SaumyaMehta
AI Engineer & Senior Data Scientist
Carnegie Learning • Indiana University
Designing and deploying scalable ML systems in education with 6+ years of experience. Led RCTs across 10K+ students delivering 4-7% gains in standardized test outcomes.
Building Intelligence for Impact
I turn noisy data into products people actually use. Building ML systems that power real-time recommendations, GenAI features, and rapid-fire A/B tests for 120K+ students.
What I Do Best
End-to-end ML pipelines: Kafka → Spark/Databricks → Snowflake → PyTorch → FastAPI on K8s
GenAI & Retrieval: LangChain + Pinecone RAG services that answer questions in <150ms p95
Product Analytics: CUPED & causal-inference frameworks; run experiments, ship only when guard-rails win
Scale & Reliability: 500K+ events/day, <120ms latency, 99.9% uptime—validated in production
Storytelling with Data: Executive dashboards and one-slide ROI narratives that move product roadmaps
Impact Metrics
Currently
Seeking opportunities to contribute to innovative teams working on cutting-edge AI/ML projects that make a meaningful difference.
Wins I'm Proud Of
Measurable impact from real-world ML deployments
Lesson Completion Boost
Contextual-bandit recommender deployed across 10K classrooms
Support Ticket Reduction
AI-generated hint rewrites via RAG pipeline with content filters
Redshift Cost Reduction
Refactored ETL and adopted Spectrum roll-ups
Innovation Award
Self-serve experiment platform trimmed time-to-insight by 40%
Production ML Systems
Real-world machine learning and AI projects deployed at scale. From RCTs with 10K+ students to RAG pipelines processing 500K+ events daily, these projects drive measurable business impact.
Featured Projects
MATHstream RCT Analysis Platform
Led randomized controlled trials across 10K+ students, delivering data-driven insights that improved core MATHstream features and drove 15% upsell conversion in pilot districts.
Impact Metrics
RAG Pipeline with LangChain & PGVector
Designed and deployed Retrieval-Augmented Generation pipelines using LangChain, OpenAI embeddings, and PGVector for contextual response generation from proprietary datasets.
Impact Metrics
Real-time ML Detector-Reactor System
Implemented real-time detector-reactor models integrated with recommendation algorithms to deliver personalized content based on student behavior patterns.
Impact Metrics
Additional Projects
Contributed to CompuCell3D, an open-source platform for multicellular biological modeling. Advanced chemical transport modeling in dynamic multicellular contexts.
Curriculum Recommendation Engine - SuccessMaker
Developed Bayesian Item Response Theory algorithms for curriculum recommendations, resulting in 40% improvement in student test scores at Pearson.
ML Pipeline on Snowflake - 1M Records/sec
Architected end-to-end ML pipelines handling 1M+ records per second, delivering 25% performance improvement and 50% reduction in processing time.
Process Mining for Student Learning Trajectories
Developed process mining algorithms to extract learning trajectories from online assessments, detecting 10-12% unfair means and supporting policy decisions.
Ready to Build Production ML Systems?
With 6+ years of experience deploying ML at scale, I'm excited to tackle your next data challenge. Let's discuss how we can drive measurable impact together.
Research
Contributing to the advancement of machine learning and AI through peer-reviewed research, collaborative studies, and real-world applications in education technology and computational biology.
Presents methodologies for rapid, large-scale experimentation in educational technology, demonstrating 1.2× faster iteration for educators through systematic A/B testing and quasi-experimental design with over 10K students.
Authors:
Develops advanced computational models for chemical transport in multicellular biological systems, contributing to the open-source CompuCell3D platform for biological modeling and simulation.
Authors:
Reviews the current state and future prospects of AI applications in gastroenterology, examining how machine learning can enhance diagnostic accuracy and treatment personalization in clinical practice.
Authors:
Presents visualization techniques for understanding student learning patterns through curriculum pacing analysis, enabling educators to identify at-risk students and optimize learning pathways.
Authors:
Investigates the combined therapeutic effects of Simvastatin and Metformin on pancreatic stellate cells, examining their potential in treating pancreatic fibroinflammatory conditions.
Authors:
Research Interests
I'm actively researching applications of AI in education technology, scalable ML systems for real-time processing, and computational biology. My work spans from theoretical research to production deployment with measurable impact.
Education
Strong academic foundation in Data Science and Machine Learning with hands-on research experience in computational biology and statistical modeling.
Master of Science in Data Science
Relevant Coursework
Academic Excellence
My graduate education at Indiana University provided a strong foundation in statistical methods, machine learning algorithms, and big data technologies. The hands-on research experience with CompuCell3D and computational biology has been instrumental in developing my expertise in scientific computing and model development.
Certifications
CITI Program for Research Engineers
CITI for Social Behavioural Education
AWS Machine Learning Specialty
Technical Skills
A comprehensive toolkit spanning machine learning, data engineering, and full-stack development with years of hands-on experience.
Machine Learning & AI
Programming Languages
Data Engineering & Cloud
Data Analysis & Visualization
Frequently Used Technologies
Resume
Download my comprehensive CV or view an interactive version online. Always kept up-to-date with my latest experience and achievements.
Saumya Mehta
Data Scientist
Professional Summary
AI & Data Science professional with 6+ years of experience designing and deploying scalable machine learning systems in education and assessment. Presented at ICICLE-25 on large-scale experimentation → 1.2× faster iteration for educators. Proven ability to operationalise AI solutions, communicate complex technical ideas to non-experts, and lead cross-functional initiatives.
Professional Experience
Senior Data Scientist, Machine Learning & AI
June 2023 – July 2025Carnegie Learning • Raleigh, NC
- •Led RCTs and A/B tests across 10K+ students in partner school districts, delivering insights that drove core MATHstream feature improvements and informed district-level product decisions
- •Conducted quasi-experimental analysis to assess MATHstream's impact on standardized test scores, revealing 4–7% gains in standardised test outcomes resulting in 15% upsell conversion in pilot districts (est 1.7B revenue)
- •Designed and deployed scalable logging infrastructure to capture and standardize 500K+ student events daily in real time, enabling robust analytics and cross-team research integration
- •Designed and deployed Retrieval-Augmented Generation (RAG) pipelines using LangChain, OpenAI embeddings, and PGVector to enable contextual response generation from proprietary document datasets
- •Implemented real-time detector-reactor models integrated with recommendation algorithms to deliver personalized content based on student behavior, enabling faster remediation and improved engagement
- •Built scalable vector indexing and semantic search systems leveraging Pinecone and OpenAI embeddings, optimizing document chunking, similarity retrieval, and response latency for downstream LLM tasks
Software Development Engineer Intern
May 2022 – Aug 2022Pearson (Savvas) • Boston, MA
- •Developed Curriculum Recommendation algorithms for the SuccessMaker engine using Bayesian Item Response Theory resulting in a 40% improvement in student test scores
- •Reduced query execution times by 5x through migration and optimisation of PostgreSQL queries to Redshift SQL using Common Table Expressions and User Defined Functions
- •Streamlined the data analytics pipeline for the Learning Analytics team by creating external views in Amazon Redshift DB and automated query scheduling on AWS for ETL tasks
Data Scientist
Jun 2018 – Aug 2021Playpower Labs • Remote
- •Led the research and development of an AI-powered paper-based formative learning product aimed at creating equitable assessments for students across India and piloted across 50+ schools
- •Architected and deployed end-to-end machine learning pipelines on Snowflake, handling up to 1 million records per second and delivering a 25% improvement in model performance and a 50% reduction in data processing time
- •Enhanced real-time data processing capabilities by utilising Spark Streaming, R, and Kafka, delivering a 50% improvement in processing speed and scalability for a distributed computing setup handling over 100K records per second of streaming data
- •Developed process mining algorithms to extract students' learning trajectories from online assessments, detecting 10%-12% of recourse to unfair means, and presented the data to stakeholders to assist with policy making
- •Implemented and optimized ETL pipelines using Apache Airflow and AWS Glue, delivering a 30% reduction in ETL runtime and improving data quality and reliability
Education
Masters of Science in Data Science
Aug 2021 – May 2023Indiana University • Bloomington, IN, USA • CGPA: 3.93
Relevant Coursework:
Technical Skills
Languages & Frameworks:
Python, R, PostgreSQL, Redis, C++, Streamlit, Django, PyTorch, TensorFlow, SageMaker, Ollama
AI & Machine Learning:
Prompt Engineering, Reinforcement Learning, Quasi Experimental Analysis, RAG, A/B Testing, Causal Inference, Bayesian Inference, Model Interpretability, LangChain, LangGraph, a2a, DeepEval, FAISS, Azure AI, Huggingface
Data Pipelines:
Spark, Kafka, Snowflake, Docker, Kubernetes, Jenkins, AWS, Databricks
Projects
Improving Image Captions with Depth Maps
Improved image captioning using CNN and RNN models and incorporating depth maps, for accurate and informative captions.
View on GitHub →Abusive Language Detection in Social Media using Natural Language Processing
Developed a multi headed model capable of detecting abusive language and threats like hate speech, obscenity, targeted threats using LSTMs and glove-emoji embeddings.
View on GitHub →Publications
- •MATHstream and UpGrade: Using Rapid, Large-Scale Experimentation for Data-Driven Improvements - International Consortium for Innovation and Collaboration in Learning Engineering (ICICLE), June 2025
- •Advanced Chemical Transport Modeling in Dynamic Multicellular Contexts Using CompuCell3D - BioPhysical Journal, May 2023
- •Using Curriculum Pacing in Learnsphere to Visualize Student Learning Trajectories - Sharing and Reusing Data and Analytic Methods with LearnSphere conference, Mar 2019
- •Gastroenterology in the age of artificial intelligence: Bridging technology and clinical practice - World Journal of Gastroenterology (In Peer Review), June 2025
- •Simvastatin + Metformin Treatment Targets Growth and Fibroinflammatory Responses in PaSc - American Pancreatic Association, Dec 2023
Latest Blogs
Thoughts on machine learning, data science, and the future of AI. Sharing knowledge and experiences from the field.
The Future of Machine Learning in Healthcare
Exploring how AI and ML are revolutionizing medical diagnosis and treatment.
Building Scalable Data Pipelines with Apache Kafka
A deep dive into creating robust, real-time data processing systems.
Computer Vision: From Theory to Production
Practical insights on deploying computer vision models in real-world applications.
Let's Connect
I'm always excited to discuss new opportunities, collaborate on interesting projects, or simply chat about the latest in AI and machine learning.
I'm actively seeking full-time positions in data science, machine learning engineering, or AI research roles. Remote and hybrid opportunities are welcome.
Particularly interested in:
Ready to work together?
Whether you have a specific role in mind or just want to explore possibilities, I'd love to hear from you. Let's build something amazing together!
© 2025 Saumya Mehta. Built with Next.js, TypeScript, and lots of coffee