Available for Senior Data Science Roles
👋
Hello, I'm

SaumyaMehta

AI Engineer & Senior Data Scientist

Carnegie Learning • Indiana University

Designing and deploying scalable ML systems in education with 6+ years of experience. Led RCTs across 10K+ students delivering 4-7% gains in standardized test outcomes.

6+
Years Experience
1.2x
Faster Iteration
4-7%
Test Score Gains
500K+
Events/Day
👨‍💻
Raleigh, NC
Available
About Me

Building Intelligence for Impact

I turn noisy data into products people actually use. Building ML systems that power real-time recommendations, GenAI features, and rapid-fire A/B tests for 120K+ students.

What I Do Best

End-to-end ML pipelines: Kafka → Spark/Databricks → Snowflake → PyTorch → FastAPI on K8s

GenAI & Retrieval: LangChain + Pinecone RAG services that answer questions in <150ms p95

Product Analytics: CUPED & causal-inference frameworks; run experiments, ship only when guard-rails win

Scale & Reliability: 500K+ events/day, <120ms latency, 99.9% uptime—validated in production

Storytelling with Data: Executive dashboards and one-slide ROI narratives that move product roadmaps

Impact Metrics

500K+
Events/Day
120K+
Students
<150ms
RAG Latency
99.9%
Uptime

Currently

Seeking opportunities to contribute to innovative teams working on cutting-edge AI/ML projects that make a meaningful difference.

Wins I'm Proud Of

Measurable impact from real-world ML deployments

+7%

Lesson Completion Boost

Contextual-bandit recommender deployed across 10K classrooms

-15%

Support Ticket Reduction

AI-generated hint rewrites via RAG pipeline with content filters

-38%

Redshift Cost Reduction

Refactored ETL and adopted Spectrum roll-ups

🏆

Innovation Award

Self-serve experiment platform trimmed time-to-insight by 40%

Real ML & AI Projects

Production ML Systems

Real-world machine learning and AI projects deployed at scale. From RCTs with 10K+ students to RAG pipelines processing 500K+ events daily, these projects drive measurable business impact.

Featured Projects

Education TechProduction

MATHstream RCT Analysis Platform

Carnegie Learning2023-2025

Led randomized controlled trials across 10K+ students, delivering data-driven insights that improved core MATHstream features and drove 15% upsell conversion in pilot districts.

PythonPostgreSQLStatistical ModelingA/B Testing
Impact Metrics
10K+
students
4-7% test gains
improvement
$1.7B impact
revenue
15% upsell
conversion
AI/LLMProduction

RAG Pipeline with LangChain & PGVector

Carnegie Learning2024

Designed and deployed Retrieval-Augmented Generation pipelines using LangChain, OpenAI embeddings, and PGVector for contextual response generation from proprietary datasets.

LangChainOpenAIPGVectorPython
Impact Metrics
<100ms
latency
94%
accuracy
500K+ indexed
documents
Real-time
queries
Machine LearningProduction

Real-time ML Detector-Reactor System

Carnegie Learning2024

Implemented real-time detector-reactor models integrated with recommendation algorithms to deliver personalized content based on student behavior patterns.

PythonML PipelinesReal-time ProcessingRecommendation Systems
Impact Metrics
500K+ daily
events
Real-time
latency
Improved
engagement
Production
scale

Additional Projects

Computational Biology

CompuCell3D - Advanced Biological Modeling

Indiana University2023

Contributed to CompuCell3D, an open-source platform for multicellular biological modeling. Advanced chemical transport modeling in dynamic multicellular contexts.

Published
citations
Open Source
platform
C++PythonScientific Computing+1
Education Tech

Curriculum Recommendation Engine - SuccessMaker

Pearson (Savvas)2022

Developed Bayesian Item Response Theory algorithms for curriculum recommendations, resulting in 40% improvement in student test scores at Pearson.

40% test scores
improvement
Bayesian IRT
algorithm
PythonBayesian MethodsPostgreSQL+1
Data Science

ML Pipeline on Snowflake - 1M Records/sec

Playpower Labs2020-2021

Architected end-to-end ML pipelines handling 1M+ records per second, delivering 25% performance improvement and 50% reduction in processing time.

1M records/sec
throughput
25% performance
improvement
SnowflakePythonApache Spark+1
Data Science

Process Mining for Student Learning Trajectories

Playpower Labs2019-2021

Developed process mining algorithms to extract learning trajectories from online assessments, detecting 10-12% unfair means and supporting policy decisions.

10-12% accuracy
detection
50+ schools
students
PythonProcess MiningR+1

Ready to Build Production ML Systems?

With 6+ years of experience deploying ML at scale, I'm excited to tackle your next data challenge. Let's discuss how we can drive measurable impact together.

Research & Publications

Research

Contributing to the advancement of machine learning and AI through peer-reviewed research, collaborative studies, and real-world applications in education technology and computational biology.

🎤PublishedConference Paper

MATHstream and UpGrade: Using Rapid, Large-Scale Experimentation for Data-Driven Improvements

International Consortium for Innovation and Collaboration in Learning Engineering (ICICLE)
2025
2 authors

Presents methodologies for rapid, large-scale experimentation in educational technology, demonstrating 1.2× faster iteration for educators through systematic A/B testing and quasi-experimental design with over 10K students.

Authors:

Saumya MehtaCarnegie Learning Team
📚PublishedJournal Article

Advanced Chemical Transport Modeling in Dynamic Multicellular Contexts Using CompuCell3D

BioPhysical Journal
2023
2 authors

Develops advanced computational models for chemical transport in multicellular biological systems, contributing to the open-source CompuCell3D platform for biological modeling and simulation.

Authors:

Saumya MehtaIndiana University Collaborators
📚In Peer ReviewJournal Article

Gastroenterology in the age of artificial intelligence: Bridging technology and clinical practice

World Journal of Gastroenterology
2025
2 authors

Reviews the current state and future prospects of AI applications in gastroenterology, examining how machine learning can enhance diagnostic accuracy and treatment personalization in clinical practice.

Authors:

Saumya MehtaMedical AI Collaborators
🎤PublishedConference Paper

Using Curriculum Pacing in Learnsphere to Visualize Student Learning Trajectories

Sharing and Reusing Data and Analytic Methods with LearnSphere Conference
2019
2 authors

Presents visualization techniques for understanding student learning patterns through curriculum pacing analysis, enabling educators to identify at-risk students and optimize learning pathways.

Authors:

Saumya MehtaPlaypower Labs Team
📊PublishedPoster Presentation

Simvastatin + Metformin Treatment Targets Growth and Fibroinflammatory Responses in PaSc

American Pancreatic Association
2023
2 authors

Investigates the combined therapeutic effects of Simvastatin and Metformin on pancreatic stellate cells, examining their potential in treating pancreatic fibroinflammatory conditions.

Authors:

Saumya MehtaResearch Collaborators

Research Interests

Educational AILarge-Scale ExperimentationRAG & LLM SystemsComputational BiologyA/B Testing & RCTsBayesian MethodsReal-time MLProcess Mining

I'm actively researching applications of AI in education technology, scalable ML systems for real-time processing, and computational biology. My work spans from theoretical research to production deployment with measurable impact.

Academic Excellence

Education

Strong academic foundation in Data Science and Machine Learning with hands-on research experience in computational biology and statistical modeling.

Graduate DegreeGPA: 3.93

Master of Science in Data Science

Indiana University Bloomington
Bloomington, IN, USA
Aug 2021 – May 2023

Relevant Coursework

Advanced Natural Language Processing
Elements of Artificial Intelligence
Building Intelligent Systems
Computer Vision
Bayesian Data Analysis
Applied Machine Learning
Machine Learning and Signal Processing
Reinforcement Learning
Intro to Statistics
Deep Learning Systems

Academic Excellence

My graduate education at Indiana University provided a strong foundation in statistical methods, machine learning algorithms, and big data technologies. The hands-on research experience with CompuCell3D and computational biology has been instrumental in developing my expertise in scientific computing and model development.

3.93
Graduate GPA
2023
Graduation Year
MS
Data Science
Professional Development

Certifications

Completed

Complete Agentic AI bootcamp with Langgraph and langchain

Udemy
2025
Completed

Convolutional Neural Networks

DeepLearning.AI
2018
Completed

CITI Program for Research Engineers

CITI Program
2022
Completed

CITI for Social Behavioural Education

CITI Program
2023-2025
Ongoing

AWS Machine Learning Specialty

Amazon Web Services
2025
Technical Skills

Technical Skills

A comprehensive toolkit spanning machine learning, data engineering, and full-stack development with years of hands-on experience.

🤖

Machine Learning & AI

TensorFlow90%
PyTorch85%
Scikit-learn95%
Keras88%
XGBoost92%
Deep Learning87%
Computer Vision82%
NLP85%
💻

Programming Languages

Python95%
R80%
SQL90%
JavaScript75%
Java70%
Scala65%
Julia60%
Go55%
☁️

Data Engineering & Cloud

Apache Spark85%
Apache Kafka80%
AWS88%
Docker90%
Kubernetes75%
Apache Airflow82%
Databricks78%
Snowflake70%
📊

Data Analysis & Visualization

Pandas95%
NumPy92%
Matplotlib88%
Seaborn85%
Plotly80%
Tableau75%
Power BI70%
D3.js65%
5+
Years of Experience
50+
Projects Completed
10+
Research Publications

Frequently Used Technologies

🐍 Python
🔮 TensorFlow
🔥 PyTorch
☁️ AWS
🐳 Docker
📊 Pandas
🧮 NumPy
📈 Matplotlib
💾 PostgreSQL
⚡ Spark
🔄 Kafka
🌐 FastAPI
⚛️ React
📝 Jupyter
🎯 MLflow
Resume & CV

Resume

Download my comprehensive CV or view an interactive version online. Always kept up-to-date with my latest experience and achievements.

Saumya Mehta

Data Scientist

smehta2530@gmail.com
(412) 905-9023
Raleigh, NC

Professional Summary

AI & Data Science professional with 6+ years of experience designing and deploying scalable machine learning systems in education and assessment. Presented at ICICLE-25 on large-scale experimentation → 1.2× faster iteration for educators. Proven ability to operationalise AI solutions, communicate complex technical ideas to non-experts, and lead cross-functional initiatives.

Professional Experience

Senior Data Scientist, Machine Learning & AI

June 2023 – July 2025

Carnegie LearningRaleigh, NC

  • Led RCTs and A/B tests across 10K+ students in partner school districts, delivering insights that drove core MATHstream feature improvements and informed district-level product decisions
  • Conducted quasi-experimental analysis to assess MATHstream's impact on standardized test scores, revealing 4–7% gains in standardised test outcomes resulting in 15% upsell conversion in pilot districts (est 1.7B revenue)
  • Designed and deployed scalable logging infrastructure to capture and standardize 500K+ student events daily in real time, enabling robust analytics and cross-team research integration
  • Designed and deployed Retrieval-Augmented Generation (RAG) pipelines using LangChain, OpenAI embeddings, and PGVector to enable contextual response generation from proprietary document datasets
  • Implemented real-time detector-reactor models integrated with recommendation algorithms to deliver personalized content based on student behavior, enabling faster remediation and improved engagement
  • Built scalable vector indexing and semantic search systems leveraging Pinecone and OpenAI embeddings, optimizing document chunking, similarity retrieval, and response latency for downstream LLM tasks

Software Development Engineer Intern

May 2022 – Aug 2022

Pearson (Savvas)Boston, MA

  • Developed Curriculum Recommendation algorithms for the SuccessMaker engine using Bayesian Item Response Theory resulting in a 40% improvement in student test scores
  • Reduced query execution times by 5x through migration and optimisation of PostgreSQL queries to Redshift SQL using Common Table Expressions and User Defined Functions
  • Streamlined the data analytics pipeline for the Learning Analytics team by creating external views in Amazon Redshift DB and automated query scheduling on AWS for ETL tasks

Data Scientist

Jun 2018 – Aug 2021

Playpower LabsRemote

  • Led the research and development of an AI-powered paper-based formative learning product aimed at creating equitable assessments for students across India and piloted across 50+ schools
  • Architected and deployed end-to-end machine learning pipelines on Snowflake, handling up to 1 million records per second and delivering a 25% improvement in model performance and a 50% reduction in data processing time
  • Enhanced real-time data processing capabilities by utilising Spark Streaming, R, and Kafka, delivering a 50% improvement in processing speed and scalability for a distributed computing setup handling over 100K records per second of streaming data
  • Developed process mining algorithms to extract students' learning trajectories from online assessments, detecting 10%-12% of recourse to unfair means, and presented the data to stakeholders to assist with policy making
  • Implemented and optimized ETL pipelines using Apache Airflow and AWS Glue, delivering a 30% reduction in ETL runtime and improving data quality and reliability

Education

Masters of Science in Data Science

Aug 2021 – May 2023

Indiana UniversityBloomington, IN, USACGPA: 3.93

Relevant Coursework:

Advanced Natural Language Processing
Elements of Artificial Intelligence
Building Intelligent Systems
Computer Vision
Bayesian Data Analysis
Applied Machine Learning
Machine Learning and Signal Processing
Reinforcement Learning
Intro to Statistics
Deep Learning Systems

Technical Skills

Languages & Frameworks:

Python, R, PostgreSQL, Redis, C++, Streamlit, Django, PyTorch, TensorFlow, SageMaker, Ollama

AI & Machine Learning:

Prompt Engineering, Reinforcement Learning, Quasi Experimental Analysis, RAG, A/B Testing, Causal Inference, Bayesian Inference, Model Interpretability, LangChain, LangGraph, a2a, DeepEval, FAISS, Azure AI, Huggingface

Data Pipelines:

Spark, Kafka, Snowflake, Docker, Kubernetes, Jenkins, AWS, Databricks

Projects

Improving Image Captions with Depth Maps

Improved image captioning using CNN and RNN models and incorporating depth maps, for accurate and informative captions.

View on GitHub →

Abusive Language Detection in Social Media using Natural Language Processing

Developed a multi headed model capable of detecting abusive language and threats like hate speech, obscenity, targeted threats using LSTMs and glove-emoji embeddings.

View on GitHub →

Publications

  • MATHstream and UpGrade: Using Rapid, Large-Scale Experimentation for Data-Driven Improvements - International Consortium for Innovation and Collaboration in Learning Engineering (ICICLE), June 2025
  • Advanced Chemical Transport Modeling in Dynamic Multicellular Contexts Using CompuCell3D - BioPhysical Journal, May 2023
  • Using Curriculum Pacing in Learnsphere to Visualize Student Learning Trajectories - Sharing and Reusing Data and Analytic Methods with LearnSphere conference, Mar 2019
  • Gastroenterology in the age of artificial intelligence: Bridging technology and clinical practice - World Journal of Gastroenterology (In Peer Review), June 2025
  • Simvastatin + Metformin Treatment Targets Growth and Fibroinflammatory Responses in PaSc - American Pancreatic Association, Dec 2023
Latest Blogs

Latest Blogs

Thoughts on machine learning, data science, and the future of AI. Sharing knowledge and experiences from the field.

Featured
January 15, 2024
5 min read
Machine Learning

The Future of Machine Learning in Healthcare

Exploring how AI and ML are revolutionizing medical diagnosis and treatment.

Read Full Article
Data Engineering8 min read

Building Scalable Data Pipelines with Apache Kafka

A deep dive into creating robust, real-time data processing systems.

January 8, 2024
Read More
Computer Vision6 min read

Computer Vision: From Theory to Production

Practical insights on deploying computer vision models in real-world applications.

January 1, 2024
Read More

Let's Connect

I'm always excited to discuss new opportunities, collaborate on interesting projects, or simply chat about the latest in AI and machine learning.

Currently Available for New Opportunities

I'm actively seeking full-time positions in data science, machine learning engineering, or AI research roles. Remote and hybrid opportunities are welcome.

Particularly interested in:

Machine Learning OpportunitiesData Science RolesResearch CollaborationsConsulting ProjectsSpeaking EngagementsMentorship

Ready to work together?

Whether you have a specific role in mind or just want to explore possibilities, I'd love to hear from you. Let's build something amazing together!

© 2025 Saumya Mehta. Built with Next.js, TypeScript, and lots of coffee