Resume

Machine learning systems from raw data to production.

Senior machine learning engineer with experience across modeling, distributed data systems, deployment, reliability, applied AI, research, and teaching.

Download PDF

Experience

Consulting Research Engineer, Climate Data Platform SDSU 2025–present

Built reusable climate-data pipelines to process heterogeneous scientific datasets and publish outputs directly into databases, Parquet, and Zarr for downstream analysis and platform use.
PythonParquetZarrData EngineeringReusable PipelinesClimate
Worked across geospatial and scientific data formats including Zarr, NetCDF, GeoTIFF, and shapefiles to support ingestion, transformation, and publication workflows for climate datasets.
ZarrNetCDFGeoTIFFShapefilesGeospatialClimate
Designed the processing workflows to be reusable across different climate datasets rather than one-off scripts, supporting repeatable ingestion and transformation patterns for the platform.
Reusable PipelinesData EngineeringPlatformClimate
Built and maintained platform-oriented climate-data processing workflows for the iCHARM website and related SDSU research use cases.
iCHARMResearch EngineeringClimatePlatform

Senior Machine Learning Engineer Unit21 2021–present

Built AI-agent tasks that customers used to automate time-consuming workflows.
PythonAWS BedrockLLM Workflows
Exported selected Databricks datasets to versioned Parquet files in S3, enabling faster metric calculation with DuckDB and preserving reproducible audit trails for statistical analysis outputs.
PySparkDuckDBStatisticsModel Evaluation
Designed and built ML systems for AML and fraud detection using heterogeneous, customer-specific datasets across hundreds of customer environments.
Predictive ModelingAnomaly/Risk ModelingPostgreSQLFastAPI
Built automated feature generation pipelines for customer-provided custom data, enabling substantial model performance gains for customers with rich, predictive data sources.
ETL PipelinesSparkStatisticsFeature EngineeringModel Evaluation
Improved model performance in high-lift customer environments from approximately 0.60 AUC to as high as 0.97 AUC through automated feature engineering and model pipeline improvements.
PySparkPythonMLflowModel Evaluation
Re-architected large-scale Spark processing from Spark SQL to optimized RDD-based workflows, reducing runtime from approximately 12 hours to roughly 30 minutes on the same cluster.
PythonPySparkRDDSparkDataBricksTerraform
Built an LLM-assisted model explainability workflow that mapped per-observation SHAP values to raw feature values, feature metadata, and model scores to generate clear score narratives.
PythonProductionSHAPLLM
Built AI-agent workflows using AWS Bedrock for financial investigation, data analysis, and customer-defined tasks.
BedrockPythonLLMAI Agents
Designed FastAPI-based model execution services integrating distributed feature generation, model scoring, and production-facing prediction workflows.
PythonModel ServicingProductionPredictive ModelingModel Evaluation
Led ML and AI research prototypes that informed or evolved into production-facing platform features.
PythonResearchPrototypesLlamaIndexPyTorchspaCy
Researched consortium-based fraud intelligence sharing approaches to improve cross-customer risk detection and bad-actor identification.
SQLData ScienceNotebooks
Cleaned and standardized customer-specific datasets with inconsistent schemas, formats, and data quality issues into reliable, model-ready inputs for machine learning.
SQLData ScienceNotebooksSparkPython
Migrated ML processing workflows from Snowflake-based execution to PySpark/Databricks, improving scalability, maintainability, production performance, and control over large-scale analytical computation.
SnowflakePySparkDataBricksWarehousingData Engineering
Designed and maintained Nix/nix-darwin infrastructure for macOS engineering environments, standardizing dependencies and improving reproducibility across engineering teams.
NixOSnix-darwinproductivity
Participated in weekly on-call rotations for production ML and data systems, owning incident response and reliability for customer-facing platform workflows.
On-CallIncident ResponseProductionObservability
Developed a SQL lineage tool to identify required model features and dependencies, automatically pruning unused scripts and feature logic to dramatically speed up model data pipelines.
SQLSnowFlakesqlglot
Automated local development startup workflows to make local testing faster and easier.
bashlinuxdockertmux

Adjunct Professor – Machine Learning Engineering SDSU 2020–2023

Developed and taught Machine Learning Engineering coursework focused on practical ML systems, model deployment, production workflows, and applied engineering patterns.
teaching
Created public lectures, slides, and examples used by students learning modern ML engineering concepts.
reveal.jsJavaScriptHTML

Senior Machine Learning Engineer AppFolio 2018–2021

Built production ML monitoring dashboards in Redash to track accuracy trends, drift, customer-specific performance, and overall model health across deployed models using real-world invoice-resolution feedback.
DashboardSQLReDashDriftModel EvaluationMonitoring
Improved a BI data consolidation tool that merged sharded customer databases into a single Snowflake analytics database using Debezium and Python, resolving schema drift, data-type conflicts, and warehouse-specific merge/upsert issues across rolling deployments on Kubernetes.
DebeziumHelm ChartsKubernetesPostgreSQLData EngineeringWarehousing
Built an end-to-end Spark-based ML training and deployment platform that used custom RDD tasks to train 1,400 customer-specific Scikit-learn invoice prediction models in parallel on a cluster.
SparkPythonPySparkPredictive ModelingAWSDocker
Built a custom Airflow-on-Kubernetes service using Helm charts, with cluster provisioning, permissions, and related infrastructure scripted in Terraform.
AirflowHelmKubernetesTerraform
Built Airflow DAG workflows that ran custom Docker images on Kubernetes clusters to automatically train and update customer models.
AirflowDockerPython
Built a fully automated ML training and deployment platform supporting approximately 1,400 customer-specific invoice prediction models.
Predictive ModelingFastAPI
Designed cloud-native ML infrastructure using Kubernetes, Docker, Airflow, Terraform, and production deployment workflows.
DockerAirflowTerraform
Developed document understanding models to extract invoice metadata including vendor, invoice amount, invoice date, and related fields from customer invoices.
OCRPythonWeights & BiasesMLFlow
Built conversational AI and customer-support chatbot systems using BERT-based NLP models, custom feature engineering, intent prediction, response selection, and next-question recommendation workflows.
BERTNLPPythonPyTorchPyTorch LightningspaCy
Converted ML models to AWS Lambda containers for both cost savings and faster deployments.
AWSPythonModel Servicing
Implemented CI/CD workflows for automating data preparation, model training, validation, deployment, and production updates.
CI/CD
Improved existing machine learning models through feature engineering and model selection.
SQLModel EvaluationFeature EngineeringPredictive ModelingA/B Testing
Used rapid invoice-resolution feedback to compare model behavior and evaluate candidate improvements across deployed customer-specific prediction systems.
Model EvaluationExperimentationA/B TestingMonitoringPredictive Modeling
Published trained Scikit-learn invoice prediction models to cloud storage for production serving by application services.
Scikit-learnPythonModel ServingAWSPandasNumpy
Participated in weekly on-call rotations for production ML systems and platform infrastructure, owning incident response and operational reliability for deployed customer models.
On-CallIncident ResponseProductionMonitoring

Data Scientist Experian 2014–2018

Re-architected Hadoop-based data processing workflows to Apache Spark, reducing processing time and AWS infrastructure cost.
HadoopSpark
Designed and implemented an end-to-end AWS EMR optimization system for cookie-replacement and digital identity models, evaluating candidate variables against entropy, longevity, and validation-performance metrics in a survival-analysis-style framework for offline model comparison.
AWS EMRJavaHadoopPigSQLModel Optimization
Built automated reporting workflows using SQL, R, R Markdown, and Shiny to summarize candidate model performance, compare feature sets, and support statistical analysis of digital identity models.
SQLRR MarkdownShinyReportingModel Evaluation
Developed distributed data systems using Hadoop, Pig, Java, SQL, Spark, and Airflow for feature analysis, model evaluation, and reporting.
JavaSQLSparkAirflow
Designed optimization workflows for evaluating large combinatorial modeling spaces that were computationally infeasible to search exhaustively, supporting large-scale offline model and feature comparison.
Numerical OptimizationsModel EvaluationExperimentation
Created Airflow-based orchestration for local and AWS-based data pipelines, from ingestion through modeling and final reports.
AirflowLinux
Expanded available predictive variables from approximately 100 to over 2,000 through automated feature engineering and data analysis.
JavaSpark
Built a Hadoop/Spark cluster from scratch using decommissioned hardware when the team lacked a dedicated environment for Spark experimentation.
SparkLinux

Statistical Modeler Digital Risk 2012–2014

Built end-to-end predictive modeling systems using credit, property, CSV, web, API, and third-party data sources to prioritize loans for forensic audit and defect detection.
Predictive ModelingData EngineeringStatisticsDecision Support
Designed and maintained a SQL Server research database with over 9 million observations and more than 5,500 candidate predictive variables.
SQLSQL ServerData EngineeringFeature Generation
Developed mortgage, valuation, and financial risk models using R, SAS, Matlab, SQL Server, C#, and statistical modeling workflows for high-stakes forensic-audit decision support.
SASC#MatlabRSQLSQL Server
Implemented a Bayesian network in C#.
C#Bayesian NetworkBayesian Methods
Built machine learning automated valuation models (AVMs) for real estate assets, including apartment-complex-level models using property, credit, geographic, and third-party data sources.
AVM
Engineered geospatial valuation features, including shapefile-based distance-to-coast calculations for Florida properties where proximity materially impacted property value.
ShapeFileGeospatialAVM
Generated 10-year real estate investment analysis reports estimating property income, expenses, and projected future property value.
Wrote multithreaded optimization tools in Matlab and C# to accelerate variable selection and modeling workflows, improving one process by roughly 20x over the prior non-optimized approach.
MatlabC#Numerical OptimizationVariable SelectionModel Evaluation
Implemented production scoring systems and automated internal and customer-facing reports using SQL Server, SSRS, and C#-generated PDFs with maps, visualizations, charts, and analytical summaries for audit-targeting workflows.
C#SQL ServerSSRSReportingDecision Support
Solved workforce distribution optimization problems to increase employee productivity.
OptimizerLinear ProgrammingOptimization
Compared baseline and candidate models, including head-to-head testing of competing approaches, to improve statistical targeting of loans for forensic audit.
Model EvaluationA/B TestingExperimentationStatisticsDecision Support

Doctoral Researcher, Computational Statistics SDSU / CGU 2011–2018

Developed spherical-harmonic statistical methods for dimension reduction and global climate signal extraction from high-dimensional geospatial data on the surface of a sphere.
Computational StatisticsDimension ReductionEigenvaluesEigenvectorsSpherical HarmonicsStatistics
Stabilized rank-deficient matrix and eigensystem calculations using diagonal-perturbation and regularization techniques in repeated large-scale harmonic-specific statistical workflows.
Linear AlgebraRegularizationNumerical MethodsStatisticsEigenvaluesEigenvectors
Built a hybrid Java and C/C++ numerical pipeline, using JNI-based Java wrappers around native code to accelerate computationally intensive matrix and estimation routines; used Matlab for visualization and plotting.
JavaC++CJNINumerical MethodsMatlab
Re-architected the research computation to run on Spark by modeling harmonic-specific workloads as distributed classes/RDD tasks, making thousands of expensive regularized computations tractable across a cluster.
SparkRDDJavaDistributed ComputingStatisticsPerformance Optimization

.NET Application Developer / QA / Technical Support CoreLogic 2002–2012

Built an end-to-end credit card bust-out fraud model for a major banking institution under a one-month deadline, covering data cleaning, feature extraction, model development, evaluation, and delivery in a form the client could implement with minimal system changes.
RBashLinuxETL PipelinesModel EvaluationPredictive Modeling
Validated both incoming client data integrity and the statistical methodology used in production code, helping ensure production model correctness.
C#StatisticsValidationModel Evaluation
Built a survival-analysis / net-present-value optimization model for mortgage-loss mitigation, comparing foreclosure, short-sale, and loan-modification scenarios under varying bounded input parameters for decision support.
C#SolverSurvival AnalysisOptimizationFinancial ModelingDecision Support
Rewrote an existing numerical optimizer in C#, making it roughly 10x faster through multithreading, mathematical simplification, and formula-level refactoring.
Optimization
Built fraud detection, mortgage-risk, and predictive analytics machine learning models for financial services clients.
Machine LearningFraudMortgage RiskPredictive Analytics
Developed C# production code for mathematical, statistical, and optimization components of the WillCap mortgage-risk project.
C#StatisticsOptimizationFinancial Modeling

Education

Ph.D. in Computational Statistics

San Diego State University / Claremont Graduate University, 2018

M.S. in Statistics

San Diego State University, 2014

B.S. / B.A. in Statistics / Mathematics

San Diego State University, 2009

Most Outstanding Math/Stats Graduating Student, Class of 2009

Publications

Pierret, J., and S.S.P. Shen, 2017. 4D visual delivery of big climate data: A fast web database application system. Advances in Data Science and Adaptive Analysis.

Shen, S.S.P., Pierret, J., et al., 2020. 4DVD visualization and delivery of 20th century reanalysis data: methods and examples. Theoretical and Applied Climatology.

Download dissertation