Resume

Machine learning systems from raw data to production.

Senior machine learning engineer with experience across modeling, distributed data systems, deployment, reliability, applied AI, research, and teaching.

Download PDF

Experience

Consulting Research Engineer, Climate Data Platform SDSU 2025–present
  • Built reusable climate-data pipelines to process heterogeneous scientific datasets and publish outputs directly into databases, Parquet, and Zarr for downstream analysis and platform use.
    PythonParquetZarrData EngineeringReusable PipelinesClimate
  • Worked across geospatial and scientific data formats including Zarr, NetCDF, GeoTIFF, and shapefiles to support ingestion, transformation, and publication workflows for climate datasets.
    ZarrNetCDFGeoTIFFShapefilesGeospatialClimate
  • Designed the processing workflows to be reusable across different climate datasets rather than one-off scripts, supporting repeatable ingestion and transformation patterns for the platform.
    Reusable PipelinesData EngineeringPlatformClimate
  • Built and maintained platform-oriented climate-data processing workflows for the iCHARM website and related SDSU research use cases.
    iCHARMResearch EngineeringClimatePlatform
Senior Machine Learning Engineer Unit21 2021–present
  • Built AI-agent tasks that customers used to automate time-consuming workflows.
    PythonAWS BedrockLLM Workflows
  • Exported selected Databricks datasets to versioned Parquet files in S3, enabling faster metric calculation with DuckDB and preserving reproducible audit trails for statistical analysis outputs.
    PySparkDuckDBStatisticsModel Evaluation
  • Designed and built ML systems for AML and fraud detection using heterogeneous, customer-specific datasets across hundreds of customer environments.
    Predictive ModelingAnomaly/Risk ModelingPostgreSQLFastAPI
  • Built automated feature generation pipelines for customer-provided custom data, enabling substantial model performance gains for customers with rich, predictive data sources.
    ETL PipelinesSparkStatisticsFeature EngineeringModel Evaluation
  • Improved model performance in high-lift customer environments from approximately 0.60 AUC to as high as 0.97 AUC through automated feature engineering and model pipeline improvements.
    PySparkPythonMLflowModel Evaluation
  • Re-architected large-scale Spark processing from Spark SQL to optimized RDD-based workflows, reducing runtime from approximately 12 hours to roughly 30 minutes on the same cluster.
    PythonPySparkRDDSparkDataBricksTerraform
  • Built an LLM-assisted model explainability workflow that mapped per-observation SHAP values to raw feature values, feature metadata, and model scores to generate clear score narratives.
    PythonProductionSHAPLLM
  • Built AI-agent workflows using AWS Bedrock for financial investigation, data analysis, and customer-defined tasks.
    BedrockPythonLLMAI Agents
  • Designed FastAPI-based model execution services integrating distributed feature generation, model scoring, and production-facing prediction workflows.
    PythonModel ServicingProductionPredictive ModelingModel Evaluation
  • Led ML and AI research prototypes that informed or evolved into production-facing platform features.
    PythonResearchPrototypesLlamaIndexPyTorchspaCy
  • Researched consortium-based fraud intelligence sharing approaches to improve cross-customer risk detection and bad-actor identification.
    SQLData ScienceNotebooks
  • Cleaned and standardized customer-specific datasets with inconsistent schemas, formats, and data quality issues into reliable, model-ready inputs for machine learning.
    SQLData ScienceNotebooksSparkPython
  • Migrated ML processing workflows from Snowflake-based execution to PySpark/Databricks, improving scalability, maintainability, production performance, and control over large-scale analytical computation.
    SnowflakePySparkDataBricksWarehousingData Engineering
  • Designed and maintained Nix/nix-darwin infrastructure for macOS engineering environments, standardizing dependencies and improving reproducibility across engineering teams.
    NixOSnix-darwinproductivity
  • Participated in weekly on-call rotations for production ML and data systems, owning incident response and reliability for customer-facing platform workflows.
    On-CallIncident ResponseProductionObservability
  • Developed a SQL lineage tool to identify required model features and dependencies, automatically pruning unused scripts and feature logic to dramatically speed up model data pipelines.
    SQLSnowFlakesqlglot
  • Automated local development startup workflows to make local testing faster and easier.
    bashlinuxdockertmux
Adjunct Professor – Machine Learning Engineering SDSU 2020–2023
  • Developed and taught Machine Learning Engineering coursework focused on practical ML systems, model deployment, production workflows, and applied engineering patterns.
    teaching
  • Created public lectures, slides, and examples used by students learning modern ML engineering concepts.
    reveal.jsJavaScriptHTML
Senior Machine Learning Engineer AppFolio 2018–2021
  • Built production ML monitoring dashboards in Redash to track accuracy trends, drift, customer-specific performance, and overall model health across deployed models using real-world invoice-resolution feedback.
    DashboardSQLReDashDriftModel EvaluationMonitoring
  • Improved a BI data consolidation tool that merged sharded customer databases into a single Snowflake analytics database using Debezium and Python, resolving schema drift, data-type conflicts, and warehouse-specific merge/upsert issues across rolling deployments on Kubernetes.
    DebeziumHelm ChartsKubernetesPostgreSQLData EngineeringWarehousing
  • Built an end-to-end Spark-based ML training and deployment platform that used custom RDD tasks to train 1,400 customer-specific Scikit-learn invoice prediction models in parallel on a cluster.
    SparkPythonPySparkPredictive ModelingAWSDocker
  • Built a custom Airflow-on-Kubernetes service using Helm charts, with cluster provisioning, permissions, and related infrastructure scripted in Terraform.
    AirflowHelmKubernetesTerraform
  • Built Airflow DAG workflows that ran custom Docker images on Kubernetes clusters to automatically train and update customer models.
    AirflowDockerPython
  • Built a fully automated ML training and deployment platform supporting approximately 1,400 customer-specific invoice prediction models.
    Predictive ModelingFastAPI
  • Designed cloud-native ML infrastructure using Kubernetes, Docker, Airflow, Terraform, and production deployment workflows.
    DockerAirflowTerraform
  • Developed document understanding models to extract invoice metadata including vendor, invoice amount, invoice date, and related fields from customer invoices.
    OCRPythonWeights & BiasesMLFlow
  • Built conversational AI and customer-support chatbot systems using BERT-based NLP models, custom feature engineering, intent prediction, response selection, and next-question recommendation workflows.
    BERTNLPPythonPyTorchPyTorch LightningspaCy
  • Converted ML models to AWS Lambda containers for both cost savings and faster deployments.
    AWSPythonModel Servicing
  • Implemented CI/CD workflows for automating data preparation, model training, validation, deployment, and production updates.
    CI/CD
  • Improved existing machine learning models through feature engineering and model selection.
    SQLModel EvaluationFeature EngineeringPredictive ModelingA/B Testing
  • Used rapid invoice-resolution feedback to compare model behavior and evaluate candidate improvements across deployed customer-specific prediction systems.
    Model EvaluationExperimentationA/B TestingMonitoringPredictive Modeling
  • Published trained Scikit-learn invoice prediction models to cloud storage for production serving by application services.
    Scikit-learnPythonModel ServingAWSPandasNumpy
  • Participated in weekly on-call rotations for production ML systems and platform infrastructure, owning incident response and operational reliability for deployed customer models.
    On-CallIncident ResponseProductionMonitoring
Data Scientist Experian 2014–2018
  • Re-architected Hadoop-based data processing workflows to Apache Spark, reducing processing time and AWS infrastructure cost.
    HadoopSpark
  • Designed and implemented an end-to-end AWS EMR optimization system for cookie-replacement and digital identity models, evaluating candidate variables against entropy, longevity, and validation-performance metrics in a survival-analysis-style framework for offline model comparison.
    AWS EMRJavaHadoopPigSQLModel Optimization
  • Built automated reporting workflows using SQL, R, R Markdown, and Shiny to summarize candidate model performance, compare feature sets, and support statistical analysis of digital identity models.
    SQLRR MarkdownShinyReportingModel Evaluation
  • Developed distributed data systems using Hadoop, Pig, Java, SQL, Spark, and Airflow for feature analysis, model evaluation, and reporting.
    JavaSQLSparkAirflow
  • Designed optimization workflows for evaluating large combinatorial modeling spaces that were computationally infeasible to search exhaustively, supporting large-scale offline model and feature comparison.
    Numerical OptimizationsModel EvaluationExperimentation
  • Created Airflow-based orchestration for local and AWS-based data pipelines, from ingestion through modeling and final reports.
    AirflowLinux
  • Expanded available predictive variables from approximately 100 to over 2,000 through automated feature engineering and data analysis.
    JavaSpark
  • Built a Hadoop/Spark cluster from scratch using decommissioned hardware when the team lacked a dedicated environment for Spark experimentation.
    SparkLinux
Statistical Modeler Digital Risk 2012–2014
  • Built end-to-end predictive modeling systems using credit, property, CSV, web, API, and third-party data sources to prioritize loans for forensic audit and defect detection.
    Predictive ModelingData EngineeringStatisticsDecision Support
  • Designed and maintained a SQL Server research database with over 9 million observations and more than 5,500 candidate predictive variables.
    SQLSQL ServerData EngineeringFeature Generation
  • Developed mortgage, valuation, and financial risk models using R, SAS, Matlab, SQL Server, C#, and statistical modeling workflows for high-stakes forensic-audit decision support.
    SASC#MatlabRSQLSQL Server
  • Implemented a Bayesian network in C#.
    C#Bayesian NetworkBayesian Methods
  • Built machine learning automated valuation models (AVMs) for real estate assets, including apartment-complex-level models using property, credit, geographic, and third-party data sources.
    AVM
  • Engineered geospatial valuation features, including shapefile-based distance-to-coast calculations for Florida properties where proximity materially impacted property value.
    ShapeFileGeospatialAVM
  • Generated 10-year real estate investment analysis reports estimating property income, expenses, and projected future property value.
  • Wrote multithreaded optimization tools in Matlab and C# to accelerate variable selection and modeling workflows, improving one process by roughly 20x over the prior non-optimized approach.
    MatlabC#Numerical OptimizationVariable SelectionModel Evaluation
  • Implemented production scoring systems and automated internal and customer-facing reports using SQL Server, SSRS, and C#-generated PDFs with maps, visualizations, charts, and analytical summaries for audit-targeting workflows.
    C#SQL ServerSSRSReportingDecision Support
  • Solved workforce distribution optimization problems to increase employee productivity.
    OptimizerLinear ProgrammingOptimization
  • Compared baseline and candidate models, including head-to-head testing of competing approaches, to improve statistical targeting of loans for forensic audit.
    Model EvaluationA/B TestingExperimentationStatisticsDecision Support
Doctoral Researcher, Computational Statistics SDSU / CGU 2011–2018
  • Developed spherical-harmonic statistical methods for dimension reduction and global climate signal extraction from high-dimensional geospatial data on the surface of a sphere.
    Computational StatisticsDimension ReductionEigenvaluesEigenvectorsSpherical HarmonicsStatistics
  • Stabilized rank-deficient matrix and eigensystem calculations using diagonal-perturbation and regularization techniques in repeated large-scale harmonic-specific statistical workflows.
    Linear AlgebraRegularizationNumerical MethodsStatisticsEigenvaluesEigenvectors
  • Built a hybrid Java and C/C++ numerical pipeline, using JNI-based Java wrappers around native code to accelerate computationally intensive matrix and estimation routines; used Matlab for visualization and plotting.
    JavaC++CJNINumerical MethodsMatlab
  • Re-architected the research computation to run on Spark by modeling harmonic-specific workloads as distributed classes/RDD tasks, making thousands of expensive regularized computations tractable across a cluster.
    SparkRDDJavaDistributed ComputingStatisticsPerformance Optimization
.NET Application Developer / QA / Technical Support CoreLogic 2002–2012
  • Built an end-to-end credit card bust-out fraud model for a major banking institution under a one-month deadline, covering data cleaning, feature extraction, model development, evaluation, and delivery in a form the client could implement with minimal system changes.
    RBashLinuxETL PipelinesModel EvaluationPredictive Modeling
  • Validated both incoming client data integrity and the statistical methodology used in production code, helping ensure production model correctness.
    C#StatisticsValidationModel Evaluation
  • Built a survival-analysis / net-present-value optimization model for mortgage-loss mitigation, comparing foreclosure, short-sale, and loan-modification scenarios under varying bounded input parameters for decision support.
    C#SolverSurvival AnalysisOptimizationFinancial ModelingDecision Support
  • Rewrote an existing numerical optimizer in C#, making it roughly 10x faster through multithreading, mathematical simplification, and formula-level refactoring.
    Optimization
  • Built fraud detection, mortgage-risk, and predictive analytics machine learning models for financial services clients.
    Machine LearningFraudMortgage RiskPredictive Analytics
  • Developed C# production code for mathematical, statistical, and optimization components of the WillCap mortgage-risk project.
    C#StatisticsOptimizationFinancial Modeling

Education

Ph.D. in Computational Statistics

San Diego State University / Claremont Graduate University, 2018

M.S. in Statistics

San Diego State University, 2014

B.S. / B.A. in Statistics / Mathematics

San Diego State University, 2009

Most Outstanding Math/Stats Graduating Student, Class of 2009

Publications

Pierret, J., and S.S.P. Shen, 2017. 4D visual delivery of big climate data: A fast web database application system. Advances in Data Science and Adaptive Analysis.

Shen, S.S.P., Pierret, J., et al., 2020. 4DVD visualization and delivery of 20th century reanalysis data: methods and examples. Theoretical and Applied Climatology.

Download dissertation