Predictive Maintenance

Data Scientist — ETL Pipelines, Model Training & Inference, Business Reporting
Amine Dhemaied (Division Chief Adjoint), Vicent Talfumiere (Division Chief), Loïc Saulnier (Operations Manager)
From January 2025 to March 2026 (1 year 3 months)
From raw IoT telemetry to predictive models and maintenance dashboards — I built an end-to-end ML pipeline to detect railway risk zones and help field teams prioritize interventions.
Gitlab (CI/CD), PowerBI, Azure Storage, ArcGis, Python, SQL
Overview
French trains run on 30,000 km of track. When a rail defect goes undetected, the cost isn't just money — it's safety. I built an end-to-end ML system to identify Risk Zones (ZER) on the national railway network, turning massive volumes of raw IoT data into actionable maintenance decisions for field teams.
THE PROBLEM
SNCF's network generates over 100 GB of data from IoT sensors, geometry measurement trains, and maintenance logs. But these sources don't speak the same language: GPS signals drift, sampling rates vary, and spatial references are inconsistent. Correlating a sensor reading at kilometer 42.7 with a maintenance event logged at "PK 42+700" requires careful spatial alignment.
On top of this, actual defect zones represent less than 1% of the network — a classic imbalanced learning problem where a naive model could score 99% accuracy by predicting "no defect" everywhere, while missing every real risk.
DATA & PIPELINE
I designed an ETL pipeline to harmonize heterogeneous spatio-temporal data around a unified Linear Reference System (LRS):
- GPS Snapping: Corrected spatial drift via orthogonal projection onto the theoretical track graph (Line/Track/PK), ensuring every sensor reading maps to the right piece of rail.
- Temporal Alignment: Used
merge_asoflogic to join continuous time-series (geometry measurements sampled every 25cm) with discrete events (maintenance interventions, defect reports). - Feature Engineering: Computed rolling statistics (mean, variance, skewness) over spatial windows (100m, 1km) to capture the local "texture" of the track — how smoothly or erratically the geometry evolves in each segment.
METHODOLOGY
In railway safety, a missed defect is far more costly than an unnecessary inspection. The entire modeling strategy is built around this asymmetry:
- Cost-Sensitive Modeling: Optimized XGBoost using class weighting
(
scale_pos_weight) and decision threshold tuning to maximize Recall — catching as many real defects as possible, even at the cost of some false alarms. - Model Comparison: Benchmarked Logistic Regression (baseline), Random Forest, and XGBoost across the full precision-recall spectrum.
- Root Cause Analysis: Used SHAP values to identify what drives defects — revealing that longitudinal leveling degradation combined with clay-rich subsoil is the strongest predictor of track failure.
| Threshold | Precision | Recall | F1 | Strategy |
|---|---|---|---|---|
| 0.50 (default) | 68.2% | 73.5% | 0.71 | Balanced |
| 0.30 (chosen) | 42.1% | 90.8% | 0.58 | Safety-first |
| 0.70 | 81.4% | 54.2% | 0.65 | Conservative |
Lowering the decision threshold catches 9 out of 10 risky zones — the right call when an undetected defect can compromise train safety.
DASHBOARD
Data Science doesn't end at the model. I designed spatio-temporal heatmaps — dense matrices where the X-axis tracks position along the rail (in kilometers) and the Y-axis tracks time (in months). These visualizations let maintenance engineers see how defects slowly drift and intensify, revealing seasonal patterns like the impact of rainfall on ballast stability.
A second layer compares defect distributions before and after repairs, qualifying whether an intervention actually held or if the geometry is degrading again. This turns the dashboard from a snapshot into a living monitoring tool.
RESULTS & IMPACT
The model catches 9 out of 10 risky zones before they escalate into safety-critical incidents — enabling maintenance teams to intervene proactively rather than reactively.
Beyond the metric, the project delivered a complete decision support system: from raw telemetry to interactive Power BI dashboards that field teams actually use. SHAP-based root cause analysis translated statistical findings into plain-language recommendations for non-technical stakeholders — bridging the gap between Data Science and operations.
REFERENCES
View Technical Report (PDF) →- Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD '16.
- Lundberg, S.M., & Lee, S.I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS '17.
- SNCF Réseau — Internal methodology for ZER classification and maintenance prioritization.
TECH STACK
Data Science: Python, Pandas, Scikit-learn, XGBoost,
SHAP, MLFlow, DVC.
Engineering: SQL (PostgreSQL), ETL Pipelines, ArcGis
(GIS), Gitlab CI/CD.
Visualization: Power BI, Streamlit, Plotly.
This is an archived project. Please reach out if you have any questions.