Predictive Maintenance Analytics Platform
The client was an industrial equipment operator running roughly two hundred high-value machines across three sites in a manufacturing-and-logistics business. Unplanned downtime was the largest single source of operational variance, with each significant incident costing more than fifty thousand dollars in lost throughput plus parts and overtime. The maintenance team ran a calendar-based preventive program, which meant machines were serviced too often when running well and not often enough during stress periods.
Trading, FinTech & Analytics
Manufacturing, Logistics
16 weeks from kickoff to all three sites in production
5 specialists
The full story
The practical problem was that the machines had sensors but the data lived in vendor-specific systems with no unified view. Maintenance decisions relied on machine-operator intuition plus calendar dates, neither of which captured the actual stress state of a machine. Several near-miss failures in the prior year had been preceded by anomalies visible in the sensor data after the fact, but no one had been watching the right signals at the right time.
We built a predictive maintenance platform that ingested sensor streams from all two hundred machines into a single time-series store, ran per-machine-class anomaly detection plus failure prediction models, and generated maintenance recommendations with predicted-time-to-failure and confidence intervals. The maintenance scheduler took recommendations plus production-schedule constraints and proposed maintenance windows that minimized production impact while respecting predicted-failure deadlines.
What shipped was an operations workstation where the maintenance manager sees every machine’s health score, predicted-time-to-failure for any at-risk machines, and a proposed maintenance schedule that balances risk against production commitments. Unplanned downtime dropped substantially in the first year, the maintenance team reduced over-servicing on healthy machines, and the operations director gained a defensible view into asset condition that they could carry into capital budgeting conversations.
Calendar-based maintenance over-serviced healthy machines and under-serviced stressed ones, with unplanned downtime as the cost.
Each significant downtime incident cost over fifty thousand dollars in lost throughput, parts, and overtime, with low predictability.
Sensor data lived in vendor-specific systems with no unified view, so cross-machine patterns were invisible to the maintenance team.
Calendar-based servicing serviced healthy machines too often and missed stressed machines until they failed unexpectedly.
Near-miss failures in the prior year were visible in sensor data after the fact, but no one was watching the right signals in real time.
Capital budgeting for asset replacement relied on age and intuition because no quantitative condition view existed across the fleet.
How we structured the engagement
Unified sensor streams into one time-series store and built per-machine-class models so prediction matched physical reality.
- 01Phase 01Weeks 1-3
Discovery
Audited two years of maintenance logs and three months of available sensor data across all machine classes. Worked with maintenance leads on which failure modes were most costly and which sensor signals correlated with which failures. Output: a per-class feature set, failure mode taxonomy, and a unified time-series schema.
- 02Phase 02Weeks 4-5
Architecture
Designed an MQTT-based ingestion path from machine PLCs and sensors into AWS IoT Core, with stream processing into a time-series store. Built per-machine-class anomaly detectors and failure prediction models because machine physics differed enough that one-size-fits-all models underperformed in pilot testing.
- 03Phase 03Weeks 6-13
Build
Shipped ingestion and the time-series store first, then per-class models in sequence by failure-mode cost. Built the maintenance scheduler with production-schedule integration via the existing ERP. Implemented the maintenance manager workstation with health scores, predicted-time-to-failure, and proposed scheduling.
- 04Phase 04Weeks 14-16
Launch
Rolled out across the three sites in two-week waves, each starting with the highest-cost failure modes. Held a weekly maintenance review where the platform’s recommendations were validated against the team’s judgment, with disagreements logged as training signal. Promoted to ERP-integrated scheduling once recommendation acceptance held above seventy-five percent.
What we built, component by component
- 01
Sensor ingestion
MQTT-based capture from machine PLCs and edge sensors into AWS IoT Core with per-machine identity and tenancy metadata.
- 02
Time-series store
Unified store across all two hundred machines with retention policies tuned per signal type and per failure-mode lookback need.
- 03
Anomaly detectors
Per-machine-class detectors that flag deviation from baseline behavior, used as input to the failure prediction models.
- 04
Failure prediction models
Per-class models that predict time-to-failure with confidence intervals, trained on historical failures and near-misses.
- 05
Maintenance scheduler
Takes predictions plus production schedule constraints from the ERP and proposes maintenance windows that minimize impact.
- 06
Operations workstation
Maintenance manager interface with fleet health scores, at-risk machines, predicted failure timing, and schedule proposals.
Sensor streams flow through MQTT into AWS IoT Core and into the time-series store. Anomaly detectors run continuously per machine, failure prediction models score at-risk machines, and the scheduler combines predictions with production schedule data from the ERP. The operations workstation renders the consolidated view and proposed maintenance windows for human approval and ERP commit.
The trade-offs we made and why
Per-machine-class models rather than one fleet-wide model
Machine physics differed enough that a fleet-wide model averaged across classes and underperformed every class individually. Per-class models cost more to operate but produced predictions that matched the actual failure mechanisms, which is what made the maintenance team trust the output.
Unified time-series store over vendor-specific systems
Vendor-specific systems were optimized for vendor diagnostics, not cross-fleet patterns. A unified store made cross-machine and cross-site analysis possible, which surfaced patterns like correlated failures across machines sharing power infrastructure that vendor systems would never expose.
Integrated with production schedule for window proposals
Maintenance recommendations without production-schedule awareness would have proposed windows that production could not accept, undermining adoption. Wiring the scheduler to the ERP made the proposals operationally realistic and shifted the conversation from "can we do this maintenance" to "when in this window".
Logged maintenance-team disagreements as training signal
Maintenance leads had years of intuition the model could not encode from scratch. Logging their disagreements with model recommendations and reviewing them weekly let the model learn from the team’s expertise rather than competing with it, which accelerated model improvement and earned trust simultaneously.
What changed for the client
failure accuracy
Precision-recall harmonic mean for predicted failures within the predicted time window on a held-out three-month sample.
unplanned downtime
Reduction in unplanned downtime hours across the three sites in the first full year after platform rollout completed.
machines monitored
Total machines under continuous monitoring at full rollout, covering all high-value equipment across all three sites.
recommendation acceptance
Share of maintenance recommendations accepted by leads after model tuning was complete, used as ERP-integration cutover gate.
The tools behind the system
Built with a deliberate stack chosen for production reliability and operational velocity.
Lessons learned from the build
Per-class models were worth the operational overhead. We almost shipped a fleet-wide model in the interest of simplicity and the per-class approach added meaningfully to prediction quality on every machine class. Industrial physics resists one-size-fits-all and we would commit to per-class from day one in future projects.
Logging team disagreements as training signal turned skeptics into contributors. Maintenance leads who initially resisted the platform became its most active users once they saw their corrections changing model behavior. The disagreement log was as much a change-management tool as a data pipeline.
Production-schedule integration was the line between adoption and shelfware. Without it, the platform would have produced recommendations the operations team could not act on, which would have failed within a month. We would scope ERP integration into the critical path on any operational platform in the future.
Similar delivery work usually starts in these service areas
If you are exploring a similar product, workflow, or implementation challenge, these are the service tracks that usually fit best.
Where this project sits in the bigger market picture
Applied AI for operations, supply chains, routing, and industrial workflows.
Build a result-driven AI product with a team that has shipped before
If you are exploring a similar product, workflow, or AI use case, we can help scope the right architecture, delivery model, and first milestone.
Related case studies worth reviewing next
Have an AI idea, messy workflow, or product vision? Let's make it buildable.
Bring the problem. We'll help shape the product, define the architecture, and show the fastest path to a serious first version.
A practical first roadmap in the discovery call
Architecture, timeline, and delivery options in plain English
Security, scalability, and reliability discussed upfront
Model registry
softus-rag-v4.2
187ms
Latency
128k
Context
$0.004
Cost / req
Evaluation suite
Deploy pipeline
prod / canary 25% — healthy
