Home
Skip to main content
xStryk™

Decision Intelligence for AI in production — guardrails, traceability & evaluation.

MININGACTIVE

ML model competition system for multinational mining

xSingularMultinational7 min read

How a multinational mining company accelerated model selection 4.7x with standardized sandbox, automatic evaluation, and real-time leaderboard.

4.7x
Selection acceleration
8+
Teams competing
100%
Reproducible evaluation

Model Competition · 5-Model Ensemble · Accuracy Convergence · Mining

0.650.700.750.800.850.9011121314150LGBM ★0.923 ACCXGBoostRandomForestNeuralNetLogisticRegACCURACYTRAINING ITERATIONCHAMPION: LGBM · 0.923 ACC · 87 FEATURES · MULTINATIONAL MINING OPS

Resumen de datos clave

  • xSingular logró 4.7x en Selection acceleration mediante el ecosistema xStryk.
  • xSingular logró 8+ en Teams competing mediante el ecosistema xStryk.
  • xSingular logró 100% en Reproducible evaluation mediante el ecosistema xStryk.

Corporate context and problem genesis

A multinational mining company with operations in four countries had accumulated over five years a heterogeneous machine learning initiative portfolio: 23 models at various stages of development trained by 6 internal teams, 4 technology vendors, and 2 university research centers. Each entity used its own technology stack, its own validation datasets, and its own reporting metrics.

The problem became critical when the operations leadership had to select a failure prediction model for copper extraction equipment from three candidates: one from a European industrial vendor, one from the internal data science team, and one from a partner university. All three presented excellent results in their own evaluations, with reported accuracy between 91% and 94%. When an independent evaluation was requested, the IT team had no framework to execute it: there were no common reference datasets, no standardized metrics, and the three models used slightly different definitions of the failure event they were trying to predict.

The decision was ultimately made based on vendor reputation and contract price — not objective technical evidence. The selected model showed in production a precision 23% lower than reported in the vendor's evaluation, because the vendor's evaluation dataset did not represent the real production data distribution of the specific mining site.

Without a standardized evaluation framework, model selection becomes a political exercise: each team presents its own metrics, under its own conditions, on its own datasets. The only way to compare is to create a neutral environment where all models are evaluated with exactly the same data, metrics, and execution conditions.

Standard evaluation framework design

The first step was defining the priority use cases the system should cover and the corresponding evaluation metrics. Four high-priority use cases were identified for the first version: failure prediction for extraction equipment (trucks, shovels, crushers), ore grade prediction by block, reagent consumption optimization in flotation, and energy consumption prediction per shift.

For each use case, three levels of metrics were defined: business metrics (the indicator the operational area cares about, for example the cost of an unpredicted failure in dollars per event), machine learning technical metrics (AUC-ROC, RMSE, precision/recall depending on the case), and robustness metrics (model performance with typical production noise, across shift changes, and with missing data in realistic proportions).

Defining the business metrics was the most critical step and the one requiring the most time: 6 weeks of workshops with maintenance engineers, geologists, and operators to translate operational objectives into quantifiable metrics that could be automatically evaluated on historical data.

  • Use case 1 - Failure prediction: business metric = expected cost of false negatives (unpredicted failure) vs. cost of false positives (unnecessary maintenance)
  • Use case 2 - Ore grade: business metric = production planning error in dollars, technical metric = RMSE per block
  • Use case 3 - Reagent consumption: business metric = deviation vs. optimal consumption in kilograms per ton, technical metric = MAE and systematic bias
  • Use case 4 - Energy consumption: business metric = deviation cost vs. firm power contract in dollars per hour
  • Cross-cutting robustness metrics: performance with 10% missing data, with Gaussian noise on critical sensors, and with distribution shifts between shifts
  • Reproducibility metric: performance variance across 5 independent runs with different seeds — maximum 2% CV threshold to classify a model as reproducible

Evaluation sandbox design

The sandbox is a containerized execution environment (Docker + Kubernetes) that guarantees identical conditions for all competing models. Each model is packaged as a standardized container that receives input datasets via a defined API contract and returns predictions in a standardized format. The sandbox runs the complete evaluation suite — technical, business, and robustness metrics — and returns results in a structured JSON that feeds the leaderboard.

The API contract design was a critical engineering decision. A minimalist format was chosen: the model container receives a Parquet file with input features and must return a Parquet file with predictions and, optionally, confidence intervals. This design is generic enough to accept any type of model (sklearn, PyTorch, TensorFlow, ONNX) without requiring teams to modify their technology stack, but specific enough to guarantee reproducibility and isolation between runs.

Benchmarking datasets were built with anonymized real production data, following three principles: representativeness (the dataset distribution must reflect production data distribution, not a cleaned or filtered version), calibrated difficulty (the dataset deliberately includes the most difficult case types: shift changes, missing data in realistic proportions, rare failure events), and stability (the dataset does not change between competition rounds to guarantee comparability).

The sandbox API contract is the most important standardization mechanism in the system. By requiring all models to be packaged in a container with an identical input/output interface, the sandbox can evaluate models from any vendor, in any language, with any technology stack — without the evaluation team needing access to source code or model weights.

  • Standardized Docker container: base image with CPU/RAM resources limited by use case, no internet access during evaluation
  • API contract: Parquet input with schema defined by use case, Parquet output with predictions and optional confidence intervals
  • Automatic evaluation suite: 47 tests per use case (technical, business, robustness, and reproducibility metrics)
  • Benchmarking datasets: anonymized real data with calibrated distribution representing actual production, including edge cases and missing data
  • Execution isolation: each model evaluated in an independent Kubernetes namespace with time limit configured per use case
  • Provenance record: hash of evaluation datasets, hash of model container, and execution timestamp recorded with each result

Automatic evaluation pipeline and leaderboard

The evaluation pipeline triggers each time a team uploads a new version of their model to the corporate registry (integrated with the MLflow model registry). A webhook fires the CI pipeline that validates the container format, runs the evaluation suite in the sandbox, and publishes results to the leaderboard in an average time of 34 minutes per model per use case.

The leaderboard shows separate rankings by metric (business, technical, robustness) and a weighted composite ranking that integrates all three dimensions. Composite ranking weights are configurable by the governance team and differ by use case: for failure prediction, the weight of the business metric (expected error cost) is 60% of the composite score, because an unpredicted crusher failure can cost $180,000 per event. For ore grade prediction, the weight of the technical metric (RMSE) is higher because the business impact is more linear.

The governance dashboard includes, beyond the leaderboard, the complete version history of each model with its metrics, the learning curve of each team across competition rounds, and correlation analysis between technical and business metrics to detect cases where a model excellent on the technical metric performs poorly on the business metric (indicating a problem definition misalignment).

  • Automatic trigger: webhook on push to MLflow model registry, no human intervention to start evaluation
  • Average full evaluation time per model: 34 minutes for a 47-test suite
  • Leaderboard with three independent rankings: business, technical, and robustness, plus weighted composite ranking by use case
  • Governance dashboard: version history, team learning curve, and technical-business correlation analysis
  • Automatic alerts: notification to the team when a model surpasses the current leader on any metric or composite ranking
  • Executive report export: automatic PDF generation with results summary for selection committee presentation

Pilot competition and system validation

The first formal competition was run on the extraction truck failure prediction use case, the one with the highest immediate economic impact. 8 teams participated: 2 internal data science teams from different operational regions, 3 industrial technology vendors, 2 university centers, and 1 corporate headquarters research team.

The competition process had four rounds of 3 weeks each. In the first two rounds, teams could see their own results and the anonymized ranking of others (without revealing which team held each position). In the third round, team names were revealed on the leaderboard. In the fourth round, the top five teams presented their technical architectures to the selection committee, which could consider factors not captured by the sandbox (model interpretability, maintenance costs, update plan).

The competition result was revealing: the team winning on the technical metric (AUC-ROC 0.923) was an external vendor that had optimized its model specifically to maximize AUC. However, on the business metric (expected error cost) it was surpassed by the internal team, whose model (AUC 0.891) had a classification threshold optimized to minimize false negatives on high-cost failures — the metric that actually mattered operationally. The system captured this difference quantitatively and transparently.

The pilot competition demonstrated that optimizing for the correct technical metric is not enough: a model with higher AUC can be inferior on the business metric if its classification threshold is not calibrated for the asymmetric cost of errors. The sandbox quantified this objectively; without the framework, this difference would have been invisible in each vendor's presentations.

Integration with the production model lifecycle

The competition system does not end with the selection of the winning model — it integrates with the deployment and production monitoring pipeline. The winning model is automatically promoted to a staging environment where it runs in shadow mode for 4 weeks, comparing its predictions with those of the currently active production model and with real events. Only if shadow mode performance is consistent with sandbox performance is promotion to production authorized.

Once in production, each model is subject to automatic quarterly re-evaluation in the sandbox using the previous quarter's production data (anonymized and processed into sandbox format). If the production model's performance decays 8% or more relative to the sandbox baseline, a new restricted competition round is automatically triggered, limited to the top five models from prior competition history.

  • Automatic staging promotion: the winner is deployed in shadow mode 4 weeks before replacing the production model
  • Shadow mode validation: systematic comparison between new model and active model predictions against real production events
  • Quarterly re-evaluation: each production model is evaluated with the previous quarter's data to detect performance drift
  • New competition trigger: if drift exceeds 8% degradation, a restricted competition round is automatically activated
  • Lineage record: each production model version has a record of the originating competition round, sandbox version, and evaluation datasets used
  • Integration with operational alerts: if the failure model fails to predict a critical event, it is automatically recorded as an analysis case for the next round

Quantified results

The most direct impact of the system was the acceleration of the model selection cycle. Before the system, the process from presenting candidates to making a selection decision took an average of 11.4 weeks (including meeting coordination time, manual evaluation, and internal approval processes). With the system, the cycle was reduced to 2.4 weeks: 34 minutes of automatic evaluation plus the time of 4 structured competition rounds (4 x 3 weeks for strategic projects, or a single direct evaluation for routine replacements). The total acceleration is 4.7x.

The most significant economic impact was the detection, in the second competition (ore grade), that the model in production for the past 22 months had a systematic bias of -4.3% in ore grade prediction for high-concentration blocks — the highest-value segment. Quantifying this bias in production planning error terms equated to $8.2M in uncaptured revenue over the prior 22 months, a figure that by itself justified the complete investment in the system.

  • Model selection cycle acceleration: from 11.4 weeks to 2.4 weeks (4.7x)
  • Subjectivity reduction: 100% of selection decisions based on quantitative metrics comparable across all teams
  • Production model bias detection: -4.3% systematic bias in high-concentration ore grade prediction, equivalent to $8.2M over 22 months
  • Manual evaluation effort reduction: from 120 person-hours per selection round to 8 hours (coordination and results review)
  • Portfolio performance improvement: the average model selected through competition outperforms the average manually selected model by 18% on the business metric
  • Reproducibility: 100% of evaluation results are reproducible with the same container and the same datasets

Lessons learned and future extensions

The most valuable lesson was about business metric design: the first versions of metrics defined in the workshops were too complex to calculate automatically on historical data. For example, the first definition of the business metric for failure prediction included the opportunity cost of lost production, which required the copper spot price at the time of each event — an external data point not available in structured format. Metrics were simplified to use fixed historical maintenance costs, sacrificing some economic precision in exchange for robust automatic evaluability.

The second lesson was about the competition dynamics between internal teams and vendors: vendors tend to optimize for metrics they know in advance. In the first two rounds, the sandbox included all metric details in the public documentation, allowing more resourced vendors to optimize specifically for the benchmark. Starting from the third round, a set of "secret" robustness metrics (revealed only after evaluation) was introduced to detect benchmark overfitting.

  • Business metrics must be automatically calculable on historical data — if they require external data or human judgment, they are not scalable as an automatic benchmark
  • The sandbox API contract must be generic enough to accept any technology stack but specific enough to guarantee reproducibility
  • Benchmarking datasets must deliberately include difficult cases (missing data, distribution shifts, rare events) that represent real production behavior
  • Partially hidden robustness metrics prevent benchmark overfitting and reveal true model generalization
  • Integrating the sandbox with the production lifecycle (shadow mode, quarterly re-evaluation, automatic trigger) is what converts the competition into a continuous governance system, not a one-time event
  • The composite ranking must have configurable weights by use case: the asymmetric cost of errors varies radically between critical failure prediction and energy consumption forecasting

Have a similar challenge?

Let's talk 30 minutes about your use case. No strings attached.

Schedule call