What does xSingular do?

xSingular designs, builds, and deploys confidential production AI systems for critical decisions. The firm focuses on Decision Intelligence, MLOps, continuous evaluation, explainability, guardrails, and decision-level traceability.

Why are many xSingular client references anonymous?

Many xSingular deployments remain anonymous because clients operate in mining, banking, public sector, or critical infrastructure environments where confidentiality, procurement, or security policies restrict public disclosure.

What is xSingular's mission?

xSingular partners with organisations to enhance performance through cutting-edge artificial intelligence solutions, driving real-world impact. The firm focuses on Decision Intelligence, MLOps, and production AI systems.

xStryk™ is the Decision Intelligence platform created by xSingular. It includes xStryk™ Engine, xStryk™ Eval, xStryk™ DataOps, and xStryk™ Ops for production AI systems.

What is the official LinkedIn page for xSingular?

The official LinkedIn page for xSingular is https://www.linkedin.com/company/xsingular-ai.

xSingular es una empresa de ingeniería especializada en inteligencia artificial para decisiones críticas. Diseña, construye y despliega sistemas de IA en producción para minería, banca e infraestructura crítica. Su plataforma xStryk™ incluye motores de Decision Intelligence, MLOps, evaluación continua, explicabilidad y trazabilidad por decisión.

¿Qué es la inteligencia artificial para empresas?

La inteligencia artificial para empresas es el conjunto de sistemas, modelos y agentes de IA aplicados a decisiones operativas críticas: predicción, optimización, automatización inteligente y simulación cognitiva. xSingular especializa en construir estos sistemas con controles auditables, explicabilidad (XAI) y trazabilidad completa.

¿Qué son los agentes de inteligencia artificial?

Los agentes de inteligencia artificial son sistemas autónomos que perciben su entorno, razonan sobre él y ejecutan acciones para lograr objetivos definidos. xSingular diseña e implementa agentes inteligentes para operaciones en minería, banca, supply chain e infraestructura crítica, con guardrails ejecutables y evaluación continua.

¿Cómo diferencia xSingular de otras consultoras de IA?

xSingular se diferencia por operar con sistemas de IA verificables, auditables y trazables. No entrega presentaciones conceptuales ni prototipos sin continuidad operativa: entrega sistemas en producción con métricas objetivas, evaluación continua (xStryk™ Eval), explicabilidad (XAI) y guardrails ejecutables. Toda decisión del sistema queda registrada y es auditable.

¿En qué industrias trabaja xSingular con inteligencia artificial?

xSingular implementa sistemas de inteligencia artificial en minería (mantenimiento predictivo, optimización de procesos, IA para operaciones extractivas), banca (agentes inteligentes, risk scoring, detección de anomalías), infraestructura crítica, salud y supply chain. Especializado en entornos donde la precisión y la auditabilidad son mandatorias.

Decision engine for public innovation program evaluation

xSingularInnovation7 min read

How a public innovation institution achieved 3.2x more transparency in awards with multi-criteria scoring and full traceability.

3.2x

Award transparency

200+

Projects evaluated

100%

Explainable decisions

Program Evaluation Pipeline · AI Decision Engine · Public Sector

The new decision engine achieved 3.2x more transparency as measured by the institution's composite auditability index, with 100% of award decisions fully explainable and documented. Mean evaluation ti…
xSingular logró 3.2x en Award transparency mediante el ecosistema xStryk.
xSingular logró 200+ en Projects evaluated mediante el ecosistema xStryk.
xSingular logró 100% en Explainable decisions mediante el ecosistema xStryk.

Context and challenge

A government innovation promotion institution evaluated more than 200 technology innovation projects annually to allocate public funding totaling several billion pesos. The evaluation process relied on expert committees applying qualitative criteria without formal weighting or systematic documentation of justifications. Rejected applicants received no detailed score explanations, and external audits had repeatedly flagged the impossibility of verifying the consistency and fairness of the process.

The problem was compounded by a 40% growth in application volume over three years without a proportional increase in evaluation resources. Evaluators spent an average of 4.2 hours per project just reading and synthesizing documentation, leaving little time for systematic comparative analysis. Internal studies also detected significant variations in scores assigned by different evaluators to identical projects (coefficient of variation: 23%), suggesting a concerning level of structural subjectivity in the process.

In the public sector, every resource allocation decision must be explainable, auditable, and reproducible. A decision engine without these three properties is not just inefficient — it is regulatorily unacceptable and erodes institutional legitimacy.

Data landscape & sources

The project began with an inventory of available data on calls and evaluations from the past 5 years. Historical data quality and completeness were heterogeneous: the first two years had partial records in digitized paper, while the last three years had structured data in the call management system. Systematic biases were identified in historical data that required specific treatment before using them as a validation reference.

Call management system: structured application forms with 45-80 fields per project depending on program type, including budget, team, timeline, and technical description
Evaluation history: scores by criterion, evaluator comments, and final decision for 1,200+ projects over 5 years, with variable completeness by call
Post-award tracking records: milestones met, accountability, and results of funded projects from previous calls for institutional track record assessment
Evaluator database: profile, experience, institutional affiliation, and prior evaluation history to model and control evaluator bias
Public applicant information: government registries, tax authority records, and intellectual property databases to verify and enrich declared data
Regulatory documentation: call guidelines, official criteria, definitions, and declared weightings to calibrate the model against the applicable regulatory framework

Methodology & analysis

The analysis phase had two parallel objectives: understanding how the historical process had worked (including its biases) and designing a scoring framework that was simultaneously rigorous, explainable, and accepted by evaluators. For the first objective, bias analysis was applied using fairness-aware ML techniques; for the second, an iterative co-design process with the institution's evaluation teams took 8 weeks and 14 structured working sessions.

Bias detection in historical awards using disparate impact analysis by applicant type (company size, region, project director gender) with Equalized Odds statistic and approval rate difference by demographic group
Multi-criteria scoring framework design with 6 main dimensions (technical feasibility, economic impact, team, innovation, sustainability, track record) and 24 sub-criteria with configurable weightings per program
Text feature extraction from proposals with NLP (TF-IDF + language embeddings using BETO model trained in Spanish) to quantify qualitative dimensions such as technical originality and impact clarity
Scoring model calibration against independent expert panel judgments in 3 iterative rounds, measuring agreement with Spearman correlation metric (target: r > 0.82)
Blind re-evaluation of 200+ historical projects with the new engine to validate consistency and detect cases where the model and historical outcome diverged significantly (threshold: normalized score difference > 15%)
Weighting sensitivity analysis: simulation of 500 weight combinations to identify robustness ranges where project rankings are stable under criterion variations

Model architecture / technical design

The decision engine combines a weighted multi-criteria scoring model with SHAP values for local per-decision explainability, an anomaly detection module for consistency alerts, and an automatic award report generator — all deployed on certified government cloud infrastructure with encryption in transit and at rest and full traceability of every interaction.

Scoring engine: weighted additive scoring model with 6 dimensions and 24 sub-criteria; configurable weights per call with consistency constraints (sum = 1, weights >= 0.05 per main dimension); Python implementation with per-call schema validation
NLP text module: spaCy + BETO pipeline (BERT in Spanish, 110M parameters) for proposal embeddings; qualitative sub-criterion classifiers with F1-score > 0.77 in 5-fold cross-validation on expert evaluator annotations
Explainability: SHAP TreeExplainer for tree models and KernelExplainer for NLP components; local per-decision explanation with each sub-criterion's contribution expressed in points out of 100; waterfall chart visualization per report
Anomaly detection: Isolation Forest on historical score distribution per call to detect atypically scored projects requiring additional human review; manual review referral rate calibrated to 8%
Report generator: parametric LaTeX template with scoring data, comparison charts against program distribution, and narrative justification per criterion; automatic generation in < 90 seconds per project
Infrastructure: certified government cloud deployment; AES-256 encryption at rest, TLS 1.3 in transit; immutable audit logs with digital signature per evaluator action

Implementation details

The main technical challenge was integrating the NLP component into the evaluation process in a way that human evaluators could understand and trust the qualitative sub-criterion scores. Language models generate scores that are not intuitively interpretable, so a calibration system was implemented mapping model outputs to semantic scales familiar to evaluators (1-7, with behaviorally anchored rating scales defined during co-design). This calibration process required three rounds of annotation by expert evaluators, totaling 1,800 labeled examples across the 24 sub-criteria.

Government cloud deployment imposed additional security constraints that affected the architecture. The BETO model could not be sent to an external API for inference — all inference had to run within the government platform perimeter. An inference server was implemented with ONNX Runtime on government platform instances, achieving inference latency of 380ms per full proposal (average 8,000 tokens). Full traceability required that every individual score, every applied weight, and every evaluator interaction be recorded in immutable logs with timestamp and digital signature, generating approximately 2.4 GB of auditable logs per call.

Validation & testing

The validation process was particularly rigorous given the regulatory context. A blind re-evaluation of 247 projects from previous calls was conducted: the scoring engine was applied without knowledge of the historical outcome, and results were compared against actual awards. Overall agreement measured with Spearman correlation was r=0.84 (target: r > 0.82). In the 38 cases with the greatest divergence (normalized ranking difference > 20%), a case-by-case analysis was conducted with the evaluation team, which identified that 22 of those cases corresponded to situations where the historical process had exhibited the inconsistencies already documented in external audits — meaning the model was more consistent than the historical human process in those cases.

A formal fairness test was also conducted applying fairness metrics: disparate impact analysis showed the new system reduced the approval rate gap between regions from 8.3 percentage points to 2.1 percentage points, and eliminated the statistically significant correlation between project director gender and approval probability detected in the historical record (p=0.031 historically, p=0.412 with the new system). An independent external audit validated the methodology and results before production deployment.

Results & business impact

The new decision engine achieved 3.2x more transparency as measured by the institution's composite auditability index, with 100% of award decisions fully explainable and documented. Mean evaluation time per project was reduced from 4.2 hours to 1.6 hours, freeing evaluation capacity equivalent to 2.4 full-time evaluators per call.

3.2x improvement in institutional transparency index as measured by independent external audit post-implementation
100% of award decisions explainable with automatic SHAP report generated in < 90 seconds per project
-62% evaluation time per project (from 4.2h to 1.6h average), equivalent to 2.4 FTE freed per call
Regional disparate impact reduced from 8.3 to 2.1 percentage points; statistically significant gender bias eliminated
200+ projects re-evaluated with Spearman consistency r=0.84 vs historical record, identifying 22 cases where the engine was more consistent than the historical manual process
Zero award appeals citing transparency concerns in the first two calls with the system, vs 4-6 average appeals in previous calls

Key lessons learned

SHAP explainability is not just a technical tool — it is a trust interface: evaluators adopt the system when they can read and discuss the explanation of each decision in their own conceptual language
Detecting and documenting biases in historical data before using it to validate the new model is critical — validating a "fair" model against biased historical data produces artificially optimistic concordance results that do not reflect real improvement
Co-designing the criteria framework with evaluation teams (8 weeks, 14 sessions) was the highest-ROI investment in the project: it generated organizational ownership that made formal adoption training unnecessary
Language models in a government context must run entirely on-premise or certified infrastructure — designing the inference architecture to meet this requirement from the start is far more efficient than adapting an API-first solution later
Fairness analysis is not an optional post-modeling step — it must be integrated as an optimization objective from the scoring framework design, with equity metrics defined before training any component
Automatic award reports reduce bureaucratic friction and improve the applicant experience, but their real value is that they force the system to be explainable by construction — if the engine cannot generate a coherent report, the model has a design problem

Have a similar challenge?

Let's talk 30 minutes about your use case. No strings attached.

Schedule call