Home
Skip to main content
xStryk™

Decision Intelligence for AI in production — guardrails, traceability & evaluation.

PUBLIC SECTORVALIDATED

Decision engine for public innovation program evaluation

xSingularInnovation7 min read

How a public innovation institution achieved 3.2x more transparency in awards with multi-criteria scoring and full traceability.

3.2x
Award transparency
200+
Projects evaluated
100%
Explainable decisions

Program Evaluation Pipeline · AI Decision Engine · Public Sector

RECEIVED847EVALUATED61272% pass rateSCORED28947% pass rateSHORTLISTED12443% pass rateAPPROVED6754% pass rate7.9% OVERALL APPROVAL RATE · AI-SCORED · PUBLIC INNOVATION PROGRAMS

Resumen de datos clave

  • The new decision engine achieved 3.2x more transparency as measured by the institution's composite auditability index, with 100% of award decisions fully explainable and documented. Mean evaluation ti…
  • xSingular logró 3.2x en Award transparency mediante el ecosistema xStryk.
  • xSingular logró 200+ en Projects evaluated mediante el ecosistema xStryk.
  • xSingular logró 100% en Explainable decisions mediante el ecosistema xStryk.

Context and challenge

A government innovation promotion institution evaluated more than 200 technology innovation projects annually to allocate public funding totaling several billion pesos. The evaluation process relied on expert committees applying qualitative criteria without formal weighting or systematic documentation of justifications. Rejected applicants received no detailed score explanations, and external audits had repeatedly flagged the impossibility of verifying the consistency and fairness of the process.

The problem was compounded by a 40% growth in application volume over three years without a proportional increase in evaluation resources. Evaluators spent an average of 4.2 hours per project just reading and synthesizing documentation, leaving little time for systematic comparative analysis. Internal studies also detected significant variations in scores assigned by different evaluators to identical projects (coefficient of variation: 23%), suggesting a concerning level of structural subjectivity in the process.

In the public sector, every resource allocation decision must be explainable, auditable, and reproducible. A decision engine without these three properties is not just inefficient — it is regulatorily unacceptable and erodes institutional legitimacy.

Data landscape & sources

The project began with an inventory of available data on calls and evaluations from the past 5 years. Historical data quality and completeness were heterogeneous: the first two years had partial records in digitized paper, while the last three years had structured data in the call management system. Systematic biases were identified in historical data that required specific treatment before using them as a validation reference.

  • Call management system: structured application forms with 45-80 fields per project depending on program type, including budget, team, timeline, and technical description
  • Evaluation history: scores by criterion, evaluator comments, and final decision for 1,200+ projects over 5 years, with variable completeness by call
  • Post-award tracking records: milestones met, accountability, and results of funded projects from previous calls for institutional track record assessment
  • Evaluator database: profile, experience, institutional affiliation, and prior evaluation history to model and control evaluator bias
  • Public applicant information: government registries, tax authority records, and intellectual property databases to verify and enrich declared data
  • Regulatory documentation: call guidelines, official criteria, definitions, and declared weightings to calibrate the model against the applicable regulatory framework

Methodology & analysis

The analysis phase had two parallel objectives: understanding how the historical process had worked (including its biases) and designing a scoring framework that was simultaneously rigorous, explainable, and accepted by evaluators. For the first objective, bias analysis was applied using fairness-aware ML techniques; for the second, an iterative co-design process with the institution's evaluation teams took 8 weeks and 14 structured working sessions.

  • Bias detection in historical awards using disparate impact analysis by applicant type (company size, region, project director gender) with Equalized Odds statistic and approval rate difference by demographic group
  • Multi-criteria scoring framework design with 6 main dimensions (technical feasibility, economic impact, team, innovation, sustainability, track record) and 24 sub-criteria with configurable weightings per program
  • Text feature extraction from proposals with NLP (TF-IDF + language embeddings using BETO model trained in Spanish) to quantify qualitative dimensions such as technical originality and impact clarity
  • Scoring model calibration against independent expert panel judgments in 3 iterative rounds, measuring agreement with Spearman correlation metric (target: r > 0.82)
  • Blind re-evaluation of 200+ historical projects with the new engine to validate consistency and detect cases where the model and historical outcome diverged significantly (threshold: normalized score difference > 15%)
  • Weighting sensitivity analysis: simulation of 500 weight combinations to identify robustness ranges where project rankings are stable under criterion variations

Model architecture / technical design

The decision engine combines a weighted multi-criteria scoring model with SHAP values for local per-decision explainability, an anomaly detection module for consistency alerts, and an automatic award report generator — all deployed on certified government cloud infrastructure with encryption in transit and at rest and full traceability of every interaction.

  • Scoring engine: weighted additive scoring model with 6 dimensions and 24 sub-criteria; configurable weights per call with consistency constraints (sum = 1, weights >= 0.05 per main dimension); Python implementation with per-call schema validation
  • NLP text module: spaCy + BETO pipeline (BERT in Spanish, 110M parameters) for proposal embeddings; qualitative sub-criterion classifiers with F1-score > 0.77 in 5-fold cross-validation on expert evaluator annotations
  • Explainability: SHAP TreeExplainer for tree models and KernelExplainer for NLP components; local per-decision explanation with each sub-criterion's contribution expressed in points out of 100; waterfall chart visualization per report
  • Anomaly detection: Isolation Forest on historical score distribution per call to detect atypically scored projects requiring additional human review; manual review referral rate calibrated to 8%
  • Report generator: parametric LaTeX template with scoring data, comparison charts against program distribution, and narrative justification per criterion; automatic generation in < 90 seconds per project
  • Infrastructure: certified government cloud deployment; AES-256 encryption at rest, TLS 1.3 in transit; immutable audit logs with digital signature per evaluator action

Implementation details

The main technical challenge was integrating the NLP component into the evaluation process in a way that human evaluators could understand and trust the qualitative sub-criterion scores. Language models generate scores that are not intuitively interpretable, so a calibration system was implemented mapping model outputs to semantic scales familiar to evaluators (1-7, with behaviorally anchored rating scales defined during co-design). This calibration process required three rounds of annotation by expert evaluators, totaling 1,800 labeled examples across the 24 sub-criteria.

Government cloud deployment imposed additional security constraints that affected the architecture. The BETO model could not be sent to an external API for inference — all inference had to run within the government platform perimeter. An inference server was implemented with ONNX Runtime on government platform instances, achieving inference latency of 380ms per full proposal (average 8,000 tokens). Full traceability required that every individual score, every applied weight, and every evaluator interaction be recorded in immutable logs with timestamp and digital signature, generating approximately 2.4 GB of auditable logs per call.

Validation & testing

The validation process was particularly rigorous given the regulatory context. A blind re-evaluation of 247 projects from previous calls was conducted: the scoring engine was applied without knowledge of the historical outcome, and results were compared against actual awards. Overall agreement measured with Spearman correlation was r=0.84 (target: r > 0.82). In the 38 cases with the greatest divergence (normalized ranking difference > 20%), a case-by-case analysis was conducted with the evaluation team, which identified that 22 of those cases corresponded to situations where the historical process had exhibited the inconsistencies already documented in external audits — meaning the model was more consistent than the historical human process in those cases.

A formal fairness test was also conducted applying fairness metrics: disparate impact analysis showed the new system reduced the approval rate gap between regions from 8.3 percentage points to 2.1 percentage points, and eliminated the statistically significant correlation between project director gender and approval probability detected in the historical record (p=0.031 historically, p=0.412 with the new system). An independent external audit validated the methodology and results before production deployment.

Results & business impact

The new decision engine achieved 3.2x more transparency as measured by the institution's composite auditability index, with 100% of award decisions fully explainable and documented. Mean evaluation time per project was reduced from 4.2 hours to 1.6 hours, freeing evaluation capacity equivalent to 2.4 full-time evaluators per call.

  • 3.2x improvement in institutional transparency index as measured by independent external audit post-implementation
  • 100% of award decisions explainable with automatic SHAP report generated in < 90 seconds per project
  • -62% evaluation time per project (from 4.2h to 1.6h average), equivalent to 2.4 FTE freed per call
  • Regional disparate impact reduced from 8.3 to 2.1 percentage points; statistically significant gender bias eliminated
  • 200+ projects re-evaluated with Spearman consistency r=0.84 vs historical record, identifying 22 cases where the engine was more consistent than the historical manual process
  • Zero award appeals citing transparency concerns in the first two calls with the system, vs 4-6 average appeals in previous calls

Key lessons learned

  • SHAP explainability is not just a technical tool — it is a trust interface: evaluators adopt the system when they can read and discuss the explanation of each decision in their own conceptual language
  • Detecting and documenting biases in historical data before using it to validate the new model is critical — validating a "fair" model against biased historical data produces artificially optimistic concordance results that do not reflect real improvement
  • Co-designing the criteria framework with evaluation teams (8 weeks, 14 sessions) was the highest-ROI investment in the project: it generated organizational ownership that made formal adoption training unnecessary
  • Language models in a government context must run entirely on-premise or certified infrastructure — designing the inference architecture to meet this requirement from the start is far more efficient than adapting an API-first solution later
  • Fairness analysis is not an optional post-modeling step — it must be integrated as an optimization objective from the scoring framework design, with equity metrics defined before training any component
  • Automatic award reports reduce bureaucratic friction and improve the applicant experience, but their real value is that they force the system to be explainable by construction — if the engine cannot generate a coherent report, the model has a design problem

Have a similar challenge?

Let's talk 30 minutes about your use case. No strings attached.

Schedule call