Home
Skip to main content
xStryk™

Decision Intelligence for AI in production — guardrails, traceability & evaluation.

BANKINGIN PRODUCTION

Unified intelligent agent platform for banking

xSingularTop 39 min read

How a Top 3 bank reduced credit decision time by 58% with risk scoring, next-best-action, and anomaly detection agents.

-58%
Credit decision time
100%
Regulatory traceability
3
Integrated domains

Unified AI Platform · Banking · 3-Layer Architecture · In Production

DATA LAYERCORE BANKINGCRMCREDIT BUREAUTRANSACTIONSMARKET DATAINTELLIGENCE LAYERCREDIT MODELFRAUD DETECTIONRISK ENGINECOMPLIANCE AIDECISION LAYERAPPROVAL FLOWLIMIT CALCRATE ENGINEMONITORINGIN PRODUCTION · TOP-3 BANK · REAL-TIME DECISIONS · xSTRYK PLATFORM

Resumen de datos clave

  • xSingular logró -58% en Credit decision time mediante el ecosistema xStryk.
  • xSingular logró 100% en Regulatory traceability mediante el ecosistema xStryk.
  • xSingular logró 3 en Integrated domains mediante el ecosistema xStryk.

Context and regulatory pressure

A Top 3 bank in the financial system operated its credit risk models, dynamic pricing, and customer intelligence in completely independent silos. Scoring models had been developed by three different vendors at different points in time, and their outputs were consumed manually by analysts who integrated them in spreadsheets to produce credit decisions. The full cycle — from application to decision — averaged 4.2 days for SME credit and 11 days for corporate credit.

Each domain had its own model update cycle: the risk team recalibrated scoring every 6 months, the pricing team adjusted rates quarterly, and the customer intelligence team updated its segmentations annually. This asynchrony created regulatory inconsistencies: a customer could receive a product offer with a spread calculated on a risk score that had been recalibrated 5 months after the pricing was set.

The catalyzing event for the project was an observation from the financial regulator during a model review: the bank could not reconstruct the reasoning chain of 43% of credit decisions from the previous quarter because individual model logs were not synchronized with the analyst's decision logs. The regulator set an 8-month deadline to demonstrate full traceability or face operational restrictions.

In regulated banking, traceability between a credit decision, the underlying model version, and a snapshot of input data at the moment of decision is not a technical improvement — it is a regulatory requirement whose non-compliance can lead to operational restrictions, fines, or limitations on portfolio growth.

Existing model audit and gap mapping

The first phase of the project was a comprehensive technical audit of existing models. Three active credit scoring models were found in production, two of which had been developed externally and delivered as binary artifacts without source code or feature documentation. Reconstructing their logic required input-output analysis with 24 months of historical decision data.

The dependency mapping revealed that the pricing engine consumed scores from risk model A via a nightly batch job, but the customer intelligence next-best-action engine used risk model B — a different model with an 11% difference in predictions for customers with scores between 550 and 650. This inconsistency meant the bank could simultaneously offer a customer a conservative product (based on risk A) and trigger an additional credit offer (based on risk B) with a different approval threshold.

  • Inventory of 7 active production models: 3 risk scoring, 2 pricing, 1 product propensity, 1 fraud detection
  • Identification of 5 undocumented dependencies between models across different domains
  • Quantification of inconsistency between parallel risk models: 11% difference in scores for the 550-650 segment
  • Mapping of 23 traceability gaps: decisions without model version, input data, or precise timestamp records
  • Classification of models by regulatory impact: 4 high-risk (credit decisions), 3 medium-risk (pricing and propensity)
  • Demographic fairness analysis on scoring outputs: disparate impact detected in 2 protected segments

Unified platform architecture

The solution was designed as an agent platform with three layers: a shared ingestion and feature store layer, a specialized agent layer by domain with a standardized communication protocol, and a traceability layer that records each decision with its complete context before sending it to the source system.

The most critical architectural decision was the design of the shared feature store. Rather than having each agent build its own features from raw data — which generates inconsistencies when two agents use slightly different definitions of the same variable — a centralized feature store was implemented with versioned semantic definitions. The risk agent, pricing agent, and next-best-action agent consume exactly the same features calculated with exactly the same logic, guaranteeing consistency between decisions from different domains for the same customer at the same time.

The shared feature store resolves the root inconsistency: when two agents calculate the same feature (for example, "debt-to-income ratio") with slightly different logic, their decisions are neither comparable nor auditable. Semantic feature consistency is the foundation of regulatory traceability.

  • Centralized feature store with versioned semantic definitions and consistent calculation across all agents
  • Risk Scoring Agent: calibrated XGBoost ensemble + logistic model for explainability, with SHAP values per decision
  • Dynamic Pricing Agent: margin optimization model with regulatory constraints on maximum spread and real-time cost of funds
  • Next-Best-Action Agent: multi-class propensity model with demographic fairness constraints and contactability policies
  • Anomaly Detection Agent: unsupervised model on transactional behavior patterns with real-time alerts to the risk committee
  • Traceability layer: immutable snapshot of inputs, model version, outputs, and final decision with precise timestamp before each communication to core banking
  • Human-in-the-loop engine: automatic routing of decisions to human analyst when agent confidence falls below threshold configured per segment

Models, algorithms, and technical decisions

The Risk Scoring agent implements an ensemble of two models with complementary roles. The primary model is XGBoost with 180 features, calibrated to produce accurate default probabilities (Brier score < 0.08 on the validation dataset). The secondary model is a logistic regression with the 25 most important features identified by SHAP, trained to produce the same decision with greater explainability. For each application, the primary model score is generated and the secondary model is used to produce the explanation in terms of the most influential factors, expressed in the business language used by the credit analyst.

The Anomaly Detection agent uses a Variational Autoencoder (VAE) trained on 18 months of normal transactional behavior patterns. Per-customer reconstruction error is monitored in real time and compared against a reference distribution by customer segment. When the error exceeds the 99.5th percentile of the segment's reference distribution, an alert is generated that includes the list of transactions that most contributed to the anomaly score, calculated via input perturbation.

A relevant technical decision was the design of the human-in-the-loop mechanism. Rather than a fixed confidence threshold, a dynamic calibration system was implemented: the human escalation threshold is adjusted weekly based on the observed error rate in automatic decisions from the prior period. If the agent had an above-target error rate in the SME segment, the confidence threshold required for automatic approval is increased, raising the proportion of cases going to human review until the error rate returns to the target range.

  • Risk Scoring: XGBoost ensemble (180 features) + logistic (25 features) with per-decision SHAP values and Platt calibration
  • Dynamic Pricing: margin optimization with regulatory maximum spread constraints and real-time cost of funds via Central Bank API
  • Next-Best-Action: multi-class propensity model (7 products) with fairness post-processing via probability reweighting
  • Anomaly Detection: Variational Autoencoder (VAE) with per-segment reconstruction error and input perturbation attribution
  • Human-in-the-loop: dynamic confidence threshold with weekly calibration based on observed error rate per segment
  • Fairness: disparate impact evaluation across 4 protected demographic attributes at every retraining cycle

Regulatory compliance and traceability

The traceability module was the most critical component of the project from a regulatory standpoint. Each credit decision generates an immutable record that includes: the version identifier of each participating model, the snapshot of features calculated at the time of decision (not raw data, but the calculated values the model received as input), the model score and confidence, the applied decision threshold, and the justification for whether the decision was automatic or escalated to a human.

To meet regulatory requirements, the bank needed to be able to reconstruct any decision from the past 5 years in less than 2 hours of querying. The traceability record design uses a partitioning scheme by date and segment that enables efficient queries without scanning the entire dataset. Audit tests conducted with the bank's compliance team demonstrated an average decision reconstruction time of 4.3 minutes.

  • Immutable record per decision: model version, feature snapshot, score, confidence, threshold, and decision type (automatic or human)
  • Average decision reconstruction time for audit: 4.3 minutes against 2-hour regulatory target
  • CI/CD pipeline with regulatory approval gates: no model reaches production without Chief Risk Officer sign-off and automated fairness validation
  • Fairness evaluation at every retraining: disparate impact, equal opportunity, and calibration by demographic group
  • Portfolio stress testing: impact simulation of each model on the portfolio risk distribution under 12 macroeconomic scenarios
  • Annual external audit: independent validation process with full access to version history and decision records

Deployment and change management

Deployment was performed in the bank's on-premise datacenter under a blue-green deployment model with automatic rollback. The infrastructure uses Kubernetes with separate namespaces per agent domain, allowing the pricing agent to be updated without affecting risk agent availability. Each namespace has network policies preventing direct inter-agent communication — all communication passes through the central messaging bus to guarantee the traceability record.

The change management program was as important as the technical deployment. Credit analysts were the most critical stakeholder: they needed to trust the system's recommendations to process automatic decisions, but also needed to know exactly how and when to exercise their override judgment. A 16-hour internal certification program was designed covering the interpretation of SHAP values, use of the audit panel, and escalation protocols. The adoption rate at 90 days post go-live was 94% of active analysts.

Override design is a governance decision as important as model design. If override is too easy, analysts use it to avoid the system without learning to interpret it. If too restrictive, it generates resistance and distrust. The right balance requires data on actual override patterns, not intuition.

Quantified results

Results were measured in two dimensions: operational efficiency and decision quality. In efficiency, average credit decision time was reduced from 4.2 days to 1.8 days for SME credit (-58%) and from 11 days to 5.3 days for corporate credit (-52%), because high-confidence model cases are processed automatically and only complex cases reach the analyst with pre-analyzed context.

In decision quality, the 12-month default rate of the approved portfolio decreased 17% in the first year of operation, controlling for product mix and the macroeconomic environment. This result was validated by the bank's risk team comparing pre and post-deployment cohorts with equivalent observed risk characteristics.

  • SME credit decision time reduction: from 4.2 to 1.8 days (-58%)
  • Corporate credit decision time reduction: from 11 to 5.3 days (-52%)
  • Reduction in 12-month default rate in approved portfolio: -17% adjusted for mix
  • 100% regulatory traceability: zero decisions without complete records since go-live
  • Reduction in cross-domain inconsistencies: from 11% to 0.3% difference between risk scores consumed by different agents
  • Response time to fraud incident with VAE alert: from 6.8 hours (historical average) to 23 minutes

Lessons learned

The most important lesson was that regulatory traceability cannot be a module added at the end of development — it must be the design principle that determines the architecture from the start. When traceability is added to an already-built system, prior technical compromises (asynchronous logs, absence of feature snapshots, informal model versioning) generate technical debt that is extraordinarily costly to resolve.

The second lesson was about semantic feature consistency. The time invested in designing and maintaining a shared feature store with versioned definitions pays back many times over in the reduction of inter-agent inconsistency incidents. In the first quarter post-deployment, the team recorded zero cross-domain inconsistency incidents, compared to an average of 3.4 incidents per quarter with the prior system.

  • Regulatory traceability must be the central design principle, not a module added at the end
  • The shared feature store with versioned semantic definitions eliminates inter-agent inconsistency at the root
  • The design of the human override mechanism is as critical as model design — it requires real behavior data to calibrate correctly
  • Binary models without source code are a regulatory risk: the bank must be able to audit and reconstruct the logic of every production model
  • Analyst certification program is a necessary condition for adoption: poorly justified override rate correlates inversely with training hours received
  • Blue-green deployment with automatic rollback by business metric (not just technical) is the correct pattern for credit models in production

Have a similar challenge?

Let's talk 30 minutes about your use case. No strings attached.

Schedule call