XAI in Production: SHAP, LIME, Attention and When to Use Each
Explainability Is Not a Report: It Is a System Layer
When a regulator asks "why did the system make this decision", they do not expect a 40-page PDF generated three months later. They expect a traceable, consistent, and reproducible answer, available at decision time or immediately after. This means explainability must be integrated into the inference pipeline, not added as an ad-hoc script.
The three families of methods that dominate XAI in production are SHAP (Shapley Additive Explanations), LIME (Local Interpretable Model-agnostic Explanations), and native attention mechanisms of transformer architectures. Each has specific trade-offs in fidelity, computational cost, and context of use.
| Dimension | SHAP | LIME | Attention |
|---|---|---|---|
| Theoretical fidelity | High (game theory foundation, exact Shapley values) | Approximate (local perturbation, linear surrogate model) | Variable (correlation does not imply causation; task-dependent) |
| Computational cost | High (exponential in features; TreeSHAP is O(TLD)) | Moderate (N configurable perturbations, typically 1000-5000) | Low (byproduct of forward pass, no additional cost) |
| Model type | Any (model-agnostic); accelerated for trees (TreeSHAP) | Any (model-agnostic) | Only transformers / models with attention mechanism |
| Granularity | Feature-level (global and local) | Feature-level (local only) | Token / patch / step (architecture-dependent) |
| Best production use | Audit, compliance, regulator explanations | Quick debug, end-user explanations | NLP, vision, time series with transformers |
SHAP: When Precision Matters More Than Speed
SHAP computes the marginal contribution of each feature to the prediction, based on Shapley values from cooperative game theory. The theoretical guarantee is strong: contributions sum to the predicted value, are consistent, and respect symmetry. For tree-based models (XGBoost, LightGBM, CatBoost), TreeSHAP offers exact computation in polynomial time. For deep learning models, KernelSHAP is model-agnostic but computationally expensive.
In production, SHAP is pre-computed in batch for global explanations (which features dominate model decisions) and computed on-demand for local explanations (why this specific prediction has this value). KernelSHAP cost can be mitigated with sampling and caching explanations for similar inputs.
LIME: Fast Explanations for Debug and UX
LIME generates local perturbations of the input, observes how the prediction changes, and fits an interpretable model (typically linear regression or decision tree) that approximates the original model behavior in the neighborhood of the point of interest. It is fast, intuitive, and produces explanations that non-technical users can understand.
The main limitation is instability: two LIME runs on the same input can produce different explanations if the perturbation sampling changes. In production, this is mitigated by fixing the random seed and increasing the number of perturbations, but it introduces a trade-off with latency.
Attention: Native Explainability in Transformers
Attention mechanisms in transformer architectures produce weight matrices that indicate how much the model "attends" to each token (or patch, or temporal step) when generating output. This information is a free byproduct of the forward pass and requires no additional computation.
However, there is active debate about whether attention weights truly "explain" the model decision. Recent studies show that attention can be manipulated without changing the prediction (attention is not explanation). In production, we recommend using attention as a quick saliency heuristic, complemented with SHAP or LIME for formal audits.
Production Integration Patterns
Pattern 1: Synchronous Explainability (Real-Time)
For decisions that require immediate explanation (credit approval, fraud detection with user notification), explainability computation runs as part of the inference pipeline. LIME with reduced perturbations (500-1000) or attention weights are used if the model is a transformer. Added latency is 50-200ms.
Pattern 2: Asynchronous Explainability (Batch/Audit)
For periodic audits or bias analysis, full SHAP is run in batch over a representative sample of recent decisions. Results are stored versioned and joined with the decision log. This pattern allows deep analysis without impacting service latency.
Explainability Architecture on Google Cloud
Vertex AI Explainable AI (xAI) offers integrated feature attributions for tabular and image models. For custom models and advanced explainability, the following architecture separates explanation compute from the inference pipeline:
For synchronous explainability (latency < 200ms), Cloud Functions execute LIME with 500 perturbations and cache results in Memorystore. For batch audits, a Vertex AI Pipeline runs full SHAP over a sample of recent decisions, writes Shapley values to BigQuery, and links them to the decision log by request_id. Looker dashboards show feature importance trends, explanation distribution by segment, and explanation drift alerts (changes in which features dominate decisions).
Key Takeaways
- Explainability is a system layer, not a post-hoc report. It must be integrated into the inference pipeline or available on demand.
- SHAP offers the highest theoretical fidelity and is ideal for audit and compliance. Its cost is mitigated with TreeSHAP (trees) or batch processing.
- LIME is faster and more intuitive, but less stable. Ideal for UX and quick debug in production.
- Attention weights are free but not always faithful. Use them as a heuristic, not as a formal explanation.
- The Explainability Stack separates presentation, compute, storage, traceability, and inference to scale each layer independently.
