Feature Engineering in Production: From Notebooks to Pipelines
The Training-Serving Skew Problem
Training-serving skew is the #1 cause of silent model degradation in production. It occurs when the features the model saw during training differ from those received at inference time. Causes are multiple: transformations implemented differently in training (Python/pandas) vs serving (Java/SQL), features computed with future data during training (temporal data leakage), or simply bugs in feature engineering logic that go undetected until production.
Feature Store: The Skew Solution
A Feature Store centralizes feature definition, computation, and serving. The same transformation serves both training and inference, eliminating skew by design. Offline features (batch) are materialized in BigQuery for training; online features (low-latency) are served from Bigtable or Memorystore for inference. Point-in-time correctness ensures that during training only features available at the time of each observation are used, preventing data leakage.
Feature Monitoring and Governance
Features are critical assets: they require versioning, ownership, documentation, lineage, and quality monitoring. Feature monitoring detects distribution anomalies, unexpected nulls, and degraded freshness. Feature governance assigns owners, defines freshness SLAs, and maintains a searchable catalog of all features in the ecosystem. Without governance, features proliferate without control and the feature store becomes a swamp.
Key Takeaways
- Training-serving skew is the #1 cause of silent degradation. A Feature Store eliminates it by design.
- Point-in-time correctness prevents temporal data leakage during training.
- Online (Bigtable, <10ms) and offline (BigQuery, batch) features must be served from the same definition.
