Building a Feature Store From Scratch: A Hands-On Guide for ML Developers
Machine learning models are only as good as the data they're trained on and the features they use for predictions. As AI systems become more complex and move into production, managing these features efficiently turns into a significant challenge. This is where a feature store comes in. It’s a specialized system designed to manage the entire lifecycle of machine learning features, ensuring consistency, reliability, and low-latency access.
Many teams realize they need a feature store when their models, which worked perfectly during development, start acting up in production. This often happens because the logic used to create features for training is different from the logic used to serve them for real-time predictions. This problem is known as "training-serving skew". A feature store solves this by defining features once and making them available consistently for both training and inference.
In this tutorial, we'll walk through building a minimal working feature store from scratch. We'll explore the five core components every feature store needs and see how modern AI applications, especially those involving Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), influence its design. We'll use popular Python tools like DuckDB, Parquet, Redis, and FastAPI to bring our feature store to life.
What Exactly is a Feature Store?
At its heart, a feature store is a centralized system that stores, manages, and serves machine learning features across different models and teams. It acts as a bridge between data engineering and machine learning engineering, streamlining the process of getting features from raw data to deployed models.
Think of it as a specialized data management system for your ML ingredients. It helps with several critical aspects:
- Consistency: Ensures the same feature logic is used for both training and serving, preventing performance degradation in production.
- Reusability: Allows data scientists and ML engineers to discover, share, and reuse high-quality features across different projects and models, reducing duplicate work.
- Version Control: Manages different versions of features, providing data lineage and reproducibility for experiments.
- Low-Latency Serving: Provides fast access to features for real-time model inference.
- Monitoring: Helps detect issues like feature drift or data quality problems.
Now, let's break down the five essential components of a feature store and build them step by step.
The Five Essential Components of a Feature Store
According to experts, a modern feature store typically consists of five main components: a Feature Registry, an Offline Store, an Online Store, a Transformation/Ingestion Layer (often called a materialization pipeline), and a Feature Serving API.
Component 1: The Feature Registry – Defining Features as Code
The feature registry is the brain of your feature store. It’s a centralized catalog that defines all your features and stores important metadata about them. This includes details about where the raw data comes from, how features are transformed, their data types, and how they are used by models. Defining features as code here ensures consistency and allows for version control, making it easier for teams to collaborate and understand what each feature represents.
Purpose: To act as a single source of truth for feature definitions and their metadata.
Minimal Implementation Idea: We can use Python dataclasses or Pydantic models to define our features. This lets us clearly specify feature names, data types, and transformation logic. For a simple setup, these definitions can live in Python files within a designated 'feature repository'.
Example (features.py):
from dataclasses import dataclass
from typing import List, Any
import pandas as pd
@dataclass
class Feature:
name: str
description: str
data_type: str
entity_key: str # The primary key for the entity this feature describes
def transform(self, raw_data: pd.DataFrame) -> pd.DataFrame:
"""
Applies the transformation logic to raw data to create this feature.
This is a placeholder; real transformations would be more complex.
"""
raise NotImplementedError
@dataclass
class UserLoginCount(Feature):
name: str = "user_login_count_7d"
description: str = "Number of logins by a user in the last 7 days."
data_type: str = "int"
entity_key: str = "user_id"
def transform(self, raw_data: pd.DataFrame) -> pd.DataFrame:
# Assume raw_data has 'user_id' and 'timestamp' columns
# For a minimal example, let's just count unique user IDs
# In a real scenario, this would involve time-windowed aggregations
print(f"Transforming {self.name}...")
raw_data['timestamp'] = pd.to_datetime(raw_data['timestamp'])
seven_days_ago = raw_data['timestamp'].max() - pd.Timedelta(days=7)
recent_logins = raw_data[raw_data['timestamp'] >= seven_days_ago]
# Group by user_id and count logins
feature_df = recent_logins.groupby(self.entity_key).size().reset_index(name=self.name)
return feature_df
# A simple registry dictionary
FEATURE_REGISTRY = {
"user_login_count_7d": UserLoginCount()
}
def get_feature_definition(feature_name: str) -> Feature:
if feature_name not in FEATURE_REGISTRY:
raise ValueError(f"Feature '{feature_name}' not found in registry.")
return FEATURE_REGISTRY[feature_name]
Component 2: The Offline Store – Historical Data for Training
The offline store is where you keep large volumes of historical feature data. This data is primarily used for training your machine learning models, backfilling missing data, and performing batch predictions. It's designed for high throughput and scalability, often storing data in formats optimized for analytical queries, like columnar stores. A crucial aspect here is "point-in-time correctness," which means that when you retrieve historical features for training, you get the values as they existed at a specific past moment, preventing data leakage.
Purpose: To store comprehensive, historical feature data for model training and batch processing.
Minimal Implementation Idea: We'll use DuckDB for querying and Parquet files for storage. Parquet is an efficient columnar storage format widely used in big data ecosystems. DuckDB is an in-process SQL OLAP database known for its speed and ability to query Parquet files directly.
Example (Conceptual Python for offline store interaction):
import duckdb
import pandas as pd
import os
OFFLINE_STORE_PATH = "offline_features/"
os.makedirs(OFFLINE_STORE_PATH, exist_ok=True)
def write_offline_features(feature_name: str, df: pd.DataFrame, entity_key: str):
file_path = os.path.join(OFFLINE_STORE_PATH, f"{feature_name}.parquet")
# Add a timestamp for point-in-time correctness
df['feature_ingestion_timestamp'] = pd.to_datetime('now', utc=True)
df.to_parquet(file_path, index=False, mode='append' if os.path.exists(file_path) else 'w')
print(f"Wrote {len(df)} rows for feature '{feature_name}' to offline store: {file_path}")
def read_offline_features(feature_name: str, entity_ids: List[Any] = None, as_of_time: pd.Timestamp = None) -> pd.DataFrame:
file_path = os.path.join(OFFLINE_STORE_PATH, f"{feature_name}.parquet")
if not os.path.exists(file_path):
return pd.DataFrame()
conn = duckdb.connect(database=':memory:', read_only=False)
query = f"SELECT FROM '{file_path}'"
conditions = []
if entity_ids:
ids_str = ', '.join(f"'{str(id)}'" for id in entity_ids)
conditions.append(f"{FEATURE_REGISTRY[feature_name].entity_key} IN ({ids_str})")
if as_of_time:
conditions.append(f"feature_ingestion_timestamp <= '{as_of_time.isoformat()}'")
# For point-in-time, we usually want the latest value up to as_of_time
# This simplified query might need more complex window functions in a real scenario
# to get the latest
• feature value for each entity_key before as_of_time.
# For this minimal example, we'll just filter.
# A more robust solution would involve partitioning by entity_key and ordering by timestamp.
if conditions:
query += " WHERE " + " AND ".join(conditions)
df = conn.execute(query).fetchdf()
conn.close()
# Simple point-in-time logic: for each entity, pick the latest record before or at as_of_time
if as_of_time and not df.empty:
df = df.sort_values(by=['feature_ingestion_timestamp'], ascending=False)
df = df.drop_duplicates(subset=[FEATURE_REGISTRY[feature_name].entity_key], keep='first')
print(f"Read {len(df)} rows for feature '{feature_name}' from offline store.")
return df
Component 3: The Online Store – Real-time Lookups for Inference
The online store is optimized for lightning-fast, low-latency retrieval of feature values. It typically holds only the most recent feature values for each entity and is critical for real-time model inference in production applications like fraud detection, recommendation systems, or personalized user experiences. Databases like Redis or DynamoDB are commonly used for their speed and key-value lookup capabilities.
Purpose: To provide instant access to the freshest feature values for real-time predictions.
Minimal Implementation Idea: We'll use Redis, an in-memory data store, for its speed. Each feature value for an entity can be stored as a key-value pair, where the key is a combination of the entity ID and feature name.
Example (Conceptual Python for online store interaction):
import redis
import json
import time
# For a real application, use a connection pool and proper configuration
# For local development, a default Redis instance is often fine
try:
port=6379, db=0)
ONLINE_STORE_CLIENT.ping()
print("Successfully connected to Redis online store.")
except redis.exceptions.ConnectionError as e:
print(f"Could not connect to Redis: {e}. Please ensure Redis is running.")
# Handle cases where Redis is not available
def write_online_features(feature_name: str, df: pd.DataFrame, entity_key: str):
if not ONLINE_STORE_CLIENT:
print("Redis client not available. Skipping online write.")
return
pipe = ONLINE_STORE_CLIENT.pipeline()
for _, row in df.iterrows():
key = f"{feature_name}:{row[entity_key]}"
value = json.dumps(row.drop(entity_key).to_dict()) # Store feature values as JSON
pipe.set(key, value)
pipe.execute()
print(f"Wrote {len(df)} rows for feature '{feature_name}' to online store.")
def read_online_features(feature_name: str, entity_ids: List[Any]) -> pd.DataFrame:
if not ONLINE_STORE_CLIENT:
print("Redis client not available. Returning empty DataFrame.")
return pd.DataFrame()
keys = [f"{feature_name}:{id}" for id in entity_ids]
values = ONLINE_STORE_CLIENT.mget(keys)
results = []
for i, val in enumerate(values):
if val:
data = json.loads(val)
data[FEATURE_REGISTRY[feature_name].entity_key] = entity_ids[i] # Add entity key back
results.append(data)
df = pd.DataFrame(results)
print(f"Read {len(df)} rows for feature '{feature_name}' from online store.")
return df
Component 4: The Materialization Pipeline – Keeping Stores in Sync
The materialization pipeline is the engine that moves transformed feature values from your raw data sources into both your offline and online stores. It ensures that the online store always has the freshest data, while the offline store maintains a complete historical record. This pipeline can involve batch processing (e.g., daily jobs) or streaming (e.g., real-time updates from Kafka). Its primary goal is to automate feature ingestion and transformations, ensuring consistency and preventing training-serving skew by using the same logic for both stores.
Purpose: To transform raw data into features and continuously push these features to both offline and online stores.
Minimal Implementation Idea: For a minimal setup, we can simulate this with a Python script that orchestrates the transformation and then writes to both stores. In a real-world scenario, this would be managed by an orchestration tool like Apache Airflow or Prefect, or a streaming engine like Apache Flink or Spark Streaming.
Example (Conceptual Python script for materialization):
# materialization_pipeline.py
from features import get_feature_definition
from offline_store import write_offline_features
from online_store import write_online_features
import pandas as pd
import time
def run_materialization(feature_name: str, raw_data_source: pd.DataFrame):
feature_def = get_feature_definition(feature_name)
print(f"--- Running materialization for feature: {feature_name} ---")
# 1. Transform raw data into features
transformed_df = feature_def.transform(raw_data_source)
if transformed_df.empty:
print(f"No data transformed for feature '{feature_name}'. Skipping write.")
return
# 2. Write to offline store (for historical data and training)
write_offline_features(feature_name, transformed_df, feature_def.entity_key)
# 3. Write to online store (for real-time serving)
write_online_features(feature_name, transformed_df, feature_def.entity_key)
print(f"--- Materialization complete for feature: {feature_name} ---")
if __name__ == "__main__":
# Simulate some raw data (e.g., user login events)
sample_raw_data = pd.DataFrame({
'user_id':,
'timestamp': [
'2026-06-01 10:00:00', '2026-06-01 10:05:00', '2026-06-02 11:00:00',
'2026-06-02 11:15:00', '2026-06-03 12:00:00', '2026-06-04 13:00:00',
'2026-06-09 14:00:00', '2026-06-10 15:00:00'
]
})
# Let's add a more recent entry to demonstrate "freshest" data for online store
current_time = pd.Timestamp.now(tz='UTC')
recent_raw_data = pd.DataFrame({
'user_id':,
'timestamp': [
current_time - pd.Timedelta(minutes=5),
current_time - pd.Timedelta(minutes=3),
current_time - pd.Timedelta(minutes=1)
]
})
# Combine historical and recent raw data
full_raw_data = pd.concat([sample_raw_data, recent_raw_data], ignore_index=True)
run_materialization("user_login_count_7d", full_raw_data)
# You would schedule this script to run periodically
Component 5: The Feature Retrieval API – Accessing Features
The feature retrieval API is how your models and applications actually get features from the feature store. For training, this often involves SDKs that query the offline store for historical data. For real-time inference, it's a low-latency API (often RESTful) that fetches features from the online store. This API needs to be robust, scalable, and secure, providing a consistent interface regardless of whether the features are for training or serving.
Purpose: To provide a unified, low-latency interface for models and applications to retrieve features.
Minimal Implementation Idea: We'll use FastAPI to build a simple REST API. FastAPI is a modern, fast (high-performance) web framework for building APIs with Python, based on standard Python type hints.
Example (api.py):
from fastapi import FastAPI, HTTPException
from typing import List
from online_store import read_online_features
from offline_store import read_offline_features
from features import FEATURE_REGISTRY, get_feature_definition
import pandas as pd
app = FastAPI(
title="Minimal Feature Store API",
description="A basic API for retrieving ML features from online and offline stores."
)
@app.get("/features/online/{feature_name}", summary="Get online features for inference")
async def get_online_features_endpoint(feature_name: str, entity_ids: str):
"""
Retrieves the latest feature values for given entities from the online store.
- feature_name: The name of the feature (e.g., 'user_login_count_7d').
- entity_ids: Comma-separated list of entity IDs (e.g., '1,2,3').
"""
if feature_name not in FEATURE_REGISTRY:
raise HTTPException(status_code=404, detail=f"Feature '{feature_name}' not found.")
ids_list = [int(x.strip()) for x in entity_ids.split(',')] # Assuming integer IDs
features_df = read_online_features(feature_name, ids_list)
if features_df.empty:
raise HTTPException(status_code=404, detail=f"No online features found for '{feature_name}' and provided entity IDs.")
return features_df.to_dict(orient="records")
@app.get("/features/offline/{feature_name}", summary="Get offline features for training/backfill")
async def get_offline_features_endpoint(feature_name: str, entity_ids: str = None, as_of_time: str = None):
"""
Retrieves historical feature values for given entities from the offline store.
- feature_name: The name of the feature.
- entity_ids: Optional comma-separated list of entity IDs.
- as_of_time: Optional timestamp (ISO format) for point-in-time correctness (e.g., '2026-01-01T12:00:00Z').
"""
if feature_name not in FEATURE_REGISTRY:
raise HTTPException(status_code=404, detail=f"Feature '{feature_name}' not found.")
ids_list = [int(x.strip()) for x in entity_ids.split(',')] if entity_ids else None
parsed_as_of_time = pd.to_datetime(as_of_time, utc=True) if as_of_time else None
features_df = read_offline_features(feature_name, ids_list, parsed_as_of_time)
if features_df.empty:
raise HTTPException(status_code=404, detail=f"No offline features found for '{feature_name}' with given criteria.")
return features_df.to_dict(orient="records")
# To run this API:
# 1. Save the above code as `api.py`.
# 2. Make sure `features.py`, `offline_store.py`, `online_store.py` are in the same directory.
# 3. Install dependencies: `pip install fastapi uvicorn pandas duckdb redis`
# 4. Run from your terminal: `uvicorn api:app --reload`
# 5. Access the API documentation at http://127.0.0.1:8000/docs
How AI Changes the Design of a Feature Store
The rise of advanced AI applications, particularly Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) pipelines, introduces new demands on feature stores. While the core components remain the same, the emphasis shifts, and new considerations emerge:
-
Extreme Low-Latency for LLM Context: LLM agents often need structured user context (like plan tier, recent activity, account state) injected into prompts for personalized responses. This context must be retrieved in milliseconds (under 10ms) for real-time interactions. The online store and its retrieval API become even more critical, requiring highly optimized data structures and serving infrastructure.
-
Complementary Role with Vector Databases: It's important to understand that a feature store is not a vector database. While both provide data for AI models, they solve different retrieval problems:
- Feature Store: Provides structured, tabular features (e.g., user's age, number of purchases, aggregated metrics).
- Vector Database: Stores high-dimensional embeddings and performs similarity searches (e.g., finding the three most similar past viewing sessions for a user).
A sophisticated LLM/RAG stack will use both. The feature store provides specific user attributes, and the vector database provides contextual similarity information, with the LLM prompt combining these diverse inputs.
-
Complex On-Demand Feature Computation: While pre-computed features are ideal, some AI use cases might require features computed on-the-fly based on real-time request data. The transformation layer needs to be flexible enough to handle these "on-demand" transformations, ensuring consistency with how they'd be calculated during training.
-
Scalability and Freshness: The volume and velocity of data for AI models can be immense. Feature stores need to scale horizontally to handle massive ingestion rates and high query volumes, while ensuring features are as fresh as possible.
Next Steps and Beyond a Minimal Implementation
Building a minimal feature store from scratch, as we've done here, provides a solid understanding of its core components and challenges. For production-grade AI systems, you'll likely want to leverage mature, open-source, or managed feature store solutions that offer robustness, scalability, and advanced features out of the box.
Popular open-source feature stores include Feast and Hopsworks. Feast, originally developed by Go-JEK and now a standalone open-source project, focuses on defining, managing, and serving features, supporting various offline and online stores. Hopsworks is another open-source platform that includes a feature store with an integrated vector database and support for streaming ingestion.
Commercial and managed options like Tecton (which acquired parts of Feast), Databricks Feature Store, Google Cloud's Vertex AI Feature Store, and Amazon SageMaker Feature Store provide fully managed services that integrate deeply with their respective cloud ecosystems, offering features like advanced monitoring, access control, and seamless deployment.
The journey from raw data to robust, production-ready AI models is complex. A well-designed feature store is a foundational piece of infrastructure that streamlines this journey, reduces common pitfalls, and empowers developers to build more reliable and impactful AI applications.



