Technical Definition
Batch learning is a machine learning paradigm where models are trained on fixed datasets during discrete time intervals rather than continuously. The model learns from a complete batch of historical data and is retrained periodically when new data accumulates.
System Architecture
Data Collection (Time Period T)
↓
Data Aggregation & Preprocessing
↓
Feature Engineering
↓
Model Training (Batch Process)
↓
Model Evaluation & Validation
↓
Model Deployment
↓
Serving Predictions
↓
[Wait for Next Batch Period]
Batch Learning Workflow
- Data Accumulation: Collect data over a time period (daily, weekly, monthly)
- Preparation: Clean, validate, and preprocess the accumulated data
- Training: Train model on complete dataset
- Validation: Evaluate on holdout test set
- Deployment: Replace production model if performance improves
- Serving: Use model for predictions until next batch
- Repeat: Start cycle again after defined interval
Code Example: Batch Learning Pipeline
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
import pickle
from datetime import datetime
class BatchLearningPipeline:
def __init__(self, model_path='models/', retrain_interval_days=7):
self.model_path = model_path
self.retrain_interval_days = retrain_interval_days
self.model = None
self.scaler = None
self.last_training_date = None
def load_batch_data(self, data_source):
"""Load data accumulated since last training"""
query = """
SELECT features, target
FROM training_data
WHERE timestamp > %s
"""
df = pd.read_sql(query, data_source,
params=[self.last_training_date])
return df
def preprocess_features(self, X):
"""Feature preprocessing"""
# Handle missing values
X = X.fillna(X.median())
# Remove outliers (3 sigma rule)
numeric_cols = X.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
mean = X[col].mean()
std = X[col].std()
X = X[(X[col] >= mean - 3*std) & (X[col] <= mean + 3*std)]
return X
def train_batch(self, data):
"""Train model on batch of data"""
X = data.drop('target', axis=1)
y = data['target']
# Preprocess
X = self.preprocess_features(X)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
self.scaler = StandardScaler()
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
# Train model
self.model = LogisticRegression(
max_iter=1000,
random_state=42,
class_weight='balanced'
)
self.model.fit(X_train_scaled, y_train)
# Evaluate
train_score = self.model.score(X_train_scaled, y_train)
test_score = self.model.score(X_test_scaled, y_test)
test_auc = roc_auc_score(y_test, self.model.predict_proba(X_test_scaled)[:, 1])
print(f"Training Accuracy: {train_score:.4f}")
print(f"Testing Accuracy: {test_score:.4f}")
print(f"Testing AUC-ROC: {test_auc:.4f}")
return {
'train_accuracy': train_score,
'test_accuracy': test_score,
'test_auc': test_auc
}
def should_retrain(self):
"""Check if retraining is due"""
if self.last_training_date is None:
return True
days_since = (datetime.now() - self.last_training_date).days
return days_since >= self.retrain_interval_days
def save_model(self):
"""Persist model and scaler to disk"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
model_file = f"{self.model_path}model_{timestamp}.pkl"
scaler_file = f"{self.model_path}scaler_{timestamp}.pkl"
with open(model_file, 'wb') as f:
pickle.dump(self.model, f)
with open(scaler_file, 'wb') as f:
pickle.dump(self.scaler, f)
self.last_training_date = datetime.now()
return model_file, scaler_file
def predict(self, X_new):
"""Make predictions using current model"""
if self.model is None or self.scaler is None:
raise ValueError("Model not trained yet")
X_scaled = self.scaler.transform(X_new)
return self.model.predict(X_scaled)
# Usage
pipeline = BatchLearningPipeline(retrain_interval_days=7)
if pipeline.should_retrain():
# Load accumulated data
batch_data = pipeline.load_batch_data(db_connection)
# Train
metrics = pipeline.train_batch(batch_data)
# Save
pipeline.save_model()
Training Batch vs. Prediction Batch
Training Batch
- Large dataset used to train the model
- Processed once or periodically
- Computationally intensive but done offline
Prediction Batch
- New data to generate predictions on
- Can be smaller, processed more frequently
- Used with already-trained model
Technical Limitations
- Concept Drift: Model degrades as data distribution changes between batches
- Latency: There's always a delay between data collection and model update
- Data Staleness: Predictions use outdated models between retrain cycles
- Computational Resources: Large batches require significant processing power
- Batch Dependency: Requires accumulating sufficient data before training
- No Real-Time Adaptation: Cannot respond to rapid changes instantly
Performance Considerations
Training Efficiency
- Vectorization: Leverage NumPy/Pandas for efficient processing
- Parallelization: Use multi-core processors for scaling
- Sampling: Use stratified sampling for representative batches
- Data Formats: Use efficient formats (Parquet vs. CSV)
Memory Management
- Chunking: Process large batches in chunks
- Disk Storage: Use databases for efficient data storage
- Compression: Compress historical data
- Cleanup: Archive old data regularly
Monitoring
- Performance Drift: Track model accuracy over time
- Data Drift: Monitor input data distribution changes
- Version Control: Track model versions for rollback
- Logging: Log predictions for evaluation
Best Practices
- Schedule Consistently: Regular retraining on fixed schedule
- Data Validation: Check data quality before training
- Holdout Test Set: Use consistent test set across batches
- Baseline Comparison: Compare new model against current production model
- Gradual Rollout: Deploy to sample users first before full rollout
- Monitoring Dashboard: Track model performance metrics continuously
- Versioning: Maintain model versions for debugging and rollback
- Documentation: Record training parameters and data characteristics
Batch Learning vs. Online Learning
| Aspect | Batch Learning | Online Learning |
|---|---|---|
| Data | Fixed dataset | Streaming data |
| Frequency | Periodic | Continuous |
| Latency | Higher | Lower |
| Computation | Intense but offline | Lighter but continuous |
| Adaptation | Periodic jumps | Gradual changes |
Use Cases for Batch Learning
- Monthly Churn Prediction: Retrain weekly with accumulated data
- Quarterly Demand Forecasting: Retrain monthly with new sales data
- Annual Credit Scoring: Retrain quarterly with new loan data
- Batch Recommendation Systems: Update recommendations nightly
- Periodic Report Generation: Generate insights weekly or monthly
Future Implications
Near Term: Hybrid approaches combining batch and online learning will become standard
Long Term: Most systems will shift toward continuous learning with adaptive batch windows
Tags
Machine LearningBatch ProcessingData Science