Knowledge Base
TechnicalBatch Learning·5 min read

Understanding Batch Learning: Technical Level

Technical guide to batch learning paradigm, implementation strategies, and performance considerations.

AG

AI Guru Team

6 November 2024

Technical Definition

Batch learning is a machine learning paradigm where models are trained on fixed datasets during discrete time intervals rather than continuously. The model learns from a complete batch of historical data and is retrained periodically when new data accumulates.

System Architecture

Data Collection (Time Period T)
    ↓
Data Aggregation & Preprocessing
    ↓
Feature Engineering
    ↓
Model Training (Batch Process)
    ↓
Model Evaluation & Validation
    ↓
Model Deployment
    ↓
Serving Predictions
    ↓
[Wait for Next Batch Period]

Batch Learning Workflow

  1. Data Accumulation: Collect data over a time period (daily, weekly, monthly)
  2. Preparation: Clean, validate, and preprocess the accumulated data
  3. Training: Train model on complete dataset
  4. Validation: Evaluate on holdout test set
  5. Deployment: Replace production model if performance improves
  6. Serving: Use model for predictions until next batch
  7. Repeat: Start cycle again after defined interval

Code Example: Batch Learning Pipeline

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
import pickle
from datetime import datetime

class BatchLearningPipeline:
    def __init__(self, model_path='models/', retrain_interval_days=7):
        self.model_path = model_path
        self.retrain_interval_days = retrain_interval_days
        self.model = None
        self.scaler = None
        self.last_training_date = None
        
    def load_batch_data(self, data_source):
        """Load data accumulated since last training"""
        query = """
        SELECT features, target 
        FROM training_data 
        WHERE timestamp > %s
        """
        df = pd.read_sql(query, data_source, 
                        params=[self.last_training_date])
        return df
    
    def preprocess_features(self, X):
        """Feature preprocessing"""
        # Handle missing values
        X = X.fillna(X.median())
        
        # Remove outliers (3 sigma rule)
        numeric_cols = X.select_dtypes(include=[np.number]).columns
        for col in numeric_cols:
            mean = X[col].mean()
            std = X[col].std()
            X = X[(X[col] >= mean - 3*std) & (X[col] <= mean + 3*std)]
        
        return X
    
    def train_batch(self, data):
        """Train model on batch of data"""
        X = data.drop('target', axis=1)
        y = data['target']
        
        # Preprocess
        X = self.preprocess_features(X)
        
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42
        )
        
        # Scale features
        self.scaler = StandardScaler()
        X_train_scaled = self.scaler.fit_transform(X_train)
        X_test_scaled = self.scaler.transform(X_test)
        
        # Train model
        self.model = LogisticRegression(
            max_iter=1000,
            random_state=42,
            class_weight='balanced'
        )
        self.model.fit(X_train_scaled, y_train)
        
        # Evaluate
        train_score = self.model.score(X_train_scaled, y_train)
        test_score = self.model.score(X_test_scaled, y_test)
        test_auc = roc_auc_score(y_test, self.model.predict_proba(X_test_scaled)[:, 1])
        
        print(f"Training Accuracy: {train_score:.4f}")
        print(f"Testing Accuracy: {test_score:.4f}")
        print(f"Testing AUC-ROC: {test_auc:.4f}")
        
        return {
            'train_accuracy': train_score,
            'test_accuracy': test_score,
            'test_auc': test_auc
        }
    
    def should_retrain(self):
        """Check if retraining is due"""
        if self.last_training_date is None:
            return True
        
        days_since = (datetime.now() - self.last_training_date).days
        return days_since >= self.retrain_interval_days
    
    def save_model(self):
        """Persist model and scaler to disk"""
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        model_file = f"{self.model_path}model_{timestamp}.pkl"
        scaler_file = f"{self.model_path}scaler_{timestamp}.pkl"
        
        with open(model_file, 'wb') as f:
            pickle.dump(self.model, f)
        with open(scaler_file, 'wb') as f:
            pickle.dump(self.scaler, f)
        
        self.last_training_date = datetime.now()
        return model_file, scaler_file
    
    def predict(self, X_new):
        """Make predictions using current model"""
        if self.model is None or self.scaler is None:
            raise ValueError("Model not trained yet")
        
        X_scaled = self.scaler.transform(X_new)
        return self.model.predict(X_scaled)

# Usage
pipeline = BatchLearningPipeline(retrain_interval_days=7)

if pipeline.should_retrain():
    # Load accumulated data
    batch_data = pipeline.load_batch_data(db_connection)
    
    # Train
    metrics = pipeline.train_batch(batch_data)
    
    # Save
    pipeline.save_model()

Training Batch vs. Prediction Batch

Training Batch

  • Large dataset used to train the model
  • Processed once or periodically
  • Computationally intensive but done offline

Prediction Batch

  • New data to generate predictions on
  • Can be smaller, processed more frequently
  • Used with already-trained model

Technical Limitations

  • Concept Drift: Model degrades as data distribution changes between batches
  • Latency: There's always a delay between data collection and model update
  • Data Staleness: Predictions use outdated models between retrain cycles
  • Computational Resources: Large batches require significant processing power
  • Batch Dependency: Requires accumulating sufficient data before training
  • No Real-Time Adaptation: Cannot respond to rapid changes instantly

Performance Considerations

Training Efficiency

  • Vectorization: Leverage NumPy/Pandas for efficient processing
  • Parallelization: Use multi-core processors for scaling
  • Sampling: Use stratified sampling for representative batches
  • Data Formats: Use efficient formats (Parquet vs. CSV)

Memory Management

  • Chunking: Process large batches in chunks
  • Disk Storage: Use databases for efficient data storage
  • Compression: Compress historical data
  • Cleanup: Archive old data regularly

Monitoring

  • Performance Drift: Track model accuracy over time
  • Data Drift: Monitor input data distribution changes
  • Version Control: Track model versions for rollback
  • Logging: Log predictions for evaluation

Best Practices

  • Schedule Consistently: Regular retraining on fixed schedule
  • Data Validation: Check data quality before training
  • Holdout Test Set: Use consistent test set across batches
  • Baseline Comparison: Compare new model against current production model
  • Gradual Rollout: Deploy to sample users first before full rollout
  • Monitoring Dashboard: Track model performance metrics continuously
  • Versioning: Maintain model versions for debugging and rollback
  • Documentation: Record training parameters and data characteristics

Batch Learning vs. Online Learning

AspectBatch LearningOnline Learning
DataFixed datasetStreaming data
FrequencyPeriodicContinuous
LatencyHigherLower
ComputationIntense but offlineLighter but continuous
AdaptationPeriodic jumpsGradual changes

Use Cases for Batch Learning

  • Monthly Churn Prediction: Retrain weekly with accumulated data
  • Quarterly Demand Forecasting: Retrain monthly with new sales data
  • Annual Credit Scoring: Retrain quarterly with new loan data
  • Batch Recommendation Systems: Update recommendations nightly
  • Periodic Report Generation: Generate insights weekly or monthly

Future Implications

Near Term: Hybrid approaches combining batch and online learning will become standard

Long Term: Most systems will shift toward continuous learning with adaptive batch windows

Tags

Machine LearningBatch ProcessingData Science