Knowledge Base
TechnicalNeural Networks·6 min read

Understanding Artificial Neural Networks: Technical Level

Technical deep-dive into artificial neural networks, architecture design, and implementation best practices.

AG

AI Guru Team

6 November 2024

Technical Definition

Artificial neural networks are computational models inspired by biological neural networks found in brains. They consist of interconnected nodes (neurons) organized in layers, where each connection has an adjustable weight that enables the network to learn patterns from data.

Network Architecture

Basic Components

Neurons (Nodes)

  • Receive weighted inputs
  • Apply activation function
  • Pass output to next layer

Weights

  • Multiply input values
  • Adjusted during training via backpropagation
  • Store learned information

Biases

  • Added to weighted sum
  • Help shift activation function
  • Improve model expressiveness

Activation Functions

  • ReLU: max(0, x) - non-linearity for hidden layers
  • Sigmoid: 1/(1+e^-x) - outputs between 0 and 1
  • Tanh: (e^x - e^-x)/(e^x + e^-x) - outputs between -1 and 1
  • Softmax: exponential normalization - for multi-class outputs

Network Layers

Input Layer
    ↓
Hidden Layers (Feature Learning)
    ↓
Output Layer (Prediction)
  • Input Layer: Receives raw features
  • Hidden Layers: Extract features and patterns
  • Output Layer: Produces predictions

Code Example: Feed-Forward Neural Network

import numpy as np
from typing import List, Tuple
import matplotlib.pyplot as plt

class NeuralNetwork:
    def __init__(self, layer_sizes: List[int], learning_rate: float = 0.01):
        """
        Initialize neural network with specified architecture
        
        Args:
            layer_sizes: List of neuron counts per layer
            learning_rate: Learning rate for gradient descent
        """
        self.learning_rate = learning_rate
        self.weights = []
        self.biases = []
        self.layer_sizes = layer_sizes
        
        # Initialize weights and biases
        for i in range(len(layer_sizes) - 1):
            w = np.random.randn(layer_sizes[i], layer_sizes[i + 1]) * 0.01
            b = np.zeros((1, layer_sizes[i + 1]))
            self.weights.append(w)
            self.biases.append(b)
    
    def relu(self, x):
        """ReLU activation function"""
        return np.maximum(0, x)
    
    def relu_derivative(self, x):
        """ReLU derivative for backpropagation"""
        return (x > 0).astype(float)
    
    def sigmoid(self, x):
        """Sigmoid activation function"""
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def sigmoid_derivative(self, x):
        """Sigmoid derivative"""
        return x * (1 - x)
    
    def forward(self, X: np.ndarray) -> Tuple[np.ndarray, List]:
        """
        Forward propagation through network
        
        Args:
            X: Input data (samples, features)
            
        Returns:
            Output predictions and cache for backpropagation
        """
        cache = []
        A = X
        
        for i in range(len(self.weights)):
            Z = np.dot(A, self.weights[i]) + self.biases[i]
            
            # Use ReLU for hidden layers, sigmoid for output
            if i < len(self.weights) - 1:
                A = self.relu(Z)
            else:
                A = self.sigmoid(Z)
            
            cache.append((Z, A))
        
        return A, cache
    
    def backward(self, X: np.ndarray, y: np.ndarray, 
                 output: np.ndarray, cache: List) -> None:
        """
        Backward propagation to compute gradients
        
        Args:
            X: Input data
            y: Target labels
            output: Network output
            cache: Cached values from forward pass
        """
        m = X.shape[0]  # Number of samples
        
        # Output layer gradient
        dA = (output - y) / m
        
        for i in reversed(range(len(self.weights))):
            Z, A_prev = cache[i]
            A_current = A_prev if i > 0 else X
            
            # Gradient computation
            if i < len(self.weights) - 1:
                dZ = dA * self.relu_derivative(Z)
            else:
                dZ = dA * self.sigmoid_derivative(output)
            
            dW = np.dot(A_current.T, dZ)
            dB = np.sum(dZ, axis=0, keepdims=True)
            
            # Gradient for previous layer
            if i > 0:
                dA = np.dot(dZ, self.weights[i].T)
            
            # Update weights and biases
            self.weights[i] -= self.learning_rate * dW
            self.biases[i] -= self.learning_rate * dB
    
    def train(self, X: np.ndarray, y: np.ndarray, 
              epochs: int = 100, batch_size: int = 32):
        """
        Train the neural network
        
        Args:
            X: Training data
            y: Training labels
            epochs: Number of training iterations
            batch_size: Samples per batch
        """
        losses = []
        
        for epoch in range(epochs):
            epoch_loss = 0
            num_batches = X.shape[0] // batch_size
            
            for batch in range(num_batches):
                start_idx = batch * batch_size
                end_idx = start_idx + batch_size
                
                X_batch = X[start_idx:end_idx]
                y_batch = y[start_idx:end_idx]
                
                # Forward and backward pass
                output, cache = self.forward(X_batch)
                self.backward(X_batch, y_batch, output, cache)
                
                # Compute loss
                loss = -np.mean(y_batch * np.log(output + 1e-8) + 
                               (1 - y_batch) * np.log(1 - output + 1e-8))
                epoch_loss += loss
            
            epoch_loss /= num_batches
            losses.append(epoch_loss)
            
            if (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch + 1}/{epochs}, Loss: {epoch_loss:.4f}")
        
        return losses
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Make predictions on new data"""
        output, _ = self.forward(X)
        return (output > 0.5).astype(int)

# Usage
X_train = np.random.randn(100, 10)
y_train = np.random.randint(0, 2, (100, 1))

nn = NeuralNetwork([10, 16, 8, 1], learning_rate=0.01)
losses = nn.train(X_train, y_train, epochs=100, batch_size=16)

y_pred = nn.predict(X_train)
accuracy = np.mean(y_pred == y_train)
print(f"Training Accuracy: {accuracy:.4f}")

Implementation Requirements

Hardware

  • GPUs: NVIDIA CUDA-capable for faster training
  • Memory: Sufficient RAM for model parameters and batch data
  • Processing: Multi-core CPUs for parallelization

Software

  • Python with TensorFlow or PyTorch
  • NumPy for numerical operations
  • Matplotlib/Seaborn for visualization

Data Requirements

  • Normalized inputs (zero mean, unit variance)
  • Sufficient training samples (rule of thumb: 10+ samples per parameter)
  • Label data for supervised learning

Technical Limitations

  • Overfitting: Networks can memorize training data instead of generalizing
  • Vanishing Gradients: Training deep networks with many layers becomes difficult
  • Computational Cost: Training requires significant compute resources
  • Hyperparameter Tuning: Many hyperparameters to optimize
  • Black Box Nature: Difficult to interpret what network learned
  • Data Requirements: Need large labeled datasets for good performance

Performance Considerations

Training Optimization

  • Batch Normalization: Stabilize training with normalized layer inputs
  • Dropout: Regularization technique to prevent overfitting
  • Early Stopping: Stop training when validation accuracy plateaus
  • Learning Rate Scheduling: Decay learning rate during training

Inference Optimization

  • Model Quantization: Reduce precision to lower memory and compute
  • Pruning: Remove unimportant connections
  • Knowledge Distillation: Train smaller model from larger one
  • Caching: Store intermediate activations for common patterns

Best Practices

  • Data Preprocessing: Normalize and standardize inputs
  • Train-Validation-Test Split: Typical 60-20-20 split
  • Monitor Loss: Track both training and validation loss
  • Regularization: Use dropout and L1/L2 to prevent overfitting
  • Hyperparameter Search: Use grid or random search systematically
  • Ensemble Methods: Combine multiple models for better performance
  • Save Best Model: Checkpoint model during training

References

  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). "Deep Learning"
  • Glorot, X., & Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks"
  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Delving Deep into Rectifiers"

Future Implications

Near Term: Advances in neural architecture search automating model design

Long Term: More interpretable neural networks and integration with symbolic reasoning

Tags

Deep LearningNeural NetworksMachine Learning