Multilayer Perceptron

Multilayer Perceptrons (MLPs) are a class of deep neural networks characterized by their layered structure, which includes an input layer, one or more hidden layers, and an output layer. Each layer comprises neurons that are fully connected to neurons in the subsequent layer through weighted connections.

Structure of MLPs

Input Layer
- Receives the initial data.
- Passes input data to the first hidden layer.
Hidden Layers
- Perform transformations on input data.
- Utilize non-linear activation functions to model complex relationships.
- Each neuron connects to every neuron in the following layer.
Output Layer
- Produces the final prediction or classification.

Neurons and Weights

Neurons
- Basic units of MLPs that process input data.
- Apply activation functions to weighted sums of inputs.
Weights
- Parameters that determine the strength of connections between neurons.
- Adjusted during training to minimize prediction errors.

Hidden Layers

Hidden layers are pivotal in enabling MLPs to capture and model intricate patterns within data. They achieve this through transformations that incorporate non-linear activation functions.

Role of Hidden Layers

Data Transformation
- Convert inputs from the previous layer into a form suitable for the next layer.
Non-Linear Activation Functions
- Introduce non-linearity, allowing the network to learn complex mappings.
- Essential for the network's ability to approximate non-linear functions.

Importance of Non-Linearity

Without non-linear activation functions, MLPs would be limited to linear transformations, regardless of the number of layers, severely restricting their modeling capabilities.

Activation Functions

Activation functions are crucial in determining the output of each neuron, influencing the network's ability to learn and perform.

ReLU (Rectified Linear Unit)

Definition
$\text{ReLU}(x) = \max(x, 0)$
Characteristics
- Retains positive inputs.
- Converts negative inputs to zero.
- Introduces non-linearity.
- Computationally efficient.
Usage
- Widely used due to simplicity and effectiveness.
- Facilitates faster training and mitigates the vanishing gradient problem.

Sigmoid Function

Definition
$\text{sigmoid}(x) = \frac{1}{1 + \exp(-x)}$
Characteristics
- Maps inputs to a range between 0 and 1.
- Smooth gradient.
Usage
- Suitable for binary classification tasks.
- Historically significant but less favored in modern architectures due to issues like vanishing gradients.

Tanh Function

Definition
$\text{tanh}(x) = \frac{1 - \exp(-2x)}{1 + \exp(2x)}$
Characteristics
- Maps inputs to a range between -1 and 1.
- Centers data, improving learning dynamics.
Usage
- Can be more effective than sigmoid in certain scenarios.
- Helps in faster convergence during training.

Universal Approximation Theorem

The Universal Approximation Theorem states that MLPs can approximate any continuous function, given sufficient neurons in the hidden layers. This theoretical foundation underscores the flexibility and power of MLPs in modeling complex functions.

Practical Considerations

Network Architecture
- The number of hidden layers and neurons impacts the network's ability to approximate functions.
Weight Initialization
- Proper initialization is critical for effective training.
Training Algorithms
- Optimization methods and learning rates influence the network's performance and convergence.

Example: PyTorch Implementation

Below is an example of implementing an MLP using PyTorch, a popular deep learning framework.

Quickstart — PyTorch Tutorials

For a comprehensive guide, refer to the Quickstart — PyTorch Tutorials.

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

# Downloading and loading dataset
train_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor(),
)

batch_size = 64
train_dataloader = DataLoader(train_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

# Define the MLP model
class MLP(nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

# Setting up the device, model, loss function, and optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MLP().to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

# Training and testing functions
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)
        pred = model(X)
        loss = loss_fn(pred, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f}")

# Running the training and testing
epochs = 5
for epoch in range(epochs):
    print(f"Epoch {epoch+1}\n-------------------------------")
    train(train_dataloader, model, loss_fn, optimizer)
    test(test_dataloader, model, loss_fn)
print("Done!")

Explanation of the Code

Data Loading
- Utilizes the FashionMNIST dataset.
- Transforms images to tensor format for processing.
Model Definition
- MLP class defines the network architecture.
- Consists of two hidden layers with ReLU activation functions.
- Output layer has 10 neurons corresponding to the 10 classes.
Training Process
- Uses Stochastic Gradient Descent (SGD) optimizer.
- Cross-entropy loss function measures prediction error.
- Training loop iterates over epochs, updating weights to minimize loss.
Evaluation
- Model's performance is assessed on the test dataset.
- Reports accuracy and average loss.

Structure of MLPs​

Neurons and Weights​

Hidden Layers​

Role of Hidden Layers​

Importance of Non-Linearity​

Activation Functions​

ReLU (Rectified Linear Unit)​

Sigmoid Function​

Tanh Function​

Universal Approximation Theorem​

Practical Considerations​

Example: PyTorch Implementation​

Quickstart — PyTorch Tutorials​

Explanation of the Code​

Reference and Useful Links​