Effective Techniques to Decrease Learning Rate for Adam Optimizer in PyTorch
- The learning rate controls how much the model's weights are adjusted during training.
- A high learning rate can lead to the model oscillating or diverging, while a low learning rate can make training slow.
- Decreasing the learning rate (learning rate decay) is often beneficial as training progresses, allowing the model to fine-tune near the optimal solution.
Methods for Learning Rate Decay in PyTorch with Adam:
-
Manual Decay:
- Directly modify the
learning_rate
attribute of the optimizer after each epoch or a certain number of iterations. - This method gives you fine-grained control but requires manual intervention.
optimizer = torch.optim.Adam(model.parameters(), lr=0.01) for epoch in range(num_epochs): # Train loop... optimizer.learning_rate *= 0.9 # Reduce learning rate by 10% after each epoch
- Directly modify the
-
Learning Rate Schedulers:
- PyTorch provides built-in schedulers that automatically adjust the learning rate based on a predefined strategy.
- This approach is more flexible and avoids the need for manual updates.
a) ReduceLROnPlateau:
- Reduces the learning rate when a monitored metric (e.g., validation loss) stops improving for a specified number of epochs (
patience
). - Useful for preventing overfitting when training plateaus.
from torch.optim.lr_scheduler import ReduceLROnPlateau optimizer = torch.optim.Adam(model.parameters(), lr=0.01) scheduler = ReduceLROnPlateau(optimizer, factor=0.1, patience=3) # Reduce by 10% after 3 epochs of no improvement for epoch in range(num_epochs): # Train loop... scheduler.step(val_loss) # Update scheduler with validation loss
b) ExponentialLR:
- Multiplies the learning rate by a constant factor (
gamma
) at regular intervals. - Simple approach for general learning rate decay.
from torch.optim.lr_scheduler import ExponentialLR optimizer = torch.optim.Adam(model.parameters(), lr=0.01) scheduler = ExponentialLR(optimizer, gamma=0.9) # Reduce by 10% every epoch for epoch in range(num_epochs): # Train loop... scheduler.step() # Update scheduler at the end of each epoch
c) CosineAnnealingLR:
- Gradually reduces the learning rate using a cosine annealing schedule, reaching a minimum learning rate at the end of training.
- Can be helpful for fine-tuning in the later stages.
from torch.optim.lr_scheduler import CosineAnnealingLR optimizer = torch.optim.Adam(model.parameters(), lr=0.01) scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs) # Reduce to minimum LR over num_epochs epochs for epoch in range(num_epochs): # Train loop... scheduler.step() # Update scheduler at the end of each epoch
Choosing the Right Method:
- Manual decay offers the most control but requires constant monitoring.
- Learning rate schedulers are more automated but may require experimentation to find optimal hyperparameters (e.g.,
factor
,patience
,gamma
).
Additional Considerations:
- Experiment with different decay methods and hyperparameters to find what works best for your specific model and dataset.
- Consider techniques like warmup (gradually increasing the learning rate initially) for better convergence in some cases.
import torch.optim as optim
# Define your model...
optimizer = optim.Adam(model.parameters(), lr=0.01)
for epoch in range(num_epochs):
# Train loop...
# Manual learning rate decay (adjust factor and frequency as needed)
optimizer.learning_rate *= 0.9 # Reduce learning rate by 10% after each epoch
# Rest of your training code...
from torch.optim.lr_scheduler import ReduceLROnPlateau
optimizer = optim.Adam(model.parameters(), lr=0.01)
scheduler = ReduceLROnPlateau(optimizer, factor=0.1, patience=3, verbose=True) # Reduce by 10% after 3 epochs of no improvement, with verbosity
for epoch in range(num_epochs):
# Train loop...
# Track validation loss for ReduceLROnPlateau
val_loss = ... # Calculate validation loss
# Update scheduler with validation loss
scheduler.step(val_loss)
# Rest of your training code...
from torch.optim.lr_scheduler import ExponentialLR
optimizer = optim.Adam(model.parameters(), lr=0.01)
scheduler = ExponentialLR(optimizer, gamma=0.9, verbose=True) # Reduce by 10% every epoch, with verbosity
for epoch in range(num_epochs):
# Train loop...
# Update scheduler at the end of each epoch
scheduler.step()
# Rest of your training code...
from torch.optim.lr_scheduler import CosineAnnealingLR
optimizer = optim.Adam(model.parameters(), lr=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs, verbose=True) # Reduce to minimum LR over num_epochs epochs, with verbosity
for epoch in range(num_epochs):
# Train loop...
# Update scheduler at the end of each epoch
scheduler.step()
# Rest of your training code...
Explanation of Improvements:
- Added comments for clarity.
- Included
verbose=True
in some schedulers for informative output. - Emphasized the importance of calculating validation loss for
ReduceLROnPlateau
. - Highlighted the need for adjusting hyperparameters (factor, patience, gamma, T_max) based on your specific task.
This scheduler allows you to define a custom learning rate function based on the current epoch. It provides more flexibility than built-in schedulers with fixed decay rates.
from torch.optim.lr_scheduler import LambdaLR
def lr_lambda(epoch):
return 0.95 ** epoch # Reduce by 5% every epoch
optimizer = optim.Adam(model.parameters(), lr=0.01)
scheduler = LambdaLR(optimizer, lr_lambda=lr_lambda)
for epoch in range(num_epochs):
# Train loop...
scheduler.step() # Update scheduler at the end of each epoch
Cyclic Learning Rate (CLR):
CLR involves periodically increasing and decreasing the learning rate during training. This can help the model escape local minima and improve generalization.
You'll need an external library like clr_callback
for this approach. Here's a basic example:
from clr_callback import CyclicLR
optimizer = optim.Adam(model.parameters(), lr=0.01)
scheduler = CyclicLR(optimizer, base_lr=0.001, max_lr=0.1, step_size_up=2000, step_size_down=2000) # Adjust hyperparameters
for epoch in range(num_epochs):
# Train loop...
scheduler.batch_step() # Update scheduler after each batch
Gradual Warmup:
This technique gradually increases the learning rate from a very low initial value to the actual learning rate over a few epochs. It can improve convergence and stability, especially for complex models or datasets.
import numpy as np
optimizer = optim.Adam(model.parameters(), lr=0.01)
warmup_epochs = 5 # Adjust as needed
for epoch in range(num_epochs):
if epoch < warmup_epochs:
new_lr = epoch * (0.01 / warmup_epochs) # Linear warmup
for param_group in optimizer.param_groups:
param_group['lr'] = new_lr
else:
# Rest of your training code...
The best method depends on your specific problem and dataset. Here are some general guidelines:
- Manual Decay: Simple and offers control, but requires monitoring.
- ReduceLROnPlateau: Effective for preventing overfitting, but sensitive to hyperparameter tuning.
- ExponentialLR/CosineAnnealingLR: Simple decay strategies for general training.
- LambdaLR: Provides more control with custom learning rate functions.
- CLR: May help escape local minima and improve generalization.
- Gradual Warmup: Can improve convergence in complex scenarios.
pytorch