Intro
Adam with AMSGrad combines adaptive learning rates with a convergence guarantee, making gradient‑based training more reliable. This guide walks through the algorithm’s mechanics, practical code, and key pitfalls. By the end you will know exactly how to swap in AMSGrad in PyTorch or TensorFlow and why it sometimes outperforms the standard Adam.
Key Takeaways
- Adam with AMSGrad replaces the moving‑average squared gradient with a maximum‑biased term, preventing learning‑rate inflation.
- The algorithm requires three hyper‑parameters: learning rate (α), exponential decay rates (β₁, β₂), and a small ε for numerical stability.
- Implementation in most frameworks needs only a flag change; no custom gradient clipping is required.
- Empirical studies show AMSGrad can converge faster on sparse‑gradient problems, but may lag on dense‑gradient tasks.
- Monitoring loss curves and gradient norms helps detect when the AMSGrad update diverges.
What is Adam with AMSGrad?
Adam with AMSGrad is a variant of the Adam optimizer that corrects a known theoretical issue with the original algorithm’s convergence proof. It maintains the first‑moment (m) and second‑moment (v) estimates but caps the second‑moment at its running maximum, ensuring the effective step size never grows beyond the best observed value. The modification adds a single line of code in most libraries while preserving the adaptive per‑parameter learning rates that make Adam popular.
Why Adam with AMSGrad Matters
Standard Adam can produce step sizes larger than theoretically justified, leading to divergent behavior on some non‑convex loss surfaces. By forcing v̂ₜ to be non‑decreasing, AMSGrad provides a tighter bound on the regret, which translates into more stable training on deep networks and reinforcement‑learning agents. The original AMSGrad paper demonstrates empirical gains on benchmark tasks such as CIFAR‑10 and language modeling.
How Adam with AMSGrad Works
The update proceeds in three stages each iteration:
# Pseudocode for one step of Adam with AMSGrad
g_t = gradient(loss, params) # current gradient
m_t = β₁ * m_{t-1} + (1 - β₁) * g_t # first‑moment estimate
v_t = β₂ * v_{t-1} + (1 - β₂) * (g_t ** 2) # second‑moment estimate
# Bias correction
m_hat = m_t / (1 - β₁ ** t)
v_hat = max(v_{t-1}_hat, v_t) # cap second‑moment at running max
θ_{t+1} = θ_t - α * m_hat / (√v_hat + ε) # parameter update
Key points:
- The first‑moment (m) mirrors momentum, smoothing noisy gradients.
- The second‑moment (v) scales each parameter update inversely to the magnitude of its gradient history.
- The
maxoperation ensures v̂ never decreases, guaranteeing a non‑increasing effective learning‑rate schedule. - Bias correction mitigates the initialization bias of the exponentially weighted averages.
Used in Practice
Switching to AMSGrad requires only a parameter flag in PyTorch or TensorFlow:
# PyTorch implementation
import torch
opt = torch.optim.Adam(model.parameters(), lr=1e-3, amsgrad=True)
# TensorFlow / Keras implementation
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, amsgrad=True)
When you train a ResNet‑50 on ImageNet, enabling amsgrad=True typically yields a 1‑2% improvement in top‑1 accuracy after 90 epochs. For language models such as Transformer, the flag often stabilizes perplexity on long sequences by preventing sudden loss spikes.
Risks / Limitations
AMSGrad’s capped second‑moment can slow convergence on problems where the gradient magnitude naturally shrinks over time. It also adds a small memory overhead for storing the running maximum. Additionally, because the algorithm still relies on exponential moving averages, it may be sensitive to the choice of β₂, especially when training for very many steps.
Adam with AMSGrad vs Standard Adam and RMSprop
| Feature | Adam (vanilla) | AMSGrad | RMSprop |
|---|---|---|---|
| Step‑size mechanism | α * m̂ / √v | α * m̂ / √v̂ (v̂ capped) | α * g / √v |
| Convergence guarantee | Theoretical only under certain conditions | Formal regret bound (Reddi et al., 2018) | No formal guarantee for non‑convex |
| Memory overhead | m, v | m, v, v̂_max | v only |
| Typical performance on dense nets | Fast early progress | Stable later progress | Good for RNNs |
The table shows that AMSGrad sits between the aggressive step scaling of vanilla Adam and the simpler per‑parameter scaling of RMSprop, offering a balanced trade‑off for many deep‑learning tasks.
What to Watch
- Gradient norms: Sudden spikes may indicate the second‑moment cap is too restrictive.
- Learning‑rate decay schedule: AMSGrad’s capped v can interact with step‑wise schedulers; consider warm‑up or cosine annealing.
- Batch‑size scaling: Larger batches often need higher β₂ to keep the effective learning rate stable.
- Hyper‑parameter sensitivity: Test β₂ values of 0.99 and 0.999 to see if capping improves loss curves.
FAQ
1. Does AMSGrad always converge faster than Adam?
No. AMSGrad improves stability on sparse‑gradient problems, but on dense, well‑conditioned datasets it can be slower to converge. Monitor validation loss to decide.
2. Can I use AMSGrad with momentum‑based learning rate schedulers?
Yes, most frameworks apply the scheduler before the optimizer step. Just ensure the scheduler respects the capped second‑moment behavior.
3. Is AMSGrad compatible with mixed‑precision training?
Yes. Both PyTorch and TensorFlow support amsgrad=True with float16 or bfloat16, provided you scale the loss appropriately.
4. How do I debug a divergence when using AMSGrad?
Check gradient clipping, reduce the learning rate, or lower β₂ to allow v̂ to grow more quickly in early steps.
5. Does AMSGrad affect the memory footprint significantly?
It adds only a single extra tensor per parameter (the running maximum of v), which is negligible for modern GPUs with billions of parameters.
6. Are there other variants that combine AMSGrad with other tricks?
Yes, you can layer weight decay, gradient centralization, or AdamW on top of AMSGrad by adjusting the loss or the update rule, but the core capping remains unchanged.
7. Where can I learn more about the theoretical background?
Read the original paper “On the Convergence of Adam and Beyond” (arXiv link) and the optimization overview by Sebastian Ruder.
8. How does AMSGrad behave with very small batch sizes?
Small batches introduce high‑variance gradients, which can make v̂ grow quickly. In such cases, a smaller β₂ (e.g., 0.9) often stabilizes training.
Leave a Reply