Optimization Algorithms

➺ Modern Optimizers:

  • Adam/AdamW
  • Lion optimizer
  • Adafactor
  • Shampoo 
  • ➺ Technical Implementation:

    Adam Variants 
  • Learning rate: 1e-4 to 1e-3
  • Beta1: 0.9 (momentum)
  • Beta2: 0.999 (variance)
  • Epsilon: 1e-8 (stability) 
  • Advanced Optimizers
  • AdaBelief
  • Rectified Adam
  • AdaGrad
  • LAMB for large batches
  • Gradient Processing
  • Gradient clipping
  • Gradient accumulation
  • Gradient centralization
  • Gradient noise scale
  • ➺ Practical Considerations:

  • Memory requirements
  • Convergence stability
  • Scaling properties
  • Hardware compatibility