Optimizers: From SGD To AdamW

sky_io@outlook.com (K4i) — Mon, 29 Jun 2026 10:00:00 +0800

In the previous training posts, we separated the loop into a few pieces:

the loss function defines what counts as wrong;
forward and backward propagation computes gradients for each parameter;
gradient descent updates parameters using those gradients.

But in real training code, we usually do not write:

1

param = param - lr * param.grad

We write:

1
2


optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
optimizer.step()

So what does the optimizer actually do? Why is AdamW so common? Is modern neural-network training basically “just use AdamW”, or are there still meaningful alternatives?

Adamw on k4i's blog

Optimizers: From SGD To AdamW