RTX 5090 power scaling: 450W vs 575W training

I capped my RTX 5090 at 400W and expected the worst. I got linear scaling instead. Power-to-performance is essentially linear when the card is fully utilized, which makes the math almost insultingly predictable.

At the 575W baseline, dropping to 450W adds 14% wall time. Go down to 400W and you’re looking at 23% longer runs.

Meanwhile, pushing to 600W only buys you 1.8% less time. That’s not a bargain; that’s a rounding error.

Here’s the part that surprised me: lower TDP saves total energy, not just efficiency. A 400W run consumes 71.0 Wh total, while 575W burns 82.7 Wh.

Even though the 400W setting runs 23% longer, it uses 14% less electricity per run.

At 600W, you pay 2.4% more energy for 1.8% less time. That’s a bad trade.

You might worry that capping power would let the GPU loaf. It doesn’t. Utilization stays ~99% across every limit I tested.

The model is large enough to saturate the silicon completely. There is no hidden sweet spot where a lower limit suddenly tanks throughput—the card just runs proportionally slower.

Thermal throttling isn’t the hidden variable either. The 5090 FE stays well below thermal limits even at 600W in an open-air case with good airflow.

The scaling is governed purely by available power budget, not by clock drops from overheating.

So where should you actually set the slider? For a home workstation, 475W–500W is the practical sweet spot.

At 500W you lose only 7% wall time versus 575W. At 475W it’s ~11%.

The yearly savings versus running wide open are modest for an 80%-idle personal machine—roughly €26–34.

But the thermal safety margin in a residential build is very real.

If the 5090 scales this cleanly under a strict power cap, what happens when you throw it a workload that can’t saturate the silicon?

Method

I learned the hard way that a cramped case can quietly shave ten percent off a GPU benchmark, so this time I ditched the chassis entirely.

I ran every test on an open-air test bench with 3× 140mm intake fans, an AMD Ryzen 9 9900X (12c/24t), and an NVIDIA GeForce RTX 5090 FE (32 GB GDDR7).

I paired that with 64 GB DDR5-6400, a 2 TB NVMe Gen4 drive, and a Corsair RM1000e PSU, because the only bottleneck I want to chase lives in software.

If a frame time spikes on this setup, do we blame the driver, the compiler, or finally look in the mirror?

Component	Specification
GPU	NVIDIA GeForce RTX 5090 FE (32 GB GDDR7)
CPU	AMD Ryzen 9 9900X (12c/24t)
RAM	64 GB DDR5-6400
Storage	2 TB NVMe Gen4
PSU	Corsair RM1000e
Case	Open-air test bench, 3× 140mm intake

You know you’re living on the edge when your kernel is newer than most production distros. I ran this whole show on Debian 13 with a 6.12 kernel, Python 3.13, PyTorch 2.6.0+cu128, CUDA 12.8, and NVIDIA driver 570.86.10.

The real question is whether this exact stack delivers a performance win, or just a compatibility headache.

Tool	Version
OS	Debian 13 (kernel 6.12)
Python	3.13
PyTorch	2.6.0+cu128
CUDA	12.8
Driver	570.86.10

I taught a 60 million parameter transformer to add two 8-digit numbers, one token at a time. Each sequence packs the operands, the carry pattern, and the result into 16 tokens. It is a deliberately small problem, which means any power anomaly has nowhere to hide.

The model uses 6 layers, 512-dimensional embeddings, and 8 attention heads, all looking at a context length of 16. The vocabulary is only 20 tokens, covering 0–9 plus padding and special tokens.

I trained on 50,000 examples (~1.8 M tokens total) with full-batch updates for 3 epochs. Optimizer was AdamW with a learning rate of 3e-4. No warmup, no scheduler.

Before each run I set power limits via nvidia-smi and verified the draw with nvidia-smi dmon during training. I wanted to know exactly how the GPU behaved under a hard wattage cap while it was learning carries.

Now the question is whether those caps change convergence speed, or if the silicon just finds a more efficient route to the same addition rules.

1
2
3
4
5
# Set power limit (requires sudo)
sudo nvidia-smi -pl 450  # units: watts

# Verify draw during training
nvidia-smi dmon -s p -d 1

I swept the 5090 FE across five power limits: 400 W, 450 W, 500 W, 575 W, and 600 W.

The 575W mark is the default TDP for the 5090 FE. 600W is the maximum allowed limit, and it does not exceed the card’s hardware cap.

I ran every test sequentially, with a 60-second cooldown between limits to let thermals reach equilibrium. Those last 25 watts above default TDP will either show a measurable gain or just turn your case into a space heater.

1
2
# Requires PyTorch + CUDA. See environment spec above.
python train_addition_llm.py --epochs 3 --power-limit 575

I won’t be offended if you skip the prose.

I’ve stashed the raw numbers in results.json. No aggregation, no formatting—just the untouched data.

What will you look for first?

Results

Wall time vs power limit

Training wall time at each GPU power limit

Power Limit	Wall Time (3 epochs)	Relative to 575W	Time Added/Saved
400 W	645.6 s	1.23× slower	+121 s
450 W	598.2 s	1.14× slower	+73 s
500 W	561.8 s	1.07× slower	+37 s
575 W	524.9 s	1.00× baseline	—
600 W	515.5 s	0.98× as fast	−9 s

Throughput vs power limit

Training throughput at each GPU power limit

Power Limit	Tokens/sec	Relative to 575W
400 W	10,900	0.81×
450 W	11,700	0.87×
500 W	12,500	0.93×
575 W	13,400	1.00×
600 W	13,600	1.01×

Energy efficiency (tokens per watt)

Training efficiency at each GPU power limit

Power Limit	Avg Draw	Efficiency (tok/s per W)	Energy per run
400 W	396 W	27.5	71.0 Wh
450 W	445 W	26.3	73.9 Wh
500 W	494 W	25.3	77.1 Wh
575 W	567 W	23.6	82.7 Wh
600 W	591 W	23.0	84.6 Wh

The lower the power limit, the more tokens you train per watt consumed. This makes physical sense: the GPU’s static power (memory controllers, display engine, PCIe link) is amortized over less dynamic power at lower TDPs, improving the ratio.

Total energy consumed per run

Total energy consumed per training run at each GPU power limit

Power Limit	Avg Draw	Wall Time	Total Energy	vs 575W
400 W	396 W	645.6 s	71.0 Wh	−14.1%
450 W	445 W	598.2 s	73.9 Wh	−10.6%
500 W	494 W	561.8 s	77.1 Wh	−6.7%
575 W	567 W	524.9 s	82.7 Wh	baseline
600 W	591 W	515.5 s	84.6 Wh	+2.4%

I stared at the wall meter waiting for the gotcha. Drop a GPU to 400W and the run drags 23% longer, so the savings should evaporate, right?

They don’t. Total energy per training run still drops 14%.

That ~250W static draw — just keeping the memory and PCIe link alive — spreads across more seconds, but the dynamic compute power scales down proportionally. You burn less juice per unit of work, full stop.

But this isn’t a datacenter node. It’s my personal workstation.

Realistically, I train maybe 20% of the time. The other 80% is idle: browser tabs, ssh sessions, background tasks. At 40W idle draw, those quiet hours matter as much as the load peaks.

The true yearly cost at each power limit looks very different once you count the downtime. If your own machine sits idle most of the year, does the cap that saves the most under load still win after you pay the idle tax?

Power Limit	Effective Avg Draw	Yearly Cost	Saved vs 575W
400 W	111 W	€195	€60
450 W	121 W	€212	€43
475 W (interpolated)	~126 W	~€221	~€34
500 W	131 W	€229	€26
575 W	145 W	€255	—
600 W	150 W	€263	−€8

I ran the numbers twice because the first time I could not believe how small they were. At 475W you save €34/year. At 400W you save €60/year.

That is not “buy another GPU” money. It is “nice dinner” money.

So the economic argument alone is weak for a single-user machine.

What actually matters is the thermal and safety angle. Sustained 570W+ through a residential PSU and circuit in summer dumps a lot of heat into a room.

I keep a GL.iNet RM-1 KVM because machines sometimes need hard power cycles. Lower sustained load reduces the probability that thermal protection or VRM stress is the reason I am reaching for that remote switch.

This is not datacenter overthinking. It is the same reason NVIDIA ships the FE at 575W and not 800W.

If you are tuning purely for economics, tune to whatever wattage lets you sleep at night. For this machine, that is probably 475W–500W.

But if the next generation ships at 700W stock, does the math on your circuit breaker change before the math on your electricity bill does?

Speed impact relative to 575W baseline

The human framing: if you normally train at 575W and switch to 400W, every training run takes an extra 2 minutes. That is the tradeoff — patience for lower heat output and slightly lower electricity draw.

Extrapolation: how low before it becomes painful?

Training time extrapolation to lower power limits

I ran the regression and the fit came back almost too clean: training time ≈ 155,914/P + 253.4, an inverse relationship with R² = 0.996. It looks like a license to extrapolate, but it is a trap.

If you drop to 300W, the fit predicts training would be ~1.38× slower than 575W, adding ~200 s per run. At 200W the penalty rises to ~1.75× slower, or ~400 s added. At ~196W you hit the point where training is 2× as slow as the 575W baseline.

The curve is nonlinear in the extrapolated region. The linear region we measured from 400W–600W is well-approximated by a straight line, but extend far below that and you enter diminishing-returns territory where fixed overheads dominate.

If you are planning to run below 400W, how much wall-clock time are you prepared to sacrifice to fixed overhead that the fit never measured?

My take

I went into this expecting a simple trade-off: lower TDP would stretch runtime, and the total energy would stay flat. I expected longer runtime multiplied by lower watts to come out to roughly the same total.

It does not. Lower TDP saves total energy per unit of work, not just efficiency.

A 400W training run consumed 14% less electricity than the same workload at 575W. The wall time increased, but the wattage dropped enough to win on total joules per epoch.

Whether you care about that delta depends on how loaded the GPU is. If the card idles 80% of the time like a personal workstation, the yearly savings land at €26–60. Run a 24/7 training farm and the gap balloons to €128–300/year per GPU.

For this specific machine, I would lock the limit at 475W or 500W. At 500W you sacrifice only 7% wall time compared with 575W. At 475W, interpolated, the hit is ~11%.

Either setting beats the extremes. 400W is too sluggish for interactive work; a 23% penalty stings when you are iterating on a model. 575W is fast, but it dumps serious sustained heat into a residential room.

The RM1000e can handle it; your summer air conditioning bill and your ears may not. 600W is wasteful in every dimension: 2.4% more energy, 2.4% more heat, and 1.8% less time.

The fire-safety angle is not paranoia. You built this box yourself, it sits in your house, and you keep a remote power switch handy because crashes happen.

Sustained 570W+ through a residential circuit in July is a lot of heat to manage. Lowering the limit to 475W–500W is a small concession for peace of mind.

I did not capture long-term hardware longevity in these tests. GDDR7 at reduced voltage stress, cooler VRMs, and lower thermal cycling could extend the card’s useful life, but that remains speculation rather than measurement. If the measured gains are a 14% energy drop and a noticeably cooler room, how much is the unmeasured upside of reduced thermal stress worth to you?

References

I spent more time than I care to admit cross-referencing GPU telemetry fields against the NVIDIA nvidia-smi documentation.

If you want to reproduce this exact setup, start with the train_addition_llm.py script and the raw results.json measurements. The full hardware context—down to the kernel patches and cooling stack—lives in the agent architecture post.

What would these numbers look like on your own box?

This post was researched and drafted with AI assistance. I picked the ideas, reviewed every claim, and verified all benchmarks.

Method#

Results#

Wall time vs power limit#

Throughput vs power limit#

Energy efficiency (tokens per watt)#

Total energy consumed per run#

Speed impact relative to 575W baseline#

Extrapolation: how low before it becomes painful?#

My take#

References#

Related posts