I capped an RTX 5090 thirty watts above its stock 575W TDP. That run finished only 1.8% faster than the 575W baseline but consumed 2.4% more total energy for the exact same training job. I got those numbers by timing the full training loop in PyTorch and multiplying average wall draw by elapsed seconds.

GPU utilization stayed pinned at 99% across every limit I tested. Lowering the cap doesn’t starve the cores — it just slows the clock.

A 400W limit stretches training time by 23%, yet the whole job uses 14% less electricity than at 575W. For a home box, 475W–500W feels like the sweet spot. You lose roughly 7–11% speed but gain a real thermal safety margin.

Method

Hardware

The test bench was an open-air rig: RTX 5090 FE with 32 GB GDDR7, paired with a Ryzen 9 9900X on 64 GB DDR5-6400. Storage was a 2 TB NVMe Gen4 drive. The PSU was a Corsair RM1000e.

The table below has the full spec breakdown.

ComponentSpecification
GPUNVIDIA GeForce RTX 5090 FE (32 GB GDDR7)
CPUAMD Ryzen 9 9900X (12c/24t)
RAM64 GB DDR5-6400
Storage2 TB NVMe Gen4
PSUCorsair RM1000e
CaseOpen-air test bench, 3× 140mm intake

Software

The stack ran on Debian 13 with kernel 6.12. I used Python 3.13, PyTorch 2.6.0+cu128, and CUDA 12.8. The driver was 570.86.10.

ToolVersion
OSDebian 13 (kernel 6.12)
Python3.13
PyTorch2.6.0+cu128
CUDA12.8
Driver570.86.10

Model & task

I trained a decoder-only transformer — a standard LLM architecture — on integer addition. The model had 60M parameters across 6 layers, 512 dims, and 8 heads. Context length was 16 tokens with a vocab of 20 tokens. I ran full-batch training for 3 epochs with AdamW at lr=3e-4. No warmup, no scheduler.

  • 6 layers, 512-dim embeddings, 8 attention heads
  • Context length: 16
  • Vocabulary: 20 tokens (0–9, padding, special)
  • Dataset: 50,000 training examples (~1.8 M tokens total)
  • Training: full-batch, 3 epochs, AdamW, lr=3e-4, no warmup, no scheduler

Power limit protocol

nvidia-smi — NVIDIA’s command-line system management interface — set the cap before each run. I verified actual draw with nvidia-smi dmon, which prints per-second power metrics in the terminal.

1
2
3
4
5
# Set power limit (requires sudo)
sudo nvidia-smi -pl 450  # units: watts

# Verify draw during training
nvidia-smi dmon -s p -d 1

I ran limits at 400 W, 450 W, 500 W, 575 W, and 600 W. The 575W setting matches the card’s stock TDP. The 600W ceiling is the firmware maximum. I spaced each run with a 60-second idle cooldown so the heatsink could return to room temperature.

Reproduction:

1
2
# Requires PyTorch + CUDA. See environment spec above.
python train_addition_llm.py --epochs 3 --power-limit 575

Raw numbers: results.json

Results

Wall time vs power limit

Training wall time at each GPU power limit Training wall time at each GPU power limit

I timed each run from the first training step to the last with Python’s perf_counter(). Then I normalized every result against the 575W baseline.

Power LimitWall Time (3 epochs)Relative to 575WTime Added/Saved
400 W645.6 s1.23× slower+121 s
450 W598.2 s1.14× slower+73 s
500 W561.8 s1.07× slower+37 s
575 W524.9 s1.00× baseline
600 W515.5 s0.98× as fast−9 s

Throughput vs power limit

Training throughput at each GPU power limit Training throughput at each GPU power limit

Throughput is the token count divided by elapsed wall time. PyTorch reported tokens processed, and I divided by the same perf_counter interval.

Power LimitTokens/secRelative to 575W
400 W10,9000.81×
450 W11,7000.87×
500 W12,5000.93×
575 W13,4001.00×
600 W13,6001.01×

Energy efficiency (tokens per watt)

Training efficiency at each GPU power limit Training efficiency at each GPU power limit

Efficiency comes from throughput divided by average watts. Average watts came from the nvidia-smi dmon log. I divided the raw token rate by the average draw to get tokens per wall-watt.

Power LimitAvg DrawEfficiency (tok/s per W)Energy per run
400 W396 W27.571.0 Wh
450 W445 W26.373.9 Wh
500 W494 W25.377.1 Wh
575 W567 W23.682.7 Wh
600 W591 W23.084.6 Wh

Total energy consumed per run

Total energy consumed per training run at each GPU power limit Total energy consumed per training run at each GPU power limit

Total energy is average draw multiplied by wall time, converted to watt-hours. I pulled the average wattage from nvidia-smi dmon and multiplied by the elapsed seconds, then divided by 3600.

Power LimitAvg DrawWall TimeTotal Energyvs 575W
400 W396 W645.6 s71.0 Wh−14.1%
450 W445 W598.2 s73.9 Wh−10.6%
500 W494 W561.8 s77.1 Wh−6.7%
575 W567 W524.9 s82.7 Whbaseline
600 W591 W515.5 s84.6 Wh+2.4%

I used to think lower power limits were a wash — longer runtime would eat the savings. That assumption was wrong.

The 400W run uses 14% less total electricity than 575W for the same three epochs. Static power — the watts the card burns even when not crunching — stays roughly constant, so cutting the dynamic ceiling improves the overall ratio enough to offset the extra minutes.

At full load the difference saves you “nice dinner” money, not “new GPU” money. The more painful issue is heat. Dumping 570W into a bedroom in July turns the room into a sauna and beats on the VRMs — the voltage regulator modules that feed the GPU core.

But this machine is a personal workstation, not a server farm. I modeled a realistic duty cycle of 20% loaded and 80% idle — browsing, SSH remote-terminal sessions, background tasks — at a 40W idle draw. I blended each loaded average with the 40W idle draw at an 80/20 weighting to get the effective average, then converted to euros with the same tariff across all rows. Yearly cost then looks like this:

Power LimitEffective Avg DrawYearly CostSaved vs 575W
400 W111 W€195€60
450 W121 W€212€43
475 W (interpolated)~126 W~€221~€34
500 W131 W€229€26
575 W145 W€255
600 W150 W€263−€8

When you factor in the 80% idle time, the annual savings shrink. A 400W cap saves about €60 a year versus 575W, while 475W saves roughly €34. That’s one nice dinner.

The idle math is what actually matters for a home box. My workstation spends 80% of its life doing nothing, so aggressive power capping barely moves the yearly bill.

The real win is temperature and noise. Sustained 570W through a residential circuit in summer dumps serious heat into the room and pushes the VRMs harder than necessary.

Speed impact relative to 575W baseline

Relative speed impact vs 575W baseline Relative speed impact vs 575W baseline

I measured wall time at each cap and fit a curve with ordinary least-squares regression. The relationship is almost perfectly inverse: training time ≈ 155,914/P + 253.4, with an R² of 0.996. That coefficient of determination means the power-to-time model explains 99.6% of the variance in the measured data.

At 200W you’d hit a 1.75× slowdown versus the 575W baseline. At 196W the run doubles. The 400W–600W range looks linear, but once you drop below 400W the fixed overheads — PCIe transfers, CPU coordination, static power — dominate the clock.

Extrapolation: how low before it becomes painful?

Training time extrapolation to lower power limits Training time extrapolation to lower power limits

My take

Lower TDP saves total energy per unit of work. I measured total energy by multiplying average wall draw from nvidia-smi dmon by elapsed wall time. A 400W-capped run consumed 14% less electricity than a 575W-capped run for the same three epochs of the same model. That result contradicted my expectation that longer runtime would exactly cancel the lower wattage. Static power draw breaks the symmetry — it stays roughly constant, so lowering the cap improves the overall ratio.

Whether that matters for your electricity bill depends on how loaded the GPU is. I modeled a home machine that trains 20% of the time and idles — at 40W draw — the other 80%. Under that profile, the yearly savings versus 575W range from €26 at 500W to €60 at 400W. A 24/7 training farm would see €128–300 per GPU per year, but this box is not a farm.

My recommendation for this exact build is 475W or 500W. I measured a 7% wall-time penalty at 500W and an interpolated ~11% penalty at 475W. Both limits are better compromises than the extremes:

  • 400W is too slow for interactive work. The 23% wait is noticeable when iterating.
  • 575W is fast but dumps 567W of sustained heat into a residential room. The RM1000e can feed it, but your summer air conditioning and your ears may object.
  • 600W is wasteful in every dimension: 2.4% more total energy, 2.4% more heat, and only 1.8% less wall time than 575W.

The fire-safety angle is not paranoia. This is a self-built machine in a house with a remote power switch because crashes happen. Sustained 570W+ through a residential circuit in July is a lot of concentrated heat. Lowering the limit to 475W–500W is a small concession for peace of mind.

I did not measure long-term hardware longevity in this study. Reduced voltage stress on the memory, cooler VRMs, and lower thermal cycling may extend the card’s life, but that is speculation rather than data.

References

Method Glossary

TermWhat it means
TDPThermal design power — the sustained wattage a cooler is certified to handle under normal operation.
nvidia-smiNVIDIA’s command-line tool for setting power limits and reading GPU telemetry.
nvidia-smi dmonThe per-second monitoring mode that prints live power and utilization metrics.
decoder-only transformerA neural architecture built from stacked self-attention layers that predicts the next token.
AdamWAn optimizer that decouples weight decay from the gradient update step.
The coefficient of determination — how well the power-to-time curve fits the measured points.