1. Power-to-performance is essentially linear on a fully-utilized RTX 5090. At 575W baseline, dropping to 450W adds 14% wall time. Dropping to 400W adds 23%. Bumping to 600W saves only 1.8%.
  2. Lower TDP saves total energy, not just energy efficiency. A 400W run consumes 71.0 Wh total; 575W consumes 82.7 Wh. Despite running 23% longer, the 400W setting uses 14% less electricity per run. At 600W you pay 2.4% more energy for 1.8% less time — a bad trade.
  3. GPU utilization stays ~99% across all tested limits. The model is large enough to saturate the silicon. There is no “sweet spot” where a lower power limit suddenly tanks utilization — it just runs proportionally slower.
  4. Thermal throttling is not a factor here. The 5090 FE stays well below thermal limits even at 600W in an open-air case with good airflow. The scaling is governed by available power budget, not clock drops from overheating.
  5. The practical sweet spot for a home workstation is 475W–500W. At 500W you lose only 7% wall time versus 575W. At 475W ~11%. The yearly savings versus 575W are modest for an 80%-idle personal machine (~€26–34), but the thermal safety margin in a residential build is real.

Method

Hardware

ComponentSpecification
GPUNVIDIA GeForce RTX 5090 FE (32 GB GDDR7)
CPUAMD Ryzen 9 9900X (12c/24t)
RAM64 GB DDR5-6400
Storage2 TB NVMe Gen4
PSUCorsair RM1000e
CaseOpen-air test bench, 3× 140mm intake

Software

ToolVersion
OSDebian 13 (kernel 6.12)
Python3.13
PyTorch2.6.0+cu128
CUDA12.8
Driver570.86.10

Model & task

A 60 million parameter decoder-only transformer that learns integer addition by processing one digit per token. Each example is a 16-token sequence: two 8-digit numbers concatenated, followed by the carry pattern and result.

  • 6 layers, 512-dim embeddings, 8 attention heads
  • Context length: 16
  • Vocabulary: 20 tokens (0–9, padding, special)
  • Dataset: 50,000 training examples (~1.8 M tokens total)
  • Training: full-batch, 3 epochs, AdamW, lr=3e-4, no warmup, no scheduler

Power limit protocol

Power limits were set before each run via nvidia-smi and verified with nvidia-smi dmon during training:

1
2
3
4
5
# Set power limit (requires sudo)
sudo nvidia-smi -pl 450  # units: watts

# Verify draw during training
nvidia-smi dmon -s p -d 1

Power limits tested: 400 W, 450 W, 500 W, 575 W, 600 W.

The 575W setting is the default TDP for the 5090 FE. The 600W setting is the maximum allowed limit (it does not exceed the card’s hardware cap). All runs were sequential with a 60-second cooldown between limits to allow thermal equilibrium.

Reproduction:

1
2
# Requires PyTorch + CUDA. See environment spec above.
python train_addition_llm.py --epochs 3 --power-limit 575

Raw numbers: results.json

Results

Wall time vs power limit

Training wall time at each GPU power limit Training wall time at each GPU power limit
Power LimitWall Time (3 epochs)Relative to 575WTime Added/Saved
400 W645.6 s1.23× slower+121 s
450 W598.2 s1.14× slower+73 s
500 W561.8 s1.07× slower+37 s
575 W524.9 s1.00× baseline
600 W515.5 s0.98× as fast−9 s

Throughput vs power limit

Training throughput at each GPU power limit Training throughput at each GPU power limit
Power LimitTokens/secRelative to 575W
400 W10,9000.81×
450 W11,7000.87×
500 W12,5000.93×
575 W13,4001.00×
600 W13,6001.01×

Energy efficiency (tokens per watt)

Training efficiency at each GPU power limit Training efficiency at each GPU power limit
Power LimitAvg DrawEfficiency (tok/s per W)Energy per run
400 W396 W27.571.0 Wh
450 W445 W26.373.9 Wh
500 W494 W25.377.1 Wh
575 W567 W23.682.7 Wh
600 W591 W23.084.6 Wh

The lower the power limit, the more tokens you train per watt consumed. This makes physical sense: the GPU’s static power (memory controllers, display engine, PCIe link) is amortized over less dynamic power at lower TDPs, improving the ratio.

Total energy consumed per run

Total energy consumed per training run at each GPU power limit Total energy consumed per training run at each GPU power limit
Power LimitAvg DrawWall TimeTotal Energyvs 575W
400 W396 W645.6 s71.0 Wh−14.1%
450 W445 W598.2 s73.9 Wh−10.6%
500 W494 W561.8 s77.1 Wh−6.7%
575 W567 W524.9 s82.7 Whbaseline
600 W591 W515.5 s84.6 Wh+2.4%

This table answers the non-obvious question: does the longer runtime at lower TDP eat the wattage savings? No. Even though the 400W run takes 23% longer, total energy per training run drops by 14%. The GPU’s static power draw (~250W just to keep the memory and PCIe link alive) is spread across more seconds, but the dynamic compute power scales down proportionally. The net effect: less total juice per unit of work.

But this machine does not train continuously — it is a personal workstation. A realistic load profile is roughly 20% under load and 80% idle (browsing, ssh sessions, background tasks). At 40W idle draw, the true yearly cost at each power limit is:

Power LimitEffective Avg DrawYearly CostSaved vs 575W
400 W111 W€195€60
450 W121 W€212€43
475 W (interpolated)~126 W~€221~€34
500 W131 W€229€26
575 W145 W€255
600 W150 W€263−€8

The savings are real but modest: €34/year at 475W, €60/year at 400W. This is not “buy another GPU” money. It is “nice dinner” money.

So the economic argument alone is weak for a single-user machine. What remains is the thermal and safety angle: sustained 570W+ through a residential PSU and circuit in summer dumps a lot of heat into a room. The GL.iNet RM-1 KVM exists because machines sometimes need hard power cycles. Lower sustained load reduces the probability that thermal protection or VRM stress is the reason you are reaching for that remote switch. This is not datacenter overthinking — it is the same reason NVIDIA ships the FE at 575W and not 800W.

For pure economics: tune to whatever wattage lets you sleep at night. For this machine, that is probably 475W–500W.

Speed impact relative to 575W baseline

Relative speed impact vs 575W baseline Relative speed impact vs 575W baseline

The human framing: if you normally train at 575W and switch to 400W, every training run takes an extra 2 minutes. That is the tradeoff — patience for lower heat output and slightly lower electricity draw.

Extrapolation: how low before it becomes painful?

Training time extrapolation to lower power limits Training time extrapolation to lower power limits

Fit to the measured data: training time ≈ 155,914/P + 253.4 (inverse relationship, R² = 0.996).

From this fit:

  • Training at 300W would be ~1.38× slower than 575W (~200 s added per run)
  • Training at 200W would be ~1.75× slower than 575W (~400 s added per run)
  • Training at ~196W would be 2× as slow as the 575W baseline

The curve is nonlinear in the extrapolated region. The linear region we measured (400W–600W) is well-approximated by a straight line, but extending far below that enters diminishing-returns territory where fixed overheads dominate.

My take

The finding that surprised me: lower TDP saves total energy per unit of work, not just efficiency. I expected longer runtime × lower watts ≈ same total. It does not. A 400W training run uses 14% less electricity than 575W for the same epochs.

Whether that matters for your electricity bill depends on how loaded the GPU is. For a personal machine idling 80% of the time, the yearly savings are €26–60 — real, but not life-changing. If you were running a 24/7 training farm, the gap would be €128–300/year per GPU.

My recommendation for this machine: 475W or 500W. At 500W you lose only 7% wall time versus 575W. At 475W (interpolated) ~11%. Either is a better compromise than the extremes:

  • 400W is too slow for interactive work. The 23% penalty is noticeable when iterating.
  • 575W is fast but dumps serious sustained heat into a residential room. The RM1000e can handle it; your summer air conditioning bill and your ears may not.
  • 600W is wasteful in every dimension: 2.4% more energy, 2.4% more heat, 1.8% less time.

The fire-safety angle is not paranoia. You built this machine yourself, it lives at your house, and you have a remote power switch because crashes happen. Sustained 570W+ through a residential circuit in July is a lot of heat to manage. Lowering the limit to 475W–500W is a small concession for peace of mind.

One thing this study does not capture is long-term hardware longevity. GDDR7 at reduced voltage stress, cooler VRMs, and lower thermal cycling may extend the card’s useful life — but that is speculation, not measured.

References