Thermal Forensics: The Physics of Heat Transfer, Silicon Gradient Analysis, and the Hotspot Delta as a Primary Degradation Metric

February 27, 2026|By Assurd Engineering Lab

Thermal Forensics: Why the Temperature You See Is Not the Temperature That Matters

"The Core temperature sensor is an averaging function across a spatially heterogeneous thermal landscape. The Hotspot sensor is the truth. The delta between them is the diagnosis." — Assurd Engineering Lab, Internal SOP v4.3

A buyer examines a GPU in a resale listing. The seller attaches a GPU-Z screenshot showing a Core temperature of 65°C under load. The heatsink fans are spinning. The card looks fine. The price is right.

At Assurd Techlabs, we have rejected cards running at 62°C core temperature. We have certified cards running at 79°C core temperature. The core temperature, in isolation, is one of the most misleading single data points in GPU health assessment — not because it is wrong, but because it is incomplete.

This document explains the full physics of GPU thermal management, the meaning and diagnostic power of the Core-to-Hotspot Delta, the specific failure modes of thermal interface materials (TIM), and why our 1800-second forensic thermal trace catches degradation that every consumer-facing benchmark misses.


Part I: GPU Thermal Architecture — From Junction to Ambient

The Thermal Resistance Network

Heat generated in a GPU die must travel from the silicon junction — where transistor switching physically converts electrical energy to heat — to the ambient environment. This journey traverses several distinct interfaces, each characterized by a thermal resistance (RθR_{\theta}, in °C/W):

Tj=Ta+Pdie(Rθ,js+Rθ,sh+Rθ,ha)T_j = T_a + P_{\text{die}} \cdot \left(R_{\theta,j-s} + R_{\theta,s-h} + R_{\theta,h-a}\right)

Where:

  • TjT_j = Junction temperature (the actual temperature of the silicon)
  • TaT_a = Ambient air temperature
  • PdieP_{\text{die}} = Power dissipated in the die (Watts)
  • Rθ,jsR_{\theta,j-s} = Junction-to-spreader thermal resistance (die → integrated heat spreader or bare die)
  • Rθ,shR_{\theta,s-h} = Spreader-to-heatsink thermal resistance (TIM layer — the critical interface)
  • Rθ,haR_{\theta,h-a} = Heatsink-to-ambient thermal resistance (determined by fin density, airflow, and fan performance)

Modern NVIDIA consumer GPUs use a bare die configuration — the GPU package does not include an integrated heat spreader (IHS). The thermal interface material (TIM) is applied directly between the GPU die and the heatsink base. This minimizes Rθ,jsR_{\theta,j-s} (no IHS layer to traverse) but makes the TIM condition directly and immediately visible in the thermal data.

The Distributed Nature of GPU Die Heating

Unlike a CPU, which concentrates heat in a relatively small, geometrically simple core area, a GPU die is a spatially distributed heat source. An AD102 die (RTX 4090) measures approximately 608mm² and contains:

  • Shader Array (SM cluster): The primary heat source. Thousands of shader processors operating concurrently at up to 450W total package power
  • Memory Interface (GPC-to-L2): High-speed data paths generating localized heat near the die perimeter
  • RT Cores and Tensor Cores: Specialized computation units with their own heat distribution profiles
  • Video Engine (NVENC/NVDEC): Separate fixed-function blocks with distinct activity patterns

The spatial power density across the die is highly non-uniform. Certain functional blocks — particularly the shader multiprocessors under 100% ALU utilization — generate significantly more heat per unit area than adjacent blocks. This non-uniformity means the die has thermal gradients: regions that are significantly hotter than others, even when the "average" temperature reported by the driver is moderate.

The Hotspot sensor is the firmware's readout of the single highest-temperature sensor on the die. The Core temperature is a weighted average across multiple sensor sites. The gap between them — the Hotspot Delta — is the spatial signature of these thermal gradients.


Part II: Thermal Interface Material — The Physics of Degradation

What TIM Does and How It Works

The microscopic surface of a machined heatsink base, even when optically polished, is not flat. Surface roughness on the order of Ra 0.1–0.8 μm creates microscopic air gaps at the interface between metal and GPU die. Air has an extremely low thermal conductivity:

kair0.026 W/mKk_{\text{air}} \approx 0.026 \ \text{W/m} \cdot \text{K}

By comparison:

MaterialThermal Conductivity (W/m·K)
Air~0.026
Kryonaut (Grizzly)~12.5
Conductonaut (Liquid Metal)~73
Copper~385
Aluminum~205

A thin layer of air at a 100mm² interface under 150W of heat flux would produce a temperature drop of many tens of degrees — catastrophically degrading thermal performance. TIM fills these microscopic gaps, replacing air with a material of 2–3 orders of magnitude higher thermal conductivity.

The contact thermal resistance of a TIM layer is:

RTIM=δkTIMAR_{\text{TIM}} = \frac{\delta}{k_{\text{TIM}} \cdot A}

Where:

  • δ\delta = TIM bond line thickness (meters) — ideally minimized by heatsink clamp force
  • kTIMk_{\text{TIM}} = TIM thermal conductivity (W/m·K)
  • AA = Contact area (m²)

The Pump-Out Effect — Physics and Mechanism

The pump-out effect is the primary TIM failure mode in GPU applications and the most common root cause of elevated Hotspot Deltas in pre-owned cards.

Silicone-based thermal compounds (the category that includes most consumer-grade pastes) are non-Newtonian fluids at elevated temperatures. As the GPU heats up, the silicone carrier becomes less viscous — the paste becomes more liquid. Under the mechanical pressure of the heatsink mounting hardware, this lower-viscosity paste is squeezed radially outward from the center of the die toward the edges. This reduces the bond line thickness at the die center (initially beneficial — thinner TIM = lower RTIMR_{\text{TIM}}), but over many thermal cycles, the paste continues migrating outward until the center of the die — the region of highest heat flux — has no remaining TIM coverage.

The physics of this process follow the squeeze film lubrication model:

δt=δ3ΔP3μr2\frac{\partial \delta}{\partial t} = -\frac{\delta^3 \cdot \Delta P}{3 \mu \cdot r^2}

Where:

  • δ\delta = Bond line thickness
  • ΔP\Delta P = Applied pressure from mounting hardware
  • μ\mu = Dynamic viscosity of the TIM (temperature-dependent)
  • rr = Radial position from center

The temperature dependence of viscosity (μ\mu decreasing with increasing TT) creates a positive feedback loop: as the die gets hotter (because the TIM is degrading), the TIM becomes more liquid, which accelerates further pump-out, which increases the die temperature further.

The result is a characteristic thermal signature:

  • Hotspot temperature rises (the center of the die, now TIM-starved, can no longer efficiently transfer heat)
  • Core temperature may remain moderate (the averaging function of the Core sensor masks the localized hotspot)
  • Hotspot Delta increases dramatically, from a healthy 10–15°C to 25°C, 35°C, or more

This is precisely why a card at 62°C Core with a 40°C Hotspot Delta — indicating a 102°C Hotspot — is more concerning than a card at 79°C Core with a 14°C Hotspot Delta (93°C Hotspot). The former has a severely degraded thermal interface. The latter is being pushed to its thermal limit by an aggressive power curve, but its heat transfer path is intact.

Silicone Oil Separation and Phase Change Degradation

Beyond pump-out, long-term TIM degradation is driven by silicone oil separation. Many thermal compounds are formulated as a suspension of high-thermal-conductivity particles (silver, zinc oxide, aluminum oxide) in a silicone carrier oil. Over years of thermal cycling, the oil component separates from the particle suspension, migrates to the perimeter (driven by pressure and surface tension gradients), and evaporates or oxidizes.

The remaining compound becomes:

  1. Harder (reduced bond line compliance, creating mechanical stress on the die during thermal cycling)
  2. Less conductive (the oil separation disrupts the thermal conduction pathways through the particle network)
  3. Cracked (dried compound can develop microcracks, which are essentially air inclusions in the thermal pathway)

Phase-change TIM (solid at room temperature, liquid at operating temperature) avoids the pump-out issue but can undergo phase segregation over time, with similar performance degradation consequences.


Part III: VRAM and Memory Junction Temperature Forensics

GDDR6X — A Uniquely Demanding Thermal Target

GDDR6X memory (used in RTX 3080/3090/4080/4090 class products) operates with a significantly different thermal profile than GDDR6 found in lower-tier cards. GDDR6X uses PAM4 (Pulse Amplitude Modulation 4-level) signaling to achieve its bandwidth targets — a 19.5 Gbps (GDDR6X on AD102) vs. 16 Gbps (GDDR6 on GA106). PAM4 signaling requires higher output driver power, which translates directly to increased heat generation per DRAM die.

The memory dies on a high-end GPU are located around the perimeter of the GPU die, typically covered by thermal pads (not paste) that transfer heat to the heatsink. The thermal pad specification — thickness, thermal conductivity, and compression factor — is precisely engineered by the GPU manufacturer.

Thermal pad degradation mechanisms:

  • Compression set: Soft thermal pads (typically 5–17 W/m·K silicone or phase-change pads) permanently deform under heatsink pressure over time. A pad that began at 1.5mm thick may measure 1.1mm after 3 years of use. This increases RTIMR_{\text{TIM}} and raises VRAM junction temperatures.

  • Hardening: Some thermal pad formulations harden with age and thermal cycling, losing their compliance. Hard pads cannot conform to microscopic surface irregularities, creating effective air gaps.

  • Displacement: In severe cases (mining rigs operated without proper PCB support, or cards dropped during handling), thermal pads can partially delaminate or shift, leaving entire DRAM dies with inadequate thermal contact.

Memory Junction Temperature Limits and Throttling Behavior

GDDR6X memory has a maximum rated junction temperature (TjT_j) of 95°C for sustained operation. NVIDIA's firmware implements a memory temperature throttle — when VRAM junction temperature exceeds approximately 95°C, the firmware reduces the memory clock speed and, if necessary, the GPU core clock to reduce heat generation.

This throttle is often invisible in typical diagnostic tools. GPU-Z reports the "Memory Temperature" sensor (on cards where this sensor is accessible — some do not expose this via software), but the firmware-level throttle response is not always reflected in driver-reported status.

At Assurd, we monitor:

  1. Memory Junction temperature (where hardware-accessible)
  2. Memory clock consistency — frequency drops below nominal at any point during stress test indicate thermal throttling
  3. Memory bandwidth degradation — measured via compute benchmarks with known memory bandwidth demands
ηmem=Measured Bandwidth (GB/s)Theoretical Peak Bandwidth (GB/s)\eta_{\text{mem}} = \frac{\text{Measured Bandwidth (GB/s)}}{\text{Theoretical Peak Bandwidth (GB/s)}}

A healthy card with no memory thermal throttling achieves ηmem>0.95\eta_{\text{mem}} > 0.95. Throttled or degraded cards fall below 0.90.


Part IV: The Assurd Forensic Thermal Trace — 1800 Seconds

Why 30 Minutes Is the Minimum

Consumer benchmark tools — Unigine Superposition, 3DMark, even FurMark in its default modes — run for 90 to 300 seconds. Marketing-focused lab reviews rarely test beyond 600 seconds. This is insufficient for forensic thermal certification for a fundamental physical reason: thermal mass.

A GPU heatsink assembly has significant thermal mass — the combined mass of copper heat pipes, aluminum fin stack, and copper base plate. Heat flow into this mass follows a time-domain exponential approach:

Theatsink(t)=Ta+ΔTss(1et/τ)T_{\text{heatsink}}(t) = T_a + \Delta T_{\text{ss}} \cdot \left(1 - e^{-t/\tau}\right)

Where:

  • TssT_{\text{ss}} = Steady-state temperature rise above ambient
  • τ\tau = Thermal time constant = RθCθR_{\theta} \cdot C_{\theta} (product of thermal resistance and thermal mass)

For a typical triple-fan GPU heatsink with significant mass, τ\tau is in the range of 600–1200 seconds. This means a heatsink doesn't reach thermal equilibrium — its steady-state temperature — until well past 10 minutes of continuous load. A 2-minute FurMark run catches the silicon and VRM transient response, but the heatsink is still absorbing heat. The GPU is running cooler than it will be in a sustained gaming session.

Heat Soak: Once the heatsink approaches its thermal saturation point, its reduced heat-sinking capacity causes the GPU die to run hotter than in the early phase of the test. This is when:

  • TIM pump-out degradation becomes visible — the failing thermal path produces a Hotspot Delta that was invisible at t=120s
  • Fan bearing degradation becomes audible and measurable — early in a run, fans operate at moderate RPM. As temperatures rise, fan speeds increase to maximum, and any bearing wear or blade imbalance becomes apparent in the RPM trace
  • Thermal throttle frequency reduction becomes permanent — sustained operation above the thermal limit produces a clock state reduction that doesn't recover, unlike the brief excursions seen in short tests

The 1-Second Polling Protocol

Our forensic trace captures the following parameters at 1-second intervals for the full 1800-second duration:

SensorMeasurementDiagnostic Purpose
Core Temperature°CBaseline thermal performance
Hotspot Temperature°CTIM and die integrity
VRAM Junction (if accessible)°CMemory thermal health
Fan RPM (all channels)RPMBearing health, control loop stability
GPU Core ClockMHzThermal throttle detection
Power Draw (die)WLoad consistency
VCoremVVRM thermal stability
Hotspot Delta (derived)°CPrimary TIM diagnostic

Hotspot Delta (ΔThs\Delta T_{hs}) is computed per-sample: ΔThs(t)=Thotspot(t)Tcore(t)\Delta T_{hs}(t) = T_{\text{hotspot}}(t) - T_{\text{core}}(t) We report both the time-averaged ΔThs\overline{\Delta T_{hs}} and the steady-state value ΔThs,ss\Delta T_{hs,ss} (average of samples 1500–1800s).

Fan Curve Forensics

Fan health is assessed not by peak RPM but by the RPM-temperature relationship — the fan curve — and by the stability of RPM at each temperature setpoint.

A healthy fan maintains a smooth, monotonic RPM response to temperature increases, with RPM variance at any given temperature setpoint of <50 RPM (1-sigma). Degraded bearings introduce characteristic signatures:

  • Intermittent RPM drops: A ball bearing with worn races produces periodic friction spikes that momentarily slow the fan, visible as downward RPM excursions in the trace
  • Hunting oscillation: A control loop with degraded fan response characteristics oscillates around the target RPM rather than settling — visible as sustained sinusoidal RPM variation
  • RPM noise floor increase: Increased variance in the RPM signal (higher standard deviation) indicates mechanical noise in the bearing
σRPM=1Ni=1N(RPMiRPM)2\sigma_{\text{RPM}} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(RPM_i - \overline{RPM})^2}

Certification requirement: σRPM<35\sigma_{\text{RPM}} < 35 RPM at any steady-state setpoint.


Part V: Heatsink Integrity and Warpage Analysis

GPU Heatsink Sag and PCB Flexure

High-mass heatsink assemblies — particularly triple-fan designs weighing 1.2–2.0 kg — create significant mechanical stress on the GPU PCB when the card is mounted in a tower case. Over years of operation and handling, this can produce PCB flexure: a bow in the PCB that changes the applied pressure distribution of the heatsink mounting hardware.

The consequences:

  • Reduced clamping force at the die center (reducing TIM bond line compression — increasing RTIMR_{\text{TIM}})
  • Non-uniform contact pressure, which combined with pump-out, creates asymmetric hot spots

We assess PCB bow and structural flexure using precision machinist straightedges and planar gauge tolerances. A bow of >0.3mm over the 300mm typical GPU PCB length is flagged for heatsink mounting hardware assessment.

Heat Pipe Degradation

Modern GPU heatsinks use copper heat pipes filled with a working fluid (typically distilled water or ethanol) in a vacuum. The fluid evaporates at the hot end (above the GPU die), travels as vapor to the cold end (in the fin stack), condenses, and returns to the hot end via a sintered wick structure.

Degradation mechanisms:

  • Working fluid loss: Micro-leaks at the heat pipe end caps reduce the fluid inventory, degrading heat transport capacity
  • Wick degradation: Oxidation or contamination of the sintered copper wick reduces capillary action, limiting the maximum heat transport rate
  • Non-condensable gas accumulation: Outgassing from internal surfaces can accumulate as non-condensable gas, progressively displacing the working fluid

A degraded heat pipe has a measurably higher Rθ,hsR_{\theta,h-s} — the thermal resistance from heatsink base to fin array. This manifests as an elevated base-to-fin temperature delta, detectable with an IR thermometer or thermal camera during our inspection process.


Part VI: Thermal Certification Thresholds

Full Test Matrix

ParameterAssurd GoldAssurd CertifiedInvestigation RequiredFail
Hotspot Delta (ΔThs\overline{\Delta T_{hs}})< 12°C< 18°C18–30°C> 30°C
Steady-State Core Temp (at rated TGP, 22°C ambient)< 72°C< 80°C< 83°C> 83°C
VRAM Junction (where accessible)< 80°C< 88°C> 90°C triggers investigation> 95°C
Fan RPM Sigma< 20 RPM< 35 RPM> 50 RPM — bearing assessment> 80 RPM — bearing replacement
Clock Consistency (% time at max boost)> 97%> 93%Thermal investigation if < 90%Fail
Thermal Throttle Events (>1°C below TJ max)0< 3 total> 10Persistent

Thermal Investigation Protocol

Cards with Hotspot Deltas in the 18–30°C range are subject to a detailed thermal investigation. Findings are documented in the certificate and disclosed to the customer. The card is returned with a full written thermal assessment.

Assurd does not perform hardware remediation — our role is forensic documentation, not repair.


Conclusion

The core temperature your GPU reports is a comfort metric — a system-level average designed to be readable by consumers and activation firmware. The Hotspot Delta is an engineering metric. It encodes the state of the thermal interface, the integrity of the heat transfer path, and the long-term silicon health trajectory of every card we process.

When we say a card has been thermally certified, we mean it has survived 1800 seconds of heat soak, that its Hotspot Delta was measured at thermal equilibrium and found within specification, that its VRAM temperatures are within safe operating margins, and that its fan system demonstrated consistent, stable performance across the full temperature range.

You cannot learn any of this from a 3-minute benchmark screenshot.


Thermal certification data, including the full 1800-second trace for any certified unit, is available to the original purchaser upon request.