Skip to main content
elric neumann

Entropy and compute estimation in model inputs

Various techniques have been used so far to estimate VRAM usage, which differ in train and inference modes, as described in Reducing Activation Recomputation in Large Transformer Models by NVIDIA researchers. At times, it may be more intuitive to know the cost of a model's input—this will obviously require logprobs which relates to relative per-token entropy.

The cache memory cost is influenced by the dimensionality of the input's embedding vector. Larger vectors require more memory, proportional to cost. We define memory cost as:

memory cost=vector dim×memory rate

where:

Regarding memory rate, we need to assess what occurs during vectorization, since we require the call overhead delta between subsequent layers. When embedding vectors are first loaded from main memory into the CPU (for preprocessing or during memory transfers to the GPU), they initially pass through the CPU cache hierarchy.

Memory rate on the GPU:

The outcome of this is that, for large vectors, frequent cache misses in L1 and L2 and evictions from L3 increase main memory access latency, impacting processing time and memory rate, which are particularly relevant in pipelines where embeddings are preprocessed on the CPU before GPU transfer. For embedding vectors that don’t fit entirely into GPU caches, cache misses occur, forcing accesses to slower, off-chip global memory (GDDR SDRAM).

There is actually a really long lore around how VRAM on GPUs is partitioned into texture pages and how pipelining texture coordinates from the CPU is handled but I won't go into that, you can read it from OpenGL forums.


Processing cost accounts for the computational effort required to handle the input's context length and complexity (entropy):

processing cost=(context length×log(1+entropy))×processing rate

where:

We use a logarithmic function of entropy to model diminishing returns, implying that very high entropy inputs do not exponentially increase computational complexity.

Each embedding dimension in a vector requires storage, and moving these embeddings between CPU and GPU (or within GPU memory) incurs bandwidth usage and potential bottlenecks. Memory bandwidth between the GPU and main memory (e.g., GDDR6 bandwidth in GPUs) impacts the rate at which data can be transferred, especially for models with high-dimensional embeddings.


VRAM cost has an inverse proportionality with the available GPU memory capacity since higher VRAM allows faster processing by reducing CPU-GPU data transfer. VRAM cost is given by:

vram cost=context lengthvram capacity×vram rate

where:

The techniques used to find VRAM cost described in the paper linked above can be represented in place of vram capacity (cf. estimation by Andrew Lapp):

activations = (
  num_layers * (5/2) * num_attn_heads * batch_size * seq_len**2
             + 17 * batch_size * hidden_size * seq_len
)

kv_cache = batch_size * seq_len * 2 * num_layers * hidden_size

VRAM is then estimated over activations or KV cache by scaling w.r.t precision, depending on the model's mode. Each subsequent estimation will be in the same range since it depends on hyperparameters and not probabilities (which are learned parameters).

Notice that this approach does not estimate LoRA VRAM requirements.

Entropic measure of complexity with logprobs #

For models like GPT or similar, token probabilities are typically accessible via the model’s API if it supports log_probs or next_token_probs which provide per-token probabilities required to compute entropy.

We then apply Shannon entropy over all token probabilities:

import math

def cost_entropy(token_probs):
  entropy = -sum(p * math.log2(p) for p in token_probs if p > 0)
  return entropy / len(token_probs)  # normalized entropy (average over tokens)

Since the estimation of the model's input cost is proportional to the relative entropy:

def cost_input(vector_dim, context_length, entropy, vram_capacity, memory_rate=0.001, processing_rate=0.002, vram_rate=0.0005):
  memory_cost = vector_dim * memory_rate
  processing_cost = (context_length * math.log(1 + entropy)) * processing_rate
  vram_cost = (context_length / vram_capacity) * vram_rate
  return memory_cost + processing_cost + vram_cost

Applicability #

I think this approach may not only apply to autoregressive decoders but could be mapped to a diffusion model although it'd be different for, say, DenseNet-* that is fully-connected, which basically means it has to represent detailed spatial relationships learned by earlier convolutional layers. Embedding dimensions would be the same as feature maps and processing_cost would be relative to the number of convolutions and filter size.

Given that small variations increase the model's capacity to capture fine-grained distinctions, the only way this could be mitigated without discarding entropic cost is to enable dropout, batch normalization, and other regularizations (maybe).