Entropy and compute estimation in model inputs
Various techniques have been used so far to estimate VRAM usage, which differ in train and inference modes, as described in Reducing Activation Recomputation in Large Transformer Models by NVIDIA researchers. At times, it may be more intuitive to know the cost of a model's input—this will obviously require logprobs which relates to relative per-token entropy.
The cache memory cost is influenced by the dimensionality of the input's embedding vector. Larger vectors require more memory, proportional to cost. We define memory cost as:
where:
vector dim
represents the number of dimensions in the embedding vectormemory rate
is a fixed cost per dimension
Regarding memory rate, we need to assess what occurs during vectorization, since we require the call overhead delta between subsequent layers. When embedding vectors are first loaded from main memory into the CPU (for preprocessing or during memory transfers to the GPU), they initially pass through the CPU cache hierarchy.
- L1 Cache: Closest to the CPU cores with the lowest latency (e.g. 1-4 cycles) but limited capacity, typically around 32 KB per core. It's effective for storing small, frequently accessed data but is quickly saturated by high-dimensional embeddings.
- L2 Cache: Intermediate-level cache, typically several hundred KB per core. L2 cache is slightly slower but larger than L1, helping to store chunks of embeddings that can fit within its capacity, reducing frequent main memory access.
- L3 Cache: Shared among all CPU cores, often several MB in size. L3 acts as a last-level buffer before main memory, providing a cache line for larger data transfers. For high-dimensional vectors, L3 becomes essential, but frequent cache evictions occur as it’s shared across all cores and can’t hold large vectors simultaneously.
Memory rate on the GPU:
- L1 Cache (per Streaming Multiprocessor): Each Streaming Multiprocessor (SM) in a GPU has its own L1 cache, which is small (usually around 64-128 KB per SM). L1 on the GPU is fast and efficient for handling the immediate data needed by the processing threads in that SM, but it is easily overwhelmed by high-dimensional embeddings.
- L2 Cache (shared across GPU): Modern GPUs have a larger, shared L2 cache (ranging from several MB, e.g. 4-16 MB, depending on GPU architecture). The L2 cache is crucial for managing data that needs to be shared across multiple SMs or repeatedly accessed across threads. When handling embeddings, L2 helps reduce latency by storing chunks of vectors accessed by different SMs, but it, too, becomes quickly saturated as embeddings grow in size.
The outcome of this is that, for large vectors, frequent cache misses in L1 and L2 and evictions from L3 increase main memory access latency, impacting processing time and memory rate, which are particularly relevant in pipelines where embeddings are preprocessed on the CPU before GPU transfer. For embedding vectors that don’t fit entirely into GPU caches, cache misses occur, forcing accesses to slower, off-chip global memory (GDDR SDRAM).
There is actually a really long lore around how VRAM on GPUs is partitioned into texture pages and how pipelining texture coordinates from the CPU is handled but I won't go into that, you can read it from OpenGL forums.
Processing cost accounts for the computational effort required to handle the input's context length and complexity (entropy):
where:
context length
is the length of the input (number of tokens),entropy
reflects the undecidability of the input w.r.t. the final layer, andprocessing rate
is a fixed multiplier for computational effort per token and entropy.
We use a logarithmic function of entropy to model diminishing returns, implying that very high entropy inputs do not exponentially increase computational complexity.
Each embedding dimension in a vector requires storage, and moving these embeddings between CPU and GPU (or within GPU memory) incurs bandwidth usage and potential bottlenecks. Memory bandwidth between the GPU and main memory (e.g., GDDR6 bandwidth in GPUs) impacts the rate at which data can be transferred, especially for models with high-dimensional embeddings.
VRAM cost has an inverse proportionality with the available GPU memory capacity since higher VRAM allows faster processing by reducing CPU-GPU data transfer. VRAM cost is given by:
where:
vram capacity
represents the GPU VRAM in gigabytes,context length
is the input sequence length, andvram rate
is a scaling factor for the cost per token, adjusted for VRAM capacity.
The techniques used to find VRAM cost described in the paper linked above can be represented in place of vram capacity
(cf. estimation by Andrew Lapp):
activations = (
num_layers * (5/2) * num_attn_heads * batch_size * seq_len**2
+ 17 * batch_size * hidden_size * seq_len
)
kv_cache = batch_size * seq_len * 2 * num_layers * hidden_size
VRAM is then estimated over activations or KV cache by scaling w.r.t precision, depending on the model's mode. Each subsequent estimation will be in the same range since it depends on hyperparameters and not probabilities (which are learned parameters).
Notice that this approach does not estimate LoRA VRAM requirements.
Entropic measure of complexity with logprobs #
For models like GPT or similar, token probabilities are typically accessible via the model’s API if it supports log_probs
or next_token_probs
which provide per-token probabilities required to compute entropy.
We then apply Shannon entropy over all token probabilities:
import math
def cost_entropy(token_probs):
entropy = -sum(p * math.log2(p) for p in token_probs if p > 0)
return entropy / len(token_probs) # normalized entropy (average over tokens)
Since the estimation of the model's input cost is proportional to the relative entropy:
def cost_input(vector_dim, context_length, entropy, vram_capacity, memory_rate=0.001, processing_rate=0.002, vram_rate=0.0005):
memory_cost = vector_dim * memory_rate
processing_cost = (context_length * math.log(1 + entropy)) * processing_rate
vram_cost = (context_length / vram_capacity) * vram_rate
return memory_cost + processing_cost + vram_cost
Applicability #
I think this approach may not only apply to autoregressive decoders but could be mapped to a diffusion model although it'd be different for, say, DenseNet-* that is fully-connected, which basically means it has to represent detailed spatial relationships learned by earlier convolutional layers. Embedding dimensions would be the same as feature maps and processing_cost
would be relative to the number of convolutions and filter size.
Given that small variations increase the model's capacity to capture fine-grained distinctions, the only way this could be mitigated without discarding entropic cost is to enable dropout, batch normalization, and other regularizations (maybe).