GLSL_NV_cooperative_matrix_decode_vector
The original text file describing this extension as a set of diffs to the OpenGL Shading Language Specification follows.
Name
NV_cooperative_matrix_decode_vector
Name Strings
GL_NV_cooperative_matrix_decode_vector
Contact
Jeff Bolz, NVIDIA (jbolz 'at' nvidia.com)
Contributors
Jeff Bolz, NVIDIA
Status
Complete
Version
Last Modified: April 28, 2026
Revision: 1
Dependencies
This extension can be applied to OpenGL GLSL versions 4.50
(#version 450) and higher.
This extension can be applied to OpenGL ES ESSL versions 3.20
(#version 320) and higher.
This extension depends on GL_NV_cooperative_matrix2.
The examples in this document are written using the vector
template syntax provided by GL_EXT_long_vector for uniformity across
component types and vector lengths. This extension does not require
GL_EXT_long_vector to be enabled; the same shaders can be expressed
using the built-in vector types (e.g. f16vec2, u8vec4) when the
component type and length are representable that way.
Overview
This extension extends GL_NV_cooperative_matrix2 to allow a single
coopMatLoadTensorNV call to provide a vector-returning decode function
in addition to the existing scalar-returning _decodeFunc_. With this
optional second function, one invocation of the vector decode function
decodes V block-adjacent elements at once, instead of being invoked V
separate times via the scalar function.
The motivating case is block-quantized weight tensors: a small unit
of encoded data (such as one dword of a 4-bit format, or one byte
of a 1-bit format) can be loaded and decoded once for V block-adjacent
elements at a time, rather than being loaded and decoded V separate
times.
Because not every call site of a load can map V neighboring matrix
elements onto a single packed-vector register, the implementation
may invoke _decodeVectorFunc_ or rely on _decodeFunc_ (per element)
at any call site, choosing whichever fits better. For example, the
implementation may use _decodeVectorFunc_ when staging through shared
memory and _decodeFunc_ when loading directly into registers along an
axis that does not match the vector function's V-direction.
V can be 2, 4, or 8. Larger V matches the natural pack unit of formats
such as 1-bit (8 elements per byte) and 4-bit (8 elements per dword)
quantized weights, and amortizes per-invocation overhead over more
outputs.
Mapping to SPIR-V
-----------------
For informational purposes (non-normative), the
decodeVectorFunc parameter is expected to map to the DecodeVectorFunc
tensor addressing operand on OpCooperativeMatrixLoadTensorNV.
Modifications to the OpenGL Shading Language Specification, Version 4.60
Including the following line in a shader can be used to control the
language features described in this extension:
#extension GL_NV_cooperative_matrix_decode_vector :
where is as specified in section 3.3.
New preprocessor #defines are added to the OpenGL Shading Language:
#define GL_NV_cooperative_matrix_decode_vector 1
Modify Section 8.X, Cooperative Matrix Functions
Augment the description of the Load functions defined by
GL_NV_cooperative_matrix2 (the variants of coopMatLoadTensorNV that
take a _decodeFunc_ parameter) by allowing an additional optional
parameter _decodeVectorFunc_ that immediately follows _decodeFunc_:
void coopMatLoadTensorNV(out coopmat<...> result,
buf,
uint elementOffset,
tensorLayoutNV<...> t,
decodeFunc,
decodeVectorFunc);
void coopMatLoadTensorNV(out coopmat<...> result,
buf,
uint elementOffset,
tensorLayoutNV<...> t,
tensorViewNV<...> view,
decodeFunc,
decodeVectorFunc);
_decodeVectorFunc_'s function is subject to the same parameter pattern
as _decodeFunc_: its first parameter must be a buffer_reference type, and
its second and third parameters must be const-in arrays of uint32_t
whose dimension matches the tensor dimension, but the two functions
can use different buffer_reference types for their first parameters.
_decodeVectorFunc_ must return a vector of length V whose component
type matches the component type of _result_, where V must be one of
2, 4, or 8.
Let V be the length of the vector returned by _decodeVectorFunc_.
For each matrix element of _result_, the implementation invokes either
_decodeFunc_ for that element, or _decodeVectorFunc_ for a group of V
block-adjacent matrix elements that contains it. The implementation
chooses which function to invoke and how often. Multiple invocations
of the same function with the same parameter values are expected to
return the same value.
When _decodeVectorFunc_ is invoked, the V matrix elements covered by
the invocation are V matrix elements whose tensor coordinates, as
computed by matrixCoordToTensorElement(WithView), share the same
blockCoord and the same coordInBlock components in every dimension
except LDim - 1, and whose coordInBlock[LDim - 1] values are V
consecutive integers starting at a multiple of V. The blockCoord,
coordInBlock, and pointer parameters passed to _decodeVectorFunc_
are those for the element of the group with the lowest
coordInBlock[LDim - 1]. For each i in the range 0 to V - 1, component
i of the returned vector is stored to the matrix element whose
coordInBlock[LDim - 1] is the group's lowest value plus i.
The behavior of coopMatLoadTensorNV is undefined when
_decodeVectorFunc_ is invoked if the following requirement is not
satisfied:
- t.blockSize[LDim - 1] must be a multiple of V.
This condition, together with coordInBlock[LDim - 1] being a
multiple of V, ensures that the V matrix elements covered by one
invocation of _decodeVectorFunc_ always lie within a single block.
In any function used as a _decodeFunc_ or _decodeVectorFunc_
parameter, and any function called directly or indirectly by those
functions, tangled instructions (as defined in the SPIR-V spec) are
not allowed.
Examples
Decode a Q8_0-style block-quantized weight tensor (32 int8 quants per
block sharing one float16_t scale) into a gl_MatrixUseA matrix of
float16_t. Each block is 2 + 32 = 34 bytes packed, so the
buffer_reference is only 2-byte aligned. Two buffer_reference types
describe the same block layout in two different shapes: one uses
int8_t for each quant, the other packs pairs of quants in int16_t.
The shader supplies a per-element scalar decode using the byte view
and a per-V-group vector decode with V == 2 using the int16_t view.
The vector function does a single 16-bit load and unpacks both quants
into a 2-lane vector scaled by b.d once. The implementation picks
between the two functions per call site:
#extension GL_NV_cooperative_matrix2 : enable
#extension GL_NV_cooperative_matrix_decode_vector : enable
#extension GL_EXT_long_vector : enable
#extension GL_EXT_shader_explicit_arithmetic_types : enable
layout(buffer_reference, std430,
buffer_reference_align = 2) readonly buffer block_q8_0_b {
float16_t d; // per-block scale
int8_t qs[32]; // 32 quantized int8 values
};
layout(buffer_reference, std430,
buffer_reference_align = 2) readonly buffer block_q8_0_w {
float16_t d; // per-block scale
int16_t qs[16]; // two int8 quants per int16
};
float16_t decode_q8_0_scalar(
const in block_q8_0_b b,
const in uint32_t blockCoord[2],
const in uint32_t coordInBlock[2])
{
return float16_t(b.qs[coordInBlock[1]]) * b.d;
}
vector decode_q8_0_v2(
const in block_q8_0_w b,
const in uint32_t blockCoord[2],
const in uint32_t coordInBlock[2])
{
// coordInBlock[1] is a multiple of V == 2, so a single
// 16-bit load yields both quants for this group.
int16_t pair = b.qs[coordInBlock[1] >> 1];
uint p = uint(pair);
return vector(int8_t(p & 0xFFu), int8_t(p >> 8)) * b.d;
}
void load(coopmat mat,
uint elementOffset, uint row, uint col)
{
tensorLayoutNV<2> t = createTensorLayoutNV(2);
// 1 x 32 block of matrix elements per Q8_0 block.
t = setTensorLayoutBlockSizeNV(t, 1, 32);
t = setTensorLayoutDimensionNV(t, NumRows, NumCols);
t = sliceTensorLayoutNV(t, row, M, col, N);
coopMatLoadTensorNV(mat, input.buf, elementOffset, t,
decode_q8_0_scalar, decode_q8_0_v2);
}
The same load can be used for any matrix Use, with or without a
tensorViewNV, and the implementation may invoke either function at
any call site.
Issues
(1) Why is _decodeVectorFunc_ a separate parameter rather than
having _decodeFunc_ return a vector?
RESOLVED: An alternate design could make _decodeFunc_ return a
vector and couple it to static restrictions on the matrix Use and
tensorViewNV and to dynamic alignment on the load's span, offset,
and layout dimensions. That approach gets intricate and still misses
important cases: for example, loading a matrix with gl_MatrixUseB from
row-major memory can require an effective transpose relative to tensor
storage, so vector decode aligned to the blocking layout can be harmful
when values are written straight into registers, while the same
tensor load can still benefit from vector decode when a shared-memory
staging pass makes loads match that layout. A single vector-returning
_decodeFunc_ cannot serve both kinds of call site without further
special cases.
This extension keeps _decodeFunc_ as the required scalar decode
function (unchanged from GL_NV_cooperative_matrix2) and adds an
optional _decodeVectorFunc_. The implementation may invoke either at
any call site. The shader supplies both; the implementation chooses
per site. Use and tensorViewNV stay independent of decode shape; the only
structural rule is that t.blockSize[LDim - 1] is a multiple of V. One load
covers both paths without an up-front shader choice. For example, the
implementation may decline _decodeVectorFunc_ where V lanes map poorly to
registers or a V-group would straddle a span or clip boundary.
(2) Are there static restrictions on the matrix Use or the
tensorViewNV when _decodeVectorFunc_ is used?
RESOLVED: No. The semantic mapping of _decodeVectorFunc_'s V return
components is defined entirely in terms of post-view
coordInBlock[LDim - 1] values and does not depend on the matrix Use
or on the tensorViewNV. Whether _decodeVectorFunc_ is invoked at a
particular call site is the implementation's choice.
(3) Why must blockSize[LDim - 1] be a multiple of V?
RESOLVED: This is the only condition needed to guarantee that the V
matrix elements covered by one invocation of _decodeVectorFunc_ all
lie in a single block (so the function reads from a single,
well-defined encoded block). Whether _decodeVectorFunc_ is profitable
at a given call site is left to the implementation, as discussed in
issue (1).
(4) Why only a single _decodeVectorFunc_ instead of letting the shader
supply multiple vector decode functions for different V (e.g.
V == 2, V == 4, and V == 8) and letting the implementation pick
among them?
RESOLVED: An alternate design could let the shader provide several
vector decode functions for different V and let the implementation
pick the most profitable one per call site. The incremental benefit
over a single shader-chosen V is small: each block-quantized format
has a natural pack unit (one byte for 1-bit formats, one dword for
4-bit formats, and so on), and once V is matched to that unit, larger
V mostly amortizes overhead that is already small at the matched
unit. Supporting multiple vector decode functions would add
per-call-site selection logic in the implementation and a
cross-product of allowed combinations to validate, in exchange for
that small benefit. In practice, a shader author can experiment with
different V values and keep whichever is fastest for their format and
target. coopMatLoadTensorNV therefore accepts at most one
_decodeVectorFunc_; the implementation only chooses, per call site,
whether to invoke it or fall back to _decodeFunc_.
Revision History
Revision 1, 2026-04-28 (Jeff Bolz)
- Initial revision.