This document proposes adding support for shader integer dot product instructions.

1. Problem Statement

Dot product operations between vectors of integer values are used heavily in machine learning algorithms, acting as a fairly fundamental building block. When running machine learning algorithms in Vulkan, these have to be emulated using other integer operations; however many implementations have dedicated fast paths for these operations.

An additional problem is that there is no clear common subset of accelerated dot product operations between vendors - making standardizing on a solution somewhat tricky.

This proposal aims to enable these fast paths for machine learning algorithms with minimal difficulty.

2. Solution Space

There are two main ways in which applications could gain access to these fast paths:

  1. Rely on compiler pattern matching to optimize standard integer operations into dot products

  2. Add dedicated dot product operations

The first of those is more or less a "do nothing" approach and puts a burden on implementations to detect these cases, with variable success rates. Adding dedicated dot product operations is less error prone, but does mean machine learning content needs to be updated to use these new operations. In the long run, the latter is likely to be much more reliable for new applications - so this proposal aims to add new operations.

The question then becomes which dedicated dot product operations should be exposed if there is no common subset of accelerated operations. Choices become:

  1. Multiple extensions advertising different operations

  2. One extension with the superset of operations but make them all optional

  3. One extension with all operations available, emulating those that are not accelerated

Most existing ML backends targeting SPIR-V compile to SPIR-V once and expect the code to work everywhere within their target market - they will pick a single expression of the ML operations at the macro level and compile to that. To run this code everywhere, only option 3 works directly - the only option faced with 1 or 2 would be to emulate the functions as they do today, perhaps picking up optimizations in extreme cases only.

Newer backends such as those using MLIR are looking at generating platform-specific optimized IR, which can be done in part by expressing the macro-level operations differently. Backends like this could use information about the accelerated operations to determine which SPIR-V operations to target, and thus 1 and 2 are well suited to this. Option 3 would also work but would need additional information in order to make optimization decisions.

In order to satisfy both of these types of backends, this proposal works along the lines of option 3, while providing platform-specific information to allow optimizing compilers to make useful choices.

3. Proposal

3.1. API Features

The following features are exposed by this extension:

typedef struct VkPhysicalDeviceShaderIntegerDotProductFeaturesKHR {
    VkStructureType    sType;
    void*              pNext;
    VkBool32           shaderIntegerDotProduct;
} VkPhysicalDeviceShaderIntegerDotProductFeaturesKHR;

shaderIntegerDotProduct is the core feature enabling this extension’s functionality.

3.2. API Properties

The following features are exposed by this extension:

typedef struct VkPhysicalDeviceShaderIntegerDotProductPropertiesKHR {
    VkStructureType    sType;
    void*              pNext;
    VkBool32           integerDotProduct8BitUnsignedAccelerated;
    VkBool32           integerDotProduct8BitSignedAccelerated;
    VkBool32           integerDotProduct8BitMixedSignednessAccelerated;
    VkBool32           integerDotProduct4x8BitPackedUnsignedAccelerated;
    VkBool32           integerDotProduct4x8BitPackedSignedAccelerated;
    VkBool32           integerDotProduct4x8BitPackedMixedSignednessAccelerated;
    VkBool32           integerDotProduct16BitUnsignedAccelerated;
    VkBool32           integerDotProduct16BitSignedAccelerated;
    VkBool32           integerDotProduct16BitMixedSignednessAccelerated;
    VkBool32           integerDotProduct32BitUnsignedAccelerated;
    VkBool32           integerDotProduct32BitSignedAccelerated;
    VkBool32           integerDotProduct32BitMixedSignednessAccelerated;
    VkBool32           integerDotProduct64BitUnsignedAccelerated;
    VkBool32           integerDotProduct64BitSignedAccelerated;
    VkBool32           integerDotProduct64BitMixedSignednessAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating8BitUnsignedAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating8BitSignedAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating8BitMixedSignednessAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating4x8BitPackedUnsignedAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating4x8BitPackedSignedAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating4x8BitPackedMixedSignednessAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating16BitUnsignedAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating16BitSignedAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating16BitMixedSignednessAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating32BitUnsignedAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating32BitSignedAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating32BitMixedSignednessAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating64BitUnsignedAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating64BitSignedAccelerated;
    VkBool32           integerDotProductAccumulatingSaturating64BitMixedSignednessAccelerated;
} VkPhysicalDeviceShaderIntegerDotProductPropertiesKHR;

Each of these properties is a boolean that will be VK_TRUE if the implementation provides a performance advantage for the corresponding SPIR-V instruction, over application-provided code composed from elementary instructions and/or other dot product instructions. This could be either because the implementation uses optimized machine code sequences whose generation from application-provided code cannot be guaranteed or because it uses hardware features that cannot otherwise be targeted from application-provided code.

Properties are written as integerDotProduct<AccumulatingSaturating>{type bitwidth}{Unsigned|Signed|MixedSignedness}Accelerated. Each property corresponds to a SPIR-V opcode of the form Op{U|S|SU}Dot<AccSat>KHR, as defined in SPIR-V extension SPV_KHR_integer_dot_product. The <AccumulatingSaturating> portion of the property corresponds to the AccSat instruction variants. The type bitwidth refers to the size of the input vectors and whether it is a packed format or not. {Unsigned|Signed|MixedSignedness} in the property correspond to {U|S|SU} in the instruction name. ---

3.3. SPIR-V Changes

This proposal uses an existing SPIR-V extension: SPV_KHR_integer_dot_product.

4. Examples