VK_AMDX_shader_enqueue

This extension adds the ability for developers to enqueue compute workgroups from a shader.

1. Problem Statement

Applications are increasingly using more complex renderers, often incorporating multiple compute passes that classify, sort, or otherwise preprocess input data. These passes may be used to determine how future work is performed on the GPU; but triggering that future GPU work requires either a round trip to the host, or going through buffer memory and using indirect commands. Host round trips necessarily include more system bandwidth and latency as command buffers need to be built and transmitted back to the GPU. Indirect commands work well in many cases, but they have little flexibility when it comes to determining what is actually dispatched; they must be enqueued ahead of time, synchronized with heavy API barriers, and execute with a single pre-recorded pipeline.

Whilst latency can be hidden and indirect commands can work in many cases where additional latency and bandwidth is not acceptable, recent engine developments such as Unreal 5’s Nanite technology explicitly require the flexibility of shader selection and low latency. A desirable solution should be able to have the flexibility required for these systems, while keeping the execution loop firmly on the GPU.

2. Solution Space

Three main possibilities exist:

  1. Extend indirect commands

  2. VK_NV_device_generated_commands

  3. Shader enqueue

More flexible indirect commands could feasibly allow things like shader selection, introduce more complex flow control, or include indirect state setting commands. The main issue with these is that these always require parameters to be written through regular buffer memory, and that buffer memory has to be sized for each indirect command to handle the maximum number of possibilities. As well as the large allocation size causing memory pressure, pushing all that data through buffer memory will reduce the bandwidth available for other operations. All of this could cause bottlenecks elsewhere in the pipeline. Hypothetically a new interface for better scheduling/memory management could be introduced, but that starts looking a lot like option 3.

Option 2 - implementing a cross-vendor equivalent of VK_NV_device_generated_commands would be a workable solution that adds both flexibility and avoids a CPU round trip. The reason it has not enjoyed wider support is due to concerns about how the commands are generated - it uses a tokenised API which has to be processed by the GPU before it can be executed. For existing GPUs this can mean doing things like running a single compute shader invocation to process each token stream into a runnable command buffer, adding both latency and bandwidth on the GPU.

Option 3 - OpenCL and CUDA have had some form of shader enqueue API for a while, where the focus has typically been primarily on enabling developers and on compute workloads. From a user interface perspective these have had a decent amount of battle testing and is quite a popular and flexible interface.

This proposal is built around something like Option 3, but extended to be explicit and performant.

3. Proposal

3.1. API Changes

3.1.1. Graph Pipelines

In order to facilitate dispatch of multiple shaders from the GPU, the implementation needs some information about how pipelines will be launched and synchronized. This proposal introduces a new execution graph pipeline that defines execution paths between multiple shaders, and allows dynamic execution of different shaders.

VkResult vkCreateExecutionGraphPipelinesAMDX(
    VkDevice                                        device,
    VkPipelineCache                                 pipelineCache,
    uint32_t                                        createInfoCount,
    const VkExecutionGraphPipelineCreateInfoAMDX*    pCreateInfos,
    const VkAllocationCallbacks*                    pAllocator,
    VkPipeline*                                     pPipelines);

typedef struct VkExecutionGraphPipelineCreateInfoAMDX {
    VkStructureType                             sType;
    const void*                                 pNext;
    VkPipelineCreateFlags                       flags;
    uint32_t                                    stageCount;
    const VkPipelineShaderStageCreateInfo*      pStages;
    const VkPipelineLibraryCreateInfoKHR*       pLibraryInfo;
    VkPipelineLayout                            layout;
    VkPipeline                                  basePipelineHandle;
    int32_t                                     basePipelineIndex;
} VkExecutionGraphPipelineCreateInfoAMDX;

Shaders defined by pStages and any pipelines in pLibraryInfo→pLibraries define the possible nodes of the graph. The linkage between nodes however is defined wholly in shader code.

Shaders in pStages must be in the GLCompute execution model, and may have the CoalescingAMDX execution mode. Pipelines in pLibraries can be compute pipelines or other graph pipelines created with the VK_PIPELINE_CREATE_LIBRARY_BIT_KHR flag bit.

Each shader in an execution graph is associated with a name and an index, which are used to identify the target shader when dispatching a payload. The VkPipelineShaderStageNodeCreateInfoAMDX provides options for specifying how the shader is specified with regards to its entry point name and index, and can be chained to the VkPipelineShaderStageCreateInfo structure.

const uint32_t VK_SHADER_INDEX_UNUSED_AMDX = 0xFFFFFFFF;

typedef struct VkPipelineShaderStageNodeCreateInfoAMDX {
    VkStructureType                             sType;
    const void*                                 pNext;
    const char*                                 pName;
    uint32_t                                    index;
} VkPipelineShaderStageNodeCreateInfoAMDX;
  • index sets the index value for a shader.

  • pName allows applications to override the name specified in SPIR-V by OpEntryPoint.

If pName is NULL then the original name is used, as specified by VkPipelineShaderStageCreateInfo::pName. If index is VK_SHADER_INDEX_UNUSED_AMDX then the original index is used, either as specified by the ShaderIndexAMDX Execution Mode, or 0 if that too is not specified. If this structure is not provided, pName defaults to NULL, and index defaults to VK_SHADER_INDEX_UNUSED_AMDX.

When dispatching from another shader, the index is dynamic and can be specified in uniform control flow - however the name must be statically declared as a decoration on the payload. Allowing the index to be set dynamically lets applications stream shaders in and out dynamically, by simply changing constant data and relinking the graph pipeline from new libraries. Shaders with the same name and different indexes must consume identical payloads and have the same execution model. Shaders with the same name in an execution graph pipeline must have unique indexes.

3.1.2. Scratch Memory

Implementations may need scratch memory to manage dispatch queues or similar when executing a pipeline graph, and this is explicitly managed by the application.

typedef struct VkExecutionGraphPipelineScratchSizeAMDX {
    VkStructureType                     sType;
    void*                               pNext;
    VkDeviceSize                        size;
} VkExecutionGraphPipelineScratchSizeAMDX;

VkResult vkGetExecutionGraphPipelineScratchSizeAMDX(
    VkDevice                                device,
    VkPipeline                              executionGraph,
    VkExecutionGraphPipelineScratchSizeAMDX* pSizeInfo);

Applications can query the required amount of scratch memory required for a given pipeline, and the address of a buffer of that size must be provided when calling vkCmdDispatchGraphAMDX. The amount of scratch memory needed by a given pipeline is related to the number and size of payloads across the whole graph; while the exact relationship is implementation dependent, reducing the number of unique nodes (different name string) and size of payloads can reduce scratch memory consumption.

Buffers created for this purpose must use the new buffer usage flags:

VK_BUFFER_USAGE_EXECUTION_GRAPH_SCRATCH_BIT_AMDX
VK_BUFFER_USAGE_2_EXECUTION_GRAPH_SCRATCH_BIT_AMDX

Scratch memory needs to be initialized against a graph pipeline before it can be used with that graph for the first time, using the following command:

void vkCmdInitializeGraphScratchMemoryAMDX(
    VkCommandBuffer                             commandBuffer,
    VkDeviceAddress                             scratch);

This command initializes it for the currently bound execution graph pipeline. Scratch memory will need to be re-initialized if it is going to be reused with a different execution graph pipeline, but can be used with the same pipeline repeatedly without re-initialization. Scratch memory initialization can be synchronized using the compute pipeline stage VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT and shader write access flag VK_ACCESS_SHADER_WRITE_BIT.

3.1.3. Dispatch a graph

Once an execution graph has been created and scratch memory has been initialized for it, the following commands can be used to execute the graph:

typedef struct VkDispatchGraphInfoAMDX {
    uint32_t                                    nodeIndex;
    uint32_t                                    payloadCount;
    VkDeviceOrHostAddressConstAMDX              payloads;
    uint64_t                                    payloadStride;
} VkDispatchGraphInfoAMDX;

typedef struct VkDispatchGraphCountInfoAMDX {
    uint32_t                                    count;
    VkDeviceOrHostAddressConstAMDX              infos;
    uint64_t                                    stride;
} VkDispatchGraphCountInfoAMDX;

void vkCmdDispatchGraphAMDX(
    VkCommandBuffer                             commandBuffer,
    VkDeviceAddress                             scratch,
    const VkDispatchGraphCountInfoAMDX*         pCountInfo);

void vkCmdDispatchGraphIndirectAMDX(
    VkCommandBuffer                             commandBuffer,
    VkDeviceAddress                             scratch,
    const VkDispatchGraphCountInfoAMDX*         pCountInfo);

void vkCmdDispatchGraphIndirectCountAMDX(
    VkCommandBuffer                             commandBuffer,
    VkDeviceAddress                             scratch,
    VkDeviceAddress                             countInfo);

Each of the above commands enqueues an array of nodes in the bound execution graph pipeline with separate payloads, according to the contents of the VkDispatchGraphCountInfoAMDX and VkDispatchGraphInfoAMDX structures.

vkCmdDispatchGraphAMDX takes all of its arguments from the host pointers. VkDispatchGraphCountInfoAMDX::infos.hostAddress is a pointer to an array of VkDispatchGraphInfoAMDX structures, with stride equal to VkDispatchGraphCountInfoAMDX::stride and VkDispatchGraphCountInfoAMDX::count elements.

vkCmdDispatchGraphIndirectAMDX consumes most parameters on the host, but uses the device address for VkDispatchGraphCountInfoAMDX::infos, and also treating payloads parameters as device addresses.

vkCmdDispatchGraphIndirectCountAMDX consumes countInfo on the device and all child parameters also use device addresses.

Data consumed via a device address must be from buffers created with the VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT and VK_BUFFER_USAGE_INDIRECT_BUFFER_BIT flags. payloads is a pointer to a linear array of payloads in memory, with a stride equal to payloadStride. payloadCount may be 0. scratch may be used by the implementation to hold temporary data during graph execution, and can be synchronized using the compute pipeline stage and shader write access flags.

These dispatch commands must not be called in protected command buffers or secondary command buffers.

If a selected node does not include a StaticNumWorkgroupsAMDX or CoalescingAMDX declaration, the first part of each element of payloads must be a VkDispatchIndirectCommand structure, indicating the number of workgroups to dispatch in each dimension. If an input payload variable in NodePayloadAMDX storage class is defined in the shader, its structure type must include VkDispatchIndirectCommand in its first 12 bytes.

If that node does not include a MaxNumWorkgroupsAMDX declaration, it is assumed that the node may be dispatched with a grid size up to VkPhysicalDeviceLimits::maxComputeWorkGroupCount.

If that node does not include a CoalescingAMDX declaration, all data in the payload is broadcast to all workgroups dispatched in this way. If that node includes a CoalescingAMDX declaration, data in the payload will be consumed by exactly one workgroup. There is no guarantee of how payloads will be consumed by CoalescingAMDX nodes.

The nodeIndex is a unique integer identifier identifying a specific shader name and shader index (defined by VkPipelineShaderStageNodeCreateInfoAMDX) added to the executable graph pipeline. vkGetExecutionGraphPipelineNodeIndexAMDX can be used to query the identifier for a given node:

VkResult vkGetExecutionGraphPipelineNodeIndexAMDX(
    VkDevice                                        device,
    VkPipeline                                      executionGraph,
    const VkPipelineShaderStageNodeCreateInfoAMDX*   pNodeInfo,
    uint32_t*                                       pNodeIndex);

pNodeInfo specifies the shader name and index as set up when creating the pipeline, with the associated node index returned in pNodeIndex. When used with this function, pNodeInfo→pName must not be NULL.

To summarize, execution graphs use two kinds of indexes:

  1. shader index specified in VkPipelineShaderStageNodeCreateInfoAMDX and used to enqueue payloads,

  2. node index specified in VkDispatchGraphInfoAMDX and used only for launching the graph from a command buffer.

Execution graph pipelines and their resources are bound using a new pipeline bind point:

VK_PIPELINE_BIND_POINT_EXECUTION_GRAPH_AMDX

3.1.4. Properties

The following new properties are added to Vulkan:

typedef VkPhysicalDeviceShaderEnqueuePropertiesAMDX {
    VkStructureType                     sType;
    void*                               pNext;
    uint32_t                            maxExecutionGraphDepth;
    uint32_t                            maxExecutionGraphShaderOutputNodes;
    uint32_t                            maxExecutionGraphShaderPayloadSize;
    uint32_t                            maxExecutionGraphShaderPayloadCount;
    uint32_t                            executionGraphDispatchAddressAlignment;
} VkPhysicalDeviceShaderEnqueuePropertiesAMDX;

Each limit is defined as follows:

  • maxExecutionGraphDepth defines the maximum node chain length in the graph, and must be at least 32. The dispatched node is at depth 1 and the node enqueued by it is at depth 2, and so on. If a node uses tail recursion, each recursive call increases the depth by 1 as well.

  • maxExecutionGraphShaderOutputNodes specifies the maximum number of unique nodes that can be dispatched from a single shader, and must be at least 256.

  • maxExecutionGraphShaderPayloadSize specifies the maximum total size of payload declarations in a shader, and must be at least 32KB.

  • maxExecutionGraphShaderPayloadCount specifies the maximum number of output payloads that can be initialized in a single workgroup, and must be at least 256.

  • executionGraphDispatchAddressAlignment specifies the alignment of non-scratch VkDeviceAddress arguments consumed by graph dispatch commands, and must be no more than 4 bytes.

3.1.5. Features

The following new feature is added to Vulkan:

typedef VkPhysicalDeviceShaderEnqueueFeaturesAMDX {
    VkStructureType                     sType;
    void*                               pNext;
    VkBool32                            shaderEnqueue;
} VkPhysicalDeviceShaderEnqueueFeaturesAMDX;

The shaderEnqueue feature enables all functionality in this extension.

3.2. SPIR-V Changes

A new capability is added:

Capability Enabling Capabilities

5067

ShaderEnqueueAMDX
Uses shader enqueue capabilities

Shader

A new storage class is added:

Storage Class Enabling Capabilities

5068

NodePayloadAMDX
Input payload from a node dispatch.
In the GLCompute execution model with the CoalescingAMDX execution mode, it is visible across all functions in all invocations in a workgroup; otherwise it is visible across all functions in all invocations in a dispatch.
Variables declared with this storage class are read-write, and must not have initializers.

ShaderEnqueueAMDX

5076

NodeOutputPayloadAMDX
Output payload to be used for dispatch.
Variables declared with this storage class are read-write, must not have initializers, and must be initialized with OpInitializeNodePayloadsAMDX before they are accessed.
Once initialized, a variable declared with this storage class is visible to all invocations in the declared Scope.
Valid in GLCompute execution models.

ShaderEnqueueAMDX

An entry point must only declare one variable in the NodePayloadAMDX storage class in its interface.

New execution modes are added:

Execution Mode Extra Operands Enabling Capabilities

5069

CoalescingAMDX
Indicates that a GLCompute shader has coalescing semantics. (GLCompute only)

Must not be declared alongside StaticNumWorkgroupsAMDX or MaxNumWorkgroupsAMDX.

ShaderEnqueueAMDX

5071

MaxNodeRecursionAMDX
Maximum number of times a node can enqueue itself.

<id>
Number of recursions

ShaderEnqueueAMDX

5072

StaticNumWorkgroupsAMDX
Statically declare the number of workgroups dispatched for this shader, instead of obeying an API- or payload-specified value. Values are reflected in the NumWorkgroups built-in value. (GLCompute only)

Must not be declared alongside CoalescingAMDX or MaxNumWorkgroupsAMDX.

<id>
x size

<id>
y size

<id>
z size

ShaderEnqueueAMDX

5077

MaxNumWorkgroupsAMDX
Declare the maximum number of workgroups dispatched for this shader. Dispatches must not exceed this value (GLCompute only)

Must not be declared alongside CoalescingAMDX or StaticNumWorkgroupsAMDX.

<id>
x size

<id>
y size

<id>
z size

ShaderEnqueueAMDX

5073

ShaderIndexAMDX
Declare the node index for this shader. (GLCompute only)

<id>
Shader Index

ShaderEnqueueAMDX

A shader module declaring ShaderEnqueueAMDX capability must only be used in execution graph pipelines created by vkCreateExecutionGraphPipelinesAMDX command.

MaxNodeRecursionAMDX must be specified if a shader re-enqueues itself, which takes place if that shader initializes and finalizes a payload for the same node name and index. Other forms of recursion are not allowed.

An application must not dispatch the shader with a number of workgroups in any dimension greater than the values specified by MaxNumWorkgroupsAMDX.

StaticNumWorkgroupsAMDX allows the declaration of the number of workgroups to dispatch to be coded into the shader itself, which can be useful for optimizing some algorithms. When a compute shader is dispatched using existing vkCmdDispatchGraph* commands, the workgroup counts specified there are overridden. When enqueuing such shaders with a payload, these arguments will not be consumed from the payload before user-specified data begins.

The values of MaxNumWorkgroupsAMDX and StaticNumWorkgroupsAMDX must be less than or equal to VkPhysicalDeviceLimits::maxComputeWorkGroupCount.

The arguments to each of these execution modes must be a constant 32-bit integer value, and may be supplied via specialization constants.

When a GLCompute shader is being used in an execution graph, NumWorkgroups must not be used.

When CoalescingAMDX is used, it has the following effects on a compute shader’s inputs and outputs:

  • The WorkgroupId built-in is always (0,0,0)

  • NB: This affects related built-ins like GlobalInvocationId

  • So similar to StaticNumWorkgroupsAMDX, no dispatch size is consumed from the payload-specified

  • The input in the NodePayloadAMDX storage class must have a type of OpTypeArray or OpTypeRuntimeArray.

  • This input must be decorated with NodeMaxPayloadsAMDX, indicating the number of payloads that can be received.

  • The number of payloads received is provided in the CoalescedInputCountAMDX built-in.

  • If OpTypeArray is used, that input’s array length must be equal to the size indicated by the NodeMaxPayloadsAMDX decoration.

New decorations are added:

Decoration Extra Operands Enabling Capabilities

5020

NodeMaxPayloadsAMDX
Must only be used to decorate a variable in the NodeOutputPayloadAMDX or NodePayloadAMDX storage class.

Variables in the NodeOutputPayloadAMDX storage class must have this decoration. If such a variable is decorated, the operand indicates the maximum number of payloads in the array
as well as the maximum number of payloads that can be allocated by a single workgroup for this output.

Variables in the NodePayloadAMDX storage class must have this decoration if the CoalescingAMDX execution mode is specified, otherwise they must not. If such a variable is decorated, the operand indicates the maximum number of payloads in the array.

<id>
Max number of payloads

ShaderEnqueueAMDX

5019

NodeSharesPayloadLimitsWithAMDX
Decorates a variable in the NodeOutputPayloadAMDX storage class to indicate that it shares output resources with Payload Array when dispatched.

Without the decoration, each variable’s resources are separately allocated against the output limits; by using the decoration only the limit of Payload Array is considered. Applications must still ensure that at runtime the actual usage does not exceed these limits, as this decoration only relaxes static validation.

Must only be used to decorate a variable in the NodeOutputPayloadAMDX storage class, Payload Array must be a different variable in the NodeOutputPayloadAMDX storage class, and Payload Array must not be itself decorated with NodeSharesPayloadLimitsWithAMDX.

It is only necessary to decorate one variable to indicate sharing between two node outputs. Multiple variables can be decorated with the same Payload Array to indicate sharing across multiple node outputs.

<id>
Payload Array

ShaderEnqueueAMDX

5091

PayloadNodeNameAMDX
Decorates a variable in the NodeOutputPayloadAMDX storage class to indicate that the payloads in the array will be enqueued for the shader with Node Name.

Must only be used to decorate a variable that is initialized by OpInitializeNodePayloadsAMDX.

Literal
Node Name

ShaderEnqueueAMDX

5078

TrackFinishWritingAMDX
Decorates a variable in the NodeOutputPayloadAMDX or NodePayloadAMDX storage class to indicate that a payload that is first enqueued and then accessed in a receiving shader, will be used with OpFinishWritingNodePayloadAMDX instruction.

Must only be used to decorate a variable in the NodeOutputPayloadAMDX or NodePayloadAMDX storage class.

Must not be used to decorate a variable in the NodePayloadAMDX storage class if the shader uses CoalescingAMDX execution mode.

If a variable in NodeOutputPayloadAMDX storage class is decorated, then a matching variable with NodePayloadAMDX storage class in the receiving shader must be decorated as well.

If a variable in NodePayloadAMDX storage class is decorated, then a matching variable with NodeOutputPayloadAMDX storage class in the enqueuing shader must be decorated as well.

ShaderEnqueueAMDX

This allows more control over the maxExecutionGraphShaderPayloadSize limit, and can be useful when a shader may output some large number of payloads but to potentially different nodes.

Two new built-ins are provided:

BuiltIn Enabling Capabilities

5073

ShaderIndexAMDX
Index assigned to the current shader.

ShaderEnqueueAMDX

5021

CoalescedInputCountAMDX
Number of valid inputs in the NodePayloadAMDX storage class array when using the CoalescingAMDX Execution Mode. (GLCompute only)

ShaderEnqueueAMDX

The business of actually allocating and enqueuing payloads is done by OpInitializeNodePayloadsAMDX:

OpInitializeNodePayloadsAMDX

Allocate payloads in memory and make them accessible through the Payload Array variable. The payloads are enqueued for the node shader identified by the Node Index and Node Name in the decoration PayloadNodeNameAMDX on the Payload Array variable.

Payload Array variable must be an OpTypePointer with a Storage Class of OutputNodePayloadAMDX, and a Type of OpTypeArray with an Element Type of OpTypeStruct.

The array pointed to by Payload Array variable must have Payload Count elements.

Payloads are allocated for the Scope indicated by Visibility, and are visible to all invocations in that Scope.

Payload Count is the number of payloads to initialize in the Payload Array.

Payload Count must be less than or equal to the NodeMaxPayloadsAMDX decoration on the Payload Array variable.

Payload Count and Node Index must be dynamically uniform within the scope identified by Visibility.

Visibility must only be either Invocation or Workgroup.

This instruction must be called in uniform control flow.
This instruction must not be called on a Payload Array variable that has previously been initialized.

Capability:
ShaderEnqueueAMDX

5

5090

<id>
Payload Array

Scope <id>
Visibility

<id>
Payload Count

<id>
Node Index

Once a payload element is initialized, it will be enqueued to workgroups in the corresponding shader after the calling shader has written all of its values. Enqueues are performed in the same manner as the vkCmdDispatchGraph* API commands. If the node enqueued has the CoalescingAMDX execution mode, there is no guarantee what set of payloads are visible to the same workgroup.

The shader must not enqueue payloads to a shader with the same name as this shader unless the index identifies this shader and MaxNodeRecursionAMDX is declared with a sufficient depth. Shaders with the same name and different indexes can each recurse independently.

A shader can explicitly specify that it is done writing to outputs (allowing the enqueue to happen sooner) by calling OpFinalizeNodePayloadsAMDX:

OpFinalizeNodePayloadsAMDX

Optionally indicates that all accesses to an array of output payloads have completed.
Payload Array is a payload array previously initialized by OpInitializeNodePayloadsAMDX.
This instruction must be called in uniform control flow.
Payload Array must be an OpTypePointer with a Storage Class of OutputNodePayloadAMDX, and a Type of OpTypeArray or OpTypeRuntimeArray with an Element Type of OpTypeStruct. Payload Array must not have been previously finalized by OpFinalizeNodePayloadsAMDX.

Capability:
ShaderEnqueueAMDX

2

5075

<id>
Payload Array

Once this has been called, accessing any element of Payload Array is undefined behavior.

OpFinishWritingNodePayloadAMDX

Optionally indicates that all writes to the input payload by the current workgroup have completed.
Returns true when all workgroups that can access this payload have called this function.

Must not be called if the shader is using CoalescingAMDX execution mode, or if the shader was dispatched with a vkCmdDispatchGraph* command, rather than enqueued from another shader.

Must not be called if the input payload is not decorated with TrackFinishWritingAMDX.

Result Type must be OpTypeBool.
Payload is a variable in the NodePayloadAMDX storage class.

Capability:
ShaderEnqueueAMDX

4

5078

<id>
Result Type

Result <id>

<id>
Payload

Once this has been called for a given payload, writing values into that payload by the current invocation/workgroup is undefined behavior.

4. Issues

4.1. RESOLVED: For compute nodes, can the input payload be modified? If so what sees that modification?

Yes, input payloads are writable and OpFinishWritingNodePayloadAMDX instruction is provided to indicate that all workgroups that share the same payload have finished writing to it.

Limitations apply to this functionality. Please refer to the instruction’s specification.

4.2. UNRESOLVED: Do we need input from the application to tune the scratch allocation?

For now no, more research is required to determine what information would be actually useful to know.

4.3. PROPOSED: How does this extension interact with device groups?

It works the same as any other dispatch commands - work is replicated to all devices unless applications split the work themselves. There is no automatic scheduling between devices.