VK_ARM_render_pass_striped

This document describes a proposal for a new extension that allows the processing of a render pass instance to be split into stripes.

1. Problem Statement

It is common to do post-processing on the images produced by a render pass instance. This is typically done using additional render pass instances or compute passes on the same graphics and compute queue - and on the same physical device.

In some cases, however, the post-processing can be more efficiently done on specialized hardware - that is not Vulkan-capable. The various memory import and export extensions can support this use-case. But existing synchronization requires that the Vulkan device has rendered the complete output image before the external device can access it. In many cases, the latency would be reduced if the post-processing could start before the complete image is rendered.

This proposal aims to support this latency reduction by splitting the processing of a render pass instance into a set of stripes that can be individually synchronized with an external device.

2. Solution Space

The first option to consider is to just use the render area (in VkRenderingInfo or VkRenderPassBeginInfo) to render each stripe separately. This would be functionality correct, but requires that the complete render pass is replicated for each stripe which adds overhead for both the application and the implementation.

We want to give the implementation visibility of the stripes such that some work can be shared across stripes.

The stripes should be specified per render pass. Extending the render pass structures with that information is straightforward.

There are more options for handling the synchronization.

The intent is to synchronize with an external (to Vulkan) consumer, so the synchronization primitive must be exportable.

If we use semaphores, when should the semaphore objects be provided? The options are:

  1. Provide the semaphores along with the queue submission commands (e.g. on vkQueueSubmit) as normal.

  2. Provide the semaphores along with the render pass information.

Providing the semaphores with the render pass would give the implementation all the information in one place. But it is unusual to provide semaphores during recording, and resubmitting command buffers with semaphores embedded in adds complexity.

This proposal picks the first option and describes a mapping between an array of semaphores and the render passes in the submitted command buffer.

This proposal does not allow stripes to be specified per subpass for two reasons:

  1. It is not necessary since the expected use-case is post-processing of the final image

  2. If different subpasses specified different stripeAreas, any subpass merging of those passes would likely have to be disabled

The expectation in this proposal is that the stripes are only used for the last subpass in a render pass instance.

3. Proposal

3.1. Features

typedef struct VkPhysicalDeviceRenderPassStripedFeaturesARM
{
    VkStructureType sType;
    void *pNext;
    VkBool32 renderPassStriped;
} VkPhysicalDeviceRenderPassStripedFeaturesARM;

This feature indicates that striped rendering is supported.

3.2. Properties

typedef struct VkPhysicalDeviceRenderPassStripedPropertiesARM
{
    VkStructureType sType;
    void *pNext;
    VkExtent2D renderPassStripeGranularity;
    uint32_t maxRenderPassStripes;
} VkPhysicalDeviceRenderPassStripedPropertiesARM;

These properties indicate implementation-defined limits on the maximum number of stripes that are supported and how fine-grained they can be. For a tile-based GPU, the stripe granularity will typically be a multiple of the tile size.

3.3. Specifying stripes

Stripes can be specified per render pass instance. The following structure can be added to the pNext chain of VkRenderingInfoKHR or VkRenderPassBeginInfo:

typedef struct VkRenderPassStripeBeginInfoARM
{
    VkStructureType sType;
    const void *pNext;
    uint32_t stripeInfoCount;
    VkRenderPassStripeInfoARM *pStripeInfos;
} VkRenderPassStripeBeginInfoARM;

stripeInfoCount is the number of stripes and also the number of elements in the pStripeInfos array.

The individual stripes are specified by the following structure:

typedef struct VkRenderPassStripeInfoARM
{
    VkStructureType sType;
    const void *pNext;
    VkRect2D stripeArea;
} VkRenderPassStripeInfoARM;

stripeArea is the region of the stripe. As a rule, the values of stripeArea.offset.x and stripeArea.extent.width must be a multiple of renderPassStripeGranularity.width. But it is difficult for an application to guarantee that the dimensions of each image is a multiple of the stripe granularity. We therefore allow an exception to the general rule for the stripe at the edge where stripeArea.extent.width does not need to be a multiple of renderPassStripeGranularity.width as long as the sum of stripeArea.offset.x and stripeArea.extent.width is equal to the renderArea.extent.width of the render pass instance. The same constraints apply to the values of stripeArea.offset.y and stripeArea.extent.height.

In order to synchronize with an external consumer, a semaphore needs to be signaled when a stripe completes. These semaphores are specified on queue submit by including the following structure in the pNext chain of vkSubmitInfo2→VkCommandBufferSubmitInfo:

typedef struct VkRenderPassStripeSubmitInfoARM
{
    VkStructureType sType;
    const void *pNext;
    uint32_t stripeSemaphoreInfoCount;
    const VkSemaphoreSubmitInfo* pStripeSemaphoreInfos;
} VkRenderPassStripeSubmitInfoARM;

stripeSemaphoreInfoCount is the number of elements in pStripeSemaphoreInfos. The value of stripeSemaphoreInfoCount must be equal to the sum of the VkRenderPassStripeBeginInfoARM→stripeInfoCount parameters that are recorded in the VkCommandBufferSubmitInfo→commandBuffer.

pStripeSemaphoreInfos is a pointer to an array of VkSemaphoreSubmitInfo structures describing the render pass stripe signal operations. The elements of this array are mapped to striped render passes in submission order and in stripe order within each render pass. Each semaphore is signaled when the associated stripe is complete.

For example, if VkCommandBufferSubmitInfo→commandBuffer contains three render passes, where the first has two stripes, the second is not striped, and the third has three stripes, then stripeSemaphoreInfoCount must have the value 5, the first two entries of pStripeSemaphoreInfos are associated with the first render pass and the last three entries are associated with the third render pass. The semaphore in the first entry of pStripeSemaphoreInfos is signaled when the first stripe of the first render pass is complete, etc.

4. Issues

4.1. PROPOSED: Naming. Do we use "stripe" or "slice"?

The specification uses "slice" to refer to slices of a 3D image. This proposal wants to express partitions in 2D. The video extensions, e.g., "VK_EXT_video_decode_h264" uses "slice" to mean something very similar to what this proposal is aiming to achieve.

We use "stripe" to disambiguate from the way slice is used to describe partitions in 3D.

4.2. PROPOSED: Could we use timeline semaphores for this?

Probably. But timeline semaphores are not yet available on all platforms.

Regular timeline semaphores are not shareable to foreign queues so are not a good fit for the target use-cases.

External timeline semaphores require the native platform to support TIMELINE type native fence. There are two ways to export these:

  1. OPAQUE_FD: These are of limited use since they can only be shared by the same driver

  2. Platform timeline fences: There is currently no way to export these from Vulkan

For now, we require binary semaphores.

4.3. PROPOSED: How do stripes interact with multiview or layered rendering?

The main question is whether a stripe covers individual views (or layers of the framebuffer) or all of them.

Each view could be post-processed separately - and in that case, one may want to start the post-processing for a stripe of one view before the next view is processed. In that case, we would want one semaphore per view.

This extension is specified to complete a stripe for all views (or layers) before moving on to the next stripe. The main motivation for this choice is to reduce the number of synchronization operations required.

4.4. PROPOSED: Is striped rendering supported for all renderable formats?

Yes.