VK_EXT_host_image_copy
This document identifies inefficiencies with image data initialization and proposes an extension to improve it.
1. Problem Statement
Copying data to optimal-layout images in Vulkan requires staging the data in a buffer first, and using the GPU to perform the copy. Similarly, copying data out of an optimal-layout image requires a copy to a buffer. This restriction can cause a number of inefficiencies in certain scenarios.
Take initializing an image for the purpose of sampling as an example, where the source of data is a file. The application has to load the data to memory (one copy), then initialize the buffer (second copy) and finally copy over to the image (third copy). Applications can remove one copy from the above scenario by creating and memory mapping the buffer first and loading the image data from disk directly into the buffer. This is not always possible, for example because the streaming and graphics subsystems of a game engine are independent, or in the case of layering, because the layer is given a pointer to the data which is already loaded from disk.
The extra copy involved due to it going through a buffer is not just a performance cost though. The buffer that is allocated for the image copy is at least as big as the image itself, and lives for a short duration until the copy is confirmed to be done. When an application performs a large number of image initialization at the same time, such as a game loading assets, it will momentarily have twice as much memory allocated for its images (the images themselves and their staging buffers), greatly increasing its peak memory usage. This can lead to out-of-memory errors on some devices.
This document proposes an extension that allows image data to be copied from/to host memory directly, obviating the need to perform the copy through a buffer and save on memory. While copying to an optimal layout image on the CPU has its own costs, this extension can still lead to better performance by allowing the CPU to perform some copies in parallel with the GPU.
2. Proposal
An extension is proposed to address this issue. The extension’s API is designed to be similar to buffer-image and image-image copies.
Introduced by this API are:
Features, advertising whether the implementation supports host→image, image→host and image→image copies:
typedef struct VkPhysicalDeviceHostImageCopyFeaturesEXT {
    VkStructureType    sType;
    void*              pNext;
    VkBool32           hostImageCopy;
} VkPhysicalDeviceHostImageCopyFeaturesEXT;Query of which layouts can be used in to-image and from-image copies:
typedef struct VkPhysicalDeviceHostImageCopyPropertiesEXT {
    VkStructureType    sType;
    void*              pNext;
    uint32_t           copySrcLayoutCount;
    VkImageLayout*     pCopySrcLayouts;
    uint32_t           copyDstLayoutCount;
    VkImageLayout*     pCopyDstLayouts;
    uint8_t            optimalTilingLayoutUUID[VK_UUID_SIZE];
    VkBool32           identicalMemoryTypeRequirements;
} VkPhysicalDeviceHostImageCopyPropertiesEXT;In the above, optimalTilingLayoutUUID can be used to ensure compatible data layouts between memory and images when using VK_HOST_IMAGE_COPY_MEMCPY_BIT_EXT in the below commands.
identicalMemoryTypeRequirements specifies whether using VK_IMAGE_USAGE_HOST_TRANSFER_BIT_EXT may affect the memory type requirements of the image or not.
Defining regions to copy to an image:
typedef struct VkCopyMemoryToImageInfoEXT {
    VkStructureType               sType;
    void*                         pNext;
    VkHostImageCopyFlagsEXT       flags;
    VkImage                       dstImage;
    VkImageLayout                 dstImageLayout;
    uint32_t                      regionCount;
    const VkMemoryToImageCopyEXT* pRegions;
} VkCopyMemoryToImageInfoEXT;In the above, flags may be VK_HOST_IMAGE_COPY_MEMCPY_BIT_EXT, in which case the data in host memory should have the same swizzling layout as the image.
This is mainly useful for embedded systems where this swizzling is known and well defined outside of Vulkan.
Defining regions to copy from an image:
typedef struct VkCopyImageToMemoryInfoEXT {
    VkStructureType               sType;
    void*                         pNext;
    VkHostImageCopyFlagsEXT       flags;
    VkImage                       srcImage;
    VkImageLayout                 srcImageLayout;
    uint32_t                      regionCount;
    const VkImageToMemoryCopyEXT* pRegions;
} VkCopyImageToMemoryInfoEXT;In the above, flags may be VK_HOST_IMAGE_COPY_MEMCPY_BIT_EXT, in which case the data in host memory will have the same swizzling layout as the image.
Defining regions to copy between images
typedef struct VkCopyImageToImageInfoEXT {
    VkStructureType               sType;
    void*                         pNext;
    VkHostImageCopyFlagsEXT       flags;
    VkImage                       srcImage;
    VkImageLayout                 srcImageLayout;
    VkImage                       dstImage;
    VkImageLayout                 dstImageLayout;
    uint32_t                      regionCount;
    const VkImageCopy2*           pRegions;
} VkCopyImageToImageInfoEXT;In the above, flags may be VK_HOST_IMAGE_COPY_MEMCPY_BIT_EXT, in which case data is copied between images with no swizzling layout considerations.
Current limitations on source and destination images necessarily lead to raw copies between images, so this flag is currently redundant for image to image copies.
Defining the copy regions themselves:
typedef struct VkMemoryToImageCopyEXT {
    VkStructureType             sType;
    void*                       pNext;
    const void*                 pHostPointer;
    uint32_t                    memoryRowLength;
    uint32_t                    memoryImageHeight;
    VkImageSubresourceLayers    imageSubresource;
    VkOffset3D                  imageOffset;
    VkExtent3D                  imageExtent;
} VkMemoryToImageCopyEXT;
typedef struct VkImageToMemoryCopyEXT {
    VkStructureType             sType;
    void*                       pNext;
    void*                       pHostPointer;
    uint32_t                    memoryRowLength;
    uint32_t                    memoryImageHeight;
    VkImageSubresourceLayers    imageSubresource;
    VkOffset3D                  imageOffset;
    VkExtent3D                  imageExtent;
} VkImageToMemoryCopyEXT;The following functions perform the actual copy:
VkResult vkCopyMemoryToImageEXT(VkDevice device, const VkCopyMemoryToImageInfoEXT* pCopyMemoryToImageInfo);
VkResult vkCopyImageToMemoryEXT(VkDevice device, const VkCopyImageToMemoryInfoEXT* pCopyImageToMemoryInfo);
VkResult vkCopyImageToImageEXT(VkDevice device, const VkCopyImageToImageInfoEXT* pCopyImageToImageInfo);Images that are used by these copy instructions must have the VK_IMAGE_USAGE_HOST_TRANSFER_BIT usage bit set.
Additionally, to avoid having to submit a command just to transition the image to the correct layout, the following function is introduced to do the layout transition on the host. The allowed layouts are limited to serve this purpose without requiring implementations to implement complex layout transitions.
typedef struct VkHostImageLayoutTransitionInfoEXT {
    VkStructureType            sType;
    void*                      pNext;
    VkImage                    image;
    VkImageLayout              oldLayout;
    VkImageLayout              newLayout;
    VkImageSubresourceRange    subresourceRange;
} VkHostImageLayoutTransitionInfoEXT;
VkResult vkTransitionImageLayoutEXT(VkDevice device, uint32_t transitionCount, const VkHostImageLayoutTransitionInfoEXT *pTransitions);The allowed values for oldLayout are:
- 
VK_IMAGE_LAYOUT_UNDEFINED
- 
VK_IMAGE_LAYOUT_PREINITIALIZED
- 
Layouts in VkPhysicalDeviceHostImageCopyPropertiesEXT::pCopySrcLayouts
The allowed values for newLayout are:
- 
Layouts in VkPhysicalDeviceHostImageCopyPropertiesEXT::pCopyDstLayouts.
- 
This list always includes VK_IMAGE_LAYOUT_GENERAL
When VK_HOST_IMAGE_COPY_MEMCPY_BIT_EXT is used in copies to or from an image with VK_IMAGE_TILING_OPTIMAL, the application may need to query the memory size needed for copy.
The vkGetImageSubresourceLayout2EXT function can be used for this purpose:
void vkGetImageSubresourceLayout2EXT(
    VkDevice                       device,
    VkImage                        image,
    const VkImageSubresource2EXT*  pSubresource,
    VkSubresourceLayout2EXT*       pLayout);The memory size in bytes needed for copies using VK_HOST_IMAGE_COPY_MEMCPY_BIT_EXT can be retrieved by chaining VkSubresourceHostMemcpySizeEXT to pLayout:
typedef struct VkSubresourceHostMemcpySizeEXT {
    VkStructureType            sType;
    void*                      pNext;
    VkDeviceSize               size;
} VkSubresourceHostMemcpySizeEXT;2.1. Querying support
To determine if a format supports host image copies, VK_FORMAT_FEATURE_2_HOST_IMAGE_TRANSFER_BIT_EXT is added.
2.2. Required formats
All color formats that support sampling are required to support
VK_FORMAT_FEATURE_2_HOST_IMAGE_TRANSFER_BIT_EXT, with some exceptions for externally defined formats:
- 
DRM format modifiers 
- 
Android hardware buffers 
2.3. Limitations
Images in optimal layout are often swizzled non-linearly. When copying between images and buffers, the GPU can perform the swizzling and address translations in hardware. When copying between images and host memory however, the CPU needs to perform this swizzling. As a result:
- 
The implementation may decide to use a simpler and less efficient layout for the image data when VK_IMAGE_USAGE_HOST_TRANSFER_BIT_EXTis specified.
- 
If optimalDeviceAccessis set however (see below), the implementation informs that the memory layout is equivalent to an image that does not enableVK_IMAGE_USAGE_HOST_TRANSFER_BIT_EXTfrom a performance perspective and applications can assume that host image copy is just as efficient as using device copies for resources which are accessed many times on device.
- 
Equivalent performance is only expected within a specific memory type however. On a discrete GPU for example, non-device local memory is expected to be slower to access than device-local memory. 
- 
The copy on the CPU may indeed be slower than the double-copy through a buffer due to the above swizzling logic. 
Additionally, to perform the copy, the implementation must be able to map the image’s memory which may limit the memory type the image can be allocated from.
It is therefore recommended that developers measure performance and decide whether this extension results in a performance gain or loss in their application. Unless specifically recommended on a platform, it is not generally recommended for applications to perform all image copies through this extension.
2.4. Querying performance characteristics
typedef struct VkHostImageCopyDevicePerformanceQueryEXT {
    VkStructureType    sType;
    void*              pNext;
    VkBool32           optimalDeviceAccess;
    VkBool32           identicalMemoryLayout;
} VkHostImageCopyDevicePerformanceQueryEXT;This struct can be chained as an output struct in vkGetPhysicalDeviceImageFormatProperties2.
Given certain image creation flags, it is important for applications to know if using VK_IMAGE_USAGE_HOST_TRANSFER_BIT_EXT
has an adverse effect on device performance.
This query cannot be a format feature flag, since image creation information can affect this query.
For example, an image that is only created with VK_IMAGE_USAGE_SAMPLED_BIT and VK_IMAGE_USAGE_TRANSFER_DST_BIT
might not have compression at all on some implementations, but adding VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT would change this query.
Other implementations may want to use compression even for VK_IMAGE_USAGE_TRANSFER_DST_BIT.
identicalMemoryLayout is intended for the gray area where the image is just swizzled in a slightly different pattern to aid host access,
but fundamentally similar to non-host image copy paths, such that it is unlikely that performance changes in any meaningful way
except pathological situations.
The inclusion of this field gives more leeway to implementations that would like to
set optimalDeviceAccess for an image without having to guarantee 100% identical memory layout, and allows applications to choose host image copies
in that case, knowing that performance is not sacrificed.
As a baseline, block-compressed formats are required to set optimalDeviceAccess to VK_TRUE.
3. Issues
3.1. RESOLVED: Should other layouts be allowed in VkHostImageLayoutTransitionInfoEXT?
Specifying VK_IMAGE_USAGE_HOST_TRANSFER_BIT effectively puts the image in a physical layout where VK_IMAGE_LAYOUT_GENERAL performs similarly to the OPTIMAL layouts for that image.
Therefore, it was deemed unnecessary to allow other layouts, as they provide no performance benefit.
In practice, especially for read-only textures, a host-transferred image in the VK_IMAGE_LAYOUT_GENERAL layout could be just as efficient as an image transitioned to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL.
VkHostImageCopyDevicePerformanceQueryEXT can be used to query whether using VK_IMAGE_USAGE_HOST_TRANSFER_BIT can be detrimental to performance.
If it is, performance measurements are recommended to ensure the gains from this extension outperform the potential losses.
3.2. RESOLVED: Should queue family ownership transfers be supported on the host as well?
As long as the allowed layouts are limited to the ones specified above, the actual physical layout of the image will not vary between queue families, and so queue family ownership transfers are currently unnecessary.