VK_QCOM_data_graph_model
This document proposes a new extension which builds upon VK_ARM_data_graph to allow applications to run QNN models with data graph pipelines.
1. Problem Statement
Machine Learning models such as ONNX are defined as data graphs, and are used in workflows such as PyTorch or TensorFlow. These graphs can be accelerated efficiently on heterogeneous compute platforms, such as the Qualcomm® AI Engine.
Image processing is a primary use case for these models, leveraging the Adreno™ GPU for pre- and post-processing tasks, and either the Hexagon™ NPU or Adreno™ GPU for model execution. However, the current workflow lacks standardization, making seamless and efficient interoperability between these engines challenging.
2. Solution Space
Vulkan is a natural fit as a standards platform for heterogeneous compute workflows. Data graph pipelines can be leveraged to execute QNN models on the NPU or GPU and interop with the GPU’s image processing capabilities.
3. Proposal
This proposal builds on two existing extensions:
-
VK_ARM_tensors which defines tensor objects that can be used in the model’s inputs and outputs
-
VK_ARM_data_graph which provides a framework to execute the models in data graph pipelines
3.1. Querying engine capabilities
First, applications need to enumerate the queue families using
vkGetPhysicalDeviceQueueFamilyProperties
and filtering out a list that supports the VK_QUEUE_DATA_GRAPH_BIT_ARM.
Next, applications can determine the engine properties of these queue families using vkGetPhysicalDeviceQueueFamilyDataGraphPropertiesARM
The query will return a paired list of the engine and operations that are supported by each queue family. Each may exist in the list more than once if they are linking multiple operations and/or engines of the same type together, such as a single engine supporting multiple operations.
This extension exposes new engines to run neural models for the Qualcomm® AI Engine:
VK_PHYSICAL_DEVICE_DATA_GRAPH_PROCESSING_ENGINE_TYPE_NEURAL_QCOM = 1000629000 // Hexagon(TM) NPU
VK_PHYSICAL_DEVICE_DATA_GRAPH_PROCESSING_ENGINE_TYPE_COMPUTE_QCOM = 1000629001 // Adreno(TM) GPU - reserved for future use
Only external semaphores and memory are permitted with foreign engines. The application can query the supported external handle types with vkGetPhysicalDeviceQueueFamilyDataGraphProcessingEnginePropertiesARM.
The extension also exposes new operations, which defines the model and/or algorithm to run:
VK_PHYSICAL_DEVICE_DATA_GRAPH_OPERATION_TYPE_NEURAL_MODEL_QCOM = 1000629000
VK_PHYSICAL_DEVICE_DATA_GRAPH_OPERATION_TYPE_BUILTIN_MODEL_QCOM = 1000629001
-
VK_PHYSICAL_DEVICE_DATA_GRAPH_OPERATION_TYPE_NEURAL_MODEL_QCOMdefines operations that can execute a compatible data graph model provided by the application. TheVkPhysicalDeviceDataGraphOperationSupportARM::namedefines the type of model supported. -
VK_PHYSICAL_DEVICE_DATA_GRAPH_OPERATION_TYPE_BUILTIN_MODEL_QCOMdefines operations that can execute a predefined model that is provided by the implementation. TheVkPhysicalDeviceDataGraphOperationSupportARM::namedefines the type of built-in model.
For each case, defining the name of the operations are out of scope of of this document. Please refer to the appropriate
documentation for specific interfacing instructions.
3.2. Foreign pipeline creation
3.2.1. Prepare external pipeline cache
In order for the application to supply a data graph model to the implementation for foreign engines, it must first prepare a binary using tooling external to Vulkan.
This workflow is out of scope of this document, but for example at the time of this writing,
VK_PHYSICAL_DEVICE_DATA_GRAPH_OPERATION_TYPE_NEURAL_MODEL_QCOM operation types with the "Generic QNN" name
documentation can be found at the QNN Documentation.
See the setup instructions for information about how to convert a model to QNN, quantize, serialize, and prepare a Vulkan pipeline cache blob.
The pipeline cache blob, if properly constructed, will have the following cache header:
VK_PIPELINE_CACHE_HEADER_VERSION_DATA_GRAPH_QCOM = 1000629000
VK_DATA_GRAPH_MODEL_TOOLCHAIN_VERSION_LENGTH_QCOM = 3U
typedef enum VkDataGraphModelCacheTypeQCOM {
VK_DATA_GRAPH_MODEL_CACHE_TYPE_GENERIC_BINARY_QCOM = 0, // Usable with "Generic QNN" operation name (out of scope of this document)
VK_DATA_GRAPH_MODEL_CACHE_TYPE_INVALID_QCOM = 0xFFFFFFFF,
} VkDataGraphModelCacheTypeQCOM;
typedef struct VkPipelineCacheHeaderVersionDataGraphQCOM {
uint32_t headerSize;
VkPipelineCacheHeaderVersion headerVersion;
VkDataGraphModelCacheTypeQCOM cacheType;
uint32_t cacheVersion;
uint32_t toolchainVersion[VK_DATA_GRAPH_MODEL_TOOLCHAIN_VERSION_LENGTH_QCOM];
} VkPipelineCacheHeaderVersionDataGraphQCOM;
-
headerSizespecifies the size in bytes of the header -
headerVersionspecifiesVK_PIPELINE_CACHE_HEADER_VERSION_DATA_GRAPH_QCOMfor this header type -
cacheTypespecifies the type of model binary contained within -
cacheVersionspecifies the serialized encoding version of the model binary contained within -
toolchainVersionspecifies the toolchain version that built the model binary contained within
3.2.2. Import pipeline cache
Once a model is properly serialized with a VkPipelineCacheHeaderVersionDataGraphQCOM header, it can be imported into
Vulkan by creating a pipeline cache with
vkCreatePipelineCache by specifying the blob in the
VkPipelineCacheCreateInfo::pInitialData parameter.
It is out of scope for Vulkan or the Vulkan Validation Layers (VVL) to verify that the blob is compatible with the device,
with the exception that dataGraphModel feature must be enabled to import a blob with the
VK_PIPELINE_CACHE_HEADER_VERSION_DATA_GRAPH_QCOM header version.
If it is not compatible, then
vkCreateDataGraphPipelinesARM will
return VK_PIPELINE_COMPILE_REQUIRED. The application should refer to the appropriate documentation to determine
compatible cache types, operation types, and respective versions.
3.2.3. Create data graph pipeline
Imported data graph models can be used to create a data graph pipeline by including the pipeline cache to the vkCreateDataGraphPipelinesARM call.
The appropriate engine type must be attached to the VkDataGraphPipelineCreateInfoARM::pNext with the
VkDataGraphProcessingEngineCreateInfoARM
structure. This specializes the pipeline to only be bindable to command buffers allocated from pools also created with
this engine type.
VkDataGraphPipelineCreateInfoARM::flags must have at least VK_PIPELINE_CREATE_2_FAIL_ON_PIPELINE_COMPILE_REQUIRED_BIT set.
If anything goes wrong with importing the binary, VK_PIPELINE_COMPILE_REQUIRED will be returned.
Imported blobs may have multiple models, a
VkDataGraphPipelineIdentifierCreateInfoARM
must be chained to the VkDataGraphPipelineCreateInfoARM::pNext. See the appropriate documentation to acquire the correct
VkDataGraphPipelineIdentifierCreateInfoARM::pIdentifer for the constructed blob.
VkDataGraphPipelineResourceInfoARM is not permitted for data graphs being imported in this manner. Resources, that is
the model’s inputs and outputs, are defined by the appropriate documentation for how they can be compatible for the built model,
obtainable from the model’s author.
The input/output binding mappings to construct the VkPipelineLayout should be obtainable from the documentation of the
tool that packed the model into a pipeline cache blob.
Session memory must be allocated with external memory created with a handle type retrieved from
VkQueueFamilyDataGraphProcessingEnginePropertiesARM::foreignMemoryHandleTypes.
3.3. Foreign built-in models
The application prepares no pipeline blob or shader module for the built-in models.
These models are provided by the implementation and selected at compile time by providing the following
structure to VkDataGraphPipelineCreateInfoARM::pNext:
typedef struct VkDataGraphPipelineBuiltinModelCreateInfoQCOM {
VkStructureType sType;
void* pNext;
const VkPhysicalDeviceDataGraphOperationSupportARM* pOperation;
} VkDataGraphPipelineBuiltinModelCreateInfoQCOM;
-
pOperationspecifies the built-in operation and must match all fields with a supported operation for the engine provided to the pipeline creation inVkDataGraphProcessingEngineCreateInfoARM
Some built-in models require arguments to be passed, these can be passed with VkDataGraphPipelineCompilerControlCreateInfoARM.
See the appropriate documentation for what each built-in model does and the arguments that it takes, as well as any
input and output descriptors it needs to define in the VkPipelineLayout.
Creating the pipeline is otherwise very similar to foreign models:
-
VkDataGraphProcessingEngineCreateInfoARMmust be provided with the appropriate engine type -
VK_PIPELINE_CREATE_2_FAIL_ON_PIPELINE_COMPILE_REQUIRED_BITmust be provided, and will fail if arguments are not compatible with the operation -
VkDataGraphPipelineResourceInfoARMmust not be provided, resource compatibility is defined by the operation, not by the application
Additionally, unlike foreign models, pipelineCache is ignored.
3.4. Descriptor sets
Descriptor sets must be allocated from a descriptor pool with VkDataGraphProcessingEngineCreateInfoARM specified.
Descriptor buffers must be allocated with the VK_BUFFER_USAGE_2_DATA_GRAPH_FOREIGN_DESCRIPTOR_BIT_ARM usage flag if
the descriptor buffer will bound for use in a foreign engine.
All tensors bound to a foreign descriptor set must adhere to the binding locations provided by the appropriate
documentation and be allocated with external memory created with a handle type retrieved from
VkQueueFamilyDataGraphProcessingEnginePropertiesARM::foreignMemoryHandleTypes.
3.5. Tensors
Foreign tensors must be bound to external memory using
VkExternalMemoryTensorCreateInfoARM and
include the VK_TENSOR_USAGE_DATA_GRAPH_BIT_ARM usage.
See the appropriate documentation for creating tensors compatible with the inputs and outputs of the model regarding how to set the other parameters. This should be obtainable from the creator of the model.
In order to interop the tensors with the GPU, they must be aliased to an image using VK_TENSOR_USAGE_IMAGE_ALIASING_BIT_ARM.
See the memory aliasing section for rules about
tensor/image aliasing.
It is possible that the parameters required for tensor creation of the model’s inputs and outputs are not compatible or optimal with the GPU. In this case, the application should alias a tensor that is compatible with an image and GPU, then use vkCmdCopyTensorARM to copy it to/from a tensor that is compatible with the model.
Optimal tiled aliased images will always be compatible with the model’s tensors,
provided the model/engine supports that tiling mode, and the memory aliasing rules are followed.
To determine if linear images are compatible with the model, use
vkGetImageSubresourceLayout to get
the required padding for the image and see if they are permitted by the model for the input/output tensor dimension
strides, following the mapping between subresource layout and tensor dimensions as described in
memory aliasing.
3.6. Command buffers
Command buffers must be allocated from a pool that was created with VkDataGraphProcessingEngineCreateInfoARM specified.
If the queueFamilyIndex that was used to create the pool only supports VK_QUEUE_DATA_GRAPH_BIT_ARM, the number of
commands that can be recorded to the command buffer are few and specified by the Supported Queue Types property
listed after each command definition in the specification.
At the time of this writing, the following commands are permitted:
-
vkCmdBindPipeline
-
vkCmdBindDescriptorSets
-
vkCmdBindDescriptorBuffersEXT
-
vkCmdSetDescriptorBufferOffsetsEXT
-
vkCmdSetDescriptorBufferOffsets2EXT
-
vkCmdDispatchDataGraphARM
The vkCmdDispatchDataGraphARM command is what records the execution of the data graph.
3.7. Synchronization
There are no barriers permitted unless other queue types are exposed for the family. This is left for a future extension for data graph only barriers. No implicit barriers are issued by dispatch and any hazards between dispatches must be split between different queue submit batches to enforce barriers using semaphores.
Semaphores executing with a queue created from a family that includes a foreign engine must be created as external
using one of the handle types retrieved from foreignSemaphoreHandleTypes.
4. Example
4.1. Prepare cache
|
The following is an upscaling example to illustrate a workflow at the time of this writing. Links and tools may change with time, please refer to the correct documentation for current practices. |
# See offline documentation to generate and push the following to device:
# * PipelineCache.bin
# * PipelineIdentifier.bin
# * GraphData.json
4.2. Queue properties
// <Query queue family properties and set VkQueueFamilyProperties to pProps>
for (uint32_t i = 0; i < count; i++) {
if (pProps[i].queueFlags & VK_QUEUE_DATA_GRAPH_BIT_ARM) {
// Get the graph properties
uint32_t graphCount = 0;
vkGetPhysicalDeviceQueueFamilyDataGraphPropertiesARM(device, i, &graphCount, nullptr);
VkQueueFamilyDataGraphPropertiesARM* pGraphProps = new VkQueueFamilyDataGraphPropertiesARM[graphCount];
for (uint32_t j = 0; j < graphCount; j++) {
pGraphProps[j].sType = VK_STRUCTURE_TYPE_QUEUE_FAMILY_DATA_GRAPH_PROPERTIES_ARM;
}
vkGetPhysicalDeviceQueueFamilyDataGraphPropertiesARM(device, i, &graphCount, pGraphProps);
for (uint32_t j = 0; j < graphCount; j++) {
// Find engine for Hexagon(TM) NPU, with Generic QNN operation
if ((pGraphProps[j].engine.type == VK_PHYSICAL_DEVICE_DATA_GRAPH_PROCESSING_ENGINE_TYPE_NEURAL_QCOM) &&
(pGraphProps[j].engine.isForeign) &&
(pGraphProps[j].operation.operationType == VK_PHYSICAL_DEVICE_DATA_GRAPH_OPERATION_TYPE_NEURAL_MODEL_QCOM) &&
(strncmp(pGraphProps[j].operation.name, "Generic QNN", sizeof(pGraphProps[j].operation.name)))) {
// NOTE Check pGraphProps[j].operation.version is compatible from appropriate documentation
// <Suitable queueFamilyIndex found at `i`>
}
}
}
}
4.3. Engine properties
VkPhysicalDeviceQueueFamilyDataGraphProcessingEngineInfoARM info = {
VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_QUEUE_FAMILY_DATA_GRAPH_PROCESSING_ENGINE_INFO_ARM, // sType
nullptr, // pNext
queueFamilyIndex, // queueFamilyIndex
VK_PHYSICAL_DEVICE_DATA_GRAPH_PROCESSING_ENGINE_TYPE_NEURAL_QCOM // engineType
};
VkQueueFamilyDataGraphProcessingEnginePropertiesARM props = {
VK_STRUCTURE_TYPE_QUEUE_FAMILY_DATA_GRAPH_PROCESSING_ENGINE_PROPERTIES_ARM // sType
};
vkGetPhysicalDeviceQueueFamilyDataGraphProcessingEnginePropertiesARM(device, &info, &props);
// <Determine which external handle to use from props.foreignSemaphoreHandleTypes and props.foreignMemoryHandleTypes>
4.4. Create descriptor set layout
// NOTE See GraphData.json for the required inputs/outputs, this model takes 1 of each
VkDescriptorSetLayoutBinding bindings[] = {
{
inputBinding, // binding - sourced from GraphData.json
VK_DESCRIPTOR_TYPE_TENSOR_ARM, // descriptorType
1, // descriptorCount
VK_SHADER_STAGE_COMPUTE_BIT, // stageFlags
nullptr // pImmutableSamplers
},
{
outputBinding, // binding - sourced from GraphData.json
VK_DESCRIPTOR_TYPE_TENSOR_ARM, // descriptorType
1, // descriptorCount
VK_SHADER_STAGE_COMPUTE_BIT, // stageFlags
nullptr // pImmutableSamplers
}
};
// <Create descriptor set layout like normal>
4.5. Create pipeline cache
FILE* pFile = fopen("/sdcard/PipelineCache.bin", "rb");
fseek(pFile, 0, SEEK_END);
long size = ftell(pFile);
fseek(pFile, 0, SEEK_SET);
uint8_t* pCacheBlob = new uint8_t[size];
size_t bytesRead = fread(pCacheBlob, 1, size, pFile);
VkPipelineCacheCreateInfo cacheInfo = {
VK_STRUCTURE_TYPE_PIPELINE_CACHE_CREATE_INFO, // sType
nullptr, // pNext
0, // flags
size, // initialDataSize
pCacheBlob // pInitialData
};
vkCreatePipelineCache(device, &cacheInfo, nullptr, pPipelineCache);
4.6. Create pipeline
FILE* pFile = fopen("/sdcard/PipelineIdentifier.bin", "rb");
fseek(pFile, 0, SEEK_END);
long size = ftell(pFile);
fseek(pFile, 0, SEEK_SET);
uint8_t* pIdentifierBlob = uint8_t char[size];
size_t bytesRead = fread(pIdentifierBlob, 1, size, pFile);
VkPhysicalDeviceDataGraphProcessingEngineARM engine = {
VK_PHYSICAL_DEVICE_DATA_GRAPH_PROCESSING_ENGINE_TYPE_NEURAL_QCOM, // type
true // isForeign
};
VkDataGraphProcessingEngineCreateInfoARM engineInfo = {
VK_STRUCTURE_TYPE_DATA_GRAPH_PROCESSING_ENGINE_CREATE_INFO_ARM, // sType
nullptr, // pNext
1, // processingEngineCount
&engine // pProcessingEngines
};
VkDataGraphPipelineIdentifierCreateInfoARM identifierInfo = {
VK_STRUCTURE_TYPE_DATA_GRAPH_PIPELINE_IDENTIFIER_CREATE_INFO_ARM, // sType
&engineInfo, // pNext
size, // identifierSize
pIdentifierBlob // pIdentifier
};
VkDataGraphPipelineCreateInfoARM createInfo = {
VK_STRUCTURE_TYPE_DATA_GRAPH_PIPELINE_CREATE_INFO_ARM, // sType
&identifierInfo, // pNext
VK_PIPELINE_CREATE_2_FAIL_ON_PIPELINE_COMPILE_REQUIRED_BIT, // flags
layout, // layout - created from descriptor set layout like normal
0, // resourceInfoCount
nullptr // pResourceInfos
};
vkCreateDataGraphPipelinesARM(device, VK_nullptr_HANDLE, pipelineCache, 1, &createInfo, nullptr, &pPipeline);
4.7. Create tensors
This example assumes that the model’s tensors are not compatible with the GPU images and will show how to create 2 different kinds of tensors - model tensors and aliased tensors. The intent here would be to render to the gpu attachment with the aliased tensor then perform a tensor copy to the model tensor.
memFlags is the chosen external handle type that was previously queried as being supported by the engine
for the model tensors being used directly by it. When memFlags == 0 the tensor path being used is the
aliased tensor since it does not need to be external in this example as they are not used directly by the model.
To determine if a model’s tensor is compatible with the GPU images, see the Tensor section.
uint32_t queueFamilies[] = {
graphicsFamilyIndex, // Find this queue family
queueFamilyIndex // Hexagon(TM) NPU queue family
};
VkExternalMemoryTensorCreateInfoARM externalMem = {
VK_STRUCTURE_TYPE_EXTERNAL_MEMORY_TENSOR_CREATE_INFO_ARM, // sType
nullptr, // pNext
memFlags // handleTypes
};
uint32_t dimensions[][4] = {
{1, 540, 960, 3}, // Input tensor
{1, 1080, 1920, 3}, // Output tensor
};
// NOTE For the model tensors, these values needs to be found in offline documentation for the model
VkTensorDescriptionARM desc = {
VK_STRUCTURE_TYPE_TENSOR_DESCRIPTION_ARM, // sType
nullptr, // pNext
VK_TENSOR_TILING_OPTIMAL_ARM, // tiling
VK_FORMAT_R8_UNORM, // format
ARRAY_SIZE(dimensions[(isInput) ? 0 : 1]), // dimensionCount
&dimensions[(isInput ? 0 : 1)], // pDimensions
nullptr, // pStrides - implementation determines for optimal
VK_TENSOR_USAGE_TRANSFER_DST_BIT_ARM | // usage - add transfer usage since tensors will be copied;
VK_TENSOR_USAGE_TRANSFER_SRC_BIT_ARM | // could make this more exact though, for example
((memFlags) ? VK_TENSOR_USAGE_DATA_GRAPH_BIT_ARM : // the input model tensor only needs to be a DST
VK_TENSOR_USAGE_IMAGE_ALIASING_BIT_ARM) // transfer to copy render target data into it
};
// If memFlags is 0 then creating internal tensor only, not for use with foreign
VkTensorCreateInfoARM info = {
VK_STRUCTURE_TYPE_TENSOR_CREATE_INFO_ARM, // sType
(memFlags) ? &externalMem : nullptr, // pNext
0, // flags
&desc, // pDescription
VK_SHARING_MODE_EXCLUSIVE, // sharingMode
(memFlags) ? ARRAY_SIZE(queueFamilies) : 1, // queueFamilyIndexCount
queueFamilies, // pQueueFamilyIndices
};
vkCreateTensorARM(device, &info, nullptr, pTensor);
VkTensorMemoryRequirementsInfoARM reqInfo = {
VK_STRUCTURE_TYPE_TENSOR_MEMORY_REQUIREMENTS_INFO_ARM, // sType
nullptr, // pNext
*pTensor // tensor
};
VkMemoryRequirements2 memReqs = {VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2};
vkGetTensorMemoryRequirementsARM(device, &reqInfo, &memReqs);
// Include this assuming dedicated is required (as it is per spec with the AHB export assumption above)
VkMemoryDedicatedAllocateInfoTensorARM = {
VK_STRUCTURE_TYPE_MEMORY_DEDICATED_ALLOCATE_INFO_TENSOR_ARM, // sType
nullptr, // pNext
*pTensor // tensor
};
VkExportMemoryAllocateInfo exportInfo = {
VK_STRUCTURE_TYPE_EXPORT_MEMORY_ALLOCATE_INFO, // sType
&dedicatedInfo, // pNext
memFlags // handleTypes
};
VkMemoryAllocateInfo allocInfo = {
VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO, // sType
(memFlags) ? &exportInfo : nullptr, // pNext
memReqs.memoryRequirements.size, // allocationSize
std::countr_zero(memReqs.memoryRequirements.memoryTypeBits) // memoryTypeIndex - needs better algorithm to select this
};
VkDeviceMemory memory;
vkAllocateMemory(device, &allocInfo, nullptr, &memory);
// Should probably batch this up for all tensors ready to be bound
VkBindTensorMemoryInfoARM bindInfo = {
VK_STRUCTURE_TYPE_BIND_TENSOR_MEMORY_INFO_ARM, // sType
nullptr, // pNext
*pTensor, // tensor
memory, // memory
0 // memoryOffset
};
if (memReqs)
{
// View not needed if just doing a tensor copy, so make the views for the model tensors only
vkBindTensorMemoryARM(device, 1, &bindInfo);
VkTensorViewCreateInfoARM viewInfo = {
VK_STRUCTURE_TYPE_TENSOR_VIEW_CREATE_INFO_ARM, // sType
nullptr, // pNext
0, // flags
*pTensor, // tensor
VK_FORMAT_R8_UNORM, // format
};
vkCreateTensorViewARM(device, &viewInfo, nullptr, pTensorView);
}
If the gpu image is compatible with the model tensor, then only 1 type of tensor needs to be created an aliased + external tensor, and the tensor copy can be omitted. The intent would be to alias the offscreen render attachment and use that tensor directly in the model. The below is an example on doing this with a linear image.
uint32_t queueFamilies[] = {
graphicsFamilyIndex, // Find this queue family
queueFamilyIndex // Hexagon(TM) NPU queue family
};
VkExternalMemoryTensorCreateInfoARM externalMem = {
VK_STRUCTURE_TYPE_EXTERNAL_MEMORY_TENSOR_CREATE_INFO_ARM, // sType
nullptr, // pNext
memFlags // handleTypes
};
uint32_t dimensions[][4] = {
{1, 540, 960, 3}, // Input tensor
{1, 1080, 1920, 3}, // Output tensor
};
// subresourceLayout[0] was queried from the render target image
// subresourceLayout[1] was queried from the output image
uint32_t strides[][4] = {
{subresourceLayout[0].depthPitch, subresourceLayout[0].rowPitch, 3, 1},
{subresourceLayout[1].depthPitch, subresourceLayout[1].rowPitch, 3, 1}
};
VkTensorDescriptionARM desc = {
VK_STRUCTURE_TYPE_TENSOR_DESCRIPTION_ARM, // sType
nullptr, // pNext
VK_TENSOR_TILING_LINEAR_ARM, // tiling
VK_FORMAT_R8_UNORM, // format
ARRAY_SIZE(dimensions[(isInput) ? 0 : 1]), // dimensionCount
&dimensions[(isInput ? 0 : 1)], // pDimensions
&strides[(isInput ? 0 : 1)], // pStrides
VK_TENSOR_USAGE_TRANSFER_DST_BIT_ARM | // usage
VK_TENSOR_USAGE_TRANSFER_SRC_BIT_ARM |
VK_TENSOR_USAGE_DATA_GRAPH_BIT_ARM |
VK_TENSOR_USAGE_IMAGE_ALIASING_BIT_ARM
};
// If memFlags is 0 then creating internal tensor only, not for use with foreign
VkTensorCreateInfoARM info = {
VK_STRUCTURE_TYPE_TENSOR_CREATE_INFO_ARM, // sType
&externalMem, // pNext
0, // flags
&desc, // pDescription
VK_SHARING_MODE_EXCLUSIVE, // sharingMode
ARRAY_SIZE(queueFamilies), // queueFamilyIndexCount
queueFamilies, // pQueueFamilyIndices
};
vkCreateTensorARM(device, &info, nullptr, pTensor);
// <Allocate and bind memory like above, create the view>>
4.8. Create aliased image
// This uses VkTensorDescriptionARM set to pDesc to determine how to size the image
VkImageCreateInfo info = {
VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO, // sType
nullptr, // pNext
0, // flags
VK_IMAGE_TYPE_2D, // imageType
GetGfxFormat(pDesc), // format - set to VK_FORMAT_R8G8B8_UNORM for {R8_UNORM, dimensions[count - 1] = 3}
{pDesc->pDimensions[2], pDesc->pDimensions[1], 1}, // extent - NHWC
1, // miplevels
pDesc->pDimensions[0], // arrayLayers
VK_SAMPLE_COUNT_1_BIT, // samples
VK_IMAGE_TILING_OPTIMAL, // tiling
VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT | // usage - for render attachment image, switch usage for output image
VK_IMAGE_USAGE_TENSOR_ALIASING_BIT_ARM,
VK_SHARING_MODE_EXCLUSIVE, // sharingMode
1, // queueFamilyIndexCount
&queueFamilyIndex, // pQueueFamilyIndices
VK_IMAGE_LAYOUT_UNDEFINED // initialLayout
};
vkCreateImage(device, &info, nullptr, pImage);
4.9. Create descriptors
VkPhysicalDeviceDataGraphProcessingEngineARM engine = {
VK_PHYSICAL_DEVICE_DATA_GRAPH_PROCESSING_ENGINE_TYPE_NEURAL_QCOM, // type
true // isForeign
};
VkDataGraphProcessingEngineCreateInfoARM engineInfo = {
VK_STRUCTURE_TYPE_DATA_GRAPH_PROCESSING_ENGINE_CREATE_INFO_ARM, // sType
nullptr, // pNext
1, // processingEngineCount
&engine // pProcessingEngines
};
VkDescriptorPoolSize sizes[] = {
{VK_DESCRIPTOR_TYPE_TENSOR_ARM, 2}
};
VkDescriptorPoolCreateInfo poolInfo = {
VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO, // sType
&engineInfo, // pNext
0, // flags
1, // maxSets
ARRAY_SIZE(sizes), // poolSizeCount
sizes // pPoolSizes
};
vkCreateDescriptorPool(device, &poolInfo, nullptr, &pool);
// <allocate descriptor set using set layout and pool, assign to `set`>
VkWriteDescriptorSetTensorARM tensorWrites[] = {
{
VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET_TENSOR_ARM, // sType
nullptr, // pNext
1, // tensorViewCount
&tensorView[0] // pTensorViews - for input model tensor
},
{
VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET_TENSOR_ARM, // sType
nullptr, // pNext
1, // tensorViewCount
&tensorView[1] // pTensorViews - for output model tensor
}
};
VkWriteDescriptorSet writes[] = {
{
VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET, // sType
&tensorWrites[0], // pNext
set, // dstSet
inputBinding, // binding - sourced from GraphData.json
0, // dstArrayElement
1, // descriptorCount
VK_DESCRIPTOR_TYPE_TENSOR_ARM, // descriptorType
nullptr, // pImageInfo
nullptr, // pBufferInfo
nullptr // pTexelBufferView
},
{
VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET, // sType
&tensorWrites[1], // pNext
set, // dstSet
outputBinding, // binding - sourced from GraphData.json
0, // dstArrayElement
1, // descriptorCount
VK_DESCRIPTOR_TYPE_TENSOR_ARM, // descriptorType
nullptr, // pImageInfo
nullptr, // pBufferInfo
nullptr // pTexelBufferView
}
};
vkUpdateDescriptorSets(device, ARRAY_SIZE(writes), writes, 0, nullptr);
4.10. Create command pool
VkPhysicalDeviceDataGraphProcessingEngineARM engine = {
VK_PHYSICAL_DEVICE_DATA_GRAPH_PROCESSING_ENGINE_TYPE_NEURAL_QCOM, // type
true // isForeign
};
VkDataGraphProcessingEngineCreateInfoARM engineInfo = {
VK_STRUCTURE_TYPE_DATA_GRAPH_PROCESSING_ENGINE_CREATE_INFO_ARM, // sType
nullptr, // pNext
1, // processingEngineCount
&engine // pProcessingEngines
};
VkCommandPoolCreateInfo info = {
VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO, // sType
&engineInfo, // pNext
0, // flags
queueFamilyIndex // queueFamilyIndex - Hexagon(TM) NPU queue family
};
// NOTE When binding session memory (VkBindDataGraphPipelineSessionMemoryInfoARM), make sure the VkDeviceMemory
// is also allocated with the supported external handle type flags