Synchronizing the CPU and GPU
This sample compares two methods for synchronizing between the CPU and GPU,
Fences demonstrating which one is the best option in order to avoid stalling.
The simplest way to synchronize the CPU and GPU is to use either
vkDeviceWaitIdle, these commands wait until the device or queue has finished executing all work dispatched to it.
Note that when using a single
vkQueueWaitIdle is functionally equivalent to
While this method works reliably, it is far more coarse than is actually required to synchronize between the CPU and GPU.
This results in bubbles within the GPU that prevent it from maintaining a full pipeline, therefore stopping it from parallelizing vertex and fragment work from separate frames which results in higher frame times and lower efficiency.
The alternative to
WaitIdle is to use a Vulkan
Fence object, these are designed to allow the GPU to inform the CPU when it has finished with a single frame’s workload allowing the CPU to safely re-use the resources for that frame.
This method avoids stalling while waiting for the GPU to finish executing, as the CPU can continue to submit the following frames without having to wait for the GPU which in turn avoids the GPU pipeline draining of work.
This sample provides two radio buttons that allow you to alternate between using
WaitIdle is selected the sample calls
vkDeviceWaitIdle before beginning each frame, this forces the GPU to finish executing all work dispatched to it and in doing so, drains the pipeline of all the work within.
As a result, the GPU is idle while the next frame’s command buffer is created until it has been dispatched, which increases frame times.
Fence is selected the sample assigns a
Fence to each frame during its creation, then it calls
vkWaitForFences and using the
Fence for the next frame to be computed.
This method allows the CPU to continue dispatching work to GPU while it executes the previous frames workload.
Below is a screenshot of the sample running on a phone with a Mali G76 GPU:
WaitIdle is used the average frame time is 72ms, yet it decreases by 22% to 56ms when
Fences is enabled.
The performance increase is also clearly demonstrated when using Streamline Performance Analyzer.
The graph below shows the Job Manager Cycles and stall in a single frame when
WaitIdle is used.
While, the following graph shows the Job Manager Cycles for a frame when
Fences are used.
This output clearly shows that
WaitIdle forces the GPU to drain of all work which results in the GPU idling causing higher frame times.
Fencesto asynchronously read back data to the CPU; don’t block synchronously and cause the pipeline to drain.
Unnecessarily wait for GPU data on either the CPU or GPU.
vkDeviceWaitIdle()unless absolutely necessary for less granular synchronisation.
The impact can be very minor or can be very significant depending on the relative sizes and ordering of the workloads which are queued.
Arm Mobile Studio can be used visualize the Arm CPU and Mali GPU activity on both GPU queues, and can quickly show bubbles in scheduling either locally to the GPU queues (indicative of a stage dependency issue) or globally across both CPU and GPU (indicative of a blocking CPU call being used).