Over the last few months at Rebellion we've taken our in-house Asura engine used in Sniper Elite 3 and added support for AMD's Mantle API.
Our Head of Programming Kevin Floyer-Lea brings us up-to-date with the story so far...
The primary goal of Mantle is to provide a low-level interface that allows applications to speak directly to AMD's "Graphics Core Next" family of GPUs - greatly reducing the CPU overhead of translating commands for the GPU. With more traditional APIs like DirectX 11 there is often a disconnect between how costly a developer thinks (hopes!) an API call will be, and how much work the driver actually ends up doing underneath.
In simple terms the expected CPU gains of Mantle should be twofold. Firstly, making a command stream for the GPU should be less work on the CPU - and without any "surprises" or mysterious stalls. Secondly, the making of command streams can be entirely multithreaded. The native support of multithreading is perhaps one of the most important features from Rebellion’s point of view - while Microsoft had made some attempts at supporting multithreading with DX11 it was fundamentally limited by the single-threaded design choices of the previous versions.
Furthermore, with Mantle the developer gains access to things that drivers typically hide away - like the GPU's dedicated memory, or hardware features such as Asynchronous Compute. This brings the PC closer to console programming, where developers are used to having direct control over available resources and squeezing the most out of the hardware.
It was these aspects which drew us to supporting Mantle - we'd long wished for the sort of control we had on consoles on our PC titles, and it was clear that whatever else may happen with Mantle in the future, it's most definitely kick-started a move to more lightweight APIs as we've seen with recent announcements concerning Microsoft’s DirectX 12, Apple's Metal, and Khronos’ Next Generation OpenGL Initiative.
Our main goal for supporting Mantle was to take maximum advantage of the potential for multithreading the API calls, and refactor our existing engine rendering pipeline to better fit what we predict are the requirements of this new breed of lightweight APIs. In that respect we spent more time restructuring our engine's rendering architecture than we did writing Mantle-specific code!
It was also important that we used exactly the same data and assets as the (already shipped!) DX11 version of Sniper Elite 3 - so we wouldn't be optimising any shaders, data formats or rendering techniques at this stage - we'd just be shipping a new executable and using the same assets. This was primarily done to reduce cost and risk - but in hindsight it makes us a fairly unbiased test case between the two APIs.
What we have now is a fairly preliminary implementation in many respects - as Asura is a fully cross-platform engine designed to work on multiple platforms simultaneously, we aim to build upon this work to make a more independent code layer which sits over multiple low-level APIs as they become available.
For our first comparison let’s look at the beginning of the “Siwa” level of Sniper Elite 3, which is one of the more graphically demanding start positions in the game as it encompasses lots of layered scenery and vegetation stretching off to the old city complex in the distance. Half-hidden in the scene are dozens of people and some vehicles which the culling system can’t remove because they are actually visible – just not that obvious. Gameplay hasn’t really kicked off yet so the rest of the engine’s systems are idling along; rendering is the biggest CPU hit here.
Click on the image to view full-screen
Below is what Task Manager reports if we just sit at the start position for 60 seconds. This is using an Intel i7-3770K CPU with 8 logical processors, coupled with an AMD R9 290X GPU, running on Ultra settings at a resolution of 1920x1200 – so we’d expect to be GPU bound in this scenario.
Click on the image above to enlarge
The Mantle version clearly shows a much more balanced CPU load across the cores – though the total CPU utilisation has only dropped from 23% on DirectX 11, to 21% on Mantle. The more balanced load is exactly as we’d hoped, since all the Mantle API calls are now distributed across the available cores by our Asura engine’s multithreaded task system, just like we do for other systems like AI, animation or physics.
It’s worth noting that Sniper Elite 3 and the Asura engine are already optimised to account for DirectX 11’s weaknesses. For example, we make heavy use of instancing and similar batching techniques to reduce the number of draw calls we make per frame – all the usual things to reduce CPU overhead, which means Mantle will have fewer easy wins compared to other draw call heavy titles.
So that’s what the CPU is doing – but what’s the actual frame rate? On those settings we’re running at an average of 88fps on DX11, and 100fps on Mantle – around a 13.6% speed increase. This explains why the total CPU utilisation is still quite similar – with Mantle the CPU has to cope with 12 more frames every second, meaning we’re packing in more work and yet still using less CPU power. Furthermore because the work is more distributed, if we increase CPU load (say by using a faster graphics card, or by lowering resolution) we’re less likely for a single logical processor to become the bottleneck.
The size of the frame rate increase is a pleasant surprise, as frankly at this stage in development we were expecting to have a more equal frame rate when GPU bound. There’s still a fair amount of scope for increasing performance with Mantle, particularly as we’re not yet taking advantage of the Asynchronous Compute queue. This would allow us to take some of our expensive compute shaders – like our Obscurance Fields technique – and schedule them to run in parallel with the rendering of shadow maps, which are particularly light on ALU work.
One reason for the performance gains seen so far may be the way we are handling the GPU’s memory - we pre-allocate VRAM in large chunks and then directly manage and defragment that memory ourselves. Similarly when updates for dynamic data and streaming textures are needed, we DMA copy the affected memory as part of our command stream to the GPU - thus eliminating the sort of copying and duplicating of buffers the DirectX drivers might have to do.
Ironically, one unintended consequence of increased texture streaming performance, and the ability to hold more textures at once given we have more control over memory, is that we’ve found that we often have far more high resolution textures being used in the Mantle version... which could in theory decrease frame rate. Thankfully speed increases from other areas seem to have hidden this, so hopefully you’ll just get better looking textures.
Another big reason for the speed gains is the way Mantle handles shaders. On DirectX we’re accustomed to having separate shader stages that are treated independently – the common ones being vertex and pixel shaders. Mantle instead uses monolithic pipelines – a concept that combines all the shader stages and the relevant rendering state into a single object.
As well as taking less CPU overhead to use, having everything together in one pipeline allows for some holistic optimisations that otherwise wouldn’t be possible – for example, perhaps that value calculated in the vertex shader isn’t actually used in the pixel shader... so it could be optimised out entirely. This seems to have particularly benefited Sniper Elite 3 when it comes to tessellation, where we’re making heavy use of all the traditional stages as well as hull and domain shaders.
To make testing easier we’ve added a Benchmark option to Sniper Elite 3 – available on the “Extras” page from the game’s front end menus. The benchmark contains varying scenes similar to what happens in game, e.g. wide, long distance views; close-ups with tessellation; obscurance fields and shadows; a truck full of characters driving by; lots of special effects overdraw in a gratuitous slow-mo explosion. These put different degrees of stress on the CPU and GPU and hopefully give us a more representative view of what happens in the game as a whole.
A word of caution at this point - when leaving the benchmark running repeatedly, we found that the dynamic power management software can kick in, reducing GPU cycle speed and thus skewing the profiling results. So it’s a good idea to use something like AMD’s OverDrive panel to monitor your GPU and guarantee consistency – and possibly increase your allowed fan speed if you don’t mind trading noise for frame rate!
At the end of the benchmark you’ll get an average frame rate report, and a more detailed log file is saved out to your Documents folder. Our initial tests with the benchmark are showing very similar performance gains as seen in the Siwa test above; here’s a breakdown using our R290X setup, varying both resolution and quality settings.
To guarantee we’re GPU bound for the final setting we’ll use 1920x1200 at Ultra quality with 4x supersampling – which means the engine internally renders everything at 3840x2400, and then right at the end downsamples back to 1920x1200 to give us an extremely good looking (and expensive) anti-aliased image.
Similarly here are the results for a HD7970, coupled with an older CPU that has only 4 logical processors:
Rather than going into more detail here we’ll let tech sites and interested users have a go themselves and come to their own conclusions. Let us know what you find!
Try it yourself
The latest version of Sniper Elite 3 now available on Steam has support for both Mantle and the Benchmark feature. To enable the Mantle build you need to select the “Use Mantle” tickbox in the game’s launcher, which is accessed via the Options button. The tickbox should be greyed out if you don’t have the requisite hardware or up to date drivers – we require AMD Catalyst™ 14.9 or later drivers which are available here: http://support.amd.com/en-us/download
NOTE: be aware that these drivers only support Windows 7 and Windows 8.1 – not Windows 8.0! If you have Windows 8.0 you can update to 8.1 for free via the Windows Store page. Best to back stuff up first!
All in all, even this first pass of Mantle has delivered all that we’d hoped for:
- Improved frame rate
- Reduced CPU power consumption (important for laptops)
- Less susceptible to frame rate spikes when other programs hit the CPU
- Future scalability with higher numbers of cores
- Scope for increasing scene and world complexity
- Ability to increase the CPU budget for other systems like AI.
The last two points are more relevant to our future games, and for now we need to see how this first pass of Mantle behaves in the wild and fix any issues that come up, before moving onto new features and improvements that would make sense to add to Sniper Elite 3. One big area that we haven't yet addressed which needs investigating is multiple GPU support - this can be a tricky area to get right.
The way DirectX11 handles multiple GPUs is “AFR” or Alternate Frame Rendering, which as the name suggests means if you have two comparably powered GPUs they simply take turns rendering frames. This is in many respects the easiest approach to take – and is a great way of making your game CPU bound! So possibly our Mantle version could show some big improvements when using this method.
However, with the independent control over the GPUs Mantle gives us, we could approach the problem very differently - for example one GPU could be rendering the basic geometry in the scene, while another handles lighting and shadows for the same frame, with the final image composited at the end. This may also provide a route for when GPUs aren’t of a comparable power level – for example an integrated APU motherboard coupled with a desktop GPU. It’s the potential for completely new approaches like this which excites me the most about Mantle and the APIs which will follow it.
Head of Programming, Rebellion
Experience Sniper Elite 3 Today!
Buy direct from the Rebellion Gamestore
Or from Steam