Jump to content

Help needed: Optimizing 0 A.D. with Intel VTune Amplifier


Recommended Posts

Hello, 0 A.D. community! I'm a Technical Consulting Engineer at Intel Corporation, and I need your help.

In short, I'm reaching out to you in the hopes of finding 1-3 developers familiar with the 0 A.D. code base to work with me on optimizing your game using Intel's software analysis tools, specifically Intel® VTune™ Amplifier. While I won't be contributing any patches personally, and do not intend to be a long term developer, you are more than welcome to keep and submit any optimizations we make as part of this project, free of charge. In particular I'm interested in optimizing the functions BlendBoneMatrices, Render, and possibly also SkinPoint.

 

A little bit of background for you. Intel® VTune™ Amplifier is a sampling-based analysis tool used to measure performance and locate/diagnose bottlenecks in code. Unfortunately, Intel has a bit of a deficiency in documentation for using this tool on graphical applications like video games (a problem in and of itself already), which in turn leads to the additional problem of a lack of familiarity with this use case in the employees tasked with supporting the product.

Currently I am attempting to remedy the second problem by creating an Intel internal training demo for using Intel® VTune™ Amplifier on video games. In theory it should be simple: find a suitable game, run the analyzer, optimize appropriately, show off the results. After some searching around I settled on 0 A.D. as a good candidate for this purpose, and I've been collecting plenty of data on it (it's also a great excuse to play a really fun RTS at work ;)). I've been able to identify some bottlenecks worth investigation and I've diagnosed the causes to varying degrees. But this game is a... very big project, and that's where I got stuck. I'm brand new to this code base and while I know (or at least suspect) what needs to be fixed, I don't know how to do it. 

This is where you come in. With your assistance, I hope to understand how to implement these optimizations, since you know the code much better than I do. Again, you're perfectly welcome to keep whatever optimizations we come up with.

 

While the current goal is to create an internal training demo, ideally I hope to create official public documentation on the Intel Developer Zone website as well, such as a tutorial or video, using 0 A.D. as the example application. Assuming Legal gives me the go-ahead after reviewing the game's licensing conditions, I would be more than happy to give you credit for your assistance (should you request it) in this context as well.

 

Please let me know if you're interested in assisting. I would be happy to provide more detailed information about the bottlenecks I've identified, as well as access to Intel® VTune™ Amplifier, my analysis result files, and the 0 A.D. save file I've been testing on.

*I'd like to make it clear that I am acting here as an individual employee of Intel Corporation pursuing a work-related task, but I should certainly not be taken as an official representative of the company as a whole. I am "Alex from Intel", not just "Intel"!

  • Like 13
Link to comment
Share on other sites

Hello Alex! Thank you very much for your interest and for picking 0 A.D. for this project :D

I feel like the best way for you to get answers would be to be in touch with the programmers through chat (we use IRC) while you are at work. It mainly depends on your timezone: indeed, we are rather active outside of our work hours, because of the volunteer nature of the project. I hope we can find a common window :)

Additionally, feel free to contact me or @feneur for any legal question, especially regarding our code license.

Looking forward to hearing about your analysis!

  • Like 3
  • Thanks 1
Link to comment
Share on other sites

Thank you for your response, @Itms! I'll definitely try the IRC channel. I'm usually available 9am-4pm PST, meetings and other duties notwithstanding. 

I'll keep your offer in mind if Legal needs additional information. Honestly I don't expect there to be any issues, though! :) 

Edited by Alex from Intel
Link to comment
Share on other sites

For the sake of keeping track of things, and for easy access by the developers helping me out, I figure I should recap all my findings so far.

I started with the code as of February 1st, to which I added some VTune Amplifier ITT API calls to mark frames as well as place user events marking the durations of actual gameplay and the loading screen on the CPU timeline. This was the only initial change made.

My designated test case is a save file from a medium sized random Lake map featuring the Romans (myself) versus the Iberians (CPU). My standard test procedure is to select my army of 115 units, march them westward along the southern edge of the lake, and engage the Iberian forces.

  Reveal hidden contents

 

On filtering down the Microarchitecture Analysis result that I'm currently working from to only the actual gameplay portion, I found three general regions of interest: a slow portion immediately after the loading screen ends, normal gameplay, and a bizarre "gap" where no frames were processed at all.

Long version:

  Reveal hidden contents

Short version:

I've written the dead section off as a fluke (and I don't think it could be optimized anyway). The slow part at the beginning needs further analysis to see if I can get meaningful data. I've concluded that CModelDef::BlendBoneMatrices and ShaderModelRenderer::Render are the primary functions to investigate.

 

Analysis of Functions:

BlendBoneMatrices assessment:

  Reveal hidden contents

 

Render assessment:

Render is more complicated. It's overwhelmingly back-end bound (89.8% of pipeline slots) and almost all of this is memory bound (81.4%). Digging deeper into the numbers it's 74.6% DRAM bound, which essentially means that it's slowed down mainly by missing the last level cache quite often. Even more abstractly, it's not accessing data in an efficient way. 

It's a fairly long function and it had several hotspots in it.

Hotspot #1, line 446

  Reveal hidden contents

I haven't dived so deeply into the other hotspots of note in this function.

  Reveal hidden contents

So that's pretty much what I've got as of right now.

Edited by Alex from Intel
  • Like 3
  • Thanks 2
Link to comment
Share on other sites

You mention your test case uses 135 units. Usually games are played with a maximum total of 1200 units aprox. (8 players 150 population each). From what I know then the pathfinder becomes the most time consuming task for the CPU. Might be worth looking at it with VTune.

  • Like 1
Link to comment
Share on other sites

Thanks @nani! I'll definitely look into using larger test cases - but just for the record, the stated "115 units" is just the size of the army I'm directly controlling/moving. I have other units going about their business back in my city and no idea how many the other faction has running around (I'll have to pause the game in combat and see if I can count just how many Iberian units come rushing out to fend me off when I attack).

I suppose I should also mention that earlier (using a different VTune Amplifier result collected under different conditions), I did actually find CCMPPathfinder::ComputeShortPath as a hotspot! The cause of the bottlenecking, according to my analysis, was that this function was bound by branch misprediction. Unfortunately, after looking into the code for some time I was forced to conclude that there was nothing that could be done to mitigate the misprediction rate. I have not yet seen other parts of the path finder become hotspots but I'll try building a new test case on an 8-player map tomorrow. Honestly I was sticking to 2 player so far to allow myself time to actually build up a decent test case - I never said I was good at this game! ;)

 

In other news, I was able to run a tighter analysis (with a sample rate of 0.1ms) on the slow startup portion. In that part, SkinPoint is definitely the main hotspot, followed by BlendBoneMatrices. SkinPoint is 49.9% back end bound, almost entirely core bound. None of that was attributed to the divider, though, so it's just plain port overuse. I'm currently investigating whether it might be possible to vectorize the offending loop, as that can dramatically cut down on the port traffic, but ultimately it comes down to whether the nature of the loop is vector-friendly.

  • Like 4
Link to comment
Share on other sites

  On 21/02/2019 at 11:57 PM, Alex from Intel said:

no idea how many the other faction has running around (I'll have to pause the game in combat and see if I can count just how many Iberian units come rushing out to fend me off when I attack).

Expand  

You can look at the replay of the match, select the Iberians' point of view, then you can know exactly how many of them are there, without counting manually. To see replays, if you are in single player, go to Single Player > Replays.

Edited by coworotel
Link to comment
Share on other sites

  On 21/02/2019 at 11:43 PM, nani said:

From what I know then the pathfinder becomes the most time consuming task for the CPU. Might be worth looking at it with VTune.

Expand  

I took your advice and started a very large 8 player map and built up to a population just shy of 300 units. Unfortunately by then several other factions had killed each other off but there were still a sizable number of units from various factions (I know one of my allies had 200+ units from switching perspectives). I then marched a very large force through some trees and buildings and I could actually see the game struggling along at ~5 FPS under the strain of the computations, even without VTune Amplifier running on it.

0ad_laggy_army.thumb.png.18d00b5703b1be432dce00a21ec7f828.png

(Also I might have a cavalry obsession)

 

But surprisingly... it doesn't look like the pathfinder is the problem (unless the problematic part of the algorithm is showing up through something else that it's calling - could it have something to do with the std::_Tree activity going on here?). Once again, the biggest identifiable hotspot is BlendBoneMatrices. Render is up there again too. This time I see Bind in the top hotspots as well (it's memory bound).

0ad_laggy_army_results.png.90057c0150a1c405001ac972e0b7a548.png

For the record, both the pathfinder functions I see here are listed as being bad-speculation bound, so that would make them rather hard to optimize (a lot of the time branch misprediction is unavoidable).

  • Like 4
Link to comment
Share on other sites

I've worked on an OpenGL4 renderer that uses more modern features like instancing and bindless textures in 2016. My conclusion was also that the renderer is a bottleneck and it's mainly memory bound. I've even tried Intel VTune Amplifier, but switched back to Perf later. By changing how we pass data to our shaders and reducing the number of draw calls, I was able to reduce that overhead by a lot. Unfortunately my work never got beyond the experimental stage.  I've written a small blog in the staff forums. I've copied the two most recent post here, if you're interested (still from 2016).

The code from the experimental OGL 4 branch is still around on github: https://github.com/Yves-G/0ad/tree/OGL4

Post10: January – Profiling, profiling and ... profiling:

  Reveal hidden contents

Post 11: 10th of July - Bindless textures and instancing:

  Reveal hidden contents

 

  • Like 4
  • Thanks 4
Link to comment
Share on other sites

  On 22/02/2019 at 8:56 PM, Alex from Intel said:

But surprisingly... it doesn't look like the pathfinder is the problem (unless the problematic part of the algorithm is showing up through something else that it's calling - could it have something to do with the std::_Tree activity going on here?). Once again, the biggest identifiable hotspot is BlendBoneMatrices. Render is up there again too. This time I see Bind in the top hotspots as well (it's memory bound).

Expand  

It really depends on frame. Try to move the camera for some empty space. Or order all 1000+ units to move.

  On 22/02/2019 at 8:56 PM, Yves said:

I've worked on an OpenGL4 renderer that uses more modern features like instancing and bindless textures in 2016.

Expand  

Unfortunately we have only 56% of players who support the GL4+. Also I suppose in most cases GL4+ means Vulkan too. It also gives a lot of opportunities to optimize.

  On 22/02/2019 at 8:56 PM, Yves said:

it's mainly memory bound.

Expand  

We use a lot of space in our VBOs, we don't even use compressing.

The most important thing I really want to know from Intel: why drivers crash? (We don't see so many of them for other vendors).

  • Like 1
Link to comment
Share on other sites

Thank you so much for these first findings!

  On 20/02/2019 at 11:01 PM, Alex from Intel said:

I'm usually available 9am-4pm PST, meetings and other duties notwithstanding.

Expand  

The majority of the dev team is in CET, so the best window would be your morning work hours, which are roughly our evening.

Regarding the renderer: this is an area of the code that I don't know very well, but Yves did a lot of research on it, as you can see; and Vladislav is the most active dev in this area, I let him take over from here.

Regarding the pathfinder: it is not a surprise that the so-called short pathfinder is a bottleneck. There are plans to revamp or get rid of it, as it is a very old piece of code. I am quite happy that the hierarchical pathfinder appears as a secondary issue: it was already revamped a couple years ago :) So your findings in the pathfinder area are matching our current focus!

We are going to upgrade our version of SpiderMonkey and we hope to see some performance improvements in the parts of SM that appear in your analysis.

Looking forward to reading your next findings :)

  • Like 1
  • Thanks 2
Link to comment
Share on other sites

  On 22/02/2019 at 9:31 PM, vladislavbelov said:

It really depends on frame. Try to move the camera for some empty space. Or order all 1000+ units to move.

Expand  
  On 22/02/2019 at 9:44 PM, Itms said:

it is not a surprise that the so-called short pathfinder is a bottleneck. There are plans to revamp or get rid of it, as it is a very old piece of code.

Expand  

Thanks for the feedback. It may well be that the pathfinder occasionally bottlenecks the application under certain circumstances - but so far we've located a few functions that are consistently eating up the CPU time, so focusing optimizations on them is likely to bring better speedup overall. This is especially true since what I've seen from both of the pathfinding functions, they mostly suffer from branch mispredictions, which aren't really something you can optimize in most circumstances.

 

  On 22/02/2019 at 9:31 PM, vladislavbelov said:
  On 22/02/2019 at 8:56 PM, Yves said:

I've worked on an OpenGL4 renderer that uses more modern features like instancing and bindless textures in 2016.

Expand  

Unfortunately we have only 56% of players who support the GL4+.

Expand  

@Yves Thank you for this gold mine of a post! I'll have to dive into what you've provided. Depending on the amount of work needed to complete your experiment (as I do unfortunately have a deadline for producing training material from this effort), it might be extremely helpful to me, even if it's not necessarily something that can be turned around and merged back into the actual game.

  • Like 3
Link to comment
Share on other sites

  • 2 weeks later...

I have some good news and some bad news. The good news is that I've finished my analysis. The bad news is I can't really do anything about it myself.

After what I found last time I did some work trying to optimize the code, but I found that nothing I did had any measurable effect on performance. My changes caused individual functions to shuffle around a bit (and credit where it's due, the pathfinder did once pop up at the top of the list) in the Intel VTune Amplifier interface but I saw no improvement in the actual game.

  Reveal hidden contents

So I took a step back, thinking maybe I missed something at a more basic level. Now, as I said in my first post, the reason for this project is that my team is, as a whole, relatively inexperienced with using our product on video games. We're more accustomed to scientific applications and the like. So I'd approached this the way I was used to doing, and went straight for the largest known hotspots, ignoring the unknown modules. I also focused on the lengths of the frames instead of the overall frame rate.

I conferred a bit with someone who was a little more used to the inner workings of video games, and re-evaluated my data. A large chunk of your time is spent in those unknown modules, and you have significant gaps between your frames - what's especially interesting is that if I filter my data down to only the gaps between frames, all the pathfinding, rendering, bone matrix manipulation, etc, drops away, and the overwhelmingly vast majority of what's left is "outside any known module". This kind of pattern can indicate that the bottleneck is an outside factor. Being a video game, the most obvious potential source is the GPU - it doesn't matter how efficiently calculations are being done if things end up waiting for the GPU to finish what it's doing.

So I ran a GPU/CPU analysis, but it showed that the GPU wasn't the problem. There were gaps in the activity there, too. So now I looked at the large number of threads involved in your game, most of which are not doing anything most of the time, and thought that might be the problem. So I ran a Threading analysis and saw that there were quite a few transitions between threads happening in those gaps between frames.

threadtransitions.thumb.png.88c1b24d9aa20b04692a9508aac71222.png

When I opened the call stacks for those top two objects, the call chains eventually went down into BaseThreadInitThunk (with the semaphore object going through a js::HelperThread::threadLoop function on the way).

thunk.png.03d238c13e1abeea6e2cbbd3982ffcbc.png

As I understand it, 0 A.D. is built on a mix of C++ and JavaScript code, and Thunking is a form of interfacing between two different types of code - such as two different languages. So presumably this is where your two languages are talking to each other.

So what I believe is going on here, is that these gaps between frames are coming from the game having to wait for the JavaScript. A couple possibilities I can think of are that you might be interfacing the two languages too often, or you might be doing computations in the JavaScript that really belong in the C++.

Unfortunately there's not much I, personally, can do to relieve that bottleneck. 

 

Technically, I now have enough content to fulfill the barest requirements of my project. Before I continue I want to make it clear that I absolutely understand that you're on your own schedule and I don't expect anything from you! I just want to let you know about my own timeline, in case it affects your priorities or decisions.

If you guys do end up fixing your JavaScript issues in the next week and a half, please let me know and I would be more than happy to include the improved results in my initial presentation. If it gets done by mid-April, I would be happy to include a result comparison/improvement showcase in any official documentation I might produce from this, and if early enough, I might also be able to include a second analysis step.

Thank you all for your interest, and good luck!

  • Like 5
Link to comment
Share on other sites

Hey thanks for the analysis ! If you did manage to optimise our functions can you drop git diffs of the code you touched ? Maybe there is still little optimizations we can benefit from :)

As for the JS engine we won't change it anytime soon however we plan to upgrade it to a more recent and hopefully faster version. Maybe @Itms can pull up a branch with Spidermonkey 45 (Here you are using 38) for you to run some profiling on.

I don't think using debug symbols for Spidermonkey would help you find the bottleneck source though I'd bet it's in the interaction of UnitMotion and UnitAI. Maybe @wraitii or @Itms know more about it.

What are your plans after this ? :)

Let us know if there is anything we can do. 

 

 

Link to comment
Share on other sites

  On 04/03/2019 at 7:58 PM, stanislas69 said:

If you did manage to optimise our functions can you drop git diffs of the code you touched ? Maybe there is still little optimizations we can benefit from :)

Expand  

I mostly fiddled around in Matrix3D.h, entirely replacing the contents of some functions with either calls to the Intel® Math Kernel Library or blocks of SSE intrinsics. I did not see any measurable improvement, but I've attached the file. The original code is left intact, commented out. Aside from these completely re-written functions, the only change is the addition of the <mkl.h> header.

  Reveal hidden contents

 

  On 04/03/2019 at 7:58 PM, stanislas69 said:

Maybe Itms can pull up a branch with Spidermonkey 45 (Here you are using 38) for you to run some profiling on.

Expand  

Couldn't hurt.

 

  On 04/03/2019 at 7:58 PM, stanislas69 said:

I don't think using debug symbols for Spidermonkey would help you find the bottleneck source

Expand  
  On 04/03/2019 at 8:20 PM, nani said:

So if Javascript code is the culprit can VTune profile what part of the js code is consuming time or how many interface calls does in average per frame?

Expand  

Unfortunately I don't believe so. VTune Amplifier does have some JavaScript profiling capabilities, but as far as I know they're limited to node.js profiling. Also, I know pretty much nothing about JavaScript. :mellow:

 

  On 04/03/2019 at 7:58 PM, stanislas69 said:

What are your plans after this ? :)

Expand  

I'm afraid while that JavaScript bottleneck is in place there's not much else I can do for 0 A.D. I do have enough from this to at least fuel the internal training session late next week. Depending on how that goes, I might produce a small how-to video/article based on this project. If I do, I'll be sure to post a link here.

Edited by Alex from Intel
  • Like 4
Link to comment
Share on other sites

Hey,

What you've noticed is accurate, and somewhat known (mostly by me I guess): The rendering is draw-call bound and slow, the pathfinder can be slow, and the Javascript main loop is very slow (we use it for most gameplay stuff, which is why it's slow).

Unfortunately, profiling spider monkey accurately is very difficult. There are few tools, and almost none that can be integrated with an existing profiler. What you'd see in OSX's instruments and I assume VTune are memory addresses, which are actually JIT-ed code and/or spider monkey code (mostly the former). With a lot of hard work, using JIT debug output and the trace logger, one might create debug symbol-like structures to document these, but it seems not super useful.

What we did in the past (and probably will in the future) for profiling JS was using our internal profiling, which does have an overhead but can give you an idea of what's going on.

_Overall_, though, we've already optimised most obvious bottlenecks. The remaining bottlenecks are not trivial to fix, and JS is somewhat of a lost cause.

  • Like 3
Link to comment
Share on other sites

  • 2 months later...
  On 06/05/2019 at 11:01 PM, Alex from Intel said:

I have finally made this into an actual piece of published collateral. It is not much, but as promised, here it is.

This is probably the last you'll hear from me, so thanks once more for all the help, and have a good one!

Expand  

Thank you so much for choosing our game to run the experiment :) Once the new version of Spidermonkey is in @Itms said he would run vtune once again :)

  • Like 2
Link to comment
Share on other sites

  On 06/05/2019 at 11:01 PM, Alex from Intel said:

I have finally made this into an actual piece of published collateral. It is not much, but as promised, here it is.

This is probably the last you'll hear from me, so thanks once more for all the help, and have a good one!

Expand  

Thank you very much for your hard endeavor, I hope your audience will find it useful and insightful and as @stanislas69 said earlier it is a good and reliable piece of information for our newer ending quest (or puzzle) for better optimizations.

Edited by asterix
  • Like 2
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...