Help needed: Optimizing 0 A.D. with Intel VTune Amplifier

Alex from Intel · February 20, 2019

Hello, 0 A.D. community! I'm a Technical Consulting Engineer at Intel Corporation, and I need your help.

In short, I'm reaching out to you in the hopes of finding 1-3 developers familiar with the 0 A.D. code base to work with me on optimizing your game using Intel's software analysis tools, specifically Intel® VTune™ Amplifier. While I won't be contributing any patches personally, and do not intend to be a long term developer, you are more than welcome to keep and submit any optimizations we make as part of this project, free of charge. In particular I'm interested in optimizing the functions BlendBoneMatrices, Render, and possibly also SkinPoint.

A little bit of background for you. Intel® VTune™ Amplifier is a sampling-based analysis tool used to measure performance and locate/diagnose bottlenecks in code. Unfortunately, Intel has a bit of a deficiency in documentation for using this tool on graphical applications like video games (a problem in and of itself already), which in turn leads to the additional problem of a lack of familiarity with this use case in the employees tasked with supporting the product.

Currently I am attempting to remedy the second problem by creating an Intel internal training demo for using Intel® VTune™ Amplifier on video games. In theory it should be simple: find a suitable game, run the analyzer, optimize appropriately, show off the results. After some searching around I settled on 0 A.D. as a good candidate for this purpose, and I've been collecting plenty of data on it (it's also a great excuse to play a really fun RTS at work ). I've been able to identify some bottlenecks worth investigation and I've diagnosed the causes to varying degrees. But this game is a... very big project, and that's where I got stuck. I'm brand new to this code base and while I know (or at least suspect) what needs to be fixed, I don't know how to do it.

This is where you come in. With your assistance, I hope to understand how to implement these optimizations, since you know the code much better than I do. Again, you're perfectly welcome to keep whatever optimizations we come up with.

While the current goal is to create an internal training demo, ideally I hope to create official public documentation on the Intel Developer Zone website as well, such as a tutorial or video, using 0 A.D. as the example application. Assuming Legal gives me the go-ahead after reviewing the game's licensing conditions, I would be more than happy to give you credit for your assistance (should you request it) in this context as well.

Please let me know if you're interested in assisting. I would be happy to provide more detailed information about the bottlenecks I've identified, as well as access to Intel® VTune™ Amplifier, my analysis result files, and the 0 A.D. save file I've been testing on.

_{*I'd like to make it clear that I am acting here as an individual employee of Intel Corporation pursuing a work-related task, but I should certainly not be taken as an official representative of the company as a whole. I am "Alex from Intel", not just "Intel"!}

**Itms** · February 20, 2019

Hello Alex! Thank you very much for your interest and for picking 0 A.D. for this project

I feel like the best way for you to get answers would be to be in touch with the programmers through chat (we use IRC) while you are at work. It mainly depends on your timezone: indeed, we are rather active outside of our work hours, because of the volunteer nature of the project. I hope we can find a common window

Additionally, feel free to contact me or @feneur for any legal question, especially regarding our code license.

Looking forward to hearing about your analysis!

Alex from Intel · February 20, 2019

Thank you for your response, @Itms! I'll definitely try the IRC channel. I'm usually available 9am-4pm PST, meetings and other duties notwithstanding.

I'll keep your offer in mind if Legal needs additional information. Honestly I don't expect there to be any issues, though!

Edited February 20, 2019 by Alex from Intel

**Stan`** · February 21, 2019

Hello @Alex from Intel I'm Stan from 0 A.D. Welcome to the forums I'm also available on the forums on IRC and by PM if you have any questions that get unanswered.

Hope this will improve the game

Alex from Intel · February 21, 2019

For the sake of keeping track of things, and for easy access by the developers helping me out, I figure I should recap all my findings so far.

I started with the code as of February 1st, to which I added some VTune Amplifier ITT API calls to mark frames as well as place user events marking the durations of actual gameplay and the loading screen on the CPU timeline. This was the only initial change made.

My designated test case is a save file from a medium sized random Lake map featuring the Romans (myself) versus the Iberians (CPU). My standard test procedure is to select my army of 115 units, march them westward along the southern edge of the lake, and engage the Iberian forces.

Spoiler

Exact army composition, for reference:

Veles x20
Triarius x5
Hastatus x32
Eques Socius x8
Eques Romanus x44
Eques Consularis x5
Marcus Claudius Marcellus x1

Perhaps it would have been better to use a bigger army but this was basically as far as I could get before I started running out of resources and the game was starting to sputter on my machine so I figured I was giving the CPU enough of a beating to produce usable VTune Amplifier results.

On filtering down the Microarchitecture Analysis result that I'm currently working from to only the actual gameplay portion, I found three general regions of interest: a slow portion immediately after the loading screen ends, normal gameplay, and a bizarre "gap" where no frames were processed at all.

Long version:

Spoiler

Overall, the top five time-consuming functions originating in pyrogenesis.exe itself (as opposed to various dlls) were:

CModelDef::BlendBoneMatrices
ShaderModelRenderer::Render
CModelDef::SkinPointsAndNormals_SSE
CVertexBuffer:Bind
CModelDef::SkinPoint

First, I filtered the results further to investigate the strange blank section in which cpu activity dropped like a rock and no frames were processed (not even a slow frame - just no frame at all). I noticed this section while actually collecting the result as the game froze for about four seconds in the middle of combat before proceeding as if nothing had happened. However, VTune Amplifier reports that the majority of activity during this section was coming from dlls, and I haven't seen this happen before. So for now I'm operating under the assumption that this was a one-off system fluke rather than a game flaw. For what it's worth, the top three functions in this section originating from pyrogenesis.exe were something in std::_Tree (not part of 0 A.D.'s own code), BlendBoneMatrices, and Render.

I also looked into the slow part at the beginning, where the most notable functions were BlendBoneMatrices and SkinPoint. Unfortunately this is a very small portion of the result, and I didn't have enough data to get accurate microarchitecture event breakdowns for this section. I may see if a tighter analysis with a faster sample rate produces interesting results.

Finally, I filtered data to only represent the normal gameplay (a chunk of time in between the slow start and the dead zone). Here the top functions were the same as for the overall results.

Short version:

I've written the dead section off as a fluke (and I don't think it could be optimized anyway). The slow part at the beginning needs further analysis to see if I can get meaningful data. I've concluded that CModelDef::BlendBoneMatrices and ShaderModelRenderer::Render are the primary functions to investigate.

Analysis of Functions:

BlendBoneMatrices assessment:

Spoiler

BlendBoneMatrices is primarily back-end bound (at 48.1% of pipeline slots), which basically means the CPU is getting and reading instructions faster than it can actually process them, either due to not being able to fetch the appropriate data fast enough (memory bound) or not being able to do the computations fast enough (core bound). This function is pretty evenly split between the two causes, at 21.4% memory bound and 26.7% core bound.

VTune has identified the primary hotspots as lines 223 and 221, which are fairly similar. Both of these lines are roughly the same form, calling boneMatrix.Blend and boneMatrix.AddBlend. These basically just boil down to matrix math done with a CMatrix3D class that looks very strange to me. There are two equally plausible possibilities here:

The person who designed this class is a wizard and knew exactly what they were doing, and this looks strange because I don't understand its brilliance, or
This method of laying out your matrix data is as weird as it looks, and overhauling this structure will bring performance improvements.

I'm honestly a bit afraid to touch it because in either case, this is a fundamental and low-level structure and tampering with it may break everything.

I do know that Intel has a hyper-optimized math library that does some matrix magic but I don't know whether MKL could (or philosophically would) be used in your open source game. I will have to look into this.

Render assessment:

Render is more complicated. It's overwhelmingly back-end bound (89.8% of pipeline slots) and almost all of this is memory bound (81.4%). Digging deeper into the numbers it's 74.6% DRAM bound, which essentially means that it's slowed down mainly by missing the last level cache quite often. Even more abstractly, it's not accessing data in an efficient way.

It's a fairly long function and it had several hotspots in it.

Hotspot #1, line 446

Spoiler


			const CShaderConditionalDefines& condefs = model->GetMaterial().GetConditionalDefines();
			for (size_t j = 0; j < condefs.GetSize(); ++j) // This one is line 446
			{
				const CShaderConditionalDefines::CondDefine& item = condefs.GetItem(j);
				int type = item.m_CondType;
				switch (type)
				{
					case DCOND_DISTANCE:
					{
						CVector3D modelpos = model->GetTransform().GetTranslation();
						float dist = worldToCam.Transform(modelpos).Z;

						float dmin = item.m_CondArgs[0];
						float dmax = item.m_CondArgs[1];

						if ((dmin < 0 || dist >= dmin) && (dmax < 0 || dist < dmax))
							condFlags |= (1 << j);

						break;
					}
				}
			}

This one is pretty baffling for me. There seems to be a lot of time spent on the actual loop control line (446) but:

VTune Amplifier has no data recorded for the lines actually inside the loop, in either source code view or assembly code view
When I put a breakpoint on the "for" line, and ran the game through a debugger, it triggered constantly, but when I put breakpoints on all the lines inside the loop and played the game for a while, no breakpoint ever triggered.
Only some of the lines inside the loop had assembly code associated with them. Many lines were simply not present in the assembly at all.
No assembly exists for line 445, the part that actually retrieves condefs. I don't know if this means the compiler optimized it out or just directly merged it into 446 (though the latter makes more sense).

So all evidence points to this loop not ever running, which makes the presence of a hotspot on its control line rather baffling. It's possible, I guess, that the loop never runs but the initial conditional check is still running (this code chunk does exist within another loop and is in the Render function so it's probably getting called a lot). It could also be a case of event skid (basically, due to the way VTune Amplifier measures, sometimes data is linked to a line a couple assembly instructions after the one that actually generated it) but I have no idea where it would be skidding from.

I haven't dived so deeply into the other hotspots of note in this function.

Spoiler

Hotspot 2 (line 763)


ENSURE(rdata->GetKey() == m->vertexRenderer.get());

Hotspot 3 (line 668)


size_t samplersNum = samplers.size();

Hotspot 4 (line 692, the "if")


if (texBindingNames[s] == samp.Name && bind.Active())
{
  bind = texBindings[s];
}

So that's pretty much what I've got as of right now.

Edited February 21, 2019 by Alex from Intel

nani · February 21, 2019

You mention your test case uses 135 units. Usually games are played with a maximum total of 1200 units aprox. (8 players 150 population each). From what I know then the pathfinder becomes the most time consuming task for the CPU. Might be worth looking at it with VTune.

Alex from Intel · February 21, 2019

Thanks @nani! I'll definitely look into using larger test cases - but just for the record, the stated "115 units" is just the size of the army I'm directly controlling/moving. I have other units going about their business back in my city and no idea how many the other faction has running around (I'll have to pause the game in combat and see if I can count just how many Iberian units come rushing out to fend me off when I attack).

I suppose I should also mention that earlier (using a different VTune Amplifier result collected under different conditions), I did actually find CCMPPathfinder::ComputeShortPath as a hotspot! The cause of the bottlenecking, according to my analysis, was that this function was bound by branch misprediction. Unfortunately, after looking into the code for some time I was forced to conclude that there was nothing that could be done to mitigate the misprediction rate. I have not yet seen other parts of the path finder become hotspots but I'll try building a new test case on an 8-player map tomorrow. Honestly I was sticking to 2 player so far to allow myself time to actually build up a decent test case - I never said I was good at this game!

In other news, I was able to run a tighter analysis (with a sample rate of 0.1ms) on the slow startup portion. In that part, SkinPoint is definitely the main hotspot, followed by BlendBoneMatrices. SkinPoint is 49.9% back end bound, almost entirely core bound. None of that was attributed to the divider, though, so it's just plain port overuse. I'm currently investigating whether it might be possible to vectorize the offending loop, as that can dramatically cut down on the port traffic, but ultimately it comes down to whether the nature of the loop is vector-friendly.

**Imarok** · February 22, 2019

That sounds interesting for @vladislavbelov

coworotel · February 22, 2019

13 hours ago, Alex from Intel said:

no idea how many the other faction has running around (I'll have to pause the game in combat and see if I can count just how many Iberian units come rushing out to fend me off when I attack).

You can look at the replay of the match, select the Iberians' point of view, then you can know exactly how many of them are there, without counting manually. To see replays, if you are in single player, go to Single Player > Replays.

Edited February 22, 2019 by coworotel

February 22, 2019

Or press Alt-D. Which would offer even more control for devs while in-game.

Alex from Intel · February 22, 2019

20 hours ago, nani said:

From what I know then the pathfinder becomes the most time consuming task for the CPU. Might be worth looking at it with VTune.

I took your advice and started a very large 8 player map and built up to a population just shy of 300 units. Unfortunately by then several other factions had killed each other off but there were still a sizable number of units from various factions (I know one of my allies had 200+ units from switching perspectives). I then marched a very large force through some trees and buildings and I could actually see the game struggling along at ~5 FPS under the strain of the computations, even without VTune Amplifier running on it.

(Also I might have a cavalry obsession)

But surprisingly... it doesn't look like the pathfinder is the problem (unless the problematic part of the algorithm is showing up through something else that it's calling - could it have something to do with the std::_Tree activity going on here?). Once again, the biggest identifiable hotspot is BlendBoneMatrices. Render is up there again too. This time I see Bind in the top hotspots as well (it's memory bound).

0ad_laggy_army_results.png.90057c0150a1c405001ac972e0b7a548.png

For the record, both the pathfinder functions I see here are listed as being bad-speculation bound, so that would make them rather hard to optimize (a lot of the time branch misprediction is unavoidable).

**Yves** · February 22, 2019

I've worked on an OpenGL4 renderer that uses more modern features like instancing and bindless textures in 2016. My conclusion was also that the renderer is a bottleneck and it's mainly memory bound. I've even tried Intel VTune Amplifier, but switched back to Perf later. By changing how we pass data to our shaders and reducing the number of draw calls, I was able to reduce that overhead by a lot. Unfortunately my work never got beyond the experimental stage. I've written a small blog in the staff forums. I've copied the two most recent post here, if you're interested (still from 2016).

The code from the experimental OGL 4 branch is still around on github: https://github.com/Yves-G/0ad/tree/OGL4

Post10: January – Profiling, profiling and ... profiling:

Spoiler

Post10: January – Profiling, profiling and ... profiling

In January, I did some patch reviews and only started working on the rendering branch about two weeks ago. I still had to find some better ways of measuring performance accurately. Figuring out if a change improved performance or not is the minimum requirement, some ways for measuring what improved exactly would be nice to have.

Testing Intel VTune Amplifier XE

Philip suggested this profiler from Intel and I gave it a try. It costs around $900, but there's a free 30 days trial and free licenses are available for non-commercial open source developers.

The profiler is relatively easy to use and it supports the developer in many ways. Context sensitive help gives information about how to use the profiler and how to interpret data, the sampling based profiler clearly indicates if there are too few samples to be statistically relevant and the software even knows enough about the processor and the collected data to make suggestions what values might indicate potential problems.

My problem was just that I couldn't use any of the fancy features because my development machine doesn't have an Intel CPU and no Intel graphics chip either. All I could test was the basic sampling profiler, which can't do more than what I already did with perf. In the end I stopped using VTune and went back to perf.

The first step: Using absolute counts from Perf and a fixed frame to turn ratio

Something else Philip suggested was setting up a fixed frame to turn ratio for the profiling and using the absolute cycle counts of perf. When measuring per-frame values using the in-game profiler, it doesn't matter how many frames you render (well, it did because of the buffer initialization thing described in the previous post - this was one reason why these measurements weren't very accurate). When using absolute counts (cycles, instructions etc.), the total number of frames matters very much. If making it faster means rendering more frames, the per-function counts would increase instead of getting lower.

What I did was hard coding the variable "realTimeSinceLastFrame" to 0.2 / 0.5. In singleplayer mode, a turn takes 200 ms, so this means 5 frames per turn. For my autostart-benchmark of 75 turns, that's 375 frames (actually 376 in practice).

This way I could get relatively stable results just by measuring the total runtime or by counting the total number of cycles using "perf stat":


perf stat -e cycles ./pyrogenesis -autostart=scenarios/DeepForestProfile -autostart-players=4 -autostart-finalturn=75 -nosound

Example output:


    58'813'616'438 cycles                   
    23.295729288 seconds time elapsed

The average values of several runs shows that the SVN version is about about 0.25 seconds or 1.1% faster than my development branch with Open GL 4 disabled.
Repeating the measurement later gave similar results: 0.21 seconds or 0.9%.

I wanted to further improve the accuracy by configuring perf to start collecting events after loading the map instead of from the beginning. It takes about half of the time just to load the map, so that would have helped a lot. Unfortunately perf doesn't support that yet. It could work using Toggle Events, but apparently these are only implemented in a custom branch. Otherwise the only way seems to be using the Perf API from C++. Both approaches seemed to be a bit too cumbersome for the moment.

Second step: Any other useful events?

After I got it working this way, I suddenly realized that Perf had many more events available to collect and I could, for example, measure changes in the number of cache misses. So far my experience was that these events are hard to use with "perf record". One of the difficulties is that you have to pick a reasonable frequency for the samples to avoid influencing the performance too much and to get enough samples for statistical significance. However, when using perf stat you don't have to worry about that. It just collects the absolute values and doesn't have to collect samples as with perf record.

As I mentioned earlier, Perf lacks some good documentation, so the available events weren't very useful. For example, there's an event called "cache-misses", but it doesn't say if these are for the L1, L2, L3 or the TLB cache, or maybe a combination of these. It also doesn't say if it's just the data cache or also the instruction cache.

After some tinkering and reading, I figured out how to access the CPU events directly. The AMD BIOS and Kernel Developer's Guide describes "Event select" and "Unit Mask" values used to get specific events. There are different documents for different CPU families and the Unit masks and Event Select values might be different depending on the exact processor model and stepping. Using Wikipedia and some data from my hardware, I figured out that I need to look into the guide for the AMD 10h family of processors and my CPU is a Phenom II X4 with a Deneb model and C2 stepping.

For example, on page 468, it describes EventSelect 4E1h for "L3 Cache Misses". Below, there's a list of some Unit Masks to further narrow down the collected events.


01h Read Block Exclusive (Data cache read)
02h Read Block Shared (Instruction cache read)
04h Read Block Modify
10h Core 0 Select
20h Core 1 Select
40h Core 2 Select
80h Core 3 Select

To select events from all cores and choose "Read Block Exclusive", you would combine the flags like this: 0x01 | 0x10 | 0x20 | 0x40 | 0x80 = 0xF1
To pass them to Perf, you use the "-e" parameter, prefix the values with "r" and append the Unit Mask followed by the Event Select value: rF14E1

The advantage is that you have the description of the events from the supplier (AMD) directly instead of having * no description at all * from Perf.

Third step: Profit!

Ok, it's time to use the new knowledge.

Compare some potentially relevant events using perf stat (L1, L2 and L3 cache misses):


perf stat -e r41,r27E,rF14E1 ./pyrogenesis -autostart=scenarios/DeepForestProfile -autostart-players=4 -autostart-finalturn=75 -nosound

I made several runs and calculated the difference between the average values from SVN to my branch without the GL4 code enabled.


~ -1.1% L1 misses in branch
~ +1.5% L2 misses in branch
~ +8.8% L3 misses in branch
~ +1.5% time for branch

I don't know why there are less L1 cache misses, but it's a small value, so it might not even be significant. There's quite a large difference in L3 misses, though. The difference was there in each measurement, not just in the average value, so it seemed possible to use perf record to figure out some more about where it comes from.

I tried using perf report to figure out where the L3 cache misses happen.


perf record -e rF14E1 -c 1000 ./pyrogenesis -autostart=scenarios/DeepForestProfile -autostart-players=4 -autostart-finalturn=75 -nosound

Just comparing the top functions, already revealed quite a lot:

branch:


 15.14% 3528  pyrogenesis  fglrx_dri.so         [.] 0x00000000015f0a89 
 12.60% 2936  pyrogenesis  pyrogenesis          [.] void std::__introsort_loop<__gnu_cxx::__normal_iterator<CModel**, std::vector<CModel* ...
 15% 2599     pyrogenesis  [kernel.kallsyms]    [k] 0xffffffff811c9384 
 3.73% 870    pyrogenesis  pyrogenesis          [.] CUnitAnimation::Update(float) 
 2.56% 597    pyrogenesis  libc-2.19.so         [.] malloc_consolidate

SVN:


   16.82%   3876  pyrogenesis  fglrx_dri.so        [.] 0x0000000000f1c62c 
   12.29%   2833  pyrogenesis  pyrogenesis         [.] void std::__introsort_loop<__gnu_cxx::__normal_iterator<CModel** ... 
   10.37%   2389  pyrogenesis  [kernel.kallsyms]   [k] 0xffffffff81174124 
    4.00%    921  pyrogenesis  pyrogenesis         [.] CUnitAnimation::Update(float) 
    3.08%    710  pyrogenesis  pyrogenesis         [.] ShaderModelRenderer::Render(std::shared_ptr<RenderModifier> ... 
    2.55%    588  pyrogenesis  libc-2.19.so        [.] malloc_consolidate

That kernel entry ("kernel.kallsyms") is probably overhead from Perf collecting the events. The overhead grows when it has to collect more events, so it gets bigger as the number of cache misses in other functions increase. We already know that the graphics driver is a performance problem (fglrx_dri.so), but I don't quite understand why there are so much more cache misses in SVN for it.

The difference in the functions for sorting models during rendering is the most interesting part and probably the source of the problem. I have moved the m_Material member from CModel to CModel abstract, so it's now further away from another member (m_pModelDef) that is also accessed in that sort function. This is already fixed in my cache efficiency patch, so we should get back the performance when I commit that to the branch. I'll make another measurement to confirm that assumption when I'm ready to commit. In addition, it might be worth to try some other cache efficiency optimizations for the data used in that model sort function.

Post 11: 10th of July - Bindless textures and instancing:

Spoiler

Post 11: 10th of July - Bindless textures and instancing

I haven't posted here for a while, but I've continued working on the GL4 branch and, in particular, on instancing and bindless textures. I've focused more on the OpenGL and driver related code again because this is still the main bottleneck of the renderer. A lot of time is still spent in the driver and even some of the slower parts of our code could probably be improved using the same techniques that can be used to reduce driver overhead. Besides, fine-tuning the cache efficiency is probably a waste of time at this point because the code that would benefit the most from it is going to change anyway.

Instancing

The idea behind instancing is to give a batch of work (multiple 3D models to draw) to the driver together with all the required data rather than setting up each model one by one. This should be more efficient because the driver has to do less validation work per model and it's usually also more efficient to transfer a large amount of data at once rather than transferring the same data in small pieces.

Imagine a game of chess with black and white pieces (king, queen, pawns etc.) that have to be rendered by our engine. Our current approach was like this:

Set the model: knight
Set the color: white
Set the position: 1/0
Draw 1st white knight
Set the position: 6/0
Draw 2nd white knight
Set color: black
Set position: 1/7
Draw 1st black knight
etc...

This requires a separate draw call for each piece, so 32 draw calls. Changing colors, models or the position requires driver validation each time. We've optimized that by sorting the pieces (models) as efficiently as possible. For example, we could first draw all white pieces or sort by model and draw all pawns without having to switch the model.

The GL4 branch uses a different approach. All data gets stored in buffers and is accessible through a draw ID. So far it still has to use the old sorting approach for models because these can't be switched from within the driver only (at least that's not implemented in the branch yet). Other than that, it can prepare all the data in buffers and then draw all the same models at once with a single draw call. So in the chess example, it would be like this:

Preparation:

Sort by models
Prepare buffers containing color and position for all models
Upload the buffers and make them available to the GPU

Drawing:

Select model: knight
Draw 4 knights using the prepared data
Select model: pawn
Draw 16 pawns using the prepared data
etc...

Bindless textures

The chess example above is a bit simplified. Instead of using a simple color, we use textures for most of our models. These textures have to be bound (selected/activated) before each draw call which makes instancing impossible for models with different textures. Bindless textures are a solution for this problem. Essentially you generate a handle for your texture and make the texture resident (only resident textures can be used in shaders). The handle is a uint64 value that can be passed to shaders using uniforms or SSBOs. The shaders can use these handles just as other texture samples which have been bound the old fashioned way.

Current state of the GL4 renderer

Currently, bindless textures and instancing are implemented for the model renderer. Multiple draw calls are still required for model or shader changes.

Here's a performance comparison between different version of the branch and SVN on the map Deep Forest. It shows how long in milliseconds the respective sections in the renderer took in average (shorter is better). The "render submissions" section basically encompasses all the rendering and "shadow map" is only for the shadow map rendering within the model renderer.

**vladislavbelov** · February 22, 2019

10 minutes ago, Alex from Intel said:

But surprisingly... it doesn't look like the pathfinder is the problem (unless the problematic part of the algorithm is showing up through something else that it's calling - could it have something to do with the std::_Tree activity going on here?). Once again, the biggest identifiable hotspot is BlendBoneMatrices. Render is up there again too. This time I see Bind in the top hotspots as well (it's memory bound).

It really depends on frame. Try to move the camera for some empty space. Or order all 1000+ units to move.

12 minutes ago, Yves said:

I've worked on an OpenGL4 renderer that uses more modern features like instancing and bindless textures in 2016.

Unfortunately we have only 56% of players who support the GL4+. Also I suppose in most cases GL4+ means Vulkan too. It also gives a lot of opportunities to optimize.

33 minutes ago, Yves said:

it's mainly memory bound.

We use a lot of space in our VBOs, we don't even use compressing.

The most important thing I really want to know from Intel: why drivers crash? (We don't see so many of them for other vendors).

**Itms** · February 22, 2019

Thank you so much for these first findings!

On 2/21/2019 at 12:01 AM, Alex from Intel said:

I'm usually available 9am-4pm PST, meetings and other duties notwithstanding.

The majority of the dev team is in CET, so the best window would be your morning work hours, which are roughly our evening.

Regarding the renderer: this is an area of the code that I don't know very well, but Yves did a lot of research on it, as you can see; and Vladislav is the most active dev in this area, I let him take over from here.

Regarding the pathfinder: it is not a surprise that the so-called short pathfinder is a bottleneck. There are plans to revamp or get rid of it, as it is a very old piece of code. I am quite happy that the hierarchical pathfinder appears as a secondary issue: it was already revamped a couple years ago So your findings in the pathfinder area are matching our current focus!

We are going to upgrade our version of SpiderMonkey and we hope to see some performance improvements in the parts of SM that appear in your analysis.

Looking forward to reading your next findings

Alex from Intel · February 22, 2019

25 minutes ago, vladislavbelov said:

It really depends on frame. Try to move the camera for some empty space. Or order all 1000+ units to move.

13 minutes ago, Itms said:

it is not a surprise that the so-called short pathfinder is a bottleneck. There are plans to revamp or get rid of it, as it is a very old piece of code.

Thanks for the feedback. It may well be that the pathfinder occasionally bottlenecks the application under certain circumstances - but so far we've located a few functions that are consistently eating up the CPU time, so focusing optimizations on them is likely to bring better speedup overall. This is especially true since what I've seen from both of the pathfinding functions, they mostly suffer from branch mispredictions, which aren't really something you can optimize in most circumstances.

29 minutes ago, vladislavbelov said:

1 hour ago, Yves said:

I've worked on an OpenGL4 renderer that uses more modern features like instancing and bindless textures in 2016.

Unfortunately we have only 56% of players who support the GL4+.

@Yves Thank you for this gold mine of a post! I'll have to dive into what you've provided. Depending on the amount of work needed to complete your experiment (as I do unfortunately have a deadline for producing training material from this effort), it might be extremely helpful to me, even if it's not necessarily something that can be turned around and merged back into the actual game.

Alex from Intel · March 4, 2019

I have some good news and some bad news. The good news is that I've finished my analysis. The bad news is I can't really do anything about it myself.

After what I found last time I did some work trying to optimize the code, but I found that nothing I did had any measurable effect on performance. My changes caused individual functions to shuffle around a bit (and credit where it's due, the pathfinder did once pop up at the top of the list) in the Intel VTune Amplifier interface but I saw no improvement in the actual game.

Spoiler

As a side note, I did make a mistake while optimizing that temporarily resulted in an amusing bug.

ibrokeit.gif.a25f9e95183213dc1cf0e140f57d7a30.gif

So I took a step back, thinking maybe I missed something at a more basic level. Now, as I said in my first post, the reason for this project is that my team is, as a whole, relatively inexperienced with using our product on video games. We're more accustomed to scientific applications and the like. So I'd approached this the way I was used to doing, and went straight for the largest known hotspots, ignoring the unknown modules. I also focused on the lengths of the frames instead of the overall frame rate.

I conferred a bit with someone who was a little more used to the inner workings of video games, and re-evaluated my data. A large chunk of your time is spent in those unknown modules, and you have significant gaps between your frames - what's especially interesting is that if I filter my data down to only the gaps between frames, all the pathfinding, rendering, bone matrix manipulation, etc, drops away, and the overwhelmingly vast majority of what's left is "outside any known module". This kind of pattern can indicate that the bottleneck is an outside factor. Being a video game, the most obvious potential source is the GPU - it doesn't matter how efficiently calculations are being done if things end up waiting for the GPU to finish what it's doing.

So I ran a GPU/CPU analysis, but it showed that the GPU wasn't the problem. There were gaps in the activity there, too. So now I looked at the large number of threads involved in your game, most of which are not doing anything most of the time, and thought that might be the problem. So I ran a Threading analysis and saw that there were quite a few transitions between threads happening in those gaps between frames.

When I opened the call stacks for those top two objects, the call chains eventually went down into BaseThreadInitThunk (with the semaphore object going through a js::HelperThread::threadLoop function on the way).

thunk.png.03d238c13e1abeea6e2cbbd3982ffcbc.png

As I understand it, 0 A.D. is built on a mix of C++ and JavaScript code, and Thunking is a form of interfacing between two different types of code - such as two different languages. So presumably this is where your two languages are talking to each other.

So what I believe is going on here, is that these gaps between frames are coming from the game having to wait for the JavaScript. A couple possibilities I can think of are that you might be interfacing the two languages too often, or you might be doing computations in the JavaScript that really belong in the C++.

Unfortunately there's not much I, personally, can do to relieve that bottleneck.

Technically, I now have enough content to fulfill the barest requirements of my project. Before I continue I want to make it clear that I absolutely understand that you're on your own schedule and I don't expect anything from you! I just want to let you know about my own timeline, in case it affects your priorities or decisions.

If you guys do end up fixing your JavaScript issues in the next week and a half, please let me know and I would be more than happy to include the improved results in my initial presentation. If it gets done by mid-April, I would be happy to include a result comparison/improvement showcase in any official documentation I might produce from this, and if early enough, I might also be able to include a second analysis step.

Thank you all for your interest, and good luck!

**Stan`** · March 4, 2019

Hey thanks for the analysis ! If you did manage to optimise our functions can you drop git diffs of the code you touched ? Maybe there is still little optimizations we can benefit from

As for the JS engine we won't change it anytime soon however we plan to upgrade it to a more recent and hopefully faster version. Maybe @Itms can pull up a branch with Spidermonkey 45 (Here you are using 38) for you to run some profiling on.

I don't think using debug symbols for Spidermonkey would help you find the bottleneck source though I'd bet it's in the interaction of UnitMotion and UnitAI. Maybe @wraitii or @Itms know more about it.

What are your plans after this ?

Let us know if there is anything we can do.

nani · March 4, 2019

So if Javascript code is the culprit can VTune profile what part of the js code is consuming time or how many interface calls does in average per frame?

March 4, 2019

(SM tracelogging does exist btw)

**Stan`** · March 4, 2019

1 hour ago, (-_-) said:

(SM tracelogging does exist btw)

IIRC it's broken it needed a few patches to work and alot more to be useful I remember there was some ticket about it.

Alex from Intel · March 4, 2019

2 hours ago, stanislas69 said:

If you did manage to optimise our functions can you drop git diffs of the code you touched ? Maybe there is still little optimizations we can benefit from

I mostly fiddled around in Matrix3D.h, entirely replacing the contents of some functions with either calls to the Intel® Math Kernel Library or blocks of SSE intrinsics. I did not see any measurable improvement, but I've attached the file. The original code is left intact, commented out. Aside from these completely re-written functions, the only change is the addition of the <mkl.h> header.

Spoiler

The changed functions are:

   CMatrix3D operator*(const CMatrix3D &matrix) const
   CMatrix3D operator*(float f) const
   CMatrix3D operator+(const CMatrix3D &m) const
   CMatrix3D& operator+=(const CMatrix3D &m)
   void Blend(const CMatrix3D& m, float f)
   void AddBlend(const CMatrix3D& m, float f)

Matrix3D.h

2 hours ago, stanislas69 said:

Maybe Itms can pull up a branch with Spidermonkey 45 (Here you are using 38) for you to run some profiling on.

Couldn't hurt.

2 hours ago, stanislas69 said:

I don't think using debug symbols for Spidermonkey would help you find the bottleneck source

2 hours ago, nani said:

So if Javascript code is the culprit can VTune profile what part of the js code is consuming time or how many interface calls does in average per frame?

Unfortunately I don't believe so. VTune Amplifier does have some JavaScript profiling capabilities, but as far as I know they're limited to node.js profiling. Also, I know pretty much nothing about JavaScript. :mellow:

2 hours ago, stanislas69 said:

What are your plans after this ?

I'm afraid while that JavaScript bottleneck is in place there's not much else I can do for 0 A.D. I do have enough from this to at least fuel the internal training session late next week. Depending on how that goes, I might produce a small how-to video/article based on this project. If I do, I'll be sure to post a link here.

Edited March 4, 2019 by Alex from Intel

**wraitii** · March 5, 2019

Hey,

What you've noticed is accurate, and somewhat known (mostly by me I guess): The rendering is draw-call bound and slow, the pathfinder can be slow, and the Javascript main loop is very slow (we use it for most gameplay stuff, which is why it's slow).

Unfortunately, profiling spider monkey accurately is very difficult. There are few tools, and almost none that can be integrated with an existing profiler. What you'd see in OSX's instruments and I assume VTune are memory addresses, which are actually JIT-ed code and/or spider monkey code (mostly the former). With a lot of hard work, using JIT debug output and the trace logger, one might create debug symbol-like structures to document these, but it seems not super useful.

What we did in the past (and probably will in the future) for profiling JS was using our internal profiling, which does have an overhead but can give you an idea of what's going on.

_Overall_, though, we've already optimised most obvious bottlenecks. The remaining bottlenecks are not trivial to fix, and JS is somewhat of a lost cause.

Alex from Intel · May 6, 2019

I have finally made this into an actual piece of published collateral. It is not much, but as promised, here it is.

This is probably the last you'll hear from me, so thanks once more for all the help, and have a good one!

**Stan`** · May 7, 2019

6 hours ago, Alex from Intel said:

I have finally made this into an actual piece of published collateral. It is not much, but as promised, here it is.

This is probably the last you'll hear from me, so thanks once more for all the help, and have a good one!

Thank you so much for choosing our game to run the experiment Once the new version of Spidermonkey is in @Itms said he would run vtune once again

asterix · May 7, 2019

7 hours ago, Alex from Intel said:

I have finally made this into an actual piece of published collateral. It is not much, but as promised, here it is.

This is probably the last you'll hear from me, so thanks once more for all the help, and have a good one!

Thank you very much for your hard endeavor, I hope your audience will find it useful and insightful and as @stanislas69 said earlier it is a good and reliable piece of information for our newer ending quest (or puzzle) for better optimizations.

Edited May 7, 2019 by asterix

Help needed: Optimizing 0 A.D. with Intel VTune Amplifier

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Guest

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Guest

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in