Profiling sandbox

**wraitii** · October 18, 2016

Opening this thread to post some profiling results for my own sake and if anyone cares. It's not going to be very formatted

Results after the 2000 turns, some at 3000 (i.e. after the 15th minute, we can reasonably assume players are skirmishing and have a solid amount of pop by then)

I was looking at which components and messages take the most time. Using an MP replay of a 2v2 for these.
In terms of component Guard is an obvious bottleneck, taking often upwards of 5ms for no good reason. It needs to be rewritten.
UnitMotion is fairly slow (7ms). UnitAI on its own is somewhat slow. And when they show up, range manager and mirage can be slow.
The slowest is by far Timer.js though.

Looking at that one specifically, it's because of unitAI. Though Damage calls can be slow too.
For unitAI, it's mostly because everything is too slow. The Gather timer is easily 7ms per turn, Combat when it shows up is terrible, Repair is slow too. Everything needs to be sped up.

**wraitii** · October 18, 2016

Looking more specifically at Gather, it seems the slower parts are the following:

-The initial call to "CheckTargetRange and CanGather". Can probably be slightly optimized. About 1ms at turn 1500. Probably slow because called so often and may call CPP
-The case where we need to find a nearby resource or dropsite (haven't looked in detail but most likely the first).
-The call to return to the nearest dropsite when we're full (about 2ms)
-The actual call to cmpResourceGatherer.PerformGather. This is the big one, 5ms on average at turn 1500.

This is annoying because this last call is slow mostly because of:
-The PostMessage "ResourceCarryingChanged"
-The PostMessage in ResourceSupply "ResourceSUpplyChanged"
-The call to Fogging

We should consider directly calling AIProxy for those, perhaps that would make it faster. Another thing we could do is wait for turn end to send the resource supply message, since often a resource is worked on by many entities and we might avoid sending repetitive messages that way (unsure).

Another thing is that in MP in particular the timer may get called several times per frame, and we are not "clever", we call everything twice, which is wasting resources. We might win some time by calling the timer less often. Not sure exactly how often that happens with resources though, but if it happens it's late-game where lag is already bad, compounding the issue.

**wraitii** · October 27, 2016

Ran some more profiling, it seems PerformGather is really slow when the Fogging activates (which isn't really surprising), generally because copying information is slow (presumably because querying all those interfaces is a little slow).

Not entirely sure how we could improve the data copy though.

For the rest, the Gathering timer could be improved by removing a few checks in the beginning (in particular CanGather check seems useless since PerformGather does it again). Sending stuff to stats and the AIProxy takes around 0.2ms per turn, which isn't trivial but doesn't seem worth optimizing.

Looking at other Timer calls, the important UnitAI ones are Combat.Attacking and Repair.Repairing, so those are the targets.

**wraitii** · October 31, 2016

Been looking at the Combat attacking timer using Combat Demo Huge.

The component update itself is around 100ms/turn at peak, so the pathfinder is not the only issue there.

Specifically on the combat timer, finding new targets is by far the slowest, then it's a combination of the rest. Could be optimized by removing redundant checks, mostly, and improving QUeryInterface I guess. A particular source of slowness is TargetKilled, because of the two calls to the statistics (I think the query interface through player ID are particularly slow).

Interestingly, the slow part of finding new targets is the JS AttackTargetByPreference more than the cpp, which is reasonably quick in comparison.

**wraitii** · November 2, 2016

Some more detailed profiling, based on a 2V2 provided by Elexis (which is basically using rev ~ Alpha 21).
This is an MP game so turn rates are 500ms. All analysis done using a custom profiling script that outputs per-frame data and R.

First we can look at the whole thing, smoothed. "Components" is Update Components, the rest are various pathfinder related things. I don't profile long-range at all here.

Then unsmoothed: as we can see there are several very high spikes.

Indeed R tells over over a third of turns are over 50ms, over 80% are above 20ms (which is 50 FPS if we ran a turn every frame - obviously we don't). The SP turns are about 200ms, also 50 ms per turns means already quite a slowdown.

We can do some fancier things, such as find the average breakdown for buckets ( <20ms, 20-50ms…):

As you can see, the short-range pathfinder is hardly the number 1 cause of lag. The number 1 cause of lag is in fact updating components. I'll profile some different things with this replay and upload a new post for more info.

**wraitii** · November 2, 2016

Fairly more comprehensive profiling. The three attached graphs are a very smoothed, less smoothed, and proportional plot of the same game.

As you can see, we have the occasional spike in LongPath, probably when we move many units not in formation. ShortPath also occasionally spikes when units move around, and particularly when they combat (which can be detected because Timer, which is the JS component mostly used by unitAI, spikes too).

But as a general rule, the longest call is the update of Timer.Js, which afaik is mostly the gathering, repairing, and combat timer.

(NB: profiling are exclusive, Timer is part of Components in the above graphs).

**wraitii** · November 6, 2016

Looking at range queries. It's about 10ms to execute active queries in the late game, because we have 500 active queries.

In general, parabolic/global/regular queries are about as much time, though their slowness varies.

For general queries, we might gain some ms by actually checking for max distance before we check for component existence, since that seems to be slightly more efficient.

The good news is that the subdivision Move/add/remove are mostly irrelevant in terms of computation time compared to the other stuff.

We might be able to speed up queries by just checking against units that moved this turn. This needs to be tested and probably needs to be optimized spatially a bit but could be interesting.

**wraitii** · January 18, 2017

Ran some tests looking at things. There was some memory dumbness recently, though I've been working on fixing them. Still need to figure out if we need to call Clean() in UpdateGrid in the beginning, I doubt it.

It seems, from my profiling, that BroadCastMessage is slow on its own because of the cost of calling the functions. Those are virtual, obviously, so they may be expensive if we call them enough times. It'd be interesting to look into it more, this may be a false positive or the real cost is calling JS components, I'm unsure. (EDIT: now that I think of it, it's much more likely that the cost is the cache-miss from accessing the component itself in the map, since we're using a map.)

Another virtual call that might be more costly than we expect: the TestShape filter from our obstruction filters, notably used in the pathfinder's TestLine

Decomposing TestLine, the std::map find is quite slow in general, probably in large part because we have tons of shapes and very few actually interest us so we waste a lot of time getting them back (as expected). It might be better here to store shapes in the subdivisions directly.

**wraitii** · January 21, 2017

Some more random profiling (as usual, take percentages as indicative more than actual):

RangeManger::UpdateVisibility() seems quite slow. It's basically 100% the call to ComputeLOSVisibility, but in an unexpected way: 60% is ComputeLosVisibility(entity_id_t, player_id_t) calling ComponentManager::LookupEntityHandle() because that triggers an std::map find call, which is as usual horrible. The remaining 40% are the ComputeLOSVisibility call itself (which is not inlined whereas the first one is). Maybe we could avoid that handle lookup. For the actual ComputeLOSVisibility call, it seems to be mostly virtual function calls that are costly (and here I do mean the fact that they're virtual).
RangeManager::PerformQuery() i about 40-50% GetInRange calls, most of the cost there is inserting (as expected). I do believe that could be skipped by iterating the original arrays directly. I don't see any special hotspots in the chess themselves, though the vector comparisons do seem slower, but honestly the assembly code is a little hard to read there :P. It seems to me like the slowest part is the check for min range, but I can't guarantee it.
Also a bit of overhead in UpdateVisibilityData from the loop there at the beginning, and probably some overhead on function calls to UpdateVisilbilty and PerformQuery in general. Nothing too awful though.

And I confirm general slowness in BroadCastMessage from using the map and calling the virtual HandleMessage functions. I'm certain the map is slow, I'm less sure about the function calls.

**leper** · January 22, 2017

15 hours ago, wraitii said:

RangeManger::UpdateVisibility() seems quite slow. It's basically 100% the call to ComputeLOSVisibility, but in an unexpected way: 60% is ComputeLosVisibility(entity_id_t, player_id_t) calling ComponentManager::LookupEntityHandle() because that triggers an std::map find call, which is as usual horrible. The remaining 40% are the ComputeLOSVisibility call itself (which is not inlined whereas the first one is). Maybe we could avoid that handle lookup. For the actual ComputeLOSVisibility call, it seems to be mostly virtual function calls that are costly (and here I do mean the fact that they're virtual).

There is a very simple improvement we can make there, use std::unordered_map.

**wraitii** · January 22, 2017

I'd rather replace them with an EntityMap (we need good insertion speed, good lookup, good erase, and good iteration speed) but that means treating local entities separately (and I'd like to get D8 committed first).

Some more profiling on the aspect, using a 3AI replay. Since there's no formations being used, we can check the calls to MT_UPDATE_MOTION_FORMATION and the number of entities called to get a rough estimate of "raw overhead per entity". We can complete with Update_Final to the visual actor, since only moving units do something there.

Motion formation takes about 50µs, for 270 entities, i.e. around 0.2µs per entity.
Final update takes 400-700, with 550µs as a median, for over 2450 entities. Raw overhead would be around 0.2µs again.

What this tells us is that with around 5K entities (quite possible depending on map/mapsize) we can expect a raw broadcast-message overhead, per turn, around 2ms, possibly more. And I'm not counting the overhead from looking up handles and such.

This isn't huge, obviously, but it's there. And well, 60 FPS means 17ms at most per frame. Now we cheat a little, but even if we aim to keep turns below 40/50ms, this is still at least 4/5%

edit: obviously that means little since it's computer-dependant, but I'm on a 2015 2.7Ghz intel i5

**leper** · January 22, 2017

We don't need iteration speed, look at all the uses of that cache, it's only inserting once, and tons of lookups, and iterating when shutting down (or running a simulation test).

**wraitii** · April 21, 2017

On a quick note:

Atlas currently recomputes the whole pathfinder grid (on top of water stuff) when we change the water height, and that's also extremely slow. 50% is the terrain stuff, 25% re-rasterizing entities, 25% recomputing the hierarchical pathfinder entirely.

So we probably should try and speed that up a tad if we start having maps that change water height, and/or we should change atlas water height from a slider to a text-box. The slider compounds the problem.

**leper** · April 21, 2017

(Well if people want to make games lag by doing that during the game they will have to either stop doing that or speed up those computations.)

For the Atlas issue only one possibility would be to only recompute all that sim-state once modifications are done (or at least most of them that don't change what is displayed). So the slider would work nicely, but we'd have to tell the simulation to stop recomputing once the slider starts moving, and stop when it stops moving. The same could (should) be done for terrain modifications, where the start/stop points are a lot easier to define than the moving part.

**wraitii** · May 23, 2018

Following the upgrade to SM45, I've been looking at profiling Spidermonkey (I've also been considering what we'd need to update to HEAD, i.e. 62, and that sounds somewhat doable)

This has been a long-standing issue. We do have ways to do it, Engine.ProfileStart and Engine.ProfileStop, but they:

can't use RAII so you need to call the stop before any return, which is troublesome
Probably incur high performance penalty, which makes fine grained profiling difficult
Can only tell you about what you expect to be slow

Spidermonkey has an API to do very precise logging, the Tracelogger. There is an existing tool to analyse this but it's fairly unreadable. However, a script allows outputting things that can be easily converted into a profiler2 parsable JSON file, and then we could use Profiler2.html to read it.

The problem here is that the Tracelogger is extremely fine-grained (function level), so it outputs ridiculous amounts of data in even a few seconds.

There's internal API support for enabling or disabling that, but we'll have to patch the JS API to do call it ourselves. What I'm envisioning is having an Engine.JSProfileStart / Stop that enables the tracelogger output, call this in fairly high level functions (such as Timer.js::OnUpdate) and then see what this outputs.

**Stan`** · May 23, 2018

Any perf gain with 45 ?

**wraitii** · May 23, 2018

Haven't really been able to notice so, which isn't that surprising, moving from 38 to 45. 45 onwards will have spectre mitigations which will slow it down but maybe other gains here and there.

**Stan`** · May 23, 2018

Yeah, The mozilla guys told us there was performance improvements to be expected if we support 62 eventually

**vladislavbelov** · May 23, 2018

Btw, does the SM45 support pyrogenesis minimum requirements? I mean GCC 4.8.1, Windows XP, etc.

**wraitii** · May 23, 2018

This is an open question for @Itms

**Stan`** · May 23, 2018

https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Build_Instructions/Simple_Firefox_build/Linux_and_MacOS_build_preparation

According to this GCC version started to be 4.9 starting from Firefox 50

According the page for building sm 45 vs2013 still works so if the toolset works the xp toolset works as well

**wraitii** · July 18, 2019

Quick and dirty screenshot from an MP 4v4 at the 16th minute.

This shows:

- short requests computation (red) taking 50-90 ms per turn, with the odd peak above
- long path requests taking about 20ms
- Active queries taking 20-30ms per turn
- Sim update (excluding active queries) taking 30-40ms per turn
- Motion update taking about 30-40ms per turn.

So threading the pathfinder and having unit pushing (which will reduce # of short requests) will help, but it won't be completely smooth sailing.

**wraitii** · August 20, 2019

I'm working on some trigger-bound maps to test performance of some stuffs... Some interesting results.

In particular, creating a new entity is somewhat slow (200µs), so creating dozens of entities per turn can be a noticeable slowdown. The main culprits seem to be Identity and ResourceGatherer, which do caching on Init. In general it appears JS components are a bit slower than C++, which is to be expected. Not sure if anything can be done about that though.

**Stan`** · August 20, 2019

4 minutes ago, wraitii said:

The main culprits seem to be Identity and ResourceGatherer, which do caching on Init.

Identity.Init() calls two Template.js functions, Maybe some caching would help (eg. storing classes for a given unit template ?) Also we do twice the GetVisibleIdentityClasses () code (because it's concatenated in GetIdentityClasses ) dunno how fast that regex is, but could maybe save some cycles.

About concat performance

Maybe

Array.prototype.concat.apply

would be faster.

**Stan`** · November 2, 2019

I ran some tests on Timer.js

{"Component":"GuiInterface","FunctionName":"DeleteTimeNotification","TotalTime":0.0,"Count":1,"Average":0.0,"TurnAverageCount":6.105006105006105E-05}
{"Component":"ResourceTrickle","FunctionName":"Trickle","TotalTime":0.0,"Count":1,"Average":0.0,"TurnAverageCount":6.105006105006105E-05}
{"Component":"Trigger","FunctionName":"DoAction","TotalTime":9.0,"Count":1,"Average":9.0,"TurnAverageCount":6.105006105006105E-05}
{"Component":"Barter","FunctionName":"ProgressTimeout","TotalTime":9.0,"Count":369,"Average":0.02439,"TurnAverageCount":0.022527472527472527}
{"Component":"AttackDetection","FunctionName":"HandleTimeout","TotalTime":56.0,"Count":40059,"Average":0.0014,"TurnAverageCount":2.445604395604396}
{"Component":"Gate","FunctionName":"OperateGate","TotalTime":58.0,"Count":1385,"Average":0.04188,"TurnAverageCount":0.08455433455433456}
{"Component":"StatisticsTracker","FunctionName":"UpdateSequences","TotalTime":121.0,"Count":440,"Average":0.275,"TurnAverageCount":0.026862026862026864}
{"Component":"BattleDetection","FunctionName":"TimerHandler","TotalTime":263.0,"Count":20945,"Average":0.01256,"TurnAverageCount":1.2786935286935288}
{"Component":"GarrisonHolder","FunctionName":"HealTimeout","TotalTime":281.0,"Count":4961,"Average":0.05664,"TurnAverageCount":0.3028693528693529}
{"Component":"Capturable","FunctionName":"TimerTick","TotalTime":455.0,"Count":3937,"Average":0.11557,"TurnAverageCount":0.24035409035409036}
{"Component":"Health","FunctionName":"ExecuteRegeneration","TotalTime":797.0,"Count":74885,"Average":0.01064,"TurnAverageCount":4.571733821733822}
{"Component":"BuildingAI","FunctionName":"FireArrows","TotalTime":3201.0,"Count":32131,"Average":0.09962,"TurnAverageCount":1.9615995115995117}
{"Component":"ResourceGatherer","FunctionName":"PerformGather|Engine.SendMessage","TotalTime":5652.0,"Count":350670,"Average":0.01612,"TurnAverageCount":21.40842490842491}
{"Component":"ProductionQueue","FunctionName":"ProgressTimeout","TotalTime":12013.0,"Count":24454,"Average":0.49125,"TurnAverageCount":1.492918192918193}
{"Component":"DelayedDamage","FunctionName":"MissileHit","TotalTime":12402.0,"Count":38835,"Average":0.31935,"TurnAverageCount":2.370879120879121}
{"Component":"ResourceGatherer","FunctionName":"PerformGather","TotalTime":13220.0,"Count":350670,"Average":0.0377,"TurnAverageCount":21.40842490842491}
{"Component":"UnitAI","FunctionName":"TimerHandler","TotalTime":144595.0,"Count":545161,"Average":0.26523,"TurnAverageCount":33.282112332112334}
There were 1488905 lines.

This is a one hour game with 4 AIs in alpha 24 minus with D2400.

Profiling sandbox

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites