[DISCUSS] Performance Improvements

**k776** · May 21, 2013

Hey Don,

As far as I know, no one has started on an Octree, so anything you can do would be a great help :-)

**alpha123** · May 21, 2013

I'm also curious to know why the existing FOSS implementations aren't what we need. (Licensing?)

As far as I know, nobody has actually evaluated them to see if they have what we need. Most of the work will be integrating it with the engine anyway; implementing the octree itself shouldn't be particularly difficult.

Feel free to hop on #0ad-dev on QuakeNet sometime. It's always great to get a new contributor.

scroogie · May 21, 2013

I'm still not convinced that the game actually needs an Octree instead of a Quadtree. I feel even with flying units, the subdivision in the Z axis would be unnecessary, but others are much more knowledgeable than me.

I just wanted to point out that there is already an abstraction for spatial subdivisions here: http://trac.wildfiregames.com/browser/ps/trunk/source/simulation2/helpers/Spatial.h

But I don't know if its used everywhere. Probably the hardest part will be to look at RangeManager, ObstructionManager, Terrain, Territory etc. to see where it could be applied. Because of the amount of units it might be helpful to take a look at loose quadtrees.

**wraitii** · May 21, 2013

We do use a basic subdivision system (not a quadtree because we don't navigate nodes, we directly access the right square using a vector, I believe) for notably the range manager and the obstruction/path finding managers. It's somewhat efficient, though there are inefficiencies that

Could be fixed.

The problem is that each manager keeps his own and the simulation as a whole doesn't, when it really should. I'd advise looking at the obstruction implementation: it keeps static objects (no need to update them) and moving objects (need to be updated at a cost). The game can use events for updates.

Overall, a new quadtree/octree ( I'm thinking it wouldnt cost muuuuch to make it an octree, not sure it has a point though) would need to be owned by the simulation (perhaps a manager) and shared for all stuffs that could use it, in particular the renderer for frustum calls.

**RedFox** · May 21, 2013

Has someone taken up the octree project? This seems like a nice, self-contained project for a new volunteer, and I'd like to pick it up.
...
I haven't done any game development, which is actually one of the reasons I'd like to work on a more abstract module like this one. I get that it can be used to represent a 3d space, but I'm ignorant of how we would go from the data structure to, say, a rendered scene.

It's an excellent place to start. Though 0AD doesn't require an octree - but a quadtree. Since it's a top-down strategy game not an FPS, only movement in the 2 dimensional plane really matters. What needs to be built into the quadtree implementation is range comparison and collision detection. They are different in the sense that collision detection only takes place inside a single quad while range finding expands outwards into quads that are in range.

kosmo · May 21, 2013

Wouldn't it be better to implement an octree, to be more flexible if 3d ever plays a significant role?

**wraitii** · May 21, 2013

I wouldn't rule it out. If it's not too expensive (I'm speaking CPU-wise here), we could use it to have better frustrum cullings (such as removing bushes that can't be seen but not big trees or something). It would probably need to be slightly clever, though. And it'd indeed be nice to have. But if it leads to a somewhat significant slowdown, then scrap it.

**RedFox** · May 21, 2013

Wouldn't it be better to implement an octree, to be more flexible if 3d ever plays a significant role?

Flexible? You mean when 0AD transforms into Homeworld and then the Z axis (or Y axis in opengl) finally plays a role? Yeah... not likely.

I wouldn't rule it out. If it's not too expensive (I'm speaking CPU-wise here), we could use it to have better frustrum cullings (such as removing bushes that can't be seen but not big trees or something). It would probably need to be slightly clever, though. And it'd indeed be nice to have. But if it leads to a somewhat significant slowdown, then scrap it.

For an octree, each node will take almost twice as much memory and there will be twice the number of nodes. So an octree will actually take ~4x amount of memory. It will also be around 4x slower since all quadtree traversals will include an exponentially larger amount of nodes. At any rate, I think it's kind of pointless to have an octree. 0AD is a strategy game with units on the ground. If that's not an obvious place for a quadtree, I don't know what is. Any entity will still retain it's X, Y, Z position, so range finding will work correctly.

Consider the case where an unit is on a wall and another unit is directly under it, for example 5m below. Even though in the quadtree they are in the same grid, their effective squared-distance will yield 25, which means they are nowhere near each other.

The goal of using a quadtree is to minimize unnecessary collision and range tests, not completely rule them out.

Edited May 21, 2013 by RedFox

scroogie · May 21, 2013

I wouldn't rule it out. If it's not too expensive (I'm speaking CPU-wise here), we could use it to have better frustrum cullings (such as removing bushes that can't be seen but not big trees or something). It would probably need to be slightly clever, though. And it'd indeed be nice to have. But if it leads to a somewhat significant slowdown, then scrap it.

It is actually quite expensive. How is frustum culling done currently? I don't really see how you subdividing the third dimension helps there, but maybe I miss it. To rephrase what RedFox said, subdividing along an Axis in my opinion only makes sense if there is so much variation on that axis, that you actually seperate a lot of objects. In the case of 0ad its most probably much cheaper to test all objects that are above each other than to subdivide it. Especially if you want to use a uniform scheme. I'd imagine that you'd either have too many empty cells or the smaller vertical extent will ruin your horizontal subdivision (of course there are ways around that, but it'll add complexity again).

**wraitii** · May 21, 2013

Should have made myself clearer: I meant a quadtree with a final "top/bottom" branching, for stuffs that are only near the ground. This could be useful for refraction/reflection culling for the water, for example, and in a few odd cases. Not too interesting, I have to admit.

Frustum culling is currently done by checking the bounding box of every renderable model against the frustum, afaik.

tuan kuranes · May 22, 2013

A first step would be to use current spatial for culling, just projecting 3D camera frustum on 2D terrain, and calling getrange on that. (would give much faster than doing 3D culling against all frustum planes, and letting reuse same 2D current code)

Imho, perf wise, current spatial algo problems are:

1: duplicates: better have only one and only one entity in one tile, thus removing the very costly sort/uniq in getRange

2: contiguous memory: better have a single vector for the whole structure, rather than vectors of vectors.

3: getRange allocating on the heap a std::vector each time

All those can be solved in current spatial code, with some simplifications, but those must be addressed some way in a new partitioning scheme.

1: inserting based on entity center or point.

2: algo to rearrange subpart of huge single vector when adding/removing, keeping tile in contiguous ranges (sparse vector)

3: range vector as parameter, and static/member of class that makes the call)

The clear advantage of a tree would nicely solve the rect to "point simplification" that could make huge structure being not taken in account when in range query border (depending on aabox size, it stays on higher quadtree nodes, instead of ending in leaves.)

Octree is certainly overkill, lots of memory and lots branching per node for near nothing interest in a 2.5D game, and that rules out using the same code for range & pathfinder (using CFVector2D) and culling (octree would need to use 3D vectors)

Quadtree would perhaps give some perf improvements.

Note that those algo are easy to implement and test, if you agree on keeping same interface as current spatial code, and once spatial code is also shared with frustum, it's just a matter of subclassing.

I would even add a kd-tree, geohash, hilbert curve to the tests.

And even tests them separately (kd-tree would be very fast for static obstacles (costly add/remove but very), and loose quadtree would very fast for moving units.)

Btw, on another topic performance improvement : following current code, do anyone know what's the requirement that make current code re-compute local AaBox for animated+prop object ?

Couldn't we use object static aabox, not taking animation or/and props ?

It's not as if we need that much precision ?

(It's very cpu intensive to make all the vertex transformation, especially if you're not needing it at all CPU side, when gpu skinning is enabled )

If it's really needed on some case (a mechanical crane ?), could that be made optional for those case ?

Edited May 22, 2013 by tuan kuranes

**wraitii** · May 22, 2013

Tuan kuranes: that's pretty spot on. I'll just add that right now getinrange sorts the unit for no really good reason(in some cases), and a speed up can be obtained from removing that (though that could be linked with returning duplicates, which for rendering might or might not be annoying).

One thing though: your idea about frustum culling could lead to trees or tall objects popping up in front of the camera, which is bad.

tuan kuranes · May 22, 2013

One thing though: your idea about frustum culling could lead to trees or tall objects popping up in front of the camera, which is bad.

I think that if you carefully all project frustum corner points (including near plane corner points) on the 2D "terrain" plane, you do then get the biggest rectangle containing all frustum projected points, thus preventing any possible popping in front.

In fact, it's more on the conservative case, drawing a bit more than needed. Worst case would be make non-visible terrain tiles and small object submitted to render when camera angle near FPS view, but it's a "rare" use case anyway.

scroogie · June 8, 2013

kuranes, you wrote "(got a bit sidetracked with making pathfinding a separate static lib with its own gui tools for test/code in wxwidget)". That sounds awesome. Can you already share something?

tuan kuranes · June 8, 2013

@scroogie: nothing interesting/visible yet, just separated the pathfind code in a lib, behind a facade interface. Now working on making the visualization (just grid and colored square in opengl, not very different than minimap,just with enough flexibility to handle multiples grid level like tile/vertex) this weekend, all that in a wxwidget gui (using a wxformbuilder thing). Then probably next we, I'll have to make all the changes on the pathfind lib interface in order to handle the "classic" proposed tool set for a pathfinder app. (load/save/edit, algo change, step by step/play/pause/stop/start/end, benchs, etc.). Then tests will begins.

**RedFox** · June 8, 2013

I'm also working on a pathfinder test. Hopefully I can contribute something based on it in the future.

**k776** · June 9, 2013

Anyone working on the path finder: Please make sure all path finder work is compatible with the latest work at http://trac.wildfiregames.com/ticket/1756 . Make sure you communicate. We dont want conflicting work.

scroogie · June 29, 2013

I've run a full 1on1 game against Aegis on Acropolis 3 through callgrind (10383 turns) to get an overview of the performance. I know its not accurate, but I think it gives valid pointers. If anyone is interested, I can upload the output data.

There are some interesting things in the output. As expected, the most costly function is CCmpPathfinder::ComputeShortPath(). Out of the 100 functions with the largest self cost, more than 50% are inside libmozjs (e.g. EnumerateNativeProperties, js_AtomizeString, Snapshot, str_split, js::EqualStrings, js_TraceObject, etc.), so I guess that shows the importance of the Spidermonkey upgrades / performance work currently done. Other functions showing high self cost are (in order) GetPercentMapExplored, map::find, isqrt64, SpatialSubdivision::GetInRange(), CCmpTerritorryManager::CalculateTerritories(), CCmpRangeManager::PerformQuery(), Pathfinder_Tile::ProcessNeighbour(), CCmpRangeManager::LosUpdateHelperIncremental and CCmpPathFinder_Vertex::SplitAAEdges().

Edited July 17, 2013 by scroogie

scroogie · July 17, 2013

So, to add to this, from the list in the prior post:

- GetPercentMapExplored has a patch in trac.

- isqrt64 is not debatable. Maybe the game can save on normalisations and length calls (and there are already some patches that do), but I think thats for long-term.

- GetInRange can probably profit from the work on a Quadtree.

- which will also help PerformQuery. That also uses a sorted list of entities, but still traverses all of them every time. Just as a sidenote, PerformQuery actually has 11.04% inclusive time in my measurements, so its really relevant. Nearly 40% of this time is spent on finding entities by id.

- CalculateTerritories could maybe be changed to only recompute parts of the influence map on changes, instead of recomputing the whole map?

- LosUpdateHelperIncremental could maybe use LOS templates, like in the GPG article? This would also offer easy flexibility on vision ranges (instead of just circles).

- As said above the RangeManager in general does a lot of map finds for entity ids, which can be replaced with the entity container that is being worked on in trac.

CCmpPathfinder_Tile::ProcessNeighbour() and CCmpPathfinder_Vertex::SplitAAEdges() are different stories, still curious what tuan comes up with.

**wraitii** · September 1, 2013

Some profiling data of a few simulation functions in an aegis vs aegis game I've just run. Obviously this graph doesn't show ComputeShortPath which gets stupidly slow (27ms at times) when the AI starts to attack and pathfinds a ton of units individually (I'll change that).

What I've found interesting here is that "Move" takes 2.5ms reliably in the mid-game (25th minute) as many units move, and execute active queries is even slower (6/7ms). Those curves are quite flat, which would indicate those values don't really change from turn to turn. Combined, this is about 10 ms per frame devoted only to these two fundamental functions. With optimizations to the queries (which are easy) and optimizations to "Move" (which might be impossible, I actually have no idea what this function is), we could cut it down to 2/3 ms, which is a serious improvement.

(calculateTerritories is here because I wanted to see if it was slow… It's 1 ms at most, not too much but not completely negligible either).

Thanduel · September 7, 2013

This seems the relevant place to post this so: I have started working on a few small optimisations for the fixed point library starting with http://trac.wildfiregames.com/ticket/2109 with some more planned as time permits.

scroogie · September 10, 2013

Found a way to get callgraphs from vtune* and did a new profile with svn trunk, if anyones interested.

* http://software.inte...fier-xe-results

*edit* That didn't work out well. Compressed and attached instead.

output.png.zip

Edited September 10, 2013 by scroogie

**RedFox** · September 10, 2013

Nice callgraph scroogie!

Looks like main bottlenecks identified are:

CComponentManager::BroadcastMessage (4.35% exclusive loops at lines 823/832)

CComponentManager::PostMessage(1.72% exclusive)

CCmpRangeManager::PerformQuery(0.99% exclusive loop at 269)

CCmpRangeManager::ExecuteActiveQueries(10.46%) <-- should probably start with this ^_^

CCmpPathfinder::ComputeShortPath(0.78%) <-- this is definitely a lot more in late game

Looks like rest of it somewhat divides over all the JS functions..

scroogie · September 11, 2013

Yes, I was quite surprised to see the amount of time the JavaScript takes, I mean, the fully inclusive usage. I guess it impressively shows how much is really done in JavaScript. More than I thought, but I never really looked at the AIs. From the module classification in VTune, you see that 44.8% (!!) of the time is spent in libmozjs as compared to pyrogenesis.

I'd also think that ExecuteActiveQueries is the most sensible target right now.

**RedFox** · September 11, 2013

Alright, I finished tuning ExecuteActiveQueries a little bit. The gain isn't anything groundbreaking, but the idle execution time in Combat Demo (Huge) went down from 17-20ms per frame to 2-3ms per frame.

Of course, if the units get close and start fighting, the time spent escalates. Albeit, ComputeShortPath becomes the dominant bottleneck there, taking over 600-700ms...

So this little fix should be a nice addition to A15. You can look at the ticket and patch here: http://trac.wildfire...1707#comment:41

[DISCUSS] Performance Improvements

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

RedFox

agentx

Yves

Posted Images

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation