[DISCUSS] Performance Improvements

**RedFox** · May 4, 2013

My few cents:
Memory is the huge culprit for me that no one points, but that's the real performance killer but there should be no allocation/deallocation during the game. Not a single one. preallocation, memory pools, you name it. It doesn't show really in profilers (still new and delete are hotspots...) as it's a general issue. There a very huge gains perf there. (memory fragmenation/allocation is a huge perf killer, causing cpu cycles to be wasted waiting for data. check all Data oriented designs paper/talk/slides)

You are correct on that, if I take a look at a small 1 minute profile run, the amount of string and vector resizing is huge. Tens of millions of buffer resizes tend to add up and fragment the heap. Even if the 'real' memory use seems small, the damage has already been dealt algorithmically. The use of unordered_map<> is also huge in this project - hash tables actually require quite a lot of memory; i'll get back to this statement at the end of this post..

Coming from ogre, making the switch is obviously a good idea, not only for perfs, but also leading to a better decoupling from mechanics and graphics code.

Even though Ogre also has its own quirks, it would indeed be easier to use and doesn't require maintenance.

Performance: made a non-behaviour performance code patch for all the hotspot listed by profilers if anyone want to review it/give a try. Gives huge perfs gains here, but doesn't solve the AI/memory hiccups. What it does is mostly removes unecessary cpu cycles and memory copies. Mostly good perf practices in C++ (added to the wiki, notably http://www.tantalon....opt/asyougo.htm guidelines).

That's an excellent patch! You've gone and fixed a lot of painful code. Great job! ^_^

fixedpt isn't really needed and doesn't help the base code sloc. Not only few external library uses it (not ogre, opengl, detours, etc.) and there's really not much reason nowadays, as it is, cpu software optimisations against x64 cpu fp (specially if not using fp/precise), x64 registers, not speaking of sse/avx/etc specialised branch/predication mechanism. No Map size here seems large enough to justify precision problems with floats here, and even if there's tricks to solve those kind of problem.

I've tried to explain that point, but it seems the general opinion is somewhat fixed...

Going through the code, it seems that someone discovered the proverbial hammer in associate maps and then proceeded to treat every problem as a proverbial nail. Even worse are constructs like CStrIntern::GetString, which creates a huge map of shared_ptr<>'s to CStrInternInternals to 'save space on duplicate strings'. Of course the Key has to be 'std::string', meaning the unordered_map<> has to store a duplicate of the string anyways.

Not only is this implementation flawed, but CStrIntern wastes a more than double amount of space for every string and searching for a matching string begins to take a large amount of time. In the profiler the comparison was triggered over 4 million times during a 1 minute run of the game.

Why do I say it wastes more than double? This is because of the implementation of the C++ hash table (unordered_map<>), which has to keep a copy of the Key value. Hash tables work by estimating a bucket for a given hash. A single bucket can contain several key<->value pairs. In order to make difference, the boost and std implementations need a copy of the original value (this isn't a problem when your key is an integer for example).

Since the unordered_map<> key in CStrIntern is a plain 'std::string', a full copy of the string is kept as a key. This has huge performance implications (!). The memory usage itself is very difficult to calculate on a hash_map, because it's non-linear and depends on the actual hash values calculated. Nevertheless, doing a simple whole program memory comparison will do the trick:


// VC++11, Release
wchar_t buffer[80];
const size_t memStart = ProcessMemory();
std::unordered_map<std::wstring, MyData, MyHasher, MyComparer> tbl;
for(int i = 0; i < 5000; ++i)
{
    int len = swprintf(buffer, L"aaaaBBBBccccDDDDeeee	%8d", i);
    tbl[std::wstring(buffer, len)] = MyData();
}
const size_t memUsed = (ProcessMemory() - memStart);

Given that we have 5,000 unique strings, 32 wchars in length (which isn't that far-fetched for 0AD), the memory usage grows to around 1.75mb and once over 1,024 entries, the hash table slows down by several orders of magnitude, especially in debug mode.

Now the first reaction would be "well, a few megabytes shouldn't really matter". Oh? Well, 5000 wchar_t strings, of length 32 should roughly take 'sizeof(wchar_t)*33*5000' bytes of memory (which is 330kb on Win32). So that's around 6 times less memory. On Linux wchar_t is 4 bytes, but the relation itself should remain similar. Now if your program only dealed with strings and your L2 cache size was 1mb, then the second option (330kb) would fit inside the L2 cache, while the larger one (1.7mb) would not. This would immediatelly make your program 5x slower, due to the cache miss.

I've written image analysis software for robotics before and even a small difference of 100kb in a lookup table size could make the whole algorithm 5-10 times faster, depending on the actual number of memory accesses. Now, the L2 cache isn't really that huge and other programs want a share of the cache too - this is why keeping your data small (330kb?) can ensure that it fits into the cache while contending with other processes.

To sum up: I think CStrIntern is a huge bug in the game logic and should be replaced with a proper translation implementation - which would solve the 'duplicate strings' problem while remaining efficient - by handing out const pointers to the strings in a large block of continuous memory.

tuan kuranes · May 4, 2013

tuan kuranes, can you give some details on how did you measure performance improvement with your patches posted on trac?

As I said above, using Very sleepy (fast profiling, the "detector"), CodeXL (deep but slow profiling, the "check if perf gains") all on win64 but 32bits exe.

Testing done with made lots of maps for realtime feeling (small, huge, with an without agents, with or without bots, with or without resources, with lots or not lots of pathfinding)

Profiler and profiler2 are nice, but hard to detect spikes with them, without some sort 'line graphics", and the fps "frequency smoother" isn't helping getting raw good numbers on that either.

If you have a look, most changes are really common "optimise as you go things" as pointed in the link above.

Other changes are rather simple and common sense: For instance, pathfinding sort is faster if you precompute the distance for each Edge once before sorting, otherwise each sort operation eats the computeLength cycles. (not speaking of rvo not kicking.). Then again, making code out of each checkvisibily to be shared by all, avoiding the call for basic ifleft, ifright check is clearly cpu cycles win.

I tried not to make any change out of that line, working on guaranted less cpu cycles, measuring each changes, (cache, cpu, branc miss).

I've been rather conservative, for instance, and optimising the Edge sort, where it's in fact faster withtout, (it's faster to check more edge than sort them as it cost not only in cpu but above all in memory)

Also, to ease review and eventual integration I think it would be better to provide a separate patch for every different improvement.

I'm sorry for that, I must say I intended to do that, but it's really just hard not just to fix the const and pass param as ref, optimisations as you go, it helps me read the code...

If you remove that part of change, it boils down to pathfinding changes, minimal c++ perf branch fixes there and there. (about minimizing "if" inside "while"), and changes where rvo doesn't kick in order to lower temp object alloc memory cost, particularly for hotspot code.

I'll try to make smaller patches (but cannot promise anything before next saturday...)

@zoot: not sure P2P is really less bandwith, each client must send its user actions to each other clients anyways, so it grows exponentially with the numbers of players, whereas client server, it's as single send per client, and server issue agregated info once per client (with clever update/quantization/compression tricks to minimize bandwidth). So even latency might be better with client-server, minimizing the buffer bloat of p2p (huge number of packet, each having to pass the buffering of each network equipement.)

@redfox: Indeed, the memory problem is in all those way the current code cause memory allocation in an "invisible manner". It's not "visible" as there not much code with new/delete/malloc, but there's a huge lot of hidden ones done on the stack, by the stl, by boost, and that kills the memory by fragmenting it incredibly. Ogre has nice pool allocator code, that can be use outside of ogre, that helps a lot there. That big step will have to be done, and using pooling and code that force acknowledgment of alloc (meanging using Pointers or at least SharedPointers) is the way to go. The pathfinding reallocate for each computeshort call at least 6 std::vector, no way the memory is contigous. The L1/L2 cache is trashed, and cpu is only really waiting data.

About fixed fixed, seems logical to wait until it doesn't mean breaking network code...

Edited May 4, 2013 by tuan kuranes

zoot · May 4, 2013

@zoot: not sure P2P is really less bandwith, each client must send its user actions to each other clients anyways, so it grows exponentially with the numbers of players, whereas client server, it's as single send per client, and server issue agregated info once per client (with clever update/quantization/compression tricks to minimize bandwidth). So even latency might be better with client-server, minimizing the buffer bloat of p2p (huge number of packet, each having to pass the buffering of each network equipement.)

The number of user commands (move unit, train unit etc.) per turn are much lower than the number of state changes (position, health, UnitAI changes of hundreds of units and projectiles etc.) per turn and we're capped at 8 players, so the amount of "exponentiation" is limited.

(The game host may even act as a 'thin' server, distributing player commands so there is no "exponentiation" at all. I don't know the networking model well enough to know whether it actually does that.)

Edited May 4, 2013 by zoot

**wraitii** · May 4, 2013

Looked at your code, Tuan Kuranes. Beyond the ton of const changes (you could have put those in another patch for readability ), it seems mostly sensible. I'd recommend you take a look at the existing patches for the rangeManager, there are some more things there.

The fixed issue is not an issue as long as the network cannot deal with smaaaall precision error: we can't afford the change.

I would rather optimize the current code to avoid too many memory allocations/deallocations than switch to a pool system though. We need to know how much faster that would be than better use of memory.

tuan kuranes · May 4, 2013

@zoot: Well let's say 20 packet per seconds:

P2p: each player sends and receives, 8 player => 8*20*8 => 1280

ClientServ: 8*20+8*20: 320.

Meaning, in terms of latency a non-neglibile gains in all case.

Let's imagine we want to grow to 16 player one day in the future, with

p2p: 5120

ClientServ: 640

better GPU CPU, RAM and bandwidth, we might have, latency is rhater hard to lower. (waiting for quantic teleportation network might be long...)

And those are rather the lower bar, as packet ack, error retransmission, etc. makes it much more in practice.

(the real point in network is latency, bandwidth is already not really a problem anymore)

@wraitii: sorry again for the big patch. I already looked at other patches, but I needed to start without changes at first, as it's easier to understand and read. I need to get a bigger picture before making changes of the current pathfinding code, I still don't get clearly the why and how paths are computed/short.long, etc. I would really start with making it a separate static lib, behind a facade interface, allowing easier refactoring and better code localisation, leading to better optimisation opportunities (if not threading opportunities).

I don't know any other way to optimise memory than memory reuse/pool.

For instance, computeshortpath allocate vector Edge, edgeleft, edgeright, tiledge, etc... for each path computation. How to avoid allocating each path without getting them form pool and return them to pool once finished. (actually, for perf, we use reserve() calls, which actually reserve often more that what we need, causing bigger memory hurts)

Memory pooling does guarantee control of the memory (what we alloc and where if we do it once at start), and zero allocation/dealloc per frames.

zoot · May 4, 2013

@zoot: Well let's say 20 packet per seconds:
P2p: each player sends and receives, 8 player => 8*20*8 => 1280
ClientServ: 8*20+8*20: 320.
Meaning, in terms of latency a non-neglibile gains in all case.

That is wrong. In the P2P case we send 5 packets per second (one every 200 ms) per peer per peer. In the float-proof client-server case we'd send hundreds upon hundreds packets to account for position, health, animation state etc. of every given unit, building, projectile etc. per peer.

And there is no "we may want to grow to 16 players". The player cap is fixed.

**RedFox** · May 4, 2013

That is wrong. In the P2P case we send 5 packets per second (one every 200 ms) per peer per peer.

I think he meant 'for example, if'.

In the float-proof client-server case we'd send hundreds upon hundreds packets to account for position, health, animation state etc. of every given unit, building, projectile etc. per peer.

How is this any different between P2P and Client-Server? In either case there exists a 'synced' version of the current game and only state changes are sent/received. In the client-server model the client sends a change in its state to the server, while 'starting' the action. The server receives the update, sends an acknowledge and also updates all other clients with the new state.

In the P2P model described, everyone broadcasts their state changes to every other client, meaning N times more packets to send out in any case.

Basically, with a running server you can at least control the bandwidth and sync at the same time. In peer-to-peer there's no single instance that can be considered the main sync object (or perhaps player 0?).

If the network data is sent in raw binary, then bandwidth shouldn't be an issue at all.

zoot · May 4, 2013

No, in the P2P case we send player commands. In the client-server case we'd send state changes.

Each player command can translate into numerous state changes, and therein lies the difference in bandwidth.

zoot · May 4, 2013

Apart from whatever bandwidth the client-server model may or may not consume, it would require radical changes to the simulation for it to understand state changes instead of player commands.

**alpha123** · May 4, 2013

Each player command can translate into numerous state changes, and therein lies the difference in bandwidth.

Not only that, but sending even just the locations of 1200 units would take SO MUCH bandwidth. See this excellent article about how Age of Empires used p2p with command passing to handle 1500 units on a 28.8 modem. Those guys know what they're talking about.

**wraitii** · May 4, 2013

Tuan Kuranes: I meant try to avoid as many temporary allocations as we can, and use the existing ones as efficiently as possible. Beyond that, it's probably the only way to go to get a pool.

Now, on the other hand, it seems to me this whole debate is quickly getting into the "unconstructive" area. There's a little too much being thrown in the air with numbers flying all over the place. I would suggest a little recap of the points debated here, because based on what is being discussed, it sounds a little too much like the whole game code needs to be rewrote absolutely-or-it's-never-gonna-work, and that's not going to be very constructive.

zoot · May 4, 2013

Now, on the other hand, it seems to me this whole debate is quickly getting into the "unconstructive" area. There's a little too much being thrown in the air with numbers flying all over the place. I would suggest a little recap of the points debated here, because based on what is being discussed, it sounds a little too much like the whole game code needs to be rewrote absolutely-or-it's-never-gonna-work, and that's not going to be very constructive.

Or, perhaps, to state it in another way: if the P2P model worked for so many other RTSs, it probably will for us too.

**Ykkrosh** · May 4, 2013

Even worse are constructs like CStrIntern::GetString, which creates a huge map of shared_ptr<>'s to CStrInternInternals to 'save space on duplicate strings'.

That's not what CStrIntern is for - it's just for fast string comparisons and copies, since string equality is simply pointer equality. In particular it's for cases where you might expect to use enums, but you want to let data files define new values too and don't want the hassle of maintaining some centralised definition of every possible value, so you use strings and convert them at load-time into some unique identifier (which happens to point to the character data). Memory usage is not a concern at all, since it's only used for a few short strings. (It should certainly never be used for UI text.)

(But I think quite a bit of the code is silly in that it converts a string literal into a CStrIntern many times per frame, which isn't cheap - the CStrInterns ought to be constructed once and cached in the appropriate places, and then there should be no extra runtime cost compared to enums. So that'd be nice to fix. And with that fixed, we should never look at the contents of the strings at runtime, so they're not going to be pulled into cache.)

In the P2P case we send 5 packets per second (one every 200 ms) per peer per peer.

The game's current network architecture is client-server: every player sends their list of commands to the server, and the server sends those commands back to every player, then the server tells each player when it's safe to start the next turn. (The server doesn't know anything about the simulation state - it's just blindly forwarding packets and counting turns). So if nobody is doing anything, each client will just receive one packet from the server per turn, and the server will just send one per player per turn.

tuan kuranes · May 5, 2013

- sorry about network OT discussion, really just laying down clearly the condition to change from fp if float, and pointing derminism is a real complexity problem for sofwtare dev.

- About Memory and performance, best we make another thread then ? (list current memory hotspots. ? Vote what to do about how to remove them ?)

**wraitii** · May 5, 2013

I think the best would be a different discussion thread, at this point. What we would need for a discussion on topic X to go somewhere is:

-Some breakdown of how system X is currently done in the game (which is what is mostly lacking currently).

-some data about how much time it takes per frame/turn, preferably verifiable/reproducible, so we can get data from multiple machines and get the real hotspots and bottlenecks from there.

-some concise description of how we could improve on this, with a con/pro list (that obviously would be discussed. We can certainly update it more or less realtime.)

The problem of estimating the time needed for the actual changes is a function of those two and of knowledge of the code, but having a good breakdown of what exactly is slow could help by knowing why it was implemented that way and help input from users knowing the code (which could help fix issues that shouldn't be there in the first place, like with CStrIntern.)

The memory issue is really quite different from the proposals of switching to Ogre, which is much more problematic in terms of how many changes it would require. Same with changing things in components.

The fixed point issue should probably be left alone for now while we optimize the rest (there are also ways to avoid incessant conversions between floats and fixed, so we could gain speed by doing that). If after all the other optimizations, it's still a huge problem, then we'll see how we can tackle it, but I doubt that will be necessary in the end. Truly there are other areas to optimize.

So generally I'd recommend one topic per specific issue otherwise the discussion seems to get diluted, particularly on big issues like the ones being discussed here.

(the model being the "Design Committee" topics here.)

zoot · May 5, 2013

- sorry about network OT discussion, really just laying down clearly the condition to change from fp if float, and pointing derminism is a real complexity problem for sofwtare dev.

I don't think it was OT, it's valid to ask how things work. We just need to agree that we can't optimize the game for every thinkable situation at the same time.

**MishFTW** · May 5, 2013

Would performance be improved by implementing a different multiplayer system? Or perhaps integrating with an already existing package?

zoot · May 5, 2013

Would performance be improved by implementing a different multiplayer system? Or perhaps integrating with an already existing package?

There is no "multiplayer system". The networking model is something integral to how the engine is designed. If someone wants to create a new engine from scratch we obviously can't stop them, but people would probably be reluctant to commit something that would tear up all existing gameplay and AI functionality. Especially given that the gains overall are questionable.

scroogie · May 6, 2013

Applied the patch for fun:

- doesn't apply cleanly anymore

- C++0x errors (need flag in all makefiles)

- ambiguous overloads in CCmpRangeManager (due to removal of implicit cast)

- misses some coding conventions (trivial things like position of pointer symbol etc.)

- duplicate isqrt (both in header and cpp)

- fast_blur_halide misses the intrinsics includes and is never used (?)

- contains some commented out code?

Don't know if its related, but also after a few minutes the game errored, printing this over and over:

ERROR: JavaScript error: gui/session/session.js line 452

ReferenceError: type is not defined

updateHero()@gui/session/session.js:452

onSimulationUpdate()@gui/session/session.js:413

__eventhandler34 (simulationupdate)([object Object])@sn simulationupdate:0

tuan kuranes · May 6, 2013

@scroogie: thx for trying, made note in patch page about that, most notably about lack of testing under linux, will do linux test/compilation changes next we with patch separation in three, and adding code style check, sry for that.

tuan kuranes · May 11, 2013

@scroogie posted new patch(es), tested under linux, but didn't get C++0X errors, just c++11 warning in non patched code?

Created a specific memory performance thread here: http://www.wildfiregames.com/forum/index.php?showtopic=17310

Edited May 11, 2013 by tuan kuranes

scroogie · May 11, 2013

I don't know why yet, but I'm getting

patch: **** malformed patch at line 1658: Index: source/simulation2/components/CCmpObstructionManager.cpp

tuan kuranes · May 11, 2013

@scroogie: I've reuploaded all patches, after testing them each once on linux (with "svn revert --depth=infinity . && svn patch perf.patch && make -C build/workspaces/gcc"

), just to make sure. But didn't get that error? (wouldn't it be easier and less OT if you could post on patch tracker page ?)

**fabio** · May 14, 2013

I found this, it's for FPS games but it may be interesting here also, since it talks about gaming with low latency:

http://ttimo.typepad.com/blog/2013/05/es_core-an-experimental-framework-for-low-latency-high-fps-multiplayer-games.html

Don · May 21, 2013

Has someone taken up the octree project? This seems like a nice, self-contained project for a new volunteer, and I'd like to pick it up.

I've been working full time as a software developer for about 17 years now, mostly in C, C++, and the last year in Objective-C. I've done quite a bit of shell programming and Python, which I imagine won't be useful with 0ad. Still feel like a JavaScript newb but I do see it on the job. Very comfortable with Subversion.

I haven't done any game development, which is actually one of the reasons I'd like to work on a more abstract module like this one. I get that it can be used to represent a 3d space, but I'm ignorant of how we would go from the data structure to, say, a rendered scene.

I'll go ahead and get the source checked out and building. In the meantime, I'd appreciate it if someone from the team could give me a more detailed idea of the interface you need. I'm also curious to know why the existing FOSS implementations aren't what we need. (Licensing?)

Don

Edited May 21, 2013 by Don

[DISCUSS] Performance Improvements

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

RedFox

agentx

Yves

Posted Images

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation