Jump to content
Sign in to follow this  
RedFox

[DISCUSS] Performance Improvements

Recommended Posts

(edit: that code does seem crazily inefficient, though)

Edit2: ran the game a little in the main menu, Xcode's profiler tells me about 40% of the time was spent walling CGUI: textDraw if I read things correctly.

I agree that it's extremely inefficient. It won't take much effort to modify the current code. I'll take a look into it and submit a patch sometime if it's okay. Before I do that, I'm going to build a small test with FreeType 2 beautiful anti-aliased fonts... ^_^

  • Like 1

Share this post


Link to post
Share on other sites

Since Philip's pathfinder is still a WIP, I haven't given it much attention. Perhaps Philip could comment on this? If he can finish the pathfinder, or if we can start over?

It's only the long range pathfinder that Philip was working on (and will finish - IIRC he'll work on it again after the Git migration), and the short range one is at least as much of a bottleneck (most of Combat Demo Huge is spent in the short range pathfinder). The long range one has some really really horrible cases (it ends up searching the entire map often with large formations :() which is why it was being worked on first.

If you are reasonably familiar with the code, I highly suggest working on the short range pathfinder - you seem qualified for the job, and it's a huge performance bottleneck right now (20% with floats will look like nothing).

I thought Philip said that JavaScript has no noticeable effect on performance?

If he did say that, I'd say he's wrong. As you mentioned in the OP, UnitAI is entirely in JS. It's not that JavaScript is much of a bottleneck right now (we don't really do anything particularly performance intensive in it), but I'm almost certain there will be a visible speedup from a newer SpiderMonkey (the latest versions are really quite fast, like V8 fast).

FreeType 2 would be excellent - not only would it boost performance a bit, the fonts in-game look a little ugly currently. :(

Edited by alpha123

Share this post


Link to post
Share on other sites

We do some quite costly computations in Javascript. Most of it is hopefully just temporary because a pathfinder interface is missing.

Another issue is garbage collection.

Oh yes, I forgot the AIs do their own pathfinding. In that case there would be a substantial speedup in single player.

IIRC SpiderMonkey 17 doesn't have a significantly different GC than 1.8.5 (I think it's just a regular mark-and-sweep collector). Eventually it will have a generational/compacting GC, which will definitely help with the AI out-of-memory erros.

Share this post


Link to post
Share on other sites

Since Philip's pathfinder is still a WIP, I haven't given it much attention. Perhaps Philip could comment on this? If he can finish the pathfinder, or if we can start over?

If someone did want to start over, the first step would be to get a solid understanding of the current implementations, to learn from their technical requirements and their qualities and their mistakes. After that, finishing the current WIP implementation would probably not feel like so much more work than starting from scratch :)

(I'd still like to finish my WIP stuff, and I'm trying to make more time for 0 A.D. stuff recently, but I'm hopeless at committing to any schedules or anything.)

I thought Philip said that JavaScript has no noticeable effect on performance?

Ignoring AI players (which do silly things like pathfinding in JS), I don't remember seeing JS as a particularly significant cost in profiles - usually something on the order of 10%. (I could be misremembering - better profiling data would be helpful). In that case it's still nice to make it faster, but making it infinitely fast would only give ~10% framerate increase, and the C++ bottlenecks will still need fixing either way.

Image drawing doesn't take any time at all, however CTextRenderer::Render does:

perf_mainmenu_04.png

So most of the time is spent rendering text?

Isn't that table showing that 0.56% of the total per-frame time is in CTextRenderer::Render? That sounds plenty fast enough already to me :P

If I run with vsync disabled, I see CTextRenderer::Render going up to 6%, but then the menu is rendering at 1000fps so that's not really a problem.

Anyway, I definitely like profiler usage, and the MSVC one is okay - just need to be careful to apply it to the areas that really do need optimisation and not get distracted by inelegant code that doesn't actually matter in practice :)

I've been looking at the new FreeType 2 library, which is an open-source library used for TrueType and OpenType text rendering. It has a very slim and streamlined API. Perhaps this is something I could start with? :)

Rendering runs of text to textures at runtime (vs the current approach which uses FreeType and Cairo to render all the individual glyphs to a texture then draws a quad per glyph at runtime - see source/tools/fontbuilder2) would probably be nice for a few reasons - mainly nicer English text (proper kerning etc) and much better support for non-English language (no need to pre-render every glyph from every possible language in every font/size/weight, which is a lot of texture memory even for non-CJK languages; proper support for combining characters and ligatures and other substitutions that some scripts depend on; etc). It'd probably be ideal to use Pango for text layout, since that'll deal with the i18n issues. One slight complication is that some of our fonts are drawn as a thick black outline with a white fill on top, and we'd probably want to continue supporting that kind of effect - I'm not sure if we could/should just embed Cairo and use it for font rasterisation like the offline fontbuilder2 does.

Share this post


Link to post
Share on other sites

Great post, Yves!

RedFox: I know the GUI example was just for demonstration but like Philip I get about 800-900 fps on the main menu with a 2 year old GPU, so I can't believe it's worth doing anything about the UI for performance reasons (there may be other good reasons, but then they need to be discussed in a different topic/context). Even a significantly slower system should have acceptable performance there. On an ancient c. 2004 single-core laptop with Intel GM915 graphics, I'm still getting 70+ fps in the UI, even considering that GPU is unusable with the game, but I don't think we should fret too much about that right now (especially since there was discussion recently and more or less everyone agreed dropping the fixed function pipeline would be no loss).

My advice for anyone wanting to optimize the game would be to actually play it. Play it in single player with AIs, but know they are very stupid and inefficient currently, so also play it in multiplayer with no AIs. By playing I mean finish a game :) Huge combat demo is not really playing the game, loading one of the few maps with a excessive trees isn't really playing the game. You'll notice real world issues playing real world maps in a real world way. The last staff match I played was 8 players: 4 humans and 4 AIs, we all commented how smooth the game was, though the map was chosen to be lag-free as possible, I think it illustrates the point.

The game runs remarkably smoothly for me in multiplayer games, though I'm not quite on a 2007 laptop, but many of the issues affecting my experience should be the ones affecting your experience and others in an even bigger way. It would be nice to focus on those first, rather than considering rewriting the renderer, or major simulation architecture rewrites. It might take more time before you understand what's going on and what needs to be done, but there are others around (like Philip) who have looked at these problems before, so you can use that knowledge and not be completely in the dark.

No one is working on the short range pathfinder as far as I know, that's a major concern in battles, of which the huge combat demo is an extreme example. I would be interested if someone tried that map and then claimed GUI engine, renderer, or even fixed point math is the most serious concern or even a logical place to begin. Formations need redeisgning but they bring out the worst sorts of pathfinding issues, especially with large formations moving around static obstacles (the current long range pathfinder scans the entire map to find a tile is unreachable, and it might do this dozens of times per turn - a problem mostly solved with Philip's WIP patch). I ask what the benefit of a 20% gain really is in code that may take several seconds per turn because it doesn't scale?

AIs aren't careful about memory usage so GC becomes a problem (significant intermittent delays), and the AI API doesn't expose all the functionality they need, so they have to do things like terrain analysis in JS, doing MBs of allocations - instead we should have a better C++ interface for AIs and move performance critical logic there. AIs should be multithreaded, pathfinding should be multithreaded - in both cases I think there are blocking issues before we get to that point, like upgrading JS or completing the pathfinder design. We should be more careful about how we schedule GC.

Having a thorough benchmark mode would be great, so we could more reliably measure and report performance data. Having a better way to collect and analyze performance data from our users would be a boon - currently it does something like report data at the start of a match, if they enable user reports, it goes into a massive database that is never actually used :(

There are a lot of tasks for an able C++ programmer that don't involve squeezing out small gains here or there because a profiler might indicate it's an issue, need to keep the big picture in mind as well.

  • Like 1

Share this post


Link to post
Share on other sites

About the AI. We should be able to move pathfinding to C++ (perhaps using the main pathfinder with that. The issue is that if it's threaded, it needs to be thread-safe. I'm thinking AIs wouldn't really need much more beyond knowing if something is accessible, a "real distance from there to there" feature and some basic long-range pathfinder (which I have implemented in JS as an oversampled A* (basically I look over every 3 tiles and note very tile, which makes it acceptably fast on my computer), so basically hooking some of the long-range pathfinder features could be enough).

The AI also deals a lot with maps, when it really shouldn't. In particular for dropsite placements, it does fairly expensive stuffs. I'm not sure how possible it is to switch this to C++, but there ought to be some way to gain performance there.

Finally the entity "collection" system for AIs is still fairly dumb and could be optimized a lot.

And then there's the rest.

In particular, the GC issue is tougher to deal with. Fixing all AI leaks, if any, wouldn't fix it completely since given the architecture (it gets the simulation state from the JS (maintained by AIProxy), and that's about 40kb each turn) it will OutOfMemory someday.

Share this post


Link to post
Share on other sites

Isn't that table showing that 0.56% of the total per-frame time is in CTextRenderer::Render? That sounds plenty fast enough already to me :P

If I run with vsync disabled, I see CTextRenderer::Render going up to 6%, but then the menu is rendering at 1000fps so that's not really a problem.

Anyway, I definitely like profiler usage, and the MSVC one is okay - just need to be careful to apply it to the areas that really do need optimisation and not get distracted by inelegant code that doesn't actually matter in practice :)

The second column is Elapsed inclusive time in percentage of total run time.

That 100% would be the total wall clock time for all 5 threads. So in that single thread context 20% would become its share of the wall time.

Running a new profile with Tiber river, and if I compare other gui objects, I can see that CText::Draw and CMiniMap::Draw are taking much more time compared to the rest: around 4% of that 20%. So that would make around 1/5th spent drawing text and the minimap (20% of rendering pipeline). It's safe to say that pre-rendering a glyph run into a texture would save a noticeable amount of frame time.

I'll continue profiling to get to the bigger bottlenecks.

Share this post


Link to post
Share on other sites

Historic_Bruno has a point with the fact that late-game is where the real lag happens. But there are some interesting things to get from early-game extreme situations, because that's when the non-obvious slowdowns are noticeable.

Some more random profiling info from starting a game on Peloponnese (according to XCode's profiler).

Time spent in "QueryInterface": 2.5%

Time spent in "InstancingModelRenderer:RenderModel()": 18.7%. 13.7% of that was system command "glUpdateDispatch" (not sure if that means anything, just shows we can get interesting tidbits of data).

To contrast, some info On Median Oasis with 3 very hard Aegis bots at 19 minutes. They had expanded greatly.

CheckStackandEntherMethodJIT: 16.9%

CGame and CRender took basically the same amount of time, respectively 38 and 28% of the time (at game start it's much more contrasted).

Pathfinder::ProcessShortRequest was 7.3% of the time (total pathfinding was 7.9).

"BroadcastMessage" took 15% (23% total, but the direct calls were 15%), which went (absolute numbers):

-2.3% to VisualActor update (0.9 from calls to CUnit::ReloadObject in CUnit:SetEntitySelection, 1.0% from UpdateVisibility)

-2.9% to UnitMotion (1.1% for "TurnTo, 0.7% for "MoveTo", 0.6% for calls to Pathfinder::CheckMovement)

-8.2% to the rangeManager, with expectedly 7.8% for ExecuteActiveQueries (about 30% of the time there was SpatialSubdivision::GetInRange.) We have patches that speed this up considerably, I'm thinking it would take less than 2/3% with those, which is non-neglectible.

Rendering took 38.4%, 2.5% for the GUI and 4.4% to enumerate objects (which is basically caused by GetWorldBoundsRec and "CModel:ValidatePosition"

Some interesting tidbits for the renderer (numbers are now relative to the time it took to complete RenderSubmissions)

There was basically no water in my shot, yet waterreflection takes 6.8% of the time, mainly because of RenderPatches. TerrainRenderer::renderpatches takes 11% of the time, mostly for renderblends and renderdecals (5.5 and 3.7%). The shadowmap took basically the same time as rendersilhouette, 17%.

Overall InstancingModelRenderer::Rendermodel takes 36.9% of the time spent in RenderSubmission.

Now all this doesn't mean much, obviously, but it does show that rendering takes time. But it also shows that the rangeManager is right now one of the biggest slowdowns. BroadcastMessage's overhad was about 3% of the time, which isn't really too much.

Share this post


Link to post
Share on other sites

The second column is Elapsed inclusive time in percentage of total run time.

That 100% would be the total wall clock time for all 5 threads. So in that single thread context 20% would become its share of the wall time.

Running a new profile with Tiber river, and if I compare other gui objects, I can see that CText::Draw and CMiniMap::Draw are taking much more time compared to the rest: around 4% of that 20%. So that would make around 1/5th spent drawing text and the minimap (20% of rendering pipeline). It's safe to say that pre-rendering a glyph run into a texture would save a noticeable amount of frame time.

I'll continue profiling to get to the bigger bottlenecks.

When I profiled GAE the minimap was taking up a ton of time. I just made it so it was only called 1/4th as often and the problem totally went away and calling it 2x a second was perfectly good in the information provided from a minimap.

I only mention this because it seems like this problem exists in many games in the RTS genre and I'm surprised someone hasn't come up with a good fix for it that I've heard of.

Share this post


Link to post
Share on other sites

CheckStackandEntherMethodJIT: 16.9%

Where does that come from and what are these 16.9%?

I'm just asking because it could be related to my current work on Spidermonkey.

Share this post


Link to post
Share on other sites

Basically it's Xcode's profiling tool, and that's a % of the time taken by the game to run over about 30 seconds. I'm not sure if that's clear?

If it's right after starting the game this probably explains these 16.9%.

If it's later in game that would probably be because the JIT-ed code gets thrown away by the garbage collector, but there's a solution for that if the information about js::NotifyAnimationActivity is correct. :)

Still I think 16.9% is quite a lot.

Share this post


Link to post
Share on other sites

That was later in the game. I'm not sure what calls this function, or for the matter how accurate that number is (I'm fairly confident for the main thread, but this was apparently another thread. The JS stuff is kind of exploded in the profiler). Frankly, I'm really not sure what that particular function is. If I investigate it further, it tales me to assembly code which reads "Jaeger trampoline", which is apparently a thing from MethodJIT.

Share this post


Link to post
Share on other sites

I guess it's nothing we should worry about now since it will most likely change with the new spidermonkey.

It would be good to check it again when that's ready though.

Share this post


Link to post
Share on other sites

Right now simulation 'turns' are taken a few times per second. Given that we can calculate per average how many frames are called between turns (let it be N), we can divide the time into 'frame slots'...

I haven't looked into this in detail so I can't say if that approach is good.

I wondered if it's possible to simply render as many frames as possible while the sim update is running.

So a lot of functions would need to have profile macros? Or do you mean a kind of profiler you can run inside JS scripts?

The important functions already have profile macros. That's how profiler1 and profiler2 work.

So basically, it's rather useless since the amount of data is overwhelming.

That was in a match with 4 AIs and 4 players where I got the 32 GB profile.txt.

I think the text format of profiler1 was a bad choice, it has way too much overhead.

Visual Studio 2012 has a really really good profiler too...

I have never used it before but it looks quite good. Is it part of the express edition? I have found this feature comparison but the Express edition isn't listed there.

What I need at the moment is something to measure the performance difference before and after doing some changes.

For the Spidermonkey upgrade for example I need to know how much faster version 18 is compared to version 17.

Basically if I run an AI VS AI match without changing the camera and generate a graph like in the image I posted above, that should be ok for a first comparison (with the limitations I mentioned and it doesn't actually work because of the bug). Of course it would be much better to have a real benchmark mode that can do something similar and with predefined camera movement each time.

It should be possible to lay one graph over the other and compare them.

I've been looking at the new FreeType 2 library, which is an open-source library used for TrueType and OpenType text rendering. It has a very slim and streamlined API. Perhaps this is something I could start with? :)

I would prefer the benchmark mode and it isn't one of the problems that really hurt, but it seems to make sense. Your choice. :)

Share this post


Link to post
Share on other sites

What situation did you measure the 20% performance improvement in? If I run e.g. Combat Demo (Huge) and start a fight for a couple of minutes, then the profiler indicates about 35% of the total runtime is in CFixedVector2D::CompareLength, which is called by CCmpPathfinder::ComputeShortestPath calling std::sort(edgesAA...). Almost all the time in CompareLength is doing lots of expensive-on-x86 64-bit multiplies, so that's the kind of thing that might well be made a lot faster by using floats (though I'd guess it could also be made quite a bit faster while sticking with ints, with some x86-specific code to do a 32x32->64 bit mul or with SSE2 or something). But the real problem here is that the short-range pathfinder is terribly unscalable - it needs a different algorithm, which'll mean it won't do a crazy number of std:sort(edgesAA...), and then the performance of CFixedVector2D::CompareLength will hardly matter. Were you measuring that performance issue or something else?

I was curious about this. Actually, the code of the pathfinder points to the "points of visibility" article in GPG. As I had that lying around I had a look, and the article specifically says that POV is a reasonable choice if you have 3D with few dynamic obstacles. For RTS it suggests a rectangular Grid approach. I think its also meant to have the edges precomputed instead of on the fly?

If I look at the code, seeing edges between passability classes, edges between obstruction squares, a sort for every open vertex etc. I'm wondering if this is actually much faster than simply searching the squares with heuristic directly?

  • Like 1

Share this post


Link to post
Share on other sites

My few cents:

  • Memory is the huge culprit for me that no one points, but that's the real performance killer but there should be no allocation/deallocation during the game. Not a single one. preallocation, memory pools, you name it. It doesn't show really in profilers (still new and delete are hotspots...) as it's a general issue. There a very huge gains perf there. (memory fragmenation/allocation is a huge perf killer, causing cpu cycles to be wasted waiting for data. check all Data oriented designs paper/talk/slides)
  • Coming from ogre, making the switch is obviously a good idea, not only for perfs, but also leading to a better decoupling from mechanics and graphics code.
  • Multithreading: did you consider OpenMP ? it's the easiest/fastest way to do multithreading, and is now supported by all compilers. Or go for a library like intel TBB or jobswarms.
  • Graphics: switching to ogre will help, but some things has to be done in other ways, mainly all the "upload to GPU" code (minimap draw for intance), that force opengl drivers to sync, which cause great great perf loss. (better do rendering to framebuffer texture, do lazy updates, instancing, anything.)
  • Patfhinder: first make separate the code in a library, then separated by a nice Facade, then use Detour library. http://digestingduck...-in-detour.html Then use make it run in threads. If you want to do pathfinding on your own, read the (read the whole blog http://digestingduck.blogspot.com/ for doc/explanations as it's near state of the art.).
  • Performance: made a non-behaviour performance code patch for all the hotspot listed by profilers if anyone want to review it/give a try. Gives huge perfs gains here, but doesn't solve the AI/memory hiccups. What it does is mostly removes unecessary cpu cycles and memory copies. Mostly good perf practices in C++ (added to the wiki, notably http://www.tantalon....opt/asyougo.htm guidelines).
  • Using very sleepy (fast) and codexl(complete) for profiling on win32
  • fixedpt isn't really needed and doesn't help the base code sloc. Not only few external library uses it (not ogre, opengl, detours, etc.) and there's really not much reason nowadays, as it is, cpu software optimisations against x64 cpu fp (specially if not using fp/precise), x64 registers, not speaking of sse/avx/etc specialised branch/predication mechanism. No Map size here seems large enough to justify precision problems with floats here, and even if there's tricks to solve those kind of problem.

Edited by tuan kuranes

Share this post


Link to post
Share on other sites

We have zero tolerance for precision issues. The problem is that small imprecisions quickly accumulate over turns, leading to desync.

Share this post


Link to post
Share on other sites

I wondered if it's possible to simply render as many frames as possible while the sim update is running.

To do rendering concurrently with update, you really need some kind of double-buffering system for the simulation state - you don't want to be rendering from the same data structures you're modifying in another thread, because that leads to race conditions and madness, so you want to render from an immutable copy of the state. I suppose that's not terribly hard in principle, since we mostly just need to copy CCmpVisualActor and some bits of CCmpPosition at the start of a turn, but there's probably lots of tricky details with other things that are updated more than once per turn (building placement previews, projectiles, everything in Atlas, etc). (Also it'll add an extra turn of latency between player input and visible output, so we'd need to make our turn length shorter to compensate for that.)

I think the text format of profiler1 was a bad choice, it has way too much overhead.

Yeah, it was never meant to be used like this - it was designed just for the interactive in-game table view, then someone added a mode that saved the table to a text file because it's helpful when debugging other people's performance problems, and then I reused it in the replay mode to save every few turns and draw graphs, which is a totally inappropriate thing to do.

What I need at the moment is something to measure the performance difference before and after doing some changes.

For the Spidermonkey upgrade for example I need to know how much faster version 18 is compared to version 17.

I think it's best to use the (non-visual) replay mode for that, since it's about optimising one component rather than about profiling the entire system - rendering introduces a lot of unpredictability across hardware (e.g. someone with a faster graphics card will have higher FPS, so we'll spent more CPU time on per-frame rendering overhead per second, so simulation will look relatively less expensive than on an identical CPU with slower GPU), and I'd guess it introduces more variability on the same hardware (e.g. a very small change might push you over a vsync threshold and double your framerate), and it takes much longer to simulate a whole match. Replay mode lets you focus on just the simulation CPU cost, and a better version of profiler2 would let you see the worst-case simulation cost per turn over the whole match (which would ideally be under maybe 30msec so we can maintain smooth 60fps rendering (the graphics drivers do a bit of buffering which can cover an occasional extra gap between frames)), and that should be enough to see how well an optimisation works. (And then it can be compared to not-so-easily-reproducible whole-system profiles to see whether the thing you're optimising is a significant cost in the wider picture.)

Share this post


Link to post
Share on other sites

We have zero tolerance for precision issues. The problem is that small imprecisions quickly accumulate over turns, leading to desync.

Desync, it's a network problem then ? If it's not network related, it's indeed strange that it occurs on such small maps, as said there, my points was that it depends on the size of the map, http://home.comcast.net/~tom_forsyth/blog.wiki.html

Seems more like 20km maps where we need 1cm precision (float-like) than the map from Pluto tothe Sun (7.4 billion km) with sub-micrometer precision (64bitsfp-like) ?

What suprise me is that float precision isn't enough here. If it's network related, and doesn't happen in single player, then that's maybe another kind of issue, more about network server/client prediction/data representation/quantisation algo (indeed much more complex if using float than fixed point.)

About perf problem and the 20% gain without fixedprecision, here (using define NATIVE_SQRT in my patch to comare) isqrt64 (win32 exe on win64) is top of the list in profiler, whereas native sqrt is at the bottom, wayyyy down, so there is a very very noticeable difference and impact. ( here for some benchmark for that exact problem http://omeg.pl/blog/2012/03/performance-of-various-square-root-computing-algorithms/ ). In that case 20% is minimum.

In fact. that problem is so impactful, that users might like either x64 bits exe on windows 64, or one native sqrt exe for single player and one isqrt64 for multiplayer (if the desync bug is a network thing.)

Share this post


Link to post
Share on other sites

It is mainly a network issue, yes. Each network client calculates the game state by itself. So if I shoot an arrow at a unit, and one client calculates the damage as 20 and another client calculates the damage as 21, the clients quickly come irrecuperably out of sync if this happens all the time for all units.

Share this post


Link to post
Share on other sites

Thanks for the answer, so clearly before even thinking going float, one has to either

  • make sure float gives totally predictable/reproductible whatever client. (very hard in practice, see http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/ )
  • change current network model (server authority or newest p2p area-authority (each one is authority for different areas dynamically), anything not realyin on determism. can be progressive with state ack (q3 network model) time to time to "resync". ( Didn't even read the current network code for now, sorry if stating the obvious, Just to point out, p2p networking model using determinism (unless you're doing search code, that is) not the usual route afaik, client-server is rather the norm, and give much simpler code overall, and less compatibility issue )

Share this post


Link to post
Share on other sites

tuan kuranes, can you give some details on how did you measure performance improvement with your patches posted on trac?

Also, to ease review and eventual integration I think it would be better to provide a separate patch for every different improvement.

Thanks for your work anyway :).

Share this post


Link to post
Share on other sites

I believe the point quantumstate made is that peer-to-peer is standard for RTS because it conserves bandwidth and is less lag prone than client-server, which is essential when you have hundreds units moving around as opposed to just a few in e.g. an FPS.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this  

×
×
  • Create New...