[DISCUSS] Performance Improvements

**RedFox** · April 26, 2013

From what I know of the code, implementing ogre fully means changing the entirety o the GUI folder and perhaps also converting GUI files(to use ogre. If we actually switch the GuI its a full change)

Yes, though MyGUI for Ogre3D is a very good implementation and I doubt we should use our own.

It means a complete change of the renderer folder.
It means tons of updates in the graphics folder, but that actually shouldnt be too much of a problem.

Those folders would probably disappear entirely, replaced by Ogre3D.

Might require some changes to collada(can't recall if ogre handles collada natively).

Ogre3D has a collada plugin. So no changes there.

Ps/ probably has files that would need to be changed as its the core of pyro genesis, but that also shouldn't be a problem with your experience

Yes, this is the main part where the integration would occur. This is the brigdge between modules: [simulation] <-> [ps engine] <-> [ps graphics]. Luckily it means that graphics can be replaced with changes to [ps engine].

The issue is maths/. If you use the ogre provided versions of matrices, vectors, it Ould require changing things everywhere in the code. Perhaps a simpler solution would be to typedefs those types (which would maintain the usage of CWhatever for class names)
The rest should be mostly left as is.

Ogre3D has a very mature maths library. I already replaced the ps maths vectors with: Vector2, Vector3 and Vector4 (had to roll my custom implementations). But Ogre3D has a much more mature version with maths for interpolation and optimized ray intersection. Now that is cool.

However ogre was not chosen before, and we need to know why to check if the reasoning still applies. I suggest you start a new topic explaining what we would gain (in terms of speed, flexibility, features) and how much time you think it'll take after assessing the necessary changes. Then well be able to take a sensible decision

I think you are right. A lot of these arguments should put into a list, describing the benefits and amount of code required to change. I'll get right at it.

**Pureon** · April 26, 2013

Intriguing discussion.

Apart from the collada models and animations, I think we still have some PMD/PMA files for unit animations, which would need integrating (or re-exporting as another format from 3dmax).

The only game I have installed that uses Ogre3D is Rigs of Rods, and unfortunately that performs extremely poorly on my mid-high end system. Maybe i should try one of the RTS games instead.

**RedFox** · April 26, 2013

At this time though I reckon it's better to work on the current code and see where that goes. Changing to Ogre3D is something for (perhaps) part 2...

Right now I'll start working on the resource files.

**Pureon** · April 26, 2013

Sounds really good man, thanks for helping us out

**quantumstate** · April 26, 2013

This is all something that can be fixed with a custom Quadtree implementation. An entity would have a pointer to a Quadtree grid and vice-versa. This allows an entity to get a list of pointers of objects that are in the same grid. Notice how this suddenly, without any complex lookup what-so-ever, decreased a huge list (like 200) to just a handful (usually never more than 8).
If needed, the parent grid can be checked and so on.

This is not true. Both steps which I mentioned are slow run after an exact distance check has been done. Basically how it works now is that it uses the spatial data structure to get the nearby units, then it filters by the euclidean distance to get precisely the units within range. After this has been done it has a slow de-duplicate step and the slow access to more extensive entity data which it uses to apply non distance based filtering. There are lots of entities because a lot of range queries run with the unit los range which is large.

Yes, for example int the source of Doom3. And all of these games have managed with it. Having a few small places where float imprecision is taken to account, is much much better than using CFixed everywhere. Floating point maths standard is now over 28 years old and holding strong. We are aware of the rounding errors and if all else fail, just compile with /fp:precise.

Doom 3 uses a server-client networking model, so they aren't running a full identical simulation on each client. An fps has far less state to synchronise than an RTS.

Yes, it also brings out the points why fully deterministic systems like that don't work - even a slight change in version number or entity tags will break sync. IMO, a different sync model should be considered in this case and deviate into a more network intensive solution.
The article also notes that using /fp:precise works (as stated by several compiler providers too: MSVC++, GCC) and provides fully deterministic results. Just so you know, 0AD already uses /fp:precise, even though precise floats are slower (since they are rounded), they still have a much better performance than CFixed<>.

Our current solution works well, it seems to be the standard way to do networking in RTS games. The Blizzard games, AoE, and Supreme Commander work in the same way.

I'm fairly sure that fp:precise is not safe across different compilers and platforms. We are going to need significant changes to our build systems to make this work properly, this will probably cause headaches with linux distros who want to use normal packaging systems etc. To me it seems easier to jstu try and optimise our existing use of fixed point, integer arithmetic might be able to replace some bits.

Yggdrasil · April 27, 2013

I'm fairly sure that fp:precise is not safe across different compilers and platforms. We are going to need significant changes to our build systems to make this work properly, this will probably cause headaches with linux distros who want to use normal packaging systems etc. To me it seems easier to jstu try and optimise our existing use of fixed point, integer arithmetic might be able to replace some bits.

You might want to check out streflop which is used by spring and megaglest.

http://nicolas.brodu...ation/streflop/

3. String Translation:
Problem: It would seem as if this had no effect for performance or memory, but you should think again. Creating a translation (multilingual) system offers us a cheap ability to group all the strings together into a huge memory block.
Solution: Load game/unit strings into a large memory block and simply share out const pointers to the strings. Something like this: "unit_javelinist\0Generic Peltast\0unit_javelinist_descr\0Peltasts are light infantry...\0".
Result: Every heap allocation costs some memory for the system heap bookkeeping and on Win32 platform it's around ~24 bytes. If you have 1000 strings of size 24, that will add up to 24KB and totaling at 48KB used. With the translation system this would remain ~24KB. Of course, there are much more strings in the game, so the result would be noticeable.

Better just use gettext.

http://www.gnu.org/software/gettext/

A keyword based translation system is also not good to maintain.

related thread: http://www.wildfireg...showtopic=13790

**quantumstate** · April 27, 2013

You might want to check out streflop which is used by spring and megaglest.
http://nicolas.brodu...ation/streflop/

That looks pretty handy, thanks.

**Ykkrosh** · April 27, 2013

Removed CFixed<> and replaced it with float.

That'll cause OOS in multiplayer, because floats aren't consistent between different builds of the game. That's almost the whole point of using CFixed. (The rest of the point is that floats cause nasty imprecision issues, so you might find that e.g. pathfinding works okay at the bottom-left corner of the map but not at the top-right because the coordinates are larger so the resolution is coarser. That's usually a very hard problem to reason about, whereas using fixed-point numbers makes it a non-issue). And they're just ints, so almost all operations on them will be at least as fast as the same operations on floats.

I think /fp:precise doesn't solve this, because the MSVC documentation says the compiler can still do intermediate calculations with 80-bit precision and only round at the end, which means it can give different answers to platforms that do 64-bit precision throughout (e.g. SSE2, or a debug build that spills things onto the stack). Also various library functions (sqrt, sin, etc) differ between platforms. Also requiring /fp:precise globally would make our graphics code (which doesn't really care about precise floats) slower. (And using different compiler options for different modules of the code would be insane and would probably break LTCG etc.)

If I remember correctly, one of the parts of CFixed that was actually slow was CFixed::Sqrt, since that's hard to do with ints. I tried changing the implementation to use floating-point sqrt instead, which made it much faster and appeared to give the same output in all the cases I tested, but I never quite trusted it enough to be fully portable. It probably wouldn't be infeasible to exhaustively test it on every interesting platform to see if it's safe enough, so that might be a nice thing to do. (I don't remember what was actually calling Sqrt though - it might have just been the pathfinder computing Euclidean distances, and I think the new pathfinder design avoids that by computing octile distances instead.)

Have you done any profiling to see which of the CFixed operations are actually causing a performance problem?

Problem: Currently the rendering system constructs a sorted map of models and their shaders Every Frame. Not only is this bad that such a large map and vector is created every frame, it could be redesigned to achieve the same effect with no maps or vectors recreated.
Solution: the models can be sorted beforehand, during context initialization (!) and grouped under specified shaders and materials.
Result: This should give a huge (I'm not kidding) performance boost, so this will be the next thing I'll implement.

Have you done any profiling of this? (Note that if you're running in MSVC you have to be careful to disable the debug heap (even in release builds) else a lot of the STL usage will be massively slower than when run outside the debugger). When I implemented this, I think I measured the per-frame overhead as pretty insignificant (and it allowed a lot of flexibility with changing materials and effects and lighting etc at runtime, without the hassle and bug-proneness of invalidating lots of cached data). It'd be nice to have numbers of how much it currently costs, to see what the potential benefits of improving it would be.

2. Entity/Actor/Prop XMLs:
Problem: Not only is XML inefficient to parse, it's also much harder to actually read and edit. To make it worse, the filesystem overhead for these hundreds of files is huge (!). Loading times for weaker filesystems can be very long.

We've solved that by caching the parsed XML in a binary format (XMB) that takes no time to load, and release packages have all of the files combined into a zip so there is no filesystem overhead.

Every heap allocation costs some memory for the system heap bookkeeping and on Win32 platform it's around ~24 bytes. If you have 1000 strings of size 24, that will add up to 24KB and totaling at 48KB used. With the translation system this would remain ~24KB. Of course, there are much more strings in the game, so the result would be noticeable.

Players have got at least half a gigabyte of RAM - why is 48KB worth any effort?

4. Entity-Component System is inefficient:
Problem: While some would say its having a map<int, map<int, *>> shouldn't have any real effect on performance (since the profiler(?) says so), I would recommend to reconsider and rethink the algorithm of the entire system. Maps are very memory intensive and too many vectors are created/destroyed constantly while the look-up of Components remains slow.

I agree that's a problem - QueryInterface is a bit slow, and BroadcastMessage likely has terrible cache usage patterns.

Give the good old ECS pattern a slight Object-Oriented overhaul. Create an IEntity interface, give concrete implementations like UnitEntity : IEntity a strong reference to a Component in its class definition. This will remove the need for component look-up. The message system can also be redesigned to be more direct. And finally, Managers can be divided into frame slots across frames, since a lot of the data can be asynchronous.

Is your suggestion to create a new class for every type of entity (UnitEntity, InfantryUnitEntity, MeleeInfantryUnitEntity, SpearmanMeleeInfantryUnitEntity, etc)? That's the kind of inflexible hard-coded hierarchy that component-based designs are intentionally trying to get away from. There's a load of documentation from game developers over the past several years about how the component systems make their lives much better.

When an entity is accessing its own components, maybe a better way to minimise the QueryInterface cost without significantly changing the design would be to let a component cache the returned IComponent* the first time it does QueryInterface (after instantation/deserialization) and use the pointer directly after that, which is safe since components can't be deleted dynamically unless the whole entity is deleted. When accessing components of a different entity you've still got to do a lookup on the entity ID though. (Storing pointers to other entities would be bad because it makes deserialization hard and makes dangling pointer bugs much more likely). Again it'd be useful to do some profiling to see which uses of QueryInterface have a significant cost - I'd suspect it's probably a small number that would be easy to fix without redesigning anything much.

Redesign the pathfinder to include a long-distance inaccurate path and a short-distance accurate path. The accurate path is only updated incrementally and for very short distances, making this method suitable for an RTS.

It does that already (where "inaccurate" is "tile-based and ignoring mobile entities").

Currently a very naive subdivision scheme is used for collision, range and obstruction. Even worse is that all of these modules are decoupled, duplicating a lot of data and wasting time on multiple updates of the same kind.

The different modules store different data and have different access patterns - e.g. collision uses obstruction shapes and needs to do queries against typically short lines, whereas range tests use the centres of entities and need to do queries against huge circles with filtering on player ID. (And the renderer needs 3D bounding boxes of models for frustum culling, which is conceptually similar but practically very different again.)

I do agree the current subdivision scheme is naive and could do with improving, but I don't think trying to use a single instance for all the different modules would help.

I don't think the subdivision-data-structure queries are currently the slow part of the range tests - even if the queries were instant, they might return several hundred entities and just trying to figure out which ones have entered or left each other entity's range is too expensive. Usually only a very small proportion have recently entered/left the range, so ideally we wouldn't have to even look at the majority of entities that have remained clearly inside/outside. I don't know how best to do that, though.

Translate UnitAI to C++ and redesign it. In a FSM case (such as UnitAI.js), the controller pattern would be used to decouple actions from the FSM logic and make it maintainable.

I'd prefer to leave it in JS and redesign it . (I have no idea what design it should have, but its current design is certainly not great). It needs to access a load of other components that are only implemented in JS and don't have C++ interfaces yet, so moving it to C++ would be a pain and would reduce the engine's flexibility, plus C++ is just generally more of a pain than JS to write in. I don't think I've ever noticed UnitAI performance as a serious problem - it doesn't run hugely frequently (given that it's all event-driven) and doesn't do any expensive computation, and we should be able to cope happily a few thousand JS functions calls per second. Is there some profiling data showing which parts of it are a problem?

**RedFox** · April 27, 2013

That'll cause OOS in multiplayer, because floats aren't consistent between different builds of the game. That's almost the whole point of using CFixed. (The rest of the point is that floats cause nasty imprecision issues, so you might find that e.g. pathfinding works okay at the bottom-left corner of the map but not at the top-right because the coordinates are larger so the resolution is coarser. That's usually a very hard problem to reason about, whereas using fixed-point numbers makes it a non-issue). And they're just ints, so almost all operations on them will be at least as fast as the same operations on floats.

The problem with multiplayer OOS is a fair point, but I'm still inclined to think that it can be solved with floats. Perhaps a client-server architecture should be considered? What do you think? So far changing to floats gave +20% increase in performance, so I'd say its noticeable.

I think /fp:precise doesn't solve this, because the MSVC documentation says the compiler can still do intermediate calculations with 80-bit precision and only round at the end, which means it can give different answers to platforms that do 64-bit precision throughout (e.g. SSE2, or a debug build that spills things onto the stack). Also various library functions (sqrt, sin, etc) differ between platforms. Also requiring /fp:precise globally would make our graphics code (which doesn't really care about precise floats) slower. (And using different compiler options for different modules of the code would be insane and would probably break LTCG etc.)

As far as I know, /fp:precise adheres to a consistent floating point rounding to reduce errors with floating point math. The VC++ workspace is configured with /fp:precise 'out of the box'.

If I remember correctly, one of the parts of CFixed that was actually slow was CFixed::Sqrt, since that's hard to do with ints. I tried changing the implementation to use floating-point sqrt instead, which made it much faster and appeared to give the same output in all the cases I tested, but I never quite trusted it enough to be fully portable. It probably wouldn't be infeasible to exhaustively test it on every interesting platform to see if it's safe enough, so that might be a nice thing to do. (I don't remember what was actually calling Sqrt though - it might have just been the pathfinder computing Euclidean distances, and I think the new pathfinder design avoids that by computing octile distances instead.)

Float sqrt does have an advantage there, since it's usually a compiler intrinsic and replaced with the fsqrt opcode. That might be one of the reasons float performance was faster for my build.

Have you done any profiling of this? (Note that if you're running in MSVC you have to be careful to disable the debug heap (even in release builds) else a lot of the STL usage will be massively slower than when run outside the debugger). When I implemented this, I think I measured the per-frame overhead as pretty insignificant (and it allowed a lot of flexibility with changing materials and effects and lighting etc at runtime, without the hassle and bug-proneness of invalidating lots of cached data). It'd be nice to have numbers of how much it currently costs, to see what the potential benefits of improving it would be.

I wouldn't ever consider debug build performance something critical. In that sense I was talking about the rendering loop performance and how it could be improved. Perhaps something like this (pseudo code):


for(IShaderTechniqueMgr* tech : context->GetTechniques())
{
  for(IMaterialPass* pass : tech->GetMaterialPasses())
  {
	renderer->SetTechnique(tech->Technique());
	renderer->SetMaterial(pass->Material());
	for(IModel* model : pass->GetModelsCulled(frustrum))
	{
	  renderer->Render(model);
	}
  }
}

We've solved that by caching the parsed XML in a binary format (XMB) that takes no time to load, and release packages have all of the files combined into a zip so there is no filesystem overhead.

Ah that makes sense. I guess I was expecting a more straightforward solution. But the binary format cache should work marvellous.

Players have got at least half a gigabyte of RAM - why is 48KB worth any effort?

That was just a tiny example. The actual amount of strings can reach a much bigger size. And even if the strings don't take that much memory, the answer will always be: cache. A translation module is needed anyways, so I guess it will bring a benefit either way.

I agree that's a problem - QueryInterface is a bit slow, and BroadcastMessage likely has terrible cache usage patterns.

Yes, I figured having strong references to core components in an entity would help with that. Such as (more pseudocode):


class UnitEntity : ...
{
    CCmpTransform mTransformation;
    CCmpActor mActor;
};

Something like that. Of course a quick answer to your next question:

Is your suggestion to create a new class for every type of entity (UnitEntity, InfantryUnitEntity, MeleeInfantryUnitEntity, SpearmanMeleeInfantryUnitEntity, etc)? That's the kind of inflexible hard-coded hierarchy that component-based designs are intentionally trying to get away from. There's a load of documentation from game developers over the past several years about how the component systems make their lives much better.

No, I don't mean that at all. Rather 4-5 generalized entity types to make the core of the game run faster, while scriptable components would still extend the entity functionality. It seems to combine the best of the [hardcoded] and [fully dynamic] approaches, while leaving less headache to us programmers in the end.

When an entity is accessing its own components, maybe a better way to minimise the QueryInterface cost without significantly changing the design would be to let a component cache the returned IComponent* the first time it does QueryInterface (after instantation/deserialization) and use the pointer directly after that, which is safe since components can't be deleted dynamically unless the whole entity is deleted. When accessing components of a different entity you've still got to do a lookup on the entity ID though. (Storing pointers to other entities would be bad because it makes deserialization hard and makes dangling pointer bugs much more likely). Again it'd be useful to do some profiling to see which uses of QueryInterface have a significant cost - I'd suspect it's probably a small number that would be easy to fix without redesigning anything much.

I'm afraid the profile won't point out a chokepoint in the code, if the whole code is equally accessing a whole bunch of components. Even though its relatively fast, it also adds up over iterations, creating the algorithmical bottleneck.

It does that already (where "inaccurate" is "tile-based and ignoring mobile entities").

Are you talking about the stalled pathfinder? It would be great if it was completed.

The different modules store different data and have different access patterns - e.g. collision uses obstruction shapes and needs to do queries against typically short lines, whereas range tests use the centres of entities and need to do queries against huge circles with filtering on player ID. (And the renderer needs 3D bounding boxes of models for frustum culling, which is conceptually similar but practically very different again.)
I do agree the current subdivision scheme is naive and could do with improving, but I don't think trying to use a single instance for all the different modules would help.

You are right that specific implementations all require different aspects, but they all fit into the same kind of BSP scheme, which can be used as a base interface for the specific implementations. So all of the data could be in a quadtree for a 'rough-match' lookup and after narrowing it down, a more specific algorithm can be used for the actual detection.

I don't think the subdivision-data-structure queries are currently the slow part of the range tests - even if the queries were instant, they might return several hundred entities and just trying to figure out which ones have entered or left each other entity's range is too expensive. Usually only a very small proportion have recently entered/left the range, so ideally we wouldn't have to even look at the majority of entities that have remained clearly inside/outside. I don't know how best to do that, though.

Perhaps this could be done during the actual movement update? This would mean a list of "EntitiesInRange" would be attached to every entity and when two units are in range of each other, they share each other's pointers. This would avoid dangling pointers and would also instantly notify an unit when someone leaves its range. Using the quadtree structure, if an unit enters a new grid, it can do a quick update of entities who are already in the grid, to see if anyone is in range. After that initial enter, the squared distances are updated during each sample where the unit moves. Something like this could work and possibly faster than comparing two large arrays.

I'd prefer to leave it in JS and redesign it . (I have no idea what design it should have, but its current design is certainly not great). It needs to access a load of other components that are only implemented in JS and don't have C++ interfaces yet, so moving it to C++ would be a pain and would reduce the engine's flexibility, plus C++ is just generally more of a pain than JS to write in. I don't think I've ever noticed UnitAI performance as a serious problem - it doesn't run hugely frequently (given that it's all event-driven) and doesn't do any expensive computation, and we should be able to cope happily a few thousand JS functions calls per second. Is there some profiling data showing which parts of it are a problem?

I think right now is a bit too early to answer that question, but I'm inclined to think that moving towards a well implemented C++ design with a scriptable interface would work out best.

**Ykkrosh** · April 27, 2013

The problem with multiplayer OOS is a fair point, but I'm still inclined to think that it can be solved with floats. Perhaps a client-server architecture should be considered? What do you think? So far changing to floats gave +20% increase in performance, so I'd say its noticeable.

Fundamentally redesigning the entirely multiplayer architecture isn't something that should be taken so lightly

Where precisely did that 20% come from? I'd expect it to be one or two call sites that are doing some relatively expensive maths, and it would be far easier to find and focus on optimising those areas than to make changes across the entire codebase with unpredictably harmful side-effects.

As far as I know, /fp:precise adheres to a consistent floating point rounding to reduce errors with floating point math. The VC++ workspace is configured with /fp:precise 'out of the box'.


#include <stdio.h>
#include <float.h>
static float x = 1.0f;
static float y = FLT_EPSILON/2.0f;
int main() {
  float t = x + y - x;
  printf("%g\n", t);
}

> cl test.cpp /O2 /fp:precise /arch:SSE2
> test
0
> cl test.cpp /O2 /fp:precise /arch:SSE
> test
5.96046e-008

(MSVC 2012, x86)

I wouldn't ever consider debug build performance something critical.

I don't mean a debug build - I mean a release build that you are running in a debugger (e.g. when you press F5 in Visual Studio). (If I remember correctly, it's any time IsDebuggerPresent() returns true when the CreateProcess() is called, unless you set an environment variable _NO_DEBUG_HEAP=1.) (That caused me quite a bit of confusion when first trying to measure and optimise stuff )

That was just a tiny example. The actual amount of strings can reach a much bigger size. And even if the strings don't take that much memory, the answer will always be: cache.

It takes about 60 bytes per character just for the GL vertex arrays to draw the text, so the sizeof(wchar_t) bytes per character plus overhead per string to store all the currently-displayed text won't be a significant part of the total cache usage either.

Are you talking about the stalled pathfinder? It would be great if it was completed.

I mean the old one that's currently in the game. (The new not-currently-in-progress one uses the same short-range pathfinder, it just makes the long-range tile-based pathfinder faster and fixes some bugs due to mismatches between the two.)

**RedFox** · April 28, 2013

Fundamentally redesigning the entirely multiplayer architecture isn't something that should be taken so lightly

Where precisely did that 20% come from? I'd expect it to be one or two call sites that are doing some relatively expensive maths, and it would be far easier to find and focus on optimising those areas than to make changes across the entire codebase with unpredictably harmful side-effects.
Where precisely did that 20% come from? I'd expect it to be one or two call sites that are doing some relatively expensive maths, and it would be far easier to find and focus on optimising those areas than to make changes across the entire codebase with unpredictably harmful side-effects.

If it was some expensive bottleneck in the code, I'm sure you would have noticed it alread. The weight was 'distributed' over all the program in the numerous hundred of conversions like fixed::fromInt(1) and fixed::fromFloat(mVariable). I reckon those conversions added up for some significant downtime. You are absolutely right about the possible harmful effects of changing code - that's a given any time you write new code. But in this case the codebase does need some maintance and since I have time, it would be foolish to postpone it to never.


> cl test.cpp /O2 /fp:precise /arch:SSE2
0
> cl test.cpp /O2 /fp:precise /arch:SSE
5.96046e-008

Yes, my own tests also showed that SSE2 does some extra rounding off the edges. Now is a time to figure which is worse? For floating point accuracy, SSE2 would be best, because it rounds out FLT_EPSILON/2.0f. Unfortunately it'll not work for embedded systems, since they don't have SSE2. All modern modern PC cpu's support SSE2, so it won't matter unless embedded systems are concerned. Alas, a lot of people would also want to play this on an iPhone or Android device... that's waaaay to much cross platform than we can handle.

I don't mean a debug build - I mean a release build that you are running in a debugger (e.g. when you press F5 in Visual Studio). (If I remember correctly, it's any time IsDebuggerPresent() returns true when the CreateProcess() is called, unless you set an environment variable _NO_DEBUG_HEAP=1.) (That caused me quite a bit of confusion when first trying to measure and optimise stuff )

Yes, I'm aware of the debugger attaching and of the heap debug mode. That's why I ran the game outside of the IDE. So still 20%.

It takes about 60 bytes per character just for the GL vertex arrays to draw the text, so the sizeof(wchar_t) bytes per character plus overhead per string to store all the currently-displayed text won't be a significant part of the total cache usage either.

In either case, even if it won't hit the cache, a translation system will still be needed. And in that case, some minor performance increase will creep with them.

I mean the old one that's currently in the game. (The new not-currently-in-progress one uses the same short-range pathfinder, it just makes the long-range tile-based pathfinder faster and fixes some bugs due to mismatches between the two.)

Ah ok. SO in that sense, the pathfinder still needs some work to be done on.

**Ykkrosh** · April 28, 2013

If it was some expensive bottleneck in the code, I'm sure you would have noticed it alread.[

There are quite a few that have been noticed years ago and still not fixed

The weight was 'distributed' over all the program in the numerous hundred of conversions like fixed::fromInt(1) and fixed::fromFloat(mVariable). I reckon those conversions added up for some significant downtime.

fixed::fromInt(1) gets compiled to something like "mov ecx, 10000h", so it has no cost. fixed::fromFloat is more expensive but is used very rarely (CCmpRallyPointRenderer is the only non-trivial one I can see (and that looks like it would probably be better if it used fixed everywhere instead of mixing it with float and converting back and forth)). Most of the other operations on fixed get inlined into the equivalent operation on ints.

What situation did you measure the 20% performance improvement in? If I run e.g. Combat Demo (Huge) and start a fight for a couple of minutes, then the profiler indicates about 35% of the total runtime is in CFixedVector2D::CompareLength, which is called by CCmpPathfinder::ComputeShortestPath calling std::sort(edgesAA...). Almost all the time in CompareLength is doing lots of expensive-on-x86 64-bit multiplies, so that's the kind of thing that might well be made a lot faster by using floats (though I'd guess it could also be made quite a bit faster while sticking with ints, with some x86-specific code to do a 32x32->64 bit mul or with SSE2 or something). But the real problem here is that the short-range pathfinder is terribly unscalable - it needs a different algorithm, which'll mean it won't do a crazy number of std:sort(edgesAA...), and then the performance of CFixedVector2D::CompareLength will hardly matter. Were you measuring that performance issue or something else?

Yes, my own tests also showed that SSE2 does some extra rounding off the edges. Now is a time to figure which is worse? For floating point accuracy, SSE2 would be best, because it rounds out FLT_EPSILON/2.0f. Unfortunately it'll not work for embedded systems, since they don't have SSE2. All modern modern PC cpu's support SSE2, so it won't matter unless embedded systems are concerned. Alas, a lot of people would also want to play this on an iPhone or Android device... that's waaaay to much cross platform than we can handle.

The point is that you can't just add a compiler option like /fp:precise and get the same FP results on all platforms. I think it's unreasonable to require the game to always be built with SSE2 on x86 (particularly given how hard it is to control the build flags used by Linux distros), but then anyone running an x87 FP build of the game will eventually get out-of-sync errors. The way to avoid all those problems is to not use floats.

(I think ARM should generally give the same results as x86 SSE2 (ignoring how compilers rearrange expressions), since its VFP does 32-bit/64-bit operations - the problem is just x87 which does 80-bit operations without rounding.)

**RedFox** · April 28, 2013

There are quite a few that have been noticed years ago and still not fixed

Hmm, that is rather dire. :/ Do you mean some of the core issues that were brought out?

fixed::fromInt(1) gets compiled to something like "mov ecx, 10000h", so it has no cost. fixed::fromFloat is more expensive but is used very rarely (CCmpRallyPointRenderer is the only non-trivial one I can see (and that looks like it would probably be better if it used fixed everywhere instead of mixing it with float and converting back and forth)). Most of the other operations on fixed get inlined into the equivalent operation on ints.

You are right about the code generated. Yet some parts converted to fixed point and then to floating point, which is one step too much already. I'm just trying to look at a pragmatic angle here: Sure, floating numbers tend to 'float', but they're pretty good for their overall convenience. If OOS is the main issue, I'm sure a deterministic rounding method can be brought in (which is both fast and accurate) during hash generation.

What situation did you measure the 20% performance improvement in? If I run e.g. Combat Demo (Huge) and start a fight for a couple of minutes, then the profiler indicates about 35% of the total runtime is in CFixedVector2D::CompareLength, which is called by CCmpPathfinder::ComputeShortestPath calling std::sort(edgesAA...). Almost all the time in CompareLength is doing lots of expensive-on-x86 64-bit multiplies, so that's the kind of thing that might well be made a lot faster by using floats (though I'd guess it could also be made quite a bit faster while sticking with ints, with some x86-specific code to do a 32x32->64 bit mul or with SSE2 or something). But the real problem here is that the short-range pathfinder is terribly unscalable - it needs a different algorithm, which'll mean it won't do a crazy number of std:sort(edgesAA...), and then the performance of CFixedVector2D::CompareLength will hardly matter. Were you measuring that performance issue or something else?

The performance improvement was measured in a single-player match "Punjab 1". Nowadays floats are fast on x86/x64, so using hardware optimized FSQRT or SSE2 RSQRTSS opcode (yeah - sqrt is built into cpu on x86/SIMD, weird right? ) seems like the way to go.

The point is that you can't just add a compiler option like /fp:precise and get the same FP results on all platforms. I think it's unreasonable to require the game to always be built with SSE2 on x86 (particularly given how hard it is to control the build flags used by Linux distros), but then anyone running an x87 FP build of the game will eventually get out-of-sync errors. The way to avoid all those problems is to not use floats.

Currently even SSE2 capable hardware struggles to run the game, so in that matter, relying on SSE2 is a viable option. SSE2 was introduced with P4 over 12 years ago. I might have agreed with 'not everyone' having SSE2 6 years ago, but now that's something very rare. Lets face it, SSE2 is on each and every Intel/AMD x86/x64 arch currently reading this post. Sounds reasonable? Tell me what you think about making SSE2 'required'? We could even make tests to see if floats given good results in 'real-world-cases'. I think the performance improvement is worth it, even if fixed point is still all the rage on microcontrollers.

(I think ARM should generally give the same results as x86 SSE2 (ignoring how compilers rearrange expressions), since its VFP does 32-bit/64-bit operations - the problem is just x87 which does 80-bit operations without rounding.)

That's good to know - I didn't know much about the ARM architecture.

zoot · April 28, 2013

How difficult is it actually to replace fixed points with floats? If it's a relatively simple change, it's probably better to leave it for a volunteer contributor.

**RedFox** · April 29, 2013

How difficult is it actually to replace fixed points with floats? If it's a relatively simple change, it's probably better to leave it for a volunteer contributor.

Well, as the OP stated, it took a full day or two to bring in the required changes. Furthermore, testing and bugfixes (like OOS) were ignored. So that will add up a lot of time. This is a non-trivial change of the entire engine.

Thanduel · April 29, 2013

It seems a bit heavy handed to change the entire engine to floating point without doing some significant testing across a number of machines to ascertain whether there is actually any benefit. As you say it is a non-trivial change. I have only given the code a cursory look but perhaps there is room for optimisation of the fixed point library? By the way I'm one of the those embedded devs that likes fixed point, well more like depends on fixed point.

**RedFox** · April 30, 2013

It seems a bit heavy handed to change the entire engine to floating point without doing some significant testing across a number of machines to ascertain whether there is actually any benefit. As you say it is a non-trivial change. I have only given the code a cursory look but perhaps there is room for optimisation of the fixed point library? By the way I'm one of the those embedded devs that likes fixed point, well more like depends on fixed point.

In this current case, I'm running a 4 year old laptop with a cheaper Intel i7 mobile chip. I'm not getting that much performance out of the game and changing to floats gave at least 20% performance increase in a regular game. With the 'big combat demo' it might give even more benefit. The main target platforms all have an integrated FPU on the chip and it's not reasonable to suffer a performance drop because rare embedded systems without an FPU are considered as a possible release target. It just makes no sense - even ARM processors (probably on your phone too) have an FPU. Floats are fast due to hardware support; sure - not all float values can be precisely represented, but that is something that can be accounted for.

Edited April 30, 2013 by RedFox

Thanduel · April 30, 2013

Indeed I do think it would be silly to use fixed point over floating point to support a bunch of embedded processors representing a tiny fraction of the user base. I'm also well aware that all modern CPU's have FPUs even multiple FPUs integrated. However a large part of the reason for using a fixed point system is performance on any platform. Therefore obviously if we can beat the performance with a floating point system we should use it. What I'm saying though is the performance improvement you get is surprising and might be an indication that there is room for improvement in our fixed point system. There really shouldn't be such a gap even with a FPU.

**RedFox** · April 30, 2013

Indeed I do think it would be silly to use fixed point over floating point to support a bunch of embedded processors representing a tiny fraction of the user base. I'm also well aware that all modern CPU's have FPUs even multiple FPUs integrated. However a large part of the reason for using a fixed point system is performance on any platform. Therefore obviously if we can beat the performance with a floating point system we should use it. What I'm saying though is the performance improvement you get is surprising and might be an indication that there is room for improvement in our fixed point system. There really shouldn't be such a gap even with a FPU.

Ah, now I understand what you're saying. Spending time optimizing a fixed point library could take weeks. I've fallen into the fixed-point pithole before (working on Atmel 8-bit ATMega chips) and in the end, when I ran out of time and just needed a working solution, reverted to the gcclib software floats. In the end it didn't matter if I used fixed point or floats - there was enough performance... Yet I spent a huge time on premature optimization; something I could've spent on the actual code instead.

So what I'm trying to say is - we shouldn't bother growing and optimizing a fixed-point library. We should just stick with the fast floats and win in both performance and development time. Sounds reasonable? ^_^

**Yves** · April 30, 2013

Maybe we should just skip the fixed point vs float discussion for a moment.

20% sounds like a lot, but I think you measured that at the beginning of the game and the bigger problem is later when the simulation starts using up most of the resources. Also it could make such a significant difference simply because other parts of the engine aren't optimized yet and have to do too many calculations. Philip indicated something like that related to the short-range pathfinder. Last but not least there seem to be different opinions that are well argued but still no real consensus.

If you will eventually become a fulltime programmer you should start with a less contentious issue . Don't get me worng, but no matter if your right or wrong in that matter it would be good to add something everybody is happy with. There are enough tasks for that, it shouldn't be the problem.

In my opinion the main problems currently are:

1. Pathfinder: Meaning the pathfinder itself that is used for unit movement but also the pathing functionality that the AI need and currently implements separately because there's no interface to the engine's pathfinder implementation that fulfills the requirements.

2. Multithreading support: Separating AI and Simulation from the rendering would about double the average framerate and would get rid of lag-spikes happening when for example the AI calculates the map for dropsite placement (multiple seconds on large maps!). I've tried doing that for the AI but my conclusion was that the spidermonkey upgrade needs to be done first (3).

3. JavaScript Engine improvements and Spidermonkey upgrade: I'm currently working on that in ticket #1886. It's difficult to predict how much that will improve performance. For Firefox the Spidermonkey performance improved a lot but our situation is a bit different. Anyway, the upgrade is need for various reasons - one being the multithreading support.

An issue we have with all kinds of performance improvements is how to measure them. You are talking about 20% improvement which is inherently quite a random number you measure in one specific situation on a very specific plattform (Hardware, driver, OS, compiler options) and with unspecified settings. It would be very helpful both for making good judgements and for motivation to have better ways of measuring performance.

What's missing is some kind of benchmark mode that tests different aspects of performance in scripted ingame scenes. It could be some camera movements over dense forests, over water, a lot of units fighting, an army moving through a forest and pathing through a narrow gap in formation, a maximum of 8 AI players doing their normal AI-calculations etc.

It should store all the profiler information in the background to draw nice graphs later like how the ms/frame performance changed over time and what parts of the engine used most of the calculation time in these different situations. We already have most of what we need for that.

I think that would be a very good task for you to start with, if you like.

The pathfinder would be the most important task to continue but that's probably a bit though to start with.

We already have two integrated profilers in the engine (simply called Profiler1 and Profiler2).

Profiler1:

"source/tools/replayprofile/*" contains a Perl script to extract profiling data from the text files generated by Profiler1.

There's also a HTML file that can display nice graphs like that:

Issues with profiler1:

There's a bug that messes up the profiling data after some time. I havent analized it any further but you should get what I mean if you let it run with a game for a few minutes.
It has some issues with multithreading
In some situations it created huge files of profile data. I once got a 32 GB file and it has such an impact on performance that the data looses its significance

Profiler2: Philip create this one which is very helpful for measuring accurately what happens in a single frame. Check the forum thread for more information.

It even supports displaying multiple threads and calculations on the GPU.

Profiler2 keeps the data in fixed size memory buffer and overwrites the oldest data when the buffer is full. The format is a binary format with much less overhead than profiler1's output. It makes the data accessible via web-server.

Maybe it would be better to extend this profiler to save the data in the memory buffer to the disk.

What do you think?

zoot · April 30, 2013

As far as benchmarking goes, this could be a vaguely related ticket: http://trac.wildfiregames.com/ticket/417

**wraitii** · April 30, 2013

Excellent post, Yves, I completely agree.

I'd add that there are probably a lot of slightly inefficient code bits in quite a few areas of the game, which overall would probably give semi-substantial improvements, but would require probably not too obvious changes (and possibly tedious ones, like caching things, or reinventing slightly better algorithms).

I would like to make a case for proper model culling, we currently cull models once per frame, in a not really efficient way (we have to check the world bounds against the frustrum, which is fairly slow, and sometimes recalculate the bounds, which is slower). We would need to do it faster, so we can do it more often

(the code for water reflection and refraction fakes culling by flagging unwanted models. But since the rendered recreates the bucket and all, this is not really efficient (and again, the method for culling is slow). This is bad because we can't really do this efficiently, it introduces bugs, and particularly as I'm working on multiple water level (infinite, really), we would really need to be able to better cull models.

This is sort of linked to the Ogre and the Octree discussion, but I think it's one of the more important point.

**sanderd17** · April 30, 2013

2. Multithreading support: Separating AI and Simulation from the rendering would about double the average framerate and would get rid of lag-spikes happening when for example the AI calculates the map for dropsite placement (multiple seconds on large maps!). I've tried doing that for the AI but my conclusion was that the spidermonkey upgrade needs to be done first (3).

Multithreading would make the game so much faster. If you would put every AI in its own thread, you could play 1human vs 3AI on a quad core, and it would still be faster than the current 1 vs 1.

**RedFox** · April 30, 2013

Maybe we should just skip the fixed point vs float discussion for a moment.

I think you're right, it doesn't serve any purpose at the moment.

If you will eventually become a fulltime programmer you should start with a less contentious issue . Don't get me worng, but no matter if your right or wrong in that matter it would be good to add something everybody is happy with. There are enough tasks for that, it shouldn't be the problem.

Perhaps you're correct on that matter, since a lot of these problems are disputatious, its better to handle smaller tasks right now. ^_^

In my opinion the main problems currently are:
1. Pathfinder: Meaning the pathfinder itself that is used for unit movement but also the pathing functionality that the AI need and currently implements separately because there's no interface to the engine's pathfinder implementation that fulfills the requirements.

Since Philip's pathfinder is still a WIP, I haven't given it much attention. Perhaps Philip could comment on this? If he can finish the pathfinder, or if we can start over?

2. Multithreading support: Separating AI and Simulation from the rendering would about double the average framerate and would get rid of lag-spikes happening when for example the AI calculates the map for dropsite placement (multiple seconds on large maps!). I've tried doing that for the AI but my conclusion was that the spidermonkey upgrade needs to be done first (3).

Right now simulation 'turns' are taken a few times per second. Given that we can calculate per average how many frames are called between turns (let it be N), we can divide the time into 'frame slots'. Specifically, you can assign slot[0] = Task0, slot[1] = Task1, slot[N] = TaskN. Where of course any task that has a slot greater than N just overflows. In this sense, if the tasks are asynchronous they can be assigned to any slot. If they require a previous task to run first, they have to be assigned to a higher slot.

It's actually pretty simple once you put it to code. It just helps you divide all the tasks over a series of frames instead of calling all the tasks every 250ms.

It should help with the framerate spikes and it won't introduce data-race conditions.

3. JavaScript Engine improvements and Spidermonkey upgrade: I'm currently working on that in ticket #1886. It's difficult to predict how much that will improve performance. For Firefox the Spidermonkey performance improved a lot but our situation is a bit different. Anyway, the upgrade is need for various reasons - one being the multithreading support.

I thought Philip said that JavaScript has no noticeable effect on performance?

An issue we have with all kinds of performance improvements is how to measure them. You are talking about 20% improvement which is inherently quite a random number you measure in one specific situation on a very specific plattform (Hardware, driver, OS, compiler options) and with unspecified settings. It would be very helpful both for making good judgements and for motivation to have better ways of measuring performance.

You are correct, that is always a huge issue - getting the same results on different platforms.

What's missing is some kind of benchmark mode that tests different aspects of performance in scripted ingame scenes. It could be some camera movements over dense forests, over water, a lot of units fighting, an army moving through a forest and pathing through a narrow gap in formation, a maximum of 8 AI players doing their normal AI-calculations etc.

So a lot of functions would need to have profile macros? Or do you mean a kind of profiler you can run inside JS scripts?

We already have two integrated profilers in the engine (simply called Profiler1 and Profiler2).
Profiler1:
"source/tools/replayprofile/*" contains a Perl script to extract profiling data from the text files generated by Profiler1.
There's also a HTML file that can display nice graphs like that:
Issues with profiler1:
There's a bug that messes up the profiling data after some time. I havent analized it any further but you should get what I mean if you let it run with a game for a few minutes.
It has some issues with multithreading
In some situations it created huge files of profile data. I once got a 32 GB file and it has such an impact on performance that the data looses its significance

So basically, it's rather useless since the amount of data is overwhelming.

Profiler2: Philip create this one which is very helpful for measuring accurately what happens in a single frame. Check the forum thread for more information.
It even supports displaying multiple threads and calculations on the GPU.
Profiler2 keeps the data in fixed size memory buffer and overwrites the oldest data when the buffer is full. The format is a binary format with much less overhead than profiler1's output. It makes the data accessible via web-server.
Maybe it would be better to extend this profiler to save the data in the memory buffer to the disk.

Now Profiler2 sounds like something much better.

What do you think?

I think all of this profiling and implementing a better profiler is really really good stuff actually... BUT - There's always a but somewhere. Visual Studio 2012 has a really really good profiler too... The main benefit of the VC++ integrated profiler is that it accurately measures time spent inside a function and the timers are compiled into the assembly output if you run your program with the profiler on. Of course its downside is that it has a similar issue to to Profiler1: it generates huge (~512mb per minute for pyrogenesis) datasets that take quite some time (a few minutes) to analyze. Luckily the analyzed report is quite small

So for example, taking a performance sample of just running the Main Menu and opening the few tooltips present, of course only filtering and showing the last minute of crazy menu action:

First thing we can see is that 0 AD Main Menu doesn't use a lot of CPU power: out of 8 cores it only needs one active thread while others are mostly in sleeping state. Even though 0 AD didn't consume that much CPU, it still didn't run fast. What gives? The only noticeable peaks in CPU activity are probably from menus and dialogs opening up.

Checking the hot path for the generated report, we see something interesting:

Around 25% of time is spent suspending the thread (which is very good - we don't need to use more power than necessary), and another 50% is used for WaitForSingleObject, which is the Windows API's mutex lock function. Are we having concurrency issues here? Is it a worker thread waiting for mutex wakeup? Interesting to note is that std::char_traits<char>::compare is being called a lot and is thus pointed out by the profiler. There were 5 threads running in total, so it makes sense that they were being suspended most of the running time, otherwise CPU usage would have skyrocketed.

Taking a closer look at the function statistics that we really care about: the active application hotpath.

Now we can start really profiling the main menu. What is going on here exactly? Aside from lots of threads sleeping, we can see that GuiManager is the one doing the real work. Around 20% of the remaining CPU time was spent drawing the GUI, if we do some quick math (20/25) that makes 80% of the total run time.

So what, big news? Rendering is takes time. We all knew that. Besides, everyone uses recursion to create easy-to-use GUI API's.

What we're missing here however, why is recursion taking so much time? Image drawing doesn't take any time at all, however CTextRenderer::Render does:

So most of the time is spent rendering text?

Lets have a look:

Hah! I get it. It's constructing the glyph runs every frame again and again and again. It should do this when the Text object is initialized with text. Direct2D API on windows has special glyph run objects for handling this and creating static text vs dynamic text. Clearly we've found why the GUI was struggling slightly on the profiler run.

Summary: Even though GUI isn't the priority, it was the easiest thing to profile (just to showcase the VS2012 profiler and how useful it can be) in such a short amount of time. And it also helped discover that the 10 years old GUI code actually has a pretty big weakness: text rendering wastes a huge amount of memory and resources every frame, since they are temporary objects.

I've been looking at the new FreeType 2 library, which is an open-source library used for TrueType and OpenType text rendering. It has a very slim and streamlined API. Perhaps this is something I could start with?

**wraitii** · April 30, 2013

This is one of those things that seem like they'd improve performance slightly and be generally good, I support it.

FYI, XCode has a really interesting profiler too in the instruments, so I could perhaps also get some data from that.

(edit: that code does seem crazily inefficient, though)

Edit2: ran the game a little in the main menu, Xcode's profiler tells me about 40% of the time was spent walling CGUI: textDraw if I read things correctly.

[DISCUSS] Performance Improvements

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

RedFox

agentx

Yves

Posted Images

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation