tuan kuranes

May 4, 2013

@zoot: Well let's say 20 packet per seconds:

P2p: each player sends and receives, 8 player => 8*20*8 => 1280

ClientServ: 8*20+8*20: 320.

Meaning, in terms of latency a non-neglibile gains in all case.

Let's imagine we want to grow to 16 player one day in the future, with

p2p: 5120

ClientServ: 640

better GPU CPU, RAM and bandwidth, we might have, latency is rhater hard to lower. (waiting for quantic teleportation network might be long...)

And those are rather the lower bar, as packet ack, error retransmission, etc. makes it much more in practice.

(the real point in network is latency, bandwidth is already not really a problem anymore)

@wraitii: sorry again for the big patch. I already looked at other patches, but I needed to start without changes at first, as it's easier to understand and read. I need to get a bigger picture before making changes of the current pathfinding code, I still don't get clearly the why and how paths are computed/short.long, etc. I would really start with making it a separate static lib, behind a facade interface, allowing easier refactoring and better code localisation, leading to better optimisation opportunities (if not threading opportunities).

I don't know any other way to optimise memory than memory reuse/pool.

For instance, computeshortpath allocate vector Edge, edgeleft, edgeright, tiledge, etc... for each path computation. How to avoid allocating each path without getting them form pool and return them to pool once finished. (actually, for perf, we use reserve() calls, which actually reserve often more that what we need, causing bigger memory hurts)

Memory pooling does guarantee control of the memory (what we alloc and where if we do it once at start), and zero allocation/dealloc per frames.

May 4, 2013

tuan kuranes, can you give some details on how did you measure performance improvement with your patches posted on trac?

As I said above, using Very sleepy (fast profiling, the "detector"), CodeXL (deep but slow profiling, the "check if perf gains") all on win64 but 32bits exe.

Testing done with made lots of maps for realtime feeling (small, huge, with an without agents, with or without bots, with or without resources, with lots or not lots of pathfinding)

Profiler and profiler2 are nice, but hard to detect spikes with them, without some sort 'line graphics", and the fps "frequency smoother" isn't helping getting raw good numbers on that either.

If you have a look, most changes are really common "optimise as you go things" as pointed in the link above.

Other changes are rather simple and common sense: For instance, pathfinding sort is faster if you precompute the distance for each Edge once before sorting, otherwise each sort operation eats the computeLength cycles. (not speaking of rvo not kicking.). Then again, making code out of each checkvisibily to be shared by all, avoiding the call for basic ifleft, ifright check is clearly cpu cycles win.

I tried not to make any change out of that line, working on guaranted less cpu cycles, measuring each changes, (cache, cpu, branc miss).

I've been rather conservative, for instance, and optimising the Edge sort, where it's in fact faster withtout, (it's faster to check more edge than sort them as it cost not only in cpu but above all in memory)

Also, to ease review and eventual integration I think it would be better to provide a separate patch for every different improvement.

I'm sorry for that, I must say I intended to do that, but it's really just hard not just to fix the const and pass param as ref, optimisations as you go, it helps me read the code...

If you remove that part of change, it boils down to pathfinding changes, minimal c++ perf branch fixes there and there. (about minimizing "if" inside "while"), and changes where rvo doesn't kick in order to lower temp object alloc memory cost, particularly for hotspot code.

I'll try to make smaller patches (but cannot promise anything before next saturday...)

@zoot: not sure P2P is really less bandwith, each client must send its user actions to each other clients anyways, so it grows exponentially with the numbers of players, whereas client server, it's as single send per client, and server issue agregated info once per client (with clever update/quantization/compression tricks to minimize bandwidth). So even latency might be better with client-server, minimizing the buffer bloat of p2p (huge number of packet, each having to pass the buffering of each network equipement.)

@redfox: Indeed, the memory problem is in all those way the current code cause memory allocation in an "invisible manner". It's not "visible" as there not much code with new/delete/malloc, but there's a huge lot of hidden ones done on the stack, by the stl, by boost, and that kills the memory by fragmenting it incredibly. Ogre has nice pool allocator code, that can be use outside of ogre, that helps a lot there. That big step will have to be done, and using pooling and code that force acknowledgment of alloc (meanging using Pointers or at least SharedPointers) is the way to go. The pathfinding reallocate for each computeshort call at least 6 std::vector, no way the memory is contigous. The L1/L2 cache is trashed, and cpu is only really waiting data.

About fixed fixed, seems logical to wait until it doesn't mean breaking network code...

May 4, 2013

Thanks for the answer, so clearly before even thinking going float, one has to either

make sure float gives totally predictable/reproductible whatever client. (very hard in practice, see http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/ )
change current network model (server authority or newest p2p area-authority (each one is authority for different areas dynamically), anything not realyin on determism. can be progressive with state ack (q3 network model) time to time to "resync". ( Didn't even read the current network code for now, sorry if stating the obvious, Just to point out, p2p networking model using determinism (unless you're doing search code, that is) not the usual route afaik, client-server is rather the norm, and give much simpler code overall, and less compatibility issue )

May 4, 2013

We have zero tolerance for precision issues. The problem is that small imprecisions quickly accumulate over turns, leading to desync.

Desync, it's a network problem then ? If it's not network related, it's indeed strange that it occurs on such small maps, as said there, my points was that it depends on the size of the map, http://home.comcast.net/~tom_forsyth/blog.wiki.html

Seems more like 20km maps where we need 1cm precision (float-like) than the map from Pluto tothe Sun (7.4 billion km) with sub-micrometer precision (64bitsfp-like) ?

What suprise me is that float precision isn't enough here. If it's network related, and doesn't happen in single player, then that's maybe another kind of issue, more about network server/client prediction/data representation/quantisation algo (indeed much more complex if using float than fixed point.)

About perf problem and the 20% gain without fixedprecision, here (using define NATIVE_SQRT in my patch to comare) isqrt64 (win32 exe on win64) is top of the list in profiler, whereas native sqrt is at the bottom, wayyyy down, so there is a very very noticeable difference and impact. ( here for some benchmark for that exact problem http://omeg.pl/blog/2012/03/performance-of-various-square-root-computing-algorithms/ ). In that case 20% is minimum.

In fact. that problem is so impactful, that users might like either x64 bits exe on windows 64, or one native sqrt exe for single player and one isqrt64 for multiplayer (if the desync bug is a network thing.)

May 4, 2013

My few cents:

Memory is the huge culprit for me that no one points, but that's the real performance killer but there should be no allocation/deallocation during the game. Not a single one. preallocation, memory pools, you name it. It doesn't show really in profilers (still new and delete are hotspots...) as it's a general issue. There a very huge gains perf there. (memory fragmenation/allocation is a huge perf killer, causing cpu cycles to be wasted waiting for data. check all Data oriented designs paper/talk/slides)
Coming from ogre, making the switch is obviously a good idea, not only for perfs, but also leading to a better decoupling from mechanics and graphics code.
Multithreading: did you consider OpenMP ? it's the easiest/fastest way to do multithreading, and is now supported by all compilers. Or go for a library like intel TBB or jobswarms.
Graphics: switching to ogre will help, but some things has to be done in other ways, mainly all the "upload to GPU" code (minimap draw for intance), that force opengl drivers to sync, which cause great great perf loss. (better do rendering to framebuffer texture, do lazy updates, instancing, anything.)
Patfhinder: first make separate the code in a library, then separated by a nice Facade, then use Detour library. http://digestingduck...-in-detour.html Then use make it run in threads. If you want to do pathfinding on your own, read the (read the whole blog http://digestingduck.blogspot.com/ for doc/explanations as it's near state of the art.).
Performance: made a non-behaviour performance code patch for all the hotspot listed by profilers if anyone want to review it/give a try. Gives huge perfs gains here, but doesn't solve the AI/memory hiccups. What it does is mostly removes unecessary cpu cycles and memory copies. Mostly good perf practices in C++ (added to the wiki, notably http://www.tantalon....opt/asyougo.htm guidelines).
Using very sleepy (fast) and codexl(complete) for profiling on win32
fixedpt isn't really needed and doesn't help the base code sloc. Not only few external library uses it (not ogre, opengl, detours, etc.) and there's really not much reason nowadays, as it is, cpu software optimisations against x64 cpu fp (specially if not using fp/precise), x64 registers, not speaking of sse/avx/etc specialised branch/predication mechanism. No Map size here seems large enough to justify precision problems with floats here, and even if there's tricks to solve those kind of problem.

April 28, 2013

- Why Octree ? That's for 3d spatial subdivisions, in 2d, it's quadtree, but that's just a name for sparse hierarchical grid.

So what you need here, is better grid performance and usage (in los, path, culling).

And for that, the key in order to support huge maps and huge agent number, key is to be "hierarchical". For instance:

_grid, etc: http://theory.stanfo...sentations.html

_ hierarchical pathfinding, hpa/hpa+:

http://aigamedev.com...pa-pathfinding/

http://digestingduck...athfinding.html

http://digestingduck...tion-grids.html

Units and formation will also benefits from hierarchical, where group/formation only request one hierarchical path, and the agents are just "following the path" (using very simple steering/swarming algos)

- Performance wise: memory work to do, as there should be no new/alloc/free/delete during main loop, ( Typically CMATRIX3D::operator* appears in profiler hot spots because it allocates a new CMatric3D from the stack each time in its return. correct way is how Cmatrox3d::GetInverse does, which is having a "dest" parameters. ) Data should be located in contigous space in memory, so that cpu cache is optimal. (ie you don't check each bouding box against clipplanes individually, but all bounding box against clipplanes)

Read all data oriented design papers available. (http://dice.se/publi...riented-design/)

Sign In

tuan kuranes

Posts

Joined

Last visited

Days Won

Content Type

Profiles

Forums

Posts posted by tuan kuranes

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[DISCUSS] Performance Optimisations

Forums