Jump to content

Manifest against threading


Recommended Posts

Don't take the title seriously. What i am writing is my view and not that from the WFG-team.

I try to speed up the engine and make it less stuttery.

First I was locking at multi-threading. Since "if one core is to slow multiple will be faster" right...

In the most recently uploaded CppCon-presentations about Hardware Efficiency and General Optimizations this statements were made: "Computationaly-bound algorithms can be speeded up by using multiple threads" and "On modern hardware applications are mostly memory-bound". It seems impossible or hard to speed up an application using multiple threads. But step by step:

Computationaly bound is when the speed of an application is bound by the speed of computation. Faster computation does not always mean faster application. A webbrowser is likely network bound: A faster CPU does not improve the application speed by much but improfing the network speed does (It can also split up in latency-bound and throuput-bound). An application can be bound by many other things: GPU-bound, storage-bound, memory-bound, user-bound ;)

An application is fast if no bound does outweight the others.

In the last decades CPUs got smaller (smaller is important since that reducec latency) and faster. -> Computationaly bound application became faster.
Memory also got faster but is spatially still "far" away from the CPU. -> Memory bound application became not so much faster.

There is some logic that can help us:

We have to bring the data (normaly inside the memory) spatially near to the CPU or inside it. That's a cache. If the CPU requests data from the memory the data will also be stored inside the cache. The next time it is requested we do not request it from the memory.
Exploit: Use data from the cache as much as possible. More practicaly: Visit the data only once per <something>. Lets say you develop a game ;) every entity has to update their position and health every turn. You should visit every entity once: "for each entity {update position and update health}" When you update health the data of entity is already in cache. If you do "for each entity {update position} for each entity {update health}" at the start of the second for-loop the first entity will not be in cache anymore (if the cache is not big enough to store all entitys) and another (slow) load/fetch from memory is required.

If the CPU fetches data from the memory most of the times it needs an "object" which is bigger than a byte. So the memory sends not only a byte to the cache but also tha data around it. That's called a cacheline. A cacheline is typicaly 64 bytes.
Exploit: Place data which is accessed together in the same cache line. If multiple entitys are packed inside the same cacheline you have to do only one fetch to get all entitys in a their. Optimaly the entitys are stored consecutive. If that is not possible an allocator can be used. The yob of an allocator is to place data(entitys) in some smart spot (the same cache line). Allocator can yield gread speed improvement: D4718. Adding "consecutivenes" to an already allocator aware container does also yield a small speed improvement: D4843

Now and then data has to be fetched from the memory. But it would be ideal if the fetch can be started asyncronosly the CPU would invoke the "fetcher"; run some other stuff and the access the prefetched data.
Exploit: The compiler should know which data it has to fetch "long" before it is accessed. In an "std::list" one element does store the pointer to the next element. The first element can be prefetched but to get the second element prefetch can not be used. Which data to load can not be determined bevore the frist fetch. Creating a fetch-chain were each fetch depend on a previous fetch. Thus iterating a "std::list" is slower than an "std::vector".
Exploit2: Don't use virtual functions they involve a fetch-chain of three fetches. A normal function call does only involve one fetch which can be prefetched and might even be inlined.

 

Back to threading

In a single threaded application you might get away with the memory-boundnes. Since big parts of the application fit in to the cache. Using multiple threads each accecing a different part they still have to share the cache and the fetcher-logic(I didn't verify that with measure). The pathfinder is not memory bound (that i did measur) so it yields an improvement.

Sometimes i read pyrogenesis is not fast because it is mostly single thraded. I don't think that is true. Yes, threading more part of the engine will yieald a speed up but in the long run reducing memory bound should have bigger priority than using multiple threads. There also might be some synergy between them.

Edited by phosit
Add first line
  • Like 2
  • Thanks 2
  • Confused 1
Link to comment
Share on other sites

  • 3 weeks later...

I don't really see anything in your post that supports the position to not use multithreading? Yes, obviously more efficient computations is better than doing less efficient computations faster, but given the same level of optimisation, a well designed multi-threaded CPU-bound program will almost always be faster (by how much depends on Amdahl's Law). In most architectures the cores have some levels of independent caches, so increasing number of threads also increase the effective memory bandwidth (for things that are in cache, taking into account cache invalidation, etc, etc). Cores also have their own prefetchers. Modern memory systems are quite well optimised for concurrent access. It's definitely not a case of all cores having to go through the same narrow pipe. For another example, if two cores require data from two different memory banks (last level cache miss, on the order of 100 cycles), the memory controller can issue read requests for both, and wait for both at the same time.

Most of what you described are basic optimisation concepts that anyone who cares about performance should be familiar with. However, I think there's one big idea that took me much longer to REALLY understand and get onboard with - always profile first. Humans are absolutely terrible at estimating where the performance hotspots are, and if left with our guesses, we will spend all our time optimising things that just don't matter. Now when I program, I never do any non-trivial optimisations ahead of time. Strictly only after profiling. Of course, that doesn't mean I pessimise unnecessarily. I still try to not make unnecessary copies of data, etc. Just nothing that requires spending more time.

Just a few notes about the specific things you mentioned:

* Manual prefetching helps in theory, but is extremely hard to do it usefully in practice. You have to make sure you prefetch far enough ahead that it makes a difference (if you do it just 10 cycles ahead of when you need it, that's not going to make any difference, and you are paying the instruction decoding cost, etc), but not so far that it gets retired before you need it (in which case you wasted memory bandwidth that maybe could have been used by another thread). Also, in your example of a linked list, if you traverse the list on every frame, the hardware prefetcher is probably already doing it for you, because it would have recognised the data access pattern. I have been working on performance-critical stuff for about 10 years now, and have only seen one instance of manual prefetching that's actually helpful (and only marginally). Many have tried. If I'm trying to optimise a program, this is going to be one of the very last things I look into. Yes, using a vector over a list where possible is a good idea. Mostly because continuous access means each cacheline fetch can pick up more elements, and for a very long vector, DRAM has burst mode that is more efficient.

* Virtual functions are fine in reality. The v-table is almost always going to be in L1i cache, so those fetches are essentially free with good instruction scheduling. Also, in many cases, the actual function called can be proven at compile time, and the compiler can de-virtualise it, in which case it's absolutely free. Virtual functions offer a lot of readability and maintainability benefits. I would need very concrete evidence showing that they are a significant bottleneck before taking them out. Again, profiling first is important. I have not encountered a single case where this makes a difference.

  • Like 2
Link to comment
Share on other sites

I might have formulated it a bit to harsh. We should still use multithreading but for now (for me) making the game more cache-frently has higher priority.

When i wrote that i didn't know that each core has there own Prefetcher :D

Yes, don't optimize prematurely. But what is an optimization‽ I use reduce over fold when possible. For some people that's already an optimization.

Yes, Manual prefetching is hard... I think... never done it. I don't say we should do it manualy but we should design the datastructure in a way that the compiler / the cpu can do it.
I don't think that the data-access-pattern will be remembered after a frame. The linked-list is not the only datastructure which does get iterated / has to be remembered and the linked-list might be mutated.

No, the compiler doesn't do that much devirtualization (we haven't enabled LTO). Which also prevents furture optimization.
If the v-table is in the cache it does use some valuable cache space (that's bad to my understanding). Aslo when i profiled many cpu-cycles are "lost" at virtual function calls.

I would like to talk further with you and get some tips how to profile but that's easier on IRC. Feel free to join ;)

Link to comment
Share on other sites

  • 4 weeks later...
On 17/12/2022 at 7:29 PM, phosit said:

First I was locking at multi-threading.

Well, being locked would eventually lead to problems in most cases. Tho, i understand your point and do not seek to contradict.

From what i can tell by now, in multiplayer the simulation is done on each client and then compared every n milliseconds. It actually feels like that when i am playing the game, as if the simulation would do a brief freeze every half a second or so, then go on, then freeze etc.

I think that the synchronization of clients would be a good occasion to implement multithreading, wouldn't it? It is not crucial for the simulations of all clients to be 100% certainly* the same, as long as the clients can be expected to behave in the same way (we are talking about multiplayer integrity checks here, big/hot topic on it's own for a moddable game). Meaning that maybe not all checks should be done during the running game, and if so, then probably asyncronous, using free system resources.

Maybe - i do know neither the computing effort neccessary for such task, nor am i familiar with the enigne design yet -, the game replay might be serialized by host on free cores, so that there would be a static reference for all clients to correct sim state, or to check whether the sim state was correct 1 minute ago. But possibly, that is not a good approach.

Let me give a scenario. Let's assume that all of our clients are in gamestate X, and that their simulation will start according to state X, that their units will behave accoring to gamestate X. Meaning, all players apply the same logic to the same values starting on the same map. From this starting point, given that all the players commands are shared on network, the game should be reconstructible for each client in the same way.

This would mean: if it is now ensured™ that all the clients behave in the same way, and that each client started using the same terrain, entities etc., then this could lessen the data exchange amount required between the clients, right? Ideally the datasets which must be accurately synchronized for each players main thread would then mainly consist of the commands given by the players, instead of constantly checking the whole simulation.

If this would actually work well, it had another implication concerning deviations between clients. Because, if the clients all behave in the same way, the verification of simulation states, meaning the actual ingame synchronization between clients, (-> multi player integrity check) would eventually become a check which only seldomly throws an OutOfSync Error, would it not?**

And if that is so, then there should probably be main thread synching between the client as soon as there was some asynchronity detected by the - asynchronously running - integrity checks during the game. The clients could then be put on pause and getting a loading bar for the synchronization task or something.

In handling this OutOfSync error, we would then do stuff to ensure that our game returns to gamestate X, or give the host the ability of canceling the game if there is a malicious behaviour suspection. An Error is something that should not occur, after all.

Probably the main question here is: how do we ensure that multiple clients playing together are simulation-wise in an equal state (not graphics-wise, nor sound-wise, but rather data- and scriptwise)? Can the certain checks be obsoleted by client-code design? Another question in this context: What if someone gets the game code, changes it in a cheaty way, just pretended to be a normal client (version etc.) and then connected to a multiplayer game? Would his differing game state be recognized beforehand, or would the game just start and then keep throwing the OutOfSync? Because that would be quite inefficient, i think. 

===

* Please interpret the term properly here; they should be ideally 100% the same, but it is not neccessary to do this check in realtime/main thread. So, technically, all clients in a game could have had 100% equal simulations, but they did not actually check that during all of the game, but rather assumed it because they would have taken some security measures beforehand, ensuring that equal simulation starts->outcomes, equal json and xml, equal binary behaviour could be assumed.

**I've encountered about 2 or 3 OutOfSync errors so far. Assuming that there is 1 check per second, applied to 10 hours of playtime, that would be... a little percentage of the checks being "positive

Edited by sternstaub
Link to comment
Share on other sites

On 04/02/2023 at 4:33 AM, sternstaub said:

From what i can tell by now, in multiplayer the simulation is done on each client and then compared every n milliseconds. It actually feels like that when i am playing the game, as if the simulation would do a brief freeze every half a second or so, then go on, then freeze etc.

Did you profile it? https://trac.wildfiregames.com/wiki/EngineProfiling#Profiler2
I experienced the same thing: https://code.wildfiregames.com/D4786

On 04/02/2023 at 4:33 AM, sternstaub said:

I think that the synchronization of clients would be a good occasion to implement multithreading, wouldn't it? It is not crucial for the simulations of all clients to be 100% certainly* the same, as long as the clients can be expected to behave in the same way (we are talking about multiplayer integrity checks here, big/hot topic on it's own for a moddable game). Meaning that maybe not all checks should be done during the running game, and if so, then probably asyncronous, using free system resources.

No. It is costly to copy JS-objects (the simulation components) to an other thread. JS cant reference data on an other thread.
There has to be some kind of OOS check. It isn't done every turn.

On 04/02/2023 at 4:33 AM, sternstaub said:

This would mean: if it is now ensured™ that all the clients behave in the same way, and that each client started using the same terrain, entities etc., then this could lessen the data exchange amount required between the clients, right? Ideally the datasets which must be accurately synchronized for each players main thread would then mainly consist of the commands given by the players, instead of constantly checking the whole simulation.

It's done that way: only commands and sometimes a hash of the simulation-state is shared.

 

  • Thanks 1
Link to comment
Share on other sites

  

9 hours ago, phosit said:

Guter Hinweis

9 hours ago, phosit said:

There has to be some kind of OOS check. It isn't done every turn.

Then it might be interesting for some people if the density of these checks could be reduced in game settings until the problem is solved. That way the host would have an option to sacrifice security for responsiveness, for example if you want to play higher troop amounts.

Questions would be: is this implementable, is it desirable and is it responsible to assume that players know what the option does?

//EDIT: hope this does not go too off-topic here.

Edited by sternstaub
Link to comment
Share on other sites

On 06/02/2023 at 9:08 PM, sternstaub said:

Then it might be interesting for some people if the density of these checks could be reduced in game settings until the problem is solved. That way the host would have an option to sacrifice security for responsiveness, for example if you want to play higher troop amounts.

Questions would be: is this implementable, is it desirable and is it responsible to assume that players know what the option does?

It is implementable (i think). It could be a hidden setting (only in the user.cfg). @Stan` also mentioned make the syncronize per turn ratio variable for a28 or so.

Edited by phosit
  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...