Jump to content

RedFox

WFG Retired
  • Posts

    245
  • Joined

  • Last visited

  • Days Won

    7

Everything posted by RedFox

  1. This is my activity status report thread for August 2013. There's a pretty large gap with the previous update and I really started working once all the legal issues were finally taken care of. Subsequently, the progress report starts from the middle of the week and isn't anything really impressive. Week #2 12.08 - 18.08 14.08 1900-0600 (11) - Debugging on Ubuntu: OpenGL and GLSL. TextRenderer drawbatch optimization. 15.08 1600-0200 (10) - Debugging on Ubuntu: no success. TerrainRenderer optimization. 16.08 1300-0500 (16) - TerrainRenderer optimization. Removed FFP. ShaderProgram optimization. 17.08 0000-0600 (6) - Debugging, first breakthroughs. From the total of 43 hours, most of the time went into debugging, though I was able to squeeze in some improvements that seemed to make enough of a difference. I think the most disappointing aspect is the ModelRenderer which is pretty much fully back to its original form - solely because of debugging the crash on Linux. Hopefully I can get around to changing it back - to get the improvements in ModelRenderer that it previously had. 1. Debugging on Ubuntu What's the issue?: The new patch causes a mysterious crash on Linux and OSX I decided to pick up a copy of Ubuntu and do some debugging. Even though I got some relevant data out of this, it wasn't a success - the game ran fine on my x86 Ubuntu. So the problem is most likely related to some size_t changes between x86/x64 - Josh is running x64 Ubuntu and he experiences the crash. This is actually a relevant breakthrough and has pointed me towards the possible causes. I expect to finally find the reason in the coming week. 2. TextRenderer DrawCalls What are Text DrawCalls?: Text DrawCalls are clumps of text that contain the text color, style and font. I didn't really intend to do anything about TextRenderer, but while debugging on Linux, I noticed that OpenGL keeps rendering text one word at a time. This is an extremely inefficient way to do it, so I improved the DrawCall batching to clump similar lines of text together. DrawCall batches before: { "This ", "is ", "an ", "example." } DrawCall batches after: { "This is an example." } So depending on the actual amount of text, the rendering speedup varies. It noticeably raised the FPS on text heavy pages. I don't know how much justification can be done for optimizing text rendering - but one thing is certain: We will have to migrate to something better like FreeType fonts and true OpenGL VertexBuffered rendering. This would not only reduce the amount of code we have, but it would also speed the engine up. We could remove a lot of redundant/duplicate code and rely solely on graphics card based buffers - which is the modern way to program in OpenGL. For now, I'll leave it be. 3. TerrainRenderer What is TerrainRenderer?: TerrainRenderer obviously renders the in-game terrain. It works in 3 stages 1) base tiles 2) blends between tiles 3) decals on the terrain. What are Patches?: Patches are 16x16 tiles of the terrain. This is the main unit used in frustrum culling and rendering. So this is actually a pretty important part of the engine. And not that surprisingly, it's also quite difficult to optimize. The previous implementation used a memory pool to manage the implementation and utilize a rather complex 3-dimensional mapped sorting algorithm. Since I was already hard pressed for time, I couldn't manage a complete redo of the rendering algorithm itself, but I was able to improve the pre-sorting and batching of Terrain Patches. Instead of relying on a memory pool and some complex usage of 3 dimensions of std::map, I wrote a simple and very straightforward structure that does the bare minimum and takes into account the hardcoded limits of number of patches and number of blends per patch. Since Terrain rendering actually takes a lot of the renderer time, this change was pretty noticeable. In both Debug and Release builds I experienced roughly 33% performance improvement. If the batching and sorting took about half of Terrain rendering time, now the sorting is rather insignificant compared to actual rendering. I'll use timing data from Profiler2 for comparison later. Before: --todo-- After: --todo-- 4. Removed FFP What is FFP?: The Fixed Function Pipeline (or FFP) was a module of 0AD that emulated shader support. It was a really nasty hack to support ancient PC's with no GPU's or no Shader support. I think it's best to say this is the biggest change of this week. It took a lof careful editing of the code to make sure it works properly. Patience paid off and I was abled to remove a lot of complexity from the whole shader pipeline. It's difficult to measure the performance improvement this gives, but it's safe to say that it's actually quite negligble. The main gain is that we have a lot less code to maintain. Previously quite a lot of optimizations were out of the question due to FFP being in the way. Now that it's removed, we will be able to slowly move on to a cleaner, more maintainable and of course, faster Shader system. 5. ShaderProgram optimizations What is ShaderProgram?: This is an internal class that abstracts away GPU's programmable shaders and automatically prepares the OpenGL shader program for rendering. This was actually a really tiny change in the way that OpenGL shader uniform and attribute bindings are stored, but it's necessary to make way for a more efficient binding system in the future. I intend to move away from the current string based binding lookups and replace it completely with constant values. There are two ways we could go about this: #1 (less preferred): Preassign binding locations for the shader input stage. For example, attribute location 0 will always be for Vertex data. Attribute location 1 always for UV0 set, etc. This is somewhat tedious, since you'll have to explicitly declare an attribute location in the shader: layout (location = 0) vec3 a_vertex; // our vertex attribute in the shader Its name can be anything, really. All we care about is binding location 0 in the layout. Vertex data would always be sent to that location (of course, only if there is an attribute with that binding location, otherwise vertex data would not be uploaded). #2 (preferred): Variable names in the shader program have explicit meaning behind them. For example, a_vertex would always be for vertex data. If you fail to declare a variable with the name a_vertex, your shader won't receive any vertexdata. This is somewhat perfect for shader development - we enforce a consistent variable naming this way and we can remove a lot of superfluous data in shader xml's (probably even negate the need for shader xml's for non-effects). In the shader it would look nice and clean: vec3 a_vertex; // our vertex attribute in the shader Having explored the two possible ways to go about it, it's pretty much obvious that #2 would be the way to go. This would allow us to seriously streamline both shader compilation and shader pipeline and attribute / uniform binding during shader's input layout stage. The most obvious reason why I would go this way, is because a very large number of shader variable names have already been hardcoded into the engine by the previous developers. Since we probably won't be redesigning all shader files, the #2 option would leave shader files as they are (with some changes to variable names). Here is the list of currently hardcoded shader variables and definitions: It is obvious that the current shader system is far from anything truly moddable. However - if we document all the "hardcoded" uniform names, such as u_TextureTransform, other people can program their own shaders without much hassle. We can also finally throw away ShaderDefStr which is a boon for performance and resolve all the definition names during compile time. We would have something like this: enum ShaderAttributes { a_Vertex = 0, // vertex position attribute a_Color, // vertex color attribute }; static std::string mapping[] = { "a_Vertex", "a_Color", }; int GetShaderVariableID(const std::string& var) // for shader compilation stage { for (int i = 0; i < ARRAY_SIZE(mapping); ++i) { if (var == mapping[i]) return i; } return -1; // mapping does not exist for this variable } const std::string& GetShaderString(int id) // for shader compilation stage { return mapping[id]; } This naive implementation would map any incoming strings from the shader file to internal indices that match the appropriate enums like ShaderAttributes enum. Since the actual number of variables isn't that big, we can get away with a simple loop. Due to effective CPU cache, simple loops are always faster than using std::map. I'll stop my tedious explanation of "what's to change" and leave it here. 6. GLSL Compilation optimization What is GLSL?: Currently we support 2 types of shaders: 1) Old ARB shaders that are fast; 2) New GLSL shaders that provide all sorts of fancy effects. All modern cards (since 2004?) support GLSL - the only issues is our own unoptimized GLSL shader code which also has some bugs on certain systems with certain drivers. However, migrating completely to GLSL is pretty much the only way to go. ARB shaders are completely unmaintained and obviously enough, GLSL is much easier to write than ARB assembly. GLSL also supports a large variety of extra features such as C-like loops, control statements, macro definitions, functions. It pretty much looks like C without its standard library. However, if we ever wish to support Android systems, we will need solid GLSL support. With all this in mind, I changed explicitly how the current GLSL compilation preprocessing is done - leaving most of the work to the driver (even if that's more inefficient) by sending 3 separate sources with: 1) GLSL version - defaults to #version 120 (OpenGL 2.1, minimum required to use GLSL with our shaders right now) 2) Shader defines - All #defines set for this shader compilation stage 3) Shader source - The actual shader source The OpenGL driver will append all 3 sources into a single source and will take care of the preprocessing. This greatly reduces code on our side and also allows to reduce complexicity and overhead of the ShaderDefines structure. The changes I've made lay some groundwork for future changes on the GLSL shader system. To end Week #2: Currently I'm mostly working on debugging the patch to get it working on linux. As soon as A14 is released, I'd like to commit this large patch to avoid any further SVN conflicts along the road. The patch is gigantic already. You can check it out here: http://trac.wildfire...com/ticket/1995 No statistics this time around, though the numbers are obviously a lot higher than before ----------------------------------------- This is my current TaskList: -) Patch review -) Megapatch Debugging -) ModelRenderer WIP -) Performance Improvement Fancy Graphs -) PSA and PMD to collada converter. -) Migrate to C++11 -) Collada -> PMD cache opt to improve first-time cache loading.
  2. Since the engine is pretty dynamic, we don't really have to close it down after applying the changes. The shaders should be reloaded automagically. Other than that, sounds like an excellent feature and a must have for A15.
  3. I should have specified Intel GMA 900 series. My bad The first (GMA 900) series wasn't that good and is the failure case since it doesn't support OpenGL 2.0 Anything above GMA X3100 seems to support OpenGL 2.0. You can read more about it on Wikipedia.
  4. According to the statistics you provided, Wijitmaker, we can get the necessary information from /report/opengl/ actually. GLSL 1.10, which runs on OpenGL 2.0, requires extensions GL_ARB_vertex_shader and GL_ARB_fragment_shader: 92% (s) GL_ARB_vertex_shader (~GL2.0) 91% (s) GL_ARB_fragment_shader (~GL2.0) At first glance it looks like ~92% support GL2.0. But if we take a closer look, we can see that the failed cases are either Windows users with ancient cards, or Linux+Radeon users with older drivers (should be fixed by just updating the drivers). You can clearly see that some cards that failed, succeeded with newer drivers on Linux. There is also the case of Intel GMA graphics, which seems to be the main case of failure. But to be honest, I tried to run 0AD on Intel GMA graphics only once and it ran as a glitchy blue mess with 2fps. So in that case, people who already have ancient cards, can't run 0AD anyways, since even if the game somehow manages to start up fine and actually display something meaningful on screen, it's going to be unplayable.
  5. Intel GMA graphics is quite rubbish - it supports OpenGL 1.4~ something and it usually doesn't run anything decently. Last time I tried running 0AD on my old laptop that had Intel GMA graphics, the whole game was blue and it ran around 2 FPS. In this sense, it doesn't really matter - those computers are so ancient they can't even run the game anyways I think the most sensible move would be to enable "Prefer GLSL" in this release, then iron out all the GLSL bugs - I'm fairly sure that most issues are present because the GLSL version hasn't been properly debugged on multiple platforms. Once we get around to ironing out the bugs we can drop ARB shader support. How does that sound?
  6. Looking at the code right now, we have 2 different copies of shader code: 1) Old legacy ARB shaders from fixed-function era OpenGL 1.5. 2) OpenGL 2.0 GLSL shaders. It's been 10 years since OpenGL 2.0 was introduced and since we've already decided to move away from fixed function code, it would only make sense to drop ARB shaders. For those not familiar with ARB / GLSL, this is a type of code that is run by your graphics card to render pretty pixels on the screen. Right now the old ARB support code is adding quite unnecessary amount of code in our shader pipeline. Removing it would make our code smaller, easier to manage and also faster, since we wouldn't need to have a preprocessor wrapper. If we no longer need a preprocessor wrapper, a lot of shader compilation code would also get smaller. Furthermore, since we are currently supporting the ancient ARB shaders - with every change in shader code, we'd have to also update and test the ARB shaders. We're already low on developers as it is, so I don't see any reason to justify that overhead. Even more important is that most code is written in modern GLSL - nobody uses old ARB assembly anymore. Nobody would update the ARB shaders anyways. This would hamper any shader development. Right now is the time to vote on this matter and seriously consider moving onto GLSL permanently.
  7. Great If you have any more questions, feel free to join our IRC Dev Chat: http://webchat.quakenet.org/?channels=0ad-dev
  8. Hello, Atilla, glad to have another programmer signing up to help with 0 A.D. Taking a look at the code, you still have some fields to improve in, but I think you can definitely do it What I found: Comments - If you're a beginner, then this doesn't make sense, but honestly, you should comment every function of your project with even a brief description - even if the function itself is obvious. This habit has saved me a lot of trouble in the past when I have to edit code I wrote years ago. I suggest you follow the Doxygen commenting style, since that is what we use in 0 A.D. http://en.wikipedia.org/wiki/Doxygen References - You don't seem to use pass-by-reference for objects. In general, it is a good habit to forward structures like Vector2f as 'const Vector2f&', to avoid a copy-constructor. In general, everything bigger than 8 bytes should be passed as a const reference (like Vector3f, which we use a lot). Here's a few articles on const correctness: http://www.parashift...view-const.html. Const Correctness - If you want to write good C++, you must also write your code to be const correct - functions that don't modify object state, should be declared as const and return const reference to a non-POD (plain-old-data) type: sf::Vector2i ClickableArea::getChoord1(); // before const sf::Vector2i& ClickableArea::getChoord1() const; // after Code Formatting - While this is a philosophical debate at its heart, in general we want code to stick to this standard, which is the most readable: http://trac.wildfire...ions#Formatting Code Quality - You should try to make code simple and readable. Over-complicating code can make it harder to read for others. For example: You should also notice that you made a serious programming error by passing 'std::vector<Minion*>' By-Value. This means it copied the entire vector instead of passing it as a reference 'std::vector<Minion*>&'. For now I recommend joining the IRC Developer Chat, grabbing a copy of the SVN repository via our Trac and then find yourself a Ticket you feel comfortable with. Here's our guide for submitting patches: http://trac.wildfire...bmittingPatches Have fun!
  9. That's probably because StarCraft II is a very well polished game. Everything about it is fluent, smooth and logical. There are no obvious glitches, performance issues or gameplay annoyances. It just feels natural? Now that's a very very well designed game. That's why it got 9.5. We can weight the pros/cons all day, but unless you actually play it, you won't ever really know. So my suggestion: play the game. That way you'll have a definite experience to compare against other titles.
  10. Having played all the AoE series games and also all the Starcraft series games, I must say that where AoE series tries to immerse in the epic history of mankind, Starcraft focuses on unprecedentedly fluent and balanced gameplay. This is why there are so many tournaments for even Starcraft - Brood War, which is 13 years old. Combat itself isn't horrible at all. It's actually very well balanced in the sense of every unit having some sort of counter unit. Due to this you can formulate a strategy or 'build' for your game. Learning to improve and adapt this makes for great strategic play. What makes it fun to watch is the quick pace of the game - the early pushes usually happen around 4-5 minutes into the game. So it's not about turtling. It's about great tactics and strategic decisions - making it very fun to watch if you can follow what's going on. In the end of the day though, all people have different tastes. I personally really enjoy all sorts of different strategy games - all for different reasons. For me comparing AoE vs Starcraft is like comparing Rome - Total War vs Total Annihilation.
  11. Awesome work Don. Very good job implementing a generic Quadtree class I have only a few remarks: Memory pool implementation. We have been working on a general solution for 0 A.D. already that doesn't require fixing pointers if the pool is out of memory. Furthermore it's not reusing pointers so there is always a realloc imminent, which is not ideal for a memory pool. Check out the thread: [Discuss] Memory aka game slowing over time. Right now it's pretty much finished, but it hasn't been put into the engine yet. Perhaps this is something that can be improved in the future? It would definitely make the code even simpler. Kudos on implementing a memory pool though - that's very important for a Quadtree You are correct that an integer implementation of a Quadtree has to be in power-of-two size. However, you check [size % 2 == 0]. Consider the case [14 % 2 == 0]; and 14 is not a power of two You should do a proper pow2 check: http://graphics.stan...rmineIfPowerOf2 I know that having a 2D array m_Child[2][2] seems like a good idea, but for a Quadtree the preferred solution is to have 4 variables: NW, NE, SW, SE. Why? This lets us unroll the loops and actually speeds up the code a lot. However I can see that you designed it because you wanted to later be able to use an Octree. I like how you calculate the child index, that way you have a constant search speed. Again kudos for not using an std::vector for children. That solution seems to permeate most implementations and is very bad for performance. The MAX_CHILDREN approach is the best one. Again, good work I'll be looking forward to what others also think of this.
  12. Awesome data there scroogie I guess the focus should be to minimize usage of std::map (std::_Rb_tree) and really implement memory pool. I'd start replacing the STL containers with the allocator variants, but it would probably break a lot of code. Plus I have a few patch changes, which would make it very hard to submit :/
  13. That's just C++ syntax for declaring static template members: template<class T> MyClass { static T StaticMember; }; template<class T> T MyClass<T>::StaticMember; It's quite convoluted, but I needed that step to have global static template instantiations. I also have a version of the pool that doesn't take any template arguments and is instantiated with arguments instead. There doesn't seem to be any speed difference at all. This makes it a lot easier to manage these global pools, since I can use a lookup table for the bucket sizes - pointing to the correct pool. That looks perfect for what I'm doing Right now writing a wrapper is actually a pretty horrible task! I've been trying to keep everything C++03 right now, that's why the horrible wrappers. If we move on to C++11 as a final verdict, I'll just have to delete a lot of superfluous code. You mean using mmap / VirtualAlloc ? There's a problem with that solution. Sometimes the address following your paged memory section is unavailable, meaning mmap/VirtualAlloc will fail. For that not to happen we have to reserve (but not yet commit) an unknown amount of memory pages ahead. In a 64-bit system this wouldn't be a problem, but the 32-bit address space is much more limited in that sense. If we don't reserve the address space ahead, something else will use those addresses. And if mmap / VirtualAlloc fails, the application will crash. We could try reserving a large part of the address space. Right now I have 25 pool buckets in the bin_allocator. If we reserve a fixed size of 25x1mb, it would soon crash, because 0 A.D. uses a lot more memory - as can be seen from MTuner data. If we create an enum with special tuning data per bucket, we could perhaps avoid that. But it wouldn't apply to every case - still leading to a crash. However, we can keep the dynamic_pool structure as it is right now (an array of memory pools) and reserve pools with 1MB virtual memory in 'reserve'. The pool can automatically commit more and more memory pages (commit is required in Win32) - up until 1MB. Then it's full and the dynamic_pool class would do the work for us by allocating a new pool into its array. This would give better performance when we are forced to continuously expand the pools, which is something that happens a lot in 0 A.D.. Though I'm not sure if the used memory count is due to fragmentation or actual allocations alive. Last note: Multithreading. We can implement a mutex lock inside each pool, which is a lot better than CRT's current mutex-lock per malloc. Yeah this is getting really complicated now.
  14. That's a good point - the problems reported by MTuner ended up just being really bad code. It's stuff that should and will definitely get fixed. Luckily I can just write a 'debug' layer that restores psvector and friends to old allocators. This way we can still use MTuner to find memory intesive code.
  15. So today I took some time to actually test out the gross performance of the memory pool. The best way to test it (and the most suitable if it's ever used in 0 A.D.) was to write a custom std::allocator. I chose to create two kinds of allocators only (mostly because STL allocators are temporary objects): Now some of you might remember that the "pool" and "dynamic_pool" classes are size based; so in order to make them fit with generic allocators we have to create a proxy: template<unsigned TSIZE> struct _global_pool { static dynamic_pool<TSIZE, true> pool; }; template<unsigned TSIZE> dynamic_pool<TSIZE, true> _global_pool<TSIZE>::pool(true); This is pretty cool, we now have a "global_pool" allocator and a "bin_allocator" allocator. Both of them just access the global template instantiation of the _global_pool<>::pool variable. We can now test out the allocator pretty easily: std::vector<int, global_pool<int>> vec; But that looks really nasty. Can you imagine having that all over the codebase? Me neither. A typedef would be cumbersome, since you'd have to declare it everywhere. Soon the code will be full of typedefs - very counter-intuitive indeed. So the best I could do was to write wrappers on-top of all the STL containers and also the string class. This works out perfectly, because global_pool<> will not work on variable sized allocations, so you can't use it in std::unordered_map nor std::vector for example. By writing the wrappers we ensure that the correct type of allocator is used. As a plus; since STL containers are already heavily debugged and refined, we don't have worry about crashing our memory pools anymore. To the wrappers: /** * Overrides std::vector and plugs in a bin_allocator as the allocator. * The bin_allocator allocates from global memory pools based * on the chunk size. */ template<class T> class psvector : public std::vector<T, bin_allocator<T> >; Using it is really simple and since it's just a subclass of std::vector, it has all of its functionality. psvector<int> vec; Ahh, that looks perfect. The P is for perfect and S is for shiny. (just kidding; - the ps stands for pyrogenesis) Finally, we can see the performance difference of Syscall memory allocations vs Memory pooled allocations: VC++11, Release, default optimizations 1000x ps::stl 42ms 1000x std::stl 1170ms A 28x performance difference. That's right. Just by plugging in a memory pool we won by committing less overall memory and less System Alloc calls. Our allocations are faster and they take up less memory in the long run - isn't that sweet? Here's the test code (nothing fancy): Since I've finished the STL container wrapper and memory pool code, It'll only be a matter of time until this gets into the pyrogenesis source. Limitations: 1) bin_allocator only 'bins' allocations in range 4..1024. Even then, the actual allocation chunks are: 4 8 12 16 20 24 32 40 48 56 64 80 96 112 128 160 192 224 256 384 512 640 768 896 1024 So it wastes some memory in favor of speed. However, this 'waste' of space will definitely not matter, since we win A LOT more in the long term and the speed is just fantastic. ps::stl mem-commited: 522K std::stl mem-commited: 384K 2) Pool deallocation scheme is not linear. The global pools are auto-cleaned when they go empty; - except the first pool, which is never deallocated. I'm afraid we'll have to live with this for now - cleaning the pools doesn't make sense anyways. What we could do further: 1) Extend the wrappers over more structures such as std::stringstream and friends 2) Test how well this scales in 0 A.D. - the initial pool reserve can be increased for less overhead and to match 0 A.D. requirements. Remember that the number of pools is increased on demand and pools have a fixed initial capacity. If we know we will have 10000 allocations, we can reserve pool size beforehand. 3) Give RedFox a cookie, since he spent all day (and night) making this beast work.
  16. A few updates on the ticket: http://trac.wildfiregames.com/ticket/1995 So far things are going slowly, since a lot of the bottlenecks pointed out by MTuner also have some major design issues. In those situations, plugging in a memory pool or another sort of cache is just evil - the right thing to do is just fix it and throw away what isn't needed. I guess the perfect example for this is CParser class in /source/ps/Parser.h/.cpp, which is an evil custom regex parser straight from hell. It was so memory inefficient that after removing it and replacing it with a traditional tokenizer, the startup time of the game reduced dramatically. (here's the patch #2005) Hopefully if enough of these bottlenecks get ironed out, we'll start to really see the performance improvements.
  17. Judging from the huge use of std::vector in the codebase, we'll definitely get a noticeable performance improvement. A little bit more reading: http://cpprocks.com/9-reasons-to-start-using-c11/
  18. Hi again guys. We've already discussed the obvious performance improvements that can be gained with using the latest C++ standard: C++11. GCC and VisualC++ compilers both support it and it would make sense to use it. The main gains are: 1. Faster ranged-for based loops. These are fast and convenient loops designed to loop over containers that have ::begin() and ::end() iterators. You can easily make your own code also work with ranged-for, by supplying these two functions. The loop definition and its underlying implementation provide optimization that would be tedious and ugly to do in actual code. for(SVertex& vert : m_Vertices) { vert.m_Position.Z += deltaHeight; } 2. The 'auto' keyword. This should be pretty self explanatory, it makes awful code look pretty: std::map<int, std::string> dictionary; // C++03: for(std::map<int, std::string>::iterator it = dictionary.begin(); it != dictionary.end(); ++it) { // ... } // C++11: for(auto it = dictionary.begin(); it != dictionary.end(); ++it) { // ... } 3. Smart pointers and new STL containers. A lot of boost implementations have been brought over to C++11, including: shared_ptr<>, unique_ptr<>, map<>, unordered_map<> and more. It would make sense to use the implementations from the standard library now. 4. Lambdas. These useful in-line functions are very handy for small functions that rely on callbacks: std::vector<int> v; std::for_each(v.begin(), v.end(), [](int n) { std::cout << n << std::endl; }); 5. Static assert. Before we would have to rely on some boost implementation, but now the standard provides us a neat way of ensuring template code correctness compile-time: template<class T> class Container { static_assert(sizeof(T) >= 4, "Specified T is not big enough for this specialized container!"); // .... }; 6. Move semantics. The newly introduced move constructor and move operator would give us a noticeable increase in performance because of the liberal use of std::string and std::vector copying. Since C++11 STL has implemented move semantics, we can avoid doing full-copy of temporary objects. You can read more about C++11 here: http://www.codeproje...splus-Developer As for enabling C++11 features: GCC - Just add the compiler switch -std=c++11 and we're good to go VisualC++ - Use Visual Studio 2010 or higher. Tell me what you think guys This is a small step for our compiler script, but a huge leap for pyrogenesis.
  19. For the memory pool I'm using a custom implementation that is very light-weight. The code for it looks ugly because it's so heavily optimized, but that's something we'll have to live with to get the performance. The guy at codeproject does a lot of unnecessary processing that slows things down. I use a single-linked list for the freelist and an additional 'available' counter. I'll omit any debug code to keep this already complex sample simple. I'm using a C-like approach here to get the maximum juice out of the application. The pool objects are always dynamically allocated as [sizeof(pool) + bufferSize]. This is a clever trick to avoid allocating a secondary buffer inside the pool. You might also notice that I'm placing the freelist next pointer inside the actual memory block. This requires TSIZE to be atleast sizeof pointer. Normally this is pretty dangerous, but in debug mode I revert to a conventional struct where the next pointer is before data. I also write a guard value into the place where the next pointer will be during release mode, so that I can catch any invalid writes to deleted pointers. This really makes the code somewhat complex in debug mode and we might not even need it, but its nice to have an assertion when you make a mistake. In release mode the pool has no real overhead. If the pool is used as a temporary dynamic memory source with no deallocs, we could change the alloc function to prefer new nodes from <available> instead of <freed>. You can clear the state of the pool by just calling ::clear() - the pool will forget its freed list and reset available count. Also, we don't really need the 'freed' counter - I'm just using it for accurate estimates of pool memory usage. As for C++11, we should definitely migrate to the new standard. No reason to stick to the old, slower standard.
  20. Thanduel, that's some awesome data! How much memory did you need to process all of it? I can see that the number of allocs for 32-byte objects is around 11 million - that's huge! Poor CRT-heap I'm currently working on optimizing the performance hotspots reported by MTuner. I'm confident those same hotspots are the cause for the huge number of allocs.
  21. Exactly! We're on the same page now. That's exactly what I was thinking about when writing those snippets. You never really need to create any vectors - just traverse the QTree and invoke the function on any range matches. Can't wait to see what you come up with.
  22. A reason to not hide the implementation is a performance consideration. If you add a method std::vector<Object*> QNode::GetChildren(), you'll get a huge performance penalty due to full vector copy. Consider writing an iterator function instead: template<class Lambda> inline void ForEachChild(Lambda func) { for(Object* o : mObjects) func(o); } // to use it: node->ForEachChild( [&](Object* o) { if( o == this ) return; if(pos.SquaredDistance(o->pos) <= collisionRange) collisionFunc(this, o); }); But now that I think about it, a more efficient solution might be to leave the collision checking to the Quadtree structure itself. This would make more sense I think. void QNode::CheckRange(void (*rangeFunc)(Object* a, Object* , float range) { int jcount = mObjects.size(); int icount = jcount - 1; for(int i = 0; i < icount; i++) { for(int j = i + 1; j < jcount; j++) { Object* a = mObjects[i], *b = mObjects[j]; if(RangeCheck(a, b, range) rangeFunc(a, ; } } } Or perhaps an unrolled version for a single node if you know the max number of children is 4: void QNode::CheckRange(void (*rangeFunc)(Object* a, Object* , float range) { // in this case objects is defined as "Object* objects[4];" // so we can use indexes without any fear of operator[] // also, a way of checking the number of objects should be added. // right now I just assume RangeCheck handles that. if(RangeCheck(objects[0], objects[1], range)) rangeFunc(objects[0], objects[1]); if(RangeCheck(objects[0], objects[2], range)) rangeFunc(objects[0], objects[2]); if(RangeCheck(objects[0], objects[3], range)) rangeFunc(objects[0], objects[3]); if(RangeCheck(objects[1], objects[2], range)) rangeFunc(objects[1], objects[2]); if(RangeCheck(objects[1], objects[3], range)) rangeFunc(objects[1], objects[3]); if(RangeCheck(objects[2], objects[3], range)) rangeFunc(objects[2], objects[3]); } The unrolled version should be a lot faster. If you combine this with walking through nearby nodes and the lambda method, you'll never even need to pass temporary vector objects. This way you'll take full advantage of what a Quadtree offers. Avoid using std::vector at all cost. It's too heavy object for this structure. You would have thousands of QNodes and thus thousands of std::vectors - very bad performance.
  23. Sorry, , sometimes the systems programmers forget that not everyone is on the same page with them . For us, it's painfully obvious that the problem is memory. It's not about the total memory used. It's about the number of allocations/deallocations and subsequent fragmentations caused by interleaved different-sized allocations. Here's an example how to get a memory fragment (it's very easy): void* p1 = malloc(4); // [4] ... void* p2 = malloc(4); // [4][4] ... void* p3 = malloc(8); // [4][4][8] ... free(p2); // [4] 4 [8] ... void* p4 = malloc(8); // [4] 4 [8] [8] ... Now this isn't bad at all, actually. If someone happens to call malloc(4) again, everything is okay again - the block is reused: void* p5 = malloc(4); // [4] [4] [8] [8] .. But in real life applications, this doesn't always apply - you might have something like this: // 128 [4] 4 [8] [8] ... void* p5 = malloc(4); // [4] 108 [4] 4 [8] [8] ... And no, I didn't do a mistake in the calculation. A heap block in windows takes 8 bytes(header) + [8 bytes (alignment)]. Due to this the overhead of allocations varies a lot. But thanks to this both the block header and memory itself are 8 byte aligned. Point to case - the original 4 byte block was not reused and can now be considered a fragmentation. In large applications those small fragments will accumulate over time and the more you free other data the more you cement that blocks fragmentation. Since it's now in the very far end of the free list, it will probably be never allocated again. I wouldn't call it just a theory. In fact, I just posted a few graphs where we could see dozens of obvious fragments. Like I explained, those fragments can be described as blocks that pollute the free list, but never get reallocated. What that does is critically diminish malloc() performance - you always need to traverse the free list to get a new block. Windows 7 has the Low-Fragmentation Heap, which is basically just like dlmalloc - a bin based allocator, but it's not used when a debugger is attached. Now before I get off-track (again), I got the opportunity to try out MTuner (30 day trial), which is a Memory allocation profiler. Pyrogenesis rev #13491, Release, VC++ 2012 Map Fast Oasis, 27 second fast-run Here's the memory graph from mainmenu->load map->exit of "Fast Oasis": Compared to other games I've seen graphs for, pyrogenesis has a very slow memory usage 'climb'. Typically you'd expect to see a very large peak after hitting load - that's reserving memory, creating pools, etc. Pyrogenesis however has A LOT of small allocations. When I say a lot, I mean it. I was amazed, actually. Each allocation from the heap has a small overhead. The smaller the allocation, the bigger the proportionate overhead. For example, if you allocate 4 bytes, the OS will actually need to set aside 16 bytes. Of course, in debug mode it's much-much worse than that, but I digress. If you need 100,000 4-byte sized objects, you can do it two ways: 1) Allocate 100,000 4-byte objects from the heap: Takes 1,600,000 bytes (due to each heap block header) Is ridiculously slow (100,000 mallocs) Trashes the heap and fragments it (after you delete all of them, the heap has a huge segment of 4-byte blocks) Deallocation takes a huge amount of time (100,000 frees) 2) Allocate a 400,000 byte pool: Takes 400,000 + ~12(pool header) bytes Is ridiculously fast (1 malloc) Doesn't trash the heap at all. In fact the OS will have to VirtualAlloc a separate part of memory for such a large alloc. Deallocation trivial (1 free) Now you may or may not think "Well, 100,000 is a ridiculous hypothetical number and doesn't apply here at all", but if you do - you're in for a surprise. Here's a histogram of allocations rounded up to fixed sized bins (8b graph contains 1,4,8 byte allocs). It seems that the numbers in my example were actually rather small compared to the real thing. I'd also like to note that this was just a 27 second run of the game: You can see that Pyrogenesis is actually quite memory hungry. The number of allocations per second is huge. You can't afford something like this in an AAA title - over 15,000 allocations per second is just asking for trouble. The amount of time it takes to iterate the free list is even worse since we've fragmented memory with all those tiny allocs. Now we can debate all we want, but now it's painfully obvious where a lot of our problems lie right now - it's the sheer number of allocations that cripples us. There is a solution to this - a very simple one: custom allocators. The C++ STL containers were designed to allow custom allocators to be used and it's an easy change to make. The main benefactors would be unordered_map and map structures that rely heavily on such implementations. We should avoid arena allocators, since they are designed to last the entire lifetime of a program and will not fit nicely to game state changes. However, we can use dynamic memory pools that last for the span of a single game state and then get destroyed. I've already implemented one for my own A* pathfinder, since I'm using a sorted linked-list instead of a vector. With CRT malloc/free: 1401 tiles visited in 1.66ms. With a memory pool: 1401 tiles visited in 193us - yes, that's right, micro-seconds. That's roughly a 10x performance increase, just by using a memory pool. All I need is to give the pool a size hint before I use it and its good to go. How it's used: // template argument defines the size of allocation objects. Must be >= 4. // constructor size argument hints the number of objects a single pool can hold // before a new pool is required. dynamic_pool<sizeof(MyStruct)> poolCollection( PoolSize ); MyStruct* st = (MyStruct*)poolCollection.alloc(); // ... poolCollection.dealloc(st); The code can be even more efficient (and also much faster), if just a regular pool is used: // pools have fixed size and the actual pool data is contained inside the structure // that's why you have to allocate it dynamically: pool<sizeof(MyStruct)>* fixedPool = pool<sizeof(MyStruct)>::create( FixedSize ); MyStruct* st = (MyStruct*)fixedPool->alloc(); // ... fixedPool->dealloc(st); delete fixedPool; // delete when you're done Before I digress any further, I'd like to share one last bit with you guys. The actual hotspot data for allocations. These are concrete points in Pyrogenesis source that we can fix right now. I've already created a ticket here: #1995 <memblock>, <total memops>, <stacktrace> size 4, count 61320, modeldef.cpp:300 (!!!!!) size 8, count 61320, modeldef.cpp:301 size 48, count 42028, parser.cpp:246 size 36, count 40866, guimanager.cpp:273 & 265 size 8, count 32749, modeldef.cpp:241 size 32, count 27638, textrenderer.cpp:136 size 8, count 21952, cgui.cpp:986 size 16, count 21952, textrenderer.cpp:77 size 16, count 21952, shaderdefines.cpp:135 size 48, count 21952, cgui.cpp:954 size 12, count 20688, modelrenderer.cpp:635 size 112, count 17325, guirenderer.cpp:388 (!!!!) size 108, count 16896, patchrdata.cpp:173 size 8, count 14852, shaderdefines.cpp:116 size 48, count 14154, font.cpp:33 (!!!!) size 36, count 13230, patchrdata.cpp:1025 size 48, count 12008, parser.cpp:981 size 24, count 9783, patchrdata.cpp:1025 size 8, count 9645, modelrenderer.cpp:433 size 8, count 8935, modelrenderer.cpp:698 size 48, count 8265, cgui.cpp:916 size 48, count 8144, parser.cpp:933 size 93, count 7806, guitext.cpp:266 size 132, count 7806, cgui.cpp:689 size 132, count 7806, cgui.cpp:800 & 855 size 12, count 7072, modelrenderer.cpp:636 textrenderer.cpp:77 textrenderer.cpp:179 patchrdata.cpp:1052 guitext.cpp:210 terraintextureentry.h:75 patchrdata.cpp:192 cgui.cpp:675 texturemanager.cpp:150 texturemanager.cpp:153 texturemanager.cpp:511 texturemanager.h:136 parser.cpp:639 componentmanager.cpp:620
  24. A quadtree is just a tree structure in itself and isn't anything special. It's how you use it and especially how you specialize it. If you simply implemented a quadtree and put a bunch of objects in a Q-node, you still don't have anything useful. However, if you store the pointer to the Q-node inside the object too - you'll get something you can use effectively: struct Object { QNode* node; // this is NULL if not in quadtree, or a valid pointer to a quadtree node Vecor3 pos; // 3d position of this gameobject /** * Checks if this Object collides with anything and calls the specified collisionFunc */ void CheckCollision(void (*collisionFunc)(Object* a, Object* , float collisionRange) { collisionRange *= collisionRange; // pow2(collisionRange) for(Object* o : node->children) { if(o == this) continue; // ignore this object // is it in range? if(pos.SquaredDistance(o->pos) <= collisionRange) collisionFunc(a, ; // trigger collision } } }; An important aspect of this approach is that you first move the objects and then check for collisions. According to the collision events you modify the path/position. Then you move on to the next frame. Haha, , don't worry, Quadtree is just a very versatile structure and can be used in many, many ways. For example in pathfinding to reduce the number of grids dynamically:
×
×
  • Create New...