Jump to content

RedFox

WFG Retired
  • Posts

    245
  • Joined

  • Last visited

  • Days Won

    7

Everything posted by RedFox

  1. I started this week by debugging Megapatch and its UTF8 changes. However, since I wanted to really improve performance for A15, I decided to pick a bottleneck in the current code to give a noticeable algorithmic boost that would make the game, well - playable. It looked like RangeManager was an easy enough place to start... Oh how wrong I was. The amount of research, debugging and despair in this week is hard to put into words. I was all up there with significant improvements and I also visited the deepest pits of despair when it all failed to deliver. It was a rollercoaster of fail that I will definitely remember as a lesson well learned - the hard way. Enough with poetics, lets get down to business: I spent a lot of effort in performance profiling and research of spatial subdivision methods this week. The bad news: I failed to produce any noticeable results. The good news: I have ideas how to make it work. I spent 63 hours in total, so I definitely think it's not completely wasted. Week #6 09.09 - 15.09 09.09 14-22 (08) - Bugfixes on megapatch UTF8 changes. Console debugging finished! 10.09 08-23 (15) - HPA* research. RangeManager and SpatialSubdivision optimization. 11.09 08-19 (11) - RangeManager optimization and testing 12.09 00-07 (07) - RangeManager testing. 13.09 20-06 (10) - RangeManager debugging. Profiling RangeManager bottlenecks. 14.09 20-03 (07) - Disappointment. Research, Sweep and Prune?, Analysis of RangeManager algorithm. 15.09 16-21 (05) - Weekly report. Megapatch. 1. Bugfixes and progress on Megapatch. What's the status: UTF8 transition is almost complete. Only debugging remains. I started the week by fixing critical issues reported by historicbruno. Finished a rewrite of the console code so that it works with UTF8 proper. Next step will be to make GUI CInput support UTF8, which will take a day for sure. 2. RangeManager and SpatialSubdivision optimization. What's RangeManager?: This is the module that manages range queries. This is important for UnitAI so that archers and infantry can engage enemies that enter their line of sight. What's SpatialSubdivision?: This is our old naive attempt at reducing the number of range comparisons. It's not very efficient. What's the issue?: RangeManager has a huge bottleneck in ExecuteActiveQueries function. The time is mostly spent extracting sorted entity lists from SpatialSubdivision and then comparing the lengths. So this issue, even though it sounds sort of trivial, is not trivial at all. There are a hundred ways to do this and most of those really bad. Unfortunately, we have one of those bad ones. In one of the many optimization threads I posted a through analysis of the whole algorithm. You can view the relevant post HERE. We can also see that the Big-Oh notation is pretty useless to describe our folly actually, since operations that should be marked as O(1) are definitely non-trivial and have a constant multiplier to the amount of time spent. With the rough estimate however, I concluded that our algorithm is somewhere between O(n^2) and O(nlog2n). Both of those are pretty bad cases, because our performance with SpatialSubdivsion should be less than O(nlog2n) if we wish for acceptable performance. So, to get the actual idea of what's happening behind the scene, I decided to look for: 1) How many QUERIES per turn do we have? 2) How many MATCHES are retrived from SpatialSubdivision for filtering? 3) How many RESULTS we actually get after filtering? Acropolis 1 IDLE EAQ Queries: 209 EQ Matches: 7947 EAQ Filtered Results: 1 We can see that we spend most of the time doing filtering on Matches. For every QUERY we get about 40 MATCHES for filtering. So in Acropolis, the algorithm looks more like O(40n*c1), where C1 is the constant time it takes to compare each result. We can't reduce N, however we can reduce possible MATCHES retrieved from SpatialSubdivision and we can also speed up C1 to make each comparison cheaper. Combat Demo (Huge) IDLE EAQ Queries: 2528 EQ Matches: 685496 <<< !!! EAQ Filtered Results: 0 Oh boy. It's scary how powerful CPU's are nowadays that we've never noticed the whopping 0.7 million distance comparisons done in Huge combat demo. It's obvious we're returning far too many "positive" results from SpatialSubdivision. We really need to reduce that number any way we can. If the algorithm was O(nlog2n), we would have ~ 28575 matches. If the algorithm was O(n^2), we would have ~ 6.3m. In reality there is no way to measure this with Big-Oh notation. Combat Demo (Huge) COMBAT EAQ Queries: 1676 EQ Matches: 415393 EAQ Filtered Results: 11421 <<< !!! Once the combat has been going for a while, the actual number of possible MATCHES is reduced - however, the filtered results are the second bottleneck in the algorithm and this part is much more expensive than filtering all the matches. So, what have we learned? We have to: 1) Reduce the number of possible MATCHES. 2) Speed up distance filtering [CFixedVector2::CompareLength()]. 3) Speed up processing of RESULTS. ------- I've learned a lot from this and it's much easier to approach the problem now that I have actual relevant data instead of bogus Big-Oh graphs that don't really help at all. Oh how low CS has fallen. 3. RangeManager patch debugging. Disappointment. What's the issue?: You were on the right path, right?: Well, sort of yes and no on this one. I was focusing on reducing allocations in general, however there were flaws in the optimizations and after correcting them, the results are less impressive. So the first graph, (thanks to historicbruno), showed that we're looking at some pretty good performance improvements in general. It looked as if it would make RangeManager ~5x faster: ----- However, this ended up being false. The patch caused an issue in ObstructionManager, which actually reduced performance of the other module! After fixing the bugs, the actual gain is not that impressive. Before idle: 17ms, After idle: 12ms. So barely any improvement at all! Here's the graph by historicbruno, you can notice that even though the patch speeds up one module (red), the overall performance goes worse (green): I was really devastated at the failure - we spent the whole week furiously debugging this with historicbruno and we ended up emptyhanded. I guess this is a lesson to show that optimization is not something trivial. It takes a huge amount of effort to optimize bad algorithms without breaking the existing logic. My strategy was to avoid excessive sorting by returning MATCHES from SpatialSubdivision as unsorted blobs that were guaranteed to have unique handles. However, this immediately broke ObstructionManager, which relied on sorted and unique results. It also broke games with very large buildings like a Wonder - making units unable to attack it. The performance gain was noticeable, but the subtle issues around the engine ensured that I couldn't have my way. 4. Analysis of RangeManager algorithm. If you missed the link in the above wall of text, you can take a look at the attempt of algorithmic analysis of RangeManager EAQ. It really shows how bad Big-Oh is for actual performance profiling, since it doesn't really help us at all. http://www.wildfireg...120#entry274300 5. Sweep and Prune. In the Performance Optimizations thread, jcantero suggested using Sweep and Prune algorithm. It sounds actually really useful if we implement it properly. Right now I'm thinking each entry in the two arrays would have inner bounds for entity Bounding Box and outer bounds for entity range. In order to avoid iterating constantly, I also plan on using a lookup entry which stores the last known approximate index of the object in the array. If the structure has changed, it would validate the index, which is very easy, since usually objects move very little. Furthermore, this method would always return unique entity id's that are correctly distance sorted! Isn't this perfect? Even if this algorithm had O(nlog2n), it would be ridiculously fast because of its simplicity and all the extra work we're not doing! Here's an article on Sweep & Prune: http://jitter-physic...sweep-and-prune Expect to hear back about Sweep and Prune soon. And kudos to jcantero who suggested it. ---- This is it for now! I hope you liked this update, even though it's a testament to my week full of failures, we still learned a lot. Especially the fact that optimizing existing "finished" code is not easy and not trivial. In the future we should put 2 extra days of effort into our modules, to avoid weeks of optimization / debugging a year later.
  2. Hi jcantero. For the sort of problem we're having, it's really hard to find an ideal solution. After a week of hassling with this problem, I have ascertained for sure that the current system is definitely the wrong way to do it. Let me explain the problem of RangeManager alone. We also have ObstructionManager that relies on the same Spatial subsystem, but it's not the focus of patch 1707. RangeManager: 1) Entities have AABB's, we interpret these as approximated world-aligned BB's in SpatialSubdivision class. 2) Entities have RangeQueries - meaning they want to know about other entities in their LOS or Range. 3) RangeManager provides each RangeQuery Entity with a list of Added and Removed entities. I think this is simplest way I can put it. RangeManager doesn't deal with collisions (that's up to ObstructionManager). It deals only with range based queries. Herein lies the problem though - all of this when put into code: Iterate over all RangeQueries: O(N). Perform Query: SpatialSubdivision::GetInRangeĀ® -> M, gets entities in range: O(r/64 + 1) ~> C1 Sort and remove duplicate entities (buildings span several tiles): O(Mlog2M) + O(M) [*]Iterate over results: O(C2*M) where C2 is constant non-trivial time. It takes work to validate results. [*]Validated results are pushed to results R. [*]Find Added (entities that entered range) by calculating difference from Rold: O(2(Rold+Rnew)) -> A [*]Find Removed (entities that left range) by calculating difference from Rold: O(2(Rold+Rnew)) [*]Sort Added by distance - O(C3*Alog2A), where C3 is a non-trivial constant. Each comparison is expensive due to our bad implementation of a fixed-point number library. Our CFixedVector2::CompareLength() function shows up very strong in performance profiler. Since I'm not the author of the original code, it took me about a week to analyze all of this. The code is definitely non-trivial, for example we need to sort the items by distance before the end result of each iteration can be applied, because a lot of code relies on this distance-sorted list of new entities. If we put all of the above together and do some very general approximations: O(N) * [ O(C1*Mlog2M) + O(C2*M) + O(8R) + O(C3*Alog2A)] Now, O(8R) can be considered rather trivial, the same actually applies to O(C3*Alog2A) unless we have a huge amount of Entities that enter range? That rarely happens. So if we remove those: O(N) * [ O(C1*Mlog2M) + O(C2*M) ] C1 and C2 are both non-trivial constants, so we have to account for them. C1 is the number of Subdivision tiles explored, which is usually 4 tiles in the actual game. However, if we assume that copying entity id's takes trivial amount of time, we could forget about C1. Though it's nice to know that traversing the SpatialSubdivision also takes some time. C2 is the precision test we do to validate which Entities are in range. It's definitely non-trivial and should be accounted for. Now if we take all of it together, the most important arrays are N and M. Queries and the Results: O(N*Mlog2M) + O(N*M) We know that M is always smaller than N, since M is a subset of N. However, there are N subsets. The only place we can put this algorithm is somewhere between O(n^2) and O(nlog2n). So what's killing the performance here? The sorting we do in SpatialSubdivision::GetInRangeĀ® and the validation of those results. This is all under PerformQuery. We should be returning an already distance sorted list of entities + turn validation of those entities into something trivial. Here's a comparison chart from bigocheatsheet.com:
  3. ENSURE is a simple macro that is always used, regardless of Release/Debug. ASSERT uses ENSURE. ASSERT is undef'd in Release builds.
  4. Alright, I finished tuning ExecuteActiveQueries a little bit. The gain isn't anything groundbreaking, but the idle execution time in Combat Demo (Huge) went down from 17-20ms per frame to 2-3ms per frame. Of course, if the units get close and start fighting, the time spent escalates. Albeit, ComputeShortPath becomes the dominant bottleneck there, taking over 600-700ms... So this little fix should be a nice addition to A15. You can look at the ticket and patch here: http://trac.wildfire...1707#comment:41
  5. Nice callgraph scroogie! Looks like main bottlenecks identified are: CComponentManager::BroadcastMessage (4.35% exclusive loops at lines 823/832) CComponentManager::PostMessage(1.72% exclusive) CCmpRangeManager::PerformQuery(0.99% exclusive loop at 269) CCmpRangeManager::ExecuteActiveQueries(10.46%) <-- should probably start with this CCmpPathfinder::ComputeShortPath(0.78%) <-- this is definitely a lot more in late game Looks like rest of it somewhat divides over all the JS functions..
  6. The resource system in Pyrogenesis (the engine that drives 0AD) is quite aggressive in loading and caching the raw file data in memory. So in essence we're already dealing with a GC-like system here, albeit we actually never free these resources, since keeping them in the cache seems more worthwhile. The actual memory management for resources is also extremely simple during the loading sequence. The memory allocation for each model for example only consists of a few malloc's and matching free's in the destructor. I wouldn't even call that a hassle for any seasoned C++ developer, since adding a matching delete to a new is like closing { brackets; } . The fact of the matter is, we just need some clean and robust C++ to get the performance. Right now it's all about optimizing old inefficient algorithms that were written 3 years ago and were classified as "just a temporary hack".
  7. I see where you're coming from, but the current performance issues are not related to language or OpenGL version. The performance issues are because of code with very bad worst case performance, like CCmpRangeManager using insertion sort in a tight loop, which has W(n^2) performance - in short, it's horrible. If we concentrate on ironing out algorithmic bugs, we'll have very good performance. Garbage collector is a very bad idea for intensive real-time systems such as 0AD. I've worked on games in C# before and the GC always got in the way once the engine was sufficiently complex. Furthermore, debugging dangling references that cause obscure leaks is just ridiculous. In general we do very little in actual memory allocation with the new patch - you can see it in the memory timeline graph. Once the game is underway it's mostly smooth. If we used C#, memory usage would keep climbing very fast, until it hits a GC cycle - then the whole game will freeze for half a second. Definitely not something we want. Ever. Dealing with JS GC is quite enough trouble already. The best approach here is to allocate as much as possible beforehand and not do any allocations or deallocations during the game loop. This is something you can't really do effectively with GC based languages.
  8. We completely agree - optimization has been the top priority for a long time now and we're only just starting to make breakthroughs on that. Since 0AD is cross-platform we are somewhat limited in the amount of optimizations that can be applied and even worse, some optimizations break the game on other platforms while work without issues on others. For example, OSX OpenGL drivers fail to adhere to GLSL compilation standards, meaning we need custom code for macro preprocessing. Even worse are optimizations that speed up some machines and slow down others. Some rendering code can be adjusted to favor powerful GPU's, giving a noticeable performance boost. The downside is that weaker GPU's suffer to perform. However, it is mostly agreed upon that AI and Pathfinding are the single biggest issues in the game and unfortunately they are both non-trivial issues. Pathfinding is still one of the hardest things to get right in the gaming industry, since naive A* or Dijkstra algorithms don't perform in this field. We are considering HPA* with Clearance based Pathfinding; the relevant article can be found here: http://aigamedev.com/open/tutorial/clearance-based-pathfinding/ The biggest issue with AI is that it's in JS and does some dangerously complex things that it shouldn't be doing. We obviously need more dedicated C++ programmers to solve the issues. That's the reason for the fundraiser.
  9. Looks like a strong majority ended up supporting ARB. Perhaps we could try optimizing our GLSL shaders to reduce some of the performance differences. If there is barely any difference, we can continue optimizing our code for GLSL specifically, which will pay off in the long term. What do you guys think?
  10. To be perfectly honest, the first four are issues that a modern compiler can easily catch and doesn't hamper development. I've never seen a dangling pointer error without getting a crash beforehand, thanks to VC++ Debug Runtime. The last issue though - platform specific errors - have been a bane of my existence for the last 2 months. Code that works on Windows, doesn't on OSX/Linux due to driver and compiler differences.
  11. I think this thread sums up why R2TW UI is so bad compared to Shogun 2: http://forums.totalw...on-of-Shogun-2. Notice how it's much easier to see if your units are firing arrows, fighting or doing something else in Shogun 2. In Rome 2 it's just a mess. Really hard to see what your units are doing from the UI.
  12. I think I might have overworked myself last week. Pulling 88 hour weeks isn't really healthy and has left me wanting for some free time. So the fifth week shines in its modest 32 hours. Week #5 02.09 - 08.09 02.09 10-22 (12) - Performance profiling data, fancy graphs, screenshots, performance fixes. 03.09 17-20 (03) - Clipboard code on linux/osx fixed. Bugfixes. 04.09 18-20 (02) - Testing megapatch. ModelRenderer optimization. 05.09 23-06 (07) - ModelRenderer glSwapBuffers fixed. GUI occlusion culling. 06.09 13-17 (04) - GUI culling finished and tested. Fixing reported bugs. 07.09 17-19 (02) - Fixed bugs reported by historicbruno. 1. Performance Profiling Data You probably noticed the thread with fancy profiling graphs and comparisons. It actually took me quite a while to get all the data, analyze it, and fix performance issues as they came along. All in all, this testing cycle actually fixed some performance issues in Megapatch, so I consider it a significant improvement - especially in the GUI. You can check the thread HERE. I won't be copy pasting all that. 2. Clipboard and UTF8 Bugfixes. Mostly UTF8 support for API's on linux and OSX that previously worked in UCS-2. 3. ModelRenderer optimization A while back I was forced to revert all my efforts on ModelRenderer. Now I've brought all those optimizations back, so the rendering is noticeably faster. I was also able to fix glSwapBuffers slowdown. Apparently because a bug caused models to be incorrectly sorted/batched, the GPU had to spend a lot more effort into rendering the scene. The slowdown in glSwapBuffers was basically the CPU waiting on the GPU to finish rendering into the backbuffer. Since we're already doing a combination of alpha blend and alpha testing for all our transparent models, we don't actually have to do any distance sorting of the models. This actually sped up the renderer a lot! Before: Transparent models were sorted by distance. This made the code very complex. After: No more sorting is done, everything is handled by the GPU. 4. GUI culling This is an important fix for the lobby - in order to have acceptable framerate we have to do some additional culling of objects on-screen objects. This means we skip rendering and ticking GUI pages that are active but hidden from view. This gives a really good performance increase for the lobby where we have thousands of lines of text to render - the previous rendering algorithm was very, very inefficient in such cases. The performance would gradually drop over time until everything became unplayable. Before: 200 FPS that eventually degraded into 20 FPS or less in Match Setup screen. After: Good and consistent 500 FPS in Match Setup screen. 5. Bugfixes Historicbruno did some review on megapatch and found quite a lot of critical issues. I spent a few hours fixing those to ensure the game doesn't crash when switching to Atlas from in-game menu. -------------- This is it for week #5 - the next one will be a lot bigger.
  13. It will only look like a mess in C++ if we do a direct port. However, if we put our heads together with all the people who have been working on the AI, procure some actual design requirements and structure, we can design something elegant and readable. It really depends on how much effort we put into it; though, at least in C++ it won't be such a performance hog
  14. Yves, wraitii: I think a quick fix can be to add a member "length" to the entitites object. We can increase the counter as needed. At any rate, it looks like a mess. We should consider moving it to C++ definitely.
  15. Alright, looks like most of the logic is in entitycollection.js Isn't this bad?: entitycollection.js : 10 Object.defineProperty(this, "length", { get: function () { if (this._length === undefined) { this._length = 0; for (var id in entities) ++this._length; } return this._length; } }); Why doesn't it use entities.length() ? If there are deleted entities in the entities array, then that might explain the gradual degradation of overall performance.
  16. Looks like there's some O(nlogn) action going around, so it seems to iterate over all entities and does something on a binary-tree structure (totally shooting in the dark here based on the graph alone). Is the entity-delta code entirely in JS?
  17. The profiling data captures about 1 minute, resulting in a 1GB data file. It takes roughly 4GB of ram to process that data. So at best I think I can capture ~2 minutes before it becomes impossible to process the profiling data. Perhaps if I load a savegame with some intensive AI action...?
  18. 2. Performance Profiling with MTuner MTuner is a really useful tool for profiling memory usage and detecting leaks. The goal is to compare A14 Release vs A15 Dev in its overall memory allocation intensity. The actual memory peak usage doesn't really concern us, since we can always sacrifice memory space for speed. What we can't sacrifice however is memory latency - meaning we must reduce the number of dynamic allocations to the lowest we can get. The A15 Dev patch actually focuses heavily on optimizing memory usage and reducing allocation bottlenecks where they are detected. Even though I've been working furiously on the current patch, I still haven't been able to remove all the bottlenecks. There is still a lot of room to improve. Both sessions were profiled on Cycladic Archipelago 6, although I wasn't able to exactly repeat every action and movement I took, the results were accurate enough over several sessions. 1) Memory usage Timeline This type of graph shows the overall memory usage during the lifetime of the program and gives us a rough idea what the program is doing. We can also see JavaScript's GC Heap slowly growing, resulting in the slight rise in memory usage. If we recorded enough data, we would see a saw-like /|/|/|/| pattern in memory usage. Profiling projects with a GC Heap is ridiculously hard, because you never know if it's a leak or just GC delaying its collection cycle. A14 Release: The game starts up slightly under 32MB mark and once loading begins, memory usage starts slowly climbing. The whole loading sequence takes around 6 seconds to finish. After that we see an usage graph with lots of tiny spikes. First thing we can see is that memory is allocated gradually during loading, which is actually not that great for loading times - we spend a lot of time waiting on the Disk to read our data, then we go through a lot of processing on the single loaded asset, just to wait on the disk some more. It's hard for me to judge loading times, since I have an SSD that pretty much eliminates any IO latency, but others have reported up to 40s loading times, so it's a worthwhile topic. A15 Dev: The game starts up just the same if slightly faster, but the gross memory usage is pretty much the same. The loading segment is a lot steeper in A15 and it loads 33% faster in just 4 seconds. After that we see a much smoother graph with only a few spikes along the way. The loading is faster because there is less action for the CPU. However, the amount of data is pretty small and the Force GS SSD should be able to load all this data (~100mb) in less than a second or so. Lower end PC's will definitely benefit from this 33% loading time improvement. We can also notice that memory growth itself is far steeper during loading - this is because model loading allocates a bit more than needed, to avoid any situation where we have to reallocate a list of 1000 vertices. This really pays off in speed and we don't even notice the few bytes we wasted. Compared to A14, there are a lot less spikes, though some still exist, they are all in the downwards direction, meaning deallocation - reallocation, which is typical pattern for a resource that gets reloaded. So this graph is looking a lot better and means we're using our memory more efficiently. Summary: A15 Dev has definitely got worthwhile improvements. Even though we'd wish for more, 0AD has a huge codebase which requires an immense amount of work to optimize. 2) Memory allocation Histogram The allocation histogram shows the general number of allocations over a selected timespan. In this case I've selected the time after loading and just before closing. This is to measure general allocation performance during lifetime of the game. A14 Release: It's obvious that the amount of allocations is just huge. In just a minute, 0AD managed to do over 1.008 million allocations. This is less than desirable. A lot of power goes to really small allocations, which is not an efficient use of memory, since we have very high alloc/dealloc frequency and a lot of memory management overhead associated with tiny allocations. A15 Dev: We can immediately notice that the number of very small allocations has gone down significantly. However, a lot of 32 byte allocations also remain, which is what should be focused on the next iteration. However, in total we've reduced the amount of allocations by 94,000. Most of those were from all the tiny allocations and some of them carried over to bigger allocation chunks, resulting in an overall more efficient usage of memory. Summary: Even though we reduced memory management overhead by 10%, we only gained a meagre ~+25% FPS, which is actually very little improvement if at all. We need breakthroughs that double or triple the fps if we really want results. ----------------- This is it for now, I hope you enjoyed this A15 Dev preview.
  19. The following profiling is done on 2 separate builds of pyrogenesis in release mode. Both have exactly the same optimizations applied and are built on VC++2012. I'll refer to these versions as: 1) A14 Release - The SVN version of A14, r13791 2) A15 Dev - The megapatch on r13791 First we'll test everything on a "visual" glance. This means I don't use any profiling tools, we only monitor the FPS and "how it feels". Both of these tests will be run with cache fully loaded on Cycladic Archipelago 6. Once that is done, we can compare memory peak usage, memory allocation histograms and loading times in general. The testing system is: Windows 7 x64, Intel i7-720QM 1.6GHz, 8GB DDR3, Radeon HD 5650 1GB, Force GS 240GB SSD Game settings: Most fancy options set to high, Postprocessing disabled, Windowed 1. First Glance @ 1280x720 -) A14 Release This is the version that will be packaged and tested before release in the next couple of days. We've been working hard on optimizations, but most of these never made it to A14. This will give us a fair comparison on how big a performance gain we'll be looking at. The menu is a good place to test the core speed of the engine. Very fast engines usually get over 1000fps. A14 gets around ~480 fps, which is not that bad at all considering we run a very complex scripting engine behind the scenes. To further test general game speed, lets enter Match Setup chatroom. At first it starts pretty strong at ~300 fps: But once more and more text piles up, the FPS drops to a meager ~50-60fps !! This is because text rendering is still extremely inefficient in A14. -------------- Now let's load Cycladic Archipelago 6. It's very hard to profile loading times, because I have a 550mb/s SSD. The loading was fast around 6 seconds, though it stuck around 100% for half of that. The last 100% is where all the shaders get loaded. I get a fairly steady ~46 fps in the initial screen. Zooming in, the FPS obviously increases to ~58, because there is less stuff to render. Once we zoom out with a revealed map, the fps drops to ~40. ------------------- -) A14 Release summary: The chatroom showed how big a bottleneck current GUI can be; it's not very efficient. With a revealed map I get 40 fps, which is a bit low, considering my system can play Mass Effect 3 in 1080p with the same fps. -) A15 Dev This one has about 2 months worth of optimizations put into it. I used to think that I would achieve more with such a long period of time, but despite my previous experience, working on pyrogenesis has been different. Mostly because it's cross-platform, thus restricting many optimization options available to the programmer. Secondly because code that worked and ran fine on Windows, often didn't work at all on Linux. This meant a few weeks of coding was lost and had to be reverted. The patch adds 7376 lines and removes 5507 lines of code. It has also gained a nickname "megapatch", due to how big the SVN .patch file is (~1mb). The menu in the patched version runs at ~630 fps, so at first glance at least something appears to have improved. Now lets check how Match Setup chatroom fares on A15 Dev. About ~300 fps just like before. Looks like there's some other bottleneck in the code, but then again 300 fps is more than enough. What happens if we spam a few hundred lines of text at it? Only a slight drop to ~280 fps, which is a lot better than before. It means long lobby times won't hurt the game in A15. -------------- Now let's load Cycladic Archipelago 6. The loading is slightly faster, seems like 4 seconds. Again half of that is spent at 100%. However, this time it's faster because A15 Dev optimizes shader compilation, reducing the amount of shaders compiled from ~300 to ~130. The initial look shows us ~61 fps, which is roughly +33% faster than A14. It's far less of an improvement than expected though. I'm slightly dismayed at that. If we zoom in, we see a similar improvement ratio of +25% at ~73fps: And with reveal map and zoomed out we get ~51 fps, which is about +27%. -------------- -) A15 Dev Summary: I'm a bit disappointed. After all the optimizations, I expected much better results. However, it's nice to see that textrenderer optimizations paid off. The loading time of 4s is already fast enough for me, so I can't complain. Also, the general improvement of +~25% fps is enough to make the game much more playable. I think the best improvement is the new input system - it's much smoother than the previous one, so it just "feels" faster, even though it's not very much so. This is the end for first glance, which is part 1 of the profiling session. Next part will show some memory usage data.
  20. Another week has passed and I've been working furiously mostly on fixing any memory leaks that popped up, optimizing XMB parsing to use string tokens and finally wide-scale UTF8 transition of the engine, which is an epic task in itself, but is definitely going to speed up the core of the engine by a noticeable factor. As I've promised to add some profiling data, I've also taken some quick snapshots and comparisons on memory usage in general. Week #4 19.08 - 25.08 26.08 1500-0300 (12) - XML Conversion optimization. WriteBuffer optimization. 27.08 2000-1100 (15) - XMB parse optimization. 28.08 1800-0400 (10) - UTF8 conversion optimization. JS ScriptLoad and Eval optimization. 29.08 1600-0800 (16) - Memory leak fixes. TextRenderer UTF8 support, major optimization. EntityMap 30.08 1700-0300 (10) - GUI UTF8 transition 31.08 1200-0100 (13) - Still UTF8 transition, Console input UTF8 compatible. 01.09 1500-0300 (12) - Patch review, weekly summary, performance graphs, GUI text hack You can notice that I've put in some insane number of hours for this week (88) - migrating to UTF8 only is not trivial in the slightest and a lot of code has to be made UTF8 aware - text rendering, console input, gui input, script interface; just to name a few. As expected, there is a noticeable performance improvement by migrating to UTF8 - the reason is simple: We're doing a lot less string conversions everywhere. 1. XML Conversion What's the issue?: Converting large XML files to XMB takes a ridiculously large amount of time. I don't know if this can be called a real issue, since it only really affects us developers, but in reality this was actually part of the XMB optimization and the added optimization made it more than twice times as fast. By implementing a custom String-ID map and making use of the knowledge that libxml2 already buckets its strings, I was able to greatly improve the bucketing speed. Furthermore, by placing the actual String-ID table into the end of the XMB file, we need to traverse the whole XML-tree only once - before we had to do it twice: 1) to get all unique String-ID's, 2) to write all the XML nodes. Perhaps the most important change was that strings are now stored as UTF8 not UCS-2, which makes conversion faster and also benefits XMB parser greatly. Before: XML tree was traversed twice and std::set with expensive string comparisons was used. After: XML tree only traversed once and a custom String-ID map with simple pointer comparisons. Speedup depends on actual XML tree complexity and the amount of unique attribute/id names. Strings are stored as UTF8 which is much faster. 2. XMB Parser What's the issue?: XMB files are binary XML files, but because the strings are stored as UCS-2 any string related operations are ridiculously complex and slow. Due to legacy reasons, XMB strings were stored as UCS-2 strings, which is the JS compatible string format. As the project evolved, more and more layers of strings ended up creating a ridiculously complex chain of string conversions whenever XMB strings are read. The actual conversion sequence was as followed: XMB UCS-2 -> UTF16 -> WCHAR String (UCS-2 on Windows, UTF32 on Linux) -> UTF8 String. That's 3 layers of string conversions, just to get the end result as an UTF8 string. The correct solution here is pretty obvious: most of the game works with 1-byte char sequences (the std::string) and the best compatible format with that is the general multi-byte string (variable char length) known as UTF8. Any C++ code that doesn't care about specific Unicode characters will still function perfectly fine with UTF8 strings - that's how UTF8 was designed to be. You can read more about character Encodings here: http://www.joelonsof...es/Unicode.html Since we now convert XML files to XMB with UTF8 strings, the actual conversion sequence is much simpler: XMB UTF8 -> UTF8 string (std::string); which means we do a simple copy of the string. However, do we really need to do even that? I've recently been working on renaming the CTokenizer class to CTokenStr - which is a special string class that references other strings. It doesn't contain any string data itself - only pointer to the start and end of the string. Once I introduced CTokenStr instead of std::string, the actual conversion sequence looks like this: XMB UTF8 -> referenced by CTokenStr; which means we actually don't do any work at all! And we all know the fastest way to do something is to not do it at all. Before: 3 layers of string conversions and copies. After: No string conversions and no string copies - we actually don't do much work at all. Isn't that nice? 3. UTF8 Conversion What's the issue?: The previous UTF8 conversion wasn't optimized for speed and wasn't robust enough to "just work". More importantly, the current UTF8Codec does not expose a convenient 1-char encode/decode. Luckily I've worked with UTF8 libraries and high-speed encoding/decoding before. With most of the engine core being converted to work with UTF8 exclusively, we really needed a reliable UTF8 decode function that would always work and automatically correct itself on any invalid UTF8 sequences. The resulting prototype looks like this: /** * Decodes a single 16-bit WCHAR (stored in a 32-bit unsigned) from the * specified UTF8 sequence. * * @param utf8sequence An UTF8 sequence of 1-4 bytes. * @param outch The output character decoded from the UTF8 sequence. * @return Number of bytes consumed while decoding (1-4). */ int utf_decode_wchar(const char* utf8sequence, wchar_t& outch); Which can be easily used to grab a sequence of wchar's: wchar_t ch; while (str < end) { str += utf_decode_wchar(str, ch); // do something with 'ch': } We only ever need this char by char decoding in: 1) TextRenderer - GlyphAtlas works on Unicode, so we need the wchar_t 2) Console - In order to handle input correctly, we need to be UTF8 aware 3) GUI Input fields - Same as Console 4) UTF8 conversion - The actual methods that decode utf8 strings to wchar_t strings Previously there was a lot of UTF8 -> WCHAR string conversions in the engine, but with the gradual migration to UTF8, there is practically no conversion at all. Alltogether, the only intensive part is TextRenderer, which is already quite efficient. Furthermore, there was no way to convert directly to UCS-2 from UTF8, you always had to do UTF8 -> WCHAR string -> UCS-2 string. With the new interface you can convert UTF8 directly to UCS-2. Before: No way to efficiently decode individual UTF8 sequences, nor UTF8 directly to UCS-2. After: A very robust and efficient UTF8 interface allows to streamline the entire engine. 4. JavaScript Script Loading What's the issue?: In order to load javascript scripts, the source needs to be converted to UCS-2. So this actually ties in with the UTF8 changes in general - by implementing a static conversion buffer for UCS-2 scripts, the conversion of scripts to UCS-2 is much, much more efficient. By using a static buffer, there is no memory allocation, which is perfect. It's very straightforward now: 1) Load the script file with VFS 2) Convert the file buffer directly to UCS-2 in the static code buffer 3) Execute the script Previously there would have been a lot of dynamic conversions of the file buffer which was a huge waste of processing time. Before: Script loading involved several layers of dynamic string conversions. After: Script files are converted directly to UCS-2 in a static buffer and then loaded. 5. TextRenderer UTF8 Support What's the issue?: Previously, in order to render anything on screen, you had to convert it into a WCHAR string, which was very clumsy. So this one actually makes text rendering ridiculously faster than before. We don't use any dynamic containers for text and index it or anything. Instead we have a fixed-size Vertex Buffer. All text that is submitted for rendering is immediatelly converted into vertices. Why is this good? This means we only need to send Vertex data to the GPU once. This gives a very noticeable speedup and is the correct way to deal with vertices on the GPU. This transitioning also allowed to batch text together by font - meaning there is no unnecessary font texture switches going on. So, all of this goodness just because we transitioned to UTF8? Well, of course we could have done it before, but allowing both WCHAR and UTF8 strings to be rendered, means this was the only viable choice, really. No other way would have been right. And for future reference, there is still a lot to improve when it comes to text - though this system is now compatible with the TrueType OpenGL text engine I developed around June, which means we can actually transition to TrueType fonts for really awesome and clear anti-aliased text . Before: Text rendering wasn't that efficient and only supported WCHAR strings, which made printing UTF8 very clumsy. After: Text renderer is a lot faster, uses less memory, supports both UTF8 and WCHAR strings and is compatible for a transition to TrueType fonts. 6. UTF8 Transition What's the issue?: The engine relies heavily on WCHAR strings, which is very bad for cross-platform projects. So you've noticed there's a lot of UTF8 here and there this week. It's all part of a larger goal to transition to UTF8 strings only. This makes the whole engine a lot simpler, since we only need WCHAR conversion in few select places. These are: 1) When we render text (although we do this char by char). 2) When we open files with Windows API. So in all aspects, using WCHAR strings makes no sense at all and we can get away with it by just using UTF8 strings. Modules that have been converted to UTF8: 1) GUI 2) XML/XMB 3) TextRenderer This transition is still in progress and is a slow and arduous one. 7. Patch Reviews I finally managed to do some patch reviewing on some core components of the engine, so it wasn't entirely wasted. I reckon the most efficient way to handle patch reviews is to explicitly assign me C++ patches for review, although I'll be jumping into patch reviews more frequently from now on, it would speed up the progress. Patches that I approved were marked as "reviewed" and thanks to sanderd17, we can now easily query those tickets. So when A14 is released and feature lock is lifted, we can start throwing stuff into SVN and make it all work before A15 is released. To end Week #4: I don't know if there's much else to say about all of this. I guess to end this week, I'll now throw fancy performance graphs at you. Beware. ( followed in the next post ) ----------------------------------------- This is my current TaskList: -) UTF8 transition -) PSA and PMD to collada converter. -) Migrate to C++11 -) Collada -> PMD cache opt to improve first-time cache loading.
  21. So I guess this report will be pretty epic. I've been working all night on XMB file loading and optimization. Mostly to greatly improve loading speeds. However, I digress, here's my report of last week. Week #3 19.08 - 25.08 19.08 1100-1700 (6) - Debugging and bugfixes on megapatch. Huge breakthrough. 20.08 1600-0200 (10) - Debugging shader issues. 21.08 1200-1900 (7) - ShaderProgram variations reduction. ModelRenderer texture bind bug solved! 22.08 2100-0500 (8) - Windows Stacktrace failure debugging. 23.08 1000-1200 (2) - Alpha sorting removed. 25.08 1400-0500 (15) - Fundraiser footage. Megapatch bugfixes. UTF conversion optimization. From the total of 48 hours, most of it went into debugging, but it finally paid off. The patch is now stable on Linux and OSX, which means it's ready for commit after A14 release. At the end of the week I took some extra time to improve UTF conversion performance (since we're doing a lot of it) and also grabbed some footage for the fundraiser. 1. Debugging breakthrough What's the issue?: Well, until recently the patch crashed on Linux and OSX; on Windows the game ran fine. It was a really frustrating issue, since I couldn't debug the crash at all - I could only hope to fix any bugs that changes to the shader definitions systems caused. Funnily enough, the failure was simply due to incorrect hashes of CShaderDefines. End result: We can now deploy the patch after A14 is released and start refining out any bugs that pop up. 2. ShaderProgram Variations What's the issue?: For each rendering ability such as Shadows, Specular, Normals a combination of ShaderDefines is formed. For each unique combination a new shader is compiled. This is very inefficient. When running 0AD in an OpenGL debugger, I noticed that the amount of shader programs generated totaled at around 300. Each shader compilation actually takes a pretty long time during loading, so generating over 300 shaders from just a few sounds like a high crime. The biggest problem is the batch-sorting that is done prior to rendering models - the larger the amount of shaders, the more inefficient rendering becomes due to constant resource binding/unbinding. Batching is also inefficient, resulting in more texture state changes than are actually needed. My solution was to implement a second layer of caching inside CShaderProgram itself and hash any input shaders. This allows me to check if the current source code has already been compiled and if so - retrieve a reference counted handle to the shader program. This is really great and reduced the amount of shader programs from 300 to around 120. What we could do more to improve this situation is to use less shader defines - the smaller the number of variations, the smaller the number of shaders compiled. End result: The annoying load time at the end of the loading bar was reduced by half and is hardly noticeable now. 3. Windows Stacktrace failure What's the issue?: Several error reports on windows fail to generate a proper stacktrace and usually another error occurs while generating the error message. This was actually pretty hard to debug. On VS2008 the issue was somewhat improved with /Oy- flag, which forces usage of frame pointers. On VS2012 generally disabling Full Program Optimization gave improved results. Still, a lot of cases failed and no stacktrace was generated at all. Apparently if the top-level function is inlined, WinDbg.dll is unable to resolve the function reference. On that case the only fix was to change the stacktrace behaviour to simply display all functions and skip any filtering on the callstack. This at least gave some kind of stacktrace, which is better than nothing. End result: Error reports can now be expected to always give a stacktrace on windows. 4. Alpha sorting What's the issue?: A noticeable amount of time during rendering is spent sorting transparent models - improvement in this is essential for better rendering performance. Even though I spent the least amount of time on this issue - it probably had the biggest FPS impact on the renderer. The current renderer distance sorted all transparent models prior to rendering, resulting in some pretty complex batching before rendering. This takes almost half of the rendering time itself and is pretty useless because OpenGL employs a Z-Buffer which, in combination with proper alpha testing gives perfect results. Since 0AD already employs this functionality, all I had to do was remove <sort_by_distance/> and any code related to distance sorting in the modelrenderer. End result: Visually no difference. About 33% gain in performance (depending on amount of trees), 50 fps -> 70 fps. 5. UTF Conversion What's the issue?: There is a lot of string conversion going back and forth in the 0AD engine: UTF8, UTF16, UCS-2, UTF32 strings are all being used and constantly converted from one type to another. My first goal was to reduce the amount of conversions done, but that's a really huge change. The next best thing I could do was streamline the UTF8 conversion code. 1) Added conversion of UTF8 -> UTF16 and UTF16 -> UTF8 for faster SpiderMonkey interaction. 2) Added special case for UCS-2 <-> WCHAR_T on windows, resulting in faster conversion performance on windows. 3) Improved, optimized and streamlined the code to do UTF conversion much faster than before. However, these changes are intended for gradual movement from WCHAR_T strings (UCS-2 on Windows and UTF32 on Linux) to simple UTF8 strings. There is a lot of code that uses WCHAR_T strings, even though there is no real need for it. The only part of code that needs to deal with UCS-2 strings is Windows CreateFileW, which is rarely called. End result: Less string conversions, faster UTF8/UTF16/UTF32 string conversion To end Week #3: I still didn't manage to do any patch reviewing, so I'll /have/ to do it first thing tomorrow (otherwise I'll procrastinate again and work on some awesome module instead). I think it was an excellent week nevertheless - I was able to squash the annoying runtime bugs thanks to everyone on the IRC helping me test it out. Since I finally got my 8GB of RAM, I can dedicate a day for memory performance comparisons. ----------------------------------------- This is my current TaskList: -) Patch review -) Performance Improvement Fancy Graphs -) PSA and PMD to collada converter. -) Migrate to C++11 -) Collada -> PMD cache opt to improve first-time cache loading.
  22. I'll have to postpone today's update for tomorrow, since I'm working on a really interesting piece regarding XML parsing that will improve load times tremendously. If my new RAM also arrives tomorrow, I'll be able to finally throw in memory usage comparisons. (apparently 4GB is not enough to profile 0AD memory usage ). Stay tuned!
  23. If the issue only persists with ShadowPCF enabled, it might be a totally different issue. Perhaps we can sort this out on IRC?
  24. Looks like you have more than enough texture units on your GPU. Then it has to be an issue with the shader texture binding stage. Perhaps diffuse texture is the same as the blend texture (invalid sampler?). This is most likely, since it would be a programming bug in the Blend generation code. With a good debugger it would be pretty easy to figure it out. If the diffuse and blend textures match, it's a bug in the code. The texture binding is done in PatchRData.cpp at line 796. Hope you know how to use a debugger
  25. Having taken a closer look at TerrainRenderer now, it looks like TerrainBlends stage doesn't have a texture bound, which has to be a logic bug in the GLSL path of the code. Perhaps the older card has a limit in texture units? TerrainBlends stage uses at least 2 textures (base texture and the blends texture) if not more.
×
×
  • Create New...