Jump to content

RedFox

WFG Retired
  • Posts

    245
  • Joined

  • Last visited

  • Days Won

    7

Everything posted by RedFox

  1. This looks really excellent! I think its a great idea.
  2. The emplace_back function seems a lot better than the old push_back for objects that are non-trivial / non-POD. Nevertheless all the arguments have to be pushed onto the stack not only once, but twice, (1 for emplace_back, 1 for the type contructor), before being copied for the third time into the actual buffer. For POD types, something like this would be preferred, since it involves 1 initialization and 1 copy: Data d = { 1, 2, 3 }; vec.push_back(d); // argument is passed as 'const Data&' And I still think the move constructor is a hack that tries to fix one of the biggest problems C++ has: temporary non-trivial objects, which is something that should be avoided at all cost. At least now they've introduced a way to escape any heap allocation/deallocation triggering 3 times, which somewhat lessens the impact. Still, the implementation of std::string is non-trivial, and looks like this on VC++ 2012 (28 bytes): class string { iterator_proxy* _Proxy; union { char _Buf[16]; char* _Ptr; }; size_t _Size; size_t _Reserved; }; With the notion that small strings can be contained inside a temporary buffer and longer than 15 char strings will be allocated on the heap. Regarding std::move and move constructors/destructors in general, they are still non trivial, containing dozens of operations for copying and nulling the string objects. C# solves this by just turning every class into reference counted objects (like Microsoft's COM, but with a Garbage Collector touch). In C++ the closest parallel is a shared_ptr<> wrapper, however it's again completely different from C# references: // again from VC++ 2012 STL template<class T> shared_ptr { T* _Ptr; // pointer to object allocated with new _Ref_count_base* _Rep; // pointer to a dynamic object that counts the shared_ptr reference count }; So this just keeps a dynamically allocated pointer to a reference counting object, which is rather inelegant. If you use a lot of shared_ptr<>'s in your project you'll start feeling the overhead. Luckily, or rather surprisingly, Microsoft designed an (in my opinion) elegant solution to the problem: class IUnknown { int refcount; IUnknown() : refcount(1) {} // initial creation has refcount 1 void AddRef() { ++refcount; } // adds to the reference count of the object void Release() { if(!--refcount) delete this; } // decreases refcount, if refcount reaches 0, deletes the object }; class MyObject : public IUnknown; // implement your own object So now you introduce a reference counting overhead for every object you intend to pass by reference. For actual use you will need a reference wrapper. In Microsoft's COM it's a comptr<>: template<class T> struct comptr { T* ptr; // add copy operator overloads to AddRef and Release as needed }; // comptr<MyObject> obj = new MyObject(); // compiles fine because you MyObject inherits IUnknown. Yeah, when I said it's elegant I might have been a bit too generous. I think its more accurate to say it's more elegant than a shared_ptr<>. What it does achieve though is only a single allocation for the object! No matter how many copies of comptr<MyObject> you pass around. For true C++ use you sometimes need objects allocated on the stack (because its fast and your data is sometimes trivial) and other times you need it to be dynamic. So the COM solution will never work for stack allocated objects, it will result in a crash or heap corruption. That's awful. Another solution would be to create a custom reference object that proxies a struct of the object pointer and its reference count: template<class T> refptr { struct proxy { T* ptr; int refcount; }; proxy* p; // proxies should be allocated from a special dynamic memory pool inline refptr() : p(nullptr) {} // default to a null pointer inline refptr(T* ptr) { p = proxyAlloc(); // gets a new object from proxy memory pool p->ptr = ptr; // set the proxy ptr p->refcount = 1; // initial refcount is 1 } inline ~refptr() { release(); // try to release the object } inline void release() { if(p && !--p->refcount) { // decrease reference count delete p->ptr; // delete the actual object proxyFree(p); // return proxy to its memory pool } } inline T* operator->() { return p->ptr; // will cause a nullptr exception if proxy is nullptr } inline operator bool() { return p ? true : false; } // so we can test the refptr in conditionals }; Even though it's not a completely working example (its missing the copy constructors and operators and actually a ton of more operators to make it usable), it still illustrates the idea of creating a very simple reference object (only 4 bytes wide on x86) that can give a proper reference counted object: { Data d1; // allocate a Data object on the stack d1.a = 5; refptr<Data> d2 = new Data(); // allocate on the heap and store as a reference d2->a = 5; refptr<Data> d3 = d2; // share the reference d2 = nullptr; // decreases reference d3->a == 5; d3 = nullptr; // since refcount reaches 0 here, the Data object is deleted } So yes, C++11 provides many ways for manipulating data for us, but it doesn't provide the one true solution that we really need. I think the compiler should be extended to treat the following patterns as special cases: vector<int> GetData() { vector<int> data; data.push_back(10); return data; } { vector<int> list = GetData(); // case 1 - initialization list = GetData(); // case 2 - temporary copy replace } And it should generate altered code for these special cases (regardless of optimization levels): // case 1 - initialization void _init_GetData(vector<int>& out) { out.vector(); // call default ctor out.push_back(10); } // case 2 - temporary copy replace void _replace_GetData(vector<int>& out) { out.~vector(); // call the dtor out.vector(); // call the default ctor out.push_back(10); } { vector<int> list; _init_GetData(list); _replace_GetData(list); } Optimizations like this are only partly achieved with certain compiler flags and they don't always work. I don't see why the compiler can't generate this kind of code. It's true that with this change, the assignment operator will not be called, but isn't this what was actually desired? Even if you replace the code with std::string or even refptr, the end result will still be same. So yeah, rvalue references are a hack for a mistreated feature (operator=) of the C++ language. And I guess we'll have to live with it, unless someone comes around with a clean new language somewhere inbetween C++ and D.
  3. That's an excellent point - introducing proper move constructor/operator to the objects will make a huge difference to the current code. Though it would still introduce quite a bit of copying and the temporary object destructors are still called - they just won't have to delete any resources, since the data has been moved already. The following code will still be the most optimal: void GetSomeData(vector<Data>& outData) { outData.push_back(Data()); } { vector<Data> list; // ... GetSomeData(list); } Whereas the following code would introduce a rather large number of operations: vector<Data> GetSomeData() { vector<Data> result; result.push_back(Data()); return result; } { vector<Data> list; // ... list = GetSomeData(); } What happens behind the scenes here: void GetSomeData(vector<Data>& rvalue) { vector<Data> result; // ctor result.push_back(Data()); rvalue.operator=(result); // copy the result result.~vector(); // calls the dtor of the initial vector } { vector<Data> list; vector<Data> rvalue; // compiler generates a temporary for the return result GetSomeData(rvalue); // get the result into the temporary list.operator=(rvalue); // copies the temporary rvalue.~vector(); // calls the dtor of the temporary rvalue } Basically what C++11 introduces is a new type of copy operator/constructor called the 'move operator/constructor': vector(vector&& rvalue) : size(rvalue.size), capacity(rvalue.capacity), buffer(rvalue.buffer) { rvalue.size = 0; rvalue.capacity = 0; rvalue.buffer = nullptr; } vector& operator=(vector&& rvalue) { // we don't have to check for 'this != &rvalue', since rvalues are always just temporary objects if(buffer) delete[] buffer; // make sure any data on this vector is freed size = rvalue.size; capacity = rvalue.capacity; buffer = rvalue.buffer; rvalue.size = 0; rvalue.capacity = 0; rvalue.buffer = nullptr; return *this; } So, instead of calling the standard 'vector& operator=(const vector& other)', the compiler will use 'vector& operator=(vector&& rvalue)' whenever a temporary object is assigned. This is a lot better than copying the data 3 times! But its still bad. Its basically a hack to fix an ugly design problem that C++ has. And this is something we have to live with. The only proper way to deal with this is to just use 'void GetSomeData(vector<Data>& outData)'. It's less intuitive, but at least it's not ugly behind the scenes.
  4. This is a good point to start! Maintaining a more detailed list of 'good C++ design patterns' seems like the way to go. It shouldn't be so brief though and should explain these concepts to non-expert C++ programmers. Not everyone has expert knowledge on x86 assembly and how the operating system maps physical memory to virtual memory, nor how the stack and heap exactly work. I haven't run any long-time tests on the pyrogenesis engine, but usually if your program doesn't have any obvious leaks, yet the total paged memory keeps increasing, the answer is always fragmentation. Even though I'm not familiar with linux allocation schemes, Windows Vista/7/8 use the Low Fragmentation Heap (a cluster of memory pools). Even though it sounds promising, it only prevents the adverse effect of malloc trying and failing while finding a proper-sized memory-block. If your dynamic allocations vary too much, it will end up creating a lot of different sized clusters, thus increasing the memory paged by the process anyways. It's a bit too forgiving to think that something like that would magically solve all the problems - the memory management problem is a design issue that is not easy to fix and will always take a lot of time. Nevertheless, sometimes its easier to spend those 2 extra hours of your evening to create a quick memory pool, rather than let others spend hours maintaining that code a few months later. Sounds familiar? In the ComputeShortPath example, an edge consists of 2 FixedVector2D-s, which each consist of 32-bit CFixed objects. Since each edge is 128-bits / 16-bytes in size, the programmer obviously counted on the SSE2 extensions which move unaligned 128-bit memory blocks through the XMM registers in a single op (Edge is a POD type, so G++/VC++ will definitely use MOVUPD [move unaligned packed doubles]). This of course requires compiling with SSE2 flags But 0AD does not compile with any enhanced instruction set flags, so perhaps that wasn't the best decision. Overall, this seems like the perfect place to use a memory pool. If you are lazy, you can just pre-determine the maximum possible pool node count, align it to default page size (4KB) and just allocate one huge block of memory. To note though, any self respecting stack is 1MB in size (that's the default on the Win32 platform anyways), so if you know your dynamic data fits into a reasonable size (64KB?), you can use alloca(size_t) to dynamically allocate on the stack instead. Given that you stated that 6 std::vector<Edge> with 16*16 edges was the possible max (? haven't looked that closely?). It would give us a required reserve of sizeof Edge(16) * 256 * 6 => ~24.5KB. This is actually so small, its a bit evil to fragment the heap with vectors like that. Its easier to just feed 64KB of stack memory into a generic pool and never look back again. No fragmentation: check. No heap alloc: check. No mutex lock on process heap: check. Triple win, people? Yeah, alloca wins huge in small scenarios like this. I think in general terms (aside from some obvious nomenclature errors regarding stack/heap) his assessment was pretty straight on. If you design an entire program with the idea that 'a single vector won't fragment memory', then you'll have a program full of a lot of vectors with a lot of fragmented memory. Some profiler data is always good to look at, but proper analysis can take hours or days - it's not that simple after all. In this case std::vector is a classic example of memory fragmentation, since the initial allocation size depends on the implementation and thus varies between platforms. Looking at VC++ STL, its std::vector implementation is one of the worst, since it always allocates by sizeof(T) and always expands by +50%. Because of that, something like this usually happens: SIZE CAPACITY BYTES 0 0 0 1 1 4 2 2 8 3 3 12 4 4 16 5 6 24 6 6 24 7 9 36 8 9 36 9 9 36 10 13 52 And you wonder where the fragmentation comes from? There are 7 reallocations required in only 10 elements added. And C++ isn't exactly known for its fast reallocation, since it requires: 1) new buffer, 2) copy data, 3) delete old buffer. If you used a custom vector implementation that always makes initial allocations in, say, 32-bytes, with each successive allocation being +50% aligned to 32-bytes, you would get something like this: SIZE CAPACITY BYTES 0 0 0 1 8 32 2 8 32 3 8 32 4 8 32 5 8 32 6 8 32 7 8 32 8 8 32 9 16 64 10 16 64 This scales better to heap memory management - if we truncate a 32 byte block and allocate a 64 byte block, another instance of our custom vector will definitely pick the 32 byte block we just used - it's a perfect fit after all. Yeah, sorry for going off-topic so suddenly. I'm just used to programming very performance critical applications and most of the time custom implementations such as this are needed to fit to the current scenario. This scenario favors speed over memory, so allocating a few bytes ahead is not a problem.
  5. I think he meant 'for example, if'. How is this any different between P2P and Client-Server? In either case there exists a 'synced' version of the current game and only state changes are sent/received. In the client-server model the client sends a change in its state to the server, while 'starting' the action. The server receives the update, sends an acknowledge and also updates all other clients with the new state. In the P2P model described, everyone broadcasts their state changes to every other client, meaning N times more packets to send out in any case. Basically, with a running server you can at least control the bandwidth and sync at the same time. In peer-to-peer there's no single instance that can be considered the main sync object (or perhaps player 0?). If the network data is sent in raw binary, then bandwidth shouldn't be an issue at all.
  6. You are correct on that, if I take a look at a small 1 minute profile run, the amount of string and vector resizing is huge. Tens of millions of buffer resizes tend to add up and fragment the heap. Even if the 'real' memory use seems small, the damage has already been dealt algorithmically. The use of unordered_map<> is also huge in this project - hash tables actually require quite a lot of memory; i'll get back to this statement at the end of this post.. Even though Ogre also has its own quirks, it would indeed be easier to use and doesn't require maintenance. That's an excellent patch! You've gone and fixed a lot of painful code. Great job! I've tried to explain that point, but it seems the general opinion is somewhat fixed... Going through the code, it seems that someone discovered the proverbial hammer in associate maps and then proceeded to treat every problem as a proverbial nail. Even worse are constructs like CStrIntern::GetString, which creates a huge map of shared_ptr<>'s to CStrInternInternals to 'save space on duplicate strings'. Of course the Key has to be 'std::string', meaning the unordered_map<> has to store a duplicate of the string anyways. Not only is this implementation flawed, but CStrIntern wastes a more than double amount of space for every string and searching for a matching string begins to take a large amount of time. In the profiler the comparison was triggered over 4 million times during a 1 minute run of the game. Why do I say it wastes more than double? This is because of the implementation of the C++ hash table (unordered_map<>), which has to keep a copy of the Key value. Hash tables work by estimating a bucket for a given hash. A single bucket can contain several key<->value pairs. In order to make difference, the boost and std implementations need a copy of the original value (this isn't a problem when your key is an integer for example). Since the unordered_map<> key in CStrIntern is a plain 'std::string', a full copy of the string is kept as a key. This has huge performance implications (!). The memory usage itself is very difficult to calculate on a hash_map, because it's non-linear and depends on the actual hash values calculated. Nevertheless, doing a simple whole program memory comparison will do the trick: // VC++11, Release wchar_t buffer[80]; const size_t memStart = ProcessMemory(); std::unordered_map<std::wstring, MyData, MyHasher, MyComparer> tbl; for(int i = 0; i < 5000; ++i) { int len = swprintf(buffer, L"aaaaBBBBccccDDDDeeee %8d", i); tbl[std::wstring(buffer, len)] = MyData(); } const size_t memUsed = (ProcessMemory() - memStart); Given that we have 5,000 unique strings, 32 wchars in length (which isn't that far-fetched for 0AD), the memory usage grows to around 1.75mb and once over 1,024 entries, the hash table slows down by several orders of magnitude, especially in debug mode. Now the first reaction would be "well, a few megabytes shouldn't really matter". Oh? Well, 5000 wchar_t strings, of length 32 should roughly take 'sizeof(wchar_t)*33*5000' bytes of memory (which is 330kb on Win32). So that's around 6 times less memory. On Linux wchar_t is 4 bytes, but the relation itself should remain similar. Now if your program only dealed with strings and your L2 cache size was 1mb, then the second option (330kb) would fit inside the L2 cache, while the larger one (1.7mb) would not. This would immediatelly make your program 5x slower, due to the cache miss. I've written image analysis software for robotics before and even a small difference of 100kb in a lookup table size could make the whole algorithm 5-10 times faster, depending on the actual number of memory accesses. Now, the L2 cache isn't really that huge and other programs want a share of the cache too - this is why keeping your data small (330kb?) can ensure that it fits into the cache while contending with other processes. To sum up: I think CStrIntern is a huge bug in the game logic and should be replaced with a proper translation implementation - which would solve the 'duplicate strings' problem while remaining efficient - by handing out const pointers to the strings in a large block of continuous memory.
  7. This is a huge change, but also something just has to be done. We could follow OOC (Object-Oriented C) approach, where objects are expressed as opaque handles: // myscript.js var g_Playlist = [ "music01.ogg", "music02.ogg" ]; // our playlist var g_Current = 0; // invalid handle var g_Next = 0; // index for the next sound to play // ... some event ... { // where sound_* is the global sound interface var newsound = sound_load(g_Playlist[g_Next++]); // load the next piece if(g_Next >= g_Playlist.length) g_Next = 0; sound_stop(g_Current); // stop the current music after we actually loaded newsound sound_play(newsound); // start playing the new music g_Current = newsound; } The C++ code would keep track of the handles and their validity. For example 0 would be an invalid handle. You won't ever get dangling pointers this way, since the handles can be invalidated after it has been deleted. Null pointers are also impossible, since handle 0 is the mark of an invalid handle - in which case the proper interface method should do nothing.
  8. The second column is Elapsed inclusive time in percentage of total run time. That 100% would be the total wall clock time for all 5 threads. So in that single thread context 20% would become its share of the wall time. Running a new profile with Tiber river, and if I compare other gui objects, I can see that CText::Draw and CMiniMap::Draw are taking much more time compared to the rest: around 4% of that 20%. So that would make around 1/5th spent drawing text and the minimap (20% of rendering pipeline). It's safe to say that pre-rendering a glyph run into a texture would save a noticeable amount of frame time. I'll continue profiling to get to the bigger bottlenecks.
  9. I agree that it's extremely inefficient. It won't take much effort to modify the current code. I'll take a look into it and submit a patch sometime if it's okay. Before I do that, I'm going to build a small test with FreeType 2 beautiful anti-aliased fonts...
  10. I think you're right, it doesn't serve any purpose at the moment. Perhaps you're correct on that matter, since a lot of these problems are disputatious, its better to handle smaller tasks right now. Since Philip's pathfinder is still a WIP, I haven't given it much attention. Perhaps Philip could comment on this? If he can finish the pathfinder, or if we can start over? Right now simulation 'turns' are taken a few times per second. Given that we can calculate per average how many frames are called between turns (let it be N), we can divide the time into 'frame slots'. Specifically, you can assign slot[0] = Task0, slot[1] = Task1, slot[N] = TaskN. Where of course any task that has a slot greater than N just overflows. In this sense, if the tasks are asynchronous they can be assigned to any slot. If they require a previous task to run first, they have to be assigned to a higher slot. It's actually pretty simple once you put it to code. It just helps you divide all the tasks over a series of frames instead of calling all the tasks every 250ms. It should help with the framerate spikes and it won't introduce data-race conditions. I thought Philip said that JavaScript has no noticeable effect on performance? You are correct, that is always a huge issue - getting the same results on different platforms. So a lot of functions would need to have profile macros? Or do you mean a kind of profiler you can run inside JS scripts? So basically, it's rather useless since the amount of data is overwhelming. Now Profiler2 sounds like something much better. I think all of this profiling and implementing a better profiler is really really good stuff actually... BUT - There's always a but somewhere. Visual Studio 2012 has a really really good profiler too... The main benefit of the VC++ integrated profiler is that it accurately measures time spent inside a function and the timers are compiled into the assembly output if you run your program with the profiler on. Of course its downside is that it has a similar issue to to Profiler1: it generates huge (~512mb per minute for pyrogenesis) datasets that take quite some time (a few minutes) to analyze. Luckily the analyzed report is quite small So for example, taking a performance sample of just running the Main Menu and opening the few tooltips present, of course only filtering and showing the last minute of crazy menu action: First thing we can see is that 0 AD Main Menu doesn't use a lot of CPU power: out of 8 cores it only needs one active thread while others are mostly in sleeping state. Even though 0 AD didn't consume that much CPU, it still didn't run fast. What gives? The only noticeable peaks in CPU activity are probably from menus and dialogs opening up. Checking the hot path for the generated report, we see something interesting: Around 25% of time is spent suspending the thread (which is very good - we don't need to use more power than necessary), and another 50% is used for WaitForSingleObject, which is the Windows API's mutex lock function. Are we having concurrency issues here? Is it a worker thread waiting for mutex wakeup? Interesting to note is that std::char_traits<char>::compare is being called a lot and is thus pointed out by the profiler. There were 5 threads running in total, so it makes sense that they were being suspended most of the running time, otherwise CPU usage would have skyrocketed. Taking a closer look at the function statistics that we really care about: the active application hotpath. Now we can start really profiling the main menu. What is going on here exactly? Aside from lots of threads sleeping, we can see that GuiManager is the one doing the real work. Around 20% of the remaining CPU time was spent drawing the GUI, if we do some quick math (20/25) that makes 80% of the total run time. So what, big news? Rendering is takes time. We all knew that. Besides, everyone uses recursion to create easy-to-use GUI API's. What we're missing here however, why is recursion taking so much time? Image drawing doesn't take any time at all, however CTextRenderer::Render does: So most of the time is spent rendering text? Lets have a look: Hah! I get it. It's constructing the glyph runs every frame again and again and again. It should do this when the Text object is initialized with text. Direct2D API on windows has special glyph run objects for handling this and creating static text vs dynamic text. Clearly we've found why the GUI was struggling slightly on the profiler run. Summary: Even though GUI isn't the priority, it was the easiest thing to profile (just to showcase the VS2012 profiler and how useful it can be) in such a short amount of time. And it also helped discover that the 10 years old GUI code actually has a pretty big weakness: text rendering wastes a huge amount of memory and resources every frame, since they are temporary objects. I've been looking at the new FreeType 2 library, which is an open-source library used for TrueType and OpenType text rendering. It has a very slim and streamlined API. Perhaps this is something I could start with?
  11. Ah, now I understand what you're saying. Spending time optimizing a fixed point library could take weeks. I've fallen into the fixed-point pithole before (working on Atmel 8-bit ATMega chips) and in the end, when I ran out of time and just needed a working solution, reverted to the gcclib software floats. In the end it didn't matter if I used fixed point or floats - there was enough performance... Yet I spent a huge time on premature optimization; something I could've spent on the actual code instead. So what I'm trying to say is - we shouldn't bother growing and optimizing a fixed-point library. We should just stick with the fast floats and win in both performance and development time. Sounds reasonable?
  12. In this current case, I'm running a 4 year old laptop with a cheaper Intel i7 mobile chip. I'm not getting that much performance out of the game and changing to floats gave at least 20% performance increase in a regular game. With the 'big combat demo' it might give even more benefit. The main target platforms all have an integrated FPU on the chip and it's not reasonable to suffer a performance drop because rare embedded systems without an FPU are considered as a possible release target. It just makes no sense - even ARM processors (probably on your phone too) have an FPU. Floats are fast due to hardware support; sure - not all float values can be precisely represented, but that is something that can be accounted for.
  13. Well, as the OP stated, it took a full day or two to bring in the required changes. Furthermore, testing and bugfixes (like OOS) were ignored. So that will add up a lot of time. This is a non-trivial change of the entire engine.
  14. Hmm, that is rather dire. :/ Do you mean some of the core issues that were brought out? You are right about the code generated. Yet some parts converted to fixed point and then to floating point, which is one step too much already. I'm just trying to look at a pragmatic angle here: Sure, floating numbers tend to 'float', but they're pretty good for their overall convenience. If OOS is the main issue, I'm sure a deterministic rounding method can be brought in (which is both fast and accurate) during hash generation. The performance improvement was measured in a single-player match "Punjab 1". Nowadays floats are fast on x86/x64, so using hardware optimized FSQRT or SSE2 RSQRTSS opcode (yeah - sqrt is built into cpu on x86/SIMD, weird right? ) seems like the way to go. Currently even SSE2 capable hardware struggles to run the game, so in that matter, relying on SSE2 is a viable option. SSE2 was introduced with P4 over 12 years ago. I might have agreed with 'not everyone' having SSE2 6 years ago, but now that's something very rare. Lets face it, SSE2 is on each and every Intel/AMD x86/x64 arch currently reading this post. Sounds reasonable? Tell me what you think about making SSE2 'required'? We could even make tests to see if floats given good results in 'real-world-cases'. I think the performance improvement is worth it, even if fixed point is still all the rage on microcontrollers. That's good to know - I didn't know much about the ARM architecture.
  15. If it was some expensive bottleneck in the code, I'm sure you would have noticed it alread. The weight was 'distributed' over all the program in the numerous hundred of conversions like fixed::fromInt(1) and fixed::fromFloat(mVariable). I reckon those conversions added up for some significant downtime. You are absolutely right about the possible harmful effects of changing code - that's a given any time you write new code. But in this case the codebase does need some maintance and since I have time, it would be foolish to postpone it to never. Yes, my own tests also showed that SSE2 does some extra rounding off the edges. Now is a time to figure which is worse? For floating point accuracy, SSE2 would be best, because it rounds out FLT_EPSILON/2.0f. Unfortunately it'll not work for embedded systems, since they don't have SSE2. All modern modern PC cpu's support SSE2, so it won't matter unless embedded systems are concerned. Alas, a lot of people would also want to play this on an iPhone or Android device... that's waaaay to much cross platform than we can handle. Yes, I'm aware of the debugger attaching and of the heap debug mode. That's why I ran the game outside of the IDE. So still 20%. In either case, even if it won't hit the cache, a translation system will still be needed. And in that case, some minor performance increase will creep with them. Ah ok. SO in that sense, the pathfinder still needs some work to be done on.
  16. The problem with multiplayer OOS is a fair point, but I'm still inclined to think that it can be solved with floats. Perhaps a client-server architecture should be considered? What do you think? So far changing to floats gave +20% increase in performance, so I'd say its noticeable. As far as I know, /fp:precise adheres to a consistent floating point rounding to reduce errors with floating point math. The VC++ workspace is configured with /fp:precise 'out of the box'. Float sqrt does have an advantage there, since it's usually a compiler intrinsic and replaced with the fsqrt opcode. That might be one of the reasons float performance was faster for my build. I wouldn't ever consider debug build performance something critical. In that sense I was talking about the rendering loop performance and how it could be improved. Perhaps something like this (pseudo code): for(IShaderTechniqueMgr* tech : context->GetTechniques()) { for(IMaterialPass* pass : tech->GetMaterialPasses()) { renderer->SetTechnique(tech->Technique()); renderer->SetMaterial(pass->Material()); for(IModel* model : pass->GetModelsCulled(frustrum)) { renderer->Render(model); } } } Ah that makes sense. I guess I was expecting a more straightforward solution. But the binary format cache should work marvellous. That was just a tiny example. The actual amount of strings can reach a much bigger size. And even if the strings don't take that much memory, the answer will always be: cache. A translation module is needed anyways, so I guess it will bring a benefit either way. Yes, I figured having strong references to core components in an entity would help with that. Such as (more pseudocode): class UnitEntity : ... { CCmpTransform mTransformation; CCmpActor mActor; }; Something like that. Of course a quick answer to your next question: No, I don't mean that at all. Rather 4-5 generalized entity types to make the core of the game run faster, while scriptable components would still extend the entity functionality. It seems to combine the best of the [hardcoded] and [fully dynamic] approaches, while leaving less headache to us programmers in the end. I'm afraid the profile won't point out a chokepoint in the code, if the whole code is equally accessing a whole bunch of components. Even though its relatively fast, it also adds up over iterations, creating the algorithmical bottleneck. Are you talking about the stalled pathfinder? It would be great if it was completed. You are right that specific implementations all require different aspects, but they all fit into the same kind of BSP scheme, which can be used as a base interface for the specific implementations. So all of the data could be in a quadtree for a 'rough-match' lookup and after narrowing it down, a more specific algorithm can be used for the actual detection. Perhaps this could be done during the actual movement update? This would mean a list of "EntitiesInRange" would be attached to every entity and when two units are in range of each other, they share each other's pointers. This would avoid dangling pointers and would also instantly notify an unit when someone leaves its range. Using the quadtree structure, if an unit enters a new grid, it can do a quick update of entities who are already in the grid, to see if anyone is in range. After that initial enter, the squared distances are updated during each sample where the unit moves. Something like this could work and possibly faster than comparing two large arrays. I think right now is a bit too early to answer that question, but I'm inclined to think that moving towards a well implemented C++ design with a scriptable interface would work out best.
  17. Can you give a few screenshots of trees to show the difference?
  18. At this time though I reckon it's better to work on the current code and see where that goes. Changing to Ogre3D is something for (perhaps) part 2... Right now I'll start working on the resource files.
  19. Yes, though MyGUI for Ogre3D is a very good implementation and I doubt we should use our own. Those folders would probably disappear entirely, replaced by Ogre3D. Ogre3D has a collada plugin. So no changes there. Yes, this is the main part where the integration would occur. This is the brigdge between modules: [simulation] <-> [ps engine] <-> [ps graphics]. Luckily it means that graphics can be replaced with changes to [ps engine]. Ogre3D has a very mature maths library. I already replaced the ps maths vectors with: Vector2, Vector3 and Vector4 (had to roll my custom implementations). But Ogre3D has a much more mature version with maths for interpolation and optimized ray intersection. Now that is cool. I think you are right. A lot of these arguments should put into a list, describing the benefits and amount of code required to change. I'll get right at it.
  20. It does seem so. I have substantial time today to assess the amount of code that would require reimplementing if such a switch would be needed. I wouldn't jump into something this drastical without analyzing the current state - but nevertheless, the graphics engine does require a major overhaul, so its something I definitely need to take up! Yes, for example int the source of Doom3. And all of these games have managed with it. Having a few small places where float imprecision is taken to account, is much much better than using CFixed everywhere. Floating point maths standard is now over 28 years old and holding strong. We are aware of the rounding errors and if all else fail, just compile with /fp:precise. Yes, it also brings out the points why fully deterministic systems like that don't work - even a slight change in version number or entity tags will break sync. IMO, a different sync model should be considered in this case and deviate into a more network intensive solution. The article also notes that using /fp:precise works (as stated by several compiler providers too: MSVC++, GCC) and provides fully deterministic results. Just so you know, 0AD already uses /fp:precise, even though precise floats are slower (since they are rounded), they still have a much better performance than CFixed<>. Right now the main branch of the svn will stay like it is regardless, since these changes are too huge to immediately integrate into the game. And as mentioned, multiplayer sync may be broken. I'd say we should just forget about CFixed<> and use /fp:precise. Its better to use a language built-in feature than designing a library that slows down the game considerably.
  21. Even though that is a rather edge case, something like that can be circumvented by simply handling a special rounding function during hash generation, which would convert the float to an int: inline __int64 roundedFloatHash(float f) { return (__int64(f * 100000.0f) >> 2); } // 0.00154 -> 100 -> 25 // 0.00523 -> 500 -> 125 In either case, I don't see why the game engine has to suffer a 20% performance loss, when float hashing could be implemented on-demand. It is a matter of precision that can be easily decided on regarding world size. All games use methods like this. It's a well used pattern.
  22. Quickest hack to increase performance, would be to put the bucket maps and vectors into ShaderModelRendererInternals. That would save at least a bit on the memory allocation performance.. heh.
  23. That is a valid point that I didn't consider. Though in this case, the float can be rounded to a near value during Serialization (which is used before the string is hashed...). This method is utter nonsense though. The component itself should have a method GetHash() which provides a reliable hash that can be used to compare two component objects. I see no reason to revert back to fixed point. Bad design choices spur an onslaught of derivative Whiskey-Tango-Foxtrot, leading to decreased maintainability and performance.
  24. The opcodes and precision might be different, depending on the cpu(fpu?) of the device or software float library of the compiler; But the overall usability of floating point numbers remains the same. Even though some errors occur with floats quite occasionally, I included proper methods for testing float equality, which is just something similar to: inline bool Equal(float a, float { return abs(a - < M_EPSILON; } Where M_EPSILON is a very small float (e.g. 0.000000001f ). I've talked to Philip about this, unfortunately he doesn't have time to finish it. I wouldn't benefit much from his solution either, since it would take me as long to implement a new one than to completely understand the existing one. This is all something that can be fixed with a custom Quadtree implementation. An entity would have a pointer to a Quadtree grid and vice-versa. This allows an entity to get a list of pointers of objects that are in the same grid. Notice how this suddenly, without any complex lookup what-so-ever, decreased a huge list (like 200) to just a handful (usually never more than 8). If needed, the parent grid can be checked and so on.
  25. In that regard, I'll have to ease myself to Git and slowly start poking around the component logic. When I'm certain I have the Full Picture, I'll make the required changes. It probably means I'll have to implement some parts inefficiently in order to get the code out faster (for example javascript support). I'm currently available most of the day, so any ideas / recommendations / additional input is always welcome. Right now I'll try to fork 0ad and get my source under version control.
×
×
  • Create New...