RedFox

May 15, 2013

This looks really excellent! I think its a great idea. ^_^

May 15, 2013

@redfox, haha I'm not sure I'd call move semantics a hack, I was impressed with the elegance of the solution, and my impression of the design decisions of WG21 were that they were very carefully thought out. In fact the papers are all exceptionally impressive quality.
Another interesting addition in C++11 for the STL is the emplace_back() function for appending to containers:
http://en.cppreferen...or/emplace_back

The emplace_back function seems a lot better than the old push_back for objects that are non-trivial / non-POD. Nevertheless all the arguments have to be pushed onto the stack not only once, but twice, (1 for emplace_back, 1 for the type contructor), before being copied for the third time into the actual buffer.

For POD types, something like this would be preferred, since it involves 1 initialization and 1 copy:


Data d = { 1, 2, 3 };
vec.push_back(d); // argument is passed as 'const Data&'

And I still think the move constructor is a hack that tries to fix one of the biggest problems C++ has: temporary non-trivial objects, which is something that should be avoided at all cost. At least now they've introduced a way to escape any heap allocation/deallocation triggering 3 times, which somewhat lessens the impact.

Still, the implementation of std::string is non-trivial, and looks like this on VC++ 2012 (28 bytes):


class string
{
	iterator_proxy* _Proxy;
	union {
		char _Buf[16];
		char* _Ptr;
	};
	size_t _Size;
	size_t _Reserved;
};

With the notion that small strings can be contained inside a temporary buffer and longer than 15 char strings will be allocated on the heap. Regarding std::move and move constructors/destructors in general, they are still non trivial, containing dozens of operations for copying and nulling the string objects.

C# solves this by just turning every class into reference counted objects (like Microsoft's COM, but with a Garbage Collector touch). In C++ the closest parallel is a shared_ptr<> wrapper, however it's again completely different from C# references:


// again from VC++ 2012 STL
template<class T> shared_ptr
{
	T* _Ptr; // pointer to object allocated with new
	_Ref_count_base* _Rep; // pointer to a dynamic object that counts the shared_ptr reference count
};

So this just keeps a dynamically allocated pointer to a reference counting object, which is rather inelegant. If you use a lot of shared_ptr<>'s in your project you'll start feeling the overhead.

Luckily, or rather surprisingly, Microsoft designed an (in my opinion) elegant solution to the problem:


class IUnknown {
	int refcount;
	IUnknown() : refcount(1) {} // initial creation has refcount 1
	void AddRef() { ++refcount; } // adds to the reference count of the object
	void Release() { if(!--refcount) delete this; } // decreases refcount, if refcount reaches 0, deletes the object
};
class MyObject : public IUnknown; // implement your own object

So now you introduce a reference counting overhead for every object you intend to pass by reference. For actual use you will need a reference wrapper. In Microsoft's COM it's a comptr<>:


template<class T> struct comptr
{
	T* ptr;
	// add copy operator overloads to AddRef and Release as needed
};
//
comptr<MyObject> obj = new MyObject(); // compiles fine because you MyObject inherits IUnknown.

Yeah, when I said it's elegant I might have been a bit too generous. I think its more accurate to say it's more elegant than a shared_ptr<>. What it does achieve though is only a single allocation for the object! No matter how many copies of comptr<MyObject> you pass around.

For true C++ use you sometimes need objects allocated on the stack (because its fast and your data is sometimes trivial) and other times you need it to be dynamic. So the COM solution will never work for stack allocated objects, it will result in a crash or heap corruption. That's awful.

Another solution would be to create a custom reference object that proxies a struct of the object pointer and its reference count:


template<class T> refptr
{
	struct proxy {
		 T* ptr;
		 int refcount;
	};
	proxy* p; // proxies should be allocated from a special dynamic memory pool

	inline refptr() : p(nullptr) {} // default to a null pointer
	inline refptr(T* ptr) {
		p = proxyAlloc(); // gets a new object from proxy memory pool
		p->ptr = ptr;		// set the proxy ptr
		p->refcount = 1; // initial refcount is 1
	}
	inline ~refptr() {
		release(); // try to release the object
	}
	inline void release() {
		if(p && !--p->refcount) { // decrease reference count
			delete p->ptr; // delete the actual object
			proxyFree(p); // return proxy to its memory pool
		}
	}
	inline T* operator->() {
		return p->ptr; // will cause a nullptr exception if proxy is nullptr
	}
	inline operator bool() { return p ? true : false; } // so we can test the refptr in conditionals
};

Even though it's not a completely working example (its missing the copy constructors and operators and actually a ton of more operators to make it usable), it still illustrates the idea of creating a very simple reference object (only 4 bytes wide on x86) that can give a proper reference counted object:


{
	Data d1; // allocate a Data object on the stack
	d1.a = 5;

	refptr<Data> d2 = new Data(); // allocate on the heap and store as a reference
	d2->a = 5;
	refptr<Data> d3 = d2; // share the reference
	d2 = nullptr; // decreases reference
	d3->a == 5;
	d3 = nullptr; // since refcount reaches 0 here, the Data object is deleted
}

So yes, C++11 provides many ways for manipulating data for us, but it doesn't provide the one true solution that we really need. I think the compiler should be extended to treat the following patterns as special cases:


vector<int> GetData()
{
	vector<int> data;
	data.push_back(10);
	return data;
}
{
	vector<int> list = GetData(); // case 1 - initialization
	list = GetData(); // case 2 - temporary copy replace
}

And it should generate altered code for these special cases (regardless of optimization levels):


// case 1 - initialization
void _init_GetData(vector<int>& out)
{
	out.vector(); // call default ctor
	out.push_back(10);
}
// case 2 - temporary copy replace
void _replace_GetData(vector<int>& out)
{
	out.~vector(); // call the dtor
	out.vector(); // call the default ctor
	out.push_back(10);
}
{
	vector<int> list;
	_init_GetData(list);
	_replace_GetData(list);
}

Optimizations like this are only partly achieved with certain compiler flags and they don't always work. I don't see why the compiler can't generate this kind of code. It's true that with this change, the assignment operator will not be called, but isn't this what was actually desired? Even if you replace the code with std::string or even refptr, the end result will still be same.

So yeah, rvalue references are a hack for a mistreated feature (operator=) of the C++ language. And I guess we'll have to live with it, unless someone comes around with a clean new language somewhere inbetween C++ and D.

May 14, 2013

I really appreciate this discussion, some great resources have been referenced.
To add something into the fray. A lot may be gained by using the new features of C++11. In particular I have in mind rvalue references because they solve a lot of temporary object copying.
http://www.cprogramm...s-in-c++11.html

That's an excellent point - introducing proper move constructor/operator to the objects will make a huge difference to the current code. Though it would still introduce quite a bit of copying and the temporary object destructors are still called - they just won't have to delete any resources, since the data has been moved already.

The following code will still be the most optimal:


void GetSomeData(vector<Data>& outData)
{
    outData.push_back(Data());
}

{
    vector<Data> list;
    // ...
    GetSomeData(list);
}

Whereas the following code would introduce a rather large number of operations:


vector<Data> GetSomeData()
{
    vector<Data> result;
    result.push_back(Data());
    return result;
}

{
    vector<Data> list;
    // ...
    list = GetSomeData();
}

What happens behind the scenes here:


void GetSomeData(vector<Data>& rvalue)
{
    vector<Data> result; // ctor
    result.push_back(Data());
    rvalue.operator=(result); // copy the result
    result.~vector(); // calls the dtor of the initial vector
}

{
    vector<Data> list;
    vector<Data> rvalue; // compiler generates a temporary for the return result
    GetSomeData(rvalue); // get the result into the temporary
    list.operator=(rvalue); // copies the temporary
    rvalue.~vector(); // calls the dtor of the temporary rvalue
}

Basically what C++11 introduces is a new type of copy operator/constructor called the 'move operator/constructor':


vector(vector&& rvalue)
: size(rvalue.size), capacity(rvalue.capacity), buffer(rvalue.buffer)
{
    rvalue.size = 0;
    rvalue.capacity = 0;
    rvalue.buffer = nullptr;
}

vector& operator=(vector&& rvalue)
{
    // we don't have to check for 'this != &rvalue', since rvalues are always just temporary objects
    if(buffer) delete[] buffer; // make sure any data on this vector is freed
    size = rvalue.size;
    capacity = rvalue.capacity;
    buffer = rvalue.buffer;
    rvalue.size = 0;  
    rvalue.capacity = 0;
    rvalue.buffer = nullptr;
    return *this;
}

So, instead of calling the standard 'vector& operator=(const vector& other)', the compiler will use 'vector& operator=(vector&& rvalue)' whenever a temporary object is assigned.

This is a lot better than copying the data 3 times! But its still bad. Its basically a hack to fix an ugly design problem that C++ has. And this is something we have to live with. The only proper way to deal with this is to just use 'void GetSomeData(vector<Data>& outData)'. It's less intuitive, but at least it's not ugly behind the scenes.

May 13, 2013

First simplest step, we could agree on basic performance guidelines. The draft is open for correction/additions/etc, but I feel it's important that you can have set of rules guiding code quality regarding performance and memory. There's lot of web pages (gotw) or books (scott meyers book, effective c++, effective stl) that enforces/list thoses.

This is a good point to start! Maintaining a more detailed list of 'good C++ design patterns' seems like the way to go. It shouldn't be so brief though and should explain these concepts to non-expert C++ programmers. Not everyone has expert knowledge on x86 assembly and how the operating system maps physical memory to virtual memory, nor how the stack and heap exactly work.

Now, as stated in the wiki page, game slowing over time is due to current memory model (nearly everything is heap allocated inside containers), which implies a lot of new/malloc/delete/free during runtime, which even shows up on the profiler.
The problem is called memory fragmentation

I haven't run any long-time tests on the pyrogenesis engine, but usually if your program doesn't have any obvious leaks, yet the total paged memory keeps increasing, the answer is always fragmentation. Even though I'm not familiar with linux allocation schemes, Windows Vista/7/8 use the Low Fragmentation Heap (a cluster of memory pools). Even though it sounds promising, it only prevents the adverse effect of malloc trying and failing while finding a proper-sized memory-block. If your dynamic allocations vary too much, it will end up creating a lot of different sized clusters, thus increasing the memory paged by the process anyways.

Once all code does follow those steps, that will leave us with the current "direct allocation" scheme, which is using object allocated on the heap over reusable object pointers,

It's a bit too forgiving to think that something like that would magically solve all the problems - the memory management problem is a design issue that is not easy to fix and will always take a lot of time. Nevertheless, sometimes its easier to spend those 2 extra hours of your evening to create a quick memory pool, rather than let others spend hours maintaining that code a few months later. Sounds familiar?

For instance pathfinding computeShortPath code does fill 6 different std::vector<Edge>, making heap allocation of those Edge (0ad does not using any custom stl allocator), then copy allocation, and copy operation along the way.
Meaning for a simple path with a minimum range of 16, a munimum average of : 6*16*16=1536 allocations and even more objects copy operation over a single A* step (that explains the memory fragmentation on its own.)
Using pointers, as in std::vector<Edge* > would reduce those to allocation done by ourself, and reducing all the "copy allocation", as each Edge allocated here is duplicated and stored in at least, main EdgeList, and depending on position copied in EdgesLeft,EdgesRight,EdgesTop,etc.
Using a memory pool, there would be no allocation at all. (no slowing down over time.)

In the ComputeShortPath example, an edge consists of 2 FixedVector2D-s, which each consist of 32-bit CFixed objects. Since each edge is 128-bits / 16-bytes in size, the programmer obviously counted on the SSE2 extensions which move unaligned 128-bit memory blocks through the XMM registers in a single op (Edge is a POD type, so G++/VC++ will definitely use MOVUPD [move unaligned packed doubles]). This of course requires compiling with SSE2 flags

But 0AD does not compile with any enhanced instruction set flags, so perhaps that wasn't the best decision. Overall, this seems like the perfect place to use a memory pool. If you are lazy, you can just pre-determine the maximum possible pool node count, align it to default page size (4KB) and just allocate one huge block of memory.

To note though, any self respecting stack is 1MB in size (that's the default on the Win32 platform anyways), so if you know your dynamic data fits into a reasonable size (64KB?), you can use alloca(size_t) to dynamically allocate on the stack instead.

Given that you stated that 6 std::vector<Edge> with 16*16 edges was the possible max (? haven't looked that closely?). It would give us a required reserve of sizeof Edge(16) * 256 * 6 => ~24.5KB. This is actually so small, its a bit evil to fragment the heap with vectors like that. Its easier to just feed 64KB of stack memory into a generic pool and never look back again.

No fragmentation: check. No heap alloc: check. No mutex lock on process heap: check. Triple win, people? Yeah, alloca wins huge in small scenarios like this.

Being careful with this would be a good idea, but I think you also need to be careful wirth your analysis.

I think in general terms (aside from some obvious nomenclature errors regarding stack/heap) his assessment was pretty straight on. If you design an entire program with the idea that 'a single vector won't fragment memory', then you'll have a program full of a lot of vectors with a lot of fragmented memory.

Also you mention that you can see the effect of fragmentation in the profiler, have you got this information somewhere (I haven't been keeping up with everything recently). Your example of computeShortPath causing fragmentation seems very surprising, since the allocation is always the same size and is very localised the allocated blocks should be reused. I would have thought that fragmentation will occur when you have a mixture of large and small allocations happening of varying sizes, a classic bad case is a container which is getting items added to it and gradually grows, so keeps asking for larger allocations.

Some profiler data is always good to look at, but proper analysis can take hours or days - it's not that simple after all. In this case std::vector is a classic example of memory fragmentation, since the initial allocation size depends on the implementation and thus varies between platforms.

Looking at VC++ STL, its std::vector implementation is one of the worst, since it always allocates by sizeof(T) and always expands by +50%. Because of that, something like this usually happens:


SIZE  CAPACITY  BYTES
   0	 0	   0
   1	 1	   4
   2	 2	   8
   3	 3	   12
   4	 4	   16
   5	 6	   24
   6	 6	   24
   7	 9	   36
   8	 9	   36
   9	 9	   36
  10	13	   52

And you wonder where the fragmentation comes from? There are 7 reallocations required in only 10 elements added. And C++ isn't exactly known for its fast reallocation, since it requires: 1) new buffer, 2) copy data, 3) delete old buffer.

If you used a custom vector implementation that always makes initial allocations in, say, 32-bytes, with each successive allocation being +50% aligned to 32-bytes, you would get something like this:


SIZE  CAPACITY  BYTES
   0	 0	   0
   1	 8	   32
   2	 8	   32
   3	 8	   32
   4	 8	   32
   5	 8	   32
   6	 8	   32
   7	 8	   32
   8	 8	   32
   9	16	   64
  10	16	   64

This scales better to heap memory management - if we truncate a 32 byte block and allocate a 64 byte block, another instance of our custom vector will definitely pick the 32 byte block we just used - it's a perfect fit after all. ^_^

Yeah, sorry for going off-topic so suddenly. I'm just used to programming very performance critical applications and most of the time custom implementations such as this are needed to fit to the current scenario. This scenario favors speed over memory, so allocating a few bytes ahead is not a problem.

May 4, 2013

That is wrong. In the P2P case we send 5 packets per second (one every 200 ms) per peer per peer.

I think he meant 'for example, if'.

In the float-proof client-server case we'd send hundreds upon hundreds packets to account for position, health, animation state etc. of every given unit, building, projectile etc. per peer.

How is this any different between P2P and Client-Server? In either case there exists a 'synced' version of the current game and only state changes are sent/received. In the client-server model the client sends a change in its state to the server, while 'starting' the action. The server receives the update, sends an acknowledge and also updates all other clients with the new state.

In the P2P model described, everyone broadcasts their state changes to every other client, meaning N times more packets to send out in any case.

Basically, with a running server you can at least control the bandwidth and sync at the same time. In peer-to-peer there's no single instance that can be considered the main sync object (or perhaps player 0?).

If the network data is sent in raw binary, then bandwidth shouldn't be an issue at all.

May 4, 2013

My few cents:
Memory is the huge culprit for me that no one points, but that's the real performance killer but there should be no allocation/deallocation during the game. Not a single one. preallocation, memory pools, you name it. It doesn't show really in profilers (still new and delete are hotspots...) as it's a general issue. There a very huge gains perf there. (memory fragmenation/allocation is a huge perf killer, causing cpu cycles to be wasted waiting for data. check all Data oriented designs paper/talk/slides)

You are correct on that, if I take a look at a small 1 minute profile run, the amount of string and vector resizing is huge. Tens of millions of buffer resizes tend to add up and fragment the heap. Even if the 'real' memory use seems small, the damage has already been dealt algorithmically. The use of unordered_map<> is also huge in this project - hash tables actually require quite a lot of memory; i'll get back to this statement at the end of this post..

Coming from ogre, making the switch is obviously a good idea, not only for perfs, but also leading to a better decoupling from mechanics and graphics code.

Even though Ogre also has its own quirks, it would indeed be easier to use and doesn't require maintenance.

Performance: made a non-behaviour performance code patch for all the hotspot listed by profilers if anyone want to review it/give a try. Gives huge perfs gains here, but doesn't solve the AI/memory hiccups. What it does is mostly removes unecessary cpu cycles and memory copies. Mostly good perf practices in C++ (added to the wiki, notably http://www.tantalon....opt/asyougo.htm guidelines).

That's an excellent patch! You've gone and fixed a lot of painful code. Great job! ^_^

fixedpt isn't really needed and doesn't help the base code sloc. Not only few external library uses it (not ogre, opengl, detours, etc.) and there's really not much reason nowadays, as it is, cpu software optimisations against x64 cpu fp (specially if not using fp/precise), x64 registers, not speaking of sse/avx/etc specialised branch/predication mechanism. No Map size here seems large enough to justify precision problems with floats here, and even if there's tricks to solve those kind of problem.

I've tried to explain that point, but it seems the general opinion is somewhat fixed...

Going through the code, it seems that someone discovered the proverbial hammer in associate maps and then proceeded to treat every problem as a proverbial nail. Even worse are constructs like CStrIntern::GetString, which creates a huge map of shared_ptr<>'s to CStrInternInternals to 'save space on duplicate strings'. Of course the Key has to be 'std::string', meaning the unordered_map<> has to store a duplicate of the string anyways.

Not only is this implementation flawed, but CStrIntern wastes a more than double amount of space for every string and searching for a matching string begins to take a large amount of time. In the profiler the comparison was triggered over 4 million times during a 1 minute run of the game.

Why do I say it wastes more than double? This is because of the implementation of the C++ hash table (unordered_map<>), which has to keep a copy of the Key value. Hash tables work by estimating a bucket for a given hash. A single bucket can contain several key<->value pairs. In order to make difference, the boost and std implementations need a copy of the original value (this isn't a problem when your key is an integer for example).

Since the unordered_map<> key in CStrIntern is a plain 'std::string', a full copy of the string is kept as a key. This has huge performance implications (!). The memory usage itself is very difficult to calculate on a hash_map, because it's non-linear and depends on the actual hash values calculated. Nevertheless, doing a simple whole program memory comparison will do the trick:


// VC++11, Release
wchar_t buffer[80];
const size_t memStart = ProcessMemory();
std::unordered_map<std::wstring, MyData, MyHasher, MyComparer> tbl;
for(int i = 0; i < 5000; ++i)
{
    int len = swprintf(buffer, L"aaaaBBBBccccDDDDeeee	%8d", i);
    tbl[std::wstring(buffer, len)] = MyData();
}
const size_t memUsed = (ProcessMemory() - memStart);

Given that we have 5,000 unique strings, 32 wchars in length (which isn't that far-fetched for 0AD), the memory usage grows to around 1.75mb and once over 1,024 entries, the hash table slows down by several orders of magnitude, especially in debug mode.

Now the first reaction would be "well, a few megabytes shouldn't really matter". Oh? Well, 5000 wchar_t strings, of length 32 should roughly take 'sizeof(wchar_t)*33*5000' bytes of memory (which is 330kb on Win32). So that's around 6 times less memory. On Linux wchar_t is 4 bytes, but the relation itself should remain similar. Now if your program only dealed with strings and your L2 cache size was 1mb, then the second option (330kb) would fit inside the L2 cache, while the larger one (1.7mb) would not. This would immediatelly make your program 5x slower, due to the cache miss.

I've written image analysis software for robotics before and even a small difference of 100kb in a lookup table size could make the whole algorithm 5-10 times faster, depending on the actual number of memory accesses. Now, the L2 cache isn't really that huge and other programs want a share of the cache too - this is why keeping your data small (330kb?) can ensure that it fits into the cache while contending with other processes.

To sum up: I think CStrIntern is a huge bug in the game logic and should be replaced with a proper translation implementation - which would solve the 'duplicate strings' problem while remaining efficient - by handing out const pointers to the strings in a large block of continuous memory.

May 2, 2013

This is a huge change, but also something just has to be done. We could follow OOC (Object-Oriented C) approach, where objects are expressed as opaque handles:


// myscript.js
var g_Playlist = [ "music01.ogg", "music02.ogg" ]; // our playlist
var g_Current = 0; // invalid handle
var g_Next = 0; // index for the next sound to play
// ... some event ...
{
	// where sound_* is the global sound interface
	var newsound = sound_load(g_Playlist[g_Next++]); // load the next piece
	if(g_Next >= g_Playlist.length) g_Next = 0;
	sound_stop(g_Current); // stop the current music after we actually loaded newsound
	sound_play(newsound); // start playing the new music
	g_Current = newsound;
}

The C++ code would keep track of the handles and their validity. For example 0 would be an invalid handle. You won't ever get dangling pointers this way, since the handles can be invalidated after it has been deleted. Null pointers are also impossible, since handle 0 is the mark of an invalid handle - in which case the proper interface method should do nothing.

May 1, 2013

Isn't that table showing that 0.56% of the total per-frame time is in CTextRenderer::Render? That sounds plenty fast enough already to me
If I run with vsync disabled, I see CTextRenderer::Render going up to 6%, but then the menu is rendering at 1000fps so that's not really a problem.
Anyway, I definitely like profiler usage, and the MSVC one is okay - just need to be careful to apply it to the areas that really do need optimisation and not get distracted by inelegant code that doesn't actually matter in practice

The second column is Elapsed inclusive time in percentage of total run time.

That 100% would be the total wall clock time for all 5 threads. So in that single thread context 20% would become its share of the wall time.

Running a new profile with Tiber river, and if I compare other gui objects, I can see that CText::Draw and CMiniMap::Draw are taking much more time compared to the rest: around 4% of that 20%. So that would make around 1/5th spent drawing text and the minimap (20% of rendering pipeline). It's safe to say that pre-rendering a glyph run into a texture would save a noticeable amount of frame time.

I'll continue profiling to get to the bigger bottlenecks.

April 30, 2013

(edit: that code does seem crazily inefficient, though)
Edit2: ran the game a little in the main menu, Xcode's profiler tells me about 40% of the time was spent walling CGUI: textDraw if I read things correctly.

I agree that it's extremely inefficient. It won't take much effort to modify the current code. I'll take a look into it and submit a patch sometime if it's okay. Before I do that, I'm going to build a small test with FreeType 2 beautiful anti-aliased fonts... ^_^

April 30, 2013

Maybe we should just skip the fixed point vs float discussion for a moment.

I think you're right, it doesn't serve any purpose at the moment.

If you will eventually become a fulltime programmer you should start with a less contentious issue . Don't get me worng, but no matter if your right or wrong in that matter it would be good to add something everybody is happy with. There are enough tasks for that, it shouldn't be the problem.

Perhaps you're correct on that matter, since a lot of these problems are disputatious, its better to handle smaller tasks right now. ^_^

In my opinion the main problems currently are:
1. Pathfinder: Meaning the pathfinder itself that is used for unit movement but also the pathing functionality that the AI need and currently implements separately because there's no interface to the engine's pathfinder implementation that fulfills the requirements.

Since Philip's pathfinder is still a WIP, I haven't given it much attention. Perhaps Philip could comment on this? If he can finish the pathfinder, or if we can start over?

2. Multithreading support: Separating AI and Simulation from the rendering would about double the average framerate and would get rid of lag-spikes happening when for example the AI calculates the map for dropsite placement (multiple seconds on large maps!). I've tried doing that for the AI but my conclusion was that the spidermonkey upgrade needs to be done first (3).

Right now simulation 'turns' are taken a few times per second. Given that we can calculate per average how many frames are called between turns (let it be N), we can divide the time into 'frame slots'. Specifically, you can assign slot[0] = Task0, slot[1] = Task1, slot[N] = TaskN. Where of course any task that has a slot greater than N just overflows. In this sense, if the tasks are asynchronous they can be assigned to any slot. If they require a previous task to run first, they have to be assigned to a higher slot.

It's actually pretty simple once you put it to code. It just helps you divide all the tasks over a series of frames instead of calling all the tasks every 250ms.

It should help with the framerate spikes and it won't introduce data-race conditions.

3. JavaScript Engine improvements and Spidermonkey upgrade: I'm currently working on that in ticket #1886. It's difficult to predict how much that will improve performance. For Firefox the Spidermonkey performance improved a lot but our situation is a bit different. Anyway, the upgrade is need for various reasons - one being the multithreading support.

I thought Philip said that JavaScript has no noticeable effect on performance?

An issue we have with all kinds of performance improvements is how to measure them. You are talking about 20% improvement which is inherently quite a random number you measure in one specific situation on a very specific plattform (Hardware, driver, OS, compiler options) and with unspecified settings. It would be very helpful both for making good judgements and for motivation to have better ways of measuring performance.

You are correct, that is always a huge issue - getting the same results on different platforms.

What's missing is some kind of benchmark mode that tests different aspects of performance in scripted ingame scenes. It could be some camera movements over dense forests, over water, a lot of units fighting, an army moving through a forest and pathing through a narrow gap in formation, a maximum of 8 AI players doing their normal AI-calculations etc.

So a lot of functions would need to have profile macros? Or do you mean a kind of profiler you can run inside JS scripts?

We already have two integrated profilers in the engine (simply called Profiler1 and Profiler2).
Profiler1:
"source/tools/replayprofile/*" contains a Perl script to extract profiling data from the text files generated by Profiler1.
There's also a HTML file that can display nice graphs like that:
Issues with profiler1:
There's a bug that messes up the profiling data after some time. I havent analized it any further but you should get what I mean if you let it run with a game for a few minutes.
It has some issues with multithreading
In some situations it created huge files of profile data. I once got a 32 GB file and it has such an impact on performance that the data looses its significance

So basically, it's rather useless since the amount of data is overwhelming.

Profiler2: Philip create this one which is very helpful for measuring accurately what happens in a single frame. Check the forum thread for more information.
It even supports displaying multiple threads and calculations on the GPU.
Profiler2 keeps the data in fixed size memory buffer and overwrites the oldest data when the buffer is full. The format is a binary format with much less overhead than profiler1's output. It makes the data accessible via web-server.
Maybe it would be better to extend this profiler to save the data in the memory buffer to the disk.

Now Profiler2 sounds like something much better.

What do you think?

I think all of this profiling and implementing a better profiler is really really good stuff actually... BUT - There's always a but somewhere. Visual Studio 2012 has a really really good profiler too... The main benefit of the VC++ integrated profiler is that it accurately measures time spent inside a function and the timers are compiled into the assembly output if you run your program with the profiler on. Of course its downside is that it has a similar issue to to Profiler1: it generates huge (~512mb per minute for pyrogenesis) datasets that take quite some time (a few minutes) to analyze. Luckily the analyzed report is quite small

So for example, taking a performance sample of just running the Main Menu and opening the few tooltips present, of course only filtering and showing the last minute of crazy menu action:

First thing we can see is that 0 AD Main Menu doesn't use a lot of CPU power: out of 8 cores it only needs one active thread while others are mostly in sleeping state. Even though 0 AD didn't consume that much CPU, it still didn't run fast. What gives? The only noticeable peaks in CPU activity are probably from menus and dialogs opening up.

Checking the hot path for the generated report, we see something interesting:

Around 25% of time is spent suspending the thread (which is very good - we don't need to use more power than necessary), and another 50% is used for WaitForSingleObject, which is the Windows API's mutex lock function. Are we having concurrency issues here? Is it a worker thread waiting for mutex wakeup? Interesting to note is that std::char_traits<char>::compare is being called a lot and is thus pointed out by the profiler. There were 5 threads running in total, so it makes sense that they were being suspended most of the running time, otherwise CPU usage would have skyrocketed.

Taking a closer look at the function statistics that we really care about: the active application hotpath.

Now we can start really profiling the main menu. What is going on here exactly? Aside from lots of threads sleeping, we can see that GuiManager is the one doing the real work. Around 20% of the remaining CPU time was spent drawing the GUI, if we do some quick math (20/25) that makes 80% of the total run time.

So what, big news? Rendering is takes time. We all knew that. Besides, everyone uses recursion to create easy-to-use GUI API's.

What we're missing here however, why is recursion taking so much time? Image drawing doesn't take any time at all, however CTextRenderer::Render does:

So most of the time is spent rendering text?

Lets have a look:

Hah! I get it. It's constructing the glyph runs every frame again and again and again. It should do this when the Text object is initialized with text. Direct2D API on windows has special glyph run objects for handling this and creating static text vs dynamic text. Clearly we've found why the GUI was struggling slightly on the profiler run.

Summary: Even though GUI isn't the priority, it was the easiest thing to profile (just to showcase the VS2012 profiler and how useful it can be) in such a short amount of time. And it also helped discover that the 10 years old GUI code actually has a pretty big weakness: text rendering wastes a huge amount of memory and resources every frame, since they are temporary objects.

I've been looking at the new FreeType 2 library, which is an open-source library used for TrueType and OpenType text rendering. It has a very slim and streamlined API. Perhaps this is something I could start with?

April 30, 2013

Indeed I do think it would be silly to use fixed point over floating point to support a bunch of embedded processors representing a tiny fraction of the user base. I'm also well aware that all modern CPU's have FPUs even multiple FPUs integrated. However a large part of the reason for using a fixed point system is performance on any platform. Therefore obviously if we can beat the performance with a floating point system we should use it. What I'm saying though is the performance improvement you get is surprising and might be an indication that there is room for improvement in our fixed point system. There really shouldn't be such a gap even with a FPU.

Ah, now I understand what you're saying. Spending time optimizing a fixed point library could take weeks. I've fallen into the fixed-point pithole before (working on Atmel 8-bit ATMega chips) and in the end, when I ran out of time and just needed a working solution, reverted to the gcclib software floats. In the end it didn't matter if I used fixed point or floats - there was enough performance... Yet I spent a huge time on premature optimization; something I could've spent on the actual code instead.

So what I'm trying to say is - we shouldn't bother growing and optimizing a fixed-point library. We should just stick with the fast floats and win in both performance and development time. Sounds reasonable? ^_^

April 30, 2013

It seems a bit heavy handed to change the entire engine to floating point without doing some significant testing across a number of machines to ascertain whether there is actually any benefit. As you say it is a non-trivial change. I have only given the code a cursory look but perhaps there is room for optimisation of the fixed point library? By the way I'm one of the those embedded devs that likes fixed point, well more like depends on fixed point.

In this current case, I'm running a 4 year old laptop with a cheaper Intel i7 mobile chip. I'm not getting that much performance out of the game and changing to floats gave at least 20% performance increase in a regular game. With the 'big combat demo' it might give even more benefit. The main target platforms all have an integrated FPU on the chip and it's not reasonable to suffer a performance drop because rare embedded systems without an FPU are considered as a possible release target. It just makes no sense - even ARM processors (probably on your phone too) have an FPU. Floats are fast due to hardware support; sure - not all float values can be precisely represented, but that is something that can be accounted for.

April 29, 2013

How difficult is it actually to replace fixed points with floats? If it's a relatively simple change, it's probably better to leave it for a volunteer contributor.

Well, as the OP stated, it took a full day or two to bring in the required changes. Furthermore, testing and bugfixes (like OOS) were ignored. So that will add up a lot of time. This is a non-trivial change of the entire engine.

April 28, 2013

There are quite a few that have been noticed years ago and still not fixed

Hmm, that is rather dire. :/ Do you mean some of the core issues that were brought out?

fixed::fromInt(1) gets compiled to something like "mov ecx, 10000h", so it has no cost. fixed::fromFloat is more expensive but is used very rarely (CCmpRallyPointRenderer is the only non-trivial one I can see (and that looks like it would probably be better if it used fixed everywhere instead of mixing it with float and converting back and forth)). Most of the other operations on fixed get inlined into the equivalent operation on ints.

You are right about the code generated. Yet some parts converted to fixed point and then to floating point, which is one step too much already. I'm just trying to look at a pragmatic angle here: Sure, floating numbers tend to 'float', but they're pretty good for their overall convenience. If OOS is the main issue, I'm sure a deterministic rounding method can be brought in (which is both fast and accurate) during hash generation.

What situation did you measure the 20% performance improvement in? If I run e.g. Combat Demo (Huge) and start a fight for a couple of minutes, then the profiler indicates about 35% of the total runtime is in CFixedVector2D::CompareLength, which is called by CCmpPathfinder::ComputeShortestPath calling std::sort(edgesAA...). Almost all the time in CompareLength is doing lots of expensive-on-x86 64-bit multiplies, so that's the kind of thing that might well be made a lot faster by using floats (though I'd guess it could also be made quite a bit faster while sticking with ints, with some x86-specific code to do a 32x32->64 bit mul or with SSE2 or something). But the real problem here is that the short-range pathfinder is terribly unscalable - it needs a different algorithm, which'll mean it won't do a crazy number of std:sort(edgesAA...), and then the performance of CFixedVector2D::CompareLength will hardly matter. Were you measuring that performance issue or something else?

The performance improvement was measured in a single-player match "Punjab 1". Nowadays floats are fast on x86/x64, so using hardware optimized FSQRT or SSE2 RSQRTSS opcode (yeah - sqrt is built into cpu on x86/SIMD, weird right? ) seems like the way to go.

The point is that you can't just add a compiler option like /fp:precise and get the same FP results on all platforms. I think it's unreasonable to require the game to always be built with SSE2 on x86 (particularly given how hard it is to control the build flags used by Linux distros), but then anyone running an x87 FP build of the game will eventually get out-of-sync errors. The way to avoid all those problems is to not use floats.

Currently even SSE2 capable hardware struggles to run the game, so in that matter, relying on SSE2 is a viable option. SSE2 was introduced with P4 over 12 years ago. I might have agreed with 'not everyone' having SSE2 6 years ago, but now that's something very rare. Lets face it, SSE2 is on each and every Intel/AMD x86/x64 arch currently reading this post. Sounds reasonable? Tell me what you think about making SSE2 'required'? We could even make tests to see if floats given good results in 'real-world-cases'. I think the performance improvement is worth it, even if fixed point is still all the rage on microcontrollers.

(I think ARM should generally give the same results as x86 SSE2 (ignoring how compilers rearrange expressions), since its VFP does 32-bit/64-bit operations - the problem is just x87 which does 80-bit operations without rounding.)

That's good to know - I didn't know much about the ARM architecture.

April 28, 2013

Fundamentally redesigning the entirely multiplayer architecture isn't something that should be taken so lightly

Where precisely did that 20% come from? I'd expect it to be one or two call sites that are doing some relatively expensive maths, and it would be far easier to find and focus on optimising those areas than to make changes across the entire codebase with unpredictably harmful side-effects.
Where precisely did that 20% come from? I'd expect it to be one or two call sites that are doing some relatively expensive maths, and it would be far easier to find and focus on optimising those areas than to make changes across the entire codebase with unpredictably harmful side-effects.

If it was some expensive bottleneck in the code, I'm sure you would have noticed it alread. The weight was 'distributed' over all the program in the numerous hundred of conversions like fixed::fromInt(1) and fixed::fromFloat(mVariable). I reckon those conversions added up for some significant downtime. You are absolutely right about the possible harmful effects of changing code - that's a given any time you write new code. But in this case the codebase does need some maintance and since I have time, it would be foolish to postpone it to never.


> cl test.cpp /O2 /fp:precise /arch:SSE2
0
> cl test.cpp /O2 /fp:precise /arch:SSE
5.96046e-008

Yes, my own tests also showed that SSE2 does some extra rounding off the edges. Now is a time to figure which is worse? For floating point accuracy, SSE2 would be best, because it rounds out FLT_EPSILON/2.0f. Unfortunately it'll not work for embedded systems, since they don't have SSE2. All modern modern PC cpu's support SSE2, so it won't matter unless embedded systems are concerned. Alas, a lot of people would also want to play this on an iPhone or Android device... that's waaaay to much cross platform than we can handle.

I don't mean a debug build - I mean a release build that you are running in a debugger (e.g. when you press F5 in Visual Studio). (If I remember correctly, it's any time IsDebuggerPresent() returns true when the CreateProcess() is called, unless you set an environment variable _NO_DEBUG_HEAP=1.) (That caused me quite a bit of confusion when first trying to measure and optimise stuff )

Yes, I'm aware of the debugger attaching and of the heap debug mode. That's why I ran the game outside of the IDE. So still 20%.

It takes about 60 bytes per character just for the GL vertex arrays to draw the text, so the sizeof(wchar_t) bytes per character plus overhead per string to store all the currently-displayed text won't be a significant part of the total cache usage either.

In either case, even if it won't hit the cache, a translation system will still be needed. And in that case, some minor performance increase will creep with them.

I mean the old one that's currently in the game. (The new not-currently-in-progress one uses the same short-range pathfinder, it just makes the long-range tile-based pathfinder faster and fixes some bugs due to mismatches between the two.)

Ah ok. SO in that sense, the pathfinder still needs some work to be done on.

April 27, 2013

That'll cause OOS in multiplayer, because floats aren't consistent between different builds of the game. That's almost the whole point of using CFixed. (The rest of the point is that floats cause nasty imprecision issues, so you might find that e.g. pathfinding works okay at the bottom-left corner of the map but not at the top-right because the coordinates are larger so the resolution is coarser. That's usually a very hard problem to reason about, whereas using fixed-point numbers makes it a non-issue). And they're just ints, so almost all operations on them will be at least as fast as the same operations on floats.

The problem with multiplayer OOS is a fair point, but I'm still inclined to think that it can be solved with floats. Perhaps a client-server architecture should be considered? What do you think? So far changing to floats gave +20% increase in performance, so I'd say its noticeable.

I think /fp:precise doesn't solve this, because the MSVC documentation says the compiler can still do intermediate calculations with 80-bit precision and only round at the end, which means it can give different answers to platforms that do 64-bit precision throughout (e.g. SSE2, or a debug build that spills things onto the stack). Also various library functions (sqrt, sin, etc) differ between platforms. Also requiring /fp:precise globally would make our graphics code (which doesn't really care about precise floats) slower. (And using different compiler options for different modules of the code would be insane and would probably break LTCG etc.)

As far as I know, /fp:precise adheres to a consistent floating point rounding to reduce errors with floating point math. The VC++ workspace is configured with /fp:precise 'out of the box'.

If I remember correctly, one of the parts of CFixed that was actually slow was CFixed::Sqrt, since that's hard to do with ints. I tried changing the implementation to use floating-point sqrt instead, which made it much faster and appeared to give the same output in all the cases I tested, but I never quite trusted it enough to be fully portable. It probably wouldn't be infeasible to exhaustively test it on every interesting platform to see if it's safe enough, so that might be a nice thing to do. (I don't remember what was actually calling Sqrt though - it might have just been the pathfinder computing Euclidean distances, and I think the new pathfinder design avoids that by computing octile distances instead.)

Float sqrt does have an advantage there, since it's usually a compiler intrinsic and replaced with the fsqrt opcode. That might be one of the reasons float performance was faster for my build.

Have you done any profiling of this? (Note that if you're running in MSVC you have to be careful to disable the debug heap (even in release builds) else a lot of the STL usage will be massively slower than when run outside the debugger). When I implemented this, I think I measured the per-frame overhead as pretty insignificant (and it allowed a lot of flexibility with changing materials and effects and lighting etc at runtime, without the hassle and bug-proneness of invalidating lots of cached data). It'd be nice to have numbers of how much it currently costs, to see what the potential benefits of improving it would be.

I wouldn't ever consider debug build performance something critical. In that sense I was talking about the rendering loop performance and how it could be improved. Perhaps something like this (pseudo code):


for(IShaderTechniqueMgr* tech : context->GetTechniques())
{
  for(IMaterialPass* pass : tech->GetMaterialPasses())
  {
	renderer->SetTechnique(tech->Technique());
	renderer->SetMaterial(pass->Material());
	for(IModel* model : pass->GetModelsCulled(frustrum))
	{
	  renderer->Render(model);
	}
  }
}

We've solved that by caching the parsed XML in a binary format (XMB) that takes no time to load, and release packages have all of the files combined into a zip so there is no filesystem overhead.

Ah that makes sense. I guess I was expecting a more straightforward solution. But the binary format cache should work marvellous.

Players have got at least half a gigabyte of RAM - why is 48KB worth any effort?

That was just a tiny example. The actual amount of strings can reach a much bigger size. And even if the strings don't take that much memory, the answer will always be: cache. A translation module is needed anyways, so I guess it will bring a benefit either way.

I agree that's a problem - QueryInterface is a bit slow, and BroadcastMessage likely has terrible cache usage patterns.

Yes, I figured having strong references to core components in an entity would help with that. Such as (more pseudocode):


class UnitEntity : ...
{
    CCmpTransform mTransformation;
    CCmpActor mActor;
};

Something like that. Of course a quick answer to your next question:

Is your suggestion to create a new class for every type of entity (UnitEntity, InfantryUnitEntity, MeleeInfantryUnitEntity, SpearmanMeleeInfantryUnitEntity, etc)? That's the kind of inflexible hard-coded hierarchy that component-based designs are intentionally trying to get away from. There's a load of documentation from game developers over the past several years about how the component systems make their lives much better.

No, I don't mean that at all. Rather 4-5 generalized entity types to make the core of the game run faster, while scriptable components would still extend the entity functionality. It seems to combine the best of the [hardcoded] and [fully dynamic] approaches, while leaving less headache to us programmers in the end.

When an entity is accessing its own components, maybe a better way to minimise the QueryInterface cost without significantly changing the design would be to let a component cache the returned IComponent* the first time it does QueryInterface (after instantation/deserialization) and use the pointer directly after that, which is safe since components can't be deleted dynamically unless the whole entity is deleted. When accessing components of a different entity you've still got to do a lookup on the entity ID though. (Storing pointers to other entities would be bad because it makes deserialization hard and makes dangling pointer bugs much more likely). Again it'd be useful to do some profiling to see which uses of QueryInterface have a significant cost - I'd suspect it's probably a small number that would be easy to fix without redesigning anything much.

I'm afraid the profile won't point out a chokepoint in the code, if the whole code is equally accessing a whole bunch of components. Even though its relatively fast, it also adds up over iterations, creating the algorithmical bottleneck.

It does that already (where "inaccurate" is "tile-based and ignoring mobile entities").

Are you talking about the stalled pathfinder? It would be great if it was completed.

The different modules store different data and have different access patterns - e.g. collision uses obstruction shapes and needs to do queries against typically short lines, whereas range tests use the centres of entities and need to do queries against huge circles with filtering on player ID. (And the renderer needs 3D bounding boxes of models for frustum culling, which is conceptually similar but practically very different again.)
I do agree the current subdivision scheme is naive and could do with improving, but I don't think trying to use a single instance for all the different modules would help.

You are right that specific implementations all require different aspects, but they all fit into the same kind of BSP scheme, which can be used as a base interface for the specific implementations. So all of the data could be in a quadtree for a 'rough-match' lookup and after narrowing it down, a more specific algorithm can be used for the actual detection.

I don't think the subdivision-data-structure queries are currently the slow part of the range tests - even if the queries were instant, they might return several hundred entities and just trying to figure out which ones have entered or left each other entity's range is too expensive. Usually only a very small proportion have recently entered/left the range, so ideally we wouldn't have to even look at the majority of entities that have remained clearly inside/outside. I don't know how best to do that, though.

Perhaps this could be done during the actual movement update? This would mean a list of "EntitiesInRange" would be attached to every entity and when two units are in range of each other, they share each other's pointers. This would avoid dangling pointers and would also instantly notify an unit when someone leaves its range. Using the quadtree structure, if an unit enters a new grid, it can do a quick update of entities who are already in the grid, to see if anyone is in range. After that initial enter, the squared distances are updated during each sample where the unit moves. Something like this could work and possibly faster than comparing two large arrays.

I'd prefer to leave it in JS and redesign it . (I have no idea what design it should have, but its current design is certainly not great). It needs to access a load of other components that are only implemented in JS and don't have C++ interfaces yet, so moving it to C++ would be a pain and would reduce the engine's flexibility, plus C++ is just generally more of a pain than JS to write in. I don't think I've ever noticed UnitAI performance as a serious problem - it doesn't run hugely frequently (given that it's all event-driven) and doesn't do any expensive computation, and we should be able to cope happily a few thousand JS functions calls per second. Is there some profiling data showing which parts of it are a problem?

I think right now is a bit too early to answer that question, but I'm inclined to think that moving towards a well implemented C++ design with a scriptable interface would work out best.

April 26, 2013

Can you give a few screenshots of trees to show the difference?

April 26, 2013

At this time though I reckon it's better to work on the current code and see where that goes. Changing to Ogre3D is something for (perhaps) part 2...

Right now I'll start working on the resource files.

April 26, 2013

From what I know of the code, implementing ogre fully means changing the entirety o the GUI folder and perhaps also converting GUI files(to use ogre. If we actually switch the GuI its a full change)

Yes, though MyGUI for Ogre3D is a very good implementation and I doubt we should use our own.

It means a complete change of the renderer folder.
It means tons of updates in the graphics folder, but that actually shouldnt be too much of a problem.

Those folders would probably disappear entirely, replaced by Ogre3D.

Might require some changes to collada(can't recall if ogre handles collada natively).

Ogre3D has a collada plugin. So no changes there.

Ps/ probably has files that would need to be changed as its the core of pyro genesis, but that also shouldn't be a problem with your experience

Yes, this is the main part where the integration would occur. This is the brigdge between modules: [simulation] <-> [ps engine] <-> [ps graphics]. Luckily it means that graphics can be replaced with changes to [ps engine].

The issue is maths/. If you use the ogre provided versions of matrices, vectors, it Ould require changing things everywhere in the code. Perhaps a simpler solution would be to typedefs those types (which would maintain the usage of CWhatever for class names)
The rest should be mostly left as is.

Ogre3D has a very mature maths library. I already replaced the ps maths vectors with: Vector2, Vector3 and Vector4 (had to roll my custom implementations). But Ogre3D has a much more mature version with maths for interpolation and optimized ray intersection. Now that is cool.

However ogre was not chosen before, and we need to know why to check if the reasoning still applies. I suggest you start a new topic explaining what we would gain (in terms of speed, flexibility, features) and how much time you think it'll take after assessing the necessary changes. Then well be able to take a sensible decision

I think you are right. A lot of these arguments should put into a list, describing the benefits and amount of code required to change. I'll get right at it.

April 26, 2013

Well, Ogre3D supports OpenGL ES 1.1 or higher. So this would probably make porting easier.
Hm, the more I think about it, the better it would be to use an established rendering engine like Ogre.

It does seem so. I have substantial time today to assess the amount of code that would require reimplementing if such a switch would be needed. I wouldn't jump into something this drastical without analyzing the current state - but nevertheless, the graphics engine does require a major overhaul, so its something I definitely need to take up!

The game has to suffer a 20% performance loss because we don't want random out of sync errors popping up. You say most games use a pattern like this, do you have sources for that statement?

Yes, for example int the source of Doom3. And all of these games have managed with it. Having a few small places where float imprecision is taken to account, is much much better than using CFixed everywhere. Floating point maths standard is now over 28 years old and holding strong. We are aware of the rounding errors and if all else fail, just compile with /fp:precise.

This article http://gafferongames...nt-determinism/ has numerous quotes saying that they used the same compiler or replays and multiplayer would not work. The hashing is irrelevant, the little differences will sometimes cause a genuine divergence in the game state. A single multiplayer game could easily have more than a billion floating point operations happening.

Yes, it also brings out the points why fully deterministic systems like that don't work - even a slight change in version number or entity tags will break sync. IMO, a different sync model should be considered in this case and deviate into a more network intensive solution.

The article also notes that using /fp:precise works (as stated by several compiler providers too: MSVC++, GCC) and provides fully deterministic results. Just so you know, 0AD already uses /fp:precise, even though precise floats are slower (since they are rounded), they still have a much better performance than CFixed<>.

It's definitely a very interesting long term perspective, particuarly as the renderer is quite limited.
On the topic of floats vs fixed, unless serialization and OOS checks can be efficiently changed to disregard all roundin errors, this is something that needs to be kept. Perhaps however it can be optimized a little (20% seems fairly big). Anyway, there are other areas where optimizations are interesting so that's nothing too important.

Right now the main branch of the svn will stay like it is regardless, since these changes are too huge to immediately integrate into the game. And as mentioned, multiplayer sync may be broken.

Edit: this is probably impossible, but another solution would be SP to have floats and Mp to have cFixed. Might require two apps though, à la CoD and perhaps some other games.

I'd say we should just forget about CFixed<> and use /fp:precise. Its better to use a language built-in feature than designing a library that slows down the game considerably.

April 25, 2013

I already explained that rounding doesn't work. One CPU will give 1.49999, they other will give 1.5000, one player will get 1, the other will get 2, they will go out of sync. Rounding simply decreases the probability of an out of sync problem. Unless you make every algorithm that uses floats avoid every rounding boundary you cannot solve this. So using fixed is a good design decision because it makes keeping the game in sync manageable.

Even though that is a rather edge case, something like that can be circumvented by simply handling a special rounding function during hash generation, which would convert the float to an int:


inline __int64 roundedFloatHash(float f) {
	return (__int64(f * 100000.0f) >> 2);
}
// 0.00154 -> 100 -> 25
// 0.00523 -> 500 -> 125

In either case, I don't see why the game engine has to suffer a 20% performance loss, when float hashing could be implemented on-demand. It is a matter of precision that can be easily decided on regarding world size. All games use methods like this. It's a well used pattern.

April 25, 2013

The renderer is a big Change that has the advantage of being very long term and really good FPS-wise, so that's why I think it's the second highest priority (particulalrly as theres really noone else to do it).

Quickest hack to increase performance, would be to put the bucket maps and vectors into ShaderModelRendererInternals. That would save at least a bit on the memory allocation performance.. heh.

April 25, 2013

Very similar isn't good enough. We need exactly the same result or a player will go out of sync. You can't rely on rounding because you will occasionally get a value right on the boundary so it rounds differently on each system. Very carefully designed code might be able to force stability but it would be really hard to do, and it needs to be done all over the simulation so is basically infeasible.

That is a valid point that I didn't consider. Though in this case, the float can be rounded to a near value during Serialization (which is used before the string is hashed...). This method is utter nonsense though. The component itself should have a method GetHash() which provides a reliable hash that can be used to compare two component objects.

I see no reason to revert back to fixed point. Bad design choices spur an onslaught of derivative Whiskey-Tango-Foxtrot, leading to decreased maintainability and performance.

April 25, 2013

From what I have read floats will not be identical between systems once you start using different compilers and different architectures.

The opcodes and precision might be different, depending on the cpu(fpu?) of the device or software float library of the compiler; But the overall usability of floating point numbers remains the same. Even though some errors occur with floats quite occasionally, I included proper methods for testing float equality, which is just something similar to:


inline bool Equal(float a, float  {
	return abs(a -  < M_EPSILON;
}

Where M_EPSILON is a very small float (e.g. 0.000000001f ).

Do you know about the stalled pathfinder rewrite at #1756?

I've talked to Philip about this, unfortunately he doesn't have time to finish it. I wouldn't benefit much from his solution either, since it would take me as long to implement a new one than to completely understand the existing one.

I had a look at this. The naive subdivision isn't what causes the performance hit, de-duplicating results which are across region boundaries and using an inefficient set for lookup are much bigger factors. Unification like you say might be a good idea though.

This is all something that can be fixed with a custom Quadtree implementation. An entity would have a pointer to a Quadtree grid and vice-versa. This allows an entity to get a list of pointers of objects that are in the same grid. Notice how this suddenly, without any complex lookup what-so-ever, decreased a huge list (like 200) to just a handful (usually never more than 8).

If needed, the parent grid can be checked and so on.

April 25, 2013

In that regard, I'll have to ease myself to Git and slowly start poking around the component logic. When I'm certain I have the Full Picture, I'll make the required changes. It probably means I'll have to implement some parts inefficiently in order to get the code out faster (for example javascript support).

I'm currently available most of the day, so any ideas / recommendations / additional input is always welcome. Right now I'll try to fork 0ad and get my source under version control.

Sign In

RedFox

Posts

Joined

Last visited

Days Won

Content Type

Profiles

Forums

Posts posted by RedFox

3d viewing inside forum

[Discuss] Memory aka game slowing over time.

[Discuss] Memory aka game slowing over time.

[Discuss] Memory aka game slowing over time.

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[Discussion] Spidermonkey upgrade

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

Transparent materials and alpha sorting

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

Transparent materials and alpha sorting

[DISCUSS] Performance Improvements

[DISCUSS] Performance Improvements

Transparent materials and alpha sorting

Forums