[Discuss] Memory aka game slowing over time.

tuan kuranes · May 11, 2013

Following performance thread, and my performance patch/work, I create this thread to discuss how to solve that specific problem.

First simplest step, we could agree on basic performance guidelines. The draft is open for correction/additions/etc, but I feel it's important that you can have set of rules guiding code quality regarding performance and memory. There's lot of web pages (gotw) or books (scott meyers book, effective c++, effective stl) that enforces/list thoses.

Now, as stated in the wiki page, game slowing over time is due to current memory model (nearly everything is heap allocated inside containers), which implies a lot of new/malloc/delete/free during runtime, which even shows up on the profiler.

The problem is called memory fragmentation

It gives slower perf over time, not only due to slower memory access, but slower malloc/new (time to find free memory block of optimal size grow over time) and can leads to crash ("heap allocation failed").

Once all code does follow those steps, that will leave us with the current "direct allocation" scheme, which is using object allocated on the heap over reusable object pointers,

For instance pathfinding computeShortPath code does fill 6 different std::vector<Edge>, making heap allocation of those Edge (0ad does not using any custom stl allocator), then copy allocation, and copy operation along the way.

Meaning for a simple path with a minimum range of 16, a munimum average of : 6*16*16=1536 allocations and even more objects copy operation over a single A* step (that explains the memory fragmentation on its own.)

Using pointers, as in std::vector<Edge* > would reduce those to allocation done by ourself, and reducing all the "copy allocation", as each Edge allocated here is duplicated and stored in at least, main EdgeList, and depending on position copied in EdgesLeft,EdgesRight,EdgesTop,etc.

Using a memory pool, there would be no allocation at all. (no slowing down over time.)

That's why I recommand going over pointers, and even better memory pools.

[last edit] s/stack/heap/

Edited May 12, 2013 by tuan kuranes

**quantumstate** · May 11, 2013

Being careful with this would be a good idea, but I think you also need to be careful wirth your analysis.

Your usage of stack seems to be different to how I understand it. In C++ the stack is where local variables get allocated, so if you enter a function the arguments and local variables get pushed onto the stack, then when the function finishes everything is popped off the stack. Fragmenting the stack is impossible, the normal way to overflow the stack is to use too many recursive functions. You seem to be using the word stack for the heap, which is where new and malloc put things they create. The heap can get fragmented as you say. Please could you go through and correct the terminology to match standard C++ terms, otherwise it is quite confusing to read (especially since with containers you allocate the container on the stack but the constructor can allocate memory from the heap for the container to use).

Also you mention that you can see the effect of fragmentation in the profiler, have you got this information somewhere (I haven't been keeping up with everything recently). Your example of computeShortPath causing fragmentation seems very surprising, since the allocation is always the same size and is very localised the allocated blocks should be reused. I would have thought that fragmentation will occur when you have a mixture of large and small allocations happening of varying sizes, a classic bad case is a container which is getting items added to it and gradually grows, so keeps asking for larger allocations.

**wraitii** · May 11, 2013

Some comments about the wiki article:

I suggest you try to put this in Coding Conventions or Code Quality rather than somewhere else, and try to follow the layout/style from those pages. It would make it easier to read and understand (also, your descriptions of optimizations are a bit under-detailled to my tastes.)

Finally, a wiki article stating "I" feels a bit weird to me.

tuan kuranes · May 12, 2013

@quantumstate: sry, fixed stack/heap inversion.

As you said, computeShort path container are itself allocated on the stack, but content on the heap, and once container removed from the stack at the end of the method call.

Then the heap "allocation space" slot must be returned for general "malloc/free" usage, and therefore might be filled before next "reserve" or "push_back" call. (And here the the A* priority queue can allocate inside the loop too.)

The point is that Pool allocation does guarantee that we do use the same spot, that if allocated big enough at start, it has big chances to be contiguous, and also have the benefits of saving a lot of copy constructor calls. (copy from stack allocated to heap allocated, and then to other edge list, etc.)

Here on windows, free/new/delete[] and lots of internal stl calls (push_back) are on the top list of a profiler (very sleepy profiler). ( https://docs.google....dit?usp=sharing )

Memory fragmentation is my explanation for slowing overtime, but I don't know the code as well as you, and perhaps it can be caused by something else ?

@wraitii: removed the I. linked it from code quality in order to avoir a too huge sized page, but can merge it. I'll try to have detailling pass on descriptions, but was afraid it would look like copy paste from sites linked.

Edited May 12, 2013 by tuan kuranes

**RedFox** · May 13, 2013

First simplest step, we could agree on basic performance guidelines. The draft is open for correction/additions/etc, but I feel it's important that you can have set of rules guiding code quality regarding performance and memory. There's lot of web pages (gotw) or books (scott meyers book, effective c++, effective stl) that enforces/list thoses.

This is a good point to start! Maintaining a more detailed list of 'good C++ design patterns' seems like the way to go. It shouldn't be so brief though and should explain these concepts to non-expert C++ programmers. Not everyone has expert knowledge on x86 assembly and how the operating system maps physical memory to virtual memory, nor how the stack and heap exactly work.

Now, as stated in the wiki page, game slowing over time is due to current memory model (nearly everything is heap allocated inside containers), which implies a lot of new/malloc/delete/free during runtime, which even shows up on the profiler.
The problem is called memory fragmentation

I haven't run any long-time tests on the pyrogenesis engine, but usually if your program doesn't have any obvious leaks, yet the total paged memory keeps increasing, the answer is always fragmentation. Even though I'm not familiar with linux allocation schemes, Windows Vista/7/8 use the Low Fragmentation Heap (a cluster of memory pools). Even though it sounds promising, it only prevents the adverse effect of malloc trying and failing while finding a proper-sized memory-block. If your dynamic allocations vary too much, it will end up creating a lot of different sized clusters, thus increasing the memory paged by the process anyways.

Once all code does follow those steps, that will leave us with the current "direct allocation" scheme, which is using object allocated on the heap over reusable object pointers,

It's a bit too forgiving to think that something like that would magically solve all the problems - the memory management problem is a design issue that is not easy to fix and will always take a lot of time. Nevertheless, sometimes its easier to spend those 2 extra hours of your evening to create a quick memory pool, rather than let others spend hours maintaining that code a few months later. Sounds familiar?

For instance pathfinding computeShortPath code does fill 6 different std::vector<Edge>, making heap allocation of those Edge (0ad does not using any custom stl allocator), then copy allocation, and copy operation along the way.
Meaning for a simple path with a minimum range of 16, a munimum average of : 6*16*16=1536 allocations and even more objects copy operation over a single A* step (that explains the memory fragmentation on its own.)
Using pointers, as in std::vector<Edge* > would reduce those to allocation done by ourself, and reducing all the "copy allocation", as each Edge allocated here is duplicated and stored in at least, main EdgeList, and depending on position copied in EdgesLeft,EdgesRight,EdgesTop,etc.
Using a memory pool, there would be no allocation at all. (no slowing down over time.)

In the ComputeShortPath example, an edge consists of 2 FixedVector2D-s, which each consist of 32-bit CFixed objects. Since each edge is 128-bits / 16-bytes in size, the programmer obviously counted on the SSE2 extensions which move unaligned 128-bit memory blocks through the XMM registers in a single op (Edge is a POD type, so G++/VC++ will definitely use MOVUPD [move unaligned packed doubles]). This of course requires compiling with SSE2 flags

But 0AD does not compile with any enhanced instruction set flags, so perhaps that wasn't the best decision. Overall, this seems like the perfect place to use a memory pool. If you are lazy, you can just pre-determine the maximum possible pool node count, align it to default page size (4KB) and just allocate one huge block of memory.

To note though, any self respecting stack is 1MB in size (that's the default on the Win32 platform anyways), so if you know your dynamic data fits into a reasonable size (64KB?), you can use alloca(size_t) to dynamically allocate on the stack instead.

Given that you stated that 6 std::vector<Edge> with 16*16 edges was the possible max (? haven't looked that closely?). It would give us a required reserve of sizeof Edge(16) * 256 * 6 => ~24.5KB. This is actually so small, its a bit evil to fragment the heap with vectors like that. Its easier to just feed 64KB of stack memory into a generic pool and never look back again.

No fragmentation: check. No heap alloc: check. No mutex lock on process heap: check. Triple win, people? Yeah, alloca wins huge in small scenarios like this.

Being careful with this would be a good idea, but I think you also need to be careful wirth your analysis.

I think in general terms (aside from some obvious nomenclature errors regarding stack/heap) his assessment was pretty straight on. If you design an entire program with the idea that 'a single vector won't fragment memory', then you'll have a program full of a lot of vectors with a lot of fragmented memory.

Also you mention that you can see the effect of fragmentation in the profiler, have you got this information somewhere (I haven't been keeping up with everything recently). Your example of computeShortPath causing fragmentation seems very surprising, since the allocation is always the same size and is very localised the allocated blocks should be reused. I would have thought that fragmentation will occur when you have a mixture of large and small allocations happening of varying sizes, a classic bad case is a container which is getting items added to it and gradually grows, so keeps asking for larger allocations.

Some profiler data is always good to look at, but proper analysis can take hours or days - it's not that simple after all. In this case std::vector is a classic example of memory fragmentation, since the initial allocation size depends on the implementation and thus varies between platforms.

Looking at VC++ STL, its std::vector implementation is one of the worst, since it always allocates by sizeof(T) and always expands by +50%. Because of that, something like this usually happens:


SIZE  CAPACITY  BYTES
   0	 0	   0
   1	 1	   4
   2	 2	   8
   3	 3	   12
   4	 4	   16
   5	 6	   24
   6	 6	   24
   7	 9	   36
   8	 9	   36
   9	 9	   36
  10	13	   52

And you wonder where the fragmentation comes from? There are 7 reallocations required in only 10 elements added. And C++ isn't exactly known for its fast reallocation, since it requires: 1) new buffer, 2) copy data, 3) delete old buffer.

If you used a custom vector implementation that always makes initial allocations in, say, 32-bytes, with each successive allocation being +50% aligned to 32-bytes, you would get something like this:


SIZE  CAPACITY  BYTES
   0	 0	   0
   1	 8	   32
   2	 8	   32
   3	 8	   32
   4	 8	   32
   5	 8	   32
   6	 8	   32
   7	 8	   32
   8	 8	   32
   9	16	   64
  10	16	   64

This scales better to heap memory management - if we truncate a 32 byte block and allocate a 64 byte block, another instance of our custom vector will definitely pick the 32 byte block we just used - it's a perfect fit after all. ^_^

Yeah, sorry for going off-topic so suddenly. I'm just used to programming very performance critical applications and most of the time custom implementations such as this are needed to fit to the current scenario. This scenario favors speed over memory, so allocating a few bytes ahead is not a problem.

Edited May 13, 2013 by RedFox

Thanduel · May 14, 2013

I really appreciate this discussion, some great resources have been referenced.

To add something into the fray. A lot may be gained by using the new features of C++11. In particular I have in mind rvalue references because they solve a lot of temporary object copying.

http://www.cprogramming.com/c++11/rvalue-references-and-move-semantics-in-c++11.html

**RedFox** · May 14, 2013

I really appreciate this discussion, some great resources have been referenced.
To add something into the fray. A lot may be gained by using the new features of C++11. In particular I have in mind rvalue references because they solve a lot of temporary object copying.
http://www.cprogramm...s-in-c++11.html

That's an excellent point - introducing proper move constructor/operator to the objects will make a huge difference to the current code. Though it would still introduce quite a bit of copying and the temporary object destructors are still called - they just won't have to delete any resources, since the data has been moved already.

The following code will still be the most optimal:


void GetSomeData(vector<Data>& outData)
{
    outData.push_back(Data());
}

{
    vector<Data> list;
    // ...
    GetSomeData(list);
}

Whereas the following code would introduce a rather large number of operations:


vector<Data> GetSomeData()
{
    vector<Data> result;
    result.push_back(Data());
    return result;
}

{
    vector<Data> list;
    // ...
    list = GetSomeData();
}

What happens behind the scenes here:


void GetSomeData(vector<Data>& rvalue)
{
    vector<Data> result; // ctor
    result.push_back(Data());
    rvalue.operator=(result); // copy the result
    result.~vector(); // calls the dtor of the initial vector
}

{
    vector<Data> list;
    vector<Data> rvalue; // compiler generates a temporary for the return result
    GetSomeData(rvalue); // get the result into the temporary
    list.operator=(rvalue); // copies the temporary
    rvalue.~vector(); // calls the dtor of the temporary rvalue
}

Basically what C++11 introduces is a new type of copy operator/constructor called the 'move operator/constructor':


vector(vector&& rvalue)
: size(rvalue.size), capacity(rvalue.capacity), buffer(rvalue.buffer)
{
    rvalue.size = 0;
    rvalue.capacity = 0;
    rvalue.buffer = nullptr;
}

vector& operator=(vector&& rvalue)
{
    // we don't have to check for 'this != &rvalue', since rvalues are always just temporary objects
    if(buffer) delete[] buffer; // make sure any data on this vector is freed
    size = rvalue.size;
    capacity = rvalue.capacity;
    buffer = rvalue.buffer;
    rvalue.size = 0;  
    rvalue.capacity = 0;
    rvalue.buffer = nullptr;
    return *this;
}

So, instead of calling the standard 'vector& operator=(const vector& other)', the compiler will use 'vector& operator=(vector&& rvalue)' whenever a temporary object is assigned.

This is a lot better than copying the data 3 times! But its still bad. Its basically a hack to fix an ugly design problem that C++ has. And this is something we have to live with. The only proper way to deal with this is to just use 'void GetSomeData(vector<Data>& outData)'. It's less intuitive, but at least it's not ugly behind the scenes.

Edited May 14, 2013 by RedFox

tuan kuranes · May 14, 2013

@Thanduel: maybe we could go for a new Discuss thread on C++11 usage ? it does has many benefits and is supported across all latest compilers. ( http://c0de517e.blogspot.fr/2013/05/integrating-c11-in-your-diet.html )

@redfox: indeed, lot of my patch posted did just that converting pass by value to pass by reference in hotspots zones.

Pathfinding is but an example, that emphasize the complexity of not using pointers and handling memory ourself, leading to complex memory fragmentation and lots of hidden object copy. There's a lot of other code around with std::vector allocated on the stack inside a method, but still reserving memory on heap and all freed at method call end.

Note that there's also a lot of new/delete we surely can avoid. What's the idea on them, can we patch that ? Allocating at max size possible, and reusing them as much as possible ? (inside calculateTerritories, los textures computation, lot of grid<> and sparse grids, etc.)

And there's also the free/malloc in profiler that seems mostly related to js. Not sure what we can do there,; Is passing const string& from c++ to js method even possible ?

Thanduel · May 14, 2013

@redfox, haha I'm not sure I'd call move semantics a hack, I was impressed with the elegance of the solution, and my impression of the design decisions of WG21 were that they were very carefully thought out. In fact the papers are all exceptionally impressive quality.

Another interesting addition in C++11 for the STL is the emplace_back() function for appending to containers:

http://en.cppreferen...or/emplace_back

What this means is that you can construct objects in place without any copies or moving at all when appending to a container. There is an example that demonstrates this at the link above.

@tuan kuranes that is probably a good idea not only will the use of C++11 features add to the performance but it will make a lot of code much cleaner.

Edited May 15, 2013 by Thanduel

**RedFox** · May 15, 2013

@redfox, haha I'm not sure I'd call move semantics a hack, I was impressed with the elegance of the solution, and my impression of the design decisions of WG21 were that they were very carefully thought out. In fact the papers are all exceptionally impressive quality.
Another interesting addition in C++11 for the STL is the emplace_back() function for appending to containers:
http://en.cppreferen...or/emplace_back

The emplace_back function seems a lot better than the old push_back for objects that are non-trivial / non-POD. Nevertheless all the arguments have to be pushed onto the stack not only once, but twice, (1 for emplace_back, 1 for the type contructor), before being copied for the third time into the actual buffer.

For POD types, something like this would be preferred, since it involves 1 initialization and 1 copy:


Data d = { 1, 2, 3 };
vec.push_back(d); // argument is passed as 'const Data&'

And I still think the move constructor is a hack that tries to fix one of the biggest problems C++ has: temporary non-trivial objects, which is something that should be avoided at all cost. At least now they've introduced a way to escape any heap allocation/deallocation triggering 3 times, which somewhat lessens the impact.

Still, the implementation of std::string is non-trivial, and looks like this on VC++ 2012 (28 bytes):


class string
{
	iterator_proxy* _Proxy;
	union {
		char _Buf[16];
		char* _Ptr;
	};
	size_t _Size;
	size_t _Reserved;
};

With the notion that small strings can be contained inside a temporary buffer and longer than 15 char strings will be allocated on the heap. Regarding std::move and move constructors/destructors in general, they are still non trivial, containing dozens of operations for copying and nulling the string objects.

C# solves this by just turning every class into reference counted objects (like Microsoft's COM, but with a Garbage Collector touch). In C++ the closest parallel is a shared_ptr<> wrapper, however it's again completely different from C# references:


// again from VC++ 2012 STL
template<class T> shared_ptr
{
	T* _Ptr; // pointer to object allocated with new
	_Ref_count_base* _Rep; // pointer to a dynamic object that counts the shared_ptr reference count
};

So this just keeps a dynamically allocated pointer to a reference counting object, which is rather inelegant. If you use a lot of shared_ptr<>'s in your project you'll start feeling the overhead.

Luckily, or rather surprisingly, Microsoft designed an (in my opinion) elegant solution to the problem:


class IUnknown {
	int refcount;
	IUnknown() : refcount(1) {} // initial creation has refcount 1
	void AddRef() { ++refcount; } // adds to the reference count of the object
	void Release() { if(!--refcount) delete this; } // decreases refcount, if refcount reaches 0, deletes the object
};
class MyObject : public IUnknown; // implement your own object

So now you introduce a reference counting overhead for every object you intend to pass by reference. For actual use you will need a reference wrapper. In Microsoft's COM it's a comptr<>:


template<class T> struct comptr
{
	T* ptr;
	// add copy operator overloads to AddRef and Release as needed
};
//
comptr<MyObject> obj = new MyObject(); // compiles fine because you MyObject inherits IUnknown.

Yeah, when I said it's elegant I might have been a bit too generous. I think its more accurate to say it's more elegant than a shared_ptr<>. What it does achieve though is only a single allocation for the object! No matter how many copies of comptr<MyObject> you pass around.

For true C++ use you sometimes need objects allocated on the stack (because its fast and your data is sometimes trivial) and other times you need it to be dynamic. So the COM solution will never work for stack allocated objects, it will result in a crash or heap corruption. That's awful.

Another solution would be to create a custom reference object that proxies a struct of the object pointer and its reference count:


template<class T> refptr
{
	struct proxy {
		 T* ptr;
		 int refcount;
	};
	proxy* p; // proxies should be allocated from a special dynamic memory pool

	inline refptr() : p(nullptr) {} // default to a null pointer
	inline refptr(T* ptr) {
		p = proxyAlloc(); // gets a new object from proxy memory pool
		p->ptr = ptr;		// set the proxy ptr
		p->refcount = 1; // initial refcount is 1
	}
	inline ~refptr() {
		release(); // try to release the object
	}
	inline void release() {
		if(p && !--p->refcount) { // decrease reference count
			delete p->ptr; // delete the actual object
			proxyFree(p); // return proxy to its memory pool
		}
	}
	inline T* operator->() {
		return p->ptr; // will cause a nullptr exception if proxy is nullptr
	}
	inline operator bool() { return p ? true : false; } // so we can test the refptr in conditionals
};

Even though it's not a completely working example (its missing the copy constructors and operators and actually a ton of more operators to make it usable), it still illustrates the idea of creating a very simple reference object (only 4 bytes wide on x86) that can give a proper reference counted object:


{
	Data d1; // allocate a Data object on the stack
	d1.a = 5;

	refptr<Data> d2 = new Data(); // allocate on the heap and store as a reference
	d2->a = 5;
	refptr<Data> d3 = d2; // share the reference
	d2 = nullptr; // decreases reference
	d3->a == 5;
	d3 = nullptr; // since refcount reaches 0 here, the Data object is deleted
}

So yes, C++11 provides many ways for manipulating data for us, but it doesn't provide the one true solution that we really need. I think the compiler should be extended to treat the following patterns as special cases:


vector<int> GetData()
{
	vector<int> data;
	data.push_back(10);
	return data;
}
{
	vector<int> list = GetData(); // case 1 - initialization
	list = GetData(); // case 2 - temporary copy replace
}

And it should generate altered code for these special cases (regardless of optimization levels):


// case 1 - initialization
void _init_GetData(vector<int>& out)
{
	out.vector(); // call default ctor
	out.push_back(10);
}
// case 2 - temporary copy replace
void _replace_GetData(vector<int>& out)
{
	out.~vector(); // call the dtor
	out.vector(); // call the default ctor
	out.push_back(10);
}
{
	vector<int> list;
	_init_GetData(list);
	_replace_GetData(list);
}

Optimizations like this are only partly achieved with certain compiler flags and they don't always work. I don't see why the compiler can't generate this kind of code. It's true that with this change, the assignment operator will not be called, but isn't this what was actually desired? Even if you replace the code with std::string or even refptr, the end result will still be same.

So yeah, rvalue references are a hack for a mistreated feature (operator=) of the C++ language. And I guess we'll have to live with it, unless someone comes around with a clean new language somewhere inbetween C++ and D.

Edited May 15, 2013 by RedFox

Thanduel · May 16, 2013

I appreciate the detailed response, you've evidently spent a lot of time trying to get around temporary object creation. I need to look at the disassembly of the calls to understand what is going on behind the scenes a little better, I'm actually very surprised that the stack copying isn't optimised out but then C++ is a new beast for me. Anyway thanks again for the detailed response.

The question now is what is the way forward with the C++ code base.

**RedFox** · May 16, 2013

I appreciate the detailed response, you've evidently spent a lot of time trying to get around temporary object creation. I need to look at the disassembly of the calls to understand what is going on behind the scenes a little better, I'm actually very surprised that the stack copying isn't optimised out but then C++ is a new beast for me. Anyway thanks again for the detailed response.

I guess it's because I come from a background of C, where being in control of every aspect of the program is really important. Inheriting from that, having code that generates hidden behavior behind the scenes can make development a real bane.

The question now is what is the way forward with the C++ code base.

Aside from a definite need to move the codebase to C++11, I think we should agree on a list of performance topics to add to http://trac.wildfiregames.com/wiki/CodeAndMemoryPerformance.

First part should be filled with general C++ knowledge that's not obvious for everyone. Perhaps excerpts from 'More Effective C++' by Scott Meyers? His book does a really good job at covering the basics. It might look a bit 'well, duh' for seasoned C++ developers, but it will definitely reveal new details.

Second part should cover algorithmic design, or 'optimize as you go' - simple design alterations to code to make it run faster, yet perform the same algorithm.

Lastly, it should finish with memory specific optimizations, data alignment and how to use new/delete more efficiently. For example: how to use alloca - a C standard library function from <malloc.h> that creates a dynamic memory block on the stack.

The current code and memory performance wikipage is very very brief. I think every point deserves a tiny code example. Why? Not everyone immediately picks up on ambiguous or made-up abbreviations.

tuan kuranes · May 22, 2013

Aside from a definite need to move the codebase to C++11

Definitely need a discuss thread ?

Here's another nice c++11

I think we should agree on a list of performance topics to add

Great Three part list.

Rewrite/copy webpages and scott meyers books might make it tedious to read and finally not be read, that's why I went for just listing strict "do that or do not commit" like guidelines. perhaps we can do listing + other wiki pages explaining each item.

I would go guidelines + example + link to deep wiki page ?

tuan kuranes · May 24, 2013

Just stopping by listing a nice article on stack allocation: http://geidav.wordpress.com/2013/03/21/anatomy-of-dynamic-stack-allocations/

Thanduel · May 24, 2013

Pretty good video on optimisations http://vimeo.com/55639112

Thanduel · June 20, 2013

I've noticed in a couple of discussions here that most people are not convinced about memory fragmentation being a significant problem. It may be a good idea to take some fragmentation measurements. I know that in VC++ you can use __heapwalk to walk through the heap and get data on the free/used blocks of memory along with the size of the blocks.

http://msdn.microsoft.com/en-us/library/h0c183dk%28v=vs.71%29.aspx

It might be interesting to do this a timed intervals to get an idea of the rate of fragmentation.

tuan kuranes · June 20, 2013

@Thanduel: Using memory analyzer from http://blog.makingartstudios.com/?page_id=72, run 0ad with it in debug mode , switch to "fragmentation tab" and you'll have nice memory mapping block drawings showing just that.

(other tabs still interesting, should be able to spot all the recurring "allocators" code spot, and all the other memory problems)

**RedFox** · June 20, 2013

@Thanduel: Using memory analyzer from http://blog.makingar...com/?page_id=72, run 0ad with it in debug mode , switch to "fragmentation tab" and you'll have nice memory mapping block drawings showing just that.
(other tabs still interesting, should be able to spot all the recurring "allocators" code spot, and all the other memory problems)

Wow! This is an amazing tool! Plus rep, man! You're now my personal hero, really, for finding this awesome tool.

First test:

Pyrogenesis rev #13491, Release, VC++ 2012

Only Main Menu (lets start it slow)

The first thing that makes leak detection really hard is the GC due to the constant memory increase and free cycles.

Next stop is to analyze the first heap, which is the mozilla js engine heap.

A note about the fragmentation graph: Red blocks represent allocated nodes, white blocks represent free nodes AND memory node headers. I don't know why they didn't exclude those headers, since it makes finding frags very hard.

The slightly larger white blocks between red blocks are fragmentations.

The js heap could be described by a very large amount of small allocs that are luckily rounded off to 32byte alignments. This greatly reduces fragmentation, though there still is quite a bit of it. What should be considered again is that this is just the 'quiet' main menu.

Next is the CRT heap, the heap used by C, C++ and pretty much every other library. This is the most important heap of the program. What could describe this heap is:

1) Lots of tiny allocations - This is bad for malloc/free speed since there is a lot of coalescing going on.

2) Fragmentation - This is bad for malloc speed since these nodes are in the freed list, meaning malloc has to iterate through them every single time, making it gradually slower. In this case the smaller nodes are the main culprits - the bigger blocks can be broken down and used, but the small ones stay in the freed list.

Now if we look at the distribution of allocations during the entire long run of the main menu, we can notice that most allocations are aligned to neat 8-byte boundaries, which makes them easy to manage. However, the sheer volume of same-sized allocations is staggering.

There are two ways to increase general performance:

1) Localized usage of custom pool allocators - This is hard to achieve across the entire program, but Arena allocators are nothing new. And it's fast.

2) Use jemalloc - This is basically a collection of thread local Arena allocators sorted into bins and is superior to dlmalloc (http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf). It's probably the fastest general purpose solution you can think of and it has seen years of debugging.

Writing a custom pool allocator is surprisingly easy, BUT jemalloc uses VirtualAlloc / mmap to expand its pools by mapping more virtual addresses to the end of the memory block, which is the most efficient way out there.

Of course, to get the speed-up of releasing a whole bunch of nodes in a single go, you'd still need to use a memory pool.

I'll post a sample of memory allocation during actual gameplay later.

Thanduel · June 20, 2013

Awesome post, you got to doing this before I could!!! The scary thing is that this is before a game has even started.

**RedFox** · June 20, 2013

Awesome post, you got to doing this before I could!!! The scary thing is that this is before a game has even started.

Well, sadly the application doesn't handle memory very well itself and crashes when I try to profile after loading a map.

If you have a 64-bit OS with more than 4GB of ram, you could perhaps post profile information of an actual game in progress?

Thanduel · June 20, 2013

Well, sadly the application doesn't handle memory very well itself and crashes when I try to profile after loading a map.
If you have a 64-bit OS with more than 4GB of ram, you could perhaps post profile information of an actual game in progress?

Will do later this evening when I get home.

Thanduel · June 21, 2013

So unfortunately when I try to start a game I get a javascript error in the civ menu. I don't have time right now to roll back a couple revisions I will have another go tomorrow.

Type error y.name undefined from gui/common/functions_utility.js line 95

Very exciting way to end the day. : /

**historic_bruno** · June 21, 2013

I've noticed in a couple of discussions here that most people are not convinced about memory fragmentation being a significant problem

How would anyone know it was a significant problem in 0 A.D. without data to show it? In particular I'd like the discussion about 0 A.D. memory fragmentation to be steered toward determining if it's a significant performance problem (we can all agree it is in theory).

**Yves** · June 21, 2013

Historic_bruno expresses exactly what I was thinking.

Your explanation of the problem makes sense but I haven't seen any benchmarks yet that show how much effect memory fragmentation has on performance for 0 A.D. Do you see any possibilities how this could be measured before implementing complex changes?

It's similar to the problem I had with Spidermonkey. There are a lot of benchmarks showing how much the browser performance was improved but there's no way to tell in advance how much it will improve performance for 0 A.D. and if it's worth the efforts.

EDIT: I'll definitely test Spidermonkey's jemalloc flag

**RedFox** · June 21, 2013

How would anyone know it was a significant problem in 0 A.D. without data to show it?

Sorry, , sometimes the systems programmers forget that not everyone is on the same page with them . For us, it's painfully obvious that the problem is memory. It's not about the total memory used. It's about the number of allocations/deallocations and subsequent fragmentations caused by interleaved different-sized allocations.

Here's an example how to get a memory fragment (it's very easy):


void* p1 = malloc(4); // [4] ...
void* p2 = malloc(4); // [4][4] ...
void* p3 = malloc(8); // [4][4][8] ...
free(p2);				    // [4] 4 [8] ...
void* p4 = malloc(8); // [4] 4 [8] [8] ...

Now this isn't bad at all, actually. If someone happens to call malloc(4) again, everything is okay again - the block is reused:


void* p5 = malloc(4); // [4] [4] [8] [8] ..

But in real life applications, this doesn't always apply - you might have something like this:


// 128 [4] 4 [8] [8] ...
void* p5 = malloc(4); // [4] 108 [4] 4 [8] [8] ...

And no, I didn't do a mistake in the calculation. A heap block in windows takes 8 bytes(header) + [8 bytes (alignment)]. Due to this the overhead of allocations varies a lot. But thanks to this both the block header and memory itself are 8 byte aligned.

Point to case - the original 4 byte block was not reused and can now be considered a fragmentation. In large applications those small fragments will accumulate over time and the more you free other data the more you cement that blocks fragmentation. Since it's now in the very far end of the free list, it will probably be never allocated again.

In particular I'd like the discussion about 0 A.D. memory fragmentation to be steered toward determining if it's a significant performance problem (we can all agree it is in theory).

I wouldn't call it just a theory. In fact, I just posted a few graphs where we could see dozens of obvious fragments. Like I explained, those fragments can be described as blocks that pollute the free list, but never get reallocated. What that does is critically diminish malloc() performance - you always need to traverse the free list to get a new block.

Windows 7 has the Low-Fragmentation Heap, which is basically just like dlmalloc - a bin based allocator, but it's not used when a debugger is attached.

Now before I get off-track (again), I got the opportunity to try out MTuner (30 day trial), which is a Memory allocation profiler.

Pyrogenesis rev #13491, Release, VC++ 2012

Map Fast Oasis, 27 second fast-run

Here's the memory graph from mainmenu->load map->exit of "Fast Oasis":

Compared to other games I've seen graphs for, pyrogenesis has a very slow memory usage 'climb'. Typically you'd expect to see a very large peak after hitting load - that's reserving memory, creating pools, etc. Pyrogenesis however has A LOT of small allocations. When I say a lot, I mean it. I was amazed, actually.

Each allocation from the heap has a small overhead. The smaller the allocation, the bigger the proportionate overhead. For example, if you allocate 4 bytes, the OS will actually need to set aside 16 bytes. Of course, in debug mode it's much-much worse than that, but I digress.

If you need 100,000 4-byte sized objects, you can do it two ways:

1) Allocate 100,000 4-byte objects from the heap:

Takes 1,600,000 bytes (due to each heap block header)
Is ridiculously slow (100,000 mallocs)
Trashes the heap and fragments it (after you delete all of them, the heap has a huge segment of 4-byte blocks)
Deallocation takes a huge amount of time (100,000 frees)

2) Allocate a 400,000 byte pool:

Takes 400,000 + ~12(pool header) bytes
Is ridiculously fast (1 malloc)
Doesn't trash the heap at all. In fact the OS will have to VirtualAlloc a separate part of memory for such a large alloc.
Deallocation trivial (1 free)

Now you may or may not think "Well, 100,000 is a ridiculous hypothetical number and doesn't apply here at all", but if you do - you're in for a surprise.

Here's a histogram of allocations rounded up to fixed sized bins (8b graph contains 1,4,8 byte allocs). It seems that the numbers in my example were actually rather small compared to the real thing. I'd also like to note that this was just a 27 second run of the game:

You can see that Pyrogenesis is actually quite memory hungry. The number of allocations per second is huge. You can't afford something like this in an AAA title - over 15,000 allocations per second is just asking for trouble. The amount of time it takes to iterate the free list is even worse since we've fragmented memory with all those tiny allocs.

Now we can debate all we want, but now it's painfully obvious where a lot of our problems lie right now - it's the sheer number of allocations that cripples us.

There is a solution to this - a very simple one: custom allocators. The C++ STL containers were designed to allow custom allocators to be used and it's an easy change to make. The main benefactors would be unordered_map and map structures that rely heavily on such implementations. We should avoid arena allocators, since they are designed to last the entire lifetime of a program and will not fit nicely to game state changes. However, we can use dynamic memory pools that last for the span of a single game state and then get destroyed.

I've already implemented one for my own A* pathfinder, since I'm using a sorted linked-list instead of a vector.

With CRT malloc/free: 1401 tiles visited in 1.66ms.
With a memory pool: 1401 tiles visited in 193us - yes, that's right, micro-seconds.

That's roughly a 10x performance increase, just by using a memory pool. All I need is to give the pool a size hint before I use it and its good to go.

How it's used:


// template argument defines the size of allocation objects. Must be >= 4.
// constructor size argument hints the number of objects a single pool can hold
// before a new pool is required.
dynamic_pool<sizeof(MyStruct)> poolCollection( PoolSize );

MyStruct* st = (MyStruct*)poolCollection.alloc();
// ...
poolCollection.dealloc(st);

The code can be even more efficient (and also much faster), if just a regular pool is used:


// pools have fixed size and the actual pool data is contained inside the structure
// that's why you have to allocate it dynamically:
pool<sizeof(MyStruct)>* fixedPool = pool<sizeof(MyStruct)>::create( FixedSize );

MyStruct* st = (MyStruct*)fixedPool->alloc();
// ...
fixedPool->dealloc(st);

delete fixedPool; // delete when you're done

Before I digress any further, I'd like to share one last bit with you guys. The actual hotspot data for allocations. These are concrete points in Pyrogenesis source that we can fix right now. I've already created a ticket here: #1995


<memblock>, <total memops>, <stacktrace>
size 4,     count 61320, modeldef.cpp:300        (!!!!!)
size 8,     count 61320, modeldef.cpp:301
size 48,    count 42028, parser.cpp:246
size 36,    count 40866, guimanager.cpp:273 & 265
size 8,     count 32749, modeldef.cpp:241
size 32,    count 27638, textrenderer.cpp:136
size 8,        count 21952, cgui.cpp:986
size 16,    count 21952, textrenderer.cpp:77
size 16,     count 21952, shaderdefines.cpp:135
size 48,    count 21952, cgui.cpp:954
size 12,    count 20688, modelrenderer.cpp:635
size 112,    count 17325, guirenderer.cpp:388     (!!!!)
size 108,    count 16896, patchrdata.cpp:173
size 8,        count 14852, shaderdefines.cpp:116
size 48,    count 14154, font.cpp:33             (!!!!)
size 36,    count 13230, patchrdata.cpp:1025
size 48,    count 12008, parser.cpp:981
size 24,     count 9783, patchrdata.cpp:1025
size 8,        count 9645, modelrenderer.cpp:433
size 8,        count 8935, modelrenderer.cpp:698
size 48,    count 8265, cgui.cpp:916
size 48,    count 8144, parser.cpp:933
size 93,    count 7806, guitext.cpp:266
size 132,     count 7806, cgui.cpp:689
size 132,    count 7806, cgui.cpp:800 & 855
size 12,    count 7072, modelrenderer.cpp:636
textrenderer.cpp:77
textrenderer.cpp:179
patchrdata.cpp:1052
guitext.cpp:210
terraintextureentry.h:75
patchrdata.cpp:192
cgui.cpp:675
texturemanager.cpp:150
texturemanager.cpp:153
texturemanager.cpp:511
texturemanager.h:136
parser.cpp:639
componentmanager.cpp:620

[Discuss] Memory aka game slowing over time.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation