Jump to content

[Discuss] Memory aka game slowing over time.


Recommended Posts

Just a few more profiler images to throw into the fray.

Revision 13491, single player match against aegis bot. I basically left the profiler going for the duration of the match. The following is the memory usage for the duration of the game. The first section is the main menu, the second is the initial game, the late game peak is using the developer overlay to reveal the map.

general_memory_usage_zps096f9311.jpg.

Summary memory allocation histograms for the run. The numbers are so large they don't display properly.

memory_histogram_zps484339b8.jpg

Three images of sections the CRT heap towards the end of the game. As redfox has already mentioned, there are a HUGE number of very small allocations. Indicated both in the following images and in the above histogram.

CRT_Heap_3_zpsdbdf0633.jpg

CRT_Heap_1_zps7c130175.jpg

CRT_Heap_2_zps9a9ecefe.jpg

The green highlighted section is just the allocated memory block I selected.

Edited by Thanduel
  • Like 1
Link to comment
Share on other sites

Thanduel, that's some awesome data! How much memory did you need to process all of it?

I can see that the number of allocs for 32-byte objects is around 11 million - that's huge! Poor CRT-heap :(

I'm currently working on optimizing the performance hotspots reported by MTuner. I'm confident those same hotspots are the cause for the huge number of allocs.

Link to comment
Share on other sites

I didn't track the memory consumption of the profiler, but my machine has 6GB tri channel ram.

Haha indeed it's a staggering number of allocs and notice also the number of reallocs shown in the general summary. That is a huge amount of memory copying combined with allocation.

Using a memory pool as you suggested should shave quite a lot of time off of each frame. What sort of implementation are you using behind the scenes, I had a look at one yesterday that used two doubly linked lists to manage the list of used and free blocks.

http://www.codeproject.com/Articles/27487/Why-to-use-memory-pool-and-how-to-implement-it

This is a bit off topic but another thing I've been looking at whilst this discussion has been going on is implementing move constructors in some of the basic value objects in 0AD, some potential candidates are classes like CVectorXD but I need to have a closer look at the code. We already get the benefit of the STL containers using move constructors by compiling with VC2010 and above -- which is in fact another point to be made -- the alpha releases should all be compiled with the gcc c++ 11 flags and with visual studio 2012 to make sure we get the STL/std library C++11 performance improvements.

Link to comment
Share on other sites

For the memory pool I'm using a custom implementation that is very light-weight. The code for it looks ugly because it's so heavily optimized, but that's something we'll have to live with to get the performance.

The guy at codeproject does a lot of unnecessary processing that slows things down. I use a single-linked list for the freelist and an additional 'available' counter. I'll omit any debug code to keep this already complex sample simple.

I'm using a C-like approach here to get the maximum juice out of the application. The pool objects are always dynamically allocated as [sizeof(pool) + bufferSize]. This is a clever trick to avoid allocating a secondary buffer inside the pool.

You might also notice that I'm placing the freelist next pointer inside the actual memory block. This requires TSIZE to be atleast sizeof pointer. Normally this is pretty dangerous, but in debug mode I revert to a conventional struct where the next pointer is before data. I also write a guard value into the place where the next pointer will be during release mode, so that I can catch any invalid writes to deleted pointers.

This really makes the code somewhat complex in debug mode and we might not even need it, but its nice to have an assertion when you make a mistake.

In release mode the pool has no real overhead. If the pool is used as a temporary dynamic memory source with no deallocs, we could change the alloc function to prefer new nodes from <available> instead of <freed>. You can clear the state of the pool by just calling ::clear() - the pool will forget its freed list and reset available count. Also, we don't really need the 'freed' counter - I'm just using it for accurate estimates of pool memory usage.


/**
* Our memory pool is a SIZE based pool, meaning you specify the size
* of a single object as the template parameter.
* If the pool has no more available elements, alloc() returns NULL.
* You can return objects to the pool by using dealloc().
* You can clear the state of the pool with clear() - this will make the
* allocator forget its free list resets the available number of objects.
*/
template<unsigned TSIZE> struct pool
{
static_assert(sizeof(void*) <= TSIZE, "TSIZE is not big enough - must be at least sizeof(void*)");
static_assert((TSIZE % 4) == 0, "TSIZE must be aligned to 4;");

public:
// Size of the actual pool object is 16 bytes,
// the allocated data follows right after the pool.
// We could use ushort for available/freed, but that would
// misalign the data. This favors newer x86/x64 model performance.

unsigned available; // number of unhanded allocations
unsigned freed; // number in free list

template<unsigned TSIZE> union tnode
{
tnode* next; // free list next pointer
char data[TSIZE]; // the actual data
};
typedef tnode<TSIZE> node;

node* list; // free list pointer
char* end; // pointer to the end of the buffer (first invalid address)
char buffer[];// the actual buffer contained in this object

private:

/**
* You can't use the constructor. Pool is a variable size object
* allocated as sizeof(pool) + TSIZE*capacity
* @note Use pool::create(capacity) instead
*/
inline pool() {}

public:

/**
* Creates a new pool<TSIZE> object. Pool is a variable size object
* allocated as sizeof(pool) + TSIZE*capacity
* @param capacity Number of objects to reserve space for
* @return A new pool object
*/
static pool* create(unsigned capacity)
{
unsigned bufferSize = sizeof(node) * capacity;
pool* p = (pool*)malloc(sizeof(pool) + bufferSize);
p->available = capacity;
p->freed = 0;
p->list = nullptr;
p->end = p->buffer + bufferSize;
return p;
}

/**
* Destroys the pool
*/
inline static void destroy(pool* p)
{
free(p);
}

/**
* Clears the current state of the pool and resets
* all the variables of the pool
*/
inline void clear()
{
available = (end - buffer) / sizeof(node);
freed = 0;
list = nullptr;
}

/**
* @return A new object from this pool, or NULL if no more handles available.
*/
inline void* alloc() // gets a new item
{
if(freed)
{
--freed;
void* ptr = (void*)list;
list = list->next; // unshift the linked list
return ptr;
}
else if(available)
{
return &((node*)buffer)[--available];
}
return nullptr; // its empty
}

/**
* Deallocates a pointer
* @param ptr Pointer
*/
inline void dealloc(void* ptr) // puts into the freed list
{
++freed;
((node*)ptr)->next = list; // shift the linked list
list = (node*)ptr; // store the beginning of the linked list
}

/**
* @return TRUE if the pointer is in range of this pool
*/
inline bool is_owner(void* ptr) const
{
return buffer <= ptr && ptr < end;
}
};

As for C++11, we should definitely migrate to the new standard. No reason to stick to the old, slower standard.

Link to comment
Share on other sites

Thanks for the more detailed explanation. :)

What I would like to see is a performance comparison of 0 A.D., but I guess that's only possible after the changes are implemented.

I finally wrote a few words about creating graphs of the simulation performance. It does not measure the improvements in the rendering performance or GUI performance but it should be useful for the simulation performance.

Link to comment
Share on other sites

For the memory pool I'm using a custom implementation that is very light-weight. The code for it looks ugly because it's so heavily optimized, but that's something we'll have to live with to get the performance. The guy at codeproject does a lot of unnecessary processing that slows things down. I use a single-linked list for the freelist and an additional 'available' counter. I'll omit any debug code to keep this already complex sample simple. I'm using a C-like approach here to get the maximum juice out of the application. The pool objects are always dynamically allocated as [sizeof(pool) + bufferSize]. This is a clever trick to avoid allocating a secondary buffer inside the pool. You might also notice that I'm placing the freelist next pointer inside the actual memory block. This requires TSIZE to be atleast sizeof pointer. Normally this is pretty dangerous, but in debug mode I revert to a conventional struct where the next pointer is before data. I also write a guard value into the place where the next pointer will be during release mode, so that I can catch any invalid writes to deleted pointers. This really makes the code somewhat complex in debug mode and we might not even need it, but its nice to have an assertion when you make a mistake. In release mode the pool has no real overhead. If the pool is used as a temporary dynamic memory source with no deallocs, we could change the alloc function to prefer new nodes from instead of . You can clear the state of the pool by just calling ::clear() - the pool will forget its freed list and reset available count. Also, we don't really need the 'freed' counter - I'm just using it for accurate estimates of pool memory usage.

You're right that codeproject fellow uses a lot of unneccessary stuff, I was just looking to see the approach people take. His use of the doubly linked list is pretty serious overkill when you're essentially treating the linked list as a stack.

I work through and decipher ugly old C/C++ everyday and this code is very clear and easy to understand so I wouldn't worry about that. I really like the use of the union there, a separate pointer would just be unnecessary used space. Perhaps it makes sense to have a second memory pool type to specifically support temporary allocation. Though switching the priority in the current implementation would only cost an extra cycle for allocation off the freed list, but I guess every cycle counts.

Edited by Thanduel
Link to comment
Share on other sites

A few updates on the ticket: http://trac.wildfiregames.com/ticket/1995

So far things are going slowly, since a lot of the bottlenecks pointed out by MTuner also have some major design issues. In those situations, plugging in a memory pool or another sort of cache is just evil - the right thing to do is just fix it and throw away what isn't needed.

I guess the perfect example for this is CParser class in /source/ps/Parser.h/.cpp, which is an evil custom regex parser straight from hell. It was so memory inefficient that after removing it and replacing it with a traditional tokenizer, the startup time of the game reduced dramatically. (here's the patch #2005)

Hopefully if enough of these bottlenecks get ironed out, we'll start to really see the performance improvements.

Link to comment
Share on other sites

So today I took some time to actually test out the gross performance of the memory pool. The best way to test it (and the most suitable if it's ever used in 0 A.D.) was to write a custom std::allocator.

I chose to create two kinds of allocators only (mostly because STL allocators are temporary objects):


/**
* A Global Pool is a global memory pool accessor.
* This class does not hold any data itself, but accesses a global memory pool.
* @note General 1 element allocator for STL
*/
template<class T> class global_pool : public std::allocator<T>;
/**
* A bin allocator is a special allocator that divides allocations into fixed-size memory pools
* @note This special allocator pools allocations between [4..1024] bytes
* @note General vector allocator for STL
*/
template<class T> class bin_allocator : public std::allocator<T>;

Now some of you might remember that the "pool" and "dynamic_pool" classes are size based; so in order to make them fit with generic allocators we have to create a proxy:


template<unsigned TSIZE> struct _global_pool { static dynamic_pool<TSIZE, true> pool; };
template<unsigned TSIZE> dynamic_pool<TSIZE, true> _global_pool<TSIZE>::pool(true);

This is pretty cool, we now have a "global_pool" allocator and a "bin_allocator" allocator. Both of them just access the global template instantiation of the _global_pool<>::pool variable.

We can now test out the allocator pretty easily:


std::vector<int, global_pool<int>> vec;

But that looks really nasty. Can you imagine having that all over the codebase? Me neither. A typedef would be cumbersome, since you'd have to declare it everywhere. Soon the code will be full of typedefs - very counter-intuitive indeed.

So the best I could do was to write wrappers on-top of all the STL containers and also the string class. This works out perfectly, because global_pool<> will not work on variable sized allocations, so you can't use it in std::unordered_map nor std::vector for example. By writing the wrappers we ensure that the correct type of allocator is used.

As a plus; since STL containers are already heavily debugged and refined, we don't have worry about crashing our memory pools anymore.

To the wrappers:


/**
* Overrides std::vector and plugs in a bin_allocator as the allocator.
* The bin_allocator allocates from global memory pools based
* on the chunk size.
*/
template<class T> class psvector : public std::vector<T, bin_allocator<T> >;

Using it is really simple and since it's just a subclass of std::vector, it has all of its functionality.


psvector<int> vec;

Ahh, that looks perfect. The P is for perfect and S is for shiny. (just kidding; - the ps stands for pyrogenesis)

Finally, we can see the performance difference of Syscall memory allocations vs Memory pooled allocations:

VC++11, Release, default optimizations


1000x ps::stl 42ms
1000x std::stl 1170ms

A 28x performance difference.

That's right. Just by plugging in a memory pool we won by committing less overall memory and less System Alloc calls. Our allocations are faster and they take up less memory in the long run - isn't that sweet? :D

Here's the test code (nothing fancy):


template<class Vector> void test_vector()
{
Vector vec;
for(int i = 0; i < 32; ++i)
vec.push_back(i);
Vector vec2 = vec;
vec2 = vec;
Vector vec3 = Vector();
vec3 = Vector();
}
template<class String> void test_string()
{
String str;
for(int i = 0; i < 32; ++i)
str.push_back('A');
String str2 = str;
str2 = str;
String str3 = String();
str3 = String();
}
template<class Map> void test_map()
{
Map map;
for(int i = 0; i < 32; ++i)
map[i] = i;
Map map2 = map;
map2 = map;
Map map3 = Map();
map3 = Map();
}
template<class Set> void test_set()
{
Set set;
for(int i = 0; i < 32; ++i)
set.insert(i);
Set set2 = set;
set2 = set;
Set set3 = Set();
set3 = Set();
}

int main()
{
const int iterations = 1000;

Timer t1(tstart);
#if 1
for(int i = 0; i < iterations; ++i)
{
test_vector<psvector<int>>();
test_vector<pslist<int>>();
test_string<psstring>();
test_map<psmap<int, int>>();
test_set<psset<int>>();
test_map<psunordered_map<int, int>>();
test_set<psunordered_set<int>>();
}
t1.Stop();
printf("%dx ps::stl %dms\n", iterations, (int)(t1.Elapsed() * 1000));
#else
Timer t2(tstart);
for(int i = 0; i < iterations; ++i)
{
test_vector<std::vector<int>>();
test_vector<std::list<int>>();
test_string<std::string>();
test_map<std::map<int, int>>();
test_set<std::set<int>>();
test_map<std::unordered_map<int, int>>();
test_set<std::unordered_set<int>>();
}
t2.Stop();
printf("%dx std::stl %dms\n", iterations, (int)(t2.Elapsed() * 1000));
#endif

system("pause");
return 0;
}

Since I've finished the STL container wrapper and memory pool code, It'll only be a matter of time until this gets into the pyrogenesis source.

Limitations:

1) bin_allocator only 'bins' allocations in range 4..1024. Even then, the actual allocation chunks are:

4 8 12 16 20 24 32 40 48 56 64 80 96 112 128 160 192 224 256 384 512 640 768 896 1024

So it wastes some memory in favor of speed.

However, this 'waste' of space will definitely not matter, since we win A LOT more in the long term and the speed is just fantastic.


ps::stl mem-commited: 522K
std::stl mem-commited: 384K

2) Pool deallocation scheme is not linear. The global pools are auto-cleaned when they go empty; - except the first pool, which is never deallocated. I'm afraid we'll have to live with this for now - cleaning the pools doesn't make sense anyways.

What we could do further:

1) Extend the wrappers over more structures such as std::stringstream and friends

2) Test how well this scales in 0 A.D. - the initial pool reserve can be increased for less overhead and to match 0 A.D. requirements. Remember that the number of pools is increased on demand and pools have a fixed initial capacity. If we know we will have 10000 allocations, we can reserve pool size beforehand.

3) Give RedFox a cookie, since he spent all day (and night) making this beast work.

Edited by RedFox
  • Like 1
Link to comment
Share on other sites

Great job, RedFox! here's a virtual cookie ;-)

I would suggest to plug as many MTuner reported problems that have design issues, before committing this memory pool feature. That way we can be sure to get the actual fix and then, this memory fix!!!

Link to comment
Share on other sites

Great job, RedFox! here's a virtual cookie ;-)

I would suggest to plug as many MTuner reported problems that have design issues, before committing this memory pool feature. That way we can be sure to get the actual fix and then, this memory fix!!!

That's a good point - the problems reported by MTuner ended up just being really bad code. :D It's stuff that should and will definitely get fixed.

Luckily I can just write a 'debug' layer that restores psvector and friends to old allocators. This way we can still use MTuner to find memory intesive code.

Link to comment
Share on other sites

This looks great! Awesome results!


template<unsigned TSIZE> struct _global_pool { static dynamic_pool<TSIZE, true> pool; };
template<unsigned TSIZE> dynamic_pool<TSIZE, true> _global_pool<TSIZE>::pool(true);

Am I right in understanding the second line as making calls to dynamic_pool refer to the static dynamic_pool inside of _global_pool?

Also whilst talking about templates, there is a new feature of C++11 that allows you to make template aliases, with pretty much the same effect as the inheritance you use here to hide the allocator parameter. I'm not sure if there is any benefit over the current method though I am just posting it for interests sake.

https://blogs.oracle.com/pcarlini/entry/template_aliases

I'd like to have a look at the source for the allocators and the pools, is there somewhere you could put it up so I can look through it all. I have some ideas for some optimisations that would require using the memory pool system in 0AD and I'd like to benchmark them.

Link to comment
Share on other sites

This looks great! Awesome results!


template<unsigned TSIZE> struct _global_pool { static dynamic_pool<TSIZE, true> pool; };
template<unsigned TSIZE> dynamic_pool<TSIZE, true> _global_pool<TSIZE>::pool(true);

Am I right in understanding the second line as making calls to dynamic_pool refer to the static dynamic_pool inside of _global_pool?

That's just C++ syntax for declaring static template members:


template<class T> MyClass
{
static T StaticMember;
};
template<class T> T MyClass<T>::StaticMember;

It's quite convoluted, but I needed that step to have global static template instantiations.

I also have a version of the pool that doesn't take any template arguments and is instantiated with arguments instead. There doesn't seem to be any speed difference at all. This makes it a lot easier to manage these global pools, since I can use a lookup table for the bucket sizes - pointing to the correct pool.

Also whilst talking about templates, there is a new feature of C++11 that allows you to make template aliases, with pretty much the same effect as the inheritance you use here to hide the allocator parameter. I'm not sure if there is any benefit over the current method though I am just posting it for interests sake.

https://blogs.oracle...emplate_aliases

That looks perfect for what I'm doing :) Right now writing a wrapper is actually a pretty horrible task! I've been trying to keep everything C++03 right now, that's why the horrible wrappers. If we move on to C++11 as a final verdict, I'll just have to delete a lot of superfluous code. :)


template<class T> class psvector
:public std::vector<T, bin_allocator<T> >
{
public:
typedef std::vector<T, bin_allocator<T> > vector;

inline psvector() : vector() {}
inline explicit psvector(size_t count) : vector(count) {}
inline psvector(size_t count, const T& value) : vector(count, value) {}
template<class Iter> inline psvector(Iter first, Iter last) : vector(first, last) {}


inline psvector(const psvector& right) : vector(right) {}
inline psvector& operator=(const psvector& right) { vector::operator=(right); return *this; }
#if HAS_CPP11
inline psvector(psvector&& right) : vector(std::move(right)) {}
inline psvector& operator=(psvector&& right) { vector::operator=(std::move(right)); return *this; }
#endif
};


template<class T> using psvector = std::vector<T, bin_allocator<T>>;

I'd like to have a look at the source for the allocators and the pools, is there somewhere you could put it up so I can look through it all. I have some ideas for some optimisations that would require using the memory pool system in 0AD and I'd like to benchmark them.

You mean using mmap / VirtualAlloc ? There's a problem with that solution. Sometimes the address following your paged memory section is unavailable, meaning mmap/VirtualAlloc will fail. For that not to happen we have to reserve (but not yet commit) an unknown amount of memory pages ahead. In a 64-bit system this wouldn't be a problem, but the 32-bit address space is much more limited in that sense.

If we don't reserve the address space ahead, something else will use those addresses. And if mmap / VirtualAlloc fails, the application will crash.

We could try reserving a large part of the address space. Right now I have 25 pool buckets in the bin_allocator. If we reserve a fixed size of 25x1mb, it would soon crash, because 0 A.D. uses a lot more memory - as can be seen from MTuner data.

0ad_memtuner_peakusage.png

If we create an enum with special tuning data per bucket, we could perhaps avoid that. But it wouldn't apply to every case - still leading to a crash.

However, we can keep the dynamic_pool structure as it is right now (an array of memory pools) and reserve pools with 1MB virtual memory in 'reserve'. The pool can automatically commit more and more memory pages (commit is required in Win32) - up until 1MB. Then it's full and the dynamic_pool class would do the work for us by allocating a new pool into its array.

This would give better performance when we are forced to continuously expand the pools, which is something that happens a lot in 0 A.D.. Though I'm not sure if the used memory count is due to fragmentation or actual allocations alive.

Last note: Multithreading. We can implement a mutex lock inside each pool, which is a lot better than CRT's current mutex-lock per malloc. Yeah this is getting really complicated now.

Edited by RedFox
Link to comment
Share on other sites

I've ran a full game (10383 turns) against a bot on Acropolis 3 through callgrind, as I was curious about the hotspots of the game. It counted 75 196 964 calls to malloc, if that is of any interest. Its actually number 6 on the most-called function list. Even more impressive is std::_Rb_tree<unsigned int const, EntityData>::find(unsigned int const&), which was called 183283748 times. There are some other interesting results, like GetPercentMapExplored taking 6.56%, but I guess thats totally offtopic here.

Link to comment
Share on other sites

Awesome data there scroogie :) I guess the focus should be to minimize usage of std::map (std::_Rb_tree) and really implement memory pool.

I'd start replacing the STL containers with the allocator variants, but it would probably break a lot of code. Plus I have a few patch changes, which would make it very hard to submit :/

Link to comment
Share on other sites

The code is a little too fresh for me to contribute much in this discussion, other than to say this slowdown issue makes playing this game completely impossible for me, on a modern gaming machine. I cannot stress enough how important the work you guys are doing in this thread is!

Cudos and great work!

*edit*: +1 Cookie for Redfox

Edited by Silkjc
  • Like 1
Link to comment
Share on other sites

  • 4 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...