Jump to content
Sign in to follow this  

Paid Development: August 2013 Activity Status

Recommended Posts

This is my activity status report thread for August 2013.

There's a pretty large gap with the previous update and I really started working once all the legal issues were finally taken care of. Subsequently, the progress report starts from the middle of the week and isn't anything really impressive.

Week #2 12.08 - 18.08

14.08 1900-0600 (11) - Debugging on Ubuntu: OpenGL and GLSL. TextRenderer drawbatch optimization.

15.08 1600-0200 (10) - Debugging on Ubuntu: no success. TerrainRenderer optimization.

16.08 1300-0500 (16) - TerrainRenderer optimization. Removed FFP. ShaderProgram optimization.

17.08 0000-0600 (6) - Debugging, first breakthroughs.

From the total of 43 hours, most of the time went into debugging, though I was able to squeeze in some improvements that seemed to make enough of a difference. I think the most disappointing aspect is the ModelRenderer which is pretty much fully back to its original form - solely because of debugging the crash on Linux. Hopefully I can get around to changing it back - to get the improvements in ModelRenderer that it previously had.

1. Debugging on Ubuntu

What's the issue?: The new patch causes a mysterious crash on Linux and OSX

I decided to pick up a copy of Ubuntu and do some debugging. Even though I got some relevant data out of this, it wasn't a success - the game ran fine on my x86 Ubuntu.

So the problem is most likely related to some size_t changes between x86/x64 - Josh is running x64 Ubuntu and he experiences the crash. This is actually a relevant breakthrough and has pointed me towards the possible causes. I expect to finally find the reason in the coming week.

2. TextRenderer DrawCalls

What are Text DrawCalls?: Text DrawCalls are clumps of text that contain the text color, style and font.

I didn't really intend to do anything about TextRenderer, but while debugging on Linux, I noticed that OpenGL keeps rendering text one word at a time. This is an extremely inefficient way to do it, so I improved the DrawCall batching to clump similar lines of text together.

DrawCall batches before: { "This ", "is ", "an ", "example." }

DrawCall batches after: { "This is an example." }

So depending on the actual amount of text, the rendering speedup varies. It noticeably raised the FPS on text heavy pages.

I don't know how much justification can be done for optimizing text rendering - but one thing is certain: We will have to migrate to something better like FreeType fonts and true OpenGL VertexBuffered rendering. This would not only reduce the amount of code we have, but it would also speed the engine up. We could remove a lot of redundant/duplicate code and rely solely on graphics card based buffers - which is the modern way to program in OpenGL.

For now, I'll leave it be.

3. TerrainRenderer

What is TerrainRenderer?: TerrainRenderer obviously renders the in-game terrain. It works in 3 stages 1) base tiles 2) blends between tiles 3) decals on the terrain.

What are Patches?: Patches are 16x16 tiles of the terrain. This is the main unit used in frustrum culling and rendering.

So this is actually a pretty important part of the engine. And not that surprisingly, it's also quite difficult to optimize. The previous implementation used a memory pool to manage the implementation and utilize a rather complex 3-dimensional mapped sorting algorithm.

Since I was already hard pressed for time, I couldn't manage a complete redo of the rendering algorithm itself, but I was able to improve the pre-sorting and batching of Terrain Patches. Instead of relying on a memory pool and some complex usage of 3 dimensions of std::map, I wrote a simple and very straightforward structure that does the bare minimum and takes into account the hardcoded limits of number of patches and number of blends per patch.

Since Terrain rendering actually takes a lot of the renderer time, this change was pretty noticeable. In both Debug and Release builds I experienced roughly 33% performance improvement.

If the batching and sorting took about half of Terrain rendering time, now the sorting is rather insignificant compared to actual rendering.

I'll use timing data from Profiler2 for comparison later.

Before: --todo--

After: --todo--

4. Removed FFP

What is FFP?: The Fixed Function Pipeline (or FFP) was a module of 0AD that emulated shader support. It was a really nasty hack to support ancient PC's with no GPU's or no Shader support.

I think it's best to say this is the biggest change of this week. It took a lof careful editing of the code to make sure it works properly. Patience paid off and I was abled to remove a lot of complexity from the whole shader pipeline.

It's difficult to measure the performance improvement this gives, but it's safe to say that it's actually quite negligble. The main gain is that we have a lot less code to maintain. Previously quite a lot of optimizations were out of the question due to FFP being in the way.

Now that it's removed, we will be able to slowly move on to a cleaner, more maintainable and of course, faster Shader system.

5. ShaderProgram optimizations

What is ShaderProgram?: This is an internal class that abstracts away GPU's programmable shaders and automatically prepares the OpenGL shader program for rendering.

This was actually a really tiny change in the way that OpenGL shader uniform and attribute bindings are stored, but it's necessary to make way for a more efficient binding system in the future.

I intend to move away from the current string based binding lookups and replace it completely with constant values.

There are two ways we could go about this:

#1 (less preferred): Preassign binding locations for the shader input stage. For example, attribute location 0 will always be for Vertex data. Attribute location 1 always for UV0 set, etc.

This is somewhat tedious, since you'll have to explicitly declare an attribute location in the shader:

layout (location = 0) vec3 a_vertex; // our vertex attribute in the shader

Its name can be anything, really. All we care about is binding location 0 in the layout. Vertex data would always be sent to that location (of course, only if there is an attribute with that binding location, otherwise vertex data would not be uploaded).

#2 (preferred): Variable names in the shader program have explicit meaning behind them. For example, a_vertex would always be for vertex data. If you fail to declare a variable with the name a_vertex, your shader won't receive any vertexdata.

This is somewhat perfect for shader development - we enforce a consistent variable naming this way and we can remove a lot of superfluous data in shader xml's (probably even negate the need for shader xml's for non-effects). In the shader it would look nice and clean:

vec3 a_vertex; // our vertex attribute in the shader

Having explored the two possible ways to go about it, it's pretty much obvious that #2 would be the way to go. This would allow us to seriously streamline both shader compilation and shader pipeline and attribute / uniform binding during shader's input layout stage.

The most obvious reason why I would go this way, is because a very large number of shader variable names have already been hardcoded into the engine by the previous developers. Since we probably won't be redesigning all shader files, the #2 option would leave shader files as they are (with some changes to variable names).

Here is the list of currently hardcoded shader variables and definitions:

const ShaderDefStr u_Transform = "transform";
const ShaderDefStr u_CameraPos = "cameraPos";
const ShaderDefStr u_Color = "color";
const ShaderDefStr u_Tex = "tex";
const ShaderDefStr u_BaseTex = "baseTex";
const ShaderDefStr u_ColorMul = "colorMul";
const ShaderDefStr u_ColorAdd = "colorAdd";
const ShaderDefStr u_ShadowTex = "shadowTex";
const ShaderDefStr u_ShadowTransform = "shadowTransform";
const ShaderDefStr u_ShadowScale = "shadowScale";
const ShaderDefStr u_LosTex = "losTex";
const ShaderDefStr u_LosTransform = "losTransform";
const ShaderDefStr u_Ambient = "ambient";
const ShaderDefStr u_SunColor = "sunColor";
const ShaderDefStr u_SunDir = "sunDir";
const ShaderDefStr u_FogColor = "fogColor";
const ShaderDefStr u_FogParams = "fogParams";
const ShaderDefStr u_NormalMap = "normalMap";
const ShaderDefStr u_NormalMap2 = "normalMap2";
const ShaderDefStr u_Foam = "Foam";
const ShaderDefStr u_MapSize = "mapSize";
const ShaderDefStr u_DepthTex = "depthTex";
const ShaderDefStr u_WaveTex = "waveTex";
const ShaderDefStr u_ReflectionMap = "reflectionMap";
const ShaderDefStr u_RefractionMap = "refractionMap";
const ShaderDefStr u_LosMap = "losMap";
const ShaderDefStr u_Shininess = "shininess";
const ShaderDefStr u_Time = "time";
const ShaderDefStr u_Waviness = "waviness";
const ShaderDefStr u_SpecularStrength = "specularStrength";
const ShaderDefStr u_Murkiness = "murkiness";
const ShaderDefStr u_Tint = "tint";
const ShaderDefStr u_ReflectionTintStrength = "reflectionTintStrength";
const ShaderDefStr u_ReflectionTint = "reflectionTint";
const ShaderDefStr u_Translation = "translation";
const ShaderDefStr u_RepeatScale = "repeatScale";
const ShaderDefStr u_ReflectionMatrix = "reflectionMatrix";
const ShaderDefStr u_RefractionMatrix = "refractionMatrix";
const ShaderDefStr u_LosMatrix = "losMatrix";
const ShaderDefStr u_ScreenSize = "screenSize";
const ShaderDefStr u_MaskTex = "maskTex";
const ShaderDefStr u_ObjectColor = "objectColor";
const ShaderDefStr u_PlayerColor = "playerColor";
const ShaderDefStr u_ShadingColor = "shadingColor";
const ShaderDefStr u_TextureTransform = "textureTransform";
const ShaderDefStr u_SkinBlendMatrices0 = "skinBlendMatrices[0]";
const ShaderDefStr u_SkinBlendMatrices = "skinBlendMatrices";
const ShaderDefStr u_WaterTex = "waterTex";
const ShaderDefStr u_SkyCube = "skyCube";
const ShaderDefStr u_BlendTex = "blendTex";
const ShaderDefStr u_LosTex1 = "losTex1";
const ShaderDefStr u_LosTex2 = "losTex2";
const ShaderDefStr u_Delta = "delta";
const ShaderDefStr u_InstancingTransform = "instancingTransform";
const ShaderDefStr u_RenderedTex = "renderedTex";
const ShaderDefStr u_TexSize = "texSize";
const ShaderDefStr u_BlurTex2 = "blurTex2";
const ShaderDefStr u_BlurTex4 = "blurTex4";
const ShaderDefStr u_BlurTex8 = "blurTex8";
const ShaderDefStr u_Width = "width";
const ShaderDefStr u_Height = "height";
const ShaderDefStr u_zNear = "zNear";
const ShaderDefStr u_zFar = "zFar";
const ShaderDefStr u_Brightness = "brightness";
const ShaderDefStr u_Hdr = "hdr";
const ShaderDefStr u_Saturation = "saturation";
const ShaderDefStr u_Bloom = "bloom";

const ShaderDefStr a_Tangent = "a_tangent";
const ShaderDefStr a_SkinJoints = "a_skinJoints";
const ShaderDefStr a_SkinWeights = "a_skinWeights";

const ShaderDefStr fx_Default = "arb/model_solid";
const ShaderDefStr fx_Minimap = "minimap";
const ShaderDefStr fx_Bloom = "bloom";
const ShaderDefStr fx_Particle = "particle";
const ShaderDefStr fx_ParticleSolid = "particle_solid";
const ShaderDefStr fx_LosInterp = "los_interp";
const ShaderDefStr fx_GuiText = "gui_text";
const ShaderDefStr fx_GuiSolid = "gui_solid";
const ShaderDefStr fx_GuiBasic = "gui_basic";
const ShaderDefStr fx_GuiAdd = "gui_add";
const ShaderDefStr fx_GuiGrayscale = "gui_grayscale";
const ShaderDefStr fx_ForegroundOverlay = "foreground_overlay";
const ShaderDefStr fx_SkySimple = "sky_simple";
const ShaderDefStr fx_WavesGLSL = "glsl/waves";
const ShaderDefStr fx_WaterHighGLSL = "glsl/water_high";

const ShaderDefStr def_UseNormals = "USE_NORMALS";
const ShaderDefStr def_UseRealDepth = "USE_REAL_DEPTH";
const ShaderDefStr def_UseFoam = "USE_FOAM";
const ShaderDefStr def_UseWaves = "USE_WAVES";
const ShaderDefStr def_UseRefraction = "USE_REFRACTION";
const ShaderDefStr def_UseReflection = "USE_REFLECTION";
const ShaderDefStr def_UseShadows = "USE_SHADOWS";
const ShaderDefStr def_UseInstancing = "USE_INSTANCING";
const ShaderDefStr def_UseGPUSkinning = "USE_GPU_SKINNING";
const ShaderDefStr def_UseShadow = "USE_SHADOW";
const ShaderDefStr def_UseFPShadow = "USE_FP_SHADOW";
const ShaderDefStr def_UseShadowPCF = "USE_SHADOW_PCF";
const ShaderDefStr def_UseShadowSampler = "USE_SHADOW_SAMPLER";
const ShaderDefStr def_Decal = "DECAL";
const ShaderDefStr def_MinimapBase = "MINIMAP_BASE";
const ShaderDefStr def_MinimapLos = "MINIMAP_LOS";
const ShaderDefStr def_MinimapPoint = "MINIMAP_POINT";

const ShaderDefStr def_ModeShadowCast = "MODE_SHADOWCAST";
const ShaderDefStr def_ModeWireFrame = "MODE_WIREFRAME";
const ShaderDefStr def_ModeSilhouetteOccluder = "MODE_SILHOUETTEOCCLUDER";
const ShaderDefStr def_ModeSilhouetteDisplay = "MODE_SILHOUETTEDISPLAY";
const ShaderDefStr def_AlphaBlendPassOpaque = "ALPHABLEND_PASS_OPAQUE";
const ShaderDefStr def_AlphaBlendPassBlend = "ALPHABLEND_PASS_BLEND";

const ShaderDefStr def_SysHasARB = "SYS_HAS_ARB";
const ShaderDefStr def_SysHasGLSL = "SYS_HAS_GLSL";
const ShaderDefStr def_SysPreferGLSL = "SYS_PREFER_GLSL";

It is obvious that the current shader system is far from anything truly moddable. However - if we document all the "hardcoded" uniform names, such as u_TextureTransform, other people can program their own shaders without much hassle.

We can also finally throw away ShaderDefStr which is a boon for performance and resolve all the definition names during compile time. We would have something like this:

enum ShaderAttributes {
a_Vertex = 0, // vertex position attribute
a_Color, // vertex color attribute
static std::string mapping[] = {
int GetShaderVariableID(const std::string& var) // for shader compilation stage
for (int i = 0; i < ARRAY_SIZE(mapping); ++i)
if (var == mapping[i])
return i;
return -1; // mapping does not exist for this variable
const std::string& GetShaderString(int id) // for shader compilation stage
return mapping[id];

This naive implementation would map any incoming strings from the shader file to internal indices that match the appropriate enums like ShaderAttributes enum. Since the actual number of variables isn't that big, we can get away with a simple loop. Due to effective CPU cache, simple loops are always faster than using std::map.

I'll stop my tedious explanation of "what's to change" and leave it here.

6. GLSL Compilation optimization

What is GLSL?: Currently we support 2 types of shaders: 1) Old ARB shaders that are fast; 2) New GLSL shaders that provide all sorts of fancy effects.

All modern cards (since 2004?) support GLSL - the only issues is our own unoptimized GLSL shader code which also has some bugs on certain systems with certain drivers.

However, migrating completely to GLSL is pretty much the only way to go. ARB shaders are completely unmaintained and obviously enough, GLSL is much easier to write than ARB assembly.

GLSL also supports a large variety of extra features such as C-like loops, control statements, macro definitions, functions. It pretty much looks like C without its standard library.

However, if we ever wish to support Android systems, we will need solid GLSL support.

With all this in mind, I changed explicitly how the current GLSL compilation preprocessing is done - leaving most of the work to the driver (even if that's more inefficient) by sending 3 separate sources with:

1) GLSL version - defaults to #version 120 (OpenGL 2.1, minimum required to use GLSL with our shaders right now)

2) Shader defines - All #defines set for this shader compilation stage

3) Shader source - The actual shader source

The OpenGL driver will append all 3 sources into a single source and will take care of the preprocessing. This greatly reduces code on our side and also allows to reduce complexicity and overhead of the ShaderDefines structure.

The changes I've made lay some groundwork for future changes on the GLSL shader system.

To end Week #2:

Currently I'm mostly working on debugging the patch to get it working on linux. As soon as A14 is released, I'd like to commit this large patch to avoid any further SVN conflicts along the road. The patch is gigantic already. You can check it out here:


No statistics this time around, though the numbers are obviously a lot higher than before :)


This is my current TaskList:

-) Patch review

-) Megapatch Debugging

-) ModelRenderer WIP

-) Performance Improvement Fancy Graphs

-) PSA and PMD to collada converter.

-) Migrate to C++11

-) Collada -> PMD cache opt to improve first-time cache loading.

  • Like 1

Share this post

Link to post
Share on other sites

Interesting as usual. I agree about method #2, if we document it modders can just reuse those even for completely different stuffs.

Share this post

Link to post
Share on other sites

I'll have to postpone today's update for tomorrow, since I'm working on a really interesting piece regarding XML parsing that will improve load times tremendously. If my new RAM also arrives tomorrow, I'll be able to finally throw in memory usage comparisons. (apparently 4GB is not enough to profile 0AD memory usage :( ).

Stay tuned! :)

Share this post

Link to post
Share on other sites

So I guess this report will be pretty epic. I've been working all night on XMB file loading and optimization. Mostly to greatly improve loading speeds. However, I digress, here's my report of last week.

Week #3 19.08 - 25.08

19.08 1100-1700 (6) - Debugging and bugfixes on megapatch. Huge breakthrough.

20.08 1600-0200 (10) - Debugging shader issues.

21.08 1200-1900 (7) - ShaderProgram variations reduction. ModelRenderer texture bind bug solved!

22.08 2100-0500 (8) - Windows Stacktrace failure debugging.

23.08 1000-1200 (2) - Alpha sorting removed.

25.08 1400-0500 (15) - Fundraiser footage. Megapatch bugfixes. UTF conversion optimization.

From the total of 48 hours, most of it went into debugging, but it finally paid off. The patch is now stable on Linux and OSX, which means it's ready for commit after A14 release. At the end of the week I took some extra time to improve UTF conversion performance (since we're doing a lot of it) and also grabbed some footage for the fundraiser.

1. Debugging breakthrough

What's the issue?: Well, until recently the patch crashed on Linux and OSX; on Windows the game ran fine.

It was a really frustrating issue, since I couldn't debug the crash at all - I could only hope to fix any bugs that changes to the shader definitions systems caused. Funnily enough, the failure was simply due to incorrect hashes of CShaderDefines.

End result: We can now deploy the patch after A14 is released and start refining out any bugs that pop up.

2. ShaderProgram Variations

What's the issue?: For each rendering ability such as Shadows, Specular, Normals a combination of ShaderDefines is formed. For each unique combination a new shader is compiled. This is very inefficient.

When running 0AD in an OpenGL debugger, I noticed that the amount of shader programs generated totaled at around 300. Each shader compilation actually takes a pretty long time during loading, so generating over 300 shaders from just a few sounds like a high crime.

The biggest problem is the batch-sorting that is done prior to rendering models - the larger the amount of shaders, the more inefficient rendering becomes due to constant resource binding/unbinding. Batching is also inefficient, resulting in more texture state changes than are actually needed.

My solution was to implement a second layer of caching inside CShaderProgram itself and hash any input shaders. This allows me to check if the current source code has already been compiled and if so - retrieve a reference counted handle to the shader program.

This is really great and reduced the amount of shader programs from 300 to around 120.

What we could do more to improve this situation is to use less shader defines - the smaller the number of variations, the smaller the number of shaders compiled.

End result: The annoying load time at the end of the loading bar was reduced by half and is hardly noticeable now.

3. Windows Stacktrace failure

What's the issue?: Several error reports on windows fail to generate a proper stacktrace and usually another error occurs while generating the error message.

This was actually pretty hard to debug. On VS2008 the issue was somewhat improved with /Oy- flag, which forces usage of frame pointers. On VS2012 generally disabling Full Program Optimization gave improved results.

Still, a lot of cases failed and no stacktrace was generated at all. Apparently if the top-level function is inlined, WinDbg.dll is unable to resolve the function reference. On that case the only fix was to change the stacktrace behaviour to simply display all functions and skip any filtering on the callstack.

This at least gave some kind of stacktrace, which is better than nothing.

End result: Error reports can now be expected to always give a stacktrace on windows.

4. Alpha sorting

What's the issue?: A noticeable amount of time during rendering is spent sorting transparent models - improvement in this is essential for better rendering performance.

Even though I spent the least amount of time on this issue - it probably had the biggest FPS impact on the renderer.

The current renderer distance sorted all transparent models prior to rendering, resulting in some pretty complex batching before rendering. This takes almost half of the rendering time itself and is pretty useless because OpenGL employs a Z-Buffer which, in combination with proper alpha testing gives perfect results.

Since 0AD already employs this functionality, all I had to do was remove <sort_by_distance/> and any code related to distance sorting in the modelrenderer.

End result: Visually no difference. About 33% gain in performance (depending on amount of trees), 50 fps -> 70 fps.

5. UTF Conversion

What's the issue?: There is a lot of string conversion going back and forth in the 0AD engine: UTF8, UTF16, UCS-2, UTF32 strings are all being used and constantly converted from one type to another.

My first goal was to reduce the amount of conversions done, but that's a really huge change. The next best thing I could do was streamline the UTF8 conversion code.

1) Added conversion of UTF8 -> UTF16 and UTF16 -> UTF8 for faster SpiderMonkey interaction.

2) Added special case for UCS-2 <-> WCHAR_T on windows, resulting in faster conversion performance on windows.

3) Improved, optimized and streamlined the code to do UTF conversion much faster than before.

However, these changes are intended for gradual movement from WCHAR_T strings (UCS-2 on Windows and UTF32 on Linux) to simple UTF8 strings. There is a lot of code that uses WCHAR_T strings, even though there is no real need for it. The only part of code that needs to deal with UCS-2 strings is Windows CreateFileW, which is rarely called.

End result: Less string conversions, faster UTF8/UTF16/UTF32 string conversion

To end Week #3:

I still didn't manage to do any patch reviewing, so I'll /have/ to do it first thing tomorrow (otherwise I'll procrastinate again and work on some awesome module instead). I think it was an excellent week nevertheless - I was able to squash the annoying runtime bugs thanks to everyone on the IRC helping me test it out.

Since I finally got my 8GB of RAM, I can dedicate a day for memory performance comparisons.


This is my current TaskList:

-) Patch review

-) Performance Improvement Fancy Graphs

-) PSA and PMD to collada converter.

-) Migrate to C++11

-) Collada -> PMD cache opt to improve first-time cache loading.

  • Like 2

Share this post

Link to post
Share on other sites

Another week has passed and I've been working furiously mostly on fixing any memory leaks that popped up, optimizing XMB parsing to use string tokens and finally wide-scale UTF8 transition of the engine, which is an epic task in itself, but is definitely going to speed up the core of the engine by a noticeable factor.

As I've promised to add some profiling data, I've also taken some quick snapshots and comparisons on memory usage in general.

Week #4 19.08 - 25.08

26.08 1500-0300 (12) - XML Conversion optimization. WriteBuffer optimization.

27.08 2000-1100 (15) - XMB parse optimization.

28.08 1800-0400 (10) - UTF8 conversion optimization. JS ScriptLoad and Eval optimization.

29.08 1600-0800 (16) - Memory leak fixes. TextRenderer UTF8 support, major optimization. EntityMap

30.08 1700-0300 (10) - GUI UTF8 transition

31.08 1200-0100 (13) - Still UTF8 transition, Console input UTF8 compatible.

01.09 1500-0300 (12) - Patch review, weekly summary, performance graphs, GUI text hack

You can notice that I've put in some insane number of hours for this week (88) - migrating to UTF8 only is not trivial in the slightest and a lot of code has to be made UTF8 aware - text rendering, console input, gui input, script interface; just to name a few.

As expected, there is a noticeable performance improvement by migrating to UTF8 - the reason is simple: We're doing a lot less string conversions everywhere.

1. XML Conversion

What's the issue?: Converting large XML files to XMB takes a ridiculously large amount of time.

I don't know if this can be called a real issue, since it only really affects us developers, but in reality this was actually part of the XMB optimization and the added optimization made it more than twice times as fast.

By implementing a custom String-ID map and making use of the knowledge that libxml2 already buckets its strings, I was able to greatly improve the bucketing speed. Furthermore, by placing the actual String-ID table into the end of the XMB file, we need to traverse the whole XML-tree only once - before we had to do it twice: 1) to get all unique String-ID's, 2) to write all the XML nodes.

Perhaps the most important change was that strings are now stored as UTF8 not UCS-2, which makes conversion faster and also benefits XMB parser greatly.

Before: XML tree was traversed twice and std::set with expensive string comparisons was used.

After: XML tree only traversed once and a custom String-ID map with simple pointer comparisons. Speedup depends on actual XML tree complexity and the amount of unique attribute/id names. Strings are stored as UTF8 which is much faster.

2. XMB Parser

What's the issue?: XMB files are binary XML files, but because the strings are stored as UCS-2 any string related operations are ridiculously complex and slow.

Due to legacy reasons, XMB strings were stored as UCS-2 strings, which is the JS compatible string format. As the project evolved, more and more layers of strings ended up creating a ridiculously complex chain of string conversions whenever XMB strings are read.

The actual conversion sequence was as followed: XMB UCS-2 -> UTF16 -> WCHAR String (UCS-2 on Windows, UTF32 on Linux) -> UTF8 String.

That's 3 layers of string conversions, just to get the end result as an UTF8 string. The correct solution here is pretty obvious: most of the game works with 1-byte char sequences (the std::string) and the best compatible format with that is the general multi-byte string (variable char length) known as UTF8. Any C++ code that doesn't care about specific Unicode characters will still function perfectly fine with UTF8 strings - that's how UTF8 was designed to be.

You can read more about character Encodings here: http://www.joelonsof...es/Unicode.html

Since we now convert XML files to XMB with UTF8 strings, the actual conversion sequence is much simpler:

XMB UTF8 -> UTF8 string (std::string); which means we do a simple copy of the string.

However, do we really need to do even that? I've recently been working on renaming the CTokenizer class to CTokenStr - which is a special string class that references other strings. It doesn't contain any string data itself - only pointer to the start and end of the string.

Once I introduced CTokenStr instead of std::string, the actual conversion sequence looks like this:

XMB UTF8 -> referenced by CTokenStr; which means we actually don't do any work at all!

And we all know the fastest way to do something is to not do it at all. :)

Before: 3 layers of string conversions and copies.

After: No string conversions and no string copies - we actually don't do much work at all. Isn't that nice?

3. UTF8 Conversion

What's the issue?: The previous UTF8 conversion wasn't optimized for speed and wasn't robust enough to "just work". More importantly, the current UTF8Codec does not expose a convenient 1-char encode/decode.

Luckily I've worked with UTF8 libraries and high-speed encoding/decoding before. With most of the engine core being converted to work with UTF8 exclusively, we really needed a reliable UTF8 decode function that would always work and automatically correct itself on any invalid UTF8 sequences.

The resulting prototype looks like this:

* Decodes a single 16-bit WCHAR (stored in a 32-bit unsigned) from the
* specified UTF8 sequence.
* @param utf8sequence An UTF8 sequence of 1-4 bytes.
* @param outch The output character decoded from the UTF8 sequence.
* @return Number of bytes consumed while decoding (1-4).
int utf_decode_wchar(const char* utf8sequence, wchar_t& outch);

Which can be easily used to grab a sequence of wchar's:

wchar_t ch;
while (str < end)
str += utf_decode_wchar(str, ch);

// do something with 'ch':

We only ever need this char by char decoding in:

1) TextRenderer - GlyphAtlas works on Unicode, so we need the wchar_t

2) Console - In order to handle input correctly, we need to be UTF8 aware

3) GUI Input fields - Same as Console

4) UTF8 conversion - The actual methods that decode utf8 strings to wchar_t strings

Previously there was a lot of UTF8 -> WCHAR string conversions in the engine, but with the gradual migration to UTF8, there is practically no conversion at all. Alltogether, the only intensive part is TextRenderer, which is already quite efficient.

Furthermore, there was no way to convert directly to UCS-2 from UTF8, you always had to do UTF8 -> WCHAR string -> UCS-2 string. With the new interface you can convert UTF8 directly to UCS-2.

Before: No way to efficiently decode individual UTF8 sequences, nor UTF8 directly to UCS-2.

After: A very robust and efficient UTF8 interface allows to streamline the entire engine.

4. JavaScript Script Loading

What's the issue?: In order to load javascript scripts, the source needs to be converted to UCS-2.

So this actually ties in with the UTF8 changes in general - by implementing a static conversion buffer for UCS-2 scripts, the conversion of scripts to UCS-2 is much, much more efficient. By using a static buffer, there is no memory allocation, which is perfect.

It's very straightforward now:

1) Load the script file with VFS

2) Convert the file buffer directly to UCS-2 in the static code buffer

3) Execute the script

Previously there would have been a lot of dynamic conversions of the file buffer which was a huge waste of processing time.

Before: Script loading involved several layers of dynamic string conversions.

After: Script files are converted directly to UCS-2 in a static buffer and then loaded.

5. TextRenderer UTF8 Support

What's the issue?: Previously, in order to render anything on screen, you had to convert it into a WCHAR string, which was very clumsy.

So this one actually makes text rendering ridiculously faster than before. We don't use any dynamic containers for text and index it or anything. Instead we have a fixed-size Vertex Buffer. All text that is submitted for rendering is immediatelly converted into vertices.

Why is this good? This means we only need to send Vertex data to the GPU once. This gives a very noticeable speedup and is the correct way to deal with vertices on the GPU.

This transitioning also allowed to batch text together by font - meaning there is no unnecessary font texture switches going on.

So, all of this goodness just because we transitioned to UTF8? Well, of course we could have done it before, but allowing both WCHAR and UTF8 strings to be rendered, means this was the only viable choice, really. No other way would have been right.

And for future reference, there is still a lot to improve when it comes to text - though this system is now compatible with the TrueType OpenGL text engine I developed around June, which means we can actually transition to TrueType fonts for really awesome and clear anti-aliased text :).

Before: Text rendering wasn't that efficient and only supported WCHAR strings, which made printing UTF8 very clumsy.

After: Text renderer is a lot faster, uses less memory, supports both UTF8 and WCHAR strings and is compatible for a transition to TrueType fonts.

6. UTF8 Transition

What's the issue?: The engine relies heavily on WCHAR strings, which is very bad for cross-platform projects.

So you've noticed there's a lot of UTF8 here and there this week. It's all part of a larger goal to transition to UTF8 strings only. This makes the whole engine a lot simpler, since we only need WCHAR conversion in few select places.

These are:

1) When we render text (although we do this char by char).

2) When we open files with Windows API.

So in all aspects, using WCHAR strings makes no sense at all and we can get away with it by just using UTF8 strings.

Modules that have been converted to UTF8:

1) GUI


3) TextRenderer

This transition is still in progress and is a slow and arduous one.

7. Patch Reviews

I finally managed to do some patch reviewing on some core components of the engine, so it wasn't entirely wasted.

I reckon the most efficient way to handle patch reviews is to explicitly assign me C++ patches for review, although I'll be jumping into patch reviews more frequently from now on, it would speed up the progress.

Patches that I approved were marked as "reviewed" and thanks to sanderd17, we can now easily query those tickets. So when A14 is released and feature lock is lifted, we can start throwing stuff into SVN and make it all work before A15 is released.

To end Week #4:

I don't know if there's much else to say about all of this. I guess to end this week, I'll now throw fancy performance graphs at you. Beware. :)

( followed in the next post )


This is my current TaskList:

-) UTF8 transition

-) PSA and PMD to collada converter.

-) Migrate to C++11

-) Collada -> PMD cache opt to improve first-time cache loading.

Share this post

Link to post
Share on other sites

The following profiling is done on 2 separate builds of pyrogenesis in release mode. Both have exactly the same optimizations applied and are built on VC++2012. I'll refer to these versions as:

1) A14 Release - The SVN version of A14, r13791

2) A15 Dev - The megapatch on r13791

First we'll test everything on a "visual" glance. This means I don't use any profiling tools, we only monitor the FPS and "how it feels". Both of these tests will be run with cache fully loaded on Cycladic Archipelago 6.

Once that is done, we can compare memory peak usage, memory allocation histograms and loading times in general.

The testing system is:

Windows 7 x64, Intel i7-720QM 1.6GHz, 8GB DDR3, Radeon HD 5650 1GB, Force GS 240GB SSD

Game settings:

Most fancy options set to high, Postprocessing disabled, Windowed

1. First Glance @ 1280x720

-) A14 Release

This is the version that will be packaged and tested before release in the next couple of days. We've been working hard on optimizations, but most of these never made it to A14. This will give us a fair comparison on how big a performance gain we'll be looking at.

The menu is a good place to test the core speed of the engine. Very fast engines usually get over 1000fps. A14 gets around ~480 fps, which is not that bad at all considering we run a very complex scripting engine behind the scenes.


To further test general game speed, lets enter Match Setup chatroom. At first it starts pretty strong at ~300 fps:


But once more and more text piles up, the FPS drops to a meager ~50-60fps !! This is because text rendering is still extremely inefficient in A14.



Now let's load Cycladic Archipelago 6. It's very hard to profile loading times, because I have a 550mb/s SSD. The loading was fast around 6 seconds, though it stuck around 100% for half of that.

The last 100% is where all the shaders get loaded.

I get a fairly steady ~46 fps in the initial screen.


Zooming in, the FPS obviously increases to ~58, because there is less stuff to render.


Once we zoom out with a revealed map, the fps drops to ~40.



-) A14 Release summary:

The chatroom showed how big a bottleneck current GUI can be; it's not very efficient. With a revealed map I get 40 fps, which is a bit low, considering my system can play Mass Effect 3 in 1080p with the same fps.

-) A15 Dev

This one has about 2 months worth of optimizations put into it. I used to think that I would achieve more with such a long period of time, but despite my previous experience, working on pyrogenesis has been different.

Mostly because it's cross-platform, thus restricting many optimization options available to the programmer. Secondly because code that worked and ran fine on Windows, often didn't work at all on Linux. This meant a few weeks of coding was lost and had to be reverted.

The patch adds 7376 lines and removes 5507 lines of code. It has also gained a nickname "megapatch", due to how big the SVN .patch file is (~1mb).

The menu in the patched version runs at ~630 fps, so at first glance at least something appears to have improved.


Now lets check how Match Setup chatroom fares on A15 Dev. About ~300 fps just like before. Looks like there's some other bottleneck in the code, but then again 300 fps is more than enough.


What happens if we spam a few hundred lines of text at it? Only a slight drop to ~280 fps, which is a lot better than before. It means long lobby times won't hurt the game in A15.



Now let's load Cycladic Archipelago 6. The loading is slightly faster, seems like 4 seconds. Again half of that is spent at 100%. However, this time it's faster because A15 Dev optimizes shader compilation, reducing the amount of shaders compiled from ~300 to ~130.

The initial look shows us ~61 fps, which is roughly +33% faster than A14. It's far less of an improvement than expected though. I'm slightly dismayed at that.


If we zoom in, we see a similar improvement ratio of +25% at ~73fps:


And with reveal map and zoomed out we get ~51 fps, which is about +27%.



-) A15 Dev Summary:

I'm a bit disappointed. After all the optimizations, I expected much better results. However, it's nice to see that textrenderer optimizations paid off.

The loading time of 4s is already fast enough for me, so I can't complain. Also, the general improvement of +~25% fps is enough to make the game much more playable.

I think the best improvement is the new input system - it's much smoother than the previous one, so it just "feels" faster, even though it's not very much so.

This is the end for first glance, which is part 1 of the profiling session. Next part will show some memory usage data.

  • Like 1

Share this post

Link to post
Share on other sites

2. Performance Profiling with MTuner

MTuner is a really useful tool for profiling memory usage and detecting leaks. The goal is to compare A14 Release vs A15 Dev in its overall memory allocation intensity.

The actual memory peak usage doesn't really concern us, since we can always sacrifice memory space for speed. What we can't sacrifice however is memory latency - meaning we must reduce the number of dynamic allocations to the lowest we can get.

The A15 Dev patch actually focuses heavily on optimizing memory usage and reducing allocation bottlenecks where they are detected. Even though I've been working furiously on the current patch, I still haven't been able to remove all the bottlenecks. There is still a lot of room to improve.

Both sessions were profiled on Cycladic Archipelago 6, although I wasn't able to exactly repeat every action and movement I took, the results were accurate enough over several sessions.

1) Memory usage Timeline

This type of graph shows the overall memory usage during the lifetime of the program and gives us a rough idea what the program is doing. We can also see JavaScript's GC Heap slowly growing, resulting in the slight rise in memory usage. If we recorded enough data, we would see a saw-like /|/|/|/| pattern in memory usage.

Profiling projects with a GC Heap is ridiculously hard, because you never know if it's a leak or just GC delaying its collection cycle.

A14 Release:

The game starts up slightly under 32MB mark and once loading begins, memory usage starts slowly climbing. The whole loading sequence takes around 6 seconds to finish. After that we see an usage graph with lots of tiny spikes.


First thing we can see is that memory is allocated gradually during loading, which is actually not that great for loading times - we spend a lot of time waiting on the Disk to read our data, then we go through a lot of processing on the single loaded asset, just to wait on the disk some more.

It's hard for me to judge loading times, since I have an SSD that pretty much eliminates any IO latency, but others have reported up to 40s loading times, so it's a worthwhile topic.

A15 Dev:

The game starts up just the same if slightly faster, but the gross memory usage is pretty much the same. The loading segment is a lot steeper in A15 and it loads 33% faster in just 4 seconds. After that we see a much smoother graph with only a few spikes along the way.


The loading is faster because there is less action for the CPU. However, the amount of data is pretty small and the Force GS SSD should be able to load all this data (~100mb) in less than a second or so. Lower end PC's will definitely benefit from this 33% loading time improvement.

We can also notice that memory growth itself is far steeper during loading - this is because model loading allocates a bit more than needed, to avoid any situation where we have to reallocate a list of 1000 vertices. This really pays off in speed and we don't even notice the few bytes we wasted.

Compared to A14, there are a lot less spikes, though some still exist, they are all in the downwards direction, meaning deallocation - reallocation, which is typical pattern for a resource that gets reloaded. So this graph is looking a lot better and means we're using our memory more efficiently.

Summary: A15 Dev has definitely got worthwhile improvements. Even though we'd wish for more, 0AD has a huge codebase which requires an immense amount of work to optimize.

2) Memory allocation Histogram

The allocation histogram shows the general number of allocations over a selected timespan. In this case I've selected the time after loading and just before closing. This is to measure general allocation performance during lifetime of the game.

A14 Release:

It's obvious that the amount of allocations is just huge. In just a minute, 0AD managed to do over 1.008 million allocations. This is less than desirable.


A lot of power goes to really small allocations, which is not an efficient use of memory, since we have very high alloc/dealloc frequency and a lot of memory management overhead associated with tiny allocations.

A15 Dev:

We can immediately notice that the number of very small allocations has gone down significantly. However, a lot of 32 byte allocations also remain, which is what should be focused on the next iteration.


However, in total we've reduced the amount of allocations by 94,000. Most of those were from all the tiny allocations and some of them carried over to bigger allocation chunks, resulting in an overall more efficient usage of memory.


Even though we reduced memory management overhead by 10%, we only gained a meagre ~+25% FPS, which is actually very little improvement if at all. We need breakthroughs that double or triple the fps if we really want results.


This is it for now, I hope you enjoyed this A15 Dev preview.

Share this post

Link to post
Share on other sites

Very interesting graphs, and very interesting reports too. It seems to me Mythos made a really good decision when he contacted you.

Share this post

Link to post
Share on other sites

Definitely some very nice improvements. Keep up the good work! :thumbsup:

How long did the game run in these graphs?

One thing I'd like to point out is that currently the major bottlenecks are the AI and the pathfinding.

You won't see any of these problems if you test only a few seconds in game but they bring down the performance from 60 fps to about 3 FPS with 4AI players, all graphics settings disabled and zoomed in as much as possible.

The problem is that because most players seem to play in singleplayer mode, the user experience will only improve significantly if we solve these problems too.

Try something like this, set the game to run on "insane" speed and see how it performs after around 10000-15000 turns.

./pyrogenesis -quickstart -autostart="Oasis 04" -autostart-ai=1:qbot-wc -autostart-ai=2:qbot-wc -autostart-ai=3:qbot-wc -autostart-ai=4:qbot-wc

I think the main reason is the shared script applying entities delta.

At this point there are so many entity changes in each turn that it takes forever. I've posted a graph showing that here.

As far as I know wraitii is already thinking about some changes to address this problem. I will finish the Spidermonkey upgrade first before getting into more trouble ;)

  • Like 1

Share this post

Link to post
Share on other sites

Maybe it's better if we post these reports as 4 different topics, so we can link them separately on the social media.

Share this post

Link to post
Share on other sites

Try something like this, set the game to run on "insane" speed and see how it performs after around 10000-15000 turns.

I think the main reason is the shared script applying entities delta.

At this point there are so many entity changes in each turn that it takes forever. I've posted a graph showing that here.

The profiling data captures about 1 minute, resulting in a 1GB data file. It takes roughly 4GB of ram to process that data. So at best I think I can capture ~2 minutes before it becomes impossible to process the profiling data.

Perhaps if I load a savegame with some intensive AI action...?

  • Like 1

Share this post

Link to post
Share on other sites

Perhaps if I load a savegame with some intensive AI action...?

Hmm no sure how good saved games work at the moment.

If it works, it's a very good idea for some other tests I need to do where the simulation behavior is slightly different.

I can just save a game at turn 10000 and measure a short time to avoid that the changes affect it too much.

Thanks for that idea :).

Share this post

Link to post
Share on other sites

Nice work redfox! This is how to get breakthrough performance: http://www.wildfireg...65

Porting the game to C# is a terrible idea. Not only will it break linux portability making the game reliant on mono, but it will never be as fast as well written C/C++. The answer to the performance issues is patience and hard work on the code base not magic language fixes. The current fundraiser is a recognition of this.

Share this post

Link to post
Share on other sites

Nice work redfox! This is how to get breakthrough performance: http://www.wildfireg...65

If you're referring to using a newer version of OpenGL, that's a good way to lose a lot of users for dubious performance gains.

If you're referring to porting the game to C#, that's a good way to spend a ridiculous amount of effort and lose a lot of users for more dubious performance gains.

Share this post

Link to post
Share on other sites

If you're referring to using a newer version of OpenGL, that's a good way to lose a lot of users for dubious performance gains.

If you're referring to porting the game to C#, that's a good way to spend a ridiculous amount of effort and lose a lot of users for more dubious performance gains.

I'm referring to the opengl version. The performance gains mahdi posted is Incredible. More than 3x. If the single threaded cpu performance requirements is very low, systems with intel atom and amd bobcat cpus will be able to run this game. And AMD and Intel has sold millions upon millions of these systems the last few years (the bobcat APUs have quite a potent GPU). These systems make up a large portion of the market. And with the trend of slimmer and slimmer laptops and tablets, one thing is sure, single threaded cpu performance is not going up by a lot. In contrast to GPU performance which scales well with smaller production nodes. I think compatibility with these systems is more important than compatibility with ancient desktop system (which are the only systems compatibility you are breaking). Maybe it isn't as black and white as I describe it, but 3x cpu performance gain is worth looking into. :) In short: I believe the group of users and future useres with a quite modern GPU and a cpu with poor single threaded performance is much larger than the group who have a fast cpu with an ancient GPU. High cpu requirements is not the way to go. Edited by Norvegia

Share this post

Link to post
Share on other sites

Switching the spikes from manual memory management to garbage collection could also speed things up. GC alloc is typically much faster, and since GC'ing itself has a time complexity with respect the number of live objects, garbage collection can actually be much faster when you have lots of short-lived objects, as seems to be the case hear. Any stalling from GCing (the price of more throughput) is also more appropriate during initial map loading than game play.

Share this post

Link to post
Share on other sites

I'm referring to the opengl version. The performance gains mahdi posted is Incredible. More than 3x. If the single threaded cpu performance requirements is very low, systems with intel atom and amd bobcat cpus will be able to run this game. And AMD and Intel has sold millions upon millions of these systems the last few years (the bobcat APUs have quite a potent GPU). These systems make up a large portion of the market...

I see where you're coming from, but the current performance issues are not related to language or OpenGL version. The performance issues are because of code with very bad worst case performance, like CCmpRangeManager using insertion sort in a tight loop, which has W(n^2) performance - in short, it's horrible.

If we concentrate on ironing out algorithmic bugs, we'll have very good performance.

Switching the spikes from manual memory management to garbage collection could also speed things up. GC alloc is typically much faster, and since GC'ing itself has a time complexity with respect the number of live objects, garbage collection can actually be much faster when you have lots of short-lived objects, as seems to be the case hear. Any stalling from GCing (the price of more throughput) is also more appropriate during initial map loading than game play.

Garbage collector is a very bad idea for intensive real-time systems such as 0AD. I've worked on games in C# before and the GC always got in the way once the engine was sufficiently complex. Furthermore, debugging dangling references that cause obscure leaks is just ridiculous.

In general we do very little in actual memory allocation with the new patch - you can see it in the memory timeline graph. Once the game is underway it's mostly smooth. If we used C#, memory usage would keep climbing very fast, until it hits a GC cycle - then the whole game will freeze for half a second. Definitely not something we want. Ever. Dealing with JS GC is quite enough trouble already.

The best approach here is to allocate as much as possible beforehand and not do any allocations or deallocations during the game loop. This is something you can't really do effectively with GC based languages.

  • Like 1

Share this post

Link to post
Share on other sites
This topic is now closed to further replies.
Sign in to follow this