Mesh skinning performance

**Ykkrosh** · April 11, 2012

Re this (split off from that thread to try to minimise disorganisation):

So, what did you guys find out in the testing? What did it do to frame rates when you displayed a bunch of of these high poly animated models to the screen?

Testing with 1024 animated actors (the new mesh vs the old mesh skeletal/m_tunic_short.dae, both with animation biped/inf_hoplite_walk.psa and texture skeletal/hele_isp_e_1.dds; no props etc), with all actors on screen. Empty map (no water etc, so no reflections/refractions rendered). Shadows enabled (so each model will be rendered twice). Core i5-2500K 3.3GHz, Windows, vsync disabled, ARB shader rendering, 1024x768 window. Ran on GeForce 560 Ti (pretty high-end compared to most users; ought to run at 60fps with no problem) and Intel HD Graphics 3000 (the current fastest Intel one; generally expected to be usable for gaming at low quality settings, so probably a realistic target for decent performance).

Old mesh:

* Triangles: 390

* Vertexes: 302

* Model triangles drawn: 798,720

* Vertex buffers allocated: 10,372,852 bytes

New mesh:

* Triangles: 6656

* Vertexes: 3402

* Model triangles drawn: 13,631,488

* Vertex buffers allocated: 112,016,048 bytes

Old mesh / GeForce 560 Ti:

* Total frame time: 12.5 msec/frame

* Time in "prepare models": 3.5 msec/frame

* Total frame time when paused: 2.5 msec/frame

New mesh / GeForce 560 Ti:

* Total frame time: 45.5 msec/frame

* Time in "prepare models": 38.0 msec/frame

* Total frame time when paused: 24.0 msec/frame

Old mesh / Intel HD Graphics 3000:

* Total frame time: 26 msec/frame

* Time in "prepare models": 3.5 msec/frame

* Total frame time when paused: 17 msec/frame

New mesh / Intel HD Graphics 3000:

* Total frame time: 145 msec/frame

* Time in "prepare models": 100 msec/frame

* Total frame time when paused: 130 msec/frame

There's 17x as many triangles in the new mesh, and 11x as many vertexes. Vertex buffers are 32 bytes per vertex, for each instance of the mesh.

"Total frame time" is limited by the CPU or GPU, whichever is slower (since they run in parallel).

"Time in "prepare models"" is the CPU cost of the skinning computation and vertex data upload - in the "New mesh / GeForce 560 Ti" case, "prepare models" is about 60% skinning and 40% upload. (Skinning should have the same cost in the Intel HD 3000 case, but the upload is much slower.)

"Total frame time when paused" means the meshes aren't animating, so there's no skinning or vertex data upload - it's basically just the GPU cost of rendering all the triangles.

Based on the paused times, GF560Ti can render about 600M tri/sec, HD3000 can render about 100M tri/sec - those figures sound vaguely plausible so I'll assume they're right. If we want 30fps on HD3000, that means at most 3M tri/frame. With the new 6656-tri mesh (keeping shadows enabled, ignoring props and buildings and trees which will eat into the polygon count), we could have ~200 units on screen at once before hitting the triangle count limit. Half as many triangles would allow twice as many units.

Independent of this triangle rendering, the CPU skinning takes about 25 msec/frame for these 1024 units. 200 units should therefore be ~5 msec/frame. This is a fairly fast CPU, so multiply by perhaps 2 for a reasonable lower-end CPU. Running at 60fps means we only have 16 msec/frame in total, and 5ms (or 10ms) is a big chunk. So I think we'd be primarily limited by CPU skinning cost, before being limited by triangle rendering cost, except on especially slow GPUs and fast CPUs.

Vertex data upload seems unpleasantly expensive; 100MB of vertex data per frame at 60fps is approaching the PCIe 16x bandwidth limit so that'll never work especially well, and with smaller numbers of units it's still a lot of bandwidth. I think our current vertex data upload code is somewhat inefficient (it updates lots of tiny chunks instead of throwing out the entire vertex buffer each frame, which'll probably prevent some driver optimisations) and could be improved, but that wouldn't solve the fundamental bandwidth problem.

So... I don't think the 6656-tri mesh is obscenely high resolution, but it's a bit too much if we want 200 units on screen at once (and much too much if we want more). But what we should really try is to do skinning on the GPU instead of on the CPU - that wouldn't increase the GPU's maximum renderable tris/sec, but it would eliminate the CPU skinning cost (at the expense of putting more load on the GPU vertex shaders) and would also eliminate the vertex data upload. That shouldn't be technically complex (I hope), so I suppose I'll experiment with that to see how it influences performance. With that data it should be possible to make a more informed tradeoff between gameplay design (number of units) and art design (number of triangles per unit).

**Ykkrosh** · April 11, 2012

I suppose I'll experiment with that to see how it influences performance

Hmm, I did a simple test with doing the vertex transforms in the (GLSL) vertex shader instead of on the CPU. Models share mesh data; the mesh data has a GL_UNSIGNED_BYTE blend-matrix index attribute per vertex (via glVertexAttribIPointer); blend matrices are uploaded per model into a "uniform mat4 blendMatrices[128]" with glUniformMatrix4fv.

Both models have 30 bones (counting the implicit root bone). The low-res mesh has 121 blend matrices (the number of distinct combinations of bone weights). The high-res mesh has 30 (because it's using the buggy PMD exporter which only uses one bone per vertex).

Total frame times:

Old mesh / GF560Ti: 16 msec/frame

New mesh / GF560Ti: 28 msec/frame

Old mesh / HD3000: 67 msec/frame

New mesh / HD3000: 160 msec/frame

The only case that's faster is the new mesh on GF560Ti. I presume that's because it has few blend matrices and many vertices (unlike the old mesh), and the GPU has more processing power than the CPU (unlike the HD3000).

With old mesh on GF560Ti, profiler says 31% of time is computing the blend matrices, 15% is inside the blend matrix glUniformMatrix4fv, most of the rest is in drivers. The uniform cost (and presumably much of the driver cost) could possibly be reduced with GL_ARB_uniform_buffer_object, but only new drivers support it; or for the old mesh (with many more blend matrices than bones) it could be reduced by uploading the bone matrices and weights and doing the blending inside the vertex shader, instead of blending on the CPU (which would also save that CPU time). I suppose I should try that too, to see if it makes the old mesh any faster.

**Wijitmaker** · April 11, 2012

Cool Philip - nice test method.

Both models have 30 bones (counting the implicit root bone). The low-res mesh has 121 blend matrices (the number of distinct combinations of bone weights). The high-res mesh has 30 (because it's using the buggy PMD exporter which only uses one bone per vertex).

I was wondering if that might be an issue. I'll see if I can get you a .dae file that will properly weight the vertexes. That might change performance though wouldn't it? I guess we'll soon find out.

So... I don't think the 6656-tri mesh is obscenely high resolution

I'm surprised the engine performed as well as it did. 6656 tris is absurdly high in comparison to RTS games from the past that I'm familiar with (AoM, AoE III, BfME, C&C) - and I freely admit I'm out of the loop these days with what current RTS games are doing. However, if pyrogensis could handle something between 390 tris and 6656 tris (there is a lot of purposeless geometry in that model that could be cut), and the art team sees a visual benefit vs. the cost of making the new models - then I say go for it.

Unfortunately I haven't heard anything back from that Blender/Max artist I met a few weeks ago. So, I guess it is back on me to get that skeleton working in Blender.

**Mythos_Ruler** · April 11, 2012

I'd do these tests with something much smaller than the 6656 mesh. There's no way the units will even end up close to that number of triangles/vertices. If we redo the body meshes, then I'm thinking the highest they'll go is 1000 triangles. If we keep the old ones and just edit those, then we'll likely only add maybe 100-200 triangles in tweaks, max. Add a few hundred for props and the largest (infantry soldier) dude will clock in at 1500 at the most. Cavalry will likely run around 1500-1600. So, there's really no use in running any more tests with a 6656 triangle mesh.

The way I'm leaning is to just edit the existing meshes to give us those additional body types we've been talking about, plus some added geometry here and there (sleeves, a better looking foot, the bottom of tunics, et cetera), giving us a body mesh of maybe 500 triangles, give or take.

**Ykkrosh** · April 11, 2012

I suppose I should try that too, to see if it makes the old mesh any faster.

Tried that - arrays of 4 joints/weights are shared between each model, and the CPU just computes the animated bone matrices (multiplied by inverse bind pose) and uploads that per model, and the vertex shader does all the weighted blending per vertex.

Old mesh / GF560Ti: 7 msec/frame

New mesh / GF560Ti: 35 msec/frame

Old mesh / HD3000: 35 msec/frame

New mesh / HD3000: 285 msec/frame

It's better than the previous approach for the old mesh (since there's less uniforms to upload), but worse for the new mesh (since there's the same number of uniforms and more computation in the vertex shader; especially on HD3000 which is seemingly slow at vertex shaders).

So... If we were targeting fast GPUs (not Intel ones), and/or we had high-poly models (e.g. this was an FPS game), it'd make sense to do as much work as possible in the vertex shader. Since we should optimise for slow GPUs, and we have low-poly models, it seems like sticking with CPU skinning (and optimising it a bit more) is best. So that's good to know

I was wondering if that might be an issue. I'll see if I can get you a .dae file that will properly weight the vertexes. That might change performance though wouldn't it?

No need to bother with that, I think - it shouldn't affect the conclusions much, and it'll probably break the method in post #2 entirely (there's a hardware limit of ~250 blend matrices which would likely be exceeded).

I'd do these tests with something much smaller than the 6656 mesh. There's no way the units will even end up close to that number of triangles/vertices.

Yeah, I didn't intend it to be a realistic mesh - it's just to get a feeling for how performance varies with mesh complexity, and it's easier to see that when testing extremes. It's not necessary to test ones in between since you can just interpolate between those extremes to get a rough but adequate view: if ~200 units with 6K-tri meshes are alright, and ~1000 units with 0.4K-tri meshes are alright, then 1K-tri meshes should be perfectly fine for several hundred units on screen at once.

So, please feel free to use a thousand triangles on unit meshes if you want, but not a lot more than that

**Wijitmaker** · April 12, 2012

The way I'm leaning is to just edit the existing meshes to give us those additional body types we've been talking about, plus some added geometry here and there (sleeves, a better looking foot, the bottom of tunics, et cetera), giving us a body mesh of maybe 500 triangles, give or take.

Would you mind making me a task with the specifics of what you want? I'd like to assist.

**Mythos_Ruler** · April 12, 2012

Would you mind making me a task with the specifics of what you want? I'd like to assist.

Tonight I'm going to get some writing done. If I hit my goal of 10 pages (I'm at 3 pages now), then I'll spend all of tomorrow doing 0 A.D. planning and stuff. I'll prioritize thinking about and listing changes and features for new dude and dudette meshes.

**feneur** · April 12, 2012

Tonight I'm going to get some writing done. If I hit my goal of 10 pages (I'm at 3 pages now), then I'll spend all of tomorrow doing 0 A.D. planning and stuff. I'll prioritize thinking about and listing changes and features for new dude and dudette meshes.

Splendid. And good luck with your writing (y)

**Shield Bearer** · April 12, 2012

Great findings Philip! It sure makes me a bit happier I've committed a test dude mesh to the SVN, its about 700 tris so I don't think we'll have to worry about the mesh going too much over 1k, especially since we'll be editing the old mesh instead

**Ykkrosh** · April 12, 2012

I committed the GPU skinning code now - you probably shouldn't test it, but if you really want to then you need to set preferglsl=true and gpuskinning=true in the config file, and need a device that supports OpenGL 3.0 (or GL_EXT_gpu_shader4). (Other combinations are likely to crash, which is intentional.)

**Wijitmaker** · April 13, 2012

On a side note (not to derail this topic) but is there plans to review the hardware that the player is working with and simply disallow configuration options that would cause the game to crash on their system? Along with that, is there plans to recommend system settings for users using all that data your collecting?

**Ykkrosh** · April 13, 2012

In almost all cases we do detect hardware capabilities and ignore options if they're not supported. The only exception is with the new GLSL stuff, since it's a bit trickier (different shaders require different GLSL versions and can be mixed with non-GLSL shaders) and is all disabled by default so it doesn't matter yet - that would need to be cleaned up before being properly supported.

We already select default graphics settings based on hardware to some extent (here), for performance or to avoid bugs, but that's pretty crude since we don't have very relevant data. I think it'd probably be useful to add a benchmarking mode which can compare various settings (fixed-function vs ARB shaders vs GLSL shaders, shadows, dynamic reflections, resolution, antialiasing, various minor implementation details, etc) and report performance, so we can see if there's unexpected performance problems with various devices/drivers/OSes/etc and so we can pick sensible defaults.

**fabio** · April 15, 2012

Would be possible to use two (or more) different meshes, the more detailed to use when there are few units and/or zooming in, the less detailed when zooming out (you'll won't be able to see the details anyway at far distance) and/or there are many units (so to not decrease performance).

**plumo** · April 15, 2012

Would be possible to use two (or more) different meshes, the more detailed to use when there are few units and/or zooming in, the less detailed when zooming out (you'll won't be able to see the details anyway at far distance) and/or there are many units (so to not decrease performance).

Or the low quality one for old rigs, and the higher quality one for newer computers?

If you use 2 models, won't you notice the change while zooming out or in??

**Ykkrosh** · April 15, 2012

Would be possible to use two (or more) different meshes, the more detailed to use when there are few units and/or zooming in, the less detailed when zooming out (you'll won't be able to see the details anyway at far distance) and/or there are many units (so to not decrease performance).

That's possible, though it'd probably be more useful to swap to sprites which should give more significant performance savings when zoomed out (since the overriding cost with a large number of units is probably the number of geometry batches, not the number of polygons). (See e.g. Rome Total War, which does lots of LOD with model resolutions and sprites since it supports extreme zooming). The usual problem with LOD is that you can see the units flip between the different meshes/sprites/etc, which is kind of ugly, and it's more work (low-res models need artists, sprites need coding). We could also do other LOD-related stuff like reduce the temporal precision of animations, e.g. run all animations at 10fps (vs the current implementation which is as precise as your framerate) and then if two units are at the same frame we only need to compute the skinning once and share it between them, which might help.

In general, I think it's sensible to try optimising the simple high-quality approach first (which is what I was trying (failing) to do here), and if that turns out to be insufficient then add performance hacks onto it later. I don't know whether the current approach is insufficient in practice (we have too many other performance problems that need to be fixed before it might become a bottleneck), and if so I don't know when will be a good time for "later" - I suppose it's not yet, so I'll try to avoid spending more time experimenting with this myself, but hopefully it wouldn't be too far away

Mesh skinning performance

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation