Jump to content

Further I/o Simplifications


Recommended Posts

Certain circumstances motivate further changes to our I/O foundation.

i) the development of a new compression algorithm at work that runs circles around LZ / Huffman / arithmetic coders has greatly diminished my interest in using Zip archives for compression (they remain useful for reducing small-file overhead and seeks, though)

ii) measurements made while writing the thesis chapter on image I/O indicate our code currently lags behind HD write benchmarks to an unreasonable degree. In particular, there is a decent bit of overhead due to splitting I/Os into blocks and adding caching functionality. The latter is most useful for Zip archives, and now provides little gain vs. the OS file cache because aio is disabled on Linux, and we're not doing any decompression in parallel with IO anymore.

iii) Sector sizes are moving towards 4K and larger, and the optimal block size/queue depth parameters have definitely changed.

What I have done and propose to integrate is:

1) pull out all the stops, completely rewrite waio.cpp, including several crazy (in the sense of "who would actually bother to do this") but legal and documented optimizations. This enables a HIGHER (3..10%) throughput than reputable benchmarks such as ATTO and AS SSD on hard and solid state disks.

2) also completely rewrite io.cpp, which is the OS-independent wrapper on top of POSIX aio or waio that splits I/Os into chunks. Remove support for caching (aio doesn't need it, and sync I/O should rather use the OS cache), allow variable sector size/alignment/block size/queue depth, and add support for template based callbacks/hooks when IOs are issued (required for generating data on the fly) and complete (for overlapped decompression - used at work).

IOs are described via io::Operation classes passed to io::Run. Unless non-default io::Parameters are prescribed, the IO will run synchronously.

3) update the remaining code to match the new interface; also changing the IO buffers from shared_ptr (which seems to trigger ANOTHER compiler bug in ICC 12.0) to a nifty new UniqueRange smart pointer. This is similar to C++0x unique_ptr with the addition of a size() member; the implementation provides simplistic but useful emulation of rvalue references in C++03.

This should have the benefit of solving (by avoiding) the what-cache-size-should-we-use problem, probably increasing performance (synchronous I/Os can use FastIO driver entry points and are faster when the major benefit of aio - overlapping computation with IO - isn't needed) and stripping out some rusty, no longer used code.

Unfortunately, new software usually includes bugs, though I will have tested the I/O logic at work beforehand.

The interesting question is whether anyone foresees any problems with this approach, or disadvantages for 0ad or maybe even other codebases.

I welcome comments here and will hold off from committing until back from vacation in 2 weeks (to leave time for discussion and be able to investigate any issues that might arise after integration).

Link to comment
Share on other sites

As far as I'm aware, for the game's perspective raw throughput is pretty much irrelevant: either we're loading from a cold disk cache and get killed by seeks, or else we're easily fast enough and don't need to worry since our data set is only tens of megabytes.

I would (perhaps naively) imagine there might be optimisations that could reduce seeks (like maybe the game can request a hundred files at once, and they get read from disk in an optimal order rather than in the arbitrary order the game would otherwise load them in, or something) which might help, but I don't know how feasible that is.

Something else that would likely help is enough thread-safety to read files from background threads - in particular the AI scripts sometimes want to synchronously read an entity template XML file, but they (will eventually) run in a thread so they can't use the VFS directly, and they can't proxy the requests through the main thread since the main thread might be spending the next 100ms rendering and will have terrible latency, so currently the main thread loads all the template files in advance, which hurts startup time quite a bit. If it was possible for the AI thread to load files on demand, while the main thread is busy rendering, that would probably be very nice. (It'd be okay to only allow a single thread in the VFS code at once and block the others, for this case - they don't need concurrent VFS usage.)

I have no idea how much effort it'd take to add these features, but I expect they're what would help the game most. (Am I mistaken in thinking so?)

About caches: I don't think the game has any particular need of file caching - most stuff is cached at a higher level (as textures, meshes, etc) and it seems a waste of memory to have it in multiple caches. So removing caching sounds fine to me.

Link to comment
Share on other sites

As far as I'm aware, for the game's perspective raw throughput is pretty much irrelevant

Agreed. My main concern here is the thesis and stuff at work that very much cares about throughput.

However, I'd like to avoid forking the code.

I would (perhaps naively) imagine there might be optimisations that could reduce seeks (like maybe the game can request a hundred files at once, and they get read from disk in an optimal order rather than in the arbitrary order the game would otherwise load them in

Nah, POSIX and Windows have scatter/gather IO, but that's for a single-file only (think DB). aio would enable such an optimization, but we dont't know the maximum number of pending IOs, so that's not workable.

Something else that would likely help is enough thread-safety to read files from background threads

Yep, still on the wishlist. This is actually moving closer to done - our own (block) caching is one of the bigger obstacles. After removing them as planned here, I think we'd only need scoped locks in each of the vfs.cpp member functions. That begs the question which flavor of locks is to be used today. OpenMP doesn't allow scoped lock syntax, the SDL stuff is slow (not just a CRITICAL_SECTION), which leaves pthread_mutex (plus a mini scoped_lock wrapper).

waio.cpp is now thread-safe; so is the new io.h, since it doesn't use global data.

I have no idea how much effort it'd take to add these features, but I expect they're what would help the game most. (Am I mistaken in thinking so?)

Sounds right to me. Adding those locks would be a matter of minutes; finding any remaining bugs, maybe longer ;) But it'll have to wait until the other major changes are integrated in ~2 weeks (so, too, the ENSURE/ASSERT stuff).

About caches: I don't think the game has any particular need of file caching - most stuff is cached at a higher level (as textures, meshes, etc) and it seems a waste of memory to have it in multiple caches. So removing caching sounds fine to me.

Agreed in principle, but we're currently only talking about the block cache. The file cache currently does satisfy some dozens of requests, so we should ensure those are cached somewhere else before removing it.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...