Jump to content

Reducing Boost Dependency


Recommended Posts

We've been integrating two codebases at work and the Boost dependency in our lowlevel code is unfortunately a no-go.

I've gone through some gyrations to allow 0ad to continue using Boost while completely removing it from the other codebase.

The biggest changes are replacing ::shared_ptr with a SHARED_PTR macro (as with STL_HASH_MAP, which also requires a different namespace depending on compiler), and replacing almost all fs::wpath with std::wstring.

This snowballed into several hours' work and rather extensive changes, so I would like to hear if anyone has objections and possibly change things to ensure everyone is OK with this change.

The proposed 223 KB patch is attached; I'd appreciate if someone could test on Linux.

Some talking points:

1) exists has been replaced with DirectoryExists / FileExists. Since stat() must only be called for files, I think this is a worthwhile change to make in any case.

2) other fs:: functions have been replaced with pretty much the same functions in path_util.h. fs::wpath.leaf() and branch_path become Filename() and Path(). These are rather generic names, so I wonder if there would be value in putting them in a namespace. One route would be to call it fs:: as before and try for mostly drop-in compatibility with Boost filesystem - but I don't really like that idea because "mostly" leads to false expectations. We could also call it Path.

3) the convenient fs::wpath operator/ becomes Join(), which is available for 2 and 3 arguments. Amusingly enough, the new code is often shorter because less casts and .string() fluff arise.

4) VfsPath tried to increase safety by making it harder to confuse different types of path (real POSIX vs. VFS). However, that is very, very weak due to the implicit conversion from std::wstring, and because we're already calling fs::wpath::string() in many places. I don't see any harm in just converting to std::wstring, but it's probably good to keep the VfsPath as a typedef for added documentation.

Any other comments / suggestions?

If not, I'd like to commit in a few days at the latest, because it's rather difficult to keep 2 codebases in sync.

reduceBoost.zip

Link to comment
Share on other sites

The Boost stuff is actually mostly header-only, so that is basically what is already happening.

However, the problem is that (say) sys_get_executable_name has fs::wpath as part of its signature.

That would mean it can't be used as part of the codebase here, since Boost (or even a subset, which is basically the same thing, since using one header is tantamount to including the whole kitchen sink and pot of soup) is "verboten".

To fix this, I've changed that function (and all others) to return a std::wstring, and if 0ad needs a fs::wpath, it's easy to create one directly from the string. Those "easy" changes do add up, though, and I'd like to be sure it's not too much of a burden for the rest of the code from y'all's perspective.

Link to comment
Share on other sites

Getting rid of Boost Filesystem means we won't have to worry about the V2 API getting dropped in future releases, which is a nice side-effect of this.

Boost doesn't use the namespace "fs", so I don't think it'd be confusing to have something silently Boost-like (not advertised as a drop-in replacement, but preferring compatibility when possible) reusing that namespace. Anybody new to the code would have no expectation that it's like Boost, and anybody old will be aware of this change.

I think a namespace is important because otherwise the function names are misleading, e.g. I'd probably expect Join("a", "b") to be equivalent to "a"+"b" and wouldn't expect it to be path-related.

SHARED_PTR looks very ugly. Can't it just be shared_ptr? (and include using 'using' or 'typedef' instead of macro)

Since you're assuming shared_ptr exists either in std:: or std::tr1:: (if Boost is disabled), could you also assume (for your set of compilers) that unordered_map exists there too, instead of using STL_HASH_MAP which has much less predictable behaviour? (Regardless of that, outside of lib we can rely on Boost and should probably get rid of STL_HASH_MAP entirely and use unordered_map, for modernity and consistency. That's probably separate from this patch, though.)

I don't remember ever accidentally mixing VFS and non-VFS paths, such that the type system would help. But I do find VfsPath very useful as documentation, since it stops me getting to the stage of accidentally mixing paths, so it sounds good to keep it.

Attached a patch that fixes a few issues on Linux, but some errors remain, of the forms:

error: ‘path_from_wpath’ was not declared in this scope
error: no matching function for call to ‘std::basic_ofstream<char, std::char_traits<char> >::basic_ofstream(const wchar_t*, std::_Ios_Openmode)’

This brings up the old encoding problem - paths on Windows are natively 16-bit strings, paths on Linux are natively 8-bit, and conversion can be lossy. With fs::wpath, Boost decoded 8-bit paths with the current locale (which caused this since people have misconfigured locales, but it mostly worked). We used external_file_string for std::ofstream, which returns 8-bit or 16-bit depending on platform.

If we want to fix things properly, I think instead of replacing the non-VFS fs::wpath with std::wstring we should replace it with a platform-dependent path type which uses std::string on Linux and std::wstring on Windows, so no lossy conversions are needed, with suitable handling of string literals etc (which we can probably assume are ASCII). (VfsPath would remain std::wstring and we could require the VFS directory contents are encoded as UTF-8 on Linux regardless of locale, probably). That's basically what Boost Filesystem V3 does.

If we want to fix things quickly, we can assume most people on Linux have UTF-8-compatible paths (which we mostly assume already), so path_from_wpath just does UTF-8 encoding, and for std::ofstream we can add a native_path_from_wpath that encodes (on Linux) or returns the std::wstring (on Windows), and it probably won't break for many people.

In the latter case, maybe we should at least use some typedef of std::wstring for non-VFS paths (NativePath? PhysicalPath?), so we can relatively easily change it to a native-sized path in the future if the UTF-8 assumption is sufficiently invalid.

reduceBoost2.patch

Link to comment
Share on other sites

Excellent points, as always :)

I think a namespace is important because otherwise the function names are misleading, e.g. I'd probably expect Join("a", "b") to be equivalent to "a"+"b" and wouldn't expect it to be path-related.

Agreed.

Boost doesn't use the namespace "fs", so I don't think it'd be confusing to have something silently Boost-like (not advertised as a drop-in replacement, but preferring compatibility when possible) reusing that namespace. Anybody new to the code would have no expectation that it's like Boost, and anybody old will be aware of this change.

OK, but these functions are in path_util and only cover that functionality; fs is kind of exaggerating there.

How about Path instead? or PathUtil even, though that's getting rather clunky.

SHARED_PTR looks very ugly. Can't it just be shared_ptr? (and include using 'using' or 'typedef' instead of macro)

Yeah, that's one of the first things I did. Turns out only one kind of shared_ptr is needed, the concerns of having to mix them are mooted, so we can revert to the previous just-put-it-in-main-namespace approach.

unordered_map exists there too, instead of using STL_HASH_MAP which has much less predictable behaviour? (Regardless of that, outside of lib we can rely on Boost and should probably get rid of STL_HASH_MAP entirely and use unordered_map, for modernity and consistency. That's probably separate from this patch, though.)

hm, I expect so, but would rather get this working ASAP, so it'd be good to address that later :)

I don't remember ever accidentally mixing VFS and non-VFS paths, such that the type system would help. But I do find VfsPath very useful as documentation, since it stops me getting to the stage of accidentally mixing paths, so it sounds good to keep it.

Agreed.

This brings up the old encoding problem - paths on Windows are natively 16-bit strings, paths on Linux are natively 8-bit, and conversion can be lossy.

Oh no, that one again :)

I think your dream is to just "get" whatever the OS gives us in the way of paths and treat them as a black box. However, we are going to want to append paths, etc. - so we can't just pretend to accept any encoding. Given that, we may as well prescribe one internal representation and ensure we can do all the processing we want. That also seems less prone to "oops I can only test on one OS" breakage

with suitable handling of string literals etc (which we can probably assume are ASCII)

Yes, but something as simple as möd nämes on the command line, and we have the encoding problem.

Do you have concrete examples of lossy conversion that are actually a problem? I'd be surprised to hear people wanting/expecting non-UTF-16 characters in paths.

Those edge cases have to be weighed against the inconvenience of always having to go through conversion functions. My first patch actually simplifies the code in a lot of cases because lots of conversions/casts go away.

If we want to fix things quickly, we can assume most people on Linux have UTF-8-compatible paths (which we mostly assume already), so path_from_wpath just does UTF-8 encoding, and for std::ofstream we can add a native_path_from_wpath that encodes (on Linux) or returns the std::wstring (on Windows), and it probably won't break for many people.

I do like that option, but have a serious backlog and might not get around to it today :S

Link to comment
Share on other sites

How about Path instead? or PathUtil even, though that's getting rather clunky.
Either sounds fine to me.
I think your dream is to just "get" whatever the OS gives us in the way of paths and treat them as a black box. However, we are going to want to append paths, etc. - so we can't just pretend to accept any encoding.
That is what I want (in an ideal world where we have time to not just kludge it :)). If we want to append/split/etc paths from the OS, we don't have to know the encoding, since we still know e.g. the code unit '/' is the directory separator regardless of encoding and regardless of code unit size (given that Linux will only use ASCII-compatible encodings). If we want to append paths which come partially from our own code (mod names, saved game names, etc) it's best to just require our path fragments to be ASCII (we won't show the filenames to users - for i18n we need to store names in multiple languages in some metadata file anyway, and also people might want names with '/' and ':' etc so we couldn't represent the full names in the filename anyway), which means we can append byte strings and not care about the encoding.
Do you have concrete examples of lossy conversion that are actually a problem? I'd be surprised to hear people wanting/expecting non-UTF-16 characters in paths.
When someone is running the game from /home/josé (last component is 4 bytes) with locale es_ES.iso-8859-1, decoding and encoding as UTF-8 would break (either with a decoding error, or with a 3 or 5 byte re-encoded output). Boost Filesystem V2 at least decoded/encoded using the current locale, so it would work as long as the locale was configured correctly; V3 on POSIX uses byte strings instead so it should work even in misconfigured environments. (OS X often has misconfigured environments, it seems.)
Those edge cases have to be weighed against the inconvenience of always having to go through conversion functions.
We need some platform-dependent conversion functions anyway, because std::ofstream wants char* on Linux. Otherwise it seems the most common situation is Join(NativePath, std::wstring) which can assume the second argument is ASCIIish and convert to native automatically, and LOGMESSAGE(L"Loaded %ls", NativePath.string().c_str()) which can convert to wstring by trying UTF-8 and falling back to ISO-8859-1 (it doesn't matter if that's lossy since it's just a log message).

(If there's anything I can do to help with this then I'd be willing - I don't mean to sound like I expect you to do everything here :). If there was some NativePath class/typedef that was implemented as std::wstring using UTF-8 everywhere, I think I should be able to change that to std::string on Linux/OS X and fix up all the conversions and operations and whatnot, if that would help.)

Link to comment
Share on other sites

Either sounds fine to me.

Alright, I'll go with Path::Join/Path/Extension, and FileExists/FileSize can stay in the root namespace.

I think your dream is to just "get" whatever the OS gives us in the way of paths and treat them as a black box. However, we are going to want to append paths, etc. - so we can't just pretend to accept any encoding.

That is what I want (in an ideal world where we have time to not just kludge it ). If we want to append/split/etc paths from the OS, we don't have to know the encoding, since we still know e.g. the code unit '/' is the directory separator regardless of encoding and regardless of code unit size (given that Linux will only use ASCII-compatible encodings).

OK, this is the part I don't get. Granted, path components are separated by '/' - but do the filesystems successfully handle DIFFERENT encodings of each component? If we just grab them from any source and pass them on, it could be 8859-1, UTF-8 or any other CP - and a mix of them doesn't sound healthy.

If we want to append paths which come partially from our own code (mod names, saved game names, etc) it's best to just require our path fragments to be ASCII

That's the QString approach and it is PART of the solution, yeah. But again, there are various other sources of pathnames: savegame paths from the UI (this alone is enough to torpedo any attempt at being unaware of the OS/FS encoding), data files, command line and maybe others. Each of these must have a well-defined conversion to the internal wchar_t representation.

Those edge cases have to be weighed against the inconvenience of always having to go through conversion functions.

We need some platform-dependent conversion functions anyway, because std::ofstream wants char* on Linux. Otherwise it seems the most common situation is Join(NativePath, std::wstring) which can assume the second argument is ASCIIish and convert to native automatically, and LOGMESSAGE(L"Loaded %ls", NativePath.string().c_str()) which can convert to wstring by trying UTF-8 and falling back to ISO-8859-1 (it doesn't matter if that's lossy since it's just a log message).

Let me elaborate on that. It is clear we need conversion functions at the boundaries of our code (e.g. reading command line - that's exactly what sys_WideFromArgv is for). However, I propose to eliminate all other conversions INSIDE the game, which is the vast majority. All game code should be able to rely upon the fact that we have correctly encoded wchar_t strings that can be parsed/displayed/manipulated.

When someone is running the game from /home/josé (last component is 4 bytes) with locale es_ES.iso-8859-1, decoding and encoding as UTF-8 would break (either with a decoding error, or with a 3 or 5 byte re-encoded output). Boost Filesystem V2 at least decoded/encoded using the current locale, so it would work as long as the locale was configured correctly; V3 on POSIX uses byte strings instead so it should work even in misconfigured environments. (OS X often has misconfigured environments, it seems.)

OK, I propose to extend or add similar functions to sys_WideFromArgv to correctly handle that case.

If UTF8 decoding fails, try the current locale. If that also "fails" (i.e. we decode into a pathname that doesn't exist after passing back through our wchar_t -> native conversion), then require the user to specify the right locale on the command line.

(If there's anything I can do to help with this then I'd be willing - I don't mean to sound like I expect you to do everything here . If there was some NativePath class/typedef that was implemented as std::wstring using UTF-8 everywhere, I think I should be able to change that to std::string on Linux/OS X and fix up all the conversions and operations and whatnot, if that would help.)

Great :) It is appreciated - I still can't test on Linux and will probably end up causing a bit of breakage, as already noted :/

Link to comment
Share on other sites

OK, this is the part I don't get. Granted, path components are separated by '/' - but do the filesystems successfully handle DIFFERENT encodings of each component? If we just grab them from any source and pass them on, it could be 8859-1, UTF-8 or any other CP - and a mix of them doesn't sound healthy.

Native filesystems don't handle any encodings, they just handle strings of '/'-separated bytes. (Similarly NTFS doesn't handle encodings, it just handles strings of 16-bit integers - it's highly conventional to interpret them as UTF-16 or perhaps UCS-2 rather than as anything else, but NTFS itself doesn't care). A path could be "\xEF\xBF\xBF/\xC3" and nothing in the OS would care.

The only time encoding ever matters is user input and output, since it's nice to convert between bytes and a human-readable string, for which the convention is to use the current locale settings. If your filesystem has some weird messed-up combination of encodings then applications should still work correctly, they'll just look weird when they try displaying paths to you.

OK, I propose to extend or add similar functions to sys_WideFromArgv to correctly handle that case.

If UTF8 decoding fails, try the current locale. If that also "fails" (i.e. we decode into a pathname that doesn't exist after passing back through our wchar_t -> native conversion), then require the user to specify the right locale on the command line.

If we implement locale support, I don't think there's any need to try UTF-8 first - the locale is more authoritative so we should just use that.

Link to comment
Share on other sites

QUOTE
Native filesystems don't handle any encodings, they just handle strings of '/'-separated bytes.



OK. So in your example, we are told of /home/josé and aren't supposed to care how it's encoded. José wants to save a game and chooses as the filename "josé". Now unless we understand how to translate the unicode string he told us into the actual FS encoding, the filename will turn out to be some gobbledygook from the perspective of the OS. Even if it turns out to be legal, he might not be able to copy that file.

QUOTE
If we implement locale support, I don't think there's any need to try UTF-8 first - the locale is more authoritative so we should just use that.



Since the locale might turn out to be UTF8? But you were saying the locale might be set incorrectly, so I do think there is value in providing the UTF8 fallback (under the assumption that some Linux FS are actually UTF8 encoded despite the locale being set to es-ES or whatever).
Link to comment
Share on other sites

Per today's meeting, we're going to introduce NativePath. Any code that receives filenames from external sources must ensure they don't contain illegal /: nor non-ASCII characters. NativePath is just a std::wstring (on Linux, each byte is stored in a wchar_t). This allows easy processing within our code and also makes for correct display of the path (even without knowing about the actual FS locale) in the common case of ISO 8859-1.

The corresponding patch is attached. Works on Windows; would appreciate a careful look (you never know) and testing on Linux.

reduceBoost2.patch

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...