Jump to content

[Discussion] Spidermonkey upgrade


Recommended Posts

I've tested replacing delete and it seems to make a big difference.

Unfortunately I'm not sure if I could have changed the simulation behaviour and replacing delete requires some ugly hacks which isn't good either.

Apparently "for(var i in entities_)" also steps into the loop for entities that are set to undefined.

post-7202-0-30387200-1372980659_thumb.gi

Edit:

About "OSR script has argsobj". You can see which script it is, by providing IONFLAGS=scripts,aborts. The line before the abort is the script where it is failing on. I would definitely suggest to not use "arguments" as much as possible. It is known to decrease performance on all engines a lot...

We aren't using "arguments" anywhere else.

This is one code where it points to:


Resources.prototype.add = function(that) {
for ( var tKey in this.types) {
var t = this.types[tKey];
this[t] += that[t];
}
this.population += that.population;
};

Link to comment
Share on other sites

Apparently "for(var i in entities_)" also steps into the loop for entities that are set to undefined.

I know that's unfortunate, but it will create much better type information and be much faster.

There are different solutions for that problem though,

1) if entities_ is small, you can recreate that object when wanting to delete a property.

2) you can have an array with all "available keys" and recreate the array when you remove a poperty.

3) If the value in entities[x] is never undefined, I would just check it in the loop and just "continue;"

This is one code where it points to:

 Resources.prototype.add = function(that) {
for ( var tKey in this.types) {
var t = this.types[tKey];
this[t] += that[t];
}
this.population += that.population;
};

That seems very strange! We don't need an arguments object there...

Link to comment
Share on other sites

In your latest benchmark, does the 1.8.5 benchmark measure delete or the replaced delete code? It would be nice to see any optimizations for v24 applied to 1.8.5 as well...

The 1.8.5 benchmark is still the original one. I don't recreate it to safe time. I could do a test with all the improvements later but I think it isn't worth to test every single improvement in both v24 and v1.8.5.

3) If the value in entities[x] is never undefined, I would just check it in the loop and just "continue;"

I set it to false and checked for that. I have to add a lot of these checks because there are a lot of loops. It improves performance quite a bit as we can seen (assuming I didn't add a bug that causes significant simulation behaviour changes).

Link to comment
Share on other sites

A test for the setrval issue.


[Scripts] Analyzing script tests/setrval.js:6 (0x7fb6436511f0) (usecount=1004)
[Abort] Unsupported opcode: setrval (line 19)
[Abort] aborted @ tests/setrval.js:19
[Abort] Builder failed to build.
[Abort] Disabling Ion mode 0 compilation of script tests/setrval.js:6
[Scripts] Analyzing script tests/setrval.js:1 (0x7fb643651128) (usecount=1100)

It can't be simply that return values are not supported. Testing a small function with return values seems to work.

EDIT: I've found a way to fix the "Abort" message. Unfortunately it makes the script about 10% slower, so it probably causes some other issues.

The strange thing is that it also prints the abort message if I pass an object as reference and use that as return value (as in the modified version) but use a return statement without value instead of the hack with the quit variable.

EDIT2: Looks like a similar issue as the last one...


./js --ion-parallel-compile=on tests/setrval_modified.js 2>&1 | grep "Took bailout" | wc -l
1655

1655 Baseline Bailouts in 2000 executions of the function...

setrval.js.txt

setrval_modified.js.txt

Link to comment
Share on other sites

So in most cases we will use JSOP_RETURN to indicate a return. But for some returns we need to do more e.g. close open try blocks. In that case JSOP_SETRVAL is used to set the value and the extra code is executed followed by JSOP_RETRVAL to return the value. So that is happening here. Apparently we don't compile those bytecodes yet. (Happens most often with try blocks what isn't supported by IM). So the problem here is the "return" in the for-in loop.

Opened https://bugzilla.mozilla.org/show_bug.cgi?id=890722 to fix this.

1) I think the patch to support it will be fairly small. So if you want I can backport the patch to FF24 and post the resulting patch here... Let me know if you are interested in that

2) To fix it in JS code, is to hoist the return out of the for each loop.


var refpath = node.split(".");
var refd = spec;
for each (var p in refpath)
{
refd = refd[p];
if (!refd)
break;
}
if (!refd)
{
error("FSM node "+path.join(".")+" referred to non-defined node "+node);
return {};
}
node = refd;

I didn't test it though...

Edited by h4writer
  • Like 1
Link to comment
Share on other sites

Thanks, that's indeed the fastest solution and it also avoids the baseline bailouts!

Have you seen my "Edit2" in the last post? It doesn't matter anymore for this case but it could be a bug.

Performance (execution time for 2000 iterations in microseconds):


original:
1715559
modified_yves:
1976366
modified_h4writer:
1548373
1551303

Link to comment
Share on other sites

H4writer offered a "0 A.D. day", so I'm uploading my current heavily WIP patch here and provide some instructions how to use it.

Brave people are encouraged to test it (but you won't be able to really play a game with it yet). Also bug-reports and feedback is only needed if you find some important performance issues or fundamental design issues because there are too many known issues at the moment.

Build instructions (also check http://trac.wildfire...ildInstructions):

  1. Check out revision 13408
  2. Apply the patch (I always had some rejected files... it's a bit cumbersome)
  3. Fix some absolute paths I used (sorry!)
  4. "ln -s" the aurora-source directory to 0ad/libraries/source/spidermonkey/mozjs24
  5. cd 0ad/libraries/source/spidermonkey/mozjs24/js/src && autoconf2.13
  6. run "./update-workspaces.sh --with-system-enet --without-audio --with-system-nvtt --disable-atlas -j5" from 0ad/build/workspaces/
  7. cd gcc && make -j5 config=debug pyrogenesis

Testing in non-visual replay mode (simulation)

I've attached the commands.txt file I used for the performance measurements in this thread.

run ./pyrogenesis -replay=/path/to/commands.txt 

or

run ./pyrogenesis_dbg -replay=/path/to/commands.txt

How to create the performance graphs.

What works/doesn't work

There's still a lot that doesn't work. Basically the only really tested things are:

  1. Building on my Ubuntu system
  2. Generating the random map alpine_lakes
  3. Running the non-visual replay with the commands.txt provided
  4. Starting a normal game on Acropolis1. Playing isn't possible yet because of a missing change in the config-db code and probably there are a lot other issues too. I also haven't checked the GUI-code for missing requests, so it could simply crash after a while.

EDIT: There were some files missing in the patch. I've attached a new version.

commands.txt

patch_24_v0.2.diff

Link to comment
Share on other sites

I have been looking into the issues surrounding the JS engine and how to improve performance by having better code (i.e taking more use of IonMonkey).

So we have bailouts (temporary go to baseline again) and invalidation (delete the IM code) in this code. From high to low this is:

1) MCallGeneric

2) GetPopertyPolymorphicV

3) BoundsChecks

4) ValueToInt32

1) MCallGeneric is the generic code used in IonMonkey to call to a function. Normally in the browser the JSObject to where the call takes place is always a JSFunction. Here that isn't the case, since I assume you added your own JSObject's and override ->call function. In that case we were bailing, so stuck in baseline until we can enter IM again. So I created a small fix for this.

2) GetPopertyPolymorphicV looks at the baseline caches to know which objects go there and create better code to lookup a property. If we encounter a new type we bailout and see a new type, resulting in IM rebuilding. Now it seems we are bailing here the whole time. I assume again that an own JSObject is given and for some reason we aren't recording this type! I temporary disabled looking at baseline caches. In that case we use GetPropertyCache that is slightly slower, but can't bail.

So now I'm back at looking to BoundsChecks issue, also raised above. But I found it time to measure how much difference this would make.

I created such a graph you created and it was a bit disappointing, but it points to another issue.

The difference is only visible on the spikes, there it is clearly visible fixing the issues indeed improves performance.

But the base line didn't go down. Also the base line didn't go up when only allowing the interpreter (our slowest engine).

=> I think the difference between v24/v1.8.5 is not because of the engine, but because of the overhead difference between the two.

I ran oprofile on the whole execution:

h4writer@h4writer-ThinkPad-W530:~/Build/0ad/binaries/system$ opreport ./pyrogenesis -d | grep "^[0-9a-z]" | head -n30
Using /home/h4writer/Build/0ad/binaries/system/oprofile_data/samples/ for samples directory.
warning: /no-vmlinux could not be found.
warning: [vdso] (tgid:25236 range:0xb778e000-0xb778efff) could not be found.
vma samples % image name symbol name
08138bd0 152029 4.1760 pyrogenesis std::_Rb_tree<unsigned int, std::pair<unsigned int const, EntityData>, std::_Select1st<std::pair<unsigned int const, EntityData> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, EntityData> > >::find(unsigned int const&)
08136f00 144718 3.9752 pyrogenesis CCmpRangeManager::GetPercentMapExplored(int)
002026e0 123392 3.3894 libmozjs24-ps-release.so JSCompartment::wrap(JSContext*, JS::MutableHandle<JS::Value>, JS::Handle<JSObject*>)
001922b0 93085 2.5569 libmozjs24-ps-release.so js::ShapeTable::search(int, bool)
00000000 92117 2.5303 libstdc++.so.6.0.17 /usr/lib/i386-linux-gnu/libstdc++.so.6.0.17
00000000 86632 2.3797 no-vmlinux /no-vmlinux
000770a0 74491 2.0462 libc-2.15.so _int_malloc
00000000 66707 1.8324 anon (tgid:25236 range:0xb2f09000-0xb2f48fff) anon (tgid:25236 range:0xb2f09000-0xb2f48fff)
00000000 65781 1.8069 libnspr4.so /usr/lib/i386-linux-gnu/libnspr4.so
...

So 8% is spent in pyrogenesis itself. I think the 3.3% JSCompartment::wrap is also overhead. But I'm mostly guessing.

I think it would be good to have a oprofile log from turn 80(*20) to turn 90(*20). That would show where the time is going too...

  • Like 1
Link to comment
Share on other sites

H4writer and I have discussed the problems in IRC and I think we haven't really found a conclusion yet. He brought up some interesting questions and observations.

  1. A strange observation is that currently the average performance (not counting the pikes) is better with all the optimizations (Ion Monkey, Type Inference, Baseline) disabled.
  2. Does the performance really get lost during JS code execution or is it some other overhead that got added?
  3. Is something wrong about how we pass data to IonMonkey which could keep it from making some important optimizations?

I've spend the whole thursday experimenting with wrappers and changing the interface to avoid some structured clones. I assumed that copying the entire gamestate with a structured clone is bad for IonMonkey because it doesn't recognise that objects and values are the same as in the last turn and it can't do optimizations because the data's memory address changes each turn (or so... ;) ).

At the end of the day there was no improvement but I was still not sure if these concerns are valid or not. :(

I will probably try to do some more experiments in a sandbox environment.

I've decided to look a bit closer if I can spot some specific JS code that runs slower. I noticed two things:

  1. There is JS code that runs slower but also code that is faster with V24
  2. About half of the time is in currently "unlogged" sections (i.e. sections not wrapped into profile macros).

I'll add some more profile macros and will try figuring out where that time is spent.

In the meantime I have some graphs comparing different profile sections of our JS code.

Watch the scales, they are different!

Parts where V24 is faster:

post-7202-0-17467300-1373657190_thumb.gi

post-7202-0-69383500-1373657187_thumb.gi

post-7202-0-43575500-1373657188_thumb.gi

Parts where v1.8.5 is faster:

post-7202-0-06094300-1373657186_thumb.gi

post-7202-0-95064300-1373657186_thumb.gi

Execute items is part of "AI script":

post-7202-0-29506900-1373657189_thumb.gi

Link to comment
Share on other sites

So now I'm back at looking to BoundsChecks issue, also raised above. But I found it time to measure how much difference this would make.

I created such a graph you created and it was a bit disappointing

What difference did you measure? Your fix for MCallGeneric?

EDIT: If you have any patches, could you post them please? I'd like to avoid finding problems you already fixed. :)

Link to comment
Share on other sites

Implement JSOP_SETRVAL and JSOP_RETRVAL (should apply to v24)

https://bugzilla.mozilla.org/show_bug.cgi?id=890722

Fix bailing of LCallGeneric:

https://bugzilla.mozilla.org/show_bug.cgi?id=891910

Quick and dirty fix for GetPopertyPolymorphicV. This doesn't fix the real issue and could actually decrease performance. I'm not sure yet what the right fix is,

but this removes the constant bailing from that bytecode:


diff --git a/js/src/ion/BaselineInspector.cpp b/js/src/ion/BaselineInspector.cpp
--- a/js/src/ion/BaselineInspector.cpp
+++ b/js/src/ion/BaselineInspector.cpp
@@ -100,18 +100,18 @@ BaselineInspector::maybeShapesForPropert
if (stub->toGetProp_Fallback()->hadUnoptimizableAccess())
shapes.clear();
} else {
if (stub->toSetProp_Fallback()->hadUnoptimizableAccess())
shapes.clear();
}

// Don't inline if there are more than 5 shapes.
- if (shapes.length() > 5)
- shapes.clear();
+ //if (shapes.length() > 5)
+ shapes.clear();

return true;
}

ICStub *
BaselineInspector::monomorphicStub(jsbytecode *pc)
{
if (!hasBaselineScript())

Your new graphs are actually very nice. It's nice to see that some parts indeed get the improvement. It should be cool if you find out which where the worst offenders. Would narrow the search.

I hope I can create some free time to look further myself. I'm going to be in the US next week, so I'll be online in different hours and I have no idea how much free time I will have. But my next objective is to fix the boundchecks.

Link to comment
Share on other sites

The bailout inside the loop mentioned above is caused by a typebarrier.

[Bailouts]  bailing from bytecode: loopentry, MIR: typebarrier [73], LIR: typebarrier [85]

I've found a good article that describes some of the basics of type inference.

Then I asked on the #jsapi channel what could cause this issue and jandem had a look at it.

Inside the loop it detects a type it didn't expect and then has to regenerate that code. It shouldn't happen again and again, so it's probably a bug. I'll try to provide some more information like the ion.cfg log and maybe a standalone script to reproduce the issue.

He also pointed me to Bug 894852 with a patch that should fix the GetPopertyPolymorphicV bailout issues.

I tested the performance difference and the number of bailouts with the V25 development branch and the different patches:

V25 branch without patches: 336021 bailouts

V25 branch + LCallGeneric patch: 8057 bailouts

V25 +LCallGeneric patch + PropertyLookup patch: 2819 bailouts

There are a much less bailouts now, but unfortunately the performance difference is so small that it's hard to tell if there's even a difference.

I still think that fixing some more bailout issues will improve performance. It's very important what part of code causes the bailouts, how often this code is called and how well it could be optimized otherwise. For the typebarrier issue we know that there is a performance problem in this specific loop.

Link to comment
Share on other sites

Yves, with so much work put into understanding the logic behind Spidermonkey, do you think you could write a (as comprehensive as possible) guide to good JS use in 0 A.D., what to do vs what not to do and things like that? Or is that out of your scope?

(obviously after you're finished with this)

I'm asking because I haven't really found serious resources about JS optimizations and it really varies form compiler to compiler so having a specific list of "Dos and Don't's" would be pretty cool.

Link to comment
Share on other sites

Tuan Kuranes has posted this link which is about some general performance guidelines.

Avoiding "delete" for properties is very important from my experience and probably won't change too soon.

The other issues we discovered are mostly issues in Spidermonkey that should be fixed soon.

Everyting seems to change quite rapidly, so too specific guidelines for Spidermonkey will probably be outdated a few weeks or months later.

I could write some instructions how to detect performance problems like using Ion spew if I eventually succeed at it ;)

Edit: I've attached the Ion.cfg for the typebarrier issue in map-module.js:39 (oh what a stack of file-endings)

ion.cfg.tar.lzma.txt

Link to comment
Share on other sites

H4writer suggested creating a tracelog. It's a new feature in Spidermonkey 25 and should allow us to identify code that has problems with JIT compiling.

There were a few difficulties generating that tracelog but finally it worked.

How to create such a file:

  1. Edit TraceLogging.cpp to create the file somewehre else than in /tmp/ (I had to do this because I use a ramdisk for temp and the files can become quite large).
  2. configure with --enable-trace-logging and rebuild Spidermonkey and 0 A.D.
  3. Make sure the code doesn't use multiple contexts. I still have to investigate how to use trace-logging with multiple contexts.
  4. Run pyrogenesis_dbg as normal. It should write a tracelogging.log in the directory specified above.
  5. Download basic.js, generate.py and style.css from here: https://github.com/h...er/blob/master/
  6. I had to convert generate.py to python 3 syntax (replace print "xyz" with print("xyz")) because it crashed with python 2.7. Maybe the crashing was related to an invalid format of the file I had first, so it's probably not necessary.
  7. run
    python3.3 generate.py tracelogging.log 2000000 > out2.html

    The number can be smaller for smaller input files or larger for larger input files. It somehow specifies how detailed the report is (don't know exactly what it does).

  8. Make sure basic.js and style.css are in the same directory and open the generated html file.

It will create a graph like that (notice the information it prints when hovering a block).

white: interpreter

ionmonkey compilation

ionmonkey running

baseline running

GC

post-7202-0-56706300-1374849647_thumb.gi

And it also prints some very useful statistics:

post-7202-0-56564900-1374849648_thumb.gi

The full html file is attached.

H4writer has figured out the cause of the typebarrier issue in the meantime and he knows a way to work around it in JS code, but there's no patch yet to fix it in Spidermonkey. The issue is tracked in bug 897926.

He also created a meta-bug for 0 A.D. peformance improvements. These are fixes that we discover during debugging and profiling but the issues aren't specific to 0 A.D. and could slow down other applications too.

I think we are aiming for v25 or v26 now because v24 is already out of the development branch and we need to get our fixes into Spidermonkey. I think that's better than applying a lot of JS workarounds for v24.

EDIT: I realized that I forgot to set limitScriptSize to false in Ion.h, so large script didn't get ion compiled.

I've attached a new tracelog with this setting set to false (iontrace_no_limit_script_size.zip). It confirms the issue in terrain-analysis-pathfinder.js:146 which we have already discovered (70.50% total time) and also the type-inference issue in map-module.js:39 (3.3% total time).

I'm going to look at the issue in queue-manager.js:128 next (4.97%).


[BaselineBailouts] Bailing to baseline simulation/ai/qbot-wc/queue-manager.js:128 (IonScript=0x9424400) (FrameType=2)
[BaselineBailouts] Reading from snapshot offset 8540 size 28522
[BaselineBailouts] Incoming frame ptr = 0x7fffd66fd978
[Bailouts] bailing from bytecode: call, MIR: call [341], LIR: callgeneric [475]
[BaselineBailouts] Callee function (simulation/ai/qbot-wc/queue-manager.js:128)
[BaselineBailouts] Not constructing!
[BaselineBailouts] Restoring frames:
[BaselineBailouts] FrameNo 0
[BaselineBailouts] Unpacking simulation/ai/qbot-wc/queue-manager.js:128
[BaselineBailouts] [BASELINE-JS FRAME]
[BaselineBailouts] WRITE_PTR 0xe443e28/0x7fffd66fd970 PrevFramePtr 0x7fffd66fd9a8
[BaselineBailouts] SUB_072 0xe443de0/0x7fffd66fd928 BaselineFrame
[BaselineBailouts] FrameSize=264
[BaselineBailouts] ScopeChain=0x7fa7445a7060
[BaselineBailouts] Is function!
[BaselineBailouts] thisv=fffbffa744380200
[BaselineBailouts] frame slots 26, nargs 1, nfixed 20
[BaselineBailouts] arg 0 = fffbffa73df34300
[BaselineBailouts] WRITE_VAL 0xe443dd8/0x7fffd66fd920 FixedValue fffbffa744380200
[BaselineBailouts] WRITE_VAL 0xe443dd0/0x7fffd66fd918 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443dc8/0x7fffd66fd910 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443dc0/0x7fffd66fd908 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443db8/0x7fffd66fd900 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443db0/0x7fffd66fd8f8 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443da8/0x7fffd66fd8f0 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443da0/0x7fffd66fd8e8 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443d98/0x7fffd66fd8e0 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443d90/0x7fffd66fd8d8 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443d88/0x7fffd66fd8d0 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443d80/0x7fffd66fd8c8 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443d78/0x7fffd66fd8c0 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443d70/0x7fffd66fd8b8 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443d68/0x7fffd66fd8b0 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443d60/0x7fffd66fd8a8 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443d58/0x7fffd66fd8a0 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443d50/0x7fffd66fd898 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443d48/0x7fffd66fd890 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443d40/0x7fffd66fd888 FixedValue fff9000000000000
[BaselineBailouts] pushing 3 expression stack slots
[BaselineBailouts] WRITE_VAL 0xe443d38/0x7fffd66fd880 StackValue fffbffa744421a00
[BaselineBailouts] WRITE_VAL 0xe443d30/0x7fffd66fd878 StackValue fffbffa744380200
[BaselineBailouts] WRITE_VAL 0xe443d28/0x7fffd66fd870 StackValue fffbffa73df34300
[BaselineBailouts] Resuming at pc offset 423 (op call) (line 166) of simulation/ai/qbot-wc/queue-manager.js:128
[BaselineBailouts] Bailout kind: Bailout_Normal
[BaselineBailouts] WRITE_WRD 0xe443d20/0x7fffd66fd868 Descriptor 0000000000001081
[BaselineBailouts] WRITE_PTR 0xe443d18/0x7fffd66fd860 ReturnAddr 0x7fa7463552b9
[BaselineBailouts] [BASELINE-STUB FRAME]
[BaselineBailouts] WRITE_PTR 0xe443d10/0x7fffd66fd858 StubPtr 0x70609a0
[BaselineBailouts] WRITE_PTR 0xe443d08/0x7fffd66fd850 PrevFramePtr 0x7fffd66fd970
[BaselineBailouts] WRITE_VAL 0xe443d00/0x7fffd66fd848 ArgVal fffbffa73df34300
[BaselineBailouts] WRITE_VAL 0xe443cf8/0x7fffd66fd840 ArgVal fffbffa744380200
[BaselineBailouts] WRITE_WRD 0xe443cf0/0x7fffd66fd838 ActualArgc 0000000000000001
[BaselineBailouts] CalleeStackSlot=0
[BaselineBailouts] Callee = fffbffa744421a00
[BaselineBailouts] WRITE_PTR 0xe443ce8/0x7fffd66fd830 CalleeToken 0x7fa744421a00
[BaselineBailouts] WRITE_WRD 0xe443ce0/0x7fffd66fd828 Descriptor 0000000000000202
[BaselineBailouts] WRITE_PTR 0xe443cd8/0x7fffd66fd820 ReturnAddr 0x7fa74704cf42
[BaselineBailouts] [RECTIFIER FRAME]
[BaselineBailouts] WRITE_VAL 0xe443cd0/0x7fffd66fd818 FillerVal fff9000000000000
[BaselineBailouts] SUB_016 0xe443cc0/0x7fffd66fd808 CopiedArgs
[BaselineBailouts] MemCpy from 0xe443cf8
[BaselineBailouts] WRITE_WRD 0xe443cb8/0x7fffd66fd800 ActualArgc 0000000000000001
[BaselineBailouts] WRITE_PTR 0xe443cb0/0x7fffd66fd7f8 CalleeToken 0x7fa744421a00
[BaselineBailouts] WRITE_WRD 0xe443ca8/0x7fffd66fd7f0 Descriptor 0000000000000184
[BaselineBailouts] WRITE_PTR 0xe443ca0/0x7fffd66fd7e8 ReturnAddr 0x7fa7577544f4
[BaselineBailouts] FrameNo 1
[BaselineBailouts] Unpacking simulation/ai/qbot-wc/queue-manager.js:42
[BaselineBailouts] [BASELINE-JS FRAME]
[BaselineBailouts] WRITE_PTR 0xe443c98/0x7fffd66fd7e0 PrevFramePtr 0x7fffd66fd850
[BaselineBailouts] SUB_072 0xe443c50/0x7fffd66fd798 BaselineFrame
[BaselineBailouts] FrameSize=112
[BaselineBailouts] ScopeChain=0x7fa7445a7060
[BaselineBailouts] Is function!
[BaselineBailouts] thisv=fffbffa744380200
[BaselineBailouts] frame slots 8, nargs 2, nfixed 2
[BaselineBailouts] arg 0 = fffbffa73df34300
[BaselineBailouts] arg 1 = fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443c48/0x7fffd66fd790 FixedValue fff9000000000000
[BaselineBailouts] WRITE_VAL 0xe443c40/0x7fffd66fd788 FixedValue fff9000000000000
[BaselineBailouts] pushing 2 expression stack slots
[BaselineBailouts] WRITE_VAL 0xe443c38/0x7fffd66fd780 StackValue fffbffa73df34300
[BaselineBailouts] WRITE_VAL 0xe443c30/0x7fffd66fd778 StackValue fffbffa74467e8e0
[BaselineBailouts] Resuming after pc offset 4 (op callprop) (line 43) of simulation/ai/qbot-wc/queue-manager.js:42
[BaselineBailouts] Bailout kind: Bailout_Normal
[BaselineBailouts] [TYPE-MONITOR CHAIN]
[BaselineBailouts] Popping top stack value into R0.
[BaselineBailouts] Adjusted framesize -= 8: 104
[BaselineBailouts] Set resumeAddr=0x7fa746356e4a monitorStub=0x7ac3e00
[BaselineBailouts] Done restoring frames
[BaselineBailouts] Done restoring frames
[BaselineBailouts] Got pc=0x507c75c
[BaselineBailouts] Restored outerScript=(simulation/ai/qbot-wc/queue-manager.js:128,319) innerScript=(simulation/ai/qbot-wc/queue-manager.js:42,234) (bailoutKind=0)

iontrace.zip

iontrace_no_limit_script_size.zip

Link to comment
Share on other sites

I was able to find a workaround for the bounds check bailouts.

Apparently Ionmonkey has some problems with arrays that contain holes. I knew this kind of arrays aren't optimal but it's a bug if it causes Spidermonkey to keep bailing out. I didn't fix all occurences of these bailouts because at some locations it's more difficult to work with arrays (or objects with numeric properties) that don't contain holes.

That issue should also be fixed in Spidermonkey to fix the remaining occurences.

Unfortunately I was disappointed again by the result. The improvement is hardly visible on the graph and I expected a lot more when I saw the "70% execution time" (which is now less than 5% with the workaround). One reason could be that a lot of time is spent in these functions before the first turn even starts. This means it's not visible in the graph if these functions execute faster.

I've measured that the time passed until the first turn starts decreased from about 4 seconds in release mode with parallel compiling enabled to about 3 seconds. If parallel compiling is disabled, it decreases from about 13 seconds to 4 seconds.

It doesn't make a difference if parallel compiling is enabled or disabled later, so that doesn't explain where the rest of the improvent "hides". Maybe it's a measuring issue with debug mode (I created the first tracelog in debug mode because I didn't know that it's also enabled in release mode).

I've also attached the updated tracelog including the workaround and created in release mode.

EDIT: I investigated a little further and figured out that the same problem that causes the type-barrier issue also seems to cause the bounds check issue. The problem only occurs if the loop is there inside fun.

If I pass i as first argument to fun instead of rnd, it only bails once, so the non-consecutive access is also important.


var obj = {}

fun = function(index, value)
{
for(var i=0; i<4000; i++)
i -= 0.1;

obj[index] = value;
}
for (var i=0; i<20; i++)
{
var rnd = Math.floor(Math.random() * 3000);
fun(rnd, true);
}

patch_bound_check_bailouts_v1.0.diff

bounds_check_fixed.html.zip

Link to comment
Share on other sites

Thanks for the graphs. Would it be possible to increase the zoom level. I mostly try to make sure the whole graph is visible on the screen. Maybe here you could try to have it visible in twice the screen height? That way you have an overview and still have enough information ...

Link to comment
Share on other sites

I've tested the bounds-check patch. The test-script doesn't bailout multiple times anymore.

Also in shared.js:350 and shared.js:360 it only records two bailouts.

Unfortunately that only seems to cover the real issue because the performance is still as bad as before and also the tracelogging stats don't improve (still most of the time is spent in ion_compile or script baseline).

Here's a graph of the profile section "Run Workers" which spends nearly all the time in shared.js:360(getMetadata) and shared.js:350 (setMetadata).

post-7202-0-41389500-1375128455_thumb.gi

Tracelogging stats before the patch (filename, times called, times compiled, time total ...):


simulation/ai/common-api-v3/shared.js:350 16752 13 0.04% ion_compile : 75.03%, script ionmonkey : 20.44%, script baseline : 4.44%, script interpreter : 0.09%,


simulation/ai/common-api-v3/shared.js:360 10728 10 0.02% ion_compile : 18.50%, script ionmonkey : 0.04%, script baseline : 80.61%, script interpreter : 0.85%,

Tracelogging stats after the patch (filename, times called, times compiled, time total ...):

simulation/ai/common-api-v3/shared.js:350	 5679	 5	 0.00%	script ionmonkey : 3.54%, script interpreter : 3.03%, ion_compile : 53.20%, script baseline : 40.24%,

simulation/ai/common-api-v3/shared.js:360	 14771	 10	 0.03%	script ionmonkey : 17.34%, script interpreter : 0.69%, ion_compile : 2.29%, script baseline : 79.69%,

Some strange observations about the data:

  1. Spidermonkey 25 requires about 5-6 times longer in average (not counding the peaks, just looking at the base line).
  2. RunWorkers isn't a big fraction of the whole turn duration, we are talking about 1-2 ms. You can see that it's pretty much the combined duration of getMetadata plus setMetadata. Now if the whole turn takes roughly 20 ms (rounded up), 1-2 ms is between 5% and 10%, not 0.3% or 0.6% as the "time total" column shows. I don't know what this column actually shows, but it's probably not very relevant for the real performance we get in the end.
  3. The value in the column "times called" is a bit strange. Our simulation is deterministic and the function should be called exactly the same number of times. Maybe it's related to inlining of the function.

EDIT: attached another ion.cfg (not directly related to that post)

ion_entity-js-456.cfg.zip

Link to comment
Share on other sites

  • 4 weeks later...

I've worked the past weeks on finding several causes of simulation state differences.

As long as v1.8.5 and the new version don't produce the same simulation state in the end of the game, it's difficult to tell if I introduced any bugs and also the performance measurements could be affected.

At the moment I have found most of the possible differences and in the test setup I use, both versions produce the same simulation state. There are little differences because some properties are set to false instead of being deleted but they don't cause any difference in gameplay and just appear in the serialized state because I haven't adjusted the serialization function yet.

I'm going to describe one of the most annoying differences that was quite hard to analyze.

I'm looking at entity collections, one of the core components of the AI.

Entity collections are used a lot and therefore making them faster is important. They are essentially just objects containing other objects in numerically indexed properties and must support removal of entities.

The main problems are:

1. Using delete on object properties is not supported by IonMonkey and the function or loop where it is used can't be JIT compiled.

2. Iterating over sparse arrays (arrays with non consecutive indexing) or using for .. in for iterating over object properties is slow.

What I tried first to avoid deleting properties is simply setting them to false.

When iterating over the properties I have to ignore those that have the value set to false. First it seems to be a bit ugly and cumbersome, but fortunately most of that is hidden in the entity-collection's forEach implementation.

However, there's a very subtle difference that caused a lot of problems (and still isn't completely solved).

The ECMA standard specifies that the iteration order of for .. in is undefined, meaning that you can't be sure you get the lowest index first or anything like that.

Well, if the standard says the order is undefined, why do we depend on a specific order?

We don't really depend on it, we just have to be sure it's the same order each time. For example there are some places where we just pick the first X entities and do something with them. We assign the first X idle workers to chop trees.

We don't care which workers that are in the first place, but it could make a difference!

At the moment Spidermonkey iterates in ascending order for numeric properties with indices that are smaller than a specified limit (don't know where that limit is exactly). After that it depends on the insertion order of properties in FIFO order (First In First Out).

The tricky bit is that the order is different if you add a property, delete it and add it again compared to adding it, setting it to false and setting it to the same object again.

Check out this example:


var useDelete = true;
var useOrderWorkaround = false;
test(useDelete, useOrderWorkaround);

function test(useDelete, useOrderWorkaround)
{
var start = elapsed();
var obj = {};
obj["10000"] = "value1";
obj["9000"] = "value2";
obj["7000"] = "value3";
if (useDelete)
delete obj["10000"];
else
obj["10000"] = false;
obj["10000"] = "value4";
var callback = function(val, index, array)
{
print("index " + index + " val: " + val);
};
if (!useOrderWorkaround)
foreach1(obj, callback);
else
foreach2(obj, callback);
var end = elapsed();
print("duration: " + (end - start) + "\n");
}
function foreach1(obj, callback)
{
for (var propIndex in obj)
{
if (obj[propIndex] != false)
callback(obj[propIndex], propIndex, obj);
}
}
function foreach2(obj, callback)
{
var array = [];
for (var propIndex in obj)
{
if (obj[propIndex] != false)
array[propIndex] = obj[propIndex]
}
array.forEach(callback);
}

Output:

1. Run

var useDelete = true; var useOrderWorkaround = false;

10000 comes last because it was the last property added.

index 9000 val: value2
index 7000 val: value3
index 10000 val: value4
duration: 128

2. Run

var useDelete = false; var useOrderWorkaround = false;

10000 comes first because it was the first property added. Setting it to false and to the original value again only changes its value, it doesn't re-add the property.


index 10000 val: value4
index 9000 val: value2
index 7000 val: value3
duration: 177

3. Run

var useDelete = true; var useOrderWorkaround = true;

The workaround makes the order ascending both when using delete or when setting it to false.

index 7000 val: value3
index 9000 val: value2
index 10000 val: value4
duration: 2501

4. Run

var useDelete = false; var useOrderWorkaround = true;

The workaround makes the order ascending both when using delete or when setting it to false.

index 7000 val: value3
index 9000 val: value2
index 10000 val: value4
duration: 2560

There are two problems with this workaround.

  1. It must be applied to 1.8.5 and the new version to make the simulation state identical.
  2. It completely destroys the performance (about 18 times slower in this example). It adds peaks of several seconds and causes the graph to scale in a way that makes any other peformance analysis impossible.

I don't yet understand why this is so bad for performance. It iterates over the properties twice, but that should only make it about half as fast. Since it looks like I have to adjust the 1.8.5 version anyway, I could as well just add the whole change to avoid deleting properties and test if 1.8.5 behaves the same way.

Another approach would be finding all places where we depend on the iteration order of for .. in and change the code to be explicit about which entities should be used. For assigning idle workers we could use the closest one for example instead of just picking and arbitrary worker.

@h4writer

I wanted to fix the iteration order problem before posting the current patch.

Unfortunately I couldn't come up with a good solution today. Do you have an idea how we could store entity collections more efficiently?

Link to comment
Share on other sites

To be honest about AI things like that, just tell the developer (me) how it works and the developer (I) will adapt. I've got an unadded optimization for entity collections, too, that uses a real array (basically if an entityCollection is not going to change much (few removals), then I keep a real array for iterating instead of an associative one.

I wouldn't worry about that too much since it's so inefficient in Javascript that I might want to try it in C++ someday.

Link to comment
Share on other sites

That's one of the reasons I try not to use for ... in. The order isn't specified.

If it is really important to get the same order, I would keep a property that contains the properties by name in the right order.

You can keep that "order" variable on the object and adjust whenever a property get's deleted/added ...


var obj = {
"prop2": 1
"prop3": 2
"prop4": 3
}

obj.order = ["prop2", "prop3", "prop4"];

for (var i=0; i < obj.order.length; i++) {
var key = obj.order[i];
callback(key, obj[key]);
}

Link to comment
Share on other sites

The ECMA standard specifies that the iteration order of for .. in is undefined, meaning that you can't be sure you get the lowest index first or anything like that.

Well, if the standard says the order is undefined, why do we depend on a specific order?

We don't really depend on it, we just have to be sure it's the same order each time. For example there are some places where we just pick the first X entities and do something with them. We assign the first X idle workers to chop trees.

We don't care which workers that are in the first place, but it could make a difference!

Yes, and it will almost certainly make a difference in AI (de)serialization, which is one of the points I've been trying to make on that topic (see this comment)

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...