Measuring and statistics

Recommended Posts

I'm reading a book about statistics at the moment. I expect this to be useful when measuring and comparing performance, especially in cases where the results vary a lot between multiple measurements. I'm posting my conclusions here because I'm not very confident that they are correct and would be glad about some feedback.

So for trying it in practice, I've made multiple performance measurements. They were all with r15626 and c++11 enabled. It was from a 2vs2 AI game on "Median Oasis (4)" over 15008 turns. The values are the number of seconds it took to complete the whole replay in non-visual replay mode (-replay flag). I've made the same test 10 times.

`665, 673, 675, 669, 666, 668, 668, 678, 679, 667`

The average value is 670.80.

That value alone doesn't tell us how much different measurements vary, so we also need the standard deviation, which is 5.07.

This still doesn't tell us how reliable the result is. Would it be possible that we measure another time and suddenly get a value of 800? Based on my experience with this kind of measurement I can say that this is very unlikely, but how can this be proven with statistics?

This can be done by calculating the confidence interval for the average.

We start by calculating the standard error of the average value :

`s = standard deviationn = numbersm = standard errorsm = s/sqrt(n) = 5.07 / sqrt(10) = 1.06`

Now to calculate the confidence interval, we have this formula (t-table here):

`m = averaget = t-value in the t-table for df = n-1 and using .0.05 in the "α (2 tail)" column.0.05 stands for 95% probability.u = universe average (I hope that is the right term in English)--> m - t * sm < u < m + t * sm--> 670.80 - 2.262 * 1.06 < u < 670.80 + 2.262 * 1.06--> 668.40 < u < 673.20`

So we've chosen the t-value for 0.05.

This means if we repeat the test, the average value will be between 668.40 and 673.20 with a probability of 95%.

I'm uncertain about normal distribution. The t-test requires normal distribution of values and I assume that the measured values aren't necessariliy normal distributed. I'm also not sure if the central limit theorem applies here because we are working with an average value. Any thoughts?

Share on other sites

• 2 weeks later...

I've analyzed the distribution of the measured values a bit. For all measurements I've used the same conditions as in the first post for the test:
r15626, c++11 enabled, 2vs2 AI game on "Median Oasis (4)", 15008 turns

First I've assumed that the factors causing differences in the measurements are quite random and that the results will be normally distributed. Such factors are mainly other processes using system resources, delays caused by the hardware and how the available resources are assigned to processes.

First measurement series
In the first run, I didn't care much about what else was running on my test-machine. I looked at data in the text editor, WLAN was activated and I sometimes even had the resource monitor open. I have a script that repeats the replay 10 times and I ran that script 9 times.

The resultant distribution graph doesn't look like a normal distribution:

Some statistical data:
Average: 669.69
Median: 668.00
Standard deviation: 4.31
Maximum value: 679
Minimum value: 663

Second measurment series
I thought that it might increase the accuracy of the measurements if I close all open programs, disable the network and don't touch the computer during the measurement. I've repeated my script 7 times for a total number of 70 measurements.

Some statistical data:
Average: 662.00
Median: 661.50
Standard deviation: 2.79
Maximum value: 676
Minimum value: 657

As you can see, the data is grouped closer together and has a better tendency to a central value. These are indications that the measured data is indeed more reliable.

Third measurement series
The first and second measurement had produced quite different Avarage and Median values. It was a difference of around 1%, which means that performance improvements in that scale could not be reliably proven. I wanted to know how close a third measurement would get to the second when using the improved measurement setup like in the second series.
This time I've just set the number of replay repetitions to 100 in the script instead of running the script 10 times. Theoretically that's the same and unless my CPU catches fire, it should also be the same in practice.

Some statistical data:
Average: 662.43
Median: 662.00
Standard deviation: 2.47
Maximum value: 672
Minimum value: 658

The results are very close to those of the second measurement series.

Conclusions:
For measurements with maximum accuracy, it makes an inportant difference to reduce the number of running processes that could influence performance of the system. Especially on the graphs, it's very obvious that there are still some outliers. For this reason I would say that the median is the better value to compare than the arithmetic mean (average), although it seems to make a small difference here.
About the distribution I would say that it is similar to normal distribution, but it's not true normal distribution. Generally there are more outliers on the right side and the curve goes down less steep on the right side. That's probably because there's a theoretical minimum limit of how many processor cycles, memory acces operations etc. are required. Going below this limit is not possible, but it can always go higher when some other processes on the system disturb the measurement.

The data is attached as csv file if you're interested to play around with it.

Data.txt

Share on other sites

Very interesting.　Good performance measurements demands asceticism of you, no web browsing, no game, etc. Good job!

Share on other sites

Good performance measurements demands asceticism of you, no web browsing, no game, etc.

Or better you have a second computer, then you don't have to touch the first one at all and the results are even better (and you can do whatever you want on your main machine ).

I've used my notebook for the measurements.

Share on other sites

• 2 weeks later...

Central limit theorem should apply iirc assuming what needs to be assumed.

Samples are most likely not normally distributed. It'd need to be checked but something like a Poisson law seems more likely as there's a "normal" run-time and more reasons for it go slow than suddenly faster, with lock-ups being fairly random in length. I'm fairly sure there's a class of function that acts a lot like a Poisson law in R but can't recall their names.

Probably not a big deal.

Share on other sites

Central limit theorem should apply iirc assuming what needs to be assumed.

Samples are most likely not normally distributed. It'd need to be checked but something like a Poisson law seems more likely as there's a "normal" run-time and more reasons for it go slow than suddenly faster, with lock-ups being fairly random in length. I'm fairly sure there's a class of function that acts a lot like a Poisson law in R but can't recall their names.

Probably not a big deal.

Maybe this : http://zoonek2.free.fr/UNIX/48_R/07.html

Expenential Distribution (Possible Typo)

Share on other sites

I love statistics!

u = universe average (I hope that is the right term in English)

Population mean (as opposed to sample mean)

(And as a rule of thumb, it's better to do statistics with a sample size of at least n = 30.)

Share on other sites

• 3 weeks later...
If you want to improve the accuracy of your measurements, you can reserve a CPU core for the measured process on linux using the taskset utility and the isolcpus kernel parameter, e.g.

boot using isolcpus=1

This should limit the influence of other processes.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

×   Pasted as rich text.   Paste as plain text instead

Only 75 emoji are allowed.