Yves Posted August 30, 2014 Report Share Posted August 30, 2014 I'm reading a book about statistics at the moment. I expect this to be useful when measuring and comparing performance, especially in cases where the results vary a lot between multiple measurements. I'm posting my conclusions here because I'm not very confident that they are correct and would be glad about some feedback.So for trying it in practice, I've made multiple performance measurements. They were all with r15626 and c++11 enabled. It was from a 2vs2 AI game on "Median Oasis (4)" over 15008 turns. The values are the number of seconds it took to complete the whole replay in non-visual replay mode (-replay flag). I've made the same test 10 times.665, 673, 675, 669, 666, 668, 668, 678, 679, 667The average value is 670.80.That value alone doesn't tell us how much different measurements vary, so we also need the standard deviation, which is 5.07.This still doesn't tell us how reliable the result is. Would it be possible that we measure another time and suddenly get a value of 800? Based on my experience with this kind of measurement I can say that this is very unlikely, but how can this be proven with statistics?This can be done by calculating the confidence interval for the average.We start by calculating the standard error of the average value :s = standard deviationn = numbersm = standard errorsm = s/sqrt(n) = 5.07 / sqrt(10) = 1.06Now to calculate the confidence interval, we have this formula (t-table here):m = averaget = t-value in the t-table for df = n-1 and using .0.05 in the "α (2 tail)" column.0.05 stands for 95% probability.u = universe average (I hope that is the right term in English)--> m - t * sm < u < m + t * sm--> 670.80 - 2.262 * 1.06 < u < 670.80 + 2.262 * 1.06--> 668.40 < u < 673.20So we've chosen the t-value for 0.05. This means if we repeat the test, the average value will be between 668.40 and 673.20 with a probability of 95%.I'm uncertain about normal distribution. The t-test requires normal distribution of values and I assume that the measured values aren't necessariliy normal distributed. I'm also not sure if the central limit theorem applies here because we are working with an average value. Any thoughts? 1 Quote Link to comment Share on other sites More sharing options...
Yves Posted September 8, 2014 Author Report Share Posted September 8, 2014 I've analyzed the distribution of the measured values a bit. For all measurements I've used the same conditions as in the first post for the test:r15626, c++11 enabled, 2vs2 AI game on "Median Oasis (4)", 15008 turnsFirst I've assumed that the factors causing differences in the measurements are quite random and that the results will be normally distributed. Such factors are mainly other processes using system resources, delays caused by the hardware and how the available resources are assigned to processes.First measurement seriesIn the first run, I didn't care much about what else was running on my test-machine. I looked at data in the text editor, WLAN was activated and I sometimes even had the resource monitor open. I have a script that repeats the replay 10 times and I ran that script 9 times.The resultant distribution graph doesn't look like a normal distribution:Some statistical data:Average: 669.69Median: 668.00Standard deviation: 4.31Maximum value: 679Minimum value: 663Second measurment seriesI thought that it might increase the accuracy of the measurements if I close all open programs, disable the network and don't touch the computer during the measurement. I've repeated my script 7 times for a total number of 70 measurements.Some statistical data:Average: 662.00Median: 661.50Standard deviation: 2.79Maximum value: 676Minimum value: 657As you can see, the data is grouped closer together and has a better tendency to a central value. These are indications that the measured data is indeed more reliable.Third measurement seriesThe first and second measurement had produced quite different Avarage and Median values. It was a difference of around 1%, which means that performance improvements in that scale could not be reliably proven. I wanted to know how close a third measurement would get to the second when using the improved measurement setup like in the second series.This time I've just set the number of replay repetitions to 100 in the script instead of running the script 10 times. Theoretically that's the same and unless my CPU catches fire, it should also be the same in practice.Some statistical data:Average: 662.43Median: 662.00Standard deviation: 2.47Maximum value: 672Minimum value: 658The results are very close to those of the second measurement series.Conclusions:For measurements with maximum accuracy, it makes an inportant difference to reduce the number of running processes that could influence performance of the system. Especially on the graphs, it's very obvious that there are still some outliers. For this reason I would say that the median is the better value to compare than the arithmetic mean (average), although it seems to make a small difference here.About the distribution I would say that it is similar to normal distribution, but it's not true normal distribution. Generally there are more outliers on the right side and the curve goes down less steep on the right side. That's probably because there's a theoretical minimum limit of how many processor cycles, memory acces operations etc. are required. Going below this limit is not possible, but it can always go higher when some other processes on the system disturb the measurement.The data is attached as csv file if you're interested to play around with it.Data.txt Quote Link to comment Share on other sites More sharing options...
kanetaka Posted September 8, 2014 Report Share Posted September 8, 2014 Very interesting. Good performance measurements demands asceticism of you, no web browsing, no game, etc. Good job! Quote Link to comment Share on other sites More sharing options...
Yves Posted September 9, 2014 Author Report Share Posted September 9, 2014 Good performance measurements demands asceticism of you, no web browsing, no game, etc.Or better you have a second computer, then you don't have to touch the first one at all and the results are even better (and you can do whatever you want on your main machine ).I've used my notebook for the measurements. Quote Link to comment Share on other sites More sharing options...
wraitii Posted September 21, 2014 Report Share Posted September 21, 2014 Central limit theorem should apply iirc assuming what needs to be assumed.Samples are most likely not normally distributed. It'd need to be checked but something like a Poisson law seems more likely as there's a "normal" run-time and more reasons for it go slow than suddenly faster, with lock-ups being fairly random in length. I'm fairly sure there's a class of function that acts a lot like a Poisson law in R but can't recall their names. Probably not a big deal. Quote Link to comment Share on other sites More sharing options...
Stan` Posted September 21, 2014 Report Share Posted September 21, 2014 Central limit theorem should apply iirc assuming what needs to be assumed.Samples are most likely not normally distributed. It'd need to be checked but something like a Poisson law seems more likely as there's a "normal" run-time and more reasons for it go slow than suddenly faster, with lock-ups being fairly random in length. I'm fairly sure there's a class of function that acts a lot like a Poisson law in R but can't recall their names. Probably not a big deal.Maybe this : http://zoonek2.free.fr/UNIX/48_R/07.htmlExpenential Distribution (Possible Typo) Quote Link to comment Share on other sites More sharing options...
Jeru Posted September 22, 2014 Report Share Posted September 22, 2014 I love statistics!u = universe average (I hope that is the right term in English)Population mean (as opposed to sample mean)(And as a rule of thumb, it's better to do statistics with a sample size of at least n = 30.) Quote Link to comment Share on other sites More sharing options...
meap Posted October 12, 2014 Report Share Posted October 12, 2014 If you want to improve the accuracy of your measurements, you can reserve a CPU core for the measured process on linux using the taskset utility and the isolcpus kernel parameter, e.g. boot using isolcpus=1 next start 0ad usingtaskset -c 1 0ad This should limit the influence of other processes. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.