-
-
Save garybernhardt/a7e920a31ff55a768fc3 to your computer and use it in GitHub Desktop.
| This tool is used to compare microbenchmarks across two versions of code. It's | |
| paranoid about nulling out timing error, so the numbers should be meaningful. | |
| It runs the benchmarks many times, scaling the iterations up if the benchmark | |
| is extremely short, and it nulls out its own timing overhead while doing so. It | |
| reports results graphically with a text interface in the terminal. | |
| You first run it with --record, which generates a JSON dotfile with runtimes | |
| for each of your benchmarks. Then you change the code and run again with | |
| --compare, which re-runs and generates comparison plots between your recorded | |
| and current times. In the example output, I did a --record on the master | |
| branch, then switched to my scoring_redesign and did a --compare. In my output, | |
| three of the benchmarks' runtimes look to be unchanged; the other three got | |
| significantly slower. As you can see at the bottom of the output, the entire | |
| process takes about a second from hitting enter to having full output, so I can | |
| get very rapid performance feedback. | |
| It goes to great lengths to null out both constant timing offsets and timing | |
| jitter. Specifically: | |
| * All benchmarks are run 16 times, with all of those data points being used in | |
| the output (not just the average, median or minimum). | |
| * They're run with and without garbage collection, with both being reported. | |
| * The cost of the benchmark runner itself is nulled out in a way that should be | |
| very accurate. It times the user-supplied block, then repeats the timing with | |
| an empty block, subtracting the latter from the former. | |
| * It guarantees a minimum test runtime for precise timing. First, it runs the | |
| block directly. If it takes less than 1ms, it's deemed unsafe and the process | |
| is restarted, but now the block will be run twice. If it's still under 1ms, | |
| that will be doubled to four times, etc., until it takes at least 1ms. This | |
| is what the "!"s are in the output: each of those is the benchmark being | |
| restarted with double the number of repetitions. The 1ms limit is checked | |
| after correcting for test runner overhead as described above, so the | |
| user-provided benchmark block itself is guaranteed to use at least 1ms of | |
| actual CPU time. | |
| In the plots, "X" is the median of the 16 iterations. The horizontal lines | |
| extend to the minimum and maximum values. This gives you a clear visualization | |
| of timing jitter. Each plot's range goes from 0 to the maximum sampled runtime | |
| to avoid confusing effects of tightly-zoomed plots. If the lines don't overlap, | |
| you can have good confidence that there's a real performance difference. In the | |
| second through fourth benchmarks below, you can see a big difference: I've made | |
| them much slower by changing Selecta's scoring algorithm. (This was my actual | |
| motivation for creating the tool. It's not a synthetic example.) I haven't | |
| wanted fine detail so far, so the text plots have been sufficient. I may add | |
| numbers eventually. | |
| This tool is Ruby-specific, but its design certainly isn't. This is the | |
| benchmark tool that I've wanted for years, since before I even knew Ruby. It 1) | |
| gives me the bare minimum statistics needed for confidence; 2) nulls out every | |
| source of timing offset or jitter that I know of and can control; and 3) | |
| automatically scales the repetitions up to get significant samples. |
| # I haven't thought much about the API yet. | |
| # bench_group is like RSpec's "describe"; bench is like "it". | |
| # Before and after blocks are supported for setup and teardown (not used here). | |
| bench_group "filtering" do | |
| bench "non-matching" do | |
| Score.score("x" * 16, "y" * 16) | |
| end | |
| bench "matching exactly" do | |
| Score.score("x" * 16, "x" * 16) | |
| end | |
| bench "matching broken up" do | |
| Score.score("xy" * 20, "x" * 10) | |
| end | |
| bench "overlapping matches" do | |
| Score.score("x" * 40, "x" * 10) | |
| end | |
| bench "words, non-matching" do | |
| WORDS[0,1000].each { |choice| Score.score(choice, "x" * 16) } | |
| end | |
| bench "words, matching" do | |
| WORDS[0,1000].each { |choice| Score.score(choice, WORDS[500]) } | |
| end | |
| end |
| filtering non-matching !!!!!!!!................!!!!!!!!................ | |
| filtering matching exactly !!................!!................ | |
| filtering matching broken up ................................ | |
| filtering overlapping matches ................................ | |
| filtering words, non-matching ................................ | |
| filtering words, matching ................................ | |
| filtering non-matching | |
| Before (GC): | -----X-------- | | |
| Before (No GC): | ---X--------- | | |
| After (GC): | -----X----------- | | |
| After (No GC): | --X----------------------------| | |
| 0 0.0105 | |
| filtering matching exactly | |
| Before (GC): | X--- | | |
| Before (No GC): | X-- | | |
| After (GC): | -----X---------------------- | | |
| After (No GC): | ----X-----------------------------| | |
| 0 0.715 | |
| filtering matching broken up | |
| Before (GC): |X- | | |
| Before (No GC): |X- | | |
| After (GC): | --X--- | | |
| After (No GC): | ---X---------------------------------| | |
| 0 2.31 | |
| filtering overlapping matches | |
| Before (GC): |X- | | |
| Before (No GC): |X- | | |
| After (GC): | ---X---------------------------------| | |
| After (No GC): | ----X----------------------- | | |
| 0 6 | |
| filtering words, non-matching | |
| Before (GC): | ---X---------------- | | |
| Before (No GC): | ---X---- | | |
| After (GC): | --------X--------------------------------------| | |
| After (No GC): | ---X-------------- | | |
| 0 20 | |
| filtering words, matching | |
| Before (GC): | -X------------------------------------------- | | |
| Before (No GC): | --X--------------------- | | |
| After (GC): | ---X-----------------------------------------| | |
| After (No GC): | -X-------- | | |
| 0 20.1 | |
| ruby benchmark.rb --compare 1.03s user 0.08s system 97% cpu 1.136 total |
I thought about units more after we exchanged those toots and had the same realization: there's an ambiguity between time and IPS. I'll probably switch this tool to IPS, which seems more intuitive (higher numbers are better). Adding " IPS" to the values is trivial, of course.
But for a gist, you can just show both forms of the code followed by the output of the comparison. That seems to give all of the information present with a multi-variant syntax, but without needing an extra feature.
I see your point about the gists. But that's a bit of work for the person reading the gist to make a comparison. And even more if there are a bunch of variants. (For example, various sorting algorithms.) Before/after (record/compare) doesn't handle more than 2 choices well. But like I said, this is a different use case, and it might not be one that you care about right now.
Anyway, I'm glad I was able to get you to think more about how it works. I can't believe more people aren't excited about this enough to be commenting. I guess you weren't controversial enough in the Tweet announcing it. Recommendation for the 1.0 announcement: "Benchmarking is Dead".
I can think of a use case for having both variants in the code next to each other -- sharing the code with other people in an easier way than multiple branches. For example, in a gist. ;)