Folding at Home, Part 5: Benchmarking with JMH
If you investigate the fold
function (in its various flavors)
in the Scala standard library, you’ll invariably notice that different collection interfaces provide subtly
different implementations of this higher-order function. The ostensible reason: performance.
Consider, for example, this commit to the Scala collections library.
The commit message reads: “Removing a couple of indirection levels gives some speed-up.” Among the changes
is an implementation of foldLeft
catered to the LinearSeq
trait,
specifically:
override def foldLeft[B](z: B)(op: (B, A) => B): B = {
var acc = z
var these: LinearSeq[A] = coll
while (!these.isEmpty) {
acc = op(acc, these.head)
these = these.tail
}
acc
}
This overrides a more general version of the function,
found in the IterableOnce
trait:
def foldLeft[B](z: B)(op: (B, A) => B): B = {
var result = z
val it = iterator
while (it.hasNext) {
result = op(result, it.next())
}
result
}
Before comparing these functions, note that both use while
loops, unlike our implementations,
which used recursion and relied on tail optimization to avoid
stack problems. On one hand, this demonstrates Scala’s flexibility: one can freely move back and forth between
imperative and functional styles, based on what one deems appropriate for a given situation. On the other hand,
I can’t help but feel betrayed. Implementing a key, higher-order function with a while
loop is nothing
short of blasphemy. We must inform the Office of Functional Purity, and summon a terrible Inquisition upon the
heads of the unfaithful Scala library maintainers.
Perhaps “betrayed” is a strong word. These iterative foldLeft
implementations are imminently reasonable and
comprehensible. Perhaps it’s simply a touch amusing that Scala goes to great lengths to make recursive functions
efficient, yet, internally, shies away from using them.
Which brings us back to the topic of optimization. Presumably, a loop was chosen here for the sake
of efficiency, a perfectly valid concern in the standard collections library of a programming language. With that
in mind, let’s that a closer look at the two foldLeft
implementations above.
The first version steps through a LinearSeq
using the head
and tail
fields, while the second uses
an iterator to accomplish the same thing. The motivation behind the LinearSeq
version seems to be that
the iterator-based approach introduces unnecessary overhead via additional function calls, e.g. to
next
and hasNext
. One can simply use tail
and isEmpty
directly, instead of invoking them
indirectly through the iterator functions.
This brings us, at last, to the question I would like to address in this post: is the optimization here at all meaningful? The git commit doesn’t offer any quantitative data to support the changes, but perhaps we can generate some on our own. That is, we’d like to measure the performance of these functions under various conditions, and come to some conclusions about their relative efficiency. This is better known as benchmarking.
Enter JMH
JMH, or the Java Microbenchmark Harness, is a framework one
can use to build benchmarks for JVM programs. We’ll show how to use this tool to build a simple
benchmark for the different incarnations of foldLeft
. Since I am evaluating a Scala
application, I will use the sbt-jmh plugin, which builds
on top of JMH.
Basic Project Layout
As the name suggests, sbt-jmh
is meant to be used with a Scala sbt
project. My simple project
setup looks like this:
├── build.sbt
├── project
│ ├── build.properties
│ └── plugins.sbt
└── src
└── main
└── scala
└── benchmark
└── FoldBenchmark.scala
My built.sbt
file is rather simple; all it needs to do is call the enablePlugins
function:
name := "Benchmark"
version := "0.1"
scalaVersion := "2.13.3"
enablePlugins(JmhPlugin)
The plugins.sbt
file tells the build system which version of the JMH plugin should be installed:
addSbtPlugin("pl.project13.scala" % "sbt-jmh" % "0.3.7")
…and the build.properties
file simple contains information about the sbt
version.
sbt.version = 1.3.13
The Benchmark
Our JMH benchmark will have two parts: a state class that describes the inputs for a given benchmark invocation (in this case, the size of the list we are folding), and individual benchmark functions that call the desired fold variant.
Benchmark State
The BenchmarkState
class below defines some state that is applied to
each individual benchmark we write. For this particular application, we want to run
each function with different input sizes. To accomplish this,
we define a field with the @Param
annotation, and initialize
that annotation with an Array
of possible values.
@State(Scope.Benchmark)
class BenchmarkState {
@Param(Array(
"1000",
"10000",
"100000",
"1000000",
"10000000"))
var inputSize: Int = _
var testSeq: LinearSeq[Int] = _
@Setup(Level.Trial)
def prepare(): Unit = {
testSeq = LinearSeq.range[Int](0, inputSize)
}
}
Note that the input array contains strings; I’m not entirely sure
why strings are required here, but I assume these values are at
some point passed to a sub-invocation of the JVM as command-line
arguments. At any rate, we can define the field representing a
given input as in Int
, and values will be converted appropriately.
Next, we define a function annotated with the @Setup
annotation,
which gets called at the start of each benchmark. This function
simply creates a LinearSeq
of size determined by the current
inputSize
.
Benchmark Functions
Our individual benchmarks are implemented in a separate class, in functions with
the @Benchmark
annotation, e.g.:
class FoldBenchmark {
@Benchmark
def foldLeftIteratorBaseline(state: BenchmarkState): Int = {
state.testSeq.iterator.foldLeft(0)(_ + _)
}
@Benchmark
def foldLeftLinearSeqBaseline(state: BenchmarkState): Int = {
state.testSeq.foldLeft(0)(_ + _)
}
}
Inside each benchmark, we simply call the desired foldLeft
. Note that an
instance of BenchmarkState
is passed to the function, which we
use to access the current input.
For fun, let’s put in one more benchmark: a recursive implementation of
foldLeft
with tail-optimization turned on:
@annotation.tailrec
final def foldLeft[A,B](as: LinearSeq[A], z: B)(f : (B,A) => B): B =
if (as.isEmpty) z
else foldLeft(as.tail, f(z, as.head))(f)
@Benchmark
def foldLeftRecursiveBaseLine(state: BenchmarkState): Int = {
foldLeft(state.testSeq, 0)(_ + _)
}
Running the Benchmark
We can run the benchmark by invoking the following command from the project root:
sbt jmh:run
Note that this can take a very long time (hours), depending on the configuration of the benchmarks. By default, JMH will measure throughput; that is, the number of times the program can run over a given unit of time. This entails running the program many times. JMH also does things like “warmup” each JVM instance before collecting data (by doing dry-runs of your benchmark), to avoid measurements being affected by JVM operations like class loading (perhaps also to give the JVM an opportunity to apply just-in-time optimizations). All of this contributes to a rather hefty runtime for the benchmark itself.
Results
At the end of it’s run, JMH spits out a summary like the following:
[info] FoldBenchmark.foldLeftIteratorBaseline 1000 thrpt 25 330731.361 ± 3508.192 ops/s
[info] FoldBenchmark.foldLeftIteratorBaseline 10000 thrpt 25 31538.970 ± 255.618 ops/s
[info] FoldBenchmark.foldLeftIteratorBaseline 100000 thrpt 25 2715.065 ± 130.792 ops/s
[info] FoldBenchmark.foldLeftIteratorBaseline 1000000 thrpt 25 206.659 ± 7.558 ops/s
[info] FoldBenchmark.foldLeftIteratorBaseline 10000000 thrpt 25 21.186 ± 0.312 ops/s
[info] FoldBenchmark.foldLeftLinearSeqBaseline 1000 thrpt 25 316580.885 ± 1348.899 ops/s
[info] FoldBenchmark.foldLeftLinearSeqBaseline 10000 thrpt 25 30744.543 ± 515.116 ops/s
[info] FoldBenchmark.foldLeftLinearSeqBaseline 100000 thrpt 25 2702.528 ± 43.966 ops/s
[info] FoldBenchmark.foldLeftLinearSeqBaseline 1000000 thrpt 25 191.799 ± 5.364 ops/s
[info] FoldBenchmark.foldLeftLinearSeqBaseline 10000000 thrpt 25 21.066 ± 0.305 ops/s
[info] FoldBenchmark.foldLeftRecursiveBaseLine 1000 thrpt 25 340990.156 ± 2884.475 ops/s
[info] FoldBenchmark.foldLeftRecursiveBaseLine 10000 thrpt 25 32189.315 ± 288.115 ops/s
[info] FoldBenchmark.foldLeftRecursiveBaseLine 100000 thrpt 25 2592.348 ± 88.177 ops/s
[info] FoldBenchmark.foldLeftRecursiveBaseLine 1000000 thrpt 25 200.300 ± 8.842 ops/s
[info] FoldBenchmark.foldLeftRecursiveBaseLine 10000000 thrpt 25 21.819 ± 0.311 ops/s
Summarizing this in a table:
Function | Input Size 10^3 (ops/s) | Input Size 10^4 (ops/s) | Input Size 10^5 (ops/s) | Input Size 10^6 (ops/s) | Input Size 10^7 (ops/s) |
---|---|---|---|---|---|
LinearSeq foldLeft | 316580.885 ± 1348.899 | 30744.543 ± 515.116 | 2702.528 ± 43.966 | 191.799 ± 5.364 | 21.066 ± 0.305 |
Iterator foldLeft | 330731.361 ± 3508.192 | 31538.970 ± 255.618 | 2715.065 ± 130.792 | 206.659 ± 7.558 | 21.186 ± 0.312 |
Recursive foldLeft | 340990.156 ± 2884.475 | 32189.315 ± 288.115 | 2592.348 ± 88.177 | 200.300 ± 8.842 | 21.819 ± 0.311 |
There is very little difference between the throughput of these functions. Different benchmarks might expose different results, and more study is required, but on the face of things it appears that the Scala compiler and JVM runtime-JIT do a reasonable job of making these three implementations roughly equivalent, performance-wise.
You can run this demo yourself by checking out the instructions at this repository.