Folding at Home, Part 5: Benchmarking with JMH

If you investigate the fold function (in its various flavors) in the Scala standard library, you’ll invariably notice that different collection interfaces provide subtly different implementations of this higher-order function. The ostensible reason: performance.

Consider, for example, this commit to the Scala collections library. The commit message reads: “Removing a couple of indirection levels gives some speed-up.” Among the changes is an implementation of foldLeft catered to the LinearSeq trait, specifically:

override def foldLeft[B](z: B)(op: (B, A) => B): B = {
  var acc = z
  var these: LinearSeq[A] = coll
  while (!these.isEmpty) {
    acc = op(acc, these.head)
    these = these.tail
  }
  acc
}

This overrides a more general version of the function, found in the IterableOnce trait:

def foldLeft[B](z: B)(op: (B, A) => B): B = {
  var result = z
  val it = iterator
  while (it.hasNext) {
    result = op(result, it.next())
  }
  result
}

Before comparing these functions, note that both use while loops, unlike our implementations, which used recursion and relied on tail optimization to avoid stack problems. On one hand, this demonstrates Scala’s flexibility: one can freely move back and forth between imperative and functional styles, based on what one deems appropriate for a given situation. On the other hand, I can’t help but feel betrayed. Implementing a key, higher-order function with a while loop is nothing short of blasphemy. We must inform the Office of Functional Purity, and summon a terrible Inquisition upon the heads of the unfaithful Scala library maintainers.

Perhaps “betrayed” is a strong word. These iterative foldLeft implementations are imminently reasonable and comprehensible. Perhaps it’s simply a touch amusing that Scala goes to great lengths to make recursive functions efficient, yet, internally, shies away from using them.

Which brings us back to the topic of optimization. Presumably, a loop was chosen here for the sake of efficiency, a perfectly valid concern in the standard collections library of a programming language. With that in mind, let’s that a closer look at the two foldLeft implementations above.

The first version steps through a LinearSeq using the head and tail fields, while the second uses an iterator to accomplish the same thing. The motivation behind the LinearSeq version seems to be that the iterator-based approach introduces unnecessary overhead via additional function calls, e.g. to next and hasNext. One can simply use tail and isEmpty directly, instead of invoking them indirectly through the iterator functions.

This brings us, at last, to the question I would like to address in this post: is the optimization here at all meaningful? The git commit doesn’t offer any quantitative data to support the changes, but perhaps we can generate some on our own. That is, we’d like to measure the performance of these functions under various conditions, and come to some conclusions about their relative efficiency. This is better known as benchmarking.

Enter JMH

JMH, or the Java Microbenchmark Harness, is a framework one can use to build benchmarks for JVM programs. We’ll show how to use this tool to build a simple benchmark for the different incarnations of foldLeft. Since I am evaluating a Scala application, I will use the sbt-jmh plugin, which builds on top of JMH.

Basic Project Layout

As the name suggests, sbt-jmh is meant to be used with a Scala sbt project. My simple project setup looks like this:

├── build.sbt
├── project
│   ├── build.properties
│   └── plugins.sbt
└── src
    └── main
        └── scala
            └── benchmark
                └── FoldBenchmark.scala

My built.sbt file is rather simple; all it needs to do is call the enablePlugins function:

name := "Benchmark"
version := "0.1"
scalaVersion := "2.13.3"
enablePlugins(JmhPlugin)

The plugins.sbt file tells the build system which version of the JMH plugin should be installed:

addSbtPlugin("pl.project13.scala" % "sbt-jmh" % "0.3.7")

…and the build.properties file simple contains information about the sbt version.

sbt.version = 1.3.13

The Benchmark

Our JMH benchmark will have two parts: a state class that describes the inputs for a given benchmark invocation (in this case, the size of the list we are folding), and individual benchmark functions that call the desired fold variant.

Benchmark State

The BenchmarkState class below defines some state that is applied to each individual benchmark we write. For this particular application, we want to run each function with different input sizes. To accomplish this, we define a field with the @Param annotation, and initialize that annotation with an Array of possible values.

@State(Scope.Benchmark)
class BenchmarkState {
  @Param(Array(
    "1000",
    "10000",
    "100000",
    "1000000",
    "10000000"))
  var inputSize: Int = _

  var testSeq: LinearSeq[Int] = _

  @Setup(Level.Trial)
  def prepare(): Unit = {
    testSeq = LinearSeq.range[Int](0, inputSize)
  }
}

Note that the input array contains strings; I’m not entirely sure why strings are required here, but I assume these values are at some point passed to a sub-invocation of the JVM as command-line arguments. At any rate, we can define the field representing a given input as in Int, and values will be converted appropriately.

Next, we define a function annotated with the @Setup annotation, which gets called at the start of each benchmark. This function simply creates a LinearSeq of size determined by the current inputSize.

Benchmark Functions

Our individual benchmarks are implemented in a separate class, in functions with the @Benchmark annotation, e.g.:

class FoldBenchmark {
  @Benchmark
  def foldLeftIteratorBaseline(state: BenchmarkState): Int = {
    state.testSeq.iterator.foldLeft(0)(_ + _)
  }

  @Benchmark
  def foldLeftLinearSeqBaseline(state: BenchmarkState): Int = {
    state.testSeq.foldLeft(0)(_ + _)
  }
}

Inside each benchmark, we simply call the desired foldLeft. Note that an instance of BenchmarkState is passed to the function, which we use to access the current input.

For fun, let’s put in one more benchmark: a recursive implementation of foldLeft with tail-optimization turned on:

@annotation.tailrec
final def foldLeft[A,B](as: LinearSeq[A], z: B)(f : (B,A) => B): B =
  if (as.isEmpty) z
  else foldLeft(as.tail, f(z, as.head))(f)

@Benchmark
def foldLeftRecursiveBaseLine(state: BenchmarkState): Int = {
  foldLeft(state.testSeq, 0)(_ + _)
}

Running the Benchmark

We can run the benchmark by invoking the following command from the project root:

sbt jmh:run

Note that this can take a very long time (hours), depending on the configuration of the benchmarks. By default, JMH will measure throughput; that is, the number of times the program can run over a given unit of time. This entails running the program many times. JMH also does things like “warmup” each JVM instance before collecting data (by doing dry-runs of your benchmark), to avoid measurements being affected by JVM operations like class loading (perhaps also to give the JVM an opportunity to apply just-in-time optimizations). All of this contributes to a rather hefty runtime for the benchmark itself.

Results

At the end of it’s run, JMH spits out a summary like the following:

[info] FoldBenchmark.foldLeftIteratorBaseline          1000  thrpt   25  330731.361 ± 3508.192  ops/s
[info] FoldBenchmark.foldLeftIteratorBaseline         10000  thrpt   25   31538.970 ±  255.618  ops/s
[info] FoldBenchmark.foldLeftIteratorBaseline        100000  thrpt   25    2715.065 ±  130.792  ops/s
[info] FoldBenchmark.foldLeftIteratorBaseline       1000000  thrpt   25     206.659 ±    7.558  ops/s
[info] FoldBenchmark.foldLeftIteratorBaseline      10000000  thrpt   25      21.186 ±    0.312  ops/s
[info] FoldBenchmark.foldLeftLinearSeqBaseline         1000  thrpt   25  316580.885 ± 1348.899  ops/s
[info] FoldBenchmark.foldLeftLinearSeqBaseline        10000  thrpt   25   30744.543 ±  515.116  ops/s
[info] FoldBenchmark.foldLeftLinearSeqBaseline       100000  thrpt   25    2702.528 ±   43.966  ops/s
[info] FoldBenchmark.foldLeftLinearSeqBaseline      1000000  thrpt   25     191.799 ±    5.364  ops/s
[info] FoldBenchmark.foldLeftLinearSeqBaseline     10000000  thrpt   25      21.066 ±    0.305  ops/s
[info] FoldBenchmark.foldLeftRecursiveBaseLine         1000  thrpt   25  340990.156 ± 2884.475  ops/s
[info] FoldBenchmark.foldLeftRecursiveBaseLine        10000  thrpt   25   32189.315 ±  288.115  ops/s
[info] FoldBenchmark.foldLeftRecursiveBaseLine       100000  thrpt   25    2592.348 ±   88.177  ops/s
[info] FoldBenchmark.foldLeftRecursiveBaseLine      1000000  thrpt   25     200.300 ±    8.842  ops/s
[info] FoldBenchmark.foldLeftRecursiveBaseLine     10000000  thrpt   25      21.819 ±    0.311  ops/s

Summarizing this in a table:

Function Input Size 10^3 (ops/s) Input Size 10^4 (ops/s) Input Size 10^5 (ops/s) Input Size 10^6 (ops/s) Input Size 10^7 (ops/s)
LinearSeq foldLeft 316580.885 ± 1348.899 30744.543 ± 515.116 2702.528 ± 43.966 191.799 ± 5.364 21.066 ± 0.305
Iterator foldLeft 330731.361 ± 3508.192 31538.970 ± 255.618 2715.065 ± 130.792 206.659 ± 7.558 21.186 ± 0.312
Recursive foldLeft 340990.156 ± 2884.475 32189.315 ± 288.115 2592.348 ± 88.177 200.300 ± 8.842 21.819 ± 0.311

There is very little difference between the throughput of these functions. Different benchmarks might expose different results, and more study is required, but on the face of things it appears that the Scala compiler and JVM runtime-JIT do a reasonable job of making these three implementations roughly equivalent, performance-wise.

You can run this demo yourself by checking out the instructions at this repository.

Updated: