To demonstrate the performance improvements that Artemis automatically achieves and its broad applicability, we applied it to three corpora: 8 popular GitHub projects, 5 projects from the Dacapo Benchmark, and 30 projects, filtered to meet Artemis’s requirements, then sampled uniformly at random from Github.

Artemis requires projects with capable build systems and an ex- tensive test suites. These two requirements entail that Artemis be able to build and run the project against its test suite.

Our first corpus comprises eight popular GitHub projects. We selected these eight to have good test suites and be diverse. We defined popular to be projects that received at least 200 stars on GitHub. We deemed a test suite to be good if its line coverage met or exceeded 70%. This corpus contains projects, usually well-written, optimised and peer code-reviewed by experienced developers. We applied Artemis on those projects to investigate whether it can provide a better combination of data structures than those selected by experienced human developers.

This first corpus might not be representative, precisely because of the popularity of its benchmarks. To address this threat to validity, we turned to the DaCapo benchmarks. The authors of DaCapo built it, from the ground up, to be representative. The goal was to provide the research community with realistic, large scale Java benchmarks that contain a good methodology for Java evaluation. Dacapo contains 14 open source, client-side Java benchmarks (version 9.12) and they come with built-in extensive evaluation. Each benchmark provides accurate measurements for execution time and memory consumption. DaCapo first appeared in 2006 to work with Java v.1.5 and has not been further updated to work with newer versions of Java. For this reason, we faced difficulties in compiling all the benchmarks and the total number of benchmarks were reduced to 5 out of 14. In this corpus, we use the following five: fop, avrora, xalan, pmd and sunflow.

The details of the DaCapo benchmark can be found by visiting

Because of its age and the fact that we are only using a subset of it, our DaCapo benchmark may not be representative. To counter this threat, we uniformly sampled projects from GitHub. We discarded those that did not meet Artemis’s constraints, like being equipped with a build system, until we collected 30 projects. Those projects are diverse, both in domain and size. The selected projects include static analysers, testing frameworks, web clients, and graph processing applications. Their sizes vary from 576 to 94K lines of code with a median of 14881. Their popularity varies from 0 to 5642 stars with a median of 52 stars per project. The median number of tests is 170 and median line coverage ratio is 72%.