Samstag, 9. Juli 2011

DTrace and it's impact on JVM performance

So, I did this blogpost that gave a very shallow introduction to what DTrace can do with the JVM and I got an amazing feedback after that. Actually I'm quite flattered that it had such an impact.

One message I received pointed out that DTrace probes are having a significant impact on the JVM's performance, he (the author) would never recommend using the DTrace JVM integration, and then he pointed me… well, a product by his own company (which shall remain nameless, since I don't want to advertise commercial products on this blog). To be fair: I admit that I would do the same. When you put a lot of effort into a piece of software you do it for reasons you believe in.

However, he gave me a link to some benchmarks that showed some impressive results and diagrams about how much the performance of the JVM suffers when DTrace probes are enabled. The bar diagrams were scary, DTrace looked pretty bad compared to the "rival" technology (you can't really call it a rival since DTrace has a completely different objective). But something was fishy about this, and that was: the axes of the diagrams were not labeled. They did show a small blue bar and a big green bar and nothing else. The code for the test case was provided as a gif (no copy and paste to reproduce the results nice and easy). Numbers were not put into any perspective. And blog comments were disabled.

None the less, this was interesting enough to start a little bit of research on the topic.

The benchmarks seemed to focus on how much enabled probes did slow down method calls. I personally don't like benchmarks that use extremely unrealistic cases as a foundation (look how fast I can increment them integers in a tight loop suckaz! Mah language iz teh fast!) but this time I will just go with the flow because this is pretty much what they did in that benchmark (don't try this at home kids, use realistic conditions to benchmark your stuff). I'm not using the same test code here but the idea seems to be pretty much the same.

The system I'm running this on is a Thinkpad X201 with a core i7 and 8 gigs of RAM (yeah I know, I'm just compensating, get over it ;)). The operating system is OpenIndiana Build 151-beta with DTrace 1.7. Java is at 1.6.0_25.

The stupid test case I will be using is this Java program:

class Test {

public int callTest(int i) {
if (i != 0)
callTest(i - 1);

return i;

public static void main(String[] args) {
Test t = new Test();

long starttime = System.currentTimeMillis();
for (int i = 0; i < 1000000; i++)
long endtime = System.currentTimeMillis();

System.out.println(endtime - starttime);

Once again, this is not a realistic example. I will just use it to demonstrate that there really is an impact.

> java Test
> java -XX:+ExtendedDTraceProbes Test

Wow, this really seems to hurt… the programm is about 35 times slower with DTrace probes enabled.

To put this into perspective I will commit a capital crime in programming. I will compare this stupid program in language A to the same stupid program written in language B. Let language B be Python in this case. To be crystal clear about this: this is NOT a comparison of Python and Java performance. I just need some landmarks to make all those numbers meaningful (at least to some extent since this is a very badly chosen scenario to begin with).

Here is our Python test case.

import time

def callTest(i):
if not i == 0:
callTest(i - 1)

return i

r = range(0,1000000)

starttime = time.time()

for i in r:

endtime = time.time()
print (endtime - starttime)

And the result:

> python

OK, our software runs slower with probes enabled, but we are still faster than Python and Python's performance is acceptable for a lot of use cases. We now have a slower JVM that can be instrumented. So I'd say: No real harm done.

Now let's use those probes and aggregate some real data. For this test I will use a slightly modified version of j_methodcalls.d, a script by Brendan Gregg that is shipped with the DTrace Toolkit. The script is licensed under CDDL but I did remove the license header here to make it more concise and blog-friendly.

#!/usr/sbin/dtrace -Zs

this->class = (char *)copyin(arg1, arg2 + 1);
this->class[arg2] = '\0';
this->method = (char *)copyin(arg3, arg4 + 1);
this->method[arg4] = '\0';
this->name = strjoin(strjoin(stringof(this->class), "."),
@calls[pid, this->name] = count();

Let's run it!

> pfexec ./j_methodcalls.d -c "java -XX:+ExtendedDTraceProbes Test"

OK, this is A LOT slower, even slower than Python. 4 Minutes!

We are now aggregating data and that means we are copying data from userspace into kernelspace from where it can be fetched by DTrace consumers (in our case the dtrace command line tool). What data did we get in this amount of time? Actually: A LOT. It flooded my terminal and I had to pipe the result into less to be able to read all of it. DTrace recorded every method call that happened in the JVM while the program was running. It counted the calls per method, and the class the method belongs to. Keep in mind that we did not need to modify our code to get these results. We don't even need to restart a running JVM to enable the probes we can activate them by using the jinfo command. And we could have used DTrace to gather system wide data in the same script (something I might demonstrate sometime on this blog)

Now lets use the most naive debugging technique on earth. We will print out "CALLED!" every time our callTest method gets called (if you ever did this before you know how disastrous the result well be). This gives us pretty much no information. We just know that a particular method has been called and we need to modify our code, recompile and load it into running JVM.

> time java Test

real 1h39m18.15s
user 3m53.29s
sys 7m55.68s

As we expected, the result is a disaster. Calling print in a tight loop is an extremely stupid thing to do. We could have used a counter that get incremented with every method call, proxy objects, interceptors etc. (all of them would have been significantly faster).

To do something similar like the print example with DTrace I just a another clause to the script:

tick-1s {

This addition prints out what happened in 1 second intervals

1 75779 :tick-1s
4028 Test.callTest 400379

1 75779 :tick-1s
4028 Test.callTest 404720

1 75779 :tick-1s
4028 Test.callTest 402135

1 75779 :tick-1s
4028 Test.callTest 398934

dtrace: pid 4028 has exited

real 4m14.23s
user 4m13.89s
sys 0m0.46s

The performance impact stays pretty much the same with DTrace, we are done in 4 Minutes while we are presented with a readable stream of information.

There are a lot of ways to generate similar data, but most of them require code changes, are not able to do system wide tracing, are limited to one process and/or just one specific runtime.


Tracing the JVM costs (this shows especially in this pathological use case), but DTrace provides us with a very broad spectrum of probes. The JVM ones are just one source of data. We can actually instrument every part of the system with our DTrace script. Maybe a problem is not even related to our program at all, maybe it's NFS misbehaving, something is wrong with the database or there is some heavy IO going on. With DTrace the whole system becomes transparent. This changes the whole "blame game" and that's the whole point of DTrace. Looking at the system as a whole.

The bottom line is: trace the JVM only if you need to and be aware of the performance impact. This tool is for hunting down problems that are very hard or even impossible to analyze with traditional tools. I did use it to trace obscure memory leaks and dead-locks (both in non-Java contexts) and I was able to exactly pinpoint the culprit.

Don't use DTrace when there is a tool that does a better job for this specific task. Use it wisely. Anyway, it's a great utility to have in your toolbox.

Last but not least: use realistic use cases for benchmarking, label your diagram axes, and compare tools that have the same objective.

Keine Kommentare: