Missing JVM Optimization 2017-11-18

A professor at my department was benchmarking a small Java program to see what impact changing the type of a loop variable had. While doing this he found a case where changing the loop bound variable type had unexpected impact. My simplified version of his loop is below:

static void test() {
  int N = 10000; // long N is much slower!
  long sum = 0;
  for (int i = 0; i < N; i++) {
    sum += i;
  }
}

When the professor changed the type of N from int to long, the loop ran much slower. He measured this by running the test method a million times and measuring the total execution time.

His results showed that when N had type long, the loop ran many orders of magnitude slower. That seemed very odd: N is constant during the loop, and he ran the program on a 64-bit machine so it should not cost that much to compare two 64-bit integers.

Here is the main method of the benchmark program if you want to try it for yourself:

public static void main (String args[]) {
  long start = System.currentTimeMillis();
  for (int i = 0; i < 1000000; i++) {
    test();
  }
  long time = System.currentTimeMillis() - start;
  System.out.format("time = %d ms%n", time);
}

When I ran the version where N has type int, using Java 8 (1.8.0_112) on a 64-bit Intel machine with Linux, it took 5-7 milliseconds. When I ran the version where N has type long, it took 10 seconds! At this point it was pretty obvious what is going on. You may have already noticed that test does not return the sum that the loop computes. This means that the loop is dead: it could be deleted from the program without affecting the observable output of the program.

To confirm that the long N version is indeed suffering from lack of dead code elimination, I used a diagnostic JVM option to print the optimized assembly code for the test method:

java -XX:+UnlockDiagnosticVMOptions \
    -XX:CompileCommand=print,*.test TestI

This gave a lot of output. First it prints the C1 optimized version of the method twice. That’s lightly optimized bytecode according to the HotSpot glossary. Then it prints the C2 optimized version twice. When a method is run enough times, the JVM will try to optimize it using the more powerful (and slower to compile) C2 compiler. I don’t know why the disassembly is printed twice for the C1 and C2 versions.

When looking at the C2 compiled version of the TestI program (with int N), it is easy to see that the loop inside the test method is entirely eliminated:

Compiled method (c2)      84   17       4       TestI::test (25 bytes)
**SNIP**
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x00007fdb68fc9410} 'test' '()V' in 'TestI'
  #           [sp+0x20]  (sp of caller)
  0x00007fdb6d111840: sub    $0x18,%rsp
  0x00007fdb6d111847: mov    %rbp,0x10(%rsp)    ;*synchronization entry
                                                ; - TestI::test@-1 (line 13)
  0x00007fdb6d11184c: add    $0x10,%rsp
  0x00007fdb6d111850: pop    %rbp
  0x00007fdb6d111851: test   %eax,0x1823a7a9(%rip)        # 0x00007fdb8534c000
                                                ;   {poll_return}
  0x00007fdb6d111857: retq
**SNIP**

The first two instructions are just setting up the stack frame, and then the next two instructions pop the stack frame. I don’t know what the purpose of the test instruction is here, but it just compares a register to some value at some specific memory address.

I ran the same command again, but using TestL (with long N). Here is the decompiled C2 code:

Compiled method (c2)      77   16       4       TestL::test (30 bytes)
**SNIP**
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x00007f2df15cf428} 'test' '()V' in 'TestL'
  #           [sp+0x20]  (sp of caller)
  0x00007f2df5111ba0: sub    $0x18,%rsp
  0x00007f2df5111ba7: mov    %rbp,0x10(%rsp)    ;*synchronization entry
                                                ; - TestL::test@-1 (line 13)

  0x00007f2df5111bac: xor    %r10d,%r10d
  0x00007f2df5111baf: xor    %r11d,%r11d
  0x00007f2df5111bb2: mov    $0x1,%r8d
  0x00007f2df5111bb8: jmp    0x00007f2df5111bcc
  0x00007f2df5111bba: nopw   0x0(%rax,%rax,1)
  0x00007f2df5111bc0: mov    %r8d,%r9d
  0x00007f2df5111bc3: inc    %r9d
  0x00007f2df5111bc6: mov    %r8d,%r11d
  0x00007f2df5111bc9: mov    %r9d,%r8d          ;*iinc
                                                ; - TestL::test@23 (line 15)
  0x00007f2df5111bcc: movslq %r8d,%r9
  0x00007f2df5111bcf: movslq %r11d,%r11
  0x00007f2df5111bd2: add    %r11,%r10          ; OopMap{off=53}
                                                ;*goto
                                                ; - TestL::test@26 (line 15)

  0x00007f2df5111bd5: test   %eax,0x1881f425(%rip)        # 0x00007f2e0d931000
                                                ;*goto
                                                ; - TestL::test@26 (line 15)
                                                ;   {poll}
  0x00007f2df5111bdb: cmp    $0x2710,%r9
  0x00007f2df5111be2: jl     0x00007f2df5111bc0  ;*ifge
                                                ; - TestL::test@14 (line 15)
  0x00007f2df5111be4: add    $0x10,%rsp
  0x00007f2df5111be8: pop    %rbp
  0x00007f2df5111be9: test   %eax,0x1881f411(%rip)        # 0x00007f2e0d931000
                                                ;   {poll_return}
  0x00007f2df5111bef: retq

You can already see that this version is doing a lot more stuff. Notice the cmp and jl instructions near the end: those are for testing the loop condition and running another iteration. This confirms that the loop has not been eliminated in the long N case.

Why is the loop not eliminated? There could be several reasons. One case where it would not be safe to delete the loop is if N was a parameter of the test method, and the loop variable had type int. In that case, the loop variable could in some cases never be larger or equal to N. It might be that the optimizer is conservatively treating all cases with a long upper bound as this potential special case. The C2 compiler does not have a lot of time to do its work, so it is possible that it’s overly conservative in this case to save time in other similar cases.

Categories: Uncategorized

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.