A few months ago I implemented the Möller-Trumbore ray-triangle intersection algorithm (trivia: the inventor, Tomas Möller, works in the graphics group at my CS department). My initial implementation looked like this (slightly edited):

```
// e1 and e2 are triangle edges, o is origin
// ray = input ray
Vec pvec = new Vec(0, 0, 0);
Vec qvec = new Vec(0, 0, 0);
Vec tvec = new Vec(0, 0, 0);
pvec.cross(ray.d, e2);
double det = e1.dot(pvec);
if (det > -EPSILON && det < EPSILON)
return false;
tvec.sub(ray.o, o);
double u = tvec.dot(pvec) / det;
if (u < 0 || u > 1)
return false;
qvec.cross(tvec, e1);
double v = ray.d.dot(qvec) / det;
if (v < 0 || (u+v) > 1)
return false;
double t = e2.dot(qvec) / det;
return t < tNear;
```

After making sure the above was correct I started optimizing it. First I replaced the three divisions by `det`

by calculating the reciprocal of `det`

and then multiplying with that. This replaces three divisions with one division and three multiplications. This should be an improvement if floating point division is at least 1.5 times slower than floating point multiplication, which is generally true.

Next, I removed the temporary Vec objects. This is where things get interesting. The idea is to avoid allocating these temporary objects on the heap and replace them with stack-allocated local variables. The code gets a lot bigger because you have to inline all the cross product and dot product code in the intersection algorithm. The inlined version looks like this:

```
double px=0,py=0,pz=0;
double qx=0,qy=0,qz=0;
double tx=0,ty=0,tz=0;
px = ray.d.y * e2.z - ray.d.z * e2.y;
py = ray.d.z * e2.x - ray.d.x * e2.z;
pz = ray.d.x * e2.y - ray.d.y * e2.x;
double det = e1.x*px + e1.y*py + e1.z*pz;
if (det > -EPSILON && det < EPSILON)
return false;
double recip = 1 / det;
tx = ray.o.x - o.x;
ty = ray.o.y - o.y;
tz = ray.o.z - o.z;
double u = (tx*px + ty*py + tz*pz) * recip;
if (u < 0 || u > 1)
return false;
qx = ty * e1.z - tz * e1.y;
qy = tz * e1.x - tx * e1.z;
qz = tx * e1.y - ty * e1.x;
double v = (ray.d.x*qx + ray.d.y*qy + ray.d.z*qz) * recip;
if (v < 0 || (u+v) > 1)
return false;
double t = (e2.x*qx + e2.y*qy + e2.z*qz) * recip;
return t < tNear;
```

The code certainly became less readable, but is it faster? The answer may depend on the JVM you use to run the code, but in most cases the above will NOT be faster. It can even be slower than the non-inlined version.

The Java language has no built-in way to explicitly allocate reference objects on the stack, as C++ does. However, the Java runtime can still just-in-time optimize your code to avoid unnecessary heap allocations. The Java JIT compiler tries to detect objects that don’t escape a method and allocate those on the stack instead. An object does not escape a method if it is not returned from the method, not assigned by reference in some other object, and not passed as an argument to some other method.

It can be difficult to figure out exactly when things are moved from heap to stack, but using the Caliper microbenchmark framework you can get a good idea. I benchmarked two versions of my intersection code against each other. One that used inlining and one that did not. It turned out that the one using inlining was consistently slower, though not by much, than the one not using manual inlining. The small difference in performance that occurred is probably caused by the way the JIT compiler happened to layout the code differently in each case, incidentally giving a slightly better cache coherence advantage to the non-manually-inlined version.

I also benchmarked the versions using the reciprocal multiplication optimization I discussed above and found that using this technique was beneficial, as originally expected.

## Leave a Reply