Coalescing slow cases to improve JIT memory usage.
The size of our JIT generated code is a memory known issue. According to Oliver the slow cases for some of our operations is on the order of 128 bytes. It occurred to me that we could reduce the JITed code by only compiling the slow case once and having all of the subsequent generated code jump to that specific slow case. The issue with this is our slow cases jump back to specific locations in the hot path, with potentially different values on the stack, as opposed to a normal function which returns back to a very specific state. We can circumvent this issue by hand writing the assembly to return to an offset of the return address with the required state. What do people think? -- Nathan
On Jun 25, 2010, at 4:20 PM, Nathan Lawrence wrote:
The size of our JIT generated code is a memory known issue. According to Oliver the slow cases for some of our operations is on the order of 128 bytes. It occurred to me that we could reduce the JITed code by only compiling the slow case once and having all of the subsequent generated code jump to that specific slow case. The issue with this is our slow cases jump back to specific locations in the hot path, with potentially different values on the stack, as opposed to a normal function which returns back to a very specific state. We can circumvent this issue by hand writing the assembly to return to an offset of the return address with the required state.
What do people think?
I think it's a good idea. It may even help speed in addition to memory use. I bet there are also some slow cases that can only return to exactly one place, and therefore could act like more conventional function calls. Regards, Maciej
Neat idea. One thing to consider is that some "slow" cases are pretty hot. For example, on 64bit, double addition is a "slow" case. Turning double addition into a function call might be too expensive. I'd suggest instrumenting the code to count how often certain slow cases are taken, and strategizing accordingly. Maybe code in these slow cases needs to move onto the hot path. Alternatively, maybe rare slow cases can change to functions, and hot slow cases can remain as they are. Geoff On Jun 25, 2010, at 4:20 PM, Nathan Lawrence wrote:
The size of our JIT generated code is a memory known issue. According to Oliver the slow cases for some of our operations is on the order of 128 bytes. It occurred to me that we could reduce the JITed code by only compiling the slow case once and having all of the subsequent generated code jump to that specific slow case. The issue with this is our slow cases jump back to specific locations in the hot path, with potentially different values on the stack, as opposed to a normal function which returns back to a very specific state. We can circumvent this issue by hand writing the assembly to return to an offset of the return address with the required state.
What do people think? -- Nathan _______________________________________________ squirrelfish-dev mailing list squirrelfish-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/squirrelfish-dev
On Jun 25, 2010, at 5:12 PM, Geoffrey Garen wrote:
Neat idea.
One thing to consider is that some "slow" cases are pretty hot. For example, on 64bit, double addition is a "slow" case. Turning double addition into a function call might be too expensive.
I don't believe a call to a fixed address is significantly more expensive than a jump to a fixed address. I don't know about ret vs. jump to a fixed address, but I expect that not to be huge either. At least on CPUs with decent branch prediction, the branch predictor is designed to handle the call/ret pattern well. Regards, Maciej
Hey Nathan, Absolutely - this is effective what we do for calls, where the bulk of the work of the slow case is performed by a shared routine. We've experimented with using this technique more broadly in the early stages of developing the JIT, and back then there was a measurable performance degradation from the call overhead – but the code had changed a lot since then, and the tradeoffs and requirements may be different now (particularly across the varying hardware platforms the JIT has now been ported to). However returning to an offset to a return address is probably not a good plan. Upon executing a call, processors commonly cache the return address of the call instruction in a circular buffer used to predict return destinations. When it reaches the return instruction it pops a value from the return address stack to predict the destination of the return. If you change the address you're going to get a mispredict and probably a pipe flush. (We modify the return address in our exception handling path, but we don't expect exceptions to be high performance). However bear in mind that there is no conditional call instruction on x86, so to eliminate the slow path altogether you'd have to litter the hot path with inverted branches over the calls out to the trampolines. I'd suggest you'd be more likely to find success in keeping the hot path branching out to slow cases, and experiment with moving the bulk of the work of larger slow cases out into shared routines (which would be being called from the slow case). This is certainly an interesting area to investigate. cheers, G. On Jun 25, 2010, at 4:20 PM, Nathan Lawrence wrote:
The size of our JIT generated code is a memory known issue. According to Oliver the slow cases for some of our operations is on the order of 128 bytes. It occurred to me that we could reduce the JITed code by only compiling the slow case once and having all of the subsequent generated code jump to that specific slow case. The issue with this is our slow cases jump back to specific locations in the hot path, with potentially different values on the stack, as opposed to a normal function which returns back to a very specific state. We can circumvent this issue by hand writing the assembly to return to an offset of the return address with the required state.
What do people think? -- Nathan _______________________________________________ squirrelfish-dev mailing list squirrelfish-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/squirrelfish-dev
Hi, the second part is true for ARM as well. Cortex A8 and A9 has a buffer with 8 entries to predict return addresses. If it fails (frequently), it flush this buffer, and this becomes a big performance loss since it cannot even predict even those return addresses, which was on the stack before (ineffective to implement a better hw logic). An example: 1) A calls B calls C calls D 2) put something to the link register, do a return 3) this flushes the return predictor -> return address of D, C, B, A also cannot be predicted anymore. I agree with the second part about moving big chunk of common code to custom functions. This would basically be a Macro Assembler based inline assembly, and practically would be a static function. Would it be possible to generate assembly at compile time from such functions and link it to the code? Would reduce the initial startup time of jsc, and probably the resulting code would be faster as well, although this would be against the FMW (favourite magic word = complexity), since the macro assembler should also generate assembly. Zoltan
Hey Nathan,
Absolutely - this is effective what we do for calls, where the bulk of the work of the slow case is performed by a shared routine. We've experimented with using this technique more broadly in the early stages of developing the JIT, and back then there was a measurable performance degradation from the call overhead – but the code had changed a lot since then, and the tradeoffs and requirements may be different now (particularly across the varying hardware platforms the JIT has now been ported to).
However returning to an offset to a return address is probably not a good plan. Upon executing a call, processors commonly cache the return address of the call instruction in a circular buffer used to predict return destinations. When it reaches the return instruction it pops a value from the return address stack to predict the destination of the return. If you change the address you're going to get a mispredict and probably a pipe flush. (We modify the return address in our exception handling path, but we don't expect exceptions to be high performance). However bear in mind that there is no conditional call instruction on x86, so to eliminate the slow path altogether you'd have to litter the hot path with inverted branches over the calls out to the trampolines. I'd suggest you'd be more likely to find success in keeping the hot path branching out to slow cases, and experiment with moving the bulk of the work of larger slow cases out into shared routines (which would be being called from the slow case).
This is certainly an interesting area to investigate.
cheers, G.
On Jun 25, 2010, at 4:20 PM, Nathan Lawrence wrote:
The size of our JIT generated code is a memory known issue. According to Oliver the slow cases for some of our operations is on the order of 128 bytes. It occurred to me that we could reduce the JITed code by only compiling the slow case once and having all of the subsequent generated code jump to that specific slow case. The issue with this is our slow cases jump back to specific locations in the hot path, with potentially different values on the stack, as opposed to a normal function which returns back to a very specific state. We can circumvent this issue by hand writing the assembly to return to an offset of the return address with the required state.
What do people think? -- Nathan _______________________________________________ squirrelfish-dev mailing list squirrelfish-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/squirrelfish-dev
_______________________________________________ squirrelfish-dev mailing list squirrelfish-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/squirrelfish-dev
participants (5)
-
Gavin Barraclough
-
Geoffrey Garen
-
Maciej Stachowiak
-
Nathan Lawrence
-
Zoltan Herczeg