Subject: LuaJIT roadmap 2008
From: Mike Pall <mikelu-0802@...>
Date: Fri, 1 Feb 2008 19:09:22 +0100
There have been several requests to publish a roadmap for LuaJIT,
so here it is:
— LuaJIT 1.1.4 will be released in the next few days. This is
strictly a minor bug fix upgrade to catch up with Lua 5.1.3.
— CoCo 1.1.4 will be released at the same time, because Lua 5.1.3
makes some incompatible changes to the C call depth counting
(basically reverse-patching these parts back to Lua 5.1.2).
I've been working on LuaJIT 2.x for quite some time now. It took
much longer than I had expected. I'm now at the third complete
redesign, because the first two approaches I tried didn't work
out: 1. shadow SSA info to augment the old bytecode got too
complex. 2. one-pass on-the-fly SSA generation doesn't work for
Lua because of an ugly corner-case involving upvalue semantics.
However I've gotten quite far with the third approach, namely a
fast interpreter combined with a trace compiler (detailed
description at the end of this posting).
The VM has been designed from the ground up to support the new
execution model. It still shares some components with the
standard Lua 5.1 interpreter, but many parts have changed.
Here are the most interesting changes:
— The bytecode format has been redesigned. It's much more
orthogonal, partially specialized and tuned for faster
— The bytecode interpreter has been completely rewritten in
hand-tuned x86 assembler. It has been optimized to reduce
branch prediction misses and cache misses. Scheduling has been
tuned and instruction-level parallelism has been improved. It
features extremely fast dispatch ("indirect threading") and
it's very compact. The commonly used parts of the code fit into
a single 4K page. Your current-generation out-of-order- and
speculative-execution CPU monster under your desk will love it.
— Object references are still tagged values, but now only use
half the space (8 bytes). IEEE-754 doubles are overlapped with
other object references. The object tags have a special
encoding which is an unused FP NaN. Amongst other things this
means stacks and arrays need 50% less space. D-Caches rejoice.
— The explicit frame stack (CallInfo) is gone. The frame
structure is implicitly stored in the regular Lua stack (tagged
values). This speeds up call handling substantially and also
simplifies the compiler (uniform treatment of slots and frames).
— Exception handling has been redesigned along with the new frame
structure. A pcall() is almost zero-cost now.
— Coroutines are completely "stackless" again. No C stack needs
to be allocated for their use. This is quite a departure from
the concept for LuaJIT 1.x -- the only drawback is that you
cannot yield *and return* to a C function. But you can of
course yield across pcall() and from iterators and metamethods
(which is what most people really want or need).
— Many of the standard library functions are now "internal"
functions. The fast paths are written in assembler, the
fallback paths are written in C. Apart from dramatic speedups
this also enables the stackless nature of coroutines.
Lots of other minor things have changed, like the fast string
hash (see the backport I posted a few weeks ago).
Preliminary benchmarks show that *the interpreter alone* is
already 2x-4x faster in most cases. Embarassingly this is better
than LuaJIT 1.x in a few benchmarks. Much higher speedups are
still to be expected from the trace compiler.
But the VM itself is nowhere near complete. Lots of rough edges
remain. Debug hooks are not yet implemented (I want them to be
zero-cost when not active). Other debug functionality, like
tracebacks, already work fine. Many more library functions need
to be internalized. And a plethora of little details still bear
prominent TODO signs.
The biggest stumbling block right now is the trace compiler and
all the associated machinery. I guess I need to redesign trace
recording (again), because a simple static approach to ucode
generation doesn't seem to work out well. The x86 backend and the
register allocator is not there yet, most optimizations are still
in their infancy. Consider it vaporware for now.
In short: no realistic ETA, yet -- "2008" is the best promise I'm
able to give. Before you ask: prereleases are pointless right
now, because you wouldn't get a fully working VM and I would
drown in support requests.
On the positive side, the new VM should be easier to port to
other CPU architectures. But there's more to consider, like a
different number type (on platforms without hardware FP) or
different pointer sizes (x64). This will certainly complicate
things. In any case, a port is not something I want to embark on
before the x86 version is stable.
My list of things I originally wanted to be in the first release
has already shrunken considerably. Some things NOT to expect:
— Redesigned memory management: faster allocation and
cache-friendly marking. Got many ideas and all the literature
I'd ever want to read -- not sure when I'll approach this.
— Concurrency. I've got a neat idea about a specific concurrency
model: basically a GIL which is only unlocked for the GC, for
blocking on I/O and for CPU-intensive code paths. I.e. the
interpreter would run locked, trace compiled code paths run
unlocked or lock on demand. C calls which block or may take
lots of CPU time need to unlock explicitly. I don't know
whether this works out in practice though.
— A compiler for pattern matches (regexes). I've got the basic
design and some code, but it doesn't fit the current execution
model anymore. Not that important, so it'll have to wait.
How a trace compiler works
The LuaJIT 2.x VM operates in four phases:
2. Interpretation + Trace Recording
3. Trace Compilation
4. Trace Execution
Here's a simplified description of the different phases:
1. Interpret the bytecode and keep running statistics on loops
and calls. As soon as certain thresholds are reached, the current
code path is considered "hot" and the VM proceeds to phase 2 and
2. Keep on interpreting the bytecode, but at the same time record
its actions. This involves not just the bytecodes themselves, but
also the individual steps taken to execute them (e.g. think about
a table lookup which might involve metamethods). The relatively
coarse bytecode is translated into a form of SSA-based microcode
(ucode) which captures much more detail.
Recording strictly follows the control flow at runtime, which
effectively flattens only the used parts of the code to a linear
"trace". Dead code paths are never recorded, just like uncommon
("cold") paths. Function calls are treated just like any other
control flow transition and tracing continues across them (which
is equivalent to inlining in a traditional compiler).
Conditional branches, type checks, function dispatch and other
runtime decisions cause "guards" to be inserted into the trace.
These guarantee an exit of the trace whenever their precondition
turns out not to be true in a future execution of a trace.
A really neat consequence is that for a dynamic language (such as
Lua) a trace also captures a particular runtime specialization of
operations and types. A trace is like a statically typed variant
of the code and is well suited to optimization (see below).
Recording is stopped under certain conditions. Some conditions
cause an abort, e.g. if an error is raised or some particular
(hopefully uncommon) combination of events cannot be handled. In
this case the collected trace is thrown away and the VM proceeds
again to phase 1 and continues interpreting the code.
Recording is also stopped whenever the trace reaches certain
points in the control flow. E.g. when the trace start point is
hit again (next loop iteration). In this case the trace is
considered complete and the VM proceeds to phase 3 to compile it.
3. A completed trace is optimized and compiled to machine code.
Since the trace is just a linear stream of statically typed
SSA-based ucode, all the traditional compiler optimizations
apply. I'm only listing the most useful ones here:
— Constant propagation and constant folding replace runtime
operations with their constant results. Guards depending on
constants can be completely eliminated. Runtime specialization
and inlining create many opportunities for this optimization.
E.g. the return address from an inlined function is a constant
(the call is in the same trace) so the guard can be eliminated.
— CSE (common subexpression elimination) eliminates redundancy in
the form of repeated expressions which calculate the same
result. Most of these are not caused by the programmer (e.g.
the classic (a+b)*(a+b) example) but by the translation to
ucode. The most common targets for CSE are repeated guards
(e.g. type checks) and address translations for table lookups.
— Loop-invariant code hosting moves operations out of a loop
which only depend on inputs outside of the loop. This is most
useful to move guards resulting from method resolving and
function dispatch out of a loop. E.g. calling math.abs() can be
reduced to a single machine code instruction inside the loop
plus a check before the start of the loop that math.abs still
resolves to the internal function originally handling it.
Optimization often substantially reduces the number of ucodes in
a trace. Converting the remaining ucode to machine code is pretty
straightforward because it's just a linear stream of instructions.
[This is a bit over-simplified: traces can extend existing traces
to form a trace tree and guards are converted to branches to
compensation code for returning to the interpreter. There are
more complications under the hood.]
Once a trace has been compiled, it's registered with its start
point. Whenever the interpreter hits this part of the code again,
it proceeds with phase 4 and directly runs the machine code. In
the case of a loop this is done right after compilation.
4. The compiled machine code resulting from a trace is run. If
it's truly a "hot" trace, the compiled code will run often enough
to outweigh the effort of optimizing and compiling it.
A guard may occasionally cause a trace to exit. This can happen
when a different control-flow path is taken or a type check or
bounds check fails and so on. In this case compensation code is
called to prepare the state for returning to the interpreter.
Depending on the kind of guard, the VM either switches to phase 1
or phase 2. Errors force a return to the interpreter. Most other
guards start another trace which in turn extends the existing
Descriptions of trace compilation employed on bytecodes are here:
— Incremental Dynamic Code Generation with Trace Trees
A. Gal and M. Franz, 2006
— HotpathVM: An Effective JIT for Resource-constrained Devices
A. Gal, M. Franz, C. W. Probst, 2006
The technique originated in system emulators doing on-the-fly
binary-to-binary translation. A good reference is here:
— Dynamo: A Transparent Dynamic Optimization System
V. Bala, E. Duesterwald, S. Banerjia, 2000
Popular software using this technique are QEMU and Valgrind.
Intel's Netburst architecture (Pentium 4) employs a trace cache
to store a copy of the x86 instruction stream translated by the
instruction decoder to RISC-like micro-ops.