Still missing details on Vector3 and GC but both of these are in development so aren't ready to be documented anyway.
14 KiB
Performance
One of main goals of Luau is to enable high performance code, with gameplay code being the main use case. This can be viewed as two separate goals:
- Make idiomatic code that is often written faster
- Enable even more high performance code through careful tuning
Both of these goals are important - it's insufficient to just focus on the highly tuned code, and all things being equal we prefer to raise all boats by implementing general optimizations. However, in some cases it's important to be aware of optimizations that Luau does and doesn't do.
Worth noting is that Luau is focused on, first and foremost, stable high performance code in interpreted context. This is because JIT compilation is not available on many platforms Luau runs on, and AOT compilation would only work for code that Roblox ships (and even that does not always work). This is in stark contrast with LuaJIT that, while providing an excellent interpreter as well, focuses a lot of the attention on JIT (with many optimizations unavailable in the interpreter).
Luau eventually plans to implement JIT on some platforms, but this is subject to careful memory safety analysis and is likely to not be deployed for client-side scripts, as the extra risk involved in JITs is much more pronounced when it may affect players.
The rest of this document goes into some optimizations that Luau employs and how to best leverage them when writing code. The document is not complete - a lot of optimizations are transparent to the user and involve detailed low-level tuning of various parts that is not described here - and all of this is subject to change without notice, as it doesn't affect the semantics of valid code.
Fast bytecode interpreter
Luau features a very highly tuned portable bytecode interpreter. It's similar to Lua interpreter in that it's written in C, but it's highly tuned to yield efficient assembly when compiled with Clang and latest versions of MSVC. On some workloads it can match the performance of LuaJIT interpreter which is written in highly specialized assembly. We are continuing to tune the interpreter and the bytecode format over time; while extra performance can be extracted by rewriting the interpreter in assembly, we're unlikely to ever do that as the extra gains at this point are marginal, and we gain a lot from C in terms of portability and being able to quickly implement new optimizations.
Of course the interpreter isn't typical C code - it uses many tricks to achieve extreme levels of performance and to coerce the compiler to produce efficient assembly. Due to a better bytecode design and more efficient dispatch loop it's noticeably faster than Lua 5.x (including Lua 5.4 which made some of the changes similar to Luau, but doesn't come close). The bytecode design was partially inspired by excellent LuaJIT interpreter. Most computationally intensive scripts only use the interpreter core loop and builtins, which on x64 compiles into ~16 KB, thus leaving half of the instruction cache for other infrequently called code.
Optimizing compiler
Unlike Lua and LuaJIT, Luau uses a more classical compiler construction with a frontend that parses source into an AST and a backend that generates bytecode from it. This carries a small penalty in terms of compilation time, but results in more flexible code and, crucially, makes it easier to optimize the generated bytecode.
While bytecode optimizations are limited due to the flexibility of Luau code (e.g. a * 1
may not be equivalent to a
if *
is overloaded through metatables), even in absence of type information Luau compiler can perform some optimizations such as "deep" constant folding across functions and local variables, perform upvalue optimizations for upvalues that aren't mutated, do analysis of builtin function usage, and some peephole optimizations on the resulting bytecode. In the future we plan to do bytecode-level inlining and possibly other code transformation.
Luau compiler currently doesn't use type information to do further optimizations, however early experiments suggest that we can extract further wins. Because we control the entire stack (unlike e.g. TypeScript where the type information is discarded completely before reaching the VM), we have more flexibility there and can make some tradeoffs during codegen even if the type system isn't completely sound. For example, it might be reasonable to assume that in presence of known types, we can infer absence of side effects for arithmetic operations and builtins - if the runtime types mismatch due to intentional violation of the type safety through global injection, the code will still be safely sandboxed; this may unlock optimizations such as common subexpression elimination and allocation hoisting without a JIT. This is speculative pending further research.
Inline caching for table and global access
Table access for field lookup is optimized in Luau using a mechanism that blends inline caching (classically used in Java/JavaScript VMs) and HREFs (implemented in LuaJIT). Compiler can predict the hash slot used by field lookup, and the VM can correct this prediction dynamically.
As a result, field access can be very fast in Luau, provided that:
- The source code uses
table.field
notation. The compiler doesn't optimizetable[field]
as it assumes that in this casefield
is not a string and/or can change for different accesses. Because of this you should avoid usingtable["field"]
which isn't idiomatic anyway. - The field access doesn't use metatables. The fastest way to work with tables in Luau is to store fields directly inside the table, and store methods in the metatable (see below); access to "static" fields in classic OOP designs is best done through
Class.StaticField
instead ofobject.StaticField
. - The object structure is usually uniform. While it's possible to use the same function to access tables of different shape - e.g.
function getX(obj) return obj.x end
can be used on any table that has a field"x"
- it's best to not vary the keys used in the tables too much, as it defeats this optimization.
The same optimization is applied to the custom globals declared in the script, although it's best to avoid these altogether by using locals instead. Still, this means that the difference between function
and local function
is less pronounced in Luau.
Importing global access chains
While global access for library functions can be optimized in a similar way, this optimization breaks down when the global table is using sandboxing through metatables, and even when globals aren't sandoxed, math.max
still requires two table accesses.
It's always possible to "localize" the global accesses by using local max = math.max
, but this is cumbersome - in practice it's easy to forget to apply this optimization. To avoid relying on programmers remembering to do this, Luau implements a special optimization called "imports", where most global chains such as math.max
are resolved when the script is loaded instead of when the script is executed.
This optimization relies on being able to predict the shape of the environment table for a given function; this is possible due to global sandboxing, however this optimization is invalid in some cases:
loadstring
can load additional code that runs in context of the caller's environmentgetfenv
/setfenv
can directly modify the environment of any function
The use of any of these functions performs a dynamic deoptimization, marking the affected environment as "impure". The optimizations are only in effect on functions with "pure" environments - because of this, the use of loadstring
/getfenv
/setfenv
is not recommended. Note that getfenv
deoptimizes the environment even if it's only used to read values from the environment.
Note: Luau still supports these functions as part of our backwards compatibility promise, although we'd love to switch to Lua 5.2's
_ENV
as that mechanism is cleaner and doesn't require costly dynamic deoptimization.
Fast method calls
Luau specializes method calls to improve their performance through a combination of compiler, VM and binding optimizations. Compiler emits a specialized instruction sequence when methods are called through obj:Method
syntax (while this isn't idiomatic anyway, you should avoid obj.Method(obj)
). When the object in question is a Lua table, VM performs some voodoo magic based on inline caching to try to quickly discover the implementation of this method through the metatable.
For this to be effective, it's crucial that __index
in a metatable points to a table directly. For performance reasons it's strongly recommended to avoid __index
functions as well as deep __index
chains; an ideal object in Luau is a table with a metatable that points to itself through __index
.
When the object in question is a reflected userdata, a special mechanism called "namecall" is used to minimize the interop cost. In classical Lua binding model, obj:Method
is called in two steps, retrieving the function object (obj.Method
) and calling it; both steps are often implemented in C++, and the method retrieval needs to use a method object cache - all of this makes method calls slow.
Luau can directly call the method by name using the "namecall" extension, and an optimized reflection layer can retrieve the correct method quickly through more voodoo magic based on string interning and custom Luau features that aren't exposed through Luau scripts.
As a result of both optimizations, common Lua tricks of caching the method in a local variable aren't very productive in Luau and aren't recommended either.
Specialized builtin function calls
Due to global sandboxing and the ability to dynamically deoptimize code running in impure environments, in pure environments we go beyond optimizing the interpreter and optimize many built-in functions through a "fastcall" mechanism.
For this mechanism to work, function call must be "obvious" to the compiler - it needs to call a builtin function directly, e.g. math.max(x, 1)
, although it also works if the function is "localized" (local max = math.max
); this mechanism doesn't work for indirect function calls unless they were inlined during compilation, and doesn't work for method calls (so calling string.byte
is more efficient than s:byte
).
The mechanism works by directly invoking a highly specialized and optimized implementation of a builtin function from the interpreter core loop without setting up a stack frame and omitting other work; additionally, some fastcall specializations are partial in that they don't support all types of arguments, for example all math
library builtins are only specialized for numeric arguments, so calling math.abs
with a string argument will fall back to the slower implementation that will do string->number coercion.
As a result, builtin calls are very fast in Luau - they are still slightly slower than core instructions such as arithmetic operations, but only slightly so. The set of fastcall builtins is slowly expanding over time and as of this writing contains math
, bit32
, assert
, type
, typeof
and some functions from string
library.
Note: The partial specialization mechanism is cute in that for
assert
, it only specializes on truthful conditions; hopefully performance ofassert(false)
isn't crucial for most code!
Optimized table iteration
Luau implements a fully generic iteration protocol; however, for iteration through tables it recognizes three common idioms (for .. in ipairs(t)
, for .. in pairs(t)
and for .. in next, t
) and emits specialized bytecode that is carefully optimized using custom internal iterators.
As a result, iteration through tables typically doesn't result in function calls for every iteration; the performance of iteration using pairs
and ipairs
is comparable, so it's recommended to pick the iteration style based on readability instead of performance.
Iterating through array-like tables using for i=1,#t
tends to be slightly slower because of extra cost incurred when reading elements from the table.
Creating and modifying tables
Luau implements several optimizations for table creation. When creating object-like tables, it's recommended to use table literals ({ ... }
) and to specify all table fields in the literal in one go instead of assigning fields later; this triggers an optimization inspired by LuaJIT's "table templates" and results in higher performance when creating objects. When creating array-like tables, if the maximum size of the table is known up front, it's recommended to use table.create
function which can create an empty table with preallocated storage, and optionally fill it with a given value.
When appending elements to tables, it's recommended to use table.insert
(which is currently ever so slightly slower than t[#t+1]
but it will be improved in the future) if the table size is not known. In cases when a table is filled sequentially, however, it's much more efficient to use a known index for insertion - together with preallocating tables using table.create
this can result in much faster code, for example this is the fastest way to build a table of squares:
local t = table.create(N)
for i=1,N do
t[i] = i * i
end
Native Vector3 math
Note: this optimization is still in progress, so this section doesn't document it, but it's going to be great
Fast memory allocator
Similarly to LuaJIT, but unlike vanilla Lua, Luau implements a custom allocator that is highly specialized and tuned to the common allocation workloads we see. The allocator design is inspired by classic pool allocators as well as the excellent mimalloc
, but through careful domain-specific tuning it beats all general purpose allocators we've tested, including rpmalloc
, mimalloc
, jemalloc
, ptmalloc
and tcmalloc
.
This doesn't mean that memory allocation in Luau is free - it's carefully optimized, but it still carries a cost, and a high rate of allocations requires more work from the garbage collector. The garbage collector is incremental, so short of some edge cases this rarely results in visible GC pauses, but can impact the throughput since scripts will interrupt to perform "GC assists" (helping clean up the garbage). Thus for high performance Luau code it's recommended to avoid allocating memory in tight loops, by avoiding temporary table and userdata creation.
Optimized garbage collector
Note: our garbage collector optimizations are still in progress, so this section doesn't document them.