mirror of
https://github.com/luau-lang/luau.git
synced 2025-01-31 23:03:10 +00:00
9aa82c6fb9
When the input is a constant, we use a fairly inefficient sequence of fmov+fcvt+dup or, when the double isn't encodable in fmov, adr+ldr+fcvt+dup. Instead, we can use the same lowering as X64 when the input is a constant, and load the vector from memory. However, if the constant is encodable via fmov, we can use a vector fmov instead (which is just one instruction and doesn't need constant space). Fortunately the bit encoding of fmov for 32-bit floating point numbers matches that of 64-bit: the decoding algorithm is a little different because it expands into a larger exponent, but the values are compatible, so if a double can be encoded into a scalar fmov with a given abcdefgh pattern, the same pattern should encode the same float; due to the very limited number of mantissa and exponent bits, all values that are encodable are also exact in both 32-bit and 64-bit floats. This strategy is ~same as what gcc uses. For complex vectors, we previously used 4 instructions and 8 bytes of constant storage, and now we use 2 instructions and 16 bytes of constant storage, so the memory footprint is the same; for simple vectors we just need 1 instruction (4 bytes). clang lowers vector constants a little differently, opting to synthesize a 64-bit integer using 4 instructions (mov/movk) and then move it to the vector register - this requires 5 instructions and 20 bytes, vs ours/gcc 2 instructions and 8+16=24 bytes. I tried a simpler version of this that would be more compact - synthesize a 32-bit integer constant with mov+movk, and move it to vector register via dup.4s - but this was a little slower on M2, so for now we prefer the slightly larger version as it's not a regression vs current implementation. On the vector approximation benchmark we get: - Before this PR (flag=false): ~7.85 ns/op - After this PR (flag=true): ~7.74 ns/op - After this PR, with 0.125 instead of 0.123 in the benchmark code (to use fmov): ~7.52 ns/op - Not part of this PR, but the mov/dup strategy described above: ~8.00 ns/op |
||
---|---|---|
.. | ||
apicalls.lua | ||
assert.lua | ||
attrib.lua | ||
basic.lua | ||
bitwise.lua | ||
buffers.lua | ||
calls.lua | ||
clear.lua | ||
closure.lua | ||
constructs.lua | ||
coroutine.lua | ||
coverage.lua | ||
datetime.lua | ||
debug.lua | ||
debugger.lua | ||
errors.lua | ||
events.lua | ||
exceptions.lua | ||
gc.lua | ||
ifelseexpr.lua | ||
interrupt.lua | ||
iter.lua | ||
literals.lua | ||
locals.lua | ||
math.lua | ||
move.lua | ||
native.lua | ||
native_types.lua | ||
ndebug_upvalues.lua | ||
pcall.lua | ||
pm.lua | ||
safeenv.lua | ||
sort.lua | ||
strconv.lua | ||
stringinterp.lua | ||
strings.lua | ||
tables.lua | ||
tmerror.lua | ||
tpack.lua | ||
types.lua | ||
userdata.lua | ||
utf8.lua | ||
vararg.lua | ||
vector.lua |