Hot-rodding Factor with typed functions, SIMD, and data-map

October 19, 2009

Factor's gained a number of features and libraries recently to help it deal efficiently with large sets of binary data.

Typed functions

Factor's compiler can take advantage of all the type information at its disposal to eliminate type checks and dynamic dispatch. However, since Factor is a dynamic language, very little type information usually survives past a function call; values can't walk across the function's borders as free integers or floats but must be smuggled through the call site inside hidden compartments of vans labeled object. Factor has some existing features to avoid this: For small functions, you can tear down the wall by inline-ing the function at its call site. However, inlining is inefficient for large functions that get used in multiple places. If you like to play fast and loose, the Factor compiler has some hooks to let you make unchecked assertions about the types of values in your code, but these hooks are unsafe, and no one should use them or know about them. Forget I said anything.

Given these inadequate choices, I wrote a typed library which wraps the aforementioned compiler hooks in a safe and easy to use syntax. You can define a TYPED: function like so:

USING: typed math ;
IN: scratchpad

TYPED: add-floats ( a: float b: float -- c: float ) + ;

The add-floats function will check its inputs before executing, converting them to floats if it can and throwing an error otherwise:

( scratchpad ) 1 2+1/2 add-floats .
3.5
( scratchpad ) "lol" "wut" add-floats .
Generic word >float does not define a method for the string class.
Dispatching on object: "lol"

When add-floats is called from another function, its type checks are inlined into the caller, so if the compiler can see that add-floats' inputs are going to be floats, it can eliminate the checks. The actual body of the function is compiled under a hidden name with the unsafe type declarations applied. After the call to the hidden typed function, the output value types are declared in the calling function, allowing the compiler to further optimize the caller's code. We can see the compiler in action using its optimized. diagnostic function:

( scratchpad ) USE: compiler.tree.debugger
( scratchpad ) [ 2.0 3.0 add-floats 5.0 * ] optimized.
[ 2.0 3.0 ( typed add-floats ) 5.0 float* ]

The compiler knows that 2.0 and 3.0 are floats, so it doesn't emit code to check them before calling ( typed add-floats ) (the hidden function that implements add-floats with type specializations). The compiler likewise knows that the result of add-floats will be a float, so multiplying it with 5.0 can be done with the specialized float* function instead of the generic * operator.

One cute thing to note is that typed is a completely self-contained Factor library. Nothing needed to change in the Factor VM or compiler to make it work.

SIMD support

Slava and I worked together to add support for hardware vector types to the language and compiler. Applying Factor's library of vector operations to objects of special fixed-size array types such as float-4 or short-8 now generates code that uses the SSE machine instructions of modern Intel processors. Together with first-class struct and binary array objects and typed functions, it's possible to write reasonably-performing vector code in a high-level style. For example, the new math.matrices.simd library implements fast 4x4 matrix math using code like this:

STRUCT: matrix4
    { columns float-4[4] } ;

: columns ( a -- a1 a2 a3 a4 )
    columns>> 4 firstn ; inline

TYPED:: m4.v ( m: matrix4 v: float-4 -- v': float-4 )
    m columns :> m4 :> m3 :> m2 :> m1
    
    v first  m1 n*v
    v second m2 n*v v+
    v third  m3 n*v v+
    v fourth m4 n*v v+ ;

We can load math.matrices.simd and look at the compiler's code generation for m4.v with the test-mr. function:

( scratchpad ) USING: math.matrices.simd typed.debugger ;
( scratchpad ) \ m4.v typed-test-mr.
=== word: ( typed m4.v ), label: ( typed m4.v )

_label 0 
_prologue T{ stack-frame { total-size 32 } } 
_label 1 
##gc RAX RCX 32 { } { } f 
##peek RAX D 1 
##peek RCX D 0 
##unbox-any-c-ptr RBX RAX RDX 
##alien-vector XMM0 RBX 0 float-4-rep 
##alien-vector XMM1 RBX 16 float-4-rep 
##alien-vector XMM2 RBX 32 float-4-rep 
##alien-vector XMM3 RBX 48 float-4-rep 
##alien-vector XMM4 RCX 10 float-4-rep 
##shuffle-vector-imm XMM5 XMM4 { 0 0 0 0 } float-4-rep 
##mul-vector XMM5 XMM5 XMM0 float-4-rep 
##shuffle-vector-imm XMM0 XMM4 { 1 1 1 1 } float-4-rep 
##mul-vector XMM0 XMM0 XMM1 float-4-rep 
##add-vector XMM5 XMM5 XMM0 float-4-rep 
##shuffle-vector-imm XMM0 XMM4 { 2 2 2 2 } float-4-rep 
##mul-vector XMM0 XMM0 XMM2 float-4-rep 
##add-vector XMM5 XMM5 XMM0 float-4-rep 
##shuffle-vector-imm XMM4 XMM4 { 3 3 3 3 } float-4-rep 
##mul-vector XMM4 XMM4 XMM3 float-4-rep 
##add-vector XMM5 XMM5 XMM4 float-4-rep 
##inc-d -1 
##allot RAX 32 byte-array RCX 
##load-immediate RCX 128 
##set-slot-imm RCX RAX 1 6 
##set-alien-vector RAX 10 XMM5 float-4-rep 
##replace RAX D 0 
_label 2 
_epilogue T{ stack-frame { total-size 32 } } 
##return 
_spill-area-size 0 

(typed-test-mr. is a version of test-mr. that examines the secret function for a TYPED: definition.) The preamble and postamble code are a bit bulkier than they would be for a similar C function, setting up GC roots at the beginning and allocating a new memory block at the end, but the heart of the function is about as good as it gets: load the four matrix columns and vector into registers (##alien-vector), broadcast each element of the vector in turn (##shuffle-vector-imm), multiplying it against the corresponding matrix column (##mul-vector) and summing the results (##add-vector), finally storing the final result in the newly allotted memory block (##set-alien-vector). The vector values fly around in vector registers until it's time for the final result to land in its final memory location.

Doug Coleman has also used Factor's SIMD support to implement the SIMD Fast Mersenne Twister algorithm in the random.sfmt library. He purports to get performance within a fraction of a second of an equivalent C++ implementation.

data-map

4x4 matrices and random number generators are nice, but the bread and butter purpose of SIMD is to accelerate the processing of large amounts of data. Slava plans to add some auto-vectorization support so that operations like v+ on packed binary arrays use SIMD operations. But as a general solution, auto-vectorization gets tough for both the compiler and developer to deal with. Even in mature vectorizing compilers, if you plié when the compiler expects you to jeté, the compiler will throw up its hands and vomit scalar code in disgust.

As an alternative to auto-vectorization in Factor, I've been experimenting with a macro to make it easy to write explicitly vectorized code, which I've given the working title data-map. It can take objects of any of Factor's packed binary array types as inputs and map over their contents from any C type to any other C type. You can also grab input values in groups and map them to groups of output values. This makes it easy to express tricky vector operations that would be tough or impossible to trick a C compiler into auto-vectorizing. Here's an example that packs an array of floating-point pixel values into byte-sized pixel values:

USING: alien.c-types alien.data.map generalizations kernel
math.vectors math.vectors.conversion math.vectors.simd
specialized-arrays typed ;
SIMDS: float int short uchar ;
SPECIALIZED-ARRAYS: float float-4 uchar-16 ;
IN: scratchpad

TYPED: float-pixels>byte-pixels ( floats: float-array -- bytes: byte-array )
    [
        [ 255.0 v*n float-4 int-4 vconvert ] 4 napply
        [ int-4 short-8 vconvert ] 2bi@
        short-8 uchar-16 vconvert
    ] data-map( float-4[4] -- uchar-16 ) ;

The above code grabs four float-4 vectors at a time from the input array and crams them into one uchar-16 vector, first scaling the four inputs from 0.01.0 to 0255 (255.0 v*n float-4 int-4 vconvert), then packing both pairs of int vectors into two short vectors (int-4 short-8 vconvert), and finally packing the two short vectors into a single uchar vector. The machine instruction that the vconvert operation maps to handles saturating values below 0 and above 255 for us.

data-map can also iterate over multiple input arrays in parallel. Here's another pixel-pushing example that folds planar R, G, B, and A data into a single RGBA image:

USING: alien.c-types alien.data.map kernel locals
math.vectors math.vectors.simd specialized-arrays typed ;
SIMDS: uchar ;
SPECIALIZED-ARRAYS: uchar-16 ;
IN: scratchpad

:: vmerge-transpose ( a b c d -- ac bd ac bd )
    a c (vmerge) b d (vmerge) ; inline

TYPED: RGBA-planes>RGBA (
    r: byte-array
    g: byte-array
    b: byte-array
    a: byte-array
    --
    rgba: byte-array
)
    [ vmerge-transpose vmerge-transpose ]
    data-map( uchar-16 uchar-16 uchar-16 uchar-16 -- uchar-16[4] ) ;

The (vmerge) function maps to SIMD instructions that interleave the contents of two vectors. By applying it twice to our four input vectors we get four interleaved vectors we can store to the destination buffer.

data-map still has some shortcomings. All the above examples assume that the input arrays are an evenly divisible size of the input grouping. data-map does nothing to deal with the tail end; currently, you'd need to find and handle it yourself. data-map also doesn't offer a solution for iterating over potentially misaligned subsequences of packed arrays, where you would also want to handle the unaligned head of the incoming slice separately before passing the aligned part the vectorized main loop. However, as is, it's been useful enough to accelerate the terrain generation demo from needing a five-second warmup to starting nearly instantly. As I try to apply it to more complex problems, I'll be tweaking the design to make it more general.

✻   ✼   ✻

Of course, Factor is still primarily a dynamic language, and it does nothing to stop you from slipping a type-opaque function call into your inner loop and kicking its performance down from C++ to Python levels. It still takes some arcane knowledge of the compiler's abilities to get it to generate fast code. But having typed function definitions, a rich set of object types that map to CPU types, and control flow constructs that work with typed data makes Factor's fast path a lot easier to access.