Intel SPMD Program Compiler

category: code [glöplog]
It might shown by your twitter recently, I don't know.

Intel SPMD Program Compiler - An open-source compiler for high-performance SIMD programming on the CPU


I've built the examples and I get a 5x speed up or more on my quad core.

Discuss, I guess.
added on the 2011-06-30 04:22:15 by xernobyl xernobyl
Should try to compile something done with visual studio, my supposedly SIMD optimized math library, and my thread pool, and compare it with this.
added on the 2011-06-30 09:16:57 by xernobyl xernobyl
cool, buuut why don't the techniques get integrated into existing open source compilers instead? (I'm sure icc supports this already)

edit: after browsing for a minute I found that it compiles to llvm IR first, so that is nice at least. Still don't understand why this needs to be a seperate project...
added on the 2011-06-30 10:53:02 by shuffle2 shuffle2
Because the language is not actual C/C++. Needs language extensions to do it.
added on the 2011-06-30 13:22:12 by NeARAZ NeARAZ
I have no idea what this is but I feel an urge to write a tutorial about it.
added on the 2011-06-30 14:31:41 by doomdoom doomdoom
Another step in the right direction it seems. Using SSE can be very beneficial yet writing intrisics-driven SIMD stuff can be tedious (and thick to look at for a fellow programmer, compared to regular C/C++) *and* this seems to automatically force somewhat careful selection of what you're going to SIMD-mize and what *not*.

SIMD-mize, sodomize.. hmm, close enough.
added on the 2011-06-30 14:35:50 by superplek superplek
They talked about auto-vectorization for years, did they give up ?
added on the 2011-06-30 15:01:57 by MsK` MsK`
Does this mean that LLVM SIMD optimization are already better than ICC itself? That would be quite a feat.
added on the 2011-06-30 17:00:06 by ponce ponce
This is an intermediate step I guess, *if* we ever get to auto-vectorization (which can be either suboptimal or downright disastrous if applied too freely w/halfassed heuristics -- so it has to be done absolutely right).

And it feels kind of safe as well: you write tailored C in relatively small batches to feed a certain SIMD slots/pipes/program width. It's probably going to miss some hardcore and more algorithmic optimizations you can do when writing SIMD routines by hand but it looks sweet enough to quickly convert all the less tricky stuff. Feels like "SIMD HLSL" :)

Certainly downloading/experimenting with this soon.
added on the 2011-06-30 17:10:40 by superplek superplek
Does this mean that LLVM SIMD optimization are already better than ICC itself? That would be quite a feat.

The whole point here is that this isn't doing any SIMD optimizations, at least not in the sense of taking code and SIMDifying it; instead, the language has semantics that make SIMDification trivial, which is (personal opinion ahead!) the only sane way to build tools for this kind of stuff.

It just generates SSE intrinsics and passes it through the LLVM backend. Compiling stuff like this *to* C/C++ is a losing proposition here because the language semantics (aliasing rules in particular) make it exceedingly hard (impossible except in trivial cases really) to do that compilation in a way that doesn't introduce artificial dependencies that aren't in the source program.

It's far easier to skip that level entirely and just give a set of intrinsics plus dependencies to a competent backend and just let it do codegen+scheduling.

They might've been able to use the iCC backend for that, but not certainly not if they had to go through the C frontend first, and I'm not sure how easy it is to integrate a new frontend into that codebase - probably hellish :). LLVM OTOH has a decent optimizer and is exceptionally easy (this is just in comparison, it's still not a trivial problem) to hook stuff up to. Since this was basically a research project with a very small team, that's the best way to do it.

This is an intermediate step I guess, *if* we ever get to auto-vectorization

You get auto-vectorization already, but the algorithms for this were all developed for FORTRAN programs and have severe limitations:

  • Loops to be auto-vectorized need to be FORTRAN-style DO loops or equivalent forms (i.e. only integer loop counters, fixed step, iteration count fixed on loop entry)
  • They may only access either scalar local variables or arrays (that's plain arrays, either global or on the stack, and they may not even be indirected through a base pointer) that are known at compile time not to alias. (This last one makes it very hard to do this for C programs).
  • These must be arrays of a basic type - and most vectorizers only support arrays of floats and doubles (some also support arrays of 32-bit ints, but that's it).
  • All array indices must be induction variables, i.e. integer affine functions of the loop counter(s) - a*i + b*j + c*k + ... with a, b, c being integer constants.
  • A lot of compilers can only deal with rectangular loops (i.e. i=0..N-1, j = 0..M-1). There's techniques that can deal with triangular or trapezoidal loops (where the iteration count of an inner loop is either an affine function of an outer loop counter, or the loop can be split into components that match this pattern) but only few compilers actually support them. More complex loop nests require techniques like polyhedral optimization (which can deal with loops that iterate over an arbitrary volume that can be described as a polyhedron). Some compilers have partial implementations of this (e.g. GCC/LLVM) but so far it doesn't mesh well with other optimizations - this is still a research problem. (Also, even if it worked, such loops are generally hard to vectorize).

TL;DR: Auto-vectorization is not magic, and unless you're solving numerical systems using FORTRAN code that was written in the 70s and 80s, it wasn't actually designed to help you. :) In particular, I haven't seen anything that even tries to deal well with heterogeneous data (where heterogeneous, as far as current compilers are concerned, starts at "struct of two arrays instead of just one straight array of floats").
added on the 2011-06-30 19:38:12 by ryg ryg
; instead, the language has semantics that make SIMDification trivial, which is (personal opinion ahead!) the only sane way to build tools for this kind of stuff

That's how I feel about it too. We can't expect plain compilers to take a look inside the programmer's head (there's this line between what code expresses on paper and "intention") and unleash a barrage of utopian heuristics to solve that. Instead, we do it this way. Easy enough and still in full control.

</cpt. obvious>

Someone once pointed out VC++ expanding some very trivial floating point code to a vectorized add and even that made me look a little worried.

So -- this FORTRAN stuff -- it was specifically designed to boost legacy code?
added on the 2011-06-30 20:14:53 by superplek superplek
So -- this FORTRAN stuff -- it was specifically designed to boost legacy code?

Basically, yes; mostly FORTRAN 77 code that no-one wants to touch. There's heaps of that stuff in the HPC / scientific computation community (which were the first to be interested in vectorization). In fact, the autovectorization stuff is mostly based on 80s (and late 70s) research on compiling code for vector machines - note that's not SIMD as today; these would just have a FP alu in the middle and load/store units on either side, all pipelined and designed to make sure the FP unit does useful work every single clock cycle. E.g. you'd have things like a "vector double add" instruction with a count that would just add two arrays of N doubles in N+<small overhead> cycles - guaranteed.

Anyway, those vector machines were supercomputers, so the compilers for them are designed for code that you'd run on supercomputers. :) Which is mostly linear algebra on huge matrices/vectors of floats or doubles.

So autovectorization (and a lot of other loop optimizations too, by the way, like blocking!) algorithms are great for code that implements:

  • Multiplication of large dense matrices
  • Multiplication of large vectors by large matrices
  • Dense linear solvers.
  • Sparse linear solvers for system with simple structure (i.e. band-diagonal or block band-diagonal systems).
  • and so on.

They're not at all useful for:

  • Code that deals with variable-sized data
  • Code that has heterogeneous data of any kind
  • Irregular data-flow patterns
  • Code with branches (unless they can be easily turned into predication)
  • Making a demo about it.
added on the 2011-06-30 21:04:38 by ryg ryg
Good concept. But whats missing (from just scanning the pages) is a dynamic compilation/execution helper API

Only option to generate SIMD core at runtime is A:
- use DX Warp (limited functionality)
- use openCL (still half baked)
- license / or write your own compiler

This 5 meg compiler + a runtime API would be my first choice.
added on the 2011-06-30 21:28:35 by T21 T21
It's using LLVM.

If you want your runtime SIMD code generator, use their frontend and then link against LLVM. Tada. :)
added on the 2011-07-01 03:02:52 by ryg ryg