Fast software rasteriser in JavaScript?

category: code [glöplog]
Just for measures, I did a C++ port of the ryg6-version (line by line, preserving variable types etc), and compared rendering 1000 frames in the C++ version and the JavaScript version (I removed the putImageData() part to just measure the raw JavaScript/C++ performance):

g++ -O3: 6.7 ms / frame
Chrome: 22 ms/frame
FF: 68 ms/frame
Opera: 68 ms/frame

So, obviously g++ -O3 is faster than JavaScript (surprised?). But hey! I still think that when we're "only" 3x-10x slower than compiled C++ code (with full optimizations turned on), it's not really THAT bad, is it?
added on the 2012-04-25 17:18:13 by marcus256 marcus256
NOTE: I didn't actually output the result from the C++ version, so I don't know if it's correct (though I've seen similar figures for other similar comparisons that I've made, so they should be about right at least).
added on the 2012-04-25 17:22:22 by marcus256 marcus256
Ryg: For the case where the block is not fully inside, it's better to not execute all comparaisons, it could be better to do some extra computation and finally mask your pixel when you write.
This is better with sse maybe not for javascript.
the problem is that this all still goes through some magic black box, subject-to-change JIT. i tried to clean up the algorithm without doing any low-level hackery (with the exception of the "or trick", which is really more a case of "throw it in" than a legitimate optimization when there's a distinct possibility that the variables are actually kept as doubles...).

if you're targeting native code, there's tons of stuff you can do - the biggest single one being that this algorithm is really easy to SIMDify (at which point you need to do the masking). of course you can also tinker around with block size etc. to figure out which is best (also depends on the SIMD width - if the number of pixels in a block matches your SIMD width, that's convenient for obvious reasons).

more importantly (and tricky to do in JS), you need to avoid int overflows. even for fairly small triangle sizes and amount of subpixel correction, you need to do 64-bit arithmetic for that, at least at the top level. (the pixel-level tests easily fit in 32 bits and won't have overflow issues if you have some bigger block-size trivial reject/accept test before them)
added on the 2012-04-25 19:06:11 by ryg ryg
ryg : you talk about SIMD optimization (which is something i never try, but know how it is done).

Is there any C/C++ compiler that can optimize to the point it can guess where some low level optimization can be done and convert some for/while loops to SIMD instructions ? (i think that intel make some research about this years ago but i'm not sure).

I know doing it by yourself will always give better results but i think optimizing code in such a way can sometimes reduce portability and readibility a lot (and require some ASM skills). Having a compiler that can do it automatically would be a good trade of.
added on the 2012-04-25 19:24:22 by Tigrou Tigrou
there's lots of autovectorization research from the 80s onwards, mostly focused on optimizing FORTRAN linear algebra code (the stuff that supercomputers spend most of their time running). all of this has fairly strong constraints on the structure of the code - it's hard to reorder computations for autovectorization while simultaneously respecting exact serial semantics of the original code.

if you want the "instant SIMD, just add water" kind, try ISPC (http://ispc.github.com). the ISPC source language has the right semantics to compile to effiient SIMD code even without magic pixie dust. :) it's basically a shader compiler for CPUs (and nicely shows how big an impact the execution model of a language can make). this is done by people at intel and currently only targets x86 CPUs, but it's LLVM-based and open-source with a BSD license. codegen isn't perfect, but it's decent (as long as you mostly stick with 32-bit types), and it's much easier (and more convenient) to use than intrinsics.
added on the 2012-04-25 20:07:24 by ryg ryg
Yeah, doing automatic SIMD optimizations in languages such as C++ and JavaScript is really difficult (and will probably never be possible to do in a satisfactory way). A more likely scenario is that we get some sort of language extension (e.g. River Trail), a new API (e.g. WebCL), or simply find ways utilize multi-core parallelism with JavaScript workers, for instance. ...or all of the above ;)

In any event - I think the future of JS is quite exciting, and even the current state gives us quite a powerful tool (we might not get maximum "native" performance, but at least we don't have to be afraid of trying...).
added on the 2012-04-25 22:17:38 by marcus256 marcus256
Congrats, folks.

You are scratching the surface of what has been possible using the Matrox Mystique Graphic card.

That thing did that job in 1996 (12 years ago). The performance figures you are able to archive using your java-script rasterizers are quite comparable. Except that you don't have any kind of texture mapping.

BB Image

added on the 2012-04-25 22:38:55 by torus torus
Painting the hallway through the letterbox? and why not good sir? :)

I for one love these threads and the nuggets of demostyle goodness within them
great work ryg, 1337 optimizing skills.
added on the 2012-04-25 23:54:25 by Skate Skate
yea, the virtual machine rule-of-thumb "10x slower than native code" still stands as it seems :-)

torus: I've recently ported that "rasterizer thingie" to a 16€ cortex-m4 board and it runs quite well, by the way :D
added on the 2012-04-25 23:57:58 by xyz xyz
@torus: still living in 2008, aren't we? ;-)
und ausgerechnet das Matrox Miststück als Referenz?
die S3 Würg hät's auch getan ;-)
added on the 2012-04-25 23:58:00 by RufUsul RufUsul
Oh hi xyz..

That "rasterizer thingie" came to my mind a few weeks ago. Remember our code generation things with the DM6446? Now i found out that the mysterious VICP module is in fact a second ARM9 core running at full clock speed with access to the DMA, very tightly coupled to the DSP, twice the bandwidth to the L2 cache. Only drawback: It has only 12kb of static RAM for code.

If TI would have told us so and didn't kept it as a secret we could have had at least twice the performance for 3d rasterization *and* we would have been able to use your ARM JIT.

Stupid folks..

I use that thing on the OMAP3 as a dma-engine with arithmetic, e.g. copy stuff from a to b and do calculations on the fly.
added on the 2012-04-26 07:45:08 by torus torus
Just for measures, I did a C++ port of the ryg6-version

I'm bored. Share this code. :)
added on the 2012-04-26 08:51:04 by doomdoom doomdoom
@doomdoom, sure: https://gist.github.com/2500329

Update: The original version had a bug. This version should be correct (added a TGA exporter to check the result). The previous figures were slightly wrong. The correct results are (fastest first):

g++ -O3: 9.8 ms/frame
g++ -O2: 11 ms/frame
g++ -O1: 12.5 ms/frame
Chrome: 22 ms/frame
g++ (no opt): 51 ms/frame
FF: 68 ms/frame
Opera: 68 ms/frame

...so, I guess the differences are even less between C++ and JavaScript than I first thought. ;)
added on the 2012-04-26 17:30:46 by marcus256 marcus256
It's not like that C++ code has any kind of optimization or even the common sense *not* to separately write, no store, bytes in a std. vector. The only thing remotely fashionable is that filler lifted from Devmaster.

Not exactly a useful comparison :)
added on the 2012-04-26 18:41:42 by superplek superplek
are you on a 64-bit OS? if so, you really want to compile this one for 64-bits, because it sure can use the extra registers. (and if your browser is 64-bit, its jit does get to use them).

plek: this is not nicks half-assed filler from devmaster anymore that mr. doob started with (ignore the comment). this is the good stuff. :)

anyway, it's totally fair not to combine writes etc. since the javascript code doesn't either.
added on the 2012-04-26 18:51:25 by ryg ryg
I'll have to have a closer look at that filler then :)

my point still remains that to me it feels useless to compare anything against a more or less 1:1 C++ port that would never see actual use
added on the 2012-04-26 19:02:03 by superplek superplek
on the contrary, if you want to see how well the JS compiler does compared to a C++ compiler, that's the *only* thing that makes sense.
added on the 2012-04-26 19:37:40 by ryg ryg
oh absolutely, you'll get a factor for a reasonably arbitrary type of test :)

so, I guess the differences are even less between C++ and JavaScript than I first thought

but that doesn't make this statement very true though. all that means is "I can write a suboptimal native port in language x and then it's about this much faster"

oh well, nitpicking.
added on the 2012-04-26 20:36:14 by superplek superplek
(make that suboptimal just 1:1 instead)
added on the 2012-04-26 20:38:18 by superplek superplek
About comparing JS and C/C++, the Mozilla/Emscriptem guys are basically doing what marcus256 did, but the other way around and they see the same 3-8x ratio: They take open source C/C++ projects, and cross compile them to JavaScript.
added on the 2012-04-26 20:46:48 by p01 p01
Maybe you guys saw the JavaScript H264 and WEBM decoders. They can decode 480p videos at 30fps on my 1.5Ghz EEE PC. If you ask me, that's pretty decent and can realistically be used as a fallback decoder.
added on the 2012-04-26 20:50:00 by p01 p01
i'm quite surprised that the non optimized C++ version actually run slower than Chrome javascript version. C++ is actually a lot simpler (more constraining) than javascript and doesnt contains lots of "runtime features" as js did (garbage collector, array bounds checking, dynamic typing, ...)

Seems google did a pretty good job with that V8 engine.
added on the 2012-04-26 22:14:10 by Tigrou Tigrou