Size limited coding vs new shader models (Is ray-tracing even possible in 64k or below?)

category: code [glöplog]
Wouldn´t using RTX for raytracing be just as lame as using DrawSphere() from DirectX/OpenGL ?!

Using RTX for raytracing would be as lame as using DX or OpenGL or Vulkan when we could all stick to pure software rendering instead.

Oh wait...
added on the 2020-12-30 16:46:16 by keops keops
thats why i put some extra-sentence(s) :P
♥ you keops! :P

P.S.: i am turning away from PC these weeks, so i´ll be back to software anyway soon! We´ll see when that happens, but it will, some day! :)
Raytracing hardware isn't just for meshes, you can just as easily build an acceleration structure with a bunch of different SDFs in it, and use that to jump the empty space and speed up the ray marching a ton (since you'd only need the distance to objects in the local area). So it's potentially useful even at 4k level.

And yeah, you can do the same thing without that hardware, but it's a ton slower. Are you going to stick with that cheap looking lighting when everyone else is path tracing? :)
added on the 2020-12-30 18:01:22 by alia alia
So what it boils down to is: Is it possible to write a small transform that changes DXIL or SPIR-V into a representation that's more compressible?

I can only throw in my decades old experience with MIDI files in here - they're pretty horrible to compress as-is but just changing the structure into something a LZ+entropy packer (or a PPM variant like modern executable packers use) understands better can yield impressive gains. In case of V2 I managed to get 20k zipped worth of .mid down to 4.? k compressed plus about 0.5k of completely unoptimized transform code that then fed a regular MIDI byte stream to the actual synth. Perhaps something like this is possible with shader bytecode, too.

As for allowing the shader compilers to be installed on compo machines - it's sadly a bit more complicated this time bc as far as I know there's no standard installers like the old DirectX9 ones. Hm.
added on the 2020-12-30 19:08:08 by kb_ kb_
And yeah, you can do the same thing without that hardware, but it's a ton slower

hardware RTX is BVH intersection

if you rewrite it to compute shaders (like UE5) than you can use very-fast raytracing anywhere

for my self I think "screen space culling" of SDF and simple voxel marcher for optimization is more than enough to have 30+ FPS even on 2015 year GPU in complex SDF raytracing scenes in realtime
(my size of code like this for tests I make is about 200Kb)
added on the 2020-12-30 19:09:12 by Danilw Danilw
you do understand, we are talking <=64k and <=4k here?
just asking!
nothing wrong with sharing my experience
dont be so close minded and aggressive
added on the 2020-12-30 19:49:54 by Danilw Danilw
Is it possible to write a small transform that changes DXIL or SPIR-V into a representation that's more compressible

I know nothing about compression, but some 4 years ago I made this: https://github.com/aras-p/smol-v which basically transforms SPIR-V into another form that is smaller by itself, but also compressed with regular compressors better.

The largest compression offender is SPIR-V variable identifiers that are all unique numbers, so even for shader sequences that do similar operations, they end up refering to different variable IDs. Even something trivial like delta-encoding the IDs from previous one was the main win, IIRC. That could be done by a very small piece of code, and then you try to let kkruncher et al. handle the rest.

For DXIL, it's a container format with LLVM-based bitcode. That by itself is also not very compressible. Someone should invent a similar transformation for it too :)
added on the 2020-12-30 19:51:27 by NeARAZ NeARAZ
Ah, good old delta encoding. That also carried the bulk of the work in my case :)

Just want to add that "smaller by itself" isn't necessarily a requirement when it comes to intros - after my transform the 178k .mid file for fr-08 ended up at over 300k but with the structure of the music laid bare for the compressor to see.
added on the 2020-12-30 20:03:30 by kb_ kb_
Yeah. SPIR-V is *very* regular; everything is 4 bytes etc., which makes it nice for compression. Just these variable/value IDs that are changing across the whole shader are messing things up -- that would be the first thing to delta encode, and then the rest is perhaps not terribly bad anymore.
added on the 2020-12-30 20:15:23 by NeARAZ NeARAZ
One could possibly look if the compiler usually only uses a small subset of instructions/modes etc (or if there are discernible patterns in how it uses those instructions) and then separate the dwords by semantics and apply some delta or move to front encoding to the separate streams.

But: just spitballing here. Probably not going to do anything with Vulkan in the next time so, grain of salt and stuff. :)
added on the 2020-12-30 20:22:32 by kb_ kb_
ah NeARAZ I forgot you're on here :)

I've been looking at shader compression and ripping SMOL-V apart. Thanks for putting it together; especially the test setup has been very helpful. I've expanded it on a fork and added my experimental format + our 64k packer to the matrix which has been really cool.

I don't have terribly good numbers yet; only slightly better than SMOL-V on individual shaders but it's a bit behind in aggregate, and things don't work all the way (I haven't tested anything at runtime, just hacking around with rearranging+packing). It's all a bit experimental, so take the following with more grains of salt :)

Some notes about what I've done so far:
- Instruction word lengths can mostly be dropped. (SMOL-V does some stuff here as well)
- Instead of delta coding result IDs, I've looked into dropping them entirely by reordering all IDs to start from 1 and increment, and using an LRU cache for referring to them for arguments. This seems to help quite a bit but I don't have it working entirely because I need accurate argument information for all instructions to actually get the ID reassignments right.
- Splitting streams also seems to be effective, but as usual, need to be careful about how it's done and what data goes together in what streams. I don't have any concrete things to add beyond trying a bit randomly to find correlations; I hope this becomes clearer as experiments get further.
- I think varint + zigzag is ultimately hurting compression performance. I don't have numbers on this but my intuition is that you want to keep certain kinds of "records" the same size throughout the data in order for the compressor to pick up on patterns better, and varint will certainly hurt this.
- Similarly, I think spirv-remap is also hurting performance. With the result id's dropped (this is the part it hurts most I think) and the arguments properly in a cache, I think code will be regularized between shaders better than they would be otherwise.
- Also similarly, it should be possible to regularize types and capabilities, and omit them from the encoded representation. Each shader can just get the same block of definitions upfront, which would both be smaller (nothing to encode) as well as normalize type identifiers (which should further normalize code between shaders).

Both my experiments and SMOL-V so far have performed about 3-4x worse than minifying GLSL on a corpus of all the shaders used in our last 64k. I'm confident we can relatively easily get to the 2x worse range or so; beyond that it's going to be difficult, but I'm hopeful!
added on the 2020-12-30 23:08:16 by ferris ferris
Skipping/LRU sounds pretty much equivalent to move-to-front encoding - the result would always end up in slot 0 and can be omitted. Perhaps a modified MTF scheme? Like, evaluate all arguments and then MTF them (but only if they're getting referenced by later instructions)?
added on the 2020-12-30 23:50:19 by kb_ kb_
Indeed it's equivalent to MTF, but there are some additional insertions for values that aren't explicit in the stream (eg new result IDs) and "front" may be relative. So far it seems best to always insert to the very front but I've seen cases where "near the front" works better or some kind of dynamic insertion/update scheme may work nicely as you suggest, so I'm trying to stay open-minded.
added on the 2020-12-30 23:54:34 by ferris ferris
here's some WIP output showing the normalizing effect of such a cache:
Code: 0x003d: 0x0000000c 0x00000000 (was 0x00000081) 0x00000004 (was 0x00000012) (Load) 0x008e: 0x0000000c 0x00000000 (was 0x00000082) 0x00000000 (was 0x00000081) 0x000000ff (was 0x00000029) (VectorTimesScalar) 0x003e: 0x0000000b (was 0x00000083) 0x00000082 (Store) 0x0039: 0x00000006 0x00000000 (was 0x00000084) 0x00000010 0x00000000 (was 0x00000083) (FunctionCall) 0x0088: 0x00000006 0x00000000 (was 0x00000085) 0x00000000 (was 0x00000084) 0x00000003 (was 0x00000029) (FDiv) 0x003d: 0x00000006 0x00000000 (was 0x00000086) 0x00000007 (was 0x0000007d) (Load) 0x0081: 0x00000006 0x00000000 (was 0x00000087) 0x00000000 (was 0x00000086) 0x00000002 (was 0x00000085) (FAdd) 0x003e: 0x00000003 (was 0x0000007d) 0x00000087 (Store) 0x003d: 0x0000000c 0x00000000 (was 0x00000088) 0x00000009 (was 0x00000012) (Load) 0x008e: 0x0000000c 0x00000000 (was 0x0000008a) 0x00000000 (was 0x00000088) 0x000000ff (was 0x00000089) (VectorTimesScalar) 0x003e: 0x00000012 (was 0x0000008b) 0x0000008a (Store) 0x0039: 0x00000006 0x00000000 (was 0x0000008c) 0x00000010 0x00000000 (was 0x0000008b) (FunctionCall) 0x0088: 0x00000006 0x00000000 (was 0x0000008d) 0x00000000 (was 0x0000008c) 0x00000003 (was 0x00000089) (FDiv) 0x003d: 0x00000006 0x00000000 (was 0x0000008e) 0x00000007 (was 0x0000007d) (Load) 0x0081: 0x00000006 0x00000000 (was 0x0000008f) 0x00000000 (was 0x0000008e) 0x00000002 (was 0x0000008d) (FAdd) 0x003e: 0x00000003 (was 0x0000007d) 0x0000008f (Store) 0x003d: 0x0000000c 0x00000000 (was 0x00000090) 0x00000009 (was 0x00000012) (Load) 0x008e: 0x0000000c 0x00000000 (was 0x00000092) 0x00000000 (was 0x00000090) 0x000000ff (was 0x00000091) (VectorTimesScalar) 0x003e: 0x00000019 (was 0x00000093) 0x00000092 (Store) 0x0039: 0x00000006 0x00000000 (was 0x00000094) 0x00000010 0x00000000 (was 0x00000093) (FunctionCall) 0x0088: 0x00000006 0x00000000 (was 0x00000095) 0x00000000 (was 0x00000094) 0x00000003 (was 0x00000091) (FDiv) 0x003d: 0x00000006 0x00000000 (was 0x00000096) 0x00000007 (was 0x0000007d) (Load) 0x0081: 0x00000006 0x00000000 (was 0x00000097) 0x00000000 (was 0x00000096) 0x00000002 (was 0x00000095) (FAdd) 0x003e: 0x00000003 (was 0x0000007d) 0x00000097 (Store) 0x003d: 0x0000000c 0x00000000 (was 0x00000098) 0x00000009 (was 0x00000012) (Load) 0x008e: 0x0000000c 0x00000000 (was 0x0000009a) 0x00000000 (was 0x00000098) 0x000000ff (was 0x00000099) (VectorTimesScalar) 0x003e: 0x00000020 (was 0x0000009b) 0x0000009a (Store) 0x0039: 0x00000006 0x00000000 (was 0x0000009c) 0x00000010 0x00000000 (was 0x0000009b) (FunctionCall) 0x0088: 0x00000006 0x00000000 (was 0x0000009d) 0x00000000 (was 0x0000009c) 0x00000003 (was 0x00000099) (FDiv) 0x003d: 0x00000006 0x00000000 (was 0x0000009e) 0x00000007 (was 0x0000007d) (Load) 0x0081: 0x00000006 0x00000000 (was 0x0000009f) 0x00000000 (was 0x0000009e) 0x00000002 (was 0x0000009d) (FAdd) 0x003e: 0x00000003 (was 0x0000007d) 0x0000009f (Store)

All of the IDs here are forced to zero due to being perfectly predictable (they're actually omitted from the output stream but that's not shown here). load/store pointers should get a similar treatment, as should function call targets. But yeah, WIP. :)
added on the 2020-12-30 23:56:55 by ferris ferris
correction: all of the *result IDs
added on the 2020-12-30 23:57:40 by ferris ferris
still considered an estimate until a working decoder exists, but I've passed smol-v on our 64k shaders (both compressed with squishy in this case):
Code:SmolV 26.8KB ( 27434B) 3.9% splt-v 24.7KB ( 25243B) 3.6%

minified GLSL of the same (except for compute+geom shaders which are left raw due to lack of minifier support) is 11665 bytes, so we're still just over a factor of 2 off, but again, I'm hopeful.

Anyways, fuck 2020, hope to have more concrete updates next year :)
added on the 2020-12-31 16:27:11 by ferris ferris