Bad compression ratio for assembler code under crinkler?

category: general [glöplog]

While i was trying to optimize a 4k code into assembler, i'm facing a strange result.

It seems that even if in assembler you can get a 10-20% gain in unompressed size, the compressed size is sometimes only 5% more than the C++ version, and even in some case, is worse than 5%! I did this test on several functions (except intrinsic and small functions like memcpy that are worth to code in asm) and getting the same results...

I did a test on a simple function and it will not probably reflect the problem for an entire intro developped in asm.
For example, take the simple code of initializing a vertex shader / fragment shader in C++:

Code:

void __forceinline InitProcAdresses() { 
    for(int i = 0; i < 12; i++)  
        procAdresses[i] = GETPROCADRESS(&procAdressesNames[i][0]); 
} 
GLuint ShaderCompile(const char *vsh, const char *psh) 
{ 
    InitProcAdresses(); 
    GLuint shader = glCreateProgram();         
    GLuint s = glCreateShader(GL_VERTEX_SHADER); 
    glShaderSource(s, 1, (const GLchar**)(&vsh), NULL); 
    glCompileShader(s); 
    glAttachShader(shader,s); 
    s = glCreateShader(GL_FRAGMENT_SHADER);     
    glShaderSource(s, 1, (const GLchar**)(&psh), NULL); 
    glCompileShader(s); 
    glAttachShader(shader,s); 
    glLinkProgram(shader); 
    return shader; 
}

The equivalent version in assembler (i didn't test it, the purpose is to have an idea of the size of the code)

Code:


GLuint __declspec( naked ) ShaderCompile(const char *vsh, const char *psh) { 
    __asm { 
        push ebp; 
        push esi; 
        push edi; 
getproc: 
        mov esi, procAdresses; 
        mov edi, procAdressesNames; 
        push edi; 
        call GETPROCADRESS; 
        mov [esi], eax; 
        add esi, 4; 
        add edi, 24; 
        cmp edi, procAdressesNames + 12*22; 
        jl getproc; 
 
        // GLuint shader = glCreateProgram();         
        mov esi, procAdresses; 
        lodsb;  
        call eax; 
        mov ebp, eax;                            // [shader = EBP] 
 
        // GLuint s = glCreateShader(GL_VERTEX_SHADER); [s = EDI] 
        push GL_VERTEX_SHADER; 
        lodsb;  
        call eax; 
        mov edi, eax; 
 
        // glShaderSource(s, 1, (const GLchar**)(&vsh), NULL); 
        push 0; 
        lea eax, [esp + 16]; 
        push eax; 
        push 1; 
        push edi; 
        lodsb;  
        call eax; 
 
        // glCompileShader(s); 
        push edi; 
        lodsb;  
        call eax; 
         
        // glAttachShader(shader,s); 
        push edi; 
        push ebp; 
        lodsb;  
        call eax; 
 
        // Restart procAdresses 
        mov esi, procAdresses; 
 
        // glCreateShader(GL_FRAGMENT_SHADER);     
        push GL_FRAGMENT_SHADER; 
        lodsb;  
        call eax; 
        mov edi, eax; 
 
        // glShaderSource(s, 1, (const GLchar**)(&vsh), NULL); 
        push 0; 
        lea eax, [esp + 20]; 
        push eax; 
        push 1; 
        push edi; 
        lodsb;  
        call eax; 
 
        // glCompileShader(s); 
        push edi; 
        lodsb;  
        call eax; 
         
        // glAttachShader(shader,s); 
        push edi; 
        push ebp; 
        lodsb; 
        call eax; 
 
        // glLinkProgram(shader); 
        push ebp; 
        lodsb;  
        call eax; 
 
        pop edi; 
        pop esi; 
        pop ebp; 
        ret; 
    }; 
}

The fact is under crinkler, compression, the C++ version is better compressed (compiled under VC2008)

Code:


                       Uncompressed       Compressed 
ShaderCompile ASM        126 bytes            73 
ShaderCompile C++        143 bytes            67

My questions are:
- Why such a result? Is it the fact that the code is to small and we should have better results on larger code? Because i'm not a x86-killer, is there really some asm trick&tips to use for smaller code?
- Is it really worth to use assembler then? I don't think that 5% makes the difference for an intro

What do you think? What is your practice with this?

added on the 2009-05-02 13:07:59 by xoofx

Ace: yeah that's not good x86 asm :D

You have the possibility to schedule and group the instructions in ASM so the compression ratio is better, also you can assume that all windows API keep the contents of EBX, ESI & EDI registers intact which can help for even more optimisations.

added on the 2009-05-02 13:28:32 by hitchhikr

Shouldn't that be:

lodsd
call eax

?

added on the 2009-05-02 15:22:37 by parcelshit

I didn't even notice that it was for Linux (but i assume there must be some preserved registers as well).

added on the 2009-05-02 15:31:56 by hitchhikr

Well it looks based on the source of a certain linux intro which is a constant source of curiosity, it seems ;)

I guess he's talking about Windows though since Crinkler is in the topic.

added on the 2009-05-02 15:35:23 by parcelshit

you're using lots of instructions that the compiler either doesn't use (push with immediate operand, lods*) or uses infrequently (call eax), which means the compressor needs some bytes to adapt to the new opcode distribution at the start of the function, and some more at the end to adapt to the c++ code again. packers don't compress bytes individually, it's all about context.

besides, i haven't checked the compiled function, but i'm pretty certain that it's a lot more regular than your code, which makes it bigger but easier to compress.

that aside, seriously, your asm code is... not good. it's got lots of bugs for one thing (you're using 12*22 but add edi, 24 in the loop - how long are your strings, 22 or 24 bytes? it's lodsd to load a dword, not loadb; vsh is in [ebp+20] after the push 0, not in [ebp+16], and similarly for fsh with [ebp+24]; you reset the pointer to procAddresses, but then it's pointing at glCreateProgram, not glCreateShader as you assume) and it's not particularly well size optimized either.

hitchhikr: look at the code, he does rely on windows preserving ebx, esi, edi and ebp.

anyway, i've tried out how small i can get it without changing the interface, this is the result (also completely untested, of course):

Code:

      push ebp;
      push esi;
      push edi;
      
      mov esi, procAddressesNames;
      mov edi, procAddresses;
      push edi;
      push 12; // update to match new function count here
      pop ebp;
getproclp:
      push esi;
      push eax;
      call GETPROCADDRESS;
      stosd;
      add esi, 24;
      dec ebp;
      jnz getproclp;
      
      pop esi;
      lodsd;
      call eax;
      xchg eax, edi; // edi=shader
      
      push -2;
      pop ebp;
      push esi;
shaderlp:
      mov esi, [esp];
      lea eax, [ebp+GL_VERTEX_SHADER+1];
      push eax;
      lodsd;
      call eax;
      
      push eax;
      push edi;
      push eax;
      push 0;
      push dword ptr [esp+44+ebp*4];
      push 1;
      push eax;
      
      lodsd;
      call eax;
      lodsd;
      call eax;
      lodsd;
      call eax;
      inc ebp;
      jnz shaderlp;
      
      pop eax;
      push edi;
      lodsd;
      call eax;
      xchg eax, edi;
      pop edi;
      pop esi;
      pop ebp;
      ret;

87 bytes uncompressed unless i miscounted somewhere.

added on the 2009-05-02 15:41:23 by ryg

hitchhikr: linux uses the same register preservation conventions as windows does, the only difference is that one requires the direction flag to be cleared on procedure entry while the other doesn't, i'm not certain which was which :)

added on the 2009-05-02 15:42:50 by ryg

I think windows needs a cld.

Nevertheless, both C & ASM code have more to do with Linux than Crinkler/Windows as it would be very inefficient on the later platform.

As for ASM the ability to directly use a controlled/restricted set of instructions also helps to outmatch any C/C++ code in term of compressed size, it just takes a little bit more time to craft.

added on the 2009-05-02 16:05:35 by hitchhikr

what on earth does crinkler and windows have to do with the efficiency of "C & ASM code"? crinkler doesn't give a shit, and nothing prevents you from doing a context mixing compressor on linux. i actually finished and debugged the kkrunchy 0.23a3 (and following versions) compressor under linux so i could use valgrind (handy to have if you're working with large models in a weird sizeoptimized depacker - it's very easy to accidentally read/write out of bounds).

using a restricted instruction set is not nearly as effective as one would expect because the x86 instruction encoding is quite irregular: e.g. xchg eax, <reg> is 1 byte while xchg <reg>, eax is 2 bytes (as are all forms of mov <reg>, <reg>); you can use signed 8-bit displacements on register-relative addressing, but if the register is esp, it's an entirely different encoding and 1 byte bigger; and so on. even if the instruction looks nearly identical in asm code, it can be entirely different on the opcode level. all these irregularities are why kkrunchys opcode reordering is a relatively large amount of code (>1k before compression); for RISC platforms with orthogonal instruction sets and few different instruction encodings (MIPS, ARM, SPARC), you could get the same effect at a fraction of the size.

added on the 2009-05-02 16:39:16 by ryg

Quote:

i actually finished and debugged the kkrunchy 0.23a3 (and following versions) compressor under linux

pls to release :(

added on the 2009-05-02 16:47:42 by parcelshit

the compressor. it can still only pack PE executables, not ELFs.

added on the 2009-05-02 16:52:07 by ryg

Quote:

what on earth does crinkler and windows have to do with the efficiency of "C & ASM code"? crinkler doesn't give a shit.

I think you didn't get it, by efficiency i meant that Crinkler is handling the API functions loading by itself, something which is done manually in the code above and is (so far) required for Linux small sized prods but would be a waste under Windows in such context.

Quote:

using a restricted instruction set is not nearly as effective as one would expect because the x86 instruction encoding is quite irregular: e.g. xchg eax, <reg> is 1 byte while xchg <reg>, eax is 2 bytes (as are all forms of mov <reg>, <reg>); you can use signed 8-bit displacements on register-relative addressing, but if the register is esp, it's an entirely different encoding and 1 byte bigger; and so on. even if the instruction looks nearly identical in asm code, it can be entirely different on the opcode level. all these irregularities are why kkrunchys opcode reordering is a relatively large amount of code (>1k before compression); for RISC platforms with orthogonal instruction sets and few different instruction encodings (MIPS, ARM, SPARC), you could get the same effect at a fraction of the size.

Since we're obviously talking about PC 4k intros here, the instructions set to use in such cases is mostly reduced to pushes and calls with a few floating point instructions anyway, the rest of the file is mostly being devoted to shaders nowadays, only the synth would use x86 opcodes (most notably fpu instructions).

added on the 2009-05-02 17:03:49 by hitchhikr

Quote:

I think you didn't get it, by efficiency i meant that Crinkler is handling the API functions loading by itself, something which is done manually in the code above and is (so far) required for Linux small sized prods but would be a waste under Windows in such context.

Erm, no. GETPROCADDRESS != GetProcAddress (if you look up the function, you'll see that it takes two arguments), it's wglGetProcAddress. The default Win32 OpenGL implementation is still only OGL 1.2, if you want anything more that means ARB extensions, which means wglGetProcAddress. This is not simply a GetProcAddress on OPENGL32.DLL (i.e. you can't just directly import it and screw the middleman); the implementation is in a different DLL that's vendor-specific (nvoglnt.dll for nvidia, don't remember the name for ATI). Also, at least for NV, the functions aren't even exported in the DLL, it's just an internal table of names+function pointers somewhere in the file that crinklers import code can't possibly find or use.

added on the 2009-05-02 17:22:54 by ryg

Quote:

done manually in the code above and is (so far) required for Linux small sized prods but would be a waste under Windows in such context.

Well as ryg says the GETPROCADDRESS is wglGetProcAddress here. The cross-over with the linux source code I believe the code to be based on is that by ordering the function table you can do all API calls using 'lodsd; call eax'. This isn't an overhead with the import code I was using under linux as what you get back is a list of function addresses anyway. I think you know this already though as we discussed it on irc ages ago...

added on the 2009-05-02 17:37:18 by parcelshit

Ah yeah i remember i helped you crafting that stuff sometimes ago but as i'm old & tired i can't remember everything. These are indeed OGL extensions functions, guess i haven't used those for a while.

But still, Windows comes with DirectX which wouldn't require all these imports so for that OS this is not the most efficient solution ;D

added on the 2009-05-02 17:52:04 by hitchhikr

not necessarily. d3d has more setup code, you need d3dx to get at the hlsl compiler, and hlsl has more red tape than glsl (you need to declare all the inputs/outputs, for example). or you could use compiled shaders, but they're pretty big.

added on the 2009-05-02 18:11:44 by ryg

Thanks hitchhikr, parapete and ryg for your remarks. Yes the original asm code is not good, i agree (but be indulgent, after 20 years of programming in "high level language" it's quite hard to come back to pure assembler - i miss the 16 registers in 68000! ;) ) . As i said, i didn't test it and it was just a quick prototype to evaluate the size (and yes, based on a "you massive clone" sample for calling the gl functions ;) )

Ryg, your version is nice. But as you said, the compression is based on the context, and because the rest of the code is still in c++, the compression is not working well. That's probably why even with your version, i still get a 74 bytes compressed with crinkler compare to the 67 bytes in c++. Probably a whole demo coded in asm would be better compressed...

Anyway, i need to practice a bit more x86 asm, it's probably worth it. But i'm also very suprised that the c++ version is going so low after compression, and it's not the first time i have encountered this.

added on the 2009-05-02 18:20:46 by xoofx

Everybody is using d3dx (you pushed it yourself) and that's the kind of dll that crinkler imports with great benefits.

The size of the shaders may be rather equivalent all things considered but i think HLSL have a more relaxed (and smaller) syntax than GLSL which results in smaller shaders which also pack better (i'll verify that someday eventually) also there's some discrepancy between ATI & Nvidia GLSL implementations so some optimizations aren't really safe unless you provide 2 versions of your intro or you leave them away, of course.
This is less crucial for 4k intros than for 1k effects, tho.

http://pouet.net/prod.php?which=52938 << HLSL
http://pouet.net/prod.php?which=52940 << GLSL

Pick one.

added on the 2009-05-02 18:35:17 by hitchhikr

i didn't push anything, and i'm pretty certain that none of our prods needs d3dx (though i may be mistaken).

added on the 2009-05-02 19:50:37 by ryg

Now now, chaps.

added on the 2009-05-02 20:03:28 by parcelshit

@lx: While it might sound silly, there is a significant difference between asm and 100% asm.

You might get some space saving by rewriting parts of your code into asm, but, as you have seen, it is not much. The really big saving will come when you write everything in asm. This will make the code more uniform across the intro, making it compress better.

Also make sure that you stick to a certain coding style all the way. Use the same register for the same thing always. Push the same set of registers onto the stack in every function, even if some are not used. In general, use large but similar constructs to do similar things rather than small but different ones.

Your lodsd, call eax idiom is probably good and compact when this is the dominant way to call API functions, but if you get many more API calls which are called in the usual call [function pointer address] way, it might be better to change it to only use this way.

But whatever you do, don't assume that some way will be better than some other way. Try both.

Happy 100% asm coding! :-D

added on the 2009-05-02 20:25:10 by Blueberry

Well the issue between NVidia and ATI for GLSL is simple : ATI follows the specification rules, and NVidia makes their own :P

Your best bet without an ATI card to test is AMD GPU ShaderAnalyzer or some such program that compiles the shader code using Catalyst drivers so if your shader compiles in there, it SHOULD work on an ATI card (and it even gives you which cards it's safe on). Problem is, "SHOULD" is not "WILL"...

Anyways another thing to note is that function name loop etc is proven to be larger than directly using the function names in C.

Let me give an example:

Code:


    p = ((PFNGLCREATEPROGRAMPROC) wglGetProcAddress("glCreateProgram"))();
    s = ((PFNGLCREATESHADERPROC) wglGetProcAddress("glCreateShader"))(GL_VERTEX_SHADER);
    ((PFNGLSHADERSOURCEPROC) wglGetProcAddress("glShaderSource"))(s,1,&shaders_vertex,NULL);
    ((PFNGLCOMPILESHADERPROC) wglGetProcAddress("glCompileShader"))(s);
    ((PFNGLATTACHSHADERPROC) wglGetProcAddress("glAttachShader"))(p,s);
    s = ((PFNGLCREATESHADERPROC) wglGetProcAddress("glCreateShader"))(GL_FRAGMENT_SHADER);
    ((PFNGLSHADERSOURCEPROC) wglGetProcAddress("glShaderSource"))(s,1,&shaders_fragment,NULL);
    ((PFNGLCOMPILESHADERPROC) wglGetProcAddress("glCompileShader"))(s);
    ((PFNGLATTACHSHADERPROC) wglGetProcAddress("glAttachShader"))(p,s);
    ((PFNGLLINKPROGRAMPROC) wglGetProcAddress("glLinkProgram"))(p);
    ((PFNGLUSEPROGRAMPROC) wglGetProcAddress("glUseProgram"))(p);

Using the strings this way without the getprocaddress loop is actually smaller, though it looks like it wouldn't be.

If you're curious, here's my ASM (NASM) equivalent:

Code:


					; Shader init
%ifndef _4K_NO_SHADERS_
					;  Create program
					push sn_glCreateProgram
					call wglGetProcAddress
					call eax
					mov [p],eax
%ifndef _4K_NO_VERTEX_SHADER_
					;  Vertex shader
					push sn_glCreateShader
					call wglGetProcAddress
					push 8b31h ; GL_VERTEX_SHADER
					call eax
					mov [pfd],eax
					push sn_glShaderSource
					call wglGetProcAddress
					push byte 0
					;   Have to pass a pointer pointer (not a typo) because glShaderSource only accepts string arrays.
					mov [esi],dword shaders_vertex
					push esi
					push byte 1
					push dword [pfd]
					call eax
					push sn_glCompileShader
					call wglGetProcAddress
					push dword [pfd]
					call eax
					push sn_glAttachShader
					call wglGetProcAddress
					push dword [pfd]
					push dword [p]
					call eax
%endif
					;  Fragment shader
					push sn_glCreateShader
					call wglGetProcAddress
					push 8b30h ; GL_FRAGMENT_SHADER
					call eax
					mov [pfd],eax
					push sn_glShaderSource
					call wglGetProcAddress
					push byte 0
					mov [esi],dword shaders_fragment
					push esi
					push byte 1
					push dword [pfd]
					call eax
					push sn_glCompileShader
					call wglGetProcAddress
					push dword [pfd]
					call eax
					push sn_glAttachShader
					call wglGetProcAddress
					push dword [pfd]
					push dword [p]
					call eax
					;  Link and bind
					push sn_glLinkProgram
					call wglGetProcAddress
					push dword [p]
					call eax
					push sn_glUseProgram
					call wglGetProcAddress
					push dword [p]
					call eax
%endif

...and the data section:

Code:


%ifndef _4K_NO_SHADERS_
%ifndef _4K_NO_VERTEX_SHADER_
shaders_vertex: ; Vertex shader
	db "void main(){"
	db  "gl_Position=ftransform();"
	db "}",0
%endif
shaders_fragment: ; Fragment shader
	db "void main(){"
	db  "gl_FragColor=vec4(1);" ; Write pixel
	db "}",0
sn_glCreateProgram: ; Names of shader procs
	db "glCreateProgram",0
sn_glCreateShader:
	db "glCreateShader",0
sn_glShaderSource:
    db "glShaderSource",0
sn_glCompileShader:
    db "glCompileShader",0
sn_glAttachShader:
    db "glAttachShader",0
sn_glLinkProgram:
    db "glLinkProgram",0
sn_glUseProgram:
    db "glUseProgram",0
%endif

This seems pretty optimal to me.

added on the 2009-05-02 20:30:38 by ferris

Quote:

d3d has more setup code

Are you sure? IIRC my minimal D3D9 setup-code is smaller than my minimal GL setup-code. I mean, it's a call to Direct3DCreate9() followed by a COM-call to IDirect3D9::CreateDevice(). In OpenGL you need to call ChangeDisplaySettings(), ChoosePixelFormat(), SetPixelFormat(), GetDC(), wglCreateContext() and wglMakeCurrent(). I'm not sure if I've ever measured the two snippets against each other, the D3D9-code sure looks a lot simpler.

added on the 2009-05-02 20:31:13 by kusma

no, not sure. i have to admit i never measured it :)

added on the 2009-05-02 20:52:44 by ryg

I also tend to think the d3d9 startup code is smaller. Also, in case you wanted to make a non shader-based intro (??) and setup some antialiasing, then d3d9 wins by some hundred bytes.

added on the 2009-05-02 21:09:10 by iq

pouët.net

Bad compression ratio for assembler code under crinkler?

login