pouët.net

4k intro, C vs assembler?

category: code [glöplog]
I have no doubt that when it comes to performance, modern compiler are more effective than manual asm coding.
However, when it comes to size code, I'm not sure that this is the case.

For example, see the code below:
Code:static void BuildSkin() { for (int x = 0; x < width; x++) for (int y = 0; y < height; y++) { int cellValue = *GetCellPtr(x, y); if (cellValue) { CreateFace(x, 1, y, 4, cellValue); for (int nSide = 0; nSide < 4; nSide++) { int CellValue = *GetCellPtr(x + DirMapX[nSide], y + DirMapY[nSide]); if (!CellValue) CreateFace(x, 0, y, nSide, cellValue); } } else CreateFace(x, 0, y, 5, 0); } }

Original compiled code size is 128 bytes and it reduces to 90 bytes after compression.
For me its seems quite big.

The question: would a manually tweaked asm be more much effective?

For reference, this is the generate asm:
Code: 00420629 55 PUSH EBP 0042062A 8B EC MOV EBP, ESP 0042062C 51 PUSH ECX 0042062D 83 65 FC 00 AND DWORD [EBP-0x4], 0x0 00420631 53 PUSH EBX 00420632 56 PUSH ESI 00420633 57 PUSH EDI 00420634 33 FF XOR EDI, EDI 00420636 8B 4D FC MOV ECX, [EBP-0x4] 00420639 8B C7 MOV EAX, EDI 0042063B E8 4D FB FF FF CALL ?GetCellPtr@@YGPAHHH@Z 00420640 8B 18 MOV EBX, [EAX] 00420642 33 F6 XOR ESI, ESI 00420644 3B DE CMP EBX, ESI 00420646 74 40 JZ 0x420688 00420648 53 PUSH EBX 00420649 6A 04 PUSH 0x4 0042064B 57 PUSH EDI 0042064C 6A 01 PUSH 0x1 0042064E FF 75 FC PUSH DWORD [EBP-0x4] 00420651 E8 53 01 00 00 CALL ?CreateFace@@YGXHHHEF@Z 00420656 0F BE 8E 84 19 42 00 MOVSX ECX, BYTE [ESI+_DirMapX] 0042065D 0F BE 86 88 19 42 00 MOVSX EAX, BYTE [ESI+_DirMapY] 00420664 03 4D FC ADD ECX, [EBP-0x4] 00420667 03 C7 ADD EAX, EDI 00420669 E8 1F FB FF FF CALL ?GetCellPtr@@YGPAHHH@Z 0042066E 83 38 00 CMP DWORD [EAX], 0x0 00420671 75 0D JNZ 0x420680 00420673 53 PUSH EBX 00420674 56 PUSH ESI 00420675 57 PUSH EDI 00420676 6A 00 PUSH 0x0 00420678 FF 75 FC PUSH DWORD [EBP-0x4] 0042067B E8 29 01 00 00 CALL ?CreateFace@@YGXHHHEF@Z 00420680 46 INC ESI 00420681 83 FE 04 CMP ESI, 0x4 00420684 7C D0 JL 0x420656 00420686 EB 0D JMP 0x420695 00420688 56 PUSH ESI 00420689 6A 05 PUSH 0x5 0042068B 57 PUSH EDI 0042068C 56 PUSH ESI 0042068D FF 75 FC PUSH DWORD [EBP-0x4] 00420690 E8 14 01 00 00 CALL ?CreateFace@@YGXHHHEF@Z 00420695 47 INC EDI 00420696 83 FF 20 CMP EDI, 0x20 00420699 7C 9B JL 0x420636 0042069B FF 45 FC INC DWORD [EBP-0x4] 0042069E 83 7D FC 20 CMP DWORD [EBP-0x4], 0x20 004206A2 7C 90 JL 0x420634 004206A4 5F POP EDI 004206A5 5E POP ESI 004206A6 5B POP EBX 004206A7 C9 LEAVE 004206A8 C3 RET

added on the 2013-04-06 10:26:51 by TLM TLM
for 4k most people use C/C++ nowdays, but if you're asm lover and can't get enough of it, by all means make a demo about it, or a 4k in this case.
added on the 2013-04-06 10:38:03 by psenough psenough
crinkler should be enough for anyone!
added on the 2013-04-06 10:38:56 by psenough psenough
lol
added on the 2013-04-06 10:41:17 by TLM TLM
I think I would suggest c++ & inline ASM for things you want to optimize. I know that our coder normally writes anything in c++ and refactors it then in asm by tweaking some patterns...
added on the 2013-04-06 10:49:05 by FeN FeN
4k -> nasm/yasm.
added on the 2013-04-06 11:13:57 by las las
las, what do you mean?
added on the 2013-04-06 11:16:43 by TLM TLM
ehh. well. trick the compressor? pretty hard. i dunno much about compression far from reading the crinkler reports. and how it does and looks.

but i'd say there's definetely potential to optimize this assembly here. see register edx isn't used at all. the c++ way. make it a "class" pointer with some offset jumptable that makes the first calls smaller. or force a forward declaration to get a positive offset. the thing is zeros compress better. change the function arguments order to make the pattern more entropy friendly repetive. this is just a lil tho. the function got random arguments anyway. also the DirMapX and DirMapY needs to get outta the pattern which is hard.

*my two cheap cents here* :D
added on the 2013-04-06 11:31:35 by yumeji yumeji
*just thought re-reading my post*

how about using edx for the second loop counter? that'd kill that DWORD [EBP-0x4] stack locals. the same negative offset shit. or you decrease ebp to get positive stack offsets for locals. more zeros to compress.

anyway. that compiler did a shit job. an assembly betterization would definetely kill that. period.
added on the 2013-04-06 12:35:36 by yumeji yumeji
Just a quick look, 102 or 96 bytes uncompressed.

EDX can be used for X (previously [EBP-4]). -14 bytes
You don't need the stack frame anymore. -4 bytes.
Use PUSHA/POPA (slower, but smaller). -4 bytes.
EBP is now free, use it for GetCellPtr. -4 bytes.

Code: PUSHA ; -3-2 MOV EBP, ?CreateFace@@YGXHHHEF@Z; +5 XOR EDX, EDX ; -3 __XLOOP: XOR EDI, EDI __YLOOP: MOV ECX, EDX ; -1 MOV EAX, EDI CALL ?GetCellPtr@@YGPAHHH@Z MOV EBX, [EAX] XOR ESI, ESI CMP EBX, ESI JE __ELSE PUSH EBX PUSH 0x4 PUSH EDI PUSH 0x1 PUSH EDX ; -2 CALL EBP ; -3 __NSIDELOOP: MOVSX ECX, BYTE [ESI+_DirMapX] MOVSX EAX, BYTE [ESI+_DirMapY] ADD ECX, EDX ; -1 ADD EAX, EDI CALL ?GetCellPtr@@YGPAHHH@Z CMP DWORD [EAX], 0x0 JNZ __SKIP PUSH EBX PUSH ESI PUSH EDI PUSH 0x0 PUSH EDX ; -2 CALL EBP ; -3 __SKIP: INC ESI CMP ESI, 0x4 JL __NSIDELOOP JMP __THEN __ELSE: PUSH ESI PUSH 0x5 PUSH EDI PUSH ESI PUSH EDX ; -2 CALL EBP ; -3 __THEN: INC EDI CMP EDI, 0x20 JL __YLOOP INC EDX ; -2 CMP EDX, 0x20 ; -1 JL __XLOOP POPA ; -2 RET ; -1


You can save 6 more bytes by looping down instead of up, but the face order will be reversed.

Code: PUSHA ; -3-2 MOV EBP, ?CreateFace@@YGXHHHEF@Z; +5 PUSH 0x1F ; -3 POP EDX ;+1 __XLOOP: PUSH 0x1F POP EDI ;+1 __YLOOP: MOV ECX, EDX ; -1 MOV EAX, EDI CALL ?GetCellPtr@@YGPAHHH@Z MOV EBX, [EAX] OR EBX, EBX JZ __ELSE PUSH EBX PUSH 0x4 PUSH EDI PUSH 0x1 PUSH EDX ; -2 CALL EBP ; -3 PUSH 3 POP ESI ;+1 __NSIDELOOP: MOVSX ECX, BYTE [ESI+_DirMapX] MOVSX EAX, BYTE [ESI+_DirMapY] ADD ECX, EDX ; -1 ADD EAX, EDI CALL ?GetCellPtr@@YGPAHHH@Z CMP DWORD [EAX], 0x0 JNZ __SKIP PUSH EBX PUSH ESI PUSH EDI PUSH 0x0 PUSH EDX ; -2 CALL EBP ; -3 __SKIP: DEC ESI ;-3 JNS __NSIDELOOP JMP __THEN __ELSE: PUSH ESI PUSH 0x5 PUSH EDI PUSH ESI PUSH EDX ; -2 CALL EBP ; -3 __THEN: DEC EDI ;-3 JNS __YLOOP DEC EDX ;-3 ; -2-1 JNS __XLOOP POPA ; -2 RET ; -1
added on the 2013-04-06 12:51:41 by rrrola rrrola
Slightly off-topic: I have no experience with 4K, but I see C++ mentioned in this thread I few times. I wonder if it really is a good idea to make use of C++ in 4Ks. Many C++ features tend to blow up code a lot. Shouldn't it be safer to stick to C to minimize code size?
added on the 2013-04-06 12:58:06 by zgrg zgrg
nice one. :D

still needs to compress. argument order is still valid. and the forward offset for GetCellPtr. am i right?
added on the 2013-04-06 13:10:13 by yumeji yumeji
zgrg: depends on what you want from "C++". classes, templates, etc have no place in a 4k (with the exception of COM objects like d3d/dsound) really.

for our revision exegfx, i made the odd choice to do the effect itself in C but did the stub/init/glue in ASM for a bunch of reasons:
- i like doing a bit of ASM every now and then, just for the feel of it.
- it was slightly smaller for a window init + data parser/loader code (at least the VS compiler didn't seem to realize the upsides of lods* / stos*)
- with a 4k, you often want to create multiple versions for the final product (different resolutions, windowed, etc) and i just found being able to compile ASM from command-line with different preprocessor definitions works better than it does with C (visual studio is a bit of a bastard with all his environment variables) so I just set up a build script that was able to pop out all the versions from one .bat file.
- with crinkler it's often beneficial to have control over your sections, which works out slightly better if you're running ASM.

note that these are all edge cases and the unpredictability of compression makes everything a "possible" good idea instead of a surefire one.
added on the 2013-04-06 13:45:43 by Gargaj Gargaj
So what does crinkler say about the hand-optimized code? Does it actually compress better?
added on the 2013-04-06 13:46:08 by trc_wm trc_wm
yumeji: you’re right.
added on the 2013-04-06 13:47:04 by rrrola rrrola
yumeji, thanks for the input!

rrrola, this is very impressive! I will try that tonight and post the compression results.

Just to make sure that I'm on the right track here, assembler inlining is not an option as I want to change the function framing. So I have to nasm/yasm it and link it together, right?
added on the 2013-04-06 13:54:19 by TLM TLM
Quote:
I have no doubt that when it comes to performance, modern compiler are more effective than manual asm coding.


That's an overrated statement.
Pure asm coding is always best. Not one compiler is smart enough to optimize down to the byte level. Some are quite good yes, but with assembler you can do tricks that not necessarily an compiler can do to save bytes.
added on the 2013-04-06 14:00:07 by rudi rudi
rudi, I have seen compilers do crazy things that a real person would not consider. I belive that writing assembler that would fully utilize the internal CPU architecture is REALLY REALLY complicated.
added on the 2013-04-06 14:04:21 by TLM TLM
I think it is a matter of knowledge + (abstract) thinking. And fun. And I also guess that there will always be 1 byte that can be saved by optimizing non asm coded stuff by hand.

Quote:
I belive that writing assembler that would fully utilize the internal CPU architecture is REALLY REALLY complicated.


But maybe not impossibru (=
thats because your algorithm changes during assembly-optimization.
added on the 2013-04-06 14:49:20 by rudi rudi
I'd not really say "classes don't belong into a 4K". There are some situations when classes can actually help, eg. with addressing.

Case in point (just typed down and compiled with VS2012)

Code: class TestClass { public: int a,b; int Test() { return a+b; } }; static int c,d; int CTest() { return c+d; }


Now, the final machine code for CTest() looks like (shortened):
Code: ?CTest@@YAHXZ PROC ; CTest, COMDAT 00000 a1 00 00 00 00 mov eax, DWORD PTR _d 00005 03 05 00 00 0000 add eax, DWORD PTR _c 0000b c3 ret 0 ?CTest@@YAHXZ ENDP ; CTest


while the "OMG C++ BLOAT" code of TestClass::Test() looks like this:
Code: ?Test@TestClass@@QAEHXZ PROC ; TestClass::Test, COMDAT ; _this$ = ecx 00000 8b 41 04 mov eax, DWORD PTR [ecx+4] 00003 03 01 add eax, DWORD PTR [ecx] 00005 c3 ret 0 ?Test@TestClass@@QAEHXZ ENDP ; TestClass::Test


Addressing of the variables just got way smaller and clearer. Of course YMMV and after some crinkling the difference will mostly vanish in the depths of context modelling, but I wouldn't discount those C++ features entirely.
added on the 2013-04-06 14:52:21 by kb_ kb_
tlm: there may be a VC optimization option for stubless functions (like -fomit-frame-pointer in GCC).

Some assembly-level optimizations can also be done directly from C: looping down instead of up, indexed loops ending at zero, do-while loops to put the condition at the end, the do-undo trick to get rid of "else" jumps, …
added on the 2013-04-06 15:02:34 by rrrola rrrola
TLM, rrrola: It's /Oy (aka "omit frame pointer")
added on the 2013-04-06 15:05:11 by kb_ kb_
Quote:
Pure asm coding is always best.

Surely you want to keep track of all the pipeline stalls, branch predictions and whatnot? I bet you can get insane from that pretty quickly, while a compiler certainly won't complain about reordering statements for best execution times.
Quote:
rudi, I have seen compilers do crazy things that a real person would not consider.


Well, obviously compilers are not sentient. So these 'crazy things' they do have been considered by humans, and programmed into the optimization logic of the compiler.

But, back to the original topic... I suppose compiled code will not have all that much entropy, since compilers only use a small subset of the total x86 instructionset, and always apply the same code generation routines to the same type of input. Which means that the compiled code will compress very nicely.
I'm not sure how much you can really win by writing everything by hand. Taking the compression into account seems to add yet another level of complexity to optimizing assembly code.
Or perhaps it even means that you should NOT try to be complex, but just keep to the same simple patterns, like a compiler does. Perhaps saving bytes in the original code, because you think you're being 'smart' ends up in lower compression ratios because the compressor has to use a different encoding for that special case than for other parts of the code, where this would not be the case with a compiler.

Discuss...
Or whatever :P
added on the 2013-04-06 15:25:30 by Scali Scali

login