ABI update woes in packers (dnload on FreeBSD 12 and Ubuntu 18.04)

category: code [glöplog]
I'm kind of interested on hearing how often and how the other packer authors here have to react to operating system changes. This mostly concerns people like blueberry or ts, but anyone else is free to chip in.

Why I'm asking? Lately it seems like all FreeBSD/Linux tiny intro hacking has become a kind of red queen's race, where every update might lead to hours and hours of debugging to even get things working again. I recently updated to the latest FreeBSD, and was crossing my fingers in hopes that it would not happen this time. Well of course it did.

To the point. Something in lld (the LLVM linker) had changed, and it now arranges the section headers differently. Current expectations about the header layout had been relying on a discovery made by minas some 5 years ago. He found out that:
  • DT_SYMTAB entry in section headers immediately follows DT_STRTAB entry.
  • Memory block pointed by DT_STRTAB always immediately follows memory block pointed by DT_SYMTAB.

Or as seen in Linux libc.so.6:
Code:... 0x0000000000000005 (STRTAB) 0x11038 0x0000000000000006 (SYMTAB) 0x3d90 ...

This beautiful symmetry meant that walking through symbols in a shared object was as simple as finding the STRTAB entry, taking the next entry as SYMTAB entry and walking symbol elements starting from memory block of the latter until memory block of the former.

lld has now been updated, and section headers in libc.so.7 from the latest FreeBSD look like this:
Code:... 0x0000000000000006 SYMTAB 0x14770 0x000000000000000b SYMENT 24 (bytes) 0x0000000000000005 STRTAB 0x337b8 ... 0x000000006ffffff0 VERSYM 0x26e78 0x000000006ffffffc VERDEF 0x28710 ...

Not only is there a header element between the entries we're interested in, somewhere down below there's another two blocks altogether that appear in memory before STRTAB, so minas' findings cease to help.

dnload always had a way to do the library scouring in a safe way, counting the symbols by analyzing HASH or GNU_HASH sections. This is naturally horribly expensive in binary size, and thus unacceptable. Luckily, even if the rules have changed, they're not completely arbitrary. Some debugging and investigation revealed that:
  • All sections are tightly packed in the header - SYMTAB will be immediately followed by another memory block that can be used as the end address of iteration.
  • STRTAB block can still be the immediately following block, but if it's not it, it's always VERSYM.

Thus, the header scouring code reduces to (hash contains 32-bit hash for symbol name, lmap is the ELF link map):
Code:const dnload_elf_sym_t* symtab = (const dnload_elf_sym_t*)elf_get_library_dynamic_section(lmap, DT_SYMTAB); const char* strtab = (char*)elf_get_library_dynamic_section(lmap, DT_STRTAB); const dnload_elf_sym_t* symtab_end = (const dnload_elf_sym_t*)strtab; // If the section immediately following SYMTAB is not STRTAB, it may be something else. { const dnload_elf_sym_t *potential_end = (const dnload_elf_sym_t*)elf_get_library_dynamic_section(lmap, DT_VERSYM); if(potential_end < symtab_end) { symtab_end = potential_end; } } for(const dnload_elf_sym_t *sym = symtab; (sym < symtab_end); ++sym) { const char *name = strtab + sym->st_name; if(sdbm_hash((const uint8_t*)name) == hash) { return (void*)((const uint8_t*)sym->st_value + (size_t)lmap->l_addr); } }

While this new 'safe' mechanism is better than counting number of symbols from hash sections, the size of a trivial binary still increases by a depressing ~40 packed bytes. Rather unacceptable even in a 4k.

As for a stroke of luck, Juippi decided to try shader coding and made me investigate the current state of Linux support, which has been broken for years. It turns out the supposed problems were more or less trivial to fix, and should have been fixed ages ago:

1) GNU ld not generating correct binaries from manually generated headers + asm? Well, if lld has changed, perhaps GNU linker has too?

I used to create binaries as follows (intro.final.S is verbatim assembler for whole intro content, including ELF headers):
Code:as CMakeFiles/intro.final.S -o CMakeFiles/intro.final.o ld --entry=0x400000 CMakeFiles/intro.final.o -T CMakeFiles/intro.ld -o CMakeFiles/intro.out objcopy --output-target=binary CMakeFiles/intro.out CMakeFiles/intro.unprocessed

It turns out that for whichever reason, the objcopy step is no longer necessary and also breaks the binary. The correct procedure is:
Code:as CMakeFiles/intro.final.S -o CMakeFiles/intro.final.o ld --entry=0x400000 CMakeFiles/intro.final.o -T CMakeFiles/intro.ld --oformat=binary -o CMakeFiles/intro.unprocessed

2) The assembler generated by newer versions of gcc on Linux requires a reference to global offset table?

An entry point would look like this for example:
Code:_start: pushl %ebp pushl %edi pushl %esi pushl %ebx subl $80, %esp call __x86.get_pc_thunk.bx addl $_GLOBAL_OFFSET_TABLE_, %ebx movl dynamic_r_debug@GOT(%ebx), %eax ...

Makes sense, since the generated binary has to be relocatable. However we don't need this, since we know exactly into which address the binary is going to get loaded into. You can explicitly force generation of non-position independent code, so passing -fno-pic and doing some automatic squashing in Python changes the entry point above to:
Code:_start: subl $88, %esp movl dynamic_r_debug, %eax

There are some unknowns on how exactly to handle the stack pointer decrement after removing needless push instructions from the beginning. I only have a hack solution for now, but it allows me to compile working binaries for all permutations and see what's up with the sizes for a trivial 1quad+bytebeat test program (safety denotes the symbol scouring assumptions, unsafe is minas' method):
Code:linux-ia32 (unsafe): 999 linux-amd64 (unsafe): 1061 freebsd-ia32 (unsafe): 1019 (works for now?) freebsd-ia32 (safe): 1061 freebsd-amd64 (unsafe): 1085 (crashes) freebsd-amd64 (safe): 1129

Even discounting the need of a safe scouring method, it's easy to understand why Linux binaries would be slightly smaller. environ and __progname are not needed for Linux libc.so, and the binary does not require a symtab in itself. All in all, it kind of looks as if the time of FreeBSD is over, at least for now.

All changes here have been pushed to github. I've tested on either Ubuntu 18.04 LTS or FreeBSD 12.0-RELEASE (you need --safe-symtab).

And yeah, Pouet is not my personal blog, But then again it's probably the only place in the world where a post about ELF hacking minutiae makes sense.
added on the 2019-02-26 22:37:24 by Trilkk Trilkk
I'm kinda in a hurry so I can't go in depth right now (already supposed to be asleep because the seminar tomorrow is criminally early), but:

* In my hacked-up version of Shiz' linker/packer, a custom 'symbol table' is defined of which the contents are known at compile-/assemble-time (of the loader routine), which simplifies all the SYMTAB/... lookup+size calc. stuff.
* Instead of using the DT_DEBUG trick to get hold of the link_map, ld.so leaks it to the entrypoint. This simplifies everything a bit.
* By misusing the glibc internals, we can get hold of the hashes computed by ld.so, so that code doesn't have to be included. (see glibc source on the internal link_map structure) However, the offset changes often between versions, so hacks are needed to make it a bit more portable.
* Other stuff I forgot

(also, all this is Linux-only, for obvious reasons)

I'll write something better when I have time (probably tomorrow)
added on the 2019-02-26 23:03:24 by porocyon porocyon
This sounds as bad as import by ordinal (on Win32) to me.
added on the 2019-02-27 00:34:44 by las las
@porocyon; Thanks for the link;

* I think I do the same thing. There is no real symbol table, just an array of (sdbm) hashes that I fill in after finding a corresponding symbol with the same hash by walking through all linked libraries. I still need to include the hash function in the binary to compare against the names found in the libraries tho.

* Since we know the locations of everything in the header before linking, we can just take the link map from the DT_DEBUG section - accessing it becomes essentially one MOV.This also works on both FreeBSD and Linux. How can it get more trick-y than that? You mean there's a way to forgo having DT_DEBUG dynamic section to begin with?

As for the last one, I'm afraid I can't really understand what you mean without seeing the code. Having to account for offset changes sounds kind of dangerous though, and I'm kind of seconding las here. Most compo rules state that import by ordinal in win32 is considered BM, which is why dnload explicitly walks through the symbols and calculates the hashes, so it doesn't require any exact, specific version(s) of the libraries.

But please do explain, I'm very interested. For now, I'll take a look at your 'smol' github.
added on the 2019-02-27 09:27:59 by Trilkk Trilkk
* Yeah, sorry. I somehow thought you were putting the imported symbols of the executable in a real SYMTAB, too. Not sure why. I was being stupid.

* It just ends up in a register, no mov needed :) (see eg. https://code.woboq.org/userspace/glibc/sysdeps/i386/dl-machine.h.html, line 190). It's slightly more complicated on x86_64, but there's still a net gain in size.

* The 'real' link_map struct can be found here. As you can see, somewhere deep down, there's a "l_gnu_buckets" field (and related ones), which makes hash lookup quite easy. However, the size of the "l_info" field can vary quite a bit (because of ABI updates), but the contents of the "l_entry" field, which comes after l_info, is known at compile-time, so the relative offset between the link_map and l_entry can be computed with an x86 string op thingy.

This sounds very shoddy indeed, but it somehow works on my machine (glibc 2.28), ubuntu 18.04 LTS (2.27), 17.10 (forgot which version) and whatever glibc the old CentOS boxes at my uni run. Furthermore, the _dl_start code has been pretty much the same since at least 2002 (according to their git repo), the internal link_map is also quite stable (2007 or so), minus the l_info stuff of course.

Thus, I think other ABI changes (eg. when the GCC people unilaterally decide to change the ABI once again, or when assumptions about linker output are proven false again, etc.) are more common anyway.
added on the 2019-02-27 13:47:03 by porocyon porocyon
Also, about GCC causing ABI changes in my last paragraph: https://sourceforge.net/p/fbc/bugs/659/
added on the 2019-02-27 15:46:52 by porocyon porocyon
Another bump: my void machine got updated to glibc 2.29 (looks like they're faster than the arch people), and the same binary still works.
added on the 2019-02-27 16:09:31 by porocyon porocyon
Ok. I spent some time reading your project, and the tricks seem quite insane, but viable. Also, despite what I said earlier, making assumptions about memory layout within dynamic linker internals is not particularly more dubious than making assumptions about layout of objects created by linkers in general.

Relying on register contents on pass to entry point is kinda dirty tho.

(what kind of dirty assumptions are made for win32 or osx? I have no idea)

Anyhow, even with all this cleverness, looking at loader32.asm in smol, it doesn't look that small. Just for a test, here's a full source for a program that prints "Hello World!" in Linux (32-bit). It can be assembled and linked into a 336-byte binary (omitting filedrop for now),

How small would a similar trivial binary made with smol go and what would it look like?
added on the 2019-02-27 20:24:45 by Trilkk Trilkk
The register-content-passing trick works because the ABI guarantees (for now...) that the '_dl_init' call in ld.so won't destroy the contents of eax. (Though I'm quite sure this will survive longer than stack alignment details.) But because this isn't the case on x86_64, more tricks are needed. (Also, it isn't exactly my project, I stole Shiz' ideas, and fixed and optimized it a little bit.)

It's true that the smol loader is currently larger, because it tries to look for a symbol in a specific DSO, instead of simply trying all of them. I guess I could add an option to just make it look at all the DSOs, then it'll be quite a bit smaller, as it won't need a basename and strcmp implementation :P (Now to find some time to do that...)

Also, I'm aiming for it to work with the Revision compo rules, which means there's no SDL/GLFW available, and only 64-bit versions of most shared libraries are available. This complicates things a bit more (eg. 4klang only works on 32-bit). My X11+GL init routine is smaller than what you'd usually come across, but still annoyingly large. Audio is done using aplay :)

(Another thing about the Revision compo rules: they're still stating Ubuntu 17.10, I hope this is simply copied from last year, and will be updated to 18.*. Would really like confirmation on this, though.)

Furthermore, there's also the issue of the compression algorithm used: lzma-based things aren't exactly optimal for assembly, so a crinkler-type PAQ-based packer could be a better idea (unlord/xylem is working on this), which would also do everything in-memory. Something like vondehi can be used as well, but it has known bugs I haven't found the time for to fix yet, too. Contributions are welcome :P (It's also someone else (in this case, blackle/Suricrasia Online)'s stuff I stole and optimized a bit. Also, GitLab accepts pull requests sent by email, so you don't need to open an account.)
added on the 2019-02-27 23:47:26 by porocyon porocyon
Forgot something: I tried dnload again on my machine, but it's still segfaulting. Will try to find some time to investigate this...
added on the 2019-02-28 00:06:10 by porocyon porocyon
Yeah, I think your and Shiz' approach can potentially be somewhat smaller once the loader is optimized. If I understood correctly, you don't need a hash function nor do you need to browse the section headers, so that's quite a lot of extra stuff to take out.

Plain X11+GL init code was looking in the order of 500 compressed bytes for me. The API is incredibly verbose. The situation's not entirely different from when I ported dnload for Raspberry PI, where you need 11 library calls and 5 structs just for VideoCore and EGL init. Luckily most compos I've taken part have been okay with SDL.

I also tried playing around with a CM compressor some years back, but was not getting proper results for non-trivial input data. It was also incredibly slow. The project's currently mothballed. It's good to know someone's working on a proper solution.

As for the crash, if you're compiling for amd64, the entry point is probably not being crunched properly. You can take a look at the files intro.final.S and intro.S around _start. It tries to erase all the push instructions and then increase the later %rsp subtract with a proper amount. You can run the compile.sh -script with -v, then copy the as and ld calls into console and try with different numbers.

I'd be happy to cooperate and fix all bugs you find. At least immediately when I'm back from my vacation (leaving tomorrow, returning on 18th).

We could also take this to IRC then for efficiency and to annoy other Pööt users less, I assume you're on CET.
added on the 2019-02-28 12:10:17 by Trilkk Trilkk