Creating Tiny Executables in D

A screenshot of the optimized hello world code.

In the spirit of “A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux,” we are going to try to create a minimal 64 bit ELF executable using just standard D and some inline assembly code.

We are going to focus on creating the standard hello world executable popularized by Brian Kernighan.

In the standard D language, using the phobos library (the now only real standard library for D, previously there existed two competing standard libraries), we would write this as:

import std.stdio;

int main(){
    write("Hello\n");
    return 0;
}

Compiling this:

dmd basic_hello.d 

yields a 1.9M executable.

wc -c ./basic_hello
 1895528 ./basic_hello

Using gdc brings it down to 28k. Note: gdc like gcc defaults to naming the executable ‘a.out’, whereas dmd names it after the file or module.

gdc basic_hello.d 
wc -c ./a.out
 28608 ./a.out

So we get some improvement. The biggest issue is that it is dynamically linking in a bunch of code we don’t need, including a bunch of machinery for exceptions we won’t throw, and code to handle cases we don’t need. Really we just need to put a string to standard out, which is exactly what the C function puts is for.

import core.stdc.stdio;

int main(){
    puts("Hello");
    return 0;
}

Compiling this with gdc give us a small decrease in size.

gdc basic_hello.d 
wc -c ./a.out
 21296 ./a.out

Running ldd on it we can see we still pull in the D runtime even though we don’t need it.

> ldd ./a.out
linux-vdso.so.1 (0x00007ffd02db8000)
libgphobos.so.1 => /lib64/libgphobos.so.1 (0x00007f4b1caa8000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f4b1ca8d000)
libm.so.6 => /lib64/libm.so.6 (0x00007f4b1c947000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f4b1c925000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f4b1c91e000)
libc.so.6 => /lib64/libc.so.6 (0x00007f4b1c753000)
/lib64/ld-linux-x86-64.so.2 (0x00007f4b1d09f000)

This is exactly what betterC mode is for! Let’s try it. We just need to modify our main function to be ‘extern(C)’.

import core.stdc.stdio;
extern(C):
int main(){
    puts("Hello");
    return 0;
}

With gdc, betterC mode is called with the flag ‘no-druntime’.

gdc -fno-druntime basic_hello.d 
wc -c ./a.out
 20600 ./a.out

So that is pretty good and we are down to just 20k. And we are only linking against the standard C library.

>ldd ./a.out
linux-vdso.so.1 (0x00007ffd119c7000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f447f734000)
libc.so.6 => /lib64/libc.so.6 (0x00007f447f569000)
/lib64/ld-linux-x86-64.so.2 (0x00007f447f783000)

We can run strip on it to save some space:

strip -s ./a.out
wc -c ./a.out
 15016 ./a.out

Now we are running into the limits of what we can do with linking to the C standard library without switching to musl. And even there we are getting away from programming in D and instead just messing with optimizing C code, which is not the point of this article.

So how does the C standard library print a string to standard output? In the end it performs a system call. In modern Windows Microsoft makes it unreasonably hard to do this by changing the syscall numbers on every update. This might stop some viruses that hardcode system calls, but makes it impossible to do any Windows programming without using the C standard library/Win32 API. So in this article we will focus on Linux in particular since the system call interface is explicitly blessed by the kernel developers. FreeBSD and other Unices tend to prefer the usage of the C standard library API. Here we will also just worry about the x86_64 architecture(being 2020, the only other architecture in major use that is widely supported by D compilers is AARCH64).

The system call interface for Linux is specified in the System V ABI document here. In short, you just pass the syscall number in the RAX register, and the other arguments in rdi, rsi, rdx, etc.

The 3 main D language compilers, DMD, GDC, and LDC, all have slightly different ways you can include assembly code inline. Just for fun I wrote a version for each using their features. One thing I noticed is when using the LDC version of inline assembly it mentions needing a comma separated list for the constraints. I included some spaces in the list and received an invalid asm constraint error that was ridiculously tricky to track down.

One other thing to note is that the main function we have been using so far is based on the fact that in the end we link to the standard C library which also has a main function that is called from another entry point, by not using the standard C library and the start files from it (which basically help the C language runtime set up some scaffolding it needs for ‘atexit’ and such (side side note, the C language does have a runtime, you just usually don’t notice it)) we no longer have an entry point. Most things assume the entry point is called ‘_start’, but this can be overridden in the linker.

Let’s write our syscall routines in a file called syscall.d:

module syscall;
version (LDC)
{
    import ldc.llvmasm;
}

version (Posix)
{
extern (C):
pragma(inline, true):
    long syscall1(long number, long a1) @system @nogc nothrow
    {

        version (DigitalMars)
        {
            long ret;
            asm @system @nogc nothrow
            {
                mov RAX, number;
                mov RDI, a1[RBP];
                syscall;
                mov ret, RAX;
            }
            return ret;
        }
        version (GNU)
        {
            long ret;
            asm @system @nogc nothrow
            {
                "syscall" : "=a"(ret) : "a"(number), "D"(a1) : "rcx", "r11";
            }
            return ret;
        }
        version (LDC)
        {
            pragma(LDC_allow_inline);
            return __asm!long("syscall", "={rax},{rax},{rdi}
        ~{rcx},~{r11}", number, a1);
        }
    }
extern (C):
pragma(inline, true):
    long syscall3(long number, long a1, long a2, long a3) @system @nogc nothrow
    {

        version (DigitalMars)
        {
            long ret;
            asm @system @nogc nothrow
            {
                mov RAX, number;
                mov RDI, a1[RBP];
                mov RSI, a2[RBP];
                mov RDX, a3[RBP];
                syscall;
                mov ret, RAX;
            }
            return ret;
        }
        version (GNU)
        {
            long ret;
            asm @system @nogc nothrow
            {
                "syscall" : "=a"(ret) : "a"(number), "D"(a1), "S"(a2), "d"(a3) : "rcx", "r11";
            }
            return ret;
        }
        version (LDC)
        {
            pragma(LDC_allow_inline);
            return __asm!long("syscall",
                    "={rax},{rax},{rdi},{rsi},{rdx},~{rcx},~{r11}", number, a1, a2, a3);
        }
    }



}

And our _start entry point and some helper functions in hello.d:
import syscall;

immutable long SYS_WRITE = 1;
immutable long SYS_EXIT = 60;

extern (C):
@nogc:
nothrow:
pragma(inline, true):
size_t write(int fd, in char* buf, size_t count)
{
    return syscall3(SYS_WRITE, fd, cast(long) buf, count);
}

extern (C):
@nogc:
nothrow:
pragma(inline, true):
void exit(long exit_code){
     syscall1(SYS_EXIT, exit_code);
}

extern (C):
@nogc:
nothrow:
void _start()
{
    write(1, "Hello\n", 6);
    exit(0);
}

We also need to link this file ourselves so that we don’t pull in any libc dependencies, we do this by using the ‘ld’ command and by just compiling and not linking (the ‘-c’ flag).

gdc -fno-druntime -nodefaultlibs hello.d syscall.d -c 
ld ./hello.o
ldd ./a.out
         not a dynamic executable

Cool so we now have a statically linked executable that has no dependence on the standard C library! How much did we save?

wc -c ./a.out
 13672 ./a.out

Ugh all that work and only 2k extra savings! Lets see what is in the executable. We can get this by running ‘readelf -a’ on the executable. Our executable has 9 sections.

[0]
[1] .text
[2] .rodata
[3] .eh_frame
[4] .data
[5] .comment
[6] .symtab
[7] .strtab
[8] .shstrtab

The first that sticks out is the ‘.eh_frame’ section. That is used for exception unwinding, stack cleanup and other assorted things we don’t need. We can tell gdc that we don’t want those by adding two flags:

>gdc -fno-druntime -nodefaultlibs  -fno-exceptions -fno-asynchronous-unwind-tables  hello.d syscall.d -c 
>ld ./hello.o
>wc -c ./a.out
 9488 ./a.out

Awesome we just saved 4k, but there is still fat to trim. One thing we can learn from reading the info pages for ‘ld’ is that there is a flag to help pack executables if you don’t need shared libraries, ‘-n’. Using this saves us about 3k:

>gdc -fno-druntime -nodefaultlibs  -fno-exceptions -fno-asynchronous-unwind-tables  hello.d syscall.d -c 
>ld -n ./hello.o
>wc -c ./a.out
 6112 ./a.out

If we look at the executable with readelf again we can see that the compiler outputs symbols for a lot of things we don’t need:

Symbol table ‘.symtab’ contains 21 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 00000000004000b0 0 SECTION LOCAL DEFAULT 1
2: 00000000004002cf 0 SECTION LOCAL DEFAULT 2
3: 00000000004012d8 0 SECTION LOCAL DEFAULT 3
4: 0000000000000000 0 SECTION LOCAL DEFAULT 4
5: 0000000000000000 0 FILE LOCAL DEFAULT ABS test.d
6: 0000000000400187 32 FUNC GLOBAL DEFAULT 1 syscall0
7: 00000000004012e0 8 OBJECT GLOBAL DEFAULT 3 _D4test8SYS_EXITyl
8: 000000000040021d 83 FUNC GLOBAL DEFAULT 1 syscall5
9: 00000000004012d8 8 OBJECT GLOBAL DEFAULT 3 _D4test9SYS_WRITEyl
10: 00000000004001a7 51 FUNC GLOBAL DEFAULT 1 syscall2
11: 00000000004000b0 47 FUNC GLOBAL DEFAULT 1 write
12: 0000000000400162 37 FUNC GLOBAL DEFAULT 1 _start
13: 00000000004012e8 0 NOTYPE GLOBAL DEFAULT 3 __bss_start
14: 00000000004001da 67 FUNC GLOBAL DEFAULT 1 syscall4
15: 0000000000400137 43 FUNC GLOBAL DEFAULT 1 syscall1
16: 00000000004012e8 0 NOTYPE GLOBAL DEFAULT 3 _edata
17: 00000000004012e8 0 NOTYPE GLOBAL DEFAULT 3 _end
18: 0000000000400117 32 FUNC GLOBAL DEFAULT 1 exit
19: 0000000000400270 95 FUNC GLOBAL DEFAULT 1 syscall6
20: 00000000004000df 56 FUNC GLOBAL DEFAULT 1 syscall3

We can put each function into a separate section and then have the linker “garbage collect” the unused sections. This requires two parts, first we need to have gdc put the functions and data into separate sections, ‘-fdata-sections’ and ‘-ffunction-sections’. Next we need the to pass the linker ‘–gc-sections’. We can add a command for the linker to strip everything we don’t need ‘–strip-all’ at the same time, and set the build id to none. We may as well pass ‘-Os’ to gdc as well to see if it can optimize it for size at all (Without this it is 800 bytes instead of 616).

>gdc -fno-druntime -fno-asynchronous-unwind-tables -nodefaultlibs  -ffunction-sections -fdata-sections -fno-exceptions -Os hello.d syscall.d -c 
>ld -s --gc-sections --strip-all --build-id=none -n ./hello.o
>wc -c ./a.out
 616 ./a.out

Wow! Now we are talking! a 616 byte executable. But we can still do better. If we look with ‘objdump’ we can see there is a ‘.comment’ section still. We can strip this out directly:

>strip -R .comment ./a.out
>wc -c ./a.out
 496 ./a.out

Which gets us down to 496 bytes. And it still works!

>./a.out
 Hello

For comparison see Drew DeVault’s Hello World blog post to see how small other languages can get.