Small Windows Executables
Chris Dragan

It looks like we have a chain reaction, since this article is a reaction to SunmaN's article from Hugi#20 about 4K intros for Win32, which also was a reaction to a reaction. *grin*

Usually we face the fact that our Win32 programs are enormous. Today it is not a problem to write a small 4096-byte "Hello, world!" program for Win32; which may surprise many Visual Basic programmers. However, 4096 bytes for a program that doesn't do anything is TOO MUCH !

Tools won't help, that available.
There's no linker that is able
to make an executable
with a size that's reasonable.

This little "poem" illustrates one thing, known from the very beginning: when you want to have something done good - do it yourself.

To create a small windows executable, we first have to learn the Windows executable format, which is know as Portable Executable (PE). Naturally, the portability of this format is questionable. We don't have too many resources:

- The description of the PE format, released before the first Win32, contains some useful information, though it is not sufficient. You can find it at wotsit.org.

- A very well known file named winnt.h, distributed with every C(++) compiler for Win32, contains the structures found in the PE headers, with a minimal description. Just use your favourite editor's search command and look for text "image format".

- Also the issue #2 of Assembly Journal (asmjournal.freeservers.com) contains an interesting article about a tiny PE executable.

Having these three resources we are able to create our own executable that will be very small. The article in asmjournal shows a Win32 console program that prints its command line. The program has only 192 bytes!!!! Unfortunately this is only possible under WinNT4. The app works under Win2K, but since it calls some kernel routines by their fixed WinNT4-specific addresses, it crashes.

So here we notice once again that WinNT is a very different OS from the so-called consumer Windows. WinNT is much less restrictive than Win9x, concerning the executables. Nevertheless we want to create small executables that will work with Win9x, too!

Choosing an assembler

At this point we choose an assembler - it will be NASM (the Netwide Assembler). This assembler allows us to hand-code a PE header with a minimal effort, using only one directive and no linker.

If you haven't used NASM, it is worth to know that it has two (and ONLY two) uncomfortable limitations. The following 32-bit code:



	jmp	here
here:	add	ecx, 5

...it assembles to these bytes:



E9 00 00 00 00   	   jmp	   +00000000h
81 C1 05 00 00 00	   add	   ecx, 00000005h

To make NASM behave kindly, we have to code:



	jmp	short here
here:	add	ecx, byte 5

...and then we get:



EB 00   		   jmp	   +00h
83 C1 05		   add	   ecx, +05h

As a side note, NASM has multiple advantages:

- it is free,

- it is simple as it has very, very few directives,

- it has a powerful preprocessor, not found in any other assembler,

- it provides you with total control over your code,

- it is extremely portable - you can use your code on any x86 OS.

But we aren't here to talk about NASM pros and cons... *grin*

First things first

How a PE executable works? A PE executable, further referred to as PE, consists of headers and sections. Headers contain data for the executable loader, and sections contain the actual program.

In the file, the sections of a PE are aligned. This means that each section starts at an address divisible by a number which is the base of the alignment, and also each section has a size divisible by that number. If it happens that a section is too short, i.e. its size is not aligned, it is simply padded with zeroes to the alignment boundary. For example if the required alignment is 32, and our section has only 24 bytes, additional 8 zeroes are added to the end of the section. The same goes for the headers: if the headers in our PE are too short, they are zero padded to the alignment boundary, so that the first section that follows them is aligned.

The docs say that all sections have to be aligned, but practice shows that the last section in a PE doesn't have to have aligned size - therefore file size doesn't have to be aligned.

Win9x requires that file alignment be 512 bytes (200h). This means that all the headers together occupy no less than 512 bytes, or a multiple of 512, and also every section, but the last one, occupies a multiple of 512 bytes. And here we have the first difference between WinNT and Win9x: WinNT doesn't have this restriction.

PE sections contain miscellaneous things. They may contain code, data, imports (references to functions imported from other DLLs), exports, resources, symbol tables, etc. No matter what the docs say and how the available linkers behave, it is not required that these entities reside in separate sections. For example, one can put code and data into one section. One can even put everything into one section - and this is what we want to do. As you may easily guess, doing this we gain a few bytes otherwise lost on file alignment.

Somewhere in the PE headers (later about where exactly) there is a number that tells where the PE will be loaded. A PE can be either a program or a DLL library (a DLL can be a library with routines, fonts, drivers, etc.). For DLLs this load address is only a proposed location - the system can load a DLL at a different location. But executables are (almost) always loaded at this address. And for executables this address is usually 400000h (4MB); using a different address than 400000h we risk our program being loaded not at the address we want. This fixed nature of a program's load address shows another difference between programs and DLLs - DLLs need additional relocation tables that will enable the system to relocate DLL code to a different location than the one specified, and executables usually don't have the relocations.

I am not going to explain the mysteries of 32-bit flat addressing mode here, but it is enough to know that each program has its own address space in this mode; as you are likely to know Win32 uses flat mode. A program calls multiple routines that reside in DLLs. These DLLs are mapped to the program's address space, and each DLL has a unique location - hence the need for relocating DLLs, while the program can have a fixed load address. Each module has a unique module handle, which sometimes needs to be passed to system functions, like CreateWindowEx() for example. As a matter of fact, this module handle is module load address; programs needlessly call the GetModuleHandle() routine, which always returns their load address, which is usually 400000h. Hence another optimisation for us: we won't ever need to import and call the GetModuleHandle() function.

PEs aren't loaded linearly. The headers are loaded exactly at the load address, but sections are further relocated. When being loaded, sections are expanded and aligned to a greater value than file alignment. This is called section alignment. The following table presents a set of entities residing in an example PE, and how they get relocated.

What	In File	In Memory
	Position	Size	Position (RVA)	Size
Headers	0	200h	0	1000h
Section 1	200h	600h	1000h	1000h
Section 2	800h	0	2000h	7000h
Section 3	800h	121h	9000h	1000h

Again, the section size in the file is aligned to File Alignment, while its size in memory after being loaded is aligned to Section Alignment. In the above example a PE contains 3 sections: first that has some stuff and is padded to 600h bytes (the actual section size can be 500h for instance), second that is empty, at least in the file, and third that has 121h bytes - alignment not required. Assuming a default section alignment of 1000h, the first section is expanded to 1000h bytes - 0A00h zeroes are added to its end to fulfill the alignment requirements, the second section is expanded to 7000h bytes and the last section to 1000h. Now we notice that section sizes in memory are actually bigger than section sizes in file, and this is a nice way of allocating non-temporal memory.
Win9x requires that the section alignment be no less than 1000h. Of course both file alignment and section alignment have to be powers of two. Almost all programs use the default alignment values - 200h and 1000h, respectively. I wouldn't recommend using any other values than these; who knows what Microsofters will devise in the future?
It is a must to use subsequent section addresses, i.e. to allocate sections in memory in the order they were in the file. Not sticking to this rule may produce undesirable results.
You are probably wondering what the RVA means that is found in our small example and why the headers are loaded at RVA=0? RVA means Relative Virtual Address, and it is an offset relative to Image Base - the address at which our file is loaded. So if Image Base is 400000h, the headers are loaded at 400000h and the first section from our example at 401000h.
What makes us headerache
The time has come to reveal the structure of PE headers - the aim of this article. Hopefully having understood how PE files are loaded and what they consist of, we can learn that all addresses in the headers and in the load-time portions of sections (e.g. import tables) are relative to image base, i.e. they are RVAs. All addresses in code and data of a PE are non-relative, i.e. fixed, unless they are relocated - provided that a PE contains relocation tables.
Each PE has the following headers, in exactly that order:
- DOS stub,
- PE header,
- optional header,
- section headers.
If there aren't any weird things in a PE, it has some padding after the headers, and then the sections come.
The DOS stub is a small DOS executable that usually displays some annoying message when the user tries to run the program in DOS. This stub is not required to exist, only the MZ header must be there. The MZ header must consist of two bytes 'M' and 'Z' at offset 0, and a 32-bit number at offset 3Ch, so our entire MZ header has 64 (40h) bytes. The other bytes within the MZ header are not important - they can have any value. The 32-bit number that ends the MZ header is a file-relative offset to the PE header.
The PE header should be located in file at an address divisible by 8 (must be 8-byte aligned). It can actually begin within the DOS header, using up the unused bytes, but it is better to place it after the MZ header, i.e. at offset 40h. We will take advantage of the extra spare bytes in the MZ header at a later time.
The PE header contains a bunch of numbers:

Size Value Description
dword 'PE' PE magic number identifying the PE header
word 14Ch Machine for which this executable is (14Ch is 386)
word 1 Number of sections - in our case it will be only one
dword ? Time stamp - this can be any value
dword ? Pointer to symbol table - we won't use any symbol tables
dword 0 Number of symbol tables - zero in our case
word X Size of optional header
word 10Fh Characteristics - bitflags (10Fh is 32-bit executable)

The ? values are unimportant - let's set them to 0. The size of the optional header that comes right after the PE header have been marked as X - we will put appropriate expression in the source file there, so the size of the optional header will be figured out at compile time. We will also do similar things later on.
The optional header, which is in fact NOT optional, contains more information, specific to our executable:

Size Value Description
word 10Bh Optional header magic number
word ? Linker version - we don't care
dword ? Size of code - we could give it some real value
dword ? Size of data
dword ? Size of uninitialized data - this is usually 0
dword X Program entry point
dword X Base of code
dword X Base of data
dword 400000h Image base - this is where our PE is loaded
dword 1000h Section alignment - we agree on 1000h
dword 200h File alignment - phew!
dword 4 OS version - better leave it 4.00
dword 0 Image version - huh?
dword 4 Subsystem version
dword ? Win32 version
dword X Image size IN MEMORY
dword X Size of all headers - file offset of first section
dword ? Checksum
word 2 Subsystem (2 is Win32 GUI)
word ? DLL characteristics - we have an executable, not a DLL
dword 100000h Stack size
dword 1000h Stack commit
dword 100000h Heap size
dword 1000h Heap commit
dword 0 Loader flags
dword 16 Number of directories
32 dwords - Directories follow

A lot of stuff! All meaningful addresses are of course RVAs. The entry point is the RVA at which our program will start. Base of code and base of data aren't too important, but we can set them to some valid values. The stack and heap sizes have to be set to some useful values. Stacks are always thread-specific, and in our case each stack will have an initial size of 1000h bytes, and limits of 100000h bytes. Heaps usually aren't used, but we can (or must) sacrifice 4KB of memory.
The table of directories found in the executable is a real pain in the a. Each entry of this table consists of two dwords - a pointer (RVA) to a directory, residing somewhere in some section, and the size of that directory. The second directory is the imports directory, and this is what we want to have. We aren't interested in any other directories, so we set all of their corresponding entries in the table to 0s - if you want to learn more about them, refer to the PE document from wotsit.org, or to winnt.h. In our case there could be only two table entries, as the second of them points to our beloved import directory. For WinNT it is OK to have only two entries, but Win9x requires 16, and that's why we don't like it.
With the end of the table of directories the optional header ends. After it we find section headers. The number of section headers is in the PE header existing before the optional header. Each section header has the following structure:

Size Value Description
qword ? ASCII section name - can be anything you want
dword X Size in memory
dword X RVAddress in memory (in our case it will be 1000h)
dword X Size in file
dword X Offset in file
dword ? Pointer to relocations - we won't have any
dword ? Pointer to line numbers - debug info, anyone ?
word 0 Number of relocations
word 0 Number of line number entries
dword 0E0000060h Flags

There are many available flag values (from winnt.h):

Code Description (section contains...)
00000020h Code
00000040h Data
00000080h Uninitialized data - this is fiction!!!
01000000h Extended relocations
02000000h Section can be discarded
04000000h Section is not cacheable
08000000h Section is not pageable
10000000h Section is shareable
20000000h Section is executable
40000000h Section is readable
80000000h Section is writable

There are even more flags, but many of those presented here as most of those other are unimportant. For example if we choose our section to be readable but not writable, we still will be able to write to it, even under WinNT.
Before we seriously go into coding, we yet have to learn how an import table looks. An import table consists of a set of entries, each of which corresponds to some DLL from which we import functions. Each import table entry has the following structure:

Size Value Description
dword X RVA of original thunk
dword ? Time stamp
dword ? Forwarder chain (what???)
dword X RVA of ASCIIZ DLL name
dword X RVA of replaced thunk

The last entry in an import table is zero-filled, indicating the end of the import table. The so-called thunk is a zero-terminated array of dword pointers to imported function names. The replaced thunk is filled by the PE loader with actual pointers to imported routines. The original thunk remains untouched, but unlike most of linkers, we can supply the same RVA for both replaced and original thunk, thus including only one thunk per imported DLL. The entries of each thunk point to ASCIIZ function names preceded by a word value called "hint". This was originally meant to serve as an alternate method of importing functions by indices instead of names, but it doesn't work anyway, so we can set the hints to 0, and use them as ASCIIZ function name terminators. Note that all function names should be word aligned. Also, it is obvious but worth to mention once again, that the entries of a thunk are RVAs to hints, and those hints are followed by actual imported function names.
Tips and tricks
As indicated above, we want to create only one single section in our PE, so we won't lose any bytes on file alignment (i.e. alignment of sections within the file). Needless to say that we don't care about any export tables, resources, symbol tables and other weird things that can reside in a PE. Unfortunately we meet two serious Win9x limitations: we must use file alignment 200h, what makes our headers occupy 512 bytes, and what's more we have to include 16 directories, 14 of which are unused - unnecessary loss of 14*8=112 bytes. As the headers are loaded into memory at the image base, we can fill their unused parts with useful data, such us imported function names, for example. The spare places we get, after getting rid of DOS stub and leaving only 64-byte MZ header, are the unused parts of this MZ header (58 bytes) plus the padding bytes after section headers (160 bytes).
The code
The program we want to create will be a skeleton for a 4K intro. Of course it would be even better to make a simple compressor for our code, but this is rather a topic for Dario Phong.
In C, the program would look like follows:
WNDCLASS WindowClass; // = { ... }; void main() // NOT WinMain() { RegisterClass( &WindowClass ); HWND hWnd = CreateWindowEx( 0, &ClassName, &ClassName, WS_OVERLAPPEDWINDOW, CW_USEDEFAULT, 0, CW_USEDEFAULT, 0, 0, 0, 0x400000, 0 ); ShowWindow( hWnd ); UpdateWindow( hWnd ); LPDIRECTDRAW lpDD; DirectDrawCreate( 0, &lpDD, 0 ); lpDD->Vtbl->SetCooperativeLevel( lpDD, hWnd, DDSCL_EXCLUSIVE | DDSCL_FULLSCREEN ); if ( lpDD->Vtbl->SetDisplayMode( lpDD, 640, 480, 32 ) ) for (;;) { WaitMessage(); // Replace this with your frame rendering MSG msg; if ( ! PeekMessage( &msg, 0, 0, 0, PM_REMOVE ) ) continue; if ( msg.message == WM_QUIT ) break; DefWindowProc( msg.hwnd, msg.message, msg.wParam, msg.lParam ); } lpDD->Vtbl->Release( lpDD ); } LRESULT CALLBACK WndProc ( HWND hwnd, UINT uMsg, WPARAM wParam, LPARAM lParam ) { if ( uMsg != WM_DESTROY ) goto DefWindowProc; PostQuitMessage( 0 ); }

This is a minimal program that switches into 640x480x32bpp mode and remains in it until the user presses Alt+F4. Note that I didn't include any surface-creation code here; creating a primary ddsurface is a must if one wants to display anything. We could also add some code for handling the Esc key to the main() function, like:
if ( msg.message == WM_KEYDOWN && msg.wParam == VK_ESCAPE ) CloseWindow( msg.hwnd );

We actually can afford importing one more function, since we put the import strings in the PE header's padding areas. You may take a different approach on your own, but consider that the import function names are needed only at load time, and we actually don't know what happens with the headers while the program runs.
Implementation
Whether you know or don't, the Win32 logic assumes that all significant calls through the system leave registers ebx, esi, edi and ebp untouched. This concerns not only imported routines we call, but also callbacks supplied by us, such as WndProc(). The result is always returned in eax or in eax:edx pair, and eax, ecx and edx may be destroyed. Win32 standard calling convention is stdcall (reverse order of pushed arguments, arguments removed by the calee); an exception is function wsprintf which has C calling convention, since it has a variable amount of arguments.
Many standard Win32 functions are in two versions: ANSI and UNICODE. This concerns mainly routines that obtain some strings, for example CreateWindowEx. The actual names of this function are CreateWindowExA for ANSI and CreateWindowExW for UNICODE. The UNICODE version is rare and exists only on some versions of WinNT, nevertheless CreateWindowExA is the valid name we will use. (Note that actually there is no CreateWindow function, as it is re-defined to CreateWindowExA/W in winuser.h.)
DirectX and other COM-style calls are also simple. It is enough to notice the difference between calling them from C++ and C:
lpDD->SetDisplayMode( 640, 480, 32 ); // In C++ lpDD->Vtbl->SetDisplayMode( lpDD, 640, 480, 32 ); // In C

Since Vtbl is located at offset 0, the latter has an alternate syntax:
(*lpDD)->SetDisplayMode( lpDD, 640, 480, 32 );

And this is exactly what we are doing in assembly.
For the purpose of our example skeleton, we use registers for storing common values, like 0 or hWnd. We also keep a pointer to MSG structure in a register. Because the thunks will lie near this structure, we will also use addresses of imported function pointers relative to it, gaining three bytes on each call to an imported routine.
In case you want to create a hardware accelerated 4KB intro: it is possible to do this with OpenGL, but you probably wouldn't do much effects, since the import tables would take many precious bytes. A better approach is to use Direct3D - you do not need any extra imports than the ones used in our example, and all calls to Direct3D are done through COM.
Final words
There isn't much to say. The source of the example you should find in the bonus pack. Have fun coding small proggies, and I hope this article helped you in that matter.
Chris Dragan


Size	Value	Description
dword	'PE'	PE magic number identifying the PE header
word	14Ch	Machine for which this executable is (14Ch is 386)
word	1	Number of sections - in our case it will be only one
dword	?	Time stamp - this can be any value
dword	?	Pointer to symbol table - we won't use any symbol tables
dword	0	Number of symbol tables - zero in our case
word	X	Size of optional header
word	10Fh	Characteristics - bitflags (10Fh is 32-bit executable)


Code	Description (section contains...)
00000020h	Code
00000040h	Data
00000080h	Uninitialized data - this is fiction!!!
01000000h	Extended relocations
02000000h	Section can be discarded
04000000h	Section is not cacheable
08000000h	Section is not pageable
10000000h	Section is shareable
20000000h	Section is executable
40000000h	Section is readable
80000000h	Section is writable