Decoding Machine Language
2.12 Translating and Starting a Program
This section describes the four steps in transforming a C program in a file from storage (disk or flash memory) into a program running on a computer. Figure 2.20 shows the translation hierarchy. Some systems combine these steps to reduce translation time, but programs go through these four logical phases. This section follows this translation hierarchy.
FIGURE 2.20 A translation hierarchy for C. A high-level language program is first compiled into an assembly language program and then assembled into an object module in machine language. The linker combines multiple modules with library routines to resolve all references.
The loader then places the machine code into the proper memory locations for execution by the processor. To speed up the translation process, some steps are skipped or combined. Some compilers produce object modules directly, and some systems use linking loaders that perform the last two steps. To identify the type of file, UNIX follows a suffix convention for files: C source files are named x.c, assembly files are x.s, object files are named x.o, statically linked library routines are x.a, dynamically linked library routes are x.so, and executable files by default are called a.out. MS-DOS uses the suffixes .C, .ASM, .OBJ, .LIB, .DLL, and .EXE to the same effect.
Loader C program
Compiler
Assembly language program
Assembler
Object: Machine language module Object: Library routine (machine language)
Linker
Memory Executable: Machine language program
2.12 Translating and Starting a Program 125
Compiler
The compiler transforms the C program into an assembly language program, a symbolic form of what the machine understands. High-level language programs take many fewer lines of code than assembly language, so programmer productivity is much higher.
In 1975, many operating systems and assemblers were written in assembly language because memories were small and compilers were inefficient. The million-fold increase in memory capacity per single DRAM chip has reduced program size concerns, and optimizing compilers today can produce assembly language programs nearly as well as an assembly language expert, and sometimes even better for large programs.
Assembler
Since assembly language is an interface to higher-level software, the assembler can also treat common variations of machine language instructions as if they were instructions in their own right. The hardware need not implement these instructions; however, their appearance in assembly language simplifies translation and programming. Such instructions are called pseudoinstructions.
As mentioned above, the RISC-V hardware makes sure that register x0 always has the value 0. That is, whenever register x0 is used, it supplies a 0, and if the programmer attempts to change the value in x0, the new value is simply discarded. Register x0 is used to create the assembly language instruction that copies the contents of one register to another. Thus, the RISC-V assembler accepts the following instruction even though it is not found in the RISC-V machine language:
li x9, 123 // load immediate value 123 into register x9 The assembler converts this assembly language instruction into the machine language equivalent of the following instruction:
addi x9, x0, 123 // register x9 gets register x0 + 123 The RISC-V assembler also converts mv (move) into an addi instruction. Thus
mv x10, x11 // register x10 gets register x11 becomes
addi x10, x11, 0 // register x10 gets register x11 + 0 The assembler also accepts j Label to unconditionally branch to a label, as a stand-in for jalx0,Label. It also converts branches to faraway locations into a branch and a jump. As mentioned above, the RISC-V assembler allows large constants to be loaded into a register despite the limited size of the immediate instructions. Thus, the load immediate (li) pseudoinstruction introduced above can
assembly language A symbolic language that can be translated into binary machine language.
pseudoinstruction A common variation of assembly language instructions often treated as if it were an instruction in its own right.
create constants larger than addi’s immediate field can contain; the load address (la) macro works similarly for symbolic addresses. Finally, it can simplify the instruction set by determining which variation of an instruction the programmer wants. For example, the RISC-V assembler does not require the programmer to specify the immediate version of the instruction when using a constant for arithmetic and logical instructions; it just generates the proper opcode. Thus
and x9, x10, 15 // register x9 gets x10 AND 15 becomes
andi x9, x10, 15 // register x9 gets x10 AND 15
We include the “i” on the instructions to remind the reader that andi produces a different opcode in a different instruction format than the and instruction with no immediate operands.
In summary, pseudoinstructions give RISC-V a richer set of assembly language instructions than those implemented by the hardware. If you are going to write assembly programs, use pseudoinstructions to simplify your task. To understand the RISC-V architecture and be sure to get best performance, however, study the real RISC-V instructions found in Figures 2.1 and 2.18.
Assemblers will also accept numbers in a variety of bases. In addition to binary and decimal, they usually accept a base that is more succinct than binary yet converts easily to a bit pattern. RISC-V assemblers use hexadecimal and octal.
Such features are convenient, but the primary task of an assembler is assembly into machine code. The assembler turns the assembly language program into an object file, which is a combination of machine language instructions, data, and information needed to place instructions properly in memory.
To produce the binary version of each instruction in the assembly language program, the assembler must determine the addresses corresponding to all labels.
Assemblers keep track of labels used in branches and data transfer instructions in a symbol table. As you might expect, the table contains pairs of symbols and addresses.
The object file for UNIX systems typically contains six distinct pieces:
n The object file header describes the size and position of the other pieces of the object file.
n The text segment contains the machine language code.
n The static data segment contains data allocated for the life of the program.
(UNIX allows programs to use both static data, which is allocated throughout the program, and dynamic data, which can grow or shrink as needed by the program. See Figure 2.13.)
n The relocation information identifies instructions and data words that depend on absolute addresses when the program is loaded into memory.
symbol table A table that matches names of labels to the addresses of the memory words that instructions occupy.
2.12 Translating and Starting a Program 127
n The symbol table contains the remaining labels that are not defined, such as external references.
n The debugging information contains a concise description of how the modules were compiled so that a debugger can associate machine instructions with C source files and make data structures readable.
The next subsection shows how to attach such routines that have already been assembled, such as library routines.
Linker
What we have presented so far suggests that a single change to one line of one procedure requires compiling and assembling the whole program. Complete retranslation is a terrible waste of computing resources. This repetition is particularly wasteful for standard library routines, because programmers would be compiling and assembling routines that by definition almost never change. An alternative is to compile and assemble each procedure independently, so that a change to one line would require compiling and assembling only one procedure. This alternative requires a new systems program, called a link editor or linker, which takes all the independently assembled machine language programs and “stitches” them together. The reason a linker is useful is that it is much faster to patch code than it is to recompile and reassemble.
There are three steps for the linker:
1. Place code and data modules symbolically in memory.
2. Determine the addresses of data and instruction labels.
3. Patch both the internal and external references.
The linker uses the relocation information and symbol table in each object module to resolve all undefined labels. Such references occur in branch instructions and data addresses, so the job of this program is much like that of an editor: it finds the old addresses and replaces them with the new addresses. Editing is the origin of the name “link editor,” or linker for short.
If all external references are resolved, the linker next determines the memory locations each module will occupy. Recall that Figure 2.13 on page 106 shows the RISC-V convention for allocation of program and data to memory. Since the files were assembled in isolation, the assembler could not know where a module’s instructions and data would be placed relative to other modules. When the linker places a module in memory, all absolute references, that is, memory addresses that are not relative to a register, must be relocated to reflect its true location.
The linker produces an executable file that can be run on a computer. Typically, this file has the same format as an object file, except that it contains no unresolved references. It is possible to have partially linked files, such as library routines, that still have unresolved addresses and hence result in object files.
linker Also called link editor. A systems program that combines independently assembled machine language programs and resolves all undefined labels into an executable file.
executable file A functional program in the format of an object file that contains no unresolved references.
It can contain symbol tables and debugging information. A “stripped executable” does not contain that information.
Relocation information may be included for the loader.
Linking Object Files
Link the two object files below. Show updated addresses of the first few instructions of the completed executable file. We show the instructions in assembly language just to make the example understandable; in reality, the instructions would be numbers.
Note that in the object files we have highlighted the addresses and symbols that must be updated in the link process: the instructions that refer to the addresses of procedures A and B and the instructions that refer to the addresses of data doublewords X and Y.
Object file header
Name Procedure A
Text size 100hex Data size 20hex Text segment Address Instruction
0 ld x10, 0(x3)
4 jal x1, 0
. . . . . .
Data segment 0 (X)
. . . . . .
Relocation information Address Instruction type Dependency
0 ld X
4 jal B
Symbol table Label Address
X –
B –
Name Procedure B
Text size 200hex Data size 30hex Text segment Address Instruction
0 sd x11, 0(x3)
4 jal x1, 0
. . . . . .
Data segment 0 (Y)
. . . . . .
Relocation information Address Instruction type Dependency
0 sd Y
4 jal A
Symbol table Label Address
Y –
A –
Procedure A needs to find the address for the variable labeled X to put in the load instruction and to find the address of procedure B to place in the jal
EXAMPLE
2.12 Translating and Starting a Program 129
instruction. Procedure B needs the address of the variable labeled Y for the store instruction and the address of procedure A for its jal instruction.
From Figure 2.14 on page 107, we know that the text segment starts at address 0000 0000 0040 0000hex and the data segment at 0000 0000 1000 0000hex. The text of procedure A is placed at the first address and its data at the second. The object file header for procedure A says that its text is 100hex bytes and its data is 20hex bytes, so the starting address for procedure B text is 40 0100hex, and its data starts at 1000 0020hex.
Executable file header
Text size 300hex
Data size 50hex
Text segment Address Instruction
0000 0000 0040 0000hex ld x10, 0(x3) 0000 0000 0040 0004hex jal x1, 252ten
. . . . . .
0000 0000 0040 0100hex sd x11, 32(x3) 0000 0000 0040 0104hex jal x1, -260ten
. . . . . .
Data segment Address
0000 0000 1000 0000hex (X)
. . . . . .
0000 0000 1000 0020hex (Y)
. . . . . .
Now the linker updates the address fields of the instructions. It uses the instruction type field to know the format of the address to be edited. We have three types here:
1. The jump and link instructions use PC-relative addressing. Thus, for the jal at address 40 0004hex to go to 40 0100hex (the address of procedure B), it must put (40 0100hex – 40 0004hex) or 252ten in its address field.
Similarly, since 40 0000hex is the address of procedure A, the jal at 400104hex gets the negative number -260ten (400000hex –40 0104hex) in its address field.
2. The load addresses are harder because they are relative to a base register.
This example uses x3 as the base register, assuming it is initialized to 0000 0000 1000 0000hex. To get the address 0000 0000 1000 0000hex (the address of doubleword X), we place 0ten in the address field of ld at address 40 0000hex. Similarly, we place 20hex in the address field of sd at address 40 0100hex to get the address 0000 0000 1000 0020hex (the address of doubleword Y).
3. Store addresses are handled just like load addresses, except that their S-type instruction format represents immediates differently than loads’ I-type format. We place 32ten in the address field of sd at address 40 0100hex to get the address 0000 0000 1000 0020hex (the address of doubleword Y).
ANSWER
Loader
Now that the executable file is on disk, the operating system reads it to memory and starts it. The loader follows these steps in UNIX systems:
1. Reads the executable file header to determine size of the text and data segments.
2. Creates an address space large enough for the text and data.
3. Copies the instructions and data from the executable file into memory.
4. Copies the parameters (if any) to the main program onto the stack.
5. Initializes the processor registers and sets the stack pointer to the first free location.
6. Branches to a start-up routine that copies the parameters into the argument registers and calls the main routine of the program. When the main routine returns, the start-up routine terminates the program with an exit system call.