Threaded code


In computer science, threaded code is a programming technique where the code has a form that essentially consists entirely of calls to subroutines. It is often used in compilers, which may generate code in that form or be implemented in that form themselves. The code may be processed by an interpreter or it may simply be a sequence of machine code call instructions.
Threaded code has better density than code generated by alternative generation techniques and by alternative calling conventions. In cached architectures, it may execute slightly slower. However, a program that is small enough to fit in a computer processor's cache may run faster than a larger program that suffers many cache misses. Small programs may also be faster at thread switching, when other programs have filled the cache.
Threaded code is best known for its use in many compilers of programming languages, such as Forth, many implementations of BASIC, some implementations of COBOL, early versions of B, and other languages for small minicomputers and for amateur radio satellites.

History

The common way to make computer programs is to use a compiler to translate Source code to machine code. The resulting executable is typically fast but, because it is specific to a hardware platform, it isn't portable. A different approach is to generate instructions for a virtual machine and to use an interpreter on each hardware platform. The interpreter instantiates the virtual machine environment and executes the instructions. Thus, only the interpreter must be compiled.
Early computers had relatively little memory. For example, most Data General Nova, IBM 1130, and many of the first microcomputers had only 4 kB of RAM installed. Consequently, a lot of time was spent trying to find ways to reduce a program's size, to fit in the available memory.
One solution is to use an interpreter which reads the symbolic language a bit at a time, and calls functions to perform the actions. As the source code is typically much denser than the resulting machine code, this can reduce overall memory use. This was the reason Microsoft BASIC is an interpreter: its own code had to share the 4 kB memory of machines like the Altair 8800 with the user's source code. A compiler translates from a source language to machine code, so the compiler, source, and output must all be in memory at the same time. In an interpreter, there is no output. Code is created a line at a time, executed, and then discarded.
Threaded code is a formatting style for compiled code that minimizes memory use. Instead of writing out every step of an operation at its every occurrence in the program, as was common in macro assemblers for instance, the compiler writes each common bit of code into a subroutine. Thus, each bit exists in only one place in memory. The top-level application in these programs may consist of nothing but subroutine calls. Many of these subroutines, in turn, also consist of nothing but lower-level subroutine calls. This technique code refactoring remains widely used today, although for different reasons.
Mainframes and some early microprocessors such as the RCA 1802 required several instructions to call a subroutine. In the top-level application and in many subroutines, that sequence is constantly repeated, with only the subroutine address changing from one call to the next. This means that a program consisting of many function calls may have considerable amounts of repeated code as well.
To address this, threaded code systems used pseudo-code to represent function calls in a single operator. At run time, a tiny "interpreter" would scan over the top-level code, extract the subroutine's address in memory, and call it. In other systems, this same basic concept is implemented as a branch table, dispatch table, or virtual method table, all of which consist of a table of subroutine addresses.
During the 1970s, hardware designers spent considerable effort to make subroutine calls faster and simpler. On the improved designs, only a single instruction is expended to call a subroutine, so the use of a pseudo-instruction saves no room. Additionally, the performance of these calls is almost free of additional overhead. Today, though almost all programming languages focus on isolating code into subroutines, they do so for code clarity and maintainability, not to save space.
Threaded code systems save room by replacing that list of function calls, where only the subroutine address changes from one call to the next, with a list of execution tokens, which are essentially function calls with the call opcode stripped off, leaving behind only a list of addresses.
Over the years, programmers have created many variations on that "interpreter" or "small selector". The particular address in the list of addresses may be extracted using an index, general purpose register or pointer. The addresses may be direct or indirect, contiguous or non-contiguous, relative or absolute, resolved at compile time or dynamically built. No single variation is "best" for all situations.

Development

To save space, programmers squeezed the lists of subroutine calls into simple lists of subroutine addresses, and used a small loop to call each subroutine in turn. For example, the following pseudocode uses this technique to add two numbers A and B. In the example, the list is labeled thread and a variable ip tracks our place within the list. Another variable sp contains an address elsewhere in memory that is available to hold a value temporarily.

start:
ip = &thread // points to the address '&pushA', not the textual label 'thread'
top:
jump *ip++ // follow ip to address in thread, follow that address to subroutine, advance ip
thread:
&pushA
&pushB
&add
...
pushA:
*sp++ = A // follow sp to available memory, store A there, advance sp to next
jump top
pushB:
*sp++ = B
jump top
add:
addend = *--sp // point sp to last value saved on stack, follow it to copy that value out
*sp++ = *--sp + addend // copy another value out of stack, add, copy sum into stack
jump top

The calling loop at top is so simple that it can be repeated inline at the end of each subroutine. Control now jumps once, from the end of a subroutine to the start of another, instead of jumping twice via top. For example:

start:
ip = &thread // ip points to &pushA
jump *ip++ // send control to first instruction of pushA and advance ip to &pushB
thread:
&pushA
&pushB
&add
...
pushA:
*sp++ = A // follow sp to available memory, store A there, advance sp to next
jump *ip++ // send control where ip says to and advance ip
pushB:
*sp++ = B
jump *ip++
add:
addend = *--sp // point sp to last value saved on stack, follow it to copy that value out
*sp++ = *--sp + addend // copy another value out of stack, add, copy sum into stack
jump *ip++

This is called direct threaded code. Although the technique is older, the first widely circulated use of the term "threaded code" is probably James R. Bell's 1973 article "Threaded Code".
In 1970, Charles H. Moore invented a more compact arrangement, indirect threaded code, for his Forth virtual machine. Moore arrived at this arrangement because Nova minicomputers had an indirection bit in every address, which made ITC easy and fast. Later, he said that he found it so convenient that he propagated it into all later Forth designs.
Today, some Forth compilers generate direct-threaded code while others generate indirect-threaded code. The executables act the same either way.

Threading models

Practically all executable threaded code uses one or another of these methods for invoking subroutines.

Direct threading

Addresses in the thread are the addresses of machine language. This form is simple, but may have overheads because the thread consists only of machine addresses, so all further parameters must be loaded indirectly from memory. Some Forth systems produce direct-threaded code. On many machines direct-threading is faster than subroutine threading.
An example of a stack machine might execute the sequence "push A, push B, add". That might be translated to the following thread and routines, where ip is initialized to the address labeled thread.

start:
ip = &thread // ip points to &pushA
jump *ip++ // send control to first instruction of pushA and advance ip to &pushB
thread:
&pushA
&pushB
&add
...
pushA:
*sp++ = A
jump *ip++ // send control where ip says to and advance ip
pushB:
*sp++ = B
jump *ip++
add:
addend = *--sp
*sp++ = *--sp + addend
jump *ip++

Alternatively, operands may be included in the thread. This can remove some indirection needed above, but makes the thread larger:

start:
ip = &thread
jump *ip++
thread:
&push
&A // address where A is stored, not literal A
&push
&B
&add
...
push:
*sp++ = *ip++ // must move ip past operand address, since it is not a subroutine address
jump *ip++
add:
addend = *--sp
*sp++ = *--sp + addend
jump *ip++

Indirect threading

Indirect threading uses pointers to locations that in turn point to machine code. The indirect pointer may be followed by operands which are stored in the indirect "block" rather than storing them repeatedly in the thread. Thus, indirect code is often more compact than direct-threaded code. The indirection typically makes it slower, though usually still faster than bytecode interpreters. Where the handler operands include both values and types, the space savings over direct-threaded code may be significant. Older FORTH systems typically produce indirect-threaded code.
For example, if the goal is to execute "push A, push B, add", the following might be used. Here, ip is initialized to address &thread, each code fragment is found by double-indirecting through ip and an indirect block; and any operands to the fragment are found in the indirect block following the fragment's address. This requires keeping the current subroutine in ip, unlike all previous examples where it contained the next subroutine to be called.

start:
ip = &thread // points to '&i_pushA'
jump * // follow pointers to 1st instruction of 'push', DO NOT advance ip yet
thread:
&i_pushA
&i_pushB
&i_add
...
i_pushA:
&push
&A
i_pushB:
&push
&B
i_add:
&add
push:
*sp++ = * // look 1 past start of indirect block for operand address
jump * // advance ip in thread, jump through next indirect block to next subroutine
add:
addend = *--sp
*sp++ = *--sp + addend
jump *

Subroutine threading

So-called "subroutine-threaded code" consists of a series of machine-language "call" instructions. Early compilers for ALGOL, Fortran, Cobol and some Forth systems often produced subroutine-threaded code. The code in many of these systems operated on a last-in-first-out stack of operands, for which compiler theory was well-developed. Most modern processors have special hardware support for subroutine "call" and "return" instructions, so the overhead of one extra machine instruction per dispatch is somewhat diminished.
Anton Ertl, the Gforth compiler's co-creator, stated that "in contrast to popular myths, subroutine threading is usually slower than direct threading". However, Ertl's most recent tests show that subroutine threading is faster than direct threading in 15 out of 25 test cases. More specifically, he found that direct threading is the fastest threading model on Xeon, Opteron, and Athlon processors, indirect threading is fastest on Pentium M processors, and subroutine threading is fastest on Pentium 4, Pentium III, and PPC processors.
As an example of call threading for "push A, push B, add":

thread:
call pushA
call pushB
call add
ret
pushA:
*sp++ = A
ret
pushB:
*sp++ = B
ret
add:
addend = *--sp
*sp++ = *--sp + addend
ret

Token threading

Token-threaded code uses lists of 8 or 12-bit indexes to a table of pointers. It is notably compact, without much special effort by a programmer. It is usually half to three-fourths the size of other threadings, which are themselves a quarter to an eighth the size of non-threaded code. The table's pointers can either be indirect or direct. Some Forth compilers produce token-threaded code. Some programmers consider the "p-code" generated by some Pascal compilers, as well as the bytecodes used by.NET, Java, BASIC and some C compilers, to be token-threading.
A common approach, historically, is bytecode, which uses 8-bit opcodes and, often, a stack-based virtual machine. A typical interpreter is known as a "decode and dispatch interpreter", and follows the form:

start:
vpc = &thread
top:
i = decode /* may be implemented simply as: return *vpc */
addr = table
jump *addr
thread: /* Contains bytecode, not machine addresses. Hence it is more compact. */
1 /*pushA*/
2 /*pushB*/
0 /*add*/
table:
&add /* table = address of machine code that implements bytecode 0 */
&pushA /* table... */
&pushB /* table... */
pushA:
*sp++ = A
jump top
pushB:
*sp++ = B
jump top
add:
addend = *--sp
*sp++ = *--sp + addend
jump top

If the virtual machine uses only byte-size instructions, decode is simply a fetch from thread, but often there are commonly used 1-byte instructions plus some less-common multibyte instructions, in which case decode is more complex. The decoding of single byte opcodes can be very simply and efficiently handled by a branch table using the opcode directly as an index.
For instructions where the individual operations are simple, such as "push" and "add", the overhead involved in deciding what to execute is larger than the cost of actually executing it, so such interpreters are often much slower than machine code. However, for more complex instructions, the overhead percentage is proportionally less significant.
Counter-intuitively, token-threaded code can sometimes run faster than the equivalent machine code --
when the machine code is too large to fit in cache, but the higher code density of threaded code, especially token-threaded code, allows it to fit entirely in high-speed cache.

Huffman threading

Huffman threaded code consists of lists of tokens stored as Huffman codes. A Huffman code is a variable-length string of bits that identifies a unique token. A Huffman-threaded interpreter locates subroutines using an index table or a tree of pointers that can be navigated by the Huffman code. Huffman-threaded code is one of the most compact representations known for a computer program. The index and codes are chosen by measuring the frequency of calls to each subroutine in the code. Frequent calls are given the shortest codes. Operations with approximately equal frequencies are given codes with nearly equal bit-lengths. Most Huffman-threaded systems have been implemented as direct-threaded Forth systems, and used to pack large amounts of slow-running code into small, cheap microcontrollers. Most published uses have been in smart cards, toys, calculators, and watches. The bit-oriented tokenized code used in PBASIC can be seen as a kind of Huffman-threaded code.

Lesser-used threading

An example is string threading, in which operations are identified by strings, usually looked-up by a hash table. This was used in Charles H. Moore's earliest Forth implementations and in the University of Illinois's experimental hardware-interpreted computer language. It is also used in Bashforth.

RPL

's RPL, first introduced in the HP-18C calculator in 1986, is a type of proprietary hybrid direct-threaded and indirect-threaded threaded-interpreted language that, unlike others TILs, allows embedding of RPL "objects" into the "runstream" ie. The stream of addresses through which the interpreter pointer advances. An RPL "object" can be thought of as a special data type whose in-memory structure contains an address to an "object prolog" at the start of the object, and then data or executable code follows. The object prolog determines how the object's body should be executed or processed. Using the "RPL inner loop", which was invented and published by William C. Wickes in 1986 and published in "Programming Environments", Institute for Applied Forth Research, Inc., 1988, execution follows like so :
  1. Dereference the IP and store it into O
  2. Increment IP by the length of one address pointer
  3. Dereference O and store its address in O_1
  4. Transfer control to next pointer or embedded object by setting the PC to O_1 plus one address pointer
  5. Go back to step 1
This can represented more precisely by :

O =
I = I + Δ
PC = + Δ

Where above, O is the current object pointer, I is the interpreter pointer, Δ is the length of one address word and the "" operator stands for "dereference".
When control is transferred to an object pointer or an embedded object, execution continues as follows :

PROLOG -> PROLOG
IF O + Δ =/= PC
THEN GOTO INDIRECT
O = I - Δ
I = I + α
INDIRECT

On HP's Saturn microprocessors that use RPL, there is a third level of indirection made possible by an architectural / programming trick which allows faster execution.

Branches

In all interpreters, a branch simply changes the thread pointer. A conditional branch, to jump if the top-of-stack value is zero, might be encoded as follows. Note that &thread is the location to which to jump, not the address of a handler. So, it must be skipped regardless of whether the branch is taken.

thread:
...
&brz
&thread
...
brz:
tmp = ip++
if
ip = tmp
jump *ip++

Common amenities

Separating the data and return stacks in a machine eliminates a great deal of stack management code, substantially reducing the size of the threaded code. The dual-stack principle originated three times independently: for Burroughs large systems, Forth, and PostScript. It is used in some Java virtual machines.
Three registers are often present in a threaded virtual machine. Another one exists for passing data between subroutines. These are:
Often, threaded virtual machines, such as implementations of Forth, have a simple virtual machine at heart, consisting of three primitives. Those are:
  1. nest, also called docol
  2. unnest, or semi_s
  3. next
In an indirect-threaded virtual machine, the one given here, the operations are:

next:
*ip++ -> w
jump **w++
nest:
ip -> *rp++
w -> ip
next
unnest:
*--rp -> ip
next

This is perhaps the simplest and fastest interpreter or virtual machine.