Details

Of the many attempts to build Smalltalk Computers, the early ones depended heavily on microcoded architectures. For many years the Dorado was considered the ideal personal computer for Smalltalk, though the fact that it cost over $100 thousand to build and was so hot and loud that it had to be kept in a separate room meant that not many people got to actually use one.

The switch from bipolar to MOS technology for processors made them slower than main memory. In this environment dense instruction encoding no longer was important and simple hardwired architectures in the form of RISC won the day. Clock speeds for the new CMOS processors increased rapidly and soon zoomed past what even the expensive ECL machines had achieved and they just kept going up until recently. Memory density also increased exponentially but its speed is not that much faster than in the old days. So it seems that the time has come to once again consider microcode.

The fourth part of the "blue book" ("Smalltalk-80: The Language and Its Implementation") is a complete description of the Smalltalk virtual machine in the form of a Smalltalk program. Though actually executable, it was written in non object oriented style meant to be easy to translate into Pascal or C (which several people have done over the years) and to be understandable.

Several chapters of the "green book" ("Smalltalk-80: Bits of History, Words of Advice") describe the difference between the specification from the blue book and practical implementations of Smalltalk virtual machines. It is particularly interesting to read Chapter 7 (pages 113 to 126), which describes the Dorado implementation, while looking at the machine description and microcode listing. Some key parts were done in microcode while the rest was implemented in Alto machine language (a variation of the Data General Nova machine language) which was also implemented in microcode. Alternative schemes were discussed.

Dual Image

The "Back to the Future" paper describes the history and implementation of Squeak. There is a mostly complete description of the whole system in Smalltalk itself. What is missing can easily be seen by searching for the senders of #cCode: and friends. Things like the actual code for the sin primitive were not written and either punt to the underlying Smalltalk implementation or depend on a linked C library to provide the actual code.

The system described in the paper has a normal Smalltalk virtual machine and image and inside that you run the Squeak VM code written in Smalltalk which uses one large Array object inside of which is the Squeak image. So there are two VMs and two images. This is used for development and debugging - for normal use the Squeak VM code is translated into C, compiled into a native code standalone application.

The underlying Smalltalk VM and image in the development mode had to be a non Squeak system in the very first bootstrapping pass (obviously) but now they can be Squeak itself. The underlying image can be extremely reduced if it is going to only be used for this application and nothing more. The application can be divided into three parts: the bytecode interpreter, the memory system and the primitives. An interesting variation would be to short circuit the bytecode interpreter so that the underlying VM's interpreter is used instead while keeping the other two parts as they are. This would significantly speed up the simulated Squeak. If this were done on a machine whose native instruction set happens to be Squeak bytecodes, then the simulated Squeak would be fast enough that it could be used as the actual Squeak seen by the user. The underlying Squeak would be hidden away as part of the machine's firmware - the very reduced system image could even be stored mostly in ROM. In theory this system image would need its own VM to be able to run, but in practice a clever scheme could allow it to share the VM inside itself with the user image. This Escher painting shows how confusing supporting an image through itself can be:

Microcode

SiliconSqueak can fetch either bytecodes from the instruction cache or 32 bit wide "microcode" instructions. The bytecodes are transformed into either a single microcode instruction or a call to a sequence of them, depending on how complex its operation is.

There is only one format for microcode:

31 30-29 28-24 23 22-21 20-16 15 14-13 12-8 7 6-4 3-0
cond uD rD ifTrue uA rA #pc uB rB tag uC op

The tag, uC and op fields relate to the ALU.

The uX and rX pair of fields select where the A and B inputs of the ALU come from and the destination of its output.

The unit encoding is as follows:

00 01 10 11
uD t0-t31 i0-i31 x0-x31 s00-s37
uA t0-t31 i0-i31 x0-x31 s00-s37
uB t0-t31 i0-i31 f0-f31 #0-#31

The last encoding for uB is not really a unit but just immediate values instead.

The fetch unit is controlled by the the cond bit, which indicates conditional execution, the ifTrue bit to select the sense of the tests and the #pc bit to indicate that the new PC is to be formed from the 32 bit word after the current microinstruction.

cond ifTrue #pc name description
0 0 0 next execute the microinstruction which follows this one
0 0 1 call save the current PC and go to the indicated address
0 1 0 fetch8 get the following bytecode and either translate it into a microinstruction using hardware or jump to one of 256 addresses, depending on the configuration registers
0 1 1 fetch4 combines the indicated address with the top 4 bits of the following byte to jump to one of 16 addresses
1 0 0 skipFalse jump over the following microinstruction when the condition is false
1 0 1 jumpFalse go to the indicated address when the condition is false
1 1 0 skipTrue jump over the following microinstruction when the condition is true
1 1 1 jumpTrue go to the indicated address when the condition is true

Note that for Squeak the fetch4 wastes memory since only the top 2 or 3 bits of the extended bytecodes are used, so a full 256 byte area will include two or four copies of each instruction sequence. But this allows the hardware to be used with other languages (the option to replace the hardware decoding for fetch8 also helps). Another thing to note is that the PC is one of the context registers and can be the destination in any instruction, so there are many more control flow options than implied in the above table.

Units

Each unit has its own state and can do simple operations, like addition, locally without involving the ALU. Most operations take just one clock cycle, but a cache miss can cause the whole microinstruction to be delayed by quite a few clock cycles.

Temporary

This gives direct access to the arguments and temporary variables of the currently executing method. t0 is always the receiver while t1 is the first argument if there was one. The current top of the expression stack is always available as t31 (store) or as t30 (pop), and in the latter case the stack pointer is decremented after use.

Instance Variable

The first 32 words of the receiver in the currently executing method (t0) can be accessed as i0 to i31.

Stream

There are four stream subunits, each represented by 8 registers: s00 to s07, s10 to s17, s20 to s27 and s30 to s37.

sX0 index
sX1 step
sX2 base points to the word with the first valid index in an object that holds the contents of the stream.
sX3 limit
sX4 reset will cause the pointer to the word with the first valid index in that object to appear in the base and index, the step to be set to 1 and the limit to the last word in the object just stored.
sX5 atEnd when the limit is reached the index wraps around and the status register sX5 goes from false to true.
sX6 next words can be read or written to the stream.
sX7 nextByte bytes can be read or written to the stream.

Besides being used for streams, this unit can efficiently implement array access and can also be used to access instance variables or method literals beyond the first 32.

Context

These registers control the context of the operation of the other units.

Frame Registers

The frame registers must be saved to the return stack on every message send and restored on every return.

x00 self normally has the same value as t0, but is different in the case of block methods. It is the base for i0 to i31.
x01 method is the currently executing method. It is the base for f0 to f31 as well as for the byte stream.
x02 instruction pointer either points to a bytecode in the current method or, if it is negative, points to a word in the virtual L2 intruction cache.
x03 context either is nil or it points to a context object associated with this frame.
x04 frame pointer indicates the offset from the start of the data stack of t0.

Thread Registers

The thread registers must be saved/restored when switching between one thread and another.

x08 stack pointer indicates the offset from the start of the data stack of the top element on the evaluation stack.
x09 return pointer indicates the offset from the start of the return stack of the topmost saved value of the previous frame.
x10 dblockHigh a 7 bit value indicating the current high data block in the stack cache.
x11 dblockLow a 7 bit value indicating the current low data block in the stack cache
x12 rblock a 7 bit value indicating the current return block in the stack cache
x13 savedIP the value of the instruction pointer before the execution of a call.

Image Registers

A single hardware node can be used for several "images". As has been described above, at least a system and a user image is typically used, though with two or more processors it would be possible to have each node deal with a single image. In the case where a node does need to switch between images, however, these registers need to be saved/restored.

x14 bcTable the offset of a 4KB region in the virtual L2 instruction cache with a block of 16 bytes corresponding to each possible bytecode. One bit indicates that the built-in hardware decoder should be used.
x15 pageTable the physical address of a table used to map image addresses to physical addresses.

System Registers

These registers allow low level access to the hardware or have temporary variables that don't need to be saved.

x16 stack cache pointer selects a word in the stack cache.
x17 stack cache data reads or writes the word selected by the previous register and increments that register.
x18 data cache pointer
x19 data cache data
x20 instruction cache pointer
x21 instruction cache data
x22 ring data is used to directly inject data into the ring network or to fetch data from it.
x23 ring last/status is used to inject the last word of a pack into the ring network or else to read the status of the network.
x24 rTop the top of the return stack. Reading from it causes a "pop" while writing to it a "push".
x25 current byte is the value pointed to by the instruction pointer in the current method.
x26 next byte increments the instruction pointer.
x27 next word increments the instruction pointer by 4.
x28-x31 undefined
x05-x07 scratch is used for such things as remembering the selector of a message or the number of arguments.

Fetch

This can fetch either bytecodes or microinstructions, and when used as registers f0 to f31 allow access to the literals in the currently executing method.

ALU

When the tag bit is zero, the "raw" 32 bits are used as in other processors. When it is one, the values of the inputs and the result must be proper SmallIntegers or all side effects of the instruction will be cancelled (writing to the destination register, changing the stack pointer or the index in the stream units) and a call to a known address (what is normally called a "trap") will happen instead.

Besides the 32 bit result that the ALU sends to the D bus, it also generates a 1 bit result which is used for control flow.

The uC field selects which of the eight possible combinational subunits (some might include pipeline registers and so not be strictly combinational) is used to compute the result. The bits of the op field mean different things to each of these subunits:

  1. add - the bottom bit of op selects between addition and subtraction while the other three bits encode the condition to be tested (in the traditional PDP-11/ARM style: zero, carry, negative, overflow, less than, less than or equal, lower or same unsigned, always).
  2. logic - the 16 rules from bitblit are selected by op.
  3. multiply - the bits of op select between signed/unsigned, multiply/shift, left/right and special/normal. "Right" with "multiply" means division. "Special" means rotation for the case of "shift", the high 32 bits for "multiply" and the remainder for "divide". Signed rotations or shift left don't make sense and have exactly the same effect as the unsigned versions.
  4. to be defined
  5. to be defined
  6. to be defined
  7. to be defined
  8. to be defined

In the FPGA version some of the subunits can be defined for specific applications. A floating point unit is a good example of things that can be added to the list of standard subunits in the future.

What's new

  • Down to business
    27 December 20116:26:02 pm by admin
    I like blogging, so here is my second post.
  • Hello
    27 December 201112:00 am by admin
    Hi! This is my first post.