Log_In Manage_Account Edit_this_page Get_the_Source List_of_Articles

The Mill is a novel CPU architecture first presented to the world at large in 02013. It has been developed by Mill Computing amongst a small group of computer experts during the preceding ten years.

The design goals of the Mill are to achieve the performance advantages of superscalar, out of order CPUs while maintaining the cost, complexity and power advantages of simpler DSP and in-order designs.

Table of Contents

1 Design Philosophy

The design philosophy behind the Mill is multifaceted. Its core principles of design are set to achieve the price-performance characteristics of DSP processors on general-purpose code. That is, code for which conventional Out of Order and Superscalar techniques have been optimized. In doing so the design is expected to achieve absolute performance of comparable conventional CPU designs with significantly lower power and price.

One essential concept which forms the foundation of this is the realization that a majority of general-purpose code is composed of loops and further that software pipelining, as used in DSPs, is able to absorb massive, simple hardware resources. Consequently, many architecture choices are made to enable software pipelining general loops to a much greater degree than that which is possible with conventional architectures. Allowing software pipelining, as produced by the compiler, instead of hardware pipelining, using OoO and superscale techniques, eliminates the need for the significant portions of the CPU chip which are dedicated to hardware pipelining and their consequent power and area costs.

Consequently, the Mill architecture is quite particular. In detail, the Mill is a Statically Scheduled, In Order, Exposed Pipeline, Single Address Space, Wide Issue, Belt Architecture CPU design.

2 Programming Model

The following description of the design of the Mill is a synthesis of the official documentation, the official wiki and various other comments the CTO, Ivan Godard has made on many other forums across the Internet.

2.1 Glossary

EBB

An Extended Basic Block is a contiguous section of executable code that has only a single entry point, but may have multiple exit points. That is, only the beginning may be a jump target for a jump or call, but the EBB may contain multiple jumps to other EBBs. On the Mill calling a function is allowed inside an EBB and is not considered leaving the EBB.

Exu

Exu are the arithmetic functional units and area of the CPU. These operations manipulate data in the registers or belt.

Flow

Flow operations are any operation dealing with addresses, such as loads, stores, or jumps.

Operation

An operation is something like adding two numbers or jumping to a new address. It uses some functional unit to perform work. Since modern CPUs have multiple functional units it is possible to perform multiple operations in a single cycle.

Different operations may take differing numbers of arguments and consume differing numbers of cycles until their results are ready.

Instruction

An instruction is a set of operations, which are all issued to be executed in the same cycle, even if those operations may not all complete within the same cycle.

Spiller

The Spiller is a component of the Mill CPU which performs memory reads and writes in the background to move to and from the special internal memories, such as the scratchpad, when there are insufficient internal resources. Thus if there is a deep stack of function calls each of which uses some scratchpad room there will be no faults due to lack of scratchpad room because the spiller will be moving that data between the scratchpad and main memory as necessary. Similarly, call frame belt values are paged in and out as necessary.

2.2 The Belt

The Mill uses the unconventional Belt Architecture in preference to the conventional Register Architecture.

2.2.1 Motivation

As of 02015 the dominant architectures for high performance CPUs are register based with register renaming. Register renaming is used to overcome architectural limitations in the number of programmer visible registers and further to allow speculative execution where the compiler cannot determine ahead of time which branch may be used.

The vast majority of these register uses are as temporary holding areas of intermediate results. Of these, the vast majority are only referenced once, which is partly responsible for the effectiveness of Single Static Assignment as a compiler technique.

However, some necessary uses of registers are to store longer-lived data. This may be data which is accessed relatively frequently or data which will be used a significant number of cycles from when it is produced. This use of registers makes determining the lifetimes of register uses difficult as the use could be beyond the visibility window of the CPU. As such until proven otherwise CPUs must assume the liveliness of register values and may not reuse that register.

Given a set of registers and other data sources on a CPU there must be some way to distribute these values to the various functional, control, and memory units within the processor. This requires hardware that scales logarithmically with the number of sources, assuming full connectedness without additional buffering stages, and scales proportional to the produce of sources and sinks for power as all that hardware must be powered. Consequently, having a large set of registers, programmer visible or not, is expensive both in power and latency. It is expensive in power because so many hardware elements and such a large distribution network require power. It is expensive in latency because the number of muxes (or equivalent) the distribution network requires increases to the logarithm of the number of sources.

Register renaming is a significant source of both the performance and power costs of modern conventional CPUs intended for general-purpose uses. The registers necessary to allow arbitrary levels of Instruction Level Parallelism. In order to support parallelism at the instruction level it must be possible to identify the dataflow explicitly such that dependencies can be satisfied in order. Unfortunately, this naming will result in name conflicts when a sufficiently complex series of instructions is needed or if the ILP exceeds what is architecturally feasible, even if the particular CPU has sufficient hardware resources.

There are other CPU architectures which do not present the issues of requiring naming of elements in the dataflow, such as the Accumulator Architecture or Stack Architecture CPUs, but they do not allow significant parallelism because with only a single implicit destination the code written for those architectures is predominantly a series of consecutive hazards.

2.2.2 Implicit Naming of Dataflow Temporaries

As noted above, a significant portion of the uses of registers are for temporary values, which will shortly be discarded. The Mill provides an alternative to the traditional register renaming problems by using a Belt Architecture where the destinations of operations is implicit and yet it is possible to issue many operations within a single cycle.

The belt is conceptually a FIFO of fixed size. Operations include specifications of which element of the belt to be used as input and any functional unit can consume data from any belt position. Belt positions are indexed by the position relative to the front of the belt. When those functional units are complete the results are pushed onto the front of the belt, and any elements at the end of the belt in excess of the capacity are lost. Consequently, the data that may previously have been at belt position b0 may, at the end of the next cycle, be at belt positions b5 if five operation results had dropped onto the belt.

A common misconception is that the belt can be viewed as a series of registers. Given the temporal nature of the belt this is not at all the case. The data referenced by a belt position changes after every instruction by a variable number of positions and there is no explicit or implicit transfer of data from one register or another. Additionally the Mill has no general-purpose registers of any kind. All general computation is done using the belt. There are special-purpose registers, such as a stack pointer, but they are not usable for general computation.

2.2.3 Explicit Value Preservation

Given the belt's limited size and ever changing contents where items are added to and fall off of the belt with every instruction it is unsuitable for storing values for more than a handful of cycles since the data will be pushed off the belt. To support preserving those values with a lifetime longer than a handful of cycles the Mill has the Scratchpad.

The Scratchpad is a subsystem of the CPU. It contains some internal SRAM to store the values as they are moved from and to the belt. If the internal SRAM is insufficiently large then excess values will be paged out to main memory as necessary.

Every frame has an isolated scratchpad which is inaccessible from any other frame. Each Mill family member has a specific maximum size and the number of scratchpad cells required by a particular frame must be explicitly allocated.

The scratchpad is not general-purpose memory. Instead, it is statically addressed and all data put into the scratchpad must be properly aligned. The only source for data directed to the scratchpad is the belt and the only destination for data from the scratchpad is the front of the belt. This transfer from and to the belt requires three cycles, and consequently, all Mill family members have belts sized such that the majority of operation results continue to exist on the belt for at least three cycles.

2.2.4 Multiple Operation Results

Register architectures generally allow a single result per operation with a small number of inputs, often in the range of zero through three. This is generally adequate for most simple operations and an instruction encoding which allows multiple outputs trades instruction size for that ability. For example, if we assume that a particular architecture requires four bits to address amongst its register set then an instruction with two outputs and two inputs would require sixteen bits just to indicate the registers. Additional bits would be required to specify which variant is being requested and which operation should be performed. Since the front of the belt is always the implicit destination it is a simple conceptual matter to implement operations, which drop multiple values onto the belt when they complete. One common example is the integer division operation, which returns the dividend and the remainder as two separate belt elements.

A more complex case is the situation with function calls. Programmer written function calls can require arbitrary elements from the belt as the arguments and would, ideally, be able to return multiple results. During the call the belt contents will be entirely replaced with the temporaries of the call itself. A traditional approach to function calls is to first save all the register values to the in memory call frame, then place the arguments to the function into the start of the new frame for the new call, and finally call into the function. On return, the results are again placed in the stack, the registers restored, and execution continues back in the caller after the call.

The Mill avoids this complication by reducing calls to a single operation, which collects an arbitrary set of elements on the belt to serve as arguments, and returning the results of the function to the caller's belt upon returning, just as any operation would. The call operation performs the belt reordering by pushing the caller's belt onto a conceptual stack and creating a new belt that only contains the arguments in their well known order. Upon returning, the belt instance is popped, the call results dropped onto the belt, and execution continues in the caller.

2.2.5 Atomic Calls

Given that the Mill is a statically scheduled, wide issue, exposed pipeline architecture having inline function calls as a single instruction could result in some programming challenges. For example, as a wide issue machine the Mill has multiple operations per instruction. If one of these operations was a multi-cycle operation and the other a call operation then there is the difficultly in how to reconcile a consistent belt positioning within a call frame and executing the call concurrently. The Mill does not choose to stall the call until the long-running operations complete, and then proceed with the call. Instead, it treats all calls as atomic single-cycle operations. Any operations inflight at the time of the call operation are retired to the original caller's belt upon the return from the call.

This design choice improves performance due to not needing to flush the CPU pipelines before calls and not needing to restart instructions upon returning from calls. This is especially relevant when it comes to interrupts and faults, which are treated as involuntary calls with the hardware supplying the function arguments.

2.2.6 Atomic Loops

The Mill treats loops quite similarly to function calls. That is, there is a single atomic operation to enter a loop and loops can return multiple results. Similar to calls, loops started with the inner operation get their own belt. Contrary to the case with function calls loops retain access to the spiller and stack state outside the loop as one would expect.

Each loop also gets a rotator on the spiller. A rotator is an addressing convenience where the rotate operation will change the base offset into the spiller memory reserved for the inner loop. This is useful when using the spiller with software pipelining to enable the compiler and specializer to produce instructions which are agnostic to the absolute addressing of items in the spiller. A loop may only modify its own rotator, though it can access values from the rotators of outer loops.

2.2.7 Concurrent Pipeline Operation Retirement

Since different operations may have different latencies and a pipeline will have several functional units capable of varied operations it is possible that multiple instructions within the same pipeline may retire in the same cycle. This is fully supported by the Mill.

2.2.8 Operand and Element Metadata

Every belt position and scratchpad cell contains a single operand and the associated metadata. The metadata is extra information, which is carried along with the operand while it is within the Mill core. Operands in memory do not have metadata. When storing the data to memory the metadata is stripped and when loading operands from memory the load operation itself defines the metadata associated with the operand.

An operand is composed of a set of elements and each element has further additional metadata bits. The operand metadata defines the width and number of elements within the operand. Therefore, an operand with only a single element is the standard scalar value, though it can be of any of the sizes supported by the specific Mill implementation. An operand with multiple elements is a vector and each element within a particular vector is the same width.

The metadata for each element can include a number of pieces of information. Some of this metadata are the IEEE754 floating-point status bits divide by zero, inexact, invalid, underflow, and overflow. Any floating-point operation sets these flags as appropriate and the properties of these metadata are discussed later when talking about the instruction set.

Similarly, every element contains a NaR metadata bit. NaR stands for Not a Result and indicates the value of that element is invalid for some reason. The None metadata bit is also available and indicates that there is no data for that element.

There exist operations that can modify the width and scalarity in the commonly semantically useful ways that are described in detail when discussing the instruction set.

2.3 Memory Hierarchy

The Mill takes a non-conventional approach towards memory access and caching in order to minimize apparent latency and reduce the necessity to completely stall the CPU while waiting on memory. Avoiding stalls during low latency memory operations is more important on the Mill than a conventional out-of-order CPU because, being in-order and statically scheduled, when one operation needs to wait for memory accesses to complete it necessarily blocks all further operations from being performed. An out-of-order machine can continue to process other concurrent operation paths while some small number of operations are blocked for a handful of cycles waiting on memory. Doing the same on the Mill would break the static scheduling model, where the compiler must know all the operation latencies ahead of time, and add variability to the exposed pipeline. Consequently, if the Mill is forced to stall due to memory delays it will do so in a transparent manner. Those delayed cycles will appear, to the code stream, as if they never happened.

For the purpose of discussion in this section the following terminology will be used:

D$1

Data Cache Level 1. The first level data cache has an access latency of three cycles and is relatively small. As with any memory, increasing the size results in an unavoidable increase in latency. Many D$1 (also known as an L1 cache) caches are on the order of 64KB in size.

D$2

Data Cache Level 2. The second level data cache has an access latency of ten cycles and is larger than the D$1 cache, but still of finite size. Many D$2 (also known as L2) caches are on the order of 256KB in size.

DRAM

DRAM, also known as main memory, is the large primary memory of the system. It has a latency on the order of three hundred cycles and is of a mostly arbitrary size. Common DRAM sizes are several gigabytes in size.

This describes the memory hierarchy. Each cache is a lower latency subset of the next larger component of the hierarchy and retrieval seamlessly, except for access latency, proceeds down the hierarchy until the requested data is found. Thus, if a piece of data is requested by the CPU, first the D$1 is checked, if it isn't found there then the D$2 is checked, and if it isn't found there finally DRAM is checked for the necessary data.

2.3.1 Memory Model

The solid lines represent data flows, for example, program memory reads. The dotted lines represent concurrent control processing, for example, protection checks or virtual to physical address translations. The dashed line indicates which parts of the memory hierarchy are physically addressed versus virtually addressed.

In this diagram we see several units, some of which are common with conventional architectures, some of which are in unusual places and some of which are not commonly found in conventional architectures.

dPLB

The dPLB is the data Protection Lookaside Buffer. It operates in parallel with data cache accesses to determine if the currently running thread has the appropriate permissions for the access being performed. If the access exceeds its permissions then the access is invalidated and any results return the appropriate metadata laden error values.

iPLB

The iPLB is similar to the dPLB except that it checks instruction accesses and will raise a fault if code for which insufficient permissions have been granted to the thread is executed.

TLB

The TLB on the Mill is in a different location when compared with conventional architectures. Additionally, the Mill TLB is only concerned with translation and not protection. On a conventional architecture the TLB is responsible for both translation and protection checking.

eI$0, eI$1, fI$0 and fI$1

These caches are part of the split instruction decoding described below. e stands for Exu and f stands for Flow in this case, though the distinction is less clear cut than during early Mill development and they are described as left and right decoders below.

As can be seen in the diagram above, the Mill is slightly unusual for mainstream CPUs in that it has a Single Virtual Address Space and further that the caches are all virtually addressed.

Doing this has several advantages over the traditional model. The traditional model has the TLB inline with the caches such that the caches are physically addressed because historically address spaces were 32-bits and this was insufficient to share amongst many processes on a system. Placing the TLB in the critical path limits its size, causing more TLB misses than might otherwise be necessary.

The Mill, with its single 60-bit address space, no longer has the restriction that multiple programs must alias virtual addresses to fit within a single machine. Consequently, the TLB can be moved below all the caches. Since any access that reaches the TLB will, by definition, be going to DRAM there will already be 300 cycles or more of latency so adding another dozen cycles for a large TLB is no great loss and allows the TLB to be more power efficient and have many fewer misses.

Additionally, the shared address space makes sharing data between different processes and the OS simpler. A pointer always points to the same memory and requires no complex translation. Protection is also simplified because there is only a simple table of permissions per protection domain instead of this same table where address translation must first be done.

This allows protection to be a small, fast unit. On the Mill the Protection Lookaside Buffer operates in parallel with the cache to provide this capability. If protection fails the load is invalidated and a fault produced.

2.3.2 Backless Memory

Another advantage of placing the TLB below the cache is that the CPU can allocate cache lines for virtual addresses without a corresponding entry in the TLB. A traditional CPU, with the TLB between the CPU and the cache, must take a TLB miss and fault to the operating system to allocate a page when there doesn't exist a physical mapping for a virtual address. On the Mill, physical mappings are not allocated until the cache line is flushed to DRAM. This makes it possible for some memory to be allocated, used and freed without ever requiring the OS to allocate the physical memory or any trips to DRAM to be completed to fetch the original memory.

Obviously, stores to such memory will allocate a new cache line which sets valid memory cells to valid values. The invalid bytes, however, are still defined as whatever is in DRAM for that virtual address. However, if there is no TLB or page table entry for that address the TLB returns, without accessing DRAM or faulting to the OS to allocate pages, implicit zero values. Memory that is not backed by DRAM is defined to implicitly be zeroed. This zeroing does not touch DRAM at all, but merely returns zeroes to the retire station which fill-in the appropriate bytes in the cache line with those bytes marked as valid.

If and when the backless memory is evicted from cache to DRAM, only then is physical memory allocated. This is done without OS intervention. The hardware allocates the page from a pre-allocated block of page table entries for pages that are a single cache line in size. These entries are provided in advance by a background OS process. This same process consolidates pages if possible. The Mill uses hierarchical page tables as is normal, though the granularity of pages is down to a single cache line. This is much smaller than the standard page size in the multiples of kilobytes.

2.3.3 Reducing Memory Stores Through Implicit Zero

Some significant amount of memory bandwidth and CPU cycles are spent zeroing memory. For security reasons this must be done whenever new memory is allocated and should be done for the stack as stack frames are allocated, even if that memory previously had been allocated for an earlier stack frame in the same program.

The Mill program data stack is allocated in multiples of cache lines and each stack cache line has a flag bit which declares whether that line has been stored to since the stack frame was allocated. If this IZ bit indicates no store has taken place then any load from that cache line will always return zero. This implicit zero prevents security leaks caused by reading leftover stack values from previous calls and requires no CPU cycles or memory bandwidth be spent zeroing that memory. When a store occurs to a stack cache line with the IZ bit indicating a pristine state, the store completes and all the other bytes of the cache line are zeroed. The IZ bit is toggled and the memory operates as normal. Upon exiting a stack frame, usually by returning from the call, the stack frame cache lines are discarded. Thus, if the stack frame hadn't been evicted from cache into memory, during the entire duration of the call the data stack would have been in backless memory and required no OS intervention and no DRAM access.

Cache lines which have the IZ bit for that line indicating an implicit zero state are termed unrealized. When a cache line ceases to be unrealized it becomes realized and has real zeroes stored in the appropriate memory. The IZ bit mask, having one bit for each line, is of finite size. When a stack line is rotated out of the range of the IZ it is realized. Similarly, newly allocated stack lines beyond the range of the IZ are realized at allocation time.

The IZ is a core private detail and is not saved out during task switch. Consequently, a task switch causes realization of all stack lines and in multicore systems a CPU may choose to realize a line whenever it has cause to believe the address of the memory to that line may be leaked to other cores.

Given the implicit zero semantics, a compiler is free to elide any program initialization which would otherwise set stack values to zero. As zero is a common initialization value this presents an efficiency gain.

2.3.4 Memory Consistency Guarantees

The Mill provides simple and easy to understand memory consistency guarantees. Quite simply, there is a direct mapping possible from the order of operations defined in the program, to the order of operations, to the order of load/store functional units and so on down the memory hierarchy. This sequential consistency is maintained through all levels. There is no hardware memory reordering. Consequently, there is no need for memory barrier operations and they are not available on the Mill.

This simplicity of consistency guarantee, makes memory race bugs of all sorts impossible.

2.3.5 Load Latency Causes

Given the variable latency of memory accesses depending on where in the hierarchy the data is to be found, the only way to hide the latency of memory loads is to request the load as early as possible and perform other work while waiting for the load to complete.

The traditional out-of-order approach to this problem uses dynamic scheduling implemented within the CPU itself to track which operations have their data ready and can be executed and as soon as the address of a load has been determined the load is issued, and all dependent operations stalled until the result appears. In some instances it is possible to issue a load even before the final address is computed, but this prefetching is only possible when an address is known, even if it might be down a branch that is in the end not taken. This hardware approach has limitations on the number of instructions that can be followed at once, and therefore, the ability for the CPU to find non-dependent work to perform while loads are in progress.

The traditional static scheduling approach to hiding load latency is to schedule loads as if they were to always hit the expected cache level, often the L1. Since the scheduling is done at compile time and not runtime, there are greater resources to find good scheduling to hide latencies. However, given the optimistic assumption of load latencies, always assuming an L1 hit, there is a limited ability to have a large number of loads concurrently outstanding and any situation where there was an L1 miss will stall the entire CPU until the load completes.

Load latency is really several distinct situations involving loads that differ enough in the particulars such that different strategies may be employed to ameliorate the issue.

2.3.5.1 Cache Overflow

If the working set is larger than the caches of the CPU and accessed randomly, then there are no CPU architecture level solutions. A solution must come from the software, perhaps as data structures, which exhibit better cache locality.

2.3.5.2 Data Dependent Load

If the series of loads is inherently serialized, that is the address of the next load cannot be determined without data from the previous load, then there is nothing to be done at the CPU architecture level. This is the typical linked list case where the address of the next element in the list cannot be determined until the current element has been loaded.

2.3.5.3 Multiple Independent Loads

If a program has the case where it wants to load multiple values from memory to be used at approximately the same time, these loads are independent and it is a better use of resources to load as many of the value as possible concurrently instead of loading each value one at a time. Concurrently loading four values from DRAM, for example, would consume 300 cycles where loading the same four values serially would consume 1200 cycles.

2.3.5.4 Cache Misses

As previously explained, it is normal under static scheduling for the compiler to assume that every load will succeed from the L1 cache. To do otherwise would waste registers or register equivalents on the CPU, reducing performance when loads can, in fact, be serviced from the L1 cache. This assumption presents an issue when the data is not found in the L1 cache. If the data is found in the L2 cache instead then there is a relatively small difference between expected latency and realized latency. During these seven cycles it is likely that the CPU could find sufficient work to keep busy. This is most likely not the case if both the L1 and L2 cache miss and the load is serviced from DRAM with an additional 297 cycle latency, though some of those cycles could be filled with work.

2.3.6 Latency Hiding

One great advantage of out-of-order CPUs over traditional statically scheduled CPUs is their ability to hide minor amount of memory load latencies, such as would be seen from an L1 miss but an L2 hit. The out-of-order CPU can find work to perform during the wait if there are independent operations while the statically scheduled CPU must stall because the operation stream was produced with a fixed load latency in mind and the CPU must enforce that. The Mill introduces capabilities, which provide mechanisms to the compiler to increase the range of acceptable load latencies without needing any operation stream changes. It also provides the compiler with architectural guarantees about loads and stores beyond what is normally available to the compiler, but is inline with what is available to the CPU hardware on an out of order CPU during runtime. As such, it effectively moves the out-of-order logic off the CPU and into the compiler where it has no ongoing power or space costs.

2.3.6.1 Deferred Loads

The way in which the Mill allows the compiler greater flexibility in acceptable load latencies is with deferred loads. A deferred load is a load where the code stream declares how many cycles, with respect to the current call frame, from the issue of the load it expects the result. Suppose there is a load with a compiler-determined delay of ten cycles. If the load is satisfied earlier, say it was found in the L1, then the result is stored off the belt within the CPU until the tenth cycle has finished upon which it is placed on the belt. If instead satisfying the load came from L2 and took exactly ten cycles there would be no additional waiting. During the interval between the load issue and the declared load retirement the compiler is free to schedule other operations to be performed. If instead, the load must be satisfied by DRAM then ten cycles of operations would be performed before the CPU stalls until the load is satisfied and can be retired.

The performance characteristics of this are superior to the traditional statically scheduled case because allowances for L2 hits can be made at compile time. In many cases performance is similar to the out of order case as there are a small, finite number of operations which can be performed while waiting for load results; enough to fill the time until an L2 hit, but usually not enough to fill the time until a DRAM fetch. The compiler will be able to see a significant number of these and the out of order hardware has limit to the depth into the future instruction stream it is able to track and consequently execute out of order to hide the latency.

2.3.6.2 Alias Immunity

Though the compiler may be able to see many possible operations to fill out the time before a load can be satisfied it is limited in how far before the consumer of that load is able to schedule due to memory aliasing. Memory aliasing is the situation where two or more variables within a program point to the same memory location. In most cases this is only a theoretical possibility and at runtime there is no aliasing, but it is often impossible for the compiler to prove the lack of aliasing for any particular code.

In traditional CPUs the value returned by a load is defined as a snapshot of memory at the time the load is issued. On the Mill the value returned by a load is defined to be the snapshot of memory at the time the load retires. Consequently, it is possible to safely hoist a load over a later store, which happens to store to the same location for which the load was issues earlier. This alias immunity provides the compiler memory consistency guarantees equal in power to those available at runtime to the out of order hardware in avoiding aliasing problems. Consequently the compiler is able to move loads earlier in the instruction stream as is convenient, often to as soon as the address has been determined.

2.3.6.3 Loads Across Variable Latency Control Flow

Deferred loads handily provide a facility to turn what would otherwise be variable latency loads into fixed latency loads. As long as the intervening instructions are also of fixed latency everything works fine. However, if there is some control flow between the load issue and retirement such that one path has a higher latency than the other then it would not be possible to issue the load until after the flow control branch has been determined, which would limit the ability of the compiler to hide the latency of loads.

The Mill provides a second type of load to solve this problem. A pickup load is very much like a deferred load except that where a deferred load will place its result onto the belt with the designated latency. A pickup load is provided a program chosen identifier and does not place its result on the belt until a subsequent pickup operation is issued with the identifier. In this way the load could be issued before the flow control and the result placed onto the belt after the flow control has occurred, even if the flow control has a different latency down each path. This returns the ability to lift loads back to the point that the address is available to the compiler, even before complex flow control.

2.3.6.4 Limitations As Compared to Dynamic Scheduling

The Mill provides mechanisms to statically schedule most loads early enough that there is expected to be no stall necessary as long as the data was found in the cache hierarchy and not DRAM. However, this only covers most common cases. There exist situations where it is not possible to optimally schedule, in a static way, multiple dependent loads, even if there would otherwise be sufficient work available to fill the time. These cascading dependent loads are rare in practice, but would provide situations where an out of order CPU would be able to dynamically schedule more optimally than the compiler.

2.3.6.5 Advantages As Compared to Dynamic Scheduling

Conversely, there are many situations where the compiler, with its greater visibility into the code stream, is able to produce a better schedule than a dynamic scheduler. This is simply because when scheduling dynamically the hardware system which does the scheduling has finite resources and a finite time budget. In addition, the code stream may be laid out such that loads come just before their consumers which requires that any dynamic scheduling to fill the latency occurs within a resource restricted window around those two operations. The compiler is able to take a holistic look at the entire function stream for an EBB or function and schedule accordingly. Consequently, the compiler effectively has an infinite window of visibility into the code stream where the dynamic scheduler of an out of order CPU has a finite window of relatively small size.

A larger window will generally result in a more optimal schedule, irrespective of whether that schedule was determined statically or dynamically.

2.3.7 Cache Data Validity

If a store is executed to a location that is not available in the cache, a store miss, there are two traditional solutions. The first, write-through, is to write the new value into DRAM directly. The second, write-back, is to fetch the cache line from DRAM and then update it when it arrives. In either case significant logic hardware is involved to track and execute the snooping and merging with other stores while waiting for DRAM to respond to ensure memory consistency.

On the Mill, in contrast, every byte of the cache has a bit that declares whether the data in that byte is valid or not. Bytes are declared valid if they are stored to or come up from DRAM. Newly allocated cache lines, such as if an address with no existing cache line was stored into L1, have every byte marked as invalid.

When a store is performed, it writes its data into a cache line in the top level, L1, cache. Then a load is issued, the start address and a request mask is passed to the memory hierarchy. The request mask is merely a mask of the bytes, starting from the address, which still need to be satisfied. For example, if part of the request can be satisfied from L1 then those bytes will be returned to the retire station and the rest of the request sent to the next level cache. When the entirety of the request has been satisfied the request is terminated and the retire station is able to retire the load. When a request is terminated it does not go down any further in the memory hierarchy.

As memory requests are made which must be satisfied between multiple levels of the cache hierarchy cache line merging is performed. This is done simply by copying the cache line data up one level in the hierarchy and merging the valid bytes. If both cache lines have valid data for the same byte then the highest level cache wins. That is, if both L1 and L2 claim to have valid data for the same byte then the value in L1 will be the result of the merge.

One major advantage of this system is that stores will never miss in the L1 cache. If there is no existing cache line for the stored address a new one is allocated in the L1 and the store is complete. Merging will occur lazily as loads are requested or as cache lines are evicted. This means that a significant amount of hardware related to buffering stores can be eliminated.

2.4 Security Model

The Mill security model is based on memory addresses and visibility into those addresses.

2.4.1 Turfs

A collection of regions, described below, is termed a turf. This is a globally unique, hardware-managed identifier. A turf can be thought of as a process protection region. In a traditional architecture, this might be the equivalent of the process private address space.

2.4.2 Threads

A thread is a thread of execution similar to Software/Thread except that where a traditional thread can only ever be run within a single address space, Mill threads may only be run in a single turf at any one time, but the turf is changeable. Note that multiple threads may be running in the same turf at any one time, much as a process may contain many threads.

The current turf ID of the currently executing thread on the CPU is maintained in a CPU register. Each thread also has a globally unique, hardware-managed identifier.

2.4.3 Regions

All Mill protections occur through the hardware concept of regions. A region descriptor is a data structure managed by the operating system and caches in the Protection Lookaside Buffers for performance. A region is described by the following components:

Name

Description

LWB

Lower bound of the address region based in the virtual address space

UPB

Upper bound of the address region based in the virtual address space

Rights

The access rights of the region. These rights include the standard read, write, and execute permissions as well as the portal permission

Turf ID

The identifier of the turf this region belongs to. Only accesses matching the turf ID will succeed with the permissions granted by this region. However, there exists a wildcard turf ID, which may be used here, and will allow any thread with a matching thread ID access to this region and whichever rights it allows.

Thread ID

The thread identifier indicates which thread may access this region. As with the turf ID this may be wildcarded to allow any thread with a matching turf ID to access this region.

Regions have byte granularity and need not be backed by physical memory. Access rights to a region of memory are directly granted between threads of execution out of the accessible regions of the originating thread. Of course, multiple regions may overlap with each other, in which case the union of the rights will be in effect.

On the Mill there are no privileged modes as all permissions are based upon region visibility. Consequently, the Mill boots by first creating a single thread which has access to a single turf which contains the entirety of the sixty-bit address space.

New regions can be created from old regions using the grant operation. This is a hardware operation which takes an existing region and produces a new region with a subset of the address range, a subset of the existing access rights, a new turf ID, and a new thread ID. The new regions are inserted directly into the PLB with a novel bit set. When entries with the novel bit are evicted from the PLB they are flushed to the OS maintained region table. The region table uses an Augmented Interval Tree and whether PLB entries evicted to be inserted into the table is done by the OS or directly by the CPU depends on the member. Regions which are pulled into the PLB from the region table do not have the novel bit set and will be discarded on eviction without any further writeback.

Regions are deleted using the revoke operation. If the region exists only in the PLB, that is the novel bit is set, then the region is simply discarded from the PLB. This feature makes transient regions quite cheap as they are never pushed to the OS and no OS intervention is necessary. Regions that are revoked but exist in the region table are deleted lazily using a similar mechanism to the way newly created regions are evicted to the region table.

Though it would seem that regions provide the ability to implement a capability system the performance characteristics of regions on the Mill preclude that. Regions are intended to represent entire segments of programs and not finer grained elements such as objects.

2.4.4 Well Known Regions

In typical programs the majority of the memory accesses happen to common regions, such as the code segment. Though it is possible, it is wasteful to use the fully generic PLB capabilities for these regions and the Mill provides the concept of well-known regions to increase performance and decrease power consumption in these common cases.

Every turf has three well known regions, each with the expected permissions. The code region is executable. The data region is readable and writable and contains both the initialized data and the uninitialized data. The constant region is read only. These regions are not contained in the PLB but instead are contained within CPU registers.

Every thread has two well-known regions: the stack and thread local storage. These are also contained in hardware registers.

The stack region behaves slightly unusually in that it is dynamically adjusted to only contain the valid area of the stack. That is, it expands during calls and contracts upon return. Though there may be additional stack available before hitting the upper limit of the stack that excess memory space is not included in the stack region and so is inaccessible. As previously discussed, new stack frames are initialized to zero on instantiation and data leakage from the stack is therefore eliminated.

2.4.5 Separate Call-Return Stack Data

On the Mill return addresses from calls are not stored beside the application data on the stack. Instead those addresses, amongst other CPU state, are handled by the spiller, which has its own regions. This separation of application data from the call-return state makes Return Oriented Programming impossible. The regions the spiller uses are not, generally, made available to the running application itself and is therefore safe from tampering.

In order to retrieve the backtrace for a series of frames some thread with read permissions on the data side of the stack frame must call into the Trace Service. This service will return the call stack to the callee as long as it has the correct permissions.

2.4.6 Services and Portals

Portals are special memory regions on the Mill that change the turf upon calling through them. They are used primarily to implement services, which are callable pieces of code that have no associated thread, but do have a separate turf ID. Portals are called like any function, by address, and represent an execution context, even though they have no thread.

A portal is a single cache line in memory with all the information necessary to change the current thread into the portal context.

Name

Description

Entry

The address of the function to call once through the portal.

Turf ID

The turf ID to move the thread into when calling through the portal. Essentially, this is the turf of the service and the service also has access to any thread accessible regions.

data

The data well known region of the service.

code

The code well known region of the service.

pool

The constant pool well known region of the service.

This is accomplished by giving the entire cache line the portal access permission, which signals to the instruction processing part of the CPU that this is a portal and not normal instructions.

A portal call is not a task switch. It is closer to a dynamic library call on a traditional architecture than a process switch. The service still has access to some of the memory of the caller, but is unaware and incapable of accessing the rest. This includes the stack frames of the caller. Abstractly, the stack frames keep piling on top of each other with some being owned by the application and some by the service.

Unfortunately, there are two major problems with a naive implementation matching the abstract view. Firstly, such a implementation could result in a very fragmented stack if the application and service ended up calling back and forth into each other up the stack. This fragmentation would require many PLB entries for small memory regions, which would lead to poor performance.

Secondly, there are various possible faults related to the stack, which may result in the service, through no fault of its own, receiving a fault such as a stack overflow. Handling such a fault in the service code at any point would be burdensome.

Instead, the logical stack is composed of stacklets, where each stacklet is one link in the chain of the stack. Practically these stacklets are represented by a single stack well known region for each service and application which is extended when the corresponding service makes a call. Stacklet allocation is done using a computed address within the virtual address space using turf ID and thread ID in a two dimensional array. One sixteenth of the total addressable memory is reserved for stacklets using these computed addresses.

In all other respects with regard to allocation stacklets operate as previously described. That is, stack frames are implicitly zeroed when allocated and backless memory is used where possible to avoid allocating physical memory for transient stack frames.

Though the precise number of turfs, threads and the size of the stacklets is member dependent, a typical implementation may reserve twenty-two bits for both turf ID and thread ID. Given this and the one sixteenth address space reservation that implies a stacklet size of four kilobytes.

The metadata for each stacklet is stored as a cacheline-sized object, also at a computed address. This infoblock is biased such that the unused state is all zeroes and, therefore, the metadata block for an unused stacklet requires no physical memory backing it and no explicit initialization. Contained within this info block is information, such as, the top of stack offset, the base address, and the limit address of the stacklet.

As both the portal cache line and the stacklet info block must be fetched, the cost of a portal call is two cache requests and the single cycle call operation latency.

As calling through a portal changes the turf ID of the currently running thread a service does not have implicit access to the memory of the application. If memory is to be passed to a service, the pass operation must be used to explicitly allow access. The pass operation works by inserting a transient region that is accessible by the thread ID. This region is deleted when the next called function uses the return operation. That is, the region is only valid for the lifetime of the next call and any nested calls within that.

In cases such as varargs or excess arguments there may be a need to implicitly provide access to memory regions of the callee. The args operation reserves a portion of the top of the stack as a place for implicit arguments. This is accomplished by setting the caller’s outp register to the designated size below the top of stack. During the call this offset is transferred to the callee's inp register and a new transient region is created between the value of inp and the caller's value of stackp. That is, an implicit region for the designated number of bytes at the top of the stack is passed.

2.5 Instruction Set

2.5.1 Encoding

Instruction encoding is a significant determinant of maximum performance. If a CPU is unable to decode instructions at a sufficient rate it will be unable to keep all its functional units fed. In doing so there are two factors that must be addressed. Firstly, there is finite bandwidth and instruction cache available. It is expensive to increase memory bandwidth between the various caches to improve bandwidth. Similarly, increasing the cache size has significant costs in power and latency; larger caches require greater time to access. The best use of bandwidth and cache is through the use of variable-width instruction encodings which have higher entropy, that is, a higher density encoding.

On the other hand, dense instruction encodings require complex logic to decode. Lower density encodings are easier to decode and can therefore be made faster to decode. Fixed-width encodings, where every instruction has the same width and layout, are faster than variable-length encodings.

Early high-end Mills were proposed which would be able to retire up to 33 operations per cycle. Given this high number of operations there is insufficient memory bandwidth and cache size to use a fixed-width encoding. Instead a variable-width encoding is mandated in order to be able to feed all the functional units. Additionally, the belt being the implicit destination of all operations also increases the density of the encoding by eliminating destination information.

2.5.1.1 Physical Layout

However, with variable-length instructions it is more difficult to decode one cache line worth of instructions because the instruction boundaries are not regular. This problem is generally consided to be polynomial in nature. Given the nature of polynomial algorithms performing the algorithm twice on inputs half the length has a smaller running time than performing the algorithm once on the full input. This is what the Mill does, with two program counters each driving one decoder. Each decoder decodes one physical stream of instructions from memory, which represents a single logical instruction stream of the program. Each half of the logical instruction stream provides a single bundle (cache line sized chunk) of instructions per cycle.

Splitting instruction decode this way allows the decoding hardware to be doubled relatively cheaply compared to conventional designs. Especially interesting is the ability to double the instruction cache size, since each decoder may have their own instruction cache, without requiring all the polynomial or exponential costs normally associated with doubling cache sizes.

Splitting decoding in this way does have one issue in how you define jump targets and keep both decoders at the same place within the logical instruction stream. The naïve method would be to have two instruction addresses, requiring a doubled size of function pointers within the system. This is inconvenient. The Mill takes the Extended Basic Block concept normally associated with compilers and higher-level programming languages and makes it a core feature of the CPU architecture. An EBB is a block of code with a single entry point and one or more exit points and represents the only valid jump destinations on the Mill. The logical instruction stream is broken up into two physical instruction streams by using the middle point of the EBB in memory as the entry point. From that point, one physical instruction stream decodes upwards in memory from that point as is conventional and the other physical instruction stream decodes downwards in memory from that point. Thus, the problem of how to specify a single concise entry point is solved by using a single address, which is read in both directions to specify the logical start of the two physical instruction streams.

The logical progression of operations executed in a single cycle (an instruction) are broken up into two bundles. The left bundle of each instruction becomes part of the left stream and the right bundle to right stream. The left stream is then reversed and both the streams connected at the head in memory. This results in the physical layout of the bundles where the left decoder decodes left (towards lower memory) and the right decoder decoded right (towards higher memory).

The address at which the two decoder streams are joined is the entrance point to the EBB. Each decoder decodes one bundle per cycle, subject to lag counts, to be issued.

The early public Mill designs indicated that there would be differences in the operation capabilities of the two decoders. One side, the Exu side, would be primarily responsible for arithmetic operations. The other side, the Flow side, would be responsible for operations related to addressing, such as loads or stores. While this would restrict the ability to execute all possible instructions in a single cycle it does allow for different encodings (that is, one particular bit pattern can mean one operation to the left decoder but mean a different operation to the right decoder) to be used for each decoder to achieve better code density.

2.5.1.2 Bundle Format

In the Mill, each instruction can contain several operations to be executed within the same cycle. As described in the previous section these are divided into two half-bundles to be decoded using two decoders. The bundles still contain variable-length operations however, which need to be divided as the execution step.

The bundle format contains a fixed-length header followed by a number of blocks, each of which contains a number of operations that are of the same size. The bundles themselves are byte aligned while the blocks are aligned to well known bit sizes. Some alignment holes on the order of bits often exist between blocks.

These alignment holes are used to encode noops for the other decoder and are called a lag count. If the lag count is zero then the other decoder executes immediately, if the lag count is one then the other decoder waits one cycle before issuing and so on. If there are an insufficient number of lag count bits then traditional noop instructions are also available, though needed much less often.

Thus when the operation boundaries have been determined decoding is as simple as in the fixed-width instruction case.

Each fixed size bundle header contains the total length of the bundle along with the number of operations in each block. With this information, it is a simple matter to do the necessary shifts and masks in hardware to be able to decode two blocks per cycle. This is possible since the possible range of bits of the leftmost block to decode is known at the beginning and the size of the last block from the previous bundle is also available with a known set of possible bits. Each bundle is decoded one block at a time from each end in parallel and therefore two blocks are decoded per cycle per decoder. On the Mill bundles only have three blocks to date.

2.5.2 Ganging

The most common operations a CPU could be desired all tend to have two or fewer inputs and one or fewer outputs. Consequently, it is most economic to support those types of operations directly. It is possible to add a third input or second output to support some of the less common operations, but this is expensive relative to the gain of having the specialized instructions.

The Mill supports such operations by ganging two adjacent functional units to work as a single unit for the purposes of that single operation. This is achieved by passing the primary operation to the lower slot and then the args operation to the higher slot. The args operations indicate that the input arguments for that slot should be passed to the lower numbered slot(s) expecting such inputs.

Two example arithmetic operations that could be performed using this feature are Fused Multiply Add using three functional units to compute two separate instances of FMA or a Fused Multiple Add-Subtract operation where both the sum and difference of two multiplications are computed simultaneously using two functional units.

2.5.2.1 Ganged Predicates

It is sometimes the case that code needs to perform flow control on the output of an arithmetic operation. For example, a branch to the top of a loop may be performed so long as a decremented value is not yet zero. Performing this predicate with separate instructions would increase the latency of such an operation by one cycle as a greater than operation (or equivalent) is performed. Condition Codes are a common solution to this problem by storing the various conditions an arithmetic operation can result in into a global register. Unfortunately, this causes difficulty when speculating because global state is modified and on wide machines such as the Mill it is often unclear which operation made which change to the condition register. Alternatively, there could be variants of the main arithmetic operations which return the result along with a Boolean indicating whether the condition was true or false; this has the severe disadvantages of resulting in an explosion of opcodes due to the various conditions to check for and the addition of a second output to otherwise fast ALU operations implies additional hardware fastpaths, which limits the clock rate.

Instead of using any of these approaches the Mill uses its ganging capability. All arithmetic units unconditionally provide the condition code results horizontally to the next greater slot and there are operations defined, which extract a given condition code and returns the result as a Boolean on the belt, ready to be used in flow control or vector masking. Since these conditions are normal data elements on the belt they can be speculated as with any other data.

2.5.3 Uniform Instruction Set

Taking advantage of the metadata associated with all operands, as discussed above, the Mill has a uniform set of operations. That is, all operations on the Mill, when passed an operand for which the instruction makes sense, will perform that operation on the operand. The programmer does not need to differentiate between scalar values of different widths or between scalars and vectors or between vectors of differing width and scalarity. A single opcode indicates which semantic operation to perform and the hardware will evaluate the metadata to ensure the matching physical operation is performed.

This uniformity allows a more compact instruction encoding, with all the associated performance advantages, as well as can allow, in some cases, polymorphic code that can operate correctly on many different types of operands without code changes. Note, however, that the metadata does not contain data type information. The CPU makes no distinction between a four-byte integer and a four-byte floating-point number. There are distinct integer and floating-point opcodes and it is the responsibility of the compiler to ensure the correct opcode is used for the datatype.

2.5.4 Speculative Execution

The operation set of the Mill is designed with compiler driven speculation in mind. To that end, operations do not raise exceptions or traps unless absolutely necessary. Instead, metadata bits are set to indicate whether the result is valid and if it is invalid what sort of invalid state it is in. For example, any IEEE754 floating-point flags set as the result of a floating-point operation will be carried through the next floating-point operation. The values passed to the operation will have the specified operation performed while the resulting flags will be the or'd union of the flags of the operands and any new flags set by the operation itself. At no point during this process are the floating-point flags stored in any global area.

Storing the flags in any global area would prevent speculation, as the executed operations may not end up being on the taken path. Instead, floating-point flags are not realized until an operation, which cannot be speculated, such as a store or branch or function call, is performed. When the resulting value is realized the floating-point result which may previously have been on the speculative path are now known to be on the taken path and the flags are merged into the global fpState register.

Similarly, the NaR bit can be set when an operation results in an error and this metadata bit will be propagated, element-wise, through subsequent operations. Unlike the floating-point flags, however, when the NaR bit is set the value of the element does not exist. In its place there is a tuple describing what kind of error was detected, such as a load result from an invalid memory access, and where the error was encountered. As with the floating-point flags the NaR values do not raise an exception until they are realized. Any debugger examining the runtime will be triggered at the point of realization, but can use this information to display the point of fault detection in most cases.

Finally, the None metadata bit indicates the absence of data in an element. As with NaR this propagates through operations element-wise. However, unlike NaR a None element does nothing when a non-speculable operation is performed upon it. That is, if a store were to be performed on a None value there would be no change to any memory and no exception would be raised. The store operation would be processed, but the result would be as if that particular instance of the store was not there at all.

2.5.5 Phasing

The Mill separates every instruction into one of several phase types. A phase is a logical period within a single cycle at which time an operation may retire. The advantage of this is that an operation, which is part of a later phase, may use the result of an operation from an earlier phase within the same clock cycle and can, therefore, be encoded within the same instruction. The phases, in order, are described in the following table.

Phase

Typical Operations

Reader phase

Operations which take no belt inputs, such as constants or scratchpad reads.

Op phase

Many of the traditional operations, such as add, load, and equal.

Call phase

Function calls.

Pick phase

Pick operation, which is the ternary operator

Writer phase

Operations which produce no belt outputs, such as stores, branches, or scratchpad writes.

The call phase has the interesting property of executing the first two phases of the instruction to completion, executing the call for some arbitrary number of cycles, and then returning from the call to re-execute the original instruction, skipping any already issued operations.

The fundamental affect of phasing is to increase the operation level parallelism by reducing the restrictions on passing data between operations within the CPU pipeline. This is not an advantage in the middle of large EBBs relative to VLIW machines, because the VLIW would have the capability to simply schedule the phased, interdependent instructions in the necessary instructions across several cycles. However, if the EBBs are small or the code in question is at the beginning of the EBB then phasing will move the reader phase into the cycle of the previous branch instruction, with no prediction involved since the instruction is known to have been chosen. If the code is at the end of the EBB then phasing will move the writer phase operations into the cycle of the next instruction, irrespective of what the next instruction is doing, even branching. In effect the reader phase and writer phase are moved across flow control.

Given the branch heavy nature of general-purpose code phasing results in a performance win in many cases. Phasing is inferior to out of order in the cases where the data dependencies do not match the phasing order. However, if the OoO CPU is issue limited then phasing provides superior performance opportunities.

2.5.5.1 Cascaded Calls

If there are multiple calls within a single instruction the calls are called consecutively in slot order and the return values of the first call are passed as arguments to the second call and so on.

This provides efficient hardware support for tail calls in hardware.

2.5.6 Multi-way Branches

As the Mill is a wide-issue machine it is capable of issuing several predicate computations in an instruction and further can issue several branches within a single instruction. It achieves this feat by defining that the first branch in slot order to succeed will be taken and all other branches will be ignored.

This facility eases implementation of multi-way branch constructs such as switch statements or long if-else cascades. It does this without consuming prediction table space as each instruction may contain only a single prediction and further having multiple branches within a single instruction prevents cascaded branch misprediction penalties. This latter can occur if every statement is tested individually, each branch will receive a prediction which may be that the branch is taken and when multiple consecutive cases are not taken each branch may receive a mispredict penalty. On the Mill there is only one prediction for a number of branches and so only one midpredict penalty is possible.

2.5.7 Cold Code Branch Prediction

The Mill has, in the case of code which is hot within the upper level caches, a relatively typical branch prediction subsystem. However, hot code in the highest level of cache is not where all the time is spent on modern code running as part of a multitasking operating system. A significant portion of the time the code the CPU is executing is not in cache and the branch history table has been completely invalidated from the previous time slice for this process and there is nothing to base a branch prediction upon. This may be the first time the code is being run, it may be some other large program just left the CPU after ejecting all previous contents of the cache. In a multitasking system where many large processes run upon the same CPU for milliseconds before being exchanged for another large process it can be the normal case that the cache contains little if any of the code of the newly running process.

In such cases where all the caches miss loading the instruction as much as 95% of the CPU cycles can be wasted waiting for the correct instructions to arrive at the CPU so they can be executed.

2.5.7.1 Fundamental Issues

The difficulties with cold code are due to three fundamental difficulties. Firstly, when loading instructions from DRAM conventional machines do not know how many instructions to load. Modern machines use a heuristic and load two cache lines worth of instructions, but it is not guaranteed that the load will be both sufficient and, at the same time, not wasteful. If a cache line contains four instructions then two cache lines would be eight instructions. Empirically this is sufficient in many cases, that is, there is often a branch approximately every eight instructions. However, if the branch at that time wasn't for ten instructions then the third cache line of instructions would suffer another cache miss and DRAM fetch. Further, if the branch was in fact within the first cache line then the second cache line is likely to go unused and have pushed some other code out which may be needed again shortly.

Secondly, branches need to be decoded before they are located and consequently no fetching from main memory can begin until the branch is already in the CPU pipeline, giving little ability for the fetch latency to be hidden without CPU stalls.

Thirdly, when running cold code there is no history in the branch prediction table so the predictions will essentially be random guesses until history can be built up. Even if the code is in the cache there will be frequent mispredict stalls due solely to the poor quality of predictions.

2.5.7.2 Prediction with EBBs

The Mill is able to provide solutions for these issues due to its use of EBBs, as discussed above. Specifically, because the Mill uses EBBs as the fundamental unit of code it does not have to predict every branch, but instead needs only predict the exits from an EBB, which may have many branches within it. That is, the Mill predicts exits and not branches.

Exit predictions are represented in the Exit Table, not the branch history table but serving a similar purpose, as a tuple of the offset to the branch destination relative to the beginning of the current EBB, the number of whole cache lines within the EBB to the predicted exiting branch operation, the number of instructions in the last cache line within the EBB to the predicted branch operation and a kind field. The kind field may be one of a jump, a return, an inner call, or an outer call.

The Exit Table is essentially a hash table and it is keyed on the address of the location of the EBB start.

2.5.7.3 Prefetching Predictions

With the address of the current EBB as the original key it is possible to repeatedly lookup the next prediction from the exit table by following the chain. This does not require examining or decoding the instruction stream itself and can be done independently. At each step of this chaining, the code for the prediction can be prefetched into the top level instruction cache and consequently hide the access latency should such requested be required to go out to DRAM. If the prediction is correct and as it contains the number of cache lines expected to be executed the memory request will be exactly as long as necessary. No cache lines are wasted and no additional DRAM access delays are necessary because the fetch heuristic was incorrect.

This chaining continues until a loop is detected, the prediction accesses invalid memory or is otherwise proven to be incorrect or, in extreme cases, some depth limit is achieved.

This prefetching is performed by the Prefetcher component on the CPU. If the prefetch would fail, for example because the memory would result in a page fault, then the prefetch is aborted until some branch is taken to restart the prefect process. Prefetching happens using otherwise idle DRAM cycles and pulls cold code into the cache hierarchy.

Concurrently, the predictions are stored in the Prediction cache. This is a small cache intended to be just large enough to contain all the predictions necessary for an inner loop. By having this small, fast cache it is possible to increase the size, and therefore latency, of the exit table, which is expected to have a latency, and assumedly a size similar to the L1 data cache. The prediction cache is only interested in loops and stops chaining from the exit table when a loop is detected until a prediction miss occurs, whence chaining recommences from the exit table.

The instruction fetcher chains predictions from the prediction cache and fetches the predicted code into microcaches (ex$0 and fl$0 in the diagram above). As this code was prefetched earlier it is likely to already be available in cache, even if the code was previously cold. The fetcher then forwards the predictions to the decoder, which uses the predictions and the stream of instructions from the microcache. Given the constant stream of instructions and predictions the decoder is thus able to continually decode down the predicted path at maximum speed.

In the case of mispredict, the penalty is only the time to go through the Fetcher and Decoder again.

2.5.7.4 Prediction Preloading

Due to physical constraints the size of the exit table is limited and entries within it must be reused when different programs are executing. Often this implies that shortly after a task switch the contents of the exit table contain no predictions for the currently running code. The Mill provides a facility to supply predictions from code for which there is no explicit history available.

When the exit table is checked for a prediction and misses a request will be sent to the load module for any prediction data available in the executable itself. This will often be represented as a separate section in the executable format. For example, ELF formatted executables often have a text section, a BSS section and a prediction section would be a logical addition. When the prediction is loaded it is installed in the exit table ready for use.

In order to hide the latency of DRAM accesses multiple predictions are loaded at one time on an exit table miss. This provides the best chance of minimizing the exit table miss penalties, which are usually much more expensive than the mispredict penalty. This raises the question of which predictions to load. It is not appropriate to load a prediction preload during a mispredict since there was a prediction, it was merely wrong and should be updated instead. Loading on every exit table miss has the problem of polluting the exit table with unlikely error paths.

The Mill does a bulk load of predictions from the load module when entering a function. This provides an obvious entry point for the EBB graph and further provides a useful unit of prediction such that only the likely EBBs may have load module predictions and the unlikely error paths can be ignored. This provides a useful flexibility in choosing the tradeoff, on a per-function basis, between cold cache miss penalties, mispredict penalties, exit table consumption and executable size.

The precise mechanism to determine if a bulk load of a function's predictions proceeds or not is determined by whether a prediction for the entry EBB is found in the exit table to not. If no such prediction is found then the bulk load proceeds. This, of course, may happen ahead of the call as part of the normal code prefetch processes. Specifically, prediction chaining will trigger the bulk loading of predictions and when those predictions are installed in the exit table prediction chaining will delve into the function and cause the code for those predicted function EBBs to be prefetched. This has a total delay of two load times to request the predictions and then the code. Thereafter, predictions and code are loaded concurrently. This differs from the conventional delay of one load time per branch in the typical case.

2.5.7.5 Load Module Prediction Sources

The question may be raised where the load module gets its predictions. There are two general sources. The first is the compiler may have good information in some cases, such as an EBB with a single exit. In such cases the compiler can produce a prediction with ease.

The second source is a profiler that measures which way branches go. While this can be done in software, the necessary instrumentation changes the performance behaviour of the program under inspection. On the Mill the prediction system has a facility to log prediction update events to memory. Thus when a prediction in the exit table is changed that even may be logged. This log can be post-processed to update the predictions in the load module. It is also possible to use these event logs to improve code produced by JITs or optimizers.

If there is no more intelligent information to place into the load module predictions then the prediction will default to the final branch of the EBB. This ensures that the entire EBB will be loaded into cache at once, even if the branch prediction itself is inaccurate. The performance advantages of caching outweigh the disadvantages of an otherwise random branch prediction. Of course when there is more intelligent information certain EBBs of a function may not have load module prediction information, saving memory and memory bandwidth in the process.

2.5.8 Operation Set Overview

This section provides a brief overview of the various operations a Mill supports. Only the most interesting features will be noted. For more detailed information please see the detailed page for each operation.

2.5.8.1 pick

Name

pick

Phase

Pick phase

Latency

0 cycles

Supported Types

Scalars and vectors

Speculable

Yes

The pick operation is a hardware implementation of the C ternary operator. That is, given the value of a predicate it will return one of two other operands depending on the value of that predicate.

In the case where the predicate is a scalar the first result operand will be returned if the low bit of the predicate is one and the second result operand will be returned if the low bit of the predicate is zero.

In the case that the predicate and result operands are vectors of identical scalarity then the pick operation will operate as if a pick were performed upon each matching triplet of elements across all three vectors. That is, if the first element of the predicate is one then the first element of the first result operand will be placed into the first element position in the returned result. If the first predicate element was zero then the first element of the second result operand would be used instead. This operation is repeated in a similar fashion with the second elements of each vector and so on for each corresponding element in the three vectors.

The intended use of the pick operation is in conjunction with if conversion to eliminate some forms of branching and in doing so create new possibilities for software pipelining and fewer branch mispredictions.

3 Implementation Strategies and Details

The previous section described the conceptual programmer's model of the Mill. That model is nothing more than a set of constraints and guarantees for which multiple implementations are possible and appropriate in differing circumstances. This section describes some of the implementation options for certain Mill features and further details about the lowest level guarantees the Mill architecture provides to programmers along with how they can be met efficiently.

3.1 The Belt

The Belt is at the core of the Mill's design and there are many different possible hardware implementations depending on the size of the belt and relative costs.

3.1.1 Shift Register

The simplest and least efficient belt implementation would be a series of registers all connected such that values would be shifted along the registers as items are added to the belt. For example there would be physical registers named for each belt position and when the data was to be moved from one belt position to another as results where dropped, the data would be physically transferred. So if a single new element were dropped onto the belt all existing data would shift right one register at the same time.

There are no plans to implement any variant of the Mill using this method.

3.1.2 Two Stage Crossbar

A two stage crossbar may be implemented physically in a number of ways. However, conceptually they are all equivalent.

In the Mill the first stage is the latency-1 crossbar. This connects all the functional units that can produce a result in a single clock cycle to all the possible consumers of those results. This restriction enables the latency-1 crossbar to remain small and fast as it is in the fastpath of the cycle timing.

The second stage is referred to as the latency-n crossbar and connects all other function unit results to the latency-1 crossbar. This second crossbar can thus take significantly more time and be significantly larger without a detrimental effect on the fast operations. Since each pipeline has two inputs there is no congestion on the input-side between the latency-1 crossbar and the latency-n crossbar so long as the latency-n crossbar has two feeds into the latency-1 crossbar.

Given a two-stage crossbar design the determination of the belt positions may not be immediately obvious. Each pipeline is a collection of several functional units that may have different latencies. For example, one pipeline may have an adder (latency 1), a rotate and mask unit (latency 2), a multiplier (latency 3) and an integer division unit (latency 4). For each latency of the pipeline there is a register dedicated to it. Thus a pipeline with four different latencies will have four registers. Since each pipeline can only start one operation per cycle the four registers can be viewed as a fifo of registers. Each cycle all the results move one further register on down the fifo. If the operation executed was always single cycle then this would uniformly go through all the registers in order. If the operation consumes multiple cycles then, conceptually, the result is contained first in the latency-1 register, then the latency-2 register and so on as for the single cycle operations. However, the result is not truly available until it has finished in the functional unit and been stored in the appropriate cycle register. As the bubble for the result was pushed through the series of registers cycle-by-cycle it is always unambiguous that the necessary register will be free when the operation is complete.

If a valid belt value reaches the end of the registers embedded inside the pipeline alive then it is shifted either into a buffer in the spiller or, in some optimization cases, into a register within another pipeline. This shifting of belt position continues even when the originating belt is not the top belt of the call stack. When a live belt value runs out of buffers to be stored in it is output to the memory hierarchy by the spiller.

Each belt value in a latency register for a pipeline is tagged with the frame ID and the belt position. Belt position is advanced by incrementing the belt position in all the live belt values that match the top frame ID, a cheap hardware operation. There are various hardware mechanisms that can be used to connect the somewhat arbitrary location of the belt value to the required functional units. A naive solution would be a CAM-style broadcast where the destination functional input requests a particular belt position and only the register or buffer containing that position responds with its value on the appropriate bus.

3.1.3 Register File

3.1.4 Content Addressable Memory

It is also possible to implement the belt as a Content Addressable Memory.

3.2 Deferred Loads and Aliasing Immunity

Given the architectural definition that loads return the value of the memory as of the retirement of the load operation there are several possible implementations, but many involve snooping the store stream to catch aliasing and then recover.

3.2.1 Retire Stations

The primary proposed method on the Mill of implementing deferred loads with aliasing immunity is to have a number of retire stations. When a load is issued a retire station is allocated and provided with the arguments supplied to the load. Concurrently, a request is sent out to the memory hierarchy. Each active retire station listens in on the store stream for addresses that overlap the load. If such a store is discovered then any data the retire station has accumulated is discarded and the memory request sent again. Because the store just occurred the data will be available in the L1 cache. If this additional request puts the load over the original delay a CPU stall will occur, but true aliasing is quite rare in real code, though it is often unprovable at compile time.

There are a finite number of physical retire stations, but each frame has a logically distinct set of retire stations and loads from multiple frames could be inflight at one time. If the number of physical retire stations is exceeded, perhaps because the combined allocated retire stations across the entire call stack exceeds the capability of the hardware, then the excess retire stations have their arguments, but not any data they have accumulated, passed to the spiller. Upon calling a return operation spilled loads will be reissued. Since the original request would likely have been satisfied from DRAM the reissue would quite possibly be satisfied from cache. Additionally, it is possible to predict returns from calls, and therefore, reissue the load in advance of the return and masking the cache latency.

3.3 Phasing

Phasing implies the ability to execute different operations from one instruction at different times such that some operations are able to use as input the output of an operation from the same instruction of a different phase type.

3.3.1 Cycle Stretching

The most naive method of implementing phasing would be to simply stretch the clock cycle such that all five phases could execute within the cycle. The easiest way to think about this would be to consider the primary cycle being divided into five subcycles, possibly of unequal lengths, such that all the single-cycle operations of a phase type can be completed within the time of the subcycle.

3.3.2 Cascaded Cycles

A better way to implement phasing would be to issue the different operation types over a number of cycles such that the results are available at the correct time. On the Mill this issuing is done over three cycles and fits nicely with the pipelined decoder of the Mill which makes good use of the bundle format in that each of the three bundles of the instruction contain operations for a different phase and are decoded at the right time without unnecessary delays or buffering. That is, the three cycles of the decode latency are precisely responsible for the three cycles of phasing latency as different operations are available from decode at different times. For example, given three instructions the different phases for each instruction will be executed as follows:

Cycle

Instruction 1

Instruction 2

Instruction 3

0

Reader Phase

1

Op Phase

Reader Phase

2

Writer Phase

Op Phase

Reader Phase

3

Writer Phase

Op Phase

4

Writer Phase

When using this cascaded cycle approach phasing can be viewed as creating a software pipeline between instructions within a single instruction.

3.4 Pick with Zero Latency

The pick operation and the pick phase have no program visible latency. This is achieved, in reasonable belt implementations, not by moving data or hiding the latency in some other way, but by modifying the naming of belt data instead. In doing so no data is moved around and consequently there is no significant latency cost associated with it and the pick phase can consume no cycles.

Another way to think about this is that the pick operation occurs during data routing. It thus has no functional unit and does not consume a full cycle in latency. Instead, it adds some small amount of latency to the routing and renaming of belt positions, though this latency can be hidden by other CPU bookkeeping.