Itanium Architecture

Yesterday, I went to the first of day of a three day class put on by Intel titled “Tuning for the Itanium architecture.” It was pretty sweet. There were a lot of details that I’m still digesting.

Executive summary: “A dumb compiler on IA-32 will produce acceptable binaries. A dumb compiler on IA-64 will produce really, really bad binaries.”

Major technical points:
——————-
* Core memory (even the caches) are SLOOOOOW (and getting slower) relative to CPU clock speeds. Assume a 300 cycle delay for fetching a value from memory, and don’t even mention page faults.

* Lots and lots of registers. 128 data-usable registers, plus (I think) about another 128 system use, performance profiling, and constant registers.

* They have implemented a register stack using register renaming. For people who stopped at Computer Architecture 1, this is a monstrous shift: Basically, it’s possible (at the assembly level) to say “allocate me N clean registers, and name them R-1 through R-N”. The hardware promises that (on the next cycle) your code can safely refer to registers 1 through N. The values formerly in those registers is hidden somewhere (in negative register addresses, if you really want to get at them). When you “pop” that context, the old register values come back, again in a single cycle.

What does that mean? It means that you can pass arguments to subroutines in the registers, explicitly, rather than hoping you get lucky with the memory stack. I think this has been in other architectures before, but it was not available in IA-32. It also means that the hardware is doing its very best to keep you from using core memory. More on that later.

* The assembly language is explicitly parallel. Instructions are organized (by the compiler or assembly hacker) into “groups.” Within a group, instructions can be executed in any order. Most chips designed for the last decade have tried to provide lots and lots of cleverness about out of order execution, but they hid it from the assembly. Assembly was a linear sequence of instructions. The hardware promised to give you back results as if the instructions had been executed in the exact order specified in the code, but the chip might do all sorts of monkeying with the *real* execution order to make things go faster. In IA-64, the compiler / author has the power to explicitly declare blocks of instructions as parallel.

This (apparently) freed up a lot of transistors for other purposes. They still have a branch prediction unit, but the execution cores will frequently be filled with parallel sets of instructions, *all* of which will be stored at the end.

* There is only one Floating Point Operation: f = a * b + c.

* It is *possible* to do SIMD operations on the registers. They suggest “software pipelining” instead.

* Software pipelining is exploiting the explicit parallelism in the assembly to implement your very own pipeline (of arbitrary depth). Describing this in detail is beyond the scope of my understanding right now, but it’s pretty wicked cool. Just trust that there’s a hardware pipeline for the microcode (8 steps deep, thanks for asking), and software pipelining is totally distinct from that.

* The Intel compilers try their very hardest to be compatible with GCC. They claim that you could tune the heck out of your bottleneck function, build it with the Intel compilers, and re-link that object back into an otherwise GCC built application.

* Speculative data access are now explicit in the assembly. Again, other architectures have done this in the past, but not IA-32. A speculative data load traps exceptions and only takes the exception if it turns out to have been needed.

* For “if/else” operations, they have “predicated” execution, which exploits the explicit parallelism. Basically:

(P0, P1) = (a > b) ## Set Predicate 0 and 1 based on the truth value of (a > b). They’ll be either (0, 1) or (1, 0)
;; ## End parallel block. The previous must be done before the next happens.
(P0) c = a ## If P0 is set, do this
(P1) c = b ## If P1 is set, do this instead
;;

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

  

  

  

This site uses Akismet to reduce spam. Learn how your comment data is processed.