Inside the machine
15 Dec 2024
Basic computing concepts
Key structure in the toy computer is
- Storage: has both the instructions and free space
- Registers: ultra fast data access on cpu
- Arithmetic logic units
- data buses
Key call out is that each instruction can only be a
- Arithmetic instruction
- Memory access instruction: loading and storing from main memory (including branch jump instructions here as well)
Mechanics of program execution
Fun call out is that when you turn on a computer, youâre starting the bootstrapping. But how does the computer start doing anything without any instructions?
- If first looks up BIOS: readonly program to setup all the hardware stuff
- Then BIOS hands over to bootloader program that goes and finds the OS
- Then the OS program sets up all your own programs
Pipelined execution
Modern RISC style processors have four steps
- Fetch
- Decode
- Execute
- Write
The CPU has a speed at which it can execute all these four steps. If it waits for an full loop, that means that while weâre fetching we arenât executing or writing and while weâre writing we arent fetching. We have cool hardware just sitting around. To fix this issue most modern processors are pipelines so a line of instructions are processed as a pipeline in the processor.
Superscalar execution
To make computers even faster we realised we could just shove two logic units together, this is the idea behind superscalar computers, they can process multiple scalars (numbers) at once. Nothing complicated.
Unsurprisingly hardware has specialised to process fundamental data easier.
- Scalar ints
- vector ints
- scalar floats
- vector floats
How might the stack look?
- Your program talks to
- Instruction set architecture (ISA)
- converted to talk to
- Any same architecture hardware
In these ISA, Intel added new cool stuff to do faster processing.
- MMX (multimedia extensions)
- SSE, SSE2 (streaming SIMD extensions)
- SIMD = Single Instruction Multiple Data
But unsurprisingly superscalar design, processing multiple instructions at once, isnât a free lunch, theres some complexity
- data hazards: oh no one of our instruction depend on the other: just try to merge/ forward the result directly
- structural hazards: oh no the processor doesnât have enough capacity to handle these instructions: just have a fat register files with enough ports (an array)
- control hazards: oh no we need to wait for conditional branches: just prefetch/ predict those branches
Itâs shocking that vectorised perf is equivalent to scalar instructions on modern processors.
Intel Pentium 4 vs Motorola G4e
The key design differences are
- Intel pushing for clock speed and deeper pipelines, leads to higher risk of bubbles clogging up the pipeline
- G4e pushing with more parallel functional units, which needed a submission queue to be well used
Something that is talked about a lot is the pain of branch prediction and how necessary it is. Itâs so expensive to just wait for many cycles waiting to get something from non L1 cache, so lot of work is put into predicting. Such as compiler hints and Intels trace cache, just speculatively inlining branch instructions out.
64-Bit Computing and x86-64
Intel wasnât good enough so AMD beat them with creating a more popular Instruction Set Architecture called x86-64.
Caching and Performance
Level | Time | Size | Tech | Managed By |
---|---|---|---|---|
Register | 1ns | 1kb | CMOS | Compiler |
L1 | 2ns | 10kb | SRAM | Hardware |
L2 | 5ns | 5MB | SRAM | Hardware |
Main Memory | 30ns | 1GB | DRAM | OS |
Hard Disk | 3mil ns | 100GB | Magnetic | OS |
The key principle behind cache locality working is
- Spatial locality: stuff you just accessed neighbours are likely needed next
- Temporal locality: stuff you just accessed is likely to be reaccessed soon
Cache usually simplified into blocks, and usual caching strategies exist, fully associative, n-way associative, direct mapping styles. Canât do cool eviction policies as itâs too expensive, it gotta be cheap if itâs done so often.