Writing speedy cpp
12 Apr 2025
Measuring
There are four main branches of profiling
- Instrumentation: compiler adds extra code on each function to count it
- Debugging: profiler inserts temporary debug breakpoints at every function
- Timer based sampling: profiler tells OS to interrupt every x time and will count where these happen, less reliable
- Event sampling: profiler tells CPU to generate interrupts at certain events. Lets you see how many cache misses, branch mispredictions, etc happen. These are CPU specific!
RAM memory access is usually a major cause of perf issues, so we have memory caches to avoid this. These are the usually sizes. It takes 100s of clock cycles for a RAM access but only 5 or so for a cache hit
- L1: 64kb
- L2: 1MB
- L3: 8MB
Another common concern is context switches, OS will interrupt your process to run other stuff!
Dependency chains are another concern, microprocessors will do out of order execution, so you want to avoid long dependency chains.
And a final common concern is the using the entire CPU, microprocessors can do multiple stuff to integers and floats at the same time, SIMD!
Choosing the right algo
Before you dig into performance optimisations, first find the optimal algo!
And consider if others have already implemented already!
- Boost
- Intel math kernel library
- Intel performance primitives library
Efficiency of cpp constructs
Variable storage
Where data is stored will affect cache efficiency.
Stack storage: are the fastest because the same range of memory addresses is reused again and again, so its almost certain that this part of the memory is mirrored in the L1 data cache.
Global and static storage: separated into three parts.
- Constants
- Initialised variables
- Uninitialised variables
Main issue is that as these memory addresses arent reused, it has bad cache properties. You should just try to make all global static stuff const, as this removes a conditional check on whether if this variable is initialised or not yet.
Dynamic storage, the heap! Very slow, and heavily affected by heap fragmentation.
Variables declared inside a class. Main thing to note is that they have the same order of declaration so store variables that are used together in the same place!
Integer variables and operators
Addition, subtraction, comparison, bit operations and shift operations only take 1 clock cycle! Multiplication will take like 11, and division is like 80 đ„Č.
Floating point variables and operators
Theres two different ways of doing floating point instructions.
- Use the 8 floating point x87 register stack
- Use vector registers XMM, YMM, ZMM
Float addition is like 5, multiplication is like 8 and division is like 40.
JUST AVOID CONVERSIONS! between ints and floats
Enums
Enums are just integers.
Booleans
Place conditional chains that are most true first in ||
for early exits!
Also bools are stored as 8 bit integers! So you can move to a char for more optimisations.
Pointers and references
Pointers and refs are equivalent n perf and honestly the perf is not that bad for local variables. Also calling out that member pointers have a bunch of possible implementations and it youre pointing to multiple inheritance pointers then thats bad.
Arrays
- the best!
- when accessing/ iterating through, you should hit adjacent bits of the array
- if its small enough to fit in l1 cache, try to make it nice power of 2
Branches and switch statements
Modern microprocessors do pipelined execution which they feed instructions into and get results. An issue of this pipeline is that it takes a while to figure out if something has gone wrong, like youâre making the wrong cake. In reality this shows up when you feed the wrong branch of a conditional into the pipeline and you gotta redo stuff taking like 20 clock cycles.
Functions
Something to note is that functions can slow down code by
- causing a microprocessor jump, taking up to 4 cycles
- code cache being less effective as youâre somewhere else now
- function parameters need to be re read
- You need to setup the stack frame
Try to
- Avoid unnecessary functions
- Use inline functions
- Avoid functions that call other functions!
- Use macros
- Use
__fastcall
and__vectorcall
to speed up integer parameter functions - Add
static
so they are inlined, move them to anonymous namespaces!
Function parameters
Const ref function parameters are pretty efficient, will be transferred in registers.
Function tail calls
When last return of a function is another function, compiler will just jump to second function, not a full function call, nice! Return types need to be identical tho.
Class data members
Just know that for your structs, data members have requirements of where the data member is placed in memory, is it on a alignment divisible by 8, 4, 2 or 1 bytes. This can leave holes in your classes!
Runtime type identification
Guess what similar to virtual member functions, doing type identification at runtime is also slow. Who knew.
Inheritance
Not much of a penalty with inheriting structs, but something to note is that the functions you call might be in different modules and therefore slower to call.
Bitfields
To save space you can share memory with union
and specify bits youâre using with bitfields
. Looks very error prone.
Templates
Something to note is you can avoid runtime virtual functions, by pushing arguments into templates!
Threads
Threads allow you to do multiple tasks at the same time, and these are distributed across your CPUs, at the cost of context switches every 30ms.
Compiler optimisations
How compilers optimise
Hereâs a list of things compilers will do
- Inline functions
- Propagate constants
- Eliminate pointers, when target is known
- Eliminate subexpressions that are used frequently
- Store stuff in registers, Main callout is that if a variable has a pointer or reference to it then it canât be stored in a register! BRUH
- Join identical branches, if both sides of a branch do the same thing, it will simplify
- Remove needless conditional jumps, by copying returns earlier
- Unroll loops
- Move code out of loops if theyâre unrelated
- Vectorisation, use vector registers for handling multiple data simultaneously
- Algebraic reductions: Doesnât work on floats very well!
Obstacles to optimisation by compiler
Theres a bunch of restrictions to compiler optimisations you should be aware of.
- Compilers canât optimise across cpp files!
- When accessing variables through pointers, it causes issues that the compiler canât figure out if accessed members are identical and can be reused!
- Compilers canât know if a function is pure, so you gotta do it yourself, so itâs not called a lot of times needlessly. Or try out
__attribute__((const))
on linux! - Virtual functions suck
- Compilers canât do complicated algebraic reduction!
Obstacles to optimisation by CPU
The main thing to note is to avoid long dependency chains! Avoid loop carried dependency chains!
Optimisation directives
Checking compiler output
Few ways to actually do this!
- Use a compiler cli flag,
-S or /Fa
, and use a disassembler on the object file - If intel, use
/FAs or -fsource-asm
but will rpevent some optimisations
Memory optimisations
Cache code and data
Cache organisation
Caches are organised into lines and sets. This is the n way set associative cache. So on a memory address modulo will choose which set it uses. The key thing to be aware of is how far a memory address needs to be away for it to be a cache hit or not. AKA the critical stride `num sets x line size = total cache size / number of ways
Functions that are used together should be stored together
Variables that are used together should be stored together
Alignment of data
As you already know variables are auto aligned and this can cause needless holes in your structs. So you should reorder them.
You can also align large objects by the cache line size alignas(64) int arr[1024]
. So your nice array will start at a cache line.
Data structures and container classes
- If the maximum number of elements is known at compile time or a decent upper limit can be set, use a fixed size array.
- Avoid additional allocations by reserving your vectors!
- FIFO: queue
- LIFO: array with stack pointer
- need ordering: sorted list, bin tree
Explicit cache control
Modern processors already prefetch, so you _mm_prefetch
isnt really necessary.
You can also do __mm_stream_pi
to do writes without triggering a cache refresh.
Multithreading
Theres three ways of doing things in parallel
- Use multiple CPUs
- Use multi core CPUs
- Use out of order execution
- Use vector operations
Out of order execution
CPUs can actually handle more than a hundred pending operations at once, can be useful to split a loop into two and store intermediate results in order to break a long dependency chain.
Vectorising
SIMD, single instruction multiple data operations can make stuff much faster! The vector register size depends on the instruction set youâre using.
When working you can do the following to do simd with pointers better
#pragma vector aligned
- declare functions inline
Do the following for better auto vectorisation
- When accessing through pointers, tell the compiler explicitly that the pointers do not alias, with
__restrict or __restrict__
keyword. - Use less restrictive floating point operations.
-02 -fno-trapping-math -fno-math-errno -fno-signed-zeros -ffast-math
-
AVX instruction set and YMM registers
AVX512 instruction set and ZMM registers
Automatic vectorisation
Using intrinsic functions
Using vector classes
Transforming serial code for vectorisation
Mathematical functions for vectors
Aligning dynamically allocated memory
Aligning RGB video or 3 dimensional vectors
Conclusion