Tpc ip illustrated
23 Dec 2024
Intro
Lets do a whirlwind tour of the entire stack and what it looks like.
One of the key principles is multiplexing: which just means that instead of having a connection using a cable for a period of time, you chunk communication into packets so you can share this cable between multiple people.
This packet idea was further developed with the addition of source and destination metadata to each of these packets: datagrams. This is great because now all the switches and routers in between the computers donât need to have any state management, they are just dumb pipes forwarding datagrams.
The idea of fate sharing also follows where all the necessary state needed to maintain an active communication must be at the source and destination endpoints.
Another principle is this idea of implementing the right features at the right level/abstraction. And not making immature abstractions at too low of a level. With this principle the layering idea of network protocols emerge. This is super neat. The real neat thing here is that each layer will blindly encapsulate the data it gets, wrapping it in its own metadata header and computers will only unwrap these packets as necessary upwards.
OSI model
- Physical: How to actually sent bits across a cable. connector, data rate, encoding info, low level error correction. Ethernet 1000BASE-T
- Link: How to communicate across a single link. error detection. Ethernet, Wi-Fi.
- Network: How to communicate across multiple hops. IP datagram, let us hop across different links.
- From here the stuff is only implemented on client/servers.
- Transport: How to communicate across multiple programs running on same computer system, and possibly reliable delivery. TCP.
- Session: How to establish and create an ongoing connection. ISO X.225
- Presentation: How to communicate data formats, like ASCII.
- Application: Whatever you want to do.
Theres three main components to network
- Computers: these are the end things that communicate
- Switches: these operate at link layer, and forward packets as necessary to the right MAC-id device within network.
- Routers: these connect networks, operating at network later, forwarding packets to the right IP address.
L3 Internet address architecture
Lets have a look at IP addresses the L3 Network layer addresses.
- Every device connected to the internet has one
- DNS maps URLs to them
- They are allocated to users and orgs, usually users just rent internet service provider addresses.
How do they look
- IPv4: look like
255.255.255.255
- IPv6: look like
y:y:y:y:y:y:y:y
Network address translation (NAT): self explanatory but these rewrite IP addresses in datagrams as they enter the internet.
Classless Inter-Domain Routing (CIDR): is just moving away from the fixed A,B,C class system of addresses and now you just signify a group of addresses with a int prefix signifying the number of bits needed for the network.
You can view network interfaces with the ifconfig
command.
You can view domain, IP, registrar info with the whois
command.
L2 Link layer
The purpose of the link layer is to send and receive IP datagrams. Simple!
- Theres a bunch of these
- They transfer protocol data units PDUs or frames, that are less than a kb
Ethernet
Sends Ethernet frames these have
- Preamble: for decoding the payload tells us space between encoded bits so we can read it
- Start Frame Delimiter SFD
- Destination DST (MAC address)
- Source SRC (MAC address)
- Type: what type of data follows, IPv4, IPv6, ARP
- Some tags
- Payload
- CRC: integrity check
You also need to wait 12 byte worth of time before sending your next eth packet,
Full duplex, power save, auto negotiation
Half duplex just means you can send stuff down the cable in one way at a time, full duplex means you can send both ways at a time
Bridges and switches
Bridges and switches connect physical link layer networks. But how does a switch know where to send stuff to the right MAC address? It maintains a mini table and populates it slowly as it receives new frames! This table has time based eviction policy.
WiFi (IEEE 802.11)
WiFi frames are fairly similar to ethernet frames. One key difference is the addition of a frame control word to specify the type of frame this is, these can be:
- Management Frames: how wifi access points communicate fundamental metadata and establish connections
- Control Frames: For control flow and acknowledgement of frames. WiFi is less reliable than cable so we resend packets if we donât get an ACK.
- Data Frames: Pretty self explanatory, but you can also combine and separate frames into more easily transmittable chunks.
L3 Internet Protocol (IP)
Intro
IP datagrams deliver all the TCP, UDP, ICMP, IGMP data.
- Itâs all best effort does not give a shit to redeliver or handle failures.
- It maintain no connection state.
- It can duplicate or fail to deliver its datagrams, it does not care lmao.
IPv4, IPv6 headers
Itâs 20 bytes contains all the expected metadata like version, checksum, TTL, source and destination etc.
Header (IPv4) | What is it? |
---|---|
Version | Main call out is that version field is identical, but everything else is different in IPv4 and IPv6, a host handling both is where the name dual stack come from. |
IHL (Internet Header Length) | Just the number of 32 bit words are in the header |
DSField (Differentiated Services Field) | To help with network congestion, set by routers. How much priority should I get? |
ECN (Explicit Congestion Notification) | To help with network congestion as well, set by routers. |
Total Length | Total length of the IPv4 datagram, 16 bits so max IPv4 datagram size is 65k bytes |
Id | Unique field, donât mix up fragmented datagrams |
Flags | đ€ |
Fragment Offset | đ€ |
Time to live | Actually a hop limit, no one actually asserts on the time |
Protocol | Whats the protocol type of data are we carrying? |
Header checksum | Self explanatory (Not CRC, itâs a more simple internet checksum) |
Source IP address | 32 bit IP addresses |
Destination IP address | 32 bit IP address |
Options | All proposed IPv4 options are basically not used. IPv6 has a bunch more useful ones: like Jumbo payload, padding, tunnel limits, etc. |
IP data | The meat and potatoes! |
MTU, maximum transmission unit: self explanatory, whats the max packet which can be sent over a network without needing to break it down.
IPv6 extension headers
IPv6 has a bunch of additions to the IPv4 heading structure.
Header (IPv6) | What is it? |
---|---|
Routing Header | You can specify nodes you want your datagram to visit before it reaches its end goal. Works by overwriting the destination IP address as you visit o journey nodes. |
Fragment Header | Just IPv4 with a larger identification bit. |
IP forwarding
General ordering of handling all this IP stuff is
- I get an IP datagram, from another protocol or network interface
- I check if Iâm the destination IP
- I open up my routing table, to find the corresponding IP
- I crack open the header for the protocol or next header field
- I blast the datagram to the next routing node, with that protocol
- If I canât find it in my routing table, I either discard or send it back?
So what does this routing table look like?
- Destination
- Mask: 32 bit (IPv4) to scope down the destination to compare with where you need to go
- Next hop: contains IP address of next IP entity you need to send to
- Interface: What is the network interface I actually need to send stuff into to for this destination
How does it work? We just pick the matching masked destination with the most bits.
View hops with > traceroute -n google.com
Firewalls and Network Address Translation (NAT)
Intro
One problem internet had was a lot of cyber attacks, to mitigate this we have: Firewalls are just routers that restricts what type of traffic it forwards.
Another problem is that IPv4 addresses are running out and theyâre getting quite expensive, to mitigate this we have: NAT network address translation gateways, theyâre just things which allow duplicate addresses to be used within a gated network.
Firewalls
Two types:
- Proxy firewalls: Application layer gateway, terminating connections and creating internal only connections. SOCKS, HTTP proxy are examples.
- Packet filtering firewalls: drops IP datagrams, acts as a router. Use filters and ACLs to control.
Network Address Translation
NAT needs to consume all ingoing and outgoing datagrams so it can rewrite the addresses and fix checksums.
For TCP, theres the three way handshake of SYN, SYN-ACK, ACK, on NATs on first SYN weâll forward it to a random internal IP and add this to our mapping table, if we doât get a SYN-ACK back we can remove the entry as a connection wasnât established. Or when we receive a FIN we can clear it up.
Instead for UDP that doesnât have a handshake or closing, we can just have a time based eviction policy to clean up our table.
NAT Traversal
Theres a bunch of cool strategies that exist with NAT
- Hairpinning/ NAT loopback: on a private to private connection just do no additional work and just continue to map private to private
- Pinhole/ Hole punching: establish a private to public connection, and then with the new info, establish a public to public connection directly. Skype does this! STUN protocol helps with this.
- TURN (traversal using relays around NAT): just give up going through NAT and just go through a third party server.
- ICE (Interactive Connectivity Establishment): P2P, establish connections
Broadcasting and Local Multicasting (ICMP and MLD)
Intro
There are 4 kinds of IP addresses
- Unicast
- Anycast
- Multicast
- Broadcast (No IPv6)
The key purpose of multicast and broadcast is to deliver packets to multiple places and to discover servers or clients.
How does L2 link layers efficiently to multicast and broadcast?
The main difference between multicast and broadcast is that, multicast only involves those that support a specific service or protocol.
Usually only UDP does multicasting. TPC is for connections.
Broadcasting
Routers simply forward data to all receivers. The all 1 bit address is the broadcast address (or just the last address in a subnet).
Multicasting
Instead of sending data to all people, lets just send data to anyone who is interested in it. Hosts and routers maintain state on if theyâre interested.
- People join a group, sending IGMP message to a router
- When router gets a multicast address (224.0.0.0 > 239.255.255.255 or 00:00:5e address)
- It will blast it to any subscribed people
Use netstat -rn
to view your routing table.
User Datagram Protocol (UDP) and IP Fragmentation
Intro
UDP provides, datagram oriented, L4 transport layer protocol, preserving message boundaries and checksums. Does not provide, error correction, sequencing, duplicate elimination, flow or congestion control.
The UDP datagram look like, UDP stuff and header is stuffed into the data slot.
- IPv4 header
- UDP header
- Source port number
- Destination port number
- Length
- Checksum
- UDP data
UDP checksum
The checksum is computed over the UDP data and UDP header, and some of the IPv4 header. This is why NAT gateways need to edit at the L3 IP layer but also the L4 transport layer as well, so it can update this checksum.
UDP and IPv6
There exists a teredo project to tunnel IPv6 on IPv4, because of the lack of quick support of IPv6.
IP fragmentation
When package is too big IP protocol will fragment it into smaller pieces. IPv4 this can happen at source or any intermediate routers, IPv6 this happens only at source. A major issue of this, with UDP is that datagrams can be lost and you canât reassemble them at all then!
Path MTU Discovery with UDP
Use Internet Control Messaging Protocol, ICMP, thats just a message a router will send back to you to tell u stuff like your package is too big or I canât get to the destination.
UDP server design
Something to note is that the server is primitively handed the UDP data block, the IP and UDP headers are stripped often. So if you need them youâll need to keep that in mind.
UDP in the internet
UDP looks to account for 10 - 40% of internet traffic. And looks like much of use is in media playing and tunneling use cases.
Name resolution and Domain Name System (DNS)
Intro
Remembering IP addresses sounds awful, so we have a big hierarchical database that maps host names to IP addresses.
DNS name space
DNS names are organised in a namespace. Top level domains, subdomains and URL labels.
Name servers and zones
So what if youâre managing a portion of name space? You need some name servers, for your âzoneâ. And will have delegation records to handover smaller subtree zones to other name servers.
Caching
Most name servers, outside of some root TLD servers will cache zone info as they learn, with a TTL eviction policy. Each DNS record, name to IP address mapping has a TTL.
DNS protocol
Two sides to the protocol
- Hitting the DNS: standard requests
- Controlling DNS: zone transfers, DNS notify.
Resource Record | What is it? |
---|---|
A, AAAA | Address Record, map a name to a IP |
NS | Name server, what are the authoritative name servers for a domain |
CNAME | Canonical Name records, these are aliases to point to other resource records! |
SOA | Authority Records, Start of Authority, point to other name servers which are the authority of certain domains |
PTR | Reverse DNS lookup queries, Pointer queries. Lets you do a reverse lookup. |
MX | Mail exchanger records: not super widely used now |
TXT | any text, such as anti spam for email, or verifying ownership |
SRV | Service Records, like a general MX, you can specify what kind of protocol, ports a service supports |
NAPTR | Name authority pointer records, more complex mappings |
OPT | Allows extra features |
Query DNS with dig
Transmission Control Protocol (TCP) basics
Intro
IP and UDP do no error correction.
Theres four categories of communication failures
- Packet bit errors: fixed with error correcting codes
- Packet reordering: fixed with sequence numbers
- Packet duplication: fixed with sequence numbers
- Packet erasure: retry based on an estimate of round trip time
Intro to TCP
UDP provides a package sending interface, TCP instead provides a connection oriented interface, you send and get byte streams.
- TCP breaks up this byte stream into packets
- Numbers these packets
- Wraps these packets (segments) in IP datagrams
- And repackages these at the other side back into a byte stream
- TCP waits for acknowledgement of packets, and if it doesnât get it itâll retransmit the packages.
TCP header and encapsulation
Unsurprisingly TCP has a header that wraps its TCP data in each IP datagram.
Header | What is it? |
---|---|
Source port | just a port |
Destination port | just a port or âsocketâ |
Sequence number | What number segment is this in the stream |
Acknowledgement Number | Number the sender expects to receive next |
Header Length | Â |
Resv | Â |
CWR | Theres a bunch of bit fields defined in TCP. This is the congestion window reduced: slow down pls bit |
ECE | Echo: sender received an earlier congestion notification |
URG | Urgent: urgent pointer field is valid |
ACK | on when a connection is established |
PSH | receiver should push this data asap |
RST | reset the connection |
SYN | synchronise sequence numbers to initiate a connection |
FIN | Ive finished sending all my data |
Window Size | This is the sliding window thats filled as we send back ACKs |
TCP checksum | Similar to UDP, spans TCP header, data and some IP headers |
Urgent Pointer | not used much |
Options | đ€ |
TCP connection management
Intro
UDP is connectionless protocol.
TCP is a connection protocol. TCP will detect and repair all data transfer problems, like packet loss, duplication and errors.
TCP connection establishment and termination
TCP connection is between a pair of IP and port.
TCP has three phases
- Setup a connection
- Client sends a SYN segment with port it wants to connect to and clients initial sequence number
- Server sends SYN segment and its own sequence number. AND it ACKs the clients message, by returning clients ISN + 1.
- Client ACKs the servers segment, by returning servers ISN + 1.
- Transfer of data
- Closing a connection
- Client sends a FIN segment.
- Server ACKs the clients FIN segment by returning clients ISN + 1.
- Server sends a FIN.
- Client ACKs the FIN.
When TCP gets a segment there are two things it requires
- Valid checksum
- But also a ISN (sequence number) that is within its sliding window
ARP: address resolution protocol maps IP addresses to MAC addresses. Itâs a L2 link layer protocol. Works by broadcasting âwho has the MAC for this IPâ, getting a response, and done.
TCP options
TCP has a bunch of options, hereâs some
- Max Segment Size: yup its written on the tin
- Selective Acknowledgement: by default you need to receive segments sequentially so when you have holes its a problem, you can send a SACK segment to indicate 3 holes you want to patch.
- Window Scale: Lets us increase our sliding window
- Timestamp options: lets you add some telemetry info of timestamps to get the round trip time. And avoid crappy issues with sequence numbers wrapping around.
- User timeout: lets you tell the other guy your timeouts.
- Auth: lets you authenticate TCP segments with hashes.
TCP server operation
How does a TCP server usually operate?
Usual is TCP connection request arrives at a server, server accepts connection and hands over the connection to a new process or thread to handle the client.
Usually berkeley socket API is used and this has queues per endpoint of connections that are about to be established. Main call out is that application API already has the three way handshake abstracted over already.
TCP timeout and retransmission
Intro
How does TCP retransmit data? It does this with two main strategies.
- Timeout based retransmission: correcting against basic packet loss
- Fast retransmission: correcting against an old packets being lost and need to quickly send those before sending any new ones
Setting the retransmission timeout
The retransmission timeout is the timeout before a device will resend its segment as it hasnât received an ACK. Theres a bunch of ways to estimate whats a good number here using the round trip time estimate (and derivations such as a smoothed metric and a variance metric).
- One sharp edge here is that if youâve resent a segment theres no way to figure out if a returning ACK is for the first or second package youâve sent.
Fast retransmit
This is when the receiver indicates that it has a hole and we gotta fill it in ASAP, using a SACK segment.
TCP data flow and window management
Intro
Users have a variety of needs for interactive uses, we want segments to be emitted immediately with poor space usage, for large data transfer uses weâd want the opposite. How does TCP handle this stuff?
Interactive communication
TCP handles both of these types of data with the same protocol but with different algos
- interactive data: 10s of bytes
- bulk data: 1k bytes
Interactive communication such as ssh, is quite surprising and the client sends a packet for every keystroke!
Delayed acknowledgements
With these interactive uses its kind of a waste to send an equivalent ACK for every keystroke segment, so TCP will patch these and send an ACK for multiple segments. This is often called piggybacking.
Nagle algorithm
Nagle algo is a way of self regulating the amount of mini segments you send depending on the network youâre on, (the trip time!).
Works by just waiting/queueing until all previously dispatched segments are ACKed.
This can be an issue if the server also has a delaying algo which means that you can have car crashes where both the server and client are waiting pointlessly. Use TCP_NODELAY
to stop it.
Flow control and window management
TCP flow control is done by the receiver advertising in each segment how much window space it has left. A sharp edge here is that if the receiver window is constantly close to fully filled each time the server sends a segment to fill it, this segment will be comically small, a poor spend of bandwidth.
TCP congestion control
Intro
How does TCP prevent the network from being overwhelmed when making large bulk data transfers? The main fix is slowing down, but how do we recognise when we should do that and when we can start sending normally again.
The main problem with detecting this is that theres no explicit signal to detect network congestion. We need to extrapolate from e.g a spike in package loss.
Standard algorithms
The basic strategy is that you have an advertised window size from the receiver, but you can have a congestion window size on the sender side which you optimise and you just use the min between the two.
To get this sender side window estimate, basic strategy is start low and slowly ratchet up the window until you see packet loss. (exponentially increase)
After youâve found an initial estimate we usually move to steady state operation where we very incrementally increase the window based on successfully sent data. (incrementally adjust)
TCP keep alive
Intro
Funnily TCP has no polling, so an idle connection can just exist forever lmao.
TCP keepalive is this probing functionality, that has been jammed into TCP. Helpful for detecting dead clients and removing any held resources. You send a keepalive probe and the other device ACKs your probe. Done.