INTRO
This post covers the TCP/IP and UDP/IP paths on kernel 5.10+, with interactive diagrams.
Let’s go:
Interactive Diagrams
The diagrams below trace the major kernel functions on both the egress (TX) and ingress (RX) paths, plus the sk_buff buffer layout and netfilter/eBPF hook points.
↗ Open full diagram in new tab
Egress Path (TX)
1. Userspace → Socket Layer
A process calls write(), send(), or sendto() on a socket fd. The VFS dispatches through sock_sendmsg(), which pulls the struct sock from the fd, attaches credentials from task_struct (PID/UID/GID), runs LSM hooks (SELinux/AppArmor), then dispatches to the transport protocol via the sk_prot->sendmsg function pointer.
The INDIRECT_CALL_INET macro is an optimization that avoids indirect branch prediction penalties by hardcoding checks for the two most common protocols (TCP/UDP) before falling through to the generic indirect call.
2. Transport Layer — TCP
tcp_sendmsg() first checks that the connection is ESTABLISHED, then iterates over the user buffer, allocating sk_buff structures sized to MSS.
Data is copied from userspace via skb_add_data_nocache() (into existing tail room) or by allocating new pages.
Segments are enqueued to the socket write queue (sk->sk_write_queue).
Then tcp_push() → tcp_write_xmit() walks the queue, applying congestion window (cwnd) and receiver window (rwnd) constraints, setting retransmission timers, building the TCP header (seq/ack/flags/window/options), computing the checksum (or deferring it to hardware via CHECKSUM_PARTIAL), and finally calling ip_queue_xmit() through the icsk_af_ops->queue_xmit function pointer.
Transport Layer — UDP
udp_sendmsg() is simpler: it resolves the route via ip_route_output_flow(), handles corking (UDP_CORK — batching multiple sendmsg() calls into one IP datagram) vs non-corking (immediate send), builds the UDP header, computes the checksum, and calls ip_make_skb() + udp_send_skb() which hands off to IP.
3. IP Layer
__ip_queue_xmit() (TCP path) or ip_push_pending_frames() (UDP path) handles route lookup via the FIB (Forwarding Information Base — the compiled routing table).
If the route is cached in skb->_skb_refdst, the lookup is skipped.
The IP header is constructed (version, IHL, TOS, TTL, protocol, src/dst addresses).
Then:
NF_INET_LOCAL_OUTnetfilter hook fires — this is where iptables/nftables OUTPUT chain rules execute, and where conntrack begins tracking the flow.ip_output()firesNF_INET_POST_ROUTING— the POSTROUTING chain (SNAT/masquerade happens here).ip_finish_output()checks MTU and fragments if necessary viaip_fragment(). Fragmentation is avoided if the DF bit is set (PMTUD).- Neighbor subsystem resolves L2 address:
neigh_resolve_output()does ARP lookup (or uses the neighbor cache). If no ARP reply exists, the skb is queued in the neighbor’sarp_queuepending resolution. - The Ethernet header is pushed onto the skb.
4. Qdisc / Device Layer
dev_queue_xmit() sets skb->mac_header, then enters the queueing discipline (qdisc).
The default is pfifo_fast (or fq_codel on modern distros). __qdisc_run() dequeues skbs, runs validate_xmit_skb() (VLAN tag insertion, GSO/TSO segmentation if hardware supports it, checksum finalization), then calls the driver’s ndo_start_xmit().
The skb is placed in the TX ring buffer (typically a DMA-mapped ring descriptor).
The driver writes the descriptor and pokes the NIC’s doorbell register (MMIO write) to trigger transmission.
Bypass paths:
dev_direct_xmit() is used by XDP and AF_XDP to skip the qdisc entirely.
XDP_TX reflects a packet at the driver level without ever going up the stack.
TC egress (tc_egress() / tcf_classify()) runs tc-BPF or u32/flower classifiers between qdisc enqueue and the driver.
Ingress Path (RX)
1. NIC → Driver
The NIC DMAs the packet into a pre-allocated ring buffer (RX ring), writes the descriptor with metadata (length, checksum status, RSS hash), and raises a hardware interrupt (or MSI-X vector).
The driver’s ISR calls napi_schedule() to schedule NAPI polling, then masks the interrupt. This interrupt coalescing is critical — without NAPI, per-packet interrupts would kill throughput.
2. NAPI Poll / Driver → netdev
In softirq context (NET_RX_SOFTIRQ), napi_poll() calls the driver’s poll function, which walks the RX ring, allocates sk_buff structures, fills in metadata (protocol via eth_type_trans(), device, rx hash), and calls napi_gro_receive().
GRO (Generic Receive Offload) coalesces multiple TCP segments into a single large skb before passing it up, reducing per-packet overhead.
3. netif_receive_skb()
skb->mac_headeris set, Ethernet header is pulled.af_packetsockets (tcpdump/libpcap) get a clone here viadeliver_skb()to all registeredptype_allhandlers.tc_ingress()runs if a clsact/ingress qdisc is attached — this is a major eBPF hook point (BPF_PROG_TYPE_SCHED_CLS).- VLAN tagged frames are dispatched to the correct VLAN sub-interface.
- rx_handler()` can steal the packet if the interface is enslaved to a bridge or has a registered rx_handler.
- protocol demux dispatches to
ip_rcv()based onskb->protocol(ETH_P_IP).
4. IP Layer
ip_rcv() validates the IP header (version == 4, IHL >= 5, total length consistent, header checksum), sets skb->transport_header, fires NF_INET_PRE_ROUTING (PREROUTING chain, DNAT, conntrack).
ip_rcv_finish() does the route lookup via ip_route_input_noref() → FIB lookup.
The routing decision sets skb->dst->input to one of:
ip_local_deliver()— packet is for us.ip_forward()— packet needs forwarding (decrements TTL, firesNF_INET_FORWARDhook).ip_mr_input()— multicast routing.
For local delivery: ip_defrag() reassembles fragments, NF_INET_LOCAL_IN fires (INPUT chain), then ip_local_deliver_finish() strips the IP header and dispatches to the transport protocol handler.
5. Transport Layer — TCP
tcp_v4_rcv() validates the TCP header, verifies the checksum, looks up the socket via __inet_lookup_skb() (established hash table first, then listener table).
For established connections: tcp_v4_do_rcv() → tcp_rcv_established().
- Fast path: header prediction (predicted next seq/ack, no special flags, window unchanged) → data is copied directly to userspace or queued to
sk->sk_receive_queue. - Slow path: out-of-order segments, SACK processing, ECN, urgent data — full state machine treatment.
For SYN packets to listeners: tcp_v4_cookie_check() (SYN cookies if SYN flood), tcp_check_req() for 3WHS completion.
Transport Layer — UDP
udp_rcv() → __udp4_lib_rcv(): socket lookup by destination port, checksum verification, udp_queue_rcv_skb() enqueues to sk->sk_receive_queue, sk->sk_data_ready() wakes the reader.
6. Socket Layer / Userspace Read
read() → sock_recvmsg() → inet_recvmsg() → tcp_recvmsg() / udp_recvmsg().
Data is copied from skb(s) in the receive queue to the userspace buffer. The skbs are freed after consumption.
References
Primary
- Stephan & Wüstrich, The Path of a Packet Through the Linux Kernel (TUM, 2024)
- PackageCloud, Monitoring and Tuning the Linux Networking Stack: Receiving Data
- PackageCloud, Monitoring and Tuning the Linux Networking Stack: Sending Data
- PackageCloud, Illustrated Guide to Monitoring and Tuning the Linux Networking Stack: Receiving Data
Official Kernel Documentation
- Networking subsystem index
- NAPI documentation
- sk_buff documentation
- Kernel networking API
/proc/sys/net/tuning parameters- Kernel source browser (v5.10.8)
Linux Foundation Wiki
Academic Papers
- Høiland-Jørgensen et al., The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel (ACM CoNEXT ‘18)
- Cai et al., Understanding Host Network Stack Overheads (ACM SIGCOMM ‘21) — PDF
- Chimata, Path of a Packet in the Linux Kernel Stack (2005, kernel 2.6.11)
Community
- Linux Network Performance Ultimate Guide
- Path of a Received Packet in the Kernel — Overview (Sheharyaar, 2024)
- DaveM’s Linux Networking Blog — GRO internals
