Linux Network Packets Path

· April 4, 2026

INTRO

This post covers the TCP/IP and UDP/IP paths on kernel 5.10+, with interactive diagrams.

Let’s go:

Interactive Diagrams

The diagrams below trace the major kernel functions on both the egress (TX) and ingress (RX) paths, plus the sk_buff buffer layout and netfilter/eBPF hook points.

↗ Open full diagram in new tab


Egress Path (TX)

1. Userspace → Socket Layer

A process calls write(), send(), or sendto() on a socket fd. The VFS dispatches through sock_sendmsg(), which pulls the struct sock from the fd, attaches credentials from task_struct (PID/UID/GID), runs LSM hooks (SELinux/AppArmor), then dispatches to the transport protocol via the sk_prot->sendmsg function pointer.
The INDIRECT_CALL_INET macro is an optimization that avoids indirect branch prediction penalties by hardcoding checks for the two most common protocols (TCP/UDP) before falling through to the generic indirect call.

2. Transport Layer — TCP

tcp_sendmsg() first checks that the connection is ESTABLISHED, then iterates over the user buffer, allocating sk_buff structures sized to MSS.
Data is copied from userspace via skb_add_data_nocache() (into existing tail room) or by allocating new pages.
Segments are enqueued to the socket write queue (sk->sk_write_queue).
Then tcp_push()tcp_write_xmit() walks the queue, applying congestion window (cwnd) and receiver window (rwnd) constraints, setting retransmission timers, building the TCP header (seq/ack/flags/window/options), computing the checksum (or deferring it to hardware via CHECKSUM_PARTIAL), and finally calling ip_queue_xmit() through the icsk_af_ops->queue_xmit function pointer.

Transport Layer — UDP

udp_sendmsg() is simpler: it resolves the route via ip_route_output_flow(), handles corking (UDP_CORK — batching multiple sendmsg() calls into one IP datagram) vs non-corking (immediate send), builds the UDP header, computes the checksum, and calls ip_make_skb() + udp_send_skb() which hands off to IP.

3. IP Layer

__ip_queue_xmit() (TCP path) or ip_push_pending_frames() (UDP path) handles route lookup via the FIB (Forwarding Information Base — the compiled routing table).
If the route is cached in skb->_skb_refdst, the lookup is skipped.
The IP header is constructed (version, IHL, TOS, TTL, protocol, src/dst addresses).
Then:

  • NF_INET_LOCAL_OUT netfilter hook fires — this is where iptables/nftables OUTPUT chain rules execute, and where conntrack begins tracking the flow.
  • ip_output() fires NF_INET_POST_ROUTING — the POSTROUTING chain (SNAT/masquerade happens here).
  • ip_finish_output() checks MTU and fragments if necessary via ip_fragment(). Fragmentation is avoided if the DF bit is set (PMTUD).
  • Neighbor subsystem resolves L2 address: neigh_resolve_output() does ARP lookup (or uses the neighbor cache). If no ARP reply exists, the skb is queued in the neighbor’s arp_queue pending resolution.
  • The Ethernet header is pushed onto the skb.

4. Qdisc / Device Layer

dev_queue_xmit() sets skb->mac_header, then enters the queueing discipline (qdisc).
The default is pfifo_fast (or fq_codel on modern distros). __qdisc_run() dequeues skbs, runs validate_xmit_skb() (VLAN tag insertion, GSO/TSO segmentation if hardware supports it, checksum finalization), then calls the driver’s ndo_start_xmit().
The skb is placed in the TX ring buffer (typically a DMA-mapped ring descriptor).
The driver writes the descriptor and pokes the NIC’s doorbell register (MMIO write) to trigger transmission.

Bypass paths:
dev_direct_xmit() is used by XDP and AF_XDP to skip the qdisc entirely.
XDP_TX reflects a packet at the driver level without ever going up the stack.
TC egress (tc_egress() / tcf_classify()) runs tc-BPF or u32/flower classifiers between qdisc enqueue and the driver.


Ingress Path (RX)

1. NIC → Driver

The NIC DMAs the packet into a pre-allocated ring buffer (RX ring), writes the descriptor with metadata (length, checksum status, RSS hash), and raises a hardware interrupt (or MSI-X vector).
The driver’s ISR calls napi_schedule() to schedule NAPI polling, then masks the interrupt. This interrupt coalescing is critical — without NAPI, per-packet interrupts would kill throughput.

2. NAPI Poll / Driver → netdev

In softirq context (NET_RX_SOFTIRQ), napi_poll() calls the driver’s poll function, which walks the RX ring, allocates sk_buff structures, fills in metadata (protocol via eth_type_trans(), device, rx hash), and calls napi_gro_receive().

GRO (Generic Receive Offload) coalesces multiple TCP segments into a single large skb before passing it up, reducing per-packet overhead.

3. netif_receive_skb()

  • skb->mac_header is set, Ethernet header is pulled.
  • af_packet sockets (tcpdump/libpcap) get a clone here via deliver_skb() to all registered ptype_all handlers.
  • tc_ingress() runs if a clsact/ingress qdisc is attached — this is a major eBPF hook point (BPF_PROG_TYPE_SCHED_CLS).
  • VLAN tagged frames are dispatched to the correct VLAN sub-interface.
  • rx_handler()` can steal the packet if the interface is enslaved to a bridge or has a registered rx_handler.
  • protocol demux dispatches to ip_rcv() based on skb->protocol (ETH_P_IP).

4. IP Layer

ip_rcv() validates the IP header (version == 4, IHL >= 5, total length consistent, header checksum), sets skb->transport_header, fires NF_INET_PRE_ROUTING (PREROUTING chain, DNAT, conntrack).
ip_rcv_finish() does the route lookup via ip_route_input_noref()FIB lookup.
The routing decision sets skb->dst->input to one of:

  • ip_local_deliver() — packet is for us.
  • ip_forward() — packet needs forwarding (decrements TTL, fires NF_INET_FORWARD hook).
  • ip_mr_input() — multicast routing.

For local delivery: ip_defrag() reassembles fragments, NF_INET_LOCAL_IN fires (INPUT chain), then ip_local_deliver_finish() strips the IP header and dispatches to the transport protocol handler.

5. Transport Layer — TCP

tcp_v4_rcv() validates the TCP header, verifies the checksum, looks up the socket via __inet_lookup_skb() (established hash table first, then listener table).
For established connections: tcp_v4_do_rcv()tcp_rcv_established().

  • Fast path: header prediction (predicted next seq/ack, no special flags, window unchanged) → data is copied directly to userspace or queued to sk->sk_receive_queue.
  • Slow path: out-of-order segments, SACK processing, ECN, urgent data — full state machine treatment.

For SYN packets to listeners: tcp_v4_cookie_check() (SYN cookies if SYN flood), tcp_check_req() for 3WHS completion.

Transport Layer — UDP

udp_rcv()__udp4_lib_rcv(): socket lookup by destination port, checksum verification, udp_queue_rcv_skb() enqueues to sk->sk_receive_queue, sk->sk_data_ready() wakes the reader.

6. Socket Layer / Userspace Read

read()sock_recvmsg()inet_recvmsg()tcp_recvmsg() / udp_recvmsg().
Data is copied from skb(s) in the receive queue to the userspace buffer. The skbs are freed after consumption.

References

Primary

Official Kernel Documentation

Linux Foundation Wiki

Academic Papers

Community

eBPF / XDP

Twitter, Facebook