GNARoot Blogs

When you create a socket file descriptor, using the socket() system call, it returns a file descriptor with the file_operations structure assigned as

struct file_operations socket_file_ops = {

            .aio_read = sock_aio_read,

            .aio_write = sock_aio_write,

};

When we write to this socket using write() system call.

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, ...)

{

    struct file *file; 

    [...] 

    file = fget_light(fd, &fput_needed); 

    [...] ===> 

    ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);

    [...]

}

For every socket created in userspace application, there is a corresponding 

struct socket, and struct sock

sock --> struct socket.
sk     --> struct sock.

static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, ..)

{

    [...]

    struct socket *sock = file->private_data;

    [...] ===>

    return sock->ops->sendmsg(iocb, sock, msg, size);

    [...]

}

struct socket {

    [...]

    struct file *file;

    struct sock *sk;

    const struct proto_ops *ops;

    };

const struct proto_ops inet_stream_ops = {

    .family = PF_INET,

    [...]

    .connect = inet_stream_connect,

    .accept = inet_accept,

    .listen = inet_listen, .sendmsg = tcp_sendmsg,

    .recvmsg = inet_recvmsg,

    [...]

};

The sendmsg call would be implemented using a protocol specific function in the proto_ops.

int tcp_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size)

{

struct sock *sk = sock->sk;

struct iovec *iov;

struct tcp_sock *tp = tcp_sk(sk);

struct sk_buff *skb;

[...]

mss_now = tcp_send_mss(sk, &size_goal, flags);

/* Ok commence sending. */

iovlen = msg->msg_iovlen;

iov = msg->msg_iov;

copied = 0;

[...]

while (--iovlen >= 0) {

int seglen = iov->iov_len;

unsigned char __user *from = iov->iov_base;

iov++;

while (seglen > 0) {

int copy = 0;

int max = size_goal;

[...]

skb = sk_stream_alloc_skb(sk, select_size(sk, sg), sk->sk_allocation);

if (!skb)

goto wait_for_memory;

/*

* Check whether we can use HW checksum.

*/

if (sk->sk_route_caps & NETIF_F_ALL_CSUM)

skb->ip_summed = CHECKSUM_PARTIAL;

[...]

skb_entail(sk, skb);

[...]

/* Where to copy to? */

if (skb_tailroom(skb) > 0) {

/* We have some space in skb head. Superb! */

if (copy > skb_tailroom(skb))

copy = skb_tailroom(skb);

if ((err = skb_add_data(skb, from, copy)) != 0)

goto do_fault;

[...]

if (copied)

tcp_push(sk, flags, mss_now, tp->nonagle);

[...]

}

One sk_buff copies and includes MSS (tcp_send_mss) bytes to help the 
code that actually creates packets. Maximum Segment Size (MSS) stands 
for the maximum payload size that one TCP packet includes.

The sk_stream_alloc_skb function creates a new sk_buff, and skb_entail adds the new sk_buff to the tail of the send_socket_buffer. The skb_add_data function copies the actual application data to the data buffer of the sk_buff. All the data is copied by repeating the procedure (creating an sk_buff and adding it to the send socket buffer) several times. Therefore, sk_buffs at the size of the MSS are in the send socket buffer as a list. Finally, the tcp_push is called to make the data which can be transmitted now as a packet, and the packet is sent.

The tcp_push function transmits as many of the sk_buffs in the send socket buffer as the TCP allows in sequence.

Then, the tcp_transmit_skb function is called to create a packet.

Finally, queue_xmit is called to send the packet to the IP layer. Queue_xmit for IPv4 is implemented by the ip_queue_xmit function.

The ip_queue_xmit function executes tasks required by the IP layers. __sk_dst_check
 checks whether the cached route is valid. If there is no cached route 
or the cached route is invalid, it performs IP routing. And then it 
calls skb_push to secure the IP header area and records the IP header 
field value. After that, as following the function call, ip_send_check computes the IP header checksum and calls the netfilter function. IP fragment is created when ip_finish_output
 function needs IP fragmentation. No fragmentation is generated when TCP
 is used. Therefore, ip_finish_output2 is called and it adds the 
Ethernet header. Finally, a packet is completed.

The completed packet is transmitted through the dev_queue_xmit function. First, the packet passes via the qdisc. If the default qdisc is used and the queue is empty, the sch_direct_xmit function is called to directly send down the packet to the driver, skipping the queue. Dev_hard_start_xmit
 function calls the actual driver. Before calling the driver, the device
 TX is locked first. This is to prevent several threads from accessing 
the device simultaneously. As the kernel locks the device TX, the driver
 transmission code does not need an additional lock. It is closely 
related to the parallel processing that will be discussed next time.

Ndo_start_xmit function calls the driver code. Just before, you will see ptype_all and dev_queue_xmit_nit.
 The ptype_all is a list that includes the modules such as packet 
capture. If a capture program is running, the packet is copied by 
ptype_all to the separate program. Therefore, the packet that tcpdump 
shows is the packet transmitted to the driver. When checksum offload or 
TSO is used, the NIC manipulates the packet. So the tcpdump packet is 
different from the packet transmitted to the network line. After 
completing packet transmission, the driver interrupt handler returns the
 sk_buff. 

THE RECEIVE PATH:

The general executed path is to receive a packet and then to add the 
data to the receive socket buffer. After executing the driver interrupt 
handler, follow the napi poll handle first.

static void net_rx_action(struct softirq_action *h) 

{ 

   struct softnet_data *sd = &__get_cpu_var(softnet_data); 

   unsigned long time_limit = jiffies + 2; 

   int budget = netdev_budget; 

   void *have; 

   local_irq_disable();

   while (!list_empty(&sd->poll_list)) {

   struct napi_struct *n;

   [...]

   n = list_first_entry(&sd->poll_list, struct napi_struct,poll_list);

   if (test_bit(NAPI_STATE_SCHED, &n->state)) {

   work = n->poll(n, weight);

   trace_napi_poll(n);

   }

   [...]

}

int netif_receive_skb(struct sk_buff *skb)

[...] ===>

static int __netif_receive_skb(struct sk_buff *skb)

{

struct packet_type *ptype, *pt_prev;

[...]

__be16 type;

[...]

list_for_each_entry_rcu(ptype, &ptype_all, list) {

if (!ptype->dev || ptype->dev == skb->dev) {

if (pt_prev)

ret = deliver_skb(skb, pt_prev, orig_dev);

pt_prev = ptype;

}

}

[...]

type = skb->protocol;

list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {

if (ptype->type == type && (ptype->dev == null_or_dev || ptype->dev == skb->dev ||

 ptype->dev == orig_dev)) {

if (pt_prev)

ret = deliver_skb(skb, pt_prev, orig_dev);

pt_prev = ptype;

}

}

if (pt_prev) {

ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);

static struct packet_type ip_packet_type __read_mostly = {

.type = cpu_to_be16(ETH_P_IP),

.func = ip_rcv,

[...]

};

As mentioned before, the net_rx_action function is the softirq handler 
that receives a packet. First, the driver that has requested the napi 
poll is retrieved from the poll_list and the poll handler of the driver is called. The driver wraps the received packet with sk_buff and then calls netif_receive_skb.

When there is a module that requests all packets, the netif_receive_skb
 sends packets to the module. Like packet transmission, the packets are 
transmitted to the module registered to the ptype_all list. The packets 
are captured here.

Then, the packets are transmitted to the upper layer based on the 
packet type. The Ethernet packet includes 2-byte ethertype field in the 
header. The value indicates the packet type. The driver records the 
value in sk_buff (skb->protocol). Each protocol has its own 
packet_type structure and registers the pointer of the structure to the 
ptype_base hash table. IPv4 uses ip_packet_type. The Type field value is the IPv4 ethertype (ETH_P_IP) value. Therefore, the IPv4 packet calls the ip_rcv function.

The ip_rcv function executes tasks required by the IP layers. 
It examines packets such as the length and header checksum. After 
passing through the netfilter code, it performs the ip_local_deliver function. If required, it assembles IP fragments. Then, it calls ip_local_deliver_finish through the netfilter code. The ip_local_deliver_finish
 function removes the IP header by using the __skb_pull and then 
searches the upper protocol whose value is identical to the IP header 
protocol value. Similar to the Ptype_base, each transport protocol 
registers its own net_protocol structure in inet_protos. IPv4 TCP uses tcp_protocol and calls tcp_v4_rcv that has been registered as a handler.

When packets come into the TCP layer, the packet processing flow 
varies depending on the TCP status and the packet type. Here, we will 
see the packet processing procedure when the expected next data packet 
has been received in the ESTABLISHED status of the TCP 
connection. This path is frequently executed by the server receiving 
data when there is no packet loss or out-of-order delivery.

First, the tcp_v4_rcv function validates the received packets. When the header size is larger than the data offset (th->doff < sizeof(struct tcphdr) / 4), it is the header error. And then __inet_lookup_skb
 is called to look for the connection where the packet belongs from the 
TCP connection hash table. From the sock structure found, all required 
structures such as tcp_sock and socket can be got.

The actual protocol is executed from the tcp_v4_do_rcv function. If the TCP is in the ESTABLISHED status, tcp_rcv_esablished is called. Processing of the ESTABLISHED status is separately handled and optimized since it is the most common status. The tcp_rcv_established
 first executes the header prediction code. The header prediction is 
also quickly processed to detect in the common state. The common case 
here is that there is no data to transmit and the received data packet 
is the packet that must be received next time, i.e., the sequence number
 is the sequence number that the receiving TCP expects. In this case, 
the procedure is completed by adding the data to the socket buffer and 
then transmitting ACK.

Go forward and you will see the sentence comparing truesize with 
sk_forward_alloc. It is to check whether there is any free space in the 
receive socket buffer to add new packet data. If there is, header 
prediction is "hit" (prediction succeeded). Then __skb_pull is called to remove the TCP header. After that, __skb_queue_tail is called to add the packet to the receive socket buffer. Finally, __tcp_ack_snd_check is called for transmitting ACK if necessary. In this way, packet processing is completed.

If there is not enough free space, a slow path is executed. The 
tcp_data_queue function newly allocates the buffer space and adds the 
data packet to the socket buffer. At this time, the receive socket 
buffer size is automatically increased if possible. Different from the 
quick path, tcp_data_snd_check is called to transmit a new data packet if possible. Finally, tcp_ack_snd_check is called to create and transmit the ACK packet if necessary.

The amount of code executed by the two paths is not much. This is 
accomplished by optimizing the common case. In other words, it means 
that the uncommon case will be processed significantly more slowly. The 
out-of-order delivery is one of the uncommon cases.

Reference:

https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/

http://www.cubrid.org/blog/dev-platform/understanding-tcp-ip-network-stack/

struct sock has 3 queues : RX, TX and Err.

sk_recevie_queue:  sk_buff <-> sk_buff <-> sk_buff.

sk_write_queue:     sk_buff <-> sk_buff <-> sk_buff.

sk_error_queue:     sk_buff <-> sk_buff <-> sk_buff.

API used to manage the queue: 

skb_queue_tail(): adding to the queue.

skb_dequeue(): removing from the queue.

After the ip_local_deliver_finish() is completed it calls udp_rcv() which would be called from sock_queue_rcv_skb()

All the 3 system calls recv(), recvfrom(), recvmsg() are handled by udp_recvmsg().

udp_recvmsg () ---> __skb_recv_datagram () --> reads from sk->sk_receive_queue.

struct net_protocol udp_protocol = {

         .handler = udp_recv,

         .err_handler = udp_err,

         ...

};

If there is a socket listening on the destination port, call udp_queue_rcv_skb() --> sock_queue_rcv_skb() -->  adds packet to sk_receive_queue by skb_queue_tail().

udp_recvmsg() calls __skb_recv_datagram() for receiving one sk_buff.

memcpy_toiovec() performs the actual copy to user space by invoking copy_to_user().

If you try to connect() on an unbound UDP socket and then bind()  you will get the EINVAL error.

The reason is that connecting to an unbound socket will call inet_autobind() to automatically bind an unbound socket on a random port. So after connect() the socket is bounded, and calling bind again will fail with EINVAL (since the socket is already bonded).

References:

http://www.haifux.org/lectures/217/netLec5.pdf

https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/
GNARoot Blogs

Posts

Search this blog

Networking System calls.

Product Buy

Product Buy

Product Buy

Visitor Counter

Instagram

Posts

Search this blog

Networking System calls.

Product Buy