Posts

Showing posts from December, 2016

Search this blog

Networking System calls.

When you create a socket file descriptor, using the socket() system call, it returns a file descriptor with the file_operations structure assigned as

struct file_operations socket_file_ops = {
            .aio_read = sock_aio_read,
            .aio_write = sock_aio_write,
};

When we write to this socket using write() system call.

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, ...)
{
    struct file *file; 
    [...] 
    file = fget_light(fd, &fput_needed); 
    [...] ===> 
    ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
    [...]
}

For every socket created in userspace application, there is a corresponding
struct socket, and struct sock

sock --> struct socket.
sk     --> struct sock.
static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, ..)
{
    [...]
    struct socket *sock = file->private_data;
    [...] ===>
    return sock->ops->sendmsg(iocb, sock, msg, size);
    [...]
}
struct socket {
    [...]
    struct file *file;
    struct sock *sk;
    const struct proto_ops *ops;
    };
const struct proto_ops inet_stream_ops = {
    .family = PF_INET,
    [...]
    .connect = inet_stream_connect,
    .accept = inet_accept,
    .listen = inet_listen, .sendmsg = tcp_sendmsg,
    .recvmsg = inet_recvmsg,
    [...]
};

The sendmsg call would be implemented using a protocol specific function in the proto_ops.

int tcp_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size)
{
struct sock *sk = sock->sk;
struct iovec *iov;
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
[...]
mss_now = tcp_send_mss(sk, &size_goal, flags);
/* Ok commence sending. */
iovlen = msg->msg_iovlen;
iov = msg->msg_iov;
copied = 0;
[...]
while (--iovlen >= 0) {
int seglen = iov->iov_len;
unsigned char __user *from = iov->iov_base;

iov++;
while (seglen > 0) {
int copy = 0;
int max = size_goal;
[...]
skb = sk_stream_alloc_skb(sk, select_size(sk, sg), sk->sk_allocation);

if (!skb)
goto wait_for_memory;
/*
* Check whether we can use HW checksum.
*/
if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
skb->ip_summed = CHECKSUM_PARTIAL;
[...]
skb_entail(sk, skb);
[...]
/* Where to copy to? */
if (skb_tailroom(skb) > 0) {
/* We have some space in skb head. Superb! */
if (copy > skb_tailroom(skb))
copy = skb_tailroom(skb);
if ((err = skb_add_data(skb, from, copy)) != 0)
goto do_fault;
[...]
if (copied)
tcp_push(sk, flags, mss_now, tp->nonagle);
[...]
}

One sk_buff copies and includes MSS (tcp_send_mss) bytes to help the code that actually creates packets. Maximum Segment Size (MSS) stands for the maximum payload size that one TCP packet includes.

The sk_stream_alloc_skb function creates a new sk_buff, and skb_entail adds the new sk_buff to the tail of the send_socket_buffer. The skb_add_data function copies the actual application data to the data buffer of the sk_buff. All the data is copied by repeating the procedure (creating an sk_buff and adding it to the send socket buffer) several times. Therefore, sk_buffs at the size of the MSS are in the send socket buffer as a list. Finally, the tcp_push is called to make the data which can be transmitted now as a packet, and the packet is sent.

The tcp_push function transmits as many of the sk_buffs in the send socket buffer as the TCP allows in sequence.

Then, the tcp_transmit_skb function is called to create a packet.

Finally, queue_xmit is called to send the packet to the IP layer. Queue_xmit for IPv4 is implemented by the ip_queue_xmit function.

The ip_queue_xmit function executes tasks required by the IP layers. __sk_dst_check checks whether the cached route is valid. If there is no cached route or the cached route is invalid, it performs IP routing. And then it calls skb_push to secure the IP header area and records the IP header field value. After that, as following the function call, ip_send_check computes the IP header checksum and calls the netfilter function. IP fragment is created when ip_finish_output function needs IP fragmentation. No fragmentation is generated when TCP is used. Therefore, ip_finish_output2 is called and it adds the Ethernet header. Finally, a packet is completed.

The completed packet is transmitted through the dev_queue_xmit function. First, the packet passes via the qdisc. If the default qdisc is used and the queue is empty, the sch_direct_xmit function is called to directly send down the packet to the driver, skipping the queue. Dev_hard_start_xmit function calls the actual driver. Before calling the driver, the device TX is locked first. This is to prevent several threads from accessing the device simultaneously. As the kernel locks the device TX, the driver transmission code does not need an additional lock. It is closely related to the parallel processing that will be discussed next time.
Ndo_start_xmit function calls the driver code. Just before, you will see ptype_all and dev_queue_xmit_nit. The ptype_all is a list that includes the modules such as packet capture. If a capture program is running, the packet is copied by ptype_all to the separate program. Therefore, the packet that tcpdump shows is the packet transmitted to the driver. When checksum offload or TSO is used, the NIC manipulates the packet. So the tcpdump packet is different from the packet transmitted to the network line. After completing packet transmission, the driver interrupt handler returns the sk_buff

THE RECEIVE PATH:

The general executed path is to receive a packet and then to add the data to the receive socket buffer. After executing the driver interrupt handler, follow the napi poll handle first.
static void net_rx_action(struct softirq_action *h) 
{ 
   struct softnet_data *sd = &__get_cpu_var(softnet_data); 
   unsigned long time_limit = jiffies + 2; 
   int budget = netdev_budget; 
   void *have; 
   local_irq_disable();
   while (!list_empty(&sd->poll_list)) {
   struct napi_struct *n;
   [...]
   n = list_first_entry(&sd->poll_list, struct napi_struct,poll_list);
   if (test_bit(NAPI_STATE_SCHED, &n->state)) {
   work = n->poll(n, weight);
   trace_napi_poll(n);
   }
   [...]
}
int netif_receive_skb(struct sk_buff *skb)
[...] ===>
static int __netif_receive_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
[...]
__be16 type;
[...]
list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}
[...]
type = skb->protocol;
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
if (ptype->type == type && (ptype->dev == null_or_dev || ptype->dev == skb->dev ||
 ptype->dev == orig_dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}
if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
static struct packet_type ip_packet_type __read_mostly = {
.type = cpu_to_be16(ETH_P_IP),
.func = ip_rcv,
[...]
};

As mentioned before, the net_rx_action function is the softirq handler that receives a packet. First, the driver that has requested the napi poll is retrieved from the poll_list and the poll handler of the driver is called. The driver wraps the received packet with sk_buff and then calls netif_receive_skb.

When there is a module that requests all packets, the netif_receive_skb sends packets to the module. Like packet transmission, the packets are transmitted to the module registered to the ptype_all list. The packets are captured here.
Then, the packets are transmitted to the upper layer based on the packet type. The Ethernet packet includes 2-byte ethertype field in the header. The value indicates the packet type. The driver records the value in sk_buff (skb->protocol). Each protocol has its own packet_type structure and registers the pointer of the structure to the ptype_base hash table. IPv4 uses ip_packet_type. The Type field value is the IPv4 ethertype (ETH_P_IP) value. Therefore, the IPv4 packet calls the ip_rcv function.

The ip_rcv function executes tasks required by the IP layers. It examines packets such as the length and header checksum. After passing through the netfilter code, it performs the ip_local_deliver function. If required, it assembles IP fragments. Then, it calls ip_local_deliver_finish through the netfilter code. The ip_local_deliver_finish function removes the IP header by using the __skb_pull and then searches the upper protocol whose value is identical to the IP header protocol value. Similar to the Ptype_base, each transport protocol registers its own net_protocol structure in inet_protos. IPv4 TCP uses tcp_protocol and calls tcp_v4_rcv that has been registered as a handler.
When packets come into the TCP layer, the packet processing flow varies depending on the TCP status and the packet type. Here, we will see the packet processing procedure when the expected next data packet has been received in the ESTABLISHED status of the TCP connection. This path is frequently executed by the server receiving data when there is no packet loss or out-of-order delivery.

First, the tcp_v4_rcv function validates the received packets. When the header size is larger than the data offset (th->doff < sizeof(struct tcphdr) / 4), it is the header error. And then __inet_lookup_skb is called to look for the connection where the packet belongs from the TCP connection hash table. From the sock structure found, all required structures such as tcp_sock and socket can be got.

The actual protocol is executed from the tcp_v4_do_rcv function. If the TCP is in the ESTABLISHED status, tcp_rcv_esablished is called. Processing of the ESTABLISHED status is separately handled and optimized since it is the most common status. The tcp_rcv_established first executes the header prediction code. The header prediction is also quickly processed to detect in the common state. The common case here is that there is no data to transmit and the received data packet is the packet that must be received next time, i.e., the sequence number is the sequence number that the receiving TCP expects. In this case, the procedure is completed by adding the data to the socket buffer and then transmitting ACK.
Go forward and you will see the sentence comparing truesize with sk_forward_alloc. It is to check whether there is any free space in the receive socket buffer to add new packet data. If there is, header prediction is "hit" (prediction succeeded). Then __skb_pull is called to remove the TCP header. After that, __skb_queue_tail is called to add the packet to the receive socket buffer. Finally, __tcp_ack_snd_check is called for transmitting ACK if necessary. In this way, packet processing is completed.
If there is not enough free space, a slow path is executed. The tcp_data_queue function newly allocates the buffer space and adds the data packet to the socket buffer. At this time, the receive socket buffer size is automatically increased if possible. Different from the quick path, tcp_data_snd_check is called to transmit a new data packet if possible. Finally, tcp_ack_snd_check is called to create and transmit the ACK packet if necessary.

The amount of code executed by the two paths is not much. This is accomplished by optimizing the common case. In other words, it means that the uncommon case will be processed significantly more slowly. The out-of-order delivery is one of the uncommon cases.

Reference:
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/
http://www.cubrid.org/blog/dev-platform/understanding-tcp-ip-network-stack/


struct sock has 3 queues : RX, TX and Err.
sk_recevie_queue:  sk_buff <-> sk_buff <-> sk_buff.
sk_write_queue:     sk_buff <-> sk_buff <-> sk_buff.
sk_error_queue:     sk_buff <-> sk_buff <-> sk_buff.

API used to manage the queue:
skb_queue_tail(): adding to the queue.
skb_dequeue(): removing from the queue.

After the ip_local_deliver_finish() is completed it calls udp_rcv() which would be called from sock_queue_rcv_skb()

All the 3 system calls recv(), recvfrom(), recvmsg() are handled by udp_recvmsg().

udp_recvmsg () ---> __skb_recv_datagram () --> reads from sk->sk_receive_queue.

struct net_protocol udp_protocol = {
         .handler = udp_recv,
         .err_handler = udp_err,
         ...
};

If there is a socket listening on the destination port, call udp_queue_rcv_skb() --> sock_queue_rcv_skb() -->  adds packet to sk_receive_queue by skb_queue_tail().

udp_recvmsg()
calls __skb_recv_datagram() for receiving one sk_buff.

memcpy_toiovec() performs the actual copy to user space by invoking copy_to_user().

If you try to connect() on an unbound UDP socket and then bind()  you will get the EINVAL error.
The reason is that connecting to an unbound socket will call inet_autobind() to automatically bind an unbound socket on a random port. So after connect() the socket is bounded, and calling bind again will fail with EINVAL (since the socket is already bonded).

References:
http://www.haifux.org/lectures/217/netLec5.pdf
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/

Product Buy