Syscall

Note: we will consider all commands from here on out as being run from the linux directory (source of your kernel tree).

1. Creating a shared folder

Let's make a shared folder between the host machine and the VM.

First we'll create a folder to be shared from our host machine:

mkdir ../shared_folder

Now we make a shared folder inside the virtual machine (we are going to add some more flags on our qemu command):

qemu-system-x86_64 \
    -drive file=../my_disk.raw,format=raw,index=0,media=disk \
    -m 2G -nographic \
    -kernel ./arch/x86_64/boot/bzImage \
    -append "root=/dev/sda rw console=ttyS0 loglevel=6" \
    -fsdev local,id=fs1,path=../shared_folder,security_model=none \
    -device virtio-9p-pci,fsdev=fs1,mount_tag=shared_folder \
    --enable-kvm
What do these new flags mean?
  • -fsdev local,id=fs1,path=<path to shared folder>,security_model=none: this will add a new file system device to our emulation. Make sure to put the right directory at path. Don’t worry about security_model=none, this argument will let the permission of creating/modifying files inside the guest to be the same as if was created by the host user.
  • -device virtio-9p-pci,fsdev=fs1,mount_tag=<shared folder name on mount>: this defines the name (tag) and type of the virtual device (virtio-9p-pci).

Now we need to determine the mountpoint of the shared folder, i.e., what directory inside the VM our shared folder will be mounted to. For this, inside the VM, edit the file /etc/fstab with your preferred editor (nano or vi), and add this line at the end of file:

# <device> <mountpoint> <type> <options> <dump> <pass>
shared_folder /root/host_folder 9p trans=virtio 0 0

Now reboot your VM and check if there is a /root/host_folder with the same files and folder of shared_folder in the host machine. For now, your shared_folder is still empty, so the existence of the /root/host_folder directory should be enough proof that it's working (you can create a random file inside it from the host just to check if it shows up in the VM as well).

2. Add a syscall

Linux has a lot of syscalls, but we are going to add one more, in the name of science: a memory copy. We usually don't need the kernel to copy a memory from one place to another in userspace, but we are adding this for learning purposes.

First, this is how the interface will look like:

sys_memcpy(void *src, void *dst, unsigned long int size);

When it succeeds, it will return 0. Otherwise, will return an error code.

If you didn't like my interface, fell free to be creative and try cool ways do do a memcpy.

Note

Creating a syscall is very architecture dependant, since each arch has its own call convention, different registers to use, and different syscall tables. In this tutorial, I'm going to add a syscall to x86-64 ABI. Keep in mind that we will not cover i386 nor x32 ABIs in this tutorial, but this document provides all useful insides to solve compatibility issues: Adding a New System Call - kernel.org

2.1 Registering our new syscall

We need to register our syscall in some places, so the kernel knows what to do when userspace asks for it. The first file is arch/x86/entry/syscalls/syscall_64.tbl: add a new line after the last entry in the first table (this is not in the end of the file!). For Linux v5.6, this is after pidfd_getfd:

 437    common  openat2                 __x64_sys_openat2
 438    common  pidfd_getfd             __x64_sys_pidfd_getfd
+439    common  memcpy                  __x64_sys_memcpy

 #
 # x32-specific system call numbers start at 512 to avoid cache impact

Note

A correct multiplatform implementation would require the syscall to be added to the syscall_32.tbl as well. The second table entry is only required when there is a need to treat x32 syscall differently.

439 will be our syscall number. Take note, since we are going to use it in others places as well.

Add a function signature at include/linux/syscalls.h:

 asmlinkage long sys_old_mmap(struct mmap_arg_struct __user *arg);

+asmlinkage long sys_memcpy(unsigned long __user *src, unsigned long __user *dst,
+                        unsigned long len);

 /*
  * Not a real system call, but a placeholder for syscalls which are

I added after the last syscall signature. Now the kernel knows the number and the signature, let's glue things together at include/uapi/asm-generic/unistd.h:

 #define __NR_pidfd_getfd 438
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)

+#define __NR_memcpy 439
+__SYSCALL(__NR_memcpy, sys_memcpy)
+
 #undef __NR_syscalls
-#define __NR_syscalls 439
+#define __NR_syscalls 440

The last place to register our syscall is at kernel/sys_ni.c:

 COND_SYSCALL(setuid16);

+COND_SYSCALL(memcpy);
+
 /* restartable sequence */

This is required to provide fallback stub implementation of our syscall, that returns -ENOSYS.

2.2 Writing some code

Finally, let's add the code of our syscall. I created a file kernel/memcpy.c, but you can choose wherever you want, just make sure it gets compiled. I also add obj-y += memcpy.o in kernel/Makefile. Let's see, step by step, what our code needs to do. For now, the only include we need is <linux/syscalls.h>.

To declare the syscall, we are going to use a macro SYSCALL_DEFINE3, that will do some magic for us. Note that the type and the name of each variable are separated by a comma:

SYSCALL_DEFINE3(memcpy, void __user *, src, void __user *, dst,
        unsigned long, len)

3 is for syscalls with three arguments. The maximum of arguments that a syscall can have is six. Some architectures don't have enough registers to deal with a 7th argument. If you need to pass more than six variables, you need to use a pointer to a struct. When this macro is expanded, we will get the same function signature declared at unistd.h.

The __user is an attribute not used by compilers, but used by static analyzers (like sparse) to see if you are not misusing user data.

Note

What kind of misuse is possible to do with user memory? Hint: check out the difference in how kernel memory and user memory are mapped and about memory management unit

Now, we move along defining our function. First, we will need a kernel buffer to temporarily store data:

{
    void *buf;

    buf = kmalloc(len, GFP_KERNEL);
    if (!buf)
        return -ENOMEM;

Note

Exercise: how about defining a maximum value for len and returning an error code if it's bigger than we support?

Now, we need to store the data from the user in the kernel. For that, we are going to use the function copy_from_user(). It's imperative to use this function to copy that from userspace to the kernel, since it checks if the pointer and the size are valid. If you want to have some fun seeing some errors, use the internal memcpy() implementation.

    if (copy_from_user(buf, src, len))
        return -EFAULT;

EFAULT is used for invalid memory access. We are almost done! We just need to copy the data back to the user and finish the syscall. To do that, the kernel also provides a copy_to_user():

    if (copy_to_user(dst, buf, len))
        return -EFAULT;

    kfree(buf);
    return 0;
}

In the end, my kernel/memcpy.c file looks like this:

#include <linux/syscalls.h>

SYSCALL_DEFINE3(memcpy, void __user *, src, void __user *, dst,
        unsigned long, len)
{
    void *buf;

    buf = kmalloc(len, GFP_KERNEL);
    if (!buf)
        return -ENOMEM;

    if (copy_from_user(buf, src, len))
        return -EFAULT;

    if (copy_to_user(dst, buf, len))
        return -EFAULT;

    kfree(buf);
    return 0;
}

The kernel side is ready. Now is time to use our syscall from the userspace.

3. The user side

Let's test our syscall from the userspace! Remember to recompile your kernel and boot it with a shared folder to guarantee all the files created inside the qemu will be saved on your local machine.

Build and Boot in your VM
make -j$(nproc)
qemu-system-x86_64 \
    -drive file=../my_disk.raw,format=raw,index=0,media=disk \
    -m 4G -nographic \
    -kernel ./arch/x86_64/boot/bzImage \
    -append "root=/dev/sda rw console=ttyS0 loglevel=6" \
    -fsdev local,id=fs1,path=../shared_folder,security_model=none \
    -device virtio-9p-pci,fsdev=fs1,mount_tag=shared_folder \
    --enable-kvm

Glibc provides a wrapper for calling syscalls, conveniently called syscall(). All we need to do is use the first argument as the syscall number, and the following ones as the syscall's arguments.

Let's include headers to have access to syscall(), printf() and errno:

#include <stdio.h>
#include <unistd.h>
#include <errno.h>

Define the number of our system call, and the size of the test array:

#define __NR_memcpy 439
#define ARR_LEN 10

Note

If you enable CONFIG_HEADERS_INSTALL and run make modules_install with INSTALL_HDR_PATH as the path of the rootfs of your test environment, you don't need to define __NR_memcpy, you can just include <linux/unistd.h>.

Create some variables, and a test array:

int ret, i;

int a[] = {1, 2, 3, 777, '5', 'a', 0x0800, -15500, 42, 1337};
int b[ARR_LEN];

Finally, call the syscall to do the magic:

ret = syscall(__NR_memcpy, a, b, ARR_LEN * sizeof(int)); 
if (ret == -1)
    printf("error: %d\n", errno);

And check if both arrays are identical:

for (i = 0; i < ARR_LEN; i++) {
    if (a[i] != b[i])
        printf("error, value %d is different\n", i);
}

And if nothing was printed, our syscall worked! This is the complete user code:

#include <stdio.h>
#include <unistd.h>
#include <errno.h>

#define __NR_memcpy 439
#define ARR_LEN 10

int main()
{
    int ret, i;

    int a[] = {1, 2, 3, 777, '5', 'a', 0x0800, -15500, 42, 1337};
    int b[ARR_LEN];

    ret = syscall(__NR_memcpy, a, b, ARR_LEN * sizeof(int)); 
    if (ret == -1)
        printf("error: %d\n", errno);

    for (i = 0; i < ARR_LEN; i++) {
        if (a[i] != b[i])
            printf("error, value %d is different\n", i);
    }

    return 0;
}

Exercise

  1. Use strace to see your syscall in action. Use gdb and catch syscall <syscall_number> to stop the program when your syscall is called. Use info reg when the program stops to see the value of the register.

  2. Let's create a syscall that not only copy the data, but also modify it. Have you ever heard of Caesar cipher?

    Implement a syscall that, for a given string, number of rotation and operation, encrypt/decrypt the string.

    #define OP_ENCRYPT 0
    #define OP_DECRYPT 1
    
    sys_caesar(char *in_str, char *out_str, unsigned int op, unsigned int rot);
    
  3. Join all data of our syscall inside a struct, and send this struct to the kernel. Such struct can look like this:

    struct memcpy_data {
        void *src;
        void *dst;
        unsigned long int size;
    };
    

    And modify the syscall signature to be like this:

    sys_memcpy(struct memcpy_data *data);
    

    Check the size of this struct. Will this size be the same in all architectures? Compile a userspace code for i386 ABI (with gcc -m32) and try to use the syscall as is. Do you think your implementation will work? Check the compatibility documentation and try to fix your code.