Syscall¶
Note: we will consider all commands from here on out as being run from the linux
directory (source of your kernel tree).
1. Creating a shared folder¶
Let's make a shared folder between the host machine and the VM.
First we'll create a folder to be shared from our host machine:
mkdir ../shared_folder
Now we make a shared folder inside the virtual machine (we are going to add some more flags on our qemu command):
qemu-system-x86_64 \
-drive file=../my_disk.raw,format=raw,index=0,media=disk \
-m 2G -nographic \
-kernel ./arch/x86_64/boot/bzImage \
-append "root=/dev/sda rw console=ttyS0 loglevel=6" \
-fsdev local,id=fs1,path=../shared_folder,security_model=none \
-device virtio-9p-pci,fsdev=fs1,mount_tag=shared_folder \
--enable-kvm
What do these new flags mean?
-fsdev local,id=fs1,path=<path to shared folder>,security_model=none: this will add a new file system device to our emulation. Make sure to put the right directory at path. Don’t worry aboutsecurity_model=none, this argument will let the permission of creating/modifying files inside the guest to be the same as if was created by the host user.-device virtio-9p-pci,fsdev=fs1,mount_tag=<shared folder name on mount>: this defines the name (tag) and type of the virtual device (virtio-9p-pci).
Now we need to determine the mountpoint of the shared folder, i.e., what directory inside the VM our shared folder will be mounted to. For this, inside the VM, edit the file /etc/fstab with your preferred editor (nano or vi), and add this line at the end of file:
# <device> <mountpoint> <type> <options> <dump> <pass>
shared_folder /root/host_folder 9p trans=virtio 0 0
Now reboot your VM and check if there is a /root/host_folder with the same files and folder of shared_folder in the host machine. For now, your shared_folder is still empty, so the existence of the /root/host_folder directory should be enough proof that it's working (you can create a random file inside it from the host just to check if it shows up in the VM as well).
2. Add a syscall¶
Linux has a lot of syscalls, but we are going to add one more, in the name of science: a memory copy. We usually don't need the kernel to copy a memory from one place to another in userspace, but we are adding this for learning purposes.
First, this is how the interface will look like:
sys_memcpy(void *src, void *dst, unsigned long int size);
When it succeeds, it will return 0. Otherwise, will return an error code.
If you didn't like my interface, fell free to be creative and try cool ways do
do a memcpy.
Note
Creating a syscall is very architecture dependant, since each arch has its own call convention, different registers to use, and different syscall tables. In this tutorial, I'm going to add a syscall to x86-64 ABI. Keep in mind that we will not cover i386 nor x32 ABIs in this tutorial, but this document provides all useful insides to solve compatibility issues: Adding a New System Call - kernel.org
2.1 Registering our new syscall¶
We need to register our syscall in some places, so the kernel knows what to do
when userspace asks for it. The first file is
arch/x86/entry/syscalls/syscall_64.tbl: add a new line after the last entry
in the first table (this is not in the end of the file!). For Linux v5.6, this
is after pidfd_getfd:
437 common openat2 __x64_sys_openat2
438 common pidfd_getfd __x64_sys_pidfd_getfd
+439 common memcpy __x64_sys_memcpy
#
# x32-specific system call numbers start at 512 to avoid cache impact
Note
A correct multiplatform implementation would require the syscall to be
added to the syscall_32.tbl as well. The second table entry is only
required when there is a need to treat x32 syscall differently.
439 will be our syscall number. Take note, since we are going to use it in
others places as well.
Add a function signature at include/linux/syscalls.h:
asmlinkage long sys_old_mmap(struct mmap_arg_struct __user *arg);
+asmlinkage long sys_memcpy(unsigned long __user *src, unsigned long __user *dst,
+ unsigned long len);
/*
* Not a real system call, but a placeholder for syscalls which are
I added after the last syscall signature. Now the kernel knows the number and
the signature, let's glue things together at
include/uapi/asm-generic/unistd.h:
#define __NR_pidfd_getfd 438
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
+#define __NR_memcpy 439
+__SYSCALL(__NR_memcpy, sys_memcpy)
+
#undef __NR_syscalls
-#define __NR_syscalls 439
+#define __NR_syscalls 440
The last place to register our syscall is at kernel/sys_ni.c:
COND_SYSCALL(setuid16);
+COND_SYSCALL(memcpy);
+
/* restartable sequence */
This is required to provide fallback stub implementation of our syscall,
that returns -ENOSYS.
2.2 Writing some code¶
Finally, let's add the code of our syscall. I created a file kernel/memcpy.c,
but you can choose wherever you want, just make sure it gets compiled. I also
add obj-y += memcpy.o in kernel/Makefile. Let's see, step by step, what our
code needs to do. For now, the only include we need is <linux/syscalls.h>.
To declare the syscall, we are going to use a macro SYSCALL_DEFINE3, that will
do some magic for us. Note that the type and the name of each variable are separated
by a comma:
SYSCALL_DEFINE3(memcpy, void __user *, src, void __user *, dst,
unsigned long, len)
3 is for syscalls with three arguments. The maximum of arguments that a
syscall can have is six. Some architectures don't have enough registers to
deal with a 7th argument. If you need to pass more than six variables, you need
to use a pointer to a struct. When this macro is expanded, we will get the
same function signature declared at unistd.h.
The __user is an attribute not used by compilers, but used by static
analyzers (like sparse) to see if you are not misusing user data.
Note
What kind of misuse is possible to do with user memory? Hint: check out the difference in how kernel memory and user memory are mapped and about memory management unit
Now, we move along defining our function. First, we will need a kernel buffer to temporarily store data:
{
void *buf;
buf = kmalloc(len, GFP_KERNEL);
if (!buf)
return -ENOMEM;
Note
Exercise: how about defining a maximum value for len and
returning an error code if it's bigger than we support?
Now, we need to store the data from the user in the kernel. For that, we
are going to use the function copy_from_user(). It's imperative to use
this function to copy that from userspace to the kernel, since it checks if the
pointer and the size are valid. If you want to have some fun seeing some errors,
use the internal memcpy() implementation.
if (copy_from_user(buf, src, len))
return -EFAULT;
EFAULT is used for invalid memory access. We are almost done! We just need
to copy the data back to the user and finish the syscall. To do that, the kernel
also provides a copy_to_user():
if (copy_to_user(dst, buf, len))
return -EFAULT;
kfree(buf);
return 0;
}
In the end, my kernel/memcpy.c file looks like this:
#include <linux/syscalls.h>
SYSCALL_DEFINE3(memcpy, void __user *, src, void __user *, dst,
unsigned long, len)
{
void *buf;
buf = kmalloc(len, GFP_KERNEL);
if (!buf)
return -ENOMEM;
if (copy_from_user(buf, src, len))
return -EFAULT;
if (copy_to_user(dst, buf, len))
return -EFAULT;
kfree(buf);
return 0;
}
The kernel side is ready. Now is time to use our syscall from the userspace.
3. The user side¶
Let's test our syscall from the userspace! Remember to recompile your kernel and boot it with a shared folder to guarantee all the files created inside the qemu will be saved on your local machine.
Build and Boot in your VM
make -j$(nproc)
qemu-system-x86_64 \
-drive file=../my_disk.raw,format=raw,index=0,media=disk \
-m 4G -nographic \
-kernel ./arch/x86_64/boot/bzImage \
-append "root=/dev/sda rw console=ttyS0 loglevel=6" \
-fsdev local,id=fs1,path=../shared_folder,security_model=none \
-device virtio-9p-pci,fsdev=fs1,mount_tag=shared_folder \
--enable-kvm
Glibc provides a wrapper for calling syscalls, conveniently called syscall().
All we need to do is use the first argument as the syscall number, and the
following ones as the syscall's arguments.
Let's include headers to have access to syscall(), printf() and errno:
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
Define the number of our system call, and the size of the test array:
#define __NR_memcpy 439
#define ARR_LEN 10
Note
If you enable CONFIG_HEADERS_INSTALL and run make modules_install
with INSTALL_HDR_PATH as the path of the rootfs of your test
environment, you don't need to define __NR_memcpy, you can just
include <linux/unistd.h>.
Create some variables, and a test array:
int ret, i;
int a[] = {1, 2, 3, 777, '5', 'a', 0x0800, -15500, 42, 1337};
int b[ARR_LEN];
Finally, call the syscall to do the magic:
ret = syscall(__NR_memcpy, a, b, ARR_LEN * sizeof(int));
if (ret == -1)
printf("error: %d\n", errno);
And check if both arrays are identical:
for (i = 0; i < ARR_LEN; i++) {
if (a[i] != b[i])
printf("error, value %d is different\n", i);
}
And if nothing was printed, our syscall worked! This is the complete user code:
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#define __NR_memcpy 439
#define ARR_LEN 10
int main()
{
int ret, i;
int a[] = {1, 2, 3, 777, '5', 'a', 0x0800, -15500, 42, 1337};
int b[ARR_LEN];
ret = syscall(__NR_memcpy, a, b, ARR_LEN * sizeof(int));
if (ret == -1)
printf("error: %d\n", errno);
for (i = 0; i < ARR_LEN; i++) {
if (a[i] != b[i])
printf("error, value %d is different\n", i);
}
return 0;
}
Exercise¶
-
Use
straceto see your syscall in action. Usegdbandcatch syscall <syscall_number>to stop the program when your syscall is called. Useinfo regwhen the program stops to see the value of the register. -
Let's create a syscall that not only copy the data, but also modify it. Have you ever heard of Caesar cipher?
Implement a syscall that, for a given string, number of rotation and operation, encrypt/decrypt the string.
#define OP_ENCRYPT 0 #define OP_DECRYPT 1 sys_caesar(char *in_str, char *out_str, unsigned int op, unsigned int rot); -
Join all data of our syscall inside a struct, and send this struct to the kernel. Such struct can look like this:
struct memcpy_data { void *src; void *dst; unsigned long int size; };And modify the syscall signature to be like this:
sys_memcpy(struct memcpy_data *data);Check the size of this struct. Will this size be the same in all architectures? Compile a userspace code for i386 ABI (with
gcc -m32) and try to use the syscall as is. Do you think your implementation will work? Check the compatibility documentation and try to fix your code.