Linux System/Driver Schulung
Transcription
Linux System/Driver Schulung
c 2005 Tschaeche IT-Services Linux System/Driver Schulung Dr.-Ing. Oliver Tschäche Tschaeche IT-Services July 14, 2005 Oliver Tschäche http://www.tschaeche.com Tätigkeiten I I 1996: Diplom Elektro-Technik mit Schwerpunkt Mikroelektronik 1996-2000: Uni Erlangen I I I I 2000: Gründung Tschaeche IT-Services 2000-2004: Selbständiger Berater/Entwickler + Uni I I I I System/Netzwerk Administrator Promotion: HW-Entwicklung fehlertoleranter Rechenwerke Caldera: Distributed version control system DBench: PC-Simulator: FAUmachine Uni Erlangen: Vorlesung Design von Hardware und deren Linux Treiber (DHWK) seit 2005: I I Siemens: Benchmarking aktueller NUMA-Systeme Uni: FAUmachine, DHWK Persönliches Berufsziel: I 90% Aktive Teilnahme an Entwicklungsprojekten I 10% Lehre/Schulung/Beratung c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Session goals I get new ideas how to implement applications efficiently I a good understanding between user space system calls and the linux driver interface an overview of communication mechanisms I I I I between user space processes itself between user space processes and kernel drivers presentation of programming APIs: user/kernel space c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Focus I System Programming: Interface between User Space and Kernel Space (list of system calls) I Not the libraries (libc only used as trampoline to switch to Kernel Mode) I Kernel Driver Perspective: How to offer an efficient hardware interface to User Space c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Outline System Startup Process Implementation Communication Mechanisms Design Decisions User space API, system programming Kernel Space API c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Outline System Startup 1. Hardware/BIOS startup 2. evtl. Bootloader 3. Linux c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Hardware/BIOS startup BIOS: Implemented in PROM, EEPROM or other non volatile memory 1. HW power-good 2. HW reset 3. Starting BIOS 4. Power On Self Test (POST) 5. Initial Hardware Setup 6. evtl. load and start Bootloader c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Bootloader I Bootloader: u-boot, BIOS built-in, (x86: GRUB, Lilo, PXElinux, ...) I Boot into different OSs Tasks of a Bootloader in the Linux Environment: I I I I I Setup kernel command line: driver setup, initrd, rootfs, init Load Kernel Load RAM-Disk Start Kernel c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Linux Startup - User Space View 1. Initialize Hardware, Drivers, Socket, IPC,... 2. Mount root filesystem, location compiled into the kernel or provided by cmdline 3. Some systems: Start first process (/linuxrc) from RAMDISK 3.1 load hardware dependent modules (e.g. SCSI-Driver Root-FS) 3.2 Free RAMDISK, mount effective root filesystem 4. Start/Replace first process: /sbin/init (compiled in) or init parameter (cmdline) 4.1 init executes system configuration scripts: I I hardware config: e.g. filesystem checks, configure speed/protocol of serial consoles, configure network interfaces (IP, routing) start daemons: crond, syslog, ssh,... 4.2 init starts gettys on consoles c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Linux Startup - Kernel Space View I 1. bootloader jumps to byte 0x1000 ( stext) of the loaded image 2. stext (arch/<host>/kernel/head.S) initializes stack pointer and performs other necessary functions to create a minimal C runtime environment 3. start kernel (kernel/init/main.c) prints startup banner, parses commandline, call other initialization functions 4. setup arch (arch/<host>/kernel/setup.c) detects memory, enables host MMU (paging init()), setup of host-specific read/write io-port functions in machine vector 5. trap init (arch/<host>/kernel/traps.c) initializes interrupt capabilities (not enabled yet) 6. init IRQ (arch/<host>/kernel/irq.c) initialize hardware irq system with disabled IRQs. Enable IRQ-lines on request (request irq()) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Linux Startup - Kernel Space View II 7. sched init (kernel/sched.c) initialize pidhash array and bottom-half handlers 8. softirq init (kernel/softirq.c) initialize softirq subsystem, softirqs are managed by the kernel’s ksoftirqd later 9. time init (arch/<host>/kernel/time.c) initializes kernel timer tick system, usually by installing an interrupt handler 10. console init (drivers/char/tty io.c) 11. init modules (kernel/module.c) 12. kmem cache init (mm/slab.c) initializes kernel buffer organisation 13. calibrate delay (init/main.c) calculates the BogoMIPS 14. mem init (arch/<host>/mm/init.c) 15. kmem cache sizes init (mm/slab.c) 16. fork init (kernel/fork.c) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Linux Startup - Kernel Space View III 17. proc caches init (kernel/fork.c) 18. vfs caches init (fs/dcache.c) 19. buffer init (fs/buffer.c) 20. page cache init (mm/filemap.c) 21. signal init (kernel/signal.c) 22. proc root init (fs/proc/root.c) initializes the /proc filesystem 23. ipc init (ipc/util.c) 24. check bugs (include/asm/<host>/bug.h) 25. smp init (init/main.c) initialize IOAPIC of Intel arch, do nothing for others 26. rest init (init/main.c) frees memory, launch init() 27. init (init/main.c) frees memory, launch init() 27.1 do basic setup (init/main.c) initialize hardware (PCI, network...) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Linux Startup - Kernel Space View IV 27.2 prepare namespace (init/main.c) mounts root filesystem 27.3 creates stdin, stdout, stderr 27.4 execve initial process (usually /sbin/init) 28. do initcalls (init/main.c) call init functions of compiled in modules 29. mount root (fs/super.c) actually mount root fs see: http://billgatliff.com/ bgat/articles/emb-linux/startup.pdf c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Outline Process Implementation I Process local Stuff I Filesystem Stuff I Table of Signal Handlers I Tracing a Process I Virtual Memory I Capabilities I Resources I Operations on a Process API: clone, fork, execve c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Process local Stuff I effective UID/GID: owner/group within the process is running I pid: created by kernel, can’t be changed I process group ID: inherited from parent API: setpgid() I thread group ID: inherited pid of parent or pid of child if new thread group. This is controlled with the CLONE THREAD flag of the clone system call I thread ID: equals to PID unless process is part of a thread group (CLONE THREAD) API: gettid() c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filesystem Stuff I Filesystem Information I File Descriptor Table I Name Space c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filesystem Stuff Filesystem Information I umask API: umask I Current Working Directory API: chdir I Root Directory API: chroot c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filesystem Stuff File Descriptor Table I Integer refering to entry in kernel space I an entry may be a socket, pipe, file, memory mapped region API: open, socket, accept (listen), close, read, write, ioctl,... c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filesystem Stuff Name Space I Filesystem hierarchy API: mount, umount c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filesystem Stuff Process of Path Resolution, Path Definition I Starting with ’/’ ⇒ absolute path I Starting with other than ’/’ ⇒ relative path Process of Path Resolution 1. Select a starting lookup directory 2. Follow path components with a trailing ’/’ 3. Evaluate final path component Source: man 2 path resolution c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filesystem Stuff, Path Resolution 1. Select a starting current lookup directory dependent on the first character: I if it’s a ’/’ ⇒ use root-dir-element of process I if it’s not a ’/’ ⇒ use cwd-element of process I ATTENTION: cwd may not include the root-dir c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filesystem Stuff, Path Resolution 2. Change current lookup directory according to path components with a trailing ’/’: 1. fail with EACCESS if process has no search permission 2. fail with ENOENT if component is not found 3. fail with ENOTDIR if component is not a directory and not a symbolic link 4. if component is a directory, set current lookup directory to this component 5. if component is a symbolic link, resolve it I I if it is not a directory ⇒ fail with ENOTDIR if it is a directory ⇒ set current lookup directory Resolution process involves limited recursion and fails with ELOOP if limit is reached c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filesystem Stuff, Path Resolution 3. Find final entry I I does not need to be a directory if it does not exist, it’s not necessarily an error I I depends on the system call e.g.: open-syscall may want to create it c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Table of Signal Handlers I contains pointers to handler functions I signal mask and pending signals are elements of each process API: sigaction, signal c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Tracing a Process I flag indicating that the process is traced API: ptrace c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Virtual Memory / User Address Space I segments: text, data, bss, stack API: brk, sbrk I memory mapped files API: mmap, munmap I state of paging API: mlock, mlockall, munlock, munlockall I Replacing segments and memory mapping by loading new program, but keeping files API: execve c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Resource Limits/Usage I as (address space): size I core: size of core file (truncated if greater) I cpu: seconds (SIGXCPU, finally SIGKILL) I data: data segment size I memlock: maximum number of bytes locked in RAM I stack: size of process stack, (SIGSEGV: use alternate stack) I fsize: maximum size of a created file (write, truncate) I locks: number of file locks I nofile: maximum file descriptor number + 1 I ofile: BSD compatibility to nofile API: getrlimit, setrlimit, getrusage c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Capabilities I Filesystem: chown, dac override, dac read search, fowner, fsetid, mknod I IPC: ipc lock, ipc owner, kill I Network: net admin, net bind service, net broadcast, net raw I Process: setuid, setgid, setpcap I System: admin, boot, chroot, module, nice, pacct, ptrace, rawio, resource, time, tty config API: capset, capget see: man 7 capabilities c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Operations on a Process I parent death signal I core dump I keep capabilities on UID transition from uid 0 to non 0 API: prctl c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Outline Communication Mechanism I Signal Handling I Filedescriptor I IPC I Comparison c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Outline Signal Handling I Signal Basics: Action, Type I Sending Signals - Permissions I Receiving/Handling Signals I Standard (POSIX.1) Signals vs. Real-Time (POSIX 1003.1-2001, former POSIX.4) Signals see: man 7 signal c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Signal Handling Signal Basics I Predefined Actions for the receiving process: I I I I I Standard (POSIX.1) signals: I I I I Term: terminate the process Ign: ignore the signal, just go on Core: terminate the process and dump core Stop: stop the process (send SIGCONT to restart) predefined meaning, e.g. HUP, QUIT, KILL, CHLD, ILL, SEGV,... depend on architecture default action is dependent on signal type Real-Time (POSIX 1003.1-2001) signals I I I no predefined meaning, (LinuxThreads use first three) can be used for application defined purposes default action is to terminate the receiving process see: man 7 signal c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Signal Handling Standard vs. Real-Time Signals I Priority not defined by POSIX, but Linux (like most Unices) handles Standard Signals first I Multiple instances are queued for Real-Time Signals, Standard Signals only queue one instance Real-Time Signals are delivered in a guaranteed order: I I I Multiple RTS of same type are delivered in the order they were sent Multiple but different RTS are delivered starting with the lowest numbered signal c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Signal Handling Receiving/Handling Signals I Process can’t catch SIGKILL/SIGSTOP I Access to global variables from signal handler: use volatile I install alternate stack: sigaltstack() Simple Handler: signal(), ANSI C I I I I the only argument is the number of the signal after catching one signal the handler has to be registered again Advanced Handler: sigaction(), POSIX I I I I I may operate on alternate stack make certain system calls restartable across signals configurable one shot behaviour allows recursive occurence (within handler) provide additional information (sigqueue, other) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Signal Handling Sending Signals I Permissions: I I I Privileged (CAP KILL) Effective or Real UID of sending Process must equal the real or saved set-user-ID of the target process In case of SIGCONT: sending and receiving process belong to the same session I API: raise, kill, killpg, (CLONE THREAD: tkill, tgkill), sigqueue I API: sigqueue → Add data (one single integer or pointer) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Signal Handling Sent Signals from the kernel I terminal IO: SIGINT, SIGTERM I sockets: SIGURG, SIGPIPE, SIGIO I write(): SIGPIPE I alarm(), setitimer(), sleep(): SIGALRM, SIGVTALRM, SIGPROF I abort(): SIGABRT I fork(),clone(): SIGCHLD I execve(): SIGTRAP I many more... c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Signal Handling I Blocking signals: I I I Select which signals are blocked: sigprocmask() Get a list of pending (blocked) signals: sigpending() Sleep until a signal is delivered: pause(), sigsuspend() c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Outline Filedescriptors What is a filedescriptor: I VFS - Virtual Filesystem Layer I Pipes I Sockets Features: I Wait on an Event of more than one filedesriptor API: select I Works with Signals in Harmony c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Outline Filedescriptors - Virtual Filesystem Layer I VFS offers API - Filesystem implements functionality I Filesystems I Permission I File Types I Example: Procedure done on open I API: creat, open, close, read, write, fcntl, flock, umask, chmod, fchmod, chown, dup, dup2, link, unlink, mknod, stat, mmap, munmap, utime c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filedescriptors - Virtual Filesystem Layer Filesystems: I Mountpoint in Name Space of a process I Implementations of block based FSs: ext2/3, xfs, reiser, ... I Other FSs: proc, tmpfs, dev, sys, ... different capabilities: I I I memory mapping: mmap() read/write methods: readv(), pread(), writev(), pwrite() c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filedescriptors - Virtual Filesystem Layer Permissions: I Access Modes: I I I I Originator: I I I I read: Read a file write: Read a file execute: Read a file User: effective User ID of requesting process Group: Group IDs of that user Other: all other which are not matched by User or Group Special Flags: I I I Set UID flag: changes effective User ID on execve Set GID flag: changes Group ID on execve, keep GID of directory on creation Sticky flag: directory permissions does not apply on already created files c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filedescriptors - Virtual Filesystem Layer File Types: I data: some bytes I directories I soft links: string references to other locations I hard links: references to other locations at inode level I character/block devices: drivers accessible through FS, e.g.: tty, harddisks,... I named pipes (fifos) and named unix sockets c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filedescriptors - Virtual Filesystem Layer (Simplified) Open Procedure: open(”/tmp/test”, O CREAT — O WRONLY, 0777) 1. Lookup Path and finally the file (man 2 path resolution), Used Process Data: I I I Name Space root or cwd UID/GID: Access Permissions of Directories 2. create the file using umask of process and 0777 3. create kernel file entry, register functionality 4. create entry in process’s filedescriptor table 5. at last pass the number of the entry to the user space (Simplified) Write to a File: write(1, ”hallo”, 5) I 1 → filedescriptor → file-operation → write c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filedescriptors - Pipes/Fifos I Local to Host I One Way Communication I Stream oriented: One process writes → One process reads I Fifos are named pipes within the Filesystem (access permissions), pipes are invisible I Fifos block until a reading and writing process are connected c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Outline Filedescriptors - Sockets I Basics I Protocol Family c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filedescriptors - Sockets I I Protocol Family (implementing one comm. semantic) Communication Semantics I I I I I I SOCK STREAM: reliable, sequenced two-way, connection based byte stream API: read, write, send, recv, accept, connect, listen SOCK DGRAM: connectionless, unreliable messages API: sendto, recvfrom and if using connect() before read, write SOCK SEQPACKET: reliable, sequenced, two-way, messages API: read, write (data may be discarded), send, recv, accept, connect, listen SOCK RAW: raw network protocol API: sendto, recvfrom SOCK RDM: reliable datagram, ordering not guaranteed API: sendto, recvfrom API: socket Bind socket to an address (dependent on Protocol Family and Communication Semantics) API: bind c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Filedescriptors - Sockets Protocol Families I Unix: Created host local within the file system (file type s) see: man 7 unix I Inet: IP based network communication: I I I I I I RAW: RAW-IP data grams (SOCK RAW) see: man 7 raw UDP: UDP-IP data grams (SOCK DGRAM) see: man 7 udp TCP: TCP-IP data stream (SOCK STREAM) see: man 7 tcp see: man 7 inet Netlink: Transfer Datagrams between Kernel Modules and User Space Processes see: man 7 netlink PACKET: send/receive at device driver (OSI layer 2) level see: man 7 packet X25, IPX, INET6, APPLETALK c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Outline IPC IPC Elements: I Basics I Message Queues I Semaphore Sets I Shared Memory Segments see: man 5 ipc c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com IPC: Basics IPC Name Space I Handled with keys (integer) instead of filenames I Conversion of a Name into a Key: man 3 ftok I Process (and Children) Private Key: use IPC PRIVATE Access Permissions: I read, write, (no execute) I UID, GID, others I Like well known FS permissions I API: msgctl, semctl, shmctl c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com IPC Message Queues: I Creation of a message queue: I I CREAT: Fails if already existent EXCL: Fails if already assigned (opened) API: msgget I I Send a message: Type must be set API: msgsnd Receive a message dependent on Type: I I I Type == 0: first message in the queue is read Type > 0: first message (not) with that type is read Type < 0: first message with a type less or equal to abs(Type) is read API: msgrcv I Control/Remove message queue API: msgctl c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com IPC Semaphore Sets I I Creation of a Semaphore Set API: semget Using Semaphores I I I I May Operating on more than one Element, but atomically Semaphore Element (SE) initialised to 0 Operation may automatically be removed on process termination Operation Modes (om): I I I om > 0: increase SE by om, no wait necessary om == 0: wait for SE == 0 om < 0: wait until SE >= abs(om), then decrease SE by om API: semop, semtimedop I Control/remove a Semaphore Set API: semctl c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com IPC Shared Memory Segments I Creation of a Shared Memory Segment API: shmget I Attaching to Shared Memory Segments Works like mmap, munmap API: shmat, shmdt I Control/remove a Semaphore Set API: shmctl c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Design Decisions I System Programming: I I I Signals vs. Sockets IPC vs. Filedescriptors Kernel Programming: I I I Kernel Driver vs. User Space Driver Module vs. Built in Driver Device Control c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Signals vs. Sockets I Signals: + ”Broadcast” (sending to process groups) + Short latency (handled by signal handler) o Priorities o OS dependent order of handling - Short amount of data (sig number/sig info) - Granularity of Permissions c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Signals vs. Sockets I Sockets: c 2005 Tschaeche IT-Services + variable message size (implement protocol) + implementation defined order (select) + high number of originators (one socket each type of msg) o FS permissions for Unix Sockets - higher latency compared to signals (2 syscalls: select, read/write) Oliver Tschäche http://www.tschaeche.com IPC vs. Filedescriptors I IPC properties: + More flexible permissions: uid/gid of user choosable by non privileged user + Semaphores - can’t wait simultanously on different message queues I Filedescriptor properties: + wait on several file descriptors (API: select) + mmap implements shared memory - access permissions of 2.4 (extended attributes/acl only implemented in 2.6) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Kernel Driver vs. User Space Driver User Space Driver’s Pros: I Full libc Support: Can do exotic tasks I Easy Debugging without having to go through contortions to debug a running kernel I If the driver hangs you can kill it and keep the system running (unless the hardware is not misbehaving) I User Memory is swappable: Large drivers which are used infrequently can be swapped out I A well-designed driver program can still allow concurrent access to a device c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Kernel Driver vs. User Space Driver User Space Driver’s Drawbacks: I Direct Access to Memory is possible only by mmapping /dev/mem (privileged operation) I Access to I/O ports is available only after calling ioperm or iopl (MIPS support?), access through /dev/port may be to slow to be effective (both are privileged operations) I Response Time is slower because a context switch is required to transfer informations or actions between the client and the hardware I worse yet, if the driver has been swapped to disk, the response time increases terribly. mlock may help, but libraries need to be locked too (privileged operation) I most important devices can’t be handled in user space, including (but not limited to) network interfaces and block devices c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Module vs. Built-in-Driver I Driver can be written so that both is possible I Built-in-Driver + kernel memory: large continuous physical blocks only available at boot + does not need module support - reboot after driver source changes during development - reboot when driver parameters change I Module c 2005 Tschaeche IT-Services + loadable during run-time: remove - modify - load + driver modes can be implemented by parameters - difficult to get continues physical memory regions Oliver Tschäche http://www.tschaeche.com Device Control Controlling by write: I ’robotic’ devices which don’t transfer data but just respond to commands: I I I I I command-oriented: data is never sent (written) command interpreter is easier implemented as ioctl easy to use from user space: echo, cat large driver (parser) implement escape sequences for data transfers Controlling by ioctl: I completely avoid write (can’t be used with echo or cat) I keeps driver small (no parser) I user space must implement each command c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Overview system programming I process control I signal handling I file system file descriptors I sockets file descriptors I waiting/checking for multiple events I ioctl c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Overview process control I create new processes: clone(), fork(), vfork() I replacing a process: execve() I terminating a process: exit(), exit() I handling children: wait(), waitpid(), wait3(), wait4() c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Creating new processes clone(function, stack, flags, arg): implements threads I sys clone: linux-2.4.20/arch/mips/kernel/syscall.c I copies the state of the process (all registers including stack-pointer, program-counter) I children share selectable parts of its parent’s context (FS, FILES, NS, SIGHAND, VM) I provide signal type for termination (ored to flags) I I I I I I I use special options in wait-family calls, if not using SIGCHLD! unlike fork(), clone() is a library call: library checks for alternate stack, sys clone supports going on with NULL library calls function with argument arg for the child when the function returns, library calls exit() with return value of function the caller must provide an alternate stack for the child (although sys clone supports working on copy) example: clone umask.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Creating new processes I fork() calls clone() with SIGCHLD as flags: I I I I I vfork(): calls clone() with SIGCHLD — CLONE VM — CLONE VFORK I I I I I I shares NS creates copy of FS, FILES, SIGHAND, VM VM: use copy-on-write for mapped pages example: see examples.user-api/fork simple.c shares NS and VM! creates copy of FS, FILES, SIGHAND VM: be carefull witch the stack, it is shared with parent!! parent is suspended until the child terminates or does execve() Mostly used when a call to execv() follows very soon. see: arch/mips/kernel/syscall.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Replacing a process execve(filename, args, env): I text, data, bss and stack segments are overwritten by the loaded program. I Filedescriptors are not closed. I Pending signals are cleared, signal handlers are reset to default actions. I SUID/SGID bits change the effective UID/GID of the process. I if the process is traced, a SIGTRAP is sent after successful execve I example: see exec simple.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Terminating a process I terminate a process immediately: exit(exit code) I I I I I any open filedescriptors are closed (if not shared with parent CLONE FILES) any child processes are inherited by the init process the parent process is sent a SIGCHLD (or the signal supplied in clone()) exit code is returned to the parent process and can be collected with wait handling terminated processes: wait(), waitpid(), wait3, wait4 I I wait(status): wait for one child and catch the exit status waitpid(pid, status, options): I I I I waits for the child with pid pid supplies options: WNOHANG, WUNTRACED catch exit status macros handling exit status c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Process Control Examples I fork simple.c I exec simple.c I clone umask.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com signal handling I handling signal masks: API: sigemptyset(), sigfillset(), sigaddset(), sigdelset(), sigismember() I installing a handler API: signal(), sigaction() I sending signals: API: raise(), kill(), killpg(), tkill(), sigqueue(), alarm() I block signal delivery API: sigprocmask I examination of blocked, but pending signals API: sigpending I wait for distinct signals API: sigsuspend, pause c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com handling signal masks I declaration: sigset t sigset I sigemptyset(&sigset): clear all signals in the set I sigfillset(&sigset): set all signals in the set I sigaddset(&sigset): add a signal to the set I sigdelset(&sigset): remove a signal from the set I sigismember(&sigset): is signal already a member in the set? c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Installing a handler I signal(num, handler): ANSI C I I I num: signal type (SIGINT, SIGTERM,...), can’t catch SIGKILL, SIGSTOP handler: function to call if signal num is delivered sigaction(num, action, oldact): POSIX I I handler function flags: I I I I I I I SA NOCLDSTOP: suppress child stop notifications SA ONESHOT: Restore the signal default handler after handling the signal once SA ONSTACK: Use alternate stack (if available, see: man 2 sigaltstack) SA RESTART: Make certain system calls restartable across signals SA NOMASK: allow recursive occurence (within handler) SA SIGINFO: provide additional information (sigqueue, other) mask of blocked signals while this handler is active c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Sending signals I kill(pid, sig): send signal sig to process with pid pid I raise(sig): kill(getpid(), sig), library function I sigqueue(pid, sig, val): send signal sig and data val to process with pid pid only works with real time signals (SIGRTMIN+n) I killpg(pgrp, sig): send signal to all processes of the process group pgrp, mapped to kill(-pgrp, sig) I tkill(tid, sig): send signal only to one process of a thread group (CLONE THREAD, each process has the same pid) I alarm(seconds): schedules a SIGALRM for the process. Any previously set alarm is cancelled. Use 0 seconds to remove an alarm c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Blocking signals I sigprocmask(action, &new set, &old set): block a set of signals I I I I I SIG BLOCK: add the members of new set to current set SIG UNBLOCK: remove the members of new set from current set SIG SETMASK: use the mask of new set from current set old set: the state of the set before the call examination of blocked, but pending signals: sigpending(&sigset) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Suspending a process until a signal is delivered I sigsuspend(&si): sleep until a signal is delivered which is a member of si I pause(): sleep until a signal is delivered which terminates the process or is catched by a signal handler c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Example: sighandler.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System I Opening a file: API: open(), close() I Filedescriptor modification: API: fcntl() I Writing data API: write(), writev(), sendfile() I Reading data: API: read(), readv(), sendfile() (if mmap possible) I Memory mapping: API: mmap(), munmap(), mremap(), mlock() c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System I open(name, flags, permissions): I I name: path to the file some flags: I I I I I I I I O APPEND: before each write the file pointer is possitioned at the end O NONBLOCK: any subsequent operation on the file descriptor nor the open itself will block O SYNC: block until data is physically written O NOFOLLOW: if name is a symbolic link, fail O DIRECT: minimize cache effects, read/write directly from/to disk O ASYNC: generate a SIGIO when the file descriptor is ready to send/receive data O LARGEFILE: allow files to be opened whose size cannot be represented in an off t (2/4GB) some of these can be altered after the open with fcntl() permissions: if a file is created, access permissions (reduced by umask) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System I close(fd): I I I close file descriptor, can’t be used any more if fd is the last copy of a particular file descriptor, associated resources are freed if fd is the last reference to a file which was removed (unlink()), delete the file c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System, fcntl(fd, cmd, arg) I Handling close-on-exec: I I I I I F DUPFD: copy fd to lowest numbered fd greater or equal than arg, close-on-exec on copy is off! F GETFD: read close-on-exec flag F SETFD: set close-on-exec flag to the FD CLOEXEC bit of arg Example: fs close on exec.c Status flags: I I F GETFL: read file descriptor’s flags F SETFL: set file descriptor’s flags (O APPEND, O NONBLOCK, O ASYNC, O DIRECT) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System, fcntl(fd, cmd, arg) continued I Managing signals: I I I I F GETOWN: get process or process group (negative value), currently receiving SIGIO/SIGURG F SETOWN: set process or process group that will receive SIGIO/SIGURG (O ASYNC must be set). If the signal handler set’s SA SIGINFO, si code indicates SI SIGIO and si fd gives the associated filedescriptor F GETSIG: get signal sent when input/output becomes possible F SETSIG: set signal sent when input/output becomes possible. Using a real time signal multiple I/O events may be queued using the same signal number c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System, fcntl(fd, cmd, arg) continued Leases (Linux specific, define GNU SOURCE): I watchdog on a file: I I I I I uses signaling SIGIO F SETLEASE, arg kind of lease: I I I I lease breaker calls open, which blocks (not O NONBLOCK) lease holder handles SIGIO and downgrades lease (cleanup: e.g. flushing buffers) if lease holder is to slow (/proc/sys/fs/lease-break-time), the kernel forces the downgrade F RDLCK: we will be notified if another process opens the file for writing F WRLCK: we will be notified if another process opens the file for reading or writing F UNLCK: remove the lease from the file Example: fs lease.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System, fcntl(fd, cmd, arg) continued file/directory change notification (Linux specific define GNU SOURCE): I I fd refers to a directory F NOTIFY: arg logic or of: I I I I I I I I DN MULTISHOT: don’t notify only once DN ACCESS: a file was accessed (read(), readv(), pread()) DN MODIFY: a file was modified (write(), writev(), pwrite(), truncate()) DN CREATE: a file was created (open(), creat(), mknod(), mkdir(), link(), symlink(), rename()) DN DELETE: a file was deleted (unlink(), rename to another directory, rmdir()) DN RENAME: a file was renamed within this directory (rename()) DN ATTRIB: the attributes of a file were changed (chown(), chmod(), utime(), utimes()) Example: fs notify.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System Handling the file offset I lseek(fd, offset, whence): current position in the file I I I SEEK SET: new current position is set to offset SEEK CUR: new current position is current position plus offset SEEK END: new current position is end of file plus offset I I llseek(fd, offset high, offset low, result, whence): I I I like lseek but 64 bit clean Linux specific read(), readv(), write(), writev(): I I I if there is a (not written to) gap between end of file and current position, this will not increase file size. On read zeros will be returned. read from/write to fd size bytes into memory buffer using current file offset file offset is readjusted at successful read/write don’t mix functions on file descriptors with the functions from the stdio (FILE) library c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System Reading/Writing with file offset modification: I read(fd, buffer, size), write(fd, buffer, size): I I I I I read from/write to file descriptor fd size bytes at memory buffer modify file offset when successful may successfully return with less than size bytes transfered (interrupted by signal, near end of file, no more bytes available from pipe or terminal) if file descriptor is set to O NONBLOCK, function fails with EAGAIN if it would block readv(fd, vector, num), writev(fd, vector, num): I I I like read()/write() but supports several buffer/size pairs stored in the vector of size num modifies file offset erronously placed in manual section 3 c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System Reading/Writing without file offset modification: I pread(fd, buffer, size, offset), pwrite(fd, buffer, size, offset): I I I I I read from/write to file descriptor fd size bytes at memory buffer starting from offset within the file does not modify the file offset may successfully return with less than size bytes transfered (interrupted by signal, near end of file, no more bytes available from pipe or terminal) if file descriptor is set to O NONBLOCK, function fails with EAGAIN if it would block use with macro before including unistd: #define XOPEN SOURCE 500 #include <unistd.h> c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System Memory mapped files I parts of a file are mapped into the virtual memory of the process I mapping only possible within page-sized-units I getpagesize(): returns the number of bytes in a page I mmap(), munmap(): maps/unmaps regions of a file into virtual memory I msync(): synchronize memory with backed file c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System, memory mapping I mmap(start, length, prot, flags, fd, offset): I I I start: suggest an address in virtual memory (page-sized), use 0 to let the kernel choose length: size of the memory/file area (page-sized) prot: represents memory protection and must not conflict with access permissions of the file I I I I I I I PROT PROT PROT PROT EXEC: pages may be executed READ: pages may be read WRITE: pages may be written NONE: pages may not be accessed fd: file descriptor offset: offset within the file (page-sized) file size must cover all mapped pages, unless SIGBUS on access c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System, memory mapping I flags: type of mapped object, must specify one of MAP SHARED or MAP PRIVATE I I MAP SHARED (POSIX.1b): share this mapping with other processes, storing to the region is like writing to the file MAP PRIVATE (POSIX.1b): create a private copy-on-write mapping. I I I I I I Stores do not affect the original file It is unspecified if changes to the file after the mmap are visible in the mapped region MAP FIXED (POSIX.1b): don’t ignore start MAP NORESERVE: don’t reserve swap space (for private/anonymous mappings), might get SIGSEGV upon write when no memory is available MAP GROWSDOWN: used for stacks, VM system should extend this mapping downwards MAP ANONYMOUS: mapping is not backed by any file (fd and offset are ignored) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com File System, memory mapping I munmap(start, length): removes mapping for the requested region I I I I I msync(start, length, flag): write changes of file-backed mappings to the file I I I I further references to addresses of that region cause an SIGSEGV closing the file does not unmap a region access time of the file is any between mmap() and munmap() modification/status change time of the file is any between first write to that region and a call to munmap() or msync() only backs up the memory area starting at start of length length MS ASYNC: only schedule update, return immediately MS SYNC: wait until the file is updated Examples: fs mmap simple.c, fs mmap on fork.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com POSIX shared memory objects I Creating/opening/removing POSIX shared memory objects: API: shm open(), shm unlink() I Filedescriptor modification: API: fcntl() I Memory mapping: API: mmap(), munmap(), mremap(), mlock() I Use real time library when linking: -lrt I available at glibc 2.2 and higher I uses dedicated file system, normally mounted under /dev/shm c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com POSIX shared memory objects I shm open(name, open flags, mode): I I I I I easily create/open shared memory not backed up to the disk fd is guaranteed to be the lowest-numbered fd is closed-on-exec by default shm unlink(name): I I I removes only the name, like unlink object is destroyed after last process unmaps the object attempts to shm open create a new object c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Pipes, Fifos I I One way communication Creating a pipe: pipe(fd[2]) I I I returns two file descriptors fd[0] is for reading fd[1] is for writing I Using/creating a fifo: use open() after mkfifo() I Sending data API: write(), writev(), sndfile() I Receiving data: API: read(), readv() I Connection parameters: API: fcntl() (O NONBLOCK, O ASYNC) I Example: fork a process with separate stdio, pipes on fork.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Sockets I I I I I I I Opening/creating a connection: API: socket(), connect(), bind(), listen(), accept(), socketpair() Sending data API: send(), sendto(), sendmsg(), write(), writev(), sendfile() Receiving data: API: recv(), recvfrom(), recvmsg(), read(), readv(), sendfile() (if mmap possible) offset is not supported, call pread(), pwrite() only with 0 offset Connection parameters: API: getsockname(), getpeername() Sockets support O NONBLOCK, O ASYNC: API: fcntl() Closing a socket: API: close() and, additionally, shutdown() on STREAMS c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Sockets Opening/creating sockets I socket(domain, type, protocol): open file descriptor with selected communcation mechanism I I I I bind(s, address, size): most comm. mechanisms need to be assigned a local address I I I I domain: PF UNIX, PF INET, PF INET6,... type: SOCK STREAM, SOCK DGRAM, SOCK SEQPACKET, SOCK RAW, SOCK RDM protocol: most families implement only one protocol, use 0 s: file descriptor got with socket() address: address specifier (dependent on family size: size of address specifier non SOCK STREAM/SOCK SEQPACKET type sockets are ready to be used now c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Sockets Connecting SOCK STREAM/SOCK SEQPACKET type sockets I Server side: I listen(s, backlog): I I I I accept(s, address, size): I I I I enable willingness to accept connections on file descriptor s backlog: maximum number of pending connection requests poll for read on s signals incoming connection create new file descriptor from the first connection on the queue of listening socket s file descriptor does not inherit any flags (O ASYNC, O NONBLOCK) from listening socket address: contains the origin of the incoming connection Client side: I connect(s, address, size): I if SOCK DGRAM calls connect(): I c 2005 Tschaeche IT-Services I I I connect socket s to the destination address default destination address is set receive dgrams only that address may use connect() multiple times to reset default address Oliver Tschäche http://www.tschaeche.com Sockets Socket options: setsockopt(s, level, optname, optval, optlen), getsockopt(s, level, optname, optval, optlen) I level: I I I SOL SOCKET for the socket protocol number for other levels, e.g. getent protocols tcp general socket options: I I I I I I I I I I SO KEEPALIVE: keep low used stream type sockets alive SO OOBINLINE: place out-of-band data within data streams SO RCVTIMEO, SO SNDTIMEO: timeout until reporting an error SO BINDTODEVICE: bind socket to a particular interface (eth0), inet sockets SO REUSEADDRESS: ignore timeout on inet closed sockets SO DONTROUTE: don’t use gateways SO BROADCAST: dgram sockets may receive/send broadcast packets SO SNDBUF, SO RCVBUF: set size of receive/send buffers SO LINGER: shutdown(), close() socket synchrounously SO PRIORITY: set type-of-service field of inet sockets c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Sockets Closing sockets: I non SOCK STREAM/SOCK SEQPACKET type sockets: I I just call close() on file descriptor SOCK STREAM/SOCK SEQPACKET type sockets: I call shutdown(s, mode): I I I I I without shutdown() tcp-sockets will stay in close-wait state SHUT RD: further reception will be disallowed SHUT WR: further transmission will be disallowed SHUT RDWR: further reception and transmission will be disallowed then call close() on file descriptor c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Sockets Transfering data: I read(), write(): like file system FDs I I send(s, buf, size, flags), recv(s, buf, size, flags): I I I normally used with connection based sockets uses default destination for non SOCK STREAM, SOCK SEQPACKET sockets sendto(s, buf, size, flags, to, to len), recvfrom(s, buf, size, flags, from, from len: I I I I maps to recv(), send() with flags=0 normally used with non connection based sockets SOCK STREAM, SOCK SEQPACKET based sockets must use (NULL, 0) for from/to fields explicit set destination address/get sender address for one call sendmsg(s, msghdr, flags), recvmsg(s, msghdr,flags): I readv(), writev() pendants c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Sockets Send data, flags: I MSG OOB: send out-of-band data I MSG DONTROUTE: don’t use a gateway I MSG DONTWAIT: use non blocking mode I MSG NOSIGNAL: don’t send SIGPIPE on error I MSG MORE: wait for additional data before sending Receive data, flags: I MSG OOB: receive out-of-band data I MSG DONTWAIT: use non blocking mode I MSG PEEK: return data but don’t remove it from the queue I MSG WAITALL: block until full request is satisfied (signals may break this) I MSG TRUNC: return real length of dgram packet, even it was longer than passed buffer c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Waiting/checking for activities on file descriptors select(), pselect(): I work on fd sets: sets of marked file descriptors I I I I I ZERO(&fd set): clears a set of file descriptors SET(fd, &fd set): enable file descriptor fd in the set CLR(fd, &fd set): disable file descriptor fd in the set ISSET(fd, &fd set): tests the state of file descriptor fd three groups, registered file descriptors will be watched to see I I I I FD FD FD FD if bytes are ready for reading if bytes can be written non blocking if an exceptions arrises select(), pselect() will return as soon as: I I I I an event is triggered a timeout elapses the process is signaled fd sets will show the originator c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Waiting/checking for activities on file descriptors pselect(n, rfds, wfds, efds, timeout, sigset): I n: highest numbered descriptor of all sets I timeout: struct timespec, nano second based I sigset: set of signals which are blocked select(n, rfds, wfds, efds, timeout): I n: highest numbered descriptor of all sets I timeout: struct timeval, micro second based I like pselect() call with NULL pointer sigset see example: sk unix server.c, sk unix client.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Waiting/checking for activities on file descriptors poll(pollfds, nfds, timeout): I pollfds: an array with registered file descriptors and events: I I I I possible events to be requested: I I I I fd: file descriptor to be watched events: requested events revents: returned events POLLIN: there is data to read POLLPRI: there is urgent data to read POLLOUT: write will not block any more possible events to be returned: I I I I all requested events POLLERR: error condition POLLHUP: hang up POLLNVAL: invalid request, fd not open I nfds: number of entries in the array I timeout: milli seconds c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com IO Control: manipulate operating parameters I ioctl(fd, request, data): I I I works on sockets: I I I I I set/get process group to which a SIGIO/SIGURG is sent to get the time of the last packet passed to the user test for out-of-band data within tcp sockets more family/type dependent..., man 7 tcp, man 7 unix works on character special files I I I I I I I request is special to opened file/socket data is dependent on the type of request control operating characteristics of character special files serial line: set baud rate, start/stop bits terminals (tty): map CTRL-C to SIGHUP, echo, buffering cdrom: start playing, eject printer: reset, get status many more... works on file systems I I I set block size ext2 file system: version, modify flags many more... c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Overview Kernel Space API I Modules/Built in Drivers I Debugging I Proc Filesystem I dev Filesystem I Character devices I Scheduling I Atomic Operations I Interrupts I Memory Management I Network devices see: Linux Device Drivers, 2nd Edition http://www.xml.com/ldd/chapter/book c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Modules/built in Drivers I Building Modules I Manual Configuration Parameters I Kernel built in Drivers c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Building Modules I Set Preprocessor Variables: I I I KERNEL : access kernel specific data in include files MODULE: compile as module (not built in). Must be set before <linux/module.h> is included add kernel includes to path: -I /usr/src/linux/include I Entry Point: init module() I Exit Point: cleanup module() c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Building Modules: Hello world Example Makefile: CFLAGS = -D__KERNEL__ -DMODULE -I/usr/src/linux/include \ -O -Wall all: module.o clean: rm -f *.o *~ core module.c: #include <linux/module.h> int init_module(void) { printk("<1>Hello, world\n"); retur void cleanup_module(void) { printk("<1>Goodbye cruel world\ c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Manual Configuration Parameters I MODULE PARM(variable, type) I I I I variable: char, short, integer, char * type: ”b”, ”h”, ”i”, ”l”, ”s” use arrays for type: ”1-3i” MODULE PARM DESC(variable, description) View Description with strings (grep parm desc) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com insmod module integer=1 array=2,3 #include <linux/module.h> int integer=0x300; int array[2]; MODULE_PARM(integer, "i"); MODULE_PARM_DESC(integer, "The base I/O port (0x300)"); MODULE_PARM(array, "1-2i"); int init_module(void) { printk("<1>integer=0x%x\n", integer); printk("<1>array[0]=%d\n", array[0]); printk("<1>array[1]=%d\n", array[1]); return 0; } void cleanup_module(void) { printk("<1>Goodbye cruel world\ c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Usage Counter I I Goal: Safely remove Module Macros defined in <linux/module.h>: I I I I MOD INC USE COUNT MOD DEC USE COUNT MOD IN USE For debugging: implement ioctl to reset this counter c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Exporting Symbols I I Goal: Make Symbols available in subsequently loaded modules Macros defined in <linux/module.h>: I I I EXPORT NO SYMBOL: module does not export any symbol EXPORT SYMBOL: export with versioning information EXPORT SYMBOL NOVERS: export without versioning information EXPORT_SYMBOL(exported_function); c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Kernel Built in Drivers I Use Attributes: I I init: free memory of that function after initialisation exit: ignore that function, don’t create any code I Attributes module I Problem: Module needs entry point (init module) → use macros: module init(init-name), module exit(exit-name) c 2005 Tschaeche IT-Services init and exit have no effect, if compiled as Oliver Tschäche http://www.tschaeche.com Debugging I Debugging by Printing: printk() I I I setting loglevel: KERN EMERG, KERN ALERT, KERN CRIT, KERN ERR, KERN WARNING, KERN NOTICE, KERN INFO, KERN DEBUG Control Logged Output: echo 8 > /proc/sys/kernel/printk Switching Logmessages On/Off: #ifdef DEBUG # define PDEBUG(fmt, args...) \ printk(KERNEL_DEBUG "my dev: " fmt, ##args) #else # define PDEBUG(fmt, args...) #endif PDEBUG("some log message %d\n", integer_value); c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Debugging - Continued I Debugging by Querying: I I I I Debugging by Watching the Application: I I I Use /proc FS: Everybody looks at /proc - Security? Use ioctl: Undocumented ioctl often remain unnoticed implement ioctl resetting module count Debugger/strace printf Debuggers and Related Tools: I I I I I gdb: gdb /usr/src/linux/vmlinux /proc/kcore kdb/kgdb: Kernel Debugger only available for I386 (as patch) ikd: Less architecture dependent than kdb (as patch) Kernel Crash Dump Analysers: LKCD, LCRASH User-Mode-Linux: Virtual Machine, Running Linux as Process c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com procfs I I create new directory: proc mkdir(), proc mkdir mode() create new read entry: 1. implement read function 2. install read function: create proc read entry() I cleanup directories and entries: remove proc entry() I example: scull/main.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com devfs Filesystem in which each driver registers an access point. Alternatively, use manually created (with mknod, see man 1 mknod) special files. I found in <linux/devfs fs kernel.h> I devfs is only available if macro CONFIG DEVFS FS is set Create Directory: devfs mk dir(parent, name, info): I I I I parent: handle for parent directory (use NULL for root directory) name: name of the new directory info: not used c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com devfs I Create Character Device: devfs register(dir, name, flags, major, minor, mode, ops, info) I I I I I I I I dir: handle for parent directory (use NULL for root directory) name: name of the new character device flags: DEVFS FL AUTO DEVNUM,... major/minor: explicitely set major/minor number mode: access permissions ops: file operations info: private data Remove Directory/ChrDev: devfs unregister() see example: scull/main.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver Components of a Character Device Driver: I Major number: selects driver, 60-63, 120-127, 250-254 are reserved for local use I Minor number: select device within the driver I create character device node in filesystem: mknod I Provides File Operations (read, write,...) I Access through Filesystem I Kernel API: register chrdev, unregister chrdev (<linux/fs.h>) I see example: scull/main.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver - File Operations File Operations (<linux/fs.h>): I llseek: modifies position counter I read, write: get, send data I readdir: only useful to filesystems I poll: implement poll/select I ioctl: issue device specific commands I mmap: map device memory to a process’s address space I flush: called within close system call I release: called when last file is closed I fsync: flush pending data I fasync: notifies change in operation mode I flock: only used within regular files I readv, writev: read/write on multiple memory areas Only implement needed functions c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver - File structure File structure (<linux/fs.h>): I f mode: check for FMODE READ or FMODE WRITE in ioctl() I f pos: 64bit value of current reading/writing position, don’t change it use last argument of read(), write() operation instead I f flags: flags used with open() (O RDONLY, O NONBLOCK,... see <linux/fcntl.h>), use f mode to check read/write I f op: file operations, may be replaced for tuning I private data: used to store a pointer to allocated data, special to that file c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver - Typical procedure 1. module init: register device manually or devfs, supply file operations 2. f op->open: allocate file special data, place it in private data 3. f op->read, write,...: use private data to do the work 4. f op->release: release everything, open creates Attention: dup(), fork() just create new references to the file structure and don’t call f op->open(). Accordingly, f op->release() is only called if the refernce count of a file struct drops to 0. c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver - read/write return values f op->read(file, buffer, size, offset) I requested number of bytes (size) were transfered I some but not all bytes were transfered: application has to retry the read I 0 if end-of-file is reached I negative value: specifies error (<linux/errno.h>) f op->write(file, buffer, size, offset) I requested number of bytes (size) were transfered I some but not all bytes were transfered: application has to retry the read I 0 nothing was tranfered, this is not an error I negative value: specifies error (<linux/errno.h>) read/write return with not completely filled buffers, if O NONBLOCK is set or the process is signaled c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver - readv/writev vector versions of read/write I if not defined in f op, emulated with subsequent read/write I file pointer and position pointer are the same as for read/write struct iovec: I I I I created in user space, but copied to kernel space before calling the driver defined in <linux/uio.h> offer higher efficiency: I e.g. useful for tapes: create only one record c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver - ioctl ioctl(inode, file, cmd, arg): I inode, file: like arguments of open I cmd: number that correspondsto a command I I I simple choice: start with 1, problem: cmd should be unique. The, if used with wrong device, an error can be detected. old convention: 8 bit magic code, 8 bit ordinal number current convention: use bit fields <linux/ioctl.h>, <asm/ioctl.h> I I I I I arg: I I I type ( IOC TYPEBITS): magic number, 8 bit wide number ( IOC NRBITS): ordinal number, 8 bit wide direction: IOC NONE, IOC READ or/and IOC WRITE size: from 8 to 14 bits wide (dependent on architecture) pointer to additional data the data itself return value: I I I 0 on success ENOTTY: POSIX EINVAL: common practice c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver - ioctl predefined commands, device drivers are interested in magic number ’T’: I FIOCLEX: set close-on-exec flag I FIONCLEX: clear close-on-exec flag I FIOASYNC: modify async flag I FIONBIO: modify blocking flag (historical fcntl()) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver - ioctl pointer arguments: I access ok(type, addr, size): checks user space area I I I I I transfer single values: I I I type: VERIFY READ for read access, VERIFY WRITE for read/write acces addr, size: user space address and its size returns 1 for success, 0 for failure driver should return -EFAULT if access fails put user(datum, ptr), get user(datum, ptr): include access ok() return -EFAULT on failure put user(datum, ptr), get user(datum, ptr): no check version copy to user(), copy from user(): transfer data between kernel and user space c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver - ioctl Capabilities, some of <linux/capability.h>: I I problem: any user should be able to use a device, but should not be able to perform any operation additionally check for capabilities: I I I I I CAP DAC OVERRIDE: ability to override access restrictions on file and directories CAP NET ADMIN: ability to perform network administration tasks CAP SYS MODULE: ability to load or remove modules CAP SYS RAWIO: ability to perform raw IO operations CAP SYS ADMIN: catch all capability, provide any access I capable(ored list of capabilities) returns true, if the process has the capability I example: scull/main.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver - poll determine whether a process is able to read/write non blocking <linux/poll.h> I I used in applications that use multiple input/output streams f op->poll(file, poll table): 1. call poll wait(file, wait queue, poll table) on each wait queue, that could indicate a change in the poll status 2. return a bit mask describing operations that could be performed without blocking I I I I I I I I I POLLIN: ready to read POLLRDNORM: ready to read normal data (ored with POLLIN) POLLPRI: ready to read out-of-band data (select exception) POLLHUP: end-of-file reached (select read) POLLERR: an error condition occured, (select read/write) POLLOUT: ready to write POLLWRNORM: ready to write normal data POLLWRBAND: ready to write high priority data example: scull/pipe.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver - fasync send signal when data arrives/write buffers are available again I I problem: several struct files may be opened register each async file in a queue: I fasync helper(fd, file, mode, fasync struct) I I I I file: register struct file mode: switch on/off async mode fasync struct: queue which is modified send signal to all processes with registered files of a fasync struct: I kill fasync(fasync struct, sig, band): I I I fasync struct: elements of this queue are signaled sig: signal to send band: event to send (POLLIN, POLLOUT,...) I remove registered async files in calls to f op->release!! I example: scull/pipe.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver - access control problem: open can be called several times on a device file I restrict to a single open: use an counter, return -EBUSY on second open I restrict to a single user: remember user of the first open I block the open call: for short time usage (e.g. logging) I create copies of the private data see chapter 5 of Linux Device Drivers c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Character Device Driver - memory mapping I f op->mmap(),see section memory management I f op->munmap(),see section memory management I f op->remap(),see section memory management c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Scheduling I Each process (current->state) is in a defined state (<linux/sched.h>): I I I I I TASK RUNNING: Ready to be scheduled TASK INTERRUPTIBLE: Waiting for an event or a signal TASK UNINTERRUPTIBLE: Waiting for an event only, not interruptible by signals TASK ZOMBIE: Task finished, waiting to return exit code TASK STOPPED: Task sleeping, waiting for signal SIGCONT I Scheduling is evaluated after timer interrupt, HZ (<linux/param.h>) times a second I Manually scheduling initiated by schedule() Processes are not scheduled, while in kernel mode I I I Avoid long lasting Loops within Kernel Code (or manual schedule) Preemptive Problem c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Delaying execution I Time is measured in jiffies (defined in <linux/sched.h>), incremented at each timer interrupt (HZ times a second) I I I I Short delays: udelay(unsigned long usecs), mdelay(unsigned long msecs) I I I I I Example code: set_current_state(TASK_INTERRUPTIBLE); schedule_timeout(delay*HZ); delay is the timeout in seconds an extra time interval could pass between the expiration of the timeout and when the process is scheduled to execute udelay() waits for usecs microseconds mdelay() is a loop around udelay(1000) implemented as busy wait, based on bogomips loops per second suggested maximum value for udelay is 1 millisecond Short delays: increase HZ, but use with caution c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Wait Queues I List of Processes which are Waiting for Wakeup: wait queue head t I initialize a Wait Queue: init wait queue head(), DECLARE WAIT QUEUE HEAD() Register current Process to wait on this queue: I I I I I I I sleep on(): Wait until an Event occures on that Queue interruptable sleep on(): Wait until an Event occures on that Queue or the Process is signaled sleep on timeout(): Same as sleep on() but resume after timeout interruptable sleep on timeout(): Same as interruptable sleep on() but resume after timeout wait event(): Wait until an Event and check condition wait event interruptible(): Wait until an Event and check condition or the Process is signaled c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Wait Queues - Continued Simplified Wait void simplified_sleep_on(wait_queue_head_t *queue) { wait_queue_t wait; init_waitqueue_entry(&wait, current); current->state = TASK_INTERRUPTIBLE; add_wait_queue(queue, &wait); schedule(); remove_wait_queue (queue, &wait); } c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Wait Queues - Continued Simplified Wait Exclusive void simplified_sleep_exclusive(wait_queue_head_t *queue) { wait_queue_t wait; init_waitqueue_entry(&wait, current); current->state = TASK_INTERRUPTIBLE | TASK_EXCLUSIVE; add_wait_queue_exclusive(queue, &wait); schedule(); remove_wait_queue (queue, &wait); } c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Wait Queues - Continued Going to sleep without Races I Problem: while (short_head == short_tail) { interruptible_sleep_on(&short_queue); /* ... */ } c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Wait Queues - Continued Going to sleep without Races I Solution: wait_queue_t wait; init_waitqueue_entry(&wait, current); add_wait_queue(&short_queue, &wait); while (1) { set_current_state(TASK_INTERRUPTIBLE); if (short_head != short_tail) /* whatever test break; schedule(); } set_current_state(TASK_RUNNING); remove_wait_queue(&short_queue, &wait); c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Wait Queues - Continued I Wake Up the Processes and schedule immediately: I I I wake up(): Wake up registered processes wake up interruptible(): Wake up registered processes, which have called an interruptible version wake up interruptible and wake up may not return immediately, a woken up process may be executed first Wake Up the Processes but keep the current process running: I I wake up sync(): Wake up registered processes by marking them runnable wake up sync interruptible(): Wake up registered processes, which have called an interruptible version wake up sync interruptible() and wake up sync() only mark processes running, but don’t call schedule(), this is left to the current process. This may save context switches! See example. c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Wait Queues - Continued wake up vs. wake up sync: while(1) { wake_up(wq1); wake_up(wq2); schedule(); } c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Task Queues Problems: I Polling the hardware without a context switch I give timely input to a hardware device (short periods, use kernel timers for long periods) I keep latency of interrupt routines short Properties: I I Doing work without context switch Suitable process context not available (in general) I I don’t access user space, you don’t know which process is active is not allowed to call schedule, sleep on, kmalloc (calls sleep on), semaphores c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Task Queues Predefined Task Queues: I scheduler queue: I I I timer queue: I I I called from keventd within process context tq schedule is hidden, use schedule task() called at each timer tick in interrupt context use queue task(your task, tq timer) immediate queue: I I I I I called at return of system call or when the scheduler is run, whichever comes first use queue task(your task, tq timer) uses bottom half mechanism, thus call mark bh(IMMEDIATE BH) Attention: mark bottom half after queueing the task! fastest queue c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Kernel Timers I Resolution is jiffies I Avoids reregistering a task in timer queue I Easy use: Register your task once and the kernel calls it once when the time expires use functions to be forward compatible: I I I I I I init timer(): initialization add timer(): insert timer into the global list of active timers mod timer(): modify expiration of an active timer del timer(): delete an active timer from the list del timer sync(): makes sure that the timer function is not running on any CPU, (rmmod) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Tasklets I since 2.4 tasklets are the prefered way to accomplish bottom-half tasks I defer execution of a task until a safe time, like task queues I run once and may reschedule themselves, like task queues I may be run in parallel on SMP systems I run on that CPU which first schedules them, (better cache behaviour, faster) I work on single CPU machines c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Atomic Operations I goal: modify bits without interference of IRQs or other CPUs I bit operations I integer operations: data types depend on architecture, 24 bits guaranteed I semaphores c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Atomic Operations Bit operations: <asm/bitops.h> I test bit(nr, addr): I I set bit(nr, addr), clear bit(nr, addr), change bit(nr, addr): I I test the bit in memory not using caches directly write to memory not using caches test and set bit(nr, addr), test and clear bit(nr, addr), test and change bit(nr, addr): I I again, directly work on memory return the state of the bit before the operation c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Atomic Operations Integer operations: <asm/atomic.h> I type: integer, but use only 24 bits! I atomic read(atom): returns integer I atomic set(atom, int): set atomic to int I atomic add(int, atom), atomic sub(int, atom): add/sub int to/from atomic I atomic inc(atom), atomic dec(atom): increment/decrement atomic I atomic add and test(int, atom), atomic sub and test(int, atom): return previous value I atomic inc and test(atom), atomic dec and test(atom): return previous value c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Atomic Operations - multiprocessor Spinlocks: <linux/spinlock.h> I do busy wait until lock is released I spin lock init(lock): run time initialization I spin lock(lock), spin unlock(lock): acquire/release the given lock I spin lock irqsave(lock, flags) spin unlock irqsave(lock, flags): disable (and save flags)/restore IRQs before/after acquiring/releasing the given lock I spin lock irq(lock), spin unlock irq(lock): disable/enable IRQs before/after acquiring/releasing the given lock I spin lock bh(lock), spin unlock bh(lock): disable/enable execution of bottom halfs before/after acquiring/releasing the given lock I spin is locked(lock): check the state of the lock I spin trylock(lock): acquire lock if free, else return failure I spin unlock wait(lock): wait until a lock is released c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Atomic Operations - multiprocessor Reader-Writer locks: <linux/spinlock.h> I Problem: I I spinlocks only allow a single reader many readers may lock data simultaneously, but single locks only needed for writers I read lock(lock), read unlock(): acquire/release a read lock, more than one CPU may get a read lock I write lock(lock), write unlock(lock): only one CPU will get a write lock (as soon as all read locks are released) I Reader-Writer locks support irq, irqsave and bh variants c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Atomic Operations Semaphores: <asm/semaphore.h> I sema init(semaphore, state): I I semaphore: has a state and wait queue state: initial state I I I down interruptible(semaphore): I I I I waits until semaphore has a state greater than 0 decrements the state of the semaphore if the process is signaled while waiting it returns true and the driver should return -ERESTARTSYS down(semaphore): I I state == 0: semaphore is hold by a process 0 ¡ state: semaphore is free like down interruptible(), but does not allow signals to be delivered up(semaphore): I release semaphore (increment state by 1) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Interrupts <asm/semaphore.h> I register an IRQ handler I blocking interrupts I enable/disable interrupts c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Interrupts Registering IRQs: <linux/sched.h> I request irq(irq, handler, flags, name, private data): I I I irq: the number of the requested interrupt handler: function to call for an IRQ with the number irq flags: I I I I I I SA INTERRUPT: fast handler, called with disabled interrupts SA SHIRQ: shared IRQ, other handlers may be attached to this IRQ SA SAMPLE RANDOM: irq contributes to the entropy pool used by /dev/random name: listed name in /proc/interrupts private data: associated data, used within shared IRQs to select handler when calling free irq() free irq(irq, private data) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Interrupts Blocking interrupts (<asm/system.h> included from <linux/sched.h>: I I don’t use sti() directly (following handler may trust in disabled IRQs) use macros save flags(), restore flags(): I can’t pass flags to a non inline function!! unsigned long flags; save_flags(flags); cli(); /* This code runs with interrupts disabled */ restore_flags(flags); c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Interrupts Enable/disable IRQs: I Generated interrupts are lost while an IRQ is disabled! I Must not be used with shared IRQs! I disable irq(irq): wait for IRQ-handler to finish, then disable reporting of that irq. I disable irq nosync(irq): disable IRQ without respect to any running IRQ-handler. I enable irq(irq): enable reporting of that IRQ. c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Interrupts IRQ handler: void short_interrupt(int irq, void *dev_id, struct pt_regs *regs { /* ... do irq handling ... */ /* wake any reading process */ wake_up_interruptible(&short_queue); } c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Interrupts Tasklet: void short_do_tasklet (unsigned long); DECLARE_TASKLET (short_tasklet, short_do_tasklet, 0); void short_tl_interrupt(int irq, void *dev_id, struct pt_regs *r { /* ... do irq handling ... */ tasklet_schedule(&short_tasklet); } void short_do_tasklet (unsigned long unused) { /* ... do real work ... */ } c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Interrupts Bottom halfs: void short_bh_interrupt(int irq, void *dev_id, struct pt_regs *r { /* ... do irq handling ... */ /* Queue the bh. Don’t care about multiple enqueuei queue_task(&short_task, &tq_immediate); mark_bh(IMMEDIATE_BH); } c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory management I Memory Zones I Allocating memory I Address spaces: physical, virtual, busses I Side effects: caches!! c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory Zones I I normal memory DMA-capable memory: not used in MIPS systems I I must be used in DMA transfers with peripheral devices of ISA bus high memory: not used in MIPS systems I introduced with P II virtual memory extension (64GB) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory Allocation I kalloc(): a malloc() like interface I look aside caches I page oriented-allocation I use virtual memory I Boot-Time allocation c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory Allocation kalloc(size, flags), kfree(obj): I May sleep until pages available (low memory situation): I a function calling kalloc must be reentrant I the returned region is consecutive physical memory I does not clear the memory it obtains c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory Allocation kalloc(): controlled by flags: I GFP KERNEL: is performed on behalf of a process and may sleep I GFP ATOMIC: never sleeps, called from outside a process’s context, e.g. interrupthandler, task queues, kernel timers I GFP BUFFER: differs from GFP KERNEL in that fewer requests to flush buffers, mainly used to avoid dead locks when I/O subsystem itself needs memory I GFP USER: low priority GFP KERNEL request I GFP HIGHUSER: low priority GFP KERNEL requesting high memory I GFP DMA: highly architecture dependent, memory usable in DMA requests I GFP HIGHMEM: highly architecture dependent, used as part in GFP HIGHUSER c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory Allocation Look aside caches I allocate many objects of the same size again and again I kmem cache create(name, size, offset, flags, constr, destr): I I I I name: association used in /proc/slabinfo size: size of the object offset: ensure particular alignment flags: I I I I contr(obj, cache, flags), destr(obj, cache, flags): I I I I I SLAB NO REAP: protect the cache from being reduced when system looks for memory SLAB HWCACHE ALIGN: align objects according to hardware caches SLAB CACHE DMA: allocate DMA capable memory provide constructor, destructor if necessary constr is called with SLAB CTOR CONSTRUCTOR flag set, use same function may be called several times for one object must not sleep if SLAB CTOR ATOMIC flag is set kmem cache destroy(cache): removes the complete cache only succeeds if all allocated objects are returned I c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory Allocation Look aside caches I kmem cache alloc(cache, flags): I I I get a cache object may perform kalloc() if no object is available within the cache kmem cache free(cache, obj); I I frees an object memory not immediately freed, but may be used for next request c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory Allocation get zeroed page(flags), get free page(flags) get free pages(flags,order) I allocate big, page-oriented chunks of memory I flags inherited to kalloc() I GFP ATOMIC: never sleeps I GFP KERNEL: may sleep until memory is available I order: 2n number of pages free page(addr), free pages(addr, order) I frees allocated pages c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory Allocation vmalloc(size), vfree(obj): I allocate big chunks of memory, consecutive in kernel virtual address space I modifies page tables, get free page(), kmalloc() virtual-to-physical mapping is 1-to-1 I used for big kernel buffers I can’t be used for DMA I uses GFP KERNEL and, therefore, may sleep, thus, not usable in interrupt handler, task queues I address range: VMALLOC START, VMALLOC END (<asm/pgtable.h>) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory Allocation Boot-Time Allocation I inflexible but least prone to failure I can’t be used by modules I done before memory management starts I needs reboot to take effect alloc bootmem(size), alloc bootmem pages(size), alloc bootmem low(size), alloc bootmem low pages(size): I I I pages: allocate on page-oriented low: allocate below MAX DMA ADDRESS c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory Allocation Other: I bigphysarea patch: see: http://www.polyware.nl/ middelink/En/hob-v4l.html I I I Reserve Highmem addresses: I I I I allocate memory via cmdline at startup passed to device driver module later use cmdline parameter mem=126M (if 128M available) supported by standard kernel allocator module on the O’Reilly ftp sites both methods need a reboot to adjust the memory size c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory management Address Spaces I bus addresses (PCI) I physical address space, initialization done by BIOS or kernel startup I user virtual address space: (each process one) I kernel logical addresses: physical-virtual one-to-one mapping kernel virtual addresses: I I I I address extension requires more than 32 bits used with modules organisation unit: pages c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory management Memory map and struct page I one struct page for each memory page I contains reference count I wait queue of processes waiting for that page I kernel virtual address of that page flags: I I I PG locked: can’t be swapped out PG reserved: may not be used c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory management Memory map and struct page, mm struct and page tables c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory management Memory map and struct page, macros: I virt to page(addr): find struct page for kernel logical address I page address(page): return kernel virtual address of the page kmap(page), kunmap(page): I I I I I I kernel logical address if low memory kernel virtual address if high memory limited number of mappings, release soon may be mapped more than once may sleep, until a mapping is available c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory management - User Space Interface Virtual Memory Areas I a homogeneous region in the virtual memory of a process I user space API: mmap(), munmap() I reference VMAs: /proc/<PID>/maps elements: I I I I I I start, end: begin and end of virtual address offset: within a file (page-oriented) permissions: read, write, execute vm ops, vm private: driver operations, driver data major,minor,inode: file system infos c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory management - User Space Interface Virtual Memory Areas (<linux/mm.h>) I vm ops: I I I I I open(VMA): called when VMA is inherited by child process (clone(), fork()) ATTENTION: has to be called within mmap() syscall manually close(VMA): called when VMA is inherited by child process (clone(), fork()) or when munmap() unmaps the entire area unmap(VMA): called from kernel when parts or entire area of the VMA is unmapped sync(VMA): backend for msync() system call nopage(VMA): called when process accesses a page of the area which is currently not in memory, return a struct page here c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory management - User Space Interface Virtual Memory Areas (<linux/mm.h>) I vm ops, driver non specific VMA operations: I I I swapout(VMA): kernel wants to swapout that page, return 0 if okay. returning non 0 will send a SIGBUS to the process protect(VMA): unused yet. Intention: change protection wppage(VMA): unused yet. Intention: handle faults for write access to write-protected pages c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory management - User Space Interface mmap: Kernel vs. User function: 1. User: mmap(addr, size, prot, flags, fd, offset) 2. Kernel performs a good deal of work: I I I checks parameters resolves file descriptor creates VMA, without any active mapping 3. Kernel: mmap(filp, vma) c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory management - change user space mapping remap page range(virt addr, phys addr, size, prot): I creates new page tables for the specified region I virt addr: user space address where mapping begins I phys addr: the physical address to which the virt addr is mapped I size: size of the region (same for virtual and physical address) prot: I I I I protection of the new VMA (hint: vma->vm page prot) caches: disable caches, architecture dependent flags (see: pgprot noncached() in drivers/char/mem.c) does not need nopage from VMA operations c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory management - change user space mapping VMA operation nopage(vma, addr, write acces): I kernel creates VMA without any mapping for mmap() I if user requests an unmapped page, nopage() is called I nopage() has to map a single page I nopage() has to take care of the page’s reference counter (get page()) user space mremap() does not notify driver if increasing the region I I I nopage() must implement growing regions of VMA user space mremap() notifies driver if reducing the region: calls VMA operation unmap() c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Memory management - change user space mapping Examples: I Remap nopage Mappings: ldd->simple/simple.c I prevent extension of mapping: sigbus on nopage.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com kiobuf Interface I I the other way round: kernel maps user space buffer avoids copying data: I non kiobuf read/write: copies data to/from kernel buffer I independent from memory management, simplifies life greatly I fast when reading same data once (if reading twice, use kernel buffers) I linux-2.4 lacks async IO for raw IO I API: include <linux/iobuf.h> c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com kiobuf Interface Create/free kiobuffer objects: I kiobuf init(kiobuf): initialize an already defined structure I alloc kiovec(nr,iovec): allocate nr kernel IO buffers I free kiovec(nr,iovec): free nr kernel IO buffers Map/unmap user space into iobuf: I map user kiobuf(rw, kiobuf, addr, size): registers all pages into kiobuf which belong to the user memory region I unmap kiobuf(kiobuf): adjust ref. count of each page Example: mmap/kiobuf.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com DMA DMA transfer - input modes: I synchronous: user process calls read: 1. driver allocates buffer (use kiobuf) 2. hardware write data to the buffer and raises an interrupt when done 3. interrupt handler acknowledges IRQ and awakens the process I asynchronous: data acquisition devices (NIC): 1. hardware raises IRQ to announce that new data has arrived 2. interrupt handler allocates buffer and tells the hardware where to transfer the data 3. hardware transfers the data and raises another IRQ when it’s done 4. the handler dispatches the data, wakes relevant processes, takes care of housekeeping NICs often use ring buffers c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com DMA DMA - memory I physical address must be simultaneously available to the CPU and the hardware I disable caches so that updates can be seen consistent DMA mappings: I I I I streaming DMA mappings: I I I allocated once, when loading the module long time monopolize mapping registers of the hardware (even when they are not being used) setup for a single operation some hardware can optimize streaming DMA mappings have a look at other drivers c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Network Devices I Initialization I Controlling transmission concurrency I Transfering packages I Changes in link state I Custom ioctl c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Network Devices - Initialization register netdev(netdev) used with preset name/init, other fields are filled in by init: I name: interface name, use %d for auto numbering I init(netdev): callback for initialization open(netdev), release(netdev): I I I start/stop interface has to start packet queue for the interface: netif start queue(netdev), netif stop queue(netdev) I set config(netdev): change configuration parameters (ifconfig) I do ioctl(netdev): ioctl-like interface, set special parameters I hard start xmit(sk buff, netdev): send a single packet I tx timeout(netdev): packet transmission fails to complete within reasonable period (driver missed interrupt?), resume packet transmission I get stats(netdev): access statistical information of an interface c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Network Devices - Initialization register netdev(netdev) used with preset name/init I flags: I I I I I IFF UP: interface is running IFF DEBUG: can be used for verbosity of driver’s printk IFF NOARP: interface does not use ARP IFF PROMISC: by default NICs filter packets per hardware. receive all packets if set setting special fields of netdev (e.g. packet header handling): I I I I I I ether setup: ethernet devices ltalk setup: ltalk devices fc setup: fiber channel devices fddi setup: fddi hippi setup: hippi tr configure: token ring, attention there is tr setup() doing nothing in 2.4! c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Network Devices Socket buffer elements <linux/skbuff.h>: I skb records: I I I I I I I dev: the device a buffer came from or is sent to protocol: protocol which is used, e.g. ethernet head: beginning of allocated space data: beginning of valid octets tail: end of valid octets end: end of allocated space len: tail - data c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Network Devices Handling of socket buffers <linux/skbuff.h>: I dev alloc skb(len), dev kfree skb(skb): allocate, free socket buffer in driver context I alloc skb(len, prio), kfree skb(skb): internal kernel functions, use dev xxx functions in driver I socket buffers are allocated in DMA-capable memory skb put(skb, len), skb put(skb, len): add data at the end I I I I I adjust tail and len records of the skb return addr to put len bytes skb put() omits check if buffer is full skb push(skb, len), at the end c 2005 Tschaeche IT-Services skb push(skb, len): like put method, but Oliver Tschäche http://www.tschaeche.com Network Devices Transmitting packets: I API: hard start xmit(sk buff, netdev) I Problem: hard start xmit() is called several times, but driver has limited amount of memory Solution: I 1. call netif stop queue(): stops the kernel to call hard start xmit() 2. kernel even moves packets into the queue 3. call netif wake queue(): enables the kernel to call hard start xmit() and send a packet waiting in the queue I hard start xmit() is protected by spinlock and only called once c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Network Devices Receiving packages without DMA: 1. allocate socket buffer: dev alloc skb(len) 2. copy received bytes to socket buffer 3. fill in metadata: dev, protocol 4. update statistic 5. call netif rx(skb) DMA-based method: 1. pre-allocate socket buffer: dev alloc skb(len) 2. let the hardware copy the data directly to the socket buffer 3. fill in metadata: dev, protocol 4. update statistic 5. call netif rx(skb) Example: snull/snull.c c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com Question & Answer c 2005 Tschaeche IT-Services Oliver Tschäche http://www.tschaeche.com