Linux System/Driver Schulung

Transcription

Linux System/Driver Schulung
c
2005
Tschaeche IT-Services
Linux System/Driver Schulung
Dr.-Ing. Oliver Tschäche
Tschaeche IT-Services
July 14, 2005
Oliver Tschäche
http://www.tschaeche.com
Tätigkeiten
I
I
1996: Diplom Elektro-Technik mit Schwerpunkt
Mikroelektronik
1996-2000: Uni Erlangen
I
I
I
I
2000: Gründung Tschaeche IT-Services
2000-2004: Selbständiger Berater/Entwickler + Uni
I
I
I
I
System/Netzwerk Administrator
Promotion: HW-Entwicklung fehlertoleranter Rechenwerke
Caldera: Distributed version control system
DBench: PC-Simulator: FAUmachine
Uni Erlangen: Vorlesung Design von Hardware und deren Linux
Treiber (DHWK)
seit 2005:
I
I
Siemens: Benchmarking aktueller NUMA-Systeme
Uni: FAUmachine, DHWK
Persönliches Berufsziel:
I 90% Aktive Teilnahme an Entwicklungsprojekten
I 10% Lehre/Schulung/Beratung
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Session goals
I
get new ideas how to implement applications efficiently
I
a good understanding between user space system calls and the
linux driver interface
an overview of communication mechanisms
I
I
I
I
between user space processes itself
between user space processes and kernel drivers
presentation of programming APIs: user/kernel space
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Focus
I
System Programming: Interface between User Space and
Kernel Space (list of system calls)
I
Not the libraries (libc only used as trampoline to switch to
Kernel Mode)
I
Kernel Driver Perspective: How to offer an efficient hardware
interface to User Space
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Outline
System Startup
Process Implementation
Communication Mechanisms
Design Decisions
User space API, system programming
Kernel Space API
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Outline System Startup
1. Hardware/BIOS startup
2. evtl. Bootloader
3. Linux
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Hardware/BIOS startup
BIOS: Implemented in PROM, EEPROM or other non volatile
memory
1. HW power-good
2. HW reset
3. Starting BIOS
4. Power On Self Test (POST)
5. Initial Hardware Setup
6. evtl. load and start Bootloader
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Bootloader
I
Bootloader: u-boot, BIOS built-in, (x86: GRUB, Lilo,
PXElinux, ...)
I
Boot into different OSs
Tasks of a Bootloader in the Linux Environment:
I
I
I
I
I
Setup kernel command line: driver setup, initrd, rootfs, init
Load Kernel
Load RAM-Disk
Start Kernel
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Linux Startup - User Space View
1. Initialize Hardware, Drivers, Socket, IPC,...
2. Mount root filesystem, location compiled into the kernel or
provided by cmdline
3. Some systems: Start first process (/linuxrc) from RAMDISK
3.1 load hardware dependent modules (e.g. SCSI-Driver Root-FS)
3.2 Free RAMDISK, mount effective root filesystem
4. Start/Replace first process: /sbin/init (compiled in) or init
parameter (cmdline)
4.1 init executes system configuration scripts:
I
I
hardware config: e.g. filesystem checks, configure
speed/protocol of serial consoles, configure network interfaces
(IP, routing)
start daemons: crond, syslog, ssh,...
4.2 init starts gettys on consoles
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Linux Startup - Kernel Space View I
1. bootloader jumps to byte 0x1000 ( stext) of the loaded image
2.
stext (arch/<host>/kernel/head.S) initializes stack pointer
and performs other necessary functions to create a minimal C
runtime environment
3. start kernel (kernel/init/main.c) prints startup banner, parses
commandline, call other initialization functions
4. setup arch (arch/<host>/kernel/setup.c) detects memory,
enables host MMU (paging init()), setup of host-specific
read/write io-port functions in machine vector
5. trap init (arch/<host>/kernel/traps.c) initializes interrupt
capabilities (not enabled yet)
6. init IRQ (arch/<host>/kernel/irq.c) initialize hardware irq
system with disabled IRQs. Enable IRQ-lines on request
(request irq())
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Linux Startup - Kernel Space View II
7. sched init (kernel/sched.c) initialize pidhash array and
bottom-half handlers
8. softirq init (kernel/softirq.c) initialize softirq subsystem,
softirqs are managed by the kernel’s ksoftirqd later
9. time init (arch/<host>/kernel/time.c) initializes kernel timer
tick system, usually by installing an interrupt handler
10. console init (drivers/char/tty io.c)
11. init modules (kernel/module.c)
12. kmem cache init (mm/slab.c) initializes kernel buffer
organisation
13. calibrate delay (init/main.c) calculates the BogoMIPS
14. mem init (arch/<host>/mm/init.c)
15. kmem cache sizes init (mm/slab.c)
16. fork init (kernel/fork.c)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Linux Startup - Kernel Space View III
17. proc caches init (kernel/fork.c)
18. vfs caches init (fs/dcache.c)
19. buffer init (fs/buffer.c)
20. page cache init (mm/filemap.c)
21. signal init (kernel/signal.c)
22. proc root init (fs/proc/root.c) initializes the /proc filesystem
23. ipc init (ipc/util.c)
24. check bugs (include/asm/<host>/bug.h)
25. smp init (init/main.c) initialize IOAPIC of Intel arch, do
nothing for others
26. rest init (init/main.c) frees memory, launch init()
27. init (init/main.c) frees memory, launch init()
27.1 do basic setup (init/main.c) initialize hardware (PCI,
network...)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Linux Startup - Kernel Space View IV
27.2 prepare namespace (init/main.c) mounts root filesystem
27.3 creates stdin, stdout, stderr
27.4 execve initial process (usually /sbin/init)
28. do initcalls (init/main.c) call init functions of compiled in
modules
29. mount root (fs/super.c) actually mount root fs
see: http://billgatliff.com/ bgat/articles/emb-linux/startup.pdf
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Outline Process Implementation
I
Process local Stuff
I
Filesystem Stuff
I
Table of Signal Handlers
I
Tracing a Process
I
Virtual Memory
I
Capabilities
I
Resources
I
Operations on a Process
API: clone, fork, execve
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Process local Stuff
I
effective UID/GID: owner/group within the process is running
I
pid: created by kernel, can’t be changed
I
process group ID: inherited from parent
API: setpgid()
I
thread group ID: inherited pid of parent or pid of child if new
thread group.
This is controlled with the CLONE THREAD flag of the clone
system call
I
thread ID: equals to PID unless process is part of a thread
group (CLONE THREAD)
API: gettid()
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filesystem Stuff
I
Filesystem Information
I
File Descriptor Table
I
Name Space
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filesystem Stuff
Filesystem Information
I
umask
API: umask
I
Current Working Directory
API: chdir
I
Root Directory
API: chroot
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filesystem Stuff
File Descriptor Table
I
Integer refering to entry in kernel space
I
an entry may be a socket, pipe, file, memory mapped region
API: open, socket, accept (listen), close, read, write, ioctl,...
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filesystem Stuff
Name Space
I
Filesystem hierarchy
API: mount, umount
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filesystem Stuff
Process of Path Resolution, Path Definition
I
Starting with ’/’ ⇒ absolute path
I
Starting with other than ’/’ ⇒ relative path
Process of Path Resolution
1. Select a starting lookup directory
2. Follow path components with a trailing ’/’
3. Evaluate final path component
Source: man 2 path resolution
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filesystem Stuff, Path Resolution
1. Select a starting current lookup directory dependent on the first
character:
I
if it’s a ’/’ ⇒ use root-dir-element of process
I
if it’s not a ’/’ ⇒ use cwd-element of process
I
ATTENTION: cwd may not include the root-dir
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filesystem Stuff, Path Resolution
2. Change current lookup directory according to path components
with a trailing ’/’:
1. fail with EACCESS if process has no search permission
2. fail with ENOENT if component is not found
3. fail with ENOTDIR if component is not a directory and not a
symbolic link
4. if component is a directory, set current lookup directory to
this component
5. if component is a symbolic link, resolve it
I
I
if it is not a directory ⇒ fail with ENOTDIR
if it is a directory ⇒ set current lookup directory
Resolution process involves limited recursion and fails with
ELOOP if limit is reached
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filesystem Stuff, Path Resolution
3. Find final entry
I
I
does not need to be a directory
if it does not exist, it’s not necessarily an error
I
I
depends on the system call
e.g.: open-syscall may want to create it
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Table of Signal Handlers
I
contains pointers to handler functions
I
signal mask and pending signals are elements of each process
API: sigaction, signal
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Tracing a Process
I
flag indicating that the process is traced
API: ptrace
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Virtual Memory / User Address Space
I
segments: text, data, bss, stack
API: brk, sbrk
I
memory mapped files
API: mmap, munmap
I
state of paging
API: mlock, mlockall, munlock, munlockall
I
Replacing segments and memory mapping by loading new
program, but keeping files
API: execve
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Resource Limits/Usage
I
as (address space): size
I
core: size of core file (truncated if greater)
I
cpu: seconds (SIGXCPU, finally SIGKILL)
I
data: data segment size
I
memlock: maximum number of bytes locked in RAM
I
stack: size of process stack, (SIGSEGV: use alternate stack)
I
fsize: maximum size of a created file (write, truncate)
I
locks: number of file locks
I
nofile: maximum file descriptor number + 1
I
ofile: BSD compatibility to nofile
API: getrlimit, setrlimit, getrusage
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Capabilities
I
Filesystem: chown, dac override, dac read search, fowner,
fsetid, mknod
I
IPC: ipc lock, ipc owner, kill
I
Network: net admin, net bind service, net broadcast, net raw
I
Process: setuid, setgid, setpcap
I
System: admin, boot, chroot, module, nice, pacct, ptrace,
rawio, resource, time, tty config
API: capset, capget
see: man 7 capabilities
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Operations on a Process
I
parent death signal
I
core dump
I
keep capabilities on UID transition from uid 0 to non 0
API: prctl
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Outline Communication Mechanism
I
Signal Handling
I
Filedescriptor
I
IPC
I
Comparison
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Outline Signal Handling
I
Signal Basics: Action, Type
I
Sending Signals - Permissions
I
Receiving/Handling Signals
I
Standard (POSIX.1) Signals vs. Real-Time (POSIX
1003.1-2001, former POSIX.4) Signals
see: man 7 signal
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Signal Handling
Signal Basics
I Predefined Actions for the receiving process:
I
I
I
I
I
Standard (POSIX.1) signals:
I
I
I
I
Term: terminate the process
Ign: ignore the signal, just go on
Core: terminate the process and dump core
Stop: stop the process (send SIGCONT to restart)
predefined meaning, e.g. HUP, QUIT, KILL, CHLD, ILL,
SEGV,...
depend on architecture
default action is dependent on signal type
Real-Time (POSIX 1003.1-2001) signals
I
I
I
no predefined meaning, (LinuxThreads use first three)
can be used for application defined purposes
default action is to terminate the receiving process
see: man 7 signal
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Signal Handling
Standard vs. Real-Time Signals
I
Priority not defined by POSIX, but Linux (like most Unices)
handles Standard Signals first
I
Multiple instances are queued for Real-Time Signals, Standard
Signals only queue one instance
Real-Time Signals are delivered in a guaranteed order:
I
I
I
Multiple RTS of same type are delivered in the order they were
sent
Multiple but different RTS are delivered starting with the
lowest numbered signal
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Signal Handling
Receiving/Handling Signals
I
Process can’t catch SIGKILL/SIGSTOP
I
Access to global variables from signal handler: use volatile
I
install alternate stack: sigaltstack()
Simple Handler: signal(), ANSI C
I
I
I
I
the only argument is the number of the signal
after catching one signal the handler has to be registered again
Advanced Handler: sigaction(), POSIX
I
I
I
I
I
may operate on alternate stack
make certain system calls restartable across signals
configurable one shot behaviour
allows recursive occurence (within handler)
provide additional information (sigqueue, other)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Signal Handling
Sending Signals
I Permissions:
I
I
I
Privileged (CAP KILL)
Effective or Real UID of sending Process must equal the real or
saved set-user-ID of the target process
In case of SIGCONT: sending and receiving process belong to
the same session
I
API: raise, kill, killpg, (CLONE THREAD: tkill, tgkill),
sigqueue
I
API: sigqueue → Add data (one single integer or pointer)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Signal Handling
Sent Signals from the kernel
I
terminal IO: SIGINT, SIGTERM
I
sockets: SIGURG, SIGPIPE, SIGIO
I
write(): SIGPIPE
I
alarm(), setitimer(), sleep(): SIGALRM, SIGVTALRM,
SIGPROF
I
abort(): SIGABRT
I
fork(),clone(): SIGCHLD
I
execve(): SIGTRAP
I
many more...
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Signal Handling
I
Blocking signals:
I
I
I
Select which signals are blocked: sigprocmask()
Get a list of pending (blocked) signals: sigpending()
Sleep until a signal is delivered: pause(), sigsuspend()
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Outline Filedescriptors
What is a filedescriptor:
I
VFS - Virtual Filesystem Layer
I
Pipes
I
Sockets
Features:
I
Wait on an Event of more than one filedesriptor
API: select
I
Works with Signals in Harmony
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Outline Filedescriptors - Virtual Filesystem Layer
I
VFS offers API - Filesystem implements functionality
I
Filesystems
I
Permission
I
File Types
I
Example: Procedure done on open
I
API: creat, open, close, read, write, fcntl, flock, umask,
chmod, fchmod, chown, dup, dup2, link, unlink, mknod, stat,
mmap, munmap, utime
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filedescriptors - Virtual Filesystem Layer
Filesystems:
I
Mountpoint in Name Space of a process
I
Implementations of block based FSs: ext2/3, xfs, reiser, ...
I
Other FSs: proc, tmpfs, dev, sys, ...
different capabilities:
I
I
I
memory mapping: mmap()
read/write methods: readv(), pread(), writev(), pwrite()
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filedescriptors - Virtual Filesystem Layer
Permissions:
I Access Modes:
I
I
I
I
Originator:
I
I
I
I
read: Read a file
write: Read a file
execute: Read a file
User: effective User ID of requesting process
Group: Group IDs of that user
Other: all other which are not matched by User or Group
Special Flags:
I
I
I
Set UID flag: changes effective User ID on execve
Set GID flag: changes Group ID on execve, keep GID of
directory on creation
Sticky flag: directory permissions does not apply on already
created files
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filedescriptors - Virtual Filesystem Layer
File Types:
I
data: some bytes
I
directories
I
soft links: string references to other locations
I
hard links: references to other locations at inode level
I
character/block devices: drivers accessible through FS, e.g.:
tty, harddisks,...
I
named pipes (fifos) and named unix sockets
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filedescriptors - Virtual Filesystem Layer
(Simplified) Open Procedure: open(”/tmp/test”, O CREAT —
O WRONLY, 0777)
1. Lookup Path and finally the file (man 2 path resolution),
Used Process Data:
I
I
I
Name Space
root or cwd
UID/GID: Access Permissions of Directories
2. create the file using umask of process and 0777
3. create kernel file entry, register functionality
4. create entry in process’s filedescriptor table
5. at last pass the number of the entry to the user space
(Simplified) Write to a File: write(1, ”hallo”, 5)
I
1 → filedescriptor → file-operation → write
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filedescriptors - Pipes/Fifos
I
Local to Host
I
One Way Communication
I
Stream oriented: One process writes → One process reads
I
Fifos are named pipes within the Filesystem (access
permissions), pipes are invisible
I
Fifos block until a reading and writing process are connected
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Outline Filedescriptors - Sockets
I
Basics
I
Protocol Family
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filedescriptors - Sockets
I
I
Protocol Family (implementing one comm. semantic)
Communication Semantics
I
I
I
I
I
I
SOCK STREAM: reliable, sequenced two-way, connection
based byte stream
API: read, write, send, recv, accept, connect, listen
SOCK DGRAM: connectionless, unreliable messages
API: sendto, recvfrom and if using connect() before read, write
SOCK SEQPACKET: reliable, sequenced, two-way, messages
API: read, write (data may be discarded), send, recv, accept,
connect, listen
SOCK RAW: raw network protocol
API: sendto, recvfrom
SOCK RDM: reliable datagram, ordering not guaranteed
API: sendto, recvfrom
API: socket
Bind socket to an address (dependent on Protocol Family and
Communication Semantics)
API: bind
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Filedescriptors - Sockets
Protocol Families
I Unix: Created host local within the file system (file type s)
see: man 7 unix
I Inet: IP based network communication:
I
I
I
I
I
I
RAW: RAW-IP data grams (SOCK RAW)
see: man 7 raw
UDP: UDP-IP data grams (SOCK DGRAM)
see: man 7 udp
TCP: TCP-IP data stream (SOCK STREAM)
see: man 7 tcp
see: man 7 inet
Netlink: Transfer Datagrams between Kernel Modules and
User Space Processes
see: man 7 netlink
PACKET: send/receive at device driver (OSI layer 2) level
see: man 7 packet
X25, IPX, INET6, APPLETALK
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Outline IPC
IPC Elements:
I
Basics
I
Message Queues
I
Semaphore Sets
I
Shared Memory Segments
see: man 5 ipc
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
IPC: Basics
IPC Name Space
I
Handled with keys (integer) instead of filenames
I
Conversion of a Name into a Key: man 3 ftok
I
Process (and Children) Private Key: use IPC PRIVATE
Access Permissions:
I
read, write, (no execute)
I
UID, GID, others
I
Like well known FS permissions
I
API: msgctl, semctl, shmctl
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
IPC
Message Queues:
I Creation of a message queue:
I
I
CREAT: Fails if already existent
EXCL: Fails if already assigned (opened)
API: msgget
I
I
Send a message: Type must be set
API: msgsnd
Receive a message dependent on Type:
I
I
I
Type == 0: first message in the queue is read
Type > 0: first message (not) with that type is read
Type < 0: first message with a type less or equal to abs(Type)
is read
API: msgrcv
I
Control/Remove message queue
API: msgctl
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
IPC
Semaphore Sets
I
I
Creation of a Semaphore Set
API: semget
Using Semaphores
I
I
I
I
May Operating on more than one Element, but atomically
Semaphore Element (SE) initialised to 0
Operation may automatically be removed on process
termination
Operation Modes (om):
I
I
I
om > 0: increase SE by om, no wait necessary
om == 0: wait for SE == 0
om < 0: wait until SE >= abs(om), then decrease SE by om
API: semop, semtimedop
I
Control/remove a Semaphore Set
API: semctl
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
IPC
Shared Memory Segments
I
Creation of a Shared Memory Segment
API: shmget
I
Attaching to Shared Memory Segments
Works like mmap, munmap
API: shmat, shmdt
I
Control/remove a Semaphore Set
API: shmctl
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Design Decisions
I
System Programming:
I
I
I
Signals vs. Sockets
IPC vs. Filedescriptors
Kernel Programming:
I
I
I
Kernel Driver vs. User Space Driver
Module vs. Built in Driver
Device Control
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Signals vs. Sockets
I
Signals:
+ ”Broadcast” (sending to process groups)
+ Short latency (handled by signal handler)
o Priorities
o OS dependent order of handling
- Short amount of data (sig number/sig info)
- Granularity of Permissions
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Signals vs. Sockets
I
Sockets:
c
2005
Tschaeche IT-Services
+ variable message size (implement protocol)
+ implementation defined order (select)
+ high number of originators (one socket each type
of msg)
o FS permissions for Unix Sockets
- higher latency compared to signals (2 syscalls:
select, read/write)
Oliver Tschäche
http://www.tschaeche.com
IPC vs. Filedescriptors
I
IPC properties:
+ More flexible permissions: uid/gid of user
choosable by non privileged user
+ Semaphores
- can’t wait simultanously on different message
queues
I
Filedescriptor properties:
+ wait on several file descriptors (API: select)
+ mmap implements shared memory
- access permissions of 2.4 (extended attributes/acl
only implemented in 2.6)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Kernel Driver vs. User Space Driver
User Space Driver’s Pros:
I
Full libc Support: Can do exotic tasks
I
Easy Debugging without having to go through contortions to
debug a running kernel
I
If the driver hangs you can kill it and keep the system running
(unless the hardware is not misbehaving)
I
User Memory is swappable: Large drivers which are used
infrequently can be swapped out
I
A well-designed driver program can still allow concurrent
access to a device
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Kernel Driver vs. User Space Driver
User Space Driver’s Drawbacks:
I
Direct Access to Memory is possible only by mmapping
/dev/mem (privileged operation)
I
Access to I/O ports is available only after calling ioperm or
iopl (MIPS support?), access through /dev/port may be to
slow to be effective (both are privileged operations)
I
Response Time is slower because a context switch is required
to transfer informations or actions between the client and the
hardware
I
worse yet, if the driver has been swapped to disk, the response
time increases terribly. mlock may help, but libraries need to
be locked too (privileged operation)
I
most important devices can’t be handled in user space,
including (but not limited to) network interfaces and block
devices
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Module vs. Built-in-Driver
I
Driver can be written so that both is possible
I
Built-in-Driver
+ kernel memory: large continuous physical blocks
only available at boot
+ does not need module support
- reboot after driver source changes during
development
- reboot when driver parameters change
I
Module
c
2005
Tschaeche IT-Services
+ loadable during run-time: remove - modify - load
+ driver modes can be implemented by parameters
- difficult to get continues physical memory regions
Oliver Tschäche
http://www.tschaeche.com
Device Control
Controlling by write:
I ’robotic’ devices which don’t transfer data but just respond to
commands:
I
I
I
I
I
command-oriented: data is never sent (written)
command interpreter is easier implemented as ioctl
easy to use from user space: echo, cat
large driver (parser)
implement escape sequences for data transfers
Controlling by ioctl:
I
completely avoid write (can’t be used with echo or cat)
I
keeps driver small (no parser)
I
user space must implement each command
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Overview system programming
I
process control
I
signal handling
I
file system file descriptors
I
sockets file descriptors
I
waiting/checking for multiple events
I
ioctl
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Overview process control
I
create new processes: clone(), fork(), vfork()
I
replacing a process: execve()
I
terminating a process: exit(), exit()
I
handling children: wait(), waitpid(), wait3(), wait4()
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Creating new processes
clone(function, stack, flags, arg): implements threads
I sys clone: linux-2.4.20/arch/mips/kernel/syscall.c
I copies the state of the process (all registers including
stack-pointer, program-counter)
I children share selectable parts of its parent’s context (FS,
FILES, NS, SIGHAND, VM)
I provide signal type for termination (ored to flags)
I
I
I
I
I
I
I
use special options in wait-family calls, if not using SIGCHLD!
unlike fork(), clone() is a library call:
library checks for alternate stack, sys clone supports going on
with NULL
library calls function with argument arg for the child
when the function returns, library calls exit() with return value
of function
the caller must provide an alternate stack for the child
(although sys clone supports working on copy)
example: clone umask.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Creating new processes
I
fork() calls clone() with SIGCHLD as flags:
I
I
I
I
I
vfork(): calls clone() with SIGCHLD — CLONE VM —
CLONE VFORK
I
I
I
I
I
I
shares NS
creates copy of FS, FILES, SIGHAND, VM
VM: use copy-on-write for mapped pages
example: see examples.user-api/fork simple.c
shares NS and VM!
creates copy of FS, FILES, SIGHAND
VM: be carefull witch the stack, it is shared with parent!!
parent is suspended until the child terminates or does execve()
Mostly used when a call to execv() follows very soon.
see: arch/mips/kernel/syscall.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Replacing a process
execve(filename, args, env):
I
text, data, bss and stack segments are overwritten by the
loaded program.
I
Filedescriptors are not closed.
I
Pending signals are cleared, signal handlers are reset to default
actions.
I
SUID/SGID bits change the effective UID/GID of the process.
I
if the process is traced, a SIGTRAP is sent after successful
execve
I
example: see exec simple.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Terminating a process
I
terminate a process immediately: exit(exit code)
I
I
I
I
I
any open filedescriptors are closed (if not shared with parent
CLONE FILES)
any child processes are inherited by the init process
the parent process is sent a SIGCHLD (or the signal supplied
in clone())
exit code is returned to the parent process and can be
collected with wait
handling terminated processes: wait(), waitpid(), wait3, wait4
I
I
wait(status): wait for one child and catch the exit status
waitpid(pid, status, options):
I
I
I
I
waits for the child with pid pid
supplies options: WNOHANG, WUNTRACED
catch exit status
macros handling exit status
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Process Control Examples
I
fork simple.c
I
exec simple.c
I
clone umask.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
signal handling
I
handling signal masks:
API: sigemptyset(), sigfillset(), sigaddset(), sigdelset(),
sigismember()
I
installing a handler
API: signal(), sigaction()
I
sending signals:
API: raise(), kill(), killpg(), tkill(), sigqueue(), alarm()
I
block signal delivery
API: sigprocmask
I
examination of blocked, but pending signals
API: sigpending
I
wait for distinct signals
API: sigsuspend, pause
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
handling signal masks
I
declaration: sigset t sigset
I
sigemptyset(&sigset): clear all signals in the set
I
sigfillset(&sigset): set all signals in the set
I
sigaddset(&sigset): add a signal to the set
I
sigdelset(&sigset): remove a signal from the set
I
sigismember(&sigset): is signal already a member in the set?
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Installing a handler
I
signal(num, handler): ANSI C
I
I
I
num: signal type (SIGINT, SIGTERM,...), can’t catch
SIGKILL, SIGSTOP
handler: function to call if signal num is delivered
sigaction(num, action, oldact): POSIX
I
I
handler function
flags:
I
I
I
I
I
I
I
SA NOCLDSTOP: suppress child stop notifications
SA ONESHOT: Restore the signal default handler after
handling the signal once
SA ONSTACK: Use alternate stack (if available, see: man 2
sigaltstack)
SA RESTART: Make certain system calls restartable across
signals
SA NOMASK: allow recursive occurence (within handler)
SA SIGINFO: provide additional information (sigqueue, other)
mask of blocked signals while this handler is active
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Sending signals
I
kill(pid, sig): send signal sig to process with pid pid
I
raise(sig): kill(getpid(), sig), library function
I
sigqueue(pid, sig, val): send signal sig and data val to process
with pid pid
only works with real time signals (SIGRTMIN+n)
I
killpg(pgrp, sig): send signal to all processes of the process
group pgrp, mapped to kill(-pgrp, sig)
I
tkill(tid, sig): send signal only to one process of a thread
group (CLONE THREAD, each process has the same pid)
I
alarm(seconds): schedules a SIGALRM for the process. Any
previously set alarm is cancelled. Use 0 seconds to remove an
alarm
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Blocking signals
I
sigprocmask(action, &new set, &old set): block a set of
signals
I
I
I
I
I
SIG BLOCK: add the members of new set to current set
SIG UNBLOCK: remove the members of new set from current
set
SIG SETMASK: use the mask of new set from current set
old set: the state of the set before the call
examination of blocked, but pending signals:
sigpending(&sigset)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Suspending a process until a signal is delivered
I
sigsuspend(&si): sleep until a signal is delivered which is a
member of si
I
pause(): sleep until a signal is delivered which terminates the
process or is catched by a signal handler
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Example:
sighandler.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System
I
Opening a file:
API: open(), close()
I
Filedescriptor modification:
API: fcntl()
I
Writing data
API: write(), writev(), sendfile()
I
Reading data:
API: read(), readv(), sendfile() (if mmap possible)
I
Memory mapping:
API: mmap(), munmap(), mremap(), mlock()
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System
I
open(name, flags, permissions):
I
I
name: path to the file
some flags:
I
I
I
I
I
I
I
I
O APPEND: before each write the file pointer is possitioned
at the end
O NONBLOCK: any subsequent operation on the file
descriptor nor the open itself will block
O SYNC: block until data is physically written
O NOFOLLOW: if name is a symbolic link, fail
O DIRECT: minimize cache effects, read/write directly
from/to disk
O ASYNC: generate a SIGIO when the file descriptor is ready
to send/receive data
O LARGEFILE: allow files to be opened whose size cannot be
represented in an off t (2/4GB)
some of these can be altered after the open with fcntl()
permissions: if a file is created, access permissions (reduced by
umask)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System
I
close(fd):
I
I
I
close file descriptor, can’t be used any more
if fd is the last copy of a particular file descriptor, associated
resources are freed
if fd is the last reference to a file which was removed
(unlink()), delete the file
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System, fcntl(fd, cmd, arg)
I
Handling close-on-exec:
I
I
I
I
I
F DUPFD: copy fd to lowest numbered fd greater or equal
than arg, close-on-exec on copy is off!
F GETFD: read close-on-exec flag
F SETFD: set close-on-exec flag to the FD CLOEXEC bit of
arg
Example: fs close on exec.c
Status flags:
I
I
F GETFL: read file descriptor’s flags
F SETFL: set file descriptor’s flags (O APPEND,
O NONBLOCK, O ASYNC, O DIRECT)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System, fcntl(fd, cmd, arg) continued
I
Managing signals:
I
I
I
I
F GETOWN: get process or process group (negative value),
currently receiving SIGIO/SIGURG
F SETOWN: set process or process group that will receive
SIGIO/SIGURG (O ASYNC must be set). If the signal handler
set’s SA SIGINFO, si code indicates SI SIGIO and si fd gives
the associated filedescriptor
F GETSIG: get signal sent when input/output becomes
possible
F SETSIG: set signal sent when input/output becomes
possible. Using a real time signal multiple I/O events may be
queued using the same signal number
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System, fcntl(fd, cmd, arg) continued
Leases (Linux specific, define GNU SOURCE):
I watchdog on a file:
I
I
I
I
I
uses signaling SIGIO
F SETLEASE, arg kind of lease:
I
I
I
I
lease breaker calls open, which blocks (not O NONBLOCK)
lease holder handles SIGIO and downgrades lease (cleanup:
e.g. flushing buffers)
if lease holder is to slow (/proc/sys/fs/lease-break-time), the
kernel forces the downgrade
F RDLCK: we will be notified if another process opens the file
for writing
F WRLCK: we will be notified if another process opens the file
for reading or writing
F UNLCK: remove the lease from the file
Example: fs lease.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System, fcntl(fd, cmd, arg) continued
file/directory change notification (Linux specific define
GNU SOURCE):
I
I
fd refers to a directory
F NOTIFY: arg logic or of:
I
I
I
I
I
I
I
I
DN MULTISHOT: don’t notify only once
DN ACCESS: a file was accessed (read(), readv(), pread())
DN MODIFY: a file was modified (write(), writev(), pwrite(),
truncate())
DN CREATE: a file was created (open(), creat(), mknod(),
mkdir(), link(), symlink(), rename())
DN DELETE: a file was deleted (unlink(), rename to another
directory, rmdir())
DN RENAME: a file was renamed within this directory
(rename())
DN ATTRIB: the attributes of a file were changed (chown(),
chmod(), utime(), utimes())
Example: fs notify.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System
Handling the file offset
I lseek(fd, offset, whence): current position in the file
I
I
I
SEEK SET: new current position is set to offset
SEEK CUR: new current position is current position plus offset
SEEK END: new current position is end of file plus offset
I
I
llseek(fd, offset high, offset low, result, whence):
I
I
I
like lseek but 64 bit clean
Linux specific
read(), readv(), write(), writev():
I
I
I
if there is a (not written to) gap between end of file and
current position, this will not increase file size. On read zeros
will be returned.
read from/write to fd size bytes into memory buffer using
current file offset
file offset is readjusted at successful read/write
don’t mix functions on file descriptors with the functions from
the stdio (FILE) library
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System
Reading/Writing with file offset modification:
I read(fd, buffer, size), write(fd, buffer, size):
I
I
I
I
I
read from/write to file descriptor fd size bytes at memory
buffer
modify file offset when successful
may successfully return with less than size bytes transfered
(interrupted by signal, near end of file, no more bytes available
from pipe or terminal)
if file descriptor is set to O NONBLOCK, function fails with
EAGAIN if it would block
readv(fd, vector, num), writev(fd, vector, num):
I
I
I
like read()/write() but supports several buffer/size pairs stored
in the vector of size num
modifies file offset
erronously placed in manual section 3
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System
Reading/Writing without file offset modification:
I pread(fd, buffer, size, offset), pwrite(fd, buffer, size, offset):
I
I
I
I
I
read from/write to file descriptor fd size bytes at memory
buffer starting from offset within the file
does not modify the file offset
may successfully return with less than size bytes transfered
(interrupted by signal, near end of file, no more bytes available
from pipe or terminal)
if file descriptor is set to O NONBLOCK, function fails with
EAGAIN if it would block
use with macro before including unistd:
#define XOPEN SOURCE 500
#include <unistd.h>
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System
Memory mapped files
I
parts of a file are mapped into the virtual memory of the
process
I
mapping only possible within page-sized-units
I
getpagesize(): returns the number of bytes in a page
I
mmap(), munmap(): maps/unmaps regions of a file into
virtual memory
I
msync(): synchronize memory with backed file
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System, memory mapping
I
mmap(start, length, prot, flags, fd, offset):
I
I
I
start: suggest an address in virtual memory (page-sized), use 0
to let the kernel choose
length: size of the memory/file area (page-sized)
prot: represents memory protection and must not conflict with
access permissions of the file
I
I
I
I
I
I
I
PROT
PROT
PROT
PROT
EXEC: pages may be executed
READ: pages may be read
WRITE: pages may be written
NONE: pages may not be accessed
fd: file descriptor
offset: offset within the file (page-sized)
file size must cover all mapped pages, unless SIGBUS on access
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System, memory mapping
I
flags: type of mapped object, must specify one of
MAP SHARED or MAP PRIVATE
I
I
MAP SHARED (POSIX.1b): share this mapping with other
processes, storing to the region is like writing to the file
MAP PRIVATE (POSIX.1b): create a private copy-on-write
mapping.
I
I
I
I
I
I
Stores do not affect the original file
It is unspecified if changes to the file after the mmap are
visible in the mapped region
MAP FIXED (POSIX.1b): don’t ignore start
MAP NORESERVE: don’t reserve swap space (for
private/anonymous mappings), might get SIGSEGV upon write
when no memory is available
MAP GROWSDOWN: used for stacks, VM system should
extend this mapping downwards
MAP ANONYMOUS: mapping is not backed by any file (fd
and offset are ignored)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
File System, memory mapping
I
munmap(start, length): removes mapping for the requested
region
I
I
I
I
I
msync(start, length, flag): write changes of file-backed
mappings to the file
I
I
I
I
further references to addresses of that region cause an
SIGSEGV
closing the file does not unmap a region
access time of the file is any between mmap() and munmap()
modification/status change time of the file is any between first
write to that region and a call to munmap() or msync()
only backs up the memory area starting at start of length
length
MS ASYNC: only schedule update, return immediately
MS SYNC: wait until the file is updated
Examples: fs mmap simple.c, fs mmap on fork.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
POSIX shared memory objects
I
Creating/opening/removing POSIX shared memory objects:
API: shm open(), shm unlink()
I
Filedescriptor modification:
API: fcntl()
I
Memory mapping:
API: mmap(), munmap(), mremap(), mlock()
I
Use real time library when linking: -lrt
I
available at glibc 2.2 and higher
I
uses dedicated file system, normally mounted under /dev/shm
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
POSIX shared memory objects
I
shm open(name, open flags, mode):
I
I
I
I
I
easily create/open shared memory
not backed up to the disk
fd is guaranteed to be the lowest-numbered
fd is closed-on-exec by default
shm unlink(name):
I
I
I
removes only the name, like unlink
object is destroyed after last process unmaps the object
attempts to shm open create a new object
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Pipes, Fifos
I
I
One way communication
Creating a pipe: pipe(fd[2])
I
I
I
returns two file descriptors
fd[0] is for reading
fd[1] is for writing
I
Using/creating a fifo: use open() after mkfifo()
I
Sending data
API: write(), writev(), sndfile()
I
Receiving data:
API: read(), readv()
I
Connection parameters:
API: fcntl() (O NONBLOCK, O ASYNC)
I
Example: fork a process with separate stdio, pipes on fork.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Sockets
I
I
I
I
I
I
I
Opening/creating a connection:
API: socket(), connect(), bind(), listen(), accept(),
socketpair()
Sending data
API: send(), sendto(), sendmsg(), write(), writev(), sendfile()
Receiving data:
API: recv(), recvfrom(), recvmsg(), read(), readv(), sendfile()
(if mmap possible)
offset is not supported, call pread(), pwrite() only with 0 offset
Connection parameters:
API: getsockname(), getpeername()
Sockets support O NONBLOCK, O ASYNC:
API: fcntl()
Closing a socket:
API: close() and, additionally, shutdown() on STREAMS
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Sockets
Opening/creating sockets
I socket(domain, type, protocol): open file descriptor with
selected communcation mechanism
I
I
I
I
bind(s, address, size): most comm. mechanisms need to be
assigned a local address
I
I
I
I
domain: PF UNIX, PF INET, PF INET6,...
type: SOCK STREAM, SOCK DGRAM, SOCK SEQPACKET,
SOCK RAW, SOCK RDM
protocol: most families implement only one protocol, use 0
s: file descriptor got with socket()
address: address specifier (dependent on family
size: size of address specifier
non SOCK STREAM/SOCK SEQPACKET type sockets are
ready to be used now
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Sockets
Connecting SOCK STREAM/SOCK SEQPACKET type sockets
I Server side:
I
listen(s, backlog):
I
I
I
I
accept(s, address, size):
I
I
I
I
enable willingness to accept connections on file descriptor s
backlog: maximum number of pending connection requests
poll for read on s signals incoming connection
create new file descriptor from the first connection on the
queue of listening socket s
file descriptor does not inherit any flags (O ASYNC,
O NONBLOCK) from listening socket
address: contains the origin of the incoming connection
Client side:
I
connect(s, address, size):
I
if SOCK DGRAM calls connect():
I
c
2005
Tschaeche IT-Services
I
I
I
connect socket s to the destination address
default destination address is set
receive dgrams only that address
may use connect() multiple times to reset default address
Oliver Tschäche
http://www.tschaeche.com
Sockets
Socket options: setsockopt(s, level, optname, optval, optlen),
getsockopt(s, level, optname, optval, optlen)
I level:
I
I
I
SOL SOCKET for the socket
protocol number for other levels, e.g. getent protocols tcp
general socket options:
I
I
I
I
I
I
I
I
I
I
SO KEEPALIVE: keep low used stream type sockets alive
SO OOBINLINE: place out-of-band data within data streams
SO RCVTIMEO, SO SNDTIMEO: timeout until reporting an
error
SO BINDTODEVICE: bind socket to a particular interface
(eth0), inet sockets
SO REUSEADDRESS: ignore timeout on inet closed sockets
SO DONTROUTE: don’t use gateways
SO BROADCAST: dgram sockets may receive/send broadcast
packets
SO SNDBUF, SO RCVBUF: set size of receive/send buffers
SO LINGER: shutdown(), close() socket synchrounously
SO PRIORITY: set type-of-service field of inet sockets
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Sockets
Closing sockets:
I non SOCK STREAM/SOCK SEQPACKET type sockets:
I
I
just call close() on file descriptor
SOCK STREAM/SOCK SEQPACKET type sockets:
I
call shutdown(s, mode):
I
I
I
I
I
without shutdown() tcp-sockets will stay in close-wait state
SHUT RD: further reception will be disallowed
SHUT WR: further transmission will be disallowed
SHUT RDWR: further reception and transmission will be
disallowed
then call close() on file descriptor
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Sockets
Transfering data:
I read(), write(): like file system FDs
I
I
send(s, buf, size, flags), recv(s, buf, size, flags):
I
I
I
normally used with connection based sockets
uses default destination for non SOCK STREAM,
SOCK SEQPACKET sockets
sendto(s, buf, size, flags, to, to len), recvfrom(s, buf, size,
flags, from, from len:
I
I
I
I
maps to recv(), send() with flags=0
normally used with non connection based sockets
SOCK STREAM, SOCK SEQPACKET based sockets must use
(NULL, 0) for from/to fields
explicit set destination address/get sender address for one call
sendmsg(s, msghdr, flags), recvmsg(s, msghdr,flags):
I
readv(), writev() pendants
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Sockets
Send data, flags:
I
MSG OOB: send out-of-band data
I
MSG DONTROUTE: don’t use a gateway
I
MSG DONTWAIT: use non blocking mode
I
MSG NOSIGNAL: don’t send SIGPIPE on error
I
MSG MORE: wait for additional data before sending
Receive data, flags:
I
MSG OOB: receive out-of-band data
I
MSG DONTWAIT: use non blocking mode
I
MSG PEEK: return data but don’t remove it from the queue
I
MSG WAITALL: block until full request is satisfied (signals
may break this)
I
MSG TRUNC: return real length of dgram packet, even it was
longer than passed buffer
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Waiting/checking for activities on file descriptors
select(), pselect():
I work on fd sets: sets of marked file descriptors
I
I
I
I
I
ZERO(&fd set): clears a set of file descriptors
SET(fd, &fd set): enable file descriptor fd in the set
CLR(fd, &fd set): disable file descriptor fd in the set
ISSET(fd, &fd set): tests the state of file descriptor fd
three groups, registered file descriptors will be watched to see
I
I
I
I
FD
FD
FD
FD
if bytes are ready for reading
if bytes can be written non blocking
if an exceptions arrises
select(), pselect() will return as soon as:
I
I
I
I
an event is triggered
a timeout elapses
the process is signaled
fd sets will show the originator
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Waiting/checking for activities on file descriptors
pselect(n, rfds, wfds, efds, timeout, sigset):
I
n: highest numbered descriptor of all sets
I
timeout: struct timespec, nano second based
I
sigset: set of signals which are blocked
select(n, rfds, wfds, efds, timeout):
I
n: highest numbered descriptor of all sets
I
timeout: struct timeval, micro second based
I
like pselect() call with NULL pointer sigset
see example: sk unix server.c, sk unix client.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Waiting/checking for activities on file descriptors
poll(pollfds, nfds, timeout):
I pollfds: an array with registered file descriptors and events:
I
I
I
I
possible events to be requested:
I
I
I
I
fd: file descriptor to be watched
events: requested events
revents: returned events
POLLIN: there is data to read
POLLPRI: there is urgent data to read
POLLOUT: write will not block any more
possible events to be returned:
I
I
I
I
all requested events
POLLERR: error condition
POLLHUP: hang up
POLLNVAL: invalid request, fd not open
I
nfds: number of entries in the array
I
timeout: milli seconds
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
IO Control: manipulate operating parameters
I
ioctl(fd, request, data):
I
I
I
works on sockets:
I
I
I
I
I
set/get process group to which a SIGIO/SIGURG is sent to
get the time of the last packet passed to the user
test for out-of-band data within tcp sockets
more family/type dependent..., man 7 tcp, man 7 unix
works on character special files
I
I
I
I
I
I
I
request is special to opened file/socket
data is dependent on the type of request
control operating characteristics of character special files
serial line: set baud rate, start/stop bits
terminals (tty): map CTRL-C to SIGHUP, echo, buffering
cdrom: start playing, eject
printer: reset, get status
many more...
works on file systems
I
I
I
set block size
ext2 file system: version, modify flags
many more...
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Overview Kernel Space API
I
Modules/Built in Drivers
I
Debugging
I
Proc Filesystem
I
dev Filesystem
I
Character devices
I
Scheduling
I
Atomic Operations
I
Interrupts
I
Memory Management
I
Network devices
see: Linux Device Drivers, 2nd Edition
http://www.xml.com/ldd/chapter/book
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Modules/built in Drivers
I
Building Modules
I
Manual Configuration Parameters
I
Kernel built in Drivers
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Building Modules
I
Set Preprocessor Variables:
I
I
I
KERNEL : access kernel specific data in include files
MODULE: compile as module (not built in). Must be set
before <linux/module.h> is included
add kernel includes to path: -I /usr/src/linux/include
I
Entry Point: init module()
I
Exit Point: cleanup module()
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Building Modules: Hello world Example
Makefile:
CFLAGS = -D__KERNEL__ -DMODULE -I/usr/src/linux/include \
-O -Wall
all: module.o
clean:
rm -f *.o *~ core
module.c:
#include <linux/module.h>
int init_module(void) { printk("<1>Hello, world\n"); retur
void cleanup_module(void) { printk("<1>Goodbye cruel world\
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Manual Configuration Parameters
I
MODULE PARM(variable, type)
I
I
I
I
variable: char, short, integer, char *
type: ”b”, ”h”, ”i”, ”l”, ”s”
use arrays for type: ”1-3i”
MODULE PARM DESC(variable, description)
View Description with strings (grep parm desc)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
insmod module integer=1 array=2,3
#include <linux/module.h>
int integer=0x300;
int array[2];
MODULE_PARM(integer, "i");
MODULE_PARM_DESC(integer, "The base I/O port (0x300)");
MODULE_PARM(array, "1-2i");
int init_module(void) {
printk("<1>integer=0x%x\n", integer);
printk("<1>array[0]=%d\n", array[0]);
printk("<1>array[1]=%d\n", array[1]);
return 0;
}
void cleanup_module(void) { printk("<1>Goodbye cruel world\
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Usage Counter
I
I
Goal: Safely remove Module
Macros defined in <linux/module.h>:
I
I
I
I
MOD INC USE COUNT
MOD DEC USE COUNT
MOD IN USE
For debugging: implement ioctl to reset this counter
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Exporting Symbols
I
I
Goal: Make Symbols available in subsequently loaded modules
Macros defined in <linux/module.h>:
I
I
I
EXPORT NO SYMBOL: module does not export any symbol
EXPORT SYMBOL: export with versioning information
EXPORT SYMBOL NOVERS: export without versioning
information
EXPORT_SYMBOL(exported_function);
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Kernel Built in Drivers
I
Use Attributes:
I
I
init: free memory of that function after initialisation
exit: ignore that function, don’t create any code
I
Attributes
module
I
Problem: Module needs entry point (init module)
→ use macros: module init(init-name),
module exit(exit-name)
c
2005
Tschaeche IT-Services
init and
exit have no effect, if compiled as
Oliver Tschäche
http://www.tschaeche.com
Debugging
I
Debugging by Printing: printk()
I
I
I
setting loglevel: KERN EMERG, KERN ALERT, KERN CRIT,
KERN ERR, KERN WARNING, KERN NOTICE,
KERN INFO, KERN DEBUG
Control Logged Output: echo 8 > /proc/sys/kernel/printk
Switching Logmessages On/Off:
#ifdef DEBUG
# define PDEBUG(fmt, args...) \
printk(KERNEL_DEBUG "my dev: " fmt, ##args)
#else
# define PDEBUG(fmt, args...)
#endif
PDEBUG("some log message %d\n", integer_value);
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Debugging - Continued
I
Debugging by Querying:
I
I
I
I
Debugging by Watching the Application:
I
I
I
Use /proc FS: Everybody looks at /proc - Security?
Use ioctl: Undocumented ioctl often remain unnoticed
implement ioctl resetting module count
Debugger/strace
printf
Debuggers and Related Tools:
I
I
I
I
I
gdb: gdb /usr/src/linux/vmlinux /proc/kcore
kdb/kgdb: Kernel Debugger only available for I386 (as patch)
ikd: Less architecture dependent than kdb (as patch)
Kernel Crash Dump Analysers: LKCD, LCRASH
User-Mode-Linux: Virtual Machine, Running Linux as Process
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
procfs
I
I
create new directory: proc mkdir(), proc mkdir mode()
create new read entry:
1. implement read function
2. install read function: create proc read entry()
I
cleanup directories and entries: remove proc entry()
I
example: scull/main.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
devfs
Filesystem in which each driver registers an access point.
Alternatively, use manually created (with mknod, see man 1
mknod) special files.
I
found in <linux/devfs fs kernel.h>
I
devfs is only available if macro CONFIG DEVFS FS is set
Create Directory: devfs mk dir(parent, name, info):
I
I
I
I
parent: handle for parent directory (use NULL for root
directory)
name: name of the new directory
info: not used
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
devfs
I
Create Character Device: devfs register(dir, name, flags,
major, minor, mode, ops, info)
I
I
I
I
I
I
I
I
dir: handle for parent directory (use NULL for root directory)
name: name of the new character device
flags: DEVFS FL AUTO DEVNUM,...
major/minor: explicitely set major/minor number
mode: access permissions
ops: file operations
info: private data
Remove Directory/ChrDev: devfs unregister()
see example: scull/main.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver
Components of a Character Device Driver:
I
Major number: selects driver, 60-63, 120-127, 250-254 are
reserved for local use
I
Minor number: select device within the driver
I
create character device node in filesystem: mknod
I
Provides File Operations (read, write,...)
I
Access through Filesystem
I
Kernel API: register chrdev, unregister chrdev (<linux/fs.h>)
I
see example: scull/main.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver - File Operations
File Operations (<linux/fs.h>):
I
llseek: modifies position counter
I
read, write: get, send data
I
readdir: only useful to filesystems
I
poll: implement poll/select
I
ioctl: issue device specific commands
I
mmap: map device memory to a process’s address space
I
flush: called within close system call
I
release: called when last file is closed
I
fsync: flush pending data
I
fasync: notifies change in operation mode
I
flock: only used within regular files
I
readv, writev: read/write on multiple memory areas
Only implement needed functions
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver - File structure
File structure (<linux/fs.h>):
I
f mode: check for FMODE READ or FMODE WRITE in
ioctl()
I
f pos: 64bit value of current reading/writing position, don’t
change it use last argument of read(), write() operation
instead
I
f flags: flags used with open() (O RDONLY,
O NONBLOCK,... see <linux/fcntl.h>), use f mode to check
read/write
I
f op: file operations, may be replaced for tuning
I
private data: used to store a pointer to allocated data, special
to that file
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver - Typical procedure
1. module init: register device manually or devfs, supply file
operations
2. f op->open: allocate file special data, place it in private data
3. f op->read, write,...: use private data to do the work
4. f op->release: release everything, open creates
Attention: dup(), fork() just create new references to the file
structure and don’t call f op->open(). Accordingly, f op->release()
is only called if the refernce count of a file struct drops to 0.
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver - read/write return values
f op->read(file, buffer, size, offset)
I
requested number of bytes (size) were transfered
I
some but not all bytes were transfered: application has to
retry the read
I
0 if end-of-file is reached
I
negative value: specifies error (<linux/errno.h>)
f op->write(file, buffer, size, offset)
I
requested number of bytes (size) were transfered
I
some but not all bytes were transfered: application has to
retry the read
I
0 nothing was tranfered, this is not an error
I
negative value: specifies error (<linux/errno.h>)
read/write return with not completely filled buffers,
if O NONBLOCK is set or the process is signaled
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver - readv/writev
vector versions of read/write
I
if not defined in f op, emulated with subsequent read/write
I
file pointer and position pointer are the same as for read/write
struct iovec:
I
I
I
I
created in user space, but copied to kernel space before calling
the driver
defined in <linux/uio.h>
offer higher efficiency:
I
e.g. useful for tapes: create only one record
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver - ioctl
ioctl(inode, file, cmd, arg):
I inode, file: like arguments of open
I cmd: number that correspondsto a command
I
I
I
simple choice: start with 1, problem: cmd should be unique.
The, if used with wrong device, an error can be detected.
old convention: 8 bit magic code, 8 bit ordinal number
current convention: use bit fields <linux/ioctl.h>,
<asm/ioctl.h>
I
I
I
I
I
arg:
I
I
I
type ( IOC TYPEBITS): magic number, 8 bit wide
number ( IOC NRBITS): ordinal number, 8 bit wide
direction: IOC NONE, IOC READ or/and IOC WRITE
size: from 8 to 14 bits wide (dependent on architecture)
pointer to additional data
the data itself
return value:
I
I
I
0 on success
ENOTTY: POSIX
EINVAL: common practice
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver - ioctl
predefined commands, device drivers are interested in magic
number ’T’:
I
FIOCLEX: set close-on-exec flag
I
FIONCLEX: clear close-on-exec flag
I
FIOASYNC: modify async flag
I
FIONBIO: modify blocking flag (historical fcntl())
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver - ioctl
pointer arguments:
I access ok(type, addr, size): checks user space area
I
I
I
I
I
transfer single values:
I
I
I
type: VERIFY READ for read access, VERIFY WRITE for
read/write acces
addr, size: user space address and its size
returns 1 for success, 0 for failure
driver should return -EFAULT if access fails
put user(datum, ptr), get user(datum, ptr): include
access ok() return -EFAULT on failure
put user(datum, ptr), get user(datum, ptr): no check
version
copy to user(), copy from user(): transfer data between kernel
and user space
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver - ioctl
Capabilities, some of <linux/capability.h>:
I
I
problem: any user should be able to use a device, but should
not be able to perform any operation
additionally check for capabilities:
I
I
I
I
I
CAP DAC OVERRIDE: ability to override access restrictions
on file and directories
CAP NET ADMIN: ability to perform network administration
tasks
CAP SYS MODULE: ability to load or remove modules
CAP SYS RAWIO: ability to perform raw IO operations
CAP SYS ADMIN: catch all capability, provide any access
I
capable(ored list of capabilities) returns true, if the process
has the capability
I
example: scull/main.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver - poll
determine whether a process is able to read/write non blocking
<linux/poll.h>
I
I
used in applications that use multiple input/output streams
f op->poll(file, poll table):
1. call poll wait(file, wait queue, poll table) on each wait queue,
that could indicate a change in the poll status
2. return a bit mask describing operations that could be
performed without blocking
I
I
I
I
I
I
I
I
I
POLLIN: ready to read
POLLRDNORM: ready to read normal data (ored with
POLLIN)
POLLPRI: ready to read out-of-band data (select exception)
POLLHUP: end-of-file reached (select read)
POLLERR: an error condition occured, (select read/write)
POLLOUT: ready to write
POLLWRNORM: ready to write normal data
POLLWRBAND: ready to write high priority data
example: scull/pipe.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver - fasync
send signal when data arrives/write buffers are available again
I
I
problem: several struct files may be opened
register each async file in a queue:
I
fasync helper(fd, file, mode, fasync struct)
I
I
I
I
file: register struct file
mode: switch on/off async mode
fasync struct: queue which is modified
send signal to all processes with registered files of a
fasync struct:
I
kill fasync(fasync struct, sig, band):
I
I
I
fasync struct: elements of this queue are signaled
sig: signal to send
band: event to send (POLLIN, POLLOUT,...)
I
remove registered async files in calls to f op->release!!
I
example: scull/pipe.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver - access control
problem: open can be called several times on a device file
I
restrict to a single open:
use an counter, return -EBUSY on second open
I
restrict to a single user:
remember user of the first open
I
block the open call: for short time usage (e.g. logging)
I
create copies of the private data
see chapter 5 of Linux Device Drivers
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Character Device Driver - memory mapping
I
f op->mmap(),see section memory management
I
f op->munmap(),see section memory management
I
f op->remap(),see section memory management
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Scheduling
I
Each process (current->state) is in a defined state
(<linux/sched.h>):
I
I
I
I
I
TASK RUNNING: Ready to be scheduled
TASK INTERRUPTIBLE: Waiting for an event or a signal
TASK UNINTERRUPTIBLE: Waiting for an event only, not
interruptible by signals
TASK ZOMBIE: Task finished, waiting to return exit code
TASK STOPPED: Task sleeping, waiting for signal SIGCONT
I
Scheduling is evaluated after timer interrupt, HZ
(<linux/param.h>) times a second
I
Manually scheduling initiated by schedule()
Processes are not scheduled, while in kernel mode
I
I
I
Avoid long lasting Loops within Kernel Code (or manual
schedule)
Preemptive Problem
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Delaying execution
I
Time is measured in jiffies (defined in <linux/sched.h>),
incremented at each timer interrupt (HZ times a second)
I
I
I
I
Short delays: udelay(unsigned long usecs), mdelay(unsigned
long msecs)
I
I
I
I
I
Example code:
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(delay*HZ);
delay is the timeout in seconds
an extra time interval could pass between the expiration of the
timeout and when the process is scheduled to execute
udelay() waits for usecs microseconds
mdelay() is a loop around udelay(1000)
implemented as busy wait, based on bogomips
loops per second
suggested maximum value for udelay is 1 millisecond
Short delays: increase HZ, but use with caution
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Wait Queues
I
List of Processes which are Waiting for Wakeup:
wait queue head t
I
initialize a Wait Queue: init wait queue head(),
DECLARE WAIT QUEUE HEAD()
Register current Process to wait on this queue:
I
I
I
I
I
I
I
sleep on(): Wait until an Event occures on that Queue
interruptable sleep on(): Wait until an Event occures on that
Queue or the Process is signaled
sleep on timeout(): Same as sleep on() but resume after
timeout
interruptable sleep on timeout(): Same as
interruptable sleep on() but resume after timeout
wait event(): Wait until an Event and check condition
wait event interruptible(): Wait until an Event and check
condition or the Process is signaled
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Wait Queues - Continued
Simplified Wait
void simplified_sleep_on(wait_queue_head_t *queue)
{
wait_queue_t wait;
init_waitqueue_entry(&wait, current);
current->state = TASK_INTERRUPTIBLE;
add_wait_queue(queue, &wait);
schedule();
remove_wait_queue (queue, &wait);
}
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Wait Queues - Continued
Simplified Wait Exclusive
void simplified_sleep_exclusive(wait_queue_head_t *queue)
{
wait_queue_t wait;
init_waitqueue_entry(&wait, current);
current->state = TASK_INTERRUPTIBLE | TASK_EXCLUSIVE;
add_wait_queue_exclusive(queue, &wait);
schedule();
remove_wait_queue (queue, &wait);
}
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Wait Queues - Continued
Going to sleep without Races
I
Problem:
while (short_head == short_tail) {
interruptible_sleep_on(&short_queue);
/* ... */
}
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Wait Queues - Continued
Going to sleep without Races
I
Solution:
wait_queue_t wait;
init_waitqueue_entry(&wait, current);
add_wait_queue(&short_queue, &wait);
while (1) {
set_current_state(TASK_INTERRUPTIBLE);
if (short_head != short_tail) /* whatever test
break;
schedule();
}
set_current_state(TASK_RUNNING);
remove_wait_queue(&short_queue, &wait);
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Wait Queues - Continued
I
Wake Up the Processes and schedule immediately:
I
I
I
wake up(): Wake up registered processes
wake up interruptible(): Wake up registered processes, which
have called an interruptible version
wake up interruptible and wake up may not return
immediately, a woken up process may be executed first
Wake Up the Processes but keep the current process running:
I
I
wake up sync(): Wake up registered processes by marking
them runnable
wake up sync interruptible(): Wake up registered processes,
which have called an interruptible version
wake up sync interruptible() and wake up sync() only mark
processes running, but don’t call schedule(), this is left to the
current process.
This may save context switches! See example.
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Wait Queues - Continued
wake up vs. wake up sync:
while(1) {
wake_up(wq1);
wake_up(wq2);
schedule();
}
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Task Queues
Problems:
I
Polling the hardware without a context switch
I
give timely input to a hardware device (short periods, use
kernel timers for long periods)
I
keep latency of interrupt routines short
Properties:
I
I
Doing work without context switch
Suitable process context not available (in general)
I
I
don’t access user space, you don’t know which process is active
is not allowed to call schedule, sleep on, kmalloc (calls
sleep on), semaphores
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Task Queues
Predefined Task Queues:
I scheduler queue:
I
I
I
timer queue:
I
I
I
called from keventd within process context
tq schedule is hidden, use schedule task()
called at each timer tick in interrupt context
use queue task(your task, tq timer)
immediate queue:
I
I
I
I
I
called at return of system call or when the scheduler is run,
whichever comes first
use queue task(your task, tq timer)
uses bottom half mechanism, thus call
mark bh(IMMEDIATE BH)
Attention: mark bottom half after queueing the task!
fastest queue
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Kernel Timers
I
Resolution is jiffies
I
Avoids reregistering a task in timer queue
I
Easy use: Register your task once and the kernel calls it once
when the time expires
use functions to be forward compatible:
I
I
I
I
I
I
init timer(): initialization
add timer(): insert timer into the global list of active timers
mod timer(): modify expiration of an active timer
del timer(): delete an active timer from the list
del timer sync(): makes sure that the timer function is not
running on any CPU, (rmmod)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Tasklets
I
since 2.4 tasklets are the prefered way to accomplish
bottom-half tasks
I
defer execution of a task until a safe time, like task queues
I
run once and may reschedule themselves, like task queues
I
may be run in parallel on SMP systems
I
run on that CPU which first schedules them, (better cache
behaviour, faster)
I
work on single CPU machines
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Atomic Operations
I
goal: modify bits without interference of IRQs or other CPUs
I
bit operations
I
integer operations: data types depend on architecture, 24 bits
guaranteed
I
semaphores
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Atomic Operations
Bit operations: <asm/bitops.h>
I test bit(nr, addr):
I
I
set bit(nr, addr), clear bit(nr, addr), change bit(nr, addr):
I
I
test the bit in memory not using caches
directly write to memory not using caches
test and set bit(nr, addr), test and clear bit(nr, addr),
test and change bit(nr, addr):
I
I
again, directly work on memory
return the state of the bit before the operation
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Atomic Operations
Integer operations: <asm/atomic.h>
I
type: integer, but use only 24 bits!
I
atomic read(atom): returns integer
I
atomic set(atom, int): set atomic to int
I
atomic add(int, atom), atomic sub(int, atom): add/sub int
to/from atomic
I
atomic inc(atom), atomic dec(atom): increment/decrement
atomic
I
atomic add and test(int, atom), atomic sub and test(int,
atom):
return previous value
I
atomic inc and test(atom), atomic dec and test(atom):
return previous value
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Atomic Operations - multiprocessor
Spinlocks: <linux/spinlock.h>
I do busy wait until lock is released
I spin lock init(lock): run time initialization
I spin lock(lock), spin unlock(lock): acquire/release the given
lock
I spin lock irqsave(lock, flags) spin unlock irqsave(lock, flags):
disable (and save flags)/restore IRQs before/after
acquiring/releasing the given lock
I spin lock irq(lock), spin unlock irq(lock):
disable/enable IRQs before/after acquiring/releasing the given
lock
I spin lock bh(lock), spin unlock bh(lock):
disable/enable execution of bottom halfs before/after
acquiring/releasing the given lock
I spin is locked(lock): check the state of the lock
I spin trylock(lock): acquire lock if free, else return failure
I spin unlock wait(lock): wait until a lock is released
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Atomic Operations - multiprocessor
Reader-Writer locks: <linux/spinlock.h>
I Problem:
I
I
spinlocks only allow a single reader
many readers may lock data simultaneously,
but single locks only needed for writers
I
read lock(lock), read unlock():
acquire/release a read lock, more than one CPU may get a
read lock
I
write lock(lock), write unlock(lock):
only one CPU will get a write lock (as soon as all read locks
are released)
I
Reader-Writer locks support irq, irqsave and bh variants
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Atomic Operations
Semaphores: <asm/semaphore.h>
I sema init(semaphore, state):
I
I
semaphore: has a state and wait queue
state: initial state
I
I
I
down interruptible(semaphore):
I
I
I
I
waits until semaphore has a state greater than 0
decrements the state of the semaphore
if the process is signaled while waiting it returns true
and the driver should return -ERESTARTSYS
down(semaphore):
I
I
state == 0: semaphore is hold by a process
0 ¡ state: semaphore is free
like down interruptible(), but does not allow signals to be
delivered
up(semaphore):
I
release semaphore (increment state by 1)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Interrupts
<asm/semaphore.h>
I
register an IRQ handler
I
blocking interrupts
I
enable/disable interrupts
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Interrupts
Registering IRQs: <linux/sched.h>
I request irq(irq, handler, flags, name, private data):
I
I
I
irq: the number of the requested interrupt
handler: function to call for an IRQ with the number irq
flags:
I
I
I
I
I
I
SA INTERRUPT: fast handler, called with disabled interrupts
SA SHIRQ: shared IRQ, other handlers may be attached to
this IRQ
SA SAMPLE RANDOM: irq contributes to the entropy pool
used by /dev/random
name: listed name in /proc/interrupts
private data: associated data, used within shared IRQs to
select handler when calling free irq()
free irq(irq, private data)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Interrupts
Blocking interrupts (<asm/system.h> included from
<linux/sched.h>:
I
I
don’t use sti() directly (following handler may trust in disabled
IRQs)
use macros save flags(), restore flags():
I
can’t pass flags to a non inline function!!
unsigned long flags;
save_flags(flags);
cli();
/* This code runs with interrupts disabled */
restore_flags(flags);
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Interrupts
Enable/disable IRQs:
I
Generated interrupts are lost while an IRQ is disabled!
I
Must not be used with shared IRQs!
I
disable irq(irq): wait for IRQ-handler to finish, then disable
reporting of that irq.
I
disable irq nosync(irq): disable IRQ without respect to any
running IRQ-handler.
I
enable irq(irq): enable reporting of that IRQ.
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Interrupts
IRQ handler:
void
short_interrupt(int irq, void *dev_id, struct pt_regs *regs
{
/* ... do irq handling ... */
/* wake any reading process */
wake_up_interruptible(&short_queue);
}
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Interrupts
Tasklet:
void short_do_tasklet (unsigned long);
DECLARE_TASKLET (short_tasklet, short_do_tasklet, 0);
void
short_tl_interrupt(int irq, void *dev_id, struct pt_regs *r
{
/* ... do irq handling ... */
tasklet_schedule(&short_tasklet);
}
void short_do_tasklet (unsigned long unused)
{
/* ... do real work ... */
}
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Interrupts
Bottom halfs:
void
short_bh_interrupt(int irq, void *dev_id, struct pt_regs *r
{
/* ... do irq handling ... */
/* Queue the bh. Don’t care about multiple enqueuei
queue_task(&short_task, &tq_immediate);
mark_bh(IMMEDIATE_BH);
}
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory management
I
Memory Zones
I
Allocating memory
I
Address spaces: physical, virtual, busses
I
Side effects: caches!!
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory Zones
I
I
normal memory
DMA-capable memory: not used in MIPS systems
I
I
must be used in DMA transfers with peripheral devices of ISA
bus
high memory: not used in MIPS systems
I
introduced with P II virtual memory extension (64GB)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory Allocation
I
kalloc(): a malloc() like interface
I
look aside caches
I
page oriented-allocation
I
use virtual memory
I
Boot-Time allocation
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory Allocation
kalloc(size, flags), kfree(obj):
I May sleep until pages available (low memory situation):
I
a function calling kalloc must be reentrant
I
the returned region is consecutive physical memory
I
does not clear the memory it obtains
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory Allocation
kalloc(): controlled by flags:
I
GFP KERNEL: is performed on behalf of a process and may
sleep
I
GFP ATOMIC: never sleeps, called from outside a process’s
context, e.g. interrupthandler, task queues, kernel timers
I
GFP BUFFER: differs from GFP KERNEL in that fewer
requests to flush buffers, mainly used to avoid dead locks
when I/O subsystem itself needs memory
I
GFP USER: low priority GFP KERNEL request
I
GFP HIGHUSER: low priority GFP KERNEL requesting high
memory
I
GFP DMA: highly architecture dependent, memory usable in
DMA requests
I
GFP HIGHMEM: highly architecture dependent, used as
part in GFP HIGHUSER
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory Allocation
Look aside caches
I allocate many objects of the same size again and again
I kmem cache create(name, size, offset, flags, constr, destr):
I
I
I
I
name: association used in /proc/slabinfo
size: size of the object
offset: ensure particular alignment
flags:
I
I
I
I
contr(obj, cache, flags), destr(obj, cache, flags):
I
I
I
I
I
SLAB NO REAP: protect the cache from being reduced when
system looks for memory
SLAB HWCACHE ALIGN: align objects according to
hardware caches
SLAB CACHE DMA: allocate DMA capable memory
provide constructor, destructor if necessary
constr is called with SLAB CTOR CONSTRUCTOR flag set,
use same function
may be called several times for one object
must not sleep if SLAB CTOR ATOMIC flag is set
kmem cache destroy(cache): removes the complete cache
only succeeds if all allocated objects are returned
I
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory Allocation
Look aside caches
I kmem cache alloc(cache, flags):
I
I
I
get a cache object
may perform kalloc() if no object is available within the cache
kmem cache free(cache, obj);
I
I
frees an object
memory not immediately freed, but may be used for next
request
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory Allocation
get zeroed page(flags), get free page(flags)
get free pages(flags,order)
I
allocate big, page-oriented chunks of memory
I
flags inherited to kalloc()
I
GFP ATOMIC: never sleeps
I
GFP KERNEL: may sleep until memory is available
I
order: 2n number of pages
free page(addr), free pages(addr, order)
I
frees allocated pages
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory Allocation
vmalloc(size), vfree(obj):
I
allocate big chunks of memory, consecutive in kernel virtual
address space
I
modifies page tables, get free page(), kmalloc()
virtual-to-physical mapping is 1-to-1
I
used for big kernel buffers
I
can’t be used for DMA
I
uses GFP KERNEL and, therefore, may sleep,
thus, not usable in interrupt handler, task queues
I
address range: VMALLOC START, VMALLOC END
(<asm/pgtable.h>)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory Allocation
Boot-Time Allocation
I
inflexible but least prone to failure
I
can’t be used by modules
I
done before memory management starts
I
needs reboot to take effect
alloc bootmem(size), alloc bootmem pages(size),
alloc bootmem low(size), alloc bootmem low pages(size):
I
I
I
pages: allocate on page-oriented
low: allocate below MAX DMA ADDRESS
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory Allocation
Other:
I
bigphysarea patch:
see: http://www.polyware.nl/ middelink/En/hob-v4l.html
I
I
I
Reserve Highmem addresses:
I
I
I
I
allocate memory via cmdline at startup
passed to device driver module later
use cmdline parameter mem=126M (if 128M available)
supported by standard kernel
allocator module on the O’Reilly ftp sites
both methods need a reboot to adjust the memory size
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory management
Address Spaces
I
bus addresses (PCI)
I
physical address space,
initialization done by BIOS or kernel startup
I
user virtual address space: (each process one)
I
kernel logical addresses: physical-virtual one-to-one mapping
kernel virtual addresses:
I
I
I
I
address extension requires more than 32 bits
used with modules
organisation unit: pages
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory management
Memory map and struct page
I
one struct page for each memory page
I
contains reference count
I
wait queue of processes waiting for that page
I
kernel virtual address of that page
flags:
I
I
I
PG locked: can’t be swapped out
PG reserved: may not be used
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory management
Memory map and struct page, mm struct and page tables
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory management
Memory map and struct page, macros:
I
virt to page(addr): find struct page for kernel logical address
I
page address(page): return kernel virtual address of the page
kmap(page), kunmap(page):
I
I
I
I
I
I
kernel logical address if low memory
kernel virtual address if high memory
limited number of mappings, release soon
may be mapped more than once
may sleep, until a mapping is available
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory management - User Space Interface
Virtual Memory Areas
I
a homogeneous region in the virtual memory of a process
I
user space API: mmap(), munmap()
I
reference VMAs: /proc/<PID>/maps
elements:
I
I
I
I
I
I
start, end: begin and end of virtual address
offset: within a file (page-oriented)
permissions: read, write, execute
vm ops, vm private: driver operations, driver data
major,minor,inode: file system infos
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory management - User Space Interface
Virtual Memory Areas (<linux/mm.h>)
I vm ops:
I
I
I
I
I
open(VMA): called when VMA is inherited by child process
(clone(), fork())
ATTENTION: has to be called within mmap() syscall manually
close(VMA): called when VMA is inherited by child process
(clone(), fork())
or when munmap() unmaps the entire area
unmap(VMA): called from kernel when parts or entire area of
the VMA is unmapped
sync(VMA): backend for msync() system call
nopage(VMA): called when process accesses a page of the area
which is currently not in memory, return a struct page here
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory management - User Space Interface
Virtual Memory Areas (<linux/mm.h>)
I vm ops, driver non specific VMA operations:
I
I
I
swapout(VMA): kernel wants to swapout that page, return 0 if
okay.
returning non 0 will send a SIGBUS to the process
protect(VMA): unused yet. Intention: change protection
wppage(VMA): unused yet. Intention: handle faults for write
access to write-protected pages
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory management - User Space Interface
mmap: Kernel vs. User function:
1. User: mmap(addr, size, prot, flags, fd, offset)
2. Kernel performs a good deal of work:
I
I
I
checks parameters
resolves file descriptor
creates VMA, without any active mapping
3. Kernel: mmap(filp, vma)
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory management - change user space mapping
remap page range(virt addr, phys addr, size, prot):
I
creates new page tables for the specified region
I
virt addr: user space address where mapping begins
I
phys addr: the physical address to which the virt addr is
mapped
I
size: size of the region (same for virtual and physical address)
prot:
I
I
I
I
protection of the new VMA (hint: vma->vm page prot)
caches: disable caches, architecture dependent flags
(see: pgprot noncached() in drivers/char/mem.c)
does not need nopage from VMA operations
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory management - change user space mapping
VMA operation nopage(vma, addr, write acces):
I
kernel creates VMA without any mapping for mmap()
I
if user requests an unmapped page, nopage() is called
I
nopage() has to map a single page
I
nopage() has to take care of the page’s reference counter
(get page())
user space mremap() does not notify driver if increasing the
region
I
I
I
nopage() must implement growing regions of VMA
user space mremap() notifies driver if reducing the region:
calls VMA operation unmap()
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Memory management - change user space mapping
Examples:
I
Remap nopage Mappings: ldd->simple/simple.c
I
prevent extension of mapping: sigbus on nopage.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
kiobuf Interface
I
I
the other way round: kernel maps user space buffer
avoids copying data:
I
non kiobuf read/write: copies data to/from kernel buffer
I
independent from memory management, simplifies life greatly
I
fast when reading same data once (if reading twice, use kernel
buffers)
I
linux-2.4 lacks async IO for raw IO
I
API: include <linux/iobuf.h>
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
kiobuf Interface
Create/free kiobuffer objects:
I
kiobuf init(kiobuf): initialize an already defined structure
I
alloc kiovec(nr,iovec): allocate nr kernel IO buffers
I
free kiovec(nr,iovec): free nr kernel IO buffers
Map/unmap user space into iobuf:
I
map user kiobuf(rw, kiobuf, addr, size): registers all pages
into kiobuf which belong to the user memory region
I
unmap kiobuf(kiobuf): adjust ref. count of each page
Example: mmap/kiobuf.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
DMA
DMA transfer - input modes:
I synchronous: user process calls read:
1. driver allocates buffer (use kiobuf)
2. hardware write data to the buffer and raises an interrupt when
done
3. interrupt handler acknowledges IRQ and awakens the process
I
asynchronous: data acquisition devices (NIC):
1. hardware raises IRQ to announce that new data has arrived
2. interrupt handler allocates buffer and tells the hardware where
to transfer the data
3. hardware transfers the data and raises another IRQ when it’s
done
4. the handler dispatches the data, wakes relevant processes,
takes care of housekeeping
NICs often use ring buffers
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
DMA
DMA - memory
I
physical address must be simultaneously available to the CPU
and the hardware
I
disable caches so that updates can be seen
consistent DMA mappings:
I
I
I
I
streaming DMA mappings:
I
I
I
allocated once, when loading the module
long time monopolize mapping registers of the hardware (even
when they are not being used)
setup for a single operation
some hardware can optimize streaming DMA mappings
have a look at other drivers
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Network Devices
I
Initialization
I
Controlling transmission concurrency
I
Transfering packages
I
Changes in link state
I
Custom ioctl
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Network Devices - Initialization
register netdev(netdev) used with preset name/init,
other fields are filled in by init:
I
name: interface name, use %d for auto numbering
I
init(netdev): callback for initialization
open(netdev), release(netdev):
I
I
I
start/stop interface
has to start packet queue for the interface:
netif start queue(netdev), netif stop queue(netdev)
I
set config(netdev): change configuration parameters (ifconfig)
I
do ioctl(netdev): ioctl-like interface, set special parameters
I
hard start xmit(sk buff, netdev): send a single packet
I
tx timeout(netdev): packet transmission fails to complete
within reasonable period (driver missed interrupt?), resume
packet transmission
I
get stats(netdev): access statistical information of an interface
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Network Devices - Initialization
register netdev(netdev) used with preset name/init
I flags:
I
I
I
I
I
IFF UP: interface is running
IFF DEBUG: can be used for verbosity of driver’s printk
IFF NOARP: interface does not use ARP
IFF PROMISC: by default NICs filter packets per hardware.
receive all packets if set
setting special fields of netdev (e.g. packet header handling):
I
I
I
I
I
I
ether setup: ethernet devices
ltalk setup: ltalk devices
fc setup: fiber channel devices
fddi setup: fddi
hippi setup: hippi
tr configure: token ring, attention there is tr setup() doing
nothing in 2.4!
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Network Devices
Socket buffer elements <linux/skbuff.h>:
I skb records:
I
I
I
I
I
I
I
dev: the device a buffer came from or is sent to
protocol: protocol which is used, e.g. ethernet
head: beginning of allocated space
data: beginning of valid octets
tail: end of valid octets
end: end of allocated space
len: tail - data
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Network Devices
Handling of socket buffers <linux/skbuff.h>:
I
dev alloc skb(len), dev kfree skb(skb): allocate, free socket
buffer in driver context
I
alloc skb(len, prio), kfree skb(skb): internal kernel functions,
use dev xxx functions in driver
I
socket buffers are allocated in DMA-capable memory
skb put(skb, len), skb put(skb, len): add data at the end
I
I
I
I
I
adjust tail and len records of the skb
return addr to put len bytes
skb put() omits check if buffer is full
skb push(skb, len),
at the end
c
2005
Tschaeche IT-Services
skb push(skb, len): like put method, but
Oliver Tschäche
http://www.tschaeche.com
Network Devices
Transmitting packets:
I
API: hard start xmit(sk buff, netdev)
I
Problem: hard start xmit() is called several times, but driver
has limited amount of memory
Solution:
I
1. call netif stop queue(): stops the kernel to call
hard start xmit()
2. kernel even moves packets into the queue
3. call netif wake queue(): enables the kernel to call
hard start xmit() and send a packet waiting in the queue
I
hard start xmit() is protected by spinlock and only called once
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Network Devices
Receiving packages without DMA:
1. allocate socket buffer: dev alloc skb(len)
2. copy received bytes to socket buffer
3. fill in metadata: dev, protocol
4. update statistic
5. call netif rx(skb)
DMA-based method:
1. pre-allocate socket buffer: dev alloc skb(len)
2. let the hardware copy the data directly to the socket buffer
3. fill in metadata: dev, protocol
4. update statistic
5. call netif rx(skb)
Example: snull/snull.c
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com
Question & Answer
c
2005
Tschaeche IT-Services
Oliver Tschäche
http://www.tschaeche.com