vhost_net: a kernel-level virtio server
Michael S. Tsirkin authored

What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
  migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear.  I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device.  Backend is also configured by userspace, including vlan/mac
etc.

Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.

Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use

Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item.  The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.

(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
Acked-by: default avatar"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
3a4d5c94
Name Last commit Last update
..
accessibility drop explicit include of autoconf.h
acpi Merge branch 'misc-2.6.33' into release
amba Merge branches 'arm', 'at91', 'bcmring', 'ep93xx', 'mach-types', 'misc' and 'w90x900' into devel
ata pata_bf54x: handle portmuxing of pins through GPIO PORTs
atm drivers/atm/lanai.c: use %pM to show MAC address
auxdisplay auxdisplay: remove PARPORT dependency
base devtmpfs: unlock mutex in case of string allocation error
block Merge branch 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block
bluetooth Bluetooth: Prevent ill-timed autosuspend in USB driver
cdrom Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6
char kfifo: fix warn_unused_result
clocksource cs5535: add a generic clock event MFGPT driver
connector connector: Fix incompatible pointer type warning
cpufreq Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
cpuidle drivers/cpuidle: Move dereference after NULL test
crypto Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
dca dca: module load should not be an error message
dio m68k: don't export static inline functions
dma Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx
edac Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp
eisa Merge branch 'akpm'
firewire Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6
firmware drivers/firmware/iscsi_ibft.c: use %pM to show MAC address
gpio gpiolib: add support for changing value polarity in sysfs
gpu Merge branch 'drm-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6
hid
hwmon
i2c
ide
idle
ieee1394
ieee802154
infiniband
input
isdn
leds
lguest
macintosh
mca
md
media
memstick
message
mfd
misc
mmc
mtd
net
nubus
of
oprofile
parisc
parport
pci
pcmcia
platform
pnp
power
pps
ps3
rapidio
regulator
rtc
s390
sbus
scsi
serial
sfi
sh
sn
spi
ssb
staging
tc
telephony
thermal
uio
usb
uwb
vhost
video
virtio
vlynq
w1
watchdog
xen
zorro
Kconfig
Makefile