git.camperquake.de Git

Update 'zfs.sh -u' to umount all zfs filesystems

Before it is safe to unload the zfs module stack all mounted
zfs filesystems must be unmounted. If they are not unmounted,
there will be references held on the modules and the stack cannot
be removed. To handle this have 'zfs.sh -u' which is used by all
of the test scripts umount all zfs filesystem before attempting
to unload the module stack.

Suppress share error on mount

Until code is added to support automatically sharing datasets
we should return success instead of failure. This prevents the
command line tools from returning a non-zero error code. While
a user likely won't notice this, test scripts like zconfig.sh
do and correctly fail because of it.

Add get/setattr, get/setxattr hooks

While the attr/xattr hooks were already in place for regular
files this hooks can also apply to directories and special files.
While they aren't typically used in this way, it should be
supported. This patch registers these additional callbacks
for both directory and special inode types.

Fix FIFO and socket handling

Under Linux when creating a fifo or socket type device in the ZFS
filesystem it's critical that the rdev is stored in a SA. This
was already being correctly done for character and block devices,
but that logic needed to be extended to include FIFOs and sockets.

This patch takes care of device creation but a follow on patch
may still be required to verify that the dev_t is being correctly
packed/unpacked from the SA.

Create minors for all zvols

It was noticed that when you have zvols in multiple datasets
not all of the zvol devices are created at module load time.
Fajarnugraha did the leg work to identify that the root cause of
this bug is a non-zero return value from zvol_create_minors_cb().

Returning a non-zero value from the dmu_objset_find_spa() callback
function results in aborting processing the remaining children in
a dataset. Since we want to ensure that the callback in run on
all children regardless of error simply unconditionally return
zero from the zvol_create_minors_cb(). This callback function
is solely used for this purpose so surpressing the error is safe.

Closes #96

Linux 2.6.36 compat, sops->evict_inode()

The new prefered inteface for evicting an inode from the inode cache
is the ->evict_inode() callback. It replaces both the ->delete_inode()
and ->clear_inode() callbacks which were previously used for this.

Linux 2.6.33 compat, get/set xattr callbacks

The xattr handler prototypes were sanitized with the idea being that
the same handlers could be used for multiple methods. The result of
this was the inode type was changes to a dentry, and both the get()
and set() hooks had a handler_flags argument added. The list()
callback was similiarly effected but no autoconf check was added
because we do not use the list() callback.

Linux 2.6.35 compat, fops->fsync()

The fsync() callback in the file_operations structure used to take
3 arguments.  The callback now only takes 2 arguments because the
dentry argument was determined to be unused by all consumers.  To
handle this a compatibility prototype was added to ensure the right
prototype is used.  Our implementation never used the dentry argument
either so it's just a matter of using the right prototype.

Linux 2.6.35 compat, const struct xattr_handler

The const keyword was added to the 'struct xattr_handler' in the
generic Linux super_block structure. To handle this we define an
appropriate xattr_handler_t typedef which can be used. This was
the preferred solution because it keeps the code clean and readable.

Prefer /lib/modules/$(uname -r)/ links

Preferentially use the /lib/modules/$(uname -r)/source and
/lib/modules/$(uname -r)/build links. Only if neither of these
links exist fallback to alternate methods for deducing which
kernel to build with. This resolves the need to manually
specify --with-linux= and --with-linux-obj= on Debian systems.

MS_DIRSYNC and MS_REC compat

It turns out that older versions of the glibc headers do not
properly define MS_DIRSYNC despite it being explicitly mentioned
in the man pages. They instead call it S_WRITE, so for system
where this is not correct defined map MS_DIRSYNC to S_WRITE.
At the time of this commit both Ubuntu Lucid, and Debian Squeeze
both use the out of date glibc headers.

As for MS_REC this field is also not available in the older headers.
Since there is no obvious mapping in this case we simply disable
the recursive mount option which used it.

Add missing -ldl linker option

The inclusion on dlsym(), dlopen(), and dlclose() symbols require
us to link against the dl library. Be careful to add the flag to
both the libzfs library and the commands which depend on the library.

Update AUTHORS file

This file has gotten stale and needed to be updated.  There are
individuals who deserve to be recognized for their contributions
to the project.  I've done my best to assemble names from the
commit logs of those who have submitted patches.  This list may
not be comprehensive, if you feel I've overlooked your contribution
please let me know and we can get your name added.

Use 'noop' IO Scheduler

Initial testing has shown the the right IO scheduler to use under Linux
is noop.  This strikes the ideal balance by allowing the zfs elevator
to do all request ordering and prioritization.  While allowing the
Linux elevator to do the maximum front/back merging allowed by the
physical device.  This yields the largest possible requests for the
device with the lowest total overhead.

While 'noop' should be right for your system you can choose a different
IO scheduler with the 'zfs_vdev_scheduler' option.  You may set this
value to any of the standard Linux schedulers: noop, cfq, deadline,
anticipatory.  In addition, if you choose 'none' zfs will not attempt
to change the IO scheduler for the block device.

Suppress large kmem_alloc() warning

The following warning was observed under normal operation.  It's
not fatal but it's something to be addressed long term.  Flag the
offending allocation with KM_NODEBUG to suppress the warning and
flag the call site.

SPL: Showing stack for process 21761
Pid: 21761, comm: iozone Tainted: P           ----------------
2.6.32-71.14.1.el6.x86_64 #1
Call Trace:
[<ffffffffa05465a7>] spl_debug_dumpstack+0x27/0x40 [spl]
[<ffffffffa054a84d>] kmem_alloc_debug+0x11d/0x130 [spl]
[<ffffffffa05de166>] dmu_buf_hold_array_by_dnode+0xa6/0x4e0 [zfs]
[<ffffffffa05de825>] dmu_buf_hold_array+0x65/0x90 [zfs]
[<ffffffffa05de891>] dmu_read_uio+0x41/0xd0 [zfs]
[<ffffffffa0654827>] zfs_read+0x147/0x470 [zfs]
[<ffffffffa06644a2>] zpl_read_common+0x52/0x70 [zfs]
[<ffffffffa0664503>] zpl_read+0x43/0x70 [zfs]
[<ffffffff8116d905>] vfs_read+0xb5/0x1a0
[<ffffffff8116da41>] sys_read+0x51/0x90
[<ffffffff81013172>] system_call_fastpath+0x16/0x1b

Update META to 0.6.0

Roll the version forward to 0.6.0, the addition of the Posix
layer warrents updating the major version number.

Invalidate dcache and inode cache

When performing a 'zfs rollback' it's critical to invalidate
the previous dcache and inode cache. If we don't there will
stale cache entries which when accessed will result in EIOs.

Remove useless libefi warnings

These two warnings in libefi serve no real purpose. When running
without DEBUG they are already supressed, and even when DEBUG is
enabled all they indicate is the device doesn't already have an
EFI label. For a Linux machine this is probably the common case.

Move cv_destroy() outside zp->z_range_lock()

With the recent SPL change (d599e4fa) that forces cv_destroy()
to block until all waiters have been woken.  It is now unsafe
to call cv_destroy() under the zp->z_range_lock() because it
is used as the condition variable mutex.  If there are waiters
cv_destroy() will block until they wake up and aquire the mutex.
However, they will never aquire the mutex because cv_destroy()
will not return allowing it's caller to drop the lock.  Deadlock.

To avoid this cv_destroy() is now run asynchronously in a taskq.
This solves two problems:

1) It is no longer run under the zp->z_range_lock so no deadlock.
2) Since cv_destroy() may now block we don't want this slowing
   down zfs_range_unlock() and throttling the system.

This was not as much of an issue under OpenSolaris because their
cv_destroy() implementation does not do anything.  They do however
risk a bad paging request if cv_destroy() returns, the memory holding
the condition variable is free'd, and then the waiters wake up and
try to reference it.  It's a very small unlikely race, but it is
possible.

Add mmap(2) support

It's worth taking a moment to describe how mmap is implemented
for zfs because it differs considerably from other Linux filesystems.
However, this issue is handled the same way under OpenSolaris.

The issue is that by design zfs bypasses the Linux page cache and
leaves all caching up to the ARC.  This has been shown to work
well for the common read(2)/write(2) case.  However, mmap(2)
is problem because it relies on being tightly integrated with the
page cache.  To handle this we cache mmap'ed files twice, once in
the ARC and a second time in the page cache.  The code is careful
to keep both copies synchronized.

When a file with an mmap'ed region is written to using write(2)
both the data in the ARC and existing pages in the page cache
are updated.  For a read(2) data will be read first from the page
cache then the ARC if needed.  Neither a write(2) or read(2) will
will ever result in new pages being added to the page cache.

New pages are added to the page cache only via .readpage() which
is called when the vfs needs to read a page off disk to back the
virtual memory region.  These pages may be modified without
notifying the ARC and will be written out periodically via
.writepage().  This will occur due to either a sync or the usual
page aging behavior.  Note because a read(2) of a mmap'ed file
will always check the page cache first even when the ARC is out
of date correct data will still be returned.

While this implementation ensures correct behavior it does have
have some drawbacks.  The most obvious of which is that it
increases the required memory footprint when access mmap'ed
files.  It also adds additional complexity to the code keeping
both caches synchronized.

Longer term it may be possible to cleanly resolve this wart by
mapping page cache pages directly on to the ARC buffers.  The
Linux address space operations are flexible enough to allow
selection of which pages back a particular index.  The trick
would be working out the details of which subsystem is in
charge, the ARC, the page cache, or both.  It may also prove
helpful to move the ARC buffers to a scatter-gather lists
rather than a vmalloc'ed region.

Additionally, zfs_write/read_common() were used in the readpage
and writepage hooks because it was fairly easy.  However, it
would be better to update zfs_fillpage and zfs_putapage to be
Linux friendly and use them instead.

Add Hooks for Linux Xattr Operations

The Linux specific xattr operations have all been located in the
file zpl_xattr.c. These functions primarily rely on the reworked
zfs_* functions to do their job. They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.

Add Hooks for Linux Super Block Operations

The Linux specific super block operations have all been located in the
file zpl_super.c. These functions primarily rely on the reworked
zfs_* functions to do their job. They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.

Add Hooks for Linux Inode Operations

The Linux specific inode operations have all been located in the
file zpl_inode.c. These functions primarily rely on the reworked
zfs_* functions to do their job. They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.

Add Hooks for Linux File Operations

The Linux specific file operations have all been located in the
file zpl_file.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.

This first zpl_* commit also includes a common zpl.h header with
minimal entries to register the Linux specific hooks.  In also
adds all the new zpl_* file to the Makefile.in.  This is not a
standalone commit, you required the following zpl_* commits.

Wrap with HAVE_XVATTR

For the moment exactly how to handle xvattr is not clear. This
change largely consists of the code to comment out the offending
bits until something reasonable can be done.

Add zp->z_is_zvol flag

A new flag is required for the zfs_rlock code to determine if
it is operation of the zvol of zpl dataset. This used to be
keyed off the zp->z_vnode, which was a hack to begin with, but
with the removal of vnodes we needed a dedicated flag.

Prototype/structure update for Linux

I appologize in advance why to many things ended up in this commit.
When it could be seperated in to a whole series of commits teasing
that all apart now would take considerable time and I'm not sure
there's much merrit in it.  As such I'll just summerize the intent
of the changes which are all (or partly) in this commit.  Broadly
the intent is to remove as much Solaris specific code as possible
and replace it with native Linux equivilants.  More specifically:

1) Replace all instances of zfsvfs_t with zfs_sb_t.  While the
type is largely the same calling it private super block data
rather than a zfsvfs is more consistent with how Linux names
this.  While non critical it makes the code easier to read when
your thinking in Linux friendly VFS terms.

2) Replace vnode_t with struct inode.  The Linux VFS doesn't have
the notion of a vnode and there's absolutely no good reason to
create one.  There are in fact several good reasons to remove it.
It just adds overhead on Linux if we were to manage one, it
conplicates the code, and it likely will lead to bugs so there's
a good change it will be out of date.  The code has been updated
to remove all need for this type.

3) Replace all vtype_t's with umode types.  Along with this shift
all uses of types to mode bits.  The Solaris code would pass a
vtype which is redundant with the Linux mode.  Just update all the
code to use the Linux mode macros and remove this redundancy.

4) Remove using of vn_* helpers and replace where needed with
inode helpers.  The big example here is creating iput_aync to
replace vn_rele_async.  Other vn helpers will be addressed as
needed but they should be be emulated.  They are a Solaris VFS'ism
and should simply be replaced with Linux equivilants.

5) Update znode alloc/free code.  Under Linux it's common to
embed the inode specific data with the inode itself.  This removes
the need for an extra memory allocation.  In zfs this information
is called a znode and it now embeds the inode with it.  Allocators
have been updated accordingly.

6) Minimal integration with the vfs flags for setting up the
super block and handling mount options has been added this
code will need to be refined but functionally it's all there.

This will be the first and last of these to large to review commits.

Remove dmu_write_pages() support

For the moment we do not use dmu_write_pages() to write pages
directly in to a dmu object. It may be required at some point
in the future, but for now is simplest and cleanest to drop it.
It can be easily readded if/when needed.

Create a root znode without VFS dependencies

For portability reasons it's handy to be able to create a root
znode and basic filesystem components without requiring the full
cooperation of the VFS. We are committing to this to simply the
filesystem creations code.

Remove zfs_ctldir.[ch]

This code is used for snapshot and heavily leverages Solaris
functionality we do not want to reimplement. These files have
been removed, including references to them, and will be replaced
by a zfs_snap.c/zpl_snap.c implementation which handles snapshots.

Disable fuid features

These features should probably be enabled in the Linux zpl code.
For now I'm disabling them until it's clear what needs to be done.

Disable zfs_sync during oops/panic

Minor update to ensure zfs_sync() is disabled if a kernel oops/panic
is triggered. As the comment says 'data integrity is job one'. This
change could have been done by defining panicstr to oops_in_progress
in the SPL. But I felt it was better to use the native Linux API
here since to be clear.

Disable Shutdown/Reboot

This support has been disable with HAVE_SHUTDOWN. We can support
this at some point by adding the needed reboot notifiers.

Remove SYNC_ATTR check

This flag does not need to be support under Linux. As the comment
says it was only there to support fsflush() for old filesystem like
UFS. This is not needed under Linux.

Remove mount options

Mount option parsing is still very Linux specific and will be
handled above this zfs filesystem layer. Honoring those mount
options once set if of course the responsibility of the lower
layers.

Remove zfs_active_fs_count

This variable was used to ensure that the ZFS module is never
removed while the filesystem is mounted. Once again the generic
Linux VFS handles this case for us so it can be removed.

Remove unused mount functions

The functions zfs_mount_label_policy(), zfs_mountroot(), zfs_mount()
will not be needed because most of what they do is already handled
by the generic Linux VFS layer. They all call zfs_domount() which
creates the actual dataset, the caller of this library call which
will be in the zpl layer is responsible for what's left.

Remove zfs_major/zfs_minor/zfsfstype

Under Linux we don't need to reserve a major or minor number for
the filesystem.  We can rely on the VFS to handle colisions without
this being handled by the lower ZFS layers.

Additionally, there is no need to keep a zfsfstype around.  We are
not limited on Linux by the OpenSolaris infrastructure which needed
this.  The upper zpl layer can specify the filesystem type.

Remove Solaris VFS Hooks

The ZFS code is being restructured to act as a library and a stand
alone module. This allows us to leverage most of the existing code
with minimal modification. It also means we need to drop the Solaris
vfs/vnode functions they will be replaced by Linux equivilants and
updated to be Linux friendly.

VFS: Add zfs_inode_update() helper

For the moment we have left ZFS unchanged and it updates many values
as part of the znode.  However, some of these values should be set
in the inode.  For the moment this is handled by adding a function
called zfs_inode_update() which updates the inode based on the znode.

This is considered a workaround until we can systematically go
through the ZFS code and have it directly update the inode.  At
which point zfs_update_inode() can be dropped entirely.  Keeping
two copies of the same data isn't only inefficient it's a breeding
ground for bugs.

VFS: Integrate zfs_znode_alloc()

Under Linux the convention for filesystem specific data structure is
to embed it along with the generic vfs data structure.  This differs
significantly from Solaris.

Since we want to integrates as cleanly with the Linux VFS as possible.
This changes modifies zfs_znode_alloc() to allocate a znode with an
embedded inode for use with the generic VFS.  This is done by calling
iget_locked() which will allocate a new inode if needed by calling
sb->alloc_inode().  This function allocates enough memory for a
znode_t by returns a pointer to the inode structure for Linux's VFS.
This function is also responsible for setting the callback
znode->z_set_ops_inodes() which is used to register the correct
handlers for the inode.

Enable zfs_znode compilation

Basic compilation of the bulk of zfs_znode.c has been enabled.  After
much consideration it was decided to convert the existing vnode based
interfaces to more friendly Linux interfaces.  The following commits
will systematically replace update the requiter interfaces.  There
are of course pros and cons to this decision.

Pros:
* This simplifies intergration with Linux in the long term.  There is
  no longer any need to manage vnodes which are a foreign concept to
  the Linux VFS.
* Improved long term maintainability.
* Minor performance improvements by removing vnode overhead.

Cons:
* Added work in the short term to modify multiple ZFS interfaces.
* Harder to pull in changes if we ever see any new code from Solaris.
* Mixed Solaris and Linux interfaces in some ZFS code.

ACL related changes

A small collection of ACL related changes related to not
supporting fuid mapping. This whole are will need to be
closely investigated.

Init/destroy tsd

Add missing tsd_destroy() call for rrw_tsd_key to avoid a leak.

Add Linux Compat Infrastructure

Lay the initial ground work for a include/linux/ compatibility
directory.  This was less critical in the past because the bulk
of the ZFS code consumes the Solaris API via the SPL.  This API
was stable and the bulk Linux API differences were handled in
the SPL.

However, with the addition of a full Posix layer written directly
against the Linux APIs we are going to need more compatibility
code.  It makes sense that all this code should be cleanly located
in one place.  Subsequent patches should move the existing zvol
and vdev_disk compatibility code in to this directory.

Replace VOP_* calls with direct zfs_* calls

These generic Solaris wrappers are no longer required. Simply
directly call the correct zfs functions for clarity.

Add basic uio support

This code originates in OpenSolaris and was modified by KQ Infotech
to be compatible with Linux. While supporting uios in the short
term is useful to get something working this is not an abstraction
we want to keep. This code is expected to be short lived and
removed as soon as all the remaining uio based APIs and updated.

Add trivial acl helpers

The zfs acl code makes use of the two OpenSolaris helper functions
acl_trivial_access_masks() and ace_trivial_common().  Since they are
only called from zfs_acl.c I've brought them over from OpenSolaris
and added them as static function to this file.  This way I don't
need to reimplement this functionality from scratch in the SPL.

Long term once I take a more careful look at the acl implementation
it may be the case that these functions really aren't needed.  If
that turns out to be the case they can then be removed.

Remove dead ACL code

The following code was unused which caused gcc to complain.
Since it was deadcode it has simply been removed.

Remove zfs_parse_bootfs() support

Remove unneeded bootfs functions. This support shouldn't be required
for the Linux port, and even if it is it would need to be reworked
to integrate cleanly with Linux.

VFS: Wrap with HAVE_SHARE

Certain NFS/SMB share functionality is not yet in place.  These
functions used to be wrapped with the generic HAVE_ZPL to prevent
them from being compiled.  I still don't want them compiled but
I'm working toward eliminating the use of HAVE_ZPL.  So I'm just
renaming the wrapper here to HAVE_SHARE.  They still won't be
compiled until all the share issues are worked through.  Share
support is the last missing piece from zfs_ioctl.c.

Wrap with HAVE_MLSLABEL

The zfs_check_global_label() function is part of the HAVE_MLSLABEL
support which was previously commented out by a HAVE_ZPL check.
Since we're still deciding what to do about mls labels wrap it
with the preexisting macro to keep it compiled out.

Remove znode move functionality

Unlike Solaris the Linux implementation embeds the inode in the
znode, and has no use for a vnode. So while it's true that fragmention
of the znode cache may occur it should not be worse than any of the
other Linux FS inode caches. Until proven that this is a problem it's
just added complexity we don't need.

Conserve stack in zfs_mkdir()

Move the sa_attrs array from the stack to the heap to minimize stack
space usage.

Conserve stack in zfs_sa_upgrade()

As always under Linux stack space is at a premium. Relocate two
20 element sa_bulk_attr_t arrays in zfs_sa_upgrade() from the stack
to the heap.

Export required vfs/vn symbols

Add HAVE_SCANSTAMP

This functionality is not supported under Linux, perhaps it
will be some day if it's decided it's useful.

Add initial rw_uio functions to the dmu

These functions were dropped originally because I felt they would
need to be rewritten anyway to avoid using uios. However, this
patch readds then with they dea they can just be reworked and
the uio bits dropped.

Remove HAVE_ZPL from commands and libraries

Thanks to the previous few commits we can now build all of the
user space commands and libraries with support for the zpl.

Documentation updates

Minor Linux specific documentation updates to the comments and
man pages.

Minimal libshare infrastructure

ZFS even under Solaris does not strictly require libshare to be
available.  The current implementation attempts to dlopen() the
library to access the needed symbols.  If this fails libshare
support is simply disabled.

This means that on Linux we only need the most minimal libshare
implementation.  In fact just enough to prevent the build from
failing.  Longer term we can decide if we want to implement a
libshare library like Solaris.  At best this would be an abstraction
layer between ZFS and NFS/SMB.  Alternately, we can drop libshare
entirely and directly integrate ZFS with Linux's NFS/SMB.

Finally the bare bones user-libshare.m4 test was dropped.  If we
do decide to implement libshare at some point it will surely be
as part of this package so the check is not needed.

Add 'zfs mount' support

By design the zfs utility is supposed to handle mounting and unmounting
a zfs filesystem.  We could allow zfs to do this directly.  There are
system calls available to mount/umount a filesystem.  And there are
library calls available to manipulate /etc/mtab.  But there are a
couple very good reasons not to take this appraoch... for now.

Instead of directly calling the system and library calls to (u)mount
the filesystem we fork and exec a (u)mount process.  The principle
reason for this is to delegate the responsibility for locking and
updating /etc/mtab to (u)mount(8).  This ensures maximum portability
and ensures the right locking scheme for your version of (u)mount
will be used.  If we didn't do this we would have to resort to an
autoconf test to determine what locking mechanism is used.

The downside to using mount(8) instead of mount(2) is that we lose
the exact errno which was returned by the kernel.  The return code
from mount(8) provides some insight in to what went wrong but it
not quite as good.  For the moment this is translated as a best
guess in to a errno for the higher layers of zfs.

In the long term a shared library called libmount is under development
which provides a common API to address the locking and errno issues.
Once the standard mount utility has been updated to use this library
we can then leverage it.  Until then this is the only safe solution.

  http://www.kernel.org/pub/linux/utils/util-linux/libmount-docs/index.html

Open up libzfs_run_process/libzfs_load_module

Recently helper functions were added to libzfs_util to load a kernel
module or execute a process. Initially this functionality was limited
to libzfs but it has become clear there will be other consumers. This
change opens up the interface so it may be used where appropriate.

Disable umount.zfs helper

For the moment, the only advantage in registering a umount helper
would be to automatically unshare a zfs filesystem.  Since under
Linux this would be unexpected (but nice) behavior there is no
harm in disabling it.

This is desirable because the 'zfs unmount' path invokes the system
umount.  This is done to ensure correct mtab locking but has the
side effect that the umount.zfs helper would be called if it exists.
By default this helper calls back in to zfs to do the unmount on
Solaris which we don't want under Linux.

Once libmount is available and we have a safe way to correctly
lock and update the /etc/mtab file we can reconsider the need
for a umount helper.  Using libmount is the prefered solution.

Enable mount.zfs helper

While not strictly required to mount a zfs filesystem using a
mount helper has certain advantages.

First, we need it if we want to honor the mount behavior as found
on Solaris.  As part of the mount we need to validate that the
dataset has the legacy mount property set if we are using 'mount'
instead of 'zfs mount'.

Secondly, by using a mount helper we can automatically load the
zpl kernel module.  This way you can just issue a 'mount' or
'zfs mount' and it will just work.

Finally, it gives us common hook in user space to add any zfs
specific mount options we might want.  At the moment we don't
have any but now the infrastructure is at least in place.

Autoconf selinux support

If libselinux is detected on your system at configure time link
against it.  This allows us to use a library call to detect if
selinux is enabled and if it is to pass the mount option:

  "context=\"system_u:object_r:file_t:s0"

For now this is required because none of the existing selinux
policies are aware of the zfs filesystem type.  Because of this
they do not properly enable xattr based labeling even though
zfs supports all of the required hooks.

Until distro's add zfs as a known xattr friendly fs type we
must use mntpoint labeling.  Alternately, end users could modify
their existing selinux policy with a little guidance.

Fix ZVOL rename minor devices

During a rename we need to be careful to destroy and create a
new minor for the ZVOL _only_ if the rename succeeded. The previous
code would both destroy you minor device unconditionally, it would
also fail to create the new minor device on success.

Fix minor compiler warnings

These compiler warnings were introduced when code which was
previously #ifdef'ed out by HAVE_ZPL was re-added for use
by the posix layer. All of the following changes should be
obviously correct and will cause no semantic changes.

Add missing mkdirp prototype

For while now mkdirp has been built as part of libspl however
the protoype was never added to libgen.h. This went unnoticed
until enabling the mount support which uses mkdirp().

Use cv_timedwait_interruptible in arc

The issue is that cv_timedwait() sleeps uninterruptibly to block signals
and avoid waking up early.  Under Linux this counts against the load
average keeping it artificially high.  This change allows the arc to
sleep interruptibly which mean it may be woken up early due to a signal.

Normally this means some extra care must be taken to handle a potential
signal.  But for the arcs usage of cv_timedwait() there is no harm in
waking up before the timeout expires so no extra handling is required.

Fix block device-related issues in zdb.

Specifically, this fixes the two following errors in zdb when a pool
is composed of block devices:

1) 'Value too large for defined data type' when running 'zdb <dataset>'.
2) 'character device required' when running 'zdb -l <block-device>'.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Enable rrwlock.c compilation

With the addition of the thread specific data interfaces to the
SPL it is safe to enable compilation of the re-enterant read
reader/writer locks.

Refresh autogen.sh products

Refresh the autogen.sh products based on the versions which are
installed by default in the GA RHEL6.0 release.

autoconf (GNU Autoconf) 2.63
automake (GNU automake) 1.11.1
ltmain.sh (GNU libtool) 2.2.6b

Remove partition from vdev name in zfault.sh

As of the 0.5.2 tag, names of whole-disk vdevs must be specified to
the command line tools without partition identifiers. This commit
fixes a 'zpool online' command in zfault.sh that incorrectly includes
he partition in the vdev name, causing test 9 to fail.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Skip /dev/hpet during 'zpool import'

If libblkid does not contain ZFS support, then 'zpool import' will scan
all block devices in /dev/ to determine which ones are components of a
ZFS filesystem.  It does this by opening all the devices and stat'ing
them to determine which ones are block devices.  If the device turns
out not to be a block device it is skipped.

Usually, this whole process is pretty harmless (although slow).  But
there are certain devices in /dev/ which must be handled in a very
specific way or your system may crash.  For example, if /dev/watchdog
is simply opened the watchdog timer will be started and your system
will panic when the timer expires.

It turns out the /dev/hpet causes similiar problems although only when
accessed under a virtual machine.  For some reason accessing /dev/hpet
causes qemu to crash.  To address this issue this commit adds /dev/hpet
to the device blacklist, it will be skipped solely based on its name.

Add '-ts' options to zconfig.sh/zfault.sh usage

When adding this functionality originally the options to only
run specific tests (-t), or conversely skip specific tests (-s)
were omitted from the usage page. This commit adds the missing
documentation.

Remove spl/zfs modules as part of cleanup

The idea behind the '-c' flag is to cleanup everything from a
previous test run which might cause the test script to fail.
This should also include removing the previously loaded module.
This makes it a little easier to run 'zconfig.sh -c', however
remember this is a test script and it will take all of your
other zpools offline for the purposes of the test. This notion
has also been extended to the default 'make check' behavior.

Unconditionally load core kernel modules

Loading and unloading the zlib modules as part of the zfs.sh
script has proven a little problematic for a few reasons.

  * First, your kernel may not need to load either zlib_inflate
    or zlib_deflate.  This functionality may be built directly in
    to your kernel.  It depends entirely on what your distribution
    decided was the right thing to do.

  * Second, even if you do manage to load the correct modules you
    may not be able to unload them.  There may other consumers
    of the modules with a reference preventing the unload.

To avoid both of these issues the test scripts have been updated to
attempt to unconditionally load all modules listed in KERNEL_MODULES.
If the module is successfully loaded you must have needed it. If
the module can't be loaded that almost certainly means either it is
built in to your kernel or is already being used by another consumer.
In both cases this is not an issue and we can move on to the spl/zfs
modules.

Finally, by removing these kernel modules from the MODULES list
we ensure they are never unloaded during 'zfs.sh -u'.  This avoids
the issue of the script failing because there is another consumer
using the module we were not aware of.  In other words the script
restricts unloading modules to only the spl/zfs modules.

Closes #78

Fix for access beyond end of device error

This commit fixes a sign extension bug affecting l2arc devices.  Extremely
large offsets may be passed down to the low level block device driver on
reads, generating errors similar to

    attempt to access beyond end of device
    sdbi1: rw=14, want=36028797014862705, limit=125026959

The unwanted sign extension occurrs because the function arc_read_nolock()
stores the offset as a daddr_t, a 32-bit signed int type in the Linux kernel.
This offset is then passed to zio_read_phys() as a uint64_t argument, causing
sign extension for values of 0x80000000 or greater.  To avoid this, we store
the offset in a uint64_t.

This change also changes a few daddr_t struct members to uint64_t in the libspl
headers to avoid similar bugs cropping up in the future.  We also add an ASSERT
to __vdev_disk_physio() to check for invalid offsets.

Closes #66
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Linux 2.6.36 compat, use fops->unlocked_ioctl()

As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has
been removed and thus fops()->ioctl() has also been removed. The
replacement hook is fops->unlocked_ioctl() which has existed in
kernel since 2.6.12. Since the ZFS code only contains support
back to 2.6.18 vintage kernels, I'm not adding an autoconf check
for this and simply moving everything to use fops->unlocked_ioctl().

Linux 2.6.36 compat, blk_* macros removed

Most of the blk_* macros were removed in 2.6.36.  Ostensibly this was
done to improve readability and allow easier grepping.  However, from
a portability stand point the macros are helpful.  Therefore the needed
macros are redefined here if they are missing from the kernel.

Linux 2.6.36 compat, synchronous bio flag

The name of the flag used to mark a bio as synchronous has changed
again in the 2.6.36 kernel due to the unification of the BIO_RW_*
and REQ_* flags.  The new flag is called REQ_SYNC.  To simplify
checking this flag I have introduced the vdev_disk_dio_is_sync()
helper function.  Based on the results of several new autoconf
tests it uses the correct mask to check for a synchronous bio.

Preferred interface for flagging a synchronous bio:
  2.6.12-2.6.29: BIO_RW_SYNC
  2.6.30-2.6.35: BIO_RW_SYNCIO
  2.6.36-2.6.xx: REQ_SYNC

Linux 2.6.36 compat, use REQ_FAILFAST_MASK

As of linux-2.6.36 the BIO_RW_FAILFAST and REQ_FAILFAST flags
have been unified under the REQ_* names.  These flags always had
to be kept in-sync so this is a nice step forward, unfortunately
it means we need to be careful to only use the new unified flags
when the BIO_RW_* flags are not defined.  Additional autoconf
checks were added for this and if it is ever unclear which method
to use no flags are set.  This is safe but may result in longer
delays before a disk is failed.

Perferred interface for setting FAILFAST on a bio:
  2.6.12-2.6.27: BIO_RW_FAILFAST
  2.6.28-2.6.35: BIO_RW_FAILFAST_{DEV|TRANSPORT|DRIVER}
  2.6.36-2.6.xx: REQ_FAILFAST_{DEV|TRANSPORT|DRIVER}

Remove inconsistent use of EOPNOTSUPP

Commit 3ee56c292bbcd7e6b26e3c2ad8f0e50eee236bcc changed an ENOTSUP return value
in one location to ENOTSUPP to fix user programs seeing an invalid ioctl()
error code. However, use of ENOTSUP is widespread in the zfs module. Instead
of changing all of those uses, we fixed the ENOTSUP definition in the SPL to be
consistent with user space. The changed return value in the above commit is
therefore no longer needed, so this commit reverses it to maintain consistency.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add lustre zpios-test workload

The lustre zpios-test simulates a reasonable lustre workload.  It will
create 128 threads, the same as a Lustre OSS, and then 4096 individual
objects.  Each objects is 16MiB in size and will be written/read in 1MiB
from a random thread.  This is fundamentally how we expect Lustre to behave
for large IO intensive workloads.

Prep for 0.5.2 tag

Update META file to prep for 0.5.2 tag.

Replace custom zpool configs with generic configs

To streamline testing I have in the past added several custom configs
to the zpool-config directory.  This change reverts those custom configs
and replaces them with three generic config which can do the same thing.
The generic config behavior can be set by setting various environment
variables when calling either the zpool-create.sh or zpios.sh scripts.

For example if you wanted to create and test a single 4-disk Raid-Z2
configuration using disks [A-D]1 with dedicated ZIL and L2ARC devices
you could run the following.

$ ZIL="log A2" L2ARC="cache B2" RANKS=1 CHANNELS=4 LEVEL=2 \
  zpool-create.sh -c zpool-raidz

$ zpool status tank
  pool: tank
state: ONLINE
scan: none requested
config:

      NAME        STATE     READ WRITE CKSUM
      tank        ONLINE       0     0     0
        raidz2-0  ONLINE       0     0     0
          A1      ONLINE       0     0     0
          B1      ONLINE       0     0     0
          C1      ONLINE       0     0     0
          D1      ONLINE       0     0     0
      logs
        A2        ONLINE       0     0     0
      cache
        B2        ONLINE       0     0     0

errors: No known data errors

Make rollbacks fail gracefully

Support for rolling back datasets require a functional ZPL, which we currently
do not have.  The zfs command does not check for ZPL support before attempting
a rollback, and in preparation for rolling back a zvol it removes the minor
node of the device.  To prevent the zvol device node from disappearing after a
failed rollback operation, this change wraps the zfs_do_rollback() function in
an #ifdef HAVE_ZPL and returns ENOSYS in the absence of a ZPL.  This is
consistent with the behavior of other ZPL dependent commands such as mount.

The orginal error message observed with this bug was rather confusing:

    internal error: Unknown error 524
    Aborted

This was because zfs_ioc_rollback() returns ENOTSUP if we don't HAVE_ZPL, but
Linux actually has no such error code.  It should instead return EOPNOTSUPP, as
that is how ENOTSUP is defined in user space.  With that we would have gotten
the somewhat more helpful message

    cannot rollback 'tank/fish': unsupported version

This is rather a moot point with the above changes since we will no longer make
that ioctl call without a ZPL.  But, this change updates the error code just in
case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Increate zio write interrupt thread count.

Increasing the default zio_wr_int thread count from 8 to 16 improves
write performence by 13% on large systems. More testing need to be
done but I suspect the ideal tuning here is ZTI_BATCH() with a minimum
of 8 threads.

Shorten zio_* thread names

Linux kernel thread names are expected to be short. This change shortens
the zio thread names to 10 characters leaving a few chracters to append
the /<cpuid> to which the thread is bound. For example: z_wr_iss/0.

Fix panic mounting unformatted zvol

On some older kernels, i.e. 2.6.18, zvol_ioctl_by_inode() may get passed a NULL
file pointer if the user tries to mount a zvol without a filesystem on it.
This change adds checks to prevent a null pointer dereference.

Closes #73.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Call modprobe with absolute path

Some sudo configurations may not include /sbin in the PATH.
libzfs_load_module() currently does not call modprobe with an absolute path, so
it may fail under such configurations if called under sudo. This change adds
the absolute path to modprobe so we no longer rely on how PATH is set.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix intermittent 'zpool add' failures

Creating whole-disk vdevs can intermittently fail if a udev-managed symlink to
the disk partition is already in place.  To avoid this, we now remove any such
symlink before partitioning the disk.  This makes zpool_label_disk_wait() truly
wait for the new link to show up instead of returning if it finds an old link
still in place.  Otherwise there is a window between when udev deletes and
recreates the link during which access attempts will fail with ENOENT.

Also, clean up a comment about waiting for udev to create symlinks.  It no
longer needs to describe the special cases for the link names, since that is
now handled in a separate helper function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add zconfig test for adding and removing vdevs

This test performs a sanity check of the zpool add and remove commands.  It
tests adding and removing both a cache disk and a log disk to and from a zpool.
Usage of both a shorthand device path and a full path is covered.  The test
uses a scsi_debug device as the disk to be added and removed.  This is done so
that zpool will see it as a whole disk and partition it, which it does not
currently done for loopback devices.  We want to verify that the manipulation
done to whole disks paths to hide the parition information does not break the
add/remove interface.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove solaris-specific code from make_leaf_vdev()

Portability between Solaris and Linux isn't really an issue for us anymore, and
removing sections like this one helps simplify the code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Support shorthand names with zpool remove

zpool status displays abbreviated vdev names without leading path components
and, in the case of whole disks, without partition information.  Also, the
zpool subcommands 'create' and 'add' support using shorthand devices names
without qualified paths.  Prior to this change, however, removing a device
generally required specifying its name as it is stored in the vdev label.  So
while zpool status might list a cache disk with a name like A16, removing it
would require a full path such as /dev/disk/zpool/A16-part1, which is
non-intuitive.

This change adds support for shorthand device names with the remove subcommand
so one can simply type, for example,

        zpool remove tank A16

A consequence of this change is that including the partition information when
removing a whole-disk vdev now results in an error.  While this is arguably the
correct behavior, it is a departure from how zpool previously worked in this
project.

This change removes the only reference to ctd_check_path(), so that function is
also removed to avoid compiler warnings.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add helper functions for manipulating device names

This change adds two helper functions for working with vdev names and paths.
zfs_resolve_shortname() resolves a shorthand vdev name to an absolute path
of a file in /dev, /dev/disk/by-id, /dev/disk/by-label, /dev/disk/by-path,
/dev/disk/by-uuid, /dev/disk/zpool.  This was previously done only in the
function is_shorthand_path(), but we need a general helper function to
implement shorthand names for additional zpool subcommands like remove.
is_shorthand_path() is accordingly updated to call the helper function.

There is a minor change in the way zfs_resolve_shortname() tests if a file
exists.  is_shorthand_path() effectively used open() and stat64() to test for
file existence, since its scope includes testing if a device is a whole disk
and collecting file status information.  zfs_resolve_shortname(), on the other
hand, only uses access() to test for existence and leaves it to the caller to
perform any additional file operations.  This seemed like the most general and
lightweight approach, and still preserves the semantics of is_shorthand_path().

zfs_append_partition() appends a partition suffix to a device path.  This
should be used to generate the name of a whole disk as it is stored in the vdev
label. The user-visible names of whole disks do not contain the partition
information, while the name in the vdev label does.   The code was lifted from
the function make_disks(), which now just calls the helper function.  Again,
having a helper function to do this supports general handling of shorthand
names in the user interface.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add zfault zpool configurations and tests

Eleven new zpool configurations were added to allow testing of various
failure cases.  The first 5 zpool configurations leverage the 'faulty'
md device type which allow us to simuluate IO errors at the block layer.
The last 6 zpool configurations leverage the scsi_debug module provided
by modern kernels.  This device allows you to create virtual scsi
devices which are backed by a ram disk.  With this setup we can verify
the full IO stack by injecting faults at the lowest layer.  Both methods
of fault injection are important to verifying the IO stack.

The zfs code itself also provides a mechanism for error injection
via the zinject command line tool.  While we should also take advantage
of this appraoch to validate the code it does not address any of the
Linux integration issues which are the most concerning.  For the
moment we're trusting that the upstream Solaris guys are running
zinject and would have caught internal zfs logic errors.

Currently, there are 6 r/w test cases layered on top of the 'faulty'
md devices.  They include 3 writes tests for soft/transient errors,
hard/permenant errors, and all writes error to the device.  There
are 3 matching read tests for soft/transient errors, hard/permenant
errors, and fixable read error with a write.  Although for this last
case zfs doesn't do anything special.

The seventh test case verifies zfs detects and corrects checksum
errors.  In this case one of the drives is extensively damaged and
by dd'ing over large sections of it.  We then ensure zfs logs the
issue and correctly rebuilds the damage.

The next  test cases use the scsi_debug configuration to injects error
at the bottom of the scsi stack.  This ensures we find any flaws in the
scsi midlayer or our usage of it.  Plus it stresses the device specific
retry, timeout, and error handling outside of zfs's control.

The eighth test case is to verify that the system correctly handles an
intermittent device timeout.  Here the scsi_debug device drops 1 in N
requests resulting in a retry either at the block level.  The ZFS code
does specify the FAILFAST option but it turns out that for this case
the Linux IO stack with still retry the command.  The FAILFAST logic
located in scsi_noretry_cmd() does no seem to apply to the simply
timeout case.  It appears to be more targeted to specific device or
transport errors from the lower layers.

The ninth test case handles a persistent failure in which the device
is removed from the system by Linux.  The test verifies that the failure
is detected, the device is made unavailable, and then can be successfully
re-add when brought back online.  Additionally, it ensures that errors
and events are logged to the correct places and the no data corruption
has occured due to the failure.

Fix missing 'zpool events'

It turns out that 'zpool events' over 1024 bytes in size where being
silently dropped.  This was discovered while writing the zfault.sh
tests to validate common failure modes.

This could occur because the zfs interface for passing an arbitrary
size nvlist_t over an ioctl() is to provide a buffer for the packed
nvlist which is usually big enough.  In this case 1024 byte is the
default.  If the kernel determines the buffer is to small it returns
ENOMEM and the minimum required size of the nvlist_t.  This was
working properly but in the case of 'zpool events' the event stream
was advanced dispite the error.  Thus the retry with the bigger
buffer would succeed but it would skip over the previous event.

The fix is to pass this size to zfs_zevent_next() and determine
before removing the event from the list if it will fit.  This was
preferable to checking after the event was returned because this
avoids the need to rewind the stream.

Initial zio delay timing

While there is no right maximum timeout for a disk IO we can start
laying the ground work to measure how long they do take in practice.
This change simply measures the IO time and if it exceeds 30s an
event is posted for 'zpool events'.

This value was carefully selected because for sd devices it implies
that at least one timeout (SD_TIMEOUT) has occured.  Unfortunately,
even with FAILFAST set we may retry and request and not get an
error.  This behavior is strongly dependant on the device driver
and how it is hooked in to the scsi error handling stack.  However
by setting the limit at 30s we can log the event even if no error
was returned.

Slightly longer term we can start recording these delays perhaps
as a simple power-of-two histrogram.  This histogram can then be
reported as part of the 'zpool status' command when given an command
line option.

None of this code changes the internal behavior of ZFS.  Currently
it is simply for reporting excessively long delays.