[Acl-Devel] [RFC] new design for EA on-disk format

Andreas Dilger adilger@clusterfs.com
Wed, 10 Jul 2002 16:52:46 -0600


Daniel, Jeff Dike (UML), and I were discussing an alternative EA design
at OLS, and I thought I would post it here for discussion by the rest of
the ext[23] folks (I'm not sure what Daniel's plans are on this front,
so I will just go ahead with it).  It is somewhat of a departure from
the current EA setup (one block per inode), but I think it is a win on
all fronts.

It turns out that the current EA setup is biting us on Lustre pretty
noticably.  The metadata server (which holds all of the directory data
and file attributes) is always running out of space because we store
a 16-byte EA with each inode.  We run out of blocks before we run out
of inodes, on a filesystem that _should_ have mostly inodes and only
directory data blocks (using default mke2fs parameters).  Obviously we
can tune mke2fs, but in reality we waste about 255/256 of our space.


The basic concept of the design is "EAs are like named directory entries",
so we should store them in "directories" and re-use as much of the
current directory handling code as possible (especially htree).

There is essentially one EA directory inode per some number[*] of regular
inodes, and this EA directory is a file which is structured like an ext2
directory, with an arbitrary number of variable-length EA entries in it.
For a large EA directory it would be structured like an htree indexed
directory in order to do fast EA lookup (the index would only appear
after the directory size has grown past a single block).

[*]  The ratio of inodes per EA directory might be one per some
     fixed number of inodes, maybe 256 inodes, or one per block group,
     or even one per filesystem if we are using the htree code for
     fast lookup.  The more inodes per EA directory, the more likely
     EA data could be shared, and the less partially full blocks.
     The fewer inodes per EA directory, the less likely there is for a
     global corruption of this data and less contention for EA updates.

The benefit of this model are:
- It is easy to store multiple variable length attributes in a single block,
  which can be shared among several inodes.
- Maximum single EA size is blocksize less EA entry header (12 bytes).
  This is to keep the "entry doesn't cross block boundary" behaviour that
  has worked so nicely with ext2 directory entries.
- You can store an essentially arbitrary number of EAs off of each inode.
  This avoids problems with user attributes using up all of the EA space
  and causing problems when you _have_ to add a system EA.
- It is easy to scale the lookup of EAs in a large EA directory by re-using
  the htree code.

The regular inodes would store the inode number of their EA directory,
maybe in the same i_file_acl field that the current EA code uses, or
in i_faddr formerly reserved for file tails.

Each inode would have an "EA table" stored in its EA directory which
holds a list of attribute names assigned to that inode.  The attributes
themselves will be stored in separate EA entries.  For performance
reasons we may still want to keep some/all of this inside a large inode
to avoid multiple seeks to get the EA data.

A proposal for EA on-disk layout could look like the following:

/* One EA table per regular inode that has EA data in this EA directory */
struct ext2_ea_table {
        __u16   et_reclen;      /* bytes to start of next entry (% 4 == 0) */
        __u16   et_size;        /* bytes of data in this EA table */

        __u32   et_name_hash;   /* 31-bit hash of EA name "system.ea_table",
				   essentially a magic (bit 0 == shared) */
        __u32   et_inode;       /* source inode number for this table */

	/* per entry data */
        __u8    et_namelen;     /* length of this EA name ((len + 1)%4 == 0) */
	__u8	et_unused[3];
        char   *et_name[0];	/* EA name (nul terminated if not
};

/* One entry per EA, common if a shared entry */
struct ext2_ea_entry {
        __u16   ee_reclen;      /* bytes to start of next entry (% 4 == 0) */
        __u16   ee_size;        /* bytes of data in this EA entry */

        __u32   ee_name_hash;   /* 31-bit hash of EA name, low-bit == shared */
        __u32   ee_inode;       /* source inode number, or refcount if shared */

	__u8	ee_data[0];	/* EA data */
};

Interaction with the EAs would be something like the following:
1) To list all EAs assigned to a given inode find the EA table for that
   inode in that inode's EA directory.  The EA table would have a
   "well known" hash number/name so we would not need to recompute for
   each access.  If the EA table was small, we could just put it inside
   a large inode, or not have one at all if all of the EA data could
   fit directly inside the large inode space.

2) To locate a particular attribute for a given inode, you _could_ first
   check that inodes EA table to see if that attribute was assigned for
   the inode, or you could do a direct lookup in the EA directory for
   the given attribute if you have reason to believe it already exists
   (in-kernel code might do this).  This would avoid a lookup of the EA
   table (and htree index if indexed) to find the EA entry.

3) For shared EA entries, the low bit of the name hash is set, which
   indicates that the "ee_inode" field actually holds the refcount on
   that attribute instead of the parent inode number (we could mask
   refcounts at 2^16 and keep 16 bits for future use if there is one
   EA directory per block group).

4) For non-shared EA entries, we could potentially use the inode number
   at the beginning/end of the name hash to avoid a large number of hash
   collisions.  This assumes that EA users know which EAs are sharable
   and which are not.  Most applications would know this, but it would
   need an API which allowed this to be specified (maybe by using the
   "shared" and "usr_shared" prefixes, or "unique" and "usr_unique" if
   we want the "system" and "user" to be shared by default?).  Putting
   something into the "shared" space that has unique data (e.g. ACL)
   does not necessarily force it to be shared, it just means that there
   will be more hash collisions than there would be otherwise.
   
   That would preclude automatic EA sharing for poorly implemented apps.
   This is not necessarily optimal, so interesting ideas for this are
   welcome.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/