[lug] Cluster File Systems

Lee Woodworth blug-mail at duboulder.com
Wed Aug 8 02:12:31 MDT 2018

Thanks Davide for the link. Interesting info.

For the curious these might be interesting:

I naively thought that after 35+ years that simple cluster-file-system
setup would have gotten easier. Instead the scope has grown as well as
the complexity.

What I am after is removing SPOF due to server and/or disk outages (e.g.
server kernel upgrade, server replacement). The file server exports nfsv4
from raid-backed file-systems (xfs mostly) on lvm. The uses range from
/home, to the postfix mail-queue/mailboxes, backup osm database (postgres),
nearline file store, video archive and build roots (e.g. rpi, rock64,
apu2). The non-hot swap disks are in the server case and there isn't
another system available that can accept the disks.

For those interested in what I understand about some of the options:

o These days, cluster/distributed wrt storage implies lots of clients,
   100's of terabytes or more, desire for high 'performance' among the
   clients. It seems to vary by fs whether performance means latency or

o The 'largeness' orientation means the hardware requirements might be
   substantial even for small deployments (e.g. ceph)

o Lustre -- complicated to setup, very much tied to RH distro via
   kernel requirements.

o Ceph -- looks complicated to setup and memory requirements are pretty
   large for small deployments. Operating ram for OSD's ~500M, but
   recovery operations could require 1GB/TB of storage. I.e.
   4TB disks -> 4GB RAM for recovery.

o BeeGFS/fhGFS -- Mirroring (replication) and file ACLs are only
   available in the enterprise version (see the LICENSE file).

o OpenStack Swift -- swift by itself is an object store. The swift-on-
   file backend provides 'POSIX' file access (maybe via NFS protocols),
   but does not do replication.

   >> Swift-On-File currently works only with Filesystems with extended
   >> attributes support. It is also recommended that these Filesystems
   >> provide data durability as Swift-On-File should not use Swift's
   >> replication mechanisms.

o GlusterFS -- distributed, replicated file system (glusterfs.org)
   Not recommended for big file access (e.g. databases). FUSE-based
   client access to cluster data allows for automatic fail over.
   Don't know if a cluster server can also be a client, say with FUSE.
   E.g. server running postfix which writes to the shared mailstore.

     says NFS access is limited to NFSv3.

   ~3yrs ago I had problems w/ lots rapid reads/writes to gluster
   served files. Maybe its generally better now, but
   >> Having had a fair bit of experience looking at large memory
   >> systems with heavily loaded regressions, be it CAD, EDA or similar
   >> tools, we've sometimes encountered stability problems with Gluster.
   >> We had to carefully analyse the memory footprint and amount of
   >> disk wait times over days. This gave us a rather remarkable story
   >> of disk trashing, huge iowaits, kernel oops, disk hangs etc.

o OpenAFS -- distributed, descriptions don't mention replication.
   Requires kerberos5 (added complexity in user management)
   IIRC non-posix file semantics

o LizardFS (based on MooseFS) -- distributed, replicating
   Metadata servers (suggest 32GB RAM, but 4GB may be ok for 'small' cases)
     dedicated machine per metadata server, need 2: master, shadow
   MetaLogger and ChunkServers (suggest 2GB RAM)
     metaloggers are optional, used when both master & shadow metdata
     servers fail
   replication can use raid5-like modes but requires n-data, 1-parity,
   and 1-extra-recovery chunk servers

o OrangeFS (based on PVFS) -- distributed, non-replicating
   might not support posix locking
   direct access client is in mainline kernels 4.6+
   fault tolerance handled through os/hw raid
     metadata is _not_ replicated even with multiple metadata servers

o XtreemFS - distributed, file-level replication
   files are replicated on close (big file issues?)
   server code in java, fuse client in C++
   stable release 1.5.1 from 2015, github repo has some minor
     activity in the last few months

o NAS systems -- still a SPOF
   - need duplicate systems for quick recovery from non-disk failures
   - need some sort of regular syncing to standby, or manually move
     disks on failure
   - NFS clients have to kill processes, remount exports
   - firmware lock-in, limited support period for (most?) commercial versions
     2 bay units ~$250 no disks (netgear, synology, qnap)
     can end-users replace the firmware with any of these?
   - rockstor.com Pro 8 $1000 no disks, mentions DIY so it might possible to
     fully manage the system (standard motherboard?)
   - helios 4 DIY NAS (http://kobol.io/helios4)
     kickstarter, 4 non-hot-swap drive bays, 2GB RAM, Marvell Armada 388 SoC
     ~$200 no drives, shipped from HK

o md raid1/5... on iSCSI targets over GBE (tgt for linux)
   - will this work?
   - use rock64s as ISCSI targets w/ 4GB cache, use available
       systems for primary/standby nfs servers
   - one of the simpler things to do
   - requires manual switch over on nfs server fail, on standby:
       connect to iSCSI targets
       bring up preconfigured /dev/mdX
       mount exported fs
       update nfs exports
       update dns with new nfs server address
       kill processes, remount exports on clients

On 08/04/2018 05:31 AM, Davide Del Vento wrote:
> Ops. The link: http://moo.nac.uci.edu/~hjm/fhgfs_vs_gluster.html
> On Fri, Aug 3, 2018 at 7:16 PM, Davide Del Vento <davide.del.vento at gmail.com
>> wrote:
>> Since you and others mentioned a few things I did not know about, while
>> trying to learn more about them I've found this which seems to be quite
>> informative regarding BeeGFS (aka fhGFS) as compared to gluster and a
>> little bit to Lustre.

More information about the LUG mailing list