[lug] Cluster File Systems
blug-mail at duboulder.com
Wed Aug 8 02:12:31 MDT 2018
Thanks Davide for the link. Interesting info.
For the curious these might be interesting:
I naively thought that after 35+ years that simple cluster-file-system
setup would have gotten easier. Instead the scope has grown as well as
What I am after is removing SPOF due to server and/or disk outages (e.g.
server kernel upgrade, server replacement). The file server exports nfsv4
from raid-backed file-systems (xfs mostly) on lvm. The uses range from
/home, to the postfix mail-queue/mailboxes, backup osm database (postgres),
nearline file store, video archive and build roots (e.g. rpi, rock64,
apu2). The non-hot swap disks are in the server case and there isn't
another system available that can accept the disks.
For those interested in what I understand about some of the options:
o These days, cluster/distributed wrt storage implies lots of clients,
100's of terabytes or more, desire for high 'performance' among the
clients. It seems to vary by fs whether performance means latency or
o The 'largeness' orientation means the hardware requirements might be
substantial even for small deployments (e.g. ceph)
o Lustre -- complicated to setup, very much tied to RH distro via
o Ceph -- looks complicated to setup and memory requirements are pretty
large for small deployments. Operating ram for OSD's ~500M, but
recovery operations could require 1GB/TB of storage. I.e.
4TB disks -> 4GB RAM for recovery.
o BeeGFS/fhGFS -- Mirroring (replication) and file ACLs are only
available in the enterprise version (see the LICENSE file).
o OpenStack Swift -- swift by itself is an object store. The swift-on-
file backend provides 'POSIX' file access (maybe via NFS protocols),
but does not do replication.
>> Swift-On-File currently works only with Filesystems with extended
>> attributes support. It is also recommended that these Filesystems
>> provide data durability as Swift-On-File should not use Swift's
>> replication mechanisms.
o GlusterFS -- distributed, replicated file system (glusterfs.org)
Not recommended for big file access (e.g. databases). FUSE-based
client access to cluster data allows for automatic fail over.
Don't know if a cluster server can also be a client, say with FUSE.
E.g. server running postfix which writes to the shared mailstore.
says NFS access is limited to NFSv3.
~3yrs ago I had problems w/ lots rapid reads/writes to gluster
served files. Maybe its generally better now, but
>> Having had a fair bit of experience looking at large memory
>> systems with heavily loaded regressions, be it CAD, EDA or similar
>> tools, we've sometimes encountered stability problems with Gluster.
>> We had to carefully analyse the memory footprint and amount of
>> disk wait times over days. This gave us a rather remarkable story
>> of disk trashing, huge iowaits, kernel oops, disk hangs etc.
o OpenAFS -- distributed, descriptions don't mention replication.
Requires kerberos5 (added complexity in user management)
IIRC non-posix file semantics
o LizardFS (based on MooseFS) -- distributed, replicating
Metadata servers (suggest 32GB RAM, but 4GB may be ok for 'small' cases)
dedicated machine per metadata server, need 2: master, shadow
MetaLogger and ChunkServers (suggest 2GB RAM)
metaloggers are optional, used when both master & shadow metdata
replication can use raid5-like modes but requires n-data, 1-parity,
and 1-extra-recovery chunk servers
o OrangeFS (based on PVFS) -- distributed, non-replicating
might not support posix locking
direct access client is in mainline kernels 4.6+
fault tolerance handled through os/hw raid
metadata is _not_ replicated even with multiple metadata servers
o XtreemFS - distributed, file-level replication
files are replicated on close (big file issues?)
server code in java, fuse client in C++
stable release 1.5.1 from 2015, github repo has some minor
activity in the last few months
o NAS systems -- still a SPOF
- need duplicate systems for quick recovery from non-disk failures
- need some sort of regular syncing to standby, or manually move
disks on failure
- NFS clients have to kill processes, remount exports
- firmware lock-in, limited support period for (most?) commercial versions
2 bay units ~$250 no disks (netgear, synology, qnap)
can end-users replace the firmware with any of these?
- rockstor.com Pro 8 $1000 no disks, mentions DIY so it might possible to
fully manage the system (standard motherboard?)
- helios 4 DIY NAS (http://kobol.io/helios4)
kickstarter, 4 non-hot-swap drive bays, 2GB RAM, Marvell Armada 388 SoC
~$200 no drives, shipped from HK
o md raid1/5... on iSCSI targets over GBE (tgt for linux)
- will this work?
- use rock64s as ISCSI targets w/ 4GB cache, use available
systems for primary/standby nfs servers
- one of the simpler things to do
- requires manual switch over on nfs server fail, on standby:
connect to iSCSI targets
bring up preconfigured /dev/mdX
mount exported fs
update nfs exports
update dns with new nfs server address
kill processes, remount exports on clients
On 08/04/2018 05:31 AM, Davide Del Vento wrote:
> Ops. The link: http://moo.nac.uci.edu/~hjm/fhgfs_vs_gluster.html
> On Fri, Aug 3, 2018 at 7:16 PM, Davide Del Vento <davide.del.vento at gmail.com
>> Since you and others mentioned a few things I did not know about, while
>> trying to learn more about them I've found this which seems to be quite
>> informative regarding BeeGFS (aka fhGFS) as compared to gluster and a
>> little bit to Lustre.
More information about the LUG