Ceph scalable distributed file system


















Weil , Scott A. Brandt , Ethan L. Miller , Darrell D. Long , Carlos Maltzahn. Abstract We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. The scientific and high-performance computing communities in particular have driven advances in the performance and scalability of distributed storage systems, typically predicting more general purpose needs by a few years. Traditional solutions, exemplified by NFS [ 20 ], provide a straightforward model in which a server exports a file system hierarchy that clients can map into their local name space.

More recent distributed file systems have adopted architectures based on object-based storage, in which conventional hard disks are replaced with intelligent object storage devices OSDs which combine a CPU, network interface, and local cache with an underlying disk or RAID [ 4 , 7 , 8 , 32 , 35 ]. OSDs replace the traditional block-level interface with one in which clients can read or write byte ranges to much larger and often variably sized named objects, distributing low-level block allocation decisions to the devices themselves.

Systems adopting this model continue to suffer from scalability limitations due to little or no distribution of the metadata workload. Continued reliance on traditional file system principles like allocation lists and inode tables and a reluctance to delegate intelligence to the OSDs have further limited scalability and performance, and increased the cost of reliability.

We present Ceph, a distributed file system that provides excellent performance and reliability while promising unparalleled scalability.

Our architecture is based on the assumption that systems at the petabyte scale are inherently dynamic: large systems are inevitably built incrementally, node failures are the norm rather than the exception, and the quality and character of workloads are constantly shifting over time. Ceph decouples data and metadata operations by eliminating file allocation tables and replacing them with generating functions.

This allows Ceph to leverage the intelligence present in OSDs to distribute the complexity surrounding data access, update serialization, replication and reliability, failure detection, and recovery. Ceph utilizes a highly adaptive distributed metadata cluster architecture that dramatically improves the scalability of metadata access, and with it, the scalability of the entire system. We discuss the goals and workload assumptions motivating our choices in the design of the architecture, analyze their impact on system scalability and performance, and relate our experiences in implementing a functional system prototype.

We say the Ceph interface is near-POSIX because we find it appropriate to extend the interface and selectively relax consistency semantics in order to better align with the needs of applications and to improve system performance.

Figure 1: System architecture. Each process can either link directly to a client instance or interact with a mounted file system. The primary goals of the architecture are scalability to hundreds of petabytes and beyond , performance, and reliability. Scalability is considered in a variety of dimensions, including the overall storage capacity and throughput of the system, and performance in terms of individual clients, directories, or files. Our target workload may include such extreme cases as tens or hundreds of thousands of hosts concurrently reading from or writing to the same file or creating files in the same directory.

Such scenarios, common in scientific applications running on supercomputing clusters, are increasingly indicative of tomorrow's general purpose workloads.

More importantly, we recognize that distributed file system workloads are inherently dynamic, with significant variation in data and metadata access as active applications and data sets change over time. Ceph directly addresses the issue of scalability while simultaneously achieving high performance, reliability and availability through three fundamental design features: decoupled data and metadata, dynamic distributed metadata management, and reliable autonomic distributed object storage.

Decoupled Data and Metadata -Ceph maximizes the separation of file metadata management from the storage of file data. Metadata operations open , rename , etc. Object-based storage has long promised to improve the scalability of file systems by delegating low-level block allocation decisions to individual devices. However, in contrast to existing object-based file systems [ 4 , 7 , 8 , 32 ] which replace long per-file block lists with shorter object lists, Ceph eliminates allocation lists entirely.

Instead, file data is striped onto predictably named objects, while a special-purpose data distribution function called CRUSH [ 29 ] assigns objects to storage devices. This allows any party to calculate rather than look up the name and location of objects comprising a file's contents, eliminating the need to maintain and distribute object lists, simplifying the design of the system, and reducing the metadata cluster workload.

Dynamic Distributed Metadata Management -Because file system metadata operations make up as much as half of typical file system workloads [ 22 ], effective metadata management is critical to overall system performance. Ceph utilizes a novel metadata cluster architecture based on Dynamic Subtree Partitioning [ 30 ] that adaptively and intelligently distributes responsibility for managing the file system directory hierarchy among tens or even hundreds of MDSs.

A dynamic hierarchical partition preserves locality in each MDS's workload, facilitating efficient updates and aggressive prefetching to improve performance for common workloads. Significantly, the workload distribution among metadata servers is based entirely on current access patterns, allowing Ceph to effectively utilize available MDS resources under any workload and achieve near-linear scaling in the number of MDSs.

Reliable Autonomic Distributed Object Storage -Large systems composed of many thousands of devices are inherently dynamic: they are built incrementally, they grow and contract as new storage is deployed and old devices are decommissioned, device failures are frequent and expected, and large volumes of data are created, moved, and deleted. All of these factors require that the distribution of data evolve to effectively utilize available resources and maintain the desired level of data replication.

Ceph delegates responsibility for data migration, replication, failure detection, and failure recovery to the cluster of OSDs that store the data, while at a high level, OSDs collectively provide a single logical object store to clients and metadata servers. This approach allows Ceph to more effectively leverage the intelligence CPU and memory present on each OSD to achieve reliable, highly available object storage with linear scaling. We describe the operation of the Ceph client, metadata server cluster, and distributed object store, and how they are affected by the critical features of our architecture.

We also describe the status of our prototype. The Ceph client runs on each host executing application code and exposes a file system interface to applications. In the Ceph prototype, the client code runs entirely in user space and can be accessed either by linking to it directly or as a mounted file system via FUSE [ 25 ] a user-space file system interface.

Each client maintains its own file data cache, independent of the kernel page or buffer caches, making it accessible to applications that link to the client directly. An MDS traverses the file system hierarchy to translate the file name into the file inode , which includes a unique inode number, the file owner, mode, size, and other per-file metadata.

If the file exists and access is granted, the MDS returns the inode number, file size, and information about the striping strategy used to map file data into objects. The MDS may also issue the client a capability if it does not already have one specifying which operations are permitted. Capabilities currently include four bits controlling the client's ability to read, cache reads, write, and buffer writes.

In the future, capabilities will include security keys allowing clients to prove to OSDs that they are authorized to read or write data [ 13 , 19 ] the prototype currently trusts all clients. Ceph generalizes a range of striping strategies to map file data onto a sequence of objects. To avoid any need for file allocation metadata, object names simply combine the file inode number and the stripe number.

For example, if one or more clients open a file for read access, an MDS grants them the capability to read and cache file content. Armed with the inode number, layout, and file size, the clients can name and locate all objects containing file data and read directly from the OSD cluster.

Any objects or byte ranges that don't exist are defined to be file "holes," or zeros. Similarly, if a client opens a file for writing, it is granted the capability to write with buffering, and any data it generates at any offset in the file is simply written to the appropriate object on the appropriate OSD.

The client relinquishes the capability on file close and provides the MDS with the new file size the largest offset written , which redefines the set of objects that may exist and contain file data. That is, each application read or write operation will block until it is acknowledged by the OSD, effectively placing the burden of update serialization and synchronization with the OSD storing each object.

When writes span object boundaries, clients acquire exclusive locks on the affected objects granted by their respective OSDs , and immediately submit the write and unlock operations to achieve the desired serialization. Object locks are similarly used to mask latency for large writes by acquiring locks and flushing data asynchronously.

Although read-write sharing is relatively rare in general-purpose workloads [ 22 ], it is more common in scientific computing applications [ 27 ], where performance is often critical. For this reason, it is often desirable to relax consistency at the expense of strict standards conformance in situations where applications do not rely on it. Although Ceph supports such relaxation via a global switch, and many other distributed file systems punt on this issue [ 20 ], this is an imprecise and unsatisfying solution: either performance suffers, or consistency is lost system-wide.

Performance-conscious applications which manage their own consistency e. Both read operations e. For simplicity, no metadata locks or leases are issued to clients. For HPC workloads in particular, callbacks offer minimal upside at a high potential cost in complexity.

Instead, Ceph optimizes for the most common metadata access scenarios. A readdir followed by a stat of each file e. A readdir in Ceph requires only a single MDS request, which fetches the entire directory, including inode contents. By default, if a readdir is immediately followed by one or more stat s, the briefly cached information is returned; otherwise it is discarded. Although this relaxes coherence slightly in that an intervening inode modification may go unnoticed, we gladly make this trade for vastly improved performance.

This behavior is explicitly captured by the readdirplus [ 31 ] extension, which returns lstat results with directory entries as some OS-specific implementations of getdir already do.

Ceph could allow consistency to be further relaxed by caching metadata longer, much like earlier versions of NFS, which typically cache for 30 seconds. However, this approach breaks coherency in a way that is often critical to applications, such as those using stat to determine if a file has been updated-they either behave incorrectly, or end up waiting for old cached values to time out.

We opt instead to again provide correct behavior and extend the interface in instances where it adversely affects performance. This choice is most clearly illustrated by a stat operation on a file currently opened by multiple clients for writing.

In order to return a correct file size and modification time, the MDS revokes any write capabilities to momentarily stop updates and collect up-to-date size and mtime values from all writers. The highest values are returned with the stat reply, and capabilities are reissued to allow further progress.

Although stopping multiple writers may seem drastic, it is necessary to ensure proper serializability. For a single writer, a correct value can be retrieved from the writing client without interrupting progress. Applications for which coherent behavior is unnecesssary-victims of a POSIX interface that doesn't align with their needs-can use statlite [ 31 ], which takes a bit mask specifying which inode fields are not required to be coherent.

File and directory metadata in Ceph is very small, consisting almost entirely of directory entries file names and inodes 80 bytes. Unlike conventional file systems, no file allocation metadata is necessary-object names are constructed using the inode number, and distributed to OSDs using CRUSH. This simplifies the metadata workload and allows our MDS to efficiently manage a very large working set of files, independent of file sizes. A set of large, bounded, lazily flushed journals allows each MDS to quickly stream its updated metadata to the OSD cluster in an efficient and distributed manner.

The per-MDS journals, each many hundreds of megabytes, also absorb repetitive metadata updates common to most workloads such that when old journal entries are eventually flushed to long-term storage, many are already rendered obsolete. Although MDS recovery is not yet implemented by our prototype, the journals are designed such that in the event of an MDS failure, another node can quickly rescan the journal to recover the critical contents of the failed node's in-memory cache for quick startup and in doing so recover the file system state.

This strategy provides the best of both worlds: streaming updates to disk in an efficient sequential fashion, and a vastly reduced re-write workload, allowing the long-term on-disk storage layout to be optimized for future read access. In particular, inodes are embedded directly within directories, allowing the MDS to prefetch entire directories with a single OSD read request and exploit the high degree of directory locality present in most workloads [ 22 ].

Each directory's content is written to the OSD cluster using the same striping and distribution strategy as metadata journals and file data. Inode numbers are allocated in ranges to metadata servers and considered immutable in our prototype, although in the future they could be trivially reclaimed on file deletion.

An auxiliary anchor table [ 28 ] keeps the rare inode with multiple hard links globally addressable by inode number-all without encumbering the overwhelmingly common case of singly-linked files with an enormous, sparsely populated and cumbersome inode table.

While most existing distributed file systems employ some form of static subtree-based partitioning to delegate this authority usually forcing an administrator to carve the dataset into smaller static "volumes" , some recent and experimental file systems have used hash functions to distribute directory and file metadata [ 4 ], effectively sacrificing locality for load distribution.

Both approaches have critical limitations: static subtree partitioning fails to cope with dynamic workloads and data sets, while hashing destroys metadata locality and critical opportunities for efficient metadata prefetching and storage. Figure 2: Ceph dynamically maps subtrees of the directory hierarchy to metadata servers based on the current workload. Individual directories are hashed across multiple nodes only when they become hot spots. Ceph's MDS cluster is based on a dynamic subtree partitioning strategy [ 30 ] that adaptively distributes cached metadata hierarchically across a set of nodes, as illustrated in Figure 2.

Each MDS measures the popularity of metadata within the directory hierarchy using counters with an exponential time decay. Any operation increments the counter on the affected inode and all of its ancestors up to the root directory, providing each MDS with a weighted tree describing the recent load distribution.

MDS load values are periodically compared, and appropriately-sized subtrees of the directory hierarchy are migrated to keep the workload evenly distributed. The combination of shared long-term storage and carefully constructed namespace locks allows such migrations to proceed by transferring the appropriate contents of the in-memory cache to the new authority, with minimal impact on coherence locks or client capabilities.

Imported metadata is written to the new MDS's journal for safety, while additional journal entries on both ends ensure that the transfer of authority is invulnerable to intervening failures similar to a two-phase commit. The resulting subtree-based partition is kept coarse to minimize prefix replication overhead and to preserve locality. When metadata is replicated across multiple MDS nodes, inode contents are separated into three groups, each with different consistency semantics: security owner, mode , file size, mtime , and immutable inode number, ctime, layout.

While immutable fields never change, security and file locks are governed by independent finite state machines, each with a different set of states and transitions designed to accommodate different access and update patterns while minimizing lock contention.

For example, owner and mode are required for the security check during path traversal but rarely change, requiring very few states, while the file lock reflects a wider range of client access modes as it controls an MDS's ability to issue client capabilities. Ceph uses its knowledge of metadata popularity to provide a wide distribution for hot spots only when needed and without incurring the associated overhead and loss of directory locality in the general case.

The contents of heavily read directories e. Directories that are particularly large or experiencing a heavy write workload e. This adaptive approach allows Ceph to encompass a broad spectrum of partition granularities, capturing the benefits of both coarse and fine partitions in the specific circumstances and portions of the file system where those strategies are most effective.

Every MDS response provides the client with updated information about the authority and any replication of the relevant inode and its ancestors, allowing clients to learn the metadata partition for the parts of the file system with which they interact. Future metadata operations are directed at the authority for updates or a random replica for reads based on the deepest known prefix of a given path. Normally clients learn the locations of unpopular unreplicated metadata and are able to contact the appropriate MDS directly.

Clients accessing popular metadata, however, are told the metadata reside either on different or multiple MDS nodes, effectively bounding the number of clients believing any particular piece of metadata resides on any particular MDS, dispersing potential hot spots and flash crowds before they occur.

Ceph's Reliable Autonomic Distributed Object Store RADOS achieves linear scaling in both capacity and aggregate performance by delegating management of object replication, cluster expansion, failure detection and recovery to OSDs in a distributed fashion.

In order to avoid imbalance e. This stochastic approach is robust in that it performs equally well under any potential workload. Ceph first maps objects into placement groups PGs using a simple hash function, with an adjustable bit mask to control the number of PGs. This differs from conventional approaches including other object-based file systems in that data placement does not rely on any block or object list metadata.

To locate any object, CRUSH requires only the placement group and an OSD cluster map : a compact, hierarchical description of the devices comprising the storage cluster. This approach has two key advantages: first, it is completely distributed such that any party client, OSD, or MDS can independently calculate the location of any object; and second, the map is infrequently updated, virtually eliminating any exchange of distribution-related metadata.

In doing so, CRUSH simultaneously solves both the data distribution problem "where should I store data" and the data location problem "where did I store data". By design, small changes to the storage cluster have little impact on existing PG mappings, minimizing data migration due to device failures or cluster expansion.

The cluster map hierarchy is structured to align with the clusters physical or logical composition and potential sources of failure. For instance, one might form a four-level hierarchy for an installation consisting of shelves full of OSDs, rack cabinets full of shelves, and rows of cabinets.

Each OSD also has a weight value to control the relative amount of data it is assigned. For example, one might replicate each PG on three OSDs, all situated in the same row to limit inter-row replication traffic but separated into different cabinets to minimize exposure to a power circuit or edge switch failure.

The cluster map also includes a list of down or inactive devices and an epoch number, which is incremented each time the map changes. All OSD requests are tagged with the client's map epoch, such that all parties can agree on the current distribution of data. Incremental map updates are shared between cooperating OSDs, and piggyback on OSD replies if the client's map is out of date.

To maintain system availability and ensure data safety in a scalable fashion, RADOS manages its own replication of data using a variant of primary-copy replication [ 2 ], while taking steps to minimize the impact on performance. Data is replicated in terms of placement groups, each of which is mapped to an ordered list of n OSDs for n-way replication.

Clients send all writes to the first non-failed OSD in an object's PG the primary , which assigns a new version number for the object and PG and forwards the write to any additional replica OSDs. After each replica has applied the update and responded to the primary, the primary applies the update locally and the write is acknowledged to the client. Reads are directed at the primary. This approach spares the client of any of the complexity surrounding synchronization or serialization between replicas, which can be onerous in the presence of other writers or failure recovery.

It also shifts the bandwidth consumed by replication from the client to the OSD cluster's internal network, where we expect greater resources to be available. Intervening replica OSD failures are ignored, as any subsequent recovery see Section 5. First, clients are interested in making their updates visible to other clients.

This should be quick: writes should be visible as soon as possible, particularly when multiple writers or mixed readers and writers force clients to operate synchronously. Second, clients are interested in knowing definitively that the data they've written is safely replicated, on disk, and will survive power or other failures. RADOS disassociates synchronization from safety when acknowledging updates, allowing Ceph to realize both low-latency updates for efficient application synchronization and well-defined data safety semantics.

Figure 4 illustrates the messages sent during an object write. The primary forwards the update to replicas, and replies with an ack after it is applied to all OSDs' in-memory buffer caches, allowing synchronous POSIX calls on the client to return. A final commit is sent perhaps many seconds later when data is safely committed to disk.

We send the ack to the client only after the update is fully replicated to seamlessly tolerate the failure of any single OSD, even though this increases client latency. By default, clients also buffer writes until they commit to avoid data loss in the event of a simultaneous power loss to all OSDs in the placement group.

This is actually a very common design in distributed file systems, because in fact one MDS can manage one or more paths, and this may contain even PB-level data. We found that the client needs to access the MDS anyway, because it is necessary to know where the requested data exists. This is also an important practice of ceph, that is, the process of looking up the table is cancelled in ceph to further reduce the MDS load, so how does the client know the data?

Where does it exist? The answer is just to calculate. MDS will return an byte inode when the client requests it, which uniquely identifies each storage object. The client can calculate the number of the data object based on this, and then according to a special map. Get the location of the data. Of course, each file also has three data: permissions, file size, and immutable attributes [5] section 4. This separation of data and metadata to the greatest extent makes the load of MDS greatly reduced.

First of all, we must be clear about what is the purpose of dynamic distributed metadata management? First, give the answer: reduce the disk IO and flow control of metadata. In fact, the description in this part of the paper is not clear [5]4. In order to optimize this update process, the log will be directly transmitted to the OSD cluster, so that when one node is down, the other node can quickly scan for recovery.

Of course, I have some doubts here. Why not configure slave nodes for each MDS and run consensus algorithms such as Paxos and Raft, which can also be quickly restored in the event of a downtime, which is obviously simpler. Of course, it may also weigh the load of MDS. Obviously, the article does not describe it completely, because generally log-based storage should include log compression, otherwise there will be a large number of expired logs in the log.

Log-based transfer of data can make disk operations all sequential, which also makes log persistence faster. Let's first take a look at the organization of the MDS cluster: we can clearly see that each MDS node is actually responsible for a part of the path of the file system.

Not to mention anything else, the design of this MDS cluster is very exciting, because this is a very novel decentralized cluster that also supports expansion. For example, GFS uses a static subtree partition after all, there is only one node , which will cause hot spots for unexpected access.

Dynamic subtree partition elegantly solves this problem. Count the number of visits to metadata in each MDS by using a counter. Any operation causes the count of the affected inode and its upper nodes up to the root directory to increase, thereby providing a weight for each MDS to describe the recent load distribution. Regularly migrating data can ensure the load balance of metadata.

Although metadata updates are rare, they will eventually occur. This involves distributed transactions, because multiple copies need to be modified. Of course, metadata will be distributed when there are hot spots, and a large number of read-accessed directories for example, multiple open will be copied to multiple nodes.

Each MDS not only provides the inode for the client, but also provides the copy information of the ancestor, so that the next operation can be randomly assigned to the master node or the slave node all nodes are equal, just store data , which solves the hot issue. Here I believe everyone will have a question at the beginning, why not directly hash, the reason is actually very simple, we can see that the CRUSH algorithm here is actually one-to-many, that is, the inode obtained by the client from the MDS can be mapped to Multiple OSDs there is a layer in between.

It will tend to choose devices with larger capacity. This trend is actually very intriguing. Why not just choose the largest one? Let's describe in detail how an inode maps to multiple OSDs to achieve decentralization:. First of all, how to understand reliable, automatically distributed object storage?

First of all, we mentioned earlier that ceph is a theoretically unlimited distributed file system, so it means that its maintenance is very troublesome, such as fault detection, fault repair, cluster changes, etc. Of course, we hope that ceph is designed at the beginning These problems can be solved automatically. Let's go back to the first question, what is reliable and automatically distributed object storage?

My understanding is that safe storage can achieve customer-specified consistency, automatic failure recovery and support for scalability.



0コメント

  • 1000 / 1000