by F. Schmuck et al., FAST 2002.
Abstract:
GPFS is IBM’s parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was built on many of the ideas that were developed in the academic community over the last several years, particularly distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale. We have had the opportunity to test those limits in the context of a product that runs on the largest systems in existence. While in many cases existing ideas scaled well, new approaches were necessary in many key areas. This paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.
Link to the full paper:
http://www.cse.buffalo.edu/faculty/tkosar/cse710_spring13/papers/gpfs.pdf
GPFS is IBM’s parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was built on many of the ideas that were developed in the academic community over the last several years, particularly distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale. We have had the opportunity to test those limits in the context of a product that runs on the largest systems in existence. While in many cases existing ideas scaled well, new approaches were necessary in many key areas. This paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.
Link to the full paper:
http://www.cse.buffalo.edu/faculty/tkosar/cse710_spring13/papers/gpfs.pdf
Using byte range tokens multiple parallel writes to the same file is possible so how does the metadata server handle these multiple requests concurrently? Are there any issues associated with this?
ReplyDeleteAssuming node1, node2, ... come in order and start writing to the same file at the offset c1, c2 ...
Deletethen at
T1: node1 holds token for file range (0, infinity)
T2: node1:(0,c2) node2(c2,inf) assuming c1c2
T3: node1:(0,c2) node2(c2,c3) node3(c3,inf)
and so on
According to the measurements (taken for 32-node IBM system with 480 disks) given in the paper, it says that the write throughput leveled off after 17 nodes due to a problem in the switch adapter microcode.
However, the main point to note in their measurement was that, writing to a single file from multiple nodes was just as fast as each node writing to a different file, which demonstrated the effectiveness of the byte-range token protocol.
This comment has been removed by the author.
ReplyDeleteThe author says that "File blocks are assigned to nodes in a round-robin fashion, so that each data block will be read or written only by
ReplyDeleteone particular node. GPFS forwards read and write
operations originating from other nodes to the node
responsible for a particular data block". If the number of these operations increase, wont it be a potential bottleneck in the system?
Well that depends on the size of block. For fine-grain sharing this is more efficient than distributed locking, because it requires fewer messages than a token exchange, and it avoids the overhead of flushing dirty data to disk when revoking a token.
DeleteByte-range token can some time result in false sharing, right ?
ReplyDeleteThen how does GPFS solves this issue ?
Since the smallest unit of I/O is one sector, the byte-range
Deletetoken granularity can be no smaller than one sector;
otherwise, two nodes could write to the same sector at
the same time, causing lost updates. Hence multiple nodes
writing into the same data block will cause token
conflicts even if individual write operations do not
overlap (“false sharing”).
GPFS uses shared write lock to avoid this. Here multiple nodes can append to the same file concurrently, since GPFS elects one node for updating inodes.
This comment has been removed by the author.
ReplyDeleteHow is fault tolerant achieved in GPFS?
ReplyDeleteFault Tolerance
Delete1)Node failures:Periodic heartbeat messages are sent to detect node failures
2)Communication failures:Continued operation can result in corrupted file system. File system is accessible only by the group containing a majority of the nodes in the cluster.
3)Disk failures:Files can be replicated
1. Can you explain a little more on the communication failure for 2 node configuration and disk fencing?
ReplyDelete2. When the cluster is broken exactly into half, how is the failure dealt? (In one of the Cluster File Systems I am aware of, there is an extra weight called epsilon given to 1 of the nodes. When system breaks into half, the half containing the node with the epsilon will take over)
If nodeset has only two nodes, then losing one of the nodes will result in the loss of quorum and GPFS will attempt to restart its daemons on both nodes. Thus three nodes in a nodeset is necessary to prevent shutting down the
Deletedaemons on all nodes prior to re-starting them.
Alternatively, one can specify a single-node quorum when there are only two nodes in the nodeset. In this case, a node failure will result in GPFS fencing the failed node and the remaining node will continue operation. This is an important consideration since a GPFS cluster using RAID can have a maximum of two nodes in the nodeset
In the case of Replication, what happens when there is disk space to store only one copy of new data? In such scenarios, how is disk failure handled?
ReplyDelete