Wednesday, February 6, 2013

GPFS: A Shared-Disk File System for Large Computing Clusters


by F. Schmuck et al., FAST 2002.

Abstract:
GPFS is IBM’s parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was built on many of the ideas that were developed in the academic community over the last several years, particularly distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale. We have had the opportunity to test those limits in the context of a product that runs on the largest systems in existence. While in many cases existing ideas scaled well, new approaches were necessary in many key areas. This paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.

Link to the full paper:
http://www.cse.buffalo.edu/faculty/tkosar/cse710_spring13/papers/gpfs.pdf

13 comments:

  1. Using byte range tokens multiple parallel writes to the same file is possible so how does the metadata server handle these multiple requests concurrently? Are there any issues associated with this?

    ReplyDelete
    Replies
    1. Assuming node1, node2, ... come in order and start writing to the same file at the offset c1, c2 ...
      then at
      T1: node1 holds token for file range (0, infinity)
      T2: node1:(0,c2) node2(c2,inf) assuming c1c2
      T3: node1:(0,c2) node2(c2,c3) node3(c3,inf)
      and so on

      According to the measurements (taken for 32-node IBM system with 480 disks) given in the paper, it says that the write throughput leveled off after 17 nodes due to a problem in the switch adapter microcode.
      However, the main point to note in their measurement was that, writing to a single file from multiple nodes was just as fast as each node writing to a different file, which demonstrated the effectiveness of the byte-range token protocol.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. The author says that "File blocks are assigned to nodes in a round-robin fashion, so that each data block will be read or written only by
    one particular node. GPFS forwards read and write
    operations originating from other nodes to the node
    responsible for a particular data block". If the number of these operations increase, wont it be a potential bottleneck in the system?

    ReplyDelete
    Replies
    1. Well that depends on the size of block. For fine-grain sharing this is more efficient than distributed locking, because it requires fewer messages than a token exchange, and it avoids the overhead of flushing dirty data to disk when revoking a token.

      Delete
  4. Byte-range token can some time result in false sharing, right ?
    Then how does GPFS solves this issue ?

    ReplyDelete
    Replies
    1. Since the smallest unit of I/O is one sector, the byte-range
      token granularity can be no smaller than one sector;
      otherwise, two nodes could write to the same sector at
      the same time, causing lost updates. Hence multiple nodes
      writing into the same data block will cause token
      conflicts even if individual write operations do not
      overlap (“false sharing”).
      GPFS uses shared write lock to avoid this. Here multiple nodes can append to the same file concurrently, since GPFS elects one node for updating inodes.

      Delete
  5. This comment has been removed by the author.

    ReplyDelete
  6. How is fault tolerant achieved in GPFS?

    ReplyDelete
    Replies
    1. Fault Tolerance
      1)Node failures:Periodic heartbeat messages are sent to detect node failures

      2)Communication failures:Continued operation can result in corrupted file system. File system is accessible only by the group containing a majority of the nodes in the cluster.

      3)Disk failures:Files can be replicated

      Delete
  7. 1. Can you explain a little more on the communication failure for 2 node configuration and disk fencing?
    2. When the cluster is broken exactly into half, how is the failure dealt? (In one of the Cluster File Systems I am aware of, there is an extra weight called epsilon given to 1 of the nodes. When system breaks into half, the half containing the node with the epsilon will take over)

    ReplyDelete
    Replies
    1. If nodeset has only two nodes, then losing one of the nodes will result in the loss of quorum and GPFS will attempt to restart its daemons on both nodes. Thus three nodes in a nodeset is necessary to prevent shutting down the
      daemons on all nodes prior to re-starting them.
      Alternatively, one can specify a single-node quorum when there are only two nodes in the nodeset. In this case, a node failure will result in GPFS fencing the failed node and the remaining node will continue operation. This is an important consideration since a GPFS cluster using RAID can have a maximum of two nodes in the nodeset

      Delete
  8. In the case of Replication, what happens when there is disk space to store only one copy of new data? In such scenarios, how is disk failure handled?

    ReplyDelete