Introduction
The GFS is a file system based on the distributed systems and designed by Google for the
map reduce
module. It has the essential elements of the distributed systems, such as
fault-tolerant, replica, etc.
So that, it is different from standard filesystems. And in the design, GFS will not provide a POSIX interface, so you will only access it using a special SDK.
And Google already deprecated the GFS and designed a new file system, but the GFS still is valuable for learning distributed systems.
Design
File operation
The GFS supports two file operation methods: file download and file upload, and the file upload contain two different kinds: file append and file write.
The file append is very clear. It is appended your data to the end of the file
(actually the end of the chunk), so you will not need to specify the offset that you want
to write because it is always the end offset of the file. The file append also provides an
atomic append operation called recoard append
, and I will illustrate it later.
The file write means you can write your data to a specific offset of the file, so you need to provide the offset of the file and file data you want to write.
File structure
The GFS does not have an ordinary file structure. It splits a file into multiple chunks (if your file is big enough) and stores them in multiple replicas called chunk servers.
Also, GFS uses the namespace to emulate the folder of the file system. A file can be under
a specific namespace such as: parent
, and for the parent
namespace, it could have a
namespace for itself, such as: root.
So the full path of the file will be like
this: /root/parent/file
.
Master node
The GFS designed the master node used to manage the location and other data of your files, and in the design part, the master is a single node, which means that only one fully functional master can provide service for clients and chunk servers.
It seems the single master will decrease the availability. But it can simplify the design so that it will not need to consider too much data consistency. And for the availability, GFS used the shadow master to improve the scalability and availability.
The shadow master is just like a master, but it can only provide a limited service for clients, so that means that when the master node is down, clients can talk to the shadow master to do some read-only logic such as downloading files.
Metadata
The master periodically sends the heartbeat message to all chunk servers. The heartbeat message can use to give instructions to chunk servers and collect information. After the master gets that information from chunk servers, it will store that information in memory, and a part of the information in persistent storage called operation log.
About the metadata that master saved, you can see below the bullet list, and for the description of some terms, I will explain it in the later chapter.
- The mapping from files to chunks
- The current locations of chunks
- The file and chunk namespaces
- Access control information
- Chunk version number
The operation log is a way to save the metadata to persistent storage. And the master does not store all data to persistent storage, such as the current locations of chunks, because it can collect from the heartbeat response.
When the master starts, it will try to replay the operation log to recover the previous state, so reducing the operation log data can improve the master start speed.
Data flow
As mentioned above, the GFS constructed by multiple services, and they work together to provide the file service for clients. You can see below the data flow to know approximately the interaction flow between different roles.
File transfer flow
First, let us start from the file transfer flow to learn how GFS provides file store service for other clients.
As the data flow illustration, the service components are master and chunk servers, probably have multiple shadow masters if you want to own the availability and scalability.
So, in this chapter, we will figure out how components interact with each other and how the client interacts with these services.
Terms
Here are some terms that will appear in the later file transfer flow.
Term | Description |
---|---|
Chunk size | One file will separate into multiple chunks and stored to multiple chunk servers according to the chunk size. The value of chunk size is configurable. |
Chunk handle | Ideally, it should be the unique value for each chunk assigned by the master when the chunk is created. |
Chunk version number | It is for stale replica detection. And the value will be updated when the master grants a new lease on a chunk. |
Upload file
The above picture illustrates the upload file flow.
The client sends the filename which the client wants to upload and chunk index to the master.
The master replies will include a chunk handle, chunk servers, and other stuff. It also includes the primary and secondary servers.
Then the client starts to upload the file data to all chunk servers. Here is some optimization for the upload process: When the client uploads file data to the closest chunk server, the chunk server will forward the file data to another closest chunk server which not receive the file data. This process can avoid network bottlenecks and high-latency links.
Those chunk servers will not write the file data to the local disk directly. Instead, they will write the data to the LRU cache and wait for the write request to write the data to the local disk.
After uploading the file data to all chunk servers, the client will talk to the primary chunk server to send the write request to finish the upload process. And the primary will assign consecutive serial numbers to all the mutations it receives.
The primary will forward the write request to all secondaries with the serial number to avoid data inconsistency and wait for the secondaries’ replies.
After the primary gets replies from secondaries, it will reply to the client, and the replies also include the error information if the secondaries encounter any errors. And after encountering errors, the modified region will be in an inconsistent state. The client can fix it by retrying.
Download file
The download flow is more straightforward than the upload flow. It just asks the chunk server address from the master and tries to download file data.
The client asks for chunk servers’ addresses by sending the filename and chunk index to the master.
The master will replay the chunk handle, chunk servers’ addresses, chunk version number, etc., to the client.
The client will connect to an arbitrary chunk server to download file data using the chunk handle and offset. If the chunk version number of the chunk server is lower than the up-to-date chunk version number, that means the chunk server is a stale replica, and the client should connect to another chunk server.
After the client downloaded the file data, the client should do the checksum calculation and compare the result with the checksum of the chunk server. It can avoid disk hardware failure or unstable networks.
Consistency
Now, we know the upload/download flow, and the file will save to multiple replicas to guarantee the file will not lose. We still have a question: how the GFS avoid and deal with the chunk different with other replicas? So, in this part, let us learn about the consistency of the GFS.
File upload
Like I described above, the file upload has split into two steps: upload file data to chunk servers and send the write request to the primary chunk server. In step two, the primary chunk server will forward your write request to the rest chunk servers.
After the chunk servers received the file data, they will save the file data into an LRU buffer in cache to waiting for flush to disk. So in the first step, no data will save to disk, and if the chunk server is down.
In step two, the write request can identify the file data upload earlier to all chunk servers. And the primary will assign consecutive serial numbers to all the file data that the primary received, which provides the necessary serialization. It can avoid different chunk servers writing the file data to disk in different orders. Then the primary will forward the written request in a specific order and wait for a reply.
Also, the GFS provides an atomic append operation called record append
. In a traditional
write operation, the client needs to specify the offset of the file data, so the
concurrent write seems impossible because we are not sure how the GFS will save the file
data in which order. But for the append operation, we only expect the data to add to the
end of the chunk, which means we do not care about the order of the data.
And in the atomic append, the write request of primary send to secondaries also includes the offset information, which indicates the secondaries write the file data to the exact position of the chunk with the primary and other secondaries even there already have data. So that the GFS can guarantee all the append operations in secondaries are the same.
Stale chunk
Let me show you a new example: if a chunk server stored a bunch of chunks and for some reason, it lost the communication with master, in this period, some chunks have been modified, and some replicas of the chunks are in the chunk server. If the chunk server is back online, it can communicate with the master again, but as you know, some chunks have been stale. The inconsistency has already happened.
In this case, the GFS used the chunk version number to detect the stale chunk in each chunk server and sync the newest one to other chunk servers until the number of fresh chunks is the same as our configuration. So, how does the GFS maintains the chunk version number?
When a client wants to write something to the GFS, it will talk with the master first. The master will increase the chunk version number and broadcast the notification to all related chunk servers, and those chunk servers will increase the chunk version number held by themselves.
So if a chunk server is not available in this period, it will not receive the notification, so they will not increase the chunk version number. And in the following information collection, the chunk server will send the old chunk version number to the master (if the chunk server is back to available), and then the master will know the chunk server contains how many stale chunks.
In the download part, the client will get the chunk version number and chunk server list from the master, so when the client talks with a chunk server, it will check the chunk version number, if the version number is lower than the master returned, it will switch to next chunk server. So that it can guarantee the client can always get the new data.
Summary
So, we knew the basic concept of GFS:
It uses the single master design to simplify the architecture, the multiple replicas and shadow master to implement the high availability and reliability.
The upload and download part involves the primary chunk server to reduce the communication between the client and the master to avoid the bottleneck.
The GFS was great because it is simple and can solve many distributed systems issues by using a simple design. In a sense: Simplicity is power.