The Google File System vs HDFS

Google's paper on their file system describes GFS as a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

GFS formed the basis for the development of HDFS, which is more or less an open-source implementation of GFS.

GFS Assumptions

Scale OUT, not UP: prefer using commodity hardware to exotic hardware
High component failure rate: it is assumed that components of commodity hardware fail all the time
Modest number of huge files: it is assumed that files will be multiple-gigabytes in size
Files are write-once, mostly appended to
Large streaming reads over random access: sequential reading is preferred to random access since the high sustained throughput provided by sequential reading is preferred to the low latency provided by random access

GFS Design Decisions

Files stored as chunks: Fixed size (64MB)
Reliability through replication: Each chunk is replicated across 3+ "chunkservers"
Single master to coordinate access, keep metadata: Simple centralized management
No data caching: Little benefit due to large datasets, streaming reads
Simplify the API: Push some of the issues onto the client (e.g., data layout)

From GFS to HDFS

As mentioned earlier, HDFS is based on GFS. However, there are a few key differences between the two:

Terminology differences

GFS Master = Hadoop NameNode
GFS chunkservers = Hadoop DataNodes

Functional differences

No file appends in HDFS
HDFS performance is (likely) slower

The next section introduces the Data Types in Hadoop.

PreviousA Typical Large Data Problem NextData Types in Hadoop

Last updated 3 years ago