The Google File System vs HDFS

Google's paper on their file system describes GFS as a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

GFS formed the basis for the development of HDFS, which is more or less an open-source implementation of GFS.

GFS Assumptions

  • Scale OUT, not UP: prefer using commodity hardware to exotic hardware

  • High component failure rate: it is assumed that components of commodity hardware fail all the time

  • Modest number of huge files: it is assumed that files will be multiple-gigabytes in size

  • Files are write-once, mostly appended to

  • Large streaming reads over random access: sequential reading is preferred to random access since the high sustained throughput provided by sequential reading is preferred to the low latency provided by random access

GFS Design Decisions

  • Files stored as chunks: Fixed size (64MB)

  • Reliability through replication: Each chunk is replicated across 3+ "chunkservers"

  • Single master to coordinate access, keep metadata: Simple centralized management

  • No data caching: Little benefit due to large datasets, streaming reads

  • Simplify the API: Push some of the issues onto the client (e.g., data layout)

From GFS to HDFS

As mentioned earlier, HDFS is based on GFS. However, there are a few key differences between the two:

Terminology differences

  • GFS Master = Hadoop NameNode

  • GFS chunkservers = Hadoop DataNodes

Functional differences

  • No file appends in HDFS

  • HDFS performance is (likely) slower

The next section introduces the Data Types in Hadoop.

Last updated