Have met a few customers last month and have realized that finally Hadoop is mainstream (here in India at least), I may be wrong here, probably my circle of knowledge is too small, probably it has been mainstream for some time and all this time I was in my cave. That being said, a little blog about Hadoop won’t hurt the internet. Also, it has been a while since I wrote. Big Data being one of the hottest trends in IT right now (or is it #AI, #ML) we all need to start becoming more comfortable and familiar with many of these industry terms so I thought it would be useful to put something about Hadoop. This blog will tell you what Hadoop is, what the offerings are (in brief detail).
In a nutshell Hadoop is….
An open source java framework bringing compute as physically close to the data as possible. Hadoop basically consists of two parts: Hadoop Distributed File System (HDFS) and Hadoop MapReduce.
What is HDFS?
It will take a file and split it up into many small ‘chunks’, it then distributes and stores these chunks across many servers. These chunks are also replicated (the number of copies required can be specified by the application) for fault tolerance. Storing data across multiple nodes in this way will boost performance, get you faster results with effects of distribution of data for query. HDFS uses an intelligent placement model for reliability and performance. Optimizing the placement makes HDFS unique from most other distributed file systems. important things to remember about HDFS is that it is more like a file system and less like a Database. each file is generally 128 MB or can even larger (actually you can set this value, this is important, as we will see later).
What is MapReduce?
This works on top of HDFS – after HDFS has distributed the data to many different servers MapReduce sends a fragment of a program (“a piece of work”) to each server to execute. So in a nutshell MapReduce is a framework that enables you to write an application that processes vast amounts of data in parallel by “sharing” the work to be completed out to a large number of servers (collectively referred to as a cluster). The basic idea is that you divide the job into two parts: a Map, and a Reduce.
Map basically takes the problem, splits it into sub-parts, and sends the sub-parts to different machines (it is possible for machines to re-distribute the work out leading to a multi-level structure), each machine will process their piece of work and send their answer back up the structure
Reduce then collects all the answers and combines them back together in some way to get a single answer to the original problem it was trying to solve.
The key to how MapReduce does “things” is to take input (for example a list of records), the records are then split amongst the machines by the map. The map computation will provide a list of key/value pairs (basically a set of two linked data items – the key is the unique identifier and the value is either the data itself or a link to the data). Reduce then collates all of the pairs. It will look to see where there are duplicate keys and then merges those. So Map takes a set of data chunks, and produces key/value pairs; reduce merges things, so that instead of a set of key/value pair sets, you get one result. You can’t tell whether the job was split into 100 pieces or 2 pieces; the end result looks pretty much like the result of a single map. Hadoop is meant for cheap heterogeneous hardware where scaling takes place by simply adding more cheap hardware.
What problems does HADOOP address?
Well I guess the first thing to ask yourself is “What is a big data problem?” The way I see it is that “Big data” provides our customers with an opportunity to use the data that is streaming in from all around to change the way that they can engage with their customers. Now this can be anything, can be suggesting the right content on a video website for customers or deciding the sentiment of a community (streaming analytics). So, as you can see, the information Hadoop provides is of extreme value to business. Hence it is very important to choose the right platform for it. I am no expert on Hadoop, but from what I have read and learnt from my customers is the fact that Hadoop data is very valuable and its results are as viable as gold. Since they provide RATIONAL FEEDBACK into the business which is of extreme value, often in real time. So taking above importance of data into consideration, below are some design criteria’s I think which are required for Hadoop:
- Scalable – both on compute and storage
- Data integrity
- Performance – both on IOPS and Throughput
- Ability to tier Data to an HDFS enabled system for minimum ETL (Extract Transform Load) – more on this later.
- Reliable (up to some extent) – common, we have multiple copies J
- Data Protection Solution – cause you still have to ward of logical errors, loss of name node etc.
Now, I can write about all mentioned attributes above, but I guess you all probably know this, so will not waste time. What organizations often overlook is a sound and stable data protection strategy for Hadoop. This can be open source Hadoop or Cloudera etc. No matter what you have, YOU NEED BACKUP. Here is why —
Backup and recovery is more challenging and complicated than ever, especially with the explosion of data growth and increasing need for data security today. Imagine big players such as Facebook, Yahoo! (the first to implement Hadoop), eBay, and more; how challenging it will be for them to handle unprecedented volumes and velocities of unstructured data, something which traditional relational databases can’t handle and deliver. To emphasize the importance of backup, let’s take a look at a study conducted in 2009. This was the time when Hadoop was evolving and a handful of bugs still existed in Hadoop. Yahoo! had about 20,000 nodes running Apache Hadoop in 10 different clusters. HDFS lost only 650 blocks, out of 329 million total blocks. These blocks were lost due to the bugs found in the Hadoop package. So, imagine what the scenario would be now. I am sure you will bet on losing hardly a block. Being a backup manager, your utmost target is to think, make, and execute a foolproof backup strategy capable of retrieving data after any disaster. Solely speaking, the plan of the strategy is to protect the files in HDFS against disastrous situations and revamp the files back to their normal state.
- Even though there are multiple copies of Hadoop (normally 3, but can be managed for more or less), still there is a chance for logical error which can corrupt your three / four / one copy of HDFS data, and can stop your “analytics project”. Yes, you have the data in the native format, but you will have to put the data in HDFS format once again (Extract Transform Load), imagine doing this for 10 TB / 100 TB / 1 PB of data. No, you cannot think of small data size since Hadoop does not work that way, you need to have big data for big results. Let me add to misery, what if you had streaming data from Flume (just, think of it as an application which allows data to directly be in HDFS for processing coming from a raw data generation program – say a twitter chat) into the HDFS, guess what this little human error / bug made you lose all your data. Data which you cannot get back. Logical error and human errors are real for Hadoop as well.
- All the metadata pertaining to the HDFS, Map reduce and other jobs in Hadoop is stored in the NAME NODE. Name node sits generally on a different server, loss of this server alone can be fatal. Latest Hadoop deployments have multi – name node environments, but they are also prone to deletion of data from a bug, human error or from malicious user.
- In most cases, Hadoop’s trash mechanism is not ON by default, and even this does not protect against logical errors.
- Ransomware, malicious users these are some other situations which should force you to look at a data protection strategy for the Hadoop infrastructure. A wise malware will not only encrypt your complete HDFS file system but will also run the copy job of encryption on a DR site (if you have any for Hadoop). There will be no resolution to this, if you do not have a backup.
There are more reasons to protect Hadoop, but above all is the fact that, If you consider it important enough, you will back it up. That being said, Hadoop data protection is not easy, remember we have terabytes of data, and to make matter worse, we cannot use TAPE, our best friend. Before I get into the details of this, I would like to give Hadoop a chance, since it has some of its own utilities for Data protection:
Onboard Data Protection:
- HDFS Snapshots –
Their biggest issue is that they are READ ONLY (just kidding, all snapshots (except a few) are supposed to be and are read only), but the fact they work on DIRECTORY LEVEL really bring the granularity of restore. The news just keeps on getting worse from here,
- Snapshots works on Directory level – GLR (Granular Level restore) is not really possible.
- Data Owner owns the snapshots as well – so he / she can delete them as well, not a good design.
- Not Really HDFS Consistent.
- Preserves consistency on file close! (Open files do not get backed up.) – This results in partial backups since normally in Hadoop there are many open files.
HDFS Trash Box –
Your normal recycle bin, only this time it’s for Hadoop. Protects against erroneous deletions, but has some flaws which clearly make it a non-contender for a proper reliable backup solution.
- The deleted files reside in trash only for a limited time and are automatically deleted after some time.
- Implemented in HDFS – not a good solution, if HDFS fails, trash fails,
- Data User with right privileges can move, delete the contents of trash, this not how backups work.
- Not enabled by default.
Off board Data Protection:
- Distributed copy (DISTCP) –
Distributed Copy is a mechanism to copy the data from a Hadoop cluster to another cluster, which seems more like a replication solution and not really a backup solution. REPLICATION IS NOT BACKUP. That being said, it can create a copy of Hadoop data in a NFS share and not in a TAPE. Let us understand why, you can read this: https://www.cloudera.com/documentation/enterprise/5-5-x/topics/admin_hdfs_nfsgateway.html This technology is designed to copy large volumes of data either within the same Hadoop cluster, to another Hadoop cluster, Amazon S3 (or S3 compliant object store), Openstack Swift, FTP or NAS storage visible to all Hadoop data nodes. A key benefit of distcp is it uses Hadoop’s parallel processing model (MapReduce) to carry out the work. This is important for large Hadoop clusters as a protection method that does not leverage the distributed nature of Hadoop is bound to hit scalability limits.
One downside of the current distcp implementation is that single file replication performance is bound to one map task and data node respectively. Each file requiring replication can only consume the networking resources available to the data node running the corresponding map task. For many small files this does not represent a problem, as many map tasks can be used to distribute the workload. However, for very large files which are common to Hadoop environments, performance of individual file copies will be limited by the bandwidth available to individual data nodes. For example, if data nodes are connected via 1 Gbe and the average file size in the cluster was 10 TB, then the minimum amount of time required to copy one average file is 22 hours. This means replication is not really a viable solution for large Hadoop clusters.
Now that its settled that hadoop cannot protect itself from its native solutions. What would be the best way to protect it?
I just realized that its more than 2000 words and above (yes, I can count while I type!). I will continue this topic in another blog where we will see how we can actually protect Hadoop on granular level, with minimal resources, maintaining application consistency and reliability for restores.
Read more about Hadoop: https://hadoop.apache.org