Big data is data sets that are so voluminous and complex that traditional data-processing application software are inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source.
To understand the phenomenon that is big data, it is often described using five Vs: Volume, Velocity, Variety, Veracity and Value
Volume refers to the vast amounts of data generated every second. Just think of all the emails, twitter messages, photos, video clips, sensor data etc. we produce and share every second. We are not talking Terabytes but Zettabytes or Brontobytes. On Facebook alone we send 10 billion messages per day, click the “like’ button 4.5 billion times and upload 350 million new pictures each and every day.
Velocity refers to the speed at which new data is generated and the speed at which data moves around. Just think of social media messages going viral in seconds, the speed at which credit card transactions are checked for fraudulent activities, or the milliseconds it takes trading systems to analyse social media networks to pick up signals that trigger decisions to buy or sell shares. Big data technology allows us now to analyse the data while it is being generated, without ever putting it into databases.
Variety refers to the different types of data we can now use. In the past we focused on structured data that neatly fits into tables or relational databases, such as financial data (e.g. sales by product or region). In fact, 80% of the world’s data is now unstructured, and therefore can’t easily be put into tables (think of photos, video sequences or social media updates). With big data technology we can now harness differed types of data (structured and unstructured) including messages, social media conversations, photos, sensor data, video or voice recordings and bring them together with more traditional, structured data.
Veracity refers to the messiness or trustworthiness of the data. With many forms of big data, quality and accuracy are less controllable (just think of Twitter posts with hash tags, abbreviations, typos and colloquial speech as well as the reliability and accuracy of content) but big data and analytics technology now allows us to work with these type of data. The volumes often make up for the lack of quality or accuracy.
Value: Then there is another V to take into account when looking at Big Data: Value! It is all well and good having access to big data but unless we can turn it into value it is useless. So you can safely argue that ‘value’ is the most important V of Big Data. It is important that businesses make a business case for any attempt to collect and leverage big data. It is so easy to fall into the buzz trap and embark on big data initiatives without a clear understanding of costs and benefits.
- Apache Spark – It is an open source big data framework. It provides faster and more general purpose data processing engine. Spark is basically designed for fast computation. It also covers wide range of workloads for example batch, interactive, iterative and streaming.
- Hadoop MapReduce – It is also an open source framework for writing applications. It also processes structured and unstructured data that are stored in HDFS. Hadoop MapReduce is designed in a way to process a large volume of data on a cluster of commodity hardware. MapReduce can process data in batch mode.
- Apache Spark – Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible.
- Hadoop MapReduce – MapReduce reads and writes from disk, as a result, it slows down the processing speed.
- Apache Spark – Spark is easy to program as it has tons of high-level operators with RDD – Resilient Distributed Dataset.
- Hadoop MapReduce – In MapReduce, developers need to hand code each and every operation which makes it very difficult to work.
2.4. Easy to Manage
- Apache Spark – Spark is capable of performing batch, interactive and Machine Learning and Streaming all in the same cluster. As a result makes it a completedata analytics engine. Thus, no need to manage different component for each need. Installing Spark on a cluster will be enough to handle all the requirements.
- Hadoop MapReduce – As MapReduce only provides the batch engine. Hence, we are dependent on different engines. For example- Storm, Giraph, Impala, etc. for other requirements. So, it is very difficult to manage many components.
2.5. Real-time analysis
- Apache Spark – It can process real time data i.e. data coming from the real-time event streams at the rate of millions of events per second, e.g. Twitter data for instance or Facebook sharing/posting. Spark’s strength is the ability to process live streams efficiently.
- Hadoop MapReduce – MapReduce fails when it comes to real-time data processing as it was designed to perform batch processing on voluminous amounts of data.
- Apache Spark – Spark provides low-latency computing.
- Hadoop MapReduce – MapReduce is a high latency computing framework.
2.7. Interactive mode
- Apache Spark – Spark can process data interactively.
- Hadoop MapReduce – MapReduce doesn’t have an interactive mode.
- Apache Spark – Spark can process real time data through Spark Streaming.
- Hadoop MapReduce – With MapReduce, you can only process data in batch mode.
2.9. Ease of use
- Apache Spark – Spark is easier to use. Since, its abstraction (RDD) enables a user to process data using high-level operators. It also provides rich APIs in Java, Scala, Python, and R.
- Hadoop MapReduce – MapReduce is complex. As a result, we need to handle low-level APIs to process the data, which requires lots of hand coding.
- Apache Spark – RDDs allows recovery of partitions on failed nodes by re-computation of the DAG while also supporting a more similar recovery style to Hadoop by way of checkpointing, to reduce the dependencies of an RDDs.
- Hadoop MapReduce – MapReduce is naturally resilient to system faults or failures. So, it is a highly fault-tolerant system.
- Apache Spark – Due to in-memory computation spark acts its own flow scheduler.
- Hadoop MapReduce – MapReduce needs an external job scheduler for example, Oozie to schedule complex flows.
2.12. Fault tolerance
- Apache Spark – Spark is fault-tolerant. As a result, there is no need to restart the application from scratch in case of any failure.
- Hadoop MapReduce – Like Apache Spark, MapReduce is also fault-tolerant, so there is no need to restart the application from scratch in case of any failure.
- Apache Spark – Spark is little less secure in comparison to MapReduce because it supports the only authentication through shared secret password authentication.
- Hadoop MapReduce – Apache Hadoop MapReduce is more secure because of Kerberos and it also supports Access Control Lists (ACLs) which are a traditional file permission model.
- Apache Spark – As spark requires a lot of RAM to run in-memory. Thus, increases the cluster, and also its cost.
- Hadoop MapReduce – MapReduce is a cheaper option available while comparing it in terms of cost.
2.15. Language Developed
- Apache Spark – Spark is developed in Scala.
- Hadoop MapReduce – Hadoop MapReduce is developed in Java.
- Apache Spark – It is data analytics engine. Hence, it is a choice for Data Scientist.
- Hadoop MapReduce – It is basic data processing engine.
- Apache Spark – Apache License 2
- Hadoop MapReduce – Apache License 2
2.18. OS support
- Apache Spark – Spark supports cross-platform.
- Hadoop MapReduce – Hadoop MapReduce also supports cross-platform.
2.19. Programming Language support
- Apache Spark – Scala, Java, Python, R, SQL.
- Hadoop MapReduce – Primarily Java, other languages like C, C++, Ruby, Groovy, Perl, Python are also supported using Hadoop streaming.
2.20. SQL support
- Apache Spark – It enables the user to run SQL queries using Spark SQL.
- Hadoop MapReduce – It enables users to run SQL queries using Apache Hive.
- Apache Spark – Spark is highly scalable. Thus, we can add n number of nodes in the cluster. Also a largest known Spark Cluster is of 8000 nodes.
- Hadoop MapReduce – MapReduce is also highly scalable we can keep adding n number of nodes in the cluster. Also, a largest known Hadoop cluster is of 14000 nodes.
2.22. The line of code
- Apache Spark – Apache Spark is developed in merely 20000 line of codes.
- Hadoop MapReduce – Hadoop 2.0 has 1,20,000 line of codes
2.23. Machine Learning
- Apache Spark – Spark has its own set of machine learning ie MLlib.
- Hadoop MapReduce – Hadoop requires machine learning tool for example Apache Mahout.
- Apache Spark – Spark can cache data in memory for further iterations. As a result it enhances the system performance.
- Hadoop MapReduce – MapReduce cannot cache the data in memory for future requirements. So, the processing speed is not that high as that of Spark.
2.25. Hardware Requirements
- Apache Spark – Spark needs mid to high-level hardware.
- Hadoop MapReduce – MapReduce runs very well on commodity hardware.
- Apache Spark – Spark is one of the most active project at Apache. Since, it has a very strong community.
- Hadoop MapReduce – MapReduce community has been shifted to Spark.