Transformation Processing Smackdown; Spark vs Hive vs Pig

Track: Data Science and Machine Learning
Skill Level: Beginner
Room: Room A403
Time Slot: Fri 2/24, 2:30 PM
Tags: hadoop , hive , pig , big data , spark
Presentation Link
Abstract

One of the core use cases for Hadoop is ETL offload and there are a number of framework options to choose from. This presentation will focus on the transformation processing aspects of ETL (or is it ELT?) and compare & contrast three of the most popular frameworks; Spark, Hive and Pig. History and background of each of these frameworks will be presented, but the focus will be on strengths and weaknesses, as well as community adoption, for each.

The format of the talk will be to rate how each framework addresses various functional requirements of transformation processing and to present sample code to visualize how well each framework works in practice. The intention is to help teams facing technology choices make the most appropriate decisions, thus suggested team skills & experiences will also be discussed for Pig, Hive and Spark.

While not formally part of the comparison, other frameworks such as MapReduce, Cascading and Flink will also be discussed.

Lester Martin

Lester is a 20+ year software development veteran with skills ranging from the mainframe, through Java and .NET distributed & web technologies, and for the last several years has been focusing on “big data” tools to include Hadoop and NoSQL technologies. He currently works at Hortonworks as a system architect & technical trainer and enjoys engaging in interactive discussions about these exciting technologies and open source in general. His 2015 Hadoop Summit presentation is available at https://www.youtube.com/watch?v=EUz6Pu1lBHQ and general information, social/community profiles and blog links can be found at http://lester.website.