Map/Reduce Large Scale Duplicate Detection Prof. Felix Naumann,

Map/Reduce Large Scale Duplicate Detection Prof. Felix Naumann,

Map/Reduce Large Scale Duplicate Detection Prof. Felix Naumann, Arvid Heise Agenda 2 Big Data Word Count Example Hadoop Distributed File System Hadoop Map/Reduce Advanced Map/Reduce Stratosphere Map/Reduce | Arvid Heise| April 15, 2013 Agenda 3 Big Data Word Count Example Hadoop Distributed File System

Hadoop Map/Reduce Advanced Map/Reduce Stratosphere Map/Reduce | Arvid Heise| April 15, 2013 What is Big Data? collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications [http://en.wikipedia.org/wiki/Big_data] terabytes, petabytes, in a few years exabytes Challenges Capturing, storage, analysis, search, ... Sources Web, social platforms Science Map/Reduce | Arvid Heise| April 15, 2013 Example: Climate Data Analysis PS,1,1,0,Pa, surface pressure

T_2M,11,105,0,K,air_temperature TMAX_2M,15,105,2,K,2m maximum temperature TMIN_2M,16,105,2,K,2m minimum temperature U,33,110,0,ms-1,U-component of wind V,34,110,0,ms-1,V-component of wind QV_2M,51,105,0,kgkg-1,2m specific humidity CLCT,71,1,0,1,total cloud cover (Up to 200 parameters) Analysis Tasks on Climate Data Sets Validate climate models Locate hot-spots in climate models Monsoon Drought

Flooding Compare climate models 10TB Based on different parameter settings Necessary Data Processing Operations Filter, aggregation (sliding window), join 5 950km, 1h 3 m re on so t h lu s, tio n

Advanced pattern recognition Map/Reduce | Arvid Heise| April 15, 2013 2km resolution Big Data Landscape 6 Map/Reduce | Arvid Heise| April 15, 2013 Agenda 7 Big Data Word Count Example Hadoop Distributed File System Hadoop Map/Reduce Advanced Map/Reduce Stratosphere Map/Reduce | Arvid Heise| April 15, 2013

Programming Model 8 Inspired by functional programming concepts map and reduce Operates on key/value pairs Map Process key/value pairs individually Generate intermediate key/value pairs Example (LISP): (mapcar 1+ (1 2 3 4)) (2 3 4 5) Reduce Merge intermediate key/value pairs with same key Example (LISP): (reduce + (1 2 3 4)) 10 Map/Reduce | Arvid Heise| April 15, 2013 Programmers Perspective: Word Count 9 1

to be, or not to be, that is the question: 2 whether 'tis nobler in the mind to suffer 3 the slings and arrows of outrageous fortune, 4 or to take arms against a sea of troubles Map Reduce To 4

Be 2 Or 2 Not 1 Map/Reduce | Arvid Heise| April 15, 2013 Map/Reduce Job Programmers Perspective: WC Map 10

1 to be, or not to be, that is the question: 2 whether 'tis nobler in the mind to suffer Map UDF to 1 whether 1 be 1 'tis

1 or 1 nobler 1 not 1 in 1 to 1

the 1 Map/Reduce | Arvid Heise| April 15, 2013 Programmers Perspective: WC Reduce 11 be 1 to

1 be 1 to 1 Reduce UDF to 2

be 2 not 1 Map/Reduce | Arvid Heise| April 15, 2013 not 1 Agenda

12 Big Data Word Count Example Hadoop Distributed File System Hadoop Map/Reduce Advanced Map/Reduce Stratosphere Map/Reduce | Arvid Heise| April 15, 2013 Behind the Scenes 13 Map/Reduce framework takes care of Data partitioning Data distribution Data replication Parallel execution of tasks Fault tolerance Status reporting Map/Reduce | Arvid Heise| April 15, 2013

Hadoop Architecture 14 Job Task Task Jobtracker Tasktracker Tasktracker Map/Reduce HDFS Name node

Data node Data node Master Slave 1 Slave N Map/Reduce | Arvid Heise| April 15, 2013 HDFS Upload 15 First step: User uploads data to HDFS Job Task Task

Jobtracker Tasktracker Tasktracker Map/Reduce HDFS Name node Data node Data node Master Slave 1

Slave N Map/Reduce | Arvid Heise| April 15, 2013 HDFS Upload 16 Block/split-based format (usually 64 MB) Splits are replicated over several nodes (usually 3 times) In average: each slave receives #Split*3/#Slaves splits HDFS Client 1) Request locations 2) Upload HDFS Name node 3) Register

Master Map/Reduce | Arvid Heise| April 15, 2013 Data node Slave 1 Data node Slave N Agenda 17 Big Data Word Count Example Hadoop Distributed File System Hadoop Map/Reduce Advanced Map/Reduce Stratosphere

Map/Reduce | Arvid Heise| April 15, 2013 Job Submission 18 Second step: User submits job Job Task Task Jobtracker Tasktracker Tasktracker Map/Reduce HDFS Name

node Data node Data node Master Slave 1 Slave N Map/Reduce | Arvid Heise| April 15, 2013 Job Submission 19 Job tracker allocates resources for submitted job Uses name node to determine which nodes processes what Distributes tasks to nodes Job

Task Task Jobtracker Tasktracker Tasktracker Map/Reduce HDFS Master Map/Reduce | Arvid Heise| April 15, 2013 Slave 1 Slave N

Job Execution 20 Third step: job execution Input splits Map Map Map Shuffle Reduce Reduce Reduce Slave 1 Slave N

Output splits Map/Reduce | Arvid Heise| April 15, 2013 Map tasks 21 Third step: job execution, map task Nodes process tasks indepently Task tracker receives tasks and spawn one map process per task Task Task Tasktracker Tasktracker Task Map Task Map Task

Map Task Slave 1 Map/Reduce | Arvid Heise| April 15, 2013 Slave N Map Execution 22 Task tracker receives input as map waves Each wave consists of at most #processors splits Spawns a new JVM(!) for each split Each wave has at least ~6s overhead For each split, the map task reads the key value pairs Invokes the map UDF for each map task Collects emitted results and spills them immediately to a local file Optionally reuses JVM to reduce time per wave Map/Reduce | Arvid Heise| April 15, 2013

Job Execution, Shuffle 23 Input splits Map Map Map Shuffle Reduce Reduce Reduce Slave 1 Slave N

Output splits Map/Reduce | Arvid Heise| April 15, 2013 Shuffle 24 Partitioner distributes data to the different nodes Uses unique mapping from key to node Often: key.hashCode() % numReducer Key/Value-pairs are serialized and sent over network Spilled to local disk of the reducer Sorted by key with two-phase merge sort Usually most costly phase Map/Reduce | Arvid Heise| April 15, 2013 Job Execution, Shuffle 25 Input splits

Map Map Map Shuffle Reduce Reduce Reduce Slave 1 Slave N Output splits Map/Reduce | Arvid Heise| April 15, 2013 Reducer Execution

26 Basic idea Scans over sorted list Invokes reducer UDF for subset of data with same keys In reality, a bit more complicated Provides reducer UDF with iterator Iterator returns all values with same key UDF is invoked as long as there is one element left Only one scan with little memory overhead Stores result on local disk Replicates splits (two times) Map/Reduce | Arvid Heise| April 15, 2013 Combiner 27 Local reducer Invoked in map phase for smaller groups of keys Not the complete list of values in general Preaggregates result to reduce network cost!

Can even be invoked recursively on preaggregated results Map/Reduce | Arvid Heise| April 15, 2013 Word Count Recap, Data Upload 28 During upload, split input (In general, more than one line) 1 1 to be, or not to be, that is the question: 2 whether 'tis nobler in the mind to suffer 3

the slings and arrows of outrageous fortune, 4 or to take arms against a sea of troubles to be, or not to be, that is the question: 2 whether 'tis nobler in the mind to suffer Map/Reduce | Arvid Heise| April 15, 2013 Word Count Recap, Map Phase 29 For each input split invoke map task Map task receives each line in the split Tokenizes line, emits (word, 1) for each word

Locally combines results! Decreases I/O from #word to #distinct words per split (64MB) Map Task Tasktracker 1 to be, or not to be, that is the question: Tasktracker Slave 1 Map/Reduce | Arvid Heise| April 15, 2013 Map Task 2 whether 'tis nobler in the mind to suffer Slave N

Word Count Recap, Shuffle+Reduce 30 Assigns each word to reducer Sends all preaggregated results to reducer For example, (to, 3512) Reducer sorts results and UDF sums preaggregated results up Each reducer outputs a partial word histogram Client is responsible for putting output splits together Map/Reduce | Arvid Heise| April 15, 2013 Behind the Scenes 31 Map/Reduce framework takes care of Data partitioning Data distribution Data replication Parallel execution of tasks Fault tolerance Status reporting

Map/Reduce | Arvid Heise| April 15, 2013 Fault Tolerance 32 On Map/Reduce level Each task tracker sends progress report If a node does not respond within 10 minutes (configurable) It is declared dead The assigned tasks are redistributed over the remaining nodes Because of replication, 2 nodes can be down at any time On HDFS level Each data node sends periodic heartbeat to name node In case of down time Receives no new I/O Lost replications are restored at other nodes Map/Reduce | Arvid Heise| April 15, 2013 Agenda 33

Big Data Word Count Example Hadoop Distributed File System Hadoop Map/Reduce Advanced Map/Reduce Stratosphere Map/Reduce | Arvid Heise| April 15, 2013 Record Reader 34 For WC, we used LineRecordReader Splits text files at line ends (\n) Generates key/value pair of (line number, line) Hadoop users can supply own readers Could already tokenize the lines Emits (word, 1) No mapper needed Necessary for custom/complex file formats Useful when having different file formats but same mapper

Map/Reduce | Arvid Heise| April 15, 2013 Dealing with Multiple Inputs 35 Map and reduce take only one input Operations with two inputs are tricky to implement Input splits of map can originate in several different files Logical concatenation of files Standard trick: tagged union In record reader/mapper output (key, (inputId, value)) Mapper and reducer UDFs can distinguish inputs Map/Reduce | Arvid Heise| April 15, 2013 Join 36 Reduce-side join Tagged union (joinKey, (inputId, record)) All records with same join key are handled by same reducer Cache all values in local memory

Perform inner/outer join Emit all pairs of values with different inputIds May generate OOM for larger partitions Map-side join Presort and prepartition input All relevant records should reside in same split Load and cache split Perform inner/outer join Map/Reduce | Arvid Heise| April 15, 2013 Secondary Grouping/Sort 37 Exploit that partitioner and grouping are two different UDFs Map emits ((key1, key2), value) Partitioner partitions data only on first key1 All KV-pairs ((keyX, ?), ?) are on the same physical machine However, reducer is invoked on partitions ((keyX, keyY), ?) Useful to further subdivide partitions

Join data could also be tagged ((joinKey, inputId), record) Only need to cache one input and iterate over other partition Hadoop Reducer always sorts data Data is grouped by first key and sorted by second key Map/Reduce | Arvid Heise| April 15, 2013 Side-effect Files 38 Sometimes even these tricks are not enough Example: triangle enumeration/three way join SELECT x, y, z WHERE x.p2=y.p1 AND y.p2=z.p1 AND z.p2=x.p1 Cohens approach with two map/reduce jobs Generate triad (SELECT x, y, z WHERE x.p2=y.p1 AND y.p2=z.p1) Probe missing edge with a reducer on input data Huge intermediate results on skewed data sets! Way faster: one map/reduce job Generate triad and immediately test if missing edge is in data Needs to load data set into main memory in reducer Map/Reduce Arvid

Heise| April 15, 2013 Might |run into OOM 39 Complete pipeline in Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). Jens Dittrich, Jorge-Arnulfo QuianRuiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, Jrg Schad. PVLDB 3(1): 518529 (2010) More than 10 UDFs! Map/Reduce | Arvid Heise| April 15, 2013 Agenda 40

Big Data Word Count Example Hadoop Distributed File System Hadoop Map/Reduce Advanced Map/Reduce Stratosphere Map/Reduce | Arvid Heise| April 15, 2013 Overview over Stratosphere 41 Research project by HU, TU, and HPI Overcome shortcomings of Map/Reduce Allow optimization of queries similar to DBMS Map/Reduce | Arvid Heise| April 15, 2013 Extensions of Map/Reduce 42 Additional second-order functions Complex workflows instead of Map/Reduce pipelines

More flexible data model Extensible operator model Optimization of workflows Sophisticated check pointing Dynamic machine booking Map/Reduce | Arvid Heise| April 15, 2013 Intuition for Parallelization Contracts 43 Map and reduce are second-order functions Call first-order functions (user code) Provide first-order functions with subsets of the input data Define dependencies between the records that must be obeyed when splitting them into subsets Contract: required partition properties Map All records are independently processable Reduce Records with identical key must

be processed together Key Input set Value Independent subsets Contracts beyond Map and Reduce 44 Cross Two inputs Each combination of records from the two inputs is built and is independently processable Match Two inputs, each combination of records with equal key from the two inputs is built Each pair is independently processable

CoGroup Multiple inputs Pairs with identical key are grouped for each input Groups of all inputs with identical key are processed together Complex Workflows Results Match name 45 pivotization Directed acyclic graphs More natural programming Holistic view on query Map/Reduce queries scattered over several jobs Higher abstraction

Allows optimization Less data is shipped Map Reduce filter students pivot CoGroupsid Map merge student with its duplicates pivot Map Cross

find similar students annotate entities Map Students Map/Reduce | Arvid Heise| April 15, 2013 annotate sentences News articles Motivation for Record Model 46 Key/Value-pairs are not very flexible In Map/Reduce Map performs calculation and sets key

Reducer uses key and performs aggregation Strong implicit interdependence between Map and Reduce In Stratosphere, we want to reorder Pacts Need to reduce interdependence Record data model Array of values Keys are explicitly set by contract (Reduce, Match, CoGroup) Map/Reduce | Arvid Heise| April 15, 2013 Record Model 47 All fields are serialized into a byte stream User code is responsible for Managing the indices Knowing the correct type of the field Huge performance gain through lazy deserialization Deserialize only accessed fields Serialize only modified fields

Map/Reduce | Arvid Heise| April 15, 2013 Composite Keys 48 Composite keys in Map/Reduce New tuple data structure Map copies values into the fields Emits (keys, value) Stratosphere allows to specify composite keys Reduce, Match, CoGroup can be configured to take several indices/types in the record as key Map/Reduce | Arvid Heise| April 15, 2013 More Documentation 49 Project website https://stratosphere.eu/ MapReduce and PACT - Comparing Data Parallel Programming Models

Alexander Alexandrov, Stephan Ewen, Max Heimel, Fabian Hueske, Odej Kao, Volker Markl, Erik Nijkamp, Daniel Warneke In Proceedings of Datenbanksysteme fr Business, Technologie und Web (BTW) 2011, pp. 25-44 Map/Reduce | Arvid Heise| April 15, 2013

Recently Viewed Presentations

  • Amazing Arthropods First Grade Science-Insects By Aaronia Baxley

    Amazing Arthropods First Grade Science-Insects By Aaronia Baxley

    Arachnids *Arachnids include spiders, scorpions, mites, and ticks. *They are different from insects in that they have 8 legs and 2 body parts. The largest arachnid is a tarantula. Myriapods Myriapods include centipedes and millipedes. They are usually seen in...
  • Chosen Criminals: Bonnie Parker & Clyde Barrow (The

    Chosen Criminals: Bonnie Parker & Clyde Barrow (The

    On April 1, 1934, Bonnie and Clyde encountered two young highway patrolmen near Grapevine, Texas. Before the officers could draw their guns, they were shot. On April 6, 1934, a constable at Miami, Oklahoma was murdered by Bonnie and Clyde,...
  • Market Failures: Public Goods and Externalities

    Market Failures: Public Goods and Externalities

    Market Failures. Market fails to produce the right amount of the product. Resources may be: Over-allocated. Under-allocated. LO1. 5-Market failure occurs when the competitive market system produces the "wrong" amounts of certain goods or services, or fails to provide any...
  • Part II: Preparation/Process - Professor Mark J. Grossman

    Part II: Preparation/Process - Professor Mark J. Grossman

    Power of Persuasion. Persuading is the goal of most public relations programs. Getting someone to do something through advice, reasoning or arm-twisting. Classic persuasion theory - people may be of two minds. Systematic mode (carefully considers argument) Heuristic mode (skimming...
  • State of Tennessee

    State of Tennessee

    Consider employee only coverage or employee + child(ren) to receive the maximum basic term life insurance benefit. ... You cannot have a HSA if either you or your spouse are enrolled in a medical flexible spending account (FSA) or HRA...
  • Propaganda War - Weebly

    Propaganda War - Weebly

    Propaganda War. Total war included controlling public opinion. Censored press. Hid discouraging news. Propaganda was used to spread ideas to promote a cause or to damage an opposing cause. Women join the War Effort. Women played critical role. At home,...
  • Office of Professional Preparation Certification Training Webinar II

    Office of Professional Preparation Certification Training Webinar II

    EIPA - WT 1 hour. EIPA - WT 3 hr. Form 60. Passing scores on the written test for EIPA may be used for 1 credit hour for Reading, 3 credit hours each for Special Education, Psychology, an elective, and...
  • https://www.math.upenn.edu/~ghrist/notes.html Background for k-means clustering Creating Voronoi diagrams

    https://www.math.upenn.edu/~ghrist/notes.html Background for k-means clustering Creating Voronoi diagrams

    Voronoi diagram: Suppose your data points live in Rn. A Voronoi diagram is a partition of space into Voronoi cells. The . Voronoi. cell . associated with vis