Pairwise Document Similarty in Large Collection with MapReduce

Pairwise Document Similarty in Large Collection with MapReduce

Ivory: Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics and Information Processing (CLIP Lab) UM Institute for Advanced Computer Studies (UMIACS) Problem 0.20 0.30 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 ~~~~~~~~~~ 0.34 ~~~~~~~~~~ 0.34 0.13 0.74 0.20 0.30 ~~~~~~~~~~ 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00

~~~~~~~~~~ 0.34 0.34 0.13 0.74 0.20 0.30 ~~~~~~~~~~ 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 ~~~~~~~~~~ 0.34 0.34 0.13 0.74 0.20 ~~~~~~~~~~ 0.30 ~~~~~~~~~~ 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 0.34 0.34 0.13 0.74

Applications: more-like-that queries Clustering e.g., co-reference resolution 0.20 ~~~~~~~~~~ 0.30 ~~~~~~~~~~ 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 0.34 0.34 0.13 0.74 Solutions Trivial sim(d i , d j ) wt ,di wt ,d j tV

For each pair of vectors Compute the inner product Loads each vector O(N) times Better Each term contributes only if appears in sim(d i , d j ) w td i d j sim(d i , d j ) t ,di di d j wt ,d j term _ contrib(t , d , d

td i d j i j ) Algorithm Loads each posting once Matrix must fit in memory Works for small collections Otherwise: disk access optimization Hadoopify : 2-Step Solution Indexing 1) one MapRedcue step term posting file

Pairwise Similarity 2) another MapRedcue step term contribution for all possible pairs Generate df*(df-1) intermediate contribution / term Indexing d1 AABC map d2 BDD map map (A,(d3,1)) (B,(d3,2)) (E,(d3,1)) d3 ABBE (A,(d1,2))

(B,(d1,1)) (C,(d1,1)) (B,(d2,1)) (D,(d2,2)) reduce (A,[(d1,2), (d3,1)]) (B,[(d1,1), (d2,1), (d3,2)]) reduce (B,[(d1,1), (d2,1), (d3,2)]) (C,[(d1,1)]) reduce (C,[(d1,1)]) (D,[(d2,2)]) reduce (D,[(d2,2)]) (E,[(d3,1)])

reduce (E,[(d3,1)]) (A,[(d1,2), (d3,1)]) shuffle Pairwise Similarity (A,[(d1,2), (d3,1)]) map ((d1,d3),2) shuffle (B,[(d1,1), (d2,1), (d3,2)]) map (C,[(d1,1)]) map (D,[(d2,2)]) map (E,[(d3,1)])

map ((d1,d2),1) ((d1,d3),2) ((d2,d3),2) ((d1,d2),[1]) reduce ((d1,d2),1) ((d1,d3),[2, 2]) reduce ((d1,d3),4) ((d2,d3),[2]) reduce ((d2,d3),2) Implementation Issues df-cut Drop common terms

Intermediate tuples dominated by very high df terms efficiency Vs. effectiveness Space saving tricks Common doc + stripes Blocking Compression Experimental Setup Hadoop 0.16.0 Cluster of 19 nodes (w/double processors) Aquaint-2 collection 906K documents Okapi BM25 Subsets of collection

Efficiency (running time) 99% df-cut Computation Time (minutes) 140 120 100 80 60 40 2 R = 0.997 20 0 0 10 20 30 40 50 60 Corpus Size (%) 70

80 90 100 Intermediate Pairs (billions) Efficiency (disk usage) 9,000 8,000 df-cut at 99% df-cut at 99.9% df-cut at 99.99% df-cut at 99.999% no df-cut 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 0 10 20 30

40 50 60 Corpus Size (%) 70 80 90 100 Effectiveness (recent) Effect of df-cut on effectiveness Medline04 - 909k abstracts 100 Relative P5 (%) 95 90 85 80 75 70 65 60 55 50 99.00

99.10 99.20 99.30 99.40 99.50 99.60 df-cut (%) 99.70 99.80 99.90 100.00 Conclusion Simple and efficient MapReduce solution 2H (using 38 nodes, 99% df-cut) for ~million-doc collection Play tricks for I/O bound jobs

Effective linear-time-scaling approximation 99.9% df-cut achieves 98% relative accuracy df-cut controls efficiency vs. effectiveness tradeoff Future work Bigger collections! More investigation of df-Cut and other techniques Analytical model Compression techniques (e.g., bitwise) More effectiveness experiments Joint resolution of personal names in email Co-reference resolution of names and organization MapReduce IR research platform Batch query processing Thank You!

MapReduce Framework input map input input map input map map Shuffling: group values by keys reduce output reduce output reduce output

Recently Viewed Presentations

  • Girls' Writing…… - Near East South Asia Council of Overseas

    Girls' Writing…… - Near East South Asia Council of Overseas

    The Female Brain:Verbal vs. Spatial. In boys' brains, there are generally more cortical areas dedicated to spatial-mechanical functioning than in girls'. In girls' brains, there is greater cortical emphasis on verbal-emotive processing. Girls naturally utilize more words on average than...
  • Land Grabs - Geographical Association

    Land Grabs - Geographical Association

    With thank to Tony Cassidy for inspiration here Ingredients. Fossil fuels Deforestation Cattle Rice Global Warming Recipe. Cooking Instructions 1.Liberally burn fossil fuels from the industrial revolution to the present. Making sure none of the carbon is offset. 2. At...
  • CARP Principles By [put your name here] Overview

    CARP Principles By [put your name here] Overview

    CARP principles are important for design view. They are the concepts and method for displaying visual elements effectively to viewers. Without CARP principles, it would make the design boring and it could increase cognitive overload. Imagine that you see the...
  • Proteins AP Biology Proteins Multipurpose molecules AP Biology

    Proteins AP Biology Proteins Multipurpose molecules AP Biology

    * It's a helix or B sheet within a single region. Can have both in one protein but a specific region is one or another * * How the whole thing holds together Division Ave. High School Ms. Foglia AP...
  • Use of graphics in spss - University of Ottawa

    Use of graphics in spss - University of Ottawa

    Use of graphics in SPSS Download the BECK SPSS dataset Access the course Blackboard site to obtain the dataset. * Box plot Graphic representation of a frequency Includes the IQR Used for scalar and quasi-scalar variables Can be grouped *...
  • Golden Age of Athens & the Peloponesian War

    Golden Age of Athens & the Peloponesian War

    Golden Age of Athens & the Peloponnesian War ... the Parthenon Also has many other temples and alters Legend Athena and Poseidon competed to be the patron god of Athens by providing gifts to the people of the city on...
  • The Design and Performance of an Astigmatic Laser

    The Design and Performance of an Astigmatic Laser

    Rectangularly symmetric Obtainable in spherical-planar resonator by adjusting plane mirror angle Also obtainable by an intra-cavity crosswire, when the laser is operating in a multimode Indices (m,n) refer to the number of intensity minima in the x and y directions...
  • Logical Query Plan - University of California, San Diego

    Logical Query Plan - University of California, San Diego

    Recall the Netflix Schema. ... Declarative "querying" (logical-physical separation) is a . key system design principle from the RDBMS world: Declarativity often helps improve . user productivity. Enables behind-the-scenes . performance optimizations.