Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

How's it built

Think like a vertex

Example: PageRank

Bells and whistles

How do I run a Giraph job?

And what does it look like?

State of the world

Future Work

Thank you!

BSP: Revenge of the messages

Example: Find max value

From Y! to

What's suboptimal with MR?

Giraph: loosely based on...

Key concepts

Superstep

V2.compute()

V3.compute()

V4.compute()

V1.compute()

0

Our neighbors send us their values, we add them up

job chaining ==

unintended consequences

iteration == job chaining

iteration

Version 0.1

  • Yahoo! Research developed original codebase
  • Entered Apache Incubator in July 2011
  • New Apache team quickly formed

> bin/giraph \

~/maxValueGiraph-1.0.jar maxValueGiraph.MaximumValueInGraph \

-w 5 \

-if org.apache.giraph.lib.TextDoubleDoubleAdjacencyListVertexInputFormat \

-ip my_graph \

-of org.apache.giraph.lib.IdWithValueTextOutputFormat \

-op my_output

First, we think we're the max

Aggregators

Combiners

Checkpointing

Even more improved RPC

More

{in|out}putformats

Improved

robustness

YARN!

And compute our new value

Anybody more max?

  • Improved out-of-box experience
  • Significant memory improvements
  • Improved definition of Vertex and Combiner
  • Lots of useful file formats

V2.compute()

1

V3.compute()

V4.compute()

V1.compute()

::

ZooKeeper

  • Writing to disk isn't always evil
  • Store work at user-defined intervals
  • Restart on failure

halt

If so, tell everybody

  • Global values calculated in superstep n, available in next superstep
  • Sum, Min, Max provided
  • User-definable
  • User-defined function to combine messages before being sent or delivered
  • Similar to combiners in Hadoop
  • Saves on network or memory

<picture of giraffe wearing

mortarboard not found...

how is that possible?>

  • Shared state
  • Master-Worker coordination
  • Aggregators

If not, vote to be done

Trunk (version 0.2 soon?)

Do this a pre-set

number of times

We graduated!

V2.compute()

2

V3.compute()

V4.compute()

HCat, Oozie,

Azkaban, etc.

Higher-level

languages

Monitoring and metrics

halt

  • Dramatically improved RPC system <- A Big Deal!
  • HBase and Accumulo integration
  • Hadoop 1.0 support

Disk IO and job scheduling quickly dominate the algorithm

http://incubator.apache.org/giraph/

2010

2011

2012

  • -w = how many workers. Worker == Mapper
  • Standard bin script
  • Lots of {in|out}putformats with silly names
  • Read and write from directories

(top-level-project domain coming soon)

  • Fewer messages sent as algorithm progresses
  • Three supersteps versus one MR job. Still faster?

Map-only jobs? Pre-partitioning the data? HaLoop? HadApt?

A better way of doing large-scale graph processing on Hadoop

V2.compute()

3

V4.compute()

bit.ly/newbie_apache_giraph_issues

And send our new value to

everybody else...

or it's time to quit

halt

"The performance, scalability, and fault-tolerance of Pregel are already satisfactory for graphs with billions of vertices."

*

* A very badly behaved one

Learn more about creating dynamic, engaging presentations with Prezi