Spark GraphX – Distributed Graph Computations
Spark GraphX is a graph processing framework built on top of Spark.
GraphX models graphs as property graphs where vertices and edges can have properties.
Caution
|
FIXME Diagram of a graph with friends. |
GraphX comes with its own package org.apache.spark.graphx
.
Tip
|
Import
|
Graph
Graph
abstract class represents a collection of vertices
and edges
.
1 2 3 4 5 |
abstract class Graph[VD: ClassTag, ED: ClassTag] |
vertices
attribute is of type VertexRDD
while edges
is of type EdgeRDD
.
Graph
can also be described by triplets
(that is of type RDD[EdgeTriplet[VD, ED]]
).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD val vertices: RDD[(VertexId, String)] = sc.parallelize(Seq( (0L, "Jacek"), (1L, "Agata"), (2L, "Julian"))) val edges: RDD[Edge[String]] = sc.parallelize(Seq( Edge(0L, 1L, "wife"), Edge(1L, 2L, "owner") )) scala> val graph = Graph(vertices, edges) graph: org.apache.spark.graphx.Graph[String,String] = org.apache.spark.graphx.impl.GraphImpl@5973e4ec |
package object graphx
package object graphx
defines two type aliases:
-
VertexId
(Long
) that represents a unique 64-bit vertex identifier. -
PartitionID
(Int
) that is an identifier of a graph partition.
Standard GraphX API
Graph
class comes with a small set of API.
-
Transformations
-
mapVertices
-
mapEdges
-
mapTriplets
-
reverse
-
subgraph
-
mask
-
groupEdges
-
-
Joins
-
outerJoinVertices
-
-
Computation
-
aggregateMessages
-
Creating Graphs (Graph object)
Graph
object comes with the following factory methods to create instances of Graph
:
-
fromEdgeTuples
-
fromEdges
-
apply
Note
|
The default implementation of Graph is GraphImpl.
|
GraphOps – Graph Operations
GraphImpl
GraphImpl
is the default implementation of Graph abstract class.
It lives in org.apache.spark.graphx.impl
package.
OLD – perhaps soon to be removed
Apache Spark comes with a library for executing distributed computation on graph data, GraphX.
-
Apache Spark graph analytics
-
GraphX is a pure programming API
-
missing a graphical UI to visually explore datasets
-
Could TitanDB be a solution?
-
From the article Merging datasets using graph analytics:
Such a situation, in which we need to find the best matching in a weighted bipartite graph, poses what is known as the stable marriage problem. It is a classical problem that has a well-known solution, the Gale–Shapley algorithm.
A popular model of distributed computation on graphs known as Pregel was published by Google researchers in 2010. Pregel is based on passing messages along the graph edges in a series of iterations. Accordingly, it is a good fit for the Gale–Shapley algorithm, which starts with each “gentleman” (a vertex on one side of the bipartite graph) sending a marriage proposal to its most preferred single “lady” (a vertex on the other side of the bipartite graph). The “ladies” then marry their most preferred suitors, after which the process is repeated until there are no more proposals to be made.
The Apache Spark distributed computation engine includes GraphX, a library specifically made for executing distributed computation on graph data. GraphX provides an elegant Pregel interface but also permits more general computation that is not restricted to the message-passing pattern.