LARGE SCALE GRAPH ANALYTICS
rabin
Posted on August 12, 2020
Abstract
Big data is a field that is growing exponentially. It is a field that concerns itself with finding ways and approaches that can evaluate, process data systematically from datasets from data sources that generate large amounts of data per period. This data cannot be analyzed using the normal data processing application software. In this age of information, interest and research have begun to find ways to get and evaluate information from data graphs that contain extensive amounts of data. In this research project, we will look into large-scale graph analytics by using the querying process perspective. That could include finding the shortest paths between the data graph and the query graph or finding similarities between the data graph and the query graph. Most problems in this field can be conceptualized to be problems on the paths on the data graph and query graph or questions about patterns on both graphs. However, straight forward issues like spotting trends on the data graph that are also on the query graph can be astonishingly problematic. For example, the cubic complexity and subgraph isomorphism on dual simulation is difficult.
Moreover, the reiterative property of the algorithms in large scale graph analytics affects the effectiveness of the uncomplicated MapReduce way of parallel and distributed processing. However, it is necessary to come up with solutions for massive graphs hence the need for further research. We will provide the current state of research and speculate the direction large scale graph analytics is likely to take.
Chapter 1: Introduction
Big Data Analytics focuses primarily on methods to accumulate, store, retrieve, and process by carrying out an analysis of the tremendous amounts of data. It also concerns itself with how to present the results. Big Data Analytics can make forecasts, spot patterns, and get a better understanding of large-scale datasets. Previously the problem was how to acquire or even create data. The internet now generates copious amounts of data from social media, satellites, security cameras, and also the weather. The dilemmas that big data now faces are how to store the data, how to get access to it, how it can be processed, shared, and presented. These are all problems with data analytics. One might ask the questions on what breakthroughs have been made to advance big data analytic algorithms and if there have been any improvements in parallel and distributed processing. These hurdles would be devastating to big data analytics if there were not any advancements in its various disciplines. Some of the subjects that big data has made progress in include statistics, numerical analysis, linear algebra, machine learning, deep learning, data mining, graph theory, graph mining, and parallel and distributed processing.
We will focus on Large Scale Graph Analytics. As the saying goes, a picture is worth a thousand words. It is also true that graphs are of extreme importance because of their ability to visualize information in a versatile and expressive manner. They can to effectively present and summarize data from social networks like Twitter, web engines like Google, and genome sequencing in genetics. It is also wise to note the discipline of graph pattern matching has had a profound effect on the several fields it has been applied to.
In the abstract, we endeavor to find subgraphs that match a pattern found data graph provided. Much research conducted on this field, and through it, answers like Subgraph Isomorphism and Regular Expression Matching have been discovered. The challenge in big data graph analytics is that scientists have to work on massive graphs always. We shall also look into ways to boost the communication in distributed processing of algorithms for inaccurate graph pattern recognition and matching. We also look into the effect of different partitionings on the runtime and network Input-output (I/O). The comprehensive results we have obtained from the research have led us to conclude that algorithms show magnificent scalable behavior and min-cut partitioning. That results in a boosted performance depending on the circumstances; they also have the added advantage of significantly decreasing the network traffic.
Chapter 2: Literature Review
Graphs have become prominent in Computer Science. Suppose one is embarking on creating models of intricate entities and systems such as those of circuits, protein structures, biological networks such as DNA and RNA structures, chemical compounds, web workflows, and the structure of XML documents. Graph mining is becoming a very active subfield in data mining because of the interest in analyzing large amounts of data that is structured in form massive graphs. For further progress in this field, scientists have to pick an efficient platform to do their analysis. At the same time, system designers and developers need to come up with a new approach to evaluate the performance and other aspects of the field. Most of the existing graphs, like social networks and communication networks, are so enormous that conventional approaches are overwhelmed when they require analysis, which has developed the need to come up with new custom methods for processing and analysis. Twitter, Facebook ad Google continuously produce large graphs whose analysis requires large amounts of data and tasks which are intensive computationally. These graphs contain interrelationships among their vertices, which makes the system more intricate. Another consideration that has to be taken into account is the algorithm’s ability to cope with the increasing graph size; this is usually a challenge. Large scale graphs make us question our knowledge and thoughts on computer architecture and algorithms. They are both dynamic and quite problematic parallelize and also because many of the graph algorithms do need a lot of computation per vertex.
Jeffrey and Sanjay (2004) came up with the MapReduce model at Google, and it became prevalent in several businesses and scientific circles. The model has been set as a basis for processing big data. The model has been applied in Apache as Hadoop, an open-source framework. One of the uses of Hadoop is to carry out analysis on petabytes using product hardware. We also have to point out that the practical applications of MapReduce have severe limitations.
.
Malewicz, Austern, Bik, Dehnert, Horn, Leiser, and Czajkowski (2010), proposed a new programming model at Google called Pregel, which uses the Bulk Synchronous Model in parallel computing. Pregel implemented an interface that could pass messages, and the memory of the nodes was distributed in the cluster so it could be stored in the Graph. Graphx, Apache Giraph, GraphLab, and Hama are open source applications that are using the BSP model. While Giraph, GraphX, and GraphLab are customized graph analytic engines, Hama is not. It is a pure BSP engine. A lot of corporate businesses and scientific organizations are demanding analysis of Large-Scale Graphs. Therefore, the selection of suitable and most efficient platforms for analyzing big data graphs is paramount. Besides, Graph structured data contains intricate properties such as the power law distributions, and it is also important to note that frameworks for large scale graph analytics are still being researched and improved on while new ones are being created.
Range Partitioning, Shimmy Partitioning, and In-mapper combing are the three design patterns that were implemented in the MapReduce graph algorithms, as presented by Jim and Michael (2010). They developed these design patterns to boost the efficiency of MapReduce based graph algorithms. Siddharth and Sergei (2011) proposed a MapReduce algorithm for counting triangles. It can be used to calculate coefficients of Clustering. We look at two approaches. First, MR-NodeIterator is a custom algorithm that was explicitly created to compute triangle incidence on all vertices. Secondly, MR-GraphPartition gives a version of the MapReduce algorithm that counts general triangles. Valiant (1990) normally defined the Bulk Synchronous Model which is used parallel computing. Bulk Synchronous Parallel (BSP) model was created to be a bridging model used to design parallel algorithms. Malewicz, Austern, Bik, Dehnert, Horn, Leiser, and Czajkowski (2010) described the analytic engine Pregel that they created at google. Pregel is a user-friendly framework that allows for distributed processing of graphs over many machines. David and David (2013) researched the algorithmic effects that the Bulk Synchronous Parallel (BSP) model for graph algorithms and its performance was put in contrast with the open-source software GraphCT. There were notable differences between the Bulk Synchronous Parallel (BSP) model and shared memory algorithms when they were compared on how each adapted the increasing graph size. Bulk Synchronous Parallel (BSP) model showed a boosted scalable performance on Cray XMT. In trying to find the significance of Apache Giraph and GraphLab, Sherif (2013) compared the solutions of open-source software for processing large amounts of Graph Structured data. His research pointed out the limitations of MapReduce when used in graph processing. He then tried to give alternative platforms that were still in development to deal with the graph analytic hurdle. His work presented ShortestPath and PageRank algorithms in GraphLab and Giraph. The Apache Giraph algorithm implementation of the Bulk Synchronous Parallel (BSP) model that is scalable. It puts into use hardware commodity that abstracts when the load distributes, and it can also allow parallelism in a way that tries to tolerate faults.
Bulk Synchronous Parallel (BSP) model was tested and compared against two other parallel computing techniques, that is Map Reduce and Map-side join, on two separate graph problems, which were collective classification and single-source shortest path. It was discovered that reiterative graph processing with Bulk Synchronous Parallel (BSP) model outperformed MapReduce to a considerable extent.
Parallel programming is the process of dividing up program instructions into portions and then distributing them among various computers for processing and storage. Parallel processing can be further be broken down to bit level, task level, data level, and data level. A parallel programming model is a programming model that allows for its instructions and data to be analyzed and evaluated in parallel at separate and different processors. To decide on which programming model to pick, you have to be aware of the nature of the problem you are dealing with. Is it computation-intensive or data-intensive? As big data became prevalent in recent years, the demand for parallel processing systems that could be able to handle vast amounts of different and intricate data increased. Scientists also became more focused on creating programs and frameworks that could work with data-parallel systems. We will concern ourselves primarily with data-parallel processing systems that process data and instructions that are distributed over different computing nodes. We are going to try to look at big data programming models and how they are used for graph processing. We will look into the properties of the two and how they are different. The two models are Bulk Synchronous Parallel (BSP) and MapReduce. Zhang, Gao, Gao, and Wang, C. (2012) created an extended version of MapReduce called Iterative MapReduce. Map Reduce was created from a functional programming background that was created for distributed computing that could evaluate large-scale datasets, a task that is data intensive. MapReduce primarily carries out the following functions scheduling tasks and data, splitting the given data, and then distributing it across different processors and enabling the process to continue even if one of the processors fails. MapReduce also allows for a user to implement Map and reduce functions.
Map and reduce functions are then allowed access to the partitions of the input data. MapReduce also allows for a system-defined shuffle, sort phase, and a combiner model for merging of intermediary results from mappers. The map phase of the process of MapReduce processes the data on nodes, while the reducer part of MapReduce combines and summarizes the answers. Map Reduce is best suited for problems that require a large number of computations, which can be run concurrently on different processors. MapReduce was adjusted and fine-tuned to ensure reliability and not adjusted for efficiency. MapReduce was not created for graph processing, but there is an option for adopting several design patterns like Range partitioning and shimmy to boost its performance in graph processing by using reiterative graph algorithms. MapReduce was first introduced in 2004 there it has stood the test of time. The reiterative graph algorithms can be used to create sequential jobs of MapReduce, where each job, once processed it gives out an intermediate solution.
Here is how the process of ModelReduce Algorithm described above looks like:
The whole formation of the graph and its components should be transferred across the networked computing nodes sequentially from mapper to reducer for each reiteration and there should be a reassignment for every task on the MapReduce Chain. One limitation that is experienced occurs when each task gives out its intermediate result. The system disk becomes very expensive. However, we can overcome this limitation by using a reiterative MapReduce extension. We can divide the graph into portions and each portion be stored on a separate machine. We can then access the Reduce function by using remote reads of the pattern. However, we run into the risk of the job analysis halting once it encounters an error; the fault tolerance property is affected.
Chapter 3: Overview of The Bulk Synchronous Process Model and related platforms
The Bulk Synchronous Process Model
Valiant (2009) proposed this model. It is model for parallel computing. He defined the parallel computation process as a chain or a series of global supersteps. For each superstep, a concurrent phase of computation is involved and is carried out in parallel by all the processors. The global communication phase involves all the processors exchange and synchronize their data once each is superstep phase is completed. Google came up with distributed computing framework, named Pregel, that could allow an analysis to continue despite encountering an error, this ability is termed fault tolerance. The Pregel framework was inspired by the Bulk Synchronous Process (BSP) Model.
A computation by the Bulk Synchronous Process (BSP) Model involves a chain of reiterations that we term supersteps.
Posted on August 12, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024
November 30, 2024