80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 2
Tariq Abughofa
Posted on December 29, 2019
This is a continuation of the resources I listed in part 1
80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 1
Tariq Abughofa ・ Dec 22 '19 ・ 5 min read
This part includes the following four categories:
- Machine Learning & Algorithms in Big Data
- Data Processing Systems
- Real-time Processing
- Graph Processing
Machine Learning and Algorithms in Big Data
Recommending items to more than a billion people: An article about collaborative filtering at Facebook.
Machine Learning with Sparkling Water: Using H2O the machine learning framework with Apache Spark.
MLlib: Scalable Machine Learning library on Apache Spark from Stanford/Databricks.
TensorFlow: the famous large-scale machine learning library.
Large-scale parallel collaborative filtering for the Netflix prize: an algorithm that for large scale recommendations of Netflix movies.
Data Processing Systems
Airflow: a workflow management system by AirBnB.
Oozie: a workflow management system for Hadoop by Yahoo!.
BlinkDb: analytics on large scale data from Berkeley.
FlumeJava: a library for developing parallel data pipelines from Google.
MapReduce: the google framework behind Hadoop.
Pig: an engine that supports PigLatin a procedural dataflow language for Hadoop from Yahoo.
Hive (resource#2): A data warehouse on top of Hadoop.
The Dataflow Model: the model behind Google Cloud Dataflow which provides simplified stream and batch processing.
MillWheel: stream processing engine from Google.
Photon: A tool to join data streams at Google.
Kinesis: stream processing engine from Amazon.
Apache Flink (resource#2): stream and batch processing engine from TU Berlin.
Trill: incremental data analytics engine from Microsoft.
Kafka: the famous distributed messaging system from LinkedIn.
Apache Spark: the famous stream and batch processing engine. It uses distributed memory abstractions: RDDs, Dataframes, and Datasets. Since Spark 2 was released, it moved to structured streaming (resource#2) (3) (4) and the SparkSQL library was introduced to allow SQL queries over Spark Dataframes. The whole Databricks blog is a great resource for the project.
SparkR: a Spark library to write processing application in R.
GraphX (resource#2): distributed graph processing with Spark's RDDs.
GraphFrames: distributed graph processing with Spark's Dataframes.
SnappyData (resource#2): a transaction datastore on top of Spark.
Real-time Processing
Samza (resource#2) (3) (4): Stream processing engine from LinkedIn.
Storm: real-time data processing engine from Twitter.
Heron: the new Storm from Twitter.
Real-time data processing at facebook.
Pulsar: real-time data processing engine from eBay.
Graph Processing
WTF: the who to follow service at Twitter.
GraphJet: real-time recommendation graph engine at Twitter.
Pregel: large-scale graph processing engine at Google.
Giraph: open source implementation of Pregel by Facebook.
Posted on December 29, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
December 29, 2019