Mabble Rabble: spark

Showing posts with label spark. Show all posts

11 December 2020

Should You Use Flink

Flink is currently a very unstable platform. They have re-instituted the FlinkML which is unstable. They have rebalanced the graph option and the introduction of table. Any stable work now really depends on Spark. The Flink team really need to make up their minds and get their heads around stream processing and the abstracted features they want to provide to the stack. In fact, the Python option is just riddled with bugs. Perhaps, waiting a while might make the entire platform more stable but that is dependent on the goals of the team in the near future. Even the documentation is going slightly pair shaped. When a core aspect of a platform changes, it is best to fork it into a completely separate project. However, this fundamental shift, is what has made the Flink platform so unstable and the documentation untrackable. Maybe, in near future something better would come along to replace Spark and Flink that is ready for commercial use. But, so far it seems Spark is the only real contender in the market, albeit slightly unstable in its own right providing sufficient amount of flexibility without the added frustration.

9 April 2018

Deep Learning Pipelines with Spark

BigDL - CPU Optimized
DeepLearning4J - JVM
DeepLearning Pipelines - Integration
MLLIB Perceptron - Integration
TensorflowOnSpark - Integration
TensorFrames - Integration

4 April 2018

Feature Structure Goals in Spark

Classification & Regression
End Goal:

Column of type Double to represent Label
Column of type Vector (Sparse or Dense)

Recommendations
End Goal:

Column of Users
Column of Items
Column of Ratings

Unsupervised Learning
End Goal:

Column of Type Vector (Sparse or Dense)

Graph Analytics
End Goal:

DataFrame of Vertices
DataFrame of Edges

31 March 2018

Spark Monitoring

Spark Application/Jobs

Logs
Spark UI

JVM

OS/Machine

Cluster

22 March 2018

Blaze Ecosystem

5 March 2018

Beam Capability Matrix

13 February 2018

py4J

22 April 2017

Alluxio

25 March 2017

Crunch & Scrunch

Apache Crunch

Cybersecurity with Spot

Apache Spot

5 March 2017

PipelineIO

21 February 2017

Data Science & Big Data Salary Surveys

big data salary
big data salaries top bi data warehousing
the new tech job paying up 500k on wall street
Harnham Salary Guide 2015
2016 data science salary survey
2016 data science salary survey
2015 data science salary survey

20 February 2017

Data Science Cheatsheets

data science machine learning cheat sheets

16 February 2017

Apache Beam

Apache Beam
apache beam unifies batch and streaming for big data

8 January 2017

SMACK Stack

S : Scala and Spark (The Engine)
M : Mesos (The Hardware Scheduler)
A : Akka (The Actor Model)
C : Cassandra (The Storage)
K : Kafka (The Message Broker)

A Brief History of Smack
Smack Hands-On
Smack Made Simple
Smack Guide
why is smack stack all rage lately
Smack Slideshare
Smack Personalization

Alternatives for Stream Analytics:
GearPump
Flink

29 October 2016

Big Data Stream Processing

Spark
Flink
DataFlow/Beam
Streamsets

awesome streaming

24 September 2016

Spark vs Flink

apache spark vs apache flink

17 May 2016

Engine Paradigms & Systems

Paradigm	System	Explanation
MapReduce	Hadoop	Small recoverable code tasks, sequential tasks inside map and reduce functions
Dryad/Nephele	Tez	Extends the mapreduce model to DAGs model, backtracking based recovery
PACTs	Flink	Embeded query processing runtime in DAGs engine, extend DAGs to cyclic graphs, incremental construction of graphs
RDDs	SPARK	Functional implementation of Dryad recovery (RDDs), restrict to coarse-grained transformations, direct execution of API

Engine Comparison	Hadoop	Tez	Spark	Flink
API	mapreduce on k/v pairs	k/v pairs readers/writers	transformation on k/v pair collections	iterative transformation on collections
Paradigm	mapreduce	DAG	RDD	Cyclic Dataflows
Optimization	none	none	optimization of SQL queries	Optimization in all APIs
Execution	batch sorting	batch sorting and partitioning	batch with memory pinning	stream with out-of-core algorithms

Courtesy of Apache Flink