Mabble Rabble: hadoop

choose the right ML algorithm that is fast and lean that is able to work on a single machine accurately
subsampling data
vertical scalability
sacrificing accuracy if it is cheaper

Horizontal scalability options:

Hadoop ecosystem with Mahout
Spark ecosystem with MLlib
Turi from GraphLab
Streaming Technologies like Kafka, Storm, AWS Kinesis, Flink, Spark Streaming

Scalability consideration for a model-building pipeline:

choose scalable option like logistic regression or svm
scaling up nonlinear algorithms by making approximations
use a distributed infrastructure to scale out

How to scale predictions in both volume and velocity:

Infrastructure that allows scale up across the number of workers
Sending same prediction to multiple workers and returning back the first one to optimize prediction velocity
choose an algorithm that can parallelize across multiple machines

A curious alternative for Hadoop for scalability is also Vowpal Wabbit for building models on large datasets without the requirement of a big data system. Feature selection also comes in handy when one wants to reduce the size of training data by selecting and retaining the most predictive subset of features. Lasso is a linear algorithm that is often use for feature selection. In respect of prediction velocity and volume, scaling in volume means being able to handle more data while scaling velocity means being able to do it fast enough for a use case. One also has to weigh out the sacrifice between speed and accuracy of predictions.

GlusterFS

21 February 2017

Data Science & Big Data Salary Surveys

big data salary
big data salaries top bi data warehousing
the new tech job paying up 500k on wall street
Harnham Salary Guide 2015
2016 data science salary survey
2016 data science salary survey
2015 data science salary survey

20 February 2017

Data Science Cheatsheets

data science machine learning cheat sheets

17 February 2017

Analytical Task Workflows

Celery
Akka
Luigi
Airflow
Dask
Azkaban
Oozie
Aurora
Falcon
Chronos
Sparrow
Pinball
BigDataScript
Makeflow

16 February 2017

Apache Beam

Apache Beam
apache beam unifies batch and streaming for big data

9 February 2017

Big Data Watch

Airflow
Apex
Arrow
Beam
BlinkDB
Cascading
DL4J
Drill
Druid
Flink
Flume
Gearpump
GlusterFS
H2O
Hadoop
Heron
Ignite
Impala
Kafka
Kudu
Mahout
Nifi
Phoenix
Prestodb
Samza
Scalding
Spark
Storm
Streamsets
Zookeeper
Oryx

hadoop ecosystem table

8 February 2017

Google BigData Interoperability

bigdata-interop

Apache Projects Directory

8 January 2017

Hortonworks Toolset

Falcon
Atlas
Sqoop
Flume
Kafka
NFS
WebHDFS
Hadoop
Hadoop MapReduce
Hadoop HDFS
Hadoop YARN
Pig
Hive
HBase
Accumulo
Phoenix
Storm
Solr
Spark
Hawq
Zepplin
Nifi
Ranger
Knox
Cloudbreak
Zookeeper
Oozie
Slider
Tez
Metron

Hortonworks Projects

Stream Processing Engines

windowing in big data streams spark flink kafka akka

SMACK Stack

S : Scala and Spark (The Engine)
M : Mesos (The Hardware Scheduler)
A : Akka (The Actor Model)
C : Cassandra (The Storage)
K : Kafka (The Message Broker)

A Brief History of Smack
Smack Hands-On
Smack Made Simple
Smack Guide
why is smack stack all rage lately
Smack Slideshare
Smack Personalization

Alternatives for Stream Analytics:
GearPump
Flink

5 November 2016

Python MapReduce

PySpark
MrJob
Luigi
DSpark
Streamparse
Dumbo

26 October 2016

Scala Data Tools

A list is provided below of the general mathematics and machine learning data tools that have emerged in Scala aside from the Hadoop and Scala API's for databases.

Algebird: Twitter’s API for abstract algebra that can be used with almost any Big Data API.
Factorie: A toolkit for deployable probabilistic modeling, with a succinct language for creating relational factor graphs, estimating parameters, and performing inference.
Figaro: A toolkit for probabilistic programming.
H2O: A high-performance, in-memory distributed compute engine for data analytics. Written in Java with Scala and R APIs.
Relate: A thin database access layer focused on performance.
ScalaNLP: A suite of Machine Learning and numerical computing libraries. It is an umbrella project for several libraries, including Breeze, for machine learning and numerical computing, and Epic, for statistical parsing and structured prediction.
ScalaStorm: A Scala API for Storm.
Scalding: Twitter’s Scala API around Cascading that popularized Scala as a language for Hadoop programming.
Scoobi: A Scala abstraction layer on top of MapReduce with an API that’s similar to Scalding’s and Spark’s.
Slick: A database access layer developed by Typesafe.
Spark: The emerging standard for distributed computation in Hadoop environments, as well in Mesos clusters and on single machines (“local” mode).
Spire: A numerics library that is intended to be generic, fast, and precise.
Summingbird: Twitter’s API that abstracts computation over Scalding (batch mode) and Storm (event streaming).

Mabble Rabble

24 January 2025

Submarine

22 October 2017

Big Data Ecosystem

26 April 2017

Big Data Processing Frameworks

22 April 2017

Comparing Deep Learning Frameworks

25 March 2017

Crunch & Scrunch

Cybersecurity with Spot

2 March 2017

Scalable Machine Learning