Showing posts with label hadoop. Show all posts
Showing posts with label hadoop. Show all posts

2 March 2017

Scalable Machine Learning

Reasons for why scale machine learning:
  • training data doesn't fit on a single machine
  • time to train model is too long
  • too high volume of data that is coming in
  • low latency requirements for predictions
How to spend less time on a scalable infrastructure:
  • choose the right ML algorithm that is fast and lean that is able to work on a single machine accurately
  • subsampling data
  • vertical scalability
  • sacrificing accuracy if it is cheaper
Horizontal scalability options:
  • Hadoop ecosystem with Mahout
  • Spark ecosystem with MLlib
  • Turi from GraphLab
  • Streaming Technologies like Kafka, Storm, AWS Kinesis, Flink, Spark Streaming
Scalability consideration for a model-building pipeline:
  • choose scalable option like logistic regression or svm
  • scaling up nonlinear algorithms by making approximations
  • use a distributed infrastructure to scale out
How to scale predictions in both volume and velocity:
  • Infrastructure that allows scale up across the number of workers
  • Sending same prediction to multiple workers and returning back the first one to optimize prediction velocity
  • choose an algorithm that can parallelize across multiple machines
A curious alternative for Hadoop for scalability is also Vowpal Wabbit for building models on large datasets without the requirement of a big data system. Feature selection also comes in handy when one wants to reduce the size of training data by selecting and retaining the most predictive subset of features. Lasso is a linear algorithm that is often use for feature selection. In respect of prediction velocity and volume, scaling in volume means being able to handle more data while scaling velocity means being able to do it fast enough for a use case. One also has to weigh out the sacrifice between speed and accuracy of predictions.

GlusterFS

GlusterFS

8 January 2017

Hortonworks Toolset

  • Falcon
  • Atlas
  • Sqoop
  • Flume
  • Kafka
  • NFS
  • WebHDFS
  • Hadoop
  • Hadoop MapReduce
  • Hadoop HDFS
  • Hadoop YARN
  • Pig
  • Hive
  • HBase
  • Accumulo
  • Phoenix
  • Storm
  • Solr
  • Spark
  • Hawq
  • Zepplin
  • Nifi
  • Ranger
  • Knox
  • Cloudbreak
  • Zookeeper
  • Oozie
  • Slider
  • Tez
  • Metron

Stream Processing Engines



SMACK Stack

S : Scala and Spark (The Engine)
M : Mesos (The Hardware Scheduler)
A : Akka (The Actor Model)
C : Cassandra (The Storage)
K : Kafka (The Message Broker)

A Brief History of Smack
Smack Hands-On
Smack Made Simple
Smack Guide
why is smack stack all rage lately
Smack Slideshare
Smack Personalization

Alternatives for Stream Analytics:
GearPump
Flink

26 October 2016

Scala Data Tools

A list is provided below of the general mathematics and machine learning data tools that have emerged in Scala aside from the Hadoop and Scala API's for databases.
  • Algebird: Twitter’s API for abstract algebra that can be used with almost any Big Data API.
  • Factorie: A toolkit for deployable probabilistic modeling, with a succinct language for creating relational factor graphs, estimating parameters, and performing inference.
  • Figaro: A toolkit for probabilistic programming.
  • H2O: A high-performance, in-memory distributed compute engine for data analytics. Written in Java with Scala and R APIs.
  • Relate: A thin database access layer focused on performance.
  • ScalaNLP: A suite of Machine Learning and numerical computing libraries. It is an umbrella project for several libraries, including Breeze, for machine learning and numerical computing, and Epic, for statistical parsing and structured prediction.
  • ScalaStorm: A Scala API for Storm.
  • Scalding: Twitter’s Scala API around Cascading that popularized Scala as a language for Hadoop programming.
  • Scoobi: A Scala abstraction layer on top of MapReduce with an API that’s similar to Scalding’s and Spark’s.
  • Slick: A database access layer developed by Typesafe. 
  • Spark: The emerging standard for distributed computation in Hadoop environments, as well in Mesos clusters and on single machines (“local” mode).
  • Spire: A numerics library that is intended to be generic, fast, and precise.
  • Summingbird: Twitter’s API that abstracts computation over Scalding (batch mode) and Storm (event streaming).