Showing posts with label hadoop. Show all posts
Showing posts with label hadoop. Show all posts
24 January 2025
Submarine
Labels:
big data
,
Cloud
,
data science
,
deep learning
,
distributed systems
,
hadoop
,
machine learning
22 October 2017
26 April 2017
22 April 2017
Comparing Deep Learning Frameworks
Compare Deep Learning
Compare Symbolic Deep Learning
Comparison of Deep Learning Frameworks
Comparison of deep learning software
Comparison of deep learning software resources
Compare Symbolic Deep Learning
Comparison of Deep Learning Frameworks
Comparison of deep learning software
Comparison of deep learning software resources
Labels:
big data
,
data science
,
deep learning
,
hadoop
,
Java
,
machine learning
,
python
,
scala
,
semantic web
25 March 2017
2 March 2017
Scalable Machine Learning
Reasons for why scale machine learning:
- training data doesn't fit on a single machine
- time to train model is too long
- too high volume of data that is coming in
- low latency requirements for predictions
How to spend less time on a scalable infrastructure:
- choose the right ML algorithm that is fast and lean that is able to work on a single machine accurately
- subsampling data
- vertical scalability
- sacrificing accuracy if it is cheaper
Horizontal scalability options:
- Hadoop ecosystem with Mahout
- Spark ecosystem with MLlib
- Turi from GraphLab
- Streaming Technologies like Kafka, Storm, AWS Kinesis, Flink, Spark Streaming
Scalability consideration for a model-building pipeline:
- choose scalable option like logistic regression or svm
- scaling up nonlinear algorithms by making approximations
- use a distributed infrastructure to scale out
How to scale predictions in both volume and velocity:
- Infrastructure that allows scale up across the number of workers
- Sending same prediction to multiple workers and returning back the first one to optimize prediction velocity
- choose an algorithm that can parallelize across multiple machines
A curious alternative for Hadoop for scalability is also Vowpal Wabbit for building models on large datasets without the requirement of a big data system. Feature selection also comes in handy when one wants to reduce the size of training data by selecting and retaining the most predictive subset of features. Lasso is a linear algorithm that is often use for feature selection. In respect of prediction velocity and volume, scaling in volume means being able to handle more data while scaling velocity means being able to do it fast enough for a use case. One also has to weigh out the sacrifice between speed and accuracy of predictions.
Labels:
big data
,
data science
,
distributed systems
,
hadoop
,
machine learning
,
predictive analytics
21 February 2017
Data Science & Big Data Salary Surveys
big data salary
big data salaries top bi data warehousing
the new tech job paying up 500k on wall street
Harnham Salary Guide 2015
2016 data science salary survey
2016 data science salary survey
2015 data science salary survey
big data salaries top bi data warehousing
the new tech job paying up 500k on wall street
Harnham Salary Guide 2015
2016 data science salary survey
2016 data science salary survey
2015 data science salary survey
Labels:
big data
,
Cloud
,
data science
,
flink
,
hadoop
,
linked data
,
machine learning
,
nosql
,
semantic web
,
spark
20 February 2017
17 February 2017
16 February 2017
9 February 2017
8 February 2017
Apache Projects Directory
Labels:
apache
,
big data
,
Cloud
,
data science
,
hadoop
,
Java
,
machine learning
,
scala
,
software engineering
8 January 2017
Hortonworks Toolset
- Falcon
- Atlas
- Sqoop
- Flume
- Kafka
- NFS
- WebHDFS
- Hadoop
- Hadoop MapReduce
- Hadoop HDFS
- Hadoop YARN
- Pig
- Hive
- HBase
- Accumulo
- Phoenix
- Storm
- Solr
- Spark
- Hawq
- Zepplin
- Nifi
- Ranger
- Knox
- Cloudbreak
- Zookeeper
- Oozie
- Slider
- Tez
- Metron
SMACK Stack
S : Scala and Spark (The Engine)
M : Mesos (The Hardware Scheduler)
A : Akka (The Actor Model)
C : Cassandra (The Storage)
K : Kafka (The Message Broker)
A Brief History of Smack
Smack Hands-On
Smack Made Simple
Smack Guide
why is smack stack all rage lately
Smack Slideshare
Smack Personalization
Alternatives for Stream Analytics:
GearPump
Flink
M : Mesos (The Hardware Scheduler)
A : Akka (The Actor Model)
C : Cassandra (The Storage)
K : Kafka (The Message Broker)
A Brief History of Smack
Smack Hands-On
Smack Made Simple
Smack Guide
why is smack stack all rage lately
Smack Slideshare
Smack Personalization
Alternatives for Stream Analytics:
GearPump
Flink
Labels:
akka
,
big data
,
cassandra
,
data science
,
distributed systems
,
hadoop
,
kafka
,
machine learning
,
nosql
,
reactive
,
scala
,
spark
5 November 2016
26 October 2016
Scala Data Tools
A list is provided below of the general mathematics and machine learning data tools that have emerged in Scala aside from the Hadoop and Scala API's for databases.
- Algebird: Twitter’s API for abstract algebra that can be used with almost any Big Data API.
- Factorie: A toolkit for deployable probabilistic modeling, with a succinct language for creating relational factor graphs, estimating parameters, and performing inference.
- Figaro: A toolkit for probabilistic programming.
- H2O: A high-performance, in-memory distributed compute engine for data analytics. Written in Java with Scala and R APIs.
- Relate: A thin database access layer focused on performance.
- ScalaNLP: A suite of Machine Learning and numerical computing libraries. It is an umbrella project for several libraries, including Breeze, for machine learning and numerical computing, and Epic, for statistical parsing and structured prediction.
- ScalaStorm: A Scala API for Storm.
- Scalding: Twitter’s Scala API around Cascading that popularized Scala as a language for Hadoop programming.
- Scoobi: A Scala abstraction layer on top of MapReduce with an API that’s similar to Scalding’s and Spark’s.
- Slick: A database access layer developed by Typesafe.
- Spark: The emerging standard for distributed computation in Hadoop environments, as well in Mesos clusters and on single machines (“local” mode).
- Spire: A numerics library that is intended to be generic, fast, and precise.
- Summingbird: Twitter’s API that abstracts computation over Scalding (batch mode) and Storm (event streaming).
Subscribe to:
Posts
(
Atom
)