5 March 2017
3 March 2017
2 March 2017
Scalable Machine Learning
Reasons for why scale machine learning:
- training data doesn't fit on a single machine
- time to train model is too long
- too high volume of data that is coming in
- low latency requirements for predictions
How to spend less time on a scalable infrastructure:
- choose the right ML algorithm that is fast and lean that is able to work on a single machine accurately
- subsampling data
- vertical scalability
- sacrificing accuracy if it is cheaper
Horizontal scalability options:
- Hadoop ecosystem with Mahout
- Spark ecosystem with MLlib
- Turi from GraphLab
- Streaming Technologies like Kafka, Storm, AWS Kinesis, Flink, Spark Streaming
Scalability consideration for a model-building pipeline:
- choose scalable option like logistic regression or svm
- scaling up nonlinear algorithms by making approximations
- use a distributed infrastructure to scale out
How to scale predictions in both volume and velocity:
- Infrastructure that allows scale up across the number of workers
- Sending same prediction to multiple workers and returning back the first one to optimize prediction velocity
- choose an algorithm that can parallelize across multiple machines
A curious alternative for Hadoop for scalability is also Vowpal Wabbit for building models on large datasets without the requirement of a big data system. Feature selection also comes in handy when one wants to reduce the size of training data by selecting and retaining the most predictive subset of features. Lasso is a linear algorithm that is often use for feature selection. In respect of prediction velocity and volume, scaling in volume means being able to handle more data while scaling velocity means being able to do it fast enough for a use case. One also has to weigh out the sacrifice between speed and accuracy of predictions.
Labels:
big data
,
data science
,
distributed systems
,
hadoop
,
machine learning
,
predictive analytics
Text Summarization with Deep Learning
- Text Summarization with Tensorflow
- Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond
- Sequence-to-Sequence with Attention Model for Text Summarization
- A Neural Attention Model for Abstractive Sentence Summarization
- Generating News Headlines with Recurrent Neural Networks
- ATTSum: Joint Learning of Focusing and Summarization with Neural Attention
- A Convolutional Attention Network for Extreme Summarization of Source Code
- Sequence-to-Sequence RNNs for Text Summarization
- Learning Summary Statistic for Approximate Bayesian Computation via Deep Neural Network
- LCSTS: A Large Scale Chinese Short Text Summarization Dataset
- Deep Dependency Substructure-Based Learning for Multidocument Summarization
- Ranking with Recursive Neural networks and ITs Application to Multi-document Summarization
- Query-oriented Unsupervised Multi-document Summarization via Deep Learning
- Abstractive Multi-Document Summarization via Phrase Selection
- Modelling, Visualising and Summarizing Documents with a Single Convolutional Neural Network
- SRRank: Leveraging Semantic Roles for Extractive Multi-Document Summarization
Subscribe to:
Posts
(
Atom
)