Mabble Rabble

21 May 2016

Open Source Data Science Masters

One doesn't have to have a Phd to be a Data Scientist. Many have transferred from Software Engineering or Data Analyst into Data Scientist roles. While others have self-taught on the job. Many move away from Data Scientist role in favor of the more illustrious Big Data Engineer taking on numerous hats as they transition into a more satisfying occupation. Although, it is an occupational hazard if a Data Scientist ends up asking a Big Data Engineer what unit testing is or how to search for data sources in which case the odd frown and possibly a questionable glance over merits would be well deserved. The below link provides some relevant tracks for self-training online in data science.

data science masters

17 May 2016

Engine Paradigms & Systems

Paradigm	System	Explanation
MapReduce	Hadoop	Small recoverable code tasks, sequential tasks inside map and reduce functions
Dryad/Nephele	Tez	Extends the mapreduce model to DAGs model, backtracking based recovery
PACTs	Flink	Embeded query processing runtime in DAGs engine, extend DAGs to cyclic graphs, incremental construction of graphs
RDDs	SPARK	Functional implementation of Dryad recovery (RDDs), restrict to coarse-grained transformations, direct execution of API

Engine Comparison	Hadoop	Tez	Spark	Flink
API	mapreduce on k/v pairs	k/v pairs readers/writers	transformation on k/v pair collections	iterative transformation on collections
Paradigm	mapreduce	DAG	RDD	Cyclic Dataflows
Optimization	none	none	optimization of SQL queries	Optimization in all APIs
Execution	batch sorting	batch sorting and partitioning	batch with memory pinning	stream with out-of-core algorithms

Courtesy of Apache Flink

Graph Comparison

Analytical
Type	Backend	Supported Frameworks	Context of Use
Giraph	Hadoop/HDFS	Spark/Hadoop	Data Processing for Analytics
GraphX	Titan, Neo4J, HDFS	Spark	Data Processing for Analytics (in-memory)
GraphLab	Hadoop/HDFS	Spark/Hadoop	Data Processing for Analytics, using PowerGraph and GAS models

Operational
Type	Backend	Supported Frameworks	Context of Use
Cayley	MongoDB or LevelDB	Custom Implementation in Go	Knowledge Graph
Titan	Cassandra, HBase, HDFS	Tinkerpop & RDF SPARQL	Massive Knowledge Graphs OLAP/OLTP (now part of Datastax)
Neo4J	Custom	Tinkerpop	Data Visualization, Web Browsing, Portfolio Analytics, Gene Sequencing, Mobile Social Application
OrientDB	Custom	Tinkerpop & RDF SPARQL	Embedded and Standalone, Knowledge Graph, Multimodel (Document + Graph)

Semantic
Type	Backend	Supported Frameworks	Context of Use
Blazegraph and MapGraph	Custom	Sesame RDF SPARQL Tinkerpop	Massive Knowledge Graphs on GPU, includes support for Semantic Web Standards of W3C (used by Wikidata, a Wikimedia project)
Stardog	Custom	RDF SPARQL	In cloud the semantic data use case (third-party)
OntoText GraphDB	Custom	Sesame Jena RDF SPARQL	Optimized as a Semantic Graph Database based on Semantic Web Standards of W3C (used by BBC, Euromoney, FinancialTimes, etc)
Virtuoso	Custom/Hybrid	Sesame Jena RDF SPARQL	Optimized as a Semantic Graph Database based on Semantic Web Standards of W3C (used by DBPedia)
Allegrograph	Custom	Sesame RDF SPARQL	Optimized as a Semantic Graph Database based on Semantic Web Standards of W3C
OpenCog	Custom	Semantic Knowledge	Massive Artificial General Intelligence Graph Knowledge Base

wikidata graph comparison

OLTP/Graph Databases
OLTP/Analytical Databases
Graph Database as a Service
Native Semantic Graph Databases
Graph Query / Interfaces

16 May 2016

Streaming

Below is a curated list of stream processing frameworks, applications, and beyond.

awesome-streaming

28 April 2016

Data Science of Colors

Colors play an important role in data science in clarifying and visualizing of an information gain for a data-driven story. They magnify and project insights from data adding much needed value. The following are some highlighted links on color palettes.

kuler
colorlovers
colorbrewer2

27 April 2016

Public Data Sources for Machine Learning

UCI	Collection of benchmark datasets for regression and classification tasks	UCI Machine Learning Repository
KDD	Extended version of UCI datasets	UCI KDD Extended Version
DELVE	Platform for comparative assessment of regression and classification tasks	DELVE
DMOZ	Collection of links for different datasets	DMOZ Directory
KDNuggets	collection of links for different datasets	Further Datasets
ChemDB	chemical data that can be used as datasets for machine learning	ChemDB
Golem	trying to learn rules for prediction	Golem Datasets
NDR	Data sets for nonlinear dimensionality reduction	Nonlinear Dimensionality Reduction
General	A list of dataset links by category	further datasets
AWS Public	public list of datasets via S3	large dataset repository
Datahub	public list of datasets	datahub datasets
BigML	curated list of datasets	bigML datasets
Curated Github	curated categorized list of datasets on github	public datasets on github
wikipedia list	curated categorized list of datasets on wikipedia	datasets of ML
Data Science	Data Science Projects	19 free public data sources
Data Science	Data Science Projects	data science datasets

3 April 2016

Hadoop Ecosystem

hadoop ecosystem table
Apache Hadoop
Apache Hadoop on Wikipedia
understanding hadoop ecosystem
bigdata ecosystem table
awesome hadoop
Hadoop Summit

deep dive amazon elastic mapreduce
AWS reInvent 2015 BigData Analytics sessions
Amazon Reinvent 2016
Amazon Global Summits

31 March 2016

JavaScript Ecosystem

JavaScript ecosystem is huge and it gets even bigger when one includes Nodejs applications. Keeping track of new trends can be difficult as it grows in so many different directions at such a rapid pace. One can obviously keep abreast of the changes through community meetups and even Github for that matter. In fact, choosing the right library for an application can pose a dilemma as one is just so spoilt for choice when it comes to JavaScript. However, there is no official standardization in place apart from the actively 'worked on' ECMAScript. It is like a mushroom cloud of libraries in the community that just keeps getting bigger. Also, as new libraries evolve, others seem to die out or lose traction with complete lack of support. There seems to be no formal quality assurance or a standards driven approval process like some other languages. Apparently, it seems JavaScript community is very much driven by trends and at times that can even dictate the choice of libraries used in the applications as a form of value or impact to the business. Naturally this also adds a degree of risk. The current trends are of reactive applications. The following links provide some aggregated view of Javascript ecosystem and the different trends for the various libraries.

Javascripting
Libscore
Libscore Search
List of JavaScript libraries
awesome javascript
whats happening in the javascript ecosystem
what to expect from javascript in 2016 beyond the browser