One doesn't have to have a Phd to be a Data Scientist. Many have transferred from Software Engineering or Data Analyst into Data Scientist roles. While others have self-taught on the job. Many move away from Data Scientist role in favor of the more illustrious Big Data Engineer taking on numerous hats as they transition into a more satisfying occupation. Although, it is an occupational hazard if a Data Scientist ends up asking a Big Data Engineer what unit testing is or how to search for data sources in which case the odd frown and possibly a questionable glance over merits would be well deserved. The below link provides some relevant tracks for self-training online in data science.
21 May 2016
17 May 2016
Engine Paradigms & Systems
Paradigm
|
System
|
Explanation
|
---|---|---|
MapReduce | Hadoop | Small recoverable code tasks, sequential tasks inside map and reduce functions |
Dryad/Nephele | Tez | Extends the mapreduce model to DAGs model, backtracking based recovery |
PACTs | Flink | Embeded query processing runtime in DAGs engine, extend DAGs to cyclic graphs, incremental construction of graphs |
RDDs | SPARK | Functional implementation of Dryad recovery (RDDs), restrict to coarse-grained transformations, direct execution of API |
Engine Comparison
|
Hadoop
|
Tez
|
Spark
|
Flink
|
---|---|---|---|---|
API | mapreduce on k/v pairs | k/v pairs readers/writers | transformation on k/v pair collections | iterative transformation on collections |
Paradigm | mapreduce | DAG | RDD | Cyclic Dataflows |
Optimization | none | none | optimization of SQL queries | Optimization in all APIs |
Execution | batch sorting | batch sorting and partitioning | batch with memory pinning | stream with out-of-core algorithms |
Courtesy of Apache Flink
Graph Comparison
Analytical | |||
---|---|---|---|
Type | Backend | Supported Frameworks | Context of Use |
Giraph | Hadoop/HDFS | Spark/Hadoop | Data Processing for Analytics |
GraphX | Titan, Neo4J, HDFS | Spark | Data Processing for Analytics (in-memory) |
GraphLab | Hadoop/HDFS | Spark/Hadoop | Data Processing for Analytics, using PowerGraph and GAS models |
Operational | |||
---|---|---|---|
Type | Backend | Supported Frameworks | Context of Use |
Cayley | MongoDB or LevelDB | Custom Implementation in Go | Knowledge Graph |
Titan | Cassandra, HBase, HDFS | Tinkerpop & RDF SPARQL | Massive Knowledge Graphs OLAP/OLTP (now part of Datastax) |
Neo4J | Custom | Tinkerpop | Data Visualization, Web Browsing, Portfolio Analytics, Gene Sequencing, Mobile Social Application |
OrientDB | Custom | Tinkerpop & RDF SPARQL | Embedded and Standalone, Knowledge Graph, Multimodel (Document + Graph) |
Semantic | |||
---|---|---|---|
Type | Backend | Supported Frameworks | Context of Use |
Blazegraph and MapGraph | Custom | Sesame RDF SPARQL Tinkerpop | Massive Knowledge Graphs on GPU, includes support for Semantic Web Standards of W3C (used by Wikidata, a Wikimedia project) |
Stardog | Custom | RDF SPARQL | In cloud the semantic data use case (third-party) |
OntoText GraphDB | Custom | Sesame Jena RDF SPARQL | Optimized as a Semantic Graph Database based on Semantic Web Standards of W3C (used by BBC, Euromoney, FinancialTimes, etc) |
Virtuoso | Custom/Hybrid | Sesame Jena RDF SPARQL | Optimized as a Semantic Graph Database based on Semantic Web Standards of W3C (used by DBPedia) |
Allegrograph | Custom | Sesame RDF SPARQL | Optimized as a Semantic Graph Database based on Semantic Web Standards of W3C |
OpenCog | Custom | Semantic Knowledge | Massive Artificial General Intelligence Graph Knowledge Base |
OLTP/Graph Databases
OLTP/Analytical Databases
Graph Database as a Service
Native Semantic Graph Databases
Graph Query / Interfaces
16 May 2016
Streaming
Below is a curated list of stream processing frameworks, applications, and beyond.
awesome-streaming
awesome-streaming
28 April 2016
Data Science of Colors
Colors play an important role in data science in clarifying and visualizing of an information gain for a data-driven story. They magnify and project insights from data adding much needed value. The following are some highlighted links on color palettes.
kuler
colorlovers
colorbrewer2
kuler
colorlovers
colorbrewer2
27 April 2016
Public Data Sources for Machine Learning
UCI | Collection of benchmark datasets for regression and classification tasks | UCI Machine Learning Repository |
KDD | Extended version of UCI datasets | UCI KDD Extended Version |
DELVE | Platform for comparative assessment of regression and classification tasks | DELVE |
DMOZ | Collection of links for different datasets | DMOZ Directory |
KDNuggets | collection of links for different datasets | Further Datasets |
ChemDB | chemical data that can be used as datasets for machine learning | ChemDB |
Golem | trying to learn rules for prediction | Golem Datasets |
NDR | Data sets for nonlinear dimensionality reduction | Nonlinear Dimensionality Reduction |
General | A list of dataset links by category | further datasets |
AWS Public | public list of datasets via S3 | large dataset repository |
Datahub | public list of datasets | datahub datasets |
BigML | curated list of datasets | bigML datasets |
Curated Github | curated categorized list of datasets on github | public datasets on github |
wikipedia list | curated categorized list of datasets on wikipedia | datasets of ML |
Data Science | Data Science Projects | 19 free public data sources |
Data Science | Data Science Projects | data science datasets |
3 April 2016
31 March 2016
JavaScript Ecosystem
JavaScript ecosystem is huge and it gets even bigger when one includes Nodejs applications. Keeping track of new trends can be difficult as it grows in so many different directions at such a rapid pace. One can obviously keep abreast of the changes through community meetups and even Github for that matter. In fact, choosing the right library for an application can pose a dilemma as one is just so spoilt for choice when it comes to JavaScript. However, there is no official standardization in place apart from the actively 'worked on' ECMAScript. It is like a mushroom cloud of libraries in the community that just keeps getting bigger. Also, as new libraries evolve, others seem to die out or lose traction with complete lack of support. There seems to be no formal quality assurance or a standards driven approval process like some other languages. Apparently, it seems JavaScript community is very much driven by trends and at times that can even dictate the choice of libraries used in the applications as a form of value or impact to the business. Naturally this also adds a degree of risk. The current trends are of reactive applications. The following links provide some aggregated view of Javascript ecosystem and the different trends for the various libraries.
Javascripting
Libscore
Libscore Search
List of JavaScript libraries
awesome javascript
whats happening in the javascript ecosystem
what to expect from javascript in 2016 beyond the browser
Labels:
intelligent web
,
interface design
,
JavaScript
,
nodejs
,
programming
,
software engineering
,
web design
Subscribe to:
Posts
(
Atom
)