23 May 2016

Employment Checks

When an employer uses third-party kyc checks for compliance why do they make mistakes in data entry? Isn't validation and verification part of compliance? It is even more shocking when potential candidates are checked against federal databases with incorrectly spelt derivatives of names especially when the candidate has provided correct information. In most circumstances such lack of due diligence on part of the third-party should be legally liable for penalty and held responsible inclusive of the employer that is using them to conduct such compliance checks. Intelligent Agents are necessary in all aspects of data entry to protect the personal information of individuals as well as to avoid potential data discrepancies where they could potentially get linked incorrectly through human error. Why must we be forced to trust another human to process our personal information especially someone we have no information of and have not done our own background checks on to assure ourselves of integrity. Sharing personal information is a risky affair especially when a candidate has no idea of the third-parties that are utilized for such processing to develop any level of assurances on trust and even such trust is questionable at best.

21 May 2016

Open Source Data Science Masters

One doesn't have to have a Phd to be a Data Scientist. Many have transferred from Software Engineering or Data Analyst into Data Scientist roles. While others have self-taught on the job. Many move away from Data Scientist role in favor of the more illustrious Big Data Engineer taking on numerous hats as they transition into a more satisfying occupation. Although, it is an occupational hazard if a Data Scientist ends up asking a Big Data Engineer what unit testing is or how to search for data sources in which case the odd frown and possibly a questionable glance over merits would be well deserved. The below link provides some relevant tracks for self-training online in data science.

17 May 2016

Engine Paradigms & Systems

Paradigm
System
Explanation
MapReduceHadoopSmall recoverable code tasks, sequential tasks inside map and reduce functions
Dryad/NepheleTezExtends the mapreduce model to DAGs model, backtracking based recovery
PACTsFlinkEmbeded query processing runtime in DAGs engine, extend DAGs to cyclic graphs, incremental construction of graphs
RDDsSPARKFunctional implementation of Dryad recovery (RDDs), restrict to coarse-grained transformations, direct execution of API
Engine Comparison
Hadoop
Tez
Spark
Flink
APImapreduce on
k/v pairs
k/v pairs readers/writerstransformation
on k/v pair collections
iterative transformation
on collections
ParadigmmapreduceDAGRDDCyclic Dataflows
Optimizationnonenoneoptimization
of
SQL
queries
Optimization
in all APIs
Executionbatch sortingbatch sorting and partitioningbatch with memory pinningstream with
out-of-core algorithms
Courtesy of Apache Flink

Graph Comparison

Analytical

TypeBackendSupported FrameworksContext of Use
GiraphHadoop/HDFSSpark/HadoopData Processing for Analytics
GraphXTitan, Neo4J, HDFSSparkData Processing for Analytics (in-memory)
GraphLabHadoop/HDFSSpark/HadoopData Processing for Analytics, using PowerGraph and GAS models

Operational

TypeBackendSupported FrameworksContext of Use
CayleyMongoDB or LevelDBCustom Implementation in GoKnowledge Graph
TitanCassandra, HBase, HDFSTinkerpop & RDF
SPARQL
Massive Knowledge Graphs OLAP/OLTP (now part of Datastax)
Neo4JCustomTinkerpopData Visualization, Web Browsing, Portfolio Analytics, Gene Sequencing, Mobile Social Application
OrientDBCustomTinkerpop & RDF
SPARQL
Embedded and Standalone, Knowledge Graph, Multimodel (Document + Graph)

Semantic

TypeBackendSupported FrameworksContext of Use
Blazegraph and MapGraphCustomSesame
RDF
SPARQL
Tinkerpop
Massive Knowledge Graphs on GPU, includes support for Semantic Web Standards of W3C (used by Wikidata, a Wikimedia project)
StardogCustomRDF
SPARQL
In cloud the semantic data use case (third-party)
OntoText GraphDBCustomSesame
Jena
RDF
SPARQL
Optimized as a Semantic Graph Database based on Semantic Web Standards of W3C (used by BBC, Euromoney, FinancialTimes, etc)
VirtuosoCustom/HybridSesame
Jena
RDF
SPARQL
Optimized as a Semantic Graph Database based on Semantic Web Standards of W3C (used by DBPedia)
AllegrographCustomSesame
RDF
SPARQL
Optimized as a Semantic Graph Database based on Semantic Web Standards of W3C
OpenCogCustomSemantic KnowledgeMassive Artificial General Intelligence Graph Knowledge Base

OLTP/Graph Databases
OLTP/Analytical Databases
Graph Database as a Service
Native Semantic Graph Databases
Graph Query / Interfaces

16 May 2016

Streaming

Below is a curated list of stream processing frameworks, applications, and beyond.

awesome-streaming

28 April 2016

Data Science of Colors

Colors play an important role in data science in clarifying and visualizing of an information gain for a data-driven story. They magnify and project insights from data adding much needed value. The following are some highlighted links on color palettes.

kuler
colorlovers
colorbrewer2

27 April 2016

Public Data Sources for Machine Learning

UCICollection of benchmark datasets for regression and classification tasksUCI Machine Learning Repository
KDDExtended version of UCI datasetsUCI KDD Extended Version
DELVEPlatform for comparative assessment of regression and classification tasksDELVE
DMOZCollection of links for different datasetsDMOZ Directory
KDNuggetscollection of links for different datasetsFurther Datasets
ChemDBchemical data that can be used as datasets for machine learningChemDB
Golemtrying to learn rules for predictionGolem Datasets
NDRData sets for nonlinear dimensionality reductionNonlinear Dimensionality Reduction
GeneralA list of dataset links by categoryfurther datasets
AWS Publicpublic list of datasets via S3large dataset repository
Datahubpublic list of datasetsdatahub datasets
BigMLcurated list of datasetsbigML datasets
Curated Githubcurated categorized list of datasets on githubpublic datasets on github
wikipedia listcurated categorized list of datasets on wikipediadatasets of ML
Data ScienceData Science Projects19 free public data sources
Data ScienceData Science Projectsdata science datasets