Mabble Rabble

31 August 2016

Generalization of Machine Learning Pipeline

Open-Domain Question/Answering Pipeline

26 August 2016

Benchmarking Google Cloud DataFlow

Google Cloud Dataflow Benchmark
dataflow tops spark benchmark test
Google Cloud Dataflow
ApacheFlink-DataFlow (use Beam Runner)
ApacheSpark-DataFlow (use Beam Runner)
Beam vs Spark Comparison

24 August 2016

Named Entity Recognizers are a form of information extraction focusing precisely on named entities in order to classify them into specifically defined categories which may utilize entity linking. Annotation is a fundamental aspect of this classification. Quality measures often incorporate the use of precision, recall and F1 score (harmonic mean). Evaluations are also often compared against a gold standard: a benchmark that is available under reasonable conditions or the most accurate test possible without restrictions which is defined as the ground truth for the absolute state of information. The below highlight a few open source and commercial projects for NER. One can even utilize semantic web in form of a thesaurus server to incorporate SKOS schemes as a way of classification or annotation of terms in form of embedded URIs. One can view further examples from applications of PoolParty or Apache Stanbol.

Stanford Named Entity Recognizer

Illinois Named Entity Tagger

OpeNER

Other Libraries for custom NER:
OpenNLP
UIMA
CORENLP
SPACY
NLTK
SyntaxNet & TensorFlow
DL4J
Apache Lucene
KEA
FastText

SpeedRead
Knowledge Population
Benchmarking
NER Survey
Google NLP API

xLisa
FOX
AIDA

20 August 2016

Words and Vectors

Clustering has become an active research area driven through deep learning techniques in deriving vectors of understanding in Natural Language Processing. Word2Vec is a fairly actively used technique for clustering. Its input is a text corpus and its output is a set of feature vectors for words. There are many libraries available that provide implementations for word embeddings including Gensim, DL4J, Spark, and others. The following are some variational areas within the same Word2Vec approach.

Doc2Vec (aka Paragraph2Vec, Sentence2Vec, Text2Vec)
Phrase2Vec
Sequence2Vec

Thought2Vec

GloVe
Concept2Vec
Sense2Vec

Word2Vec Paper
Word2Vec Paper2
Word2Vec Paper3
Google Blog on Word2Vec
Background on Word2Vec

16 August 2016

Open Semantic Search

Seems like a new open source project in semantic search, quite useful in the coverage of features that they are trying to achieve. Although, it appears it is still a very new project with much to be implemented. However, tracking it would be still very useful.

Open Semantic Search

Popular BigData and Machine Learning Libraries

Machine Learning libraries and frameworks are constantly evolving. However, there is no harmonization with one tool that fits all solutions. It seems quite apparent that as more and more libraries evolve the plethora of Machine Learning libraries to choose from will grow to such levels that they will eventually be shunned and refactored towards the cloud in order to utilize greater data processing requirements for scale out. However, certain libraries have a massive following already in industry as examples of some are listed below. Languages like Python, Java, Scala, and C++ are most suited to such contextual work. However, languages like Go are not far behind either. Most of these libraries are directly related to the progress in academic research in the area which can equally provide an indication of what new approaches can be utilized now and what may be possible in the future.

TensorFlow
DL4J
DataFlow
Flink
Spark
Theano
ScikitLearn
GraphLab
Mahout
SpringXD

Subscribe to: Posts ( Atom )

Data Source	Description
Land Registry 10 Years Data	Build a story visualization of sold property prices and timeline of trends across UK
Marvel API	Using the Marvel API and social media, collect, mine and build a comical visualization story for characters
TFL Data Feeds	Track TFL Data across London
Local Urban Data	WhatsOn, Congestion, Events, Hubbub, GeoLocation
Social Media, Blogs, News, Reviews	Product or Brand tracking/engagement on the web
Github, Twitter, Meetups, Quora, Stackoverflow, MailingLists, stackshare	Monitor/track technology trends (BigData, ML, Batch/Stream Processing, etc)
Social Media, Blogs, News, Alerts	Monitor and visualize political risk, events, and trends with a story timeline
Google N-Grams, Gutenberg, Wiktionary, WordNet, etc	Spelling Checker using word2vec/glove
Single and Multi-Documents (News Feeds, Journals, Business Documents, etc)	Information Extraction (Summary, Topic Tags, Language Detection, Author, etc)
Santander	Measuring customer satisfaction
HomeDepot	Search relevance of search terms
Company House, Social Media, Corporate Sites, Compliance, Angelist	Track companies with partners, creditors, suppliers, sponsors, buyers
Walmart	Use historical data to predict store sales
Historical Stock Prices, News	Monitor and track stock prices and news for forecasting
WorldBank Datasets & Indicators, UK Office of National Statistics, US Census Data, IMF Data, Census Hub, and others	Track and visualization of census data across regions
World University Rankings	Find the best universities of the world
World Food Facts	Find the nutrition facts in foods
Reddit Comments	Storytelling and visualization of contextualized comments on Reddit
Handwriting and Digits	Training a computer to detect handwriting
Faces	Training a computer to detect facial expressions
Twitter and others	Building a profile of how people view the EU
Cats and Dogs Dataset	Distinguish Dogs from Cats
Any music/video stream	Write a Stream Sampler that takes a random (representative) sample of size k from a stream of values of unknown and possibly very large length: Receiving data the sampler should work with two kinds of inputs: -values piped directly into process (stdin) -values generated using a good random source
Expedia Hotels	which hotel type will an expedia customer book learning to rank hotels
Amazon Fine Foods	analyze reviews what does the product-reviewer graph look like? what words tend to indicate positive and negative reviews? what types of food products get reviewed the most? how does review score distribution vary across reviewers? what makes a review helpful?
NIPS 2015	analyze and explore research papers, citations
Data Curation/Scraping + DBPedia	ontology engineering of a few custom/domain contexts, scraping, building a commonsense graph/reasoning
Anomaly Detection (Spam, Fraud, Fault, Network)	Monitor/Track/Identify Anomalies from Data
Domain Data	Monitor/Track Domain Websites
Images/Videos/Music/Shows/News Feeds/Twitter/Facebook/Reviews	Develop semantic recommendations (processing multiple types of streaming)
FAQ sources	Build a FAQ graph and recommendation for technology
Recipes, Barcodes, etc	mining ingredients for: wellness, nutrition, religion, quantified self, fitness and health
museum, gallery, and library (worldcat) datasets, catalogs, library of congress, etc	mining and visualization of connected archives
relevant contextual dataset	topic extraction in NLP in real time to do recommendations using LDA

Mabble Rabble

31 August 2016

Generalization of Machine Learning Pipeline

Open-Domain Question/Answering Pipeline

Data Science Projects

26 August 2016

Benchmarking Google Cloud DataFlow

24 August 2016

NER Projects

20 August 2016

Words and Vectors

16 August 2016

Open Semantic Search

Popular BigData and Machine Learning Libraries