31 August 2016

Generalization of Machine Learning Pipeline

Open-Domain Question/Answering Pipeline


Data Science Projects

Data Source
Description
Land Registry 10 Years DataBuild a story visualization of sold property prices and timeline of trends across UK
Marvel APIUsing the Marvel API and social media, collect, mine and build a comical visualization story for characters
TFL Data FeedsTrack TFL Data across London
Local Urban DataWhatsOn, Congestion, Events, Hubbub, GeoLocation
Social Media, Blogs, News, ReviewsProduct or Brand tracking/engagement on the web
Github, Twitter, Meetups, Quora, Stackoverflow, MailingLists, stackshareMonitor/track technology trends (BigData, ML, Batch/Stream Processing, etc)
Social Media, Blogs, News, AlertsMonitor and visualize political risk, events, and trends with a story timeline
Google N-Grams, Gutenberg, Wiktionary, WordNet, etcSpelling Checker using word2vec/glove
Single and Multi-Documents (News Feeds, Journals, Business Documents, etc)Information Extraction (Summary, Topic Tags, Language Detection, Author, etc)
SantanderMeasuring customer satisfaction
HomeDepotSearch relevance of search terms
Company House, Social Media, Corporate Sites, Compliance, AngelistTrack companies with partners, creditors, suppliers, sponsors, buyers
WalmartUse historical data to predict store sales
Historical Stock Prices, NewsMonitor and track stock prices and news for forecasting
WorldBank Datasets & Indicators, UK Office of National Statistics, US Census Data, IMF Data, Census Hub, and othersTrack and visualization of census data across regions
World University RankingsFind the best universities of the world
World Food FactsFind the nutrition facts in foods
Reddit CommentsStorytelling and visualization of contextualized comments on Reddit
Handwriting and DigitsTraining a computer to detect handwriting
FacesTraining a computer to detect facial expressions
Twitter and othersBuilding a profile of how people view the EU
Cats and Dogs DatasetDistinguish Dogs from Cats
Any music/video streamWrite a Stream Sampler that takes a random (representative) sample of size k from a stream of values of unknown and possibly very large length:
Receiving data the sampler should work with two kinds of inputs:
-values piped directly into process (stdin)
-values generated using a good random source
Expedia Hotelswhich hotel type will an expedia customer book
learning to rank hotels
Amazon Fine Foodsanalyze reviews
what does the product-reviewer graph look like?
what words tend to indicate positive and negative reviews?
what types of food products get reviewed the most?
how does review score distribution vary across reviewers?
what makes a review helpful?
NIPS 2015analyze and explore research papers, citations
Data Curation/Scraping + DBPediaontology engineering of a few custom/domain contexts, scraping, building a commonsense graph/reasoning
Anomaly Detection (Spam, Fraud, Fault, Network)Monitor/Track/Identify Anomalies from Data
Domain DataMonitor/Track Domain Websites
Images/Videos/Music/Shows/News Feeds/Twitter/Facebook/ReviewsDevelop semantic recommendations (processing multiple types of streaming)
FAQ sourcesBuild a FAQ graph and recommendation for technology
Recipes, Barcodes, etcmining ingredients for: wellness, nutrition, religion, quantified self, fitness and health
museum, gallery, and library (worldcat) datasets, catalogs, library of congress, etcmining and visualization of connected archives
relevant contextual datasettopic extraction in NLP in real time to do recommendations using LDA

Public Data Sources

24 August 2016

NER Projects

Named Entity Recognizers are a form of information extraction focusing precisely on named entities in order to classify them into specifically defined categories which may utilize entity linking. Annotation is a fundamental aspect of this classification. Quality measures often incorporate the use of precision, recall and F1 score (harmonic mean). Evaluations are also often compared against a gold standard: a benchmark that is available under reasonable conditions or the most accurate test possible without restrictions which is defined as the ground truth for the absolute state of information. The below highlight a few open source and commercial projects for NER. One can even utilize semantic web in form of a thesaurus server to incorporate SKOS schemes as a way of classification or annotation of terms in form of embedded URIs. One can view further examples from applications of PoolParty or Apache Stanbol.

20 August 2016

Words and Vectors

Clustering has become an active research area driven through deep learning techniques in deriving vectors of understanding in Natural Language Processing. Word2Vec is a fairly actively used technique for clustering. Its input is a text corpus and its output is a set of feature vectors for words. There are many libraries available that provide implementations for word embeddings including Gensim, DL4J, Spark, and others. The following are some variational areas within the same Word2Vec approach.

Doc2Vec (aka Paragraph2Vec, Sentence2Vec, Text2Vec)
Phrase2Vec
Sequence2Vec

16 August 2016

Open Semantic Search

Seems like a new open source project in semantic search, quite useful in the coverage of features that they are trying to achieve. Although, it appears it is still a very new project with much to be implemented. However, tracking it would be still very useful.

Popular BigData and Machine Learning Libraries

Machine Learning libraries and frameworks are constantly evolving. However, there is no harmonization with one tool that fits all solutions.  It seems quite apparent that as more and more libraries evolve the plethora of Machine Learning libraries to choose from will grow to such levels that they will eventually be shunned and refactored towards the cloud in order to utilize greater data processing requirements for scale out. However, certain libraries have a massive following already in industry as examples of some are listed below. Languages like Python, Java, Scala, and C++ are most suited to such contextual work. However, languages like Go are not far behind either. Most of these libraries are directly related to the progress in academic research in the area which can equally provide an indication of what new approaches can be utilized now and what may be possible in the future.

TensorFlow
DL4J
DataFlow
Flink
Spark
Theano
ScikitLearn
GraphLab
Mahout
SpringXD