31 August 2016
Data Science Projects
Data Source
|
Description
|
Land Registry 10 Years Data | Build a story visualization of sold property prices and timeline of trends across UK |
Marvel API | Using the Marvel API and social media, collect, mine and build a comical visualization story for characters |
TFL Data Feeds | Track TFL Data across London |
Local Urban Data | WhatsOn, Congestion, Events, Hubbub, GeoLocation |
Social Media, Blogs, News, Reviews | Product or Brand tracking/engagement on the web |
Github, Twitter, Meetups, Quora, Stackoverflow, MailingLists, stackshare | Monitor/track technology trends (BigData, ML, Batch/Stream Processing, etc) |
Social Media, Blogs, News, Alerts | Monitor and visualize political risk, events, and trends with a story timeline |
Google N-Grams, Gutenberg, Wiktionary, WordNet, etc | Spelling Checker using word2vec/glove |
Single and Multi-Documents (News Feeds, Journals, Business Documents, etc) | Information Extraction (Summary, Topic Tags, Language Detection, Author, etc) |
Santander | Measuring customer satisfaction |
HomeDepot | Search relevance of search terms |
Company House, Social Media, Corporate Sites, Compliance, Angelist | Track companies with partners, creditors, suppliers, sponsors, buyers |
Walmart | Use historical data to predict store sales |
Historical Stock Prices, News | Monitor and track stock prices and news for forecasting |
WorldBank Datasets & Indicators, UK Office of National Statistics, US Census Data, IMF Data, Census Hub, and others | Track and visualization of census data across regions |
World University Rankings | Find the best universities of the world |
World Food Facts | Find the nutrition facts in foods |
Reddit Comments | Storytelling and visualization of contextualized comments on Reddit |
Handwriting and Digits | Training a computer to detect handwriting |
Faces | Training a computer to detect facial expressions |
Twitter and others | Building a profile of how people view the EU |
Cats and Dogs Dataset | Distinguish Dogs from Cats |
Any music/video stream | Write a Stream Sampler that takes a random (representative) sample of size k from a stream of values of unknown and possibly very large length: Receiving data the sampler should work with two kinds of inputs: -values piped directly into process (stdin) -values generated using a good random source |
Expedia Hotels | which hotel type will an expedia customer book learning to rank hotels |
Amazon Fine Foods | analyze reviews what does the product-reviewer graph look like? what words tend to indicate positive and negative reviews? what types of food products get reviewed the most? how does review score distribution vary across reviewers? what makes a review helpful? |
NIPS 2015 | analyze and explore research papers, citations |
Data Curation/Scraping + DBPedia | ontology engineering of a few custom/domain contexts, scraping, building a commonsense graph/reasoning |
Anomaly Detection (Spam, Fraud, Fault, Network) | Monitor/Track/Identify Anomalies from Data |
Domain Data | Monitor/Track Domain Websites |
Images/Videos/Music/Shows/News Feeds/Twitter/Facebook/Reviews | Develop semantic recommendations (processing multiple types of streaming) |
FAQ sources | Build a FAQ graph and recommendation for technology |
Recipes, Barcodes, etc | mining ingredients for: wellness, nutrition, religion, quantified self, fitness and health |
museum, gallery, and library (worldcat) datasets, catalogs, library of congress, etc | mining and visualization of connected archives |
relevant contextual dataset | topic extraction in NLP in real time to do recommendations using LDA |
Public Data Sources
Labels:
big data
,
data science
,
deep learning
,
linked data
,
machine learning
,
natural language processing
,
text analytics
26 August 2016
24 August 2016
NER Projects
Named Entity Recognizers are a form of information extraction focusing precisely on named entities in order to classify them into specifically defined categories which may utilize entity linking. Annotation is a fundamental aspect of this classification. Quality measures often incorporate the use of precision, recall and F1 score (harmonic mean). Evaluations are also often compared against a gold standard: a benchmark that is available under reasonable conditions or the most accurate test possible without restrictions which is defined as the ground truth for the absolute state of information. The below highlight a few open source and commercial projects for NER. One can even utilize semantic web in form of a thesaurus server to incorporate SKOS schemes as a way of classification or annotation of terms in form of embedded URIs. One can view further examples from applications of PoolParty or Apache Stanbol.
OpeNER
Other Libraries for custom NER:
OpenNLP
UIMA
CORENLP
SPACY
NLTK
SyntaxNet & TensorFlow
DL4J
Apache Lucene
KEA
FastText
SpeedRead
Knowledge Population
Benchmarking
NER Survey
Google NLP API
Other Libraries for custom NER:
OpenNLP
UIMA
CORENLP
SPACY
NLTK
SyntaxNet & TensorFlow
DL4J
Apache Lucene
KEA
FastText
SpeedRead
Knowledge Population
Benchmarking
NER Survey
Google NLP API
Labels:
data science
,
linked data
,
metadata
,
natural language processing
,
semantic web
,
text analytics
20 August 2016
Words and Vectors
Clustering has become an active research area driven through deep learning techniques in deriving vectors of understanding in Natural Language Processing. Word2Vec is a fairly actively used technique for clustering. Its input is a text corpus and its output is a set of feature vectors for words. There are many libraries available that provide implementations for word embeddings including Gensim, DL4J, Spark, and others. The following are some variational areas within the same Word2Vec approach.
16 August 2016
Open Semantic Search
Seems like a new open source project in semantic search, quite useful in the coverage of features that they are trying to achieve. Although, it appears it is still a very new project with much to be implemented. However, tracking it would be still very useful.
Popular BigData and Machine Learning Libraries
Machine Learning libraries and frameworks are constantly evolving. However, there is no harmonization with one tool that fits all solutions. It seems quite apparent that as more and more libraries evolve the plethora of Machine Learning libraries to choose from will grow to such levels that they will eventually be shunned and refactored towards the cloud in order to utilize greater data processing requirements for scale out. However, certain libraries have a massive following already in industry as examples of some are listed below. Languages like Python, Java, Scala, and C++ are most suited to such contextual work. However, languages like Go are not far behind either. Most of these libraries are directly related to the progress in academic research in the area which can equally provide an indication of what new approaches can be utilized now and what may be possible in the future.
TensorFlow
DL4J
DataFlow
Flink
Spark
Theano
ScikitLearn
GraphLab
Mahout
SpringXD
Subscribe to:
Posts
(
Atom
)