Mabble Rabble: webscraper

One doesn't have to have a Phd to be a Data Scientist. Many have transferred from Software Engineering or Data Analyst into Data Scientist roles. While others have self-taught on the job. Many move away from Data Scientist role in favor of the more illustrious Big Data Engineer taking on numerous hats as they transition into a more satisfying occupation. Although, it is an occupational hazard if a Data Scientist ends up asking a Big Data Engineer what unit testing is or how to search for data sources in which case the odd frown and possibly a questionable glance over merits would be well deserved. The below link provides some relevant tracks for self-training online in data science.

data science masters

3 May 2015

Common Crawl

Common Crawl provides an archive snapshot dataset of the web which can be utilized for massive array of applications. It is also based on the Heritrix archival crawler making it quite reusable and extensible for open-ended solutions whether that be building a search engine against years of web page data, extracting specific data from web page documents, or even to train machine learning algorithms. Common Crawl is also available via the AWS public data repository and accessible via the AWS S3 blob store. There are plenty of MapReduce examples available in both Python and Java to make it approachable for developers. Having years of data at a developer's disposal saves one from manually setting up such crawler processes.

28 February 2015

Alternatives To OpenRefine

OpenRefine which used to be part of a Google project stream has become an almost irreplaceable tool for data cleansing and transformations. This is a part of activity regarded generally as data wrangling. One can clean messy data, transform data into various normalizations/denormalizations, parse data from various websites, merge data from various sources, and reconcile with Freebase (this has now been discontinued and work continues on Wikidata). However, the tool does have its many quirks and limitations. There are quite a few tools available as alternatives, most of which stem from research then end up becoming commercial products in their own right. Unfortunately, other open source options are only left as experimental and then slowly are made unavailable for public use. A few interesting free alternatives are listed below.

DataWrangler (commercialized into Trifacta)
Karma
Potluck
Exhibit
FusionTables
Many Eyes (discontinued)
DataCleaner

School of Data Online Resources

25 December 2013

Web Crawling

Web crawler allows one to search and scrap through document URLs based on a specific criteria for indexing. It also needs to be approached from a netiquette friendly way conforming to the robots.txt rules. Scalability can be an issue as well as different approaches can be devised for an optimal outcome. An algorithm driven approach is vital for a constructive approach of meeting requirements that might incorporate either an informed or an uninformed search strategy. At times, they even incorporate a combination as well as heuristics. This ultimately implies that, from an algorithmic point of view of a crawler, the web is seen as a graph search and lends itself well with linked data. They could be conducted in a distributed fashion utilizing multiagent approach or as singular agents. Web crawlers can also be used for monitoring websites usage, security, and dispensing information analytics that might otherwise be hidden from a web master. There are quite a few open source tools and services available for a developer. There is always a period in which testing would need to be done locally to work out the ideal and web friendly approach. There is no one best solution out there if the needs go beyond the limitations of any existing libraries can offer. In that respect, it really means designing one's own custom search strategy. And, perhaps, making it open source to share with the community.

Python:

Scrapy

BeautifulSoup

Mechanize
HTML5Lib
PyCreep

Java:

bixo
Jaunt