- feedparser
- feedfinder
- feedfinder2
- fastfeedparser
- rss-parser
- reader
- feedgen
- feedsearch
Showing posts with label webscraper. Show all posts
Showing posts with label webscraper. Show all posts
25 May 2025
Pythonic Feed Libraries
Labels:
big data
,
data science
,
deep learning
,
machine learning
,
semantic web
,
webcrawler
,
webscraper
7 January 2025
9 April 2020
30 January 2020
Event Monitoring
Open:
Premium:
- DiffEngine
- Edgi
- Huginn
- Klaxon
- Lighthouse
- Newsdiffs
- Nytdiff
- Pagelyzer
- Siteseer
- Beehive
- Memorious
Premium:
- ChangeDetect
- ChangeDetection
- ChangeTower
- Diffbot
- Distill
- Fluxguard
- Followthatpage
- OnWebChange
- PageFreezer
- TheWebWatcher
- TimeMachine
- Tackly
- Versionista
- Visualping
- Wachete
- WatchThatPage
Labels:
big data
,
data science
,
devops
,
event
,
intelligent web
,
natural language processing
,
text analytics
,
webcrawler
,
webscraper
18 August 2019
Types of Data Discovery
- CDR
- Emails
- ERP
- Social Media
- Web Logs
- Server Logs
- System Logs
- HTML Pages
- Sales
- Photos
- Videos
- Audios
- Tabulated
- CRM
- Transactions
- XDR
- Sensor Data
- Call Center
- Knowledge Bases
- Google Search
- Google Trends
- News
- Sanctions Data
- Profile Data
Labels:
big data
,
data science
,
deep learning
,
intelligent web
,
machine learning
,
natural language processing
,
semantic web
,
text analytics
,
webcrawler
,
webscraper
7 June 2019
Scrapely
Labels:
big data
,
data science
,
deep learning
,
intelligent web
,
machine learning
,
python
,
semantic web
,
webcrawler
,
webscraper
Ahmia
Labels:
big data
,
data science
,
intelligent web
,
semantic web
,
webcrawler
,
webscraper
,
webservices
ScrapingHub Splash
Labels:
big data
,
data science
,
JavaScript
,
natural language processing
,
python
,
text analytics
,
webcrawler
,
webscraper
20 December 2017
Curlie
Labels:
artificial intelligence
,
big data
,
data science
,
intelligent web
,
linked data
,
machine learning
,
semantic web
,
webcrawler
,
webscraper
24 October 2017
2 April 2017
Web Data Commons
Labels:
big data
,
data science
,
intelligent web
,
internet
,
semantic web
,
webcrawler
,
webscraper
8 December 2016
Web Scraping Services
Labels:
big data
,
data science
,
distributed systems
,
intelligent web
,
metadata
,
natural language processing
,
semantic web
,
text analytics
,
webcrawler
,
webscraper
,
webservices
21 May 2016
Open Source Data Science Masters
One doesn't have to have a Phd to be a Data Scientist. Many have transferred from Software Engineering or Data Analyst into Data Scientist roles. While others have self-taught on the job. Many move away from Data Scientist role in favor of the more illustrious Big Data Engineer taking on numerous hats as they transition into a more satisfying occupation. Although, it is an occupational hazard if a Data Scientist ends up asking a Big Data Engineer what unit testing is or how to search for data sources in which case the odd frown and possibly a questionable glance over merits would be well deserved. The below link provides some relevant tracks for self-training online in data science.
3 May 2015
Common Crawl
Common Crawl provides an archive snapshot dataset of the web which can be utilized for massive array of applications. It is also based on the Heritrix archival crawler making it quite reusable and extensible for open-ended solutions whether that be building a search engine against years of web page data, extracting specific data from web page documents, or even to train machine learning algorithms. Common Crawl is also available via the AWS public data repository and accessible via the AWS S3 blob store. There are plenty of MapReduce examples available in both Python and Java to make it approachable for developers. Having years of data at a developer's disposal saves one from manually setting up such crawler processes.
Labels:
big data
,
data science
,
intelligent web
,
linked data
,
nosql
,
semantic web
,
text analytics
,
webcrawler
,
webscraper
28 February 2015
Alternatives To OpenRefine
OpenRefine which used to be part of a Google project stream has become an almost irreplaceable tool for data cleansing and transformations. This is a part of activity regarded generally as data wrangling. One can clean messy data, transform data into various normalizations/denormalizations, parse data from various websites, merge data from various sources, and reconcile with Freebase (this has now been discontinued and work continues on Wikidata). However, the tool does have its many quirks and limitations. There are quite a few tools available as alternatives, most of which stem from research then end up becoming commercial products in their own right. Unfortunately, other open source options are only left as experimental and then slowly are made unavailable for public use. A few interesting free alternatives are listed below.
DataWrangler (commercialized into Trifacta)
Karma
Potluck
Exhibit
FusionTables
Many Eyes (discontinued)
DataCleaner
School of Data Online Resources
25 December 2013
Web Crawling
Web crawler allows one to search and scrap through document URLs based on a specific criteria for indexing. It also needs to be approached from a netiquette friendly way conforming to the robots.txt rules. Scalability can be an issue as well as different approaches can be devised for an optimal outcome. An algorithm driven approach is vital for a constructive approach of meeting requirements that might incorporate either an informed or an uninformed search strategy. At times, they even incorporate a combination as well as heuristics. This ultimately implies that, from an algorithmic point of view of a crawler, the web is seen as a graph search and lends itself well with linked data. They could be conducted in a distributed fashion utilizing multiagent approach or as singular agents. Web crawlers can also be used for monitoring websites usage, security, and dispensing information analytics that might otherwise be hidden from a web master. There are quite a few open source tools and services available for a developer. There is always a period in which testing would need to be done locally to work out the ideal and web friendly approach. There is no one best solution out there if the needs go beyond the limitations of any existing libraries can offer. In that respect, it really means designing one's own custom search strategy. And, perhaps, making it open source to share with the community.
Python:
Java:
websphinx
JSoup
JSpider
TagSoup
WebEater
JoBo
Hounder
Web Harvest
Aperture
Capek
Spider
Metis
Arachnid
LARM
HyperSpider
Yacy
Norconex HTTP Collector
WebLech
Arale
Ruby:
Nokogiri
Mechanize
scrapi
wombat
hpricot
JSoup
JSpider
TagSoup
WebEater
JoBo
Hounder
Web Harvest
Aperture
Capek
Spider
Metis
Arachnid
LARM
HyperSpider
Yacy
Norconex HTTP Collector
WebLech
Arale
Ruby:
Nokogiri
Mechanize
scrapi
wombat
hpricot
LinkedData:
Services:
Also, HBase appears to be in general a very good back-end for a crawler architecture which plays well with Hadoop.
Obviously, there are a lot more options out there most likely of which have a premium. Majority of the premium options have been avoided a mention.
high performance distributed web crawler
high performance distributed web crawler survey
learning and discovering structure in web pages
UbiCrawler: A Scalable Fully Distributed Web Crawler
Searching the Web
Obviously, there are a lot more options out there most likely of which have a premium. Majority of the premium options have been avoided a mention.
high performance distributed web crawler
high performance distributed web crawler survey
learning and discovering structure in web pages
UbiCrawler: A Scalable Fully Distributed Web Crawler
Searching the Web
Labels:
artificial intelligence
,
big data
,
hadoop
,
hbase
,
intelligent web
,
linked data
,
webcrawler
,
webscraper
Subscribe to:
Posts
(
Atom
)