Showing posts with label webcrawler. Show all posts
Showing posts with label webcrawler. Show all posts

17 June 2025

Disruptive Search

Google's stranglehold on the search engine market, with a near-monopoly exceeding 90% of global queries, represents an unprecedented concentration of power over information access. This dominance is not merely about market share; it dictates what billions of people see, influences commerce, and shapes the digital landscape. However, this immense power is increasingly challenged by a growing public distrust fueled by Google's checkered past with data breaches and its often-criticized approach to data protection compliance. This vulnerability presents a fertile ground for a truly disruptive competitor, one capable of not just challenging but ultimately dismantling Google's search model.

Google's reputation has been repeatedly marred by significant data privacy incidents. The 2018 Google+ data breach, which exposed the personal information of over 52 million users, vividly demonstrated systemic flaws in its data security. Beyond direct breaches, Google has faced substantial regulatory backlash. The French CNIL's €50 million fine in 2019 for insufficient transparency and invalid consent for ad personalization, and subsequent fines for making it too difficult to refuse cookies, highlight a consistent pattern of prioritizing its advertising-driven business model over user privacy. These incidents, coupled with ongoing concerns about data collection through various services and the implications of broad surveillance laws, have eroded trust among a significant portion of the global internet user base.

To truly disrupt and ultimately destroy Google's search model, a competitor would need to embody a radical departure from the status quo. Its foundation must be absolute, unwavering user privacy. This means a "privacy-by-design" philosophy, where no user data is collected, no search history is stored, and no personalized advertising is served based on Browse habits. This fundamental commitment to anonymity would directly address Google's biggest weakness and attract users deeply concerned about their digital footprints.

Beyond privacy, the disruptive search engine would need to redefine the search experience itself. Leveraging advanced AI, it would offer a sophisticated, conversational interface that provides direct, concise answers to complex queries, akin to a highly intelligent research assistant. Crucially, every answer would be accompanied by clear, verifiable citations from a diverse array of reputable, unbiased sources. This "answer engine" approach would eliminate the need for users to sift through endless links, a stark contrast to Google's current link-heavy results pages.

Furthermore, this competitor would champion radical transparency in its algorithms. Users would have insight into how results are generated and ranked, combating algorithmic bias and ensuring a more diverse and inclusive information landscape. It would prioritize factual accuracy and intellectual property, ensuring ethical use of content with clear attribution to creators.

To truly dismantle Google's integrated ecosystem, this disruptive search engine would also need to offer seamless, privacy-preserving integrations with other essential digital tools. Imagine a search engine that naturally connects with a secure, encrypted communication platform, or a decentralized file storage system, all without collecting personal data. Such an ecosystem would effectively sever the user's reliance on Google's interconnected suite of products.

Ultimately, a successful competitor would be monetized through a model entirely decoupled from personal data. This could involve a premium subscription service for advanced features, a focus on ethical, context-aware advertising (e.g., ads related to the search query, not the user's profile), or even a non-profit, community-supported model. This financial independence from surveillance capitalism is key to its disruptive power.

In essence, this hypothetical competitor would not just be an alternative search engine; it would be a paradigm shift. By championing absolute privacy, offering intelligent and transparent answers, fostering an open and ethical information environment, and building a privacy-first ecosystem of digital tools, it could systematically erode Google's user base and fundamentally alter the landscape of online information, leading to the obsolescence of Google's current data-intensive search and product model.

25 May 2025

16 October 2024

Schema.org

Schema.org is useful markup to have on website as it makes it search engine friendly while helping them understand the content and internal structure which enables better search results. However, the schema.org website lacks clarity and is difficult to navigate like a clutter of information.

Benefits:

  • Enables search engines to better understand content on sites that rank higher on search results
  • Improve click-through rates and organically increase traffic on site 
  • Provide more flexibility and context to how sites appear in search results
  • Increase user search relevancy
  • Improve strategy for content and context
  • Improve user experience
  • Flexibility on markup from microdata, rdfa, and jsonld
  • Provides a meta vocabulary to define the context of the site
  • Extract who, what, when, why from sites 

Drawbacks:

  • Unfriendly schema.org community for suggestions, feedback, and improvements
  • Submitting new changes or schemas is slow and often fraught with frustration 
  • Terrible and difficult to navigate schema.org site as the information is cluttered
  • Community is not very open and unwelcoming to new users
  • No real reasoning and significant effort towards web of data queryability 
  • Community is discriminatory towards user suggestions, submissions, and approval process 
  • Very opinionated and closed community which makes it unconstructive 
  • Huge Google bias with often rude and arrogant community members 
  • Markup often is buggy, flawed, and inflexible to community changes 
  • Process is fraught with trial and error
  • Difficult to develop a strategy around the markup 
  • Difficult to implement at scale with larger websites 
  • Maintaining markup is a challenge 
  • It is subjective and questionable whether the markup significantly improves discoverability 
  • Limited tools that support and provide insights into the markup 
  • Inflexible schema.org developer community makes the standard inaccessible, inextensible, and unmaintainable 
  • Unclear documentation on the schema.org website 
  • The markup is still very limited in context and scope especially for larger websites 
  • Lacks sufficient domain coverage as a markup

Although, the project has a long history and many websites make use of such markup, it has significant drawbacks that justify alternative efforts. The project is also Google sponsored with a significant corporate bias which defeats the merit and accessibility for an open community engagement. The often slow process means the markup lacks speed in reaching its full potential. An active open community may speed up the process but this is likely to be a significant roadblock from the existing community of developers who are not very forthright with community engagement. Bugs in schemas take a very long time to resolve and usually recommendations are not appreciated in the community. There is significant concern for websites to use such markup where the community is often unapproachable and inflexible. After all, it is supposed to be a web standard which arises from a community effort and engagement. Lastly, a markup that lacks readability, reasoning, and trust as a web standard is likely to be insufficient for the spectrum of web crawling, semantic search, web of data, and AI in general.

8 May 2022

DiffBot

Diffbot is one of the most useless solutions out there for harvesting the web. In fact, their solution is basically what google already provides for free. They also use methods that have been used by multiple providers for last twenty years. They in fact do not provide a knowledge graph. The solution is simply indexed crawl that one can replicate with elasticsearch. Or, even use commoncrawl data. What they are doing is trying to make a fool out of organizations and charging a premium for it. There are so many free alternatives out there that do a better job. In fact, their notion of a knowledge graph is a marketing gimmick. The knowledge graph has no real semantics and provides for no meaningful inference. Even the data they extract is basically data, and not machine-readable. They add virtually no real metadata. In fact, their solution does not even utilize the schema.org let alone any W3C standards. They also don't follow web etiquettes of obeying the robot.txt. Diffbot utilizes a ruthless form of crawling by hiding itself as a human visitor via spoofing. In most cases, their approach is also likely to violate GDPR. There is also no real deep learning models being used for either computer vision, AI, or natural language processing. This is a perfect example of an organization trying to sell something that has no real value to would be customers. 

30 January 2020

Event Monitoring

Open:
  • DiffEngine
  • Edgi
  • Huginn
  • Klaxon
  • Lighthouse
  • Newsdiffs
  • Nytdiff
  • Pagelyzer
  • Siteseer
  • Beehive
  • Memorious

Premium:
  • ChangeDetect
  • ChangeDetection
  • ChangeTower
  • Diffbot
  • Distill
  • Fluxguard
  • Followthatpage
  • OnWebChange
  • PageFreezer
  • TheWebWatcher
  • TimeMachine
  • Tackly
  • Versionista
  • Visualping
  • Wachete
  • WatchThatPage

18 August 2019

Types of Data Discovery

  • CDR
  • Emails
  • ERP
  • Social Media
  • Web Logs
  • Server Logs
  • System Logs
  • HTML Pages
  • Sales
  • Photos
  • Videos
  • Audios
  • Tabulated
  • CRM
  • Transactions
  • XDR
  • Sensor Data
  • Call Center
  • Knowledge Bases
  • Google Search
  • Google Trends
  • News
  • Sanctions Data
  • Profile Data

3 May 2015

Common Crawl

Common Crawl provides an archive snapshot dataset of the web which can be utilized for massive array of applications. It is also based on the Heritrix archival crawler making it quite reusable and extensible for open-ended solutions whether that be building a search engine against years of web page data, extracting specific data from web page documents, or even to train machine learning algorithms. Common Crawl is also available via the AWS public data repository and accessible via the AWS S3 blob store. There are plenty of MapReduce examples available in both Python and Java to make it approachable for developers. Having years of data at a developer's disposal saves one from manually setting up such crawler processes. 

25 December 2013

Web Crawling

Web crawler allows one to search and scrap through document URLs based on a specific criteria for indexing. It also needs to be approached from a netiquette friendly way conforming to the robots.txt rules. Scalability can be an issue as well as different approaches can be devised for an optimal outcome. An algorithm driven approach is vital for a constructive approach of meeting requirements that might incorporate either an informed or an uninformed search strategy. At times, they even incorporate a combination as well as heuristics. This ultimately implies that, from an algorithmic point of view of a crawler, the web is seen as a graph search and lends itself well with linked data. They could be conducted in a distributed fashion utilizing multiagent approach or as singular agents. Web crawlers can also be used for monitoring websites usage, security, and dispensing information analytics that might otherwise be hidden from a web master. There are quite a few open source tools and services available for a developer. There is always a period in which testing would need to be done locally to work out the ideal and web friendly approach. There is no one best solution out there if the needs go beyond the limitations of any existing libraries can offer. In that respect, it really means designing one's own custom search strategy. And, perhaps, making it open source to share with the community.

Python:

Java:
LinkedData:

Services:

Also, HBase appears to be in general a very good back-end for a crawler architecture which plays well with Hadoop.

Obviously, there are a lot more options out there most likely of which have a premium. Majority of the premium options have been avoided a mention.

high performance distributed web crawler
high performance distributed web crawler survey 
learning and discovering structure in web pages
UbiCrawler: A Scalable Fully Distributed Web Crawler
Searching the Web