Mabble Rabble: information retrieval

Showing posts with label information retrieval. Show all posts

13 August 2025

Hybrid Retrieval in GraphRAG

The evolution of Retrieval-Augmented Generation (RAG) is pushing the boundaries of what is possible with large language models (LLMs). A sophisticated approach, GraphRAG, integrates knowledge graphs to provide LLMs with a more structured and contextually rich understanding of data. For a robust and scalable GraphRAG implementation in the cloud, a hybrid architecture leveraging Amazon Web Services (AWS) provides a compelling solution. This approach combines S3 for static data storage, Neptune for the definitive knowledge graph, Elasticsearch for flexible indexing, and a combination of Kendra and FAISS for dynamic, high-performance retrieval.

In this architecture, the long-term memory of the system is a multi-layered construct. Amazon S3 serves as the foundational data lake, storing the raw, unstructured documents that are the source of the knowledge graph. This provides a durable, scalable, and cost-effective storage solution. From these documents, a knowledge graph is extracted and primarily stored in Amazon Neptune, a purpose-built graph database. Neptune excels at representing and querying complex relationships, making it the canonical source of truth for the system's long-term, interconnected knowledge. To enhance searchability, this knowledge graph is replicated into Elasticsearch. Elasticsearch, with its powerful full-text and filtered search capabilities, acts as a secondary, highly-performant index over the long-term knowledge, allowing for traditional lexical searches and fast lookups of graph entities and properties.

For the performance-critical tasks of short-term memory and semantic search, a separate, hybrid approach is used. Amazon Kendra, an intelligent search service, can be leveraged to create a managed index for conversational history or other short-term context. It provides a straightforward way to ingest, index, and search data with built-in natural language processing. In parallel, FAISS, a high-performance library for similarity search, is deployed to handle the vector embeddings. When a user query arrives, it is converted into a vector embedding. This query vector is then used to perform a rapid search against a FAISS index to find semantically similar nodes or documents. This dual approach allows the system to utilize Kendra's managed service for general-purpose semantic search while reserving FAISS for lightning-fast, high-recall vector lookups.

Implementing FAISS alongside Elasticsearch involves a sophisticated hybrid retrieval strategy. Elasticsearch can first be used to filter a vast dataset to a manageable size based on lexical keywords or metadata. The remaining document vectors are then passed to FAISS for an approximate nearest neighbor search to find the most semantically relevant results. This two-stage process ensures both high precision and high recall.

To scale this GraphRAG system in the cloud, the native capabilities of the AWS services are key. Amazon Neptune and Elasticsearch Service are designed for horizontal scaling, allowing them to handle massive data volumes and high query loads by distributing data across multiple nodes. FAISS, while a local library, can be scaled by deploying it on containerized services like Amazon ECS or EKS, distributing indexes, and using a load balancer to manage query traffic. This allows the system to meet the demands of large-scale applications while maintaining low-latency retrieval and a performant user experience.

17 June 2025

Disruptive Search

Google's stranglehold on the search engine market, with a near-monopoly exceeding 90% of global queries, represents an unprecedented concentration of power over information access. This dominance is not merely about market share; it dictates what billions of people see, influences commerce, and shapes the digital landscape. However, this immense power is increasingly challenged by a growing public distrust fueled by Google's checkered past with data breaches and its often-criticized approach to data protection compliance. This vulnerability presents a fertile ground for a truly disruptive competitor, one capable of not just challenging but ultimately dismantling Google's search model.

Google's reputation has been repeatedly marred by significant data privacy incidents. The 2018 Google+ data breach, which exposed the personal information of over 52 million users, vividly demonstrated systemic flaws in its data security. Beyond direct breaches, Google has faced substantial regulatory backlash. The French CNIL's €50 million fine in 2019 for insufficient transparency and invalid consent for ad personalization, and subsequent fines for making it too difficult to refuse cookies, highlight a consistent pattern of prioritizing its advertising-driven business model over user privacy. These incidents, coupled with ongoing concerns about data collection through various services and the implications of broad surveillance laws, have eroded trust among a significant portion of the global internet user base.

To truly disrupt and ultimately destroy Google's search model, a competitor would need to embody a radical departure from the status quo. Its foundation must be absolute, unwavering user privacy. This means a "privacy-by-design" philosophy, where no user data is collected, no search history is stored, and no personalized advertising is served based on Browse habits. This fundamental commitment to anonymity would directly address Google's biggest weakness and attract users deeply concerned about their digital footprints.

Beyond privacy, the disruptive search engine would need to redefine the search experience itself. Leveraging advanced AI, it would offer a sophisticated, conversational interface that provides direct, concise answers to complex queries, akin to a highly intelligent research assistant. Crucially, every answer would be accompanied by clear, verifiable citations from a diverse array of reputable, unbiased sources. This "answer engine" approach would eliminate the need for users to sift through endless links, a stark contrast to Google's current link-heavy results pages.

Furthermore, this competitor would champion radical transparency in its algorithms. Users would have insight into how results are generated and ranked, combating algorithmic bias and ensuring a more diverse and inclusive information landscape. It would prioritize factual accuracy and intellectual property, ensuring ethical use of content with clear attribution to creators.

To truly dismantle Google's integrated ecosystem, this disruptive search engine would also need to offer seamless, privacy-preserving integrations with other essential digital tools. Imagine a search engine that naturally connects with a secure, encrypted communication platform, or a decentralized file storage system, all without collecting personal data. Such an ecosystem would effectively sever the user's reliance on Google's interconnected suite of products.

Ultimately, a successful competitor would be monetized through a model entirely decoupled from personal data. This could involve a premium subscription service for advanced features, a focus on ethical, context-aware advertising (e.g., ads related to the search query, not the user's profile), or even a non-profit, community-supported model. This financial independence from surveillance capitalism is key to its disruptive power.

In essence, this hypothetical competitor would not just be an alternative search engine; it would be a paradigm shift. By championing absolute privacy, offering intelligent and transparent answers, fostering an open and ethical information environment, and building a privacy-first ecosystem of digital tools, it could systematically erode Google's user base and fundamentally alter the landscape of online information, leading to the obsolescence of Google's current data-intensive search and product model.

28 March 2025

Multi-Agent Search

Ithy

11 September 2024

Splink

1 July 2023

Conference Index

18 March 2022

Biases in Transformer Models

Transformer models are notoriously trained on biased data, which they then propagate through the training, test, validation cycle and in production use cases. There are many types of biases at various stages of the process. The below highlight the different bias cases in the cycle that could evolve and provides a few suggestions for resolution.

Training Data is Collected and Annotated:

Reporting Bias
Selection Bias
Stereotyping
Racism
Underrepresentation
Gender Bias
Human Bias

Model Trained:

Overfitting
Underfitting
Default Effect
Anchoring Bias

Media is Filtered, Aggregated, and Generated:

Confirmation Bias
Congruence Bias

People See Output:

Automation Bias
Network Effect
Bias Laundering

How to Resolve Transformer Bias:

Feedback Loop
Model Cards for Model Reporting
Open Review
TLDR

21 August 2021

TextGraphs

5 June 2021

Unpredictable Google

Google services are the worst. One minute they are available for use. The next minute they are going through a decommissioning process. Then there is that aspect of their page ranking algorithms which keep changing effecting the publisher revenue. Not to mention the way they have recently been giving preferential treatment through a preferred advertising supplier network. One minute an API is available to use, next minute it is gone. The same is the case on GCP. Nothing seems to stay for very long before it is changed with a total lack of regard for the user. No time frames given for planning a migration. Not to mention the fact to find any information one has to literally hunt for it. One would think if they are a search company they would know how to make their searchable and findability functions user-friendly - but no. And, it takes ages to remove anything from their search engine. The company is also slack in following basic privacy and regulatory compliance. The company just gives off an air of arrogance like they can get away with everything without really being very responsible with user data. There seems to be a complete disconnect across the internal organization which shows in their products and services initiatives. Over the years, with multiple court cases in the international community, Google has been slowly but surely losing the sense of credibility of their services with users. Large company like Google eventually meets its faith when more issues with reliability and security of their services come into question while increasing frustration for their users for their lack of responsive customer care and dodgy business practices. A perfect example of a company that just doesn't care about the end-user.

22 May 2021

Matterhorn

31 March 2021

Three Approaches to Word Similarity Measures

Geometric/Spatial to evaluate relative positions of two words in semantic space defined as context vectors
Set-based that relies on analysis of overlap of the set of contexts in which words occur
Probabilistic using probabilistic models and measures as proposed in information theory

26 August 2020

TREC

16 July 2020

Embedding Types

Word
Contextualized
Document
Sentence
Sense
Graph

13 June 2020

AIFDB

9 November 2019

Code Search

ack

the platinum searcher

the silver searcher

Manticore

21 October 2019

BigSi

BigSi
Bigsi on biorxiv
EBI

Neural Information Retrieval

Neural Information Retrieval Review
Deep Learning Relevance
Neural Information Retrieval

18 June 2019

Lucene4IR

7 May 2019

Contextual Signals of Measurement

Interest
Intent
Impact
Influence

8 January 2019

Approximate Nearest Neighbor Matching

Annoy-Hamming
BallTree (NMSLib)
Brute Force (BLAS)
Brute Force (NMSLib)
DolphinnPy
RPForest
Datasketch
MIH
Panns
Falconn
FLANN
HNSW (NMSLib)
Kdtree
NearPy
KeyedVectors (Gensim)
FAISS-IVF
SW-Graph (NMSLib)
KGraph (NMSLib)

ANN Benchmarks

Subscribe to: Posts ( Atom )