Microsoft AI GraphRAG

Enhancing Intelligent Applications using GraphRAG

In today's rapidly evolving enterprise landscape, leveraging large language models (LLMs) to build AI-driven operations and intelligent applications is crucial for success. With the rise of private data sets within organizations, it becomes essential to establish clear relationships between various datasets using LLMs and Knowledge Graphs

LLMs vs Knowledge Graphs

A prime example that highlights this need would be an incident management platform requiring a thorough understanding of error events and performance events to make accurate service circuit SLA decisions. Likewise, in enterprise security information and event management (SIEM) systems, correlating user identities with their access paths from logs is essential to identify anomalies effectively. The significance of using LLMs for these purposes has gained momentum through discussions within the research community. One such platform that embodies this approach is Microsoft's GraphRAG (Graph-based RAGnometries), which was announced in February 2024. GraphRAG offers an AI-driven content interpretation and search capability by utilizing LLMs to create a knowledge graph from private datasets, enabling users to query the data effectively for better results.

GraphRAG Advantages

The major advantage of using Microsoft's GraphRAG over traditional vector search techniques is its ability to handle complex queries that demand higher order reasoning or extensive comprehension of the dataset at hand. For instance, when asked "What are the most unusual conversations?" a conventional vector search may fall short if it doesn't find an exact match in the data set. In contrast, GraphRAG builds a knowledge graph based on semantic concepts and provides a holistic understanding of all sources, allowing users to discover relevant information at various levels of abstraction for more accurate retrieval-augmented generation tasks.

Google Cloud has similar GraphRAG implementation using Neo4J

GraphRAG can be employed across critical information discovery and analysis use cases where datasets span multiple documents or contain noise, mixed with misinformation, or when the user's queries are abstract or thematic in nature. Furthermore, it is designed to complement a domain expert's analytical approach rather than replace their insights altogether.

The GraphRAG process begins by indexing an input corpus into analyzable TextUnits and extracting entities, relationships, and key claims using LLMs. This information undergoes hierarchical clustering via the Leiden technique to create a visual graph representation of entities. From there, summaries are generated for each community and its constituents from bottom-up, enabling users to gain comprehensive insights into their dataset.

When querying GraphRAG's knowledge graph, users can employ two primary modes: global search for holistic questions about the corpus or local search for specific entities by exploring related concepts within their neighborhood. It is worth noting that fine-tuning prompts using Microsoft's Prompt Tuning Guide may be necessary to achieve optimal results when working with your data set.

Architecture

Architecture diagram shows how Google Cloud and Neo4j work together to build and interact with knowledge graphs

Knowledge extraction - On the left side of the diagram, blue arrows show data flowing from structured and unstructured sources into Vertex AI. Generative AI is used to extract entities and relationships from that data which are then converted to Neo4j Cypher queries that are run against the Neo4j database to populate the knowledge graph. This work was traditionally done manually with handcrafted rules. Using generative AI eliminates much of the manual work of data cleansing and consolidation.
Knowledge consumption - On the right side of the diagram, green arrows show applications that consume the knowledge graph. They present natural language interfaces to users. Vertex AI generative AI converts that natural language to Neo4j Cypher that is run against the Neo4j database. This allows non technical users to interact more closely with the database than was possible without generative AI

UseCases

We’re seeing this architecture come up again and again across verticals. Some examples include:

Healthcare - Modeling the patient journey for multiple sclerosis to improve patient outcomes
Manufacturing - Using generative AI to collect a bill of materials that extends across domains, something that wasn’t tractable with previous manual approaches
Oil and gas - Building a knowledge base with extracts from technical documents that users without a data science background can interact with. This enables them to more quickly educate themselves and answer questions about the business.