Session

Knowledge Graphs Construction From Unstructured Data Sources

What's a good way to convert text documents into a knowledge graph? According to the current open source libraries for GraphRAG, a dominant notion is: "Just use an LLM to generate a graph automatically, which should be good enough to use." For those of us who work with graphs in regulated environments or mission-critical apps, obviously this isn't appropriate. In some contexts it may even represent unlawful practices, e.g., given US laws regarding data management in some federal agencies.

Let's step back to review the broader practices in knowledge graph construction. For downstream use cases, such as where KGs are grounding AI apps, there's larger question to ask. How can we build KGs from both structured and unstructured data sources, and keep human expert reviews in the loop, while taking advantage of LLMs and other deep learning models?

This talk provides a step-by-step guide to working with unstructured data sources for constructing and updating knowledge graphs. We'll assume you have some experience coding in Python and working with popular open source tools.

The general sketch is to parse the text (e.g., based on `spaCy` pipelines) then use _textgraph_ methods to build a _lexical graph_. We generate a _semantic layer_ atop this, making use of _named entity recognition_, _entity extraction_, and leveraging previous _entity resolution_ work with structured data sources to perform _entity linking_. These steps enrich the semantics for nodes in the graph. Then making use of _relation extraction_ to connect pairs of nodes, we enrich the semantics for edges in the graph. In each steps, we're using LLMs and other deep learning models to augment narrowly-defined tasks within the overall workflows. Using domain-specific resources such as a thesaurus, we'll show how to perform _semantic random walks_ to expand the graph. Finally, we'll show graph analytics to make use of the graph -- tying into what's needed for use cases such as GraphRAG.

Estimated at 45 minutes, not including time for Q&A.

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top