5 min read

Extracting information from unstructured text with source grounding using LangExtract

With most modern LLMs now supporting structured outputs , extracting information from unstructured text or documents has become straightforward. However, a critical question remains: does the extracted information actually exist in the source text, or are we dealing with LLM hallucinations that produce plausible but incorrect answers?

Fortunately, Google has recently open-sourced LangExtract , a Python library designed to tackle exactly this problem. Beyond simply extracting information from unstructured text or documents, LangExtract provides precise source grounding that shows exactly where in the text the LLM found each piece of information, so we can easily trace back to verify each extraction.

In this article, I’ll demonstrate how to use LangExtract to extract information from unstructured text. The text below serves as our unstructured text example:

To keep things simple, I’ll focus on extracting the names of people mentioned and their opinions in the example above.

Setting up the API key

Even though the README mentions that LangExtract supports various LLMs, including local open-source models via Ollama, I encountered some errors when trying to set it up locally. So for now, I’ll use the Gemini model to perform the extraction.

Just get the API key from Google AI Studio and add it to the .env file:

LANGEXTRACT_API_KEY=your-gemini-api-key-here

Define the example data

Example data is crucial for LangExtract to produce reliable structured outputs . While we can provide multiple examples, I’ll use just one example for this demonstration:

import langextract as lx

example_text = """
John Smith went to the store with his sister Amy Wilson.
She thinks the new shopping app is very convenient,
but he believes online shopping lacks personal touch.
"""

example_extractions = [
    lx.data.Extraction(
        extraction_class="person",
        extraction_text="John Smith",
        attributes={
            "name": "John Smith",
            "first_name": "John",
            "last_name": "Smith",
        },
    ),
    lx.data.Extraction(
        extraction_class="person",
        extraction_text="Amy Wilson",
        attributes={
            "name": "Amy Wilson",
            "first_name": "Amy",
            "last_name": "Wilson",
        },
    ),
    lx.data.Extraction(
        extraction_class="opinion",
        extraction_text="online shopping lacks personal touch",
        attributes={"person": "John Smith", "sentiment": "negative"},
    ),
    lx.data.Extraction(
        extraction_class="opinion",
        extraction_text="the new shopping app is very convenient",
        attributes={"person": "Amy Wilson", "sentiment": "positive"},
    ),
]

example = lx.data.ExampleData(
    text=example_text,
    extractions=example_extractions,
)

The example data takes two key parameters: text (the sample text) and extractions (a collection of what should be extracted). In the code above, I defined two extraction classes: “person” and “opinion,” each with their sample text extractions and required attributes.

Create the prompt instruction

We also need to provide a prompt instruction that tells the model what to extract. Here, I’ll instruct the model to extract people’s names and their opinions:

prompt = "Extract people's names and their opinions or views expressed in the text."

Clear and specific instructions help the model understand the task better and produce more accurate results.

Set up the extraction

Now let’s set up the LangExtract extraction:

result = lx.extract(
    examples=[example],
    prompt_description=prompt,
    text_or_documents=unstructured_text,  # From the earlier example
    model_id="gemini-2.5-flash-lite",
)

for extraction in result.extractions:
    print("Class:", extraction.extraction_class)
    print("Text:", extraction.extraction_text)
    print("Attributes:", getattr(extraction, "attributes", {}))

    start_pos = extraction.char_interval.start_pos
    end_pos = extraction.char_interval.end_pos

    print(f"Character Interval: start_pos={start_pos}, end_pos={end_pos}")
    print("---")

After running the code, the extraction will produce output similar to this:

LangExtract: model=gemini-2.5-flash-lite, current=816 chars, processed=816 chars:  [00:02]
✓ Extraction processing complete
✓ Extracted 9 entities (2 unique types)
  • Time: 2.80s
  • Speed: 292 chars/sec
  • Chunks: 1

Class: person
Text: Emma Johnson
Attributes: {'name': 'Emma Johnson', 'first_name': 'Emma', 'last_name': 'Johnson'}
Character Interval: start_pos=1, end_pos=13
---
Class: person
Text: Tom Johnson
Attributes: {'name': 'Tom Johnson', 'first_name': 'Tom', 'last_name': 'Johnson'}
Character Interval: start_pos=117, end_pos=128
---
...
---
Class: opinion
Text: This is amazing!
Attributes: {'person': 'Emma Johnson', 'sentiment': 'positive'}
Character Interval: start_pos=209, end_pos=224
---
Class: opinion
Text: I don't trust these machines
Attributes: {'person': 'David Johnson', 'sentiment': 'negative'}
Character Interval: start_pos=423, end_pos=451
---

As we can see, LangExtract successfully extracts structured information from the unstructured text, following our defined example. It also provides character interval information that shows exactly where each piece of text was found, which we can use for verification.

This precise source grounding capability sets LangExtract apart from traditional extraction methods. Additionally, LangExtract offers interactive visualization features that help review and validate extracted data more easily.

AILLMLangExtract