Extracting information from unstructured text with source grounding using LangExtract
With most modern LLMs now supporting structured outputs , extracting information from unstructured text or documents has become straightforward. However, a critical question remains: does the extracted information actually exist in the source text, or are we dealing with LLM hallucinations that produce plausible but incorrect answers?
Fortunately, Google has recently open-sourced LangExtract , a Python library designed to tackle exactly this problem. Beyond simply extracting information from unstructured text or documents, LangExtract provides precise source grounding that shows exactly where in the text the LLM found each piece of information, so we can easily trace back to verify each extraction.
In this article, I’ll demonstrate how to use LangExtract to extract information from unstructured text. The text below serves as our unstructured text example:
Emma Johnson was excited about her new AI assistant. She asked it to help her plan a birthday party for her brother Tom Johnson. The AI suggested decorations, food, and games that would be perfect for him.
“This is amazing!” Emma said to her friend Sarah Williams. “The AI even remembered that Tom loves soccer, so it recommended a soccer-themed cake.”
Meanwhile, her father David Johnson was skeptical about using AI. “I don’t trust these machines,” he told his wife Maria Johnson. But when Maria showed him how the AI helped her organize her recipe collection, David started to see its potential.
Later that evening, Emma’s colleague Jake Martinez called to ask about the AI assistant. “My sister Lisa has been looking for something like this,” he said. Emma promised to share the information with him.
To keep things simple, I’ll focus on extracting the names of people mentioned and their opinions in the example above.
Setting up the API key
Even though the README mentions that LangExtract supports various LLMs, including local open-source models via Ollama, I encountered some errors when trying to set it up locally. So for now, I’ll use the Gemini model to perform the extraction.
Just get the API key from Google AI Studio and add it to the .env
file:
LANGEXTRACT_API_KEY=your-gemini-api-key-here
Define the example data
Example data is crucial for LangExtract to produce reliable structured outputs . While we can provide multiple examples, I’ll use just one example for this demonstration:
import langextract as lx
example_text = """
John Smith went to the store with his sister Amy Wilson.
She thinks the new shopping app is very convenient,
but he believes online shopping lacks personal touch.
"""
example_extractions = [
lx.data.Extraction(
extraction_class="person",
extraction_text="John Smith",
attributes={
"name": "John Smith",
"first_name": "John",
"last_name": "Smith",
},
),
lx.data.Extraction(
extraction_class="person",
extraction_text="Amy Wilson",
attributes={
"name": "Amy Wilson",
"first_name": "Amy",
"last_name": "Wilson",
},
),
lx.data.Extraction(
extraction_class="opinion",
extraction_text="online shopping lacks personal touch",
attributes={"person": "John Smith", "sentiment": "negative"},
),
lx.data.Extraction(
extraction_class="opinion",
extraction_text="the new shopping app is very convenient",
attributes={"person": "Amy Wilson", "sentiment": "positive"},
),
]
example = lx.data.ExampleData(
text=example_text,
extractions=example_extractions,
)
The example data takes two key parameters: text
(the sample text) and extractions
(a collection of what should be extracted). In the code above, I defined two extraction classes: “person” and “opinion,” each with their sample text extractions and required attributes.
Create the prompt instruction
We also need to provide a prompt instruction that tells the model what to extract. Here, I’ll instruct the model to extract people’s names and their opinions:
prompt = "Extract people's names and their opinions or views expressed in the text."
Clear and specific instructions help the model understand the task better and produce more accurate results.
Set up the extraction
Now let’s set up the LangExtract extraction:
result = lx.extract(
examples=[example],
prompt_description=prompt,
text_or_documents=unstructured_text, # From the earlier example
model_id="gemini-2.5-flash-lite",
)
for extraction in result.extractions:
print("Class:", extraction.extraction_class)
print("Text:", extraction.extraction_text)
print("Attributes:", getattr(extraction, "attributes", {}))
start_pos = extraction.char_interval.start_pos
end_pos = extraction.char_interval.end_pos
print(f"Character Interval: start_pos={start_pos}, end_pos={end_pos}")
print("---")
After running the code, the extraction will produce output similar to this:
LangExtract: model=gemini-2.5-flash-lite, current=816 chars, processed=816 chars: [00:02]
✓ Extraction processing complete
✓ Extracted 9 entities (2 unique types)
• Time: 2.80s
• Speed: 292 chars/sec
• Chunks: 1
Class: person
Text: Emma Johnson
Attributes: {'name': 'Emma Johnson', 'first_name': 'Emma', 'last_name': 'Johnson'}
Character Interval: start_pos=1, end_pos=13
---
Class: person
Text: Tom Johnson
Attributes: {'name': 'Tom Johnson', 'first_name': 'Tom', 'last_name': 'Johnson'}
Character Interval: start_pos=117, end_pos=128
---
...
---
Class: opinion
Text: This is amazing!
Attributes: {'person': 'Emma Johnson', 'sentiment': 'positive'}
Character Interval: start_pos=209, end_pos=224
---
Class: opinion
Text: I don't trust these machines
Attributes: {'person': 'David Johnson', 'sentiment': 'negative'}
Character Interval: start_pos=423, end_pos=451
---
As we can see, LangExtract successfully extracts structured information from the unstructured text, following our defined example. It also provides character interval information that shows exactly where each piece of text was found, which we can use for verification.
This precise source grounding capability sets LangExtract apart from traditional extraction methods. Additionally, LangExtract offers interactive visualization features that help review and validate extracted data more easily.