Information Extraction Solutions
Exam weight: 10โ15%
Overviewโ
This domain covers building retrieval and grounding pipelines (the backbone of RAG) and extracting structured content from documents, images, audio, and video using Azure Content Understanding. It connects tightly with Domain 2's RAG work.
Key Conceptsโ
Retrieval & Grounding Pipelines (RAG Ingestion)โ
The ingestion side of a RAG pipeline:
Source content (PDF, Word, images, audio, video)
โ
Ingest โ chunk into segments
โ
Enrich with skills (OCR, entity extraction, custom skills)
โ
Embed chunks โ vector representations
โ
Index in Azure AI Search (vector + keyword fields)
โ
Available for hybrid / vector / semantic search at query time
| Concept | Description |
|---|---|
| Chunking | Splitting documents into segments that fit in a model's context window |
| Embedding | Converting text to a vector using an embedding model (e.g., text-embedding-ada-002) |
| Vector search | Finding similar documents by comparing embedding vectors |
| Hybrid search | Combining keyword (BM25) + vector search for better recall |
| Semantic ranking | Re-ranking results using a language model for relevance |
| Skillset | A pipeline of enrichment steps (OCR, entity extraction, custom functions) applied during indexing |
Azure AI Search โ Core Componentsโ
| Component | Purpose |
|---|---|
| Index | The searchable data store โ fields, types, searchability |
| Indexer | Pulls data from a source and runs the skillset |
| Skillset | Enrichment pipeline โ OCR, entity extraction, embedding, custom skills |
| Data source | Connection to Blob Storage, Cosmos DB, SQL, etc. |
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.core.credentials import AzureKeyCredential
search_client = SearchClient(
endpoint="https://<your-search>.search.windows.net",
index_name="my-index",
credential=AzureKeyCredential("<key>")
)
# Hybrid search: keyword + vector
query_vector = VectorizedQuery(vector=get_embedding("What is Azure AI Foundry?"), k_nearest_neighbors=3, fields="contentVector")
results = search_client.search(
search_text="Azure AI Foundry",
vector_queries=[query_vector],
select=["title", "content", "url"],
query_type="semantic",
semantic_configuration_name="my-semantic-config",
top=5,
)
for r in results:
print(r["title"], r["@search.score"])
Azure Content Understandingโ
Extracts structured data from multi-modal content:
| Content type | Analyzer type | Example output |
|---|---|---|
| Documents / forms | prebuilt-invoice, prebuilt-receipt, custom | Structured fields (InvoiceTotal, VendorName) |
| Images | Custom or prebuilt-read | OCR text, layout, detected objects |
| Audio | Custom audio analyzer | Transcript, speaker turns, topics |
| Video | Custom video analyzer | Transcript, scene descriptions, on-screen text |
import requests
endpoint = "https://<your-resource>.cognitiveservices.azure.com/"
key = "<your-key>"
# Submit a document for extraction
response = requests.post(
f"{endpoint}contentunderstanding/analyzers/prebuilt-invoice:analyze?api-version=2024-12-01-preview",
headers={"Ocp-Apim-Subscription-Key": key, "Content-Type": "application/json"},
json={"url": "https://example.com/invoice.pdf"}
)
operation_id = response.headers.get("Operation-Id")
# Poll GET .../operations/{operation_id} until status == "succeeded"
# Then read result["analyzeResult"]["fields"]
Producing Grounded Representations for RAGโ
Content Understanding can output Markdown โ ideal for chunking and feeding into a RAG pipeline:
Document PDF โ Content Understanding (layout + OCR)
โ
Markdown output (headings, tables, text preserved)
โ
Chunk + embed + index in Azure AI Search
โ
Query โ retrieve โ LLM generates grounded response
Azure Services & Foundry Featuresโ
| Service | Purpose |
|---|---|
| Azure AI Search | Indexing, vector search, hybrid search, RAG |
| Azure Content Understanding | Multi-modal extraction (Foundry Tools) |
| Azure Blob Storage | Source for document ingestion |
| Azure OpenAI embeddings | Convert text to vectors |