Skip to main content

Information Extraction Solutions

Exam weight: 10โ€“15%

Overviewโ€‹

This domain covers building retrieval and grounding pipelines (the backbone of RAG) and extracting structured content from documents, images, audio, and video using Azure Content Understanding. It connects tightly with Domain 2's RAG work.

Key Conceptsโ€‹

Retrieval & Grounding Pipelines (RAG Ingestion)โ€‹

The ingestion side of a RAG pipeline:

Source content (PDF, Word, images, audio, video)
โ†“
Ingest โ†’ chunk into segments
โ†“
Enrich with skills (OCR, entity extraction, custom skills)
โ†“
Embed chunks โ†’ vector representations
โ†“
Index in Azure AI Search (vector + keyword fields)
โ†“
Available for hybrid / vector / semantic search at query time
ConceptDescription
ChunkingSplitting documents into segments that fit in a model's context window
EmbeddingConverting text to a vector using an embedding model (e.g., text-embedding-ada-002)
Vector searchFinding similar documents by comparing embedding vectors
Hybrid searchCombining keyword (BM25) + vector search for better recall
Semantic rankingRe-ranking results using a language model for relevance
SkillsetA pipeline of enrichment steps (OCR, entity extraction, custom functions) applied during indexing

Azure AI Search โ€” Core Componentsโ€‹

ComponentPurpose
IndexThe searchable data store โ€” fields, types, searchability
IndexerPulls data from a source and runs the skillset
SkillsetEnrichment pipeline โ€” OCR, entity extraction, embedding, custom skills
Data sourceConnection to Blob Storage, Cosmos DB, SQL, etc.
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.core.credentials import AzureKeyCredential

search_client = SearchClient(
endpoint="https://<your-search>.search.windows.net",
index_name="my-index",
credential=AzureKeyCredential("<key>")
)

# Hybrid search: keyword + vector
query_vector = VectorizedQuery(vector=get_embedding("What is Azure AI Foundry?"), k_nearest_neighbors=3, fields="contentVector")

results = search_client.search(
search_text="Azure AI Foundry",
vector_queries=[query_vector],
select=["title", "content", "url"],
query_type="semantic",
semantic_configuration_name="my-semantic-config",
top=5,
)

for r in results:
print(r["title"], r["@search.score"])

Azure Content Understandingโ€‹

Extracts structured data from multi-modal content:

Content typeAnalyzer typeExample output
Documents / formsprebuilt-invoice, prebuilt-receipt, customStructured fields (InvoiceTotal, VendorName)
ImagesCustom or prebuilt-readOCR text, layout, detected objects
AudioCustom audio analyzerTranscript, speaker turns, topics
VideoCustom video analyzerTranscript, scene descriptions, on-screen text
import requests

endpoint = "https://<your-resource>.cognitiveservices.azure.com/"
key = "<your-key>"

# Submit a document for extraction
response = requests.post(
f"{endpoint}contentunderstanding/analyzers/prebuilt-invoice:analyze?api-version=2024-12-01-preview",
headers={"Ocp-Apim-Subscription-Key": key, "Content-Type": "application/json"},
json={"url": "https://example.com/invoice.pdf"}
)

operation_id = response.headers.get("Operation-Id")
# Poll GET .../operations/{operation_id} until status == "succeeded"
# Then read result["analyzeResult"]["fields"]

Producing Grounded Representations for RAGโ€‹

Content Understanding can output Markdown โ€” ideal for chunking and feeding into a RAG pipeline:

Document PDF โ†’ Content Understanding (layout + OCR)
โ†“
Markdown output (headings, tables, text preserved)
โ†“
Chunk + embed + index in Azure AI Search
โ†“
Query โ†’ retrieve โ†’ LLM generates grounded response

Azure Services & Foundry Featuresโ€‹

ServicePurpose
Azure AI SearchIndexing, vector search, hybrid search, RAG
Azure Content UnderstandingMulti-modal extraction (Foundry Tools)
Azure Blob StorageSource for document ingestion
Azure OpenAI embeddingsConvert text to vectors

Study Resourcesโ€‹