Information Extraction Solutions

Exam weight: 10–15%

Overview

This domain covers building retrieval and grounding pipelines (the backbone of RAG) and extracting structured content from documents, images, audio, and video using Azure Content Understanding. It connects tightly with Domain 2's RAG work.

Key Concepts

Retrieval & Grounding Pipelines (RAG Ingestion)

The ingestion side of a RAG pipeline:

Source content (PDF, Word, images, audio, video)
       ↓
Ingest → chunk into segments
       ↓
Enrich with skills (OCR, entity extraction, custom skills)
       ↓
Embed chunks → vector representations
       ↓
Index in Azure AI Search (vector + keyword fields)
       ↓
Available for hybrid / vector / semantic search at query time

Concept	Description
Chunking	Splitting documents into segments that fit in a model's context window
Embedding	Converting text to a vector using an embedding model (e.g., `text-embedding-ada-002`)
Vector search	Finding similar documents by comparing embedding vectors
Hybrid search	Combining keyword (BM25) + vector search for better recall
Semantic ranking	Re-ranking results using a language model for relevance
Skillset	A pipeline of enrichment steps (OCR, entity extraction, custom functions) applied during indexing

Azure AI Search — Core Components

Component	Purpose
Index	The searchable data store — fields, types, searchability
Indexer	Pulls data from a source and runs the skillset
Skillset	Enrichment pipeline — OCR, entity extraction, embedding, custom skills
Data source	Connection to Blob Storage, Cosmos DB, SQL, etc.

from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.core.credentials import AzureKeyCredential

search_client = SearchClient(
    endpoint="https://<your-search>.search.windows.net",
    index_name="my-index",
    credential=AzureKeyCredential("<key>")
)

# Hybrid search: keyword + vector
query_vector = VectorizedQuery(vector=get_embedding("What is Azure AI Foundry?"), k_nearest_neighbors=3, fields="contentVector")

results = search_client.search(
    search_text="Azure AI Foundry",
    vector_queries=[query_vector],
    select=["title", "content", "url"],
    query_type="semantic",
    semantic_configuration_name="my-semantic-config",
    top=5,
)

for r in results:
    print(r["title"], r["@search.score"])

Azure Content Understanding

Extracts structured data from multi-modal content:

Content type	Analyzer type	Example output
Documents / forms	`prebuilt-invoice`, `prebuilt-receipt`, custom	Structured fields (InvoiceTotal, VendorName)
Images	Custom or `prebuilt-read`	OCR text, layout, detected objects
Audio	Custom audio analyzer	Transcript, speaker turns, topics
Video	Custom video analyzer	Transcript, scene descriptions, on-screen text

import requests

endpoint = "https://<your-resource>.cognitiveservices.azure.com/"
key = "<your-key>"

# Submit a document for extraction
response = requests.post(
    f"{endpoint}contentunderstanding/analyzers/prebuilt-invoice:analyze?api-version=2024-12-01-preview",
    headers={"Ocp-Apim-Subscription-Key": key, "Content-Type": "application/json"},
    json={"url": "https://example.com/invoice.pdf"}
)

operation_id = response.headers.get("Operation-Id")
# Poll GET .../operations/{operation_id} until status == "succeeded"
# Then read result["analyzeResult"]["fields"]

Producing Grounded Representations for RAG

Content Understanding can output Markdown — ideal for chunking and feeding into a RAG pipeline:

Document PDF → Content Understanding (layout + OCR)
                    ↓
             Markdown output (headings, tables, text preserved)
                    ↓
             Chunk + embed + index in Azure AI Search
                    ↓
             Query → retrieve → LLM generates grounded response

Azure Services & Foundry Features

Service	Purpose
Azure AI Search	Indexing, vector search, hybrid search, RAG
Azure Content Understanding	Multi-modal extraction (Foundry Tools)
Azure Blob Storage	Source for document ingestion
Azure OpenAI embeddings	Convert text to vectors

Overview​

Key Concepts​

Retrieval & Grounding Pipelines (RAG Ingestion)​

Azure AI Search — Core Components​

Azure Content Understanding​

Producing Grounded Representations for RAG​

Azure Services & Foundry Features​

Study Resources​