Skip to main content

Computer Vision Solutions

Exam weight: 10–15%

Overview

AI-103 tests computer vision through two lenses: generation (creating images and video from prompts) and multimodal understanding (using models to interpret visual content). The exam also covers responsible AI for visual content.

Key Concepts

Image & Video Generation

CapabilityModel / ServiceNotes
Text-to-imageDALL-E 3Generate images from text prompts
Image editingDALL-E 3 (inpainting)Edit specific regions of an image using a mask
Text-to-videoAzure AI video generation modelsGenerate short video clips from prompts
Video editingVideo generation modelsModify specific segments of generated video
from openai import AzureOpenAI

client = AzureOpenAI(
api_version="2024-02-01",
azure_endpoint="<endpoint>",
api_key="<key>"
)

# Generate an image
result = client.images.generate(
model="dall-e-3",
prompt="A photorealistic azure cloud floating above a city skyline",
size="1024x1024",
quality="hd",
style="natural",
n=1,
)
print(result.data[0].url)

Multimodal Understanding

CapabilityHow to implement
Visual Q&ASend image + question to GPT-4o
Image captioningAsk GPT-4o to describe the image
Accessibility alt-textPrompt for concise, accessibility-aligned descriptions
Object/component identificationAsk GPT-4o to identify specific objects or regions
Video analysisUse Azure Content Understanding for video segments
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import UserMessage, ImageContentItem, ImageUrl, TextContentItem
from azure.core.credentials import AzureKeyCredential

client = ChatCompletionsClient(endpoint="<endpoint>", credential=AzureKeyCredential("<key>"))

response = client.complete(
model="gpt-4o",
messages=[
UserMessage(content=[
TextContentItem(text="Describe this image and identify all visible objects."),
ImageContentItem(image_url=ImageUrl(url="https://example.com/photo.jpg"))
])
]
)
print(response.choices[0].message.content)

Responsible AI for Visual Content

ControlPurpose
Content filters (vision)Block generation of violent, sexual, or hateful images
Indirect prompt injection detectionDetect malicious instructions embedded in images (e.g., hidden text)
WatermarkingMark AI-generated images to indicate their origin
Brand / prohibited symbol detectionFlag images that violate usage policies

Azure Content Understanding — Visual Pipelines

For structured extraction from images and video (not just description):

Pipeline typeWhat it produces
Single-taskOne extraction task (e.g., extract all text from an image)
Pro-modeComplex multi-step extraction with custom field schemas

Azure Services & Foundry Features

ServiceAccess
DALL-E 3Azure OpenAI deployment in Foundry
GPT-4o (multimodal)Azure OpenAI deployment in Foundry
Azure AI VisionFoundry Tools → Vision
Azure Content UnderstandingFoundry Tools → Content Understanding
Azure Video IndexerStandalone Azure service

Study Resources