Computer Vision Solutions
Exam weight: 10–15%
Overview
AI-103 tests computer vision through two lenses: generation (creating images and video from prompts) and multimodal understanding (using models to interpret visual content). The exam also covers responsible AI for visual content.
Key Concepts
Image & Video Generation
| Capability | Model / Service | Notes |
|---|---|---|
| Text-to-image | DALL-E 3 | Generate images from text prompts |
| Image editing | DALL-E 3 (inpainting) | Edit specific regions of an image using a mask |
| Text-to-video | Azure AI video generation models | Generate short video clips from prompts |
| Video editing | Video generation models | Modify specific segments of generated video |
from openai import AzureOpenAI
client = AzureOpenAI(
api_version="2024-02-01",
azure_endpoint="<endpoint>",
api_key="<key>"
)
# Generate an image
result = client.images.generate(
model="dall-e-3",
prompt="A photorealistic azure cloud floating above a city skyline",
size="1024x1024",
quality="hd",
style="natural",
n=1,
)
print(result.data[0].url)
Multimodal Understanding
| Capability | How to implement |
|---|---|
| Visual Q&A | Send image + question to GPT-4o |
| Image captioning | Ask GPT-4o to describe the image |
| Accessibility alt-text | Prompt for concise, accessibility-aligned descriptions |
| Object/component identification | Ask GPT-4o to identify specific objects or regions |
| Video analysis | Use Azure Content Understanding for video segments |
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import UserMessage, ImageContentItem, ImageUrl, TextContentItem
from azure.core.credentials import AzureKeyCredential
client = ChatCompletionsClient(endpoint="<endpoint>", credential=AzureKeyCredential("<key>"))
response = client.complete(
model="gpt-4o",
messages=[
UserMessage(content=[
TextContentItem(text="Describe this image and identify all visible objects."),
ImageContentItem(image_url=ImageUrl(url="https://example.com/photo.jpg"))
])
]
)
print(response.choices[0].message.content)
Responsible AI for Visual Content
| Control | Purpose |
|---|---|
| Content filters (vision) | Block generation of violent, sexual, or hateful images |
| Indirect prompt injection detection | Detect malicious instructions embedded in images (e.g., hidden text) |
| Watermarking | Mark AI-generated images to indicate their origin |
| Brand / prohibited symbol detection | Flag images that violate usage policies |
Azure Content Understanding — Visual Pipelines
For structured extraction from images and video (not just description):
| Pipeline type | What it produces |
|---|---|
| Single-task | One extraction task (e.g., extract all text from an image) |
| Pro-mode | Complex multi-step extraction with custom field schemas |
Azure Services & Foundry Features
| Service | Access |
|---|---|
| DALL-E 3 | Azure OpenAI deployment in Foundry |
| GPT-4o (multimodal) | Azure OpenAI deployment in Foundry |
| Azure AI Vision | Foundry Tools → Vision |
| Azure Content Understanding | Foundry Tools → Content Understanding |
| Azure Video Indexer | Standalone Azure service |