Computer Vision Solutions

Exam weight: 10–15%

Overview

AI-103 tests computer vision through two lenses: generation (creating images and video from prompts) and multimodal understanding (using models to interpret visual content). The exam also covers responsible AI for visual content.

Key Concepts

Image & Video Generation

Capability	Model / Service	Notes
Text-to-image	DALL-E 3	Generate images from text prompts
Image editing	DALL-E 3 (inpainting)	Edit specific regions of an image using a mask
Text-to-video	Azure AI video generation models	Generate short video clips from prompts
Video editing	Video generation models	Modify specific segments of generated video

from openai import AzureOpenAI

client = AzureOpenAI(
    api_version="2024-02-01",
    azure_endpoint="<endpoint>",
    api_key="<key>"
)

# Generate an image
result = client.images.generate(
    model="dall-e-3",
    prompt="A photorealistic azure cloud floating above a city skyline",
    size="1024x1024",
    quality="hd",
    style="natural",
    n=1,
)
print(result.data[0].url)

Multimodal Understanding

Capability	How to implement
Visual Q&A	Send image + question to GPT-4o
Image captioning	Ask GPT-4o to describe the image
Accessibility alt-text	Prompt for concise, accessibility-aligned descriptions
Object/component identification	Ask GPT-4o to identify specific objects or regions
Video analysis	Use Azure Content Understanding for video segments

from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import UserMessage, ImageContentItem, ImageUrl, TextContentItem
from azure.core.credentials import AzureKeyCredential

client = ChatCompletionsClient(endpoint="<endpoint>", credential=AzureKeyCredential("<key>"))

response = client.complete(
    model="gpt-4o",
    messages=[
        UserMessage(content=[
            TextContentItem(text="Describe this image and identify all visible objects."),
            ImageContentItem(image_url=ImageUrl(url="https://example.com/photo.jpg"))
        ])
    ]
)
print(response.choices[0].message.content)

Responsible AI for Visual Content

Control	Purpose
Content filters (vision)	Block generation of violent, sexual, or hateful images
Indirect prompt injection detection	Detect malicious instructions embedded in images (e.g., hidden text)
Watermarking	Mark AI-generated images to indicate their origin
Brand / prohibited symbol detection	Flag images that violate usage policies

Azure Content Understanding — Visual Pipelines

For structured extraction from images and video (not just description):

Pipeline type	What it produces
Single-task	One extraction task (e.g., extract all text from an image)
Pro-mode	Complex multi-step extraction with custom field schemas

Azure Services & Foundry Features

Service	Access
DALL-E 3	Azure OpenAI deployment in Foundry
GPT-4o (multimodal)	Azure OpenAI deployment in Foundry
Azure AI Vision	Foundry Tools → Vision
Azure Content Understanding	Foundry Tools → Content Understanding
Azure Video Indexer	Standalone Azure service

Overview​

Key Concepts​

Image & Video Generation​

Multimodal Understanding​

Responsible AI for Visual Content​

Azure Content Understanding — Visual Pipelines​

Azure Services & Foundry Features​

Study Resources​