Computer Vision & Image Generation with Foundry

Exam objectives:

Interpret visual input in prompts using a deployed multimodal model
Create new visual outputs using generative models
Build a lightweight application that includes vision capabilities

Overview

AI-901 tests computer vision through the lens of multimodal models (models that accept images as input) and image generation models (DALL-E). You're not expected to use the classic Azure AI Vision SDK in isolation — the exam focuses on how these capabilities are accessed through Foundry using modern multimodal models.

Key Concepts

Multimodal Vision with GPT-4o

A multimodal model accepts both text and images as input. You can pass an image URL or base64-encoded image directly in the prompt.

from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import (
    UserMessage, SystemMessage,
    ImageContentItem, ImageUrl, TextContentItem
)
from azure.core.credentials import AzureKeyCredential

client = ChatCompletionsClient(
    endpoint="<your-endpoint>",
    credential=AzureKeyCredential("<your-key>")
)

response = client.complete(
    model="gpt-4o",
    messages=[
        SystemMessage("You are an image analysis assistant."),
        UserMessage(content=[
            TextContentItem(text="What objects are in this image?"),
            ImageContentItem(image_url=ImageUrl(url="https://example.com/image.jpg"))
        ])
    ]
)
print(response.choices[0].message.content)

Vision Capabilities via Multimodal Models

Capability	How to achieve it
Describe an image	Ask the model to describe the image in the prompt
Identify objects	Ask "what objects are in this image?"
Read text in an image (OCR)	Ask "what text appears in this image?"
Visual question answering	Ask a specific question about the image
Analyze a chart or diagram	Ask the model to interpret the visual data

Image Generation with DALL-E

from openai import AzureOpenAI

client = AzureOpenAI(
    api_version="2024-02-01",
    azure_endpoint="<your-endpoint>",
    api_key="<your-key>"
)

result = client.images.generate(
    model="dall-e-3",
    prompt="A futuristic city skyline at sunset with flying cars",
    n=1,
    size="1024x1024",
    quality="standard",
)

image_url = result.data[0].url
print(image_url)

DALL-E Configuration Parameters

Parameter	Options	Notes
`size`	`1024x1024`, `1792x1024`, `1024x1792`	DALL-E 3
`quality`	`standard`, `hd`	`hd` is more detailed but costs more
`style`	`vivid`, `natural`	DALL-E 3 only
`n`	`1`	DALL-E 3 only supports 1 image per request

Azure Services & Foundry Features

Service	Purpose
GPT-4o (multimodal)	Image understanding, visual Q&A
DALL-E 3	Image generation from text prompts
Azure AI Vision	Classic vision tasks (OCR, object detection) — accessible via Foundry Tools
Foundry playground	Test vision prompts interactively

Study Resources

📖 Use images in prompts with Azure OpenAI
📖 DALL-E image generation quickstart
📖 Azure AI Vision documentation
🧪 Vision Studio

Overview​

Key Concepts​

Multimodal Vision with GPT-4o​

Vision Capabilities via Multimodal Models​

Image Generation with DALL-E​

DALL-E Configuration Parameters​

Azure Services & Foundry Features​

Study Resources​