Skip to main content

Computer Vision & Image Generation with Foundry

Exam objectives:

  • Interpret visual input in prompts using a deployed multimodal model
  • Create new visual outputs using generative models
  • Build a lightweight application that includes vision capabilities

Overviewโ€‹

AI-901 tests computer vision through the lens of multimodal models (models that accept images as input) and image generation models (DALL-E). You're not expected to use the classic Azure AI Vision SDK in isolation โ€” the exam focuses on how these capabilities are accessed through Foundry using modern multimodal models.

Key Conceptsโ€‹

Multimodal Vision with GPT-4oโ€‹

A multimodal model accepts both text and images as input. You can pass an image URL or base64-encoded image directly in the prompt.

from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import (
UserMessage, SystemMessage,
ImageContentItem, ImageUrl, TextContentItem
)
from azure.core.credentials import AzureKeyCredential

client = ChatCompletionsClient(
endpoint="<your-endpoint>",
credential=AzureKeyCredential("<your-key>")
)

response = client.complete(
model="gpt-4o",
messages=[
SystemMessage("You are an image analysis assistant."),
UserMessage(content=[
TextContentItem(text="What objects are in this image?"),
ImageContentItem(image_url=ImageUrl(url="https://example.com/image.jpg"))
])
]
)
print(response.choices[0].message.content)

Vision Capabilities via Multimodal Modelsโ€‹

CapabilityHow to achieve it
Describe an imageAsk the model to describe the image in the prompt
Identify objectsAsk "what objects are in this image?"
Read text in an image (OCR)Ask "what text appears in this image?"
Visual question answeringAsk a specific question about the image
Analyze a chart or diagramAsk the model to interpret the visual data

Image Generation with DALL-Eโ€‹

from openai import AzureOpenAI

client = AzureOpenAI(
api_version="2024-02-01",
azure_endpoint="<your-endpoint>",
api_key="<your-key>"
)

result = client.images.generate(
model="dall-e-3",
prompt="A futuristic city skyline at sunset with flying cars",
n=1,
size="1024x1024",
quality="standard",
)

image_url = result.data[0].url
print(image_url)

DALL-E Configuration Parametersโ€‹

ParameterOptionsNotes
size1024x1024, 1792x1024, 1024x1792DALL-E 3
qualitystandard, hdhd is more detailed but costs more
stylevivid, naturalDALL-E 3 only
n1DALL-E 3 only supports 1 image per request

Azure Services & Foundry Featuresโ€‹

ServicePurpose
GPT-4o (multimodal)Image understanding, visual Q&A
DALL-E 3Image generation from text prompts
Azure AI VisionClassic vision tasks (OCR, object detection) โ€” accessible via Foundry Tools
Foundry playgroundTest vision prompts interactively

Study Resourcesโ€‹