Computer Vision & Image Generation with Foundry
Exam objectives:
- Interpret visual input in prompts using a deployed multimodal model
- Create new visual outputs using generative models
- Build a lightweight application that includes vision capabilities
Overviewโ
AI-901 tests computer vision through the lens of multimodal models (models that accept images as input) and image generation models (DALL-E). You're not expected to use the classic Azure AI Vision SDK in isolation โ the exam focuses on how these capabilities are accessed through Foundry using modern multimodal models.
Key Conceptsโ
Multimodal Vision with GPT-4oโ
A multimodal model accepts both text and images as input. You can pass an image URL or base64-encoded image directly in the prompt.
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import (
UserMessage, SystemMessage,
ImageContentItem, ImageUrl, TextContentItem
)
from azure.core.credentials import AzureKeyCredential
client = ChatCompletionsClient(
endpoint="<your-endpoint>",
credential=AzureKeyCredential("<your-key>")
)
response = client.complete(
model="gpt-4o",
messages=[
SystemMessage("You are an image analysis assistant."),
UserMessage(content=[
TextContentItem(text="What objects are in this image?"),
ImageContentItem(image_url=ImageUrl(url="https://example.com/image.jpg"))
])
]
)
print(response.choices[0].message.content)
Vision Capabilities via Multimodal Modelsโ
| Capability | How to achieve it |
|---|---|
| Describe an image | Ask the model to describe the image in the prompt |
| Identify objects | Ask "what objects are in this image?" |
| Read text in an image (OCR) | Ask "what text appears in this image?" |
| Visual question answering | Ask a specific question about the image |
| Analyze a chart or diagram | Ask the model to interpret the visual data |
Image Generation with DALL-Eโ
from openai import AzureOpenAI
client = AzureOpenAI(
api_version="2024-02-01",
azure_endpoint="<your-endpoint>",
api_key="<your-key>"
)
result = client.images.generate(
model="dall-e-3",
prompt="A futuristic city skyline at sunset with flying cars",
n=1,
size="1024x1024",
quality="standard",
)
image_url = result.data[0].url
print(image_url)
DALL-E Configuration Parametersโ
| Parameter | Options | Notes |
|---|---|---|
size | 1024x1024, 1792x1024, 1024x1792 | DALL-E 3 |
quality | standard, hd | hd is more detailed but costs more |
style | vivid, natural | DALL-E 3 only |
n | 1 | DALL-E 3 only supports 1 image per request |
Azure Services & Foundry Featuresโ
| Service | Purpose |
|---|---|
| GPT-4o (multimodal) | Image understanding, visual Q&A |
| DALL-E 3 | Image generation from text prompts |
| Azure AI Vision | Classic vision tasks (OCR, object detection) โ accessible via Foundry Tools |
| Foundry playground | Test vision prompts interactively |