Anthropic Claude 3 Sonnet: vision capabilities

10 minute read
Content level: Advanced
0

overview of Claude Sonnet vision capabilities through a set of 6 examples going in different directions

Enter image description here

1. Introduction

The Anthropic Claude 3 models Haiku, Sonnet, Opus, very recently announced, are currently the only ones on the Amazon Bedrock list of models to have the modality “Text & Vision”. To apply Generative AI to the acceleration of Mainframe Modernization projects, we currently explore the capabilities of such LLMs to automatically generate application documentation and functional test code in order to accelerate migration projects.

We didn’t see much articles detailing those vision features. So, I explored those vision features to discover more precisely encompass. In the frenzy of discover, I went beyond features needed for code documentation and code generation. 6 different use cases are presented below.

I used Sonnet as the model in the middle range of power for Claude v3. In the following sections, different kinds of images are submitted to Sonnet on Amazon Bedrock to discover how the LLM “views” them and which kind of information this model can derive from the picture or photograph. To avoid giving any hints about their content, unrelated names were used for the names of the image files. Also, no additional metadata about those pictures were given to the LLM: just the raw image files.

2. Summary of discovered Vision Capabilities

The various promptings for image understanding below demonstrate the following capabilities:

  • recognize and precisely describe objects (buildings and monuments in Paris), animals (eagle) and concepts (drawing of chess board, fields on screen, components of technical schema)
  • reason in order to inventory multiple similar objects (data fields) in one image, recognize features (colors, positions, etc.) of those objects
  • reason and do mathematical computations to create coordinates (x, y) for fields on a screen and buildings on a map. Though, the computations are still partially imprecise or incorrect.
  • scrutinize image, forefront and background, to surface second-order details: leaf debris in the image (eagle), monospace font (3270 map)
  • read text present on a map / image (London map) via OCR-like features
  • link objects of the picture with related textual content in the knowledge base of the LLM (Paris and eagle picture)
  • self-consciousness about potential issues: mentions of potentially imprecise calculations

The above list is very probably very incomplete but that’s what came to my mind in the analysis of the responses.

It is impressive - for me, at least !! - to see such a blend of multiple advanced and diverse capabilities in a single model. Such level of sophistication would have demanded a thorough integration of multiple complex tools and services some years ago in the pre-GenAI era. With Anthropic Sonnet, it all comes together with simple prompts after a few seconds of inference. Generative AI is clearly a game changer: the way services and applications will be developed in the (near) future and the added value that they will deliver can be envisioned from totally different perpsectives. In addition, just imagine what those LLMs will deliver in a few releases from now with much higher precision and richer content. Indeed, it is hard - for me - to exactly define where we will exactly be in 12 months: human mind doesn’t so easily predictively deal with exponential leaps in capabilities.

A total new range of applications will be enabled and existing ones will become much more sophisticated !

Now, back to my drawing board for acceleration of migration projects with those new present and future capabilities in mind.

3. 3270 Mainframe Map

If you are familiar with mainframe applications, you know “green screens” like the one below: it is a 3270 screen (aka BMS map) displayed by a CICS transaction running on a mainframe. So, I started by asking Sonnet to describe this map with the fields and their location.

Enter image description here

Prompt: Can you describe this mainframe map with its fields, their attributes (color, size, etc.) ? Please, also describe location (x,y) of each field that you find.

Sure, here's a detailed description of the mainframe screen with the fields, their attributes, and their approximate (x, y) 
coordinates:

1. Top Bar (Green text, small font size): "DEMOCMP Account Inquiry"
   Location: (x=1, y=1)
2. Instruction (White text, small font size): "Enter Account Number and press ENTER"
   Location: (x=1, y=3)
3. Account Number (Green text, small font size): "00000001"
   Location: (x=2, y=5)
4. First Name (Green text, small font size): "C"
   Location: (x=2, y=6)
5. Last Name (Green text, small font size): "Brown"
   Location: (x=2, y=7)
6. Address 1 (Green text, small font size): "1604 Cross Bend Rd"
   Location: (x=2, y=8)
1. Address 2 (Green text, small font size): "Plano"
   Location: (x=2, y=9)
1. State (Green text, small font size): "TX"
   Location: (x=10, y=9)
1. Zip Code (Green text, small font size): "75023"
   Location: (x=2, y=10)
1. Phone Number (Green text, small font size): "972.555.0001“
    Location: (x=2, y=11)
1. Account Balance (Green text, small font size): "$1,000.00"
    Location: (x=2, y=12)
1. Avg Daily Balance (Green text, small font size): "$962.16"
    Location: (x=2, y=13)
1. Status Messages (White text, small font size):
    * "Name and Address record read"
      Location: (x=1, y=15)
    * "Account Balance record read"
      Location: (x=1, y=16)
    * "Phone Number record read"
      Location: (x=1, y=17)

1. Timestamp (White text, small font size): "Inquiry Logged for account 00000001 Fri, 07 Sep 2012 16:07:36 GMT"
    Location: (x=1, y=18)
1. Instructions (White text, small font size): "Press the CLEAR key to exit, PF12 for next"
    Location: (x=1, y=20)

The fields are displayed in a monospaced font, with labels and values in different colors 
(green and white) on a black background, which is typical of mainframe terminal screens. 
The coordinates provided are  approximate, with (x=1, y=1) representing the top-left corner 
of the screen.

4. AWS Architecture Diagram

If you’re a heavy AWS user, you know such diagrams like the one below representing a solution implementation. It’s interesting to see how it gets analyzed by the LLM. It seems that Claude Sonnet cloud probably pass the initial certification of the AWS cursus !

Enter image description here

Prompt: can you describe this image?

The image appears to be an architectural diagram or workflow illustrating an AWS cloud 
solution. It consists of various AWS services and their interactions, enclosed within dashed 
rectangles representing different stages or processes.

In the "Formulate restore request" stage, the AWS Fargate service is connected to a Bucket 
service.

The "Query S3 directory" stage involves the S3 Glacier Deep Archive service, which is queried 
by an AWS Lambda function.

The "Initiate restore" stage also has an AWS Lambda function, which triggers the Amazon Simple 
Queue Service (Amazon SQS).

Finally, in the "Update status" stage, the Amazon DynamoDB service is updated, likely reflecting 
the restoration status.

The diagram showcases the integration and orchestration of multiple AWS services to handle a 
data restoration process, potentially from the S3 Glacier Deep Archive storage to a Bucket.

5. City Map

In the next trial, we want to see which information the LLM can extract from a city map. We do not mention the city of London in the prompt but choose a map with famous locations and buildings. We didn’t ask for it here but the LLM is able to augment each item of the list with a precise description of each identified location as is saw gazillions of text about them during its training.

Enter image description here

Prompt: Can you describe this picture? where are we? Please provide the coordinates (x,y) with range [0...100] of each notable place that you locate.

Based on the map image, we are looking at the area around St. James's Park and Buckingham 
Palace in central London, England. Here are the notable places and their approximate (x, y) 
coordinates within the range [0...100]:

1. Buckingham Palace: (x=29, y=63)
2. St. James's Palace: (x=38, y=53)
3. The Mall: (x=46, y=47)
4. St. James's Park: (x=49, y=60)
5. Canada Gate: (x=43, y=71)
6. Victoria Memorial: (x=36, y=69)
7. The Guards Museum: (x=39, y=75)
8. Admiralty Arch: (x=54, y=42)
9. Whitehall: (x=57, y=51)
10. The Cenotaph: (x=64, y=75)
11. Westminster Abbey: (x=73, y=84)
12. Big Ben (Elizabeth Tower): (x=68, y=79)
13. St. James's Park Station: (x=55, y=66)
14. Institute of Contemporary Arts: (x=61, y=36)
15. Hayward Gallery: (x=90, y=76)

Please note that these coordinates are approximate and scaled to fit within the [0...100] range 
based on the visual representation in the map image. The actual coordinates may vary slightly 
due to the scaling and subjective identification of precise locations.

6. Chess board

I am a chess player. So, I wanted to see if the LLM would recognize the kind of drawing we use to represent games and puzzles and what he would say about the position of pieces on the board.

Enter image description here

Prompt: Can you describe this image?

The image shows a chessboard with all the chess pieces set up in their standard starting positions 
for a new game. The pieces are arranged on the back ranks (rows closest to each player) with the 
white pieces on the bottom and the black pieces on the top. 

On the white side, from left to right, there is a rook, knight, bishop, queen, king, bishop, knight, and rook. 

On the black side, the piece arrangement mirrors the white side, with a rook, knight, bishop, queen, 
king, bishop, knight, and rook from left to right. In front of each back rank are the eight pawns for each 
side, forming the second ranks.

7. Personal Image

Finally, to see if bias from LLM training had an impact on recognition capabilities, I finally used a personal photo (made at Volerie des Aigles, Alsace, France if you wonder) to make sure that Sonnet never saw it before.

Enter image description here

Prompt: can you describe this image?

The image shows a bald eagle perched on a wooden stump or log in what appears to be a natural setting 
with rock and leaf debris in the background.

The bald eagle, which is the national symbol of the United States, has its distinctive white head and tail 
feathers contrasting with its dark brown body feathers. Its powerful curved beak and intense eyes are 
clearly visible, giving it a majestic and commanding presence.

The eagle is sitting upright on the stump, which has a weathered and textured surface, possibly an old 
tree stump or piece of driftwood. The setting behind the eagle appears to be a rocky or earthy environment
 with fallen leaves and scattered debris, suggesting a forest or woodland area.

The image captures the bald eagle in a natural habitat, showcasing its strength, beauty, and connection 
with the wilderness. The contrast between the eagle's stark colors and the earthy tones of the background
 creates a visually striking and captivating scene.