<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Python</title>
    <description>Dries Buytaert on Python.</description>
    <link>https://dri.es/tag/python</link>
    <atom:link href="https://dri.es/tag/python/rss.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Extract speaker notes from PowerPoint to text</title>
      <link>https://dri.es/extract-speaker-notes-from-powerpoint-to-text</link>
      <guid>https://dri.es/extract-speaker-notes-from-powerpoint-to-text</guid>
      <pubDate>Thu, 09 Oct 2025 11:41:43 -0400</pubDate>
      <description>&lt;p&gt;When working on presentations, I like to extract my speaker notes to review the flow and turn them into blog posts. I&#039;m doing this right now for my DrupalCon Vienna talk.&lt;/p&gt;
&lt;p&gt;I used to do this manually, but with presentations often having 100+ slides, it gets tedious and isn&#039;t very repeatable. So I ended up automating this with a Python script.&lt;/p&gt;
&lt;p&gt;Since I use Apple Keynote or Google Slides rather than Microsoft PowerPoint, I first export my presentations to PowerPoint format, then run my Python script.&lt;/p&gt;
&lt;p&gt;If you&#039;ve ever needed to pull speaker notes from a presentation for review, editing or blogging, here is my script and how to use it.&lt;/p&gt;
&lt;h3&gt;Speaker notes extractor script&lt;/h3&gt;
&lt;p&gt;Save this code as &lt;code&gt;powerpoint-to-text.py&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;#!/usr/bin/env python3
&amp;quot;&amp;quot;&amp;quot;Extract speaker notes from PowerPoint presentations to text files.&amp;quot;&amp;quot;&amp;quot;

import sys
from pathlib import Path
from pptx import Presentation

def extract_speaker_notes(pptx_path: Path) -&amp;gt; tuple[str, int]:
    presentation = Presentation(pptx_path)
    notes_text = []

    for i, slide in enumerate(presentation.slides, 1):
        if slide.notes_slide and slide.notes_slide.notes_text_frame:
            notes = slide.notes_slide.notes_text_frame.text.strip()
            if notes:
                notes_text.append(f&amp;quot;=== Slide {i} ===\n{notes}\n&amp;quot;)

    return &amp;quot;\n&amp;quot;.join(notes_text), len(notes_text)

def main():
    if len(sys.argv) != 2:
        print(&amp;quot;Usage: python powerpoint-to-text.py presentation.pptx&amp;quot;)
        sys.exit(1)

    input_path = Path(sys.argv[1])

    if not input_path.exists():
        print(f&amp;quot;Error: File &#039;{input_path}&#039; not found&amp;quot;)
        sys.exit(1)

    if input_path.suffix.lower() != &#039;.pptx&#039;:
        print(f&amp;quot;Warning: &#039;{input_path}&#039; may not be a PowerPoint file&amp;quot;)

    try:
        notes_text, notes_count = extract_speaker_notes(input_path)
    except Exception as e:
        print(f&amp;quot;Error reading presentation: {e}&amp;quot;)
        sys.exit(1)

    output_path = input_path.with_suffix(&#039;.txt&#039;)
    output_path.write_text(notes_text, encoding=&#039;utf-8&#039;)

    print(f&amp;quot;Extracted {notes_count} slides with notes to {output_path}&amp;quot;)

if __name__ == &amp;quot;__main__&amp;quot;:
    main()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The script uses the &lt;code&gt;python-pptx&lt;/code&gt; library to read PowerPoint files. This library understands the internal structure of .pptx files (which are zip archives containing XML). It provides a clean Python interface to access slides and their speaker notes. The script loops through each slide, checks if it has notes, and writes them to a text file.&lt;/p&gt;
&lt;h3&gt;Usage&lt;/h3&gt;
&lt;p&gt;I like to use &lt;a href=&quot;https://github.com/astral-sh/uv&quot;&gt;uv&lt;/a&gt; to run Python code. &lt;code&gt;uv&lt;/code&gt; is a fast, modern Python package manager that handles dependencies automatically:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;$ uv run --with python-pptx powerpoint-to-text.py your-presentation.pptx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This saves a &lt;code&gt;.txt&lt;/code&gt; file with your notes in the same directory as the input file, not the current directory or desktop.&lt;/p&gt;
&lt;p&gt;The text file contains:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;=== Slide 1 ===
Speaker notes from slide 1 ...

=== Slide 3 ===
Speaker notes from slide 3 ...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Only slides with speaker notes are included.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Comparing local LLMs for alt-text generation, round 2</title>
      <link>https://dri.es/comparing-local-llms-for-alt-text-generation-round-2</link>
      <guid>https://dri.es/comparing-local-llms-for-alt-text-generation-round-2</guid>
      <pubDate>Tue, 27 May 2025 14:04:35 -0400</pubDate>
      <description>&lt;p&gt;Four months ago, I &lt;a href=&quot;https://dri.es/comparing-local-llms-for-alt-text-generation&quot;&gt;tested 10 local vision LLMs&lt;/a&gt; and compared them against the top cloud models. &lt;em&gt;Vision models&lt;/em&gt; can analyze images and describe their content, making them useful for &lt;code&gt;alt&lt;/code&gt;-text generation.&lt;/p&gt;
&lt;p&gt;The result? The local models missed important details or introduced hallucinations. So &lt;a href=&quot;https://dri.es/automating-alt-text-generation-ai&quot;&gt;I switched to using cloud models&lt;/a&gt;, which produced better results but meant sacrificing privacy and offline capability.&lt;/p&gt;
&lt;p&gt;Two weeks ago, &lt;a href=&quot;https://ollama.com/&quot;&gt;Ollama&lt;/a&gt; released &lt;a href=&quot;https://github.com/ollama/ollama/releases/tag/v0.7.0&quot;&gt;version 0.7.0&lt;/a&gt; with improved support for vision models. They added support for three vision models I hadn&#039;t tested yet: &lt;a href=&quot;https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503&quot;&gt;Mistral 3.1&lt;/a&gt;, &lt;a href=&quot;https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct&quot;&gt;Qwen 2.5 VL&lt;/a&gt; and &lt;a href=&quot;https://huggingface.co/google/gemma-3-27b-it&quot;&gt;Gemma 3&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I decided to evaluate these models to see whether they&#039;ve caught up to GPT-4 and Claude 3.5 in quality. Can local models now generate accurate and reliable &lt;code&gt;alt&lt;/code&gt;-text?&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Provider&lt;/th&gt;
  &lt;th&gt;Release date&lt;/th&gt;
  &lt;th&gt;Model size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;
  &lt;a href=&quot;https://huggingface.co/google/gemma-3-27b-it&quot;&gt;Gemma 3 (27B)&lt;/a&gt;
&lt;/td&gt;
  &lt;td&gt;Google DeepMind&lt;/td&gt;
  &lt;td&gt;March 2025&lt;/td&gt;
  &lt;td&gt;27B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;
  &lt;a href=&quot;https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct&quot;&gt;Qwen 2.5 VL (32B)&lt;/a&gt;
&lt;/td&gt;
  &lt;td&gt;Alibaba&lt;/td&gt;
  &lt;td&gt;March 2025&lt;/td&gt;
  &lt;td&gt;32B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;
  &lt;a href=&quot;https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503&quot;&gt;Mistral 3.1 (24B)&lt;/a&gt;
&lt;/td&gt;
  &lt;td&gt;Mistral AI&lt;/td&gt;
  &lt;td&gt;March 2025&lt;/td&gt;
  &lt;td&gt;24B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Updating my &lt;code&gt;alt&lt;/code&gt;-text script&lt;/h3&gt;
&lt;p&gt;For my earlier experiments, I created &lt;a href=&quot;https://github.com/dbuytaert/image-caption&quot;&gt;an open-source script&lt;/a&gt; that generates &lt;code&gt;alt&lt;/code&gt;-text descriptions. The script is a Python wrapper around &lt;a href=&quot;https://github.com/simonw/llm&quot;&gt;Simon Willison&#039;s &lt;code&gt;llm&lt;/code&gt; tool&lt;/a&gt;, which provides a unified interface to LLMs. It supports models from Ollama, Hugging Face and various cloud providers.&lt;/p&gt;
&lt;p&gt;To test the new models, I added 3 new entries to my script&#039;s &lt;a href=&quot;https://github.com/dbuytaert/image-caption/blob/v2/models.yaml&quot;&gt;&lt;code&gt;models.yaml&lt;/code&gt;&lt;/a&gt;, which defines each model&#039;s prompt, temperature, and token settings. Once configured, generating &lt;code&gt;alt&lt;/code&gt;-text is simple. Here is an example using the three new vision models:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-shell&quot;&gt;$ ./caption.py test-images/image-1.jpg –model mistral-3.1-24b gemma3-27b qwen2.5vl-32b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Which outputs something like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-shell&quot;&gt;{
  &amp;quot;image&amp;quot;: &amp;quot;test-images/image-1.jpg&amp;quot;,
  &amp;quot;captions&amp;quot;: {
    &amp;quot;mistral-3.1-24b&amp;quot;: &amp;quot;A bustling intersection at night filled with pedestrians crossing in all directions.&amp;quot;
    &amp;quot;gemma3-27b&amp;quot;: &amp;quot;A high-angle view shows a crowded Tokyo street filled with pedestrians and brightly lit advertising billboards at night.&amp;quot;,
    &amp;quot;qwen2.5vl-32b&amp;quot;: &amp;quot;A bustling city intersection at night, crowded with people crossing the street, surrounded by tall buildings with bright, colorful billboards and advertisements.&amp;quot;,
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Evaluating the models&lt;/h3&gt;
&lt;p&gt;To keep the results consistent, I used the same test images and the same evaluation method as in &lt;a href=&quot;https://dri.es/comparing-local-llms-for-alt-text-generation#model-evaluation&quot;&gt;my earlier blog post&lt;/a&gt;. The details results are in this &lt;a href=&quot;https://docs.google.com/spreadsheets/d/1LttMys6cEkse6LrRZoznvDdvBiz_WkVyauNBFV2jIXg/edit&quot;&gt;Google spreadsheet&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Each &lt;code&gt;alt&lt;/code&gt;-text was scored from 0 to 5 based on three criteria: how well it identified the most important elements in the image, how effectively it captured the mood or atmosphere, and whether it avoided repetition, grammar issues or hallucinated details. I then converted each score into a letter grade from A to F.&lt;/p&gt;
&lt;p&gt;For comparison, the cloud models received the following scores: GPT-4o earned an average of 4.8 out of 5 (grade A), and Claude 3.5 Sonnet received a perfect 5 out of 5 (grade A).&lt;/p&gt;
&lt;h4&gt;Test image 1: Shibuya Crossing, Tokyo&lt;/h4&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;figure&gt;&lt;img src=&quot;https://dri.es/files/cache/japan-2024/shibuya-crossing-1280w.jpg&quot; alt=&quot;Areal view of an intersection, capturing the evening commute with pedestrians, traffic and electronic billboards.&quot; width=&quot;1280&quot; height=&quot;850&quot; /&gt;
&lt;/figure&gt;

&lt;/div&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;Mistral 3.1 (24B)&lt;/td&gt;
  &lt;td&gt;A bustling intersection at night filled with pedestrians crossing in all directions.&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Gemma 3 (27B)&lt;/td&gt;
  &lt;td&gt;A high-angle view shows a crowded Tokyo &lt;mark&gt;street&lt;/mark&gt; filled with pedestrians and brightly lit advertising billboards.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Qwen 2.5 VL (32B)&lt;/td&gt;
  &lt;td&gt;A bustling city intersection at night, crowded with people crossing the street, surrounded by brightly lit billboards and tall buildings.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;None earned an A because they failed to identify this as Shibuya Crossing in Tokyo, while cloud models do. Mistral also missed the billboards.&lt;/p&gt;
&lt;h4&gt;Test image 2: Isabella Stewart Gardner Museum, Boston&lt;/h4&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;figure&gt;&lt;img src=&quot;https://dri.es/files/cache/miscellaneous-2023/isabella-stewart-gardner-museum-1280w.jpg&quot; alt=&quot;An empty picture frame in a room with ornate wallpaper, candles, old paintings, and antique furniture.&quot; width=&quot;1280&quot; height=&quot;850&quot; /&gt;
&lt;/figure&gt;

&lt;/div&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;Mistral 3.1 (24B)&lt;/td&gt;
  &lt;td&gt;An ornate wall features a large empty picture frame.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Gemma 3 (27B)&lt;/td&gt;
  &lt;td&gt;An empty, ornate gold frame hangs on a patterned green wall between two framed portraits and a candle sconce.&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Qwen 2.5 VL (32B)&lt;/td&gt;
  &lt;td&gt;A vintage-style room features ornate wallpaper, a framed empty canvas, a lit candelabra, and a decorative vase on a table, with portraits on either side.&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The vision models in my previous post often mistook the empty frame for a framed painting. All three models in this test correctly identified it as empty. Gemma and Qwen captured valuable details about the scene, while Mistral&#039;s description felt sparse.&lt;/p&gt;
&lt;h4&gt;Test image 3: wakeboarding in Vermont, USA&lt;/h4&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;figure&gt;&lt;img src=&quot;https://dri.es/files/cache/vermont-2024/wakeboarding-1280w.jpg&quot; alt=&quot;Two men in swim shorts on the back of a boat watching another person wakeboarding behind the boat.&quot; width=&quot;1280&quot; height=&quot;850&quot; /&gt;
&lt;/figure&gt;

&lt;/div&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;Mistral 3.1 (24B)&lt;/td&gt;
  &lt;td&gt;Two shirtless men on a boat watch another person &lt;mark&gt;water skiing&lt;/mark&gt; on a lake.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Gemma 3 (27B)&lt;/td&gt;
  &lt;td&gt;Two people on a boat watch a &lt;mark&gt;waterskier&lt;/mark&gt; speeding across the lake on a sunny day.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Qwen 2.5 VL (32B)&lt;/td&gt;
  &lt;td&gt;Two shirtless men on a boat watch a person &lt;mark&gt;water skiing&lt;/mark&gt; in the distance on a calm lake.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;All three described a wakeboarding scene as &amp;quot;water skiing&amp;quot;, while the cloud models correctly identified it as wakeboarding.&lt;/p&gt;
&lt;h4&gt;Test image 4: hiking in the Dolomites, Italy&lt;/h4&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;figure&gt;&lt;img src=&quot;https://dri.es/files/cache/italy-2022/santa-maddalena-church-in-funes-2-1280w.jpg&quot; alt=&quot;Santa maddalena church in funes&quot; width=&quot;1280&quot; height=&quot;846&quot; /&gt;
&lt;/figure&gt;

&lt;/div&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;Mistral 3.1 (24B)&lt;/td&gt;
  &lt;td&gt;A wooden statue of a &lt;mark&gt;saint&lt;/mark&gt; is mounted on a post with directional signs pointing to various locations.&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Gemma 3 (27B)&lt;/td&gt;
  &lt;td&gt;A small wooden shrine with a statue of Mary stands beside a signpost indicating hiking trails in a grassy field.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Qwen 2.5 VL (32B)&lt;/td&gt;
  &lt;td&gt;A wooden shrine with a statue of &lt;mark&gt;a figure&lt;/mark&gt; stands on a tree stump, surrounded by a scenic mountain landscape with directional signs in the foreground.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Only Gemma recognized the statue as Mary. Both Mistral and Gemma missed the mountains in the background, which seems important.&lt;/p&gt;
&lt;h4&gt;Test image 5: backgammon by candlelight&lt;/h4&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;figure&gt;&lt;img src=&quot;https://dri.es/files/cache/vermont-2023/backgammon-by-candlelight-1280w.jpg&quot; alt=&quot;A backgammon board on a wooden table, accompanied by candles that cast a warm glow.&quot; width=&quot;1280&quot; height=&quot;850&quot; /&gt;
&lt;/figure&gt;

&lt;/div&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;Mistral 3.1 (24B)&lt;/td&gt;
  &lt;td&gt;A lit candle and a glass of liquid are on a wooden table next to a wooden board game.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Gemma 3 (27B)&lt;/td&gt;
  &lt;td&gt;A lit candle and glass votive sit on a wooden table, creating a warm, inviting glow in a dimly lit space.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Qwen 2.5 VL (32B)&lt;/td&gt;
  &lt;td&gt;A cozy scene with a lit candle on a wooden table, next to a backgammon board and a glass of liquid, creating a warm and inviting atmosphere.&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Neither Mistral nor Gemma recognized the backgammon board. Only Qwen identified it correctly. Mistral also failed to capture the photo&#039;s mood.&lt;/p&gt;
&lt;h3 id=&quot;model-accuracy&quot;&gt;Model accuracy&lt;/h3&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;table&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Repetitions&lt;/th&gt;
  &lt;th&gt;Hallucinations&lt;/th&gt;
  &lt;th&gt;Moods&lt;/th&gt;
  &lt;th&gt;Average score&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Mistral 3.1 (24B)&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffeb99&quot;&gt;Fair&lt;/td&gt;
  &lt;td&gt;3.4/5&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Gemma 3 (27B)&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Good&lt;/td&gt;
  &lt;td&gt;4.2/5&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Qwen 2.5 VL (32B)&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Good&lt;/td&gt;
  &lt;td&gt;4.4/5&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;Qwen 2.5 VL performed best overall, with Gemma 3 not far behind.&lt;/p&gt;
&lt;p&gt;Needless to say, these results are based on a small set of test images. And while I used a structured scoring system, the evaluation still involves subjective judgment. This is not a definitive ranking, but it&#039;s enough to draw some conclusions.&lt;/p&gt;
&lt;p&gt;It was nice to say that all three LLMs avoided repetition and hallucinations, and generally captured the mood of the images.&lt;/p&gt;
&lt;p&gt;Local models still make mistakes. All three described wakeboarding as &amp;quot;water skiing&amp;quot;, most failed to recognize the statue as Mary or place the intersection in Japan. Cloud models get these details right, as I showed in &lt;a href=&quot;https://dri.es/comparing-local-llms-for-alt-text-generation&quot;&gt;my previous blog post&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;I ran my original experiment four months ago, and at the time, none of the models I tested felt accurate enough for large-scale &lt;code&gt;alt&lt;/code&gt;-text generation. Some, like Llama 3, showed promise but still fell short in overall quality.&lt;/p&gt;
&lt;p&gt;Newer models like Qwen 2.5 VL and Gemma 3 have matched the performance I saw earlier with Llama 3. Both performed well in my latest test. They produced relevant, grounded descriptions without hallucinations or repetition, which earlier local models often struggled with.&lt;/p&gt;
&lt;p&gt;Still, the quality is not yet at the level where I would trust these models to generate thousands of &lt;code&gt;alt&lt;/code&gt;-texts without human review. They make more mistakes than GPT-4 or Claude 3.5.&lt;/p&gt;
&lt;p&gt;My main question was: are local models now good enough for practical use? While Qwen 2.5 VL performed best overall, it still needs human review. I&#039;ve started using it for small batches where manual checking is manageable. For large-scale, fully automated use, I continue using cloud models as they remain the most reliable option.&lt;/p&gt;
&lt;p&gt;That said, local vision-language models continue to improve. My long-term goal is to return to a 100% local-first workflow that gives me more control and keeps my data private. While we&#039;re not there yet, these results show real progress.&lt;/p&gt;
&lt;p&gt;My plan is to wait for the next generation of local vision models (or upgrade my hardware to run larger models). When those become available, I&#039;ll test them and report back.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Automating alt-text generation with AI</title>
      <link>https://dri.es/automating-alt-text-generation-ai</link>
      <guid>https://dri.es/automating-alt-text-generation-ai</guid>
      <pubDate>Thu, 20 Feb 2025 06:22:29 -0500</pubDate>
      <description>&lt;p&gt;Billions of images on the web lack proper &lt;code&gt;alt&lt;/code&gt;-text, making them inaccessible to millions of users who rely on screen readers.&lt;/p&gt;
&lt;p&gt;My own website is no exception, so &lt;a href=&quot;https://dri.es/comparing-local-llms-for-alt-text-generation&quot;&gt;a few weeks ago&lt;/a&gt;, I set out to add missing &lt;code&gt;alt&lt;/code&gt;-text to about 9,000 images on this website.&lt;/p&gt;
&lt;p&gt;What seemed like a simple fix became a multi-step challenge. I needed to &lt;a href=&quot;https://dri.es/comparing-local-llms-for-alt-text-generation&quot;&gt;evaluate different AI models&lt;/a&gt; and &lt;a href=&quot;https://dri.es/i-want-to-run-ai-locally-here-is-why-i-am-not-yet&quot;&gt;decide between local or cloud processing&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To make the web better, a lot of websites need to add &lt;code&gt;alt&lt;/code&gt;-text to their images. So I decided to document my progress here on &lt;a href=&quot;https://dri.es/&quot;&gt;my blog&lt;/a&gt; so others can learn from it – or offer suggestions. This third post dives into the technical details of how I built an automated pipeline to generate &lt;code&gt;alt&lt;/code&gt;-text at scale.&lt;/p&gt;
&lt;h3&gt;High-level architecture overview&lt;/h3&gt;
&lt;p&gt;My automation process follows three steps for each image:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Check if &lt;code&gt;alt&lt;/code&gt;-text exists for a given image&lt;/li&gt;
&lt;li&gt;Generate new &lt;code&gt;alt&lt;/code&gt;-text using AI when missing&lt;/li&gt;
&lt;li&gt;Update the database record for the image with the new &lt;code&gt;alt&lt;/code&gt;-text&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The rest of this post goes into more detail on each of these steps. If you&#039;re interested in the implementation, you can find most of the &lt;a href=&quot;https://github.com/dbuytaert/image-caption&quot;&gt;source code on GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Retrieving image metadata&lt;/h3&gt;
&lt;p&gt;To systematically process 9,000 images, I needed a structured way to identify which ones were missing &lt;code&gt;alt&lt;/code&gt;-text.&lt;/p&gt;
&lt;p&gt;Since my site runs on &lt;a href=&quot;https://www.drupal.org/&quot;&gt;Drupal&lt;/a&gt;, I built two REST API endpoints to interact with the image metadata:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;GET /album/{album-name}/{image-name}/get&lt;/code&gt; – Retrieves metadata for an image, including title, &lt;code&gt;alt&lt;/code&gt;-text, and caption.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;PATCH /album/{album-name}/{image-name}/patch&lt;/code&gt; – Updates specific fields, such as adding or modifying &lt;code&gt;alt&lt;/code&gt;-text.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I&#039;ve built similar APIs before, including one for my &lt;a href=&quot;https://dri.es/building-my-own-temperature-and-humidity-monitor&quot;&gt;basement&#039;s temperature and humidity monitor&lt;/a&gt;. That post provides a more detailed breakdown of how I build endpoints like this.&lt;/p&gt;
&lt;p&gt;This API uses separate URL paths (&lt;code&gt;/get&lt;/code&gt; and &lt;code&gt;/patch&lt;/code&gt;) for different operations, rather than using a single resource URL. I&#039;d prefer to follow RESTful principles, but this approach avoids caching problems, including content negotiation issues in CDNs.&lt;/p&gt;
&lt;p&gt;Anyway, with the new endpoints in place, fetching metadata for an image is simple:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-shell&quot;&gt;curl -H &amp;quot;Authorization: test-token&amp;quot; \
  &amp;quot;https://dri.es/album/isle-of-skye-2024/journey-to-skye/get&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Every request requires an authorization token. And no, &lt;code&gt;test-token&lt;/code&gt; isn&#039;t the real one. Without it, anyone could edit my images. While crowdsourced &lt;code&gt;alt&lt;/code&gt;-text might be an interesting experiment, it&#039;s not one I&#039;m looking to run today.&lt;/p&gt;
&lt;p&gt;This request returns a JSON object with image metadata:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-shell&quot;&gt;{
  &amp;quot;title&amp;quot;: &amp;quot;Journey to Skye&amp;quot;,
  &amp;quot;alt&amp;quot;: &amp;quot;&amp;quot;,
  &amp;quot;caption&amp;quot;: &amp;quot;Each year, Klaas and I pick a new destination for our outdoor adventure. In 2024, we set off for the Isle of Skye in Scotland. This stop was near Glencoe, about halfway between Glasgow and Skye.&amp;quot;
}

&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Because the &lt;code&gt;alt&lt;/code&gt;-field is empty, the next step is to generate a description using AI.&lt;/p&gt;
&lt;h3&gt;Generating and refining &lt;code&gt;alt&lt;/code&gt;-text with AI&lt;/h3&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;figure&gt;&lt;img src=&quot;https://dri.es/files/cache/isle-of-skye-2024/journey-to-skye-1280w.jpg&quot; alt=&quot;A person stands by a small lake surrounded by grassy hills and mountains under a cloudy sky in the Scottish Highlands.&quot; width=&quot;1280&quot; height=&quot;850&quot; /&gt;
&lt;/figure&gt;

&lt;/div&gt;
&lt;p&gt;In &lt;a href=&quot;https://dri.es/comparing-local-llms-for-alt-text-generation&quot;&gt;my first post on AI-generated &lt;code&gt;alt&lt;/code&gt;-text&lt;/a&gt;, I wrote a Python script to compare 10 different local &lt;a href=&quot;https://en.wikipedia.org/wiki/Large_language_model&quot;&gt;Large Language Models&lt;/a&gt; (LLMs). The script uses &lt;a href=&quot;https://pytorch.org/&quot;&gt;PyTorch&lt;/a&gt;, a widely used machine learning framework for AI research and deep learning. This implementation was a great learning experience.&lt;/p&gt;
&lt;p&gt;The original script takes an image as input and generates &lt;code&gt;alt&lt;/code&gt;-text using multiple LLMs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-shell&quot;&gt;./caption.py journey-to-skye.jpg
{
  &amp;quot;image&amp;quot;: &amp;quot;journey-to-skye.jpg&amp;quot;,
  &amp;quot;captions&amp;quot;: {
    &amp;quot;vit-gpt2&amp;quot;: &amp;quot;A man standing on top of a lush green field next to a body of water with a bird perched on top of it.&amp;quot;,
    &amp;quot;git&amp;quot;: &amp;quot;A man stands in a field next to a body of water with mountains in the background and a mountain in the background.&amp;quot;,
    &amp;quot;blip&amp;quot;: &amp;quot;This is an image of a person standing in the middle of a field next to a body of water with a mountain in the background.&amp;quot;,
    &amp;quot;blip2-opt&amp;quot;: &amp;quot;A man standing in the middle of a field with mountains in the background.&amp;quot;,
    &amp;quot;blip2-flan&amp;quot;: &amp;quot;A man is standing in the middle of a field with a river and mountains behind him on a cloudy day.&amp;quot;,
    &amp;quot;minicpm-v&amp;quot;: &amp;quot;A person standing alone amidst nature, with mountains and cloudy skies as backdrop.&amp;quot;,
    &amp;quot;llava-13b&amp;quot;: &amp;quot;A person standing alone in a misty, overgrown field with heather and trees, possibly during autumn or early spring due to the presence of red berries on the trees and the foggy atmosphere.&amp;quot;,
    &amp;quot;llava-34b&amp;quot;: &amp;quot;A person standing alone on a grassy hillside with a body of water and mountains in the background, under a cloudy sky.&amp;quot;,
    &amp;quot;llama32-vision-11b&amp;quot;: &amp;quot;A person standing in a field with mountains and water in the background, surrounded by overgrown grass and trees.&amp;quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;My original plan was to run everything locally for full control, no subscription costs, and optimal privacy. But after testing 10 local LLMs, I changed my mind.&lt;/p&gt;
&lt;p&gt;I knew cloud-based models would be better, but wanted to see if local models were good enough for &lt;code&gt;alt&lt;/code&gt;-texts. Turns out, they&#039;re not quite there. You can read the &lt;a href=&quot;https://dri.es/comparing-local-llms-for-alt-text-generation&quot;&gt;full comparison&lt;/a&gt;, but I gave the best local models a B, while cloud models earned an A.&lt;/p&gt;
&lt;p&gt;While local processing aligned with my principles, it compromised the primary goal: creating the best possible descriptions for screen reader users. So I abandoned my local-only approach and decided to use cloud-based LLMs.&lt;/p&gt;
&lt;p&gt;To automate &lt;code&gt;alt&lt;/code&gt;-text generation for 9,000 images, I needed programmatic access to cloud models rather than relying on their browser-based interfaces – though &lt;a href=&quot;https://dri.es/i-gave-an-ai-agent-edit-access-to-my-website&quot;&gt;browser-based AI can be tons of fun&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Instead of expanding my script with cloud LLM support, I switched to &lt;a href=&quot;https://simonwillison.net/&quot;&gt;Simon Willison&lt;/a&gt;&#039;s &lt;code&gt;llm&lt;/code&gt; tool: &lt;a href=&quot;https://llm.datasette.io/&quot;&gt;https://llm.datasette.io/&lt;/a&gt;. &lt;code&gt;llm&lt;/code&gt; is a command-line tool and Python library that supports both local and cloud-based models. It takes care of installation, dependencies, API key management, and uploading images. Basically, all the things I didn&#039;t want to spend time maintaining myself.&lt;/p&gt;
&lt;p&gt;Despite enjoying my PyTorch explorations with vision language models and multimodal encoders, I needed to focus on results. My weekly progress goal meant prioritizing working &lt;code&gt;alt&lt;/code&gt;-text over building homegrown inference pipelines.&lt;/p&gt;
&lt;p&gt;I also considered you, my readers. If this project inspires you to make your own website more accessible, you&#039;re better off with a script built on a well-maintained tool like &lt;code&gt;llm&lt;/code&gt; rather than trying to adapt my custom implementation.&lt;/p&gt;
&lt;p&gt;Scrapping my PyTorch implementation stung at first, but building on a more mature and active open-source project was far better for me and for you. So I rewrote my script, now in the &lt;a href=&quot;https://github.com/dbuytaert/image-caption&quot;&gt;v2 branch&lt;/a&gt;, with the original PyTorch version preserved in v1.&lt;/p&gt;
&lt;p&gt;The new version of my script keeps the same simple interface but now supports cloud models like ChatGPT and Claude:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-shell&quot;&gt;./caption.py journey-to-skye.jpg --model chatgpt-4o-latest claude-3-sonnet --context &amp;quot;Location: Glencoe, Scotland&amp;quot;
{
  &amp;quot;image&amp;quot;: &amp;quot;journey-to-skye.jpg&amp;quot;,
  &amp;quot;captions&amp;quot;: {
    &amp;quot;chatgpt-4o-latest&amp;quot;: &amp;quot;A person in a red jacket stands near a small body of water, looking at distant mountains in Glencoe, Scotland.&amp;quot;,
    &amp;quot;claude-3-sonnet&amp;quot;: &amp;quot;A person stands by a small lake surrounded by grassy hills and mountains under a cloudy sky in the Scottish Highlands.&amp;quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;--context&lt;/code&gt; parameter improves &lt;code&gt;alt&lt;/code&gt;-text quality by adding details the LLM can&#039;t determine from the image alone. This might include GPS coordinates, album titles, or even &lt;a href=&quot;https://dri.es/van-life-on-the-isle-of-skye&quot;&gt;a blog post about the trip&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In this example, I added &lt;code&gt;&amp;quot;Location: Glencoe, Scotland&amp;quot;&lt;/code&gt;. Notice how ChatGPT-4o mentions Glencoe directly while Claude-3 Sonnet references the Scottish Highlands. This contextual information makes descriptions more accurate and valuable for users. For maximum accuracy, use all available information!&lt;/p&gt;
&lt;h3&gt;Updating image metadata&lt;/h3&gt;
&lt;p&gt;With &lt;code&gt;alt&lt;/code&gt;-text generated, the final step is updating each image. The &lt;code&gt;PATCH&lt;/code&gt; endpoint accepts only the fields that need changing, preserving other metadata:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-shell&quot;&gt;curl -X PATCH \
  -H &amp;quot;Authorization: test-token&amp;quot; \
  &amp;quot;https://dri.es/album/isle-of-skye-2024/journey-to-skye/patch&amp;quot; \
  -d &#039;{
    &amp;quot;alt&amp;quot;: &amp;quot;A person stands by a small lake surrounded by grassy hills and mountains under a cloudy sky in the Scottish Highlands.&amp;quot;,
  }&#039;

&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That&#039;s it. This completes the automation loop for one image. It checks if &lt;code&gt;alt&lt;/code&gt;-text is needed, creates a description using a cloud-based LLM, and updates the image if necessary. Now, I just need to do this about 9,000 times.&lt;/p&gt;
&lt;h3&gt;Tracking AI-generated &lt;code&gt;alt&lt;/code&gt;-text&lt;/h3&gt;
&lt;p&gt;Before running the script on all 9,000 images, I added a label to the database that marks each &lt;code&gt;alt&lt;/code&gt;-text as either human-written or AI-generated. This makes it easy to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Re-run AI-generated descriptions without overwriting human-written ones&lt;/li&gt;
&lt;li&gt;Upgrade AI-generated &lt;code&gt;alt&lt;/code&gt;-text as better models become available&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With this approach I can update the AI-generated &lt;code&gt;alt&lt;/code&gt;-text when ChatGPT 5 is released. And eventually, it might allow me to return to my original principles: to use a high-quality local LLM trained on public domain data. In the mean time, it helps me make the web more accessible today while building toward a better long-term solution tomorrow.&lt;/p&gt;
&lt;h3&gt;Next steps&lt;/h3&gt;
&lt;p&gt;Now that the process is automated for a single image, the last step is to run the script on all 9,000. And honestly, it makes me nervous. The perfectionist in me wants to review every single AI-generated &lt;code&gt;alt&lt;/code&gt;-text, but that is just not feasible. So, I have to trust AI. I&#039;ll probably write one more post to share the results and what I learned from this final step.&lt;/p&gt;
&lt;p&gt;Stay tuned.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Comparing local large language models for alt-text generation</title>
      <link>https://dri.es/comparing-local-llms-for-alt-text-generation</link>
      <guid>https://dri.es/comparing-local-llms-for-alt-text-generation</guid>
      <pubDate>Mon, 03 Feb 2025 11:45:10 -0500</pubDate>
      <description>&lt;p&gt;I have &lt;a href=&quot;https://dri.es/photos&quot;&gt;10,000 photos&lt;/a&gt; on my website. About 9,000 have no &lt;code&gt;alt&lt;/code&gt;-text. I&#039;m not proud of that, and it has bothered me for a long time.&lt;/p&gt;
&lt;p&gt;When I started my blog nearly 20 years ago, I didn&#039;t think much about &lt;code&gt;alt&lt;/code&gt;-texts. Over time, I realized its importance for visually impaired users who rely on screen readers.&lt;/p&gt;
&lt;p&gt;The past 5+ years, I diligently added &lt;code&gt;alt&lt;/code&gt;-text to every new image I uploaded. But that only covers about 1,000 images, leaving most older photos without descriptions.&lt;/p&gt;
&lt;p&gt;Writing 9,000 &lt;code&gt;alt&lt;/code&gt;-texts manually would take ages. Of course, AI could do this much faster, but is it good enough?&lt;/p&gt;
&lt;p&gt;To see what AI can do, I tested 12 &lt;em&gt;Large Language Models&lt;/em&gt; (LLMs): 10 running locally and 2 in the cloud. My goal was to test their accuracy and determine whether they can generate accurate &lt;code&gt;alt&lt;/code&gt;-text.&lt;/p&gt;
&lt;p&gt;The TL;DR is that, not surprisingly, cloud models (GPT-4, Claude Sonnet 3.5) set the benchmark with A-grade performance, though not 100% perfect. I prefer local models for privacy, cost, and offline use. Among local options, the Llama variants and MiniCPM-V perform best. Both earned a B grade: they work reliably but sometimes miss important details.&lt;/p&gt;
&lt;p&gt;I know I&#039;m not the only one. Plenty of people – entire organizations even – have massive backlogs of images without &lt;code&gt;alt&lt;/code&gt;-text. I&#039;m determined to fix that for my blog and share what I learn along the way. This blog post is just step one – &lt;a href=&quot;https://buttondown.com/dries-buytaert-blog&quot;&gt;subscribe by email&lt;/a&gt; or &lt;a href=&quot;https://dri.es/rss.xml&quot;&gt;RSS&lt;/a&gt; to get future posts.&lt;/p&gt;
&lt;h3&gt;Models evaluated&lt;/h3&gt;
&lt;p&gt;I tested &lt;code&gt;alt&lt;/code&gt;-text generation using 12 AI models: 9 on my MacBook Pro with 32GB RAM, 1 on a higher-RAM machine (thanks to Jeremy Andrews, a friend and long-time Drupal contributor), and 2 cloud-based services.&lt;/p&gt;
&lt;p&gt;The table below lists the models I tested, with details like links to research papers, release dates, parameter sizes (in billions), memory requirements, some architectural details and more:&lt;/p&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
   &lt;th&gt;&lt;/th&gt;
   &lt;th&gt;Model&lt;/th&gt;
   &lt;th&gt;Launch date&lt;/th&gt;
   &lt;th&gt;Type&lt;/th&gt;
   &lt;th&gt;Vision encoder&lt;/th&gt;
   &lt;th&gt;Language encoder&lt;/th&gt;
   &lt;th&gt;Model size (billions of parameters)&lt;/th&gt;
   &lt;th&gt;RAM&lt;/th&gt;
   &lt;th&gt;Deployment&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
   &lt;td&gt;1&lt;/td&gt;
   &lt;td&gt;
    &lt;a href=&quot;https://huggingface.co/nlpconnect/vit-gpt2-image-captioning&quot;&gt;VIT-GPT2&lt;/a&gt;
 &lt;/td&gt;
   &lt;td&gt;2021&lt;/td&gt;
   &lt;td&gt;Image-to-text&lt;/td&gt;
   &lt;td&gt;ViT (Vision Transformer)&lt;/td&gt;
   &lt;td&gt;GPT-2&lt;/td&gt;
   &lt;td&gt;0.4B&lt;/td&gt;
   &lt;td&gt;~8GB&lt;/td&gt;
   &lt;td&gt;Local, Dries&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;2&lt;/td&gt;
   &lt;td&gt;
    &lt;a href=&quot;https://huggingface.co/microsoft/git-base&quot;&gt;Microsoft GIT&lt;/a&gt;
 &lt;/td&gt;
   &lt;td&gt;2022&lt;/td&gt;
   &lt;td&gt;Image-to-text&lt;/td&gt;
   &lt;td&gt;Swin Transformer&lt;/td&gt;
   &lt;td&gt;Transformer Decoder&lt;/td&gt;
   &lt;td&gt;1.2B&lt;/td&gt;
   &lt;td&gt;~8GB&lt;/td&gt;
   &lt;td&gt;Local, Dries&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;3&lt;/td&gt;
   &lt;td&gt;
    &lt;a href=&quot;https://huggingface.co/Salesforce/blip-image-captioning-large&quot;&gt;BLIP Large&lt;/a&gt;
 &lt;/td&gt;
   &lt;td&gt;2022&lt;/td&gt;
   &lt;td&gt;Image-to-text&lt;/td&gt;
   &lt;td&gt;ViT&lt;/td&gt;
   &lt;td&gt;BERT&lt;/td&gt;
   &lt;td&gt;0.5B&lt;/td&gt;
   &lt;td&gt;~8GB&lt;/td&gt;
   &lt;td&gt;Local, Dries&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;4&lt;/td&gt;
   &lt;td&gt;
    &lt;a href=&quot;https://huggingface.co/Salesforce/blip2-opt-2.7b&quot;&gt;BLIP-2 OPT&lt;/a&gt;
 &lt;/td&gt;
   &lt;td&gt;2023&lt;/td&gt;
   &lt;td&gt;Image-to-text&lt;/td&gt;
   &lt;td&gt;CLIP ViT&lt;/td&gt;
   &lt;td&gt;OPT&lt;/td&gt;
   &lt;td&gt;2.7B&lt;/td&gt;
   &lt;td&gt;~8GB&lt;/td&gt;
   &lt;td&gt;Local, Dries&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;5&lt;/td&gt;
   &lt;td&gt;
    &lt;a href=&quot;https://huggingface.co/Salesforce/blip2-flan-t5-xl&quot;&gt;BLIP-2 FLAN-T5&lt;/a&gt;
 &lt;/td&gt;
   &lt;td&gt;2023&lt;/td&gt;
   &lt;td&gt;Image-to-text&lt;/td&gt;
   &lt;td&gt;CLIP ViT&lt;/td&gt;
   &lt;td&gt;FLAN-T5 XL&lt;/td&gt;
   &lt;td&gt;3B&lt;/td&gt;
   &lt;td&gt;~8GB&lt;/td&gt;
   &lt;td&gt;Local, Dries&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;6&lt;/td&gt;
   &lt;td&gt;
    &lt;a href=&quot;https://ollama.com/library/minicpm-v&quot;&gt;MiniCPM-V&lt;/a&gt;
 &lt;/td&gt;
   &lt;td&gt;2024&lt;/td&gt;
   &lt;td&gt;Multi-modal&lt;/td&gt;
   &lt;td&gt;SigLip-400M&lt;/td&gt;
   &lt;td&gt;Qwen2-7B&lt;/td&gt;
   &lt;td&gt;8B&lt;/td&gt;
   &lt;td&gt;~16GB&lt;/td&gt;
   &lt;td&gt;Local, Dries&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;7&lt;/td&gt;
   &lt;td&gt;
    &lt;a href=&quot;https://ollama.com/library/llava&quot;&gt;LLaVA 13B&lt;/a&gt;
 &lt;/td&gt;
   &lt;td&gt;2024&lt;/td&gt;
   &lt;td&gt;Multi-modal&lt;/td&gt;
   &lt;td&gt;CLIP ViT&lt;/td&gt;
   &lt;td&gt;Vicuna 13B&lt;/td&gt;
   &lt;td&gt;13B&lt;/td&gt;
   &lt;td&gt;~16GB&lt;/td&gt;
   &lt;td&gt;Local, Dries&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;8&lt;/td&gt;
   &lt;td&gt;
    &lt;a href=&quot;https://ollama.com/library/llava&quot;&gt;LLaVA 34B&lt;/a&gt;
 &lt;/td&gt;
   &lt;td&gt;2024&lt;/td&gt;
   &lt;td&gt;Multi-modal&lt;/td&gt;
   &lt;td&gt;CLIP ViT&lt;/td&gt;
   &lt;td&gt;Vicuna 34B&lt;/td&gt;
   &lt;td&gt;34B&lt;/td&gt;
   &lt;td&gt;~32GB&lt;/td&gt;
   &lt;td&gt;Local, Dries&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;9&lt;/td&gt;
   &lt;td&gt;
    &lt;a href=&quot;https://ollama.com/library/llama3.2-vision&quot;&gt;Llama 3.2 Vision 11B&lt;/a&gt;
 &lt;/td&gt;
   &lt;td&gt;2024&lt;/td&gt;
   &lt;td&gt;Multi-modal&lt;/td&gt;
   &lt;td&gt;Custom Vision Encoder&lt;/td&gt;
   &lt;td&gt;Llama 3.2&lt;/td&gt;
   &lt;td&gt;11B&lt;/td&gt;
   &lt;td&gt;~20GB&lt;/td&gt;
   &lt;td&gt;Local, Dries&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;10&lt;/td&gt;
   &lt;td&gt;
    &lt;a href=&quot;https://ollama.com/library/llama3.2-vision&quot;&gt;Llama 3.2 Vision 90B&lt;/a&gt;
 &lt;/td&gt;
   &lt;td&gt;2024&lt;/td&gt;
   &lt;td&gt;Multi-modal&lt;/td&gt;
   &lt;td&gt;Custom Vision Encoder&lt;/td&gt;
   &lt;td&gt;Llama 3.2&lt;/td&gt;
   &lt;td&gt;90B&lt;/td&gt;
   &lt;td&gt;~128GB&lt;/td&gt;
   &lt;td&gt;Local, Jeremy&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;11&lt;/td&gt;
   &lt;td&gt;
    &lt;a href=&quot;https://chat.openai.com&quot;&gt;OpenAI GPT-4o&lt;/a&gt;
 &lt;/td&gt;
   &lt;td&gt;2023&lt;/td&gt;
   &lt;td&gt;Multi-modal&lt;/td&gt;
   &lt;td&gt;Custom Vision Encoder&lt;/td&gt;
   &lt;td&gt;GPT-4&lt;/td&gt;
   &lt;td&gt;&amp;gt;150B&lt;/td&gt;
   &lt;td&gt;
 &lt;/td&gt;
   &lt;td&gt;Cloud&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;12&lt;/td&gt;
   &lt;td&gt;
    &lt;a href=&quot;https://claude.ai&quot;&gt;Anthropic Claude 3.5 Sonnet&lt;/a&gt;
 &lt;/td&gt;
   &lt;td&gt;2024&lt;/td&gt;
   &lt;td&gt;Multi-modal&lt;/td&gt;
   &lt;td&gt;Custom Vision Encoder&lt;/td&gt;
   &lt;td&gt;Claude 3.5&lt;/td&gt;
   &lt;td&gt;&amp;gt;150B&lt;/td&gt;
   &lt;td&gt;
 &lt;/td&gt;
   &lt;td&gt;Cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;h3&gt;How image-to-text models work (in less than 30 seconds)&lt;/h3&gt;
&lt;p&gt;LLMs come in many forms, but for this project, I focused on &lt;em&gt;image-to-text&lt;/em&gt; and &lt;em&gt;multi-modal&lt;/em&gt; models. Both types of models can analyze images and generate text, either by describing images or answering questions about them.&lt;/p&gt;
&lt;p&gt;Image-to-text models follow a two-step process: &lt;strong&gt;vision encoding&lt;/strong&gt; and &lt;strong&gt;language decoding&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Vision encoding&lt;/strong&gt;: First, the model breaks an image down into &lt;em&gt;patches&lt;/em&gt;. You can think of these as &amp;quot;puzzle pieces&amp;quot;. The patches are converted into mathematical representations called &lt;em&gt;embeddings&lt;/em&gt;, which summarize their visual details. Next, an &lt;a href=&quot;https://en.wikipedia.org/wiki/Attention_(machine_learning)&quot;&gt;attention mechanism&lt;/a&gt; filters out the most important patches (e.g. the puzzle pieces with the cat&#039;s outline or fur texture) and eliminates less relevant details (e.g. puzzle pieces with plain blue skies).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Language encoding&lt;/strong&gt;: Once the model has summarized the most important visual features, it uses a &lt;em&gt;language model&lt;/em&gt; to translate those features into words. This step is where the actual text (image captions or Q&amp;amp;A answers) is generated.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In short, the vision encoder &lt;em&gt;sees&lt;/em&gt; the image, while the language encoder &lt;em&gt;describes&lt;/em&gt; it.&lt;/p&gt;
&lt;p&gt;If you look at the table above, you&#039;ll see that each row pairs a &lt;em&gt;vision encoder&lt;/em&gt; (e.g., ViT, CLIP, Swin) with a &lt;em&gt;language encoder&lt;/em&gt; (e.g., GPT-2, BERT, T5, Llama).&lt;/p&gt;
&lt;p&gt;For a more in-depth explanation, I recommend &lt;a href=&quot;https://sebastianraschka.com/&quot;&gt;Sebastian Raschka&lt;/a&gt;&#039;s article &lt;a href=&quot;https://sebastianraschka.com/blog/2024/understanding-multimodal-llms.html&quot;&gt;Understanding Multi-modal LLMs&lt;/a&gt;, which also covers how image encoders work. It&#039;s fantastic!&lt;/p&gt;
&lt;h3&gt;Comparing different AI models&lt;/h3&gt;
&lt;p&gt;I wrote a Python script that generates &lt;code&gt;alt&lt;/code&gt;-texts for images using nine different local models. You can find it in my &lt;a href=&quot;https://github.com/dbuytaert/image-caption&quot;&gt;GitHub repository&lt;/a&gt;. It takes care of installing models, running them, and generating &lt;code&gt;alt&lt;/code&gt;-texts. It supports both &lt;a href=&quot;https://huggingface.co/&quot;&gt;Hugging Face&lt;/a&gt; and &lt;a href=&quot;https://ollama.ai/&quot;&gt;Ollama&lt;/a&gt; and is built to be easily extended as new models come out.&lt;/p&gt;
&lt;p&gt;You can run the script as follows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-shell&quot;&gt;$ ./caption.py ./test-images/image-1.jpg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first time you run the script, it will download all models, which requires significant disk space and bandwidth – expect to download over 50GB of model data.&lt;/p&gt;
&lt;p&gt;The script outputs a JSON response, making it easy to integrate or analyze programmatically. Here is an example output:&lt;/p&gt;
&lt;pre&gt;
  &lt;code class=&quot;language-json&quot;&gt;{
  &quot;image&quot;: &quot;test-images/image-1.jpg&quot;,
  &quot;&lt;code&gt;alt&lt;/code&gt;-texts&quot;: {
  &quot;vit-gpt2&quot;: &quot;A city at night with skyscrapers and a traffic light on the side of the street in front of a tall building.&quot;,
  &quot;git&quot;: &quot;A busy city street is lit up at night, with the word qroi on the right side of the sign.&quot;,
  &quot;blip&quot;: &quot;This is an aerial view of a busy city street at night with lots of people walking and cars on the side of the road.&quot;,
  &quot;blip2-opt&quot;: &quot;An aerial view of a busy city street at night.&quot;,
  &quot;blip2-flan&quot;: &quot;An aerial view of a busy street in tokyo, japanese city at night with large billboards.&quot;,
  &quot;minicpm-v&quot;: &quot;A bustling cityscape at night with illuminated billboards and advertisements, including one for Michael Kors.&quot;,
  &quot;llava-13b&quot;: &quot;A bustling nighttime scene from Tokyo&#039;s famous Shibuya Crossing, characterized by its bright lights and dense crowds of people moving through the intersection.&quot;,
  &quot;llava-34b&quot;: &quot;A bustling city street at night, filled with illuminated buildings and numerous pedestrians.&quot;,
  &quot;llama32-vision-11b&quot;: &quot;A bustling city street at night, with towering skyscrapers and neon lights illuminating the scene.&quot;
  }
  }
&lt;/code&gt;
&lt;/pre&gt;
&lt;h3&gt;Test images&lt;/h3&gt;
&lt;p&gt;With the script ready, I decided to test it on some of &lt;a href=&quot;https://dri.es/photos&quot;&gt;my 10,000 photos&lt;/a&gt;. Not all of them at once. I picked five that I consider non-standard. Instead of simple portraits or landscapes, I picked photos with elements that might confuse or challenge the models.&lt;/p&gt;
&lt;p&gt;One photo is from the &lt;a href=&quot;https://en.wikipedia.org/wiki/Isabella_Stewart_Gardner_Museum_theft&quot;&gt;Isabella Stewart Gardner Museum&lt;/a&gt; in Boston and features an empty gold frame. The frame once held a masterpiece stolen in the infamous 1990 heist, one of the biggest art thefts in history. I wanted to see if the models would recognize it as empty or mistake it for a framed painting.&lt;/p&gt;
&lt;p&gt;Another photo, taken last summer in Vermont, shows a wakeboarder. Though he is the main subject, he is relatively small in the frame. I was curious to see if the models could still recognize him as the focal point.&lt;/p&gt;
&lt;p&gt;In another photo, a backgammon game is set in a dark but cozy atmosphere. I was curious to see if the models could recognize partially visible objects and capture the mood of the scene.&lt;/p&gt;
&lt;p&gt;To ensure a fair test, I stripped all &lt;a href=&quot;https://en.wikipedia.org/wiki/Exif&quot;&gt;EXIF metadata&lt;/a&gt; from the images. This includes any embedded captions, GPS coordinates, or other details that could inadvertently help the models.&lt;/p&gt;
&lt;p&gt;Yes, I &lt;em&gt;know&lt;/em&gt; that a test set of five images is small, but it&#039;s sufficient to identify the top models for further evaluation. With 12 models generating &lt;code&gt;alt&lt;/code&gt;-texts for each photo, I had to &lt;a href=&quot;https://docs.google.com/spreadsheets/d/1LttMys6cEkse6LrRZoznvDdvBiz_WkVyauNBFV2jIXg/edit&quot;&gt;manually evaluate 60 &lt;code&gt;alt&lt;/code&gt;-texts&lt;/a&gt;, which was tedious to do fairly and accurately. For now, these five images are enough to filter out weaker models and pinpoint the best 2-3 LLMs for more detailed testing.&lt;/p&gt;
&lt;p&gt;Next, I&#039;ll share the detailed results for each of the five photos. It&#039;s a lot to go through, so feel free to skip ahead to the &lt;a href=&quot;#model-accuracy&quot;&gt;summary&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Test image 1: Shibuya Crossing, Tokyo&lt;/h4&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;figure&gt;&lt;img src=&quot;https://dri.es/files/cache/japan-2024/shibuya-crossing-1280w.jpg&quot; alt=&quot;Areal view of an intersection, capturing the evening commute with pedestrians, traffic and electronic billboards.&quot; width=&quot;1280&quot; height=&quot;850&quot; /&gt;
&lt;/figure&gt;

&lt;/div&gt;
&lt;p&gt;Benchmark LLMs:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;GPT-4o&lt;/td&gt;
  &lt;td&gt;A vibrant city intersection with crowds crossing, surrounded by illuminated billboards and towering buildings, likely Shibuya, Tokyo.&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Claude&lt;/td&gt;
  &lt;td&gt;Hundreds of people cross Tokyo&#039;s famous Shibuya intersection as neon signs light up the surrounding skyscrapers.&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Local LLMs:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;VIT-GPT2&lt;/td&gt;
  &lt;td&gt;A city at night with skyscrapers and a &lt;mark&gt;traffic light&lt;/mark&gt; on the side of the street in front of a tall &lt;mark&gt;building&lt;/mark&gt;.&lt;/td&gt;
  &lt;td&gt;F&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;GIT&lt;/td&gt;
  &lt;td&gt;A busy city street is lit up at night, with the word &lt;mark&gt;qroi&lt;/mark&gt; on the right side of the sign.&lt;/td&gt;
  &lt;td&gt;E&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP&lt;/td&gt;
  &lt;td&gt;This is an aerial view of a busy city street at night with lots of people walking and cars on the side of the road.&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP-2 with OPT&lt;/td&gt;
  &lt;td&gt;An aerial view of a busy city street at night.&lt;/td&gt;
  &lt;td&gt;E&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP-2 with FLAN&lt;/td&gt;
  &lt;td&gt;An aerial view of a busy street in &lt;mark&gt;Tokyo, Japanese city&lt;/mark&gt; at night with large billboards.&lt;/td&gt;
  &lt;td&gt;E&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;LLaVA (13B)&lt;/td&gt;
  &lt;td&gt;A bustling nighttime scene of Tokyo&#039;s Shibuya Crossing, with neon lights and people walking across the intersection.&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;LLaVA (34B)&lt;/td&gt;
  &lt;td&gt;A bustling city street at night, with illuminated buildings and numerous pedestrians.&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;MiniCPM-V&lt;/td&gt;
  &lt;td&gt;A bustling cityscape at night with various advertisements, including one for Michael Kors and another for Yakult.&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Llama 3.2 Vision (11B)&lt;/td&gt;
  &lt;td&gt;A bustling city street at night, with numerous billboards and advertisements lining the buildings.&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Llama 3.2 Vision (90B)&lt;/td&gt;
  &lt;td&gt;A bustling city street with many people and buildings.&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4&gt;Test image 2: Isabella Stewart Gardner Museum, Boston&lt;/h4&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;figure&gt;&lt;img src=&quot;https://dri.es/files/cache/miscellaneous-2023/isabella-stewart-gardner-museum-1280w.jpg&quot; alt=&quot;An empty picture frame in a room with ornate wallpaper, candles, old paintings, and antique furniture.&quot; width=&quot;1280&quot; height=&quot;850&quot; /&gt;
&lt;/figure&gt;

&lt;/div&gt;
&lt;p&gt;Benchmark LLMs:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;GPT-4o&lt;/td&gt;
  &lt;td&gt;
   &lt;mark&gt;Elegant&lt;/mark&gt; wallpaper, an empty gold frame, candelabra, and framed painting in a classic room.&lt;/td&gt;
   &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;Claude&lt;/td&gt;
   &lt;td&gt;Ornate Victorian room with damask wallpaper, candelabra, and gold frames on wall.&lt;/td&gt;
   &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Local LLMs:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;VIT-GPT2&lt;/td&gt;
  &lt;td&gt;A living room with a &lt;mark&gt;mirror&lt;/mark&gt;, candles, and a vase of &lt;mark&gt;flowers&lt;/mark&gt; on a table in front of a &lt;mark&gt;mirror&lt;/mark&gt;.&lt;/td&gt;
  &lt;td&gt;F&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;GIT&lt;/td&gt;
  &lt;td&gt;A picture frame is hanging on a wall next to a vase &lt;mark&gt;and a vase&lt;/mark&gt; with &lt;mark&gt;the word tulips on it&lt;/mark&gt;.&lt;/td&gt;
  &lt;td&gt;E&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP&lt;/td&gt;
  &lt;td&gt;A room with a painting on the wall and &lt;mark&gt;two vases&lt;/mark&gt; on the table in front of it.&lt;/td&gt;
  &lt;td&gt;E&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP-2 with OPT&lt;/td&gt;
  &lt;td&gt;A room with a painting on the wall, a picture frame, and a chandelier.&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP-2 with FLAN&lt;/td&gt;
  &lt;td&gt;A room with a &lt;mark&gt;painting&lt;/mark&gt; on the wall and a couple of framed pictures hanging on the wall next to it.&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;LLaVA (13B)&lt;/td&gt;
  &lt;td&gt;An &lt;mark&gt;empty&lt;/mark&gt; wall with ornate decorations, including candlesticks and framed artwork, suggesting a formal or historical setting.
&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;LLaVA (34B)&lt;/td&gt;
  &lt;td&gt;An ornate room with framed pictures on the wall, a chandelier &lt;mark&gt;hanging from the ceiling&lt;/mark&gt;, and a &lt;mark&gt;fireplace mantel&lt;/mark&gt; adorned with decorative items.&lt;/td&gt;
  &lt;td&gt;E&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;MiniCPM-V&lt;/td&gt;
  &lt;td&gt;A room with ornate wallpaper, candlesticks and framed portraits of historical figures is displayed.&lt;/td&gt;
  &lt;td&gt;
&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Llama 3.2 Vision (11B)&lt;/td&gt;
  &lt;td&gt;An empty gold frame on a wall with ornate wallpaper, surrounded by other decorative items.&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Llama 3.2 Vision (90B)&lt;/td&gt;
  &lt;td&gt;An ornate room with a gold frame, a painting of a man, and a vase on a &lt;mark&gt;table&lt;/mark&gt;.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4&gt;Test image 3: wakeboarding in Vermont, USA&lt;/h4&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;figure&gt;&lt;img src=&quot;https://dri.es/files/cache/vermont-2024/wakeboarding-1280w.jpg&quot; alt=&quot;Two men in swim shorts on the back of a boat watching another person wakeboarding behind the boat.&quot; width=&quot;1280&quot; height=&quot;850&quot; /&gt;
&lt;/figure&gt;

&lt;/div&gt;
&lt;p&gt;Benchmark LLMs:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;GPT-4o&lt;/td&gt;
  &lt;td&gt;Two shirtless boys on a boat watch a person wakeboarding in the distance on a cloudy day.&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Claude&lt;/td&gt;
  &lt;td&gt;People watching from a boat as a person is wakeboarding on a lake with a wooded shoreline.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Local LLMs:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;VIT-GPT2&lt;/td&gt;
  &lt;td&gt;A man standing on top of a boat next to another man &lt;mark&gt;holding a surfboard in one hand&lt;/mark&gt; and &lt;mark&gt;a surfboard in the other&lt;/mark&gt;.&lt;/td&gt;
  &lt;td&gt;E&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;GIT&lt;/td&gt;
  &lt;td&gt;Two men are in a boat, one of them is wearing an orange hat &lt;mark&gt;and the other is wearing an orange hat&lt;/mark&gt;.&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP&lt;/td&gt;
  &lt;td&gt;There are two men riding on the back of a boat in the water, &lt;mark&gt;one of them is on a surfboard and the other is on a board&lt;/mark&gt;.&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP-2 with OPT&lt;/td&gt;
  &lt;td&gt;
   &lt;mark&gt;Three young men&lt;/mark&gt; sitting on the back of a boat.&lt;/td&gt;
   &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;td&gt;BLIP-2 with FLAN&lt;/td&gt;
   &lt;td&gt;
    &lt;mark&gt;A group of people&lt;/mark&gt; sitting on a boat watching a man ride a &lt;mark&gt;surfboard&lt;/mark&gt; in the middle of the water.&lt;/td&gt;
    &lt;td&gt;C&lt;/td&gt;
 &lt;/tr&gt;
   &lt;tr&gt;
    &lt;td&gt;LLaVA (13B)&lt;/td&gt;
    &lt;td&gt;Two shirtless men on a boat, watching another man &lt;mark&gt;surfing&lt;/mark&gt; in the water.&lt;/td&gt;
    &lt;td&gt;B&lt;/td&gt;
 &lt;/tr&gt;
   &lt;tr&gt;
    &lt;td&gt;LLaVA (34B)&lt;/td&gt;
    &lt;td&gt;Two shirtless men on a boat, watching another person &lt;mark&gt;water skiing&lt;/mark&gt;.&lt;/td&gt;
    &lt;td&gt;B&lt;/td&gt;
 &lt;/tr&gt;
   &lt;tr&gt;
    &lt;td&gt;MiniCPM-V&lt;/td&gt;
    &lt;td&gt;Two shirtless men on a boat watch as another person &lt;mark&gt;water skis away from them&lt;/mark&gt;, with trees and sky visible across the lake behind him.&lt;/td&gt;
    &lt;td&gt;B&lt;/td&gt;
 &lt;/tr&gt;
   &lt;tr&gt;
    &lt;td&gt;Llama 3.2 Vision (11B)&lt;/td&gt;
    &lt;td&gt;Two shirtless men on a boat watching &lt;mark&gt;people&lt;/mark&gt; wakeboarding or &lt;mark&gt;surfing&lt;/mark&gt; behind it.&lt;/td&gt;
    &lt;td&gt;B&lt;/td&gt;
 &lt;/tr&gt;
   &lt;tr&gt;
    &lt;td&gt;Llama 3.2 Vision (90B)&lt;/td&gt;
    &lt;td&gt;Two men on a boat watching another man wakeboarding.&lt;/td&gt;
    &lt;td&gt;A&lt;/td&gt;
 &lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4&gt;Test image 4: hiking in the Dolomites, Italy&lt;/h4&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;figure&gt;&lt;img src=&quot;https://dri.es/files/cache/italy-2022/santa-maddalena-church-in-funes-2-1280w.jpg&quot; alt=&quot;Santa maddalena church in funes&quot; width=&quot;1280&quot; height=&quot;846&quot; /&gt;
&lt;/figure&gt;

&lt;/div&gt;
&lt;p&gt;Benchmark LLMs:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;GPT-4o&lt;/td&gt;
  &lt;td&gt;A wooden shrine with a statue of Mary and directional signs, set against mountains and green hills.&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Claude&lt;/td&gt;
  &lt;td&gt;Wooden shrine of St. Magdalena with trail signs in the Dolomites mountains of Italy.&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Local LLMs:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;VIT-GPT2&lt;/td&gt;
  &lt;td&gt;A &lt;mark&gt;bird&lt;/mark&gt; perched on top of a wooden post in front of a mountain range with a view of &lt;mark&gt;a mountain range and mountains&lt;/mark&gt;.&lt;/td&gt;
  &lt;td&gt;F&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;GIT&lt;/td&gt;
  &lt;td&gt;A &lt;mark&gt;wooden sign&lt;/mark&gt; in front of a mountain with a statue on it that says &lt;mark&gt;&quot;Meise museum&quot;&lt;/mark&gt; on the top.&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP&lt;/td&gt;
  &lt;td&gt;There is &lt;mark&gt;a birdhouse with a statue&lt;/mark&gt; on top of it in the middle of a field with trees and mountains in the background.&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP-2 with OPT&lt;/td&gt;
  &lt;td&gt;A statue of Mary on a wooden post in the middle of a field with mountains in the background.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP-2 with FLAN&lt;/td&gt;
  &lt;td&gt;A statue of the Virgin Mary sits on top of a wooden post &lt;mark&gt;in the middle of a mountain&lt;/mark&gt;.&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;LLaVA (13B)&lt;/td&gt;
  &lt;td&gt;A picturesque scene of a small wooden chapel with a statue of &lt;mark&gt;Jesus&lt;/mark&gt;, set against a backdrop of majestic mountains and lush greenery.&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;LLaVA (34B)&lt;/td&gt;
  &lt;td&gt;A wooden shrine with religious iconography, surrounded by alpine scenery and signposts.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;MiniCPM-V&lt;/td&gt;
  &lt;td&gt;A wooden shrine with a statue of Mary and several directional signs pointing to various locations such as Kirchsteig, Magdalena, St.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Llama 3.2 Vision (11B)&lt;/td&gt;
  &lt;td&gt;A wooden shrine with a statue of Mary and a signpost in front of mountains.&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Llama 3.2 Vision (90B)&lt;/td&gt;
  &lt;td&gt;A statue of Mary in a wooden shrine with a signpost pointing to various locations, including Rundweg St.&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4&gt;Test image 5: backgammon by candlelight&lt;/h4&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;figure&gt;&lt;img src=&quot;https://dri.es/files/cache/vermont-2023/backgammon-by-candlelight-1280w.jpg&quot; alt=&quot;A backgammon board on a wooden table, accompanied by candles that cast a warm glow.&quot; width=&quot;1280&quot; height=&quot;850&quot; /&gt;
&lt;/figure&gt;

&lt;/div&gt;
&lt;p&gt;Benchmark LLMs:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;GPT-4o&lt;/td&gt;
  &lt;td&gt;A cozy, dimly lit room with &lt;mark&gt;a candle&lt;/mark&gt; on a wooden table, next to a backgammon board, creating a warm, rustic ambiance.
&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Claude&lt;/td&gt;
  &lt;td&gt;Two candles light up a game board for backgammon on a wooden table at night.&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Local LLMs:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Description&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
  &lt;tbody&gt;
  &lt;tr&gt;
  &lt;td&gt;VIT-GPT2&lt;/td&gt;
  &lt;td&gt;A candle is lit on a wooden table in front of a &lt;mark&gt;fire place&lt;/mark&gt; with candles and other items on top of it.&lt;/td&gt;
  &lt;td&gt;E&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;GIT&lt;/td&gt;
  &lt;td&gt;Two candles are lit next to each other on a table, &lt;mark&gt;one of them is lit up and the other is lit up&lt;/mark&gt;.
&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP&lt;/td&gt;
  &lt;td&gt;There is a lit &lt;mark&gt;candle&lt;/mark&gt; sitting on top of a wooden table next to a game board and &lt;mark&gt;a glass of wine&lt;/mark&gt; on the table.
&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;BLIP-2 with OPT&lt;/td&gt;
  &lt;td&gt;A candle sits on top of a wooden table.&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;BLIP-2 with FLAN&lt;/td&gt;
  &lt;td&gt;A candle sits on a wooden table next to a backgammon board and a &lt;mark&gt;glass of wine&lt;/mark&gt;.&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;LLaVA (13B)&lt;/td&gt;
  &lt;td&gt;A cozy indoor setting with candles, &lt;mark&gt;a guitar&lt;/mark&gt;, and a warm ambiance.&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;LLaVA (34B)&lt;/td&gt;
  &lt;td&gt;A candlelit wooden table with &lt;mark&gt;musical instruments&lt;/mark&gt; and a cozy ambiance.&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;MiniCPM-V&lt;/td&gt;
  &lt;td&gt;A dimly lit room with candles and backgammon pieces on a wooden table, creating an atmosphere of relaxation or leisure activity.&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Llama 3.2 Vision (11B)&lt;/td&gt;
  &lt;td&gt;A dimly lit room with a wooden table, featuring a backgammon board and two candles.
&lt;/td&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Llama 3.2 Vision (90B)&lt;/td&gt;
  &lt;td&gt;A candle and backgammon board on a wooden table.
&lt;/td&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&quot;model-accuracy&quot;&gt;Model accuracy&lt;/h3&gt;
&lt;p&gt;I evaluated each description using &lt;a href=&quot;https://docs.google.com/spreadsheets/d/1LttMys6cEkse6LrRZoznvDdvBiz_WkVyauNBFV2jIXg/edit&quot;&gt;a structured but subjective scoring system&lt;/a&gt;. For each image, I identified the two or three most important objects the AI should recognize and include in its description. I also assessed whether the model captured the photo&#039;s mood, which can be important for visually impaired users. Finally, I deducted points for repetition, grammar errors, or hallucinations (invented details). Each &lt;code&gt;alt&lt;/code&gt;-text received a score from 0 to 5, which I then converted to a letter grade from A to F.&lt;/p&gt;
&lt;div class=&quot;large&quot;&gt;
  &lt;table&gt;
  &lt;tr&gt;
  &lt;th&gt;Model&lt;/th&gt;
  &lt;th&gt;Repetitions&lt;/th&gt;
  &lt;th&gt;Hallucinations&lt;/th&gt;
  &lt;th&gt;Moods&lt;/th&gt;
  &lt;th&gt;Average score&lt;/th&gt;
  &lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;VIT-GPT2&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffcccc&quot;&gt;Often&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffcccc&quot;&gt;Often&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffcccc&quot;&gt;Poor&lt;/td&gt;
  &lt;td&gt;0.4/5&lt;/td&gt;
  &lt;td&gt;F&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;GIT&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffcccc&quot;&gt;Often&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffcccc&quot;&gt;Often&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffcccc&quot;&gt;Poor&lt;/td&gt;
  &lt;td&gt;1.6/5&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffcccc&quot;&gt;Often&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffcccc&quot;&gt;Often&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffcccc&quot;&gt;Poor&lt;/td&gt;
  &lt;td&gt;1.8/5&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP2 w/OPT&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Rarely&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffeb99&quot;&gt;Sometimes&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffeb99&quot;&gt;Fair&lt;/td&gt;
  &lt;td&gt;2.6/5&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;BLIP2 w/FLAN&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Rarely&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffeb99&quot;&gt;Sometimes&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffeb99&quot;&gt;Fair&lt;/td&gt;
  &lt;td&gt;2.2/5&lt;/td&gt;
  &lt;td&gt;D&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;LLaVA 13B&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffeb99&quot;&gt;Sometimes&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Good&lt;/td&gt;
  &lt;td&gt;3.2/5&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;LLaVA 34B&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ffeb99&quot;&gt;Sometimes&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Good&lt;/td&gt;
  &lt;td&gt;3.2/5&lt;/td&gt;
  &lt;td&gt;C&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;MiniCPM-V&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Good&lt;/td&gt;
  &lt;td&gt;3.8/5&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Llama 11B&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Rarely&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Good&lt;/td&gt;
  &lt;td&gt;4.4/5&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Llama 90B&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Rarely&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Good&lt;/td&gt;
  &lt;td&gt;3.8/5&lt;/td&gt;
  &lt;td&gt;B&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;GPT-4o&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Good&lt;/td&gt;
  &lt;td&gt;4.8/5&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
  &lt;tr&gt;
  &lt;td&gt;Claude 3.5 Sonnet&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Never&lt;/td&gt;
  &lt;td style=&quot;background-color: #ccffcc&quot;&gt;Good&lt;/td&gt;
  &lt;td&gt;5/5&lt;/td&gt;
  &lt;td&gt;A&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;The cloud-based models, GPT-4o and Claude 3.5 Sonnet, performed nearly perfectly on my small test of five images, with no major errors, hallucinations, repetitions and excellent mood detection.&lt;/p&gt;
&lt;p&gt;Among local models, both Llama variants and MiniCPM-V show the strongest performance.&lt;/p&gt;
&lt;p&gt;Repetition in descriptions frustrates users of screen readers. Early models like VIT-GPT2, GIT, BLIP, and BLIP2 frequently repeat content, making them unsuitable.&lt;/p&gt;
&lt;p&gt;Hallucinations can be a serious issue in my opinion. Describing nonexistent objects or actions misleads visually impaired users and erodes trust. Among the best-performing local models, MiniCPM-V did not hallucinate, while Llama 11B and Llama 90B each made one mistake. Llama 90B misidentified a cabinet at the museum as a table, and Llama 11B described multiple people wakeboarding instead of just one. While these errors aren&#039;t dramatic, they are still frustrating.&lt;/p&gt;
&lt;p&gt;Capturing mood is essential for giving visually impaired users a richer understanding of images. While early models struggled in this area, all recent models all performed well. This includes both LLaVA variants and MiniCPM-V.&lt;/p&gt;
&lt;p&gt;From a practical standpoint, Llama 11B and MiniCPM-V ran smoothly on my 32GB RAM laptop, but Llama 90B needed more memory. Long story short, this means that Llama 11B and MiniCPM-V are my best candidates for additional testing.&lt;/p&gt;
&lt;h3&gt;Possible next steps&lt;/h3&gt;
&lt;p&gt;The results raise a tough question: is a &amp;quot;B&amp;quot;-level &lt;code&gt;alt&lt;/code&gt;-text better than none at all? Many human-written &lt;code&gt;alt&lt;/code&gt;-texts probably aren&#039;t perfect either. Should I wait for local models to hit an &amp;quot;A&amp;quot;-grade, or is an imperfect description still better than no &lt;code&gt;alt&lt;/code&gt;-text at all?&lt;/p&gt;
&lt;p&gt;Here are four possible next steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Combine AI outputs&lt;/strong&gt; – Run the same image through different models and merge their results to try and create more accurate descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wait and upgrade&lt;/strong&gt; – Use the best local model for now, tag AI-generated &lt;code&gt;alt&lt;/code&gt;-texts in the database, and refresh them in 6–12 months when new and better local models are available.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Go cloud-based&lt;/strong&gt; – Get the best quality with a cloud model, even if it means uploading 65GB of photos. I can&#039;t explain why, or if the feeling is even justified, but it feels like giving in.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hybrid approach&lt;/strong&gt; – Use AI to generate &lt;code&gt;alt&lt;/code&gt;-texts but review them manually. With 9,000 images, that is not practical. I&#039;d need a way to flag &lt;code&gt;alt&lt;/code&gt;-texts most likely to be wrong. Can LLMs give me a reliably confidence score?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each option comes with trade-offs. Some options are quick but imperfect, others take work but might be worth it. Going cloud-based is the easiest but it feels like giving in. Waiting for better models is effortless but means delaying progress. Merging AI outputs or assigning a confidence score takes more effort but might be the best balance of speed and accuracy.&lt;/p&gt;
&lt;p&gt;Maybe the solution is a combination of these options? I could go cloud-based now, tag the AI-generated &lt;code&gt;alt&lt;/code&gt;-texts in my database, and regenerate them in 6–12 months when LLMs got even better.&lt;/p&gt;
&lt;p&gt;It also comes down to pragmatism versus principle. Should I stick to local models because I believe in data privacy and Open Source, or should I prioritize accessibility by providing the best possible &lt;code&gt;alt&lt;/code&gt;-text for users? The local-first approach better aligns with my values, but it might come at the cost of a worse experience for visually impaired users.&lt;/p&gt;
&lt;p&gt;I&#039;ll be weighing these options over the next few weeks. What would you do? I&#039;d love to hear your thoughts!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; My thoughts on using AI for &lt;code&gt;alt&lt;/code&gt;-text has evolved across several blog posts. First, I &lt;a href=&quot;https://dri.es/i-want-to-run-ai-locally-here-is-why-i-am-not-yet&quot;&gt;chose a cloud-based LLM&lt;/a&gt; after all. Then, I &lt;a href=&quot;https://dri.es/automating-alt-text-generation-ai&quot;&gt;built an automated system&lt;/a&gt; to generate and update descriptions for just one image. Finally, I &lt;a href=&quot;https://dri.es/trusting-ai-with-my-images-was-not-easy&quot;&gt;scaled it to 9,000 images&lt;/a&gt; and learned to trust AI in the process.&lt;/p&gt;
</description>
    </item>
    <item>
      <title>Python wrapper for Mollom</title>
      <link>https://dri.es/python-wrapper-for-mollom</link>
      <guid>https://dri.es/python-wrapper-for-mollom</guid>
      <pubDate>Fri, 09 May 2008 03:04:10 -0400</pubDate>
      <description>&lt;p&gt;&lt;a href=&quot;http://itkovian.net&quot;&gt;Andy Georges&lt;/a&gt; released a &lt;a href=&quot;http://itkovian.net/base/python-wrapper-mollom&quot;&gt;Python wrapper for Mollom&lt;/a&gt;. The wrapper can be used to integrate Mollom in your Python applications, but it also gets Mollom one step closer to the &lt;a href=&quot;https://www.djangoproject.com/&quot;&gt;Django project&lt;/a&gt; and &lt;a href=&quot;https://cloud.google.com/appengine/&quot;&gt;Google App Engine&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&quot;https://www.mollom.com/api&quot;&gt;Mollom API&lt;/a&gt; was released &lt;a href=&quot;https://dri.es/mollom-api-now-available&quot;&gt;less than 10 days ago&lt;/a&gt;, and already &lt;a href=&quot;https://mollom.com&quot;&gt;Mollom&lt;/a&gt; is supported on PHP, Java, Python and Ruby. &lt;em&gt;Sweet!&lt;/em&gt;&lt;/p&gt;
</description>
    </item>
  </channel>
</rss>
