Transcribing Handwriting with a Multimodal Model

by Shawn Graham

Legacy archaeological data is often trapped in handwritten context sheets, site diaries, and scraps of paper. Transcribing such materials so that one can work with them is time consuming and difficult. If you’re an English speaker, trained in North American cursive you’re probably better off that the average undergrad today who didn’t learn cursive… but then throw into the loop Italian cursive:

A page from a list of stratigraphy at Gabii

…well, that’s a challenge. Or consider the more tractable challenge – context recording sheets that certainly do structure the data, but are filled in by hand:

Single context recording sheet from the West Kent Archaeological Society

This too can be difficult and time consuming. So we have been exploring what might be done with so-called ‘multimodal models’. These are LLMs that have had visual and/or audio data expressed into the same world of embeddings as the associated text. Some of these models are then further fine tuned to work primarily with images. One can ask of such models, after feeding them an image, ‘Describe the scene’ and so on.

We’ve been experimenting with the Qwen2-VL-Instruct model (you can launch a chat interface for it via Google colab from this user’s github repository, if you’d like a go); we feed it an image, then prompt it with ‘extract text’. Simon Willison writes about this here, too. And it works well, but getting it working consistently, getting it to load up, has been a problem. We’re still experimenting. But we’ve also tried Google’s Gemini model, again because of Simon Willison’s experimentation where he feeds it video and gets structured text out!

Ok, so we’re not at the playing-with-archaeological-video-to-see-what-we-can-see – yet. But I did sort out a Google notebook that uses Gemini (you’ll need an API key, available through the ‘secrets’ button in the notebook) that will take images and transcribe them. I fed it that first page from Gabii telling it ‘Extract HIGH QUALITY text from the italian handwriting in the image’ and it returned:

ELENCO DEGLI STRATI
 
US-0 = US-1 15/10/98 - da 0 e 50-60cm
 
US-2 = Si può individuare alla [illegible] di un 60-70 cm dal piano di
campagna -
Si compone di un livello terroso, con file presenze di malta
diverse, frammenti di mat. ceramici, pezzi di tegoli, pezzi di
pietre [illegible].
Si raccolgono 6 monete di bronzo. 13/10
 
US-3 = Strato terroso, friabile, di colore nerastro.
(ex US-A) Tale strato si mette in luce sotto nel livello formato
da US-2
È ricchissimo di frammenti di materiale ceramico, frammenti
di vetro, ossa animali, alcune monete, frammenti di
metalli da costruzione.
 
US-4 = Strato terroso, più compatto rispetto a US-3, di colore
(ex US-B) grigio-biancastro. Tale colorazione è dovuta all'elevata
concentrazione di malta o numerose minute disfettie
Si concentra in zone vicine ed alle sterne
prete di US-3
 
US-5 = Taglio rilevato nello strato terroso [illegible] US-3
(ex US-D) Il taglio, dalla forma meglio conservata ma leggermente visibile [illegible] di
1 e 1,50 m.
 
US-6 = Riempimento taglio US-5 composto da un sedimento terroso di colore
(ex US-C) marrone-grigiastro ([illegible] di malta e [illegible]).
 
US-7 = Forme di forma tondeggiante scavate
(ex US-E) nell'US-3.
Di forma si individuano solo
al margine NORD.

Pretty slick, eh? And then, because of the way my code runs, I inadvertently gave it an extra prompt, ‘Please process this page’ (instead of the placeholder text ‘insert_input_here’) and it translated the page too:

**US-0 = US-1**  15/10/98 - from 0 to 50-60cm
 
**US-2** = It can be distinguished at the edge of a pit about 60-70 cm from the wall of the "muro" (wall).
Composed of a loose layer, with the presence of various mortars, fragments of ceramic material, pieces of tiles, pieces of green glass.
6 bronze coins were collected. 13/10
 
**US-3** (ex US-A) = Loose, brittle, dark-colored layer.
This layer is visible just below the level formed by US-2.
 
It is rich in fragments of ceramic material, glass fragments, animal bones, some coins, fragments of construction materials.
 
**US-4** (ex US-B) = Compact layer, more compact than US-3, greyish-whitish in color. This coloring is due to the high concentration of mortar or various dispersed materials.
It can be found in small areas near and under US-3
 
**US-5** (ex US-D) = Cut located in the earthen layer from north-west to US-3
The cut, of better defined but slightly curved shape, measures about 1 by 1.50 m
 
**US-6** (ex US-C) = Filling of cut US-5, made up of a grayish-greenish earthen deposit (mortar remains).
 
**US-7** (ex US-E) = Traces of a rounded shape dug into US-3
This trace is only visible on the NORTH edge

I fed it the context sheet. It returned:

Grid Square(s):

DEPOSIT
1. Compaction: FRIABLE
2. Colour: BROWNISH BLACK
3. Composition / Particle size (over 10%): SANDY SILT
4. Inclusions (under 10%): GREY CLAY, OCCASIONAL TUFA, MODERATE
5. Thickness & extent: 0.6m x 1.3m x
6. Other comments: POSS: ANIMAL BURROWS PULLING TOPSOIL INTO CONTEXT. SIMILAR TO (3001)
7. Method & conditions: TROWEL - WARM & SUNNY

Site Code: KEP (12)
Context: 3058
Area/Section: TRENCH 3
Context type: DEPOSIT

CUT
1. Shape in plan
2. Corners
3. Dimensions/Depth
4. Break of slope- top
5. Sides
6. Break of slope- base
7. Base
8. Orientation
9. Inclination of axis
10. Truncated (if known)
11. Fill nos
12. Other comments
Draw profile overleaf

Stratigraphic matrix
3001 
This context is 3058
3057  3059

Your interpretation: Internal
Your discussion: Primary fill of 3057. Very similar to (3001) but with more density of finds and small irregular lumps of tufa

Context same as: (X)
Plan nos: P
Initials & date: KF 26/6/12

Finds (tick)
Bone ✓ Glass ✓

Levels on reverse
Highest:
Lowest:

Environmental samples
Sample nos & type:

Finds Sleving: on site ☐

Checked interpretation:

I then swapped out the secondary placeholder prompt with ‘Return results suitable for CSV’ and lo! It duly did. I also cooked up a prompt asking it just to extract the stratigraphic relationships recorded on the sheet. This took a bit more jiggery pokery, but it also worked (and for extra fun, could be returned as a cypher query for integration of data into a graph database).

Anyway, not the most elegant code, and we’ll want to experiment with other multimodal models since it would be best if all of this could be run on a local machine, but holy moly this is a useful thing.

Friday, October 18, 2024 | Categories: News
Post tagged with experiments, legacy data, llm, multimodal
Share: Twitter, Facebook
Short URL: https://carleton.ca/xlab/?p=251