…in which I discuss the logic and functioning of two jupyter notebooks that use Simon Willison’s LLM to act as a kind of discovery agent for a history api and an archaeology api. Both notebooks are available for copying and improving and use!
Crossposted from ElectricArchaeology by Shawn Graham
I have been playing with GPT-Researcher, figuring out ways to make it use different data sources. I eventually got it to work with Open Alex, Open Context, and Chronicling America, and to write particular kinds of research reports. It was tricksy, and getting it to even start took a wee bit (tacit/hidden knowledge at play once again). I talk about this in this earlier post.
Part of the impulse for exploring this kind of approach was captured nicely by Eric Kansa in a discussion via Mastodon:
I think LLMs have some interesting applicability in making archaeological data more accessible to wider publics.
They have the potential of helping to bridge more everyday language with the more specialized / technical language that’s describes #opendata in #archaeology
@electricarchaeo @ekansa just jumping in to say that’s my big hope for LLMs / vector space search too – getting over the hit and miss effect of having to know the right keywords
I love this kind of framing. I struggle to use APIs effectively because I have trouble thinking in the kind of granularity that writing/parsing API calls require. The biggest strengths of large language models is that they can translate between one domain and another. ‘Rewrite this driver’s manual in the style of an 18th century pirate’; ‘Turn this research question into a series of appropriate API calls to the Chronicling America database’: to the LLM, it’s all the same task. That’s incredibly powerful.
I am teaching a seminar on artificial intelligence in/and history in the fall. I thought therefore it might be a good exercise for my students to explore these kinds of agents and the ways one might write prompts to achieve this kind of bridging between the things we want to know about, and the tools we have for retrieving the data. I set out to reimplement GPT-Researcher after a fashion using a good ol’ Jupyter notebook approach via Google Colab. My students will most certainly not have enough computing power to run LLM locally on their own machines; even if they do, I do not necessarily want to start from zero providing tech support for 20 different machines and 20 different environments. A google colab environment is just the ticket.
I built two notebooks, one for working with the Chronicling America API, and one for working with the Open Context API. The logic for both looks like this:
At first I was using the initial prompt to translate from the user’s research question to various kinds of google searches that one would use. Then I was wrestling with the problem that those kinds of search strings with regard to a database of scanned newspapers just don’t really work. I fiddled some more, getting the prompt to generate appropriate keywords and key phrases that captured the user’s research question. This worked better, but then I was having to parse the json results in python and it was all getting to be a mess. Then I realized – translation. I didn’t have to have an intermediary step. I could show the model how a completed url via the API worked. I should also note that I’m using Simon Willison’s wonderful LLM. You can set up prompt templates which you then feed text into or out of. Below is what the initial prompt for my Chronicling America notebook looks like:
# newsquery prompt # note that this prompt has an example of a complete advanced searched URL using the api. # LLMs are good at following examples, so we can just use this prompt to translate from the user's # question into the direct search api strings that might answer the question. !llm 'You will write six newspaper database search queries from the following $input. You need to construct queries that use high-quality keywords or keyword phrases to capture the general sense of the original question. The search api we will use can look like this- https://chroniclingamerica.loc.gov/search/pages/results/?state=&lccn=&dateFilterType=range&date1=1950&date2=1960&language=&ortext=&andtext=&phrasetext=Canadian+lumber&proxtext=&proxdistance=5&rows=20&searchType=advanced&format=json. The strings that you generate should be formulated to follow that pattern of variables, and always with format as json. Phrases should follow phrasetext= and be joined with a + sign. Keywords on the other hand should follow proxtext= and be joined with a + sign. Search can be narrowed to states by adding the state name after state=. RETURN ONLY THE FULL URL YOU HAVE GENERATED. DO NOT PUT SPACES IN THE API STRING. Use the + sign if necessary. Generate at least one api string using THIS PATTERN IN PARTICULAR: https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1756&date2=1963&proxtext=canadian+softwood+lumber&x=0&y=0&dateFilterType=yearRange&rows=20&searchType=basic. Take it step by step. This is important to my career.' --save newsquery
See how elegant that is? (Although having to encourage the LLM to think things through because I might get fired is simultaneously funny and frightening – funny that this counts as ‘programming’, frightening because it has learned from the web that better results sometimes are a result of threats; and yet, this works). Later on, after having the user write their question into a variable field, we can just get cracking with:
!llm -t newsquery '{question}' > generated-queries.txt
This will generate several different api calls, which we then load up and write the results of which to file. A bit of cleaning takes place. Then another prompt is called that directs the LLM to organize the news articles chronologically, summarize each one in the light of the original research question, write it all out, then provide a synoptic overview.
Similarly, with regard to Open Context, we follow the same logic. The challenge with Open Context is that it is so granular. The initial generated queries lead to more uris, so there are some helper functions that parse the original results returned via the api to traverse the next layer of results. NB Eric also tweaked some of the api for me to experiment with, which is why that notebook points to the ‘staging’ site for open context at present.
Having done all this, what is left for the students to do? Well, right now the notebooks are using GPT-4-preview. I do not expect and will not require students to pay OpenAI for keys, so they’ll have to explore the different models available through the LLM plugin for GPT4All. This will require them to rework the prompts, no doubt, and to also consider tokenization and context windows and how to assemble the data. The template system for LLM will also allow them to easily write new kind of report templates, and to consider the comparative results.
My course website and resources will all be public once I finish developing everything. I think this will be an interesting experience. The final output for the course will be a collaboratively authored book or grimoire for historians deploying LLMs in the service of good history. And the point of the seminar? Right there in the task: what constitutes good history when LLMs are around?