Using GPT3 To Generate a Knowledge Graph

We talked a bit about using GPT3 to create a knowledge graph in an earlier post. In this one, we’ll go into the weeds with a bit more detail.

If you could take a long-distance, macroscopic, big-picture look at the antiquities trade, what might you see? This is the question animating our ‘New Organigram Project’, which uses certain machine learning techniques to shift our perspectives on what we know (or think we know) about the trade. The principle technique we’ve been using is to create knowledge graphs, and then feed these through a neural network to help identify new probable connections. That’s the thing with studying an illegal, illicit, or shadowy trade: you only catch glimpses. But what do those glimpses add up to?

One of the things that we want to be able to do is to continually update our knowledge graph about the antiquities trade. When we have statements about the trade arrange in subject – predicate – object predicates (“Medici”, “owned”, “Hydra Gallery”), we can knit them into a network. In network analysis, we normally want all of the ‘nodes’ or ‘entities’ or ‘things’ to all be of the same kind, and the relationships between them to be all of the same kind, for instance, as with ‘students’ who are connected to each other by virtue of ‘friends with’.

But when we start getting into networks with multiple kinds of entities, and multiple kinds of relationships, things can get very complicated. One way of dealing with this might be to re-project the network so that one ends up with a series of networks. Say you have a network with ‘students’, ‘teachers’, and ‘classes’. Students and teachers are connected to classes by virtue of ‘teaching’ or ‘studying in’. You could create a graph of classes connected to other classes by virtue of students (if ‘Bob’ studies ‘chemistry’ and ‘geology’, chemisty and geology would be the nodes and would have a link with strength 1: one student, Bob. If ‘Alice’ studies them too, the link would be strength 2. Alternatively, if students are the nodes, then Bob and Alice would be connected with strength 2: two classes in common.) While this approach is feasible when you have only a handful of different kinds of links or entities, it becomes more unwieldly with each type you have to deal.

This is what attracted us to what are called ’embedding models’. In this approach, we convert all of those statements about the antiquities trade into a mathematical vector (list of numbers) describing the semantic content of the statement in a way that the computer can measure. Vectors are directions in space; the more entities and relationships, the more dimensions in that space, and proximity in that space implies some kind of semantic similarity. We create this embedding model with a machine learning approach (‘training a neural architecture over a graph’, as Ampligraph puts it https://docs.ampligraph.org/en/2.0.0/ampligraph.latent_features.html.)

The upshot of that is that when we know a lot about a person, we can make some educated guesses about people we don’t know as much about by measuring their proximity to things we do know a lot about.

The first time we tried this, we used a knowledge graph that we created by closely reading through the encyclopedia articles at Trafficking Culture and listing out all of the relationships we found. We initially collected as much information as we could, but with play and experimentation we whittled it down so that we had statements about people, objects, and organizations and relationships about people’s personal or work relationships, their relationships with organizations, and purchase/theft/donation etc type relationships with regard to objects. We visualized the proximities (remember, things are ‘close’ depending on where they get represented in the multi-dimensional vector space of the relationships) and found a really interesting proximity: Leonardo Patterson with the Brooklyn Museum. We had no reason to think that there was any connection between Patterson and the Brooklyn Museum, but we trawled through their online database anyway…. and found that Patterson had made some low-value donations that no one had really noticed or thought much of. This led us to finding a similar pattern of behaviour by Patterson at other museums! (This article is published in Advances in Archaeological Practice)

The problem with our approach was that it wasn’t very fast. Reading, annotating, and cleaning up the initial graph of knowledge statements took a very long time. This kind of ‘relationship extraction’ is an area of machine learning research, but when people develop automated approaches to this, it generally requires extremely carefully annotated source texts to create a training model. Often, not only do we have to mark up the entities and relationships of interest, but we also have to count the number of characters in the sentence so we can indicate the start and stop points; this is information we did not have in our first round of annotation, so this approach was not… attractive… for us. But like everyone else, we saw the sudden explosion of interest and hype around GPT3 (and ChatGPT) and wondered if maybe these ‘stochastic parrots’ – autocomplete on steroids, token predictors – might be both smart enough to identify relationships and dumb enough that we could use them.

We began to play with the OpenAI sandbox to see if we could channel the language model towards our own ends. We tried many different prompts:

> Identify the critical relationships and return subject, verb, object pairs

> Identify the critical relationships and return subject, verb, object pairs as markdown

> You are a cultural heritage relationship extractor. Find the relationships…

…and so on. We have found that we can, through a mixture of very specific instructions and carefully framed examples, steer GPT3 enough that we can feed it a newspaper article and have it return most of what we’re after. We can even get it to return the information such that it is written in the graph database language CYPHER (which means we can go from reading to graph query in one step; a ‘graph database’ doesn’t represent information in tables with keys, but rather as nodes and entities, and can query over the links; this was the subject of our earlier post).

We have custom code that we run at the terminal on our machine to feed the prompt and the articles to OpenAI, one article in a text file at a time, and then write the result to file. However, we’ve got a csv file with thousands of newspaper articles. We could split that csv into individual text files, and then feed those one at a time. But if we could somehow use GPT inside a spreadsheet on our csv file, to extract the relationships from the text, and then place the results in the adjoining cell of the spreadsheet, we would solve some issues. One issue is making sure that GPT3 isn’t making things up – we can visually check things much faster. Another issue emerges when an article is just too big. OpenAI has a limit on how large a context window it can use at a time. If the prompt and the article are together larger than the amount of material the context window can hold, we get an error. Having GPT3 work from within a spreadsheet means we can filter to find those errors quickly and then decide what to do with them. Another issue is compiling the resulting series of statements into the csv format that Ampligraph requires; in a spreadsheet, we can quickly build a pivot table to see which relationships or entities occur the most, which relationships GPT has generated and how many, for us to manually cull and pull together into our eventual knowledge graph file.

The thing with a knowledge graph embedding model we’ve discovered is that if we have too many different kinds of relationships, the results are too … nebulous. The cloud isn’t lumpy enough to say or spot anything useful, as it were. Also, GPT3 can be very … expansive … in how it identifies relationships and entities. ‘G Medici’, ‘Medici’, ‘Giacomo Medici’s Warehouse’ are all entities it might spot. The solution is to somehow constrain GPT3 to be very conservative in what it extracts. There is a paramater called ‘temperature’ which, when dialed down to 0, steers GPT3 to choose only those terms that have the very strongest probabilities. Another part of the solution is to be extremely explicit in the prompt and add, in the example fed to the prompt, markers of ‘person’ or ‘organization’, as well as the definite article ‘the’ when dealing with an antiquity.

That one floored us. Say the sentence was, ‘Medici sold the krater to Hecht.’ If in our example desired output passed to GPT3 we said something like ‘Bob’,’sold’,’vase’, the result for our sentence might be ‘medici’, ‘sold’, ‘krater’; but it could also be ‘krater’,’sold’,’medici’, especially if the source sentence was written in the passive voice. We found that if we phrased the examples with a ‘the’, like this: ‘Bob’,’sold’,’the Vase’, GPT3 would no longer be flummoxed. (They say that GPT4, being so much bigger and more complicated, is far better at these things naturally, and so doesn’t require all of this alchemy, but though we’ve applied for access, we’re still on the waiting list to find out.)

So, just to recap:
– we want to call GPT3 from within a spreadsheet
– we want it to stick to the facts
– we want it to not get flummoxed by convoluted sentence structures
– we want it to do the work quickly

We found what we were looking for on the website of a Danny Richman, an online marketer: https://www.seotraininglondon.org/gpt3-google-sheets-free-tutorial/ . Richman’s posts taught us how to write a function for a Google Sheets spreadsheet that would take our prompt, prepend it to the text in a cell, pass the complete thing to OpenAI, and then return our results.

The code below contains our complete script. To use it, make a new Google sheet. Then, under ‘Extensions’, click on ‘Apps Script’. An editor for the script will open. Delete the small bit of script that is already in the editor, and paste ours in instead. Save it (ctrl+s). Then hit the ‘Deploy’ button. Select ‘web app’ if it asks, agree to the security questions (yes, let it access things). Then, once it’s deployed, reload your google spreadsheet.

Make a new tab/sheet in the spreadsheet file, and call it ‘Settings’. In CELL A2, you will need to paste your OpenAI api key (if you don’t have one, you can get one by creating an account with OpenAI and you probably still get a few free credits to play with before this starts costing you anything). Make a new sheet, and fill the first column with information you want to extract. Then in an adjoining cell, you can type the formula: `=getentities(a1)` and you’re good to go! The function takes the prompt, adds it to the cell text, sends it to OpenAI, then prints out the result.

When everything is finished, click the ‘save values’ menu item (that menu item should’ve appeared once you reloaded the spreadsheet after making your app script). This will copy the text _as text_ into the cells. That way, you won’t make any inadvertent changes to the sheet or cause it to start passing material to OpenAI again. When you’re done, it’s probably best to delete cell A2 in your ‘settings’ sheet, so that your API key doesn’t lie around.

We’ll talk about the important bits of the prompt in a moment. Here’s the code to paste into your app script:

// This script by Danny Richman is free to use on a Creative Commons License @DannyRichman // Modifications by S Graham only to the prompt function onOpen(){ // This function add a menu option to covert formulas to plain text. var ui = SpreadsheetApp.getUi(); ui.createMenu('Save Values') .addItem('Save', 'saveAsFixedValue') .addToUi(); } function saveAsFixedValue(){ // This function prevents the API being called each time the sheet is refreshed or updated to save your API credits. var ss = SpreadsheetApp.getActiveSheet() var data = ss.getDataRange().getValues() ss.getRange(1,1,data.length,data[0].length).setValues(data) } function GETENTITIES(val) { // main function to generate formula and write value to sheet var ss = SpreadsheetApp.getActiveSheet() // Exit function if no description provided if (val == "" ) { result = "" return result; } // get API key from settings sheet var setsh = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Settings"); var apiKey = setsh.getRange(2,1).getValue() //prompt prompted by arxiv.org/pdf/2305.05003.pdf // configure the API request to OpenAI var data = { "prompt" : "List ONLY the relations of the types [OrgBased In, Work For, Worked With, Located In, Purchased, Bought From, Sold, Sold To, Stole, stolen from, Donated, Donated To, Occupation] among the entities [PERSON, LOCATION, ORGANIZATION, ART_WORK] in the text. IGNORE relations not of the listed types. TEXT: Meanwhile, Shi Liming at the Institute of Zoology of Kunming found that pandas lack variety in their protein heredity, which may serve as one of the major reasons for pandas' near extinction.\nEND\nRelations: [['Shi Liming:Per', 'Work For', 'Institute of Zoology:Org']\n['Institute of Zoology:Org', 'OrgBased In', 'Kunming:Loc']]\nTEXT: At that point, it was not clear who had given the statue to the museum, but records suggest that Mary Turlington was the ultimate source. It may be that Kenny Wong sold it to her in the first place. Wong claims he bought it at auction.\nEND\nRelations: [['Mary Turlington:Per', 'Occupation', 'art dealer:Org']\n['Mary Turlington:Per', 'Donated', 'the statue:Art_Work']\n['Mary Turlington:Per','Donated to','museum:Org']\n['Kenny Wong:Per','Sold to','Mary Turlingtong:Per']\n['Kenny Wong:Per','Sold','the statue:Art_Work']\n['Kenny Wong:per','Bought From','Auction:Org']]\nText\nA U.S. Senate subcommittee released a report confirming that Mary Turlington was the buyer present at an auction where the looted statue was sold.he statue was stolen from a cemetery.\nEND\nRelations: [['Mary Turlington','occupation','buyer']\n['Mary Turlington:Per','purchased','the statue:Art_Work']\n\['the statue:Art_Work','stolen from','cemetery']]\nList ONLY the relations of the types [OrgBased In, Work For, Worked With, Located In, Purchased, Bought From, Sold, Sold To, Stole, stolen from, Donated, Donated To, Occupation] among the entities [PERSON, LOCATION, ORGANIZATION, ART_WORK] in the text. IGNORE relations not of the listed types.\nTEXT:" + val + "\nEND\n", "temperature": 0, "max_tokens": 2000, "top_p": 0, "best_of": 1, "frequency_penalty": 0, "presence_penalty": 0, "stop": ["END"] }; var options = { 'method' : 'post', 'contentType': 'application/json', 'payload' : JSON.stringify(data), 'headers': { Authorization: 'Bearer ' + apiKey, }, }; var response = UrlFetchApp.fetch( 'https://api.openai.com/v1/engines/text-davinci-003/completions', options, ); // Send the API request var result = JSON.parse(response.getContentText())['choices'][0]['text'] Logger.log(result)

(The cost of using OpenAI depends on how much data you query it for. If there are other large language models online that have an API – cohere.ai, for instance does- it should be fairly straightforward to modify this script to use their endpoint; they’re all a bit similar.)

Ok, let’s look at our prompt. We use ALL CAPS to emphasize things we really want GPT3 to pay attention to (it seems to work). We give it a list of desired relationships to identify as an array; the square brackets do seem to focus GPT3’s attention better, which perhaps is a function of all the code it saw during training? Same with the desired entities to identify. All of our newspaper articles that we want to analyze have the word END appended to them (you can do this with a spreadsheet formula) to signal that all of the text has been provided (otherwise, GPT3 can decide to add more details to the story, which is a BAD THING). The two examples provided are structurally the kind of thing GPT3 will see in the newspaper text, but in their details, are very unlikely to appear – Mary Turlington is a character in the _Vinyl Cafe_ stories of Stuart McLean, for instance. The injunction to IGNORE relations not listed doesn’t always stop GPT3, but by and large it does seem to do the trick. Adding the kind-of-entity flags (‘:Org’, ‘:Loc’, etc) seems to also help keep GPT3 from going off the rails when it comes to subjects and objects. Then we repeat the instructions one more time; this extra emphasis does seem to make a difference. Other things to fiddle with: temperature and top_p. We keep these dialed right down; we want GPT3 to NOT be creative. Like we said, it’s all alchemy: there’s no theory that will guide you to the exact prompt to achieve what you want. Eye of newt, toe of frog, shazaam!, lead into gold.

We then download the results, open them in Excel, and then start examining the relationships. There’s still some cleaning and sorting to do here, but we find generally this prompt steers us towards those relationships that seem to give us the best results when we get to the point of creating and querying the knowledge graph embedding model.

One thing that we’ve found that makes a difference in the quality of the reults is to do a round of summarization first. The advantages here are two-fold. One, GPT3 is extremely good at summarizing things, so the prompt itself can be very succinct. Because some of the newspaper articles are very long and contain a lot of repetitive background text that we don’t really need (ie, many articles will repeat elements of Elgin and the Parthenon Marbles, to set the scene/colour for the main story), a succinct prompt also lets us process more text with fewer errors. By summarizing, we get a paragraph of text with the essence of the story which we can then extract relationships from. We have another google sheet where the prompt is:

"prompt" : "You are a high quality text summarizer in the domain of cultural heritage knowledge. Summarize the most salient points the following text, being concise while maintaining as much detail about the main points as possible (ie persons, places, things): \n\nTEXT" + val + "\nEND\n",

By giving GPT3 a persona in the first line, we direct its attention more effectively in terms of deciding what should be prioritized in the summary. When we feed these results to our `getentities` function the results are also more likely to be framed in terms of the desired relationships we want; and it also makes it easier for GPT to return ‘none’ when the summary really doesn’t reflect any structural information about the trade that we are after. (Last week we ran the summarizer on 3146 newspaper articles; total cost: $60.84. Elapsed time: 2 hrs 45 minutes.)

Using GPT3 this way involves tradeoffs. A relationship extraction model trained on annotated relationships would be more intelligible for us to understand, perhaps, why and how it returns what it does, but crafting one of those is outside our current ability. Our use of GPT3 is rather like what Simon Willison, a developer, has called ‘a calculator for words’. And we also must always remember that GPT3 does not return the answers to questions, merely text that looks like what a good answer might be. By keeping things as tightly clamped down as we do, for very specific purposes, and through systematic experimentation with prompts, I think we’ve arrived at a point where we can use it to accelerate the creation of our knowledge graph with a good degree of confidence in the results.

In another post, we’ll describe what came next. But in short, we used the workflow we’ve described here on several thousand newspaper articles published over the last thirty years (but mostly the last decade) across English language media that intersected in some way with the antiquities trade. We took the extracted relationships and double checked them against the source text, and then filtered those results for the relationships we really wanted. We added those relationships to our existing knowledge graph (the one we made by hand) and created a knowledge graph embedding model with it. We’re now exploring the results of that, but so far, the things we expect to find are there; now it’s a question of, are the surprises reasonable or probable? What tips about the trade will it give us? Where will it direct our attention, and what will we find?

Wednesday, May 31, 2023 | Categories: News
Post tagged with gpt3, Organigram
Share: Twitter, Facebook
Short URL: https://carleton.ca/xlab/?p=106