A few weeks ago Vianney Mixtur published a blog post and related code repository showing how he used the Mixtral model to pull together recipe data from semi-structured cookery information. We have been experimenting here for some time using a variety of approaches to create knowledge graphs from academic articles, newspaper articles, auction catalogues and other kinds of publications. The sheer variety of potential sources has always made things a bit hit and miss. However, the use of pydantic, a library for validating and controling information information so that it conforms to your schema was a new approach for us.

We forked Vianney’s code and set out to modify it for our purposes. We began by modifying the ‘schema’ that the whole process would use. Normally when we use a LLM to try to extract knowledge graph triples, the sheer variety of relationships the LLM determines can be overwhelming. Remember, it just picks the likeliest of tokens given an input text, so it’s quite possible that it might return ‘loots’, ‘steals’, ‘obtains’, ‘retrieved’ to describe the same relationship. By using a schema that pydantic uses to validate returned text, we make things far more consistent. Here’s our schema:

“`

# Define RelationType Enum
class RelationType(str, Enum):
is_the_owner_of = ‘is_the_owner_of’
works_with = ‘works_with’
works_for = ‘works_for’
has_possession_of = ‘has_possession_of’
purchases = ‘purchases’
buys_from = ‘buys_from’
sells = ‘sells’
makes_sale_to = “makes_sale_to”
donates_to = ‘donates_to’
obtains_from = ‘obtains_from’
comes_from = ‘comes_from’
has_immediate_family_member = ‘has_immediate_family_member’
legal_status_change = ‘legal_status_change’
has_role = ‘has_role’

class Entity(BaseModel):
“””Entity schema”””
name: str = Field(
description=”The name of the entity, representing various participants in the cultural heritage and art crime domain.”,
examples=[
“John Doe — an actor”,
“Ancient Greek Vase — an artefact”,
“The Metropolitan Museum of Art — a museum”,
“Royal Canadian Mounted Police — an organization”,
“Internal Revenue Service — a government agency”
]
)

class Relation(BaseModel):
“””Relation schema”””
name: RelationType = Field(description=”The specific name of the relationship that can exist between entities in the domain.”)

class Triplet(BaseModel):
“””Triplet schema to represent entity-relation-entity structure”””
entity1: Entity = Field(description=”The SUBJECT entity in the triplet.”)
relation: Relation = Field(description=”The relation PREDICATE connecting the first and second entities.”)
entity2: Entity = Field(description=”The OBJECT entity in the triplet.”)

class CulturalHeritageSchema(BaseModel):
“””Cultural Heritage Schema”””
triplets: List[Triplet] = Field(
description=(“A list of triplets, in grammatical subject – predicate – object order, where each triplet consists of two entities and the relation between them. ”
“Triplets capture the structured relationships within the cultural heritage domain.”)
)

“`

Then, in ‘prompts.py’, we set things up so that our text will be passed to a prompt that uses that schema to direct the LLM into the right possibility space, given the text we want it to process:

“`

DEFAULT_BASE_PROMPT = “””
Extract information regarding cultural heritage crime from the following text.

In particular, please provide the following information:
– A list of triplets, where each triplet consists of two entities and the relation between them in grammatical subject – predicate – object order. Only use the predefined relations listed below and provide detailed descriptions for each entity.

The predefined relations are:
– ‘is_the_owner_of’: Denotes a business relationship where an actor controls, or is the legal owner of, a business, gallery, auction house, or other for-profit organization.
– ‘works_with’: Denotes a business relationship between actors who are dealers, organizations, looters, or collectors.
– ‘works_for’: Describes an employment or contractual relationship between actors who are dealers, organizations, businesses, museums, government agencies, looters, or collectors.
– ‘has_possession_of’: Describes a situation where a dealer, organization, collector, or auction house controls an artefact whether through ownership or other means.
– ‘purchases’: Describes a situation where a dealer, organization, collector, or auction house buys an artefact.
– ‘buys_from’: Describes a situation where a dealer, organization, collector, or auction house buys an artefact from a named actor.
– ‘sells’: Describes a situation where a dealer, organization, collector, or auction house sells A NAMED ARTEFACT.
– ‘makes_sale_to’: Describes a situation where a dealer, organization, collector, or auction house makes a sale to another ACTOR.
– ‘donates_to’: Describes a situation where a dealer, organization, collector, or auction house donates an artefact to another entity.
– ‘obtains_from’: Describes a situation where a dealer, organization, collector, or auction house obtains an artefact from another entity under unclear circumstances.
– ‘comes_from’: Describes a situation where the provenance of an artefact is attributed.
– ‘has_immediate_family_member’: Describes a direct familial relationship between two individuals.
– ‘legal_status_change’: Describes when an organization, business, person, etc., has come to the attention of law enforcement.
– ‘has_role’: Describes the role or roles an actor or organization can have.

{format_instructions}

Make sure to provide a valid and well-formatted JSON adhering to the given schema.

Content:
{content}
“””

“`

The prompt and the schema together with the pydantic validation means that the returned knowledge graph triples are almost always from our scheme. Give our notebook a try. A variety of LLM can be used, courtesy of the langchain library (just comment/uncomment the relevant lines). We use the ‘secrets’ feature of the Google Colab environment to keep track of API keys. Successfully validated triples are written to a dataframe; triples that use a relationship NOT in the schema are written to a different dataframe so that the researcher can check manually and re-write as appropriate. When we ran all this using GPT-o on the Trafficking Culture encyclopedia, we ended up with a few thousand triples successfully validating (and only 80 that were not part of the schema), for a cost of $3.12.

To modify all this for your own schema, adjust the two files ‘schema.py’ and ‘prompts.py’ as appropriate, just the bits that define relationships. In the notebook, we have it so that these two files get re-imported when the prompts are constructed and passed to the LLM so that we can experiment with different schemas. Once you get something that works the way you want it, you wouldn’t do that (you’d only have to import once in the initial configuration).

Repo: https://github.com/shawngraham/structured-cultural-heritage-crime/