The dataset for the Panama Papers is available here.
Wouldn’t it be great to examine it for people connected to the antiquities trade? In this post I try to do that, using our knowledge graph generated from the Trafficking Culture Encyclopedia. Behind that link you will find:
- Offshore Leaks (2013)
- Panama Papers (2016)
- Bahamas Leaks (2016)
- Paradise Papers (2017)
- Pandora Papers (2021)
The ICJI people provide guidance for pulling those materials into neo4j to explore. But what I want to do is go through the data, all these csv tables, and see if any of the people we know from our Trafficking Culture Knowledge Graph appear in there. So I started coding. Well, in truth, I started sketching out every discrete step in what I thought a workflow might look like, and then I started talking with GPT4-turbo to design and build my lego bricks. My knowledge graph has columns ‘source’ and ‘target’, but the dataset has node and relationship tables. Any entity in the relationships.csv is referred to by node_id, so I need to sift through the node tables, find matching names, find the relevant node id, and then pull that into a final csv file that I can merge with my original knowledge graph csv.
I end up with a four-step process: 1, compare names with the node tables; 2, find the relationships those nodes are implicated in, in the complete dataset; 3 find the names for those nodes; do a little manual futzing; 4; the last step.
1_compare.py. This script is a Python program that compares two CSV files to find matching entities. It uses command-line arguments to accept paths to two CSV files and an optional output file name. The first CSV file should have “source” and “target” columns, from which the program builds a set of unique entities. The second CSV file is then read row by row, and if any of the ‘name’ fields in a row match any of the entities from the first file, that row is written to an output CSV file.
2_relationshipper.py. This script performs the following tasks: It reads the output csv from the previous step and creates a set containing unique ‘node_id’ values from this file. It then reads another CSV file named ‘relationships.csv’ from the ‘orig_data’ folder. As it reads ‘relationships.csv’, it checks each row to see if either ‘node_id_start’ or ‘node_id_end’ matches any ‘node_id’ in the set created from the step 1 output. If there’s a match, the row is considered a matching relationship. The script writes the matching relationships to a new CSV file named ‘output_relationships.csv’. It keeps a count of the number of matching relationships found. After processing all rows, it outputs a message indicating the number of matching relationships written to the output file or a message saying no matches were found if the count is zero.
Then I do a manual step where I open output_relationships.csv in Sublime Text and use a regex: `^([^,]+),([^,]+)` to select the node start, node end values, which I then paste into a new file to pass to step 3. Yes, I could probably do this programmatically.
3_compare_node_ids.py takes that new file and you use it to compare against all of the node tables again so that if there are any new nodes you found as a result of step 2, you get the names for the node ids.
4_last_step.py replaces the numeric IDs in the ‘source’ and ‘target’ columns of the first CSV (csv1, the one we created manually before script 3) with corresponding names using the mapping provided in the second CSV (the output of script 3). It writes the resulting data with named relationships to a new CSV file, which is named ‘named_relationships.csv’. This csv can then be combined with the knowledge-graph.csv we started with.
Ta da!
Now, all of this depends on *exact* matches. So I redid script 1 to use a fuzzymatching algorithm. With exact matches, I find just one individual. But with fuzzy matches like so:
python fuzzy_compare.py orig_data/knowledge-graph.csv orig_data/nodes-officers.csv –output_csv test2.csv –score_threshold 95
I end up with about 20 hits. Of course, then I inspect them manually, but this all seems like a very promising way to expand our ‘new organigram’ with the results of the offshore papers leak.
Here’s a gist with the five scripts; improvements and suggestions welcome!
https://gist.github.com/shawngraham/8b099ae4992a57718cf66e2c3166a16d