Qualitative Analysis of Social Media Posts - A Workflow from OCR to Obsidian

Structure Emerges From Our Observations

In the past, we have applied many different distant reading techniques to social media posts related to the human remains trade. As part of a new analysis, we are exploring a computer-facilitated close reading of social media posts. Here we detail a workflow that assumes one already has a collection of images that capture posts and their associated texts (perhaps as the outcome of some sort of screen-shotting process using something like e.g. this from Simon Willison.)

OCR

The first step is to perform the OCR on the screenshots. Different languages might be present in the screenshots, and there might be techniques used by the person posting to obfuscate key-word searching or indeed OCR (use of emojis, character-swaps, and spaces between letters of misspelled words). Our approach currently uses PaddleOCR which uses a variety of neural-network trained models for 80+ languages. We also tried easyOCR but the output broadly seemed better with PaddleOCR. This colab notebook allows one to upload a zipped file of images to iterate over them performing the OCR and eventually get a csv with all the output data (and scores) as well as annotated images showing where the text was that it found.

CSV -> Obsidian

The next step is to get that data into our qualitative analysis environment. We’re using a gently modified version of Ryan J. A. Murphy’s dedicated Obsidian vault for the purpose. The only plugin we’ve had to install is the ‘dataview‘ plugin. We use our own custom python script to read each row of the output from the OCR and then write it to a markdown file in the appropriate place in the obsidian vault:

python markdowner.py to-obsidian/image-output.csv '/Users/username/Dropbox/qualitative-analysis-vault/data points/group-1'

Our script uses this little template to format our data points (eg, observations on each individual post that we did OCR on):

``` markdown_content = f"![[{filename}]]\n\n" \ "```\n" \ f"{ocr_text}\n" \ "```\n" \ f"## encoding \n" ```

If you then drop the images into the vault as well (you can even put them in a subfolder there, to keep things tidy), Obsidian will find the images. Each individual data point (note file) will have the same name as the image file. It will display the image, the OCR transcript, and then a subheading under which you can put your qualitative analysis as you go.

Qualitative Analysis & Some Stats

Imagine you’ve gotten to that stage. You would then open the vault, click on the first data point, read the transcript and look at the image (making any corrections in the transcript if necessary), and then start making your observations on what discourses are present.

If you detected say a discourse about ethics, you could just start typing: [[ethics]]. Obsidian will understand this to be a link to a new note that doesn’t yet exist. Click on it, and Obsidian makes a new note called ‘ethics’. You can then write any other observations in that note about what ‘ethics’ means in the context of the dataset you’re working on. You move that ethics note to the subfolder called ‘codes’ (you can do that now, or do it later). You continue in this way moving through the different data points. You might add more observations. You might decide that ‘ethics’ isn’t quite the right word: so you go to the ‘ethics’ note and rename it. All of your observations on the data points will update accordingly. Neat, eh? So you iterate your way through this process.

But how do those observations distribute? How do they cluster? Well, first make sure that all of your encoding notes are in the ‘codes’ subfolder. The vault as it comes configured by Murphy has a note that calculates statistics for how many discourses are present in your datapoints. It does this via a dataview query that looks at the ‘codes’ subfolder and counts the in-bound links that point to any notes within that folder. And if any of those notes have links that point to other notes – maybe on your ‘ethics’ note you have a link to a related idea ‘AreWeTheBaddies’ – the dataview query will also tot up those out-bound links. (Incidentally, this also makes the graph view in Obsidian a handy way to see how the discourses map to your dataset). Obsidian will always update the results of this query if you have the ‘live preview’ (the default) turned on.

For our purposes, we wanted to have statistics according to subfolders in our data points. While the original query was quite straightforward, this seemingly minor modification was not. Here’s our code:

```dataview TABLE length(filter(file.inlinks, (i) => contains(i.file.folder, "data points/group-1"))) AS "Data Points", filter(file.outlinks, (o) => contains(o.file.folder, "codes")) AS "Related Themes" FROM "codes" WHERE contains(file.inlinks.file.folder, "data points/group-1") or contains(file.outlinks.file.folder, "codes") SORT length(filter(file.inlinks, (i) => contains(i.file.folder, "data points/fb-group-1"))) DESC ```

We can also modify the query to return not the count of links, but the actual links to the relevant files, like so:

```dataview TABLE filter(file.inlinks, (i) => contains(i.file.folder, "data points/group-1")) AS "Data Points" FROM "codes" WHERE contains(file.inlinks.file.folder, "data points/group-1") SORT length(filter(file.inlinks, (i) => contains(i.file.folder, "data points/group-1"))) DESC ```

The code was developed from code shared in this Obsidian forum post.

There You Go

That’s our workflow for getting our material into our qualitative analysis vault. Murphy concludes:

When reporting on your themes, these counts give you some starting points. Why do the top themes appear so frequently? Why are the bottom ones rare? Speculate, theorize, debate, and discuss!

[…]It’s your job to report on the relevance of the themes you’ve uncovered to your research question(s). The memos you’ve already developed about each theme give you a starting place for commentary. Build on those, and use any exemplary cases you’ve embedded to illustrate your points. If you need to look back at your data, the backlinks from your themes will point to all of the data points you coded with a given theme.

You may also want to examine themes that seem to have common data points in mind. Why did these overlaps happen? What might they mean?

Friday, December 15, 2023 | Categories: News
Post tagged with Bonetrade
Share: Twitter, Facebook
Short URL: https://carleton.ca/xlab/?p=182

X-Lab

Qualitative Analysis of Social Media Posts – A Workflow from OCR to Obsidian

OCR

CSV -> Obsidian

Qualitative Analysis & Some Stats

There You Go