I don’t think this counts as a New Year’s resolution, but I’ve been meaning to play around with Gephi for a while now. The biggest hurdle for me was finding sources that were conducive to network analysis. My research does not rely heavily on personal correspondence, so I turned to the Canadian Letters and Images Project, based at Vancouver Island University. The project has build a corpus of transcribed letters from Canadian soldiers in wartime, which are grouped into collections. The letters we find in archives are usually saved by a single person, so most of the collections consists of letters exchanged between two people. After lightly perusing the website, I found a collection of about 40 letters written by three brothers of the Rooke family, Robert, Charles, and, George, who enlisted during the Boer War.
From Letters to Relational Database
Gephi is a popular tool among Digital Humanists to create network visualizations. Simply put, it turns a relational database, usually compiled in a spreadsheet or .csv file and turns it into a visual image that reveals weighed correlations in the data. The obvious relational database to build out of this collection was to make a spreadsheet of who wrote to who. But with only 40 letters in the collection and only three letter-writers, the resulting network diagram is probably going to be a little underwhelming. I decided to create a second spreadsheet recording who mentions who in their letters. That should be a bit more interesting.
To populate these spreadsheets I could have read each letter and typed in names as I found them, but that’s a little more work than I’m prepared to do. Instead, I used the Stanford Named Entity Recognizer to tag all of the person names in the letters. Assuming that the first name to appear in each letter was the addressee and the last name to appear in the letter was the writer, who signed his name at the bottom, it would be pretty easy to populate both of my spreadsheets.
For one letter, the NER extracted these names:
Already, we can see a few of the challenges of relying on the NER. The first name in this letter is Cleary, because the header of the letter reads:
“Dundonald’s Brigade, Cleary’s Division, South Africa”
So in this case the addressee, “Mother,” is the second name mentioned in the letter. But the last name is “Charlie,” who wrote the letter. We also see that a few of the names are not people names. “P. Maritzburg” is actually a place and “Boers” is more of a group people rather than a single person. So the data would need a bit of cleaning, but that’s always the case with these things.
I wrote a script that took the first and last name mentioned in each letter and compiled it into a csv file, and quickly corrected any of the suspicious-looking addressees or signatories.
Then I wrote another script that compiled all of the names in each letter into another csv file. I opened that csv in Excel, created a new column and manually entered the signatory of the letter next to each name they mention in their letter. That took a bit of time, but it was probably faster than trying to figure out how to do that with a script.
The last thing to do was to review both spreadsheets to confirm that all of the data was accurate. It took a bit of time to go through the second spreadsheet an eliminate all of the place names that were incorrectly tagged as people, but that definitely took much less time that typing out everything by hand.
The last thing to do was create a csv file with everyone’s names in, then create columns with the label that is going to appear on each person’s node in Gephi, and another column with the type of node they are going to represent.
I classified each person as one of three things: a family member, a friend, or a member of the military. There were just guesses based on the content of the letters, I did not do any further research to confirm that someone was only a friend and not a family member.
I imported these into Gephi in two batches, the first one to create a network diagram that showed who wrote letters to who:
As expected, this on is a little plain. It does show that most of the letters are written by Charlie, and most of the letters were received by their mother. There are a few letters written from George to Jim, and one letter written from Charlie to Eva. No huge revelations here, but it’s good to know where letters are coming from when considering the next set: who mentions who in their letters.
The diagram doesn’t come out very well in the screen-capture but it’s clear that Charlie is making the most mentions in his letters. That’s not surprising, because we saw in the last diagram that he wrote most of the letters. The colors reflect the three categories I assigned to each node: family in blue, friends in green, and military in red. So we see that Charlie mentions Stan quite a bit in his letters, as well as Jim, George, and Vic. George seems to mention his mother quite a bit, but most of George’s letters were written to his mother so most of these mentions are probably him addressing her in his letters.
What also comes out really well in the visualization is that there are far more mentions of friends and family than members of the military. Most of the people mentioned are mutual acquaintances who are either still at home, encountered in transit, or serving overseas with the three brothers. This isn’t to say that the content of the letters ignores events at the front – Charlie’s mentions of Lord Kitchener reflect how much he discusses the course of the war. Nevertheless, the diagram reveals the importance of letter writing in maintaining connections with friends and family.
Maybe not a ground-breaking discovery, but it was a fun introduction to Gephi. I’d like to experiment with bigger data sets in the future, particularly ones that combine sources from different collections, to see how Gephi generates more complex data visualizations.