This semester, with the Modernist Versions Project (MVP), I have been creating a repository of modernist texts for the purposes of text analysis and machine learning. The scope of this project requires a powerful infrastructure, including hardware, software, and technical support, provided in part by Compute Canada, a high performance computing resource platform for universities and institutions across Canada. The plan is to first aggregate a significant number of modernist texts (in TXT format) and—once we have a working repository (in late 2013 / early 2014)—mobilize computer vision and machine learning techniques to infer as yet unseen patterns across modernism. Also, in collaboration with Adam Hammond (2012-13 MVP postdoc), we are exploring the possibilities of a Turing test for modernism. This test would follow our machine learning work and—if nothing else—be a playful experiment in the spirit of modernist artifice.
But producing a repository based on web-based materials is quite tricky. As part of his work for the Routledge Encyclopedia of Modernism, Stephen Ross has created an impressively thorough list of modernist authors that we will use to amass modernist texts housed across the web. As one might imagine, electronic texts are not always “clean,” and they don’t always have sufficient metadata. And even when repositories like Project Gutenberg Australia have relatively clean text files, their selection is limited due to copyright (among other reasons). As such, the version of modernism most people currently access through popular online repositories like Gutenberg Australia often doesn’t contain important works by notable women writers and people of colour. In Project Gutenberg Australia, there is no Nella Larsen, Zora Neale Hurston, or Langston Hughes. No Dorothy Richardson or Djuna Barnes. Formally, there is also no poetry. So we don’t get Ezra Pound or T. S. Eliot, either. Put differently, Gutenberg Australia’s version of modernism appears to be very different from the version most North American students will encounter in, say, a university course. That said, we are not relying on just one repository for this work, and we hope that scripts written in collaboration with Compute Canada will allow us to be comprehensive and equitable in our articulation of modernism, especially where difference is concerned. We also hope to fill in gaps where possible, either by adding our own texts to existing repositories or conducting more research on modernist writers who are not (yet) discoverable on the web. In this regard, we are especially inspired by digital humanities practitioners, Amy Earhart and Susan Brown.
To get started on this MVP project, I’ve begun meeting with Jentery Sayers, Stephen Ross, and Belaid Moa, who is one of Compute Canada’s HPC Specialists from the West Grid sector. Belaid has been extremely helpful, guiding me through the West Grid system and showing me how to develop a Python script that will grab modernist texts from an array of online repositories. Our script needs to locate the required texts within the HTML tree structure of the repository sites, download them, and store them in the Compute Canada database. In order to develop this script, I am learning the syntax and semantics of arrays, functions, strings, and regular expressions. As we are told over and over, code is a language, and so I look forward to becoming better versed in the intricacies of Python throughout the year.
For now, the idea is to start small (i.e., with twenty novels) and see how well the analysis scales up when more modernist texts are included. I chose the twenty novels we are tentatively planning to use based on availability, university syllabi, and MVP familiarity with them. We plan to run basic machine learning methods on this sample of texts in order to determine commonalities, differences, and tendencies across them. To be honest, I wonder whether the computer will be as confused about modernism as I am. I wonder if we’ll agree about what makes a passage important or interesting, whether we’ll get tripped up in the same sentences, whether we will come to the same conclusions about a given text. In short, how will this computational approach challenge the ongoing assumptions of literary scholars? More from me soon.