Jana Millar Usiskin – MLab in the Humanities

Making Models of Modernism

Jana Millar Usiskin — Mon, 12 May 2014 19:52:33 +0000

This semester, with the Modernist Versions Project and the Maker Lab, Belaid Moa (Compute Canada) and I have been topic modelling modernist texts. In doing this work, we are hoping to identify heretofore unidentified patterns, both thematic and stylistic, across a (for now, admittedly small) corpus of modernist texts.

Topic modelling assumes authors create documents using collocated clusters of words. By working “backward,” computer algorithms sort the words from a set of pre-processed documents and generate lists of words that comprise these clusters. In our work, we are using the LDA (Latent Dirichlet Allocation) probabilistic model. This rather popular model operates on the Bayesian method of inference, a mathematical concept that works backward from an observed set of data to calculate the probability of certain conditions being in place in order to produce that set of data. In other words, it depends on a notion of causality and asks what circumstances need to be in place in order for certain results to occur.

Using the MALLET package (an open source application developed primarily by Andrew McCallum at the University of Massachusetts at Amherst) allows for the implementation of Gibbs sampling, parameter optimization, and tools for inferring topics from trained models. These affordances let the researcher alter the distribution of topics across documents, and the distribution of words across topics. That is, we can adjust our model to achieve more interesting results. We are interested in a model that is, as Julia Flanders describes it, a “strategic representation,” which might “distort the scale so that we can work with the parts that matter to us” (“The Productive Unease of 21st-century Digital Scholarship”).

For the purposes of our very preliminary study, we are examining word trends across a corpus but also, at least to some extent, narrative tendencies. In so doing, we employed MALLET’s stop words list, which allows the algorithm to ignore common “function words” (i.e., adverbs, conjunctions, pronouns, and propositions). The idea is to eliminate words that carry little thematic weight. Following a method outlined by Matthew Jockers and advocated by Belaid Moa, we also removed character names where possible. While it would likely be interesting to look at the ways MALLET reads texts without any intervention, for our purposes character names made it harder to express tendencies across the novels. However, we did not employ Jockers’s method in its entirety. In some cases, he uses a noun-based approach, eliminating all parts of speech except for nouns. But we felt that, at least for now, including verbs and adjectives was important for revealing aspects of narrative. Jockers also advocates chunking texts, but we were interested in the ways the algorithm would read entire novels as documents.

While our repository of modernist texts has been growing, we limited this preliminary study to a corpus of thirty-two early twentieth-century texts, formatted as TXT files, to come up with a profile of the most prominent topics identified by LDA. The algorithm is interested in finding the topics that can be used to correlate all the texts as well as the topics that can be used to distinguish between individual texts. The top three topics that are evenly distributed throughout the corpus—here showing the first nine words—are:

time, felt, day, looked, knew, work, face, hand, night
eyes, face, life, time, white, dark, round, hand, head
men, people, began, room, house, talk, suddenly, end, years

When reading these topics, we might want to consider that, according to the algorithm, these words are not only more frequent within the corpus, but have a greater chance of appearing near each other. As well, the top three or four words are considerably more heavily weighted than later words. Unsurprisingly, time seems to play a significant role in all the categories. Thus we might ask how categories each tell us something unique about time and temporality. In the first topic, the verbs are all in the past tense. Notably, the second topic arguably contains no verbs, with “face,” “hand”, “eyes,” and “head” possibly being exceptions. The fragmented body parts also reveal an interesting slippage between time embodied in humans and time embodied in objects. For instance, as Stephen Ross notes, the results do not distinguish between the face or hand of a clock and the face or hand of a human character. The third category suggests that critics might want to consider how the durations of events (especially their beginnings and endings) are situated, and how spatiotemporal concepts shift across texts. For instance, when and where do people become men? Or do houses and rooms become years?

MALLET also shows us the relative weight of these word collocations across the novels.

Temporality Past

Temporality Embodied

The Temporality of Place

The third category, labelled “The Temporality of Place,” appears more prominently in Howard’s End, Mrs. Dalloway, A Passage to India, The Great Gatsby, and Heart of Darkness. We might ask how these texts in particular focus on the temporality of particular physical environments. On the other hand, I wonder why The Waves, Ulysses, and Women in Love do not seem to engage as fully with the first category, labelled “Temporality Past.” We might also ask how the novels of D. H. Lawrence seem to best exemplify the second category, labelled “Temporality Embodied.” Through LDA, do we get any sense of overlap between the ways people and objects embody time?

Building on MALLET’s algorithms, Belaid Moa has also written scripts that allow us to cluster texts according to perceived similarities and differences. Many readers will notice that Howard’s End and Mrs. Dalloway are similar when it comes to “The Temporality of Place,” but that topic 12 (people street feel leaves trees window room green door) is considerably more prevalent in Mrs. Dalloway than in Howard’s End. Moa’s script projects all these differences and similarities and allows us to see the texts clustered according to MALLET’s assigned topics.

The Multiple Dimensions of Topics in Modernism

Given our current data set, these are the clusters we identified with LDA, with the exemplar being the text most central to that particular cluster:

Cluster 1, exemplar Tender is the Night:
The Awakening (Chopin), Heart of Darkness (Conrad), Lord Jim (Conrad), The Secret Agent (Conrad), The Great Gatsby (Fitzgerald), Tender is the Night (Fitzgerald), The Trial (Kafka), Babbitt (S. Lewis), Tarr (W. Lewis), 1984 (Orwell), Burmese Days (Orwell), The Autobiography of Alice B. Toklas (Stein), Twilight Sleep (Wharton)

Cluster 2, exemplar Ulysses:
Nightwood (Barnes), A Passage to India (Forster), Tess of the D’Urbervilles (Hardy), The Dubliners (Joyce), Portrait of the Artist as a Young Man (Joyce), Ulysses (Joyce)

Cluster 3, exemplar Seven Pillars of Wisdom:
Seven Pillars of Wisdom (Lawrence)

Cluster 4, exemplar Time Regained:
The Ambassadors (James), The Captive (Proust), Time Regained (Proust)

Cluster 5, exemplar Mrs. Dalloway:
The Good Soldier (Ford), Howard’s End (Forster), Sons and Lovers (Lawrence), Women in Love (Lawrence), Of Human Bondage (Maugham), Three Lives (Stein), The Picture of Dorian Gray (Wilde), Mrs. Dalloway (Woolf), The Waves (Woolf)

What subcategories of early twentieth-century modernism do these clusters suggest? How do we define these clusters for modernist literary criticism? Do they actually suggest anything, including temporal, geographic, or stylistic tendencies? How might these clusters compare with models constructed for, say, Victorian novels? These are questions we are also experimenting with, and we look forward to exploring further as we continue this work.

Post by Jana Millar-Usiskin in the ModVers category with the versioning tag. Images for this post care of Jana Millar-Usiskin.

Counting Virginia Woolf

Jana Millar Usiskin — Thu, 10 Apr 2014 17:53:19 +0000

In my last post, “Making Modernism Big,” I ended by asking how a computer might read modernism. During the last few months, this question has informed the work I’ve been doing with computer scientist, Belaid Moa. In preliminary attempts to articulate an answer, Belaid and I have been exploring what is possible with Python, a flexible, extensible, and high-level programming language that allows us to give instructions to the computer, essentially teaching it how to read.

Using the texts of Virginia Woolf, Belaid and I are focused on a rather basic computational practice: counting. The computer excels at counting what it reads. Our computer—with the help of Python, Beautiful Soup (a machine parser), and a few regular expressions—can now count the highest frequency of words from The Voyage Out to Between the Acts, or the number of questions in each (805 in The Voyage Out, 473 in Between the Acts). It can find the frequencies of first words in each sentence of The Waves, or the last words. It can find the frequencies of words per HTML-encoded paragraph in Mrs. Dalloway (she, 6; she, 3; the, 10; a, 2; said, 3; etc.). The computer is eager to quantify, and this is great if we find value in knowing the numbers. But to what extent are the numbers important to human readers? Counting the word “war” in Virginia Woolf will not give us much insight into, say, the ways war and gender intersect in Mrs. Dalloway. At least for humans, counting indeed plays a small part in the usual sitting-down-with-a-book reading experience. While we might unconsciously register a repeated word or phrase, it is highly unlikely that anyone will count them as they go.

Still, there is something eerily fascinating about the high-frequency, small words that now captivate our learning computer. These are the words that will likely be most common for any text written in English. It would probably be hard to distinguish between modernist texts, or indeed any group of texts, based on these kinds of results. When thinking about machine learning, these filler words are important because they are usually the easiest to predict. Consider this beginning: “Mrs. Dalloway said she would buy…” Even if you (the human reader) weren’t familiar with the first line, there is a much higher chance you could predict the word that immediately follows (“the”) than the one that comes after that (“flowers”). Predictability is a key part of reading. However, knowing what to predict, a good reader can focus on the parts that are surprising and unpredictable. These small words might signal a kind of architectural structure around which the distinguishing features of literary edifices are often built. Thus, it might be productive to explore not only the ways modernist writers break this edifice apart but also how they reinforce familiar or predictable forms of language.

Question Frequencies in Woolf Novels

Going forward we are planning to use Python and other programming languages to further explore (un)predictability, with hopes of teaching our computer to recognize the giddy, exhilarating “plunge” of modernist language.

Post by Jana Millar-Usiskin in the ModVers category, with the versioning tag. Images for this post care of Jana Millar-Usiskin.

Making Modernism Big

Jana Millar Usiskin — Wed, 16 Oct 2013 16:53:47 +0000

This semester, with the Modernist Versions Project (MVP), I have been creating a repository of modernist texts for the purposes of text analysis and machine learning. The scope of this project requires a powerful infrastructure, including hardware, software, and technical support, provided in part by Compute Canada, a high performance computing resource platform for universities and institutions across Canada. The plan is to first aggregate a significant number of modernist texts (in TXT format) and—once we have a working repository (in late 2013 / early 2014)—mobilize computer vision and machine learning techniques to infer as yet unseen patterns across modernism. Also, in collaboration with Adam Hammond (2012-13 MVP postdoc), we are exploring the possibilities of a Turing test for modernism. This test would follow our machine learning work and—if nothing else—be a playful experiment in the spirit of modernist artifice.

But producing a repository based on web-based materials is quite tricky. As part of his work for the Routledge Encyclopedia of Modernism, Stephen Ross has created an impressively thorough list of modernist authors that we will use to amass modernist texts housed across the web. As one might imagine, electronic texts are not always “clean,” and they don’t always have sufficient metadata. And even when repositories like Project Gutenberg Australia have relatively clean text files, their selection is limited due to copyright (among other reasons). As such, the version of modernism most people currently access through popular online repositories like Gutenberg Australia often doesn’t contain important works by notable women writers and people of colour. In Project Gutenberg Australia, there is no Nella Larsen, Zora Neale Hurston, or Langston Hughes. No Dorothy Richardson or Djuna Barnes. Formally, there is also no poetry. So we don’t get Ezra Pound or T. S. Eliot, either. Put differently, Gutenberg Australia’s version of modernism appears to be very different from the version most North American students will encounter in, say, a university course. That said, we are not relying on just one repository for this work, and we hope that scripts written in collaboration with Compute Canada will allow us to be comprehensive and equitable in our articulation of modernism, especially where difference is concerned. We also hope to fill in gaps where possible, either by adding our own texts to existing repositories or conducting more research on modernist writers who are not (yet) discoverable on the web. In this regard, we are especially inspired by digital humanities practitioners, Amy Earhart and Susan Brown.

To get started on this MVP project, I’ve begun meeting with Jentery Sayers, Stephen Ross, and Belaid Moa, who is one of Compute Canada’s HPC Specialists from the West Grid sector. Belaid has been extremely helpful, guiding me through the West Grid system and showing me how to develop a Python script that will grab modernist texts from an array of online repositories. Our script needs to locate the required texts within the HTML tree structure of the repository sites, download them, and store them in the Compute Canada database. In order to develop this script, I am learning the syntax and semantics of arrays, functions, strings, and regular expressions. As we are told over and over, code is a language, and so I look forward to becoming better versed in the intricacies of Python throughout the year.

For now, the idea is to start small (i.e., with twenty novels) and see how well the analysis scales up when more modernist texts are included. I chose the twenty novels we are tentatively planning to use based on availability, university syllabi, and MVP familiarity with them. We plan to run basic machine learning methods on this sample of texts in order to determine commonalities, differences, and tendencies across them. To be honest, I wonder whether the computer will be as confused about modernism as I am. I wonder if we’ll agree about what makes a passage important or interesting, whether we’ll get tripped up in the same sentences, whether we will come to the same conclusions about a given text. In short, how will this computational approach challenge the ongoing assumptions of literary scholars? More from me soon.

Post by Jana Millar Usiskin attached to the ModVers category, with the versioning tag. Image for this post care of Jana Millar Usiskin and Google Images.

The AAB Exhibit: A Walkthrough

Jana Millar Usiskin — Fri, 31 May 2013 22:14:20 +0000

This exhibit is just about ready to launch! More from the Maker Lab soon.

Post by Jana Millar Usiskin, attached to the AABrown project, with the exhibits tag. Featured video for this post, documenting a walkthrough of the Audrey Alexandra Brown Exhibit, produced by Jana Millar Usiskin.

The AAB Exhibit: An Assessment

Jana Millar Usiskin — Mon, 25 Mar 2013 03:23:13 +0000

How does a digital scholarly exhibit argue? Stan Ruecker and Alan Galey suggest that a good prototype makes a substantial argument. Extending their assessment of prototypes to a more general notion of digital projects, here is my report assessing whether the Audrey Alexandra Brown Exhibit fulfills Ruecker and Galey’s criteria:

R and G ask: Is the argument reified by the prototype contestable, defensible, and substantive?

On a technical level, showcasing images and audio housed in UVic Special Collections, and contextualizing these materials with a narrative, the Brown Exhibit argues that the purview of scholarly materials is in libraries. If scholars want to create credible work, then they need to include these kinds of resources, because resources in libraries will likely maintain appropriate metadata, including correct copyright information. In terms of narrative, the exhibit argues that materials should be grouped and showcased thematically. I’ve arranged images of Brown and her work into sections that speak to larger discussions in national literature, media studies, and women’s writing. Rather than arrange chronologically or by media type, methods that might be considered less subjective, I’ve foregrounded my own interpretation of the materials. The exhibit argues that the onus of interpretation falls on the arranger. It presumes that I have the authority and expertise necessary for scholarly interpretation.

R and G ask: Does the prototype have a recognizable position in the context of similar work, either in terms of concept or affordances?

Like the Deena Larsen collection among others, the Brown Exhibit considers archival remediation a necessary and valuable endeavor. Both projects enable online access to materials that may not otherwise be easily accessible. Both projects assume the importance of metadata and suggest that the information that contextualizes artifacts adds value to a visitor’s experience of that artifact.

R and G ask: Is the prototype part of a series of prototypes with an identifiable trajectory?

In the context of the Maker Lab, the Brown Exhibit is one of two Scalar books that are now being built to display remediated materials. Like the Crocodile Cafe Exhibit, the Brown Exhibit focuses on a very particular aspect of West Coast culture. Both value the local particularity of their subject matter and the ways culture is made through media. In a broader context, projects such as the Brown Women Writers Project and Orlando are also attempting to recover women writers in a digital environment. These projects all have the goal of bringing attention to the work of writers who have been historically overlooked or marginalized.

R and G ask: Does the prototype address possible objections?

I can see objections being raised about the need for material to be housed in the library. The time and effort argument as well as the metadata argument might be raised here. The first group would suggest that while it would be better for materials to come from the library, there are so many materials in need of remediation that it becomes difficult for librarians to keep pace with the growing demands of the scholarly community. The exhibit would argue that part of the value in digital work is in the affiliations that arise alongside the finished product. I hope that by including a narrative of process, by linking showcased objects to the library site, and by acknowledging librarians’ contributions, the exhibit will make this kind of claim. In terms of metadata standards, others might argue that standardized metadata, as required for library hosting, might not best represent all digital materials. The exhibit would argue, through the inclusion of narrative, that issues of classification not necessarily apparent in the library’s official metadata can be supplemented with additional narrative.

R and G ask: Is the prototype itself an original contribution to knowledge?

In foregrounding narrative as a way of organizing and showcasing remediated materials, the Brown Exhibit does make an original contribution to the field of scholarly exhibits. I see the exhibit functioning in parallel with the library. In one scenario, the materials can be accessed on the library’s CONTENTdm webpage; here visitors will have a simulated experience of archival serendipity. They can construct their own narratives, and make their own arguments about the work. However, with the Brown Exhibit, they also have the option of encountering the material as you might in a museum or art gallery, with additional interpretation based on another’s research interests. I think this second kind of encounter is more likely to start discussion; visitors can engage with the material in an unexpected context and they might have points on which they disagree. Galey and Ruecker point out, there is a way of interpreting artifacts such that they become ‘productively contestable.’ In other words, we are more likely to talk about materials when someone starts the discussion. So the exhibit argues that before digital remediation can have an impact, someone needs to do the work of interpretation.

Galey, Alan and Stan Ruecker. “How a Prototype Argues.” Literary and Linguistic Computing 25.4 (2010): 405-424. Oxford Journals. Web. 23 March 2013.

Post by Jana Millar Usiskin, attached to the AABrown project, with the exhibits tag. Featured images for this post care of the Audrey Alexandra Brown Exhibit (built using the Scalar platform) and Literary and Linguistic Computing, at llc.oxfordjournals.org.

Doing Design: Questions of Navigation

Jana Millar Usiskin — Wed, 23 Jan 2013 23:47:28 +0000

Post by Jana Millar Usiskin, attached to the AABrown project, with the exhibits tag. Featured video for this post, documenting a walk through the use of the Scalar platform, produced by Jana Millar Usiskin.

Preservation and the Antimodern Impulse

Jana Millar Usiskin — Fri, 07 Dec 2012 04:48:49 +0000

To what extent does a culture of gratification produce a culture of preservation? The desire for the new, the desire for the contemporary, enables a process by which what is not-new and not-contemporary becomes available for preservation. In a culture where the new can be made to seem old, retro, classic, or vintage instantaneously with tools such as Instagram, there is a sense in which that which motivates consumption also motivates the impulse to preserve what is thrown out as a result. By implementing the aesthetics of the old to fetishize the new, Instagram draws attention to a narrative in which technology changes so fast that what seems new instantly becomes old. However, with this narrative of technological advance comes anxiety over how to preserve the amassing collections of data and technologies that are rapidly becoming obsolete. As Barbara Bordalejo asked at a recent ETCL and Maker Lab panel on “Long Term Thinking with Technologies,” not everything can be preserved. And if we don’t preserve everything, what are the criteria we use to select what is preserved? Who gets to make these selections?

Bordalejo’s questions are particularly relevant in the digital humanities. A recent workshop facilitated by the University of Victoria library, entitled “Why the Library Won’t Archive Your Digital Project,” listed a set of topics that included items like “What criteria is used?”; “Does the department cost share on the project?”; and “When does the library take on a project?” Factors that impede the process of institutional preservation include issues of consistent metadata (information that describes an image, text, audio recording, or object to enable preservation and discovery), copyright issues, and structural costs. At present, the UVic Library is planning to create a set of standards by which both librarians and students will have a better understanding of what is required in order to archive digital work. Perhaps at odds with the institutional impulse to standardize is the increasing drive to make scholarship accessible and relevant to the public, in which case scholarly work, especially archival work, begins to compete in a market of both the old-made-new and the new-made-old.

Audrey Alexandra Brown, the poet and writer whose material I am now translating into a digital medium, consistently produced the new-made-old, anticipating, in a sense, today’s Instagram users. She wrote weekly newspaper poems for a mass audience. Her themes and style recalled nineteenth-century romanticism at a time when increased industrialization, communication, and global conflict was transforming the way literature was conceived of, produced, and marketed. In my exhibit of Brown’s work, how can I draw attention to the similarities between Brown’s antimodern impulse and the cultural impetus to produce the new-made-old, between Brown’s sentimentality and our own nostalgic moment? How can I best explore the cultural and social implications of this relationship?

Post by Jana Millar Usiskin, attached to the AABrown project, with the exhibits tag. Featured images for this post care of the Audrey Alexandra Brown Collection at UVic. Digitization by Jana Millar Usiskin.

AAB Archives: Storage and Discoverability

Jana Millar Usiskin — Fri, 26 Oct 2012 21:41:36 +0000

I am a graduate student in English at the University of Victoria. This year in the Maker Lab I am looking forward to working with the University of Victoria Library to build a scholarly exhibit based on its Audrey Alexandra Brown fonds. Brown was a poet from Vancouver Island who has gone largely unnoticed in literary and historical criticism. Her archives include unpublished poems, newspaper publications, early work, photos, letters, biographical materials, and reviews. While digital remediation of this material does not ensure renewed scholarly attention, it certainly plays a role in the extent to which an author can be studied. Thus, under the guidance of Chris Petter and Corey Davis at the UVic Library, with support from Editing Modernism in Canada, and with Maker Lab resources, I am hoping to build an exhibit that will bring renewed attention to this overlooked Canadian writer.

While I had undertaken some of the digitization for a digital humanities seminar (with Jentery Sayers) last semester, a meeting with the UVic librarians earlier this month made me realize how much space I need for storage as well as how much metadata work is yet to be done on the exhibit. We discussed various options for both these issues. Basically, the past few months have been preoccupied with questions of storage and discoverability. Where can I house this material so that it remains stable and searchable? What digitization process must I follow in order to ensure that Brown’s remediated archival material is qualified for these storage spaces? What metadata needs to accompany this material, and how will it map on to the library’s system?

In meetings with Chris Petter and Corey Davis, we decided that the best way of housing the materials so that I would also have permanent URLs for the data, was through CONTENTdm, a system already used by the library for much of its digital material. CONTENTdm uses the Dublin Core ontology to attach metadata to images, audio, and text materials, which can then be searched through WorldCat and thus made available to a larger scholarly community. In meetings with Nancy Stuart and Gail Fowler, we established the criteria I could follow for the metadata, including which fields to use and how to format the information. These criteria have guided me as I more precisely document the materials I’ve scanned so that they can be uploaded into the library system. Due to time constraints, I will be unable to digitize all the material in the Brown fonds, so my questions at this time pertain to exigency. What materials most need to be digitized at this historical moment? What kinds of stories can this archive tell?