In an age when literature and other cultural objects are created, distributed, and sold as industrial products on massive scales, we continue to study them as if they were art. This isn’t entirely bad—we’ve been doing it for centuries, with more than a little success—but it has real costs to literary and cultural scholarship, costs that are growing over time. Most serious among these is the mounting disparity between the number of books we read (or, more inclusively, the objects we study) and the number produced. This gap is now large enough (and has been for decades) that it’s exceedingly difficult to say with any certainty what contemporary cultural production as a whole looks like. If we understand our disciplinary project as an attempt to comprehend contemporary culture by way of its nominally aesthetic production—and I think this is indeed how we do and should conceive our work—the fact that we have read, seen, and heard so little of that production is a serious problem.
So what should we do about it? We should develop more readily scalable techniques by which to analyze large bodies of text and other cultural products. This is at present a small part of the so-called digital humanities; with persistence and luck, it will be one of the major strands of DH by the time we give up on that catch-all brand and assimilate whatever is of use back into a reconstituted cultural studies. The techniques in question—I’m thinking here of data mining, quantitative text analysis, economic and book history, GIS, and so on—will have their own substantial shortcomings, of course; the question isn’t whether or not we should replace one productive but limited set of techniques with another, but instead what kinds of evidence we’re willing to accept in support of the cultural claims we want to make and what trade-offs we’re willing to accept in pursuit of them.
What follows is a short illustration of the underlying problem of literary and cultural abundance, a quick tour of several techniques that we might use to expand our analytical repertoire so as to deal with that problem more effectively, and, finally, a consideration of the substantial challenges these methods face in the short-to-medium term.
The problem of abundance
There are a lot of books published in the U.S. every year. Hundreds of thousands; millions, if you count print-on-demand and other “non-traditional” titles. Restricting our interest to fiction, the number is still around 50,000 and growing rapidly (see figure 1). And that’s just in the United States. As bad as the problem is now, however, we should also remember that it’s been around for a long time. Anything above a title or two a week is more than we can reasonably expect to keep up with; fiction production has been at that level in Britain since the mid-eighteenth century (Moretti) and in the U.S. since the mid-nineteenth (Wright).
Fifty thousand new novels in the U.S. every year. We read a couple dozen, see reviews of maybe that many again, and so could claim to know something about roughly 0.1% of the total. Even that might be (barely) useful if it were randomly distributed, but of course it’s not; it’s by and large the thousandth that the major publishing houses have decided to support with ad buys, review placement, book tours, and publicity packages. We have no reason to believe that it’s a representative slice of the industry’s output.
Similar issues of raw abundance apply to other media industries, to say nothing of less formal publishing and distribution online. Now, most of this stuff we’re not reading is of modest quality at best. But the point is that once we’ve taken the cultural turn, it’s hard to see exactly how we can justify not only excluding material on the basis of perceived aesthetic value (our job certainly isn’t any longer to say what’s beautiful and what isn’t), but to do so without having been so much as aware of its existence in the first place.
So how would we begin to address such abundance? We need to find ways in which to extract information from and about books without spending much time on them individually, or at least before we spend intensive time with a few of them. Sometimes that can be done by hand, as when Franco Moretti, building on the work of multiple historians of the book, quantifies the international rises of the novel in Graphs, Maps, Trees. But if we want to move beyond metadata, into books themselves—and we do—we will almost certainly need computers.
Computers are lousy readers, of course, but fortunately we don’t need them to read anything. What they do exceptionally well—and what we require here—is count things, quickly, without getting tired or bored or sloppy. Why should counts be useful to us? Because they help us to identify patterns. And pattern recognition is what we claim we’re doing when we study genre, style, form, periodization, political orientation, gender, race, and so on. We don’t usually think of our work in exactly numerical terms, and of course we rarely base our arguments explicitly on counts of words or books. But we do imply that we’ve identified changes in patterns and frequencies when we say, for example, that modernist texts are characterized by a certain set of important features or that women’s writing has forms and concerns distinguishable from men’s.
Neither of those broad claims (concerning modernism and women’s writing) is notably ripe for computational analysis at the moment, though it wouldn’t be hard to imagine interesting statistical information that might be of use to scholars in the field. Other important critical issues, though, are already benefiting from computational and quantitative approaches. Three examples give a sense of what’s possible.
Consider American fiction circa 1850. (Why 1850 rather than, say, 2000? The answer will come in the last section.) Where is it focused and what are its concerns? The standard answers involve variations on “New England” and “Americanness” (understood to include issues of slavery and union). This is the moment of (American) Romanticism and the American Renaissance; our understanding of it is dominated by Melville, Hawthorne, Dickinson, Emerson, Thoreau, and Whitman. To this we would properly add Stowe, Douglass, and a handful of slave narratives (though Jacobs’ Incidents was still a decade off). Both Twain and American regionalism were more than a generation in the future.
If we were to guess at the locations mentioned in books written at the time, where would they fall on a map? Clustered in the northeast, probably, with some scattered mentions of the south and west, plus a few international cities, perhaps mostly in Britain and western Europe. Figure 2 shows what we actually find in a large subset of the novels published in 1851.
According to Wright’s bibliography, one hundred and eleven American novels were published in 1851. The thirty-seven texts used to create figure 2 were those available in machine-readable form from the Wright American Fiction collection and represent about a third of the total American literary works published that year. There are some squarely canonical texts included in this corpus, including Moby-Dick and House of the Seven Gables, but the large majority are obscure novels by the likes of T. S. Arthur and Sylvanus Cobb. Place names were extracted using a tool called GeoDict (now rolled into the larger Data Science Toolkit), which looks for strings of text that match a large database of named locations. It’s a pretty straightforward process and takes between a few minutes and a few hours (depending on input parameters) to run over the 2.5 million total words in the corpus. A bit of cleanup on the extracted places is necessary, mostly because many personal names and common adjectives are also the names of cities somewhere in the world. The results presented here err on the conservative side, excluding any such ambiguous places and requiring a leading preposition for cities and regions; they certainly overlook some valid places. But the results are fascinating. Two points of particular interest:
For one, there are a lot more international locations than one might have expected. True, many of them are in Britain and western Europe, but these are American novels, not British reprints, so even that fact might surprise us. And there are also multiple mentions of locations in South America, Africa, India, China, Russia, Australia, the Middle East, and so on. The imaginative landscape of American fiction in the mid-nineteenth century appears to be pretty diversely outward looking in a way that hasn’t received much attention.
And then—point two—there’s the distinct cluster of named places in the American south. At some level this probably shouldn’t be surprising; we’re talking about books that appeared just a decade before the Civil War, and the south was certainly on people’s minds. But there are nearly as many southern locations as northern ones, and they form a distinct cluster. Perhaps we need to at least consider the possibility that American regionalism took hold significantly earlier than we usually claim.
Although this particular experiment would be possible without a computer, it would have taken months rather than days, and it wouldn’t scale well at all. 1852 would take as long again, as would 1853, etc. (Incidentally, the other years of the early 1850s show a geographic distribution similar to the one presented here.) And it would be totally untenable in the contemporary case, where we would need to deal with tens of thousands of novels rather than dozens. Speed and iterability are two of the major advantages that we gain when we sacrifice the precision of close reading.
Another promising technique in computational analysis is automated network analysis. The pioneering study here is Elson, Dames, and McKeown’s work on the differences between rural and urban social networks in the nineteenth-century novel. Their method is clever: they identify exchanges of direct discourse between named characters algorithmically, then treat the quantity of such discourse as a proxy for the degree of social interaction between those characters. Like the geolocation work discussed above, this is a process that could in principle be carried out manually for the corpus with which they work (sixty novels comprising about ten million words), but that is straightforwardly extensible by computation to much larger corpora than could ever be processed by hand.
More interesting than the technical details (which are impressive) is the light the group’s work sheds on a long-standing assumption concerning the effects of urbanization. Where we might have expected that the increased human density of city life would produce literary narratives built around larger social networks in urban settings than those seen in rural settings, Elson et al. find no evidence of such a shift in their sample corpus, which was carefully curated to contain a substantial number of texts from each category. This result casts some doubt on the Bakhtinian theory of chronotopes and, more broadly speaking, may lead us to reëxamine the links between social space, historical change, and narrative structure.
There are, of course, some limitations to the Elson group’s method. There are surely additional markers of social connection beyond conversational exchange; their current technique disregards indirect discourse, which may be significantly different in urban and rural settings; their corpus is substantial but not exhaustive; they were forced to trim peripheral interactions from the generated networks, which may have skewed their findings. But these are not unaddressable issues; eminently tractable studies might resolve each of them. Assuming the main findings hold up, it’s not hard to see how similar work might shed light not only on the structure of literary narratives and their changes over time, but also on cultural coteries, the influence of reviews, and networks of reception in general (both professional, via citation analysis, and popular, for example via Amazon recommendations; on the latter, see Finn).
A final example of computational work as a method by which to assess large-scale cultural issues concerns the ability to identify texts that resemble one another in some way. Imagine, for example, that we think allegory should be correlated with periods of rapid literary or cultural evolution (Wilkens 2006, 2010). It’s easy enough to see this dynamic at work in the major literary texts of a single revolutionary period—American late modernism, say—but much more difficult to assess its adequacy with reference to the full field of literary or cultural production at the time, and orders of magnitude more challenging to extend across historical events and cultural domains.
What’s required is a way to measure quickly the degree of allegory in any given text. This is a hard problem, not least because it’s partially dependent on the context of reception. But it’s akin to that of spam filtering insofar as it involves sorting a large pile of individual texts into two groups, one of which possesses an amorphous high-level feature (“spamminess”) and the other of which does not. We can imagine approaching the allegory problem in the same way: train a classifier on a set of known texts, check its performance against a limited test sample, then set it loose on a larger corpus of unknown texts.
The difficulty is that we don’t know exactly which low-level features of a work account for its high-level allegory. To get a handle on that, we can try two different approaches. One (call it, loosely, inductive) is to identify features that seem likely to go along with allegory based on what we know about allegory itself. Allegory has a difficult task: It needs to maintain a coherent mapping between at least two levels of meaning across large portions of its narrative span. So we might expect allegories to be relatively simple in their base story so as to avoid making this mapping any more ambiguous than necessary. They also need to convince their readers to interpret them allegorically, perhaps by looking like other, well known allegorical works. It would follow that allegories should be rich in some kinds of words and structures (verbs, for instance, or allusions to war or to the seasons) and poor in others (adjectives and adverbs; dependent clauses).
Figure 4 shows the variations in frequency of selected parts of speech in about a thousand novels written between the late eighteenth and early twentieth centuries. 1 It isn’t terribly interesting on its own, but it does demonstrate both that large-scale part of speech tagging is possible (in fact it’s pretty easy and 97–98% accurate) and that there exist temporal variations in the frequency of major parts of speech that exceed background noise (see especially the period between 1810 and 1840), which might be useful indicators of some sort of underlying formal change.
Figure 5 takes a slightly different approach. It summarizes the number of cases (from a relatively small corpus) in which two works by the same author differ significantly in the frequency of each major part of speech class (see listing of texts in table 1). The method of comparison is Dunning log likelihood with a threshold of ten, but the details aren’t especially important. The take-away point is that verbs do indeed look to be associated with more allegorical works, which are correspondingly poor in most of the other major parts of speech. That is to say, there’s evidence, albeit in a relatively small sample, to suggest that allegories tend to be stories of action rather than of portraiture.
|Author||More allegorical||Less allegorical|
|Bunyan||Pilgrim’s Progress||Grace Abounding|
|Defoe||Robinson Crusoe||J. of the Plague Year|
|Dickens||Christmas Carol||Bleak House|
|Orwell||Animal Farm||Burmese Days|
Table 1. Paired texts by single authors, differing in degree of allegorical content. All possible appropriate pairings for each author were used, resulting in thirteen total pairs.
Parts of speech aren’t the only indicators of allegory; work to evaluate other candidates—including keywords, lemma (i.e., word stem) and character n-grams, locations, network structures and so on—is ongoing. What we’d really like to do, though, is to proceed more or less inductively by measuring all of these parameters and then seeing what combination of them does the best job of replicating our settled judgments about the allegorical content of known texts. The problem is that we would need a very large training set to do that, because it would need to average out variations (especially in keywords) that result from the plot-level content of each book. This implies a training set comprising hundreds of known allegorical and nonallegorical novels, maybe thousands. No such corpus currently exists. But by using the results of studies like those presented here, we should be able to identify those texts in a larger pool that stand out as particularly well separated on the continuum of allegory and to devote further attention to them as candidates for inclusion in the expanded training set. This is something of which I think we’ll see more in the future: computational methods guiding our attention to works that seem likely to repay close study. Of course, we’ll also need to be aware of the ways in which that type of guidance can be self-reinforcing, but it’s not as though our current methods of selecting texts are without bias or randomness themselves.
Issues, limitations, and challenges
All three of the data-mining techniques presented here might be easily extended to other important problems in literary and cultural studies. But there are challenges, too. Beyond the obvious limitations that arise from machines’ inability to perform useful interpretation on their own—and hence our need to make creative use of the things they can do—I see three major sources of difficulty when it comes to introducing computational and quantitative methods into mainstream humanities research.
The first and least important set of issues is technical. There are datasets, particularly those that approach the millions of volumes held by Google Books or the Hathi Trust, that can’t be easily manipulated on desktop-class hardware. For the time being, these require either subsetting or collaboration with institutions that have suitable computing resources. Exactly how big is too big depends on what you’re trying to do with the data, but none of the work presented here could be scaled nontrivially to a million books. On the other hand, computing power grows quickly. More to the point, there’s a lot that’s within easy reach now, enough to keep us busy exploring the available low-hanging fruit while we wait for more horsepower and work on the more difficult social and legal issues that stand in our way.
It’s also true that our current digital tools walk a delicate line between analytical power and accessibility for a user base not generally trained in quantitative analysis. There’s no single correct trade-off in this case, of course, and both the tools and the users will improve with time. Probably more important, we’ll develop with experience a clearer sense of what types of analytical approaches are most fruitful for the work we want to do and concentrate our development resources there. But at the moment, the going can be slow, which means a cost–benefit analysis sometimes weighs against such methods.
Social and cultural issues within the academy are less immediately tractable. Most people currently involved with literary and cultural studies came to the field because they liked to read books or watch films or contemplate art. They’re very good at those things, and the approaches they use to study their objects seem to them, for the most part, to be the most appropriate and desirable ones. To the extent that new methods involve shifting attention and resources away from established techniques, they strike many established scholars as a move in the wrong direction. The question, though, is one of optimum balance. Insofar as we spend more time computing and less time reading, we will become worse readers. But it’s also true that the less computing we do, the less we know about the things computers alone can offer. As I said at the outset, close reading has real and meaningful costs, too—costs that are becoming harder to ignore with the release of a thousand new novels every week.
Fortunately, we don’t need to convince all that many people currently in the field to change course. Digital methods are attracting more than their share of undergraduates and grad students, and far more than their share of grant money. The discipline will shift over time, probably looking rather different a generation from now than it does today (compare, for instance, psychology or political science thirty years ago). That’s enough—and there’s probably no other way in which disciplines evolve—but it’s a slow process. In the meantime, there’s a shortage of people with the appropriate backgrounds (in the broadest sense) for computational and quantitative work.
Finally, there is the very serious issue of access to the material required for computational analysis. It won’t have escaped your attention that in an essay written for Contemporaries and arguing that contemporary cultural analysis is especially well suited to quantitative techniques (given its acute problem of abundance), all three examples involved texts written before 1923. But that date, 1923, is the giveaway, being the beginning of the era of effectively perpetual copyright. Nearly all the books written before 1923 and currently held in major research libraries are now or soon will be available in machine-readable form. You can download them from Google Books or the Open Content Alliance, and while you might have to deal with imperfect conversion from page images to text and with spotty metadata, they’re there. Moreover, they’ll improve with time, because anyone can correct and redistribute them, perhaps as part of her own project. Books that are still in copyright also exist in digital form (with Google and, in some cases, with libraries and publishers), but you can’t get your hands on them in anything like the way you can with those in the public domain. You probably never will be able to (this being the point of copyright), and even though it’s likely legal to, for example, convert books to digital form for use in your own analysis (provided you don’t redistribute them), the costs involved are prohibitive for nearly all scholars.
This is where we stand at the moment, and it’s why there’s very little existing digital scholarship on contemporary fiction. But there’s one bright spot in the form of Google’s proposed settlement with the Authors Guild and the Association of American Publishers. The settlement includes language that would grant “nonconsumptive” research access to some or all of Google’s holdings. The details aren’t yet clear, but the idea is that there would be conditions under which academic researchers could run queries and algorithms of their own design against a large number of Google’s texts, including those still in copyright. This isn’t a perfect solution—exclusivity of access and lack of distribution rights to the underlying data being major immediate drawbacks—but it’s probably the best we’re going to get for contemporary sources in the foreseeable future. There are already grant-funded collaborations underway between humanities scholars and Google engineers, as well as plans for university research centers built around access to Google’s corpus. So we needn’t write off the techniques discussed here for lack of access to the sources, though the precise form of that access remains to be seen.
All of this is to say that there are real limits to what is currently possible and productive via digital methods for the study of postwar fiction, as well as structural challenges that could limit those methods’ long-term impact. But their promise remains immense, especially for contemporary work that confronts most directly the difficulties associated with finding our way in an overwhelming mass of material. We know there are disadvantages to continuing exclusively on our traditional path. We ought to be willing at least to see where new ones might take us.
Matthew Wilkens is a postdoctoral fellow in the Program in American Culture Studies at Washington University in St. Louis.
Elson, David K., Nicholas Dames, and Kathleen R. McKeown. “Extracting Social Networks From Literary Fiction.” Proceedings Of the 48th Annual Meeting Of the Association For Computational Linguistics. Uppsala, Sweden, 2010. 138–147.
Finn, Ed. “Becoming Yourself: David Foster Wallace and the Afterlife of Reception.” The Legacy of David Foster Wallace: Critical and Creative Assessments. Iowa City: U of Iowa P, 2011. Forthcoming.
Greco, Albert, Clara RodrÃguez, and Robert Wharton. The Culture and Commerce of Publishing in the 21st Century. Stanford, CA: Stanford UP, 2007.
Moretti, Franco. Maps, Graphs, Trees: Abstract Models For a Literary History. New York: Verso, 2005.
R. R. Bowker. “http://www.bowker.com/bookwire/decadebookproduction.htmlU.S. Book Production, 1993–2004.” 2005. Web.
———. “http://www.bowkerinfo.com/bowker/IndustryStats2010.pdfNew Book Titles & Editions, 2002–2009.” 14 April 2010. Web.
Wilkens, Matthew. “Toward a Benjaminian Theory Of Dialectical Allegory.” New Literary History 37.2 (2006): 285–298.
———. “Nothing as He Thought It Would Be: William Gaddis and American Postwar Fiction.” Contemporary Literature 51.3 (2010): 596–628.
Wright, Lyle H. American Fiction 1851–1875: A Contribution Toward a Bibliography. Rev. ed. San Marino, CA: Huntington Library P. 1965.