For basic primers on this kind of work and what it can achieve, take a look at Matthew L. Jockers’ Macroanalysis, David Berry’s Understanding Digital Humanities, or Witten, Frank, & Hall’s Data Mining: Practical Machine Learning Tools and Techniques. If you’re looking for online readings, I’d recommend Ted Underwood’s “Seven Ways Humanists Are Using Computers to Understand Text,” Scott Weingart’s “Topic Modeling for Humanists: a Guided Tour,” and an e-book by David Easley and Jon Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected World.
The purpose of pursuing this kind of work is to find new ways of recognizing patterns and spotting anomalies across an impressively wide range of sources. Within the context of intellectual history, for example, macroanalysis might allow us to trace the transmission and even the development of key terms and concepts over increasingly long stretches of time.
When it comes to digitizing texts in the first place, we can choose from a number of effective tools. While the best digitization tends to be done by technologically well-endowed research libraries, there are also programs that anyone can use to get engaged in similar kinds of work. Optical Character Recognition (OCR) programs like Abbyy and AcrobatPro can prove to be especially powerful resources, allowing us to swiftly translate bare script into encoded text. Once that’s done, we can turn to tools like OpenRefine in order to clean our data up and make it usable for analysis–although the human touch is usually needed to ensure that the text-based dataset is as clean as it really needs to be.
For a figure like Augustine of Hippo, we could also draw upon existing digitized versions of his corpus in order to get a better picture of how major pieces of the Augustinian vocabulary–from confessio to distentio and beyond–fall into place over a period of decades. Vital resources here would include the digitization efforts undertaken by Belgium’s CETEDOC (Centre de Traitement Electronique des Documents) and the careful, prescient work of James J. O’Donnell (Confessiones online).
Once we have a clean corpus of data to draw from, we can manage our data so that it can easily translate into polished final products. In addition to the obvious (Excel), there are more advanced resources out there that can help streamline our data management, from RStudio (which puts the statistical language R to work for any number of projects) to Stanford’s capacious Palladio, which also includes an NEH-funded visualization program.
Getting even more macroanalytic, scholars can even turn to larger-scale database systems like MySQL, SQLite, MariaDB, and PostgreSQL. A great example of what databases are capable of in the context of the Humanities can be found in OCHRE, hosted by the Oriental Institute at the University of Chicago.
Again, while gathering up all this clean data is already a significant achievement, it is also a prelude to the fun part: actually analyzing the data and seeing what we can discover. There are a number of ways we can approach this stage of macroanalysis. Language processing, as exemplified by the Stanford Natural Language Processing Group and SAMTLA, can harness the power of technology to rapidly decode and parse texts from all over the globe. Markup tools and topic models can then help us give structure to our textual data and collaborate with others who may be analyzing the same datasets. MALLET provides us with a way of modelling word-clusters in order to draw conclusions about terminological trends and semantic trajectories, while (for something completely different) MARKUS stands as an excellent example of how a markup tool can deepen our study of an immense literary corpus (in this case, that of Chinese).
Lessons learned from methods in macroanalysis can also bear fruit in what we might call not-so-macroanalysis. (‘Microanalysis’ would seem to undersell things.)
For example, even taking a relatively constrained dataset–such as the digitized text of Book XI of Augustine’s Confessions–can allow us to track the use of key terms more precisely than any eyeball test. We don’t need language processing tools to tell us that Book XI deals with the themes of time and temporality, but those same tools can indeed help us determine how exactly Augustine chooses to deploy time-related words (tempus, tempora, distentio, etc.) over the course of the entire book.
Translating that processed data into easily interpretable visual media can then offer us a straightforward way to inform others about the terminological breakdown of Augustine’s writing. There are a range of advanced resources aimed at visualizing data in the most analytically responsible way possible: think here of D3, Gephi, and NodeXL. Yet even sites like Wordle or Jason Davies’ generator or (my fav) Voyant can give us the chance to create a visualization as simple as a wordcloud (see below), which can help people see the intensity with which Augustine pursues the topic of time over the course of Book XI, oftentimes better than would a paragraph of explanatory prose.
While constructing a wordcloud like this may seem like a rather straightforward enterprise, it can actually raise a number of thoughtful questions about ‘data-mining’ ancient texts. When it comes to grammar, for example, a language like Latin offers up some challenges that we seldom face when the object of our language processing is English. Trying to account for the cases of various nouns, for example, can add new layers of complexity to the process of turning raw text into usable data.
Digitally analyzing a text like Confessions XI can also raise questions having to do with rhetoric (as many scholars are already demonstrating in the field of stylometry). One of the most obvious steps we must take when cleaning up textual data is to get rid of all of the ‘stop-words:’ terms that recur so frequently in a language that they become almost statistically irrelevant. (In this case, think of non, et, de, and so on.)
But what about a word like te (you; accusative or ablative singular)? It would seem to be a statistically irrelevant term in many settings, and yet in Confessions XI that may not be the case. Throughout this work of his, Augustine frequently refers directly to God in the second person: tu, tibi, te… And so what is the burgeoning young digital humanist to do?
Regardless of what we decide to do with such data-points, the fact remains that the very exercise of trying to translate Augustine into analyzable digital data can give us reason to reflect on the grammatical, rhetorical, and perhaps even conceptual content of the text proper.