By the end of this section, you should:
- Be aware of the Natural Language Processing techniques that have been applied to various digitised newspapers collections
- Be aware of developments in the future application of NLP techniques to digitised newspapers.
The “pleasures and perils” coined by Richard Abel (Abel 2013) have interested researchers working with digitised newspapers over the last decade. Indeed, the digitisation of the newspapers and the subsequent possibility to apply text mining algorithms to the press articles, has opened many fields of experimentation. Libraries and researchers have imported tools from the digital humanities or are developing their own algorithms to index, analyse and link out the contents of the digitised newspapers.
In the bibliography compiled in the ‘Further Learning’ tab hereunder is a snapshot reporting from the current trends in digital research conducted on digitised newspapers, organised along four themes:
- how the digitisation changed the research practices and how scholars reflect on these changes;
- what are the efforts undergone to index the mass of digitised text contained by the newspapers;
- how to look at the visual contents of the digitised newspapers;
- how algorithms can help automatically detect newspaper segments, recurrent articles types.
As often underlined, newspapers both in analogue and digital format, constitute a massive, overwhelming source. The hope brought by the digitisation is the possibility to apply algorithms and text mining tools to help organise their content. A popular solution appeared to be topic modelling. This algorithm helps identify how some words tend to appear together in individual texts. Topic modelling relies on the assumption that some topics are frequently expressed in a similar word combination and defines from a collection of texts, a series of topics expressed by words that tend to co-occur. The output of the topic modelling is twofold: a list of co-occurring words defining the topics, and the distribution of these topics across a collection of documents. The most popular application of that principle is the latent Dirichlet allocation (LDA) that calculates the probability of words to co-occur. The interest in using topic modelling to index digitised newspapers can be found on the libraries side as well as on the research side. However, because of its probabilistic nature, it is not fully reliable and because of its unsupervised nature, it is not always easy to use the produced topics. The main advantage of the topic modelling is that because it is data driven, it can help to gain first insights into a newspaper collections and it links articles together that do not necessarily share keywords but are of semantically similar content.
Aside from the principles of the algorithm, topic modelling has a set of technical positive and negative sides. Topic modelling can function with several languages, with OCR mistakes but will be impacted by both. At the same time, it can help even identify recurrent OCR mistakes. The major hindrance for applying topic modelling to digitised newspapers collections is however the segmentation of the newspapers contents. Indeed: the topic modelling functions best if it can identify the words that appear in one document and if all articles of a given issue are merged in one document, the observation of co-occurring words will be distorted. The biggest “noise” is potentially created by advertisements where image and text are mixed up. The pre-processing steps, such as the removal of stopwords (common words carrying no autonomous meaning) or the lemmatisation (the reduction to the core form of the words), impacts the outcome of topic modeling as well.
The topic modelling is being implemented as an exploration feature in the impresso project’s interface:
See the presentation of how topic modelling has been applied in the context of impresso : https://impresso-project.ch/news/2018/09/07/tradingzone-tm.html
and what is the outcome: https://impresso-project.ch/news/2019/03/05/Explore_TM.html
This video shows how topics can sometimes correspond to article type, for example ‘horoscopes’:
To have a better understanding of what topic modelling does and what are the outcomes for digitised newspapers, see:
- Blei, David M. ‘Probabilistic Topic Models’. Communications of the ACM 55, no. 4 (1 April 2012): 77. https://doi.org/10.1145/2133806.2133826.
- Ciula, Arianna, and Cristina Marras. ‘Circling around Texts and Language: Towards “Pragmatic Modelling” in Digital Humanities’ 10, no. 3 (2016). http://www.digitalhumanities.org/dhq/vol/10/3/000258/000258.html.
- Frermann, Lea, and Mirella Lapata. ‘A Bayesian Model of Diachronic Meaning Change’. Transactions of the Association for Computational Linguistics 4 (2016): 31–45.
- Galen, Quintus Van, and Bob Nicholson. ‘In Search of America’. Digital Journalism 6, no. 9 (21 October 2018): 1165–85. https://doi.org/10.1080/21670811.2018.1512879.
- Jacobi, Carina, Wouter van Atteveldt, and Kasper Welbers. ‘Quantitative Analysis of Large Amounts of Journalistic Texts Using Topic Modelling’. Digital Journalism 4, no. 1 (2 January 2016): 89–106. https://doi.org/10.1080/21670811.2015.1093271.
- Jähnichen, Patrick, Patrick Oesterling, Gerhard Heyer, Tom Liebmann, Gerik Scheuermann, and Christoph Kuras. ‘Exploratory Search Through Visual Analysis of Topic Models’ 11, no. 2 (2017). http://www.digitalhumanities.org/dhq/vol/11/2/000296/000296.html.
- Mimno, David. ‘Computational Historiography: Data Mining in a Century of Classics Journals’. Journal on Computing and Cultural Heritage (JOCCH) 5, no. 1 (1 April 2012): 3. https://doi.org/10.1145/2160165.2160168.
- Yang, Tze-I, Andrew J. Torget, and Rada Mihalcea. ‘Topic Modeling on Historical Newspapers’. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 96–104. LaTeCH ’11. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011. http://dl.acm.org/citation.cfm?id=2107636.2107649.
Further reading on research projects using digitised newspapers
- Buechel, Sven, Johannes Hellrich, and Udo Hahn. ‘Feelings from the Past—Adapting Affective Lexicons for Historical Emotion Analysis’. LT4DH 2016, 2016, 54.
- Kutuzov, Andrey, Terrence Szymanski, and Erik Velldal. ‘Diachronic Word Embeddings and Semantic Shifts: A Survey’. In Proceedings of the 27th International Conference on Computational Linguistics, 1384–1397. Santa Fe, New Mexico, USA: Association for Computational Linguistics, 2018. http://www.aclweb.org/anthology/C18-1117.
Data visualisations and visual explorations of the digitised newspapers
- Dzogang, Fabon, Thomas Lansdall-Welfare, and Nello Cristianini. ‘Discovering Periodic Patterns in Historical News’. PLOS ONE 11, no. 11 (8 November 2016): e0165736. https://doi.org/10.1371/journal.pone.0165736.
- Torget, Andrew J., Rada Mihalcea, Jon Christensen, and Geoff McGhee. ‘Mapping Texts: Combining Text-Mining and Geo-Visualization to Unlock the Research Potential of Historical Newspapers’. University of North Texas Digital Library, 2011.
- Wevers, Melvin, and Thomas Smits. ‘The Visual Digital Turn: Using Neural Networks to Study Historical Images’. Digital Scholarship in the Humanities. Accessed 11 February 2019. https://doi.org/10.1093/llc/fqy085.
Segmentations of digitised newspapers
- Langlais, de Pierre-Carl. ‘Distant reading the French news with the Numapresse project: toward a contextual approach of text mining.’ Numapresse (blog), 7 February 2019. http://www.numapresse.org/2019/02/07/distant-reading-the-french-news-with-the-numapresse-project-toward-a-contextual-approach-of-text-mining/.
- Langlais, Pierre-Carl. ‘La Formation de La Chronique Boursière Dans La Presse Quotidienne Française (1801-1870) : Métamorphoses Textuelles d’un Journalisme de Données’. Thesis, Paris 4, 2015. http://www.theses.fr/2015PA040176.
- ‘Romans-Feuilletons’. Accessed 17 September 2018. http://www.numapresse.org/exploration/feuilleton/feuilleton_main/comte_de_montecristo_le.html.
- Walma, L. W. B. ‘Filtering the “News” : Uncovering Morphine’s Multiple Meanings on Delpher’s Dutch Newspapers and the Need to Distinguish More Article Types’. TS: Tijdschrift Voor Tijdschriftstudies, 7 December 2015. http://dspace.library.uu.nl/handle/1874/324205.
Research workflows, use-cases and teaching
- Abel, Richard. ‘The Pleasures and Perils of Big Data in Digitized Newspapers’. Film History 25, no. 1–2 (2013): 1–10. https://doi.org/10.2979/filmhistory.25.1-2.1.
- Hill, Mark J. ‘Invisible Interpretations: Reflections on the Digital Humanities and Intellectual History’. Global Intellectual History 1, no. 2 (3 May 2016): 130–50. https://doi.org/10.1080/23801883.2017.1304162.
- Kestemont, Mike, Folgert Karsdorp, and Marten During. ‘Mining the Twentieth Century’s History from the Time Magazine Corpus’. In Abstract Book of EACL 2014: The 14th Conference of the European Chapter of the Association for Computational Linguistics, 62, 2014.
- Mussell, Jim. ‘Teaching Nineteenth-Century Periodicals Using Digital Resources: Myths and Methods’. Victorian Periodicals Review 45, no. 2 (2012): 201–9. https://doi.org/10.1353/vpr.2012.0024.
- Strange, Carolyn, Daniel McNamara, Joshua Wodak, and Ian Wood. ‘Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers’. Digital Humanities Quarterly, 2014. https://openresearch-repository.anu.edu.au/handle/1885/64038.
- Upchurch, Charles. ‘Full-Text Databases and Historical Research: Cautionary Results from a Ten-Year Study’. Journal of Social History 46, no. 1 (1 September 2012): 89–105. https://doi.org/10.1093/jsh/shs035.
- Wijfjes, Huub. ‘Digital Humanities and Media History. A Challenge for Historical Newspaper Research’. Tijdschrift Voor Mediageschiedenis 20, no. 1 (26 June 2017): 4-24–24. https://doi.org/10.18146/tmg20277.