Digitised newspapers as new artefacts

How are the analogue newspapers reconstructed in digital form?

Let’s start at the beginning: what exactly are we accessing when we work with digitised newspapers, and how do we access it?

As Pelle Snickars explains, the process of digitisation creates many new layers of information on top of the original source, and the output of digitisation requires a dedicated interface to access it. Digitisation transforms the original source and opens it up for more uses and different means of access. Capturing the analogue source with an scanned image and processing the image with text recognition and layout analysis creates a new source format and changes the potential interaction with it. But raw digital output relies on many technical and institutional constraints, mainly copyright issues, and is not usable as such; many further steps are needed before it can be accessed by researchers. Users must be aware of the context in which it was produced, and also of the fact that the field is constantly changing, both in terms of technical improvement in the digitisation itself but also in the development of interfaces giving access to them. It is important not to lose sight of the fact that collections are being constantly enriched and the technical quality of newspaper digitisation is undergoing continuous improvement. It seems that the real challenge facing holders of digitised newspaper collections is how to provide users with facilitated access to a growing quantity of sources via a suitable interface. Behind seemingly stable institutional search interfaces, the digital newspapers collections are undergoing important and continued changes.

What does digitisation mean in practice?

Historical newspapers are selected by holders, often national libraries, who hold collections of newspapers but do not own the rights to them. Multiple factors determine which of newspaper titles are selected for digitisation: Legal obstacles are one reason why digitisation often starts with older newspaper titles. Another reason why older editions have tended to be digitised first is their fragility. National libraries have the mission to preserve the collections they hold, and digitisation often seems to be an effective way to facilitate access to their collections while protecting them from deterioration caused by use. It can also seem that more popular titles are being digitised first, but this is hard to judge as holders generally do not justify or explain their selection. Another factor are digitisation requests from users which some libraries are happy to consider.

The newspapers are usually transferred to a digitisation company along with a set of requirements defined by the newspaper holder. The holder defines the format and features the digitisation will entail: what quality goals will be set for the text recognition in the newspapers, what segments will be annotated, will the digitisation be based on the paper version of the newspaper or on microfilms made of these newspapers, will the scanning be performed in colour or in black and white? For instance, the holder may request that annotation be enabled for newspaper obituaries, since this feature is very useful for genealogical research to collect names and dates. Enabling annotation means identifying visual elements that characterise these obituaries, such as various cross designs and the text attached to each of them. This identification can subsequently be used as a filter to limit searches to obituaries: users can type a name, select the obituaries filter and retrieve results from these segments of the page only. The segmentation of individual articles is crucial for newspapers, since many different elements are generally found on a single page.

The process of digitisation involves using optical character recognition (OCR) to produce texts, and potentially also document layout analysis methods (also described as Optical Layout Recognition) to recognise individual articles and other visual elements. This information about the text and its location is stored on the basis of two standards – the Metadata Encoding and Transmission Standard (METS) and Analyzed Layout and Text Object (ALTO) – and sent back to the newspaper holder.

What is OCR?

Optical character recognition (OCR) is a relatively old technology that has been in use for many years, for instance since 1971 at Canadian post offices to read addresses on letters. It is a process that recognises text in an image and transforms it into machine-readable and editable text. But how does it work?

First, the material is prepared so that it can be optically processed more easily: the image of the text is straightened out and the contrast of the picture is enhanced. This latter step is called “binarization” – the range of colour shades of an image is reduced to black and white in order to increase the contrast between the background and the letters on the image. The binarized image is then inspected by software that identifies the contours of the letters and words and compares it to known fonts. Taking into account not only each letter but all the letters of a text helps identify the most probable font used for the text. After the identification of letters, more complex processes are performed: the identification of language and the spatial distribution of the text on the image, or layout analysis. The language is identified with the help of dictionaries. This might create some difficulties when dealing with historical texts, as spelling changes over time. Layout analysis is crucial, especially when dealing with newspapers, where the text is distributed over columns.

One final important dimension to be aware of is the measure of the quality of OCR processing. The traditional way to measure the quality of automated processes is to manually compare them to “ground truth”, in other words against real-world observation. For OCR processing, this works by selecting a random sample of articles or pages, transcribing them manually and comparing the ratio of letters and words recognised by the software to the manual transcription. Another way to measure the efficiency of OCR processing is to compare it to dictionaries. This accuracy information is generally used internally by institutions and is rarely published for users via interfaces. Researchers should therefore bear this in mind and develop strategies to analyse the precision and recall of their queries.

The information produced by OCR processing is then stored under two standards:

Metadata Encoding and Transmission Standard (METS) and Analyzed Layout and Text Object (ALTO). METS was created to store general information about objects in digital libraries. It contains metadata, i.e. the title, date, author, place of publication, and generally speaking the information normally found in a library catalogue. It contains also digital metadata, e.g. when the file was created, by whom, which version of the file it is, what format it is in and how it relates to other files stored in the digital library. Finally, it contains the metadata of the elements contained in the digital file, such as the title of a page or information about its textual content. The standard provides a long list of categories of information that can be used by libraries to store and communicate information about a given file. Not all fields are necessarily completed but METS offers guidelines on how to produce and store this information.
The ALTO standard, as its name indicates, was designed to store OCR output. It contains information about identified text and its spatial distribution for each OCR image.

Key points to remember:

Digital text via OCR that makes newspapers searchable with keywords.
Identification of newspaper structure via document layout analysis that enables the application of search filters limited to certain sections of newspapers such as advertisements or obituaries.
METS/ALTO standards to store information about the characters in the scanned image and their spatial distribution.
OCR and OLR quality have constantly improved in recent years and recent efforts to use deep learning technologies promise again a significant boost in quality.

Further Learning (Click to expand)

Further reading on the digitisation process

Optical Character Recognition (OCR) : Wikipedia page: https://en.wikipedia.org/wiki/Optical_character_recognition
METS: Metadata Encoding and Transmission Standard Library of Congress: http://www.loc.gov/standards/mets/
Wikipedia page: https://en.wikipedia.org/wiki/Metadata_Encoding_and_Transmission_Standard
ALTO: Analyzed Layout and Text Object
Wikipedia page https://en.wikipedia.org/wiki/ALTO_(XML)
Library of Congress: https://www.loc.gov/standards/alto/v4/alto.xsd
Optical Character Recognition (OCR) – by Computerphile, with Professor Steve Simske (Honorary Professor at the University of Nottingham as well as Director & Chief Technologist at HP Labs’ Security Printing Solutions): https://www.youtube.com/watch?v=ZNrteLp_SvY
In The Field: National Digital Newspaper Program: in the video produced by the NEH, you can follow the stages in the digitisation process, which require many stakeholders and manipulations of the original source before it can be read and analysed: Link to the NEH video: https://www.youtube.com/watch?v=LclIm9s7Iho&t=18s
Digitisation tools at Europeana: http://www.europeana-newspapers.eu/public-materials/tools/ METS/ALTO explained by Veridian: https://veridiansoftware.com/knowledge-base/metsalto/
Output of newspapers digitisation by ProQuest: Cornell, Jennifer. ‘LibGuides: ProQuest Historical Newspapers’.//proquest.libguides.com/hnp/searchinghnp.