Working with parliamentary corpora

Working with parliamentary corpora
by Ulrike Wuttke

By the end of this section, you will be able to…

  • Understand the peculiarities of parliamentary records as a research dataset
  • Understand the most frequent encoding standards for parliamentary corpora
  • Understand the most important metadata in parliamentary corpora
  • Understand the most common annotation layers in parliamentary corpora

Parliamentary proceedings as a research dataset

Parliamentary proceedings have some important characteristics that researchers need to take into account throughout their work. Their most distinguishing characteristic is that they are essentially transcriptions of spoken language produced in highly controlled and regulated settings. They are also rich in invaluable (sociodemographic) metadata. To enable data-driven science, parliamentary data must be easily findable and accessible, encoded according to international standards and recommendation, with rich and correct annotations and metadata.

Encoding of parliamentary data

For the encoding of parliamentary data, the format, quality and structure of the source files are essential.

Source files

Most frequently, transcriptions of parliamentary proceedings are made available in HTML or PDF formats. These formats are appropriate for reading by humans but are not appropriate for direct processing by computers. For this purpose, a much better, i.e. more explicit format, is XML. Therefore documents stored in HTML and PDF need to be converted to XML, a process which can range from trivial to highly complex.

Traditionally, only transcriptions of parliamentary sessions have been made available but they are now being increasingly released in audio and video as well.

Proceedings of the Austrian parliament in PDF.

Figure 1. Proceedings of the Austrian parliament in PDF.

Proceedings of the Danish parliament in HTML.

Figure 2. Proceedings of the Danish parliament in HTML.

Proceedings of the German parliament in XML.

Figure 3. Proceedings of the German parliament in XML.

Informally presented structure of Slovenian parliamentary proceedings (with minimal and maximal occurrences of structural elements).

Figure 4. Informally presented structure of Slovenian parliamentary proceedings (with minimal and maximal occurrences of structural elements).

Structural elements

Parliamentary debates are typically published in a uniform format, which fluctuates very little in time. A document typically contains the table of contents, the list of speakers, the index of topics, annexes (session papers, legislation, etc.) and, most importantly, the transcribed speeches, accompanied by non-verbal content, such as information about the meeting and the chairperson, description of the outcome of a vote, description of actions like applause, etc.

Encoding standards

There are several XML schemas in which parliamentary proceedings can be encoded. The most popular are:

Political Mashup, formal XML schema for parliamentary proceedings.

Figure 5. Political Mashup, formal XML schema for parliamentary proceedings.

An exemplification of the TEI schema for speech as used for Slovenian parliamentary proceedings.

Figure 6. An exemplification of the TEI schema for speech as used for Slovenian parliamentary proceedings.

Metadata

Parliamentary records are rich in extralinguistic markup on the speech level (e.g. parliamentary session, date, meeting item, speaking time) and also on the speaker level (e.g. speaker’s name, date of birth, gender, education, party affiliation). Metadata can be used by researchers as variables in their analysis or for fine-grained filtering of the research data.

Text production in the German parliament over time by political party.

Figure 7. Text production in the German parliament over time by political party.

Most characteristic nouns specific to a given parliamentary group in the French parliament.

Figure 8. Most characteristic nouns specific to a given parliamentary group in the French parliament.

Representation and text production of female members of the Danish parliament.

Figure 9. Representation and text production of female members of the Danish parliament.

Linguistic annotations and text enrichment

In addition to the extralinguistic metadata, parliamentary records are typically further enriched with different levels of linguistic annotations. The most standard are tokenization (identifying words and, typically, punctuation marks), sentence segmentation (marking sentence boundaries), morphosyntactic tagging (adding information on the part of speech and other morphosyntactic characteristics of each word token in the corpus) and lemmatization (providing the base dictionary forms of the inflected words).

Morphosyntactically tagged and lemmatized Slovenian parliamentary corpus.

Figure 10. Morphosyntactically tagged and lemmatized Slovenian parliamentary corpus.

To further facilitate the use of parliamentary records, they often also include annotations of named entities (markings of names of persons, organizations and places) and, more recently, sentiment (information on whether a speech conveys a positive, negative or neutral attitude about the subject matter). Other frequent text enrichment procedures for parliamentary data are the annotation of discussion topics and linking with their corresponding Wikipedia pages (e.g. Brexit), and linking of parliamentary debates with external knowledge sources, such as voting records.

Illustration of the named entity recognition and linking in the UK parliamentary proceedings.

Figure 11. Illustration of the named entity recognition and linking in the UK parliamentary proceedings.


Examples of sentiment annotation (positive in green, negative in red) in the UK parliamentary corpus.

Figure 12. Examples of sentiment annotation (positive in green, negative in red) in the UK parliamentary corpus.

Image Credits: