Collections of Computer-Mediated Communication

Collections of Computer-Mediated Communication
by Marie Annisius

By the end of this section, you will be able to…

  • Understand the societal and technological circumstances that shape the content, structure and language of social media
  • Understand the importance and potential of social media for the society and Digital Humanities research

Social media and its importance for cross-disciplinary research

With the advent of Web 2.0 there has been an unprecedented surge of user-generated and social media content, such as forums, blogs, tweets, etc. Such content has quickly become a major source of human knowledge and opinion and is considered a catalyst of bottom-up communication practices that, among many other things, contribute towards the democratization of language. As a consequence, we are seeing a growing need for a thorough multidisciplinary understanding of this type of communication, which is significantly shaped by the specific social and technical circumstances in which it is produced: rich in colloquialisms and foreign language elements, non-canonical spelling variants and syntax, as well idiosyncratic abbreviations and neologisms.

An important dimension of communication on social media is that it is highly interactive, reflecting the on-line social networks which build around certain topics, social issues and opinions. In addition to linguistic content, it is heavily shaped by the multimodal content and accompanied by easily accessible and rich (sociodemographic) metadata, which open a wide range of new exciting research opportunities in digital humanities and social sciences, as well as bring about new technical, linguistic and ethical challenges for scholars.

The role of research infrastructures in social media research

A combination of two very important factors, i.e. the growing popularity and importance of social media in society and the low technical barrier to collect social media data, have sparked plenty of corpus collection projects. Data harvesting from user-generated and social media platforms, such as chats, forums, weblogs and tweets, on social networking sites and in wikis, is not only straightforward but is also often facilitated by APIs (Application Programming Interfaces) offered by platform providers directly. However, few of the collected social media corpora have become available to the rest of the research community. The main reason for this is the problematic legal and ethical status of the collected data when distributed as a resource to the scientific community due to copyright issues, personal information issues, and terms of service, which content providers change very frequently, even multiple times in the lifespan of a single corpus collection project.

The second major bottleneck to reusing the social media corpora is that they are developed using heterogeneous technologies, representation formats and annotation schemas. This occurs because this new type of discourse exhibits features that cannot be adequately handled by the standard schemas and tools which have been developed for the representation, annotation and processing of “standard” language, found for instance in newspaper articles or various literary text. This heterogeneity leads to interoperability issues and significantly hinders the reuse of the created resources as well as comparability of the research results.

With the support of research infrastructures such as CLARIN and DARIAH in projects like ChatCorpus2CLARIN, the barriers with the legal issues and compliance with the standards are being lowered systematically, which will allow researchers to combine, merge and connect resources as well as increase the sustainability and reusability of resources.

For more details on this, see Working with social media corpora.


  • Alexandra Georgakopoulou, Tereza Spilioti, 2015: The Routledge Handbook of Language and Digital Communication. London: Routledge.


  • Katrin Weller, Axel Bruns, Jean Burgess, Merja Mahrt and Cornelius Puschmann, 2014: Twitter and society (Vol. 89). Peter Lang.
  • Crystal, David, 2011: Internet Linguistics: A Student Guide. New York: Routledge.