Boosting Digital Humanities research with CMC data

Boosting Digital Humanities research with CMC data
by Marie Annisius

By the end of this section, you will be able to…

  • Identify possible research questions on social media discourse from different DH disciplines
  • Understand the main quantitative research methods used to study social media discourse in DH

Social media is a place where people express emotions, exchange ideas and chat about whatever is on their minds. People’s behaviour on social media reflects what moves (excites or irritates) them and what their attitude is on emerging matters. Combined with easy access to large volumes of data, accompanied by rich metadata, this has opened up new opportunities for sociolinguists, communication and media studies scholars, ethnographers, anthropologists and culturologists as well as political and social scientists who have started to adopt methodologies previously used in the fields of natural language processing and data science, such as text mining, social network analysis, geospatial analysis, and data visualization, in order to structure, search, mine, manipulate, visualize, share, and combine social media data. These approaches not only enable scholars to apply the interpretative traditions of the humanities and social sciences to data on a very large scale but also allow them to address new research questions and develop novel techniques for tackling complex social phenomena (e.g. fake news, hate speech, nationalism, mental health).

Case study 1: The Citizen Mindscapes Project

The aim of the multidisciplinary Citizen Mindscapes Project is a comprehensive socio-political and linguistic analysis of the everyday social media discourse in the Finnish society. By applying a wide range of quantitative and qualitative methods from the fields of natural language processing, statistics and machine learning to research questions in sociology and media studies, Citizen Mindscapes researchers seek to uncover the societal and political trends in Finland.

In a recent article, Pakkasvirta (2018) analyses how the concepts and words referring to Latin America are used in Finnish social media, and how this kind of interaction is built on and further strengthens national or “continental” historical stereotypes in the new media. Through the lens of the representations of Latin American stereotypes, the Finnish national self-portrait is reflected as well. The approach used combines qualitative corpus research and quantitative content analysis methods in the big social media data setting, covering a time period of more than fifteen years in the form of tens of millions of online messages on topics such as local affairs, health, food, religion, and celebrity gossip.

The study proves that strong historical stereotypes are maintained and reproduced in social media in the twenty-first century by new connective actions which also reproduce nationalism. In the forum discussions, Latin Americans are mostly discussed through the lens of a tourist. Even the posts on relationships are typically the result of travelling. The historically established attitudes permeate all categories of the analysed dataset in which Latin Americans are described as black/dark, hot-blooded, passionate, criminals, backward, jealous, religious, but also beautiful and friendly. The Latin American stereotypes observed in a U.S. poll from 1940 are still very present in the Finnish social media of the twenty-first century. This shows that attitudes and prejudices change slowly, even in the time of fast-paced Internet connective actions.

Read more:

Case study 2: The NTAP Project

The Project Networks of Texts and People (NTAP) developed methods and tools to detect, analyse and visualize the distribution, flow and development of knowledge and opinions across online social networks in order to help general users assess, interpret and use information in the blogosphere, enable researchers to visualize and analyse information diffusion and polarization in large data sets, and support media monitoring companies in tracking the spread of statements and opinions through social media.

The project advocated using data-driven approaches to climate change discourse in blogs, which is characterized by its large scale and extremely complex, varied and dynamic nature, because automatic approaches are able to provide manageable overviews of the content, point out interesting linguistic patterns and automatically annotate the material. Furthermore, the project showed that while some problems can be tackled by treating texts as bags of words and analysing language only at the level of lexis, identification of higher-level linguistic patterns is often necessary. This can be only partly achieved with the established corpus linguistic techniques, which is why the researchers in the project investigated other text mining and visualization tools as well.

In a recent publication, Elgesem et al. (2015) analysed the texts of 1.3 million English blog posts on climate change and the structure of the links between the blogs in which these posts appeared. They combined community detection techniques with probabilistic topic modelling to show how topics related to climate change are discussed across various parts of the English blogosphere. They identified one community of predominantly climate-sceptical bloggers and several accepter communities. Although they observed a series of topics that are characteristic of the climate change discourse in the blogosphere, two turned out to be particularly prominent for characterizing the discourse: one related to climate change science and one related to climate change politics. Interestingly, they also found that the distribution of topics over the communities cuts across the divide between sceptics and non-sceptics. Despite shared topics, differences in the patterns of interactions between the sceptics and different groups of accepters were identified.

Watch the video of Andrew Salway on “Creating and Using Topically-Focused Blog Corpora” here:

Case study 3: The SoSweet Project

The main goal of The SoSweet project is to provide a detailed account of the links between linguistic variation and social structure on Twitter, both synchronically and diachronically, using novel interdisciplinary and quantitative approaches on much larger samples of data than were used in traditional sociolinguistic studies.

In one of their recent papers, Abitbol et al. (2018) investigated the socioeconomic dependencies of linguistic patterns on Twitter by combining a large corpus of geo-tagged French Tweets with detailed socioeconomic maps obtained from the National Institute of Statistics. They focused on three linguistic phenomena: the rate of standard negation, the rate of plural agreement and the size of the vocabulary set and tested whether their usage depends on the socioeconomic status, tweeting location, tweeting time and social network of Twitter users.

By establishing a detailed multidimensional correlation, they found that socioeconomic indicators and linguistic variables are significantly correlated in that Twitter users with a higher socioeconomic status are more prone to use more standard variants of language and a larger vocabulary set, while Twitter users on the other end of the socioeconomic spectrum tend to use more non-standard terms and, on average, a smaller vocabulary set. Related to that is their finding that Twitter users from the North of France tend to use more non-standard terms and a smaller vocabulary set compared to the users from the South. Standard language was more likely to be used during the daytime while non-standard variants were predominant during the night, which is again related to the activity of the users from different socioeconomic statuses. Finally, by observing a higher similarity of the linguistic footprints of Twitter users who are connected and have the same socioeconomic status as the disconnected users, they showed a linguistic link between the social network and status homophily.