Basic workflows for talkr • talkr

library(talkr)

Introduction

talkr is a package designed for working with conversational data in R.

Loading some data

We will be using the IFADV corpus as example data for the workflow of talkr. This is a corpus consisting of 20 dyadic conversations in Dutch, published by the Nederlandse Taalunie in 2007 (source)

The snippet below initializes the talkr dataset using the ifadv data. For more information about the IFADV dataset, see the repository link.


data <- get_ifadv()
data <- init(data)
#> Warning in init(data): Column 'uid' already exists in the dataset. This column
#> will be renamed to `original_uid`.

Essential to any talkr workflow is a minimal set of data fields. These are the following:

source: the source conversation (a corpus can consist of multiple sources)
begin: begin time (in ms) of an utterance
end: end time (in ms) of an utterance
utterance: content of an utterance
participant: the person who produced the utterance

The init() function takes these minimal fields and generates a uid: a unique identifier at utterance-level that can be used as a reference to select and filter specific utterances.

The init() function can be used to rename columns if necessary. For example, if the column participant is named speaker, we can rename it as follows:

talkr_data <- init(data,
             participant = "speaker")

A dataset can contain additional fields. For instance, the IFADV sample dataset also contain language (which is Dutch) and utterance_raw (a fuller, less processed version of the utterance content). It also contains measures related to turn-taking and timing, including FTO (floor transfer offset, the offset between current turn and that of a prior participant, in milliseconds) and freq and rank, frequency measures of the utterance content.

Workflow 1: Quality control

Summary statistics

The report_stats function provides a simple summary of a dataset, including the total number of utterances, the total duration of the conversation, the number of participants, and the number of sources.

report_stats(data)
#> ### Corpus-level statistics 
#> 
#> nr of sources: 20
#> nr of participants: 40
#> nr of turns: 14022
#> mean turn duration (ms): 1257
#> turns per hour: 2806
#> total recording (min): 299.9
#> total recording (hours): 5
#> total speaking time (min): 293.8
#> total speaking time (hours): 4.9
#> 
#> ### Conversation-level statistics (per source)
#> 
#> |source          | nr of participants| total recording (ms)| total recording (min)| total speaking time (ms)| total speaking time (min)|
#> |:---------------|------------------:|--------------------:|---------------------:|------------------------:|-------------------------:|
#> |/dutch2/DVA10O  |                  2|               899341|                    15|                   829140|                      13.8|
#> |/dutch2/DVA11Q  |                  2|               899061|                    15|                   884734|                      14.7|
#> |/dutch2/DVA12S  |                  2|               900000|                    15|                   785323|                      13.1|
#> |/dutch2/DVA13U  |                  2|               900000|                    15|                   877328|                      14.6|
#> |/dutch2/DVA14W  |                  2|               900000|                    15|                   796924|                      13.3|
#> |/dutch2/DVA15Y  |                  2|               899379|                    15|                  1032776|                      17.2|
#> |/dutch2/DVA16AA |                  2|               900000|                    15|                   974312|                      16.2|
#> |/dutch2/DVA17AC |                  2|               899961|                    15|                   885257|                      14.8|
#> |/dutch2/DVA19AG |                  2|               899236|                    15|                   788268|                      13.1|
#> |/dutch2/DVA1A   |                  2|               899538|                    15|                   879384|                      14.7|
#> |/dutch2/DVA20AI |                  2|               899145|                    15|                   918466|                      15.3|
#> |/dutch2/DVA22AL |                  2|               899838|                    15|                   879580|                      14.7|
#> |/dutch2/DVA24AK |                  2|               900000|                    15|                   887106|                      14.8|
#> |/dutch2/DVA2C   |                  2|               899635|                    15|                  1097376|                      18.3|
#> |/dutch2/DVA3E   |                  2|               899091|                    15|                   851393|                      14.2|
#> |/dutch2/DVA4C   |                  2|               899108|                    15|                   850314|                      14.2|
#> |/dutch2/DVA6H   |                  2|               899981|                    15|                   889722|                      14.8|
#> |/dutch2/DVA7B   |                  2|               899664|                    15|                   913682|                      15.2|
#> |/dutch2/DVA8K   |                  2|               900000|                    15|                   807418|                      13.5|
#> |/dutch2/DVA9M   |                  2|               899773|                    15|                   802054|                      13.4|
#> 
#> ### Turn statistics (per source)
#> 
#> |source          | nr of turns| mean turn duration (ms)| shortest turn (ms)| longest turn (ms)| turns per hour|
#> |:---------------|-----------:|-----------------------:|------------------:|-----------------:|--------------:|
#> |/dutch2/DVA10O  |         501|                    1655|                233|              3001|           2005|
#> |/dutch2/DVA11Q  |         792|                    1117|                124|              4188|           3171|
#> |/dutch2/DVA12S  |         640|                    1227|                173|              3819|           2560|
#> |/dutch2/DVA13U  |         717|                    1224|                196|              3173|           2868|
#> |/dutch2/DVA14W  |         721|                    1105|                 97|              3115|           2884|
#> |/dutch2/DVA15Y  |         770|                    1341|                154|              4973|           3082|
#> |/dutch2/DVA16AA |         604|                    1613|                292|              3447|           2416|
#> |/dutch2/DVA17AC |         782|                    1132|                  0|              3453|           3128|
#> |/dutch2/DVA19AG |         648|                    1216|                136|              2988|           2594|
#> |/dutch2/DVA1A   |         681|                    1291|                143|              3417|           2725|
#> |/dutch2/DVA20AI |         758|                    1212|                167|              5472|           3035|
#> |/dutch2/DVA22AL |         745|                    1181|                143|              3448|           2981|
#> |/dutch2/DVA24AK |         685|                    1295|                121|              4421|           2740|
#> |/dutch2/DVA2C   |         765|                    1434|                217|              3415|           3061|
#> |/dutch2/DVA3E   |         753|                    1131|                175|              3199|           3015|
#> |/dutch2/DVA4C   |         719|                    1183|                145|              3697|           2879|
#> |/dutch2/DVA6H   |         683|                    1303|                122|              4786|           2732|
#> |/dutch2/DVA7B   |         691|                    1322|                197|              3395|           2765|
#> |/dutch2/DVA8K   |         612|                    1319|                200|              8712|           2448|
#> |/dutch2/DVA9M   |         755|                    1062|                163|              2909|           3021|

Visual quality checks

The plot_quality function provides a visual check of the nature of the data, by visualizing the distribution of turn durations, and transition timing.

Transition timing is similar to FTO, but calculated without additional quality checks: transitions are identified when the participant changes from one turn to the next. The transition time is then calculated as the difference between the beginning of the turn of the new participant, and the end of the turn of the previous one.

By default, plot_quality() will plot the entire dataset:

plot_quality(data)

Quality plots can also be run for a specific source:

plot_quality(data, source = "/dutch2/DVA8K")

A quality plot consists of three separate visualizations, all designed to allow rapid visual inspection and spotting oddities:

A density plot of turn durations. This is normally expected to look like a distribution that has a peak around 2000ms (2 seconds) and maximum lengths that do not far exceed 10000ms (10 seconds) (Liesenfeld & Dingemanse 2022). The goal of this plot is to allow eyeballing of oddities like turns of extreme durations or sets of turns with the exact same duration (unlikely in carefully segmented conversational data).
A density plot of turn transition times. A plot like this is expected to look like a normal distribution centered around 0-200ms (Stivers et al. 2009). Deviations from this may signal problems in the dataset, for instance due to imprecise or automated annotation methods.
A scatterplot of turn transition (x) by turn duration (y). This combines both distributions and is expected to look like a cloud of datapoints that is thickest in the middle region. Any standout patterns (for instance, turns whose duration is equal to their transition time) are indicative of problems in the segmentation or timing data.

Each of the three plots can also be generated separately:


plot_density(data, colname="duration", title="Turn durations",xlab="duration (ms)")

plot_density(data, colname="FTO", title="Turn transitions (FTO)",xlab="FTO (ms)")

plot_scatter(data, colname_x="FTO",colname_y="duration",title="Turn transitions and durations",xlab="transition (ms)", ylab="duration (ms)")

Workflow 2: Plot conversations

Another key use of talkr is to visualize conversational patterns. A first way to do so is geom_turn(), a ggplot2-compatible geom that visualizes the timing and duration of turns in a conversation.

We can start by simply visualizing some of the conversations in the dataset. Here we sample the first four and plot the first minute of each. We display them together using facet_wrap() by source.

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# we simplify participant names
conv <- data |>
  group_by(source) |>
  mutate(participant = as.character(factor(participant, labels=c("A","B"),ordered=T)))

# select first four conversations
these_sources <- unique(data$source)[1:4]

conv |>
  filter(end < 60000, # select first 60 seconds
                source %in% these_sources) |> # filter to keep only these conversations
  ggplot(aes(x = end, y = participant)) +
  geom_turn(aes(
    begin = begin,
    end = end)) +
  xlab("Time (ms)") +
  ylab("") +
  theme_turnPlot() +
  facet_wrap(~source) # let's facet to show the conversations side by side

More often, we will want to plot a single conversation and explore it in some more detail. Let’s zoom in on one of these first four. If we plot it without further tweaking, it is not the most helpful: the conversation is 15 minutes long and it is hard to appreciate its structure when we put it all on a single line.


conv |>
  filter(source == "/dutch2/DVA12S") |>
  ggplot(aes(x = end, y = participant)) +
  geom_turn(aes(
    begin = begin,
    end = end)) +
  xlab("Time (ms)") +
  ylab("") +
  theme_turnPlot()

So what we do is similar to conversational transcripts: we present the conversation in a left-to-right, top-to-bottom grid. To do so, we first need to divide the long conversation into a number of shorter lines. We do this using add_lines(). By default, this will divide the conversation into lines of 60000ms each (1 minute), creating as many lines as needed.

For now, let’s focus on the first 4 minutes, which we can do by filtering for line_id < 5 after we’ve added lines.


conv <- conv |> add_lines(line_duration = 60000)

conv |>
  filter(source == "/dutch2/DVA12S",
                line_id < 5) |> # limit to the first five lines
  ggplot(aes(x = line_end, y = line_participant)) +
  ggtitle("The first four minutes from DVA12S") +
  geom_turn(aes(
    begin = line_begin, # the begin and end aesthetics are now line-relative
    end = line_end)) +
  scale_y_reverse(breaks = seq(1, max(conv$line_id))) +  
  xlab("Time (ms)") +
  ylab("") +
  theme_turnPlot()


p <- last_plot()

We can style a plot like this using any available variables. For instance, we can highlight turns that are produced in overlap:


p +
  ggtitle("Turns produced in overlap") +
  geom_turn(aes(
    begin = line_begin,
    end = line_end,
    fill=overlap,
    colour=overlap)) +
  scale_fill_discrete(na.translate=F) + # stop NA value from showing up in legend
  scale_colour_discrete(na.translate=F) # stop NA value from showing up in legend

So far we have just visualized the temporal structure. But conversational turns typically consist of words and other elements.


p +
  ggtitle("Turns produced in overlap") +
  geom_turn(aes(
    begin = line_begin,
    end = line_end,
    fill=overlap,
    colour=overlap)) +
  scale_fill_discrete(na.translate=F) + # stop NA value from showing up in legend
  scale_colour_discrete(na.translate=F) # stop NA value from showing up in legend

Looking into tokens

We can start looking into the internal structure of turns by plotting occurrence of tokens.

To do so, we first need to generate a token-specific dataframe with tokenize(). This calculate token frequencies for all tokens in the selected dataset (all data by default). It also calculates relative positions in time for individual tokens in a turn. Finally, it provides a simple positional classification in to only (the token appears on its own), first (the token is turn-initial), last (the token is utterance-final), and middle (the token is not first nor last).


conv_tokens <- conv |> tokenize()

With information about tokens in hand, we can start asking questions. For instance, how does the relative frequency of words relate to their position in the turn?

To explore this question, let’s look at a shorter excerpt: 1 minute in total, divided over 4 lines. To do this, we create a dataframe this_conv, dividing it into 4 lines of 15 seconds each. We also create a dataframe these_tokens with tokenized turn elements for the same conversation, divided up in the same way.


this_conv <- conv |>
  add_lines(line_duration=15000) |>
  filter(source == "/dutch2/DVA12S",
                line_id < 5) # let's look at the first three lines

these_tokens <- conv_tokens |>
  add_lines(line_duration=15000, time_columns = "relative_time") |>
  filter(source == "/dutch2/DVA12S",
                line_id < 5)

this_conv |>
  ggplot(aes(x = line_end, y = line_participant)) +
  ggtitle("Relative frequency of elements within turns") +
  scale_y_reverse() + # we reverse the axis because lines run top to bottom
  geom_turn(aes(
    begin = line_begin,
    end = line_end)) +
  geom_token(data=these_tokens,
             aes(x=line_relative_time,
                 size=frequency)) +
  xlab("Time (ms)") +
  ylab("") +
  theme_turnPlot()


p <- last_plot()

Finally, we can also print the content of some of the elements. Here, we pick the most frequent turn-initial elements for plotting, highlight them with another layer of geom_token() and plot the text using geom_label_repel():


these_tokens_first <- these_tokens |>
  filter(order=="first",
                rank < 10)

p +
  ggtitle("Some frequent turn-initial elements") +
  geom_token(data=these_tokens_first,
             aes(x=line_relative_time),
             color="red") +
  ggrepel::geom_label_repel(data=these_tokens_first,
                            aes(x=line_relative_time,
                                label=token),
                            direction="y")

Notes

The init function can also be used to reformat timestamps. Default is “ms”, which expects milliseconds. ‘%H:%M:%OS’ will format eg. 00:00:00.010 to milliseconds (10). See ‘?strptime’ for more format examples.

init(format_timestamps="ms")

Token frequencies are calculated over the entire dataset. If you want source-specific data, you can filter the source prior to tokenization:

tokens_DVA9M <- data |>
  filter(source == "/dutch2/DVA9M") |>
  tokenize()

tokens_DVA9M

References

Liesenfeld, Andreas, and Mark Dingemanse. 2022. ‘Building and Curating Conversational Corpora for Diversity-Aware Language Science and Technology’. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 1178–92. Marseille. doi:10.48550/arXiv.2203.03399.
Stivers, Tanya, N. J. Enfield, Penelope Brown, C. Englert, Makoto Hayashi, Trine Heinemann, Gertie Hoymann, Federico Rossano, J. P. de Ruiter, Kyung-Eun Yoon, and Stephen C. Levinson. 2009. ‘Universals and Cultural Variation in Turn-Taking in Conversation’. Proceedings of the National Academy of Sciences 106 (26): 10587–92. doi:10.1073/pnas.0903616106.