The DepsearcheR package is a simple utility made for the purpose of using R for corpus analyses involving the utilization of dependency annotations represented in some version of the CoNLL format (cf. e.g here).

1 Installation

Use either the devtools package, or - if devtools isn’t installed, then probably preferably - the more lightweight remotes package. Both have the install_github function needed to do the job:

library(remotes)
install_github('utacorpora/depsearcheR')

2 Example data

Assume that you have parsed a file like the one provided in the inst/extdata folder of this package (the Finnish wikipedia article for sparrow):

library(depsearcheR)
library(readr)
mytext <- readr::read_file(
                           system.file("extdata", 
                                       "varpunen_wikipedia.txt",
                                       package="depsearcheR")
                           )

cat(substr(mytext,1,300))

## Varpunen (Passer domesticus) on yleinen lintulaji suuressa osassa Eurooppaa ja Aasiaa. Carolus Linnaeus antoi varpuselle aluksi nimen Fringilla domestica.
## Varpunen on 14–17 cm pitkä ja painaa 30–33 g. Koiras on hieman kookkaampi. Varpusella on tukeva ruumis, suhteellisen suuri pää ja voimakas nokka.

The text has been parsed with the Finnish dependency parser developed at the university of Turku and this output file is also included in inst/extdata. Note that the format here is the so called universal dependencies format. This is what the conll formatted file looks like:

1   Varpunen    varpunen    NOUN    _   Case=Nom|Number=Sing    8   nsubj:cop   _   _
2   (   (   PUNCT   _   _   4   punct   _   _
3   Passer  Passer  PROPN   _   Case=Nom|Number=Sing    4   compound:nn _   _
4   domesticus  domesticus  NOUN    _   Case=Nom|Number=Sing    1   appos   _   _
5   )   )   PUNCT   _   _   4   punct   _   _
6   on  olla    VERB    _   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 8   cop _   _
7   yleinen yleinen ADJ _   Case=Nom|Degree=Pos|Number=Sing 8   amod    _   _
8   lintulaji   lintu#laji  NOUN    _   Case=Nom|Number=Sing    0   root    _   _
9   suuressa    suuri   ADJ _   Case=Ine|Degree=Pos|Number=Sing 10  amod    _   _
10  osassa  osa NOUN    _   Case=Ine|Number=Sing    8   nmod    _   _
11  Eurooppaa   Eurooppa    PROPN   _   Case=Par|Number=Sing    10  nmod    _   _
12  ja  ja  CONJ    _   _   11  cc  _   _
13  Aasiaa  Aasia   PROPN   _   Case=Par|Number=Sing    11  conj    _   _
14  .   .   PUNCT   _   _   8   punct   _   _

1   Carolus Carolus PROPN   _   Case=Nom|Number=Sing    2   name    _   _
2   Linnaeus    Linnaeus    PROPN   _   Case=Nom|Number=Sing    3   nsubj   _   _
3   antoi   antaa   VERB    _   Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act 0   root    _   _
4   varpuselle  varpunen    NOUN    _   Case=All|Number=Sing    3   nmod    _   _
5   aluksi  aluksi  ADV _   _   3   advmod  _   _
6   nimen   nimi    NOUN    _   Case=Gen|Number=Sing    7   nmod:poss   _   _
7   Fringilla   Fringilla   NOUN    _   Case=Ade|Number=Plur    3   nmod    _   _
8   domestica   domestica   X   _   Foreign=Foreign 3   dobj    _   _
9   .   .   PUNCT   _   _   3   punct   _   _

Now, let’s imagine we have a data set consisting of all the sentences of the Finnish wikipedia article mentioned above. It could be acquired as follows:

library(dplyr)
library(readr)
sentences <- readr::read_file(
                           system.file("extdata", 
                                       "varpunen.conll",
                                       package="depsearcheR")
                           ) %>% 
        strsplit("\n\n")  %>% 
        unlist

The data is also included in the package as a sample vector called varpunen_sentences, which we will use in the following examples.

Now, since there is some variation among the different conll formats and not all conll outputs have the same number of columns, you should set the columns used in each session as a global option via the options function. For instance, for the results of the Stanford parser’s 2015 version, we should do:

options("conll_cols" = c("tokenid","token","lemma","feat","none", "head", "dep"))

Since in the following examples we’ll be using the Finnish dep parser, we’ll set it like this:

options("conll_cols" = c("tokenid","token","lemma","pos","pos2","feat","head","dep","null1","null2"))

3 Usage on the sentence level

At the heart of the depsearcheR package is an extremely simple function called FilterConllRows. The idea of this function is to filter a conll formatted sentence according to some conditions given by the users. The output of the filter can then be used for more filtering.

As simple and trivial scenario, imagine that we want to retrieve all the nouns of a sentence. This could be achieved in the following way (using the example set of sentences from the previous section):

FilterConllRows(varpunen_sentences[1], "pos", "NOUN")

## # A tibble: 4 x 10
##   tokenid token      lemma      pos   pos2  feat    head dep   null1 null2
##     <int> <fct>      <fct>      <fct> <fct> <fct>  <int> <fct> <fct> <fct>
## 1       1 Varpunen   varpunen   NOUN  _     Case=…     8 nsub… _     _    
## 2       4 domesticus domesticus NOUN  _     Case=…     1 appos _     _    
## 3       8 lintulaji  lintu#laji NOUN  _     Case=…     0 root  _     _    
## 4      10 osassa     osa        NOUN  _     Case=…     8 nmod  _     _

The function can also be used with regular expressions:

FilterConllRows(varpunen_sentences[1], "feat", "Case=Ine", use_regex=T)

## # A tibble: 2 x 10
##   tokenid token    lemma pos   pos2  feat           head dep   null1 null2
##     <int> <fct>    <fct> <fct> <fct> <fct>         <int> <fct> <fct> <fct>
## 1       9 suuressa suuri ADJ   _     Case=Ine|Deg…    10 amod  _     _    
## 2      10 osassa   osa   NOUN  _     Case=Ine|Num…     8 nmod  _     _

…conditions with multiple values

FilterConllRows(varpunen_sentences[1], "pos", c("NOUN","ADJ"))

## # A tibble: 6 x 10
##   tokenid token      lemma      pos   pos2  feat    head dep   null1 null2
##     <int> <fct>      <fct>      <fct> <fct> <fct>  <int> <fct> <fct> <fct>
## 1       1 Varpunen   varpunen   NOUN  _     Case=…     8 nsub… _     _    
## 2       4 domesticus domesticus NOUN  _     Case=…     1 appos _     _    
## 3       7 yleinen    yleinen    ADJ   _     Case=…     8 amod  _     _    
## 4       8 lintulaji  lintu#laji NOUN  _     Case=…     0 root  _     _    
## 5       9 suuressa   suuri      ADJ   _     Case=…    10 amod  _     _    
## 6      10 osassa     osa        NOUN  _     Case=…     8 nmod  _     _

or with negative conditions, for instance to get everything that’s not a noun or an adjective (or a punctuation mark):

FilterConllRows(varpunen_sentences[1], "pos", c("NOUN","ADJ","PROPN","PUNCT"), is_negative=T)

## # A tibble: 2 x 10
##   tokenid token lemma pos   pos2  feat              head dep   null1 null2
##     <int> <fct> <fct> <fct> <fct> <fct>            <int> <fct> <fct> <fct>
## 1       6 on    olla  VERB  _     Mood=Ind|Number…     8 cop   _     _    
## 2      12 ja    ja    CONJ  _     _                   11 cc    _     _

You can add more complex conditions by piping the reslts in dplyr style. Note that since the philosophy behind this package relies quite heavily on the one behind dplyr (or the tidyverse framework in general) it might be a good idea to get a basic idea of what that is about (e.g. here). The very basic thing to note: the %>% operator used in the following examples is called the piping operator (from the magrittr package) and it’s a way to pass on a functions return value to another function.

So, to get, for instance, something that is not a noun but is in the inessive case, we might do:

FilterConllRows(varpunen_sentences[1], "pos", c("NOUN","PROPN"),is_negative=T) %>% 
    FilterConllRows("feat", "Case=Ine", use_regex=T)

## # A tibble: 1 x 10
##   tokenid token    lemma pos   pos2  feat           head dep   null1 null2
##     <int> <fct>    <fct> <fct> <fct> <fct>         <int> <fct> <fct> <fct>
## 1       9 suuressa suuri ADJ   _     Case=Ine|Deg…    10 amod  _     _

And actually, with the piping style going on, we could start with the sentence:

varpunen_sentences[1]  %>% 
    FilterConllRows("pos", c("NOUN","PROPN"),is_negative=T) %>% 
    FilterConllRows("feat", "Case=Ine", use_regex=T)

## # A tibble: 1 x 10
##   tokenid token    lemma pos   pos2  feat           head dep   null1 null2
##     <int> <fct>    <fct> <fct> <fct> <fct>         <int> <fct> <fct> <fct>
## 1       9 suuressa suuri ADJ   _     Case=Ine|Deg…    10 amod  _     _

Stylistically, this looks a little better to me.

This kind of “filtering the filtered” is where the function actually gets useful for getting information about dependencies.

Now, imagine we want to get all the dependents of a finite verb in a sentence. For that, we can use the values of the columns head and tokenid.

#Get all the finite verbs in the sentence
finverbs <- FilterConllRows(varpunen_sentences[2], "feat", "VerbForm=Fin", T)
#Get their dependents
deps <- FilterConllRows(varpunen_sentences[2], "head", finverbs$tokenid)
deps

## # A tibble: 6 x 10
##   tokenid token      lemma     pos   pos2  feat     head dep   null1 null2
##     <int> <fct>      <fct>     <fct> <fct> <fct>   <int> <fct> <fct> <fct>
## 1       2 Linnaeus   Linnaeus  PROPN _     Case=N…     3 nsubj _     _    
## 2       4 varpuselle varpunen  NOUN  _     Case=A…     3 nmod  _     _    
## 3       5 aluksi     aluksi    ADV   _     _           3 advm… _     _    
## 4       7 Fringilla  Fringilla NOUN  _     Case=A…     3 nmod  _     _    
## 5       8 domestica  domestica X     _     Foreig…     3 dobj  _     _    
## 6       9 .          .         PUNCT _     _           3 punct _     _

There is actually a shortcut function for getting the dependents of a word. It’s called GetDeps and it takes as it’s arguments

a word (=a row of a tibble) or multiple words (a tibble) that are the heads we’re looking at
an unfiltered sentence (as a tibble)

So, the previous example could be written as:

varpunen_sentences[2] %>% 
    FilterConllRows("feat", "VerbForm=Fin", T) %>%  # returns one row, passed on to the GetDeps function
    GetDeps(varpunen_sentences[2])  # note: the original sentence is the second argument

## # A tibble: 6 x 10
##   tokenid token      lemma     pos   pos2  feat     head dep   null1 null2
##     <int> <fct>      <fct>     <fct> <fct> <fct>   <int> <fct> <fct> <fct>
## 1       2 Linnaeus   Linnaeus  PROPN _     Case=N…     3 nsubj _     _    
## 2       4 varpuselle varpunen  NOUN  _     Case=A…     3 nmod  _     _    
## 3       5 aluksi     aluksi    ADV   _     _           3 advm… _     _    
## 4       7 Fringilla  Fringilla NOUN  _     Case=A…     3 nmod  _     _    
## 5       8 domestica  domestica X     _     Foreig…     3 dobj  _     _    
## 6       9 .          .         PUNCT _     _           3 punct _     _

There is also the GetHeads function for doing this the other way around: if we want to, say, find all the heads of (common) nouns, we could do it as follows.

varpunen_sentences[2] %>% 
    FilterConllRows("pos", "NOUN") %>%  
    GetHeads(varpunen_sentences[2])

## # A tibble: 2 x 10
##   tokenid token     lemma     pos   pos2  feat      head dep   null1 null2
##     <int> <fct>     <fct>     <fct> <fct> <fct>    <int> <fct> <fct> <fct>
## 1       3 antoi     antaa     VERB  _     Mood=In…     0 root  _     _    
## 2       7 Fringilla Fringilla NOUN  _     Case=Ad…     3 nmod  _     _

Note that we can use the output of GetDeps and GetHeads for further filtering, for instance to get only the kind of dependents for finite verbs that are not nouns (or punctuation marks):

varpunen_sentences[2] %>% 
    FilterConllRows("feat", "VerbForm=Fin", T) %>%  
    GetDeps(varpunen_sentences[2])   %>% 
    FilterConllRows("pos", c("PROPN", "NOUN", "PUNCT"), is_negative=T)

## # A tibble: 2 x 10
##   tokenid token     lemma     pos   pos2  feat      head dep   null1 null2
##     <int> <fct>     <fct>     <fct> <fct> <fct>    <int> <fct> <fct> <fct>
## 1       5 aluksi    aluksi    ADV   _     _            3 advm… _     _    
## 2       8 domestica domestica X     _     Foreign…     3 dobj  _     _

4 Usage on the dataset level

All the previous examples demonstrate the usage of the simple functions in this package on the sentence level. What the whole package is actually for, is, however, querying datastructures containing multiple sentences and filtering only the ones that are relevant for the user.

It should be emphasized at this point that this package is definitely not meant for effective data mining on a large scale. The functions used are simple, perhaps even trivial, and performance is the cost. But if you know what you’re searching for and are not dealing with a terribly large dataset (or have a lot of time) this can be a useful approach.

4.1 Querying a raw text

The first, although propably not the most common, use case considered here is where you have a conll formatted text (such as the one from the first section of this vignette) and you want to make queries to find sentences with certain characteristics. To do that, we want to have the data formatted as a vector of sentences.

Since the data in conll format separates sentences with an empty row, a vector of sentences can be obtained simply by splitting the file by two consequtive newlines as was done at the beginning of this vignette. There is also a convenience function for this, which takes a filename as a parameter and produces the vector:

sentences  <- GetSentencesFromFile("inst/extdata/varpunen.conll")

Now, let’s say we want to get all the sentences

which include a finite verb
in which the finite verb has a subject dependent
in which the subject is a proper name

The trick here is to develop custom functions that will take care of the filtering and then apply those functions to the vector containing the sentences. Here’s an example function that can be used for searching for only the kind of results described above (has a finite verb, the verb has a subject dependent, the subject is a proper name)

MyFilterFunction <- function(sentence) {
    sentence  %>% 
        FilterConllRows("feat","VerbForm=Fin", T) %>% 
        GetDeps(sentence) %>% 
        FilterConllRows("dep","nsubj") %>% 
        FilterConllRows("pos","PROPN")  %>% 
        return
}

There is one requirement to keep in mind when creating these filters: every filter must take a single sentence as its only argument and return the filtered matches so that they can be further processed. Before the return statement you can have as complex or as simple a set of code as you wish. The filters should be used with the ApplyConllFilter function of this library in the following manner:

options("conll_cols" = c("tokenid","token","lemma","pos","pos2","feat","head","dep","null1","null2"))
matched_sentences <- ApplyConllFilter(varpunen_sentences, MyFilterFunction)

By default ApplyConllFilter just returns the sentences of the input vector that matched the specified filter. In our case, there was one sentence that matched the query. To view the sentence as a tibble, we can use the function ConllAsTibble:

ConllAsTibble(matched_sentences)

## # A tibble: 9 x 10
##   tokenid token      lemma     pos   pos2  feat     head dep   null1 null2
##     <int> <fct>      <fct>     <fct> <fct> <fct>   <int> <fct> <fct> <fct>
## 1       1 Carolus    Carolus   PROPN _     Case=N…     2 name  _     _    
## 2       2 Linnaeus   Linnaeus  PROPN _     Case=N…     3 nsubj _     _    
## 3       3 antoi      antaa     VERB  _     Mood=I…     0 root  _     _    
## 4       4 varpuselle varpunen  NOUN  _     Case=A…     3 nmod  _     _    
## 5       5 aluksi     aluksi    ADV   _     _           3 advm… _     _    
## 6       6 nimen      nimi      NOUN  _     Case=G…     7 nmod… _     _    
## 7       7 Fringilla  Fringilla NOUN  _     Case=A…     3 nmod  _     _    
## 8       8 domestica  domestica X     _     Foreig…     3 dobj  _     _    
## 9       9 .          .         PUNCT _     _           3 punct _     _

Or, if we would like a more human-readable representation of the actual sentence, there is a function called ConllAsSentence:

ConllAsSentence(matched_sentences)

## [1] "Carolus Linnaeus antoi varpuselle aluksi nimen Fringilla domestica."

Keep in mind, however, that these two functions take single sentences as arguments, not vectors of sentences.

If we want different kind of output from ApplyConllFilter, we can specify this via the return_type parameter. This parameter is a string defaulting to “raw” (the kind of output we got above, i.e. just a filtered vector of conll-formatted sentences). Other possible values are:

matches: returns all the words (rows) that match as a single tibble.
both: returns both the matching words and the conll output as an additional column (sent)
both_pretty: same as above but instead of the raw conll string converts the sentence to a human readable format

The matches version of the output is useful if we want to further analyze the results of our queries, not just count how many matches we got. This will be more closely illustrated below.

4.1.1 Caveat: performance

The kind of queries presented above work okay with the kind of toy files used here as examples. However, if we have a larger text, this is really not an effective way of mining it. However, if you don’t mind the query running for a while, you can, of course, give this approach a go. For that kind cases, ApplyConllFilter uses a progress bar with an estimation of the remaining time (the proggress bar itself is part of the dplyr package).

As a larger example, consider querying the entire (English translation of the) gospel of John¹ with a similar filter. This set of sentences is included with the package as a vector called gospel_of_john_sentences. Note that since we are now switching to a different kind of parser output, we must define the expected columns again using the options function.

AnotherFilter <- function(sentence) {
    sentence  %>% 
        FilterConllRows("feat", "^V", use_regex=T) %>% 
        FilterConllRows("feat", "VV", is_negative=T) %>% 
        GetDeps(sentence) %>% 
        FilterConllRows("dep","nsubj") %>% 
        FilterConllRows("feat",c("NN","NNP"))  %>% 
        return
}

options("conll_cols" = c("tokenid","token","lemma","feat","none", "head", "dep"))
start.time <- Sys.time()
matched_sentences <- ApplyConllFilter(gospel_of_john_sentences, AnotherFilter)
show(Sys.time() - start.time)

## Time difference of 4.102786 secs

show(length(matched_sentences))

## [1] 501

As you can see, on my thinkpad x220 that took around 3 seconds – still not terribly bad. To get the perspective: this version of the gospel of John has 1,218 sentences. Of these, our search criteria were met by 501.

As a somewhat more serious work load, let’s try to parse the whole New testament. That means 9,329 sentences (not included in this package).

The result (on the same icore5 2.5ghz processor): Time difference of 16.20708 secs

So, let’s say that 17 seconds for around 10,000 sentences. If it means that 100,000 sentences will be analyzed in 170 seconds = a little less than 3 minutes, well – you get the perspective. Not the most effective way, but suitable for many cases, and if you’ve got time to wait, even for some more serious text mining assignments.

4.1.2 Getting parameters from the results

In the previous examples we retrieved sentences by different filters and so far we have relied on the raw output of ApplyConllFilter. Let’s move on to imagining a scenario in which we want to get specific parameters out of the filtered results, that is, to specifying return_type = 'matches'..

Consider the previous example with the sentences from the Gospel of John. Our little filter, repeated below, obtained all the cases where a verb had a subject dependent which was a noun:

AnotherFilter <- function(sentence) {
    sentence  %>% 
        FilterConllRows("feat", "^V", use_regex=T) %>% 
        GetDeps(sentence) %>% 
        FilterConllRows("dep","nsubj") %>% 
        FilterConllRows("feat",c("NN","NNP"))  %>% 
        return
}


matched_words <- ApplyConllFilter(gospel_of_john_sentences, AnotherFilter, "matches")

Now we have a tibble called matched_words which contains all the words (all the rows from the conll representations of sentences). With this tibble we can do any statistical operations we would normally do with or data sets. As an especially simple example, think about taking all the subject dependents of verbs in the gospel of John and looking at which lemmas are most frequent as the subject:

matched_words %>% count(lemma)  %>% arrange(desc(n))

## # A tibble: 129 x 2
##    lemma         n
##    <fct>     <int>
##  1 Jesus       187
##  2 Father       29
##  3 man          23
##  4 Peter        19
##  5 one          18
##  6 world        15
##  7 Pilate       14
##  8 Lord         11
##  9 multitude    11
## 10 John         10
## # ... with 119 more rows

No surprise, the most frequent subject is the proper name “Jesus”. But what is it that he keeps on doing? Let’s modify the filter a little:

YetAnotherFilter <- function(sentence) {
    sentence  %>% 
        FilterConllRows("feat", "^V", use_regex=T) %>% 
        GetDeps(sentence) %>% 
        FilterConllRows("dep","nsubj") %>% 
        FilterConllRows("lemma", "Jesus")  %>% 
        GetHeads(sentence) %>% 
        return
}

verbs_with_Jesus <- ApplyConllFilter(gospel_of_john_sentences,YetAnotherFilter,"both_pretty")

The filter became a little circular (we’re first taking all the dependents of verbs, then filtering those and then getting their heads again…), but it does the job:

verbs_with_Jesus %>% count(lemma) %>% arrange(desc(n))

## # A tibble: 35 x 2
##    lemma      n
##    <fct>  <int>
##  1 say       65
##  2 answer    36
##  3 come      16
##  4 do         9
##  5 go         7
##  6 speak      6
##  7 love       6
##  8 see        5
##  9 walk       4
## 10 make       3
## # ... with 25 more rows

Now, what if we would like to combine this information to get ngrams of sorts?

a <- function(sentence) {
    sentence  %>% 
        FilterConllRows("feat", "^V", use_regex=T) %>% 
        GetDeps(sentence) %>% 
        FilterConllRows("dep","nsubj") %>% 
        FilterConllRows("feat",c("NN","NNP"))  %>% 
        return
}


b <- function(sentence) {
    sentence  %>% 
        FilterConllRows("feat", "^V", use_regex=T) %>% 
        GetDeps(sentence) %>% 
        FilterConllRows("dep","nsubj") %>% 
        FilterConllRows("feat",c("NN","NNP"))  %>% 
        GetHeads(sentence) %>% 
        return
}


subj <- ApplyConllFilter(gospel_of_john_sentences,a,"both_pretty")
verb <- ApplyConllFilter(gospel_of_john_sentences,b,"both_pretty")

tab <- subj %>% 
    select(lemma, sent) %>% 
    left_join(verb  %>% select(lemma, sent), by="sent")   %>% 
    count(lemma.x, lemma.y) %>% 
    arrange(lemma.x, desc(n)) %>% 
    filter(n>4) 

tab  %>% print(n=99)

## # A tibble: 18 x 3
##    lemma.x lemma.y     n
##    <fct>   <fct>   <int>
##  1 Jesus   say        65
##  2 Jesus   answer     36
##  3 Jesus   come       24
##  4 Jesus   do         11
##  5 Jesus   go          8
##  6 Jesus   love        7
##  7 Jesus   be          6
##  8 Jesus   speak       6
##  9 Jesus   see         6
## 10 Jesus   know        5
## 11 Father  give        6
## 12 one     come        7
## 13 hour    come        6
## 14 time    come        8
## 15 Christ  come        5
## 16 Peter   say         5
## 17 woman   say         5
## 18 Pilate  say         6

And we could, of course keep on going and draw some graphs, build statistical models and so on… A correspondence analysis, maybe?

library(ca)

subj %>% 
    select(lemma, sent) %>% 
    left_join(verb  %>% select(lemma, sent), by="sent") %>% 
    filter(lemma.x %in% tab$lemma.x, lemma.y %in% tab$lemma.y)  %>% 
    mutate(lemma.x = as.character(lemma.x), lemma.y=as.character(lemma.y)) %>% 
    ca(~lemma.x + lemma.y, data=.) %>% 
    plot

4.2 Filtering concordances

Although it is certainly possible to use this library for querying entire texts outputted in the conll format, the use case for which the library was created is when you have a data frame containing a list of concordances with more information than just the actual conll-annotated sentence.

The basic idea is to have the data in a data frame / tibble where one row corresponds to one sentence (or one relevant annotated context of any length) and there is a column containing the conll representation of the sentence. In addition, you probably have some variables, at least metadata an the like. One sample use case is, for instance, if you’ve queried for Twitter data and would like to have each tweet parsed and then use the conll output of the tweet together with the massive amounts of metadata variables provided by the Twitter API for each tweet.

Here’s a toy example of such a task.

Let’s imagine we want to analyze the usage of the word “referee” in recent Uefa champions’ league games (recent, because the Twitter api only let’s you search for tweets not older than one week). In order to run the following you need to have requested developer access from Twitter, but never mind, the data is also available as a small data set accompanying this package.

#library(rtweet)
#ref <- search_tweets("(#UCL OR #ChampionsLeague OR #UEFAChampionsLeague) referee",  n = 5000)

I used the method described in the context of the webcorpcrawler python utility to add syntactic annotations using Stanford Core-NLP. This gave me the following data set, available, as mentioned, as part of this package. (I’ve stripped away most of the props offered by the Twitter API since there is A LOT of them)

ucl_ref

## # A tibble: 262 x 6
##    text     followers retweets favorites location parsed_text             
##    <chr>        <int>    <int>     <int> <chr>    <chr>                   
##  1 "Here's…      8251        1         1 Berlin,… "1\tHere\there\tRB\t_\t…
##  2 "\"We l…       146        1         0 Kraków,… "1\t``\t``\t``\t_\t3\tp…
##  3 "#Wedne…        58        7         0 lagos c… "1\t#WednesdayWisdom\t#…
##  4 "This r…     20546       28         0 South A… "1\tThis\tthis\tDT\t_\t…
##  5 "They m…      4310        0         0 Lagos N… "1\tThey\tthey\tPRP\t_\…
##  6 "This r…     11689       28         0 ☆S I Y … "1\tThis\tthis\tDT\t_\t…
##  7 "Jurgen…      4566        1         3 New Del… "1\tJurgen\tJurgen\tNNP…
##  8 "#Jurge…       506        0         0 Armenia  "1\t#JurgenKlopp\t#jurg…
##  9 Liverpo…     34247        1         0 Paris, … "1\tLiverpool\tLiverpoo…
## 10 Liverpo…     34247        1         0 Paris, … "1\tLiverpool\tLiverpoo…
## # ... with 252 more rows

In the ucl_ref dataset, the conll annotated raw text is stored in the parsed_text column. Let’s do something analogous to the previous example and look at the verbs the word ‘referee’ is an object of. First, let’s build our filter:

ref_filt <- function(sentence){
    sentence  %>% 
        FilterConllRows("lemma", "referee") %>% 
        FilterConllRows("dep", "dobj") %>% 
        GetHeads(sentence) %>% 
        FilterConllRows("feat", "^V", T) %>% 
        return
}

We could test the filter with using just the parsed_text column as a vector:

matches <- ApplyConllFilter(ucl_ref$parsed_text, ref_filt)

show(length(matches))

## [1] 12

Right, 12 matches, so the filter is working. But how to filter the actual data frame, not just the column with the annotated sentences? For this use case, the package includes a version of the ApplyConllFilter called ApplyConllFilter_df. Its first argument is the tibble (or data frame) to be filtered, the second argument is the custom filter function we’ve created and the third argument is the name of the column containing the conll annototated sentence. To filter our ucl_ref tibble so that we’re left with only the cases where the word referee is used as a direct object of a verb we would simply run:

ref_as_obj <- ucl_ref %>% 
    ApplyConllFilter_df(ref_filt, "parsed_text")

…and we get the twelve rows that match the condition. We could now check the metadata of the rows to look at, for instance, where these kinds of tweets are coming from:

ref_as_obj  %>%  count(location)

## # A tibble: 12 x 2
##    location                           n
##    <chr>                          <int>
##  1 ""                                 1
##  2 🏴󠁧󠁢󠁳󠁣󠁴󠁿                                 1
##  3 City Of Compton                    1
##  4 England, United Kingdom            1
##  5 Heathrow                           1
##  6 In My Liverpool Home               1
##  7 India                              1
##  8 İstanbul, Türkiye                  1
##  9 London | England                   1
## 10 Northern Ireland                   1
## 11 Royal Tunbridge Wells, England     1
## 12 Scotland                           1

We can also spesify, if we want our data set filter to return an additional column (called filtered_col) which contains some property of the word (TODO: words) matched by the filter. For instance, if we would like to get the lemmas of the head verbs in the cases where referee is used as a direct object, we could do:

filtered <- ucl_ref %>% 
    ApplyConllFilter_df(ref_filt, "parsed_text", return_col="lemma")

Now we could check, for instance, if a certain verb + referee as the object combination attracts more retweets than others by using the metadata available in the data frame:

filtered %>% 
    count(filtered_col, retweets) %>% 
    arrange(desc(n))

## # A tibble: 10 x 3
##    filtered_col retweets     n
##    <fct>           <int> <int>
##  1 get                 0     2
##  2 surround            0     2
##  3 criticise           0     1
##  4 play                0     1
##  5 berate              0     1
##  6 pay                 0     1
##  7 include             0     1
##  8 hope                0     1
##  9 go                  0     1
## 10 announce            0     1

Well, oviously the data set here is too limited to get any kind of reasonable results, but you get the idea.

As a final remark about the ApplyConllFilter_df function, it should be added that there is also the possibility to return the whole data frame with the filtered_col column so that for the rows that don’t match the syntactic filter the function just returns NA:

filtered2 <- ucl_ref %>% 
    ApplyConllFilter_df(ref_filt, "parsed_text", return_col="lemma", return_all = T) 


filtered2  %>% count(filtered_col) %>% arrange(desc(n))

## # A tibble: 11 x 2
##    filtered_col     n
##    <chr>        <int>
##  1 <NA>           250
##  2 get              2
##  3 surround         2
##  4 announce         1
##  5 berate           1
##  6 criticise        1
##  7 go               1
##  8 hope             1
##  9 include          1
## 10 pay              1
## 11 play             1

Included, also, as a sample file from the public domain World English version and parsed with the Stanford coreNLP parser.↩

DepsearcheR - querying syntactically annotated data with R