This R package wraps the CRFsuite C/C++ library (https://github.com/chokkan/crfsuite), allowing the following:
For users unfamiliar with Conditional Random Field (CRF) models, you can read this excellent tutorial https://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf
In order to build a CRF model, you need to have
Generally the labels follow the IOB
type of scheme which
look something like: B-ORG, I-ORG, B-YOUROWNLABEL, I-YOUROWNLABEL or O.
Indicating the beginning of a certain category (B-)
, the
intermediate part of a certain category (I-)
or outside the
category (O)
.
I went to the New York City District on holidays
would
e.g. be labelled as
O, O, O, O, B-LOCATION, I-LOCATION, I-LOCATION, I-LOCATION, O, O
The attributes of the observations are mostly something like the term itself, the neighbouring terms, the parts of speech, the neighbouring parts of speech or any specific feature you can extract and which is relevant to your business domain (e.g. the number of numbers in the token, how far is it from the start of the document or end of the document, is the token capitalised, does it contain an ampersand, …).
As an example, let’s get some data in Dutch for doing Named Entity Recognition which was distributed as part of the CoNLL-2002 shared task challenge. This dataset contains 1 row per term and provides entity labels as well as the parts of speech tag for each term.
library(crfsuite)
x <- ner_download_modeldata("conll2002-nl")
subset(x, doc_id == 100)
data doc_id sentence_id token pos label
1: ned.train 100 8882 EK Pron B-MISC
2: ned.train 100 8882 Magazine N I-MISC
3: ned.train 100 8882 Canvas N B-ORG
4: ned.train 100 8882 23.45 Num O
5: ned.train 100 8883 Tourjournaal N B-MISC
---
343: ned.train 100 8916 gepresenteerd V O
344: ned.train 100 8916 door Prep O
345: ned.train 100 8916 Stef N B-PER
346: ned.train 100 8916 Wijnants N I-PER
347: ned.train 100 8916 . Punc O
As basic feature enrichment we add the parts of speech tag of the preceding and the next term which we will use later when building the model and do the same for the token. The R package data.table has a nice shift function for this.
library(data.table)
x <- as.data.table(x)
x <- x[, pos_previous := shift(pos, n = 1, type = "lag"), by = list(doc_id)]
x <- x[, pos_next := shift(pos, n = 1, type = "lead"), by = list(doc_id)]
x <- x[, token_previous := shift(token, n = 1, type = "lag"), by = list(doc_id)]
x <- x[, token_next := shift(token, n = 1, type = "lead"), by = list(doc_id)]
Note that CRFsuite handles all attributes equivalently, in order to distinguish between the columns, we need to prepend the column name logic to each column similar as shown at http://www.chokkan.org/software/crfsuite/tutorial.html. This is done using a custom txt_sprintf function which is similar as sprintf but handles NA values gracefully.
x <- x[, pos_previous := txt_sprintf("pos[w-1]=%s", pos_previous), by = list(doc_id)]
x <- x[, pos_next := txt_sprintf("pos[w+1]=%s", pos_next), by = list(doc_id)]
x <- x[, token_previous := txt_sprintf("token[w-1]=%s", token_previous), by = list(doc_id)]
x <- x[, token_next := txt_sprintf("token[w-1]=%s", token_next), by = list(doc_id)]
subset(x, doc_id == 100, select = c("doc_id", "token", "token_previous", "token_next"))
doc_id token token_previous token_next
1: 100 EK <NA> token[w-1]=Magazine
2: 100 Magazine token[w-1]=EK token[w-1]=Canvas
3: 100 Canvas token[w-1]=Magazine token[w-1]=23.45
4: 100 23.45 token[w-1]=Canvas token[w-1]=Tourjournaal
5: 100 Tourjournaal token[w-1]=23.45 token[w-1]=Canvas
---
343: 100 gepresenteerd token[w-1]=, token[w-1]=door
344: 100 door token[w-1]=gepresenteerd token[w-1]=Stef
345: 100 Stef token[w-1]=door token[w-1]=Wijnants
346: 100 Wijnants token[w-1]=Stef token[w-1]=.
347: 100 . token[w-1]=Wijnants <NA>
x <- as.data.frame(x)
Once you have data which are tagged with your own categories, you can build a CRF model. On the previous data, split it into a training and test dataset.
crf_train <- subset(x, data == "ned.train")
crf_test <- subset(x, data == "testa")
And start building your model.
model <- crf(y = crf_train$label,
x = crf_train[, c("pos", "pos_previous", "pos_next",
"token", "token_previous", "token_next")],
group = crf_train$doc_id,
method = "lbfgs", file = "tagger.crfsuite",
options = list(max_iterations = 25, feature.minfreq = 5, c1 = 0, c2 = 1))
model
Conditional Random Field saved at C:\Users\jwijf\AppData\Local\Temp\RtmpqOrzeH\Rbuild355c11b5666e\crfsuite\vignettes\tagger.crfsuite
size of the model in Mb: 0.79
number of categories: 9
category labels: O, B-ORG, B-MISC, B-PER, I-PER, B-LOC, I-MISC, I-ORG, I-LOC
stats <- summary(model)
Summary statistics of last iteration:
Loss: 37014.993192
Feature norm: 30.527576
Error norm: 2230.388754
Active features: 11822
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.229
Dumping summary of the model to file C:\Users\jwijf\AppData\Local\Temp\RtmpIbhgnl\crfsuite_5a7c455c3212.txt
plot(stats$iterations$loss, pch = 20, type = "b",
main = "Loss evolution", xlab = "Iteration", ylab = "Loss")
weights <- coefficients(model, encoding = "UTF-8")
head(weights$transitions)
from to weight
1 B-PER I-PER 7.865476
2 B-ORG I-ORG 6.420383
3 B-MISC I-MISC 6.356296
4 O B-MISC 6.016069
5 O O 5.983449
6 I-MISC I-MISC 5.877581
head(subset(weights$states, label %in% "B-LOC"), n = 10)
attribute label weight
11 Brussel B-LOC 2.297898
19 token[w-1]=in B-LOC 1.820405
21 token[w-1]=) B-LOC 1.756937
24 België B-LOC 1.705850
32 Fra B-LOC 1.492787
37 N B-LOC 1.409908
43 pos[w-1]=Punc B-LOC 1.312599
54 Ita B-LOC 1.104971
62 Antwerpen B-LOC 1.078814
65 token[w-1]=naar B-LOC 1.069607
You can use the model to get predictions of the named entity / chunks / categories you have trained. Below this is done on the holdout data. Provide the model, your data with the attributes and indicate the group the attributes belong to.
scores <- predict(model,
newdata = crf_test[, c("pos", "pos_previous", "pos_next",
"token", "token_previous", "token_next")],
group = crf_test$doc_id)
crf_test$entity <- scores$label
table(crf_test$entity, crf_test$label)
B-LOC B-MISC B-ORG B-PER I-LOC I-MISC I-ORG I-PER O
B-LOC 187 6 46 26 0 0 3 0 95
B-MISC 1 44 13 4 0 0 0 0 29
B-ORG 2 17 109 6 0 5 1 0 20
B-PER 18 49 83 289 4 7 27 13 111
I-LOC 4 1 2 1 14 2 7 2 8
I-MISC 1 4 7 5 0 49 22 11 50
I-ORG 3 8 19 19 3 30 130 23 56
I-PER 5 10 16 42 19 28 107 319 87
O 258 609 391 311 24 94 99 55 33517
In order to facilitate creating training data on your own data, with your own categories, a Shiny app is put inside this R package. To go short, this app allows you to:
To start the app, make sure you have the following packages installed.
install.packages("shiny")
install.packages("flexdashboard")
install.packages("DT")
install.packages("writexl")
And run the app with
rmarkdown::run(file = system.file(package = "crfsuite", "app", "annotation.Rmd"))
The app was developed with shiny 1.4.0.2, flexdashboard 0.5.1.1, DT 0.13, writexl 1.2 and rmarkdown 1.6
When building the model, you need to
In order to identify the parameters of the algorithm, look e.g. at
crf_options("lbfgs")
crf_options("l2sgd")
If you train the model with different algorithm parameters, you probably are interested to see the Precision / Recall / F1 statistics to compare them alongside the model hyperparameters. You can easily get these with the caret R package.
library(caret)
overview <- confusionMatrix(crf_test$entity, crf_test$label, mode = "prec_recall")
overview$overall
overview$byClass[, c("Precision", "Recall", "F1")]
To obtain better models, you need to do feature engineering specific to your business domain.
This example below starts from scratch assuming that you have plain text and you annotated some chunks using the app in this package. Below the manually annotated dataset is shown.
library(crfsuite)
library(udpipe)
library(data.table)
data(airbnb_chunks, package = "crfsuite")
str(airbnb_chunks)
Classes 'chunkrange' and 'data.frame': 1091 obs. of 8 variables:
$ annotation_time: POSIXct, format: "2018-09-02 22:47:53" "2018-09-02 22:47:59" "2018-09-02 22:48:07" "2018-09-02 22:48:15" ...
$ doc_id : int 26261897 26261897 26261897 11412934 11412934 11412934 11412934 19782360 19782360 19782360 ...
$ text : chr "Fijn, ruim appartement. Locatie vlakbij Manneken Pis en de Grote Markt, dus uitstekend gelegen.\nBadkamer was s"| __truncated__ "Fijn, ruim appartement. Locatie vlakbij Manneken Pis en de Grote Markt, dus uitstekend gelegen.\nBadkamer was s"| __truncated__ "Fijn, ruim appartement. Locatie vlakbij Manneken Pis en de Grote Markt, dus uitstekend gelegen.\nBadkamer was s"| __truncated__ "Het appartement van Salvatore ligt midden in de trendy modewijk. Het is een ideale locatie om Brussel te verken"| __truncated__ ...
$ start : num 41 59 156 21 95 192 424 100 220 292 ...
$ end : num 52 70 160 29 101 200 437 104 224 296 ...
$ chunk_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ chunk_entity : chr "LOCATION" "LOCATION" "PERSON" "PERSON" ...
$ chunk : chr "Manneken Pis" " Grote Markt" "Aline" "Salvatore" ...
We want to build a classifier for the following categories:
table(airbnb_chunks$chunk_entity)
DISTANCE LOCATION PERSON
111 419 464
In order to build the training dataset, we need to have data at the token level. In the example below, this is done using the udpipe R package (https://CRAN.R-project.org/package=udpipe).
## Annotate text data with udpipe (version >= 0.7)
udmodel <- udpipe_download_model("dutch")
udmodel <- udpipe_load_model(udmodel$file_model)
airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")])
airbnb_tokens <- udpipe(x = airbnb_tokens, object = udmodel)
str(airbnb_tokens)
'data.frame': 29092 obs. of 17 variables:
$ doc_id : chr "26261897" "26261897" "26261897" "26261897" ...
$ paragraph_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ sentence_id : int 1 1 1 1 1 2 2 2 2 2 ...
$ sentence : chr "Fijn, ruim appartement." "Fijn, ruim appartement." "Fijn, ruim appartement." "Fijn, ruim appartement." ...
$ start : int 1 5 7 12 23 25 33 41 50 54 ...
$ end : int 4 5 10 22 23 31 39 48 52 55 ...
$ term_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ token_id : chr "1" "2" "3" "4" ...
$ token : chr "Fijn" "," "ruim" "appartement" ...
$ lemma : chr "fijn" "," "ruim" "appartement" ...
$ upos : chr "PROPN" "PUNCT" "ADJ" "VERB" ...
$ xpos : chr "N|eigen|ev|basis|zijd|stan" "LET" "ADJ|vrij|basis|zonder" "WW|vd|vrij|zonder" ...
$ feats : chr "Gender=Com|Number=Sing" NA "Degree=Pos" "VerbForm=Part" ...
$ head_token_id: chr "0" "4" "4" "1" ...
$ dep_rel : chr "root" "punct" "advmod" "parataxis" ...
$ deps : chr NA NA NA NA ...
$ misc : chr "SpaceAfter=No" NA NA "SpaceAfter=No" ...
Once you have the data in 1 row per doc_id/token, you can enrich this
with the chunk entity. Next, by using the function
crf_cbind_attributes
we enrich the training data by adding
relevant attributes of words in the neighbourhood of the word. We added
also a basic column indicating if the term is in the beginning or end of
the sentence (bos/eos). Based on that dataset, a model can be built.
x <- merge(airbnb_chunks, airbnb_tokens)
table(x$chunk_entity)
B-DISTANCE B-LOCATION B-PERSON I-DISTANCE I-LOCATION I-PERSON O
110 411 451 367 214 39 27500
## Indicate beginning of sequence and end of sequence and sequence position
x <- as.data.table(x)
x <- x[, bos := sprintf("BOS+%s", (1:.N)-1), by = list(doc_id)]
x <- x[, eos := sprintf("EOS-%s", (.N:1)-1), by = list(doc_id)]
x <- as.data.frame(x)
## Add preceding and next tokens and parts of speech tags
x <- crf_cbind_attributes(x, terms = c("lemma", "upos"),
by = c("doc_id", "sentence_id"), ngram_max = 3, sep = "|")
attributes <- c("bos", "eos", grep("lemma|upos", colnames(x), value=TRUE))
model <- crf(y = x$chunk_entity, x = x[, attributes],
group = x$doc_id,
method = "lbfgs")
scores <- predict(model, newdata = x[, attributes], group = x$doc_id)
barplot(table(scores$label[scores$label != "O"]), col = "royalblue", cex.names = 0.75)