Introduction to parse and the lookup_ functions


This package aims to parse out identifiable drug names given a corpus of text. By corpus of text, we assume that the data has already been imported into R.

Data: drug_df

Throughout this vignette, we will employ a sample dataset - drug_df - that is intended to represent data collected from a clinical trial. The dataset contains 3 variables and 500 observations.

   [1] 500   3

   tibble [500 x 3] (S3: tbl_df/tbl/data.frame)
    $ textdrug: chr [1:500] "Remeron" "Remeron" "Soma" "Ectacy" ...
    $ sex     : chr [1:500] "male" "female" "female" "female" ...
    $ race    : chr [1:500] "black" "ai/an" "ai/an" "hn/pi" ...

   [1] "tbl_df"     "tbl"        "data.frame"

Note: drug_df is a simulated dataset and does not reflect any true clinical observations.


The parse() function is intended to extract identifiable drug names from a corpus of text such as, clinical trial data, social media, survey or interview transcription. parse() takes in one argument, the vector that contains the strings to be parsed.

Here is an example of some problematic records in the drug_df dataset that warrants the use of parse()

messy_data <- drug_df %>% 
  # select records that have problematic characters
  filter(str_detect(textdrug, ",|;|and|\\/|=|\\(")) %>% 

Percocets and Vicodin
Barbiturate (doesn’t know which)
heroin - “few days on, few days off”
heroin- "a few days on, few days off
Ambien = 2 pills
Ambien “a bunch” = 2 pills
promethazine (25mg), clonidine (0.1mg)

As you can see there are so many extraneous/problematic characters, multiple drugs in one record and several variations of the same drug (i.e. “bup/nx”). We assume that the user is solely interested in the drugs themselves, not information such as dosage and units.

This messy data is exactly what parse() was designed for.

drug_names <- DOPE::parse(messy_data$textdrug)

    [1] "bup/nx"       "bup/nx"       "bup/nx"       "percocets"    "vicodin"     
    [6] "barbiturate"  "heroin"       "ambien"       "ambien"       "promethazine"
   [11] "clonidine"   
   [1] 8 9
   [1] "omit"

Notice parse() cleans up the capitalization and punctuation of ‘bup/nx’. parse() has special code to clean up cases of ‘bup/nx’ and also ‘speedball’. It also finds the distinction of the final row “promethazine (25mg), clonidine” and separates them. See the tidytext package.1

The resulting vector can then be passed on to the lookup_* functions to identify whether the input drug is a class, category or a synonym for other drugs in the same category.



This function relies on a comprehensive lookup table lookup_df. This dataframe contains 3 variables:

These names were based on terms used by the DEA.2

   [1] 4766    3

   'data.frame':    4766 obs. of  3 variables:
    $ class   : chr  "hallucinogen" "hallucinogen" "hallucinogen" "hallucinogen" ...
    $ category: chr  "2cb" "2cb" "2cb" "2cb" ...
    $ synonym : chr  "banana split" "bdmpea" "bromo" "mft" ...

   [1] "data.frame"

The purpose of this function is to return any possible matches to the lookup_df, which is a comprehensive dataframe consisting of all drug classes, categories and synonyms. It serves as a source or helper function to many of the other more specific function in the package. The idea is to match any possible columns with a the single word, a list of separate words or a vector passed as an argument. The dataframe returned will consist of the lookup_df match as well as the original_word that was the source of the match.

Here is an example of a common search done using lookup.

results <- lookup(unique(drug_names))
head(results, 15) %>%
original_word class category synonym
bup/nx treatment drug treatment drug bup/nx
percocets NA NA NA
vicodin narcotic (opioid) codeine combinations, non-injectable vicodin
barbiturate depressant barbiturate fiorina
barbiturate depressant barbiturate nembutal
barbiturate depressant barbiturate pentothal
barbiturate depressant barbiturate seconal
heroin heroin heroin a-bomb
heroin heroin heroin a-bomb (mixed with marijuana)
heroin heroin heroin achivia
heroin heroin heroin adormidera
heroin heroin heroin aip
heroin heroin heroin al capone
heroin heroin heroin antifreeze
heroin heroin heroin aries

You can see that the dataframe returned could be vast in its matches (heroin returns another few hundred matches alone), and that the other more specific functions, below, might be of more use depending on one’s needs.


This function takes in one argument: the table returned from a search using the lookup function. The purpose of this function is to narrow down the results to a more specific dataframe consisting of only relevant values, such as class and/or category depending on the user’s selection. compress_lookup returns, by default, original_word, class and category.

If a researcher wanted to determine the main classes of drugs being used by the patients of a clinical study, they might pass a large vector of substances from clinical notes taken in a study to the lookup function, then filter them down to only return the datafram of classes relevant to their needs.

Here is an example of a common search done using compress_lookup.

filtered_df <- compress_lookup(results)
     original_word             class                             category
   1        bup/nx    treatment drug                       treatment drug
   2     percocets              <NA>                                 <NA>
   3       vicodin narcotic (opioid) codeine combinations, non-injectable
   4   barbiturate        depressant                          barbiturate
   5        heroin            heroin                               heroin
   6        ambien           Unknown                              Unknown

The resulting dataframe is a short list of only the relevant information needed.


The purpose of this function is to find all possible synonyms of, primarily, a slang/street name of a commonly abused drug. Though searching for a drug class or category with lookup() will also return common synonms, this function makes searching specifically for synonyms explicit by taking just one argument: drug_name. The function will then determine the category of the slang term (drug_name) and return all synonyms that share that category.

Here is an example of a common search done using lookup_syn.

results <- lookup_syn("shrooms")
   1  mushrooms
   2 psilocybin

The resulting dataframe contains a moderate list of terms that are synonyms of the drug_name given as determined by sources such as the DEA, FDA and other publicly available resources.