Using data.table to match multiple patterns against multiple strings in R - r

library(data.table)
dat1 <- data.table(id1 = c(1, 1, 2),
pattern = c("apple", "applejack", "bananas, sweet"))
dat2 <- data.table(id2 = c(1174, 1231),
description = c("apple is sweet", "apple is a computer"),
memo = c("bananas, sweet yes", "bananas, sweetyes"))
> dat1
id1 pattern
1: 1 apple
2: 1 applejack
3: 2 bananas, sweet
> dat2
id2 description memo
1: 1174 apple is sweet bananas, sweet yes
2: 1231 apple is a computer bananas, sweetyes
I have two data.tables, dat1 and dat2. I want to search for each pattern in dat2 against the description and memo columns in dat2 and store the corresponding id2s.
The final output table should look something like this:
id1 pattern description_match memo_match
1: 1 apple 1174,1231 <NA>
2: 1 applejack <NA> <NA>
3: 2 bananas, sweet <NA> 1174
The regular expression I want to use is \\b[pattern]\\b. Below is my attempt:
dat1[, description_match := dat2[grepl(paste0("\\b", dat1$pattern, "\\b"), dat2$description), .(id2 = paste(id2, collapse = ","))]]
dat1[, memo_match := dat2[grepl(paste0("\\b", dat1$pattern, "\\b"), dat2$memo), .(id2 = paste(id2, collapse = ","))]]
However, both give me the error that grepl can only use the first pattern.

We group by row sequence, create the match columns from 'dat2', by pasteing the 'id2' extracted from the logical output from grepl
library(data.table)
dat1[, c("description_match", "memo_match") := {
pat <- sprintf('\\b(%s)\\b', paste(pattern, collapse = "|"))
.(toString(dat2$id2[grepl(pat, dat2$description)]),
toString(dat2$id2[grepl(pat, dat2$memo)]))
}, seq_along(id1)]
dplyr::na_if(dat1, "")
id1 pattern description_match memo_match
<num> <char> <char> <char>
1: 1 apple 1174, 1231 <NA>
2: 1 applejack <NA> <NA>
3: 2 bananas, sweet <NA> 1174
According to ?sprintf
The string fmt contains normal characters, which are passed through to the output string, and also conversion specifications which operate on the arguments provided through .... The allowed conversion specifications start with a % and end with one of the letters in the set aAdifeEgGosxX%
s - Character string. Character NAs are converted to "NA".
Or use a for loop
for(nm in names(dat2)[-1])
dat1[, paste0(nm, "_match") :=
toString(dat2$id2[grepl(paste0("\\b", pattern, "\\b"),
dat2[[nm]])]), seq_along(id1)][]

Related

Get indices of matches with a column in a second data.table

I have two data.tables. Each has a column called 'firstName' and another called 'lastName', which contain some values which will match each other and some that won't. Some values in both data sets might be duplicated.
I want to add a new column to the second data.table, in which I will store the indices of matches from the first data set for each element of 'firstName' within the second data set. I will then repeat the whole matching process with the 'lastName' column and get the intersect of index matches for 'firstName' and 'lastName'. I will then use the intersect of the indices to fetch the case ID (cid) from the first data set and append it to the second data set.
Because there might be more than one match per element, I will store them as lists within my data.table. I cannot use base::match function because it will only return the first match for each element, but I do need the answer to be vectorised in just the same way as the match function.
I've tried different combinations of which(d1$x %in% y) but this does not work either because it matches for all of y at once instead of one element at a time. I am using data.table because for my real-world use case, the data set to match on could be hundreds of thousands of records, so speed is important.
I have found a related question here, but I can't quite figure out how to efficiently convert this to data.table syntax.
Here is some example data:
# Load library
library(data.table)
# First data set (lookup table):
dt1 <- data.table(cid = c("c1", "c2", "c3", "c4", "c5"),
firstName = c("Jim", "Joe", "Anne", "Jim", "Anne"),
lastName = c("Gracea", "Ali", "Mcfee", "Dutto", "Crest"))
# Second data set (data to match with case IDs from lookup table):
dt2 <- data.table(lid = c(1, 2, 3, 4),
firstName = c("Maria", "Jim", "Jack", "Anne"),
lastName = c("Antonis", "Dutto", "Blogs", "Mcfee"),
result = c("pos", "neg", "neg", "pos"))
My desired output would look like this:
# Output:
> dt2
lid firstName lastName result fn_match ln_match casematch caseid
1: 1 Maria Antonis pos NA NA NA <NA>
2: 2 Jim Dutto neg 1,4 4 4 c4
3: 3 Jack Blogs neg NA NA NA <NA>
4: 4 Anne Mcfee pos 3,5 3 3 c3
A possible solution:
dt1[,id:=seq_along(cid)]
dt1[dt2,.(lid,id,firstName = i.firstName),on=.(firstName)][
,.(casematch =.( id)),by=.(lid,firstName)]
lid firstName casematch
<num> <char> <list>
1: 1 Maria NA
2: 2 Jim 1,4
3: 3 Jack NA
4: 4 Anne 3,5
We could use
library(data.table)
dt1[dt2, .(casematch = toString(cid), lid),on = .(firstName), by = .EACHI]
-output
firstName casematch lid
<char> <char> <num>
1: Maria NA 1
2: Jim c1, c4 2
3: Jack NA 3
4: Anne c3, c5 4
Or with row index
dt1[dt2, .(casematch = na_if(toString(.I), 0), lid),on = .(firstName), by = .EACHI]
firstName casematch lid
<char> <char> <num>
1: Maria <NA> 1
2: Jim 1, 4 2
3: Jack <NA> 3
4: Anne 3, 5 4
Using .EACHI and adding the resulting list column by reference.
dt2[ , res := dt1[ , i := .I][.SD, on = .(firstName), .(.(i)), by = .EACHI]$V1]
# lid firstName res
# 1: 1 Maria NA
# 2: 2 Jim 1,4
# 3: 3 Jack NA
# 4: 4 Anne 3,5
Another data.table option
> dt1[, .(cid = toString(cid)), firstName][dt2, on = .(firstName)]
firstName cid lid
1: Maria <NA> 1
2: Jim c1, c4 2
3: Jack <NA> 3
4: Anne c3, c5 4
In my real life scenario, I need to retrieve the indices for matches on more than one column. I found a way to do this in one step by combining some of the other solutions and figured it would be useful to also share this and the explanation of how it works below.
The code below adds a new column caseid to dt2, which gets its values from the column cid in dt1 for the row indices that matched on both firstName and lastName.
Putting dt1 inside the square brackets and specifying on = .(...) is equivalent to merging dt1 with dt2 on firstName and lastName, but instead of merging all columns from both datasets, one new column called caseid is created.
The lower case i. prefix to cid indicates that cid is a column from the second data set (dt1).
The upper case .I inside the square brackets after i.cid will retrieve the row indices of dt1 that match dt2 on firstName and lastName.
# Get case IDs from dt1 for matches of firstName and lastName in one step:
dt2[dt1, caseid := i.cid[.I], on = .(firstName, lastName)]
# Output:
> dt2
lid firstName lastName result caseid
1: 1 Maria Antonis pos <NA>
2: 2 Jim Dutto neg c4
3: 3 Jack Blogs neg <NA>
4: 4 Anne Mcfee pos c3

Transforming a list of lists into dataframe

I have a list containing a number of other lists, each of which contain varying numbers of character vectors, with varying numbers of elements. I want to create a dataframe where each list would be represented as a row and each character vector within that list would be a column. Where the character vector has > 1 element, the elements would be concatenated and separated using a "+" sign, so that they can be stored as one string. The data looks like this:
fruits <- list(
list(c("orange"), c("pear")),
list(c("pear", "orange")),
list(c("lemon", "apple"),
c("pear"),
c("grape"),
c("apple"))
)
The expected output is like this:
fruits_df <- data.frame(col1 = c("orange", "pear + orange", "lemon + apple"),
col2 = c("pear", NA, "pear"),
col3 = c(NA, NA, "grape"),
col4 = c(NA, NA, "apple"))
There is no limit on the number of character vectors that can be contained in a list, so the solution needs to dynamically create columns, leading to a df where the number of columns is equal to the length of the list containing the largest number of character vectors.
For every list in fruits you can create a one row dataframe and bind the data.
dplyr::bind_rows(lapply(fruits, function(x) as.data.frame(t(sapply(x,
function(y) paste0(y, collapse = "+"))))))
# V1 V2 V3 V4
#1 orange pear <NA> <NA>
#2 pear+orange <NA> <NA> <NA>
#3 lemon+apple pear grape apple
This is a bit messy but here is one way
cols <- lapply(fruits, function(x) sapply(x, paste, collapse=" + "))
ncols <- max(lengths(cols))
dd <- do.call("rbind.data.frame", lapply(cols, function(x) {length(x) <- ncols; x}))
names(dd) <- paste0("col", 1:ncol(dd))
dd
# col1 col2 col3 col4
# 1 orange pear <NA> <NA>
# 2 pear + orange <NA> <NA> <NA>
# 3 lemon + apple pear grape apple
or another strategy
ncols <- max(lengths(fruits))
dd <- data.frame(lapply(seq.int(ncols), function(x) sapply(fruits, function(y) paste(unlist(y[x]), collapse=" + "))))
names(dd) <- paste0("col", 1:ncols)
dd
But really you need to either build each column or row from your list and then combine them together.
Another approach that melts the list to a data.frame using rrapply::rrapply and then casts it to the required format using data.table::dcast:
library(rrapply)
library(data.table)
## melt to long data.frame
long <- rrapply(fruits, f = paste, how = "melt", collapse = " + ")
## cast to wide data.table
setDT(long)
dcast(long[, .(L1, L2, value = unlist(value))], L1 ~ L2)[, !"L1"]
#> ..1 ..2 ..3 ..4
#> 1: orange pear <NA> <NA>
#> 2: pear + orange <NA> <NA> <NA>
#> 3: lemon + apple pear grape apple

how to separate a column into multiple columns and change the results from characters to numbers

##id## ##initiativen##
1 abc 2a
2 cde 2b
3 efd a
4 geh c
5 jytd 5v
6 jydjytd e
Hello, I have something similar to this, just a lot bigger and I was wondering which is the most efficient way to divide the column initiativen into two columns, one containing the numbers (2,2,5,4) and one containing the letters or the blank space. it has to be a general formula as the data frame I need to apply it too is quite big. The letters correspond to a particular initiative number but the first initiative number is not indicated and "a" correspond to initiative number 2.
I would love it to look like something like that with the letters substituted by numbers (blank=1, a=2, b=3 etc..)
id initiativen question
abc 2 2
cde 3 2
efd 2 N/A
geh 4 N/A
jytd 23 5
jydjytd 6 N/A
bfdhslbf 1 3
I have tried to use "separate" but it doesn't really work and doesn't solve the problem of the first initiative having no corresponding letter.
Any help or suggestion would be extremely welcomed and helpful.
Thank you so much:)
How about the following tidyverse solution?
library(tidyverse);
df %>%
separate(initiativen, into = c("p1", "p2"), sep = "(?<=[0-9])(?=[a-z])") %>%
mutate(
initiativen = case_when(
str_detect(p1, "[a-z]") ~ p1,
str_detect(p2, "[a-z]") ~ p2),
question = case_when(
str_detect(p1, "[0-9]") ~ p1,
str_detect(p2, "[0-9]") ~ p2)) %>%
mutate(initiativen = ifelse(is.na(initiativen), 1, match(initiativen, letters) + 1)) %>%
select(-p1, -p2)
# id initiativen question
#1 abc 2 2
#2 cde 3 2
#3 efd 2 <NA>
#4 geh 4 <NA>
#5 jytd 23 5
#6 jydjytd 6 <NA>
#7 vbdjfkb 1 4
Note that the warning can be safely ignored as it stems from the missing fields when separateing.
Explanation: We use a positive look-behind and look-ahead to split entries in initiativen into two parts p1 and p2; we then fill initiativen and question with entries from p1 or p2 depending on whether they contain a number "[0-9]" or a character "[a-z]"; convert characters to numbers with match(initiativen, letters) and finally clean the data.frame.
Sample data
df <- read.table(text =
" id initiativen
1 abc 2a
2 cde 2b
3 efd a
4 geh c
5 jytd 5v
6 jydjytd e
7 vbdjfkb 4", row.names = 1)
Using data.table
# Step one
setDT(df)
df[, ":="(
question = gsub("[a-z]", "", initiativen),
initiativen = match(gsub("[0-9]", "", initiativen), letters, nomatch = 0) + 1L
)
]
df
id initiativen question
1: abc 2 2
2: cde 3 2
3: efd 2
4: geh 4
5: jytd 23 5
6: jydjytd 6
7: vbdjfkb 1 4
# Then some tidying
df[, question := ifelse(nzchar(question), question, NA)]
df
id initiativen question
1: abc 2 2
2: cde 3 2
3: efd 2 <NA>
4: geh 4 <NA>
5: jytd 23 5
6: jydjytd 6 <NA>
7: vbdjfkb 1 4
Data
df <- data.frame(
id = c("abc", "cde", "efd", "geh", "jytd", "jydjytd", "vbdjfkb"),
initiativen = c("2a", "2b", "a", "c", "5v", "e", "4"),
stringsAsFactors = FALSE
)
Edit
Can also be done in one step:
df[, question := gsub("[a-z]", "", initiativen)
][, ":="(
question = ifelse(nzchar(question), question, NA),
initiativen = match(gsub("[0-9]", "", initiativen), letters, nomatch = 0) + 1L
)
]
For the second column you can use regular expression to only keep numeric values:
df$initiativen <- gsub("[^0-9]", "", df$initiativen)

Put the combinations matrix of many rows in a column of a dataframe, then split it

I have a dataframe that looks like this (I simplify):
df <- data.frame(rbind(c(1, "dog", "cat", "rabbit"), c(2, "apple", "peach", "cucumber")))
colnames(df) <- c("ID", "V1", "V2", "V3")
## ID V1 V2 V3
## 1 1 dog cat rabbit
## 2 2 apple peach cucumber
I would like to create a column containing all possible combinations of variables V1:V3 two by two (order doesn't matter), but keeping a link with the original ID. So something like this.
## ID bigrams
## 1 1 dog cat
## 2 1 cat rabbit
## 3 1 dog rabbit
## 4 2 apple peach
## 5 2 apple cucumber
## 6 2 peach cucumber
My idea: use combn(), mutate() and separate_row().
library(tidyr)
library(dplyr)
df %>%
mutate(bigrams=paste(unlist(t(combn(df[,2:4],2))), collapse="-")) %>%
separate_rows(bigrams, sep="-") %>%
select(ID,bigrams)
The result is not what I expected... I guess that concatenating a matrix (the result of combine()) is not as easy as that.
I have two questions about this: 1) how to debug this code? 2) Is this a good way to do this kind of thing? I'm new on R but I’ve an Open Refine background, so concatenate-split multivalued cells make a lot of sense for me. But is this also the right method with R?
Thanks in advance for any help.
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), melt it to 'long' format, grouped by 'ID', get the combn of 'value' and paste it together
library(data.table)
dM <- melt(setDT(df), id.var = "ID")[, combn(value, 2, FUN = paste, collapse=' '), ID]
setnames(dM, 2, 'bigrams')[]
# ID bigrams
#1: 1 dog cat
#2: 1 dog rabbit
#3: 1 cat rabbit
#4: 2 apple peach
#5: 2 apple cucumber
#6: 2 peach cucumber
I recommend #akrun's "melt first" approach, but just for fun, here are more ways to do it:
library(tidyverse)
df %>%
mutate_all(as.character) %>%
transmute(ID = ID, bigrams = pmap(
list(V1, V2, V3),
function(a, b, c) combn(c(a, b, c), 2, paste, collapse = " ")
))
# ID bigrams
# 1 1 dog cat, dog rabbit, cat rabbit
# 2 2 apple peach, apple cucumber, peach cucumber
(mutate_all(as.character) just because you gave us factors, and factor to character conversion can be surprising).
df %>%
mutate_all(as.character) %>%
nest(-ID) %>%
mutate(bigrams = map(data, combn, 2, paste, collapse = " ")) %>%
unnest(data) %>%
as.data.frame()
# ID bigrams V1 V2 V3
# 1 1 dog cat, dog rabbit, cat rabbit dog cat rabbit
# 2 2 apple peach, apple cucumber, peach cucumber apple peach cucumber
(as.data.frame() just for a prettier printing)

How to extract a partially matched string into a new column?

I have this data.table with strings:
dt = tbl_dt(data.table(x=c("book|ball|apple", "flower|orange|cup", "banana|bandana|pen")))
x
1 book|ball|apple
2 flower|orange|cup
3 banana|bandana|pen
..and I also have a reference string which I would like to match with the one in the data.table, extracting the word if it's in there, like so..
fruits = "apple|banana|orange"
str_match(fruits, "flower|orange|cup")
>"orange"
How do I do this for the entire data.table?
require(dplyr)
require(stringr)
dt %>%
mutate (fruit = str_match(fruits, x))
Error in rep(NA_character_, n) : invalid 'times' argument
In addition: Warning message:
In regexec(c("book|ball|apple", "flower|orange|cup", "banana|bandana|pen" :
argument 'pattern' has length > 1 and only the first element will be used
What I would like:
x fruit
1 book|ball|apple apple
2 flower|orange|cup orange
3 banana|bandana|pen banana
Or (in order to avoid warnings, it is better though that instead of tbl_dt you will use data.table)
dt[, fruits := mapply(str_match, fruits, x)]
dt
## x fruits
## 1: book|ball|apple apple
## 2: flower|orange|cup orange
## 3: banana|bandana|pen banana
Or you could do something similar to #akrun's answer, such as
dt[, fruits := lapply(x, str_match, fruits)]
dt$fruit <- unlist(lapply(dt$x, str_match, fruits))
dt
#Source: local data table [3 x 2]
#
# x fruit
#1 book|ball|apple apple
#2 flower|orange|cup orange
#3 banana|bandana|pen banana
A solution using base R and without str_match:
fruit=NULL
reflist = unlist(strsplit(fruits, '\\|'))
for(xx in ddf$x){
ss = unlist(strsplit(xx,'\\|'))
for(s in ss) if(s %in% reflist) fruit[length(fruit)+1]=s
}
ddf$fruit = fruit
ddf
# x fruit
#1 book|ball|apple apple
#2 flower|orange|cup orange
#3 banana|bandana|pen banana

Resources