How to extract a partially matched string into a new column?

How to extract a partially matched string into a new column? - r

I have this data.table with strings:
dt = tbl_dt(data.table(x=c("book|ball|apple", "flower|orange|cup", "banana|bandana|pen")))
x
1 book|ball|apple
2 flower|orange|cup
3 banana|bandana|pen
..and I also have a reference string which I would like to match with the one in the data.table, extracting the word if it's in there, like so..
fruits = "apple|banana|orange"
str_match(fruits, "flower|orange|cup")
>"orange"
How do I do this for the entire data.table?
require(dplyr)
require(stringr)
dt %>%
mutate (fruit = str_match(fruits, x))
Error in rep(NA_character_, n) : invalid 'times' argument
In addition: Warning message:
In regexec(c("book|ball|apple", "flower|orange|cup", "banana|bandana|pen" :
argument 'pattern' has length > 1 and only the first element will be used
What I would like:
x fruit
1 book|ball|apple apple
2 flower|orange|cup orange
3 banana|bandana|pen banana

Or (in order to avoid warnings, it is better though that instead of tbl_dt you will use data.table)
dt[, fruits := mapply(str_match, fruits, x)]
dt
## x fruits
## 1: book|ball|apple apple
## 2: flower|orange|cup orange
## 3: banana|bandana|pen banana
Or you could do something similar to #akrun's answer, such as
dt[, fruits := lapply(x, str_match, fruits)]

dt$fruit <- unlist(lapply(dt$x, str_match, fruits))
dt
#Source: local data table [3 x 2]
#
# x fruit
#1 book|ball|apple apple
#2 flower|orange|cup orange
#3 banana|bandana|pen banana

A solution using base R and without str_match:
fruit=NULL
reflist = unlist(strsplit(fruits, '\\|'))
for(xx in ddf$x){
ss = unlist(strsplit(xx,'\\|'))
for(s in ss) if(s %in% reflist) fruit[length(fruit)+1]=s
}
ddf$fruit = fruit
ddf
# x fruit
#1 book|ball|apple apple
#2 flower|orange|cup orange
#3 banana|bandana|pen banana

Related

Using data.table to match multiple patterns against multiple strings in R

library(data.table)
dat1 <- data.table(id1 = c(1, 1, 2),
pattern = c("apple", "applejack", "bananas, sweet"))
dat2 <- data.table(id2 = c(1174, 1231),
description = c("apple is sweet", "apple is a computer"),
memo = c("bananas, sweet yes", "bananas, sweetyes"))
> dat1
id1 pattern
1: 1 apple
2: 1 applejack
3: 2 bananas, sweet
> dat2
id2 description memo
1: 1174 apple is sweet bananas, sweet yes
2: 1231 apple is a computer bananas, sweetyes
I have two data.tables, dat1 and dat2. I want to search for each pattern in dat2 against the description and memo columns in dat2 and store the corresponding id2s.
The final output table should look something like this:
id1 pattern description_match memo_match
1: 1 apple 1174,1231 <NA>
2: 1 applejack <NA> <NA>
3: 2 bananas, sweet <NA> 1174
The regular expression I want to use is \\b[pattern]\\b. Below is my attempt:
dat1[, description_match := dat2[grepl(paste0("\\b", dat1$pattern, "\\b"), dat2$description), .(id2 = paste(id2, collapse = ","))]]
dat1[, memo_match := dat2[grepl(paste0("\\b", dat1$pattern, "\\b"), dat2$memo), .(id2 = paste(id2, collapse = ","))]]
However, both give me the error that grepl can only use the first pattern.

We group by row sequence, create the match columns from 'dat2', by pasteing the 'id2' extracted from the logical output from grepl
library(data.table)
dat1[, c("description_match", "memo_match") := {
pat <- sprintf('\\b(%s)\\b', paste(pattern, collapse = "|"))
.(toString(dat2$id2[grepl(pat, dat2$description)]),
toString(dat2$id2[grepl(pat, dat2$memo)]))
}, seq_along(id1)]
dplyr::na_if(dat1, "")
id1 pattern description_match memo_match
<num> <char> <char> <char>
1: 1 apple 1174, 1231 <NA>
2: 1 applejack <NA> <NA>
3: 2 bananas, sweet <NA> 1174
According to ?sprintf
The string fmt contains normal characters, which are passed through to the output string, and also conversion specifications which operate on the arguments provided through .... The allowed conversion specifications start with a % and end with one of the letters in the set aAdifeEgGosxX%
s - Character string. Character NAs are converted to "NA".
Or use a for loop
for(nm in names(dat2)[-1])
dat1[, paste0(nm, "_match") :=
toString(dat2$id2[grepl(paste0("\\b", pattern, "\\b"),
dat2[[nm]])]), seq_along(id1)][]

How to find intersect elements of concatenated string?

# create sample df
basket_customer <- c("apple,orange,banana","apple,banana,orange","strawberry,blueberry")
basket_ideal<- c("orange,banana","orange,apple,banana","strawberry,watermelon")
customer_name <- c("john","adam","john")
visit_id <- c("1001","1001","1003")
df2 <- cbind.data.frame(basket_customer,basket_ideal,customer_name,visit_id)
df2$basket_ideal <- as.character(basket_ideal)
df2$basket_customer <- as.character(basket_customer)
The goal is to compare the basket elements (fruits) of each customer to the ideal basket and return the missing fruit.
Note the same visit_id can exists for 1 or more users so the uniqueID is (id+username) and elements are not alphabetically sorted.
expected output:
visit_id
customer_name
NOT_in_basket_ideal
NOT_in_basket_customer
1001
john
apple
NA
1001
adam
NA
NA
1003
john
blueberry
watermelon
I tried using row_wise(),intersect(),except(),and unnesting however did not succeed. Thank you

We could use Map to loop over the corresponding elements of the list columns, and use setdiff to get the elements of the first vector not in the second
cst_list <- strsplit(df2$basket_customer, ",\\s*")
idl_list <- strsplit(df2$basket_ideal, ",\\s*")
lst1 <- Map(function(x, y) if(identical(x, y)) 'equal'
else setdiff(x, y), cst_list, idl_list)
lst1[lengths(lst1) == 0] <- NA_character_
v1 <- sapply(lst1, toString)
and the second case, just reverse the order
lst2 <- Map(function(x, y) if(identical(x, y)) 'equal'
else setdiff(y, x), cst_list, idl_list)
lst2[lengths(lst2) == 0] <- NA_character_
v2 <- sapply(lst2, toString)
Combining the output from both to 'df2'
df2[c("NOT_in_basket_ideal", "NOT_in_basket_customer")] <- list(v1, v2)
-output
df2[-(1:2)]
# customer_name visit_id NOT_in_basket_ideal NOT_in_basket_customer
#1 john 1001 apple NA
#2 adam 1001 NA NA
#3 john 1003 blueberry watermelon
Or in tidyverse
library(dplyr)
library(purrr)
library(stringr)
df2 %>%
mutate(across(starts_with('basket'), ~ str_extract_all(., "\\w+"))) %>%
transmute(customer_name, visit_id,
NOT_in_basket_ideal = map2_chr(basket_customer,
basket_ideal, ~ toString(setdiff(.x, .y))),
NOT_in_basket_customer = map2_chr(basket_ideal, basket_customer,
~ toString(setdiff(.x, .y))))
# customer_name visit_id NOT_in_basket_ideal NOT_in_basket_customer
#1 john 1001 apple
#2 adam 1001
#3 john 1003 blueberry watermelon

Transforming a list of lists into dataframe

I have a list containing a number of other lists, each of which contain varying numbers of character vectors, with varying numbers of elements. I want to create a dataframe where each list would be represented as a row and each character vector within that list would be a column. Where the character vector has > 1 element, the elements would be concatenated and separated using a "+" sign, so that they can be stored as one string. The data looks like this:
fruits <- list(
list(c("orange"), c("pear")),
list(c("pear", "orange")),
list(c("lemon", "apple"),
c("pear"),
c("grape"),
c("apple"))
)
The expected output is like this:
fruits_df <- data.frame(col1 = c("orange", "pear + orange", "lemon + apple"),
col2 = c("pear", NA, "pear"),
col3 = c(NA, NA, "grape"),
col4 = c(NA, NA, "apple"))
There is no limit on the number of character vectors that can be contained in a list, so the solution needs to dynamically create columns, leading to a df where the number of columns is equal to the length of the list containing the largest number of character vectors.

For every list in fruits you can create a one row dataframe and bind the data.
dplyr::bind_rows(lapply(fruits, function(x) as.data.frame(t(sapply(x,
function(y) paste0(y, collapse = "+"))))))
# V1 V2 V3 V4
#1 orange pear <NA> <NA>
#2 pear+orange <NA> <NA> <NA>
#3 lemon+apple pear grape apple

This is a bit messy but here is one way
cols <- lapply(fruits, function(x) sapply(x, paste, collapse=" + "))
ncols <- max(lengths(cols))
dd <- do.call("rbind.data.frame", lapply(cols, function(x) {length(x) <- ncols; x}))
names(dd) <- paste0("col", 1:ncol(dd))
dd
# col1 col2 col3 col4
# 1 orange pear <NA> <NA>
# 2 pear + orange <NA> <NA> <NA>
# 3 lemon + apple pear grape apple
or another strategy
ncols <- max(lengths(fruits))
dd <- data.frame(lapply(seq.int(ncols), function(x) sapply(fruits, function(y) paste(unlist(y[x]), collapse=" + "))))
names(dd) <- paste0("col", 1:ncols)
dd
But really you need to either build each column or row from your list and then combine them together.

Another approach that melts the list to a data.frame using rrapply::rrapply and then casts it to the required format using data.table::dcast:
library(rrapply)
library(data.table)
## melt to long data.frame
long <- rrapply(fruits, f = paste, how = "melt", collapse = " + ")
## cast to wide data.table
setDT(long)
dcast(long[, .(L1, L2, value = unlist(value))], L1 ~ L2)[, !"L1"]
#> ..1 ..2 ..3 ..4
#> 1: orange pear <NA> <NA>
#> 2: pear + orange <NA> <NA> <NA>
#> 3: lemon + apple pear grape apple

Put the combinations matrix of many rows in a column of a dataframe, then split it

I have a dataframe that looks like this (I simplify):
df <- data.frame(rbind(c(1, "dog", "cat", "rabbit"), c(2, "apple", "peach", "cucumber")))
colnames(df) <- c("ID", "V1", "V2", "V3")
## ID V1 V2 V3
## 1 1 dog cat rabbit
## 2 2 apple peach cucumber
I would like to create a column containing all possible combinations of variables V1:V3 two by two (order doesn't matter), but keeping a link with the original ID. So something like this.
## ID bigrams
## 1 1 dog cat
## 2 1 cat rabbit
## 3 1 dog rabbit
## 4 2 apple peach
## 5 2 apple cucumber
## 6 2 peach cucumber
My idea: use combn(), mutate() and separate_row().
library(tidyr)
library(dplyr)
df %>%
mutate(bigrams=paste(unlist(t(combn(df[,2:4],2))), collapse="-")) %>%
separate_rows(bigrams, sep="-") %>%
select(ID,bigrams)
The result is not what I expected... I guess that concatenating a matrix (the result of combine()) is not as easy as that.
I have two questions about this: 1) how to debug this code? 2) Is this a good way to do this kind of thing? I'm new on R but I’ve an Open Refine background, so concatenate-split multivalued cells make a lot of sense for me. But is this also the right method with R?
Thanks in advance for any help.

We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), melt it to 'long' format, grouped by 'ID', get the combn of 'value' and paste it together
library(data.table)
dM <- melt(setDT(df), id.var = "ID")[, combn(value, 2, FUN = paste, collapse=' '), ID]
setnames(dM, 2, 'bigrams')[]
# ID bigrams
#1: 1 dog cat
#2: 1 dog rabbit
#3: 1 cat rabbit
#4: 2 apple peach
#5: 2 apple cucumber
#6: 2 peach cucumber

I recommend #akrun's "melt first" approach, but just for fun, here are more ways to do it:
library(tidyverse)
df %>%
mutate_all(as.character) %>%
transmute(ID = ID, bigrams = pmap(
list(V1, V2, V3),
function(a, b, c) combn(c(a, b, c), 2, paste, collapse = " ")
))
# ID bigrams
# 1 1 dog cat, dog rabbit, cat rabbit
# 2 2 apple peach, apple cucumber, peach cucumber
(mutate_all(as.character) just because you gave us factors, and factor to character conversion can be surprising).
df %>%
mutate_all(as.character) %>%
nest(-ID) %>%
mutate(bigrams = map(data, combn, 2, paste, collapse = " ")) %>%
unnest(data) %>%
as.data.frame()
# ID bigrams V1 V2 V3
# 1 1 dog cat, dog rabbit, cat rabbit dog cat rabbit
# 2 2 apple peach, apple cucumber, peach cucumber apple peach cucumber
(as.data.frame() just for a prettier printing)

Matching columns in R and adding frequencies against them

I need to match my values in col1 with col 2 and col3 and if they match i need to add their frequencies.It should display the count from freq1 freq2 and freq3 of the unique values.
col1 freq1 col2 freq2 col3 freq3
apple 3 grapes 4 apple 1
grapes 5 apple 2 orange 2
orange 4 banana 5 grapes 2
guava 3 orange 6 banana 7
I need my output like this
apple 6
grapes 11
orange 12
guava 3
banana 12
I m a beginner.How do I code this in R.

We can use melt from data.table with patterns specified in the measure argument to convert the 'wide' format to 'long' format, then grouped by 'col', we get the sum of 'freq' column
library(data.table)
melt(setDT(df1), measure = patterns("^col", "^freq"),
value.name = c("col", "freq"))[,.(freq = sum(freq)) , by = col]
# col freq
#1: apple 6
#2: grapes 11
#3: orange 12
#4: guava 3
#5: banana 12
If it is alternating 'col', 'freq', columns, we can just unlist the subset of 'col' columns and 'freq' columns separately to create a data.frame (using c(TRUE, FALSE) to recycle for subsetting columns), and then use aggregate from base R to get the sum grouped by 'col'.
aggregate(freq~col, data.frame(col = unlist(df1[c(TRUE, FALSE)]),
freq = unlist(df1[c(FALSE, TRUE)])), sum)
# col freq
#1 apple 6
#2 banana 12
#3 grapes 11
#4 guava 3
#5 orange 12

I think that the easiest to understand for newbie would be creating 3 separate dataframes (I assumed here that your dataframe name is df):
df1 <- data.frame(df$col1, df$freq1)
colnames(df1) <- c("fruit", "freq")
df2 <- data.frame(df$col2, df$freq2)
colnames(df2) <- c("fruit", "freq")
df3 <- data.frame(df$col3, df$freq3)
colnames(df3) <- c("fruit", "freq")
Then bind all dataframes by rows:
df <- rbind(df1, df2, df3)
And at the end group by fruit and sum frequencies using dplyr library.
library(dplyr)
df <- df %>%
group_by(fruit)%>%
summarise(sum(freq))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to extract a partially matched string into a new column? - r

dt$fruit <- unlist(lapply(dt$x, str_match, fruits)) dt #Source: local data table [3 x 2] # # x fruit #1 book|ball|apple apple #2 flower|orange|cup orange #3 banana|bandana|pen banana

Related

Using data.table to match multiple patterns against multiple strings in R

How to find intersect elements of concatenated string?

Transforming a list of lists into dataframe

Put the combinations matrix of many rows in a column of a dataframe, then split it

Matching columns in R and adding frequencies against them

Categories

Resources