How to find intersect elements of concatenated string? - r

# create sample df
basket_customer <- c("apple,orange,banana","apple,banana,orange","strawberry,blueberry")
basket_ideal<- c("orange,banana","orange,apple,banana","strawberry,watermelon")
customer_name <- c("john","adam","john")
visit_id <- c("1001","1001","1003")
df2 <- cbind.data.frame(basket_customer,basket_ideal,customer_name,visit_id)
df2$basket_ideal <- as.character(basket_ideal)
df2$basket_customer <- as.character(basket_customer)
The goal is to compare the basket elements (fruits) of each customer to the ideal basket and return the missing fruit.
Note the same visit_id can exists for 1 or more users so the uniqueID is (id+username) and elements are not alphabetically sorted.
expected output:
visit_id
customer_name
NOT_in_basket_ideal
NOT_in_basket_customer
1001
john
apple
NA
1001
adam
NA
NA
1003
john
blueberry
watermelon
I tried using row_wise(),intersect(),except(),and unnesting however did not succeed. Thank you

We could use Map to loop over the corresponding elements of the list columns, and use setdiff to get the elements of the first vector not in the second
cst_list <- strsplit(df2$basket_customer, ",\\s*")
idl_list <- strsplit(df2$basket_ideal, ",\\s*")
lst1 <- Map(function(x, y) if(identical(x, y)) 'equal'
else setdiff(x, y), cst_list, idl_list)
lst1[lengths(lst1) == 0] <- NA_character_
v1 <- sapply(lst1, toString)
and the second case, just reverse the order
lst2 <- Map(function(x, y) if(identical(x, y)) 'equal'
else setdiff(y, x), cst_list, idl_list)
lst2[lengths(lst2) == 0] <- NA_character_
v2 <- sapply(lst2, toString)
Combining the output from both to 'df2'
df2[c("NOT_in_basket_ideal", "NOT_in_basket_customer")] <- list(v1, v2)
-output
df2[-(1:2)]
# customer_name visit_id NOT_in_basket_ideal NOT_in_basket_customer
#1 john 1001 apple NA
#2 adam 1001 NA NA
#3 john 1003 blueberry watermelon
Or in tidyverse
library(dplyr)
library(purrr)
library(stringr)
df2 %>%
mutate(across(starts_with('basket'), ~ str_extract_all(., "\\w+"))) %>%
transmute(customer_name, visit_id,
NOT_in_basket_ideal = map2_chr(basket_customer,
basket_ideal, ~ toString(setdiff(.x, .y))),
NOT_in_basket_customer = map2_chr(basket_ideal, basket_customer,
~ toString(setdiff(.x, .y))))
# customer_name visit_id NOT_in_basket_ideal NOT_in_basket_customer
#1 john 1001 apple
#2 adam 1001
#3 john 1003 blueberry watermelon

Related

Transforming a list of lists into dataframe

I have a list containing a number of other lists, each of which contain varying numbers of character vectors, with varying numbers of elements. I want to create a dataframe where each list would be represented as a row and each character vector within that list would be a column. Where the character vector has > 1 element, the elements would be concatenated and separated using a "+" sign, so that they can be stored as one string. The data looks like this:
fruits <- list(
list(c("orange"), c("pear")),
list(c("pear", "orange")),
list(c("lemon", "apple"),
c("pear"),
c("grape"),
c("apple"))
)
The expected output is like this:
fruits_df <- data.frame(col1 = c("orange", "pear + orange", "lemon + apple"),
col2 = c("pear", NA, "pear"),
col3 = c(NA, NA, "grape"),
col4 = c(NA, NA, "apple"))
There is no limit on the number of character vectors that can be contained in a list, so the solution needs to dynamically create columns, leading to a df where the number of columns is equal to the length of the list containing the largest number of character vectors.
For every list in fruits you can create a one row dataframe and bind the data.
dplyr::bind_rows(lapply(fruits, function(x) as.data.frame(t(sapply(x,
function(y) paste0(y, collapse = "+"))))))
# V1 V2 V3 V4
#1 orange pear <NA> <NA>
#2 pear+orange <NA> <NA> <NA>
#3 lemon+apple pear grape apple
This is a bit messy but here is one way
cols <- lapply(fruits, function(x) sapply(x, paste, collapse=" + "))
ncols <- max(lengths(cols))
dd <- do.call("rbind.data.frame", lapply(cols, function(x) {length(x) <- ncols; x}))
names(dd) <- paste0("col", 1:ncol(dd))
dd
# col1 col2 col3 col4
# 1 orange pear <NA> <NA>
# 2 pear + orange <NA> <NA> <NA>
# 3 lemon + apple pear grape apple
or another strategy
ncols <- max(lengths(fruits))
dd <- data.frame(lapply(seq.int(ncols), function(x) sapply(fruits, function(y) paste(unlist(y[x]), collapse=" + "))))
names(dd) <- paste0("col", 1:ncols)
dd
But really you need to either build each column or row from your list and then combine them together.
Another approach that melts the list to a data.frame using rrapply::rrapply and then casts it to the required format using data.table::dcast:
library(rrapply)
library(data.table)
## melt to long data.frame
long <- rrapply(fruits, f = paste, how = "melt", collapse = " + ")
## cast to wide data.table
setDT(long)
dcast(long[, .(L1, L2, value = unlist(value))], L1 ~ L2)[, !"L1"]
#> ..1 ..2 ..3 ..4
#> 1: orange pear <NA> <NA>
#> 2: pear + orange <NA> <NA> <NA>
#> 3: lemon + apple pear grape apple

Replace strings containing only blanks with NA

I have a dataframe containing columns named Q1 through Q98. These columns contain strings ("This is a string"), yet some entries only contain a varying number of blanks (" ", " "). I would like to replace all entries containing only blanks with NA.
Consider the dataframe created by the following code:
df<-data.frame(Q1=c("Test test","Test"," "," "),Q2=c("Sample sample"," ","Sample","Sample"))
The solution would modify the above dataframe df such that df$Q1[3:4]==NA and df$Q2[2]==NA.
I have already tried using grepl(" ", df), but this lets me replace every entry that contains blanks, not only those which consist purely of blanks.
One dplyr possibility could be:
df %>%
mutate_all(~ ifelse(nchar(trimws(.)) == 0, NA_character_, .))
Q1 Q2
1 Test test Sample sample
2 Test <NA>
3 <NA> Sample
4 <NA> Sample
Or the same with base R:
df[] <- lapply(df, function(x) ifelse(nchar(trimws(x)) == 0, NA_character_, x))
Or:
df %>%
mutate_all(~ trimws(.)) %>%
na_if(., "")
A dplyr+stringr option
library(dplyr)
library(stringr)
df %>% mutate_all(~str_replace(., "^\\s+$", NA_character_))
# Q1 Q2
#1 Test test Sample sample
#2 Test <NA>
#3 <NA> Sample
#4 <NA> Sample
You can search for strings with a start ^, then one or more spaces +, then an end $.
df[sapply(df, function(x) grepl('^ +$', x))] <- NA
# Q1 Q2
# 1 Test test Sample sample
# 2 Test <NA>
# 3 <NA> Sample
# 4 <NA> Sample
Some other possibilities
df[] <- lapply(df, function(x) replace(x, grep('^ +$', x), NA))
#or
replace(df, sapply(df, function(x) grepl('^ +$', x)), NA)
Apply sub to all columns of whitespaces:
lapply(df, FUN = sub, pattern = "^\\s*$", replacement = NA)
We can do this in base R
df[trimws(as.matrix(df)) == ''] <- NA
df
# Q1 Q2
#1 Test test Sample sample
#2 Test <NA>
#3 <NA> Sample
#4 <NA> Sample
Or with replace
library(dplyr)
df %>%
mutate_all(list(~ replace(., trimws(.)=="", NA)))
# Q1 Q2
#1 Test test Sample sample
#2 Test <NA>
#3 <NA> Sample
#4 <NA> Sample

R - Reshaping repeated row value into column

I have data like this:
Name Rating
Tom 3
Tom 4
Tom 2
Johnson 5
Johnson 7
But I'd like it so each unique name is instead a column, with the ratings below, in each row. How can I approach this?
Here is a good way of doing it
x <- data.frame(c("Tom", "Tom", "Tom", "Johnson", "Johnson"), c(3,4,2,5,7))
colnames(x) <- c("Name", "Rating")
n <- unique(x[,1])
m <- max(table(x[,1]))
c <- data.frame(matrix(, ncol = length(n), nrow = m))
for (i in 1:length(n)) {
l <- x[which(x[,1] == n[i]), 2]
l2 <- rep("", m - length(l))
c[,i] <- c(l, l2)
}
colnames(c) <- n
Results:
Tom Johnson
1 3 5
2 4 7
3 2
Here is a way using CRAN package reshape.
library(reshape2)
d <- dcast(mydata, Rating ~ Name, value.var = "Rating")[-1]
d
# Johnson Tom
#1 NA 2
#2 NA 3
#3 NA 4
#4 5 NA
#5 7 NA
As you can see, there are too many NA values in this result. One way of getting rid of them could be:
d <- lapply(d, function(x) x[!is.na(x)])
n <- max(sapply(d, length))
d <- do.call(cbind.data.frame, lapply(d, function(x) c(x, rep(NA, n - length(x)))))
d
# Johnson Tom
#1 5 2
#2 7 3
#3 NA 4
Well, this does the job but introduces some NAs.
Edit: Replace the NAs with some other Rating.
mydata<-data.frame(Name=c("Tom","Tom","Tom","Johnson","Johnson"),Rating=c(3,4,2,5,7))
library(reshape2)
library(tidyverse)
mydata1<-mydata %>%
mutate(Name=as.factor(Name)) %>%
melt(id.var="Name") %>%
dcast(variable+value~Name) %>%
select(-value) %>%
rename(Name=variable) %>%
select_if(is.numeric)
mydata1 %>%
mutate(Johnson=as.factor(Johnson),Tom=as.factor(Tom)) %>%
mutate(Johnson=fct_explicit_na(Johnson,na_level = "No Rating"),
Tom=fct_explicit_na(Tom,na_level = "No Rating"))
Johnson Tom
1 No Rating 2
2 No Rating 3
3 No Rating 4
4 5 No Rating
5 7 No Rating

For Loop to check if any rows have a specfic set of values

I'm trying to run a for-loop to check if any my rows contain a specific set of values. I already know you can simply apply a function to remove the set from the dataframe, but I want to know how to run a for-loop as well. Thanks!
This is my dataframe:
df <- as.data.frame(matrix(NA, nrow = 12, ncol = 3))
df$V1 <- c('1','1','2','3','3','3','4','4','5','5','5','5')
df$V2 <- c('CCC','BBB','AAA','AAA','EEE','BBB','AAA','DDD','EEE','EEE','BBB','CCC')
df$V3 <- c(100,90,80,85,66,98,62,74,56,85,77,66)
colnames(df) <- c('ID','Secondary_ID','Number')
Grouping the Data so there is only 1 unique ID per row
library(dplyr)
library(tidyr)
df_2 <- df%>%
group_by(ID)%>%
summarise(Key_s = paste0(Secondary_ID, collapse = ','))%>%
separate(Key_s, into = c('1','2','3','4'))
I know that you can remove the specific set like this:
remove_this <- c('BBB','CCC')
df_remove <- apply(df_2, 1, function(x) !any(x %in% remove_this))
final_dataframe <- df_2[df_remove,]
I'm trying to run a for-loop which creates another column called output, and if it contains the specific set than "Yes" else "No".
Something like this:
output <- as.character(nrow(df_2))
for(i in 1:nrow(df_2)){
if(df_2[i,] %in% remove_this){
df_2$output <- "Yes"
}else{df_2$output <- "No"}
}
Reverse the test to see if the contents of remove_this are in the row.
df_2$output <- NA # initialize the column
for(i in 1:nrow(df_2)){
df_2$output[i] <- ifelse(all(remove_this %in% df_2[i,]), 'Yes', 'No')
}
You don't need to create a for loop:
remove_this <- c('BBB','CCC')
df_remove <- apply(df_2, 1, function(x) !any(x %in% remove_this))
df_2 %>%
mutate(output = c("No", "Yes")[df_remove + 1L])
# A tibble: 5 x 6
ID `1` `2` `3` `4` output
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 CCC BBB NA NA No
2 2 AAA NA NA NA Yes
3 3 AAA EEE BBB NA No
4 4 AAA DDD NA NA Yes
5 5 EEE EEE BBB CCC No
The "trick" is to convert the logical values FALSE and TRUE of df_remove into integer indices which are used to subset the vector c("No", "Yes").

Multiple Values in One Cell using R

Suppose, there are 2 data.frames, for instance:
dat1 <- read.table("[path_dat1]", header=TRUE, sep=",")
id name age
1 Jack 21
2 James 40
dat2 <- read.table("[path_dat2]", header=TRUE, sep=",")
id interests
1 football
1 basketball
1 soccer
2 pingpang ball
How do I join table 1 and table 2 into a data.frame like the one below?
id name age interests
1 1 Jack 21 (football, basketball, soccer)
2 2 James 40 (pingpang ball)
How can I join these using plyr in the simplest way?
I can't tell you how to solve this in plyr but can in base:
dat3 <- aggregate(interests~id, dat2, paste, collapse=",")
merge(dat1, dat3, "id")
EDIT: If you really want the parenthesis you could use:
ppaste <- function(x) paste0("(", gsub("^\\s+|\\s+$", "", paste(x, collapse = ",")), ")")
dat3 <- aggregate(interests~id, dat2, ppaste)
merge(dat1, dat3, "id")
Using Tyler's example:
dat1$interests <- ave(dat1$id, dat1$id,
FUN=function(x) paste(dat2[ dat2$id %in% x, "interests"], collapse=",") )
> dat1
id name age interests
1 1 Jack 21 football, basketball, soccer
2 2 James 40 pingpang ball

Resources