I am trying to loop across a vector of patterns/strings to both match strings within another column and assign the results to a column of the same name as the pattern being searched. A simple example is below.
I am aware that this example is trivial but it captures the minimal case that produces the error that I cannot resolve.
> library(rlang)
> library(stringr)
> library(dplyr)
> set.seed(5)
> df <- data.frame(
+ groupA = sample(x = LETTERS[1:6], size = 20, replace = TRUE),
+ id_col = 1:20
+ )
>
> mycols <- c('A','C','D')
>
> dfmatches <-
+ lapply(mycols, function(icol) {
+ data.frame(!!icol := grepl(pattern = icol, x = df$groupA))
+ }) %>%
+ cbind.data.frame()
This gives me the error:
Error: `:=` can only be used within a quasiquoted argument
The desired output would be a data.frame like the below:
> dfmatches
A C D
1 FALSE FALSE FALSE
2 FALSE FALSE FALSE
3 FALSE FALSE FALSE
4 FALSE FALSE FALSE
5 TRUE FALSE FALSE
6 FALSE FALSE FALSE
7 FALSE FALSE TRUE
8 FALSE FALSE FALSE
9 FALSE FALSE FALSE
10 TRUE FALSE FALSE
11 FALSE FALSE FALSE
12 FALSE TRUE FALSE
13 FALSE FALSE FALSE
14 FALSE FALSE TRUE
15 FALSE FALSE FALSE
16 FALSE FALSE FALSE
17 FALSE TRUE FALSE
18 FALSE FALSE FALSE
19 FALSE FALSE TRUE
20 FALSE FALSE FALSE
I've tried multiple variations using {{}} or !! rlang::sym() etc but cannot quite figure out the right syntax for this.
One option is to use map_dfc from purrr. Also I think you don't need grepl since here we are looking for an exact match and not partial one.
library(dplyr)
library(purrr)
map_dfc(mycols, ~df %>% transmute(!!.x := groupA == .x))
In base R, we can do
setNames(do.call(cbind.data.frame, lapply(mycols,
function(x) df$groupA == x)), mycols)
Related
a<-c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE)
b<-c(TRUE,FALSE,TRUE,FALSE,FALSE,FALSE)
c<-c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE)
costumer<-c("one","two","three","four","five","six")
df<-data.frame(costumer,a,b,c)
That's an example code. It looks like this printed:
costumer a b c
1 one TRUE TRUE TRUE
2 two FALSE FALSE TRUE
3 three TRUE TRUE TRUE
4 four FALSE FALSE FALSE
5 five TRUE FALSE TRUE
6 six FALSE FALSE FALSE
I want to create a new column df$items that contains only the column names that are TRUE for each row in the data. Something like this:
costumer a b c items
1 one TRUE TRUE TRUE a,b,c
2 two FALSE FALSE TRUE c
3 three TRUE TRUE TRUE a,b,c
4 four FALSE FALSE FALSE
5 five TRUE FALSE TRUE
6 six FALSE FALSE FALSE
I thought of using apply function or use which for selecting indexes, but couldn't figure it out. Can anyone help me?
df$items <- apply(df, 1, function(x) paste0(names(df)[x == TRUE], collapse = ","))
df
custumer a b c items
1 one TRUE TRUE TRUE a,b,c
2 two FALSE FALSE TRUE c
3 three TRUE TRUE TRUE a,b,c
4 four FALSE FALSE FALSE
5 five TRUE FALSE TRUE a,c
6 six FALSE FALSE FALSE
df$items = apply(df[2:4], 1, function(x) toString(names(df[2:4])[x]))
df
# custumer a b c items
# 1 one TRUE TRUE TRUE a, b, c
# 2 two FALSE FALSE TRUE c
# 3 three TRUE TRUE TRUE a, b, c
# 4 four FALSE FALSE FALSE
# 5 five TRUE FALSE TRUE a, c
# 6 six FALSE FALSE FALSE
You could use
df$items <- apply(df, 1, function(x) toString(names(df)[which(x == TRUE)]))
Output
# custumer a b c items
# 1 one TRUE TRUE TRUE a, b, c
# 2 two FALSE FALSE TRUE c
# 3 three TRUE TRUE TRUE a, b, c
# 4 four FALSE FALSE FALSE
# 5 five TRUE FALSE TRUE a, c
# 6 six FALSE FALSE FALSE
We can use pivot_longer to reshape to 'long' format and then do a group by paste
library(dplyr)
library(tidyr)
library(stringr)
df %>%
pivot_longer(cols = a:c) %>%
group_by(costumer) %>%
summarise(items = toString(name[value])) %>%
left_join(df)
Why should I use | vs any() when I'm comparing columns in dplyr::mutate()?
And why do they return different answers?
For example:
library(tidyverse)
df <- data_frame(x = rep(c(T,F,T), 4), y = rep(c(T,F,T, F), 3), allF = F, allT = T)
df %>%
mutate(
withpipe = x | y # returns expected results by row
, usingany = any(c(x,y)) # returns TRUE for every row
)
What's going on here and why should I use one way of comparing values over another?
The difference between the two is how the answer is calculated:
for |, elements are compared row-wise and boolean logic is used to return the proper value. In the example above each x and y pair are compared to each other and a logical value is returned for each pair, resulting in 12 different answers, one for each row of the data frame.
any(), on the other hand, looks at the entire vector and returns a single value. In the above example, the mutate line that calculates the new usingany column is basically doing this: any(c(df$x, df$y)), which will return TRUE because there's at least one TRUE value in either df$x or df$y. That single value is then assigned to every row of the data frame.
You can see this in action using the other columns in your data frame:
df %>%
mutate(
usingany = any(c(x,y)) # returns all TRUE
, allfany = any(allF) # returns all FALSE because every value in df$allF is FALSE
)
To answer when you should use which: use | when you want to compare elements row-wise. Use any() when you want a universal answer about the entire data frame.
TLDR, when using dplyr::mutate(), you're usually going to want to use |.
You can also use rowwise().
df <- data_frame(x = rep(c(T,F,T), 4), y = rep(c(T,F,T, F), 3), allF = F, allT = T)
df %>%
rowwise() %>%
mutate(x_or_y = any(x,y))
Output:
# A tibble: 12 x 5
x y allF allT x_or_y
<lgl> <lgl> <lgl> <lgl> <lgl>
1 TRUE TRUE FALSE TRUE TRUE
2 FALSE FALSE FALSE TRUE FALSE
3 TRUE TRUE FALSE TRUE TRUE
4 TRUE FALSE FALSE TRUE TRUE
5 FALSE TRUE FALSE TRUE TRUE
6 TRUE FALSE FALSE TRUE TRUE
7 TRUE TRUE FALSE TRUE TRUE
8 FALSE FALSE FALSE TRUE FALSE
9 TRUE TRUE FALSE TRUE TRUE
10 TRUE FALSE FALSE TRUE TRUE
11 FALSE TRUE FALSE TRUE TRUE
12 TRUE FALSE FALSE TRUE TRUE
TL;DR (update):
if_anyis the cleanest replacement for any() in rowwise operations with dplyr. See below.
You can use both the OR operator | or any()
It is the same thing when comparing &and all().
As suggested, you must take into account that |is vectorized, while any() is not
In order to use any() the same way, you must group the data rowwise, so you can call an equivalent of any(current_row). This can be done with purrr::pmap or dplyr::rowwise.
But dplyr::if_any looks a lot cleaner.
Se the code below for a comparison of all methods:
df%>%mutate(
row_OR=x|y,
row_pmap_any=pmap_lgl(select(.,c(x,y)), any),
with_if_any = if_any(c(x,y)))%>%
rowwise()%>%
mutate(
row_rowwise_any=any(c_across(c(x,y))))
# A tibble: 12 × 8
# Rowwise:
x y allF allT row_OR row_pmap_any with_if_any row_rowwise_any
<lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
2 FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
3 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
4 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
5 FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
6 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
7 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
8 FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
9 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
10 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
11 FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
12 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
All methods work, and I did not find much difference in performance.
I am looking for a function that takes a column of a data.frame as the reference and finds all subsets with respect to the other variable levels. For example, let z be data frame with 4 columns a,b,c,d, each column has 2 levels for instance. let a be the reference. Then z would be like
z$a : TRUE FALSE
z$b : TRUE FALSE
z$c : TRUE FALSE
z$d : TRUE FALSE
Then what I need is a LIST that the elements are combination names such as
aTRUEbTRUEcTRUEdTR UE :subset of the dataframe
aTRUEbFALSEcTRUEdTRUE : subset
...
Here is an example,
set.seed(123)
z=matrix(sample(c(TRUE,FALSE),size = 100,replace = TRUE),ncol=4)
colnames(z) = letters[1:4]
z=as.data.frame(z)
output= list(
'bTUEcTRUEdFALSE' = subset(z,b==TRUE & c==TRUE & d==FALSE),
'bTRUEcTRUEdTRUE' = subset(z,b==TRUE & c==TRUE & d==TRUE),
'bTRUEcFALSEdFALSE' = subset(z,b==TRUE & c==FALSE & d==FALSE),
'bTRUEcFALSEdTRUE' = subset(z,b==TRUE & c==FALSE & d==TRUE)
# and so on ...
)
output
$bTUEcTRUEdFALSE
a b c d
13 FALSE TRUE TRUE FALSE
14 FALSE TRUE TRUE FALSE
$bTRUEcTRUEdTRUE
a b c d
4 FALSE TRUE TRUE TRUE
10 TRUE TRUE TRUE TRUE
16 FALSE TRUE TRUE TRUE
20 FALSE TRUE TRUE TRUE
24 FALSE TRUE TRUE TRUE
$bTRUEcFALSEdFALSE
a b c d
17 TRUE TRUE FALSE FALSE
19 TRUE TRUE FALSE FALSE
22 FALSE TRUE FALSE FALSE
$bTRUEcFALSEdTRUE
a b c d
5 FALSE TRUE FALSE TRUE
11 FALSE TRUE FALSE TRUE
15 TRUE TRUE FALSE TRUE
18 TRUE TRUE FALSE TRUE
21 FALSE TRUE FALSE TRUE
23 FALSE TRUE FALSE TRUE
However, there is an issue with the example. firstly, I do not know the number of variables (in this case 4 (a to d). Secondly, the name of the variables must be caught from the data (simple speaking, I cannot use subset since I do not know the variable name in the condition (a== can be anything==)
What is the most efficient way of doing this in R?
You can use split and paste like so:
split(z, paste(z$b, z$c, z$d))
But the tricky part of your question is how to programmatically combine the variables in columns 2:end without knowing beforehand the number of columns, their names or values. We can use a function like below to paste the values by row in columns 2:end
apply(df, 1, function(i) paste(i[-1], collapse=""))
Now combine with split
split(z, apply(z, 1, function(i) paste(i[-1], collapse="")))
I am getting some unexpected behavior using %in% c() versus == c() to filter data on multiple conditions. I am returning incomplete results when the == c() method. Is there a logical explanation for this behavior?
df <- data.frame(region = as.factor(c(1,1,1,2,2,3,3,4,4,4)),
value = 1:10)
library(dplyr)
filter(df, region == c(1,2))
filter(df, region %in% c(1,2))
# using base syntax
df[df$region == c(1,2),]
df[df$region %in% c(1,2),]
The results do not change if I convert 'region' to numeric.
I am returning incomplete results when the == c() method. Is there a
logical explanation for this behavior?
That's kind of logical, let's see:
df$region == 1:2
# [1] TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
df$region %in% 1:2
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
The reason is in the first form your trying to compare different lenght vectors, as #lukeA said in his comment this form is the same as (see implementation-of-standard-recycling-rules):
# 1 1 1 2 2 3 3 4 4 4 ## df$region
# 1 2 1 2 1 2 1 2 1 2 ## c(1,2) recycled to the same length
# T F T T F F F F F F ## equality of the corresponding elements
df$region == c(1,2,1,2,1,2,1,2,1,2)
# [1] TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Where each value on the left hand side of the operator is tested with the corresponding value on the right hand side of the operator.
However when you use df$region %in% 1:2 it's more in the idea:
sapply(df$region, function(x) { any(x==1:2) })
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
I mean each value is tested against the second vector and TRUE is returned if there's one match.
Consider the following data:
library(tibble)
key <- c("a", "b", "c", "d", "e")
tags <- c("A,B", "B", "A,E", "C,D", "")
data <- tibble(key, tags)
Here, key could mean book title and tags could be genres, or key could be an email sender and tags could mean recipients. Essential is that the column tags can have a variable (possibly zero) number of different substrings.
For splitting a fixed number of concatenated tags (e.g. like a data) I can use tidyr::spread, and I can use string splitting to separate the tags column itself, but how to combine the two?
I would like the transformed data to look like this:
key A B C D E
a TRUE TRUE FALSE FALSE FALSE
b FALSE TRUE FALSE FALSE FALSE
c TRUE FALSE FALSE FALSE TRUE
d FALSE FALSE TRUE TRUE FALSE
e FALSE FALSE FALSE FALSE FALSE
I can see it's possible to do this in several steps by splitting tags, determining the unique substrings and loop over each of them and test if tags for each row contains the string. But I'd prefer to do this in a pipeline using the tidyverse.
Question: how can I split the variable number of concatened tags into one column per tag?
Here's a base R alternative approach:
# get unique values in tags
x <- unique(unlist(strsplit(df$tags, ",", fixed=TRUE)))
# check for existence in the tags column
res <- sapply(paste0("(^|.*,)", x, "(,.*|$)"), grepl, df$tags)
# add sensible dimension names
dimnames(res) <- list(df$key, x)
The resulting matrix looks like this:
res
# A B E C D
#a TRUE TRUE FALSE FALSE FALSE
#b FALSE TRUE FALSE FALSE FALSE
#c TRUE FALSE TRUE FALSE FALSE
#d FALSE FALSE FALSE TRUE TRUE
#e FALSE FALSE FALSE FALSE FALSE
The separate_rows function from tidyr may help you get where you want. This splits the strings within tags into separate rows instead of separate columns, which sets you up to use spread.
To get the TRUE/FALSE result I created a new column of all TRUE to use as the value column, and then filled the missing with FALSE in spread. In the end,spread kept the blank cell as a column name, which I removed via select. There may be a better way to do this (maybe convert to NA?).
library(tidyr)
library(dplyr)
data %>%
separate_rows(tags) %>%
mutate(tagslog = TRUE) %>%
spread(tags, tagslog, fill = FALSE) %>%
select(-one_of(""))
key A B C D E
* <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
1 a TRUE TRUE FALSE FALSE FALSE
2 b FALSE TRUE FALSE FALSE FALSE
3 c TRUE FALSE FALSE FALSE TRUE
4 d FALSE FALSE TRUE TRUE FALSE
5 e FALSE FALSE FALSE FALSE FALSE
You can almost get where you want with just separate_rows and table, but I still had that extra blank column that would need to be removed.
data %>%
separate_rows(tags) %>%
with(., table(key, tags) == 1)
tags
key A B C D E
a FALSE TRUE TRUE FALSE FALSE FALSE
b FALSE FALSE TRUE FALSE FALSE FALSE
c FALSE TRUE FALSE FALSE FALSE TRUE
d FALSE FALSE FALSE TRUE TRUE FALSE
e TRUE FALSE FALSE FALSE FALSE FALSE
A third base R method is
# get named list splitting by commas
myList <- setNames(strsplit(tags, split=",", fixed=TRUE), key)
# get unique elements from list
colTemp <- sort(unique(unlist(myList)))
# check each list element for the unique elements, return matrix
myMat <- t(sapply(myList, function(i) colTemp %in% i))
# add column names
colnames(myMat) <- colTemp
which returns
myMat
A B C D E
a TRUE TRUE FALSE FALSE FALSE
b FALSE TRUE FALSE FALSE FALSE
c TRUE FALSE FALSE FALSE TRUE
d FALSE FALSE TRUE TRUE FALSE
e FALSE FALSE FALSE FALSE FALSE
From docendo discimus approach, using different way of paste function
xx <- sort(unique(unlist(strsplit(data$tags,","))))
data1 <- sapply(paste(xx), grepl, data$tags)
data <- cbind(data[,1],data1)
key A B C D E
1 a TRUE TRUE FALSE FALSE FALSE
2 b FALSE TRUE FALSE FALSE FALSE
3 c TRUE FALSE FALSE FALSE TRUE
4 d FALSE FALSE TRUE TRUE FALSE
5 e FALSE FALSE FALSE FALSE FALSE