Apply a vector of filters based on a string (or vector of strings) in dplyr - r

R and the tidyverse have some extremely powerful but equally mysterious methods for turning strings into actionable expressions. I feel like one needs to be an expert to really understand how to use them.
NOTE: this question differs from this one in that I specifically ask about a vector (that is multiple) filter conditions. I demonstrate a solution for single filters that fails when I try multiple ways of extending it to multiple filters.
I want to do something along the lines of:
df = data.frame(A=1:10, B=1:10)
df %>% filter(A<3, B<5)
But where the filters are contained in either a string such as "A<3, B<5" or a character vector such as c("A<3", "B<5").
I can do
df %>% filter(eval(str2expression("A<3")))
# A B
# 1 1 1
# 2 2 2
But this does not work:
df %>% filter(eval(str2expression("A<3, B<5")))
Error in str2expression("A<3, B<5") : <text>:1:4: unexpected ','
1: A<3,
^
These don't work either:
> df %>% filter(!!c(str2expression("A<3"), str2expression("B<5")))
Error: Argument 2 filter condition does not evaluate to a logical vector
> df %>% filter(!!!c(str2expression("A<3"), str2expression("B<5")))
Error: Can't splice an object of type `expression` because it is not a vector
Run `rlang::last_error()` to see where the error occurred.
Evaluating a vector of expressions from str2expression for some reason only applies the last expression:
> df %>% filter(eval(c(str2expression("A<3"), str2expression("B<5"))))
# A B
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
Using a vector of evaluated expressions fails altogether:
> df %>% filter(!!!c(eval(str2expression("A<3")), eval(str2expression("B<5"))))
Error in eval(str2expression("A<3")) : object 'A' not found
I can do:
> df %>% filter(!!!c(expr(A<3), expr(B<5)))
# A B
# 1 1 1
# 2 2 2
and this tells me that expr(A<3) is NOT the same thing as str2expression("A<3")
But that isn't starting from strings.
What to do?

You could use parse_exprs from rlang
library(dplyr)
expr <- c("A<3", "B<5")
filter(df, !!!rlang::parse_exprs(expr))
# A B
#1 1 1
#2 2 2
Or you could combine the two expressions and then use it in eval
filter(df, eval(parse(text = paste0(expr, collapse = "&"))))
# A B
#1 1 1
#2 2 2

Learning from #Ronak Shah's answer, apparently, in dplyr I can use multiple conditions with a single & in filter instead of a comma. I don't understand this at all---it is not the same thing as an and logical:
> df %>% filter(A<3 & B<5)
A B
1 1 1
2 2 2
> df %>% filter(A<3 && B<5)
A B
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
Nevertheless, the following does work:
> df %>% filter(eval(str2expression("A<3 & B<5")))
A B
1 1 1
2 2 2
> df %>% filter(eval(str2expression("A<6 & B<5")))
A B
1 1 1
2 2 2
3 3 3
4 4 4

Related

ifelse not working in mutate if column not supplied in condition

It seems to me that if a column is not included in the conditional part of the ifelse statement, the ifelse statement in a dplyr mutate function does not work as expected:
mdf <- data.frame(a=c(1,2,3), b=c(3,4,5))
# this works:
> mdf %>% mutate(c=ifelse(a==1,0,1))
a b c
1 1 3 0
2 2 4 1
3 3 5 1
# This does not work (expected column c to be equal to a):
> mdf %>% mutate(c=ifelse(0==1,0,a))
a b c
1 1 3 1
2 2 4 1
3 3 5 1
# This does not work either (expected column c to be equal to a):
> mdf %>% mutate(c=ifelse("a" %in% names(.),a,0))
a b c
1 1 3 1
2 2 4 1
3 3 5 1
If using a regular if-statement, it does work:
> mdf %>% mutate(c=if("a" %in% names(.)){a}else{1})
a b c
1 1 3 1
2 2 4 2
3 3 5 3
However, I was hoping to use the ifelse statement, since it has a cleaner syntax. Is there a way to achieve the desired result with the ifelse statement?
I see that the length of the conditional statement determines what is returned. If the condition evaluates to one True/False value, it only returns one value (instead of the entire column). The value returned seems to be the first value of the desired column:
> mdf %>% mutate(c=ifelse("a" %in% names(.),a,0))
a b c
1 1 3 1
2 2 4 1
3 3 5 1
If I increase the lenght of the condition to the number of rows, the ifelse statement will return the entire column:
> mdf %>% mutate(c=ifelse(rep("a", nrow(.)) %in% names(.),a,0))
a b c
1 1 3 1
2 2 4 2
3 3 5 3

Rename dataframe column names by switching string patterns

I have following dataframe and I want to rename the column names to c("WBC_MIN_D7", "WBC_MAX_D7", "DBP_MIN_D3")
> dataf <- data.frame(
+ WBC_D7_MIN=1:4,WBC_D7_MAX=1:4,DBP_D3_MIN=1:4
+ )
> dataf
WBC_D7_MIN WBC_D7_MAX DBP_D3_MIN
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
> names(dataf)
[1] "WBC_D7_MIN" "WBC_D7_MAX" "DBP_D3_MIN"
Probably, the rename_with function in tidyverse can do it, But I cannot figure out how to do it.
You can use capture groups with sub to extract values in order -
names(dataf) <- sub('^(\\w+)_(\\w+)_(\\w+)$', '\\1_\\3_\\2', names(dataf))
Same regex can be used in rename_with -
library(dplyr)
dataf %>% rename_with(~ sub('^(\\w+)_(\\w+)_(\\w+)$', '\\1_\\3_\\2', .))
# WBC_MIN_D7 WBC_MAX_D7 DBP_MIN_D3
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
You can rename your dataf with your vector with names(yourDF) <- c("A","B",...,"Z"):
names(dataf) <- c("WBC_MIN_D7", "WBC_MAX_D7", "DBP_MIN_D3")

How to apply function to colnames in tidyverse

Just like in title: is there any function that allows applying another function to column names of data frame? I mean something like forcats::fct_relabel that applies some function to factor labels.
To give an example, supose I have a data.frame as below:
X<-data.frame(
firababst = c(1,1,1),
secababond = c(2,2,2),
thiababrd = c(3,3,3)
)
X
firababst secababond thiababrd
1 1 2 3
2 1 2 3
3 1 2 3
Now I wish to get rid of abab from column names by applying stringr::str_remove. My workaround involves magrittr::set_colnames:
X %>%
set_colnames(colnames(.) %>% str_remove('abab'))
first second third
1 1 2 3
2 1 2 3
3 1 2 3
Can you suggest some more strightforward way? Ideally, something like:
X %>%
magic_foo(str_remove, 'abab')
You can do:
X %>%
rename_all(~ str_remove(., "abab"))
first second third
1 1 2 3
2 1 2 3
3 1 2 3
With base R, we can do
names(X) <- sub("abab", "", names(X))

invalid argument to unary operator -error message - negative dplyr:: select with vector

I've been using a vector to positively subset a data frame, and it's working well. Now I'd like to use the same vector to negatively subset that data frame.
I get an error message (invalid argument to unary operator) but after googling I still don't understand what it means.
Thanks for any help!
# Starting point
df_main <- data.frame(coat=c(1:5),hanger=c(1:5),book=c(1:5),dvd=c(1:5),bookcase=c(1:5),
clock=c(1:5),bottle=c(1:5),curtains=c(1:5),wall=c(1:5))
df_keep <- data.frame(keep_var=c("coat","hanger","book","wall","bottle"),othvar=c("r","w","r","w",NA))
# Vector
library(dplyr)
keep.vec <- as.character(
(df_keep %>% dplyr::filter(is.na(othvar) | othvar == 'r'))$keep_var
)
# Attempts at using vector for negative subsetting
df_main %>% dplyr::select(-keep.vec)
df_main[-keep.vec, ]
We can wrap it with a helper function one_of in tidyselect
df_main %>%
select(-one_of(keep.vec))
# hanger dvd bookcase clock curtains wall
#1 1 1 1 1 1 1
#2 2 2 2 2 2 2
#3 3 3 3 3 3 3
#4 4 4 4 4 4 4
#5 5 5 5 5 5 5
Or another option is setdiff
df_main %>%
select(setdiff(names(.), keep.vec))
which will also work outside the tidyverse
df_main[setdiff(names(df_main), keep.vec)]

How to take the latest entry from a data.frame and store it in new dataframe

I have a data.frame that is full of data, and where the data for parameters repeat itself, but I want to use the latest information that is stored.
Thankfully I have an index in the files that tells me which duplicate is he current row in the data.frame.
Example for my problem is the following:
A B C D
1 1 2 3 1
2 1 2 2 2
3 3 4 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
A small explanation ... A and B columns can be considered key, and the C column represents value for that key ... the column D represents the index of the measurement .. but it does not have to start from 1 ... it can start from 3,6, ... any integer. This is happening because the data is incomplete
So at the end the output should be like:
A B C D
2 1 2 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
Can you please help me program a make an R program, or point me to the right direction, that is going to save all the keys with the their latest index ...
I have tried using for loops but it didn't work ....
Sincerely thanks
If you have any question feel free to ask
Using duplicated and subsetting in base R, you can do
dat[!duplicated(dat[,1:2], fromLast=TRUE),]
A B C D
2 1 2 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
duplicated returns a logical vector indicating whether a row (here the first two columns) has been duplicated. The fromLast argument initiates this process from the bottom of the data.frame.
You can use dplyr verbs to group your data group_by, then sort arrange. The do verb allows you to operate at the group-level. tail grabs the last row of each group...
library(dplyr)
df1 <- df %>%
group_by(A,B) %>%
arrange(D) %>%
do(tail(.,1)) %>%
ungroup()
Thanks to Frank's suggestion, you could also use slice
df1 <- df %>%
group_by(A,B) %>%
arrange(D) %>%
slice(n()) %>%
ungroup()

Resources