I want to choose certain columns of a dataframe with dplyr::select() using contains() more than ones. I know there are other ways to solve it, but I wonder whether this is possible inside select(). An example:
df <- data.frame(column1= 1:10, col2= 1:10, c3= 1:10)
library(dplyr)
names(select(df, contains("col") & contains("1")))
This gives an error but I would like the function to give "column1".
I expected that select() would allow a similiar appraoch as filter() where we can set multiple conditions with operators, i.e. something like filter(df, column1 %in% 1:5 & col2 != 2).
EDIT
I notice that my question is more general and I wonder whether it is possible to pass any combinations in select(), like select(df, contains("1") | !starts_with("c")), and so on. But can't figure out how to make such a function.
You can use select_if and grepl
library(dplyr)
df %>%
select_if(grepl("col", names(.)) & grepl(1, names(.)))
# column1
#1 1
#2 2
#3 3
#4 4
#5 5
#6 6
#7 7
#8 8
#9 9
#10 10
If you want to use select with contains you could do something like this:
df %>%
select(intersect(contains("col"), contains("1")))
This can be combined in other ways, as mentioned in the comments:
df %>%
select(intersect(contains("1"), starts_with("c")))
You can also chain two select calls:
library(dplyr)
df <- data.frame(column1 = 1:10, col2 = 1:10, c3 = 1:10)
df %>%
select(contains("col")) %>%
select(contains("1"))
Not too elegant for one-line lovers
You could use the dplyr::intersect function
select(df, intersect(contains("col"), contains("1")))
Related
This feels like a common enough task that I assume there's an established function/method for accomplishing it. I'm imagining a function like dplyr::filter_after() but there doesn't seem to be one.
Here's the method I'm using as a starting point:
#Setup:
library(dplyr)
threshold <- 3
test.df <- data.frame("num"=c(1:5,1:5),"let"=letters[1:10])
#Drop every row that follows the first 3, including that row:
out.df <- test.df %>%
mutate(pastThreshold = cumsum(num>=threshold)) %>%
filter(pastThreshold==0) %>%
dplyr::select(-pastThreshold)
This produces the desired output:
> out.df
num let
1 1 a
2 2 b
Is there another solution that's less verbose?
dplyr provides the window functions cumany and cumall, that filter all rows after/before a condition becomes false for the first time. Documentation.
test.df %>%
filter(cumall(num<threshold)) #all rows until condition violated for first time
# num let
# 1 1 a
# 2 2 b
You can do:
test.df %>%
slice(1:which.max(num == threshold)-1)
num let
1 1 a
2 2 b
We can use the same in filter without the need for creating extra column and later removing it
library(dplyr)
test.df %>%
filter(cumsum(num>=threshold) == 0)
# num let
#1 1 a
#2 2 b
Or another option is match with slice
test.df %>%
slice(seq_len(match(threshold-1, num)))
Or another option is rleid
library(data.table)
test.df %>%
filter(rleid(num >= threshold) == 1)
This question already has answers here:
How to store filter expressions as strings?
(2 answers)
Closed 3 years ago.
I would like to filter a dataframe df based on some filter_phrase using quasiquotation (similar to this question here). However, instead of dynamically setting the column, I would like to evaluate the entire condition:
library(dplyr)
library(rlang)
df <- data.frame(a = 1:5, b = letters[1:5])
filter_phrase <- "a < 4"
df %>% filter(sym(filter_phrase))
The expected output should look like this:
> df %>% filter(a < 4)
a b
1 1 a
2 2 b
3 3 c
Any help is greatly appreciated.
An option would be parse_expr. The 'filter_phrase' is an expression as a string. We can convert it to langauge class with parse_expr and then evaluate with (!!)
library(dplyr)
df %>%
filter(!! rlang::parse_expr(filter_phrase))
# a b
#1 1 a
#2 2 b
#3 3 c
I have a data.frame df with columns A and B:
df <- data.frame(A = 1:5, B = 11:15)
There's another data.frame, df2, which I'm building by various calculations that ends up having generic column names X1 and X2, which I cannot control directly (because it passes through being a matrix at one point). So it ends up being something like:
mtrx <- matrix(1:10, ncol = 2)
mtrx %>% data.frame()
I would like to rename the columns in df2 to be the same as df. I could, of course, do it after I finish building df2 with a simple assigning:
names(df2)<-names(df)
My question is - is there a way to do this directly within the pipe? I can't seem to use dplyr::rename, because these have to be in the form of newname=oldname, and I can't seem to vectorize it. Same goes to the data.frame call itself - I can't just give it a vector of column names, as far as I can tell. Is there another option I'm missing? What I'm hoping for is something like
mtrx %>% data.frame() %>% rename(names(df))
but this doesn't work - gives error Error: All arguments must be named.
Cheers!
You can use setNames
mtrx %>%
data.frame() %>%
setNames(., nm = names(df))
# A B
#1 1 6
#2 2 7
#3 3 8
#4 4 9
#5 5 10
Or use purrr's equivalent set_names
mtrx %>%
data.frame() %>%
purrr::set_names(., nm = names(df))
A third option is "names<-"
mtrx %>%
data.frame() %>%
"names<-"(names(df))
We can use rename_all from tidyverse
library(tidyverse)
mtrx %>%
as.data.frame %>%
rename_all(~ names(df))
# A B
# 1 1 6
# 2 2 7
# 3 3 8
# 4 4 9
# 5 5 10
I have a data frame (a tibble, actually) df, with two columns, a and b, and I want to filter out the rows in which a is a substring of b. I've tried
df %>%
dplyr::filter(grepl(a,b))
but I get a warning that seems to indicate that R is actually applying grepl with the first argument being the whole column a.
Is there any way to apply a regular expression involving two different columns to each row in a tibble (or data frame)?
If you're only interested in by-row comparisons, you can use rowwise():
df <- data.frame(A=letters[1:5],
B=paste0(letters[3:7],letters[c(2,2,4,3,5)]),
stringsAsFactors=F)
df %>%
rowwise() %>%
filter(grepl(A,B))
A B
1 b db
2 e ge
---------------------------------------------------------------------------------
If you want to know whether row-entry of A is in all of B:
df %>% rowwise() %>% filter(any(grepl(A,df$B)))
A B
1 b db
2 c ed
3 d fc
4 e ge
Or using base R apply and #Chi-Pak's reproducible example
df <- data.frame(A=letters[1:5],
B=paste0(letters[3:7],letters[c(2,2,4,3,5)]),
stringsAsFactors=F)
matched <- sapply(1:nrow(df), function(i) grepl(df$A[i], df$B[i]))
df[matched, ]
Result
A B
2 b db
5 e ge
You can use stringr::str_detect, which is vectorised over both string and pattern. (Whereas, as you noted, grepl is only vectorised over its string argument.)
Using #Chi Pak's example:
library(dplyr)
library(stringr)
df %>%
filter(str_detect(B, fixed(A)))
# A B
# 1 b db
# 2 e ge
I'm trying to figure out what I'm doing wrong here. Using the following training data I compute some frequencies using dplyr:
group.count <- c(101,99,4)
data <- data.frame(
by = rep(3:1,group.count),
y = rep(letters[1:3],group.count))
data %>%
group_by(by) %>%
summarise(non.miss = sum(!is.na(y)))
Which gives me the outcome I'm looking for. However, when I try to do it as a function:
res0 <- function(x1,x2) {
output = data %>%
group_by(x2) %>%
summarise(non.miss = sum(!is.na(x1)))
}
res0(y,by)
I get an error (index out of bounds).
Can anybody tell me what I'm missing?
Thanks on advance.
You can't do this like that in dplyr.
The problem is that you are passing it a NULL object at the moment. by doesn't exist anywhere. Your first thought might be to pass "by" but this won't work with dplyr either. What dplyr is doing here is trying to group_by the variable x2 which is not a part of your data.frame. To show this, make your data.frame as such:
data <- data.frame(
x2 = rep(3:1,group.count),
x1 = rep(letters[1:3],group.count)
)
Then call your function again and it will return the expected output.
I suggest changing the name of your dataframe to df.
This is basically what you have done:
df %>%
group_by(by) %>%
summarise(non.miss = sum(!is.na(y)))
which produces this:
# by non.miss
#1 1 4
#2 2 99
#3 3 101
but to count the number of observations per group, you could use length, which gives the same answer:
df %>%
group_by(by) %>%
summarise(non.miss = length(y))
# by non.miss
#1 1 4
#2 2 99
#3 3 101
or, use tally, which gives this:
df %>%
group_by(by) %>%
tally
# by n
#1 1 4
#2 2 99
#3 3 101
Now, you could put that if you really wanted into a function. The input would be the dataframe. Like this:
res0 <- function(df) {
df %>%
group_by(by) %>%
tally
}
res0(df)
# by n
#1 1 4
#2 2 99
#3 3 101
This of course assumes that your dataframe will always have the grouping column named 'by'. I realize that these data are just fictional, but avoiding naming columns 'by' might be a good idea because that is its own function in R - it may get a bit confusing reading the code with it in.