df <- data.frame(
exp=c(1,1,2,2),
name=c("gene1", "gene2", "gene1", "gene2"),
value=c(1,1,3,-1)
)
In trying to get customed to the dplyr and reshape2I stumbled over a "simple" way to select rows based on several conditions. If I want to have those genes (the namevariable) that have valueabove 0 in experiment 1 (exp== 1) AND at the same time valuebelow 0 in experiment 2; in df this would be "gene2". Sure there must be many ways to this, e.g. subset df for each set of conditions (exp==1 & value > 0, and exp==2 and value < 0) and then join the results of these subset:
library(dplyr)
inner_join(filter(df,exp == 1 & value > 0),filter(df,exp == 2 & value < 0), by= c("name"="name"))[[1]]
Although this works it looks very akward, and I feel that such conditioned filtering lies at the heart of reshape2 and dplyr but cannot figure out how to do this. Can someone enlighten me here?
One alternative that comes to mind is to transform the data to a "wide" format and then do the filtering.
Here's an example using "data.table" (for the convenience of compound-statements):
library(data.table)
dcast.data.table(as.data.table(df), name ~ exp)[`1` > 0 & `2` < 0]
# name 1 2
# 1: gene2 1 -1
Similarly, with "dplyr" and "tidyr":
library(dplyr)
library(tidyr)
df %>%
spread(exp, value) %>%
filter(`1` > 0 & `2` < 0)
Another dplyr option is:
group_by(df, name) %>% filter(value[exp == 1] > 0 & value[exp == 2] < 0)
#Source: local data frame [2 x 3]
#Groups: name
#
# exp name value
#1 1 gene2 1
#2 2 gene2 -1
Probably this is even more convoluted than your own solution, but I think it has a "dplyr" feel:
df %>%
filter((exp == 1 & value > 0) | (exp == 2 & value < 0)) %>%
group_by(name) %>%
filter(length(unique(exp)) == 2) %>%
select(name) %>%
unique()
#Source: local data frame [1 x 1]
#Groups: name
# name
#1 gene2
filter allows multiple parameters with comma, sames as select. Each extra condition is an AND:
group_by(df, name) %>% filter(value[exp == 1] > 0, value[exp == 2] < 0)
From official documentation: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
The examples shown there are:
flights[flights$month == 1 & flights$day == 1, ] in base R
filter(flights, month == 1, day == 1) in dplyr.
Related
Can someone help me understand what the grouping is doing here, please?
Why do these two produce two different grouped outputs? The top returns all grouped variables where n() >1 in results A and outside of A category but just the A pairing while the bottom returns n() > 1 here duplicates exist in only A.
Sample Data:
df <- data.frame(ID = c(1,1,3,4,5,6,6),
Acronym = c('A','B','A','A','B','A','A')
)
df %>%
group_by(ID) %>%#
filter(Acronym == 'A',n() > 1)
df %>% filter(Acronym == 'A') %>%
group_by(ID) %>%
filter(n() > 1)
In the first example, rows with Acroynm == "A" are in the data frame and contribute to the row count n().
In the second example, these rows are removed, and don't contribute to row count from n().
If we want the first case to return only 'ID' 6, use sum to get the count of 'A' values in Acronym
library(dplyr)
df %>%
group_by(ID) %>%
filter(sum(Acronym == 'A') > 1)
As mentioned in the other post, it is just that n() is based on the whole group count and not on the number of 'A's. If we are unsure about the filter, create a column with mutate and check the output
df %>%
group_by(ID) %>%
mutate(ind = Acronym == 'A' & n() > 1)
# A tibble: 7 × 3
# Groups: ID [5]
ID Acronym ind
<dbl> <chr> <lgl>
1 1 A TRUE
2 1 B FALSE
3 3 A FALSE
4 4 A FALSE
5 5 B FALSE
6 6 A TRUE
7 6 A TRUE
I plan to filter data for multiple columns with multiple columns in one line to reduce the time used for running the code. This is sample data that I used to test my code. Basically, I want to remove any rows containing 0, 1, 2, and NA.
test <- data.frame(A = c(1,0,2,3,4,0,5,6,0,7,0,8,0,9,NA),
B = c(0,1,0,2,3,4,0,5,0,7,8,0,NA,9,0),
C = c(1,2,3,0,0,4,5,6,0,7,0,8,NA,0,9))
I used the following code to clean my data. Although it does the job, the code is very tedious and takes me quite a while when I run it with a large database.
test %>% filter(!is.na(A)) %>%
filter(!is.na(B)) %>%
filter(!is.na(C)) %>%
filter(A != 0) %>%
filter(A != 1) %>%
filter(A != 2) %>%
filter(B != 0) %>%
filter(B != 1) %>%
filter(B != 2) %>%
filter(C != 0) %>%
filter(C != 1) %>%
filter(C != 2)
A B C
1 6 5 6
2 7 7 7
I tried to shorten the code using filter, filter_at, and any_vars, but it did not work. Below are my attempts to deal with this problem (all of these codes did not work because they could not delete the row containing 0 (or 1,2, and NA).
df_total <- test %>%
filter_at(vars(A, B, C), any_vars(!is.na(.))) %>%
filter_at(vars(A, B, C), any_vars(. != 2)) %>%
filter_at(vars(A, B, C), any_vars(. != 1)) %>%
filter_at(vars(A, B, C), any_vars(. != 0))
df_total <- test %>%
filter_at(vars(A, B, C), any_vars(!is.na(.) | . != 2 | . != 1 | . != 0))
df_total <- test %>%
filter(!is.na(A) | A!= 2 | A!= 1 | A!= 0) %>%
filter(!is.na(B) | B!= 2 | B!= 1 | B!= 0) %>%
filter(!is.na(C) | C!= 2 | C!= 1 | C!= 0) %>%
I cannot figure out what I did incorrectly here. I went back and forth between the documentation and R to solve this problem, but my efforts were useless. Could you please suggest to me what I did wrong in my code? How can I write a code for multiple columns with multiple conditions in just one line? The point for one line is to speed up the running time for R. Any advice/ suggestions/ resources to find the answer would be appreciated! Thank you.
Another possible solution:
library(dplyr)
test %>%
filter(complete.cases(.) & if_all(everything(), ~ !(.x %in% 0:2)))
#> A B C
#> 1 6 5 6
#> 2 7 7 7
test %>%
filter(across(c(A, B, C), function(x) !is.na(x) & !x %in% c(0, 1, 2)))
# A B C
# 6 5 6
# 7 7 7
I'm struggling with the filter (dplyr) function on a tidy dataframe:
data1<-data.frame("Time"=c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5),
"Variable"=rep(c("a","b","c","d"),6),
"Value"=c(0,1,0,0,1,1,1,1,1,3,2,3,10,1,3,7,2,1,4,2,3,1,5,13))
What I want to do is to filter the time when variable "a" is equal to 2 and when variable "a" is as it max.
For first case mi code is:
data1<-data1%>%
group_by(Time)%>%
filter(any(Variable=="a" & Value==2))
and works fine and gives me:
Time Variable Value
2 a 2
2 b 1
2 c 4
2 d 2
Don't now how could be for a=max(a), I tried with:
data1<-data1%>%
group_by(Time)%>%
filter(any(Variable=="a" & Value==max(Value)))
but doesn't work (becaus max is calculated on all column Variable) I think I need something like
Value=max(Value)[Variable$a].
The filtered must act this way:
Time Variable Value
3 a 10
3 b 1
3 c 3
3 d 7
I prefer a solution with dplyr. Can anyone give me a general rule for filtering on tidy df with multiple criteria?
Here's a dplyr way:
library(dplyr)
data1%>%
filter(Time == Time[Value == max(Value[Variable == "a"])])
And a data.table way
library(data.table)
setDT(data1)
data1[Time == Time[Value == max(Value[Variable == "a"])]]
additional option
data1 %>%
filter(Variable == "a") %>%
filter(Value == max(Value, na.rm = T)) %>%
select(Time) %>%
left_join(., data1, by = "Time")
Based on the edited criteria this should provide the desired results.
data1 <- data1 %>%
group_by(Time) %>%
filter(any(Variable=="a" &
Value==max(data1$Value[data1$Variable == 'a'])))
df <- data.frame(Name=c('black','white','green','red','brown', 'blue'),
Num=c(1,1,1,0,1,0))
How many times 1 changed to 0 in the column Num? How I can count it by R?
One way is to use head, tail and count instances where the previous value was 1 and current value is 0.
sum(head(df$Num, -1) == 1 & tail(df$Num, -1) == 0)
#[1] 2
Using the same logic with dplyr lead/lag we can do
library(dplyr)
df %>% filter(Num == 0 & lag(Num) == 1) %>% nrow()
df %>% filter(Num == 1 & lead(Num) == 0) %>% nrow()
We can just use rle from base R
sum(rle(df$Num)$values)
#[1] 2
Or with rleid from data.table
library(data.table)
nrow(setDT(df)[, .N[any(Num > 0)] , rleid(Num)])
#[1] 2
I've got several sequential comparative evaluations to conduct with two variables in R in order to check for concordance.
In this example, say I have a boolean ANES_6 and a numeric ANES. The boolean is 1 if the patient had anesthesia for more than 6 hours, 0 else. The numeric value is the time the patient was under anesthesia.
I'm looking to write a function which can replace multiple copy-pastes of the following:
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 1 & ANES < 6)) %>%
tally()
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 0 & ANES >= 6)) %>%
tally()
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 1 & ANES >= 6)) %>%
tally()
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 0 & ANES >= 6)) %>%
tally()
I could create the following function (non-exhaustive of all cases shown above):
my_func <- function(x, y) {
if (x == "gt" & y == 1) {
data %>% select(ANES_6, AnaestheticTime_hours_) %>%
filter(ANES >= 6 & ANES_6 == 1) %>%
tally()
} else if (x == "lt" & y == 0 ) {
data %>% select(ANES_6, AnaestheticTime_hours_) %>%
filter(ANES < 6 & ANES_6 != 1) %>%
tally()
}}
which takes x and y as input, with values for x being c('lt', 'gt'), and y being c(0, 1), in order to evaluate all possible condition. However, this would entail writing more code, and not less.
Is there a way to input logical comparisons in the function such that the following works:
my_func <- function(x, y) {
data %>% select(ANES_6, ANES) %>%
filter(ANES x 6 & ANES_6 == y)
}
with x replaced by >=, <, etc, in the input of the function. Currently, this does not work, are there any workarounds?
Try grouping. The question should normally include reproducible test data but I have provided it this time.
library(dplyr)
data <- data.frame(ANES_6 = c(0, 0, 1, 1), ANES = 5:6) # test data
data %>%
group_by(ANES_6, ANES >= 6) %>%
tally %>%
ungroup
giving:
# A tibble: 4 x 3
ANES_6 `ANES >= 6` n
<dbl> <lgl> <int>
1 0. FALSE 1
2 0. TRUE 1
3 1. FALSE 1
4 1. TRUE 1