Subset data frame based on multiple conditions? - r

I have a data frame: df=data.frame(sample.id=c(1, 1, 2, 3, 4, 4, 5, 6, 7, 7), sample.type=c(U, S, S, U, U, D, D, U, U, D), cond = c(1.4, 17, 12, 0.45, 1, 7, 1, 9, 0, 14))
I want a data frame that only contains the rows of sample.ids that have both sample.type "U" and sample.type "D"
new df: df.new=data.frame(sample.id=c(4, 4, 7, 7), sample.type=c(U, D, U, D), cond = c(1, 7, 0, 14))
What's the easiest way to do this? Duplicated doesn't work because it will return sample.ids with U and S as well as U and D. I can't figure out how to filter/subset for sample ids that are both sample.type U and sample.type D. Thanks for any advice!

We can do a filter by group
library(dplyr)
df %>%
group_by(sample.id) %>%
filter(all(c("U", "D") %in% sample.type))
# A tibble: 4 x 3
# Groups: sample.id [2]
# sample.id sample.type cond
# <dbl> <fct> <dbl>
#1 4 U 1
#2 4 D 7
#3 7 U 0
#4 7 D 14

Using filter with any
df %>% group_by(sample.id) %>% filter(any(sample.type == 'U') & any(sample.type == 'D'))
# A tibble: 4 x 3
# Groups: sample.id [2]
sample.id sample.type cond
<dbl> <fctr> <dbl>
1 4 U 1
2 4 D 7
3 7 U 0
4 7 D 14

With data.table
library(data.table)
setDT(df)
df[, if(all(c('U', 'D') %in% sample.type)) .SD, by = sample.id]

Related

Transform subject ID across groups that vary in size

A MWE is as follows:
I have 3 groups with 2, 4, and 3 subjects consecutively. So I have:
library(dplyr)
Group <- c(1, 1, 2, 2, 2, 2, 3, 3, 3)
Subject_ID <- c(1, 2, 1 ,2, 3, 4, 1, 2)
df <- rbind(Group, Subject_ID)
Since the subjects in different groups are different subjects, so I want the subject ID be unique for each subject in the dataset. What I did was as follows:
Num_Subjects <- (length(unique(filter(df, Group == 1)$Subject)),
length(unique(filter(df, Group == 2)$Subject)),
length(unique(filter(df, Group == 3)$Subject)),
)
# Then I defined a summation function to calculate how many subjects there are in all previous groups.
sumfun <- function(x,start,end){
return(sum(x[start:end]))
}
# Then I defined another function that generates a new subject ID for each subject in each group.
SubjIDFn <- function(x, i) {
x %>% filter(Session == i) %>% mutate(
Sujbect = Subject + sumfun(Num_Subjects, 1, i-1)
)
}
# Then I loop this from group 2 to group 3,
for (i in 2:3) {
df.Corruption.WithoutS1 <- SubjIDFn(df.Corruption.WithoutS1, i)
}
Then the data set has zero observations. I don't know where it went wrong, and I don't know what is the smart solution to this problem. Thanks for your help!
I think you're a bit overshooting it... If Subject_ID is unique within groups, you may just go with:
library(dplyr)
Group <- c(1, 1, 2, 2, 2, 2, 3, 3, 3)
Subject_ID <- c(1, 2, 1 ,2, 3, 4, 1, 2, 3)
df <- bind_cols(Group=Group, Subject_ID=Subject_ID)
df %>% mutate(unique_id = paste(Group, Subject_ID, sep="."))
# A tibble: 9 x 3
Group Subject_ID unique_id
<dbl> <dbl> <chr>
1 1 1 1.1
2 1 2 1.2
3 2 1 2.1
4 2 2 2.2
5 2 3 2.3
6 2 4 2.4
7 3 1 3.1
8 3 2 3.2
9 3 3 3.3
Note that I used bind_cols instead of rbind to have a dataframe instead of a matrix.

applying a function across rows in dataframe

I have a dataset of approximate counts of birds of 5 species. I wrote a function to calculate the diversity of species using Broullions Index. My data looks like this and my function is written like this:
df <- data.frame(
sp1 = c(2, 3, 4, 5),
sp2 = c(1, 6, 7, 2),
sp3 = c(1, 9, 4, 3),
sp4 = c(2, 2, 2, 4),
sp5 = c(3, 3, 2, 1),
treatment1 = c("A", "B", "C", "A"),
treatment2 = c("D", "E", "D", "E")
)
#write function that estimates Broullion's Index
Brillouin_Index <- function(x){
N <- sum(x)
(log10(factorial(N)) - sum(log10(factorial(x)))) / N
}
df2 <- df %>%
mutate(bindex = Brillon_Index(matrix(df[1:5,])
How do apply my function to calculate the Broullions Index across rows? I thought something like the above would work but no luck yet. The point would be to use the diversity index as the response variable in relation to treatment 1 and 2 which is why I'd like to sum across rows and get a single value across for each row for a new variable called bindex. Any help will be greatly appreciated. Best,
We can use rowwise to group by row
library(dplyr)
df <- df %>%
rowwise %>%
mutate(bindex = Brillouin_Index(as.matrix(c_across(1:5)))) %>%
ungroup
-output
df
# A tibble: 4 x 8
# sp1 sp2 sp3 sp4 sp5 treatment1 treatment2 bindex
# <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#1 2 1 1 2 3 A D 0.464
#2 3 6 9 2 3 B E 0.528
#3 4 7 4 2 2 C D 0.527
#4 5 2 3 4 1 A E 0.505
Or use apply in base R
df$bindex <- apply(df[1:5], 1, Brillouin_Index)
df$bindex
#[1] 0.4643946 0.5277420 0.5273780 0.5051951
Or with dapply in collapse
library(collapse
df$bindex <- dapply(slt(df, 1:4), Brillouin_Index, MARGIN = 1)

How to keep all columns when concatenating rows with dplyr::summarise?

I want to aggregate one column (C) in a data frame according to one grouping variable A, and separate the individual values by a comma while keeping all the other column B. However, B can either have a character (which is always the same for all the rows) or be empty. In this case, I would like to keep the character whenever it is present on one row.
Here is a simplified example:
data <- data.frame(A = c(rep(111, 3), rep(222, 3)), B = c("", "", "", "a" , "", "a"), C = c(5:10))
data
Based on this question Collapse / concatenate / aggregate a column to a single comma separated string within each group, I have the following code:
library(dplyr)
data %>%
group_by(A) %>%
summarise(test = toString(C)) %>%
ungroup()
Here it is what I would like to obtain:
A B C
1 111 5,6,7
2 222 a 8,9,10
Use summarise_all()
To keep all your columns, you can use summarise_all():
data %>%
group_by(A) %>%
summarise_all(toString)
# A tibble: 2 x 3
A B C
<dbl> <chr> <chr>
1 111 1, 2, 1 5, 6, 7
2 222 2, 1, 2 8, 9, 10
Edit for updated question
You can add a B column to summarise to achieve the desided results:
data <- data.frame(A = c(rep(111, 3), rep(222, 3)), B = c("", "", "", "a" , "", "a"), C = c(5:10))
data
library(dplyr)
data %>%
group_by(A) %>%
summarise(B = names(sort(table(B),decreasing=TRUE))[1],
C = toString(C)) %>%
ungroup()
# A tibble: 2 x 3
A B C
<dbl> <fct> <chr>
1 111 "" 5, 6, 7
2 222 a 8, 9, 10
This will return the most frequent value in B column (as order gives you ordered indexes).
Hope this helps.
You could write one function to return unique values
library(dplyr)
get_common_vars <- function(x) {
if(n_distinct(x) > 1) unique(x[x !='']) else unique(x)
}
and then use it on all columns that you are interested :
data %>%
group_by(A) %>%
mutate(C = toString(C)) %>%
summarise_at(vars(B:C), get_common_vars)
# ^------ Include all columns here
# A tibble: 2 x 3
# A B C
# <dbl> <fct> <chr>
#1 111 "" 5, 6, 7
#2 222 a 8, 9, 10
You can also use the paste() function and leverage the collapse argument.
data %>%
group_by(A) %>%
summarise(
B = paste(unique(B), collapse = ""),
C = paste(C, collapse = ", "))
# A tibble: 2 x 3
A B C
<chr> <chr> <chr>
1 111 "" 5, 6, 7
2 222 a 8, 9, 10

Subset with all values for a variable in R

I have a Data Frame with a variable with different values for another variable.
Like this:
DataFrame
So, I need a subset when the value of S contain all the possible values of B. In this example, el subset is conformed by S = a and S = b:
Subset
Any idea? Thanks!!
An option would be to group by 'S' and filter the rows having all the unique values of the column 'B' %in% 'B'
library(dplyr)
un1 <- unique(df1$B)
df1 %>%
group_by(S) %>%
filter(all(un1 %in% B))
# A tibble: 8 x 2
# Groups: S [2]
# S B
# <fct> <dbl>
#1 a 1
#2 a 2
#3 a 3
#4 a 4
#5 d 1
#6 d 2
#7 d 3
#8 d 4
Or with data.table
library(data.table)
setDT(df1)[, .SD[all(un1 %in% B)], S]
Or using base R
df1[with(df1, ave(B, S, FUN = function(x) all(un1 %in% x)) == 1),]
data
df1 <- data.frame(S = rep(letters[1:4], c(4, 3, 2, 4)),
B = c(1:4, c(1, 3, 4), 1:2, 1:4))

Summarize different Columns with different Functions

I have the following Problem: In a data frame I have a lot of rows and columns with the first row being the date. For each date I have more than 1 observation and I want to summarize them.
My df looks like that (date replaced by ID for ease of use):
df:
ID Cash Price Weight ...
1 0.4 0 0
1 0.2 0 82 ...
1 0 1 0 ...
1 0 3.2 80 ...
2 0.3 1 70 ...
... ... ... ... ...
I want to group them by the first column and then summarize all rows BUT with different functions:
The function Cash and Price should be sum so I get the sum of Cash and Price for each ID. The function on Weight should be max so I only get the maximum weight for the ID.
Because I have so many columns I can not write a all functions by hand, but I have only 2 columns which should be summarized by max the rest should be summarized by sum.
So I am looking for a function to group by ID, summarize all with sum except 2 different columns which I need the max value.
I tried to use the dplyr package with:
df %>% group_by(ID = tolower(ID)) %>% summarise_each(funs(sum))
But I need the addition to not sum but max the 2 specified columns, any Ideas?
To be clear, the output of the example df should be:
ID Cash Price Weight
1 0.6 4.2 82
2 0.3 1 70
As of dplyr 1.0.0 you can use across():
tribble(
~ID, ~max1, ~max2, ~sum1, ~sum2, ~sum3,
1, 1, 1, 1, 2, 3,
1, 2, 3, 1, 2, 3,
2, 1, 1, 1, 2, 3,
2, 3, 4, 2, 3, 4,
3, 1, 1, 1, 2, 3,
3, 4, 5, 3, 4, 5,
3, NA, NA, NA, NA, NA
) %>%
group_by(ID) %>%
summarize(
across(matches("max1|max2"), max, na.rm = T),
across(!matches("max1|max2"), sum, na.rm = T)
)
# ID max1 max2 sum1 sum2 sum3
# 1 2 3 2 4 6
# 2 3 4 3 5 7
# 3 4 5 4 6 8
We can use
df %>%
group_by(ID) %>%
summarise(Cash = sum(Cash), Price = sum(Price), Weight = max(Weight))
If we have many columns, one way would be to do this separately and then join the output together.
df1 <- df %>%
group_by(ID) %>%
summarise_each(funs(sum), Cash:Price)
df2 <- df %>%
group_by(ID) %>%
summarise_each(funs(max), Weight)
inner_join(df1, df2, by = "ID")
# ID Cash Price Weight
# (int) (dbl) (dbl) (int)
#1 1 0.6 4.2 82
#2 2 0.3 1.0 70
Or do it w/o the double groups:
library(dplyr)
set.seed(1492)
df <- data.frame(id=rep(c(1,2), 3),
cash=rnorm(6, 0.5, 0.1),
price=rnorm(6, 0.5, 0.1)*6,
weight=sample(100, 6))
df
## id cash price weight
## 1 1 0.4410152 2.484082 10
## 2 2 0.4101343 3.032529 93
## 3 1 0.3375889 2.305076 58
## 4 2 0.6047922 3.248851 55
## 5 1 0.4721711 3.209930 34
## 6 2 0.5362493 2.331530 99
custom_summarise <- function(do_df) {
return(bind_cols(
summarise_each(select(do_df, -weight), funs(sum)),
summarise_each(select(do_df, weight), funs(max))
))
}
group_by(df, id) %>% do(custom_summarise(.))
## Source: local data frame [2 x 4]
## Groups: id [2]
##
## id cash price weight
## (dbl) (dbl) (dbl) (int)
## 1 3 1.250775 7.999089 58
## 2 6 1.551176 8.612910 99
library(data.table)
setDT(df)
df[,.(Cash = sum(Cash),Price = sum(Price),Weight = max(Weight)),by=ID]
One way of doing this for +90 columns can be:
max_col <- 'Weight'
sum_col <- setdiff(colnames(df),max_col)
query_1 <- paste0(sum_col,' = sum(',sum_col,')')
query_2 <- paste0(max_col,' = max(',max_col,')')
query_3 <- paste(query_1,collapse=',')
query_4 <- paste(query_2,collapse=',')
query_5 <- paste(query_3,query_4,sep=',')
final_query <- paste0('df[,.(',query_5,'),by = ID]')
eval(parse(text = final_query))
Here is a solution based on this comment on an issue on dplyr repo. I think it's very general to be applied to more complicated cases.
library(tidyverse)
df <- tribble(
~ID, ~Cash, ~Price, ~Weight,
#----------------------
'a', 4, 6, 8,
'a', 7, 3, 0,
'a', 7, 9, 0,
'b', 2, 8, 8,
'b', 5, 1, 8,
'b', 8, 0, 1,
'c', 2, 1, 1,
'c', 3, 8, 0,
'c', 1, 9, 1
)
out <- list(.vars=lst(vars(-Weight), vars(Weight)),
.funs=lst(sum, max))%>%
pmap(~df%>%group_by(ID)%>%summarise_at(.x, .y)) %>%
reduce(inner_join)
out
# A tibble: 3 x 4
# ID Cash Price Weight
# <chr> <dbl> <dbl> <dbl>
# 1 a 18 18 8
# 2 b 15 9 8
# 3 c 6 18 1
You should specify the vars in the first lst (e.g. vars(-Weight), vars(Weight)) and respective function to be applied in the lst (sum, max). The .x in the summarise_at argument refers to elements in the variable lst, and .y refers to the elements in the function lst.

Resources