I am trying to add a summary column to a dataframe. Although the summary statistic should be applied to every column, the statistic itself should only be calculated based on conditional rows.
As an example, given this dataframe:
x <- data.frame(usernum=rep(c(1,2,3,4),each=3),
final=rep(c(TRUE,TRUE,FALSE,FALSE)),
time=1:12)
I would like to add a usernum.mean column, but where the mean is only calculated when final=TRUE. I have tried:
library(tidyverse)
x %>%
group_by(usernum) %>%
mutate(user.mean = mean(x$time[x$final==TRUE]))
but this gives an overall mean, rather than by user. I have also tried:
x %>%
group_by(usernum) %>%
filter(final==TRUE) %>%
mutate(user.mean = mean(time))
but this only returns the filtered dataframe:
# A tibble: 6 x 4
# Groups: usernum [4]
usernum final time user.mean
<dbl> <lgl> <int> <dbl>
1 1 TRUE 1 1.5
2 1 TRUE 2 1.5
3 2 TRUE 5 5.5
4 2 TRUE 6 5.5
5 3 TRUE 9 9
6 4 TRUE 10 10
How can I apply those means to every original row?
If we use x$ after the group_by, it returns the entire column instead of only the values in that particular group. Second, TRUE/FALSE is logical vector, so we don't need ==
library(dplyr)
x %>%
group_by(usernum) %>%
mutate(user.mean = mean(time[final]))
The one option where we can use $ is with .data
x %>%
group_by(usernum) %>%
mutate(user.mean = mean(.data$time[.data$final]))
Related
I am trying to get some unique combinations of two variables.
For each value of x, I would like to have this unique y value, and drop those have several y values. But several x values could share same y value.
For example,
a=data.frame(x=c(1,1,2,4,5,5),y=c(2,3,3,3,6,6)),
and I would like to get the output like:
b=data.frame(x=c(2,4,5),y=c(3,3,6))
I have tried unique(), but it does not help this situation.
Thank you!
First we use unique to omit repeated rows with the same x and y values (keeping only one copy of each). Any repeated x values that are left have different y values, so we want to get rid of them. We use the standard way to remove all copies of any duplicated values as in this R-FAQ.
a=data.frame(x=c(1,1,2,4,5,5),y=c(2,3,3,3,6,6))
b = unique(a)
b = b[!duplicated(b$x) & !duplicated(b$x, fromLast = TRUE), ]
b
# x y
# 3 2 3
# 4 4 3
# 5 5 6
Fans of dplyr would probably do it like this, producing the same result.
library(dplyr)
a %>%
group_by(x) %>%
filter(n_distinct(y) == 1) %>%
distinct
Using dplyr:
library(dplyr)
a <- data.frame(x=c(1,1,2,4,5,5),y=c(2,3,3,3,6,6))
a %>%
distinct() %>%
add_count(x) %>% # adds an implicit group_by(x)
filter(n == 1) %>%
select(-n)
#> # A tibble: 3 x 2
#> # Groups: x [3]
#> x y
#> <dbl> <dbl>
#> 1 2 3
#> 2 4 3
#> 3 5 6
Created on 2018-11-14 by the reprex package (v0.2.1)
I have a table where the first two rows are sample identifiers and the third a measure of distance eg:
df<-data.table(H1=c(1,2,3,4,5),H2=c(7,3,2,8,9), D=c(100,4,55,66,35))
I want to find only the unique pairs across both columns, ie 1-7,2-3,4-8,5-9. Removing the duplicate 2-3 and 3-2 pairings which appears in different columns but keeping the third row (which being a distance is identical for 2-3 and 3-2).
# example data
df<-data.frame(H1=c(1,2,3,4,5),
H2=c(7,3,2,8,9),
D=c(100,4,55,66,35), stringsAsFactors = F)
library(dplyr)
df %>%
rowwise() %>% # for each row
mutate(HH = paste0(sort(c(H1,H2)), collapse = ",")) %>% # create a new variable that orders and combines H1 and H2
group_by(HH) %>% # group by that variable
filter(D == max(D)) %>% # keep the row where D is the maximum (assumed logic*)
ungroup() %>% # forget the grouping
select(-HH) # remove unnecessary variable
# # A tibble: 4 x 3
# H1 H2 D
# <dbl> <dbl> <dbl>
# 1 1 7 100
# 2 3 2 55
# 3 4 8 66
# 4 5 9 35
*Note: No idea what your logic is to keep 1 row from the duplicates. I had to use something as an example and here I'm keeping the row with the highest D value. This logic can change if needed.
This is an approximation of the original dataframe. In the original, there are many more columns than are shown here.
id init_cont family description value
1 K S impacteach 1
1 K S impactover 3
1 K S read 2
2 I S impacteach 2
2 I S impactover 4
2 I S read 1
3 K D impacteach 3
3 K D impactover 5
3 K D read 3
I want to combine the values for impacteach and impactover to generate an average value that is just called impact. I would like the final table to look like the following:
id init_cont family description value
1 K S impact 2
1 K S read 2
2 I S impact 3
2 I S read 1
3 K D impact 4
3 K D read 3
I have not been able to figure out how to generate this table. However, I have been able to create a dataframe that looks like this:
id description value
1 impact 2
1 read 2
2 impact 3
2 read 1
3 impact 4
3 read 3
What is the best way for me to take these new values and add them to the original dataframe? I also need to remove the original values (like impacteach and impactover) in the original dataframe. I would prefer to modify the original dataframe as opposed to creating an entirely new dataframe because the original dataframe has many columns.
In case it is useful, this is a summary of the code I used to create the shorter dataframe with impact as a combination of impacteach and impactover:
df %<%
mutate(newdescription = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %<%
group_by(id, newdescription) %<%
summarise(value = mean(as.numeric(value)))
What if you changed the description column first so that it could be included in the grouping:
df %>%
mutate(description = substr(description, 1, 6)) %>%
group_by(id, init_cont, family, description) %>%
summarise(value = mean(value))
# A tibble: 6 x 5
# Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.
# 2 1 K S read 2.
# 3 2 I S impact 3.
# 4 2 I S read 1.
# 5 3 K D impact 4.
# 6 3 K D read 3.
You just need to modify your group_by statement. Try group_by(id, init_cont, family)
Because your id seems to be mapped to init_cont and family already, adding in these values won't change your summarization result. Then you have all the columns you want with no extra work.
If you have a lot of columns you could trying something like the code below. Essentially, do a left_join onto your original data with your summarised data, but doing it using the . to not store off a new dataframe. Then, once joined (by id and description which we modified in place) you'll have two value columns which should be prepeneded with a .x and .y, drop the original and then use distinct to get rid of the duplicate 'impact' columns.
df %>%
mutate(description = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %>%
left_join(. %>%
group_by(id, description)
summarise(value = mean(as.numeric(value))
,by=c('id','description')) %>%
select(-value.x) %>%
distinct()
gsub can be used to replace description containing imact as impact and then group_by from dplyr package will help in summarising the value.
df %>% group_by(id, init_cont, family,
description = gsub("^(impact).*","\\1", description)) %>%
summarise(value = mean(value))
# # A tibble: 6 x 5
# # Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.00
# 2 1 K S read 2.00
# 3 2 I S impact 3.00
# 4 2 I S read 1.00
# 5 3 K D impact 4.00
# 6 3 K D read 3.00
I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3
You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.
This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333
Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581
How to select groups based on a condition on the individual rows, say keep all groups that contain at least one (ANY) of a certain value, e.g. 4, (or any other condition that is TRUE at least once). Or phrased the other way around: if a group does not have any rows where condition is true, the entire group should be removed.
Let's take a very simple data, with two groups, and I want to select the group that has at least one row with a Value of 4, (i.e. group B here)
library(dplyr)
df <- data.frame(Group = LETTERS[c(1,1,1,2,2,2)], Value=c(1:5, 4))
df
# Group Value
# 1 A 1 # Group A has no values == 4 ~~> remove entire group
# 2 A 2
# 3 B 3
# 4 B 4 # Group B has at least one 4 ~~> keep the whole group
Doing group_by() and then filter (as in this post) will only select individual rows that contains a value of 4, not the whole group:
df %>%
group_by(Group) %>%
filter(Value == 4)
# Group Value
# <fctr> <int>
# 1 B 4
This turns out to be pretty easy: you just need to use the any() function in the filter call. Indeed, it appears that:
filter(any(...)) evaluates at the group_by() level,
filter(...) evaluates at the rowwise() level, even when preceded by group_by().
Hence use:
df %>%
group_by(Group) %>%
filter(any(Value==4))
Group Value
<fctr> <int>
1 B 3
2 B 4
Interestingly, the same appear with mutate, compare:
df %>%
group_by(Group) %>%
mutate(check1=any(Value==4),
check2=Value==4)
Group Value check1 check2
<fctr> <int> <lgl> <lgl>
1 A 1 FALSE FALSE
2 A 2 FALSE FALSE
3 B 3 TRUE FALSE
4 B 4 TRUE TRUE
A data.table option is
library(data.table)
setDT(df)[, if(any(Value==4)) .SD, by = Group]
# Group Value
#1: B 4
#2: B 5
#3: B 4
In base R, without performing any grouping operation we can do :
subset(df, Group %in% unique(Group[Value == 4]))
# Group Value
#4 B 4
#5 B 5
#6 B 4