Consider the situation, where I want to summarize_each a data.frame with mixed column type.
> (temp=data.frame(ID=c(1,1,2,2),gender=c("M","M","F","F"),val1=rnorm(4),val2=rnorm(4)))
ID gender val1 val2
1 1 M -1.7944804 0.5232313
2 1 M 0.3938437 -0.8424086
3 2 F -0.3190777 0.3220580
4 2 F 1.3667340 -0.6031376
> temp%>%group_by(ID)%>%summarize_each(funs(mean))
Source: local data frame [2 x 4]
ID gender val1 val2
(dbl) (lgl) (dbl) (dbl)
1 1 NA -0.7003184 -0.1595886
2 2 NA 0.5238282 -0.1405398
This doesn't work because mean(gender) doesn't make sense.
Question:
If all my non-numeric columns are characteristic of ID, thus are identical within each ID, can I somehow get summarize_each to return that 'unique' value?
> temp%>%group_by(ID,gender)%>%summarize_each(funs(mean))
Source: local data frame [2 x 4]
Groups: ID [?]
ID gender val1 val2
(dbl) (fctr) (dbl) (dbl)
1 1 M -0.7003184 -0.1595886
2 2 F 0.5238282 -0.1405398
is the output that I want, but I somehow feel like this is doing unnecessary nested group_by because there really is nothing to group within ID.
One option would be gather/spread from tidyr. Reshape to 'long' format with gather, grouped by 'ID', 'var', get the first element of 'gender' and mean of 'val', spread it back to 'wide' format.
library(tidyr)
library(dplyr)
gather(temp, var, val, val1:val2) %>%
group_by(ID, var) %>%
summarise(gender = first(gender), val = mean(val)) %>%
spread(var, val)
Or another is using mutate_if and unique. After grouping by 'ID', we get the mean of the numeric columns with mutate_if. As the other columns (i.e. 'gender' also remains in the output) we can just do unique to get the unique rows from the output.
temp %>%
group_by(ID) %>%
mutate_if(is.numeric, mean) %>%
unique()
# ID gender val1 val2
# <int> <chr> <dbl> <dbl>
#1 1 M -0.7003184 -0.1595886
#2 2 F 0.5238281 -0.1405398
Related
a1 <- data.frame(id=c(1,1,1,1,2,2,2,3,3),
var=c("A",NA,NA,"B","B","B",NA,NA,NA))
desired_1 <- data.frame(id=c(1,2,3),
A=c(T,NA,NA),
B=c(T,T,NA),
None=c(NA,NA,T))
desired_2 <- data.frame(id=c(1,1,2,3),
type=c("A","B","B","None"))
what is the most efficient method to generate both desired_1 and desired_2 using either data.table or dplyr?
We can do a group by 'id' with summarise to get 'None' if all the elements in 'var' is NA or else return the unique non-NA elements of 'var'
library(dplyr)
a1 %>%
group_by(id) %>%
summarise(var = if(all(is.na(var))) "None" else unique(var[!is.na(var)]) )
# A tibble: 4 x 2
# Groups: id [3]
# id var
# <dbl> <chr>
#1 1 A
#2 1 B
#3 2 B
#4 3 None
Or using data.table
library(data.table)
setDT(a1)[, .(var = if(all(is.na(var))) "None" else unique(var[!is.na(var)])), id]
I would like to know how to keep ordering after spread.
data<-tibble(var=c("A","C","D","B"), score=c(1,2,4,3))
data_spread <-data%>%spread(key = var, value = score)
I would like to keep the order of c("A","C","D","B").
An option is to convert to factor with levels specified as the unique elements of 'var' will make sure the order is the order of occurrence
library(dplyr)
library(tidyr)
data %>%
mutate(var = factor(var, levels = unique(var))) %>%
spread(var, score)
# A tibble: 1 x 4
# A C D B
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 4 3
If I have tidied data:
df = expand.grid(Name=c("Sub1","Sub2","Sub3"),Vis=c("Yes","No")) %>%
mutate(KPR_mean=c(NA,1,3,2,3,2),KPR_range=c(NA,4,4,2,6,5)) %>%
filter(complete.cases(.))
I'd like to filter out incomplete factor combinations, to be left with a full factorial model. Right now, I'm doing so as follows:
df %>%
unite(KPR_mean_range,KPR_mean,KPR_range) %>%
spread(Vis,KPR_mean_range) %>%
filter(complete.cases(.)) %>%
gather(Win,KPR_mean_range,-Name) %>%
separate(KPR_mean_range,c("KPR_mean","KPR_range"),sep="_")
But that seems really verbose, and also difficult to extend once there are multiple factors and more variables. Is there a way to filter on a grouping variable, instead of a row? I.e., for each level of Name, if filter(complete.cases(.)) would remove a row from that group, then remove the entire group instead?
For the new data, expand your answer to all cases, group by whichever variable you want the completed cases in, and filter out groups with NAs:
df %>% complete(Vis, Name) %>% group_by(Name) %>% filter(!any(is.na(KPR_mean)))
# Source: local data frame [4 x 4]
# Groups: Name [2]
#
# Vis Name KPR_mean KPR_range
# (fctr) (fctr) (dbl) (dbl)
# 1 Yes Sub2 1 4
# 2 Yes Sub3 3 4
# 3 No Sub2 3 6
# 4 No Sub3 2 5
Here is one option with data.table. We convert the 'data.frame' to 'data.table' specifying the key columns, (setDT(df,..), do a cross join, grouped by 'Name', if there are no 'NA' values in 'KPP_range', subset the group of rows.
library(data.table)
setDT(df, key = c("Name", "Vis"))[CJ(Name, Vis, unique=TRUE)][,
if(all(!is.na(KPR_mean))) .SD , Name]
# Name Vis KPR_mean KPR_range
#1: Sub2 Yes 1 4
#2: Sub2 No 3 6
#3: Sub3 Yes 3 4
#4: Sub3 No 2 5
Does anyone know of a fast way to select 'all-but-one' (or 'all-but-a-few') columns when using dplyr::group_by?
Ultimately, I just want to aggregate over all distinct rows after removing a few select columns, but I don't want to have to explicitly list all the grouping columns each time (since those get added and removed somewhat frequently in my analysis).
Example:
> df <- data_frame(a = c(1,1,2,2), b = c("foo", "foo", "bar", "bar"), c = runif(4))
> df
Source: local data frame [4 x 3]
a b c
(dbl) (chr) (dbl)
1 1 foo 0.95460749
2 1 foo 0.05094088
3 2 bar 0.93032589
4 2 bar 0.40081121
Now I want to aggregate by a and b, so I can do this:
> df %>% group_by(a, b) %>% summarize(mean(c))
Source: local data frame [2 x 3]
Groups: a [?]
a b mean(c)
(dbl) (chr) (dbl)
1 1 foo 0.5027742
2 2 bar 0.6655686
Great.
But, I'd really like to be able to do something like just specify not c, similar to dplyr::select(-c):
> df %>% select(-c)
Source: local data frame [4 x 2]
a b
(dbl) (chr)
1 1 foo
2 1 foo
3 2 bar
4 2 bar
But group_by can apply expressions, so the equivalent doesn't work:
> df %>% group_by(-c) %>% summarize(mean(c))
Source: local data frame [4 x 2]
-c mean(c)
(dbl) (dbl)
1 -0.95460749 0.95460749
2 -0.93032589 0.93032589
3 -0.40081121 0.40081121
4 -0.05094088 0.05094088
Anyone know if I'm just missing a basic function or shortcut to help me do this quickly?
Example use case: if df suddenly gains a new column d, I'd like the downstream code to now aggregate over unique combinations of a, b, and d, without me having to explicitly add d to the group_by call.)
In current versions of dplyr, the function group_by_at, together with vars, accomplishes this goal:
df %>% group_by_at(vars(-c)) %>% summarize(mean(c))
# A tibble: 2 x 3
# Groups: a [?]
a b `sum(c)`
<dbl> <chr> <dbl>
1 1 foo 0.9851376
2 2 bar 1.0954412
Appears to have been introduced in dplyr 0.7.0, in June 2017
I don't know if I am not searching with the right terms but I can't find a post about this.
I have a df :
df <- data.frame(grouping_letter = c('A', 'A', 'B', 'B', 'C', 'C'), grouping_animal = c('Cat', 'Dog', 'Cat', 'Dog', 'Cat', 'Dog'), value = c(1,2,3,4,5,6))
I want to group by grouping_letter and by grouping_animal. I want to do this using dplyr.
If I did it separately, it would be :
df %>% group_by(grouping_letter) %>% summarise(sum(value))
df %>% group_by(grouping_animal) %>% summarise(sum(value))
Now let's say, I have hundreds of columns I need to group by individually. How can I do this?
I was trying:
results <- NULL
for (i in grouping_columns) {
results[[i]] <- df %>% group_by(df$i) %>% summarize(sum(value))
}
I got a list called results with the output. I am wondering if there is a better way to do this instead of using a for-loop?
We can create an index of 'grouping' colums (using grep), loop over the index (with lapply) and separately get the sum of 'value' after grouping by the column in the 'index'.
library(dplyr)
i1 <- grep('grouping', names(df))
lapply(i1, function(i)
df[setdiff(seq_along(df), i)] %>%
group_by_(.dots=names(.)[1]) %>%
summarise(Sumvalue= sum(value)))
#[[1]]
#Source: local data frame [2 x 2]
# grouping_animal Sumvalue
# (fctr) (dbl)
#1 Cat 9
#2 Dog 12
#[[2]]
#Source: local data frame [3 x 2]
# grouping_letter Sumvalue
# (fctr) (dbl)
#1 A 3
#2 B 7
#3 C 11
Or we can do this by converting the dataset from 'wide' to 'long' format, then group by the concerned columns and get the sum of 'value'.
library(tidyr)
gather(df, Var, Group, -value) %>%
group_by(Var, Group) %>%
summarise(Sumvalue = sum(value))
# Var Group Sumvalue
# (chr) (chr) (dbl)
#1 grouping_animal Cat 9
#2 grouping_animal Dog 12
#3 grouping_letter A 3
#4 grouping_letter B 7
#5 grouping_letter C 11