I don't know if I am not searching with the right terms but I can't find a post about this.
I have a df :
df <- data.frame(grouping_letter = c('A', 'A', 'B', 'B', 'C', 'C'), grouping_animal = c('Cat', 'Dog', 'Cat', 'Dog', 'Cat', 'Dog'), value = c(1,2,3,4,5,6))
I want to group by grouping_letter and by grouping_animal. I want to do this using dplyr.
If I did it separately, it would be :
df %>% group_by(grouping_letter) %>% summarise(sum(value))
df %>% group_by(grouping_animal) %>% summarise(sum(value))
Now let's say, I have hundreds of columns I need to group by individually. How can I do this?
I was trying:
results <- NULL
for (i in grouping_columns) {
results[[i]] <- df %>% group_by(df$i) %>% summarize(sum(value))
}
I got a list called results with the output. I am wondering if there is a better way to do this instead of using a for-loop?
We can create an index of 'grouping' colums (using grep), loop over the index (with lapply) and separately get the sum of 'value' after grouping by the column in the 'index'.
library(dplyr)
i1 <- grep('grouping', names(df))
lapply(i1, function(i)
df[setdiff(seq_along(df), i)] %>%
group_by_(.dots=names(.)[1]) %>%
summarise(Sumvalue= sum(value)))
#[[1]]
#Source: local data frame [2 x 2]
# grouping_animal Sumvalue
# (fctr) (dbl)
#1 Cat 9
#2 Dog 12
#[[2]]
#Source: local data frame [3 x 2]
# grouping_letter Sumvalue
# (fctr) (dbl)
#1 A 3
#2 B 7
#3 C 11
Or we can do this by converting the dataset from 'wide' to 'long' format, then group by the concerned columns and get the sum of 'value'.
library(tidyr)
gather(df, Var, Group, -value) %>%
group_by(Var, Group) %>%
summarise(Sumvalue = sum(value))
# Var Group Sumvalue
# (chr) (chr) (dbl)
#1 grouping_animal Cat 9
#2 grouping_animal Dog 12
#3 grouping_letter A 3
#4 grouping_letter B 7
#5 grouping_letter C 11
Related
a1 <- data.frame(id=c(1,1,1,1,2,2,2,3,3),
var=c("A",NA,NA,"B","B","B",NA,NA,NA))
desired_1 <- data.frame(id=c(1,2,3),
A=c(T,NA,NA),
B=c(T,T,NA),
None=c(NA,NA,T))
desired_2 <- data.frame(id=c(1,1,2,3),
type=c("A","B","B","None"))
what is the most efficient method to generate both desired_1 and desired_2 using either data.table or dplyr?
We can do a group by 'id' with summarise to get 'None' if all the elements in 'var' is NA or else return the unique non-NA elements of 'var'
library(dplyr)
a1 %>%
group_by(id) %>%
summarise(var = if(all(is.na(var))) "None" else unique(var[!is.na(var)]) )
# A tibble: 4 x 2
# Groups: id [3]
# id var
# <dbl> <chr>
#1 1 A
#2 1 B
#3 2 B
#4 3 None
Or using data.table
library(data.table)
setDT(a1)[, .(var = if(all(is.na(var))) "None" else unique(var[!is.na(var)])), id]
I have the following dataset, and I want to know the min word for each group, and if there is no min word (it is NA), I still want to display it
df=data.frame(
key=c("A","A","B","B","C"),
word=c(1,2,3,5,NA))
df%>%group_by(key)%>%slice(which.min(word))
This excludes key=C, word=NA which I would want:
df_out=data.frame(
key=c("A","B","C"),
word=c(1,3,NA))
We can create a logical condition with is.na in filter and return the NA rows as well after doing the grouping by 'key'
library(dplyr)
df %>%
group_by(key) %>%
filter(word == min(word)|is.na(word))
Or using slice. We don't need any if/else condition
df %>%
group_by(key) %>%
slice(which(word ==min(word)|is.na(word)))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Or more compactly
df %>%
group_by(key) %>%
slice(match(min(word), word))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
NOTE: Using match returns the index of the first match.
which.min removes the NA
which.min(c(NA, 1, 3))
#[1] 2
We can check the condition with if, If all the word in a group is NA we return the first row or else return the minimum row.
library(dplyr)
df %>%
group_by(key)%>%
slice(if(all(is.na(word))) 1L else which.min(word))
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Another option is to arrange the data by word and select the 1st row in each group.
df %>% arrange(key, word) %>% group_by(key) %>% slice(1L)
You can create a modified slice-function using the tidyverse-package, which returns NA's:
slice_uneven = function(.data, .idx) {
.data_ = .data %>% add_row() # Add an extra row
.idx_ = .idx %>% c(NA) %>% replace_na(nrow(.data_)) # Replace NA with index of the extra row
.data_[.idx_,] %>% head(-1) %>% remove_rownames() %>% return() # Subset, remove extra row, and reset rownames before returning data
}
slice_uneven(cars, c(1, 2, 3, NA, NA, 3, 2))
You can also arrange by word and use distinct from dplyr to get the desired output.
library(dplyr)
df %>%
arrange(word) %>%
distinct(key, .keep_all = TRUE)
# key word
#1 A 1
#2 B 3
#3 C NA
I am looking for a way to get last element in group omitting NA. Standard dplyr solution is not working and it is not clear when it is going to be fixed issue
Can anybody suggest work around?
Here is an example of what I am looking for
df <- DataFrame(col_1 = c('A', 'A', 'B', 'B'), col_2 = c(1, NA, 3, 3))
So I would like to group by col_1 and for group A return 1 and for group B return 3
One way to do it is to use na.omit and tail:
df %>% group_by(col_1) %>% summarise(last=tail(na.omit(col_2),1))
col_1 last
<fctr> <dbl>
1 A 1
2 B 3
Or you could filter your dataframe, then slice the last row per group:
df %>% filter(!is.na(col_2)) %>% group_by(col_1) %>% slice(n())
After grouping by 'col_1', arrange using the logical vector is.na(col_2) and slice the first element
library(dplyr)
df %>%
group_by(col_1)%>%
arrange(is.na(col_2)) %>%
slice(1)
# A tibble: 2 x 2
# Groups: col_1 [2]
# col_1 col_2
# <fctr> <dbl>
#1 A 1
#2 B 3
Consider the situation, where I want to summarize_each a data.frame with mixed column type.
> (temp=data.frame(ID=c(1,1,2,2),gender=c("M","M","F","F"),val1=rnorm(4),val2=rnorm(4)))
ID gender val1 val2
1 1 M -1.7944804 0.5232313
2 1 M 0.3938437 -0.8424086
3 2 F -0.3190777 0.3220580
4 2 F 1.3667340 -0.6031376
> temp%>%group_by(ID)%>%summarize_each(funs(mean))
Source: local data frame [2 x 4]
ID gender val1 val2
(dbl) (lgl) (dbl) (dbl)
1 1 NA -0.7003184 -0.1595886
2 2 NA 0.5238282 -0.1405398
This doesn't work because mean(gender) doesn't make sense.
Question:
If all my non-numeric columns are characteristic of ID, thus are identical within each ID, can I somehow get summarize_each to return that 'unique' value?
> temp%>%group_by(ID,gender)%>%summarize_each(funs(mean))
Source: local data frame [2 x 4]
Groups: ID [?]
ID gender val1 val2
(dbl) (fctr) (dbl) (dbl)
1 1 M -0.7003184 -0.1595886
2 2 F 0.5238282 -0.1405398
is the output that I want, but I somehow feel like this is doing unnecessary nested group_by because there really is nothing to group within ID.
One option would be gather/spread from tidyr. Reshape to 'long' format with gather, grouped by 'ID', 'var', get the first element of 'gender' and mean of 'val', spread it back to 'wide' format.
library(tidyr)
library(dplyr)
gather(temp, var, val, val1:val2) %>%
group_by(ID, var) %>%
summarise(gender = first(gender), val = mean(val)) %>%
spread(var, val)
Or another is using mutate_if and unique. After grouping by 'ID', we get the mean of the numeric columns with mutate_if. As the other columns (i.e. 'gender' also remains in the output) we can just do unique to get the unique rows from the output.
temp %>%
group_by(ID) %>%
mutate_if(is.numeric, mean) %>%
unique()
# ID gender val1 val2
# <int> <chr> <dbl> <dbl>
#1 1 M -0.7003184 -0.1595886
#2 2 F 0.5238281 -0.1405398
I have following data.frame (df)
ID1 ID2 Col1 Col2 Col3 Grp
A B 1 3 6 G1
C D 3 5 7 G1
E F 4 5 7 G2
G h 5 6 8 G2
What I would like to achieve is the following:
- group by Grp, easy
- and then summarize so that for each group I sum the columns and create the columns with strings with all ID1s and ID2s
It would be something like this:
df %>%
group_by(Grp) %>%
summarize(ID1s=toString(ID1), ID2s=toString(ID2), Col1=sum(Col1), Col2=sum(Col2), Col3=sum(Col3))
Everything is fine whae Iknow the number of the columns (Col1, Col2, Col3), however I would like to be able to implement it so that it would work for a data frame with known and always named the same ID1, ID2, Grp, and any number of additional numeric column with unknown names.
Is there a way to do it in dplyr.
I would like to be able to implement it so that it would work for a data frame with known and always named the same ID1, ID2, Grp, and any number of additional numeric column with unknown names.
You can overwrite the ID columns first and then group by them as well:
DF %>%
group_by(Grp) %>% mutate_each(funs(. %>% unique %>% sort %>% toString), ID1, ID2) %>%
group_by(ID1, ID2, add=TRUE) %>% summarise_each(funs(sum))
# Source: local data frame [2 x 6]
# Groups: Grp, ID1 [?]
#
# Grp ID1 ID2 Col1 Col2 Col3
# (chr) (chr) (chr) (int) (int) (int)
# 1 G1 A, C B, D 4 8 13
# 2 G2 E, G F, h 9 11 15
I think you'll want to uniqify and sort before collapsing to a string, so I've added those steps.
Using the data table you could try the following:
setDT(df)
sd_cols=3:(ncol(df)-1)
merge(df[ ,.(toString(ID1), toString(ID2)), by = Grp], df[ , c(-1,-2), with = F][ , lapply(.SD, sum), by = Grp],by = "Grp")