I have following data.frame (df)
ID1 ID2 Col1 Col2 Col3 Grp
A B 1 3 6 G1
C D 3 5 7 G1
E F 4 5 7 G2
G h 5 6 8 G2
What I would like to achieve is the following:
- group by Grp, easy
- and then summarize so that for each group I sum the columns and create the columns with strings with all ID1s and ID2s
It would be something like this:
df %>%
group_by(Grp) %>%
summarize(ID1s=toString(ID1), ID2s=toString(ID2), Col1=sum(Col1), Col2=sum(Col2), Col3=sum(Col3))
Everything is fine whae Iknow the number of the columns (Col1, Col2, Col3), however I would like to be able to implement it so that it would work for a data frame with known and always named the same ID1, ID2, Grp, and any number of additional numeric column with unknown names.
Is there a way to do it in dplyr.
I would like to be able to implement it so that it would work for a data frame with known and always named the same ID1, ID2, Grp, and any number of additional numeric column with unknown names.
You can overwrite the ID columns first and then group by them as well:
DF %>%
group_by(Grp) %>% mutate_each(funs(. %>% unique %>% sort %>% toString), ID1, ID2) %>%
group_by(ID1, ID2, add=TRUE) %>% summarise_each(funs(sum))
# Source: local data frame [2 x 6]
# Groups: Grp, ID1 [?]
#
# Grp ID1 ID2 Col1 Col2 Col3
# (chr) (chr) (chr) (int) (int) (int)
# 1 G1 A, C B, D 4 8 13
# 2 G2 E, G F, h 9 11 15
I think you'll want to uniqify and sort before collapsing to a string, so I've added those steps.
Using the data table you could try the following:
setDT(df)
sd_cols=3:(ncol(df)-1)
merge(df[ ,.(toString(ID1), toString(ID2)), by = Grp], df[ , c(-1,-2), with = F][ , lapply(.SD, sum), by = Grp],by = "Grp")
Related
I've got a dataframe such as this:
df = data.frame(col1=c(1,1,1,2,2,2,3,3,3),
col2=as.factor(c('a','b','b','a','a','a','b','a','b')))
Then I extract all the categories (levels) related to each column:
levels_df = expand.grid(unique(df$col1), unique(df$col2))
colnames(levels_df)=c('col1','col2')
My objective now is to perform for the rows belonging to each pair of levels a function. How can I do that?
sapply(levels, FUN, dataset=df)
Any other strategy to perform the same task is accepted. The function operation could be whatever you like, for example a counting function (how many rows belong to each pair of levels), in which case the output would have this aspect:
In conclusion I want to susbset rows from a dataframe using each pair of levels, so I can manipulate those rows to perform a function ( such as nrows() )
You can skip the levels part, and just use dplyr to group by col1 and col2, then count the rows. Finally, we use complete to add in any combinations that don't appear in our dataset:
library(tidyverse)
df %>%
group_by(col1, col2) %>% # group df by col1 and col2
summarise(n = n()) %>% # make a new column, n, which is the count
complete(col1, col2, fill=list(n=0)) # Fill in missing pairs with 0
The output matches what you expected:
# A tibble: 6 x 3
# Groups: col1 [3]
col1 col2 n
<dbl> <fct> <dbl>
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 0
5 3 a 1
6 3 b 2
I‘m not sure if this specific count example will help you, but here‘s what you could do in the tidyverse:
library(tidyverse)
df %>%
group_by(col1, col2) %>%
count() %>%
ungroup() %>%
complete(col1, col2, fill = list(n = 0))
which gives:
# A tibble: 6 x 3
col1 col2 n
<dbl> <fct> <dbl>
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 0
5 3 a 1
6 3 b 2
when I execute the following code:
data_ikea_wider <- data_ikea_longer %>%
pivot_wider(id_cols = c(Record_no
, Geography
, City
, Country
, City.Country
, Year)
, names_from = Category, values_from = Value)
The columns just have n/a's as shown in the attached print screen.
What am I doing wrong? Thanks!
We could use dcast from data.table
library(data.table)
setDT(dat)[, col1 ~ col2, value.var = 'val')
Getting NAs from a pivot is not unexpected, it means that not all of your id columns have all "columns".
For example,
dat <- data.frame(col1 = c(1,1,2), col2 = c('a', 'b', 'a'), val = 1:3)
dat
# col1 col2 val
# 1 1 a 1
# 2 1 b 2
# 3 2 a 3
If we want to pivot keeping col1 as an id, and turning col2 values into new columns, then it should be apparent that we'll end up with two rows (ida 1 and 2), and two new columns (a and b) to replace col2 and val. Unfortunately, since we only have three rows, the 2 rows 2 columns = 4 cells will not be completely filled with 3 values, so one will be NA:
pivot_wider(dat, col1, names_from = col2, values_from = val)
# # A tibble: 2 x 3
# col1 a b
# <dbl> <int> <int>
# 1 1 1 2
# 2 2 3 NA
If you see this and are surprised, thinking that you actually have the data ... then you should check your data importing and filtering to make sure you did not inadvertently remove it (or it was not provided initially).
I have the dataframe below:
master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))
As you can see I have 4 columns A,B,C,D. What I want to achieve is to create a new dataframe which will include the duplicated pair-rows between A and B, the index of rows where this duplication happens and the column names that make those rows different (C,D,or C and D) in a third column. To make my request more clear I display an example with master2 instead of master which includes only A and B
master2 <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8))
and then with:
library(data.table)
setDT(master2)
master2[master2[, .N, by=names(master2)][ N > 1L ], on=names(master2),
.(N, locs = .(.I)), by=.EACHI]
I get:
# A B N locs
# 1: 2 3 2 3,4
# 2: 4 6 2 7,8
So I want this logic implemented to the master dataframe and also add another column named "Different" with the column names that make those rows different. If the rows are identical to everything then the new column with the column names that differ should take as value "nothing".If it is possible to add another column with the initial position of the "Different"column. It will be 3 for C and 4 for D?
The desired output shpuld be something like:
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different position
# <dbl> <dbl> <int> <chr> <chr> <int>
#1 2 3 2 3, 4 C, D 3,4
#2 4 6 2 7, 8 C, D 3,4
If we need the row index, then create a sequence column ('rn'), grouped by the columns of interest, keep only groups that have number of rows greater than 1, summarise to get the number of rows (n()) as well as the pasteed index of the sequence of rows for the group. Regarding the 'different' column, it is not entirely clear about the logic. Here, is one implemented based on the occurrence of different values within the same group of 'A' and 'B' with case_when
library(tidyverse)
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'))
# A tibble: 2 x 4
# Groups: A [?]
# A B n locs different
# <dbl> <dbl> <int> <chr> <chr>
#1 2 3 2 3, 4 C, D
#2 4 6 2 7, 8 C, D
Update
Based on the comments to include 'position'
master %>%
mutate(rn = row_number()) %>%
group_by(A, B) %>%
filter(n() > 1) %>%
mutate(position = toString(rn[!(duplicated(paste(C, D))|
duplicated(paste(C, D), fromLast = TRUE))])) %>%
summarise(n = n(),
locs = toString(rn),
Different = case_when(n_distinct(C) > 1 & n_distinct(D) > 1 ~ 'C, D',
n_distinct(C) > 1 ~ 'C',
n_distinct(D) > 1 ~ 'D',
TRUE ~ 'Same'),
position = first(position))
This question is based on the following post with additional requirements (Iterate through columns in dplyr?).
The original code is as follows:
df <- data.frame(col1 = rep(1, 15),
col2 = rep(2, 15),
col3 = rep(3, 15),
group = c(rep("A", 5), rep("B", 5), rep("C", 5)))
for(col in c("col1", "col2", "col3")){
filt.df <- df %>%
filter(group == "A") %>%
select_(.dots = c('group', col))
# do other things, like ggplotting
print(filt.df)
}
My objective is to output a frequency table for each unique COL by GROUP combination. The current example specifies a dplyr filter based on a GROUP value A, B, or C. In my case, I want to iterate (loop) through a list of values in GROUP (list <- c("A", "B", "C") and generate a frequency table for each combination.
The frequency table is based on counts. For Col1 the result would look something like the table below. The example data set is simplified. My real dataset is more complex with multiple 'values' per 'group'. I need to iterate through Col1-Col3 by group.
group value n prop
A 1 5 .1
B 2 5 .1
C 3 5 .1
A better example of the frequency table is here: How to use dplyr to generate a frequency table
I struggled with this for a couple days, and I could have done better with my example. Thanks for the posts. Here is what I ended up doing to solve this. The result is a series of frequency tables for each column and each unique value found in group. I had 3 columns (col1, col2, col3) and 3 unique values in group (A,B,C), 3x3. The result is 9 frequency tables and a frequency table for each group value that is non-sensical. I am sure there is a better way to do this. The output generates some labeling, which is useful.
# Build unique group list
group <- unique(df$group)
# Generate frequency tables via a loop
iterate_by_group <- function(x)
for (i in 1:length(group)){
filt.df <- df[df$group==group[i],]
print(lapply(filt.df, freq))
}
# Run
iterate_by_group(df)
We could gather into long format and then get the frequency (n()) by group
library(tidyverse)
gather(df, value, val, col1:col3) %>%
group_by(group, value = parse_number(value)) %>%
summarise(n = n(), prop = n/nrow(.))
# A tibble: 9 x 4
# Groups: group [?]
# group value n prop
# <fct> <dbl> <int> <dbl>
#1 A 1 5 0.111
#2 A 2 5 0.111
#3 A 3 5 0.111
#4 B 1 5 0.111
#5 B 2 5 0.111
#6 B 3 5 0.111
#7 C 1 5 0.111
#8 C 2 5 0.111
#9 C 3 5 0.111
Is this what you want?
df %>%
group_by(group) %>%
summarise_all(funs(freq = sum))
I don't know if I am not searching with the right terms but I can't find a post about this.
I have a df :
df <- data.frame(grouping_letter = c('A', 'A', 'B', 'B', 'C', 'C'), grouping_animal = c('Cat', 'Dog', 'Cat', 'Dog', 'Cat', 'Dog'), value = c(1,2,3,4,5,6))
I want to group by grouping_letter and by grouping_animal. I want to do this using dplyr.
If I did it separately, it would be :
df %>% group_by(grouping_letter) %>% summarise(sum(value))
df %>% group_by(grouping_animal) %>% summarise(sum(value))
Now let's say, I have hundreds of columns I need to group by individually. How can I do this?
I was trying:
results <- NULL
for (i in grouping_columns) {
results[[i]] <- df %>% group_by(df$i) %>% summarize(sum(value))
}
I got a list called results with the output. I am wondering if there is a better way to do this instead of using a for-loop?
We can create an index of 'grouping' colums (using grep), loop over the index (with lapply) and separately get the sum of 'value' after grouping by the column in the 'index'.
library(dplyr)
i1 <- grep('grouping', names(df))
lapply(i1, function(i)
df[setdiff(seq_along(df), i)] %>%
group_by_(.dots=names(.)[1]) %>%
summarise(Sumvalue= sum(value)))
#[[1]]
#Source: local data frame [2 x 2]
# grouping_animal Sumvalue
# (fctr) (dbl)
#1 Cat 9
#2 Dog 12
#[[2]]
#Source: local data frame [3 x 2]
# grouping_letter Sumvalue
# (fctr) (dbl)
#1 A 3
#2 B 7
#3 C 11
Or we can do this by converting the dataset from 'wide' to 'long' format, then group by the concerned columns and get the sum of 'value'.
library(tidyr)
gather(df, Var, Group, -value) %>%
group_by(Var, Group) %>%
summarise(Sumvalue = sum(value))
# Var Group Sumvalue
# (chr) (chr) (dbl)
#1 grouping_animal Cat 9
#2 grouping_animal Dog 12
#3 grouping_letter A 3
#4 grouping_letter B 7
#5 grouping_letter C 11