I'am struggle with one task: I have a dataframe, where one column is always numeric and others are always factors. I don't know the index of numeric columns.
My task is: to group dataframe by all factors, then to find mean and sd within each group.
I have already done some part of work:
library(dplyr)
library(stats)
df <- data.frame(
col1 = sample(LETTERS[1:3], 100, replace=TRUE),
col2 = sample(LETTERS[1:3], 100, replace=TRUE),
col3 = rnorm(100))
df
find_mean_sd <- function(df){
numeric <- which(sapply(df,is.numeric)==TRUE)
columns <- names(df)[-numeric]
dots <- lapply(columns, as.symbol)
df %>%
group_by_(.dots=dots) %>%
summarise(mean = mean(df[,numeric]), SD= sd(df[,numeric]))
}
find_mean_sd(df)
I am confused with mean and sd: why do they the same for all groups? I wanted to get 9 different meanings.
In case you want to fix your code, you can try this:
library(dplyr)
find_mean_sd <- function(df){
numeric <- which(sapply(df,is.numeric)==TRUE)
columns <- names(df)[-numeric]
dots <- lapply(columns, as.symbol)
df %>%
group_by_(.dots=dots) %>%
summarise_all(funs(mean,sd)) # here you can summarise by the functions you need
}
find_mean_sd(df)
# A tibble: 9 x 4
# Groups: col1 [3]
col1 col2 mean SD
<fct> <fct> <dbl> <dbl>
1 A A 0.202 1.19
2 A B -0.141 0.950
3 A C 0.585 0.596
4 B A -0.0812 1.20
5 B B -0.380 1.18
6 B C 0.300 0.846
7 C A -0.152 0.705
8 C B 0.136 1.39
9 C C 0.263 0.762
I think the problem was that you use in a dplyr chain the df, that is not necessary in the part of the summarise for your purpose, despite the A. Suliman solution is more elegant.
We can use dplyr::*_if to select the required columns
library(dplyr)
df %>%
group_by_if(is.factor) %>%
summarise_if(is.numeric, list(mean=~mean(., na.rm = TRUE), SD=~sd(.,na.rm = TRUE)))
Related
This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 1 year ago.
Here is my data:
x <- rnorm(0,1, n = 6)
class <- c(1,1,1,2,2,2)
df <- cbind(x, class)
I want to calculate the mean of x by class and have them repeat for all rows (I get a new column with the means for each class repeated so that the number of rows of the data frame remain the same.
We can use
library(dplyr)
df <- df %>%
group_by(class) %>%
mutate(Mean = mean(x)) %>%
ungroup
-ouptut
df
# A tibble: 6 x 3
x class Mean
<dbl> <dbl> <dbl>
1 2.43 1 1.05
2 0.0625 1 1.05
3 0.669 1 1.05
4 0.195 2 -0.0550
5 0.285 2 -0.0550
6 -0.644 2 -0.0550
data
df <- data.frame(x, class)
A base R option using ave
transform(
df,
Mean = ave(x, class)
)
I have a data.frame (or tiibble or whatever) with an id variable. Often I made some operation for this id with dplyr::group_by, so
data %>%
group_by(id) %>%
summarise/mutate/...()
Often, I have other non-numeric variables that are unique for each id, such as the project or country to which the id belongs and other characteristics of the id (such as gender, etc.). When I use the summarise function above, these other variables ares lost unless I specify, either
data %>%
group_by(id) %>%
summarise(across(c(project, country, gender, ...), unique),...)
or
data %>%
group_by(id, project, country, gender, ...) %>%
summarise()
Is there some functions which detect these variables which are unique for each id, so that one does not have to specify them?
Thank you!
PS: I am asking mainly on dplyr and group_by related functions, but other environments like R-base or data.table are wellcome also.
I did not test it extensively yet it should do the job
library(dplyr)
myData <- tibble(X = c(1, 1, 2, 2, 2, 3),
Y = LETTERS[c(1, 1, 2, 2, 2, 3)],
R = rnorm(6))
myData
#> # A tibble: 6 x 3
#> X Y R
#> <dbl> <chr> <dbl>
#> 1 1 A 0.463
#> 2 1 A -0.965
#> 3 2 B -0.403
#> 4 2 B -0.417
#> 5 2 B -2.28
#> 6 3 C 0.423
group_by_id_vars <- function(.data, ...) {
# group by the prespecified ID variables
.data <- .data %>% group_by(...)
# how many groups do these ID determine
ID_groups <- .data %>% n_groups()
# Get the number of groups if the initial grouping variables are combined
# with other variables
groupVars <- sapply(substitute(list(...))[-1], deparse) #specified grouping Variable
nms <- names(.data) # all variables in .data
res <- sapply(nms[!nms %in% groupVars],
function(x) {
.data %>%
# important to specify add = TRUE to combine the variable
# with the IDs
group_by(across(all_of(x)), .add = TRUE) %>%
n_groups()})
# which combinations are identical, i.e. this variable does not increase the
# number of groups in the data if combined with IDvars
v <- names(res)[which(res == ID_groups)]
# group the data accordingly
.data <- .data %>% ungroup() %>% group_by(across(all_of(c(groupVars, v))))
return(.data)
}
myData %>%
group_by_id_vars(X) %>%
summarise(n = n())
#> `summarise()` regrouping output by 'X' (override with `.groups` argument)
#> # A tibble: 3 x 3
#> # Groups: X [3]
#> X Y n
#> <dbl> <chr> <int>
#> 1 1 A 2
#> 2 2 B 3
#> 3 3 C 1
This is a bit more advanced in application, but what you are looking for are linear combinations of your grouping variables. You can convert these to factors and then use some linear algebra.
You can use findLinearCombos() from caret to locate these. It takes a bit of work to get it all organized how I think you want it though.
Something like this may do the trick. I also have not extensively tested this.
Packages
library(dplyr)
library(caret)
library(purrr)
Function
group_by_lc <- function(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data)) {
# capture the ... and convert to a character vector
.groups <- rlang::ensyms(...)
.groups_chr <- map_chr(.groups, rlang::as_name)
# convert all character and factor variables to a numeric
d <- .data %>%
mutate(across(where(is.factor), as.character),
across(where(is.character), as.factor),
across(where(is.factor), as.integer))
# find linear combinations of the character / factor variables
lc <- caret::findLinearCombos(d)
# see if any of your grouping variables have linear combinations
find_group_match <- function(known_groups, lc_pair) {
if (any(lc_pair %in% known_groups)) unique(c(lc_pair, known_groups)) else NULL
}
# convert column indices to names
lc_pairs <- map(lc$linearCombos, ~ names(d)[.x])
# iteratively look for linear combinations of known grouping variabels
lc_cols <- reduce(lc_pairs, find_group_match, .init = .groups_chr)
# find new grouping variables
added_groups <- rlang::syms(lc_cols[!(lc_cols %in% .groups_chr)])
# apply the grouping to your groups and the linear combinations
group_by(.data, !!!.groups, !!!added_groups, .add = .add, .drop = .drop)
}
Usage
data <- tibble(V = LETTERS[1:10], W = letters[1:10], X = paste0(V, W), Y = rep(LETTERS[1:5], each = 2), Z = runif(10))
group_by_lc(data, W)
Result
You can see how it added in all the other grouping variables. You can rework this all in other ways, the key part is building that added_groups list to find them.
# A tibble: 10 x 5
# Groups: W, X, V [10]
V W X Y Z
<chr> <chr> <chr> <chr> <dbl>
1 A a Aa A 0.884
2 B b Bb A 0.133
3 C c Cc B 0.194
4 D d Dd B 0.407
5 E e Ee C 0.256
6 F f Ff C 0.0976
7 G g Gg D 0.635
8 H h Hh D 0.0542
9 I i Ii E 0.0104
10 J j Jj E 0.464
here is my data. I would like to group them into ten groups. with a minimal sum of var1 within-group and minimal variance of var2 across groups.
data <- data.frame(id=1:100,var1=runif(1000),var2=runif(1000))
I'm not exactly sure how to understand the minimal sum of var1 condition, as the total sum will always be the same. I have assumed it means to have the lowest maximum group sum.
# Create dataset
set.seed(1)
data <- data.frame(var1 = runif(1000), var2 = runif(1000))
# Create 50 splits
# These are numerically balanced by var2 as specified
# meaning that groups will have similar means
data_grouped <- data %>%
groupdata2::fold(k = 10,
num_col = "var2",
num_fold_cols = 50)
# Find the split with the lowest maximum sum of a group
data_grouped %>%
# Note that `gather()` will lead to 50k rows
# So might need to rethink this step for a bigger dataset
tidyr::gather(key = "split", value = "group", 3:52) %>%
dplyr::group_by(split, group) %>%
# Find sum per group per split
dplyr::summarise(var1_sum = sum(var1), .groups = "drop_last") %>%
# Find max group sum per split
dplyr::summarise(var1_max = max(var1_sum)) %>%
# Find split with lowest max group sum
dplyr::filter(var1_max == min(var1_max))
> # A tibble: 1 x 2
> split var1_max
> <chr> <dbl>
> 1 .folds_19 51.9
# Assign best grouping factor to original data frame
data$group <- data_grouped$.folds_19
# Check the means of var2
data %>%
dplyr::group_by(group) %>%
dplyr::summarise(var2_mean = mean(var2))
> # A tibble: 10 x 2
> group var2_mean
> <fct> <dbl>
> 1 1 0.491
> 2 2 0.489
> 3 3 0.490
> 4 4 0.490
> 5 5 0.492
> 6 6 0.490
> 7 7 0.489
> 8 8 0.491
> 9 9 0.491
> 10 10 0.490
Example data:
set.seed(99999)
library(dplyr)
Group <- c(rep("A",4),rep("B",4),rep("C",4))
Value <- abs(rnorm(12))
df <- data.frame(Group,Value)
df$Group <- as.character(df$Group)
I would like to filter each group, i.e. A,B,C, based on a different value in the column "Value". In dplyr it would look like this:
df2 <- df %>%
filter(Group=="A" & Value>=0.2 |
Group=="B" & Value>=0.1 |
Group=="C" & Value>=0.6)
However, my real df is much larger with >100 groups and each one has a unique threshold value to filter by. Therefore I have a seperate df3, which only has the thresholds per group:
df3 <- data.frame(Group=c("A","B","C"),Value=c(0.2,0.1,0.6))
How could I filter the df with the respective threshold values in df3 per corresponding group?
A dplyr solution uses group_by(Group) and inner_join() to merge the threshold values by group, and then uses filter() to retain rows where Value exceeds threshold.
set.seed(99999)
library(dplyr)
Group <- c(rep("A",4),rep("B",4),rep("C",4))
Value <- abs(rnorm(12))
df <- data.frame(Group,Value,stringsAsFactors = FALSE)
df$Group <- as.character(df$Group)
df3 <- data.frame(Group=c("A","B","C"),threshold=c(0.2,0.1,0.6),stringsAsFactors = FALSE)
df %>% group_by(Group) %>%
inner_join(df3) %>% filter(Value > threshold)
Note that I changed the column name in df3 from Value to threshold to avoid a column name conflict in inner_join().
...and the output:
Joining, by = "Group"
# A tibble: 9 x 3
# Groups: Group [3]
Group Value threshold
<chr> <dbl> <dbl>
1 A 0.426 0.2
2 A 0.283 0.2
3 A 0.899 0.2
4 A 0.707 0.2
5 B 2.09 0.1
6 B 1.64 0.1
7 B 0.540 0.1
8 B 0.604 0.1
9 C 0.956 0.6
>
I would use the old split - apply - combine method:
library(dplyr)
df %>%
split(df$Group) %>%
lapply(filter, Value > df3$Value[df3$Group == Group[1]]) %>%
bind_rows()
#> Group Value
#> 1 A 0.4255127
#> 2 A 0.2829203
#> 3 A 0.8986773
#> 4 A 0.7065184
#> 5 B 2.0916699
#> 6 B 1.6356643
#> 7 B 0.5401934
#> 8 B 0.6037287
#> 9 C 0.9558980
Say I have a data frame like this:
group1 <- c('a','a','a','a','a','a','b','b','b','b','b','b','b','b')
group2 <- c('x','y','x','y','x','y','x','y','x','y','x','y','x','y')
value <- round(runif(14, min=0, max=1), digits = 2)
df1 <- as.data.frame(cbind(group1,group2,value))
df1$value <- as.numeric(df1$value)
It is easy to get a new data frame with only the maximum values of each group, by using the dplyr package and summarise function:
df2 <- summarise(group_by(df1,group1),max_v = max(value))
But what I want is a new data frame with the 3 maximum values of each group, doing something like that:
df2 <- summarise(group_by(df1,group1),max_v = max(value),max2_v = secondmax(value),max3_v = thirdmax(value))
Is there a way to do that without using the sort function ?
We can use arrange/slice/spread way to get this
library(dplyr)
library(tidyr)
df1 %>%
group_by(group1) %>%
arrange(desc(value)) %>%
slice(seq_len(3)) %>%
mutate(Max = paste0("max_", row_number())) %>%
select(-group2) %>%
spread(Max, value)
# A tibble: 2 x 4
# Groups: group1 [2]
# group1 max_1 max_2 max_3
#* <fctr> <dbl> <dbl> <dbl>
#1 a 0.84 0.69 0.41
#2 b 0.89 0.72 0.54
data
df1 <- data.frame(group1,group2,value)