I have a large (rows > 200000) data frame with dozens of columns of data. I want to distill this data frame down and summarize the number of data that have variables that fall within given ranges.
For instance, if I have a data.frame that is similar to this:
set.seed(10)
df <- data.frame( age = runif( n = 1000, min = 0, max = 4000 ),
size = rnorm( n = 1000, mean = 10, sd = 1 ),
shape = rnorm( n = 1000, mean = 1000, sd = 1000) )
and I would like to group get the number of samples within a series of age ranges, the mean size and shape, and the median size and shape from the samples in each of those age brackets.
Something like
summary.df <- data.frame( age.group = seq( 0, 3900, by = 100 ),
number = (number of samples in age bin),
mean = ( mean of data in age bin ) )
etc.
Right now I am doing this very bluntly by creating a new data.frame for each age group.
data.1 <- subset( df, age > 0 & age <= 100 )
data.2 <- subset( df, age > 100 & age <= 200 )
data.3 <- subset( df, age > 200 & age <= 300 )
etc.
and then adding a categorical variable
data.1 <- data.frame( data.1, age.group = "100", count.row = nrow( data.1 ) )
data.2 <- data.frame( data.2, age.group = "200", count.row = nrow( data.2 ) )
data.3 <- data.frame( data.3, age.group = "300", count.row = nrow( data.3 ) )
adding them together
data.big <- rbind( data.1, data.2, data.3 )
and then generating summary stats via dplyr
data.summary <- data.big %>%
group_by( age.group ) %>%
summarize( count.row = mean( count.row ),
mean = mean( size, na.rm = TRUE ),
median = median( size, na.rm = T ) )
How would I go about doing this more efficiently with just dplyr? I think there must be a way but I can't wrap my head around it.
Thanks for any help you can give!
You can make use of cut to divide the data in intervals of 100 and calculate summary statistics for each group.
library(dplyr)
df %>%
group_by(age = cut(age, seq( 0, 4000, by = 100))) %>%
summarise(mean = mean( size, na.rm = TRUE),
median = median( size, na.rm = TRUE))
# age mean median
# <fct> <dbl> <dbl>
# 1 (0,100] 10.0 9.92
# 2 (100,200] 9.88 10.2
# 3 (200,300] 10.1 10.1
# 4 (300,400] 9.83 9.80
# 5 (400,500] 9.95 9.72
# 6 (500,600] 9.68 9.78
# 7 (600,700] 10.2 10.5
# 8 (700,800] 10.2 10.4
# 9 (800,900] 9.68 9.47
#10 (900,1e+03] 9.80 9.81
# … with 30 more rows
Related
I am working with the R programming language.
I have the following dataset about people with their weights and asthma (1 = yes, 0 = no):
library(dplyr)
library(purrr)
library(ggplot2)
set.seed(123)
my_data1 = data.frame(Weight = rnorm(500,100,100), asthma = sample(c(0,1), prob = c(0.7,0.3), replace=TRUE, size= 500))
my_data2 = data.frame(Weight = rnorm(500, 200, 50), asthma = sample(c(0,1), prob = c(0.3,0.7), replace=TRUE, size= 500))
my_data_a = rbind(my_data1, my_data2)
my_data_a$gender = "male"
my_data1 = data.frame(Weight = rnorm(500,100,100), asthma = sample(c(0,1), prob = c(0.7,0.3), replace=TRUE, size= 500))
my_data2 = data.frame(Weight = rnorm(500, 200, 50), asthma = sample(c(0,1), prob = c(0.3,0.7), replace=TRUE, size= 500))
my_data_b = rbind(my_data1, my_data2)
my_data_b$gender = "female"
my_data = rbind(my_data_a, my_data_b)
my_data$id = 1:2000
My Question: For both genders, I would like to "bin" people in this dataset into "n" bins (e.g. n = 30) in ascending order based on the available weight ranges (e.g. min_weight_men : min_weight_men+ 30 = bin_1_men, min_weight_women : min_weight_women+ 30 = bin_1_women, min_weight_men+ 30 : min_weight_men+ 60 = bin_2_men, etc.) - and then find out how many people in each bin, as well as the min weight and max weight for each bin.
My Attempt: I tried to do this with the following code:
Part_1 = my_data %>% group_by(gender) %>%
mutate(bins = cut(Weight , breaks = pretty(Weight , n = (max(Weight)-min(Weight))/30), include.lowest = TRUE)) %>%
mutate(rank = dense_rank(bins)) %>%
mutate(new_bins = paste(rank,"_", gender, sep=""))
Part_2 = Part_1 %>% group_by(gender, bins) %>%
summarize(min_weight = min(Weight), max_weight = max(Weight), count = n())
Part_3 = merge(x=Part_1,y=Part_2, by.x=c("gender","bins"), by.y=c("gender","bins"))
While the result are in the format that I want - I am not sure if I have performed the calculations correctly:
> head(Part_3)
gender bins Weight asthma id rank new_bins min_weight max_weight count
1 female (-100,-50] -75.13021 0 1192 4 4_female -99.91774 -51.53241 23
2 female (-100,-50] -55.78222 0 1382 4 4_female -99.91774 -51.53241 23
3 female (-100,-50] -51.53241 0 1232 4 4_female -99.91774 -51.53241 23
4 female (-100,-50] -71.44877 1 1484 4 4_female -99.91774 -51.53241 23
5 female (-100,-50] -93.99402 1 1160 4 4_female -99.91774 -51.53241 23
6 female (-100,-50] -96.49823 0 1378 4 4_female -99.91774 -51.53241 23
Can someone please help me understand if I have done this correctly?
Thanks!
Note: Just to clarify - suppose weights for men are from 70kg to 150kg. I want bins such as bin_1_men = 70-100kg, bin_2_men = 100-130kg, etc. I am aware that this could result in some bins having significantly different counts.
Instead of doing this in 3 steps, could be done in a single pipe with mutate after grouping
library(dplyr)
my_data %>%
group_by(gender) %>%
mutate(bins = cut(Weight , breaks = pretty(Weight ,
n = (max(Weight)-min(Weight))/30), include.lowest = TRUE),
rank = dense_rank(bins),
new_bins = paste(rank,"_", gender, sep="")) %>%
group_by(gender, bins) %>%
mutate(min_weight = min(Weight), max_weight = max(Weight),
count = n()) %>%
ungroup
I want to calculate the mean and standard deviation for subgroups every column in my dataset.
The membership of the subgroups is based on the values in the column of interest and these subgroups are specific to each column of interest.
# Example data
set.seed(1)
library(data.table)
df <- data.frame(baseline = runif(100), `Week0_12` = runif(100), `Week12_24` = runif(100))
So for column Baseline, a row may be assigned to another subgroup than for column Week0_12.
I can of course create these 'subgroup columns' manually for each column and then calculate the statistics for each column by column subgroup:
df$baseline_subgroup <- ifelse(df$baseline < 0.2, "subgroup_1", "subgroup_2")
df <- as.data.table(df)
df[, .(mean = mean(baseline), sd = sd(baseline)), by = baseline_subgroup]
Giving this output:
baseline_subgroup mean sd
1: subgroup_2 0.58059314 0.22670071
2: subgroup_1 0.09793105 0.05317809
Doing this for every column separately is too much repetition, especially given that I have many columns my actual data.
df$Week0_12_subgroup <- ifelse(df$Week0-12 < 0.2, "subgroup_1", "subgroup_2")
df[, .(mean = mean(Week0_12), sd = sd(Week0_12 )), by = Week0_12_subgroup ]
df$Week12_24_subgroup <- ifelse(df$Week0-12 < 0.2, "subgroup_1", "subgroup_2")
df[, .(mean = mean(Week12_24), sd = sd(Week12_24)), by = Week12_24_subgroup ]
What is a more elegant approach to do this?
Here's a tidyverse method that gives an easy-to-read and easy-to-plot output:
library(tidyverse)
set.seed(1)
df <- data.frame(baseline = runif(100),
`Week0_12` = runif(100),
`Week12_24` = runif(100))
df2 <- df %>%
summarize(across(everything(), list(mean_subgroup1 = ~mean(.x[.x < 0.2]),
sd_subgroup1 = ~sd(.x[.x < 0.2]),
mean_subgroup2 = ~mean(.x[.x > 0.2]),
sd_subgroup2 = ~sd(.x[.x > 0.2])))) %>%
pivot_longer(everything(), names_pattern = '^(.*)_(.*)_(.*$)',
names_to = c('time', 'measure', 'subgroup')) %>%
pivot_wider(names_from = measure, values_from = value)
df2
#> # A tibble: 6 x 4
#> time subgroup mean sd
#> <chr> <chr> <dbl> <dbl>
#> 1 baseline subgroup1 0.0979 0.0532
#> 2 baseline subgroup2 0.581 0.227
#> 3 Week0_12 subgroup1 0.117 0.0558
#> 4 Week0_12 subgroup2 0.594 0.225
#> 5 Week12_24 subgroup1 0.121 0.0472
#> 6 Week12_24 subgroup2 0.545 0.239
ggplot(df2, aes(time, mean, group = subgroup)) +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd, color = subgroup),
width = 0.1) +
geom_point() +
theme_minimal(base_size = 16)
Created on 2022-07-14 by the reprex package (v2.0.1)
You could use apply to apply a subgroup function across each column
i. e.
# list to house dfs of summary statistics
summaries <- list()
subgroup <- function(x){
# x is the column that we are interested in
df$current_subgroup<- ifelse(x < 0.2, "subgroup_1", "subgroup_2")
library(data.table)
df <- as.data.table(df)
summaries.append(df[, .(mean = mean(baseline), sd = sd(baseline)), by = baseline_subgroup])
}
# MARGIN = 2 applies across columns
apply(df, 2, subgroup)
You can create a custom function and apply it using .SD, i.e.
library(data.table)
f1 <- function(x){
i_mean <- mean(x);
i_sd <- sd(x);
list(Avg = i_mean, standard_dev = i_sd)
}
setDT(df)[, unlist(lapply(.SD, f1), recursive = FALSE), by = baseline_subgroup][]
baseline_subgroup baseline.Avg baseline.standard_dev Week0.12.Avg Week0.12.standard_dev Week12.24.Avg Week12.24.standard_dev
1: subgroup_2 0.5950020 0.22556590 0.5332555 0.2651810 0.5467046 0.2912027
2: subgroup_1 0.1006693 0.04957005 0.5947161 0.2645519 0.5137543 0.3213723
I'm not really familiar with dplyr function in R. However, I want to filter my dataset into certain conditions.
Let's say I've more than 100 of attributes in my dataset. And I want to perform filter with multiple condition.
Can I put my coding filter the position of the column instead of their name as follow:
y = filter(retag, c(4:50) != 8 & c(90:110) == 8)
I've tried few times similar with this coding, however still haven't get the result.
I also did tried coding as follow, but not sure how to add another conditions into the rowSums function.
retag[rowSums((retag!=8)[,c(4:50)])>=1,]
The only example that I found was using the dataset names instead of the position.
Or is there any way to filter using the dataset position as my data quite huge.
You can use a combination of filter() and across(). I didn't have your version of the retag dataframe so I created my own as an example
set.seed(2000)
retag <- tibble(
col1 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col2 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col3 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col4 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col5 = runif(n = 1000, min = 0, max = 10) %>% round(0)
)
# filter where the first, second, and third column all equal 5 and the fourth column does not equal 5
retag %>%
filter(
across(1:3, function(x) x == 5),
across(4, function(x) x != 5)
)
if_all() and if_any() were recently introduced into the tidyverse for the purpose of filtering across multiple variables.
library(dplyr)
filter(retag, if_all(X:Y, ~ .x > 10 & .x < 35))
# # A tibble: 5 x 2
# X Y
# <int> <int>
# 1 11 30
# 2 12 31
# 3 13 32
# 4 14 33
# 5 15 34
filter(retag, if_any(X:Y, ~ .x == 2 | .x == 25))
# # A tibble: 2 x 2
# X Y
# <int> <int>
# 1 2 21
# 2 6 25
Data
retag <- structure(list(X = 1:20, Y = 20:39), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Here's a base R option.
This will select rows where there is no 8 in column 4 to 50 and there is at least one 8 in column 90 to 110.
result <- retag[rowSums(retag[4:50] == 8, na.rm = TRUE) == 0 &
rowSums(retag[90:110] == 8,na.rm = TRUE) > 0, ]
I have a data frame with several variables I want to get the means of and a variable I want to group by. Then, I would like to get the proportion of each group's mean to the overall mean.
I have put together the following, but it is clumsy.
How would you go about it using dplyr or data.table? Bonus points for the option to return both the intermediate step (group and overall mean) and the final proportions.
library(tidyverse)
set.seed(1)
Data <- data.frame(
X1 = sample(1:10),
X2 = sample(11:20),
X3 = sample(21:30),
Y = sample(c("yes", "no"), 10, replace = TRUE)
)
groupMeans <- Data %>%
group_by(Y) %>%
summarize_all(funs(mean))
overallMeans <- Data %>%
select(-Y) %>%
summarize_all(funs(mean))
index <- sweep(as.matrix(groupMeans[, -1]), MARGIN = 2, as.matrix(overallMeans), FUN = "/")
here is one more dplyr solution
index <- as.data.frame(Data %>%
group_by(Y) %>%
summarise_all(mean) %>%
select(-Y) %>%
rbind(Data %>% select(-Y) %>% summarise_all(mean))%>%
mutate_all(funs( . / .[3])))[1:2,]
Here is one possible dplyr solution that contains everything you want:
Data %>%
group_by(Y) %>%
summarise(
group_avg_X1 = mean(X1),
group_avg_X2 = mean(X2),
group_avg_X3 = mean(X3)
) %>%
mutate(
overall_avg_X1 = mean(group_avg_X1),
overall_avg_X2 = mean(group_avg_X2),
overall_avg_X3 = mean(group_avg_X3),
proportion_X1 = group_avg_X1 / overall_avg_X1,
proportion_X2 = group_avg_X2 / overall_avg_X2,
proportion_X3 = group_avg_X3 / overall_avg_X3
)
# # A tibble: 2 x 10
# Y group_avg_X1 group_avg_X2 group_avg_X3 overall_avg_X1 overall_avg_X2 overall_avg_X3 proportion_X1
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 no 6.6 14.6 25.8 5.5 15.5 25.5 1.2
# 2 yes 4.4 16.4 25.2 5.5 15.5 25.5 0.8
# # ... with 2 more variables: proportion_X2 <dbl>, proportion_X3 <dbl>
Here's a method with data.table:
#data
library(data.table)
set.seed(1)
dt <- data.table(
x1 = sample(1:10),
x2 = sample(11:20),
x3 = sample(21:30),
y = sample(c("yes", "no"), 10, replace = TRUE)
)
# group means
group_means <- dt[ , lapply(.SD, mean), by=y, .SDcols=1:3]
# overall means
overall_means <- dt[ , lapply(.SD, mean), .SDcols=1:3]
# clunky combination (sorry!)
group_means[ , perc_x1 := x1 / overall_means[[1]] ]
group_means[ , perc_x2 := x2 / overall_means[[2]] ]
group_means[ , perc_x3 := x3 / overall_means[[3]] ]
I have been scratching my head over this. I have two data frames: df
df <- data.frame(group = 1:3,
age = seq(30, 50, length.out = 3),
income = seq(100, 500, length.out = 3),
assets = seq(500, 800, length.out = 3))
and weights
weights <- data.frame(age = 5, income = 10)
I would like to multiply these two data frames only for the same column names. I tried something like this:
colwise(function(x) {x * weights[names(x)]})(df)
but that obviously didn't work as colwise does not keep the column name inside the function. I looked at various mapply solutions (example), but I am unable to come up with an answer.
The resulting data.frame should look like this:
structure(list(group = 1:3, age = c(150, 200, 250), income = c(1000,
3000, 5000), assets = c(500, 650, 800)), .Names = c("group",
"age", "income", "assets"), row.names = c(NA, -3L), class = "data.frame")
group age income assets
1 1 150 1000 500
2 2 200 3000 650
3 3 250 5000 800
sweep() is your friend here, for this particular example. It relies upon the names in df and weights being in the right order, but that can be arranged.
> nams <- names(weights)
> df[, nams] <- sweep(df[, nams], 2, unlist(weights), "*")
> df
group age income assets
1 1 150 1000 500
2 2 200 3000 650
3 3 250 5000 800
If the variable names in weights and df are not in the same order, you can make them so:
> df2 <- data.frame(group = 1:3,
+ age = seq(30, 50, length.out = 3),
+ income = seq(100, 500, length.out = 3),
+ assets = seq(500, 800, length.out = 3))
> nams <- c("age", "income") ## order in df2
> weights2 <- weights[, rev(nams)]
> weights2 ## wrong order compared to df2
income age
1 10 5
> df2[, nams] <- sweep(df2[, nams], 2, unlist(weights2[, nams]), "*")
> df2
group age income assets
1 1 150 1000 500
2 2 200 3000 650
3 3 250 5000 800
In other words we reorder all objects so that age and income are in the right order.
Someone might have a slick way to do it with plyr, but this is probably the most straight forward way in base R.
shared.names <- intersect(names(df), names(weights))
cols <- sapply(names(df), USE.NAMES=TRUE, simplify=FALSE, FUN=function(name)
if (name %in% shared.names) df[[name]] * weights[[name]] else df[[name]])
data.frame(do.call(cbind, cols))
# group age income assets
# 1 1 150 1000 500
# 2 2 200 3000 650
# 3 3 250 5000 800
Your data:
df <- data.frame(group = 1:3,
age = seq(30, 50, length.out = 3),
income = seq(100, 500, length.out = 3),
assets = seq(500, 800, length.out = 3))
weights <- data.frame(age = 5, income = 10)
The logic:
# Basic name matching looks like this
names(df[names(df) %in% names(weights)])
# [1] "age" "income"
# Use that in `sapply()`
sapply(names(df[names(df) %in% names(weights)]),
function(x) df[[x]] * weights[[x]])
# age income
# [1,] 150 1000
# [2,] 200 3000
# [3,] 250 5000
The implementation:
# Put it all together, replacing the original data
df[names(df) %in% names(weights)] <- sapply(names(df[names(df) %in% names(weights)]),
function(x) df[[x]] * weights[[x]])
The result:
df
# group age income assets
# 1 1 150 1000 500
# 2 2 200 3000 650
# 3 3 250 5000 800
Here is a data.table solution
library(data.table)
DT <- data.table(df)
W <- data.table(weights)
Use mapply (or Map) to calculate the new columns and add then both at once
by reference.
DT <- data.table(df)
W <- data.table(weights)
DT[, `:=`(names(W), Map('*', DT[,names(W), with = F], W)), with = F]
You could also do this in a for loop using an index resulting from which(%in%). The above approach is much more efficient but this is an alternative.
results <- list()
for ( i in 1:length(which(names(df) %in% names(weights))) ) {
idx1 <- which(names(df) %in% names(weights))[i]
idx2 <- which(names(weights) %in% names(df))[i]
results[[i]] <- dat[,idx1] * weights[idx2]
}
unlist(results)