Tidy way to get `summary` output per group? - r

My code frequently uses tapply and summary as shown below:
data <- tibble(
year = rep(2018:2021, 3),
x = runif(length(year))
)
tapply(data$x, data$year, summary)
The output looks like:
$`2018`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3914 0.5696 0.7477 0.6668 0.8045 0.8614
$`2019`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1910 0.2863 0.3816 0.4179 0.5313 0.6809
(etc.)
Is there a way to get such summary-like output in a tibble?
Desired output, using ugly code:
tapply(data$x, data$year, summary)%>%
map(~ as.numeric(round(.x, 2))) %>%
map_dfr(set_names, names(summary(1))) %>%
add_column(year = 2018:2021, .before = 1)
# A tibble: 4 x 7
year Min. `1st Qu.` Median Mean `3rd Qu.` Max.
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018 0.39 0.570 0.75 0.67 0.8 0.86
2 2019 0.19 0.290 0.38 0.42 0.53 0.68
3 2020 0.01 0.35 0.7 0.55 0.82 0.93
4 2021 0.06 0.15 0.24 0.32 0.45 0.66
I'm hoping that there is a nice combination of dplyr functions to do that better -- my code to get the desired output is hacky.
Of course, I'm hoping not to have to rewrite base R's summary function, as below:
summarise(`Min` = min(x), `1st Qu.` = quantile(x, 0.25), ...)

Here is a concise tidyverse way.
library(dplyr)
library(purrr)
library(tidyr)
data %>%
nest_by(year) %>%
mutate(data = map(data, summary)) %>%
unnest_wider(data)
# # A tibble: 4 x 7
# year Min. `1st Qu.` Median Mean `3rd Qu.` Max.
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2018 0.105 0.256 0.407 0.307 0.407 0.407
# 2 2019 0.0354 0.205 0.375 0.313 0.452 0.529
# 3 2020 0.272 0.467 0.662 0.546 0.684 0.705
# 4 2021 0.00564 0.107 0.208 0.252 0.375 0.542
You can also just convert the table output from your original line. Note that here it converted year to character, so you would probably want to change that back.
library(purrr)
tapply(data$x, data$year, summary) %>%
map_dfr(c, .id = "year")
# # A tibble: 4 x 7
# year Min. `1st Qu.` Median Mean `3rd Qu.` Max.
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2018 0.105 0.256 0.407 0.307 0.407 0.407
# 2 2019 0.0354 0.205 0.375 0.313 0.452 0.529
# 3 2020 0.272 0.467 0.662 0.546 0.684 0.705
# 4 2021 0.00564 0.107 0.208 0.252 0.375 0.542

Base R solution
Try with by followed by do.call/rbind.
do.call(rbind, by(data$x, data$year, summary))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#2018 0.45126737 0.5437956 0.6363238 0.6343376 0.7258727 0.8154215
#2019 0.70134602 0.7425629 0.7837798 0.8227042 0.8833833 0.9829869
#2020 0.02726706 0.3338530 0.6404389 0.4591294 0.6750606 0.7096822
#2021 0.26667973 0.3242120 0.3817443 0.4953048 0.6096173 0.8374904
This returns a "matrix":
class(do.call(rbind, by(data$x, data$year, summary)))
#[1] "matrix" "array"
To get a "data.frame", coerce the return value after, don't use rbind.data.frame, it will loose the column names.
smry <- do.call(rbind, by(data$x, data$year, summary))
as.data.frame(smry)
dplyr solution.
A dplyr and purrr solution could be the following. Note that it doesn't round, it coerces the return value of map_dfr, which is columns of class "table", to numeric instead.
library(purrr)
library(dplyr)
tapply(data$x, data$year, summary)%>%
map_dfr(set_names, names(summary(1))) %>%
mutate(across(everything(), as.numeric))
## A tibble: 4 x 6
# Min. `1st Qu.` Median Mean `3rd Qu.` Max.
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 0.451 0.544 0.636 0.634 0.726 0.815
#2 0.701 0.743 0.784 0.823 0.883 0.983
#3 0.0273 0.334 0.640 0.459 0.675 0.710
#4 0.267 0.324 0.382 0.495 0.610 0.837

Another possible tidyverse solution. Same basic idea as Rui's solution above, but a little more verbose since it uses nest() and unnest() before pivoting back to wide data.
library(tidyverse)
data <- tibble(
year = rep(2018:2021, 3),
x = runif(length(year))
)
df_summary <- data %>%
nest_by(year) %>%
mutate(
summary = map(data, ~list(summary(.x))),
df = map(summary, ~data.frame(names = names(.x), values = c(.x))),
) %>%
unnest(df) %>%
select(-data, -summary) %>%
pivot_wider(names_from = names, values_from = values)
year Min. `1st Qu.` Median Mean `3rd Qu.` Max.
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018 0.204 0.351 0.498 0.538 0.705 0.912
2 2019 0.548 0.673 0.798 0.767 0.877 0.956
3 2020 0.228 0.416 0.604 0.604 0.792 0.980
4 2021 0.240 0.314 0.388 0.357 0.416 0.443

Related

Loop on several variables with the same suffix in R

I have a database which looks like this but with much more rows and columns.
Several variables (x,y,z) measured at different time (1,2,3).
df <-
tibble(
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10),
y1 = rnorm(10),
y2 = rnorm(10),
y3 = rnorm(10),
z1 = rnorm(10),
z2 = rnorm(10),
z3 = rnorm(10),
)
I am trying to create dummies variables from the variables with the same suffix (measured at the same time) like this:
df <- df %>%
mutate(var1= ifelse(x1>0 & (y1<0.5 |z1<0.5),0,1)) %>%
mutate(var2= ifelse(x2>0 & (y2<0.5 |z2<0.5),0,1)) %>%
mutate(var3= ifelse(x3>0 & (y1<0.5 |z3<0.5),0,1))
I am used to coding in SAS or Stata, so I would like to use a function or a loop because I have many more variables in my database.
But I think I don't have the right approach in R to deal with this.
Thank you very much for your help !
{dplyover} makes this kind of operation easy (disclaimer: I'm the maintainer), given that your desired output contains a typo:
I think you want to use all variables with the same digit (1, 2, 3 and so on) in each calculation:
df <- df %>%
mutate(var1= ifelse(x1>0 & (y1<0.5 |z1<0.5),0,1)) %>%
mutate(var2= ifelse(x2>0 & (y2<0.5 |z2<0.5),0,1)) %>%
mutate(var3= ifelse(x3>0 & (y3<0.5 |z3<0.5),0,1))
If that is the case we can use dplyover::over to apply the same function over a vector. Here we construct the vector with extract_names("[0-9]{1}$") which gets us all ending numbers of our variable names here: c(1,2,3). We can then construct the variable names using a special syntax: .("x{.x}"). Here .x evaluates to the first number in our vector so it would return the object name x1 (not a string!) which we can use inside the function argument of over.
library(dplyr)
library(dplyover) # Only on GitHub: https://github.com/TimTeaFan/dplyover
df %>%
mutate(over(cut_names("^[a-z]{1}"),
~ ifelse(.("x{.x}") > 0 & (.("y{.x}") < 0.5 | .("z{.x}") < 0.5), 0, 1),
.names = "var{x}"
))
#> # A tibble: 10 x 12
#> x1 x2 x3 y1 y2 y3 z1 z2 z3 var1
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.690 0.550 0.911 0.203 -0.111 0.530 -2.09 0.189 0.147 0
#> 2 -0.238 1.32 -0.145 0.744 1.05 -0.448 2.05 -1.04 1.50 1
#> 3 0.888 0.898 -1.46 -1.87 -1.14 1.59 1.91 -0.155 1.46 0
#> 4 -2.78 -1.34 -0.486 -0.0674 0.246 0.141 0.154 1.08 -0.319 1
#> 5 -1.20 0.835 1.28 -1.32 -0.674 0.115 0.362 1.06 0.515 1
#> 6 0.622 -0.713 0.0525 1.79 -0.427 0.819 -1.53 -0.885 0.00237 0
#> 7 -2.54 0.0197 0.942 0.230 -1.37 -1.02 -1.55 -0.721 -1.06 1
#> 8 -0.434 1.97 -0.274 0.848 -0.482 -0.422 0.197 0.497 -0.600 1
#> 9 -0.316 -0.219 0.467 -1.97 -0.718 -0.442 -1.39 -0.877 1.52 1
#> 10 -1.03 0.226 2.04 0.432 -1.02 -0.535 0.954 -1.11 0.804 1
#> # ... with 2 more variables: var2 <dbl>, var3 <dbl>
Alternatively we can use dplyr::across and use cur_column(), get() and gsub() to alter the name of the column on the fly. To name the new variables correctly we use gsub() in the .names argument of across and wrap it in curly braces {} to evaluate the expression.
library(dplyr)
df %>%
mutate(across(starts_with("x"),
~ {
cur_c <- dplyr::cur_column()
ifelse(.x > 0 & (get(gsub("x","y", cur_c)) < 0.5 | get(gsub("x","z", cur_c)) < 0.5), 0, 1)
},
.names = '{gsub("x", "var", .col)}'
))
#> # A tibble: 10 x 12
#> x1 x2 x3 y1 y2 y3 z1 z2 z3 var1
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.423 -1.42 -1.15 -1.54 1.92 -0.511 -0.739 0.501 0.451 1
#> 2 -0.358 0.164 0.971 -1.61 1.96 -0.675 -0.0188 -1.88 1.63 1
#> 3 -0.453 -0.758 -0.258 -0.449 -0.795 -0.362 -1.81 -0.780 -1.90 1
#> 4 0.855 0.335 -1.36 0.796 -0.674 -1.37 -1.42 -1.03 -0.560 0
#> 5 0.436 -0.0487 -0.639 0.352 -0.325 -0.893 -0.746 0.0548 -0.394 0
#> 6 -0.228 -0.240 -0.854 -0.197 0.884 0.118 -0.0713 1.09 -0.0289 1
#> 7 -0.949 -0.231 0.428 0.290 -0.803 2.15 -1.11 -0.202 -1.21 1
#> 8 1.88 -0.0980 -2.60 -1.86 -0.0258 -0.965 -1.52 -0.539 0.108 0
#> 9 0.221 1.58 -1.46 -0.806 0.749 0.506 1.09 0.523 1.86 0
#> 10 0.0238 -0.389 -0.474 0.512 -0.448 0.178 0.529 1.56 -1.12 1
#> # ... with 2 more variables: var2 <dbl>, var3 <dbl>
Created on 2022-06-08 by the reprex package (v2.0.1)
You could restructure your data along the principles of tidy data (see e.g. https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html).
Here to a long format and using tidyverse:
library(tidyverse)
df <-
df |>
pivot_longer(everything()) |>
separate(name, c("var", "time"), sep = "(?=[0-9])") |>
pivot_wider(id_col = "time",
names_from = "var",
names_prefix = "var_",
values_from = "value",
values_fn = list) |>
unnest(-time) |>
mutate(new_var = ifelse(var_x > 0 & (var_y < 0.5 | var_z < 0.5), 0, 1))
df
You would probably want to keep the data in a long format, but if you want, you can pivot_wider and get back to the format you started with. E.g.
df |>
pivot_wider(values_from = c(starts_with("var_"), "new_var"),
names_from = "time",
values_fn = list) |>
unnest(everything())
As you suggested, a solution using a loop is definitely possible.
# times as unique non-alphabetical parts of column names
times <- unique(gsub('[[:alpha:]]', '', names(df)))
for (time in times) {
# column names for current time
xyz <- paste0(c('x', 'y', 'z'), time)
df[[paste0('var', time)]] <-
ifelse(df[[xyz[1]]]>0 & (df[[xyz[2]]]<.5 | df[[xyz[3]]]<.5), 0, 1)
}
Another way I can think of is transforming the data into a 3D array (observartion × variable × time) so that you can actually do the computation for all variables at once.
times <- unique(gsub('[[:alpha:]]', '', names(df)))
df.arr <- sapply(c('x', 'y', 'z'),
function(var) as.matrix(df[, paste0(var, times)]),
simplify='array')
new.vars <- ifelse(df.arr[, , 1]>0 & (df.arr[, , 2]<0.5 | df.arr[, , 3]<0.5), 0, 1)
colnames(new.vars) <- paste0('var', times)
cbind(df, new.vars)
Here, sapply creates a matrix from columns of measurings for each variable at different times and stacks them into a 3D array.
If you trust (or ensure) correct ordering of columns in the data frame, instead of using sapply you can create the array just by modifying the object's dimensions. I didn't do any benchmarking but i guess this could be the most computationally efficient solution (if it should matter).
df.arr <- as.matrix(df)
dim(df.arr) <- c(dim(df.arr) / c(1, 3), 3)

How to flip column variable names to row labels? [duplicate]

This question already has answers here:
Transposing a dataframe maintaining the first column as heading
(5 answers)
Transposition of a Tibble Using Pivot_Longer() and Pivot_Wider (Tidyverse) [duplicate]
(1 answer)
Closed 1 year ago.
I have the below tibble.
A tibble: 2 x 6
Trial_Type CT_tib_all CT_lum_all CT_tho_all CT_gps_all CT_vest_all
* <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Pre 0.244 0.209 0.309 0.315 0.310
2 Post 0.254 0.211 0.302 0.313 0.316
I would like to flip the rows and columns so I end up with a 6 x 2 tibble, but I'm not sure of the easiest way to do this. How do I get the column variable names to become row labels and the row labels as column variables (Pre and Post)?
You can use pivot_longer and pivot_wider -
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -Trial_Type) %>%
pivot_wider(names_from = Trial_Type, values_from = value)
# name Pre Post
# <chr> <dbl> <dbl>
#1 CT_tib_all 0.244 0.254
#2 CT_lum_all 0.209 0.211
#3 CT_tho_all 0.309 0.302
#4 CT_gps_all 0.315 0.313
#5 CT_vest_all 0.31 0.316
In data.table -
library(data.table)
dcast(melt(setDT(df), id.vars = 'Trial_Type'),
variable~Trial_Type, vvalue.var = 'value')
t i.e. transpose function in baseR may also be used, in combination with tibble::rownames_to_column and tibble::column_to_rownames
library(tibble)
library(dplyr)
df <- read.table(text = 'Trial_Type CT_tib_all CT_lum_all CT_tho_all CT_gps_all CT_vest_all
Pre 0.244 0.209 0.309 0.315 0.310
Post 0.254 0.211 0.302 0.313 0.316', header = T)
df %>% tibble::column_to_rownames('Trial_Type') %>%
t() %>% as.data.frame() %>%
rownames_to_column('Trial_Type')
#> Trial_Type Pre Post
#> 1 CT_tib_all 0.244 0.254
#> 2 CT_lum_all 0.209 0.211
#> 3 CT_tho_all 0.309 0.302
#> 4 CT_gps_all 0.315 0.313
#> 5 CT_vest_all 0.310 0.316
Created on 2021-05-28 by the reprex package (v2.0.0)
We can use transpose from data.table
data.table::transpose(df, make.names = 'Trial_Type', keep.names = 'name')
# name Pre Post
#1 CT_tib_all 0.244 0.254
#2 CT_lum_all 0.209 0.211
#3 CT_tho_all 0.309 0.302
#4 CT_gps_all 0.315 0.313#
5 CT_vest_all 0.310 0.316
A base R option using reshape
reshape(
cbind(name = df$Trial_Type, stack(df[-1])),
direction = "wide",
idvar = "ind",
timevar = "name"
)
gives
ind values.Pre values.Post
1 CT_tib_all 0.244 0.254
3 CT_lum_all 0.209 0.211
5 CT_tho_all 0.309 0.302
7 CT_gps_all 0.315 0.313
9 CT_vest_all 0.310 0.316

Stack dataframe columns with two distinct suffix into two columns, preferably using tidyverse [duplicate]

This question already has answers here:
tidyverse pivot_longer several sets of columns, but avoid intermediate mutate_wider steps [duplicate]
(3 answers)
Closed 1 year ago.
Suppose I have a list of dataframes, mylist and I want to do the same operation to each dataframes.
Say my dataframes look like this:
set.seed(1)
test.tbl <- tibble(
case1_diff = rnorm(10,0),
case1_avg = rnorm(10,0),
case2_diff = rnorm(10,0),
case2_avg = rnorm(10,0),
case3_diff = rnorm(10,0),
case3_avg = rnorm(10,0),
case4_diff = rnorm(10,0),
case4_avg = rnorm(10,0),
)
> head(test.tbl)
# A tibble: 6 x 8
case1_diff case1_avg case2_diff case2_avg case3_diff case3_avg case4_diff case4_avg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.626 1.51 0.919 1.36 -0.165 0.398 2.40 0.476
2 0.184 0.390 0.782 -0.103 -0.253 -0.612 -0.0392 -0.710
3 -0.836 -0.621 0.0746 0.388 0.697 0.341 0.690 0.611
4 1.60 -2.21 -1.99 -0.0538 0.557 -1.13 0.0280 -0.934
5 0.330 1.12 0.620 -1.38 -0.689 1.43 -0.743 -1.25
6 -0.820 -0.0449 -0.0561 -0.415 -0.707 1.98 0.189 0.291
and I wish to stack them into two columns of diff and avg as 40 x 2 dataframe.
Normally, I would just separate it into two objects through select(ends_with("diff")) and select(ends_with("avg")), pivot them, then bind_rows.
However, since my original object is list, I want to do it using map like:
mylist %>%
map(*insertfunction1*) %>%
map(*insertfunction2*)
meaning I would need to do this without separating. I would also need to make sure that diff and avg is correctly paired.
What I have tried so far is
test.tbl %>%
pivot_longer(cols=everything(),
names_to = "metric") %>%
mutate(metric = str_remove(metric,"[0-9]+")) %>%
pivot_wider(id_cols=metric,
values_from=value)
We don't need both pivot_longer and pivot_wider. it can be done within pivot_longer itself by specifying the names_to and the names_sep argument
library(dplyr)
library(tidyr)
test.tbl %>%
pivot_longer(cols = everything(), names_to = c('grp', '.value'),
names_sep = "_") %>%
select(-grp)
-output
# A tibble: 40 x 2
# diff avg
# <dbl> <dbl>
# 1 -0.626 1.51
# 2 0.919 1.36
# 3 -0.165 0.398
# 4 2.40 0.476
# 5 0.184 0.390
# 6 0.782 -0.103
# 7 -0.253 -0.612
# 8 -0.0392 -0.710
# 9 -0.836 -0.621
#10 0.0746 0.388
# … with 30 more rows

Produce a matrix from every row (tidyverse)

I am trying to implement analyses across a posterior of matrices. What I start with is a tibble of k^2 columns, where k is the dimensions of the matrix. The ith row forms the matrix of the ith iteration.
So, for example for a 3x3 matrix, this is:
set.seed(12)
n <- 1000
z1z1 <- rnorm(n, 5, 1)
z2z2 <- rnorm(n, 5, 1)
z3z3 <- rnorm(n, 5, 1)
z1z2 <- rnorm(n, 0, 1)
z1z3 <- rnorm(n, 0, 1)
z2z3 <- rnorm(n, 0, 1)
post3 <- as_tibble(matrix(c(z1z1, z1z2, z1z3,
z1z2, z2z2, z2z3,
z1z3, z2z3, z3z3),
ncol = 9))
post3
Giving:
# A tibble: 1,000 x 9
V1 V2 V3 V4 V5 V6 V7 V8 V9
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3.52 -0.618 2.96 -0.618 2.48 -0.634 2.96 -0.634 5.98
2 6.58 -0.827 0.0909 -0.827 5.52 -1.84 0.0909 -1.84 6.20
3 4.04 1.48 -1.66 1.48 6.58 0.166 -1.66 0.166 5.58
4 4.08 -1.01 0.809 -1.01 5.49 0.607 0.809 0.607 6.55
5 3.00 0.582 -0.485 0.582 6.20 0.0765 -0.485 0.0765 6.38
6 4.73 0.718 1.97 0.718 4.00 -0.147 1.97 -0.147 4.35
7 4.68 -0.372 0.572 -0.372 4.65 -1.68 0.572 -1.68 3.83
8 4.37 -0.809 0.883 -0.809 3.96 0.985 0.883 0.985 4.97
9 4.89 0.405 0.686 0.405 6.02 0.252 0.686 0.252 6.29
10 5.43 0.124 0.199 0.124 5.75 0.354 0.199 0.354 4.20
# ... with 990 more rows
Where this is the matrix in the first iteration:
k <- sqrt(length(post3))
matrix(post3[1,], nrow = k)
[,1] [,2] [,3]
[1,] 3.519432 -0.618137 2.962622
[2,] -0.618137 2.479522 -0.6338298
[3,] 2.962622 -0.6338298 5.977552
I am then working along this posterior to calculate the dominance of the first eigenvector:
post3 %>%
rowwise %>%
mutate(
pre_eig = list(eigen(matrix(c(V1, V2, V3, V4, V5, V6, V7, V8, V9), nrow = k))),
dom = pre_eig[[1]][1] / sum(pre_eig[[1]][1:k])) %>%
select('dom')
Giving:
# A tibble: 1,000 x 1
dom
<dbl>
1 0.676
2 0.437
3 0.462
4 0.427
5 0.414
6 0.504
7 0.474
8 0.429
9 0.394
10 0.383
# ... with 990 more rows
What I would like to do is make this script versatile so that it can take posteriors for any value of k. The issue I am having is in how to define the matrix without having to hand write all the column names - when applying this to 2000x2000 matrices I don't want to write out V1, V2, V3... V4000000!
I tried a few things (including ...eigen(matrix(c(paste0('V', 1:(k^2))), nrow = k)))..., which I think is not working because it wants V1, V2... rather than "V1", "V2"...) and I all out of ideas. How do I get it to automatically take the column names from the posterior tibble?
I would then be able to use the exact same piece of script for example on post3 <- as_tibble(matrix(c(z1z1, z1z2, z1z2, z2z2), ncol = 4))...
You can avoid naming all the columns explicitly if you gather each row's values into key-value pairs:
library(tidyr)
post3 %>%
# add row ID (so that results can be sorted back into original order)
mutate(row.id = seq(1, n())) %>%
# convert each row to long format, with values sorted from 1st to k^2th column
gather(position, value, -row.id) %>%
mutate(position = as.numeric(gsub("^V", "", position))) %>%
arrange(row.id, position) %>%
select(-position) %>%
# group by row ID & calculate
group_by(row.id) %>%
summarise(pre_eig = list(eigen(matrix(value, nrow = k))[["values"]]),
dom = pre_eig[[1]][1] / sum(pre_eig[[1]][1:k])) %>%
ungroup() %>%
# sort results in original order
arrange(row.id) %>%
select(dom)
The results should be the same as before:
# A tibble: 1,000 x 1
dom
<dbl>
1 0.676
2 0.437
3 0.462
4 0.427
5 0.414
6 0.504
7 0.474
8 0.429
9 0.394
10 0.383
# ... with 990 more rows

Calculate summary statistics of measurements and pivot them to columns in R [duplicate]

This question already has answers here:
Creating a summary statistical table from a data frame
(5 answers)
Closed 4 years ago.
I have a dataframe like this
Step <- c("1","1","4","3","2","2","3","4","4","3","1","3","2","4","3","1","2")
Length <- c(0.1,0.5,0.7,0.8,0.2,0.1,0.3,0.8,0.9,0.15,0.25,0.27,0.28,0.61,0.15,0.37,0.18)
Breadth <- c(0.13,0.35,0.87,0.38,0.52,0.71,0.43,0.8,0.9,0.15,0.45,0.7,0.8,0.11,0.11,0.47,0.28)
Height <- c(0.31,0.35,0.37,0.38,0.32,0.51,0.53,0.48,0.9,0.15,0.35,0.32,0.22,0.11,0.17,0.27,0.38)
Width <- c(0.21,0.25,0.27,0.8,0.2,0.21,0.3,0.28,0.29,0.65,0.55,0.37,0.26,0.31,0.5,0.7,0.8)
df <- data.frame(Step,Length,Breadth,Height,Width)
I am trying to calculate the max, min, mean, median, standard deviation of the measurements grouped by step and then pivot those columns having the measurements as a column.
My desired output is
Measurement max_1 min_1 mean_1 median_1 sd_1 max_2 min_2 mean_2 median_2 sd_2 max_3 min_3 mean_3 median_3 sd_3 max_4 min_4 mean_4 median_4 sd_4
Length 0.50 0.10 0.3050 0.31 0.17058722 0.28 0.10 0.1900 0.190 0.07393691 0.80 0.15 0.334 0.27 0.2693139 0.90 0.61 0.7525 0.750 0.12526638
Breadth 0.47 0.13 0.3500 0.40 0.15577760 0.80 0.28 0.5775 0.615 0.23012680 0.70 0.11 0.354 0.38 0.2383904 0.90 0.11 0.6700 0.835 0.37567720
Height 0.35 0.27 0.3200 0.33 0.03829708 0.51 0.22 0.3575 0.350 0.12120919 0.53 0.15 0.310 0.32 0.1570032 0.90 0.11 0.4650 0.425 0.32888701
Width 0.70 0.21 0.4275 0.40 0.23669601 0.80 0.20 0.3675 0.235 0.28952547 0.80 0.30 0.524 0.50 0.2040343 0.31 0.27 0.2875 0.285 0.01707825
I am trying to do it this way to calculate the summary statistics but its not an efficient way to do it.
library(dplyr)
df1 <- df %>%
group_by(Step) %>%
summarise(Length_Mean = mean(Length),
Breadth_Mean = mean(Breadth),
Height_Mean = mean(Height),
Width_Mean = mean(Width))
How do I accomplish my desired output with minimal code and efficiently? Could someone point me in the right direction?
You can use a "scoped" version of summarize to calculate the same summary
statistics for multiple columns at once. From ?scoped:
The variants suffixed with _if, _at or _all apply an expression
(sometimes several) to all variables within a specified subset. This
subset can contain all variables (_all variants), a vars() selection
(_at variants), or variables selected with a predicate (_if variants).
Here summarize_all could be a good choice; it selects all columns except
for the grouping columns.You can also supply several summary functions to
calculate on each of the variables in the selection.
library(tidyverse)
# Calculate the summary statistics
sums <- df %>%
group_by(Step) %>%
summarize_all(funs(max, min, mean, median, sd))
sums
#> # A tibble: 4 x 21
#> Step Length_max Breadth_max Height_max Width_max Length_min Breadth_min
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.5 0.47 0.35 0.7 0.1 0.13
#> 2 2 0.28 0.8 0.51 0.8 0.1 0.28
#> 3 3 0.8 0.7 0.53 0.8 0.15 0.11
#> 4 4 0.9 0.9 0.9 0.31 0.61 0.11
#> # ... with 14 more variables: Height_min <dbl>, Width_min <dbl>,
#> # Length_mean <dbl>, Breadth_mean <dbl>, Height_mean <dbl>,
#> # Width_mean <dbl>, Length_median <dbl>, Breadth_median <dbl>,
#> # Height_median <dbl>, Width_median <dbl>, Length_sd <dbl>,
#> # Breadth_sd <dbl>, Height_sd <dbl>, Width_sd <dbl>
Now that we have the summary statistics, all that is left to do is to
reshape the data to achieve the desired output. For this, gather, spread,
separate and unite from tidyr come in handy:
sums %>%
# Reshape to long format
gather(col, val, -Step) %>%
# Separate the measurement and the summary statistic
separate(col, into = c("Measurement", "stat")) %>%
arrange(Step) %>%
# Create the desired column headings
unite(col, stat, Step) %>%
# Need to use factors to preserve order
mutate_at(vars(col, Measurement), fct_inorder) %>%
# Reshape back to wide format
spread(col, val)
#> # A tibble: 4 x 21
#> Measurement max_1 min_1 mean_1 median_1 sd_1 max_2 min_2 mean_2
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Length 0.5 0.1 0.305 0.31 0.171 0.28 0.1 0.19
#> 2 Breadth 0.47 0.13 0.35 0.4 0.156 0.8 0.28 0.578
#> 3 Height 0.35 0.27 0.32 0.330 0.0383 0.51 0.22 0.358
#> 4 Width 0.7 0.21 0.428 0.4 0.237 0.8 0.2 0.368
#> # ... with 12 more variables: median_2 <dbl>, sd_2 <dbl>, max_3 <dbl>,
#> # min_3 <dbl>, mean_3 <dbl>, median_3 <dbl>, sd_3 <dbl>, max_4 <dbl>,
#> # min_4 <dbl>, mean_4 <dbl>, median_4 <dbl>, sd_4 <dbl>
Created on 2018-05-24 by the reprex package (v0.2.0).

Resources