Create summary statistics based on condition - r

Normally to make summary statistics on a condition I would say
summary(data$how_fast[data$weight == 'Medium' & data$height == 'High'], basic = T)
But what I would like is to output all of the summary statistics for every variable.
summary(data[data$weight == 'Medium' & data$height == 'High'], basic = T)
So we'd get summary statistics not just for $how_fast, but also for other variable like $start_speed or $medals.
Ideally, it'd be stored in a awesome table (although I believe you can do this using rtf package).

by lets you apply functions to data frames. The output is an array with dimensionality based on your grouping.
dat <- data.frame(A = rep(1:2, each = 10),
B = rep(1:2, times = 10), C = rpois(20, 1))
by(data = dat, INDICES = dat[c("A", "B")], FUN = summary, basic = TRUE)
# A: 1
# B: 1
# A B C
# Min. :1 Min. :1 Min. :0.0
# 1st Qu.:1 1st Qu.:1 1st Qu.:0.0
# Median :1 Median :1 Median :0.0
# Mean :1 Mean :1 Mean :0.6
# 3rd Qu.:1 3rd Qu.:1 3rd Qu.:1.0
# Max. :1 Max. :1 Max. :2.0
# -------------------------------------------------------------
# ...
This lets you summarize for all groupings in a data.frame. To just apply for a single subset you could use lapply.
lapply(X = dat[dat$A == 1 && dat$B == 1, ],
FUN = summary, basic = TRUE)
# $A
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.0 1.0 1.5 1.5 2.0 2.0
#
# $B
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.0 1.0 1.5 1.5 2.0 2.0
#
# $C
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.0 0.0 1.0 0.9 1.0 3.0

Related

step_BoxCox() with negative data

My understanding is that the step_BoxCox() requires a strictly positive variable. However, I tried to apply the step on data that has some negative values, I didn't get an error or a warning. The output had no NA values.
I don't know what is wrong, if my understanding flawed, or am I using a wrong syntax or something.
library(recipes)
library(skimr)
# create dummy data
set.seed(123)
n <- 2e3
x1 <- rpois(n, lambda = 5) # has some zero vals
x2 <- rnorm(n) # has some -ve vals
x3 <- x1 + 10 # is strictly positive
y <- x1 + x2
data <- tibble(x1, x2, x3, y)
# a BocCox recipe
rec <- recipe(y ~ ., data = data) %>%
step_BoxCox(all_predictors())
rec
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 3
#>
#> Operations:
#>
#> Box-Cox transformation on all_predictors()
# bake
processed <- rec %>%
prep() %>%
bake(new_data = NULL)
# check output
summary(data)
#> x1 x2 x3 y
#> Min. : 0.000 Min. :-3.047861 Min. :10.00 Min. :-2.048
#> 1st Qu.: 3.000 1st Qu.:-0.654767 1st Qu.:13.00 1st Qu.: 3.349
#> Median : 5.000 Median :-0.007895 Median :15.00 Median : 4.843
#> Mean : 4.981 Mean : 0.011176 Mean :14.98 Mean : 4.993
#> 3rd Qu.: 6.000 3rd Qu.: 0.688699 3rd Qu.:16.00 3rd Qu.: 6.486
#> Max. :14.000 Max. : 3.421095 Max. :24.00 Max. :15.225
summary(processed)
#> x1 x2 x3 y
#> Min. : 0.000 Min. :-3.047861 Min. :2.076 Min. :-2.048
#> 1st Qu.: 3.000 1st Qu.:-0.654767 1st Qu.:2.285 1st Qu.: 3.349
#> Median : 5.000 Median :-0.007895 Median :2.398 Median : 4.843
#> Mean : 4.981 Mean : 0.011176 Mean :2.388 Mean : 4.993
#> 3rd Qu.: 6.000 3rd Qu.: 0.688699 3rd Qu.:2.448 3rd Qu.: 6.486
#> Max. :14.000 Max. : 3.421095 Max. :2.756 Max. :15.225
sum(is.na(processed$x2))
#> [1] 0
skim(processed)
Created on 2021-04-29 by the reprex package (v0.3.0)

Summarize the same variables from multiple dataframes in one table

I have voter and party-data from several datasets that I further separated into different dataframes and lists to make it comparable. I could just use the summary command on each of them individually then compare manually, but I was wondering whether there was a way to get them all together and into one table?
Here's a sample of what I have:
> summary(eco$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3 4 4 4 4 5
> summary(ecovoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.000 4.000 3.744 5.000 10.000 26
> summary(lef$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 3.000 3.000 3.692 4.000 7.000
> summary(lefvoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 2.000 3.000 3.612 5.000 10.000 332
> summary(soc$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 4.000 4.000 4.143 5.000 6.000
> summary(socvoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.000 4.000 3.674 5.000 10.000 346
Is there a way I can summarize these lists (ecovoters, lefvoters, socvoters etc) and the dataframe variables (eco$rilenew, lef$rilenew, soc$rilenew etc) together and have them in one table?
You could put everything into a list and summarize with a small custom function.
L <- list(eco$rilenew, ecovoters, lef$rilenew,
lefvoters, soc$rilenew, socvoters)
t(sapply(L, function(x) {
s <- summary(x)
length(s) <- 7
names(s)[7] <- "NA's"
s[7] <- ifelse(!any(is.na(x)), 0, s[7])
return(s)
}))
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
[1,] 0.9820673 3.3320662 3.958665 3.949512 4.625109 7.229069 0
[2,] -4.8259384 0.5028293 3.220546 3.301452 6.229384 9.585749 26
[3,] -0.3717391 2.3280366 3.009360 3.013908 3.702156 6.584659 0
[4,] -2.6569493 1.6674330 3.069440 3.015325 4.281100 8.808432 332
[5,] -2.3625651 2.4964361 3.886673 3.912009 5.327401 10.349040 0
[6,] -2.4719404 1.3635785 2.790523 2.854812 4.154936 8.491347 346
Data
set.seed(42)
eco <- data.frame(rilenew=rnorm(800, 4, 1))
ecovoters <- rnorm(75, 4, 4)
ecovoters[sample(length(ecovoters), 26)] <- NA
lef <- data.frame(rilenew=rnorm(900, 3, 1))
lefvoters <- rnorm(700, 3, 2)
lefvoters[sample(length(lefvoters), 332)] <- NA
soc <- data.frame(rilenew=rnorm(900, 4, 2))
socvoters <- rnorm(700, 3, 2)
socvoters[sample(length(socvoters), 346)] <- NA
Can use map from tidyverse to get the summary list, then if you want the result as dataframe, then plyr::ldply can help to convert list to dataframe:
ll = map(L, summary)
ll
plyr::ldply(ll, rbind)
> ll = map(L, summary)
> ll
[[1]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9821 3.3321 3.9587 3.9495 4.6251 7.2291
[[2]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-4.331 1.347 3.726 3.793 6.653 16.845 26
[[3]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3717 2.3360 3.0125 3.0174 3.7022 6.5847
[[4]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-2.657 1.795 3.039 3.013 4.395 9.942 332
[[5]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.363 2.503 3.909 3.920 5.327 10.349
[[6]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-3.278 1.449 2.732 2.761 4.062 8.171 346
> plyr::ldply(ll, rbind)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 0.9820673 3.332066 3.958665 3.949512 4.625109 7.229069 NA
2 -4.3312551 1.346532 3.725708 3.793431 6.652917 16.844796 26
3 -0.3717391 2.335959 3.012507 3.017438 3.702156 6.584659 NA
4 -2.6569493 1.795307 3.038905 3.012928 4.395338 9.941819 332
5 -2.3625651 2.503324 3.908727 3.920050 5.327401 10.349040 NA
6 -3.2779863 1.448814 2.732515 2.760569 4.061854 8.170793 346

Return Value of 3rd qudrant

Is there a way to return the value of the 3rd Qu. that comes up when you do the summary of a vector?
For example:
summary(data$attribute)
Returns:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0002012 0.0218800 0.0454300 0.0707100 0.0961500 0.4845000
You can also use quantile and specify the probability to be 0.75:
quantile(1:10, probs = 0.75)
# 75%
#7.75
If you want to remove the name attribute:
quantile(1:10, probs = 0.75, names = FALSE)
#7.75
You can access elements of the summary by index:
summary(1:10)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.00 3.25 5.50 5.50 7.75 10.00
summary(1:10)[5]
# 3rd Qu.
# 7.75
Or by name:
summary(1:10)["3rd Qu."]
# 3rd Qu.
# 7.75
We can use unname() to drop names:
unname(summary(1:10)[5])
# [1] 7.75

summary to a data frame

Using summary(var) gives me the following output:
PAY_BACK_ORG
Min. : -16.40
1st Qu.: 0.00
Median : 26.40
Mean : 34.37
3rd Qu.: 53.60
Max. :4033.40
I want it as a dataframe which will look like this:
Min -16.40
1st Qu 0.00
Median 26.40
Mean 34.37
3rd Qu 53.60
Max 4033.40
How can I get it in?
Like this?
var <- rnorm(100)
x <- summary(var)
data.frame(x=matrix(x),row.names=names(x))
## x
## Min. -2.68300
## 1st Qu. -0.70930
## Median -0.09732
## Mean -0.00809
## 3rd Qu. 0.71550
## Max. 2.58100

Performing same action on multiple tables in for loop

This is a very simplified version of what I'm trying to do. In brief, I create some matrices that I want to perform the same action on, like in a loop. In this example here I want to print the summary for each matrix, but I don't know how to refer to the matrices in a for loop. Any help is much appreciated:
for (i in 1:3){
x <- paste0('df', i)
assign(x, matrix(sample(1:10, 15, replace = TRUE), ncol = 3))
print(summary(eval(x)))
}
Returns (it is evaluating 'df3' as a string):
Length Class Mode
1 character character
Length Class Mode
1 character character
Length Class Mode
1 character character
How do I get it to return the following?
V1 V2 V3
Min. : 1.0 Min. :3.0 Min. : 5
1st Qu.: 5.0 1st Qu.:3.0 1st Qu.: 5
Median : 6.0 Median :4.0 Median : 7
Mean : 5.6 Mean :5.2 Mean : 7
3rd Qu.: 6.0 3rd Qu.:7.0 3rd Qu.: 8
Max. :10.0 Max. :9.0 Max. :10
V1 V2 V3
Min. :2 Min. :1.0 Min. : 4.0
1st Qu.:4 1st Qu.:3.0 1st Qu.: 4.0
Median :7 Median :3.0 Median : 6.0
Mean :6 Mean :3.4 Mean : 6.6
3rd Qu.:8 3rd Qu.:4.0 3rd Qu.: 9.0
Max. :9 Max. :6.0 Max. :10.0
V1 V2 V3
Min. :1.0 Min. : 5.0 Min. :1.0
1st Qu.:2.0 1st Qu.: 6.0 1st Qu.:2.0
Median :6.0 Median : 6.0 Median :3.0
Mean :5.2 Mean : 6.8 Mean :2.4
3rd Qu.:8.0 3rd Qu.: 7.0 3rd Qu.:3.0
Max. :9.0 Max. :10.0 Max. :3.0
Don’t use distinct variables and paste their names – put your objects into a list:
x = Map(function (i) matrix(sample(1:10, 15, replace = TRUE), ncol = 3), 1 : 3)
Then performing a common operation on them is trivial as well:
Map(summary, x)
Map maps a function onto a list. It operates similar to the lapply and mapply family of functions.
I think you can use eval(as.name("df3")) or get("df3")

Resources