Summary statistics in r, measures as headers - r

I want to create summary statistics for my dataset. I have tried searching but haven't found anything that matches what I want. I want the columns to be listed on vertically with the statistics measure as headings. Here is how I want it to look:
Column
Mean
Standard deviation
25th perc.
Median
75th perc.
Column 1
Mean column 1
Std column 1
...
...
...
Column 2
Mean column 2
...
...
...
...
Etc
...
...
...
...
...
How do I do this? Thankful for any help I can get!:)
If there is a specific function to use where I can also do some formatting/styling some info about that would also be appreciated, but the main point is that it should look as described. :)

You may want to check out the summarytools package... Has built-in support for both markdown and html.
library(summarytools)
descr(iris,
stats = c("mean", "sd", "q1", "med", "q3"),
transpose = TRUE)
## Non-numerical variable(s) ignored: Species
## Descriptive Statistics
## iris
## N: 150
##
## Mean Std.Dev Q1 Median Q3
## ----------------- ------ --------- ------ -------- ------
## Petal.Length 3.76 1.77 1.60 4.35 5.10
## Petal.Width 1.20 0.76 0.30 1.30 1.80
## Sepal.Length 5.84 0.83 5.10 5.80 6.40
## Sepal.Width 3.06 0.44 2.80 3.00 3.30

We could use descr from collapse
library(collapse)
descr(iris)

Your question is missing some important features, but I think you want something like this:
Example with just the numerical variables of the iris dataset:
iris_numerical<-iris[,1:4]
calculate statistics
new_df<-sapply(iris_numerical, function(x){c(mean=mean(x), SD=sd(x), Q1=quantile(x, 0.25), median=median(x), Q3=quantile(x, 0.75))})
This gives you summary statistics column-wise
> new_df
Sepal.Length Sepal.Width Petal.Length Petal.Width
mean 5.8433333 3.0573333 3.758000 1.1993333
SD 0.8280661 0.4358663 1.765298 0.7622377
Q1.25% 5.1000000 2.8000000 1.600000 0.3000000
median 5.8000000 3.0000000 4.350000 1.3000000
Q3.75% 6.4000000 3.3000000 5.100000 1.8000000
Then create final dataframe in the desired format, with colnames as rownames:
new_df<-data.frame(column=colnames(new_df), apply(new_df, 1, function(x) x))
> new_df
column mean SD Q1.25. median Q3.75.
Sepal.Length Sepal.Length 5.843333 0.8280661 5.1 5.80 6.4
Sepal.Width Sepal.Width 3.057333 0.4358663 2.8 3.00 3.3
Petal.Length Petal.Length 3.758000 1.7652982 1.6 4.35 5.1
Petal.Width Petal.Width 1.199333 0.7622377 0.3 1.30 1.8

Related

How to obtain an statistic (cohensd) for more than two factors

I want to compute the cohens d value for more than one factor at the same time.
So for example, in the iris dataset we could compute the cohens d for sepal length between setosa and versicolor very easily with:
virginica <- subset(iris, Species =="virginica")
versicolor <- subset(iris, Species =="versicolor")
cohen.d(virginica$Sepal.Length, versicolor$Sepal.Length)
Of course we could replicate this process again for the remaining factor.
In summary, what I want is to obtain the measure of all the factors against one, not all the factors against each other. So it would be like generating several cohensd but just in one step.
In this case, versicolor vs setosa and versicolor vs virginica.
I don't think this is the answer you want, but it is an answer that will work.
Instead of trying to make the function fit your data, make your data fit the function.
First, find all possible combinations of groups–I know you used the Iris data, but it's likely that you're just using that as an example.
library(tidyverse)
library(psych)
library(RcppAlgos)
# find unique pairs
iS = RcppAlgos::comboGeneral(unique(iris$Species),
2, F) # all unique combinations of 2 options
Then use these possible groups and get the effect size between each of them.
res = map(1:nrow(iS),
.f = function(x){
filter(iris, Species %in% c(iS[x, 1], iS[x, 2])) %>%
cohen.d(., group = "Species")
})
names(res) <- paste0(iS[,1], "-",iS[,2])
res
The effect sizes between each group:
# [[1]]
# Call: cohen.d(x = ., group = "Species")
# Cohen d statistic of difference between two means
# lower effect upper
# Sepal.Length 1.55 2.13 2.69
# Sepal.Width -2.45 -1.91 -1.36
# Petal.Length 6.35 7.98 9.57
# Petal.Width 5.47 6.89 8.27
#
# Multivariate (Mahalanobis) distance between groups
# [1] 10
# r equivalent of difference between two means
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 0.73 -0.69 0.97 0.96
#
# [[2]]
# Call: cohen.d(x = ., group = "Species")
# Cohen d statistic of difference between two means
# lower effect upper
# Sepal.Length 2.38 3.11 3.83
# Sepal.Width -1.77 -1.30 -0.83
# Petal.Length 8.01 10.10 12.08
# Petal.Width 6.89 8.64 10.36
#
# Multivariate (Mahalanobis) distance between groups
# [1] 14
# r equivalent of difference between two means
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 0.84 -0.55 0.98 0.97
#
# [[3]]
# Call: cohen.d(x = ., group = "Species")
# Cohen d statistic of difference between two means
# lower effect upper
# Sepal.Length 0.68 1.14 1.58
# Sepal.Width 0.23 0.65 1.06
# Petal.Length 1.90 2.55 3.18
# Petal.Width 2.25 2.95 3.65
#
# Multivariate (Mahalanobis) distance between groups
# [1] 3.8
# r equivalent of difference between two means
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 0.49 0.31 0.79 0.83
#

How to apply multiple functions to a dataframe in R

I am working with a relatively large dataset with a lot of attributes, is there any simple way(no extra packages) to let the output be like this row names as attributes name and col names as the function:
Min Max
Sepal.Length 4.3 7.9
Sepal.Width 2.0 4.4
Petal.Length 1.0 6.9
Petal.Width 0.1 2.5
Currently, when I sapply multiple functions to the data, the output is this
Sepal.Length Sepal.Width Petal.Length Petal.Width
min 4.300000 2.000000 1.000 0.100000
max 7.900000 4.400000 6.900 2.500000
mean 5.843333 3.057333 3.758 1.199333
However, this output will be too wide to fit in a pdf when knitting when dealing with large number of attributes.
you can use the t function
transposed_table <- t(normal_table)

Use first part of value as header in r

I have a data frame generated by t(summary(raw_data())):
Original data frame
However, each cell has a prefix of like max, min, mean, etc... And I would like to remove that prefix from each row and put it at the header. Is there an easy way to do this in r to get the dataframe to look like this:
Desired data frame
Also, as far as variables 3 & 18 which are factors. Those i'm less concerned about.
We can loop through the columns of the dataset, get the summary and then rbind the output
do.call(rbind, lapply(raw_data, summary))
Using a reproducible example
do.call(rbind, lapply(iris[1:4], summary))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#Sepal.Length 4.3 5.1 5.80 5.843 6.4 7.9
#Sepal.Width 2.0 2.8 3.00 3.057 3.3 4.4
#Petal.Length 1.0 1.6 4.35 3.758 5.1 6.9
#Petal.Width 0.1 0.3 1.30 1.199 1.8 2.5

Regression in R with loops

I need to run a simple regression using Lm() in R. Its simple because I have only one independent variable. However the catch is that I need to test this independent variable for a number of dependents which are columns in a data frame.
So basically I have one common X and numerous Y's for which i need to extract the intercept and slope and store them all in a data frame.
In excel this is possible with the intercept and slope functions and then dragging across columns. I need something in R that would basically do the same, I could of course run separate regressions , but the requirement is that I need to run all of them in one loop and store estimates of intercept and slopes together for each.
Im still learning R and any help on this would be great. Thanks :)
The lmList function in package nlme was designed for this.
Let's use the iris dataset as an example:
DF <- iris[, 1:4]
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 5.1 3.5 1.4 0.2
#2 4.9 3.0 1.4 0.2
#3 4.7 3.2 1.3 0.2
#4 4.6 3.1 1.5 0.2
#5 5.0 3.6 1.4 0.2
#6 5.4 3.9 1.7 0.4
#...
First we have to reshape it. We want Sepal.Length as the dependent and the other columns as predictors in this example.
library(reshape2)
DF <- melt(DF, id.vars = "Sepal.Length")
# Sepal.Length variable value
#1 5.1 Sepal.Width 3.5
#2 4.9 Sepal.Width 3.0
#3 4.7 Sepal.Width 3.2
#4 4.6 Sepal.Width 3.1
#5 5.0 Sepal.Width 3.6
#6 5.4 Sepal.Width 3.9
#...
Now we can do the fits.
library(nlme)
mods <- lmList(Sepal.Length ~ value | variable,
data = DF, pool = FALSE)
We can now extract intercept and slope for each model.
coef(mods)
# (Intercept) value
#Sepal.Width 6.526223 -0.2233611
#Petal.Length 4.306603 0.4089223
#Petal.Width 4.777629 0.8885803
And get the usual t-table:
summary(mods)
# Call:
# Model: Sepal.Length ~ value | variable
# Data: DF
#
# Coefficients:
# (Intercept)
# Estimate Std. Error t value Pr(>|t|)
# Sepal.Width 6.526223 0.47889634 13.62763 6.469702e-28
# Petal.Length 4.306603 0.07838896 54.93890 2.426713e-100
# Petal.Width 4.777629 0.07293476 65.50552 3.340431e-111
# value
# Estimate Std. Error t value Pr(>|t|)
# Sepal.Width -0.2233611 0.15508093 -1.440287 1.518983e-01
# Petal.Length 0.4089223 0.01889134 21.646019 1.038667e-47
# Petal.Width 0.8885803 0.05137355 17.296454 2.325498e-37
Or the R-squared values:
summary(mods)$r.squared
#[1] 0.01382265 0.75995465 0.66902769
However, if you need something more efficient, you can use package data.table together with lm's workhorse lm.fit:
library(data.table)
setDT(DF)
DF[, setNames(as.list(lm.fit(cbind(1, value),
Sepal.Length)[["coefficients"]]),
c("intercept", "slope")), by = variable]
# variable intercept slope
#1: Sepal.Width 6.526223 -0.2233611
#2: Petal.Length 4.306603 0.4089223
#3: Petal.Width 4.777629 0.8885803
And of course the R.squared values of these models are just the squared Pearson correlation coefficients:
DF[, .(r.sq = cor(Sepal.Length, value)^2), by = variable]
# variable r.sq
#1: Sepal.Width 0.01382265
#2: Petal.Length 0.75995465
#3: Petal.Width 0.66902769

Aggregate calculations with and without grouping variable in data.table

I'm producing some summary statistics at the by-group and overall levels.
(Note: the overall statistic cannot necessarily be derived from the group-level stats. A weighted average could work, but not a median.)
Thus far my workarounds use rbindlist on either summary stats or copies of the original data, as in:
library(data.table)
data(iris)
d <- data.table(iris)
# Approach 1)
rbindlist(list(d[, lapply(.SD, median), by=Species, .SDcols=c('Sepal.Length','Petal.Length')],
d[, lapply(.SD, median), .SDcols=c('Sepal.Length', 'Petal.Length')]),
fill=TRUE)
# Species Sepal.Length Petal.Length
# 1: setosa 5.0 1.50
# 2: versicolor 5.9 4.35
# 3: virginica 6.5 5.55
# 4: NA 5.8 4.35
# Approach 2)
d2 <- rbindlist(list(copy(d), copy(d[,Species:="Overall"]) ) )
d2[, lapply(.SD, median), by=Species, .SDcols=c('Sepal.Length', 'Petal.Length')]
# Species Sepal.Length Petal.Length
# 1: setosa 5.0 1.50
# 2: versicolor 5.9 4.35
# 3: virginica 6.5 5.55
# 4: Overall 5.8 4.35
The first approach seems to be faster (avoids copies).
The second approach allows me to use a label "Overall" instead of the NA fill, which is more intelligible if some records were missing the "Species" value (which in the first approach would result in two rows of NA Species.)
Are there any other solutions I should consider?
I think I normally do it like this:
cols = c('Sepal.Length','Petal.Length')
rbind(d[, lapply(.SD, median), by=Species, .SDcols=cols],
d[, lapply(.SD, median), .SDcols=cols][, Species := 'Overall'])
# Species Sepal.Length Petal.Length
#1: setosa 5.0 1.50
#2: versicolor 5.9 4.35
#3: virginica 6.5 5.55
#4: Overall 5.8 4.35
I accepted #Eddi's answer but wanted to incorporate the good comment from #Frank. This approach IMO makes the most sense.
library(data.table)
d <- data.table(iris)
cols = c('Sepal.Length','Petal.Length')
rbind(d[, lapply(.SD, median), by=Species, .SDcols=cols],
d[, c(Species = 'Overall', lapply(.SD, median) ), .SDcols=cols])
# Species Sepal.Length Petal.Length
# 1: setosa 5.0 1.50
# 2: versicolor 5.9 4.35
# 3: virginica 6.5 5.55
# 4: Overall 5.8 4.35
It may also be slightly faster (1.54 vs. 1.73 millis on microbenchmark) than applying the secondary calculation.

Resources