I am working with a relatively large dataset with a lot of attributes, is there any simple way(no extra packages) to let the output be like this row names as attributes name and col names as the function:
Min Max
Sepal.Length 4.3 7.9
Sepal.Width 2.0 4.4
Petal.Length 1.0 6.9
Petal.Width 0.1 2.5
Currently, when I sapply multiple functions to the data, the output is this
Sepal.Length Sepal.Width Petal.Length Petal.Width
min 4.300000 2.000000 1.000 0.100000
max 7.900000 4.400000 6.900 2.500000
mean 5.843333 3.057333 3.758 1.199333
However, this output will be too wide to fit in a pdf when knitting when dealing with large number of attributes.
you can use the t function
transposed_table <- t(normal_table)
Related
I want to create summary statistics for my dataset. I have tried searching but haven't found anything that matches what I want. I want the columns to be listed on vertically with the statistics measure as headings. Here is how I want it to look:
Column
Mean
Standard deviation
25th perc.
Median
75th perc.
Column 1
Mean column 1
Std column 1
...
...
...
Column 2
Mean column 2
...
...
...
...
Etc
...
...
...
...
...
How do I do this? Thankful for any help I can get!:)
If there is a specific function to use where I can also do some formatting/styling some info about that would also be appreciated, but the main point is that it should look as described. :)
You may want to check out the summarytools package... Has built-in support for both markdown and html.
library(summarytools)
descr(iris,
stats = c("mean", "sd", "q1", "med", "q3"),
transpose = TRUE)
## Non-numerical variable(s) ignored: Species
## Descriptive Statistics
## iris
## N: 150
##
## Mean Std.Dev Q1 Median Q3
## ----------------- ------ --------- ------ -------- ------
## Petal.Length 3.76 1.77 1.60 4.35 5.10
## Petal.Width 1.20 0.76 0.30 1.30 1.80
## Sepal.Length 5.84 0.83 5.10 5.80 6.40
## Sepal.Width 3.06 0.44 2.80 3.00 3.30
We could use descr from collapse
library(collapse)
descr(iris)
Your question is missing some important features, but I think you want something like this:
Example with just the numerical variables of the iris dataset:
iris_numerical<-iris[,1:4]
calculate statistics
new_df<-sapply(iris_numerical, function(x){c(mean=mean(x), SD=sd(x), Q1=quantile(x, 0.25), median=median(x), Q3=quantile(x, 0.75))})
This gives you summary statistics column-wise
> new_df
Sepal.Length Sepal.Width Petal.Length Petal.Width
mean 5.8433333 3.0573333 3.758000 1.1993333
SD 0.8280661 0.4358663 1.765298 0.7622377
Q1.25% 5.1000000 2.8000000 1.600000 0.3000000
median 5.8000000 3.0000000 4.350000 1.3000000
Q3.75% 6.4000000 3.3000000 5.100000 1.8000000
Then create final dataframe in the desired format, with colnames as rownames:
new_df<-data.frame(column=colnames(new_df), apply(new_df, 1, function(x) x))
> new_df
column mean SD Q1.25. median Q3.75.
Sepal.Length Sepal.Length 5.843333 0.8280661 5.1 5.80 6.4
Sepal.Width Sepal.Width 3.057333 0.4358663 2.8 3.00 3.3
Petal.Length Petal.Length 3.758000 1.7652982 1.6 4.35 5.1
Petal.Width Petal.Width 1.199333 0.7622377 0.3 1.30 1.8
This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 1 year ago.
I would like to aggregate a data frame while also adding in a new column (N) that counts the number of rows per value of the grouping variable, in base R.
This is trivial in dplyr:
library(dplyr)
data(iris)
combined_summary <- iris %>% group_by(Species) %>% group_by(N=n(), add=TRUE) %>% summarize_all(mean)
> combined_summary
# A tibble: 3 x 6
# Groups: Species [3]
Species N Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 setosa 50 5.01 3.43 1.46 0.246
2 versicolor 50 5.94 2.77 4.26 1.33
3 virginica 50 6.59 2.97 5.55 2.03
I am however in the unfortunate position of having to write this code in an environment that doesn't allow for packages to be used (don't ask; it's not my decision). So I need a way to do this in base R.
I can do it in base R in a long-winded way as follows:
# First create the aggregated tables separately
summary_means <- aggregate(. ~ Species, data=iris, FUN=mean)
summary_count <- aggregate(Sepal.Length ~ Species, data=iris[, c("Species", "Sepal.Length")], FUN=length)
> summary_means
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
> summary_count
Species Sepal.Length
1 setosa 50
2 versicolor 50
3 virginica 50
# Then rename the count column
colnames(summary_count)[2] <- "N"
> summary_count
Species N
1 setosa 50
2 versicolor 50
3 virginica 50
# Finally merge the two dataframes
combined_summary_baseR <- merge(x=summary_count, y=summary_means, by="Species", all.x=TRUE)
> combined_summary_baseR
Species N Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 50 5.006 3.428 1.462 0.246
2 versicolor 50 5.936 2.770 4.260 1.326
3 virginica 50 6.588 2.974 5.552 2.026
Is there any way to do this in a more efficient way in base R?
Here is a base R option using a single by call (to aggregate)
do.call(rbind, by(
iris[-ncol(iris)], iris[ncol(iris)], function(x) c(N = nrow(x), colMeans(x))))
# N Sepal.Length Sepal.Width Petal.Length Petal.Width
#setosa 50 5.006 3.428 1.462 0.246
#versicolor 50 5.936 2.770 4.260 1.326
#virginica 50 6.588 2.974 5.552 2.026
Using colMeans ensures that the column names are carried through which avoids an additional setNames call.
Update
In response to your comment, to have row names as a separate column requires an extra step.
d <- do.call(rbind, by(
iris[-ncol(iris)], iris[ncol(iris)], function(x) c(N = nrow(x), colMeans(x))))
cbind(Species = rownames(d), as.data.frame(d))
Not as concise as the initial by call. I think we're having a clash of philosophies here. In dplyr (and the tidyverse) row names are generally avoided, to be consistent with the principles of "tidy data". In base R row names are common and are (more or less) consistently carried through data operations. So in a way you're asking for a mix of dplyr (tidy) and base R data structure concepts which may not be the best/robust approach.
I have a data frame generated by t(summary(raw_data())):
Original data frame
However, each cell has a prefix of like max, min, mean, etc... And I would like to remove that prefix from each row and put it at the header. Is there an easy way to do this in r to get the dataframe to look like this:
Desired data frame
Also, as far as variables 3 & 18 which are factors. Those i'm less concerned about.
We can loop through the columns of the dataset, get the summary and then rbind the output
do.call(rbind, lapply(raw_data, summary))
Using a reproducible example
do.call(rbind, lapply(iris[1:4], summary))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#Sepal.Length 4.3 5.1 5.80 5.843 6.4 7.9
#Sepal.Width 2.0 2.8 3.00 3.057 3.3 4.4
#Petal.Length 1.0 1.6 4.35 3.758 5.1 6.9
#Petal.Width 0.1 0.3 1.30 1.199 1.8 2.5
I have a data frame subdist.df that has data for sub districts. I am trying to sum up the values of rows based on a common attribute in the data frame i.e DISTRICT column.
The following line of code works
hello2 <-aggregate(.~DISTRICT, subdist.df,sum)
But this one does not.
hello <-aggregate(noquote(paste0(".~","DISTRICT")), subdist.df,sum)
I am unable to understand why this is the case. I need to use it in a function wherein DISTRICT can be any input from the user as an argument.
Using iris data.frame as an example:
aggregate(.~Species, iris, sum)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3
The following paste0 doesn't work, as noquote only generate an expression and not a formula as required by aggregate function:
aggregate(noquote(paste0(".~","Species")), iris, sum)
Error in aggregate.data.frame(as.data.frame(x), ...) :
arguments must have same length
Instead, adding as.formula before paste0 would work:
aggregate(as.formula(paste0(".~","Species")), iris, sum)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 250.3 171.4 73.1 12.3
2 versicolor 296.8 138.5 213.0 66.3
3 virginica 329.4 148.7 277.6 101.3
I need to run a simple regression using Lm() in R. Its simple because I have only one independent variable. However the catch is that I need to test this independent variable for a number of dependents which are columns in a data frame.
So basically I have one common X and numerous Y's for which i need to extract the intercept and slope and store them all in a data frame.
In excel this is possible with the intercept and slope functions and then dragging across columns. I need something in R that would basically do the same, I could of course run separate regressions , but the requirement is that I need to run all of them in one loop and store estimates of intercept and slopes together for each.
Im still learning R and any help on this would be great. Thanks :)
The lmList function in package nlme was designed for this.
Let's use the iris dataset as an example:
DF <- iris[, 1:4]
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 5.1 3.5 1.4 0.2
#2 4.9 3.0 1.4 0.2
#3 4.7 3.2 1.3 0.2
#4 4.6 3.1 1.5 0.2
#5 5.0 3.6 1.4 0.2
#6 5.4 3.9 1.7 0.4
#...
First we have to reshape it. We want Sepal.Length as the dependent and the other columns as predictors in this example.
library(reshape2)
DF <- melt(DF, id.vars = "Sepal.Length")
# Sepal.Length variable value
#1 5.1 Sepal.Width 3.5
#2 4.9 Sepal.Width 3.0
#3 4.7 Sepal.Width 3.2
#4 4.6 Sepal.Width 3.1
#5 5.0 Sepal.Width 3.6
#6 5.4 Sepal.Width 3.9
#...
Now we can do the fits.
library(nlme)
mods <- lmList(Sepal.Length ~ value | variable,
data = DF, pool = FALSE)
We can now extract intercept and slope for each model.
coef(mods)
# (Intercept) value
#Sepal.Width 6.526223 -0.2233611
#Petal.Length 4.306603 0.4089223
#Petal.Width 4.777629 0.8885803
And get the usual t-table:
summary(mods)
# Call:
# Model: Sepal.Length ~ value | variable
# Data: DF
#
# Coefficients:
# (Intercept)
# Estimate Std. Error t value Pr(>|t|)
# Sepal.Width 6.526223 0.47889634 13.62763 6.469702e-28
# Petal.Length 4.306603 0.07838896 54.93890 2.426713e-100
# Petal.Width 4.777629 0.07293476 65.50552 3.340431e-111
# value
# Estimate Std. Error t value Pr(>|t|)
# Sepal.Width -0.2233611 0.15508093 -1.440287 1.518983e-01
# Petal.Length 0.4089223 0.01889134 21.646019 1.038667e-47
# Petal.Width 0.8885803 0.05137355 17.296454 2.325498e-37
Or the R-squared values:
summary(mods)$r.squared
#[1] 0.01382265 0.75995465 0.66902769
However, if you need something more efficient, you can use package data.table together with lm's workhorse lm.fit:
library(data.table)
setDT(DF)
DF[, setNames(as.list(lm.fit(cbind(1, value),
Sepal.Length)[["coefficients"]]),
c("intercept", "slope")), by = variable]
# variable intercept slope
#1: Sepal.Width 6.526223 -0.2233611
#2: Petal.Length 4.306603 0.4089223
#3: Petal.Width 4.777629 0.8885803
And of course the R.squared values of these models are just the squared Pearson correlation coefficients:
DF[, .(r.sq = cor(Sepal.Length, value)^2), by = variable]
# variable r.sq
#1: Sepal.Width 0.01382265
#2: Petal.Length 0.75995465
#3: Petal.Width 0.66902769