How to convert lapply output to a single matrix in R - r

I have a list of data frames, organized by year. I am using lapply to get the summary for a single variable in each data frame. The output follows the list and gives a summary for each year, one by one. However, I want the output in the form of a single table with years for rows. How do I do this? An example using the iris dataset shows my problem:
x <- split(iris$Sepal.Length, iris$Species)
lapply(x, summary)
And the output is:
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 4.800 5.000 5.006 5.200 5.800
Similarly for the other two.
I want the output organized as a single table like with:
> sapply(x, summary)
setosa versicolor virginica
Min. 4.300 4.900 4.900
1st Qu. 4.800 5.600 6.225
Median 5.000 5.900 6.500
Mean 5.006 5.936 6.588
3rd Qu. 5.200 6.300 6.900
Max. 5.800 7.000 7.900
But with setosa, versicolor, virginica (or years in my case) on the left and Min... Max up top. I can flip the axes around in ggplot, but reading the table as-is is more intuitive with the years on the left. I came across a number of discussions about converting lapply output but the ones I came across were all measuring a single stat like mean or median. Thanks.

This seems like a good time to use by(). It eliminates the need for the call to split(), is all done in one line, and returns a matrix.
with(iris, do.call(rbind, by(Sepal.Length, Species, summary)))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# setosa 4.3 4.800 5.0 5.006 5.2 5.8
# versicolor 4.9 5.600 5.9 5.936 6.3 7.0
# virginica 4.9 6.225 6.5 6.588 6.9 7.9
If you still wish to use manual split-apply-combine method, then it would be
do.call(rbind, lapply(x, summary))

If you have a large data.frame, I recommend not to split it into pieces but to use data.table for grouping by year. With the iris data set this could be done along
library(data.table)
setDT(copy(iris))[, as.list(summary(Sepal.Length)), by = Species]
# Species Min. 1st Qu. Median Mean 3rd Qu. Max.
#1: setosa 4.3 4.800 5.0 5.006 5.2 5.8
#2: versicolor 4.9 5.600 5.9 5.936 6.3 7.0
#3: virginica 4.9 6.225 6.5 6.588 6.9 7.9
as.list() ensures the output of summary() appears column-wise as requested.
The result is a data.table (not a matrix) which can be used directly in a subsequent ggplot() call.
Note that copy(iris) is only required here because the iris data set is locked to prevent modifying its variable bindings. With your own data.frame df you would simply use setDT(df) to coerce to data.table without copying.
Add-on
The OP mentioned that he uses the result for plotting with ggplot2. Now, ggplot2 works best when data are provided in long format. Reshaping a data.table from wide to long format can be conveniently done with melt()
wideDT <- setDT(copy(iris))[, summary(Sepal.Length), by = Species]
longDT <- melt(wideDT, id.vars = "Species")
longDT
# Species variable value
# 1: setosa Min. 4.300
# 2: versicolor Min. 4.900
# 3: virginica Min. 4.900
# 4: setosa 1st Qu. 4.800
# 5: versicolor 1st Qu. 5.600
# 6: virginica 1st Qu. 6.225
# 7: setosa Median 5.000
# 8: versicolor Median 5.900
# 9: virginica Median 6.500
#10: setosa Mean 5.006
#11: versicolor Mean 5.936
#12: virginica Mean 6.588
#13: setosa 3rd Qu. 5.200
#14: versicolor 3rd Qu. 6.300
#15: virginica 3rd Qu. 6.900
#16: setosa Max. 5.800
#17: versicolor Max. 7.000
#18: virginica Max. 7.900

Related

Create Table from Summary() in R

used the generic Summary() function to get some data. Now i want to display some of the summary data into a table and then knit into pdf. How to I create a table on the results from calling the Summary() function?
Using TeX, kableExtra and ggplot2.
summary(segmentdata)
summary(subset(segmentdata, Segment == "Suburb mix"))
summary(subset(segmentdata, Segment == "Urban hip"))
summary(subset(segmentdata, Segment == "Travelers"))
summary(subset(segmentdata, Segment == "Moving up"))
E.g. data
age gender income kids ownHome
Min. :20.00 Length:300 Min. :-13292 Min. :0.000 Length:300
1st Qu.:32.75 Class :character 1st Qu.: 38122 1st Qu.:0.000 Class :character
Median :39.00 Mode :character Median : 51134 Median :1.000 Mode :character
Mean :40.59 Mean : 50259 Mean :1.163
3rd Qu.:47.00 3rd Qu.: 63001 3rd Qu.:2.000
Max. :70.00 Max. :139679 Max. :5.000
subscribe Segment
Length:300 Length:300
Class :character Class :character
Mode :character Mode :character
Welcome to StackOverflow. It's a good practice when posting a question to provide actual data with a reproducible example so contributors can help you. reprex package is recommended with R.
I'll give you an answer based on what I think you want to achieve. I used the iris data set as an example.
library(tidyverse)
library(kableExtra)
vars <- iris %>% names()
iris %>%
filter(Species == "setosa") %>% # subset data
map_dfr(summary) %>% # apply summary to variables
add_column(vars = vars, .before = 1) # add variable names
#> # A tibble: 5 x 10
#> vars Min. `1st Qu.` Median Mean `3rd Qu.` Max. setosa versicolor virginica
#> <chr> <tab> <table> <tabl> <tab> <table> <tab> <int> <int> <int>
#> 1 Sepa.. 4.3 4.8 5.0 5.006 5.200 5.8 NA NA NA
#> 2 Sepa.. 2.3 3.2 3.4 3.428 3.675 4.4 NA NA NA
#> 3 Peta.. 1.0 1.4 1.5 1.462 1.575 1.9 NA NA NA
#> 4 Peta.. 0.1 0.2 0.2 0.246 0.300 0.6 NA NA NA
#> 5 Spec.. NA NA NA NA NA NA 50 0 0
For more detail on the process, check out the function's documentation.
For the kableExtra output, add kbl() %>% kable_styling() at the end of the pipeline.

How to loop over a specific set of columns names in a dataframe with the use of a vector?

I've been struggling with this problem for days and I'm rather new to R. So I give up and I hope anyone of you can help me.
I want to compute summary statistics of a specific variable grouped by different variables which I want to loop. I don't want to copy-paste my syntax and change the grouping variable every time. I used a for loop and lapply with the use of vector (where my different grouping variables are stored).
I think the problem is that my dataframe cannot find the column names I stored in my vector.
My code looks something likes this:
snp_EPA <- c('rs3798713_C', 'rs174550_C', 'rs174574_A', 'rs174448_C') #Vector of grouping variables
for (i in snp_EPA) {
FA %>% group_by(as.name(i)) %>% summarise(FA, bce_c20_5n_3)
} #For loop I tried, didn't work
epa <- lapply(snp_EPA, function(x) {describeBy(FA$bce_c20_5n_3, as.name(x))})
lapply(epa, print) #lapply function I used, still didn't work....
We really need more information about your data and a small sample using dput(data). I can show you a couple of ways to get what you want that might get you started. I'll use the iris data set that comes with R:
data(iris)
str(iris)
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The data set consists of 4 measurements on three different species of iris. One simple way to get descriptive statistics is to use split and summary:
iris.split <- split(iris, iris$Species)
lapply(iris.split, summary)
# $setosa
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 setosa :50
# 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 versicolor: 0
# Median :5.000 Median :3.400 Median :1.500 Median :0.200 virginica : 0
# Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
# 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
# Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
# . . . results for other 3 measurements
Another approach is to use a summary statistics functions that will group the data for you. The numSummary function in package RcmdrMisc is one of many possiblities:
library(RcmdrMisc) # You will have to install it the first time with `install.packages("RcmdrMisc)`.
numSummary(iris[, -5], groups=iris$Species)
#
# Variable: Sepal.Length
# mean sd IQR 0% 25% 50% 75% 100% n
# setosa 5.006 0.3524897 0.400 4.3 4.800 5.0 5.2 5.8 50
# versicolor 5.936 0.5161711 0.700 4.9 5.600 5.9 6.3 7.0 50
# virginica 6.588 0.6358796 0.675 4.9 6.225 6.5 6.9 7.9 50
# . . . results for three other measurements.
These examples use all of the numeric columns, but you can select only some columns with iris[, 1:3] to get just the first three or iris[, c(1,4)] to get the the first and the fourth columns.

Use first part of value as header in r

I have a data frame generated by t(summary(raw_data())):
Original data frame
However, each cell has a prefix of like max, min, mean, etc... And I would like to remove that prefix from each row and put it at the header. Is there an easy way to do this in r to get the dataframe to look like this:
Desired data frame
Also, as far as variables 3 & 18 which are factors. Those i'm less concerned about.
We can loop through the columns of the dataset, get the summary and then rbind the output
do.call(rbind, lapply(raw_data, summary))
Using a reproducible example
do.call(rbind, lapply(iris[1:4], summary))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#Sepal.Length 4.3 5.1 5.80 5.843 6.4 7.9
#Sepal.Width 2.0 2.8 3.00 3.057 3.3 4.4
#Petal.Length 1.0 1.6 4.35 3.758 5.1 6.9
#Petal.Width 0.1 0.3 1.30 1.199 1.8 2.5

How to describe and format print equivalents

I am learning Julia by using it as a substitute for R and Python.
I have a Python statement:
df = pd.read_csv('{0}/{1:03.0f}.csv'.format(directory, int(id)))
and am using
filename = length(string(id)) == 1 ? "00"*string(id) :
length(string(id)) == 2 ? "0"*string(id) : string(id)
df = readtable(directory*"/"*filename*".csv")
I quite like this but is there a simpler way?
Similarly with Python I can get a summary (R) of the dataframes statistics by using df.describe(). Is there an equivalent in Julia yet?
sprintf is the most compact, but just FYI there's also lpad and rpad.
You can use the #sprintf macro like this:
julia> #sprintf("%s/%03d.csv","foo",1)
"foo/001.csv"
You can get a summary of a DataFrame using the describe function:
julia> using RDatasets
julia> iris = data("datasets","iris");
julia> describe(iris)
Min 1.0
1st Qu. 38.25
Median 75.5
Mean 75.5
3rd Qu. 112.75
Max 150.0
NAs 0
NA% 0.0%
Sepal.Length
Min 4.3
1st Qu. 5.1
Median 5.8
Mean 5.843333333333332
3rd Qu. 6.4
Max 7.9
NAs 0
NA% 0.0%
Sepal.Width
Min 2.0
1st Qu. 2.8
Median 3.0
Mean 3.0573333333333337
3rd Qu. 3.3
Max 4.4
NAs 0
NA% 0.0%
Petal.Length
Min 1.0
1st Qu. 1.6
Median 4.35
Mean 3.758000000000001
3rd Qu. 5.1
Max 6.9
NAs 0
NA% 0.0%
Petal.Width
Min 0.1
1st Qu. 0.3
Median 1.3
Mean 1.1993333333333331
3rd Qu. 1.8
Max 2.5
NAs 0
NA% 0.0%
Species
Length 150
Type UTF8String
NAs 0
NA% 0.0%
Unique 3

function with missing argument

I have data that looks like this:
> x
Date Obs
1/1/2012 4
1/2/2012 40
1/3/2012 50
And a function like this:
myDat <- function(x, summarize)
{
if (summarize == T)
{
print(summary(x))
}
if (missing(summarize) | summarize == F)
{
print(x)
}
}
when I try to run it as:
myDat(x)
I get this error:
Error in summarize == T : 'summarize' is missing
what am I doing here wrong?
Use defaults for your summarize argument and your function simplifies to one line:
myDat <- function(x, summarize=FALSE) { if (summarize) summary(x) else x}
Try it:
head(myDat(iris))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
myDat(iris, s=TRUE)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50

Resources