Use first part of value as header in r - r

I have a data frame generated by t(summary(raw_data())):
Original data frame
However, each cell has a prefix of like max, min, mean, etc... And I would like to remove that prefix from each row and put it at the header. Is there an easy way to do this in r to get the dataframe to look like this:
Desired data frame
Also, as far as variables 3 & 18 which are factors. Those i'm less concerned about.

We can loop through the columns of the dataset, get the summary and then rbind the output
do.call(rbind, lapply(raw_data, summary))
Using a reproducible example
do.call(rbind, lapply(iris[1:4], summary))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#Sepal.Length 4.3 5.1 5.80 5.843 6.4 7.9
#Sepal.Width 2.0 2.8 3.00 3.057 3.3 4.4
#Petal.Length 1.0 1.6 4.35 3.758 5.1 6.9
#Petal.Width 0.1 0.3 1.30 1.199 1.8 2.5

Related

How to apply multiple functions to a dataframe in R

I am working with a relatively large dataset with a lot of attributes, is there any simple way(no extra packages) to let the output be like this row names as attributes name and col names as the function:
Min Max
Sepal.Length 4.3 7.9
Sepal.Width 2.0 4.4
Petal.Length 1.0 6.9
Petal.Width 0.1 2.5
Currently, when I sapply multiple functions to the data, the output is this
Sepal.Length Sepal.Width Petal.Length Petal.Width
min 4.300000 2.000000 1.000 0.100000
max 7.900000 4.400000 6.900 2.500000
mean 5.843333 3.057333 3.758 1.199333
However, this output will be too wide to fit in a pdf when knitting when dealing with large number of attributes.
you can use the t function
transposed_table <- t(normal_table)

Summary statistics in r, measures as headers

I want to create summary statistics for my dataset. I have tried searching but haven't found anything that matches what I want. I want the columns to be listed on vertically with the statistics measure as headings. Here is how I want it to look:
Column
Mean
Standard deviation
25th perc.
Median
75th perc.
Column 1
Mean column 1
Std column 1
...
...
...
Column 2
Mean column 2
...
...
...
...
Etc
...
...
...
...
...
How do I do this? Thankful for any help I can get!:)
If there is a specific function to use where I can also do some formatting/styling some info about that would also be appreciated, but the main point is that it should look as described. :)
You may want to check out the summarytools package... Has built-in support for both markdown and html.
library(summarytools)
descr(iris,
stats = c("mean", "sd", "q1", "med", "q3"),
transpose = TRUE)
## Non-numerical variable(s) ignored: Species
## Descriptive Statistics
## iris
## N: 150
##
## Mean Std.Dev Q1 Median Q3
## ----------------- ------ --------- ------ -------- ------
## Petal.Length 3.76 1.77 1.60 4.35 5.10
## Petal.Width 1.20 0.76 0.30 1.30 1.80
## Sepal.Length 5.84 0.83 5.10 5.80 6.40
## Sepal.Width 3.06 0.44 2.80 3.00 3.30
We could use descr from collapse
library(collapse)
descr(iris)
Your question is missing some important features, but I think you want something like this:
Example with just the numerical variables of the iris dataset:
iris_numerical<-iris[,1:4]
calculate statistics
new_df<-sapply(iris_numerical, function(x){c(mean=mean(x), SD=sd(x), Q1=quantile(x, 0.25), median=median(x), Q3=quantile(x, 0.75))})
This gives you summary statistics column-wise
> new_df
Sepal.Length Sepal.Width Petal.Length Petal.Width
mean 5.8433333 3.0573333 3.758000 1.1993333
SD 0.8280661 0.4358663 1.765298 0.7622377
Q1.25% 5.1000000 2.8000000 1.600000 0.3000000
median 5.8000000 3.0000000 4.350000 1.3000000
Q3.75% 6.4000000 3.3000000 5.100000 1.8000000
Then create final dataframe in the desired format, with colnames as rownames:
new_df<-data.frame(column=colnames(new_df), apply(new_df, 1, function(x) x))
> new_df
column mean SD Q1.25. median Q3.75.
Sepal.Length Sepal.Length 5.843333 0.8280661 5.1 5.80 6.4
Sepal.Width Sepal.Width 3.057333 0.4358663 2.8 3.00 3.3
Petal.Length Petal.Length 3.758000 1.7652982 1.6 4.35 5.1
Petal.Width Petal.Width 1.199333 0.7622377 0.3 1.30 1.8

How to loop over a specific set of columns names in a dataframe with the use of a vector?

I've been struggling with this problem for days and I'm rather new to R. So I give up and I hope anyone of you can help me.
I want to compute summary statistics of a specific variable grouped by different variables which I want to loop. I don't want to copy-paste my syntax and change the grouping variable every time. I used a for loop and lapply with the use of vector (where my different grouping variables are stored).
I think the problem is that my dataframe cannot find the column names I stored in my vector.
My code looks something likes this:
snp_EPA <- c('rs3798713_C', 'rs174550_C', 'rs174574_A', 'rs174448_C') #Vector of grouping variables
for (i in snp_EPA) {
FA %>% group_by(as.name(i)) %>% summarise(FA, bce_c20_5n_3)
} #For loop I tried, didn't work
epa <- lapply(snp_EPA, function(x) {describeBy(FA$bce_c20_5n_3, as.name(x))})
lapply(epa, print) #lapply function I used, still didn't work....
We really need more information about your data and a small sample using dput(data). I can show you a couple of ways to get what you want that might get you started. I'll use the iris data set that comes with R:
data(iris)
str(iris)
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The data set consists of 4 measurements on three different species of iris. One simple way to get descriptive statistics is to use split and summary:
iris.split <- split(iris, iris$Species)
lapply(iris.split, summary)
# $setosa
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 setosa :50
# 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 versicolor: 0
# Median :5.000 Median :3.400 Median :1.500 Median :0.200 virginica : 0
# Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
# 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
# Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
# . . . results for other 3 measurements
Another approach is to use a summary statistics functions that will group the data for you. The numSummary function in package RcmdrMisc is one of many possiblities:
library(RcmdrMisc) # You will have to install it the first time with `install.packages("RcmdrMisc)`.
numSummary(iris[, -5], groups=iris$Species)
#
# Variable: Sepal.Length
# mean sd IQR 0% 25% 50% 75% 100% n
# setosa 5.006 0.3524897 0.400 4.3 4.800 5.0 5.2 5.8 50
# versicolor 5.936 0.5161711 0.700 4.9 5.600 5.9 6.3 7.0 50
# virginica 6.588 0.6358796 0.675 4.9 6.225 6.5 6.9 7.9 50
# . . . results for three other measurements.
These examples use all of the numeric columns, but you can select only some columns with iris[, 1:3] to get just the first three or iris[, c(1,4)] to get the the first and the fourth columns.

How to convert lapply output to a single matrix in R

I have a list of data frames, organized by year. I am using lapply to get the summary for a single variable in each data frame. The output follows the list and gives a summary for each year, one by one. However, I want the output in the form of a single table with years for rows. How do I do this? An example using the iris dataset shows my problem:
x <- split(iris$Sepal.Length, iris$Species)
lapply(x, summary)
And the output is:
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 4.800 5.000 5.006 5.200 5.800
Similarly for the other two.
I want the output organized as a single table like with:
> sapply(x, summary)
setosa versicolor virginica
Min. 4.300 4.900 4.900
1st Qu. 4.800 5.600 6.225
Median 5.000 5.900 6.500
Mean 5.006 5.936 6.588
3rd Qu. 5.200 6.300 6.900
Max. 5.800 7.000 7.900
But with setosa, versicolor, virginica (or years in my case) on the left and Min... Max up top. I can flip the axes around in ggplot, but reading the table as-is is more intuitive with the years on the left. I came across a number of discussions about converting lapply output but the ones I came across were all measuring a single stat like mean or median. Thanks.
This seems like a good time to use by(). It eliminates the need for the call to split(), is all done in one line, and returns a matrix.
with(iris, do.call(rbind, by(Sepal.Length, Species, summary)))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# setosa 4.3 4.800 5.0 5.006 5.2 5.8
# versicolor 4.9 5.600 5.9 5.936 6.3 7.0
# virginica 4.9 6.225 6.5 6.588 6.9 7.9
If you still wish to use manual split-apply-combine method, then it would be
do.call(rbind, lapply(x, summary))
If you have a large data.frame, I recommend not to split it into pieces but to use data.table for grouping by year. With the iris data set this could be done along
library(data.table)
setDT(copy(iris))[, as.list(summary(Sepal.Length)), by = Species]
# Species Min. 1st Qu. Median Mean 3rd Qu. Max.
#1: setosa 4.3 4.800 5.0 5.006 5.2 5.8
#2: versicolor 4.9 5.600 5.9 5.936 6.3 7.0
#3: virginica 4.9 6.225 6.5 6.588 6.9 7.9
as.list() ensures the output of summary() appears column-wise as requested.
The result is a data.table (not a matrix) which can be used directly in a subsequent ggplot() call.
Note that copy(iris) is only required here because the iris data set is locked to prevent modifying its variable bindings. With your own data.frame df you would simply use setDT(df) to coerce to data.table without copying.
Add-on
The OP mentioned that he uses the result for plotting with ggplot2. Now, ggplot2 works best when data are provided in long format. Reshaping a data.table from wide to long format can be conveniently done with melt()
wideDT <- setDT(copy(iris))[, summary(Sepal.Length), by = Species]
longDT <- melt(wideDT, id.vars = "Species")
longDT
# Species variable value
# 1: setosa Min. 4.300
# 2: versicolor Min. 4.900
# 3: virginica Min. 4.900
# 4: setosa 1st Qu. 4.800
# 5: versicolor 1st Qu. 5.600
# 6: virginica 1st Qu. 6.225
# 7: setosa Median 5.000
# 8: versicolor Median 5.900
# 9: virginica Median 6.500
#10: setosa Mean 5.006
#11: versicolor Mean 5.936
#12: virginica Mean 6.588
#13: setosa 3rd Qu. 5.200
#14: versicolor 3rd Qu. 6.300
#15: virginica 3rd Qu. 6.900
#16: setosa Max. 5.800
#17: versicolor Max. 7.000
#18: virginica Max. 7.900

How to describe and format print equivalents

I am learning Julia by using it as a substitute for R and Python.
I have a Python statement:
df = pd.read_csv('{0}/{1:03.0f}.csv'.format(directory, int(id)))
and am using
filename = length(string(id)) == 1 ? "00"*string(id) :
length(string(id)) == 2 ? "0"*string(id) : string(id)
df = readtable(directory*"/"*filename*".csv")
I quite like this but is there a simpler way?
Similarly with Python I can get a summary (R) of the dataframes statistics by using df.describe(). Is there an equivalent in Julia yet?
sprintf is the most compact, but just FYI there's also lpad and rpad.
You can use the #sprintf macro like this:
julia> #sprintf("%s/%03d.csv","foo",1)
"foo/001.csv"
You can get a summary of a DataFrame using the describe function:
julia> using RDatasets
julia> iris = data("datasets","iris");
julia> describe(iris)
Min 1.0
1st Qu. 38.25
Median 75.5
Mean 75.5
3rd Qu. 112.75
Max 150.0
NAs 0
NA% 0.0%
Sepal.Length
Min 4.3
1st Qu. 5.1
Median 5.8
Mean 5.843333333333332
3rd Qu. 6.4
Max 7.9
NAs 0
NA% 0.0%
Sepal.Width
Min 2.0
1st Qu. 2.8
Median 3.0
Mean 3.0573333333333337
3rd Qu. 3.3
Max 4.4
NAs 0
NA% 0.0%
Petal.Length
Min 1.0
1st Qu. 1.6
Median 4.35
Mean 3.758000000000001
3rd Qu. 5.1
Max 6.9
NAs 0
NA% 0.0%
Petal.Width
Min 0.1
1st Qu. 0.3
Median 1.3
Mean 1.1993333333333331
3rd Qu. 1.8
Max 2.5
NAs 0
NA% 0.0%
Species
Length 150
Type UTF8String
NAs 0
NA% 0.0%
Unique 3

Resources