I am learning Julia by using it as a substitute for R and Python.
I have a Python statement:
df = pd.read_csv('{0}/{1:03.0f}.csv'.format(directory, int(id)))
and am using
filename = length(string(id)) == 1 ? "00"*string(id) :
length(string(id)) == 2 ? "0"*string(id) : string(id)
df = readtable(directory*"/"*filename*".csv")
I quite like this but is there a simpler way?
Similarly with Python I can get a summary (R) of the dataframes statistics by using df.describe(). Is there an equivalent in Julia yet?
sprintf is the most compact, but just FYI there's also lpad and rpad.
You can use the #sprintf macro like this:
julia> #sprintf("%s/%03d.csv","foo",1)
"foo/001.csv"
You can get a summary of a DataFrame using the describe function:
julia> using RDatasets
julia> iris = data("datasets","iris");
julia> describe(iris)
Min 1.0
1st Qu. 38.25
Median 75.5
Mean 75.5
3rd Qu. 112.75
Max 150.0
NAs 0
NA% 0.0%
Sepal.Length
Min 4.3
1st Qu. 5.1
Median 5.8
Mean 5.843333333333332
3rd Qu. 6.4
Max 7.9
NAs 0
NA% 0.0%
Sepal.Width
Min 2.0
1st Qu. 2.8
Median 3.0
Mean 3.0573333333333337
3rd Qu. 3.3
Max 4.4
NAs 0
NA% 0.0%
Petal.Length
Min 1.0
1st Qu. 1.6
Median 4.35
Mean 3.758000000000001
3rd Qu. 5.1
Max 6.9
NAs 0
NA% 0.0%
Petal.Width
Min 0.1
1st Qu. 0.3
Median 1.3
Mean 1.1993333333333331
3rd Qu. 1.8
Max 2.5
NAs 0
NA% 0.0%
Species
Length 150
Type UTF8String
NAs 0
NA% 0.0%
Unique 3
Related
This question already has answers here:
Apply multiple functions to column using tapply
(2 answers)
How can I write the code to generate a summarized table in R? [duplicate]
(2 answers)
Closed 4 years ago.
My code is:
Normality <- tapply(input$TotalAuthBdNet.USD., input$Country, summary)
The output displayed is:
$Albania
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000e+00 1.066e+04 2.730e+04 3.403e+07 5.015e+04 2.720e+09
$Angola
Min. 1st Qu. Median Mean 3rd Qu. Max.
5405 15323 52522 486451 170000 4513196
$`Antigua and Barbuda`
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
22622 22622 22622 22622 22622 22622 2
$Argentina
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 15814 45000 212800 193626 4080293 15
Country names are in rows and each country will have such statistic. I want the output as:
Country Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
Albania 0.000e+00 1.066e+04 2.730e+04 3.403e+07 5.015e+04 2.720e+09
Angola 5405 15323 52522 486451 170000 4513196
Argentina 0 15814 45000 212800 193626 4080293 15
The country name is a list identified from the file.
A simple rbind would do.. E.g.
do.call(rbind, tapply(mpg$year, mpg$model, summary))
You can also directly call aggregate so you don't need the extra step:
aggregate(Sepal.Length ~ Species, iris, summary)
# Species Sepal.Length.Min. Sepal.Length.1st Qu. Sepal.Length.Median Sepal.Length.Mean Sepal.Length.3rd Qu. Sepal.Length.Max.
# 1 setosa 4.300 4.800 5.000 5.006 5.200 5.800
# 2 versicolor 4.900 5.600 5.900 5.936 6.300 7.000
# 3 virginica 4.900 6.225 6.500 6.588 6.900 7.900
I am having an error printing the ouput of a summary function to a file. I have a column "bin" with three factor levels and want to return 5 number summary for each level. The five number summary prints to the screen but won't write to file? Error reports I have
Empty data.table (0 rows) of 1 col: bin
Data:
A B info C bin
1: 10-60494 0.66392100 0.001833330 1 MAF0.01
2: rs148087467 0.35274000 0.000716240 1 MAF0.01
3: rs187110906 0.40586900 0.004488040 1 MAF0.01
4: rs192025213 0.00743299 0.000000000 1 MAF0.01
5: rs115033199 0.32829300 0.000614316 1 MAF0.01
6: rs183305313 0.51721200 0.002892520 1 MAF0.01
s <- df2[, print(summary(info)), by='bin']
print(s)
write.table(as.data.frame(s),
quote=FALSE,file=paste(i,"sum_out.txt",sep=''))
Ouput:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0009998 0.0371300 0.2016000 0.2700000 0.4477000 1.0000000
The reason you are getting zero rows is because the only thing you do in j is print the outcome of the summary command.
Considering the following example data:
set.seed(2018)
dt <- data.table(bin = rep(c('A','B'), 5), val = rnorm(10,3,1))
Now when you do (like in your question):
s <- dt[, print(summary(val)), by = bin]
the summary statistics are printed to the console but it results in an empty data.table:
> s <- dt[, print(summary(val)), by = bin]
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.389 2.577 2.936 3.547 4.735 5.099
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.450 2.735 3.271 2.991 3.637 3.863
> s
Empty data.table (0 rows) of 1 col: bin
Removing the print-command doesn't help:
> dt[, summary(val), by = bin]
bin V1
1: A 2.389
2: A 2.577
3: A 2.936
4: A 3.547
5: A 4.735
6: A 5.099
7: B 1.450
8: B 2.735
9: B 3.271
10: B 2.991
11: B 3.637
12: B 3.863
because summary returns a table-object which is treated a vector by data.table.
Instead of using print, you should use as.list to get the elements of summary as columns in a data.table:
s <- dt[, as.list(summary(val)), by = bin]
now the summary statistics are included in the resulting data.table:
> s
bin Min. 1st Qu. Median Mean 3rd Qu. Max.
1: A 2.389413 2.577016 2.935571 3.547351 4.735284 5.099471
2: B 1.450122 2.735289 3.270881 2.991340 3.637056 3.863351
Because the summary statistics are stored in the non-empty data.table s, you can write s to a file with for example fwrite (the fast write function thedata.table-package).
This can be achieved using sapply() - here is an example using the iris data frame:
levels <- unique(iris$Species)
result <- data.frame(t(sapply(levels, function (x) summary(subset(iris, Species == levels[x])$Petal.Width))))
> result
Min. X1st.Qu. Median Mean X3rd.Qu. Max.
1 0.1 0.2 0.2 0.246 0.3 0.6
2 1.0 1.2 1.3 1.326 1.5 1.8
3 1.4 1.8 2.0 2.026 2.3 2.5
I have a list of data frames, organized by year. I am using lapply to get the summary for a single variable in each data frame. The output follows the list and gives a summary for each year, one by one. However, I want the output in the form of a single table with years for rows. How do I do this? An example using the iris dataset shows my problem:
x <- split(iris$Sepal.Length, iris$Species)
lapply(x, summary)
And the output is:
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 4.800 5.000 5.006 5.200 5.800
Similarly for the other two.
I want the output organized as a single table like with:
> sapply(x, summary)
setosa versicolor virginica
Min. 4.300 4.900 4.900
1st Qu. 4.800 5.600 6.225
Median 5.000 5.900 6.500
Mean 5.006 5.936 6.588
3rd Qu. 5.200 6.300 6.900
Max. 5.800 7.000 7.900
But with setosa, versicolor, virginica (or years in my case) on the left and Min... Max up top. I can flip the axes around in ggplot, but reading the table as-is is more intuitive with the years on the left. I came across a number of discussions about converting lapply output but the ones I came across were all measuring a single stat like mean or median. Thanks.
This seems like a good time to use by(). It eliminates the need for the call to split(), is all done in one line, and returns a matrix.
with(iris, do.call(rbind, by(Sepal.Length, Species, summary)))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# setosa 4.3 4.800 5.0 5.006 5.2 5.8
# versicolor 4.9 5.600 5.9 5.936 6.3 7.0
# virginica 4.9 6.225 6.5 6.588 6.9 7.9
If you still wish to use manual split-apply-combine method, then it would be
do.call(rbind, lapply(x, summary))
If you have a large data.frame, I recommend not to split it into pieces but to use data.table for grouping by year. With the iris data set this could be done along
library(data.table)
setDT(copy(iris))[, as.list(summary(Sepal.Length)), by = Species]
# Species Min. 1st Qu. Median Mean 3rd Qu. Max.
#1: setosa 4.3 4.800 5.0 5.006 5.2 5.8
#2: versicolor 4.9 5.600 5.9 5.936 6.3 7.0
#3: virginica 4.9 6.225 6.5 6.588 6.9 7.9
as.list() ensures the output of summary() appears column-wise as requested.
The result is a data.table (not a matrix) which can be used directly in a subsequent ggplot() call.
Note that copy(iris) is only required here because the iris data set is locked to prevent modifying its variable bindings. With your own data.frame df you would simply use setDT(df) to coerce to data.table without copying.
Add-on
The OP mentioned that he uses the result for plotting with ggplot2. Now, ggplot2 works best when data are provided in long format. Reshaping a data.table from wide to long format can be conveniently done with melt()
wideDT <- setDT(copy(iris))[, summary(Sepal.Length), by = Species]
longDT <- melt(wideDT, id.vars = "Species")
longDT
# Species variable value
# 1: setosa Min. 4.300
# 2: versicolor Min. 4.900
# 3: virginica Min. 4.900
# 4: setosa 1st Qu. 4.800
# 5: versicolor 1st Qu. 5.600
# 6: virginica 1st Qu. 6.225
# 7: setosa Median 5.000
# 8: versicolor Median 5.900
# 9: virginica Median 6.500
#10: setosa Mean 5.006
#11: versicolor Mean 5.936
#12: virginica Mean 6.588
#13: setosa 3rd Qu. 5.200
#14: versicolor 3rd Qu. 6.300
#15: virginica 3rd Qu. 6.900
#16: setosa Max. 5.800
#17: versicolor Max. 7.000
#18: virginica Max. 7.900
I have a data frame generated by t(summary(raw_data())):
Original data frame
However, each cell has a prefix of like max, min, mean, etc... And I would like to remove that prefix from each row and put it at the header. Is there an easy way to do this in r to get the dataframe to look like this:
Desired data frame
Also, as far as variables 3 & 18 which are factors. Those i'm less concerned about.
We can loop through the columns of the dataset, get the summary and then rbind the output
do.call(rbind, lapply(raw_data, summary))
Using a reproducible example
do.call(rbind, lapply(iris[1:4], summary))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#Sepal.Length 4.3 5.1 5.80 5.843 6.4 7.9
#Sepal.Width 2.0 2.8 3.00 3.057 3.3 4.4
#Petal.Length 1.0 1.6 4.35 3.758 5.1 6.9
#Petal.Width 0.1 0.3 1.30 1.199 1.8 2.5
I have data that looks like this:
> x
Date Obs
1/1/2012 4
1/2/2012 40
1/3/2012 50
And a function like this:
myDat <- function(x, summarize)
{
if (summarize == T)
{
print(summary(x))
}
if (missing(summarize) | summarize == F)
{
print(x)
}
}
when I try to run it as:
myDat(x)
I get this error:
Error in summarize == T : 'summarize' is missing
what am I doing here wrong?
Use defaults for your summarize argument and your function simplifies to one line:
myDat <- function(x, summarize=FALSE) { if (summarize) summary(x) else x}
Try it:
head(myDat(iris))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
myDat(iris, s=TRUE)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50