Run a defined function for various variables in R - r

I have defined a function for descriptive statistics, which is
descriptive_statistics<-function(obj){
n<-length(obj)
cat("mean=" ,mean(obj) , "\n")
cat("s.d.=",sd(obj),"\n")
cat("min=" ,min(obj ) , "\n")
cat("1. Quartil=" ,quantile(obj, 0.25) , "\n")
cat("median=",median(obj),"\n")
cat("3. Quartil=" ,quantile(obj, 0.75) , "\n")
cat ( "max= " , max (obj ) , "\n" )
skew<-(sum((obj-mean(obj))^3)/n)/(sum((obj-mean(obj))^2)/n)^(3/2)
cat("skewness=",skew,"\n")
kurt<-n*sum((obj-mean(obj))^4)/(sum((obj-mean(obj))^2)^2)
cat("kurtosis=",kurt,"\n")
Stat<-n*skew^2/6+n*(kurt - 3)^2/24
}
For any variable that I put in the command, it gives me this output:
descriptive_statistics(data$ewz_2016)
mean= 232024.4
s.d.= 383172
min= 36514
1. Quartil= 81632
median= 121705
3. Quartil= 230405.5
max= 3516100
skewness= 5.930454
kurtosis= 46.42959
My wish would be to put various variables in the command at the same time (e.g. the variables from the columns 3-25) and afterwards copy the table to Excel to format it.
Does anyone have a piece of advice?

I would change the function descriptive_statistics to return the values it computes in the form of a named vector. In what follows the function doesn't print anything, it just computes those values. You can combine the two, yours and this version of it, if you want.
descriptive_statistics2 <- function(obj){
n <- length(obj)
skew<-(sum((obj-mean(obj))^3)/n)/(sum((obj-mean(obj))^2)/n)^(3/2)
kurt <- n*sum((obj-mean(obj))^4)/(sum((obj-mean(obj))^2)^2)
Stat <- n*skew^2/6+n*(kurt - 3)^2/24
c(mean = mean(obj),
s.d. = sd(obj),
min = min(obj),
Quartil = quantile(obj, 0.25), # Note: no need to say it's quartile 1 or 3
median = median(obj),
Quartil = quantile(obj, 0.75), # quantile() will append a 25% or 75%
max = max(obj),
skewness = skew,
kurtosis = kurt,
Stat = Stat
)
}
Now test it with some data.
set.seed(8085) # make it reproducible
n <- 1e2
dat <- as.data.frame(matrix(rnorm(25*n), ncol = 25))
result <- t(sapply(dat[3:25], descriptive_statistics2))
head(result)
# mean s.d. min Quartil.25% median Quartil.75% max
#V3 0.04496632 0.9454383 -2.114690 -0.6352870 0.13869333 0.6789609 1.901803
#V4 0.08136469 1.0594723 -1.955842 -0.6922518 -0.03305246 0.8274921 2.653876
#V5 -0.05899082 1.0225392 -3.031083 -0.6536158 -0.16813752 0.6497042 2.053943
#V6 0.13282456 0.9746959 -1.715005 -0.5399552 0.05815323 0.8383848 2.308354
#V7 0.03935582 1.1629889 -2.669025 -0.6569735 -0.01753087 0.6976987 3.277311
#V8 -0.08203328 1.1111241 -2.755361 -0.9619138 -0.08806534 0.8730580 2.547580
# skewness kurtosis Stat
#V3 -0.20061108 2.414033 2.1014041
#V4 0.32427896 2.455259 2.9890412
#V5 -0.21501670 2.758610 1.0133248
#V6 0.16328790 2.322679 2.3558991
#V7 0.09352129 2.977775 0.1478287
#V8 -0.05664655 2.509671 1.0552417
Finally, to export the result to Excel, use write.csv, if your country uses the period as decimal point or write.csv2 if it uses the comma.
write.csv(result, file = "SCW16.csv", row.names = FALSE)

Related

Summarise multiple columns using multiple functions using base R and Dplyr

the data is something like this:
> head(r)
area peri shape perm
1 4990 2791.90 0.0903296 6.3
2 7002 3892.60 0.1486220 6.3
3 7558 3930.66 0.1833120 6.3
4 7352 3869.32 0.1170630 6.3
5 7943 3948.54 0.1224170 17.1
6 7979 4010.15 0.1670450 17.1
I want to perform multiple functions on each column, what I currently have is this function:
analysis = function(df){
measurements = data.frame(attributes = character(),
mean = double(),
median = double(),
variance = double(),
IQR = double())
for (i in 1:ncol(df)){
names = colnames(df)[i]
temp = data.frame(attribute = names,
mean = mean(df[,i]),
median = median(df[,i]),
variance = var(df[,i]),
IQR = IQR(df[,i]))
measurements = rbind(measurements, temp)
}
return (measurements)
}
It works well and achieve what I want which gives the following output:
attribute mean median variance IQR
1 area 7187.7291667 7487.000000 7.203045e+06 3564.2500000
2 peri 2682.2119375 2536.195000 2.049654e+06 2574.6150000
3 shape 0.2181104 0.198862 6.971657e-03 0.1004083
4 perm 415.4500000 130.500000 1.916848e+05 701.0500000
However, my supervisor said it is not efficient and not thinking in a R way.
I also tried summarise_each()and summarise_all(r, funs(mean, median, var, IQR)) but it doesn't achieve what I want and the output doesn't look nice.
What are some other ways to achieve that output only using base R or dplyr.
I suspect your supervisors comment about 'R'-style thinking was about using that for loop. Almost any for loop you write can be replaced by the apply family of functions (e.g. apply, sapply, lapply etc).
They make it easier to run functions on vectors/data.frames/lists/etc.
Everything you could do using apply functions could be replicated in for loops (usually with similar performance) so using for loops isn't actually a cardinal sin. Why use apply functions? Well ... once you learn them you get more succinct code which returns the results of running your functions on your data. Before long, you'll find this sort of code very intuitive, and even more readable than for loops.
Base R
df <- data.frame(
area = c(4990, 7002, 7558, 7352, 7943),
peri = c(2791.9, 3892.6, 3930.66, 3869.32, 3948.54),
shape = c(.0903296, .148622, .183312, .117063, .122417),
perm = c(6.3, 6.3, 6.3, 6.3, 17.1)
)
sapply(df, function(x) c(mean=mean(x), median=median(x), var=var(x), IQR=IQR(x)))
Your results can be achieved using base::Map:
f <- function(x) {
desc = base::summary(x)
c(
Mean = unname(desc['Mean']),
Median = unname(desc['Median']),
Variance = base::sum((x-desc['Mean'])**2)/(length(x)-1),
IQR = unname(desc['3rd Qu.'] - desc['1st Qu.'])
)
}
t(as.data.frame(base::Map(f, df)))
# Mean Median Variance IQR
# area 7137.3333333 7455.0000000 1.241980e+06 757.25000000
# peri 3740.5283333 3911.6300000 2.183447e+05 68.93000000
# shape 0.1381314 0.1355195 1.192633e-03 0.04403775
# perm 9.9000000 6.3000000 3.110400e+01 8.10000000
Apologies
Data:
df <- data.frame(
area = c(4990, 7002, 7558, 7352, 7943, 7979),
peri = c(2791.9, 3892.6, 3930.66, 3869.32, 3948.54, 4010.15),
shape = c(.0903296, .148622, .183312, .117063, .122417, .167045),
perm = c(6.3, 6.3, 6.3, 6.3, 17.1, 17.1)
)
Hope that's useful.

Expected return and covariance from return time series

I’m trying to simulate the Matlab ewstats function here defined:
https://it.mathworks.com/help/finance/ewstats.html
The results given by Matlab are the following ones:
> ExpReturn = 1×2
0.1995 0.1002
> ExpCovariance = 2×2
0.0032 -0.0017
-0.0017 0.0010
I’m trying to replicate the example with the RiskPortfolios R package:
https://cran.r-project.org/web/packages/RiskPortfolios/RiskPortfolios.pdf
The R code I’m using is this one:
library(RiskPortfolios)
rets <- as.matrix(cbind(c(0.24, 0.15, 0.27, 0.14), c(0.08, 0.13, 0.06, 0.13)))
w <- 0.98
rets
w
meanEstimation(rets, control = list(type = 'ewma', lambda = w))
covEstimation(rets, control = list(type = 'ewma', lambda = w))
The mean estimation is the same of the one in the example, but the covariance matrix is different:
> rets
[,1] [,2]
[1,] 0.24 0.08
[2,] 0.15 0.13
[3,] 0.27 0.06
[4,] 0.14 0.13
> w
[1] 0.98
>
> meanEstimation(rets, control = list(type = 'ewma', lambda = w))
[1] 0.1995434 0.1002031
>
> covEstimation(rets, control = list(type = 'ewma', lambda = w))
[,1] [,2]
[1,] 0.007045044 -0.003857217
[2,] -0.003857217 0.002123827
Am I missing something?
Thanks
They give the same answer if type = "lw" is used:
round(covEstimation(rets, control = list(type = 'lw')), 4)
## 0.0032 -0.0017
## -0.0017 0.0010
They are using different algorithms. From the RiskPortfolio manual:
ewma ... See RiskMetrics (1996)
From the Matlab hlp page:
There is no relationship between ewstats function and the RiskMetrics® approach for determining the expected return and covariance from a return time series.
Unfortunately Matlab does not tell us which algorithm is used.
For those who eventually need an equivalent ewstats function in R, here the code I wrote:
ewstats <- function(RetSeries, DecayFactor=NULL, WindowLength=NULL){
#EWSTATS Expected return and covariance from return time series.
# Optional exponential weighting emphasizes more recent data.
#
# [ExpReturn, ExpCovariance, NumEffObs] = ewstats(RetSeries, ...
# DecayFactor, WindowLength)
#
# Inputs:
# RetSeries : NUMOBS by NASSETS matrix of equally spaced incremental
# return observations. The first row is the oldest observation, and the
# last row is the most recent.
#
# DecayFactor : Controls how much less each observation is weighted than its
# successor. The k'th observation back in time has weight DecayFactor^k.
# DecayFactor must lie in the range: 0 < DecayFactor <= 1.
# The default is DecayFactor = 1, which is the equally weighted linear
# moving average Model (BIS).
#
# WindowLength: The number of recent observations used in
# the computation. The default is all NUMOBS observations.
#
# Outputs:
# ExpReturn : 1 by NASSETS estimated expected returns.
#
# ExpCovariance : NASSETS by NASSETS estimated covariance matrix.
#
# NumEffObs: The number of effective observations is given by the formula:
# NumEffObs = (1-DecayFactor^WindowLength)/(1-DecayFactor). Smaller
# DecayFactors or WindowLengths emphasize recent data more strongly, but
# use less of the available data set.
#
# The standard deviations of the asset return processes are given by:
# STDVec = sqrt(diag(ECov)). The correlation matrix is :
# CorrMat = VarMat./( STDVec*STDVec' )
#
# See also MEAN, COV, COV2CORR.
NumObs <- dim(RetSeries)[1]
NumSeries <- dim(RetSeries)[2]
# size the series and the window
if (is.null(WindowLength)) {
WindowLength <- NumObs
}
if (is.null(DecayFactor)) {
DecayFactor = 1
}
if (DecayFactor <= 0 | DecayFactor > 1) {
stop('Must have 0< decay factor <= 1.')
}
if (WindowLength > NumObs){
stop(sprintf('Window Length #d must be <= number of observations #d',
WindowLength, NumObs))
}
# ------------------------------------------------------------------------
# size the data to the window
RetSeries <- RetSeries[NumObs-WindowLength+1:NumObs, ]
# Calculate decay coefficients
DecayPowers <- seq(WindowLength-1, 0, by = -1)
VarWts <- sqrt(DecayFactor)^DecayPowers
RetWts <- (DecayFactor)^DecayPowers
NEff = sum(RetWts) # number of equivalent values in computation
# Compute the exponentially weighted mean return
WtSeries <- matrix(rep(RetWts, times = NumSeries),
nrow = length(RetWts), ncol = NumSeries) * RetSeries
ERet <- colSums(WtSeries)/NEff;
# Subtract the weighted mean from the original Series
CenteredSeries <- RetSeries - matrix(rep(ERet, each = WindowLength),
nrow = WindowLength, ncol = length(ERet))
# Compute the weighted variance
WtSeries <- matrix(rep(VarWts, times = NumSeries),
nrow = length(VarWts), ncol = NumSeries) * CenteredSeries
ECov <- t(WtSeries) %*% WtSeries / NEff
list(ExpReturn = ERet, ExpCovariance = ECov, NumEffObs = NEff)
}

R: Error in calculating the average of a variable at different time intervals for many factors using for loop

I have a data frame in which a variable(var1) is expressed over time in seconds. I want to calculate the mean of var1 for each sample at different time intervals (10 seconds interval until 500 seconds).
the dataframe looks like this:
sample time var1
S1 1 3.5
S1 2 6.3
S1 3 7.8
S1 4 20.5
S1 … ...
S1 530 4.5
S2 1 6.7
S2 2 20.3
S2 3 5.4
S2 … ...
S2 710 70.3
...
The data frame that I want to obtain looks like this
Sample var1_mean10:20sec var1_mean20:30sec .... var1_mean490:500sec
S1
S2
..
So I wrote this code:
setwd("…")
A <- read_excel("dati.xlsx")
for (cat in unique(A$sample))
{
A.s <- subset(A, A$sample == cat)
cuts <- cut (A.s$time, breaks=seq.int(from = 0, to = 500, by = 10))
d <- by (A.s$var1, cuts, mean)
Y<-data.frame(d)
j <- t(Y)
write.csv(Y, file = paste(cat, "var1", sep = "_"))
}
But when I run it I get Error message: Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ""by"" to a data.frame
The plan is to eventually merge all the different csv.
If I understood your problem correctly you are trying to average your data in 10 second interval. I would like to propose an alternative approach using the function aggregate to compute the mean across the 10 seconds interval. The 10 seconds interval would be created through a fictitious 'time' array used to group your 10 seconds interval and then averaging.
# try to create some data similar to yours
A <- data.frame(sample = c(rep('A1', 530), rep('A2', 710)),
time = c(1 : 530, 1:710), var1 = runif(530+710))
A$times <- ceiling(A$time / 10)
Y <- aggregate(var1 ~ sample + times, data = A, FUN = mean)
Then you could export tmp straightaway.
HTH
Solved :
A <- read_excel("data.xlsx")
n <- subset(A, time <= 500)
d<-data.frame(sample= n$sample, time= n$time, ms=n$var1)
storage.data<-data.frame(matrix(nrow = n, ncol = n))
for(cat in unique(d$sample)){
g <- subset(d, d$sample == cat)
cuts <- cut (g$time, breaks=seq.int(from = 0, to = 500, by = 10))
p <- by (g$ms, cuts, mean)
storage.data[cat] = p}
View(storage.data)
storage.data_t <- t(storage.data)
View(storage.data_t)
write.csv(storage.data_t, file = "filename.csv")

Large raster frequency table / counts

I try to calculate the frequency/count of pixel values of a raster in R using freq().
Create two example rasters for comparison:
library(raster)
RastSmall <- raster(nrow=70, ncol=70)
RastBig <- raster(nrow=7000, ncol=7000)
set.seed(0)
RastSmall[] <- round(runif(1:ncell(r_hr), 1, 5))
RastBig[] <- round(runif(1:ncell(r_hr), 1, 5))
Get the pixel count using freq()
freq(RastSmall)
value count
[1,] 1 6540000
[2,] 2 12150000
[3,] 3 12140000
[4,] 4 11720000
[5,] 5 6450000
However, it is a fairly large file and takes extremely long, i.e. up to hours. Is there a faster way in R?
Here the speed difference for a small and a large raster:
system.time(freq(RastSmall))
user system elapsed
0.008 0.000 0.004
system.time(freq(RastBig))
user system elapsed
40.484 0.964 41.445
Is there a way to speed this up? Alternatively can this be done in the command line using something like gdal tools?
I did exactly that last week, however I couldn't find other faster ways to do it in R. I've tried to do it with the rqgis package by calling the r.report of GRASS. It works but was slower than the R native function. Maybe you'll have a better luck. Here is my code with grass in case you want to try it:
library(RQGIS)
monqgis <- set_env("C:\\Mrnmicro\\Applic\\OSGeo4W")
find_algorithms(search_term = "report", qgis_env = monqgis)
get_usage(alg = "grass7:r.report", qgis_env = monqgis)
params <- get_args_man(alg = "grass7:r.report", qgis_env = monqgis)
get_usage(alg = "grass7:r.report", qgis_env = monqgis)
params$map <- classif
params$units <- 5
params$rawoutput <- "C:\\temp\\outputRQGIS_raw"
params$html <- "C:\\temp\\outputRQGIS"
system.time(asas <- run_qgis(alg = "grass7:r.report", params=params,load_output = params$OUTPUT, qgis_env = monqgis))
not an amazing saving but if you getValues from your raster and then run the base::table function, it saves about 20%. My raster was c.500m cells.
# read in raster to obtain frequency table
r <- raster("./path/myraster.tif")
# perform tests; traditional freq() vs. getValues() & table()
require(microbenchmark)
mbm <- microbenchmark(
Freq = {freqf <- freq(r,useNA="no");
freq.df <- data.frame(CODE=freqf[,1], N=freqf[,2]},
GetVals = {v <- getValues(r);
vt <- table(v);
getval.df <- data.frame(CODE=as.numeric(names(vt)),N=as.numeric(as.matrix(vt)))},
times=5
)
mbm
Unit: seconds
expr min lq mean median uq max neval
Freq 191.1649 191.8001 198.8567 192.5256 193.0986 225.6942 5
GetVals 153.5552 154.8776 156.9173 157.0539 159.0400 160.0598 5
# check the routines have identical results
identical(freq.df,getval.df)
[1] TRUE
bit of a saving i guess
(N.B. the reason i make the data frames is that I go on to process the data that comes out of the frequency analysis)
I think the most effective way to calculate that is by using GetHistogram( ) from GDAL. Unfortunately, I can't find a way to use it from R. The closest approach is by using gdalUtilities::gdalinfo from R, and use the flag -hist, or hist = TRUE, but is limited the calculations between 0 - 255.
Another option is using rasterDT::freqDT, which is faster than regular options. Here an example:
library(gdalUtilities)
library(raster)
library(rasterDT)
library(microbenchmark)
RastBig <- raster(nrow=7000, ncol=7000)
set.seed(0)
RastBig[] <- round(runif(1:ncell(RastBig), 1, 5))
writeRaster(RastBig, filename = 'C:/temp/RastBig.tif')
mbm <- microbenchmark(times = 50,
freq1 = freq(RastBig),
freq2 = table(RastBig[]),
freq3 = freqDT(RastBig),
freq4 = ({
gdalLog <- capture.output(gdalUtilities::gdalinfo(datasetname = 'C:/temp/RastBig.tif', hist = TRUE));
(bucxml <- as.numeric(sub('buckets.+', '', grep('buckets ', gdalLog, value = TRUE))));
(minxml <- as.numeric(gsub('.+from | to.+', '', grep('buckets ', gdalLog, value = TRUE)) ));
(maxxml <- as.numeric(gsub('.+to |:', '', grep('buckets ', gdalLog, value = TRUE))));
(histxml <- as.numeric(strsplit(split = '[[:space:]]', gsub("^ |^ ", "", gdalLog[grep('buckets', gdalLog)+1]))[[1]]));
labs <- seq(from = minxml, to = maxxml, length.out = bucxml);
df <- data.frame(labs, nwlab = c(ceiling(labs[1]),
round(labs[2:(bucxml-1)]),
floor(labs[bucxml])),
val = histxml);
hist <- aggregate(df$val, by = list(df$nwlab), sum)})
)
Results:
> freq1
value count
[1,] 1 6127755
[2,] 2 12251324
[3,] 3 12249376
[4,] 4 12248938
[5,] 5 6122607
> freq2
1 2 3 4 5
6127755 12251324 12249376 12248938 6122607
> freq3
ID freq
1: 1 6127755
2: 2 12251324
3: 3 12249376
4: 4 12248938
5: 5 6122607
> freq4
Group.1 x
1 1 6127755
2 2 12251324
3 3 12249376
4 4 12248938
5 5 6122607
Unit: milliseconds
expr min lq mean median uq max neval
freq1 58628.486301 59100.539302 59400.304887 59383.913701 59650.412 60841.3975 50
freq2 55912.170401 56663.025202 56954.032395 56919.905051 57202.001 58307.9500 50
freq3 3785.767301 4006.858102 4288.699531 4292.447250 4536.382 4996.0598 50
freq4 7.892201 8.883102 9.255641 9.154001 9.483 15.6072 50
EDIT: using this is quite faster than option 3:
rB <- raster('C:/temp/RastBig.tif')
freq3B <- freqDT(rB)

Store or print results for 't.test' in for loop

I am new to R and having a problem with printing the results of 'for' loop in R. Here is my code:
afile <- read.table(file = 'data.txt', head =T)##Has three columns Lab, Store and Batch
lab1 <- afile$Lab[afile$Batch == 1]
lab2 <- afile$Lab[afile$Batch == 2]
lab3 <- afile$Lab[afile$Batch == 3]
lab_list <- list(lab1,lab2,lab3)
for (i in 1:2){
x=lab_list[[i]]
y=lab_list[[i+1]]
t.test(x,y,alternative='two.sided',conf.level=0.95)
}
This code runs without any error but produces no output on screen. I tried taking results in a variable using 'assign' but that produces error:
for (i in 1:2){x=lab_list[[i]];y=lab_list[[i+1]];assign(paste(res,i,sep=''),t.test(x,y,alternative='two.sided',conf.level=0.95))}
Warning messages:
1: In assign(paste(res, i, sep = ""), t.test(x, y, alternative = "two.sided", :
only the first element is used as variable name
2: In assign(paste(res, i, sep = ""), t.test(x, y, alternative = "two.sided", :
only the first element is used as variable name
Please help me on how can I perform t.test in loop and get their results i.e. print on screen or save in variable.
AK
I would rewrite your code like this :
I assume your data is like this
afile <- data.frame(Batch= sample(1:3,10,rep=TRUE),lab=rnorm(10))
afile
Batch lab
1 2 0.4075675
2 1 0.3006192
3 1 -0.4824655
4 3 1.0656481
5 1 0.1741648
6 2 -1.4911526
7 2 0.2216970
8 1 -0.3862147
9 1 -0.4578520
10 1 -0.6298040
Then using lapply you can store your result in a list :
lapply(1:2,function(i){
x <- subset(afile,Batch==i)
y <- subset(afile,Batch==i+1)
t.test(x,y,alternative='two.sided',conf.level=0.95)
})
[[1]]
Welch Two Sample t-test
data: x and y
t = -0.7829, df = 6.257, p-value = 0.4623
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.964637 1.005008
sample estimates:
mean of x mean of y
0.3765373 0.8563520
[[2]]
Welch Two Sample t-test
data: x and y
t = -1.0439, df = 1.797, p-value = 0.4165
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.588720 4.235776
sample estimates:
mean of x mean of y
0.856352 2.032824
In a loop, you need to explicitly print your results in many cases. Try:
print(t.test(x,y,alternative='two.sided',conf.level=0.95))
or
print(summary(t.test(x,y,alternative='two.sided',conf.level=0.95)))
In addition to 'Hansons' solution of printing, results can be saved and printed like:
result <- vector("list",6)
for (i in 1:5){x=lab_list[[i]];y=lab_list[[i+1]];result[[i]] = t.test(x,y,alternative='two.sided',conf.level=0.95)}
result
AK

Resources