Summarise multiple columns using multiple functions using base R and Dplyr - r

the data is something like this:
> head(r)
area peri shape perm
1 4990 2791.90 0.0903296 6.3
2 7002 3892.60 0.1486220 6.3
3 7558 3930.66 0.1833120 6.3
4 7352 3869.32 0.1170630 6.3
5 7943 3948.54 0.1224170 17.1
6 7979 4010.15 0.1670450 17.1
I want to perform multiple functions on each column, what I currently have is this function:
analysis = function(df){
measurements = data.frame(attributes = character(),
mean = double(),
median = double(),
variance = double(),
IQR = double())
for (i in 1:ncol(df)){
names = colnames(df)[i]
temp = data.frame(attribute = names,
mean = mean(df[,i]),
median = median(df[,i]),
variance = var(df[,i]),
IQR = IQR(df[,i]))
measurements = rbind(measurements, temp)
}
return (measurements)
}
It works well and achieve what I want which gives the following output:
attribute mean median variance IQR
1 area 7187.7291667 7487.000000 7.203045e+06 3564.2500000
2 peri 2682.2119375 2536.195000 2.049654e+06 2574.6150000
3 shape 0.2181104 0.198862 6.971657e-03 0.1004083
4 perm 415.4500000 130.500000 1.916848e+05 701.0500000
However, my supervisor said it is not efficient and not thinking in a R way.
I also tried summarise_each()and summarise_all(r, funs(mean, median, var, IQR)) but it doesn't achieve what I want and the output doesn't look nice.
What are some other ways to achieve that output only using base R or dplyr.

I suspect your supervisors comment about 'R'-style thinking was about using that for loop. Almost any for loop you write can be replaced by the apply family of functions (e.g. apply, sapply, lapply etc).
They make it easier to run functions on vectors/data.frames/lists/etc.
Everything you could do using apply functions could be replicated in for loops (usually with similar performance) so using for loops isn't actually a cardinal sin. Why use apply functions? Well ... once you learn them you get more succinct code which returns the results of running your functions on your data. Before long, you'll find this sort of code very intuitive, and even more readable than for loops.
Base R
df <- data.frame(
area = c(4990, 7002, 7558, 7352, 7943),
peri = c(2791.9, 3892.6, 3930.66, 3869.32, 3948.54),
shape = c(.0903296, .148622, .183312, .117063, .122417),
perm = c(6.3, 6.3, 6.3, 6.3, 17.1)
)
sapply(df, function(x) c(mean=mean(x), median=median(x), var=var(x), IQR=IQR(x)))

Your results can be achieved using base::Map:
f <- function(x) {
desc = base::summary(x)
c(
Mean = unname(desc['Mean']),
Median = unname(desc['Median']),
Variance = base::sum((x-desc['Mean'])**2)/(length(x)-1),
IQR = unname(desc['3rd Qu.'] - desc['1st Qu.'])
)
}
t(as.data.frame(base::Map(f, df)))
# Mean Median Variance IQR
# area 7137.3333333 7455.0000000 1.241980e+06 757.25000000
# peri 3740.5283333 3911.6300000 2.183447e+05 68.93000000
# shape 0.1381314 0.1355195 1.192633e-03 0.04403775
# perm 9.9000000 6.3000000 3.110400e+01 8.10000000
Apologies
Data:
df <- data.frame(
area = c(4990, 7002, 7558, 7352, 7943, 7979),
peri = c(2791.9, 3892.6, 3930.66, 3869.32, 3948.54, 4010.15),
shape = c(.0903296, .148622, .183312, .117063, .122417, .167045),
perm = c(6.3, 6.3, 6.3, 6.3, 17.1, 17.1)
)
Hope that's useful.

Related

Run a defined function for various variables in R

I have defined a function for descriptive statistics, which is
descriptive_statistics<-function(obj){
n<-length(obj)
cat("mean=" ,mean(obj) , "\n")
cat("s.d.=",sd(obj),"\n")
cat("min=" ,min(obj ) , "\n")
cat("1. Quartil=" ,quantile(obj, 0.25) , "\n")
cat("median=",median(obj),"\n")
cat("3. Quartil=" ,quantile(obj, 0.75) , "\n")
cat ( "max= " , max (obj ) , "\n" )
skew<-(sum((obj-mean(obj))^3)/n)/(sum((obj-mean(obj))^2)/n)^(3/2)
cat("skewness=",skew,"\n")
kurt<-n*sum((obj-mean(obj))^4)/(sum((obj-mean(obj))^2)^2)
cat("kurtosis=",kurt,"\n")
Stat<-n*skew^2/6+n*(kurt - 3)^2/24
}
For any variable that I put in the command, it gives me this output:
descriptive_statistics(data$ewz_2016)
mean= 232024.4
s.d.= 383172
min= 36514
1. Quartil= 81632
median= 121705
3. Quartil= 230405.5
max= 3516100
skewness= 5.930454
kurtosis= 46.42959
My wish would be to put various variables in the command at the same time (e.g. the variables from the columns 3-25) and afterwards copy the table to Excel to format it.
Does anyone have a piece of advice?
I would change the function descriptive_statistics to return the values it computes in the form of a named vector. In what follows the function doesn't print anything, it just computes those values. You can combine the two, yours and this version of it, if you want.
descriptive_statistics2 <- function(obj){
n <- length(obj)
skew<-(sum((obj-mean(obj))^3)/n)/(sum((obj-mean(obj))^2)/n)^(3/2)
kurt <- n*sum((obj-mean(obj))^4)/(sum((obj-mean(obj))^2)^2)
Stat <- n*skew^2/6+n*(kurt - 3)^2/24
c(mean = mean(obj),
s.d. = sd(obj),
min = min(obj),
Quartil = quantile(obj, 0.25), # Note: no need to say it's quartile 1 or 3
median = median(obj),
Quartil = quantile(obj, 0.75), # quantile() will append a 25% or 75%
max = max(obj),
skewness = skew,
kurtosis = kurt,
Stat = Stat
)
}
Now test it with some data.
set.seed(8085) # make it reproducible
n <- 1e2
dat <- as.data.frame(matrix(rnorm(25*n), ncol = 25))
result <- t(sapply(dat[3:25], descriptive_statistics2))
head(result)
# mean s.d. min Quartil.25% median Quartil.75% max
#V3 0.04496632 0.9454383 -2.114690 -0.6352870 0.13869333 0.6789609 1.901803
#V4 0.08136469 1.0594723 -1.955842 -0.6922518 -0.03305246 0.8274921 2.653876
#V5 -0.05899082 1.0225392 -3.031083 -0.6536158 -0.16813752 0.6497042 2.053943
#V6 0.13282456 0.9746959 -1.715005 -0.5399552 0.05815323 0.8383848 2.308354
#V7 0.03935582 1.1629889 -2.669025 -0.6569735 -0.01753087 0.6976987 3.277311
#V8 -0.08203328 1.1111241 -2.755361 -0.9619138 -0.08806534 0.8730580 2.547580
# skewness kurtosis Stat
#V3 -0.20061108 2.414033 2.1014041
#V4 0.32427896 2.455259 2.9890412
#V5 -0.21501670 2.758610 1.0133248
#V6 0.16328790 2.322679 2.3558991
#V7 0.09352129 2.977775 0.1478287
#V8 -0.05664655 2.509671 1.0552417
Finally, to export the result to Excel, use write.csv, if your country uses the period as decimal point or write.csv2 if it uses the comma.
write.csv(result, file = "SCW16.csv", row.names = FALSE)

R reduce code format

I'll try to explain what i need help with.
Example: i want to add a design parameter on my dataframe "transactionTableMergeCost"
this is my attempt to do so:
transactionTableMergeCost$roi<-(transactionTableMergeCost$revenue-transactionTableMergeCost$spend)/transactionTableMergeCost$spend
Can this code be shorter(something like this?)
transactionTableMergeCost->
#this$roi<-(#this$revenue - #this$spend) / #this$spend
Or is the first way the shortest one possible?
with() is the simplest way to go IMO... And using shorter names wouldn't hurt!
tbl <- data.frame(revenue=rnorm(n = 10, mean = 10000, sd = 1000),
spend=rnorm(n = 10, mean = 9000, sd = 1000))
tbl$roi <- with(data = tbl, expr = (revenue-spend)/spend)
tbl
revenue spend roi
1 10900.029 8286.808 0.31534715
2 8998.217 11095.703 -0.18903590
3 10204.678 9394.989 0.08618313
4 10218.754 9365.915 0.09105775
5 9147.773 8023.789 0.14008150
6 9573.119 8538.044 0.12123093
7 8991.229 10439.290 -0.13871259
8 11588.986 9844.280 0.17723050
9 9535.081 9055.307 0.05298270
10 10280.902 8352.768 0.23083772

Performing simulations and combining the data into one data frame

For each reported study, I want to do 1000 simulations of a parameter X using normal or log-normal distribution (based on a flag) and then combine all the simulations in one data frame. I am looking for an automated way of doing this.
What I have is a data frame with the following columns:
SOURCE NSUB MEAN SD DIST
Study1 10 1.5 0.3 0
Study2 5 2.5 0.4 1
Study1 4 3.5 0.3 0
when DIST==0 then it is normal distribution, if DIST==1 then it is log-normal.
I am able to do the simulations and combine them using hard coding: for example:
#for Study1:
set.seed <-1
NSUB <- 10
MEAN <- 1.5
SD <- 0.3
DIST <- 0 #Normal distribution
df1 <- data.frame("SOURCE"="Study1","NSUB"=NSUB,"DIST"=DIST, "VALUE" = rnorm(1000, mean=MEAN, sd=SD))
#For study2
set.seed <-2
NSUB <- 5
MEAN <- 2.5
SD <- 0.4
DIST <- 1 #log-normal distribution
df2 <- data.frame("SOURCE"="Study2","NSUB"=NSUB,"DIST"=DIST, "VALUE" = rlnorm(1000, meanlog=log(MEAN), sdlog=SD))
#Combine all
dfall <- rbind(df1,df2)
However, this would be tedious to me I have alot of reported means and SD for the parameter. I need help in how to make this automated so it does 1000 simulation for each row (using MEAN and SD) and then combine all simulated data in one data frame.
In the interest of implementing readable and general code, you should do two things here:
Write a function that takes each row of your simulation configuration dataset and returns the simulated values as a data_frame (doSim below). This makes it easier to test your simulation code separately from your iteration over simulation configurations.
Use dplyr to pass each row of the function to this function, and collect up the results as a data_frame.
Here is some sample code:
library(dplyr)
# read in the simultation configuration dataset
dfX = read.table(textConnection("
SOURCE NSUB MEAN SD DIST
Study1 10 1.5 0.3 0
Study2 5 2.5 0.4 1
Study1 4 3.5 0.3 0"),
header = TRUE, stringsAsFactors = FALSE)
# write a function that takes each row of the configuration
# data.frame and returns the simulations
doSim = function(simConfig, seed = 12345) {
set.seed(seed)
dist = if(simConfig[["DIST"]] == 0) rnorm else rlnorm
mean = if(simConfig[["DIST"]] == 0) simConfig[["MEAN"]] else log(simConfig[["MEAN"]])
return(
data_frame(
source = simConfig[["SOURCE"]],
nsub = simConfig[["NSUB"]],
value = dist(1000, mean = mean, sd = simConfig[["SD"]])
)
)
}
# test the function
doSim(dfX[1, ])
# apply over dfX
dfX %>%
rowwise() %>%
do(doSim(.))

trying to perform a t.test for each row and count all rows where p-value is less than 0.05

I've been wrecking my head for the past four hours trying to find the solution to an R problem, which is driving me nuts. I've searching everywhere for a decent answer but so far I've been hitting wall after wall. I am now appealing to your good will of this fine community for help.
Consider the following dataset:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
I need to perform a t-test for every row in DataSample in order to find out if groups TRIAL and CONTROL differ (equal variance applies).
Then I need to count the number of rows with a p-value equal to, or lower than 0.05.
So here is the code I tried, which I know is wrong:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
pValResults <- apply(
DataSample[,1:12],1,function(x) t.test(x,DataSample[,13:24], var.equal=T)$p.value
)
sum(pValResults < 0.05) # Returns the wrong answer (so I was told)
I did try looking at many similar questions around stackoverflow, but I would often end-up with syntax errors or a dimensional mismatch. The code above is the best I could get without returning me an R error -- but I since the code is returning the wrong answer I have nothing to feel proud of.
Any advice will be greatly appreciated! Thanks in advance for your time.
One option is to loop over the data set calculating the t test for each row, but it is not as elegant.
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
# initialize vector of stored p-values
pvalue <- rep(0,nrow(DataSample))
for (i in 1:nrow(DataSample)){
pvalue[i] <- t.test(DataSample[i,1:12],DataSample[i,13:24])$p.value
}
# finding number that are significant
sum(pvalue < 0.05)
I converted to a data.table, and the answer I got was 45:
DataSample.dt <- as.data.table(DataSample)
sum(sapply(seq_len(nrow(DataSample.dt)), function(x)
t.test(DataSample.dt[x, paste0('Trial', 1:12), with=F],
DataSample.dt[x, paste0('Control', 13:24), with=F],
var.equal=T)$p.value) < 0.05)
To do a paired T test, you need to supply the paired = TRUE parameter. The t.test function isn't vectorised, but it's quite simple to do t tests a whole matrix at a time. Here's three methods (including using apply):
library("genefilter")
library("matrixStats")
library("microbenchmark")
dd <- DataSample[, 1:12] - DataSample[, 13:24]
microbenchmark::microbenchmark(
manual = {ps1 <- 2 * pt(-abs(rowMeans(dd) / sqrt(rowVars(dd) / ncol(dd))), ncol(dd) - 1)},
apply = {ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], paired=TRUE)$p.value)},
rowttests = {ps3 <- rowttests(dd)[, "p.value"]})
#Unit: milliseconds
# expr min lq mean median uq max
# manual 1.611808 1.641783 1.677010 1.663122 1.709401 1.852347
# apply 390.869635 398.720930 404.391487 401.508382 405.715668 634.932675
# rowttests 2.368823 2.417837 2.639671 2.574320 2.757870 7.207135
# neval
# 100
# 100
# 100
You can see the manual method is over 200x faster than apply.
If you actually meant an unpaired test, here's the equivalent comparison:
microbenchmark::microbenchmark(
manual = {x <- DataSample[, 1:12]; y <- DataSample[, 13:24]; ps1 <- 2 * pt(-abs((rowMeans(x) - rowMeans(y)) / sqrt((rowVars(x) + rowVars(y)) / ncol(x))), ncol(DataSample) - 2)},
apply = { ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], var.equal = TRUE)$p.value)},
rowttests = {ps3 <- rowttests(DataSample, factor(rep(1:2, each = 12)))[, "p.value"]})
Note the manual method assumes that the two groups are the same sizes.
Adding an alternative using an external library.
Performing the test:
library(matrixTests)
res <- row_t_equalvar(DataSample[,1:12], DataSample[,13:24])
Format of the result:
res
obs.x obs.y obs.tot mean.x mean.y mean.diff var.x var.y var.pooled stderr df statistic pvalue conf.low conf.high alternative mean.null conf.level
1 12 12 24 0.30569721 0.160622830 0.145074376 0.5034806 1.0769678 0.7902242 0.3629105 22 0.399752487 0.69319351 -0.6075559 0.89770469 two.sided 0 0.95
2 12 12 24 -0.27463354 -0.206396781 -0.068236762 0.8133311 0.2807800 0.5470556 0.3019535 22 -0.225984324 0.82329990 -0.6944500 0.55797651 two.sided 0 0.95
3 12 12 24 -0.19805092 -0.023207888 -0.174843032 0.4278359 0.5604078 0.4941219 0.2869733 22 -0.609265949 0.54858909 -0.7699891 0.42030307 two.sided 0 0.95
Number of rows with p <= 0.05:
> sum(res$pvalue <= 0.05)
[1] 4

Running Mean/SD: How can I select within the averaging window based on criteria

I need to calculate a moving average and standard deviation for a moving window. This is simple enough with the catools package!
... However, what i would like to do, is having defined my moving window, i want to take an average from ONLY those values within the window, whose corresponding values of other variables meet certain criteria. For example, I would like to calculate a moving Temperature average, using only the values within the window (e.g. +/- 2 days), when say Relative Humidity is above 80%.
Could anybody help point me in the right direction? Here is some example data:
da <- data.frame(matrix(c(12,15,12,13,8,20,18,19,20,80,79,91,92,70,94,80,80,90),
ncol = 2, byrow = TRUE))
names(da) = c("Temp", "RH")
Thanks,
Brad
I haven't used catools, but in the help text for the (presumably) most relevant function in that package, ?runmean, you see that x, the input data, can be either "a numeric vector [...] or matrix with n rows". In your case the matrix alternative is most relevant - you wish to calculate mean of a focal variable, Temp, conditional on a second variable, RH, and the function needs access to both variables. However, "[i]f x is a matrix than each column will be processed separately". Thus, I don't think catools can solve your problem. Instead, I would suggest rollapply in the zoo package. In rollapply, you have the argument by.column. Default is TRUE: "If TRUE, FUN is applied to each column separately". However, as explained above we need access to both columns in the function, and set by.column to FALSE.
# First, specify a function to apply to each window: mean of Temp where RH > 80
meanfun <- function(x) mean(x[(x[ , "RH"] > 80), "Temp"])
# Apply the function to windows of size 3 in your data 'da'.
meanTemp <- rollapply(data = da, width = 3, FUN = meanfun, by.column = FALSE)
meanTemp
# If you want to add the means to 'da',
# you need to make it the same length as number of rows in 'da'.
# This can be acheived by the `fill` argument,
# where we can pad the resulting vector of running means with NA
meanTemp <- rollapply(data = da, width = 3, FUN = meanfun, by.column = FALSE, fill = NA)
# Add the vector of means to the data frame
da2 <- cbind(da, meanTemp)
da2
# even smaller example to make it easier to see how the function works
da <- data.frame(Temp = 1:9, RH = rep(c(80, 81, 80), each = 3))
meanTemp <- rollapply(data = da, width = 3, FUN = meanfun, by.column = FALSE, fill = NA)
da2 <- cbind(da, meanTemp)
da2
# Temp RH meanTemp
# 1 1 80 NA
# 2 2 80 NaN
# 3 3 80 4.0
# 4 4 81 4.5
# 5 5 81 5.0
# 6 6 81 5.5
# 7 7 80 6.0
# 8 8 80 NaN
# 9 9 80 NA

Resources