R: how to perform more complex calculations from a combn of a dataset? - r

Right now, I have a combn from the built in dataset iris. So far, I have been guided into being able to find the coefficient of lm() of the pair of values.
myPairs <- combn(names(iris[1:4]), 2)
formula <- apply(myPairs, MARGIN=2, FUN=paste, collapse="~")
model <- lapply(formula, function(x) lm(formula=x, data=iris)$coefficients[2])
model
However, I would like to go a few steps further and use the coefficient from lm() to be used in further calculations. I would like to do something like this:
Coefficient <- lm(formula=x, data=iris)$coefficients[2]
Spread <- myPairs[1] - coefficient*myPairs[2]
library(tseries)
adf.test(Spread)
The procedure itself is simple enough, but I haven't been able to find a way to do this for each combn in the data set. (As a sidenote, the adf.test would not be applied to such data, but I'm just using the iris dataset for demonstration).
I'm wondering, would it be better to write a loop for such a procedure?

You can do all of this within combn.
If you just wanted to run the regression over all combinations, and extract the second coefficient you could do
fun <- function(x) coef(lm(paste(x, collapse="~"), data=iris))[2]
combn(names(iris[1:4]), 2, fun)
You can then extend the function to calculate the spread
fun <- function(x) {
est <- coef(lm(paste(x, collapse="~"), data=iris))[2]
spread <- iris[,x[1]] - est*iris[,x[2]]
adf.test(spread)
}
out <- combn(names(iris[1:4]), 2, fun, simplify=FALSE)
out[[1]]
# Augmented Dickey-Fuller Test
#data: spread
#Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
#alternative hypothesis: stationary
Compare results to running the first one manually
est <- coef(lm(Sepal.Length ~ Sepal.Width, data=iris))[2]
spread <- iris[,"Sepal.Length"] - est*iris[,"Sepal.Width"]
adf.test(spread)
# Augmented Dickey-Fuller Test
# data: spread
# Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
# alternative hypothesis: stationary

Sounds like you would want to write your own function and call it in your myPairs loop (apply):
yourfun <- function(pair){
fm <- paste(pair, collapse='~')
coef <- lm(formula=fm, data=iris)$coefficients[2]
Spread <- iris[,pair[1]] - coef*iris[,pair[2]]
return(Spread)
}
Then you can call this function:
model <- apply(myPairs, 2, yourfun)
I think this is the cleanest way. But I don't know what exactly you want to do, so I was making up the example for Spread. Note that in my example you get warning messages, since column Species is a factor.

A few tips: I wouldn't name things that you with the same name as built-in functions (model, formula come to mind in your original version).
Also, you can simplify the paste you are doing - see the below.
Finally, a more general statement: don't feel like everything needs to be done in a *apply of some kind. Sometimes brevity and short code is actually harder to understand, and remember, the *apply functions offer at best, marginal speed gains over a simple for loop. (This was not always the case with R, but it is at this point).
# Get pairs
myPairs <- combn(x = names(x = iris[1:4]),m = 2)
# Just directly use paste() here
myFormulas <- paste(myPairs[1,],myPairs[2,],sep = "~")
# Store the models themselves into a list
# This lets you go back to the models later if you need something else
myModels <- lapply(X = myFormulas,FUN = lm,data = iris)
# If you use sapply() and this simple function, you get back a named vector
# This seems like it could be useful to what you want to do
myCoeffs <- sapply(X = myModels,FUN = function (x) {return(x$coefficients[2])})
# Now, you can do this using vectorized operations
iris[myPairs[1,]] - iris[myPairs[2,]] * myCoeffs[myPairs[2,]]
If I am understanding right, I believe the above will work. Note that the names on the output at present will be nonsensical, you would need to replace them with something of your own design (maybe the values of myFormulas).

Related

Combining for loops and ifelse in R

I am trying to write a for loop that will generate a correlation for a fixed column (LPS0) vs. all other columns in the data set. I don't want to use a correlation matrix because I only care about the correlation of LPS0 vs all other columns, not the correlations of the other columns with themselves. I then want to include an if statement to print only the significant correlations (p.value <= 0.05). I ran into some issues where some of the p.values are returned as NA, so I switched to an if_else loop. However, I am now getting an error. My code is as follows:
for(i in 3:ncol(microbiota_lps_0_morm)) {
morm_0 <- cor.test(microbiota_lps_0_morm$LPS0, microbiota_lps_0_morm[[colnames(microbiota_lps_0_morm)[i]]], method = "spearman")
if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
}
The first value is returned, and then the loop stops with the following error:
Error in if_else():
! true must be length 1 (length of condition), not 8.
Backtrace: 1. dplyr::if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
How can I make the loop print morm only when p.value <- 0.05?
Here's a long piece of code which aytomates the whole thing. it might be overkill but you can just take the matrix and use whatever you need. it makes use of the tidyverse.
df <- select_if(mtcars,is.numeric)
glimpse(df)
# keeping real names
dict <- cbind(original=names(df),new=paste0("v",1:ncol(df)))
# but changing names for better data viz
colnames(df) <- paste0("v",1:ncol(df))
# correlating between variables + p values
pvals <- list()
corss <- list()
for (coln in colnames(df)) {
pvals[[coln]] <- map(df, ~ cor.test(df[,coln], .)$p.value)
corss[[coln]] <- map(df, ~ cor(df[,coln], .))
}
# Keeping both matrices in a list
matrices <- list(
pvalues = matrix(data=unlist(pvals),
ncol=length(names(pvals)),
nrow=length(names(pvals))),
correlations = matrix(data=unlist(corss),
ncol=length(names(corss)),
nrow=length(names(corss)))
)
rownames(matrices[[1]]) <- colnames(df)
rownames(matrices[[2]]) <- colnames(df)
# Creating a combined data frame
long_cors <- expand.grid(Var1=names(df),Var2=names(df)) %>%
mutate(cor=unlist(matrices["correlations"]),
pval=unlist(matrices["pvalues"]),
same=Var1==Var2,
significant=pval<0.05,
dpcate=duplicated(cor)) %>%
# Leaving no duplicants, non-significant or self-correlation results
filter(same ==F,significant==T,dpcate==F) %>%
select(-c(same,dpcate,significant))
# Plotting correlations
long_cors %>%mutate(negative=cor<0) %>%
ggplot(aes(x=Var1,y=Var2,
color=negative,size=abs(cor),fill=Var2,
label=round(cor,2)))+
geom_label(show.legend = F,alpha=0.2)+
scale_color_manual(values = c("black","darkred"))+
# Sizing each correlation by it's magnitude
scale_size_area(seq(1,100,length=length(unique(long_cors$Var1))))+ theme_light()+
theme(axis.text = element_text(face = "bold",size=12))+
labs(title="Correlation between variables",
caption = "p < 0.05")+xlab("")+ylab("")
If you want to correlate a column of a matrix with the remaining columns, you can do so with one function call:
mtx <- matrix(rnorm(800), ncol=8)
cor(mtx[,1], mtx[,-1])
However, you will not get p-values. For getting p-values, I would recommend this approach:
library(tidyverse)
significant <- map_dbl(2:ncol(mtx),
~ cor.test(mtx[,1], mtx[,.], use="p", method="s")$p.value)
Whenever you feel like you need a for loop in R, chances are, you should be using another approach. for is a very un-R construct, and R gives many better ways of handling the same issues. map_* family of functions from tidyverse is but one of them. Another approach, in base R, would be to use apply:
significant <- apply(mtx[,-1], 2,
\(x) cor.test(x, mtx[,1], method="s", use="p")$p.value)

Using boot on subset of data

From the boot documentation, it looks like it should be possible to pass information about which observations to use for bootstrapping directly via the command, but I cannot figure out how to access the indices.
To take a simple example, say I want to use only cars with automatic transmissions for my bootstrap. One way would be to subset the data directly.
# Bootstrap on a subset of mtcars
library(boot)
my_func <- function(data, i) {
d <- data[i,]
m <- lm(mpg ~ disp + log(hp), data = d)
beta <- m$coefficients['log(hp)']
y_mean <- mean(m$model[, 1])
# Return estimated elasticity
return(beta / y_mean)
}
my_boot <- boot(mtcars[mtcars$am == 1, ], my_func, R = 100)
My question is whether it is possible to achieve the same effect using the options in boot(). In this simple case, it might not make much sense, but imagine a much more complicated set of conditions passed from a list.

Perform two-sample t-test and output the t, p values for two groups of matrices in R

I am trying to perform a two-sample t-test like this
The wt1, wt2, wt3, mut1, mut2, mut3 are 3x3 matrices. After running the t-test, I would like to get t.stat and p.value matrices, in which
t.stat[i,j] <- the t value from t.test(c(wt1[i,j],wt2[i,j],wt3[i,j]),c(mut1[i,j],mut2[i,j],mut3[i,j]))
p.value[i,j] <- the p-value from t.test(c(wt1[i,j],wt2[i,j],wt3[i,j]),c(mut1[i,j],mut2[i,j],mut3[i,j]))
with i and j indicating the row and column indices.
Is there an efficient way to achieve this without a loop?
Thank you very much for the help, it works!
Now I found that my data in the diagonal directions are all 1, which would result in Error in t.test.default(c(wt1[x], wt2[x], wt3[x]), c(mut1[x], mut2[x], :
data are essentially constant.
In order to pass those errors, I would like to output N/A in the t.stat and p.value. If the matrices have to contain the same type of values, 0 and 1 can be used for t.stat and p.value, respectively. It seems that tryCatch can do the job, but I am not sure how to handle it with sapply?
You can do something like this:
test<- sapply(1:9, function(x) t.test(c(wt1[x], wt2[x], wt3[x]),
c(mut1[x], mut2[x], mut3[x])))
t.stat<- matrix(test["statistic", ], nrow = 3)
p.value<- matrix(test["p.value", ], nrow = 3)
For the second part of your question, I think using tryCatch inside sapply will help. Unfortunately, I couldn't think of a way of pre-allocating test and then creating the 2 matrices while using tryCatch. In order to do that, I am adapting Aaron's answer.
t.stat<- matrix(sapply(1:9, function(x)
tryCatch({t.test(c(wt1[x], wt2[x], wt3[x]),
c(mut1[x], mut2[x], mut3[x]))$statistic},
error = function(err) {return(NA)})), nrow = 3)
p.value<- matrix(sapply(1:9, function(x)
tryCatch({t.test(c(wt1[x], wt2[x], wt3[x]),
c(mut1[x], mut2[x], mut3[x]))$p.value},
error = function(err) {return(NA)})), nrow = 3)

Simulate over an xts object in R

I am looking for a function, or package, that will help me with this goal. I've looked through several packages but can't find what I am looking for:
Lets say I have an xts object with 10 columns and 250 rows.
What I want to do is run a simulation, such that I get a robust calculation of my performance metric over the period.
So, lets say that I have 250 data points, I want to run x number of simulations over random samples of the data computing the Sharpe Ratio using the function (PerformanceAnalytics::SharpeRatio) varying the random samples to be lengths 30-240, and then find the average. Keep in mind I want to do this for every column and I'd rather not have to use apply if possible. I'd also like to find something that processes the information rather quickly.
What package or functions would best serve this purpose?
Thank you!
Subsetting xts objects for the rows you want to randomly sample should be good enough, performance wise, if that is your main concern. If you want some other concrete examples, you may find it useful to look at the monte carlo simulation functions recently added to the R blotter package:
https://github.com/braverock/blotter/blob/master/R/mcsim.R
Your requirements are quite detailed and a little tricky to follow, but I think this example may be what you're after?
This solution does use apply functions though! Because it just makes life easier. If you don't use lapply, the code will expand quickly and distract from achieving the goal quickly (and you risk introducing bugs with longer, messier code; one reason to use apply family functions where you can).
library(quantmod)
library(PerformanceAnalytics)
# Set up the data:
syms <- c("GOOG", "FB", "TSLA", "SNAP", "MU")
getSymbols(syms)
z <- do.call(merge, lapply(syms, function(s) {
x <- get(s)
dailyReturn(Cl(x))
}))
# Here we have 250 rows, 5 columns:
z <- tail(z, 250)
colnames(z) <- paste0(syms, ".rets")
subSample <- function(x, n.sub = 40) {
# Assuming subsampling by row, preserving all returns and cross symbol dependence structure at a given timestamp
ii <- sample(1:NROW(x), size = n.sub, replace = FALSE)
# sort in order to preserve time ordering?
ii <- sort(ii)
xs <- x[ii, ]
xs
}
set.seed(5)
# test:
z2 <- subSample(z, n.sub = 40)
zShrp <- SharpeRatio(z2)[1, ]
# now run simulation:
nSteps <- seq(30, 240, by = 30)
sharpeSimulation <- function(x, n.sub) {
x <- subSample(x, n.sub)
SharpeRatio(x)[1, ]
}
res <- lapply(nSteps, FUN = sharpeSimulation, x = z)
res <- do.call(rbind, res)
resMean <- colMeans(res)
resMean
# GOOG.rets FB.rets TSLA.rets SNAP.rets MU.rets
# 0.085353854 0.059577882 0.009783841 0.026328660 0.080846592
Do you realise that SharpeRatio uses sapply? And it's likely other performance metrics you want to use will as well. Since you seem to have something against apply (possibly all apply functions in R), this might be worth noting.

How to find significant correlations in a large dataset

I'm using R.
My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.
I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b.
Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.
I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?
You can use the function rcorr from the package Hmisc.
Using the same demo data from Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Then:
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
To access the p-values:
correlations$P
To visualize you can use the package corrgram
library(corrgram)
corrgram(the_data)
Which will produce:
In order to print a list of the significant correlations (p < 0.05), you can use the following.
Using the same demo data from #Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Install Hmisc
install.packages("Hmisc")
Import library and find the correlations (#Carlos)
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
Loop over the values printing the significant correlations
for (i in 1:m){
for (j in 1:m){
if ( !is.na(correlations$P[i,j])){
if ( correlations$P[i,j] < 0.05 ) {
print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
}
}
}
}
Warning
You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.
Here's some sample data for reproducibility.
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.
correlations <- vapply(
the_data[, -1],
function(x)
{
cor(the_data[, 1], x)
},
numeric(1)
)
You can then find the column with the largest magnitude of correlation with y using:
correlations[which.max(abs(correlations))]
So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.
If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y.
To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator
One option is to run a correlation matrix:
cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")
This correlates all the variables in the file against each other and outputs a matrix.

Resources