Related
I want to make a scatterplot matrix with points in upper pane and r or r2 values in lower pane, as described here: http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs
When there is no missing data, it works fine. But when there are some missing values, it seems unable to calculate R, even when I use code I thought would account for missing values. See commented-out lines in the code below, which show what I've tried -- those attempts were passed on what I found after searching about here on StackOverflow: Dealing with missing values for correlations calculation
Probably something simple, as I'm a pretty simple R user (so I'm hoping for solutions that are more simple than elegant). Talk to me like I'm stupid!
I do not want to remove whole rows just because there is one missing value, as my real dataset (not this example) is rather small.
# --------------------------------------
# Create Dataframes, one with missing values
# --------------------------------------
Alx <- c(13, 9, 5, 17, 2, 8, 11, 4)
Bex <- c(23, 41, 32, 58, 26, 33, 51, 46)
Dex <- c(7,10,6,4,19,6,15,16)
Gax <- c(43,54,31,28,60,30,43,21)
AlxM <- c(NA, 9, 5, 17, 2, 8, 11, 4)
BexM <- c(23, 41, NA, 58, 26, 33, 51, 46)
DexM <- c(7,10,6,4,19,6,15,NA)
GaxM <- c(43,54,31,28,60,30,43,21)
df <- data.frame(Alx,Bex,Dex,Gax) # dataframe that works in scatterplot matrix
df_miss <- data.frame(AlxM,BexM,DexM,GaxM)# dataframe that has missing values
rm(Alx,Bex,Dex,Gax,AlxM,BexM,DexM,GaxM) # removing un-needed garbage
# --------------------------------------
# --------------------------------------
# Scatterplot Matrix - functions for upper and lower
# panels, it is the line "r <- round(cor(x,y), digits=2)"
# that I've been focusing on. Perhaps the wrong approach?
# see: http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs
# --------------------------------------
# Upper panel
upper.panel<-function(x, y){
points(x,y, pch=19)
r <- round(cor(x,y), digits=2)
txt <- paste0("R = ", r)
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
text(0.5, 0.9, txt)
}
# Correlation panel
panel.cor <- function(x, y){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- round(cor(x, y), digits=2) # gives all NA
# Neither of these (immediately below) worked for me:
# see: https://stackoverflow.com/questions/7445639/dealing-with-missing-values-for-correlations-calculation
# r <- round(cor(na.omit(x, y)), digits=2) # does not work
# r <- round(cor(x, y), use="pairwise.complete.obs", digits=2) # does not work
txt <- paste0("R = ", r)
cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = 0.5)
}
# Scatterplots
pairs(df[,1:4], lower.panel = panel.cor,
upper.panel = upper.panel)
pairs(df_miss[,1:4], lower.panel = panel.cor,
upper.panel = upper.panel)
# --------------------------------------
We can use the use argument in cor i.e. it shouldn't be outside the cor as in the OP's commented line r <- round(cor(x, y), use="pairwise.complete.obs", digits=2)
panel.cor <- function(x, y){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- round(cor(x, y, use = "pairwise.complete.obs"), digits=2)
txt <- paste0("R = ", r)
cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = 0.5)
}
-testing
pairs(df_miss[,1:4], lower.panel = panel.cor,
upper.panel = upper.panel)
-output
I’m trying to write simulation code, that generates data and runs t-test selection (discarding those predictors whose t-test p-value exceeds 0.05, retaining the rest) on it. The simulation is largely an adaptation of Applied Econometrics with R by Kleiber and Zeileis (2008, pp. 183–189).
When running the code, it usually fails. Yet with certain seeds (e.g. 1534) it produces plausible output. If it does not produce output (e.g. 1911), it fails due to: "Error in x[, ii] : subscript out of bounds", which traces back to na.omit.data.frame(). So, for some reason, the way I attempt to handle the NAs seems to fail, but I'm unable to figure out in how so.
coef <- rep(coef[,3], length.out = pdim+1)
err <- as.vector(rnorm(nobs, sd = sd))
uX <- c(rep(1, times = nobs))
pX <- matrix(scale(rnorm(nobs)), byrow = TRUE, ncol = pdim, nrow = nobs)
X <- cbind(uX, pX)
y <- coef %*% t(X) + err
y <- matrix(y)
tTp <- (summary(lm(y ~ pX)))$coefficients[,4]
tTp <- tTp[2:length(tTp)]
TTT <- matrix(c(tTp, rep(.7, ncol(pX)-length(tTp))))
tX <- matrix(NA, ncol = ncol(pX), nrow = nrow(pX))
for(i in 1:ncol(pX)) {ifelse(TTT[i,] < ALPHA, tX[,i] <- pX[,i], NA)}
tX <- matrix(Filter(function(x)!all(is.na(x)), tX), nrow = nobs)
TTR <- lm(y ~ tX)
The first block is unlikely to the cause of the error. It merely generates the data and works well on its own and with other methods, like PCA, as well. The second block pulls the p-values from the regression output; removes the p-value of the intercept (beta_0); and fills the vector with as many 7s as necessary to have the same length as the number of variables, to ensure the same dimension for matrix calculations. Seven is arbitrary and could be any number larger than 0.05 to not pass the test of the loop. This becomes – I believe – necessary, if R discards predictors due to multicollinearity.
The final block creates an empty matrix of the original dimensions; inserts the original data, if the t-test p-value is lower than 0.05, else retains the NA; while the penultimate line removes all columns containing NAs ((exclusively NA or one NA is the same here) taken from mnel’s answer to Remove columns from dataframe where ALL values are NA); lastly, the modified data is again put in the shape of a linear regression.
Does anyone know what causes this behavior or how it would work as intended? I would expect it to either work or not, but not kind of both. Ideally, the former.
A working version of the code is:
set.seed(1534)
Sim_TTS <- function(nobs = c(1000, 15000), pdim = pdims, coef = coef100,
model = c("MLC", "MHC"), ...){
DGP_TTS <- function(nobs = 1000, model = c("MLC", "MHC"), coef = coef100,
sd = 1, pdim = pdims, ALPHA = 0.05)
{
model <- match.arg(model)
if(model == "MLC") {
coef <- rep(coef[,1], length.out = pdim+1)
err <- as.vector(rnorm(nobs, sd = sd))
uX <- c(rep(1, times = nobs))
pX <- matrix(scale(rnorm(nobs)), byrow = TRUE, ncol = pdim, nrow = nobs)
X <- cbind(uX, pX)
y <- coef %*% t(X) + err
y <- matrix(y)
tTp <- (summary(lm(y ~ pX)))$coefficients[,4]
tTp <- tTp[2:length(tTp)]
TTT <- matrix(c(tTp, rep(.7, ncol(pX)-length(tTp))))
tX <- matrix(NA, ncol = ncol(pX), nrow = nrow(pX))
for(i in 1:ncol(pX)) {ifelse(TTT[i,] < ALPHA, tX[,i] <- pX[,i], NA)}
tX <- matrix(Filter(function(x)!all(is.na(x)), tX), nrow = nobs)
TTR <- lm(y ~ tX)
} else {
coef <- rep(coef[,2], length.out = pdim+1)
err <- as.vector(rnorm(nobs, sd = sd))
uX <- c(rep(1, times = nobs))
pX <- matrix(scale(rnorm(nobs)), byrow = TRUE, ncol = pdim, nrow = nobs)
X <- cbind(uX, pX)
y <- coef %*% t(X) + err
y <- matrix(y)
tTp <- (summary(lm(y ~ pX)))$coefficients[,4]
tTp <- tTp[2:length(tTp)]
TTT <- matrix(c(tTp, rep(.7, ncol(pX)-length(tTp))))
tX <- matrix(NA, ncol = ncol(pX), nrow = nrow(pX))
for(i in 1:ncol(pX)) {ifelse(TTT[i,] < ALPHA, tX[,i] <- pX[,i], NA)}
tX <- matrix(Filter(function(x)!all(is.na(x)), tX), nrow = nobs)
TTR <- lm(y ~ tX)
}
return(TTR)
}
PG_TTS <- function(nrep = 1, ...)
{
rsq <- matrix(rep(NA, nrep), ncol = 1)
rsqad <- matrix(rep(NA, nrep), ncol = 1)
pastr <- matrix(rep(NA, nrep), ncol = 1)
vmat <- cbind(rsq, rsqad, pastr)
colnames(vmat) <- c("R sq.", "adj. R sq.", "p*")
for(i in 1:nrep) {
vmat[i,1] <- summary(DGP_TTS(...))$r.squared
vmat[i,2] <- summary(DGP_TTS(...))$adj.r.squared
vmat[i,3] <- length(DGP_TTS(...)$coefficients)-1
}
return(c(mean(vmat[,1]), mean(vmat[,2]), round(mean(vmat[,3]))))
}
SIM_TTS <- function(...)
{
prs <- expand.grid(pdim = pdim, nobs = nobs, model = model)
nprs <- nrow(prs)
pow <- matrix(rep(NA, 3 * nprs), ncol = 3)
for(i in 1:nprs) pow[i,] <- PG_TTS(pdim = prs[i,1],
nobs = prs[i,2], model = as.character(prs[i,3]), ...)
rval <- rbind(prs, prs, prs)
rval$stat <- factor(rep(1:3, c(nprs, nprs, nprs)),
labels = c("R sq.", "adj. R sq.", "p*"))
rval$power <- c(pow[,1], pow[,2], pow[,3])
rval$nobs <- factor(rval$nobs)
return(rval)
}
psim_TTS <- SIM_TTS()
tab_TTS <- xtabs(power ~ pdim + stat + model + nobs, data = psim_TTS)
ftable(tab_TTS, row.vars = c("model", "nobs", "stat"), col.vars = "pdim")}
FO_TTS <- Sim_TTS()
FO_TTS
}
Preceeded by:
pdims <- seq(12, 100, 4)
coefLC12 <- c(0, rep(0.2, 4), rep(0.1, 4), rep(0, 4))/1.3
rtL <- c(0.2, rep(0, 3))/1.3
coefLC100 <- c(coefLC12, rep(rtL, 22))
coefHC12 <- c(0, rep(0.8, 4), rep(0.4, 4), rep(0, 4))/1.1
rtH <- c(0.8, rep(0, 3))/1.1
coefHC100 <- c(coefHC12, rep(rtH, 22))
coef100 <- cbind(coefLC100, coefHC100)
I’m aware that model selection via the significance of individual predictors is not recommended, but that is the whole point – it is meant to be compared to more sophisticated methods.
I want to acquire the optimized allocation of a set of the asset, so I use the package fPortfolio and BLCOP. Here is my code,
library(xts);library(fPortfolio);library(BLCOP)
sh_return <- xts(ret[,-1],order.by = as.Date(ret[,1]))
prior_mean <- colMeans(sh_return)
prior_mean
prior_cov_matrix <- cov(sh_return)
## onstruct the confidenec interval
pick_matrix <- matrix(0,2,ncol(sh_return))
colnames(pick_matrix) <- colnames(sh_return)
pick_matrix[1,1:4] <- 1
pick_matrix[2,c(1,2,5,ncol(sh_return)-1)] <- c(0.5,0.6,-1,0.8)
pick_matrix
# views
q <- c(0.4,0.32)
confidence <- c(90,95)
views <- BLViews(pick_matrix,q,confidence,assetNames = colnames(sh_return))
views
# posterior
tau <- 0.3
posterior <- posteriorEst(views,tau=tau,prior_mean,prior_cov_matrix)
# optimazation
optimal_portfolio <- optimalPortfolios.fPort(posterior,inputData = NULL,spec = NULL,constraints = "LongOnly",optimizer = "tangencyPortfolio",numSimulations = 100)
And the error turns out:
Error in if (STATUS != 0) { : argument is of length zero
The problem is, when I check the source code, it seems fine to me and there is no such STATUS that exists. Therefore, I have no idea how this code could go wrong like this. Any idea could be helpful.
Or if you want to test the data, here is the code from the source file of the package, the same error exists:
entries <- c(0.001005,0.001328,-0.000579,-0.000675,0.000121,0.000128,
-0.000445, -0.000437, 0.001328,0.007277,-0.001307,-0.000610,
-0.002237,-0.000989,0.001442,-0.001535, -0.000579,-0.001307,
0.059852,0.027588,0.063497,0.023036,0.032967,0.048039,-0.000675,
-0.000610,0.027588,0.029609,0.026572,0.021465,0.020697,0.029854,
0.000121,-0.002237,0.063497,0.026572,0.102488,0.042744,0.039943,
0.065994 ,0.000128,-0.000989,0.023036,0.021465,0.042744,0.032056,
0.019881,0.032235 ,-0.000445,0.001442,0.032967,0.020697,0.039943,
0.019881,0.028355,0.035064 ,-0.000437,-0.001535,0.048039,0.029854,
0.065994,0.032235,0.035064,0.079958 )
varcov <- matrix(entries, ncol = 8, nrow = 8)
mu <- c(0.08, 0.67,6.41, 4.08, 7.43, 3.70, 4.80, 6.60) / 100
pick <- matrix(0, ncol = 8, nrow = 3, dimnames = list(NULL, letters[1:8]))
pick[1,7] <- 1
pick[2,1] <- -1; pick[2,2] <- 1
pick[3, 3:6] <- c(0.9, -0.9, .1, -.1)
confidences <- 1 / c(0.00709, 0.000141, 0.000866)
views <- BLViews(pick, c(0.0525, 0.0025, 0.02), confidences, letters[1:8])
posterior <- posteriorEst(views, tau = 0.025, mu, varcov )
optimalPortfolios.fPort(posterior, optimizer = "tangencyPortfolio")
I try to create more then three independent contrasts in PERMANOVA with 4 factors without success. I need to use all possible pairwise combinations of factor levels in my contr2df object. There are any way for make this possible?
In my code:
#1st factor
treat <- gl(4, 15, labels = paste("t", 1:4, sep="")); treat
#Variables
set.seed(124)
sp <- cbind(c(rnorm(10, 5, 0.25), rnorm(50, 2.5, 0.25)), rnorm(60, 2.5, 0.25),
c(rnorm(10, 12, 0.25), rnorm(50, 2.5, 0.25)), rnorm(60, 2.5, 0.25))
colnames(sp) <- c("sp1", "sp2", "sp3", "sp4")
head(sp))
#create a design matrix of the contrasts for "treat"
Treat_Imp<-model.matrix(~treat-1)
require(vegan)
fullModel <- adonis(sp ~ treat, method = "euclidean", permutations = 9999)
fullModel
#Comparisons
TI <- model.matrix(~ treat-1)
head(TI)
f <- nlevels(treat)
comb <- t(combn(1:f, 2))
n <- nrow(comb)
contr2 <- NULL
for (x in 1:n) {
i <- comb[x, 1]
j <- comb[x, 2]
tmp <- list(TI[,i] - TI[,j]); names(tmp) <- paste0("TI",i, "_", j)
contr2 <- c(contr2, tmp) }
contr2
contr2df <- as.data.frame(contr2)
adonis(
sp ~ ., data = contr2df,
method = "euclidean",
permutations = 9999)
#
Thanks,
Alexandre
I am doing correlation matrixes in R. I love the corrplot package, because of its ease of use and pretty graphics. All the examples on the help page have been done using mirrored rows and columns, but I've been able to do correlation matrixes with different number of columns and rows too.
The corrplot help page.
The package includes an implementation that lets you blot out correlations with too big p-values and it works really well when the matrix is symmetrical, but I can't figure out how to get it to work when I'm doing a correlation matrix with two different subsets.
I'm a newbie when it comes to R, so your help is really appreciated. Thank you in advance!
So these work out fine:
corrplot(cor(My30variableSubset, My70variableSubset, use = "complete.obs"),
method = "circle", tl.col = "black", tl.cex=0.5, col=col1(200))
CIetPfor30variables <- cor.mtest(My30variableSubset, 0.95)
corrplot(cor(My30variableSubset, use = "complete.obs"),
method = "circle", tl.col = "black", tl.cex=0.5, col=col1(200),
p.mat = CIetPfor30variables[[1]], sig.level = 0.001, insig = "blank")
However, this most definetly doesn't:
corrplot(cor(MyCoolSubset, MyConcentrationSubset, use = "complete.obs"),
method = "circle", tl.col = "black", tl.cex=0.5, col=col1(200),
p.mat = CIetPfor70variables[[1]], sig.level = 0.001, insig = "blank")
The reason being that the cor.mtest-function given in the help, can't deal with this situaltion.
cor.mtest <- function(mat, conf.level = 0.95) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
diag(p.mat) <- 0
diag(lowCI.mat) <- diag(uppCI.mat) <- 1
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], conf.level = conf.level)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
lowCI.mat[i, j] <- lowCI.mat[j, i] <- tmp$conf.int[1]
uppCI.mat[i, j] <- uppCI.mat[j, i] <- tmp$conf.int[2]
}
}
return(list(p.mat, lowCI.mat, uppCI.mat))
}
So I guess I would solve my problem, if I had a function that would do the same as the one above, except that it would take in two different subsets of data and give the p-values & stuff for their correlations.
Here is the R code for two example datasets. In reality, I have a dataset with 15 000 observations for 200 variables and I have taken 70 & 30 variable subsets that I have to use for the correlation matrix. The point is that the cor.mtest -function only accepts one matrix and gives the p-values & stuff of its intercorrelations.
variable1 <- c(25, 30, 56)
variable2 <- c(5, 1, 4)
variable3 <- c(160, 110, 220)
variable4 <- c(60, 11, 20)
variable5 <- c(3, 2, 1)
My30variableSubset <- data.frame(variable1, variable2)
My70variableSubset <- data.frame(variable3, variable4, variable5)