Incorrect output of findCorrelation (caret package) - r

I use 'findCorrelation' function of caret package to define factors with correlation equal or below cutoff (threshold) set. My script is as follows:
library (caret)
set.seed(123)
#make a matrix to calculate correlation
data<-as.matrix(data.frame(x=rnorm(1:1000),y=rnorm(1:1000),z=rnorm(1:1000),w=rnorm(1:1000)))
#calculate correlation
df2 <- cor(data)
hc <- findCorrelation(as.matrix(df2), cutoff=0.05) # putt any value as a "cutoff"
hc <- sort(hc)
print(df2)
print(df2[-hc,-hc])
df2 output (all factors):
print(df2)
x y z w
x 1.00000000 0.086479441 -0.01932954 -0.002994710
y 0.08647944 1.000000000 0.02650333 -0.007029076
z -0.01932954 0.026503334 1.00000000 0.050560850
w -0.00299471 -0.007029076 0.05056085 1.000000000
df2 with applied cutoff of 0.05:
print(df2[-hc,-hc])
x w
x 1.00000000 -0.00299471
w -0.00299471 1.00000000
But if I apply the cutoff=0.1, for instance, I will have a zero matrix instead of the list of all factors below the cutoff:
hc <- findCorrelation(as.matrix(df2), cutoff=0.1)
hc <- sort(hc)
print(df2[-hc,-hc])
The df2 output with cutoff=0.1:
<0 x 0 matrix>
I have run other examples from my business cases and it is appeared to have at least one factor above the cutoff value to generate the matrix of factors below cutoff.
Otherwise, zero matrix is generated.
I have dived into the script of 'findCorrelation' but it worked well. Maybe the script is not presumed to handle such a case.
So I would be grateful for your hints how to tackle the issue.
UPDATE of 07/03/16:
Due to usefull answer of #topepo I have revised the script:
the part to be replaced:
print(df2[-hc,-hc])
with:
if(length(hc)==0){
print(df2)
}else{
print(df2[-hc,-hc])
}

It is not a bug.
In ?findCorrelation, it describes the value returned as
A vector of indices denoting the columns to remove (when names = TRUE) otherwise a vector of column names. If no correlations meet the criteria, integer(0) is returned.
The issue that you are seeing results because you need to make sure that the subsetting vector has elements to it via something like
if(length(hc) > 0) df2 <- df2[-hc, -hc]
Any zero length integer would produce this issue.

Related

Confusion about calculating sample correlation in r

I have been tasked with manually calculating the sample correlation between two datasets (D$Nload and D$Pload), and then compare the result with R's in built cor() function.
I calculate the sample correlation with
cov(D$Nload,D$Pload, use="complete.obs")/(sd(D$Nload)*sd(D$Pload, na.rm=TRUE))
Which gives me the result 0.5693599
Then I try using R's cov() function
cor(D[, c("Nload","Pload")], use="pairwise.complete.obs")
which gives me the result:
Nload Pload
Nload 1.0000000 0.6244952
Pload 0.6244952 1.0000000
Which is a different result. Can anyone see where I've gone wrong?
This happens because when you call sd() on a single vector, it cannot check if the data is pairwise complete. Example:
x <- rnorm(100)
y <- rexp(100)
y[1] <- NA
df <- data.frame(x = x, y = y)
So here we have
df[seq(2), ]
x y
1 1.0879645 NA
2 -0.3919369 0.2191193
We see that while the second row is pairwise complete (all columns used for your computation are not NA), the first row is not. However, if you calculate sd() on just a single column, it doesn't have any information about the pairs. So in your case, sd(df$x) will use all the available data, although it should avoid the first row.
cov(df$x, df$y, use = "complete.obs") / (sd(df$x)*sd(df$y, na.rm=TRUE))
[1] 0.09301583
cor(df$x, df$y, use = "pairwise.complete.obs")
[1] 0.09313766
But if you remove the first row from your computation, the result is equal
df <- df[complete.cases(df), ]
cov(df$x, df$y, use = "complete.obs") / (sd(df$x)*sd(df$y, na.rm=TRUE))
[1] 0.09313766
cor(df$x, df$y, use = "pairwise.complete.obs")
[1] 0.09313766

log- and z-transforming my data in R

I'm preparing my data for a PCA, for which I need to standardize it. I've been following someone else's code in vegan but am not getting a mean of zero and SD of 1, as I should be.
I'm using a data set called musci which has 13 variables, three of which are labels to identify my data.
log.musci<-log(musci[,4:13],10)
stand.musci<-decostand(log.musci,method="standardize",MARGIN=2)
When I then check for mean=0 and SD=1...
colMeans(stand.musci)
sapply(stand.musci,sd)
I get mean values ranging from -8.9 to 3.8 and SD values are just listed as NA (for every data point in my data set rather than for each variable). If I leave out the last variable in my standardization, i.e.
log.musci<-log(musci[,4:12],10)
the means don't change, but the SDs now all have a value of 1.
Any ideas of where I've gone wrong?
Cheers!
You data is likely a matrix.
## Sample data
dat <- as.matrix(data.frame(a=rnorm(100, 10, 4), b=rexp(100, 0.4)))
So, either convert to a data.frame and use sapply to operate on columns
dat <- data.frame(dat)
scaled <- sapply(dat, scale)
colMeans(scaled)
# a b
# -2.307095e-16 2.164935e-17
apply(scaled, 2, sd)
# a b
# 1 1
or use apply to do columnwise operations
scaled <- apply(dat, 2, scale)
A z-transformation is quite easy to do manually.
See below using a random string of data.
data <- c(1,2,3,4,5,6,7,8,9,10)
data
mean(data)
sd(data)
z <- ((data - mean(data))/(sd(data)))
z
mean(z) == 0
sd(z) == 1
The logarithm transformation (assuming you mean a natural logarithm) is done using the log() function.
log(data)
Hope this helps!

nls() in R using entire matrix

I have data which I want to fit to the following equation using R:
Z(u,w)=z0*F(w)*[1-exp((-b*u)/F(w))]
where z0 and b are constants and F(w), w=0,...,9 is a decreasing step function that depends on w with F(0)=1 and u=1,...,50.
Z(u,w) is an observed set of data in the form of a 50x10 matrix (u=50,...,1 down the side of the rows and w=0,...,9 along the columns). For example as I haven't explained that great, Z(42,3) will be the element in the 9th row down and the 4th column along.
Using F(0)=1 I was able to get estimates of b and z0 using just the first column (ie w=0) with the code:
n0=nls(zuw~z0*(1-exp(-b*u)),start=list(z0=283,b=0.03),options(digits=10))
I then found F(w) for w=1,...,9 by going through each columns and using the vlaues of b and z0 I found.
However, I was wanting to find a way to estimate all the 12 parameters at once (b, z0 and the 10 values of F(w)) as b and z0 should be fitted to all the data, not just the first column.
Does anyone know of any way of doing this? All help would be greatly appreciated!
Thanks
James
This may be a case where the formula interface of the nls(...) function works against you. As an alternative, you can use nls.lm(...) in the minpack.lm package to perform non-linear regression with a programmatically defined function. To demonstrate this, first we create an artificial dataset which follows your functional form by design, with random error added (error ~ N[0,1]).
u <- 1:50
w <- 0:9
z0 <- 100
b <- 0.02
F <- 10/(10+w^2)
# matrix containing data, in OP's format: rows are u, cols are w
m <- do.call(cbind,lapply(w,function(w)
z0*F[w+1]*(1-exp(-b*u/F[w+1]))+rnorm(length(u),0,1)))
So now we have a matrix m, which is equivalent to your dataset. This matrix is in the so-called "wide" format - the response for different values of w is in different columns. We need it in "long" format: all responses in a single column, with a separate columns identifying u and w. We do this using melt(...) in the reshape2 package.
# prepend values of u
df.wide <- data.frame(u=u, m)
library(reshape2)
# reshape to long format: col1 = u, col2=w, col3=z
df <- melt(df.wide,id="u",variable.name="w", value.name="z")
df$w <- as.numeric(substr(df$w,2,4))-1
Now we have a data frame df with columns u, w, and z. The nls.lm(...) function takes (at least) 4 arguments: par is a vector of initial estimates of the parameters of the fit, fn is a function that calculates the residuals at each step, observed is the dependent variable (z), and xx is a vector or matrix containing the independent variables (u, v).
Next we define a function, f(par, xx), where par is an 11 element vector. The first two elements contain estimates of z0 and b. The next 9 contain estimates of F(w), w=1:9. This is because you state that F(0) is known to be 1. xx is a matrix with two columns: the values for u and w respectively. f(par,xx) then calculates estimate of the response z for all values of u and w, for the given parameter estimates.
library(minpack.lm)
# model function
f <- function(pars, xx) {
z0 <- pars[1]
b <- pars[2]
F <- c(1,pars[3:11])
u <- xx[,1]
w <- xx[,2]
z <- z0*F[w+1]*(1-exp(-b*u/F[w+1]))
return(z)
}
# residual function
resids <- function(p, observed, xx) {observed - f(p,xx)}
Next we perform the regression using nls.lm(...), which uses a highly robust fitting algorithm (Levenberg-Marquardt). Consequently, we can set the par argument (containing the initial estimates of z0, b, and F) to all 1's, which is fairly distant from the values used in creating the dataset (the "actual" values). nls.lm(...) returns a list with several components (see the documentation). The par component contains the final estimates of the fit parameters.
# initial parameter estimates; all 1's
par.start <- c(z0=1, b=1, rep(1,9))
# fit using Levenberg-Marquardt algorithm
nls.out <- nls.lm(par=par.start,
fn = resids, observed = df$z, xx = df[,c("u","w")],
control=nls.lm.control(maxiter=10000, ftol=1e-6, maxfev=1e6))
par.final <- nls.out$par
results <- rbind(predicted=c(par.final[1:2],1,par.final[3:11]),actual=c(z0,b,F))
print(results,digits=5)
# z0 b
# predicted 102.71 0.019337 1 0.90456 0.70788 0.51893 0.37804 0.27789 0.21204 0.16199 0.13131 0.10657
# actual 100.00 0.020000 1 0.90909 0.71429 0.52632 0.38462 0.28571 0.21739 0.16949 0.13514 0.10989
So the regression has done an excellent job at recovering the "actual" parameter values. Finally, we plot the results using ggplot just to make sure this is all correct. I can't overwmphasize how important it is to plot the final results.
df$pred <- f(par.final,df[,c("u","w")])
library(ggplot2)
ggplot(df,aes(x=u, color=factor(w)))+
geom_point(aes(y=z))+ geom_line(aes(y=pred))

R package caret confusionMatrix with missing categories

I am using the function confusionMatrix in the R package caret to calculate some statistics for some data I have. I have been putting my predictions as well as my actual values into the table function to get the table to be used in the confusionMatrix function as so:
table(predicted,actual)
However, there are multiple possible outcomes (e.g. A, B, C, D), and my predictions do not always represent all the possibilities (e.g. only A, B, D). The resulting output of the table function does not include the missing outcome and looks like this:
A B C D
A n1 n2 n2 n4
B n5 n6 n7 n8
D n9 n10 n11 n12
# Note how there is no corresponding row for `C`.
The confusionMatrix function can't handle the missing outcome and gives the error:
Error in !all.equal(nrow(data), ncol(data)) : invalid argument type
Is there a way I can use the table function differently to get the missing rows with zeros or use the confusionMatrix function differently so it will view missing outcomes as zero?
As a note: Since I am randomly selecting my data to test with, there are times that a category is also not represented in the actual result as opposed to just the predicted. I don't believe this will change the solution.
You can use union to ensure similar levels:
library(caret)
# Sample Data
predicted <- c(1,2,1,2,1,2,1,2,3,4,3,4,6,5) # Levels 1,2,3,4,5,6
reference <- c(1,2,1,2,1,2,1,2,1,2,1,3,3,4) # Levels 1,2,3,4
u <- union(predicted, reference)
t <- table(factor(predicted, u), factor(reference, u))
confusionMatrix(t)
First note that confusionMatrix can be called as confusionMatrix(predicted, actual) in addition to being called with table objects. However, the function throws an error if predicted and actual (both regarded as factors) do not have the same number of levels.
This (and the fact that the caret package spit an error on me because they don't get the dependencies right in the first place) is why I'd suggest to create your own function:
# Create a confusion matrix from the given outcomes, whose rows correspond
# to the actual and the columns to the predicated classes.
createConfusionMatrix <- function(act, pred) {
# You've mentioned that neither actual nor predicted may give a complete
# picture of the available classes, hence:
numClasses <- max(act, pred)
# Sort predicted and actual as it simplifies what's next. You can make this
# faster by storing `order(act)` in a temporary variable.
pred <- pred[order(act)]
act <- act[order(act)]
sapply(split(pred, act), tabulate, nbins=numClasses)
}
# Generate random data since you've not provided an actual example.
actual <- sample(1:4, 1000, replace=TRUE)
predicted <- sample(c(1L,2L,4L), 1000, replace=TRUE)
print( createConfusionMatrix(actual, predicted) )
which will give you:
1 2 3 4
[1,] 85 87 90 77
[2,] 78 78 79 95
[3,] 0 0 0 0
[4,] 89 77 82 83
I had the same problem and here is my solution:
tab <- table(my_prediction, my_real_label)
if(nrow(tab)!=ncol(tab)){
missings <- setdiff(colnames(tab),rownames(tab))
missing_mat <- mat.or.vec(nr = length(missings), nc = ncol(tab))
tab <- as.table(rbind(as.matrix(tab), missing_mat))
rownames(tab) <- colnames(tab)
}
my_conf <- confusionMatrix(tab)
Cheers
Cankut

Impossible to create correlated variables from this correlation matrix?

I would like to generate correlated variables specified by a correlation matrix.
First I generate the correlation matrix:
require(psych)
require(Matrix)
cor.table <- matrix( sample( c(0.9,-0.9) , 2500 , prob = c( 0.8 , 0.2 ) , repl = TRUE ) , 50 , 50 )
k=1
while (k<=length(cor.table[1,])){
cor.table[1,k]<-0.55
k=k+1
}
k=1
while (k<=length(cor.table[,1])){
cor.table[k,1]<-0.55
k=k+1
}
ind<-lower.tri(cor.table)
cor.table[ind]<-t(cor.table)[ind]
diag(cor.table) <- 1
This correlation matrix is not consistent, therefore, eigenvalue decomposition is impossible.
TO make it consistent I use nearPD:
c<-nearPD(cor.table)
Once this is done I generate the correlated variables:
fit<-principal(c, nfactors=50,rotate="none")
fit$loadings
loadings<-matrix(fit$loadings[1:50, 1:50],nrow=50,ncol=50,byrow=F)
loadings
cases <- t(replicate(50, rnorm(10)) )
multivar <- loadings %*% cases
T_multivar <- t(multivar)
var<-as.data.frame(T_multivar)
cor(var)
However the resulting correlations are far from anything that I specified initially.
Is it not possible to create such correlations or am I doing something wrong?
UPDATE from Greg Snow's comment it became clear that the problem is that my initial correlation matrix is unreasonable.
The question then is how can I make the matrix reasonable. The goal is:
each of the 49 variables should correlate >.5 with the first variable.
~40 of the variables should have a high >.8 correlation with each other
the remaining ~9 variables should have a low or negative correlation with each other.
Is this whole requirement impossible ?
Try using the mvrnorm function from the MASS package rather than trying to construct the variables yourself.
**Edit
Here is a matrix that is positive definite (so it works as a correlation matrix) and comes close to your criteria, you can tweak the values from there (all the Eigen values need to be positive, so you can see how changing a number affects things):
cor.mat <- matrix(0.2,nrow=50, ncol=50)
cor.mat[1,] <- cor.mat[,1] <- 0.55
cor.mat[2:41,2:41] <- 0.9
cor.mat[42:50, 42:50] <- 0.25
diag(cor.mat) <- 1
eigen(cor.mat)$values
Some numerical experimentation based on your specifications above suggests that the generated matrix will never (what never? well, hardly ever ...) be positive definite, but it also doesn't look far from PD with these values (making lcor below negative will almost certainly make things worse ...)
rmat <- function(n=49,nhcor=40,hcor=0.8,lcor=0) {
m <- matrix(lcor,n,n) ## fill matrix with 'lcor'
## select high-cor variables
hcorpos <- sample(n,size=nhcor,replace=FALSE)
## make all of these highly correlated
m[hcorpos,hcorpos] <- hcor
## compute min real part of eigenvalues
min(Re(eigen(m,only.values=TRUE)$values))
}
set.seed(101)
r <- replicate(1000,rmat())
## NEVER pos definite
max(r)
## [1] -1.069413e-15
par(las=1,bty="l")
png("eighist.png")
hist(log10(abs(r)),breaks=50,col="gray",main="")
dev.off()

Resources