How to find significant correlations in a large dataset - r

I'm using R.
My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.
I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b.
Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.
I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?

You can use the function rcorr from the package Hmisc.
Using the same demo data from Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Then:
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
To access the p-values:
correlations$P
To visualize you can use the package corrgram
library(corrgram)
corrgram(the_data)
Which will produce:

In order to print a list of the significant correlations (p < 0.05), you can use the following.
Using the same demo data from #Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Install Hmisc
install.packages("Hmisc")
Import library and find the correlations (#Carlos)
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
Loop over the values printing the significant correlations
for (i in 1:m){
for (j in 1:m){
if ( !is.na(correlations$P[i,j])){
if ( correlations$P[i,j] < 0.05 ) {
print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
}
}
}
}
Warning
You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.

Here's some sample data for reproducibility.
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.
correlations <- vapply(
the_data[, -1],
function(x)
{
cor(the_data[, 1], x)
},
numeric(1)
)
You can then find the column with the largest magnitude of correlation with y using:
correlations[which.max(abs(correlations))]
So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.

If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y.
To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator

One option is to run a correlation matrix:
cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")
This correlates all the variables in the file against each other and outputs a matrix.

Related

Mclust() - NAs in model selection

I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).
Here is my code :
library(mclust)
X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5) #I want 5 clusters
cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm
for (k in 1:nclusters){
ii = which(cl_gmm == k) # individuals of group k
counts=table(y[ii]) # number of occurences for each label
imax = which.max(counts) # Majority label
maj_lab = attributes(counts)$dimnames[[1]][imax]
print(paste("Group ",k,", majority label = ",maj_lab))
cl_gmm_lab[ii] = maj_lab
}
conf_mat_gmm <- table(y,cl_gmm_lab) # CONFUSION MATRIX
The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.
Until now I did not find any solution to this problem...are you familiar with this issue?
Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing
Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true
I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.

Calculate Errors using loop function in R

I have two data matrices both having the same dimensions. I want to extract the same series of columns vectors. Then take both series as vectors, then calculate different errors for example mean absolute error (mae), mean percentage error (mape) and root means square error
(rmse). My data matrix is quite large dimensional so I try to explain with an example and calculate these errors manually as:
mat1<- matrix(6:75,ncol=10,byrow=T)
mat2<- matrix(30:99,ncol=10,byrow=T)
mat1_seri1 <- as.vector(mat1[,c(1+(0:4)*2)])
mat1_seri2<- as.vector(mat1[,c(2+(0:4)*2)])
mat2_seri1 <- as.vector(mat1[,c(1+(0:4)*2)])
mat2_seri2<- as.vector(mat1[,c(2+(0:4)*2)])
mae1<-mean(abs(mat1_seri1-mat2_seri1))
mae2<-mean(abs(mat1_seri2-mat2_seri2))
For mape
mape1<- mean(abs(mat1_seri1-mat2_seri1)/mat1_seri1)*100
mape2<- mean(abs(mat1_seri2-mat2_seri2)/mat1_seri2)*100
similarly, I calculate rmse from their formula, as I have large data matrices so manually it is quite time-consuming. Is it's possible to do this using looping which gives an output of the errors (mae,mape,rmse) term for each series separately.
I'm not sure if this is what you are looking for, but here is a function that could automate the process, maybe there is also a better way:
fn <- function(m1, m2) {
stopifnot(dim(m1) == dim(m2))
mat1_seri1 <- as.vector(m1[, (1:ncol(m1))[(1:ncol(m1))%%2 != 0]])
mat1_seri2 <- as.vector(m1[, (1:ncol(m1))[!(1:ncol(m1))%%2]])
mat2_seri1 <- as.vector(m2[, (1:ncol(m2))[(1:ncol(m2))%%2 != 0]])
mat2_seri2 <- as.vector(m2[, (1:ncol(m2))[!(1:ncol(m2))%%2]])
mae1 <- mean(abs(mat1_seri1-mat2_seri1))
mae2 <- mean(abs(mat1_seri2-mat2_seri2))
mape1 <- mean(abs(mat1_seri1-mat2_seri1)/mat1_seri1)*100
mape2 <- mean(abs(mat1_seri2-mat2_seri2)/mat1_seri2)*100
setNames(as.data.frame(matrix(c(mae1, mae2, mape1, mape2), ncol = 4)),
c("mae1", "mae2", "mape1", "mape2"))
}
fn(mat1, mat2)
mae1 mae2 mape1 mape2
1 24 24 92.62581 86.89572

Permutations and combinations of all the columns in R

I want to check all the permutations and combinations of columns while selecting models in R. I have 8 columns in my data set and the below piece of code lets me check some of the models, but not all. Models like column 1+6, 1+2+5 will not be covered by this loop. Is there any better way to accomplish this?
best_model <- rep(0,3) #store the best model in this array
for(i in 1:8){
for(j in 1:8){
for(x in k){
diabetes_prediction <- knn(train = diabetes_training[, i:j], test = diabetes_test[, i:j], cl = diabetes_train_labels, k = x)
accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/183
if( best_model[1] < accuracy[x] ){
best_model[1] = accuracy[x]
best_model[2] = i
best_model[3] = j
}
}
}
}
Well, this answer isn't complete, but maybe it'll get you started. You want to be able to subset by all possible subsets of columns. So instead of having i:j for some i and j, you want to be able to subset by c(1,6) or c(1,2,5), etc.
Using the sets package, you can for the power set (set of all subsets) of a set. That's the easy part. I'm new to R, so the hard part for me is understanding the difference between sets, lists, vectors, etc. I'm used to Mathematica, in which they're all the same.
library(sets)
my.set <- 1:8 # you want column indices from 1 to 8
my.power.set <- set_power(my.set) # this creates the set of all subsets of those indices
my.names <- c("a") #I don't know how to index into sets, so I created names (that are numbers, but of type characters)
for(i in 1:length(my.power.set)) {my.names[i] <- as.character(i)}
names(my.power.set) <- my.names
my.indices <- vector("list",length(my.power.set)-1)
for(i in 2:length(my.power.set)) {my.indices[i-1] <- as.vector(my.power.set[[my.names[i]]])} #this is the line I couldn't get to work
I wanted to create a list of lists called my.indices, so that my.indices[i] was a subset of {1,2,3,4,5,6,7,8} that could be used in place of where you have i:j. Then, your for loop would have to run from 1:length(my.indices).
But alas, I have been spoiled by Mathematica, and thus cannot decipher the incredibly complicated world of R data types.
Solved it, below is the code with explanatory comments:
# find out the best model for this data
number_of_columns_to_model <- ncol(diabetes_training)-1
best_model <- c()
best_model_accuracy = 0
for(i in 2:2^number_of_columns_to_model-1){
# ignoring the first case i.e. i=1, as it doesn't represent any model
# convert the value of i to binary, e.g. i=5 will give combination = 0 0 0 0 0 1 0 1
combination = as.binary(i, n=number_of_columns_to_model) # from the binaryLogic package
model <- c()
for(i in 1:length(combination)){
# choose which columns to consider depending on the combination
if(combination[i])
model <- c(model, i)
}
for(x in k){
# for the columns decides by model, find out the accuracies of model for k=1:27
diabetes_prediction <- knn(train = diabetes_training[, model, with = FALSE], test = diabetes_test[, model, with = FALSE], cl = diabetes_train_labels, k = x)
accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/length(diabetes_test_labels)
if( best_model_accuracy < accuracy[x] ){
best_model_accuracy = accuracy[x]
best_model = model
print(model)
}
}
}
I trained with Pima.tr and tested with Pima.te. KNN Accuracy for pre-processed predictors was 78% and 80% without pre-processing (and this because of the large influence of some variables).
The 80% performance is at par with a Logistic Regression model. You don't need to preprocess variables in Logistic Regression.
RandomForest, and Logistic Regression provide a hint on which variables to drop, so you don't need to go and perform all possible combinations.
Another way is to look at a matrix Scatter plot
You get a sense that there is difference between type 0 and type 1 when it comes to npreg, glu, bmi, age
You also notice the highly skewed ped and age, and you notice that there may be an anomaly data point between skin and and and other variables (you may need to remove that observation before going further)
Skin Vs Type box plot shows that for type Yes, an extreme outlier exist (try removing it)
You also notice that most of the boxes for Yes type are higher than No type=> the variables may add prediction to the model (you can confirm this through a Wilcoxon Rank Sum Test)
The high correlation between Skin and bmi means that you can use one or the other or an interact of both.
Another approach to reducing the number of predictors is to use PCA

Obtain t-statistic for regression coefficients of an “mlm” object returned by `lm()`

I've used lm() to fit multiple regression models, for multiple (~1 million) response variables in R. Eg.
allModels <- lm(t(responseVariablesMatrix) ~ modelMatrix)
This returns an object of class "mlm", which is like a huge object containing all the models. I want to get the t-statistic for the first coefficient in each model, which I can do using the summary(allModels) function, but its very slow on this large data and returns a lot of unwanted info too.
Is there a faster way of calculating the t-statistic manually, that might be faster than using the summary() function
Thanks!
You can hack the summary.lm() function to get just the bits you need and leave the rest.
If you have
nVariables <- 5
nObs <- 15
y <- rnorm(nObs)
x <- matrix(rnorm(nVariables*nObs),nrow=nObs)
allModels <-lm(y~x)
Then this is the code from the lm.summary() function but with all the excess baggage removed (note, all the error handling has been removed as well).
p <- allModels$rank
rdf <- allModels$df.residual
Qr <- allModels$qr
n <- NROW(Qr$qr)
p1 <- 1L:p
r <- allModels$residuals
f <- allModels$fitted.values
w <- allModels$weights
mss <- if (attr(allModels$terms, "intercept"))
sum((f - mean(f))^2) else sum(f^2)
rss <- sum(r^2)
resvar <- rss/rdf
R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
se <- sqrt(diag(R) * resvar)
est <- allModels$coefficients[Qr$pivot[p1]]
tval <- est/se
tval is now a vector of the t statistics as also give by
summary(allModels)$coefficients[,3]
If you have problems on the large model you might want to rewrite the code so that it keeps fewer objects by compounding multiple lines/assignments into fewer lines.
Hacky solution I know. But it will be about as fast as possible. I suppose it would be neater to put all the lines of code into a function as well.

using k-NN in R with categorical values

I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.
I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance.
Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?
I'm using R 2.15
As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.
# Generate dummy data
y <- rep(1:2, each=50) # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5 # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)
# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)
# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
tab <- table(y[design.set][nn])
as.integer(names(tab)[which.max(tab)]) # This is a pretty dirty line
}
# Inspect the results
table(pred, y[test.set])
If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.
The drop=FALSE argument is needed to preserve the subset of NN as matrix in the case k=1. If not it will be converted to a vector and apply will throw an error.

Resources