removing specific columns in R - r

I am using findCorrelation function in R:
highCorr <- findCorrelation(correlations, cutoff = .60,names = FALSE)
The function return columns numbers/names that are 0.6 an above correlated.
I want to remove these columns.
I don't know how to do this because first if i remove them one at a time the column number change but, I want to try few cutoff threshold and would like to do this automatically.

If your original data are a correlation matrix you can do the following:
library(caret) #findCorrelation comes from this library
set.seed(1)
#create simulated data for correlation matrix
mydata <- matrix(data = rnorm(100,mean = 100, sd = 3), nrow = 10, ncol = 10)
#create correlation matrix
correlations <- cor(mydata)
#index correlations at cutoff
corr_ind <- findCorrelation(correlations, cutoff = .2)
#remove columns from original data based on index value
remove_corrs <- mydata[-c(corr_ind)]

Related

How to create correlation matrix after mice multiple imputation

I'm using the mice package to create multiple imputations. I want to create a correlations matrix (and a matrix of p-values for the correlation coefficients. I use miceadds::micombine.cor to do this. But this gives a dataframe with variables in the first to columns, and then a number of columns to contain r, p, t-values, and the like.
I'm looking for a way to turn this dataframe into a "good old" matrix with the correlation coefficient between x and y in position [x,y], and a matrix with p-values Does anyone have an easy way to do this?
Here's some code to reproduce:
data <- mtcars
mt.mis <- prodNA(mtcars, noNA = 0.1)
imputed <-mice(iris.mis, m = 5, maxit = 5, method = "pmm")
correlations<- miceadds::micombine.cor(mi.res=iris.mis, variables = c(1:3))
What I'm looking for is something like the output from cor(mtcars). Who can help?
I ended up writing my own function. Can probably be done much more efficiently, but this is what I made.
cormatrix <- function(r, N){
x <- 1
cormatrix <- matrix(nrow = N, ncol = N) # create empty matrix
for (i in 1:N) {
for (j in i:N) {
if(j>i){
cormatrix[i,j] <- r[x]
cormatrix[j,i] <- r[x]
x <- x + 1
}
}
}
diag(cormatrix) <- 1
cormatrix
}
You can call it with the output of micombine.cor and the number of variables in your model as arguments. So for example cormatrix(correlations$r,ncol(df)).

How to create a formulated table in R?

This is my reproducible example :
#http://gekkoquant.com/2012/05/26/neural-networks-with-r-simple-example/
library("neuralnet")
require(ggplot2)
traininginput <- as.data.frame(runif(50, min=0, max=100))
trainingoutput <- sqrt(traininginput)
trainingdata <- cbind(traininginput,trainingoutput)
colnames(trainingdata) <- c("Input","Output")
Hidden_Layer_1 <- 1 # value is randomly assigned
Hidden_Layer_2 <- 1 # value is randomly assigned
Threshold_Level <- 0.1 # value is randomly assigned
net.sqrt <- neuralnet(Output~Input,trainingdata, hidden=c(Hidden_Layer_1, Hidden_Layer_2), threshold = Threshold_Level)
#Test the neural network on some test data
testdata <- as.data.frame((1:13)^2) #Generate some squared numbers
net.results <- predict(net.sqrt, testdata) #Run them through the neural network
cleanoutput <- cbind(testdata,sqrt(testdata),
as.data.frame(net.results))
colnames(cleanoutput) <- c("Input","ExpectedOutput","NeuralNetOutput")
ggplot(data = cleanoutput, aes(x= ExpectedOutput, y= NeuralNetOutput)) + geom_point() +
geom_abline(intercept = 0, slope = 1
, color="brown", size=0.5)
rmse <- sqrt(sum((sqrt(testdata)- net.results)^2)/length(net.results))
print(rmse)
At here, when my Hidden_Layer_1 is 1, Hidden_Layer_2 is 2, and the Threshold_Level is 0.1, my rmse generated is 0.6717354.
Let's say we try for the other example,
when my Hidden_Layer_1 is 2, Hidden_Layer_2 is 3, and the Threshold_Level is 0.2, my rmse generated is 0.8355925.
How can I create a table that will automatically calculate the value of rmse when user assign value to the Hidden_Layer_1, Hidden_Layer_2, and Threshold_Level. ( I know how to do it in Excel but not in r haha )
The desired table should be looked like this :
I wish that I have Trial(s), Hidden_Layer_1, Hidden_Layer_2, Threshold_Level, and rmse in my column, and the number of rows can be generated infinitely by entering some actionButton (if possible), means user can keep on trying until they got the rmse they desired.
How can I do that? Can anyone help me? I will definitely learn from this lesson as I am quite new to r.
Thank you very much for anyone who willing to give a helping hand to me.
Here is a way to create the table of values that can be displayed with the data frame viewer.
# initialize an object where we can store the parameters as a data frame
data <- NULL
# function to receive a row of parameters and add them to the
# df argument
addModelElements <- function(df,trial,layer1,layer2,threshold,rmse){
newRow <- data.frame(trial = trial,
Hidden_Layer_1 = layer1,
Hidden_Layer_2 = layer2,
Threshold = threshold,
RMSE = rmse)
rbind(df,newRow)
}
# once a model has been run, call addModelElements() with the
# model parameters
data <- addModelElements(data,1,1,2,0.1,0.671735)
data <- addModelElements(data,2,2,3,0.2,0.835593)
...and the output:
View(data)
Note that if you're going to create scores or hundreds of rows of parameters & RMSE results before displaying any of them to the end user, the code should be altered to improve the efficiency of rbind(). In this scenario, we build a list of sets of parameters, convert them into data frames, and use do.call() to execute rbind() only once.
# version that improves efficiency of `rbind()
addModelElements <- function(trial,layer1,layer2,threshold,rmse){
# return row as data frame
data.frame(trial = trial,
Hidden_Layer_1 = layer1,
Hidden_Layer_2 = layer2,
Threshold = threshold,
RMSE = rmse)
}
# generate list of data frames and rbind() once
inputParms <- list(c(1,1,2,0.1,0.671735),
c(1,1,2,0.3,0.681935),
c(2,2,3,0.2,0.835593))
parmList <- lapply(inputParms,function(x){
addModelElements(x[1],x[2],x[3],x[4],x[5])
})
# bind to single data frame
data <- do.call(rbind,parmList)
View(data)
...and the output:

Weighted Pearson's Correlation with one Object

I want to create a correlation matrix using data but weighted based on significant edges.
m <- matrix(data = rnorm(36), nrow = 6, ncol = 6)
x <- LETTERS[1:6]
for (a in 1:length(x)) y <- c(y, paste("c", a, sep = ""))
mCor <- cor(t(m))
w <- sample(x = seq(0.5, 0.8, by = 0.01), size = 36)
The object w represents the weights for mCor. I know other packages that provide correlation for input data that has to be the same length for vectors x and y. I want to calculate a pairwise weighted Pearson's correlation table, using data for each row across all columns.
I just want to make sure it's correct, but I thought about using a weighted cor for each row A and B by multiplying each value by the given weight. You typically need three vectors all the same length, two for data, and one for the weights.
I am using the data.table package so speedy solutions are welcomed. Also, not sure if I should pass a table with two columns for connections and one for weights. Do the existing functions preserve order or automatically match?
weight <- data.table(x = rep(LETTERS[1:3], each = 12), y = rep(LETTERS[4:6], times = 3), w = w)

Association matrix in r

The way corrplot allows you to plot a correlation matrix in R
Any idea how i can plot a association matrix in R
where the method of association is using any user specified method like Cramer's V
The answer to your question strongly depends on the data you've got and specific correlation method. I assume you have a bunch of nominal variables and want to see whether they are correlated using Cramer's V on the correlation plot. In this case, a way to do this is following:
Calculate Cramer's V correlation coefficient for every pair of
variables.I used vcd library, as it has method to calculate Cramer's V.
Put these coefficients together and basically get correlation matrix
Visualize the matrix
Ugly but working code to do this is listed below. I played around outer - the clearest and most precise way to work with row and column indexes, but encountered problems with indexing columns in df using row and column index from m: for some reason it just didn't want to get variable from df.
install.packages("vcd")
library(vcd)
# Simulate some data or paste your own
df <- data.frame(x1 = sample(letters[1:5], 20, replace = TRUE),
x2 = sample(letters[1:5], 20, replace = TRUE),
x3 = sample(letters[1:5], 20, replace = TRUE))
# Initialize empty matrix to store coefficients
empty_m <- matrix(ncol = length(df),
nrow = length(df),
dimnames = list(names(df),
names(df)))
# Function that accepts matrix for coefficients and data and returns a correlation matrix
calculate_cramer <- function(m, df) {
for (r in seq(nrow(m))){
for (c in seq(ncol(m))){
m[[r, c]] <- assocstats(table(df[[r]], df[[c]]))$cramer
}
}
return(m)
}
cor_matrix <- calculate_cramer(empty_m ,data)
corrplot(cor_matrix)
Building upon the example by Alexey Knorre:
library(DescTools)
library(corrplot)
# Simulate data
df <- data.frame(x1 = sample(letters[1:5], 20, replace = TRUE),
x2 = sample(letters[1:5], 20, replace = TRUE),
x3 = sample(letters[1:5], 20, replace = TRUE))
# Use CramerV as input for corrplot
corrplot::corrplot(DescTools::PairApply(df, DescTools::CramerV))
library(vcd)
library(corrplot)
I would suggest corrplot(PairApply(df, cramerV),diag = F,is.corr = F) to change color scale from -1,1 (is.corr = T) to 0,1 (is.corr = F).

automation of subset process

It is probably easy, but I can't figure it out.
I have a data frame with over 70 variables. I make predictions using all those variables. For sensitivity analysis I would like to subset the data frame automatically to see how the prediction performs on this specific subset.
I have done this manually but with over 100 different subset options it is very tedious.
Here is the data/code and my desired solution:
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
a = c(1.7, 3.3, 5.1)
df = data.frame(n, s, b, a)
df
To calculate the accuracy of prediction a
df$calc <- df$a - df$n
df$difference <- sqrt(df$calc * df$calc)
With these values I can now calculate the Mean and SD
Mean <- mean(df$difference)
SD <- sd(df$difference)
Let's say I would like to get an overview of the prediction accuracy for all cases where b = TRUE. (Or other subsets of the data)
Ideally I would like a data frame to look like this:
subset = c("b=TRUE", "b=FALSE", "s=aa")
amount = c(2, 1, 1) # count the number this subset occurs
Mean = c(0.22, 0.3, 0.1)
SD = c(0.1, 0.2, 0.5)
OV = data.frame(subset, amount, Mean, SD)
OV
Considering that I have more than 100 different subsets that I would like to create, I need a fast solution that generates an overview like the OV data frame. I tried a loop, but I have trouble defining a vector for subsetting the data.
Thanks!

Resources