Getting AUCs for several predictors and outcomes in a dataframe - r

I want to be able to do lots of AUCs at once from the pROC package. Here is a simple dataframe with two predictors and a binary outcome and my attempt to use sapply() along with auc() and roc() from the pROC library. What am I doing wrong?
library(pROC)
df <- data.frame(z = rnorm(100,0,1), x=rnorm(100,0,1), y = as.factor(sample(0:1, 100, replace=TRUE)))
#One AUC at a time
auc(roc(df$y, df$x))
auc(roc(df$y, df$z))
#Trying to get multiple
predictors <- c("z","x")
results <- lapply(df, function(x){auc(roc(y, predictors))})
This solution works using a for loop, is that the most elegant method or can sapply/lapply be used instead?
Calculating multiple ROC curves in R using a for loop and pROC package. What variable to use in the predictor field?

You can use lapply in the following way -
predictors <- c("z","x")
results <- lapply(predictors, function(x) auc(roc(df$y, df[[x]])))
results
#[[1]]
#Area under the curve: 0.6214
#[[2]]
#Area under the curve: 0.6238
sapply would return a numeric vector.
sapply(predictors, function(x) auc(roc(df$y, df[[x]])))
# z x
#0.6213942 0.6237981

Related

Performing linear regressions using columns of two matrices in R

I have two large matrices with the same dimensions e.g.:
#dummy matrices
A <- matrix(c(1:3288),nrow=12)
B <- matrix(c(3289:6576),nrow=12)
For each column I would like to run a linear regression between the two matrices (A and B) and if possible I would like to get the output of the lm into a data frame e.g. for each column's regression I want to know lm the r^2, the slope, the intercept etc.
Any help appreciated.
Assuming that you'll fit the regression between any two combination of columns this could be a solution. Keep in mind that depending on what you'll finally want in the resulting data.frame the code will change.
A <- matrix(c(1:3288),nrow=12)
B <- matrix(c(3289:6576),nrow=12)
library(broom)
library(dplyr)
results <- NULL
for (i in 1:ncol(A)){
for (j in 1:ncol(B)){
model_<-lm(A[,i]~B[,j])
results<-bind_rows(results,
bind_cols(columnx = i,
columny = j,
glance(model_),
intercept=model_$coefficients[1],
slope=model_$coefficients[2]
)
)
}
}
If you only need pairwise regression in the form of column 1 in A is going to be fitted with column 1 in B, 2 with 2 and so on, a more elegant solution could be written using map from the purr package. Hope this helps.
Edit: only fitting 1 in A with 1 in B a so forth
library(purrr)
library(dplyr)
library(broom)
A<-data.frame(A)
B<-data.frame(B)
results <- map2_df(.x = A,
.y = B, ~ {
model_<-lm(.y ~ .x)
bind_cols(glance(model_),
intercept=model_$coefficients[1],
slope=model_$coefficients[2]
)
})
Here is the purrr documentation. It is very clear explaining how map2_df works. It basically loops over two lists at the same time executing one function and returning a data.frame.

How to run many linear regressions/correlations in one data set

I have one data set in an excel/csv form. I wish to run many simple linear regressions/correlations (each with a p-value).
I have several independent variables (x's) and one dependent variable (y).
The variables are all columns of data, not rows. Each column has the name of the data type in the first cell, and all the numerical data in the lower cells.
I want to create a loop instead of manually running each test, but I'm unfamiliar with loops in R. If anyone could help, I would greatly appreciate it.Thanks!
Without more detail it's hard to know for sure, but using dplyr and broom might get you where you need to go.
For example, this runs a linear model for each group:
library(broom)
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(tidy(lm(mpg ~ wt, data = .)))
For more detail, may I suggest: http://r4ds.had.co.nz/many-models.html
Here is my attempt to use a simulated data set to demonstrate 1) "manually" compute correlations, and 2) iteratively calculate correlation by a for loop in R:
First, generate data simulation with 2 independent variables x1 (normally distributed) and x2 (exponentially distributed), and a dependent variable y (same distribution as x1):
set.seed(1) #reproducibility
## The first column is your DEPENDENT variable
## The rest are independent variables
data <- data.frame(y=rnorm(100,0.5,1), x1=rnorm(100,0,1), x2= rexp(100,0.5))
"Manually" compute correlation:
cor_x1_y <- cor.test(data$x1, data$y)
cor_x2_y <- cor.test(data$x2, data$y)
c(cor_x1_y$estimate, cor_x2_y$estimate) #corr. coefficients
## cor cor
## -0.0009943199 -0.0404557828
c(cor_x1_y$p.value, cor_x2_y$p.value) #p values
## [1] 0.9921663 0.6894252
Iteratively compute correlation and store results in a matrix called results:
results <- NULL # placeholder
for(i in 2:ncol(data)) {
## Perform i^th test:
one_test <- cor.test(data[,i], data$y)
test_cor <- one_test$estimate
p_value <- one_test$p.value
## Add any other parameters you'd like to include
##update results vector
results <- rbind(results, c(test_cor , p_value))
}
colnames(results) <- c("correlation", "p_value")
results
## correlation p_value
## [1,] -0.0009943199 0.9921663
## [2,] -0.0404557828 0.6894252

R: how to perform more complex calculations from a combn of a dataset?

Right now, I have a combn from the built in dataset iris. So far, I have been guided into being able to find the coefficient of lm() of the pair of values.
myPairs <- combn(names(iris[1:4]), 2)
formula <- apply(myPairs, MARGIN=2, FUN=paste, collapse="~")
model <- lapply(formula, function(x) lm(formula=x, data=iris)$coefficients[2])
model
However, I would like to go a few steps further and use the coefficient from lm() to be used in further calculations. I would like to do something like this:
Coefficient <- lm(formula=x, data=iris)$coefficients[2]
Spread <- myPairs[1] - coefficient*myPairs[2]
library(tseries)
adf.test(Spread)
The procedure itself is simple enough, but I haven't been able to find a way to do this for each combn in the data set. (As a sidenote, the adf.test would not be applied to such data, but I'm just using the iris dataset for demonstration).
I'm wondering, would it be better to write a loop for such a procedure?
You can do all of this within combn.
If you just wanted to run the regression over all combinations, and extract the second coefficient you could do
fun <- function(x) coef(lm(paste(x, collapse="~"), data=iris))[2]
combn(names(iris[1:4]), 2, fun)
You can then extend the function to calculate the spread
fun <- function(x) {
est <- coef(lm(paste(x, collapse="~"), data=iris))[2]
spread <- iris[,x[1]] - est*iris[,x[2]]
adf.test(spread)
}
out <- combn(names(iris[1:4]), 2, fun, simplify=FALSE)
out[[1]]
# Augmented Dickey-Fuller Test
#data: spread
#Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
#alternative hypothesis: stationary
Compare results to running the first one manually
est <- coef(lm(Sepal.Length ~ Sepal.Width, data=iris))[2]
spread <- iris[,"Sepal.Length"] - est*iris[,"Sepal.Width"]
adf.test(spread)
# Augmented Dickey-Fuller Test
# data: spread
# Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
# alternative hypothesis: stationary
Sounds like you would want to write your own function and call it in your myPairs loop (apply):
yourfun <- function(pair){
fm <- paste(pair, collapse='~')
coef <- lm(formula=fm, data=iris)$coefficients[2]
Spread <- iris[,pair[1]] - coef*iris[,pair[2]]
return(Spread)
}
Then you can call this function:
model <- apply(myPairs, 2, yourfun)
I think this is the cleanest way. But I don't know what exactly you want to do, so I was making up the example for Spread. Note that in my example you get warning messages, since column Species is a factor.
A few tips: I wouldn't name things that you with the same name as built-in functions (model, formula come to mind in your original version).
Also, you can simplify the paste you are doing - see the below.
Finally, a more general statement: don't feel like everything needs to be done in a *apply of some kind. Sometimes brevity and short code is actually harder to understand, and remember, the *apply functions offer at best, marginal speed gains over a simple for loop. (This was not always the case with R, but it is at this point).
# Get pairs
myPairs <- combn(x = names(x = iris[1:4]),m = 2)
# Just directly use paste() here
myFormulas <- paste(myPairs[1,],myPairs[2,],sep = "~")
# Store the models themselves into a list
# This lets you go back to the models later if you need something else
myModels <- lapply(X = myFormulas,FUN = lm,data = iris)
# If you use sapply() and this simple function, you get back a named vector
# This seems like it could be useful to what you want to do
myCoeffs <- sapply(X = myModels,FUN = function (x) {return(x$coefficients[2])})
# Now, you can do this using vectorized operations
iris[myPairs[1,]] - iris[myPairs[2,]] * myCoeffs[myPairs[2,]]
If I am understanding right, I believe the above will work. Note that the names on the output at present will be nonsensical, you would need to replace them with something of your own design (maybe the values of myFormulas).

How to find significant correlations in a large dataset

I'm using R.
My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.
I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b.
Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.
I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?
You can use the function rcorr from the package Hmisc.
Using the same demo data from Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Then:
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
To access the p-values:
correlations$P
To visualize you can use the package corrgram
library(corrgram)
corrgram(the_data)
Which will produce:
In order to print a list of the significant correlations (p < 0.05), you can use the following.
Using the same demo data from #Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Install Hmisc
install.packages("Hmisc")
Import library and find the correlations (#Carlos)
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
Loop over the values printing the significant correlations
for (i in 1:m){
for (j in 1:m){
if ( !is.na(correlations$P[i,j])){
if ( correlations$P[i,j] < 0.05 ) {
print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
}
}
}
}
Warning
You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.
Here's some sample data for reproducibility.
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.
correlations <- vapply(
the_data[, -1],
function(x)
{
cor(the_data[, 1], x)
},
numeric(1)
)
You can then find the column with the largest magnitude of correlation with y using:
correlations[which.max(abs(correlations))]
So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.
If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y.
To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator
One option is to run a correlation matrix:
cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")
This correlates all the variables in the file against each other and outputs a matrix.

Problems with points and apply R for linear discriminant analysis

I have some coding question, which arise doing some exercises in linear discriminant analysis. We are using the Iris data:
## Read in dataset, set seed, load package
Iris <- iris[,-(1:2)]
grIris <- as.integer(iris[,"Species"])
set.seed(16)
library(MASS)
## Read n
n <- nrow(Iris)
As you can see, we delte the first and second column of iris. What I want to do is a bootstrap for this data using linear discriminant analysis, here is my code:
ind <- replicate(B,sample(seq(1:n),n,replace=TRUE))
This generates the indices I want to use. Note B is some large number, e.g. 1000. Now I want to use apply, but why does the following code doesn't work?
bst.sample <- apply(ind,2,lda(Species~Petal.Length+Petal.Width,data=Iris[ind,]))
where Species, Petal.Length etc. are the data from iris. If I use a for loop everything works fine, but of course I would like to implement in this more elegant way.
My second question is about points. I also wanted to calculate the estimated means, which I've done by the following code
est.lda <- vector("list",B)
est.qda <- vector("list",B)
mu_hat_1 <- mu_hat_2 <- mu_hat_3 <- matrix(0,ncol=B,nrow=2)
for (i in 1:B){
est.lda[[i]] <- lda(Species~Petal.Length+Petal.Width,data=Iris[ind[,i],])
mu_hat_1[,i] <- est.lda[[i]]$means[1,]
mu_hat_2[,i] <- est.lda[[i]]$means[2,]
mu_hat_3[,i] <- est.lda[[i]]$means[3,]
est.qda[[i]] <- qda(Species~Petal.Length+Petal.Width,data=Iris[ind[,i],])
}
plot(mu_hat_1[1,],mu_hat_1[2,],pch=4)
points(mu_hat_2[1,],mu_hat_2[2,],pch=4,col=2)
points(mu_hat_3[1,],mu_hat_3[2,],pch=4,col=3)
The plot at the end should show three region with the expected mean of the three classes. However just the first plot is shown.
Thank you for your help.
B <- 10
ind <- replicate(B,sample(seq(1:n),n,replace=TRUE))
#you need to pass a function to apply
bst.sample <- apply(ind,2,
function(i) lda(Species~Petal.Length+Petal.Width,data=Iris[i,]))
#extract means
bst.means <- lapply(bst.sample,function(x) x$means)
#bind means into array
library(abind)
bst.means <- do.call(function(...) abind(..., along=3), bst.means)
#you need to make sure that alle points are inside the axis limits
plot(bst.means[1,1,],bst.means[1,2,],
xlim=range(bst.means[,1,]), ylim=range(bst.means[,2,]),
xlab=dimnames(bst.means)[[2]][1],ylab=dimnames(bst.means)[[2]][2],
col=1)
points(bst.means[2,1,],bst.means[2,2,], col=2)
points(bst.means[3,1,],bst.means[3,2,], col=3)
legend("topleft", legend=dimnames(bst.means)[[1]], col=1:3, pch=1)

Resources