Calculate many AUCs in R - r

I am fairly new to R. I am using the ROCR package in R to calculate AUC, which I can do for one predictor just fine. What I am looking to do is perform many AUC calculations for 100 different variables.
What I have done so far is the following:
varlist <- names(mydata)[2:101]
formlist <- lapply(varlist, function(x) paste0("prediction(",x,"mydata$V1))
However then the formulas are in text format, and the as.formula is giving me an error. Any help appreciated! Thanks in advance!

The function inside your lapply looks like it is just outputting a statement like prediction(varmydata$V1). I am guessing you actually want to run that command. If so, you probably want something like
lapply(varlist,function(x) prediction(mydata[x]))
but it is hard to tell without a reproducible situation. Also, it looks like your code has a missing quote.

If I understand you correctly, you want to use the first column of mydata as predictions, and all other variables as labels, one after the other.
Is this the correct way to treat mydata? This way is rather uncommon. It is more common to have the same true labels for several diffent predictions (e.g. iterated cross validation, comparison of different classifiers).
However, to answer your original question:
predictions and labels need to have the same shape for ROCR::prediction, e.g.
either as matrix
prediction (matrix (rep (mydata$V1, 10), 10), mydata [, -1])
or as lists:
prediction (mydata [rep (1, ncol (mydata) - 1)], mydata [-1])

Related

How do I create a simple linear regression function in R that iterates over the entire dataframe?

I am working through ISLR and am stuck on a question. Basically, I am trying to create a function that iterates through an entire dataframe. It is question 3.7, 15a.
For each predictor, fit a simple linear regression model to predictthe response. Describe your results. In which of the models isthere a statistically significant association between the predictor and the response? Create some plots to back up your assertions.
So my thinking is like this:
y = Boston$crim
x = Boston[, -crim]
TestF1 = lm(y ~ x)
summary(TestF1)
But this is nowhere near the right answer. I was hoping to break it down by:
Iterate over the entire dataframe with crim as my response and the others as predictors
Extract the p values that are statistically significant (or extract the ones insignificant)
Move on to the next question (which is considerably easier)
But I am stuck. I've googled but can't find anything. I tried this combn(Boston) thing but it didn't work either. Please help, thank you.
If your problem is to iterate over a data frame, here is an example for mtrcars (mpg is the targer variable, and the rest are predictors, assuming models with a single predictor). The idea is to generate strings and convert them to formulas:
lms <- vector(mode = "list", length = ncol(mtcars)-1)
for (i in seq_along(lms)){
lms[[i]] <- lm(as.formula(paste0("mpg~",names(mtcars)[-1][i])), data = mtcars)
}
If you want to look at each and every variable combination, start with a model with all variables and then eliminate non-significant predictors finding the best model.

Low-pass fltering of a matrix

I'm trying to write a low-pass filter in R, to clean a "dirty" data matrix.
I did a google search, came up with a dazzling range of packages. Some apply to 1D signals (time series mostly, e.g. How do I run a high pass or low pass filter on data points in R? ); some apply to images. However I'm trying to filter a plain R data matrix. The image filters are the closest equivalent, but I'm a bit reluctant to go this way as they typically involve (i) installation of more or less complex/heavy solutions (imageMagick...), and/or (ii) conversion from matrix to image.
Here is sample data:
r<-seq(0:360)/360*(2*pi)
x<-cos(r)
y<-sin(r)
z<-outer(x,y,"*")
noise<-0.3*matrix(runif(length(x)*length(y)),nrow=length(x))
zz<-z+noise
image(zz)
What I'm looking for is a filter that will return a "cleaned" matrix (i.e. something close to z, in this case).
I'm aware this is a rather open-ended question, and I'm also happy with pointers ("have you looked at package so-and-so"), although of course I'd value sample code from users with experience on signal processing !
Thanks.
One option may be using a non-linear prediction method and getting the fitted values from the model.
For example by using a polynomial regression, we can predict the original data as the purple one,
By following the same logic, you can do the same thing to all columns of the zz matrix as,
predictions <- matrix(, nrow = 361, ncol = 0)
for(i in 1:ncol(zz)) {
pred <- as.matrix(fitted(lm(zz[,i]~poly(1:nrow(zz),2,raw=TRUE))))
predictions <- cbind(predictions,pred)
}
Then you can plot the predictions,
par(mfrow=c(1,3))
image(z,main="Original")
image(zz,main="Noisy")
image(predictions,main="Predicted")
Note that, I used a polynomial regression with degree 2, you can change the degree for a better fitting across the columns. Or maybe, you can use some other powerful non-linear prediction methods (maybe SVM, ANN etc.) to get a more accurate model.

Bootstrap-t Method for Comparing Trimmed Means in R

I am confused with different robust methods to compare independent means. I found good explanation in statistical textbooks. For example yuen() in case of equal sample sizes. My samples are rather unequal, thus I would like to try a bootstrap-t method (from Wilcox book: Introduction to Robust Estimation and Hypothesis Testing, p.163). It says yuenbt() would be a possible solution.
But all textbooks say I can use vectors here:
yuenbt(x,y,tr=0.2,alpha=0.05,nboot=599,side=F)
If I check the local description it says:
yuenbt(formula, data, tr = 0.2, nboot = 599)
What's wrong with my trial:
x <- c(1,2,3)
y <- c(5,6,12,30,2,2,3,65)
yuenbt(x,y)
Why can't I use yuenbt-function with my two vectors? Thank you very much
Looking at the help (for those wondering, yuenbt is from the package WRS2...) for yuenbt, it takes a formula and a dataframe as arguments. My impression is that it expects data in long format.
With your example data, we can achieve that like so:
library(WRS2)
x <- c(1,2,3)
y <- c(5,6,12,30,2,2,3,65)
dat <- data.frame(value=c(x,y),group=rep(c("x","y"), c(length(x),length(y))))
We can then use the function:
yuenbt(value~group, data=dat)

Where does R store subset information in glm object?

I'm trying to do some post-processing of a large number of glm models that I am working with, but I need to extract information about the data subset from the glm objects.
As a toy example:
x <- rnorm(100)
y <- rnorm(100,x,0.5)
s<-sample(c(T,F),100,replace=T)
myGlm <- glm(y~x, subset= s)
From this, I need to know which of the 100 observations were used by getting the information out of myGlm. I thought that myGlm$data would have the subsetted data, but it actually has all 100 observations in it. I looked through str(myGlm) to no avail. However, it is quite clear that somewhere in the object, information about the subset s is stored.
This seems like it should be totally trivial!
Thanks in advance.
as.numeric(rownames(myGlm$model))

Using outer() with predict()

I am trying to use the outer function with predict in some classification code in R. For ease, we will assume in this post that we have two vectors named alpha and beta each containing ONLY 0 and 1. I am looking for a simple yet efficient way to pass all combinations of alpha and beta to predict.
I have constructed the code below to mimic the lda function from the MASS library, so rather than "lda", I am using "classifier". It is important to note that the prediction method within predict depends on an (alpha, beta) pair.
Of course, I could use a nested for loop to do this, but I am trying to avoid this method.
Here is what I would like to do ideally:
alpha <- seq(0, 1)
beta <- seq(0, 1)
classifier.out <- classifier(training.data, labels)
outer(X=alpha, Y=beta, FUN="predict", classifier.out, validation.data)
This is a problem because alpha and beta are not the first two parameters in predict.
So, in order to get around this, I changed the last line to
outer(X=alpha, Y=beta, FUN="predict", object=classifier.out, data=validation.data)
Note that my validation data has 40 observations, and also that there are 4 possible pairs of alpha and beta. I get an error though saying
dims [product 4] do not match the length of object [40]
I have tried a few other things, some of which work but are far from simple. Any suggestions?
The problem is that outer expects its function to be vectorized (i.e., it will call predict ONCE with a vector of all the arguments it wants executed). Therefore, when predict is called once, returning its result (which happens to be of length 4), outer complains because it doesn't equal the expected 40.
One way to fix this is to use Vectorize. Untested code:
outer(X=alpha, Y=beta, FUN=Vectorize(predict, vectorize.args=c("alpha", "beta")), object=classifier.out, data=validation.data)
I figured out one decent way to do this. Here it is:
pairs <- expand.grid(alpha, beta)
names(pairs) <- c("alpha", "beta")
mapply(predict, pairs$alpha, pairs$beta,
MoreArgs=list(object=classifier.out, data=validation.data))
Anyone have something simpler and more efficient? I am very eager to know because I spent a little too long on this problem. :(

Resources