I'm trying to interpret the results I'm getting when trying to cross validate my data for a k-nearest neighbors model. My data set is set up like
variable1(int) | variable2(int) | variable3(int) | variable4 (int) | Response (factor)
I split my data 80% into cvdata and 20% for testing once I choose my model.
A single iteration for my code is below:
cv <- cv.kknn(formula = Response~., cvdata, kcv = 10, k = 7, kernel = 'optimal', scale = TRUE)
cv
When I run 'cv' it just returns a list() containing some seemingly random numbers as the rownames, the observed outcome variable (y) and predicted outcome variable (yhat). I'm trying to calculate some sort of accuracy to the test set. Should I be comparing y to yhat to validate?
EDIT: output added below
[[1]]
y yhat
492 1 0.724282776
654 0 0.250394372
427 0 0.125159894
283 0 0.098561768
218 1 0.409990851
[[2]]
[1] 0.2267058 0.1060212
The first element in [[2]] is mean absolute error, the second is mean squared error.
Suppose df is your dataframe, then those values can be easily tested by mean(abs(df$y - df$yhat)) and mean((df$y - df$yhat)^2).
Related
I am using LASSO (from package glmnet) to select variables. I have fitted a glmnet model and plotted coefficients against lambda's.
library(glmnet)
set.seed(47)
x = matrix(rnorm(100 * 3), 100, 3)
y = rnorm(100)
fit = glmnet(x, y)
plot(fit, xvar = "lambda", label = TRUE)
Now I want to get the order in which coefficients become 0. In other words, at what lambda does each coefficient become 0?
I don't find a function in glmnet to extract such result. How can I get it?
Function glmnetPath in my initial answer is now in an R package called solzy.
## you may need to first install package "remotes" from CRAN
remotes::install_github("ZheyuanLi/solzy")
## Zheyuan Li's R functions on Stack Overflow
library(solzy)
## use function `glmnetPath` for your example
glmnetPath(fit)
#$enter
# i j ord var lambda
#1 3 2 1 V3 0.15604809
#2 2 19 2 V2 0.03209148
#3 1 24 3 V1 0.02015439
#
#$leave
# i j ord var lambda
#1 1 23 1 V1 0.02211941
#2 2 18 2 V2 0.03522036
#3 3 1 3 V3 0.17126258
#
#$ignored
#[1] i var
#<0 rows> (or 0-length row.names)
Interpretation of enter
As lambda decreases, variables (see i for numeric ID and var for variable names) enter the model in turn (see ord for the order). The corresponding lambda for the event is fit$lambda[j].
variable 3 enters the model at lambda = 0.15604809, the 2nd value in fit$lambda;
variable 2 enters the model at lambda = 0.03209148, the 19th value in fit$lambda;
variable 1 enters the model at lambda = 0.02015439, the 24th value in fit$lambda.
Interpretation of leave
As lambda increases, variables (see i for numeric ID and var for variable names) leave the model in turn (see ord for the order). The corresponding lambda for the event is fit$lambda[j].
variable 1 leaves the model at lambda = 0.02211941, the 23rd value in fit$lambda;
variable 2 leaves the model at lambda = 0.03522036, the 18th value in fit$lambda;
variable 3 leaves the model at lambda = 0.17126258, the 1st value in fit$lambda.
Interpretation of ignored
If not an empty data.frame, it lists variables that never enter the model. That is, they are effectively ignored. (Yes, this can happen!)
Note: fit$lambda is decreasing, so j is in ascending order in enter but in descending order in leave.
To further explain indices i and j, take variable 2 as an example. It leaves the model (i.e., its coefficient becomes 0) at j = 18 and enters the model (i.e., its coefficient becomes non-zero) at j = 19. You can verify this:
fit$beta[2, 1:18]
## all zeros
fit$beta[2, 19:ncol(fit$beta)]
## all non-zeros
See Obtain variable selection order from glmnet for a more complicated example.
I'm new here and new to R. I'm wondering whether I've used the R survey package correctly to postStratify my data. Below you can see the data structure of my dataframe (df).
utype
gender
age
regzeit
finanz
sfeld
sindex
pri
female
23
ja
s
ARG
5
sta
male
23
nein
f
ARG
-7
sta
female
21
ja
ARG
11
pri
male
28
ja
t
ARG
1
I've oversampled females for the "gender" variable and students for the "utype" variable and now want to adjust for the distribution in the population. My n=383 was oversampled to n = 477
The intended distributions of my my n=383 sample are:
utype
male
female
Sum
pri
54
68
122
sta
128
133
261
Sum
187
196
383
design <- svydesign(id = ~utype+gender, data = df)
Warning message:
In svydesign.default(id = ~utype + gender, data = df) :
No weights or probabilities supplied, assuming equal probability
pop.types <- data.frame(utype=c("sta","pri"), Freq=c(261,122))
designp <- postStratify(design, ~utype, pop.types)
postStratify(design, ~utype, pop.types)
svymean(~sindex, design)
mean | SE
sindex 0.48008 | 0.0192
svymean(~sindex, designp)
mean | SE
sindex 0.47692 | 0
My question now is whether the below code is correct and how I can postStratify for both variables utype and gender in my code or whether I have to run the postStratify command twice. I'm especially concerned that the standard error is zero in my weighted sample and because of the warning message. And whether the Freq values are correct for what I'm trying to do here?
The last thing I've been trying to figure out is how to get the svymean, svyhist or svyboxplot functions for "sindex", but only for the observations with utype == pri, so by group basically. This should all be applied to the weighted sindex values.
I hope I'm conforming to all the rules. Many thanks in advance!
You don't want to post-stratify twice (that would give you raking). You want to post-stratify once, using a post-stratum variable that is a combination of your two variables gender, as in your 2x2 table. That is,
designp <- update(designp, combined_var = interaction(gender,utype))
You now specify a pop.types= argument that has the desired frequencies for each of the four levels of this new variable.
With only the four observations you list, you will end up with zero standard errors because there isn't any variation within any of the post-strata.
I have carried out a Wilcoxon rank sum test to see if there is any significant difference between the expression of 598019 genes between three disease samples vs three control samples. I am in R.
When I see how many genes have a p value < 0.05, I get 41913 altogether. I set the parameters of the Wilcoxon as follows;
wilcox.test(currRow[4:6], currRow[1:3], paired=F, alternative="two.sided", exact=F, correct=F)$p.value
(This is within an apply function, and I can provide my total code if necessary, I was a little unsure as to whether alternative="two.sided" was correct).
However, as I assume correcting for multiple comparisons using the Benjamini Hochberg False Discovery rate would lower this number, I then adjusted the p values via the following code
pvaluesadjust1 <- p.adjust(pvalues_genes, method="BH")
Re-assessing which p values are less than 0.05 via the below code, I get 0!
p_thresh1 <- 0.05
names(pvaluesadjust1) <- rownames(gene_analysis1)
output <- names(pvaluesadjust1)[pvaluesadjust1 < p_thresh1]
length(output)
I would be grateful if anybody could please explain, or direct me to somewhere which can help me understand what is going on!
Thank-you
(As an extra question, would a t-test be fine due to the size of the data, the Anderson-Darling test showed that the underlying data is not normal. I had far less genes which were less than 0.05 using this statistical test rather than Wilcoxon (around 2000).
Wilcoxon is a parametric test based on ranks. If you have only 6 samples, the best result you can get is rank 2,2,2 in disease versus 5,5,5 in control, or vice-versa.
For example, try the parameters you used in your test, on these random values below, and you that you get the same p.value 0.02534732.
wilcox.test(c(100,100,100),c(1,1,1),exact=F, correct=F)$p.value
wilcox.test(c(5,5,5),c(15,15,15),exact=F, correct=F)$p.value
So yes, with 598019 you can get 41913 < 0.05, these p-values are not low enough and with FDR adjustment, none will ever pass.
You are using the wrong test. To answer your second question, a t.test does not work so well because you don't have enough samples to estimate the standard deviation correctly. Below I show you an example using DESeq2 to find differential genes
library(zebrafishRNASeq)
data(zfGenes)
# remove spikeins
zfGenes = zfGenes[-grep("^ERCC", rownames(zfGenes)),]
head(zfGenes)
Ctl1 Ctl3 Ctl5 Trt9 Trt11 Trt13
ENSDARG00000000001 304 129 339 102 16 617
ENSDARG00000000002 605 637 406 82 230 1245
First three are controls, last three are treatment, like your dataset. To validate what I have said before, you can see that if you do a wilcoxon.test, the minimum value is 0.02534732
all_pvalues = apply(zfGenes,1,function(i){
wilcox.test(i[1:3],i[4:6],exact=F, correct=F)$p.value
})
min(all_pvalues,na.rm=T)
# returns 0.02534732
So we proceed with DESeq2
library(DESeq2)
#create a data.frame to annotate your samples
DF = data.frame(id=colnames(zfGenes),type=rep(c("ctrl","treat"),each=3))
# run DESeq2
dds = DESeqDataSetFromMatrix(zfGenes,DF,~type)
dds = DESeq(dds)
summary(results(dds),alpha=0.05)
out of 25839 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 69, 0.27%
LFC < 0 (down) : 47, 0.18%
outliers [1] : 1270, 4.9%
low counts [2] : 5930, 23%
(mean count < 7)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
So you do get hits which pass the FDR cutoff. Lastly we can pull out list of significant genes
res = results(dds)
res[which(res$padj < 0.05),]
I don't understand how to generate predicted values from a linear regression using the predict.lm command when some value of the dependent variable Y are missing, even though no independent X observation is missing. Algebraically, this isn't a problem, but I don't know an efficient method to do it in R. Take for example this fake dataframe and regression model. I attempt to assign predictions in the source dataframe but am unable to do so because of one missing Y value: I get an error.
# Create a fake dataframe
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(100,200,300,400,NA,600,700,800,900,100)
df <- as.data.frame(cbind(x,y))
# Regress X and Y
model<-lm(y~x+1)
summary(model)
# Attempt to generate predictions in source dataframe but am unable to.
df$y_ip<-predict.lm(testy)
Error in `$<-.data.frame`(`*tmp*`, y_ip, value = c(221.............
replacement has 9 rows, data has 10
I got around this problem by generating the predictions using algebra, df$y<-B0+ B1*df$x, or generating the predictions by calling the coefficients of the model df$y<-((summary(model)$coefficients[1, 1]) + (summary(model)$coefficients[2, 1]*(df$x)) ; however, I am now working with a big data model with hundreds of coefficients, and these methods are no longer practical. I'd like to know how to do it using the predict function.
Thank you in advance for your assistance!
There is built-in functionality for this in R (but not necessarily obvious): it's the na.action argument/?na.exclude function. With this option set, predict() (and similar downstream processing functions) will automatically fill in NA values in the relevant spots.
Set up data:
df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA
Fit model: default na.action is na.omit, which simply removes non-complete cases.
mod1 <- lm(y~x+1,data=df)
predict(mod1)
## 1 2 3 4 6 7 8 9 10
## 100 200 300 400 600 700 800 900 1000
na.exclude removes non-complete cases before fitting, but then restores them (filled with NA) in predicted vectors:
mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
## 1 2 3 4 5 6 7 8 9 10
## 100 200 300 400 NA 600 700 800 900 1000
Actually, you are not using correctly the predict.lm function.
Either way you have to input the model itself as its first argument, hereby model, with or without the new data. Without the new data, it will only predict on the training data, thus excluding your NA row and you need this workaround to fit the initial data.frame:
df$y_ip[!is.na(df$y)] <- predict.lm(model)
Or explicitly specifying some new data. Since the new x has one more row than the training x it will fill the missing row with a new prediction:
df$y_ip <- predict.lm(model, newdata = df)
I am trying to build a for() loop to manually conduct leave-one-out cross validations for a GLMM fit using the lmer() function from the lme4 pkg. I need to remove an individual, fit the model and use the beta coefficients to predict a response for the individual that was withheld, and repeat the process for all individuals.
I have created some test data to tackle the first step of simply leaving an individual out, fitting the model and repeating for all individuals in a for() loop.
The data have a binary (0,1) Response, an IndID that classifies 4 individuals, a Time variable, and a Binary variable. There are N=100 observations. The IndID is fit as a random effect.
require(lme4)
#Make data
Response <- round(runif(100, 0, 1))
IndID <- as.character(rep(c("AAA", "BBB", "CCC", "DDD"),25))
Time <- round(runif(100, 2,50))
Binary <- round(runif(100, 0, 1))
#Make data.frame
Data <- data.frame(Response, IndID, Time, Binary)
Data <- Data[with(Data, order(IndID)), ] #**Edit**: Added code to sort by IndID
#Look at head()
head(Data)
Response IndID Time Binary
1 0 AAA 31 1
2 1 BBB 34 1
3 1 CCC 6 1
4 0 DDD 48 1
5 1 AAA 36 1
6 0 BBB 46 1
#Build model with all IndID's
fit <- lmer(Response ~ Time + Binary + (1|IndID ), data = Data,
family=binomial)
summary(fit)
As stated above, my hope is to get four model fits – one with each IndID left out in a for() loop. This is a new type of application of the for() command for me and I quickly reached my coding abilities. My attempt is below.
fit <- list()
for (i in Data$IndID){
fit[[i]] <- lmer(Response ~ Time + Binary + (1|IndID), data = Data[-i],
family=binomial)
}
I am not sure storing the model fits as a list is the best option, but I had seen it on a few other help pages. The above attempt results in the error:
Error in -i : invalid argument to unary operator
If I remove the [-i] conditional to the data=Data argument the code runs four fits, but data for each individual is not removed.
Just as an FYI, I will need to further expand the loop to:
1) extract the beta coefs, 2) apply them to the X matrix of the individual that was withheld and lastly, 3) compare the predicted values (after a logit transformation) to the observed values. As all steps are needed for each IndID, I hope to build them into the loop. I am providing the extra details in case my planned future steps inform the more intimidate question of leave-one-out model fits.
Thanks as always!
The problem you are having is because Data[-i] is expecting i to be an integer index. Instead, i is either AAA, BBB, CCC or DDD. To fix the loop, set
data = Data[Data$IndID != i, ]
in you model fit.