Error relating to number of rows using basehaz in prediction modelling? - r

I am developing a prediction model using 524 records with complete data.
This code below is a sample where everything runs smoothly- I cannot share the data I am using as it is protected but I am using the same code.
test1 <- list(time=c(4,3,1,1,2,2,3),
status=c(1,1,1,0,0,1,0),
x=c(0,2,1,1,1,0,0),
sex=c(0,0,0,0,1,1,1))
a<-coxph(Surv(time, status) ~ x + sex, test1,x=TRUE)
b<-basehaz(a,FALSE)
exp(-b$hazard[test1$time])^exp(a$coef[1]*test1$x+a$coef[2]*test1$sex)
In that last line of code, exp(-b$hazard[test1$time]) returns 518 results while
exp(a$coef[1]*test1$x+a$coef[2]*test1$sex) returns 524
-b$hazard alone = 518 results and test1$time alone= 524
If I run the entire thing 518 results show, with the warning message:
"In exp(-b$hazard[complete$DF.Days_ACCT_followup])^exp(a$coef[1] * :
longer object length is not a multiple of shorter object length"
I cannot figure out why this is happening, and would appreciate any help.

Related

Error in dataframe *tmp* replacement has x data has y

I'm a beginner in R. Here is a very simple code where I'm trying to save the residual term:
# Create variables for child's EA:
dat$cldeacdi <- rowMeans(dat[,c('cdcresp', 'cdcinv')],na.rm=T)
dat$cldeacu <- rowMeans(dat[,c('cucresp', 'cucinv')],na.rm=T)
# Create a residual score for child EA:
dat$cldearesid <- resid(lm(cldeacu ~ cldeacdi, data = dat))
I'm getting the following message:
Error in `$<-.data.frame`(`*tmp*`, cldearesid, value = c(-0.18608488908881, :
replacement has 366 rows, data has 367
I searched for this error but couldn't find anything that could resolve this. Additionally, I've created the exact same code for mom's EA, and it saved the residual just fine, with no errors. I'd be grateful if someone could help me resolve this.
I have a feeling you have NAs in your data. Look at this example:
#mtcars data set
test <- mtcars
#adding just one NA in the cyl column
test[2, 2] <- NA
#running linear model and adding the residuals to the data.frame
test$residuals <- resid(lm(mpg ~ cyl, test))
Error in `$<-.data.frame`(`*tmp*`, "residuals", value = c(0.382245430809409, :
replacement has 31 rows, data has 32
As you can see this results in a similar error to yours.
As a validation:
length(resid(lm(mpg ~ cyl, test)))
#31
nrow(test)
#32
This happens because lm will run na.omit on the data set prior to running the regression, so if you have any rows with NA these will get eliminated resulting in fewer results.
If you run na.omit on your dat data set (i.e. dat <- na.omit(dat) at the very beginning of your code then your code should work.
This is an old thread, but maybe this can help someone else facing the same issue. To LyzandeR's point, check for NA's as a first line of defense. In addition, make sure that you don't have any factors in x, as this can also cause the error.

R - RandomForest with two Outcome Variables

Fairly new to using randomForest statistical package here.
I'm trying to run a model with 2 response variables and 7 predictor variables, but I can't seem to because of the lengths of the response variables and/or the nature of fitting the model with 2 response variables.
Let's assume this is my data and model:
> table(data$y1)
0 1 2 3 4
23 43 75 47 21
> length(data$y1)
0 4
> table(data$y2)
0 2 3 4
104 30 46 29
> length(data$y2)
0 4
m1<-randomForest(cbind(y1,y2)~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
When I run this model, I receive this error:
Error in randomForest.default(m, y, ...) :
length of response must be the same as predictors
I did some troubleshooting, and find that cbind() the two response variables simply places their values together, thus doubling the original length, and possible resulting in the above error. As an example,
length(cbind(y1,y2))
> 418
t(lapply(data, length()))
> a b c d e f g y1 y2
209 209 209 209 209 209 209 209 209
I then tried to solve this issue by running randomForest individually on each of the response variables and then apply combine() on the regression models, but came across these issues:
m2<-randomForest(y1~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
m3<-randomForest(y2~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
combine(m2,m3)
Warning message:
In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I then decide to treat the randomForest models as classification models, and apply as.factor() to both response variables before running randomForest, but then came across this new issue:
m4<-randomForest(as.factor(y1)~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
m5<-randomForest(as.factor(y2)~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
combine(m4,m5)
Error in rf$votes + ifelse(is.na(rflist[[i]]$votes), 0, rflist[[i]]$votes) :
non-conformable arrays
My guess is that I can't combine() classification models.
I hope that my inquiry of trying to run a multivariate Random Forest model makes sense. Let me know if there are further questions. I can also go back and make adjustments.
Combine your columns outside the randomForest formula:
data[["y3"]] <- paste0(data$y1, data$y2)
randomForest(y3~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)

Df without NA gives error: Error in cov.wt(y, wt = wt) : 'x' must contain finite values only

I'm using the package relaimpo to calculate the proportional contribution of each covariate to the R2 of a linear full model. Recently, I updated this package what confronts me know with the following error message:
SUBJECT Days Volume
P003 1 51640.33
P003 211 55109.29
P004 1 10259.38
P004 140 10269.75
P004 252 10526.75
P004 364 8560.62
P007 1 177.38
P007 368 266.65
> library(relaimpo)
> Full_model <- lm(Volume~SUBJECT+Days, data=Test)
> calc.relimp(Full_model, diff = T, rela = T)
Error in cov.wt(y, wt = wt) : 'x' must contain finite values only
I checked already the posts on this website. They all seem to suggest that this is due to missing values. But when I remove the missing values from the dataframe (as in this test dataset) I still get this error message. Also when I additionally define x=NULL in the script or try to add instructions how to handle NA values, I still get the same error.
Anybody an idea how I can solve this?

Using lm() in R in data with many zeroes gives error

I'm new to data analysis, and I have a couple questions about using lm() in R to create a linear regression model of my data.
My data looks like this:
testID userID timeSpentStudying testGrade
12345 007 10 90
09876 008 0 75
And my model:
model <- lm(formula = data$testGrade ~ timeSpentStudying, data = data)
I'm getting the following error (twice), across just under 60 rows of data from RStudio:
Warning messages:
1: In sqrt(crit * p * (1 - hh)/hh) : NaNs produced
2: In sqrt(crit * p * (1 - hh)/hh) : NaNs produced
My question is, does the problem have to do with the data containing many instances of zero being the value, such as above under the 'timeSpentStudying' column? If so, how do I handle that? Shouldn't lm() be able to handle values of zero, especially if that would give significance to the data itself?
Thanks!
So far I have been unable to replicate this, e.g.:
dd <- data.frame(y=rnorm(1000),x=c(rep(0,990),1:10))
model <- lm(y~x, data = dd)
summary(model)
Searching the R code base for the code listed in your error and tracing back indicates that the relevant lines are in plot.lm, the function that plots diagnostics, and that the problem is that you are somehow getting a value >1 for the leverage or "hat values" of one of your data points. However, I can't see how you could be achieving that. Data would make this much clearer!

How to overcome "Error in .local(object, ...) : test vector does not match model !"?

I removed 100 records from the original data set, then rebuilt a SVM model using the following coding.
uk<-read.csv("Riskx.csv", header=TRUE, sep=",")
attach(uk)
library(e1071)
library(kernlab)
index<-1:nrow(uk)
testindex<-sample(index, trunc(length(index)/3))
testset<-uk[testindex,]
trainset<-uk[-testindex,]
model<-ksvm(Risk~, data = trainset, type = "nu-svc")
pred<-predict(model, testset)
table(pred, testset$Risk)
summary(testset$Risk)
NOW, I want to bring in those 100 records I set aside from training/testing the new model and check how well the model can identify and classify those 100 records which it has not seen before. So I did the following coding.
testset<-read.csv(“Validation.csv”, header=TRUE, sep=”,”)
Pred1<-predict(model4, testset)
But R, gives me the following error:
Error in .local(object, …) : test vector does not match model !
Any idea how I could over come this error? The test set used to build the model has 466 records. Therefore I tried duplicating the validation test to 466 as well, but it still gives the same error.

Resources