r rms error using validate - r

I'm building an Linear model using OLS in the r package with:
model<-ols(nallSmells ~ rcs(size, 5) + rcs(minor,5)+rcs(change_churn,3)
+rcs(review_rate,0), data=quality,x=T, y=T)
When I want to validate my model using:
validate(model,B=100)
I get the following error:
Error in lsfit(x, y) : only 0 cases, but 2 variables
In addition: Warning message:
In lsfit(x, y) : 1164 missing values deleted
But if I decrease B, e.g., B=10, I works. Why I can't iterate more. Also I notice that the seed has an effect when I use this method.
Can someone give me some advice?
UPDATE:
I'm using rcs(review_rate,0) because I want to assign the 0 number of knots to this predictor, according to my DOF budget. I noticed that the problem is with thte data in review_rate. Even if I ommit the parameter in rcs() and just put the name of the predictor, I get errors. This is the frequency of the data in review_rate: count(quality$review_rate)
x freq
1 0.8571429 1
2 0.9483871 1
3 0.9789474 1
4 0.9887640 1
5 0.9940476 1
6 1.0000000 1159 I wonder if there is a relationship with the values of this vector? Because when I built the OLS model, I get the following warning:
Warning message:
In rcspline.eval(x, nk = nknots, inclx = TRUE, pc = pc, fractied = fractied) :
5 knots requested with 6 unique values of x. knots set to 4 interior values.
The values in the other predictors are real positives, but if ommit review_rate predictor I don't get any warning or error.
Thanks for your support.
I add the link for a sample of 100 of my data for replication
https://www.dropbox.com/s/oks2ztcse3l8567/examplestackoverflow.csv?dl=0
X represent the depedent variable and Y4 the predictor that is giving me problems.
require (rms)
Data <- read.csv ("examplestackoverflow.csv")
testmodel<-ols(X~ rcs(Y1)+rcs(Y2)+rcs(Y3),rcs(Y4),data=Data,x=T,y=T)
validate(testmodel,B=1000)
Kind regards,

Related

Error "$ operator is invalid for atomic vectors" despite not using atomic vectors or $

Hello fellow Stackers! This is my first question so i am curious if you can help me! :)
First: I checked similar questions and unfortunately none of the solutions worked for me. Tried it for nearly 3 days now :/ Since I am working with sensitive data I cannot provide the original table for reprex, unfortunately. However I will create a small substitutional example-table for testing.
To get to the problem:
I want to predict a norm value using the package "CNorm". It requires raw data, classification data, a model and min/max values and some other things that are less important. The problem is: Whatever I do, whatever Data-type and working directory I use, it gives me the error "$ operator is invalid for atomic vectors" to change that I transformed the original .sav-file to a Dataframe. Well- nothing happened. I tested the type of the data and it said dataframe, not atomic vector. Also i tried using "[1]" for location or ["Correct"] for names but still the same error showed up. Same for using 2 single Dataframes, using lists. I have tried to use $ to check, if i get a different error but also the same. I even used another workspace to check if the old workspace was bugged.
So maybe I just did really stupid mistakes but I really tried and it did not work out so I am asking you, what the solution might be. Here is some data to test! :)
install.packages("haven")
library(haven)
install.packages("CNORM")
library(CNORM)
SpecificNormValue <- predictNorm((Data_4[1]),(Data_4[2]),model = T,minNorm = 12, maxNorm = 75, force = FALSE, covariate = NULL)
So that is one of the commands I used on the Dataframe "Data_4". I also tried not using brackets or using "xxx" to get the column names but to no avail.
The following is the example Dataframe. To test it more realistic I would recommend an Exel-file with 2 columns and 900 rows(+ Column title) (like the original). The "correct"-values can be random selected by Excel and they differ from 35 to 50, the age differs from 6 to 12.
Correct
Age
40
6
45
7
50
6
35
6
I really hope someone of you can figure out the problem and how I get the command properly running. I really have no other idea right now.
Thanks for checking my question and thanks in advance for your time! I would be glad to hear from you!
The source of that error isn't your data, it's the third argument to predictNorm: model = T. According to the predictNorm documentation, this is supposed to be a "regression model or a cnorm object". Instead you are passing in a logical value (T = TRUE) which is an atomic vector and causes this error when predictNorm tries to access the components of the model with $.
I don't know enough about your problem to say what kind of model you need to use to get the answer you want, but for example passing it an object constructed by cnorm() returns without an error using your data and parameters (there are some warnings because of the small size of your test dataset):
library(haven)
library(cNORM)
#> Good morning star-shine, cNORM says 'Hello!'
Data_4 <- data.frame(correct = c(40, 45, 50, 35),
age = c(6,7,6,6))
SpecificNormValue <- predictNorm(Data_4$correct,
Data_4$age,
model = cnorm(Data_4$correct, Data_4$age),
minNorm = 12,
maxNorm = 75,
force = FALSE,
covariate = NULL)
#> Warning in rankByGroup(raw = raw, group = group, scale = scale, weights =
#> weights, : The dataset includes cases, whose percentile depends on less than
#> 30 cases (minimum is 1). Please check the distribution of the cases over the
#> grouping variable. The confidence of the norm scores is low in that part of the
#> scale. Consider redividing the cases over the grouping variable. In cases of
#> disorganized percentile curves after modelling, it might help to reduce the 'k'
#> parameter.
#> Multiple R2 between raw score and explanatory variable: R2 = 0.0667
#> Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
#> force.in, : 21 linear dependencies found
#> Reordering variables and trying again:
#> Warning in log(vr): NaNs produced
#> Warning in log(vr): NaNs produced
#> Specified R2 falls below the value of the most primitive model. Falling back to model 1.
#> R-Square Adj. = 0.993999
#> Final regression model: raw ~ L4A3
#> Regression function: raw ~ 30.89167234 + (6.824413606e-09*L4A3)
#> Raw Score RMSE = 0.35358
#>
#> Use 'printSubset(model)' to get detailed information on the different solutions, 'plotPercentiles(model) to display percentile plot, plotSubset(model)' to inspect model fit.
Created on 2020-12-08 by the reprex package (v0.3.0)
Note I used Data_4$age and Data_4$correct for the first two arguments. Data_4[,1] and Data_4[[1]] also work, but Data_4[1] doesn't, because that returns a subset of a data frame not a vector as expected by predictNorm.

Error in if (any(co)) { : missing value where TRUE/FALSE needed In addition: Warning messages: 1: In FUN(newX[, i], ...) : NAs introduced by coercion

I am working with a dataset that has approximately 150000 rows and 25 columns. The data consist of numerical and factor variables. Factor variables are both text and numbers and I need all of them. The depended variable is a factor with 20 levels.
I am trying to build a model and feed it into a SVM using the kernlab package in R.
library(kernlab)
n<- nrow(x)
trainInd<- sort(sample(1:nrow(x), n*.8))
xtrain<- x[trainInd,]
xtest<- x[-trainInd,]
ytrain<- y[trainInd]
ytest<- y[-trainInd]
modelclass<- ksvm(x=as.matrix(xtrain), y=as.matrix(ytrain),
scaled = TRUE, type="C-svc", kernel = "rbfdot",
kpar="automatic", C=1, cross=0)
Following the code, I get this error:
Error in if (any(co)) { : missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In FUN(newX[, i], ...) : NAs introduced by coercion
The xtrain data frame looks like:
Length Gender Age Day Hour Duration Period
5 1 80 5 11 20 3
0.2 2 35 2 18 10 5
1.1 2 55 1 15 120 4
The Gender, Day, and Period variables are categorical (factors), where the rest is numerical.
I have gone through similar questions and been through my dataset as well, but I cannot identify any NA values or other mistakes.
I assume that I am doing something wrong with variable types, and particular the factors. I am unsure of how to use them, but I can't see something wrong.
Any help of how to solve the error and possibly how to model factor together with numerical variables would be appreciated.
The reason for this error message is that the svm implementations by kernlab and e1071 cannot deal with features of data type factor.
The solution is to convert the predictors which are factors by one-hot-encoding. Then there are two cases:
Case 1: formula interface
The one-hot-encoding is done implicitly by using train(form = formula, ...).
Case 2: x,y interface
when using the format train(x = features, y = target, data = dataset, ...), you must explicitly perform the one-hot-encoding!
A simple way to do this is:
features = model.matrix(features)
I had the same problem with e1071 package in R. I solved it changing all variables to numeric instead of factor, except the decision variable (y), which can be either a factor (for classification tasks) or a numeric (for regression).
References:
CRAN Package 'e1071'

R - RandomForest with two Outcome Variables

Fairly new to using randomForest statistical package here.
I'm trying to run a model with 2 response variables and 7 predictor variables, but I can't seem to because of the lengths of the response variables and/or the nature of fitting the model with 2 response variables.
Let's assume this is my data and model:
> table(data$y1)
0 1 2 3 4
23 43 75 47 21
> length(data$y1)
0 4
> table(data$y2)
0 2 3 4
104 30 46 29
> length(data$y2)
0 4
m1<-randomForest(cbind(y1,y2)~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
When I run this model, I receive this error:
Error in randomForest.default(m, y, ...) :
length of response must be the same as predictors
I did some troubleshooting, and find that cbind() the two response variables simply places their values together, thus doubling the original length, and possible resulting in the above error. As an example,
length(cbind(y1,y2))
> 418
t(lapply(data, length()))
> a b c d e f g y1 y2
209 209 209 209 209 209 209 209 209
I then tried to solve this issue by running randomForest individually on each of the response variables and then apply combine() on the regression models, but came across these issues:
m2<-randomForest(y1~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
m3<-randomForest(y2~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
combine(m2,m3)
Warning message:
In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I then decide to treat the randomForest models as classification models, and apply as.factor() to both response variables before running randomForest, but then came across this new issue:
m4<-randomForest(as.factor(y1)~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
m5<-randomForest(as.factor(y2)~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
combine(m4,m5)
Error in rf$votes + ifelse(is.na(rflist[[i]]$votes), 0, rflist[[i]]$votes) :
non-conformable arrays
My guess is that I can't combine() classification models.
I hope that my inquiry of trying to run a multivariate Random Forest model makes sense. Let me know if there are further questions. I can also go back and make adjustments.
Combine your columns outside the randomForest formula:
data[["y3"]] <- paste0(data$y1, data$y2)
randomForest(y3~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)

Df without NA gives error: Error in cov.wt(y, wt = wt) : 'x' must contain finite values only

I'm using the package relaimpo to calculate the proportional contribution of each covariate to the R2 of a linear full model. Recently, I updated this package what confronts me know with the following error message:
SUBJECT Days Volume
P003 1 51640.33
P003 211 55109.29
P004 1 10259.38
P004 140 10269.75
P004 252 10526.75
P004 364 8560.62
P007 1 177.38
P007 368 266.65
> library(relaimpo)
> Full_model <- lm(Volume~SUBJECT+Days, data=Test)
> calc.relimp(Full_model, diff = T, rela = T)
Error in cov.wt(y, wt = wt) : 'x' must contain finite values only
I checked already the posts on this website. They all seem to suggest that this is due to missing values. But when I remove the missing values from the dataframe (as in this test dataset) I still get this error message. Also when I additionally define x=NULL in the script or try to add instructions how to handle NA values, I still get the same error.
Anybody an idea how I can solve this?

Warning message 'newdata' had 1 row but variables found have 16 rows in R

I am suppose to use the predict function to predict when fjbjor is 5.5 and I always get this warning message and I have tried many ways but it always comes so is there anyone who can see what I am doing wrong here
This is my code
fit.lm <- lm(fjbjor~amagn, data=bjor)
summary(fit.lm)
new.bjor<- data.frame(fjbjor=5.5)
predict(fit.lm,new.bjor)
and this comes out
1 2 3 4 5 6 7 8 9 10 11
5.981287 2.864521 9.988559 5.758661 4.645530 2.419269 4.645530 5.313409 6.871792 3.309773 4.200278
12 13 14 15 16
3.755026 5.981287 5.536035 1.974016 3.755026
Warning message: 'newdata' had 1 row but variables found have 16 rows
If anyone can see what is wrong I would be really thankful for the help.
Your model is fjbjor ~ amagn, where fjbjor is response and amagn is covariate. Then your newdata is data.frame(fjbjor=5.5).
newdata should be used to provide covariates rather than response. predict will only retain columns of covariates in newdata. For your specified newdata, this will be NULL. As a result, predict will use the internal model frame for prediction, which returns you fitted values.
The warning message is fairly clear. predict determines the expected number of predictions from nrow(newdata), which is 1. But then what I described above happened so 16 fitted values are returned. Such mismatch produces the warning.
Looks like the model you really want is: amagn ~ fjbjor.

Resources