Error 'variable lengths differ' while doing Hoslem.test() in R - r

I am trying to do a hoslem.test() and ROC curve but I am struggling with a constant error. I am obtaining this error:
Error in model.frame.default(formula = cbind(y0 = 1 - y, y1 = y) ~ cutyhat) : variable lengths differ (found for 'cutyhat')
Firstly, I tried to see if the lengths from montevil$icPM10 and nrow(montevil) were the same. They are (8790). After that, I tried to omit NA values from the logistic regression model. Nothing happened. I saw other websites, and people answered by coding the na.omit() clause in the model.
This is my code:
montevil<-read.csv("Montevil.csv")
#Hoslem.test
icPM10<-as.factor(montevil$icPM10)
b <- glm(formula=icPM10 ~ RS_re+vv+PRB+month, data = montevil, family = "binomial"(link=logit))
#install.packages("ResourceSelection")
library(ResourceSelection)
hoslem.test(icPM10, fitted(b), g=10)
#ROC curve
#This 'prob' variable was calculated in order to get the ROC curve
log.df <- data.frame(vv=0.6, RS_re="normal_alta", PRB=1013, month="08")
prob<-predict( b, newdata = log.df,type="response" )
#install.packages("pROC")
library(pROC)
r=roc(icPM10,prob, data=montevil)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
#plot (r)
Here is a link to the csv:
https://drive.google.com/file/d/1ap0Y-QMizgjKf1IB7mm_woJzEGEhlKPl/view?usp=sharing
EDIT
I could figure out (I am not sure if this is OK) that I had to input this for the hoslem.test():
hoslem.test(b$y, fitted(b), g=10)
But I am still struggling with the ROC Curve.

Related

Unable to plot p values when using facet.by from ggsurvplot. Error message: "variable lengths differ"

I have a problem that I don't know how to solve. And it seems to be related to my data set (or is it?). Indeed, I am actually able to plot different p values when using facet.by when I use your example from issue#205 via the "colon" data set. However,it does not work with my data set that is available on my Github profile here (https://github.com/CroixJeremy2/Data-frame-for-stack-overflow.git).
Expected behavior
I would like to be able to plot different p values as in issue#205 with my data set.
Actual behavior
I am only able to plot the curves via facet.by. But I can't plot the p values that should be automatically calculated for a log-rank test. Instead, an error message is returned in my R console saying:
"Error in model.frame.default(formula = Survival ~ Sex, data = list(ID = c(147L, :"
"les longueurs des variables diffèrent (trouvé pour 'Sex')"
# the last line translated from French to English means:
"variable lengths differ (found for Sex)"
Steps to reproduce the problem
library(survival)
Survival = Surv(time = D$Age, event = D$outcome)
library(survminer)
fit = survfit(data = D, formula = Survival ~ Sex + Genotype)
ggsurvplot(fit = fit, data = D, pval = TRUE, facet.by = 'Genotype') #error message
ggsurvplot(fit = fit, data = D, facet.by = 'Genotype') #curves can be plotted
Remarks
Note that the survdiff() function works perfectly on my data sets in order to calculate p values from log-rank tests. Therefore, I do not know if I am doing something wrong in ggsurvplot() (most likely hypothesis) or if there is something wrong in the ggsurvplot() function itself (unlikely).
survdiff(data = D, subset = D$Ctrl, formula = Survival ~ Sex)
survdiff(data = D, subset = D$nKO, formula = Survival ~ Sex)
survdiff(data = D, subset = D$CRE_Ctrl, formula = Survival ~ Sex)
#works fine, p value returned, no message/warning/error returned.
Moreover, the variable lengths seems equal... And I don't have any "NA" values in my dataframe... So I really don't understand why I have this error message...
sapply(D,function(x) length(x))
# length = 298 for all my variables...
Thanks in advance for your response and help,

Issues plotting count distribution displot()

I have count data. I'm trying to document my decision to use a negative binomial distribution rather than Poisson (I couldn't get a quasi-poisson dist. in lme4) and am having graphical issues (the vector is appended to the end of the post).
I've been trying to implement the distplot() function to inform my decision about which distribution to model:
here's the outcome variable (physician count):
plot(d1.2$totalmds)
Which might look poisson
but the mean and variance aren't close (the variance is doubled by two extreme values; but is still not anywhere near the mean)
> var(d1.2$totalmds, na.rm = T)
[1] 114240.7
> mean(d1.2$totalmds, na.rm = T)
[1] 89.3121
My outcome is partly population driven so I'm using the total population as an offset variable in preliminary models. This, as I understand it, divides the outcome by the natural log of the offset variable so totalmds/log(poptotal) is essentially what's being modeled. Which looks something like:
But when I try to model this using:
plot 1: distplot(x = d1.2$totalmds, type = "poisson")
plot 2: distplot(x = d1.2$totalmds, type = "nbinomial") # looks way off
plot 3: plot(fitdist(data = d1.2$totalmds, distr = "pois", method = "mle"))
plot 4: plot(fitdist(data = d1.2$totalmds, distr = "nbinom", method = "mle")) # throws warnings
plot 5: qqcomp(fitdist(data = d1.2$totalmds, distr = "pois", method = "mle"))
plot 6: qqcomp(fitdist(data = d1.2$totalmds, distr = "nbinom", method = "mle")) # throws warnings
Does anyone have suggestions for why the following plots look a little screwy/inconsistent?
As I mentioned I'm using another variable as an offset variable in my actual analysis, if that makes a difference.
Here's the vector:
https://gist.github.com/timothyslau/f95a777b713eb33a2fe6
I'm fairly sure NB is better than poisson since var(d1.2$totalmds)/mean(d1.2$totalmds) # variance-to-mean ratio (VMR) > 1
But if NB is appropriate the plots should look a lot cleaner (I think, unless I'm doing something wrong with these plotting functions/packages).

Error when using msmFit in R

I'm trying to simulate this paper (Point Forecast Markov Switching Model for U.S. Dollar/ Euro Exchange Rate, by Hamidreza Mostafei) in R. The table that I'm trying to get is on page 483. Here is a link to a pdf.
I wrote the following codes and then got an error at the last line:
mydata <- read.csv("C:\\Users\\User\\Downloads\\EURUSD_2.csv", header=T)
mod <- lm(EURUSD~EURUSD.1, mydata)
mod.mswm = msmFit(mod, k=2, p=1, sw=c(T,T,T,T), control=list(parallel=F))
Error in if ((max(abs(object["Fit"]["logLikel"] - oldll))/(0.1 + max(abs(object["Fit"]["logLikel"]))) < :
missing value where TRUE/FALSE needed
Basically the data that's being used is EURUSD, which is the level change in monthly frequency. EURUSD.1 is the one lag variable. Both EURUSD and EURUSD.1 are in my csv file. (I'm not sure how to attach the csv file here. If someone could point that out that would be great).
I changed the EURUSD.1 values to something random and msmFit function seemed to work. But whenever I tried using the original value, i.e. the lag value, the error came out.
Something degenerate is happening when one variable is simply lagged from the other. Consider:
Sample data frame where Y is lagged X:
> d = data.frame(X=runif(100))
> d$Y=c(.5, d$X[-100])
> mod <- lm(X~Y,d)
> mod.mswm = msmFit(mod, k=2, p=1, sw=c(T,T,T,T), control=list(parallel=F))
Error in if ((max(abs(object["Fit"]["logLikel"] - oldll))/(0.1 + max(abs(object["Fit"]["logLikel"]))) < :
missing value where TRUE/FALSE needed
that gives your error. Let's add a tiny tiny bit of noise to Y and see what happens:
> d$Y=d$Y+rnorm(100,0,.000001)
> mod <- lm(X~Y,d)
> mod.mswm = msmFit(mod, k=2, p=1, sw=c(T,T,T,T), control=list(parallel=F))
> mod.mswm
Markov Switching Model
Call: msmFit(object = mod, k = 2, sw = c(T, T, T, T), p = 1, control = list(parallel = F))
AIC BIC logLik
4.3109 47.45234 3.84455
Coefficients:
(Intercept)(S) Y(S) X_1(S) Std(S)
Model 1 0.8739622 -22948.89 22948.83 0.08194545
Model 2 0.4220748 77625.21 -77625.17 0.21780764
Transition probabilities:
Regime 1 Regime 2
Regime 1 0.3707261 0.3886715
Regime 2 0.6292739 0.6113285
It works! Now either:
Having perfectly lagged variables causes some "divide by zero" error because its a purely degenerate case (like having perfectly co-linear variables in a linear model). A little experimenting shows that in this case the resulting output is very sensitive to how much noise you add, so I'm thinking its on a knife-edge here. I suspect having perfectly lagged variables here leads to some singularity or degeneracy.
or
There's some bug in the function.
I have no idea what msmFit does, so that's for you to sort out.

Error in plot, formula missing when using svm

I am trying to plot my svm model.
library(foreign)
library(e1071)
x <- read.arff("contact-lenses.arff")
#alt: x <- read.arff("http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/contact-lenses.arff")
model <- svm(`contact-lenses` ~ . , data = x, type = "C-classification", kernel = "linear")
The contact lens arff is the inbuilt data file in weka.
However, now i run into an error trying to plot the model.
plot(model, x)
Error in plot.svm(model, x) : missing formula.
The problem is that in in your model, you have multiple covariates. The plot() will only run automatically if your data= argument has exactly three columns (one of which is a response). For example, in the ?plot.svm help page, you can call
data(cats, package = "MASS")
m1 <- svm(Sex~., data = cats)
plot(m1, cats)
So since you can only show two dimensions on a plot, you need to specify what you want to use for x and y when you have more than one to choose from
cplus<-cats
cplus$Oth<-rnorm(nrow(cplus))
m2 <- svm(Sex~., data = cplus)
plot(m2, cplus) #error
plot(m2, cplus, Bwt~Hwt) #Ok
plot(m2, cplus, Hwt~Oth) #Ok
So that's why you're getting the "Missing Formula" error.
There is another catch as well. The plot.svm will only plot continuous variables along the x and y axes. The contact-lenses data.frame has only categorical variables. The plot.svm function simply does not support this as far as I can tell. You'll have to decide how you want to summarize that information in your own visualization.

Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1, : zero non-NA points

I am using R's ROCR package to calculate the area under the curve of large data-sets. However, The code does not work for all the datasets except a few.
the code i have used:
pred <- prediction(mydata$Total.Regexes, mydata$actual)
perf <- performance(pred, "tpr", "fpr")
I checked the dataset there is no non-Na points present in the dataset. However, since the dataset is huge, it may go out of my sight. So, is there any other process to refine the dataset (for non-NA values, if there any) without disturbing the remaining values?
And this is the error it shows for a few dataset:
Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1, :
zero non-NA points
I checked using:
is.na(dataset)
dataset <- na.omit(dataset)
but still it doesn't work. There is no non-Na values present in the dataset.. I can't reproduce the error with a simple dataset, so I've posted the problem dataset in my dropbox.
https://www.dropbox.com/s/pjko6o6h23m43le/DC4.csv
Please Help!
I had a similar problem.
By "casting" the arguments to prediction I managed to get everything working properly.
Try:
pred <- prediction(as.numeric(mydata$Total.Regexes), as.numeric(mydata$actual))
perf <- performance(pred, "tpr", "fpr")
I had the similar problem. This is a 'bad' way to solve it:
MODEL <- glm(y ~ x + z, my_data, family = "binomial")
pred_probab <- predict(MODEL, type = "response")
type = "response"is specified for return prediction sample as probabilities
pr <- prediction(pred_probab, Two_levels_factor)
Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1, :
zero non-NA points
The sample "Two_levels_factor" with n = 2000 has only one level value "positive_result". For logit regression it must have two levels.
levels(Two_levels_factor)
[1] "negative_result"
Two_levels_factor[1] <- "positive_result"
levels(Two_levels_factor)
[1] "positive_result" "negative_result"
pr <- prediction(pred_probab, Two_levels_factor)

Resources