Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1, : zero non-NA points - r

I am using R's ROCR package to calculate the area under the curve of large data-sets. However, The code does not work for all the datasets except a few.
the code i have used:
pred <- prediction(mydata$Total.Regexes, mydata$actual)
perf <- performance(pred, "tpr", "fpr")
I checked the dataset there is no non-Na points present in the dataset. However, since the dataset is huge, it may go out of my sight. So, is there any other process to refine the dataset (for non-NA values, if there any) without disturbing the remaining values?
And this is the error it shows for a few dataset:
Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1, :
zero non-NA points
I checked using:
is.na(dataset)
dataset <- na.omit(dataset)
but still it doesn't work. There is no non-Na values present in the dataset.. I can't reproduce the error with a simple dataset, so I've posted the problem dataset in my dropbox.
https://www.dropbox.com/s/pjko6o6h23m43le/DC4.csv
Please Help!

I had a similar problem.
By "casting" the arguments to prediction I managed to get everything working properly.
Try:
pred <- prediction(as.numeric(mydata$Total.Regexes), as.numeric(mydata$actual))
perf <- performance(pred, "tpr", "fpr")

I had the similar problem. This is a 'bad' way to solve it:
MODEL <- glm(y ~ x + z, my_data, family = "binomial")
pred_probab <- predict(MODEL, type = "response")
type = "response"is specified for return prediction sample as probabilities
pr <- prediction(pred_probab, Two_levels_factor)
Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1, :
zero non-NA points
The sample "Two_levels_factor" with n = 2000 has only one level value "positive_result". For logit regression it must have two levels.
levels(Two_levels_factor)
[1] "negative_result"
Two_levels_factor[1] <- "positive_result"
levels(Two_levels_factor)
[1] "positive_result" "negative_result"
pr <- prediction(pred_probab, Two_levels_factor)

Related

Error with svyglm function in survey package in R: "all variables must be in design=argument"

New to stackoverflow. I'm working on a project with NHIS data, but I cannot get the svyglm function to work even for a simple, unadjusted logistic regression with a binary predictor and binary outcome variable (ultimately I'd like to use multiple categorical predictors, but one step at a time).
El_under_glm<-svyglm(ElUnder~SO2, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in eval(extras, data, env) :
object '.survey.prob.weights' not found
I changed the variables to 0 and 1 instead:
Under_narm$SO2REG<-ifelse(Under_narm$SO2=="Heterosexual", 0, 1)
Under_narm$ElUnderREG<-ifelse(Under_narm$ElUnder=="No", 0, 1)
But then get a different issue:
El_under_glm<-svyglm(ElUnderREG~SO2REG, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in svyglm.survey.design(ElUnderREG ~ SO2REG, design = SAMPdesign, :
all variables must be in design= argument
This is the design I'm using to account for the weights -- I'm pretty sure it's correct:
SAMPdesign=svydesign(data=Under_narm, id= ~NHISPID, weight= ~SAMPWEIGHT)
Any and all assistance appreciated! I've got a good grasp of stats but am a slow coder. Let me know if I can provide any other information.
Using some make-believe sample data I was able to get your model to run by setting rescale = TRUE. The documentation states
Rescaling of weights, to improve numerical stability. The default
rescales weights to sum to the sample size. Use FALSE to not rescale
weights.
So, one solution maybe is just to set rescale = TRUE.
library(survey)
# sample data
Under_narm <- data.frame(SO2 = factor(rep(1:2, 1000)),
ElUnder = sample(0:1, 1000, replace = TRUE),
NHISPID = paste0("id", 1:1000),
SAMPWEIGHT = sample(c(0.5, 2), 1000, replace = TRUE))
# with 'rescale' = TRUE
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(), # this family avoids warnings
rescale=TRUE) # Weights rescaled to the sum of the sample size.
summary(El_under_glm, correlation = TRUE) # use correlation with summary()
Otherwise, looking code for this function's method with 'survey:::svyglm.survey.design', it seems like there may be a bug. I could be wrong, but by my read when 'rescale' is FALSE, .survey.prob.weights does not appear to get assigned a value.
if (is.null(g$weights))
g$weights <- quote(.survey.prob.weights)
else g$weights <- bquote(.survey.prob.weights * .(g$weights)) # bug?
g$data <- quote(data)
g[[1]] <- quote(glm)
if (rescale)
data$.survey.prob.weights <- (1/design$prob)/mean(1/design$prob)
There may be a work around if you assign a vector of numeric values to .survey.prob.weights in the global environment. No idea what these values should be, but your error goes away if you do something like the following. (.survey.prob.weights needs to be double the length of the data.)
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
.survey.prob.weights <- rep(1, 2000)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(),
rescale=FALSE)
summary(El_under_glm, correlation = TRUE)

How to Create a loop (when levels do not overlap the reference)

I have written some code in R. This code takes some data and splits it into a training set and a test set. Then, I fit a "survival random forest" model on the training set. After, I use the model to predict observations within the test set.
Due to the type of problem I am dealing with ("survival analysis"), a confusion matrix has to be made for each "unique time" (inside the file "unique.death.time"). For each confusion matrix made for each unique time, I am interested in the corresponding "sensitivity" value (e.g. sensitivity_1001, sensitivity_2005, etc.). I am trying to get all these sensitivity values : I would like to make a plot with them (vs unique death times) and determine the average sensitivity value.
In order to do this, I need to repeatedly calculate the sensitivity for each time point in "unique.death.times". I tried doing this manually and it is taking a long time.
Could someone please show me how to do this with a "loop"?
I have posted my code below:
#load libraries
library(survival)
library(data.table)
library(pec)
library(ranger)
library(caret)
#load data
data(cost)
#split data into train and test
ind <- sample(1:nrow(cost),round(nrow(cost) * 0.7,0))
cost_train <- cost[ind,]
cost_test <- cost[-ind,]
#fit survival random forest model
ranger_fit <- ranger(Surv(time, status) ~ .,
data = cost_train,
mtry = 3,
verbose = TRUE,
write.forest=TRUE,
num.trees= 1000,
importance = 'permutation')
#optional: plot training results
plot(ranger_fit$unique.death.times, ranger_fit$survival[1,], type = 'l', col = 'red') # for first observation
lines(ranger_fit$unique.death.times, ranger_fit$survival[21,], type = 'l', col = 'blue') # for twenty first observation
#predict observations test set using the survival random forest model
ranger_preds <- predict(ranger_fit, cost_test, type = 'response')$survival
ranger_preds <- data.table(ranger_preds)
colnames(ranger_preds) <- as.character(ranger_fit$unique.death.times)
From here, another user (Justin Singh) from a previous post (R: how to repeatedly "loop" the results from a function?) suggested how to create a loop:
sensitivity <- list()
for (time in names(ranger_preds)) {
prediction <- ranger_preds[which(names(ranger_preds) == time)] > 0.5
real <- cost_test$time >= as.numeric(time)
confusion <- confusionMatrix(as.factor(prediction), as.factor(real), positive = 'TRUE')
sensitivity[as.character(i)] <- confusion$byclass[1]
}
But due to some of the observations used in this loop, I get the following error:
Error in confusionMatrix.default(as.factor(prediction), as.factor(real), :
The data must contain some levels that overlap the reference.
Does anyone know how to fix this?
Thanks
Certain values in prediction and/or real have only 1 unique value in them. Make sure the levels of the factors are the same.
sapply(names(ranger_preds), function(x) {
prediction <- factor(ranger_preds[[x]] > 0.5, levels = c(TRUE, FALSE))
real <- factor(cost_test$time >= as.numeric(x), levels = c(TRUE, FALSE))
confusion <- caret::confusionMatrix(prediction, real, positive = 'TRUE')
confusion$byClass[1]
}, USE.NAMES = FALSE) -> result
result

How to get around error "factor has new levels" in cross-validation glm?

My goal is to use cross-validation to evaluate the performance of a linear model.
My problem is that my training and testing sets might not always have the same variable levels.
Here is a reproducible data example:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
summary(data)
Now let's make a glm model:
model_glm <- glm(x~., data = data)
And let's use cross-validation on this model:
library(boot)
cross_validation_glm <- cv.glm(data = data, glmfit = model_glm, K = 10)
And this is the kind of error output that you will get:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor z has new levels F
if you don't get this error, re-run the cross validation and at some point you will get a similar error.
The nature of the problem here is that when you do cross-validation, the train and test subsets might not have the exact same variable levels. Here our variable z has three levels (D,E,F).
In the total amount of our data there is much more D's than E's and F's.
Thus whenever you take a small subset of the whole data (to do cross-validation).
There is a very good chance that your z variable are all going to be set at the D's level.
Thus Eand F levels gets dropped, thus we get the error (This answer is helpful to understand the problem: https://stackoverflow.com/a/51555998/10972294).
My question is: how to avoid the drop in the first place?
If it is not possible, what are the alternatives?
(Keep in mind that this a reproducible example, the actual data I am using has many variables like z, I would like to avoid deleting them.)
To answer your question in the comment, I don't know if there is a function or not. Most likely there is one, but I have no idea on which package would contain it. For this example, this function should work:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
#optional tag row for later identification:
#data$rowid<-1:nrow(data)
stratified <- function(df, column, percent){
#split dataframe into groups based on column
listdf<-split(df, df[[column]])
testsubgroups<-lapply(listdf, function(x){
#pick the number of samples per group, round up.
numsamples <- ceiling(percent*nrow(x))
#selects the rows
whichones <-sample(1:nrow(x), numsamples, replace = FALSE)
testsubgroup <-x[whichones,]
})
#combine the subgroups into one data frame
testgroup<-do.call(rbind, testsubgroups)
testgroup
}
testgroup<-stratified(data, "z", 0.8)
This will just split the initial data by column z, if you are interested is grouping by multiple columns then this could be extended by using the group_by function from the dplyr package, but that would be another question.
Comment on the statistics: If you just have a few examples for any particular factor, what type of fit do you expect? A poor fit with wide confidence limits.

error when doing the summary of polr in r

I am trying to do a proportional odds logistic regression model of the form:
dsnac <- polr(formula=DS1~AC1, data = pddat1, method=c("logistic"))
summary(dsnac)
The regression ran fine,however, when I implement the summary function I get an error:
svd(X) : infinite or missing values in 'x'
I checked to see if there are any missing values in the "AC1" column (assuming AC1 is "x" as mentioned in the error), but does not have any values missing. The range of AC1 is 1.3 to 170000. DS1 is a factor having the levels 0,1 and 2.
Would be a great help if someone can help me with this. Thanks
A reproducible example is:
pddat1 <- data.frame(cbind(DS1=c(rep(0,400),rep(1,60),rep(2,40)),
AC1=runif(500,1,170000)))
pddat1$DS1 <- as.factor(pddat1$DS1)
dsnac <- polr(formula=DS1~AC1, data = pddat1, method=c("logistic"))
summary(dsnac)
A simple transformation solved the issue. svd(X) refers to singular value decomposition of covariates matrix.
dsnac <- polr(DS1~scale(AC1) , data = pddat1, method=c("logistic"))
summary(dsnac)
However, it is something has to do with your data. Calling clm function from ordinal package lead to the same conclusions with a warning such as "Model is nearly unidentifiable: very large eigenvalue - Rescale variables?"
library(ordinal)
dsnac <- clm(as.factor(DS1) ~ AC1, data=pddat1)
summary(dsnac)
If you downsize the maximum value in the runif command everything works fine
pddat1 <- data.frame(cbind(DS1=factor(c(rep(0,400),rep(1,60),rep(2,40))),
AC1=runif(500,1,15)))
str(pddat1)
pddat1$DS1 <- as.factor(pddat1$DS1)
dsnac <- polr(DS1 ~ AC1, data = pddat1, method=c("logistic"))
summary(dsnac)

Error in rep(1, n.ahead) : invalid 'times' argument in R

I'm working on dataset to forecast with ARIMA, and I'm so close to the last step but I'm getting error and couldn't find reference to figure out what I'm missing.
I keep getting error message when I do the following command:
ForcastData<-forecast(fitModel,testData)
Error in rep(1, n.ahead) : invalid 'times' argument
I'll give brief view on the work I did where I have changed my dataset from data frame to Time series and did all tests to check volatility, and Detect if data stationary or not.
Then I got the DataAsStationary as good clean data to apply ARIMA, but since I wanna train the model on train data and test it on the other part of the data, I splitted dataset into training 70% and testing 30%:
ind <-sample(2, nrow(DataAsStationary), replace = TRUE, prob = c(0.7,0.3))
traingData<- DataStationary1[ind==1,]
testData<- DataStationary1[ind==2,]
I used Automatic Selection Algorithm and found that Arima(2,0,3) is the best.
autoARIMAFastTrain1<- auto.arima(traingData, trace= TRUE, ic ="aicc", approximation = FALSE, stepwise = FALSE)
I have to mentioned that I did check the if residuals are Uncorrelated (White Noise) and deal with it.
library(tseries)
library(astsa)
library(forecast)
After that I used the training dataset to fit the model:
fitModel <- Arima(traingData, order=c(2,0,3))
fitted(fitModel)
ForcastData<-forecast(fitModel,testData)
output <- cbind(testData, ForcastData)
accuracy(testData, ForcastData)
plot(outp)
Couldn't find any resource about the error:
Error in rep(1, n.ahead) : invalid 'times' argument
Any suggestions!! Really
I tried
ForcastData<-forecast.Arima(fitModel,testData)
but I get error that
forecast.Arima not found !
Any idea why I get the error?
You need to specify the arguments to forecast() a little differently; since you didn't post example data, I'll demonstrate with the gold dataset in the forecast package:
library(forecast)
data(gold)
trainingData <- gold[1:554]
testData <- gold[555:1108]
fitModel <- Arima(trainingData, order=c(2, 0, 3))
ForcastData <- forecast(fitModel, testData)
# Error in rep(1, n.ahead) : invalid 'times' argument
ForcastData <- forecast(object=testData, model=fitModel) # no error
accuracy(f=ForcastData) # you only need to give ForcastData; see help(accuracy)
ME RMSE MAE MPE MAPE MASE
Training set 0.4751156 6.951257 3.286692 0.09488746 0.7316996 1.000819
ACF1
Training set -0.2386402
You may want to spend some time with the forecast package documentation to see what the arguments for the various functions are named and in what order they appear.
Regarding your forecast.Arima not found error, you can see this answer to a different question regarding the forecast package -- essentially that function isn't meant to be called by the user, but rather called by the forecast function.
EDIT:
After receiving your comment, it seems the following might help:
library(forecast)
# Read in the data
full_data <- read.csv('~/Downloads/onevalue1.csv')
full_data$UnixHour <- as.Date(full_data$UnixHour)
# Split the sample
training_indices <- 1:floor(0.7 * nrow(full_data))
training_data <- full_data$Lane1Flow[training_indices]
test_data <- full_data$Lane1Flow[-training_indices]
# Use automatic model selection:
autoARIMAFastTrain1 <- auto.arima(training_data, trace=TRUE, ic ="aicc",
approximation=FALSE, stepwise=FALSE)
# Fit the model on test data:
fit_model <- Arima(training_data, order=c(2, 0, 3))
# Do forecasting
forecast_data <- forecast(object=test_data, model=fit_model)
# And plot the forecasted values vs. the actual test data:
plot(x=test_data, y=forecast_data$fitted, xlab='Actual', ylab='Predicted')
# It could help more to look at the following plot:
plot(test_data, type='l', col=rgb(0, 0, 1, alpha=0.7),
xlab='Time', ylab='Value', xaxt='n', ylim=c(0, max(forecast_data$fitted)))
ticks <- seq(from=1, to=length(test_data), by=floor(length(test_data)/4))
times <- full_data$UnixHour[-training_indices]
axis(1, lwd=0, lwd.ticks=1, at=ticks, labels=times[ticks])
lines(forecast_data$fitted, col=rgb(1, 0, 0, alpha=0.7))
legend('topright', legend=c('Actual', 'Predicted'), col=c('blue', 'red'),
lty=1, bty='n')
I was able to run
ForcastData <- forecast(object=testData, model=fitModel)
without no error
and Now want to plot the testData and the forecasting data and check if my model is accurate:
so I did:
output <- cbind(testData, ForcastData)
plot(output)
and gave me the error:
Error in error(x, ...) :
improper length of one or more arguments to merge.xts
So when I checked ForcastData, it gave the output:
> ForcastData
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2293201 -20.2831770 -308.7474 268.1810 -461.4511 420.8847
2296801 -20.1765782 -346.6400 306.2868 -519.4593 479.1061
2300401 -18.3975657 -348.8556 312.0605 -523.7896 486.9945
2304001 -2.2829565 -332.7483 328.1824 -507.6860 503.1201
2307601 2.7023277 -327.8611 333.2658 -502.8509 508.2555
2311201 4.5777316 -328.6756 337.8311 -505.0893 514.2447
2314801 4.3198927 -331.4470 340.0868 -509.1913 517.8310
2318401 3.8277285 -332.7898 340.4453 -510.9844 518.6398
2322001 1.4364973 -335.2403 338.1133 -513.4662 516.3392
2325601 -0.4013561 -337.0807 336.2780 -515.3080 514.5053
I thought I will get list of result as I have in my testData. I need to get the chart that shows 2 lines of actual data(testData), and expected data(ForcastData).
I have really went through many documentation about forcast, but I can't find something explain what I wanna do.

Resources