I am trying to use the lqmm package in R and receiving the error Error in f(arg, ...) : NA/NaN/Inf in foreign function call (arg 1). I can successfully use it for a version of my data in which a variable called cluster_name is averaged over.
I've tried to verify that there are no NaNs or infinite values in my dataset this way:
na_data = mydata
new_DF <- na_data[rowSums(is.na(mydata)) > 0,] # yields a dataframe with no observations
is.na(na_data) <- sapply(na_data, is.infinite)
new_DF <- na_data[rowSums(is.na(mydata)) > 0,] # still a dataframe with no observations
There are no variables in my dataframe that are type char -- every such variable has been converted to a factor.
When I run my model
m1 = lqmm(std_brain ~ std_beh*type*taught, random = ~1, group=subject, data = begin_data, tau=.5, na.action=na.exclude)
on the first 12,528 lines of my dataset, the model works fine. Line 12,529 looks totally normal.
Similarly, if I run tail(mydata, 11943) I get a dataframe that runs without error, but tail(mydata, 11944) gives me a dataframe that generates the error. I can also run a subset from 9990:21825 without error, but extending the dataframe on either side generates the error. The whole dataframe is 29450 observations, and thus this middle slice contains the supposedly problematic observations. I tried making a smaller version of my dataset that contained just the borders of problems, and some observations around them, and I can see that 3/4 cases involve the same subject (7645), but I don't know what to make of that. I don't see how to make this reproducible without providing the whole dataframe (in case you were wondering, the small dataset doesn't cause any error). So here is the csv file I used.
Here is the function that gets the dataframe ready for analysis:
prep_data_set <- function(data_file, brain_var = 'beta', beh_var = 'accuracy') {
data = read.csv(data_file)
data$subject <- factor(data$subject)
data$type <- factor(data$type)
data$type <- relevel(data$type, ref = "S")
data$taught <- factor(data$taught)
data <- subset(data, data$run_num < 13)
data$run = factor(data$run_num)
brain_mean <- mean(data[[brain_var]])
brain_sd <- sd(data[[brain_var]])
beh_mean <- mean(data[[beh_var]])
beh_sd <- sd(data[[beh_var]])
data <- subset(data, data$cluster_name != "")
data$cluster_name <- factor(data$cluster_name)
data$mean_centered_brain <- data[[brain_var]]
data$std_brain <- data$mean_centered_brain/brain_sd
data$mean_centered_beh <- data[[beh_var]]
data$std_beh <- data$mean_centered_beh/beh_sd
return(data)
}
I run
mydata = prep_data_set(file.path(resdir, 'robust0005', 'pos_rel_con__all_clusters.csv'))
m1 = lqmm(std_brain ~ std_beh*type*taught, random = ~1, group=subject, data = mydata, tau=.5, na.action=na.exclude)
to generate the error.
By comparison
regular_model = lmer(std_brain ~ type*taught*std_beh + (1|subject/run) +
(1|subject:cluster_name), data = mydata)
runs fine.
I hope there is something interesting and generalizable in this question; I know it's kind of annoying to post to Stack Overflow with some idiosyncratic problem in a ~30000 line dataset.
I'm running the SVM model on a dataset, which runs through fine on the train/fitted model. However when I run it for the prediction/test data, it seems to be dropping rows for some reason, when I try to add 'pred_SVM' back into the dataset, the lengths are different.
Below is my code
#SVM MODEL
SVM_swim <- svm(racetime_mins ~ event_date+ event_month +year
+event_id +
gender + place + distance+ New_Condition+
raceNo_Updated +
handicap_mins +points+
Wind_Speed_knots+
Air_Temp_Celsius +Water_Temp_Celsius +Wave_Height_m,
data = SVMTrain, kernel='linear')
summary(SVM_swim)
#Predict Race_Time Using Test Data
pred_SVM <- predict(SVM_swim, SVMTest, type ="response")
View(pred_SVM)
#Add predicted Race_Times back into the test dataset.
SVMTest$Pred_RaceTimes<- pred_SVM
View(SVMTest) #Returns 13214 rows
View(pred_SVM) #Returns 12830
Error in $<-.data.frame(*tmp*, Pred_RaceTime, value = c(2 = 27.1766438249356, :
replacement has 12830 rows, data has 13214
As it is mentioned in the command, you need to get rid of the NA values in your dataset. SVM is handling it for you so that, the pred_SVM output is calculated without the NA values.
To test if there exist NA in your data, just run : sum(is.na(SVMTest))
I am pretty sure that you will see a number greater than zero.
Before starting to build your SVM algorithm, get rid of all NA values by,
dataset <- dataset[complete.cases(dataset), ]
Then after separating your data into Train and Test sets you can run ,
SVM_swim <- svm(.....,data = SVMTrain, kernel='linear')
I'm trying to use the random forest model to predict Gender based on Height, Weight and Number of siblings. I've gotten the data from a much larger data set that contains dozens of variables, but I've cleaned it into this "clean" data.frame with omitted NA values and only the 4 variables I care about, the last column being Gender.
I've tried fiddling with the code and searching everywhere but I can't find a concrete fix.
Here's the code:
ind <- sample(nrow(clean),0.8*nrow(clean))
train <- clean[ind,]
test <- clean[-ind,]
rf <- randomForest(Gender ~ ., data = train[,1:4], ntree = 20)
pred <- predict(rf, newdata = test[,-c(length(test))])
cm <- table(test$Gender, pred)
cm
and here's the output:
Error in `[.default`(table(observed = y, predicted = out.class), levels(y), : subscript out of bounds
Traceback:
1. randomForest(Gender ~ ., data = train[, 1:4], ntree = 20)
2. randomForest.formula(Gender ~ ., data = train[, 1:4], ntree = 20)
3. randomForest.default(m, y, ...)
4. table(observed = y, predicted = out.class)[levels(y), levels(y)]
5. `[.table`(table(observed = y, predicted = out.class), levels(y),
. levels(y))
6. NextMethod()
The problem is likely that you have some kind of a variable level in your test data that was not reflected in your training data. So when it goes to assign the outcome, it has no basis to do so.
It is impossible to say for sure without sample data, but it is the most likely scenario. Try setting a seed set.seed=3 and then change the seed number set.seed=28 and so on, a few times to see if you end up finding a combination where you do not get the error.
Compare the conflicted data frame with the un-conflicted one to see what is missing.
EDIT:
Also, try running str(train) and str(test) to be sure the fields have remained the same. You can share that if you like by editing your post.
If any of the columns are factors with levels missing (meaning it has 10 levels but only 8 are represented in the train with 9 or 10 in the test) it might be a problem. They should be balanced if you are trying to create a predictor for all possible outcomes.
If nothing else works, you can set a seed and remove predictors one at a time until it runs correctly, then look to see how the train and test sets are different in that removed column.
I'm having a similar problem to the questioners here had with the linear model predict function, but I am trying to use the "time series linear model" function from Rob Hyndman's forecasting package.
Predict.lm in R fails to recognize newdata
predict.lm with newdata
totalConv <- ts(varData[,43])
metaSearch <- ts(varData[,45])
PPCBrand <- ts(varData[,38])
PPCGeneric <- ts(varData[,34])
PPCLocation <- ts(varData[,35])
brandDisplay <- ts(varData[,29])
standardDisplay <- ts(varData[,3])
TV <- ts(varData[,2])
richMedia <- ts(varData[,46])
df.HA <- data.frame(totalConv, metaSearch,
PPCBrand, PPCGeneric, PPCLocation,
brandDisplay, standardDisplay,
TV, richMedia)
As you can see I've tried to avoid the names issues by creating a data frame of the time series objects.
However, I then fit a tslm object (time series linear model) as follows -
fit1 <- tslm(totalConv ~ metaSearch
+ PPCBrand + PPCGeneric + PPCLocation
+ brandDisplay + standardDisplay
+ TV + richMedia data = df.HA
)
Despite having created a data frame and named all the objects properly I get the same dimension error as these other users have experienced.
Error in forecast.lm(fit1) : Variables not found in newdata
In addition: Warning messages:
1: 'newdata' had 10 rows but variables found have 696 rows
2: 'newdata' had 10 rows but variables found have 696 rows
the model frame seems to give sensible names to all of the variables, so I don't know what is up with the forecast function:-
names(model.frame(fit1))
[1] "totalConv" "metaSearch" "PPCBrand" "PPCGeneric" "PPCLocation" "brandDisplay"
[7] "standardDisplay" "TV" "richMedia"
Can anyone suggest any other improvements to my model specification that might help the forecast function to run?
EDIT 1: Ok, just so there's a working example, I've used the data given in Irsal's answer to this question (converting to time series objects) and then fitted the tslm. I get the same error (different dimensions obviously):-
Is there an easy way to revert a forecast back into a time series for plotting?
I'm really confused about what I'm doing wrong, my code looks identical to that used in all of the examples on this....
data <- c(11,53,50,53,57,69,70,65,64,66,66,64,61,65,69,61,67,71,74,71,77,75,85,88,95,
93,96,89,95,98,110,134,127,132,107,94,79,72,68,72,70,66,62,62,60,59,61,67,
74,87,112,134,51,50,38,40,44,54,52,51,48,50,49,49,48,57,52,53,50,50,55,50,
55,60,65,67,75,66,65,65,69,72,93,137,125,110,93,72,61,55,51,52,50,46,46,45,
48,44,45,53,55,65,89,112,38,7,39,35,37,41,51,53,57,52,57,51,52,49,48,48,51,
54,48,50,50,53,56,64,71,74,66,69,71,75,84,93,107,111,112,90,75,62,53,51,52,
51,49,48,49,52,50,50,59,58,69,95,148,49,83,40,40,40,53,57,54,52,56,53,55,
55,51,54,45,49,46,52,49,50,57,58,63,73,66,63,72,72,71,77,105,97,104,85,73,
66,55,52,50,52,48,48,46,48,53,49,58,56,72,84,124,76,4,40,39,36,38,48,55,49,
51,48,46,46,47,44,44,45,43,48,46,45,50,50,56,62,53,62,63)
data2 <- c(rnorm(237))
library(forecast)
nData <- ts(data)
nData2 <- ts(data2)
dat.ts <- tslm(nData~nData2)
forecast(dat.ts)
Error in forecast.lm(dat.ts) : Variables not found in newdata
In addition: Warning messages:
1: 'newdata' had 10 rows but variables found have 237 rows
2: 'newdata' had 10 rows but variables found have 237 rows
EDIT 2: Same error even if I combine both series into a data frame.
nData.df <- data.frame(nData, nData2)
dat.ts <- tslm(nData~nData2, data = nData.df)
forecast(dat.ts)
tslm fits a linear regression model. You need to provide the future values of the explanatory variables if you want to forecast. These should be provided via the newdata argument of forecast.lm.