Error in predict.svm in R - r

Click here to access the train and test data I used. I m new to SVM. I was trying the svm package in R to train my data which consists of 40 attributes and 39 labels. All attributes are of double type(most of them are 0's or 1's becuase I performed dummy encoding on the categorical attriubutes ) , the class label was of different strings which i later converted to a factor and its now of Integer type.
model=svm(Category~.,data=train1,scale=FALSE)
p1=predict(model,test1,"prob")
This was the result i got once i trained the model using SVM.
Call:
svm(formula = Category ~ ., data = train1, scale = FALSE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.02564103
Number of Support Vectors: 2230
I used the predict function
Error in predict.svm(model, test1, "prob") :
NAs in foreign function call (arg 1)
In addition: Warning message:
In predict.svm(model, test1, "prob") : NAs introduced by coercion
I'm not understanding why this error is appearing, I checked all attributes of my training data none of them have NA's in them. Please help me with this.
Thanks

I'm assuming you are using the package e1071 (you don't specify which package are you using, and as far as I know there is no package called svm).
The error message is confusing, but the problem is that you are passing "prob" as the 3rd argument, while the function expects a boolean. Try it like this:
require(e1071)
model=svm(Category~.,data=train1, scale=FALSE, probability=TRUE)
p1=predict(model,test1, probability = TRUE)
head(attr(p1, "probabilities"))
This is a sample of the output I get.
WARRANTS OTHER OFFENSES LARCENY/THEFT VEHICLE THEFT VANDALISM NON-CRIMINAL ROBBERY ASSAULT WEAPON LAWS BURGLARY
1 0.04809877 0.1749634 0.2649921 0.02899535 0.03548131 0.1276913 0.02498949 0.08322866 0.01097913 0.03800846
SUSPICIOUS OCC DRUNKENNESS FORGERY/COUNTERFEITING DRUG/NARCOTIC STOLEN PROPERTY SECONDARY CODES TRESPASS MISSING PERSON
1 0.03255891 0.003790755 0.006249521 0.01944938 0.004843043 0.01305858 0.009727582 0.01840337
FRAUD KIDNAPPING RUNAWAY DRIVING UNDER THE INFLUENCE SEX OFFENSES FORCIBLE PROSTITUTION DISORDERLY CONDUCT ARSON
1 0.01884472 0.006089563 0.001378799 0.003289503 0.01071418 0.004562048 0.003107619 0.002124643
FAMILY OFFENSES LIQUOR LAWS BRIBERY EMBEZZLEMENT SUICIDE
1 0.0004787845 0.001669914 0.0007471968 0.0007465053 0.0007374036
Hope it helps.

Related

Incompatibility between training and test data in a logistic regression model in R

I am using R version 4.2.2 in RStudio version 2022.12.0+353 on an M1 Macbook Air running MacOS 13.0.1
I am a medical doctor performing modelling on data from patients who have stayed in an Intensive Care Unit I work in (I have full ethical approval etc and no results obtained will be used to treat future patients).
When patients are extremely unwell they can be turned face down ('the prone position'). I am working on a logistic regression model to give the likelihood of death within 72 hours based on how a patient responds to this maneuver.
I have successfully trained a (not hugely accurate) model on some training data but am encountering an error message when I try to make predictions with the testing data.
The original dataframe consists of 23 variables and 365 observations. In order to create a train/test split I used initial_split() from the rsample package.
# die_in_72 is a true/false variable showinbg if the patient did or did not die within 72hrs
mort_72_data <- initial_split(predict_mortality_changes72,
prop = 0.8,
strata = die_in_72)
There were a small amount of NAs. I imputed them using the mice package:
imputed_mort_72 <- mice(data = mort_72_data_train,
method = 'pmm',
m = 100)
mort_72_data_train_imputed <- complete(imputed_mort_72)
I have trained a logistic regression model using 3 of the variables that lasso regression showed to be relevant.
lr_model_01 <- glm(data = mort_72_data_train_imputed,
formula = die_in_72 ~
pfr_change_absolute +
apache_ii +
time_between_abg,
family = 'binomial')
I have successfully used the model to predict this outcome from the training data. Accuracy is not optimal yet (but I have some idea why and what to do about it).
pred_model_01 <- select(mort_72_data_train, die_in_72)
pred_model_01$prediction_response <- predict(object = lr_model_01,
newdata = mort_72_data_train_imputed,
type = 'response')
The problem comes when I try to make predictions on the testing data. I get an error message. Here is how I have tried to make new predictions:
mort_72_data_test <- testing(mort_72_data)
pred_model_01$prediction_response_2 <- predict(object = lr_model_01,
newdata = mort_72_data_test,
type = 'response')
The error message I get describes an incompatibility between data. I am absolkutely stumped. Here is the error message.
Error:
! Assigned data `predict(object = lr_model_01, newdata = mort_72_data_test, type = "response")` must be compatible with existing data.
✖ Existing data has 291 rows.
✖ Assigned data has 74 rows.
ℹ Only vectors of size 1 are recycled.
Backtrace:
1. base::`$<-`(`*tmp*`, prediction_response_2, value = `<dbl>`)
12. tibble (local) `<fn>`(`<vctrs___>`)
Can anyone who can shed any light on why this is happening? I have made predictions in other contexts and cannot see what is going wrong here.

How to correctly interpret glmmTMB models with large z statistics/conflicting error messages?

I am using glmmTMB to run a zero-inflated two-component hurdle model to determine how certain covariates might influence (1) whether or not a fish has food in its stomach and (2) if the stomach contains food, which covariates effect the number of prey items found in its stomach.
My data consists of the year a fish was caught, the season it was caught, sex, condition, place of origin, gross sea age (1SW = one year at sea, MSW = multiple years at sea), its genotype at two different loci, and fork length residuals. Data are available at my GitHub here.
Model interpretation
When I run the model (see code below), I get the following warning message about unusually large z-statistics.
library(glmmTMB)
library(DHARMa)
library(performance)
set.seed(111)
feast_or_famine_all_prey <- glmmTMB(num_prey ~ autumn_winter+
fishing_season + sex+ condition_scaled +
place_of_origin+
sea_age/(gene1+gene2+fork_length_residuals) + (1|location),
data = data_5,
family= nbinom2,
ziformula = ~ .,
dispformula = ~ fishing_season + place_of_origin,
control = glmmTMBControl(optCtrl = list(iter.max = 100000,
eval.max = 100000),
profile = TRUE, collect = FALSE))
summary(feast_or_famine_all_prey_df)
diagnose(feast_or_famine_all_prey_df)
Since the data does display imbalance for the offending variables (e.g. mean number of prey items in autumn = 85.33, mean number of prey items in winter = 10.61), I think the associated model parameters are near the edge of their range, hence, the extreme probabilities suggested by the z-statistics. Since this is an actual reflection of the underlying data structure (please correct me if I'm wrong!) and not a failure of the model itself, is the model output safe to interpret and use?
Conflicting error messages
Using the diagnose() function as well as exploring model diagnostics using the DHARMa package seem to suggest the model is okay.
diagnose(feast_or_famine_all_prey_df)
ff_all_prey_residuals_df<- simulateResiduals(feast_or_famine_all_prey_df, n = 1000)
testUniformity(ff_all_prey_residuals_df)
testOutliers(ff_all_prey_residuals_df, type = "bootstrap")
testDispersion(ff_all_prey_residuals_df)
testQuantiles(ff_all_prey_residuals_df)
testZeroInflation(ff_all_prey_residuals_df)
However, if I run the code performance::r2_nakagawa(feast_or_famine_all_prey_df) then I get the following error messages:
> R2 for Mixed Models
Conditional R2: 0.333
Marginal R2: 0.251
Warning messages:
1: In (function (start, objective, gradient = NULL, hessian = NULL, :
NA/NaN function evaluation
2: In (function (start, objective, gradient = NULL, hessian = NULL, :
NA/NaN function evaluation
3: In (function (start, objective, gradient = NULL, hessian = NULL, :
NA/NaN function evaluation
4: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
5: In fitTMB(TMBStruc) :
Model convergence problem; false convergence (8). See vignette('troubleshooting')"
None of these appeared using diagnose() nor were they (to the best of my knowledge) hinted at by the DHARMa diagnostics. Should these errors be believed?
Short answer: when you run performance::r2_nakagawa it refits the model with the fixed effects components removed. It's possible that your R^2 estimates are unreliable, but this shouldn't affect any of the other model results.
(update after much digging):
The code descends through these functions:
performance::r2_nakagawa
performance:::.compute_random_vars
insight::get_variance
insight:::.compute_variances
insight:::.compute_variance_distribution
insight:::.variance_distributional
insight:::null_model
insight:::.null_model_mixed
at which point it tries to run a null model with no fixed effects (num_prey ~ (1 | location)). This is where the warnings are coming from.
When I run your code I get R^2 values of 0.308/0.237, which does suggest that this is a somewhat unstable calculation (not that these differences would really change the conclusion much).

Error when trying to fit Hierarchical GAMs (Model GS or S) using mgcv

I have a large dataset (~100k observations) of presence/absence data that I am trying to fit a Hierarchical GAM with individual effects that have a Shared penalty (e.g. 'S' in Pedersen et al. 2019). The data consists of temp as numeric, region (5 groups) as a factor.
Here is a simple version of the model that I am trying to fit.
modS1 <- gam(occurrence ~ s(temp, region), family = binomial,
data = df, method = "REML")
modS2 <- gam(occurrence ~ s(temp, region, k= c(10,4), family = binomial,
data = df, method = "REML")
In the first case I received the following error:
Which I assumed it because k was set too high for region given there are only 5 different regions in the data set.
Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In mean.default(xx) : argument is not numeric or logical: returning NA
2: In Ops.factor(xx, shift[i]) : ‘-’ not meaningful for factors
In the second case I attempt to lower k for region and receive this error:
Error in if (k < M + 1) { : the condition has length > 1
In addition: Warning messages:
1: In mean.default(xx) : argument is not numeric or logical: returning NA
2: In Ops.factor(xx, shift[i]) : ‘-’ not meaningful for factors
I can fit Models G and GI and I from Pedersen et al. 2019 with no issues. It is models GS and S where I run into issues.
If anyone has any insights I would really appreciate it!
The bs = "fs" argument in the code you're using as a guide is important. If we start at the ?s help page and click on the link to the ?smooth.terms help page, we see:
Factor smooth interactions
bs="fs" Smooth factor interactions are often produced using by variables (see gam.models), but a special smoother class (see factor.smooth.interaction) is available for the case in which a smooth is required at each of a large number of factor levels (for example a smooth for each patient in a study), and each smooth should have the same smoothing parameter. The "fs" smoothers are set up to be efficient when used with gamm, and have penalties on each null space component (i.e. they are fully ‘random effects’).
You need to use a smoothing basis appropriate for factors.
Notably, if you take your source code and remove the bs = "fs" argument and attempt to run gam(log(uptake) ∼ s(log(conc), Plant_uo, k=5, m=2), data=CO2, method="REML"), it will produce the same error that you got.

Error in multiple regression prediction interval

This is the error message:
Error in qt((1 - level)/2, df) : Non-numeric argument to mathematical function
What I am trying to do is to fit a model to check the association between SBP and age with sex and race adjustments.
My code uses the uwIntroStats package: the code to fit the model works. Sex (male) is coded as 0 for female and 1 for male, race is coded 1 to 4.
library(uwIntroStats)
data(mri)
model <- regress("mean", sbp~age*male+as.factor(race), data = mri)
predict(model, data.frame(age=70,male=0,race=2),interval="prediction")
Any reasons why the error occurs and how to fix it? Thanks!
You need to name the newdata argument: otherwise the predict method thinks you're trying to specify the next unmatched argument, which is level. From ?predict.uRegress:
## S3 method for class 'uRegress'
predict(object,interval="prediction",level=0.95, ...)
So
predict(model, newdata=data.frame(age=70,male=0,race=2),
interval="prediction")
works (you don't actually need to specify interval="prediction" - that's the default value).

r support vector machine e1071 training not working

I am playing around with Support Vector Machines in the R-Language. Specifically I am using the e1071 package.
As long as I follow the manual pages or the tutorial at wikibooks everythings works. But if I try to use my own datasets with those examples things aren't that good anymore.
It seems that the model creation fails for some reason. At least I am not getting the levels on the target column. Below you find the example for clarification.
Maybe someone can help me to figure out what I am doing wrong here. So here is all the code and data.
Test dataset
target,col1,col2
0,1,2
0,2,3
0,3,4
0,4,5
0,5,6
0,1,2
0,2,3
0,3,4
0,4,5
0,5,6
0,1,2
0,2,3
0,3,4
0,4,5
1,6,7
1,7,8
1,8,9
1,9,0
1,0,10
1,6,7
1,7,8
1,8,9
1,9,0
1,0,10
1,6,7
1,7,8
1,8,9
1,9,0
1,0,10
R-Script
library(e1071)
dataset <- read.csv("test.csv", header=TRUE, sep=',')
tuned <- tune.svm(target~., data = dataset, gamma = 10^(-6:-1), cost = 10^(-1:1))
summary(tuned)
model <- svm(target~., data = dataset, kernel="radial", gamma=0.001, cost=10)
summary(model)
Output of the summary(model) statement
+ summary(model)
Call:
svm(formula = target ~ ., data = dataset, kernel = "radial", gamma = 0.001,
cost = 10)
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 10
gamma: 0.001
epsilon: 0.1
Number of Support Vectors: 28
>
Wikibooks examaple
If I compare this output to the output of the wikibooks example, it's missing some information. Please notice the "Levels"-Section in the output:
library(MASS)
library(e1071)
data(cats)
model <- svm(Sex~., data = cats)
summary(model)
Output
> summary(model)
Call:
svm(formula = Sex ~ ., data = cats)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.5
Number of Support Vectors: 84
( 39 45 )
Number of Classes: 2
Levels:
F M
Putting Roland's answer in the proper "answer" format:
target is numeric
sex is a factor
Let me give a few more suggestions:
it seems as if target really should be a factor. (It has only 2 levels, 0 & 1, and I suspect you're trying to classify into either 0 or 1.) So stick in a dataset$target <- factor(dataset$target) somewhere.
right now, because target is a numeric, a regression model is being run instead of a classification.
it's worthwhile to do a similar check for any of your variables before running a model (especially a model). In the case you gave, for instance, it's not obvious what col1 and col2 are. If either of them are a grouping or classification, you should also make them factors, too.
In R, many functions have multiple ways in which they will run, depending upon the data types fed to them. If you feed factors into a model, it will run classification. If you feed numerics, regression. This is actually common in many programming languages, and is called function overloading.

Resources