I am running an SVM model with 4 numerical columns and 1 column that is a factor. I am able to see a successful summary of the model, and the accuracy is perfect.
However, when trying to plot the model with 4 variables I get a result that does not look right, as the data points are not grouped by classification. Here is the code I've been using, if anyone could help that would be much appreciated. Also, let me know if the dataset is required for you to help me solve this issue.
View(anthrokids)
anthrokids$Race <- as.factor(anthrokids$Race)
svm_model <- svm(formula = Race ~ ., data = anthrokids)
summary(svm_model)
svm_model$SV
plot(svm_model, data = anthrokids, Height~Weight,
slice = list(Age = 3, Sex = 4))
prediction = predict(svm_model, anthrokids)
table(Predicted = prediction, Actual = anthrokids$Race)
Related
I developed a SVM model for a classification problem
install.packages('e1071')
library(e1071)
fit_7 = svm(formula = Heatwave ~ .,data = training_set_7,type = 'C-classification',kernel = 'linear')
My question is- how do I get two dataset (1st dataset is from one side of hyperplane; and 2nd dataset is form another side of hyperplane) from this fit_SVM model? All I want two datasets that included observations from above and under the hyperplane.
Thanks in advance
I tried following:
install.packages('e1071')
library(e1071)
fit_7 = svm(formula = Heatwave ~ .,data = training_set_7,type = 'C-classification',kernel = 'linear')
I am expecting to create two csv files that includes observations from above and under the hyperplane
I have an R coding question.
This is my first time asking a question here, so apologies if I am unclear or do something wrong.
I am trying to use a Generalized Linear Mixed Model (GLMM) with Poisson error family to test for any significant effect on a count response variable by three separate dichotomous variables (AGE = ADULT or JUVENILE, SEX = MALE or FEMALE and MEDICATION = NEW or OLD) and an interaction between AGE and MEDICATION (AGE:MEDICATION).
There is some dependency in my data in that the data was collected from a total of 22 different sites (coded as SITE vector with 33 distinct levels), and the data was collected over a total of 21 different years (coded as YEAR vector with 21 distinct levels, and treated as a categorical variable). Unfortunately, every SITE was not sampled for each YEAR, with some being sampled for a greater number of years than others.
The data is also quite sparse, in that I do not have a great number of measurements of the response variable (coded as COUNT and an integer vector) per SITE per YEAR.
My Poisson GLMM is constructed using the following code:
model <- glmer(data = mydata,
family = poisson(link = "log"),
formula = COUNT ~ SEX + SEX:MEDICATION + AGE + AGE:SEX + MEDICATION + AGE:MEDICATION + (1|SITE/YEAR),
offset = log(COUNT.SAMPLE.SIZE),
nAGQ = 0)
In order to try and obtain more reliable estimates for the fixed effect coefficients (particularly given the sparse nature of my data), I am trying to obtain 95% confidence intervals for the fixed effect coefficients through non-parametric bootstrapping.
I have come across the "glmmboot" package which can be used to conduct non-parametric bootstrapping of GLMMs, however when I try to run the non-parametric bootstrapping using the following code:
library(glmmboot)
bootstrap_model(base_model = model,
base_data = mydata,
resamples = 1000)
When I run this code, I receive the following message:
Performing case resampling (no random effects)
Naturally, though, my model does have random effects, namely (1|SITE/YEAR).
If I try to tell the function to resample from a specific block, by adding in the "reample_specific_blocks" argument, i.e.:
library(glmmboot)
bootstrap_model(base_model = model,
base_data = mydata,
resamples = 1000,
resample_specific_blocks = "YEAR")
Then I get the following error message:
Performing block resampling, over SITE
Error: Invalid grouping factor specification, YEAR:SITE
I get a similar error message if I try set 'resample_specific_blocks' to "SITE".
If I then try to set 'resample_specific_blocks' to "SITE:YEAR" or "SITE/YEAR" I get the following error message:
Error in bootstrap_model(base_model = model, base_data = mydata, resamples = 1000, :
No random columns from formula found in resample_specific_blocks
I have tried explicitly nesting YEAR within SITE and then adapting the model accordingly using the code:
mydata <- within(mydata, SAMPLE <- factor(SITE:YEAR))
model.refit <- glmer(data = mydata,
family = poisson(link = "log"),
formula = COUNT ~ SEX + AGE + MEDICATION + AGE:MEDICATION + (1|SITE) + (1|SAMPLE),
offset = log(COUNT.SAMPLE.SIZE),
nAGQ = 0)
bootstrap_model(base_model = model.refit,
base_data = mydata,
resamples = 1000,
resample_specific_blocks = "SAMPLE")
But unfortunately I just get this error message:
Error: Invalid grouping factor specification, SITE
The same error message comes up if I set resample_specific_blocks argument to SITE, or if I just remove the resample_specific_blocks argument.
I believe that the case_bootstrap() function found in the lmeresampler package could potentially be another option, but when I look into the help for it it looks like I would need to create a function and I unfortunately have no experience with creating my own functions within R.
If anyone has any suggestions on how I can get the bootstrap_model() function in the glmmboot package to recognise the random effects in my model/dataframe, or any suggestions for alternative methods on conducting non-parametric bootstrapping to create 95% confidence intervals for the coefficients of the fixed effects in my model, it would be greatly appreciated! Many thanks in advance, and for reading such a lengthy question!
For reference, I include links to the RDocumentation and GitHub for the glmmboot package:
https://www.rdocumentation.org/packages/glmmboot/versions/0.6.0
https://github.com/ColmanHumphrey/glmmboot
The following is code that will allow for creation of a reproducible example using the data set from lme4::grouseticks
#Load in required packages
library(tidyverse)
library(lme4)
library(glmmboot)
library(psych)
#Load in the grouseticks dataframe
data("grouseticks")
tibble(grouseticks)
#Create dummy vectors for SEX, AGE and MEDICATION
set.seed(1)
SEX <-sample(1:2, size = 403, replace = TRUE)
SEX <- as.factor(ifelse(SEX == 1, "MALE", "FEMALE"))
set.seed(2)
AGE <- sample(1:2, size = 403, replace = TRUE)
AGE <- as.factor(ifelse(AGE == 1, "ADULT", "JUVENILE"))
set.seed(3)
MEDICATION <- sample(1:2, size = 403, replace = TRUE)
MEDICATION <- as.factor(ifelse(MEDICATION == 1, "OLD", "NEW"))
grouseticks$SEX <- SEX
grouseticks$AGE <- AGE
grouseticks$MEDICATION <- MEDICATION
#Use the INDEX vector to create a vector of sample sizes per LOCATION
#per YEAR
grouseticks$INDEX <- 1
sample.sizes <- grouseticks %>%
group_by(LOCATION, YEAR) %>%
summarise(SAMPLE.SIZE = sum(INDEX))
#Combine the dataframes together into the dataframe to be used in the
#model
mydata$SAMPLE.SIZE <- as.integer(mydata$SAMPLE.SIZE)
#Create the Poisson GLMM model
model <- glmer(data = mydata,
family = poisson(link = "log"),
formula = TICKS ~ SEX + SEX + AGE + MEDICATION + AGE:MEDICATION + (1|LOCATION/YEAR),
nAGQ = 0)
#Attempt non-parametric bootstrapping on the model to get 95%
#confidence intervals for the coefficients of the fixed effects
set.seed(1)
Model.bootstrap <- bootstrap_model(base_model = model,
base_data = mydata,
resamples = 1000)
Model.bootstrap
I ran a partial least squares (PLS) in R and I want to extract the variables so that I can run either a decision tree or random forest or some other type of model.
I tried pls1$coefficients
# split data into 2 parts for pls training (75%) and prediction (25%)
set.seed(1)
samp <- sample(nrow(newdata), nrow(newdata)*0.75)
analogous.train <- newdata[samp,]
analogous.valid <- newdata[-samp,]
#First use cross validation to find the optimal number of dimensions
pls.model = plsr(meanlog ~ ., data = analogous.train, validation = "CV")
# Find the number of dimensions with lowest cross validation error
cv = RMSEP(pls.model)
best.dims = which.min(cv$val[estimate = "adjCV", , ]) - 1
best.dims
#This told me that 8 dimensions was the best
#Now fit a model with 8 components and includes leave one out cross
#validated predictions
pls1 <- plsr(meanlog ~ ., ncomp = best.dims, data = analogous.train,
validation = "LOO")
#a fited model is often used to predict the response values of new
#observations.
predict(pls1, ncomp = 8, newdata = analogous.valid)
I want the actual variables itself that it created. For example PCA creates PC1, PC2, etc. I assumed (maybe incorrectly) that PLS does the same.
It's in $scores. If you do
pls1$scores
You will get a matrix of 8 columns, scores for each of the latent variables, and a row for each observation.
So, I am working with a big dataset (55965 points). I am trying to run a LME accounting for correlation. But R will return me this
Error: 'sumLenSq := sum(table(groups)^2)' = 3.13208e+09 is too large.
Too large or no groups in your correlation structure?
I can not subset it since I need all the points. My questions are:
Is there some setting I can change in the function?
If not, is there any other package with similar function that would run such a big dataset?
Here is a reproducible example:
require(nlme)
my.data<- matrix(data = 0, nrow = 55965, ncol = 3)
my.data<- as.data.frame(my.data)
dummy <- rep(1, 55965)
my.data$dummy<- dummy
my.data$V1<- seq(780, 56744)
my.data$V2<- seq(1:55965)
my.data$X<- seq(49.708, 56013.708)
my.data$Y<-seq(-12.74094, -55977.7409)
null.model <- lme(fixed = V1~ V2, data = my.data, random = ~ 1 | dummy, method = "ML")
spatial_model <- update(null.model, correlation = corGaus(1, form = ~ X + Y), method = "ML")
Since you have assigned a grouping factor with only one level, there are no groups in the data, which is what the error message reports. If you just want to account for spatial autocorrelation, with no other random effects, use gls from the same package.
Edit: A further note on 2 different approaches to modelling spatial autocorrelation: The corrGauss (and other corrSpatial type functions) implement spatial correlation models for regression residuals, which is different from, say, a spatial random effect added to the model based on county/district/grid identity.
Suppose I have this dataset:
require(rms)
newdata <- data.frame(eduattain = rep(c(1,2,3), times=2), dadedu=rep(c(1,2,3),each=2),
random=rnorm(6, mean(1000),sd=50))
I transform both the dependent and independent variables to factors
newdata$eduattain <- factor(newdata$eduattain, levels = 1:3, labels = c("L1","L2","L3"),
ordered = T)
newdata$dadedu <- factor(newdata$dadedu, levels = 1:3, labels = c("L1","L2","L3"))
and conduct a simple ordinal logistic regression with weights:
model1 <- lrm(eduattain ~ dadedu, data=newdata, weights = random, normwt = T)
Warning message:
In lrm(eduattain ~ dadedu, data = newdata, weights = random, normwt = T) :
currently weights are ignored in model validation and bootstrapping lrm fits
I have reasons to believe that if the weights were being used the results would be quite different.
How can I fix it? Most questions that tackle this warning don't give proper answers to this specific warning.(here, here, here)
Someone would need to modify the code for validate.lrm and predab.resample in the rms package. The code is on github at https://github.com/harrelfe/rms