Wrong prediction in linear SVM - r

I am writing a R script which when run gives the predicted value of dependent variable. All of my variables are categorically divided (as shown in picture) and assigned a number, total number of classes are 101. (each class is song name).
So I have a training dataset which contains pairs like {(2,5,6,1)82, (2,5,6,1)45, (2,5,3,1)34, ...}. I trained this dataset using linear svm in R studio and for some values of given (x,y,z,w) it gives correct answers. but even though records like (2,5,6,1)X existed in training dataset, why it doesn't predict values 82 or 45? I am pretty confused as it neglects this terms and shows whole new output 23.
training_set = dataset;
library(e1071)
classifier = svm(formula = Song ~ .,
data = training_set,
type = 'C-classification',
kernel = 'linear')
y_pred = predict(classifier, data.frame(Emotion = 2, Pact = 5, Mact = 6, Session = 1)).
What I want is my answer to come closest. What can I do for achieving these goals?
Get atleast 10 closest outcomes instead of 1 in R.
Is linear svm model doing good here?
How do I get value 82,45 like in training dataset, if no entry present then find the closest one. (Is there any model without going for simply euclidean distance)?

What makes you think that your classifier will predict the same outcome for a set of predictors as your original observation? I think there might be some fundamental misconceptions about how classification works.
Here is a simple counter-example using a linear regression model. The same principle applies to your SVM.
Simulate some data
set.seed(2017);
x <- seq(1:10);
y <- x + rnorm(10);
We now modify one value of y and show the data of (x,y) pairs.
y[3] = -10;
df <- cbind.data.frame(x = x, y = y);
df;
# x y
#1 1 2.434201
#2 2 1.922708
#3 3 -10.000000
#4 4 2.241395
#5 5 4.930175
#6 6 6.451906
#7 7 5.041634
#8 8 7.998476
#9 9 8.734664
#10 10 11.563223
Fit a model and get predictions.
fit <- lm(y ~ x, data = df);
pred <- predict(fit);
Let's take a look at predicted responses y.pred and compare them to the original data (x, y).
data.frame(df, y.pred = pred)
# x y y.pred
#1 1 2.434201 -2.1343357
#2 2 1.922708 -0.7418526
#3 3 -10.000000 0.6506304
#4 4 2.241395 2.0431135
#5 5 4.930175 3.4355966
#6 6 6.451906 4.8280796
#7 7 5.041634 6.2205627
#8 8 7.998476 7.6130458
#9 9 8.734664 9.0055288
#10 10 11.563223 10.3980119
Note how the predicted response for x=3 is y.pred=0.65 even though you observed y=-10.

Related

How can I incorporate the prior weight in to my GLM function?

I am trying to incorporate the prior settings of my dependent variable in my logistic-regression in r using the glm-function. The data-set I am using is created to predict churn.
So far I am using the function below:
V1_log <- glm(CH1 ~ RET + ORD + LVB + REV3, data = trainingset, family =
binomial(link='logit'))
What I am looking for is how the weights function works and how to include it in the function or if there is another way to incorporate this. The dependent variable is a nominal variables with the options 0 or 1. The data set is imbalanced in a way that only 10 % has a value of 1 on the dependent variable CH1 and the other 90% has a value of 0. Therefore the weights are (0.1, 0.9)
My dataset Is build-up in the following manner:
Where the independent variables vary in data type between continues and class variables and
Although the ratio of 0 to 1s is 1:9, it does not mean the weights are 0.1 and 0.9. The weights decides how much emphasis you want to give observation compared to the others.
And in your case, if you want to predict something, it is essential you split your data into train and test, and see what influence the weights have on prediction.
Below is using the pima indian diabetes example, I subsample the Yes type such that the training set has 1:9 ratio.
set.seed(111)
library(MASS)
# we sample 10 from Yes and 90 from No
idx = unlist(mapply(sample,split(1:nrow(Pima.tr),Pima.tr$type),c(90,10)))
Data = Pima.tr
trn = Data[idx,]
test = Data[-idx,]
table(trn$type)
No Yes
90 10
Lets try regressing it with weight 9 if positive, 1 if negative:
library(caret)
W = 9
lvl = levels(trn$type)
#if positive we give it the defined weight, otherwise set it to 1
fit_wts = ifelse(trn$type==lvl[2],W,1)
fit = glm(type ~ .,data=trn,weight=fit_wts,family=binomial)
# we test it on the test set
pred = ifelse(predict(fit,test,type="response")>0.5,lvl[2],lvl[1])
pred = factor(pred,levels=lvl)
confusionMatrix(pred,test$type,positive=lvl[2])
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 34 26
Yes 8 32
You can see from above, you can see it's doing ok, but you are missing out on 8 positives and also falsely labeling 26 false positives. Let's say we try W = 3
W = 3
lvl = levels(trn$type)
fit_wts = ifelse(trn$type==lvl[2],W,1)
fit = glm(type ~ .,data=trn,weight=fit_wts,family=binomial)
pred = ifelse(predict(fit,test,type="response")>0.5,lvl[2],lvl[1])
pred = factor(pred,levels=lvl)
confusionMatrix(pred,test$type,positive=lvl[2])
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 39 30
Yes 3 28
Now we manage to get almost all the positive calls correct.. But still miss out on a lot of potential "Yes". Bottom line is, code above might work, but you need to do some checks to figure out what is the weight for your data.
You can also look around the other stats provided by confusionMatrix in caret to guide your choice.
In your dataset trainingset create a column called weights_col that contains your weights (.1, .9) and then run
V1_log <- glm(CH1 ~ RET + ORD + LVB + REV3, data = trainingset, family = binomial(link='logit'), weights = weights_col)

Logistic regression with proportions in R (dependent variable is not binary). What is R doing?

So I stumbled across the following code:
#Importing the data:
seeds.df <-
read.table('http://www.uib.no/People/nzlkj/statkey/data/seeds.csv',header=T)
attach(seeds.df)
#Making a plot of seeds eaten depending on seed type:
plot(Seed.type, Eaten)
#Testing the hypothesis:
fit1.glm <- glm(cbind(Eaten,Not.eaten)~Seed.type, binomial)
summary(fit1.glm)
From https://folk.uib.no/nzlkj/statkey/logistic.html#proportions
which provides a method for doing logistic regression on proportion data.
My question is what is R actually doing mathematically? the response variable is two columns. As far as I knew logistic regression is supposed to be performed on binary dependent variable.
is R creating a new response variable of length Eaten + Not.eaten
which is populated by rep(1, Eaten) rep(0, Not.eaten) and performing log regression on that?
e.g. for row 1 in seeds.df Eaten = 2 Not.eaten = 48
row# eaten.or.not seed.type Hamster
1 1 B 1
2 1 B 1
3 0 B 1
0 B 1
...
50 0 B 1
then R would do glm(eaten.or.not ~ seed.type, family = 'binomial')
I tested the above and it didn't produce the same answer
or is R doing the following
ln(prob of being eaten/ (1-prob of being eaten)) = intercept + B1(seed.type)
I also tested this and I got something different, but I'm not sure I did it correctly.
Anyway, could someone shed light on what R is doing mathematically for the log regression with a proportion that would be great.
thank you for your time

Zero-inflated negative binomial model in R: Computationally singular

I have been comparing Poisson, negative binomial (NB), and zero-inflated Poisson and NB models in R. My dependent variable is a symptom count for generalized anxiety disorder (GAD), and my predictors are two personality traits (disinhibition [ZDis_winz] and meanness [ZMean_winz]), their interaction, and covariates of age and assessment site (dummy-coded; there are 8 sites so I have 7 of these dummy variables). I have a sample of 1206 with full data (and these are the only individuals included in the data frame).
I am using NB models for this disorder because the variance (~40) far exceeds the mean (~4). I wanted to consider the possibility of a ZINB model as well, given that ~30% of the sample has 0 symptoms.
For other symptom counts (e.g., conduct disorder), I have run ZINB models perfectly fine in R, but I am getting an error when I do the exact same thing with the GAD model. The standard NB model works fine for GAD; it is only the GAD ZINB model that's erroring out.
Here is the error I'm receiving:
Error in solve.default(as.matrix(fit$hessian)) :
system is computationally singular: reciprocal condition number = 4.80021e-36
Here is the code I'm using for the (working) NB model:
summary(
NB_GAD_uw_int <- glm.nb(
dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy,
data=eurodata))
Here is the code I'm using for the (not working) ZINB model (which is identical to other ZINB models I've run for other disorders):
summary(
ZINB_GAD_uw_int <- zeroinfl(
dawbac_bl_GAD_sxs_uw ~ ZMean_winz + ZDis_winz + ZMean_winz*ZDis_winz + age_years + Nottingham_dummy + Dublin_dummy + Berlin_dummy + Hamburg_dummy + Mannheim_dummy + Paris_dummy + Dresden_dummy,
data = eurodata,
dist = "negbin",
model = TRUE,
y = TRUE, x = TRUE))
I have seen a few other posts on StackOverflow and other forums about this type of issue. As far as I can tell, people generally say that this is an issue of either 1) collinear predictors or 2) too complex a model for too little data. (Please let me know if I am misinterpreting this! I'm fairly new to Poisson-based models.) However, I am still confused about these answers because: 1) In this case, none of my predictors are correlated more highly than .15, except for the main predictors of interest (ZMean_winz and ZDis_winz), which are correlated about .45. The same predictors are used in other ZINB models that have worked. 2) With 1206 participants, and having run the same ZINB model with similarly distributed count data for other disorders, I'm a little confused how this could be too complex a model for my data.
If anyone has any explanation for why this version of my model will not run and/or any suggestions for troubleshooting, I would really appreciate it! I am also happy to provide more info if needed.
Thank you so much!
The problem may be that zeroinfl is not converting categorical variables into dummy variables.
You can dummify your variables using model.matrix, which is what glm, glm.nb, etc. call internally to dummify categorical variables. This is usually preferred over manually dummifying categorical variables, and should be done to avoid mistakes and to ensure full rank of your model matrix (a full rank matrix is non-singular).
You can of course dummify categorical variables yourself; in that case I would use model.matrix to transform your input data involving categorical variables (and potentially interactions between categorical variables and other variables) into the correct model matrix.
Here is an example:
set.seed(2017)
df <- data.frame(
DV = rnorm(100),
IV1_num = rnorm(100),
IV2_cat = sample(c("catA", "catB", "catC"), 100, replace = T))
head(df)
# DV IV1_num IV2_cat
#1 1.43420148 0.01745491 catC
#2 -0.07729196 1.37688667 catC
#3 0.73913723 -0.06869535 catC
#4 -1.75860473 0.84190898 catC
#5 -0.06982523 -0.96624056 catB
#6 0.45190553 -1.96971566 catC
mat <- model.matrix(DV ~ IV1_num + IV2_cat, data = df)
head(mat)
# (Intercept) IV1_num IV2_catcatB IV2_catcatC
#1 1 0.01745491 0 1
#2 1 1.37688667 0 1
#3 1 -0.06869535 0 1
#4 1 0.84190898 0 1
#5 1 -0.96624056 1 0
#6 1 -1.96971566 0 1
The manually dummified input data would then be
df.dummified = cbind.data.frame(DV = df$DV, mat[, -1])
# DV IV1_num IV2_catB IV2_catC
#1 1.43420148 0.01745491 0 1
#2 -0.07729196 1.37688667 0 1
#3 0.73913723 -0.06869535 0 1
#4 -1.75860473 0.84190898 0 1
#5 -0.06982523 -0.96624056 1 0
#6 0.45190553 -1.96971566 0 1
which you'd use in e.g.
glm.nb(DV ~ ., data = df.dummified)

What is the explicit form of the formula in Predict function in R

I fit my data.frame, g
` Day V`
1 13 211.45
2 15 274.40
3 18 381.15
4 21 499.80
5 26 614.65
6 29 723.75
7 33 931.70
8 36 996.35
9 40 1037.40
10 43 1277.75
by using the following steps
fit <- glm(V ~ Day, data = g, family = gaussian(link = "log"))
pred <- predict(fit, type = "response")
# get Intercept from glm
intercept<-fit$coefficients[[1]]
# Get Slope from glm
slope<-fit$coefficients[[2]]
Now, I want to calculate the Day for a particular value of V, say V = 800.0 based on the fit. Although I know the fit coefficients, I cannot construct a formula to calculate this.
Let's assume this exponential data can be represented by a formula like
V=Voexp(mt)
where
Vo is the intercept
m is the slope the fit
then, the Day for a particular V = 800.0 can be calculated by
t(V) = log(V/intercept)/slope
I do not know what formula predict function is using to calculate predicted values based on the fit. I tried to do the following
new<-data.frame(V=c(800.00))
p<-predict(fit,newdata=new, type=response")
I get the following error
Error in eval(predvars, data, env) : object 'Day' not found
This is because, this function is intended to calculate, V from a new Day, not the other way around.
But how can I do this using R?

Having issues using the lme4 predict function on my mixed models

I’m having a bit of a struggle trying to use the lme4 predict function on my mixed models. When making predications I want to be able to set some of my explanatory variables to a specified level but average across others.
Here’s some made up data that is a simplified, nonsense version of my original dataset:
a <- data.frame(
TLR4=factor(rep(1:3, each=4, times=4)),
repro.state=factor(rep(c("a","j"),each=6,times=8)),
month=factor(rep(1:2,each=8,times=6)),
sex=factor(rep(1:2, each=4, times=12)),
year=factor(rep(1:3, each =32)),
mwalkeri=(sample(0:15, 96, replace=TRUE)),
AvM=(seq(1:96))
)
The AvM number is the water vole identification number. The response variable (mwalkeri) is a count of the number of fleas on each vole. The main explanatory variable I am interested in is Tlr4 which is a gene with 3 different genotypes (coded 1, 2 and 3). The other explanatory variables included are reproductive state (adult or juvenile), month (1 or 2), sex (1 or 2) and year (1, 2 or 3). My model looks like this (of course this model is now inappropriate for the made up data but that shouldn't matter):
install.packages("lme4")
library(lme4)
mm <- glmer(mwalkeri~TLR4+repro.state+month+sex+year+(1|AvM), data=a,
family=poisson,control=glmerControl(optimizer="bobyqa"))`
summary(mm)
I want to make predictions about parasite burden for each different Tlr4 genotype while accounting for all the other covariates. To do this I created a new dataset to specify the level I wanted to set each of the explanatory variables to and used the predict function:
b <- data.frame(
TLR4=factor(1:3),
repro.state=factor(c("a","a","a")),
month=factor(rep(1, times=3)),
sex=factor(rep(1, times=3)),
year=factor(rep(1, times=3))
)
predict(mm, newdata=b, re.form=NA, type="response")
This did work but I would really prefer to average across years instead of setting year to one particular level. However, whenever I attempt to average year I get this error message:
Error in model.frame.default(delete.response(Terms), newdata, na.action = na.action, : factor year has new level
Is it possible for me to average across years instead of selecting a specified level? Also, I've not worked out how to get the standard error associated with these predictions. The only way I've been able to get standard error for predictions was using the lsmeans() function (from the lsmeans package):
c <- lsmeans(mm, "TLR4", type="response")
summary(c, type="response")
Which automatically generates the standard error. However, this is generated by averaging across all the other explanatory variables. I'm sure it’s probably possible to change that but I would rather use the predict() function if I can. My goal is to create a graph with Tlr4 genotype on the x-axis and predicted parasite burden on the y-axis to demonstrate the predicted differences in parasite burden for each genotype while all other significant covariants are accounted for.
You might be interested in the merTools package which includes a couple of functions for creating datasets of counterfactuals and then making predictions on that new data to explore the substantive impact of variables on the outcome. A good example of this comes from the README and the package vignette:
Let's take the case where we want to explore the impact of a model with an interaction term between a category and a continuous predictor. First, we fit a model with interactions:
data(VerbAgg)
fmVA <- glmer(r2 ~ (Anger + Gender + btype + situ)^2 +
(1|id) + (1|item), family = binomial,
data = VerbAgg)
Now we prep the data using the draw function in merTools. Here we draw the average observation from the model frame. We then wiggle the data by expanding the dataframe to include the same observation repeated but with different values of the variable specified by the var parameter. Here, we expand the dataset to all values of btype, situ, and Anger.
# Select the average case
newData <- draw(fmVA, type = "average")
newData <- wiggle(newData, var = "btype", values = unique(VerbAgg$btype))
newData <- wiggle(newData, var = "situ", values = unique(VerbAgg$situ))
newData <- wiggle(newData, var = "Anger", values = unique(VerbAgg$Anger))
head(newData, 10)
#> r2 Anger Gender btype situ id item
#> 1 N 20 F curse other 5 S3WantCurse
#> 2 N 20 F scold other 5 S3WantCurse
#> 3 N 20 F shout other 5 S3WantCurse
#> 4 N 20 F curse self 5 S3WantCurse
#> 5 N 20 F scold self 5 S3WantCurse
#> 6 N 20 F shout self 5 S3WantCurse
#> 7 N 11 F curse other 5 S3WantCurse
#> 8 N 11 F scold other 5 S3WantCurse
#> 9 N 11 F shout other 5 S3WantCurse
#> 10 N 11 F curse self 5 S3WantCurse
Now we simply pass this new dataset to predictInterval in order to generate predictions for these counterfactuals. Then we plot the predicted values against the continuous variable, Anger, and facet and group on the two categorical variables situ and btype respectively.
plotdf <- predictInterval(fmVA, newdata = newData, type = "probability",
stat = "median", n.sims = 1000)
plotdf <- cbind(plotdf, newData)
ggplot(plotdf, aes(y = fit, x = Anger, color = btype, group = btype)) +
geom_point() + geom_smooth(aes(color = btype), method = "lm") +
facet_wrap(~situ) + theme_bw() +
labs(y = "Predicted Probability")

Resources