logistic regression alternative interpretation - r

I am trying to analyze the data that shows people catch the disease or not. That is, response is binary. I applied logistic regression. Assume the result of the log.reg (logistic regression) is like;
ID = c(1,2,3,4)
Test_Data = c(0,1,1,0)
Log.Reg_Output = c(0.01,0.4,0.8,0.49)
result = data.frame(ID,Test_Data,Reg_Output)
result
# 1 | 0 | 0.01
# 2 | 1 | 0.4
# 3 | 1 | 0.8
# 4 | 0 | 0.49
Can I say that person who has ID=3 will catch the disease at 80 percent? Is it right approach? If not, why? I am so confused, any help will be great!
Second question is how can I calculate accuracy rate except rounding the model result 0 or 1. Because rounding 0.49 to 0 is not so meaningful I think.
For my example, model output will turn 0,0,1,0 instead 0.01, 0.4, 0.8, 0.49 based on greater or less than 0.5. And accuracy rate will be 75%. Is there any other calculation method?
Thanks!

Can I say that person who has ID=3 will catch the disease at 80 percent?
It is unclear what you mean by "at"; the traditional/conventional interpretation of the logistic regression output here is the model estimates that person #3 will catch the disease, with 80% confidence. It is also unclear what you mean by "alternative" in your title (you don't elaborate in the question body).
how can I calculate accuracy rate except rounding the model result 0 or 1.
Accuracy by definition requires rounding the model results to 0/1. But, at least in principle, the decision threshold need not necessarily be 0.5...
Because rounding 0.49 to 0 is not so meaningful I think.
Do you think rounding 0.49 to 1 is more meaningful? Because this is the only alternative choice in a binary classification setting (a person either will catch the disease, or not).
Regarding the log loss metric, mentioned in the comments: its role is completely different than that of the accuracy. You may find these relevant answers of mine helpful:
Loss & accuracy - Are these reasonable learning curves?
How does Keras evaluate the accuracy? (despite the mislading title, it has nothing particular to do with Keras).
I seriously suggest you have a look at some logistic regression tutorials (there are literally hundreds out there); a highly recommended source is the textbook An Introduction to Statistical Learning (with Applications in R), made freely available by the authors...

Related

Extremely wide confidence interval for a significant coefficient in a GLMM logistic regression. Due to my approach? Or somethimng else?

I have a concern with a GLMM I am running and I would be very grateful if you could help me out.
I am modelling the factors that cause a frog species to make either a type 1 or type 2 calls. I am using a GLMM logistic regression. The data from this were generated from recordings of individuals in frog choruses of various sizes. For each male in the dataset, I randomly chose 100 of his calls, and then determined if they were type 1 or 2 (type 1 call =0, type 2 call =1). So each frog is represented by the same number of calls (100 calls), and some frogs are represented in several choruses of different sizes (total n= 12400). The response variable is whether each call in the dataset is type 1 or 2, and my fixed effects are: the size of the chorus a frog is calling in (2,3,4,5,6), the body condition of the frog (residuals from an LM of mass on body length), and standardized body length (SVL) (body length and body condition score are not correlated so no VIF issues). I included frog ID and the chorus ID as random intercepts.
Model results
The model looks fine, and the coefficients seem sensible; they are about what I expect. The only thing that worries me is that, when I calculate the 95%CI for the coefficients, the coefficient for body condition has a huge range (-9.7 to 6.3) (see screenshot). This seems crazy. Even when exponentiated, it seems quite crazy (0 to 492). Is this reasonable?
This variable was involved in a significant interaction with chorus size; does this explain a wide CI? Or does this suggest my approach is flawed? Instead of having each male equally represented in the dataset by 100 calls in each chorus he is in, should I instead collapse that down to a proportion (e.g. proportion of type 2 calls out of the 100 randomly selected calls for each male) and model this as a poisson regression or something? Is the way I’m doing my logistic regression a reasonable approach? I have run model checks and everything and they all seem to point to logistic regression being suitable for my data, at least as I have set it up currently.
Thanks for any help you can provide!
Values I get after standardizing condition:
2.5 % 97.5 %
.sig01 2.0948132676 3.1943483
.sig02 0.0000000000 2.0980214
(Intercept) -3.1595281536 -1.2902779
chorus_size 0.8936643930 1.0418465
cond_resid -0.8872467384 0.5746653
svl -0.0865697646 1.2413117
chorus_size:cond_resid -0.0005998784 0.1383067

How to determine the correct mixed effects structure in a binomial GLMM (lme4)?

Could someone help me to determine the correct random variable structure in my binomial GLMM in lme4?
I will first try to explain my data as best as I can. I have binomial data of seedlings that were eaten (1) or not eaten (0), together with data of vegetation cover. I try to figure out if there is a relationship between vegetation cover and the probability of a tree being eaten, as the other vegetation is a food source that could attract herbivores to a certain forest patch.
The data is collected in ~90 plots scattered over a National Park for 9 years now. Some were measured all years, some were measured only a few years (destroyed/newly added plots). The original datasets is split in 2 (deciduous vs coniferous), both containing ~55.000 entries. Per plot about 100 saplings were measured every time, so the two separate datasets probably contain about 50 trees per plot (though this will not always be the case, since the decid:conif ratio is not always equal). Each plot consists of 4 subplots.
I am aware that there might be spatial autocorrelation due to plot placement, but we will not correct for this, yet.
Every year the vegetation is surveyed in the same period. Vegetation cover is estimated at plot-level, individual trees (binary) are measured at a subplot-level.
All trees are measured, so the amount of responses per subplot will differ between subplots and years, as the forest naturally regenerates.
Unfortunately, I cannot share my original data, but I tried to create an example that captures the essentials:
#set seed for whole procedure
addTaskCallback(function(...) {set.seed(453);TRUE})
# Generate vector containing individual vegetation covers (in %)
cover1vec <- c(sample(0:100,10, replace = TRUE)) #the ',number' is amount of covers generated
# Create dataset
DT <- data.frame(
eaten = sample(c(0,1), 80, replace = TRUE),
plot = as.factor(rep(c(1:5), each = 16)),
subplot = as.factor(rep(c(1:4), each = 2)),
year = as.factor(rep(c(2012,2013), each = 8)),
cover1 = rep(cover1vec, each = 8)
)
Which will generate this dataset:
>DT
eaten plot subplot year cover1
1 0 1 1 2012 4
2 0 1 1 2012 4
3 1 1 2 2012 4
4 1 1 2 2012 4
5 0 1 3 2012 4
6 1 1 3 2012 4
7 0 1 4 2012 4
8 1 1 4 2012 4
9 1 1 1 2013 77
10 0 1 1 2013 77
11 0 1 2 2013 77
12 1 1 2 2013 77
13 1 1 3 2013 77
14 0 1 3 2013 77
15 1 1 4 2013 77
16 0 1 4 2013 77
17 0 2 1 2012 46
18 0 2 1 2012 46
19 0 2 2 2012 46
20 1 2 2 2012 46
....etc....
80 0 5 4 2013 82
Note1: to clarify again, in this example the number of responses is the same for every subplot:year combination, making the data balanced, which is not the case in the original dataset.
Note2: this example can not be run in a GLMM, as I get a singularity warning and all my random effect measurements are zero. Apparently my example is not appropriate to actually use (because using sample() caused the 0 and 1 to be in too even amounts to have large enough effects?).
As you can see from the example, cover data is the same for every plot:year combination.
Plots are measured multiple years (only 2012 and 2013 in the example), so there are repeated measures.
Additionally, a year effect is likely, given the fact that we have e.g. drier/wetter years.
First I thought about the following model structure:
library(lme4)
mod1 <- glmer(eaten ~ cover1 + (1 | year) + (1 | plot), data = DT, family = binomial)
summary(mod1)
Where (1 | year) should correct for differences between years and (1 | plot) should correct for the repeated measures.
But then I started thinking: all trees measured in plot 1, during year 2012 will be more similar to each other than when they are compared with (partially the same) trees from plot 1, during year 2013.
So, I doubt that this random model structure will correct for this within plot temporal effect.
So my best guess is to add another random variable, where this "interaction" is accounted for.
I know of two ways to possibly achieve this:
Method 1.
Adding the random variable " + (1 | year:plot)"
Method 2.
Adding the random variable " + (1 | year/plot)"
From what other people told me, I still do not know the difference between the two.
I saw that Method 2 added an extra random variable (year.1) compared to Method 1, but I do not know how to interpret that extra random variable.
As an example, I added the Random effects summary using Method 2 (zeros due to singularity issues with my example data):
Random effects:
Groups Name Variance Std.Dev.
plot.year (Intercept) 0 0
plot (Intercept) 0 0
year (Intercept) 0 0
year.1 (Intercept) 0 0
Number of obs: 80, groups: plot:year, 10; plot, 5; year, 2
Can someone explain me the actual difference between Method 1 and Method 2?
I am trying to understand what is happening, but cannot grasp it.
I already tried to get advice from a colleague and he mentioned that it is likely more appropriate to use cbind(success, failure) per plot:year combination.
Via this site I found that cbind is used in binomial models when Ntrails > 1, which I think is indeed the case given our sampling procedure.
I wonder, if cbind is already used on a plot:year combination, whether I need to add a plot:year random variable?
When using cbind, the example data would look like this:
>DT3
plot year cover1 Eaten_suc Eaten_fail
8 1 2012 4 4 4
16 1 2013 77 4 4
24 2 2012 46 2 6
32 2 2013 26 6 2
40 3 2012 91 2 6
48 3 2013 40 3 5
56 4 2012 61 5 3
64 4 2013 19 2 6
72 5 2012 19 5 3
80 5 2013 82 2 6
What would be the correct random model structure and why?
I was thinking about:
Possibility A
mod4 <- glmer(cbind(Eaten_suc, Eaten_fail) ~ cover1 + (1 | year) + (1 | plot),
data = DT3, family = binomial)
Possibility B
mod5 <- glmer(cbind(Eaten_suc, Eaten_fail) ~ cover1 + (1 | year) + (1 | plot) + (1 | year:plot),
data = DT3, family = binomial)
But doesn't cbind(success, failure) already correct for the year:plot dependence?
Possibility C
mod6 <- glmer(cbind(Eaten_suc, Eaten_fail) ~ cover1 + (1 | year) + (1 | plot) + (1 | year/plot),
data = DT3, family = binomial)
As I do not yet understand the difference between year:plot and year/plot
Thus: Is it indeed more appropriate to use the cbind-method than the raw binary data? And what random model structure would be necessary to prevent pseudoreplication and other dependencies?
Thank you in advance for your time and input!
EDIT 7/12/20: I added some extra information about the original data
You are asking quite a few questions in your question. I'll try to cover them all, but I do suggest reading the documentation and vignette from lme4 and the glmmFAQ page for more information. Also I'd highly recommend searching for these topics on google scholar, as they are fairly well covered.
I'll start somewhere simple
Note 2 (why is my model singular?)
Your model is highly singular, because the way you are simulating your data does not indicate any dependency between the data itself. If you wanted to simulate a binomial model you would use g(eta) = X %*% beta to simulate your linear predictor and thus the probability for success. One can then use this probability for simulating the your binary outcome. This would thus be a 2 step process, first using some known X or randomly simulated X given some prior distribution of our choosing. In the second step we would then use rbinom to simulate binary outcome while keeping it dependent on our predictor X.
In your example you are simulating independent X and a y where the probability is independent of X as well. Thus, when we look at the outcome y the probability of success is equal to p=c for all subgroup for some constant c.
Can someone explain me the actual difference between Method 1 and Method 2? ((1| year:plot) vs (1|year/plot))
This is explained in the package vignette fitting linear mixed effects models with lme4 in the table on page 7.
(1|year/plot) indicates that we have 2 mixed intercept effects, year and plot and plot is nested within year.
(1|year:plot) indicates a single mixed intercept effect, plot nested within year. Eg. we do not include the main effect of year. It would be somewhat similar to having a model without intercept (although less drastic, and interpretation is not destroyed).
It is more common to see the first rather than the second, but we could write the first as a function of the second (1|year) + (1|year:plot).
Thus: Is it indeed more appropriate to use the cbind-method than the raw binary data?
cbind in a formula is used for binomial data (or multivariate analysis), while for binary data we use the raw vector or 0/1 indicating success/failure, eg. aggregate binary data (similar to how we'd use glm). If you are uninterested in the random/fixed effect of subplot, you might be able to aggregate your data across plots, and then it would likely make sense. Otherwise stay with you 0/1 outcome vector indicating either success or failures.
What would be the correct random model structure and why?
This is a topic that is extremely hard to give a definitive answer to, and one that is still actively researched. Depending on your statistical paradigm opinions differ greatly.
Method 1: The classic approach
Classic mixed modelling is based upon knowledge of the data you are working with. In general there are several "rules of thumb" for choosing these parameters. I've gone through a few in my answer here. In general if you are "not interested" in the systematic effect and it can be thought of as a random sample of some population, then it could be a random effect. If it is the population, eg. samples do not change if the process is repeated, then it likely shouldn't.
This approach often yields "decent" choices for those who are new to mixed effect models, but is highly criticized by authors who tend towards methods similar to those we'd use in non-mixed models (eg. visualizing to base our choice and testing for significance).
Method 2: Using visualization
If you are able to split your data into independent subgroups and keeping the fixed effect structure a reasonable approach for checking potential random effects is the estimate marginal models (eg. using glm) across these subgroups and seeing if the fixed effects are "normally distributed" between these observations. The function lmList (in lme4) is designed for this specific approach. In linear models we would indeed expect these to be normally distributed, and thus we can get an indication whether a specific grouping "might" be a valid random effect structure. I believe the same is approximately true in the case of generalized linear models, but I lack references. I know that Ben Bolker have advocated for this approach in a prior article of his (the first reference below) that I used during my thesis. However this is only a valid approach for strictly separable data, and the implementation is not robust in the case where factor levels are not shared across all groups.
So in short: If you have the right data, this approach is simple, fast and seemingly highly reliable.
Method 3: Fitting maximal/minimal models and decreasing/expanding model based on AIC or AICc (or p-value tests or alternative metrics)
Finally an alternative to use a "step-wise"-like procedure. There are advocates of both starting with maximal and minimal models (I'm certain at least one of my references below talk about problems with both, otherwise check glmmFAQ) and then testing your random effects for their validity. Just like classic regression this is somewhat of a double-edged sword. The reason is both extremely simple to understand and amazingly complex to comprehend.
For this method to be successful you'd have to perform cross-validation or out-of-sample validation to avoid selection bias just like standard models, but unlike standard models sampling becomes complicated because:
The fixed effects are conditional on the random structure.
You will need your training and testing samples to be independent
As this is dependent on your random structure, and this is chosen in a step-wise approach it is hard to avoid information leakage in some of your models.
The only certain way to avoid problems here is to define the space
that you will be testing and selecting samples based on the most
restrictive model definition.
Next we also have problems with choice of metrics for evaluation. If one is interested in the random effects it makes sense to use AICc (AIC estimate of the conditional model) while for fixed effects it might make more sense to optimize AIC (AIC estimate of the marginal model). I'd suggest checking references to AIC and AICc on glmmFAQ, and be wary since the large-sample results for these may be uncertain outside a very reestrictive set of mixed models (namely "enough independent samples over random effects").
Another approach here is to use p-values instead of some metric for the procedure. But one should likely be even more wary of test on random effects. Even using a Bayesian approach or bootstrapping with incredibly high number of resamples sometimes these are just not very good. Again we need "enough independent samples over random effects" to ensure the accuracy.
The DHARMA provides some very interesting testing methods for mixed effects that might be better suited. While I was working in the area the author was still (seemingly) developing an article documenting the validity of their chosen method. Even if one does not use it for initial selection I can only recommend checking it out and deciding upon whether one believes in their methods. It is by far the most simple approach for a visual test with simple interpretation (eg. almost no prior knowledge is needed to interpret the plots).
A final note on this method would thus be: It is indeed an approach, but one I would personally not recommend. It requires either extreme care or the author accepting ignorance of model assumptions.
Conclusion
Mixed effect parameter selection is something that is difficult. My experience tells me that mostly a combination of method 1 and 2 are used, while method 3 seems to be used mostly by newer authors and these tend to ignore either out-of-sample error (measure model metrics based on the data used for training), ignore independence of samples problems when fitting random effects or restrict themselves to only using this method for testing fixed effect parameters. All 3 do however have some validity. I myself tend to be in the first group, and base my decision upon my "experience" within the field, rule-of-thumbs and the restrictions of my data.
Your specific problem.
Given your specific problem I would assume a mixed effect structure of (1|year/plot/subplot) would be the correct structure. If you add autoregressive (time-spatial) effects likely year disappears. The reason for this structure is that in geo-analysis and analysis of land plots the classic approach is to include an effect for each plot. If each plot can then further be indexed into subplot it is natural to think of "subplot" to be nested in "plot". Assuming you do not model autoregressive effects I would think of time as random for reasons that you already stated. Some years we'll have more dry and hotter weather than others. As the plots measured will have to be present in a given year, these would be nested in year.
This is what I'd call the maximal model and it might not be feasible depending on your amount of data. In this case I would try using (1|time) + (1|plot/subplot). If both are feasible I would compare these models, either using bootstrapping methods or approximate LRT tests.
Note: It seems not unlikely that (1|time/plot/subplot) would result in "individual level effects". Eg 1 random effect per row in your data. For reasons that I have long since forgotten (but once read) it is not plausible to have individual (also called subject-level) effects in binary mixed models. In this case It might also make sense to use the alternative approach or test whether your model assumptions are kept when withholding subplot from your random effects.
Below I've added some useful references, some of which are directly relevant to the question. In addition check out the glmmFAQ site by Ben Bolker and more.
References
Bolker, B. et al. (2009). „Generalized linear mixed models: a practical guide for ecology and evolution“. In: Trends in ecology & evolution 24.3, p. 127–135.
Bolker, B. et al. (2011). „GLMMs in action: gene-by-environment interaction in total fruit production of wild populations of Arabidopsis thaliana“. In: Revised version, part 1 1, p. 127–135.
Eager, C. og J. Roy (2017). „Mixed effects models are sometimes terrible“. In: arXiv preprint arXiv:1701.04858. url: https://arxiv.org/abs/1701.04858 (last seen 19.09.2019).
Feng, Cindy et al. (2017). „Randomized quantile residuals: an omnibus model diagnostic tool with unified reference distribution“. In: arXiv preprint arXiv:1708.08527. (last seen 19.09.2019).
Gelman, A. og Jennifer Hill (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Hartig, F. (2019). DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. R package version 0.2.4. url: http://florianhartig.github.io/DHARMa/ (last seen 19.09.2019).
Lee, Y. og J. A. Nelder (2004). „Conditional and Marginal Models: Another View“. In: Statistical Science 19.2, p. 219–238.
doi: 10.1214/088342304000000305. url: https://doi.org/10.1214/088342304000000305
Lin, D. Y. et al. (2002). „Model-checking techniques based on cumulative residuals“. In: Biometrics 58.1, p. 1–12. (last seen 19.09.2019).
Lin, X. (1997). „Variance Component Testing in Generalised Linear Models with Random Effects“. In: Biometrika 84.2, p. 309–326. issn: 00063444. url: http://www.jstor.org/stable/2337459
(last seen 19.09.2019).
Stiratelli, R. et al. (1984). „Random-effects models for serial observations with binary response“. In:
Biometrics, p. 961–971.

In R, how do you impute left-censored data that is below a limit of detection?

This is probably a simple problem but I just can't work it out. I have a dataframe of biochemistry test results. Some of these tests like base_crp are returning values like <3 because of limits of detection. I need to impute this data before moving forward. I'd like to do this properly, so not just substituting.
I tried multLN from the zCompositions package but it seems to think that all the <3 values are negative (error says X contains negative values). There also doesn't seem to be much documentation out there- is this an obscure package?
I also looked at LODI but it wants me to specify covariates for the imputation model- is there a proper way to select these? Anyway, I picked 3 that would theoretically correlate well and used this code:
clmi.out <- clmi(formula = log(base_crp) ~ base_wcc + base_neut + base_lymph, df = all, lod = crplim, seed = 12345, n.imps = 5)
where base_crp is the variable I'm trying to fix. I replaced all the <3 with NA and inserted a new column all$crplim <- "3". However, this is just returning
Error in sprintf("%s must be numeric.") : too few arguments.
Even if I can get LODI working, I'm not sure if it's the right tool. I'm only an undergraduate university student with little statistical background so I don't really understand what I'm doing- I just want something that will populate the column with numbers so I can move forward with Pearson correlations and linear regressions, etc. I would really appreciate some help with this. Thanks in advance.
I've done a bit of statistical modelling of CRP (C reactive protein) levels before - see this peer-reviewed paper as an example. CRP has an approximately log-normal distribution, and the median value in an unselected population across all testing indications is usually around 3.5 mg/l (most healthy people will be in that "<3mg/l" category). You probably don't want to be using an imputation model, because these are for missing data. The low CRP data is not missing. You already know it lies within a certain range, so you are losing information if you do the imputation this way.
It is reasonable to want to replace "<3" with a numeric value for regressions etc, as long as you are using this to correlate CRP with clinical findings etc and not (as Ben Norris points out) for CRP machine calibration.
I can tell you from data of over 10,000 samples of high-sensitvity CRP measurements in the study I linked above that the mean CRP in people with CRP < 3 is about 1.3, and it would be reasonable to substitute all of your "CRP < 3" measurements with 1.3 for most real-world clinical observational studies.
If you really need to have plausible numerical values on the missing CRP, you could impute the bottom half of a lognormal distribution. The following function would give you numbers that would likely be indistinguishable from real-life CRP measurements:
impute_crp <- function(n)
{
x <- exp(rnorm(10 * n, 1.355, 1.45))
round(x[x < 3][seq(n)], 1)
}
So you could do
impute_crp(10)
#> [1] 1.5 2.0 1.1 0.4 2.5 0.1 0.7 1.5 1.4 0.4
And
base_crp[base_crp == "<3"] <- impute_crp(length(which(base_crp == "<3"))
However, you will notice that I didn't use imputation at all in my own CRP models. Replacing the lower value with the threshold of detection was good enough for the purposes of modelling - and I'm fairly sure whether you replace the "< 3" with a lognormal tail, or all 1.3, or all 2, it will make no difference to the conclusions you are trying to draw.

Performe qnorm() for p-values of 0 and 1

I'm doing a meta-regression-analysis for Granger non-causality tests in my Master thesis. The effects of interest are F- and chi-square distributed, so to use theme in a meta-regression they must be converted to normal variates. Right now, I'm using probit-function (inverse of the standard normal cumulative distribution) for this. And this is basically its the qnorm() of the p-values (as far as I know).
My problem is now, the underlying studies sometimes report p-values of 0 or 1. Transforming them with qnorm() gives me Inf and -Inf values.
My solution approach is to exchange 0 p-values with values near 0, for example 1e-180
and 1 p-values with values near 1, for example 0.9999999999999999 (only 16 9 are possible because R is changing the results for more "9"s to 1).
Does anybody know a better solution for this problem? Is this mathematically reasonable? Excluding the 0 and 1 p-values would change the results completely and therefor is, in my honest opinion, wrong.
My code sample right now:
df$p_val[df$p_val == 0] <- 1e-180
df$p_val[df$p_val == 1] <- 0.9999999999999999
df$probit <- -qnorm(df$p_val)
The minus in front of the qnorm helps intuition, so that positive values are associated wth rejecting the null hypothesis of non-causality at higher levels of significance.
I would be really glad for support / hints / etc.!

Support Vector Machine with 3 outcomes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Objective:
I would like to utilize a classificaiton Support Vector Machine to model three outcomes: Win=1, Loss=0, or Draw=2. The inputs are a total of 50 interval variables and 2 categorical variables: isHome or isAway. The dataset is comprised of 23,324 instances or rows.
What the data looks like:
Outcome isHome isAway Var1 Var2 Var3 ... Var50
1 1 0 0.23 0.75 0.5 ... 0.34
0 0 1 0.66 0.51 0.23 ... 0.89
2 1 0 0.39 0.67 0.15 ... 0.45
2 0 1 0.55 0.76 0.17 ... 0.91
0 1 0 0.35 0.81 0.27 ... 0.34
The interval variables are within the range 0 to 1, hence I believe they do not require scaling given they are percentages. The categorical variable inputs are 0 for not Home and 1 for Home in isHome and 1 for Away and 0 for not Away.
Summary
Create Support Vector Machine Model
Correct for gamma and cost
Questions
I will be honest, this is my first time using SVM and I have practiced using the Titanic dataset from Kaggle, but I am trying to exapnd off of that and try new things.
Does the data have to be transformed into a scale of [0,1]? I do not believe it does
I have found some literature stating it is possiable to predict with 3 categories, but this is outside of my scope of knowledge. How would I implement this in R?
Are there too many features that I am looking at in order for this to work, or could there be a problem with noise? I know this is not a yes or no question, but curous to hear people's thoughts.
I understand SVM can split data either linearly, radially, or in a polygon. How does one make the best choice for their data?
Reproducable Code
library(e1071)
library(caret)
# set up data
set.seed(500)
isHome<-c(1,0,1,0,1)
isAway<-c(0,1,0,1,0)
Outcome<-c(1,0,2,2,0)
Var1<-abs(rnorm(5,0,1))
Var2<-abs(rnorm(5,0,1))
Var3<-abs(rnorm(5,0,1))
Var4<-abs(rnorm(5,0,1))
Var5<-abs(rnorm(5,0,1))
df<-data.frame(Outcome,isHome,isAway,Var1,Var2,Var3,Var4,Var5)
# split data into train and test
inTrain<-createDataPartition(y=df$Outcome,p=0.50,list=FALSE)
traindata<-df[inTrain,]
testdata<-df[-inTrain,]
# Train the model
svm_model<-svm(Outcome ~.,data=traindata,type='C',kernel="radial")
summary(svm_model)
# predict
pred <- predict(svm_model,testdata[-1])
# Confusion Matrix
table(pred,testdata$Outcome)
# Tune the model to find the best cost and gamma
svm_tune <- tune(svm, train.x=x, train.y=y,
kernel="radial", ranges=list(cost=10^(-1:2),
gamma=c(.5,1,2)))
print(svm_tune)
I'll try to answer each point, to my opinion you could get different solutions for your problem(s), as it is now it's a bit "broad". You could get answers also from searching for similar topics on CrossValidated.
Does the data have to be transformed into a scale of [0,1]?
It depends, usually yes it would be better to scale var1,var2,... One good approach would be to build to pipelines. One where you scale each var, one were you leave them be, the best model on the validation set will win.
Note, you'll find frequently this kind of approach in order to decide "the best way".
Often what you're really interested in is the performance, so checking via cross-validation is a good way of evaluating your hypothesis.
I have found some literature stating it is possible to predict with 3
categories, but this is outside of my scope of knowledge. How would I
implement this in R?
Yes it is, some functions implement this right away in fact. See the example linked down below.
Note, you could always do a multi-label classification by building more models . This is called usually one-vs-all approach (more here).
In general you could:
First train a model to detect Wins, your labels will be [0,1], so Draws and Losses will both be counted as "zero" class, and Wins will be labeled as "one"
Repeat the same principle for the other two classes
Of course, now you'll have three models, and for each observation, you'll have at least two predictions to make.
You'll assign each obs to the class with the most probability or by majority vote.
Note, there are other ways, it really depends on the model you chose.
Good news is you can avoid this. You can look here to start.
e1071::svm() can be easily generalized for your problem, it does this automatically, so no need to fit multiple models.
Are there too many features that I am looking at in order for this to work, or could there be a problem with noise? I know this is not a yes or no question, but curious to hear people's thoughts.
Could be or could not be the case, again, look at the performance you have via CV. You have reason to suspect that var1,..,var50 are too many variables? Then you could build a pipeline where before fit you use PCA to reduce those dimension, say to 95% of the variance.
How do you know this works? You guessed it, by looking at the performance once again, one the validation set you get via CV.
My suggestion is to follow both solutions and keep the best performing one.
I understand SVM can split data either linearly, radially, or in a
polygon. How does one make the best choice for their data?
You can treat the kernel choice again as a hyperparameter to be tuned. Here you need to look at performances, once again.
This is what I'd follow, based on the fact that you seem to have already selected svm as the model of choice. I suggest to look at the package caret, it should simplify all the evaluations you need to do (Example of CV with caret).
Scale Data vs not Scale Data
Perform PCA vs keep all variables
Fit all the models on the training set and evaluate via CV
Take your time to test all this pipelines (4 so far)
Evaluate again using CV which kernel is best, along with other hyperparameters (C, gamma,..)
You should find which path led you to the best result.
If you're familiar with the classic Confusion Matrix, you can use accuracy even for a multi-class classification problem as a performance metric.

Resources