Regarding usage of prediction in RandomForest implementation using Ranger - r

Overview
I am classifying documents using random forest implementation in ranger R.
Now I am facing an issue,
System expecting all the feature that are in Train set to be present in real time data set which is not possible to achieve,
hence I am not able to predict for real time data text.
Procedure following
Aim : To predict description belongs to which type of class (i.e, OutputClass)
Each of the information like description, features are converted into Document term matrix
Document term matrix of Train Set
rpm Velocity Speed OutputClass
doc1 1 0 1 fan
doc2 1 1 1 fan
doc3 1 0 1 referigirator
doc4 1 1 1 washing machine
doc5 1 1 1 washing machine
Now train the model using the above matrix
fit <- ranger(trainingColumnNames,data=trainset)
save(fit,file="C:/TrainedObject.rda”)
Now I am using the above stored object to predict the real time description for their class type.
Load("C:/TrainedObject.rda”)
Again construct the Document matrix for the RealTimeData.
Velocity Speed OutputClass
doc5 0 1 fan
doc6 1 1 fan
doc7 0 1 referigirator
doc8 1 1 washing machine
doc9 1 1 washing machine
In real time data there is no term or feature by name “RPM”.
So moment I call predict function
Predict(fit, RealTimeData)
it is showing an error saying RPM is missing,
which practically impossible to get all the term or feature of the train set in the real time data every time.
I tried in both the implementation of random forest in R (Ranger, RandomForest) with parameter in predict function like
newdata
Predict.all
treetype.
None of the parameter helped to predict for the missing features in real time data.
someone please help me out how to solve the above issue
Thanks in advance

predict expects all the features you provided to Ranger. Hence if you have missing data on the test set you either remove the problematic feature from the train set and run ranger again or fill the missing values. For the latter solution you may want to have a look at the mice package.

Related

How do you organize data for and run multinomial probit in R?

I apologize for the "how do I run this model in R" question. I will be the first to admit that i am a newbie when it comes to statistical models. Hopefully I have enough substantive questions surrounding it to be interesting, and the question will come out more like, "Does this command in R correspond to this statistical model?"
I am trying to estimate a model that can estimate the probability of a given Twitter user "following" a political user from a given political party. My dataframe is at the level of individual users, where each user can choose to follow or not follow a party on Twitter. As alternative-specific variables i have measures of ideological distance from the Twitter user and the political party and an interaction term that specifies whether the distance is positive or negative. Thus, the decision to follow a politician on twitter is a function of your ideological distance.
Initially i tried to estimate a conditional logit model, but i quickly got away from that idea since the choices are not mutually exclusive i.e. they can choose to follow more than one party. Now i am in doubt whether i should employ a multinomial probit or a multivariate probit, since i want my model to allow indviduals to choose more than one alternative. However, when i try to estimate a multinomial probit, my code doesn't work. My code is:
mprobit <- mlogit(Follow ~ F1_Distance+F2_Distance+F1_Distance*F1_interaction+F2_Distance*F2_interaction+strata(id),
long, probit = T, seed = 123)
And i get the following error message:
Error in dfidx::dfidx(data = data, dfa$idx, drop.index = dfa$drop.index, :
the two indexes don't define unique observations
I've tried looking the error up, but i can't seem to find anything that relates to probit models. Can you tell me what i'm doing wrong? Once again, sorry for my ignorance. Thank you for your help.
Also, i've tried copying my dataframe in the code below. The data is for the first 6 observations for the first Twitter user, but i have a dataset of 5181 users, which corresponds to 51810 observations, since there's 10 parties in Denmark.
id Alternative Follow F1_Distance F2_Distance F1_interaction
1 1 alternativet 1 -0.9672566 -1.3101138 0
2 1 danskfolkeparti 0 0.6038972 1.3799961 1
3 1 konservative 1 1.0759252 0.8665096 1
4 1 enhedslisten 0 -1.0831657 -1.0815424 0
5 1 liberalalliance 0 1.5389934 0.8470291 1
6 1 nyeborgerlige 1 1.4139934 0.9898862 1
F2_interaction
1 0
2 1
3 1
4 0
5 1
6 1
>```

How to determine the correct mixed effects structure in a binomial GLMM (lme4)?

Could someone help me to determine the correct random variable structure in my binomial GLMM in lme4?
I will first try to explain my data as best as I can. I have binomial data of seedlings that were eaten (1) or not eaten (0), together with data of vegetation cover. I try to figure out if there is a relationship between vegetation cover and the probability of a tree being eaten, as the other vegetation is a food source that could attract herbivores to a certain forest patch.
The data is collected in ~90 plots scattered over a National Park for 9 years now. Some were measured all years, some were measured only a few years (destroyed/newly added plots). The original datasets is split in 2 (deciduous vs coniferous), both containing ~55.000 entries. Per plot about 100 saplings were measured every time, so the two separate datasets probably contain about 50 trees per plot (though this will not always be the case, since the decid:conif ratio is not always equal). Each plot consists of 4 subplots.
I am aware that there might be spatial autocorrelation due to plot placement, but we will not correct for this, yet.
Every year the vegetation is surveyed in the same period. Vegetation cover is estimated at plot-level, individual trees (binary) are measured at a subplot-level.
All trees are measured, so the amount of responses per subplot will differ between subplots and years, as the forest naturally regenerates.
Unfortunately, I cannot share my original data, but I tried to create an example that captures the essentials:
#set seed for whole procedure
addTaskCallback(function(...) {set.seed(453);TRUE})
# Generate vector containing individual vegetation covers (in %)
cover1vec <- c(sample(0:100,10, replace = TRUE)) #the ',number' is amount of covers generated
# Create dataset
DT <- data.frame(
eaten = sample(c(0,1), 80, replace = TRUE),
plot = as.factor(rep(c(1:5), each = 16)),
subplot = as.factor(rep(c(1:4), each = 2)),
year = as.factor(rep(c(2012,2013), each = 8)),
cover1 = rep(cover1vec, each = 8)
)
Which will generate this dataset:
>DT
eaten plot subplot year cover1
1 0 1 1 2012 4
2 0 1 1 2012 4
3 1 1 2 2012 4
4 1 1 2 2012 4
5 0 1 3 2012 4
6 1 1 3 2012 4
7 0 1 4 2012 4
8 1 1 4 2012 4
9 1 1 1 2013 77
10 0 1 1 2013 77
11 0 1 2 2013 77
12 1 1 2 2013 77
13 1 1 3 2013 77
14 0 1 3 2013 77
15 1 1 4 2013 77
16 0 1 4 2013 77
17 0 2 1 2012 46
18 0 2 1 2012 46
19 0 2 2 2012 46
20 1 2 2 2012 46
....etc....
80 0 5 4 2013 82
Note1: to clarify again, in this example the number of responses is the same for every subplot:year combination, making the data balanced, which is not the case in the original dataset.
Note2: this example can not be run in a GLMM, as I get a singularity warning and all my random effect measurements are zero. Apparently my example is not appropriate to actually use (because using sample() caused the 0 and 1 to be in too even amounts to have large enough effects?).
As you can see from the example, cover data is the same for every plot:year combination.
Plots are measured multiple years (only 2012 and 2013 in the example), so there are repeated measures.
Additionally, a year effect is likely, given the fact that we have e.g. drier/wetter years.
First I thought about the following model structure:
library(lme4)
mod1 <- glmer(eaten ~ cover1 + (1 | year) + (1 | plot), data = DT, family = binomial)
summary(mod1)
Where (1 | year) should correct for differences between years and (1 | plot) should correct for the repeated measures.
But then I started thinking: all trees measured in plot 1, during year 2012 will be more similar to each other than when they are compared with (partially the same) trees from plot 1, during year 2013.
So, I doubt that this random model structure will correct for this within plot temporal effect.
So my best guess is to add another random variable, where this "interaction" is accounted for.
I know of two ways to possibly achieve this:
Method 1.
Adding the random variable " + (1 | year:plot)"
Method 2.
Adding the random variable " + (1 | year/plot)"
From what other people told me, I still do not know the difference between the two.
I saw that Method 2 added an extra random variable (year.1) compared to Method 1, but I do not know how to interpret that extra random variable.
As an example, I added the Random effects summary using Method 2 (zeros due to singularity issues with my example data):
Random effects:
Groups Name Variance Std.Dev.
plot.year (Intercept) 0 0
plot (Intercept) 0 0
year (Intercept) 0 0
year.1 (Intercept) 0 0
Number of obs: 80, groups: plot:year, 10; plot, 5; year, 2
Can someone explain me the actual difference between Method 1 and Method 2?
I am trying to understand what is happening, but cannot grasp it.
I already tried to get advice from a colleague and he mentioned that it is likely more appropriate to use cbind(success, failure) per plot:year combination.
Via this site I found that cbind is used in binomial models when Ntrails > 1, which I think is indeed the case given our sampling procedure.
I wonder, if cbind is already used on a plot:year combination, whether I need to add a plot:year random variable?
When using cbind, the example data would look like this:
>DT3
plot year cover1 Eaten_suc Eaten_fail
8 1 2012 4 4 4
16 1 2013 77 4 4
24 2 2012 46 2 6
32 2 2013 26 6 2
40 3 2012 91 2 6
48 3 2013 40 3 5
56 4 2012 61 5 3
64 4 2013 19 2 6
72 5 2012 19 5 3
80 5 2013 82 2 6
What would be the correct random model structure and why?
I was thinking about:
Possibility A
mod4 <- glmer(cbind(Eaten_suc, Eaten_fail) ~ cover1 + (1 | year) + (1 | plot),
data = DT3, family = binomial)
Possibility B
mod5 <- glmer(cbind(Eaten_suc, Eaten_fail) ~ cover1 + (1 | year) + (1 | plot) + (1 | year:plot),
data = DT3, family = binomial)
But doesn't cbind(success, failure) already correct for the year:plot dependence?
Possibility C
mod6 <- glmer(cbind(Eaten_suc, Eaten_fail) ~ cover1 + (1 | year) + (1 | plot) + (1 | year/plot),
data = DT3, family = binomial)
As I do not yet understand the difference between year:plot and year/plot
Thus: Is it indeed more appropriate to use the cbind-method than the raw binary data? And what random model structure would be necessary to prevent pseudoreplication and other dependencies?
Thank you in advance for your time and input!
EDIT 7/12/20: I added some extra information about the original data
You are asking quite a few questions in your question. I'll try to cover them all, but I do suggest reading the documentation and vignette from lme4 and the glmmFAQ page for more information. Also I'd highly recommend searching for these topics on google scholar, as they are fairly well covered.
I'll start somewhere simple
Note 2 (why is my model singular?)
Your model is highly singular, because the way you are simulating your data does not indicate any dependency between the data itself. If you wanted to simulate a binomial model you would use g(eta) = X %*% beta to simulate your linear predictor and thus the probability for success. One can then use this probability for simulating the your binary outcome. This would thus be a 2 step process, first using some known X or randomly simulated X given some prior distribution of our choosing. In the second step we would then use rbinom to simulate binary outcome while keeping it dependent on our predictor X.
In your example you are simulating independent X and a y where the probability is independent of X as well. Thus, when we look at the outcome y the probability of success is equal to p=c for all subgroup for some constant c.
Can someone explain me the actual difference between Method 1 and Method 2? ((1| year:plot) vs (1|year/plot))
This is explained in the package vignette fitting linear mixed effects models with lme4 in the table on page 7.
(1|year/plot) indicates that we have 2 mixed intercept effects, year and plot and plot is nested within year.
(1|year:plot) indicates a single mixed intercept effect, plot nested within year. Eg. we do not include the main effect of year. It would be somewhat similar to having a model without intercept (although less drastic, and interpretation is not destroyed).
It is more common to see the first rather than the second, but we could write the first as a function of the second (1|year) + (1|year:plot).
Thus: Is it indeed more appropriate to use the cbind-method than the raw binary data?
cbind in a formula is used for binomial data (or multivariate analysis), while for binary data we use the raw vector or 0/1 indicating success/failure, eg. aggregate binary data (similar to how we'd use glm). If you are uninterested in the random/fixed effect of subplot, you might be able to aggregate your data across plots, and then it would likely make sense. Otherwise stay with you 0/1 outcome vector indicating either success or failures.
What would be the correct random model structure and why?
This is a topic that is extremely hard to give a definitive answer to, and one that is still actively researched. Depending on your statistical paradigm opinions differ greatly.
Method 1: The classic approach
Classic mixed modelling is based upon knowledge of the data you are working with. In general there are several "rules of thumb" for choosing these parameters. I've gone through a few in my answer here. In general if you are "not interested" in the systematic effect and it can be thought of as a random sample of some population, then it could be a random effect. If it is the population, eg. samples do not change if the process is repeated, then it likely shouldn't.
This approach often yields "decent" choices for those who are new to mixed effect models, but is highly criticized by authors who tend towards methods similar to those we'd use in non-mixed models (eg. visualizing to base our choice and testing for significance).
Method 2: Using visualization
If you are able to split your data into independent subgroups and keeping the fixed effect structure a reasonable approach for checking potential random effects is the estimate marginal models (eg. using glm) across these subgroups and seeing if the fixed effects are "normally distributed" between these observations. The function lmList (in lme4) is designed for this specific approach. In linear models we would indeed expect these to be normally distributed, and thus we can get an indication whether a specific grouping "might" be a valid random effect structure. I believe the same is approximately true in the case of generalized linear models, but I lack references. I know that Ben Bolker have advocated for this approach in a prior article of his (the first reference below) that I used during my thesis. However this is only a valid approach for strictly separable data, and the implementation is not robust in the case where factor levels are not shared across all groups.
So in short: If you have the right data, this approach is simple, fast and seemingly highly reliable.
Method 3: Fitting maximal/minimal models and decreasing/expanding model based on AIC or AICc (or p-value tests or alternative metrics)
Finally an alternative to use a "step-wise"-like procedure. There are advocates of both starting with maximal and minimal models (I'm certain at least one of my references below talk about problems with both, otherwise check glmmFAQ) and then testing your random effects for their validity. Just like classic regression this is somewhat of a double-edged sword. The reason is both extremely simple to understand and amazingly complex to comprehend.
For this method to be successful you'd have to perform cross-validation or out-of-sample validation to avoid selection bias just like standard models, but unlike standard models sampling becomes complicated because:
The fixed effects are conditional on the random structure.
You will need your training and testing samples to be independent
As this is dependent on your random structure, and this is chosen in a step-wise approach it is hard to avoid information leakage in some of your models.
The only certain way to avoid problems here is to define the space
that you will be testing and selecting samples based on the most
restrictive model definition.
Next we also have problems with choice of metrics for evaluation. If one is interested in the random effects it makes sense to use AICc (AIC estimate of the conditional model) while for fixed effects it might make more sense to optimize AIC (AIC estimate of the marginal model). I'd suggest checking references to AIC and AICc on glmmFAQ, and be wary since the large-sample results for these may be uncertain outside a very reestrictive set of mixed models (namely "enough independent samples over random effects").
Another approach here is to use p-values instead of some metric for the procedure. But one should likely be even more wary of test on random effects. Even using a Bayesian approach or bootstrapping with incredibly high number of resamples sometimes these are just not very good. Again we need "enough independent samples over random effects" to ensure the accuracy.
The DHARMA provides some very interesting testing methods for mixed effects that might be better suited. While I was working in the area the author was still (seemingly) developing an article documenting the validity of their chosen method. Even if one does not use it for initial selection I can only recommend checking it out and deciding upon whether one believes in their methods. It is by far the most simple approach for a visual test with simple interpretation (eg. almost no prior knowledge is needed to interpret the plots).
A final note on this method would thus be: It is indeed an approach, but one I would personally not recommend. It requires either extreme care or the author accepting ignorance of model assumptions.
Conclusion
Mixed effect parameter selection is something that is difficult. My experience tells me that mostly a combination of method 1 and 2 are used, while method 3 seems to be used mostly by newer authors and these tend to ignore either out-of-sample error (measure model metrics based on the data used for training), ignore independence of samples problems when fitting random effects or restrict themselves to only using this method for testing fixed effect parameters. All 3 do however have some validity. I myself tend to be in the first group, and base my decision upon my "experience" within the field, rule-of-thumbs and the restrictions of my data.
Your specific problem.
Given your specific problem I would assume a mixed effect structure of (1|year/plot/subplot) would be the correct structure. If you add autoregressive (time-spatial) effects likely year disappears. The reason for this structure is that in geo-analysis and analysis of land plots the classic approach is to include an effect for each plot. If each plot can then further be indexed into subplot it is natural to think of "subplot" to be nested in "plot". Assuming you do not model autoregressive effects I would think of time as random for reasons that you already stated. Some years we'll have more dry and hotter weather than others. As the plots measured will have to be present in a given year, these would be nested in year.
This is what I'd call the maximal model and it might not be feasible depending on your amount of data. In this case I would try using (1|time) + (1|plot/subplot). If both are feasible I would compare these models, either using bootstrapping methods or approximate LRT tests.
Note: It seems not unlikely that (1|time/plot/subplot) would result in "individual level effects". Eg 1 random effect per row in your data. For reasons that I have long since forgotten (but once read) it is not plausible to have individual (also called subject-level) effects in binary mixed models. In this case It might also make sense to use the alternative approach or test whether your model assumptions are kept when withholding subplot from your random effects.
Below I've added some useful references, some of which are directly relevant to the question. In addition check out the glmmFAQ site by Ben Bolker and more.
References
Bolker, B. et al. (2009). „Generalized linear mixed models: a practical guide for ecology and evolution“. In: Trends in ecology & evolution 24.3, p. 127–135.
Bolker, B. et al. (2011). „GLMMs in action: gene-by-environment interaction in total fruit production of wild populations of Arabidopsis thaliana“. In: Revised version, part 1 1, p. 127–135.
Eager, C. og J. Roy (2017). „Mixed effects models are sometimes terrible“. In: arXiv preprint arXiv:1701.04858. url: https://arxiv.org/abs/1701.04858 (last seen 19.09.2019).
Feng, Cindy et al. (2017). „Randomized quantile residuals: an omnibus model diagnostic tool with unified reference distribution“. In: arXiv preprint arXiv:1708.08527. (last seen 19.09.2019).
Gelman, A. og Jennifer Hill (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Hartig, F. (2019). DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. R package version 0.2.4. url: http://florianhartig.github.io/DHARMa/ (last seen 19.09.2019).
Lee, Y. og J. A. Nelder (2004). „Conditional and Marginal Models: Another View“. In: Statistical Science 19.2, p. 219–238.
doi: 10.1214/088342304000000305. url: https://doi.org/10.1214/088342304000000305
Lin, D. Y. et al. (2002). „Model-checking techniques based on cumulative residuals“. In: Biometrics 58.1, p. 1–12. (last seen 19.09.2019).
Lin, X. (1997). „Variance Component Testing in Generalised Linear Models with Random Effects“. In: Biometrika 84.2, p. 309–326. issn: 00063444. url: http://www.jstor.org/stable/2337459
(last seen 19.09.2019).
Stiratelli, R. et al. (1984). „Random-effects models for serial observations with binary response“. In:
Biometrics, p. 961–971.

Holt Winters predict results differ from fitted data dramatically

I am having a huge difference between my fitted data with HoltWinters and the predict data. I can understand there being a huge difference after several predictions but shouldn't the first prediction be the same number as the fitted data would be if it had one more number in the data set??
Please correct me if I'm wrong and why that wouldn't be the case?
Here is an example of the actual data.
1
1
1
2
1
1
-1
1
2
2
2
1
2
1
2
1
1
2
1
2
2
1
1
2
2
2
2
2
1
2
2
2
-1
1
Here is an example of the fitted data.
1.84401709401709
0.760477897417666
1.76593566042741
0.85435674207981
0.978449891674328
2.01079668445307
-0.709049507055536
1.39603638693742
2.42620183925688
2.42819282543689
2.40391946256294
1.29795840410863
2.39684770489517
1.35370435531208
2.38165200319969
1.34590347535205
1.38878761417551
2.36316132796798
1.2226736501825
2.2344269563083
2.24742853293732
1.12409156568888
Here is my R code.
randVal <- read.table("~/Documents/workspace/Roulette/play/randVal.txt", sep = "")
test<-ts(randVal$V1, start=c(1,133), freq=12)
test <- fitted(HoltWinters(test))
test.predict<-predict(HoltWinters(test), n.ahead=1*1)
Here is the predicted data after I expand it to n.ahead=1*12. Keep in mind that I only really want the first value. I don't understand why all the predict data is so low and close to 0 and -1 while the fitted data is far more accurate to the actual data..... Thank you.
0.16860570380268
-0.624454483845195
0.388808753990824
-0.614404235175936
0.285645402877705
-0.746997659036848
-0.736666618626855
0.174830187188718
-1.30499945596422
-0.320145850774167
-0.0917166719596059
-0.63970713627854
Sounds like you need a statistical consultation since the code is not throwing any errors. And you don't explain why you are dissatisfied with the results since the first value for those two calls is the same. With that in mind, you should realize that most time-series methods assume input from de-trended and de-meaned data, so they will return estimated parameters and values that would need in many cases to be offset to the global mean to predict on the original scale. (It's really bad practice to overwrite intermediate values as you are doing with 'test'.) Nonetheless, if you look at the test-object you see a column of yhat values that are on the scale of the input data.
Your question "I don't understand why all the predict data is so low and close to 0 and -1 while the fitted data is far more accurate to the actual data" doesn't say in what sense you think the "predict data[sic]" is "more accurate" than the actual data. The predict-results is some sort of estimate and they are split into components, as you would see if you ran the code on help page for predict:
plot(test,test.predict12)
It is not "data", either. It's not at all clear how it could be "more accurate" unless you has some sort of gold standard that you are not telling us about.

trying to use package bootstrap to run a jackknife on my Random Forest model

I'm having trouble trying to figure out the following: I am running Random Forest for classification of habitat use and have GPS data from 17 animals. My data frame depicts different habitat variables such as aspect and canopy cover at each used animal location and each unused, random location. Each used location is also identified by the ID number of the animal ( this column is called "lynx"). A column called "usvsa" codes used locations as 1 and unused locations as 0. Here's the top of my spatial points data frame called sdata3:
lynx usvsa aspect canopy_cover clearcut_area cti deciduous dist_draw dist_ridge
311 1 252.3302 55.3704 0 7.311823 0 90.0000 484.66483
311 1 263.1394 55.1528 0 6.857203 0 324.4996 305.94116
311 1 249.6992 72.9272 0 6.612025 0 364.9658 212.13203
311 1 194.4459 50.4428 0 6.330615 0 108.1665 67.08204
Ok. So, I'd like to use Jackknifing to run Random Forest 17 times (since I have 17 individuals), leaving one animal out each run. The idea is to compare the results of each random forest run to make sure no one animal is having a disproportionately large effect on the model results. I've been reading about package "bootstrap" and the jackknife function: jackknife(x, theta, ...)
I get that I need to write a function for theta but I can't figure out how to put it all together so that each run of Random Forest leaves one animal out. Here is my Random Forest Model: randomForest(y ~ ., data=sdata3, ntree=b, importance=TRUE,norm.votes=TRUE, proximity=TRUE) I'd like to compare the importance values and oob error of each run.
Any tips would be appreciated!!

Classification using R in a data set with numeric and categorical variables

I'm working on a very big data-set.(csv)
The data set is composed from both numeric and categorical columns.
One of the columns is my "target column" , meaning i want to use the other columns to determine which value (out of 3 possible known values) is likely to be in the "target column". In the end check my classification vs the real data.
My question:
I'm using R.
I am trying to find a way to select the subset of features which give the best classifiation.
going over all the subsets is impossible.
Does anyone know an algorithm or can think of a way do it on R?
This seems to be a classification problem. Without knowing the amount of covariates you have for your target, can't be sure, but wouldn't a neural network solve your problem?
You could use the nnet package, which uses a Feed-forward neural network and works with multiple classes. Having categorical columns is not a problem since you could just use factors.
Without a datasample I can only explain it just a bit, but mainly using the function:
newNet<-nnet(targetColumn~ . ,data=yourDataset, subset=yourDataSubset [..and more values]..)
You obtain a trained neural net. What is also important here is the size of the hidden layer which is a tricky thing to get right. As a rule of thumb it should be roughly 2/3 of the amount of imputs + amount of outputs (3 in your case).
Then with:
myPrediction <- predict(newNet, newdata=yourDataset(with the other subset))
You obtain the predicted values. About how to evaluate them, I use the ROCR package but currently only supports binary classification, I guess a google search will show some help.
If you are adamant about eliminate some of the covariates, using the cor() function may help you to identify the less caracteristic ones.
Edit for a step by step guide:
Lets say we have this dataframe:
str(df)
'data.frame': 5 obs. of 3 variables:
$ a: num 1 2 3 4 5
$ b: num 1 1.5 2 2.5 3
$ c: Factor w/ 3 levels "blue","red","yellow": 2 2 1 2 3
The column c has 3 levels, that is, 3 type of values it can take. This is something done by default by a dataframe when a column has strings instead of numerical values.
Now, using the columns a and b we want to predict which value c is going to be. Using a neural network. The nnet package is simple enough for this example. If you don't have it installed, use:
install.packages("nnet")
Then, to load it:
require(nnet)
after this, lets train the neural network with a sample of the data, for that, the function
portion<-sample(1:nrow(df),0.7*nrow(df))
will store in portion, 70% of the rows from the dataframe. Now, let's train that net! I recommend you to check the documentation for the nnet package with ?nnet for a deeper knowledge. Using only basics:
myNet<-nnet( c~ a+b,data=df,subset=portion,size=1)
c~ a+b is the formula for the prediction. You want to predict the column c using the columns a and b
data= means the data origin, in this case, the dataframe df
subset= self explanatory
size= the size of the hidden layer, as I said, use about 2/3 of the total columns(a+b) + total outputs(1)
We have trained net now, lets use it.
Using predict you will use the trained net for new values.
newPredictedValues<-predict(myNet,newdata=df[-portion,])
After that, newPredictedValues will have the predictions.
Since you have both numerical and categorical data, then you may try SVM.
I am using SVM and KNN on my numerical data and I also tried to apply DNN. DNN is pretty slow for training especially big data in R. KNN does not need to be trained, but is used for numerical data. And the following is what I am using. Maybe you can have a look at it.
#Train the model
y_train<-data[,1] #first col is response variable
x_train<-subset(data,select=-1)
train_df<-data.frame(x=x_train,y=y_train)
svm_model<-svm(y~.,data=train_df,type="C")
#Test
y_test<-testdata[,1]
x_test<-subset(testdata,select=-1)
pred<-predict(svm_model,newdata = x_test)
svm_t<-table(pred,y_test)
sum(diag(svm_t))/sum(svm_t) #accuracy

Resources