Classification using R in a data set with numeric and categorical variables - r

I'm working on a very big data-set.(csv)
The data set is composed from both numeric and categorical columns.
One of the columns is my "target column" , meaning i want to use the other columns to determine which value (out of 3 possible known values) is likely to be in the "target column". In the end check my classification vs the real data.
My question:
I'm using R.
I am trying to find a way to select the subset of features which give the best classifiation.
going over all the subsets is impossible.
Does anyone know an algorithm or can think of a way do it on R?

This seems to be a classification problem. Without knowing the amount of covariates you have for your target, can't be sure, but wouldn't a neural network solve your problem?
You could use the nnet package, which uses a Feed-forward neural network and works with multiple classes. Having categorical columns is not a problem since you could just use factors.
Without a datasample I can only explain it just a bit, but mainly using the function:
newNet<-nnet(targetColumn~ . ,data=yourDataset, subset=yourDataSubset [..and more values]..)
You obtain a trained neural net. What is also important here is the size of the hidden layer which is a tricky thing to get right. As a rule of thumb it should be roughly 2/3 of the amount of imputs + amount of outputs (3 in your case).
Then with:
myPrediction <- predict(newNet, newdata=yourDataset(with the other subset))
You obtain the predicted values. About how to evaluate them, I use the ROCR package but currently only supports binary classification, I guess a google search will show some help.
If you are adamant about eliminate some of the covariates, using the cor() function may help you to identify the less caracteristic ones.
Edit for a step by step guide:
Lets say we have this dataframe:
str(df)
'data.frame': 5 obs. of 3 variables:
$ a: num 1 2 3 4 5
$ b: num 1 1.5 2 2.5 3
$ c: Factor w/ 3 levels "blue","red","yellow": 2 2 1 2 3
The column c has 3 levels, that is, 3 type of values it can take. This is something done by default by a dataframe when a column has strings instead of numerical values.
Now, using the columns a and b we want to predict which value c is going to be. Using a neural network. The nnet package is simple enough for this example. If you don't have it installed, use:
install.packages("nnet")
Then, to load it:
require(nnet)
after this, lets train the neural network with a sample of the data, for that, the function
portion<-sample(1:nrow(df),0.7*nrow(df))
will store in portion, 70% of the rows from the dataframe. Now, let's train that net! I recommend you to check the documentation for the nnet package with ?nnet for a deeper knowledge. Using only basics:
myNet<-nnet( c~ a+b,data=df,subset=portion,size=1)
c~ a+b is the formula for the prediction. You want to predict the column c using the columns a and b
data= means the data origin, in this case, the dataframe df
subset= self explanatory
size= the size of the hidden layer, as I said, use about 2/3 of the total columns(a+b) + total outputs(1)
We have trained net now, lets use it.
Using predict you will use the trained net for new values.
newPredictedValues<-predict(myNet,newdata=df[-portion,])
After that, newPredictedValues will have the predictions.

Since you have both numerical and categorical data, then you may try SVM.
I am using SVM and KNN on my numerical data and I also tried to apply DNN. DNN is pretty slow for training especially big data in R. KNN does not need to be trained, but is used for numerical data. And the following is what I am using. Maybe you can have a look at it.
#Train the model
y_train<-data[,1] #first col is response variable
x_train<-subset(data,select=-1)
train_df<-data.frame(x=x_train,y=y_train)
svm_model<-svm(y~.,data=train_df,type="C")
#Test
y_test<-testdata[,1]
x_test<-subset(testdata,select=-1)
pred<-predict(svm_model,newdata = x_test)
svm_t<-table(pred,y_test)
sum(diag(svm_t))/sum(svm_t) #accuracy

Related

Negative Binomial model offset seems to be creating a 2 level factor

I am trying to fit some data to a negative binomial model and run a pairwise comparison using emmeans. The data has two different sample sizes, 15 and 20 (num_sample in the example below).
I have set up two data frames: good.data which produces the expected result of offset() using random sample sizes between 15 and 20, and bad.data using a sample size of either 15 or 20, which seems to produce a factor of either 15 or 20. The bad.data pairwise comparison produces way too many comparisons compared to the good.data, even though they should produce the same number?
set.seed(1)
library(dplyr)
library(emmeans)
library(MASS)
# make data that works
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,17,20),24*5,T),# more than 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->good.data
# make data that doesn't work
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,20),24*5,T),# only 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->bad.data
# fit models
good.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.good
bad.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.bad
# pairwise comparison
emmeans::emmeans(mod.good,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
emmeans::emmeans(mod.bad,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
First , I think you should look up how to use emmeans.The intent is not to give a duplicate of the model formula, but rather to specify which factors you want the marginal means of.
However, that is not the issue here. What emmeans does first is to setup a reference grid that consists of all combinations of
the levels of each factor
the average of each numeric predictor; except if a
numeric predictor has just two different values, then
both its values are included.
It is that exception you have run against. Since num_samples has just 2 values of 15 and 20, both levels are kept separate rather than averaged. If you want them averaged, add cov.keep = 1 to the emmeans call. It has nothing to do with offsets you specify in emmeans-related functions; it has to do with the fact that num_samples is a predictor in your model.
The reason for the exception is that a lot of people specify models with indicator variables (e.g., female having values of 1 if true and 0 if false) in place of factors. We generally want those treated like factors rather than numeric predictors.
To be honest I'm not exactly sure what's going on with the expansion (276, the 'correct' number of contrasts, is choose(24,2), the 'incorrect' number of contrasts is 1128 = choose(48,2)), but I would say that you should probably be following the guidance in the "offsets" section of one of the emmeans vignettes where it says
If a model is fitted and its formula includes an offset() term, then by default, the offset is computed and included in the reference grid. ...
However, many users would like to ignore the offset for this kind of model, because then the estimates we obtain are rates per unit value of the (logged) offset. This may be accomplished by specifying an offset parameter in the call ...
The most natural choice for setting the offset is to 0 (i.e. make predictions etc. for a sample size of 1), but in this case I don't think it matters.
get_contr <- function(x) as_tibble(x$contrasts)
cfun <- function(m) {
emmeans::emmeans(m,
pairwise~trt_time:pre_trt:storage_time, offset=0) |>
get_contr()
}
nrow(cfun(mod.good)) ## 276
nrow(cfun(mod.bad)) ## 276
From a statistical point of view I question the wisdom of looking at 276 pairwise comparisons, but that's a different issue ...

Unsupervised Supervised Clusters with NAs and Qualitative Data in R

I have basketball player data that looks like the following:
Player Weight Height Shots School
A NA 70 23 AB
B 130 62 10 AB
C 180 66 NA BC
D 157 65 22 CD
and I want to do unsupervised and supervised(based on height) clustering. Looking into online resources I found that I can use kmeans for unsupervised but I don't know how to handle NAs without losing a good amount of data. I also don't know how to handle the quantitative variable "school". Are there any ways to resolve both issues for unsupervised and supervised clustering?
K-means cannot be used for categorical data. One work around would be to instead use data about the schools such as # of enrollments or local SES data.
kmeans() in R cannot handle NA's so you could either omit them (and you should check that the NA's are distributed fairly evenly among other factors) or look into using cluster::clara() from the cluster library.
You have not asked anything specifically about super-learning so I cannot address that part of the question.
The problem you are facing is known as missing data. And you have to decide about it before start the clustering. in most cases the samples with missed data (NAs here) are simple omitted. that happen in preparing data and clearing process step of data mining. In R you can do it using the following code:
na.omit(yourdata)
it omit the records or samples (in row) that contains NAs.
but if you want to include them in the clustering process you can use the average value of that feature in entire cluster for the missing value option.
in your case, Consider weight:
for player A you can set (130+180+157)/3 for his weight.
For another question: it seems you are a little bit confused about the meaning of supervised and unsupervised learning. in supervised learning you need to define the class label of the samples. then you build a model (classifier) and train it to learn about each class of samples and after training you can use the model to predict the label of a test sample, like you give it a player with this values (W=100,H=190,shots=55) and it will give you the predicted class label.
For unsupervised learning you just need to cluster the data to find group or cluster relation of samples. for doing this you do not need a class label, you should define the features that you are going to cluster the samples based on them, for example you can cluster players only based on their weights, or just cluster them based on their height,... or you can use all height, weight and shots features for clustering. this is possible in R using the following code:
clus <- kmeans(na.omit(data$weight), 5) #for cluster them to 5 clusters based on weight
clus <- kmeans(na.omit(data[,1:3]),5) # to cluster them based on weight, height, shots into 5 clusters.
consider the using of na.omit here that remove rows which has NAs in their columns.
let me know if this helps you.

In R, what is the difference between ICCbare and ICCbareF in the ICC package?

I am not sure if this is a right place to ask a question like this, but Im not sure where to ask this.
I am currently doing some research on data and have been asked to find the intraclass correlation of the observations within patients. In the data, some patients have 2 observations, some only have 1 and I have an ID variable to assign each observation to the corresponding patient.
I have come across the ICC package in R, which calculates the intraclass correlation coefficient, but there are 2 commands available: ICCbare and ICCbareF.
I do not understand what is the difference between them as they do give completely different ICC values on the same variables. For example, on the same variable, x:
ICCbare(ID,x) gave me a value of -0.01035216
ICCbareF(ID,x) gave me a value of 0.475403
The second one using ICCbareF is almost the same as the estimated correlation I get when using random effects models.
So I am just confused and would like to understand the algorithm behind them so I could explain them in my research. I know one is to be used when the data is balanced and there are no NA values.
In the description it says that it is either calculated by hand or using ANOVA - what are they?
By: https://www.rdocumentation.org/packages/ICC/versions/2.3.0/topics/ICCbare
ICCbare can be used on balanced or unbalanced datasets with NAs. ICCbareF is similar, however ICCbareF should not be used with unbalanced datasets.

Error : Sets of levels in train and test don't match (knncat R)

I am trying to do knn classification using knncat in R since I have categorical attributes in my data set.
knncat(FinalData, FinalTestData, k=10, classcol = 15)
when i execute the above statement, it gives me the error that : Sets of levels in train and test do not match.
On checking of levels for all of the attributes, i did get a difference. I have a country attribute which can take from 1-41 values in train data set.
However in test data set, one particular country never appears and thus it is causing this error. How am I supposed to deal with that ?
I'm not sure but you may match the factor levels as below.
train <- factor(c("a","b","c"))
test <- factor(c("a","b"))
levels(test) <- levels(train)
test
[1] a b
Levels: a b c
Perhaps I am wrong, but wouldn't this still be problematic because the KNN algorithm bases its tuning off of Euclidian distance calculations, right?
Wouldn't you still need to create a binary variable for each level of your categorical features, which would mean that you would have an issue given that certain levels might not appear in both the training and test sets.
Could someone perhaps enlighten me with regards to this.
Also, as a note, this is meant to be more of a spur than a hijack.

How to replicate Stata "factor" command in R

I'm trying to replicate some Stata results in R and am having a lot of trouble. Specifically, I want to recover the same eigenvalues as Stata does in exploratory factor analysis. To provide a specific example, the factor help in Stata uses bg2 data (something about physician costs) and gives you the following results:
webuse bg2
factor bg2cost1-bg2cost6
(obs=568)
Factor analysis/correlation Number of obs = 568
Method: principal factors Retained factors = 3
Rotation: (unrotated) Number of params = 15
--------------------------------------------------------------------------
Factor | Eigenvalue Difference Proportion Cumulative
-------------+------------------------------------------------------------
Factor1 | 0.85389 0.31282 1.0310 1.0310
Factor2 | 0.54107 0.51786 0.6533 1.6844
Factor3 | 0.02321 0.17288 0.0280 1.7124
Factor4 | -0.14967 0.03951 -0.1807 1.5317
Factor5 | -0.18918 0.06197 -0.2284 1.3033
Factor6 | -0.25115 . -0.3033 1.0000
--------------------------------------------------------------------------
LR test: independent vs. saturated: chi2(15) = 269.07 Prob>chi2 = 0.0000
I'm interested in the eigenvalues in the first column of the table. When I use the same data in R, I get the following results:
bg2 = read.dta("bg2.dta")
eigen(cor(bg2)
$values
[1] 1.7110112 1.4036760 1.0600963 0.8609456 0.7164879 0.6642889 0.5834942
As you can see, these values are quite different from Stata's results. It is likely that the two programs are using different means of calculating the eigenvalues, but I've tried a wide variety of different methods of extracting the eigenvalues, including most (if not all) of the options in R commands fa, factanal, principal, and maybe some other R commands. I simply cannot extract the same eigenvalues as Stata. I've also read through Stata's manual to try and figure out exactly what method Stata uses, but couldn't figure it out with enough specificity.
I'd love any help! Please let me know if you need any additional information to answer the question.
I would advise against carrying out a factor analysis on all the variables in the bg2 data as one of the variables is clinid, which is an arbitrary identifier 1..568 and carries no information, except by accident.
Sensibly or not, you are not using the same data, as you worked on the 6 cost variables in Stata and those PLUS the identifier in R.
Another way to notice that would be to spot that you got 6 eigenvalues in one case and 7 in the other.
Nevertheless the important principle is that eigen(cor(bg2)) is just going to give you the eigenvalues from a principal component analysis based on the correlation matrix. So you can verify that pca in Stata would match what you report from R.
So far, so clear.
But your larger question remains. I don't know how to mimic Stata's (default) factor analysis in R. You may need a factor analysis expert, if any hang around here.
In short, PCA is not equal to principal axis method factor analysis.
Different methods of calculating eigenvalues are not the issue here. I'd bet that given the same matrix Stata and R match up well in reporting eigenvalues. The point is that different techniques mean different eigenvalues in principle.
P.S. I am not an R person, but I think what you call R commands are strictly R functions. In turn I am open to correction on that small point.

Resources