what does positive class mean in classification tree, in R? - r

So I'm a newbie in data science, and have some question regarding tree model.
Below is the result of my classification modeling, but I'm having trouble intrepreting it.
As you can see at the very bottom line in the bottom-left part of the screen, it says 'positive class : 1'. Our target attribute has value of either 1 or 0. What does that 'positive class : 1' mean in this case?
I very much appreciate your help. Thanks. :)

Positive Class: 1 indicates that the positive class, i.e. the class which you are most interested in, is labeled as 1 in your dataset; it is not the numerical value R uses under the levels of the factor, but the value with which it is encoded/written in the dataset.
For more information about it, you can see the documentation of confusionMatrix function from caret, which I believe is the one you used. Look for the optional argument named positive.

Positive class is the class that is related to your objective function. For example, if you want to classify whether the objects are present or not in a given scenario. So for all the data samples where objects are present will be considered in the positive class. So in your problem, you want to identify in which case you will have 1 as a target variable so it is considered as a positive class.

Related

How does matchit subclass look so complicated?

I am trying to match test/control using a string that concatenates different values (for example: 1_4_5). I ran matchit using exact matching for the variable above and mahalanobis for 2 or 3 other variables. The matching results returns with a few new columns such as weight and subclass. I understand that subclass is to be used to know which test units are matched with which control units. However, the subclass column looks very complicated compared to subclass column usually (just numbers). Mine looks like this 1_1_1_1_1_1_1_1_1_1_4, etc. Did I do anything incorrectly that the subclass is produced this way?
Thanks

FancyRpartPlot - What does the number inside the node mean?

anyone know what does it means? I have a confusion on the information inside the decision tree below. I tried to find the number sources inside the variable. Hence, i could not find anything
Normally you see the probability of each class and the percentage of cases falling in that class when you plot with fancyRplot- but in your case, your attribute perhaps is a numeric, and then in the box is the mean of responses for that split.

Prediction with cpdist using "probabilities" as evidence

I have a very quick question with an easy reproducible example that is related to my work on prediction with bnlearn
library(bnlearn)
Learning.set4=cbind(c("Yes","Yes","Yes","No","No","No"),c(9,10,8,3,2,1))
Learning.set4=as.data.frame(Learning.set4)
Learning.set4[,c(2)]=as.numeric(as.character(Learning.set4[,c(2)]))
colnames(Learning.set4)=c("Cause","Cons")
b.network=empty.graph(colnames(Learning.set4))
struct.mat=matrix(0,2,2)
colnames(struct.mat)=colnames(Learning.set4)
rownames(struct.mat)=colnames(struct.mat)
struct.mat[1,2]=1
bnlearn::amat(b.network)=struct.mat
haha=bn.fit(b.network,Learning.set4)
#Some predictions with "lw" method
#Here is the approach I know with a SET particular modality.
#(So it's happening with certainty, here for example I know Cause is "Yes")
classic_prediction=cpdist(haha,nodes="Cons",evidence=list("Cause"="Yes"),method="lw")
print(mean(classic_prediction[,c(1)]))
#What if I wanted to predict the value of Cons, when Cause has a 60% chance of being Yes and 40% of being no?
#I decided to do this, according the help
#I could also make a function that generates "Yes" or "No" with proper probabilities.
prediction_idea=cpdist(haha,nodes="Cons",evidence=list("Cause"=c("Yes","Yes","Yes","No","No")),method="lw")
print(mean(prediction_idea[,c(1)]))
Here is what the help says:
"In the case of a discrete or ordinal node, two or more values can also be provided. In that case, the value for that node will be sampled with uniform probability from the set of specified values"
When I predict the value of a variable using categorical variables, I for now just used a certain modality of said variable as in the first prediction in the example. (Having the evidence set at "Yes" gets Cons to take a high value)
But if I wanted to predict Cons without knowing the exact modality of the variable Cause with certainty, could I use what I did in the second prediction (Just knowing the probabilities) ?
Is this an elegant way or are there better implemented ones I don't know off?
I got in touch with the creator of the package, and I will paste his answer related to the question here:
The call to cpquery() is wrong:
Prediction_idea=cpdist(haha,nodes="Cons",evidence=list("Cause"=c("Yes","Yes","Yes","No","No")),method="lw")
print(mean(prediction_idea[,c(1)]))
A query with the 40%-60% soft evidence requires you to place these new probabilities in the network first
haha$Cause = c(0.40, 0.60)
and then run the query without an evidence argument. (Because you do not have any hard evidence, really, just a different probability distribution for Cause.)
I will post the code that lets me do what I wanted off of the fitted network from the example.
change=haha$Cause$prob
change[1]=0.4
change[2]=0.6
haha$Cause=change
new_prediction=cpdist(haha,nodes="Cons",evidence=TRUE,method="lw")
print(mean(new_prediction[,c(1)]))

Atomic vectors in R and applying function to them

So I have a data set that I'm using from UC Irvine's website("Wine Quality" dataset) and I want to take a look at a plot of the residuals of the data set. The reason I'm doing this is to look to see if there is a an increase in variance so I could run a log based regression. To look at the residuals I apply this code:
residuals(white.wine)
white.wine is how I named my dataframe. However I get this error thrown at me, "NULL". If I want to look at the residuals of a specific predictor variable like Fixed Acidity I get this error:
Error: $ operator is invalid for atomic vectors.
Any way around this? Thanks for any help!
#Hugh was right in saying that "residuals" must be used against a model, but I think your question was also asking about how to apply something over a data frame. In case you just want the variance of each predictor variable, you might want something like:
apply(white.wine, 2, var)
As the ?apply documentation says, you need to provide the data, the margin, and the function. The margin refers to applying over rows or columns, with 1 signaling to apply a function over the rows, and 2 indicating that the function should be applied over columns. I'm assuming you have predictor variables in columns, so I used a 2 in the code above.

Make use of available data and neglect missing data for building classifier

I am using randomForest package in R platform to build a binary classifier. There are about 30,000 rows with 14,000 being in positive class and 16,000 in negative class. I have 15 variables that have been known to be important for classification.
I have some additional variables (about 5) which have missing information. These variables have values 1 or 0. 1 means presence of something but 0 means that it is not known whether it is present or absent. It is widely known that these variables would be the most important variable for classification (increase reliability of classification and its more likely that the sample lies in positive class) if there is 1 but useless if there is 0. And, only 5% of the rows have value 1. So, one variable is useful for only 5% of the cases. The 5 variables are independent of each other, so I expect that these will be highly useful for 15-25% of the data I have.
Is there a way to make use of available data but neglect the missing/unknown data present in a single column? Your ideas and suggestions would be appreciated. The implementation does not have to be specific to random forest and R-platform. If this is possible using other machine learning techniques or in other platforms, they are also most welcome.
Thank you for your time.
Regards
I can see at least the following approaches. Personally, I prefer the third option.
1) Discard the extra columns
You can choose to discard those 5 extra columns. Obviously this is not optimal, but it is good to know the performance of this option, to compare with the following.
2) Use the data as it is
In this case, those 5 extra columns are left as they are. The definite presence (1) or unknown presence/absence (0) in each of those 5 columns is used as information. This is the same as saying "if I'm not sure whether something is present or absent, I'll treat it as absent". I know this is obvious, but if you haven't tried this, you should, to compare it to option 1.
3) Use separate classifiers
If around 95% of each of those 5 columns has zeroes, and they are roughly independent of each other, that's 0.95^5 = 77.38% of data (roughly 23200 rows) which has zeroes in ALL of those columns. You can train a classifier on those 23200 rows, removing the 5 columns which are all zeroes (since those columns are equal for all points, they have zero predictive utility anyway). You can then train a separate classifier for the remaining points, which will have at least one of those columns set to 1. For these points, you leave the data as it is.
Then, for your test point, if all those columns are zeroes you use the first classifier, otherwise you use the second.
Other tips
If the 15 "normal" variables are not binary, make sure you use a classifier which can handle variables with different normalizations. If you're not sure, normalize the 15 "normal" variables to lie in the interval [0,1] -- you probably won't lose anything by doing this.
I'd like to add a further suggestion to Herr Kapput's: if you use a probabilistic approach, you can treat "missing" as a value which you have a certain probability of observing, either globally or within each class (not sure which makes more sense). If it's missing, it has probability of occurring p(missing), and if it's present it has probability p(not missing) * p(val | not missing). This allows you to gracefully handle the case where the values have arbitrary range when they are present.

Resources