Explanation of output for Naive bayes algorithm in R - r

I am new to Statistics and data analysis in R.
Today i was trying Naive Bayes algorithm in R.
The problem i am facing is that I am unable to understand the output of the prediction.
The code is followed like this:
install.packages('ElemStatLearn')
library('ElemStatLearn')
library("klaR")
library("caret")
sub = sample(nrow(spam), floor(nrow(spam) * 0.9))
train = spam[sub,]
test = spam[-sub,]
xTrain = train[,-58]
yTrain = train$spam
xTest = test[,-58]
yTest = test$spam
model = train(xTrain,yTrain,'nb',trControl=trainControl(method='cv',number=10))
prop.table(table(predict(model$finalModel,xTest)$class,yTest))`
Result display here is as follow:
yTest
email spam
email 0.33405640 0.02603037
spam 0.24945770 0.39045553
Can refer this link to see http://joshwalters.com/2012/11/27/naive-bayes-classification-in-r.html

The result that you have displayed is called a 'confusion matrix'. It is used to verify how well your classifier has worked.
You will need to understand a few terms here :- True positive (TP), False positive (FP),True negative (TN) ,False negative (FN)
Compare :
with your case
The diagonal from left top to right bottom gives you the %age of right predictions, and the other two values indicate the %age that your classifier got "confused"
Hope this gives an initial idea.
Google for confusion matrix and you can find more.
One good link is here : https://classeval.wordpress.com/introduction/basic-evaluation-measures/

It is not the naive bayes model's output.
Once you used predict, you don't really "care" about the model, because you already obtained the prediction.
table.prop creates the proportion out of each combination for the entire population. You might want to consider looking at the table without the proportion part, to see the actual numbers
For example 33.4% will be detected as email and will be actually an email, while 2.6% will be detected as email while they are actually spam.

Related

ezANOVA not providing Greenhouse Geiser correct df though violated

I've noticed that sometimes when I use ezANOVA from package ez I get columns stating what the Greenhouse-Geiser corrected df values are, and other times the tables with the sphericity corrections do not include the new df values, even though there are violations. For example, I just ran a 2-way repeated measures anova, and my table output looks like this:
I wish I could give repeatable data, but I genuinely don't know why it does or doesn't do it sometimes. Does anyone else know? I'll show my code below in case there's something I'm missing regarding the actual ezANOVA function. I could do the Df values by hand, but haven't found a good resource online to show me how to correct them using the epsilon value and I unfortunately was never taught that.
ez::ezANOVA(data = IntA2, wid = rat, dv = numReinforcers, within = .(component, minute))
Edit: A very nice person on the internet has explained to me how to calculate the new df values by hand (multiplying the GG epsilon by the old Dfs, in case any one else was wondering!) but I'm still unclear on why sometimes the function does it for you and other times it does not.

Estimation to plot person-item map not feasible because items "have no 0-responses" in data matrix

I am trying to create a person item map that organizes the questions from a dataset in order of difficulty. I am using the eRm package and the output should looks like follows:
[person-item map] (https://hansjoerg.me/post/2018-04-23-rasch-in-r-tutorial_files/figure-html/unnamed-chunk-3-1.png)
So one of the previous steps, before running the function that outputs the map, I have to fit the data set to have a matrix which is the object that the plotting functions uses to create the actual map, but I am having an error when creating that matrix
I have already tried to follow and review some documentation that might be useful if you want to have some extra-information:
[Tutorial] https://hansjoerg.me/2018/04/23/rasch-in-r-tutorial/#plots
[Ploting function] https://rdrr.io/rforge/eRm/man/plotPImap.html
[Documentation] https://eeecon.uibk.ac.at/psychoco/2010/slides/Hatzinger.pdf
Now, this is the code that I am using. First, I install and load the respective libraries and the data:
> library(eRm)
> library(ltm)
Loading required package: MASS
Loading required package: msm
Loading required package: polycor
> library(difR)
Then I fit the PCM and generate the object of class Rm and here is the error:
*the PCM function here is specific for polytomous data, if I use a different one the output says that I am not using a dichotomous dataset
> res <- PCM(my.data)
>Warning:
The following items have no 0-responses:
AUT_10_04 AUN_07_01 AUN_07_02 AUN_09_01 AUN_10_01 AUT_11_01 AUT_17_01
AUT_20_03 CRE_05_02 CRE_07_04 CRE_10_01 CRE_16_02 EFEC_03_07 EFEC_05
EFEC_09_02 EFEC_16_03 EVA_02_01 EVA_07_01 EVA_12_02 EVA_15_06 FLX_04_01
... [rest of items]
>Responses are shifted such that lowest
category is 0.
Warning:
The following items do not have responses on
each category:
EFEC_03_07 LC_07_03 LC_11_05
Estimation may not be feasible. Please check
data matrix
I must clarify that all the dataset has a range from 1 to 5. Is a Likert polytomous dataset
Finally, I try to use the plot function and it does not have any output, the system just keep loading ad-infinitum with no answer
>plotPImap(res, sorted=TRUE)
I would like to add the description of that particular function and the arguments:
>PCM(X, W, se = TRUE, sum0 = TRUE, etaStart)
#X
Input data matrix or data frame with item responses (starting from 0);
rows represent individuals, columns represent items. Missing values are
inserted as NA.
#W
Design matrix for the PCM. If omitted, the function will compute W
automatically.
#se
If TRUE, the standard errors are computed.
#sum0
If TRUE, the parameters are normed to sum-0 by specifying an appropriate
W.
If FALSE, the first parameter is restricted to 0.
#etaStart
A vector of starting values for the eta parameters can be specified. If
missing, the 0-vector is used.
I do not understand why is necessary to have a score beginning from 0, I think that that what the error is trying to say but I don't understand quite well that output.
I highly appreciate any hint that you can provide me
Feel free to ask for any information that could be useful to reach the solution to this issue
The problem is not caused by the fact that there are no items with 0-responses. The model automatically corrects this by centering the response scale categories on zero. (You'll notice that the PI-map that you linked to is centered on zero. Also, I believe the map you linked to is of dichotomous data. Polytomous data should include the scale categories on the PI-map, I believe.)
Without being able to see your data, it is impossible to know the exact cause though.
It may be that the model is not converging. That may be what this error was alluding to: Estimation may not be feasible. Please check data matrix. You could check by entering > res at the prompt. If the model was able to converge you should see something like:
Conditional log-likelihood: -2.23709
Number of iterations: 27
Number of parameters: 8
...
Does your data contain answers with decimal numbers? I found the same error, I solved it by using dplyr::dense_rank() function:
df_ranked <- sapply(df_decimal_data, dense_rank)
Worked.

GLMM's for meta-analysis - error using metabin

I'm trying to run a generalised linear mixed effects (binomial-normal) meta-analysis for 7 randomised studies, where each study records the presence of an adverse event within the treatment and placebo populations (exposure and control).
To do this, I'm hoping to use the metabin function (meta package). However, I'm getting an error and I'm not sure why. E.g. running this code:
install.packages('meta')
# Data
data<-data.frame(exposure.events=c(11,34,152,4,60,3,25), exposure.population=c(184,152,9500,77,2012,15,60), control.events=c(3,33,4729,133,1441,1,25), control.population=c(184,375,613978,15865,480485,105,238), Study=c("1","2","3","4","5","6","7"))
# Calling metabin
metabin(event.e=exposure.events, n.e=exposure.population, event.c=control.events, n.c=control.population, studlab=Study, data=data, method="GLMM",model.glmm = "CM.AL",method.tau = "ML")
I get this output:
Error in metafor::rma.glmm(ai = event.e[!exclude], n1i = n.e[!exclude], :
Cannot fit ML model.
I've also tried calling the rma.glmm function directly (instead of doing this via metabin), but get the same error message. I've also tried reading the source code for rma.glmm but I'm not sure I understand what's going on. However, I think the issue is related to the third study (the largest), and in particular the size of the control population, as both of the following run smoothly:
# Modifying 3rd row's control population
data<-data.frame(exposure.events=c(11,34,152,4,60,3,25), exposure.population=c(184,152,9500,77,2012,15,60), control.events=c(3,33,4729,133,1441,1,25), control.population=c(184,375,61378,15865,480485,105,238), Study=c("1","2","3","4","5","6","7"))
metabin(event.e=exposure.events, n.e=exposure.population, event.c=control.events, n.c=control.population, studlab=Study, data=data, method="GLMM",model.glmm = "CM.AL",method.tau = "ML")
# Deleting 3rd row
data<-data.frame(exposure.events=c(11,34,4,60,3,25), exposure.population=c(184,152,77,2012,15,60), control.events=c(3,33,133,1441,1,25), control.population=c(184,375,15865,480485,105,238), Study=c("1","2","3","4","5","6"))
metabin(event.e=exposure.events, n.e=exposure.population, event.c=control.events, n.c=control.population, studlab=Study, data=data, method="GLMM",model.glmm = "CM.AL",method.tau = "ML")
Is this a convergence problem, and does anyone know if there is any way around this? The only other thing I can find about this error message is for a problem (and thus solution) which does not apply to me.
Any help would be really appreciated :)

Prediction with cpdist using "probabilities" as evidence

I have a very quick question with an easy reproducible example that is related to my work on prediction with bnlearn
library(bnlearn)
Learning.set4=cbind(c("Yes","Yes","Yes","No","No","No"),c(9,10,8,3,2,1))
Learning.set4=as.data.frame(Learning.set4)
Learning.set4[,c(2)]=as.numeric(as.character(Learning.set4[,c(2)]))
colnames(Learning.set4)=c("Cause","Cons")
b.network=empty.graph(colnames(Learning.set4))
struct.mat=matrix(0,2,2)
colnames(struct.mat)=colnames(Learning.set4)
rownames(struct.mat)=colnames(struct.mat)
struct.mat[1,2]=1
bnlearn::amat(b.network)=struct.mat
haha=bn.fit(b.network,Learning.set4)
#Some predictions with "lw" method
#Here is the approach I know with a SET particular modality.
#(So it's happening with certainty, here for example I know Cause is "Yes")
classic_prediction=cpdist(haha,nodes="Cons",evidence=list("Cause"="Yes"),method="lw")
print(mean(classic_prediction[,c(1)]))
#What if I wanted to predict the value of Cons, when Cause has a 60% chance of being Yes and 40% of being no?
#I decided to do this, according the help
#I could also make a function that generates "Yes" or "No" with proper probabilities.
prediction_idea=cpdist(haha,nodes="Cons",evidence=list("Cause"=c("Yes","Yes","Yes","No","No")),method="lw")
print(mean(prediction_idea[,c(1)]))
Here is what the help says:
"In the case of a discrete or ordinal node, two or more values can also be provided. In that case, the value for that node will be sampled with uniform probability from the set of specified values"
When I predict the value of a variable using categorical variables, I for now just used a certain modality of said variable as in the first prediction in the example. (Having the evidence set at "Yes" gets Cons to take a high value)
But if I wanted to predict Cons without knowing the exact modality of the variable Cause with certainty, could I use what I did in the second prediction (Just knowing the probabilities) ?
Is this an elegant way or are there better implemented ones I don't know off?
I got in touch with the creator of the package, and I will paste his answer related to the question here:
The call to cpquery() is wrong:
Prediction_idea=cpdist(haha,nodes="Cons",evidence=list("Cause"=c("Yes","Yes","Yes","No","No")),method="lw")
print(mean(prediction_idea[,c(1)]))
A query with the 40%-60% soft evidence requires you to place these new probabilities in the network first
haha$Cause = c(0.40, 0.60)
and then run the query without an evidence argument. (Because you do not have any hard evidence, really, just a different probability distribution for Cause.)
I will post the code that lets me do what I wanted off of the fitted network from the example.
change=haha$Cause$prob
change[1]=0.4
change[2]=0.6
haha$Cause=change
new_prediction=cpdist(haha,nodes="Cons",evidence=TRUE,method="lw")
print(mean(new_prediction[,c(1)]))

Piecewise regression : Reproducibility problems on breakpoints detection in segmented R package

I'm trying to fit a 3 pieces regression on my data with the help of the segmented package, and I'm a bit lost...
First : here is a reproducible example :
y=c(520.0000, 620.0000, 653.3333, 853.3333, 1220.0000, 1553.3333, 1586.6667, 1586.6667, 1586.6667, 1586.6667, 1586.6667)
x=c(33320, 41020, 49020, 56920, 69220, 76320, 86320, 95420, 103720, 111520, 120320)
plot(y~x)
out=lm(y~x)
My data with 2 visible breakpoints :
- First I tried specifying the known number of breakpoints with K=2 :
mdl2=segmented(out, seg.Z =~x, psi=NA, control=seg.control(K=2,n.boot=0,it.max=500,stop.if.error=FALSE,display=T))
plot(mdl2)
points(y~x)
Which gives me, a 1 breakpoint result :
- But if I set 2<K<8 (so a wrong value...), I'm able to detect the right number of breakpoints :
- And a last point which puzzles me :
If I set K=4, the display=T option show me a result with 3 breakpoints, but in the function output I still have two breakpoints...
******EDIT OF THE 09/19/2016******
I tried also by specifying psi directly as I've some priors on the breakpoints location (but it's not my goal), and the results are still really bad with segmented...
For some regression I've to run the function many times before the algorithm success to end with a solution. Also, the solutions proposed have often reproducibility problems...
Does anyone know a way to robustly estimate these breakpoints ? It looks like my data are not that hard to fit, isn't it ?

Resources