probability and classification in svm function of e1071 package in R - r

I'm using SVM in e1071 package for binary classification.
I'm using both the probability attribute, and the SVM predict classification to compare the results. What I'm puzzled by is that the predicted classification (0 or 1) of the predict function doesn't seem congruous with the actual probabilities listed in the attribute. For some very high probabilities for level 1, the SVM classification is level 0, and for some low probabilities for level 1, the SVM classification is level 1.
here's a sample code and results
svm_model <- svm(as.factor(CHURNED) ~ .
, scale = FALSE
, data = train
, cost = 1
, gamma = 0.1
, kernel = "radial"
, probability = TRUE
)
test$Pred_Class <- predict(svm_model, test, probability = TRUE)
test$Pred_Prob <- attr(test$Pred_Class, "probabilities")[,1]
Here is the results: (rows have been placed differently to see various examples)
CHURNED: is response variable that is being predicted
Pred_class: is the predicted class by SVM
Pred_Prob: is the predicted probability, based on which SVM makes classification?
CHURNED Pred_Class Pred_Prob
1 0 0.03968526 # --> makes sense
1 0 0.03968526
1 0 0.07033222
1 0 0.11711195
1 0 0.12477983
1 0 0.12827296
1 0 0.12829345
1 0 0.12829345
1 0 0.12829345
1 0 0.12829444
1 0 0.12829927
1 0 0.12829927
1 0 0.12831169
1 0 0.12831169
1 0 0.12831428
1 1 0.13053475 # --> doesn't make sense. Prob is less than 0.5
1 1 0.13053475
1 1 0.13053475
1 1 0.1305348
1 1 0.1305348
1 1 0.1305348
1 1 0.1690807
1 1 0.2206993
1 1 0.2321171
0 0 0.998289 # --> doesn't make sense. Prob is almost 1!
0 0 0.9982887
0 0 0.993133
0 0 0.9898889
1 0 0.9849951
0 0 0.9849951
1 0 0.546427
0 0 0.5440994 # --> doesn't make sense. Prob is more than 0.5
0 0 0.5437889
1 0 0.5417848
0 0 0.5284112
0 0 0.5252177
0 1 0.5180776 # --> makes sense but is not consistent with above example
0 1 0.5180704
1 1 0.5180436
1 1 0.5180436
0 1 0.518043
This result doesn't make sense to me at all. The predicted class and predicted probabilities don't match. I've checked to make sure that I'm referencing the right column from the "probabilities" attribute matrix:
test$Pred_Class
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[98] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
attr(,"probabilities")
1 0
6442 0.2369796 0.7630204
6443 0.2520246 0.7479754
6513 0.2322581 0.7677419
6801 0.2309437 0.7690563
6802 0.2244768 0.7755232
6954 0.2322450 0.7677550
6968 0.2537544 0.7462456
6989 0.2352477 0.7647523
7072 0.2322308 0.7677692
...
...
...
Maybe I am interpreting the probability incorrectly?

Related

Genetic Algorithm in R: Specify number of 1s in binary chromosomes

I am using the rbga function, but my question still stands for other genetic algorithm implementations in R. Is there a way to specify the number of 1s in binary chromosomes?
I have the following example provided by the library documentation.
data(iris)
library(MASS)
X <- as.data.frame(cbind(scale(iris[,1:4]), matrix(rnorm(36*150), 150, 36)))
Y <- iris[,5]
iris.evaluate <- function(indices) {
print("Chromosome")
print(indices)
print("================================")
result = 1
if (sum(indices) > 2) {
huhn <- lda(X[,indices==1], Y, CV=TRUE)$posterior
result = sum(Y != dimnames(huhn)[[2]][apply(huhn, 1,
function(x)
which(x == max(x)))]) / length(Y)
}
result
}
monitor <- function(obj) {
minEval = min(obj$evaluations);
plot(obj, type="hist");
}
woppa <- rbga.bin(size=40, mutationChance=0.05, zeroToOneRatio=10,
evalFunc=iris.evaluate, showSettings=TRUE, verbose=TRUE)
Here are some of the chromosomes.
"Chromosome"
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
"================================"
"Chromosome"
0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0
"================================"
"Chromosome"
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
"================================"
"Chromosome"
0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
"================================"
The 1s (i.e., the chosen characteristics) are 5, 8, 5 and 4 respectively.
I am trying to follow the technique specified in a paper and they claim that they apply a genetic algorithm and in the end they pick a specific number of characteristics.
Is it possible to specify in a genetic algorithm the number of characteristics that I want my solution(s)/chromosome(s) to have?
Could this be done on the final solution/chromosome and if yes how?

plm regression fixed effects on two variables,

I have the following simplified df:
problem <- data.frame(
stringsAsFactors = FALSE,
fkeycompany = c("0000001961",
"0000003570","0000003570","0000003570",
"0000003570","0000003570","0000003570",
"0000003570","0000004187","0000004187","0000004187",
"0000004187","0000016058","0000022872",
"0000022872","0000022872","0000022872","0000024071",
"0000050471","0000052971","0000052971",
"0000056679","0000058592","0000058592","0000058592",
"0000063330","0000099047","0000099047",
"0000099047","0000316206","0000316537",
"0000319697","0000351917","0000351917","0000351917",
"0000356037","0000356037","0000356037",
"0000700815","0000700815","0000700815","0000700815",
"0000704415","0000704415","0000704415",
"0000705003","0000720154","0000720154","0000720154",
"0000720154"),
fiscalyear = c(2018,2002,
2002,2004,2006,2007,2007,2014,2005,2005,
2009,2017,2003,2002,2004,2004,2010,2002,
2016,2008,2008,2002,2005,2005,2010,2014,
2000,2005,2005,2002,2002,2001,2005,2005,
2006,2007,2012,2015,2006,2006,2007,2008,
2003,2014,2014,2000,2004,2006,2008,2013),
zmijewskiscore = c(-0.295998372490631,-3.0604522838509,-3.0604522838509,
-9.70437199970406,-0.836774487816746,
0.500903351523752,0.500903351523752,-1.29210741224579,
-1.96529713996165,-1.96529713996165,
-1.60831783946871,-2.12343231229296,-3.99767761748961,
0.561261861396196,4.13793269655047,4.13793269655047,
5.61803398400963,-0.000195582736436772,
-3.93766039340527,-0.540037039625719,
-0.540037039625719,-1.93767533120689,-4.54446419505987,
-4.54446419505987,1.94389244672183,
0.941272649148121,-3.88427264672157,-0.342812414189714,
-0.342812414189714,-1.35074505582686,
-4.52746658422071,-0.130671284507204,-0.223517713694019,
-0.223517713694019,0.0149617517859735,
-2.95100357094774,-2.55146691134187,-1.86846592111008,
2.92283100206773,2.92283100206773,
4.65325023636937,6.1585365469118,-4.54449586848866,
-1.49969162335521,-1.49969162335521,-3.34071706450412,
-1.72382101559976,-1.53076052307727,
-1.77582320023177,-1.57280701642882),
lloss = c(0,1,1,1,1,
1,1,1,0,0,0,1,0,0,1,1,1,1,0,1,1,
1,0,0,1,0,0,1,1,0,0,1,1,1,1,0,0,
1,1,1,1,1,0,1,1,0,1,1,1,0),
GCO_prev = c(1,1,1,0,0,
0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,0,0),
GCO = c(0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,
0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,
0,0,0,1,1,0,0,0,0,0,0,0,0),
industry = c(9,5,5,5,5,
5,5,5,6,6,6,6,9,9,9,9,9,6,9,6,6,
9,8,8,8,8,9,9,9,9,8,9,5,5,5,9,9,
9,6,6,6,6,9,9,9,9,9,9,9,9))
I would like to run a plm regression on this with fixed effects on year and industry.
library(plm)
summary(plm(GCO ~ GCO_prev + lloss + zmijewskiscore, index=c("fiscalyear", "industry"), data=problem, model="within" ))
However, I get this error while running:
Error in pdim.default(index[[1L]], index[[2L]]) :
duplicate couples (id-time)
In addition: Warning message:
In pdata.frame(data, index) :
duplicate couples (id-time) in resulting pdata.frame
to find out which, use, e.g., table(index(your_pdataframe), useNA = "ifany")
I do not quite know how to fix this. If I am assuming correctly, it has something to do with there being more companies (fkeycompany code) than 1 that have for example for industry = 9, fiscalyear = 2003 for example. So for some industries, lets say 9, there are more rows (fkeycompanies, in this example 0000016058 & 0000704415) which contain the year 2003 (or at least, thats what I think is the issue, or am I wrong?). This is with more industries and years the issue I believe in my main dataset. How do I fix this error message?
Also, besides this issue, is the code correctly that I am running? Am I indeed regressing with year and industry effects?
Given your data, the observational unit for panel data is firms (fkeycompany). You might want to add the industry as another fixed effect, but it cetainly is not the time index (the time index goes into the 2nd slot of argument index and I will assume it is fiscalyear). There are plenty of questions with answers to the topic. Also, do read the packages first vignette where the data specification for the index argument is explained.
I suggest to convert to pdata.frame first.
However, there are double constellations of fkeycompany and fiscal year, see below code where the output of table with a value > 1 hints you to the combinations.
library(plm)
pdat.problem <- pdata.frame(problem, index = c("fkeycompany", "fiscalyear"))
#> Warning in pdata.frame(problem, index = c("fkeycompany", "fiscalyear")): duplicate couples (id-time) in resulting pdata.frame
#> to find out which, use, e.g., table(index(your_pdataframe), useNA = "ifany")
table(index(pdat.problem), useNA = "ifany")
#> fiscalyear
#> fkeycompany 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2012 2013
#> 0000001961 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 0000003570 0 0 2 0 1 0 1 2 0 0 0 0 0
#> 0000004187 0 0 0 0 0 2 0 0 0 1 0 0 0
#> 0000016058 0 0 0 1 0 0 0 0 0 0 0 0 0
#> 0000022872 0 0 1 0 2 0 0 0 0 0 1 0 0
#> 0000024071 0 0 1 0 0 0 0 0 0 0 0 0 0
#> 0000050471 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 0000052971 0 0 0 0 0 0 0 0 2 0 0 0 0
#> 0000056679 0 0 1 0 0 0 0 0 0 0 0 0 0
#> 0000058592 0 0 0 0 0 2 0 0 0 0 1 0 0
#> 0000063330 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 0000099047 1 0 0 0 0 2 0 0 0 0 0 0 0
#> 0000316206 0 0 1 0 0 0 0 0 0 0 0 0 0
#> 0000316537 0 0 1 0 0 0 0 0 0 0 0 0 0
#> 0000319697 0 1 0 0 0 0 0 0 0 0 0 0 0
#> 0000351917 0 0 0 0 0 2 1 0 0 0 0 0 0
#> 0000356037 0 0 0 0 0 0 0 1 0 0 0 1 0
#> 0000700815 0 0 0 0 0 0 2 1 1 0 0 0 0
#> 0000704415 0 0 0 1 0 0 0 0 0 0 0 0 0
#> 0000705003 1 0 0 0 0 0 0 0 0 0 0 0 0
#> 0000720154 0 0 0 0 1 0 1 0 1 0 0 0 1
#> fiscalyear
#> fkeycompany 2014 2015 2016 2017 2018
#> 0000001961 0 0 0 0 1
#> 0000003570 1 0 0 0 0
#> 0000004187 0 0 0 1 0
#> 0000016058 0 0 0 0 0
#> 0000022872 0 0 0 0 0
#> 0000024071 0 0 0 0 0
#> 0000050471 0 0 1 0 0
#> 0000052971 0 0 0 0 0
#> 0000056679 0 0 0 0 0
#> 0000058592 0 0 0 0 0
#> 0000063330 1 0 0 0 0
#> 0000099047 0 0 0 0 0
#> 0000316206 0 0 0 0 0
#> 0000316537 0 0 0 0 0
#> 0000319697 0 0 0 0 0
#> 0000351917 0 0 0 0 0
#> 0000356037 0 1 0 0 0
#> 0000700815 0 0 0 0 0
#> 0000704415 2 0 0 0 0
#> 0000705003 0 0 0 0 0
#> 0000720154 0 0 0 0 0
Once fixed, you will be able to run a model along these lines.
For a time-fixed-effects model:
model <- plm(GCO ~ GCO_prev + lloss + zmijewskiscore, data = pdat.problem, model="within", effect = "time")
Or time-fixed-effects with industry as an additional fixed effect:
model2 <- plm(GCO ~ GCO_prev + lloss + zmijewskiscore + factor(industry), data = pdat.problem, model="within", effect = "time")

R design.matrix issue -- dropped column in design matrix?

I'm having an odd problem while trying to set up a design matrix to do downstream pairwise differential expression analysis on RNAseq data.
For the design matrix, I have both the donor information and each condition:
group<-factor(y$samples$group) #44 samples, 6 different conditions
sample<-factor(y$samples$samples) #44 samples, 11 different donors.
design<- model.matrix(~0+sample+group)
head(design)
Donor11.CD8 Donor12.CD8 Donor14.CD8 Donor15.CD8 Donor16.CD8
1 1 0 0 0 0
2 1 0 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0
Donor17.CD8 Donor18.CD8 Donor19.CD8 Donor20.CD8 Donor3.CD8
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
Donor4.CD8 Treatment2 Treatment3 Treatment4 Treatment5
1 0 0 0 0 0
2 0 0 0 0 1
3 0 0 0 1 0
4 0 0 0 0 0
5 0 0 1 0 0
6 0 1 0 0 0
Treatment6
1 1
2 0
3 0
4 0
5 0
6 0
>
The issue is that I seem to be losing a condition (treatment 1) when I form the design matrix, and I'm not sure why.
Many thanks, in advance, for your help!
That's not a problem. Treatment 1 is indicated by all 0 for the columns in the design matrix. Look at row 4 - zero for Treatments 2 through 6. That means it is Treatment 1. This is called a "treatment contrast" because the coefficients in the model contrast the named treatment against the "base" level, in this case the base level is Treatment1.

Kappam.light from irr package in R: Warning sqrt(varkappa), NAns produced, kappa = NA, z-value=NA and p-value=NA

I'm trying to calculate the inter-observer reliability in R for a scoring system using Light's kappa provided by the irr package. It's a fully crossed design in which fifteen observers scored 20 subjects for something being present ("1") or something not being present ("0").
This is my data frame (imported from an excel sheet):
library(irr)
my.df #my dataframe
a b c d e f g h i j k l m n o
1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
4 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0
5 0 1 0 0 1 1 0 0 0 1 1 0 0 1 0
6 0 1 0 0 1 1 0 0 0 0 0 1 1 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
11 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1
12 0 1 0 0 0 1 0 1 0 1 0 0 1 0 0
13 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
14 0 1 0 1 0 1 1 0 0 1 1 1 1 1 0
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
17 0 1 0 1 1 1 0 0 0 0 0 1 1 1 0
18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0
Next I try and calculate the kappa value and I get the following response
kappam.light(my.df) #calculating the kappa-value
Light's Kappa for m Raters
Subjects = 20
Raters = 15
Kappa = NaN
z = NaN
p-value = NaN
Warning messages:
1: In sqrt(varkappa) : NaNs produced
2: In sqrt(varkappa) : NaNs produced
3: In sqrt(varkappa) : NaNs produced
4: In sqrt(varkappa) : NaNs produced
5: In sqrt(varkappa) : NaNs produced
6: In sqrt(varkappa) : NaNs produced
7: In sqrt(varkappa) : NaNs produced
8: In sqrt(varkappa) : NaNs produced
9: In sqrt(varkappa) : NaNs produced
10: In sqrt(varkappa) : NaNs produced
I already tried changing the class of all the variables to factors, characters, numeric, boolean. Nothing works. I suspect it has something to do with the relatively low numbers of "1" scores. Any suggestions?
EDIT: I found a solution to the problem, without having to exclude data.
To calculate a prevalence and bias adjusted kappa, the pabak can be used for birater problems. For multirater problems like this you should use Randolph's kappa. This is based on the fleiss' kappa and therefore does not take variance in consideration. Ideal for the problem I had.
An online calculator can be found here
In R, the Raters package can be used. I've compared the outcome between the two methods, and the results are virtually the same (a difference in the sixth decimal).
You are getting this error because you have no variability in the columns a and i.
First, check the variability across the columns
apply(df,2,sd)
a b c d e f g h i j k l m n o
0.0000000 0.5104178 0.3663475 0.4103913 0.3663475 0.4893605 0.3077935 0.2236068 0.0000000 0.4701623 0.3663475 0.4103913 0.4103913 0.4103913 0.2236068
You see that columns a and i have no variability. Variability is needed because Kappa calculates the inter-rater reliability and corrects for chance agreement. With two unknowns, and no variability this can't be calculated.
Therefore, you get output without errors if you remove these 2 columns.
df$a=NULL
df$i=NULL
kappam.light(df)
Light's Kappa for m Raters
Subjects = 20
Raters = 13
Kappa = 0.19
z = 0
p-value = 1

How can I calculate an empirical CDF in R?

I'm reading a sparse table from a file which looks like:
1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1
Note row lengths are different.
Each row represents a single simulation. The value in the i-th column in each row says how many times value i-1 was observed in this simulation. For example, in the first simulation (first row), we got a single result with value '0' (first column), 7 results with value '2' (third column) etc.
I wish to create an average cumulative distribution function (CDF) for all the simulation results, so I could later use it to calculate an empirical p-value for true results.
To do this I can first sum up each column, but I need to take zeros for the undef columns.
How do I read such a table with different row lengths? How do I sum up columns replacing 'undef' values with 0'? And finally, how do I create the CDF? (I can do this manually but I guess there is some package which can do that).
This will read the data in:
dat <- textConnection("1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1")
df <- data.frame(scan(dat, fill = TRUE, what = as.list(rep(1, 29))))
names(df) <- paste("Val", 1:29)
close(dat)
Resulting in:
> head(df)
Val 1 Val 2 Val 3 Val 4 Val 5 Val 6 Val 7 Val 8 Val 9 Val 10 Val 11 Val 12
1 1 0 7 0 0 1 0 0 0 5 0 0
2 1 0 0 1 0 0 0 3 0 0 0 0
3 0 0 0 1 0 0 0 2 0 0 0 0
4 1 0 0 1 0 3 0 0 0 0 1 0
5 0 0 0 1 0 0 0 2 0 0 0 0
....
If the data are in a file, provide the file name instead of dat. This code presumes that there are a maximum of 29 columns, as per the data you supplied. Alter the 29 to suit the real data.
We get the column sums using
df.csum <- colSums(df, na.rm = TRUE)
the ecdf() function generates the ECDF you wanted,
df.ecdf <- ecdf(df.csum)
and we can plot it using the plot() method:
plot(df.ecdf, verticals = TRUE)
You can use the ecdf() (in base R) or Ecdf() (from the Hmisc package) functions.

Resources