R confusion matrix error - r

I have to lists and I'm creating a confusion matrix like this
conf.mat <- table(x,y)
but the
accuracy <- sum(diag(conf.mat))/length(y) * 100)
is giving me 0 when I know for sure they aren't.
x is a long list that ends like this
[1546] data mining
25 Levels: clustering algorithms ...
and y ends like this
[1546] mixed discrete-continuous optimization
646 Levels: access control ... world wide web
The thing is even though I assume diag(conf.mat) to contain 1546 it only contains 25 entries.
Any ideas what's happening? I assume it has something to do with the levels but I'm not sure how to fix this.

Related

Converting a numeric dataframe with fractions to intergers in r via interpolation

I have collected data which looks something like this:
wl Spec.94
299.784 57.95
300.151 57.18
300.517 88.18
300.884 18.71
301.252 100.90
301.617 127.06
301.983 75.02
302.349 54.20
302.715 50.93
303.082 50.43
However, the program I use to analyze the data can only handle whole numbers for wl. I have an excel sheet I inherited that interpolates this data and produces this:
wl Spec
300 41.03
301 61.77
302 51.84
I really don't know how that spreadsheet works, but the column titles that it auto-populates are Target Wl, Nearest Smaller Index, Nearest Smaller Wl, Upper Wl, Bias, Low-side value, High-side Value, and Interpolated value.
I need to be able to replicate this process in my r code, to make the analysis reproducible, but I have no idea where to start. How do I interpolate my data in r to get the values for Spec.95 at whole number values in wl?
You could do a sequence over the rounded range and feed approx with it.
wl <- do.call(seq, as.list(round(range(dat$wl))))
cbind(wl, Spec.94=approx(dat$wl, dat$Spec, wl)$y)
# wl Spec
# [1,] 300 57.49681
# [2,] 301 44.61772
# [3,] 302 74.05295
# [4,] 303 50.54172
However, the values are somewhat different, and I'm not sure how your specific excel code interpolates the 41.03 value which should be somewhere between 57.95 and 57.18. Maybe you could figure that out?

What is the best way to manage/store result from either posthoc.krukal.dunn.test() or dunn.test() - where my input data is in dataframe format?

I am a newbie in R programming and seek help in analyzing the Metabolomics data - 118 metabolites with 4 conditions (3 replicates per condition). I would like to know, for each metabolite, which condition(s) is significantly different from which. Here is part of my data
> head(mydata)
Conditions HMDB03331 HMDB00699 HMDB00606 HMDB00707 HMDB00725 HMDB00017 HMDB01173
1 DMSO_BASAL 0.001289121 0.001578235 0.001612297 0.0007772231 3.475837e-06 0.0001221674 0.02691318
2 DMSO_BASAL 0.001158363 0.001413287 0.001541713 0.0007278363 3.345166e-04 0.0001037669 0.03471329
3 DMSO_BASAL 0.001043537 0.002380287 0.001240891 0.0008595932 4.007387e-04 0.0002033625 0.07426482
4 DMSO_G30 0.001195253 0.002338346 0.002133992 0.0007924157 4.189224e-06 0.0002131131 0.05000778
5 DMSO_G30 0.001511538 0.002264779 0.002535853 0.0011580857 3.639661e-06 0.0001700157 0.02657079
6 DMSO_G30 0.001554804 0.001262859 0.002047611 0.0008419137 6.350990e-04 0.0000851638 0.04752020
This is what I have so far.
I learned the first line from this post
kwtest_pvl = apply(mydata[,-1], 2, function(x) kruskal.test(x,as.factor(mydata$Conditions))$p.value)
and this is where I loop through the metabolite that past KW test
tCol = colnames(mydata[,-1])[kwtest_pvl <= 0.05]
for (k in tCol){
output = posthoc.kruskal.dunn.test(mydata[,k],as.factor(mydata$Conditions),p.adjust.method = "BH")
}
I am not sure how to manage my output such that it is easier to manage for all the metabolites that passed KW test. Perhaps saving the output from each iteration appending to excel? I also tried dunn.test package since it has an option of table or list output. However, it still leaves me at the same point. Kinda stuck here.
Moreover, should I also perform some kind of adjusted p-value, i.e FWER, FDR, BH right after KW test - before performing the posthoc test?
Any suggestion(s) would be greatly appreciated.

Problems with Naive Bayes

I'm trying to run Naive Bayes in R for making predictions from textual data (by building a Document Term Matrix).
I read several posts warning about terms that could be missing in both the training and the testing set, so I decided to work with only one data frame and split it afterwards. The code I'm using is this:
data <- read.csv(file="path",header=TRUE)
########## NAIVE BAYES
library(e1071)
library(SparseM)
library(tm)
# CREATE DATA FRAME AND TRAINING AND
# TEST INCLUDING 'Text' AND 'InfoType' (columns 8 and 27)
traindata <- as.data.frame(data[13000:13999,c(8,27)])
testdata <- as.data.frame(data[14000:14999,c(8,27)])
complete <- as.data.frame(data[13000:14999,c(8,27)])
# SEPARATE TEXT VECTOR TO CREATE Source(),
# Corpus() CONSTRUCTOR FOR DOCUMENT TERM
# MATRIX TAKES Source()
completevector <- as.vector(complete$Text)
# CREATE SOURCE FOR VECTORS
completesource <- VectorSource(completevector)
# CREATE CORPUS FOR DATA
completecorpus <- Corpus(completesource)
# STEM WORDS, REMOVE STOPWORDS, TRIM WHITESPACE
completecorpus <- tm_map(completecorpus,tolower)
completecorpus <- tm_map(completecorpus,PlainTextDocument)
completecorpus <- tm_map(completecorpus, stemDocument)
completecorpus <- tm_map(completecorpus, removeWords,stopwords("english"))
completecorpus <- tm_map(completecorpus,removePunctuation)
completecorpus <- tm_map(completecorpus,removeNumbers)
completecorpus <- tm_map(completecorpus,stripWhitespace)
# CREATE DOCUMENT TERM MATRIX
completematrix<-DocumentTermMatrix(completecorpus)
trainmatrix <- completematrix[1:1000,]
testmatrix <- completematrix[1001:2000,]
# TRAIN NAIVE BAYES MODEL USING trainmatrix DATA AND traindata$InfoType CLASS VECTOR
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$InfoType),laplace=1)
# PREDICTION
results <- predict(model,as.matrix(testmatrix))
conf.matrix<-table(results, testdata$InfoType,dnn=list('predicted','actual'))
conf.matrix
The problem is that I'm getting weird results like this:
actual
predicted 1 2 3
1 60 833 107
2 0 0 0
3 0 0 0
Any idea of why is this happening?
The raw data looks like this:
head(complete)
Text
13000 Milkshakes, milkshakes, whats not to love? Really like the durability and weight of the cup. Something about it sure makes good milkshakes.Works beautifully with the Cuisinart smart stick.
13001 excellent. shipped on time, is excellent for protein shakes with a cuisine art mixer. easy to clean and the mixer fits in perfectly
13002 Great cup. Simple and stainless steel great size cup for use with my cuisinart mixer. I can do milkshakes really easy and fast. Recommended. No problems with the shipping.
13003 Wife Loves This. Stainless steel....attractive and the best part is---it won't break. We are considering purchasing another one because they are really nice.
13004 Great! Stainless steel cup is great for smoothies, milkshakes and even chopping small amounts of vegetables for salads!Wish it had a top but still love it!
13005 Great with my. Stick mixer...the plastic mixing container cracked and became unusable as a result....the only downside is you can't see if the stuff you are mixing is mixed well
InfoType
13000 2
13001 2
13002 2
13003 3
13004 2
13005 2
Seemingly the problem is that the TDM needs to get rid of so much sparsity. So I added:
completematrix<-removeSparseTerms(completematrix, 0.95)
And it started working!!
actual
predicted 1 2 3
1 60 511 6
2 0 86 2
3 0 236 99
Thank you all for your ideas (thank you Chelsey Hill!!)

Discriminant analysis and column name in the code

I have been writing a code to ease performing a discriminant analysis using the lda function. But actually I have a step which I cannot solve. And it is when I have to introduce the name of the categorical column in the code. Imagine we have the next table (called smoke), in which the column Factor represents the groups (in our cases, smoker and nsmok).
smoke
Factor Lung Heart Blood
1 smoker 7 22 15
2 smoker 8 21 12
3 nsmok 22 9 5
This is the code I have been preparing. Please, look at the XXXX's in the code (it appears twice). I want them to write automatically the name of the categorical column, instead of writing directly it twice.
lda=lda(XXXX~.,data=Smoke)
plot(lda)
lda
lda$counts
lda$svd
lda.p=predict(lda)
Tabla=table(Smoke$XXXX,lda.p$class)
Tabla
diag(prop.table(Tabla, 1))
sum(diag(prop.table(Tabla)))
I thought that writing...
colnames(Table)[1]
... would solve it. But actually there still exist some errors when running the code.
Otherwise, I though that introducing directly the name in this way:
Column_Factor-> Factor
and writing Column_Factor in the two places in the code would solve it. But it isn't.
Any ideas?
You could do something like this:
library(MASS)
#gets the column name of the factor, maybe check if there is only one factor column first
Column_Factor <- names(Smoke)[sapply(Smoke, class)=="factor"]
#creates the formula by pasting the name and the RHS
lda <- lda(as.formula(paste(Column_Factor,"~.",sep="")),data=Smoke)
plot(lda)
lda
lda$counts
lda$svd
lda.p=predict(lda)
#selects the column using the variable
Tabla=table(Smoke[,Column_Factor],lda.p$class)
Tabla
diag(prop.table(Tabla, 1))
sum(diag(prop.table(Tabla)))

Interpret knn.cv (R) results after applying on data set

I have encountered a problem while using the k-nearest neighbors algorithm (with cross validation) on a data set in R, the knn.cv from the FNN package.
The data set consists of 4601 email cases with 58 attributes, with the 57 depending on character or word frequencies in the emails(numerical, range [0,100]) , and the last one indicating if it is spam (value 1) or ham (value 0).
After indicating train and cl variables and using 10 neighbors, running the package presents a list of all the emails with values like 7.4032 at each column, which I don't know how to use. I need to find the percentage of spam and ham the package classifies and compare it with the correct percentage. How should I interpret these results?
Given that the data set you describe matches (exactly) the spam data set in the ElemStatLearn package accompanying the well-known book by the same title, I'm wondering if this is in fact a homework assignment. If that's the case, it's ok, but you should add the homework tag to your question.
Here are some pointers.
The documentation for the function knn.cv says that it returns a vector of classifications, along with the distances and indices of the k nearest neighbors as "attributes". So when I run this:
out <- knn.cv(spam[,-58],spam[,58],k = 10)
The object out looks sort of like this:
> head(out)
[1] spam spam spam spam spam email
Levels: email spam
The other values you refer to are sort of "hidden" as attributes, but you can see that they are there using str:
> str(out)
Factor w/ 2 levels "email","spam": 2 2 2 2 2 1 1 1 2 2 ...
- attr(*, "nn.index")= int [1:4601, 1:10] 446 1449 500 5 4 4338 2550 4383 1470 53 ...
- attr(*, "nn.dist")= num [1:4601, 1:10] 8.10e-01 2.89 1.50e+02 2.83e-03 2.83e-03 ...
You can access those additional attributes via something like this:
nn.index <- attr(out,'nn.index')
nn.dist <- attr(out,'nn.dist')
Note that both of these objects end up being matrices of dimension 4601 x 10, which makes sense, since the documentation said that they recorded the index (i.e. row number) of the k = 10 nearest neighbors as well as the distances to each.
For the last bit, you will probably find the table() function useful, as well as prop.table().

Resources