Interpret knn.cv (R) results after applying on data set - r

I have encountered a problem while using the k-nearest neighbors algorithm (with cross validation) on a data set in R, the knn.cv from the FNN package.
The data set consists of 4601 email cases with 58 attributes, with the 57 depending on character or word frequencies in the emails(numerical, range [0,100]) , and the last one indicating if it is spam (value 1) or ham (value 0).
After indicating train and cl variables and using 10 neighbors, running the package presents a list of all the emails with values like 7.4032 at each column, which I don't know how to use. I need to find the percentage of spam and ham the package classifies and compare it with the correct percentage. How should I interpret these results?

Given that the data set you describe matches (exactly) the spam data set in the ElemStatLearn package accompanying the well-known book by the same title, I'm wondering if this is in fact a homework assignment. If that's the case, it's ok, but you should add the homework tag to your question.
Here are some pointers.
The documentation for the function knn.cv says that it returns a vector of classifications, along with the distances and indices of the k nearest neighbors as "attributes". So when I run this:
out <- knn.cv(spam[,-58],spam[,58],k = 10)
The object out looks sort of like this:
> head(out)
[1] spam spam spam spam spam email
Levels: email spam
The other values you refer to are sort of "hidden" as attributes, but you can see that they are there using str:
> str(out)
Factor w/ 2 levels "email","spam": 2 2 2 2 2 1 1 1 2 2 ...
- attr(*, "nn.index")= int [1:4601, 1:10] 446 1449 500 5 4 4338 2550 4383 1470 53 ...
- attr(*, "nn.dist")= num [1:4601, 1:10] 8.10e-01 2.89 1.50e+02 2.83e-03 2.83e-03 ...
You can access those additional attributes via something like this:
nn.index <- attr(out,'nn.index')
nn.dist <- attr(out,'nn.dist')
Note that both of these objects end up being matrices of dimension 4601 x 10, which makes sense, since the documentation said that they recorded the index (i.e. row number) of the k = 10 nearest neighbors as well as the distances to each.
For the last bit, you will probably find the table() function useful, as well as prop.table().

Related

number of items to replace is not a multiple of replacement length in weighting

I am using multinomial regression to get the probability of belonging to four sub-groups for 500,000 regions.
The data.frame looks like this:
Regions groupadmit mid-pop
1 2 1764
2 3 1254
25 1 1452
674 4 2665
3001 2 1097
56 3 9864
98 1 2675
500,000 .... .....
I wrote the following code:
library (nnet)
mlogit<- multinom(groupadmit~mid_pop, data = admissionLSOA1)
probs <- predict(mlogit, type="probs")
The codes work fine till this point, giving the probability of belonging to each group (1, 2, 3, 4) for each observation (region).
Probabilities:
Regions groupadmit1 groupadmit2 groupadmit3 groupadmit4
52 0.2484091 0.2494408 0.2505393 0.2516109
97 0.2483949 0.2494358 0.2505441 0.2516252
1300 0.2483253 0.2494112 0.2505676 0.251695
287 0.2483623 0.2494242 0.2505551 0.2516584
500,000 .... ..... .... ....
But, when I go to weight the sample (regions) according to their probability, it brings back the following error:
Warning message:
In wts[groupadmit == 1] <- probs[groupadmit == 1, 1]/probs[groupadmit == :
number of items to replace is not a multiple of replacement length
What I am doing is weighting the regions according to their probability of belonging to each groupadmit proportional to the probability of belonging to groupadmit one in order to balance any chance for selection bias. It is very similar to inverse probability weighting. The codes are:
wts[groupadmit==1] <- probs[groupadmit==1,1]/probs[groupadmit==1,1]
wts[groupadmit==2] <- probs[groupadmit==2,1]/probs[groupadmit==2,2]
wts[groupadmit==3] <- probs[groupadmit==3,1]/probs[groupadmit==3,3]
wts[groupadmit==4] <- probs[groupadmit==4,1]/probs[groupadmit==4,4]
But, the above error comes up whenever I do the the analysis.
May someone please help me to understand why I get this error and how can I solve it?
Many thanks in advance
Why R complains?
Warning message:
In wts[groupadmit == 1] <- probs[groupadmit == 1, 1]/probs[groupadmit == :
number of items to replace is not a multiple of replacement length
it means that, the right handside of you assign (<-) is bigger than, what you have on the left handside which is wts[groupadmit==1]
Therefore, i suggest you to do:
length(probs[groupadmit==1,1]/probs[groupadmit==1,1])
and then
length(wts[groupadmit==1])
Then i suppose, it shows the lefthand side is smaller.
Then simply run
wts[groupadmit==1] <- probs[groupadmit==1,1]/probs[groupadmit==1,1]
and finnally print
wts[groupadmit==1]
Solution:
A quick fix is to use rbind to build your wts:
wts<-rbind(probs[groupadmit==1,1]/probs[groupadmit==1,1],
probs[groupadmit==2,1]/probs[groupadmit==2,2],
probs[groupadmit==3,1]/probs[groupadmit==3,3],
probs[groupadmit==4,1]/probs[groupadmit==4,4])

R confusion matrix error

I have to lists and I'm creating a confusion matrix like this
conf.mat <- table(x,y)
but the
accuracy <- sum(diag(conf.mat))/length(y) * 100)
is giving me 0 when I know for sure they aren't.
x is a long list that ends like this
[1546] data mining
25 Levels: clustering algorithms ...
and y ends like this
[1546] mixed discrete-continuous optimization
646 Levels: access control ... world wide web
The thing is even though I assume diag(conf.mat) to contain 1546 it only contains 25 entries.
Any ideas what's happening? I assume it has something to do with the levels but I'm not sure how to fix this.

Gene ontology (GO) analysis for a list of Genes (with ENTREZID) in R?

I am very new with the GO analysis and I am a bit confuse how to do it my list of genes.
I have a list of genes (n=10):
gene_list
SYMBOL ENTREZID GENENAME
1 AFAP1 60312 actin filament associated protein 1
2 ANAPC11 51529 anaphase promoting complex subunit 11
3 ANAPC5 51433 anaphase promoting complex subunit 5
4 ATL2 64225 atlastin GTPase 2
5 AURKA 6790 aurora kinase A
6 CCNB2 9133 cyclin B2
7 CCND2 894 cyclin D2
8 CDCA2 157313 cell division cycle associated 2
9 CDCA7 83879 cell division cycle associated 7
10 CDCA7L 55536 cell division cycle associated 7-like
and I simply want to find their function and I've been suggested to use GO analysis tools.
I am not sure if it's a correct way to do so.
here is my solution:
x <- org.Hs.egGO
# Get the entrez gene identifiers that are mapped to a GO ID
xx<- as.list(x[gene_list$ENTREZID])
So, I've got a list with EntrezID that are assigned to several GO terms for each genes.
for example:
> xx$`60312`
$`GO:0009966`
$`GO:0009966`$GOID
[1] "GO:0009966"
$`GO:0009966`$Evidence
[1] "IEA"
$`GO:0009966`$Ontology
[1] "BP"
$`GO:0051493`
$`GO:0051493`$GOID
[1] "GO:0051493"
$`GO:0051493`$Evidence
[1] "IEA"
$`GO:0051493`$Ontology
[1] "BP"
My question is :
how can I find the function for each of these genes in a simpler way and I also wondered if I am doing it right or?
because I want to add the function to the gene_list as a function/GO column.
Thanks in advance,
EDIT: There is a new Bioinformatics SE (currently in beta mode).
I hope I get what you are aiming here.
BTW, for bioinformatics related topics, you can also have a look at biostar which have the same purpose as SO but for bioinformatics
If you just want to have a list of each function related to the gene, you can query database such ENSEMBl through the biomaRt bioconductor package which is an API for querying biomart database.
You will need internet though to do the query.
Bioconductor proposes packages for bioinformatics studies and these packages come generally along with good vignettes which get you through the different steps of the analysis (and even highlight how you should design your data or which would be then some of the pitfalls).
In your case, directly from biomaRt vignette - task 2 in particular:
Note: there are slightly quicker way that the one I reported below:
# load the library
library("biomaRt")
# I prefer ensembl so that the one I will query, but you can
# query other bases, try out: listMarts()
ensembl=useMart("ensembl")
# as it seems that you are looking for human genes:
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
# if you want other model organisms have a look at:
#listDatasets(ensembl)
You need to create your query (your list of ENTREZ ids). To see which filters you can query:
filters = listFilters(ensembl)
And then you want to retrieve attributes : your GO number and description. To see the list of available attributes
attributes = listAttributes(ensembl)
For you, the query would look like something as:
goids = getBM(
#you want entrezgene so you know which is what, the GO ID and
# name_1006 is actually the identifier of 'Go term name'
attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',
values=gene_list$ENTREZID,
mart=ensembl)
The query itself can take a while.
Then you can always collapse the information in two columns (but I won't recommend it for anything else that reporting purposes).
Go.collapsed<-Reduce(rbind,lapply(gene_list$ENTREZID,function(x)
tempo<-goids[goids$entrezgene==x,]
return(
data.frame('ENTREZGENE'= x,
'Go.ID'= paste(tempo$go_id,collapse=' ; '),
'GO.term'=paste(tempo$name_1006,collapse=' ; '))
)
Edit:
If you want to query a past version of the ensembl database:
ens82<-useMart(host='sep2015.archive.ensembl.org',
biomart='ENSEMBL_MART_ENSEMBL',
dataset='hsapiens_gene_ensembl')
and then the query would be:
goids = getBM(attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',values=gene_list$ENTREZID,
mart=ens82)
However, if you had in mind to do a GO enrichment analysis, your list of genes is too short.

To use the correct test for independence

I have two groups (data.frame) in R called good and bad which contain good users and bad users respectively.
The group good contains game_id which is the id for a computergame and number which is how many times this game has been played.
For example good$game_id we get 1 2 3 ... 20. We have 20 games.
Similar good$number we get 45214 1254 23 ... 8914 which is the number the game has been played. For example has game_id==1 been played 45214 times in group good.
Similar for bad.
We also have the same number of users in the two groups.
So for head(good,20) we get
game_id number
1 45214
2 1254
...
20 8914
I want to investigate if there is dependence between the number of times a fixed computergame has been played.
For game_id==1 I would try to use Pearson's Chi test for 'Independence'.
In R I type chisq.test(good[1,2], bad[1,2]) to see if there is indepence between good and bad for game_id==1 but I get an error message: x and y must have same levels.
How can this problem be solved ?

Testing recurrences and orders in strings matlab

I have observed nurses during 400 episodes of care and recorded the sequence of surfaces contacts in each.
I categorised the surfaces into 5 groups 1:5 and calculated the probability density functions of touching any one of 1:5 (PDF).
PDF=[ 0.255202629 0.186199343 0.104052574 0.201533406 0.253012048]
I then predicted some 1000 sequences using:
for i=1:1000 % 1000 different nurses
seq(i,1:end)=randsample(1:5,max(observed_seq_length),'true',PDF);
end
eg.
seq = 1 5 2 3 4 2 5 5 2 5
stairs(1:max(observed_seq_length),seq) hold all
I'd like to compare my empirical sequences with my predicted one. What would you suggest to be the best strategy or property to look at?
Regards,
EDIT: I put r as a tag as this may well fall more easily under that category due to the nature of the question rather than the matlab code.

Resources