ROCR Plot using R - r

I have a csv file (tab deliminated) which contains 2 columns which looks like such
5 0
6 0
9 0
8 1
"+5000 lines similar lines"
I am attempting to create a ROC plot using ROCR.
This is what I have tried so far:
p<-read.csv(file="forROC.csv", head=TRUE, sep="\t")
pred<-prediction(p[1],p[2])
The second line gives me an error: Error in prediction(p[1], p[2]) : Number of classes is not equal to 2.
ROCR currently supports only evaluation of binary classification tasks.
I am not sure what the error is. Is there something wrong with my CSV file?

My guess is that your array indexing isn't setup properly. If you read in that CSV file, you should expect a data.frame (think matrix or 2D array, depending on your background) with two columns and 5,000+ rows.
So your current call to p[1] or p[2] aren't especially meaningful. You probably want to access the first and second column of that data.frame, which you can do using the syntax of p[,1] for the first column and p[,2] for the second.
The specific error you're encountering, however, is a complaint that the "truth" variable you're using isn't binary. It seems that your file is setup to have an output of 1 and 0, so this error may go away once you properly access your array. But if you encounter this in the future, just be sure to binarize your truth data before you use it. For instance:
p[,2] <- p[,2] != 0
Would set the values to FALSE if it's a zero, and TRUE for each non-zero cell in the column.

Related

Importing stata files into R and variables have to be filtered using labels and not actual numerical values

I have a very basic question on importing STATA files into R, and I tried searching the forum but could not find what I was looking for.
I have a DHS file (AR - HIV test results) and it only has several fields as follows, after importing into R using the foreign package:
AR_HIV_dataset <- read.dta("RWAR71FL.DTA") #HIV test result file
My question is on how to filter some cases with dplyr based on the value of a variable e.g. HIV03. Using the structure command, the variable HIV03 is displayed as "HIV negative", "HIV positive", etc:
$ hiv03 : Factor w/ 8 levels "hiv negative",..: 1 1 1 1 1 1 1 1 1 1 ...
but the actual data values stored are just 0 or 1. However, I cannot refer to these numerical values as the filter command seems to need me to specify the label values, e.g.
filter(AR_HIV_dataset,hiv03=="hiv negative")
this will return the required cases, but I would like to be able to use the following command instead (using the actual values)
filter(AR_HIV_dataset, hiv03==0)
But if I do that,this returns an error.
Can you let me know what I need to change in order to use the second line of code instead?
Thanks in advance for your kind support.
Using the haven package (http://haven.tidyverse.org/) to import the stata file may be helpful especially if you are using dplyr given both packages are part of the tidyverse. The vignette on semantics may be particularly useful when looking at how variables in stata etc are treated when imported into R.
Two possible solutions are:
filter(AR_HIV_dataset, as.numeric(hiv03)==0)
Or, somewhat nicer
filter(AR_HIV_dataset, hiv03 == levels( hiv03 )[1] )

Return 0 for 'subscript out of bounds error'

Is there a simple way to deal with the “subscript out of bounds” error. I’d like to return 0 where this occurs, instead of having the error interrupt the code.
I understand the nature of the error in my context and it’s a perfectly legitimate finding that I'd like to report:
• I’m capturing the difference between two data items.
• So if e.g. there are 0 instance of case "a" (which might be where the difference is 0) then I get the subscript out of bounds error when looking up the matrix row name equal to “a”, whereas I’d like to report this finding as 0.
In the following simplified example, both CASE 1 and CASE 2 occur across my various matrices, but CASE 2 returns the 'subscript out of bounds' error.
# CASE 1
distribution <- matrix(c(1:12), nrow=12, ncol=1)
rownames(distribution) <- letters[1:12]
distribution["a",]
# CASE 2
distribution <- matrix(c(1:11), nrow=11, ncol=1)
rownames(distribution) <- letters[2:12]
distribution["a",]
and want to report the finding...
distribution["a",]
...for each of my matrices.
Something equivalent to the iferror formula in excel is what I'm after I guess.
Any thoughts / alternative suggestion to the problem are much appreciated.

Convert/transform an abundance (OTU) table/data.frame (to a fasta file) in R

I'm working on a large dataset at the moment and so far I could solve all my ideas/problems via countless google searches and long try & error sessions very well. I've managed to use plyr and reshape functions for some transformations of my different datasets and learned a lot, but I think I've reached a point where my present R knowledge won't help me anymore.
Even if my question sounds very specific (i.e. OTU table and fasta file) I guess my attempt is a common R application across many different fields (and not just bioinformatics).
Right now, I have merged an reference sequence file with an abundance table, and I would like to generate a specific file based on the information of this data.frame - a fasta file.
My df looks a bit like this at the moment:
repSeq sw.1.102 sw.3.1021 sw.30.101 sw.5.1042 ...
ACCT-AGGA 3 0 1 0
ACCT-AGGG 1 1 2 0
ACTT-AGGG 0 1 0 25
...
The resulting file should look like this:
>sw.1.102_1
ACCT-AGGA
>sw.1.102_2
ACCT-AGGA
>sw.1.102_3
ACCT-AGGA
>sw.1.102_4
ACCT-AGGG
>sw.3.1021_1
ACCT-AGGG
>sw.3.1021_2
ACTT-AGGG
>sw.30.101_1
ACCT-AGGA
>sw.30.101_2
ACCT-AGGG
...
As you can see I would like to use the information about the number of (reference) sequences for each sample (i.e. sw.n) to create a (fasta) file.
I have no experiences with loops in R (I used basic loops only during simple processing attempts), but I assume this could do the trick here. I have found the write.fasta function from the SeqinR package, but I could not find any solution there. The deunique.seqs command in mothur wont work, because it needs a fasta file as input (which I obviously don't have). It could be very possible that there is something on Bioconductor (OTUbase?), but to be honest, I don't know where to beginn and I'm glad about any help.
And I really would like to do this in R, since I enjoy working with it, but any other ideas are also very welcome.
//small edit:
Both answers below work very well (see my comments) - I also found two possible not-so-elegant & non-R workarounds (not tested yet):
since I already have a taxonomy file and an abundance OTU table, I think the mothur command make.biom could be used to create a biom-format file. I haven't worked with biom files yet, but I think there are some tools and scripts available to save the biom-file data as fasta again
convert Qiime files to oligotyping format - this also needs a taxonomy file and an Otu table
Not sure if both ways work - therefore, please correct me if I'm wrong.
Here's your data, coerced to a matrix (which is a more natural representation for rectangular data of homogeneous type).
df <- read.delim(textConnection(
"repSeq sw.1.102 sw.3.1021 sw.30.101 sw.5.1042
ACCT-AGGA 3 0 1 0
ACCT-AGGG 1 1 2 0
ACTT-AGGG 0 1 0 25"
), sep="", row.names=1)
m <- as.matrix(df)
The tricky part is to figure out how to number the duplicated column name entries. I did this by creating sequences of the appropriate length and un-listing. I then created a matrix with two rows, the first (from replicating the colnames() as required by entries in the original matrix) is the id, and the second the sequence.
csum <- colSums(m)
idx <- unlist(lapply(csum, seq_len), use.names=FALSE)
res <- matrix(c(sprintf(">%s_%d", rep(colnames(m), csum), idx), # id
rep(rownames(m)[row(m)], m)), # sequence
nrow=2, byrow=TRUE)
Use writeLines(res, "your.fasta") to write out the results, or setNames(res[2,], res[1,]) to get a named vector of sequences.
Try this, it goes through the dataframe line by line and concatenates repetitions of sequences :
fasta_seq<-apply(df,1,function(x){
p<-x[1]
paste(unlist(mapply(function(x,y,z){
if(as.numeric(y)>0) {paste(">",x,"_",(z+1):(z+y),"\n",p,"\n",sep="")}
},colnames(df)[-1],as.numeric(x[-1]),c(0,lag(cumsum(as.numeric(x[-1])))[-1]),USE.NAMES=F)),collapse="")
})
write(paste(fasta_seq,collapse=""),"your_file.txt")

Importing edge list in igraph in R

I'm trying to import an edge list into igraph's graph object in R. Here's how I'm trying to do so:
graph <- read.graph(edgeListFile, directed=FALSE)
I've used this method before a million times, but it just won't work for this specific data set:
294834289 476607837
560992068 2352984973
560992068 575083378
229711468 204058748
2432968663 2172432571
2473095109 2601551818
...
R throws me this error:
Error in read.graph.edgelist(file, ...) :
At structure_generators.c:84 : Invalid (negative) vertex id, Invalid vertex id
The only difference I see between this dataset and the ones I previously used is that those were in sorted form, starting from 1:
1 1
1 2
2 4
...
Any clues?
It seems likely that it's trying to interpret the values as indexes rather than node names and it's probably storing them in a signed integer field that is too small and is probably overflowing into negative numbers. One potential work around is
library("igraph")
dd <- read.table("test.txt")
gg <- graph.data.frame(dd, directed=FALSE)
plot(gg)
It seems this method doesn't have the overflow problem (assuming that's what it was).

Create warnings in R

I wanna write a script whicht makes R usable for "everybody" at this special topic of analysis. Is there a possibility to create warnings?
time,value
2012-01-01,5
2012-01-02,0
2012-01-03,0
2012-01-04,0
2012-01-05,3
For example if the value is at least 3 times 0 (afterwards - better within a settet period of time - 3 days) give warnings - and naming the date. Maybe create something like a report, if I am combining conditions.
In general: Masurement data are read via read.csv and then set Date by as.POSIXct - xts/zoo. I want the "user" to get a clear message if the values are changing etc.; if they are 0 for a long time etc.
The second step would be sending emails - maybe running on a server later.
Additional Questions:
I do have a df in xts now - is it possible to check if the value is greater a threshold value? It's not working because it's not an atomic vector.
Thanks
Try this.
x <- read.table(text = "time,value
2012-01-01,5
2012-01-02,0
2012-01-03,0
2012-01-04,0
2012-01-05,3", header = TRUE, sep = ",")
if(any(rle(x$value)$lengths >= 3)) warning("I noticed some dates have value 0 at least three times.")
Warning message:
I noticed some dates have value 0 at least three times.
I'll leave it to you as a training exercise to paste a warning message that would also give you the date(s).

Resources