DADA2 - Calculating percent of reads that merged - r

I have been following the tutorial for DADA2 in R for a 16S data-set, and everything runs smoothly; however, I do have a question on how to calculate the total percent of merged reads. After the step to track reads through the pipeline with the following code:
merger <- mergePairs(dadaF1, derepF1, dadaR1, derepR1, verbose=TRUE)
and then tracking the reads through each step:
getN <- function(x) sum(getUniques(x))
track <- cbind(out_2, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
I get a table that looks like this, here I am viewing the resuling track data-frame made w/ the above code:
Where input is the total sequences I put in (after demuxing) and filtered is the total sequences after they were filtered based on a parameter of my choosing. The denoisedF and denoisedR are sequences that have been denoised (one for forward reads and another for reverse reads), the total number of merged reads (from the mergePairs command above) and the nonchim are the total sequences that are not chimeras.
My question is this .... to calculate the percent of merged reads - is this a simple division? Say take the first row - (417/908) * 100 = 46% or should I somehow incorporate the denoisedF and denoisedR columns in this calculation?
Thank you very much in advance!

The track object is a matrix (see class(track)), thus you can run operations accordingly. In your case:
track[, "merged"]/track[, "input"] * 100
Or, you could convert the track object into a data frame for a "table" output.
However, I usually export the track output as an excel file and then do my modification there. It is easier to be shared and commented on with non-R users.
I find the write_xlsx function from the writexl package particularly convenient.
Cheers

Related

R Updating A Column In a Large Dataframe

I've got a dataframe, which is stored in a csv, of 63 columns and 1.3 million rows. Each row is a chess game, each column is details about the game (e.g. who played in the game, what their ranking was, the time it was played, etc). I have a column called "Analyzed", which is whether someone later analyzed the game, so it's a yes/no variable.
I need to use the API offered by chess.com to check whether a game is analyzed. That's easy. However, how do I systematically update the csv file, without wasting huge amounts of time reading in and writing out the csv file, while accounting for the fact that this is going to take a huge amount of time and I need to do it in stages? I believe a best practice for chess.com's API is to use Sys.sleep after every API call so that you lower the likelihood that you are accidentally making concurrent requests, which the API doesn't handle very well. So I have Sys.sleep for a quarter of a second. If we assume the API call itself takes no time, then this means this program will need to run for 90 hours because of the sleep time alone. My goal is to make it so that I can easily run this program in chunks, so that I don't need to run it for 90 hours in a row.
The code below works great to get whether a game has been analyzed, but I don't know how to intelligently update the original csv file. I think my best bet would be to rewrite the new dataframe and replace the old Games.csv every 1000 or say API calls. See the commented code below.
My overall question is, when I need to update a column in csv that is large, what is the smart way to update that column incrementally?
library(bigchess)
library(rjson)
library(jsonlite)
df <- read.csv <- "Games.csv"
for(i in 1:nrow(df)){
data <- read_json(df$urls[i])
if(data$analysisLogExists == TRUE){
df$Analyzed[i] <- 1
}
if(data$analysisLogExists==FALSE){
df$Analyzed[i] = 0
}
Sys.sleep(.25)
##This won't work because the second time I run it then I'll just reread the original lines
##if i try to account for this by subsetting only the the columns that haven't been updated,
##then this still doesn't work because then the write command below will not be writing the whole dataset to the csv
if(i%%1000){
write.csv(df,"Games.csv",row.names = F)
}
}

correlation matrix using large data sets in R when ff matrix memory allocation is not enough

I have a simple analysis to be done. I just need to calculate the correlation of the columns (or rows ,if transposed). Simple enough? I am unable to get the results for the whole week and I have looked through most of the solutions here.
My laptop has a 4GB RAM. I do have access to a server with 32 nodes. My data cannot be loaded here as it is huge (411k columns and 100 rows). If you need any other information or maybe part of the data I can try to put it up here, but the problem can be easily explained without really having to see the data. I simply need to get a correlation matrix of size 411k X 411k which means I need to compute the correlation among the rows of my data.
Concepts I have tried to code: (all of them in some way give me memory issues or run forever)
The most simple way, one row against all, write the result out using append.T. (Runs forever)
biCorPar.r by bobthecat (https://gist.github.com/bobthecat/5024079), splitting the data into blocks and using ff matrix. (unable to allocate memory to assign the corMAT matrix using ff() in my server)
split the data into sets (every 10000 continuous rows will be a set) and do correlation of each set against the other (same logic as bigcorPar) but I am unable to find a way to store them all together finally to generate the final 411kX411k matrix.
I am attempting this now, bigcorPar.r on 10000 rows against 411k (so 10000 is divided into blocks) and save the results in separate csv files.
I am also attempting to run every 1000 vs 411k in one node in my server and today is my 3rd day and I am still on row 71.
I am not an R pro so I could attempt only this much. Either my codes run forever or I do not have enough memory to store the results. Are there any more efficient ways to tackle this issue?
Thanks for all your comments and help.
I'm familiar with this problem myself in the context of genetic research.
If you are interested only in the significant correlations, you may find my package MatrixEQTL useful (available on CRAN, more info here: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/ ).
If you want to keep all correlations, I'd like to first warn you that in the binary format (economical compared to text) it would take 411,000 x 411,000 x 8 bytes = 1.3 TB. If this what you want and you are OK with the storage required for that, I can provide my code for such calculations and storage.

Sample A CSV File Too Large To Load Into R?

I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.
Is this possible? I cannot seem to find an answer anywhere.
If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.
And that step is easier to happen outside R.
(1) Linux Shell:
Assuming your data falls into a consistent format. Each row is one record. You can do:
sort -R data | head -n 1000 >data.sample
This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample
(2) If the data is not small enough to fit into memory.
There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:
select * from tablename order by rand() limit 1000
You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.
These are the two most commonly used ways based on my experience for dealing with 'big' data.

CSV file to Histogram in R

I'm a total newbie with R, and I'm trying to create a histogram (with value and frequency as the axises) from a csv file (just one row of values). Any idea how I can do this?
I'm also an R newbie, and I ran into the same thing. I made two separate mistakes, actually, so I'll describe them both here.
Mistake 1: Passing a frequency table to hist(). Originally I was trying to pass a frequency table to hist() instead of passing in the raw data. One way to fix this is to use the rep() ("replicate") function to explode your frequency table back into a raw dataset, as described here:
Creating a histogram using aggregated data
Simple R (histogram) from counted csv file
Instead of that, though, I just decided to read in my original dataset instead of the frequency table.
Mistake 2: Wrong data type. My raw data CSV file contains two columns: hostname and bookings (idea is to count the number of bookings each host generated during some given time period). I read it into a table.
> tbl <- read.csv('bookingsdata.csv')
Then when I tried to generate a histogram off the second column, I did this:
> hist(tbl[2])
This gave me the "'x' must be numeric" error you mention in a comment. (It was trying to read the "bookings" column header in as a data value.)
This fixed it:
> hist(tbl$bookings)
You should really start to read some basic R manual...
CRAN offers a lot of them (look into the Manuals and Contributed sections)
In any case:
setwd("path/to/csv/file")
myvalues <- read.csv("filename.csv")
hist(myvalues, 100) # Example: 100 breaks, but you can specify them at will
See the manual pages for those functions for more help (accessible through ?read.table, ?read.csv and ?hist).
To plot the histogram, the values must be of numeric class i.e the data must be of numeric value. Here the value of x seems to be of some other class.
Run the following command and see:
sapply(myvalues[1,],class)

Import Large Unusual File To R

First time poster here, so I'll try and make myself as clear as possible on the help I need. I'm fairly new to R, and this is my first real independent programming experience.
I have stock tick data for about 2.5 years, each day has its own file. The files are .txt and consist of approximately 20-30 million rows, and averaging I guess 360mb each. I am working one file at a time for now. I don't need all the data these files contain, and I was hoping that I could use the programming to minimize my files a bit.
Now my problem is that I am having some difficulties with writing the proper code so R understands what I need it to do.
Let me first show you some of the data so you can get an idea of the formatting.
M977
R 64266NRE1VEW107 FI0009653869 2EURXHEL 630 1
R 64516SSA0B 80SHB SE0002798108 8SEKXSTO 40 1
R 645730BBREEW750 FR0010734145 8EURXHEL 640 1
R 64655OXS1C 900SWE SE0002800136 8SEKXSTO 40 1
R 64663OXS1P 450SWE SE0002800219 8SEKXSTO 40 1
R 64801SSIEGV LU0362355355 11EURXCSE 160 1
M978
Another snip of data:
M732
D 3547742
A 3551497B 200000 67110 02800
D 3550806
D 3547743
A 3551498S 250000 69228 09900
So as you can see each line begins with a letter. Each letter denotes what the line means. For instance R means order book directory message, M means milliseconds after last second, H means stock trading action message. There are 14 different letters used in total.
I have used the readLines function to import the data into R. This however seems to take a very long time for R to process when I want to work with the data.
Now I would like to write some sort of If function that says if the first letter is R then from offset 1 to 4 the code means Market Segment Identifier etc., and have R add columns to these so I can work with the data in a more structured fashion.
What is the best way of importing such data, and also creating some form of structure - i.e. use unique ID information in the line of data to analyze 1 stock at a time for instance.
You can try something like this :
options(stringsAsFactors = FALSE)
f_A <- function(line,tab_A){
values <- unlist(strsplit(line," "))[2:5]
rbind(tab_A,list(name_1=as.character(values[1]),name_2=as.numeric(values[2]),name_3=as.numeric(values[3]),name_4=as.numeric(values[4])))
}
tab_A <- data.frame(name_1=character(),name_2=numeric(),name_3=numeric(),name_4=numeric(),stringsAsFactors=F)
for(i in readLines(con="/home/data.txt")){
switch(strsplit(x=i,split="")[[1]][1],M=cat("1\n"),R=cat("2\n"),D=cat("3\n"),A=(tab_A <- f_A(i,tab_A)))
}
And replace cat() by different functions that add values to each type of data.frame. Use the pattern of the function f_A() to construct others functions and same things for the table structure.
You can combine your readLines() command with regular expressions. To get more information about regular expressions, look at the R help site for grep()
> ?grep
So you can go through all the lines, check for each line what it means, and then handle or store the content of the line however you like. (Regular Expressions are also useful to split the data within one line...)

Resources