How can I read in a fasta file (~4 Gb) and calculate nucleotide frequencies in a window of 4 bps in length?
it takes too long to read in the fasta file using
library(ShortRead)
readFasta('myfile.fa')
I have tried to index it using (and there are many of them)
library(Rsamtools)
indexFa('myfile.fa')
fa = FaFile('myfile.fa')
however I do not know how to access the file in this format
I would guess that 'slow' to read in a file that size would be a minute; longer than that and something other than software is the problem. Maybe it's appropriate to ask where your file comes from, your operating system, and whether you have manipulated the files (e.g., trying to open them in a text editor) before processing.
If 'too slow' is because you are running out of memory, then reading in chunks might help. With Rsamtools
fa = "my.fasta"
## indexFa(fa) if the index does not already exist
idx = scanFaIndex(fa)
create chunks of index, e.g., into n=10 chunks
chunks = snow::splitIndices(length(idx), 10)
and then process the file
res = lapply(chunks, function(chunk, fa, idx) {
dna = scanFa(fa, idx[chunk])
## ...
}, fa, idx)
Use do.call(c, res) or similar to concatenate the final result, or perhaps use a for loop if you're accumulating a single value. Indexing the fasta file is via a call to the samtools library; using samtools on the command line is also an option, on non-Windows.
An alternative is to use Biostrings::fasta.index() to index the file, then chunk through with that
idx = fasta.index(fa, seqtype="DNA")
chunks = snow::splitIndices(nrow(fai), 10)
res = lapply(chunks, function(chunk) {
dna = readDNAStringSet(idx[chunk, ])
## ...
}, idx)
If each record consists of a single line of DNA sequence, then reading the records in to R, in (even-numbered) chunks via readLines() and processing from there is relatively easy
con = file(fa)
open(fa)
chunkSize = 10000000
while (TRUE) {
lines = readLines(fa, chunkSize)
if (length(lines) == 0)
break
dna = DNAStringSet(lines[c(FALSE, TRUE)])
## ...
}
close(fa)
Load the Biostrings Package and then use the readDNAStringSet() method
From example("readDNAStringSet"), slightly modified:
library(Biostrings)
# example("readDNAStringSet") #optional
filepath1 <- system.file("extdata", "someORF.fa", package="Biostrings")
head(fasta.seqlengths(filepath1, seqtype="DNA")) #
x1 <- readDNAStringSet(filepath1)
head(x1)
Related
I've first figured out how to read and name multiple H5 files from my directory, but I'm running into actually being able to graph with them. My problem is multiple - with this type of file, I do not know how to make the columns have the same number of rows and I do not know how to call on specific files.
My initial setup is as followed
library("rhdf5")
library("ggplot2")
library("fs")
library("tidyverse")
wd <- "D:/Data/1282-1329/"
setwd(wd)
testh5 <- H5Fopen("1282.h5")
H5Fclose(testh5)
y <- h5read(file = "1282.h5",
name = "/Signal")
x <- h5read(file = "1282.h5",
name = "/Scan")
The / refers to the H5 files 'Group' and the Signal or Scan refers to the 'Name', thus "/Signal" creates a numerical list with a length of 48 (number of files within 1282-1329). I make multiple lists from each of these by doing
file_paths <- fs::dir_ls("D:/Data/1282-1329/H5")
file_paths
file_Scan <- list()
for (i in seq_along(file_paths)) {
file_Scan[[i]] <- h5read(
file = file_paths[[i]],
name = "/Scan"
)
}
file_Signal <- list()
for (i in seq_along(file_paths)) {
file_Signal[[i]] <- h5read(
file = file_paths[[i]],
name = "/Signal"
)
}
file_Scan <- setNames(file_Scan, file_paths)
file_Signal <- setNames(file_Signal, file_paths)
Thus str(file_Signal) gives me something like..
List of 48
$ D:/Data/1282-1329/H5/1282.h5: num [1:8044(1d)] 11569527 11576106 10848312 11007212 11074822 ...
$ D:/Data/1282-1329/H5/1283.h5: num [1:8045(1d)] 9746633 9886735 10000637 9617273 ...
So my first problem here is [1:8044(1d)] and [1:8045(1d)] - they're one row off. But I'm unable to add in NAs or make the lengths the same as I would a normal list. Is it because I'm thinking about this wrong? I feel like the solution is simple.
My ultimate goal will be to create multiple single plots for each of these files in the directory using something like
for (i in seq_along(file_paths)) {
plots[[i]] = ggplot(file_paths, aes(x=file_Signal, y=file_Scan))+
geom_point(size=1)
}
Then use these to create a rolling gif of the files with Even numbers (1282, 1284, 1286, etc) and Odd numbers (1283, 1285, 1287, etc.)
Thank you for any help or resources to might have to offer.
I'm working with limited RAM (AWS free tier EC2 server - 1GB).
I have a relatively large txt file "vectors.txt" (800mb) I'm trying to read into R. Having tried various methods I have failed to read in this vector to memory.
So, I was researching ways of reading it in in chunks. I know that the dim of the resulting data frame should be 300K * 300. If I was able to read in the file e.g. 10K lines at a time and then save each chunk as an RDS file I would be able to loop over the results and get what I need, albeit just a little slower with less convenience than having the whole thing in memory.
To reproduce:
# Get data
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
# word2vec r library
library(rword2vec)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
So far so good. Here's where I struggle:
word_vectors = as.data.frame(read.table("vector.txt",skip = 1, nrows = 10))
Returns "cannot allocate a vector of size [size]" error message.
Tried alternatives:
word_vectors <- ff::read.table.ffdf(file = "vector.txt", header = TRUE)
Same, not enough memory
word_vectors <- readr::read_tsv_chunked("vector.txt",
callback = function(x, i) saveRDS(x, i),
chunk_size = 10000)
Resulted in:
Parsed with column specification:
cols(
`299567 300` = col_character()
)
|=========================================================================================| 100% 817 MB
Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, :
Evaluation error: bad 'file' argument.
Is there any other way to turn vectors.txt into a data frame? Maybe by breaking it into pieces and reading in each piece, saving as a data frame and then to rds? Or any other alternatives?
EDIT:
From Jonathan's answer below, tried:
library(rword2vec)
library(RSQLite)
# Download pre trained Google News word2vec model (Slimmed down version)
# https://github.com/eyaler/word2vec-slim
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
# from https://privefl.github.io/bigreadr/articles/csv2sqlite.html
csv2sqlite <- function(tsv,
every_nlines,
table_name,
dbname = sub("\\.txt$", ".sqlite", tsv),
...) {
# Prepare reading
con <- RSQLite::dbConnect(RSQLite::SQLite(), dbname)
init <- TRUE
fill_sqlite <- function(df) {
if (init) {
RSQLite::dbCreateTable(con, table_name, df)
init <<- FALSE
}
RSQLite::dbAppendTable(con, table_name, df)
NULL
}
# Read and fill by parts
bigreadr::big_fread1(tsv, every_nlines,
.transform = fill_sqlite,
.combine = unlist,
... = ...)
# Returns
con
}
vectors_data <- csv2sqlite("vector.txt", every_nlines = 1e6, table_name = "vectors")
Resulted in:
Splitting: 12.4 seconds.
Error: nThread >= 1L is not TRUE
Another option would be to do the processing on-disk, e.g. using an SQLite file and dplyr's database functionality. Here's one option: https://stackoverflow.com/a/38651229/4168169
To get the CSV into SQLite you can also use the bigreadr package which has an article on doing just this: https://privefl.github.io/bigreadr/articles/csv2sqlite.html
I am using the R package msa, a core Bioconductor package, for multiple sequence alignment. Within msa, I am using the MUSCLE alignment algorithm to align protein sequences.
library(msa)
myalign <- msa("test.fa", method=c("Muscle"), type="protein",verbose=FALSE)
The test.fa file is a standard fasta as follows (truncated, for brevity):
>sp|P31749|AKT1_HUMAN_RAC
MSDVAIVKEGWLHKRGEYIKTWRPRYFLL
>sp|P31799|AKT1_HUMAN_RAC
MSVVAIVKEGWLHKRGEYIKTWRFLL
When I run the code on the file, I get:
MUSCLE 3.8.31
Call:
msa("test.fa", method = c("Muscle"), type = "protein", verbose = FALSE)
MsaAAMultipleAlignment with 2 rows and 480 columns
aln
[1] MSDVAIVKEGWLHKRGEYIKTWRPRYFLL
[2] MSVVAIVKEGWLHKRGEYIKTWR---FLL
Con MS?VAIVKEGWLHKRGEYIKTWR???FLL
As you can see, a very reasonable alignment.
I want to write the gapped alignment, preferably without the consensus sequence (e.g., Con row), to a fasta file. So, I want:
>sp|P31749|AKT1_HUMAN_RAC
MSDVAIVKEGWLHKRGEYIKTWRPRYFLL
>sp|P31799|AKT1_HUMAN_RAC
MSVVAIVKEGWLHKRGEYIKTWR---FLL
I checked the msa help, and the package does not seem to have a built in method for writing out to any file type, fasta or otherwise.
The seqinr package looks somewhat promising, because maybe it could read this output as an msf format, albeit a weird one. However, seqinr seems to need a file read in as a starting point. I can't even save this using write(myalign, ...).
I wrote a function:
alignment2Fasta <- function(alignment, filename) {
sink(filename)
n <- length(rownames(alignment))
for(i in seq(1, n)) {
cat(paste0('>', rownames(alignment)[i]))
cat('\n')
the.sequence <- toString(unmasked(alignment)[[i]])
cat(the.sequence)
cat('\n')
}
sink(NULL)
}
Usage:
mySeqs <- readAAStringSet('test.fa')
myAlignment <- msa(mySeqs)
alignment2Fasta(myAlignment, 'out.fasta')
I think you ought to follow the examples in the help pages that show input with a specific read function first, then work with the alignment:
mySeqs <- readAAStringSet("test.fa")
myAlignment <- msa(mySeqs)
Then the rownames function will deliver the sequence names:
rownames(myAlignment)
[1] "sp|P31749|AKT1_HUMAN_RAC" "sp|P31799|AKT1_HUMAN_RAC"
(Not what you asked for but possibly useful in the future.) Then if you execute:
detail(myAlignment) #function actually in Biostrings
.... you get a text file in interactive mode that you can save
2 29
sp|P31749|AKT1_HUMAN_RAC MSDVAIVKEG WLHKRGEYIK TWRPRYFLL
sp|P31799|AKT1_HUMAN_RAC MSVVAIVKEG WLHKRGEYIK TWR---FLL
If you wnat to try hacking a function for which you can get a file written in code, then look at the Biostrings detail function code that is being used
> showMethods( f= 'detail')
Function: detail (package Biostrings)
x="ANY"
x="MsaAAMultipleAlignment"
(inherited from: x="MultipleAlignment")
x="MultipleAlignment"
showMethods( f= 'detail', classes='MultipleAlignment', includeDefs=TRUE)
Function: detail (package Biostrings)
x="MultipleAlignment"
function (x, ...)
{
.local <- function (x, invertColMask = FALSE, hideMaskedCols = TRUE)
{
FH <- tempfile(pattern = "tmpFile", tmpdir = tempdir())
.write.MultAlign(x, FH, invertColMask = invertColMask,
showRowNames = TRUE, hideMaskedCols = hideMaskedCols)
file.show(FH)
}
.local(x, ...)
}
You may use export.fasta function from bio2mds library.
# reading of the multiple sequence alignment of human GPCRS in FASTA format:
aln <- import.fasta(system.file("msa/human_gpcr.fa", package = "bios2mds"))
export.fasta(aln)
You can convert your msa alignment first ("AAStringSet") into an "align" object first, and then export as fasta as follows:
library(msa)
library(bios2mds)
mysequences <-readAAStringSet("test.fa")
alignCW <- msa(mysequences)
#https://rdrr.io/bioc/msa/man/msaConvert.html
alignCW_as_align <- msaConvert(alignCW, "bios2mds::align")
export.fasta(alignCW_as_align, outfile = "test_alignment.fa", ncol = 60, open = "w")
If I had 1000 slices of a 64x64 image, I could write in 64x64x1 chunks like this:
using HDF5
filename = "test.h5"
# open file
fmode ="w"
# get a file object
fid = h5open(filename, fmode)
# matrix to write in chunks
B = rand(64,64,1000)
# figure out its dimensions
sizeTuple = size(B)
ndims = length(sizeTuple)
# set up to write in chunks of sizeArray
sizeArray = ones(Int, ndims)
[sizeArray[i] = sizeTuple[i] for i in 1:(ndims-1)] # last value of size array is :...:,1
# create a dataset models within root
dset = d_create(fid, "models", datatype(Float64), dataspace(size(B)), "chunk", sizeArray)
[dset[:,:,i] = slicedim(B, ndims, i) for i in 1:size(B, ndims)]
close(fid)
And this works just fine, but the assignment syntax in dset[:,:,i] is specific to ndims = 3. How can I change that if I had 1000 slices of an arbitrary hyperrectangle specified at runtime? E.g., for B = rand(64,64,3,1000) or rand(64,64,64,3,1000)?
Thanks
So I got the answer from the julia users group on Google after a couple of misfires here. It's delightfully simple:
using HDF5
filename = "test.h5"
# open file
fmode ="w"
# get a file object
fid = h5open(filename, fmode)
# matrix to write in chunks
B = rand(64,64,1000)
# figure out its dimensions
Ndims = ndims(B)
# set up to write in chunks of sizeArray
sizeArray = ones(Int, Ndims)
[sizeArray[i] = size(B, i) for i in 1:(Ndims-1)] # last value of size array is :...:,1
# create a dataset models within root
dset = d_create(fid, "models", datatype(Float64), dataspace(size(B)), "chunk", sizeArray)
# write in slices of (:,:,i)
[dset[(fill(:,Ndims-1))...,i] = slicedim(B, Ndims, i) for i in 1:size(B, Ndims)]
close(fid)
The fill and splat ... syntax is very, very useful.
The following is my code. I am trying get the list of all the files (~20000) that end with .idat and read each file using the function illuminaio::readIDAT.
library(illuminaio)
library(parallel)
library(data.table)
# number of cores to use
ncores = 8
# this gets all the files with .idat extension ~20000 files
files <- list.files(path = './',
pattern = "*.idat",
full.names = TRUE)
# function to read the idat file and create a data.table of filename, and two more columns
# write out as csv using fwrite
get.chiptype <- function(x)
{
idat <- readIDAT(x)
res <- data.table(filename = x, nSNPs = nrow(idat$Quants), Chip = idat$ChipType)
fwrite(res, file.path = 'output.csv', append = TRUE)
}
# using mclapply call the function get.chiptype on all 20000 files.
# use 8 cores at a time
mclapply(files, FUN = function(x) get.chiptype(x), mc.cores = ncores)
After reading and writing info about 1200 files, I get the following message:
Warning message:
In mclapply(files, FUN = function(x) get.chiptype(x), mc.cores = ncores) :
all scheduled cores encountered errors in user code
How do I resolve it?
Calling mclapply() in some instances requires you to specify a random number generator that allows for multiple streams of random numbers.
R version 2.14.0 has an implementation of Pierre L'Ecuyer's multiple pseudo-random number generator.
Try adding the following before the mclapply() call, with a pre-specified value for 'my.seed':
set.seed( my.seed, kind = "L'Ecuyer-CMRG" );