ResourceExhaustedError while generating tf records in VGGish model - runtime-error

I have been working with the VGGish model. When I change the size of the window of the spectrogram produced, a lot of them are produced. When I produce the tf-records, I then get the error of -
Resource exhausted: OOM when allocating tensor with shape[24373,64,96,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node vggish/conv1/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]`
I read stack overflow for a possible solution, some said that changing the batch size might help. Since I am new to data science, I really don't know how to do that. Can someone help me a little more with the problem?
Here is my code.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
r"""A simple demonstration of running VGGish in inference mode.
This is intended as a toy example that demonstrates how the various building
blocks (feature extraction, model definition and loading, postprocessing) work
together in an inference context.
A WAV file (assumed to contain signed 16-bit PCM samples) is read in, converted
into log mel spectrogram examples, fed into VGGish, the raw embedding output is
whitened and quantized, and the postprocessed embeddings are optionally written
in a SequenceExample to a TFRecord file (using the same format as the embedding
features released in AudioSet).
Usage:
# Run a WAV file through the model and print the embeddings. The model
# checkpoint is loaded from vggish_model.ckpt and the PCA parameters are
# loaded from vggish_pca_params.npz in the current directory.
$ python vggish_inference_demo.py --wav_file /path/to/a/wav/file
# Run a WAV file through the model and also write the embeddings to
# a TFRecord file. The model checkpoint and PCA parameters are explicitly
# passed in as well.
$ python vggish_inference_demo.py --wav_file /path/to/a/wav/file \
--tfrecord_file /path/to/tfrecord/file \
--checkpoint /path/to/model/checkpoint \
--pca_params /path/to/pca/params
# Run a built-in input (a sine wav) through the model and print the
# embeddings. Associated model files are read from the current directory.
$ python vggish_inference_demo.py
"""
from __future__ import print_function
import numpy as np
import six
import soundfile
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import vggish_input
import vggish_params
import vggish_postprocess
import vggish_slim
flags = tf.app.flags
flags.DEFINE_string(
'wav_file', 'input.wav',
'Path to a wav file. Should contain signed 16-bit PCM samples. '
'If none is provided, a synthetic sound is used.')
flags.DEFINE_string(
'checkpoint', 'vggish_model.ckpt',
'Path to the VGGish checkpoint file.')
flags.DEFINE_string(
'pca_params', 'vggish_pca_params.npz',
'Path to the VGGish PCA parameters file.')
flags.DEFINE_string(
'tfrecord_file', 'records_for_0.025sec_window.tfrecord',
'Path to a TFRecord file where embeddings will be written.')
FLAGS = flags.FLAGS
# function to make the postproces and add it to the file
def main(_):
# In this simple example, we run the examples from a single audio file through
# the model. If none is provided, we generate a synthetic input.
if FLAGS.wav_file:
wav_file = FLAGS.wav_file
else:
# Write a WAV of a sine wav into an in-memory file object.
num_secs = 5
freq = 1000
sr = 44100
t = np.linspace(0, num_secs, int(num_secs * sr))
x = np.sin(2 * np.pi * freq * t)
# Convert to signed 16-bit samples.
samples = np.clip(x * 32768, -32768, 32767).astype(np.int16)
wav_file = six.BytesIO()
soundfile.write(wav_file, samples, sr, format='WAV', subtype='PCM_16')
wav_file.seek(0)
examples_batch = vggish_input.wavfile_to_examples(wav_file)
print(examples_batch)
print("I reached step 1. ", examples_batch.shape)
# Prepare a postprocessor to munge the model embeddings.
pproc = vggish_postprocess.Postprocessor(FLAGS.pca_params)
# If needed, prepare a record writer to store the postprocessed embeddings.
writer = tf.python_io.TFRecordWriter(
FLAGS.tfrecord_file) if FLAGS.tfrecord_file else None
with tf.Graph().as_default(), tf.Session() as sess:
# Define the model in inference mode, load the checkpoint, and
# locate input and output tensors.
vggish_slim.define_vggish_slim(training=False)
vggish_slim.load_vggish_slim_checkpoint(sess, FLAGS.checkpoint)
features_tensor = sess.graph.get_tensor_by_name(
vggish_params.INPUT_TENSOR_NAME)
embedding_tensor = sess.graph.get_tensor_by_name(
vggish_params.OUTPUT_TENSOR_NAME)
print("I reached step 2")
# Run inference and postprocessing.
[embedding_batch1] = sess.run([embedding_tensor],
feed_dict={features_tensor: examples_batch[0:6000,:,:]})
postprocessed_batch = pproc.postprocess(embedding_batch1)
print("I reached step 3", postprocessed_batch.shape)
# Write the postprocessed embeddings as a SequenceExample, in a similar
# format as the features released in AudioSet. Each row of the batch of
# embeddings corresponds to roughly a second of audio (96 10ms frames), and
# the rows are written as a sequence of bytes-valued features, where each
# feature value contains the 128 bytes of the whitened quantized embedding.
seq_example1 = tf.train.SequenceExample(
feature_lists=tf.train.FeatureLists(
feature_list={
vggish_params.AUDIO_EMBEDDING_FEATURE_NAME:
tf.train.FeatureList(
feature=[
tf.train.Feature(
bytes_list=tf.train.BytesList(
value=[embedding.tobytes()]))
for embedding in postprocessed_batch
]
)
}
)
)
[embedding_batch1] = sess.run([embedding_tensor],
feed_dict={features_tensor: examples_batch[6000:12000,:,:]})
postprocessed_batch = pproc.postprocess(embedding_batch1)
print("I reached step 3", postprocessed_batch.shape)
# Write the postprocessed embeddings as a SequenceExample, in a similar
# format as the features released in AudioSet. Each row of the batch of
# embeddings corresponds to roughly a second of audio (96 10ms frames), and
# the rows are written as a sequence of bytes-valued features, where each
# feature value contains the 128 bytes of the whitened quantized embedding.
seq_example2 = tf.train.SequenceExample(
feature_lists=tf.train.FeatureLists(
feature_list={
vggish_params.AUDIO_EMBEDDING_FEATURE_NAME:
tf.train.FeatureList(
feature=[
tf.train.Feature(
bytes_list=tf.train.BytesList(
value=[embedding.tobytes()]))
for embedding in postprocessed_batch
]
)
}
)
)
if writer:
writer.write(seq_example1.SerializeToString())
writer.write(seq_example2.SerializeToString())
if writer:
writer.close()
if __name__ == '__main__':
tf.app.run()```

Related

Generating fastQC report

I am using the fastqcr R package to generate a multi-qc and single-qc reports of fastq files for RNAseq analysis. While my mlti-qc report works fine, I am finding the following error while trying to generate a single-qc report from the fastqc results zipped file.
Error in switch(status, PASS = "#00AFBB", WARN = "#E7B800", FAIL =
"#FC4E07") : EXPR must be a length 1 vector
The code I am using is
Step 6 - Building the final report
It creates an HTML file containing FastQC reports of one or multiple samples.
#for multi-qc
qc_report(qc.dir, result.file = "F:/SUDI#UCSF01/COURSES/RNA seq Analysis/scRNA seq by R/My Tutorials/Made by Sudi/Trial Analysis files/FastQC/fastqc_results/multi_qc_report",
experiment = "Exome sequencing of colon cancer cell lines", interpret = TRUE)
# For single-qc
qc.file1 <- "F:/SUDI#UCSF01/COURSES/RNA seq Analysis/scRNA seq by R/My Tutorials/Made by Sudi/Trial Analysis files/FastQC/fastqc_results/ERR522959_2_fastqc.zip"
qc.file1
qc_report(qc.file1, result.file = "F:/SUDI#UCSF01/COURSES/RNA seq Analysis/scRNA seq by R/My Tutorials/Made by Sudi/Trial Analysis files/FastQC/fastqc_results/single_qc_report", interpret = TRUE, preview = TRUE)
Can somebody help me trouble shoot this.
Thank you
I guess it is a version problem. Meaning that the scanning of the fastqc report depends on old versions and can't handle new ones. Just a guess. Because if you try to run
qc.file <- system.file("fastqc_results", "S1_fastqc.zip", package = "fastqcr")
qc_report(qc.file, result.file = "~/Desktop/result", interpret = TRUE)
This will work.
I am wondering what are the advantages of fastqcr, as MultiQC already have a very clear presentation of the data. And in the new versions of MultiQC you can also see an overview of failed and succeeded modules.

Calling getSibling() in R to extract single nodes from XML file causes crash

I am attempting to extract one node at a time from a very large (~620 MB) XML file using an R script. Each main node that I want to access corresponds to a different drug, and all of the nodes are parallel to each other. My aim is to process the entire file, one node at a time, since trying to read the entire file into memory does not work with the XML parser in R.
I have significantly truncated my large XML file into a much smaller example file that contains only 4 nodes; the beginning of this XML file looks like:
<?xml version="1.0" encoding="UTF-8"?>
<drugbank xmlns="http://www.drugbank.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.drugbank.ca http://www.drugbank.ca/docs/drugbank.xsd" version="5.0" exported-on="2017-07-06">
<drug type="biotech" created="2005-06-13" updated="2016-08-17">
<drugbank-id primary="true">DB00001</drugbank-id>
<drugbank-id>BTD00024</drugbank-id>
<drugbank-id>BIOD00024</drugbank-id>
<name>Lepirudin</name>
<description>Lepirudin is identical to natural hirudin except for substitution of leucine for isoleucine at the N-terminal end of the molecule and the absence of a sulfate group on the tyrosine at position 63. It is produced via yeast cells. Bayer ceased the production of lepirudin (Refludan) effective May 31, 2012.</description>
<cas-number>138068-37-8</cas-number>
<unii>Y43GF64R34</unii>
<state>liquid</state>
<groups>
<group>approved</group>
</groups>
Having reviewed the available options, and wanting to use the R script that I have already written that extracts desired fields from the XML file (it works for small XML files, but fails for the large file), it seems that using the getSibling() function in the XML library in R is my best choice. The following example code (from http://svitsrv25.epfl.ch/R-doc/library/XML/html/addSibling.html ) works to extract the single node in this example file:
f = system.file("exampleData", "job.xml", package = "XML")
tt = as(xmlParse(f), "XMLHashTree")
x = xmlRoot(tt, skip = FALSE)
DesiredOutput <- getSibling(x)
# I’m still not sure how to “walk” to the next sibling after the above process completes, since this example file only contains one node, and there is no simple way to increment a counter using the above code
That example job.xml file begins as follows:
<!-- Initial Comment -->
<gjob:Helping xmlns:gjob="http://www.gnome.org/some-location">
<gjob:Jobs>
<gjob:Job>
<gjob:Project ID="3"/>
<gjob:Application>GBackup</gjob:Application>
<gjob:Category>Development</gjob:Category>
<gjob:Update>
<gjob:Status>Open</gjob:Status>
<gjob:Modified>Mon, 07 Jun 1999 20:27:45 -0400 MET DST</gjob:Modified>
<gjob:Salary>USD 0.00</gjob:Salary>
</gjob:Update>
<gjob:Developers>
<gjob:Developer>
</gjob:Developer>
</gjob:Developers>
However, if I substitute my own XML file (small version of the full file; I have checked that it is legitimate XML format, as my R script correctly runs to process it), the following code crashes R:
f = "MyTruncatedExampleFile.xml" -> this line causes R to crash
tt = as(xmlParse(f), "XMLHashTree")
x = xmlRoot(tt, skip = FALSE)
DesiredOutput <- getSibling(x)
Can anyone suggest why my own small XML file would cause a crash, but the example job.xml file runs correctly? Thanks in advance for your insights.
Apparently it seems to be due to the undeclared namespace prefix in the truncated XML causing the crash: xmlns="http://www.drugbank.ca". If you remove this, the method does not crash R. Do note: undeclared namespaces is valid in XML documents. So, this issue should be raised with XML package authors. Also, since <drug> does not have a sibling in truncated XML, xmlChildren() is used below in place of getSibling().
# CRASHES
f = "DrugBank.xml"
tt = as(xmlParse(f), "XMLHashTree")
x = xmlRoot(tt, skip = FALSE)
DesiredOutput <- xmlChildren(x)[[1]]
DesiredOutput
# NO CRASHES
f = "DrugBank_oth.xml" # REMOVED UNDEFINED NAMESPACE PREFIX
tt = as(xmlParse(f), "XMLHashTree")
x = xmlRoot(tt, skip = FALSE)
DesiredOutput <- xmlChildren(x)[[1]]
DesiredOutput
As a workaround without modifying original XML, consider getNodeSet where you define a prefix for the special namespace and XPath to the children level of the root wih /*/*. The index [[1]] is used to take the first instead of all nodes. Here, web is used as prefix but it can be anything.
# NO CRASHES
f = "DrugBank.xml"
doc = xmlParse(f)
nmsp = c(web="http://www.drugbank.ca") # DEFINE NAMESPACE PREFIX
DesiredOutput <- getNodeSet(doc, "/*/*", nmsp)[[1]]
DesiredOutput
# <drug type="biotech" created="2005-06-13" updated="2016-08-17">
# <drugbank-id primary="true">DB00001</drugbank-id>
# <drugbank-id>BTD00024</drugbank-id>
# <drugbank-id>BIOD00024</drugbank-id>
# <name>Lepirudin</name>
# <description>Lepirudin is identical to natural hirudin except for
# substitution of leucine for isoleucine at the N-terminal
# end of the molecule and the absence of a sulfate group on
# the tyrosine at position 63. It is produced via yeast
# cells. Bayer ceased the production of lepirudin (Refludan)
# effective May 31, 2012.</description>
# <cas-number>138068-37-8</cas-number>
# <unii>Y43GF64R34</unii>
# <state>liquid</state>
# <groups>
# <group>approved</group>
# </groups>
# </drug>

How to see the actual memory size of a big.matrix object of bigmemory package?

I am using the bigmemory package to load a heavy dataset, but when I check the size of the object (with function object.size), it always returns 664 bytes. As far I as understand, the weight should be almost the same as a classic R matrix, but depending of the class (double or integer). Then, why do I obtain 664 bytes as an answer?. Below, reproducible code. The first chunck is really slow, so feel free to reduce the number of simulated values. With (10^6 * 20) will be enough.
# CREATE BIG DATABASE -----------------------------------------------------
data <- as.data.frame(matrix(rnorm(6 * 10^6 * 20), ncol = 20))
write.table(data, file = "big-data.csv", sep = ",", row.names = FALSE)
format(object.size(data), units = "auto")
rm(list = ls())
# BIGMEMORY READ ----------------------------------------------------------
library(bigmemory)
ini <- Sys.time()
data <- read.big.matrix(file = "big-data.csv", header = TRUE, type = "double")
print(Sys.time() - ini)
print(object.size(data), units = "auto")
To determine the size of the bigmemory matrix use:
> GetMatrixSize(data)
[1] 9.6e+08
Explanation
Data stored in big.matrix objects can be of type double (8 bytes, the default), integer (4 bytes), short (2 bytes), or char (1 byte).
The reason for the size disparity is that data stores a pointer to a memory-mapped file. You should be able to find the new file in the temporary directory of your machine. - [Paragraph quoted from R High Performance Programming]
Essentially, bigmatrix maintains a binary data file on the disk called a backing file that holds all of the values in a data set. When values from a bigmatrix object are needed by R, a check is performed to see if they are already in RAM (cached). If they are, then the cached values are returned. If they are not cached, then they are retrieved from the backing file. These caching operations reduce the amount of time needed to access and manipulate the data across separate calls, and they are transparent to the statistician.
See page 8 of the documentation for a description
https://cran.r-project.org/web/packages/bigmemory/bigmemory.pdf
Ref:
R High Performance Programming By: Aloysius Lim; William Tjhi
Data Science in R By: Duncan Temple Lang; Deborah Nolan

Predict memory usage in R

I have downloaded a huge file from the UCI Machine learning Dataset library. (~300mb).
Is there a way to predict the memory required to load the dataset, before loading it into R memory?
Googled a lot, but everywhere all I could find is how to calculate memory with R-profiler and several other packages, but after loading the objects into R.
based on "R programming" coursera course, U can calculate the proximate memory usage using number of rows and columns within the data" U can get that info from the codebox/meta file"
memory required = no. of column * no. of rows * 8 bytes/numeric
so for example if you have 1,500,00 rows and 120 column you will need more than 1.34 GB of spare memory required
U also can apply the same approach on other types of data with attention to number of bytes used to store different data types.
If your data's stored in a csv file, you could first read in a subset of the file and calculate the memory usage in bytes with the object.size function. Then, you could compute the total number of lines in the file with the wc command-line utility and use the line count to scale the memory usage of your subset to get an estimate of the total usage:
top.size <- object.size(read.csv("simulations.csv", nrow=1000))
lines <- as.numeric(gsub("[^0-9]", "", system("wc -l simulations.csv", intern=T)))
size.estimate <- lines / 1000 * top.size
Presumably there's some object overhead, so I would expect size.estimate to be an overestimate of the total memory usage when you load the whole csv file; this effect will be diminished if you use more lines to compute top.size. Of course, this approach could be inaccurate if the first 1000 lines of your file are not representative of the overall file contents.
R has the function object.size(), that provides an estimate of the memory that is being used to store an R object.
You can use like this:
predict_data_size <- function(numeric_size, number_type = "numeric") {
if(number_type == "integer") {
byte_per_number = 4
} else if(number_type == "numeric") {
byte_per_number = 8 #[ 8 bytes por numero]
} else {
stop(sprintf("Unknown number_type: %s", number_type))
}
estimate_size_in_bytes = (numeric_size * byte_per_number)
class(estimate_size_in_bytes) = "object_size"
print(estimate_size_in_bytes, units = "auto")
}
# Example
# Matrix (rows=2000000, cols=100)
predict_data_size(2000000*100, "numeric") # 1.5 Gb

Extracting binary data from a mixed data file

I am trying to read binary data from a mixed data file (ascii and binary) using R, the data file is constructed in a pseudo-xml format. The idea I had was to use the scan function, read the specific lines and then convert the binary to numerical values but I can't seem to do this in R. I have a python script that does this, but I would like to do the job in R, the python script is below. The binary section within the data file is enclosed by the start and end tags and .
The data file is a proprietary format containing spectroscopic data, a link to an example data file is included below. To quote the user manual:
Data of BinData elements are written as a binary array of bytes. Each
8 bytes of the binary array represent a one double-precision
floating-point value. Therefore the size of the binary array is
NumberOfPoints * 8 bytes. For two-dimensional arrays, data layout
follows row-major form used by SafeArrays. This means that moving to
next array element increments the last index. For example, if a
two-dimensional array (e.g. Data(i,j)) is written in such
one-dimensional binary byte array form, moving to the next 8 byte
element of the binary array increments last index of the original
two-dimensional array (i.e. Data(i,j+1)). After the last element of
the binary array the combination of carriage return and linefeed
characters (ANSI characters 13 and 10) is written.
Thanks for any suggestions in advance!
Link to example data file:
https://docs.google.com/file/d/0B5F27d7b1eMfQWg0QVRHUWUwdk0/edit?usp=sharing
Python script:
import sys, struct, csv
f=open(sys.argv[1], 'rb')
#
t = f.read()
i = t.find("<BinData>") + len("<BinData>") + 2 # add \r\n line end
header = t[:i]
#
t = t[i:]
i = t.find("\r\n</BinData>")
bin = t[:i]
#
doubles=[]
for i in range(len(bin)/8):
doubles.append(struct.unpack('d', bin[i*8:(i+1)*8])[0])
#
footer = t[i+2:]
#
myfile = open("output.csv", 'wb')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(doubles)
I wrote the pack package to make this easier. You still have to search for the start/end of the binary data though.
b <- readBin("120713b01.ols", "raw", 4000)
# raw version of the start of the BinData tag
beg.raw <- charToRaw("<BinData>\r\n")
# only take first match, in case binary data randomly contains "<BinData>\r\n"
beg.loc <- grepRaw(beg.raw,b,fixed=TRUE)[1] + length(beg.raw)
# convert header to text
header <- scan(text=rawToChar(b[1:beg.loc]),what="",sep="\n")
# search for "<Number of Points"> tags and calculate total number of points
numPts <- prod(as.numeric(header[grep("<Number of Points>",header)+1]))
library(pack)
Data <- unlist(unpack(rep("d", numPts), b[beg.loc:length(b)]))

Resources