Efficiently split a large audio file in R - r

Previously I asked this question on SO about splitting an audio file. The answer I got from #Jean V. Adams worked relatively (downside: input was stereo and output was mono, not stereo) well for small sound objects:
library(seewave)
# your audio file (using example file from seewave package)
data(tico)
audio <- tico # this is an S4 class object
# the frequency of your audio file
freq <- 22050
# the length and duration of your audio file
totlen <- length(audio)
totsec <- totlen/freq
# the duration that you want to chop the file into
seglen <- 0.5
# defining the break points
breaks <- unique(c(seq(0, totsec, seglen), totsec))
index <- 1:(length(breaks)-1)
# a list of all the segments
subsamps <- lapply(index, function(i) cutw(audio, f=freq, from=breaks[i], to=breaks[i+1]))
I applied this solution to one (out of around 300) of the files I'm preparing for analysis (~150 MB), and my computer worked on it for ( > 5 hours now), but I ended up closing the session before it finished.
Does anyone have any thoughts or solutions to efficiently perform this task of splitting up a large audio file (specifically, an S4 class Wave object) into smaller pieces using R? I'm hoping to cut down drastically on the time it takes to make smaller files out of these larger files, and I'm hoping to use R. However, if I cannot get R to do the task efficiently, I would appreciate suggestions of other tools for the job. The example data above is mono, but my data is in stereo. The example data can be made to be stereo using:
tico#stereo <- TRUE
tico#right <- tico#left
UPDATE
I identified another solution that builds on work from the first solution:
lapply(index, function(i) audio[(breaks[i]*freq):(breaks[i+1]*freq)])
Comparing the performance of three solutions:
# Solution suggested by #Jean V. Adams
system.time(replicate(100,lapply(index, function(i) cutw(audio, f=freq, from=breaks[i], to=breaks[i+1], output="Wave"))))
user system elapsed
1.19 0.00 1.19
# my modification of the previous solution
system.time(replicate(100,lapply(index, function(i) audio[(breaks[i]*freq):(breaks[i+1]*freq)])))
user system elapsed
0.86 0.00 0.85
# solution suggested by #CarlWitthoft
audiomod <- audio[(freq*breaks[1]):(freq*breaks[length(breaks)-1])] # remove unequal part at end
system.time(replicate(100,matrix(audiomod#left,ncol=length(breaks))))+
system.time(replicate(100,matrix(audiomod#right,ncol=length(breaks))))
user system elapsed
0.25 0.00 0.26
The method using indexing (i.e. [) seems to faster (3-4x). #CarlWitthoft's solution is even faster, the downside is that it puts the data into a matrix rather than multiple Wave objects, which I will be saving using writeWave. Presumably, convert from the matrix format to a separate Wave objects will be relatively trivial if I properly understand how to create this type of S4 object. Any further room for improvement?

The approach I ended up using builds off of the solutions offered by #CarlWitthoft and #JeanV.Adams. It is quite fast compared to the other techniques I was using, and it has allowed me to split a large number of my files in a matter of hours, rather than days.
Here is the whole process using a small Wave object for example (my current audio files range up to 150 MB in size, but in the future, I may receive much larger files (i.e. sound files covering 12-24 hours of recording) where memory management will become more important):
library(seewave)
library(tuneR)
data(tico)
# force to stereo
tico#stereo <- TRUE
tico#right <- tico#left
audio <- tico # this is an S4 class object
# the frequency of your audio file
freq <- 22050
# the length and duration of your audio file
totlen <- length(audio)
totsec <- totlen/freq
# the duration that you want to chop the file into (in seconds)
seglen <- 0.5
# defining the break points
breaks <- unique(c(seq(0, totsec, seglen), totsec))
index <- 1:(length(breaks)-1)
# the split
leftmat<-matrix(audio#left, ncol=(length(breaks)-2), nrow=seglen*freq)
rightmat<-matrix(audio#right, ncol=(length(breaks)-2), nrow=seglen*freq)
# the warnings are nothing to worry about here...
# convert to list of Wave objects.
subsamps0409_180629 <- lapply(1:ncol(leftmat), function(x)Wave(left=leftmat[,x],
right=rightmat[,x], samp.rate=d#samp.rate,bit=d#bit))
# get the last part of the audio file. the part that is < seglen
lastbitleft <- d#left[(breaks[length(breaks)-1]*freq):length(d)]
lastbitright <- d#right[(breaks[length(breaks)-1]*freq):length(d)]
# convert and add the last bit to the list of Wave objects
subsamps0409_180629[[length(subsamps0409_180629)+1]] <-
Wave(left=lastbitleft, right=lastbitright, samp.rate=d#samp.rate, bit=d#bit)
This wasn't part of my original question, but my ultimate goal was to save these new, smaller Wave objects.
# finally, save the Wave objects
setwd("C:/Users/Whatever/Wave_object_folder")
# I had some memory management issues on my computer when doing this
# process with large (~ 130-150 MB) audio files so I used rm() and gc(),
# which seemed to resolve the problems I had with allocating memory.
rm("breaks","audio","freq","index","lastbitleft","lastbitright","leftmat",
"rightmat","seglen","totlen","totsec")
gc()
filenames <- paste("audio","_split",1:(length(breaks)-1),".wav",sep="")
# Save the files
sapply(1:length(subsamps0409_180629),
function(x)writeWave(subsamps0409_180629[[x]],
filename=filenames[x]))
The only real downside here is that the output files are pretty big. For example, I put in a 130 MB file and split it into 18 files each approximately 50 MB. I think this is because my input file is .mp3 and the output is .wav. I posted this answer to my own question in order to wrap up the problem I was having with the full solution I used to solve it, but other answers are appreciated and I will take the time to look at each solution and evaluate what they offer. I am sure there are better ways to accomplish this task, and methods that will work better with very large audio files. In solving this problem, I barely scratched the surface on dealing with memory management.

Per Frank's request, here's one possible approach.
Extract the audio#left and audio#right slots' vectors of sound data, then break each up into equal-length sections in one step something like:
leftsong<-audio#left
leftmat<-matrix(leftsong, ncol=(seglen*freq)
Where I've assumed seglen is the distance between breaks[i] and breaks[i+1] .
New wave objects can then be created and processed from the matching rows in leftmat and rightmat.

Related

Why does my computers memory rapidly disappear when I try to process rasters?

I am working with a set of 13 .tif raster files, 116.7 MB each, containing data on mangrove forest distributions in West Africa. Each file holds the distribution for one year (2000-2012). The rasters load into R without any problems and plot relatively easily as well, taking ~20 seconds using base plot() and ~30 seconds using ggplot().
I am running into problems when I try to do any sort of processing or analysis of the rasters. I am trying to do simple raster math, subtracting the 2000 mangrove distribution raster from the 2000 raster to show deforestation hotspots, but as soon as I do, the memory on my computer starts rapidly disappearing.
I have 48GB of drive space free, but when I start running the raster math, I start to lose a GB of storage every few seconds. This continues until my storage is almost empty, I get a notification from my computer that my storage is critically low, and I have to stop R from running. I am running on a MacBook Pro 121GB storage 8GB ram Big Sur 11.0.1. Does anyone know what could be causing this?
Here's my code:
#import cropped rasters
crop2000 <- raster("cropped2000.tif")
crop2001 <- raster("cropped2001.tif")
crop2002 <- raster("cropped2002.tif")
crop2003 <- raster("cropped2003.tif")
crop2004 <- raster("cropped2004.tif")
crop2005 <- raster("cropped2005.tif")
crop2006 <- raster("cropped2006.tif")
crop2007 <- raster("cropped2007.tif")
crop2008 <- raster("cropped2008.tif")
crop2009 <- raster("cropped2009.tif")
crop2010 <- raster("cropped2010.tif")
crop2011 <- raster("cropped2011.tif")
crop2012 <- raster("cropped2012.tif")
#look at 2000 distribution
plot(crop2000)
#look at 2012 distribuion
plot(crop2012)
#subtract 2000 from 2012 to look at change
chg00_12 <- crop2012 - crop2000
If you work with large datasets that cannot be all kept in RAM, raster will save them to temporary files. This can be especially demanding with raster math, as each step will create a new file. e.g with Raster* x
y <- 3 * (x + 2) - 5
would create three temp files. First for (x+2), then for *3, and then for -5. You can avoid that by using functions like calc and overlay
y <- raster::calc(x, function(i) 3 * (i + 2) - 5)
That would only create one temp file. Or none if you provide a filename (which makes it also easier to delete), and perhaps use compression (see ?writeRaster).
Also see ?raster::removeTmpFiles
You can also increase the amount of RAM that raster is allowed to use. See ?raster::rasterOptions.

Efficient way to read and write data into files over a loop using R

I am trying to read and write data into files at each time step.
To do this, I am using the package h5 to store large datasets but I find that my code using the functions of this package is running slowly. I am working with very large datasets. So, I have memory limit issues. Here is a reproducible example:
library(ff)
library(h5)
set.seed(12345)
for(t in 1:3650){
print(t)
## Initialize the matrix to fill
mat_to_fill <- ff(-999, dim=c(7200000, 48), dimnames=list(NULL, paste0("P", as.character(seq(1, 48, 1)))), vmode="double", overwrite = T)
## print(mat_to_fill)
## summary(mat_to_fill[,])
## Create the output file
f_t <- h5file(paste0("file",t,".h5"))
## Retrieve the matrix at t - 1 if t > 1
if(t > 1){
f_t_1 <- h5file(paste0("file", t - 1,".h5"))
mat_t_1 <- f_t_1["testmat"][] ## *********** ##
## f_t_1["testmat"][]
} else {
mat_t_1 <- 0
}
## Fill the matrix
mat_to_fill[,] <- matrix(data = sample(1:100, 7200000*48, replace = TRUE), nrow = 7200000, ncol = 48) + mat_t_1
## mat_to_fill[1:3,]
## Write data
system.time(f_t["testmat"] <- mat_to_fill[,]) ## *********** ##
## f_t["testmat"][]
h5close(f_t)
}
Is there an efficient way to speed up my code (see symbols ## *********** ##) ? Any advice would be much appreciated.
EDIT
I have tried to create a data frame from the function createDataFrame of the package "SparkR" but I have this error message:
Error in writeBin(batch, con, endian = "big") :
long vectors not supported yet: connections.c:4418
I have also tested other functions to write huge data in file:
test <- mat_to_fill[,]
library(data.table)
system.time(fwrite(test, file = "Test.csv", row.names=FALSE))
user system elapsed
33.74 2.10 13.06
system.time(save(test, file = "Test.RData"))
user system elapsed
223.49 0.67 224.75
system.time(saveRDS(test, "Test.Rds"))
user system elapsed
197.42 0.98 199.01
library(feather)
test <- data.frame(mat_to_fill[,])
system.time(write_feather(test, "Test.feather"))
user system elapsed
0.99 1.22 10.00
If possible, I would like to reduce the elapsed time to <= 1 sec.
SUPPLEMENTARY INFORMATION
I am building an agent-based model with R but I have memory issues because I work with large 3D arrays. In the 3D arrays, the first dimension corresponds to the time (each array has 3650 rows), the second dimension defines the properties of individuals or landscape cells (each array has 48 columns) and the third dimension represents each individual (in total, there are 720000 individuals) or landscape cell (in total, there are 90000 cells). In total, I have 8 3D arrays. Currently, the 3D arrays are defined at initialization so that data are stored in the array at each time step (1 day) using several functions. However, to fill one 3D array at t from the model, I need to only keep data at t – 1 and t – tf – 1, where tf is a duration parameter that is fixed (e.g., tf = 320 days). However, I don’t know how to manage these 3D arrays in the ABM at each time step. My first solution to avoid memory issues was thus to save data that are contained in the 3D array for each individual or cell at each time step (thus 2D array) and to retrieve data (thus read data from files) at t – 1 and t – tf – 1.
You matrix is 7200000 * 48 and with a 4 byte integer you'll get 7200000 * 48 * 4 bytes or ~1.3Gb. With the HDD r/w operation speed of 120Mb/s you are lucky to get 10 seconds if you have an average HDD. With a good SDD you should be able to get 2-3Gb/s and therefore about 0.5 second using fwrite or write_feather you tried. I assume you don't have SDD as it is not mentioned. You have 32Gb of memory which seems to be enough for 8 datasets of that size, so chances are you are using the memory to copy this data around. You can try to optimize your memory usage instead of writing it to the hard drive or to work with a portion of the dataset at a time, although both approaches are probably presenting implementation challenges. The problem of splitting the data and merging results is frequent distributed computing which requires splitting datasets and then merging results from multiple workers. Using database is always slower than plain disc operations, unless it is in-memory database which is stated to be not fitting into memory, unless you have some very specific sparse data that could be easily compressed/extracted.
You can try using-
library(fst)
write.fst(x, path, compress = 50, uniform_encoding = TRUE)
You can find more detailed comparison here -
https://www.fstpackage.org/
Note: You can use compress parameter to make it more efficient.

R - Chaging specific cell values in a large raster layer

I am working with R "raster" package and have a large raster layer (62460098 cells, 12 Mb for the object). My cell values range from -1 to 1. I have to replace all negative values with a 0 (example: a cell that has -1 as value has to become a 0). I tried to do this:
raster[raster < 0] <- 0
But it keeps overloading my RAM because of the raster size.
OS: Windows 7 64-bits
RAM size: 8GB
Tks!
You can do
r <- reclassify(raster, c(-Inf, 0, 0))
This will work on rasters of any size (no memory limitation)
There are several postings that discuss memory issues and it's not clear if you have attempted any of these, .... but you should. The physical constraints are not clear, so you should edit your question to include size of machine and name of OS being tortured. I don't know how to construct a toybox that lets me do any testing, but one approach that might not blow up RAM use (as much) would be to first construct a set of indices marking the locations to be "zeroed":
idxs <- which(raster <0, arr.ind=TRUE)
gc() # may not be necessary
Then incrementally replace some fraction of locations, say a quarter or a tenth at a time.
raster[ idxs[ 1:(nrow(idxs)/10), ] ] <- 0
The likely problem with any of this is that R's approach to replacement is not "in place" but rather involves the creation of a temporary copy of the objects which is then reassigned to the original. Good Luck.

Very slow raster::sampleRandom, what can I do as a workaround?

tl;dr: why is raster::sampleRandom taking so much time? e.g. to extract 3k cells from 30k cells (over 10k timesteps). Is there anything I can do to improve the situation?
EDIT: workaround at bottom.
Consider a R script in which I have to read a big file (usually more than 2-3GB) and perform quantile calculation over the data. I use the raster package to read the (netCDF) file. I'm using R 3.1.2 under 64bit GNU/Linux with 4GB of RAM, 3.5GB available most of the time.
As the files are often too big to fit into memory (even 2GB files for some reason will NOT fit into 3GB of available memory: unable to allocate vector of size 2GB) I cannot always do this, which is what I would do if I had 16GB of RAM:
pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(getValues(pr)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)
But instead I can sample a smaller number of cells in my files using the function sampleRaster() from the raster package, still getting good statistics.
e.g.:
pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(sampleRandom(pr, cnsample)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)
I perform this over 6 different files (i goes from 1 to 6) which all have about 30k cells and 10k timesteps (so 300M values). Files are:
1.4GB, 1 variable, filesystem 1
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
1.2GB, 1 variable, filesystem 3
1.2GB, 1 variable, filesystem 3
Note that:
files are on three different nfs filesystem, whose performance I'm not sure of. I cannot rule out the fact that the nfs filesystems can greatly vary in performance from one moment to the other.
RAM usage is 100% all of the time when the script runs, but the system does not use all of it's swap.
sampleRandom(dataset, N) takes N non-NA random cells from one layer (= one timestep), and reads their content. Does so for the same N cells for each layer. If you visualize the dataset as a 3D matrix, with Z as timesteps, the function takes N random non-NA columns. However, I guess the function does not know that all the layers have the NAs in the same positions, so it has to check that any column it chooses does not have NAs in it.
When using the same commands on files with 8393 cells (about 340MB in total) and reading all the cells, the computing time is a fraction of trying to read 1000 cells from a file with 30k cells.
The full script which produces the output below is here, with comments etc.
If I try to read all the 30k cells:
cannot allocate vector of size 2.6 Gb
If I read 1000 cells:
5 minutes
45 m
30 m
30 m
20 m
20 m
If I read 3000 cells:
15 minutes
18 m
35 m
34 m
60 m
60 m
If I try to read 5000 cells:
2.5 h
22 h
for >2 I had to stop after 18h, I had to use the workstation for other tasks
With more tests, I've been able to find out that it's the sampleRandom() function that's taking most of the computing time, not the calculation of the quantile (which I can speed up using other quantile functions, such as kuantile()).
Why is sampleRandom() taking so long? Why does it perform so strangely, sometimes fast and sometimes very slow?
What is the best workaround? I guess I could manually generate N random cells for the 1st layer and then manually raster::extract for all timesteps.
EDIT:
Working workaround is to do:
cells <- sampleRandom(pr[[1]], cnsample, cells=T) #Extract cnsample random cells from the first layer, exluding NAs
cells[,1]
prvals <- pr[cells[,1]] #Read those cells from all layers
qs <- quantile(prvals, probs=qprobs, na.rm=T, type=8, names=F) #Compute quantile
This works and is very fast because all layers have NAs in the same positions. I think this should be an option that sampleRandom() could implement.

Calculate percentage over time on very large data frames

I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}

Resources