Fast way for splitting large .wav file using R - r

For my work I need to analyze large .wav files (>208 MB), and I make use of the R packages seewave and tuneR. I bring each file into the R environment in 30 s chunks, using the readWave function as follows:
tr1_1 = readWave("TR1_edit.WAV", from = 0, to = 0.5, units = "minutes")
tr1_2= readWave("TR1_edit.WAV", from = 0.5, to = 1, units = "minutes")
tr1_3= readWave("TR1_edit.WAV", from = 1, to = 1.5, units = "minutes")
tr1_4= readWave("TR1_edit.WAV", from = 1.5, to = 2, units = "minutes")
tr1_5= readWave("TR1_edit.WAV", from = 2, to = 2.5, units = "minutes")
and so on. This method works, but is not efficient or pretty. Is there a way to import and split up a large .wav class file more efficiently?

If you're loading all of these into memory at the same time, rather than sequential variable names you should be using a list.
tr1 = list()
duration = 0.5
start_times = seq(0, 2, by = duration)
for (i in seq_along(start_times)) {
tr1[[i]] = readWave('TR1_edit.WAV',
from = start_times[i],
to = start_times[i] + duration,
units = 'minutes')
}
This is the same principle as why you should use a list of data frames rather than sequentially named data frames.
You could easily wrap this into a function that takes the name of a WAV file as input, gets its length from the metadata, and imports it in 30-second (or a parameterized argument) segments, and returns the list of segments.

#Gregor and #AkselA thanks for your input. The biggest issue with the for loop solution was that the wave files I'm working with are of varying sizes, so I would end up with blank elements in the resultant lists. My current solution imports the entire file, then breaks it up into 30s pieces from there:
duration = 1.44e6
tr1 <- readWave("TR1_edit.wav", from = 0, to = 1, units = "minutes")
tr1 <- as.matrix(tr1#left)
tr1 <- cbind(tr1, (rep(1:(length(tr1)/duration), each = duration)))
tr1 <- lapply(split(tr1[,1],tr1[,2]),matrix, ncol = 1)
From there I can use mapply to return the vectors to wave class
w <- function(s){
Wave(s, right = numeric(0), samp.rate = 48000, bit = 16, pcm = TRUE)
}
tr1 <- mapply(w, tr1)

Related

Automate User-Defined Function to Process Dataframes

I'd like to create a code that takes a varying number of dataframes X1, X2, X3...XN containing four columns of data time, height, start, group and automatically runs them through a user-defined function HL_plot from Vulntookit. I'm guessing I could use a for loop, but am still very new to R and a little stuck on this step. If I use the function manually, the code looks like this:
HL.plot (level = X1[, 2], time = X1[, 1], period = 0.3,
phantom = TRUE, tides = "H")
HL.plot (level = X2[, 2], time = X2[, 1], period = 0.3,
phantom = TRUE, tides = "H")
HL.plot (level = X3[, 2], time = X3[, 1], period = 0.3,
phantom = TRUE, tides = "H")
HL.plot (level = XN[, 2], time = XN[, 1], period = 0.3,
phantom = TRUE, tides = "H")
The function plots the second height and first time columns of data.
Assuming that if you have a lots of dataframes that named as follows X1, X2, X3,..X100,..XN the best way I can think to automatically concatenate it (rather than typing 1 by 1 dataframe into rbind() parameter, is to create an evaluate expression by string:
eval_expression = "Xnew <- rbind("
#example if n is 200 (200 dataframes available)
n <- 200
for(a in 1:n){
if(a == n){
eval_expression <- paste0(eval_expression, "X",a,")")
}
else{
eval_expression <- paste0(eval_expression, "X",a,",")
}
}
you can also try look what is inside eval_expression after executing above codes, and then finally do the final execution:
eval(parse(text = eval_expression))

Aligning Multiple Files in R by Pairwise Alignment

I have 15 protein sequences as fasta format in one file. I have to pairwise align them globally and locally then generate a distance score matrix 15x15 to construct dendrogram.
But when I do, i.e. A sequence is not aligning with itself and I get NA result. Moreover, AxB gives 12131 score but BxA gives NA. Thus R can not construct phylogenetic tree.
What should I do?
I'm using this script for the loop but it reads one way only :
for (i in 1:150) {
globalpwa<-pairwiseAlignment(toString(ProtDF[D[1,i],2])
,toString(ProtDF[D[2,i],2]),
substitutionMatrix = "BLOSUM62",
gapOpening = 0,
gapExtension = -2,
scoreOnly=FALSE,
type="global")
ScoreX[i]<-c(globalpwa#score)
nameSeq1[i]<-c(as.character(ProtDF[D[1,i],1]))
nameSeq2[i]<-c(as.character(ProtDF[D[2,i],1]))
}
I used an example fasta file, protein sequence of RPS29 in fungi.
ProtDF <-
c(OQS54945.1 = "MINDLKVRKDVEKSKAHCHVKPFGKGSRACERCASHRGHNRKYGMNLCRRCLHTNAKILGFTSFD",
XP_031008245.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY80688.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVKHR",
TVY57447.1 = "LPFLKIRVEPARRDNLKPAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCVDAMGTLEPRASSPEL",
TVY47820.1 = "EPARRDNLKTTIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVK",
TVY37154.1 = "AISKLNSRPQRPISTTPRKAKHTKSLVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKHR",
TVY29458.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY14147.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGWIGTLEL",
`sp|Q6CPG3.1|RS29_KLULA` = "MAHENVWYSHPRKFGKGSRQCRISGSHSGLIRKYGLNIDRQSFREKANDIGFYKYR",
`sp|Q8SS73.1|RS29_ENCCU` = "MSFEPSGPHSHRKPFGKGSRSCVSCYTFRGIIRKLMMCRRCFREYAGDIGFAIYD",
`sp|O74329.3|RS29_SCHPO` = "MAHENVWFSHPRKYGKGSRQCAHTGRRLGLIRKYGLNISRQSFREYANDIGFVKYR",
TPX23066.1 = "MTHESVFYSRPRNYGKGSRQCRVCAHKAGLIRKYGLLVCRQCFREKSQDIGFVKYR",
`sp|Q6FWE3.1|RS29_CANGA` = "MAHENVWFSHPRRFGKGSRQCRVCSSHTGLIRKYDLNICRQCFRERASDIGFNKYR",
`sp|Q6BY86.1|RS29_DEBHA` = "MAHESVWFSHPRNFGKGSRQCRVCSSHSGLIRKYDLNICRQCFRERASDIGFNKFR",
XP_028490553.1 = "MSHESVWNSRPRSYGKGSRSCRVCKHSAGLIRKYDLNLCRQCFREKAKDIGFNKFR"
)
So you got it right to use combn. As you said, you need a distance score matrix for dendrogram, so better to store the values in a matrix. See below, I simply named the matrix after the names of the fasta, and slot in the pairwise values
library(Biostrings)
# you can also read in your file
# like ProtDF = readAAStringSet("fasta")
ProtDF=AAStringSet(ProtDF)
# combination like you want
# here we just use the names
D = combn(names(ProtDF),2)
#make the pairwise matrix
mat = matrix(NA,ncol=length(ProtDF),nrow=length(ProtDF))
rownames(mat) = names(ProtDF)
colnames(mat) = names(ProtDF)
# loop through D
for(idx in 1:ncol(D)){
i <- D[1,idx]
j <- D[2,idx]
globalpwa<-pairwiseAlignment(ProtDF[i],
ProtDF[j],
substitutionMatrix = "BLOSUM62",
gapOpening = 0,
gapExtension = -2,
scoreOnly=FALSE,
type="global")
mat[i,j]<-globalpwa#score
mat[j,i]<-globalpwa#score
}
# if you need to make diagonal zero
diag(mat) <- 0
# plot
plot(hclust(as.dist(mat)))
An alternate method, if you're interested, using the same example as above:
library(DECIPHER)
ProtDF <- c(OQS54945.1 = "MINDLKVRKDVEKSKAHCHVKPFGKGSRACERCASHRGHNRKYGMNLCRRCLHTNAKILGFTSFD",
XP_031008245.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY80688.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVKHR",
TVY57447.1 = "LPFLKIRVEPARRDNLKPAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCVDAMGTLEPRASSPEL",
TVY47820.1 = "EPARRDNLKTTIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVK",
TVY37154.1 = "AISKLNSRPQRPISTTPRKAKHTKSLVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKHR",
TVY29458.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY14147.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGWIGTLEL",
`sp|Q6CPG3.1|RS29_KLULA` = "MAHENVWYSHPRKFGKGSRQCRISGSHSGLIRKYGLNIDRQSFREKANDIGFYKYR",
`sp|Q8SS73.1|RS29_ENCCU` = "MSFEPSGPHSHRKPFGKGSRSCVSCYTFRGIIRKLMMCRRCFREYAGDIGFAIYD",
`sp|O74329.3|RS29_SCHPO` = "MAHENVWFSHPRKYGKGSRQCAHTGRRLGLIRKYGLNISRQSFREYANDIGFVKYR",
TPX23066.1 = "MTHESVFYSRPRNYGKGSRQCRVCAHKAGLIRKYGLLVCRQCFREKSQDIGFVKYR",
`sp|Q6FWE3.1|RS29_CANGA` = "MAHENVWFSHPRRFGKGSRQCRVCSSHTGLIRKYDLNICRQCFRERASDIGFNKYR",
`sp|Q6BY86.1|RS29_DEBHA` = "MAHESVWFSHPRNFGKGSRQCRVCSSHSGLIRKYDLNICRQCFRERASDIGFNKFR",
XP_028490553.1 = "MSHESVWNSRPRSYGKGSRSCRVCKHSAGLIRKYDLNLCRQCFREKAKDIGFNKFR")
# All pairwise alignments:
# Convert characters to an AA String Set
ProtDF <- AAStringSet(ProtDF)
# Initialize some outputs
AliMat <- matrix(data = list(),
ncol = length(ProtDF),
nrow = length(ProtDF))
DistMat <- matrix(data = 0,
ncol = length(ProtDF),
nrow = length(ProtDF))
# loop through the upper triangle of your matrix
for (m1 in seq_len(length(ProtDF) - 1L)) {
for (m2 in (m1 + 1L):length(ProtDF)) {
# Align each pair
AliMat[[m1, m2]] <- AlignSeqs(myXStringSet = ProtDF[c(m1, m2)],
verbose = FALSE)
# Assign a distance to each alignment, fill both triangles of the matrix
DistMat[m1, m2] <- DistMat[m2, m1] <- DistanceMatrix(myXStringSet = AliMat[[m1, m2]],
type = "dist", # return a single value
includeTerminalGaps = TRUE, # return a global distance
verbose = FALSE)
}
}
dimnames(DistMat) <- list(names(ProtDF),
names(ProtDF))
Dend01 <- IdClusters(myDistMatrix = DistMat,
method = "NJ",
type = "dendrogram",
showPlot = FALSE,
verbose = FALSE)
# A single multiple alignment:
AllAli <- AlignSeqs(myXStringSet = ProtDF,
verbose = FALSE)
AllDist <- DistanceMatrix(myXStringSet = AllAli,
type = "matrix",
verbose = FALSE,
includeTerminalGaps = TRUE)
Dend02 <- IdClusters(myDistMatrix = AllDist,
method = "NJ",
type = "dendrogram",
showPlot = FALSE,
verbose = FALSE)
Dend01, from all the pairwise alignments:
Dend02, from a single multiple alignment:
Finally, DECIPHER has a function for loading up your alignment in your browser just to look at it, which, if your alignments are huge, can be a bit of a mistake, but in this case (and in cases up to a few hundred short sequences) is just fine:
BrowseSeqs(AllAli)
A side note about BrowseSeqs, for some reason it doesn't behave well with Safari, but it plays just fine with Chrome. Sequences are displayed in the order in which they exist in the aligned string set.
EDIT: BrowseSeqs DOES play fine with safari directly, but it does have a weird issue with being incorporated with HTMLs knitted together with RMarkdown. Weird, but not applicable here.
Another big aside: All of the functions i've used have a processors argument, which is set to 1 by default, if you'd like to get greedy with your cores you can set it to NULL and it'll just grab everything available. If you're aligning very large string sets, this can be pretty useful, if you're doing trivially small things like this example, not so much.
combn is a great function and I use it all the time. However for these really simple set ups I like looping through the upper triangle, but that's just a personal preference.

R crashes when cspade is trained on a large data set

The codes below works to extract sequences using the cspade algorithm.
library("arulesSequences")
df <- data.frame(personID = c(1, 1, 2, 2, 2),
eventID = c(100, 101, 102, 103, 104),
site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
sequence = c(1, 2, 1, 2, 3))
df.trans <- as(df[,"site", drop = FALSE], "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
df.trans <- df.trans[order(transactionInfo(df.trans)$sequenceID),]
seq <- cspade(df.trans, parameter = list(support = 0.2),
control = list(verbose = TRUE))
The problem is that my actual data is ~2 million rows, with sequence increasing to ~20 for each person. Using the code above, cspade quickly consumes all RAM and R crashes. Anyone have tips on how to perform sequence mining on large datasets like mine? Thanks!
How many unique IDs do you have in df$sequence? It looks like in the last column of your sample dataset that there are 3 sequence options. Do you think sequences of up to 20 are necessary? One thing you could do is set the maxlen parameter in your cspade function call to something like 4 or 5 and evaluate your predictive accuracy, assuming that's what you are after.
So you would have something like seq <- cspade(df.trans, parameter = list(support = 0.2, maxlen = 4),control = list(verbose = TRUE)).
Hope that helps

readBin seems to round data in R

I have structured data comprising several floats and an integer which I want to process in R. So far, I was able to read the data and create a list like this:
rawData <- readBin(path, what = "raw", n = fileSize);
dim(rawData) <- c(recordSize, cnt);
x <- readBin(con = rawData[1:4,], what = "double", size = 4, n = cnt);
y <- readBin(con = rawData[5:8,], what = "double", size = 4, n = cnt);
z <- readBin(con = rawData[9:12,], what = "double", size = 4, n = cnt);
The result seems to be almost OK except for that (some of the floats) seem to be rounded to an integer. For instance, the very first value is -5813186.5, but if I print x[1], the output is [1] -5813187. I also tried to play around with options(digits = 2), but this had no effect. As I am new to R, I do not even know whether this is an issue of the output or whether the in-memory data are wrong. I know that typeof(x[1]) yields [1] "double" as expected.
How can I (i) print the data with full precision, or (ii) ensure that the data is not rounded?

R apply function to data based on index column value

Example:
require(data.table)
example = matrix(c(rnorm(15, 5, 1), rep(1:3, each=5)), ncol = 2, nrow = 15)
example = data.table(example)
setnames(example, old=c("V1","V2"), new=c("target", "index"))
example
threshold = 100
accumulating_cost = function(x,y) { x-cumsum(y) }
whats_left = accumulating_cost(threshold, example$target)
whats_left
I want whats_left to consist of the difference between threshold and the cumulative sum of values in example$target for which example$index = 1, and 2, and 3. So I used the following for loop:
rm(whats_left)
whats_left = vector("list")
for(i in 1:max(example$index)) {
whats_left[[i]] = accumulating_cost(threshold, example$target[example$index==i])
}
whats_left = unlist(whats_left)
whats_left
plot(whats_left~c(1:15))
I know for loops aren't the devil in R, but I'm habituating myself to use vectorization when possible (including getting away from apply, being a for loop wrapper). I'm pretty sure it's possible here, but I can't figure out how to do it. Any help would be much appreciated.
All you trying to do is accumulate the cost by index. Thus, you might want to use the by argument as in
example[, accumulating_cost(threshold, target), by = index]

Resources