readBin seems to round data in R - r

I have structured data comprising several floats and an integer which I want to process in R. So far, I was able to read the data and create a list like this:
rawData <- readBin(path, what = "raw", n = fileSize);
dim(rawData) <- c(recordSize, cnt);
x <- readBin(con = rawData[1:4,], what = "double", size = 4, n = cnt);
y <- readBin(con = rawData[5:8,], what = "double", size = 4, n = cnt);
z <- readBin(con = rawData[9:12,], what = "double", size = 4, n = cnt);
The result seems to be almost OK except for that (some of the floats) seem to be rounded to an integer. For instance, the very first value is -5813186.5, but if I print x[1], the output is [1] -5813187. I also tried to play around with options(digits = 2), but this had no effect. As I am new to R, I do not even know whether this is an issue of the output or whether the in-memory data are wrong. I know that typeof(x[1]) yields [1] "double" as expected.
How can I (i) print the data with full precision, or (ii) ensure that the data is not rounded?

Related

Aligning Multiple Files in R by Pairwise Alignment

I have 15 protein sequences as fasta format in one file. I have to pairwise align them globally and locally then generate a distance score matrix 15x15 to construct dendrogram.
But when I do, i.e. A sequence is not aligning with itself and I get NA result. Moreover, AxB gives 12131 score but BxA gives NA. Thus R can not construct phylogenetic tree.
What should I do?
I'm using this script for the loop but it reads one way only :
for (i in 1:150) {
globalpwa<-pairwiseAlignment(toString(ProtDF[D[1,i],2])
,toString(ProtDF[D[2,i],2]),
substitutionMatrix = "BLOSUM62",
gapOpening = 0,
gapExtension = -2,
scoreOnly=FALSE,
type="global")
ScoreX[i]<-c(globalpwa#score)
nameSeq1[i]<-c(as.character(ProtDF[D[1,i],1]))
nameSeq2[i]<-c(as.character(ProtDF[D[2,i],1]))
}
I used an example fasta file, protein sequence of RPS29 in fungi.
ProtDF <-
c(OQS54945.1 = "MINDLKVRKDVEKSKAHCHVKPFGKGSRACERCASHRGHNRKYGMNLCRRCLHTNAKILGFTSFD",
XP_031008245.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY80688.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVKHR",
TVY57447.1 = "LPFLKIRVEPARRDNLKPAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCVDAMGTLEPRASSPEL",
TVY47820.1 = "EPARRDNLKTTIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVK",
TVY37154.1 = "AISKLNSRPQRPISTTPRKAKHTKSLVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKHR",
TVY29458.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY14147.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGWIGTLEL",
`sp|Q6CPG3.1|RS29_KLULA` = "MAHENVWYSHPRKFGKGSRQCRISGSHSGLIRKYGLNIDRQSFREKANDIGFYKYR",
`sp|Q8SS73.1|RS29_ENCCU` = "MSFEPSGPHSHRKPFGKGSRSCVSCYTFRGIIRKLMMCRRCFREYAGDIGFAIYD",
`sp|O74329.3|RS29_SCHPO` = "MAHENVWFSHPRKYGKGSRQCAHTGRRLGLIRKYGLNISRQSFREYANDIGFVKYR",
TPX23066.1 = "MTHESVFYSRPRNYGKGSRQCRVCAHKAGLIRKYGLLVCRQCFREKSQDIGFVKYR",
`sp|Q6FWE3.1|RS29_CANGA` = "MAHENVWFSHPRRFGKGSRQCRVCSSHTGLIRKYDLNICRQCFRERASDIGFNKYR",
`sp|Q6BY86.1|RS29_DEBHA` = "MAHESVWFSHPRNFGKGSRQCRVCSSHSGLIRKYDLNICRQCFRERASDIGFNKFR",
XP_028490553.1 = "MSHESVWNSRPRSYGKGSRSCRVCKHSAGLIRKYDLNLCRQCFREKAKDIGFNKFR"
)
So you got it right to use combn. As you said, you need a distance score matrix for dendrogram, so better to store the values in a matrix. See below, I simply named the matrix after the names of the fasta, and slot in the pairwise values
library(Biostrings)
# you can also read in your file
# like ProtDF = readAAStringSet("fasta")
ProtDF=AAStringSet(ProtDF)
# combination like you want
# here we just use the names
D = combn(names(ProtDF),2)
#make the pairwise matrix
mat = matrix(NA,ncol=length(ProtDF),nrow=length(ProtDF))
rownames(mat) = names(ProtDF)
colnames(mat) = names(ProtDF)
# loop through D
for(idx in 1:ncol(D)){
i <- D[1,idx]
j <- D[2,idx]
globalpwa<-pairwiseAlignment(ProtDF[i],
ProtDF[j],
substitutionMatrix = "BLOSUM62",
gapOpening = 0,
gapExtension = -2,
scoreOnly=FALSE,
type="global")
mat[i,j]<-globalpwa#score
mat[j,i]<-globalpwa#score
}
# if you need to make diagonal zero
diag(mat) <- 0
# plot
plot(hclust(as.dist(mat)))
An alternate method, if you're interested, using the same example as above:
library(DECIPHER)
ProtDF <- c(OQS54945.1 = "MINDLKVRKDVEKSKAHCHVKPFGKGSRACERCASHRGHNRKYGMNLCRRCLHTNAKILGFTSFD",
XP_031008245.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY80688.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVKHR",
TVY57447.1 = "LPFLKIRVEPARRDNLKPAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCVDAMGTLEPRASSPEL",
TVY47820.1 = "EPARRDNLKTTIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVK",
TVY37154.1 = "AISKLNSRPQRPISTTPRKAKHTKSLVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKHR",
TVY29458.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY14147.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGWIGTLEL",
`sp|Q6CPG3.1|RS29_KLULA` = "MAHENVWYSHPRKFGKGSRQCRISGSHSGLIRKYGLNIDRQSFREKANDIGFYKYR",
`sp|Q8SS73.1|RS29_ENCCU` = "MSFEPSGPHSHRKPFGKGSRSCVSCYTFRGIIRKLMMCRRCFREYAGDIGFAIYD",
`sp|O74329.3|RS29_SCHPO` = "MAHENVWFSHPRKYGKGSRQCAHTGRRLGLIRKYGLNISRQSFREYANDIGFVKYR",
TPX23066.1 = "MTHESVFYSRPRNYGKGSRQCRVCAHKAGLIRKYGLLVCRQCFREKSQDIGFVKYR",
`sp|Q6FWE3.1|RS29_CANGA` = "MAHENVWFSHPRRFGKGSRQCRVCSSHTGLIRKYDLNICRQCFRERASDIGFNKYR",
`sp|Q6BY86.1|RS29_DEBHA` = "MAHESVWFSHPRNFGKGSRQCRVCSSHSGLIRKYDLNICRQCFRERASDIGFNKFR",
XP_028490553.1 = "MSHESVWNSRPRSYGKGSRSCRVCKHSAGLIRKYDLNLCRQCFREKAKDIGFNKFR")
# All pairwise alignments:
# Convert characters to an AA String Set
ProtDF <- AAStringSet(ProtDF)
# Initialize some outputs
AliMat <- matrix(data = list(),
ncol = length(ProtDF),
nrow = length(ProtDF))
DistMat <- matrix(data = 0,
ncol = length(ProtDF),
nrow = length(ProtDF))
# loop through the upper triangle of your matrix
for (m1 in seq_len(length(ProtDF) - 1L)) {
for (m2 in (m1 + 1L):length(ProtDF)) {
# Align each pair
AliMat[[m1, m2]] <- AlignSeqs(myXStringSet = ProtDF[c(m1, m2)],
verbose = FALSE)
# Assign a distance to each alignment, fill both triangles of the matrix
DistMat[m1, m2] <- DistMat[m2, m1] <- DistanceMatrix(myXStringSet = AliMat[[m1, m2]],
type = "dist", # return a single value
includeTerminalGaps = TRUE, # return a global distance
verbose = FALSE)
}
}
dimnames(DistMat) <- list(names(ProtDF),
names(ProtDF))
Dend01 <- IdClusters(myDistMatrix = DistMat,
method = "NJ",
type = "dendrogram",
showPlot = FALSE,
verbose = FALSE)
# A single multiple alignment:
AllAli <- AlignSeqs(myXStringSet = ProtDF,
verbose = FALSE)
AllDist <- DistanceMatrix(myXStringSet = AllAli,
type = "matrix",
verbose = FALSE,
includeTerminalGaps = TRUE)
Dend02 <- IdClusters(myDistMatrix = AllDist,
method = "NJ",
type = "dendrogram",
showPlot = FALSE,
verbose = FALSE)
Dend01, from all the pairwise alignments:
Dend02, from a single multiple alignment:
Finally, DECIPHER has a function for loading up your alignment in your browser just to look at it, which, if your alignments are huge, can be a bit of a mistake, but in this case (and in cases up to a few hundred short sequences) is just fine:
BrowseSeqs(AllAli)
A side note about BrowseSeqs, for some reason it doesn't behave well with Safari, but it plays just fine with Chrome. Sequences are displayed in the order in which they exist in the aligned string set.
EDIT: BrowseSeqs DOES play fine with safari directly, but it does have a weird issue with being incorporated with HTMLs knitted together with RMarkdown. Weird, but not applicable here.
Another big aside: All of the functions i've used have a processors argument, which is set to 1 by default, if you'd like to get greedy with your cores you can set it to NULL and it'll just grab everything available. If you're aligning very large string sets, this can be pretty useful, if you're doing trivially small things like this example, not so much.
combn is a great function and I use it all the time. However for these really simple set ups I like looping through the upper triangle, but that's just a personal preference.

Fast way for splitting large .wav file using R

For my work I need to analyze large .wav files (>208 MB), and I make use of the R packages seewave and tuneR. I bring each file into the R environment in 30 s chunks, using the readWave function as follows:
tr1_1 = readWave("TR1_edit.WAV", from = 0, to = 0.5, units = "minutes")
tr1_2= readWave("TR1_edit.WAV", from = 0.5, to = 1, units = "minutes")
tr1_3= readWave("TR1_edit.WAV", from = 1, to = 1.5, units = "minutes")
tr1_4= readWave("TR1_edit.WAV", from = 1.5, to = 2, units = "minutes")
tr1_5= readWave("TR1_edit.WAV", from = 2, to = 2.5, units = "minutes")
and so on. This method works, but is not efficient or pretty. Is there a way to import and split up a large .wav class file more efficiently?
If you're loading all of these into memory at the same time, rather than sequential variable names you should be using a list.
tr1 = list()
duration = 0.5
start_times = seq(0, 2, by = duration)
for (i in seq_along(start_times)) {
tr1[[i]] = readWave('TR1_edit.WAV',
from = start_times[i],
to = start_times[i] + duration,
units = 'minutes')
}
This is the same principle as why you should use a list of data frames rather than sequentially named data frames.
You could easily wrap this into a function that takes the name of a WAV file as input, gets its length from the metadata, and imports it in 30-second (or a parameterized argument) segments, and returns the list of segments.
#Gregor and #AkselA thanks for your input. The biggest issue with the for loop solution was that the wave files I'm working with are of varying sizes, so I would end up with blank elements in the resultant lists. My current solution imports the entire file, then breaks it up into 30s pieces from there:
duration = 1.44e6
tr1 <- readWave("TR1_edit.wav", from = 0, to = 1, units = "minutes")
tr1 <- as.matrix(tr1#left)
tr1 <- cbind(tr1, (rep(1:(length(tr1)/duration), each = duration)))
tr1 <- lapply(split(tr1[,1],tr1[,2]),matrix, ncol = 1)
From there I can use mapply to return the vectors to wave class
w <- function(s){
Wave(s, right = numeric(0), samp.rate = 48000, bit = 16, pcm = TRUE)
}
tr1 <- mapply(w, tr1)

Why is the actual number of generation not as specified for genetic algorithms in R

I am working with the genalg library for R, and try to save all the generations when I run a binary generic algorithm. It does not seems like there is a built-in method for that in the library, so my attempt was to save each chromosome, x, coming through the evaluation function.
To test this method I have tried to insert print(x) in the evaluation function to be able to see all the evaluated chromosomes. However, the number of printed chromosomes does not always match what I am suspecting.
I thought that the number of printed chromosomes would be equal to the number of iterations times the population size, but it does not seems to be try all the time.
The problem is that I want to know from which generation (or iteration) each chromosome belongs, which I can't tell if the number of chromosomes are different from iter times popSize.
What is the reason for this, and how can I "fix" it. Or is there another way of saving each chromosome and from which iteration it belongs?
Below is an example, where I thought that the evaluation function would print 2x5 chromosomes, but only prints 8.
library(genalg)
library(ggplot2)
dataset <- data.frame(
item = c("pocketknife", "beans", "potatoes", "unions", "sleeping bag", "rope", "compass"),
survivalpoints = c(10, 20, 15, 2, 30, 10, 30),
weight = c(1, 5, 10, 1, 7, 5, 1))
weightlimit <- 20
evalFunc <- function(x) {
print(x)
current_solution_survivalpoints <- x %*% dataset$survivalpoints
current_solution_weight <- x %*% dataset$weight
if (current_solution_weight > weightlimit)
return(0) else return(-current_solution_survivalpoints
}
iter = 2
popSize = 5
set.seed(1)
GAmodel <- rbga.bin(size = 7, popSize = popSize, iters = iter, mutationChance = 0.1,elitism = T, evalFunc = evalFunc)
Looking at the function code, it seems like at each iteration (generation) a subset of chromosomes is chosen from the population (population = 5 chromosomes in your example) with a certain probability (0.1 in your case) and mutated. Evaluation function is called only for the mutated chromosomes at each generation (and of course for all the chromosomes in the first iteration to know their initial value).
Note that, this subset do not include elitists group, which in your example you have defined as 1 element big (you have erroneously passed elitism=TRUE and TRUE is implicitly converted to 1).
Anyway, to know the population at each generation, you can pass a monitor function through the monitorFun parameter e.g. :
# obj contains a lot of informations, try to print it
monitor <- function(obj) {
print(paste(" GENERATION :", obj$iter))
print("POPULATION:")
print(obj$population)
print("VALUES:")
print(obj$evaluations)
}
iter = 2
popSize = 5
set.seed(1)
GAmodel <- rbga.bin(size = 7, popSize = popSize,
iters = iter, mutationChance = 0.1,
elitism = 1, evalFunc = evalFunc, monitorFunc = monitor)

How to compute the overall mean for several files in R?

I have 365 files for one year(considered as matrix with nrows=500 and ncol=700) that I want to compute the overall mean of that year.
to read one files:
con <- file("C:\\Users\\data.img","rb")
dat<- readBin(con, numeric(), size=4, n=700*500, signed=TRUE)
str(dat)
num [1:810438] 0.5 0.2 0.1...
to read all files:
dir1<- list.files("C:\\Users\\datsets", "*.img", full.names = TRUE)
to loop thru files:
for (.files in seq_along(dir1)){
file1 <- readBin(dir1[.files], numeric(), size = 4, n = 700*500, signed = T)}
any idea please on how to compute the mean of all values (pixel by pixel)so end up with one file with mean values?
Edit: I forgot to mention, I only want to compute the mean among elements (pixels) that have a positive value.
Here are two methods I can think of:
1) Using a for loop (memory efficient):
sum.dat <- rep(0, 810438)
sum.pos <- rep(0, 810438)
for (.file in dir1) {
dat <- readBin(.file, numeric(), size = 4, n = 700*500, signed = TRUE)
pos <- dat >= 0
sum.dat <- sum.dat + dat * pos
sum.pos <- sum.pos + pos
}
mean.dat <- sum.dat / sum.pos
2) Using vapply (concise code but not memory efficient as it loads all the data into memory at once. This might be what you want though if you plan to do further processing on all the data.)
dats <- vapply(dir1, readBin, FUN.VALUE = numeric(810438),
what = numeric(), size = 4, n = 700*500, signed = TRUE)
mean.dat <- rowmeans(ifelse(dats >= 0, dats, NA), na.rm = TRUE)

Fill elements of a list without looping

I am trying not to use a for loop to assign values to the elements of a list.
Here, I create an empty list, gives it a length of 20 and name each of the 20 elements.
mylist <- list()
length(mylist) <- 20
names(mylist) <- paste0("element", 1:20, sep = "")
I want each element of mylist to contain samples drawn from a pool of randomly generated numbers denoted as x:
x <- runif(100, 0, 1)
I tried the following codes, which do not get to the desired result:
mylist[[]] <- sample(x = x, size = 20, replace = TRUE) # Gives an error
mylist[[1:length(mylist)]] <- sample(x = x, size = 20, replace = TRUE) # Does not give the desired result
mylist[1:length(mylist)] <- sample(x = x, size = 20, replace = TRUE) # Gives the same undesired result as the previous line of code
mylist[] <- sample(x = x, size = 20, replace = TRUE) # Gives the same undesired result as the previous line of code
P.S. As explained above, the desired result is a list of 20 elements, which individually contains 20 numeric values. I can do it using a for loop, but I would like to become a better R user and use vectorized operations as much as possible.
Thank you for your help.
Maybe replicate is what you're looking for.
mylist <- replicate(20, sample(x = x, size = 20, replace = TRUE), simplify=FALSE)
names(mylist) <- paste0("element", 1:20, sep = "")
Note that there is no need to first create a list, replicate will do it for you.
Since you're using replace=TRUE you could also generate all 400 at once and then split them up. If you were doing this many times, this probably would be faster than replicate. For only 20 times, the speed difference won't matter hardly at all and tje code using replicate is perhaps easier to read and understand and so might be preferred for that reason.
foo <- sample(x = x, size = 20*20, replace = TRUE)
mylist <- split(foo, rep(1:20, each=20))
Alternatively, you could split them by converting to a data frame first. Not sure which would be faster.
mylist <- as.list(as.data.frame(matrix(foo, ncol=20)))

Resources