Decode Garmin rsd sonar binary data in R - r

I've been working with a garmin chartplotter lately, but all of the data requires an ios app to even look at. I'm interested in taking the sonar/sounder data and pulling it into a csv to simply extract depth by time (so I can merge it with the data from the gpx file for a depth track).
Anyone have any experience or suggestions for doing so?
to.read <- file("TestSonar.RSD", "rb")
a <- readBin(to.read,
raw(),
n = file.size("TestSonar.RSD"),
endian = "little", signed = FALSE)
close(to.read)
produces a nice full bit of hex.......not sure where to go from here.

Related

R running very slowly when importing and manipulating 100,000 KB dataset

I'm working with a dataset in R with dimensions of about 7000 x 5000. The file size is about 100,000 KB. It takes about half an hour to load it into R. When I try to create a correlation table in order to run a PCA, R freezes. Then I have to reopen it and import the data again.
I'm surprised that it's so slow with a dataset of this size. I thought datasets had to be much larger to affect the speed to this degree. I'm using a Microsoft Surface Pro 3.
Does anyone have any ideas for why this might be happening and what I can do about it? Is it my laptop? Or is this kind of thing common with datasets of this size?
Edit in response to comments: My computer has 8 GB RAM. This is the code I am using:
nlsy_training_set <- read_excel("nlsy training set.xlsx")
df <- nlsy_training_set
full <- df[,2:4886]
corf <- cor(full)
corf <- fill.NAs(full, data = NULL, all.covs = FALSE, contrasts.arg = NULL)
corf <- as.data.frame(corf)
pcaf <- principal(corf, nfactors = 100, rotate = "varimax")$loadings
dfpcaf <- as.data.frame(pcaf)
This was very slow because I was using read_excel and had converted the original data file into an Excel workbook format. Once I used read.csv and used the original csv format, I was able to import the data into R relatively quickly.
Using read.csv works better than read_excel for large datasets.

ncvar_get "cannot allocate vector of size" for netcdf4 subset no matter how small

I'm trying to extract a subset of depth data from GEBCO's global ocean bathymetry dataset, which is a 10.9gb .nc file, netcdf4 (direct link).
I open a connection to the file, which doesn't load it into memory:
library(ncdf4)
GEBCO <- nc_open(filename = "GEBCO_2019.nc", verbose = T)
Find the lat & lon indices corresponding to my subset area:
LonIdx <- which(GEBCO$dim$lon$vals < -80 & GEBCO$dim$lon$vals > -81.7) #n=408 long
LatIdx <- which(GEBCO$dim$lat$vals < 26 & GEBCO$dim$lat$vals > 25) #n=240; 240*408=97920
Then get Z data for those extents:
z <- ncvar_get(GEBCO, GEBCO$var$elevation)[LonIdx, LatIdx]
Resulting in:
Error: cannot allocate vector of size 27.8gb
However it does this regardless of the size of the subset, even down to a 14*14 matrix. I presume therefore that ncvar_get() is pulling the whole database in order to extract the indices... even though I was under the impression that the entire point of netcdf files was that you could extract using matrix indexing without loading the whole thing to memory?
FWIW I'm on a 32gb linux machine, so it should work anyway? [edit, and the file is 10.9gb in the first place, so one would think a subset would be smaller]
Any ideas/intel/insights gratefully received. Thanks in advance.
Edit: other times it crashes RStudio rather than giving the error. R Session Aborted, fatal error, session terminated. RAM usage was:
Ok, solved. Turns out the answer I found online before using [LonIdx, LatIdx] indexes the object after the whole thing is read to memory. Notwithstanding I still don't know why this was a problem given its filesize is a third my memory, and failing expanded size is within my memory, this is still the wrong way to go.
Assuming ones rows and columns are contiguous (they should be in netcdf) the solution is:
z <- ncvar_get(nc = GEBCO,
varid = GEBCO$var$elevation,
start = c(LonIdx[1],
LatIdx[1]),
count = c(length(LonIdx),
length(LatIdx)),
verbose = T)
To convert to long format:
lon <- GEBCO$dim$lon$vals[LonIdx]
lat <- GEBCO$dim$lat$vals[LatIdx]
rownames(z) <- as.character(lon)
colnames(z) <- as.character(lat)
library(tidyr)
library(magrittr)
ztbl <- as_tibble(z, rownames = "lon")
ztbl %<>% pivot_longer(-lon, names_to = "lat", values_to = "depth")

why the function Write.fcs in the Flowcore package corrupts my FCS file

I am analyzing FCS files from a CyTOF experiment using Flowcore package
. When I import and export my FCS files using read.FCS and write.FCS, I find that these functions have corrupted my FCS file and all channels are affected and the data looks like the tSNE in the picture below (not what is expected or meaningful).
I'm using R (ver.3.6), Rstudio (1.2.1335), and flowcore ver.3.9.
Here is the code I have used:
library(flowCore)
#Import FCS file
myfilename<-"export_MIX_NT_Ungated_viSNE.fcs"
myfile_fcs<-read.FCS(myfilename,
transformation="linearize", which.lines=NULL,
alter.names=FALSE, column.pattern=NULL)
#I plan to do some data analysis here in the final version before exporting below
#export the fcs file and rename it to T_+filename
write.FCS(myfile_fcs,paste("T_",keyword(myfile_fcs)$"$FIL",sep=""), what="numeric")
and this is what the original file looks like before import into R
and this is what the exported result looks like after export
Here is the file that we have used for this code: dropbox link for the example file
I've looked into your problem and at first I was skeptical about the transformation of read.fcs. Looking into your example file, I also see that there are already columns for your original (full plot) tsne plot, so I'm assuming flowjo is rewriting the tsne values after you read/write it into R. Since Flowcore is generally more targeted towards flow data and not cytof, I took a few pieces of this Bioc2017 walkthough and recreated the transformations, which seems to work better although I'm not sure how flowjo will handle the data now. If you were going to do more work on the data though, we now have it at an accessible low level so you can basically do whatever you want. Here's my code.
fcs_raw <- read.flowSet("~/Downloads/export_MIX_NT_Ungated_viSNE.fcs", transformation = FALSE,
truncate_max_range = FALSE)
fcs <- fsApply(fcs_raw, function(x, cofactor = 5){
expr <- exprs(x)
expr <- asinh(expr[,] / cofactor)
exprs(x) <- expr
x
})
expr <- fsApply(fcs, exprs)
library(matrixStats)
rng <- colQuantiles(expr, probs = c(0.01, 0.99))
expr01 <- t((t(expr) - rng[, 1]) / (rng[, 2] - rng[, 1]))
expr01[expr01 < 0] <- 0
expr01[expr01 > 1] <- 1
expr01
summary(expr01)
Be aware that this does mess up your original tSNE column numbers, so if these were important to you, I would read the flowset, make a copy of those columns, and move on with the data analysis in the code. If you have future questions or analysis with flow data feel free to contact me directly.
#csugai, thanks for your answer. The truncate_max_range = FALSE argument in the read.flowSet function caught my eyes so I included that into my read.FCS function and that fixed the problem! Although I didn't really understand other parts of your code that resulted in a binned data.

use ape to phase a fasta file and create a DNAbin file as output, then test tajima's D using pegas

I'm trying to complete the very simple task of reading in an unphased fasta file and phasing it using ape, and then calculating Tajima's D using pegas, but #my data doesn't seem to be reading in correctly. Input and output is as #follows:
library("ape")
library("adegenet")
library("ade4")
library("pegas")
DNAbin8c18 <- read.dna(file="fasta8c18.fa", format="f")
I shouldn't need to attach any data since I've just generated the file, but since the data() command was in the manual, I executeed
data(DNAbin8c18)
and got
Warning message: In data(DNAbin8c18) : data set ‘DNAbin8c18’ not found
I know that data() only works in certain contexts, so maybe this isn't a big deal. I looked at what had been loaded
DNAbin8c18
817452 DNA sequences in binary format stored in a matrix.
All sequences of same length: 96
Labels:
CLocus_12706_Sample_1_Locus_34105_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_2_Locus_31118_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_3_Locus_30313_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_5_Locus_33345_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_7_Locus_37388_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_8_Locus_29451_Allele_0 [BayOfIslands_s09... ...
More than 10 million nucleotides: not printing base composition
so it looks like the data should be fine. Because of this, I tried what I want to do
tajima.test(DNAbin8c18)
and got
Error: cannot allocate vector of size 2489.3 Gb
Many people have completed this same test using as many or more SNPs that I have, and also using FASTA files, but is it possible that mine is too big, or can you see another issue?
The data file can be downloaded at the following link
https://drive.google.com/open?id=0B6qb8IlaQGFZLVRYeXMwRnpMTUU
I have also sent and earlier version of this question, with the data, to the r-sig-genetics mailing list, but I have not heard back.
Any thoughts would be much appreciated.
Ella
Thank you for the comment. Indeed, you are correct. The developer just emailed me with the following very helpful comments.
The problem is that your data are too big (too many sequences) and tajima.test() needs to compute the matrix of all pairwise distances. You could this check by trying:
dist.dna(DNAbin8c18, "N")
One possibility for you is to sample randomly some observations, and repeat this many times, eg:
tajima.test(DNAbin8c18[sample(n, size = 1000), ])
This could be:
N <- 1000 # number of repeats
RES <- matrix(N, 3)
for (i in 1:N)
RES[, i] <- unlist(tajima.test(DNAbin8c18[sample(n, size = 10000), ]))
You may adjust N and 'size =' to have something not too long to run. Then you may look at the distribution of the columns of RES.

Reading binary data from accelerometer device into R

Essentially I want to know if there is a practical way to read a particular kind of binary file in to R. I have some Matlab code which does what I want but ideally I want to be able to do this in R.
The Matlab code is:
fid = fopen('filename');
A(:) = fread(fid, size*2, '2*uint8=>uint8',510,'ieee-le');
and so far in R I've been using:
to.read = file("filename", "rb")
bin = readBin(to.read, integer(), n = 76288, endian = "little")
The confusion I'm having is with the 3rd and 5th argument in the matlab function fread()- I don't understand exactly what '2*uint8=>uint8' or 'ieee-le' mean in terms of interpreting the binary data. This is what is holding me back from implementing it in R.
Also, the file extension is .cwa, apparently this is a very efficient format to have high frequency (100Hz) activity data recorded in.

Resources