Extracting binary data from a mixed data file - r

I am trying to read binary data from a mixed data file (ascii and binary) using R, the data file is constructed in a pseudo-xml format. The idea I had was to use the scan function, read the specific lines and then convert the binary to numerical values but I can't seem to do this in R. I have a python script that does this, but I would like to do the job in R, the python script is below. The binary section within the data file is enclosed by the start and end tags and .
The data file is a proprietary format containing spectroscopic data, a link to an example data file is included below. To quote the user manual:
Data of BinData elements are written as a binary array of bytes. Each
8 bytes of the binary array represent a one double-precision
floating-point value. Therefore the size of the binary array is
NumberOfPoints * 8 bytes. For two-dimensional arrays, data layout
follows row-major form used by SafeArrays. This means that moving to
next array element increments the last index. For example, if a
two-dimensional array (e.g. Data(i,j)) is written in such
one-dimensional binary byte array form, moving to the next 8 byte
element of the binary array increments last index of the original
two-dimensional array (i.e. Data(i,j+1)). After the last element of
the binary array the combination of carriage return and linefeed
characters (ANSI characters 13 and 10) is written.
Thanks for any suggestions in advance!
Link to example data file:
https://docs.google.com/file/d/0B5F27d7b1eMfQWg0QVRHUWUwdk0/edit?usp=sharing
Python script:
import sys, struct, csv
f=open(sys.argv[1], 'rb')
#
t = f.read()
i = t.find("<BinData>") + len("<BinData>") + 2 # add \r\n line end
header = t[:i]
#
t = t[i:]
i = t.find("\r\n</BinData>")
bin = t[:i]
#
doubles=[]
for i in range(len(bin)/8):
doubles.append(struct.unpack('d', bin[i*8:(i+1)*8])[0])
#
footer = t[i+2:]
#
myfile = open("output.csv", 'wb')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(doubles)

I wrote the pack package to make this easier. You still have to search for the start/end of the binary data though.
b <- readBin("120713b01.ols", "raw", 4000)
# raw version of the start of the BinData tag
beg.raw <- charToRaw("<BinData>\r\n")
# only take first match, in case binary data randomly contains "<BinData>\r\n"
beg.loc <- grepRaw(beg.raw,b,fixed=TRUE)[1] + length(beg.raw)
# convert header to text
header <- scan(text=rawToChar(b[1:beg.loc]),what="",sep="\n")
# search for "<Number of Points"> tags and calculate total number of points
numPts <- prod(as.numeric(header[grep("<Number of Points>",header)+1]))
library(pack)
Data <- unlist(unpack(rep("d", numPts), b[beg.loc:length(b)]))

Related

Set read_csv() to a fixed number of columns?

TLDR: How do I set Rstudio to import a CSV as a tibble exactly as Microsoft Excel (Rstudio for mac version: Version 1.3.959, Excel for mac: version 16.33 if that helps)? If this is not possible or it should already behave the same, how do I set it to read in a CSV file with no more than 8 columns and fill in blank values in rows so I can tidy it?
Long version:
I have a dozen CSV files (collected from archival animal tags) that messy (inconsistent width, multiple blocks of data on one file) and need to be read in. For workflow reasons, I would like to take the raw data and bring it straight into R. The data has a consistent structure between files: a metadata block, a summary by day that is 6 columns wide, and 2 blocks of constant logging that are 2 columns wide. If you were to count the blank cells in each section, it would be:
Section
Width
Length
Metadata
8
37
Summary Block
7
N days
Block 1
2
N*72
Block 2
2
N*72
The last three blocks of data can be thousands of entries long. I am unable to get this data to load into R as anything other than a single 1x500,000+ dataframe. Using tag1 = read_csv('file', skip = 37) to just start with the data I want crashes R. It works with read.csv(), but that removes the metadata block that I would like to keep.
Attempting to read the file into Excel shows the correct format (width, length, etc) but will not load all of the data. It cuts off a good chunk of the last block of data. reading in the data in a tabular format like read_xl() presents the same issue.
Ultimately, I like to either import the data as a nested tibble with these different sections, or better yet, automate this process so it can read in an entire folder's worth of csv files, automatically assign them to variables, and split them into sections. However, for now I just want to get this data into a workable format intact, and I would appreciate any help you can give me.
Get the number of lines in the file, n, and from that derive N. Then read the blocks one by one. Use the same connection so each read starts off from where the prior one ended.
n <- length(count.fields("myfile", sep = ""))
N = (n - 37) / (1 + 2 * 72)
con <- file("myfile", open = "r")
meta <- readLines(con, 37)
summary_block <- read.csv(con, header = FALSE, nrow = N)
block1 <- read.csv(con, header = FALSE, nrow = N * 37)
block2 <- read.csv(con, header = FALSE, nrow = N * 37)
close(con)

Iterating through multiple substrings in a .txt file in base R

I've been tasked with calculating the GC content of a FASTA file using base R (no packages). My problem is that I don't know how to pragmatically iterate through the sequence while storing the sequence name and also the number of Cs and Gs.
Example FASTA file I can read in (as a .txt file):
>T7_promoter
ATTAGACGAG
>T3_promoter
TTTGCGCGAAATTTTTTTTT
*There are no quotes here but the > designates a distinct sequence.
Such that my output will be something conceptually similar to -
T7_promoter: 0.4 (ratio of GC from # of Gs and Cs)
T3_promoter: 0.25
Any and all help is much appreciated. I am currently using readLines() to pass the file through. I tried using unlist(strsplit()) per element that strsplit() naturally produces to try and store each sequence as an element in a list. Then I could iterate through each element to get calculations but my executions have not been successful.
You could use dat <- read.csv("file.txt", sep = " ", header = FALSE) to read the lines to a dataframe.
Then you can count the number of Gs and Cs with
dat$Gs <- lengths(regmatches(lines$V2, gregexpr("G", dat$V2)))
dat$Cs <- lengths(regmatches(lines$V2, gregexpr("C", dat$V2)))
The last needed thing would be the calculation of the ratio:
dat$ratio <- dat$Gs/dat$Cs

Write SAS XPORT file in R specifying length larger than the largest actual value for a character variable

How would one write an R data frame to the SAS xpt format and specify the length of each column? For example, in a column of text variables the longest string is 157 characters, however I'd like field length attribute to have 200 characters.
The package haven does not seem to have this option and the package SASxport's documentation is less than clear on this issue.
The SASformat() and SASiformat() functions are used to set an attribute on an R object that sets its format when written to a SAS xport file. To set a data frame column to a 200 character format, use the following approach:
SASformat(mydata$var) <- 'CHAR200.'`
SASiformat(mydata$var) <- 'CHAR200.'`
Then use write.xport() to write the data frame to a SAS xport format.
See page 17 of the SASxport package documentation for details.
SASxport is an old package, so you'll need to load an older version of Hmisc to get it to work properly, per another SO question.
However, on reading the file into SAS it uses the length of the longest string in any observation to set the length of the column, regardless of the format and informat attributes. Therefore, one must write at least one observation containing trailing blanks to the desired length in order for SAS to set the length to the desired size. Ironically, this makes the format and informat superfluous.
This can be accomplished with the str_c() function from the stringr package.
Putting it all together...
library("devtools")
install_version("Hmisc", version = "3.17-2")
library(SASxport)
library(Hmisc)
## manually create a data set
data <- data.frame( x=c(1, 2, NA, NA ), y=c('a', 'B', NA, '*' ), z=c("this is a test","line 2","another text string",
"bottom line") )
# workaround - extend the string variable to desired length (30 characters) by
# adding trailing blanks, using stringr::str_c() function
library(stringr)
data$z <- sapply(data$z,function(x){str_c(x,str_dup(" ",30-nchar(x)),collapse=TRUE)})
nchar(data$z)
# write to SAS XPORT file
tmp <- tempfile(fileext = ".dat")
write.xport( data, file = tmp )
We'll read the file into SAS and use lengthc() to check the size of the z column.
libname testlib xport '/folders/myfolders/xport.dat';
proc copy in=testlib out=work;
run;
data data;
set data;
lenZ = lengthc(z);
run;
...and the output:

Reading files from folder and storing them to dataframe in R

My goal eventually is to build a classifier -- something like a spam detector.
I am not sure however how to read the text files that contain the text that will feed the classifier and store it to a dataframe.
So suppose that I have assembled in a folder text files -- raw text initially stored in a Notepad and then saved from it in a txt file-- whose names are indicative of their content, e.g. xx_xx_xx__yyyyyyyyyyy_zzzzz, where xx will be numbers representing a date, yyyyyyyyy will be a character string representing a theme and zzzz will be a character string representing a source. yyyyyyyyyyy and zzzzzzz will be of variable lengths.
My objective would be to create a function that would loop through the files, read them, store the information contained in their names in separate columns of a dataframe --e.g. "Date", "Theme", "Source" -- and the text content in a fourth column (e.g. "Content").
Any ideas how this could be achieved?
Your advice will be appreciated.
Hi here is a possible answer, I'm storing results in a list instead of a data frame but you can convert from one to the other by using do.call(rbind.data.frame, result)
require(stringr)
datawd<-"C:/my/path/to/folder/" # your data directory
listoffiles<-list.files(str_c(datawd)) # list of files
listoffiles<-listoffiles[grep(".txt",listoffiles)] # only extract .txt files
my_paths<-str_c(datawd,listoffiles) # vector of path
# the following works with windows only
progress<-winProgressBar(title = "loading text files",
label = "progression %",
min = 0,
max = length(my_paths),
initial = 0,
width = 400)
#000000000000000000000000000000000000000 loop
for (i in 1:length(chemins)){
result<-list()
setWinProgressBar(progress,i,label=listoffiles[i])
the_date<-sapply(strsplit(listoffiles[i],"_"),"[[",1)
the_theme<-sapply(strsplit(listoffiles[i],"_"),"[[",2)
the_source<-sapply(strsplit(listoffiles[i],"_"),"[[",3)
# open connexion with read
con <- file(my_paths[i], open = "r")
# readlines returns an element per line, here I'm concatenating all,
#you will do what you need....
the_text<- str_c(readLines(con,warn = FALSE))
close(con) # closing the connexion
result[[i]]<-list()
result[[i]]["date"]<-the_date
result[[i]]["source"]<-the_source
result[[i]]["theme"]<-the_theme
result[[i]]["text"]<-the_text
}
#000000000000000000000000000000000000000 end loop

How to read csv files in matlab as you would in R?

I have a data set that is saved as a .csv file that looks like the following:
Name,Age,Password
John,9,\i1iiu1h8
Kelly,20,\771jk8
Bob,33,\kljhjj
In R I could open this file by the following:
X = read.csv("file.csv",header=TRUE)
Is there a default command in Matlab that reads .csv files with both numeric and string variables? csvread seems to only like numeric variables.
One step further, in R I could use the attach function to create variables with associated with teh columns and columns headers of the data set, i.e.,
attach(X)
Is there something similar in Matlab?
Although this question is close to being an exact duplicate, the solution suggested in the link provided by #NathanG (ie, using xlsread) is only one possible way to solve your problem. The author in the link also suggests using textscan, but doesn't provide any information about how to do it, so I thought I'd add an example here:
%# First we need to get the header-line
fid1 = fopen('file.csv', 'r');
Header = fgetl(fid1);
fclose(fid1);
%# Convert Header to cell array
Header = regexp(Header, '([^,]*)', 'tokens');
Header = cat(2, Header{:});
%# Read in the data
fid1 = fopen('file.csv', 'r');
D = textscan(fid1, '%s%d%s', 'Delimiter', ',', 'HeaderLines', 1);
fclose(fid1);
Header should now be a row vector of cells, where each cell stores a header. D is a row vector of cells, where each cell stores a column of data.
There is no way I'm aware of to "attach" D to Header. If you wanted, you could put them both in the same structure though, ie:
S.Header = Header;
S.Data = D;
Matlab's new table class makes this easy:
X = readtable('file.csv');
By default this will parse the headers, and use them as column names (also called variable names):
>> x
x =
Name Age Password
_______ ___ ___________
'John' 9 '\i1iiu1h8'
'Kelly' 20 '\771jk8'
'Bob' 33 '\kljhjj'
You can select a column using its name etc.:
>> x.Name
ans =
'John'
'Kelly'
'Bob'
Available since Matlab 2013b.
See www.mathworks.com/help/matlab/ref/readtable.html
I liked this approach, supported by Matlab 2012.
path='C:\folder1\folder2\';
data = 'data.csv';
data = dataset('xlsfile',sprintf('%s\%s', path,data));
Of cource you could also do the following:
[data,path] = uigetfile('C:\folder1\folder2\*.csv');
data = dataset('xlsfile',sprintf('%s\%s', path,data));

Resources