I'm trying to read a binary file into R but this file has rows of data written in binary code. So it doesnt have one full set of data belonging to one column instead it is stored as rows of data. Here's what my data looks like:
Bytes 1-4: int ID
Byte 5: char response character
Bytes 6-9: int Resp Dollars
Byte 10: char Type char
Anyone can help me figure out how to read this file into R?
Here is the code I have tried so far. I tried a couple of things with limited success. Unfortunately, I cant post any of the data on public sites, apologies. I’m relatively new to R, so I need some help in terms of how to improve the code.
> binfile = file("File Location", "rb")
> IDvals = readBin(binfile, integer(), size=4, endian = "little")
> Responsevals = readBin(binfile, character (), size = 5)
> ResponseDollarsvals = readBin (binfile, integer (), size = 9, endian= "little")
Error in readBin(binfile, integer(), size = 9, endian = "little") :
size 9 is unknown on this machine
> Typevals = readBin (binfile, character (), size=4)
> binfile1= cbind(IDvals, Responsevals, ResponseDollarsvals, Typevals)
> dimnames(binfile1)[[2]]
[1] "IDvals" "Responsevals" "ResponseDollarsvals" "Typevals"
> colnames(binfile1)=binfile
Error in `colnames<-`(`*tmp*`, value = 4L) :
length of 'dimnames' [2] not equal to array extent
You could open the file as a raw file, then issue readBin or readChar commands to get each line. Append each value to a column as you go.
my.file <- file('path', 'rb')
id <- integer(0)
response <- character(0)
...
Loop around this block:
id = c(id, readBin(my.file, integer(), size = 4, endian = 'little'))
response = c(response, readChar(my.file, 1))
...
readChar(my.file, size = 1) # For UNIX newlines. Use size = 2 for Windows newlines.
Then create your data frame.
See here: http://www.ats.ucla.edu/stat/r/faq/read_binary.htm
Related
I'm new at using GIS with R and I'm trying to open an ENVI file containing hyperspectral data following the suggestions from this post R how to read ENVI .hdr-file?, but I don't seem to be able to do so. I tried three different approaches but all of them failed. I also can't seem to find any other posts where my problem is described.
# install.packages("rgdal")
# install.packages("raster")
# install.packages("caTools")
library("rgdal")
library("raster")
library("caTools")
dirname <- "S:/LAB-cavender/4_Project_Folders/oakWilt/oak_wilt_image_analyses/R_input/6.15.2021 - Revisions/ENVI export/AISA/Resampled_flights"
filename <- file.path(dirname, "AISA_Flight_4_resampled")
file.exists(filename)
The first option that I tried was using file name only
x <- read.ENVI(filename)
But I got the following error message:
#Error in read.ENVI(filename) :
# read.ENVI: Could not open input file: S:/LAB-cavender/4_Project_Folders/oakWilt/oak_wilt_image_analyses/R_input/6.15.2021 - Revisions/ENVI export/AISA/Resampled_flights/AISA_Flight_4_resampled
#In addition: Warning message:
# In nRow * nCol * nBand : NAs produced by integer overflow
I tried then the second option which is using file name + header file name read using file.path
headerfile <- file.path(dirname, "AISA_Flight_4_resampled")
x <- read.ENVI(filename = filename,headerfile = headerfile)
Again, I got an error message that says:
#Error in read.ENVI(filename = filename, headerfile = headerfile) :
# read.ENVI: Could not open input header file: S:/LAB-cavender/4_Project_Folders/oakWilt/oak_wilt_image_analyses/R_input/6.15.2021 - Revisions/ENVI export/AISA/Resampled_flights/AISA_Flight_4_resampled
Finally, I tried the third option by using file name + header file name read using readLines
hdr_file <- readLines(con = "S:/LAB-cavender/4_Project_Folders/oakWilt/oak_wilt_image_analyses/R_input/6.15.2021 - Revisions/ENVI export/AISA/Resampled_flights/AISA_Flight_4_resampled.hdr")
x <- read.ENVI(filename = filename,headerfile = hdr_file)
But I got the error message:
#Error in read.ENVI(filename = filename, headerfile = hdr_file) :
# read.ENVI: Could not open input header file: ENVIdescription = { Spectrally Resampled File. Input number of bands: 63, output number of bands: 115. [Fri Jun 25 16:57:21 2021]}samples = 5187lines = 6111bands = 115header offset = 0file type = ENVI Standarddata type = 4interleave = bilsensor type = Unknownbyte order = 0map info = {UTM, 1.000, 1.000, 482828.358, 5029367.353, 7.5000000000e-001, 7.5000000000e-001, 15, North, WGS-84, units=Meters}coordinate system string = {PROJCS["UTM_Zone_15N",GEOGCS["GCS_WGS_1984",DATUM["D_WGS_1984",SPHEROID["WGS_1984",6378137.0,298.257223563]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Transverse_Mercator"],PARAMETER["False_Easting",500000.0],PARAMETER["False_Northing",0.0],PARAMETER["Central_Meridian",-93.0],PARAMETER["Scale_Factor",0.9996],PARAMETER["Latitude_Of_Origin",0.0],UNIT["Meter",1.0]]}default bands = {46,31,16}wavelength units = Nanometersdata ignore value = -9999.00000000e+000band names = { Resampled
# In addition: Warning message:
# In if (!file.exists(headerfile)) stop("read.ENVI: Could not open input header file: ", :
# the condition has length > 1 and only the first element will be used
Any help would be really appreciated!
I'm reading MNIST image files into R using a function that relies onreadBin(). However, running the function line-by-line I see that readBin() returns different values for the same line of code (without any change of parameters). How come?
#Getting the data
> download.file("http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz",
+ "t10k-images-idx3-ubyte.gz")
#unzipped the .gz file manually out of R. The extracted file is 'train-images.idx3-ubyte'
#Using file() to read the 'train-images.idx3-ubyte' file
> f = file("train-images.idx3-ubyte", 'rb')
#this is what 'f' is:
> f
A connection with
description "train-images.idx3-ubyte"
class "file"
mode "rb"
text "binary"
opened "opened"
can read "yes"
can write "no"
#The following lines show the execution of readBin with the same parameters, though giving a different value each time
> readBin(f, 'integer', n = 1, size = 4, endian = 'big')
[1] 2051
> readBin(f, 'integer', n = 1, size = 4, endian = 'big')
[1] 60000
> readBin(f, 'integer', n = 1, size = 4, endian = 'big')
[1] 28
> readBin(f, 'integer', n = 1, size = 4, endian = 'big')
[1] 28
> readBin(f, 'integer', n = 1, size = 4, endian = 'big')
[1] 0
You open file connection and never close it. So the result is what you experienced. You just read next number.
Repeating the sequence
open connection
readBin
close connection
Will give you consistent results.
Or using readBin("the_file__not_the_connection", 'integer', n = 1, size = 4, endian = 'big') as well.
I am having some trouble parsing a Unicode string a JSON object pulled from an API. As the string has Encoding() like "unknown", i need to parse it for the system to know what its dealing with. The string represents a decoded .png file in UTF-8 that I then need to decode back to latin1 before writing it to a .png file (I know, very backwards, and it would be much better if the API pushed a base64-string).
I get the string from the API as chr object, and try to let fromJSON do the job, but no dice. It cuts the string at the first null (\u0000).
> library(httr)
> library(jsonlite)
> library(tidyverse)
> m
Response [https://...]
Date: 2018-04-10 11:47
Status: 200
Content-Type: application/json
Size: 24.3 kB
{"artifact": "\u0089PNG\r\n\u001a\n\u0000\u0000\u0000\rIHDR\u0000\u0000\u0000\u0092\u0000\u0000\u0000\u00e3...
> x <- content(m, encoding = "UTF-8", as = "text")
> ## substing of the complete x:
> x <- "{\"artifact\": \"\\u0089PNG\\r\\n\\u001a\\n\\u0000\\u0000\\u0000\\rIHDR\\u0000\\u0000\\u0000\\u0092\\u0000\\u0000\\u0000\\u00e3\\b\\u0006\\u0000\\u0000\\u0000n\\u0005M\\u00ea\\u0000\\u0000\\u0000\\u0006bKGD\\u0000\\u00ff\\u0000\\u00ff\\u0000\\u00ff\\u00a0\\u00bd\\u00a7\\u0093\\u0000\\u0000\\u0016\\u00e7IDATx\\u009c\\u00ed\"}\n"
>
> ## the problem
> "\u0000"
Error: nul character not allowed (line 1)
> ## this works fine
> "\\u0000"
[1] "\\u0000"
>
> y <- fromJSON(txt = x)
> y # note how the string is cut!
$artifact
[1] "\u0089PNG\r\n\032\n"
When I replace the \\u0000 with char(0), everything works fine. The problem is that the nulls seems to play an important role in the binary representation of the file that I write to in the end, causing the resulting image to be corrupted in the viewer.
> x <- str_replace_all(string = x, pattern = "\\\\u0000", replacement = chr(0))
> y <- fromJSON(txt = x)
> y
$artifact
[1] "\u0089PNG\r\n\032\n\rIHDR\u0092ã\b\006n\005Mê\006bKGDÿÿÿ ½§\u0093\026çIDATx\u009cí"
> str(y$artifact)
chr "<U+0089>PNG\r\n\032\n\rIHDR<U+0092>ã\b\006n\005Mê\006bKGDÿÿÿ ½§<U+0093>\026çIDATx<U+009C>í"
> Encoding(y$artifact)
[1] "UTF-8"
> z <- iconv(y$artifact, from = "UTF-8", to = "latin1")
> writeBin(object = z, con = "test.png", useBytes = TRUE)
I have tried these commands with the original string, to no avail
> library(stringi)
> stri_unescape_unicode(str = x)
Error in stri_unescape_unicode(str = x) :
embedded nul in string: '{"artifact": "<U+0089>PNG\r\n\032\n'
> ## and
> parse(text = x)
Error in parse(text = x) : nul character not allowed (line 1)
Is there no way for R to handle this nul character?
Any idea on how I can get the complete encoded string and write it to a file?
The same story works just fine in Python, which uses a \x convention in stead of \u00
response = r.json()
artifact = response['artifact']
artifact
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR....'
artifact_encoded = artifact.encode('latin-1')
artifact_encoded # note the binary form!
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR....'
fh = open("abc.png", "wb")
fh.write(artifact_encoded)
fh.close()
FYI: I have cut some most of the actual string out, but enough to use for testing purposes. The actual string contained other symbols, and it seemed impossible to copy-paste the string in a script and assign it to a new variable (e.g. y <- "{\"artifact\": \"\\u0089PNG\\..."). So, I don't know what I would do if I had to read the string from e.g. a .csv file..
Any pointers in any of my struggles would be appreciated :)
I am currently facing a strange problem while saving lists and 'sublists' with R. The title may not be explicit but here is what is troubling me :
Given some data (here the data is totaly artificial but the problem isn't the relevance of the model) :
set.seed(1)
a0 = rnorm(10000,10,2)
b1 = rnorm(10000,10,2)
b2 = rnorm(10000,10,2)
b3 = rnorm(10000,10,2)
data = data.frame(a0,b1,b2,b3)
And a function returning a list of complex objects (let's say lm() objects) :
test = function(k){
tt = vector('list',k)
for(i in 1:k) tt[[i]] = lm(a0~b1+b2+b3,data = data)
tt
}
Our test fonction returns a list of lm() objects. Lets look the size of this object :
ok = test(2)
object.size(ok)
> object.size(ok)
4019336 bytes
Let's create ok2, an exactly similar object but not within a function :
ok2 = vector('list',2)
ok2[[1]] = lm(a0~b1+b2+b3,data = data)
ok2[[2]] = lm(a0~b1+b2+b3,data = data)
... and check his size :
> object.size(ok2)
4019336 bytes
Here we are, ok and ok2 are exactly the same, and so tells us R.
Problem, if we save these objects on hard drive as R object (with save() or saveRDS()) :
save(ok,file='ok.RData')
save(ok2,file='ok2.RData')
Theirs sizes on hard drive are respectively : 3 366 005 bytes and 1 678 851 bytes.
ok is 2 times bigger than ok2 while they are exactly similar!
Even more strange, if you save a 'sublist' of our objects, lets say ok[[1]] and ok2[[1]] (objects once again totaly identical) :
a = ok[[1]]
a2 = ok2[[1]]
save(a,file='console/a.RData')
save(a2,file='console/a2.RData')
Theirs sizes on hard drive respectively : 2 523 284 bytes and 838 977 bytes.
Two things :
Why does the size of a differ from the size of a2 on hard drive? Why does the size of ok differ from the size of ok2 on hard drive?
And why a which is exactly half of ok sizes 2 523 284 bytes while ok sizes at 3 366 005 bytes on HD?.
Am I missing something?
ps : I runned this test under Windows 7 32bits with R 2.15.1, 2.15.2, 2.15.3, 3.0.0, and with debian and R 2.15.1, R 2.15.2. I am having this problem every time.
EDIT
thx to #user1609452, here is a little trick which seems to be working :
test2 = function(k){
tt = vector('list',k)
for(i in 1:k){
tt[[i]] = lm(a0~b1+b2+b3,data = data)
attr(tt[[i]]$terms,".Environment") = .GlobalEnv
attr(attr(tt[[i]]$model,"terms"),".Environment") = .GlobalEnv
}
tt
}
Formula objects come with their own environment and a lot of stuff in it. Put it to NULL or to .GlobalEnv and it seems to be working. Functions like predict.lm() still work and our saved objects have the right size on the HD. Not sure why though.
look at
> attr(ok[[1]]$terms,".Environment")
<environment: 0x9bcf3f8>
> attr(ok2[[1]]$terms,".Environment")
<environment: R_GlobalEnv>
also
> ls(envir = attr(ok[[1]]$terms,".Environment"))
[1] "i" "k" "tt"
so ok is dragging around the environment of the function with it.
Also read ?object.size
The calculation is of the size of the object, and excludes the
space needed to store its name in the symbol table.
Associated space (e.g. the environment of a function and what the
pointer in a ‘EXTPTRSXP’ points to) is not included in the
calculation.
For example define a test2 and an ok3
test2 = function(k){
tt = vector('list',k)
for(i in 1:k) tt[[i]] = lm(a0~b1+b2+b3,data = data)
rr = tt
tt
}
ok3 <- test2(2)
save(ok3, 'ok3.RdData')
> file.info('ok3.RData')$size
[1] 5043933
> file.info('ok.RData')$size
[1] 3366005
> file.info('ok2.RData')$size
[1] 1678851
> ls(envir = attr(ok3[[1]]$terms,".Environment"))
[1] "i" "k" "rr" "tt"
so ok is roughly twice as big as ok2 because it has the extra tt and ok3 is three times as big as it has tt and rr
> c(object.size(ok),object.size(ok2),object.size(ok3))
[1] 4019336 4019336 4019336
There is related discussion here
I am trying to parse some online weather data with R. The data is a binary file that has been gzipped. An example file is:
ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/V1.0/2005/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101.gz
If I download the file to my computer and manually unzip it, I can easily do the following:
myFile <- ( "/tmp/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101" )
to.read = file( myFile, "rb")
myPoints <- readBin(to.read, real(), n=1e6, size = 4, endian = "little")
What I would prefer to do is automate both the download/unzip along with the read. So I thought that would be as simple as the following:
p <- gzcon( url( "ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/V1.0/2005/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101.gz" ) )
myPoints <- readBin(p, real(), n=1e6, size = 4, endian = "little")
This seems to work just dandy, but in the manual step the vector myPoints has length 518400, which is accurate. However if R handles the download and read as in the second example, I get a different length vector every time I run the code. Seriously. I'm not smoking anything. I swear. I run it multiple time and each time the vector is a different length, always less than the expected 518400.
I also tried getting R to download the gzip file using the following:
temp <- tempfile()
myFile <- download.file("ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/V1.0/2005/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101.gz",temp)
I found that often that would return a Warning about the file not being the expected size. Like the following:
Warning message:
In download.file("ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/V1.0/2005/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101.gz", :
downloaded length 162176 != reported length 179058
Any tips you can throw my way that might help me solve this?
-J
Try this:
R> remfname <- "ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/V1.0/2005/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101.gz"
R> locfname <- "/tmp/data.gz"
R> download.file(remfname, locfname)
trying URL 'ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/V1.0/2005/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101.gz'
ftp data connection made, file length 179058 bytes
opened URL
==================================================
downloaded 174 Kb
R> con <- gzcon(file(locfname, "rb"))
R> myPoints <- readBin(con, real(), n=1e6, size = 4, endian = "little")
R> close(con)
R> str(myPoints)
num [1:518400] 0 0 0 0 0 0 0 0 0 0 ...
R>