How can I get the number of significant digits of the data stored in a NetCDF file? - netcdf

I need to know the precision of the data stored in a NetCDF file.
I think that it is possible to know this precision because, when I dump a NetCDF file using ncdump, the number of significant digits displayed depends on the particular NetCDF file that I am using.
So, for one file I get:
Ts = -0.2121478, -0.08816089, -0.4285178, -0.3446428, -0.4800949,
-0.4332879, -0.2057121, -0.06589077, -0.001647412, 0.007711744,
And for another one:
Ts = -2.01, -3.6, -1, -0.53, -1.07, -0.7, -0.56, -1.3, -0.93, -1.41, -0.83,
-0.8, -2.13, -2.91, -1.13, -1.2, -2.23, -1.77, -2.93, -0.7, -2.14, -1.36,
I also have to say that there is no information about precision in any attribute, neither global nor local to the variable. You can see this in the dump of the header of the NetCDF file:
netcdf pdo {
dimensions:
time = UNLIMITED ; // (809 currently)
variables:
double time(time) ;
time:units = "months since 1900-01-01" ;
time:calendar = "gregorian" ;
time:axis = "T" ;
double Ts(time) ;
Ts:missing_value = NaN ;
Ts:name = "Ts" ;
// global attributes:
:Conventions = "CF-1.0" ;
}
Does anybody know how can I get the number of significant digits of the data stored in a NetCDF file?.

This is a tricky question: what ncdump (and many other pretty number generators) does is simply strip the trailing zeros from the fractional part, but does that say anything about the real (observed/calculated/..) precision of the values? Something measured with three decimals accuracy might be 1.100, yet ncdump will still print it as 1.1. If you want to know the true (physical?) significance, it would indeed have to be included as an attribute, or documented elsewhere.
For a large set of numbers, counting the maximum number of significant digits in the fractional part of the numbers could be a first indication of the precision. If that is what you are looking for, something like this might work in Python:
import numpy as np
a = np.array([1.01, 2.0])
b = np.array([1.10, 1])
c = np.array([10., 200.0001])
d = np.array([1, 2])
def count_max_significant_fraction(array):
# Return zero for any integer type array:
if issubclass(array.dtype.type, np.integer):
return 0
decimals = [s.rstrip('0').split('.')[1] for s in array.astype('str')]
return len(max(decimals, key=len))
print( count_max_significant_fraction(a) ) # prints "2"
print( count_max_significant_fraction(b) ) # prints "1"
print( count_max_significant_fraction(c) ) # prints "4"
print( count_max_significant_fraction(d) ) # prints "0"

I suggest you adopt the convention NCO uses and name the precision attribute "number_of_significant_digits" and/or "least_significant_digit". Terms are defined in the lengthy precision discussion that starts here.

Related

How to read a specific element from a binary file in Julia?

I have a binary file. If I want to read all the numeric data in an array at once, the code is below:
y = Array{Float32}(undef, 1000000, 1);
read!("myfile.bin", y)
I will get an array y, y is 1000000*1 array{Float32, 2}.
My question is that, I don't want to read all the data in an array at once since it will use a lot of memory. I want to read a specific element in the binary file each time. For example, I only want to read the third element in the binary file which is the third element in array y. How can I do it?
If you just want to read a single element, you don't need to read into an array:
io = open("myfile.bin", "r") # open file for reading
Nbytes = sizeof(Float32) # number of bytes per element
seek(io, (3-1)*Nbytes) # move to the 3rd element
val = read(io, Float32) # read a Float32 element
close(io)
BTW: if you want an array for your data, you should probably use a 1000000 length Array{Float32, 1} instead of a size 1000000x1 Array{Float32, 2}:
y = Array{Float32}(undef, 1000000)
# or
y = Array{Float32, 1}(undef, 1000000)
# or
y = Vector{Float32}(undef, 1000000)
Alternatively, you could mmap the file to access it as an array:
fd = open("myfile.bin")
y = Mmap.mmap(fd, Vector{Float32}, 10000000)
println(y[3])
This will only use virtual memory, but no RAM. You can also make it writeable, too.

Getting R observations back to NodeJS using "sort"

I am having a weird issue with the r-script (npm module) and the passage of the output to NodeJS.
Using:
needs("arules")
data <- read.transactions(input[[1]], sep = ",")
library(arules)
# default settings result in zero rules learned
groceryrules <- apriori(data, parameter = list(support =
0.006, confidence = 0.25, minlen = 2))
summary(groceryrules)
inspect(groceryrules[1:5])
I get the result fine in nodeJS as:
[ { '2': '=>', lhs: '{potted plants}', rhs: '{whole milk}', support: 0.0069, confidence: 0.4, lift: 1.5655, count: 68, _row: '[1]' }, { '2': '=>', lhs: '{pasta}', rhs: '{whole milk}', support: 0.0061, confidence: 0.4054, lift: 1.5866, count: 60, _row: '[2]' } ...]
However, changing the last line to:
inspect(sort(groceryrules, by = "lift")[1:5])
I get no output. If I set the interval to 1:2, it prints correctly the two top observations (by Lift).
Why can't I view more than 2 items when using sort?
My code in NodeJS:
var R = require("r-script");
var out = R("tests.R");
out = out.data(__dirname+"\\groceries.csv");
out = out.callSync();
console.log(out)
Thanks!
I managed to find the solution.
Using:
out <- capture.output(inspect(sort(groceryrules,by="lift")[1:10]))
out
It correctly puts into a string the inspect output and then passes it to the NodeJS server as:
[' lhs rhs support confidence lift count',
'[1] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477 69',...]
A simple split in each string should address the problem now, to make the data manageable.
EDIT:
Managed to find a better solution that gets the JSON in the correct format straight away, by using:
data = (sort(rules,by="lift"))
as(data, "data.frame")
This way it correctly converts the frame to JSON.

Representing a float or a binary as a 32 bit signed integer in R

I've been given a task to write an API for the AR.Drone 2.0 in R. I know it's probably not the wisest choice of language as there are good validated APIs written in Python and JS, but I took the challenge anyway.
I was doing pretty well until I had to format the AT* command string that is sent for the drone.
The commands accepts arguments that can be either a quoted string, an integer, which are sent as is, or binary and float (single precision IEEE754 floating point value between -1 to 1) that must be represented as 32 bit signed integers.
I was able to do the conversion through 2 online converters, converting first from float or binary to hex, and then hex to 32 bit signed integer, so I have the basic conversion for the most common values; however, I would like to use R's built-in functions or added packages to do the conversion.
Python's struct function handles this easily:
import struct
print "Float , Signed Integer"
for i in range(-10,10):
z = float(i)/10
Y = struct.unpack('i', struct.pack('f', z))[0]
print "%.1f , %d" % (z,Y)
land = 0b10001010101000000000000000000
take_off = land + 0b1000000000
print "Binary representation is simple as just using the %d format:"
print "Land code: %d, Take off code: %d" % (land, take_off)
This code will produce the following output:
Float , Signed Integer
-1.0 , -1082130432
-0.8 , -1085485875
-0.6 , -1088841318
-0.4 , -1093874483
-0.2 , -1102263091
0 , 0
0.2 , 1045220557
0.4 , 1053609165
0.6 , 1058642330
0.8 , 1061997773
1 , 1065353216
Binary representation is simple as just using the %d format
Land code: 290717696, Take off code: 290718208
Finally to my question, how to I reproduce this functionality/conversion in R?
Thanks heaps, Ido
Edit 1
I found a solution in R for the binary to Int conversion, using the base function strtoi
(land <- strtoi("10001010101000000000000000000", base=2))
[1] 290717696
(takeOff <- land + strtoi("1000000000", base=2))
[1] 290718208
Edit 2
Kept scanning the web for solutions and found a few that combined together get me my desired conversion, probably not in the most elegant way.
First, I wrote a function to calculate the mantissa and the exponent:
find_mantissa <- function(f) {
i <- 1 ; rem <- abs(f) ; mantissa <- NULL
man_dec_point <- FALSE
while (length(mantissa)<24) {
temp_rem <- rem-2^-i
if (!isTRUE(man_dec_point)) {
if (temp_rem>=0) {
man_dec_point <- TRUE
mantissa <- "."
}
} else {
mantissa<-c(mantissa,ifelse(temp_rem>=0, "1", "0"))
}
rem <- ifelse(temp_rem>=0, temp_rem, rem)
i<-i+1
next
}
return(c(paste0(mantissa[-1], collapse=""),24-i))
}
find_mantissa(0.68)
[1] "01011100001010001111010" "-1"
Then I use a function that I adapted from r-bloggers (apologies for not including the link, I'm limited to 2 links only):
dec2binstr<-function(p_number) {
bin_str<-NULL
while (p_number > 0) {
digit<-p_number %% 2
p_number<-floor(p_number / 2)
bin_str<-paste0(digit,bin_str)
}
return(bin_str)
}
dec2binstr(127-1)
[1] "1111110"
Finally I wrapped it all in a function to convert the float fraction to binary and then to 32 bit signed integer by pasting together the sign bit (used always "0", but inverted the entire string for the negative binaries and added a - sign to the resulting integer), exponent bits (with left padded 0s) and mantissa. I also added another function instead of strtoi that can't handle negative signed binaries (thanks to a solution I found here on SO).
str2num <- function(x) {
y <- as.numeric(strsplit(x, "")[[1]])
sum(y * 2^rev((seq_along(y)-1)))
}
float2Int <- function(decfloat){
bit32 <- ifelse(decfloat<0,-1, 1)
mantissa <- find_mantissa(decfloat)
exp <- dec2binstr(127+as.numeric(mantissa[2]))
long_exp <- paste0(rep("0",8-stri_length(exp)),exp)
unsigned <- paste0(long_exp,mantissa[1] )
if (decfloat<0) {
unsigned <- sapply(lapply(strsplit(unsigned, split=NULL),
function(x) ifelse(x=="0","1", "0")), paste0, collapse="")
}
binary <- paste0("0",unsigned)
return(c(binary, bit32*str2num(binary)))
}
float2Int(0.468)
[1] "00111110111011111001110110110010" "1055890866"
> float2Int(-0.468)
[1] "01000001000100000110001001001101" "-1091592781"
It's been quite a journey getting to this solution and I'm sure it's not the most efficient code implementation (and not sure it's correct for floats greater than 1 or smaller than -1), but it works for my purposes.
Any comments how to improve this code would be highly regarded.
Cheers, Ido

Which of these is pythonic? and Pythonic vs. Speed

I'm new to python and just wrote this module level function:
def _interval(patt):
""" Converts a string pattern of the form '1y 42d 14h56m'
to a timedelta object.
y - years (365 days), M - months (30 days), w - weeks, d - days,
h - hours, m - minutes, s - seconds"""
m = _re.findall(r'([+-]?\d*(?:\.\d+)?)([yMwdhms])', patt)
args = {'weeks': 0.0,
'days': 0.0,
'hours': 0.0,
'minutes': 0.0,
'seconds': 0.0}
for (n,q) in m:
if q=='y':
args['days'] += float(n)*365
elif q=='M':
args['days'] += float(n)*30
elif q=='w':
args['weeks'] += float(n)
elif q=='d':
args['days'] += float(n)
elif q=='h':
args['hours'] += float(n)
elif q=='m':
args['minutes'] += float(n)
elif q=='s':
args['seconds'] += float(n)
return _dt.timedelta(**args)
My issue is with the for loop here i.e the long if elif block, and was wondering if there is a more pythonic way of doing it.
So I re-wrote the function as:
def _interval2(patt):
m = _re.findall(r'([+-]?\d*(?:\.\d+)?)([yMwdhms])', patt)
args = {'weeks': 0.0,
'days': 0.0,
'hours': 0.0,
'minutes': 0.0,
'seconds': 0.0}
argsmap = {'y': ('days', lambda x: float(x)*365),
'M': ('days', lambda x: float(x)*30),
'w': ('weeks', lambda x: float(x)),
'd': ('days', lambda x: float(x)),
'h': ('hours', lambda x: float(x)),
'm': ('minutes', lambda x: float(x)),
's': ('seconds', lambda x: float(x))}
for (n,q) in m:
args[argsmap[q][0]] += argsmap[q][1](n)
return _dt.timedelta(**args)
I tested the execution times of both the codes using timeit module and found that the second one took about 5-6 seconds longer (for the default number of repeats).
So my question is:
1. Which code is considered more pythonic?
2. Is there still a more pythonic was of writing this function?
3. What about the trade-offs between pythonicity and other aspects (like speed in this case) of programming?
p.s. I kinda have an OCD for elegant code.
EDITED _interval2 after seeing this answer:
argsmap = {'y': ('days', 365),
'M': ('days', 30),
'w': ('weeks', 1),
'd': ('days', 1),
'h': ('hours', 1),
'm': ('minutes', 1),
's': ('seconds', 1)}
for (n,q) in m:
args[argsmap[q][0]] += float(n)*argsmap[q][1]
You seem to create a lot of lambdas every time you parse. You really don't need a lambda, just a multiplier. Try this:
def _factor_for(what):
if what == 'y': return 365
elif what == 'M': return 30
elif what in ('w', 'd', 'h', 's', 'm'): return 1
else raise ValueError("Invalid specifier %r" % what)
for (n,q) in m:
args[argsmap[q][0]] += _factor_for([q][1]) * n
Don't make _factor_for a method's local function or a method, though, to speed things up.
(I have not timed this, but) if you're going to use this function often it might be worth pre-compiling the regex expression.
Here's my take on your function:
re_timestr = re.compile("""
((?P<years>\d+)y)?\s*
((?P<months>\d+)M)?\s*
((?P<weeks>\d+)w)?\s*
((?P<days>\d+)d)?\s*
((?P<hours>\d+)h)?\s*
((?P<minutes>\d+)m)?\s*
((?P<seconds>\d+)s)?
""", re.VERBOSE)
def interval3(patt):
p = {}
match = re_timestr.match(patt)
if not match:
raise ValueError("invalid pattern : %s" % (patt))
for k,v in match.groupdict("0").iteritems():
p[k] = int(v) # cast string to int
p["days"] += p.pop("years") * 365 # convert years to days
p["days"] += p.pop("months") * 30 # convert months to days
return datetime.timedelta(**p)
update
From this question, it looks like precompiling regex patterns does not bring about noticeable performance improvement since Python caches and reuses them anyway. You only save the time it takes to check the cache which, unless you are repeating it numerous times, is negligible.
update2
As you quite rightly pointed out, this solution does not support interval3("1h 30s" + "2h 10m"). However, timedelta supports arithmetic operations which means one you can still express it as interval3("1h 30s") + interval3("2h 10m").
Also, as mentioned by some of the comments on the question, you may want to avoid supporting "years" and "months" in the inputs. There's a reason why timedelta does not support those arguments; it cannot be handled correctly (and incorrect code are almost never elegant).
Here's another version, this time with support for float, negative values, and some error checking.
re_timestr = re.compile("""
^\s*
((?P<weeks>[+-]?\d+(\.\d*)?)w)?\s*
((?P<days>[+-]?\d+(\.\d*)?)d)?\s*
((?P<hours>[+-]?\d+(\.\d*)?)h)?\s*
((?P<minutes>[+-]?\d+(\.\d*)?)m)?\s*
((?P<seconds>[+-]?\d+(\.\d*)?)s)?\s*
$
""", re.VERBOSE)
def interval4(patt):
p = {}
match = re_timestr.match(patt)
if not match:
raise ValueError("invalid pattern : %s" % (patt))
for k,v in match.groupdict("0").iteritems():
p[k] = float(v) # cast string to int
return datetime.timedelta(**p)
Example use cases:
>>> print interval4("1w 2d 3h4m") # basic use
9 days, 3:04:00
>>> print interval4("1w") - interval4("2d 3h 4m") # timedelta arithmetic
4 days, 20:56:00
>>> print interval4("0.3w -2.d +1.01h") # +ve and -ve floats
3:24:36
>>> print interval4("0.3x") # reject invalid input
Traceback (most recent call last):
File "date.py", line 19, in interval4
raise ValueError("invalid pattern : %s" % (patt))
ValueError: invalid pattern : 0.3x
>>> print interval4("1h 2w") # order matters
Traceback (most recent call last):
File "date.py", line 19, in interval4
raise ValueError("invalid pattern : %s" % (patt))
ValueError: invalid pattern : 1h 2w
Yes, there is. Use time.strptime instead:
Parse a string representing a time
according to a format. The return
value is a struct_time as returned
by gmtime() or localtime().

How to arrange elements of vector in Fortran?

I have two p*n arrays, y and ymiss. y contains real numbers and NA's. ymiss contains 1's and 0's, so that if y(i,j)==NA, ymiss(i,j)==0, and 1 otherwise. I also have 1*n array ydim which tells how many real numbers there is at y(1:p,n), so ydim has values 0 to p
In R programming language, I can do following:
if(ydim!=p && ydim!=0)
y(1:ydim(t), t) = y(ymiss(,t), t)
That code arranges all real numbers of y(,t) in like this
first there's for example
y(,t) = (3,1,NA,6,2,NA)
after the code it's
y(,t) = (3,1,6,2,2,NA)
Now I will only need those first 1:ydim(t), so it doesn't matter what those rest are.
The question is, how can I do something like that in Fortran?
Thanks,
Jouni
The "where statement" and the "merge" intrinsic function are powerful, operating on selected positions in arrays, but they don't move items to the front of an array. With old-fashioned code with explicit indexing (could be packaged into a function) e.g.:
k=1
do i=1, n
if (ymiss (i) == 1) then
y(k) = y(i)
k = k + 1
end if
end do
What you want could be done with array intrinsics using the "pack" intrinsic. Convert ymiss into a logical array: 0 --> .false., 1 --> .true.. Then use code like (tested without the second index):
y(1:ydim(t), t) = pack (y (:,t), ymiss (:,t))
Edit to add example code, showing use of Fortran intrinsics "where", "count" and "pack". "where" alone can't solve the problem, but "pack" can. I used "< -90" as NaN for this example. The step "y (ydim+1:LEN) = -99.0" isn't required by the OP, who doesn't need to use these elements.
program test1
integer, parameter :: LEN = 6
real, dimension (1:LEN) :: y = [3.0, 1.0, -99.0, 6.0, 2.0, -99.0 ]
real, dimension (1:LEN) :: y2
logical, dimension (1:LEN) :: ymiss
integer :: ydim
y2 = y
write (*, '(/ "The input array:" / 6(F6.1) )' ) y
where (y < -90.0)
ymiss = .false.
elsewhere
ymiss = .true.
end where
ydim = count (ymiss)
where (ymiss) y2 = y
write (*, '(/ "Masking with where does not rearrange:" / 6(F6.1) )' ) y2
y (1:ydim) = pack (y, ymiss)
y (ydim+1:LEN) = -99.0
write (*, '(/ "After using pack, and ""erasing"" the end:" / 6(F6.1) )' ) y
stop
end program test1
Output is:
The input array:
3.0 1.0 -99.0 6.0 2.0 -99.0
Masking with where does not rearrange:
3.0 1.0 -99.0 6.0 2.0 -99.0
After using pack, and "erasing" the end:
3.0 1.0 6.0 2.0 -99.0 -99.0
In Fortran you can't store na in an array of real numbers, you can only store real numbers. So you'll probably want to replace na's with some value not likely to be present in your data: huge() might be suitable. 2D arrays are no problem at all for Fortan. You might want to use a 2D array of logicals to replace ymiss rather than a 2D array of 1s and 0s.
There is no simple, intrinsic to achieve what you want, you'd need to write a function. However, a more Fortran way of doing things would be to use the array of logicals as a mask for the operations you want to carry out.
So, here's some fragmentary Fortran code, not tested:
! Declarations
real(8), dimension(m,n) :: y, ynew
logical, dimension(m,n) :: ymiss
! Executable
where (ymiss) ynew = func(y) ! here func() is whatever your function is

Resources