netcdf dimension variable interpretation - netcdf

I'm trying to understand if this is allowed by NetCDF standards. It does not make sence to me, but maybe there is a reason why it is not forbidden at library level. Ncdump:
netcdf tt {
dimensions:
one = 2 ;
two = 1 ;
variables:
int64 one(two) ;
data:
one = 1 ;
}
And code to produce this file in python:
from netCDF4 import Dataset
rr=Dataset('tt.nc','w')
rr.createDimension('one',2)
rr.createDimension('two',1)
var1=rr.createVariable('one','i8',('two'))
var1[:]=1
rr.close()
Note the variable with the same name as dimension, but with a different dimension than itself?!
So two questions:
is this allowed by standard?
if not, should it be restricted by libraries?

It's valid because the names of attributes, names of dimensions, and names of variables all exist in different namespaces.

It's valid, but obviously makes for confusing code and output and would not be acceptable in a professional sense. Though, note that single-dimension arrays that have the same name and size as the dimension they are assigned to are called "coordinate variables."
For example, you'll often see a variable named latitude that is 1D and has a dimension named latitude. ncks or ncdump should reveal a (CRD) next to that variable display, indicating that it is indeed coordinated to the array of latitudes.

Related

In Julia, how do I find out why Dict{String, Any} is Any?

I am very new to Julia and mostly code in Python these days. I am using Julia to work with and manipulate HDF5 files.
So when I get to writing out (h5write), I get an error because the data argument is of mixed type and I need to find out why.
The error message says Array{Dict{String,Any},4} is what I am trying to pass in, but when I look at the values (and it is a huge structure), I see a lot of 0xff and values like this. How do I quickly find why the Any and not a single type?
Just to make this an answer:
If my_dicts is an Array{Dict{String, Any}, 4}, then one way of working out what types are hiding in the Any part of the dict is:
unique(typeof.(values(my_dicts[1])))
To explain:
my_dicts[1] picks out the first element of your Array, i.e. one of your Dict{String, Any}
values then extracts the values, which is the Any part of the dictionary,
typeof. (notice the dot) broadcasts the typeof function over all elements returned by values, returning the types of all of these elements; and
unique takes the list of all these types and reduces it to its unique elements, so you'll end up with a list of each separate type contained in the Any partof your dictionary.

Expand a netCDF Variable into an additional Dimension or multiple Variables

I am working with a very large netCDF file in three dimensions (lat/lon/time). The resolution is 300 meters and the time variable has 25 steps, which leads to 64800x129600x25 cells.
The one variable contained in the file is an integer (ranging from -36 to 120) but represents an underlying factor, which is the problem.
It is a land cover data set, so for example: -20 means the cell is of the land type Forest or 10 means the cell is covered by water.
I want to reshape the netCDF file such that there is an additional dimension which represents every factor level of the original variable. And the variable would then be just a 1 or 0 per cell indicating the presence of every factor level at a certain lat/lon/time.
The dimensions would then be lat/lon/time/land type.
Here is an example data set, that does not concern land type but is small enough that it can be used for testing. And here is some code to read it in:
library(ncdf4)
# Download the data
download.file("http://schubert.atmos.colostate.edu/~cslocum/code/air.sig995.2012.nc",
mode="wb", destfile = "test.nc")
test.ncdf <- nc_open("test.nc", write=TRUE)
# See the lon,lat,time dimensions
print(test.ncdf)
tmp.array <- ncvar_get(test.ncdf, varid="air")
I'm not sure if the raster package is better more suited for this task. For very small netCDF-files I have managed the intended result to some extent, by extracting the data and then stacking it as a data.frame.
Any help or pointing in the right direction would be greatly appreciated.
Thanks in advance.
If I understand correctly, you want to have a set of fields for each type that are 1 or 0 as a function of lat/long/time. e.g. if you are looking a forest you want an array which is 1 when the factor=20 and 0 otherwise.
I know you want to do this in a 4 dimensional array, for that you will need to use R I expect as you tagged the question. But if you don't mind to have a series of 3D arrays for types, a quick and easy way to do this is to use CDO to process the integer array
cdo eqc,-20 air.sig995.2012.nc test.nc
The issue with this is that the output variable still has the same name
(you don't say what it is called, so I refer to it as sfctype), and so you would need to change the meta data with nco.
Therefore a better way would be to use expr in cdo.
cdo expr,"forest=sfctype==-20" air.sig995.2012.nc forest.nc
This makes a new variable called forest which is 1 or 0.
You could now process all the types you want, and then merge them into one file:
cdo expr,"forest=(sfctype==-20)" air.sig995.2012.nc type_forest.nc
cdo expr,"forest=(sfctype==10)" air.sig995.2012.nc type_water.nc
...etc...
cdo merge type_*.nc combined_file.nc
(I don't think you need the curly brackets, but it is a clearer syntax)
...almost what you wanted in a few lines, but not quite... I am not sure how to "stack" these new variables into a 4D array if you really need that, but perhaps nco can do it.

Either unformatted I/O is giving absurd values, or I'm reading them incorrectly in R

I have a problem with unformatted data and I don't know where, so I will post my entire workflow.
I'm integrating my own code into an existing climate model, written in fortran, to generate a custom variable from the model output. I have been successful in getting sensible and readable formatted output (values up to the thousands), but when I try to write unformatted output then the values I get are absurd (on the scale of 1E10).
Would anyone be able to take a look at my process and see where I might be going wrong?
I'm unable to make a functional replication of the entire code used to output the data, however the relevant snippet is;
c write customvar to file [UNFORMATTED]
open (unit=10,file="~/output_test_u",form="unformatted")
write (10)customvar
close(10)
c write customvar to file [FORMATTED]
c open (unit=10,file="~/output_test_f")
c write (10,*)customvar
c close(10)
The model was run twice, once with the FORMATTED code commented out and once with the UNFORMATTED code commented out, although I now realise I could have run it once if I'd used different unit numbers. Either way, different runs should not produce different values.
The files produced are available here;
unformatted(9kb)
formatted (31kb)
In order to interpret these files, I am using R. The following code is what I used to read each file, and shape them into comparable matrices.
##Read in FORMATTED data
formatted <- scan(file="output_test_f",what="numeric")
formatted <- (matrix(formatted,ncol=64,byrow=T))
formatted <- apply(formatted,1:2,as.numeric)
##Read in UNFORMATTED data
to.read <- file("output_test_u","rb")
unformatted <- readBin(to.read,integer(),n=10000)
close(to.read)
unformatted <- unformatted[c(-1,-2050)] #to remove padding
unformatted <- matrix(unformatted,ncol=64,byrow=T)
unformatted <- apply(unformatted,1:2,as.numeric)
In order to check the the general structure of the data between the two files is the same, I checked that zero and non-zero values were in the same position in each matrix (each value represents a grid square, zeros represent where there was sea) using;
as.logical(unformatted)-as.logical(formatted)
and an array of zeros was returned, indicating that it is the just the values which are different between the two, and not the way I've shaped them.
To see how the values relate to each other, I tried plotting formatted vs unformatted values (note all zero values are removed)
As you can see they have some sort of relationship, so the inflation of the values is not random.
I am completely stumped as to why the unformatted data values are so inflated. Is there an error in the way I'm reading and interpreting the file? Is there some underlying way that fortran writes unformatted data that alters the values?
The usual method that Fortran uses to write unformatted file is:
A leading record marker, usually four bytes, with the length of the following record
The actual data
A trailing record marker, the same number of bytes as the leading record marker, with the same information (used for BACKSPACE)
The usual number of bytes in the record marker is four bytes, but eight bytes have also been sighted (e.g. very old versions of gfortran for 64-bit systems).
If you don't want to deal with these complications, just use stream access. On the Fortran side, open the file with
OPEN(unit=10,file="foo.dat",form="unformatted",access="stream")
This will give you a stream-oriented I/O model like C's binary streams.
Otherwise, you would have to look at your compiler's documentation to see how exactly unformatted I/O is implemented, and take care of the record markers from the R side. A word of caution here: Different compilers have different methods of dealing with very long records of more than 2^31 bytes, even if they have four-byte record markers.
Following on from the comments of #Stibu and #IanH, I experimented with the R code and found that the source of error was the incorrect handling of the byte size in R. Explicitly specifying a bite size of 4, i.e
unformatted <- readBin(to.read,integer(),size="4",n=10000)
allows the data to be perfectly read in.

Fortran90 created allocatable arrays but elements incorrect

Trying to create an array from an xyz data file. The data file is arranged so that x,y,z of each atom is on a new line and I want the array to reflect this.
Then to use this array to find find the distance from each atom in the list with all the others.
To do this the array has been copied such that atom1 & atom2 should be identical to the input file.
length is simply the number of atoms in the list.
The write statement: WRITE(20,'(3F12.9)') atom1 actually gives the matrix wanted but when I try to find individual elements they're all wrong!
Any help would be really appreciated!
Thanks guys.
DOUBLE PRECISION, DIMENSION(:,:), ALLOCATABLE ::atom1,atom2'
ALLOCATE(atom1(length,3),atom2(length,3))
READ(10,*) ((atom1(i,j), i=1,length), j=1,3)
atom2=atom1
distn=0
distc=0
DO n=1,length
x1=atom1(n,1)
y1=atom1(n,2) !1st atom
z1=atom1(n,3)
DO m=1,length
x2=atom2(m,1)
y2=atom2(m,2) !2nd atom
z2=atom2(m,3)`
Your READ statement reads all the x coordinates for all atoms from however many records, then all the y coordinates, then all the z coordinates. That's inconsistent with your description of the input file. You have the nesting of the io-implied-do's in the READ statement around the wrong way - it should be ((atom1(i,j),j=1,3),i=1,length).
Similarly, as per the comment, your diagnostic write mislead you - you were outputting all x ordinates, followed by all y ordinates, etc. Array element order of a whole array reference varies the first (leftmost) dimension fastest (colloquially known as column major order).
(There are various pitfalls associated with list directed formatting that mean I wouldn't recommend it for production code (or perhaps for input specifically written with the knowledge of and defence against those pitfalls). One of those pitfalls is that the READ under list directed formatting will pull in as many records as it requires to satisfy the input list. You may have detected the problem earlier if you were using an explicit format that nominated the number of fields per record.)

Markov Algorithm for Random Writing

I got a litte problem understanding conceptually the structure of a random writing program (that takes input in form of a text file) and uses the Markov algorithm to create a somewhat sensible output.
So the data structure i am using is to use cases ranging from 0-10. Where at case 0: I count the number a letter/symbol or digit appears and base my new text on this to simulate the input. I have already implemented this by using an Map type that holds each unique letter in the input text and a array of how many there are in the text. So I can simply ask for the size of the array for the specific letter and create output text easy like this.
But now I Need to create case1/2/3 and so on... case 1 also holds what letter is most likely to appear after any letter aswell. Do i need to create 10 seperate arrays for these cases, or are there an easier way?
There are a lot of ways to model this. One approach is as you describe, with an multi-dimensional array where each index is the following character in the chain and the final result is the count.
# Two character sample:
int counts[][] = new int[26][26]
# ... initialize all entries to zero
# 'a' => 0, 'b' => 1, ... 'z' => 25
# For example for the string 'apple'
# Note: I'm only writing this like this to show what the result is, it should be in a
# loop or function ...
counts['a'-'a']['p'-'a']++
counts['p'-'a']['p'-'a']++
counts['p'-'a']['l'-'a']++
counts['l'-'a']['l'-'e']++
Then to randomly generate names you would count the number of total outcomes for a given character (ex: 2 outcomes for 'p' in the previous example) and pick a weighted random number for one of the possible outcomes.
For smaller sizes (say up to 4 characters) that should work fine. For anything larger you may start to run into memory issues since (assuming you're using A-Z) 26^N entries for an N-length chain.
I wrote something like a couple of years ago. I think I used random pages from Wikipedia to for seed data to generate the weights.

Resources