Expand a netCDF Variable into an additional Dimension or multiple Variables - r

I am working with a very large netCDF file in three dimensions (lat/lon/time). The resolution is 300 meters and the time variable has 25 steps, which leads to 64800x129600x25 cells.
The one variable contained in the file is an integer (ranging from -36 to 120) but represents an underlying factor, which is the problem.
It is a land cover data set, so for example: -20 means the cell is of the land type Forest or 10 means the cell is covered by water.
I want to reshape the netCDF file such that there is an additional dimension which represents every factor level of the original variable. And the variable would then be just a 1 or 0 per cell indicating the presence of every factor level at a certain lat/lon/time.
The dimensions would then be lat/lon/time/land type.
Here is an example data set, that does not concern land type but is small enough that it can be used for testing. And here is some code to read it in:
library(ncdf4)
# Download the data
download.file("http://schubert.atmos.colostate.edu/~cslocum/code/air.sig995.2012.nc",
mode="wb", destfile = "test.nc")
test.ncdf <- nc_open("test.nc", write=TRUE)
# See the lon,lat,time dimensions
print(test.ncdf)
tmp.array <- ncvar_get(test.ncdf, varid="air")
I'm not sure if the raster package is better more suited for this task. For very small netCDF-files I have managed the intended result to some extent, by extracting the data and then stacking it as a data.frame.
Any help or pointing in the right direction would be greatly appreciated.
Thanks in advance.

If I understand correctly, you want to have a set of fields for each type that are 1 or 0 as a function of lat/long/time. e.g. if you are looking a forest you want an array which is 1 when the factor=20 and 0 otherwise.
I know you want to do this in a 4 dimensional array, for that you will need to use R I expect as you tagged the question. But if you don't mind to have a series of 3D arrays for types, a quick and easy way to do this is to use CDO to process the integer array
cdo eqc,-20 air.sig995.2012.nc test.nc
The issue with this is that the output variable still has the same name
(you don't say what it is called, so I refer to it as sfctype), and so you would need to change the meta data with nco.
Therefore a better way would be to use expr in cdo.
cdo expr,"forest=sfctype==-20" air.sig995.2012.nc forest.nc
This makes a new variable called forest which is 1 or 0.
You could now process all the types you want, and then merge them into one file:
cdo expr,"forest=(sfctype==-20)" air.sig995.2012.nc type_forest.nc
cdo expr,"forest=(sfctype==10)" air.sig995.2012.nc type_water.nc
...etc...
cdo merge type_*.nc combined_file.nc
(I don't think you need the curly brackets, but it is a clearer syntax)
...almost what you wanted in a few lines, but not quite... I am not sure how to "stack" these new variables into a 4D array if you really need that, but perhaps nco can do it.

Related

counting oligonucleotides and reversed complementary

I'm start to using R and i need some help if it's possible. I need to read fasta files and count for each species the frequency of each nucleotide, dinucleotides and to words with length 10 and the frequency of the reversed complementary. I'm using the package Biostrings. Can you Help me? Thank You
The Bioconductor Biostring Manual contains some pretty descriptive methods that match what you are looking for. They also have attached examples. Otherwise, you could just read in the FASTA file and keep track of how many of each base occurs (If you can't figure out the BioString program).
For the frequency, simply reading from a text file (FASTA after removing name sequences) is also sufficient. As long as you keep count of how many of each oligonucleotide appears.
I'm not exactly sure how you want to measure how much reverse complementary there is, if you kept all the possibilities of size 10 in an array that array wouldn't be too large (4^10 I think?), so if you add the data to the array in a logical way you could pretty easily compare them in an algorithmic manner.

netcdf dimension variable interpretation

I'm trying to understand if this is allowed by NetCDF standards. It does not make sence to me, but maybe there is a reason why it is not forbidden at library level. Ncdump:
netcdf tt {
dimensions:
one = 2 ;
two = 1 ;
variables:
int64 one(two) ;
data:
one = 1 ;
}
And code to produce this file in python:
from netCDF4 import Dataset
rr=Dataset('tt.nc','w')
rr.createDimension('one',2)
rr.createDimension('two',1)
var1=rr.createVariable('one','i8',('two'))
var1[:]=1
rr.close()
Note the variable with the same name as dimension, but with a different dimension than itself?!
So two questions:
is this allowed by standard?
if not, should it be restricted by libraries?
It's valid because the names of attributes, names of dimensions, and names of variables all exist in different namespaces.
It's valid, but obviously makes for confusing code and output and would not be acceptable in a professional sense. Though, note that single-dimension arrays that have the same name and size as the dimension they are assigned to are called "coordinate variables."
For example, you'll often see a variable named latitude that is 1D and has a dimension named latitude. ncks or ncdump should reveal a (CRD) next to that variable display, indicating that it is indeed coordinated to the array of latitudes.

Is there a way to read a raw netcdf file and tell what layer a value belongs to?

I'm in the process of evaluating how successful a script I wrote is and kind of a quick and dirty method I've employed is looking at the first few values and last few values of a single variable and doing a few calculations with them based on the same values in another netcdf file.
I know that there are better ways to approach this but again, this is a really quick and dirty method that has worked for me so far. My question though is that by looking at the raw data through ncdump, is there a way to tell which vertical layer that data belongs to? In my example, the file has 14 layers. I"m assuming that the first few values are a part of the surface layer and the last few values are a part of the top layer, but I suspect that this assumption is wrong, at least in part.
As a follow-up question, what would then be the easiest 'proper' way to tell what layer data belongs to? Thank you in advance!
ncview and NCO are both very powerful and quick command line operators to view data inside a netcdf file.
ncview: http://meteora.ucsd.edu/~pierce/ncview_home_page.html
NCO: http://nco.sourceforge.net/
You can easily show variables over all layers for example with
ncks -d layer,0,13 some_infile.nc
ncdump dumps the data with the last dimension varying fastest (http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/CDL-Syntax.html) so if 'layer' is the slowest/first dimension, the earlier values are all in the first layer, while the last few values are in the last layer.
As to whether the first layer is the top or bottom layer, you'd have to look to the 'layer' dimension and its data.

transformation of flow cytometry data using flowCore?

I wanna know if someone know how to do transformation of the channel four (FLH 4) without using the standard transformations offer by the flowCore package?
The values of the channel four are between 1 and 4096 and i need to convert in values between 1 and 246 with the rule 10^(x/1024).
Thank you.
Better to use flowTrans mclMultivArcSinh transformation.
trans<-flowTrans(flowData, "mclMultivArcSinh",colnames(flowData)[3:12], n2f=FALSE, parameters.only=FALSE)
You must not tranform FSC-A,SSC-A and Time thats why i have colnames [3:12].
you could get a custom transform in by doing something like..
plot(transform(someFlowFrame, FSC-H=10^(FSC-H/1024), SSC-H=10^(SSC-H/1024)), c("FSC-H","SSC-H"))
however as 10^(4096/1024) returns a max value of 10000 for your hypothetical example, the plot with your ranges -
plot(transform(someFlowFrame, FSC-H=10^(FSC-H/1024), SSC-H=10^(SSC-H/1024)), c("FSC-H","SSC-H"), xlim=c(0,256), ylim=c(0,256))
doesn't look good.

Import Large Unusual File To R

First time poster here, so I'll try and make myself as clear as possible on the help I need. I'm fairly new to R, and this is my first real independent programming experience.
I have stock tick data for about 2.5 years, each day has its own file. The files are .txt and consist of approximately 20-30 million rows, and averaging I guess 360mb each. I am working one file at a time for now. I don't need all the data these files contain, and I was hoping that I could use the programming to minimize my files a bit.
Now my problem is that I am having some difficulties with writing the proper code so R understands what I need it to do.
Let me first show you some of the data so you can get an idea of the formatting.
M977
R 64266NRE1VEW107 FI0009653869 2EURXHEL 630 1
R 64516SSA0B 80SHB SE0002798108 8SEKXSTO 40 1
R 645730BBREEW750 FR0010734145 8EURXHEL 640 1
R 64655OXS1C 900SWE SE0002800136 8SEKXSTO 40 1
R 64663OXS1P 450SWE SE0002800219 8SEKXSTO 40 1
R 64801SSIEGV LU0362355355 11EURXCSE 160 1
M978
Another snip of data:
M732
D 3547742
A 3551497B 200000 67110 02800
D 3550806
D 3547743
A 3551498S 250000 69228 09900
So as you can see each line begins with a letter. Each letter denotes what the line means. For instance R means order book directory message, M means milliseconds after last second, H means stock trading action message. There are 14 different letters used in total.
I have used the readLines function to import the data into R. This however seems to take a very long time for R to process when I want to work with the data.
Now I would like to write some sort of If function that says if the first letter is R then from offset 1 to 4 the code means Market Segment Identifier etc., and have R add columns to these so I can work with the data in a more structured fashion.
What is the best way of importing such data, and also creating some form of structure - i.e. use unique ID information in the line of data to analyze 1 stock at a time for instance.
You can try something like this :
options(stringsAsFactors = FALSE)
f_A <- function(line,tab_A){
values <- unlist(strsplit(line," "))[2:5]
rbind(tab_A,list(name_1=as.character(values[1]),name_2=as.numeric(values[2]),name_3=as.numeric(values[3]),name_4=as.numeric(values[4])))
}
tab_A <- data.frame(name_1=character(),name_2=numeric(),name_3=numeric(),name_4=numeric(),stringsAsFactors=F)
for(i in readLines(con="/home/data.txt")){
switch(strsplit(x=i,split="")[[1]][1],M=cat("1\n"),R=cat("2\n"),D=cat("3\n"),A=(tab_A <- f_A(i,tab_A)))
}
And replace cat() by different functions that add values to each type of data.frame. Use the pattern of the function f_A() to construct others functions and same things for the table structure.
You can combine your readLines() command with regular expressions. To get more information about regular expressions, look at the R help site for grep()
> ?grep
So you can go through all the lines, check for each line what it means, and then handle or store the content of the line however you like. (Regular Expressions are also useful to split the data within one line...)

Resources