Not able to append two netcdf files using nco - netcdf

I am using netcdf operators to append two NCEP netCDF files together.
These files are of different sizes but they represent the same atmospheric variable i.e. geopotential height. One is at 1000 hPa and the other file is at 925 hPa.They have the same dimensions and same latitudinal and longitudinal extent. Both represent the same time instant
This is the command I am using - ncks -A hgt_1000.nc hgt_925.nc
The command runs without any issue but when I look at the output of hgt_925.nc it looks the files have not merged. Looking at the NCO documentation it looks they have to be the same size to append. Is there another way forward or should I write my own code to append ? These are netCDF4 files classic files downloaded using nccopy.

new answer, based on new user information:
Since your input files already have a level dimension, the proceducre to follow is here. Turn level into a record dimension, then concatenate files along it with ncrcat, then permute back with ncpdq. The manual has examples.
old answer:
What you want to do seems to be what NCO would handle with ncecat (appending is for copying new variables to existing files). Concatenate the files together and rename the resulting record variable as, e.g., level, with
ncecat -u level hgt_1000.nc hgt_925.nc out.nc

You can also use the CDO to merge the netcdf files.
The command cdo merge hgt_1000.nc hgt_925.nc out.nc

Related

Vroom/fread won't read LARGE .csv file - cannot memory map it

I have a .csv file that is 112GB in weight but neither vroom nor data.table::fread will open it. Even if I ask to read in 10 rows or just a couple of columns it complains with mapping error: Cannot allocate memory.
df<-data.table::fread("FINAL_data_Bus.csv", select = c(1:2),nrows=10)
System errno 22 unmapping file: Invalid argument
Error in data.table::fread("FINAL_data_Bus.csv", select = c(1:2), nrows = 10) :
Opened 112.3GB (120565605488 bytes) file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available.
read.csv on the other hand will read the ten rows happily.
Why won't vroom or fread read it using the usual altrep, even for 10 rows?
This matter has been discussed by the main creator of data.table package at https://github.com/Rdatatable/data.table/issues/3526. See the comment by Matt Dowle himself at https://github.com/Rdatatable/data.table/issues/3526#issuecomment-488364641. From what I understand, the gist of the matter is that to read even 10 lines from a huge csv file with fread, the entire file needs to be memory mapped. So fread cannot be used on its own in case your csv file is too big for your machine. Please correct me if I'm wrong.
Also, I haven't been able to use vroom with big more-than-RAM csv files. Any pointers towards this end will be appreciated.
For me, the most convenient way to check out a huge (gzipped) csv file is by using a small command line tool csvtk from https://bioinf.shenwei.me/csvtk/
e.g., check dimensions with
csvtk dim BigFile.csv.gz
and, check out head with top 100 rows
csvtk head -n100 BigFile.csv.gz
get a better view of above with
csvtk head -n100 BigFile.csv.gz | csvtk pretty | less -SN
Here I've used less command available with "Gnu On Windows" at https://github.com/bmatzelle/gow
A word of caution - many people suggest using command
wc -l BigFile.csv
to check out no. of lines from a big csv file. In most cases, it will be equal to the no. of rows. But in case the big csv file contains newline characters within a cell, to use a spreadsheet term, the above command will not show the no. of rows. In such cases the no. of lines is different from the no. of rows. So it is advisable to use csvtk dim or csvtk nrow. Other csv command line tools like xsv, miller will also show correct results.
Another word of caution - the short command fread(cmd="head -n 10 BigFile.csv") is not advisable to preview top few lines in case some columns contain significant leading zeros in data such as 0301, 0542, etc. since without column specification, fread will interpret them as integers and not show leading zeros from such columns. For example, in some databases that I have to analyse, the first digit zero in a particular column means that it is a Revenue Receipt. So better use a command line tool like csvtk, miller, xsv with less -SN for previewing a big csv file which show the file "as is" without any potentially wrong interpretation.
PS1: Even spreadsheets like MS Excel and LibreOffice Calc loses leading zeroes in csv files by default. LibreOffice Calc actually shows leading zeroes in the preview window but loses them when you load the file! I'm yet to find a spreadsheet that does not lose leading zeroes in csv files by default.
PS2: I've posted my approach to querying very large csv files at https://stackoverflow.com/a/68693819/8079808
EDIT:
VROOM does have difficulty when dealing with huge files since it needs to store the index in memory as well as any data you read from the file. See development thread https://github.com/r-lib/vroom/issues/203

looping through paired end fastq reads

How can you loop through a paired-end fastq file? For single end reads you can do the following
library(ShortRead)
strm <- FastqStreamer("./my.fastq.gz")
repeat {
fq <- yield(strm)
if (length(fq) == 0)
break
#do things
writeFasta(fq, 'output.fq', mode="a")
}
However, if I edit one paired-end file, I somehow need to keep track of the second file so that the two files continue to correspond well with each other
Paired-end fastq files are typically ordered,
So you could keep track of the lines that are removed, and remove them from the paired file. But this isn't a great method, and if your data is line-wrapped you will be in pain.
A better way would be to use the header information.
The headers for the paired reads in the two files are identical, except for the field that specifies whether the read is reverse or forward (1 or 2)...
first read from file 1:
#M02621:7:000000000-ARATH:1:1101:15643:1043 1:N:0:12
first read from file 2
#M02621:7:000000000-ARATH:1:1101:15643:1043 2:N:0:12
The numbers 1101:15643:1043 refers to the tile and x, y coordinates on that tile, respectively.
These numbers uniquely identify each read pair, for the given run.
Using this information, you can removed reads from the second file if they are not in the first file.
Alternatively, if you are doing quality trimming... Trimmomatic can perform quality/length filtering on paired-end data, and it's fast...

unix file fetch by timestamp

I have a list of files that get added to my work stream. They are csv with a date time stamp to indicate when they are created. I need to pick up each file in the order of the datetime in the file name to process it. Here is a sample list that I get:
Workprocess_2016_11_11T02_00_12.csv
Workprocess_2016_11_11T06_50_45.csv
Workprocess_2016_11_11T10_06_18.csv
Workprocess_2016_11_11T14_23_00.csv
How would I compare the files to search for the oldest one and work towards the chronological newer file? The day the files are dumped is the same, so I can only use from the timestamp in file name.
The beneficial aspect of that date time format is that it sorts the same lexically and chronologically. So all you need is
for file in *.csv; do
mv "$f" xyz
process xyz
done

Extract exactly one file (any) from each 7zip archive, in bulk (Unix)

I have 1,500 7zip archives, each archive contains 2 to 10 files, with no subdirectories.
Each file has the same extension, however the filename varies.
I only want one file out of each archive, but I'd like to perform this in bulk. I do not care which file is taken out, as long as only one file is taken out. It can be the first file, the newest, the biggest, the smallest, it doesn't matter.
Here's an example:
aa.7z {blah 56.smc, blah 57.smc, 1 blah 58.smc}
ab.7z {xx.smc, xx 1.smc, xx_2.smc}
ac.7z {1.smc}
I want to run something equivalent to:
7z e *.7z # But somehow only extract one file
Thank you!
Ultimately my solution was to extract all files and run the following in the directory:
for n in *; do echo "$n"; done > files.txt
I then imported that list into excel, and split the files by a special character that divided the title of the file with the qualifying data inside the filename (for example: Some Title (V1) [X2].smc), specifically I used a brackets delimiter.
Then I removed all duplicates, leaving me with only one edition of each from the zip. I finally remerged the columns (unfortunately the bracket was deleted during the splitting so wrote a function to add it back on the condition of whether there was content in the next column) and then resaved files.txt, after a bit of reviewing StackOverflow for answers, deleted files based on an input file (files.txt). A word of warning on this, spaces in filenames cause problems with rm and xargs so I had to encapsulate the variable with quotes.
Ultimately this still didn't serve me well enough so I just used a different resource entirely.
Posting this answer so others who find themselves in a similar predicament find an alternative resolution.

Batch read netcdf files and average one variable

I'm a new R user. I now have daily netcdf data for year 1979 such as these:
sm19790101.1.nc
sm19790102.1.nc
.
.
.
sm19791231.1.nc
I need to average a variable called "sm" to monthly resolution. I can now do this:
glob2rx("sm197901*.1.nc")
jan<-list.files(pattern=glob2rx("sm197901*.1.nc"),full.names=TRUE)
to port all January data to jan, but I don't know how to open each file and get specific variable (I've had Rnetcdf package installed) . If I were to do this manually, it should be:
s19790101<-open.nc("sm19790101.1.nc")
sm19790101<-var.get.nc(s19790101,"sm",na.mode=0)
and then average them...
I guess the question is how to read files with a variable (e.g. 01-31) as part of the file name and then loop through the whole month.
If you have a lot of data to summarize, you could summarize the daily data into monthly means with the NetCDF Operator tool http://nco.sourceforge.net/nco.html#ncra-netCDF-Record-Averager
ncra DAILY/sm197901[*].1.nc MONTHLY/sm197901.1.nc
It looks like you can paste together the filename components "sm197901", day, ".1.nc" construct the desired filename.
#make sure it has a leading 0
days = formatC(1:31, width=2, flag="0")
ncfiles = lapply(days, function(d){
filename = paste("sm197901", d, ".1.nc", sep="")
#print(filename)
open.nc(filename)
})
parallel to Dave's ncra answer you can also do it with cdo
cdo mergetime sm1979????.1.nc year.nc
# you only need this next step if there is more than one variable in the file:
cdo selvar,sm year.nc yearsm.nc
cdo monmean year.nc month.nc
On some systems the number of open files is limited to 256 - if this is your case you can replace "mergetime" with "cat" and I think it should still work since the files will be listed in time order.

Resources