How to remove seasonality from time series data? - netcdf

How can I remove seasonality data from a timeseries with the data stored in a netcdf file? I would like to find a solution using Linux, while I used Grads and Ferret for visualization.
Thanks a lot!

You can use CDO to calculate the average for each day/month of the year and subtract from the origin file:
If the file contains daily data:
cdo sub in.nc -ydaymean in.nc deseasonalized.nc
Likewise if the data is monthly:
cdo sub in.nc -ymonmean in.nc deseasonalized.nc
The ydaymean and ymonmean commands calculate the annual cycle over the dataset in.nc, i.e. ymonmean returns 12 time slices, the average of all the january, february and so on, which is then subtracted from the original file using sub. I've used piping, but it may be easier to understand on two separate lines:
cdo ymonmean in.nc annual_cycle.nc
cdo sub in.nc annual_cycle.nc deseasonalized.nc
This does exactly the same, deseasonalized.nc will be identical (well almost, there will be a few bytes differences due to the different "history" log in the netcdf global metadata header), but you will also have a new file with the annual_cycle.nc inside it (might also be useful?).
When doing the subtraction, CDO detects that the number of timeslices is smaller in the second file to be subtracted and thus loops/cycles through it. Note as the seasonal cycle is calculated from the same file as the original data it is fine to simply use "sub" as, if the data starts in e.g. April, the results of ymonmean will also start from April. However, if you want to remove a seasonal cycle calculated from a different source, the start day/month may be different and you end up subtracting e.g. April mean from January! To avoid this, you can use the ymonsub command instead:
cdo ymonsub full_timeseries.nc seasonal_file.nc deseasonalised.nc
In addition, there are now also packages in both R and python to allow you to access the full functionality of cdo from within those languages without having to resort to using shell access tools.
Edit 2021: i now have a video on this topic you can view here https://youtu.be/jKlA1ouoQIs

Related

making shape files with zeros values eliminated

I am trying to make a shapefile where the zero values are taken out in order to reduce the amount of space used. I attempted to do this through the following procedure but only got more zero values displayed:
cdo gtc,0.000008 precip_2022110815_0p50.nc mask1.nc
cdo setctomiss,0 mask1.nc mask2.nc
cdo mul precip_2022110815_0p50.nc mask2.nc precip_2022110815_0p50_adjust.nc
cdo setctomiss,0 precip_2022110815_0p50_adjust.nc precip_2022110815_0p50_final.nc
gdal_contour -a precip -i 0.00001 precip_2022110815_0p50_final.nc precip_2022110815_0p50.shp
I got the netcdf from a grib file that was obtained from ftp.ncep.noaa.gov.
Is there anyway I could tweak this code or other methods I could use to get a shape file where all zero values are filtered out or even values below a certain threshold are filtered out? Is there a way to filter the values below a certain amount out from a grib2 file?

Merge Two Datasets with Different Number of Variables per Timestep Using CDO

I am analyzing a time series of data that is split by time into two NetCDF files (infiles). These files have a different number of variables/fields, by design. Traditionally I have been using Climate Data Operators (CDO) to easily merge two datasets sorted by time using the following command in a terminal:
cdo mergetime <infiles> <outfile>
this command merges any number of files "infiles" sorted by time and writes a new "outfile" containing a time series of all the data in each ; however this doesn't appear to work by default with cdo, as it kicks back the following:
cdo select (Abort): Input streams have different number of variables per timestep!
the statement is true, each file does have a different number of variables per timestep. But it prevents me from looking at the dataset as a whole. I have also tried the following modifications to the cdo command I use to merge the time series, without success:
cdo mergetime -select,name=<variable> <infiles> <outfile>
cdo -select,name=<variable> <infiles> <outfile>
I have read through the CDO Userguide and have not found any alternative solutions yet. I would be very grateful if anyone could offer a workaround for joining the two files into a single time-series of data (preferably in cdo but not necessarily) as I am running out of ideas.
On phone but you could delete the extra annoying new variables from files with nco like this
ncks -x -v var1,var2 in.nc out.nc
And then merge as usual. I think you can use the cdo delete operator to do the same thing.

Number of hot days from daily temperature

I am following this answer to calculate number of hot days in a year (temperature exceeding 35degC) from daily tmax data.
I am using tmax from CHELSA from 2000-2016, and I have crop it based on my bounding-box requirement.
Here are the steps I have done (example using 2001 data - 1 month 1 nc file):
Merge monthly data to annual: cdo mergetime chelsa_daily_2001*.nc chelsa_annual_2001.nc
Calculate hot days: cdo gec,308.15 chelsa_annual_2001.nc chelsa_hotdays_2001.nc The Chelsa's temperature is in Kelvin, so threshold for hot days is 308.15
Sum number of days in a year: cdo yearsum chelsa_hotdays_2001.nc chelsa_hotdays_yearsum_2001.nc
And below is the result and unfortunately not what I expected.
Why the number of days is not in integer? Did I missed something in the script?
UPDATE1 (following response from Adrian)
I have installed ncview via homebrew but unfortunately can't open. Got following error:
Note: could not open file /Users/xxx/.ncviewrc for reading
Error: Can't open display:
I try to open the nc output using QGIS, and the result still in float.
UPDATE2
Ok, I managed to check it using ncdump, and here's the first line that contains the value. Bit confuse, because I tried using 1 year data and the total is more than 365. How did it happen?
I strongly suspect panoply is performing some kind of spatial interpolation on the data on import.
Please take a look at the raw field directly using ncdump like this
ncdump chelsa_hotdays_yearsum_2001.nc | less
(I pipe to less so you can stroll down through the data). Or alternatively you can open the file in ncview and move the cursor over the data and you will see the field values displayed in the dialog box.
ncview chelsa_hotdays_yearsum_2001.nc

How to make time variable continuous in a netcdf forcing climatology file?

I am having a forcing NetCDF climatology file that has 4 dimensions, i.e. time, lat, long and lev. I am reading this file using Flexible Modelling System (FMS) from GFDL. I want to keep the time continuous, or something like a periodic boundary condition. For example, I should provide just one year monthly file, and the model should directly pick up the corresponding month from my file without exactly checking the specific year. I am performing some experiments in which some forcings are kept fixed and others changing (to provide a context why I want to do this).
You can use the following nco command to add modulo attribute to your time dimension
ncatted -a modulo,time,c,c," " filename.nc
You could create a netcdf file with the same year repeated many times and the timestamp overwritten to give a "dummy" year using cdo. Let's say your example netcdf file "oneyear.nc" has the times for the year 2002, whereas your other "changing" files cover the years from 1980 to 2010 (if I understand your question, some variables will change from year to year, it is just one specific file you want to keep invariant).
Then you could create a set of year files with the dummy timestamp in this way
for offset in $(seq -22 8) ; do
cdo shifttime,${offset}years oneyear.nc year${offset}.nc
done
and you can then merge them to one file using mergetime:
cdo mergetime year*.nc mydriverfile.nc
mydriverfile.nc should now contain the same monthly data repeated for 30 years with dummy "fake" years, which then be easy to read in the same way as your other driver files.
edit: Concerning the discussion above on leap years, obviously for monthly data this is not an issue, but if you want to perform this task with daily data, then with this solution you should select a sample year that is a leapyear, or insert a dummy day for Feb 29. I also suspect that the modulo solution above suffers from the leap year issue when applied to daily input.

count number of missing values in netcdf file - R

Is there a quick way to know how many missing values are in a netcdf file? Possibly using R.
Currently I have to
hum<-nc_open("rhum.sig995.2008.nc")
rhum<-ncvar_get(hum, "rhum")
then manually look up the missing value by typing 'hum' and copy it into this operation
sum(abs(rhum - 9.96920996838687e+36) < -9.96920996838687e+36)
Is there a more direct way, especially if I have to work with hundreds of files? I would like to avoid copying and pasting the missing value, and also I am not sure with what kind of precision the number should be handled.
My suggestion is to use the excellent raster package:
install.packages(raster)
library(raster)
r <- raster("rhum.sig995.2008.nc", var="rhum")
NAnum <- summary(r)[6]
The total number of missing data points for variable names "var" can be stored in a new additional variable using
ncap2 -s "nmiss=number_miss(var)" in.nc out.nc
or
ncap2 -s "nmiss=var.number_miss()" in.nc out.nc
If your data has a time dimension and you want to see the total number of missing points summed over the space dimensions, then you can see this with
cdo info in.nc

Resources