How to make time variable continuous in a netcdf forcing climatology file? - netcdf

I am having a forcing NetCDF climatology file that has 4 dimensions, i.e. time, lat, long and lev. I am reading this file using Flexible Modelling System (FMS) from GFDL. I want to keep the time continuous, or something like a periodic boundary condition. For example, I should provide just one year monthly file, and the model should directly pick up the corresponding month from my file without exactly checking the specific year. I am performing some experiments in which some forcings are kept fixed and others changing (to provide a context why I want to do this).

You can use the following nco command to add modulo attribute to your time dimension
ncatted -a modulo,time,c,c," " filename.nc

You could create a netcdf file with the same year repeated many times and the timestamp overwritten to give a "dummy" year using cdo. Let's say your example netcdf file "oneyear.nc" has the times for the year 2002, whereas your other "changing" files cover the years from 1980 to 2010 (if I understand your question, some variables will change from year to year, it is just one specific file you want to keep invariant).
Then you could create a set of year files with the dummy timestamp in this way
for offset in $(seq -22 8) ; do
cdo shifttime,${offset}years oneyear.nc year${offset}.nc
done
and you can then merge them to one file using mergetime:
cdo mergetime year*.nc mydriverfile.nc
mydriverfile.nc should now contain the same monthly data repeated for 30 years with dummy "fake" years, which then be easy to read in the same way as your other driver files.
edit: Concerning the discussion above on leap years, obviously for monthly data this is not an issue, but if you want to perform this task with daily data, then with this solution you should select a sample year that is a leapyear, or insert a dummy day for Feb 29. I also suspect that the modulo solution above suffers from the leap year issue when applied to daily input.

Related

Efficient way to compile NC4 file information from separate files in R

I am currently trying to compile temperature information from the WDFE5 Data set which is quite large in size and am struggling to find an efficient way to meet my goal. My main goals are to:
Determine the max temperature for individual days for each individual grid cell
Change the time step from hourly to daily and from UTC to MST.
The data set is stored in monthly NC4 files and contains the temperature data in a 3 dimensional matrix (time lat lon). My main question is if there is a efficient way to compile this data to meet my goals or to manipulate the NC4 files to be easier to play around with (Somehow merge the monthly files into one mega file?)
I have tried two rather convoluted ways to catch holes between months (Example : due to the time conversion, some dates end up spanning between two months, which requires me to read in the next file and then continuing to read the data).
My first try was to individually read 1 month / file at a time, using pmax() to get the max value of each grid cell, and comparing time steps for 24 hours, and then repeating the process. I have been using
ncvar_get() with start and count to only read one time step at a time. To catch days that span two months, I was able to create a convoluted function to merge the two, by calculating the number of 1 hour periods left in one month, and how much would be needed from the next.
My second try still involved pmax(), but I tried a different method to fill in any holes between months. I set a date vector from the time variable to each hour time step, and match by same day. While this seems better, it still has to read multiple NC4 files which gets very convoluting compared to being able to just reading one NC4 file with all the needed information.
In the end, I tested a few test cases and both seem to solutions seem to work, but run extremely slow and seem very overcomplicated to me. I was wondering if anyone had suggestions on how to better set up the NC4 files for reading and time conversion.

Number of hot days from daily temperature

I am following this answer to calculate number of hot days in a year (temperature exceeding 35degC) from daily tmax data.
I am using tmax from CHELSA from 2000-2016, and I have crop it based on my bounding-box requirement.
Here are the steps I have done (example using 2001 data - 1 month 1 nc file):
Merge monthly data to annual: cdo mergetime chelsa_daily_2001*.nc chelsa_annual_2001.nc
Calculate hot days: cdo gec,308.15 chelsa_annual_2001.nc chelsa_hotdays_2001.nc The Chelsa's temperature is in Kelvin, so threshold for hot days is 308.15
Sum number of days in a year: cdo yearsum chelsa_hotdays_2001.nc chelsa_hotdays_yearsum_2001.nc
And below is the result and unfortunately not what I expected.
Why the number of days is not in integer? Did I missed something in the script?
UPDATE1 (following response from Adrian)
I have installed ncview via homebrew but unfortunately can't open. Got following error:
Note: could not open file /Users/xxx/.ncviewrc for reading
Error: Can't open display:
I try to open the nc output using QGIS, and the result still in float.
UPDATE2
Ok, I managed to check it using ncdump, and here's the first line that contains the value. Bit confuse, because I tried using 1 year data and the total is more than 365. How did it happen?
I strongly suspect panoply is performing some kind of spatial interpolation on the data on import.
Please take a look at the raw field directly using ncdump like this
ncdump chelsa_hotdays_yearsum_2001.nc | less
(I pipe to less so you can stroll down through the data). Or alternatively you can open the file in ncview and move the cursor over the data and you will see the field values displayed in the dialog box.
ncview chelsa_hotdays_yearsum_2001.nc

How to remove seasonality from time series data?

How can I remove seasonality data from a timeseries with the data stored in a netcdf file? I would like to find a solution using Linux, while I used Grads and Ferret for visualization.
Thanks a lot!
You can use CDO to calculate the average for each day/month of the year and subtract from the origin file:
If the file contains daily data:
cdo sub in.nc -ydaymean in.nc deseasonalized.nc
Likewise if the data is monthly:
cdo sub in.nc -ymonmean in.nc deseasonalized.nc
The ydaymean and ymonmean commands calculate the annual cycle over the dataset in.nc, i.e. ymonmean returns 12 time slices, the average of all the january, february and so on, which is then subtracted from the original file using sub. I've used piping, but it may be easier to understand on two separate lines:
cdo ymonmean in.nc annual_cycle.nc
cdo sub in.nc annual_cycle.nc deseasonalized.nc
This does exactly the same, deseasonalized.nc will be identical (well almost, there will be a few bytes differences due to the different "history" log in the netcdf global metadata header), but you will also have a new file with the annual_cycle.nc inside it (might also be useful?).
When doing the subtraction, CDO detects that the number of timeslices is smaller in the second file to be subtracted and thus loops/cycles through it. Note as the seasonal cycle is calculated from the same file as the original data it is fine to simply use "sub" as, if the data starts in e.g. April, the results of ymonmean will also start from April. However, if you want to remove a seasonal cycle calculated from a different source, the start day/month may be different and you end up subtracting e.g. April mean from January! To avoid this, you can use the ymonsub command instead:
cdo ymonsub full_timeseries.nc seasonal_file.nc deseasonalised.nc
In addition, there are now also packages in both R and python to allow you to access the full functionality of cdo from within those languages without having to resort to using shell access tools.
Edit 2021: i now have a video on this topic you can view here https://youtu.be/jKlA1ouoQIs

What are the consequences of choosing different frequencies for ts objects?

To create a ts-object in R, one has to specify a data frame, a start date and the frequency of the time series.
When searching the internet (e.g. Role of frequency parameter in ts), I get the impression that by choosing the frequency, one can emphasise whatever periodic pattern one believes is the most important in the data. However, I doubt that this is actually true. My impression is that it is solely used to compute the dates of the time series on-the-fly. E.g. when I set the start date “2015-08-01”, R automatically transforms it into a decimal date and I get something like 2015.58. If I now choose a frequency of 365 (or 365.25), R divides one unit by 365 and assigns this fraction to each day as one unit ahead, so the entry 366 days later is exactly 2016.58. However, if I choose frequency=7, the fraction assigned to each day is 1/7th, so the date assigned to the 8th day after my start date corresponds to a decimal number between 2016 and 2017. So the only choice for a data set with 365 entries per year is 365, isn’t it? And it is only used to actually create the time series?
Otherwise, if I choose the xts-class, an xts-object is built from a vector and a matrix where the vector has to be created in advance. So here there is no need to compute the date on-the-fly using a start date and a frequency and that is the reason why no frequency has to be assigned at all.
In both cases I can apply forecasting packages to either ts or xts objects (such as ARIMA, ets, stl, bats, bats etc) without specifying anything else so this shows that the frequency is actually not used for anything else. Or am I missing something here?
Thanks in advance for your comments!

Can I make a time series work with date objects rather than integers?

I have time series data that I'm trying to analyse in R. It was provided as a CSV from excel, which I subsequently read as a data.frame all. Let's say it has two columns: all$date and all$people, representing the count of people on a particular date. The frequency is hence daily.
Being from Excel, the dates are integers representing the number of days since 1900-01-01.
I could read the data as people = ts(all$people, start=c(all$date[1], 1), frequency=365); but that gives a silly start value of almost 40000 because the data starts in 2006. The start parameter doesn't take a date object, according to ?ts, so I can't just use as.Date():
ts - ...
start: the time of the first observation. Either a single number
or a vector of two integers, which specify a natural time unit and
a (1-based) number of samples into the time unit. See the examples
for the use of the second form.
I could of course set start=1, but it's a bit painful to figure out what season we're in when the plot tells me interesting things are happening around day 2100. (To be clear, setting frequency=365 does tell me what year we're in, but isn't useful more precise dates). Is there a useful way of expressing the date in ts in a human-readable form so that I don't have to keep calling as.Date() to understand when the interesting features are happening?

Resources