Number of hot days from daily temperature - netcdf

I am following this answer to calculate number of hot days in a year (temperature exceeding 35degC) from daily tmax data.
I am using tmax from CHELSA from 2000-2016, and I have crop it based on my bounding-box requirement.
Here are the steps I have done (example using 2001 data - 1 month 1 nc file):
Merge monthly data to annual: cdo mergetime chelsa_daily_2001*.nc chelsa_annual_2001.nc
Calculate hot days: cdo gec,308.15 chelsa_annual_2001.nc chelsa_hotdays_2001.nc The Chelsa's temperature is in Kelvin, so threshold for hot days is 308.15
Sum number of days in a year: cdo yearsum chelsa_hotdays_2001.nc chelsa_hotdays_yearsum_2001.nc
And below is the result and unfortunately not what I expected.
Why the number of days is not in integer? Did I missed something in the script?
UPDATE1 (following response from Adrian)
I have installed ncview via homebrew but unfortunately can't open. Got following error:
Note: could not open file /Users/xxx/.ncviewrc for reading
Error: Can't open display:
I try to open the nc output using QGIS, and the result still in float.
UPDATE2
Ok, I managed to check it using ncdump, and here's the first line that contains the value. Bit confuse, because I tried using 1 year data and the total is more than 365. How did it happen?

I strongly suspect panoply is performing some kind of spatial interpolation on the data on import.
Please take a look at the raw field directly using ncdump like this
ncdump chelsa_hotdays_yearsum_2001.nc | less
(I pipe to less so you can stroll down through the data). Or alternatively you can open the file in ncview and move the cursor over the data and you will see the field values displayed in the dialog box.
ncview chelsa_hotdays_yearsum_2001.nc

Related

Efficient way to compile NC4 file information from separate files in R

I am currently trying to compile temperature information from the WDFE5 Data set which is quite large in size and am struggling to find an efficient way to meet my goal. My main goals are to:
Determine the max temperature for individual days for each individual grid cell
Change the time step from hourly to daily and from UTC to MST.
The data set is stored in monthly NC4 files and contains the temperature data in a 3 dimensional matrix (time lat lon). My main question is if there is a efficient way to compile this data to meet my goals or to manipulate the NC4 files to be easier to play around with (Somehow merge the monthly files into one mega file?)
I have tried two rather convoluted ways to catch holes between months (Example : due to the time conversion, some dates end up spanning between two months, which requires me to read in the next file and then continuing to read the data).
My first try was to individually read 1 month / file at a time, using pmax() to get the max value of each grid cell, and comparing time steps for 24 hours, and then repeating the process. I have been using
ncvar_get() with start and count to only read one time step at a time. To catch days that span two months, I was able to create a convoluted function to merge the two, by calculating the number of 1 hour periods left in one month, and how much would be needed from the next.
My second try still involved pmax(), but I tried a different method to fill in any holes between months. I set a date vector from the time variable to each hour time step, and match by same day. While this seems better, it still has to read multiple NC4 files which gets very convoluting compared to being able to just reading one NC4 file with all the needed information.
In the end, I tested a few test cases and both seem to solutions seem to work, but run extremely slow and seem very overcomplicated to me. I was wondering if anyone had suggestions on how to better set up the NC4 files for reading and time conversion.

How to remove seasonality from time series data?

How can I remove seasonality data from a timeseries with the data stored in a netcdf file? I would like to find a solution using Linux, while I used Grads and Ferret for visualization.
Thanks a lot!
You can use CDO to calculate the average for each day/month of the year and subtract from the origin file:
If the file contains daily data:
cdo sub in.nc -ydaymean in.nc deseasonalized.nc
Likewise if the data is monthly:
cdo sub in.nc -ymonmean in.nc deseasonalized.nc
The ydaymean and ymonmean commands calculate the annual cycle over the dataset in.nc, i.e. ymonmean returns 12 time slices, the average of all the january, february and so on, which is then subtracted from the original file using sub. I've used piping, but it may be easier to understand on two separate lines:
cdo ymonmean in.nc annual_cycle.nc
cdo sub in.nc annual_cycle.nc deseasonalized.nc
This does exactly the same, deseasonalized.nc will be identical (well almost, there will be a few bytes differences due to the different "history" log in the netcdf global metadata header), but you will also have a new file with the annual_cycle.nc inside it (might also be useful?).
When doing the subtraction, CDO detects that the number of timeslices is smaller in the second file to be subtracted and thus loops/cycles through it. Note as the seasonal cycle is calculated from the same file as the original data it is fine to simply use "sub" as, if the data starts in e.g. April, the results of ymonmean will also start from April. However, if you want to remove a seasonal cycle calculated from a different source, the start day/month may be different and you end up subtracting e.g. April mean from January! To avoid this, you can use the ymonsub command instead:
cdo ymonsub full_timeseries.nc seasonal_file.nc deseasonalised.nc
In addition, there are now also packages in both R and python to allow you to access the full functionality of cdo from within those languages without having to resort to using shell access tools.
Edit 2021: i now have a video on this topic you can view here https://youtu.be/jKlA1ouoQIs

How to make time variable continuous in a netcdf forcing climatology file?

I am having a forcing NetCDF climatology file that has 4 dimensions, i.e. time, lat, long and lev. I am reading this file using Flexible Modelling System (FMS) from GFDL. I want to keep the time continuous, or something like a periodic boundary condition. For example, I should provide just one year monthly file, and the model should directly pick up the corresponding month from my file without exactly checking the specific year. I am performing some experiments in which some forcings are kept fixed and others changing (to provide a context why I want to do this).
You can use the following nco command to add modulo attribute to your time dimension
ncatted -a modulo,time,c,c," " filename.nc
You could create a netcdf file with the same year repeated many times and the timestamp overwritten to give a "dummy" year using cdo. Let's say your example netcdf file "oneyear.nc" has the times for the year 2002, whereas your other "changing" files cover the years from 1980 to 2010 (if I understand your question, some variables will change from year to year, it is just one specific file you want to keep invariant).
Then you could create a set of year files with the dummy timestamp in this way
for offset in $(seq -22 8) ; do
cdo shifttime,${offset}years oneyear.nc year${offset}.nc
done
and you can then merge them to one file using mergetime:
cdo mergetime year*.nc mydriverfile.nc
mydriverfile.nc should now contain the same monthly data repeated for 30 years with dummy "fake" years, which then be easy to read in the same way as your other driver files.
edit: Concerning the discussion above on leap years, obviously for monthly data this is not an issue, but if you want to perform this task with daily data, then with this solution you should select a sample year that is a leapyear, or insert a dummy day for Feb 29. I also suspect that the modulo solution above suffers from the leap year issue when applied to daily input.

How can I generate a fully-formatted Excel document as a final output?

I have built a script that extracts data from numerous backend systems and I want the output to be formatted like this:
Which package is the best and/or easiest to do this with?
I am aware of the xlsx package, but am also aware that there are others available so would like to know which is best in terms of ease and/or simplicity to achieve my desired output.
A little more detail:
If I run the report across seven days, then the resulting data frame is 168 rows deep (1 row represents 1 hour, 168 hours per week). I want each date (00:00 - 23:00) to be broken out into day-long blocks, as per the image I have provided.
(Also note that I am in London, England, and as such am currently in timezone UTC+1, which means that right now, the hourly breakdown for each date will range from 01:00 - 00:00 on the next day because our backend systems run on the UTC timezone and that is fine.)
At present, I copy and paste (transpose) the values across manually, but want to be able to fully automate the process so that I can run the script (function), and have the resulting output looking like the image.
This is what the current final output looks like:
Try the package openxlsx. The package offers a lot of custom formatting for .xslx documents and is actively developed / fairly responsive to githhub issues. The vignettes on the cran website are particularly useful.

SPSS date format when imported into R

I have not worked with SPSS (.sav) files before and am trying to work with some data files provided to me by importing them into R. I did not receive any explanation of the files, and because communication is difficult I am trying to figure out as much as I can on my own.
Here's my first question. This is what the Date field looks like in an R data frame after import:
> dataset2$Date[1:4]
[1] 13608172800 13608259200 13608345600 13608345600
I don't know what dates the data is supposed to be for, but I found that if I divide the above numbers by 10, that seems to give a reasonable date (in February 2013). Can anyone confirm this is indeed what the above represents?
My second question is regarding another column called Begin_time. Here's what that looks like:
> dataset2$Begin_time[1:4]
[1] 29520 61800 21480 55080
Any idea what this is representing? I want to believe this is some representation of time of day because the records are for wildlife observations, but I haven't got more info than that to try to guess. I noticed that if I take the difference between End_Time and Begin_time I get numbers like 120 and 180, which seems like minutes to me (3 hours seems reasonable to observe a wild animal), but the absolute numbers are far greater than the number of minutes in a day (1440), so that leaves me puzzled. Is this some time keeping format from SPSS? If so, what's the logic?
Unfortunately, I don't have access to SPSS, so any help would be much appreciated.
I had the same problem and this function is a good solution:
pss2date <- function(x) as.Date(x/86400, origin = "1582-10-14")
This is where I found the answer:
http://scs.math.yorku.ca/index.php/R:_Importing_dates_from_SPSS
Dates in SPSS Statistics are represented as floating point doubles holding the number of seconds since Oct 1, 1582. If you use the SPSS R plugin apis, they can be automatically converted to R dates, but any proper converter should be able to do this for you.

Resources