Monthly sum of wet days from daily data using Climate Data Operators (CDO) - netcdf

I have climate data with a daily temporal resolution and would like a count of days that have precipitation (e.g., greater than 1mm/day) by month and by year.
I've tried eca_pd,1 and eca_rr1, but these commands return wet-day totals for all years.
For example, cdo eca_pd,1 infile outfile
Is there a command to return wet-days for each month and/or year?

You can accomplish this task with CDO's masking function, for more details beyond the answer below, you can also refer to my video guide on masking using cdo.
The first step is to make an equivalent file with 1 if P>threshold (1mm/day in your case) and 0 otherwise. For this we use the "greater than or equal to a constant" gec function (or ge="greater than" if you prefer):
cdo gec,1 input.nc mask.nc
(assuming units are mm/day in your input file).
Then you can simply sum this mask over the period (months, years etc) that you want your statistic
cdo monsum mask.nc nwetdays_mon.nc
cdo yearsum mask.nc nwetdays_year.nc
Of course you can pipe this if you like to do this on one line: e.g.
cdo monsum -gec,1 input.nc nwetdays_mon.nc
We can take this even further if you want to work out the climatology for a particular month. If you have a multiyear dataset then you can use the wonderful "ymonstat" commands. So for example, once you have calculated your monthly series of wet days above, you can calculate the average for each month with
cdo ymonmean nwetdays_mon.nc nwetdays_mon_clim.nc
You can then difference the series from this monthly climatology to give you the anomaly of wet days in each month over the series
cdo ymonsub nwetdays_mon.nc nwetdays_mon_clim.nc nwetdays_mon_anom.nc
I hope that helps!
(ps: I usually always find it is easier to calculate these kinds of statistics directly with CDO in this way, I rarely find that the built in climate functions calculate exactly the statistic as/how I want).

With NCO's ncap2, create a binary flag then total it in the desired dimension(s):
ncap2 -s 'rainy=(precip > 1);rainy_days=rainy.total($time)' in.nc out.nc

You can also do this in cf-python, essentially using the same methodology as the CDO example above, but in a Python environment, using the where and collapse methods:
import cf
# Read the dataset
f = cf.read('filename.nc')[0]
# Mask out dry days (assuming that your data
# units are 'mm day-1' or 'kg m-2 day-1', etc.)
wet = f.where(cf.le(1), cf.masked)
# If the data are in units of 'metres/day', say, then you could do:
# wet = f.where(cf.le(0.001), cf.masked)
# or
# wet = f.where(cf.le(1, 'mm day-1'), cf.masked)
# etc.
# Count the wet day occurrences by month
count_monthly = wet.collapse('T: sample_size', group=cf.M())
# Count the wet day occurrences by year
count_yearly = wet.collapse('T: sample_size', group=cf.Y())
# Get the data as numpy arrays
print(count_monthly.array)
print(count_yearly.array)
# Count the wet day totals by month
wet_day_sum_monthly = wet.collapse('T: sum', group=cf.M())
# Count the wet day totals by year
wet_day_sum_yearly = wet.collapse('T: sum', group=cf.Y())

Related

Accumulated precipitation between 12UTC to 12UC of the other day

I have a .nc file that contains data every 6 hours of precipitation for 1 full year, my interest is to calculate the daily precipitation and compare with observed data, for this I must make them coincide temporally. To achieve this, the precipitation should accumulate between 12 utc of one day and 12 utc of the next day. Does anyone have a suggestion on how to achieve this with CDO?
Thank you!
Well if the first slice covers 12-18 UTC, then essentially you want to average the timeseries 4 slices at a time, (right?) in which case you can use this
cdo timselmean,4 infile.nc outfile.nc
If the timeseries starts instead at 00, you may need to remove the first two timeslices before you start (cdo seltimestep)
Another method is a bit more of a fudge, in that you can shift the series by 12 hours, and then use the day mean function. This would have the advantage of working for any frequency of data (i.e. you don't hardwire the factor "4" based on the data frequency)
cdo daymean -shifttime,-12hours infile.nc outfile.nc
The answer Adrian Tompkins gives should work well. One additional point to note is that you can remove time steps in CDO. So, if your time starts at 0 UTC ands at 24 UTC, you do not want the first and last time step of Adrian's first answer, but you could modify it as follows:
cdo -timselmean,4 -delete,timestep=-1,-2,1,2 infile.nc outfile.nc
This will require a 2.x version of CDO.

Calculate a 1000 year mean using decadal data from NetCDF file in R or CDO

I have a netCDF file for temperature going back the last 22 thousand years at a decadal average (TraCE dataset). I want to calculate 100 or 1000 year averages.
I am really stuck, if anyone could help then that would be great. I am mostly using R, but if it is simple in cdo then I can try this too.
I don't have any code to show as I really don't know where to start. Most examples I have seen have been on daily or yearly data... not decadal
Your data is decadal averages, so it should be easy to do this in CDO. You want to calculate a rolling average which is averaged over every 10 time steps. For this runmean is your friend. Just do the following:
cdo runmean,10 infile.nc outfile.nc
You might need to subset time afterwards, depending on the exact output you want. It sounds like the time you have may be non-standard, but runmean should still be OK.
Robert's solution is useful if you want a smoothed output at the 100 or 1000 year timescale. Your original dataset has 2200 timesteps, and runmean,10 smooths this and produces an output with 2200-9=2191 timesteps, each of which is an average over a 100 year window either centered on the slice itself or lagged/lead, depending on the option used.
However, from your question, I think you are more likely to want an output where the first slice is the average over the first century, the second is for the second century and so on, that is, an output with 220 timeslices, each a century average (or 22 time-slices of 1000 year averages). In other words, you want a command analogous to daymean, monmean and yearmean, but as there is no command called centurymean, then you can instead resort to the more generic command timselmean and manually define your window length:
# Centurial average:
cdo timselmean,10 infile.nc outfile.nc
# Millennial Average:
cdo timselmean,100 infile.nc outfile.nc
I think this should still work despite the non-CF compliant time units you mention in the comment (but let us know if it doesn't)

How to I transform half-hourly data that does not span the whole day to a Time Series in R?

This is my first question on stackoverflow, sorry if the question is poorly put.
I am currently developing a project where I predict how much a person drinks each day. I currently have data that looks like this:
The menge column represents how much water a person has actually drunk in 30 minutes (So first value represents amount from 8:00 till before 8:30 etc..). This is a 1 day sample from 3 months of data. The day starts at 8 AM and ends at 8 PM.
I am trying to forecast the Time Series for each day. For example, given the first one or two time steps, we would predict the whole day and then we know how much in total the person has drunk until 8 PM.
I am trying to model this data as a Time Series object in R (Google Colab), in order to use Croston's Method for the forecasting. Using the ts() function, what should I set the frequency to knowing that:
The data is half-hourly
The data is from 8:00 till 20:00 each day (Does not span the whole day)
Would I need to make the data span the whole day by adding 0 values? Are there maybe better approaches for this? Thank you in advance.
When using the ts() function, the frequency is used to define the number of (usually regularly spaced) observations within a given time period. For your example, your observations are every 30 minutes between 8AM and 8PM, and your time period is 1 day. The time period of 1 day assumes that the patterns over each day is of most interest here, you could also use 1 week here.
So within each day of your data (8AM-8PM) you have 24 observations (24 half hours). So a suitable frequency for this data would be 24.
You can also pad the data with 0 values, however this isn't necessary and would complicate the model. If you padded the data so that it has observations for all half-hours of the day, the frequency would then be 48.

How to convert annual netCDF data to daily from the command line?

Before I resort to using Python, I would like to know if there is a simple way from the command line to convert an annual netCDF file into a file with daily data simply by duplication (including leap years), i.e. each annual value is duplicated 365 (366) times with the appropriate date stamp.
In the file each data value has the date stamped for the first day of each year:
cdo showdate population_ssp2soc_0p5deg_annual_2006-2100_density.nc4
2006-01-01 2007-01-01 2008-01-01 2009-01-01 2010-01-01 ...etc
I know it seems like a strange thing to do (the file size will be 365.25 times bigger!), but I want to read the data into a Fortran program that uses a daily timestep and do not want to mess around with dates in the Fortran code.
There might be a more efficient way of doing this. But you could first merge the original file and then a second which is the first, but time shifted to the final day of the year. Then temporally interpolate:
cdo -mergetime population_ssp2soc_0p5deg_annual_2006-2100_density.nc4 -shifttime,-1day -shifttime,1year population_ssp2soc_0p5deg_annual_2006-2100_density.nc4 temp.nc
cdo inttime,2006-01-01,12:00:00,1day temp.nc outfile.nc

gnuplot, calculating and plotting monthly averages

I have a datafile with several months of minute data with lines like "2016-02-02 13:21(\t)value(\n)".
I need to plot the data (no problem with that) and calculate + plot an average for each month.
Is it possible in gnuplot?
I am able to get an overall average using
fit a "datafile" using 1:3 via a
I am also able to specify some time range for the fit using
fit [now_secs-3600*24*31:now_secs] b "datafile" using 1:3 via b
... and then plot them with
plot a t "Total average",b t "Last 31 days"
But no idea how to calculate and plot an average for each month (= one stepped line showing each month average)
Here is a way to do it purely in gnuplot. This method can be adapted (with a not small amount of effort) to work with files that cross a year boundary or span more than one year. It works just fine if the data starts with January or not. It computes the ordinary average for each month (the arithmetic mean) treating each data point as one value for the month. With somewhat significant modification, it can be used to work with weighted averages as well.
This makes a significant use of the stats function to compute values. It is a little long, partly because I commented it heavily. It uses 5.0 features (NaN for undefined values and in-memory datablocks instead of temporary files), but comments note how to change these for earlier versions.
Note: This script must be run before setting time mode. The stats function will not work in time mode. Time conversions are handled by the script functions.
data_time_format = "%Y-%m-%d %H:%M" #date format in file
date_cols = 2 # Number of columns consumed by date format
# get numeric month value of time - 1=January, 12=December
get_month(x) = 0+strftime("%m",strptime(data_time_format,x))
# get numeric year value of time
get_year(x) = 0+strftime("%Y",strptime(data_time_format,x))
# get internal time representation of day 1 of month x in year y
get_month_first(x,y) = strptime("%Y-%m-%d",sprintf("%d-%d-01",y,x))
# get internal time representation of date
get_date(x) = strptime(data_time_format,x)
# get date string in file format corresponding to day y in month x of year z
get_date_string(x,y,z) = strftime(data_time_format,strptime("%Y-%m-%d",sprintf("%04d-%02d-%02d",z,x,y)))
# determine if date represented by z is in month x of year y
check_valid(x,y,z) = (get_date(z)>=get_month_first(x,y))&(get_date(z)<get_month_first(x+1,y))
# Determine year and month range represented by file
year = 0
stats datafile u (year=get_year(strcol(1)),get_month(strcol(1))) nooutput
month_min = STATS_min
month_max = STATS_max
# list of average values for each month
aves = ""
# fill missing months at beginning of year with 0
do for[i=1:(month_min-1)] {
aves = sprintf("%s %d",aves,0)
}
# compute average of each month and store it at the end of aves
do for[i=month_min:month_max] {
# In versions prior to 5.0, replace NaN with 1/0
stats datafile u (check_valid(i,year,strcol(1))?column(date_cols+1):NaN) nooutput
aves = sprintf("%s %f",aves,STATS_mean)
}
# day on which to plot average
baseday = 15
# In version prior to 5.0, replace $k with a temporary file name
set print $k
# Change this to start at 1 if we want to fill in prior months
do for [i=month_min:month_max] {
print sprintf("%s %s",get_date_string(i,baseday,year),word(aves,i))
}
set print
This script will create either a in-memory datablock or a temporary file for earlier versions (with the noted changes) that contains a similar file to the original, but containing one entry per month with the value of the monthly average.
At the beginning we need to define our date format and the number of columns that the date format consumes. From then on it is assumed that the data file is structured as datetime value. Several functions are defined which make extensive use of the strptime function (to compute a date string to an internal integer) and the strftime function (to compute an internal representation to a string). Some of these functions compute both ways in order to extract the necessary values. Note the addition of 0 in the get_month and get_year function to convert a string value to an integer.
We do several steps with the data in order to build our resulting datablock/file.
Use the stats function to compute the first and last month and the year. We are assuming only one year is present. This step needs to be modified heavily if we need to work with more than one year. In particular months in a second year would need to be numbered 13 - 24 and in a third year 25 - 36 and so on. We would need to modify this line to capture multiple years as well. Probably two passes would be needed.
Build up a string which contains space separated values for the average value for each month. This is done by applying the stats function once for each month. The check_valid function checks if a value is in the month of interest, and a value that isn't is assigned NaN which causes the stats function to ignore it.
Loop over the months of interest and build a datablock/temporary file with one entry for each month with the average value for that month. In this case, the average value is assigned to the start of the 15th day of the month. This can be easily changed to any other desired time. The get_date_string function is used for assigning the value to a time.
Now to demonstrate this, suppose that we have the following data
2016-02-03 15:22 95
2016-02-20 18:03 23
2016-03-10 16:03 200
2016-03-15 03:02 100
2016-03-18 02:02 200
We wish to plot this data along with the average value for each month. We can run the above script, and we will get a datablock $k (make the commented change near the bottom to use a temporary file instead) containing the following
2016-02-15 00:00 59.000000
2016-03-15 00:00 166.666667
This is exactly the average values for each month. Now we can plot with
set xdata time
set timefmt data_time_format
set key outside top right
plot $k u 1:3 w points pt 7 t "Monthly Average",\
datafile u 1:3 with lines t "Original Data"
Here, just for illustration, I used points with the averages. Feel free to use any style that you want. If you choose to use steps, you will very likely want to adjust the day that is assigned† in the datablock/temporary file (probably the first or last day in the month depending on how you want to do it).
It is usually easier with a task like this to do some outside preprocessing, but this demonstrates that it is possible in pure gnuplot.
† Regarding changing the day that is assigned, using any specific day in the month is easy, as long as it is a day that occurs in every month (dates from the 1st to the 28th) - just change baseday. For other values modifications to the get_date_string function need to be made.
For example, to use the last day, the function can be defined as
get_date_string(x,y,z) = strftime(data_time_format,strptime("%Y-%m-%d",sprintf("%04d-%02d-01",z,x+1))-24*60*60)
This version actually computes the first day of the next month, and then subtracts one whole day from that. The second argument is ignored in this version, but preserved to allow it to be used without having to make any additional changes to the script.
With a recent version of gnuplot, you have the stats command and you can do something something like this:
stats "datafile" using 1:3 name m0
month_sec=3600*24*30.5
do for [month=1:12] {
stats [now_secs-(i+1)*month_sec:(i+0)*now_secs-month_sec] "datafile" using 1:3 name sprintf("m%d")
}
you get m0_mean value for the total mean and you get all m1_mean m2_mean variables for the previuos months etc... defined in gnuplot
Finally to plot the you should do something like:
plot 'datafile', for [month=0:12] value(sprintf("m%d_mean"))
see help stats help for help value help sprintf for more information on the above commands

Resources