Calculate monthly average and annual sum of precipitation from netcdf with CDO - netcdf

I am learning netCDF and CDO with a CRU ts_4.04 data. I want to calculate the monthly average and annual sum of precipitation in London. I wrote:
#!/usr/bin/bash
lon=-0.11
lat=51.49
PREFNAME="/myHD/CRU4.04/pre/cru_ts4.04.1901.2019.pre.dat.nc"
OUTFNAME="outfile-"
echo "1970-2000 monthly average and annual sum of precipitations in London"
cdo remapnn,lon=$lon/lat=$lat $PREFNAME $OUTFNAME"place.nc"
cdo selvar,pre $OUTFNAME"place.nc" $OUTFNAME"var.nc"
cdo selyear,1970/2000 $OUTFNAME"var.nc" $OUTFNAME"years.nc"
cdo ymonmean $OUTFNAME"years.nc" $OUTFNAME"yearsmean.nc"
cdo timcumsum $OUTFNAME"yearsmean.nc" $OUTFNAME"yearsum.nc"
cdo info $OUTFNAME"yearsum.nc"
cdo info $OUTFNAME"yearsmean.nc"
exit
I get:
MyPC:~/workbench01$ ./gotit2.sh
1970-2000 monthly average and annual sum of precipitations in London
cdo remapnn: Nearest neighbor weights from lonlat (720x360) to lonlat (1x1) grid, with source mask (67420)
cdo remapnn: Processed 2 variables over 1428 timesteps [2.64s 60MB].
cdo selname: Processed 2 variables over 1428 timesteps [0.01s 52MB].
cdo selyear: Processed 1 variable over 1428 timesteps [0.00s 51MB].
cdo ymonmean: Processed 1 variable over 372 timesteps [0.00s 51MB].
cdo timcumsum: Processed 1 variable over 12 timesteps [0.00s 51MB].
-1 : Date Time Level Gridsize Miss : Minimum Mean Maximum : Parameter ID
1 : 2000-01-16 00:00:00 0 1 0 : 81.229 : -1
2 : 2000-02-15 00:00:00 0 1 0 : 132.26 : -1
3 : 2000-03-16 00:00:00 0 1 0 : 189.70 : -1
4 : 2000-04-16 00:00:00 0 1 0 : 244.82 : -1
5 : 2000-05-16 00:00:00 0 1 0 : 298.52 : -1
6 : 2000-06-16 00:00:00 0 1 0 : 356.92 : -1
7 : 2000-07-16 00:00:00 0 1 0 : 402.39 : -1
8 : 2000-08-16 00:00:00 0 1 0 : 456.03 : -1
9 : 2000-09-16 00:00:00 0 1 0 : 527.19 : -1
10 : 2000-10-16 00:00:00 0 1 0 : 605.04 : -1
11 : 2000-11-16 00:00:00 0 1 0 : 682.58 : -1
12 : 2000-12-16 00:00:00 0 1 0 : 762.59 : -1
cdo info: Processed 1 variable over 12 timesteps [0.00s 50MB].
-1 : Date Time Level Gridsize Miss : Minimum Mean Maximum : Parameter ID
1 : 2000-01-16 00:00:00 0 1 0 : 81.229 : -1
2 : 2000-02-15 00:00:00 0 1 0 : 51.032 : -1
3 : 2000-03-16 00:00:00 0 1 0 : 57.439 : -1
4 : 2000-04-16 00:00:00 0 1 0 : 55.116 : -1
5 : 2000-05-16 00:00:00 0 1 0 : 53.700 : -1
6 : 2000-06-16 00:00:00 0 1 0 : 58.400 : -1
7 : 2000-07-16 00:00:00 0 1 0 : 45.471 : -1
8 : 2000-08-16 00:00:00 0 1 0 : 53.642 : -1
9 : 2000-09-16 00:00:00 0 1 0 : 71.161 : -1
10 : 2000-10-16 00:00:00 0 1 0 : 77.845 : -1
11 : 2000-11-16 00:00:00 0 1 0 : 77.545 : -1
12 : 2000-12-16 00:00:00 0 1 0 : 80.006 : -1
cdo info: Processed 1 variable over 12 timesteps [0.00s 51MB].
It looks very well, but they aren't the same results as shown in https://climatecharts.net/ for London between 1970-2000 years with CRU ts4.04.
My question is: Am I calculating the monthly average and the yearly sum of precipitations?.
Thank you for any help.

cdo function:
ymonmean
calculates the average of each calendar month, i.e. the average of all the Januarys, the average of all the February etc... the resulting file will have 12 time steps.
timcumsum
then produces the cumulative sum over these 12 steps. So step 1 is still your January average, then the 2nd step has the sum of the January and February averages and so on, the resulting file still has 12 timesteps, and the result you need should be the last step.
However, if you simply want to know what the mean average annual rainfall is, then you can simply calculate it directly with
cdo yearsum in.nc out.nc # calculate total for each year
cdo timmean out.nc year_average.nc # average over the totals
or piped using one line:
cdo timmean -yearsum in.nc year_average.nc
A word of warning with the above, make sure your series has full calendar years. If the first year starts in e.g. July, then that year's sum would obviously only have 6 months worth of rainfall, which would impact your statistics, likewise with the final month of the final year.
Last thing, I see on the data page of the climatecharts that it uses observations directly and not the gridded cru, so you can anyway not expect the results to be exactly the same.
Edit 2021: I have now made video guides on these topics:
Calculating temporal statistics
Calculating diurnal and seasonal cycles

Related

Apply a function with if inside to a dataframe to take a value in a list in R

Hello everybody and thank you in advance for any help.
I inserted a txt file named "project" in R. This dataframe called "data" and consisted of 12 columns with some information of 999 households.
head(data)
im iw r am af a1c a2c a3c a4c a5c a6c a7c
1 0.00 20064.970 5984.282 0 38 0 0 0 0 0 0 0
2 15395.61 7397.191 0.000 42 30 1 0 0 0 0 0 0
3 16536.74 18380.770 0.000 33 28 1 0 0 0 0 0 0
4 20251.87 14042.250 0.000 38 38 1 1 0 0 0 0 0
5 17967.04 12693.240 0.000 24 39 1 0 0 0 0 0 0
6 12686.43 21170.450 0.000 62 42 0 0 0 0 0 0 0
im=male income
iw=female income
r=rent
am=male age
af=female age
a1c,a2c....a7c takes the value 1 when there is a child in age under 18
and the value 0 when there is not a child in the household.
Now i have to calculate the taxed income seperately for male and female for each houshold based on some criteria, so i am trying to create 1 function which calculate 2 numbers and after that to apply this function on my data frame and return a list with these numbers.
Specificaly I want something like this:
fact<-function(im,iw,r,am,af,a1c,a2c,a3c,a4c,a5c,a6c,a7c){
if ((am>0)&&(am<67)&&(af>0)) {mti<-im-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>0)&&(am<67)&&(af==0)) {mti<-im-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>=67)&&(af>0)) {mti<-im-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am<=67)&&(af==0)) {mti<-im-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am>0)) {fti<-iw-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am==0)) {fti<-iw-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>=67)&&(am>0)) {fti<-iw-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af<=67)&&(am==0)) {fti<-iw-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
return(mti,fti)}
how can i fix this function in order to apply on my dataframe?
Can a function return 2 values?
how can i apply the function?
THEN I TRIED THIS:
fact<-function(im=data$im,iw=data$iw,r=data$r,am=data$am,af=data$af,a1c=data$a1c,a2c=data$a2c,a3c=data$a3c,a4c=data$a4c,a5c=data$a5c,a6c=data$a6c,a7c=data$a7c){
if ((am>0)&&(am<67)&&(af>0)) {mti<-im-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>0)&&(am<67)&&(af==0)) {mti<-im-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>=67)&&(af>0)) {mti<-im-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am<=67)&&(af==0)) {mti<-im-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am>0)) {fti<-iw-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am==0)) {fti<-iw-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>=67)&&(am>0)) {fti<-iw-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af<=67)&&(am==0)) {fti<-iw-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
return(mti,fti)}
fact(data[1,])
but i have tis error: Error in fact(data[1, ]) : object 'mti' not found
when i tried the function only for "fti" can run but wrongly.
Besides the need to return multiple values using c(mti, fti), your function doesn't have a default value if none of the conditions in the functions are TRUE. So, mti is never created.
Add mti <- NA at the start of your function, so NA is the default value.

R-tree() function not using any variables

My code,
tree_model<-tree(readmitted_bin~ num_medications_01,data = tree_trainData)
plot(tree_model)
where readmitted_bin is a factor of 0 and 1 and num_medications_01 is scaled between 0 and 1. I have included other variables that also return the following error:
Error in plot.tree(tree_model) : cannot plot singlenode tree
3.
stop("cannot plot singlenode tree")
2.
plot.tree(tree_model)
1.
plot(tree_model)
The dataframe:
n|readmitted_bin
|
num_medications_01
1 0 0.3375
2 0 0.2125
3 0 0.0875
4 1 0.2000
5 1 0.1875
6 1 0.1250
Any help is appreciated on how tree() works.

convert factor to date in R to create dummy variable

I need to create dummy variable for "before and after 04/11/2020" for variable "date" in dataset "counties". There are over hundred dates in the dataset. I am trying to covert the dates from factor to date with as.date function, but get NA. Could you please help finding where I am making an error? I kept the other dummy variable I created just in case, if it affects the overall outcome
counties <- read.csv('C:/Users/matpo/Desktop/us-counties.csv')
str(counties)
as.Date(counties$date, format = '%m/%d/%y')
#create dummy variables forNew York, New Jersey, California, and Illinois
counties$state = ifelse(counties$state == 'New Jersey' &
counties$state == 'New York'& counties$state == 'California' &
counties$state == 'Illinois', 1, 0)
counties$date = ifelse(counties$date >= "4/11/2020", 1, 0)
str output
$ date : logi NA NA NA NA NA NA ...
$ county: Factor w/ 1774 levels "Abbeville","Acadia",..: 1468 1468 1468 379 1468 1178 379 1468 979 942 ...
$ state : num 0 0 0 0 0 0 0 0 0 0 ...
$ fips : int 53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ...
$ cases : int 1 1 1 1 1 1 1 1 1 1 ...
$ deaths: int 0 0 0 0 0 0 0 0 0 0 ...``
Thank you!
You have an incorrect format in as.Date, you should use "%Y" for 4 digit year.
You need to assign the values back (<-) for the values to change.
"4/11/2020" is just a string, if you are comparing date you need to convert it to date object. Also you can avoid using ifelse here.
Try :
counties$date <- as.Date(counties$date, format = '%m/%d/%Y')
counties$dummy <- as.integer(counties$date >= as.Date('2020-04-11'))

Using lag function gives an atomic vector with all zeroes

I have trying to use "lag" function in base R to calculate rainfall accumulations for a 6-hr period. I have hourly rainfall, then I calculate cumulative rainfall using cumsum function and then I am using the lag function to calculate 6-hr accumulations as below.
Event_Data<-dbGetQuery(con, "select feature_id, TO_CHAR(datetime, 'MM/DD/YYYY HH24:MI') as DATE_TIME, value_ms as RAINFALL_IN from Rain_HOURLY")
Event_Data$cume<-cumsum(Event_Data$RAINFALL_IN)
Event_Data$six_hr<-Event_Data$cume-lag(Event_Data$cume, 6)
But the lag function gives me all zeroes and the structure of the data frame looks like this-
'data.frame': 169 obs. of 5 variables:
$ feature_id : num 80 80 80 80 80 ...
$ DATE_TIME : chr "09/10/2017 00:00" "09/10/2017 01:00" "09/10/2017 02:00" "09/10/2017 03:00" ...
$ RAINFALL_IN: num 0.251 0.09 0.017 0.071 0.016 0.01 0.136 0.651 0.185 0.072 ...
$ cume : num 0.251 0.341 0.358 0.429 0.445 ...
$ six_hr : atomic 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "tsp")= num -23 145 1
This code has worked fine with several of my other projects but I have no clue why I am getting zeroes. Any help is greatly appreciated.
Thanks.
There might be a conflict with the lag function from other packages, that would explain why this code worked on other scripts but not on this one.
try stats::lag instead of just lag to enforce which package you want to use. (or dplyr::lag which seems to work better for me at east) ?
I think you have a misconception about what lag() from the stats package does. It's returning zeros, because you're taking the full data for cumulative rainfall and then subtract it again. Check this small example for an illustration:
x <- 1:20
y <- lag(x,3) ;y
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#attr(,"tsp")
#[1] -2 17 1
x-y #x is a vector
# [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#attr(,"tsp")
#[1] -2 17 1
As you can see, lag() simply keeps the vector values and just adds a time series attribute with the values "starting time, ending time, frequency". Because you put in a vector, it used the default values "1, length(Event_Data$cume), 1" and subtracted the lag from the starting and ending time, which is 3 in the example and seemingly 24 in your code output (which doesn't fit the code input above it, btw).
The problem is that your vector doesn't have any time attribute assigned to it, so R doesn't know which the corresponding values of your data and lagged data are. Thus, it simply subtracts the vector values and adds the time attribute of the lagged variable. To fix this, you just need to assign times to Event_Data$cume, by converting it to a time-series object, i.e. try Event_Data$six_hr<-as.numeric(ts(Event_Data$cume) - lag(ts(Event_Data$cume), 6))
It works fine for the small example above:
x <- ts(1:20)
y <- lag(x,3)
x-y #x is a ts
#Time Series:
#Start = 1
#End = 17
#Frequency = 1
# [1] -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3

How do I automatically specify the correct regression model when the number of variables differs between input data sets?

I have a working R program that will be used by my internal client for analysing their nutrient intake data. For each dataset that they have, they will re-run the R program.
A key part of the dataset is an nonlinear mixed method analysis, using nlmer from the lme4 package, that incorporates dummy variables for age. Depending on whether they will be analysing children or adults, the number of age band dummies in the formula will differ, although the reference age band dummy will always be the youngest. I think that the number of possible age bands ranges from 4 to about 6, so it's not a large range. It is a trivial matter to count the number of age band dummies, if I need to condition based on that.
What is the most efficient way for me to wrap the model-based code (the lmer that provides the starting parameter values, the function for the nlmer model, and the model specification in nlmer itself) so that the correct function and models are applied based on the number of age band dummies in the model? The other variables in the model are constant across datasets.
I've already got the program set up for automatically generating the relevant dummies and dropping those that aren't used in the current analysis. The program after the model is pretty well set up as automated as well. I'm just stuck on what to do with automating the two lme4-based analyses and function. These will only be run once for each dataset.
I've been wondering whether I need to write a function to contain all the lme4 related code, or whether there was an easier way. I would appreciate some pointers on how to do this. It took me one day to work out how to get the function working that I needed for the nlmer model, so I am still at a beginner level with functions.
I've searched for other R related automation questions on the site and I didn't find anything similar to what I would like to do.
Thanks in advance.
Update in response to suggestion in the comments about using a string. That sounds like an easy way forward for me, except that I don't then know how to apply the string content in a function as each dummy variable level (excluding the reference category) is used in the function for nlmer. How can I pull apart the string and use only the dummy variables that I have in a function? For example, one analysis might have AgeBand2, AgeBand3, AgeBand4, and another analysis might have AgeBand5 as well as those 3? If this was VBA, I would just create subfunctions based on the number of age dummy variables. I have no idea how to do this efficiently in R.
Can I just wrap a while loop around the lmer, function, and nlmer parts, so I have a series of while loops?
This is the section of code I wish to automate, the number of AgeBand dummy variables differs depending on the dataset that will be analysed (children vs. adults). This is using the dataset that I have been testing a SAS to R translation on, but the real datasets will be very similar. It is necessary to have a nonlinear model as this is the basis of the peer-reviewed published method that I am working off.
library(lme4)
Male.lmer <- lmer(BoxCoxXY ~ AgeBand4 + AgeBand5 + AgeBand6 + AgeBand7 +
AgeBand8 + Race1 + Race3 + Weekend + IntakeDay + (1|RespondentID),
data=Male.AddSugar,
weights=Replicates)
Male.lmer.fixef <- fixef(Male.lmer)
Male.lmer.fixef <- as.data.frame(Male.lmer.fixef)
bA <- Male.lmer.fixef[1,1]
bB <- Male.lmer.fixef[2,1]
bC <- Male.lmer.fixef[3,1]
bD <- Male.lmer.fixef[4,1]
bE <- Male.lmer.fixef[5,1]
bF <- Male.lmer.fixef[6,1]
bG <- Male.lmer.fixef[7,1]
bH <- Male.lmer.fixef[8,1]
bI <- Male.lmer.fixef[9,1]
bJ <- Male.lmer.fixef[10,1]
MD <- deriv(expression(b0 + b1*AgeBand4 + b2*AgeBand5 + b3*AgeBand6 +
b4*AgeBand7 + b5*AgeBand8 + b6*Race1 + b7*Race3 + b8*Weekend + b9*IntakeDay),
namevec=c("b0","b1","b2","b3", "b4", "b5", "b6", "b7", "b8", "b9"),
function.arg=c("b0","b1","b2","b3", "b4", "b5", "b6", "b7", "b8", "b9",
"AgeBand4","AgeBand5","AgeBand6","AgeBand7","AgeBand8",
"Race1","Race3","Weekend","IntakeDay"))
Male.nlmer <- nlmer(BoxCoxXY ~ MD(b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,AgeBand4,AgeBand5,AgeBand6,AgeBand7,AgeBand8,
Race1,Race3,Weekend,IntakeDay)
~ b0|RespondentID,
data=Male.AddSugar,
start=c(b0=bA, b1=bB, b2=bC, b3=bD, b4=bE, b5=bF, b6=bG, b7=bH, b8=bI, b9=bJ),
weights=Replicates
)
These will be the required changes between the datasets:
the number of fixed effect coefficients that I need to assign out of the lmer will change.
in the function, the expression, name.vec, and function.arg parts will change
the nlmer, the model statement and start parameter list will change.
I can change the lmer model statement so it takes AgeBand as a factor with levels, but I still need to pull out the values of the coefficients afterwards.
str(Male.AddSugar) gives:
'data.frame': 10287 obs. of 23 variables:
$ RespondentID: int 9966 9967 9970 9972 9974 9976 9978 9979 9982 9993 ...
$ RACE : int 2 3 2 2 3 2 2 2 2 1 ...
$ RNDW : int 26290 7237 10067 75391 1133 31298 20718 23908 7905 1091 ...
$ Replicates : num 41067 2322 17434 21723 375 ...
$ DRXTNUMF : int 27 11 13 18 17 13 13 19 11 11 ...
$ DRDDAYCD : int 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeAmt : num 33.45 2.53 9.58 43.34 55.66 ...
$ RIAGENDR : int 1 1 1 1 1 1 1 1 1 1 ...
$ RIDAGEYR : int 39 23 16 44 13 36 16 60 13 16 ...
$ Subgroup : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 4 3 2 4 1 4 2 5 1 2 ...
$ WKEND : int 1 1 1 0 1 0 0 1 1 1 ...
$ AmtInd : num 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeDay : num 0 0 0 0 0 0 0 0 0 0 ...
$ Weekend : int 1 1 1 0 1 0 0 1 1 1 ...
$ Race1 : num 0 0 0 0 0 0 0 0 0 1 ...
$ Race3 : num 0 1 0 0 1 0 0 0 0 0 ...
$ AgeBand4 : num 0 0 1 0 0 0 1 0 0 1 ...
$ AgeBand5 : num 0 1 0 0 0 0 0 0 0 0 ...
$ AgeBand6 : num 1 0 0 1 0 1 0 0 0 0 ...
$ AgeBand7 : num 0 0 0 0 0 0 0 1 0 0 ...
$ AgeBand8 : num 0 0 0 0 0 0 0 0 0 0 ...
$ YN : num 1 1 1 1 1 1 1 1 1 1 ...
$ BoxCoxXY : num 7.68 1.13 3.67 8.79 9.98 ...
The AgeBand data is incorrectly shown as the ordered factor Subgroup. Because I haven't used it, I haven't gone back and correct this to a plain factor.
This assumes that you have one variable, "ageband", which is a factor with levels: AgeBand2, AgeBand3, AgeBand4, and perhaps others that you want to be ignored. Since factors are generally treated by R regression functions using the lowest lexigraphic values as the reference levels, you would get your correct level chosen automagically. You pick your desired levels by creating a dataset hat has only the desired levels.
agelevs <- c("AgeBand2", "AgeBand3", "AgeBand4")
dsub <- subset(inpdat, ageband %in agelevs)
res <- your_fun(dsub) nlmer(y ~ ageband + <other-parameters>, data=dsub, ...)
If you have gone to the trouble of creating separate variables, then you need to learn to use factors correctly rather than holding to inefficent habits enforced by training in SPSS or other clunky macro processors.

Resources