Simple water demand supply model in R - r

I am trying to work out a simple soil water balance in R. Here's the step I need to do:
For a given year, starting from doy 200,I need to determine the Soil Water (SWi) which is calculated by following formula
`SW(i) = SW(i-1) + Rain(i) - ETo(i)`
where SW(i-1) is the water content in previous day, Rain(i) is the rainfall and ETo(i) is the evapotranspiration on day i
The conditions are that SW(i) cannot be negative or be more than SW(max) which is 20.
Here's a sample data:
library(tidyverse)
set.seed(123)
dat <- tibble(
year = rep(1980:2015, each = 100),
day = rep(200:299, times = 36),
rain = sample(0:17, size = 100*36,replace = T),
eto = sample(2:9, size = 100*36,replace = T))
SW.initial <- data.frame(year= 1980:2015, SW.199 = sample(1:10, 36, replace = T))
SW.initial is the water content for doy 199 for for each year
SW.max <- 20
dat$SW.fin <- NA
Taking the example of year 1980
dat.1980 <- dat[dat$year == 1980,]
SW.initial.1980 <- SW.initial[SW.initial$year== 1980,"SW.199"]
for(doy in dat.1980$day){
SW <- SW.initial.1980
SW <- SW + dat.1980[dat.1980$day == doy, "rain"] - dat.1980[dat.1980$day == doy, "eto"]
SW <- ifelse(SW < 0, 0, ifelse(SW >= SW.max, SW.max, SW))
dat[dat$year == years & dat$day == doy,"SW.fin"] <- SW
SW.initial.1980 <- SW
}
This loop will give me the SW of each day starting doy 200 till 299 using:
`SW(i) = SW(i-1) + Rain[i] + ETo[i]``
where for doy 200, SW(i-1) was given from the SW.initial data frame
I can loop through all years:
for(years in unique(dat$year)){
test <- dat[dat$year == years,]
SW.in <- SW.initial[SW.initial$year == years,"SW.199"]
for(doy in test$day){
SW <- SW.in
SW <- SW + test[test$day == doy, "rain"] - test[test$day == doy, "eto"]
SW <- ifelse(SW < 0, 0, ifelse(SW >= SW.max, SW.max, SW))
dat[dat$year == years & dat$day == doy,"SW.fin"] <- SW
SW.in <- SW
}}
I really want to avoid this long loop and was thinking if there is much clever (and faster way to do this).

Does this give what you want ?
edit -> added grouped by year
dat %>% group_by(year) %>% mutate(sw_oneless = c(NA, day[1:length(day)-1]),
sw_oneless + rain - eto)
# A tibble: 3,600 x 6
# Groups: year [36]
year day rain eto sw_oneless `sw_oneless + rain - eto`
<int> <int> <int> <int> <int> <int>
1 1980 200 5 2 NA NA
2 1980 201 14 6 200 208
3 1980 202 7 4 201 204
4 1980 203 15 5 202 212
5 1980 204 16 5 203 214
6 1980 205 0 8 204 196
7 1980 206 9 9 205 205
8 1980 207 16 6 206 216
9 1980 208 9 9 207 207
10 1980 209 8 4 208 212
# ... with 3,590 more rows
To solve the "problem" with day 200, why don't you just filter out day 199-300 from your original data? You can then run my code and na.omit() or filter again and the rows with day 199 are gone.
Or, if you can't do that, merge your SW.initial with your dat data frame

Related

Which is first day above and below threshold?

I have a dataframe that represents a two year daily time series of temperature for two rivers. For each river and each year, I would like to know what day of year:
temperature is greater than or equal to 15 degrees
temperature is sustained greater than or equal to 15 degrees (sustained is when there are no more dips below 15 until the autumn)
temperature is less than or equal to 15 degrees
temperature is sustained less than or equal to 15 degrees (sustained is when there are are no more peaks above 15 until the following spring)
Example timeseries
library(ggplot2)
library(lubridate)
library(dplyr)
library(dataRetrieval)
siteNumber <- c("01428500","01432805") # United States Geological Survey site numbers
parameterCd <- "00010" # temperature
statCd <- "00003" # mean
startDate <- "2018-01-01"
endDate <- "2019-10-31"
dat <- readNWISdv(siteNumber, parameterCd, startDate, endDate, statCd=statCd) # obtains the timeseries from the USGS
dat <- dat[,c(2:4)]
colnames(dat)[3] <- "temperature"
# To view at the time series
ggplot(data = dat, aes(x = Date, y = temperature)) +
geom_point() +
theme_bw() +
facet_wrap(~site_no)
# Adds a new column for year and day of year (doy; Jan 1 = 1, Dec 31 = 365)
dat <- dat %>%
mutate(year = year(Date),
doy = yday(Date))
I have tried using the dplyr filter() function but have had little success
dat %>%
group_by(site_no,year) %>%
filter(temperature >= 15 & temperature <= 15)
The ideal output would look something like this:
site_no year doy_firstabove15 doy_sustainedAbove15 doy_firstbelow15 doy_sustainedBelow15
1 01428500 2018 136 144 253 286
2 01428500 2019 140 146 279 289
3 01432805 2018 143 143 272 276
4 01432805 2019 140 140 278 278
I like this question. Its specific, tricky, and has a good example. I would go about this by creating a rolling variable that checks for changes in temperature above and below 15. Then you can group by site and year and check for the first instance of the change in group id. Now I make several assumptions here that might be bad. 1) I am grouping by year, but it would likely be better to group by water year. You don't explicitly say that the year is important, but I suspect it is. 2) I also assume that Autumn is defined by the first day of month 9. Maybe you want the sustained to be through the fall instead of the first day. Same assumption for spring.
library(tidyverse)
library(dataRetrieval)
dat |>
mutate(month = lubridate::month(Date),
year = lubridate::year(Date)) |>
group_by(site_no, year) |>
mutate(gt_15 = temperature > 15,
lt_15 = temperature < 15,
chng1 = cumsum(gt_15 != lag(gt_15, def = first(gt_15))),
chng2 = cumsum(lt_15 != lag(lt_15, def = first(lt_15))),
)|>
summarise(first_above_15 = Date[chng1==1][1],
indx_autum = chng1[month == 9][1],
first_above_15_sustained = Date[chng1== indx_autum][1],
first_below_15 = Date[chng2==1][1],
indx_spring = chng1[month == 3][1],
first_below_15_sustained = Date[chng2== indx_spring][1],
.groups = "drop") |>
select(-contains("indx"))
#> # A tibble: 4 x 6
#> site_no year first_above_15 first_above_15_sustained first_belo~1 first_be~2
#> <chr> <dbl> <date> <date> <date> <date>
#> 1 01428500 2018 2018-05-18 2018-05-24 2018-05-16 2018-01-01
#> 2 01428500 2019 2019-05-20 2019-05-26 2019-05-20 2019-01-01
#> 3 01432805 2018 2018-05-23 2018-05-23 2018-05-23 2018-01-01
#> 4 01432805 2019 2019-05-20 2019-05-20 2019-05-20 2019-02-22
#> # ... with abbreviated variable names 1: first_below_15,
#> # 2: first_below_15_sustained
This works by splitting the annual time series into times before and after the peak temperature. Then it logically returns the day of year (doy) when temperature is first above 15 before the peak temperature and the first below 15 after the peak temperature.
Similarly, the sustain_above and sustain_beloware determined logically first creating a run-length ID column (run) then extracting the doy where the run is max for the subset of run where the values are TRUE in below_peak or after_peak and get the first element of 'doy'
dat %>%
mutate(year = year(Date),
doy = yday(Date)) %>%
group_by(site_no, year) %>%
mutate(gt_15 = temperature >= 15,
lt_15 = temperature <= 15,
peak_doy = doy[which.max(temperature)],
below_peak = doy < peak_doy,
after_peak = doy > peak_doy,
run = data.table::rleid(lt_15)) %>%
summarise(first_above = doy[below_peak & gt_15][1],
sustain_above = first(doy[run == max(run[below_peak])]),
first_below = doy[after_peak & lt_15][1],
sustain_below = first(doy[run == max(run[after_peak])]), .groups = 'drop')

Identify first unique value when multiple conditions are met using dplyr in R

I have a dataframe representing a two-year daily time series of temperature for two rivers. I have identified when the temperature is either above or below the peak temperature. I have also created a run-length ID column for when temperature is either above or below a threshold temperature of 10 degrees.
How can I get the first day of year for each site and year and the following conditions:
maximum run-length & below peak = TRUE
maximum run-length & above peak = TRUE
Example Data:
library(ggplot2)
library(lubridate)
library(dplyr)
library(dataRetrieval)
siteNumber <- c("01432805","01388000") # United States Geological Survey site numbers
parameterCd <- "00010" # temperature
statCd <- "00003" # mean
startDate <- "1996-01-01"
endDate <- "1997-12-31"
dat <- readNWISdv(siteNumber, parameterCd, startDate, endDate, statCd=statCd) # obtains the timeseries from the USGS
dat <- dat[,c(2:4)]
colnames(dat)[3] <- "temperature"
# To view at the time series
ggplot(data = dat, aes(x = Date, y = temperature)) +
geom_point() +
theme_bw() +
facet_wrap(~site_no)
To create the columns described above
dat <- dat %>%
mutate(year = year(Date),
doy = yday(Date)) %>% # doy = day of year
group_by(site_no, year) %>%
mutate(lt_10 = temperature <= 10,
peak_doy = doy[which.max(temperature)],
below_peak = doy < peak_doy,
after_peak = doy > peak_doy,
run = data.table::rleid(lt_10))
View(dat)
The ideal output would look as follows:
site_no year doy_below doy_after
1 01388000 1996 111 317
2 01388000 1997 112 312
3 01432805 1996 137 315
4 01432805 1997 130 294
doy_after = the first row for after_peak == TRUE & max(run) when group_by(site_no,year)
doy_below = the first row for below_peak == TRUE & max(run) when group_by(site_no,year)
For site_no = 01388000 in year = 1996, the max(run) when below_peak == TRUE is 4. The first row whenrun = 4 and below_peak == TRUE corresponds with date 1996-04-20 which has a doy = 111.
As the data is already grouped, just summarise by extracting the 'doy' where the run is max for the subset of run where the values are TRUE in 'below_peak' or 'after_peak' and get the first element of 'doy'
library(dplyr)
dat %>%
summarise(doy_below = first(doy[run == max(run[below_peak])]),
doy_above = first(doy[run == max(run[after_peak])]), .groups = 'drop')
-output
# A tibble: 4 × 4
site_no year doy_below doy_above
<chr> <dbl> <dbl> <dbl>
1 01388000 1996 111 317
2 01388000 1997 112 312
3 01432805 1996 137 315
4 01432805 1997 130 294

Create a table out of a tibble

I do have the following dataframe with 45 million observations:
year month variable
1992 1 0
1992 1 1
1992 1 1
1992 2 0
1992 2 1
1992 2 0
My goal is to count the frequency of the variable for each month of a year.
I was already able to generate these sums with cps_data as my dataframe and SKILL_1 as my variable.
cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum))
Logically, I obtained 348 different rows as a tibble. Now, I struggle to create a new table with these values. My new table should look similar to my tibble. How can I do that? Is there even a way? I've already tried to read in an excel file with a date range from 01/1992 - 01/2021 in order to obtain exactly 349 rows and then merge it with the rows of the tibble, but it did not work..
# A tibble: 349 x 3
# Groups: YEAR [30]
YEAR MONTH name
<dbl> <int+lbl> <dbl>
1 1992 1 [January] 499
2 1992 2 [February] 482
3 1992 3 [March] 485
4 1992 4 [April] 457
5 1992 5 [May] 434
6 1992 6 [June] 470
7 1992 7 [July] 450
8 1992 8 [August] 438
9 1992 9 [September] 442
10 1992 10 [October] 427
# ... with 339 more rows
many thanks in advance!!
library(zoo)
createmonthyear <- function(start_date,end_date){
ym <- seq(as.yearmon(start_date), as.yearmon(end_date), 1/12)
data.frame(start = pmax(start_date, as.Date(ym)),
end = pmin(end_date, as.Date(ym, frac = 1)),
month = month.name[cycle(ym)],
year = as.integer(ym),
stringsAsFactors = FALSE)}
Once you create the function, you can specify the start and end date you want:
left_table <- data.frame(createmonthyear(1991-01-01,2021-01-01))
then left join the output with what you have
library(dplyr)
right_table <- data.frame(cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum)))
results <- left_join(left_table, right_table, by = c("Year" = "year", "Month" = "month")

R Panel data: Create new variable based on ifelse() statement and previous row

My question refers to the following (simplified) panel data, for which I would like to create some sort of xrd_stock.
#Setup data
library(tidyverse)
firm_id <- c(rep(1, 5), rep(2, 3), rep(3, 4))
firm_name <- c(rep("Cosco", 5), rep("Apple", 3), rep("BP", 4))
fyear <- c(seq(2000, 2004, 1), seq(2003, 2005, 1), seq(2005, 2008, 1))
xrd <- c(49,93,121,84,37,197,36,154,104,116,6,21)
df <- data.frame(firm_id, firm_name, fyear, xrd)
#Define variables
growth = 0.08
depr = 0.15
For a new variable called xrd_stock I'd like to apply the following mechanics:
each firm_id should be handled separately: group_by(firm_id)
where fyear is at minimum, calculate xrd_stock as: xrd/(growth + depr)
otherwise, calculate xrd_stock as: xrd + (1-depr) * [xrd_stock from previous row]
With the following code, I already succeeded with step 1. and 2. and parts of step 3.
df2 <- df %>%
ungroup() %>%
group_by(firm_id) %>%
arrange(firm_id, fyear, decreasing = TRUE) %>% #Ensure that data is arranged w/ in asc(fyear) order; not required in this specific example as df is already in correct order
mutate(xrd_stock = ifelse(fyear == min(fyear), xrd/(growth + depr), xrd + (1-depr)*lag(xrd_stock))))
Difficulties occur in the else part of the function, such that R returns:
Error: Problem with `mutate()` input `xrd_stock`.
x object 'xrd_stock' not found
i Input `xrd_stock` is `ifelse(...)`.
i The error occured in group 1: firm_id = 1.
Run `rlang::last_error()` to see where the error occurred.
From this error message, I understand that R cannot refer to the just created xrd_stock in the previous row (logical when considering/assuming that R is not strictly working from top to bottom); however, when simply putting a 9 in the else part, my above code runs without any errors.
Can anyone help me with this problem so that results look eventually as shown below. I am more than happy to answer additional questions if required. Thank you very much to everyone in advance, who looks at my question :-)
Target results (Excel-calculated):
id name fyear xrd xrd_stock Calculation for xrd_stock
1 Cosco 2000 49 213 =49/(0.08+0.15)
1 Cosco 2001 93 274 =93+(1-0.15)*213
1 Cosco 2002 121 354 …
1 Cosco 2003 84 385 …
1 Cosco 2004 37 364 …
2 Apple 2003 197 857 =197/(0.08+0.15)
2 Apple 2004 36 764 =36+(1-0.15)*857
2 Apple 2005 154 803 …
3 BP 2005 104 452 …
3 BP 2006 116 500 …
3 BP 2007 6 431 …
3 BP 2008 21 388 …
arrange the data by fyear so minimum year is always the 1st row, you can then use accumulate to calculate.
library(dplyr)
df %>%
arrange(firm_id, fyear) %>%
group_by(firm_id) %>%
mutate(xrd_stock = purrr::accumulate(xrd[-1], ~.y + (1-depr) * .x,
.init = first(xrd)/(growth + depr)))
# firm_id firm_name fyear xrd xrd_stock
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 Cosco 2000 49 213.
# 2 1 Cosco 2001 93 274.
# 3 1 Cosco 2002 121 354.
# 4 1 Cosco 2003 84 385.
# 5 1 Cosco 2004 37 364.
# 6 2 Apple 2003 197 857.
# 7 2 Apple 2004 36 764.
# 8 2 Apple 2005 154 803.
# 9 3 BP 2005 104 452.
#10 3 BP 2006 116 500.
#11 3 BP 2007 6 431.
#12 3 BP 2008 21 388.

Summing a dataframe based on another dataframe

I have daily data of rainfall from 10 locations across 10 years
set.seed(123)
df <- data.frame(loc.id = rep(1:10, each = 10*365),years = rep(rep(2001:2010,each = 365),times = 10),
day = rep(rep(1:365,times = 10),times = 10), rain = runif(min = 0 , max = 35, 10*10*365))
I have a separate data frame that has certain days using which I want to sum the rainfall in df
df.ref <- data.frame(loc.id = rep(1:10, each = 10),
years = rep(2001:2010,times = 10),
index1 = rep(250,times = 10*10),
index2 = sample(260:270, size = 10*10,replace = T),
index3 = sample(280:290, size = 10*10,replace = T),
index4 = sample(291:300, size= 10*10,replace = T))
df.ref
loc.id years index1 index2 index3 index4
1: 1 2001 250 264 280 296
2: 1 2002 250 269 284 298
3: 1 2003 250 268 289 293
4: 1 2004 250 266 281 295
5: 1 2005 250 260 289 293
What I want to is for row in in df.ref, use the index values in df.ref and
sum the rainfall in df between index1 to index2, index1 to index3 and index1 to index4. For example:
Using df.ref, for loc.id = 1 and year == 2001, sum the rainfall in df from 250 to 264, 250 to 280, 250 to 296 (as shown in df.ref)
Similarly, for year 2002, for loc.id = 1, sum the rainfall from 250 to 269, 250 to 284, 250 to 298.
I did this:
library(dplyr)
ptm <- proc.time()
dat <- df.ref %>% left_join(df)
index1.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index2) %>% summarise(sum.rain1 = sum(rain))
index2.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index3) %>% summarise(sum.rain2 = sum(rain))
index3.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index4) %>% summarise(sum.rain3 = sum(rain))
all.index <- index1.cal %>% left_join(index2.cal) %>% left_join(index3.cal))
proc.time() - ptm
user system elapsed
2.36 0.64 3.06
I am looking to make my code faster since my actual df.ref is quite large. Could anyone advise me how to make this quicker.
Non-equi join from data.table package can be both faster and more memory efficient than dplyr::left_join (slide | video)
For each value in df, find all the rain values in df.ref that have day in between index 1 and index 2. Then calculate the summation of rain based on loc.id and years.
df1 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index2)
, by = .EACHI][
][
, c("sum_1") := .(sum(rain)), by = .(loc.id, years)][
# remove all redundant columns
, day := NULL][
, day := NULL][
, rain := NULL])
df2 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index3)
, by = .EACHI][
][
, c("sum_2") := .(sum(rain)), by = .(loc.id, years)][
, day := NULL][
, day := NULL][
, rain := NULL])
df3 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index4)
, by = .EACHI][
][
, c("sum_3") := .(sum(rain)), by = .(loc.id, years)][
, day := NULL][
, day := NULL][
, rain := NULL])
Merge all three data.tables together
df1[df2, on = .(loc.id, years)][
df3, on = .(loc.id, years)]
loc.id years sum_1 sum_2 sum_3
1: 1 1950 104159.11 222345.4 271587.1
2: 1 1951 118689.90 257450.2 347624.3
3: 1 1952 99262.27 212923.7 280877.6
4: 1 1953 72435.50 192072.7 251593.6
5: 1 1954 104021.19 242525.3 326463.4
6: 1 1955 93436.32 232653.1 304921.4
7: 1 1956 89122.79 190424.4 255535.0
8: 1 1957 135658.11 262918.7 346361.4
9: 1 1958 80064.18 220454.8 292966.4
10: 1 1959 114231.19 273181.0 349489.2
11: 2 1950 94360.69 238296.8 301751.8
12: 2 1951 93845.50 195273.7 289686.0
13: 2 1952 107692.53 245019.4 308093.7
14: 2 1953 86650.14 257225.1 332674.1
15: 2 1954 104085.83 238859.4 286350.7
16: 2 1955 101602.16 223107.3 300958.4
17: 2 1956 73912.77 198087.2 276590.1
18: 2 1957 117780.86 228299.8 305348.5
19: 2 1958 98625.45 220902.6 291583.7
20: 2 1959 109851.38 266745.2 324246.8
[ reached getOption("max.print") -- omitted 81 rows ]
Compare processing time and memory used
> time_dplyr; time_datatable
user system elapsed
2.17 0.27 2.61
user system elapsed
0.45 0.00 0.69
rowname Class MB
1 dat data.frame 508
2 df3 data.table 26
3 df2 data.table 20
4 df1 data.table 9
When testing for about 100 years of data, dplyr used more than 50 GB of memory while data.table consumed only 5 GB. dplyr also took about 4 times longer to finish.
'data.frame': 3650000 obs. of 4 variables:
$ loc.id: int 1 1 1 1 1 1 1 1 1 1 ...
$ years : int 1860 1860 1860 1860 1860 1860 1860 1860 1860 1860 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ rain : num 10.1 27.6 14.3 30.9 32.9 ...
'data.frame': 3650000 obs. of 6 variables:
$ loc.id: int 1 1 1 1 1 1 1 1 1 1 ...
$ years : int 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 ...
$ index1: num 250 250 250 250 250 250 250 250 250 250 ...
$ index2: int 270 265 262 267 266 265 262 268 260 268 ...
$ index3: int 290 287 286 289 281 285 286 285 284 283 ...
$ index4: int 298 297 296 295 298 294 296 298 298 300 ...
> time_dplyr; time_datatable
user system elapsed
95.010 33.704 128.722
user system elapsed
26.175 3.147 29.312
rowname Class MB
1 dat data.frame 50821
2 df3 data.table 2588
3 df2 data.table 2004
4 df1 data.table 888
5 df.ref data.table 97
6 df data.table 70
If I increased the number of years to 150, dplyr broke even on a HPC cluster node with 256 GB RAM
Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) :
negative length vectors are not allowed
Calls: %>% ... left_join -> left_join.tbl_df -> left_join_impl -> .Call
Execution halted
Here's a starting point that will be much faster. Should be trivial figuring out the rest.
library(data.table)
setDT(df)
df[df.ref, on = .(loc.id, years, day >= index1, day <= index2), sum(rain), by = .EACHI]

Resources