Interpolate Time Series Using Weighted Loess in R - r

I have a dataset of multiple lakes with water level elevations through time. The observations are not regularly spaced and have many large gaps. Further, some of the older observations may be of lower or unknown quality. I created a separate model that does a reasonably good job of predicting water levels across time, but still misses the actual observations by varying amounts.
I would like to create a third inputed/interpolated set of data in which the solution is:
informed by the modeled values where observations are missing
crosses the highly weighted observations
and is informed by the lower weighted observations
So far, I have used the fable package's TSLM->interpolate to perform this. It works reasonably well, but I cannot see a way to introduce weighting to the process. Further, it relies to heavily on the global coefficient and intercepts making it a bit too volatile when the modeled value significantly misses the observed. I am thinking that I need to use some sort of weighted loess that relies on local coefficients and can accomodate weighting.
library(dplyr)
library(tsibble)
library(fable)
library(ggplot2)
test_data <- data.frame(obs_year = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009),
site_name = c("Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2"),
observed = c(100,200,NA, NA, NA, NA, 220, NA, NA, 125, NA,NA,425, NA, 475, NA, 450, 450, 475, 500),
weights = c(1,1,NA, NA, NA, NA, 2, NA, NA, 2, NA,NA,2, NA, 1, NA, 2, 2, 2, 2),
modeled = c(110,120,165,150, 200, 225, 240, 250, 150, 130, 450,430,415,400, 425, 450, 460, 460, 470, 490))
test_tsibble <- as_tsibble(test_data, key = site_name, index = obs_year)
tslm_interpolate <- test_tsibble %>%
group_by(site_name) %>%
model(lm = TSLM(observed~modeled)) %>%
fabletools::interpolate(test_tsibble)
tslm_interpolate <- left_join(tslm_interpolate, test_data, by = c("site_name", "obs_year")) %>%
dplyr::select(obs_year, site_name, observed = observed.y, imputed = observed.x, modeled, weights)
tslm_interpolate %>%
ggplot(aes(x=obs_year))+
geom_line(aes(y = imputed), color = "blue")+
geom_line(aes(y = modeled), color = "red")+
geom_point(aes(y = observed), color = "green")+
facet_wrap(~site_name, scales = "free_y")

Related

Interpolating in R restricted by a character variable

I have a data set that consists of an ID, years and an index:
ID = c("ABW", "ABW", "FRA", "FRA", "FRA", "GER", "GER", "GER")
year = c(2000, 2005, 2000, 2002, 2008, 2005, 2008, 2010)
index = c(NA, NA, 4, NA, 8, NA, 6, NA)
df <- data.frame(ID, year, index)
I am trying to interpolate the missing values in the index but I want the interpolation to be restricted by the ID - e.g. I want R to interpolate the index for all rows with the ID "FRA" and then start the interpolation over again for all rows with the ID "GER". And if there are no values at all for a specific ID (like for the ID "ABW") then I want R to return no interpolated values either.
I have tried this code (but it does not take the ID into consideration):
df <- df %>% mutate(index = na.approx(index, rule = 2)
After the interpolation I want my index column to look like this:
index = c(NA, NA, 4, 6, 8, 6, 6, 6)
Does anyone know how I can do this?

Fitting curves with DRC package in R?

I'm trying to fit curves with the DRC package in R.
Example:
x_yrs<-c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
2015, 2016, 2017)
y<-c(1.89, 0.34, 0.47, 2.46, 2.13, 7.49, 47.24, 117.84, 202.8, 322.7,
540.72, 744.22, 1148.7)
MaxPop<-110000
Y_Adj<-y/MaxPop
EV<-drm(y~ x_yrs,fct = LL.3(fixed = c(NA, NA, NA)))
plot(EV, broken = TRUE, type = "all")
EV<-drm(y~ x_yrs,fct = LL.5(fixed = c(NA, NA, NA, NA, NA)))
plot(EV, broken = TRUE, type = "all")
x_yrs_Adj<- x_yrs-2004
EV<-drm(Y_Adj~ x_yrs_Adj,fct = LL.5(fixed = c(NA, NA, NA, NA, NA)))
plot(EV, broken = TRUE, type = "all",xlim = c(0, 40), ylim = c(0, 1))
I would like to max value of the curve to be "1" or the "MaxPop" ie as the upper asymptote.
How would I go about changing the drm model to accomplish this?
"I would like to set the future population size to reach 110,000." I don't think it will be possible to fit a model with that constraint based on the data you give. The response that you have for the support of the function doesn't even get near to that (potentially?) asymptotic region. So I think you need to rethink your approach.
That aside, in drc you can realise constraints by specifying values for specific parameters through the fixed function argument.
EV <- drm(Y_Adj ~ x_yrs_Adj, fct = LL.5(fixed = c(NA, 0, 1, NA, NA)))
You can find out about the individual parameters if you do e.g. ?LL.5:
LL.5(fixed = c(NA, NA, NA, NA, NA), names = c("b", "c", "d", "e", "f"), ...)
[...]
The five-parameter logistic function is given by the expression
f(x) = c + \frac{d-c}{(1+\exp(b(\log(x)-\log(e))))^f}
So in this case, we set c to zero and then fix d = 1.
Let's show the plot
plot(EV, broken = TRUE, type = "all", xlim = c(0, 40000), ylim = c(0, 1))
You can see the issue here. As you don't have any support of values x_yrs_Adj closer to the function's asymptotic behaviour, your fit (and the resulting estimated parameters) will be poor.

R: casting multiple columns according

I have this dataset (run it in the command line, to have a look at it)
structure(list(Staz = c("Carmagnola", "Chieri", "Chivasso", "Ivrea",
"Moncalieri", "Orbassano"), Year = c(2004, 2004, 2004, 2004,
2004, 2004), Season = c("Autumn", "Autumn", "Autumn", "Autumn",
"Autumn", "Autumn"), Avg_T = c(11.7361111111111, 11.7361111111111,
11.7361111111111, 11.7361111111111, 11.7361111111111, 11.7361111111111
), Min_T = c(7.27222222222222, 7.27222222222222, 7.27222222222222,
7.27222222222222, 7.27222222222222, 7.27222222222222), Max_T = c(16.6722222222222,
16.6722222222222, 16.6722222222222, 16.6722222222222, 16.6722222222222,
16.6722222222222), Moisture = c(69.6388888888889, 69.6388888888889,
69.6388888888889, 69.6388888888889, 69.6388888888889, 69.6388888888889
), Rain = c(79.2, 79.2, 79.2, 79.2, 79.2, 79.2), Year_Bef = c(2004,
2004, 2004, 2004, 2004, 2004), Year_Bef_Two = c(2004, 2004, 2004,
2004, 2004, 2004)), .Names = c("Staz", "Year", "Season", "Avg_T",
"Min_T", "Max_T", "Moisture", "Rain", "Year_Bef", "Year_Bef_Two"
), row.names = c(NA, 6L), class = "data.frame")
From what you can see there is a variable named 'Season', defining the season of the data. I would like to split the weather variables ('Avg_T', Min_T', 'Max_T', 'Moisture', 'Rain') for every season, but having them in the same row. So, I would have just one row per study area for every year, containing information about seasonal data.
I tried to do that with the 'cast' and 'dcast' commands in the 'reshape' and 'reshape2' packages but it didn't work.
May somebody help me?
Thanks,
Jacopo
First, let's say your data lives in df. I rbind df to df and change the season in the latter half of df to summer so that we have more than 1 season present:
df <- rbind(df, df)
df[7:12,]$Season = 'Summer'
Then I get rid of the last two columns in df (they don't seem to be doing anything):
df = df[, -c(9,10)]
Now, we're ready to use the reshape function:
r_df <- reshape(df, timevar = 'Season', idvar = c('Staz', 'Year'), direction = 'wide')
I think that should give you what you're looking for.

Sum over rows (rollapply) with time decay

This is a follow on question to a question I posted earlier (see Sum over rows with multiple changing conditions R data.table for more details). I want to calculate how many times the 3 subjects have experienced an event in the last 5 years. So have been summing over a rolling window using rollapply from the zoo package. This assumes that the experience 5 years ago is as important as the experience 1 year ago (same weighting), so now I want to include a time decay for the experience that enters the sum. This basically means that the experience 5 years ago does not enter into the sum with the same weighting as the experience 1 year ago.
I my case I want to include an age dependent decay (even though for other applications faster or slower decays such as square root or squares could be possible).
For example lets assume I have the following data (I build on the previous data for clarity):
mydf <- data.frame (Year = c(2000, 2001, 2002, 2004, 2005,
2007, 2000, 2001, 2002, 2003,
2003, 2004, 2005, 2006, 2006, 2007),
Name = c("Tom", "Tom", "Tom", "Fred", "Gill",
"Fred", "Gill", "Gill", "Tom", "Tom",
"Fred", "Fred", "Gill", "Fred", "Gill", "Gill"))
# Create an indicator for the experience
mydf$Ind <- 1
# Load require packages
library(data.table)
library(zoo)
# Set data.table
setDT(mydf)
setkey(mydf, Name,Year)
# Perform cartesian join to calculate experience. I2 is the new experience indicator
m <- mydf[CJ(unique(Name),seq(min(Year)-5, max(Year))),allow.cartesian=TRUE][,
list(Ind = unique(Ind), I2 = sum(Ind,na.rm=TRUE)),
keyby=list(Name,Year)]
# This is the approach I have been taking so far. Note that is a simple rolling sum of I2
m[,Exp := rollapply(I2, 5, function(x) sum(head(x,-1)),
align = 'right', fill=0),by=Name]
So question now is, how can I include a age dependent decay into this calculation. To model this I need to divide the experience by the age of the experience before it enters the sum.
I have been trying to get it to work using something along these lines:
m[,Exp_age := rollapply(I2, 5, function(x) sum(head(x,-1)/(tail((Year))-head(Year,-1))),
align = 'right', fill=0),by=Name]
But it does not work. I think my main problem is that I cannot get the age of the experience right so I can divide by the age in the sum. The result should look like the Exp_age column in the myres data.frame below
myres <- data.frame(Name = c("Fred", "Fred", "Fred", "Fred", "Fred",
"Gill", "Gill", "Gill", "Gill", "Gill", "Gill",
"Tom", "Tom", "Tom", "Tom", "Tom"),
Year = c(2003, 2004, 2004, 2006, 2007, 2000, 2001, 2005,
2005, 2006, 2007, 2000, 2001, 2002, 2002, 2003),
Ind = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
Exp = c(0, 1, 1, 3, 4, 0, 1, 1, 1, 2, 3, 0, 1, 2, 2, 4),
Exp_age = c(0, 1, 1, 1.333333333, 1.916666667, 0, 1, 0.45,
0.45, 2.2, 2, 0, 1, 1.5, 1.5, 2.833333333))
Any pointers would be greatly appreciated!
If I understand you correctly, you are trying to do a rollapply with width=5 and rather than do a simple sum, you want to do a weighted sum. The weights are the age of the experience relative to the 5 year window. I would do this: first set the key in your data.table so that it has proper increasing order by Name, then you know that the last item in your x variable is the youngest and the first item is the oldest (you do this in your code already). I can't quite tell which way you want the weights to go (youngest to have greatest weight or oldest) but you get the point:
setkey(m, Name, Year)
my_fun = function(x) { w = 1:length(x); sum(x*w)}
m[,Exp_age:=rollapply(I2, width=5, by=1, fill=NA, FUN=my_fun, by.column=FALSE, align="right") ,by=Name]

Adding different vertical lines for each panel in xyplot using lattice in R

I have a graph of plant species frequency by year for several sites that I am plotting using xyplot in the lattice package. I've figured out how to get the scatterplot for each species-site combo. However, I want to add an abline representing each year in which a chemical treatment was done. Chemical treatments were added in different years at each site, and I'd like to add a vertical abline for each species-site graph in which a chemical treatment was performed at that site. Here is my xyplot code:
library(plyr)
sp.1 <- data.frame(site=rep('a', 10), year=seq(2001, 2010, 1), year.trt=c(NA, NA, NA, NA, 2005, NA, NA, 2008, NA, NA), pl.1=rnorm(10, 4, 1), pl.2=rnorm(10, 6, 2))
sp.2 <- data.frame(site=rep('b', 10), year=seq(2001, 2010, 1), year.trt=c(2001, NA, NA, NA, NA, 2006, NA, NA, NA, NA), pl.1=rnorm(10, 5, 2), pl.2=rnorm(10, 4, 1))
sp.3 <- data.frame(site=rep('c', 10), year=seq(2001, 2010, 1), year.trt=c(NA, NA, NA, 2004, NA, NA, NA, NA, 2009, NA), pl.1=rnorm(10, 8, 1), pl.2=rnorm(10, 3, 3))
data <- rbind.fill(sp.1, sp.2, sp.3)
xy.plot <- xyplot(pl.1 + pl.2 ~ year | site, data=data, outer=T, type='l',
as.table=T, xlab=c('Year'), ylab=c('Spp. Frequency (%)'),
panel=function(x, y,...){
panel.xyplot(x,y, type='l')
panel.abline(v=data$year.trt)
})
print(xy.plot)
So, the important line of code in this block is the 'panel.abline(v=test$trt.year)'. Currently, this plots all years in my dataset in which a chemical treatment was done, however, I'd like it to show in each panel which year a treatment was done for that specific site.
Any insight would be greatly appreciated.
Thanks,
Paul

Resources