Interpolating in R restricted by a character variable - r

I have a data set that consists of an ID, years and an index:
ID = c("ABW", "ABW", "FRA", "FRA", "FRA", "GER", "GER", "GER")
year = c(2000, 2005, 2000, 2002, 2008, 2005, 2008, 2010)
index = c(NA, NA, 4, NA, 8, NA, 6, NA)
df <- data.frame(ID, year, index)
I am trying to interpolate the missing values in the index but I want the interpolation to be restricted by the ID - e.g. I want R to interpolate the index for all rows with the ID "FRA" and then start the interpolation over again for all rows with the ID "GER". And if there are no values at all for a specific ID (like for the ID "ABW") then I want R to return no interpolated values either.
I have tried this code (but it does not take the ID into consideration):
df <- df %>% mutate(index = na.approx(index, rule = 2)
After the interpolation I want my index column to look like this:
index = c(NA, NA, 4, 6, 8, 6, 6, 6)
Does anyone know how I can do this?

Related

Is there an easy way to extend a single column in an R dataframe?

I want to extend my dataset with missing observations in order to compute forecasts.
This means I want to extend my 'time' column and set all the new cells from the other columns to NA:
Time1 <- c(2019, 2020, 2021, 2022)
data1 <- c(3, 4, 1, 4)
df1 <- cbind(Time1, data1)
Time2 <- c(2019, 2020, 2021, 2022, 2023, 2024, 2025)
data2 <- c(3, 4, 1, 4, NA, NA, NA)
df2 <- cbind(Time2, data2)
Is there an easy way to get from df1 to df2 without creating a new dataframe?
You can do it like this:
library(dplyr)
df1 <- as_tibble(df1)
df1 %>% add_row(Time1 = seq(from =2023, to = 2025, by = 1))

Interpolate Time Series Using Weighted Loess in R

I have a dataset of multiple lakes with water level elevations through time. The observations are not regularly spaced and have many large gaps. Further, some of the older observations may be of lower or unknown quality. I created a separate model that does a reasonably good job of predicting water levels across time, but still misses the actual observations by varying amounts.
I would like to create a third inputed/interpolated set of data in which the solution is:
informed by the modeled values where observations are missing
crosses the highly weighted observations
and is informed by the lower weighted observations
So far, I have used the fable package's TSLM->interpolate to perform this. It works reasonably well, but I cannot see a way to introduce weighting to the process. Further, it relies to heavily on the global coefficient and intercepts making it a bit too volatile when the modeled value significantly misses the observed. I am thinking that I need to use some sort of weighted loess that relies on local coefficients and can accomodate weighting.
library(dplyr)
library(tsibble)
library(fable)
library(ggplot2)
test_data <- data.frame(obs_year = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009),
site_name = c("Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2"),
observed = c(100,200,NA, NA, NA, NA, 220, NA, NA, 125, NA,NA,425, NA, 475, NA, 450, 450, 475, 500),
weights = c(1,1,NA, NA, NA, NA, 2, NA, NA, 2, NA,NA,2, NA, 1, NA, 2, 2, 2, 2),
modeled = c(110,120,165,150, 200, 225, 240, 250, 150, 130, 450,430,415,400, 425, 450, 460, 460, 470, 490))
test_tsibble <- as_tsibble(test_data, key = site_name, index = obs_year)
tslm_interpolate <- test_tsibble %>%
group_by(site_name) %>%
model(lm = TSLM(observed~modeled)) %>%
fabletools::interpolate(test_tsibble)
tslm_interpolate <- left_join(tslm_interpolate, test_data, by = c("site_name", "obs_year")) %>%
dplyr::select(obs_year, site_name, observed = observed.y, imputed = observed.x, modeled, weights)
tslm_interpolate %>%
ggplot(aes(x=obs_year))+
geom_line(aes(y = imputed), color = "blue")+
geom_line(aes(y = modeled), color = "red")+
geom_point(aes(y = observed), color = "green")+
facet_wrap(~site_name, scales = "free_y")

do a loop with different versions of a dataset based on a variable id and save the result after each loop

I have a dataset with x countries over y years.
I would like to do a certain analysis (see indicated below, but this code is not the problem)
The problem: I would like to do this analysis of the code I already have, a number of times: each time with a different dataset that has another combination of the x countries and y years. To be clear: I would like to do the analysis for EACH possible combination of the x countries and the y years.
The code that I would like to execute for each version of the dataset (explanation dataset see further)
library(stats)
##### the analysis for one dataset ####
d=data.frame(outcome_spring=rep(1,999),outcome_summer=rep(1,999),
outcome_autumn=rep(1,999),outcome_winter=rep(1,999))
o <- lapply(1:999, function(i) {
Alldata_Rainfed<-subset(Alldata, rainfed <= i)
outcome_spring=sum(Alldata$spring)
outcome_summer=sum(Alldata$summer)
outcome_autumn=sum(Alldata$autumn)
outcome_winter=sum(Alldata$winter)
d[i, ] = c(outcome_spring, outcome_summer, outcome_autumn, outcome_winter)
} )
combination<-as.data.frame(do.call(rbind, o)) #the output I want is another dataset for each unique dataset
#### the end of the analysis for one dataset ####
Desired output
That means that as an output I need to have the same amounts of datasets (named "combination" in the example) as the number of combinations possible between x countries and y years.
As an example, imagine having the following dataset (real dataset has over 500000 observations, 15 countries, 9 years)
> dput(Alldata)
structure(list(country = c("belgium", "belgium", "belgium", "belgium",
"germany", "germany", "germany", "germany"), year = c(2004, 2005,
2005, 2013, 2005, 2009, 2013, 2013), spring = c(23, 24, 45, 23,
1, 34, 5, 23), summer = c(25, 43, 654, 565, 23, 1, 23, 435),
autumn = c(23, 12, 4, 12, 24, 64, 23, 12), winter = c(34,
45, 64, 13, 346, 74, 54, 45), irrigation = c(10, 30, 40,
300, 288, 500, 996, 235), id = c(1, 2, 2, 3, 4, 5, 6, 6)), datalabel = "", time.stamp = "14 Nov 2016 20:09", .Names = c("country",
"year", "spring", "summer", "autumn", "winter", "irrigation",
"id"), formats = c("%9s", "%9.0g", "%9.0g", "%9.0g", "%9.0g",
"%9.0g", "%9.0g", "%9.0g"), types = c(7L, 254L, 254L, 254L, 254L,
254L, 254L, 254L), val.labels = c("", "", "", "", "", "", "",
""), var.labels = c("", "", "", "", "", "", "", "group(country year)"
), row.names = c("1", "2", "3", "4", "5", "6", "7", "8"), version = 12L, class = "data.frame")
In the example above, I already made an id for combining country and year. That means I want to make datasets with all observations that have combinations of the following ids:
dataset 1_2_3_4_5: ids 1, 2, 3, 4, 5 (so this dataset only misses the observations with id = 6)
dataset 1_2_3_4_6: ids 1, 2, 3, 4, 6 (but not 5)
dataset 1_2: ids 1, 2 (but not all the rest)
dataset 3_4_5: ids 3, 4, 5 (but not all the rest)
....
etc etc... Note that I gave the name of the dataset the name of the ids that are included. Otherwise it will be hard for me to distinguish all the different datasets from each other. Other names are fine too, as long as I can distinguish between the datasets!
Thanks for your help!
EDIT: it might be possible that certain datasets give no results (because in the second loop irrigation is used too loop and certain combinations might not have irrigation) but then the output should just be a dataset with missing values
Not sure if this is the most efficient way of doing this, but I think it should work:
# create a df to store the results of all combinations
result=data.frame()
The next loops are based on the combn() function, which creates all possible combinations of a vector (here ID), using m number of elements.
for(i in 2:max(o$id)){
combis=combn(unique(o$id),i)
for(j in 1:ncol(combis)){
sub=o[o$id %in% combis[,j],]
out=sub[1,] # use your function
out$label=paste(combis[,j],collapse ='') #provide an id so you know for which combination this result is
result=rbind(result,out) # paste it to previous output
}
}

Sum over rows (rollapply) with time decay

This is a follow on question to a question I posted earlier (see Sum over rows with multiple changing conditions R data.table for more details). I want to calculate how many times the 3 subjects have experienced an event in the last 5 years. So have been summing over a rolling window using rollapply from the zoo package. This assumes that the experience 5 years ago is as important as the experience 1 year ago (same weighting), so now I want to include a time decay for the experience that enters the sum. This basically means that the experience 5 years ago does not enter into the sum with the same weighting as the experience 1 year ago.
I my case I want to include an age dependent decay (even though for other applications faster or slower decays such as square root or squares could be possible).
For example lets assume I have the following data (I build on the previous data for clarity):
mydf <- data.frame (Year = c(2000, 2001, 2002, 2004, 2005,
2007, 2000, 2001, 2002, 2003,
2003, 2004, 2005, 2006, 2006, 2007),
Name = c("Tom", "Tom", "Tom", "Fred", "Gill",
"Fred", "Gill", "Gill", "Tom", "Tom",
"Fred", "Fred", "Gill", "Fred", "Gill", "Gill"))
# Create an indicator for the experience
mydf$Ind <- 1
# Load require packages
library(data.table)
library(zoo)
# Set data.table
setDT(mydf)
setkey(mydf, Name,Year)
# Perform cartesian join to calculate experience. I2 is the new experience indicator
m <- mydf[CJ(unique(Name),seq(min(Year)-5, max(Year))),allow.cartesian=TRUE][,
list(Ind = unique(Ind), I2 = sum(Ind,na.rm=TRUE)),
keyby=list(Name,Year)]
# This is the approach I have been taking so far. Note that is a simple rolling sum of I2
m[,Exp := rollapply(I2, 5, function(x) sum(head(x,-1)),
align = 'right', fill=0),by=Name]
So question now is, how can I include a age dependent decay into this calculation. To model this I need to divide the experience by the age of the experience before it enters the sum.
I have been trying to get it to work using something along these lines:
m[,Exp_age := rollapply(I2, 5, function(x) sum(head(x,-1)/(tail((Year))-head(Year,-1))),
align = 'right', fill=0),by=Name]
But it does not work. I think my main problem is that I cannot get the age of the experience right so I can divide by the age in the sum. The result should look like the Exp_age column in the myres data.frame below
myres <- data.frame(Name = c("Fred", "Fred", "Fred", "Fred", "Fred",
"Gill", "Gill", "Gill", "Gill", "Gill", "Gill",
"Tom", "Tom", "Tom", "Tom", "Tom"),
Year = c(2003, 2004, 2004, 2006, 2007, 2000, 2001, 2005,
2005, 2006, 2007, 2000, 2001, 2002, 2002, 2003),
Ind = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
Exp = c(0, 1, 1, 3, 4, 0, 1, 1, 1, 2, 3, 0, 1, 2, 2, 4),
Exp_age = c(0, 1, 1, 1.333333333, 1.916666667, 0, 1, 0.45,
0.45, 2.2, 2, 0, 1, 1.5, 1.5, 2.833333333))
Any pointers would be greatly appreciated!
If I understand you correctly, you are trying to do a rollapply with width=5 and rather than do a simple sum, you want to do a weighted sum. The weights are the age of the experience relative to the 5 year window. I would do this: first set the key in your data.table so that it has proper increasing order by Name, then you know that the last item in your x variable is the youngest and the first item is the oldest (you do this in your code already). I can't quite tell which way you want the weights to go (youngest to have greatest weight or oldest) but you get the point:
setkey(m, Name, Year)
my_fun = function(x) { w = 1:length(x); sum(x*w)}
m[,Exp_age:=rollapply(I2, width=5, by=1, fill=NA, FUN=my_fun, by.column=FALSE, align="right") ,by=Name]

Adding different vertical lines for each panel in xyplot using lattice in R

I have a graph of plant species frequency by year for several sites that I am plotting using xyplot in the lattice package. I've figured out how to get the scatterplot for each species-site combo. However, I want to add an abline representing each year in which a chemical treatment was done. Chemical treatments were added in different years at each site, and I'd like to add a vertical abline for each species-site graph in which a chemical treatment was performed at that site. Here is my xyplot code:
library(plyr)
sp.1 <- data.frame(site=rep('a', 10), year=seq(2001, 2010, 1), year.trt=c(NA, NA, NA, NA, 2005, NA, NA, 2008, NA, NA), pl.1=rnorm(10, 4, 1), pl.2=rnorm(10, 6, 2))
sp.2 <- data.frame(site=rep('b', 10), year=seq(2001, 2010, 1), year.trt=c(2001, NA, NA, NA, NA, 2006, NA, NA, NA, NA), pl.1=rnorm(10, 5, 2), pl.2=rnorm(10, 4, 1))
sp.3 <- data.frame(site=rep('c', 10), year=seq(2001, 2010, 1), year.trt=c(NA, NA, NA, 2004, NA, NA, NA, NA, 2009, NA), pl.1=rnorm(10, 8, 1), pl.2=rnorm(10, 3, 3))
data <- rbind.fill(sp.1, sp.2, sp.3)
xy.plot <- xyplot(pl.1 + pl.2 ~ year | site, data=data, outer=T, type='l',
as.table=T, xlab=c('Year'), ylab=c('Spp. Frequency (%)'),
panel=function(x, y,...){
panel.xyplot(x,y, type='l')
panel.abline(v=data$year.trt)
})
print(xy.plot)
So, the important line of code in this block is the 'panel.abline(v=test$trt.year)'. Currently, this plots all years in my dataset in which a chemical treatment was done, however, I'd like it to show in each panel which year a treatment was done for that specific site.
Any insight would be greatly appreciated.
Thanks,
Paul

Resources