Using Kalman smoothing in R's KFAS package to impute missing data - r

I have a data frame (reproducible example at the bottom) containing a column of values representing precipitation volume, a column of date-of-measurement values, and a column each for lat, lon, and elevation coordinates. The data covers 10 years of measurement, and 10 different lat/long/elev points (levels which I will call "stations").
The precipitation column is MCAR missing 3.4% of its values. My goal is to impute the missing values, taking into account both the temporal correlation (the NA's position within its station's time series) and the spatial correlation (the NA's geographic relationship to the rest of the points.)
I do not think typical ARIMA based techniques, such as those found in Amelia or ImputeTS will satisfy, because they are limited to univariate data.
I am interested in using the KFAS package because I believe it will allow me to treat these different "stations" as "states" within the "state space", and enable me to use Kalman smoothing to "predict" the missing values based on the correlation of the both spatial and temporal variables.
My trouble is that I am having a VERY hard time getting over KFAS' learning curve and implementing this model. The documentation is sparse and there are next to no tutorials or beginner focused material out there. I'm feeling like I don't even know how to get started.
Can KFAS be used this way? How would you approach this challenge? What would the basic steps look like in KFAS?
Since I barely know how to frame this question, I have made an effort to make good reproducible data. This sample data covers three "stations" over 1 month, which I'm thinking should be sufficient for demonstration. The values are realistic but not accurate.
#defining the precip variable
set.seed(76)
precip <- sample(0:7, 30, replace=TRUE)
#defining the categorical variables
lon1 <- (-123.7500)
lon2 <- (-124.1197)
lon3 <- (-124.0961)
lat1 <- (43.9956)
lat2 <- (44.0069)
lat3 <- (44.0272)
elev1 <- 76.2
elev2 <- 115.8
elev3 <- 3.7
date1 <- seq(as.Date('2011-01-01'), as.Date('2011-01-10'),by=1)
date2 <- seq(as.Date('2011-01-11'), as.Date('2011-01-20'),by=1)
date3 <- seq(as.Date('2011-01-21'), as.Date('2011-01-30'),by=1)
#creating the df
reprex.data <- NULL
reprex.data$precip <- precip
#inserting NA's randomly into the precip vector now to easily avoid doing it to the other variables
reprex.data <- as.data.frame(lapply(reprex.data, function(cc) cc[sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE)]))
#creating the rest of the df
reprex.data$lon[1:10] <- lon1
reprex.data$lon[11:20] <- lon2
reprex.data$lon[21:30] <- lon3
reprex.data$lat[1:10] <- lat1
reprex.data$lat[11:20] <- lat2
reprex.data$lat[21:30] <- lat3
reprex.data$elev[1:10] <- elev1
reprex.data$elev[11:20] <- elev2
reprex.data$elev[21:30] <- elev3
reprex.data$date[1:10] <- date1
reprex.data$date[11:20] <- date2
reprex.data$date[21:30] <- date3
#viola
head(reprex.data)

Related

How can I effectivize my script for correcting a logger's seasonal drift in R?

I have installed a bunch of CO2 loggers in water that log CO2 every hour for the open water season. I have characterized the loggers at 3 different concentrations of CO2 before and after installing them.
I assume that the seasonal drift in error will be linear
I assume that the error between my characterization points will be linear
My script is based on a for loop that goes through each timestamp and corrects the value, this works but is unfortuneately not fast enough. I know that this can be done within a second but I am not sure how. I seek some advice and I would be grateful if someone could show me how.
Reproduceable example based on basic R:
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#generate dummy dataframe
dummy <- data.frame(dt,co2)
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
diff.pre <- measured.pre-actual#diff at precharacterization
diff.post <- measured.post-actual#diff at post
#linear interpolation of how deviance from actual values change throughout the season
#I assume that the temporal drift is linear
diff.0 <- seq(diff.pre[1],diff.post[1],length.out=length(dummy$dt))
diff.400 <- seq(diff.pre[2],diff.post[2],length.out = length(dummy$dt))
diff.1000 <- seq(diff.pre[3],diff.post[3],length.out = length(dummy$dt))
#creates a data frame with the assumed drift at each increment throughout the season
dummy <- data.frame(dummy,diff.0,diff.400,diff.1000)
#this loop makes a 3-point calibration at each day in the dummy data set
co2.corrected <- vector()
for(i in 1:nrow(dummy)){
print(paste0("row: ",i))#to show the progress of the loop
diff.0 <- dummy$diff.0[i]#get the differences at characterization increments
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
#values below are only used for encompassing the range of measured values in the characterization
#this is based on the interpolated difference at the given time point and the known concentrations used
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
#linear difference between calibration at 0 and 400
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
#linear difference between calibration at 400 and 1000
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
#bind them together to get one vector
correction.ppm <- c(seg1,seg2)
#the complete range of measured co2 in the characterization.
#in reality it can not be below 0 and thus it can not be below the minimum measured in the range
measured.co2.range <- round(seq(measured.0,measured.1000,length.out=length(correction.ppm)))
#generate a table from which we can characterize the measured values from
correction.table <- data.frame(measured.co2.range,correction.ppm)
co2 <- dummy$co2[i] #measured co2 at the current row
#find the measured value in the table and extract the difference
diff <- correction.table$correction.ppm[match(co2,correction.table$measured.co2.range)]
#correct the value and save it to vector
co2.corrected[i] <- co2-diff
}
#generate column with calibrated values
dummy$co2.corrected <- co2.corrected
This is what I understand after reviewing the code. You have a series of CO2 concentration readings, but they need to be corrected based on characterization measurements taken at the beginning of the timeseries and at the end of the timeseries. Both sets of characterization measurements were made using three known concentrations: 0, 400, and 1000.
Your code appears to be attempting to apply bilinear interpolation (over time and concentration) to apply the needed correction. This is easy to vectorize:
set.seed(1)
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
# interpolate the reference concentrations over time
cref <- mapply(seq, measured.pre, measured.post, length.out = length(dt))
#generate dummy dataframe with corrected values
dummy <- data.frame(
dt,
co2,
co2.corrected = ifelse(
co2 < cref[,2],
actual[1] + (co2 - cref[,1])*(actual[2] - actual[1])/(cref[,2] - cref[,1]),
actual[2] + (co2 - cref[,2])*(actual[3] - actual[2])/(cref[,3] - cref[,2])
)
)
head(dummy)
#> dt co2 co2.corrected
#> 1 2022-08-01 00:00:00 537 416.1905
#> 2 2022-08-01 01:00:00 618 493.2432
#> 3 2022-08-01 02:00:00 516 395.9776
#> 4 2022-08-01 03:00:00 760 628.2707
#> 5 2022-08-01 04:00:00 633 507.2542
#> 6 2022-08-01 05:00:00 518 397.6533
I do not know what you are calculating (I feel that this could be done differently), but you can increase speed by:
remove print, that takes a lot of time inside loop
remove data.frame creation in each iteration, that is slow and not needed here
This loop should be faster:
for(i in 1:nrow(dummy)){
diff.0 <- dummy$diff.0[i]
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
correction.ppm <- c(seg1,seg2)
s <- seq(measured.0,measured.1000,length.out=length(correction.ppm))
measured.co2.range <- round(s)
co2 <- dummy$co2[i]
diff <- correction.ppm[match(co2, measured.co2.range)]
co2.corrected[i] <- co2-diff
}
p.s. now the slowest part from my testing is round(s). Maybe that can be removed or rewritten...

Subsetting a rasterbrick to give mean of three minimum months in each year

I'm interested in creating two variables from a time-series of spatial raster data in R from an netcdf file. Opening the data, subsetting by Z and creating a mean value is straight forward:
# Working example below using a ~16mb nc file of sea ice from HadISST
library(R.utils)
library(raster)
library(tidyverse)
download.file("https://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_ice.nc.gz","HadISST_ice.nc.gz")
gunzip("HadISST_ice.nc.gz", ext="gz", FUN=gzfile)
hadISST <- brick('Datasets/HadISST/HadISST_ice.nc')
# subset to a decade time period and create decadal mean from monthly data
hadISST_a <- hadISST %>% subset(., which(getZ(.) >= as.Date("1900-01-01") & getZ(.) <= as.Date("1909-12-31"))) %>% mean(., na.rm = TRUE)
But, I'm interested in extracting 1) annual mean values, and 2) annual mean of three minimum monthly values for the subsetted time period. My current work flow is using nc_open() and ncvar_get() to open the data, raster::extract() to get the values and then tidverse group_by() and slice_min() to get the annual coolest months, but it's a slow and cpu intensive approach. Is there a more effective way of doing this without converting from raster - data.frame?
Questions:
Using the above code how can I extract annual means rather than a mean of ALL months over the decadal period?
Is there a way of using slice_min(order_by = sst, n = 3) or similar with brick objects to get the minimum three values per year prior to annual averaging?
Example data
if (!file.exists("HadISST_ice.nc")) {
download.file("https://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_ice.nc.gz","HadISST_ice.nc.gz")
R.utils:::gunzip("HadISST_ice.nc.gz")
}
library(terra)
hadISST <- rast('HadISST_ice.nc')
Annual mean
y <- format(time(hadISST), "%Y")
m <- tapp(hadISST, y, mean)
Mean of the lowest three monthly values by year (this takes much longer because a user-defined R function is used). I now see that there is a bug in the CRAN version. You can instead use version 1.5-47 that you can install like this install.packages('terra', repos='https://rspatial.r-universe.dev').
f <- function(i) mean(sort(i)[1:3])
m3 <- tapp(hadISST, y, f)
To make this faster (if you have multiple cores):
m3 <- tapp(hadISST, y, f, cores=4)
There are likely much more intelligent ways to do this, but a thought as to your annual means, here calling hadISST, hadI:
v <- getValues(hadI)
v_t <- t(v) # this takes a while
v_mean <- vector(mode='numeric')
for(k in 1:nrow(v_t)) {
v_mean[k] = mean(v_t[k, ], na.rm = TRUE)
}
length(v_mean)
[1] 1828
v_mean[1:11]
[1] 0.1651351 0.1593368 0.1600364 0.1890360 0.1931470 0.1995657 0.1982052
[8] 0.1917534 0.1917840 0.1911638 0.1911657
I'm a little unclear on proposition 2, as it seems it would be the average of 3 0's...

How to form an LP based clustering problem in R?

Trying to form an LP-based clustering problem in R with binary variables.
sample dataset:
set.seed(123)
id<- seq(1:50)
lon <- rnorm(50, 88.5, 0.125)
lat <- rnorm(50, 22.4, 0.15)
demand <- round(runif(50, min=20, max=40))
df<- data.frame(id, lon, lat, demand)
The problem statement:
Where yij takes binary values. (It is 1 if i belong to cluster j, 0 otherwise)
ai is the position of individual points. x¯j is the centroid of the clusters. Qj is the maximum load of cluster j and qi is the demand of each point.
I have used lpSolve in R for optimization problems but I can't find a way to model this problem. Especially the main issue is x¯j. How to incorporate a variable such as that in the objective function?

Bandpassfilter R using fft

I have a time series z with sampling frequeny fs = 12(monthly data) and I would like to perform a bandpass filter using the fftat 10 months and 15 months. This is how I would proceed:
y <- as.data.frame(fft(z))
y$freq <- ..
y$y <- ifelse(y$freq>= 1/10 & y$freq<= 1/15,y$y,0)
zz <- fft(y$y, inverse = TRUE)/length(z)
plot zz in the time domain...
However, I don't know how to derive the frequencies of the fft and I don't know how to plot zz in the time domain. Can someone help me?
I have a function, that wraps fft() a bit:
function(y, samp.freq, ...){
N <- length(y)
fk <- fft(y)
fk <- fk[2:length(fk)/2+1]
fk <- 2*fk[seq(1, length(fk), by = 2)]/N
freq <- (1:(length(fk)))* samp.freq/(2*length(fk))
return(data.frame(fur = fk, freq = freq))
}
y is values of your signal, and samp.freq is it's sample frequency. It's output is data.frame with two columns - fur is complex numbers we get after fast fourier transform (Mod(fur) will be an amplitude, Arg(fur) - a phase) and freq is vector of corresponding frequencies.
But for frequency filtering I highly reccomend using signal package.
For example using Butterworth filter:
library('signal')
bf <- butter(2, c(low, high), type = "pass")
signal.filtered <- filtfilt(bf, signal.noisy)
In this case interval should be defined as c(Low.freq, High.freq) * (2/samp.freq), where Low.freq and High.freq - borders of frequency intervals. More information can be found in package documentation and octave reference guide.
Also, notice that with fft you can get only frequencies up to (sample frequency)/2.

Simple DLNM in R

What I am trying to do is find the relative risk of mortality at the 10th, 50th and 90th percentiles of diurnal temperature range and its additive effects at lags of 0, 1, 3 and 5 days. I'm doing this for a subset of months May-Sept (call subset here for mortality, temperature is already subsetted when read in). I have a code that works below, but no matter what city and what lag I introduce, I get a RR of essentially 1.0, so I believe that something is off or I am missing an argument somewhere. If anyone has more experience with these problems than I, your help would be greatly appreciated.
library('dlnm')
library('splines')
mortdata <- read.table('STLmort.txt', sep="\t", header=T)
morts <- subset(mortdata, Month %in% 5:9)
deaths <- morts$AllMort
tempdata <- read.csv('STLRanges.csv',sep=',',header=T)
temp <- tempdata$Trange
HI <- tempdata$HIrange
#basis.var <- onebasis(1:5, knots=3)
#mklagbasis(maxlag=5, type="poly", degree=3)
basis.temp <- crossbasis(temp,vardegree=3,lag=5)
summary(basis.temp)
model <- glm (deaths ~ basis.temp, family=quasipoisson())
pred.temp <- crosspred(basis.temp, model, at=quantile(temp,c(.10,.50,.90),na.rm=TRUE) , cumul=T)
plot(pred.temp, "slices", var=c(quantile(temp, c(.10, .50, .90),na.rm=TRUE)) ,lag=c(0,1,5))
The problem is you did not put any time variables to control the long-term and seasonal trends in the time-series using DLNM .

Resources