How can I effectivize my script for correcting a logger's seasonal drift in R? - r

I have installed a bunch of CO2 loggers in water that log CO2 every hour for the open water season. I have characterized the loggers at 3 different concentrations of CO2 before and after installing them.
I assume that the seasonal drift in error will be linear
I assume that the error between my characterization points will be linear
My script is based on a for loop that goes through each timestamp and corrects the value, this works but is unfortuneately not fast enough. I know that this can be done within a second but I am not sure how. I seek some advice and I would be grateful if someone could show me how.
Reproduceable example based on basic R:
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#generate dummy dataframe
dummy <- data.frame(dt,co2)
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
diff.pre <- measured.pre-actual#diff at precharacterization
diff.post <- measured.post-actual#diff at post
#linear interpolation of how deviance from actual values change throughout the season
#I assume that the temporal drift is linear
diff.0 <- seq(diff.pre[1],diff.post[1],length.out=length(dummy$dt))
diff.400 <- seq(diff.pre[2],diff.post[2],length.out = length(dummy$dt))
diff.1000 <- seq(diff.pre[3],diff.post[3],length.out = length(dummy$dt))
#creates a data frame with the assumed drift at each increment throughout the season
dummy <- data.frame(dummy,diff.0,diff.400,diff.1000)
#this loop makes a 3-point calibration at each day in the dummy data set
co2.corrected <- vector()
for(i in 1:nrow(dummy)){
print(paste0("row: ",i))#to show the progress of the loop
diff.0 <- dummy$diff.0[i]#get the differences at characterization increments
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
#values below are only used for encompassing the range of measured values in the characterization
#this is based on the interpolated difference at the given time point and the known concentrations used
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
#linear difference between calibration at 0 and 400
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
#linear difference between calibration at 400 and 1000
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
#bind them together to get one vector
correction.ppm <- c(seg1,seg2)
#the complete range of measured co2 in the characterization.
#in reality it can not be below 0 and thus it can not be below the minimum measured in the range
measured.co2.range <- round(seq(measured.0,measured.1000,length.out=length(correction.ppm)))
#generate a table from which we can characterize the measured values from
correction.table <- data.frame(measured.co2.range,correction.ppm)
co2 <- dummy$co2[i] #measured co2 at the current row
#find the measured value in the table and extract the difference
diff <- correction.table$correction.ppm[match(co2,correction.table$measured.co2.range)]
#correct the value and save it to vector
co2.corrected[i] <- co2-diff
}
#generate column with calibrated values
dummy$co2.corrected <- co2.corrected

This is what I understand after reviewing the code. You have a series of CO2 concentration readings, but they need to be corrected based on characterization measurements taken at the beginning of the timeseries and at the end of the timeseries. Both sets of characterization measurements were made using three known concentrations: 0, 400, and 1000.
Your code appears to be attempting to apply bilinear interpolation (over time and concentration) to apply the needed correction. This is easy to vectorize:
set.seed(1)
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
# interpolate the reference concentrations over time
cref <- mapply(seq, measured.pre, measured.post, length.out = length(dt))
#generate dummy dataframe with corrected values
dummy <- data.frame(
dt,
co2,
co2.corrected = ifelse(
co2 < cref[,2],
actual[1] + (co2 - cref[,1])*(actual[2] - actual[1])/(cref[,2] - cref[,1]),
actual[2] + (co2 - cref[,2])*(actual[3] - actual[2])/(cref[,3] - cref[,2])
)
)
head(dummy)
#> dt co2 co2.corrected
#> 1 2022-08-01 00:00:00 537 416.1905
#> 2 2022-08-01 01:00:00 618 493.2432
#> 3 2022-08-01 02:00:00 516 395.9776
#> 4 2022-08-01 03:00:00 760 628.2707
#> 5 2022-08-01 04:00:00 633 507.2542
#> 6 2022-08-01 05:00:00 518 397.6533

I do not know what you are calculating (I feel that this could be done differently), but you can increase speed by:
remove print, that takes a lot of time inside loop
remove data.frame creation in each iteration, that is slow and not needed here
This loop should be faster:
for(i in 1:nrow(dummy)){
diff.0 <- dummy$diff.0[i]
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
correction.ppm <- c(seg1,seg2)
s <- seq(measured.0,measured.1000,length.out=length(correction.ppm))
measured.co2.range <- round(s)
co2 <- dummy$co2[i]
diff <- correction.ppm[match(co2, measured.co2.range)]
co2.corrected[i] <- co2-diff
}
p.s. now the slowest part from my testing is round(s). Maybe that can be removed or rewritten...

Related

Creating a for-loop to store LDA misclassification rates

I have a dataset of 104 samples (2 classes) and 182 variables. I am to carry out LDA on the dataset. My strategy involves first carrying out PCA in order to reduce dimensionality; this leaves me with 104 PCs. Now, what I want to do is carry out LDA on the PCs. I want to carry it out first where the number of PCs equal to 1, and store the misclassification rates into a data frame object. I will then do the same for 2, 3 and so on until ~50 PCs; the number is not important. I have created a for-loop to try solve this but I end up with a data frame where the only row is the final value I choose for my PCs. Here is the code I have so far:
# required packages
library(MASS)
library(class)
library(tidyverse)
# reading in and cleaning data
og_data <- read.csv("data.csv")
og_data <- og_data[, -1]
og_data$tumour <- unclass(as.factor(og_data$tumour))
# standardizing
st_data <- as.data.frame(cbind(og_data[, 1], scale(og_data[, -1])))
colnames(st_data)[1] <- "tumour"
# PCA for dimension reduction
k=10 # this is for the for-loop
grouping <- c(rep(1, 62), rep(2, 42)) # a vector denoting the true class of the samples
pca <- prcomp(st_data[, -1])
df_misclassification <- tibble(i=as.numeric(),
misclassification_rate_1=as.numeric(),
misclassification_rate_2=as.numeric())
for (i in k){
a <- as.data.frame(pca$x[, 1:i])
b <- lda(a, grouping=grouping, CV=TRUE)
c <- table(list(predicted=b$class, observed=grouping)) # confusion matrix
d <- t(as.data.frame(diag(c) / rowSums(c))) # misclassification rate for each class
df_misclassification <- df_misclassification %>%
add_row(i=i,
misclassification_rate_1=d[, 1],
misclassification_rate_2=d[, 2])
}
Running the above for k=10 leaves me with the following data frame:
# A tibble: 1 x 3
i misclassification_rate_1 misclassification_rate_2
<dbl> <dbl> <dbl>
1 10 0.952 0.951
I would like the table to have 10 rows, one for each number of PCs used. There is some overwriting in the for-loop but I have no idea how to fix this. Any help would be much appreciated. Thank you.
My for-loop was wrong. It should have been for (i in 1:k).

Reconstruct seasonally (and non seasonally) differenced data in R

I've achieved stationary data for use in arima (see) forecasts using seasonal and non seasonal differencing. Now how do I revert back to the original date using the differenced data?
raw <- read.csv("https://raw.githubusercontent.com/thistleknot/Python-Stock/master/data/combined_set.csv",row.names=1,header=TRUE)
temp <- raw$CSUSHPINSA
#tells me to seasonally difference 1 time
print(nsdiffs(ts(temp,frequency=4)))
temp_1 <- temp-dplyr::lag(temp,1*season)
#tells me I need to difference it once more
print(ndiffs(temp_2))
temp_2 <- temp_1-dplyr::lag(temp_1,1)
#shows data is somewhat stationary
plot(temp_2)
#gives me back the original dataset if I only had seasonal differencing
na.omit(dplyr::lag(raw$CSUSHPINSA ,4)+temp_1)
#how to do this with temp_2?
Some references
Pandas reverse of diff()
Reverse Diff function in R
Nevermind, I got it
dplyr::lag(raw$CSUSHPINSA ,4) + dplyr::lag(temp_1,1)+temp_2
More complete examples
temp <- raw$MSPUS
#print(nsdiffs(ts(temp,frequency=4)))
#temp_1 <- temp-dplyr::lag(temp,1*season)
print(ndiffs(temp_1))
temp_1 <- temp-dplyr::lag(temp,1)
temp_2 <- temp_1-dplyr::lag(temp_1,1)
#forecast values of temp_2
temp_3 <- dplyr::lag(temp_1,1)+temp_2
temp_4 = (dplyr::lag(raw$MSPUS ,1) + temp_3)
new_temp_2_values = c(8000,10000)
extended <- c(temp_4,tail(c(c(temp_3),tail(temp_4,1)+cumsum(tail(temp_3,1)+cumsum(new_temp_2_values))),length(new_temp_2_values)))
print(extended)
Wrote a more involved version here
https://gist.github.com/thistleknot/eeaf1631f736e20806c37107f344d50e

Using Kalman smoothing in R's KFAS package to impute missing data

I have a data frame (reproducible example at the bottom) containing a column of values representing precipitation volume, a column of date-of-measurement values, and a column each for lat, lon, and elevation coordinates. The data covers 10 years of measurement, and 10 different lat/long/elev points (levels which I will call "stations").
The precipitation column is MCAR missing 3.4% of its values. My goal is to impute the missing values, taking into account both the temporal correlation (the NA's position within its station's time series) and the spatial correlation (the NA's geographic relationship to the rest of the points.)
I do not think typical ARIMA based techniques, such as those found in Amelia or ImputeTS will satisfy, because they are limited to univariate data.
I am interested in using the KFAS package because I believe it will allow me to treat these different "stations" as "states" within the "state space", and enable me to use Kalman smoothing to "predict" the missing values based on the correlation of the both spatial and temporal variables.
My trouble is that I am having a VERY hard time getting over KFAS' learning curve and implementing this model. The documentation is sparse and there are next to no tutorials or beginner focused material out there. I'm feeling like I don't even know how to get started.
Can KFAS be used this way? How would you approach this challenge? What would the basic steps look like in KFAS?
Since I barely know how to frame this question, I have made an effort to make good reproducible data. This sample data covers three "stations" over 1 month, which I'm thinking should be sufficient for demonstration. The values are realistic but not accurate.
#defining the precip variable
set.seed(76)
precip <- sample(0:7, 30, replace=TRUE)
#defining the categorical variables
lon1 <- (-123.7500)
lon2 <- (-124.1197)
lon3 <- (-124.0961)
lat1 <- (43.9956)
lat2 <- (44.0069)
lat3 <- (44.0272)
elev1 <- 76.2
elev2 <- 115.8
elev3 <- 3.7
date1 <- seq(as.Date('2011-01-01'), as.Date('2011-01-10'),by=1)
date2 <- seq(as.Date('2011-01-11'), as.Date('2011-01-20'),by=1)
date3 <- seq(as.Date('2011-01-21'), as.Date('2011-01-30'),by=1)
#creating the df
reprex.data <- NULL
reprex.data$precip <- precip
#inserting NA's randomly into the precip vector now to easily avoid doing it to the other variables
reprex.data <- as.data.frame(lapply(reprex.data, function(cc) cc[sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE)]))
#creating the rest of the df
reprex.data$lon[1:10] <- lon1
reprex.data$lon[11:20] <- lon2
reprex.data$lon[21:30] <- lon3
reprex.data$lat[1:10] <- lat1
reprex.data$lat[11:20] <- lat2
reprex.data$lat[21:30] <- lat3
reprex.data$elev[1:10] <- elev1
reprex.data$elev[11:20] <- elev2
reprex.data$elev[21:30] <- elev3
reprex.data$date[1:10] <- date1
reprex.data$date[11:20] <- date2
reprex.data$date[21:30] <- date3
#viola
head(reprex.data)

R: quickly simulate unbalanced panel with variable that depends on lagged values of itself

I am trying to simulate monthly panels of data where one variable depends on lagged values of that variable in R. My solution is extremely slow. I need around 1000 samples of 2545 individuals, each of whom is observed monthly over many years, but the first sample took my computer 8.5 hours to construct. How can I make this faster?
I start by creating an unbalanced panel of people with different birth dates, monthly ages, and variables xbsmall and error that will be compared to determine the Outcome. All of the code in the first block is just data setup.
# Setup:
library(plyr)
# Would like to have 2545 people (nPerson).
#Instead use 4 for testing.
nPerson = 4
# Minimum and maximum possible ages and birth dates
AgeMin = 10
AgeMax = 50
BornMin = 1950
BornMax = 1963
# Person-specific characteristics
ind =
data.frame(
id = 1:nPerson,
BornYear = floor(runif(length(1:nPerson), min=BornMin, max=BornMax+1)),
BornMonth = ceiling(runif(length(1:nPerson), min=0, max=12))
)
# Make an unbalanced panel of people over age 10 up to year 1986
# panel = ddply(ind, ~id, transform, AgeMonths = BornMonth)
panel = ddply(ind, ~id, transform, AgeMonths = (AgeMin*12):((1986-BornYear)*12 + 12-BornMonth))
# Set up some random variables to approximate the data generating process
panel$xbsmall = rnorm(dim(panel)[1], mean=-.3, sd=.45)
# Standard normal error for probit
panel$error = rnorm(dim(panel)[1])
# Placeholders
panel$xb = rep(0, dim(panel)[1])
panel$Outcome = rep(0, dim(panel)[1])
Now that we have data, here is the part that is slow (around a second on my computer for only 4 observations but hours for thousands of observations). Each month, a person gets two draws (xbsmall and error) from two different normal distributions (these were done above), and Outcome == 1 if xbsmall > error. However, if Outcome equals 1 in the previous month, then Outcome in the current month equals 1 if xbsmall + 4.47 > error. I use xb = xbsmall+4.47 in the code below (xb is the "linear predictor" in a probit model). I ignore the first month for each person for simplicity. For your information, this is simulating a probit DGP (but that is not necessary to know to solve the problem of computation speed).
# Outcome == 1 if and only if xb > -error
# The hard part: xb includes information about the previous month's outcome
start_time = Sys.time()
for(i in 1:nPerson){
# Determine the range of monthly ages to loop over for this person
AgeMonthMin = min(panel$AgeMonths[panel$id==i], na.rm=T)
AgeMonthMax = max(panel$AgeMonths[panel$id==i], na.rm=T)
# Loop over the monthly ages for this person and determine the outcome
for(t in (AgeMonthMin+1):AgeMonthMax){
# Indicator for whether Outcome was 1 last month
panel$Outcome1LastMonth[panel$id==i & panel$AgeMonths==t] = panel$Outcome[panel$id==i & panel$AgeMonths==t-1]
# xb = xbsmall + 4.47 if Outcome was 1 last month
# Otherwise, xb = xbsmall
panel$xb[panel$id==i & panel$AgeMonths==t] = with(panel[panel$id==i & panel$AgeMonths==t,], xbsmall + 4.47*Outcome1LastMonth)
# Outcome == 1 if xb > 0
panel$Outcome[panel$id==i & panel$AgeMonths==t] =
ifelse(panel$xb[panel$id==i & panel$AgeMonths==t] > - panel$error[panel$id==i & panel$AgeMonths==t], 1, 0)
}
}
end_time = Sys.time()
end_time - start_time
My thoughts for reducing computer time:
Something with cumsum()
Some wonderful panel data function that I do not know about
Find a way to make the t loop go through the same starting and ending points for each individual and then somehow use plyr::ddpl() or dplyr::gather_by()
Iterative solution: make an educated guess about the value of Outcome at each monthly age (say, the mode) and somehow adjust values that do not match the previous month. This would work better in my real application because xbsmall has a very clear trend in age.
Do the simulation only for smaller samples and then estimate the effect of sample size on the values I need (the distributions of regression coefficient estimates not calculated here)
One approach is to use a split-apply-combine method. I take out the for(t in (AgeMonthMin+1):AgeMonthMax) loop and put the contents in a function:
generate_outcome <- function(x) {
AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE)
AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE)
for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){
x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1]
x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0)
}
x
}
where x is a dataframe for one person. This allows us to simplify the panel$id==i & panel$AgeMonths==t construct. Now we can just do
out <- lapply(split(panel, panel$id), generate_outcome)
out <- do.call(rbind, out)
and all.equal(panel$Outcome, out$Outcome) returns TRUE. Computing 100 persons took 1.8 seconds using this method, compared to 1.5 minutes in the original code.

Applying an lm function to different ranges of data and separate groups using data.table

How do I perform a linear regression using different intervals for data in different groups in a data.table?
I am currently doing this using plyr but with large data sets it gets very slow. Any help to speed up the process is greatly appreciated.
I have a data table which contains 10 counts of CO2 measurements over 10 days, for 10 plots and 3 fences. Different days fall into different time periods, as described below.
I would like to perform a linear regression to determine the rate of change of CO2 for each fence, plot and day combination using a different interval of counts during each period. Period 1 should regress CO2 during counts 1-5, period 2 using 1-7 and period 3 using 1-9.
CO2 <- rep((runif(10, 350,359)), 300) # 10 days, 10 plots, 3 fences
count <- rep((1:10), 300) # 10 days, 10 plots, 3 fences
DOY <-rep(rep(152:161, each=10),30) # 10 measurements/day, 10 plots, 3 fences
fence <- rep(1:3, each=1000) # 10 days, 10 measurements, 10 plots
plot <- rep(rep(1:10, each=100),3) # 10 days, 10 measurements, 3 fences
flux <- as.data.frame(cbind(CO2, count, DOY, fence, plot))
flux$period <- ifelse(flux$DOY <= 155, 1, ifelse(flux$DOY > 155 & flux$DOY < 158, 2, 3))
flux <- as.data.table(flux)
I expect an output which gives me the R2 fit and slope of the line for each plot, fence and DOY.
The data I have provided is a small subsample, my real data has 1*10^6 rows. The following works, but is slow:
model <- function(df)
{lm(CO2 ~ count, data = subset(df, ifelse(df$period == 1,count>1 &count<5,
ifelse(df$period == 2,count>1 & count<7,count>1 & count<9))))}
model_flux <- dlply(flux, .(fence, plot, DOY), model)
rsq <- function(x) summary(x)$r.squared
coefs_flux <- ldply(model_flux, function(x) c(coef(x), rsquare = rsq(x)))
names(coefs_flux)[1:5] <- c("fence", "plot", "DOY", "intercept", "slope")
Here is a "data.table" way to do this:
library(data.table)
flux <- as.data.table(flux)
setkey(flux,count)
flux[,include:=(period==1 & count %in% 2:4) |
(period==2 & count %in% 2:6) |
(period==3 & count %in% 2:8)]
flux.subset <- flux[(include),]
setkey(flux.subset,fence,plot,DOY)
model <- function(df) {
fit <- lm(CO2 ~ count, data = df)
return(list(intercept=coef(fit)[1],
slope=coef(fit)[2],
rsquare=summary(fit)$r.squared))
}
coefs_flux <- flux.subset[,model(.SD),by="fence,plot,DOY"]
Unless I'm missing something, the subsetting you do in each call to model(...) is unnecessary. You can segment the counts by period in one step at the beginning. This code yields the same results as yours, except that dlply(...) returns a data frame and this code produces a data table. It isn't much faster on this test dataset.

Resources