Applying an lm function to different ranges of data and separate groups using data.table - r

How do I perform a linear regression using different intervals for data in different groups in a data.table?
I am currently doing this using plyr but with large data sets it gets very slow. Any help to speed up the process is greatly appreciated.
I have a data table which contains 10 counts of CO2 measurements over 10 days, for 10 plots and 3 fences. Different days fall into different time periods, as described below.
I would like to perform a linear regression to determine the rate of change of CO2 for each fence, plot and day combination using a different interval of counts during each period. Period 1 should regress CO2 during counts 1-5, period 2 using 1-7 and period 3 using 1-9.
CO2 <- rep((runif(10, 350,359)), 300) # 10 days, 10 plots, 3 fences
count <- rep((1:10), 300) # 10 days, 10 plots, 3 fences
DOY <-rep(rep(152:161, each=10),30) # 10 measurements/day, 10 plots, 3 fences
fence <- rep(1:3, each=1000) # 10 days, 10 measurements, 10 plots
plot <- rep(rep(1:10, each=100),3) # 10 days, 10 measurements, 3 fences
flux <- as.data.frame(cbind(CO2, count, DOY, fence, plot))
flux$period <- ifelse(flux$DOY <= 155, 1, ifelse(flux$DOY > 155 & flux$DOY < 158, 2, 3))
flux <- as.data.table(flux)
I expect an output which gives me the R2 fit and slope of the line for each plot, fence and DOY.
The data I have provided is a small subsample, my real data has 1*10^6 rows. The following works, but is slow:
model <- function(df)
{lm(CO2 ~ count, data = subset(df, ifelse(df$period == 1,count>1 &count<5,
ifelse(df$period == 2,count>1 & count<7,count>1 & count<9))))}
model_flux <- dlply(flux, .(fence, plot, DOY), model)
rsq <- function(x) summary(x)$r.squared
coefs_flux <- ldply(model_flux, function(x) c(coef(x), rsquare = rsq(x)))
names(coefs_flux)[1:5] <- c("fence", "plot", "DOY", "intercept", "slope")

Here is a "data.table" way to do this:
library(data.table)
flux <- as.data.table(flux)
setkey(flux,count)
flux[,include:=(period==1 & count %in% 2:4) |
(period==2 & count %in% 2:6) |
(period==3 & count %in% 2:8)]
flux.subset <- flux[(include),]
setkey(flux.subset,fence,plot,DOY)
model <- function(df) {
fit <- lm(CO2 ~ count, data = df)
return(list(intercept=coef(fit)[1],
slope=coef(fit)[2],
rsquare=summary(fit)$r.squared))
}
coefs_flux <- flux.subset[,model(.SD),by="fence,plot,DOY"]
Unless I'm missing something, the subsetting you do in each call to model(...) is unnecessary. You can segment the counts by period in one step at the beginning. This code yields the same results as yours, except that dlply(...) returns a data frame and this code produces a data table. It isn't much faster on this test dataset.

Related

Creating a for-loop to store LDA misclassification rates

I have a dataset of 104 samples (2 classes) and 182 variables. I am to carry out LDA on the dataset. My strategy involves first carrying out PCA in order to reduce dimensionality; this leaves me with 104 PCs. Now, what I want to do is carry out LDA on the PCs. I want to carry it out first where the number of PCs equal to 1, and store the misclassification rates into a data frame object. I will then do the same for 2, 3 and so on until ~50 PCs; the number is not important. I have created a for-loop to try solve this but I end up with a data frame where the only row is the final value I choose for my PCs. Here is the code I have so far:
# required packages
library(MASS)
library(class)
library(tidyverse)
# reading in and cleaning data
og_data <- read.csv("data.csv")
og_data <- og_data[, -1]
og_data$tumour <- unclass(as.factor(og_data$tumour))
# standardizing
st_data <- as.data.frame(cbind(og_data[, 1], scale(og_data[, -1])))
colnames(st_data)[1] <- "tumour"
# PCA for dimension reduction
k=10 # this is for the for-loop
grouping <- c(rep(1, 62), rep(2, 42)) # a vector denoting the true class of the samples
pca <- prcomp(st_data[, -1])
df_misclassification <- tibble(i=as.numeric(),
misclassification_rate_1=as.numeric(),
misclassification_rate_2=as.numeric())
for (i in k){
a <- as.data.frame(pca$x[, 1:i])
b <- lda(a, grouping=grouping, CV=TRUE)
c <- table(list(predicted=b$class, observed=grouping)) # confusion matrix
d <- t(as.data.frame(diag(c) / rowSums(c))) # misclassification rate for each class
df_misclassification <- df_misclassification %>%
add_row(i=i,
misclassification_rate_1=d[, 1],
misclassification_rate_2=d[, 2])
}
Running the above for k=10 leaves me with the following data frame:
# A tibble: 1 x 3
i misclassification_rate_1 misclassification_rate_2
<dbl> <dbl> <dbl>
1 10 0.952 0.951
I would like the table to have 10 rows, one for each number of PCs used. There is some overwriting in the for-loop but I have no idea how to fix this. Any help would be much appreciated. Thank you.
My for-loop was wrong. It should have been for (i in 1:k).

How can I effectivize my script for correcting a logger's seasonal drift in R?

I have installed a bunch of CO2 loggers in water that log CO2 every hour for the open water season. I have characterized the loggers at 3 different concentrations of CO2 before and after installing them.
I assume that the seasonal drift in error will be linear
I assume that the error between my characterization points will be linear
My script is based on a for loop that goes through each timestamp and corrects the value, this works but is unfortuneately not fast enough. I know that this can be done within a second but I am not sure how. I seek some advice and I would be grateful if someone could show me how.
Reproduceable example based on basic R:
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#generate dummy dataframe
dummy <- data.frame(dt,co2)
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
diff.pre <- measured.pre-actual#diff at precharacterization
diff.post <- measured.post-actual#diff at post
#linear interpolation of how deviance from actual values change throughout the season
#I assume that the temporal drift is linear
diff.0 <- seq(diff.pre[1],diff.post[1],length.out=length(dummy$dt))
diff.400 <- seq(diff.pre[2],diff.post[2],length.out = length(dummy$dt))
diff.1000 <- seq(diff.pre[3],diff.post[3],length.out = length(dummy$dt))
#creates a data frame with the assumed drift at each increment throughout the season
dummy <- data.frame(dummy,diff.0,diff.400,diff.1000)
#this loop makes a 3-point calibration at each day in the dummy data set
co2.corrected <- vector()
for(i in 1:nrow(dummy)){
print(paste0("row: ",i))#to show the progress of the loop
diff.0 <- dummy$diff.0[i]#get the differences at characterization increments
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
#values below are only used for encompassing the range of measured values in the characterization
#this is based on the interpolated difference at the given time point and the known concentrations used
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
#linear difference between calibration at 0 and 400
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
#linear difference between calibration at 400 and 1000
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
#bind them together to get one vector
correction.ppm <- c(seg1,seg2)
#the complete range of measured co2 in the characterization.
#in reality it can not be below 0 and thus it can not be below the minimum measured in the range
measured.co2.range <- round(seq(measured.0,measured.1000,length.out=length(correction.ppm)))
#generate a table from which we can characterize the measured values from
correction.table <- data.frame(measured.co2.range,correction.ppm)
co2 <- dummy$co2[i] #measured co2 at the current row
#find the measured value in the table and extract the difference
diff <- correction.table$correction.ppm[match(co2,correction.table$measured.co2.range)]
#correct the value and save it to vector
co2.corrected[i] <- co2-diff
}
#generate column with calibrated values
dummy$co2.corrected <- co2.corrected
This is what I understand after reviewing the code. You have a series of CO2 concentration readings, but they need to be corrected based on characterization measurements taken at the beginning of the timeseries and at the end of the timeseries. Both sets of characterization measurements were made using three known concentrations: 0, 400, and 1000.
Your code appears to be attempting to apply bilinear interpolation (over time and concentration) to apply the needed correction. This is easy to vectorize:
set.seed(1)
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
# interpolate the reference concentrations over time
cref <- mapply(seq, measured.pre, measured.post, length.out = length(dt))
#generate dummy dataframe with corrected values
dummy <- data.frame(
dt,
co2,
co2.corrected = ifelse(
co2 < cref[,2],
actual[1] + (co2 - cref[,1])*(actual[2] - actual[1])/(cref[,2] - cref[,1]),
actual[2] + (co2 - cref[,2])*(actual[3] - actual[2])/(cref[,3] - cref[,2])
)
)
head(dummy)
#> dt co2 co2.corrected
#> 1 2022-08-01 00:00:00 537 416.1905
#> 2 2022-08-01 01:00:00 618 493.2432
#> 3 2022-08-01 02:00:00 516 395.9776
#> 4 2022-08-01 03:00:00 760 628.2707
#> 5 2022-08-01 04:00:00 633 507.2542
#> 6 2022-08-01 05:00:00 518 397.6533
I do not know what you are calculating (I feel that this could be done differently), but you can increase speed by:
remove print, that takes a lot of time inside loop
remove data.frame creation in each iteration, that is slow and not needed here
This loop should be faster:
for(i in 1:nrow(dummy)){
diff.0 <- dummy$diff.0[i]
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
correction.ppm <- c(seg1,seg2)
s <- seq(measured.0,measured.1000,length.out=length(correction.ppm))
measured.co2.range <- round(s)
co2 <- dummy$co2[i]
diff <- correction.ppm[match(co2, measured.co2.range)]
co2.corrected[i] <- co2-diff
}
p.s. now the slowest part from my testing is round(s). Maybe that can be removed or rewritten...

Nest models dplyr calibration and validation

I have used R quite a bit, but I'm starting my journey in the tidyverse.
I'm trying to create a function that allows me to Bias correction daily precipitation series.
I want to break the time series in 2 (for calibration and validation). I would need to fit the model for the calibration period, apply it to the validation period, together with the observed and modeled data.
So far, I was able to do this in two for loops, but i was wondering if would be possible to do this "tidyer", with nest, but i cant figure it out.
Moreover, how could I use apply to compute this to many precipitation time series in a data.frame.
My current code is below,
Thanks in advance!
libraries
library(lubridate)
library(qmap)
library(dplyr)
Simulate data
obs_ <- runif(min=0,max=157,n=14975)
sim <- obs_ + 20
date_ <- seq(as.Date("1979-01-01"), as.Date("2019-12-31"),by="days")
db <- data.frame(obs=obs_, sim=sim_, date=date_, month=month(date_), year=year(date_))
Sample years
ss<- seq(from=1979, to=2019, by=1)
samp <- sample(ss, length(ss)/2)
samp <- samp[order(samp)]
samp1 <- subset(ss, !(ss %in% samp))
Model
list_mod <- list()
for(i in 1:12){
# retrives the data for the calibration period
model_fit <-db %>%
mutate(id = case_when( year %in% samp ~ "cal",
year %in% samp1 ~ "val")) %>%
filter(month== i, id== "cal")
# fits the model to each month and stores it in a list
list_mod[[i]] <- fitQmap(model_fit$obs,model_fit$sim)
}
Retrives the data for the validation period
model1 <- db %>%
mutate(id = case_when( year %in% samp ~ "cal",
year %in% samp1 ~ "val")) %>%
filter(id=="val")
Estimates the new data and stores it with the observations and simulations
for( i in 1:12){
temp__ <- model1[model1$month ==i,"sim"]
model1[model1$month ==i,"model"] <- doQmap(temp__, list_mod[[i]])
}
If you're not wedded to tidy, here is a solution using data.table.
Using your db:
library(data.table)
library(qmap)
##
setDT(db)[, Set:='cal']
db[sample(.N, .N/2), Set:='val']
db[, pred:=doQmap(sim, fitQmap(obs[Set=='cal'], sim[Set=='cal'])), by=.(month)]
result <- db[Set=='val']
The first line converts your db to a data.table and creates a column, Set, to define calibration/validation. The second line assigns a random 1/2 of the data to the validation set.
The third line does all the work: it groups the rows by month ( by=.(month) ), then generates fits with fitQmap(...) on the calibration set, and then generates debiased predictions using doQmap(...) on the full dataset.
The final line just filters out the calibration rows.
I notice in this example that Qmap reduces but does not eliminate bias. It that what you expect?

R: quickly simulate unbalanced panel with variable that depends on lagged values of itself

I am trying to simulate monthly panels of data where one variable depends on lagged values of that variable in R. My solution is extremely slow. I need around 1000 samples of 2545 individuals, each of whom is observed monthly over many years, but the first sample took my computer 8.5 hours to construct. How can I make this faster?
I start by creating an unbalanced panel of people with different birth dates, monthly ages, and variables xbsmall and error that will be compared to determine the Outcome. All of the code in the first block is just data setup.
# Setup:
library(plyr)
# Would like to have 2545 people (nPerson).
#Instead use 4 for testing.
nPerson = 4
# Minimum and maximum possible ages and birth dates
AgeMin = 10
AgeMax = 50
BornMin = 1950
BornMax = 1963
# Person-specific characteristics
ind =
data.frame(
id = 1:nPerson,
BornYear = floor(runif(length(1:nPerson), min=BornMin, max=BornMax+1)),
BornMonth = ceiling(runif(length(1:nPerson), min=0, max=12))
)
# Make an unbalanced panel of people over age 10 up to year 1986
# panel = ddply(ind, ~id, transform, AgeMonths = BornMonth)
panel = ddply(ind, ~id, transform, AgeMonths = (AgeMin*12):((1986-BornYear)*12 + 12-BornMonth))
# Set up some random variables to approximate the data generating process
panel$xbsmall = rnorm(dim(panel)[1], mean=-.3, sd=.45)
# Standard normal error for probit
panel$error = rnorm(dim(panel)[1])
# Placeholders
panel$xb = rep(0, dim(panel)[1])
panel$Outcome = rep(0, dim(panel)[1])
Now that we have data, here is the part that is slow (around a second on my computer for only 4 observations but hours for thousands of observations). Each month, a person gets two draws (xbsmall and error) from two different normal distributions (these were done above), and Outcome == 1 if xbsmall > error. However, if Outcome equals 1 in the previous month, then Outcome in the current month equals 1 if xbsmall + 4.47 > error. I use xb = xbsmall+4.47 in the code below (xb is the "linear predictor" in a probit model). I ignore the first month for each person for simplicity. For your information, this is simulating a probit DGP (but that is not necessary to know to solve the problem of computation speed).
# Outcome == 1 if and only if xb > -error
# The hard part: xb includes information about the previous month's outcome
start_time = Sys.time()
for(i in 1:nPerson){
# Determine the range of monthly ages to loop over for this person
AgeMonthMin = min(panel$AgeMonths[panel$id==i], na.rm=T)
AgeMonthMax = max(panel$AgeMonths[panel$id==i], na.rm=T)
# Loop over the monthly ages for this person and determine the outcome
for(t in (AgeMonthMin+1):AgeMonthMax){
# Indicator for whether Outcome was 1 last month
panel$Outcome1LastMonth[panel$id==i & panel$AgeMonths==t] = panel$Outcome[panel$id==i & panel$AgeMonths==t-1]
# xb = xbsmall + 4.47 if Outcome was 1 last month
# Otherwise, xb = xbsmall
panel$xb[panel$id==i & panel$AgeMonths==t] = with(panel[panel$id==i & panel$AgeMonths==t,], xbsmall + 4.47*Outcome1LastMonth)
# Outcome == 1 if xb > 0
panel$Outcome[panel$id==i & panel$AgeMonths==t] =
ifelse(panel$xb[panel$id==i & panel$AgeMonths==t] > - panel$error[panel$id==i & panel$AgeMonths==t], 1, 0)
}
}
end_time = Sys.time()
end_time - start_time
My thoughts for reducing computer time:
Something with cumsum()
Some wonderful panel data function that I do not know about
Find a way to make the t loop go through the same starting and ending points for each individual and then somehow use plyr::ddpl() or dplyr::gather_by()
Iterative solution: make an educated guess about the value of Outcome at each monthly age (say, the mode) and somehow adjust values that do not match the previous month. This would work better in my real application because xbsmall has a very clear trend in age.
Do the simulation only for smaller samples and then estimate the effect of sample size on the values I need (the distributions of regression coefficient estimates not calculated here)
One approach is to use a split-apply-combine method. I take out the for(t in (AgeMonthMin+1):AgeMonthMax) loop and put the contents in a function:
generate_outcome <- function(x) {
AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE)
AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE)
for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){
x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1]
x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0)
}
x
}
where x is a dataframe for one person. This allows us to simplify the panel$id==i & panel$AgeMonths==t construct. Now we can just do
out <- lapply(split(panel, panel$id), generate_outcome)
out <- do.call(rbind, out)
and all.equal(panel$Outcome, out$Outcome) returns TRUE. Computing 100 persons took 1.8 seconds using this method, compared to 1.5 minutes in the original code.

R: Calculating Pearson correlation and R-squared by group

I am trying to extend the answer of a question R: filtering data and calculating correlation.
To obtain the correlation of temperature and humidity for each month of the year (1 = January), we would have to do the same for each month (12 times).
cor(airquality[airquality$Month == 1, c("Temp", "Humidity")])
Is there any way to do each month automatically?
In my case I have more than 30 groups (not months but species) to which I would like to test for correlations, I just wanted to know if there is a faster way than doing it one by one.
Thank you!
cor(airquality[airquality$Month == 1, c("Temp", "Humidity")])
gives you a 2 * 2 covariance matrix rather than a number. I bet you want a single number for each Month, so use
## cor(Temp, Humidity | Month)
with(airquality, mapply(cor, split(Temp, Month), split(Humidity, Month)) )
and you will obtain a vector.
Have a read around ?split and ?mapply; they are very useful for "by group" operations, although they are not the only option. Also read around ?cor, and compare the difference between
a <- rnorm(10)
b <- rnorm(10)
cor(a, b)
cor(cbind(a, b))
The answer you linked in your question is doing something similar to cor(cbind(a, b)).
Reproducible example
The airquality dataset in R does not have Humidity column, so I will use Wind for testing:
## cor(Temp, Wind | Month)
x <- with(airquality, mapply(cor, split(Temp, Month), split(Wind, Month)) )
# 5 6 7 8 9
#-0.3732760 -0.1210353 -0.3052355 -0.5076146 -0.5704701
We get a named vector, where names(x) gives Month, and unname(x) gives correlation.
Thank you very much! It worked just perfectly! I was trying to figure out how to obtain a vector with the R^2 for each correlation too, but I can't... Any ideas?
cor(x, y) is like fitting a standardised linear regression model:
coef(lm(scale(y) ~ scale(x) - 1)) ## remember to drop intercept
The R-squared in this simple linear regression is just the square of the slope. Previously we have x storing correlation per group, now R-squared is just x ^ 2.

Resources