R - overestimation predict function in sinusoidal linear model - r
Background
I have a data file which consists of Sea Levels near the Dutch Storm Barrier over time, for various days. The goal is to fit a linear model that describes this evolution of the sea level, and given a certain time frame, make a 5 minute ahead-prediction of the sea level (forecasting).
Approach
Given a certain day (chosen on forehand), I chose a time frame on which I want to fit\train the linear model. After some technical adjustments (see below for actual code), I fitted the model. Then, the linear model object and 5 minutes time range are used in the command 'predict()' for a prediction, and the 'forecast' along with a confidence interval is graphed, just behind the fitted model in the first time frame, all in one plot (see below for an example plot).
Problem
The forecast of the model always over- or under predicts hugely. In terms of magnitude, the forecast is a factor 10^10 (or, equivalently, e+10) off. At the same time, the R^2 and R_adj^2 are 'quite high', (0,972 and 0,9334, respectively) and the model diagnostics (leverages, fitted vs residuals, normal qq) look 'reasonably good'. Hence, my problem/question is: How can the model that fits the data so well, predict/forecast so badly? My only explanation is mistake in the code, but I can't spot it.
The data set
More specifically, the dataset is a data frame, which consists (apart from a index column) of 3 columns: 'date' ( "yyyy-mm-dd" format), 'time' ( "hh:mm:ss" format) and 'water' (integer between approx -150 and 350, sea level in cm). Here's a small slice of the data which already gives rise to the above problem:
> SeaLvlAug30[fitRngAug, ]
date time water
1574161 2010-08-30 04:40:00 253
1574162 2010-08-30 04:40:10 254
1574163 2010-08-30 04:40:20 253
1574164 2010-08-30 04:40:30 250
1574165 2010-08-30 04:40:40 250
1574166 2010-08-30 04:40:50 252
1574167 2010-08-30 04:41:00 250
1574168 2010-08-30 04:41:10 247
1574169 2010-08-30 04:41:20 246
1574170 2010-08-30 04:41:30 245
1574171 2010-08-30 04:41:40 242
1574172 2010-08-30 04:41:50 241
1574173 2010-08-30 04:42:00 242
1574174 2010-08-30 04:42:10 244
1574175 2010-08-30 04:42:20 245
1574176 2010-08-30 04:42:30 247
1574177 2010-08-30 04:42:40 247
1574178 2010-08-30 04:42:50 249
1574179 2010-08-30 04:43:00 250
1574180 2010-08-30 04:43:10 250
Minimal runnable R code
# Construct a time frame of a day with steps of 10 seconds
SeaLvlDayTm <- c(1:8640)*10
# Construct the desired fit Range and prediction Range
ftRng <- c(1:20)
predRng <- c(21:50)
# Construct the desired columns for the data frame
date <- rep("2010-08-30", length(c(ftRng,predRng)))
time <- c("04:40:00", "04:40:10", "04:40:20", "04:40:30", "04:40:40", "04:40:50", "04:41:00",
"04:41:10", "04:41:20", "04:41:30", "04:41:40", "04:41:50", "04:42:00", "04:42:10",
"04:42:20", "04:42:30", "04:42:40", "04:42:50", "04:43:00", "04:43:10", "04:43:20",
"04:43:30", "04:43:40", "04:43:50", "04:44:00", "04:44:10", "04:44:20", "04:44:30",
"04:44:40", "04:44:50", "04:45:00", "04:45:10", "04:45:20", "04:45:30", "04:45:40",
"04:45:50", "04:46:00", "04:46:10", "04:46:20", "04:46:30", "04:46:40", "04:46:50",
"04:47:00", "04:47:10", "04:47:20", "04:47:30", "04:47:40", "04:47:50", "04:48:00",
"04:48:10")
timeSec <- c(1681:1730)*10
water <- c(253, 254, 253, 250, 250, 252, 250, 247, 246, 245, 242, 241, 242, 244, 245, 247,
247, 249, 250, 250, 249, 249, 250, 249, 246, 246, 248, 248, 245, 247, 251, 250,
251, 255, 256, 256, 257, 259, 257, 256, 260, 260, 257, 260, 261, 258, 256, 256,
258, 258)
# Construct the data frame
SeaLvlAugStp2 <- data.frame(date, time, timeSec, water)
# Change the index set of the data frame to correspond that of a year
rownames(SeaLvlAugStp2) <- c(1574161:1574210)
#Use a seperate variable for the time (because of a weird error)
SeaLvlAugFtTm <- SeaLvlAugStp2$timeSec[ftRng]
# Fit the linear model
lmObjAug <- lm(SeaLvlAugStp2$water[ftRng] ~ sin((2*pi)/44700 * SeaLvlAugFtTm)
+ cos((2*pi)/44700 * SeaLvlAugFtTm) + poly(SeaLvlAugFtTm, 3)
+ cos((2*pi)/545 * SeaLvlAugFtTm) + sin((2*pi)/545 * SeaLvlAugFtTm)
+ cos((2*pi)/205 * SeaLvlAugFtTm) + sin((2*pi)/205 * SeaLvlAugFtTm)
+ cos((2*pi)/85 * SeaLvlAugFtTm) + sin((2*pi)/85 * SeaLvlAugFtTm),
data = SeaLvlAug30Stp2[ftRng, ])
# Get information about the linear model fit
summary(lmObjAug)
plot(lmObjAug)
#Compute time range prediction and fit
prdtRngTm <- timeSec[prdtRng]
ftRngTm <- timeSec[ftRng]
#Compute prediction/forecast based on fitted data linear model
prdtAug <- predict(lmObjAug, newdata=data.frame(SeaLvlAugFtTm = prdtRngTm), interval="prediction", level=0.95)
#Calculate lower and upper bound confidence interval prediction
lwrAug <- prdtAug[, 2]
uprAug <- prdtAug[, 3]
#Calculate minimum and maximum y axis plot
yminAug <- min(SeaLvlAug30$water[fitRngAug], SeaLvlAug30$water[prdtRngAug], lwrAug)
ymaxAug <- max(SeaLvlAug30$water[fitRngAug], SeaLvlAug30$water[prdtRngAug], uprAug)
#Make the plot
plot((timeSec/10)[ftRng], SeaLvlAugStp2$water[ftRng], xlim = c(min(timeSec/10), max(prdtRngAug30)), ylim = c(yminAug, ymaxAug), col = 'green', pch = 19, main = "Sea Level high water & prediction August 30 ", xlab = "Time (seconds)", ylab = "Sea Level (cm)")
polygon(c(sort(prdtRngTm/10), rev(sort(prdtRngTm/10))), c(uprAug, rev(lwrAug)), col = "gray", border = "gray")
points(prdtRngTm/10, SeaLvlAug30$water[prdtRngTm/10], col = 'green', pch = 19)
lines(ftRngTm/10, fitted(lmObjAug), col = 'blue', lwd = 2)
lines(prdtRngTm/10, prdtAug[, 1], col = 'blue', lwd = 2)
legend("topleft", legend = c("Observ.", "Predicted", "Conf. Int."), lwd = 2, col=c("green", "blue", "gray"), lty = c(NA, 1, 1), pch = c(19, NA, NA))
Example plot
Sea Lvl High Water & prediction August 30
Until you post a question with code that we can run we won't be able to help you more but in the meantime here is a quick and dirty forecast from Rob J Hyndman forecast package:
string_data <- "row date time water
1574161 2010-08-30 04:40:00 253
1574162 2010-08-30 04:40:10 254
1574163 2010-08-30 04:40:20 253
1574164 2010-08-30 04:40:30 250
1574165 2010-08-30 04:40:40 250
1574166 2010-08-30 04:40:50 252
1574167 2010-08-30 04:41:00 250
1574168 2010-08-30 04:41:10 247
1574169 2010-08-30 04:41:20 246
1574170 2010-08-30 04:41:30 245
1574171 2010-08-30 04:41:40 242
1574172 2010-08-30 04:41:50 241
1574173 2010-08-30 04:42:00 242
1574174 2010-08-30 04:42:10 244
1574175 2010-08-30 04:42:20 245
1574176 2010-08-30 04:42:30 247
1574177 2010-08-30 04:42:40 247
1574178 2010-08-30 04:42:50 249
1574179 2010-08-30 04:43:00 250
1574180 2010-08-30 04:43:10 250"
SeaLvlAug30 <- read.table(textConnection(string_data), header=TRUE)
library(forecast)
fit <- auto.arima(SeaLvlAug30$water)
summary(fit)
preds <- forecast(fit, h = 25)
preds
# Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
# 21 249.7563 247.7314 251.7812 246.6595 252.8531
# 22 249.4394 246.1177 252.7611 244.3593 254.5195
# 23 249.1388 244.9831 253.2945 242.7832 255.4944
# 24 248.8930 244.2626 253.5234 241.8114 255.9746
# 25 248.7110 243.8397 253.5822 241.2610 256.1609
# 26 248.5867 243.6073 253.5661 240.9713 256.2021
# 27 248.5085 243.4867 253.5302 240.8284 256.1885
# 28 248.4636 243.4280 253.4991 240.7624 256.1647
# 29 248.4410 243.4020 253.4800 240.7345 256.1475
# 30 248.4322 243.3927 253.4718 240.7249 256.1396
# 31 248.4311 243.3916 253.4707 240.7238 256.1385
# 32 248.4337 243.3941 253.4733 240.7263 256.1411
# 33 248.4376 243.3979 253.4773 240.7300 256.1452
# 34 248.4414 243.4016 253.4812 240.7337 256.1492
# 35 248.4447 243.4048 253.4845 240.7368 256.1525
# 36 248.4471 243.4072 253.4870 240.7392 256.1550
# 37 248.4488 243.4089 253.4887 240.7409 256.1567
# 38 248.4499 243.4100 253.4898 240.7420 256.1578
# 39 248.4505 243.4106 253.4905 240.7426 256.1585
# 40 248.4509 243.4109 253.4908 240.7429 256.1588
# 41 248.4510 243.4111 253.4910 240.7431 256.1589
# 42 248.4510 243.4111 253.4910 240.7431 256.1590
# 43 248.4510 243.4111 253.4910 240.7431 256.1590
# 44 248.4510 243.4110 253.4909 240.7430 256.1589
# 45 248.4509 243.4110 253.4909 240.7430 256.1589
plot(preds)
Related
Optimize a function using gradient descent
Growing degree days is a concept in plant phenology where a given crop needs to accumulate certain amount of thermal units every day in order to move from one stage to the other. I have thermal units data available at daily resolution for a given site for 10 years as follows: set.seed(1) avg_temp <- data.frame(year_ref = rep(2001:2010, each = 365), doy = rep(1:365, times = 10), thermal.units = sample(0:40, 3650, replace=TRUE)) I also have a crop grown in this site that should take 110 days to mature if planted on day 152 planting_date <- 152 observed_days_to_mature <- 110 I also have some initial random guess on how many thermal units this crop in general might accumulate in each stage starting from planting to reach full maturity. For e.g. in the below example, stage 1 needs to accumulate 50 thermal units since planting, stage2 needs 120 thermal units since planting, stage 3 needs 190 thermal units since planting and so on. gdd_data <- data.frame(stage_id = 1:4, gdd_required = c(50, 120, 190, 250)) So given the gdd requirement, I can calculate for each year, how many days does this crop take to mature. library(dplyr) library(data.table) days_to_mature_func <- function(gdd_data_df, avg_temp_df, planting_date_d){ gdd.vec <- gdd_data_df$gdd_required year_vec <- sort(unique(avg_temp_df$year_ref)) temp_ls <- list() for(y in seq_along(year_vec)){ year_id <- year_vec[y] weather_sub <- avg_temp_df %>% dplyr::filter(year_ref == year_id & doy >= planting_date_d) stage_vec <- unlist(lapply(1:length(gdd.vec), function(x) planting_date_d - 1 + which.max(cumsum(weather_sub$thermal.units) >= gdd.vec[x]))) stage_vec[length(stage_vec)] <- ifelse(stage_vec[length(stage_vec)] <= stage_vec[length(stage_vec) - 1], NA, stage_vec[length(stage_vec)]) gdd_doy <- as.data.frame(t(as.data.frame(stage_vec))) names(gdd_doy) <- paste0('stage_doy', 1:length(stage_vec)) gdd_doy$year_ref <- year_id temp_ls[[y]] <- gdd_doy } days_to_mature_mod <- rbindlist(temp_ls) return(days_to_mature_mod) } days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date) days_to_mature_mod stage_doy1 stage_doy2 stage_doy3 stage_doy4 year_ref 1: 154 160 164 167 2001 2: 154 157 159 163 2002 3: 154 157 160 162 2003 4: 155 157 163 165 2004 5: 154 156 160 164 2005 6: 154 161 164 168 2006 7: 154 156 159 161 2007 8: 155 158 161 164 2008 9: 154 156 160 163 2009 10: 154 158 160 163 2010 Since the crop should be taking 110 days to mature, I define the error as: error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2 My question is how do I optimise the gdd_required in the gdd_data to produce the minimal error. One method I have implemented is to loop over a sequence of factors that reduces the gdd_required in each step and calculates the error. the factor with the lowest error is the final factor that I apply to the gdd_required data. I am reading about the gradient descent algorithm that might make this processquicker but unfortunately I don't have enough techincal expertise yet to achieve this. From comment: I do have a condition that wasn't explicit - the x in the function that I am optimising are ordered i.e. x[1] < x[2] < x[3] < x[4] since they are cumulative.
Building on your example, you can define a function that takes arbitrary gdd_required and returns the fit: optim_function <- function(x){ gdd_data <- data.frame(stage_id = 1:4, gdd_required = x) days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date) error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2 } The function optim allows you to find the parameters that reach a minimum, starting from the initial set you used e.g. optim(c(50, 120, 190, 250), optim_function) #$par #[1] 266.35738 199.59795 -28.35870 30.21135 # #$value #[1] 1866.24 # #$counts #function gradient # 91 NA # #$convergence #[1] 0 # #$message #NULL So a best fit of around 1866 is found with parameters 266.35738, 199.59795, -28.35870, 30.21135. The help page gives some pointers on doing constrained optimisation if it is important that they are in a specific range. Given your comment that the parameters should be strictly increasing, you can transform arbitrary values into increasing ones with cumsum(exp()) so your code would become optim_function_plus <- function(x){ gdd_data <- data.frame(stage_id = 1:4, gdd_required = cumsum(exp(x))) days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date) error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2 } opt <- optim(log(c(50, 70, 70, 60)), optim_function_plus) opt # $par # [1] 1.578174 2.057647 2.392850 3.241456 # # $value # [1] 1953.64 # # $counts # function gradient # 57 NA # # $convergence # [1] 0 # # $message # NULL To get the parameters back on the scale you're interested, you'd need to do: cumsum(exp(opt$par)) # [1] 4.846097 12.673626 23.618263 49.189184
Confidence interval for each estimate of a multiple regression with a factor
I want to get the confidence interval (p<0.05) for each of the estimates of a multiple regression with factors. Here is an example: # Create data df1 <- cbind.data.frame (region = rep (c ("North", "South"), 6), height = c (30, 35, 28, 31, 29, 32, 25, 27, 23, 26, 28, 29), calories = c (300, 390, 282, 310, 215, 320, 252, 271, 440, 235, 235, 230)) > head (df1) region height calories 1 North 30 300 2 South 35 390 3 North 28 282 4 South 31 310 5 North 29 215 6 South 32 320 # Fit a model considering the interaction region*calories and get the confidence intervals m1 <-lm (height ~ region * calories, data = df1 ) m1.coef <- cbind (estimate = summary(m1)$coef[,1], confint (m1)) > m1.coef estimate 2.5 % 97.5 % (Intercept) 33.35270923 26.08014451 40.62527394 regionSouth -18.06358078 -30.24845497 -5.87870659 calories -0.02152915 -0.04604315 0.00298485 regionSouth:calories 0.07179409 0.03082360 0.11276457 confint (m1) gives the confidence interval for calories (corresponding to the reference level "North") and for regionSouth:calories (i.e. the difference between slopes for "South"). My question is: how do I get the actual confidence interval of calories for South region? One way to do it is by changing the reference level with relevel(), but this is tedious when one is working with several factor levels: m2 <- lm (height ~ relevel (region, "South") * calories, data = df1 ) m2.coef <- cbind (estimate = summary(m2)$coef[,1], confint(m2)) > m2.coef estimate 2.5 % 97.5 % (Intercept) 15.28912845 5.51257684 25.06568006 relevel(region, "South")North 18.06358078 5.87870659 30.24845497 calories 0.05026494 0.01743744 0.08309243 relevel(region, "South")North:calories -0.07179409 -0.11276457 -0.03082360 > confint (m2)["calories",] 2.5 % 97.5 % 0.01743744 0.08309243 Hope I was clear. Thanks in advance.
Store coefficients from several regressions in R then call coefficients into second loop
I am trying to output coefficients from multiple multi-linear regressions, store each of them and then multiply the coefficients by a future data set to predict future revenue. There are 91 regressions total. One for each 'DBA' numbered 0 to 90. These are ran against 680 dates. I have the loop that runs all of the regressions and outputs the coefficients. I need help storing each of the unique 91 coefficient vectors. x = 0 while(x<91) { pa.coef <- lm(formula = Final_Rev ~ OTB_Revenue + ADR + Sessions,data=subset(data, DBA == x)) y <- coef(pa.coef) print(cbind(x,y)) x = x + 1 } After storing each of the unique vectors I need to multiply the vectors by future 'dates' to output 'predicted revenue.' Any help would be greatly appreciated! Thanks!
Since you need to store data from an iteration, consider an apply function over standard loops such as for or while. And because you need to subset by a group, consider using by (the object-oriented wrapper to tapply) which slices dataframe by factor(s) and passes subsets into a function. Such a needed function would call lm and predict.lm. Below demonstrates with random data and otherdata dataframes (10 rows per DBA group) to return a named list of predicted Final_Rev vectors (each length 10 as per their DBA group). Data set.seed(51718) data <- data.frame(DBA = rep(seq(0,90), 10), Sessions = sample(100:200, 910, replace=TRUE), ADR = abs(rnorm(910)) * 100, OTB_Revenue = abs(rnorm(910)) * 1000, Final_Rev = abs(rnorm(910)) * 1000) set.seed(8888) other_data <- data.frame(DBA = rep(seq(0,90), 10), Sessions = sample(100:200, 910, replace=TRUE), ADR = abs(rnorm(910)) * 100, OTB_Revenue = abs(rnorm(910)) * 1000, Final_Rev = abs(rnorm(910)) * 1000) Prediction final_rev_predict_list <- by(data, data$DBA, function(sub){ pa.model <- lm(formula = Final_Rev ~ OTB_Revenue + ADR + Sessions, data=sub) predict.lm(pa.model, new_data=other_data) }) final_rev_predict_list[['0']] # 1 92 183 274 365 456 547 638 729 820 # 831.3382 1108.0749 1404.8833 1024.4387 784.5980 455.0259 536.9992 100.5486 575.0234 492.1356 final_rev_predict_list[['45']] # 46 137 228 319 410 501 592 683 774 865 # 1168.1625 961.9151 536.2392 1125.5452 1440.8600 1008.1956 609.7389 728.3272 1474.5348 700.1708 final_rev_predict_list[['90']] # 91 182 273 364 455 546 637 728 819 910 # 749.9693 726.6120 488.7858 830.1254 659.7508 618.7387 929.6969 584.3375 628.9795 929.3194
Represent interval between values in ggplot2 geom_line()
I need to plot a large amount of data, but most of them are equal to 0. My idea was, in order to save space and computation time, to not store values equal to 0. Furthermore, I want to use geom_line() function of ggplot2 package in R, because with my data, this representation is the best one and has the aesthetics that I want. My problem is: How, between two values of my X axis, can I plot a line at 0. Do I have to generate the associated Data Frame or a trick is possible to plot this? Example: X Y 117 1 158 14 179 4 187 1 190 1 194 2 197 1 200 4 203 3 208 1 211 1 212 5 218 1 992 15 1001 1 1035 1 1037 28 1046 1 1048 1 1064 14 1078 1 # To generate the DF X <- c(117, 158, 179, 187, 190, 194, 197, 200, 203, 208, 211, 212, 218, 992, 1001, 1035, 1037, 1046, 1048, 1064, 1078) Y <- c(1,14,4,1,1,2,1,4,3,1,1,5,1,15,1,1,28,1,1,14,1) data <- data.frame(X,Y) g <- ggplot(data = data, aes(x = data$X, y = data$Y)) g <- g + geom_line() g To give you an idea, that I am trying to do is to convert this image: to something like this: http://www.hostingpics.net/viewer.php?id=407269stack2.png To generate the second figure, I have to define two positions around peaks in order to have this good shape. I tried to change the scale to continuous scale, or discrete, but I did not have good peaks. So, there is a trick to say at ggplot2, if a position in X axis is between two values of X, this position will be display at 0? Thank you a lot, any kind of help will be highly appreciated.
Your problem is that R doesn't see any interval values of X. You can fix that by doing the following: X <- c(117, 158, 179, 187, 190, 194, 197, 200, 203, 208, 211, 212, 218, 992, 1001, 1035, 1037, 1046, 1048, 1064, 1078) Y <- c(1,14,4,1,1,2,1,4,3,1,1,5,1,15,1,1,28,1,1,14,1) Which is your original data frame. Z <- data.frame(seq(min(X),max(X))) Creates a data frame that has all of the X values. colnames(Z)[1] <- "X" Renames the first column as "X" to be able to merge it with your "data" dataframe. data <- data.frame(X,Y) data <- merge(Z[1],data, all.x = X) Creates a new data frame with all of the interval X values. data[is.na(data)] <- 0 Sets all X values that are NA to 0. g <- ggplot(data = data, aes(x = data$X, y = data$Y)) g <- g + geom_line() g Now plots it.
R nls function and starting values
I'm wondering how I can find/choose the starting values for the nls function as I'm getting errors with any I put in. I also want to confirm that I can actually use the nls function with my data set. data [1] 108 128 93 96 107 126 105 137 78 117 111 131 106 123 112 90 79 106 120 [20] 91 100 103 112 138 61 57 97 100 95 92 78 week = (1:31) > data.fit = nls(data~M*(((P+Q)^2/P)*exp((P+Q)*week)/(1+(Q/P)*exp(-(P+Q)*week))^2), start=c(M=?, P=?, Q=?))
If we change the function a bit and use nls2 to get starting values then we can get it to converge. The model we are using is: log(data) = .lin1 + .lin2 * log((exp((P+Q)*week)/(1+(Q/P)*exp(-(P+Q)*week))^2))) +error In this model .lin1 = log(M*(((P+Q)^2/P)) and when .lin2=1 it reduces to the model in the question (except for the multiplicative rather than additive error and the fact that the parameterization is different but when appropriately reduced gives the same predictions). This is a 4 parameter rather than 3 parameter model. The linear parameters are .lin1 and .lin2. We are using algorithm = "plinear" which does not require starting values for these parameters. The RHS of plinear formulas is specified as a matrix with one column for each linear parameter specifying its coefficient (which may be a nonlinear function of the nonlinear parameters). The code is: data <- c(108, 128, 93, 96, 107, 126, 105, 137, 78, 117, 111, 131, 106, 123, 112, 90, 79, 106, 120, 91, 100, 103, 112, 138, 61, 57, 97, 100, 95, 92, 78) week <- 1:31 if (exists("fit2")) rm(fit2) library(nls2) fo <- log(data) ~ cbind(1, log((exp((P+Q)*week)/(1+(Q/P)*exp(-(P+Q)*week))^2))) # try maxiter random starting values set.seed(123) fit2 = nls2(fo, alg = "plinear-random", start = data.frame(P = c(-10, 10), Q = c(-10, 10)), control = nls.control(maxiter = 1000)) # use starting value found by nls2 fit = nls(fo, alg = "plinear", start = coef(fit2)[1:2]) plot(log(data) ~ week) lines(fitted(fit) ~ week, col = "red") giving: > fit Nonlinear regression model model: log(data) ~ cbind(1, log((exp((P + Q) * week)/(1 + (Q/P) * exp(-(P + Q) * week))^2))) data: parent.frame() P Q .lin1 .lin2 0.05974 -0.02538 5.63199 -0.87963 residual sum-of-squares: 1.069 Number of iterations to convergence: 16 Achieved convergence tolerance: 9.421e-06