I am using the Biomod2 package to run a series of species distribution models in R. One of the modelling techniques I am using is a classification tree analysis (CTA) which uses the rpart package.
The response in these models are presence/absence of a plant species and the predictor variables are contained in a rasterStack. Most of the variables in the rasterStack are continuous numeric variables with the exception of one land cover variables, geology, which is a factor. I stacked each individual rasterLayer and after, used as.factor() to convert the geology layer to a factor.
I am running into an error message when trying to predict from the CTA. The CTA model was built with a data frame in which "geology" is a factor (see below) and used the raster predict function on a rasterStack ("geology" is a factor, see below). However, running the predict function, I get an error saying I supplied a numeric instead of a factor. I have checked all possible points to see if somehow "geology" get converted back to numeric but it seems to be a factor (as it should) everywhere I look.
EDIT: Changed data to make reproduceable.
library(raster)
library(rpart)
set.seed(123)
# Create sample rasterStack
data.rast <- stack(system.file("external/rlogo.grd", package = "raster"))
# Create one layer as a factor
data.rast$geology <- as.factor(sampleInt(7, length(data.rast$red), replace = TRUE))
# Create sample presence/absence data by randomly selecting cells of raster
data <- as.data.frame(data.rast)
data <- data[sample(nrow(data), 300, replace = FALSE), ]
data$pa <- as.factor(sample(0:1, nrow(data), replace = TRUE))
names(data)[4] <- "geology"
head(data)
# red green blue geology pa
#2463 251 255 255 7 1
#1944 191 190 186 5 0
#5016 162 174 226 7 0
#5771 255 255 253 4 1
#3739 204 205 199 7 0
#5483 131 133 122 3 0
# Build CTA model using presence/absence dataframe
# Parameters set as the defaults in Biomod2 modeling options
cta <- rpart(pa ~ .,
data = data,
na.action = na.omit,
method = "class",
control = list(xval = 5,
minbucket = 5,
minsplit = 5,
cp = 0.001,
maxdepth = 25))
# Confirm classes of data before running predict function
data.frame(ctaClass = attr(terms(cta), "dataClasses")[2:5],
rasterFactor = is.factor(data.rast))
# ctaClass rasterFactor
#red numeric FALSE
#green numeric FALSE
#blue numeric FALSE
#geology factor TRUE
# Once again confirming this rasterLayer is a factor
levels(data.rast$geology)
#[[1]]
# ID VALUE
#1 1 1
#2 2 2
#3 3 3
#4 4 4
#5 5 5
#6 6 6
#7 7 7
# Run predict function on rasterStack
cta.predict <- predict(object = data.rast,
model = cta,
type = "class")
#Error: variable 'geology' was fitted with type "factor" but type "numeric" was #supplied
#In addition: Warning message:
#In model.frame.default(Terms, newdata, na.action = na.action, xlev = #attr(object, :
# variable 'geology' is not a factor
EDIT: added proof that it works with a randomForests model
library(randomForest)
rf <- randomForest(pa ~ .,
data = data,
na.action = na.omit)
rf.predict <- predict(data.rast, rf)
rf.predict
#class : RasterLayer
#dimensions : 77, 101, 7777 (nrow, ncol, ncell)
#resolution : 1, 1 (x, y)
#extent : 0, 101, 0, 77 (xmin, xmax, ymin, ymax)
#crs : +proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs
#source : memory
#names : layer
#values : 0, 1 (min, max)
#attributes :
# ID value
# 1 0
# 2 1
In this case you need to help predict a bit by providing the factor name(s) and levels
data$geology <- as.factor(data$geology)
cta.predict <- predict(data.rast, cta, type="class", factors=list(geology=levels(data$geology)))
Also note the type= in type=class, you should cannot just do class (unless you want the filename to be class.grd)
With terra this works a little better, I think (hope)
library(terra)
x <- rast(data.rast*1)
x$geology <- as.factor(x$geology)
cta.predict <- predict(x, cta, type="class")
Related
I have a timeseries of geotiff files I'd like to stack in R using stars. Here's the first two:
urls <- paste0("/vsicurl/",
"https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/",
"noaa/gefs-v12/cogs/gefs.20221201/",
c("gep01.t00z.pgrb2a.0p50.f003.tif", "gep01.t00z.pgrb2a.0p50.f006.tif"))
library(stars)
stars::read_stars(urls, along="time")
Errors with:
Error in c.stars_proxy(`3` = list(gep01.t00z.pgrb2a.0p50.f003.tif = "/vsicurl/https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/noaa/gefs-v12/cogs/gefs.20221201/gep01.t00z.pgrb2a.0p50.f003.tif"), :
don't know how to merge arrays: please specify parameter along
Context: bands contain both time+band info
This fails because the dimensions do not match, which happens because the files have concatenated temporal information into the band names:
x<- lapply(urls, read_stars)
x
produces:
[[1]]
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
Min. 1st Qu. Median Mean 3rd Qu. Max.
gep01.t00z.pgrb2a.0p50.f003.ti... 50026.01 98094.81 101138 98347.42 101845.2 104605.2
dimension(s):
from to offset delta refsys point
x 1 720 -180.25 0.5 Coordinate System importe... FALSE
y 1 361 90.25 -0.5 Coordinate System importe... FALSE
band 1 8 NA NA NA NA
values x/y
x NULL [x]
y NULL [y]
band PRES:surface:3 hour fcst,...,DLWRF:surface:0-3 hour ave fcst
[[2]]
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
Min. 1st Qu. Median Mean 3rd Qu. Max.
gep01.t00z.pgrb2a.0p50.f006.ti... 50029.83 98101.83 101170.6 98337.52 101825 104588.2
dimension(s):
from to offset delta refsys point
x 1 720 -180.25 0.5 Coordinate System importe... FALSE
y 1 361 90.25 -0.5 Coordinate System importe... FALSE
band 1 8 NA NA NA NA
values x/y
x NULL [x]
y NULL [y]
band PRES:surface:6 hour fcst,...,DLWRF:surface:0-6 hour ave fcst
Note the band names would align except for the existence of the timestamp being tacked on, e.g. PRES:surface:3 hour fcst vs PRES:surface:6 hour fcst.
How can I best read in these files so that I have dimensions of x,y,band, and time in my stars object?
alternatives: terra?
How about terra? Note that terra is happy to read these files in directly, but treats this as 16 unique bands. Can I re-align that so that I have the original 8 bands along a new "time" dimension? (I recognize stars emphasizes 'spatio-temporal', maybe the such a cube is out of scope to terra?) Also note that terra for some reason mangles the timestamp in these band names:
x <- terra::rast(urls)
x
class : SpatRaster
dimensions : 361, 720, 16 (nrow, ncol, nlyr)
resolution : 0.5, 0.5 (x, y)
extent : -180.25, 179.75, -90.25, 90.25 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat Coordinate System imported from GRIB file
sources : gep01.t00z.pgrb2a.0p50.f003.tif (8 layers)
gep01.t00z.pgrb2a.0p50.f006.tif (8 layers)
names : PRES:~ fcst, TMP:2~ fcst, RH:2 ~ fcst, UGRD:~ fcst, VGRD:~ fcst, APCP:~ fcst, .
With terra it is pretty easy to make a time-series for each variable as I show below.
urls <- paste0("/vsicurl/",
"https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/",
"noaa/gefs-v12/cogs/gefs.20221201/",
c("gep01.t00z.pgrb2a.0p50.f003.tif", "gep01.t00z.pgrb2a.0p50.f006.tif"))
library(terra)
r <- rast(urls)
Extract two variables of interest
nms <- names(r)
tmp <- r[[grep("TMP", nms)]]
rh <- r[[grep("RH", nms)]]
# set time
tm <- as.POSIXct("2022-12-01", tz="GMT") + c(3,6) * 3600
time(rh) <- tm
time(tmp) <- tm
And you could combine them into a SpatRasterDatset like this:
s <- sds(list(tmp=tmp, rh=rh))
An alternative path to get to the same point would be to start with a SpatRasterDataset and subset it.
sd <- sds(urls)
nl <- 1:length(sd)
nms <- names(sd[1])
tmp2 <- rast(sd[nl, grep("TMP", nms)])
time(tmp2) <- tm
rh2 <- rast(sd[nl, grep("RH", nms)])
time(rh2) <- tm
I made the subsetting work a little nicer in terra version 1.7-5
urls <- paste0("/vsicurl/",
"https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/",
"noaa/gefs-v12/cogs/gefs.20221201/", c("gep01.t00z.pgrb2a.0p50.f003.tif", "gep01.t00z.pgrb2a.0p50.f006.tif"))
library(terra)
#terra 1.7.5
sd <- sds(urls)
tmp <- sd[,2]
tmp
#class : SpatRaster
#dimensions : 361, 720, 2 (nrow, ncol, nlyr)
#resolution : 0.5, 0.5 (x, y)
#extent : -180.25, 179.75, -90.25, 90.25 (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat Coordinate System imported from GRIB file
#sources : gep01.t00z.pgrb2a.0p50.f003.tif
# gep01.t00z.pgrb2a.0p50.f006.tif
#names : TMP:2 m above g~Temperature [C], TMP:2 m above g~Temperature [C]
#unit : C, C
#time : 2022-12-01 03:00:00 to 2022-12-01 06:00:00 UTC
As for the layer names containing the forecast time, that is just because that is what is in the tif metadata. It looks like that was a decision made when they were created from the original GRIB files.
The latitude extent going beyond the north and south poles is an interesting feature of this dataset.
Just wanted to share some additional possible solutions for comparison. With larger numbers of files some of these differences become more relevant. this expands a bit beyond my original question.
terra
Prof Hijmans gives a very nice solution in terra. He also asked about the original upstream sources, which I didn't explain properly -- these are originally GRIB files for NOAA GEFS forecast.
Notably, we can work directly from the GRIB files. GEFS is a 35-day forecast, so let's try going more than 6 hrs into the future:
library(terra)
# original GRIB sources, AWS mirror
gribs <- paste0("/vsicurl/https://noaa-gefs-pds.s3.amazonaws.com/gefs.20220314/00/atmos/pgrb2ap5/geavg.t00z.pgrb2a.0p50.f",
stringr::str_pad(seq(3,240,by=3), 3, pad="0"))
bench::bench_time({
cube <- terra::sds(gribs)
})
cube[1,63] |> plot()
very nice!
gdalcubes
gdalcubes is another package that can also leverage the gdal virtual filesystem when working with these large-ish remote files. It also lets us define an abstract cube at potentially a different resolution in space & time than the original sources (averaging or interpolating). lazy operations mean this may run a bit faster(?)
library(gdalcubes)
date <- as.Date("2023-01-26")
date_time = date + lubridate::hours(seq(3,240,by=3))
# USA box
v <- cube_view(srs = "EPSG:4326",
extent = list(left = -125, right = -66,top = 49, bottom = 25,
t0= as.character(min(date_time)), t1=as.character(max(date_time))),
dx = 0.5, dy = 0.5, dt = "PT3H")
gribs <- paste0("/vsicurl/https://noaa-gefs-pds.s3.amazonaws.com/gefs.20220314/00/atmos/pgrb2ap5/geavg.t00z.pgrb2a.0p50.f",
stringr::str_pad(seq(3,240,by=3), 3, pad="0"))
bench::bench_time({
cube <- gdalcubes::create_image_collection(gribs, date_time = date_time)
})
bench::bench_time({
raster_cube(cube, v) |>
select_bands("band63") |> # tempearture
animate(col = viridisLite::viridis, nbreaks=50, fps=10, save_as = "temp.gif")
})
stars
didn't translate a full stars example, but here at least is the band name correction; a bit more cumbersome than the examples above.
urls <- paste0("/vsicurl/",
"https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/",
"noaa/gefs-v12/cogs/gefs.20221201/",
c("gep01.t00z.pgrb2a.0p50.f003.tif", "gep01.t00z.pgrb2a.0p50.f006.tif"))
library(stars)
#stars::read_stars(urls, along="time") # no luck!
## grab unstacked proxy object for each geotiff
x <- lapply(urls, read_stars)
# extract band-names-part
band_names <- st_get_dimension_values(x[[1]], "band") |>
stringr::str_extract("([A-Z]+):") |>
str_remove(":")
# apply corrected band-names
x1 <- lapply(x, st_set_dimensions, "band", band_names)
# at last, we can stack into a cube:
x1 <- do.call(c, c(x1, along="time"))
# and add correct date timestamps to the new time dimension
dates <- as.Date("2022-12-01") + lubridate::hours(c(3,6))
x1 <- st_set_dimensions(x1, "time", dates)
x1
I was given data with 3 columns (lon, lat, data) which I would like to turn into a SpatRaster.
library(terra)
#> terra 1.5.21
df <- read.csv("data.csv")
head(df)
#> lon lat mean_daily_par_mol
#> 1 -7.52993 61.91259 0.5463984
#> 2 -7.16531 61.77965 0.5635693
#> 3 -6.80426 61.64570 0.6404579
#> 4 -6.44676 61.51075 0.6602439
#> 5 -6.09277 61.37483 0.6801313
#> 6 -5.74228 61.23794 0.6358355
Unfortunately, the data is not on a regular grid, so I can create a raster directly. The data is on the MODIS ISIN grid.
rast(df)
#> Error: [raster,matrix(xyz)] x cell sizes are not regular
e <- ext(c(min(df$lon), max(df$lon), min(df$lat), max(df$lat)))
e
#> SpatExtent : -180, 179.69196, 60.00299, 89.41164 (xmin, xmax, ymin, ymax)
r <- rast(e, ncols = 100, nrows = 100)
r <- rasterize(
as.matrix(df[c("lon", "lat")]),
r,
df$mean_daily_par_mol,
fun = mean,
na.rm = TRUE
)
plot(r)
Obviously, changing ncols and nrows considerably influence how the raster is created.
r <- rast(e, ncols = 300, nrows = 300)
r <- rasterize(
as.matrix(df[c("lon", "lat")]),
r,
df$mean_daily_par_mol,
fun = mean,
na.rm = TRUE
)
plot(r)
My question is how can I find the optimal ncols and nrows so there is approximately 1 values per pixel. The ultimate goal being to reproject this data onto the EASE2 grid.
The data is available here
Created on 2022-05-29 by the reprex package (v2.0.1)
I want to calculate the area under every possible combination of two classified rasters. I am using the following code
library(terra)
#First create two rasters
r1 <- r2 <- rast(nrow=100, ncol=100)
#Assign random cell values
set.seed(123)
values(r1) <- runif(ncell(r1), min=0, max=1)
values(r2) <- runif(ncell(r2), min=0, max=1)
# classify the values into two groups
m_r1 <- c(min(global(r1, "min", na.rm=TRUE)), 0.2, 1,
0.2, max(global(r1, "max", na.rm=TRUE)), 2)
m_r2 <- c(min(global(r2, "min", na.rm=TRUE)), 0.2, 1,
0.2, max(global(r2, "max", na.rm=TRUE)), 2)
#Reclassify the rasters
rclmat_r1 <- matrix(m_r1, ncol=3, byrow=TRUE)
rc_r1 <- classify(r1, rclmat_r1, include.lowest=TRUE)
rclmat_r2 <- matrix(m_r2, ncol=3, byrow=TRUE)
rc_r2 <- classify(r2, rclmat_r2, include.lowest=TRUE)
plot(rc_r1)
plot(rc_r2)
#Convert to polygons
r1_poly <- as.polygons(rc_r1, dissolve=TRUE)
r2_poly <- as.polygons(rc_r2, dissolve=TRUE)
plot(r1_poly)
plot(r2_poly)
#Perform intersections
x <- intersect(r1_poly, r2_poly)
x
#> class : SpatVector
#> geometry : polygons
#> dimensions : 2747, 2 (geometries, attributes)
#> extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84
#> names : lyr.1 lyr.1
#> type : <int> <int>
#> values : 1 1
#> 1 2
#> 2 1
As you can see from the output, one combination i.e. 2-2 is missing. Why is this happening?
When I am trying to calculate the area for each combination using expanse(x), it returns a long result. How can I get the area in km2 for the following combinations?
Combination Area (km2)
1-1
1-2
2-1
2-2
With this example it would be better to stay with raster data.
x = 10 * rc_r1 + rc_r2
a = cellSize(x, unit="km")
zonal(a, x, sum)
# lyr.1 area
#1 11 19886611
#2 12 81946082
#3 21 84763905
#4 22 323469024
By multiplying with 10, the values in the first layer become 10 (it they were 1) or 20 (if they were 2). If you then add the second layer, you get 10 + 1 or 2 and 20 + 1 or 2, so you end up with four classes: 11, 12, 21, and 22. These show the value in the first raster (first digit) and in the second raster (second digit).
When you show a SpatVector only the first three records are printed, and there is a 2-2 record. Nevertheless, intersect did not work properly and I have now fixed this.
I have loaded the data.frame object in R named "prec" with 1009549 rows and 8 variables. I want to create 60 raster layers of the cumulative "prec" variable values for each x-y coordinates pair at every 4-time step ("tstep" variable, from index 2 to 241) as summarized in the code below. I performed a single function to create each file in 3 steps to achieve it. However, is it possible to write a single function for each step or a single function for the entire code (steps 1 to 4)?
load required packages
library(data.table)
library(raster)
structure of the "prec" data.frame
> headTail(prec)
x y prec index tstep variable level date
1 -47.8 -21.2 0 1 1 prec 1000 2015-01-01 00:00:00
1.1 -47.6 -21.2 0 1 1 prec 1000 2015-01-01 00:00:00
1.2 -47.4 -21.2 0 1 1 prec 1000 2015-01-01 00:00:00
1.3 -47.2 -21.2 0 1 1 prec 1000 2015-01-01 00:00:00
... ... ... ... ... ... <NA> ... <NA>
241.4185 -36.8 -7.2 0 241 241 prec 1000 2015-01-01 00:00:59
241.4186 -36.6 -7.2 0 241 241 prec 1000 2015-01-01 00:00:59
241.4187 -36.4 -7.2 0 241 241 prec 1000 2015-01-01 00:00:59
241.4188 -36.2 -7.2 0 241 241 prec 1000 2015-01-01 00:01:00
step 1: subset by tstep
prec_1 <- prec[prec$tstep %in% c(2, 3, 4, 5),]
prec_2 <- prec[prec$tstep %in% c(6, 7, 8, 9),]
prec_3 <- prec[prec$tstep %in% c(10, 11, 12, 13),]
...
prec_60 <- prec[prec$tstep %in% c( 238 , 239 , 240 , 241),]
step 2: coerce to data.table
prec_1_sum <- setDT(prec_1)[, list(prec_sum_1 = sum(prec*1000)), list(x, y)]
prec_2_sum <- setDT(prec_2)[, list(prec_sum_2 = sum(prec*1000)), list(x, y)]
prec_3_sum <- setDT(prec_3)[, list(prec_sum_3 = sum(prec*1000)), list(x, y)]
...
prec_60_sum <- setDT(prec_60)[, list(prec_sum_60 = sum(prec*1000)), list(x, y)]
step 3: create n raster layers
layer_1 <- rasterFromXYZ(prec_1_sum [,1:3], res = c(0.20, 0.20), crs = sp::CRS("+init=epsg:4326"))
layer_2 <- rasterFromXYZ(prec_2_sum [,1:3], res = c(0.20, 0.20), crs = sp::CRS("+init=epsg:4326"))
layer_3 <- rasterFromXYZ(prec_3_sum [,1:3], res = c(0.20, 0.20), crs = sp::CRS("+init=epsg:4326"))
...
layer_60 <- rasterFromXYZ(prec_60_sum [,1:3], res = c(0.20, 0.20), crs = sp::CRS("+init=epsg:4326"))
step 4: stack raster layers
stack_prec <- stack(layer_1, layer_2, layer_3, layer_4, layer_5, layer_6, layer_7, layer_8, layer_9, layer_10,
layer_11, layer_12, layer_13, layer_14, layer_15, layer_16, layer_17, layer_18, layer_19, layer_20,
layer_21, layer_22, layer_23, layer_24, layer_25, layer_26, layer_27, layer_28, layer_29, layer_30,
layer_31, layer_32, layer_33, layer_34, layer_35, layer_36, layer_37, layer_38, layer_39, layer_40,
layer_41, layer_42, layer_43, layer_44, layer_45, layer_46, layer_47, layer_48, layer_49, layer_50,
layer_51, layer_52, layer_53, layer_54, layer_55, layer_56, layer_57, layer_58, layer_59, layer_60)
It’s always much easier to help when we have sample data we can use. In the future you can use dput(prec) and copy and paste that output for people to use. At the very least some sample data is useful, particularly when you’re using functions that have certain specifications for what the data should look like. Here we generate some data to work with.
library(raster)
#> Loading required package: sp
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following object is masked from 'package:raster':
#>
#> shift
set.seed(1)
dat <-
data.frame(
x = rep(seq(-47.8, -47.2, by = 0.2), 241),
y = -21.2,
prec = runif(964),
tstep = rep(1:241, each = 4),
date = c(rep(as.Date("2015-01-01"), 4), rep(seq(as.Date("2015-01-01"), by = "day", length.out = 60), each = 16))
)
For your process, it seems a bit more straightforward to group the data rather than break it up. That way you only have to perform the operations on one data set rather than do it many times over. Steps 1 and 2 can be reduced to only a few lines that way. Without thinking too much about optimizing this, I’ve looped over the groups created in the first step to create the raster layers.
raster_layers <- function(dat){
## some flexibility if there is a differing number of tsteps
## it will by default exclude the first tstep as in your example
min_tstep <- min(dat$tstep)
max_tstep <- max(dat$tstep)
breaks <- seq(min_tstep, max_tstep, by = 4)
## Step 1
dat$group <- cut(dat$tstep, breaks)
dat <- dat[!is.na(dat$group), ]
## Step 2
prec <- setDT(dat)[ , list(prec_sum = sum(prec * 1000)), by = list(group, x, y)]
## Step 3
layer <- list()
group <- unique(prec$group)
j <- 1
for (i in group){
raster_dat <- prec[prec$group %in% i , c("x", "y", "prec_sum")]
## looks like your plot uses the names for changing labels??
colnames(raster_dat)[colnames(raster_dat) == "prec_sum"] <- paste0("prec_sum_", j)
layer[[j]] <-
rasterFromXYZ(raster_dat,
res = c(0.20, 0.20),
crs = sp::CRS("+init=epsg:4326"))
j <- j + 1
}
## Step 4
stack_prec <- stack(unlist(layer))
return(stack_prec)
}
Example
stack_prec <- raster_layers(dat = dat)
stack_prec
#> class : RasterStack
#> dimensions : 1, 4, 4, 60 (nrow, ncol, ncell, nlayers)
#> resolution : 0.2, 0.2 (x, y)
#> extent : -47.9, -47.1, -21.3, -21.1 (xmin, xmax, ymin, ymax)
#> crs : +init=epsg:4326
#> names : prec_sum_1, prec_sum_2, prec_sum_3, prec_sum_4, prec_sum_5, prec_sum_6, prec_sum_7, prec_sum_8, prec_sum_9, prec_sum_10, prec_sum_11, prec_sum_12, prec_sum_13, prec_sum_14, prec_sum_15, ...
#> min values : 2112.4990, 1124.8232, 2007.5945, 1315.0517, 1729.9294, 1582.8684, 1524.0147, 1098.1529, 2008.5390, 1248.1860, 1680.0199, 1855.4024, 815.4047, 1204.8576, 1416.3943, ...
#> max values : 2336.186, 2565.158, 2877.219, 2318.115, 3017.609, 2540.536, 2569.019, 2690.884, 2327.706, 2288.046, 3104.792, 2639.530, 2358.953, 2599.245, 2618.676, ...
Using R and estimating a simple equation by least squares that has the dependent variable as a independent (explanatory, right hand side) variable, I want to forecast out of sample and use the dependent variable forecasts in the out of sample period as a lag for each step ahead.
I.e., I want to extend forecasts of y to be outside the data period
a <- lm( y ~ x + lag(y,1), data= dset1)
b <- forecast(a,newdata=dset2)
where dset2 has the full period of extra x variables, but not the lagged y.
Here is an example using the AirPassengers data set, where dset2 was created with some missing ap data. The results below show only row 143 gets filled in not 144 because forecast did not have the 143 lag.
I looked at the dyn dynlm and forecast packages but nonw seem to work with type of model. (I do not want to restate as an ARMA or a VAR)
What package can easily do this, or am I using forecast incorrectly?
I can loop and step ahead on period at a time, but rather not do that.
##Example case using airline data
data("AirPassengers", package = "datasets")
ap <- log(AirPassengers)
ap <- as.ts(ap)
d1 <- data.frame(ap, index= as.Date(ap))
m1 <- lm(ap ~ lag(ap,1), data=d1)
m2 <- dynlm(ap ~ lag(ap,1), data=d1)
m3 <- dyn(lm(ap ~ lag(ap,1), data=d1))
summary(m3)
##Neither lm or dyn or dynlm obects worked as I want
## Try forecast missing values, 2 steps, rows 143 and 144
d2 <- d1
d2$apx = d2$ap
d2$apx[143:144]= NA
mx <- lm(apx ~ lag(apx,1), data=d2)
b <- forecast(mx,newdata=d2)
Results:
> b
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
1 NA NA NA NA NA
2 4.756850 4.619213 4.894488 4.545513 4.968188
3 4.807218 4.669783 4.944653 4.596191 5.018245
....
140 6.411559 6.273546 6.549572 6.199644 6.623474
141 6.386407 6.248507 6.524306 6.174667 6.598146
142 6.216154 6.078941 6.353368 6.005467 6.426841
143 6.122453 5.985553 6.259354 5.912247 6.332659
144 NA NA NA NA NA
other lm like objects produced errors for forecast
mx <- dynlm(apx ~ lag(apx,1), data=d2)
b <- forecast(mx,newdata=d2)
Error in forecast.lm(mx, newdata = d2) : invalid type/length
(symbol/0) in vector allocation
mx <- dyn(lm(apx ~ lag(apx,1), data=d2))
b <- forecast(mx,newdata=d2)
Error in predict.lm(object, newdata = newdata, se.fit = TRUE, interval
= "prediction", : formal argument "se.fit" matched by multiple actual arguments