R: Problem with raster prediction from a linear model - r

I am using the function raster::predict to extract the prediction part of a linear model as a raster but I am getting this error:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : object is not a matrix
In addition: Warning message:
'newdata' had 622 rows but variables found have 91 rows
My data set is a RasterStack of two satellite images (same CRS and data type). I have found this question but I couldn't solve my problem.
Here is the code and the data:
library(raster)
ntl = raster ("path/ntl.tif")
vals_ntl <- as.data.frame(values(ntl))
ntl_coords = as.data.frame(xyFromCell(ntl, 1:ncell(ntl)))
combine <- as.data.frame(cbind(ntl_coords,vals_ntl))
ebbi = raster ("path/ebbi.tif")
ebbi <- resample(ebbi, ntl, method = "bilinear")
vals_ebbi <- as.data.frame(values(ebbi))
s = stack(ntl, ebbi)
block.data <- as.data.frame(cbind(combine, vals_ebbi))
names(block.data)[3] <- "ntl"
names(block.data)[4] <- "ebbi"
block.data <- na.omit(block.data)
model <- lm(formula = ntl ~ ebbi, data = block.data)
#predict to a raster
r1 <- raster::predict(s, model, progress = 'text', na.rm = T)
plot(r1)
writeRaster(r1, filename = "path/lm_predict.tif")
The data can be downloaded from here (I don't know if by sharing a smaller dataset the problem would still exist so I decided to share the full dataset which is quite big when using the dput command to copy-paste it)

You are correct that dput is generally not very useful for spatial data; and that you should avoid using it. However, in most cases, there is no need to share data as you can create example data with code, or with data that ships with R, like in most examples in the help files and questions on this site. Saying that "I don't know if by sharing a smaller dataset the problem would still exist" suggests that the first thing you should do is to find out.
If you have a SpatRaster x that you want to reproduce, you can start with as.character(x), which is what I did to get the below.
library(terra)
ntl <- rast(ncols=48, nrows=91, nlyrs=1, xmin=582360, xmax=604440, ymin=1005560, ymax=1047420, names=c('avg_rad'), crs='EPSG:7767')
ebbi <- rast(ncols=48, nrows=91, nlyrs=1, xmin=582360, xmax=604440, ymin=1005560, ymax=1047420, names=c('B6_median'), crs='EPSG:7767')
values(ntl) <- sample(100, ncell(ntl), replace=TRUE)
values(ebbi) <- runif(ncell(ebbi))
Combine, set the names, and get the values into a data.frame. For larger datasets you could take a sample with spatSample(x, type="regular").
x <- c(ntl, ebbi)
names(x) <- c("ntl", "ebbi")
Fit the model. You can do that in two steps
v <- as.data.frame(x, na.rm=TRUE)
model <- lm(ntl ~ ebbi, data=v)
Or in one step
model <- lm(ntl ~ ebbi, data=x)
And now predict (set a filename if you want to save the raster to disk).
p <- predict(x$ebbi, model, filename="")
It is important that the first (SpatRaster) argument to predict has names that match the names in the model. So in this case you can use x$ebbi or x[[2]], but if you use ebbi you get a mysterious error message
p <- predict(ebbi, model)
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : object is not a matrix
#In addition: Warning message:
#'newdata' had 48 rows but variables found have 91 rows
unless you first do
names(ebbi) <- "ebbi"
p <- predict(ebbi, model)

Alternative, using the raster package the solution is:
library(raster)
ntl = raster ("path/ntl.tif")
ebbi = raster ("path/ebbi.tif")
ebbi <- resample(ebbi, ntl, method = "bilinear")
s = stack(ntl, ebbi)
names(s) = c('ntl', 'ebbi') # important step in order to run the predict function successfully
block.data = data.frame(na.omit(values(s)))
names(block.data) <- c('ntl', 'ebbi')
model <- lm(formula = ntl ~ ebbi, data = block.data)
#predict to a raster
r1 <- raster::predict(s, model, progress = 'text', na.rm = T)
plot(r1)
writeRaster(r1, filename = "path/lm_predict.tif")
I found the answer based on this post.

Related

Problems with raster prediction from linear model in r

I'm having problems with predicting a raster using a linear model.
Firstly i create my model from the data found in my polygons.
# create model
poly <- st_read("polygon.shp")
df <- na.omit(poly)
df <- df[df$gdp > 0 & df$ntl2 > 0 & df$pop2 > 0,]
x <- log(df$ntl2)
y <- log(df$gdp*df$pop2)
c <- df$iso
d <- data.frame(x,y,c)
m <- lm(y~x+c,data=d)
Then i want to use raster::predict to estimate an output raster
# raster data
iso <- raster("iso.tif")
viirs <- raster("viirs.tif")
x <- log(viirs)
c <- iso
## predict with models
s <- stack(x,c)
predicted <- raster::predict(x,model=m)
however i get following response:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
object is not a matrix
I don't know what the problem is and how to fix it. My current throughts are that its something to do with the factors/country codes:
My model includes country codes, as I would like to include some country fixed effects. Maybe there is a problems with including these. However even when excluding the country codes from the model and the entire dataframe, i still get the same error message.
Futhermore, my model is based on regional values from the whole world and the prediction datasets only include the extent of Turkey. Maybe this is the problem?
And here is the data:
https://drive.google.com/open?id=16cy7CJFrxQCTLhx-hXDNHJz8ej3vTEED
Perhaps it works if you do like this:
iso <- raster("iso.tif")
viirs <- raster("viirs.tif")
s <- stack(log(viirs), iso)
names(s) <- c("x", "c")
predicted <- raster::predict(s, model=m)
It won't work if the values in df$iso and iso.tif don't match (is one a factor, and the other numeric?).

Warning "Variable is not a factor" using predict for One-Hot encoding using caret

I'm following this tutorial to learn the very basics of the caret package in R and machine learning.
I get a warning message I don't understand, and I don't know if it's a problem. This happens both when I apply the tutorial steps to my own data and when I follow the tutorial.
orange <- read.csv('https://raw.githubusercontent.com/selva86/datasets/master/orange_juice_withmissing.csv')
trainRowNumbers <- createDataPartition(orange$Purchase, p=0.8, list=FALSE)
# Step 2: Create the training dataset
trainData <- orange[trainRowNumbers,]
# Step 3: Create the test dataset
testData <- orange[-trainRowNumbers,]
#Impute
preProcess_missingdata_model <- preProcess(trainData, method='knnImpute')
preProcess_missingdata_model
library(RANN) # required for knnInpute
trainData <- predict(preProcess_missingdata_model, newdata = trainData)
#One-hot encoding
dummies_model <- dummyVars(Purchase ~ ., data=trainData)
trainData_mat <- predict(dummies_model, newdata = trainData)
I get:
Warning message:
In model.frame.default(Terms, newdata, na.action = na.action, xlev = object$lvls) :
variable 'Purchase' is not a factor
But:
is.factor(trainData$Purchase)
[1] TRUE
I have two questions:
What is going on?
Is this important?
(For extra points) why are R warning/error messages so bad and uninformative?
You can easily fix it by removing the label before the ~.
In this case, your code would look like:
#One-hot encoding
dummies_model <- dummyVars(~ ., data=trainData)
trainData_mat <- predict(dummies_model, newdata = trainData)
I just found this question because I had the same problem in the tutorial. In the end my solution for the tutorial is to do it like this:
dummies_model <- dummyVars(~ ., data = trainData[, names(trainData) != "Purchase"])
This way you exclude the purchase column that would otherwise be also splitted into dummy variables. With my solution you can just go on with the tutorial and don't get any warning.

R: error with autofitVariogram (automap package)

Using autofitVariogram() function from automap package I have generate following error:
Error in vgm_list[[which.min(SSerr_list)]] : attempt to select less
than one element in get1index
Example code:
model <- as.formula(Value ~ Elevation)
data <- matrix(c(11.07,42.75,5,62.5,
8.73,45.62,234,75,
12.62,44.03,12,75,
10.87,45.38,67,75,
8.79,42.53,64,75),
nrow = 5, byrow = TRUE)
data <- as.data.frame(data)
names(data) <- c('Lon', 'Lat', 'Elevation', 'Value')
library('sp')
coordinates(data) = ~Lon+Lat
library('automap')
autofitVariogram(model, data)
What causes this error? Do interpolated values cause some kind of 'singularity'?
Thx!
This error is caused by the fact that gstat cannot generate an experimental variogram given this number of observations:
library(gstat)
library(sp)
data <- matrix(c(11.07,42.75,5,62.5,
8.73,45.62,234,75,
12.62,44.03,12,75,
10.87,45.38,67,75,
8.79,42.53,64,75),
nrow = 5, byrow = TRUE)
data <- as.data.frame(data)
names(data) <- c('Lon', 'Lat', 'Elevation', 'Value')
coordinates(data) = ~Lon+Lat
variogram(Value ~ Elevation, data)
## NULL
When given insufficient observations, gstat::variogram returns NULL. This in turn causes autofitVariogram to fail.
The solution is to simply have more data if you want to use kriging. A rule of thumb is that you need about 30 observations to generate a meaningful variogram to fit a variogram model to.
Recently, I also come across this problem. I find out the reason is that there are some Inf values in my data, and if I delete them, the package works well. Hope this could help you.

NAs in rasters and randomForest::predict()

New here, please let me know if you need more info.
My goal: I am using Rehfeldt climate data and eBird presence/absence data to produce niche models using Random Forest models.
My problem: I want to predict niche models for the entirety of North America. The Rehfeldt climate rasters have data values for every cell on the continent, but these are surrounded by NAs in the "ocean cells". See the plot here, where I have colored the NAs dark green. randomForest::predict() does not run if the independent dataset contains NAs. Thus, I want to crop my climate rasters (or set a working extent?) so that the predict() function only operates over the cells which contain data.
Troubleshooting:
I've run the Random Forest model using a smaller extent which does not include the "NA oceans" of the rasters and the model runs just fine. So, I know the NAs are the problem. However, I don't want to predict my niche models for just a rectangular chunk of North America.
I used flowla's approach here for cropping and masking rasters using a polygon shapefile for North America. I hoped that this would remove the NAs but it doesn't. Is there something similar I can do to remove the NAs?
I've done some reading but can't figure out a way to adjust the Random Forest code itself so that predict() ignores NAs. This post looks relevant but I'm not sure whether it helps in my case.
Data
My rasters, the input presence/absence text file, and code for additional functions are here. Use with the main code below for a reproducible example.
Code
require(sp)
require(rgdal)
require(raster)
library(maptools)
library(mapproj)
library(dismo)
library(maps)
library(proj4)
data(stateMapEnv)
# This source code has all of the functions necessary for running the Random Forest models, as well as the code for the function detecting multi-collinearity
source("Functions.R")
# Read in Rehfeldt climate rasters
# these rasters were converted to .img and given WGS 84 projection in ArcGIS
d100 <- raster("d100.img")
dd0 <- raster("dd0.img")
dd5 <- raster("dd5.img")
fday <- raster("fday.img")
ffp <- raster("ffp.img")
gsdd5 <- raster("gsdd5.img")
gsp <- raster("gsp.img")
map <- raster("map.img")
mat <- raster("mat_tenths.img")
mmax <- raster("mmax_tenths.img")
mmin <- raster("mmin_tenths.img")
mmindd0 <- raster("mmindd0.img")
mtcm <- raster("mtcm_tenths.img")
mtwm <- raster("mtwm_tenths.img")
sday <- raster("sday.img")
smrpb <- raster("smrpb.img")
# add separate raster files into one big raster, with each file being a different layer.
rehfeldt <- addLayer(d100, dd0, dd5, fday, ffp, gsdd5, gsp, map, mat, mmax, mmin, mmindd0, mtcm, mtwm, sday, smrpb)
# plot some rasters to make sure everything worked
plot(d100)
plot(rehfeldt)
# read in presence/absence data
LAZB.INBUtemp <- read.table("LAZB.INBU.txt", header=T, sep = "\t")
colnames(LAZB.INBUtemp) <- c("Lat", "Long", "LAZB", "INBU")
LAZB.INBUtemp <- LAZB.INBUtemp[c(2,1,3,4)]
LAZB.INBU <- LAZB.INBUtemp
latpr <- (LAZB.INBU$Lat)
lonpr <- (LAZB.INBU$Long)
sites <- SpatialPoints(cbind(lonpr, latpr))
LAZB.INBU.spatial <- SpatialPointsDataFrame(sites, LAZB.INBU, match.ID=TRUE)
# The below function extracts raster values for each of the different layers for each of the eBird locations
pred <- raster::extract(rehfeldt, LAZB.INBU.spatial)
LAZB.INBU.spatial#data = data.frame(LAZB.INBU.spatial#data, pred)
LAZB.INBU.spatial#data <- na.omit(LAZB.INBU.spatial#data)
# ITERATIVE TEST FOR MULTI-COLINEARITY
# Determines which variables show multicolinearity
cl <- MultiColinear(LAZB.INBU.spatial#data[,7:ncol(LAZB.INBU.spatial#data)], p=0.05)
xdata <- LAZB.INBU.spatial#data[,7:ncol(LAZB.INBU.spatial#data)]
for(l in cl) {
cl.test <- xdata[,-which(names(xdata)==l)]
print(paste("REMOVE VARIABLE", l, sep=": "))
MultiColinear(cl.test, p=0.05)
}
# REMOVE MULTI-COLINEAR VARIABLES
for(l in cl) { LAZB.INBU.spatial#data <- LAZB.INBU.spatial#data[,-which(names(LAZB.INBU.spatial#data)==l)] }
################################################################################################
# FOR LAZB
# RANDOM FOREST MODEL AND RASTER PREDICTION
require(randomForest)
# NUMBER OF BOOTSTRAP REPLICATES
b=1001
# CREATE X,Y DATA
# use column 3 for LAZB and 4 for INBU
ydata <- as.factor(LAZB.INBU.spatial#data[,3])
xdata <- LAZB.INBU.spatial#data[,7:ncol(LAZB.INBU.spatial#data)]
# PERCENT OF PRESENCE OBSERVATIONS
( dim(LAZB.INBU.spatial[LAZB.INBU.spatial$LAZB == 1, ])[1] / dim(LAZB.INBU.spatial)[1] ) * 100
# RUN RANDOM FORESTS MODEL SELECTION FUNCTION
# This model is using the model improvement ratio to select a final model.
pdf(file = "LAZB Random Forest Model Rehfeldt.pdf")
( rf.model <- rf.modelSel(x=xdata, y=ydata, imp.scale="mir", ntree=b) )
dev.off()
# RUN RANDOM FORESTS CLASS BALANCE BASED ON SELECTED VARIABLES
# This code would help in the case of imbalanced sample
mdata <- data.frame(y=ydata, xdata[,rf.model$SELVARS])
rf.BalModel <- rfClassBalance(mdata[,1], mdata[,2:ncol(mdata)], "y", ntree=b)
# CREATE NEW XDATA BASED ON SELECTED MODEL AND RUN FINAL RF MODEL
sel.vars <- rf.model$PARAMETERS[[3]]
rf.data <- data.frame(y=ydata, xdata[,sel.vars])
write.table(rf.data, "rf.data.txt", sep = ",", row.names = F)
# This the code given to me; takes forever to run for my dataset (I haven't tried to let it finish)
# ( rf.final <- randomForest(y ~ ., data=rf.data, ntree=b, importance=TRUE, norm.votes=TRUE, proximity=TRUE) )
# I use this form because it's a lot faster
( rf.final <- randomForest(x = rf.data[2:6], y = rf.data$y, ntree=1000, importance=TRUE, norm.votes=TRUE, proximity=F) )
################################################################################################
# MODEL VALIDATION
# PREDICT TO VALIDATION DATA
# Determines the percent correctly classified
rf.pred <- predict(rf.final, rf.data[,2:ncol(rf.data)], type="response")
rf.prob <- as.data.frame(predict(rf.final, rf.data[,2:ncol(rf.data)], type="prob"))
ObsPred <- data.frame(cbind(Observed=as.numeric(as.character(ydata)),
PRED=as.numeric(as.character(rf.pred)), Prob1=rf.prob[,2],
Prob0=rf.prob[,1]) )
op <- (ObsPred$Observed == ObsPred$PRED)
( pcc <- (length(op[op == "TRUE"]) / length(op))*100 )
# PREDICT MODEL PROBABILITIES RASTER
# The first line of code says what directory I'm working, and then what folder in that directory has the raster files that I'm using to predict the range
# The second line defines the x variable, wich is my final Random Forest model
rpath=paste('~/YOURPATH', "example", sep="/")
xvars <- stack(paste(rpath, paste(rownames(rf.final$importance), "img", sep="."), sep="/"))
tr <- blockSize(xvars)
s <- writeStart(xvars[[1]], filename=paste('~/YOURPATH', "prob_LAZB_Rehfeldt.img", sep="/"), overwrite=TRUE)
for (i in 1:tr$n) {
v <- getValuesBlock(xvars, row=tr$row[i], nrows=tr$nrows[i])
v <- as.data.frame(v)
rf.pred <- predict(rf.final, v, type="prob")[,2]
writeValues(s, rf.pred, tr$row[i])
}
s <- writeStop(s)
prob_LAZB <- raster("prob_LAZB_Rehfeldt.img")
# Write range prediction raster to .pdf
pdf(file="LAZB_range_pred.pdf")
plot(prob_LAZB)
map("state", add = TRUE)
dev.off()
Thanks!!
Did you try setting 'na.action` in your call to RF? The option is clearly labelled in the randomForest R manual. Your call to RF would look like this:
rf.final <- randomForest(x = rf.data[2:6], y = rf.data$y, ntree=1000, importance=TRUE, norm.votes=TRUE, proximity=F, na.action = omit)
This will tell RF to omit rows where NA exists, thereby throwing out those observations. This is not necessarily the best approach, but it might be handy in your situation.
Option 2: rfImpute or na.roughfix: This will fill in your NAs so that you can go ahead with your prediction. Watch out as this can give you spurious predictions wherever the NAs are being imputed/"fixed".
Option 3: Start with Option 2, and after you get your prediction, bring your raster into your GIS/Image processing software of choice, and mask out the areas you don't want. In your case, masking out water bodies would be pretty simple.

R quantmod data merge regression error

This R code throws an error, namely
Error in .xts(e, .index(e1), .indexCLASS = indexClass(e1),
.indexFORMAT = indexFormat(e1), : index length must match number
of observations
Code:
library('quantmod')
library('foreach')
JNK <- getSymbols('JNK', from='2010-01-01',auto.assign=FALSE)[,6]
GSPC <- getSymbols('^GSPC', from='2010-01-01',auto.assign=FALSE)[,6]
JNK <- diff(log(JNK))
GSPC <- diff(log(GSPC))
Data <- na.omit(merge(JNK,GSPC, all=FALSE))
m <- lm(JNK ~ GSPC, data=Data)
plot(m)
Could anyone help me figure out what I'm doing wrong?
The actual column names of Data are JNK.Adjusted and GSPC.Adjusted. Hence, you should specify the complete names in the lm call:
m <- lm(JNK.Adjusted ~ GSPC.Adjusted, data=Data)
plot(m)
Otherwise, the plot function will look for the columns JNK and GSPC but will not find them in Data.

Resources