R: Random forest with raster as response and explanitory variable - r

I want to create a fire occurence probability map with random forest method. My response variable is a raster with the mean annual burned area per grid cell. My explanitory variables are mulitple rasters (temperature, elevation, land use and population density). Is it possible to use a raster as the response variable and how would a basic codeline look like? I couldn't find any information on that.
files <- list.files(path="C:/Users/fsorb/OneDrive/Desktop/test/fire_prob", pattern="grd", all.files=FALSE, full.names=TRUE,recursive=TRUE)
predictors <- stack(files)
fire <- raster("C:/Users/fsorb/OneDrive/Desktop/test/env_data/fire.tif")
fire_occ_prob <- randomForest(fire ~ ., data = predictors, ntree=500)
So is the code I have so far, but I get the error: Error in as.data.frame.default(data) :
can not transform ‘structure("RasterStack", package = "raster")’ into data.frame
I tried to save the fire raster as.dataframe but all grid cells only get NA value.

I would try to
convert the response (fire) raster to points
extract the values of the predictors at the points
train a random forest model using the resulting data frame.
require(raster)
require(sf)
require(dplyr)
require(randomForest)
files <- list.files(path="C:/Users/fsorb/OneDrive/Desktop/test/fire_prob", pattern="grd", all.files=FALSE, full.names=TRUE,recursive=TRUE)
predictors <- stack(files)
fire <- raster("C:/Users/fsorb/OneDrive/Desktop/test/env_data/fire.tif")
# convert raster to point
response <- rasterToPoints(fire, spatial = TRUE) %>% st_as_sf()
response$ID <- c(1:nrow(response))
colnames(response)[1] <- "response"
# combine predictor values with the response
rs_preds <- full_join(terra::extract(x=r2, y=response, df=TRUE),
st_drop_geometry(response), by="ID")
# train random forest
fire_occ_prob <- randomForest(response ~ .,
data = rs_preds[,!names(rs_preds) %in% "ID"],
ntree=500,
importance = TRUE)
# plot variable importance
varImpPlot(fire_occ_prob)
# make spatial predictions
sp_pred <- raster::predict(predictors, model=fire_occ_prob)
If your aim is to make spatial (temporal) predictions, make sure to use a spatial (temporal) (cross-) validation strategy. For further information take a look at e.g. Roberts et al. (2016): https://doi.org/10.1111/ecog.02881
Greetings, Jan

Related

Using randomForest package in R, how to map Random forest prediction?

enter image description hereI am trying to use randomforest to generate a spatial prediction map.
I developed my model by using random forest regression, but I met a little difficulty in the last step to use the best predictors for building the predictive map. I want to create a map prediction map.
My code:
library(raster)
library(randomForest)
set.seed(12)
s <- stack("Density.tif", "Aqui.tif", "Rech.tif", "Rainfall.tif","Land Use.tif", "Cond.tif", "Nitrogen.tif", "Regions.tif","Soil.tif","Topo.tif", "Climatclass.tif", "Depth.tif")
points <- read.table("Coordonnées3.txt",header=TRUE, sep="\t", dec=",",strip.white=TRUE)
d <- extract(s, points)
rf <-randomForest(nitrate~ . , data=d, importance=TRUE, ntree=500, na.action = na.roughfix)
p <- predict(s, rf)
plot(p)
Sample Data:
> head(points)
LAT LONG
1 -13.057007 27.549580
2 -4.255000 15.233745
3 5.300000 -1.983610
4 7.245675 -4.233336
5 12.096330 15.036016
6 -4.255000 15.233745
The error when I run my short code is:
Error in eval(expr, envir, enclos) : object 'nitrate' not found.
I am guessing the error happens when you fit the model.
Why would there be a variable called nitrate. Given how you create your RasterStack, perhaps there is one called Nitrogen. Either way you can find out by looking at names(s) and colnames(d).
NOTE that your points are not good! They are in reverse order. The order should be (longitude, latitude).
Based on your comments (please edit your question instead), you should
add nitrate the points file (the third column) or something like that. Then do
xy <- points[, 2:1]
nitrate <- points[,3]
Extract points and combine with your observed data
d <- extract(s, xy)
d <- cbind(nitrate=nitrate, d)
Build model and predict
rf <-randomForest(nitrate~ . , data=d, importance=TRUE, ntree=500, na.action = na.roughfix)
p <- predict(s, rf)
It sounds like the error is coming when you are trying to build the forest. It may be most helpful to not use the formula interface. Also, if d is large, then using the formula interface is not advisable. From the help file on randomForest: "For large data sets, especially those with large number of variables, calling randomForest via the formula interface is not advised: There may be too much overhead in handling the formula."
Assuming d$nitrate exists then the solution is randomForest(y = d$nitrate, x = subset(d, select = -nitrate), importance=TRUE, ntree=500, na.action = na.roughfix)

How to store forecasted values using 'forecast' library in R into a CSV file?

I have fitted a TBATS model around my seasonal time-series data and used the forecast package to obtain predictions. My R code is:
library("forecast")
data = read.csv("data.csv")
season_info <- msts(data,seasonal.periods=c(24,168))
model <- tbats(season_info)
forecasted <- forecast.tbats(best_model,h=24,level=90)
forecasted
Now, I have a variable called 'forecasted' that outputs as such:
> forecasted
Point Forecast Lo 90 Hi 90
6.940476 5080.641 4734.760 5426.523
6.946429 5024.803 4550.111 5499.496
6.952381 4697.625 4156.516 5238.733
6.958333 4419.105 3832.765 5005.446
6.964286 4262.782 3643.528 4882.037
6.970238 4187.629 3543.062 4832.196
6.976190 4349.196 3684.444 5013.947
6.982143 4484.108 3802.574 5165.642
6.988095 4247.858 3551.955 4943.761
6.994048 3851.379 3142.831 4559.927
7.000000 3575.951 2855.962 4295.941
7.005952 3494.943 2764.438 4225.449
7.011905 3501.354 2760.968 4241.739
7.017857 3445.563 2695.781 4195.345
I need to gather the forecasted values from the column 'Forecast' and store it in a CSV file. I tried to read the page for the TBATS and 'forecast' method online, but they do not say how a particular column of forecasted values could be extracted, ignoring the other columns such as 'Hi' 'Lo' and 'Point'.
I'm looking for this output in my CSV:
hour,forecasted_value
0,5080.641
1,5024.803
2,4697.625
...
They are stored in three parts. You can look at the object structure with str(ret):
library(forecast)
fit <- tbats(USAccDeaths)
ret <- forecast(fit)
ret$upper # Upper interval
ret$lower # Lower interval
ret$mean # Point forecast
You can obtain the output shown by using print():
library("forecast")
data = read.csv("data.csv")
season_info <- msts(data,seasonal.periods=c(24,168))
model <- tbats(season_info)
forecasted <- forecast.tbats(best_model,h=24,level=90)
dfForec <- print(forecasted)
this will give you the data.frame, now you can pick out the columns you want, ie. dfForec[, 1] for only the point-forecast, then use write.csv(dfForec[, 1, drop = FALSE], ...) to write it to a flat file.
use mean function for getting your Point Forecast
library("forecast")
data = read.csv("data.csv")
season_info <- msts(data,seasonal.periods=c(24,168))
model <- tbats(season_info)
forecasted <- (forecast.tbats(best_model,h=24,level=90))$mean
or
forecasted$mean

How to bind/merge prcomp and predict data in r?

To plot a predicted validation/test data set within a training dataset in ggbiplot as addressed here, I would like to bind/merge the two datasets.
The given mwe is:
library(ggbiplot)
data(wine)
##pca on the wine dataset used as training data
wine.pca <- prcomp(wine, center = TRUE, scale. = TRUE)
##add the wine.classes as a column to the dataset
wine$class <- wine.class
##simulate test data by generating three new wine classes
wine.new.1 <- wine[c(sample(1:nrow(wine), 25)),]
wine.new.2 <- wine[c(sample(1:nrow(wine), 43)),]
wine.new.3 <- wine[c(sample(1:nrow(wine), 36)),]
##Predict PCs for the new classes by transforming
#them using the predict.prcomp function
pred.new.1 <- predict(wine.pca, newdata = wine.new.1)
pred.new.2 <- predict(wine.pca, newdata = wine.new.2)
pred.new.3 <- predict(wine.pca, newdata = wine.new.3)
##simulate the classes for the new sorts
wine.new.1$class <- rep("new.wine.1", nrow(wine.new.1))
wine.new.2$class <- rep("new.wine.2", nrow(wine.new.2))
wine.new.3$class <- rep("new.wine.3", nrow(wine.new.3))
And I've been using:
df.train.pred <- rbind(wine.pca$x, pred.new.1, pred.new.2, pred.new.3)
to fuse the two but ggbiplot returned an error as it Expected a object of class prcomp, princomp, PCA, or lda
How can I consolidate the two so they become an object ggbiplot accepts?

SVM is not generating forecast using R

I have sales data for 5 different product along with weather information.To read the data, we have daily sales data at a particular store and daily weather information like what is the temperature, average speed of the area where store is located.
I am using Support Vector Machine for prediction. It works well for all the products except one. Its giving me following error:
tunedModelLOG
named numeric(0)
Below is the code:
# load the packages
library(zoo)
library(MASS)
library(e1071)
library(rpart)
library(caret)
normalize <- function(x) {
a <- min(x, na.rm=TRUE)
b <- max(x, na.rm=TRUE)
(x - a)/(b - a)
}
# Define the train and test data
test_data <- train[1:23,]
train_data<-train[24:nrow(train),]
# Define the factors for the categorical data
names<-c("year","month","dom","holiday","blackfriday","after1","back1","after2","back2","after3","back3","is_weekend","weeday")
train_data[,names]<- lapply(train_data[,names],factor)
test_data[,names] <- lapply(test_data[,names],factor)
# Normalized the continuous data
normalized<-c("snowfall","depart","cool","preciptotal","sealevel","stnpressure","resultspeed","resultdir")
train_data[,normalized] <- data.frame(lapply(train_data[,normalized], normalize))
test_data[,normalized] <- data.frame(lapply(test_data[,normalized], normalize))
# Define the same level in train and test data
levels(test_data$month)<-levels(train_data$month)
levels(test_data$dom)<-levels(train_data$dom)
levels(test_data$year)<-levels(train_data$year)
levels(test_data$after1)<-levels(train_data$after1)
levels(test_data$after2)<-levels(train_data$after2)
levels(test_data$after3)<-levels(train_data$after3)
levels(test_data$back1)<-levels(train_data$back1)
levels(test_data$back2)<-levels(train_data$back2)
levels(test_data$back3)<-levels(train_data$back3)
levels(test_data$holiday)<-levels(train_data$holiday)
levels(test_data$is_weekend)<-levels(train_data$is_weekend)
levels(test_data$blackfriday)<-levels(train_data$blackfriday)
levels(test_data$is_weekend)<-levels(train_data$is_weekend)
levels(test_data$weeday)<-levels(train_data$weeday)
# Fit the SVM model and tune the parameters
svmReFitLOG=tune(svm,logunits~year+month+dom+holiday+blackfriday+after1+after2+after3+back1+back2+back3+is_weekend+depart+cool+preciptotal+sealevel+stnpressure+resultspeed+resultdir,data=train_data,ranges = list(epsilon = c(0,0.1,0.01,0.001), cost = 2^(2:9)))
retunedModeLOG <- svmReFitLOG$best.model
tunedModelLOG <- predict(retunedModeLOG,test_data)
Working file is available at the below link
https://drive.google.com/file/d/0BzCJ8ytbECPMVVJ1UUg2RHhQNFk/view?usp=sharing
What I am doing wrong? I would appreciate any kind of help.
Thanks in advance.

NAs in rasters and randomForest::predict()

New here, please let me know if you need more info.
My goal: I am using Rehfeldt climate data and eBird presence/absence data to produce niche models using Random Forest models.
My problem: I want to predict niche models for the entirety of North America. The Rehfeldt climate rasters have data values for every cell on the continent, but these are surrounded by NAs in the "ocean cells". See the plot here, where I have colored the NAs dark green. randomForest::predict() does not run if the independent dataset contains NAs. Thus, I want to crop my climate rasters (or set a working extent?) so that the predict() function only operates over the cells which contain data.
Troubleshooting:
I've run the Random Forest model using a smaller extent which does not include the "NA oceans" of the rasters and the model runs just fine. So, I know the NAs are the problem. However, I don't want to predict my niche models for just a rectangular chunk of North America.
I used flowla's approach here for cropping and masking rasters using a polygon shapefile for North America. I hoped that this would remove the NAs but it doesn't. Is there something similar I can do to remove the NAs?
I've done some reading but can't figure out a way to adjust the Random Forest code itself so that predict() ignores NAs. This post looks relevant but I'm not sure whether it helps in my case.
Data
My rasters, the input presence/absence text file, and code for additional functions are here. Use with the main code below for a reproducible example.
Code
require(sp)
require(rgdal)
require(raster)
library(maptools)
library(mapproj)
library(dismo)
library(maps)
library(proj4)
data(stateMapEnv)
# This source code has all of the functions necessary for running the Random Forest models, as well as the code for the function detecting multi-collinearity
source("Functions.R")
# Read in Rehfeldt climate rasters
# these rasters were converted to .img and given WGS 84 projection in ArcGIS
d100 <- raster("d100.img")
dd0 <- raster("dd0.img")
dd5 <- raster("dd5.img")
fday <- raster("fday.img")
ffp <- raster("ffp.img")
gsdd5 <- raster("gsdd5.img")
gsp <- raster("gsp.img")
map <- raster("map.img")
mat <- raster("mat_tenths.img")
mmax <- raster("mmax_tenths.img")
mmin <- raster("mmin_tenths.img")
mmindd0 <- raster("mmindd0.img")
mtcm <- raster("mtcm_tenths.img")
mtwm <- raster("mtwm_tenths.img")
sday <- raster("sday.img")
smrpb <- raster("smrpb.img")
# add separate raster files into one big raster, with each file being a different layer.
rehfeldt <- addLayer(d100, dd0, dd5, fday, ffp, gsdd5, gsp, map, mat, mmax, mmin, mmindd0, mtcm, mtwm, sday, smrpb)
# plot some rasters to make sure everything worked
plot(d100)
plot(rehfeldt)
# read in presence/absence data
LAZB.INBUtemp <- read.table("LAZB.INBU.txt", header=T, sep = "\t")
colnames(LAZB.INBUtemp) <- c("Lat", "Long", "LAZB", "INBU")
LAZB.INBUtemp <- LAZB.INBUtemp[c(2,1,3,4)]
LAZB.INBU <- LAZB.INBUtemp
latpr <- (LAZB.INBU$Lat)
lonpr <- (LAZB.INBU$Long)
sites <- SpatialPoints(cbind(lonpr, latpr))
LAZB.INBU.spatial <- SpatialPointsDataFrame(sites, LAZB.INBU, match.ID=TRUE)
# The below function extracts raster values for each of the different layers for each of the eBird locations
pred <- raster::extract(rehfeldt, LAZB.INBU.spatial)
LAZB.INBU.spatial#data = data.frame(LAZB.INBU.spatial#data, pred)
LAZB.INBU.spatial#data <- na.omit(LAZB.INBU.spatial#data)
# ITERATIVE TEST FOR MULTI-COLINEARITY
# Determines which variables show multicolinearity
cl <- MultiColinear(LAZB.INBU.spatial#data[,7:ncol(LAZB.INBU.spatial#data)], p=0.05)
xdata <- LAZB.INBU.spatial#data[,7:ncol(LAZB.INBU.spatial#data)]
for(l in cl) {
cl.test <- xdata[,-which(names(xdata)==l)]
print(paste("REMOVE VARIABLE", l, sep=": "))
MultiColinear(cl.test, p=0.05)
}
# REMOVE MULTI-COLINEAR VARIABLES
for(l in cl) { LAZB.INBU.spatial#data <- LAZB.INBU.spatial#data[,-which(names(LAZB.INBU.spatial#data)==l)] }
################################################################################################
# FOR LAZB
# RANDOM FOREST MODEL AND RASTER PREDICTION
require(randomForest)
# NUMBER OF BOOTSTRAP REPLICATES
b=1001
# CREATE X,Y DATA
# use column 3 for LAZB and 4 for INBU
ydata <- as.factor(LAZB.INBU.spatial#data[,3])
xdata <- LAZB.INBU.spatial#data[,7:ncol(LAZB.INBU.spatial#data)]
# PERCENT OF PRESENCE OBSERVATIONS
( dim(LAZB.INBU.spatial[LAZB.INBU.spatial$LAZB == 1, ])[1] / dim(LAZB.INBU.spatial)[1] ) * 100
# RUN RANDOM FORESTS MODEL SELECTION FUNCTION
# This model is using the model improvement ratio to select a final model.
pdf(file = "LAZB Random Forest Model Rehfeldt.pdf")
( rf.model <- rf.modelSel(x=xdata, y=ydata, imp.scale="mir", ntree=b) )
dev.off()
# RUN RANDOM FORESTS CLASS BALANCE BASED ON SELECTED VARIABLES
# This code would help in the case of imbalanced sample
mdata <- data.frame(y=ydata, xdata[,rf.model$SELVARS])
rf.BalModel <- rfClassBalance(mdata[,1], mdata[,2:ncol(mdata)], "y", ntree=b)
# CREATE NEW XDATA BASED ON SELECTED MODEL AND RUN FINAL RF MODEL
sel.vars <- rf.model$PARAMETERS[[3]]
rf.data <- data.frame(y=ydata, xdata[,sel.vars])
write.table(rf.data, "rf.data.txt", sep = ",", row.names = F)
# This the code given to me; takes forever to run for my dataset (I haven't tried to let it finish)
# ( rf.final <- randomForest(y ~ ., data=rf.data, ntree=b, importance=TRUE, norm.votes=TRUE, proximity=TRUE) )
# I use this form because it's a lot faster
( rf.final <- randomForest(x = rf.data[2:6], y = rf.data$y, ntree=1000, importance=TRUE, norm.votes=TRUE, proximity=F) )
################################################################################################
# MODEL VALIDATION
# PREDICT TO VALIDATION DATA
# Determines the percent correctly classified
rf.pred <- predict(rf.final, rf.data[,2:ncol(rf.data)], type="response")
rf.prob <- as.data.frame(predict(rf.final, rf.data[,2:ncol(rf.data)], type="prob"))
ObsPred <- data.frame(cbind(Observed=as.numeric(as.character(ydata)),
PRED=as.numeric(as.character(rf.pred)), Prob1=rf.prob[,2],
Prob0=rf.prob[,1]) )
op <- (ObsPred$Observed == ObsPred$PRED)
( pcc <- (length(op[op == "TRUE"]) / length(op))*100 )
# PREDICT MODEL PROBABILITIES RASTER
# The first line of code says what directory I'm working, and then what folder in that directory has the raster files that I'm using to predict the range
# The second line defines the x variable, wich is my final Random Forest model
rpath=paste('~/YOURPATH', "example", sep="/")
xvars <- stack(paste(rpath, paste(rownames(rf.final$importance), "img", sep="."), sep="/"))
tr <- blockSize(xvars)
s <- writeStart(xvars[[1]], filename=paste('~/YOURPATH', "prob_LAZB_Rehfeldt.img", sep="/"), overwrite=TRUE)
for (i in 1:tr$n) {
v <- getValuesBlock(xvars, row=tr$row[i], nrows=tr$nrows[i])
v <- as.data.frame(v)
rf.pred <- predict(rf.final, v, type="prob")[,2]
writeValues(s, rf.pred, tr$row[i])
}
s <- writeStop(s)
prob_LAZB <- raster("prob_LAZB_Rehfeldt.img")
# Write range prediction raster to .pdf
pdf(file="LAZB_range_pred.pdf")
plot(prob_LAZB)
map("state", add = TRUE)
dev.off()
Thanks!!
Did you try setting 'na.action` in your call to RF? The option is clearly labelled in the randomForest R manual. Your call to RF would look like this:
rf.final <- randomForest(x = rf.data[2:6], y = rf.data$y, ntree=1000, importance=TRUE, norm.votes=TRUE, proximity=F, na.action = omit)
This will tell RF to omit rows where NA exists, thereby throwing out those observations. This is not necessarily the best approach, but it might be handy in your situation.
Option 2: rfImpute or na.roughfix: This will fill in your NAs so that you can go ahead with your prediction. Watch out as this can give you spurious predictions wherever the NAs are being imputed/"fixed".
Option 3: Start with Option 2, and after you get your prediction, bring your raster into your GIS/Image processing software of choice, and mask out the areas you don't want. In your case, masking out water bodies would be pretty simple.

Resources