Problems with raster prediction from linear model in r - r

I'm having problems with predicting a raster using a linear model.
Firstly i create my model from the data found in my polygons.
# create model
poly <- st_read("polygon.shp")
df <- na.omit(poly)
df <- df[df$gdp > 0 & df$ntl2 > 0 & df$pop2 > 0,]
x <- log(df$ntl2)
y <- log(df$gdp*df$pop2)
c <- df$iso
d <- data.frame(x,y,c)
m <- lm(y~x+c,data=d)
Then i want to use raster::predict to estimate an output raster
# raster data
iso <- raster("iso.tif")
viirs <- raster("viirs.tif")
x <- log(viirs)
c <- iso
## predict with models
s <- stack(x,c)
predicted <- raster::predict(x,model=m)
however i get following response:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
object is not a matrix
I don't know what the problem is and how to fix it. My current throughts are that its something to do with the factors/country codes:
My model includes country codes, as I would like to include some country fixed effects. Maybe there is a problems with including these. However even when excluding the country codes from the model and the entire dataframe, i still get the same error message.
Futhermore, my model is based on regional values from the whole world and the prediction datasets only include the extent of Turkey. Maybe this is the problem?
And here is the data:
https://drive.google.com/open?id=16cy7CJFrxQCTLhx-hXDNHJz8ej3vTEED

Perhaps it works if you do like this:
iso <- raster("iso.tif")
viirs <- raster("viirs.tif")
s <- stack(log(viirs), iso)
names(s) <- c("x", "c")
predicted <- raster::predict(s, model=m)
It won't work if the values in df$iso and iso.tif don't match (is one a factor, and the other numeric?).

Related

Fixing missing data- how to transform table into ts object that works with KalmanRun?

I'm working with data from SteamCharts on a game- Warframe (https://steamdb.info/app/230410/graphs/)
Edit- The data is a .csv downloadable near "Steam charts for every day"
I'm modelling this timeseries data, but the package I'm using requires no missing values. To resolve this, I'm using arima to predict the missing values (instructions from link reproduced below)
https://stats.stackexchange.com/questions/104565/how-to-use-auto-arima-to-impute-missing-values
require(forecast)
# sample series
x0 <- x <- log(AirPassengers)
y <- x
# set some missing values
x[c(10,60:71,100,130)] <- NA
# fit model
fit <- auto.arima(x)
# Kalman filter
kr <- KalmanRun(x, fit$model)
# impute missing values Z %*% alpha at each missing observation
id.na <- which(is.na(x))
for (i in id.na)
y[i] <- fit$model$Z %*% kr$states[i,]
# alternative to the explicit loop above
sapply(id.na, FUN = function(x, Z, alpha) Z %*% alpha[x,],
Z = fit$model$Z, alpha = kr$states)
As of now, I've managed to convert to
Convert the Date strings to a DateTime object in my dataframe:
df <- read.csv(file="chart.csv", header=TRUE, sep=",")
df = df %>% select(DateTime, Players)
df_temp[['DateTime']] <- as.Date(strptime(df[['DateTime']], format='%Y-%m-%d %H:%M:%S'))
Get an xts object of my data (I believe arima only works with ts though)
df = xts(df$Players, df$DateTime)
df = ts(df)
The arima model fits, but when I try to use the KalmanRun, I get the following error:
Error in KalmanRun(x, fit$model) : invalid argument type
I believe there's an issue in how I'm converting it to a timeseries object, but don't know how to resolve it. Any help would be greatly appreciated. Thanks!

"Error in model.frame.default(data = train, formula = cost ~ .) : variable lengths differ", but all variables are length 76?

I'm modeling burrito prices in San Diego to determine whether some burritos are over/under priced (according to the model). I'm attempting to use regsubsets() to determine the best linear model, using the BIC, on a data frame of 76 observations of 14 variables. However, I keep getting an error saying that variable lengths differ, and thus a linear model doesn't work.
I've tried rounding all the observations in the data frame to one decimal place, I've used the length() function on each variable in the data frame to make sure they're all the same length, and before I made the model I used na.omit() on the data frame to make sure no NAs were present. By the way, the original dataset can be found here: https://www.kaggle.com/srcole/burritos-in-san-diego. I cleaned it up a bit in Excel first, removing all the categorical variables that appeared after the "overall" column.
burritos <- read.csv("/Users/Jack/Desktop/R/STOR 565 R Projects/Burritos.csv")
burritos <- burritos[ ,-c(1,2,5)]
burritos <- na.exclude(burritos)
burritos <- round(burritos, 1)
library(leaps)
library(MASS)
yelp <- burritos$Yelp
google <- burritos$Google
cost <- burritos$Cost
hunger <- burritos$Hunger
tortilla <- burritos$Tortilla
temp <- burritos$Temp
meat <- burritos$Meat
filling <- burritos$Meat.filling
uniformity <- burritos$Uniformity
salsa <- burritos$Salsa
synergy <- burritos$Synergy
wrap <- burritos$Wrap
overall <- burritos$overall
variable <- sample(1:nrow(burritos), 50)
train <- burritos[variable, ]
test <- burritos[-variable, ]
null <- lm(cost ~ 1, data = train)
full <- regsubsets(cost ~ ., data = train) #This is where error occurs

How to run a loop inside a loop for a gam object

I am trying to predict new observations after multiple imputation. Both the newdata and the model to use are list objects. The correctness of the approach is not the issue but how to use the predict function after multiple imputation we I have a new data that is a list. Below are my code.
library(betareg)
library(mice)
library(mgcv)
data(GasolineYield)
dat1 <- GasolineYield
dat1 <- GasolineYield
dat1$yield <- with(dat1,
ifelse(yield > 0.40 | yield < 0.17,NA,yield)) # created missing values
datim <- mice(dat1,m=30) #imputing missing values
mod1 <- with(datim,gam(yield ~ batch + emp,family=betar(link="logit"))) #fit models using gam
creating data set to be used for prediction
datnew <- complete(datim,"long")
datsplit <- split(datnew,datnew$.imp)
the code below just testing out the predict without newdata. The problem I observed was that tp is saved as 1 by 32 matrix instead of 30 by 32 matrix. But the print option prints out a 30 by 32 but then I couldn't save it as such.
tot <- 0
for(i in 1:30){
tot <- mod1$analyses[[i]]
tp <- predict.gam(tot,type = "response")
print(tp)
}
the code below is me trying to predict new observation using newdata. Here I am just lost I am not sure how to go about it.
datnew <- complete(datim,"long")
datsplit <- split(datnew,datnew$.imp)
tot <- 0
for(i in 1:30){
tot <- mod1$analyses[[i]]
tp <- predict.gam(tot,newdata=datsplit[[i]], type = "response")
print(tp)
}
Can someone help me out on how best to go about it?
I finally find solved the problem. Here is the solution:
datnew <- complete(datim,"long")# stack all the imputation data
though I have to point out that this should be your new dataset
I am assuming that this is not used in building the model. My aim of opening this #thread was to address the question of how to predict observations using new data after multiple imputation/using model built with multiple imputation dataset.
datsplit <- split(datnew,datnew$.imp)
tot <- list()
tot_ <- list()
for(i in 1:30){
for(j in 1:30){
tot[[j]] <- predict.gam(mod1$analyses[[i]],newdata=datsplit[[j]])
}
tot_[[i]] <- tot
}
# flatten the lists within lists
totfl <- tot_ %>% flatten()
#nrow is the number of observations to be predicted as contained in the
#newdata set (datsplit)
totn <- matrix(unlist(totfl),nrow=32)
apply(totn,1,mean) #takes the means of prediction across the 30 data set
I hope this helps those with similar questions. I once came across a question on how to predict newdata after multiple imputation, I guess this will answer some of the questions contained in that thread.

SVM is not generating forecast using R

I have sales data for 5 different product along with weather information.To read the data, we have daily sales data at a particular store and daily weather information like what is the temperature, average speed of the area where store is located.
I am using Support Vector Machine for prediction. It works well for all the products except one. Its giving me following error:
tunedModelLOG
named numeric(0)
Below is the code:
# load the packages
library(zoo)
library(MASS)
library(e1071)
library(rpart)
library(caret)
normalize <- function(x) {
a <- min(x, na.rm=TRUE)
b <- max(x, na.rm=TRUE)
(x - a)/(b - a)
}
# Define the train and test data
test_data <- train[1:23,]
train_data<-train[24:nrow(train),]
# Define the factors for the categorical data
names<-c("year","month","dom","holiday","blackfriday","after1","back1","after2","back2","after3","back3","is_weekend","weeday")
train_data[,names]<- lapply(train_data[,names],factor)
test_data[,names] <- lapply(test_data[,names],factor)
# Normalized the continuous data
normalized<-c("snowfall","depart","cool","preciptotal","sealevel","stnpressure","resultspeed","resultdir")
train_data[,normalized] <- data.frame(lapply(train_data[,normalized], normalize))
test_data[,normalized] <- data.frame(lapply(test_data[,normalized], normalize))
# Define the same level in train and test data
levels(test_data$month)<-levels(train_data$month)
levels(test_data$dom)<-levels(train_data$dom)
levels(test_data$year)<-levels(train_data$year)
levels(test_data$after1)<-levels(train_data$after1)
levels(test_data$after2)<-levels(train_data$after2)
levels(test_data$after3)<-levels(train_data$after3)
levels(test_data$back1)<-levels(train_data$back1)
levels(test_data$back2)<-levels(train_data$back2)
levels(test_data$back3)<-levels(train_data$back3)
levels(test_data$holiday)<-levels(train_data$holiday)
levels(test_data$is_weekend)<-levels(train_data$is_weekend)
levels(test_data$blackfriday)<-levels(train_data$blackfriday)
levels(test_data$is_weekend)<-levels(train_data$is_weekend)
levels(test_data$weeday)<-levels(train_data$weeday)
# Fit the SVM model and tune the parameters
svmReFitLOG=tune(svm,logunits~year+month+dom+holiday+blackfriday+after1+after2+after3+back1+back2+back3+is_weekend+depart+cool+preciptotal+sealevel+stnpressure+resultspeed+resultdir,data=train_data,ranges = list(epsilon = c(0,0.1,0.01,0.001), cost = 2^(2:9)))
retunedModeLOG <- svmReFitLOG$best.model
tunedModelLOG <- predict(retunedModeLOG,test_data)
Working file is available at the below link
https://drive.google.com/file/d/0BzCJ8ytbECPMVVJ1UUg2RHhQNFk/view?usp=sharing
What I am doing wrong? I would appreciate any kind of help.
Thanks in advance.

NAs in rasters and randomForest::predict()

New here, please let me know if you need more info.
My goal: I am using Rehfeldt climate data and eBird presence/absence data to produce niche models using Random Forest models.
My problem: I want to predict niche models for the entirety of North America. The Rehfeldt climate rasters have data values for every cell on the continent, but these are surrounded by NAs in the "ocean cells". See the plot here, where I have colored the NAs dark green. randomForest::predict() does not run if the independent dataset contains NAs. Thus, I want to crop my climate rasters (or set a working extent?) so that the predict() function only operates over the cells which contain data.
Troubleshooting:
I've run the Random Forest model using a smaller extent which does not include the "NA oceans" of the rasters and the model runs just fine. So, I know the NAs are the problem. However, I don't want to predict my niche models for just a rectangular chunk of North America.
I used flowla's approach here for cropping and masking rasters using a polygon shapefile for North America. I hoped that this would remove the NAs but it doesn't. Is there something similar I can do to remove the NAs?
I've done some reading but can't figure out a way to adjust the Random Forest code itself so that predict() ignores NAs. This post looks relevant but I'm not sure whether it helps in my case.
Data
My rasters, the input presence/absence text file, and code for additional functions are here. Use with the main code below for a reproducible example.
Code
require(sp)
require(rgdal)
require(raster)
library(maptools)
library(mapproj)
library(dismo)
library(maps)
library(proj4)
data(stateMapEnv)
# This source code has all of the functions necessary for running the Random Forest models, as well as the code for the function detecting multi-collinearity
source("Functions.R")
# Read in Rehfeldt climate rasters
# these rasters were converted to .img and given WGS 84 projection in ArcGIS
d100 <- raster("d100.img")
dd0 <- raster("dd0.img")
dd5 <- raster("dd5.img")
fday <- raster("fday.img")
ffp <- raster("ffp.img")
gsdd5 <- raster("gsdd5.img")
gsp <- raster("gsp.img")
map <- raster("map.img")
mat <- raster("mat_tenths.img")
mmax <- raster("mmax_tenths.img")
mmin <- raster("mmin_tenths.img")
mmindd0 <- raster("mmindd0.img")
mtcm <- raster("mtcm_tenths.img")
mtwm <- raster("mtwm_tenths.img")
sday <- raster("sday.img")
smrpb <- raster("smrpb.img")
# add separate raster files into one big raster, with each file being a different layer.
rehfeldt <- addLayer(d100, dd0, dd5, fday, ffp, gsdd5, gsp, map, mat, mmax, mmin, mmindd0, mtcm, mtwm, sday, smrpb)
# plot some rasters to make sure everything worked
plot(d100)
plot(rehfeldt)
# read in presence/absence data
LAZB.INBUtemp <- read.table("LAZB.INBU.txt", header=T, sep = "\t")
colnames(LAZB.INBUtemp) <- c("Lat", "Long", "LAZB", "INBU")
LAZB.INBUtemp <- LAZB.INBUtemp[c(2,1,3,4)]
LAZB.INBU <- LAZB.INBUtemp
latpr <- (LAZB.INBU$Lat)
lonpr <- (LAZB.INBU$Long)
sites <- SpatialPoints(cbind(lonpr, latpr))
LAZB.INBU.spatial <- SpatialPointsDataFrame(sites, LAZB.INBU, match.ID=TRUE)
# The below function extracts raster values for each of the different layers for each of the eBird locations
pred <- raster::extract(rehfeldt, LAZB.INBU.spatial)
LAZB.INBU.spatial#data = data.frame(LAZB.INBU.spatial#data, pred)
LAZB.INBU.spatial#data <- na.omit(LAZB.INBU.spatial#data)
# ITERATIVE TEST FOR MULTI-COLINEARITY
# Determines which variables show multicolinearity
cl <- MultiColinear(LAZB.INBU.spatial#data[,7:ncol(LAZB.INBU.spatial#data)], p=0.05)
xdata <- LAZB.INBU.spatial#data[,7:ncol(LAZB.INBU.spatial#data)]
for(l in cl) {
cl.test <- xdata[,-which(names(xdata)==l)]
print(paste("REMOVE VARIABLE", l, sep=": "))
MultiColinear(cl.test, p=0.05)
}
# REMOVE MULTI-COLINEAR VARIABLES
for(l in cl) { LAZB.INBU.spatial#data <- LAZB.INBU.spatial#data[,-which(names(LAZB.INBU.spatial#data)==l)] }
################################################################################################
# FOR LAZB
# RANDOM FOREST MODEL AND RASTER PREDICTION
require(randomForest)
# NUMBER OF BOOTSTRAP REPLICATES
b=1001
# CREATE X,Y DATA
# use column 3 for LAZB and 4 for INBU
ydata <- as.factor(LAZB.INBU.spatial#data[,3])
xdata <- LAZB.INBU.spatial#data[,7:ncol(LAZB.INBU.spatial#data)]
# PERCENT OF PRESENCE OBSERVATIONS
( dim(LAZB.INBU.spatial[LAZB.INBU.spatial$LAZB == 1, ])[1] / dim(LAZB.INBU.spatial)[1] ) * 100
# RUN RANDOM FORESTS MODEL SELECTION FUNCTION
# This model is using the model improvement ratio to select a final model.
pdf(file = "LAZB Random Forest Model Rehfeldt.pdf")
( rf.model <- rf.modelSel(x=xdata, y=ydata, imp.scale="mir", ntree=b) )
dev.off()
# RUN RANDOM FORESTS CLASS BALANCE BASED ON SELECTED VARIABLES
# This code would help in the case of imbalanced sample
mdata <- data.frame(y=ydata, xdata[,rf.model$SELVARS])
rf.BalModel <- rfClassBalance(mdata[,1], mdata[,2:ncol(mdata)], "y", ntree=b)
# CREATE NEW XDATA BASED ON SELECTED MODEL AND RUN FINAL RF MODEL
sel.vars <- rf.model$PARAMETERS[[3]]
rf.data <- data.frame(y=ydata, xdata[,sel.vars])
write.table(rf.data, "rf.data.txt", sep = ",", row.names = F)
# This the code given to me; takes forever to run for my dataset (I haven't tried to let it finish)
# ( rf.final <- randomForest(y ~ ., data=rf.data, ntree=b, importance=TRUE, norm.votes=TRUE, proximity=TRUE) )
# I use this form because it's a lot faster
( rf.final <- randomForest(x = rf.data[2:6], y = rf.data$y, ntree=1000, importance=TRUE, norm.votes=TRUE, proximity=F) )
################################################################################################
# MODEL VALIDATION
# PREDICT TO VALIDATION DATA
# Determines the percent correctly classified
rf.pred <- predict(rf.final, rf.data[,2:ncol(rf.data)], type="response")
rf.prob <- as.data.frame(predict(rf.final, rf.data[,2:ncol(rf.data)], type="prob"))
ObsPred <- data.frame(cbind(Observed=as.numeric(as.character(ydata)),
PRED=as.numeric(as.character(rf.pred)), Prob1=rf.prob[,2],
Prob0=rf.prob[,1]) )
op <- (ObsPred$Observed == ObsPred$PRED)
( pcc <- (length(op[op == "TRUE"]) / length(op))*100 )
# PREDICT MODEL PROBABILITIES RASTER
# The first line of code says what directory I'm working, and then what folder in that directory has the raster files that I'm using to predict the range
# The second line defines the x variable, wich is my final Random Forest model
rpath=paste('~/YOURPATH', "example", sep="/")
xvars <- stack(paste(rpath, paste(rownames(rf.final$importance), "img", sep="."), sep="/"))
tr <- blockSize(xvars)
s <- writeStart(xvars[[1]], filename=paste('~/YOURPATH', "prob_LAZB_Rehfeldt.img", sep="/"), overwrite=TRUE)
for (i in 1:tr$n) {
v <- getValuesBlock(xvars, row=tr$row[i], nrows=tr$nrows[i])
v <- as.data.frame(v)
rf.pred <- predict(rf.final, v, type="prob")[,2]
writeValues(s, rf.pred, tr$row[i])
}
s <- writeStop(s)
prob_LAZB <- raster("prob_LAZB_Rehfeldt.img")
# Write range prediction raster to .pdf
pdf(file="LAZB_range_pred.pdf")
plot(prob_LAZB)
map("state", add = TRUE)
dev.off()
Thanks!!
Did you try setting 'na.action` in your call to RF? The option is clearly labelled in the randomForest R manual. Your call to RF would look like this:
rf.final <- randomForest(x = rf.data[2:6], y = rf.data$y, ntree=1000, importance=TRUE, norm.votes=TRUE, proximity=F, na.action = omit)
This will tell RF to omit rows where NA exists, thereby throwing out those observations. This is not necessarily the best approach, but it might be handy in your situation.
Option 2: rfImpute or na.roughfix: This will fill in your NAs so that you can go ahead with your prediction. Watch out as this can give you spurious predictions wherever the NAs are being imputed/"fixed".
Option 3: Start with Option 2, and after you get your prediction, bring your raster into your GIS/Image processing software of choice, and mask out the areas you don't want. In your case, masking out water bodies would be pretty simple.

Resources