Im working on a project that im searching what the effects of Risk-Taking behaviour on Entrepreneurship, i mean how culture affects the Entrepreneurial activities. I run regressions to see the impact of independent variables on the dependent variables which represent the presence of Entrepreneurial Intention.
I would like to exctract the regression tables and summary tables i have constructed to the LyX document processor in order to have a more scientific view.
Whats the process i must follow to do this ??
EDIT:
MY DATASET:
My dataset is quite big and even if i use the order: dput(head(GemData,10)) the result is very big to post it here !!! Any order way ?!
MY CODE:
## you need the 'haven' package for loading a .dta file
library(haven)
GemData <- read_dta(("C:/Users/ILIAS/Documents/Bachelors Thesis/GEM Dataset.dta"))
#### Stepwise Regression for y1 = 'all_high_stat_entre' and y2 = 'all_fear_fail' ####
library(MASS)
index<-which(is.na(GemData$all_high_stat_entre)==F)
n = nrow(GemData)
r<-NULL
for(i in 2:n){
r[i-1]=cor(GemData$all_high_stat_entre[index],GemData[index,i])
}
index.r<-which(is.na(r)==F)
## 'res' is that number of column which the response 'all_high_stat_entre' ##
res = which(r==1)
#---------------------------------------------------------------------------------------
index_fail<-which(is.na(GemData$all_fear_fail)==F)
r_fail<-NULL
for(i in 2:n){
r_fail[i-1]=cor(GemData$all_fear_fail[index_fail],GemData[index_fail,i])
}
index.r.fail<-which(is.na(r_fail)==F)
## 'res.fail' is that number of column which the response 'all_fear_fail' ##
res.fail = which(r_fail==1)
#### Stepwise regression of 'all_high_stat_entre' ####
index.r.mod = index.r[-res]
index.r.mod.1=which(abs(r)>0.3)
n.all_high = length(index.r.mod.1)
data.subset=GemData[index,index.r.mod.1]
data.subset[,(n.all_high + 1)]=GemData$all_high_stat_entre[index]
colnames(data.subset)=c(names(data.subset)[1:19],"all_high_stat_entre")
## fit a full model
full.model <- lm(all_high_stat_entre~.,data=data.subset)
min.model <- lm(all_high_stat_entre~1,data=data.subset)
## ols_step_all_possible(full.model)
library(olsrr)
ols_step_forward_p(full.model)
model.all.high = lm(all_high_stat_entre ~ all_entre_des+all_estab_bus_age2+all_est_bus_fem+all_fut_startbus+all_startbus_job+all_know_entre+all_est_bus_sect4,data=data.subset)
summary(model.all.high)
stargazer(model.all.high, title="Results",type='text')
fwd.model <- stepAIC(min.model, direction='forward', scope=(~all_entre_des+all_estab_bus_age2+all_est_bus_fem+all_fut_startbus+all_startbus_job+all_know_entre+all_est_bus_sect4),data=data.subset)
library(stargazer)
stargazer(fwd.model, title="Results",type='text')
#--------------------------------------------------------------------------------------------
#### Modeling for the response 'all_fear_fail' ####
index.r.mod.fail = index.r[-res.fail]
index.r.mod.fail.1=which(abs(r_fail)>0.3)
n.all_fail = length(index.r.mod.fail.1)
data.subset.fail=GemData[index_fail,index.r.mod.fail.1]
data.subset.fail[,(n.all_fail + 1)]=GemData$all_fear_fail[index_fail]
colnames(data.subset.fail)=c(names(data.subset.fail)[1:(n.all_fail)],"all_fear_fail")
## fit a full model
full.model.fail <- lm(all_fear_fail~.,data=data.subset.fail)
min.model.fail <- lm(all_fear_fail~1,data=data.subset.fail)
## ols_step_all_possible(full.model)
library(olsrr)
ols_step_forward_p(full.model.fail)
fwd.model.fail <- stepAIC(min.model.fail, direction='forward', scope=(~all_per_cap+all_know_entre+all_per_opp),data=data.subset.fail)
library(stargazer)
stargazer(fwd.model.fail, title="Results" , type='text')
Thanks in advance !
Related
I have used the R package 'flexmix' to create some regression models. I now want to export the results to Tex.
Unlike conventional models created with lm(), the flexmix models are not saved as named numerics but as FLXRoptim objects.
When I now use the normal syntax from the 'texreg' package in order to create Tex code from the model results, I am getting error messages:
"unable to find an inherited method for function ‘extract’ for signature ‘"FLXRoptim"’"
I have to access the models directly, these are stored as 'Coefmat' and I did not manage to make this usable for texreg().
library(flexmix)
library(texreg)
data("patent")
## 1. Flexmix model ##
flex.model <- flexmix(formula = Patents ~ lgRD, data = patent, k = 3,
model = FLXMRglm(family = "poisson"), concomitant = FLXPmultinom(~RDS))
re.flex.model <- refit(flex.model)
## 2. Approach of results extraction ##
comp1.flex <- re.flex.model#components[[1]][["Comp.1"]]
## 3. Not working: Tex Export ##
texreg(comp1.flex)
Do you guys have an idea how to make these model results usable for Tex export?
I have now found a workaround: 'Texreg' allows us to create Texreg models with manually specified columns.
createTexreg(coef.names, coef, se, pvalues)
Using the example from above:
## Take estimates, SEs, and p-values for Comp1 ##
est1 <- re.flex.model#components[[1]][["Comp.1"]][,1]
se1 <- re.flex.model#components[[1]][["Comp.1"]][,2]
pval1 <- re.flex.model#components[[1]][["Comp.1"]][,4]
## Take estimates, SEs, and p-values for Comp2 ##
est2 <- re.flex.model#components[[1]][["Comp.2"]][,1]
se2 <- re.flex.model#components[[1]][["Comp.2"]][,2]
pval2 <- re.flex.model#components[[1]][["Comp.2"]][,4]
## Create Texreg objects and export into Tex ##
mymodel1 <- createTexreg(row.names(comp1.flex), est1, se1, pval1)
mymodel2 <- createTexreg(row.names(comp1.flex), est2, se2, pval2)
models.flex = list(mymodel1, mymodel2)
texreg(models.flex)
That's probably the most practical way to turn such specific models into a conventional Tex output.
I am very new to R and I am trying to write general functions instead of very specific ones for a certain file. I am trying to do Land Use Regression(Using olsrr package). My code looks like this:
library("olsrr")
library("car") #check VIF
library("heplots")
dataset <- read.csv("data.csv")
View(dataset)
dim(dataset) #991 observations with 79 variables
summary(dataset)
summary(dataset[,c("PM25","NOx","PM10","O3")]) # The outcome variables are not nuerice format
###################
# Data Management #
###################
# Convert air pollution data (PM2.5) into numeric format #
dataset$PM25 <- as.numeric(as.character(dataset$PM25))
dataset$NOx <- as.numeric(as.character(dataset$NOx))
dataset$PM10 <- as.numeric(as.character(dataset$PM10))
dataset$O3 <- as.numeric(as.character(dataset$O3))
summary(dataset[,c("PM25","NOx","PM10","O3")]) # issue solved!!
#############
# Remove NA #
#############
dataset <- na.omit(dataset)
dim(dataset) #957 observations remained
###############################
# Stepwise Regression Model #
# (olsrr Package) #
###############################
#### Set the varaibles will be used for LUR ####
Y <- colnames(dataset)[8] # Outcome variable [PM2.5]
X <- colnames(dataset)[9:ncol(dataset)] # predictors
allX <- paste(X, collapse = "+") # put all predictors together
as.formula(paste(Y, "~", allX)) # Check formula for linear model
temp <- lm(as.formula(paste(Y, "~", allX)), singular.ok=TRUE, data=dataset)
summary(temp)
#### Stepwise regression ####
stepResult <- ols_step_both_p(model=temp, pent=0.1, prem=0.3, details=FALSE)
but When I run the 'ols_step_both_p' funtion. R give me a message:
Error in if (pvals[minp] <= pent) { : argument is of length zero
so what should I do?
This looks like a bug in the olsrr package. Please share a reproducible example using reprex.
PS. I am the author of olsrr. You can directly contact me using the mail id provided on the CRAN page.
I’m working on building predictive classifiers in R on a cancer dataset.
I’m using random forest, support vector machine and naive Bayes classifiers. I’m unable to calculate variable importance on SVM and NB models
I end up receiving the following error.
Error in UseMethod("varImp") :
no applicable method for 'varImp' applied to an object of class "c('svm.formula', 'svm')"
I would greatly appreciate it if anyone could help me.
Given
library(e1071)
model <- svm(Species ~ ., data = iris)
class(model)
# [1] "svm.formula" "svm"
library(caret)
varImp(model)
# Error in UseMethod("varImp") :
# no applicable method for 'varImp' applied to an object of class "c('svm.formula', 'svm')"
methods(varImp)
# [1] varImp.bagEarth varImp.bagFDA varImp.C5.0* varImp.classbagg*
# [5] varImp.cubist* varImp.dsa* varImp.earth* varImp.fda*
# [9] varImp.gafs* varImp.gam* varImp.gbm* varImp.glm*
# [13] varImp.glmnet* varImp.JRip* varImp.lm* varImp.multinom*
# [17] varImp.mvr* varImp.nnet* varImp.pamrtrained* varImp.PART*
# [21] varImp.plsda varImp.randomForest* varImp.RandomForest* varImp.regbagg*
# [25] varImp.rfe* varImp.rpart* varImp.RRF* varImp.safs*
# [29] varImp.sbf* varImp.train*
There is no function varImp.svm in methods(varImp), therefore the error. You might want to have a look at this post on Cross Validated, too.
If you use R, the variable importance can be calculated with Importance method in rminer package. This is my sample code:
library(rminer)
M <- fit(y~., data=train, model="svm", kpar=list(sigma=0.10), C=2)
svm.imp <- Importance(M, data=train)
In detail, refer to the following link https://cran.r-project.org/web/packages/rminer/rminer.pdf
I have created a loop that iteratively removes one predictor at a time and captures in a data frame various performance measures derived from the confusion matrix. This is not supposed to be a one size fits all solution, I don't have the time for it, but it should not be difficult to apply modifications.
Make sure that the predicted variable is last in the data frame.
I mainly needed specificity values from the models and by removing one predictor at a time, I can evaluate the importance of each predictor, i.e. by removing a predictor, the smallest specificity of the model(less predictor number i) means that the predictor has the most importance. You need to know on what indicator you will attribute importance.
You can also add another for loop inside to change between kernels, i.e. linear, polynomial, radial, but you might have to account for the other parameters,e.g. gamma. Change "label_fake" with your target variable and df_final with your data frame.
SVM version:
set.seed(1)
varimp_df <- NULL # df with results
ptm1 <- proc.time() # Start the clock!
for(i in 1:(ncol(df_final)-1)) { # the last var is the dep var, hence the -1
smp_size <- floor(0.70 * nrow(df_final)) # 70/30 split
train_ind <- sample(seq_len(nrow(df_final)), size = smp_size)
training <- df_final[train_ind, -c(i)] # receives all the df less 1 var
testing <- df_final[-train_ind, -c(i)]
tune.out.linear <- tune(svm, label_fake ~ .,
data = training,
kernel = "linear",
ranges = list(cost =10^seq(1, 3, by = 0.5))) # you can choose any range you see fit
svm.linear <- svm(label_fake ~ .,
kernel = "linear",
data = training,
cost = tune.out.linear[["best.parameters"]][["cost"]])
train.pred.linear <- predict(svm.linear, testing)
testing_y <- as.factor(testing$label_fake)
conf.matrix.svm.linear <- caret::confusionMatrix(train.pred.linear, testing_y)
varimp_df <- rbind(varimp_df,data.frame(
var_no=i,
variable=colnames(df_final[,i]),
cost_param=tune.out.linear[["best.parameters"]][["cost"]],
accuracy=conf.matrix.svm.linear[["overall"]][["Accuracy"]],
kappa=conf.matrix.svm.linear[["overall"]][["Kappa"]],
sensitivity=conf.matrix.svm.linear[["byClass"]][["Sensitivity"]],
specificity=conf.matrix.svm.linear[["byClass"]][["Specificity"]]))
runtime1 <- as.data.frame(t(data.matrix(proc.time() - ptm1)))$elapsed # time for running this loop
runtime1 # divide by 60 and you get minutes, /3600 you get hours
}
Naive Bayes version:
varimp_nb_df <- NULL
ptm1 <- proc.time() # Start the clock!
for(i in 1:(ncol(df_final)-1)) {
smp_size <- floor(0.70 * nrow(df_final))
train_ind <- sample(seq_len(nrow(df_final)), size = smp_size)
training <- df_final[train_ind, -c(i)]
testing <- df_final[-train_ind, -c(i)]
x = training[, names(training) != "label_fake"]
y = training$label_fake
model_nb_var = train(x,y,'nb', trControl=ctrl)
predict_nb_var <- predict(model_nb_var, newdata = testing )
confusion_matrix_nb_1 <- caret::confusionMatrix(predict_nb_var, testing$label_fake)
varimp_nb_df <- rbind(varimp_nb_df, data.frame(
var_no=i,
variable=colnames(df_final[,i]),
accuracy=confusion_matrix_nb_1[["overall"]][["Accuracy"]],
kappa=confusion_matrix_nb_1[["overall"]][["Kappa"]],
sensitivity=confusion_matrix_nb_1[["byClass"]][["Sensitivity"]],
specificity=confusion_matrix_nb_1[["byClass"]][["Specificity"]]))
runtime1 <- as.data.frame(t(data.matrix(proc.time() - ptm1)))$elapsed # time for running this loop
runtime1 # divide by 60 and you get minutes, /3600 you get hours
}
Have fun!
I have sales data for 5 different product along with weather information.To read the data, we have daily sales data at a particular store and daily weather information like what is the temperature, average speed of the area where store is located.
I am using Support Vector Machine for prediction. It works well for all the products except one. Its giving me following error:
tunedModelLOG
named numeric(0)
Below is the code:
# load the packages
library(zoo)
library(MASS)
library(e1071)
library(rpart)
library(caret)
normalize <- function(x) {
a <- min(x, na.rm=TRUE)
b <- max(x, na.rm=TRUE)
(x - a)/(b - a)
}
# Define the train and test data
test_data <- train[1:23,]
train_data<-train[24:nrow(train),]
# Define the factors for the categorical data
names<-c("year","month","dom","holiday","blackfriday","after1","back1","after2","back2","after3","back3","is_weekend","weeday")
train_data[,names]<- lapply(train_data[,names],factor)
test_data[,names] <- lapply(test_data[,names],factor)
# Normalized the continuous data
normalized<-c("snowfall","depart","cool","preciptotal","sealevel","stnpressure","resultspeed","resultdir")
train_data[,normalized] <- data.frame(lapply(train_data[,normalized], normalize))
test_data[,normalized] <- data.frame(lapply(test_data[,normalized], normalize))
# Define the same level in train and test data
levels(test_data$month)<-levels(train_data$month)
levels(test_data$dom)<-levels(train_data$dom)
levels(test_data$year)<-levels(train_data$year)
levels(test_data$after1)<-levels(train_data$after1)
levels(test_data$after2)<-levels(train_data$after2)
levels(test_data$after3)<-levels(train_data$after3)
levels(test_data$back1)<-levels(train_data$back1)
levels(test_data$back2)<-levels(train_data$back2)
levels(test_data$back3)<-levels(train_data$back3)
levels(test_data$holiday)<-levels(train_data$holiday)
levels(test_data$is_weekend)<-levels(train_data$is_weekend)
levels(test_data$blackfriday)<-levels(train_data$blackfriday)
levels(test_data$is_weekend)<-levels(train_data$is_weekend)
levels(test_data$weeday)<-levels(train_data$weeday)
# Fit the SVM model and tune the parameters
svmReFitLOG=tune(svm,logunits~year+month+dom+holiday+blackfriday+after1+after2+after3+back1+back2+back3+is_weekend+depart+cool+preciptotal+sealevel+stnpressure+resultspeed+resultdir,data=train_data,ranges = list(epsilon = c(0,0.1,0.01,0.001), cost = 2^(2:9)))
retunedModeLOG <- svmReFitLOG$best.model
tunedModelLOG <- predict(retunedModeLOG,test_data)
Working file is available at the below link
https://drive.google.com/file/d/0BzCJ8ytbECPMVVJ1UUg2RHhQNFk/view?usp=sharing
What I am doing wrong? I would appreciate any kind of help.
Thanks in advance.
New here, please let me know if you need more info.
My goal: I am using Rehfeldt climate data and eBird presence/absence data to produce niche models using Random Forest models.
My problem: I want to predict niche models for the entirety of North America. The Rehfeldt climate rasters have data values for every cell on the continent, but these are surrounded by NAs in the "ocean cells". See the plot here, where I have colored the NAs dark green. randomForest::predict() does not run if the independent dataset contains NAs. Thus, I want to crop my climate rasters (or set a working extent?) so that the predict() function only operates over the cells which contain data.
Troubleshooting:
I've run the Random Forest model using a smaller extent which does not include the "NA oceans" of the rasters and the model runs just fine. So, I know the NAs are the problem. However, I don't want to predict my niche models for just a rectangular chunk of North America.
I used flowla's approach here for cropping and masking rasters using a polygon shapefile for North America. I hoped that this would remove the NAs but it doesn't. Is there something similar I can do to remove the NAs?
I've done some reading but can't figure out a way to adjust the Random Forest code itself so that predict() ignores NAs. This post looks relevant but I'm not sure whether it helps in my case.
Data
My rasters, the input presence/absence text file, and code for additional functions are here. Use with the main code below for a reproducible example.
Code
require(sp)
require(rgdal)
require(raster)
library(maptools)
library(mapproj)
library(dismo)
library(maps)
library(proj4)
data(stateMapEnv)
# This source code has all of the functions necessary for running the Random Forest models, as well as the code for the function detecting multi-collinearity
source("Functions.R")
# Read in Rehfeldt climate rasters
# these rasters were converted to .img and given WGS 84 projection in ArcGIS
d100 <- raster("d100.img")
dd0 <- raster("dd0.img")
dd5 <- raster("dd5.img")
fday <- raster("fday.img")
ffp <- raster("ffp.img")
gsdd5 <- raster("gsdd5.img")
gsp <- raster("gsp.img")
map <- raster("map.img")
mat <- raster("mat_tenths.img")
mmax <- raster("mmax_tenths.img")
mmin <- raster("mmin_tenths.img")
mmindd0 <- raster("mmindd0.img")
mtcm <- raster("mtcm_tenths.img")
mtwm <- raster("mtwm_tenths.img")
sday <- raster("sday.img")
smrpb <- raster("smrpb.img")
# add separate raster files into one big raster, with each file being a different layer.
rehfeldt <- addLayer(d100, dd0, dd5, fday, ffp, gsdd5, gsp, map, mat, mmax, mmin, mmindd0, mtcm, mtwm, sday, smrpb)
# plot some rasters to make sure everything worked
plot(d100)
plot(rehfeldt)
# read in presence/absence data
LAZB.INBUtemp <- read.table("LAZB.INBU.txt", header=T, sep = "\t")
colnames(LAZB.INBUtemp) <- c("Lat", "Long", "LAZB", "INBU")
LAZB.INBUtemp <- LAZB.INBUtemp[c(2,1,3,4)]
LAZB.INBU <- LAZB.INBUtemp
latpr <- (LAZB.INBU$Lat)
lonpr <- (LAZB.INBU$Long)
sites <- SpatialPoints(cbind(lonpr, latpr))
LAZB.INBU.spatial <- SpatialPointsDataFrame(sites, LAZB.INBU, match.ID=TRUE)
# The below function extracts raster values for each of the different layers for each of the eBird locations
pred <- raster::extract(rehfeldt, LAZB.INBU.spatial)
LAZB.INBU.spatial#data = data.frame(LAZB.INBU.spatial#data, pred)
LAZB.INBU.spatial#data <- na.omit(LAZB.INBU.spatial#data)
# ITERATIVE TEST FOR MULTI-COLINEARITY
# Determines which variables show multicolinearity
cl <- MultiColinear(LAZB.INBU.spatial#data[,7:ncol(LAZB.INBU.spatial#data)], p=0.05)
xdata <- LAZB.INBU.spatial#data[,7:ncol(LAZB.INBU.spatial#data)]
for(l in cl) {
cl.test <- xdata[,-which(names(xdata)==l)]
print(paste("REMOVE VARIABLE", l, sep=": "))
MultiColinear(cl.test, p=0.05)
}
# REMOVE MULTI-COLINEAR VARIABLES
for(l in cl) { LAZB.INBU.spatial#data <- LAZB.INBU.spatial#data[,-which(names(LAZB.INBU.spatial#data)==l)] }
################################################################################################
# FOR LAZB
# RANDOM FOREST MODEL AND RASTER PREDICTION
require(randomForest)
# NUMBER OF BOOTSTRAP REPLICATES
b=1001
# CREATE X,Y DATA
# use column 3 for LAZB and 4 for INBU
ydata <- as.factor(LAZB.INBU.spatial#data[,3])
xdata <- LAZB.INBU.spatial#data[,7:ncol(LAZB.INBU.spatial#data)]
# PERCENT OF PRESENCE OBSERVATIONS
( dim(LAZB.INBU.spatial[LAZB.INBU.spatial$LAZB == 1, ])[1] / dim(LAZB.INBU.spatial)[1] ) * 100
# RUN RANDOM FORESTS MODEL SELECTION FUNCTION
# This model is using the model improvement ratio to select a final model.
pdf(file = "LAZB Random Forest Model Rehfeldt.pdf")
( rf.model <- rf.modelSel(x=xdata, y=ydata, imp.scale="mir", ntree=b) )
dev.off()
# RUN RANDOM FORESTS CLASS BALANCE BASED ON SELECTED VARIABLES
# This code would help in the case of imbalanced sample
mdata <- data.frame(y=ydata, xdata[,rf.model$SELVARS])
rf.BalModel <- rfClassBalance(mdata[,1], mdata[,2:ncol(mdata)], "y", ntree=b)
# CREATE NEW XDATA BASED ON SELECTED MODEL AND RUN FINAL RF MODEL
sel.vars <- rf.model$PARAMETERS[[3]]
rf.data <- data.frame(y=ydata, xdata[,sel.vars])
write.table(rf.data, "rf.data.txt", sep = ",", row.names = F)
# This the code given to me; takes forever to run for my dataset (I haven't tried to let it finish)
# ( rf.final <- randomForest(y ~ ., data=rf.data, ntree=b, importance=TRUE, norm.votes=TRUE, proximity=TRUE) )
# I use this form because it's a lot faster
( rf.final <- randomForest(x = rf.data[2:6], y = rf.data$y, ntree=1000, importance=TRUE, norm.votes=TRUE, proximity=F) )
################################################################################################
# MODEL VALIDATION
# PREDICT TO VALIDATION DATA
# Determines the percent correctly classified
rf.pred <- predict(rf.final, rf.data[,2:ncol(rf.data)], type="response")
rf.prob <- as.data.frame(predict(rf.final, rf.data[,2:ncol(rf.data)], type="prob"))
ObsPred <- data.frame(cbind(Observed=as.numeric(as.character(ydata)),
PRED=as.numeric(as.character(rf.pred)), Prob1=rf.prob[,2],
Prob0=rf.prob[,1]) )
op <- (ObsPred$Observed == ObsPred$PRED)
( pcc <- (length(op[op == "TRUE"]) / length(op))*100 )
# PREDICT MODEL PROBABILITIES RASTER
# The first line of code says what directory I'm working, and then what folder in that directory has the raster files that I'm using to predict the range
# The second line defines the x variable, wich is my final Random Forest model
rpath=paste('~/YOURPATH', "example", sep="/")
xvars <- stack(paste(rpath, paste(rownames(rf.final$importance), "img", sep="."), sep="/"))
tr <- blockSize(xvars)
s <- writeStart(xvars[[1]], filename=paste('~/YOURPATH', "prob_LAZB_Rehfeldt.img", sep="/"), overwrite=TRUE)
for (i in 1:tr$n) {
v <- getValuesBlock(xvars, row=tr$row[i], nrows=tr$nrows[i])
v <- as.data.frame(v)
rf.pred <- predict(rf.final, v, type="prob")[,2]
writeValues(s, rf.pred, tr$row[i])
}
s <- writeStop(s)
prob_LAZB <- raster("prob_LAZB_Rehfeldt.img")
# Write range prediction raster to .pdf
pdf(file="LAZB_range_pred.pdf")
plot(prob_LAZB)
map("state", add = TRUE)
dev.off()
Thanks!!
Did you try setting 'na.action` in your call to RF? The option is clearly labelled in the randomForest R manual. Your call to RF would look like this:
rf.final <- randomForest(x = rf.data[2:6], y = rf.data$y, ntree=1000, importance=TRUE, norm.votes=TRUE, proximity=F, na.action = omit)
This will tell RF to omit rows where NA exists, thereby throwing out those observations. This is not necessarily the best approach, but it might be handy in your situation.
Option 2: rfImpute or na.roughfix: This will fill in your NAs so that you can go ahead with your prediction. Watch out as this can give you spurious predictions wherever the NAs are being imputed/"fixed".
Option 3: Start with Option 2, and after you get your prediction, bring your raster into your GIS/Image processing software of choice, and mask out the areas you don't want. In your case, masking out water bodies would be pretty simple.