How to create Naive Bayes in R for numerical and categorical variables - r

I am trying to implement a Naive Bayes model in R based on known information:
Age group, e.g. "18-24" and "25-34", etc.
Gender, "male" and "female"
Region, "London" and "Wales", etc.
Income, "£10,000 - £15,000", etc.
Job, "Full Time" and "Part Time", etc.
I am experiencing errors when implementing. My code is as per below:
library(readxl)
iphone <- read_excel("~/Documents/iPhone_1k.xlsx")
View(iphone)
summary(iphone)
iphone
library(caTools)
library(e1071)
set.seed(101)
sample = sample.split(iphone$Gender, SplitRatio = .7)
train = subset(iphone, sample == TRUE)
test = subset(iphone, sample == FALSE)
nB_model <- naiveBayes(Gender ~ Region + Retailer, data = train)
pred <- predict(nB_model, test, type="raw")
In the above scenario, I have an excel file called iPhone_1k (1,000 entries relating to people who have visited a website to buy an iPhone). Each row is a person visiting the website and the above demographics are known.
I have been trying to make the model work and have resorted to following the below link that uses only two variables (I would like to use a minimum of 4 but introduce more, if possible):
https://rpubs.com/dvorakt/144238
I want to be able to use these demographics to predict which retailer they will go to (also known for each instance in the iPhone_1k file). There are only 3 options. Can you please advise how to complete this?
P.S. Below is a screenshot of a simplified version of the data I have used to keep it simple in R. Once I get some code to work, I'll expand the number of variables and entries.

You are setting the problem incorrectly. It should be:
naiveBayes(Retailer ~ Gender + Region + AgeGroup, data = train)
or in short
naiveBayes(Retailer ~ ., data = train)
Also you might need to convert the columns into factors if they are characters. You can do it for all columns, right after reading from excel, by
iphone[] <- lapply(iphone, factor)
Note that if you add numeric variables in the future, you should not apply this step on them.

Related

R and multiple time series and Error in model.frame.default: variable lengths differ

I am new to R and I am using it to analyse time series data (I am also new to this).
I have quarterly data for 15 years and I am interested in exploring the interplay between drinking and smoking rates in young people - treating smoking as the outcome variable. I was advised to use the gls command in the nlme package as this would allow me to include AR and MA terms. I know I could use more complex approaches like ARIMAX but as a first step, I would like to use simpler models.
After loading the data, specify the time series
data.ts = ts(data=data$smoke, frequency=4, start=c(data[1, "Year"], data[1, "Quarter"]))
data.ts.dec = decompose(data.ts)
After decomposing the data and some tests (KPSS and ADF test), it is clear that the data are not stationary so I differenced the data:
diff_dv<-diff(data$smoke, difference=1)
plot.ts(diff_dv, main="differenced")
data.diff.ts = ts(diff_dv, frequency=4, start=c(hse[1, "Year"], hse[1, "Quarter"]))
The ACF and PACF plots suggest AR(2) should also be included so I set up the model as:
mod.gls = gls(diff_dv ~ drink+time , data = data,
correlation=corARMA(p=2), method="ML")
However, when I run this command I get the following:
"Error in model.frame.default: variable lengths differ".
I understand from previous posts that this is due to the differencing and the fact that the diff_dv is now shorter. I have attempted fixing this by modifying the code but neither approach works:
mod.gls = gls(diff_dv ~ drink+time , data = data[1:(length(data)-1), ],
correlation=corARMA(p=2), method="ML")
mod.gls = gls(I(c(diff(smoke), NA)) ~ drink+time+as.factor(quarterly) , data = data,
correlation=corARMA(p=2), method="ML")
Can anyone help with this? Is there a workaround which would allow me to run the -gls- command or is there an alternative approach which would be equivalent to the -gls- command?
As a side question, is it OK to include time as I do - a variable with values 1 to 60? A similar question is for the quarters which I included as dummies to adjust for possible seasonality - is this OK?
Your help is greatly appreciated!
Specify na.action = na.omit or na.action = na.exclude to omit the rows with NA's. Here is an example using the built-in Ovary data set. See ?na.fail for info on the differences between these two.
Ovary2 <- transform(Ovary, dfoll = c(NA, diff(follicles)))
gls(dfoll ~ sin(2*pi*Time) + cos(2*pi*Time), Ovary2,
correlation = corAR1(form = ~ 1 | Mare), na.action = na.exclude)

error with rda test in vegan r package. Variable not being read correctly

I am trying to perform a simple RDA using the vegan package to test the effects of depth, basin and sector on genetic population structure using the following data frame.
datafile.
The "ALL" variable is the genetic population assignment (structure).
In case the link to my data doesn't work well, I'll paste a snippet of my data frame here.
I read in the data this way:
RDAmorph_Oct6 <- read.csv("RDAmorph_Oct6.csv")
My problems are two-fold:
1) I can't seem to get my genetic variable to read correctly. I have tried three things to fix this.
gen=rda(ALL ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Error in eval(specdata, environment(formula), enclos = globalenv()) :
object 'ALL' not found
In addition: There were 12 warnings (use warnings() to see them)
so, I tried things like:
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
so I specified numeric
> RDAmorph_Oct6$ALL = as.numeric(RDAmorph_Oct6$ALL)
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
I am really baffled. I've also tried specifying each variable with dataset$variable, but this doesn't work either.
The strange thing is, I can get an rda to work if I look the effects of the environmental variables on a different, composite, variable
MC = RDAmorph_Oct6[,5:6]
H_morph_var=rda(MC ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Note that I did try to just extract the ALL column for the genetic rda above. This didn't work either.
Regardless, this leads to my second problem.
When I try to plot the rda I get a super weird plot. Note the five dots in three places. I have no idea where these come from.
I will have to graph the genetic rda, and I figure I'll come up with the same issue, so I thought I'd ask now.
I've been though several tutorials and tried many iterations of each issue. What I have provided here is I think the best summary. If anyone can give me some clues, I would much appreciate it.
The documentation, ?rda, says that the left-hand side of the formula specifying your model needs to be a data matrix. You can't pass it the name of a variable in the data object as the left-hand side (or at least if this was ever anticipated, doing so exposes bugs in how we parse the formula which is what leads to further errors).
What you want is a data frame containing a variable ALL for the left-hand side of the formula.
This works:
library('vegan')
df <- read.csv('~/Downloads/RDAmorph_Oct6.csv')
ALL <- df[, 'ALL', drop = FALSE]
Notice the drop = FALSE, which stops R from dropping the empty dimension (i.e. converting the single column data frame to a vector.
Then your original call works:
ord <- rda(ALL ~ Basin + Depth + Sector, data = df, na.action = 'na.exclude')
The problem is that rda expects a separate df for the first part of the formula (ALL in your code), and does not use the one in the data = argument.
As mentioned above, you can create a new df with the variable needed for analysis, but here's a oneline solution that should also work:
gen <- rda(RDAmorph_Oct6$ALL ~ Depth + Basin + Sector, data = RDAmorph_Oct6, na.action = na.exclude)
This is partly similar to Gavin simpson's answer. There is also a problem with the categorical vectors in your data frame. You can either use library(data.table) and the rowid function to set the categorical variables to unique integers. Most preferably, not use them. I also wanted to set the ID vector as site names, but I am too lazy now.
library(data.table)
RDAmorph_Oct6 <- read.csv("C:/........../RDAmorph_Oct6.csv")
#remove NAs before. I like looking at my dataframes before I analyze them.
RDAmorph_Oct6 <- na.omit(RDAmorph_Oct6)
#I removed one duplicate
RDAmorph_Oct6 <- RDAmorph_Oct6[!duplicated(RDAmorph_Oct6$ID),]
#Create vector with only ALL
ALL <- RDAmorph_Oct6$ALL
#Create data frame with only numeric vectors and remove ALL
dfn <- RDAmorph_Oct6[,-c(1,4,11,12)]
#Select all categorical vectors.
dfc <- RDAmorph_Oct6[,c(1,11,12)]
#Give the categorical vectors unique integers doesn't do this for ID (Why?).
dfc2 <- as.data.frame(apply(dfc, 2, function(x) rowid(x)))
#Bind back with numeric data frame
dfnc <- cbind.data.frame(dfn, dfc2)
#Select only what you need
df <- dfnc[c("Depth", "Basin", "Sector")]
#The rest you know
rda.out <- rda(ALL ~ ., data=df, scale=T)
plot(rda.out, scaling = 2, xlim=c(-3,2), ylim=c(-1,1))
#Also plot correlations
plot(cbind.data.frame(ALL, df))
Sector and depth have the highest variation. Almost logical, since there are only three vectors used. The assignment of integers to the categorical vector has probably no meaning at all. The function assigns from top to bottom unique integers to the following unique character string. I am also not really sure which question you want to answer. Based on this you can organize the data frame.

Excluding ID field when fitting model in R

I have a simple Random Forest model I have created and tested in R. For now I have excluded an internal company ID from my training/testing data frames. Is there a way in R that I could include this column in my data and have the training/execution of my model ignore the field?
I obviously would not want the model to try and incorporate it as a variable, but upon an export of the data with a column added of the predicted outcome, I would need that internal id to tie back in other customer data so I know what customers have been categorized as
I am just using the out of the box random forest function from the randomForest library
#divide data into training and test sets
set.seed(3)
id<-sample(2,nrow(Churn_Model_Data_v2),prob=c(0.7,0.3),replace = TRUE)
churn_train<-Churn_Model_Data_v2[id==1,]
churn_test<-Churn_Model_Data_v2[id==2,]
#changes Churn data 1/2 to a factor for model
Churn_Model_Data_v2$`Churn` <- as.factor(Churn_Model_Data_v2$`Churn`)
churn_train$`Churn` <- as.factor(churn_train$`Churn`)
#churn_test$`Churn` <- as.factor(churn_test$`Churn`)
bestmtry <- tuneRF(churn_train,churn_train$`Churn`, stepFactor = 1.2,
improve =0.01, trace=T, plot=T )
#creates model based on training data, views model
churn_forest <- randomForest(`Churn`~. , data= churn_train )
churn_forest
#shows us what variables are most important
importance(churn_forest)
varImpPlot(churn_forest)
#predicts churn diagnosis on test data
predict_churn <- predict(churn_forest, newdata = churn_test, type="class")
predict_churn
A simple example of excluding a particular column or set of columns is as follows
library(MASS)
temp<-petrol
randomForest(No ~ .,data = temp[, !(colnames(temp) %in% c("SG"))]) # One Way
randomForest(No ~ .-SG,data = temp) #Another way with similar result
This method of exclusion is commonly valid across other fuctions/alogorithms in R too.

Data set for regression: different response values for same combination of input variables

Hey dear stackoverflowers,
I would like to perform (multiple) regression analysis on a large customer data set, trying to predict amount spent after initial purchase based on various independent variables, observed during the first purchase.
In this data set, for the same combination of input variable values (say gender=male, age=30, income=40k, first_purchase_value = 99,90), I can have multiple observartions with varying y values (i.e. multiple customers share the same independent variable attributes, but behave differently according to their observed y values).
Is this a problem for regression analysis, i.e. do I have to condense these observations by e.g. averaging? I am getting negative R2 values, that's why I'm asking (I know that a linear model might also just be the wrong assumption here) ...
Thank you for helping me. I tried using the search function, but was unable to find similar topics (probably because the question is silly?).
Cheers!
Edit: This is the code I'm using:
spl <- sample.split(data$spent, SplitRatio = 0.75)
data_train <- subset(data, spl == TRUE)
data_test <- subset(data, spl == FALSE)
model_lm_spent <- lm(spent ~ ., data = data_train)
summary(model_lm_spent)
model_lm_predictions_spent <- predict(model_lm_spent, newdata = data_test)
SSE_spent = sum((data_test$spent - model_lm_predictions_spent)^2)
SST_spent = sum((data_test$spent - mean(data$spent))^2)
1 - SSE_spent/SST_spent

How are BRR weights used in the survey package for R?

Does anyone know how to use BRR weights in Lumley's survey package for estimating variance if your dataset already has BRR weights it in?
I am working with PISA data, and they already include 80 BRR replicates in their dataset. How can I get as.svrepdesign to use these, instead of trying to create its own? I tried the following and got the subsequent error:
dstrat <- svydesign(id=~uniqueID,strata=~strataVar, weights=~studentWeight,
data=data, nest=TRUE)
dstrat <- as.svrepdesign(dstrat, type="BRR")
Error in brrweights(design$strata[, 1], design$cluster[, 1], ...,
fay.rho = fay.rho, : Can't split with odd numbers of PSUs in a stratum
Any help would be greatly appreciated, thanks.
no need to use as.svrepdesign() if you have a data frame with the replicate weights already :) you can create the replicate weighted design directly from your data frame.
say you have data with a main weight column called mainwgt and 80 replicate weight columns called repwgt1 through repwgt80 you could use this --
yoursurvey <-
svrepdesign(
weights = ~mainwgt ,
repweights = "repwgt[0-9]+" ,
type = "BRR",
data = yourdata ,
combined.weights = TRUE
)
-- this way, you don't have to identify the exact column numbers. then you can run normal survey commands like --
svymean( ~variable , design = yoursurvey )
if you'd like another example, here's some example code and an explanatory blog post using the current population survey.
I haven't used the PISA data, I used the svprepdesign method last year with the Public Use Microsample from the American Community Survey (US Census Bureau) which also shipped with 80 replicate weights. They state to use the Fay method for that specific survey, so here is how one can construct the svyrep object using that data:
pums_p.rep<-svrepdesign(variables=pums_p[,2:7],
repweights=pums_p[8:87],
weights=pums_p[,1],combined.weights=TRUE,
type="Fay",rho=(1-1/sqrt(4)),scale=1,rscales=1)
attach(pums_p.rep)
#CROSS - TABS
#unweighted
xtabs(~ is5to17youth + withinAMILimit)
table(is5to17youth + withinAMILimit)
#weighted, mean income by sex by race for select age groups
svyby(~PINCP,~RAC1P+SEX,subset(
pums_p.rep,AGEP > 25 & AGEP <35),na.rm = TRUE,svymean,vartype="se","cv")
In getting this to work, I found the article from A. Damico helpful: Damico, A. (2009). Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis Techniques in Health Policy Data. The R Journal, 1(2), 37–44.

Resources