Error while trying to do a prediction with bnlearn package - Bayesian network - r

I'm trying to do a prediction model with bnlearn package but I get error indicating : "Error in check.data(data) : the data are missing".
Here is my example data set and line of codes that I used to preformed the prediction model:
dat <- read.table(text = " category birds wolfs snakes
yes 3 9 7
no 3 8 4
no 1 2 8
yes 1 2 3
yes 1 8 3
no 6 1 2
yes 6 7 1
no 6 1 5
yes 5 9 7
no 3 8 7
no 4 2 7
notsure 1 2 3
notsure 7 6 3
no 6 1 1
notsure 6 3 9
no 6 1 1 ",header = TRUE)
Here are the lines of code that I used to get the prediction:
dat$birds<-as.numeric(dat$birds)
dat$wolfs<-as.numeric(dat$wolfs)
dat$snakes<-as.numeric(dat$snakes)
training.set = dat[1:8,2:4 ]
demo.set = dat[8:16,2:4 ]
res <- hc(training.set)
fitted = bn.fit(res, training.set)
pred = predict(fitted, demo.set) # I get an error: "Error in check.data(data) : the data are missing."
Any Idea how to solve it ?

predict(fittedbn, node="column name to predict", data=testdata) worked for me

I don't have bnlearn installed, but from your code I guess that the problem is that you didn't provide the output (which is the category column) into the training set. Change:
training.set = dat[1:8,]
and see if it works.

Related

Cannot introduce offset into negative binomial regression

I'm performing a count data analysis in R, dealing with a data called 'doctor' which is:
V2 V3 L4 V5
1 1 32 10.866795 1
2 2 104 10.674706 1
3 3 206 10.261581 1
4 4 186 9.446440 1
5 5 102 8.578665 1
6 1 2 9.841080 2
7 2 12 9.275472 2
8 3 28 8.649974 2
9 4 28 7.857481 2
10 5 31 7.287561 2
The best model was V3~V2+L4+V5+V2:L4:V5 using stepwise AIC. Now I want to set L4 as the offset and perform negative binomial regression including the interaction, so I used the code nbinom=glm.nb(V3~V2+V5+V2:V5,offset=L4) but get this error message that says Error in glm.control(...) : unused argument (offset = L4). What have I done wrong here?
Offsets are entered using an offset term in the model formula:
nbinom=glm.nb(V3~V2+V5+V2:V5+offset(L4))
Also you can use V2*V5 instead of V2+V5+V2:V5

R: Error in contrasts when fitting linear models with `lm`

I've found Error in contrasts when defining a linear model in R and have followed the suggestions there, but none of my factor variables take on only one value and I am still experiencing the same issue.
This is the dataset I'm using: https://www.dropbox.com/s/em7xphbeaxykgla/train.csv?dl=0.
This is the code I'm trying to run:
simplelm <- lm(log_SalePrice ~ ., data = train)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
What is the issue?
Thanks for providing your dataset (I hope that link will forever be valid so that everyone can access). I read it into a data frame train.
Using the debug_contr_error, debug_contr_error2 and NA_preproc helper functions provided by How to debug "contrasts can be applied only to factors with 2 or more levels" error?, we can easily analyze the problem.
info <- debug_contr_error2(log_SalePrice ~ ., train)
## the data frame that is actually used by `lm`
dat <- info$mf
## number of cases in your dataset
nrow(train)
#[1] 1460
## number of complete cases used by `lm`
nrow(dat)
#[1] 1112
## number of levels for all factor variables in `dat`
info$nlevels
# MSZoning Street Alley LotShape LandContour
# 4 2 3 4 4
# Utilities LotConfig LandSlope Neighborhood Condition1
# 1 5 3 25 9
# Condition2 BldgType HouseStyle RoofStyle RoofMatl
# 6 5 8 5 7
# Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
# 14 16 4 4 4
# Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
# 6 5 5 5 7
# BsmtFinType2 Heating HeatingQC CentralAir Electrical
# 7 5 5 2 5
# KitchenQual Functional FireplaceQu GarageType GarageFinish
# 4 6 6 6 3
# GarageQual GarageCond PavedDrive PoolQC Fence
# 5 5 3 4 5
# MiscFeature SaleType SaleCondition MiscVal_bool MoYrSold
# 4 9 6 2 55
As you can see, Utilities is the offending variable here as it has only 1 level.
Since you have many character / factor variables in train, I wonder whether you have NA for them. If we add NA as a valid level, we could possibly get more complete cases.
new_train <- NA_preproc(train)
new_info <- debug_contr_error2(log_SalePrice ~ ., new_train)
new_dat <- new_info$mf
nrow(new_dat)
#[1] 1121
new_info$nlevels
# MSZoning Street Alley LotShape LandContour
# 5 2 3 4 4
# Utilities LotConfig LandSlope Neighborhood Condition1
# 1 5 3 25 9
# Condition2 BldgType HouseStyle RoofStyle RoofMatl
# 6 5 8 5 7
# Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
# 14 16 4 4 4
# Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
# 6 5 5 5 7
# BsmtFinType2 Heating HeatingQC CentralAir Electrical
# 7 5 5 2 6
# KitchenQual Functional FireplaceQu GarageType GarageFinish
# 4 6 6 6 3
# GarageQual GarageCond PavedDrive PoolQC Fence
# 5 5 3 4 5
# MiscFeature SaleType SaleCondition MiscVal_bool MoYrSold
# 4 9 6 2 55
We do get more complete cases, but Utilities still has one level. This means that most incomplete cases are actually caused by NA in your numerical variables, which we can do nothing (unless you have a statistically valid way to impute those missing values).
As you only have one single-level factor variable, the same method as given in How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"? will work.
new_dat$Utilities <- 1
simplelm <- lm(log_SalePrice ~ 0 + ., data = new_dat)
The model now runs successfully. However, it is rank-deficient. You probably want to do something to address it, but leaving it as it is is fine.
b <- coef(simplelm)
length(b)
#[1] 301
sum(is.na(b))
#[1] 9
simplelm$rank
#[1] 292

R 3.4.1 gsub on Windows 10 - find and replace all strings except for

I am trying to clean up data for a class project. The data deals with NOAA Storm data from 1950 to 2011. The storm types (EVTYPE) are only supposed to be 48 different levels, but there are over 1000 unique entries. I am trying to find all the snow related entries, which gives me:
table(grep("snow", temp$EVTYPE, ignore.case = TRUE, value = TRUE))
ACCUMULATED.SNOWFALL BLOWING.SNOW COLD.AND.SNOW DRIFTING.SNOW
4 5 1 1
EARLY.SNOWFALL EXCESSIVE.SNOW FALLING.SNOW.ICE FIRST.SNOW
7 25 2 2
HEAVY.SNOW HEAVY.SNOW.SHOWER HEAVY.SNOW.SQUALLS ICE.SNOW
13988 1 1 4
LAKE.EFFECT.SNOW LATE.SEASON.SNOW LATE.SEASON.SNOWFALL LATE.SNOW
656 1 3 2
LIGHT.SNOW LIGHT.SNOW.FLURRIES LIGHT.SNOW.FREEZING.PRECIP LIGHT.SNOWFALL
174 3 1 1
MODERATE.SNOW MODERATE.SNOWFALL MONTHLY.SNOWFALL MOUNTAIN.SNOWS
1 101 1 1
RECORD.MAY.SNOW RECORD.SNOW RECORD.SNOWFALL RECORD.WINTER.SNOW
1 2 2 3
SEASONAL.SNOWFALL SNOW SNOW.ACCUMULATION SNOW.ADVISORY
1 425 1 1
SNOW.AND.ICE SNOW.AND.SLEET SNOW.BLOWING.SNOW SNOW.DROUGHT
4 5 6 4
SNOW.ICE SNOW.SHOWERS SNOW.SLEET SNOW.SQUALL
1 5 5 5
SNOW.SQUALLS THUNDERSNOW.SHOWER UNUSUALLY.LATE.SNOW
14 1 1
There is a storm type called "Lake.Effect.Snow", which is one of the 48 storm types. How can I replace all of the other entries while excluding that particular storm type? I've tried:
table(grep("([^lake]?)snow", temp$EVTYPE, ignore.case = TRUE, value = TRUE))
to try and ignore the Lake.Effect.Snow entries, but no good.
Use stringr::str_detect with if.else.
library("stringr")
temp$EVTYPE <- if.else(str_detect(temp$EVTYPE, regex("snow", ignore_case = TRUE)) & temp$EVTYPE != "Lake.Effect.Snow", "Snow", temp$EVTYPE)

Using predict() to predict response variable in test dataset

Question: What r code should one use to predict a response variable in a completely separate test data set (not the test data set drawn from the original data set from which the training data set has been drawn) that doesn't have a response variable?
I have been stuck on this for two days and any help is highly appreciated!
My training set has 100 observations and 27 variables. "units" is the response variables. The test set has 6000 observations and 26 variables. I am showing only a part of both data sets to keep the length of my question manageable.
I am using ISLR and MASS packages.
Training set:
age V1 V2 V3 V4 V5 V6 units
10 1 3 0 5 5 5 5828
7 4 5 4 4 1 2 2698
5 6 6 4 7 8 10 2578
4 4 5 4 4 1 3 2548
15 3 5 4 4 2 5 9922
5 2 4 4 5 1 3 6791
Test set:
age V1 V2 V3 V4 V5 V6
2 3 4 4 4 2 2
2 2 5 4 5 2 3
10 5 4 4 4 1 3
4 15 7 6 3 4 8
7 2 5 4 4 2 2
4 6 5 4 5 2 2
18 2 5 4 5 1 3
6 3 5 5 6 4 5
R Code:
library(ISLR)
library(MASS)
train = read.csv(".../train.csv", header = T)
train.pca = train[c(-27)]
pr.out = prcomp(train.pca, scale = TRUE, center = TRUE, retx = TRUE) # Conducting PCA
plot(pr.out, type = 'l')
summary(pr.out)
pred.tr = predict(pr.out, newdata = train) # Predicting on the train data
dat.tr = cbind(train, pred.tr) # Appending PCA output to the train data
glm.fit.pca = glm(units ~ PC2 + PC3 + PC4 + PC5 +
PC6 + PC7 + PC8 + PC9 + PC10 +
PC11 + PC12 + PC13 + PC14 + PC15,
data = dat.tr) # Conducting glm on train data with PCs
test = read.csv(".../test.csv", header = T) # Reading in test data
pred.test = predict(pr.out, newdata = test, type = "response") # Predicting
# on test data. With this code, I get the following error message - "Error
# in predict.prcomp(pr.out, newdata = y, type = "response") :
# 'newdata' does not have named columns matching one or more of the original
# columns" I understand why because the test set doesn't have the response
# variable
So I tried the following:
pred.test = predict(pr.out, newdata = test) # This doesn't give me any error
dat.test = cbind(test_numr, pred.test) # Appending PCA output to test data
I don't understand how I can conduct a glm on the test data, the way I did on train data because test data set doesn't have a response variable (i.e., "units"). I tried initializing the response variable in the test data by doing the following to add the response variable in the test data set:
dat.test$units = rep(0, nrow(dat.test))
Now when I try to run the glm model on the dat.test data set, I get all zeros. I can understand why but I don't understand what changes should I make to my code to get the predictions for the test data set.
Any guidance is highly appreciated! Thank you!
EDIT: I edited and ran the code again based on the comment from #csgillespie. I still have the same issue. Thanks for catching the error!

Converting R data frame with RDS package: recruitment id error?

I am using the RDS package for respondent-driven sampling survey data. I want to convert a regular R data frame to an rds.data.frame. To do so, I have been trying to use the as.rds.data.frame function from RDS.
Here is an excerpted section of my data frame, where the first case (id=1) is the 'seed' respondent (who has no recruiter). It contains the variables: id (respondent id number), recruit.id(id number of respondent who recruited him/her), netsize (respondent's network size) and population (estimate of whole population size).
df<-data.frame(id=c(1,2,3,4,5,6,7,8,9,10),
recruit.id=c(-1,1,1,2,2,4,5,3,8,3),
netsize=c(6,6,6,5,5,4,4,3,4,6), population=rep(22,000, 10))
I then (try to) apply the relevant function:
new.df <-as.rds.data.frame(df,id=df$id,
recruiter.id=df$recruit.id,
network.size=df$netsize,
population.size=df$population,
max.coupons=2)
I get the error message:
Error in as.rds.data.frame(df, id = df$id, recruiter.id = df$recruit.id,: Invalid id
and the warning
In addition: Warning message:In if (!(id %in% names(x))) stop("Invalid id") :
the condition has length > 1 and only the first element will be used
I have tried assigning various 'recruiter id' values for seed participants, including -1,0 or their own id number but I still get the same message. I have also tried eliminating function arguments (coupon.max, population) or deleting seed respondents, but I still get the same message.
Package documentation says the function will fail if recruitment information is incomplete. As far as I can tell, this is not the case.
I am new to this, so if anyone can point me in the right direction I would be really grateful.
This seems to work:
colnames(df)[2:4] <- c("recruiter.id", "network.size.variable", "population.size")
as.rds.data.frame(df,max.coupons=2)
This gives a result with a warning
as.rds.data.frame(df, id="id", recruiter.id="recruit.id",
network.size="netsize", population.size="population", max.coupons=2)
# An object of class "rds.data.frame"
#id: 1 2 3 4 5 6 7 8 9 10
#recruiter.id: -1 1 1 2 2 4 5 3 8 3
# id recruit.id netsize population
#1 1 -1 6 22
#2 2 1 6 22
#3 3 1 6 22
#4 4 2 5 22
#5 5 2 5 22
#6 6 4 4 22
#7 7 5 4 22
#8 8 3 3 22
#9 9 8 4 22
#10 10 3 6 22
# Warning message:
#In as.rds.data.frame(df, id = "id", recruiter.id = "recruit.id", :
#NAs introduced by coercion

Resources