Error in code Neuralnet pacakge - r

I have to predict TragetBuy Variable which is coded as 0 and 1
I have the following code
library(neuralnet)
library(NeuralNetTools)
n <- names(train)
f <- as.formula(paste("TargetBuy ~", paste(n[!n %in% "TargetBuy"], collapse = " + ")))
parse_train <- model.matrix(~ ID + DemAffl + DemAge + DemCluster +
DemClusterGroup + DemGender + DemReg +
DemTVReg + PromClass + PromSpend + PromTime +
TargetBuy,
data = train)
head(parse_train)
nn <- neuralnet(f, data = parse_train,
hidden = 2,
err.fct = "ce",
threshold = 0.01,
linear.output = FALSE)
I am getting following error:
Error in eval(expr, envir, enclos) : object 'TargetBuy' not found
here I am providing str(train)
'data.frame': 15556 obs. of 12 variables:
$ ID : int 140 620 868 1120 2313 2771 3131 4529 5886 7420 ...
$ DemAffl : int 10 4 5 10 11 9 11 10 14 7 ...
$ DemAge : int 76 49 70 65 68 72 74 62 43 60 ...
$ DemCluster : int 16 35 27 51 4 28 3 49 49 52 ...
$ DemClusterGroup: Factor w/ 8 levels "","A","B","C",..: 4 5 5 7 2 5 2 7 7 7 ...
$ DemGender : Factor w/ 4 levels "","F","M","U": 4 4 2 3 2 4 2 3 2 2 ...
$ DemReg : Factor w/ 6 levels "","Midlands",..: 2 2 2 2 2 3 2 2 1 3 ...
$ DemTVReg : Factor w/ 14 levels "","Border","C Scotland",..: 13 13 13 6 6 9 4 4 1 7 ...
$ PromClass : Factor w/ 4 levels "Gold","Platinum",..: 1 1 3 4 4 2 4 3 1 1 ...
$ PromSpend : num 16000 6000 0.02 0.01 0.01 ...
$ PromTime : int 4 5 8 7 8 3 8 3 1 2 ...
$ TargetBuy : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 2 1 ...

Related

How to combine training and testing dataset in same format

I am practicing with this dataset: http://archive.ics.uci.edu/ml/datasets/Census+Income
I loaded training & testing data.
# Downloading train and test data
trainFile = "adult.data"; testFile = "adult.test"
if (!file.exists (trainFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
destfile = trainFile)
if (!file.exists (testFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
destfile = testFile)
# Assigning column names
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.table (trainFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", stringsAsFactors = TRUE)
# Load the testing data set
testing = read.table (testFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", fill = TRUE, stringsAsFactors = TRUE)
I needed to combined two into one. But, there is a problem. I am seeing structure of the two data is not same.
Display structure of the training data
> str (training)
'data.frame': 32561 obs. of 15 variables:
$ age : int 39 50 38 53 28 37 49 52 31 42 ...
$ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
Display structure of the testing data
> str (testing)
'data.frame': 16282 obs. of 15 variables:
$ age : Factor w/ 74 levels "|1x3 Cross validator",..: 1 10 23 13 29 3 19 14 48 9 ...
$ workclass : Factor w/ 9 levels "","Federal-gov",..: 1 5 5 3 5 NA 5 NA 7 5 ...
$ fnlwgt : int NA 226802 89814 336951 160323 103497 198693 227026 104626 369667 ...
$ education : Factor w/ 17 levels "","10th","11th",..: 1 3 13 9 17 17 2 13 16 17 ...
$ educationnum : int NA 7 9 12 10 10 6 9 15 10 ...
$ maritalstatus: Factor w/ 8 levels "","Divorced",..: 1 6 4 4 4 6 6 6 4 6 ...
$ occupation : Factor w/ 15 levels "","Adm-clerical",..: 1 8 6 12 8 NA 9 NA 11 9 ...
$ relationship : Factor w/ 7 levels "","Husband","Not-in-family",..: 1 5 2 2 2 5 3 6 2 6 ...
$ race : Factor w/ 6 levels "","Amer-Indian-Eskimo",..: 1 4 6 6 4 6 6 4 6 6 ...
$ sex : Factor w/ 3 levels "","Female","Male": 1 3 3 3 3 2 3 3 3 2 ...
$ capitalgain : int NA 0 0 0 7688 0 0 0 3103 0 ...
$ capitalloss : int NA 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int NA 40 50 40 40 30 30 40 32 40 ...
$ nativecountry: Factor w/ 41 levels "","Cambodia",..: 1 39 39 39 39 39 39 39 39 39 ...
$ incomelevel : Factor w/ 3 levels "","<=50K.",">50K.": 1 2 2 3 3 2 2 2 3 2 ...
Problem 1:
age has become factor at testing. and all other levels of factor in testing is being increased by 1 than levels of factor in training. This is because first row is an unnecessary row in testing.
|1x3 Cross validator
I tried to get rid of this by re-assigning testing:
testing = testing[-1,]
but, after running str() command again, I don't see any change.
Problem 2:
Like I said at previous, I needed to combine those two data-frame into one data-frame. So, I run this:
combined <- rbind(training , testing)
Besides the problem-1, I can see new a problem after running str()
> str(combined)
'data.frame': 48842 obs. of 15 variables:
$ age : chr "39" "50" "38" "53" ...
$ workclass : Factor w/ 9 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 17 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 8 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 15 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 7 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 6 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 42 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 5 levels "<=50K",">50K",..: 1 1 1 1 1 1 1 2 2 2 ...
factor levels at target variable (incomelevel) in combined data-frame is 5 where it's 2 (which is correct) in the training data-frame and 3 (increased by 1 for problem-1) in testing data-frame. This is because there is a . (dot) after each value at incomelevel in testing data-frame (<=50K., <=50K., >50K.,......). So, I need to remove that .(dot) But, I am not getting idea how to remove it. Is there any function?
I am very in data and r. That's why, facing this type of basic issues. Can you please help me to solve the issue I am facing?
I think you can ignore the first line of test, this will solve the issue of age being a factor, because it seems like a header:
head(readLines(testFile))
[1] "|1x3 Cross validator"
[2] "25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K."
[3] "38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K."
We run your code, we can use read.csv, with skip=1 for test:
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.csv (trainFile, header = FALSE, col.names = colNames,stringsAsFactors = TRUE,na.strings = "?",strip.white = TRUE)
testing = read.csv (testFile, header = FALSE, col.names = colNames,na.strings = "?",stringsAsFactors = TRUE,skip=1,strip.white = TRUE)
Now, the income level, unfortunately we have to correct it manually, it's a good thing you check:
testing$incomelevel = factor(gsub("\\.","",as.character(testing$incomelevel)))
We check levels, only difference is native country:
all.equal(sapply(testing,levels) ,sapply(training,levels))
[1] "Component “nativecountry”: Lengths (40, 41) differ (string compare on first 40)"
[2] "Component “nativecountry”: 26 string mismatches"
And I don't think there's much you can do, maybe you have to remove it before / after joining:
setdiff(levels(training$nativecountry),levels(testing$nativecountry))
[1] "Holand-Netherlands"

Issues producing a ROC curve with a KNN Model - undefined columns

I do apologize if this is rudimentary however I have run through the tracebook and tried googling to no real avail. Every time I try and run my code to produce a ROC curve I keep getting returned
Error in `[.data.frame`(data, , class) : undefined columns selected
I have checked the data and they are single column characters (as required)
library(cutpointr)
Temp1 <- predict(KnnModel, newdata=TestData, type="prob")
KnnProbs <- predict(object = KnnModel, newdata = TestData, type = "prob")
KnnProbs <- as.character(KnnProbs$`0`)
clch <- as.character(TrainData$loan_status)
KnnROC <- roc(data = TestData$loan_status, x = KnnProbs, class = clch)
plot(KnnROC, print.auc = T)
Any ideas as to what I am doing wrong and how to fix this
EDIT: The TrainData is of the following
'data.frame': 1500 obs. of 13 variables:
$ loan_amnt : num 6000 17625 8500 5000 10000 ...
$ loan_status : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
$ int_rate : num 13.33 15.61 6.68 6.92 14.98 ...
$ term : num 1 1 1 1 1 1 2 1 1 1 ...
$ installment : num 203 616 261 154 347 ...
$ grade : num 3 4 1 1 3 4 4 3 2 1 ...
$ emp_length : num 10 11 3 3 8 3 3 3 2 1 ...
$ annual_inc : num 30000 49000 53100 60000 37000 ...
$ dti : num 25.5 12.2 26.2 27.8 31.4 ...
$ sub_grade : num 13 16 3 4 13 16 18 12 8 4 ...
$ verification_status: num 1 2 2 3 3 3 3 1 3 1 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 5 2 2 2 2 6 2 ...
$ pymnt_plan : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...

Dataframe reading problems

So, I have a DataFrame generated by the following block:
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
adult <- read.csv(url ,strip.white = TRUE ,header = FALSE )
colnames( adult ) <- c("age"," workclass "," final weight ","education "," education -num"," martial - status ","
occupation "," relationship "," race ","sex"," capital-gain "," capital - loss ","hours -per - week ","native -
country ","income")
The values in the "income" column are either "<=50k" or ">50k". when I try to select the people with income ">50k", I use the following comand:
richs = adult[adult["income"] == ">50k",]
however, the richs DataFrame is always empty. What am I doing wrong?
thanks.
First, I will download the data into a data frame with strings as factors:
>adults <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = FALSE)
> str(adults)
'data.frame': 32561 obs. of 15 variables:
$ V1 : int 39 50 38 53 28 37 49 52 31 42 ...
$ V2 : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
$ V3 : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ V4 : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ V5 : int 13 13 9 7 13 14 5 9 14 13 ...
$ V6 : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ V7 : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
$ V8 : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ V9 : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ V10: Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
$ V11: int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ V12: int 0 0 0 0 0 0 0 0 0 0 ...
$ V13: int 40 13 40 40 40 40 16 45 50 40 ...
$ V14: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
$ V15: Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...
If you look at the data close, you will notice that the feature you are working on is a factor having two classes: 1 = "<=50K" and 2 = ">50K". One fast way to extract the samples with class 2 of this feature is to convert it to integer and perform the operation on it:
> richadults = adults[as.integer(adults$V15) == 2, ]
> str(richadults)
'data.frame': 7841 obs. of 15 variables:
$ V1 : int 52 31 42 37 30 40 43 40 56 54 ...
$ V2 : Factor w/ 9 levels " ?"," Federal-gov",..: 7 5 5 5 8 5 7 5 3 1 ...
$ V3 : int 209642 45781 159449 280464 141297 121772 292175 193524 216851 180211 ...
$ V4 : Factor w/ 16 levels " 10th"," 11th",..: 12 13 10 16 10 9 13 11 10 16 ...
$ V5 : int 9 14 13 10 13 11 14 16 13 10 ...
$ V6 : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 3 5 3 3 3 3 1 3 3 3 ...
$ V7 : Factor w/ 15 levels " ?"," Adm-clerical",..: 5 11 5 5 11 4 5 11 14 1 ...
$ V8 : Factor w/ 6 levels " Husband"," Not-in-family",..: 1 2 1 1 1 1 5 1 1 1 ...
$ V9 : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 2 2 5 5 5 2 ...
$ V10: Factor w/ 2 levels " Female"," Male": 2 1 2 2 2 2 1 2 2 2 ...
$ V11: int 0 14084 5178 0 0 0 0 0 0 0 ...
$ V12: int 0 0 0 0 0 0 0 0 0 0 ...
$ V13: int 45 50 40 80 40 40 45 60 40 60 ...
$ V14: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 20 1 40 40 40 36 ...
$ V15: Factor w/ 2 levels " <=50K"," >50K": 2 2 2 2 2 2 2 2 2 2 ...
In the new data frame (richadults) you will have 7 841 samples only with those individuals that have their income >50K. The original data set has 32 561 samples.

Extracting complete dataframe from Hmisc package in R

I've used aregImpute to impute the missing values then i used impute.transcan function trying to get complete dataset using the following code.
impute_arg <- aregImpute(~ age + job + marital + education + default +
balance + housing + loan + contact + day + month + duration + campaign +
pdays + previous + poutcome + y , data = mov.miss, n.impute = 10 , nk =0)
imputed <- impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE)
y <- completed[names(imputed)]
and when i used str(y) it already gives me a dataframe but with NAs as it is not imputed before, My question is how to get complete dataset without NAs after imputation?
str(y)
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 NA 35 30 NA 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 NA 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 1 1 ...
$ balance : int NA 4789 1350 1476 0 747 307 147 NA -88 ...
$ housing : Factor w/ 2 levels "no","yes": NA 2 2 2 NA 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 NA 1 1 NA 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 NA 1 ...
$ day : int 19 NA 16 3 5 23 14 6 14 NA ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 NA 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 NA ...
$ pdays : int -1 339 330 NA -1 176 330 -1 -1 NA ...
$ previous : int 0 4 NA 0 NA 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
I have tested your code myself, and it works just fine, except for the last line:
y <- completed[names(imputed)]
I believe there's a type in the above line. Plus, you do not even need the completed function.
Besides, if you want to get a data.frame from the impute.transcan function, then wrap it with as.data.frame:
imputed <- as.data.frame(impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE))
Moreover, if you need to test your missing data pattern, you can also use the md.pattern function provided by the mice package.

must a dataset contain all factors in SVM in R

I'm trying to find class probabilities of new input vectors with support vector machines in R.
Training the model shows no errors.
fit <-svm(device~.,data=dataframetrain,
kernel="polynomial",probability=TRUE)
But predicting some input vector shows some errors.
predict(fit,dataframetest,probability=prob)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
dataframetrain looks like:
> str(dataframetrain)
'data.frame': 24577 obs. of 5 variables:
$ device : Factor w/ 3 levels "mob","pc","tab": 1 1 1 1 1 1 1 1 1 1 ...
$ geslacht : Factor w/ 2 levels "M","V": 1 1 1 1 1 1 1 1 1 1 ...
$ leeftijd : num 77 67 67 66 64 64 63 61 61 58 ...
$ invultijd: num 12 12 12 12 12 12 12 12 12 12 ...
$ type : Factor w/ 8 levels "A","B","C","D",..: 5 5 5 5 5 5 5 5 5 5 ...
and dataframetest looks like:
> str(dataframetest)
'data.frame': 8 obs. of 4 variables:
$ geslacht : Factor w/ 1 level "M": 1 1 1 1 1 1 1 1
$ leeftijd : num 20 60 30 25 36 52 145 25
$ invultijd: num 6 12 2 5 6 8 69 7
$ type : Factor w/ 8 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8
I trained the model with 2 factors for 'geslacht' but sometime I have to predict data with only 1 factor of 'geslacht'.
Is it maybe possible that the class probabilites can be predicted with a test set with only 1 factor of 'geslacht'?
I hope someone can help me!!
Add another level (but not data) to geslacht.
x <- factor(c("A", "A"), levels = c("A", "B"))
x
[1] A A
Levels: A B
or
x <- factor(c("A", "A"))
levels(x) <- c("A", "B")
x
[1] A A
Levels: A B

Resources