I am practicing with this dataset: http://archive.ics.uci.edu/ml/datasets/Census+Income
I loaded training & testing data.
# Downloading train and test data
trainFile = "adult.data"; testFile = "adult.test"
if (!file.exists (trainFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
destfile = trainFile)
if (!file.exists (testFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
destfile = testFile)
# Assigning column names
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.table (trainFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", stringsAsFactors = TRUE)
# Load the testing data set
testing = read.table (testFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", fill = TRUE, stringsAsFactors = TRUE)
I needed to combined two into one. But, there is a problem. I am seeing structure of the two data is not same.
Display structure of the training data
> str (training)
'data.frame': 32561 obs. of 15 variables:
$ age : int 39 50 38 53 28 37 49 52 31 42 ...
$ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
Display structure of the testing data
> str (testing)
'data.frame': 16282 obs. of 15 variables:
$ age : Factor w/ 74 levels "|1x3 Cross validator",..: 1 10 23 13 29 3 19 14 48 9 ...
$ workclass : Factor w/ 9 levels "","Federal-gov",..: 1 5 5 3 5 NA 5 NA 7 5 ...
$ fnlwgt : int NA 226802 89814 336951 160323 103497 198693 227026 104626 369667 ...
$ education : Factor w/ 17 levels "","10th","11th",..: 1 3 13 9 17 17 2 13 16 17 ...
$ educationnum : int NA 7 9 12 10 10 6 9 15 10 ...
$ maritalstatus: Factor w/ 8 levels "","Divorced",..: 1 6 4 4 4 6 6 6 4 6 ...
$ occupation : Factor w/ 15 levels "","Adm-clerical",..: 1 8 6 12 8 NA 9 NA 11 9 ...
$ relationship : Factor w/ 7 levels "","Husband","Not-in-family",..: 1 5 2 2 2 5 3 6 2 6 ...
$ race : Factor w/ 6 levels "","Amer-Indian-Eskimo",..: 1 4 6 6 4 6 6 4 6 6 ...
$ sex : Factor w/ 3 levels "","Female","Male": 1 3 3 3 3 2 3 3 3 2 ...
$ capitalgain : int NA 0 0 0 7688 0 0 0 3103 0 ...
$ capitalloss : int NA 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int NA 40 50 40 40 30 30 40 32 40 ...
$ nativecountry: Factor w/ 41 levels "","Cambodia",..: 1 39 39 39 39 39 39 39 39 39 ...
$ incomelevel : Factor w/ 3 levels "","<=50K.",">50K.": 1 2 2 3 3 2 2 2 3 2 ...
Problem 1:
age has become factor at testing. and all other levels of factor in testing is being increased by 1 than levels of factor in training. This is because first row is an unnecessary row in testing.
|1x3 Cross validator
I tried to get rid of this by re-assigning testing:
testing = testing[-1,]
but, after running str() command again, I don't see any change.
Problem 2:
Like I said at previous, I needed to combine those two data-frame into one data-frame. So, I run this:
combined <- rbind(training , testing)
Besides the problem-1, I can see new a problem after running str()
> str(combined)
'data.frame': 48842 obs. of 15 variables:
$ age : chr "39" "50" "38" "53" ...
$ workclass : Factor w/ 9 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 17 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 8 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 15 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 7 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 6 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 42 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 5 levels "<=50K",">50K",..: 1 1 1 1 1 1 1 2 2 2 ...
factor levels at target variable (incomelevel) in combined data-frame is 5 where it's 2 (which is correct) in the training data-frame and 3 (increased by 1 for problem-1) in testing data-frame. This is because there is a . (dot) after each value at incomelevel in testing data-frame (<=50K., <=50K., >50K.,......). So, I need to remove that .(dot) But, I am not getting idea how to remove it. Is there any function?
I am very in data and r. That's why, facing this type of basic issues. Can you please help me to solve the issue I am facing?
I think you can ignore the first line of test, this will solve the issue of age being a factor, because it seems like a header:
head(readLines(testFile))
[1] "|1x3 Cross validator"
[2] "25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K."
[3] "38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K."
We run your code, we can use read.csv, with skip=1 for test:
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.csv (trainFile, header = FALSE, col.names = colNames,stringsAsFactors = TRUE,na.strings = "?",strip.white = TRUE)
testing = read.csv (testFile, header = FALSE, col.names = colNames,na.strings = "?",stringsAsFactors = TRUE,skip=1,strip.white = TRUE)
Now, the income level, unfortunately we have to correct it manually, it's a good thing you check:
testing$incomelevel = factor(gsub("\\.","",as.character(testing$incomelevel)))
We check levels, only difference is native country:
all.equal(sapply(testing,levels) ,sapply(training,levels))
[1] "Component “nativecountry”: Lengths (40, 41) differ (string compare on first 40)"
[2] "Component “nativecountry”: 26 string mismatches"
And I don't think there's much you can do, maybe you have to remove it before / after joining:
setdiff(levels(training$nativecountry),levels(testing$nativecountry))
[1] "Holand-Netherlands"
I do apologize if this is rudimentary however I have run through the tracebook and tried googling to no real avail. Every time I try and run my code to produce a ROC curve I keep getting returned
Error in `[.data.frame`(data, , class) : undefined columns selected
I have checked the data and they are single column characters (as required)
library(cutpointr)
Temp1 <- predict(KnnModel, newdata=TestData, type="prob")
KnnProbs <- predict(object = KnnModel, newdata = TestData, type = "prob")
KnnProbs <- as.character(KnnProbs$`0`)
clch <- as.character(TrainData$loan_status)
KnnROC <- roc(data = TestData$loan_status, x = KnnProbs, class = clch)
plot(KnnROC, print.auc = T)
Any ideas as to what I am doing wrong and how to fix this
EDIT: The TrainData is of the following
'data.frame': 1500 obs. of 13 variables:
$ loan_amnt : num 6000 17625 8500 5000 10000 ...
$ loan_status : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
$ int_rate : num 13.33 15.61 6.68 6.92 14.98 ...
$ term : num 1 1 1 1 1 1 2 1 1 1 ...
$ installment : num 203 616 261 154 347 ...
$ grade : num 3 4 1 1 3 4 4 3 2 1 ...
$ emp_length : num 10 11 3 3 8 3 3 3 2 1 ...
$ annual_inc : num 30000 49000 53100 60000 37000 ...
$ dti : num 25.5 12.2 26.2 27.8 31.4 ...
$ sub_grade : num 13 16 3 4 13 16 18 12 8 4 ...
$ verification_status: num 1 2 2 3 3 3 3 1 3 1 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 5 2 2 2 2 6 2 ...
$ pymnt_plan : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
So, I have a DataFrame generated by the following block:
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
adult <- read.csv(url ,strip.white = TRUE ,header = FALSE )
colnames( adult ) <- c("age"," workclass "," final weight ","education "," education -num"," martial - status ","
occupation "," relationship "," race ","sex"," capital-gain "," capital - loss ","hours -per - week ","native -
country ","income")
The values in the "income" column are either "<=50k" or ">50k". when I try to select the people with income ">50k", I use the following comand:
richs = adult[adult["income"] == ">50k",]
however, the richs DataFrame is always empty. What am I doing wrong?
thanks.
First, I will download the data into a data frame with strings as factors:
>adults <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = FALSE)
> str(adults)
'data.frame': 32561 obs. of 15 variables:
$ V1 : int 39 50 38 53 28 37 49 52 31 42 ...
$ V2 : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
$ V3 : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ V4 : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ V5 : int 13 13 9 7 13 14 5 9 14 13 ...
$ V6 : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ V7 : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
$ V8 : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ V9 : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ V10: Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
$ V11: int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ V12: int 0 0 0 0 0 0 0 0 0 0 ...
$ V13: int 40 13 40 40 40 40 16 45 50 40 ...
$ V14: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
$ V15: Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...
If you look at the data close, you will notice that the feature you are working on is a factor having two classes: 1 = "<=50K" and 2 = ">50K". One fast way to extract the samples with class 2 of this feature is to convert it to integer and perform the operation on it:
> richadults = adults[as.integer(adults$V15) == 2, ]
> str(richadults)
'data.frame': 7841 obs. of 15 variables:
$ V1 : int 52 31 42 37 30 40 43 40 56 54 ...
$ V2 : Factor w/ 9 levels " ?"," Federal-gov",..: 7 5 5 5 8 5 7 5 3 1 ...
$ V3 : int 209642 45781 159449 280464 141297 121772 292175 193524 216851 180211 ...
$ V4 : Factor w/ 16 levels " 10th"," 11th",..: 12 13 10 16 10 9 13 11 10 16 ...
$ V5 : int 9 14 13 10 13 11 14 16 13 10 ...
$ V6 : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 3 5 3 3 3 3 1 3 3 3 ...
$ V7 : Factor w/ 15 levels " ?"," Adm-clerical",..: 5 11 5 5 11 4 5 11 14 1 ...
$ V8 : Factor w/ 6 levels " Husband"," Not-in-family",..: 1 2 1 1 1 1 5 1 1 1 ...
$ V9 : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 2 2 5 5 5 2 ...
$ V10: Factor w/ 2 levels " Female"," Male": 2 1 2 2 2 2 1 2 2 2 ...
$ V11: int 0 14084 5178 0 0 0 0 0 0 0 ...
$ V12: int 0 0 0 0 0 0 0 0 0 0 ...
$ V13: int 45 50 40 80 40 40 45 60 40 60 ...
$ V14: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 20 1 40 40 40 36 ...
$ V15: Factor w/ 2 levels " <=50K"," >50K": 2 2 2 2 2 2 2 2 2 2 ...
In the new data frame (richadults) you will have 7 841 samples only with those individuals that have their income >50K. The original data set has 32 561 samples.
I've used aregImpute to impute the missing values then i used impute.transcan function trying to get complete dataset using the following code.
impute_arg <- aregImpute(~ age + job + marital + education + default +
balance + housing + loan + contact + day + month + duration + campaign +
pdays + previous + poutcome + y , data = mov.miss, n.impute = 10 , nk =0)
imputed <- impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE)
y <- completed[names(imputed)]
and when i used str(y) it already gives me a dataframe but with NAs as it is not imputed before, My question is how to get complete dataset without NAs after imputation?
str(y)
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 NA 35 30 NA 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 NA 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 1 1 ...
$ balance : int NA 4789 1350 1476 0 747 307 147 NA -88 ...
$ housing : Factor w/ 2 levels "no","yes": NA 2 2 2 NA 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 NA 1 1 NA 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 NA 1 ...
$ day : int 19 NA 16 3 5 23 14 6 14 NA ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 NA 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 NA ...
$ pdays : int -1 339 330 NA -1 176 330 -1 -1 NA ...
$ previous : int 0 4 NA 0 NA 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
I have tested your code myself, and it works just fine, except for the last line:
y <- completed[names(imputed)]
I believe there's a type in the above line. Plus, you do not even need the completed function.
Besides, if you want to get a data.frame from the impute.transcan function, then wrap it with as.data.frame:
imputed <- as.data.frame(impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE))
Moreover, if you need to test your missing data pattern, you can also use the md.pattern function provided by the mice package.
I'm trying to find class probabilities of new input vectors with support vector machines in R.
Training the model shows no errors.
fit <-svm(device~.,data=dataframetrain,
kernel="polynomial",probability=TRUE)
But predicting some input vector shows some errors.
predict(fit,dataframetest,probability=prob)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
dataframetrain looks like:
> str(dataframetrain)
'data.frame': 24577 obs. of 5 variables:
$ device : Factor w/ 3 levels "mob","pc","tab": 1 1 1 1 1 1 1 1 1 1 ...
$ geslacht : Factor w/ 2 levels "M","V": 1 1 1 1 1 1 1 1 1 1 ...
$ leeftijd : num 77 67 67 66 64 64 63 61 61 58 ...
$ invultijd: num 12 12 12 12 12 12 12 12 12 12 ...
$ type : Factor w/ 8 levels "A","B","C","D",..: 5 5 5 5 5 5 5 5 5 5 ...
and dataframetest looks like:
> str(dataframetest)
'data.frame': 8 obs. of 4 variables:
$ geslacht : Factor w/ 1 level "M": 1 1 1 1 1 1 1 1
$ leeftijd : num 20 60 30 25 36 52 145 25
$ invultijd: num 6 12 2 5 6 8 69 7
$ type : Factor w/ 8 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8
I trained the model with 2 factors for 'geslacht' but sometime I have to predict data with only 1 factor of 'geslacht'.
Is it maybe possible that the class probabilites can be predicted with a test set with only 1 factor of 'geslacht'?
I hope someone can help me!!
Add another level (but not data) to geslacht.
x <- factor(c("A", "A"), levels = c("A", "B"))
x
[1] A A
Levels: A B
or
x <- factor(c("A", "A"))
levels(x) <- c("A", "B")
x
[1] A A
Levels: A B