How to combine training and testing dataset in same format - r

I am practicing with this dataset: http://archive.ics.uci.edu/ml/datasets/Census+Income
I loaded training & testing data.
# Downloading train and test data
trainFile = "adult.data"; testFile = "adult.test"
if (!file.exists (trainFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
destfile = trainFile)
if (!file.exists (testFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
destfile = testFile)
# Assigning column names
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.table (trainFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", stringsAsFactors = TRUE)
# Load the testing data set
testing = read.table (testFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", fill = TRUE, stringsAsFactors = TRUE)
I needed to combined two into one. But, there is a problem. I am seeing structure of the two data is not same.
Display structure of the training data
> str (training)
'data.frame': 32561 obs. of 15 variables:
$ age : int 39 50 38 53 28 37 49 52 31 42 ...
$ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
Display structure of the testing data
> str (testing)
'data.frame': 16282 obs. of 15 variables:
$ age : Factor w/ 74 levels "|1x3 Cross validator",..: 1 10 23 13 29 3 19 14 48 9 ...
$ workclass : Factor w/ 9 levels "","Federal-gov",..: 1 5 5 3 5 NA 5 NA 7 5 ...
$ fnlwgt : int NA 226802 89814 336951 160323 103497 198693 227026 104626 369667 ...
$ education : Factor w/ 17 levels "","10th","11th",..: 1 3 13 9 17 17 2 13 16 17 ...
$ educationnum : int NA 7 9 12 10 10 6 9 15 10 ...
$ maritalstatus: Factor w/ 8 levels "","Divorced",..: 1 6 4 4 4 6 6 6 4 6 ...
$ occupation : Factor w/ 15 levels "","Adm-clerical",..: 1 8 6 12 8 NA 9 NA 11 9 ...
$ relationship : Factor w/ 7 levels "","Husband","Not-in-family",..: 1 5 2 2 2 5 3 6 2 6 ...
$ race : Factor w/ 6 levels "","Amer-Indian-Eskimo",..: 1 4 6 6 4 6 6 4 6 6 ...
$ sex : Factor w/ 3 levels "","Female","Male": 1 3 3 3 3 2 3 3 3 2 ...
$ capitalgain : int NA 0 0 0 7688 0 0 0 3103 0 ...
$ capitalloss : int NA 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int NA 40 50 40 40 30 30 40 32 40 ...
$ nativecountry: Factor w/ 41 levels "","Cambodia",..: 1 39 39 39 39 39 39 39 39 39 ...
$ incomelevel : Factor w/ 3 levels "","<=50K.",">50K.": 1 2 2 3 3 2 2 2 3 2 ...
Problem 1:
age has become factor at testing. and all other levels of factor in testing is being increased by 1 than levels of factor in training. This is because first row is an unnecessary row in testing.
|1x3 Cross validator
I tried to get rid of this by re-assigning testing:
testing = testing[-1,]
but, after running str() command again, I don't see any change.
Problem 2:
Like I said at previous, I needed to combine those two data-frame into one data-frame. So, I run this:
combined <- rbind(training , testing)
Besides the problem-1, I can see new a problem after running str()
> str(combined)
'data.frame': 48842 obs. of 15 variables:
$ age : chr "39" "50" "38" "53" ...
$ workclass : Factor w/ 9 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 17 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 8 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 15 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 7 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 6 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 42 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 5 levels "<=50K",">50K",..: 1 1 1 1 1 1 1 2 2 2 ...
factor levels at target variable (incomelevel) in combined data-frame is 5 where it's 2 (which is correct) in the training data-frame and 3 (increased by 1 for problem-1) in testing data-frame. This is because there is a . (dot) after each value at incomelevel in testing data-frame (<=50K., <=50K., >50K.,......). So, I need to remove that .(dot) But, I am not getting idea how to remove it. Is there any function?
I am very in data and r. That's why, facing this type of basic issues. Can you please help me to solve the issue I am facing?

I think you can ignore the first line of test, this will solve the issue of age being a factor, because it seems like a header:
head(readLines(testFile))
[1] "|1x3 Cross validator"
[2] "25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K."
[3] "38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K."
We run your code, we can use read.csv, with skip=1 for test:
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.csv (trainFile, header = FALSE, col.names = colNames,stringsAsFactors = TRUE,na.strings = "?",strip.white = TRUE)
testing = read.csv (testFile, header = FALSE, col.names = colNames,na.strings = "?",stringsAsFactors = TRUE,skip=1,strip.white = TRUE)
Now, the income level, unfortunately we have to correct it manually, it's a good thing you check:
testing$incomelevel = factor(gsub("\\.","",as.character(testing$incomelevel)))
We check levels, only difference is native country:
all.equal(sapply(testing,levels) ,sapply(training,levels))
[1] "Component “nativecountry”: Lengths (40, 41) differ (string compare on first 40)"
[2] "Component “nativecountry”: 26 string mismatches"
And I don't think there's much you can do, maybe you have to remove it before / after joining:
setdiff(levels(training$nativecountry),levels(testing$nativecountry))
[1] "Holand-Netherlands"

Related

Error when trying to use one_hot encoding

I know this may be a potential duplicate question, but I found other answers didn't work in my situation.
I am using the following dataset:
> str(total_data)
'data.frame': 32260 obs. of 13 variables:
$ age : int 40 42 44 32 25 31 30 30 27 28 ...
$ workclass : Factor w/ 4 levels "Other-Unknown",..: 3 2 2 1 2 2 2 3 2 3 ...
$ education : Ord.factor w/ 7 levels "1"<"2"<"3"<"4"<..: 2 3 2 2 2 3 2 2 2 2 ...
$ marital.status : Factor w/ 5 levels "Divorced","Married",..: 2 1 2 3 3 3 3 2 2 3 ...
$ occupation : Factor w/ 6 levels "Blue-Collar",..: 5 3 6 2 1 6 6 1 1 6 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 1 5 1 1 5 5 5 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 2 1 1 ...
$ hours.per.week : int 84 40 40 38 40 38 48 70 35 38 ...
$ naitive.country: Factor w/ 41 levels "?","Cambodia",..: 39 39 39 39 39 39 39 12 39 39 ...
$ classifier : chr "<=50K" "<=50K" ">50K" "<=50K" ...
$ class_num : Factor w/ 2 levels "1","2": 1 1 2 1 1 1 1 2 1 1 ...
$ age_norm : num 0.315 0.342 0.37 0.205 0.11 ...
$ hours_norm : num 0.847 0.398 0.398 0.378 0.398 ...
I'm trying to encode the factors into binary using one_hot() but receive the following error message:
encoded_data <- one_hot(total_data, dropCols = FALSE)
ERROR MESSAGE:
Error in `[.data.frame`(dt, , cols, with = FALSE) :
unused argument (with = FALSE)
I'm not sure what the "with" argument is as I don't see it in the R documentation.
I also saw that someone suggested to use model.matrix. However, when I use that, my ordered factor gets encoded as well, which is what I'm trying to avoid.
This is what happens to my ordered factor variable:
education.L education.Q education.C education^4 education^5 education^6
-3.779645e-01 9.690821e-17 4.082483e-01 -0.5640761 4.364358e-01 -0.19738551
-1.889822e-01 -3.273268e-01 4.082483e-01 0.0805823 -5.455447e-01 0.49346377
I'm also not sure why there are sometimes letters or numbers after the attribute name. i.e. education**.L** vs education**^5**
Convert the data.frame into a data.table and it should work fine.
library(data.table)
dt = data.table(total_data)
one_hot(dt)

Dataframe reading problems

So, I have a DataFrame generated by the following block:
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
adult <- read.csv(url ,strip.white = TRUE ,header = FALSE )
colnames( adult ) <- c("age"," workclass "," final weight ","education "," education -num"," martial - status ","
occupation "," relationship "," race ","sex"," capital-gain "," capital - loss ","hours -per - week ","native -
country ","income")
The values in the "income" column are either "<=50k" or ">50k". when I try to select the people with income ">50k", I use the following comand:
richs = adult[adult["income"] == ">50k",]
however, the richs DataFrame is always empty. What am I doing wrong?
thanks.
First, I will download the data into a data frame with strings as factors:
>adults <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = FALSE)
> str(adults)
'data.frame': 32561 obs. of 15 variables:
$ V1 : int 39 50 38 53 28 37 49 52 31 42 ...
$ V2 : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
$ V3 : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ V4 : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ V5 : int 13 13 9 7 13 14 5 9 14 13 ...
$ V6 : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ V7 : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
$ V8 : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ V9 : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ V10: Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
$ V11: int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ V12: int 0 0 0 0 0 0 0 0 0 0 ...
$ V13: int 40 13 40 40 40 40 16 45 50 40 ...
$ V14: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
$ V15: Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...
If you look at the data close, you will notice that the feature you are working on is a factor having two classes: 1 = "<=50K" and 2 = ">50K". One fast way to extract the samples with class 2 of this feature is to convert it to integer and perform the operation on it:
> richadults = adults[as.integer(adults$V15) == 2, ]
> str(richadults)
'data.frame': 7841 obs. of 15 variables:
$ V1 : int 52 31 42 37 30 40 43 40 56 54 ...
$ V2 : Factor w/ 9 levels " ?"," Federal-gov",..: 7 5 5 5 8 5 7 5 3 1 ...
$ V3 : int 209642 45781 159449 280464 141297 121772 292175 193524 216851 180211 ...
$ V4 : Factor w/ 16 levels " 10th"," 11th",..: 12 13 10 16 10 9 13 11 10 16 ...
$ V5 : int 9 14 13 10 13 11 14 16 13 10 ...
$ V6 : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 3 5 3 3 3 3 1 3 3 3 ...
$ V7 : Factor w/ 15 levels " ?"," Adm-clerical",..: 5 11 5 5 11 4 5 11 14 1 ...
$ V8 : Factor w/ 6 levels " Husband"," Not-in-family",..: 1 2 1 1 1 1 5 1 1 1 ...
$ V9 : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 2 2 5 5 5 2 ...
$ V10: Factor w/ 2 levels " Female"," Male": 2 1 2 2 2 2 1 2 2 2 ...
$ V11: int 0 14084 5178 0 0 0 0 0 0 0 ...
$ V12: int 0 0 0 0 0 0 0 0 0 0 ...
$ V13: int 45 50 40 80 40 40 45 60 40 60 ...
$ V14: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 20 1 40 40 40 36 ...
$ V15: Factor w/ 2 levels " <=50K"," >50K": 2 2 2 2 2 2 2 2 2 2 ...
In the new data frame (richadults) you will have 7 841 samples only with those individuals that have their income >50K. The original data set has 32 561 samples.

Contingency table with conditional probability using R

I have those data: http://www.unige.ch/ses/spo/static/simonhug/madi/Mitchell_et_al_1984.csv
> str(dataset)
'data.frame': 135 obs. of 13 variables:
$ CCode : int 2 20 40 41 42 51 52 70 90 91 ...
$ StateAbb : Factor w/ 130 levels "AFG","ALB","ALG",..: 124 19 28 52 33 62 117 75 49 53 ...
$ StateNme : Factor w/ 130 levels "Afghanistan",..: 122 20 27 51 33 62 116 76 47 52 ...
$ prison_score : Factor w/ 5 levels "never","often",..: 1 1 2 4 5 1 NA 4 5 4 ...
$ torture_score : Factor w/ 5 levels "never","often",..: 1 3 1 4 2 1 NA 2 5 2 ...
$ ht_colonial : Factor w/ 10 levels "0. Never colonized by a Western overseas colonial power",..: 1 1 4 8 4 7 7 4 4 4 ...
$ british : int NA NA 0 0 0 1 1 0 0 0 ...
$ british_colony : Factor w/ 2 levels "no","yes": NA NA 1 1 1 2 2 1 1 1 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
$ region_wb : Factor w/ 19 levels "Australia and New Zealand",..: 10 10 2 2 2 2 2 3 3 3 ...
$ gdppc_l1 : num 25839 23550 10095 1846 4758 ...
$ colonialExperience: chr NA NA "Other Colonial Background" "Other Colonial Background" ...
And have to create a similar result
With this code
# Copy the torture_score in a new col
dataset$torture_score_new = dataset$torture_score
# Add a level to the factor torture_score_new so we can t
levels(dataset$torture_score_new) = c(levels(dataset$torture_score_new), "rarely or never")
### Recode variables
# Torture score
dataset$torture_score_new[dataset$torture_score == "rarely"] = "rarely or never"
dataset$torture_score_new[dataset$torture_score == "never"] = "rarely or never"
dataset$torture_score_new = droplevels(dataset$torture_score_new)
dataset$torture_score_new = ordered(dataset$torture_score_new, levels =c("rarely or never", "somtimes", "often", "very often"))
### Text
dataset$colonialExperience = ifelse(dataset$british_colony == "yes",
"Former British Colony",
"Other Colonial Background")
useOfTortureByColonialExperience = table(dataset$torture_score_new, dataset$colonialExperience)
addmargins(round(prop.table(useOfTortureByColonialExperience)*100,2),1)
and get this result
Former British Colony Other Colonial Background
rarely or never 9.76 20.73
somtimes 10.98 15.85
often 6.10 18.29
very often 10.98 7.32
Sum 37.82 62.19
But I don't understand how to use conditional stat and get the Chi Square.
(I'm a programmer, but a total newbe to R)
Ok it's what I end up doing.
useOfTortureByColonialExperience = table(dataset$torture_score_new, dataset$colonialExperience)
# Get the number of observation
addmargins(useOfTortureByColonialExperience,1);
# Contingency table with conditional probability
useOfTortureByColonialExperienceProp = prop.table(useOfTortureByColonialExperience,2)
print(addmargins(useOfTortureByColonialExperienceProp*100,1),3)
## Chisq
chisq.test(useOfTortureByColonialExperience)
cramersV(useOfTortureByColonialExperience)

Error in code Neuralnet pacakge

I have to predict TragetBuy Variable which is coded as 0 and 1
I have the following code
library(neuralnet)
library(NeuralNetTools)
n <- names(train)
f <- as.formula(paste("TargetBuy ~", paste(n[!n %in% "TargetBuy"], collapse = " + ")))
parse_train <- model.matrix(~ ID + DemAffl + DemAge + DemCluster +
DemClusterGroup + DemGender + DemReg +
DemTVReg + PromClass + PromSpend + PromTime +
TargetBuy,
data = train)
head(parse_train)
nn <- neuralnet(f, data = parse_train,
hidden = 2,
err.fct = "ce",
threshold = 0.01,
linear.output = FALSE)
I am getting following error:
Error in eval(expr, envir, enclos) : object 'TargetBuy' not found
here I am providing str(train)
'data.frame': 15556 obs. of 12 variables:
$ ID : int 140 620 868 1120 2313 2771 3131 4529 5886 7420 ...
$ DemAffl : int 10 4 5 10 11 9 11 10 14 7 ...
$ DemAge : int 76 49 70 65 68 72 74 62 43 60 ...
$ DemCluster : int 16 35 27 51 4 28 3 49 49 52 ...
$ DemClusterGroup: Factor w/ 8 levels "","A","B","C",..: 4 5 5 7 2 5 2 7 7 7 ...
$ DemGender : Factor w/ 4 levels "","F","M","U": 4 4 2 3 2 4 2 3 2 2 ...
$ DemReg : Factor w/ 6 levels "","Midlands",..: 2 2 2 2 2 3 2 2 1 3 ...
$ DemTVReg : Factor w/ 14 levels "","Border","C Scotland",..: 13 13 13 6 6 9 4 4 1 7 ...
$ PromClass : Factor w/ 4 levels "Gold","Platinum",..: 1 1 3 4 4 2 4 3 1 1 ...
$ PromSpend : num 16000 6000 0.02 0.01 0.01 ...
$ PromTime : int 4 5 8 7 8 3 8 3 1 2 ...
$ TargetBuy : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 2 1 ...

Change type of variables in multiple data frames

I have a list of data frames:
str(df.list)
List of 34
$ :'data.frame': 506 obs. of 7 variables:
..$ Protocol : Factor w/ 5 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ Time : num [1:506] 0 2 3 0.5 6 1 24 24 24 24 ...
..$ SampleID : Factor w/ 40 levels "P1T0","P1T0.5",..: 1 5 7 2 8 3 6 6 6 6 ...
..$ VolunteerID: Factor w/ 15 levels "ID-02","ID-03",..: 10 10 10 10 10 10 10 11 13 14 ...
..$ Assay : Factor w/ 1 level "ALAT": 1 1 1 1 1 1 1 1 1 1 ...
..$ ResultAssay: int [1:506] 23 23 23 24 25 24 20 34 28 17 ...
..$ Index : Factor w/ 502 levels "P1T0.5VID-02",..: 8 31 37 2 43 19 25 26 28 29 ...
$ :'data.frame': 505 obs. of 7 variables:
..$ Protocol : Factor w/ 5 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ Time : num [1:505] 0 2 3 0.5 6 1 24 24 24 24 ...
..$ SampleID : Factor w/ 40 levels "P1T0","P1T0.5",..: 1 5 7 2 8 3 6 6 6 6 ...
..$ VolunteerID: Factor w/ 15 levels "ID-02","ID-03",..: 10 10 10 10 10 10 10 11 13 14 ...
..$ Assay : Factor w/ 1 level "ALB": 1 1 1 1 1 1 1 1 1 1 ...
..$ ResultAssay: int [1:505] 45 46 47 47 49 47 46 46 44 43 ...
..$ Index : Factor w/ 501 levels "P1T0.5VID-02",..: 8 31 37 2 43 19 25 26 28 29 ..
The list contains 34 data frames with equal variable names. The variables Time and ResultAssay are of the wrong type: I would like to have Time as factor and ResultAssay as numerical.
I am trying to generate a function to use together with lapply to convert the variable type of this list of 34 data frames in one go, but so far i am unsuccessful.
I have tried things in parallel to:
ChangeType <- function(DF){
DF[,2] <- as.factor(DF[,2])
DF[, "ResultAssay"] <- as.numeric(DF[, c("ResultAssay")]
}
lapply(df.list, ChangeType)
What you have tried is nearly correct, but you also need to return the new data.frame and also store it to your existing variable, as so:
ChangeType <- function(DF){
DF[,2] <- as.factor(DF[,2])
DF[, "ResultAssay"] <- as.numeric(DF[, c("ResultAssay")]
DF #return the data.frame
}
# store the returned value to df.list,
# thus updating your existing data.frame
df.list <- lapply(df.list, ChangeType)

Resources