I'm looking to plot the results of a support vector classification, whereby the predictor variables are categorical (Gender and School). School can be 1 of 20, Gender 1 of 2. The response variable is Attended school (y/n)
The attendance.df is as follows:
gender school attendance
1 male 1 1
2 male 2 0
3 male 1 1
4 female 2 1
5 male 1 1
6 female 2 0
7 female 3 1
8 male 4 0
9 female 5 1
10 female 6 1
11 female 7 1
12 male 8 1
13 male 9 1
14 male 10 1
15 male 10 1
16 male 11 0
17 male 12 1
18 female 13 1
19 male 14 1
20 female 15 0
21 female 16 1
22 male 17 0
23 female 18 1
24 female 19 0
25 female 4 1
26 male 5 1
27 male 5 1
28 male 20 0
The code for the SVM and plot is:
# Builds linear SVM model
svm.response = tune(svm,attendance~ ., data=attendance.df, kernel="linear", ranges=list(cost=c(0.001,0.01,0.1,1,10,100,1000)))
svm.response$best.parameters # Identifies best parameters to rebuild model below
svm.response = svm(attendance ~ ., data=attendance.df, kernel="linear", cost=10)
# Plots SVM classification plot
plot(svm.response, attendance.df, fill=TRUE)
I want the predictor variables (gender and school) to be on the x and y, with the linear line (the SVM) to separate attendance which will be a black dot for yes, and red dot for no.
With the plot() I have included, the error "‘min’ not meaningful for factors" appears due to all variables being factors. I'm just unsure how to plot
Note: It may help to expand the df, I have deliberately shortned in order to post.
Related
I want to convert categorical columns in the dataset to be numerical values (1,2,3, etc).
How can I do this in R?
## Load vcd package
library(vcd)
## Load Arthritis dataset (data frame)
data(Arthritis)
Arthritis <- Arthritis[,2:5]
head(Arthritis)
Treatment Sex Age Improved
1 Treated Male 27 Some
2 Treated Male 29 None
3 Treated Male 30 None
4 Treated Male 32 Marked
5 Treated Male 46 Marked
6 Treated Male 58 Marked
Resulting dataset would look like this:
Treatment Sex Age Improved
[1,] 1 1 27 1
[2,] 1 1 29 0
[3,] 1 1 30 0
[4,] 1 1 32 2
[5,] 1 1 46 2
[6,] 1 1 58 2
If number of variables is huge, you may consider using this automation:
Arthritis2 <- sapply(Arthritis, unclass)
Edit:
Arthritis2 <- sapply(Arthritis, unclass) - 1
Solution using named list and match function:
scores <- list("0" = "None", "1" = "Some", "2" = "Marked" )
Arthritis$Scores <- names(scores)[match(Arthritis$Improved, scores)]
head(Arthritis)
Sex Age Improved Scores
1 Male 27 Some 1
2 Male 29 None 0
3 Male 30 None 0
4 Male 32 Marked 2
5 Male 46 Marked 2
6 Male 58 Marked 2
If you don't want to keep Improved column, simply do this instead:
Arthritis$Improved <- names(scores)[match(Arthritis$Improved, scores)]
I am attempting to use Random Forest. The training data has 7000 observations with 12 variables. These variables include both categorical and continuous variables. When I submit the code I receive the following
warning: Warning message: In randomForest.default(m, y, ...) : The
response has five or fewer unique values. Are you sure you want to do
regression?
The data is structured as such:
CustomerId CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
15634602 619 France Female 42 2 0 1 1 1 101348.88 1
15647311 608 Spain Female 41 1 83807.86 1 0 1 112542.58 0
15619304 502 France Female 42 8 159660.8 3 1 0 113931.57 1
15701354 699 France Female 39 1 0 2 0 0 93826.63 0
15737888 850 Spain Female 43 2 125510.82 1 1 1 79084.1 0
15574012 645 Spain Male 44 8 113755.78 2 1 0 149756.71 1
15592531 822 France Male 50 7 0 2 1 1 10062.8 0
15656148 376 Germany Female 29 4 115046.74 4 1 0 119346.88 1
15792365 501 France Male 44 4 142051.07 2 0 1 74940.5 0
Based on research, I have attempted to change variables to factors, but this has not corrected the issue.
The random forest model code that I am using is as follows:
rfModel=randomForest(Exited~.,data=train)
I have been unable to proceed past the warning to this point.
Try converting the outcome variable to a factor. For example if you the outcome variable in train is y
then run this before the model
train$y <- as.factor(train$y)
I'm starting to learn R and I'm playing around with the Titanic dataset on Kaggle. When attempting logistical regression on the survival rate, my expectation was for 0 and 1?
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked Title FamSize FamSizeD AgeRange
1 0 3 male 22 1 0 7.25 S Mr 2 small 21-30
2 1 1 female 38 1 0 71.2833 C Mrs 2 small 31-40
3 1 3 female 26 0 0 7.925 S Miss 1 single 21-30
4 1 1 female 35 1 0 53.1 S Mrs 2 small 31-40
5 0 3 male 35 0 0 8.05 S Mr 1 single 31-40
6 0 3 male 28 0 0 8.4583 Q Mr 1 single 21-30
7 0 1 male 54 0 0 51.8625 S Mr 1 single 51-60
8 0 3 male 2 3 1 21.075 S Master 5 large 0-20
9 1 3 female 27 0 2 11.1333 S Mrs 3 small 21-30
10 1 2 female 14 1 0 30.0708 C Mrs 2 small 0-20
logistix <- glm(Survived ~ Pclass + Sex + Embarked + Title + FamSizeD + AgeRange,
data = train,
family = 'binomial')
Prediction <- predict(logistix, test, type = "response")
Prediction
imagine, you have the following data set:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 0 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
Further, imagine you want to compile summary tables that print out the frequencies of those that drink wine, beer, water.
I solved it that way.
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
This allows me to complete my ultimate goal of compiling a bar chart in the way I want it:
barplot(con_P)
It works perfectly. No problem. Now, let us tweak the data set as follows: We set all entries for water to 1.
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
If I now run the following commands:
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
it gives me the following error message after the second line: Error in margin.table(x, margin) : 'x' is not an array!
Through another question here on this forum, I learned that the following will help me to overcome this issue:
con_P <- lapply(con, function(x) x/sum(x))
However, if I now run
barplot(con_P)
R does not create a barplot: Error in -0.01 * height : non-numeric argument to binary operator. I assume it is because it is no array!
My question is what to do now (how would I transform con_P in th second example into an array?). Secondly, how can I make the entire step of creating prop.tables and then a bar chart more efficient? Any help is much appreciated.
We can by converting the columns to factor with levels specified. In the second example, as the columns have 0 and 1 values in the 2nd and 3rd, we use the levels as 0:1, then get the table and convert to proportion with prop.table. and do the barplot
barplot(prop.table(sapply(df[2:4],
function(x) table(factor(x, levels=0:1))),2))
Reproducing your data:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
con <-lapply(df[,c(2:4)], table)
con_P <- lapply(con, function(x) x/sum(x))
You can use reshape2 to melt the data:
library(reshape2)
df <- melt(con_P)
Now, if you want to use gpplot2 you can use df to plot the bar plot:
ggplot(df, aes(x = L1, y = value, fill = factor(Var1) )) +
geom_bar(stat= "identity") +
theme_bw()
If you want to use barplot you can reshape the data.frame into an array:
array <- acast( df, Var1~L1)
array[is.na(array)] <- 0
barplot(array)
imagine, you have the following data set:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 0 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
Further, imagine you want to compile summary tables that print out the frequencies of those that drink wine, beer, water.
I solved it that way.
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
It works perfectly. No problem. Now, let us tweak the data set as follows: We set all entries for water to 1.
df<-data.frame(read.table(header = TRUE, text = "
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
If I now run the following commands:
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
it gives me the following error message after the second line: Error in margin.table(x, margin) : 'x' is not an array! Why?
Why does it make a difference if all data points within a variable have all the same outcome? Also, what can I do to circumvent this problem? Thanks guys!
The function prop.table uses the function sweep which takes an array as first argument. Since your second con is a list and not an array, prop.table will fail.
Why is your second con a list? Because the column Water has just one element and all the other columns have 2 elements. When the number of elements is different apply can't simplify the result to an array and gives you a list.
In the example you gave us, a safer way is to work with lapply instead, it will always give a list with the results:
con <- lapply(df, table)
con_P <- lapply(con, function(x) x/sum(x))