Converting categorical columns to numerical values

Converting categorical columns to numerical values - r

I want to convert categorical columns in the dataset to be numerical values (1,2,3, etc).
How can I do this in R?
## Load vcd package
library(vcd)
## Load Arthritis dataset (data frame)
data(Arthritis)
Arthritis <- Arthritis[,2:5]
head(Arthritis)
Treatment Sex Age Improved
1 Treated Male 27 Some
2 Treated Male 29 None
3 Treated Male 30 None
4 Treated Male 32 Marked
5 Treated Male 46 Marked
6 Treated Male 58 Marked
Resulting dataset would look like this:
Treatment Sex Age Improved
[1,] 1 1 27 1
[2,] 1 1 29 0
[3,] 1 1 30 0
[4,] 1 1 32 2
[5,] 1 1 46 2
[6,] 1 1 58 2

If number of variables is huge, you may consider using this automation:
Arthritis2 <- sapply(Arthritis, unclass)
Edit:
Arthritis2 <- sapply(Arthritis, unclass) - 1

Solution using named list and match function:
scores <- list("0" = "None", "1" = "Some", "2" = "Marked" )
Arthritis$Scores <- names(scores)[match(Arthritis$Improved, scores)]
head(Arthritis)
Sex Age Improved Scores
1 Male 27 Some 1
2 Male 29 None 0
3 Male 30 None 0
4 Male 32 Marked 2
5 Male 46 Marked 2
6 Male 58 Marked 2
If you don't want to keep Improved column, simply do this instead:
Arthritis$Improved <- names(scores)[match(Arthritis$Improved, scores)]

Related

How to change an NA value in a specific row in R?

I am very new in R and still learning. My data is the Titanic.csv which has 891 observation and 13 variables. I would like to change the NA value on the 62 observation of PassengerID 62 in column 12 (column_name "Embarked") from NA to "S" and 830 observation to "C".
I found similar postings, but it didn't give me what I need.
How to replace certain values in a specific rows and columns with NA in R?
How to change NA value in a specific row and column?
My assignment is asking to use the below function.
boat<-within(boat,Embarked[is.na(Embarked)]<-"your choice here")
If I do this
boat<-within(boat,Embarked[is.na(Embarked)]<- "S")
or "C" in where it says "your choice here" it replaces both observations with either "S" or "C".
Below is the example of the Titanic.csv file.
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Owen male 22 1 0 A/5 1717.25 S
2 1 1 Cumings,John female 38 1 0 PC 9971.28 C85 C
17 0 3 Rice, Eugene male 2 4 1 382 29.125 Q
18 1 2 Williams,Charles male 0 0 2443 13 S
60 0 3 Goodwin, William male 11 5 2 CA 21 46.9 S
61 0 3 Sirayanian, Orsen male 22 0 0 2669 7.2292 C
62 1 1 Icard, Amelie female 38 0 0 11357 80 B28 NA
63 0 1 Harris, Henry male 45 1 0 36973 83.475 C83 S
My apologies if the sample dataframe is somewhat condensed.

# df is you data frame, first one is the row e.g 62, second one is column e.g 12
df[62, 12]
# Now assign "S" with the `<-` operator
df[62, 12] <- "S"
# and check if NA is changed to S
df[62, 12]
#Embarked
#<chr>
# 1 S
# Same with
df[830, 12] <- "C"

Plot support vector machine with categorical predictor variables

I'm looking to plot the results of a support vector classification, whereby the predictor variables are categorical (Gender and School). School can be 1 of 20, Gender 1 of 2. The response variable is Attended school (y/n)
The attendance.df is as follows:
gender school attendance
1 male 1 1
2 male 2 0
3 male 1 1
4 female 2 1
5 male 1 1
6 female 2 0
7 female 3 1
8 male 4 0
9 female 5 1
10 female 6 1
11 female 7 1
12 male 8 1
13 male 9 1
14 male 10 1
15 male 10 1
16 male 11 0
17 male 12 1
18 female 13 1
19 male 14 1
20 female 15 0
21 female 16 1
22 male 17 0
23 female 18 1
24 female 19 0
25 female 4 1
26 male 5 1
27 male 5 1
28 male 20 0
The code for the SVM and plot is:
# Builds linear SVM model
svm.response = tune(svm,attendance~ ., data=attendance.df, kernel="linear", ranges=list(cost=c(0.001,0.01,0.1,1,10,100,1000)))
svm.response$best.parameters # Identifies best parameters to rebuild model below
svm.response = svm(attendance ~ ., data=attendance.df, kernel="linear", cost=10)
# Plots SVM classification plot
plot(svm.response, attendance.df, fill=TRUE)
I want the predictor variables (gender and school) to be on the x and y, with the linear line (the SVM) to separate attendance which will be a black dot for yes, and red dot for no.
With the plot() I have included, the error "‘min’ not meaningful for factors" appears due to all variables being factors. I'm just unsure how to plot
Note: It may help to expand the df, I have deliberately shortned in order to post.

R: apply/ lapply: How to Create a bar chart if all entries in on column are 1's?

imagine, you have the following data set:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 0 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
Further, imagine you want to compile summary tables that print out the frequencies of those that drink wine, beer, water.
I solved it that way.
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
This allows me to complete my ultimate goal of compiling a bar chart in the way I want it:
barplot(con_P)
It works perfectly. No problem. Now, let us tweak the data set as follows: We set all entries for water to 1.
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
If I now run the following commands:
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
it gives me the following error message after the second line: Error in margin.table(x, margin) : 'x' is not an array!
Through another question here on this forum, I learned that the following will help me to overcome this issue:
con_P <- lapply(con, function(x) x/sum(x))
However, if I now run
barplot(con_P)
R does not create a barplot: Error in -0.01 * height : non-numeric argument to binary operator. I assume it is because it is no array!
My question is what to do now (how would I transform con_P in th second example into an array?). Secondly, how can I make the entire step of creating prop.tables and then a bar chart more efficient? Any help is much appreciated.

We can by converting the columns to factor with levels specified. In the second example, as the columns have 0 and 1 values in the 2nd and 3rd, we use the levels as 0:1, then get the table and convert to proportion with prop.table. and do the barplot
barplot(prop.table(sapply(df[2:4],
function(x) table(factor(x, levels=0:1))),2))

Reproducing your data:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
con <-lapply(df[,c(2:4)], table)
con_P <- lapply(con, function(x) x/sum(x))
You can use reshape2 to melt the data:
library(reshape2)
df <- melt(con_P)
Now, if you want to use gpplot2 you can use df to plot the bar plot:
ggplot(df, aes(x = L1, y = value, fill = factor(Var1) )) +
geom_bar(stat= "identity") +
theme_bw()
If you want to use barplot you can reshape the data.frame into an array:
array <- acast( df, Var1~L1)
array[is.na(array)] <- 0
barplot(array)

R Frequency Tables: prop.table does not work if all data points within variable share the outcome?

imagine, you have the following data set:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 0 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
Further, imagine you want to compile summary tables that print out the frequencies of those that drink wine, beer, water.
I solved it that way.
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
It works perfectly. No problem. Now, let us tweak the data set as follows: We set all entries for water to 1.
df<-data.frame(read.table(header = TRUE, text = "
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
If I now run the following commands:
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
it gives me the following error message after the second line: Error in margin.table(x, margin) : 'x' is not an array! Why?
Why does it make a difference if all data points within a variable have all the same outcome? Also, what can I do to circumvent this problem? Thanks guys!

The function prop.table uses the function sweep which takes an array as first argument. Since your second con is a list and not an array, prop.table will fail.
Why is your second con a list? Because the column Water has just one element and all the other columns have 2 elements. When the number of elements is different apply can't simplify the result to an array and gives you a list.
In the example you gave us, a safer way is to work with lapply instead, it will always give a list with the results:
con <- lapply(df, table)
con_P <- lapply(con, function(x) x/sum(x))

Retaining variables in dcast in R

I am using the dcast function in R to turn a long-format dataset into a wide-format dataset. I have an ID number, a categorical variable (CAT), and a continuous variable (AMT). However, I also have a variable SEX, which is the same for all rows of a given ID number. This code works to create the wide-format dataset, but I lose SEX. How can I retain it?
PC1cast <- dcast(PC1, ID~CAT, value.var='AMT', fun.aggregate=sum, na.rm=TRUE)
If I add SEX to the ID~CAT line, it gives me SEX-CAT combinations. I want SEX to just be one value for each row.
Sample data:
ID CAT AMT SEX
1 A 46 Female
1 B 22 Female
1 C 31 Female
2 A 17 Male
2 B 25 Male
2 C 44 Male

For that, you need to add SEX to the ID side of your formula:
dcast(PC1, ID + SEX~CAT, value.var='AMT', fun.aggregate=sum, na.rm=TRUE)
# results in:
ID SEX A B C
1 1 Female 46 22 31
2 2 Male 17 25 44
Things on the left hand side of the formula are kept as-is, things on the right-hand side are cast.

I added some extra data lines to clarify some parts of this. But the gist is that you just need to put SEX on the left hand side (i.e., of ~):
PC2 <- read.table(text="ID CAT AMT SEX
1 A 46 Female
1 B 22 Female
1 C 31 Female
2 A 17 Male
2 B 25 Male
2 C 44 Male
3 A 47 Female
3 B 27 Female
3 C 37 Female
4 A 17 Male
4 A 17 Male
4 B 22 Male
4 B NA Male
4 C 44 Male", header=T)
library(reshape2)
PC1cast2 <- dcast(PC2, ID+SEX~CAT, value.var='AMT', fun.aggregate=sum,
na.rm=TRUE)
PC1cast2
# ID SEX A B C
# 1 1 Female 46 22 31
# 2 2 Male 17 25 44
# 3 3 Female 47 27 37
# 4 4 Male 34 22 44
In your example data, you only have one instance of each combination and no NAs, so the fun.aggregate=sum, na.rm=TRUE doesn't do anything. When some are duplicated (e.g., there are two 4 As and two 4 Bs), the values are summed, but the NAs are dropped first. Make sure that is what you want.