Convert DataFrame Column to Factors - r

R, RStudio
How do I convert DataFrame column to Factors?
I wish 0 is "North", 1 is "South", 2 is "East" and 3 is "West".
directions <- data.frame(
state=c("New York","New Jersey","Deleware","Texax","Alaska"),
travel=c(0,0,3,2,1)
)
head(directions)
Outputs
state travel
1 New York 0
2 New Jersey 0
3 Deleware 3
4 Texax 2
5 Alaska 1
I tried the following, but the entire travel column is NA
directions$travel <- factor(directions$travel,levels=c("North","South","East","West"))
head(directions)
Outputs
state travel
1 New York <NA>
2 New Jersey <NA>
3 Deleware <NA>
4 Texax <NA>
5 Alaska <NA>

We need to specify it in labels
factor(directions$travel,labels=c("North","South","East","West"))
#[1] North North West East South
#Levels: North South East West
If we need a custom grouping, then specify the levels as well
factor(directions$travel,levels = c(0, 1, 2, 3),
labels=c("North","South","East","West"))

Related

R Split strings into two columns based on TWO regular expressions

I have some data that is structured something like this:
ID Region Value
1 Europe 8
2 Europe: Class 1 6
3 Asia: System 2 6
4 North America 7
5 Europe: System 1 5
6 Africa 7
7 Africa: Class 2 5
8 South America 9
9 Europe: System 1 3
10 Europe 7
What I want to do is create a new column called Class which adds instances of where "Class" AND "System" are mentioned in the Region column - if it's not clear what I mean, take a look at my expected output below. I know this can be done with the separate function but I think you can only specify one value for the separator part of the code. E.g. sep = ": Class" will only split instances that mention "class" but I also want to split any instances where "system" is mentioned too. Can this be done in one line of code, or do I need to do something a bit more complicated here? Here's how my final data should look:
ID Region Class Value
1 Europe 8
2 Europe 1 6
3 Asia 2 6
4 North America 7
5 Europe 1 5
6 Africa 7
7 Africa 2 5
8 South America 9
9 Europe 1 3
10 Europe 7
Please note, I want to remove any reference to "class" or "system" (including colons) from the Region column, and simply add the numerical value to a new Class column.
You can do it with base functions by just using strsplit with a regular expression that takes either ": System" or ": Class" as symbol:
splitted = strsplit(df$Region,"(: Class)|(: System)")
df$Region = lapply(splitted,FUN=function(x){x[1]})
df$Class = lapply(splitted,FUN=function(x){x[2]})
The result is:
> df
ID Region Value Class
1 1 Europe 8 NA
2 2 Europe 6 1
3 3 Asia 6 2
4 4 North America 7 NA
5 5 Europe 5 1
6 6 Africa 7 NA
7 7 Africa 5 2
8 8 South America 9 NA
9 9 Europe 3 1
10 10 Europe 7 NA
You can use str_extract to extract the number and str_remove to drop the text that you don't want.
library(dplyr)
library(stringr)
df %>%
mutate(Class = str_extract(Region, '(?<=(Class|System)\\s)\\d+'),
Region = str_remove(Region, ':\\s*(Class|System)\\s*\\d+'))
# ID Region Value Class
#1 1 Europe 8 <NA>
#2 2 Europe 6 1
#3 3 Asia 6 2
#4 4 North America 7 <NA>
#5 5 Europe 5 1
#6 6 Africa 7 <NA>
#7 7 Africa 5 2
#8 8 South America 9 <NA>
#9 9 Europe 3 1
#10 10 Europe 7 <NA>
str_extract extracts the number which comes after 'Class'
or 'System'. If these words are not present then it returns NA.
str_remove removes colon followed by zero or more whitespace (\\s*) followed by either 'Class' or 'System' and a number (\\d+).
data
It is easier to help if you provide data in a reproducible format which is easier to copy.
df <- structure(list(ID = 1:10, Region = c("Europe", "Europe: Class 1",
"Asia: System 2", "North America", "Europe: System 1", "Africa",
"Africa: Class 2", "South America", "Europe: System 1", "Europe"
), Value = c(8L, 6L, 6L, 7L, 5L, 7L, 5L, 9L, 3L, 7L)),
class = "data.frame", row.names = c(NA, -10L))

How do you match a numeric value to a categorical value in another data set

I have two data sets. One with a numeric value assigned to individual categorical variables (country name) and a second with survey responses including a person's nationality. How do I assign the numeric value to a new column in the survey dataset with matching nationality/country name?
Here is the head of data set 1 (my.data1):
EN HCI
1 South Korea 0.845
2 UK 0.781
3 USA 0.762
Here is the head of data set 2 (my.data2):
Nationality OIS IR
1 South Korea 2 2
2 South Korea 3 3
3 USA 3 4
4 UK 3 3
I would like to make it look like this:
Nationality OIS IR HCI
1 South Korea 2 2 0.845
2 South Korea 3 3 0.845
3 USA 3 4 0.762
4 UK 3 3 0.781
I have tried this but unsuccessfully:
my.data2$HCI <- NA
for (i in i:nrow(my.data2)) {
my.data2$HCI[i] <- my.data1$HCI[my.data1$EN == my.data2$Nationality[i]]
}
We can use a left_join
library(dplyr)
left_join(my.data2, my.data1, by = c("Nationality" = "EN"))
Or with merge from base R
merge(my.data2, my.data1, by.x = c("Nationality", by.y = "EN", all.x = TRUE)

Recode by comparing a value to numbers in a vector

I want to code the values in a column into fewer values in another column.
For example,
if the value in zipcode column is one of the following c(90272,90049,90077,90210,90046,90069,90024,90025,90048,90036,90038,90028),
code it as "west" in district column.
How can I do it in R?
You can use the ifelse() function.
Set up the data in a dataframe:
df <- data.frame(zipcode = c(90272,90049,90077,90210,90046,90069,90024,90025,90048,90036,90038,90028))
Then use ifelse() to code a new value based on the values of zipcode.
df$district <- ifelse(df$zipcode %in% c(90272,90049,90077,90210,90046,90069,90024,90025,90048,90036,90038,90028),
"west",
NA)
> df
zipcode region
1 90272 west
2 90049 west
3 90077 west
4 90210 west
5 90046 west
6 90069 west
7 90024 west
8 90025 west
9 90048 west
10 90036 west
11 90038 west
12 90028 west

R-How to obtain relationships between cutree groups?

Hopefully title is not too badly worded. I have a tree that I used cutree to obtain groups from, but it is clear that the groups are not numbered left-to-right or right-to-left (I know the orientation within a branch doesn't matter so much, was hoping the grouping would be the same as the ordering in the hclust object). Is it possible to extract groups from a tree (using the height option of cutree) and know which of those groups are more related to one another? I walk through an example using USArrests below.
hc <- hclust(dist(USArrests), "ave")
plot(hc)
cutree(hc,h=60)
Alabama Alaska Arizona Arkansas California
1 1 1 2 1
Colorado Connecticut Delaware Florida Georgia
2 3 1 4 2
Hawaii Idaho Illinois Indiana Iowa
3 3 1 3 3
Kansas Kentucky Louisiana Maine Maryland
3 3 1 3 1
Massachusetts Michigan Minnesota Mississippi Missouri
2 1 3 1 2
Montana Nebraska Nevada New Hampshire New Jersey
3 3 1 3 2
New Mexico New York North Carolina North Dakota Ohio
1 1 4 3 3
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
2 2 3 2 1
South Dakota Tennessee Texas Utah Vermont
3 2 2 3 3
Virginia Washington West Virginia Wisconsin Wyoming
2 2 3 3 2
If you plot the tree it is clear that groups 1 and 4 are more related then groups 2 and 3 are more related. However when I just print the contents of each group there is no way to know what that relationship is. Is there a function or standard process I am missing? The real data I'm working with I split 36k values into 10 groups, so it would be tough to visually validate the relationships as I do with the example data, and want to code it as a script for future analyses. Thanks ahead of time.
I think you want to use
hc <- hclust(dist(USArrests), "ave")
cuthc <- cut(as.dendrogram(hc), h=60)
This will return a list with an $upper showing the tree above the cut, and a $lower element which is a list of each of the subtrees made from the cut. We can plot them with
layout(matrix(1:4, ncol=2))
sapply(1:4, function(i) plot(cuthc$lower[[i]]))
Then, if you want to extract the names and groups in the order they appear in the dendrograms, you can do
stack(setNames(Map(labels, cuthc$lower),seq_along(cuthc$lower)))
Here I use stack() and setNames() just to assign a unique ID to each element in the $lower list. stack() doesn't like it when the list isn't named

Clustering list for hclust function

Using plot(hclust(dist(x))) method, I was able to draw a cluster tree map. It works. Yet I would like to get a list of all clusters, not a tree diagram, because I have huge amount of data (like 150K nodes) and the plot gets messy.
In other words, lets say if a b c is a cluster and if d e f g is a cluster then I would like to get something like this:
1 a,b,c
2 d,e,f,g
Please note that this is not exactly what I want to get as an "output". It is just an example. I just would like to be able to get a list of clusters instead of a tree plot It could be vector, matrix or just simple numbers that show which groups elements belong to.
How is this possible?
I will use the dataset available in R to demonstrate how to cut a tree into desired number of pieces. Result is a table.
Construct a hclust object.
hc <- hclust(dist(USArrests), "ave")
#plot(hc)
You can now cut the tree into as many branches as you want. For my next trick, I will split the tree into two groups. You set the number of cuts with the k parameter. See ?cutree and the use of paramter h which may be more useful to you (see cutree(hc, k = 2) == cutree(hc, h = 110)).
cutree(hc, k = 2)
Alabama Alaska Arizona Arkansas California
1 1 1 2 1
Colorado Connecticut Delaware Florida Georgia
2 2 1 1 2
Hawaii Idaho Illinois Indiana Iowa
2 2 1 2 2
Kansas Kentucky Louisiana Maine Maryland
2 2 1 2 1
Massachusetts Michigan Minnesota Mississippi Missouri
2 1 2 1 2
Montana Nebraska Nevada New Hampshire New Jersey
2 2 1 2 2
New Mexico New York North Carolina North Dakota Ohio
1 1 1 2 2
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
2 2 2 2 1
South Dakota Tennessee Texas Utah Vermont
2 2 2 2 2
Virginia Washington West Virginia Wisconsin Wyoming
2 2 2 2 2
lets say,
y<-dist(x)
clust<-hclust(y)
groups<-cutree(clust, k=3)
x<-cbind(x,groups)
now you will get for each record, the cluster group.
You can subset the dataset as well:
x1<- subset(x, groups==1)
x2<- subset(x, groups==2)
x3<- subset(x, groups==3)

Resources