I'd like to create a group variables based upon how similar a selection of names is. I have started by using the stringdist package to generate a measure of distance. But I'm not sure how to use that output information to generate a group by variable. I've looked at hclust but it seems like to use clustering functions you need to know how many groups you want in the end, and I do not know that. The code I start with is below:
name_list <- c("Mary", "Mery", "Mary", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")
name_dist <- stringdistmatrix(name_list)
name_dist
name_dist2 <- stringdistmatrix(name_list, method="soundex")
name_dist2
I would like to see a dataframe with two columns that look like
name = c("Mary", "Mery", "Mary", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")
name_group = c(1, 1, 1, 2, 2, 2, 3, 3, 4)
The groups might be slightly different depending obviously on what distance measure I use (I've suggested two above) but I would probably choose one or the other to run.
Basically, how do I get from the distance matrix to a group variable without knowing the number of clusters I'd like?
You can also use adist(...) in base R to calculate the Levenshtein distances, and cluster based on that.
n<- c("Mary", "Mery", "Mari", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")
d <- adist(n)
rownames(d) <- n
cl <- hclust(as.dist(d))
plot(cl)
You could use a cluster analysis like this:
# loading the package
require(stringdist);
# Group selection by class numbers or height
num.class <- 5;
num.height <-0.5;
# define names
n <- c("Mary", "Mery", "Mari", "Joe",
"Jo", "Joey", "Bob", "Beb", "Paul");
# calculate distances
d <- stringdistmatrix(n, method="soundex");
# cluster the stuff
h <- hclust(d);
# cut the cluster by num classes
m <- cutree(h, k = num.class);
# cut the cluster by height
p <- cutree(h, h = num.height);
# build the resulting frame
df <- data.frame(names = n,
group.class = m,
group.prob = p);
It produces:
df;
names group.class group.prob
1 Mary 1 1
2 Mery 1 1
3 Mari 1 1
4 Joe 2 2
5 Jo 2 2
6 Joey 2 2
7 Bob 3 3
8 Beb 4 3
9 Paul 5 4
And the chart gives you an overview:
plot(h, labels=n);
Regards huck
Related
i have two vectors:
names_of_p <- c("John", "Adam", "James", "Robert")
speeds <- c("Slow", "Fast", "Average", "Slow")
And i need the show to slowest person, i did it with if's and if else's, but i wonder if there is easier way to do it with like auto give "Slow" = 1 , "Average" = 2 and so on. In other words attach values to them.
At the end it should be vector like
names_speeds <- c(names_of_p, speed)
And then so i can compare persons and get who is faster.
You could turn speeds into an ordered factor, which would preserve the labeling while also creating an underlying numerical representation:
names_of_p <- c("John", "Adam", "James", "Robert")
speeds <- c("Slow", "Fast", "Average", "Slow")
speeds <- factor(speeds, levels = c('Slow', 'Average', 'Fast'), ordered = T)
names_of_p[order(speeds)]
[1] "John" "Robert" "James" "Adam"
names_of_p[as.numeric(speeds) < 3]
[1] "John" "James" "Robert"
It might also be a good idea to store the data in a data frame rather in separate vectors:
library(tidyverse)
df <- data.frame(
names_of_p = names_of_p,
speeds = factor(speeds, levels = c('Slow', 'Average', 'Fast'), ordered = T)
)
df %>%
arrange(speeds)
names_of_p speeds
<chr> <ord>
1 John Slow
2 Robert Slow
3 James Average
4 Adam Fast
df %>%
filter(as.numeric(speeds) < 3)
names_of_p speeds
<chr> <ord>
1 John Slow
2 James Average
3 Robert Slow
First assign names to the vector speeds then you get a named vector.
After that you can use which:
names(speeds) <- names
which(speeds=="Slow")
John Robert
1 4
I have two datasets that I want to merge. One of the columns that I want to use as a key to merge has the values in a list. If any of those values appear in the second dataset’s column, I want the value in the other column to be merged into the first dataset – which might mean there are multiple values, which should be presented as a list.
That is quite hard to explain but hopefully this example data makes it clearer.
Example data
library(data.table)
mother_dt <- data.table(mother = c("Penny", "Penny", "Anya", "Sam", "Sam", "Sam"),
child = c("Violet", "Prudence", "Erika", "Jake", "Wolf", "Red"))
mother_dt [, children := .(list(unique(child))), by = mother]
mother_dt [, child := NULL]
mother_dt <- unique(mother_dt , by = "mother")
child_dt <- data.table(child = c("Violet", "Prudence", "Erika", "Jake", "Wolf", "Red"),
age = c(10, 8, 9, 6, 5, 2))
So for example, the first row in my new dataset would have “Penny” in themother column, a list containing “Violet” and “Prudence” in the children column, and a list containing 10 and 8 in the age column.
I've tried the following:
combined_dt <- mother_dt[, child_age := ifelse(child_dt$child %in% children,
.(list(unique(child_dt$age))), NA)
But that just contains a list of all the ages in the final row.
I appreciate this is probably quite unusual behaviour but is there a way to achieve it?
Edit: The final datatable would look like this:
final_dt <- data.table(mother = c("Penny", "Anya", "Sam"),
children = c(list(c("Violet", "Prudence")), list(c("Erika")), list(c("Jake", "Wolf", "Red"))),
age = c(list(c(10, 8)), list(c(9)), list(c(6, 5, 2))))
The easiest way I can think of is, first unlist the children, then merge, then list again:
mother1 <- mother_dt[,.(children=unlist(children)),by=mother]
mother1[child_dt,on=c(children='child')][,.(children=list(children),age=list(age)),by=mother]
You can do something like this-
library(splitstackshape)
newm <- mother_dt[,.(children=unlist(children)),by=mother]
final_dt <- merge(newm,child_dt,by.x = "children",by.y = "child")
> aggregate(. ~ mother, data = cv, toString)
mother children age
1 Anya Erika 9
2 Penny Prudence, Violet 8, 10
3 Sam Jake, Red, Wolf 6, 2, 5
You could do it the following way, which has the advantage of preserving duplicates in mother column when they exist.
mother_dt$age <- lapply(
mother_dt$children,
function(x,y) y[x],
y = setNames(child_dt$age, child_dt$child))
mother_dt
# mother children age
# 1: Penny Violet,Prudence 10, 8
# 2: Anya Erika 9
# 3: Sam Jake,Wolf,Red 6,5,2
I translates nicely into tidyverse syntax :
library(tidyverse)
mutate(mother_dt, age = map(children,~.y[.], deframe(child_dt)))
# mother children age
# 1 Penny Violet, Prudence 10, 8
# 2 Anya Erika 9
# 3 Sam Jake, Wolf, Red 6, 5, 2
I have the following data:
DF <- data.frame(Members = c("Eva", "Charlie1", "Fred", "Charlie2", "Adam", "Eva", "Charlie2", "David", "Adam", "David", "Charlie1"))
I would like to create a function that returns a specific value if the Members name meets a certain criteria:
Return "Group1" if the Member name is equals Eva or Adam
Return "Group2" if the Member name contains the string "Charlie"
Return "Group3" if the Member name is not either of the first two rules
I'd like to return "Group1", "Group2", "Group3" into a new column in DF called "Teams"
I've accomplished it with the following code, but I'm interested in how to accomplish it with functions
DF$Team <- with(DF, ifelse((DF$Members=="Eva"|DF$Members=="Adam"),"Group1",
ifelse((grepl("Charlie", DF$Members)),"Group2","Group3")))
Do you mean to create a function? Sort of like this:
DF <- data.frame(Members = c("Eva", "Charlie1", "Fred", "Charlie2", "Adam", "Eva", "Charlie2", "David", "Adam", "David", "Charlie1"))
get_group <- function(data=DF, Members=Members) {
with(DF, ifelse((DF$Members=="Eva"| DF$Members=="Adam"),"Group1",
ifelse((grepl("Charlie", DF$Members)),"Group2","Group3")))
}
DF$Group <- get_group(data = DF, Members = Members)
In my own experience, the most challenging part of deal with matters like this has been the "everything else" bucket. I usually have a good sense of what I want elsewhere.
The conventional approach is to use ifelse. This is generally efficient, but I find it difficult to read. My preferred approach is to use something like
levels(DF$Members) <- list(Group1 = c("Eva", "Adam"),
Group2 = c("Charlie1", "Charlie2"),
Group3 = c("David", "Fred"))
The problem with this approach is I have to explicitly name all of the values that map to each group. That doesn't help resolve the "everything else" issue.
We can modify this approach a little to identify the groups programatically.
g1 <- c("Eva", "Adam")
g2 <- levels(DF$Members)[grepl("Charlie", levels(DF$Members))]
g3 <- levels(DF$Members)[!levels(DF$Members) %in% c(g1, g2)]
levels(DF$Members) <- list(Group1 = g1,
Group2 = g2,
Group3 = g3)
This is reasonably tolerable, and helps me understand the group definitions a little better than reading nested ifelse calls.
Since you brought it up, I decided it'd be nice to have a function that handles the "everything else" scenario without my intervention. I came up with the following, which allows you to name as many groups as you want, and then use Other = NULL to indicate "everything else goes into Other".
group_levels <- function(x, ...)
{
x <- as.character(x)
group <- list(...)
which_group_null <- vapply(group, is.null, logical(1))
name_null <- names(group)[which_group_null]
group <- group[!which_group_null]
null_group <- list(unique(x[! x %in% unlist(group)]))
null_group <- setNames(null_group, name_null)
x <- factor(x)
levels(x) <- c(group, null_group)
x
}
group_levels(DF$Members,
Group1 = c("Eva", "Adam"),
Group2 = levels(DF$Members)[grepl("Charlie", levels(DF$Members))],
Group3 = NULL)
If you leave out the Group3 = NULL, the unmatched levels are given NA values.
It's probably slower than using ifelse, but I like how it reads.
Maybe you mean :
group_function <- function(name_string) {
if (name_string == "Eva" | name_string == "Adam")
return("Group 1")
if (grepl("Charlie", name_string))
return("Group 2")
return("Group 3")
}
and then call this function on every member
DF$Team <- sapply(DF$Members, group_function)
DF
# Members Team
#1 Eva Group 1
#2 Charlie1 Group 2
#3 Fred Group 3
#4 Charlie2 Group 2
#5 Adam Group 1
#6 Eva Group 1
#7 Charlie2 Group 2
#8 David Group 3
#9 Adam Group 1
#10 David Group 3
#11 Charlie1 Group 2
I have the following problem:
names <- c("Peter", "Gabriel", "James", "Philip")
city <- c("LA", "NY","Chicago","Chicago")
number <- seq(1, length(names))
from <- c("Peter", "Peter", "Gabriel", "James", "James")
to <- c("James","Gabriel", "Philip", "Gabriel", "Philip")
nodes <- data.frame(names, city, number)
edges <- data.frame(from, to)
How do I change the values of edges$from to match those in nodes$number?
You can use the following,
edges$from <- sapply(edges$from, function(i)nodes$number[match(i, nodes$names)])
edges
# from to
#1 1 James
#2 1 Gabriel
#3 2 Philip
#4 3 Gabriel
#5 3 Philip
I am trying to a create a new variable in R that gives a unique (ordered) numeric value to each observation based on the duplicate values in another variable. I have put below what the data looks like and what I would like it too look like. Can anyone help?
name <- c("Alex", "Alex", "Alex", "Bill", "Bill", "Cathy")
purchase <- c("hat", "bag", "book", "bag", "book", "book")
individual_purchase_No <- c(1, 2, 3, 1, 2, 1)
What the data looks like:
purchase.data <- data.frame(name, purchase)
What I want the data to look like:
purchase_order.data <- data.frame(name, purchase, individual_purchase_No)
You can do this with dplyr:
library(dplyr)
purchase.data %>% group_by(name) %>%
mutate(individual_purchase_No = 1:n())
## Source: local data frame [6 x 3]
## Groups: name [3]
##
## name purchase individual_purchase_No
## (fctr) (fctr) (int)
## 1 Alex hat 1
## 2 Alex bag 2
## 3 Alex book 3
## 4 Bill bag 1
## 5 Bill book 2
## 6 Cathy book 1
A base R solution is for instance:
purchase.data$individual_purchase_No <- sequence(table(purchase.data$name))
Table counts the number of appearances of each name, and sequence then creates for each number n the sequence 1:n.