I have the following data:
DF <- data.frame(Members = c("Eva", "Charlie1", "Fred", "Charlie2", "Adam", "Eva", "Charlie2", "David", "Adam", "David", "Charlie1"))
I would like to create a function that returns a specific value if the Members name meets a certain criteria:
Return "Group1" if the Member name is equals Eva or Adam
Return "Group2" if the Member name contains the string "Charlie"
Return "Group3" if the Member name is not either of the first two rules
I'd like to return "Group1", "Group2", "Group3" into a new column in DF called "Teams"
I've accomplished it with the following code, but I'm interested in how to accomplish it with functions
DF$Team <- with(DF, ifelse((DF$Members=="Eva"|DF$Members=="Adam"),"Group1",
ifelse((grepl("Charlie", DF$Members)),"Group2","Group3")))
Do you mean to create a function? Sort of like this:
DF <- data.frame(Members = c("Eva", "Charlie1", "Fred", "Charlie2", "Adam", "Eva", "Charlie2", "David", "Adam", "David", "Charlie1"))
get_group <- function(data=DF, Members=Members) {
with(DF, ifelse((DF$Members=="Eva"| DF$Members=="Adam"),"Group1",
ifelse((grepl("Charlie", DF$Members)),"Group2","Group3")))
}
DF$Group <- get_group(data = DF, Members = Members)
In my own experience, the most challenging part of deal with matters like this has been the "everything else" bucket. I usually have a good sense of what I want elsewhere.
The conventional approach is to use ifelse. This is generally efficient, but I find it difficult to read. My preferred approach is to use something like
levels(DF$Members) <- list(Group1 = c("Eva", "Adam"),
Group2 = c("Charlie1", "Charlie2"),
Group3 = c("David", "Fred"))
The problem with this approach is I have to explicitly name all of the values that map to each group. That doesn't help resolve the "everything else" issue.
We can modify this approach a little to identify the groups programatically.
g1 <- c("Eva", "Adam")
g2 <- levels(DF$Members)[grepl("Charlie", levels(DF$Members))]
g3 <- levels(DF$Members)[!levels(DF$Members) %in% c(g1, g2)]
levels(DF$Members) <- list(Group1 = g1,
Group2 = g2,
Group3 = g3)
This is reasonably tolerable, and helps me understand the group definitions a little better than reading nested ifelse calls.
Since you brought it up, I decided it'd be nice to have a function that handles the "everything else" scenario without my intervention. I came up with the following, which allows you to name as many groups as you want, and then use Other = NULL to indicate "everything else goes into Other".
group_levels <- function(x, ...)
{
x <- as.character(x)
group <- list(...)
which_group_null <- vapply(group, is.null, logical(1))
name_null <- names(group)[which_group_null]
group <- group[!which_group_null]
null_group <- list(unique(x[! x %in% unlist(group)]))
null_group <- setNames(null_group, name_null)
x <- factor(x)
levels(x) <- c(group, null_group)
x
}
group_levels(DF$Members,
Group1 = c("Eva", "Adam"),
Group2 = levels(DF$Members)[grepl("Charlie", levels(DF$Members))],
Group3 = NULL)
If you leave out the Group3 = NULL, the unmatched levels are given NA values.
It's probably slower than using ifelse, but I like how it reads.
Maybe you mean :
group_function <- function(name_string) {
if (name_string == "Eva" | name_string == "Adam")
return("Group 1")
if (grepl("Charlie", name_string))
return("Group 2")
return("Group 3")
}
and then call this function on every member
DF$Team <- sapply(DF$Members, group_function)
DF
# Members Team
#1 Eva Group 1
#2 Charlie1 Group 2
#3 Fred Group 3
#4 Charlie2 Group 2
#5 Adam Group 1
#6 Eva Group 1
#7 Charlie2 Group 2
#8 David Group 3
#9 Adam Group 1
#10 David Group 3
#11 Charlie1 Group 2
Related
i have two vectors:
names_of_p <- c("John", "Adam", "James", "Robert")
speeds <- c("Slow", "Fast", "Average", "Slow")
And i need the show to slowest person, i did it with if's and if else's, but i wonder if there is easier way to do it with like auto give "Slow" = 1 , "Average" = 2 and so on. In other words attach values to them.
At the end it should be vector like
names_speeds <- c(names_of_p, speed)
And then so i can compare persons and get who is faster.
You could turn speeds into an ordered factor, which would preserve the labeling while also creating an underlying numerical representation:
names_of_p <- c("John", "Adam", "James", "Robert")
speeds <- c("Slow", "Fast", "Average", "Slow")
speeds <- factor(speeds, levels = c('Slow', 'Average', 'Fast'), ordered = T)
names_of_p[order(speeds)]
[1] "John" "Robert" "James" "Adam"
names_of_p[as.numeric(speeds) < 3]
[1] "John" "James" "Robert"
It might also be a good idea to store the data in a data frame rather in separate vectors:
library(tidyverse)
df <- data.frame(
names_of_p = names_of_p,
speeds = factor(speeds, levels = c('Slow', 'Average', 'Fast'), ordered = T)
)
df %>%
arrange(speeds)
names_of_p speeds
<chr> <ord>
1 John Slow
2 Robert Slow
3 James Average
4 Adam Fast
df %>%
filter(as.numeric(speeds) < 3)
names_of_p speeds
<chr> <ord>
1 John Slow
2 James Average
3 Robert Slow
First assign names to the vector speeds then you get a named vector.
After that you can use which:
names(speeds) <- names
which(speeds=="Slow")
John Robert
1 4
This is a sport analysis question - How many time was a player on the court ?
I have a list of players I am interested in
names <- c('John','Bill',Peter')
and a list of actions during multiple matches
team <- c('teama','teama','teama','teama','teama','teama','teamb','teamb')
player1 <- c('John', 'John', 'John', 'Bill', 'Mike', 'Mike', 'Steve', 'Steve')
player2 <- c('Mike', 'Mike', 'Mike', 'John', 'Bill', 'Bill', 'Peter', 'Bob')
df <- data.frame(team,player1,player2)
I want to build a column that will list how many action was the player on the court
actions_when_player_on_court <- df %>% group_by(team) %>%
calculate({nb of observation where the player is either player1 or player2} )
so I end up with a new list like
actions_when_player_on_court <- c(4,3,1)
so I can create a new DF like this
new df <- data.frame(names,actions_when_player_on_court)
where John appears 4 times on the court, Bill twice, and Peter once
I feel I may need to intersect the names and c(player1,player2) especially if
names are unique - John, Bill and Peter cannot belong to other teams and are unique in df
I may have 0 to n players on the field so 0 to n column (player1, player2... playern)
The following code should do what you need.
We first need to create a new data frame to store all names and an empty actions_when_player_on_court variable.
names = c()
for (i in 2:ncol(df)) {
names = c(names, unique(df[,i]))
}
names = data.frame(name = unique(names), actions_when_player_on_court = 0)
Then, we can fill the actions_when_player_on_court variable using a for loop:
df$n = 1
for (i in 2:(ncol(df)-1)) {
tmp = aggregate(cbind(n = n) ~ df[, i], data = df[, c(i, ncol(df))], FUN="sum")
names(tmp)[1] = "name"
names = merge(names, tmp, all=T)
names[is.na(names)] = 0
names$actions_when_player_on_court = names$actions_when_player_on_court + names$n
names = names[-ncol(names)]
}
You can have as many players as you want as long as they start with the second column an run until the end of the data frame. Note that the resulting data frame does not include the team variable. I think you can deal with that yourself. Here is the result:
> names
name actions_when_player_on_court
1 Bill 3
2 Bob 1
3 John 4
4 Mike 5
5 Peter 1
6 Steve 2
I have a df below
df <- data.frame(LASTNAME = c("Robinson", "Anderson", "Beckham", "Wickham", "Carlos", "Robinson", "Beckham", "Anderson", "Carlos"),
FIRSTNAME = c("David", "Adi", "Joan", "Kesley", "Anberto", "Dave", "Joana", "Adien", "An"))
df <- data.frame(lapply(df, as.character), stringsAsFactors = FALSE)
There are some first names are not consistent. I want to find and replace these ones. But when I put it in the function, it doesn't work. One more thing is my data is big. There are hundred of names, so are there any better ways to do it.
My code works well when it is alone (not in function), but I failed to find a way to do it if I have 100 names need to find and replace. I found a reference here, but does not resolve my problem. Any suggestions would be appreciated.
fil_name <- function(last,first,alternative){
df %>%
mutate(FIRSTNAME = ifelse(LASTNAME == "last" & FIRSTNAME == "first", "alternative", FIRSTNAME))
}
fil_name(Robinson,Dave,David)
Expected output:
LASTNAME FIRSTNAME
1 Robinson David
2 Anderson Adien
3 Beckham Joana
4 Wickham Kesley
5 Carlos Anberto
6 Robinson David
7 Beckham Joana
8 Anderson Adien
9 Carlos Anberto
We can convert to character inside the function, and it should work
fil_name <- function(df, last,first,alternative){
last <- rlang::as_string(rlang::ensym(last))
first <- rlang::as_string(rlang::ensym(first))
alternative <- rlang::as_string(rlang::ensym(alternative))
df %>%
dplyr::mutate(FIRSTNAME = case_when(LASTNAME == last &
FIRSTNAME == first ~ alternative, TRUE ~ FIRSTNAME))
}
fil_name(df, Robinson,Dave,David)
Another approach is to create a separate data frame including the FIRSTNAME alternative name pairings, merge it into the original data, and update FIRSTNAME for those rows where ALTNAME is not NA.
This allows one to update the data with a vectorized process, rather than changing the names one by one.
# create data frame with a column to maintain original sort order
df <- data.frame(obs = 1:9,
LASTNAME = c("Robinson", "Anderson", "Beckham", "Wickham", "Carlos", "Robinson", "Beckham", "Anderson", "Carlos"),
FIRSTNAME = c("David", "Adi", "Joan", "Kesley", "Anberto", "Dave", "Joana", "Adien", "An"),
stringsAsFactors = FALSE)
# create firstname / altname pairs
altnames <- data.frame(FIRSTNAME = c("Dave","Adi","Joan","An"),
ALTNAME = c("David","Adien","Joana","Anberto"),
stringsAsFactors = FALSE)
# merge by firstname, keeping all rows from original data frame
combined <- merge(df,altnames,by="FIRSTNAME",all.x=TRUE)
# update rows where ALTNAME is not NA
combined[!is.na(combined$ALTNAME),"FIRSTNAME"] <- combined[!is.na(combined$ALTNAME),"ALTNAME"]
# print the result, ordered by sequence in original data frame
combined[order(combined$obs),c("LASTNAME","FIRSTNAME")]
...and the output:
> combined[order(combined$obs),c("LASTNAME","FIRSTNAME")]
LASTNAME FIRSTNAME
6 Robinson David
1 Anderson Adien
7 Beckham Joana
9 Wickham Kesley
4 Carlos Anberto
5 Robinson David
8 Beckham Joana
2 Anderson Adien
3 Carlos Anberto
>
I'm trying to create a custom function that generates new binary variables in an existing dataframe. The idea is to be able to feed the function with the diagnosis description (string), ICD9 diagnosis code (number), and patient database. The function would then generate new variables for all the diagnosis of interest and assign a 0 or 1 if the patient (row or observation) has the diagnosis.
Below are the function variables:
x<-c("2851") #ICD9 for Anemia
y<-c("diag_1") #Primary diagnosis
z<-"Anemia" #Name of new binary variable for patient dataframe
i<-patient_db #patient dataframe
patient<-c("a","b","c")
diag_1<-c("8661", "2851","8651")
diag_2<-c("8651","8674","2866")
diag_3<-c("2430","3456","9089")
patient_db<-data_frame(patient,diag_1,diag_2,diag_3)
patient diag_1 diag_2 diag_3
1 a 8661 8651 2430
2 b 2851 8674 3456
3 c 8651 2866 9089
Below is the function:
diagnosis_func<-function(x,y,z,i){
pattern = paste("^(", paste0(x, collapse = "|"), ")", sep = "")
i$z<-ifelse(rowSums(sapply(i[y], grepl, pattern = pattern)) != 0,"1","0")
}
This is what I would like to get at after running the function:
patient diag_1 diag_2 diag_3 Anemia
1 a 8661 8651 2430 0
2 b 2851 8674 3456 1
3 c 8651 2866 9089 0
The lines within the function have been tested outside the function and are working. Where I'm stuck is trying to get the function working. Any help would be greatly appreciated.
Happy New Year
Albit
If you are intending to only work with one diagnosis at a time, this will work. I took the liberty of renaming arguments to be a little easier to work with in the code.
diagnosis_func <- function(data, target_col, icd, new_col){
pattern <- sprintf("^(%s)",
paste0(icd, collapse = "|"))
data[[new_col]] <- grepl(pattern = pattern,
x = data[[target_col]]) + 0L
data
}
diagnosis_func(patient_db, "diag_1", "2851", "Anemia")
# Multiple codes for a single diagnosis
diagnosis_func(patient_db, "diag_1", c("8661", "8651"), "Dx")
If you want to spruce it up a little to prevent inadvertent mistakes, you can install the checkmate package and use this version. This will
diagnosis_func <- function(data, target_col, icd, new_col){
coll <- checkmate::makeAssertCollection()
checkmate::assert_class(x = data,
classes = "data.frame",
add = coll)
checkmate::assert_character(x = target_col,
len = 1,
add = coll)
checkmate::assert_character(x = icd,
add = coll)
checkmate::assert_character(x = new_col,
len = 1,
add = coll)
checkmate::reportAssertions(coll)
pattern <- sprintf("^(%s)",
paste0(icd, collapse = "|"))
data[[new_col]] <- grepl(pattern = pattern,
x = data[[target_col]]) + 0L
data
}
diagnosis_func(patient_db, "diag_1", "2851", "Anemia")
I'd like to create a group variables based upon how similar a selection of names is. I have started by using the stringdist package to generate a measure of distance. But I'm not sure how to use that output information to generate a group by variable. I've looked at hclust but it seems like to use clustering functions you need to know how many groups you want in the end, and I do not know that. The code I start with is below:
name_list <- c("Mary", "Mery", "Mary", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")
name_dist <- stringdistmatrix(name_list)
name_dist
name_dist2 <- stringdistmatrix(name_list, method="soundex")
name_dist2
I would like to see a dataframe with two columns that look like
name = c("Mary", "Mery", "Mary", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")
name_group = c(1, 1, 1, 2, 2, 2, 3, 3, 4)
The groups might be slightly different depending obviously on what distance measure I use (I've suggested two above) but I would probably choose one or the other to run.
Basically, how do I get from the distance matrix to a group variable without knowing the number of clusters I'd like?
You can also use adist(...) in base R to calculate the Levenshtein distances, and cluster based on that.
n<- c("Mary", "Mery", "Mari", "Joe", "Jo", "Joey", "Bob", "Beb", "Paul")
d <- adist(n)
rownames(d) <- n
cl <- hclust(as.dist(d))
plot(cl)
You could use a cluster analysis like this:
# loading the package
require(stringdist);
# Group selection by class numbers or height
num.class <- 5;
num.height <-0.5;
# define names
n <- c("Mary", "Mery", "Mari", "Joe",
"Jo", "Joey", "Bob", "Beb", "Paul");
# calculate distances
d <- stringdistmatrix(n, method="soundex");
# cluster the stuff
h <- hclust(d);
# cut the cluster by num classes
m <- cutree(h, k = num.class);
# cut the cluster by height
p <- cutree(h, h = num.height);
# build the resulting frame
df <- data.frame(names = n,
group.class = m,
group.prob = p);
It produces:
df;
names group.class group.prob
1 Mary 1 1
2 Mery 1 1
3 Mari 1 1
4 Joe 2 2
5 Jo 2 2
6 Joey 2 2
7 Bob 3 3
8 Beb 4 3
9 Paul 5 4
And the chart gives you an overview:
plot(h, labels=n);
Regards huck