Matching strings to values in a different data frame - r

Consider this data frame, containing multiple entries for a person named Steve/Stephan Jones and a person named Steve/Steven Smith (as well as Jane Jones and Matt/Matthew Smith)
df <- data.frame(First = c("Steve", "Stephan", "Steve", "Jane", "Steve", "Steven", "Matt"),
Last = c(rep("Jones", 4), rep("Smith", 3)))
What I'd like is to match values of First to the appropriate value of Name in this data frame.
nicknames <- data.frame(Name = c("Stephan", "Steven", "Stephen", "Matthew"),
N1 = c(rep("Steve", 3), "Matt"))
To yield this target
target <- data.frame(First = c("Stephan", "Stephan", "Stephan", "Jane", "Steven", "Steven", "Matthew"),
Last = c(rep("Jones", 4), rep("Smith", 3)))
The issue is that there are multiple values of Name corresponding to a N1 (or First) value of "Steve", so I need to check within each group based of df$Last to see which version of Steven/Stephan/Stephen is correct.
Using something like this
library(dplyr)
library(stringr)
df %>%
group_by(Last) %>%
mutate(First = First[which.max(str_length(First))])
won't work because the value of "Jane" in row 4 will be converted to "Stephan"

I'm not sure, if this solves your problem and is consistent to your desired output:
library(dplyr)
df %>%
mutate(id = row_number()) %>%
left_join(nicknames, by=c("First" = "N1")) %>%
mutate(real_name = coalesce(Name, First)) %>%
group_by(Last, real_name) %>%
mutate(id = n()) %>%
group_by(Last, First) %>%
filter(id==max(id)) %>%
select(-Name, -id)
returns
# A tibble: 7 x 3
# Groups: Last, First [6]
First Last real_name
<chr> <chr> <chr>
1 Steve Jones Stephan
2 Stephan Jones Stephan
3 Steve Jones Stephan
4 Jane Jones Jane
5 Steve Smith Steven
6 Steven Smith Steven
7 Matt Smith Matthew

Related

Count multi-response answers aginst a vector in R

I have a multi-response question from a survey.
The data look like this:
|respondent| friend |
|----------|-----------------|
| 001 | John, Mary |
|002 | Sue, John, Peter|
Then, I want to count, for each respondent, how many male and female friends they have.
I imagine I need to create separate vectors of male and female names, then check each cell in the friend column against these vectors and count.
Any help is appreciated.
This should be heavily caveated, because many common names are frequently used by different genders. Here I use the genders applied in american social security data in the babynames package as a proxy. Then I merge that with my data and come up with a weighted count based on likelihood. In the dataset, fairly common names including Casey, Riley, Jessie, Jackie, Peyton, Jaime, Kerry, and Quinn are almost evenly split between genders, so in my approach those add about half a female friend and half a male friend, which seems to me the most sensible approach when the name alone doesn't add much information about gender.
library(tidyverse) # using dplyr, tidyr
gender_freq <- babynames::babynames %>%
filter(year >= 1930) %>% # limiting to people <= 92 y.o.
count(name, sex, wt = n) %>%
group_by(name) %>%
mutate(share = n / sum(n)) %>%
ungroup()
tribble(
~respondent, ~friend,
"001", "John, Mary, Riley",
"002", "Sue, John, Peter") %>%
separate_rows(friend, sep = ", ") %>%
left_join(gender_freq, by = c("friend" = "name")) %>%
count(respondent, sex, wt = share)
## A tibble: 4 x 3
# respondent sex n
# <chr> <chr> <dbl>
#1 001 F 1.53
#2 001 M 1.47
#3 002 F 1.00
#4 002 M 2.00
Assuming you have a list that links a name with gender, you can split up your friend column, merge the result with your list and summarise on the gender:
library(tidyverse)
df <- tibble(
respondent = c('001', '002'),
friend = c('John, Mary', 'Sue, John, Peter')
)
names_df <- tibble(
name = c('John', 'Mary', 'Sue','Peter'),
gender = c('M', 'F', 'F', 'M')
)
df %>%
mutate(friend = strsplit(as.character(friend), ", ")) %>%
unnest(friend) %>%
left_join(names_df, by = c('friend' = 'name')) %>%
group_by(respondent) %>%
summarise(male_friends = sum(gender == 'M'),
female_friends = sum(gender == 'F'))
resulting in
# A tibble: 2 x 3
respondent male_friends female_friends
* <chr> <int> <int>
1 001 1 1
2 002 2 1

R function for finding similar names?

I'm working with a big dataset of names and need to be able to group by the individual. It's possible that in the dataset there are names that appear different but are the same person, such as John Doe or John A. Doe, or Michael Smith and Mike Smith. Is there a way for R to find instances like these and recognize them as the same person?
df <- data.frame(
name = c("John Doe", "John A. Doe", "Jane Smith", "Jane Anderson", "Jane Anderson Lowell",
"Jane B. Smith", "John Doe", "Jane Smith", "Michael Smith",
"Mike Smith", "A.K. Ross", "Ana Kristina Ross"),
rating = c(1,2,1,1,2,3,1,4,2,1,3,2)
)
Here, there are multiple repeated individuals, whether the variant be a middle initial, a shortened name, a lengthened name, or someone whose last name changed. I've been trying to find a function that could give a similarity percentage of characters in name matches, and from there I could manually examine cases of high percentage to evaluate if they are indeed the same person. My end goal is to find the average rating by person, where I would need to sort by the individual.
There are many algorithms that measure string distance. Here is a simple approach for this example dataset using stringdist package. As suggested by the documentation of stringdist() function, Jaro-Winkler distance is used to find the string distance between a name pair. Note that I only paired the names with the same first two letters. Through eye-balling, a string distance of 0.15 seems to be a reasonable threshold to define a match.
library(tidyverse)
library(stringdist)
get_string_distance <- function(x) {
if (length(x) == 1) {
data.frame(name1 = x, name2 = x, string_distance = NA_real_)
} else {
x %>%
unique() %>%
combn(2) %>%
t() %>%
as.data.frame() %>%
setNames(c("name1", "name2")) %>%
mutate(string_distance = stringdist(name1, name2, method = "jw"))
}
}
dat <- df %>%
mutate(two_letters = str_sub(name, 1, 2)) %>%
nest_by(two_letters) %>%
mutate(same_name = list(get_string_distance(data$name))) %>%
ungroup()
dat1 <- dat %>%
unnest(same_name) %>%
filter(string_distance < 0.15) %>%
select(name1, name2, string_distance)
dat1
# # A tibble: 4 x 3
# name1 name2 string_distance
# <chr> <chr> <dbl>
# 1 Jane Smith Jane B. Smith 0.0769
# 2 Jane Anderson Jane Anderson Lowell 0.117
# 3 John Doe John A. Doe 0.0909
# 4 Michael Smith Mike Smith 0.136

Manipulating variables to produce a new dataset in R

I'm a relatively new R user. I would really appreciate any help with my dataset please.
I have a dataset with 24 million rows. There are 3 variables in the dataset: patient name, pharmacy name, and count of medications picked up from the pharmacy at that visit.
Some patients appear in the dataset more than once (ie. they have picked up medications from different pharmacies at different time points).
The data frame looks like this:
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2))
From this data I want to generate a new dataset, which has ONE pharmacy for each patient. This pharmacy needs to be the one where the patient has picked up the highest number of medications.
For example: for Tom his most frequent pharmacy is Pharmacy B because he has picked up 13 medications from there (5+8 meds). The dataset I would like to generate:
data.frame(name = c("Tom", "Rob", "Amy"),
pharmacy = c("B", "B", "C"),
meds = c(13, 2, 2))
Can someone please help me with writing a code to do this?
I have tried various functions in R, such as dplyr, tidyr, aggregate() with no success. Any help would be genuinely appreciated.
Thank you very much
Alex
Your question is not reproducible. But here is one solution:
# create reproducible example of data
dataset1 <- data.frame(
name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("pharmacy_A", "pharmacy_B", "pharmacy_B", "pharmacy_B", "pharmacy_C"),
meds_count = c(3, 2, 5, 8, 2))
library(dplyr) #load dplyr
dataset2 <- dataset1 %>% group_by(name, pharmacy) %>% # group by your grouping variables
summarise(meds_count = sum(meds_count)) %>% # sum no. of meds by your grouping variables
top_n(1, meds_count) %>% # filter for only the top 1 count
ungroup()
Resulting dataframe:
> dataset2
# A tibble: 3 x 3
name pharmacy meds_count
<fct> <fct> <dbl>
1 Amy pharmacy_C 2.00
2 Rob pharmacy_B 2.00
3 Tom pharmacy_B 13.0
If I understood you correctly, I think you're looking for something like this.
require(tidyverse)
#Sample data. I copied yours.
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2))
Edit. I changed the group_by(), summarise() and added filter.
df %>%
group_by(name, pharmacy) %>%
summarise(SumMeds = sum(meds, na.rm = TRUE)) %>%
filter(SumMeds == max(SumMeds))
Results:
name pharmacy SumMeds
<fct> <fct> <dbl>
1 Amy C 2.
2 Rob B 2.
3 Tom B 13.
Generating your dataset:
patient = c("Tom","Rob","Tom","Tom","Amy")
pharmacy = c("A","B","B","B","C")
meds = c(3,2,5,8,2)
df = data.frame(patient,pharmacy,meds)
df is your dataframe
library(dplyr)
df = df %>% group_by(patient,pharmacy) %>%
summarize(meds =sum(meds)) %>%
group_by(patient) %>%
filter(meds == max(meds))
Take your df, group by patient and pharmacy
calculate total medicines bought by each patient from different pharmacies by taking the sum of medicines.
Then group_by patient
Finally filter for max.
Print the dataframe
print(df)
You can do it in base R with aggregate twice followed by merge.
It seems to me a bit complicated to have to use aggregate twice. Maybe dplyr solutions run more quickly, especially with a dataset with 24 million rows.
agg <- aggregate(meds ~ name + pharmacy, df, FUN = function(x) sum(x))
agg2 <- aggregate(meds ~ name, agg, function(x) x[which.max(x)])
merge(agg, agg2)[c(1, 3, 2)]
# name pharmacy meds
#1 Amy C 2
#2 Rob B 2
#3 Tom B 13
Data.
This is the dataset in the question after the edit.
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2), stringsAsFactors = FALSE)
Assuming the following dataset:
df <- tribble(
~patient, ~pharmacy, ~medication,
"Tom", "Pharmacy A", "3 meds",
"Rob", "Pharmacy B", "2 meds",
"Tom", "Pharmacy B", "5 meds",
"Tom", "Pharmacy B", "8 meds",
"Amy", "Pharmacy C", "2 meds"
)
A tidyverse-friendly option could be:
df %>%
mutate(med_n = as.numeric(str_extract(medication, "[0-9]"))) %>% # 1
group_by(patient, pharmacy) %>% # 2
mutate(med_sum = sum(med_n)) %>% # 3
group_by(patient) %>% # 4
filter(med_sum == max(med_sum)) %>% # 5
select(patient, pharmacy, med_sum) %>% # 6
distinct() # 7
create a numeric variable as you can't add strings
among all patient / pharmacy couples
find the total number of medications
then among all patients
keep only pharmacies with the highest patient / pharm totals
discard useless variables
discard duplicated lines (several lines per patient / pharmacy couple)

Using a variable number of groups with do in function

I would like to understand if and how this could be achieved using the tidyverse framework.
Assume I have the following simple function:
my_fn <- function(list_char) {
data.frame(comma_separated = rep(paste0(list_char, collapse = ","),2),
second_col = "test",
stringsAsFactors = FALSE)
}
Given the below list:
list_char <- list(name = "Chris", city = "London", language = "R")
my function works fine if you run:
my_fn(list_char)
However if we change some of the list's elements with a vector of characters we could use the dplyr::do function in the following way to achieve the below:
list_char_mult <- list(name = c("Chris", "Mike"),
city = c("New York", "London"), language = "R")
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(my_fn(list(name = .$name, city = .$city, language = "R")))
The question is how to write a function that could do this for a list with a variable number of elements. For example:
my_fn_generic <- function(list_char_mult) {
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(my_fn(...))
}
Thanks
Regarding how to use the function with variable number of arguments
my_fn_generic <- function(list_char) {
expand.grid(list_char, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(do.call(my_fn, list(.)))
}
my_fn_generic(list_char_mult)
# A tibble: 4 x 4
# Groups: name, city, language [4]
# name city language comma_separated
# <chr> <chr> <chr> <chr>
#1 Chris London R Chris,London,R
#2 Chris New York R Chris,New York,R
#3 Mike London R Mike,London,R
#4 Mike New York R Mike,New York,R
Or use the pmap
library(tidyverse)
list_char_mult %>%
expand.grid(., stringsAsFactors = FALSE) %>%
mutate(comma_separated = purrr::pmap_chr(.l = ., .f = paste, sep=", ") )
# name city language comma_separated
#1 Chris New York R Chris, New York, R
#2 Mike New York R Mike, New York, R
#3 Chris London R Chris, London, R
#4 Mike London R Mike, London, R
If I understand your question, you could use apply without grouping:
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
mutate(comma_separated = apply(., 1, paste, collapse=","))
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
mutate(comma_separated = apply(., 1, my_fn))
name city language comma_separated
1 Chris London R Chris,London,R
2 Chris New York R Chris,New York,R
3 Mike London R Mike,London,R
4 Mike New York R Mike,New York,R

Undirected combinations of actors in the same movie

I'm not exactly sure how to describe the operation that I'm trying to do. I have a data frame with two columns (movies and actors). I want to create from this a list of unique 2-actor combinations based on movies they are in together. Below is code that creates an example of the data frame that I have, and another data frame which is the results that I want.
start_data <- tibble::tribble(
~movie, ~actor,
"titanic", "john",
"star wars", "john",
"baby driver", "john",
"shawshank", "billy",
"titanic", "billy",
"star wars", "sarah",
"titanic", "sarah"
)
end_data <- tibble::tribble(
~movie, ~actor1, ~actor2,
"titanic", "john", "billy",
"titanic", "john", "sarah",
"titanic", "billy", "sarah",
"star wars", "john", "sarah"
)
Any help is appreciated, thanks! Bonus points if it is short++
You can use combn(..., 2) to find two combination of actors, which can be converted to a two column tibble and stored in a list column with summarize; To get a flat data frame, use unnest:
library(tidyverse)
start_data %>%
group_by(movie) %>%
summarise(acts = list(
if(length(actor) > 1) set_names(as.tibble(t(combn(actor, 2))), c('actor1', 'actor2'))
else tibble()
)) %>%
unnest()
# A tibble: 4 x 3
# movie actor1 actor2
# <chr> <chr> <chr>
#1 star wars john sarah
#2 titanic john billy
#3 titanic john sarah
#4 titanic billy sarah
library(tidyverse)
library(stringr)
inner_join(start_data, start_data, by = "movie") %>%
filter(actor.x != actor.y) %>%
rowwise() %>%
mutate(combo = str_c(min(actor.x, actor.y), "_", max(actor.x, actor.y))) %>%
ungroup() %>%
select(movie, combo) %>%
distinct %>%
separate(combo, c("actor1", "actor2"))

Resources