Undirected combinations of actors in the same movie - r

I'm not exactly sure how to describe the operation that I'm trying to do. I have a data frame with two columns (movies and actors). I want to create from this a list of unique 2-actor combinations based on movies they are in together. Below is code that creates an example of the data frame that I have, and another data frame which is the results that I want.
start_data <- tibble::tribble(
~movie, ~actor,
"titanic", "john",
"star wars", "john",
"baby driver", "john",
"shawshank", "billy",
"titanic", "billy",
"star wars", "sarah",
"titanic", "sarah"
)
end_data <- tibble::tribble(
~movie, ~actor1, ~actor2,
"titanic", "john", "billy",
"titanic", "john", "sarah",
"titanic", "billy", "sarah",
"star wars", "john", "sarah"
)
Any help is appreciated, thanks! Bonus points if it is short++

You can use combn(..., 2) to find two combination of actors, which can be converted to a two column tibble and stored in a list column with summarize; To get a flat data frame, use unnest:
library(tidyverse)
start_data %>%
group_by(movie) %>%
summarise(acts = list(
if(length(actor) > 1) set_names(as.tibble(t(combn(actor, 2))), c('actor1', 'actor2'))
else tibble()
)) %>%
unnest()
# A tibble: 4 x 3
# movie actor1 actor2
# <chr> <chr> <chr>
#1 star wars john sarah
#2 titanic john billy
#3 titanic john sarah
#4 titanic billy sarah

library(tidyverse)
library(stringr)
inner_join(start_data, start_data, by = "movie") %>%
filter(actor.x != actor.y) %>%
rowwise() %>%
mutate(combo = str_c(min(actor.x, actor.y), "_", max(actor.x, actor.y))) %>%
ungroup() %>%
select(movie, combo) %>%
distinct %>%
separate(combo, c("actor1", "actor2"))

Related

Is there an R function that can aggregate the count of a specific row in a categorical column?

I hope everyone is doing well. I am having a bit of a brain fart trying to aggregate in R. Lets say I have this df:
student
subject
Amber
math
Colin
math
Bob
science
Amber
math
Amber
science
And I want to get a count of the number of times the student's subject is math and add that to the data frame, so the result would look like this:
student
subject
total 'math'
Amber
math
2
Colin
math
1
Bob
science
0
Amber
math
2
Amber
science
2
Is this possible? I tried aggregate(subject["math"] ~ student, data = df, length) just to get the first part done, but I get "Error in model.frame.default(formula = subject["math"] ~ : variable lengths differ (found for 'student')".
Thank you in advance!
I think that you want something like this
library(magrittr)
library(dplyr)
df <- data.frame(
student = c("Amber", "Colin", "Bob", "Amber", "Amber"),
subject = c("math", "math", "science", "math", "science")
)
df %>% group_by(student,subject) %>% mutate(`Total math` = n()) %>% filter(`Total math` > 0) %>% filter (subject=="math") %>% distinct -> df2
merge(x=df, y=df2, by="student", all.x = TRUE) %>% mutate(`Total math` = ifelse(!is.na(`Total math`), `Total math`,0)) %>% rename(subject="subject.x") %>% select(student, subject, `Total math`) %>% print
I've tried a different approach and it's different from your desire output but does that work for you ?
my_df <- data.frame("Student" = c("Amber", "Colin", "Bob", "Amber", "Amber"),
"Subject" = c("math", "math", "science", "math", "science"),
stringsAsFactors = FALSE)
my_df <- my_df %>% group_by(Student, Subject) %>% summarise("Total" = n())
library(dplyr)
df_with_count<-df%>%group_by(student,subject)%>%mutate(count=n())
found here:
https://www.tutorialspoint.com/how-to-add-a-new-column-in-an-r-data-frame-with-count-based-on-factor-column

Matching strings to values in a different data frame

Consider this data frame, containing multiple entries for a person named Steve/Stephan Jones and a person named Steve/Steven Smith (as well as Jane Jones and Matt/Matthew Smith)
df <- data.frame(First = c("Steve", "Stephan", "Steve", "Jane", "Steve", "Steven", "Matt"),
Last = c(rep("Jones", 4), rep("Smith", 3)))
What I'd like is to match values of First to the appropriate value of Name in this data frame.
nicknames <- data.frame(Name = c("Stephan", "Steven", "Stephen", "Matthew"),
N1 = c(rep("Steve", 3), "Matt"))
To yield this target
target <- data.frame(First = c("Stephan", "Stephan", "Stephan", "Jane", "Steven", "Steven", "Matthew"),
Last = c(rep("Jones", 4), rep("Smith", 3)))
The issue is that there are multiple values of Name corresponding to a N1 (or First) value of "Steve", so I need to check within each group based of df$Last to see which version of Steven/Stephan/Stephen is correct.
Using something like this
library(dplyr)
library(stringr)
df %>%
group_by(Last) %>%
mutate(First = First[which.max(str_length(First))])
won't work because the value of "Jane" in row 4 will be converted to "Stephan"
I'm not sure, if this solves your problem and is consistent to your desired output:
library(dplyr)
df %>%
mutate(id = row_number()) %>%
left_join(nicknames, by=c("First" = "N1")) %>%
mutate(real_name = coalesce(Name, First)) %>%
group_by(Last, real_name) %>%
mutate(id = n()) %>%
group_by(Last, First) %>%
filter(id==max(id)) %>%
select(-Name, -id)
returns
# A tibble: 7 x 3
# Groups: Last, First [6]
First Last real_name
<chr> <chr> <chr>
1 Steve Jones Stephan
2 Stephan Jones Stephan
3 Steve Jones Stephan
4 Jane Jones Jane
5 Steve Smith Steven
6 Steven Smith Steven
7 Matt Smith Matthew

Pivot_longer: Rotating multiple columns of data with same data types

I'm trying to rotate multiple columns of data into single, data-type consistent columns.
I've created a minimum example below.
library(tibble)
library(dplyr)
# I have data like this
df <- tibble(contact_1_prefix=c('Mr.','Mrs.','Dr.'),
contact_2_prefix=c('Dr.','Mr.','Mrs.'),
contact_1 = c('Bob Johnson','Robert Johnson','Bobby Johnson'),
contact_2 = c('Tommy Two Tones','Tommy Three Tones','Tommy No Tones'),
contact_1_loc = c('Earth','New York','Los Angeles'),
contact_2_loc = c('London','Geneva','Paris'))
# My attempt at a solution:
df %>% rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols=c(matches('_[12]_')),
names_to=c('.value','dat'),
names_pattern = "(.*)_[1-2]_(.*)") %>%
pivot_wider(names_from='dat',values_from='contact')
#What I want is to widen that data to achieve a tibble with these two example lines
df_desired <- tibble(name=c('Bob Johnson','Tommy Two Tones'),
loc =c('Earth','London'),
prefix=c('Mr.','Dr.'))
I want all names under name, all locations under loc, and all prefixes under prefix.
If I use just this snippet from the middle statement:
df %>% rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols=c(matches('_[12]_')),
names_to=c('.value','dat'),
names_pattern = "(.*)_[1-2]_(.*)")
The dput of the output is:
structure(list(dat = c("prefix", "prefix", "name", "name", "loc",
"loc", "prefix", "prefix", "name", "name", "loc", "loc", "prefix",
"prefix", "name", "name", "loc", "loc"), contact = c("Mr.", "Dr.",
"Bob Johnson", "Tommy Two Tones", "Earth", "London", "Mrs.",
"Mr.", "Robert Johnson", "Tommy Three Tones", "New York", "Geneva",
"Dr.", "Mrs.", "Bobby Johnson", "Tommy No Tones", "Los Angeles",
"Paris")), row.names = c(NA, -18L), class = c("tbl_df", "tbl",
"data.frame"))
From that, I thought for sure pivot_wider was the solution, but there is a name conflict.
I assume a single pivot_longer statement will achieve the task. I studied Gathering wide columns into multiple long columns using pivot_longer carefully but can't quite figure this out. I have to admit I don't quite understand what the names_to = c(".value", "group") phrase does.
In any event, any help is appreciated.
Thanks
You were on the right path. Renaming is needed since only the name columns do not have any suffix to identify them. .value identifies part of the original column name that you want to uniquely identify as new columns. If you remove everything until the last underscore the part that remains are the new column names which you can specify using regex in names_pattern.
library(dplyr)
library(tidyr)
df %>%
rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols = everything(),
names_to = '.value',
names_pattern = '.*_(\\w+)')
# prefix name loc
# <chr> <chr> <chr>
#1 Mr. Bob Johnson Earth
#2 Dr. Tommy Two Tones London
#3 Mrs. Robert Johnson New York
#4 Mr. Tommy Three Tones Geneva
#5 Dr. Bobby Johnson Los Angeles
#6 Mrs. Tommy No Tones Paris
Here is a solution using split.default
data.table::rbindlist(
lapply( split.default( df, gsub( "[^0-9]+", "", names(df) ) ),
data.table::setnames,
new = c("prefix", "name", " loc" ) ) )
# prefix name loc
# 1: Mr. Bob Johnson Earth
# 2: Mrs. Robert Johnson New York
# 3: Dr. Bobby Johnson Los Angeles
# 4: Dr. Tommy Two Tones London
# 5: Mr. Tommy Three Tones Geneva
# 6: Mrs. Tommy No Tones Paris

Add a grouping variable based on ranked data

Consider the following dataframe:
name <- c("Sally", "Dave", "Aaron", "Jane", "Michael")
rank <- c(1,2,1,2,3)
df <- data.frame(name, rank, stringsAsFactors = FALSE)
I'd like to create a grouping variable (event) based on the rank column, as such:
event <- c("Hurdles", "Hurdles", "Long Jump", "Long Jump", "Long Jump")
df_desired <- data.frame(name, rank, event, stringsAsFactors = FALSE)
There are lots of examples of going the other way (making a ranking variable based on a group) but I can't seem to find one doing what I'd like.
It's possible to use filter, full_join and then fill as shown below, but is there a simpler way?
library(tidyverse)
df <- df %>%
mutate(order = row_number())
df_1 <- df %>%
filter(rank == 1)
df_1$event <- c("Hurdles", "Long Jump")
df %>%
filter(rank != 1) %>%
mutate(event = as.character(NA)) %>%
full_join(df_1, by = c("order", "name", "rank", "event")) %>%
arrange(order) %>%
fill(event) %>%
select(-order)
We can use cumsum to create the index
library(dplyr)
df %>%
mutate(event = c("Hurdles", "Long Jump")[cumsum(rank == 1)])
# name rank event
#1 Sally 1 Hurdles
#2 Dave 2 Hurdles
#3 Aaron 1 Long Jump
#4 Jane 2 Long Jump
#5 Michael 3 Long Jump
Or in base R (just in case)
df$event <- c("Hurdles", "Long Jump")[cumsum(df$rank == 1)])

Manipulating variables to produce a new dataset in R

I'm a relatively new R user. I would really appreciate any help with my dataset please.
I have a dataset with 24 million rows. There are 3 variables in the dataset: patient name, pharmacy name, and count of medications picked up from the pharmacy at that visit.
Some patients appear in the dataset more than once (ie. they have picked up medications from different pharmacies at different time points).
The data frame looks like this:
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2))
From this data I want to generate a new dataset, which has ONE pharmacy for each patient. This pharmacy needs to be the one where the patient has picked up the highest number of medications.
For example: for Tom his most frequent pharmacy is Pharmacy B because he has picked up 13 medications from there (5+8 meds). The dataset I would like to generate:
data.frame(name = c("Tom", "Rob", "Amy"),
pharmacy = c("B", "B", "C"),
meds = c(13, 2, 2))
Can someone please help me with writing a code to do this?
I have tried various functions in R, such as dplyr, tidyr, aggregate() with no success. Any help would be genuinely appreciated.
Thank you very much
Alex
Your question is not reproducible. But here is one solution:
# create reproducible example of data
dataset1 <- data.frame(
name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("pharmacy_A", "pharmacy_B", "pharmacy_B", "pharmacy_B", "pharmacy_C"),
meds_count = c(3, 2, 5, 8, 2))
library(dplyr) #load dplyr
dataset2 <- dataset1 %>% group_by(name, pharmacy) %>% # group by your grouping variables
summarise(meds_count = sum(meds_count)) %>% # sum no. of meds by your grouping variables
top_n(1, meds_count) %>% # filter for only the top 1 count
ungroup()
Resulting dataframe:
> dataset2
# A tibble: 3 x 3
name pharmacy meds_count
<fct> <fct> <dbl>
1 Amy pharmacy_C 2.00
2 Rob pharmacy_B 2.00
3 Tom pharmacy_B 13.0
If I understood you correctly, I think you're looking for something like this.
require(tidyverse)
#Sample data. I copied yours.
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2))
Edit. I changed the group_by(), summarise() and added filter.
df %>%
group_by(name, pharmacy) %>%
summarise(SumMeds = sum(meds, na.rm = TRUE)) %>%
filter(SumMeds == max(SumMeds))
Results:
name pharmacy SumMeds
<fct> <fct> <dbl>
1 Amy C 2.
2 Rob B 2.
3 Tom B 13.
Generating your dataset:
patient = c("Tom","Rob","Tom","Tom","Amy")
pharmacy = c("A","B","B","B","C")
meds = c(3,2,5,8,2)
df = data.frame(patient,pharmacy,meds)
df is your dataframe
library(dplyr)
df = df %>% group_by(patient,pharmacy) %>%
summarize(meds =sum(meds)) %>%
group_by(patient) %>%
filter(meds == max(meds))
Take your df, group by patient and pharmacy
calculate total medicines bought by each patient from different pharmacies by taking the sum of medicines.
Then group_by patient
Finally filter for max.
Print the dataframe
print(df)
You can do it in base R with aggregate twice followed by merge.
It seems to me a bit complicated to have to use aggregate twice. Maybe dplyr solutions run more quickly, especially with a dataset with 24 million rows.
agg <- aggregate(meds ~ name + pharmacy, df, FUN = function(x) sum(x))
agg2 <- aggregate(meds ~ name, agg, function(x) x[which.max(x)])
merge(agg, agg2)[c(1, 3, 2)]
# name pharmacy meds
#1 Amy C 2
#2 Rob B 2
#3 Tom B 13
Data.
This is the dataset in the question after the edit.
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2), stringsAsFactors = FALSE)
Assuming the following dataset:
df <- tribble(
~patient, ~pharmacy, ~medication,
"Tom", "Pharmacy A", "3 meds",
"Rob", "Pharmacy B", "2 meds",
"Tom", "Pharmacy B", "5 meds",
"Tom", "Pharmacy B", "8 meds",
"Amy", "Pharmacy C", "2 meds"
)
A tidyverse-friendly option could be:
df %>%
mutate(med_n = as.numeric(str_extract(medication, "[0-9]"))) %>% # 1
group_by(patient, pharmacy) %>% # 2
mutate(med_sum = sum(med_n)) %>% # 3
group_by(patient) %>% # 4
filter(med_sum == max(med_sum)) %>% # 5
select(patient, pharmacy, med_sum) %>% # 6
distinct() # 7
create a numeric variable as you can't add strings
among all patient / pharmacy couples
find the total number of medications
then among all patients
keep only pharmacies with the highest patient / pharm totals
discard useless variables
discard duplicated lines (several lines per patient / pharmacy couple)

Resources