Finding closest matching number between 2 dataframes using a grouped dplyr identifier - r

I have 2 datasets each with a 'Patient ID` and a collection date measured from the same date "from start." In order to join these dataframes together, I'd like to match each sample in d1 to it's closest neighbor in d2. How can this be done with a function in dplyr?
d1<-data.frame(`Patient ID`=c(rep("001",4),rep("002",5)),`fromstart`=c(-5,30,90,150,-10,15,45,100,250),check.names = F)
d2<-data.frame(`Patient ID`=c(rep("001",7),rep("002",4)),`fromstart`=c(-20,10,30,50,90,110,150,-10,15,45,100),check.names = F)
closest_date<-function(cases,d2) {
return(d2 %>% select(`Patient ID`,fromstart) %>% unique() %>% filter(`Patient ID`==cases$`Patient ID`) %>% rowwise() %>% mutate(date_match=as.numeric(cases$fromstart[which.min(abs(fromstart - cases$fromstart))])))
}
d1 %>% select(`Patient ID`,fromstart) %>% unique() %>% group_by(`Patient ID`) %>% rowwise() %>% mutate(closest=closest_date(.,d2))

If I understood your problem correctly you want to join by patient ID and then select those lines where the difference between fromstart is the smallest? If so this would be a solution
library(dplyr)
d1 %>%
dplyr::full_join(d2, by = c("Patient ID"), suffix = c("_1", "_2")) %>%
dplyr::mutate(DIF = abs(fromstart_1 - fromstart_2)) %>%
dplyr::group_by(`Patient ID`, fromstart_1) %>%
dplyr::filter(DIF == min(DIF))
As you can see this does not really work well if you want unique combinations because there can be cases where the distance is the same... than again maybe I did not get your questoin right
Instead of using the absolute value you could filter for positive differences/distances as well if you want fromstart of the second table to be larger than from the first, this would reduce double entries to a certain degree

Related

Unique rows based on two logical conditions

I want my dataframe to return unique rows based on two logical conditions (OR not AND).
But when I ran this, df %>% group_by(sex) %>% distinct(state, education) %>% summarise(n=n()) I got deduplicated rows based on the two conditions joined by AND not OR.
Is there a way to get something like this df %>% group_by(sex) %>% distinct(state | education) %>% summarise(n=n()) so that the deduplicated rows will be joined by OR not AND?
Thank you.
You can use tidyr::pivot_longer and then distinct afterwards:
df %>%
pivot_longer(c(state, education), names_to = "type", values_to = "value")
group_by(sex) %>%
distinct(value) %>%
summarise(n = n())
In this case, pivot_longer simply puts state and education into one column called value.

Finding the top n represented entries in a grouped dataframe in R

I am a beginner in R and would be very thankful for a response as I am stuck on this code (this is my attempt at solving the problem but it does not work):
personal_spotify_df <- fromJSON("data/StreamingHistory0.json")
personal_spotify_df = personal_spotify_df %>%
mutate(minutesPlayed = msPlayed/1000/60)
personal_spotify_df_ranked <- personal_spotify_df %>%
group_by(artistName) %>%
filter(top_n(15, max(nrows())))
I have a dataframe (see below for a screenshot on how its structured) which is my spotity listening history. I want to group this dataframe by artists and afterwards arrange the new dataframe to show the top 15 artists with the most songs listened to. I am stuck on how to get from grouping by artistName to actually filtering out the top 15 represented artists from the dataframe.
The dataframe
We may use slice_max, with n specified as 15 and the order column created with add_count
library(dplyr)
personal_spotify_df %>%
add_count(artistName, name = "Count") %>%
slice_max(n = 15, order_by = "Count") %>%
select(-Count)
If we want to get only the top 15 distinct 'artistName',
personal_spotify_df %>%
count(artistName, name = "Count") %>%
slice_max(n = 15, order_by = "Count")
Or an option with filter after arrangeing the rows based on the count
personal_spotify_df %>%
add_count(artistName) %>%
arrange(desc(n)) %>%
filter(artistName %in% head(unique(artistName), 15))
In base R, you can make use of table, sort and head to get top 15 artists with their count
table(personal_spotify_df$artistName) |>
sort(decreasing = TRUE) |>
head(15) |>
stack()
The pipe operator (|>) requires R 4.1 if you have a lower version use -
stack(head(sort(table(personal_spotify_df$artistName), decreasing = TRUE), 15))

transform() to add rows with dplyr()

I've got a data frame (df) with two variables, site and purchase.
I'd like to use dplyr() to group my data by site and purchase, and get the counts and percentages for the grouped data. I'd however also like the tibble to feature rows called ALLSITES, representing the data of all the sites grouped by purchase, so that I end up with a tibble looking similar to dfgoal.
The problem's that my current code doesn't get me the ALLSITES rows. I've tried adding a base R function into dplyr(), which doesn't work.
Any help would be much appreciated.
Starting point (df):
df <- data.frame(site=c("LON","MAD","PAR","MAD","PAR","MAD","PAR","MAD","PAR","LON","MAD","LON","MAD","MAD","MAD"),purchase=c("a1","a2","a1","a1","a1","a1","a1","a1","a1","a2","a1","a2","a1","a2","a1"))
Desired outcome:
dfgoal <- data.frame(site=c("LON","LON","MAD","MAD","PAR","ALLSITES","ALLSITES"),purchase=c("a1","a2","a1","a2","a1","a1","a2"),bin=c(1,2,6,2,4,11,4),pin_per=c(33.33333,66.66667,75.00000,25.00000,100.00000,73.33333,26.66666))
Current code:
library(dplyr)
df %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
df %>%
rbind(df, transform(df, site = "ALLSITES") %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
We can start from the first output code block, after grouping by 'site' with a created string of 'ALLSITES' and 'purchase' get the sum of 'bin' and later 'bin_per', then with bind_rows row bind the two datasets
df1 %>%
ungroup() %>%
group_by(site = 'ALLSITES', purchase) %>%
summarise(bin = sum(bin)) %>%
ungroup %>%
mutate(bin_per = 100*(bin/sum(bin))) %>%
bind_rows(df1, .)

Finding elements from multiple columns of one dataframe that are not in multiple columns of another

library(tidyverse)
I have two dataframes (see sample code at bottom) called Df1 and Df2. I want to find phone numbers in Df1 (from all the columns) that are not in any of the phone number columns in Df2.
First, I restructure Df1 so that there is only one Id per row.
Df1<-Df1 %>%
gather(key, value, -Id) %>%
filter(!is.na(value)) %>%
select(-key) %>%
group_by(Id) %>%
filter(!duplicated(value)) %>%
mutate(Phone=paste0("Phone_",1:n())) %>%
spread(Phone, value)
Next, I rename Df2 and then use a join to find only Ids in Df1 that are in Df2.
Df2<-Df2%>%set_names(c("Id","Ph1","Ph2"))
DfJoin<-left_join(Df2,Df1,by="Id")
This is where I'm stuck. I want to find all the numbers in Df1 (Phone1 Phone2, and Phone 3) that are not in Df2 (Ph1 and Ph2). Below are some ideas for code. I tried many variations of this idea but could not find a way to achieve what I want. The final product should just be a table with the phone numbers(s) in any Df1 column that are not in any Df2 column together with the associated Id. I'm also wondering if there is another join or set operation that would achieve this in a more efficient way?
DfJoin<-DfJoin%>%mutate(New=if_else(! DfJoin[2:3] %in% DfJoin[4:6]),1,0)
DfJoin<-DfJoin%>%filter(! DfJoin[2:3] %in% DfJoin[2:4])
Sample Data:
Dataframe 1:
Id<-c(199,148,148,145,177,165,144,121,188,188,188,111)
Ph1<-c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554444,8764443344,6453348736)
Ph2<-c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
Df1<-data.frame(Id,Ph1,Ph2)
Dataframe 2:
Id2<-c(199,148,142,145,177,165,144,121,182,109,188,111)
Phone1<-c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554400,8764443344,6453348736)
Phone2<-c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
Df2<-data.frame(Id2,Phone1,Phone2)
One way to think about this problem:
You have a set of phone numbers in df1 for each ID number.
You have a set of phone numbers in df2 for each ID number.
You want to find, within each ID, the set difference between df1 and df2.
You can do this by mapping the base R function setdiff() onto your joined dataframe. To do this, you need to convert your data frames into list-column format, where all the phone numbers for each ID are present as a list in a "cell" of the dataframe. This is easily done by combining group_by(), summarize() and list().
# create example data
Id <- c(199,148,148,145,177,165,144,121,188,188,188,111)
ph1 <- c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554444,8764443344,6453348736)
ph2 <- c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
df1 <- data.frame(Id, ph1, ph2)
Id2 <- c(199,148,142,145,177,165,144,121,182,109,188,111)
phone1 <- c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554400,8764443344,6453348736)
phone2 <- c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
df2 <- data.frame(Id=Id2, phone1, phone2)
# convert the data to list-column format
df1.listcol <- df1 %>%
gather(col, phone, -Id) %>%
na.omit() %>%
group_by(Id) %>%
summarize(phone_list1 = list(phone))
df2.listcol <- df2 %>%
gather(col, phone, -Id) %>%
na.omit() %>%
group_by(Id) %>%
summarize(phone_list2 = list(phone))
Take a look at these dataframes to make sure you understand how we've reformatted them. Obviously, we could save a few lines of code by making this conversion process into a function, and then calling the function on each of df1 and df2, but I didn't do that here.
# join the two listcol dfs by Id, then map setdiff on the two columns
result <-
df1.listcol %>%
left_join(df2.listcol, by='Id') %>%
mutate(only_list_1 = map2(phone_list1, phone_list2, ~setdiff(.x, .y))) %>%
select(Id, only_list_1) %>%
unnest()
result
The result is
Id only_list_1
148 6541132112
188 7890986543
188 6785554444
Have you tried anti_join(a, b, by = "x1")
This basically gives you all rows in a which are not in b
DfJoin <- anti_join(Df1, Df2, by = "Id")
tidyr_dplyr cheatsheet
Use the above cheatsheet for data manipulation in tidyverse

Same difference in alternate rows per sub group

I have data frame where i need to find difference but for every alternate row the difference should stay same as the things to do are same like this:
but I have used this:
things <- data.frame( category = c("A","B","A","B","A","B","A","B","A","B"),
things2do = c("ball","ball","bat","bat","hockey","hockey","volley ball","volley ball","foos ball","foos ball"),
number = c(12,5,4,1,0,2,2,0,0,2))
things %>%
mutate(diff = number - lead(number,order_by=things2do))
but it is not helpful,as I am getting this:
Can i get some help here?
library(tidyverse)
things2 <- things %>%
spread(category, number) %>%
mutate(diff = B - A) %>%
gather(category, number, A:B) %>%
select(category, things2do, number, diff) %>%
arrange(things2do)
One way is to group the data by things2do and subsequently take an iterated difference.
library(dplyr)
things %>%
group_by(things2do) %>%
mutate(diff = diff(number))

Resources