Merge dataframes using an extra condition r - r

I know it should be an easier or smarter way of doing what I need, but I haven't found it yet after several days.
I have 2 dataframes that I need to merge using a extra condition. For example:
df1 <- data.frame(Username = c("user1", "user2", "user3", "user4", "user5", "user6"))
df2 <- data.frame(File_Name = c(rep("StudyABC", 5), rep("AnotherStudyCDE", 4)), Username = c("user1", rep(c("user2", "user3", "user4", "user5"),2)))
print(df1)
print(df2)
What I need is to create 2 new columns in df1 called ABC and CDE that includes their "File_Name" values. Of course the real data is hundreds of lines and not ordered so no way of selecting by range.
One of the solutions (not elegant) that I have found is:
df2_filtered <- df2 %>% filter(str_detect(File_Name, "ABC"))
df1 <- left_join(df1, df2_filtered, by = "Username")
names(df1)[2] <- "ABC"
df2_filtered <- df2 %>% filter(str_detect(File_Name, "CDE"))
df1 <- left_join(df1, df2_filtered, by = "Username")
names(df1)[3] <- "CDE"
print(df1)
Is there a shortest way of doing it? Because I have to repeat the same logic 160 times.
Thanks

You can extract either "ABC" or "CDE" from File_Name and cast the data into wide format. We can join the data with df1 to get all the Username in the final dataframe.
library(dplyr)
df2 %>%
mutate(name = stringr::str_extract(File_Name, 'ABC|CDE')) %>%
tidyr::pivot_wider(names_from = name, values_from = File_Name) %>%
right_join(df1, by = 'Username')
# Username ABC CDE
# <chr> <chr> <chr>
#1 user1 StudyABC NA
#2 user2 StudyABC AnotherStudyCDE
#3 user3 StudyABC AnotherStudyCDE
#4 user4 StudyABC AnotherStudyCDE
#5 user5 StudyABC AnotherStudyCDE
#6 user6 NA NA

What you're looking for is a way of casting data from long to wide eg using data.table package I would do this:
library(data.table)
# converts data.frame to data.table
dt <- as.data.table(df2)
# I copy the file_name so one is used for the pivotting for long to wide and the other is used for filling in the data
dt[, study := File_Name]
dt_wide <- dcast(Username~File_Name, data=dt, value.var = "study")
# have a look at df2 in wide format
dt_wide[]
# now its just a direct merge to pull it back in to df1 and turn
# back in to data.frame for you
out <- merge(as.data.table(df1), dt_wide, by="Username", all.x=TRUE)
setDF(out)
out
Plenty of tutorials on melting/casting even without data.table. It's just knowing what to search for eg Google throws up https://ademos.people.uic.edu/Chapter8.html as the first result.

If one study can have more than one file path (which I assume is the case from your previous attempts), just converting your data to a wide format before joining won't work as you'll have one column per file path, not per study.
One method in this case could be to use a for-loop to create an additional column in df2 with the study name, then convert the data to a wide format using pivot_wider.
It's not a very R method though so I'd welcome suggestions to avoid creating the empty study column and the for-loop
studies <- c("ABC", "CDE")
#create empty column named "study"
df2 <- df2 %>%
mutate(study = NA_character_)
for (i in studies) {
df2 <- df2 %>%
mutate(study = if_else(grepl(i, File_Name), i, study))
}
df2 <- df2 %>%
pivot_wider(names_from = study, values_from = File_Name)
> df2
# A tibble: 5 x 3
Username ABC CDE
<chr> <chr> <chr>
1 user1 StudyABC NA
2 user2 StudyABC AnotherStudyCDE
3 user3 StudyABC AnotherStudyCDE
4 user4 StudyABC AnotherStudyCDE
5 user5 StudyABC AnotherStudyCDE
df2 is now in a wide format and you can join it to df1 as before to get your desired output.
df3 <- left_join(df1, df2)

Related

Replace values in dataframe based on other dataframe with column name and value

Let's say I have a dataframe of scores
library(dplyr)
id <- c(1 , 2)
name <- c('John', 'Ninaa')
score1 <- c(8, 6)
score2 <- c(NA, 7)
df <- data.frame(id, name, score1, score2)
Some mistakes have been made so I want to correct them. My corrections are in a different dataframe.
id <- c(2,1)
column <- c('name', 'score2')
new_value <- c('Nina', 9)
corrections <- data.frame(id, column, new_value)
I want to search the dataframe for the correct id and column and change the value.
I have tried something with match but I don't know how mutate the correct column.
df %>% mutate(corrections$column = replace(corrections$column, match(corrections$id, id), corrections$new_value))
We could join by 'id', then mutate across the columns specified in the column and replace the elements based on the matching the corresponding column name (cur_column()) with the column
library(dplyr)
df %>%
left_join(corrections) %>%
mutate(across(all_of(column), ~ replace(.x, match(cur_column(),
column), new_value[match(cur_column(), column)]))) %>%
select(names(df))
-output
id name score1 score2
1 1 John 8 9
2 2 Nina 6 7
It's an implementation of a feasible idea with dplyr::rows_update, though it involves functions of multiple packages. In practice I prefer a moderately parsimonious approach.
library(tidyverse)
corrections %>%
group_by(id) %>%
group_map(
~ pivot_wider(.x, names_from = column, values_from = new_value) %>% type_convert,
.keep = TRUE) %>%
reduce(rows_update, by = 'id', .init = df)
# id name score1 score2
# 1 1 John 8 9
# 2 2 Nina 6 7

How to group_by more elegantly my data frame

I have two tables:
One where I know for sure all users of this table df1 have used a feature called "Folder"
The other where I don't know if users of this table df2 have used a feature called "Folder"
I want to build a graph showing the number of users that used the Folder feature on each date.
So the main data frame I want to build is a data frame with ALL the Dates included (from df1 and df2) and for each date the number of users who used this feature "Folder".
Here is a reproducible example:
df1 <- data.frame(Date=c("2021-05-12","2021-05-15","2021-05-20"),user_ID=c("RZ625","TDH65","EJ7336"))
colnames(df1) <- c("Date", "user_ID")
df2 <- data.frame(Date=c("2021-05-12","2021-05-15","2021-05-22"),user_ID=c("IZ823","TDH65","SI826"))
colnames(df2) <- c("Date", "user_ID")
The only way I found so far was to create some kind of flag Folder_True where it's a 1 if we know this user used Folder feature on this date and 0 if we don't know. I then used it with dplyr combining group_by and sum. But I think it's not very elegant and I would like to learn a more logical/efficient way to do this data wrangling.
Thanks!
df1 <- data.frame(Date=c("2021-05-12","2021-05-15","2021-05-20"),user_ID=c("RZ625","TDH65","EJ7336"), Folder_True=c(0,0,0))
df2 <- data.frame(Date=c("2021-05-12","2021-05-15","2021-05-22"),user_ID=c("IZ823","TDH65","SI826"), Folder_True=c(1,0,1))
combined_df <- rbind(df1, df2)
combined_df <-
combined_df %>%
group_by(Date, user_ID) %>%
summarise(Folder_True = sum(Folder_True))
final_df <-
combined_df %>%
group_by(Date) %>%
summarise(Nb_Users_Folder_True = sum(Folder_True))
For each Date you can find out unique users who have Folder_True = 1.
library(dplyr)
combined_df <- rbind(df1, df2)
combined_df %>%
group_by(Date) %>%
summarise(Nb_Users_Folder_True = n_distinct(user_ID[Folder_True == 1]))
# Date Nb_Users_Folder_True
# <chr> <int>
#1 2021-05-12 1
#2 2021-05-15 0
#3 2021-05-20 0
#4 2021-05-22 1
Using uniqueN from data.table
library(data.table)
rbindlist(list(df1, df2))[,
.(Nb_Users_Folder_True = uniqueN(user_ID[as.logical(Folder_True)])), Date]
-output
# Date Nb_Users_Folder_True
#1: 2021-05-12 1
#2: 2021-05-15 0
#3: 2021-05-20 0
#4: 2021-05-22 1

Is there a way to create and insert new rows in one column based on another column

I have some event data that I want to gather in one column. At the moment the data include columns for events and other columns that contain the outcome of certain events. I want to include the outcomes as events in the data and also preserve the order. The data look like df in the example below and I want to transform them so that they look like the desired df.
a <- c("event1","event2","event3","event4")
b <- c("outcome1",'','','')
c <- c('','',"outcome3",'')
df <- data.frame(a,b,c)
d <- c("event1","outcome1","event2","event3","outcome3","event4")
desired <- data.frame(d)
We can convert the data to matrix by transposing, collapse it into one vector and remove the empty values
vals <- c(t(df))
data.frame(d = vals[vals!= ""])
# d
#1 event1
#2 outcome1
#3 event2
#4 event3
#5 outcome3
#6 event4
Using tidyverse
library(dplyr)
tidyr::pivot_longer(df, cols = names(df)) %>%
filter(value != "") %>%
select(value)

Select only rows where the last date is present

Let's say that I have the following data.
df = data.frame(name = c("A","A","A","B","B","B","B"),
date = c("2011-01-01","2011-03-01","2011-05-01",
"2011-01-01","2011-05-01","2011-06-01",
"2011-07-01"))
df
I know the last date in the data set and only want to pick those names where data is available for the last date. So in the above example, the last date is only available for name B. Thus, I want to select only the rows for name B.
I can do simple hacks like this to get the desired result.
last_date = "2011-07-01"
#unique(df$name[df$date %in% last_date])
df[df$name %in% unique(df$name[df$date %in% last_date]),]
However, I was wondering if there was a dplyr/tidyverse or data.table solution for this task.
There are multiple ways you can do this, with dplyr we can filter only those groups which have the last_date
library(dplyr)
df %>%
group_by(name) %>%
filter(last_date %in% date)
# name date
# <fct> <fct>
#1 B 2011-01-01
#2 B 2011-05-01
#3 B 2011-06-01
#4 B 2011-07-01
Or similarly in base R :
df[ave(df$date, df$name, FUN = function(x) last_date %in% x) == TRUE,]
Also, we can get all the name where you find last_date and filter those names from the original dataframe.
df[with(df, name %in% name[date %in% last_date]), ]

Create a loop for pasting or removing elements based on different scenarios

Say I have the following data set:
mydf <- data.frame( "MemberID"=c("111","0111A","0111B","112","0112A","113","0113B"),
"resign.date"=c("2013/01/01",NA,NA,"2014/03/01",NA,NA,NA))
Note: 111,112 and 113 are the IDs for the family representative.
I would like to do two things:
a) if I have the resign dates for a family representative for instance in the case of 111, I want to paste the same resign dates for 0111A and 0111B (These represent spouse and children of 111 if you're wondering)
b) if I don't have resign dates for the family representative, for instance 113, I would simply like to remove the rows 113 and 0113B.
My resulting data frame should look like this:
mydf <- data.frame("MemberID"=c("111","0111A","0111B","112","0112A"),
"resign.date"=c("2013/01/01","2013/01/01","2013/01/01","2014/03/01","2014/03/01"))
Thanks in advance.
If resign.date is only present for (some) MembersID without trailing letters, a solution using data.table
library(data.table)
df <- data.table( "MemberID"=c("0111","0111A","0111B","0112","0112A","0113","0113B"),
"resign.date"=c("2013/01/01",NA,NA,"2014/03/01",NA,NA,NA))
df <- df[order(MemberID)] ## order data : MemberIDs w/out trailing letters first by ID
df[, myID := gsub("\\D+", "", MemberID)] ## create myID col : MemberID w/out trailing letters
df[ , my.resign.date := resign.date[1L], by = myID] ##assign first occurrence of resign date by myID
df <- df[!is.na(my.resign.date)] ##drop rows if my.resign.date is missing
EDIT
If inconsistencies in MemberID (some have leading 0 some don't) you can try some work around as in what follows
df <- data.table( "MemberID"=c("111","0111A","0111B","112","0112A","113","0113B"),
"resign.date"=c("2013/01/01",NA,NA,"2014/03/01",NA,NA,NA))
df[, myID := gsub("(?<![0-9])0+", "", gsub("\\D+", "", MemberID), perl = TRUE)]
df <- df[order(myID, -MemberID)]
df[ , my.resign.date := resign.date[1L], by = myID]
df <- df[!is.na(my.resign.date)]
We can also use tidyverse
library(tidyverse)
mydf %>%
group_by(grp = parse_number(MemberID)) %>%
mutate(resign.date = first(resign.date)) %>%
na.omit() %>%
ungroup() %>%
select(-grp)
# A tibble: 5 x 2
# MemberID resign.date
# <fctr> <fctr>
#1 0111 2013/01/01
#2 0111A 2013/01/01
#3 0111B 2013/01/01
#4 0112 2014/03/01
#5 0112A 2014/03/01

Resources