inner_join or merge remain second id colum - r

I have conducted an inner_join or merge function in R.
I want to remain the second id column "DI" in the result.
library(dplyr)
ab<-data.frame(ID=(c("PDM.999993856","PDM.999960488")),oi=rep("r",2),stringsAsFactors = FALSE)
to<-data.frame(DI=c("PDM.999993856","PDM.999960488"),kl=rep("foo",2),stringsAsFactors=FALSE)
inner_join(ab,to, by=c("ID"="DI"))

We can try
to %>%
mutate(ID = DI) %>%
inner_join(., ab, by = "ID")

Related

Drop a column that was used as 'by' argument in join

I have the following query:
library(dplyr)
FinalQueryDplyr <- PostsWithFavorite %>%
inner_join(Users, by = c("OwnerUserId" = "Id"), keep = FALSE) %>%
select(DisplayName, Age, Location, FavoriteTotal, MostFavoriteQuestion, MostFavoriteQuestionLikes) %>%
select(-c(OwnerUserId)) %>%
arrange(desc(FavoriteTotal))
As you can see, I use the OwnerUserId column as the joining column between 2 data frames.
I want the result data frame to only have other columns, without the OwnerUserId column visible.
Even though I 'deselect' the OwnerUserId column 2 times in said query:
once by not including it in the first select clause
once by explicitly deselecting it with select(-c(OwnerUserId))
It is still visible in the result:
OwnerUserId DisplayName Age Location FavoriteTotal MostFavoriteQuestion MostFavoriteQuestionLikes
How can I get rid of the column that was used as a joining column in dplyr?
One option is to remove the attribute by converting to data.frame
library(dplyr)
PostsWithFavorite %>%
inner_join(Users, by = c("OwnerUserId" = "Id"), keep = FALSE) %>%
select(DisplayName, Age, Location, FavoriteTotal,
MostFavoriteQuestion, MostFavoriteQuestionLikes) %>%
as.data.frame %>%
select(-c(OwnerUserId)) %>%
arrange(desc(FavoriteTotal))

How to divide this string in multiple columns?

I've this string and I need to split it into different columns
legend = "Frequency..Derivatives.measure...Derivatives.instrument...Derivatives.risk.category...Derivatives.reporting.country...Derivatives.counterparty.sector...Derivatives.counterparty.country...Derivatives.underlying.risk.sector...Derivatives.currency.leg.1...Derivatives.currency.leg.2...Derivatives.maturity...Derivatives.rating...Derivatives.execution.method...Derivatives.basis...Period..30.06.1998.31.12.1998.30.06.1999.31.12.1999.30.06.2000.31.12.2000.30.06.2001.31.12.2001.30.06.2002.31.12.2002.30.06.2003.31.12.2003.30.06.2004.31.12.2004.30.06.2005.31.12.2005.30.06.2006.31.12.2006.30.06.2007.31.12.2007.30.06.2008.31.12.2008.30.06.2009.31.12.2009.30.06.2010.31.12.2010.30.06.2011.31.12.2011.30.06.2012.31.12.2012.30.06.2013.31.12.2013.30.06.2014.31.12.2014.30.06.2015.31.12.2015.30.06.2016.31.12.2016.30.06.2017.31.12.2017.30.06.2018.31.12.2018.30.06.2019"
Every three points there should be a new column, until the word perdiod. Note that the first word Frequency is divided from the second word Derivatives.measure by only two points not three.
After that, there are a series of Date (6 months interval) and they should be divided in this way: "everytime there's a 4 digit number perform a split".
How can I do this? Thank You
We can use strsplit to split at the ... with fixed = TRUE into a list of vectors and then rbind the vectors to create a data.frame
df1 <- do.call(rbind.data.frame, strsplit(legend, "...", fixed = TRUE))
names(df1) <- paste0("V", seq_along(df1))
If we also need to include the last condition to split the "Period"
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
tibble(col = legend) %>%
mutate(rn = row_number()) %>%
separate_rows(col, sep= "[.]{3}") %>%
mutate(rn2 = str_c("V", rowid(rn))) %>%
pivot_wider(names_from = rn2, values_from = col) %>%
rename_at(ncol(.), ~ "Period") %>%
mutate(Period = str_remove(Period, "Period\\.+")) %>%
separate_rows(Period, sep="(?<=\\.[0-9]{4})\\.")

Looking up multiple values in separate table, but only returning one unique row

I have two data frames that look like this:
Table1:
Gender<-c("M","F","M","M","F")
CPTCodes<-c("15777, 19328, 19342, 19366, 19370, 19371, 19380","15777, 19357","19367, 49568","15777, 19357","15777, 19357")
Df<-tibble(Gender,CPTCodes)
Table2:
Code<-c(19328,19342,15777,49568,12345)
Value<-c(0.5,7,9,35,2)
Df2<-tibble(Code,Value)
And had previously asked this question about how to summarize the "values" from table 2 into a column in table 1, depending on how many codes were in the "Code" column of table 1. Turns out it was a duplicate of another question, but either way, the solutions there worked great! It did exactly what I asked.
Problem was that I didn't realize, buried deep down in the thousands of rows of Table 2, were some duplicate codes. I.e. table 2 really looked like this:
Code<-c(19357,19342,15777,49568,12345,15777,19357)
Modifier<-c("","","","","","a","a")
Value<-c(0.5,7,9,35,2,3,45)
Df2<-tibble(Code,Modifier,Value)
So when I use the suggested code:
Df %>% mutate(id = row_number()) %>% separate_rows(CPTCodes, sep = ", ", convert = TRUE) %>% left_join(Df2, by = c("CPTCodes" = "Code")) %>% group_by(id, Gender) %>% summarize(total = sum(Value, na.rm = TRUE))
It summarizes ALL of the codes in finds that match in Table2, and I really just want rows that dont have anything in the "modifier" column. Any ideas?
Lastly, the current code returns the summarized total in its own data frame, but it'd be cool if everything was still there from the original Table 1, and it just had an extra column with the new sum.
I'm not entirely sure of your expected output. But you should be able to filter and then join the new column to the original df.
Df <- Df %>% mutate(id = row_number()) %>%
separate_rows(CPTCodes, sep = ", ", convert = TRUE) %>%
left_join(Df2, by = c("CPTCodes" = "Code")) %>%
group_by(id, Gender) %>%
filter(Modifier == "") %>%
summarize(total = sum(Value, na.rm = TRUE)) %>%
right_join(Df, by = "Gender")

Finding elements from multiple columns of one dataframe that are not in multiple columns of another

library(tidyverse)
I have two dataframes (see sample code at bottom) called Df1 and Df2. I want to find phone numbers in Df1 (from all the columns) that are not in any of the phone number columns in Df2.
First, I restructure Df1 so that there is only one Id per row.
Df1<-Df1 %>%
gather(key, value, -Id) %>%
filter(!is.na(value)) %>%
select(-key) %>%
group_by(Id) %>%
filter(!duplicated(value)) %>%
mutate(Phone=paste0("Phone_",1:n())) %>%
spread(Phone, value)
Next, I rename Df2 and then use a join to find only Ids in Df1 that are in Df2.
Df2<-Df2%>%set_names(c("Id","Ph1","Ph2"))
DfJoin<-left_join(Df2,Df1,by="Id")
This is where I'm stuck. I want to find all the numbers in Df1 (Phone1 Phone2, and Phone 3) that are not in Df2 (Ph1 and Ph2). Below are some ideas for code. I tried many variations of this idea but could not find a way to achieve what I want. The final product should just be a table with the phone numbers(s) in any Df1 column that are not in any Df2 column together with the associated Id. I'm also wondering if there is another join or set operation that would achieve this in a more efficient way?
DfJoin<-DfJoin%>%mutate(New=if_else(! DfJoin[2:3] %in% DfJoin[4:6]),1,0)
DfJoin<-DfJoin%>%filter(! DfJoin[2:3] %in% DfJoin[2:4])
Sample Data:
Dataframe 1:
Id<-c(199,148,148,145,177,165,144,121,188,188,188,111)
Ph1<-c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554444,8764443344,6453348736)
Ph2<-c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
Df1<-data.frame(Id,Ph1,Ph2)
Dataframe 2:
Id2<-c(199,148,142,145,177,165,144,121,182,109,188,111)
Phone1<-c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554400,8764443344,6453348736)
Phone2<-c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
Df2<-data.frame(Id2,Phone1,Phone2)
One way to think about this problem:
You have a set of phone numbers in df1 for each ID number.
You have a set of phone numbers in df2 for each ID number.
You want to find, within each ID, the set difference between df1 and df2.
You can do this by mapping the base R function setdiff() onto your joined dataframe. To do this, you need to convert your data frames into list-column format, where all the phone numbers for each ID are present as a list in a "cell" of the dataframe. This is easily done by combining group_by(), summarize() and list().
# create example data
Id <- c(199,148,148,145,177,165,144,121,188,188,188,111)
ph1 <- c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554444,8764443344,6453348736)
ph2 <- c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
df1 <- data.frame(Id, ph1, ph2)
Id2 <- c(199,148,142,145,177,165,144,121,182,109,188,111)
phone1 <- c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554400,8764443344,6453348736)
phone2 <- c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
df2 <- data.frame(Id=Id2, phone1, phone2)
# convert the data to list-column format
df1.listcol <- df1 %>%
gather(col, phone, -Id) %>%
na.omit() %>%
group_by(Id) %>%
summarize(phone_list1 = list(phone))
df2.listcol <- df2 %>%
gather(col, phone, -Id) %>%
na.omit() %>%
group_by(Id) %>%
summarize(phone_list2 = list(phone))
Take a look at these dataframes to make sure you understand how we've reformatted them. Obviously, we could save a few lines of code by making this conversion process into a function, and then calling the function on each of df1 and df2, but I didn't do that here.
# join the two listcol dfs by Id, then map setdiff on the two columns
result <-
df1.listcol %>%
left_join(df2.listcol, by='Id') %>%
mutate(only_list_1 = map2(phone_list1, phone_list2, ~setdiff(.x, .y))) %>%
select(Id, only_list_1) %>%
unnest()
result
The result is
Id only_list_1
148 6541132112
188 7890986543
188 6785554444
Have you tried anti_join(a, b, by = "x1")
This basically gives you all rows in a which are not in b
DfJoin <- anti_join(Df1, Df2, by = "Id")
tidyr_dplyr cheatsheet
Use the above cheatsheet for data manipulation in tidyverse

Can I create a data.frame in R from an existing data.frame by assigning a list of col.names?

I have a data.frame where I assign each column.name a vector of variables:
dat1 <- data.frame(a=1:5,b=1:5,c=1:5)
I want to create a new data.frame but instead of assigning each column individually, I want to assign them all at once. For example, if I wanted to rename them all:
dat.new <- data.frame(paste(names(dat1),'1',sep='') = dat1)
This obviously doens't work. Is there a way to make it work?
I understand I can just rename using names(), but the scenario where this actually seems useful is if combining multiple data sets that share the same col.names (and in which I don't want to simply rbind):
dat1 <- data.frame(a=1:5,b=1:5,c=1:5)
dat2 <- data.frame(a=6:10,b=6:10,c=6:10)
dat.new <- data.frame(paste(names(dat1),'1',sep='') = dat1, paste(names(dat1),'2',sep='') = dat2)
library(dplyr)
library(tidyr)
library(magrittr)
Ok, here's the first part:
dat2 =
dat1 %>%
setNames(names(.) %>%
paste0("1") )
Here's the second part. The reshaping is a bit complex but more flexible, especially if you have row id's already with different amounts of rows:
list(dat1, dat2) %>%
bind_rows(.id = "number") %>%
group_by(number) %>%
mutate(id = 1:n()) %>%
gather(variable, value, -number, -id) %>%
unite(new_variable, variable, number) %>%
spread(new_variable, value)

Resources