recently I have a fixed dataframe and would like to join this dataframe to multiple dataframe. Below please see my example:
df1 <- data.frame (first_column = c("key", "key"),
second_column = c("a", "a")
)
df2 <- data.frame (first_column = c("key", "key"),
second_column = c("b", "b")
)
df3 <- data.frame (first_column = c("key", "key"),
second_column = c("c", "c")
)
join <- data.frame (first_column = c("key", "key"),
join_column = c("join", "join")
)
#df1 df2 df3 are the dataframe that needed to by joined by df.join
I try to create a for loop to join it:
for (i in 1:length(df.list)) {
df <- df.list[[i]]
assign(paste(names(df.list[[i]])),"_joined"),left_join(df, join, by = c("first_column"= "first_column"))
}
However, I have encountered 2 problems:
I cannot create a variable name by using the name in for loop [i]
I cannot create 3 different dataframe by using this for loop.
Below please see the result that I want to get
> df1_joined
first_column second_column join_column
1 key a join
2 key a join
> df2_joined
first_column second_column join_column
1 key b join
2 key b join
> df3_joined
first_column second_column join_column
1 key c join
2 key c join
Many Thanks!
You can put the dataframes in a list and join them using lapply.
list_df <- list(df1, df2, df3)
result <- lapply(list_df, function(x)
merge(x, join, by = 'first_column', all.x = TRUE))
To get separate joined dataframes assign them the names and use list2env.
names(result) <- sprintf('df%d_joined', seq_along(result))
list2env(result, .GlobalEnv)
Your paste(names(df.list[[i]])),"_joined") is getting the names of a single dataframe, which are the column names.
Just switch to paste("df",i,"_joined") should give you the correct answer
We can also do this as
library(purrr)
library(dplyr)
out <- mget(ls(pattern = '^df\\d+$')) %>%
map(~ .x %>%
left_join(join, by = 'first_column'))
Related
I have a protein list like the given in df1
df1 <- data.frame( names = c("Gen1", "Gen2", "Gen3"))
I need to change those names to their ID using a protein table, the way they are related is summarized in df2
df2 <- data.frame(
Protein.name = c("Gen1", "Gen2", "Gen3"),
Protein.product = c("id1", "id2" , "id3"))
So I finally get a list of the protein ID instead of the protein names, as in df3
df3 <- data.frame( ID = c("id1", "id2" , "id3"))
I've tried using the cbind command but in order for that to work both data frames should have the same number of rows which is not the case.
You probably want something as left_join() from tidyverse package:
library(tidyverse)
df1 %>%
left_join(df2, by = c("names" = "Protein.name"))
This piece of code gets protein names in df1 and provides their IDs from df2 in that order. As for df3, it's:
library(tidyverse)
df3 <- df1 %>%
left_join(df2, by = c("names" = "Protein.name")) %>%
select(ID = Protein.product)
(The last line with select renames the data frame column to ID to get desired output as you wanted.)
Example: (with changed order of df1 items to check it work)
library(tidyverse)
df1 <- data.frame(names = c("Gen3", "Gen1", "Gen2"))
df2 <- data.frame(
Protein.name = c("Gen1", "Gen2", "Gen3"),
Protein.product = c("id1", "id2" , "id3")
)
df3 <- df1 %>%
left_join(df2, by = c("names" = "Protein.name")) %>%
select(ID = Protein.product)
df3
Result:
ID
1 id3
2 id1
3 id2
inner_join(df1,df2,by=c("name"="gene")) %>% select(name = name.y)
In base R, we can do
out <- merge(df1, df2, by.x = 'name', by.y = 'gene')
Or with match
data.frame(name = df2$name[match(df1$name, df2$gene)])
In Base-R
df1$names <- sapply(df1$names, function(x) df2$Protein.product[df2$Protein.name %in% x])
> df1
names
1 id1
2 id2
3 id3
I would like to "copy paste" one column's value from df A under DF B's column values.
Below is I've visualized on what I'm trying to achieve
An option is to use bind_rows for the selected columns after making the type of the column same
library(dplyr)
bind_rows(df2, df1[1] %>%
transmute(ColumnC = as.character(ColumnA)))
# ColumnC ColumnD
#1 a b
#2 1 <NA>
#3 2 <NA>
#4 3 <NA>
data
df1 <- data.frame(ColumnA = 1:3, ColumnB = 4:6)
df2 <- data.frame(ColumnC = 'a', ColumnD = 'b',
stringsAsFactors = FALSE)
You may use also R base for this. You actually want to right join df2 with df1 :
df1 <- data.frame(1:3, 4:6)
names(df1) <- paste0("c", 1:2)
df2 <- data.frame("a", "b")
names(df2) <- paste0("c", 3:4)
# renaming column to join on
names(df2)[1] <- "c1"
merge(x = df1[,1,drop=FALSE], y = df2, by.y = c("c1"), all = TRUE)
I have a list of data.frames (in this example only 2):
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
I want to join them into a single data.frame only by a subset of the shared column names, in this case by id.
If I use:
library(dplyr)
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
The shared column names, which I'm not joining by, get mutated with the x and y suffices:
id val.x val1 val.y val2
1 G -0.05612874 0.2914462 2.087167 0.7876396
2 G -0.05612874 0.2914462 -0.255027 1.4411577
3 J -0.15579551 -0.4432919 -1.286301 1.0273924
In reality, for the shared column names for which I'm not joining by, it's good enough to select them from a single data.frame in the list - which ever they exist in WRT to the joined id.
I don't know these shared column names in advance but that's not difficult find out:
E.g.:
df.list.colnames <- unlist(lapply(df.list,function(l) colnames(l %>% dplyr::select(-id))))
df.list.colnames <- table(df.list.colnames)
repeating.colnames <- names(df.list.colnames)[which(df.list.colnames > 1)]
Which will then allow me to separate them from the data.frames in the list:
repeating.colnames.df <- do.call(rbind,lapply(df.list,function(r) r %>% dplyr::select_(.dots = c("id",repeating.colnames)))) %>%
unique()
I can then join the list of data.frames excluding these columns:
And then join them as above:
for(r in 1:length(df.list)) df.list[[r]] <- df.list[[r]] %>% dplyr::select_(.dots = paste0("-",repeating.colnames))
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
And now I'm left with adding the repeating.colnames.df to that. I don't know of any join in dplyr that wont return all combinations between df and repeating.colnames.df, so it seems that all I can do is apply over each df$id, pick the first match in repeating.colnames.df and join the result with df.
Is there anything less cumbersome for this situation?
If I followed correctly, I think you can handle this by writing a custom function to pass into reduce that identifies the common column names (excluding your joining columns) and excludes those columns from the "second" table in the merge. As reduce works through the list, the function will "accumulate" the unique columns, defaulting to the columns in the "left-most" table.
Something like this:
library(dplyr)
library(purrr)
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
fun <- function(df1, df2, by_col = "id"){
df1_names <- names(df1)
df2_names <- names(df2)
dup_cols <- intersect(df1_names[!df1_names %in% by_col], df2_names[!df2_names %in% by_col])
out <- dplyr::inner_join(df1, df2[, !(df2_names %in% dup_cols)], by = by_col)
return(out)
}
df_chase <- df.list %>% reduce(fun,by_col="id")
Created on 2019-01-15 by the reprex package (v0.2.1)
If I compare df_chase to your final solution, I yield the same answer:
> all.equal(df_chase, df_orig)
[1] TRUE
You can just get rid of the duplicate columns from one of the data frames if you say you don't really care about them and simply use base::merge:
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
duplicates = names(df1) == names(df2) & names(df1) !="id"
df2 = df2[,!duplicates]
df12 = base::merge.data.frame(df1, df2, by = "id")
head(df12)
When using the various join functions from dplyr you can either join all variables with the same name (by default) or specify those ones using by = c("a" = "b"). Is there a way to join by exclusion? For example, I have 1000 variables in two data frames and I want to join them by 999 of them, leaving one out. I don't want to do by = c("a1" = "b1", ...,"a999" = "b999"). Is there a way to join by excluding the one variable that is not used?
Ok, using this example from one answer:
set.seed(24)
df1 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
df2 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
I want to join them using all variables excluding val. I'm looking for a more general solution. Assuming there are 1000 variables and I only remember the name of the one that I want to exclude in the join, while not knowing the index of that variable. How can I perform the join while only knowing the variable names to exclude. I understand I can find the column index first but is there a simply way to add exclusions in by =?
We create a named vector to do this
library(dplyr)
grps <- setNames(paste0("b", 1:999), paste0("a", 1:999))
Note the 'grps' vector is created with paste as the OP's post suggested a pattern. If there is no pattern, but we know the column that is not to be grouped
nogroupColumn <- "someColumn"
grps <- setNames(setdiff(names(df1), nogroupColumn),
setdiff(names(df2), nogroupColumn))
inner_join(df1, df2, by = grps)
Using a reproducible example
set.seed(24)
df1 <- data_frame(a1 = LETTERS[1:3], a2 = letters[1:3], val = rnorm(3))
df2 <- data_frame(b1 = LETTERS[3:4], b2 = letters[3:4], valn = rnorm(2))
grps <- setNames(paste0("b", 1:2), paste0("a", 1:2))
inner_join(df1, df2, by = grps)
# A tibble: 1 x 4
# a1 a2 val valn
# <chr> <chr> <dbl> <dbl>
#1 C c 0.420 -0.584
To exclude a certain field(s), you need to identify the index of the columns you want. Here's one way:
which(!names(df1) %in% "sskjs" ) #<this excludes the column "sskjs"
[1] 1 2 4 #<and shows only the desired index columns
Use unite to create a join_id in each dataframe, and join by it.
df1 <- df1 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
df2 <- df2 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
left_join(df1, df2, by = "join_id" )
I have two data frames and I want to merge them using two columns that are like below:
a <- data.frame(A = c("Ali", "Should Be", "Calif")))
b <- data.frame(B = c("ALI", "CALIF", "SHOULD BE"))
Could you please let me know if it is possible to do it in r?
One way would be to decapitalize your character values using tolower from base R and then do a merge:
library(dplyr) # for mutating
df1 <- df1 %>%
mutate(A = tolower(A))
df2 <- df2 %>%
mutate(B = tolower(B))
df3 <- merge(df1, df2, by.x = "A", by.y = "B")
df3
A
1 ali
2 calif
3 should be
Is this what you needed?
Edit: The dplyr bit is of course not necessary. If everything is to be done in base R, df1$A=tolower(df1$A) and df2$B=tolower(df2$B) - as suggested in the comments - work just as well.