Consider the following two data.tables:
df1=data.table(a=1:3, b=4:6, c=7:9)
df2=data.table(a=c(T,F,T), c=c(F,F,T), d=c(T,F,F))
What is the best way to update columns a and c of df1 with the corresponding values from df2?
df1[,c("a","c"),with=FALSE] and df2[,c("a","c"),with=FALSE] return the corresponding parts of each data.table;
but df1[,c("a","c"),with=FALSE] <- df2[,c("a","c"),with=FALSE] returns an error!
Here's another solution:
library(tidyverse)
df1 <- tibble(a = 1:3, b = 4:6, c = 7:9)
df2 <- tibble(a = c(T,F,T), c = c(F,F,T), d = c(T,F,F))
bind_cols(df1, df2) %>%
transmute(a = a1, b, c = c1)
This creates a table with all six columns and then the transmute call selects and renames the ones you're interested in.
Related
recently I have a fixed dataframe and would like to join this dataframe to multiple dataframe. Below please see my example:
df1 <- data.frame (first_column = c("key", "key"),
second_column = c("a", "a")
)
df2 <- data.frame (first_column = c("key", "key"),
second_column = c("b", "b")
)
df3 <- data.frame (first_column = c("key", "key"),
second_column = c("c", "c")
)
join <- data.frame (first_column = c("key", "key"),
join_column = c("join", "join")
)
#df1 df2 df3 are the dataframe that needed to by joined by df.join
I try to create a for loop to join it:
for (i in 1:length(df.list)) {
df <- df.list[[i]]
assign(paste(names(df.list[[i]])),"_joined"),left_join(df, join, by = c("first_column"= "first_column"))
}
However, I have encountered 2 problems:
I cannot create a variable name by using the name in for loop [i]
I cannot create 3 different dataframe by using this for loop.
Below please see the result that I want to get
> df1_joined
first_column second_column join_column
1 key a join
2 key a join
> df2_joined
first_column second_column join_column
1 key b join
2 key b join
> df3_joined
first_column second_column join_column
1 key c join
2 key c join
Many Thanks!
You can put the dataframes in a list and join them using lapply.
list_df <- list(df1, df2, df3)
result <- lapply(list_df, function(x)
merge(x, join, by = 'first_column', all.x = TRUE))
To get separate joined dataframes assign them the names and use list2env.
names(result) <- sprintf('df%d_joined', seq_along(result))
list2env(result, .GlobalEnv)
Your paste(names(df.list[[i]])),"_joined") is getting the names of a single dataframe, which are the column names.
Just switch to paste("df",i,"_joined") should give you the correct answer
We can also do this as
library(purrr)
library(dplyr)
out <- mget(ls(pattern = '^df\\d+$')) %>%
map(~ .x %>%
left_join(join, by = 'first_column'))
I would like to "copy paste" one column's value from df A under DF B's column values.
Below is I've visualized on what I'm trying to achieve
An option is to use bind_rows for the selected columns after making the type of the column same
library(dplyr)
bind_rows(df2, df1[1] %>%
transmute(ColumnC = as.character(ColumnA)))
# ColumnC ColumnD
#1 a b
#2 1 <NA>
#3 2 <NA>
#4 3 <NA>
data
df1 <- data.frame(ColumnA = 1:3, ColumnB = 4:6)
df2 <- data.frame(ColumnC = 'a', ColumnD = 'b',
stringsAsFactors = FALSE)
You may use also R base for this. You actually want to right join df2 with df1 :
df1 <- data.frame(1:3, 4:6)
names(df1) <- paste0("c", 1:2)
df2 <- data.frame("a", "b")
names(df2) <- paste0("c", 3:4)
# renaming column to join on
names(df2)[1] <- "c1"
merge(x = df1[,1,drop=FALSE], y = df2, by.y = c("c1"), all = TRUE)
I have a list of data.frames (in this example only 2):
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
I want to join them into a single data.frame only by a subset of the shared column names, in this case by id.
If I use:
library(dplyr)
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
The shared column names, which I'm not joining by, get mutated with the x and y suffices:
id val.x val1 val.y val2
1 G -0.05612874 0.2914462 2.087167 0.7876396
2 G -0.05612874 0.2914462 -0.255027 1.4411577
3 J -0.15579551 -0.4432919 -1.286301 1.0273924
In reality, for the shared column names for which I'm not joining by, it's good enough to select them from a single data.frame in the list - which ever they exist in WRT to the joined id.
I don't know these shared column names in advance but that's not difficult find out:
E.g.:
df.list.colnames <- unlist(lapply(df.list,function(l) colnames(l %>% dplyr::select(-id))))
df.list.colnames <- table(df.list.colnames)
repeating.colnames <- names(df.list.colnames)[which(df.list.colnames > 1)]
Which will then allow me to separate them from the data.frames in the list:
repeating.colnames.df <- do.call(rbind,lapply(df.list,function(r) r %>% dplyr::select_(.dots = c("id",repeating.colnames)))) %>%
unique()
I can then join the list of data.frames excluding these columns:
And then join them as above:
for(r in 1:length(df.list)) df.list[[r]] <- df.list[[r]] %>% dplyr::select_(.dots = paste0("-",repeating.colnames))
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
And now I'm left with adding the repeating.colnames.df to that. I don't know of any join in dplyr that wont return all combinations between df and repeating.colnames.df, so it seems that all I can do is apply over each df$id, pick the first match in repeating.colnames.df and join the result with df.
Is there anything less cumbersome for this situation?
If I followed correctly, I think you can handle this by writing a custom function to pass into reduce that identifies the common column names (excluding your joining columns) and excludes those columns from the "second" table in the merge. As reduce works through the list, the function will "accumulate" the unique columns, defaulting to the columns in the "left-most" table.
Something like this:
library(dplyr)
library(purrr)
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
fun <- function(df1, df2, by_col = "id"){
df1_names <- names(df1)
df2_names <- names(df2)
dup_cols <- intersect(df1_names[!df1_names %in% by_col], df2_names[!df2_names %in% by_col])
out <- dplyr::inner_join(df1, df2[, !(df2_names %in% dup_cols)], by = by_col)
return(out)
}
df_chase <- df.list %>% reduce(fun,by_col="id")
Created on 2019-01-15 by the reprex package (v0.2.1)
If I compare df_chase to your final solution, I yield the same answer:
> all.equal(df_chase, df_orig)
[1] TRUE
You can just get rid of the duplicate columns from one of the data frames if you say you don't really care about them and simply use base::merge:
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
duplicates = names(df1) == names(df2) & names(df1) !="id"
df2 = df2[,!duplicates]
df12 = base::merge.data.frame(df1, df2, by = "id")
head(df12)
When using the various join functions from dplyr you can either join all variables with the same name (by default) or specify those ones using by = c("a" = "b"). Is there a way to join by exclusion? For example, I have 1000 variables in two data frames and I want to join them by 999 of them, leaving one out. I don't want to do by = c("a1" = "b1", ...,"a999" = "b999"). Is there a way to join by excluding the one variable that is not used?
Ok, using this example from one answer:
set.seed(24)
df1 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
df2 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
I want to join them using all variables excluding val. I'm looking for a more general solution. Assuming there are 1000 variables and I only remember the name of the one that I want to exclude in the join, while not knowing the index of that variable. How can I perform the join while only knowing the variable names to exclude. I understand I can find the column index first but is there a simply way to add exclusions in by =?
We create a named vector to do this
library(dplyr)
grps <- setNames(paste0("b", 1:999), paste0("a", 1:999))
Note the 'grps' vector is created with paste as the OP's post suggested a pattern. If there is no pattern, but we know the column that is not to be grouped
nogroupColumn <- "someColumn"
grps <- setNames(setdiff(names(df1), nogroupColumn),
setdiff(names(df2), nogroupColumn))
inner_join(df1, df2, by = grps)
Using a reproducible example
set.seed(24)
df1 <- data_frame(a1 = LETTERS[1:3], a2 = letters[1:3], val = rnorm(3))
df2 <- data_frame(b1 = LETTERS[3:4], b2 = letters[3:4], valn = rnorm(2))
grps <- setNames(paste0("b", 1:2), paste0("a", 1:2))
inner_join(df1, df2, by = grps)
# A tibble: 1 x 4
# a1 a2 val valn
# <chr> <chr> <dbl> <dbl>
#1 C c 0.420 -0.584
To exclude a certain field(s), you need to identify the index of the columns you want. Here's one way:
which(!names(df1) %in% "sskjs" ) #<this excludes the column "sskjs"
[1] 1 2 4 #<and shows only the desired index columns
Use unite to create a join_id in each dataframe, and join by it.
df1 <- df1 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
df2 <- df2 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
left_join(df1, df2, by = "join_id" )
Consider these three dataframes:
df1 <- data.frame(a = runif(10,1,10), b = runif(10,1,10), c = runif(10,1,10))
df2 <- data.frame(a = runif(10,1,10), b = runif(10,1,10), c = runif(10,1,10))
df3 <- data.frame(a = runif(10,1,10), b = runif(10,1,10), c = runif(10,1,10))
I want to do a cor.test between column a against column a, b against b and c against c in all dfs – I can do it between each pair using and modifying code below but I want loop between all three dataframes in one go:
for (i in 1:length(df1)){
cor.test(df1[,i],df2[,i])
}
How would I go about doing that?
We could do the combination of object names with combn, get the values with mget in a list, and apply cor.test on each list and extract the p.value
combn(paste0("df", 1:3), 2, FUN = function(x) {
x1 <- mget(x, envir = .GlobalEnv)
Map(function(x,y) cor.test(x,y)$p.value, x1[[1]], x1[[2]])})
Or another option is corr.test from psych
library(psych)
t(sapply(names(df1), function(nm) {
x1 <- corr.test(data.frame(df1[nm], df2[nm], df3[nm]))$p
x1[lower.tri(x1)]})))