Is it possible to use column indices in merge? - r

If I have two dataframes that I wish to merge, is there a way to merge by the column index rather than the name of the column?
For instance if I have these two dfs, and want to merge on x.x1 and y.x2.
dtest <- data.frame(x1 = 1:10, y = 2:11)
dtest2 <- data.frame(x2 = 1:10, y1 = 11:20)
I've tried the following but I can't get it to work
xy <- merge(dtest, dtest2, by.x = x[,1], by.y = y[,1], all.x = TRUE, all.y = TRUE)

Here you go:
xy <- merge(dtest, dtest2, by.x = 1, by.y = 1, all.x = TRUE, all.y = TRUE)
From help(merge): Columns to merge on can be specified by name, number or by a logical vector...

Related

Create R function

could you help me to convert all these code in a single function? I need to avoid writing code for every single dataframe
data <- merge(x = data_2021, y = corr_df, by = "XCode", all.x = TRUE)
#Drop column
data = subset(data, select = -c(XCode))
# Rename columns
names(data)[names(data) == "Zvar"] <- "XCode"
# Reorder column by name
col_order <- c("XCode", "x2" , "x3")
data <- data[,col_order]
Maybe something like this:
fn <- function(x,y) {
data <- merge(x = x, y = y, by = "XCode", all.x = TRUE)
## Drop column
data = subset(data, select = -c(XCode))
## Rename columns
names(data)[names(data) == "Zvar"] <- "XCode"
## Reorder column by name
col_order <- c("XCode", "x2" , "x3")
data <- data[,col_order]
data
}
data <- fn(data_2021, corr_df)

Multiple merges using base R

Using the "merge" function in (base) R, I have figured out how to do joins on a single column ...if that column has the same name in both tables: (in this example, I do a left join)
result = merge( x = table_a, y = table_b, by = "col_a", all.X = TRUE)
But is there a way to do this if the column names are not the same?
e.g.
result_1 = merge( x = table_a, y = table_b, by = "table_a$col_a = table_b$col_b", all.X = TRUE)
Could this also be done using multiple conditions?
result_2 = merge( x = table_a, y = table_b, by = c("table_a$col_a = table_b$col_b" & table_a$col_c = table_b$col_d" & table_a$col_e= table_b$col_f" ), all.X = TRUE)
Thanks
With merge, we can have by.x and by.y as argument
merge( x = table_a, y = table_b, by.x = "col_a", by.y = "col_b", all.X = TRUE)
The syntax for a named vector in by (excluding the "table_ar$" would be a join syntax in dplyr

How to automatically adjust R script based on # of variables (var1, var2, etc.)

Main dataset:
df <- data.frame(var1 =c(1, 2, 1), var2 = c(2, 3, 3))
My mapping table:
mt <- data.frame(var1 = c(1, 2, 1), var2 = c(2, 3,3), color = c('red', 'blue', 'yellow'))
To merge df to mt, preserving all rows in df:
df <- merge(x = df, y=mt, by=c("var1", "var2"), all.x = TRUE)
QUESTION: How can I dynamically change the code so that if I have 4 number of vars (ie. var1, var2, var3, var4), the code will automatically get adjusted to the following?
df <- merge(x = df, y=mt, by=c("var1", "var2", "var3", "var4"), all.x = TRUE)
Similarly, if we have 5 number of vars, it will be automatically get adjusted to:
df <- merge(x = df, y=mt, by=c("var1", "var2", "var3", "var4", "var5"), all.x = TRUE)
If both the datasets have the same column names that are used in by, then we don't need to specify the by as it automatically picks up the columns by matching the intersecting column names.
merge(df, mt, all.x = TRUE)
But, if there are other columns and want to only specify the "var" columns, then an option is either startsWith
merge(x = df, y=mt, by= names(df)[startsWith(names(df), "var")], all.x = TRUE)
or grep
merge(x = df, y=mt, by= grep("^var\\d+$", names(df), value = TRUE), all.x = TRUE)

Join a long data frame to a tidy data frame

I have two dataframes like below:
df1 <- data.frame(Construction = c("Frame","Frame","Masonry","Fire Resistive","Masonry"),
Industry = c("Apartments","Restaurant","Condos","Condos","Condos"),
Size = c("[0-3)","[6-9)","[3-6)","[3-6)","9+"))
df2 <- data.frame(Category = c("Construction","Construction","Construction",
"Industry","Industry","Industry",
"Size","Size","Size","Size"),
Type = c("Frame","Masonry","Fire Resistive",
"Apartments","Restaurant","Condos",
"[0-3)","[3-6)","[6-9)","9+"),
Score1 = rnorm(10),
Score2 = rnorm(10),
Score3 = rnorm(10))
I want to join df2 to df1 so that Construction, Industry, and Size each have their respective Score.
I can do it manually by making a key equal to Category concatenated with Type and then doing a left-join for each column, but I want a way to automate it so I can add/remove variables easily.
Here's the format I want it to look like: (note: Score numbers don't match.)
df3 <- data.frame(Construction = c("Frame","Frame","Masonry","Fire Resistive","Masonry"),
Construction_Score1 = rnorm(5),
Construction_Score2 = rnorm(5),
Construction_Score3 = rnorm(5),
Industry = c("Apartments","Restaurant","Condos","Condos","Condos"),
Industry_Score1 = rnorm(5),
Industry_Score2 = rnorm(5),
Industry_Score3 = rnorm(5),
Size = c("[0-3)","[6-9)","[3-6)","[3-6)","9+"),
Size_Score1 = rnorm(5),
Size_Score2 = rnorm(5),
Size_Score3 = rnorm(5))
The idea here is joining df1 and df2 on c("Construction","Industry","Size") and Type and then construct a long dataframe consist of those merged dataframe which we later convert to wide to get it in the format you desired.
mylist <- lapply(names(df1), function(col){
merge(x = df1, y = df2,
by.x = col, by.y = "Type",
all.x = TRUE)})
mydf <- do.call(rbind, mylist)
df3 <- reshape(mydf, idvar = c("Construction","Industry","Size"),
timevar = "Category",
direction = "wide")
One thing to note is that you have Score as the value of your Category column in df2 which I think should be Size instead to match what you have in df3 and also what has been hinted in df1.
Update: Answering OP's follow-up question;
What if there are other columns that are in df1, but not df2?
Let's make df11 which has another column and apply the same approach on that:
df11 <- cbind(df1, a=1:5)
mydf <- do.call(rbind,
lapply(names(df11[1:3]), function(col){
merge(x = df11, y = df2,
by.x = col, by.y = "Type",
all.x = TRUE)}))
df33 <- reshape(mydf, idvar = names(df11),
timevar = "Category",
direction = "wide")
So, you just need to specify in lapply which columns of df11 you are using to merge with df2 and in the reshape you include all the columns from df11 whether they match with df2 or not.
Another possibility using tidyverse package (Thanks to #akrun for reminding me about map_df):
map_df(names(df11)[1:3], ~ left_join(df11, df2, by = set_names("Type", .x))) %>%
gather(mvar, mval, Score1:Score3) %>%
unite(var, mvar, Category) %>%
spread(var, mval)

Map() and dplyr joins

I have two lists, both of which contain similar datasets corresponding to different years. I wish to merge the datasets in both lists, element by element. When I use mapply, alongside dplyr::full_join, in the instance where the variable names don't match and I need to use the by argument, R is unable to perform the join.
library(dplyr)
set.seed(100)
first_list <- list(data.frame(x = 1:3, y = rnorm(3)),
data.frame(x = 4:6, y = rnorm(3)))
second_list <- list(data.frame(z = 1:3, w = rnorm(3)),
data.frame(z = 4:6, w = rnorm(3)))
Map(full_join, by = c("x" = "z"), first_list, second_list)
#Error: 'z' column not found in rhs, cannot join
However,
Map(function(x, y) full_join(x, y, by = c("x" = "z")), first_list, second_list)
works successfully. I am curious about this behaviour and wonder if anyone could provide some explanation.
Since Map is a wrapper to mapply, use its MoreArgs argument while the other required args (...) include lists to be vectorized over (see ?mapply):
test1 <- Map(full_join, first_list, second_list, MoreArgs=list(by = c("x" = "z")))
test2 <- Map(function(x, y) full_join(x, y, by = c("x" = "z")), first_list, second_list)
all.equal(test1, test2)
# [1] TRUE

Resources