Join a long data frame to a tidy data frame - r

I have two dataframes like below:
df1 <- data.frame(Construction = c("Frame","Frame","Masonry","Fire Resistive","Masonry"),
Industry = c("Apartments","Restaurant","Condos","Condos","Condos"),
Size = c("[0-3)","[6-9)","[3-6)","[3-6)","9+"))
df2 <- data.frame(Category = c("Construction","Construction","Construction",
"Industry","Industry","Industry",
"Size","Size","Size","Size"),
Type = c("Frame","Masonry","Fire Resistive",
"Apartments","Restaurant","Condos",
"[0-3)","[3-6)","[6-9)","9+"),
Score1 = rnorm(10),
Score2 = rnorm(10),
Score3 = rnorm(10))
I want to join df2 to df1 so that Construction, Industry, and Size each have their respective Score.
I can do it manually by making a key equal to Category concatenated with Type and then doing a left-join for each column, but I want a way to automate it so I can add/remove variables easily.
Here's the format I want it to look like: (note: Score numbers don't match.)
df3 <- data.frame(Construction = c("Frame","Frame","Masonry","Fire Resistive","Masonry"),
Construction_Score1 = rnorm(5),
Construction_Score2 = rnorm(5),
Construction_Score3 = rnorm(5),
Industry = c("Apartments","Restaurant","Condos","Condos","Condos"),
Industry_Score1 = rnorm(5),
Industry_Score2 = rnorm(5),
Industry_Score3 = rnorm(5),
Size = c("[0-3)","[6-9)","[3-6)","[3-6)","9+"),
Size_Score1 = rnorm(5),
Size_Score2 = rnorm(5),
Size_Score3 = rnorm(5))

The idea here is joining df1 and df2 on c("Construction","Industry","Size") and Type and then construct a long dataframe consist of those merged dataframe which we later convert to wide to get it in the format you desired.
mylist <- lapply(names(df1), function(col){
merge(x = df1, y = df2,
by.x = col, by.y = "Type",
all.x = TRUE)})
mydf <- do.call(rbind, mylist)
df3 <- reshape(mydf, idvar = c("Construction","Industry","Size"),
timevar = "Category",
direction = "wide")
One thing to note is that you have Score as the value of your Category column in df2 which I think should be Size instead to match what you have in df3 and also what has been hinted in df1.
Update: Answering OP's follow-up question;
What if there are other columns that are in df1, but not df2?
Let's make df11 which has another column and apply the same approach on that:
df11 <- cbind(df1, a=1:5)
mydf <- do.call(rbind,
lapply(names(df11[1:3]), function(col){
merge(x = df11, y = df2,
by.x = col, by.y = "Type",
all.x = TRUE)}))
df33 <- reshape(mydf, idvar = names(df11),
timevar = "Category",
direction = "wide")
So, you just need to specify in lapply which columns of df11 you are using to merge with df2 and in the reshape you include all the columns from df11 whether they match with df2 or not.
Another possibility using tidyverse package (Thanks to #akrun for reminding me about map_df):
map_df(names(df11)[1:3], ~ left_join(df11, df2, by = set_names("Type", .x))) %>%
gather(mvar, mval, Score1:Score3) %>%
unite(var, mvar, Category) %>%
spread(var, mval)

Related

Copying a column into another dataframe based on matching columns

I have two data frames. The data frames are different lengths, but have the same IDs in their ID columns. I would like to create a column in df called Classification based on the Classification in df2. I would like the Classification column in the df to match up with the appropriate ID listed in df2. Is there a good way to do this?
#Example data set
library(lubridate)
date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2011"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
ID2 <- rep(seq(1,5), 1)
Classification2 <- c("A", "B", "C", "D", "E")
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df2 <- data.frame(ID2, Classification)
A dplyr solution using left_join().
left_join(df, df2, c("ID" = "ID2"))
Are you looking for a merge between df and df2? Assuming Classification is a column in df.
df2 <- merge(df2, df, by.x = "ID2", by.y = "ID", all.x = TRUE)

Group dataframe row and column wise based on other dataframe?

I have a dataframe that I would like to group in both directions, first rowise and columnwise after. The first part worked well, but I am stuck with the second one. I would appreciate any help or advice for a solution that does both steps at the same time.
This is the dataframe:
df1 <- data.frame(
ID = c(rep(1,5),rep(2,5)),
ID2 = rep(c("A","B","C","D","E"),2),
A = rnorm(10,20,1),
B = rnorm(10,50,1),
C = rnorm(10,10,1),
D = rnorm(10,15,1),
E = rnorm(10,5,1)
)
This is the second dataframe, which holds the "recipe" for grouping:
df2 <- data.frame (
Group_1 = c("B","C"),
Group_2 = c("D","A"),
Group_3 = ("E"), stringsAsFactors = FALSE)
Rowise grouping:
df1_grouped<-bind_cols(df1[1:2], map_df(df2, ~rowSums(df1[unique(.x)])))
Now i would like to apply the same grouping to the ID2 column and sum the values in the other columns. My idea was to mutate a another column (e.g. "group", which contains the name of the final group of ID2. After this i can use group_by() and summarise() to calculate the sum for each. However, I can't figure out an automated way to do it
bind_cols(df1_grouped,
#add group label
data.frame(
group = rep(c("Group_2","Group_1","Group_1","Group_2","Group_3"),2))) %>%
#remove temporary label column and make ID a character column
mutate(ID2=group,
ID=as.character(ID))%>%
select(-group) %>%
#summarise
group_by(ID,ID2)%>%
summarise_if(is.numeric, sum, na.rm = TRUE)
This is the final table I need, but I had to manually assign the groups, which is impossible for big datasets
I will offer such a solution
library(tidyverse)
set.seed(1)
df1 <- data.frame(
ID = c(rep(1,5),rep(2,5)),
ID2 = rep(c("A","B","C","D","E"),2),
A = rnorm(10,20,1),
B = rnorm(10,50,1),
C = rnorm(10,10,1),
D = rnorm(10,15,1),
E = rnorm(10,5,1)
)
df2 <- data.frame (
Group_1 = c("B","C"),
Group_2 = c("D","A"),
Group_3 = ("E"), stringsAsFactors = FALSE)
df2 <- df2 %>% pivot_longer(everything())
df1 %>%
pivot_longer(-c(ID, ID2)) %>%
mutate(gr_r = df2$name[match(ID2, table = df2$value)],
gr_c = df2$name[match(name, table = df2$value)]) %>%
arrange(ID, gr_r, gr_c) %>%
pivot_wider(c(ID, gr_r), names_from = gr_c, values_from = value, values_fn = list(value = sum))

Fill missing values from another dataframe with the same columns

I searched various join questions and none seemed to quite answer this. I have two dataframes which each have an ID column and several information columns.
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25),
rep(NA, 25)), phase = c(rep("liquid", 50), rep("gas", 50)),
rand.col = rnorm(100))
df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))
As you can see, df1 is missing some info that is present in df2, while df2 is only a subset of all the ids, but they both have some similar columns. Is there a way to fill the missing values in df1 based on matching ID's from DF2?
I found a similar question that recommended using merge, but when I tried it, it dropped all the id's that were not present in both dataframes. Plus it required manually dropping duplicate columns and in my real dataset, there will be a large number of these, making doing so cumbersome. Even ignoring that though,
both the recommended solutions:
df1 <- setNames(merge(df1, df2)[-2], names(df1))
and
df1[is.na(df1$color), "color"] <- df2[match(df1$id, df2$id), "color"][which(is.na(df1$color))]
did not work for me, throwing various errors.
An alternate solution I have thought of is using rbind then dropping incomplete cases. The problem is that in my real dataset, while there are shared columns, there are also non-shared columns so I would have to create intermediate objects of just the shared columns, rbind, then drop incomplete cases, then join with the original object to regain the dropped columns. This seems unnecessarily roundabout.
In this example it would look like
df2 = rbind(df1[,colnames(df2)], df2)
df2 = df2[complete.cases(df2),]
df2 = merge(df1[,c("id", "rand.col")], df2, by = "id")
and, in case there are any fully duplicated rows between the two dataframes, I would need to add
df2 = unique(df2)
This solution will work, but it is cumbersome and as the number of columns that are being matched on increase, it gets even worse. Is there a better solution?
-edit- fixed a problem in my example data pointed out by Sathish
-edit2- Expanded example data
df1 = data.frame(id = c(1:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
These dataframe represents the case where there are many columns that have incomplete data and a second dataframe that has all of the missing data. Ideally, we would not need to separately list each each column with wq1 := i.wq1 etc.
If you want to join only by id column, you can remove phase in the on clause of code below.
Also your data in the question has discrepancies, which are corrected in the data posted in this answer.
library('data.table')
setDT(df1) # make data table by reference
setDT(df2) # make data table by reference
df1[ i = df2, color := i.color, on = .(id, phase)] # join df1 with df2 by id and phase values, and replace color values of df2 with color values of df1
tail(df1)
# id color phase rand.col
# 1: 95 green gas 1.5868335
# 2: 96 green gas 0.5584864
# 3: 97 green gas -1.2765922
# 4: 98 green gas -0.5732654
# 5: 99 green gas -1.2246126
# 6: 100 green gas -0.4734006
one-liner:
setDT(df1)[df2, color := i.color, on = .(id, phase)]
Data:
set.seed(1L)
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25),
rep(NA, 50)), phase = c(rep("liquid", 50), rep("gas", 50)),
rand.col = rnorm(100))
df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))
EDIT: based on new data posted in the question
Data:
set.seed(1L)
df1 = data.frame(id = c(1:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
set.seed(2423L)
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
Code:
library('data.table')
setDT(df1)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.1836433 -0.6120264 0.04211587 -0.01855983
setDT(df2)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687
df1[df2, `:=` ( wq2 = i.wq2,
wq3 = i.wq3,
wq4 = i.wq4,
wq5 = i.wq5), on = .(id)]
setDT(df1)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687

Is it possible to use column indices in merge?

If I have two dataframes that I wish to merge, is there a way to merge by the column index rather than the name of the column?
For instance if I have these two dfs, and want to merge on x.x1 and y.x2.
dtest <- data.frame(x1 = 1:10, y = 2:11)
dtest2 <- data.frame(x2 = 1:10, y1 = 11:20)
I've tried the following but I can't get it to work
xy <- merge(dtest, dtest2, by.x = x[,1], by.y = y[,1], all.x = TRUE, all.y = TRUE)
Here you go:
xy <- merge(dtest, dtest2, by.x = 1, by.y = 1, all.x = TRUE, all.y = TRUE)
From help(merge): Columns to merge on can be specified by name, number or by a logical vector...

Want to loop through the columns of dataframes in a list

I would like to loop through a list of dataframes and change the column names (I want each of the columns to have the same name)
Does anyone have a solution using the following data?
df <- data.frame(x = 1:10, y = 2:11, z = 3:12)
df2 <- data.frame(x = 1:10, y = 2:11, z = 3:12)
df3 <- data.frame(x = 1:10, y = 2:11, z = 3:12)
x <- list(df, df2, df3)
Either using a for loop or apply? Would actually love to see both if possible
Thanks,
Ben
Both hrbrmstr and David Arenburg's answers are perfect.

Resources