I searched various join questions and none seemed to quite answer this. I have two dataframes which each have an ID column and several information columns.
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25),
rep(NA, 25)), phase = c(rep("liquid", 50), rep("gas", 50)),
rand.col = rnorm(100))
df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))
As you can see, df1 is missing some info that is present in df2, while df2 is only a subset of all the ids, but they both have some similar columns. Is there a way to fill the missing values in df1 based on matching ID's from DF2?
I found a similar question that recommended using merge, but when I tried it, it dropped all the id's that were not present in both dataframes. Plus it required manually dropping duplicate columns and in my real dataset, there will be a large number of these, making doing so cumbersome. Even ignoring that though,
both the recommended solutions:
df1 <- setNames(merge(df1, df2)[-2], names(df1))
and
df1[is.na(df1$color), "color"] <- df2[match(df1$id, df2$id), "color"][which(is.na(df1$color))]
did not work for me, throwing various errors.
An alternate solution I have thought of is using rbind then dropping incomplete cases. The problem is that in my real dataset, while there are shared columns, there are also non-shared columns so I would have to create intermediate objects of just the shared columns, rbind, then drop incomplete cases, then join with the original object to regain the dropped columns. This seems unnecessarily roundabout.
In this example it would look like
df2 = rbind(df1[,colnames(df2)], df2)
df2 = df2[complete.cases(df2),]
df2 = merge(df1[,c("id", "rand.col")], df2, by = "id")
and, in case there are any fully duplicated rows between the two dataframes, I would need to add
df2 = unique(df2)
This solution will work, but it is cumbersome and as the number of columns that are being matched on increase, it gets even worse. Is there a better solution?
-edit- fixed a problem in my example data pointed out by Sathish
-edit2- Expanded example data
df1 = data.frame(id = c(1:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
These dataframe represents the case where there are many columns that have incomplete data and a second dataframe that has all of the missing data. Ideally, we would not need to separately list each each column with wq1 := i.wq1 etc.
If you want to join only by id column, you can remove phase in the on clause of code below.
Also your data in the question has discrepancies, which are corrected in the data posted in this answer.
library('data.table')
setDT(df1) # make data table by reference
setDT(df2) # make data table by reference
df1[ i = df2, color := i.color, on = .(id, phase)] # join df1 with df2 by id and phase values, and replace color values of df2 with color values of df1
tail(df1)
# id color phase rand.col
# 1: 95 green gas 1.5868335
# 2: 96 green gas 0.5584864
# 3: 97 green gas -1.2765922
# 4: 98 green gas -0.5732654
# 5: 99 green gas -1.2246126
# 6: 100 green gas -0.4734006
one-liner:
setDT(df1)[df2, color := i.color, on = .(id, phase)]
Data:
set.seed(1L)
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25),
rep(NA, 50)), phase = c(rep("liquid", 50), rep("gas", 50)),
rand.col = rnorm(100))
df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))
EDIT: based on new data posted in the question
Data:
set.seed(1L)
df1 = data.frame(id = c(1:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
set.seed(2423L)
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
Code:
library('data.table')
setDT(df1)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.1836433 -0.6120264 0.04211587 -0.01855983
setDT(df2)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687
df1[df2, `:=` ( wq2 = i.wq2,
wq3 = i.wq3,
wq4 = i.wq4,
wq5 = i.wq5), on = .(id)]
setDT(df1)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687
Related
I have two data frames. The data frames are different lengths, but have the same IDs in their ID columns. I would like to create a column in df called Classification based on the Classification in df2. I would like the Classification column in the df to match up with the appropriate ID listed in df2. Is there a good way to do this?
#Example data set
library(lubridate)
date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2011"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
ID2 <- rep(seq(1,5), 1)
Classification2 <- c("A", "B", "C", "D", "E")
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df2 <- data.frame(ID2, Classification)
A dplyr solution using left_join().
left_join(df, df2, c("ID" = "ID2"))
Are you looking for a merge between df and df2? Assuming Classification is a column in df.
df2 <- merge(df2, df, by.x = "ID2", by.y = "ID", all.x = TRUE)
I am trying to aggregate multiple columns given certain conditions using R (data.table?)...
I have one data frame df1 with columns 12:262 that contains species abundance (each column) for each sample (rows)
sample species1 species2
sample1 1 21
sample2 47 36
sample3 8 32
In another data frame df2, I have the phylum, genus, etc.. for each species (rows) .
species phylum genus
species1 X A
species2 Y B
I would like to aggregate all columns from df1 whose species belong to the same phylum (defined in df2)...
Does that make sense?
thank you!
The first thing to do is to reshape df1. If you convert the data from a 'wide' format to a 'long' format you will have multiple rows for each sample. You can then merge this with your second data set by the species variable. From here, you haven't given enough detail on exactly how you want to aggregate the data, but I provided two simple examples. You should be able to easily adjust that aggregation code to include whatever you need.
library(tidyr)
library(dplyr)
df1 <- data.frame(
sample = c("sample1", "sample2", "sample3"),
species1 = c(1, 47, 8),
species2 = c(21, 36, 32))
df2 <- data.frame(
species = c("species1", "species2"),
phylum = c("X", "Y"),
genus = c("A", "B")
)
df1_long <- tidyr::pivot_longer(df1, starts_with("species"),
names_to = "species", values_to = "abundance")
df3 <- dplyr::left_join(df1_long, df2, by = "species")
df3 %>%
group_by(phylum) %>%
summarize(total_abundance = sum(abundance),
avg_abundance = mean(abundance))
A data.table version
library(data.table)
dt1 <- data.table(
sample = c("sample1", "sample2", "sample3"),
species1 = c(1, 47, 8),
species2 = c(21, 36, 32))
dt2 <- data.table(
species = c("species1", "species2"),
phylum = c("X", "Y"),
genus = c("A", "B")
)
# long format
dt1_long <-
melt(
dt1,
id.vars = 'sample',
variable.name = "species",
value.name = "abundence"
)
# join and group
dt1_long[dt2,on = "species",by = "phylum"]
I have two dataframes like below:
df1 <- data.frame(Construction = c("Frame","Frame","Masonry","Fire Resistive","Masonry"),
Industry = c("Apartments","Restaurant","Condos","Condos","Condos"),
Size = c("[0-3)","[6-9)","[3-6)","[3-6)","9+"))
df2 <- data.frame(Category = c("Construction","Construction","Construction",
"Industry","Industry","Industry",
"Size","Size","Size","Size"),
Type = c("Frame","Masonry","Fire Resistive",
"Apartments","Restaurant","Condos",
"[0-3)","[3-6)","[6-9)","9+"),
Score1 = rnorm(10),
Score2 = rnorm(10),
Score3 = rnorm(10))
I want to join df2 to df1 so that Construction, Industry, and Size each have their respective Score.
I can do it manually by making a key equal to Category concatenated with Type and then doing a left-join for each column, but I want a way to automate it so I can add/remove variables easily.
Here's the format I want it to look like: (note: Score numbers don't match.)
df3 <- data.frame(Construction = c("Frame","Frame","Masonry","Fire Resistive","Masonry"),
Construction_Score1 = rnorm(5),
Construction_Score2 = rnorm(5),
Construction_Score3 = rnorm(5),
Industry = c("Apartments","Restaurant","Condos","Condos","Condos"),
Industry_Score1 = rnorm(5),
Industry_Score2 = rnorm(5),
Industry_Score3 = rnorm(5),
Size = c("[0-3)","[6-9)","[3-6)","[3-6)","9+"),
Size_Score1 = rnorm(5),
Size_Score2 = rnorm(5),
Size_Score3 = rnorm(5))
The idea here is joining df1 and df2 on c("Construction","Industry","Size") and Type and then construct a long dataframe consist of those merged dataframe which we later convert to wide to get it in the format you desired.
mylist <- lapply(names(df1), function(col){
merge(x = df1, y = df2,
by.x = col, by.y = "Type",
all.x = TRUE)})
mydf <- do.call(rbind, mylist)
df3 <- reshape(mydf, idvar = c("Construction","Industry","Size"),
timevar = "Category",
direction = "wide")
One thing to note is that you have Score as the value of your Category column in df2 which I think should be Size instead to match what you have in df3 and also what has been hinted in df1.
Update: Answering OP's follow-up question;
What if there are other columns that are in df1, but not df2?
Let's make df11 which has another column and apply the same approach on that:
df11 <- cbind(df1, a=1:5)
mydf <- do.call(rbind,
lapply(names(df11[1:3]), function(col){
merge(x = df11, y = df2,
by.x = col, by.y = "Type",
all.x = TRUE)}))
df33 <- reshape(mydf, idvar = names(df11),
timevar = "Category",
direction = "wide")
So, you just need to specify in lapply which columns of df11 you are using to merge with df2 and in the reshape you include all the columns from df11 whether they match with df2 or not.
Another possibility using tidyverse package (Thanks to #akrun for reminding me about map_df):
map_df(names(df11)[1:3], ~ left_join(df11, df2, by = set_names("Type", .x))) %>%
gather(mvar, mval, Score1:Score3) %>%
unite(var, mvar, Category) %>%
spread(var, mval)
I would like to convert a data.table from wide to long format.
A "normal" melt works fine, but I would like to melt my data into two different columns.
I found information about this here:
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html (Point 3)
I tried this with my code but somehow it is not working and until now I could not figure out the problem.
It would be great if you could explain me the mistake in the following example.
Thanks
#Create fake data
a = c("l1","l2","l2","l3")
b = c(10, 10, 20, 10)
c = c(30, 30, 30, 30)
d = c(40.2, 32.1, 24.1, 33.0)
e = c(1,2,3,4)
f = c(1.1, 1.2, 1.3, 1.5)
df <- data.frame(a,b,c,d,e,f)
colnames(df) <- c("fac_a", "fac_b", "fac_c", "m1", "m2.1", "m2.2")
#install.packages("data.table")
require(data.table)
TB <- setDT(df)
#Standard melt - works
TB.m1 = melt(TB, id.vars = c("fac_a", "fac_b", "fac_c"), measure.vars = c(4:ncol(TB)))
#Melt into two columns
colA = paste("m1", 4, sep = "")
colB = paste("m2", 5:ncol(TB), sep = "")
DT = melt(TB, measure = list(colA, colB), value.name = c("a", "b"))
#Not working, error: Error in melt.data.table(TB, measure = list(colA, colB), value.name = c("a", : One or more values in 'measure.vars' is invalid.
I have two different datasets,
df1 <- data.frame(
x = c(1.25:10.25),
y = c(1.25:10.25),
val = sample(50:150, 100, replace = FALSE)
)
df2 <- data.frame(
x = c(1:10),
y = c(1:10),
val_2 = sample(50:150, 100, replace = FALSE)
)
ggplot(df1, aes(x=x, y=y)) +
geom_tile(aes(fill=val)) + coord_equal() +
scale_fill_gradient(low = "yellow", high="red") +
geom_point(data = df2, aes(x = x, y = y, size = val_2), shape = 21, colour ="purple")
the resulting plot looks like this,
I would like to assign the values from df1 to df2 based on the box in which the df2 bubbles lie. The result I am looking for will be a copy of df2, but with an added column of df1 values. So something like
df2$val_1 <-
and the right-hand side code might have some distance criteria.
Considering the sample data presented and the example to reproduce, the solution can be given by:
require(dplyr)
df2$val_1 <- left_join(df2,
df1 %>% mutate(x = round(x,0), y = round(y,0)),
by = c("x" = "x", "y" = "y")) %>%
pull(val)
Instead, if you want to approach it using a more generalizable approach based on distance. I would suggest the following:
First of all, it is important to assign a primary key to both data.frame df1 and df2:
df1 <- data.frame(
ID = seq.int(1:100),
x = c(1.25:10.25),
y = c(1.25:10.25),
val = sample(50:150, 100, replace = FALSE)
)
df2 <- data.frame(
ID = seq.int(1:100),
x = c(1:10),
y = c(1:10),
val_2 = sample(50:150, 100, replace = FALSE)
)
We need to install pdist package because allow the computation of a distance matrix, in this solution using euclidean distance considering variables x and y
require(pdist)
dists <- pdist(df2[c("x", "y")],
df1[c("x", "y")])
Let's convert the output of pdist() function to a matrix
dists <- as.matrix(dists)
Now, based on the resulting matrix, we want to obtain a data.frame that for each element of df2 it gives us the ID of the nearest element of df1
assign_value <- data.frame(ID_df2 = df2$ID,
ID_df1 = apply(dists, 1, which.min))
We need to integrate the resulting 2-column data.frame with the val feature of df1:
assign_value <- left_join(assign_value,
df1[c("ID", "val")],
by = c("ID_df1" = "ID"))
Finally, we have obtained a data.frame with the following structure: "each row refers to a unique element of df2 and it is linked to the ID of the nearest element in df1 and its val":
ID_df2 ID_df1 val
1 1 1 70
2 2 2 132
To obtain the final data.frame we just have to perform a simple left_join using the desired features.
alternative_solution <- dplyr::left_join(df2,
assign_value[c("ID_df2", "val")],
by = c("ID" = "ID_df2"))
> identical(df2$val_2, alternative_solution$val)
[1] TRUE