Copying a column into another dataframe based on matching columns - r

I have two data frames. The data frames are different lengths, but have the same IDs in their ID columns. I would like to create a column in df called Classification based on the Classification in df2. I would like the Classification column in the df to match up with the appropriate ID listed in df2. Is there a good way to do this?
#Example data set
library(lubridate)
date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2011"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
ID2 <- rep(seq(1,5), 1)
Classification2 <- c("A", "B", "C", "D", "E")
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df2 <- data.frame(ID2, Classification)

A dplyr solution using left_join().
left_join(df, df2, c("ID" = "ID2"))

Are you looking for a merge between df and df2? Assuming Classification is a column in df.
df2 <- merge(df2, df, by.x = "ID2", by.y = "ID", all.x = TRUE)

Related

Partical match string between columns for multiple dataframes

I have a list of dataframes (df1, df2, df3) for which I would like to match columns with another dataframe (df) and substitute strings only if there is a match. Match should be based on a string specified when running the function, specified as partial match, in other words here it only for fields containing string "TEXT" and should work on cases like TEXT123 and TEXTabc. I did not get very far myself...
df1 <- data.frame(name = c("TEXT333","b","c"), column_A = 1:3, stringsAsFactors=FALSE)
df2 <- data.frame(name = c("b","TEXT345","d"), column_A = 4:6, stringsAsFactors=FALSE)
df3 <- data.frame(name = c("c","TEXT123","a"), column_A = 7:9, stringsAsFactors=FALSE)
df <- data.frame(name = c("TEXT333","TEXT123","a", "TEXT345", "k", "l", "b","c", "f"), column_B = 11:19, stringsAsFactors=FALSE)
list<-c(df1, df2, df3)
example for df1
partial_match <- function(column_A$df1, column_B, TEXT, df) {
df1_new <-df1
df1_new[, column_B] <- ifelse(grepl("TEXT.*", df1[, column_A]),
df[, column_B] - nchar(TEXT),
df[, column_B])
df1_new
}
Outcome for df1:
name column_A column_B
TEXT333 1 11
b 2 b
c 3 c
Here's one approach using a for loop. You were close! Note that I changed your reference dataframe name to dfs to avoid confusion with list().
Do you think you might encounter a situation where you might match multiple times in the same dataframe? If so, what I show below won't work without a couple more lines.
df1 <- data.frame(name = c("TEXT333","b","c"), column_A = 1:3, stringsAsFactors=FALSE)
df2 <- data.frame(name = c("b","TEXT345","d"), column_A = 4:6, stringsAsFactors=FALSE)
df3 <- data.frame(name = c("c","TEXT123","a"), column_A = 7:9, stringsAsFactors=FALSE)
dfs <- list(df1, df2, df3)
df <- data.frame(name = c("TEXT333","TEXT123","a", "TEXT345", "k", "l", "b","c", "f"), column_B = 11:19, stringsAsFactors=FALSE)
# loop over all dataframes in your list
for(i in 1:length(dfs)){
# get name that matches regex
val <- grep(pattern = "*TEXT*", x = dfs[[i]]$name, value = TRUE)
# use name to update value from reference df
dfs[[i]][dfs[[i]]$name == val,"column_A"] <- df[df$name == val,"column_B"]
}
Updated answer that can account for multiple matches in the same df
for(i in 1:length(dfs)){
vals <- grep(pattern = "*TEXT*", x = dfs[[i]]$name, value = TRUE)
for(val in vals){
dfs[[i]][dfs[[i]]$name == val, "column_A"] <- df[df$name == val,"column_B"]
}
}

aggregate columns with 2 data frames

I am trying to aggregate multiple columns given certain conditions using R (data.table?)...
I have one data frame df1 with columns 12:262 that contains species abundance (each column) for each sample (rows)
sample species1 species2
sample1 1 21
sample2 47 36
sample3 8 32
In another data frame df2, I have the phylum, genus, etc.. for each species (rows) .
species phylum genus
species1 X A
species2 Y B
I would like to aggregate all columns from df1 whose species belong to the same phylum (defined in df2)...
Does that make sense?
thank you!
The first thing to do is to reshape df1. If you convert the data from a 'wide' format to a 'long' format you will have multiple rows for each sample. You can then merge this with your second data set by the species variable. From here, you haven't given enough detail on exactly how you want to aggregate the data, but I provided two simple examples. You should be able to easily adjust that aggregation code to include whatever you need.
library(tidyr)
library(dplyr)
df1 <- data.frame(
sample = c("sample1", "sample2", "sample3"),
species1 = c(1, 47, 8),
species2 = c(21, 36, 32))
df2 <- data.frame(
species = c("species1", "species2"),
phylum = c("X", "Y"),
genus = c("A", "B")
)
df1_long <- tidyr::pivot_longer(df1, starts_with("species"),
names_to = "species", values_to = "abundance")
df3 <- dplyr::left_join(df1_long, df2, by = "species")
df3 %>%
group_by(phylum) %>%
summarize(total_abundance = sum(abundance),
avg_abundance = mean(abundance))
A data.table version
library(data.table)
dt1 <- data.table(
sample = c("sample1", "sample2", "sample3"),
species1 = c(1, 47, 8),
species2 = c(21, 36, 32))
dt2 <- data.table(
species = c("species1", "species2"),
phylum = c("X", "Y"),
genus = c("A", "B")
)
# long format
dt1_long <-
melt(
dt1,
id.vars = 'sample',
variable.name = "species",
value.name = "abundence"
)
# join and group
dt1_long[dt2,on = "species",by = "phylum"]

Join a long data frame to a tidy data frame

I have two dataframes like below:
df1 <- data.frame(Construction = c("Frame","Frame","Masonry","Fire Resistive","Masonry"),
Industry = c("Apartments","Restaurant","Condos","Condos","Condos"),
Size = c("[0-3)","[6-9)","[3-6)","[3-6)","9+"))
df2 <- data.frame(Category = c("Construction","Construction","Construction",
"Industry","Industry","Industry",
"Size","Size","Size","Size"),
Type = c("Frame","Masonry","Fire Resistive",
"Apartments","Restaurant","Condos",
"[0-3)","[3-6)","[6-9)","9+"),
Score1 = rnorm(10),
Score2 = rnorm(10),
Score3 = rnorm(10))
I want to join df2 to df1 so that Construction, Industry, and Size each have their respective Score.
I can do it manually by making a key equal to Category concatenated with Type and then doing a left-join for each column, but I want a way to automate it so I can add/remove variables easily.
Here's the format I want it to look like: (note: Score numbers don't match.)
df3 <- data.frame(Construction = c("Frame","Frame","Masonry","Fire Resistive","Masonry"),
Construction_Score1 = rnorm(5),
Construction_Score2 = rnorm(5),
Construction_Score3 = rnorm(5),
Industry = c("Apartments","Restaurant","Condos","Condos","Condos"),
Industry_Score1 = rnorm(5),
Industry_Score2 = rnorm(5),
Industry_Score3 = rnorm(5),
Size = c("[0-3)","[6-9)","[3-6)","[3-6)","9+"),
Size_Score1 = rnorm(5),
Size_Score2 = rnorm(5),
Size_Score3 = rnorm(5))
The idea here is joining df1 and df2 on c("Construction","Industry","Size") and Type and then construct a long dataframe consist of those merged dataframe which we later convert to wide to get it in the format you desired.
mylist <- lapply(names(df1), function(col){
merge(x = df1, y = df2,
by.x = col, by.y = "Type",
all.x = TRUE)})
mydf <- do.call(rbind, mylist)
df3 <- reshape(mydf, idvar = c("Construction","Industry","Size"),
timevar = "Category",
direction = "wide")
One thing to note is that you have Score as the value of your Category column in df2 which I think should be Size instead to match what you have in df3 and also what has been hinted in df1.
Update: Answering OP's follow-up question;
What if there are other columns that are in df1, but not df2?
Let's make df11 which has another column and apply the same approach on that:
df11 <- cbind(df1, a=1:5)
mydf <- do.call(rbind,
lapply(names(df11[1:3]), function(col){
merge(x = df11, y = df2,
by.x = col, by.y = "Type",
all.x = TRUE)}))
df33 <- reshape(mydf, idvar = names(df11),
timevar = "Category",
direction = "wide")
So, you just need to specify in lapply which columns of df11 you are using to merge with df2 and in the reshape you include all the columns from df11 whether they match with df2 or not.
Another possibility using tidyverse package (Thanks to #akrun for reminding me about map_df):
map_df(names(df11)[1:3], ~ left_join(df11, df2, by = set_names("Type", .x))) %>%
gather(mvar, mval, Score1:Score3) %>%
unite(var, mvar, Category) %>%
spread(var, mval)

Fill missing values from another dataframe with the same columns

I searched various join questions and none seemed to quite answer this. I have two dataframes which each have an ID column and several information columns.
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25),
rep(NA, 25)), phase = c(rep("liquid", 50), rep("gas", 50)),
rand.col = rnorm(100))
df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))
As you can see, df1 is missing some info that is present in df2, while df2 is only a subset of all the ids, but they both have some similar columns. Is there a way to fill the missing values in df1 based on matching ID's from DF2?
I found a similar question that recommended using merge, but when I tried it, it dropped all the id's that were not present in both dataframes. Plus it required manually dropping duplicate columns and in my real dataset, there will be a large number of these, making doing so cumbersome. Even ignoring that though,
both the recommended solutions:
df1 <- setNames(merge(df1, df2)[-2], names(df1))
and
df1[is.na(df1$color), "color"] <- df2[match(df1$id, df2$id), "color"][which(is.na(df1$color))]
did not work for me, throwing various errors.
An alternate solution I have thought of is using rbind then dropping incomplete cases. The problem is that in my real dataset, while there are shared columns, there are also non-shared columns so I would have to create intermediate objects of just the shared columns, rbind, then drop incomplete cases, then join with the original object to regain the dropped columns. This seems unnecessarily roundabout.
In this example it would look like
df2 = rbind(df1[,colnames(df2)], df2)
df2 = df2[complete.cases(df2),]
df2 = merge(df1[,c("id", "rand.col")], df2, by = "id")
and, in case there are any fully duplicated rows between the two dataframes, I would need to add
df2 = unique(df2)
This solution will work, but it is cumbersome and as the number of columns that are being matched on increase, it gets even worse. Is there a better solution?
-edit- fixed a problem in my example data pointed out by Sathish
-edit2- Expanded example data
df1 = data.frame(id = c(1:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
These dataframe represents the case where there are many columns that have incomplete data and a second dataframe that has all of the missing data. Ideally, we would not need to separately list each each column with wq1 := i.wq1 etc.
If you want to join only by id column, you can remove phase in the on clause of code below.
Also your data in the question has discrepancies, which are corrected in the data posted in this answer.
library('data.table')
setDT(df1) # make data table by reference
setDT(df2) # make data table by reference
df1[ i = df2, color := i.color, on = .(id, phase)] # join df1 with df2 by id and phase values, and replace color values of df2 with color values of df1
tail(df1)
# id color phase rand.col
# 1: 95 green gas 1.5868335
# 2: 96 green gas 0.5584864
# 3: 97 green gas -1.2765922
# 4: 98 green gas -0.5732654
# 5: 99 green gas -1.2246126
# 6: 100 green gas -0.4734006
one-liner:
setDT(df1)[df2, color := i.color, on = .(id, phase)]
Data:
set.seed(1L)
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25),
rep(NA, 50)), phase = c(rep("liquid", 50), rep("gas", 50)),
rand.col = rnorm(100))
df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))
EDIT: based on new data posted in the question
Data:
set.seed(1L)
df1 = data.frame(id = c(1:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
set.seed(2423L)
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
Code:
library('data.table')
setDT(df1)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.1836433 -0.6120264 0.04211587 -0.01855983
setDT(df2)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687
df1[df2, `:=` ( wq2 = i.wq2,
wq3 = i.wq3,
wq4 = i.wq4,
wq5 = i.wq5), on = .(id)]
setDT(df1)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687

Is it possible to use the by variable from the left and right dataframes separately after a left_join procedure

After joining two dataframes (df1 and df2) I would like to flag in column check which ids are in df1 but not df2 (example below).
df1 <-
data.frame(id = c(1, 2, 3))
df2 <-
data.frame(id = c(1, 2))
df3 <-
left_join(df1, df2)
Required result
id check
1 1 Y
2 2 Y
3 3 N
I can achieve the result using a temp column (example below)
df1 <-
data.frame(id = c(1, 2, 3))
df2 <-
data.frame(id = c(1, 2), temp = "Y")
df3 <-
left_join(df1, df2) %>%
mutate(check = ifelse(is.na(temp), "N", "Y")) %>%
select(-temp)
but I was hoping for a solution where a temp column isn't required, I've tried some different approaches (for example below), but haven't been able to find a better solution.
df3 <-
left_join(x = df1, y = df2) %>%
mutate(check = ifelse(is.na(y.id), "N", "Y"))
but this errors with...
Joining, by = "id"
Error: object 'y.id' not found

Resources