Merging different data frames depending on column value - r

I have a data frame df1
df1<- data.frame(ID = c("A","B","A","A","B"),CLASS = c(1,1,2,1,4))
ID CLASS
1 A 1
2 B 1
3 A 2
4 A 1
5 B 4
and another two data frames A and B
> A<- data.frame(CLASS = c(1,2,3), DESCRIPTION = c("Unknown", "Tall", "Short"))
CLASS DESCRIPTION
1 1 Unknown
2 2 Tall
3 3 Short
> B <- data.frame(CLASS = c(1,2,3,4), DESCRIPTION = c("Big", "Small", "Medium", "Very Big"))
CLASS DESCRIPTION
1 1 Big
2 2 Small
3 3 Medium
4 4 Very Big
I want to merge these three data frames depending on the ID and class of df1 to have something like this:
ID CLASS DESCRIPTION
1 A 1 Unknown
2 B 1 Big
3 A 2 Tall
4 A 1 Unknown
5 B 4 Very Big
I know I can merge it as df1 <- merge(df1, A, by = "CLASS") but I can't find a way to add the conditional (maybe an "if" is too much) to also merge B according to the ID.
I need to have an efficient way to do this as I am applying it to over 2M rows.

Add the ID variable to A and B, rbind A and B together, and use ID and CLASS to merge:
A$ID = 'A'
B$ID = 'B'
AB <- rbind(A, B)
merge(df1, AB, by = c('ID', 'CLASS'))
ID CLASS DESCRIPTION
1 A 1 Unknown
2 A 1 Unknown
3 A 2 Tall
4 B 1 Big
5 B 4 Very Big
I would suggest using stringsAsFactors = FALSE when creating the data:
df1 <- data.frame(ID = c("A","B","A","A","B"),CLASS = c(1,1,2,1,4),
stringsAsFactors = FALSE)
A <- data.frame(CLASS = c(1,2,3),
DESCRIPTION = c("Unknown", "Tall", "Short"),
stringsAsFactors = FALSE)
B <- data.frame(CLASS = c(1,2,3,4),
DESCRIPTION = c("Big", "Small", "Medium", "Very Big"),
stringsAsFactors = FALSE)

To merge multiple dataframes in one go, Reduce is often helpful:
out <- Reduce(function(x,y) merge(x,y, by = "CLASS", all.x=T), list(df1, A, B))
out
CLASS ID DESCRIPTION.x DESCRIPTION.y
1 1 A Unknown Big
2 1 B Unknown Big
3 1 A Unknown Big
4 2 A Tall Small
5 4 B <NA> Very Big
As you can see, columns that were present in all dataframes were added a suffix (default merge behavior). This allows you to apply whatever logic you want to get the final column you wish for. For instance,
out$Description <- ifelse(out$ID == "A", as.character(out$DESCRIPTION.x), as.character(out$DESCRIPTION.y))
> out
CLASS ID DESCRIPTION.x DESCRIPTION.y Description
1 1 A Unknown Big Unknown
2 1 B Unknown Big Big
3 1 A Unknown Big Unknown
4 2 A Tall Small Tall
5 4 B <NA> Very Big Very Big
Note that ifelse is vectorized and quite efficient.

A dplyr solution:
library(dplyr)
bind_rows(lst(A,B),.id="ID") %>% inner_join(df1)
# ID CLASS DESCRIPTION
# 1 A 1 Unknown
# 2 A 1 Unknown
# 3 A 2 Tall
# 4 B 1 Big
# 5 B 4 Very Big

Related

Arranging data.frame's columns based on a reference vector [duplicate]

I have a data.frame that looks like this:
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
The vector is sorted by the cluster_id (which goes up to 11).
I want to sort the columns in the data frame such that the columns are in the order of the names in the vector.
A simple example of what I want is that:
Data:
A B C
1 2 3
4 5 6
Vector:
c("B","C","A")
Sorted:
B C A
2 3 1
5 6 4
Is there a fast way to do this?
UPDATE, with reproducible data added by OP:
df <- read.table(h=T, text="A B C
1 2 3
4 5 6")
vec <- c("B", "C", "A")
df[vec]
Results in:
B C A
1 2 3 1
2 5 6 4
As OP desires.
How about:
df[df.clust$mutation_id]
Where df is the data.frame you want to sort the columns of and df.clust is the data frame that contains the vector with the column order (mutation_id).
This basically treats df as a list and uses standard vector indexing techniques to re-order it.
Brodie's answer does exactly what you're asking for. However, you imply that your data are large, so I will provide an alternative using "data.table", which has a function called setcolorder that will change the column order by reference.
Here's a reproducible example.
Start with some simple data:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
matches <- data.frame(X = 1:3, Y = c("C", "A", "B"), Z = 4:6)
mydf
# A B C
# 1 1 3 5
# 2 2 4 6
matches
# X Y Z
# 1 1 C 4
# 2 2 A 5
# 3 3 B 6
Provide proof that Brodie's answer works:
out <- mydf[matches$Y]
out
# C A B
# 1 5 1 3
# 2 6 2 4
Show a more memory efficient way to do the same thing.
library(data.table)
setDT(mydf)
mydf
# A B C
# 1: 1 3 5
# 2: 2 4 6
setcolorder(mydf, as.character(matches$Y))
mydf
# C A B
# 1: 5 1 3
# 2: 6 2 4
A5C1D2H2I1M1N2O1R2T1's solution didn't work for my data (I've a similar problem that Yilun Zhang) so I found another option:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
# A B C
# 1 1 3 5
# 2 2 4 6
matches <- c("B", "C", "A") #desired order
mydf_reorder <- mydf[,match(matches, colnames(mydf))]
colnames(mydf_reorder)
#[1] "B" "C" "A"
match() find the the position of first element on the second one:
match(matches, colnames(mydf))
#[1] 2 3 1
I hope this can offer another solution if anyone is having problems!

subsetting a data frame using another data frame with var names and other condition

I'd like to subset my main data frame (df_main) based on a list AND a condition that sit in a separate data frame (df_keep), so that I end up with a data frame like df_goal.
I'd like to keep a var in df_main if it's on the list of var names (df_keep$keep_var) AND if it is NA or "r" (df_keep$othvar).
My approach seems to work up until the last line, and I don't know why.
Thanks for any help!
# Starting point
df_main <- data.frame(coat=c(1:5),hanger=c(1:5),book=c(1:5),dvd=c(1:5),bookcase=c(1:5),
clock=c(1:5),bottle=c(1:5),curtains=c(1:5),wall=c(1:5))
df_keep <- data.frame(keep_var=c("coat","hanger","book","wall","bottle"),othvar=c("r","w","r","w",NA))
# Goal
df_goal <- data.frame(coat=c(1:5),book=c(1:5),bottle=c(1:5))
# Attempt
df_keep$othvar[is.na(df_keep$othvar)] <- "r" # everything in othvar that's NA I want to keep so I recode it to "r"
df_keep <- df_keep %>% filter(othvar == "r") # keep everything that's "r"
df_main <- df_main[df_keep$keep_var] # subset my df_main using updated df_keep
You can get rows where your conditions are met in df_keep like this:
conditions_met <- df_keep$othvar == "r" | is.na(df_keep$othvar)
> conditions_met
[1] TRUE FALSE TRUE FALSE TRUE
You can then use these to get only the correct rows in df_keep$keepvar:
kept_rows <- df_keep$keep_var[conditions_met]
> kept_rows
[1] coat book bottle
Now, just return only the columns in df_main whose names match those in kept_rows:
df_main[, as.character(kept_rows)]
coat book bottle
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
Or in one line:
> df_main[, as.character(df_keep$keep_var[df_keep$othvar == "r" |
+ is.na(df_keep$othvar)])]
coat book bottle
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
Note that the as.character is needed as your example dataset does not use stringsAsFactors = FALSE. if it did, you could omit the as.character argument, so if your real data is in characters rather than in factors you should be able to drop as.character. Eg:
df_main <-
data.frame(
coat = c(1:5),
hanger = c(1:5),
book = c(1:5),
dvd = c(1:5),
bookcase = c(1:5),
clock = c(1:5),
bottle = c(1:5),
curtains = c(1:5),
wall = c(1:5),
stringsAsFactors = FALSE
)
df_keep <-
data.frame(
keep_var = c("coat", "hanger", "book", "wall", "bottle"),
othvar = c("r", "w", "r", "w", NA),
stringsAsFactors = FALSE
)
df_goal <- data.frame(coat = c(1:5),
book = c(1:5),
bottle = c(1:5))
df_main[, df_keep$keep_var[df_keep$othvar == "r" |
is.na(df_keep$othvar)]]
Here is a dplyr solution
library(dplyr)
# Filter based on 'othvar' and convert factor to string.
keep.vec <- as.character(
(df_keep %>% dplyr::filter(is.na(othvar) | othvar == 'r'))$keep_var
)
df_main %>% dplyr::select(keep.vec)
## coat book bottle
## 1 1 1 1
## 2 2 2 2
## 3 3 3 3
## 4 4 4 4
## 5 5 5 5

Counting the Number of Dates When Larger Than Another Date - R [duplicate]

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed 4 years ago.
I have no idea where to start with this, but what I am trying to do is create a new value based on the number of times another value is represented in another column.
For example
# Existing Data
key newcol
a ?
a ?
a ?
b ?
b ?
c ?
c ?
c ?
Would like the output to look like
key newcol
a 3
a 3
a 3
b 2
b 2
c 3
c 3
c 3
Thanks!
This can be achieved with the doBy package like so:
require(doBy)
#original data frame
df <- data.frame(key = c('a', 'a', 'a', 'b', 'b', 'c', 'c', 'c'))
#add counter
df$count <- 1
#use summaryBy to count number of instances of key
counts <- summaryBy(count ~ key, data = df, FUN = sum, var.names = 'newcol', keep.names = TRUE)
#merge counts into original data frame
df <- merge(df, counts, by = 'key', all.x = TRUE)
df then looks like:
> df
key count newcol
1 a 1 3
2 a 1 3
3 a 1 3
4 b 1 2
5 b 1 2
6 c 1 3
7 c 1 3
8 c 1 3
If key is a vector like this key <- rep(c("a", "b", "c"), c(3,2,3)), then you can get what you want by using table to count occurences of key elements
> N <- table(key)
> data.frame(key, newcol=rep(N,N))
key newcol
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
6 c 3
7 c 3
8 c 3
On the other hand, if key is a data.frame, then...
key.df <- data.frame(key = rep(letters[1:3], c(3, 2, 3)))
N <- table(key.df$key)
data.frame(key=key.df, newcol=rep(N, N))

Sort columns of a data frame by a vector of column names

I have a data.frame that looks like this:
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
The vector is sorted by the cluster_id (which goes up to 11).
I want to sort the columns in the data frame such that the columns are in the order of the names in the vector.
A simple example of what I want is that:
Data:
A B C
1 2 3
4 5 6
Vector:
c("B","C","A")
Sorted:
B C A
2 3 1
5 6 4
Is there a fast way to do this?
UPDATE, with reproducible data added by OP:
df <- read.table(h=T, text="A B C
1 2 3
4 5 6")
vec <- c("B", "C", "A")
df[vec]
Results in:
B C A
1 2 3 1
2 5 6 4
As OP desires.
How about:
df[df.clust$mutation_id]
Where df is the data.frame you want to sort the columns of and df.clust is the data frame that contains the vector with the column order (mutation_id).
This basically treats df as a list and uses standard vector indexing techniques to re-order it.
Brodie's answer does exactly what you're asking for. However, you imply that your data are large, so I will provide an alternative using "data.table", which has a function called setcolorder that will change the column order by reference.
Here's a reproducible example.
Start with some simple data:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
matches <- data.frame(X = 1:3, Y = c("C", "A", "B"), Z = 4:6)
mydf
# A B C
# 1 1 3 5
# 2 2 4 6
matches
# X Y Z
# 1 1 C 4
# 2 2 A 5
# 3 3 B 6
Provide proof that Brodie's answer works:
out <- mydf[matches$Y]
out
# C A B
# 1 5 1 3
# 2 6 2 4
Show a more memory efficient way to do the same thing.
library(data.table)
setDT(mydf)
mydf
# A B C
# 1: 1 3 5
# 2: 2 4 6
setcolorder(mydf, as.character(matches$Y))
mydf
# C A B
# 1: 5 1 3
# 2: 6 2 4
A5C1D2H2I1M1N2O1R2T1's solution didn't work for my data (I've a similar problem that Yilun Zhang) so I found another option:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
# A B C
# 1 1 3 5
# 2 2 4 6
matches <- c("B", "C", "A") #desired order
mydf_reorder <- mydf[,match(matches, colnames(mydf))]
colnames(mydf_reorder)
#[1] "B" "C" "A"
match() find the the position of first element on the second one:
match(matches, colnames(mydf))
#[1] 2 3 1
I hope this can offer another solution if anyone is having problems!

merge data.frame but keep only unique columns?

Let's say I want to merge two data.frames but some of the columns are redundant (the same). How would I merge those data.frames but drop the redundant columns?
X1 = data.frame(id = c("a","b","c"), same = c(1,2,3), different1 = c(4,5,6))
X2 = data.frame(id = c("b","c","a"), same = c(2,3,1), different2 = c(7,8,9))
merge(X1,X2, by="id", all = TRUE, sort = FALSE)
id same.x different1 same.y different2
1 a 1 4 1 9
2 b 2 5 2 7
3 c 3 6 3 8
But how would I get just the different1 and different2 columns?
id same different1 different2
1 a 1 4 9
2 b 2 5 7
3 c 3 6 8
You could include the column same in your by argument. The default is by=intersect(names(x), names(y)). Try merge(X1, X2) (it is the same as merge(X1, X2, by=c("id", "same"))):
merge(X1, X2)
# id same different1 different2
#1 a 1 4 9
#2 b 2 5 7
#3 c 3 6 8
Just subset via indexing in the merge statement. There are many ways to subset i.e. name, position. There is even a subset function but the [] notation works well for almost all cases
merge(X1[,c("id","same","different1")], X2[,c("id","different2")], by="id", all = TRUE, sort = FALSE)
As shown in other examples you could put it into the by statement but this will become an issue after you exit the realm of one-to-one merges and enter one-to-many or many-to-many merges.

Resources