Merging data frames with a non-unique column

Merging data frames with a non-unique column - r

I would like to create a new data frame that borrows an ID variable from another data frame. The data frame I would like to merge has repeated observations in the ID column which is causing me some problems.
DF1<-data.frame(ID1=rep(c("A","B", "C", "D", "E") , 2), X1=rnorm(10))
DF2<-data.frame(ID1=c("A", "B", "C", "D", "E"), ID2=c("V","W","X","Y" ,"Z"), X2=rnorm(5), X3=rnorm(5))
What I would like to append DF2$ID2 onto DF by the ID1 column. My goal is something that looks like this (I do not want DF2$X2 and DF$X3 in the 'Goal' data frame):
Goal<-data.frame(ID2=DF2$ID2, DF1)
I have tried merge but it complains because DF1$ID1 is not unique. I know R can goggle this up in 1 line of code but I can't seem to make the functions I know work. Any help would be greatly appreciated!

There should be no problem with a simple merge. Using your sample data
merge(DF1, DF2[,c("ID1","ID2")], by="ID1")
produces
ID1 X1 ID2
1 A 0.03594331 V
2 A 0.42814900 V
3 B -2.17161263 W
4 B -0.33403550 W
5 C 0.95407844 X
6 C -0.23186723 X
7 D 0.46395514 Y
8 D -1.49919961 Y
9 E -0.20342430 Z
10 E -0.49847569 Z

You could also use left_join from library(dplyr)
library(dplyr)
left_join(DF1, DF2[,c("ID1", "ID2")])
# ID1 X1 ID2
#1 A -1.20927237 V
#2 B -0.03003128 W
#3 C -0.75799708 X
#4 D 0.53946986 Y
#5 E -0.52009921 Z
#6 A 1.15822659 V
#7 B -0.91976194 W
#8 C 0.74620142 X
#9 D -2.46452560 Y
#10 E 0.80015219 Z

Related

adding the frequency column in R without using dyplr

I have a "wide" dataset where for each observation I measure a value from a bunch of categorical variables. It is presented just like this:
V1
V2
V3
a
z
f
a
z
f
b
y
g
b
y
g
a
y
g
b
y
f
this means that V1 has two categories "a" and "b", V2 has two categories "z" and "y", and so on. But suppose that I have 30 variables (a quite bigger dataset).
I want to obtain a dataset in this form
V1
V2
V3
Freq
a
z
f
2
b
y
g
2
a
y
g
1
b
y
f
1
How can I get it in R? with smaller datasets I use transform(table(data.frame(data))) but it doesn't work with bigger datasets since it requires to build giant tables. Can somebody help please?
I would like to get a "general" code that does not depend on the variables name since I will be using it in a function. And moreover, since the datasets will be big I prefer to do it without the function table.
Thanks

In base R, with interaction:
as.data.frame(table(interaction(df, sep = "", drop = TRUE)))
Or, with table:
subset(data.frame(table(df)), Freq > 0)
# V1 V2 V3 Freq
#2 b y f 1
#3 a z f 2
#5 a y g 1
#6 b y g 2
With dplyr:
library(dplyr)
df %>%
count(V1, V2, V3, name = "Freq")
# V1 V2 V3 Freq
#1 a y g 1
#2 a z f 2
#3 b y f 1
#4 b y g 2

I assume your dataset dt contains only categorical variables and Freq represents the number of observations for each unique combination of the categorical variables.
As you want codes "without using dplyr," here is an alternative using data.table.
library(data.table)
dt[, Freq:=.N, by=c(colnames(dt))]

How to compare two variable and different length data frames to add values from one data frame to the other, repeating values where necessary

I apologize as I'm not sure how to word this title exactly.
I have two data frames. df1 is a series of paths with columns "source" and "destination". df2 stores values associated with the destinations. Below is some sample data:
df1
row
source
destination
1
A
B
2
C
B
3
H
F
4
G
B
df2
row
destination
n
1
B
26
2
F
44
3
L
12
I would like to compare the two data frames and add the n column to df1 so that df1 has the correct n value for each destination. df1 should look like:
row
source
destination
n
1
A
B
26
2
C
B
26
3
H
F
44
4
G
B
26
The data that I'm actually working with is much larger, and is never the same number of rows when I run the program. The furthest I've gotten with this is using the which command to get the right values, but only each value once.
df2[ which(df2$destination %in% df1$destination), ]$n
[1] 26 44
When what I would need is the list (26,26,44,26) so I can save it to df1$n

We can use a merge or left_join
library(data.table)
setDT(df1)[df2, n := i.n, on = .(destination)]

A base R option using match
transform(
df1,
n = df2$n[match(destination, df2$destination)]
)
which gives
row source destination n
1 1 A B 26
2 2 C B 26
3 3 H F 44
4 4 G B 26
Data
df1 <- data.frame(row = 1:4, source = c("A", "C", "H", "G"), destination = c("B", "B", "F", "B"))
df2 <- data.frame(row = 1:3, destination = c("B", "F", "L"), n = c(26, 44, 12))

How to keep columns in the output when grouping a data.table by several columns

I am trying to "tidy" a large dataset, where multiple different types of data is merged in columns, and some data in column names. This is a common scenario in biological dataset.
My data table has replicate measurements which I want to collapse into a mean. Converting the data into tidy format, these replicate values become additional rows. If I try to aggregate/group by several columns and calculate the mean of the replicates:
collapsed.data <- tidy.dt[, mean(expression, na.rm = T), by=list(Sequence.window,Gene.names,ratio,enrichment.type,condition)]
I get a resultant table that has only the columns used in by statement and followed with the mean(expression) as column V1. Is it possible to get all the other (unchanged) columns as well?
A minimalist example showing what I am trying to achieve is as follows:
library(data.table)
dt <- data.table(a = c("a", "a", "b", "b", "c", "a", "c", "a"), b = rnorm(8),
c = c(1,1,1,1,1,2,1,2), d = rep('x', 8), e = rep('test', 8))
dt[, mean(b), by = list(a, c)]
# a c V1
#1: a 1 -0.7597186
#2: b 1 -0.3001626
#3: c 1 -0.6893773
#4: a 2 -0.1589146
As you can see the columns d and e are dropped.

One possibility is to include d and e in the grouping:
res <- dt[, mean(b), by = list(a, c, d, e)]
res
# a c d e V1
#1: a 1 x test 0.9271986
#2: b 1 x test -0.3161799
#3: c 1 x test 1.3709635
#4: a 2 x test 0.1543337
If you want to keep all columns except the one you want to aggregrate, you can do this in a more programmatic way:
cols_to_group_by <- setdiff(colnames(dt), "b")
res <- dt[, mean(b), by = cols_to_group_by]
The result is the same as above.
By this, you have reduced the number of rows. If you want to keep all rows, you can add an additional column:
dt[, mean_b := mean(b), by = list(a, c)]
dt
# a b c d e mean_b
#1: a 1.1127632 1 x test 0.9271986
#2: a 0.7416341 1 x test 0.9271986
#3: b 0.9040880 1 x test -0.3161799
#4: b -1.5364479 1 x test -0.3161799
#5: c 1.9846982 1 x test 1.3709635
#6: a 0.2615139 2 x test 0.1543337
#7: c 0.7572287 1 x test 1.3709635
#8: a 0.0471535 2 x test 0.1543337
Here, dtis modified by reference, i.e., without copying all of dt, which might save time on large data.

Finding out what values didn't merge in R

If I had two simple dataframes:
a <- 1:10
b <- c("a","b","c","d","e","f","g","h","i","j")
df1 <-data.frame(a,b)
c <- 1:7
d <- c("k","l","m","n","o","p","q")
df2 <-data.frame(c,d)
... and I wanted to merge them by "a" and "c" for df1 and df2 respectively using:
df3= merge(df1, df2, by.x = "a", by.y = "c")
How would I go about producing a dataframe of rows in df1 which didn't merge? For example:
a b
8 8 h
9 9 i
10 10 j
Any help would be gratefully received.
EDIT
Using the suggestion in the comment, I can do:
check = setdiff(df1$a, df2$c)
This is great, as I get 8:10 which is correct, but I do need the other column in df1 listed to... Can this be done with setdiff too?

Look up the all argument.
df3= merge(df1, df2, by.x = "a", by.y = "c", all.x = TRUE)
will return this. Now you can filter on d to get the entries you're looking for.
a b d
1 1 a k
2 2 b l
3 3 c m
4 4 d n
5 5 e o
6 6 f p
7 7 g q
8 8 h <NA>
9 9 i <NA>
10 10 j <NA>

Per comments:
check = setdiff(df1$a, df2$c)
alldiff <- df1[1:dim(df1)[1] %in% check,]
(note that dim(df1)[1] is the same as, say, length(df1$a) )
With credit to Codoremifa for the second line.

Order data frame by columns in increasing and decreasing order

I'm trying to figure out how to sort a data frame like the one below by c1 in decreasing order and c2 in increasing order.
c1 <- c("a", "b", "c", "d", "d", "e", "f", "g", "h", "i")
c2 <- c("29-JAN-08", "29-JAN-08", "29-JAN-08", "29-JAN-08", "20-MAR-08", "28-MAR-08", "28-MAR-08", "28-MAR-08", "28-MAR-08", "28-MAR-08")
example <- data.frame(c1, c2)
I can't use the - sign with a date vector:
> example <- example[order(example$c1, -example$c2),]
Error: unexpected input in "example <- example[order(example$c1, -1ex"
And I haven't been able to figure out how to use the 'decreasing' argument:
> example <- example[order(example$c1, example$c2, decreasing = c(F, T)),]
Error: unexpected input in "example <- example[order(example$c1, -1ex"
Is there a way I can order this data frame by these two columns, in increasing order by the first one and decreasing order by the second when the columns are character and date types, respectively?

Here's an answer using the data.table package, which shows off it's benefits in terms of cleaner code:
example <- as.data.table(example)
# set the date variable as an actual date first
example$c2 <- as.Date(example$c2,format="%d-%b-%Y")
# then sort - notice no need to keep referencing example$...
example[order(c1,-as.numeric(c2))]
A base R version of how to do this would use with
example[with(example,order(c1,-as.numeric(c2))),]

This would do the reverse lexical sort, but it may not be what you were intending since you have not yet converted to Date values, since the reverse sorting will first be be done on the character day "field":
example[ order(example$c1, rev(example$c2)) , ]
#-------
c1 c2
1 a 29-JAN-08
2 b 29-JAN-08
3 c 29-JAN-08
4 d 29-JAN-08
5 d 20-MAR-08
6 e 28-MAR-08
7 f 28-MAR-08
8 g 28-MAR-08
9 h 28-MAR-08
10 i 28-MAR-08
If you want to do the sort in reverse "true" date-order:
example[ order(example$c1, -as.numeric(as.Date(example$c2, format="%d-%b-%Y"))) , ]
#-----
c1 c2
1 a 29-JAN-08
2 b 29-JAN-08
3 c 29-JAN-08
5 d 20-MAR-08
4 d 29-JAN-08
6 e 28-MAR-08
7 f 28-MAR-08
8 g 28-MAR-08
9 h 28-MAR-08
10 i 28-MAR-08
9 h 28-MAR-08
10 i 28-MAR-08

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Merging data frames with a non-unique column - r

Related

adding the frequency column in R without using dyplr

How to compare two variable and different length data frames to add values from one data frame to the other, repeating values where necessary

How to keep columns in the output when grouping a data.table by several columns

Finding out what values didn't merge in R

Order data frame by columns in increasing and decreasing order

Categories

Resources