Backfilling from another data.frame - r

I frequently have situations where I have to "fill in" information from another data source.
For example:
x <- data.frame(c1=letters[1:26],c2=letters[26:1])
x[x$c1 == "m","c2"] <- NA
x[x$c1 == "a","c2"] <- NA
c1 c2
1 a <NA>
2 b y
3 c x
4 d w
5 e v
6 f u
7 g t
8 h s
9 i r
10 j q
11 k p
12 l o
13 m <NA>
...
Now, with that missing variable, I'd like to check and fill it in using a seperate data.frame, lets' call it y
y <- data.frame(c1=c("m","a"),c2=c("n","z"))
So, what I would like to happen is for x to be filled in with y. (row 13 should be c("m","n"), row 1 should be c("a","z"))
The method I use to deal with this currently seems convoluted and indirect. What would your approach be? Keeping in mind that my data is not necessarily in a nice order like this one is but the order should be maintained in x. My preference would be for a solution that does not rely on anything but base R.

This will be a far simpler proposition if you deal with character variables, not factors.
I will present a simple data.table solution (for elegant and easy to use syntax amongst many other advantages)
x <- data.frame(c1=letters[1:26],c2=letters[26:1], stringsAsFactors =FALSE)
x[x$c1 == "m","c2"] <- NA
y <- data.frame(c1="m",c2="n", stringsAsFactors = FALSE)
library(data.table)
X <- as.data.table(x)
Y <- as.data.table(y)
For simplicity of merging, I will create a column that indicating
X[,missing_c2 := is.na(c2)]
# a similar column in Y
Y[,missing_c2 := TRUE]
setkey(X, c2, missing_c2)
setkey(Y, c2, missing_c2)
# merge and replace (by reference) those values in X with the the values in `Y`
X[Y, c2 := i.c2]
The i.c2 means that we use the values of c2 from the i argument to [
This approach assumes that not all values where c1 = 'm' will be missing in X and you don't want to replace all values in c2 with 'm' where c1='m', only those which are missing
A base solution
Here is a base solution -- I use merge so that the y data.frame can contain more missing replacements than actually needed (i.e. could have values for all c1 values, although only c1=m`` is required.
# add a second missing value row because to make the solution more generalizable
x <- rbind(x, data.frame(c1 = 'm',c2 = NA, stringsAsFactors = FALSE) )
missing <- x[is.na(x$c2),]
merged <- merge(missing, y, by = 'c1')
x[is.na(x$c2),] <- with(merged, data.frame(c1 = c1, c2 = c2.y, stringsAsFactors = FALSE))
If you use factors you will come up against a wall of pain ensuring that the levels correspond.

In base R, I believe this will work for you:
nas <- is.na(x$c2)
x[nas, ] <- y[y$c1 %in% x[nas, 1], ]

Related

Problem in merging a gff file and a csv file in R [duplicate]

I want to merge two data frames keeping the original row order of one of them (df.2 in the example below).
Here are some sample data (all values from class column are defined in both data frames):
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
If I do:
merge(df.2, df.1)
Output is:
class object prob
1 1 B 0.5
2 1 C 0.5
3 2 A 0.7
4 2 D 0.7
5 3 F 0.3
If I add sort = FALSE:
merge(df.2, df.1, sort = F)
Result is:
class object prob
1 2 A 0.7
2 2 D 0.7
3 1 B 0.5
4 1 C 0.5
5 3 F 0.3
But what I would like is:
class object prob
1 2 A 0.7
2 1 B 0.5
3 2 D 0.7
4 3 F 0.3
5 1 C 0.5
You just need to create a variable which gives the row number in df.2. Then, once you have merged your data, you sort the new data set according to this variable. Here is an example :
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
df.2$id <- 1:nrow(df.2)
out <- merge(df.2,df.1, by = "class")
out[order(out$id), ]
Check out the join function in the plyr package. It's like merge, but it allows you to keep the row order of one of the data sets. Overall, it's more flexible than merge.
Using your example data, we would use join like this:
> join(df.2,df.1)
Joining by: class
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
Here are a couple of links describing fixes to the merge function for keeping the row order:
http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/
http://r.789695.n4.nabble.com/patching-merge-to-allow-the-user-to-keep-the-order-of-one-of-the-two-data-frame-objects-merged-td4296561.html
You can also check out the inner_join function in Hadley's dplyr package (next iteration of plyr). It preserves the row order of the first data set. The minor difference to your desired solution is that it also preserves the original column order of the first data set. So it does not necessarily put the column we used for merging at the first position.
Using your example above, the inner_join result looks like this:
inner_join(df.2,df.1)
Joining by: "class"
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
From data.table v1.9.5+, you can do:
require(data.table) # v1.9.5+
setDT(df.1)[df.2, on="class"]
The performs a join on column class by finding out matching rows in df.1 for each row in df.2 and extracting corresponding columns.
For the sake of completeness, updating in a join preserves the original row order as well. This might be an alternative to Arun's data.table answer if there are only a few columns to append:
library(data.table)
setDT(df.2)[df.1, on = "class", prob := i.prob][]
object class prob
1: A 2 0.7
2: B 1 0.5
3: D 2 0.7
4: F 3 0.3
5: C 1 0.5
Here, df.2 is right joined to df.1 and gains a new column prob which is copied from the matching rows of df.1.
The accepted answer proposes a manual way to keep order when using merge, which works most of the times but requires unnecessary manual work. This solution comes on the back of How to ddply() without sorting?, which deals with the issue of keeping order but in a split-apply-combine context:
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
So now you can use this generic keeping.order function to keep the original row order of a merge call:
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
keeping.order(df.2, merge, y=df.1, by = "class")
Which will yield, as requested:
> keeping.order(df.2, merge, y=df.1, by = "class")
class object id prob
3 2 A 1 0.7
1 1 B 2 0.5
4 2 D 3 0.7
5 3 F 4 0.3
2 1 C 5 0.5
So keeping.order effectively automates the approach in the accepted answer.
Thanks to #PAC , I came up with something like this:
merge_sameord = function(x, y, ...) {
UseMethod('merge_sameord')
}
merge_sameord.data.frame = function(x, y, ...) {
rstr = paste(sample(c(0:9, letters, LETTERS), 12, replace=TRUE), collapse='')
x[, rstr] = 1:nrow(x)
res = merge(x, y, all.x=TRUE, sort=FALSE, ...)
res = res[order(res[, rstr]), ]
res[, rstr] = NULL
res
}
This assumes that you want to preserve the order the first data frame, and the merged data frame will have the same number of rows as the first data frame. It will give you the clean data frame without extra columns.
In this specific case you could us factor for a compact base solution:
df.2$prob = factor(df.2$class,labels=df.1$prob)
df.2
# object class prob
# 1 A 2 0.7
# 2 B 1 0.5
# 3 D 2 0.7
# 4 F 3 0.3
# 5 C 1 0.5
Not a general solution however, it works if:
You have a lookup table containing unique values
You want to update a table, not create a new one
the lookup table is sorted by the merging column
The lookup table doesn't have extra levels
You want a left_join
If you're fine with factors
1 is not negotiable, for the rest we can do:
df.3 <- df.2 # deal with 2.
df.1b <- df.1[order(df.1$class),] # deal with 3
df.1b <- df.1b[df.1$class %in% df.2$class,] # deal with 4.
df.3$prob = factor(df.3$class,labels=df.1b$prob)
df.3 <- df3[!is.na(df.3$prob),] # deal with 5. if you want an `inner join`
df.3$prob <- as.numeric(as.character(df.3$prob)) # deal with 6.
For package developers
As a package developer, you want to be dependent on as few other packages as possible. Especially tidyverse functions, that change way too often for package developers IMHO.
To be able to make use of the join functions of the dplyr package without importing dplyr, below is a quick implementation. It keeps the original sorting (as requested by OP) and does not move the joining column to the front (which is another annoying thing of merge()).
left_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.x = TRUE, ...)
}
right_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.y = TRUE, ...)
}
inner_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all = TRUE, ...)
}
full_join <- function(x, y, ...) {
merge_exec(x = x, y = y, ...)
}
# workhorse:
merge_exec <- function(x, y, ...) {
# set index
x$join_id_ <- 1:nrow(x)
# do the join
joined <- merge(x = x, y = y, sort = FALSE, ...)
# get suffices (yes, I prefer this over suffixes)
if ("suffixes" %in% names(list(...))) {
suffixes <- list(...)$suffixes
} else {
suffixes <- c("", "")
}
# get columns names in right order, so the 'by' column won't be forced first
cols <- unique(c(colnames(x),
paste0(colnames(x), suffixes[1]),
colnames(y),
paste0(colnames(y), suffixes[2])))
# get the original row and column index
joined[order(joined$join_id),
cols[cols %in% colnames(joined) & cols != "join_id_"]]
}
The highest rated answer does not produce what the Original Poster would like, i.e., "class" in column 1. If OP would allow switching column order in df.2, then here is a possible base R non-merge one-line answer:
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(class = c(2, 1, 2, 3, 1), object = c('A', 'B', 'D', 'F', 'C'))
cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE])
I happen to like the information portrayed in the row.names. A complete one-liner that exactly duplicates the OP's desired outcome is
data.frame(cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE]),
row.names = NULL)
I agree with https://stackoverflow.com/users/4575331/ms-berends that the fewer dependencies of a package developer on another package (or "verse") the better because development paths frequently diverge over time.
Note: The one-liner above does not work when there are duplicates in df.1$class. This can be overcome sans merge with 'outer' and a loop, or more generally with Ms Berend's clever post-merge rescrambling code.
There are several uses cases in which a simple subset will do:
# Use the key variable as row.names
row.names(df.1) = df.1$key
# Sort df.1 so that it's rows match df.2
df.3 = df.1[df.2$key, ]
# Create a data.frame with cariables from df.1 and (the sorted) df.2
df.4 = cbind(df.1, df.3)
This code will preserve df.2 and it's order and add only matching data from df.1
If only one variable is to be added, the cbind() ist not required:
row.names(df.1) = df.1$key
df.2$data = df.1[df.2$key, "data"]
I had the same problem with it but I simply used a dummy vector c(1:5) applied to a new column 'num'
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
df.2$num <- c(1:5) # This range you can order in the last step.
dfm <- merge(df.2, df.1) # merged
dfm <- dfm[order(dfm$num),] # ascending order
There may be a more efficient way in base. This would be fairly simple to make into a function.
varorder <- names(mydata) # --- Merge
mydata <- merge(mydata, otherData, by="commonVar")
restOfvars <- names(mydata[!(names(mydata) %in% varorder)])
mydata[c(varorder,restOfvars)]

Is there a way to replace rows in one dataframe with another in R?

I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
We can use match. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.
As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1 now? Not quite right anymore.
In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).
In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets

Drop columns when splitting data frame in R

I am trying to split data table by column, however once I get list of data tables, they still contains the column which data table was split by. How would I drop this column once the split is complete. Or more preferably, is there a way how do I drop multiple columns.
This is my code:
x <- rnorm(10, mean = 5, sd = 2)
y <- rnorm(10, mean = 5, sd = 2)
z <- sample(5, 10, replace = TRUE)
dt <- data.table(x, y, z)
split(dt, dt$z)
The resulting data table subsets looks like that
$`1`
x y z
1: 6.179790 5.776683 1
2: 5.725441 4.896294 1
3: 8.690388 5.394973 1
$`2`
x y z
1: 5.768285 3.951733 2
2: 4.572454 5.487236 2
$`3`
x y z
1: 5.183101 8.328322 3
2: 2.830511 3.526044 3
$`4`
x y z
1: 5.043010 5.566391 4
2: 5.744546 2.780889 4
$`5`
x y z
1: 6.771102 0.09301977 5
Thanks
Splitting a data.table is really not worthwhile unless you have some fancy parallelization step to follow. And even then, you might be better off sticking with a single table.
That said, I think you want
split( dt[, !"z"], dt$z )
# or more generally
mysplitDT <- function(x, bycols)
split( x[, !..bycols], x[, ..bycols] )
mysplitDT(dt, "z")
You would run into the same problem if you had a data.frame:
df = data.frame(dt)
split( df[-which(names(df)=="z")], df$z )
First thing that came to mind was to iterate through the list and drop the z column.
lapply(split(dt, dt$z), function(d) { d$z <- NULL; d })
And I just noticed that you use the data.table package, so there is probably a better, data.table way of achieving your desired result.

Function for pasting corrected values inside existing dataframe

Does something like the 'paste_over' function below already exist within base R or one of the standard R packages?
paste_over <- function(original, corrected, key){
corrected <- corrected[order(corrected[[key]]),]
output <- original
output[
original[[key]] %in% corrected[[key]],
names(corrected)
] <- corrected
return(output)
}
An example:
D1 <- data.frame(
k = 1:5,
A = runif(5),
B = runif(5),
C = runif(5),
D = runif(5),
E = runif(5)
)
D2 <- data.frame(
k=c(4,1,3),
D=runif(3),
E=runif(3),
A=runif(3)
)
D2 <- D2[order(D2$k),]
D3 <- D1
D3[
D1$k %in% D2$k,
names(D2)
] <- D2
D4 <- paste_over(D1, D2, "k")
all(D4==D3)
In the example D2 contains some values that I want to paste over corresponding cells within D1. However D2 is not in the same order and does not have the same dimension as D1.
The motivation for this is that I was given a very large dataset, reported some errors within it, and received a subset of the original dataset with some corrected values. I would like to be able to 'paste over' the new, corrected values into the old dataset without changing the old dataset in terms of structure. (As the rest of the code I've written assume's the old dataset's structure.)
Although the paste_over function seems to work I can't help but think this must have been tackled before, and so maybe there's already a well known function that's both faster and has error checking. If there is then please let me know what it is.
Thanks.
We can accomplish this using data.table as follows:
setkeyv(setDT(D1), "k")
cols = c("D", "E", "A")
D1[D2, (cols) := D2[, cols]]
setDT() converts a data.frame to data.table by reference (without actually copying the data). We want D1 to be a data.table.
setkey() sorts the data.table by the column specified (here k) and marks that column as sorted (by setting the attribute sorted) by reference. This allows us to perform joins using binary search.
x[i] in data.table performs a join. You can read more about it here. Briefly, for each row of column k in D2, it finds the matching row indices in D1 by matching on D1's key column (here k).
x[i, LHS := RHS] performs the join to find matching rows, and the LHS := RHS part adds/updates x with the columns specified in LHS with the values specified in RHS by reference. LHS should be a a vector of column names or numbers, and RHS should be a list of values.
So, D1[D2, (cols) := D2[, cols]] finds matching rows in D1 for k=c(1,3,4) from D2 and updates the columns D,E,A specified in cols by the list (a data.frame is also a list) of corresponding columns from D2 on RHS.
D1 will now be modified in-place.
HTH
You could use the replacement method for data frames in your function, like this maybe. It does adequate checking for you. I chose to pass the logical row subset as an argument, but you can change that
pasteOver <- function(original, corrected, key) {
"[<-.data.frame"(original, key, names(corrected), corrected)
}
(p1 <- pasteOver(D1, D2, D1$k %in% D2$k))
k A B C D E
1 1 0.18827167 0.006275082 0.3754535 0.8690591 0.73774065
2 2 0.54335829 0.122160101 0.6213813 0.9931259 0.38941407
3 3 0.62946977 0.323090601 0.4464805 0.5069766 0.41443988
4 4 0.66155954 0.201218532 0.1345516 0.2990733 0.05296677
5 5 0.09400961 0.087096652 0.2327039 0.7268058 0.63687025
p2 <- paste_over(D1, D2, "k")
identical(p1, p2)
# [1] TRUE

Merge two data frames while keeping the original row order

I want to merge two data frames keeping the original row order of one of them (df.2 in the example below).
Here are some sample data (all values from class column are defined in both data frames):
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
If I do:
merge(df.2, df.1)
Output is:
class object prob
1 1 B 0.5
2 1 C 0.5
3 2 A 0.7
4 2 D 0.7
5 3 F 0.3
If I add sort = FALSE:
merge(df.2, df.1, sort = F)
Result is:
class object prob
1 2 A 0.7
2 2 D 0.7
3 1 B 0.5
4 1 C 0.5
5 3 F 0.3
But what I would like is:
class object prob
1 2 A 0.7
2 1 B 0.5
3 2 D 0.7
4 3 F 0.3
5 1 C 0.5
You just need to create a variable which gives the row number in df.2. Then, once you have merged your data, you sort the new data set according to this variable. Here is an example :
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
df.2$id <- 1:nrow(df.2)
out <- merge(df.2,df.1, by = "class")
out[order(out$id), ]
Check out the join function in the plyr package. It's like merge, but it allows you to keep the row order of one of the data sets. Overall, it's more flexible than merge.
Using your example data, we would use join like this:
> join(df.2,df.1)
Joining by: class
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
Here are a couple of links describing fixes to the merge function for keeping the row order:
http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/
http://r.789695.n4.nabble.com/patching-merge-to-allow-the-user-to-keep-the-order-of-one-of-the-two-data-frame-objects-merged-td4296561.html
You can also check out the inner_join function in Hadley's dplyr package (next iteration of plyr). It preserves the row order of the first data set. The minor difference to your desired solution is that it also preserves the original column order of the first data set. So it does not necessarily put the column we used for merging at the first position.
Using your example above, the inner_join result looks like this:
inner_join(df.2,df.1)
Joining by: "class"
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
From data.table v1.9.5+, you can do:
require(data.table) # v1.9.5+
setDT(df.1)[df.2, on="class"]
The performs a join on column class by finding out matching rows in df.1 for each row in df.2 and extracting corresponding columns.
For the sake of completeness, updating in a join preserves the original row order as well. This might be an alternative to Arun's data.table answer if there are only a few columns to append:
library(data.table)
setDT(df.2)[df.1, on = "class", prob := i.prob][]
object class prob
1: A 2 0.7
2: B 1 0.5
3: D 2 0.7
4: F 3 0.3
5: C 1 0.5
Here, df.2 is right joined to df.1 and gains a new column prob which is copied from the matching rows of df.1.
The accepted answer proposes a manual way to keep order when using merge, which works most of the times but requires unnecessary manual work. This solution comes on the back of How to ddply() without sorting?, which deals with the issue of keeping order but in a split-apply-combine context:
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
So now you can use this generic keeping.order function to keep the original row order of a merge call:
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
keeping.order(df.2, merge, y=df.1, by = "class")
Which will yield, as requested:
> keeping.order(df.2, merge, y=df.1, by = "class")
class object id prob
3 2 A 1 0.7
1 1 B 2 0.5
4 2 D 3 0.7
5 3 F 4 0.3
2 1 C 5 0.5
So keeping.order effectively automates the approach in the accepted answer.
Thanks to #PAC , I came up with something like this:
merge_sameord = function(x, y, ...) {
UseMethod('merge_sameord')
}
merge_sameord.data.frame = function(x, y, ...) {
rstr = paste(sample(c(0:9, letters, LETTERS), 12, replace=TRUE), collapse='')
x[, rstr] = 1:nrow(x)
res = merge(x, y, all.x=TRUE, sort=FALSE, ...)
res = res[order(res[, rstr]), ]
res[, rstr] = NULL
res
}
This assumes that you want to preserve the order the first data frame, and the merged data frame will have the same number of rows as the first data frame. It will give you the clean data frame without extra columns.
In this specific case you could us factor for a compact base solution:
df.2$prob = factor(df.2$class,labels=df.1$prob)
df.2
# object class prob
# 1 A 2 0.7
# 2 B 1 0.5
# 3 D 2 0.7
# 4 F 3 0.3
# 5 C 1 0.5
Not a general solution however, it works if:
You have a lookup table containing unique values
You want to update a table, not create a new one
the lookup table is sorted by the merging column
The lookup table doesn't have extra levels
You want a left_join
If you're fine with factors
1 is not negotiable, for the rest we can do:
df.3 <- df.2 # deal with 2.
df.1b <- df.1[order(df.1$class),] # deal with 3
df.1b <- df.1b[df.1$class %in% df.2$class,] # deal with 4.
df.3$prob = factor(df.3$class,labels=df.1b$prob)
df.3 <- df3[!is.na(df.3$prob),] # deal with 5. if you want an `inner join`
df.3$prob <- as.numeric(as.character(df.3$prob)) # deal with 6.
For package developers
As a package developer, you want to be dependent on as few other packages as possible. Especially tidyverse functions, that change way too often for package developers IMHO.
To be able to make use of the join functions of the dplyr package without importing dplyr, below is a quick implementation. It keeps the original sorting (as requested by OP) and does not move the joining column to the front (which is another annoying thing of merge()).
left_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.x = TRUE, ...)
}
right_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.y = TRUE, ...)
}
inner_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all = TRUE, ...)
}
full_join <- function(x, y, ...) {
merge_exec(x = x, y = y, ...)
}
# workhorse:
merge_exec <- function(x, y, ...) {
# set index
x$join_id_ <- 1:nrow(x)
# do the join
joined <- merge(x = x, y = y, sort = FALSE, ...)
# get suffices (yes, I prefer this over suffixes)
if ("suffixes" %in% names(list(...))) {
suffixes <- list(...)$suffixes
} else {
suffixes <- c("", "")
}
# get columns names in right order, so the 'by' column won't be forced first
cols <- unique(c(colnames(x),
paste0(colnames(x), suffixes[1]),
colnames(y),
paste0(colnames(y), suffixes[2])))
# get the original row and column index
joined[order(joined$join_id),
cols[cols %in% colnames(joined) & cols != "join_id_"]]
}
The highest rated answer does not produce what the Original Poster would like, i.e., "class" in column 1. If OP would allow switching column order in df.2, then here is a possible base R non-merge one-line answer:
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(class = c(2, 1, 2, 3, 1), object = c('A', 'B', 'D', 'F', 'C'))
cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE])
I happen to like the information portrayed in the row.names. A complete one-liner that exactly duplicates the OP's desired outcome is
data.frame(cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE]),
row.names = NULL)
I agree with https://stackoverflow.com/users/4575331/ms-berends that the fewer dependencies of a package developer on another package (or "verse") the better because development paths frequently diverge over time.
Note: The one-liner above does not work when there are duplicates in df.1$class. This can be overcome sans merge with 'outer' and a loop, or more generally with Ms Berend's clever post-merge rescrambling code.
There are several uses cases in which a simple subset will do:
# Use the key variable as row.names
row.names(df.1) = df.1$key
# Sort df.1 so that it's rows match df.2
df.3 = df.1[df.2$key, ]
# Create a data.frame with cariables from df.1 and (the sorted) df.2
df.4 = cbind(df.1, df.3)
This code will preserve df.2 and it's order and add only matching data from df.1
If only one variable is to be added, the cbind() ist not required:
row.names(df.1) = df.1$key
df.2$data = df.1[df.2$key, "data"]
I had the same problem with it but I simply used a dummy vector c(1:5) applied to a new column 'num'
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
df.2$num <- c(1:5) # This range you can order in the last step.
dfm <- merge(df.2, df.1) # merged
dfm <- dfm[order(dfm$num),] # ascending order
There may be a more efficient way in base. This would be fairly simple to make into a function.
varorder <- names(mydata) # --- Merge
mydata <- merge(mydata, otherData, by="commonVar")
restOfvars <- names(mydata[!(names(mydata) %in% varorder)])
mydata[c(varorder,restOfvars)]

Resources