Fill in data frame with values from rows above - r

Say I have a data frame like this:
ID, ID_2, FIRST, VALUE
-----------------------
'a', 'aa', TRUE, 2
'a', 'ab', FALSE, NA
'a', 'ac', FALSE, NA
'b', 'aa', TRUE, 5
'b', 'ab', FALSE, NA
So VALUE is only set for FIRST = TRUE once per ID. ID_2 may be duplicate between IDs, but doesn't have to.
How do I put the numbers from the first rows of each ID into all rows of that ID, such that the VALUE column becomes 2, 2, 2, 5, 5?
I know I could simply loop over all IDs with a for loop, but I am looking for a more efficient way.

The question asks for efficiency compared with a loop. Here is a comparison of four solutions:
zoo::na.locf, which introduces a package dependency, and although it handles many edge cases, requires that the 'blank' values are NA. The other solutions are easily adapted to non-NA blanks.
A simple loop in base R.
A recursive function in base R.
My own vectorised solution in base R.
The new fill() function in tidyr version 0.3.0., which works on data.frames.
Note that most of these solutions are for vectors, not data frames, so they don't check any ID column. If the data frame isn't grouped by ID, with the value to be filled down being at the top of each group, then you could try a windowing function in dplyr or data.table
# A popular solution
f1 <- zoo::na.locf
# A loop, adapted from https://stat.ethz.ch/pipermail/r-help/2008-July/169199.html
f2 <- function(x) {
for(i in seq_along(x)[-1]) if(is.na(x[i])) x[i] <- x[i-1]
x
}
# Recursion, also from https://stat.ethz.ch/pipermail/r-help/2008-July/169199.html
f3 <- function(z) {
y <- c(NA, head(z, -1))
z <- ifelse(is.na(z), y, z)
if (any(is.na(z))) Recall(z) else z }
# My own effort
f4 <- function(x, blank = is.na) {
# Find the values
if (is.function(blank)) {
isnotblank <- !blank(x)
} else {
isnotblank <- x != blank
}
# Fill down
x[which(isnotblank)][cumsum(isnotblank)]
}
# fill() from the `tidyr` version 0.3.0
library(tidyr)
f5 <- function(y) {
fill(y, column)
}
# Test data, 2600 values, ~58% blanks
x <- rep(LETTERS, 100)
set.seed(2015-09-12)
x[sample(1:2600, 1500)] <- NA
x <- c("A", x) # Ensure the first element is not blank
y <- data.frame(column = x, stringsAsFactors = FALSE) # data.frame version of x for tidyr
# Check that they all work (they do)
identical(f1(x), f2(x))
identical(f1(x), f3(x))
identical(f1(x), f4(x))
identical(f1(x), f5(y)$column)
library(microbenchmark)
microbenchmark(f1(x), f2(x), f3(x), f4(x), f5(y))
Results:
Unit: microseconds
expr min lq mean median uq max neval
f1(x) 422.762 466.6355 508.57284 505.6760 527.2540 837.626 100
f2(x) 2118.914 2206.7370 2501.04597 2312.8000 2497.2285 5377.018 100
f3(x) 7800.509 7832.0130 8127.06761 7882.7010 8395.3725 14128.107 100
f4(x) 52.841 58.7645 63.98657 62.1410 65.2655 104.886 100
f5(y) 183.494 225.9380 305.21337 331.0035 350.4040 529.064 100

If you need only to carry forward the values from the VALUE column, then I think you can use na.lofc() function from zoo package. Here is an example:
a<-c(1,NA,NA,2,NA)
na.locf(a)
[1] 1 1 1 2 2

If the VALUE for a specific ID always appears in the first record, which seems to be the case for your data, you can use match to find that record:
df <- read.csv(textConnection("
ID, ID_2, FIRST, VALUE
'a', 'aa', TRUE, 2
'a', 'ab', FALSE, NA
'a', 'ac', FALSE, NA
'b', 'aa', TRUE, 5
'b', 'ab', FALSE, NA
"))
df$VALUE <- df$VALUE[match(df$ID, df$ID)]
df
# ID ID_2 FIRST VALUE
# 1 'a' 'aa' TRUE 2
# 2 'a' 'ab' FALSE 2
# 3 'a' 'ac' FALSE 2
# 4 'b' 'aa' TRUE 5
# 5 'b' 'ab' FALSE 5

+1 for #nacnudus
Handles leading blanks
f4 <- function(x, blank = is.na) {
# Find the values
if (is.function(blank)) {
isnotblank <- !blank(x)
} else {
isnotblank <- x != blank
}
# Fill down
xfill <- cumsum(isnotblank)
xfill[ xfill == 0 ] <- NA
# Replace Blanks
xnew <- x[ which(isnotblank) ][ xfill ]
xnew[is.na(xnew)] <- blank
return(xnew)
}

Related

Data frame: How to compare current row to some other rows without looping?

I have the following df and use-case, I'd like to find and set something in all rows for which exist another row satisfying a condition e.g.
df <- data.frame(X=c('a','b','c'), Y=c('a','c','d'))
> df
X Y
1 a a
2 b c
3 c d
I'd like to find those rows whos Y value is the same as X value in another row. In the example above would be row #2 is true because Y = c and row #3 has X = c. Note that row #1 does not satisfy the condition.
Something like:
df$Flag <- find(df, Y == X_in_another_row(df))
1
For each Y, we check if any value in X (other than in the same row) matches.
sapply(1:NROW(df), function(i) df$Y[i] %in% df$X[-i])
#[1] FALSE TRUE FALSE
If indices are necessary, wrap the whole thing in which
which(sapply(1:NROW(df), function(i) df$Y[i] %in% df$X[-i]))
#[1] 2
2 (not tested well)
df <- data.frame(X=c('a','b','c'), Y=c('a','c','d'), stringsAsFactors = FALSE)
temp = outer(df$X, df$Y, "==") #Check equality among values of X and Y
diag(temp) = FALSE #Set diagonal values as FALSE (for same row)
colSums(temp) > 0
#[1] FALSE TRUE FALSE
which(match(df$Y,df$X)!=1:nrow(df))
I think this should work.
df <- data.frame(X= c(1,2,3,4,5,3,2,1), Y = c(1,2,3,4,5,6,7,8))
which(with(df, (X %in% Y) & (X != Y)))
Works on the original data.frame, if we set stringsasfactors=FALSE
df <- data.frame(X=c('a','b','c'), Y=c('a','c','d'), stringsAsFactors = F)
which(with(df, (X %in% Y) & (X != Y)))
Quite convoluted but I'll put it here anyway. This should work even if there are repeated values in X.
For example with the following dataframe df2:
df2 = data.frame(X=c('a','b','c','a','d'), Y=c('a','c','d','e','b'))
X Y
1 a a
2 b c
3 c d
4 a e
5 d b
## Specifying the same factor levels allows us to get a square matrix
df2$X = factor(df2$X,levels=union(df2$X,df2$Y))
df2$Y = factor(df2$Y,levels=union(df2$X,df2$Y))
m = as.matrix(table(df2))
valY = rowSums(m)*colSums(m)-diag(m)
which(df2$Y %in% names(valY)[as.logical(valY)])
[1] 1 2 3 5
Essentially you want to know whether Y is in X but you want the condition to be FALSE when X == Y:
df$Z <- with(df, (Y != X) & (Y %in% X))
# Assume you want to use position 4, value 'c', to find all the rows that Y is 'c'
df <- data.frame(X = c('a', 'b', 'd', 'c'),
Y = c('a', 'c', 'c', 'd'))
row <- 4 # assume the desire row is position 4
val <- as.character( df[(row),'X'] ) # get the character and turn it into character type
df[df$Y == val,]
# Result
# X Y
# 2 b c
# 3 d c

extract highest and lowest values for columns in R, as well as row identifiers

Say I have some data of the following kind:
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
I want a new dataframe that keeps the 10 original columns, but for every column retains only the highest 10 and lowest 10 values. Importantly, the rows have names corresponding to id values that need to be kept in the new data frame.
Thus, the end result data.frame is gonna be of dimensions m by 10, where m is very likely to be more than 20. But for every column, I want only 20 valid values.
The only way I can think of doing this is doing it manually per column, using dplyr and arrange, grabbing the top and bottom rows, and then creating a matrix from all the individual vectors. Clearly this is inefficient. Help?
Assuming you want to keep all the rows from the original dataset, where there is at least one value satisfying your condition (value among ten largest or ten smallest in the given column), you could do it like this:
# create a data frame
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
# function to find lowes 10 and highest 10 values
lowHigh <- function(x)
{
test <- x
test[!(order(x) <= 10 | order(x) >= (length(x)- 10))] <- NA
test
}
# apply the function defined above
test2 <- apply(df, 2, lowHigh)
# use the original rownames
rownames(test2) <- rownames(df)
# keep only rows where there is value of interest
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
Please note that there is definitely some smarter way of doing it...
Here is the data matrix with 10 highest and 10 lowest in each column,
x<-apply(df,2,function(k) k[order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))]])
x is your 20 by 10 matrix.
Your requirement of rownames is conflicting column by column, altogether you only have 20 rownames in this matrix and it can not be same for all 10 columns. Instead, here is your order matrix,
x_roworder<-apply(df,2,function(k) order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))])
This will give you corresponding rows in original data matrix within each column.
I offer a couple of answers to this.
A base R implementation ( I have used %>% to make it easier to read)
ix = lapply(df, function(x) order(x)[-(1:(length(x)-20)+10)]) %>%
unlist %>% unique %>% sort
df[ix,]
This abuses the fact that data frames are lists, finds the row id satisfying the condition for each column, then takes the unique ones in order as the row indices you want to keep. This should retain any row names attached to df
An alternative using dplyr (since you mentioned it) which if I remember correctly doesn't particular like row names
# add id as a variable
df$id = 1:nrow(df) # or row names
df %>%
gather("col",value,-id) %>%
group_by(col) %>%
filter(min_rank(value) <= 10 | min_rank(desc(value)) <= 10) %>%
ungroup %>%
select(id) %>%
left_join(df)
Edited: To fix code alignment and make a neater filter
I'm not entirely sure what you're expecting for your return / output. But this will get you the appropriate indices
# example data
set.seed(41234L)
N <- 1000
df<-data.frame(id= 1:N, matrix(rnorm(10*N, 1, .5), ncol=10))
# for each column, extract ID's for top 10 and bottom 10 values
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N)
# check:
xx <- sort(df[,2])
all.equal(sort(df[l1[[1]], 2]), xx[c(1:10, 991:1000)])
[1] TRUE
If you want an m * 10 matrix with these unique values, where m is the number of unique indices, you could do:
l2 <- do.call("c", l1)
l2 <- unique(l2)
df2 <- df[l2,] # in this case, m == 189
This doesn't 0 / NA the columns which you're not searching on for each row. But it's unclear what your question is trying to do.
Note
This isn't as efficient as using data.table since you're going to get a copy of the data in xy <- data.frame(x,y)
Benchmark
library(microbenchmark)
microbenchmark(ira= {
test2 <- apply(df[,2:11], 2, lowHigh);
rownames(test2) <- rownames(df);
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
},
alex= {
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N);
l2 <- unique(do.call("c", l1));
df2 <- df[l2,]
}, times= 50L)
Unit: milliseconds
expr min lq mean median uq max neval cld
ira 4.360452 4.522082 5.328403 5.140874 5.560295 8.369525 50 b
alex 3.771111 3.854477 4.054388 3.936716 4.158801 5.654280 50 a

R Minimum Value from Datatable Not Equal to a Particular Value

How do I find the minimum value from an R data table other than a particular value?
For example, there could be zeroes in the data table and the goal would be to find the minimum non zero value.
I tried using the sapply with min, but am not sure how to specify the extra criteria that we have so that the minimum is not equal to a certain value.
More generally, How do we find the minimum from a data table not equal to any element from a list of possible values?
If you want to find the minimum value from a vector while excluding certain values from that vector, then you can use %in%:
v <- c(1:10) # values 1 .. 10
v.exclude <- c(1, 2) # exclude the values 1 and 2 from consideration
min.exclude <- min(v[!v %in% v.exclude])
The logic won't change much if you are using a column from a data table/frame. In this case you can just replace the vector v with the apropriate column. If you have your excluded values in a list, then you can flatten it to produce your v.exclude vector.
This can be done with data.table (as the OP mentioned about data table in the post) after setting the key
library(data.table)
setDT(df, key='a')[!.(exclude)]
# a b
#1: 4 40
#2: 5 50
#3: 6 60
If we need the min value of 'a'
min(setDT(df, key='a')[!.(exclude)]$a)
#[1] 4
For finding the min in all the columns (using the setkey method), we loop over the columns of the dataset, set the key as each of the column, subset the dataset, get the min value in a previously created list object.
setDT(df)
MinVal <- vector('list', length(df))
for(j in seq_along(df)){
setkeyv(df, names(df)[j])
MinVal[[j]] <- min(df[!.(exclude)][[j]])
}
MinVal
#[[1]]
#[1] 4
#[[2]]
#[1] 10
data
df <- data.frame(a = c(0,2,3,2,1,2,3,4,5,6),
b = c(10,10,20,20,30,30,40,40,50,60))
exclude <- c(0,1,2,3)
Assuming you are working with a data.frame
Data
df <- data.frame(a = c(0,2,3,2,1,2,3,4,5,6),
b = c(10,10,20,20,30,30,40,40,50,60))
Values to exlude from our minimum search
exclude <- c(0,1,2,3)
we can find the minimum value from column a excluding our exclude vector
## minimum from column a
min(df[!df$a %in% exclude,]$a)
# [1] 4
Or from b
exclude <- c(10, 20, 30, 40)
min(df[!df$b %in% exclude,]$b)
# [1] 50
To return the row that corresponds to the minimum value
df[df$b == min( df[ !df$b %in% exclude, ]$b ),]
# a b
# 9 5 50
Update
To find the minimum across multiple rows we can do it this way:
## values to exclude
exclude_a <- c(0,1)
exclude_b <- c(10)
## exclude rows/values from each column we don't want
df2 <- df[!(df$a %in% exclude_a) & !(df$b %in% exclude_b),]
## order the data
df3 <- df2[with(df2, order(a,b)),]
## take the first row
df3[1,]
# > df3[1,]
# a b
#4 2 20
Update 2
To select from multiple columns we can iterate over them as #akrun has shown, or alternatively we can construct our subsetting formula using an expression and evaluate it inside our [ operation
exclude <- c(0,1,2, 10)
## construct a formula/expression using the column names
n <- names(df)
expr <- paste0("(", paste0(" !(df$", n, " %in% exclude) ", collapse = "&") ,")")
# [1] "( !(df$a %in% exclude) & !(df$b %in% exclude) )"
expr <- parse(text=expr)
df2 <- df[eval(expr),]
## order and select first row as before
df2 <- df2[with(df2, order(a,b)),]
df2 <- df2[1,]
And if we wanted to use data.table for this:
library(data.table)
setDT(df)[ eval(expr) ][order(a, b),][1,]
comparison of methods
library(microbenchmark)
fun_1 <- function(x){
df2 <- x[eval(expr),]
## order and select first row as before
df2 <- df2[with(df2, order(a,b)),]
df2 <- df2[1,]
return(df2)
}
fun_2 <- function(x){
df2 <- setDT(x)[ eval(expr) ][order(a, b),][1,]
return(df2)
}
## including #akrun's solution
fun_3 <- function(x){
setDT(df)
MinVal <- vector('list', length(df))
for(j in seq_along(df)){
setkeyv(df, names(df)[j])
MinVal[[j]] <- min(df[!.(exclude)][[j]])
}
return(MinVal)
}
microbenchmark(fun_1(df), fun_2(df), fun_3(df) , times=1000)
# Unit: microseconds
# expr min lq mean median uq max neval
# fun_1(df) 770.376 804.5715 866.3499 833.071 869.2195 2728.740 1000
# fun_2(df) 854.862 893.1220 952.1207 925.200 962.6820 3115.119 1000
# fun_3(df) 1108.316 1148.3340 1233.1268 1186.938 1234.3570 5400.544 1000

Unstacking a stacked dataframe unstacks columns in a different order

Using R 3.1.0
a = as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
b = unstack(stack(a))
# Returns FALSE
all(colnames(a) == colnames(b))
The documentation on stack/unstack says unstacking should "reverse this [stack] operation". Am I missing something? Why do I need to re-order the columns of b?
The last few lines of the stack (see utils:::stack.data.frame) function create a data.frame with two columns, "values" and "ind". The "ind" column is created with the code:
ind = factor(rep.int(names(x), lapply(x, length)))
But, look at how factor works in general (pay attention to the order of the "Levels"):
factor(c(1, 2, 3, 10, 4))
# [1] 1 2 3 10 4
# Levels: 1 2 3 4 10
factor(paste0("A", c(1, 2, 3, 10, 4)))
# [1] A1 A2 A3 A10 A4
# Levels: A1 A10 A2 A3 A4
If the functionality you describe is important for your analysis, you might do better modifying a version of stack.data.frame to capture the order of the data.frame names during the factoring process, like this:
Stack <- function (x, select, ...)
{
if (!missing(select)) {
nl <- as.list(1L:ncol(x))
names(nl) <- names(x)
vars <- eval(substitute(select), nl, parent.frame())
x <- x[, vars, drop = FALSE]
}
keep <- unlist(lapply(x, is.vector))
if (!sum(keep))
stop("no vector columns were selected")
if (!all(keep))
warning("non-vector columns will be ignored")
x <- x[, keep, drop = FALSE]
data.frame(values = unlist(unname(x)),
# REMOVE THIS --> ind = factor(rep.int(names(x), lapply(x, length))),
# AND ADD THIS:
ind = factor(rep.int(names(x), lapply(x, length)), unique(names(x))),
stringsAsFactors = FALSE)
}
Testing, one, two, three...
## Not using identical here because
## the factor levels are different
all.equal(Stack(a), stack(a))
# [1] TRUE
identical(unstack(Stack(a)), a)
# [1] TRUE
You'll never get me to defend the R documentation...
stack(...) creates a new data frame with two columns, values and ind. The latter has the column names from the original table, as a factor, ordered alphabetically. unstack(...) uses that factor to (re-) create columns of the new data frame. So the phrase "Unstacking reverses this operation" should be interpreted loosely...
To get the result you want, you need to reorder the factor ind, as follows:
a <- as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
c <- stack(a)
c$ind <- factor(c$ind, levels=colnames(a))
d <- unstack(c)
identical(a,d)
# [1] TRUE

Merge two data frames while keeping the original row order

I want to merge two data frames keeping the original row order of one of them (df.2 in the example below).
Here are some sample data (all values from class column are defined in both data frames):
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
If I do:
merge(df.2, df.1)
Output is:
class object prob
1 1 B 0.5
2 1 C 0.5
3 2 A 0.7
4 2 D 0.7
5 3 F 0.3
If I add sort = FALSE:
merge(df.2, df.1, sort = F)
Result is:
class object prob
1 2 A 0.7
2 2 D 0.7
3 1 B 0.5
4 1 C 0.5
5 3 F 0.3
But what I would like is:
class object prob
1 2 A 0.7
2 1 B 0.5
3 2 D 0.7
4 3 F 0.3
5 1 C 0.5
You just need to create a variable which gives the row number in df.2. Then, once you have merged your data, you sort the new data set according to this variable. Here is an example :
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
df.2$id <- 1:nrow(df.2)
out <- merge(df.2,df.1, by = "class")
out[order(out$id), ]
Check out the join function in the plyr package. It's like merge, but it allows you to keep the row order of one of the data sets. Overall, it's more flexible than merge.
Using your example data, we would use join like this:
> join(df.2,df.1)
Joining by: class
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
Here are a couple of links describing fixes to the merge function for keeping the row order:
http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/
http://r.789695.n4.nabble.com/patching-merge-to-allow-the-user-to-keep-the-order-of-one-of-the-two-data-frame-objects-merged-td4296561.html
You can also check out the inner_join function in Hadley's dplyr package (next iteration of plyr). It preserves the row order of the first data set. The minor difference to your desired solution is that it also preserves the original column order of the first data set. So it does not necessarily put the column we used for merging at the first position.
Using your example above, the inner_join result looks like this:
inner_join(df.2,df.1)
Joining by: "class"
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
From data.table v1.9.5+, you can do:
require(data.table) # v1.9.5+
setDT(df.1)[df.2, on="class"]
The performs a join on column class by finding out matching rows in df.1 for each row in df.2 and extracting corresponding columns.
For the sake of completeness, updating in a join preserves the original row order as well. This might be an alternative to Arun's data.table answer if there are only a few columns to append:
library(data.table)
setDT(df.2)[df.1, on = "class", prob := i.prob][]
object class prob
1: A 2 0.7
2: B 1 0.5
3: D 2 0.7
4: F 3 0.3
5: C 1 0.5
Here, df.2 is right joined to df.1 and gains a new column prob which is copied from the matching rows of df.1.
The accepted answer proposes a manual way to keep order when using merge, which works most of the times but requires unnecessary manual work. This solution comes on the back of How to ddply() without sorting?, which deals with the issue of keeping order but in a split-apply-combine context:
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
So now you can use this generic keeping.order function to keep the original row order of a merge call:
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
keeping.order(df.2, merge, y=df.1, by = "class")
Which will yield, as requested:
> keeping.order(df.2, merge, y=df.1, by = "class")
class object id prob
3 2 A 1 0.7
1 1 B 2 0.5
4 2 D 3 0.7
5 3 F 4 0.3
2 1 C 5 0.5
So keeping.order effectively automates the approach in the accepted answer.
Thanks to #PAC , I came up with something like this:
merge_sameord = function(x, y, ...) {
UseMethod('merge_sameord')
}
merge_sameord.data.frame = function(x, y, ...) {
rstr = paste(sample(c(0:9, letters, LETTERS), 12, replace=TRUE), collapse='')
x[, rstr] = 1:nrow(x)
res = merge(x, y, all.x=TRUE, sort=FALSE, ...)
res = res[order(res[, rstr]), ]
res[, rstr] = NULL
res
}
This assumes that you want to preserve the order the first data frame, and the merged data frame will have the same number of rows as the first data frame. It will give you the clean data frame without extra columns.
In this specific case you could us factor for a compact base solution:
df.2$prob = factor(df.2$class,labels=df.1$prob)
df.2
# object class prob
# 1 A 2 0.7
# 2 B 1 0.5
# 3 D 2 0.7
# 4 F 3 0.3
# 5 C 1 0.5
Not a general solution however, it works if:
You have a lookup table containing unique values
You want to update a table, not create a new one
the lookup table is sorted by the merging column
The lookup table doesn't have extra levels
You want a left_join
If you're fine with factors
1 is not negotiable, for the rest we can do:
df.3 <- df.2 # deal with 2.
df.1b <- df.1[order(df.1$class),] # deal with 3
df.1b <- df.1b[df.1$class %in% df.2$class,] # deal with 4.
df.3$prob = factor(df.3$class,labels=df.1b$prob)
df.3 <- df3[!is.na(df.3$prob),] # deal with 5. if you want an `inner join`
df.3$prob <- as.numeric(as.character(df.3$prob)) # deal with 6.
For package developers
As a package developer, you want to be dependent on as few other packages as possible. Especially tidyverse functions, that change way too often for package developers IMHO.
To be able to make use of the join functions of the dplyr package without importing dplyr, below is a quick implementation. It keeps the original sorting (as requested by OP) and does not move the joining column to the front (which is another annoying thing of merge()).
left_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.x = TRUE, ...)
}
right_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.y = TRUE, ...)
}
inner_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all = TRUE, ...)
}
full_join <- function(x, y, ...) {
merge_exec(x = x, y = y, ...)
}
# workhorse:
merge_exec <- function(x, y, ...) {
# set index
x$join_id_ <- 1:nrow(x)
# do the join
joined <- merge(x = x, y = y, sort = FALSE, ...)
# get suffices (yes, I prefer this over suffixes)
if ("suffixes" %in% names(list(...))) {
suffixes <- list(...)$suffixes
} else {
suffixes <- c("", "")
}
# get columns names in right order, so the 'by' column won't be forced first
cols <- unique(c(colnames(x),
paste0(colnames(x), suffixes[1]),
colnames(y),
paste0(colnames(y), suffixes[2])))
# get the original row and column index
joined[order(joined$join_id),
cols[cols %in% colnames(joined) & cols != "join_id_"]]
}
The highest rated answer does not produce what the Original Poster would like, i.e., "class" in column 1. If OP would allow switching column order in df.2, then here is a possible base R non-merge one-line answer:
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(class = c(2, 1, 2, 3, 1), object = c('A', 'B', 'D', 'F', 'C'))
cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE])
I happen to like the information portrayed in the row.names. A complete one-liner that exactly duplicates the OP's desired outcome is
data.frame(cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE]),
row.names = NULL)
I agree with https://stackoverflow.com/users/4575331/ms-berends that the fewer dependencies of a package developer on another package (or "verse") the better because development paths frequently diverge over time.
Note: The one-liner above does not work when there are duplicates in df.1$class. This can be overcome sans merge with 'outer' and a loop, or more generally with Ms Berend's clever post-merge rescrambling code.
There are several uses cases in which a simple subset will do:
# Use the key variable as row.names
row.names(df.1) = df.1$key
# Sort df.1 so that it's rows match df.2
df.3 = df.1[df.2$key, ]
# Create a data.frame with cariables from df.1 and (the sorted) df.2
df.4 = cbind(df.1, df.3)
This code will preserve df.2 and it's order and add only matching data from df.1
If only one variable is to be added, the cbind() ist not required:
row.names(df.1) = df.1$key
df.2$data = df.1[df.2$key, "data"]
I had the same problem with it but I simply used a dummy vector c(1:5) applied to a new column 'num'
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
df.2$num <- c(1:5) # This range you can order in the last step.
dfm <- merge(df.2, df.1) # merged
dfm <- dfm[order(dfm$num),] # ascending order
There may be a more efficient way in base. This would be fairly simple to make into a function.
varorder <- names(mydata) # --- Merge
mydata <- merge(mydata, otherData, by="commonVar")
restOfvars <- names(mydata[!(names(mydata) %in% varorder)])
mydata[c(varorder,restOfvars)]

Resources