Unstacking a stacked dataframe unstacks columns in a different order

Unstacking a stacked dataframe unstacks columns in a different order - r

Using R 3.1.0
a = as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
b = unstack(stack(a))
# Returns FALSE
all(colnames(a) == colnames(b))
The documentation on stack/unstack says unstacking should "reverse this [stack] operation". Am I missing something? Why do I need to re-order the columns of b?

The last few lines of the stack (see utils:::stack.data.frame) function create a data.frame with two columns, "values" and "ind". The "ind" column is created with the code:
ind = factor(rep.int(names(x), lapply(x, length)))
But, look at how factor works in general (pay attention to the order of the "Levels"):
factor(c(1, 2, 3, 10, 4))
# [1] 1 2 3 10 4
# Levels: 1 2 3 4 10
factor(paste0("A", c(1, 2, 3, 10, 4)))
# [1] A1 A2 A3 A10 A4
# Levels: A1 A10 A2 A3 A4
If the functionality you describe is important for your analysis, you might do better modifying a version of stack.data.frame to capture the order of the data.frame names during the factoring process, like this:
Stack <- function (x, select, ...)
{
if (!missing(select)) {
nl <- as.list(1L:ncol(x))
names(nl) <- names(x)
vars <- eval(substitute(select), nl, parent.frame())
x <- x[, vars, drop = FALSE]
}
keep <- unlist(lapply(x, is.vector))
if (!sum(keep))
stop("no vector columns were selected")
if (!all(keep))
warning("non-vector columns will be ignored")
x <- x[, keep, drop = FALSE]
data.frame(values = unlist(unname(x)),
# REMOVE THIS --> ind = factor(rep.int(names(x), lapply(x, length))),
# AND ADD THIS:
ind = factor(rep.int(names(x), lapply(x, length)), unique(names(x))),
stringsAsFactors = FALSE)
}
Testing, one, two, three...
## Not using identical here because
## the factor levels are different
all.equal(Stack(a), stack(a))
# [1] TRUE
identical(unstack(Stack(a)), a)
# [1] TRUE

You'll never get me to defend the R documentation...
stack(...) creates a new data frame with two columns, values and ind. The latter has the column names from the original table, as a factor, ordered alphabetically. unstack(...) uses that factor to (re-) create columns of the new data frame. So the phrase "Unstacking reverses this operation" should be interpreted loosely...
To get the result you want, you need to reorder the factor ind, as follows:
a <- as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
c <- stack(a)
c$ind <- factor(c$ind, levels=colnames(a))
d <- unstack(c)
identical(a,d)
# [1] TRUE

Related

How to rbind several named dataframes but keep only common columns?

I have several data frames named a32, a33,..., a63 in the namespace which I have to rbind to a single dataframe. Each has several (about 20) columns. They were supposed to have common column names but unfortunately a few have some columns missing. This leads to an error when I try to rbind them.
l <- 32:63
l<- as.character(l) ## create a list of characters
A <- do.call(rbind.data.frame,mget(paste0("a",l))) ## "colnames not matching" error
Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors(), :
numbers of columns of arguments do not match
I want to rbind them by only taking the common columns. I tried using paste0 inside a for loop to list column names for all dataframes and see which dataframes have missing columns but got nowhere. How can I avoid manually searching for missing columns by listing column names of each data frame one-by-one.
As a small example, say:
a32 <- data.frame(AB = 1, CD = 2, EF = 3, GH = 4)
a33 <- data.frame(AB = 6, EF = 7)
a34 <- data.frame(AB = 8, CD = 9, EF = 10, GH = 11)
a35 <- data.frame(AB = 12,CD = 13, GH = 14)
a36 <- data.frame(AB = 15,CD = 16,EF = 17,GH = 18)
and so on
Is there an efficient way to rbind all the 32 data frames in the namespace?

Get dataframes in a list.
find out the common columns using Reduce + intersect
subset each dataframe from list with common columns
combine all the data together.
list_data <- mget(paste0("a",l))
common_cols <- Reduce(intersect, lapply(list_data, colnames))
result <- do.call(rbind, lapply(list_data, `[`, common_cols))
You can also make use of purrr::map_df which will make this shorter.
result <- purrr::map_df(list_data, `[`, common_cols)

A base R solution:
# get names from workspace
dat_names <- ls()[grepl("a[0-9][0-9]", ls())]
# get data
df <- lapply(dat_names, get)
# get comman col
commen_col <- Reduce(intersect, sapply(df, FUN = colnames, simplify = TRUE))
# selet and ribind
dat <- lapply(df, FUN = function(x, commen_col) x[, c(commen_col)], commen_col=commen_col)
dat <- do.call("rbind", dat)
colnames(dat) <- commen_col
dat
# AB
# [1,] 1
# [2,] 6
# [3,] 8
# [4,] 12
# [5,] 15

Problem in merging a gff file and a csv file in R [duplicate]

I want to merge two data frames keeping the original row order of one of them (df.2 in the example below).
Here are some sample data (all values from class column are defined in both data frames):
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
If I do:
merge(df.2, df.1)
Output is:
class object prob
1 1 B 0.5
2 1 C 0.5
3 2 A 0.7
4 2 D 0.7
5 3 F 0.3
If I add sort = FALSE:
merge(df.2, df.1, sort = F)
Result is:
class object prob
1 2 A 0.7
2 2 D 0.7
3 1 B 0.5
4 1 C 0.5
5 3 F 0.3
But what I would like is:
class object prob
1 2 A 0.7
2 1 B 0.5
3 2 D 0.7
4 3 F 0.3
5 1 C 0.5

You just need to create a variable which gives the row number in df.2. Then, once you have merged your data, you sort the new data set according to this variable. Here is an example :
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
df.2$id <- 1:nrow(df.2)
out <- merge(df.2,df.1, by = "class")
out[order(out$id), ]

Check out the join function in the plyr package. It's like merge, but it allows you to keep the row order of one of the data sets. Overall, it's more flexible than merge.
Using your example data, we would use join like this:
> join(df.2,df.1)
Joining by: class
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
Here are a couple of links describing fixes to the merge function for keeping the row order:
http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/
http://r.789695.n4.nabble.com/patching-merge-to-allow-the-user-to-keep-the-order-of-one-of-the-two-data-frame-objects-merged-td4296561.html

You can also check out the inner_join function in Hadley's dplyr package (next iteration of plyr). It preserves the row order of the first data set. The minor difference to your desired solution is that it also preserves the original column order of the first data set. So it does not necessarily put the column we used for merging at the first position.
Using your example above, the inner_join result looks like this:
inner_join(df.2,df.1)
Joining by: "class"
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5

From data.table v1.9.5+, you can do:
require(data.table) # v1.9.5+
setDT(df.1)[df.2, on="class"]
The performs a join on column class by finding out matching rows in df.1 for each row in df.2 and extracting corresponding columns.

For the sake of completeness, updating in a join preserves the original row order as well. This might be an alternative to Arun's data.table answer if there are only a few columns to append:
library(data.table)
setDT(df.2)[df.1, on = "class", prob := i.prob][]
object class prob
1: A 2 0.7
2: B 1 0.5
3: D 2 0.7
4: F 3 0.3
5: C 1 0.5
Here, df.2 is right joined to df.1 and gains a new column prob which is copied from the matching rows of df.1.

The accepted answer proposes a manual way to keep order when using merge, which works most of the times but requires unnecessary manual work. This solution comes on the back of How to ddply() without sorting?, which deals with the issue of keeping order but in a split-apply-combine context:
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
So now you can use this generic keeping.order function to keep the original row order of a merge call:
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
keeping.order(df.2, merge, y=df.1, by = "class")
Which will yield, as requested:
> keeping.order(df.2, merge, y=df.1, by = "class")
class object id prob
3 2 A 1 0.7
1 1 B 2 0.5
4 2 D 3 0.7
5 3 F 4 0.3
2 1 C 5 0.5
So keeping.order effectively automates the approach in the accepted answer.

Thanks to #PAC , I came up with something like this:
merge_sameord = function(x, y, ...) {
UseMethod('merge_sameord')
}
merge_sameord.data.frame = function(x, y, ...) {
rstr = paste(sample(c(0:9, letters, LETTERS), 12, replace=TRUE), collapse='')
x[, rstr] = 1:nrow(x)
res = merge(x, y, all.x=TRUE, sort=FALSE, ...)
res = res[order(res[, rstr]), ]
res[, rstr] = NULL
res
}
This assumes that you want to preserve the order the first data frame, and the merged data frame will have the same number of rows as the first data frame. It will give you the clean data frame without extra columns.

In this specific case you could us factor for a compact base solution:
df.2$prob = factor(df.2$class,labels=df.1$prob)
df.2
# object class prob
# 1 A 2 0.7
# 2 B 1 0.5
# 3 D 2 0.7
# 4 F 3 0.3
# 5 C 1 0.5
Not a general solution however, it works if:
You have a lookup table containing unique values
You want to update a table, not create a new one
the lookup table is sorted by the merging column
The lookup table doesn't have extra levels
You want a left_join
If you're fine with factors
1 is not negotiable, for the rest we can do:
df.3 <- df.2 # deal with 2.
df.1b <- df.1[order(df.1$class),] # deal with 3
df.1b <- df.1b[df.1$class %in% df.2$class,] # deal with 4.
df.3$prob = factor(df.3$class,labels=df.1b$prob)
df.3 <- df3[!is.na(df.3$prob),] # deal with 5. if you want an `inner join`
df.3$prob <- as.numeric(as.character(df.3$prob)) # deal with 6.

For package developers
As a package developer, you want to be dependent on as few other packages as possible. Especially tidyverse functions, that change way too often for package developers IMHO.
To be able to make use of the join functions of the dplyr package without importing dplyr, below is a quick implementation. It keeps the original sorting (as requested by OP) and does not move the joining column to the front (which is another annoying thing of merge()).
left_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.x = TRUE, ...)
}
right_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.y = TRUE, ...)
}
inner_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all = TRUE, ...)
}
full_join <- function(x, y, ...) {
merge_exec(x = x, y = y, ...)
}
# workhorse:
merge_exec <- function(x, y, ...) {
# set index
x$join_id_ <- 1:nrow(x)
# do the join
joined <- merge(x = x, y = y, sort = FALSE, ...)
# get suffices (yes, I prefer this over suffixes)
if ("suffixes" %in% names(list(...))) {
suffixes <- list(...)$suffixes
} else {
suffixes <- c("", "")
}
# get columns names in right order, so the 'by' column won't be forced first
cols <- unique(c(colnames(x),
paste0(colnames(x), suffixes[1]),
colnames(y),
paste0(colnames(y), suffixes[2])))
# get the original row and column index
joined[order(joined$join_id),
cols[cols %in% colnames(joined) & cols != "join_id_"]]
}

The highest rated answer does not produce what the Original Poster would like, i.e., "class" in column 1. If OP would allow switching column order in df.2, then here is a possible base R non-merge one-line answer:
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(class = c(2, 1, 2, 3, 1), object = c('A', 'B', 'D', 'F', 'C'))
cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE])
I happen to like the information portrayed in the row.names. A complete one-liner that exactly duplicates the OP's desired outcome is
data.frame(cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE]),
row.names = NULL)
I agree with https://stackoverflow.com/users/4575331/ms-berends that the fewer dependencies of a package developer on another package (or "verse") the better because development paths frequently diverge over time.
Note: The one-liner above does not work when there are duplicates in df.1$class. This can be overcome sans merge with 'outer' and a loop, or more generally with Ms Berend's clever post-merge rescrambling code.

There are several uses cases in which a simple subset will do:
# Use the key variable as row.names
row.names(df.1) = df.1$key
# Sort df.1 so that it's rows match df.2
df.3 = df.1[df.2$key, ]
# Create a data.frame with cariables from df.1 and (the sorted) df.2
df.4 = cbind(df.1, df.3)
This code will preserve df.2 and it's order and add only matching data from df.1
If only one variable is to be added, the cbind() ist not required:
row.names(df.1) = df.1$key
df.2$data = df.1[df.2$key, "data"]

I had the same problem with it but I simply used a dummy vector c(1:5) applied to a new column 'num'
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
df.2$num <- c(1:5) # This range you can order in the last step.
dfm <- merge(df.2, df.1) # merged
dfm <- dfm[order(dfm$num),] # ascending order

There may be a more efficient way in base. This would be fairly simple to make into a function.
varorder <- names(mydata) # --- Merge
mydata <- merge(mydata, otherData, by="commonVar")
restOfvars <- names(mydata[!(names(mydata) %in% varorder)])
mydata[c(varorder,restOfvars)]

Identify difference in 2 data frame with missing values

Suppose I have 2 data frames:
a1 <- data.frame(a = 1:5, b=2:6)
a2 <- data.frame(a = 1:5, b=c(2:5,NA))
I would like to identify which columns are not identical (I will need the column number later). I thought that this would do the trick:
apply(!a1==a2, 2, sum, na.rm=TRUE)
However, because the last entry in a2 is an NA, it doesn't work.

Not sure why you're using sum, but to identify which columns are not identical you could use mapply with identical and negate the result.
which(!mapply(identical, a1, a2))
# b
# 2
for the column number. Or more simply for use in a column subset
!mapply(identical, a1, a2)
# a b
# FALSE TRUE
Just as a note, the word identical has a meaning in R that may be different from the result of ==, so it's possible you may need to clarify your question a bit.
x <- 1
y <- 1L
x == y
# [1] TRUE
identical(x, y)
# [1] FALSE

If you wanted to use sum, you could try
colSums(a1==a2, na.rm=TRUE)!=nrow(a1)
# a b
#FALSE TRUE
Or using your code
apply(a1==a2, 2, sum, na.rm=TRUE)!=nrow(a1)
# a b
#FALSE TRUE

data.table loses factor ordering after rbind, R

When rbinding two data.table with ordered factors, the ordering seems to be lost:
dtb1 = data.table(id = factor(c("a", "b"), levels = c("a", "c", "b"), ordered=T), key="id")
dtb2 = data.table(id = factor(c("c"), levels = c("a", "c", "b"), ordered=T), key="id")
test = rbind(dtb1, dtb2)
is.ordered(test$id)
#[1] FALSE
Any thoughts or ideas?

data.table does some fancy footwork that means that data.table:::.rbind.data.table is called when rbind is called on objects including data.tables. .rbind.data.table utilizes the speedups associated with rbindlist, with a bit of extra checking to match by name etc.
.rbind.data.table deals with factor columns by using c to combine them (hence retaining the levels attribute)
# the relevant code is
l = lapply(seq_along(allargs[[1L]]), function(i) do.call("c",
lapply(allargs, "[[", i)))
In base R using c in this manner does not retain the "ordered" attribute, it doesn't even return a factor!
For example (in base R)
f <- factor(1:2, levels = 2:1, ordered=TRUE)
g <- factor(1:2, levels = 2:1, ordered=TRUE)
# it isn't ordered!
is.ordered(c(f,g))
# [1] FALSE
# no suprise as it isn't even a factor!
is.factor(c(f,g))
# [1] FALSE
However data.table has an S3 method c.factor, which is used to ensure that a factor is returned and the levels are retained. Unfortunately this method does not retain the ordered attribute.
getAnywhere('c.factor')
# A single object matching ‘c.factor’ was found
# It was found in the following places
# namespace:data.table
# with value
#
# function (...)
# {
# args <- list(...)
# for (i in seq_along(args)) if (!is.factor(args[[i]]))
# args[[i]] = as.factor(args[[i]])
# newlevels = unique(unlist(lapply(args, levels), recursive = TRUE,
# use.names = TRUE))
# ind <- fastorder(list(newlevels))
# newlevels <- newlevels[ind]
# nm <- names(unlist(args, recursive = TRUE, use.names = TRUE))
# ans = unlist(lapply(args, function(x) {
# m = match(levels(x), newlevels)
# m[as.integer(x)]
# }))
structure(ans, levels = newlevels, names = nm, class = "factor")
}
<bytecode: 0x073f7f70>
<environment: namespace:data.table
So yes, this is a bug. It is now reported as #5019.

As of version 1.8.11 data.table will combine ordered factors to result in ordered if a global order exists, and will complain and result in a factor if it doesn't exist:
DT1 = data.table(ordered('a', levels = c('a','b','c')))
DT2 = data.table(ordered('a', levels = c('a','d','b')))
rbind(DT1, DT2)$V1
#[1] a a
#Levels: a < d < b < c
DT3 = data.table(ordered('a', levels = c('b','a','c')))
rbind(DT1, DT3)$V1
#[1] a a
#Levels: a b c
#Warning message:
#In rbindlist(lapply(seq_along(allargs), function(x) { :
# ordered factor levels cannot be combined, going to convert to simple factor instead
To contrast, here's what base R does:
rbind(data.frame(DT1), data.frame(DT2))$V1
#[1] a a
#Levels: a < b < c < d
# Notice that the resulting order does not respect the suborder for DT2
rbind(data.frame(DT1), data.frame(DT3))$V1
#[1] a a
#Levels: a < b < c
# Again, suborders are not respected and new order is created

I met with the same problem after rbind, just re-assign the ordered level for the column.
test$id <- factor(test$id, levels = letters, ordered = T)
It's better to define factor after rbind

Merge two data frames while keeping the original row order

I want to merge two data frames keeping the original row order of one of them (df.2 in the example below).
Here are some sample data (all values from class column are defined in both data frames):
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
If I do:
merge(df.2, df.1)
Output is:
class object prob
1 1 B 0.5
2 1 C 0.5
3 2 A 0.7
4 2 D 0.7
5 3 F 0.3
If I add sort = FALSE:
merge(df.2, df.1, sort = F)
Result is:
class object prob
1 2 A 0.7
2 2 D 0.7
3 1 B 0.5
4 1 C 0.5
5 3 F 0.3
But what I would like is:
class object prob
1 2 A 0.7
2 1 B 0.5
3 2 D 0.7
4 3 F 0.3
5 1 C 0.5

You just need to create a variable which gives the row number in df.2. Then, once you have merged your data, you sort the new data set according to this variable. Here is an example :
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
df.2$id <- 1:nrow(df.2)
out <- merge(df.2,df.1, by = "class")
out[order(out$id), ]

Check out the join function in the plyr package. It's like merge, but it allows you to keep the row order of one of the data sets. Overall, it's more flexible than merge.
Using your example data, we would use join like this:
> join(df.2,df.1)
Joining by: class
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
Here are a couple of links describing fixes to the merge function for keeping the row order:
http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/
http://r.789695.n4.nabble.com/patching-merge-to-allow-the-user-to-keep-the-order-of-one-of-the-two-data-frame-objects-merged-td4296561.html

You can also check out the inner_join function in Hadley's dplyr package (next iteration of plyr). It preserves the row order of the first data set. The minor difference to your desired solution is that it also preserves the original column order of the first data set. So it does not necessarily put the column we used for merging at the first position.
Using your example above, the inner_join result looks like this:
inner_join(df.2,df.1)
Joining by: "class"
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5

From data.table v1.9.5+, you can do:
require(data.table) # v1.9.5+
setDT(df.1)[df.2, on="class"]
The performs a join on column class by finding out matching rows in df.1 for each row in df.2 and extracting corresponding columns.

For the sake of completeness, updating in a join preserves the original row order as well. This might be an alternative to Arun's data.table answer if there are only a few columns to append:
library(data.table)
setDT(df.2)[df.1, on = "class", prob := i.prob][]
object class prob
1: A 2 0.7
2: B 1 0.5
3: D 2 0.7
4: F 3 0.3
5: C 1 0.5
Here, df.2 is right joined to df.1 and gains a new column prob which is copied from the matching rows of df.1.

The accepted answer proposes a manual way to keep order when using merge, which works most of the times but requires unnecessary manual work. This solution comes on the back of How to ddply() without sorting?, which deals with the issue of keeping order but in a split-apply-combine context:
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
So now you can use this generic keeping.order function to keep the original row order of a merge call:
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
keeping.order(df.2, merge, y=df.1, by = "class")
Which will yield, as requested:
> keeping.order(df.2, merge, y=df.1, by = "class")
class object id prob
3 2 A 1 0.7
1 1 B 2 0.5
4 2 D 3 0.7
5 3 F 4 0.3
2 1 C 5 0.5
So keeping.order effectively automates the approach in the accepted answer.

Thanks to #PAC , I came up with something like this:
merge_sameord = function(x, y, ...) {
UseMethod('merge_sameord')
}
merge_sameord.data.frame = function(x, y, ...) {
rstr = paste(sample(c(0:9, letters, LETTERS), 12, replace=TRUE), collapse='')
x[, rstr] = 1:nrow(x)
res = merge(x, y, all.x=TRUE, sort=FALSE, ...)
res = res[order(res[, rstr]), ]
res[, rstr] = NULL
res
}
This assumes that you want to preserve the order the first data frame, and the merged data frame will have the same number of rows as the first data frame. It will give you the clean data frame without extra columns.

In this specific case you could us factor for a compact base solution:
df.2$prob = factor(df.2$class,labels=df.1$prob)
df.2
# object class prob
# 1 A 2 0.7
# 2 B 1 0.5
# 3 D 2 0.7
# 4 F 3 0.3
# 5 C 1 0.5
Not a general solution however, it works if:
You have a lookup table containing unique values
You want to update a table, not create a new one
the lookup table is sorted by the merging column
The lookup table doesn't have extra levels
You want a left_join
If you're fine with factors
1 is not negotiable, for the rest we can do:
df.3 <- df.2 # deal with 2.
df.1b <- df.1[order(df.1$class),] # deal with 3
df.1b <- df.1b[df.1$class %in% df.2$class,] # deal with 4.
df.3$prob = factor(df.3$class,labels=df.1b$prob)
df.3 <- df3[!is.na(df.3$prob),] # deal with 5. if you want an `inner join`
df.3$prob <- as.numeric(as.character(df.3$prob)) # deal with 6.

For package developers
As a package developer, you want to be dependent on as few other packages as possible. Especially tidyverse functions, that change way too often for package developers IMHO.
To be able to make use of the join functions of the dplyr package without importing dplyr, below is a quick implementation. It keeps the original sorting (as requested by OP) and does not move the joining column to the front (which is another annoying thing of merge()).
left_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.x = TRUE, ...)
}
right_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.y = TRUE, ...)
}
inner_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all = TRUE, ...)
}
full_join <- function(x, y, ...) {
merge_exec(x = x, y = y, ...)
}
# workhorse:
merge_exec <- function(x, y, ...) {
# set index
x$join_id_ <- 1:nrow(x)
# do the join
joined <- merge(x = x, y = y, sort = FALSE, ...)
# get suffices (yes, I prefer this over suffixes)
if ("suffixes" %in% names(list(...))) {
suffixes <- list(...)$suffixes
} else {
suffixes <- c("", "")
}
# get columns names in right order, so the 'by' column won't be forced first
cols <- unique(c(colnames(x),
paste0(colnames(x), suffixes[1]),
colnames(y),
paste0(colnames(y), suffixes[2])))
# get the original row and column index
joined[order(joined$join_id),
cols[cols %in% colnames(joined) & cols != "join_id_"]]
}

The highest rated answer does not produce what the Original Poster would like, i.e., "class" in column 1. If OP would allow switching column order in df.2, then here is a possible base R non-merge one-line answer:
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(class = c(2, 1, 2, 3, 1), object = c('A', 'B', 'D', 'F', 'C'))
cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE])
I happen to like the information portrayed in the row.names. A complete one-liner that exactly duplicates the OP's desired outcome is
data.frame(cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE]),
row.names = NULL)
I agree with https://stackoverflow.com/users/4575331/ms-berends that the fewer dependencies of a package developer on another package (or "verse") the better because development paths frequently diverge over time.
Note: The one-liner above does not work when there are duplicates in df.1$class. This can be overcome sans merge with 'outer' and a loop, or more generally with Ms Berend's clever post-merge rescrambling code.

There are several uses cases in which a simple subset will do:
# Use the key variable as row.names
row.names(df.1) = df.1$key
# Sort df.1 so that it's rows match df.2
df.3 = df.1[df.2$key, ]
# Create a data.frame with cariables from df.1 and (the sorted) df.2
df.4 = cbind(df.1, df.3)
This code will preserve df.2 and it's order and add only matching data from df.1
If only one variable is to be added, the cbind() ist not required:
row.names(df.1) = df.1$key
df.2$data = df.1[df.2$key, "data"]

I had the same problem with it but I simply used a dummy vector c(1:5) applied to a new column 'num'
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
df.2$num <- c(1:5) # This range you can order in the last step.
dfm <- merge(df.2, df.1) # merged
dfm <- dfm[order(dfm$num),] # ascending order

There may be a more efficient way in base. This would be fairly simple to make into a function.
varorder <- names(mydata) # --- Merge
mydata <- merge(mydata, otherData, by="commonVar")
restOfvars <- names(mydata[!(names(mydata) %in% varorder)])
mydata[c(varorder,restOfvars)]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Unstacking a stacked dataframe unstacks columns in a different order - r

Related

How to rbind several named dataframes but keep only common columns?

Problem in merging a gff file and a csv file in R [duplicate]

Identify difference in 2 data frame with missing values

data.table loses factor ordering after rbind, R

Merge two data frames while keeping the original row order

Categories

Resources