Drop columns when splitting data frame in R - r

I am trying to split data table by column, however once I get list of data tables, they still contains the column which data table was split by. How would I drop this column once the split is complete. Or more preferably, is there a way how do I drop multiple columns.
This is my code:
x <- rnorm(10, mean = 5, sd = 2)
y <- rnorm(10, mean = 5, sd = 2)
z <- sample(5, 10, replace = TRUE)
dt <- data.table(x, y, z)
split(dt, dt$z)
The resulting data table subsets looks like that
$`1`
x y z
1: 6.179790 5.776683 1
2: 5.725441 4.896294 1
3: 8.690388 5.394973 1
$`2`
x y z
1: 5.768285 3.951733 2
2: 4.572454 5.487236 2
$`3`
x y z
1: 5.183101 8.328322 3
2: 2.830511 3.526044 3
$`4`
x y z
1: 5.043010 5.566391 4
2: 5.744546 2.780889 4
$`5`
x y z
1: 6.771102 0.09301977 5
Thanks

Splitting a data.table is really not worthwhile unless you have some fancy parallelization step to follow. And even then, you might be better off sticking with a single table.
That said, I think you want
split( dt[, !"z"], dt$z )
# or more generally
mysplitDT <- function(x, bycols)
split( x[, !..bycols], x[, ..bycols] )
mysplitDT(dt, "z")
You would run into the same problem if you had a data.frame:
df = data.frame(dt)
split( df[-which(names(df)=="z")], df$z )

First thing that came to mind was to iterate through the list and drop the z column.
lapply(split(dt, dt$z), function(d) { d$z <- NULL; d })
And I just noticed that you use the data.table package, so there is probably a better, data.table way of achieving your desired result.

Related

Problem in merging a gff file and a csv file in R [duplicate]

I want to merge two data frames keeping the original row order of one of them (df.2 in the example below).
Here are some sample data (all values from class column are defined in both data frames):
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
If I do:
merge(df.2, df.1)
Output is:
class object prob
1 1 B 0.5
2 1 C 0.5
3 2 A 0.7
4 2 D 0.7
5 3 F 0.3
If I add sort = FALSE:
merge(df.2, df.1, sort = F)
Result is:
class object prob
1 2 A 0.7
2 2 D 0.7
3 1 B 0.5
4 1 C 0.5
5 3 F 0.3
But what I would like is:
class object prob
1 2 A 0.7
2 1 B 0.5
3 2 D 0.7
4 3 F 0.3
5 1 C 0.5
You just need to create a variable which gives the row number in df.2. Then, once you have merged your data, you sort the new data set according to this variable. Here is an example :
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
df.2$id <- 1:nrow(df.2)
out <- merge(df.2,df.1, by = "class")
out[order(out$id), ]
Check out the join function in the plyr package. It's like merge, but it allows you to keep the row order of one of the data sets. Overall, it's more flexible than merge.
Using your example data, we would use join like this:
> join(df.2,df.1)
Joining by: class
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
Here are a couple of links describing fixes to the merge function for keeping the row order:
http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/
http://r.789695.n4.nabble.com/patching-merge-to-allow-the-user-to-keep-the-order-of-one-of-the-two-data-frame-objects-merged-td4296561.html
You can also check out the inner_join function in Hadley's dplyr package (next iteration of plyr). It preserves the row order of the first data set. The minor difference to your desired solution is that it also preserves the original column order of the first data set. So it does not necessarily put the column we used for merging at the first position.
Using your example above, the inner_join result looks like this:
inner_join(df.2,df.1)
Joining by: "class"
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
From data.table v1.9.5+, you can do:
require(data.table) # v1.9.5+
setDT(df.1)[df.2, on="class"]
The performs a join on column class by finding out matching rows in df.1 for each row in df.2 and extracting corresponding columns.
For the sake of completeness, updating in a join preserves the original row order as well. This might be an alternative to Arun's data.table answer if there are only a few columns to append:
library(data.table)
setDT(df.2)[df.1, on = "class", prob := i.prob][]
object class prob
1: A 2 0.7
2: B 1 0.5
3: D 2 0.7
4: F 3 0.3
5: C 1 0.5
Here, df.2 is right joined to df.1 and gains a new column prob which is copied from the matching rows of df.1.
The accepted answer proposes a manual way to keep order when using merge, which works most of the times but requires unnecessary manual work. This solution comes on the back of How to ddply() without sorting?, which deals with the issue of keeping order but in a split-apply-combine context:
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
So now you can use this generic keeping.order function to keep the original row order of a merge call:
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
keeping.order(df.2, merge, y=df.1, by = "class")
Which will yield, as requested:
> keeping.order(df.2, merge, y=df.1, by = "class")
class object id prob
3 2 A 1 0.7
1 1 B 2 0.5
4 2 D 3 0.7
5 3 F 4 0.3
2 1 C 5 0.5
So keeping.order effectively automates the approach in the accepted answer.
Thanks to #PAC , I came up with something like this:
merge_sameord = function(x, y, ...) {
UseMethod('merge_sameord')
}
merge_sameord.data.frame = function(x, y, ...) {
rstr = paste(sample(c(0:9, letters, LETTERS), 12, replace=TRUE), collapse='')
x[, rstr] = 1:nrow(x)
res = merge(x, y, all.x=TRUE, sort=FALSE, ...)
res = res[order(res[, rstr]), ]
res[, rstr] = NULL
res
}
This assumes that you want to preserve the order the first data frame, and the merged data frame will have the same number of rows as the first data frame. It will give you the clean data frame without extra columns.
In this specific case you could us factor for a compact base solution:
df.2$prob = factor(df.2$class,labels=df.1$prob)
df.2
# object class prob
# 1 A 2 0.7
# 2 B 1 0.5
# 3 D 2 0.7
# 4 F 3 0.3
# 5 C 1 0.5
Not a general solution however, it works if:
You have a lookup table containing unique values
You want to update a table, not create a new one
the lookup table is sorted by the merging column
The lookup table doesn't have extra levels
You want a left_join
If you're fine with factors
1 is not negotiable, for the rest we can do:
df.3 <- df.2 # deal with 2.
df.1b <- df.1[order(df.1$class),] # deal with 3
df.1b <- df.1b[df.1$class %in% df.2$class,] # deal with 4.
df.3$prob = factor(df.3$class,labels=df.1b$prob)
df.3 <- df3[!is.na(df.3$prob),] # deal with 5. if you want an `inner join`
df.3$prob <- as.numeric(as.character(df.3$prob)) # deal with 6.
For package developers
As a package developer, you want to be dependent on as few other packages as possible. Especially tidyverse functions, that change way too often for package developers IMHO.
To be able to make use of the join functions of the dplyr package without importing dplyr, below is a quick implementation. It keeps the original sorting (as requested by OP) and does not move the joining column to the front (which is another annoying thing of merge()).
left_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.x = TRUE, ...)
}
right_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.y = TRUE, ...)
}
inner_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all = TRUE, ...)
}
full_join <- function(x, y, ...) {
merge_exec(x = x, y = y, ...)
}
# workhorse:
merge_exec <- function(x, y, ...) {
# set index
x$join_id_ <- 1:nrow(x)
# do the join
joined <- merge(x = x, y = y, sort = FALSE, ...)
# get suffices (yes, I prefer this over suffixes)
if ("suffixes" %in% names(list(...))) {
suffixes <- list(...)$suffixes
} else {
suffixes <- c("", "")
}
# get columns names in right order, so the 'by' column won't be forced first
cols <- unique(c(colnames(x),
paste0(colnames(x), suffixes[1]),
colnames(y),
paste0(colnames(y), suffixes[2])))
# get the original row and column index
joined[order(joined$join_id),
cols[cols %in% colnames(joined) & cols != "join_id_"]]
}
The highest rated answer does not produce what the Original Poster would like, i.e., "class" in column 1. If OP would allow switching column order in df.2, then here is a possible base R non-merge one-line answer:
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(class = c(2, 1, 2, 3, 1), object = c('A', 'B', 'D', 'F', 'C'))
cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE])
I happen to like the information portrayed in the row.names. A complete one-liner that exactly duplicates the OP's desired outcome is
data.frame(cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE]),
row.names = NULL)
I agree with https://stackoverflow.com/users/4575331/ms-berends that the fewer dependencies of a package developer on another package (or "verse") the better because development paths frequently diverge over time.
Note: The one-liner above does not work when there are duplicates in df.1$class. This can be overcome sans merge with 'outer' and a loop, or more generally with Ms Berend's clever post-merge rescrambling code.
There are several uses cases in which a simple subset will do:
# Use the key variable as row.names
row.names(df.1) = df.1$key
# Sort df.1 so that it's rows match df.2
df.3 = df.1[df.2$key, ]
# Create a data.frame with cariables from df.1 and (the sorted) df.2
df.4 = cbind(df.1, df.3)
This code will preserve df.2 and it's order and add only matching data from df.1
If only one variable is to be added, the cbind() ist not required:
row.names(df.1) = df.1$key
df.2$data = df.1[df.2$key, "data"]
I had the same problem with it but I simply used a dummy vector c(1:5) applied to a new column 'num'
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
df.2$num <- c(1:5) # This range you can order in the last step.
dfm <- merge(df.2, df.1) # merged
dfm <- dfm[order(dfm$num),] # ascending order
There may be a more efficient way in base. This would be fairly simple to make into a function.
varorder <- names(mydata) # --- Merge
mydata <- merge(mydata, otherData, by="commonVar")
restOfvars <- names(mydata[!(names(mydata) %in% varorder)])
mydata[c(varorder,restOfvars)]

How to get the sum of the product of selected column in a data frame?

This is probably very simple but I couldn't think of a solution.
I have the following data frame, and I want to multiply column y with column z and sum the answer.
> df <- data.frame(x = c(1,2,3), y = c(2,4,6), z = c(2,3,4))
> df
x y z
1 1 2 2
2 2 4 3
3 3 6 4
The value found should be equal to 40.
with would be an option here if we don't want to repeat df$ or df[[ to extract the column
with(df, sum( y * z))
#[1] 40
Or %*%
c(df$y %*% df$z)
Additionally, you could use data table. The second row after the comma indicates columns (j). You don't need the spaces, they're just there to show how it works.
library(data.table)
a <- data.table(x = c(1,2,3), y = c(2,4,6), z = c(2,3,4))
#dt i j by
a[ , sum(y*z), ]

Perform a semi-join with data.table

How do I perform a semi-join with data.table? A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y. For example, the following code performs an inner join:
x <- data.table(x = 1:2, y = c("a", "b"))
setkey(x, x)
y <- data.table(x = c(1, 1), z = 10:11)
x[y]
# x y z
# 1: 1 a 10
# 2: 1 a 11
A semi-join would return just x[1]
More possibilities :
w = unique(x[y,which=TRUE]) # the row numbers in x which have a match from y
x[w]
If there are duplicate key values in x, then that needs :
w = unique(x[y,which=TRUE,allow.cartesian=TRUE])
x[w]
Or, the other way around :
setkey(y,x)
w = !is.na(y[x,which=TRUE,mult="first"])
x[w]
If nrow(x) << nrow(y) then the y[x] approach should be faster.
If nrow(x) >> nrow(y) then the x[y] approach should be faster.
But the anti anti join appeals too :-)
One solution I can think of is:
tmp <- x[!y]
x[!tmp]
In data.table, you can have another data table as an i expression (i.e., the first expression in the data.table.[ call), and that will perform a join, e.g.:
x <- data.table(x = 1:10, y = letters[1:10])
setkey(x, x)
y <- data.table(x = c(1,3,5,1), z = 1:4)
> x[y]
x y z
1: 1 a 1
2: 3 c 2
3: 5 e 3
4: 1 a 4
The ! before the i expression is an extension of the syntax above that performs a 'not-join', as described on p. 11 of data.table documentation. So the first assignments evaluates to a subset of x that doesn't have any rows where the key (column x) is present in y:
> x[!y]
x y
1: 2 b
2: 4 d
3: 6 f
4: 7 g
5: 8 h
6: 9 i
7: 10 j
It is similar to setdiff in this regard. And therefore the second statement returns all the rows in x where the key is present in y.
The ! feature was added in data.table 1.8.4 with the following note in NEWS:
o A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384i.
DT[-DT["a", which=TRUE, nomatch=0]] # old not-join idiom, still works
DT[!"a"] # same result, now preferred.
DT[!J(6),...] # !J == not-join
DT[!2:3,...] # ! on all types of i
DT[colA!=6L | colB!=23L,...] # multiple vector scanning approach (slow)
DT[!J(6L,23L)] # same result, faster binary search
'!' has been used rather than '-' :
* to match the 'not-join'/'not-where' nomenclature
* with '-', DT[-0] would return DT rather than DT[0] and not be backwards
compatible. With '!', DT[!0] returns DT both before (since !0 is TRUE in
base R) and after this new feature.
* to leave DT[+J...] and DT[-J...] available for future use
For some reason, the following doesn't work x[!(x[!y])] - probably data.table is too smart about parsing the argument.
P.S. As Josh O'Brien pointed in another answer, a one-line would be x[!eval(x[!y])].
I'm confused with all the not-joins above, isn't what you want simply:
unique(x[y, .SD])
# x y
#1: 1 a
If x can have duplicate keys, then you can unique y instead:
## Creating an example data.table 'a' three-times-repeated first row
x <- data.table(x = c(1,1,1,2), y = c("a", "a", "a", "b"))
setkey(x, x)
y <- data.table(x = c(1, 1), z = 10:11)
setkey(y, x)
x[eval(unique(y, by = key(y))), .SD] # data.table >= 1.9.8 requires by=key(y)
# x y
# 1: 1 a
# 2: 1 a
# 3: 1 a
Update. Based on all the discussion here, I would do something like this, which should be fast and work in the most general case:
x[eval(unique(y[, key(x), with = FALSE]))]
Here is another, more direct solution:
unique(x[eval(y$x)])
It's more direct and runs faster - here is the comparison in run times with my previous solution:
# Generate some large data
N <- 1000000 * 26
x <- data.table(x = 1:N, y = letters, z = rnorm(N))
setkey(x, x)
y <- data.table(x = sample(N, N/10, replace = TRUE), z = sample(letters, N/10, replace = TRUE))
setkey(y, x)
system.time(r1 <- x[!eval(x[!y])])
user system elapsed
7.772 1.217 11.998
system.time(r2 <- unique(x[eval(y$x)]))
user system elapsed
0.540 0.142 0.723
In a more general case, you can do something like
x[eval(y[, key(x), with = FALSE])]
I tried to write a method that doesn't use any names, which are downright confusing in the OP's example.
sJ <- function(x,y){
ycols <- 1:min(ncol(y),length(key(x)))
yjoin <- unique(y[, ..ycols])
yjoin
}
x[eval(sJ(x,y))]
For Victor's simpler example, this gives the desired output:
x y
1: 1 a
2: 3 c
3: 5 e
This is a ~30% slower than Victor's way.
EDIT: And Victor's approach, taking unique before joining, is quite a bit faster:
N <- 1e5*26
x <- data.table(x = 1:N, y = letters, z = rnorm(N))
setkey(x, x)
y <- data.table(x = sample(N, N/10, replace = TRUE), z = sample(letters, N/10, replace = TRUE))
setkey(y, x)
require(microbenchmark)
microbenchmark(
sJ=x[eval(sJ(x,y))],
dolla=unique(x[eval(y$x)]),
brack=x[eval(unique(y[['x']]))]
)
Unit: milliseconds
expr min lq median uq max neval
# sJ 120.22700 125.04900 126.50704 132.35326 217.6566 100
# dolla 105.05373 108.33804 109.16249 118.17613 285.9814 100
# brack 53.95656 61.32669 61.88227 65.21571 235.8048 100
I'm guessing the [[ vs $ doesn't help the speed, but didn't check.
This thread is so old. But I noticed that the solution can be easily derived from the definition of semi-join given in the original post:
"A semi-join is like an inner join except that it only returns the
columns of X (not also those of Y), and does not repeat the rows of X
to match the rows of Y"
library(data.table)
dt1 <- data.table(ProdId = 1:4,
Product = c("Bread", "Cheese", "Pizza", "Butter"))
dt2 <- data.table(ProdId = c(1, 1, 3, 4, 5),
Company = c("A", "B", "C", "D", "E"))
# semi-join
unique(merge(dt1, dt2, on="ProdId")[, names(dt1), with=F])
ProdId Product
1: 1 Bread
2: 3 Pizza
3: 4 Butter
I've simply applied the syntax of inner-join, followed by filtering columns from first table only, with unique() to remove rows of first table which were repeated to match rows of second table.
Edit: The above approach will match dplyr::semi_join() output only if we have unique rows in the first table. If we need to output all the rows including duplicates from first table, then we may use fsetdiff() method shown below.
Another one line data.table solution:
fsetdiff(dt1, dt1[!dt2, on="ProdId"])
ProdId Product
1: 1 Bread
2: 3 Pizza
3: 4 Butter
I've just removed from first table the anti-join of first and second. Seems simpler to me. If the first table has duplicate rows, we will need:
fsetdiff(dt1, dt1[!dt2, on="ProdId"], all=T)
The fsetdiff() result with ,all=T matches the output from dplyr:
dplyr::semi_join(dt1, dt2, by="ProdId")
ProdId Product
1 1 Bread
2 3 Pizza
3 4 Butter
Using another set of data taken from one of previous posts:
x <- data.table(x = c(1,1,1,2), y = c("a", "a", "a", "b"))
y <- data.table(x = c(1, 1), z = 10:11)
With dplyr:
dplyr::semi_join(x, y, by="x")
x y
1 1 a
2 1 a
3 1 a
With data.table:
fsetdiff(x, x[!y, on="x"], all=T)
x y
1: 1 a
2: 1 a
3: 1 a
Without ,all=T, the duplicate rows are removed:
fsetdiff(x, x[!y, on="x"])
x y
1: 1 a
The package dplyr supports the following four join types:
inner_join, left_join, semi_join, anti_join
So for the semi-join try the following code
library("dplyr")
table1 <- data.table(x = 1:2, y = c("a", "b"))
table2 <- data.table(x = c(1, 1), z = 10:11)
semi_join(table1, table2)
The output is as expected:
# Joining by: "x"
# Source: local data table [1 x 2]
#
# x y
# (int) (chr)
# 1 1 a
Try the following:
w <- y[,unique(x)]
x[x %in% w]
Output will be:
x y
1: 1 a

Merge two data frames while keeping the original row order

I want to merge two data frames keeping the original row order of one of them (df.2 in the example below).
Here are some sample data (all values from class column are defined in both data frames):
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
If I do:
merge(df.2, df.1)
Output is:
class object prob
1 1 B 0.5
2 1 C 0.5
3 2 A 0.7
4 2 D 0.7
5 3 F 0.3
If I add sort = FALSE:
merge(df.2, df.1, sort = F)
Result is:
class object prob
1 2 A 0.7
2 2 D 0.7
3 1 B 0.5
4 1 C 0.5
5 3 F 0.3
But what I would like is:
class object prob
1 2 A 0.7
2 1 B 0.5
3 2 D 0.7
4 3 F 0.3
5 1 C 0.5
You just need to create a variable which gives the row number in df.2. Then, once you have merged your data, you sort the new data set according to this variable. Here is an example :
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
df.2$id <- 1:nrow(df.2)
out <- merge(df.2,df.1, by = "class")
out[order(out$id), ]
Check out the join function in the plyr package. It's like merge, but it allows you to keep the row order of one of the data sets. Overall, it's more flexible than merge.
Using your example data, we would use join like this:
> join(df.2,df.1)
Joining by: class
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
Here are a couple of links describing fixes to the merge function for keeping the row order:
http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/
http://r.789695.n4.nabble.com/patching-merge-to-allow-the-user-to-keep-the-order-of-one-of-the-two-data-frame-objects-merged-td4296561.html
You can also check out the inner_join function in Hadley's dplyr package (next iteration of plyr). It preserves the row order of the first data set. The minor difference to your desired solution is that it also preserves the original column order of the first data set. So it does not necessarily put the column we used for merging at the first position.
Using your example above, the inner_join result looks like this:
inner_join(df.2,df.1)
Joining by: "class"
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
From data.table v1.9.5+, you can do:
require(data.table) # v1.9.5+
setDT(df.1)[df.2, on="class"]
The performs a join on column class by finding out matching rows in df.1 for each row in df.2 and extracting corresponding columns.
For the sake of completeness, updating in a join preserves the original row order as well. This might be an alternative to Arun's data.table answer if there are only a few columns to append:
library(data.table)
setDT(df.2)[df.1, on = "class", prob := i.prob][]
object class prob
1: A 2 0.7
2: B 1 0.5
3: D 2 0.7
4: F 3 0.3
5: C 1 0.5
Here, df.2 is right joined to df.1 and gains a new column prob which is copied from the matching rows of df.1.
The accepted answer proposes a manual way to keep order when using merge, which works most of the times but requires unnecessary manual work. This solution comes on the back of How to ddply() without sorting?, which deals with the issue of keeping order but in a split-apply-combine context:
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
So now you can use this generic keeping.order function to keep the original row order of a merge call:
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
keeping.order(df.2, merge, y=df.1, by = "class")
Which will yield, as requested:
> keeping.order(df.2, merge, y=df.1, by = "class")
class object id prob
3 2 A 1 0.7
1 1 B 2 0.5
4 2 D 3 0.7
5 3 F 4 0.3
2 1 C 5 0.5
So keeping.order effectively automates the approach in the accepted answer.
Thanks to #PAC , I came up with something like this:
merge_sameord = function(x, y, ...) {
UseMethod('merge_sameord')
}
merge_sameord.data.frame = function(x, y, ...) {
rstr = paste(sample(c(0:9, letters, LETTERS), 12, replace=TRUE), collapse='')
x[, rstr] = 1:nrow(x)
res = merge(x, y, all.x=TRUE, sort=FALSE, ...)
res = res[order(res[, rstr]), ]
res[, rstr] = NULL
res
}
This assumes that you want to preserve the order the first data frame, and the merged data frame will have the same number of rows as the first data frame. It will give you the clean data frame without extra columns.
In this specific case you could us factor for a compact base solution:
df.2$prob = factor(df.2$class,labels=df.1$prob)
df.2
# object class prob
# 1 A 2 0.7
# 2 B 1 0.5
# 3 D 2 0.7
# 4 F 3 0.3
# 5 C 1 0.5
Not a general solution however, it works if:
You have a lookup table containing unique values
You want to update a table, not create a new one
the lookup table is sorted by the merging column
The lookup table doesn't have extra levels
You want a left_join
If you're fine with factors
1 is not negotiable, for the rest we can do:
df.3 <- df.2 # deal with 2.
df.1b <- df.1[order(df.1$class),] # deal with 3
df.1b <- df.1b[df.1$class %in% df.2$class,] # deal with 4.
df.3$prob = factor(df.3$class,labels=df.1b$prob)
df.3 <- df3[!is.na(df.3$prob),] # deal with 5. if you want an `inner join`
df.3$prob <- as.numeric(as.character(df.3$prob)) # deal with 6.
For package developers
As a package developer, you want to be dependent on as few other packages as possible. Especially tidyverse functions, that change way too often for package developers IMHO.
To be able to make use of the join functions of the dplyr package without importing dplyr, below is a quick implementation. It keeps the original sorting (as requested by OP) and does not move the joining column to the front (which is another annoying thing of merge()).
left_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.x = TRUE, ...)
}
right_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.y = TRUE, ...)
}
inner_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all = TRUE, ...)
}
full_join <- function(x, y, ...) {
merge_exec(x = x, y = y, ...)
}
# workhorse:
merge_exec <- function(x, y, ...) {
# set index
x$join_id_ <- 1:nrow(x)
# do the join
joined <- merge(x = x, y = y, sort = FALSE, ...)
# get suffices (yes, I prefer this over suffixes)
if ("suffixes" %in% names(list(...))) {
suffixes <- list(...)$suffixes
} else {
suffixes <- c("", "")
}
# get columns names in right order, so the 'by' column won't be forced first
cols <- unique(c(colnames(x),
paste0(colnames(x), suffixes[1]),
colnames(y),
paste0(colnames(y), suffixes[2])))
# get the original row and column index
joined[order(joined$join_id),
cols[cols %in% colnames(joined) & cols != "join_id_"]]
}
The highest rated answer does not produce what the Original Poster would like, i.e., "class" in column 1. If OP would allow switching column order in df.2, then here is a possible base R non-merge one-line answer:
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(class = c(2, 1, 2, 3, 1), object = c('A', 'B', 'D', 'F', 'C'))
cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE])
I happen to like the information portrayed in the row.names. A complete one-liner that exactly duplicates the OP's desired outcome is
data.frame(cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE]),
row.names = NULL)
I agree with https://stackoverflow.com/users/4575331/ms-berends that the fewer dependencies of a package developer on another package (or "verse") the better because development paths frequently diverge over time.
Note: The one-liner above does not work when there are duplicates in df.1$class. This can be overcome sans merge with 'outer' and a loop, or more generally with Ms Berend's clever post-merge rescrambling code.
There are several uses cases in which a simple subset will do:
# Use the key variable as row.names
row.names(df.1) = df.1$key
# Sort df.1 so that it's rows match df.2
df.3 = df.1[df.2$key, ]
# Create a data.frame with cariables from df.1 and (the sorted) df.2
df.4 = cbind(df.1, df.3)
This code will preserve df.2 and it's order and add only matching data from df.1
If only one variable is to be added, the cbind() ist not required:
row.names(df.1) = df.1$key
df.2$data = df.1[df.2$key, "data"]
I had the same problem with it but I simply used a dummy vector c(1:5) applied to a new column 'num'
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
df.2$num <- c(1:5) # This range you can order in the last step.
dfm <- merge(df.2, df.1) # merged
dfm <- dfm[order(dfm$num),] # ascending order
There may be a more efficient way in base. This would be fairly simple to make into a function.
varorder <- names(mydata) # --- Merge
mydata <- merge(mydata, otherData, by="commonVar")
restOfvars <- names(mydata[!(names(mydata) %in% varorder)])
mydata[c(varorder,restOfvars)]

How I can select rows from a dataframe that do not match?

I'm trying to identify the values in a data frame that do not match, but can't figure out how to do this.
# make data frame
a <- data.frame( x = c(1,2,3,4))
b <- data.frame( y = c(1,2,3,4,5,6))
# select only values from b that are not in 'a'
# attempt 1:
results1 <- b$y[ !a$x ]
# attempt 2:
results2 <- b[b$y != a$x,]
If a = c(1,2,3) this works, as a is a multiple of b. However, I'm trying to just select all the values from data frame y, that are not in x, and don't understand what function to use.
If I understand correctly, you need the negation of the %in% operator. Something like this should work:
subset(b, !(y %in% a$x))
> subset(b, !(y %in% a$x))
y
5 5
6 6
Try the set difference function setdiff. So you would have
results1 = setdiff(a$x, b$y) # elements in a$x NOT in b$y
results2 = setdiff(b$y, a$x) # elements in b$y NOT in a$x
You could also use dplyr for this task. To find what is in b but not a:
library(dplyr)
anti_join(b, a, by = c("y" = "x"))
# y
# 1 5
# 2 6

Resources