Conditionally Join Dataframes by Row in R

Conditionally Join Dataframes by Row in R - r

I would like to conditionally merge two tables with the following formats:
id1 <- c('S001', 'S002', 'S003', 'S004', 'S004')
id2 <- c('S001', 'S001', 'S002', 'S002', 'S001')
ids <- data.frame(id1, id2)
and
bad_id_key <- c('S002', 'S004')
bad_id_val <- c('a', 'b')
bad_ids <- data.frame(bad_id_key, bad_id_val)
The conditional rules are:
If both IDs are in the "bad" list, drop that row
If neither ID is in the "bad" list, drop that row
If only one of the IDs is bad, add the bad value to the row.
The resulting table would look like:
id1 id2 bad_id_val
2 S002 S001 a
3 S003 S002 a
5 S004 S001 b
I was able to accomplish this with the following code snippet:
conditionalJoin <- function(row){
if(row$id1 %in% bad_id_key & row$id2 %in% bad_id_key){
# do nothing
}
else if(row$id1 %in% bad_id_key){
merge(x=row, y=bad_ids, by.x="id1", by.y="bad_id_key", all.x=TRUE)
}
else if(row$id2 %in% bad_id_key){
merge(x=row, y=bad_ids, by.x="id2", by.y="bad_id_key", all.x=TRUE)
}
}
out <- do.call("rbind", as.list(by(ids, 1:nrow(ids), conditionalJoin)))
However, this approach scales extremely poorly as the size of the ids dataframe grows. I think this is because of the rbind function. Also, the if elses are not very elegant R code.
Does anyone know of of an R command to do this kind of row-wise conditional joining that is more efficient than rbind? Thanks in advance.

Using the data.table package, i would approach it as follows:
library(data.table)
ids <- setDT(ids)[xor(id1 %in% bad_ids$bad_id_key, id2 %in% bad_ids$bad_id_key)
][, bad_id_val := ifelse(id1 %in% bad_ids$bad_id_key,
as.character(bad_ids$bad_id_val[match(id1, bad_ids$bad_id_key)]),
as.character(bad_ids$bad_id_val[match(id2, bad_ids$bad_id_key)]))]
which gives the desired result:
> ids
id1 id2 bad_id_val
1: S002 S001 a
2: S003 S002 a
3: S004 S001 b
Tested on the larger dataset of #jeremycg this gives the following outcome with regard to speed:
Unit: milliseconds
expr min lq mean median uq max neval cld
jeremy 9.196898 9.386950 9.854132 9.603002 9.749256 16.764747 100 b
OP 974.933816 985.813821 996.770067 992.145890 1000.411484 1143.402837 100 c
jaap 3.572531 3.612401 3.779686 3.679115 3.790707 9.803782 100 a

This is the fastest I can get it using dplyr. It's considerably faster, as there are only two match calls, everything else is quick. See the benchmark below.
library(dplyr)
ids %>% mutate(x = match(id1, bad_ids$bad_id_key), #get the first match of id1
y = match(id2, bad_ids$bad_id_key)) %>% #and id2
filter(xor(is.na(x), is.na(y))) %>% #filter to make sure we have 1 match
mutate(val = ifelse(is.na(x), #if x didn't match
as.character(bad_ids$bad_id_val[y]), #get the y
as.character(bad_ids$bad_id_val[x]))) # otherwise get the x
Here's a benchmark on larger data:
#5000 lines of ids
set.seed(12345)
ids <- data.frame(id1 = sample(1:50, 5000, replace = TRUE), id2 = sample(1:50, 5000, replace = TRUE))
bad_ids <- data.frame(bad_id_key = 1:20, bad_id_val = letters[1:20])
microbenchmark::microbenchmark(
me = {
ids %>% mutate(x = match(id1, bad_ids$bad_id_key),
y = match(id2, bad_ids$bad_id_key)) %>%
filter(xor(is.na(x), is.na(y))) %>%
mutate(val = ifelse(is.na(x),
as.character(bad_ids$bad_id_val[y]),
as.character(bad_ids$bad_id_val[x])))},
OP = {out <- do.call("rbind", as.list(by(ids, 1:nrow(ids), conditionalJoin)))}
)
Unit: milliseconds
expr min lq mean median uq max
me 11.92924 12.41934 15.36524 13.07722 15.71085 63.14211
OP 1831.34599 1910.90149 2369.70980 2112.57251 2340.88428 5549.01191
neval
100
100

Instead of using ifelse functions, it is often better to just work within the data.frame or data.table its self to identify records you want to keep. For your example you could do this with the following code:
ids[xor(ids$id1 %in% bad_id_key, ids$id2 %in% bad_id_key),]
After running this code you just need to merge ids and bad_ids to append the bad id value.

Related

Is it faster to aggregate or aggregate and merge with one-to-many vars with data.table?

I have a data.table like the following:
dt <- data.table(data.frame(
id = rep(1:3, each=5),
age = rep(10*(2:4), each=5),
var = rnorm(15)
))
I want to aggregate over var using, say, a sum, but I must keep "age" a one-to-many variable, in the output.
One way to do this is:
dt <- merge(dt[, .(vsum=sum(var)), by=id], unique(dt[, c('id', 'age']), by='id')
Another way is
dt <- dt[, .(vsum=sum(var)), by=c('id', 'age')]
My gut says the second case loses time because by= looks for differing values of age within ids, which may be problematic if age is 20 or more variables. My gut says merge is problematic because the overall larger number of operations, and only a subset of those is by= within a [.data.table instance.
I can explore stupid cases like this, but don't have a sense of the general operating characteristics in terms of 1. When there are many many-to-one variables (like age), and if there data are dense with observations (many rows and few IDs) or dense with individuals (same N of rows but many IDs)
Is there any general, efficient approach to doing summary datasets such as this type?

It all depends on the implementation. However, why not do this?
dt[, .(vsum=sum(var), age=age[1]), by="id"]
Edit: Benchmarking below.
dt <- data.table(data.frame(
id = rep(1:10000, each=5),
age = rep(10*(1:10000), each=5),
var = rnorm(150000)
))
res1 <- function() {merge(dt[, .(vsum=sum(var)), by="id"], unique(dt[, c('id', 'age')]), by='id')}
res2 <- function() {dt[, .(vsum=sum(var)), by=c('id', 'age')]}
res3 <- function() {dt[, .(vsum=sum(var), age=unique(age)), by="id"]}
res4 <- function() {dt[, .(vsum=sum(var), age=age[1]), by="id"]}
library(microbenchmark)
microbenchmark(res1(),res2(), res3(), res4(), times=10)
Unit: milliseconds
expr min lq mean median uq max neval cld
res1() 6.940417 7.949203 9.250408 8.791923 9.695110 13.448288 10 b
res2() 3.796992 3.898165 4.889812 4.507141 4.790384 9.477044 10 a
res3() 48.259783 52.026664 55.401017 54.986112 59.375380 60.804102 10 c
res4() 2.646796 2.853593 3.709116 3.252362 3.391909 6.321708 10 a
So it turns out, contrary to intuition, the 2nd approach is quite fast, with the fastest being the 4th approach.

R Minimum Value from Datatable Not Equal to a Particular Value

How do I find the minimum value from an R data table other than a particular value?
For example, there could be zeroes in the data table and the goal would be to find the minimum non zero value.
I tried using the sapply with min, but am not sure how to specify the extra criteria that we have so that the minimum is not equal to a certain value.
More generally, How do we find the minimum from a data table not equal to any element from a list of possible values?

If you want to find the minimum value from a vector while excluding certain values from that vector, then you can use %in%:
v <- c(1:10) # values 1 .. 10
v.exclude <- c(1, 2) # exclude the values 1 and 2 from consideration
min.exclude <- min(v[!v %in% v.exclude])
The logic won't change much if you are using a column from a data table/frame. In this case you can just replace the vector v with the apropriate column. If you have your excluded values in a list, then you can flatten it to produce your v.exclude vector.

This can be done with data.table (as the OP mentioned about data table in the post) after setting the key
library(data.table)
setDT(df, key='a')[!.(exclude)]
# a b
#1: 4 40
#2: 5 50
#3: 6 60
If we need the min value of 'a'
min(setDT(df, key='a')[!.(exclude)]$a)
#[1] 4
For finding the min in all the columns (using the setkey method), we loop over the columns of the dataset, set the key as each of the column, subset the dataset, get the min value in a previously created list object.
setDT(df)
MinVal <- vector('list', length(df))
for(j in seq_along(df)){
setkeyv(df, names(df)[j])
MinVal[[j]] <- min(df[!.(exclude)][[j]])
}
MinVal
#[[1]]
#[1] 4
#[[2]]
#[1] 10
data
df <- data.frame(a = c(0,2,3,2,1,2,3,4,5,6),
b = c(10,10,20,20,30,30,40,40,50,60))
exclude <- c(0,1,2,3)

Assuming you are working with a data.frame
Data
df <- data.frame(a = c(0,2,3,2,1,2,3,4,5,6),
b = c(10,10,20,20,30,30,40,40,50,60))
Values to exlude from our minimum search
exclude <- c(0,1,2,3)
we can find the minimum value from column a excluding our exclude vector
## minimum from column a
min(df[!df$a %in% exclude,]$a)
# [1] 4
Or from b
exclude <- c(10, 20, 30, 40)
min(df[!df$b %in% exclude,]$b)
# [1] 50
To return the row that corresponds to the minimum value
df[df$b == min( df[ !df$b %in% exclude, ]$b ),]
# a b
# 9 5 50
Update
To find the minimum across multiple rows we can do it this way:
## values to exclude
exclude_a <- c(0,1)
exclude_b <- c(10)
## exclude rows/values from each column we don't want
df2 <- df[!(df$a %in% exclude_a) & !(df$b %in% exclude_b),]
## order the data
df3 <- df2[with(df2, order(a,b)),]
## take the first row
df3[1,]
# > df3[1,]
# a b
#4 2 20
Update 2
To select from multiple columns we can iterate over them as #akrun has shown, or alternatively we can construct our subsetting formula using an expression and evaluate it inside our [ operation
exclude <- c(0,1,2, 10)
## construct a formula/expression using the column names
n <- names(df)
expr <- paste0("(", paste0(" !(df$", n, " %in% exclude) ", collapse = "&") ,")")
# [1] "( !(df$a %in% exclude) & !(df$b %in% exclude) )"
expr <- parse(text=expr)
df2 <- df[eval(expr),]
## order and select first row as before
df2 <- df2[with(df2, order(a,b)),]
df2 <- df2[1,]
And if we wanted to use data.table for this:
library(data.table)
setDT(df)[ eval(expr) ][order(a, b),][1,]
comparison of methods
library(microbenchmark)
fun_1 <- function(x){
df2 <- x[eval(expr),]
## order and select first row as before
df2 <- df2[with(df2, order(a,b)),]
df2 <- df2[1,]
return(df2)
}
fun_2 <- function(x){
df2 <- setDT(x)[ eval(expr) ][order(a, b),][1,]
return(df2)
}
## including #akrun's solution
fun_3 <- function(x){
setDT(df)
MinVal <- vector('list', length(df))
for(j in seq_along(df)){
setkeyv(df, names(df)[j])
MinVal[[j]] <- min(df[!.(exclude)][[j]])
}
return(MinVal)
}
microbenchmark(fun_1(df), fun_2(df), fun_3(df) , times=1000)
# Unit: microseconds
# expr min lq mean median uq max neval
# fun_1(df) 770.376 804.5715 866.3499 833.071 869.2195 2728.740 1000
# fun_2(df) 854.862 893.1220 952.1207 925.200 962.6820 3115.119 1000
# fun_3(df) 1108.316 1148.3340 1233.1268 1186.938 1234.3570 5400.544 1000

Replace parts of a variable using numeric indices in dplyr. Do I need to create an index column and use ifelse?

At one stage in longer chain of dplyr functions, I need to replace parts of a variable using numeric indices to specify which elements to replace.
My data looks like this:
df1 <- data.frame(grp = rep(1:2, each = 3),
a = 1:6,
b = rep(c(10, 20), each = 3))
df1
# grp a b
# 1 1 1 10
# 2 1 2 10
# 3 1 3 10
# 4 2 4 20
# 5 2 5 20
# 6 2 6 20
Assume that we, within each group, wish to replace elements in variable a with the corresponding elements in b, at one or more positions. In this simple example I use a single index (id), but this could be a vector of indices. First, here's how I would do it with ddply:
library(plyr)
id <- 2
ddply(.data = df1, .variables = .(grp), function(x){
x$a[id] <- x$b[id]
x
})
# grp a b
# 1 1 1 10
# 2 1 10 10
# 3 1 3 10
# 4 2 4 20
# 5 2 20 20
# 6 2 6 20
In dplyr I could think of some different ways to perform the replacement. (1) Use do with an anonymous function, similar to the one used in ddply. (2) Use mutate: concatenate a vector where the replacement is 'inserted' using numeric indexing. This is probably only fruitful for a single index. (3) Use mutate: create an index vector and use conditional replacement with ifelse (see e.g. here, here, here, and here).
detach("package:plyr", unload = TRUE)
library(dplyr)
# (1)
fun_do <- function(df){
l <- df %.%
group_by(grp) %.%
do(function(dat){
dat$a[id] <- dat$b[id]
dat
})
do.call(rbind, l)
}
# (2)
fun_mut <- function(df){
df %.%
group_by(grp) %.%
mutate(
a = c(a[1:(id - 1)], b[id], a[(id + 1):length(a)])
)
}
# (3)
fun_mut_ifelse <- function(df){
df %.%
group_by(grp) %.%
mutate(
idx = 1:n(),
a = ifelse(idx %in% id, b, a)) %.%
select(-idx)
}
fun_do(df1)
fun_mut(df1)
fun_mut_ifelse(df1)
In a benchmark with a slightly larger data set, the 'jigsaw puzzle insertion' is fastest, but again, this method is probably only suited for single replacements. And it doesn't look very clean...
set.seed(123)
df2 <- data.frame(grp = rep(1:200, each = 3),
a = rnorm(600),
b = rnorm(600))
library(microbenchmark)
microbenchmark(fun_do(df2),
fun_mut(df2),
fun_mut_ifelse(df2),
times = 10)
# Unit: microseconds
# expr min lq median uq max neval
# fun_do(df2) 48443.075 49912.682 51356.631 53369.644 55108.769 10
# fun_mut(df2) 891.420 933.996 1019.906 1066.663 1155.235 10
# fun_mut_ifelse(df2) 2503.579 2667.798 2869.270 3027.407 3138.787 10
Just to check the influence of the do.call(rbind part in the do function, try without it:
fun_do2 <- function(df){
df %.%
group_by(grp) %.%
do(function(dat){
dat$a[2] <- dat$b[2]
dat
})
}
fun_do2(df1)
Then a new benchmark on a larger data set:
df3 <- data.frame(grp = rep(1:2000, each = 3),
a = rnorm(6000),
b = rnorm(6000))
microbenchmark(fun_do(df3),
fun_do2(df3),
fun_mut(df3),
fun_mut_ifelse(df3),
times = 10)
Again, a simple 'insertion' is fastest, while the do function is losing ground. In the help text do is described as "a general purpose complement" to the other dplyr functions. To me it seemed to be a natural choice for an anonymous function. However, I was surprised that do was so much slower, also when the non-dplyr rbinding part was skipped. Currently, the do documentation is rather scarce, so I wonder if I am abusing the function, and that there may be more appropriate (undocumented?) ways to do it?
I got no hits on index/indices when I searched the dplyr help text or vignette. So now I wonder:
Are there other dplyr methods to replace parts of a variable using numeric indices which I have overlooked? Specifically, is the creation of an index column in combination with ifelse the way to go, or are there more direct a[i] <- b[i]-like alternatives?
Edit following comment from #G.Grothendieck (Thanks!). Added replace alternative (a candidate for 'See also' in ?[).
fun_replace <- function(df){
df %.%
group_by(grp) %.%
mutate(
a = replace(a, id, b[id]))
}
fun_replace(df1)
microbenchmark(fun_do(df3),
fun_do2(df3),
fun_mut(df3),
fun_mut_ifelse(df3),
fun_replace(df3),
times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun_do(df3) 685.154605 693.327160 706.055271 712.180410 851.757790 10
# fun_do2(df3) 291.787455 294.047747 297.753888 299.624730 302.368554 10
# fun_mut(df3) 5.736640 5.883753 6.206679 6.353222 7.381871 10
# fun_mut_ifelse(df3) 24.321894 26.091049 29.361553 32.649924 52.981525 10
# fun_replace(df3) 4.616757 4.748665 4.981689 5.279716 5.911503 10
replace function is fastest, and for sure easier to use than fun_mut when there are more than one index.
Edit 2 fun_do and fun_do2 no longer works in dplyr 0.2; Error: Results are not data frames at positions:

Here's a much faster modify-in-place approach:
library(data.table)
# select rows we want, then assign b to a for those rows, in place
fun_dt = function(dt) dt[dt[, .I[id], by = grp]$V1, a := b]
# benchmark
df4 = data.frame(grp = rep(1:20000, each = 3),
a = rnorm(60000),
b = rnorm(60000))
dt4 = as.data.table(df4)
library(microbenchmark)
# using fastest function from OP
microbenchmark(fun_dt(dt4), fun_replace(df4), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# fun_dt(dt4) 15.62325 17.22828 18.42445 20.83768 21.25371 10
# fun_replace(df4) 99.03505 107.31529 116.74830 188.89134 286.50199 10

R subset unique observation keeping last entry

I have a data frame that looks something like this (with a lot more observations)
df <- structure(list(session_user_id = c("1803f6c3625c397afb4619804861f75268dfc567",
"1924cb2ebdf29f052187b9a2d21673e4d314199b", "1924cb2ebdf29f052187b9a2d21673e4d314199b",
"1924cb2ebdf29f052187b9a2d21673e4d314199b", "1924cb2ebdf29f052187b9a2d21673e4d314199b",
"198b83b365fef0ed637576fe1bde786fc09817b2", "19fd8069c094fb0697508cc9646513596bea30c4",
"19fd8069c094fb0697508cc9646513596bea30c4", "19fd8069c094fb0697508cc9646513596bea30c4",
"19fd8069c094fb0697508cc9646513596bea30c4", "1a3d33c9cbb2aa41515e6ef76f123b2ea8ee2f13",
"1b64c142b1540c43e3f813ccec09cb2dd7907c14", "1b7346d13f714c97725ba2e1c21b600535164291"
), raw_score = c(1, 1, 1, 1, 1, 0.2, NA, 1, 1, 1, 1, 0.2, 1),
submission_time = c(1389707078L, 1389694184L, 1389694188L,
1389694189L, 1389694194L, 1390115495L, 1389696939L, 1389696971L,
1389741306L, 1389985033L, 1389983862L, 1389854836L, 1389692240L
)), .Names = c("session_user_id", "raw_score", "submission_time"
), row.names = 28:40, class = "data.frame")
I want to create a new data frame with only one observation per "session_ user_id" by keeping the one with the latest "submission_time."
The only idea that I have in mind is to create a list of unique users. Write a loop to find the max of submission_time for each user and then write a loop that gets raw score fore that user and time.
Can somebody show me a better way of doing this in R?
Thanks!

You could first order your data.frame by submission_time and remove all duplicated session_user_id entries afterwards:
## order by submission_time
df <- df[order(df$submission_time, decreasing=TRUE),]
## remove duplicated user_id
df <- df[!duplicated(df$session_user_id),]
# session_user_id raw_score submission_time
#33 198b83b365fef0ed637576fe1bde786fc09817b2 0.2 1390115495
#37 19fd8069c094fb0697508cc9646513596bea30c4 1.0 1389985033
#38 1a3d33c9cbb2aa41515e6ef76f123b2ea8ee2f13 1.0 1389983862
#39 1b64c142b1540c43e3f813ccec09cb2dd7907c14 0.2 1389854836
#28 1803f6c3625c397afb4619804861f75268dfc567 1.0 1389707078
#32 1924cb2ebdf29f052187b9a2d21673e4d314199b 1.0 1389694194
#40 1b7346d13f714c97725ba2e1c21b600535164291 1.0 1389692240

This is simple to express with dplyr: first group by session id, then filter, selecting the row in each group with the maximum time:
library(dplyr)
df %.%
group_by(session_user_id) %.%
filter(submission_time == max(submission_time))
Alternatively, if you don't want to keep all maximum times (if duplicated), you could do:
library(dplyr)
df %.%
group_by(session_user_id) %.%
filter(row_number(desc(submission_time)) == 1)

I'll add a data.table solution as well, and out of curiosity benchmark against dplyr on bigger data:
require(data.table)
DT <- as.data.table(df)
DT[DT[, .I[which.max(submission_time)], by=list(session_user_id)]$V1]
Here I'm assuming that the OP needs just one observation, even for multiple identical "max" values. If not, check out the function f2 below.
Benchmarks on bigger data vs dplyr:
Benchmark against #hadley's dplyr solutions on bigger data. I'll assume there are about 50e3 user ids and there are a total of 1e7 rows.
require(data.table) # 1.8.11 commit 1142
require(dplyr) # latest commit from github
set.seed(45L)
DT <- data.table(session_user_id = sample(paste0("id", 1:5e4), 1e7, TRUE),
raw_score = sample(10, 1e7, TRUE),
submission_time = sample(1e5:5e5, 1e7, TRUE))
DF <- tbl_df(as.data.frame(DT))
f1 <- function(DT) {
DT[DT[, .I[which.max(submission_time)], by=list(session_user_id)]$V1]
}
f2 <- function(DT) {
DT[DT[, .I[submission_time == max(submission_time)],
by=list(session_user_id)]$V1]
}
f3 <- function(DF) {
DF %.%
group_by(session_user_id) %.%
filter(submission_time == max(submission_time))
}
f4 <- function(DF) {
DF %.%
group_by(session_user_id) %.%
filter(row_number(desc(submission_time)) == 1)
}
And here are the timings. All are minimum of three runs:
system.time(a1 <- f1(DT))
# user system elapsed
# 1.044 0.056 1.101
system.time(a2 <- f2(DT))
# user system elapsed
# 1.384 0.080 1.475
system.time(a3 <- f3(DF))
# user system elapsed
# 4.513 0.044 4.555
system.time(a4 <- f4(DF))
# user system elapsed
# 6.312 0.004 6.314
As expected f4 is the slowest because it uses desc (which I'm guessing somehow involves in an ordering or sorting per group - a more computationally expensive operation than just getting max or which.max).
Here, a1 and a4 (only one observation even if multiple max values are present) give identical results and so does a2 and a3 (all max values).
data.table is at least 3x faster here (comparing a2 to a3) and about 5.7x times faster when comparing f1 to f4.

You could use the "plyr' package to summarize the data. Something like this should work
max_subs<-ddply(df,"session_user_id",summarize,max_sub=max(submission_time))
ddply takes a data frame in and returns a data frame, and this will give you the user and submission times you want.
To return the original data frame rows corresponding to these you could do
df2<-df[df$session_user_id %in% max_subs$session_user_id & df$submission_time %in% max_subs$max_sub,]

First find the max submission time by session_user_id. This table will be unique by session_user_id.
Then just merge (sql-speak: inner join) back to your original table joining on submission_time & session_user_id (R automatically picks up common names across the two data frames).
maxSessions<-aggregate(submission_time~session_user_id , df, max)
mySubset<-merge(df, maxSessions)
mySubset #this table has the data your are looking for
If you are looking for speed and your dataset is large then have a look at this How to summarize data by group in R? data.table & plyr are good choices.

This is just an extended comment because I was interested in how fast each of the solutions were
library(microbenchmark)
library(plyr)
library(dplyr)
library(data.table)
df <- df[sample(1:nrow(df),10000,replace=TRUE),] # 10k records
fun.test1 <- function(df) {
df <- df[order(df$submission_time, decreasing = TRUE),]
df <- df[!duplicated(df$session_user_id),]
return(df)
}
fun.test2 <- function(df) {
max_subs<-ddply(df,"session_user_id",summarize,max_sub=max(submission_time))
df2<-df[df$session_user_id %in% max_subs$session_user_id &
df$submission_time %in% max_subs$max_sub,]
return(df2)
}
fun.test3 <- function(df) {
df <- df %.%
group_by(session_user_id) %.%
filter(submission_time == max(submission_time))
return(df)
}
fun.test4 <- function(df) {
maxSessions<-aggregate(submission_time~session_user_id , df, max)
mySubset<-merge(df, maxSessions)
return(mySubset)
}
fun.test5 <- function(df) {
df <- df[df$submission_time %in% by(df, df$session_user_id,
function(x) max(x$submission_time)),]
return(df)
}
dt <- as.data.table(df) # Assuming you're working with data.table to begin with
# Don't know a lot about data.table so I'm sure there's a faster solution
fun.test6 <- function(dt) {
dt <- unique(
dt[,
list(raw_score,submission_time=max(submission_time)),
by=session_user_id]
)
return(dt)
}
Looks like the most basic solution with !duplicated() wins by a significant margin for small data (Under 1k), followed by dplyr. dplyr wins for large samples (over 1k).
microbenchmark(
fun.test1(df),
fun.test2(df),
fun.test3(df),
fun.test4(df),
fun.test5(df),
fun.test6(dt)
)
expr min lq median uq max neval
fun.test1(df) 2476.712 2660.0805 2740.083 2832.588 9162.339 100
fun.test2(df) 5847.393 6215.1420 6335.932 6477.745 12499.775 100
fun.test3(df) 815.886 924.1405 1003.585 1050.169 1128.915 100
fun.test4(df) 161822.674 167238.5165 172712.746 173254.052 225317.480 100
fun.test5(df) 5611.329 5899.8085 6000.555 6120.123 57572.615 100
fun.test6(dt) 511481.105 541534.7175 553155.852 578643.172 627739.674 100

Create an ID (row number) column

I need to create a column with unique ID, basically add the row number as an own column. My current data frame looks like this:
V1 V2
1 23 45
2 45 45
3 56 67
How to make it look like this:
V1 V2 V3
1 23 45
2 45 45
3 56 67
?
Many thanks

Two tidyverse alternatives (using sgibb's example data):
tibble::rowid_to_column(d, "ID")
which gives:
ID V1 V2
1 1 23 45
2 2 45 45
3 3 56 67
Or:
dplyr::mutate(d, ID = row_number())
which gives:
V1 V2 ID
1 23 45 1
2 45 45 2
3 56 67 3
As you can see, the rowid_to_column-function adds the new column in front of the other ones while the mutate&row_number()-combo adds the new column after the others.
And another base R alternative:
d$ID <- seq_along(d[,1])

You could use cbind:
d <- data.frame(V1=c(23, 45, 56), V2=c(45, 45, 67))
## enter id here, you could also use 1:nrow(d) instead of rownames
id <- rownames(d)
d <- cbind(id=id, d)
## set colnames to OP's wishes
colnames(d) <- paste0("V", 1:ncol(d))
EDIT: Here a comparison of #dacko suggestions. d$id <- seq_len(nrow(d) is slightly faster, but the order of the columns is different (id is the last column; reorder them seems to be slower than using cbind):
library("microbenchmark")
set.seed(1)
d <- data.frame(V1=rnorm(1e6), V2=rnorm(1e6))
cbindSeqLen <- function(x) {
return(cbind(id=seq_len(nrow(x)), x))
}
dickoa <- function(x) {
x$id <- seq_len(nrow(x))
return(x)
}
dickoaReorder <- function(x) {
x$id <- seq_len(nrow(x))
nc <- ncol(x)
x <- x[, c(nc, 1:(nc-1))]
return(x)
}
microbenchmark(cbindSeqLen(d), dickoa(d), dickoaReorder(d), times=100)
# Unit: milliseconds
# expr min lq median uq max neval
# cbindSeqLen(d) 23.00683 38.54196 40.24093 42.60020 47.73816 100
# dickoa(d) 10.70718 36.12495 37.58526 40.22163 72.92796 100
# dickoaReorder(d) 19.25399 68.46162 72.45006 76.51468 88.99620 100

Many presented their ideas, but I think this is the sortest and simplest code for this task:
data$ID <- 1:nrow(data)
One line. The one and only.

You could also do this using dplyr:
DF <- mutate(DF, id = rownames(DF))

data.table solution
Easier syntax and much faster
library(data.table)
dt <- data.table(V1=c(23, 45, 56), V2=c(45, 45, 67))
setnames(dt, c("V2", "V3")) # changing column names
dt[, V1 := .I] # Adding ID column

Hope this will help. Shortest and best way to create ID column is:
dataframe$ID <- seq.int(nrow(dataframe))

If you're starting without named rows in your df, the tidy way is:
df %>%
mutate(id = row_number()) %>%
select(id, everything())

Here is a solution that keeps the dplyr piping format and places id in the first column, which may be preferred.
d %>%
mutate(id = rownames(.)) %>%
select(id, everything())

The function rownames_to_column() moves rownames into a column; found in the tidyverse package (docs).
rownames_to_column(DF, "my_column_name")
Use column_to_rownames() for the reverse operation.

If your database is not too large this will work
# Load sample data
Dt1 <- tibble(V1=c(23,45,56),V2=c(45,45,67))
# Create Separate Tibble with row numbers
Dt2 <- tibble(id=seq(1:nrow(Dt1)))
# Join together
Dt3 <- cbind(Dt2,Dt1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Conditionally Join Dataframes by Row in R - r

Related

Is it faster to aggregate or aggregate and merge with one-to-many vars with data.table?

R Minimum Value from Datatable Not Equal to a Particular Value

Replace parts of a variable using numeric indices in dplyr. Do I need to create an index column and use ifelse?

R subset unique observation keeping last entry

Create an ID (row number) column

Categories

Resources