R subset unique observation keeping last entry - r

I have a data frame that looks something like this (with a lot more observations)
df <- structure(list(session_user_id = c("1803f6c3625c397afb4619804861f75268dfc567",
"1924cb2ebdf29f052187b9a2d21673e4d314199b", "1924cb2ebdf29f052187b9a2d21673e4d314199b",
"1924cb2ebdf29f052187b9a2d21673e4d314199b", "1924cb2ebdf29f052187b9a2d21673e4d314199b",
"198b83b365fef0ed637576fe1bde786fc09817b2", "19fd8069c094fb0697508cc9646513596bea30c4",
"19fd8069c094fb0697508cc9646513596bea30c4", "19fd8069c094fb0697508cc9646513596bea30c4",
"19fd8069c094fb0697508cc9646513596bea30c4", "1a3d33c9cbb2aa41515e6ef76f123b2ea8ee2f13",
"1b64c142b1540c43e3f813ccec09cb2dd7907c14", "1b7346d13f714c97725ba2e1c21b600535164291"
), raw_score = c(1, 1, 1, 1, 1, 0.2, NA, 1, 1, 1, 1, 0.2, 1),
submission_time = c(1389707078L, 1389694184L, 1389694188L,
1389694189L, 1389694194L, 1390115495L, 1389696939L, 1389696971L,
1389741306L, 1389985033L, 1389983862L, 1389854836L, 1389692240L
)), .Names = c("session_user_id", "raw_score", "submission_time"
), row.names = 28:40, class = "data.frame")
I want to create a new data frame with only one observation per "session_ user_id" by keeping the one with the latest "submission_time."
The only idea that I have in mind is to create a list of unique users. Write a loop to find the max of submission_time for each user and then write a loop that gets raw score fore that user and time.
Can somebody show me a better way of doing this in R?
Thanks!

You could first order your data.frame by submission_time and remove all duplicated session_user_id entries afterwards:
## order by submission_time
df <- df[order(df$submission_time, decreasing=TRUE),]
## remove duplicated user_id
df <- df[!duplicated(df$session_user_id),]
# session_user_id raw_score submission_time
#33 198b83b365fef0ed637576fe1bde786fc09817b2 0.2 1390115495
#37 19fd8069c094fb0697508cc9646513596bea30c4 1.0 1389985033
#38 1a3d33c9cbb2aa41515e6ef76f123b2ea8ee2f13 1.0 1389983862
#39 1b64c142b1540c43e3f813ccec09cb2dd7907c14 0.2 1389854836
#28 1803f6c3625c397afb4619804861f75268dfc567 1.0 1389707078
#32 1924cb2ebdf29f052187b9a2d21673e4d314199b 1.0 1389694194
#40 1b7346d13f714c97725ba2e1c21b600535164291 1.0 1389692240

This is simple to express with dplyr: first group by session id, then filter, selecting the row in each group with the maximum time:
library(dplyr)
df %.%
group_by(session_user_id) %.%
filter(submission_time == max(submission_time))
Alternatively, if you don't want to keep all maximum times (if duplicated), you could do:
library(dplyr)
df %.%
group_by(session_user_id) %.%
filter(row_number(desc(submission_time)) == 1)

I'll add a data.table solution as well, and out of curiosity benchmark against dplyr on bigger data:
require(data.table)
DT <- as.data.table(df)
DT[DT[, .I[which.max(submission_time)], by=list(session_user_id)]$V1]
Here I'm assuming that the OP needs just one observation, even for multiple identical "max" values. If not, check out the function f2 below.
Benchmarks on bigger data vs dplyr:
Benchmark against #hadley's dplyr solutions on bigger data. I'll assume there are about 50e3 user ids and there are a total of 1e7 rows.
require(data.table) # 1.8.11 commit 1142
require(dplyr) # latest commit from github
set.seed(45L)
DT <- data.table(session_user_id = sample(paste0("id", 1:5e4), 1e7, TRUE),
raw_score = sample(10, 1e7, TRUE),
submission_time = sample(1e5:5e5, 1e7, TRUE))
DF <- tbl_df(as.data.frame(DT))
f1 <- function(DT) {
DT[DT[, .I[which.max(submission_time)], by=list(session_user_id)]$V1]
}
f2 <- function(DT) {
DT[DT[, .I[submission_time == max(submission_time)],
by=list(session_user_id)]$V1]
}
f3 <- function(DF) {
DF %.%
group_by(session_user_id) %.%
filter(submission_time == max(submission_time))
}
f4 <- function(DF) {
DF %.%
group_by(session_user_id) %.%
filter(row_number(desc(submission_time)) == 1)
}
And here are the timings. All are minimum of three runs:
system.time(a1 <- f1(DT))
# user system elapsed
# 1.044 0.056 1.101
system.time(a2 <- f2(DT))
# user system elapsed
# 1.384 0.080 1.475
system.time(a3 <- f3(DF))
# user system elapsed
# 4.513 0.044 4.555
system.time(a4 <- f4(DF))
# user system elapsed
# 6.312 0.004 6.314
As expected f4 is the slowest because it uses desc (which I'm guessing somehow involves in an ordering or sorting per group - a more computationally expensive operation than just getting max or which.max).
Here, a1 and a4 (only one observation even if multiple max values are present) give identical results and so does a2 and a3 (all max values).
data.table is at least 3x faster here (comparing a2 to a3) and about 5.7x times faster when comparing f1 to f4.

You could use the "plyr' package to summarize the data. Something like this should work
max_subs<-ddply(df,"session_user_id",summarize,max_sub=max(submission_time))
ddply takes a data frame in and returns a data frame, and this will give you the user and submission times you want.
To return the original data frame rows corresponding to these you could do
df2<-df[df$session_user_id %in% max_subs$session_user_id & df$submission_time %in% max_subs$max_sub,]

First find the max submission time by session_user_id. This table will be unique by session_user_id.
Then just merge (sql-speak: inner join) back to your original table joining on submission_time & session_user_id (R automatically picks up common names across the two data frames).
maxSessions<-aggregate(submission_time~session_user_id , df, max)
mySubset<-merge(df, maxSessions)
mySubset #this table has the data your are looking for
If you are looking for speed and your dataset is large then have a look at this How to summarize data by group in R? data.table & plyr are good choices.

This is just an extended comment because I was interested in how fast each of the solutions were
library(microbenchmark)
library(plyr)
library(dplyr)
library(data.table)
df <- df[sample(1:nrow(df),10000,replace=TRUE),] # 10k records
fun.test1 <- function(df) {
df <- df[order(df$submission_time, decreasing = TRUE),]
df <- df[!duplicated(df$session_user_id),]
return(df)
}
fun.test2 <- function(df) {
max_subs<-ddply(df,"session_user_id",summarize,max_sub=max(submission_time))
df2<-df[df$session_user_id %in% max_subs$session_user_id &
df$submission_time %in% max_subs$max_sub,]
return(df2)
}
fun.test3 <- function(df) {
df <- df %.%
group_by(session_user_id) %.%
filter(submission_time == max(submission_time))
return(df)
}
fun.test4 <- function(df) {
maxSessions<-aggregate(submission_time~session_user_id , df, max)
mySubset<-merge(df, maxSessions)
return(mySubset)
}
fun.test5 <- function(df) {
df <- df[df$submission_time %in% by(df, df$session_user_id,
function(x) max(x$submission_time)),]
return(df)
}
dt <- as.data.table(df) # Assuming you're working with data.table to begin with
# Don't know a lot about data.table so I'm sure there's a faster solution
fun.test6 <- function(dt) {
dt <- unique(
dt[,
list(raw_score,submission_time=max(submission_time)),
by=session_user_id]
)
return(dt)
}
Looks like the most basic solution with !duplicated() wins by a significant margin for small data (Under 1k), followed by dplyr. dplyr wins for large samples (over 1k).
microbenchmark(
fun.test1(df),
fun.test2(df),
fun.test3(df),
fun.test4(df),
fun.test5(df),
fun.test6(dt)
)
expr min lq median uq max neval
fun.test1(df) 2476.712 2660.0805 2740.083 2832.588 9162.339 100
fun.test2(df) 5847.393 6215.1420 6335.932 6477.745 12499.775 100
fun.test3(df) 815.886 924.1405 1003.585 1050.169 1128.915 100
fun.test4(df) 161822.674 167238.5165 172712.746 173254.052 225317.480 100
fun.test5(df) 5611.329 5899.8085 6000.555 6120.123 57572.615 100
fun.test6(dt) 511481.105 541534.7175 553155.852 578643.172 627739.674 100

Related

R: Suggestion to speed up a function (remove duplicates in data frame)

I run into a bit of trouble with my code and would welcome any suggestion to make it run faster.
I have a data frame that looks like that:
Name <- c("a","a","a","a","a","b","b","b","b","c")
Category <- c("sun","cat","sun","sun","sea","sun","sea","cat","dog","cat")
More_info <- c("table","table","table","table","table","table","table","table","table","cat")
d <- data.frame(Name,Category,More_info)
So I have duplicated entries for each row in column Name (the number of duplicates can vary). For each entry (a,b,...) I want to count the sum of each corresponding element in the Category column and keep the only category that appears the most. If an entry has an equal number of categories, I want to take one of most categories randomly.
So in this case, the output dataframe would look like this:
Name <- c("a","b","c")
Category <- c("sun","dog","cat")
More_info <- c("table","table","table")
d <- data.frame(Name,Category,More_info)
a have sun entry kept because it appears the most, b would be dog or whatever other value as they all appear once with b, and c wouldn't be changed.
My function looks like this:
my_choosing_function <- function(x){
tmp = dbSNP_hapmap[dbSNP_hapmap$refsnp_id==list_of_snps[x],]
snp_freq <- as.data.frame(table(tmp$consequence_type_tv))
best_hit <- snp_freq[order(-snp_freq$Freq),]
best_hit$SNP<-list_of_snps[x]
top<-best_hit[1,]
return(top)
}
trst <- lapply(1:length(list_of_snps), function(x) my_choosing_function(x))
final <- do.call("rbind",trst)
Where I start from a list of unique elements (that would be Name in our case), for each element I do a table of the duplicated entries, I order the table by descending values and keep the top element. I do a lapply for each element of the list of unique values, then do a rbind of the whole thing.
As I have 2500000 rows in my initial data frame and 1500000 unique elements, it takes forever to run. 4 seconds for 100 lines, that would be a total of 34 hours for the lapply.
I'm sure packages like dplyr can do it in a few minutes but can't find a solution to do it. Anyone has an idea?
Thanks a lot for your help!
Note: This should be a very long comment because I use data.table instead of dplyr.
I suggest use data.table because it runs faster. And in the data.table way shown below, it randomly choose one in case of tie, not always the first one.
library(data.table)
library(dplyr)
library(microbenchmark)
d <- data.frame(
Name = as.character(sample.int(10000, 2.5e6, replace = T)),
Category = as.character(sample.int(10000, 2.5e6, replace = T)),
More_info = rep('table', 2.5e6)
)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
system.time({
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
})
# user system elapsed
# 45.932 0.808 46.745
system.time({
dt <- as.data.table(d)
dt.max <- dt[, .N, by = .(Name, Category)]
dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
dt.max <- dt.max[r == 1, .(Name, Category)]
dt[dt.max, on = .(Name, Category), mult = 'first']
})
# user system elapsed
# 2.424 0.004 2.426
We can modify the Mode function from here and then do a group by filter
library(dplyr)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
A couple slight tweaks on #mt1022's solution can produce a marginal speedup, nothing to phone home about, but might be of use if you find your data grows another order of magnitude.
library(data.table)
library(dplyr)
d <- data.frame(
Name = as.character(sample.int(10000, 2.5e6, replace = T)),
Category = as.character(sample.int(5000, 2.5e6, replace = T)),
More_info = rep('table', 2.5e6)
)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
system.time({
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
})
# user system elapsed
# 40.459 0.180 40.743
system.time({
dt <- as.data.table(d)
dt.max <- dt[, .N, by = .(Name, Category)]
dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
dt.max <- dt.max[r == 1, .(Name, Category)]
dt[dt.max, on = .(Name, Category), mult = 'first']
})
# user system elapsed
# 4.196 0.052 4.267
Tweaks include
Use setDT() instead of as.data.table() to avoid making a copy
Using stats::runif() to generate the random tiebreaker directly, this is of what data.table is doing internally in the the random option of frank()
Using setkey() to sort the table
Sub-setting the table by the row indices, .I, where the row within each group is equal to the number of observations, .N in each group. (This returns the last row of each group)
Results:
system.time({
dt.max <- setDT(d)[, .(Count = .N), keyby = .(Name, Category)]
dt.max[,rand := stats::runif(.N)]
setkey(dt.max,Name,Count, rand)
dt.max[dt.max[,.I[.N],by = .(Name,Category)]$V1,.(Name,Category,Count)]
})
# user system elapsed
# 1.722 0.057 1.750

Conditionally Join Dataframes by Row in R

I would like to conditionally merge two tables with the following formats:
id1 <- c('S001', 'S002', 'S003', 'S004', 'S004')
id2 <- c('S001', 'S001', 'S002', 'S002', 'S001')
ids <- data.frame(id1, id2)
and
bad_id_key <- c('S002', 'S004')
bad_id_val <- c('a', 'b')
bad_ids <- data.frame(bad_id_key, bad_id_val)
The conditional rules are:
If both IDs are in the "bad" list, drop that row
If neither ID is in the "bad" list, drop that row
If only one of the IDs is bad, add the bad value to the row.
The resulting table would look like:
id1 id2 bad_id_val
2 S002 S001 a
3 S003 S002 a
5 S004 S001 b
I was able to accomplish this with the following code snippet:
conditionalJoin <- function(row){
if(row$id1 %in% bad_id_key & row$id2 %in% bad_id_key){
# do nothing
}
else if(row$id1 %in% bad_id_key){
merge(x=row, y=bad_ids, by.x="id1", by.y="bad_id_key", all.x=TRUE)
}
else if(row$id2 %in% bad_id_key){
merge(x=row, y=bad_ids, by.x="id2", by.y="bad_id_key", all.x=TRUE)
}
}
out <- do.call("rbind", as.list(by(ids, 1:nrow(ids), conditionalJoin)))
However, this approach scales extremely poorly as the size of the ids dataframe grows. I think this is because of the rbind function. Also, the if elses are not very elegant R code.
Does anyone know of of an R command to do this kind of row-wise conditional joining that is more efficient than rbind? Thanks in advance.
Using the data.table package, i would approach it as follows:
library(data.table)
ids <- setDT(ids)[xor(id1 %in% bad_ids$bad_id_key, id2 %in% bad_ids$bad_id_key)
][, bad_id_val := ifelse(id1 %in% bad_ids$bad_id_key,
as.character(bad_ids$bad_id_val[match(id1, bad_ids$bad_id_key)]),
as.character(bad_ids$bad_id_val[match(id2, bad_ids$bad_id_key)]))]
which gives the desired result:
> ids
id1 id2 bad_id_val
1: S002 S001 a
2: S003 S002 a
3: S004 S001 b
Tested on the larger dataset of #jeremycg this gives the following outcome with regard to speed:
Unit: milliseconds
expr min lq mean median uq max neval cld
jeremy 9.196898 9.386950 9.854132 9.603002 9.749256 16.764747 100 b
OP 974.933816 985.813821 996.770067 992.145890 1000.411484 1143.402837 100 c
jaap 3.572531 3.612401 3.779686 3.679115 3.790707 9.803782 100 a
This is the fastest I can get it using dplyr. It's considerably faster, as there are only two match calls, everything else is quick. See the benchmark below.
library(dplyr)
ids %>% mutate(x = match(id1, bad_ids$bad_id_key), #get the first match of id1
y = match(id2, bad_ids$bad_id_key)) %>% #and id2
filter(xor(is.na(x), is.na(y))) %>% #filter to make sure we have 1 match
mutate(val = ifelse(is.na(x), #if x didn't match
as.character(bad_ids$bad_id_val[y]), #get the y
as.character(bad_ids$bad_id_val[x]))) # otherwise get the x
Here's a benchmark on larger data:
#5000 lines of ids
set.seed(12345)
ids <- data.frame(id1 = sample(1:50, 5000, replace = TRUE), id2 = sample(1:50, 5000, replace = TRUE))
bad_ids <- data.frame(bad_id_key = 1:20, bad_id_val = letters[1:20])
microbenchmark::microbenchmark(
me = {
ids %>% mutate(x = match(id1, bad_ids$bad_id_key),
y = match(id2, bad_ids$bad_id_key)) %>%
filter(xor(is.na(x), is.na(y))) %>%
mutate(val = ifelse(is.na(x),
as.character(bad_ids$bad_id_val[y]),
as.character(bad_ids$bad_id_val[x])))},
OP = {out <- do.call("rbind", as.list(by(ids, 1:nrow(ids), conditionalJoin)))}
)
Unit: milliseconds
expr min lq mean median uq max
me 11.92924 12.41934 15.36524 13.07722 15.71085 63.14211
OP 1831.34599 1910.90149 2369.70980 2112.57251 2340.88428 5549.01191
neval
100
100
Instead of using ifelse functions, it is often better to just work within the data.frame or data.table its self to identify records you want to keep. For your example you could do this with the following code:
ids[xor(ids$id1 %in% bad_id_key, ids$id2 %in% bad_id_key),]
After running this code you just need to merge ids and bad_ids to append the bad id value.

Efficiently reformat column entries in large data set in R

I have a large (6 million row) table of values that I believe needs to be reformatted before it can be used for comparison to my data set. The table has 3 columns that I care about.
The first column contains nucleotide base changes, in the form of C>G, A>C, A>G, etc. I'd like to split these into two separate columns.
The second column has the chromosome and base position, formatted as 10:130448, 2:40483, 5:30821291, etc. I would also like to split this into two columns.
The third column has the allelic fraction in a number of sample populations, formatted like .02/.03/.20. I'd like to extract the third fraction into a new column.
The problem is that the code I have written is currently extremely slow. It looks like it will take about a day and a half just to run. Is there something I'm missing here? Any suggestions would be appreciated.
My current code does the following: pos, change, and fraction each receive a vector of the above values split use strsplit. I then loop through the entire database, getting the ith value from those three vectors, and creating new columns with the values I want.
Once the database has been formatted, I should be able to easily check a large number of samples by chromosome number, base, reference allele, alternate allele, etc.
pos <- strsplit(total.esp$NCBI.Base, ":")
change <- strsplit(total.esp$Alleles, ">")
fraction <- strsplit(total.esp$'MAFinPercent(EA/AA/All)', "/")
for (i in 1:length(pos)){
current <- pos[[i]]
mutation <- change[[i]]
af <- fraction[[i]]
total.esp$chrom[i] <- current[1]
total.esp$base[i] <- current [2]
total.esp$ref[i] <- mutation[1]
total.esp$alt[i] <- mutation[2]
total.esp$af[i] <- af[3]
}
Thanks!
Here is a data.table solution. We convert the 'data.frame' to 'data.table' (setDT(df1)), loop over the Subset of Data.table (.SD) with lapply, use tstrsplit and split the columns by specifying the split character, unlist the output with recursive=FALSE.
library(data.table)#v1.9.6+
setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]
# Alleles1 Alleles2 NCBI.Base1 NCBI.Base2 MAFinPercent1 MAFinPercent2
#1: C G 10 130448 0.02 0.03
#2: A C 2 40483 0.05 0.03
#3: A G 5 30821291 0.02 0.04
# MAFinPercent3
#1: 0.20
#2: 0.04
#3: 0.03
NOTE: I assumed that there are only 3 columns in the dataset. If there are more columns, and want to do the split only for the 3 columns, we can specify the .SDcols= 1:3 i.e. column index or the actual column names, assign (:=) the output to new columns and subset the columns that are only needed in the output.
data
df1 <- data.frame(Alleles =c('C>G', 'A>C', 'A>G'),
NCBI.Base=c('10:130448', '2:40483', '5:30821291'),
MAFinPercent= c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'),
stringsAsFactors=FALSE)
You can use tidyr, dplyr and separate:
library(tidyr)
library(dplyr)
total.esp %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent.EA.AA.All., c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)
You'll need to be careful about that last MAFinPercent.EA.AA.All. - you have a horrible column name so may have to rename it/quote it depending on how exactly r has it (this is also a good reason to include at least some data in your question, such as the output of dput(head(total.esp))).
data used to check:
total.esp <- data.frame(Alleles= rep("C>G", 50), NCBI.Base = rep("10:130448", 50), 'MAFinPercent(EA/AA/All)'= rep(".02/.03/.20", 50))
Because we now have a tidyr/dplyr solution, a data.table solution and a base solution, let's benchmark them. First, data from #akrun, 300,000 rows in total:
df1 <- data.frame(Alleles =rep(c('C>G', 'A>C', 'A>G'), 100000),
NCBI.Base=rep(c('10:130448', '2:40483', '5:30821291'), 100000),
MAFinPercent= rep(c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'), 100000),
stringsAsFactors=FALSE)
Now, the benchmark:
microbenchmark::microbenchmark(
tidyr = {df1 %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent, c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)},
data.table = {setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]},
base = {pos <- strsplit(df1$NCBI.Base, ":");
change <- strsplit(df1$Alleles, ">");
fraction <- strsplit(df1$MAFinPercent, "/");
data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( fraction, "[", 3)
)}
)
Unit: seconds
expr min lq mean median uq max neval
tidyr 1.295970 1.398792 1.514862 1.470185 1.629978 1.889703 100
data.table 2.140007 2.209656 2.315608 2.249883 2.481336 2.666345 100
base 2.718375 3.079861 3.183766 3.154202 3.221133 3.791544 100
tidyr is the winner
Try this (after retaining your first three lines of code):
total.esp <- data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( af, "[", 3)
)
I cannot imagine this taking more than a couple of minutes. (I do work with R objects of similar size.)

r - apply function to each row of a data.table

I'm looking to use data.table to improve speed for a given function, but I'm not sure I'm implementing it the correct way:
Data
Given two data.tables (dt and dt_lookup)
library(data.table)
set.seed(1234)
t <- seq(1,100); l <- letters; la <- letters[1:13]; lb <- letters[14:26]
n <- 10000
dt <- data.table(id=seq(1:n),
thisTime=sample(t, n, replace=TRUE),
thisLocation=sample(la,n,replace=TRUE),
finalLocation=sample(lb,n,replace=TRUE))
setkey(dt, thisLocation)
set.seed(4321)
dt_lookup <- data.table(lkpId = paste0("l-",seq(1,1000)),
lkpTime=sample(t, 10000, replace=TRUE),
lkpLocation=sample(l, 10000, replace=TRUE))
## NOTE: lkpId is purposly recycled
setkey(dt_lookup, lkpLocation)
I have a function that finds the lkpId that contains both thisLocation and finalLocation, and has the 'nearest' lkpTime (i.e. the minimum non-negative value of thisTime - lkpTime)
Function
## function to get the 'next' lkpId (i.e. the lkpId with both thisLocation and finalLocation,
## with the minimum non-negative time between thisTime and dt_lookup$lkpTime)
getId <- function(thisTime, thisLocation, finalLocation){
## filter lookup based on thisLocation and finalLocation,
## and only return values where the lkpId has both 'this' and 'final' locations
tempThis <- unique(dt_lookup[lkpLocation == thisLocation,lkpId])
tempFinal <- unique(dt_lookup[lkpLocation == finalLocation,lkpId])
availServices <- tempThis[tempThis %in% tempFinal]
tempThisFinal <- dt_lookup[lkpId %in% availServices & lkpLocation==thisLocation, .(lkpId, lkpTime)]
## calcualte time difference between 'thisTime' and 'lkpTime' (from thisLocation)
temp2 <- thisTime - tempThisFinal$lkpTime
## take the lkpId with the minimum non-negative difference
selectedId <- tempThisFinal[min(which(temp2==min(temp2[temp2>0]))),lkpId]
selectedId
}
Attempts at a solution
I need to get the lkpId for each row of dt. Therefore, my initial instinct was to use an *apply function, but it was taking too long (for me) when n/nrow > 1,000,000. So I've tried to implement a data.table solution to see if it's faster:
selectedId <- dt[,.(lkpId = getId(thisTime, thisLocation, finalLocation)),by=id]
However, I'm fairly new to data.table, and this method doesn't appear to give any performance gains over an *apply solution:
lkpIds <- apply(dt, 1, function(x){
thisLocation <- as.character(x[["thisLocation"]])
finalLocation <- as.character(x[["finalLocation"]])
thisTime <- as.numeric(x[["thisTime"]])
myId <- getId(thisTime, thisLocation, finalLocation)
})
both taking ~30 seconds for n = 10,000.
Question
Is there a better way of using data.table to apply the getId function over each row of dt ?
Update 12/08/2015
Thanks to the pointer from #eddi I've redesigned my whole algorithm and am making use of rolling joins (a good introduction), thus making proper use of data.table. I'll write up an answer later.
Having spent the time since asking this question looking into what data.table has to offer, researching data.table joins thanks to #eddi's pointer (for example Rolling join on data.table, and inner join with inequality), I've come up with a solution.
One of the tricky parts was moving away from the thought of 'apply a function to each row', and redesigning the solution to use joins.
And, there will no doubt be better ways of programming this, but here's my attempt.
## want to find a lkpId for each id, that has the minimum difference between 'thisTime' and 'lkpTime'
## and where the lkpId contains both 'thisLocation' and 'finalLocation'
## find all lookup id's where 'thisLocation' matches 'lookupLocation'
## and where thisTime - lkpTime > 0
setkey(dt, thisLocation)
setkey(dt_lookup, lkpLocation)
dt_this <- dt[dt_lookup, {
idx = thisTime - i.lkpTime > 0
.(id = id[idx],
lkpId = i.lkpId,
thisTime = thisTime[idx],
lkpTime = i.lkpTime)
},
by=.EACHI]
## remove NAs
dt_this <- dt_this[complete.cases(dt_this)]
## find all matching 'finalLocation' and 'lookupLocaiton'
setkey(dt, finalLocation)
## inner join (and only return the id columns)
dt_final <- dt[dt_lookup, nomatch=0, allow.cartesian=TRUE][,.(id, lkpId)]
## join dt_this to dt_final (as lkpId must have both 'thisLocation' and 'finalLocation')
setkey(dt_this, id, lkpId)
setkey(dt_final, id, lkpId)
dt_join <- dt_this[dt_final, nomatch=0]
## take the combination with the minimum difference between 'thisTime' and 'lkpTime'
dt_join[,timeDiff := thisTime - lkpTime]
dt_join <- dt_join[ dt_join[order(timeDiff), .I[1], by=id]$V1]
## equivalent dplyr code
# library(dplyr)
# dt_this <- dt_this %>%
# group_by(id) %>%
# arrange(timeDiff) %>%
# slice(1) %>%
# ungroup

Using R's plyr package to reorder groups within a dataframe

I have a data reorganization task that I think could be handled by R's plyr package. I have a dataframe with numeric data organized in groups. Within each group I need to have the data sorted largest to smallest.
The data looks like this (code to generate below)
group value
2 b 0.1408790
6 b 1.1450040 #2nd b is smaller than 1st
1 c 5.7433568
3 c 2.2109819
4 d 0.5384659
5 d 4.5382979
What I would like is this.
group value
b 1.1450040 #1st b is largest
b 0.1408790
c 5.7433568
c 2.2109819
d 4.5382979
d 0.5384659
So, what I need plyr to do is go through each group & apply something like order on the numeric data, reorganize by order, save the reordered subset of data, & put it all back together at the end.
I can process this "by hand" with a list & some loops, but it takes a long long time. Can this be done by plyr in a couple of lines?
Example data
df.sz <- 6;groups <-c("a","b","c","d")
df <- data.frame(group = sample(groups,df.sz,replace = TRUE),
value = runif(df.sz,0,10),stringsAsFactors = FALSE)
df <- df[order(df$group),] #order by group letter
The inefficient approach using loops:
My current approach is to separate the dataframe df into a list by groups, apply order to each element of the list, and overwrite the original list element with the reordered element. I then use a loop to re-assemble the dataframe. (As a learning exercise, I'd interested also in how to make this code more efficient. In particular, what would be the most efficient way using base R functions to turn a list into a dataframe?)
Vector of the unique groups in the dataframe
groups.u <- unique(df$group)
Create empty list
my.list <- as.list(groups.u); names(my.list) <- groups.u
Break up df by $group into list
for(i in 1:length(groups.u)){
i.working <- which(df$group == groups.u[i])
my.list[[i]] <- df[i.working, ]
}
Sort elements within list using order
for(i in 1:length(my.list)){
order.x <- order(my.list[[i]]$value,na.last = TRUE, decreasing = TRUE)
my.list[[i]] <- my.list[[i]][order.x, ]
}
Finally rebuild df from the list. 1st, make seed for loop
new.df <- my.list[[1]][1,];; new.df[1,] <- NA
for(i in 1:length(my.list)){
new.df <- rbind(new.df,my.list[[i]])
}
Remove seed
new.df <- new.df[-1,]
You could use dplyr which is a newer version of plyr that focuses on data frames:
library(dplyr)
arrange(df, group, desc(value))
It's virtually sacrilegious to include a "data.table" response in a question tagged "plyr" or "dplyr", but your comment indicates you're looking for fast compact code.
In "data.table", you could use setorder, like this:
setorder(setDT(df), group, -value)
That command does two things:
It converts your data.frame to a data.table without copying.
It sorts your columns by reference (again, no copying).
You mention "> 50k rows". That's actually not very large, and even base R should be able to handle it well. In terms of "dplyr" and "data.table", you're looking at measurements in the milliseconds. That could make a difference as your input datasets become larger.
set.seed(1)
df.sz <- 50000
groups <- c(letters, LETTERS)
df <- data.frame(
group = sample(groups, df.sz, replace = TRUE),
value = runif(df.sz,0,10), stringsAsFactors = FALSE)
library(data.table)
library(dplyr)
library(microbenchmark)
dt1 <- function() as.data.table(df)[order(group, -value)]
dt2 <- function() setorder(as.data.table(df), group, -value)[]
dp1 <- function() arrange(df, group, desc(value))
microbenchmark(dt1(), dt2(), dp1())
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt1() 5.749002 5.981274 7.725225 6.270664 8.831899 67.402052 100
# dt2() 4.956020 5.096143 5.750724 5.229124 5.663545 8.620155 100
# dp1() 37.305364 37.779725 39.837303 38.169298 40.589519 96.809736 100

Resources