r - apply function to each row of a data.table - r

I'm looking to use data.table to improve speed for a given function, but I'm not sure I'm implementing it the correct way:
Data
Given two data.tables (dt and dt_lookup)
library(data.table)
set.seed(1234)
t <- seq(1,100); l <- letters; la <- letters[1:13]; lb <- letters[14:26]
n <- 10000
dt <- data.table(id=seq(1:n),
thisTime=sample(t, n, replace=TRUE),
thisLocation=sample(la,n,replace=TRUE),
finalLocation=sample(lb,n,replace=TRUE))
setkey(dt, thisLocation)
set.seed(4321)
dt_lookup <- data.table(lkpId = paste0("l-",seq(1,1000)),
lkpTime=sample(t, 10000, replace=TRUE),
lkpLocation=sample(l, 10000, replace=TRUE))
## NOTE: lkpId is purposly recycled
setkey(dt_lookup, lkpLocation)
I have a function that finds the lkpId that contains both thisLocation and finalLocation, and has the 'nearest' lkpTime (i.e. the minimum non-negative value of thisTime - lkpTime)
Function
## function to get the 'next' lkpId (i.e. the lkpId with both thisLocation and finalLocation,
## with the minimum non-negative time between thisTime and dt_lookup$lkpTime)
getId <- function(thisTime, thisLocation, finalLocation){
## filter lookup based on thisLocation and finalLocation,
## and only return values where the lkpId has both 'this' and 'final' locations
tempThis <- unique(dt_lookup[lkpLocation == thisLocation,lkpId])
tempFinal <- unique(dt_lookup[lkpLocation == finalLocation,lkpId])
availServices <- tempThis[tempThis %in% tempFinal]
tempThisFinal <- dt_lookup[lkpId %in% availServices & lkpLocation==thisLocation, .(lkpId, lkpTime)]
## calcualte time difference between 'thisTime' and 'lkpTime' (from thisLocation)
temp2 <- thisTime - tempThisFinal$lkpTime
## take the lkpId with the minimum non-negative difference
selectedId <- tempThisFinal[min(which(temp2==min(temp2[temp2>0]))),lkpId]
selectedId
}
Attempts at a solution
I need to get the lkpId for each row of dt. Therefore, my initial instinct was to use an *apply function, but it was taking too long (for me) when n/nrow > 1,000,000. So I've tried to implement a data.table solution to see if it's faster:
selectedId <- dt[,.(lkpId = getId(thisTime, thisLocation, finalLocation)),by=id]
However, I'm fairly new to data.table, and this method doesn't appear to give any performance gains over an *apply solution:
lkpIds <- apply(dt, 1, function(x){
thisLocation <- as.character(x[["thisLocation"]])
finalLocation <- as.character(x[["finalLocation"]])
thisTime <- as.numeric(x[["thisTime"]])
myId <- getId(thisTime, thisLocation, finalLocation)
})
both taking ~30 seconds for n = 10,000.
Question
Is there a better way of using data.table to apply the getId function over each row of dt ?
Update 12/08/2015
Thanks to the pointer from #eddi I've redesigned my whole algorithm and am making use of rolling joins (a good introduction), thus making proper use of data.table. I'll write up an answer later.

Having spent the time since asking this question looking into what data.table has to offer, researching data.table joins thanks to #eddi's pointer (for example Rolling join on data.table, and inner join with inequality), I've come up with a solution.
One of the tricky parts was moving away from the thought of 'apply a function to each row', and redesigning the solution to use joins.
And, there will no doubt be better ways of programming this, but here's my attempt.
## want to find a lkpId for each id, that has the minimum difference between 'thisTime' and 'lkpTime'
## and where the lkpId contains both 'thisLocation' and 'finalLocation'
## find all lookup id's where 'thisLocation' matches 'lookupLocation'
## and where thisTime - lkpTime > 0
setkey(dt, thisLocation)
setkey(dt_lookup, lkpLocation)
dt_this <- dt[dt_lookup, {
idx = thisTime - i.lkpTime > 0
.(id = id[idx],
lkpId = i.lkpId,
thisTime = thisTime[idx],
lkpTime = i.lkpTime)
},
by=.EACHI]
## remove NAs
dt_this <- dt_this[complete.cases(dt_this)]
## find all matching 'finalLocation' and 'lookupLocaiton'
setkey(dt, finalLocation)
## inner join (and only return the id columns)
dt_final <- dt[dt_lookup, nomatch=0, allow.cartesian=TRUE][,.(id, lkpId)]
## join dt_this to dt_final (as lkpId must have both 'thisLocation' and 'finalLocation')
setkey(dt_this, id, lkpId)
setkey(dt_final, id, lkpId)
dt_join <- dt_this[dt_final, nomatch=0]
## take the combination with the minimum difference between 'thisTime' and 'lkpTime'
dt_join[,timeDiff := thisTime - lkpTime]
dt_join <- dt_join[ dt_join[order(timeDiff), .I[1], by=id]$V1]
## equivalent dplyr code
# library(dplyr)
# dt_this <- dt_this %>%
# group_by(id) %>%
# arrange(timeDiff) %>%
# slice(1) %>%
# ungroup

Related

R: Create variable using iteratively updated values of previous row (=lag) similar to cumsum (depreciation)

My goal is to calculate take the add the valueto a cumulative sum csum of the previous row multiplied by a scalar delta. Coming from Stata (where this is obvious), I didn't know that it would not work with dplyr's lagfunction (see: Lag doesn't see the effects of mutate on previous rows) and neither with data.table's shift because both commands do not iteratively update the csum. It seems like some kind of a loop is required, but I can't figure out how to do it.
Code that does not do the job
delta= 0.94
df <- df %>% mutate(dcsum1 = value + delta*lag(csum, default=0))
setDT(df) #library(data.table) required
df$dcsumR2=df$csum
Example data & desired output:
library(dplyr)
df <- tribble(~date, ~value,~csum,~dplyr_lag_output,~desired_output,
"2017-01-01", 1,1,1,1,
"2017-01-02", 2,3,2.94,2.94,
"2017-01-03", NA,3,2.94,2.76,
"2017-01-04", 3,6,5.82,5.49,
"2017-01-05", 4,10,9.64,8.94 )
You could use Reduce with accumulate = T:
Reduce(function(prev,value) delta * prev + ifelse(is.na(value),0,value), x=df$value[-1], init = ifelse(is.na(df$value[1]),0,df$value[1]), accumulate = T)
[1] 1.000000 2.940000 2.763600 5.597784 9.261917
With data.table:
library(data.table)
setDT(df)
df[,output:=Reduce(function(prev,value) delta * prev + ifelse(is.na(value),0,value), x=value[-1], init = ifelse(is.na(value[1]),0,value[1]), accumulate = T)]

how to insert sequential rows in data.table in R (Example given)?

df is data.table and df_expected is desired data.table . I want to add hour column from 0 to 23 and visits value would be filled as 0 for hours newly added .
df<-data.table(customer=c("x","x","x","y","y"),location_id=c(1,1,1,2,3),hour=c(2,5,7,0,4),visits=c(40,50,60,70,80))
df_expected<-data.table(customer=c("x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x",
"y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y",
"y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y"),
location_id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3),
hour=c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23),
visits=c(0,0,40,0,0,50,0,60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
70,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,80,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
This is what I tried to obtain my result , but it did not work
df1<-df[,':='(hour=seq(0:23)),by=(customer)]
Error in `[.data.table`(df, , `:=`(hour = seq(0L:23L)), by = (customer)) :
Type of RHS ('integer') must match LHS ('double'). To check and coerce would impact
performance too much for the fastest cases. Either change the type of the target column, or
coerce the RHS of := yourself (e.g. by using 1L instead of 1)
Here's an approach that creates the target and then uses a join to add in the visits information. The ifelse statement just helps up clean up the NA from the merge. You could also leave them in and replace them with := in the new data.table.
target <- data.table(
customer = rep(unique(df$customer), each = 24),
hour = 0:23)
df_join <- df[target, on = c("customer", "hour"),
.(customer, hour, visits = ifelse(is.na(visits), 0, visits))
]
all.equal(df_expected, df_join)
Edit:
This addresses the request to include the location_id column. One way to do this is with by=location in the creation of the target. I've also added in some of the code from chinsoon12's answer.
target <- df[ , .("customer" = rep(unique(customer), each = 24L),
"hour" = rep(0L:23L, times = uniqueN(customer))),
by = location_id]
df_join <- df[target, on = .NATURAL,
.(customer, location_id, hour, visits = fcoalesce(visits, 0))]
all.equal(df_expected, df_join)
Another option using CJ to generate your universe, on=.NATURAL for joining on identically named columns, and fcoalesce to handle NAs:
df[CJ(customer, hour=0L:23L, unique=TRUE), on=.NATURAL, allow.cartesian=TRUE,
.(customer=i.customer, hour=i.hour, visits=fcoalesce(visits, 0))]
here's a for-loop answer.
df_final <- data.table()
for(i in seq(24)){
if(i %in% df[,hour]){
a <- df[hour==i]
}else{
a <- data.table(customer="x", hour=i, visits=0)}
df_final <- rbind(df_final, a)
}
df_final
You can wrap this in another for-loop to have your multiple customers x, y, etc. (the following loop isnt very clean but gets the job done).
df_final <- data.table()
for(j in unique(df[,customer])){
for(i in seq(24)){
if(i %in% df[,hour]){
if(df[hour==i,customer] %in% j){
a <- df[hour==i]
}else{
a <- data.table(customer=j, hour=i, visits=0)
}
}else{
a <- data.table(customer=j, hour=i, visits=0)
}
df_final <- rbind(df_final, a)
}
}
df_final

R: Suggestion to speed up a function (remove duplicates in data frame)

I run into a bit of trouble with my code and would welcome any suggestion to make it run faster.
I have a data frame that looks like that:
Name <- c("a","a","a","a","a","b","b","b","b","c")
Category <- c("sun","cat","sun","sun","sea","sun","sea","cat","dog","cat")
More_info <- c("table","table","table","table","table","table","table","table","table","cat")
d <- data.frame(Name,Category,More_info)
So I have duplicated entries for each row in column Name (the number of duplicates can vary). For each entry (a,b,...) I want to count the sum of each corresponding element in the Category column and keep the only category that appears the most. If an entry has an equal number of categories, I want to take one of most categories randomly.
So in this case, the output dataframe would look like this:
Name <- c("a","b","c")
Category <- c("sun","dog","cat")
More_info <- c("table","table","table")
d <- data.frame(Name,Category,More_info)
a have sun entry kept because it appears the most, b would be dog or whatever other value as they all appear once with b, and c wouldn't be changed.
My function looks like this:
my_choosing_function <- function(x){
tmp = dbSNP_hapmap[dbSNP_hapmap$refsnp_id==list_of_snps[x],]
snp_freq <- as.data.frame(table(tmp$consequence_type_tv))
best_hit <- snp_freq[order(-snp_freq$Freq),]
best_hit$SNP<-list_of_snps[x]
top<-best_hit[1,]
return(top)
}
trst <- lapply(1:length(list_of_snps), function(x) my_choosing_function(x))
final <- do.call("rbind",trst)
Where I start from a list of unique elements (that would be Name in our case), for each element I do a table of the duplicated entries, I order the table by descending values and keep the top element. I do a lapply for each element of the list of unique values, then do a rbind of the whole thing.
As I have 2500000 rows in my initial data frame and 1500000 unique elements, it takes forever to run. 4 seconds for 100 lines, that would be a total of 34 hours for the lapply.
I'm sure packages like dplyr can do it in a few minutes but can't find a solution to do it. Anyone has an idea?
Thanks a lot for your help!
Note: This should be a very long comment because I use data.table instead of dplyr.
I suggest use data.table because it runs faster. And in the data.table way shown below, it randomly choose one in case of tie, not always the first one.
library(data.table)
library(dplyr)
library(microbenchmark)
d <- data.frame(
Name = as.character(sample.int(10000, 2.5e6, replace = T)),
Category = as.character(sample.int(10000, 2.5e6, replace = T)),
More_info = rep('table', 2.5e6)
)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
system.time({
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
})
# user system elapsed
# 45.932 0.808 46.745
system.time({
dt <- as.data.table(d)
dt.max <- dt[, .N, by = .(Name, Category)]
dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
dt.max <- dt.max[r == 1, .(Name, Category)]
dt[dt.max, on = .(Name, Category), mult = 'first']
})
# user system elapsed
# 2.424 0.004 2.426
We can modify the Mode function from here and then do a group by filter
library(dplyr)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
A couple slight tweaks on #mt1022's solution can produce a marginal speedup, nothing to phone home about, but might be of use if you find your data grows another order of magnitude.
library(data.table)
library(dplyr)
d <- data.frame(
Name = as.character(sample.int(10000, 2.5e6, replace = T)),
Category = as.character(sample.int(5000, 2.5e6, replace = T)),
More_info = rep('table', 2.5e6)
)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
system.time({
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
})
# user system elapsed
# 40.459 0.180 40.743
system.time({
dt <- as.data.table(d)
dt.max <- dt[, .N, by = .(Name, Category)]
dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
dt.max <- dt.max[r == 1, .(Name, Category)]
dt[dt.max, on = .(Name, Category), mult = 'first']
})
# user system elapsed
# 4.196 0.052 4.267
Tweaks include
Use setDT() instead of as.data.table() to avoid making a copy
Using stats::runif() to generate the random tiebreaker directly, this is of what data.table is doing internally in the the random option of frank()
Using setkey() to sort the table
Sub-setting the table by the row indices, .I, where the row within each group is equal to the number of observations, .N in each group. (This returns the last row of each group)
Results:
system.time({
dt.max <- setDT(d)[, .(Count = .N), keyby = .(Name, Category)]
dt.max[,rand := stats::runif(.N)]
setkey(dt.max,Name,Count, rand)
dt.max[dt.max[,.I[.N],by = .(Name,Category)]$V1,.(Name,Category,Count)]
})
# user system elapsed
# 1.722 0.057 1.750

Using predicted values to make predictions in data.table

I'm trying to solve a problem in data.table which requires me to use the value just predicted in the next step of the prediction.
I have the data set up like this, with NA rows generated ready receive the predictions. Each NA is calculated by multiplying the value preceding it by the current parameter
library(data.table)
dt <- data.table(
date = as.Date(paste(rep(c(2015, 2016), each = 12, times = 2), 1:12, 1, sep = "-")),
val = c(rnorm(12, 50, 5), rep(NA, 12)),
param1 = runif(48),
cat = rep(c("a", "b"), each = 24)
)
I can't do it this way
dt[, {
dt_in <- .SD
lapply(dt_in[year(date) > 2015, date], function(d){
dt_sub <- dt_in[date <= d]
pred <- dt_sub[.N-1, val] * dt_sub[.N, param1]
dt_in[date == d, val := pred]
})
} , by = cat]
As trying to update the .SD within {} give me the '.SD is locked...' error. My current solution involves breaking the data.table into a list and updating each list item row by row
# Create a list of data.tables, one for each category
break_list <- lapply(dt[, unique(cat)], function(c){
dt[cat == c]
})
l_out <- lapply(break_list, function(dt_in){
# Select the dates requiring prediction
lapply(dt_in[year(date) > 2015, date], function(d){
# Subset by date
dt_sub <- dt_in[date <= d]
# Prediciton = value from the second to last row * parameter in the last row
pred <- dt_sub[.N-1, val] * dt_sub[.N, param1]
# Update data.table
dt_in[date == d, val := pred]
})
dt_in
})
dt_out <- rbindlist(l_out)
This works and gives me the desired solution, but it can be slow and feels like I've broken all the data.table rules. Is there a better way?
You are looking to iteratively update rows of a data.table with values computed from rows updated in a previous iteration. While it is generally better to find an explicit formulation of the problem making the updates independent and it is possible in your case using a helper column holding the cumprod of param1 and a rolling join (dt[dt[...], ..., roll=TRUE]) I will show how to do iterative updates of a data.table efficiently using data.table::set, as the former is not always easy/possible:
setkey(dt, cat, date) # sort by cat first then by date in have the reference value used for each calculation in the row above
val_col_nr <- which(colnames(dt)=="val") # set requires a column number
dt[is.na(val), # we want to compute new values for val where val currently is NA
# .I is a vector the row numbers (in dt) of each row in .SD
for (ii in .I) set(dt, i=ii, j=val_col_nr, value=dt[ii,param1]*dt[ii-1L,val]),
by=cat] # for every 'cat'
You can use identical(dt, setkey(dt_out,cat,date)) to check the result.
Please do also note that it generally a bad idea to use names of base functions (cat in your case) as variable names (even in a distinct namespace).

Using R's plyr package to reorder groups within a dataframe

I have a data reorganization task that I think could be handled by R's plyr package. I have a dataframe with numeric data organized in groups. Within each group I need to have the data sorted largest to smallest.
The data looks like this (code to generate below)
group value
2 b 0.1408790
6 b 1.1450040 #2nd b is smaller than 1st
1 c 5.7433568
3 c 2.2109819
4 d 0.5384659
5 d 4.5382979
What I would like is this.
group value
b 1.1450040 #1st b is largest
b 0.1408790
c 5.7433568
c 2.2109819
d 4.5382979
d 0.5384659
So, what I need plyr to do is go through each group & apply something like order on the numeric data, reorganize by order, save the reordered subset of data, & put it all back together at the end.
I can process this "by hand" with a list & some loops, but it takes a long long time. Can this be done by plyr in a couple of lines?
Example data
df.sz <- 6;groups <-c("a","b","c","d")
df <- data.frame(group = sample(groups,df.sz,replace = TRUE),
value = runif(df.sz,0,10),stringsAsFactors = FALSE)
df <- df[order(df$group),] #order by group letter
The inefficient approach using loops:
My current approach is to separate the dataframe df into a list by groups, apply order to each element of the list, and overwrite the original list element with the reordered element. I then use a loop to re-assemble the dataframe. (As a learning exercise, I'd interested also in how to make this code more efficient. In particular, what would be the most efficient way using base R functions to turn a list into a dataframe?)
Vector of the unique groups in the dataframe
groups.u <- unique(df$group)
Create empty list
my.list <- as.list(groups.u); names(my.list) <- groups.u
Break up df by $group into list
for(i in 1:length(groups.u)){
i.working <- which(df$group == groups.u[i])
my.list[[i]] <- df[i.working, ]
}
Sort elements within list using order
for(i in 1:length(my.list)){
order.x <- order(my.list[[i]]$value,na.last = TRUE, decreasing = TRUE)
my.list[[i]] <- my.list[[i]][order.x, ]
}
Finally rebuild df from the list. 1st, make seed for loop
new.df <- my.list[[1]][1,];; new.df[1,] <- NA
for(i in 1:length(my.list)){
new.df <- rbind(new.df,my.list[[i]])
}
Remove seed
new.df <- new.df[-1,]
You could use dplyr which is a newer version of plyr that focuses on data frames:
library(dplyr)
arrange(df, group, desc(value))
It's virtually sacrilegious to include a "data.table" response in a question tagged "plyr" or "dplyr", but your comment indicates you're looking for fast compact code.
In "data.table", you could use setorder, like this:
setorder(setDT(df), group, -value)
That command does two things:
It converts your data.frame to a data.table without copying.
It sorts your columns by reference (again, no copying).
You mention "> 50k rows". That's actually not very large, and even base R should be able to handle it well. In terms of "dplyr" and "data.table", you're looking at measurements in the milliseconds. That could make a difference as your input datasets become larger.
set.seed(1)
df.sz <- 50000
groups <- c(letters, LETTERS)
df <- data.frame(
group = sample(groups, df.sz, replace = TRUE),
value = runif(df.sz,0,10), stringsAsFactors = FALSE)
library(data.table)
library(dplyr)
library(microbenchmark)
dt1 <- function() as.data.table(df)[order(group, -value)]
dt2 <- function() setorder(as.data.table(df), group, -value)[]
dp1 <- function() arrange(df, group, desc(value))
microbenchmark(dt1(), dt2(), dp1())
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt1() 5.749002 5.981274 7.725225 6.270664 8.831899 67.402052 100
# dt2() 4.956020 5.096143 5.750724 5.229124 5.663545 8.620155 100
# dp1() 37.305364 37.779725 39.837303 38.169298 40.589519 96.809736 100

Resources