I have to find indices for 1MM numeric values within a vector of roughly 10MM values. I found the package fastmatch, but when I use the function fmatch(), I am only returning the index of the first match.
Can someone help me use this function to find all values, not just the first? I realize this is a basic question but online documentation is pretty sparse and fmatch has cut down the computing time considerably.
Thanks so much!
Here is some sample data - for the purposes of this exercise, let's call this data frame A:
DateTime Address Type ID
1 2014-03-04 20:21:03 982076970 1 2752394
2 2014-03-04 20:21:07 98174238211 1 2752394
3 2014-03-04 20:21:08 76126162197 1 2752394
4 2014-03-04 20:21:16 6718053253 1 2752394
5 2014-03-04 20:21:17 98210219176 1 2752510
6 2014-03-04 20:21:20 7622877100 1 2752510
7 2014-03-04 20:21:23 2425126157 1 2752510
8 2014-03-04 20:21:23 2425126157 1 2752510
9 2014-03-04 20:21:25 701838650 1 2752394
10 2014-03-04 20:21:27 98210219176 1 2752394
What I wish to do is to find the number of unique Type values for each Address. There are several million rows of data with roughly 1MM unique Address values... on average, each Address appears about 6 times in the data set. And, though the Type values listed above are all 1, they can take any value from 0:5. I also realize the Address values are quite long, which adds to the time required for the matching.
I have tried the following:
uvals <- unique(A$Address)
utypes <- matrix(0,length(uvals),2)
utypes[,1] <- uvals
for (i in 1:length(unique(Address))) {
b <- which(uvals[i] %in% A$Address)
c <- length(unique(A$Type[b]))
utypes[i,2] <- c
}
However, the code above is not very efficient - if I am looping over 1MM values, I estimate this will take 10-15 hours.
I have tried this, as well, within the loop... but it is not considerably faster.
b <- which(A$Address == uvals[i])
I know there is a more elegant/faster way, I am fairly new to R and would appreciate any help.
This can be done using unique function in data.table, followed by an aggregation. I'll illustrate it using more or less the sample data generated by #Chinmay:
Create sample data:
set.seed(100L)
dat = data.frame(
address = sample(1e6L, 1e7L, TRUE),
value = sample(1:5, 1e7L, TRUE, prob=c(0.5, 0.3, 0.1, 0.07, 0.03))
)
data.table solution:
require(data.table) ## >= 1.9.2
dat.u = unique(setDT(dat), by=c("address", "value"))
ans = dat.u[, .N, by=address]
Explanation:
The setDT function converts a data.frame to data.table by reference (which is very fast).
unique function operated on a data.table evokes the unique.data.table method, which is incredibly fast compared to base:::unique. Now, we've only unique values of type for every address.
All that's left to do is to aggregate or group-by address and get the number of observations that are there in each group. The by=address part groups by address and .N is an in-built data.table variable that provides the number of observations for that group.
Benchmarks:
I'll create functions to generate data as data.table and data.frame to benchmark data.table answer againstdplyr solution (a) proposed by #beginneR, although I don't see the need for arrange(.) there and therefore will skip that part.
## function to create data
foo <- function(type = "df") {
set.seed(100L)
dat = data.frame(
address = sample(1e6L, 1e7L, TRUE),
value = sample(1:5, 1e7L, TRUE, prob=c(0.5, 0.3, 0.1, 0.07, 0.03))
)
if (type == "dt") setDT(dat)
dat
}
## DT function
dt_sol <- function(x) {
unique(x, by=c("address", "value"))[, .N, by=address]
}
## dplyr function
dplyr_sol <- function(x) {
distinct(x) %>% group_by(address) %>% summarise(N = n_distinct(value))
}
The timings reported here are three consecutive runs of system.time(.) on each function.
## benchmark timings in seconds
## pkg run-01 run-02 run-03 command
## data.table 2.4 2.3 2.4 system.time(ans1 <- dt_sol(foo("dt")))
## dplyr 15.3 16.3 15.7 system.time(ans2 <- dplyr_sol(foo()))
For some reason, dplyr automatically orders the result by the grouping variable. So in order to compare the results, I'll also order them in the result from data.table:
system.time(setkey(ans1, address)) ## 0.102 seconds
identical(as.data.frame(ans1), as.data.frame(ans2)) ## TRUE
So, data.table is ~6x faster here.
Note that bit64:::integer64 is also supported in data.table - since you mention the address values are too long, you can also store them as integer64.
You can try creating an index of your 10MM values and sort that. Then looking for your 1MM values in that indexed vector should be faster.
For example, using data.table package you can do that by using setkey function which indexes given column of data.table.
require(data.table)
set.seed(100)
dat <- sample(1:1e+07, size = 1e+07, replace = T)
searchval <- sample(dat, size = 1e+06)
DT <- data.table(dat, index = seq_along(dat))
setkey(DT, dat)
DT
## dat index
## 1: 1 169458
## 2: 1 4604823
## 3: 1 7793446
## 4: 2 5372388
## 5: 3 2036622
## ---
## 9999996: 9999996 1271426
## 9999997: 9999998 530029
## 9999998: 10000000 556672
## 9999999: 10000000 6776063
## 10000000: 10000000 6949665
lookup <- data.table(val = searchval)
setkey(lookup, val)
lookup
## val
## 1: 2
## 2: 16
## 3: 24
## 4: 33
## 5: 36
## ---
## 999996: 9999970
## 999997: 9999973
## 999998: 9999988
## 999999: 9999996
## 1000000: 9999998
Now you can lookup all the values from lookup in DT by simply using
DT[lookup]
## dat index
## 1: 2 5372388
## 2: 16 537927
## 3: 16 1721233
## 4: 24 7286522
## 5: 33 7448516
## ---
## 2000298: 9999973 8008610
## 2000299: 9999988 3099060
## 2000300: 9999988 7996302
## 2000301: 9999996 1271426
## 2000302: 9999998 530029
fmatch seems to clearly state that it only finds the first match. And given that it uses an underlying hashing strategy, I imagine it's unlikely that it stores multiple items per key which is one of the ways it stays so fast (and it's the same way match works).
Do you have many duplicate values? Perhaps you could store those in a separate place/table and create a fast index to a list of possible matches. It would be more helpful if you provided sample data representative of what you're trying to do and the code you tried to see if it would be easy to extend.
If I understand your question correctly, you can also do this with dplyr:
I will include two different ways, since I am not entirely sure which is your desired output.
First create some sample data:
Address <- rep(letters, 5)
Type <- sample(1:5, size=5*26, replace=T)
A <- data.frame(Address, Type)
Then install and load dplyr
require(dplyr)
a) To find the number of different Type values for each Address value:
A %.% arrange(Address, Type) %.% group_by(Address) %.% summarize(NoOfTypes = length(unique(Type)))
b) To find all unique combinations of Address and Type:
A %.% arrange(Address, Type) %.% group_by(Address, Type) %.% filter( 1:n() == 1)
Related
I want to write a function in R that does the following:
I have a table of cases, and some data. I want to find the correct row matching to each observation from the data. Example:
crit1 <- c(1,1,2)
crit2 <- c("yes","no","no")
Cases <- matrix(c(crit1,crit2),ncol=2,byrow=FALSE)
data1 <- c(1,2,1)
data2 <- c("no","no","yes")
data <- matrix(c(data1,data2),ncol=2,byrow=FALSE)
Now I want a function that returns for each row of my data, the matching row from Cases, the result would be the vector
c(2,3,1)
Are you sure you want to be using matrices for this?
Note that the numeric data in crit1 and data1 has been converted to string (matrices can only store one data type):
typeof(data[ , 1L])
# [1] character
In R, a data.frame is a much more natural choice for what you're after. data.table is (among many other things) a toolset for working with "enhanced" data.frames; See the Introduction.
I would create your data as:
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
We can get the matching row indices as asked by doing a keyed join (See the vignette on keys):
setkey(Cases) # key by all columns
Cases
# crit1 crit2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
setkey(data)
data
# data1 data2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
Cases[data, which=TRUE]
# [1] 1 2 3
This differs from 2,3,1 because the order of your data has changed, but note that the answer is still correct.
If you don't want to change the order of your data, it's slightly more complicated (but more readable if you're not used to data.table syntax):
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
Cases[data, on = setNames(names(data), names(Cases)), which=TRUE]
# [1] 2 3 1
The on= part creates the mapping between the columns of data and those of Cases.
We could write this in a bit more SQL-like fashion as:
Cases[data, on = .(crit1 == data1, crit2 == data2), which=TRUE]
# [1] 2 3 1
This is shorter and more readable for your sample data, but not as extensible if your data has many columns or if you don't know the column names in advance.
The prodlim package has a function for that:
library(prodlim)
row.match(data,Cases)
[1] 2 3 1
I have a data frame, where each rows represents data for a specific category on a specific day:
set.seed(1)
k <- 10
df <- data.frame(
name = c(rep('a',k), rep('b',k)),
date = rep(seq(as.Date('2017-01-01'),as.Date('2017-01-01')+k-1, 'days'),2),
x = runif(2*k,1,20),
y = runif(2*k,100,300)
)
View(df)
Head:
head(df)
name date x y
1 a 2017-01-01 6.044665 286.9410
2 a 2017-01-02 8.070354 142.4285
3 a 2017-01-03 11.884214 230.3348
4 a 2017-01-04 18.255948 125.1110
5 a 2017-01-05 4.831957 153.4441
6 a 2017-01-06 18.069404 177.2228
Structure:
str(df)
'data.frame': 20 obs. of 4 variables:
$ name: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
$ date: Date, format: "2017-01-01" "2017-01-02" "2017-01-03" "2017-01-04" ...
$ x : num 6.04 8.07 11.88 18.26 4.83 ...
$ y : num 287 142 230 125 153 ...
I need to scale x and y variables of this data over a specific date window.
The script I came up with is the following:
library(dplyr)
library(lubridate)
df2 <- df
moving_window_days <- 4
##Iterate over each row in df
for(i in 1:nrow(df)){
df2[i,] <- df %>%
##Give me only rows for 'name' on the current row
##which are within the date window of interest
filter(date <= date(df[i,"date"]) &
date >= date(df[i,"date"]) - moving_window_days &
name == df[i,"name"]
) %>%
##Now scale x and y on this date wondow
mutate(x = percent_rank(x),
y = percent_rank(y)
) %>%
##Get rid of the rest of the rows - leave only the row we are looking at
filter(date == date(df[i,"date"]))
}
It works as intended (well, I initially wanted to get each observation's percentile in a moving window, but scaled values will work just fine)
The problem is that the real dataset is much larger:
'name' column has 30 local branch offices
'date' is at least a year worth of data for each branch
instead of 'x' and 'y' I have 6 variables
the moving window is 90 days
I ran this script on the real data, and out of 30,000 rows it was able to go over only 5,000 in 4 hours...
This is the first time I run into a problem like this.
I am sure my script is highly inefficient (I'm sure because I am not a pro in R. I am just assuming there is always a better way)
Any way this script can be optimized / improved?
Any way to 'purrrify' (use some of the map functions in purrr)?
Nested dataframe? nest()? Thinking this is a solution... Not sure how to implement...
Anything I can do to perhaps tackle the problem in a different manner?
One thing you can do is parallel processing. I utilize the future package for this. This may annoy some poeple, who may consider it a hack, because the future package is intended... Well... For futures (or "promises" if you're a front end developer). This approach is finicky, but works very well.
library(future)
# Create a function that iterates over each row in the df:
my_function <- function(df, x) {
x <- df
for(i in 1:nrow(df)){
x[i, ] <- df %>%
##Give me only rows for 'name' on the current row
##which are within the date window of interest
filter(date <= date(df[i,"date"]) &
date >= date(df[i,"date"]) - moving_window_days &
name == df[i,"name"]
) %>%
##Now scale x and y on this date wondow
mutate(x = percent_rank(x),
y = percent_rank(y)
) %>%
##Get rid of the rest of the rows - leave only the row we are looking at
filter(date == date(df[i,"date"]))
}
return(x)
}
plan(multiprocess) # make sure to always include this in a run of the code.
# Divide df evenly into three separate dataframes:
df1 %<-% my_function(df[1:7, ], df1)
df2 %<-% my_function(df = df[(8 - moving_window_days):14, ], df2) # But from here on out, go back 4 days to include that data in the moving average calculation.
df3 %<-% my_function(df = df[(15 - moving_window_days):20, ], df3)
# See if your computer is able to split df into 4 or 5 separate dataframes.
# Now bind the dataframes together, but get the indexing right:
rbind(df1, df2[(nrow(df2) - 6):nrow(df2), ], df3[(nrow(df3) - 5):nrow(df3), ])
Parallel processing is one of many ways to optimize code for efficiency. This exact technique has substantially sped up code for me in the past. It has reduced the run time of a program from a day and a half, down to 3 or 4 hours!
Now, ideally, we'd like to work with the apply family and matrices. This answer is just one of many ways we can speed up code. Also, the future package allows us to parallel process without learning a new looping structure, such as in the parallel package (which, nonetheless, is still an amazing package).
Also check out the Rcpp package. It'll take some time to learn, but is incredible for unlocking the speed of C++.
zoo::rollapply can be quite fast.
df2 <- df %>%
group_by(name) %>%
mutate(x2 = zoo::rollapply(x, width = 4, FUN = percent_rank, fill = "extend")[,1],
y2 = zoo::rollapply(y, width = 4, FUN = percent_rank, fill = "extend")[,1])
Each call to rollapply generates a matrix with n=width columns. The first column is the value of the function for the window beginning with that observation, while the nth column is the value of the function for the window ending with that observation. You can change the [,1] to whichever column you want (the percentile in the middle of the window? at the end? at the beginning?).
The argument fill = "extend" duplicates the observations at the beginning or end of windows, since for the last n-k observations there are k missings from the window.
I expanded your dataset to a dummy of 28,496 rows, covering 26 names and 3 years of data, and ran this snippet with a width of 90 days. On my 4 year old desktop this took less than a minute for two variables:
user system elapsed
37.66 0.01 37.77
You could certainly use purrr::map2 to iterate over 6 variables (instead of calling rollapply 6 times in mutate), but I'm not sure it would speed it up substantially.
#OP You should be CAUTIOUS with the answers provided
--My original answer--
library(tidyverse)
I first split the data frame into a list of data frames grouped by name
split.df <- split(df, df$name)
Using the split data, use lapply and map_df to iterate over rows of each grouped df, filter for dates between relevant window of time using between, then mutate as you did before, and then filter for the relevant row again (I tried to 'copy' your code as closely as possible):
new <- lapply(split.df, function(z) map_df(1:nrow(z), ~z %>%
filter(between(date, z$date[.x]-moving_window_days, z$date[.x])) %>%
mutate(x=percent_rank(x),y=percent_rank(y)) %>%
filter(date==z$date[.x])))
This results in a list. To convert back to a single data frame
final <- Reduce("rbind",new)
Output (head)
name date x y
1 a 2017-01-01 0.0000000 0.00
2 a 2017-01-02 1.0000000 0.00
3 a 2017-01-03 1.0000000 0.50
4 a 2017-01-04 1.0000000 0.00
Let's make sure my result matches that of yours.
identical(final$x, OP.output$x)
[1] TRUE
--END of my original answer--
----------------------------COMPARING SOLUTIONS----------------------------
--#Brian's answer--
#Brian's answer does not give the same result you expect. You said your function works as intended, so let's compare Brian's result with yours. The first shows Brian's result. The second shows your result.
name date x y x2 y2
1 a 2017-01-01 6.044665 286.9410 0.0000000 1.0000000
2 a 2017-01-02 8.070354 142.4285 0.0000000 1.0000000
3 a 2017-01-03 11.884214 230.3348 0.3333333 0.3333333
4 a 2017-01-04 18.255948 125.1110 0.3333333 1.0000000
name date x y
1 a 2017-01-01 0.0000000 0.00
2 a 2017-01-02 1.0000000 0.00
3 a 2017-01-03 1.0000000 0.50
4 a 2017-01-04 1.0000000 0.00
identical(Brian.output$x2, OP.output$x, )
[1] FALSE
--END #Brian's answer--
--#Odysseus's answer--
#Odysseus's answer returns the right result since it uses your same function, but you have to split the data frame manually. See his code below that calls my_function
df1 %<-% my_function(df[1:7, ], df1)
df2 %<-% my_function(df = df[(8 - moving_window_days):14, ], df2) # But from here on out, go back 4 days to include that data in the moving average calculation.
df3 %<-% my_function(df = df[(15 - moving_window_days):20, ], df3)
--END #Odysseus's answer--
You're likely to get the best gain in performance from #Odysseus answer, but you'll need to benchmark it yourself since it will depend on the number of cores you have. Parallelization is not always faster than a vectorized operation. But you'll need to extend his solution to the rest of your data frame.
We have a very large data frame df that can be split by factors. On each subset of the data frame created by this split, we need to perform an operation to increase the number of rows of that subset until it's a certain length. Afterwards, we rbind the subsets to get a bigger version of df.
Is there a way of doing this quickly without using an inner function?
Let's say our subset operation (in a separate .R file) is:
foo <- function(df) { magic }
We've come up with a few ways of doing this:
1)
df <- split(df, factor)
df <- lapply(df, foo)
rbindlist(df)
2)
assign('list.df', list(), envir=.GlobalEnv)
assign('i', 1, envir=.GlobalEnv)
dplyr::group_by(df, factor)
dplyr::mutate(df, foo.list(df.col))
df <- rbindlist(list.df)
rm('list.df', envir=.GlobalEnv)
rm('i', envir=.GlobalEnv)
(In a separate file)
foo.list <- function(df.cols) {
magic;
list.df[[i]] <<- magic.df
i <<- i + 1
return(dummy)
}
The issue with the first approach is time. The lapply simply takes too long to really be desirable (on the order of an hour with our data set).
The issue with the second approach is the extremely undesirable side-effect of tampering with the user's global environment. It's significantly faster, but this is something we'd rather avoid if we can.
We've also tried passing in the list and count variables and then trying to substitute them with the variables in the parent environment (A sort of hack to get around R's lack of pass-by-reference).
We've looked at a number of possibly-relevant SO questions (R applying a function to a subset of a data frame, Calculations on subsets of a data frame, R: Pass by reference, e.t.c.) but none of them dealt with our question too well.
If you want to run code, here's something you can copy and paste:
x <- runif(n=10, min=0, max=3)
y <- sample(x=10, replace=FALSE)
factors <- runif(n=10, min=0, max=2)
factors <- floor(factors)
df <- data.frame(factors, x, y)
df now looks like this (length 10):
## We group by factor, then run foo on the groups.
foo <- function(df.subset) {
min <- min(df.subset$y)
max <- max(df.subset$y)
## We fill out df.subset to have everything between the min and
## max values of y. Then we assign the old values of df.subset
## to the corresponding spots.
df.fill <- data.frame(x=rep(0, max-min+1),
y=min:max,
factors=rep(df.subset$factors[1], max-min+1))
df.fill$x[which(df.subset$y %in%(min:max))] <- df.subset$x
df.fill
}
So I can take my sample code in the first approach to build a new df (length 18):
Using data.table this doesn't take long due to speedy functionality. If you can, rewrite your function to work with specific variables. The split-apply-combine processing may get a performance boost:
library(data.table)
system.time(
df2 <- setDT(df)[,foo(df), factors]
)
# user system elapsed
# 1.63 0.39 2.03
Another variation using data.table.. First get the min(y):max(y) part and then join+update:
require(data.table)
ans = setDT(df)[, .(x=0, y=min(y):max(y)), by=factors
][df, x := i.x, on=c("factors", "y")][]
ans
# factors x y
# 1: 0 1.25104362 1
# 2: 0 0.16729068 2
# 3: 0 0.00000000 3
# 4: 0 0.02533907 4
# 5: 0 0.00000000 5
# 6: 0 0.00000000 6
# 7: 0 1.80547980 7
# 8: 1 0.34043937 3
# 9: 1 0.00000000 4
# 10: 1 1.51742163 5
# 11: 1 0.15709287 6
# 12: 1 0.00000000 7
# 13: 1 1.26282241 8
# 14: 1 2.88292354 9
# 15: 1 1.78573288 10
Pierre and Roland already provides nice solutions.
If the case is scalability not only in timing but also in memory you can spread the data across number of remote R instances.
In most basic setup it requires only Rserve/RSclient, so no non-CRAN deps.
Spread data across R instances
For easier reproducibility below example will start two R instances on a single localhost machine. You need to start Rserve nodes on remote machines for real scalability.
# start R nodes
library(Rserve)
port = 6311:6312
invisible(sapply(port, function(port) Rserve(debug = FALSE, port = port, args = c("--no-save"))))
# populate data
set.seed(123)
x = runif(n=5e6,min=0, max=3)
y = sample(x=5e6,replace=FALSE)
factors = runif(n=5e6, min=0, max=2)
factors = floor(factors)
df = data.frame(factors, x, y)
# connect Rserve nodes
library(RSclient)
rscl = sapply(port, function(port) RS.connect(port = port))
# assign chunks to R nodes
sapply(seq_along(rscl), function(i) RS.assign(rscl[[i]], name = "x", value = df[df$factors == (i-1),]))
# assign magic function to R nodes
foo = function(df) df
sapply(rscl, RS.assign, name = "foo", value = foo)
All processes on remote machines can be performed parallely (using wait=FALSE and RS.collect) which additionally reduce computing timing.
Using lapply + RS.eval
# sequentially
l = lapply(rscl, RS.eval, foo(x))
rbindlist(l)
# parallely
invisible(sapply(rscl, RS.eval, foo(x), wait=FALSE))
l = lapply(rscl, RS.collect)
rbindlist(l)
Using big.data.table::rscl.*
big.data.table package provides few wrappers on RSclient::RS.* functions allowing them to accept list of connections to R nodes.
They doesn't use data.table in any way so can be effectively applied to data.frame, vector or any R type that is chunk-able. Below example uses basic data.frame.
library(big.data.table)
# sequentially
l = rscl.eval(rscl, foo(x), simplify=FALSE)
rbindlist(l)
# parallely
invisible(rscl.eval(rscl, foo(x), wait=FALSE))
l = rscl.collect(rscl, simplify=FALSE)
rbindlist(l)
Using big.data.table
This example requires data on nodes to be stored as data.tables, but gives some convenient api and a lot of other features.
library(big.data.table)
rscl.require(rscl, "data.table")
rscl.eval(rscl, is.data.table(setDT(x))) # is.data.table to suppress collection of `setDT` results
bdt = big.data.table(rscl = rscl)
# parallely by default
bdt[, foo(.SD), factors]
# considering we have data partitioned using `factors` field, the `by` is redundant in that case
bdt[, foo(.SD)]
# optionally use `[[` to access R nodes environment directly
bdt[[expr = foo(x)]]
Clean workspace
# disconnect
rscl.close(rscl)
# shutdown nodes started from R
l = lapply(setNames(nm = port), function(port) tryCatch(RSconnect(port = port), error = function(e) e, warning = function(w) w))
invisible(lapply(l, function(rsc) if(inherits(rsc, "sockconn")) RSshutdown(rsc)))
I don't think your function works as intended. It relies on y being ordered.
Try using a data.table join with grouping:
library(data.table)
setDT(df)
df2 <- df[, .SD[data.table(y=seq(.SD[, min(y)], .SD[, max(y)], by = 1)), .SD,
on = "y"], #data.table join
by = factors] #grouping
df2[is.na(x), x:= 0]
setkey(df2, factors, y, x)
I have a large data.table that I am collapsing to the month level using ,by.
There are 5 by vars, with # of levels: c(4,3,106,3,1380). The 106 is months, the 1380 is a geographic unit. As in turns out there are some 0's, in that some cells have no values. by drops these, but I'd like it to keep them.
Reproducible example:
require(data.table)
set.seed(1)
n <- 1000
s <- function(n,l=5) sample(letters[seq(l)],n,replace=TRUE)
dat <- data.table( x=runif(n), g1=s(n), g2=s(n), g3=s(n,25) )
datCollapsed <- dat[ , list(nv=.N), by=list(g1,g2,g3) ]
datCollapsed[ , prod(dim(table(g1,g2,g3))) ] # how many there should be: 5*5*25=625
nrow(datCollapsed) # how many there are
Is there an efficient way to fill in these missing values with 0's, so that all permutations of the by vars are in the resultant collapsed data.table?
I'd also go with a cross-join, but would use it in the i-slot of the original call to [.data.table:
keycols <- c("g1", "g2", "g3") ## Grouping columns
setkeyv(dat, keycols) ## Set dat's key
ii <- do.call(CJ, sapply(dat[, ..keycols], unique)) ## CJ() to form index
datCollapsed <- dat[ii, list(nv=.N)] ## Aggregate
## Check that it worked
nrow(datCollapsed)
# [1] 625
table(datCollapsed$nv)
# 0 1 2 3 4 5 6
# 135 191 162 82 39 13 3
This approach is referred to as a "by-without-by" and, as documented in ?data.table, it is just as efficient and fast as passing the grouping instructions in via the by argument:
Advanced: Aggregation for a subset of known groups is
particularly efficient when passing those groups in i. When
i is a data.table, DT[i,j] evaluates j for each row
of i. We call this by without by or grouping by i.
Hence, the self join DT[data.table(unique(colA)),j] is
identical to DT[,j,by=colA].
Make a cartesian join of the unique values, and use that to join back to your results
dat.keys <- dat[,CJ(g1=unique(g1), g2=unique(g2), g3=unique(g3))]
setkey(datCollapsed, g1, g2, g3)
nrow(datCollapsed[dat.keys]) # effectively a left join of datCollapsed onto dat.keys
# [1] 625
Note that the missing values are NA right now, but you can easily change that to 0s if you want.
I'd like to add a new column to my data.table, which contains data from one of the other columns. The choice of column, however, varies per row - depending on the contents of another column. So:
for the data set:
a_data b_data column_choice
[1,] 55 1 a
[2,] 56 2 a
[3,] 57 3 b
generated by:
dat=data.table(a_data = c(55, 56, 57),
b_data = c(1, 2, 3),
column_choice = c("a", "a", "b"))
I'd like a new column, 'chosen', which contains (per row) either the data from "a_data" or "b_data", depending on the value of "column_choice". The resulting data table will therefore be:
a_data b_data column_choice chosen
[1,] 55 1 a 55
[2,] 56 2 a 56
[3,] 57 3 b 3
I have managed to get the desired effect using:
dat=dat[, data.table(.SD, chosen=.SD[[paste0(.SD$column_choice, "_data")]]),
by=1:nrow(a)]
dat$nrow = NULL
however this feels quite clunky; perhaps there's a simpler way to do it (that will no doubt also teach me something about R)?
In practice, the data frame also has lots of other columns that need to be preserved, more choices than just 'a or b', and several of these types of column to generate, so I'd rather not use the basic ifelse solution that may be appropriate for the basic example above.
Thank you very much for your help.
I think I've now found a properly vectorised one liner, that's also faster than the other answers in this case.
petesFun2 uses data.table aggregation as petesFun, however now vectorised across column_choice (rather than per item, as previously).
While petesFun2 is fine for my purposes, it does leave both the rows and columns in a different order. In the interests of comparison with the other answers, therefore, I've added petesFun2Clean which maintains the same ordering as the other answers.
petesFun2 <-function(myDat) {
return(myDat[, cbind(.SD, chosen=.SD[[paste0(.BY$column_choice, "_data")]]),
by=column_choice])
}
petesFun2Clean <-function(myDat) {
myDat = copy(myDat) # To prevent reference issues
myDat[, id := seq_len(nrow(myDat))] # Assign an id
result = myDat[, cbind(.SD, chosen=.SD[[.BY$choice]]),
by=list(column_choice, choice=paste0(column_choice, "_data"))]
# recover ordering and column order.
return(result[order(id),
list(a_data, b_data, c_data, column_choice, chosen)])
}
benchmark(benRes<- myFun(test.dat),
petesRes<- petesFun(test.dat),
dowleRes<- dowleFun(test.dat),
petesRes2<-petesFun2(test.dat),
petesRes2Clean<- petesFun2Clean(test.dat),
replications=25,
columns=c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
# 1 benRes <- myFun(test.dat) 25 0.337 4.160494
# 3 dowleRes <- dowleFun(test.dat) 25 0.191 2.358025
# 5 petesRes2Clean <- petesFun2Clean(test.dat) 25 0.122 1.506173
# 4 petesRes2 <- petesFun2(test.dat) 25 0.081 1.000000
# 2 petesRes <- petesFun(test.dat) 25 4.018 49.604938
identical(petesRes2, benRes)
# FALSE (due to row and column ordering)
identical(petesRes2Clean, benRes)
# TRUE
EDIT: I just noticed (as mentioned by Matthew in the comments) that we now have by group :=. So we can drop the cbind and simply do:
myDat[, chosen := .SD[[paste0(.BY$column_choice, "_data")]],
by=column_choice]
When I think of clunky, things like old bicycles or old cars come to mind, but also doing things in R by iterating over rows. So the below turned out to look clunkier than what you posted in your question, but it goes after a solution in what I think is a more vectorized way. The following appears to be about 10 times faster than (and return identical results as) the sleeker code you posted above.
This suggestion relies on the reshape2 package:
library(data.table)
library(reshape2)
I've added "c" as a possible column_choice to make things a bit more interesting:
dat=data.table(a_data = c(55,56,57,65),
b_data = c(1,2,3,4),c_data=c(1000,1001,1002,1003),
column_choice = c("a", "c", "a", "b"))
Below are the steps, wrapped in a function to prepare them for benchmarking.
myFun<-function(myDat){
# convert data.table to data.frame for melt()ing
dat1<-data.frame(myDat)
# add ID variable to keep track of things
dat1$ID<-seq_len(nrow(dat1))
# melt data - because of this line, it's important to only
# pass those variables that are used to select the appropriate value
# i.e., a_data,b_data,c_data,column_choice
dat2<-melt(dat1,id.vars=c("ID","column_choice"))
# Determine which value to choose: a, b, or c
dat2$chosen<-as.numeric(dat2$column_choice==substr(dat2$variable,
1,1))*dat2$value
# cast the data back into the original form
dat_cast<-dcast(dat2,ID+column_choice~.,
fun.aggregate=sum,value.var="chosen")
# rename the last variable
names(dat_cast)[ncol(dat_cast)]<-"chosen"
# merge data back together and return results as a data.table
datOUT<-merge(dat1,dat_cast,by=c("ID","column_choice"),sort=FALSE)
return(data.table(datOUT[,c(names(myDat),"chosen")]))
}
Here is your solution packaged into a function:
petesFun<-function(myDat){
datOUT=myDat[, data.table(.SD,
chosen=.SD[[paste0(.SD$column_choice, "_data")]]),
by=1:nrow(myDat)]
datOUT$nrow = NULL
return(datOUT)
}
This looks much more elegant than myFun. The benchmarking results show a large difference, however:
Make a larger data.table:
test.df<-data.frame(lapply(dat,rep,100))
test.dat<-data.table(test.df)
and benchmark:
library(rbenchmark)
benchmark(myRes<-myFun(test.dat),petesRes<-petesFun(test.dat),
replications=25,columns=c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
# 1 myRes <- myFun(test.dat) 25 0.412 1.00000
# 2 petesRes <- petesFun(test.dat) 25 5.429 13.17718
identical(myRes,petesRes)
# [1] TRUE
I propose that "clunky" can be interpreted in different ways :)
We're starting to use for loops more and more for this kind of task with data.table. Building on Ben's answer and using his benchmark, how about the following :
dowleFun = function(DT) {
DT = copy(DT) # Faster to remove this line to add column by reference, but
# included copy() because benchmark repeats test 25 times and
# the other tests use the same input table
w = match(paste0(DT$column_choice,"_data"),names(DT))
DT[,chosen:=NA_real_] # allocate new column (or clear it if already exists)
j = match("chosen",names(DT))
for (i in 1:nrow(DT))
set(DT,i,j,DT[[w[i]]][i])
DT
}
benchmark(benRes<-myFun(test.dat),
petesRes<-petesFun(test.dat),
dowleRes<-dowleFun(test.dat),
replications=25,columns=c("test", "replications", "elapsed", "relative"),
order="elapsed")
# test replications elapsed relative
# 3 dowleRes <- dowleFun(test.dat) 25 0.30 1.0
# 1 benRes <- myFun(test.dat) 25 0.39 1.3
# 2 petesRes <- petesFun(test.dat) 25 5.79 19.3
If you can remove the copy() then it should be faster and scale better to larger datasets. To test that, perhaps create a very large table and time how long a single run takes.
In this case a simple for loop can be easier to follow.
Having said that, if i could be a 2-column matrix, then A[B] syntax in base could be used (where B contains the row and column positions to select) and it's a one liner :
DT[,chosen:=DT[cbind(1:nrow(DT),paste0(column_choice,"_data"))]]
At the moment you get this :
> DT[cbind(1:3,c(4,4,5))]
Error in `[.data.table`(test.dat, cbind(1:3, c(4, 4, 5))) :
i is invalid type (matrix). Perhaps in future a 2 column matrix could return
a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please let
maintainer('data.table') know if you'd like this, or add your comments to
FR #1611.