Finding number of unique values (quickly) across a large data.table - r

I have a 1.5Mx7 data.table that I need to process through. The code I have written is running very slowly (.18s per row, estimated 75 hours to complete), and I'm hoping I can optimize it.
I'll put the pseudo-example code at the end, because it's long.
str(review)
Classes ‘data.table’ and 'data.frame': 1500000 obs. of 7 variables:
$ user_id : Factor w/ 375000 levels "aA1aJ9lJ1lB5yH5uR6jR7",..: 275929 313114 99332 277686 57473 31780 236964 44371 210127 217770 ...
$ stars : int 2 1 3 3 1 1 2 1 2 2 ...
$ business_id : Factor w/ 60000 levels "aA1kR2bK6nH8yQ9gU2uI9",..: 40806 29885 43018 58297 58444 31626 26018 2493 37883 34204 ...
$ votes.funny : int 3 0 0 7 2 9 6 8 2 7 ...
$ votes.useful: int 4 1 0 5 9 2 4 7 4 9 ...
$ votes.cool : int 5 3 6 8 3 2 0 8 10 9 ...
$ IDate : IDate, format: "2012-01-01" "2012-01-01" "2012-01-01" ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "IDate"
I need to subset the dataset by date, and then compute several columns by business_id.
setkey(review, IDate)
system.time(
review[
#(IDate >= window.start) & (IDate <= window.end),
1:10,
.SD,
keyby = business_id
][
,
list(
review.num = .N,
review.users = length(unique(user_id)),
review.stars = mean(stars),
review.votes.funny = sum(votes.funny),
review.votes.useful = sum(votes.useful),
review.votes.cool = sum(votes.cool)
),
by = business_id
]
)
user system elapsed
1.534 0.000 1.534
Timing for smaller versions of the example dataset is
# 1% of original size - 15000 rows
user system elapsed
0.02 0.00 0.02
# 10% of original size - 150000 rows
user system elapsed
0.25 0.00 0.25
So, even though I'm only processing 10 rows, the time increases with the size of the original dataset.
I tried commenting out the review.users variable above, and the computation time on the original dataset fell tremendously:
user system elapsed
0 0 0
So, my challenge is making unique() work more quickly.
I need to count the unique values in user_id for each grouping of business_id.
Not sure what else to specify, but I'm happy to answer questions.
Here is some code to create a pseudo-example dataset. I'm not sure exactly what is the cause of the slowdown, so I've tried to recreate the data as specifically as possible, but because the processing time for the random variables is so long I've reduced the size by ~90%.
z <- c()
x <- c()
for (i in 1:6000) {
z <<- c(z, paste0(
letters[floor(runif(7, min = 1, max = 26))],
LETTERS[floor(runif(7, min = 1, max = 26))],
floor(runif(7, min = 1, max = 10)),
collapse = ""
))
}
z <- rep(z, 25)
for (i in 1:37500) {
x <<- c(x, paste0(
letters[floor(runif(7, min = 1, max = 26))],
LETTERS[floor(runif(7, min = 1, max = 26))],
floor(runif(7, min = 1, max = 10)),
collapse = ""
))
}
x <- rep(x, 4)
review2 <- data.table(
user_id = factor(x),
stars = as.integer(round(runif(150000) * 5, digits = 0)),
business_id = factor(z),
votes.funny = as.integer(round(runif(150000) * 10, digits = 0)),
votes.useful = as.integer(round(runif(150000) * 10, digits = 0)),
votes.cool = as.integer(round(runif(150000) * 10, digits = 0)),
IDate = rep(as.IDate("2012-01-01"), 150000)
)
setkey(review2, IDate)

How about this - an alternative to unique using an extra data.table within an anonymous function:
review2[,{
uid <- data.table(user_id)
rev_user <- uid[, .N, by = user_id][, .N]
#browser()
list(
review.num = .N,
review.users = rev_user,
review.stars = mean(stars),
review.votes.funny = sum(votes.funny),
review.votes.useful = sum(votes.useful),
review.votes.cool = sum(votes.cool)
)}, by = business_id]

It seems that length(unique()) is inefficient in calculating the length of factor variables as levels become very large.
Using uniqueN() instead (thanks #Frank):
user system elapsed
0.12 0.00 0.12
Using set(review, NULL, "user_id", as.character(review$user_id)) and length(unique)):
user system elapsed
0.11 0.00 0.11

Related

Combine information from two data frames with dplyr

I need some help with dplyr.
I have two data frames - one huge, with several time series A,B,... in there (LargeDF), and a second one (Categories) with time intervals (left and right boundaries).
I would like to add another column to LargeDF, labeled leftBoundary, containing the appropriate boundary value, like so:
LargeDF
ts timestamp signal # left_boundary
1 A 0.3209338 10.43279 # 0
2 A 1.4791524 10.34295 # 1
3 A 2.6007494 10.71601 # 2
and
Categories
ts left right
1 A 0 1
2 A 1 2
3 A 2 3
My code I came up with is
LargeDF %>%
group_by(ts) %>%
do(myFUN(., Categories))
# calls this ...
myFUN <- function(Large, Categ) {
CategTS <- Categ %>%
filter(ts == Large[1, "ts"][[1]])
Large %>%
group_by(timestamp) %>% # this is bothering me...
mutate(left_boundary = CategTS$left[CategTS$left < timestamp
& timestamp < CategTS$right])
}
but it is super slow for large time series. I would really like to lose the group_by(timestamp), as they are unique within each ts anyways.
Does someone see a better solution? That would be much appreciated.
# Code for making the example data frames ...
library("dplyr")
n <- 10; series <- c("A", "B", "C")
LargeDF <- data.frame(
ts = rep(series, each = n)
, timestamp = runif(n*length(series), max = 4)
, signal = runif(n*length(series), min = 10, max = 11)
) %>% group_by(ts) %>% arrange(timestamp)
m <- 7
Categories <- data.frame(
ts = rep(series, each = m)
, left = rep(seq(1 : m) - 1, length(series))
, right = rep(seq(1 : m), length(series))
)
Update (data.table and my slightly modified mockup)
So, I tried the suggestions from #DavidArenburg on a quick/dirty mockup-example first, but had the problem that some timestamps were binned twice (into successive categories/intervals).
> foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))
left right value timestamp timestamp2
1: 0.9 1.9 0.1885459 1 1
2: 0.9 1.9 0.0542375 2 2 # binned here
3: 1.9 2.9 0.0542375 2 2 # and here as well
13: 19.9 25.9 0.4579986 20 20
I then read about minoverlap = 1L as a default and realized that a normal timestamp is >> 1.
> as.numeric(Sys.time())
[1] 1429022267
Therefore, if I shifted everything to larger values (e.g. n <- 10 in the example below), everything went fine.
left right value timestamp timestamp2
1: 9 19 0.64971126 10 10
2: 19 29 0.75994751 20 20
3: 29 99 0.98276462 30 30
9: 199 259 0.89816165 200 200
With my real data, everything went smoothly, so thanks again.
## Code for my data.table example -----
n <- 1
d <- data.table( value = runif(9),
timestamp = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n,
timestamp2 = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n)
c <- data.table(left = c(0.9, 1.9, 2.9, 9.9, 19.9, 25.9)*n,
right = c(1.9, 2.9, 9.9, 19.9, 25.9, 33.9)*n)
setkey(c, left, right)
foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))
Update 2 (JOIN, then FILTER, within dplyr)
I tested the suggestion from #aosmith to use the dplyr function left_join() to create one (very) large DF, then filter() this again. Very quickly, I ran into memory issues:
Error: std::bad_alloc
Probably, this approach would be a good idea for smaller tables - as the syntax is very nice (but this, again, is personal preference). I'll go for the data.table solution in this case. Thanks again for all suggestions.
dplyr isn't suitable for such operations, try data.tables foverlaps functions instead
library(data.table)
class(LargeDF) <- "data.frame" ## Removing all the dplyr classes
setDT(LargeDF)[, `:=`(left = timestamp, right = timestamp)] # creating min and max boundaries in the large table
setkey(setDT(Categories)) # keying by all columns (necessary for `foverlaps` to work)
LargeDF[, left_boundary := foverlaps(LargeDF, Categories)$left][] # Creating left_boundary
# ts timestamp signal left right left_boundary
# 1: A 0.46771516 10.72175 0.46771516 0.46771516 0
# 2: A 0.58841492 10.35459 0.58841492 0.58841492 0
# 3: A 1.14494484 10.50301 1.14494484 1.14494484 1
# 4: A 1.18298225 10.82431 1.18298225 1.18298225 1
# 5: A 1.69822678 10.04780 1.69822678 1.69822678 1
# 6: A 1.83189609 10.75001 1.83189609 1.83189609 1
# 7: A 1.90947475 10.94715 1.90947475 1.90947475 1
# 8: A 2.73305266 10.14449 2.73305266 2.73305266 2
# 9: A 3.02371968 10.17724 3.02371968 3.02371968 3
# ...

Creating averaged time-bins from an existing dataframe

I have the following dataframe called 'EasyScaled';
str(EasyScaled)
'data.frame': 675045 obs. of 3 variables:
$ Trial : chr "1_easy.wav" "1_easy.wav" "1_easy.wav" "1_easy.wav" ...
$ TrialTime : num 3000 3001 3002 3003 3004 ...
$ PupilBaseCorrect: num 0.784 0.781 0.78 0.778 0.777 ...
The 'TrialTime' numeric variable denotes the time of each data point (3000 = 3000ms, 3001 = 3001 ms, etc.), 'PupilBaseCorrect' is my dependent variable, and the 'Trial' variable refers to the experimental trial.
I would like to create a new object which firstly divides my data into 3 time-bins (TimeBin1 = 3000-8000ms, TimeBin2 = 8001-13000ms, TimeBin3 = 13001 - 18000ms) and then calculate an average value for each timebin (for each trial) so that I would end up with something that looks like this (with the value given reflecting 'PupilBaseCorrect');
Trial TimeBin1 TimeBin2 TimeBin3
1_easy 0.784 0.876 0.767
34_easy 0.781 0.872 0.765
35_easy 0.78 0.871 0.762
...etc ...etc ...etc ....etc
I have tried using cut(), ddply() and some of the suggestions on this blog http://lamages.blogspot.co.uk/2012/01/say-it-in-r-with-by-apply-and-friends.html but haven't been able to find the correct code. I also tried this;
EasyTimeBin <- aggregate(PupilBaseCorrect ~ Trial + TrialTime[3000:8000, 8001:1300,1301:1800], data=EasyScaled, mean)
But got the following error;
Error in TrialTime[3000:8000, 8001:1300, 1301:1800] :
incorrect number of dimensions
Any suggestions or advice would be much appreciated.
Good use of cut and ddply are correct, but here's some vanilla R chicken scratch that will do what you need.
# Generate example data
EasyScaled <- data.frame(
Trial = paste0(c(sapply(1:3, function(x) rep(x, 9))), "_easy.wav"),
TrialTime = c(sapply(seq_len(9)-1, function(x) (floor(x/3))*5000 + x%%3 + 3000)),
PupilBaseCorrect = rnorm(27, 0.78, 0.1)
)
# group means of PupilBaseCorrect by Trial + filename
tmp <- tapply(EasyScaled$PupilBaseCorrect,
paste0(EasyScaled$Trial, ',',
as.integer((EasyScaled$TrialTime - 3000)/5000)+1), mean)
# melt & recast the array manually into a dataframe
EasyTimeBin <- do.call(data.frame,
append(list(row.names = NULL,
Trial = gsub('.wav,.*','',names(tmp)[3*seq_len(length(tmp)/3)])),
structure(lapply(seq_len(3),
function(x) tmp[3*(seq_len(length(tmp)/3)-1) + x]
), .Names = paste0("TimeBin", seq_len(3))
)
)
)
print(EasyTimeBin)
# Trial TimeBin1 TimeBin2 TimeBin3
# 1 1_easy 0.7471973 0.7850524 0.8939581
# 2 2_easy 0.8096973 0.8390587 0.7757359
# 3 3_easy 0.8151430 0.7855042 0.8081268

fast subsetting in data.table in R

Given a data.table, I would like to subset the items in there quickly. For example:
dt = data.table(a=1:10, key="a")
dt[a > 3 & a <= 7]
This is pretty slow still. I know I can do joins to get individual rows but is there a way to fact that the data.table is sorted to get quick subsets of this kind?
This is what I'm doing:
dt1 = data.table(id = 1, ym = c(199001, 199006, 199009, 199012), last_ym = c(NA, 199001, 199006, 199009), v = 1:4, key=c("id", "ym"))
dt2 = data.table(id = 1, ym = c(199001, 199002, 199003, 199004, 199005, 199006, 199007, 199008, 199009, 199010, 199011, 199012), v2 = 1:12, key=c("id","ym"))
For each id, here there is only 1, and ym in dt1, I would like to sum the values of v2 between current ym in dt1 and the last ym in dt1. That is, for ym == 199006 in dt1 I would like to return list(v2 = 2 + 3 + 4 + 5 + 6). These are the values of v2 in dt2 that are equal to or less than the current ym (excluding the previous ym). In code:
expr = expression({ #browser();
cur_id = id;
cur_ym = ym;
cur_dtb = dt2[J(cur_id)][ym <= cur_ym & ym > last_ym];
setkey(cur_dtb , ym);
list(r = sum(cur_dtb$v2))
})
dt1[,eval(expr ),by=list(id, ym)]
To avoid the logical condition, perform a rolling join of dt1 and dt2. Then shift ym forward by one position within id. Finally, sum over v2 by id and ym:
setkey(dt1, id, last_ym)
setkey(dt2, id, ym)
dt1[dt2,, roll = TRUE][
, list(v2 = v2, ym = c(last_ym[1], head(ym, -1))), by = id][
, list(v2 = sum(v2)), by = list(id, ym)]
Note that we want to sum everything since the last_ym so the key on dt1 must be last_ym rather than ym.
The result is:
id ym v2
1: 1 199001 1
2: 1 199006 20
3: 1 199009 24
4: 1 199012 33
UPDATE: correction
Regardless of the fact that data.table is sorted, you will be limited to the amount of time it takes to evaluate a > 3 & a <= 7 in the first place:
> dt = data.table(a=1:10000000, key="a")
> system.time(dt$a > 3 & dt$a <= 7)
user system elapsed
0.18 0.01 0.20
> system.time(dt[,a > 3 & a <= 7])
user system elapsed
0.18 0.05 0.24
> system.time(dt[a > 3 & a <= 7])
user system elapsed
0.25 0.07 0.31
Alternative approach:
> system.time({Indices = dt$a > 3 & dt$a <= 7 ; dt[Indices]})
user system elapsed
0.28 0.03 0.31
Multiple Subsets
There can be a speed issue here if you break up factors on an ad hoc basis rather than doing it all at once first:
> dt <- data.table(A=sample(letters, 10000, replace=T))
> system.time(for(i in unique(dt$A)) dt[A==i])
user system elapsed
5.16 0.42 5.59
> system.time(dt[,.SD,by=A])
user system elapsed
0.32 0.03 0.36

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

When I use eval with aggregate, I lose the variable names

I have a bit of code which aggregates data:
pivot.present.RT <- with(
subset(correct.data, relevantTarget == 1),
aggregate(
data.frame(RT = RT),
list(
identifier = identifier,
set.size = relevantSS,
stimulus = stimulus
),
mean
)
)
I would like to make this more flexible by specifying different column names to take the place of "relevantSS". I thought I could do this with eval:
set.size.options <- c("relevantSS","irrelevantSS")
pivot.present.RT <- with(
subset(correct.data, relevantTarget == 1),
aggregate(
data.frame(RT = RT),
list(
identifier = identifier,
eval(parse(text = paste("set.size = ", set.size.options[relevant.index]))),
stimulus = stimulus
),
mean
)
)
However, when I run the second bit of code, while it does correctly aggregate the data, I lose the variable name "set.size". If I call str, I get output like this:
'data.frame': 48 obs. of 4 variables:
$ identifier: Factor w/ 9 levels "aks","ejr","ejr3",..: 1 2 4 5 6 7 8 9 1 2 ...
$ Group.2 : int 4 4 4 4 4 4 4 4 8 8 ...
$ stimulus : Factor w/ 2 levels "moving","stationary": 1 1 1 1 1 1 1 1 1 1 ...
$ RT : num 1161 1026 1257 1264 1324 ...
If I run the original code, it correctly identifies the second variable as "set.size".
Any idea what I'm missing here?
I think get might be more appropriate than eval/parse.
set.size.options <- c("relevantSS","irrelevantSS")
pivot.present.RT <- with(
subset(correct.data, relevantTarget == 1),
aggregate(
data.frame(RT = RT),
list(
identifier = identifier,
set.size = get(set.size.options[relevant.index]),
stimulus = stimulus
),
mean
)
)
That said, I'd probably prefer something like this:
d2 <- subset(correct.data, relevantTarget == 1)
doby <- subset(d2, select=c("identifier", set.size.options[relevant.index], "stimulus"))
names(doby) <- c("identifier", "set.size", "stimulus")
aggregate(d2[,"RT",drop=FALSE], doby, mean)
And others will undoubtedly chime in with plyr solutions...
Put the grouping variable name outside of eval(parse(...)), like this:
set.size.options <- c("relevantSS","irrelevantSS")
pivot.present.RT <- with(
subset(correct.data, relevantTarget == 1),
aggregate(
data.frame(RT = RT),
list(
identifier = identifier,
set.size = eval(parse(text = set.size.options[relevant.index])),
stimulus = stimulus
),
mean
)
)

Resources