Sample random rows within each group in a data.table

Sample random rows within each group in a data.table - r

How would you use data.table to efficiently take a sample of rows within each group in a data frame?
DT = data.table(a = sample(1:2), b = sample(1:1000,20))
DT
a b
1: 2 562
2: 1 183
3: 2 180
4: 1 874
5: 2 533
6: 1 21
7: 2 57
8: 1 20
9: 2 39
10: 1 948
11: 2 799
12: 1 893
13: 2 993
14: 1 69
15: 2 906
16: 1 347
17: 2 969
18: 1 130
19: 2 118
20: 1 732
I was thinking of something like: DT[ , sample(??, 3), by = a] that would return a sample of three rows for each "a" (the order of the returned rows isn't significant):
a b
1: 2 180
2: 2 57
3: 2 799
4: 1 69
5: 1 347
6: 1 732

Maybe something like this?
> DT[,.SD[sample(.N, min(3,.N))],by = a]
a b
1: 1 744
2: 1 497
3: 1 167
4: 2 888
5: 2 950
6: 2 343
(Thanks to Josh for the correction, below.)

I believe joran's answer can be further generalized. The details are here (How do you sample groups in a data.table with a caveat) but I believe this solution accounts for cases where there aren't "3" rows to sample from.
The current solution will error out when it tries to sample "x" times from rows that have less than "x" common values. In the below case, x=3. And it takes into consideration this caveat. (Solution done by nrussell)
set.seed(123)
##
DT <- data.table(
a=c(1,1,1,1:15,1,1),
b=sample(1:1000,20))
##
R> DT[,.SD[sample(.N,min(.N,3))],by = a]
a b
1: 1 288
2: 1 881
3: 1 409
4: 2 937
5: 3 46
6: 4 525
7: 5 887
8: 6 548
9: 7 453
10: 8 948
11: 9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15 42

There are two subtle considerations that impact the answer to this question, and these are mentioned by Josh O'Brien and Valentin in comments. The first is that subsetting via .SD is very inefficient, and it is better to sample .I directly (see the benchmark below).
The second consideration, if we do sample from .I, is that calling sample(.I, size = 1) leads to unexpected behavior when .I > 1 and length(.I) = 1. In this case, sample() behaves as if we called sample(1:.I, size = 1), which is surely not what we want. As Valentin notes, it's better to use the construct .I[sample(.N, size = 1)] in this case.
As a benchmark, we build a simple 1,000 x 1 data.table and sample randomly per group. Even with such a small data.table the .I method is roughly 20x faster.
library(microbenchmark)
library(data.table)
set.seed(1L)
DT <- data.table(id = sample(1e3, 1e3, replace = TRUE))
microbenchmark(
`.I` = DT[DT[, .I[sample(.N, 1)], by = id][[2]]],
`.SD` = DT[, .SD[sample(.N, 1)], by = id]
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> .I 2.396166 2.588275 3.22504 2.794152 3.118135 19.73236 100
#> .SD 55.798177 59.152000 63.72131 61.213650 64.205399 102.26781 100
Created on 2020-12-02 by the reprex package (v0.3.0)

Inspired by this answer by David Arenburg, another method to avoid the .SD allocation would be to sample the groups, then join back onto the original data using .EACHI
DT[ DT[, sample(.N, 3), by=a], b[i.V1], on="a", by=.EACHI]
# a V1
# 1: 2 42
# 2: 2 498
# 3: 2 179
# 4: 1 469
# 5: 1 93
# 6: 1 898
where the DT[, sample(.N, 3), by=a] line gives us a sample for each group
# a V1
# 1: 1 9
# 2: 1 3
# 3: 1 2
# 4: 2 4
# 5: 2 9
# ---
so we can then use V1 to give us the b it corresponds to.

Stratified sampling > oversampling
size=don[y==1,.(strata=length(iden)),by=.(y,x)] # count of iden by strata
table(don$x,don$y)
don<-merge(don,size[,.(y,strata)],by="x") #merge strata values
don_strata=don[,.SD[sample(.N,strata)],by=.(y,x)]

Related

R data.table .SD unexpected behaviour

I was trying to calculate some basic statistics of a data table and I've come to this (for me) unexpected behaviour.
If I calculate using everything using "explicit" indexes everything works as expected as in the following example:
library(data.table)
n <- 100; reps <- 6; n1 <- 2
df <- as.data.frame(cbind(matrix(seq_len(n*n1), ncol=n1),
matrix(sample(0:1000, n*reps, replace=TRUE), ncol=reps)))
dt <- data.table(df)
dtmean <- dt[, lapply(.SD[,c(seq(2,5))], mean, na.rm=TRUE), by=c("V1")]
but if I use
a=2
b=5
dtmean <- dt[, lapply(.SD[,c(seq(a,b))], mean, na.rm=TRUE), by=c("V1")]
the results is not what I expect (previous lines)
Is this intentionally how data.table should work?
so the first part of code for n=10 gives
V1 V3 V4 V5 V6
1: 1 504 399 430 564
2: 2 547 294 274 700
3: 3 555 305 781 326
4: 4 144 840 983 221
5: 5 894 659 169 38
6: 6 788 289 598 433
7: 7 810 378 86 22
8: 8 848 212 701 565
9: 9 412 707 890 160
10: 10 82 580 927 607
while the second
V1 V1 V2 V3 V4
1: 1 2 3 4 5
2: 2 2 3 4 5
3: 3 2 3 4 5
4: 4 2 3 4 5
5: 5 2 3 4 5
6: 6 2 3 4 5
7: 7 2 3 4 5
8: 8 2 3 4 5
9: 9 2 3 4 5
10: 10 2 3 4 5
shouldn't they give me same results?
the function mean here does not calculate anything since V1 has all different values, the question is about selecting of the indexes, I don't understand why they work in different ways.

You should use .SDcols to control what's included in .SD in this case:
dtmean <- dt[, lapply(.SD, mean, na.rm=TRUE), by="V1", .SDcols=seq(a,b)]
To do it your style, you should use with=FALSE on the inner .SD call:
dtmean <- dt[, lapply(.SD[, seq(a,b), with=FALSE], mean, na.rm=TRUE), by=c("V1")]
.SD is itself a data.table, so [ has the same semantics, i.e., the issue is the same as the difference between
dt[ , seq(a,b)]
and
dt[ , seq(a,b), with=FALSE]
Addendum to note that .SDcols can also be used to determine a,b inline in some cases, e.g. if a:b is just the numeric columns of the table, we can use:
dt[ , lapply(.SD, mean, na.rm=TRUE), by=V1, .SDcols=is.numeric]
Or if a:b have a pattern in their name, e.g.:
dt[ , lapply(.SD, mean, na.rm=TRUE), by=V1, .SDcols=patterns("ends_with_x$")]

Calculate mean of all groups except the current group

I have a data frame with two grouping variables, 'mkt' and 'mdl', and some values 'pr':
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2),
mdl = c('a','a','b','b','b','a','b','a','b'),
pr = c(120,120,110,110,145,130,145,130, 145))
df
mkt mdl pr
1 1 a 120
2 1 a 120
3 1 b 110
4 1 b 110
5 2 b 145
6 2 a 130
7 2 b 145
8 2 a 130
9 2 b 145
Within each 'mkt', the mean 'pr' for each 'mdl' should be calculated as the mean of 'pr' of all other 'mdl' in the same 'mkt', except the current 'mdl'.
For example, for the group defined by mkt == 1 and mdl == a, the 'avgother' is calculated as the average of 'pt' for mkt == 1 (same 'mkt') and mdl == b (all other 'mdl' than the current group a).
Desired result:
# mkt mdl pr avgother
# 1 1 a 120 110
# 2 1 a 120 110
# 3 1 b 110 120
# 4 1 b 110 120
# 5 2 b 145 130
# 6 2 a 130 145
# 7 2 b 145 130
# 8 2 a 130 145
# 9 2 b 145 130

First get the average of each mkt and mdl values and for each mkt exclude the current value and get the average of remaining values.
library(dplyr)
library(purrr)
df %>%
group_by(mkt, mdl) %>%
summarise(avgother = mean(pr)) %>%
mutate(avgother = map_dbl(row_number(), ~mean(avgother[-.x]))) %>%
ungroup %>%
inner_join(df, by = c('mkt', 'mdl'))
# mkt mdl avgother pr
# <dbl> <chr> <dbl> <dbl>
#1 1 a 110 120
#2 1 a 110 120
#3 1 b 120 110
#4 1 b 120 110
#5 2 a 145 130
#6 2 a 145 130
#7 2 b 130 145
#8 2 b 130 145
#9 2 b 130 145

Using data.table, calculate sum and length by 'mkt'. Then, within each mkt-mdl group, calculate mean as (mkt sum - group sum) / (mkt length - group length)
library(data.table)
setDT(df)[ , `:=`(s = sum(pr), n = .N), by = mkt]
df[ , avgother := (s - sum(pr)) / (n - .N), by = .(mkt, mdl)]
df[ , `:=`(s = NULL, n = NULL)]
# mkt mdl pr avgother
# 1: 1 a 120 110
# 2: 1 a 120 110
# 3: 1 b 110 120
# 4: 1 b 110 120
# 5: 2 b 145 130
# 6: 2 a 130 145
# 7: 2 b 145 130
# 8: 2 a 130 145
# 9: 2 b 145 130

Consider base R with multiple ave calls for different level grouping calculation using the decomposed version of mean with sum / count:
df <- within(df, {
avgoth <- (ave(pr, mkt, FUN=sum) - ave(pr, mkt, mdl, FUN=sum)) /
(ave(pr, mkt, FUN=length) - ave(pr, mkt, mdl, FUN=length))
})
df
# mkt mdl pr avgoth
# 1 1 a 120 110
# 2 1 a 120 110
# 3 1 b 110 120
# 4 1 b 110 120
# 5 2 b 145 130
# 6 2 a 130 145
# 7 2 b 145 130
# 8 2 a 130 145
# 9 2 b 145 130

For the sake of completeness, here is another data.table approach which uses grouping by each i, i.e., join and aggregate simultaneously.
For demonstration, an enhanced sample dataset is used which has a third market with 3 products:
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2,3,3,3),
mdl = c('a','a','b','b','b','a','b','a','b', letters[1:3]),
pr = c(120,120,110,110,145,130,145,130, 145, 1:3))
library(data.table)
mdt <- setDT(df)[, .(mdl, s = sum(pr), .N), by = .(mkt)]
df[mdt, on = .(mkt, mdl), avgother := (sum(pr) - s) / (.N - N), by = .EACHI][]
mkt mdl pr avgother
1: 1 a 120 110.0
2: 1 a 120 110.0
3: 1 b 110 120.0
4: 1 b 110 120.0
5: 2 b 145 130.0
6: 2 a 130 145.0
7: 2 b 145 130.0
8: 2 a 130 145.0
9: 2 b 145 130.0
10: 3 a 1 2.5
11: 3 b 2 2.0
12: 3 c 3 1.5
The temporay table mdt contains the sum and count of prices within each mkt but replicated for each product mdl within the market:
mdt
mkt mdl s N
1: 1 a 460 4
2: 1 a 460 4
3: 1 b 460 4
4: 1 b 460 4
5: 2 b 695 5
6: 2 a 695 5
7: 2 b 695 5
8: 2 a 695 5
9: 2 b 695 5
10: 3 a 6 3
11: 3 b 6 3
12: 3 c 6 3
Having mkt and mdl in mdt allows for grouping by each i (by = .EACHI)

Here is an approach which computes avgother directly by subsetting pr values which do not belong to the actual value of mdl before computing the averages.
This is quite different to the other answers posted so far which justifies to post this as a separate answer, IMHO.
# enhanced sample dataset covering more corner cases
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2,3,3,3,4),
mdl = c('a','a','b','b','b','a','b','a','b', letters[1:3],'d'),
pr = c(120,120,110,110,145,130,145,130, 145, 1:3, 9))
library(data.table)
setDT(df)[, avgother := sapply(mdl, function(m) mean(pr[m != mdl])), by = mkt][]
mkt mdl pr avgother
1: 1 a 120 110.0
2: 1 a 120 110.0
3: 1 b 110 120.0
4: 1 b 110 120.0
5: 2 b 145 130.0
6: 2 a 130 145.0
7: 2 b 145 130.0
8: 2 a 130 145.0
9: 2 b 145 130.0
10: 3 a 1 2.5
11: 3 b 2 2.0
12: 3 c 3 1.5
13: 4 d 9 NaN
Difference between approaches
The other answers share more or less the same approach (although implemented in different manners)
compute sums and counts of pr for each mkt
compute sums and counts of prfor each mkt and mdl
subtract mkt/mdl sums and counts from mkt sums and counts
compute avgother
This approach
groups by mkt
loops through mdl within each mkt,
subsets pr to drop values which do not belong to the actual value of mdl
before computing mean() directly.
Caveat concerning performance: Although the code essentially is a one-liner it does not imply it is the fastest.

Find non-overlapping values from two tables in R

I have two tables as follows:
library(data.table)
Input<-data.table("Date"=seq(1:10),"Cycle"=c(90,100,130,180,200,230,250,260,300,NA))
Date Cycle
1: 1 90
2: 2 100
3: 3 130
4: 4 180
5: 5 200
6: 6 230
7: 7 250
8: 8 260
9: 9 300
10: 10 320
FDate<-data.table("Date"=seq(1:9),"Cycle"=c(90,100,130,180,200,230,250,260,300),"Task"=c("D","A","B,C",NA,"A,D","D","C","D","A,C,D"))
Date Cycle Task
1: 1 90 D
2: 2 100 A
3: 3 130 B,C
4: 4 180 <NA>
5: 5 200 A,D
6: 6 230 D
7: 7 250 C
8: 8 260 D
9: 9 300 A,C,D
I just want to have an output table with non-overlapped Date and corrresponding Cycle.
I tried with setdiff but it doesn't work. I expect my output like this
Date Cycle
10 320
When I tried this setdiff(FDate$Date,Input$Date)
it turns like this integer(0)

We can use fsetdiff from data.table by including only the common columns in both datasets
fsetdiff(Input, FDate[ , names(Input), with = FALSE])
# Date Cycle
#1: 10 320
Or a join as #Frank mentioned
Input[!FDate, on=.(Date)]
# Date Cycle
#1: 10 320
In the OP's code,
setdiff(FDate$Date,Input$Date)
the first argument is from the 'Date' column from 'FDate' All of the elements in that column is also in the master data 'Input$Date'. So, it returns integer(0)). If we do the reverse, it would return 10

How to randomly sample dataframe rows with unique column values

The ultimate objective is to compare the variance and standard deviation of a simple statistic (numerator / denominator / true_count) from the avg_score for 10 trials of incrementally sized random samples per word from a dataset similar to:
library (data.table)
set.seed(1)
df <- data.frame(
word_ID = c(rep(1,4),rep(2,3),rep(3,2),rep(4,5),rep(5,5),rep(6,3),rep(7,4),rep(8,4),rep(9,6),rep(10,4)),
word = c(rep("cat",4), rep("house", 3), rep("sung",2), rep("door",5), rep("pretty", 5), rep("towel",3), rep("car",4), rep("island",4), rep("ran",6), rep("pizza", 4)),
true_count = c(rep(234,4),rep(39,3),rep(876,2),rep(4,5),rep(67,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(53,4)),
occurrences = c(rep(234,4),rep(34,3),rep(876,2),rep(4,5),rep(65,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(51,4)),
item_score = runif(40),
avg_score = rnorm(40),
line = c(71,234,71,34,25,32,573,3,673,899,904,2,4,55,55,1003,100,432,100,29,87,326,413,32,54,523,87,988,988,12,24,754,987,12,4276,987,93,65,45,49),
validity = sample(c("T", "F"), 40, replace = T)
)
dt <- data.table(df)
dt[ , denominator := 1:.N, by=word_ID]
dt[ , numerator := 1:.N, by=c("word_ID", "validity")]
dt$numerator[df$validity=="F"] <- 0
df <- dt
<df
word_ID word true_count occurrences item_score avg_score line validity denominator numerator
1: 1 cat 234 234 0.25497614 0.15268651 71 F 1 0
2: 1 cat 234 234 0.18662407 1.77376261 234 F 2 0
3: 1 cat 234 234 0.74554352 -0.64807093 71 T 3 1
4: 1 cat 234 234 0.93296878 -0.19981748 34 T 4 2
5: 2 house 39 34 0.49471189 0.68924373 25 F 1 0
6: 2 house 39 34 0.64499368 0.03614551 32 T 2 1
7: 2 house 39 34 0.17580259 1.94353631 573 F 3 0
8: 3 sung 876 876 0.60299465 0.73721373 3 T 1 1
9: 3 sung 876 876 0.88775767 2.32133393 673 F 2 0
10: 4 door 4 4 0.49020940 0.34890935 899 T 1 1
11: 4 door 4 4 0.01838357 -1.13391666 904 T 2 2
The data represents each detection of a word in a document, so it's possible for a word to appear on the same line more than once. The task is for the sample size to represent unique column values (line), but to return all instances where the line number is the same- meaning the actual number of rows returned could be more than the specified sample size. So, for one two-word sample size trial for "cat", the form of the desired result would be:
word_ID word true_count occurrences item_score avg_score line validity denominator numerator
1: 1 cat 234 234 0.25497614 0.15268651 71 F 1 0
2: 1 cat 234 234 0.18662407 1.77376261 234 F 2 0
3: 1 cat 234 234 0.74554352 -0.64807093 71 T 3 1
My basic iteration (found on this site) currently looks like:
for (i in 1:10) {
a2[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 2, replace = T), ])
b3[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 3, replace = T), ])}
}
So, I can do the standard random sample sizes, but am unsure (and couldn't find something similar or wasn't looking the right way) how to approach the goal stated above. Is there a straight-forward way to approach this?
Thanks,

Here is a data.table solution that uses a join on a sampled data.table.
set.seed(1234)
df[df[, .(line=sample(unique(line), 2)), by=word], on=.(word, line)]
The inner data.table consists of two columns, word and line, and has two rows per word, each with a unique value for line. The values for line are returned by sample which is fed the unique values of line and is performed separately for each word (using by=word). You can vary the number of unique line values by changing 2 to your desired value. This data.table is joined onto the main data.table in order to select the desired rows.
In this instance, you get
word_ID word true_count occurrences item_score avg_score line validity
1: 1 cat 234 234 0.26550866 0.91897737 71 F
2: 1 cat 234 234 0.57285336 0.07456498 71 T
3: 1 cat 234 234 0.37212390 0.78213630 234 T
4: 2 house 39 34 0.89838968 -0.05612874 32 T
5: 2 house 39 34 0.94467527 -0.15579551 573 F
6: 3 sung 876 876 0.62911404 -0.47815006 673 T
7: 3 sung 876 876 0.66079779 -1.47075238 3 T
8: 4 door 4 4 0.06178627 0.41794156 899 F
9: 4 door 4 4 0.38410372 -0.05380504 55 F
10: 5 pretty 67 65 0.71761851 -0.39428995 100 F
11: 5 pretty 67 65 0.38003518 1.10002537 100 F
12: 5 pretty 67 65 0.49769924 -0.41499456 1003 F
13: 6 towel 81 81 0.21214252 -0.25336168 326 F
14: 6 towel 81 81 0.93470523 -0.16452360 87 F
15: 7 car 90 90 0.12555510 0.55666320 32 T
16: 7 car 90 90 0.26722067 -0.68875569 54 F
17: 8 island 43 43 0.01339033 0.36458196 87 T
18: 8 island 43 43 0.38238796 0.76853292 988 F
19: 8 island 43 43 0.86969085 -0.11234621 988 T
20: 9 ran 54 54 0.59956583 -0.61202639 754 F
21: 9 ran 54 54 0.82737332 1.43302370 4276 F
22: 10 pizza 53 51 0.79423986 -0.36722148 93 F
23: 10 pizza 53 51 0.41127443 -0.13505460 49 T
word_ID word true_count occurrences item_score avg_score line validity

If you sample from a de-duplicated data.frame and do a subsequent left-join with the original data, you can ensure what you need.
I'm not proficient with data.table, so I'll use base functions. (dplyr would work well here, too, but since you're using data.table, I'll avoid it for now.) (As I'm about to hit submit, #lmo provided a dt-specific answer ...)
By "de-duplicate", I mean:
subdf <- df[,c("word_ID", "line")]
subdf <- subdf[!duplicated(subdf),]
dim(subdf)
# [1] 36 2
head(subdf)
# word_ID line
# 1 1 71
# 2 1 234
# 4 1 34
# 5 2 25
# 6 2 32
# 7 2 573
Note that the subdf only has three rows for 1, whereas the original data has 4:
df[1:4,]
# word_ID word true_count occurrences item_score avg_score line validity
# 1 1 cat 234 234 0.2655087 0.91897737 71 F
# 2 1 cat 234 234 0.3721239 0.78213630 234 T
# 3 1 cat 234 234 0.5728534 0.07456498 71 T
# 4 1 cat 234 234 0.9082078 -1.98935170 34 T
I'm using by here instead of lapply/split, but the results should be the same:
out <- by(subdf, subdf$word_ID, function(x) merge(x[sample(nrow(x), 2, replace=TRUE),], df, by=c("word_ID", "line")))
out[1]
# $`1`
# word_ID line word true_count occurrences item_score avg_score validity
# 1 1 34 cat 234 234 0.9082078 -1.98935170 T
# 2 1 71 cat 234 234 0.5728534 0.07456498 T
# 3 1 71 cat 234 234 0.2655087 0.91897737 F

How do you sample groups in a data.table with a caveat

This question is very similar to Sample random rows within each group in a data.table.
The difference is in a minor subtlety that I did not have enough reputation to discuss for that question itself.
Let's change Christopher Manning's initial data a little bit:
> DT = data.table(a=c(1,1,1,1:15,1,1), b=sample(1:1000,20))
> DT
a b
1: 1 102
2: 1 5
3: 1 658
4: 1 499
5: 2 632
6: 3 186
7: 4 761
8: 5 150
9: 6 423
10: 7 832
11: 8 883
12: 9 247
13: 10 894
14: 11 141
15: 12 891
16: 13 488
17: 14 101
18: 15 677
19: 1 400
20: 1 467
If we tried the question's solution:
> DT[,.SD[sample(.N,3)],by = a]
Error in sample.int(x, size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
This is because there are values in column that only occur once. We cannot sample 3 times for values that occur less than three times without using replacement (which we do not want to do).
I am struggling to deal with this scenario. We want to sample 3 times when the number of occurrences is >= 3, but pull the number of occurrences if it is < 3. For example with our DT above we would want:
a b
1: 1 102
2: 1 5
3: 1 658
4: 2 632
5: 3 186
6: 4 761
7: 5 150
8: 6 423
9: 7 832
10: 8 883
11: 9 247
12: 10 894
13: 11 141
14: 12 891
15: 13 488
16: 14 101
17: 15 677
Maybe a solution could involve sorting the data.table like this, then using rle() lengths to find out which n to use in the sample function above:
> DT <- DT[order(DT$a),]
> DT
a b
1: 1 102
2: 1 5
3: 1 658
4: 1 499
5: 1 400
6: 1 467
7: 2 632
8: 3 186
9: 4 761
10: 5 150
11: 6 423
12: 7 832
13: 8 883
14: 9 247
15: 10 894
16: 11 141
17: 12 891
18: 13 488
19: 14 101
20: 15 677
> ifelse(rle(DT$a)$lengths >= 3, 3,rle(DT$a)$lengths)
> [1] 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
If we replace "3" with n, this will return how much we should sample from a=1, a=2, a=3...
I have yet to find a way to incorporate this into a final solution. Any help would be appreciated!

I might be misunderstanding your question, but are you looking for something like this?
set.seed(123)
##
DT <- data.table(
a=c(1,1,1,1:15,1,1),
b=sample(1:1000,20))
##
R> DT[,.SD[sample(.N,min(.N,3))],by = a]
a b
1: 1 288
2: 1 881
3: 1 409
4: 2 937
5: 3 46
6: 4 525
7: 5 887
8: 6 548
9: 7 453
10: 8 948
11: 9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15 42
where we are drawing 3 samples from b for group a_i if a_i contains three or more values, else we draw only n values, where n (n < 3) is the size of group a_i.
Just for demonstration, here are the 6 possible values of b for a=1 that we are sampling from (assuming you use the same random seed as above):
R> DT[order(a)][1:6,]
a b
1: 1 288
2: 1 788
3: 1 409
4: 1 881
5: 1 323
6: 1 996

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sample random rows within each group in a data.table - r

Maybe something like this? > DT[,.SD[sample(.N, min(3,.N))],by = a] a b 1: 1 744 2: 1 497 3: 1 167 4: 2 888 5: 2 950 6: 2 343 (Thanks to Josh for the correction, below.)

Stratified sampling > oversampling size=don[y==1,.(strata=length(iden)),by=.(y,x)] # count of iden by strata table(don$x,don$y) don<-merge(don,size[,.(y,strata)],by="x") #merge strata values don_strata=don[,.SD[sample(.N,strata)],by=.(y,x)]

Related

R data.table .SD unexpected behaviour

Calculate mean of all groups except the current group

Find non-overlapping values from two tables in R

How to randomly sample dataframe rows with unique column values

How do you sample groups in a data.table with a caveat

Categories

Resources