How do you sample groups in a data.table with a caveat - r

This question is very similar to Sample random rows within each group in a data.table.
The difference is in a minor subtlety that I did not have enough reputation to discuss for that question itself.
Let's change Christopher Manning's initial data a little bit:
> DT = data.table(a=c(1,1,1,1:15,1,1), b=sample(1:1000,20))
> DT
a b
1: 1 102
2: 1 5
3: 1 658
4: 1 499
5: 2 632
6: 3 186
7: 4 761
8: 5 150
9: 6 423
10: 7 832
11: 8 883
12: 9 247
13: 10 894
14: 11 141
15: 12 891
16: 13 488
17: 14 101
18: 15 677
19: 1 400
20: 1 467
If we tried the question's solution:
> DT[,.SD[sample(.N,3)],by = a]
Error in sample.int(x, size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
This is because there are values in column that only occur once. We cannot sample 3 times for values that occur less than three times without using replacement (which we do not want to do).
I am struggling to deal with this scenario. We want to sample 3 times when the number of occurrences is >= 3, but pull the number of occurrences if it is < 3. For example with our DT above we would want:
a b
1: 1 102
2: 1 5
3: 1 658
4: 2 632
5: 3 186
6: 4 761
7: 5 150
8: 6 423
9: 7 832
10: 8 883
11: 9 247
12: 10 894
13: 11 141
14: 12 891
15: 13 488
16: 14 101
17: 15 677
Maybe a solution could involve sorting the data.table like this, then using rle() lengths to find out which n to use in the sample function above:
> DT <- DT[order(DT$a),]
> DT
a b
1: 1 102
2: 1 5
3: 1 658
4: 1 499
5: 1 400
6: 1 467
7: 2 632
8: 3 186
9: 4 761
10: 5 150
11: 6 423
12: 7 832
13: 8 883
14: 9 247
15: 10 894
16: 11 141
17: 12 891
18: 13 488
19: 14 101
20: 15 677
> ifelse(rle(DT$a)$lengths >= 3, 3,rle(DT$a)$lengths)
> [1] 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
If we replace "3" with n, this will return how much we should sample from a=1, a=2, a=3...
I have yet to find a way to incorporate this into a final solution. Any help would be appreciated!

I might be misunderstanding your question, but are you looking for something like this?
set.seed(123)
##
DT <- data.table(
a=c(1,1,1,1:15,1,1),
b=sample(1:1000,20))
##
R> DT[,.SD[sample(.N,min(.N,3))],by = a]
a b
1: 1 288
2: 1 881
3: 1 409
4: 2 937
5: 3 46
6: 4 525
7: 5 887
8: 6 548
9: 7 453
10: 8 948
11: 9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15 42
where we are drawing 3 samples from b for group a_i if a_i contains three or more values, else we draw only n values, where n (n < 3) is the size of group a_i.
Just for demonstration, here are the 6 possible values of b for a=1 that we are sampling from (assuming you use the same random seed as above):
R> DT[order(a)][1:6,]
a b
1: 1 288
2: 1 788
3: 1 409
4: 1 881
5: 1 323
6: 1 996

Related

Extend numerical series in data frame

Data
Let's take a look at a simple dataset (mine is actually >200,000 rows):
df <- data.frame(
id = c(rep(1, 11), rep(2,6)),
ref.pos = c(NA,NA,NA,301,302,303,800,801,NA,NA,NA, 500,501,502, NA, NA, NA),
pos = c(1:11, 30:35)
)
Which thus looks like this:
id ref.pos pos
1 1 NA 1
2 1 NA 2
3 1 NA 3
4 1 301 4
5 1 302 5
6 1 303 6
7 1 800 7
8 1 801 8
9 1 NA 9
10 1 NA 10
11 1 NA 11
12 2 500 30
13 2 501 31
14 2 502 32
15 2 NA 33
16 2 NA 34
17 2 NA 35
What I want to achieve
Per id I want to extend the numbers in the ref.pos to fill out the whole column, where the ref.pos numbers go down moving up in the data frame and up moving down in the colum. This would result in the following data frame:
id ref.pos pos
1 1 298 1
2 1 299 2
3 1 300 3
4 1 301 4
5 1 302 5
6 1 303 6
7 1 800 7
8 1 801 8
9 1 802 9
10 1 803 10
11 1 804 11
12 2 500 30
13 2 501 31
14 2 502 32
15 2 503 33
16 2 504 34
17 2 505 35
What I tried
I wish I could provide some code here however I haven't figure out a proper way in two days, especially not something applicable to large datasets. I found df %>% group_by(id) %>% tidyr::fill(ref.pos, .direction = "downup") interesting however this repeats numbers rather than going down and up for me.
I hope my question is clear, otherwise let me know in the comments!
An option using data.table:
fillends <- function(x) nafill(nafill(x, "locf"), "nocb")
setDT(df)[, ref.pos2 := {
dif <- fillends(c(diff(ref.pos), NA_integer_))
frp <- fillends(ref.pos)
fp <- fillends(replace(pos, is.na(ref.pos), NA_integer_))
fifelse(is.na(ref.pos), frp + dif*(pos - fp), ref.pos)
}, id]
output:
id ref.pos pos ref.pos2
1: 1 NA 1 298
2: 1 NA 2 299
3: 1 NA 3 300
4: 1 301 4 301
5: 1 302 5 302
6: 1 303 6 303
7: 1 802 7 802
8: 1 801 8 801
9: 1 NA 9 800
10: 1 NA 10 799
11: 1 NA 11 798
12: 2 500 30 500
13: 2 501 31 501
14: 2 502 32 502
15: 2 NA 33 503
16: 2 NA 34 504
17: 2 NA 35 505
data:
df <- data.frame(
id = c(rep(1, 11), rep(2,6)),
ref.pos = c(NA,NA,NA,301,302,303,802,801,NA,NA,NA, 500,501,502, NA, NA, NA),
pos = c(1:11, 30:35)
)
A base R option is to define custom function fill, which is applied in ave
fill <- function(v) {
inds <- range(which(!is.na(v)))
l <- 1:inds[1]
u <- inds[2]:length(v)
v[l] <- v[inds[1]] - rev(l)+1
v[u] <- v[inds[2]] + seq_along(u)-1
v
}
df <- within(df,ref.pos <- ave(ref.pos,id,FUN = fill))
such that
> df
id ref.pos pos
1 1 298 1
2 1 299 2
3 1 300 3
4 1 301 4
5 1 302 5
6 1 303 6
7 1 800 7
8 1 801 8
9 1 802 9
10 1 803 10
11 1 804 11
12 2 500 30
13 2 501 31
14 2 502 32
15 2 503 33
16 2 504 34
17 2 505 35

How to randomly sample dataframe rows with unique column values

The ultimate objective is to compare the variance and standard deviation of a simple statistic (numerator / denominator / true_count) from the avg_score for 10 trials of incrementally sized random samples per word from a dataset similar to:
library (data.table)
set.seed(1)
df <- data.frame(
word_ID = c(rep(1,4),rep(2,3),rep(3,2),rep(4,5),rep(5,5),rep(6,3),rep(7,4),rep(8,4),rep(9,6),rep(10,4)),
word = c(rep("cat",4), rep("house", 3), rep("sung",2), rep("door",5), rep("pretty", 5), rep("towel",3), rep("car",4), rep("island",4), rep("ran",6), rep("pizza", 4)),
true_count = c(rep(234,4),rep(39,3),rep(876,2),rep(4,5),rep(67,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(53,4)),
occurrences = c(rep(234,4),rep(34,3),rep(876,2),rep(4,5),rep(65,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(51,4)),
item_score = runif(40),
avg_score = rnorm(40),
line = c(71,234,71,34,25,32,573,3,673,899,904,2,4,55,55,1003,100,432,100,29,87,326,413,32,54,523,87,988,988,12,24,754,987,12,4276,987,93,65,45,49),
validity = sample(c("T", "F"), 40, replace = T)
)
dt <- data.table(df)
dt[ , denominator := 1:.N, by=word_ID]
dt[ , numerator := 1:.N, by=c("word_ID", "validity")]
dt$numerator[df$validity=="F"] <- 0
df <- dt
<df
word_ID word true_count occurrences item_score avg_score line validity denominator numerator
1: 1 cat 234 234 0.25497614 0.15268651 71 F 1 0
2: 1 cat 234 234 0.18662407 1.77376261 234 F 2 0
3: 1 cat 234 234 0.74554352 -0.64807093 71 T 3 1
4: 1 cat 234 234 0.93296878 -0.19981748 34 T 4 2
5: 2 house 39 34 0.49471189 0.68924373 25 F 1 0
6: 2 house 39 34 0.64499368 0.03614551 32 T 2 1
7: 2 house 39 34 0.17580259 1.94353631 573 F 3 0
8: 3 sung 876 876 0.60299465 0.73721373 3 T 1 1
9: 3 sung 876 876 0.88775767 2.32133393 673 F 2 0
10: 4 door 4 4 0.49020940 0.34890935 899 T 1 1
11: 4 door 4 4 0.01838357 -1.13391666 904 T 2 2
The data represents each detection of a word in a document, so it's possible for a word to appear on the same line more than once. The task is for the sample size to represent unique column values (line), but to return all instances where the line number is the same- meaning the actual number of rows returned could be more than the specified sample size. So, for one two-word sample size trial for "cat", the form of the desired result would be:
word_ID word true_count occurrences item_score avg_score line validity denominator numerator
1: 1 cat 234 234 0.25497614 0.15268651 71 F 1 0
2: 1 cat 234 234 0.18662407 1.77376261 234 F 2 0
3: 1 cat 234 234 0.74554352 -0.64807093 71 T 3 1
My basic iteration (found on this site) currently looks like:
for (i in 1:10) {
a2[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 2, replace = T), ])
b3[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 3, replace = T), ])}
}
So, I can do the standard random sample sizes, but am unsure (and couldn't find something similar or wasn't looking the right way) how to approach the goal stated above. Is there a straight-forward way to approach this?
Thanks,
Here is a data.table solution that uses a join on a sampled data.table.
set.seed(1234)
df[df[, .(line=sample(unique(line), 2)), by=word], on=.(word, line)]
The inner data.table consists of two columns, word and line, and has two rows per word, each with a unique value for line. The values for line are returned by sample which is fed the unique values of line and is performed separately for each word (using by=word). You can vary the number of unique line values by changing 2 to your desired value. This data.table is joined onto the main data.table in order to select the desired rows.
In this instance, you get
word_ID word true_count occurrences item_score avg_score line validity
1: 1 cat 234 234 0.26550866 0.91897737 71 F
2: 1 cat 234 234 0.57285336 0.07456498 71 T
3: 1 cat 234 234 0.37212390 0.78213630 234 T
4: 2 house 39 34 0.89838968 -0.05612874 32 T
5: 2 house 39 34 0.94467527 -0.15579551 573 F
6: 3 sung 876 876 0.62911404 -0.47815006 673 T
7: 3 sung 876 876 0.66079779 -1.47075238 3 T
8: 4 door 4 4 0.06178627 0.41794156 899 F
9: 4 door 4 4 0.38410372 -0.05380504 55 F
10: 5 pretty 67 65 0.71761851 -0.39428995 100 F
11: 5 pretty 67 65 0.38003518 1.10002537 100 F
12: 5 pretty 67 65 0.49769924 -0.41499456 1003 F
13: 6 towel 81 81 0.21214252 -0.25336168 326 F
14: 6 towel 81 81 0.93470523 -0.16452360 87 F
15: 7 car 90 90 0.12555510 0.55666320 32 T
16: 7 car 90 90 0.26722067 -0.68875569 54 F
17: 8 island 43 43 0.01339033 0.36458196 87 T
18: 8 island 43 43 0.38238796 0.76853292 988 F
19: 8 island 43 43 0.86969085 -0.11234621 988 T
20: 9 ran 54 54 0.59956583 -0.61202639 754 F
21: 9 ran 54 54 0.82737332 1.43302370 4276 F
22: 10 pizza 53 51 0.79423986 -0.36722148 93 F
23: 10 pizza 53 51 0.41127443 -0.13505460 49 T
word_ID word true_count occurrences item_score avg_score line validity
If you sample from a de-duplicated data.frame and do a subsequent left-join with the original data, you can ensure what you need.
I'm not proficient with data.table, so I'll use base functions. (dplyr would work well here, too, but since you're using data.table, I'll avoid it for now.) (As I'm about to hit submit, #lmo provided a dt-specific answer ...)
By "de-duplicate", I mean:
subdf <- df[,c("word_ID", "line")]
subdf <- subdf[!duplicated(subdf),]
dim(subdf)
# [1] 36 2
head(subdf)
# word_ID line
# 1 1 71
# 2 1 234
# 4 1 34
# 5 2 25
# 6 2 32
# 7 2 573
Note that the subdf only has three rows for 1, whereas the original data has 4:
df[1:4,]
# word_ID word true_count occurrences item_score avg_score line validity
# 1 1 cat 234 234 0.2655087 0.91897737 71 F
# 2 1 cat 234 234 0.3721239 0.78213630 234 T
# 3 1 cat 234 234 0.5728534 0.07456498 71 T
# 4 1 cat 234 234 0.9082078 -1.98935170 34 T
I'm using by here instead of lapply/split, but the results should be the same:
out <- by(subdf, subdf$word_ID, function(x) merge(x[sample(nrow(x), 2, replace=TRUE),], df, by=c("word_ID", "line")))
out[1]
# $`1`
# word_ID line word true_count occurrences item_score avg_score validity
# 1 1 34 cat 234 234 0.9082078 -1.98935170 T
# 2 1 71 cat 234 234 0.5728534 0.07456498 T
# 3 1 71 cat 234 234 0.2655087 0.91897737 F

How to write a function which can count individuals once even with multiple matches across columns

I have a data set which has patient diagnostic (ICD-9) codes, which can have a length between 3-5 digits, where the first three digits represent a classification of diagnosis, and the 4th and 5th represent a further refinement of the classification. For example:
zz<-" dx1 dx2 dx3
1 64251 82381 8100
2 8052 8730 51881
3 64421 431 81601
4 3041 29690 9920
5 72888 8782 59080
6 7245 60886 8479
7 291 4659 4739
8 30410 30400 95901
9 2929 30500 8208
10 7840 6268 8052"
df<-read.table(text=zz, header=TRUE)
Each row of codes represents multiple diagnoses of the same individual. I have written a series of ifelse statements to create a new variable with the codes I’m interested in so they are mapped to numbers representing different diagnoses of interest:
df$x<-ifelse(grepl("^291", dx1),1, ifelse(grepl("^292", dx1),1
ifelse(grepl("^3040", dx1),2,ifelse(grepl("^3047", dx1),2,
ifelse(grepl("^3051", dx1),3,ifelse(grepl("^98984", dx1),3,0))))))
Where I run into trouble is when I want to check for these select codes across each of the columns containing diagnostic codes. I attempted to write a function for this:
df$alldx<-apply(df[,c(1:3)],MARGIN = 2, function(dx){
ifelse(grepl("^291", dx),1, ifelse(grepl("^292", dx),1
ifelse(grepl("^3040", dx),2,ifelse(grepl("^3047", dx),2,
ifelse(grepl("^3051", dx),3,ifelse(grepl("^98984", dx),3,0))))))
})
The problem is I only want to count an individual once if they have one of the codes of interest; in the case of multiple code matches, then that person’s code should be whichever diagnosis was given first. I feel like there must be a way to do this, but it’s well beyond my coding abilities!
Here's what I would do, using the data.table package for convenience:
library(data.table)
setDT(df)[, id := .I]
DF = melt(df, id="id")[,
`:=`(diag = substr(value, 1, 3), ref = substr(value, 4, 5))][order(id)]
So now the data looks like
id variable value diag ref
1: 1 dx1 64251 642 51
2: 1 dx2 82381 823 81
3: 1 dx3 8100 810 0
4: 2 dx1 8052 805 2
5: 2 dx2 8730 873 0
6: 2 dx3 51881 518 81
7: 3 dx1 64421 644 21
8: 3 dx2 431 431
9: 3 dx3 81601 816 01
10: 4 dx1 3041 304 1
11: 4 dx2 29690 296 90
12: 4 dx3 9920 992 0
13: 5 dx1 72888 728 88
14: 5 dx2 8782 878 2
15: 5 dx3 59080 590 80
16: 6 dx1 7245 724 5
17: 6 dx2 60886 608 86
18: 6 dx3 8479 847 9
19: 7 dx1 291 291
20: 7 dx2 4659 465 9
21: 7 dx3 4739 473 9
22: 8 dx1 30410 304 10
23: 8 dx2 30400 304 00
24: 8 dx3 95901 959 01
25: 9 dx1 2929 292 9
26: 9 dx2 30500 305 00
27: 9 dx3 8208 820 8
28: 10 dx1 7840 784 0
29: 10 dx2 6268 626 8
30: 10 dx3 8052 805 2
id variable value diag ref
where id is the patient ID, diag is the diagnosis and ref is the refinement, if any. We can see that there is not a lot of duplication in diagnoses in the example data:
DF[, sum(duplicated(diag)), by=id]
id V1
1: 1 0
2: 2 0
3: 3 0
4: 4 0
5: 5 0
6: 6 0
7: 7 0
8: 8 1
9: 9 0
10: 10 0
Now, if we wanted to count how many patients have a diagnosis of 304, we can do:
DF[diag == "304", uniqueN(id)]
[1] 2
If you wish to map codes to new codes (which might be more confusing than anything else), I'd recommend doing so by making a separate table describing the mapping and merging on it to assign the new codes. I think data.table is also handy for that. The intro materials for the package are a good place to start with it.
A brief example for merging:
m = fread("
diag new_code
291 1
292 1
304 2
305 3
989 3", colClasses=c(diag = "character"))
DF[m, on="diag", new_code := i.new_code ]

How do you sample groups with different sample sizes with data.table

I am trying to use data.table to speed some calculations on a relatively large dataset. The example below replicates the situation:
DT = data.table(a=sample(1:2), b=sample(1:1000,20))
> DT
a b
1: 2 440
2: 1 5
3: 2 795
4: 1 138
5: 2 941
6: 1 929
7: 2 759
8: 1 192
9: 2 994
10: 1 176
11: 2 152
12: 1 893
13: 2 28
14: 1 884
15: 2 467
16: 1 761
17: 2 879
18: 1 964
19: 2 802
20: 1 271
I want to sample different numbers of replicates groups a==1 and a==2, e.g., n1=5 and n2=3 replicates without replacement, and obtain something like
a b
1: 2 440
2: 2 879
3: 2 994
4: 2 152
5: 2 879
6: 1 884
7: 1 964
8: 1 929
But I cannot seem to be able to get around it with data.table, i.e., I cannot insert the different sample sizes into a data.table commmand. Is there any way to do it? I'm new to data.table and R so any constructive guidance would be greatly apprecieated
One option would be to split the 'b' column by 'a', pass the 'size' as a vector in Map and get the sample of 'b' using the corresponding 'size'. The output is a list, which can be converted to a 'data.frame' with 2 columns using stack.
set.seed(24)
stack(Map(sample, split(DT$b, DT$a), size=c(5,3),MoreArgs=list(replace=FALSE)))
# values ind
#1 279 1
#2 93 1
#3 665 1
#4 797 1
#5 317 1
#6 542 2
#7 761 2
#8 893 2
Or using data.table methods, we melt the list output we got with Map.
set.seed(24)
DT[, melt(Map(sample, split(b, a), size=c(5,3), MoreArgs=list(replace=FALSE)))]
# value L1
#1 279 1
#2 93 1
#3 665 1
#4 797 1
#5 317 1
#6 542 2
#7 761 2
#8 893 2

Use previous calculated row value in r

I have a data.table that looks like this:
DT <- data.table(A=1:20, B=1:20*10, C=1:20*100)
DT
A B C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
...
20: 20 200 2000
I want to be able to calculate a new column "D" that has the first value as the average of the first 20 rows in column B as the first value, and then I want to use the first row of column D to help calculate the next row value of D.
Say the Average of the first 20 rows of column B is 105. and the formula for the next row in column D is this : DT$D[1]+DT$C[2]
where I take the previous row value of D and add the row value of C.
The third row will then look like this: DT$D[2]+DT$C[3]
A B C D
1: 1 10 100 105
2: 2 20 200 305
3: 3 30 300 605
4: 4 40 400 1005
5: 5 50 500 1505
...
20: 20 200 2000 21005
Any ideas on this would be made?
I think shift would be a great help to lag, but dont know how to get rid of the NA that it produces at the first instance?
We can take the mean of the first 20 rows of column B and add the cumulative sum of C. The cumulative sum has one special consideration that we want to add a concatenation of 0 and column C without the first value.
DT[, D := mean(B[1:20]) + cumsum(c(0, C[-1]))][]
# A B C D
# 1: 1 10 100 105
# 2: 2 20 200 305
# 3: 3 30 300 605
# 4: 4 40 400 1005
# 5: 5 50 500 1505
# 6: 6 60 600 2105
# 7: 7 70 700 2805
# 8: 8 80 800 3605
# 9: 9 90 900 4505
# 10: 10 100 1000 5505
# 11: 11 110 1100 6605
# 12: 12 120 1200 7805
# 13: 13 130 1300 9105
# 14: 14 140 1400 10505
# 15: 15 150 1500 12005
# 16: 16 160 1600 13605
# 17: 17 170 1700 15305
# 18: 18 180 1800 17105
# 19: 19 190 1900 19005
# 20: 20 200 2000 21005

Resources