R optimize loop with Tidyverse - r

I am working with a large data set, and have written working code using for loops. I want to optimize the efficiency of running the code due to the large data set and think there must be a way using Tidyverse.
Briefly, I have a data frame with object IDs (seed IDs) and their x and y coordinates.
object <- c('A','B','C')
x <- c(147, 146, 143)
y <- c(17, 80, 155)
df_Seeds <- data.frame(object, x, y)
df_Seeds$object <- as.character(df_Seeds$object)
I also have an array with another set of objects (radicles) and their x and y coordinates.
x1 <- c(180, 146, 143, 17, 17, 155, 30, 30, 30)
array_Radicles <- array(x1,dim = c(3,3))
The following code outputs an array with the index of any radicle objects within a certain distance of each seed and another array with the seed object ID. Lastly, I cbind the arrays.
seedID_Array <-array(dim=(0:1)) #blank array for seedID
radicleIndex_Array <-array(dim=(0:1)) #blank array for radicle index
for(i in 1:dim(df_Seeds)[1]) { #loops through each seed object
indexRadicles <- which(abs(array_Radicles[,1] - df_Seeds[i, 2]) <= 50 & abs(array_Radicles[,2]- df_Seeds[i,3]) <= 25) #generates vector index of any radicle within distance of seed
if (length(indexRadicles) > 0) { #some seed objects will not have an associated radicle
for (j in 1:length(indexRadicles)) { #loops through each radicle index
singleIndexRadicles <- indexRadicles[j]
seedID_Array <- rbind(seedID_Array, df_Seeds[i,1]) #adds seed object ID to array
radicleIndex_Array <- rbind(radicleIndex_Array, singleIndexRadicles) #adds radicle index to array
}
}
}
combinedArray <- cbind(seedID_Array, radicleIndex_Array)
I appreciate any suggestions or direction to another similar problem that has been solved.

Well, first of all, I highly recommend you take a look at this resource and this other resource. They're both good references on how to avoid some simple mistakes and get your R code going faster.
For your specific problem, I'd say the biggest performance bottleneck is when you're using rbind and cbind. Those functions create a copy of your original object and then fill in with the second argument. This is not very efficient.
Also, in your inner loop, you're essentially adding all the indexes in indexRadicles and repeteadly rbinding df_Seeds[i,1].
To solve this, a possible solution would be to use a list, indexing by Seed ID. For example:
output <- list()
for(i in 1:dim(df_Seeds)[1]) { #loops through each seed object
indexRadicles <- which(abs(array_Radicles[,1] - df_Seeds[i, 2]) <= 50 & abs(array_Radicles[,2]- df_Seeds[i,3]) <= 25) #generates vector index of any radicle within distance of seed
if (length(indexRadicles) > 0) { #some seed objects will not have an associated radicle
output[df_Seeds$object[i]] <- list(indexRadicles)
}
}
seeds_that_had_index_radicles <- names(output)
all_index_radicles <- unlist(output)
Note that we did not use any tidyverse solution here. I believe it is wrong to even assume tidyverse solutions to be always faster or more efficient. I personally think that they help you understand some operations better, or at least visualize them better. But you can usually do the same things with the same performance using base R.
Bonus: On a side note, you can always use profvis to help you find out the performance bottlenecks in your code. It will you show what lines are taking longer, or what lines are being called the most. Highly recommend taking a look at it: https://rstudio.github.io/profvis/

Your question is an example of a non-equi join, where your distance requirement constrains the matches between two tables. dplyr does not currently allow non-equi joins, but in many cases my data is small enough (eg cartesian product table still fits in memory) for a brute force method to work fine and be fast enough.
Option 1: cartesian product then filter
Here I join every radicle to every seedID (this could be untenable if your data is big enough) and then filter out the ones I don't need.
library(tidyverse)
df_Radicles <- tibble(x = array_Radicles[,1],
y = array_Radicles[,2],
misc = array_Radicles[,3],
rad_idx = 1:length(array_Radicles[,1]))
# brute force non-equi join: join all then filter
crossing(object = df_Seeds$object, df_Radicles) %>%
left_join(df_Seeds, by = "object") %>%
filter(abs(x.x - x.y) <= 50, abs(y.x - y.y) <= 25) %>%
select(object, rad_idx)
# A tibble: 3 x 2
object rad_idx
<chr> <int>
1 A 2
2 A 1
3 C 3
Option 2: fuzzyjoin
The fuzzyjoin package allows non-equi joins, and has built in methods for distance joins. In this case you're using a manhattan distance metric, but since your y distance is different I scale it here * 2 so that it can be evaluated on the same +/- 50 scale as your x distance. There's also a geo_join option if you're dealing with lat/lon coordinates.
library(fuzzyjoin)
df_Seeds %>%
mutate(y = y * 2) %>% # to use manhattan distance with x + y on same scale
distance_inner_join(
df_Radicles %>% mutate(y = y*2),
by = c("x", "y"),
method = "manhattan",
max_dist = 50) %>%
select(object, rad_idx)
object rad_idx
1 A 1
2 A 2
3 C 3
If these approaches aren't performant on your data, I'd recommend using data.table, which is phenomenally fast for this sort of thing.

Related

In R, is it possible to use a pair, tuple or equivalent in a matrix?

I am trying to create a matrix of coordinates(indexes) that I randomly pick one from using the sample function. I then use these to select a cell in another matrix. What is the best way to do this? The trouble is how to store these integers in the matrix so that they are easy to separate. Right now I have them stored as strings with a comma, that I then split. Someone suggested I use a pair, or a string, but I cannot seam to get these to work with a matrix. Thanks!
EDIT:What i currently have looks like this (changed a little to make sense out of context):
probs <- matrix(c(0,0,0.6,0,0,
0,0.7,1,0.7,0,
0.6,1,0,1,0.6,
0,0.7,1,0.7,0,
0,0,0.6,0,0),5,5)
cordsMat <- matrix("",5,5)
for (x in 1:5){
for (y in 1:5){
cordsMat[x,y] = paste(x,y,sep=",")
}
}
cords <- sample(cordsMat,1,,probs)
cordsVec <- unlist(strsplit(cords,split = ","))
cordX <- as.numeric(cordsVec[1])
cordY <- as.numeric(cordsVec[2])
otherMat[cordX,cordY]
It sort of works but i would also be interested for a better way, as this will get repeated a lot.
If you want to set the probabilities it can easily be done by providing it to sample
# creating the matrix
matrix(sample(rep(1:6, 15:20), 25), 5) -> other.mat
# set the probs vec
probs <- c(0,0,0.6,0,0,
0,0.7,1,0.7,0,
0.6,1,0,1,0.6,
0,0.7,1,0.7,0,
0,0,0.6,0,0)
# the coordinates matrix
mat <- as.matrix(expand.grid(1:nrow(other.mat),1:ncol(other.mat)))
# sampling a row randomly
sample(mat, 1, prob=probs) -> rand
# getting the value
other.mat[mat[rand,1], mat[rand,2]]
[1] 6

R unique combinations from given ranges quickly and using less system resource

This is a follow up question from here:
https://stackoverflow.com/a/55912086/3988575
I have a dataset like this:
ID=as.character(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
IQ=c(120.5,128.1,126.5,122.5,127.1,129.7,124.2,123.7,121.7,122.3,120.9,122.4,125.7,126.4,128.2,129.1,121.2,128.4,127.6,125.1)
Section=c("A","A","B","B","A","B","B","A","B","A","B","B","A","A","B","B","A","B","B","A")
zz=data.frame(ID,IQ,Section)
zz_new=do.call("rbind", replicate(zz, n=30, simplify = FALSE))
What I would like to do is to match people by the range of their IQ (which was the previous question).
Now, I want to create multiple levels of the ranges. For example one range can be 10 IQ classes: 120-121,121-122,122-123....129-130. Another example is a single IQ class:120-130. All the possible combinations of the above can be obtained by:
IQ_Class=c(120,121,122,123,124,125,126,127,128,129,130)
n = length(IQ_Class)-2
all_combin=expand.grid(replicate(n, 0:1, simplify = FALSE))
all_combin$First=1
all_combin$Last=1
all_combin_new=all_combin[c("First",names(all_combin)[1:(length(names(all_combin))-2)],"Last")] #Reorder columns
all_combin_new = t((apply(all_combin_new,1,function(x)(x*IQ_Class)))) #Multiply by IQ classes
all_combin_new = apply(all_combin_new, 1, function(x) { x[x!=0] })
Note that the final object all_combin_new provides a list of lists of all the classes (a total of 512 classes in total).
Now what I want to do is to take one class (one element from all_combin_new) and create all the combinations of ID's in that particular IQ class by their section. Save this dataset and take the next class from all_combin_new and repeat the operation.
From the previous answer, I was able to to modify the code to consider the combinations by Section by changing the following in the previous question:
zz1=list("list",length(all_combin_new))
for (i in 1:length(all_combin_new)){ #changed this line to run for all combinations in all_combin_new
zz2=all_combin_new[[i]]
zz11=zz_new%>%
mutate(ID=as.character(ID),vec=as.character(cut(IQ,zz2,right=F)))%>%
group_by(vec,Section)%>% #Changed this line
summarize(if(n()>1)list(data.frame(t(combn(ID,2)),stringsAsFactors = F))
else list(data.frame(X1=ID,X2=ID,stringsAsFactors = F)))%>%
unnest()%>%
bind_cols(read.csv(text=gsub("[^0-9,]","",.$vec),h=F))
zz1[[i]]=as.data.frame(zz11)
}
My actual dataset has about 10K (as compared to zz_new here) observations with 20 Sections (leading to 2^18=262144 ranges of IQ as compared to the the length of all_combin_new list here = 512). This causes two main issues:
a) Time: The speed is extremely slow. Is there a way to increase the speed?
b) Size of objects created: In my tests, even without considering as high number of combinations, the lists grow too big and the code fails. What alternate approaches could I use here? Note that in the list of list that I obtain here, I also need to do further computations.
Any help will be appreciated. Thanks in advance.
P.S.Please let me know if any part is unclear or any part of the code has some inadvertent errors.
Edit: Now with loop to go through all IQ combos and to include Section as a key on join.
I used the sample data in the linked question. Instead of making a list and looping, this does everything at once.
Note there is a cartesian product, so it may still run into memory issues. If you're having trouble, you can always try data.table as you can have non-equi joins.
library(tidyverse)
zz <- tibble(ID=1:12
,IQ=c(120.5,123,125,122.5,122.1,121.7,123.2,123.7,120.7,122.3,120.1,122)
,Section=c("A","A","B","B","A","B","B","A","B","A","B","B")
)
IQ_Class <- c(120,122,124,126)
IQ_Classes <- data.frame(First = 1
,expand.grid(replicate(length(IQ_Class)-2, 0:1, simplify = FALSE))
,Last = 1)
IQ_Classes <- IQ_Classes * IQ_Class[col(IQ_Classes)]
IQ_Classes_List <- apply(IQ_Classes, 1, function(x) { x[x!=0] })
all_combos <- lapply(IQ_Classes_List
, function(IQs)
{
z_cut <- zz%>%
mutate(cut_range = cut(IQ, IQ_Class, right = F, labels = F))
inner_join(z_cut
, z_cut %>%
select(V2 = ID, cut_range, Section)
, by = c('cut_range', 'Section'))%>%
filter(V2 > ID) %>%
mutate(Previous_IQ_class = IQs[cut_range],
Next_Class = IQs[cut_range+1])
}
)%>%
bind_rows(.id = 'IQ_List')

Block bootstrap for genomic data

I am trying to implement a block bootstrap procedure, but I haven't figured out a way of doing this efficiently.
My data.frame has the following structure:
CHR POS var_A var_B
1 192 0.9 0.7
1 2000 0.8 0.3
2 3 0.21 0.76
2 30009 0.36 0.15
...
The first column is the chromosome identification, the second column is the position, and the last two columns are variables for which I want to calculate a correlation. The problem is that each row is not entirely independent to one another, depending on the distance between them (the closer the more dependent), and so I cannot simply do cor(df$var_A, df$var_B).
The way out of this problem that is commonly used with this type of data is performing a block bootstrap. That is, I need to divide my data into blocks of length X, randomly select one row inside that block, and then calculate my statistic of interest. Note, however, that these blocks need to be defined based on the column POS, and not based on the row number. Also, this procedure needs to be done for each chromosome.
I tried to implement this, but I came up with the slowest code possible (it didn't even finish running) and I am not 100% sure it works.
x = 1000
cors = numeric()
iter = 1000
for(j in 1:iter) {
df=freq[0,]
for (i in unique(freq$CHR)) {
t = freq[freq$CHR==i,]
fim = t[nrow(t),2]
i = t[1,2]
f = i + x
while(f < fim) {
rows = which(t$POS>=i & t$POS<f)
s = sample(rows)
df = rbind(df,t[s,])
i = f
f = f + x
}
}
cors = c(cors, cor(df$var_A, df$var_B))
}
Could anybody help me out? I am sure there is a more efficient way of doing this.
Thank you in advance.
One efficient way to try would be to use the 'boot' package, of which functions include parallel processing capabilities.
In particular, the 'tsboot', or time series boot function, will select ordered blocks of data. This could work if your POS variable is some kind of ordered observation.
The boot package functions are great, but they need a little help first. To use bootstrap functions in the boot package, one must first wrap the statistic of interest in a function which includes an index argument. This is the device the bootstrap generated index will use to pass sampled data to your statistic.
cor_hat <- function(data, index) cor(y = data[index,]$var_A, x = data[index,]$var_B)
Note cor_hat in the arguments below. The sim = "fixed", l = 1000 arguments, which indicate you want fixed blocks of length(l) 1000. However, you could do blocks of any size, 5 or 10 if your trying to capture nearest neighbor dynamics moving over time. The multicore argument speaks for itself, but it maybe "snow" if you are using windows.
library(boot)
tsboot(data, cor_hat, R = 1000, sim = "fixed", l = 1000, parallel = "multicore", ncpus = 4)
In addition, page 194 of Elements of Statistical Learning provides a good example of the framework using the traditional boot function, all of which is relevant to tsboot.
Hope that helps, good luck.
Justin
r
I hope I understood you right:
# needed for round_any()
library(plyr)
res <- lapply(unique(freq$CHR),function(x){
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
})
This should return a list with an entry for each chromosome. Within each entry, there's an observation per 1kb-block if present. The number of blocks is determined by the maximum POS value.
EDIT:
library(doParallel)
library(foreach)
library(plyr)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
res <- foreach(x=unique(freq$CHR),.packages = 'plyr') %dopar% {
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
}
stopCluster(cl)
This is a simple parallelisation with foreach on each Chromosome. It could be better to restructure the function and base the parallel processing on another level (such as the 1000 iterations or maybe the blocks). In any case, I can just stress again what I was saying in my comment: Before you work on parallelising your code, you should be sure that it's as efficient as possible. Meaning you might want to look into the boot package or similar to get an increase in efficiency. That said, with the number of iterations you're planning, parallel processing might be useful once you're comfortable with your function.
So, after a while I came up with an answer to my problem. Here it goes.
You'll need the package dplyr.
l = 1000
teste = freq %>%
mutate(w = ceiling(POS/l)) %>%
group_by(CHR, w) %>%
sample_n(1)
This code creates a new variable named w based on the position in the genome (POS). This variable w is the window to which each row was assigned, and it depends on l, which is the length of your window.
You can repeat this code several times, each time sampling one row per window/CHR (with the sample_n(1)) and apply whatever statistic of interest that you want.

FAST way to sum up neighbors' attributes in a large graph in R

I have a large igraph object with almost a 1M nodes and 1.5M edges. After researching a while I could not find a procedure to sum a node's neighbors attributes, in this case, it's a binary one. At the moment, the best solution I found way the following:
V(g)$sum = sapply( ego(g,1,V(g),mode = 'all',mindist = 1), function(v) sum(V(G)[v]$attr) )
However, after 12 hours it's still crunching.
Any suggestions?
UPDATE 1: Let's consider the following graph
library(igraph)
G <- graph.formula(1-+2,1-+3,2-+4,2-+5,3-+6,5-+7,7-+8,8-+9,9+-7, 9-+10,
6-+9,1-+5,3-+9,10-+11,11-+12,11-+5,12-+4,4-+10,10-+4,11-+10)
V(G)$attr = c(1,1,0,0,1,0,1,0,1,0,1,0)
plot(G, vertex.label.color = "white", edge.width=E(G)$weight, layout = layout.circle(G))
and the desired outcome should be this...
sapply( ego(G,1,V(G),mode = 'all',mindist = 1), function(v) sum(V(G)[v]$attr) )
[1] 2 2 2 1 4 1 2 2 1 2 1 1
#Tamás, I tried to access the neighbors function without using a loop, but instead of the outcome described above I got this...
sapply(neighbors(G,V(G)),function (v) sum(V(G)[v]$attr))
2 3 5
1 0 1
I am also working with large networks and I'm having some problems with the time it takes igraph to do "simple" stuff, like calculating betweenness and closeness. In your case, however, I think you can work around this issue outside the network framework.
1st, convert your network into a data.frame and use the library data.table, which is really fast for working large data sets to calculate the sum of the attributes.
library(igraph)
library(magrittr)
library(data.table)
# simple network
g<- graph.formula(1-+2,1-+3,2-+4,2-+5,3-+6,5-+7,7-+8,8-+9,9+-7, 9-+10,
6-+9,1-+5,3-+9,10-+11,11-+12,11-+5,12-+4,4-+10,10-+4,11-+10)
V(g)$attr = c(1,1,0,0,1,0,1,0,1,0,1,0)
# convert the network to data.table
dt <- as_long_data_frame(g) %>% setDT()
# Calculate the sum of neighbors' attributes by origin (from). This is really fast in data.table
mysum <- dt[, .(attr_sum = sum(to_attr)), by= from]
# get the sum result back in the data doing a simple merge
dt <- dt[mysum, on=.(from)]
# get the sum into the network object
E(g)$attr_sum <- dt$attr_sum
The bottleneck is almost surely the ego() function. Try using neighbors() instead; it is specialized to get the first-order neighbors only so it is faster - and you don't need to construct V(g) in every iteration either.
As noted by #Tamás, the bottleneck lies in the ego function (neighbors will create a similar bottleneck). For adjacent nodes (i.e., neighbors of order 1), this bottleneck can be avoided by pulling the adjacency matrix using get.adjacency and then multiplying the matrix by the attribute vector using %*%:
library(igraph)
set.seed(42)
g <- erdos.renyi.game(1000000, 1500000, type = "gnm")
V(g)$att <- as.logical(rbinom(vcount(g), 1, 0.5))
system.time({
ma <- get.adjacency(g)
att <- V(g)$att
res <- as.numeric(ma %*% att)
})
# user system elapsed
# 0.642 0.138 0.786

R data.table efficient replication by group

I am running into some memory allocation problems trying to replicate some data by groups using data.table and rep.
Here is some sample data:
ob1 <- as.data.frame(cbind(c(1999),c("THE","BLACK","DOG","JUMPED","OVER","RED","FENCE"),c(4)),stringsAsFactors=FALSE)
ob2 <- as.data.frame(cbind(c(2000),c("I","WALKED","THE","BLACK","DOG"),c(3)),stringsAsFactors=FALSE)
ob3 <- as.data.frame(cbind(c(2001),c("SHE","PAINTED","THE","RED","FENCE"),c(1)),stringsAsFactors=FALSE)
ob4 <- as.data.frame(cbind(c(2002),c("THE","YELLOW","HOUSE","HAS","BLACK","DOG","AND","RED","FENCE"),c(2)),stringsAsFactors=FALSE)
sample_data <- rbind(ob1,ob2,ob3,ob4)
colnames(sample_data) <- c("yr","token","multiple")
What I am trying to do is replicate the tokens (in the present order) by the multiple for each year.
The following code works and gives me the answer I want:
good_solution1 <- ddply(sample_data, "yr", function(x) data.frame(rep(x[,2],x[1,3])))
good_solution2 <- data.table(sample_data)[, rep(token,unique(multiple)),by = "yr"]
The issue is that when I scale this up to 40mm+ rows, I get into memory issues for both possible solutions.
If my understanding is correct, these solutions are essentially doing an rbind which allocates everytime.
Does anyone have a better solution?
I looked at set() for data.table but was running into issues because I wanted to keep the tokens in the same order for each replication.
One way is:
require(data.table)
dt <- data.table(sample_data)
# multiple seems to be a character, convert to numeric
dt[, multiple := as.numeric(multiple)]
setkey(dt, "multiple")
dt[J(rep(unique(multiple), unique(multiple))), allow.cartesian=TRUE]
Everything except the last line should be straightforward. The last line uses a subset using key column with the help of J(.). For each value in J(.) the corresponding value is matched with "key column" and the matched subset is returned.
That is, if you do dt[J(1)] you'll get the subset where multiple = 1. And if you note carefully, by doing dt[J(rep(1,2)] gives you the same subset, but twice. Note that there's a difference between passing dt[J(1,1)] and dt[J(rep(1,2)]. The former is matching values of (1,1) with the first-two-key-columns of the data.table respectively, where as the latter is subsetting by matching (1 and 2) against the first-key column of the data.table.
So, if we were to pass the same value of the column 2 times in J(.), then it gets be duplicated twice. We use this trick to pass 1 1-time, 2 2-times etc.. and that's what the rep(.) part does. rep(.) gives 1,2,2,3,3,3,4,4,4,4.
And if the join results in more rows than max(nrow(dt), nrow(i)) (i is the rep vector that's inside J(.)), you've to explicitly use allow.cartesian = TRUE to perform this join (I guess this is a new feature from data.table 1.8.8).
Edit: Here's some benchmarking I did on a "relatively" big data. I don't see any spike in memory allocations in both methods. But I've yet to find a way to monitor peak memory usage within a function in R. I am sure I've seen such a post here on SO, but it slips me at the moment. I'll write back again. For now, here's a test data and some preliminary results in case anyone is interested/wants to run it for themselves.
# dummy data
set.seed(45)
yr <- 1900:2013
sz <- sample(10:50, length(yr), replace = TRUE)
token <- unlist(sapply(sz, function(x) do.call(paste0, data.frame(matrix(sample(letters, x*4, replace=T), ncol=4)))))
multiple <- rep(sample(500:5000, length(yr), replace=TRUE), sz)
DF <- data.frame(yr = rep(yr, sz),
token = token,
multiple = multiple, stringsAsFactors=FALSE)
# Arun's solution
ARUN.DT <- function(dt) {
setkey(dt, "multiple")
idx <- unique(dt$multiple)
dt[J(rep(idx,idx)), allow.cartesian=TRUE]
}
# Ricardo's solution
RICARDO.DT <- function(dt) {
setkey(dt, key="yr")
newDT <- setkey(dt[, rep(NA, list(rows=length(token) * unique(multiple))), by=yr][, list(yr)], 'yr')
newDT[, tokenReps := as.character(NA)]
# Add the rep'd tokens into newDT, using recycling
newDT[, tokenReps := dt[.(y)][, token], by=list(y=yr)]
newDT
}
# create data.table
require(data.table)
DT <- data.table(DF)
# benchmark both versions
require(rbenchmark)
benchmark(res1 <- ARUN.DT(DT), res2 <- RICARDO.DT(DT), replications=10, order="elapsed")
# test replications elapsed relative user.self sys.self
# 1 res1 <- ARUN.DT(DT) 10 9.542 1.000 7.218 1.394
# 2 res2 <- RICARDO.DT(DT) 10 17.484 1.832 14.270 2.888
But as Ricardo says, it may not matter if you run out of memory. So, in that case, there has to be a trade-off between speed and memory. What I'd like to verify is the peak memory used in both methods here to say definitively if using Join is better.
you can try allocating the memory for all the rows first, and then populating them iteratively.
eg:
# make sure `sample_data$multiple` is an integer
sample_data$multiple <- as.integer(sample_data$multiple)
# create data.table
S <- data.table(sample_data, key='yr')
# optionally, drop original data.frame if not needed
rm(sample_data)
## Allocate the memory first
newDT <- data.table(yr = rep(sample_data$yr, sample_data$multiple), key="yr")
newDT[, tokenReps := as.character(NA)]
# Add the rep'd tokens into newDT, using recycling
newDT[, tokenReps := S[.(y)][, token], by=list(y=yr)]
Two notes:
(1) sample_data$multiple is currently a character and thus getting coerced when passed to rep (in your original example). It might be worth double-checking your real data if that is also the case.
(2) I used the following to determine the number of rows needed per year
S[, list(rows=length(token) * unique(multiple)), by=yr]

Resources