I am trying to run a for function to extract multiple strings in order from a fasta.
Here is an example(of course the real one is more than 10 thousand)
eg <- ATCGACGTCGATGCTGATCGATCGATCGATCGTCAGATCGATCAG
And here is my code
forsubseq <- function(dna){
sta <- for (i in 1:floor(width(dna)/100)) {
seqGC <- Biostrings::subseq(dna, start = 100*i - 99, width = 100) %>%
Biostrings::letterFrequency(letters = "GC", as.prob = TRUE)
}
return(sta)
}
forsubseq(eg)
However, nothing happened after running. It really confused me...What I want to obtain is to analyze GC content for each 100 bp...
Could anyone kindly offer advice? Thanks.
The library Biostrings is not available for the most recent version of R, but one simplified approach would be to split eg at every n th character then use lapply to analyze. In this example I counted the number of "GC" pairs using str_count since I dont have the Biostrings library but you can change to the Biostrings::letterFrequency function:
eg <- "ACGACGTCGATGCTGATCGATCGATCGATCGTCAGATCGATCAG"
n <- 10 # you would change to 100
blocks <- seq(1, nchar(eg), n) # prep to separate every n base pairs
splits <- substring(eg, blocks, blocks + n - 1) # separate every n base pairs
lapply(splits,
function(x) stringr::str_count(x, "GC")) # replace with Biostrings::letterFrequency
The output is a list counting the number of "GC" pairs for each block of n characters (here, 10). If you want a vector of integers representing these data, just simply wrap the lapply function in unlist(lapply(...))
Related
I am trying to generate the term frequency matrix of a document and subsequently look up the frequency of a certain word in a given query in that matrix. In the end I want to sum the frequencies found of the words in the query.
However, I am coping with the error message: Error in feature[i] <- x : replacement has length zero
I do not have a lot of coding experience in general, and this is my first time working with R, thus I am having difficulties solving this error. I presume it has something to do with a null-value. I already tried to avoid the nested for-loop with an apply function because I thought that might help (not sure though), but I could not quite get the hang of how to convert the for-loop into an apply function.
termfreqname <- function(queries,docs){
n <- length(queries)
feature <- vector(length=n)
for(i in 1:n){
query <- queries[i]
documentcorpus <- c(docs[i])
tdm <- TermDocumentMatrix(tm_corpus) #creates the term frequency matrix per document
m <- sapply(strsplit(query, " "), length) #length of the query in words
totalfreq <- list(0) #initialize list
freq_counter <- rowSums(as.matrix(tdm)) #counts the occurrence of a given word in the tdm matrix
for(j in 1:m){
freq <- freq_counter[word(query,j)] #finds frequency of each word in the given query, in the term frequency matrix
totalfreq[[j]] <- freq #adds this frequency to position j in the list
}
x <- reduce(totalfreq,'+') #sums all the numbers in the list
feature[i] <- x #adds this number to feature list
feature
}
}
It depends on your needs, but bottom line you need to add some if statement. How you use it depends on whether you want the default value of the vector to persist. In your code, while feature starts as a logical vector, it is likely coerced to integer or numeric once you overwrite its first value with a number. In that case, the default value in all positions of the vector will be 0 (or 0L, if integer). That's going to influence your decision on how to use the if statement.
if (length(x)) feature[i] <- x
This will only attempt to overwrite the ith value of feature if the x objects has length (that's equivalent to if (length(x) > 0)). In this case, since the default value in the vector will be zero, this means when you are done that you will not be able to distinguish between an element known to be 0 and an element that failed to find anything.
The alternative (and my preference/recommendation):
feature[i] <- if (length(x)) x else NA
In this case, when you are done, you can clearly distinguish between known-zero (0) and uncertain/unknown values (NA). When doing math operations on that vector, you might want/need na.rm=TRUE ... but it all depends on your use.
BTW, as MartinGal noted, your use of reduce(totalfreq, '+') is a little flawed: 'x' may not be (is not?) recognized as a known function. The first fix to this is to use backticks around the function, so
totalfreq <- 5:7
reduce(totalfreq, '+')
# NULL
reduce(totalfreq, `+`)
# [1] 18
sum(totalfreq)
# [1] 18
There the last is the much-more-preferred method. Why? With a vector of length 4, for instance, it takes the first two and adds them, then takes that result and adds it to the third, then takes that result and adds to the fourth. Three operations. When you have 100 elements, it will make 99 individual additions. sum does it once, and this does have an effect on performance (asymptotically).
However, if totalfreq is instead a list, then this changes slightly:
totalfreq <- as.list(5:7)
reduce(totalfreq, `+`)
# [1] 18
sum(totalfreq)
# Error in sum(totalfreq) : invalid 'type' (list) of argument
# x
sum(unlist(totalfreq))
# [1] 18
The reduce code still works, and the sum by itself fails, but we can unlist the list first, effectively creating a vector, and then call sum on that. Much much faster asymptotically. And perhaps clearer, more declarative.
(I'm assuming purrr::reduce, btw ...)
I am trying to implement a block bootstrap procedure, but I haven't figured out a way of doing this efficiently.
My data.frame has the following structure:
CHR POS var_A var_B
1 192 0.9 0.7
1 2000 0.8 0.3
2 3 0.21 0.76
2 30009 0.36 0.15
...
The first column is the chromosome identification, the second column is the position, and the last two columns are variables for which I want to calculate a correlation. The problem is that each row is not entirely independent to one another, depending on the distance between them (the closer the more dependent), and so I cannot simply do cor(df$var_A, df$var_B).
The way out of this problem that is commonly used with this type of data is performing a block bootstrap. That is, I need to divide my data into blocks of length X, randomly select one row inside that block, and then calculate my statistic of interest. Note, however, that these blocks need to be defined based on the column POS, and not based on the row number. Also, this procedure needs to be done for each chromosome.
I tried to implement this, but I came up with the slowest code possible (it didn't even finish running) and I am not 100% sure it works.
x = 1000
cors = numeric()
iter = 1000
for(j in 1:iter) {
df=freq[0,]
for (i in unique(freq$CHR)) {
t = freq[freq$CHR==i,]
fim = t[nrow(t),2]
i = t[1,2]
f = i + x
while(f < fim) {
rows = which(t$POS>=i & t$POS<f)
s = sample(rows)
df = rbind(df,t[s,])
i = f
f = f + x
}
}
cors = c(cors, cor(df$var_A, df$var_B))
}
Could anybody help me out? I am sure there is a more efficient way of doing this.
Thank you in advance.
One efficient way to try would be to use the 'boot' package, of which functions include parallel processing capabilities.
In particular, the 'tsboot', or time series boot function, will select ordered blocks of data. This could work if your POS variable is some kind of ordered observation.
The boot package functions are great, but they need a little help first. To use bootstrap functions in the boot package, one must first wrap the statistic of interest in a function which includes an index argument. This is the device the bootstrap generated index will use to pass sampled data to your statistic.
cor_hat <- function(data, index) cor(y = data[index,]$var_A, x = data[index,]$var_B)
Note cor_hat in the arguments below. The sim = "fixed", l = 1000 arguments, which indicate you want fixed blocks of length(l) 1000. However, you could do blocks of any size, 5 or 10 if your trying to capture nearest neighbor dynamics moving over time. The multicore argument speaks for itself, but it maybe "snow" if you are using windows.
library(boot)
tsboot(data, cor_hat, R = 1000, sim = "fixed", l = 1000, parallel = "multicore", ncpus = 4)
In addition, page 194 of Elements of Statistical Learning provides a good example of the framework using the traditional boot function, all of which is relevant to tsboot.
Hope that helps, good luck.
Justin
r
I hope I understood you right:
# needed for round_any()
library(plyr)
res <- lapply(unique(freq$CHR),function(x){
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
})
This should return a list with an entry for each chromosome. Within each entry, there's an observation per 1kb-block if present. The number of blocks is determined by the maximum POS value.
EDIT:
library(doParallel)
library(foreach)
library(plyr)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
res <- foreach(x=unique(freq$CHR),.packages = 'plyr') %dopar% {
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
}
stopCluster(cl)
This is a simple parallelisation with foreach on each Chromosome. It could be better to restructure the function and base the parallel processing on another level (such as the 1000 iterations or maybe the blocks). In any case, I can just stress again what I was saying in my comment: Before you work on parallelising your code, you should be sure that it's as efficient as possible. Meaning you might want to look into the boot package or similar to get an increase in efficiency. That said, with the number of iterations you're planning, parallel processing might be useful once you're comfortable with your function.
So, after a while I came up with an answer to my problem. Here it goes.
You'll need the package dplyr.
l = 1000
teste = freq %>%
mutate(w = ceiling(POS/l)) %>%
group_by(CHR, w) %>%
sample_n(1)
This code creates a new variable named w based on the position in the genome (POS). This variable w is the window to which each row was assigned, and it depends on l, which is the length of your window.
You can repeat this code several times, each time sampling one row per window/CHR (with the sample_n(1)) and apply whatever statistic of interest that you want.
I am a noob R programmer. I have written a code that needs to apply a function to a data frame split by factors. The data frame in itself contains about 1 million 324961 observations with 64376 factors in the variable that we use to slice the dataframe.
The code is as follows:
library("readstata13")
# Reading the Stata Data file into R
bod_fb <- read.dta13("BoD_nonmissing_fb.dta")
gen_fuzzy_blau <- function(bod_sample){
# Here we drop the Variables that are not required in creating the Fuzzy-Blau index
bod_sample <- as.data.frame(bod_sample)
bod_sample$tot_occur <- as.numeric(bod_sample$tot_occur)
bod_sample$caste1_occ <- as.numeric(bod_sample$caste1_occ)
bod_sample$caste2_occ <- as.numeric(bod_sample$caste2_occ)
bod_sample$caste3_occ <- as.numeric(bod_sample$caste3_occ)
bod_sample$caste4_occ <- as.numeric(bod_sample$caste4_occ)
# Calculating the Probabilites of a director belonging to a caste
bod_sample$caste1_occ <- (bod_sample$caste1_occ)/(bod_sample$tot_occur)
bod_sample$caste2_occ <- (bod_sample$caste2_occ)/(bod_sample$tot_occur)
bod_sample$caste3_occ <- (bod_sample$caste3_occ)/(bod_sample$tot_occur)
bod_sample$caste4_occ <- (bod_sample$caste4_occ)/(bod_sample$tot_occur)
#Dropping the Total Occurances column, as we do not need it anymore
bod_sample$tot_occur<- NULL
# Here we replace all the blanks with NA
bod_sample <- apply(bod_sample, 2, function(x) gsub("^$|^ $", NA, x))
bod_sample <- as.data.frame(bod_sample)
# Here we push all the NAs in the caste names and caste probabilities to the end of the row
# So if there are only two castes against a name, then they become caste1 and caste2
caste_list<-data.frame(bod_sample$caste1,bod_sample$caste2,bod_sample$caste3,bod_sample$caste4)
caste_list = as.data.frame(t(apply(caste_list,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
caste_list_prob<-data.frame(bod_sample$caste1_occ,bod_sample$caste2_occ,bod_sample$caste3_occ,bod_sample$caste4_occ)
caste_list_prob = as.data.frame(t(apply(caste_list_prob,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
# Here we write two functions: 1. gen_castelist
# 2. gen_caste_prob
# gen_castelist: This function takes the row number (serial number of the direcor)
# and returns the names of all the castes for which he has a non-zero
# probability.
# gen_caste_prob: This function takes the row number (serial number of the director)
# and returns the probability with which he belongs to the caste
#
gen_castelist <- function(x){
y <- caste_list[x,]
y <- as.vector(y[!is.na(y)])
return(y)
}
gen_caste_prob <- function(x){
z <- caste_list_prob[x,]
z <- z[!is.na(z)]
z <- as.numeric(z)
return(z)
}
caste_ls <-list()
caste_prob_ls <- list()
for(i in 1:nrow(bod_sample))
{
caste_ls[[i]]<- gen_castelist(i)
caste_prob_ls[[i]]<- gen_caste_prob(i)
}
gridcaste <- expand.grid(caste_ls)
gridcaste <- data.frame(lapply(gridcaste, as.character), stringsAsFactors=FALSE)
gridcasteprob <- expand.grid(caste_prob_ls)
# Generating the Joint Probability
gridcasteprob$JP <- apply(gridcasteprob,1,prod)
# Generating the Similarity Index
gen_sim_index <- function(x){
x <- t(x)
a <- as.data.frame(table(x))
sim_index <- sum(a$Freq^2)/(sum(a$Freq))^2
return(sim_index)
}
gridcaste$sim_index <- apply(gridcaste,1,gen_sim_index)
# Generating fuzzyblau
gridcaste$fb <- gridcaste$sim_index * gridcasteprob$JP
fuzzy_blau_index <- sum(gridcaste$fb)
remove_list <- c("gridcaste","")
return(fuzzy_blau_index)
}
fuzzy_blau_output <- by(bod_fb,bod_fb$code_year,gen_fuzzy_blau)
# Saving the output as a dataframe with two columns
# Column 1 is the fuzzy blau index
# Column 2 is the code_year
code_year <- names(fuzzy_blau_output)
fuzzy_blau <- as.data.frame(as.vector(unlist(fuzzy_blau_output)))
names(fuzzy_blau) <- c("fuzzy_blau_index")
fuzzy_blau$code_year <- code_year
bod_fb <- merge(bod_fb,fuzzy_blau,by = "code_year")
save.dta13(bod_fb,"bod_fb_example.dta")
If the code is tl;dr, the summary is as follows:
I have a dataframe bod_fb. I need to apply the apply the gen_fuzzy_blau function on this dataframe by slicing the dataframe with factors of bod_fb$code_year.
Since the function is very huge sequential processing is taking more than a day and ends up in running out of memory. The function gen_fuzzy_blau returns a numeric variable fuzzy_blau_index for each code_year of the dataframe. I use by to apply the function on each slice. I wanted to know if there is a way to parallelly implement this code so that multiple instances of the function run at once on different slices of the dataframe. I did not find a by implementation for parallel package and I did not know how to pass the dataframes as iterators while using foreach and doParallel packages.
I have a AMD A8 laptop with 4GB RAM and windows 7 sp1 home basic. I have given 20GB as page file memory (this was after I got the memory error).
Thank you
EDIT 1: #milkmotel I have eliminated the redundancy in the code and removed the for loops, but a huge amount of time is being wasted in gen_sim_index in the function, I am using the proc.time()function to gauge the time that each part of the code is taking.
The function is supposed to the following to a row:
if we have a row (not a vector) say: a a b c the similarity index will be (2/4)^2 + (1/4)^2 + (1/4)^2 ie, summation of (no of occurences of each unique element of each row/total no of elements in the row)^2
I am unable to use the apply function directly on the row because each element in a row because each element in the row has different factors and table() does not output the frequencies properly.
What is an efficient way to code the gen_sim_index function?
You're saving your data 6 times over in 6 different variables. Try not doing that.
and it takes a day because you're running character indexing on a ridiculous amount of data with gsub().
Take your code out of your gen_fuzzy_blau function as it provides no value to wrap it up into one function rather than running it all independently. Then run it all one line at a time. If it takes too long to run, reconsider your method. Your code is incredibly inefficient.
I am trying to get to grips with R and as an experiment I thought that I would try to play around with some cricket data. In its rawest format it is a yaml file, which I used the yaml R package to turn into an R object.
However, I now have a number of nested lists of uneven length that I want to try and turn into a data frame in R. I have tried a few methods such as writing some loops to parse the data and some of the functions in the tidyr package. However, I can't seem to get it to work nicely.
I wondered if people knew of the best way to tackle this? Replicating the data structure would be difficult here, because the complexity comes in the multiple nested lists and the unevenness of their length (which would make for a very long code block. However, you can find the raw yaml data here: http://cricsheet.org/downloads/ (I was using the ODI internationals).
Thanks in advance!
Update
I have tried this:
1)Using tidyr - seperate
d <- unnest(balls)
Name <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal","WicketFielder","WicketKind","PlayerOut")
a <- separate(d, x, Name, sep = ",",extra = "drop")
Which basically uses the tidyr package returns a single column dataframe that I then try to separate. However, the problem here is that in the middle there is sometimes extras variables that appear in some rows and not others, thereby throwing off the separation.
2) Creating vectors
ballsVector <- unlist(balls[[2]],use.names = FALSE)
names_vector <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal")
names(ballsVector) <- c(names_vector)
ballsMatrix <- matrix(ballsVector, nrow = 1, byrow = TRUE)
colnames(ballsMatrix) <- names_vector
The problem here is that the resulting vectors are uneven in length and therefore cant be combined into a data frame. It will also suffer from the issue that there are sporadic variables in the middle of the dataset (as above).
Caveat: not complete answer; attempt to arrange the innings data
plyr::rbind.fill may offer a solution to binding rows with a different number of columns.
I dont use tidyr but below is some rough code to get the innings data into a data.frame. You could then loop this through all the yaml files in the directory.
# Download and unzip data
download.file("http://cricsheet.org/downloads/odis.zip", temp<- tempfile())
tmp <- unzip(temp)
# Create lists - use first game
library(yaml)
raw_dat <- yaml.load_file(tmp[[2]])
#names(raw_dat)
# Function to process list into dataframe
p_fun <- function(X) {
team = X[[1]][["team"]]
# function to process each list subelement that represents each throw
fn <- function(...) {
tmp = unlist(...)
tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
}
# loop over all throws
lst = lapply(X[[1]][["deliveries"]], fn )
cbind(team, plyr:::rbind.fill(lst))
}
# Loop over each innings
dat <- plyr::rbind.fill(lapply(raw_dat$innings, p_fun))
Some explanation
The list structure and subsetting it. To get an idea of the structure of the list use
str(raw_dat) # but this gives a really long list of data
You can truncate this, to make it a bit more useful
str(raw_dat, 3)
length(raw_dat)
So there are three main list elements - meta, info, and innings. You can also see this with
names(raw_dat)
To access the meta data, you can use
raw_dat$meta
#or using `[[1]]` to access the first element of the list (see ?'[[')
raw_dat[[1]]
#and get sub-elements by either
raw_dat$meta$data_version
raw_dat[[1]][[1]] # you can also use the names of the list elements eg [[`data_version`]]
The main data is in the inningselement.
str(raw_dat$innings, 3)
Look at the names in the list element
lapply(raw_dat$innings, names)
lapply(raw_dat$innings[[1]], names)
There are two list elements, each with sub-elements. You can access these as
raw_dat$innings[[1]][[1]][["team"]] # raw_dat$innings[[1]][["1st innings"]][["team"]]
raw_dat$innings[[2]][[1]][["team"]] # raw_dat$innings[[2]][["2nd innings"]][["team"]]
The above function parsed the deliveries data in raw_dat$innings. To see what it does, work through it from the inside.
Use one record to see how it works
(note the lapply, with p_fun, looped over raw_dat$innings[[1]] and raw_dat$innings[[2]] ; so this is the outer loop, and the lapply, with fn, loops through the deliveries, within an innings ; the inner loop)
X <- raw_dat$innings[[1]]
tmp <- X[[1]][["deliveries"]][[1]]
tmp
#create a named vector
tmp <- unlist(tmp)
tmp
# 0.1.batsman 0.1.bowler 0.1.non_striker 0.1.runs.batsman 0.1.runs.extras 0.1.runs.total
# "IR Bell" "DW Steyn" "MJ Prior" "0" "0" "0"
To use rbind.fill, the elements to bind together need to be data.frames. We also want to remove the leading numbers /
deliveries from the names, as otherwise we will have lots of uniquely names columns
# this regex removes all non-numeric characters from the string
# you could then split this number into over and delivery
gsub("[^0-9]", "", names(tmp))
# this regex removes all numeric characters from the string -
# allowing consistent names across all the balls / deliveries
# (if i was better at regex I would have also removed the leading dots)
gsub("[0-9]", "", names(tmp))
So for the first delivery in the first innings we have
tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 01 IR Bell DW Steyn MJ Prior 0 0 0
To see how the lapply works, use the first three deliveries (you will need to run the function fn in your workspace)
lst = lapply(X[[1]][["deliveries"]][1:3], fn )
lst
# [[1]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 01 IR Bell DW Steyn MJ Prior 0 0 0
#
# [[2]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 02 IR Bell DW Steyn MJ Prior 0 0 0
#
# [[3]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 03 IR Bell DW Steyn MJ Prior 3 0 3
So we end up with a list element for every delivery within an innings. We then use rbind.fill to create one data.frame.
If I was going to try and parse every yaml file I would use a loop.
Use the first three records as an example, and also add the match date.
tmp <- unzip(temp)[2:4]
all_raw_dat <- vector("list", length=length(tmp))
for(i in seq_along(tmp)) {
d = yaml.load_file(tmp[i])
all_raw_dat[[i]] <- cbind(date=d$info$date, plyr::rbind.fill(lapply(d$innings, p_fun)))
}
Then use rbind.fill.
Q1. from comments
A small example with rbind.fill
a <- data.frame(x=1, y=2)
b <- data.frame(x=2, z=1)
rbind(a,b) # error as names dont match
plyr::rbind.fill(a, b)
rbind.fill doesnt go back and add/update rows with the extra columns, where needed (a still doesnt have column z), Think of it as creating an empty dataframe with the number of columns equal to the number of unique columns found in the list of dataframes - unique(c(names(a), names(b))). The values are then filled in each row where possible, and left missing (NA) otherwise..
We all know that appending a vector to a vector within a for loop in R is a bad thing because it costs time. A solution would be to do it in a vectorized style. Here is a nice example by Joshua Ulrich. It is important to first create a vector with known length and then fill it up, instead of appending each new piece to an existing piece within the loop.
Still, in his example he demonstrates 'only' how to append one data piece at a time. I am now fighting with the idea to fill a vector with vectors - not scalars.
Imagine I have a vector with a length of 100
vector <- numeric(length=100)
and a smaller vector that would fit 10 times into the first vector
vec <- seq(1,10,1)
How would I have to construct a loop that adds the smaller vector to the large vector without using c() or append ?
EDIT: This example is simplified - vec does not always consist of the same sequence but is generated within a for loop and should be added to vector.
You could just use normal vector indexing within the loop to accomplish this:
vector <- numeric(length=100)
for (i in 1:10) {
vector[(10*i-9):(10*i)] <- 1:10
}
all.equal(vector, rep(1:10, 10))
# [1] TRUE
Of course if you were just trying to repeat a vector a certain number of times rep(vec, 10) would be the preferred solution.
A similar approach, perhaps a little more clear if your new vectors are of variable length:
# Let's over-allocate so that we now the big vector is big enough
big_vec = numeric(1e4)
this.index = 1
for (i in 1:10) {
# Generate a new vector of random length
new_vec = runif(sample(1:20, size = 1))
# Stick in in big_vec by index
big_vec[this.index:(this.index + length(new_vec) - 1)] = new_vec
# update the starting index
this.index = this.index + length(new_vec)
}
# truncate to only include the added values
big_vec = big_vec[1:(this.index - 1)]
As #josilber suggested in comments, lists would be more R-ish. This is a much cleaner approach, unless the new vector generation depends on the previous vectors, in which case the for loop might be necessary.
vec_list = list()
for (i in 1:10) {
# Generate a new vector of random length
vec_list[[i]] = runif(sample(1:20, size = 1))
}
# Or, use lapply
vec_list = lapply(1:10, FUN = function(x) {runif(sample(1:20, size = 1))})
# Then combine with do.call
do.call(c, vec_list)
# or more simply, just unlist
unlist(vec_list)