I currently have this for-loop which I want to vectorize. It calculates the percentage amount of 6's in a for different subvectors. Starting with a[1:100], a[1:200], ... always in 100's steps.
rolls.max <- 100000
a <- sample(1:6, size=rolls.max, replace=TRUE)
sixes.ratio <- c()
for(i in 1:(rolls.max/100)) {
sixes.count <- table(a[1:(i*100)])[6]
ratio <- sixes.count/(i*100)
sixes.ratio <- c(sixes.ratio, ratio)
}
I think the most difficult part is to get the count of 6's from a for each subvector. I tried this:
rolls.max <- 100000
a <- matrix(sample(1:6, size=rolls.max, replace=TRUE))
subset.creator <- function(x, c) if (c!=0 && c%%100==0) { as.vector(table(x[1:(rolls[c/100])]))[6] }
sixes.count <- mapply(subset.creator, a, col(a))
# Converting the other lines won't be difficult I think
Want I wanted to achieve with this is, to create a subvector of a for every 100th call of the function subset.creator. Then create a table and take the sixth column, to get the count of 6's and then extract only the count by using as.vector()
But this just gives me rubbish instead of a vector with counts of 6's.
If you want to create a "rolling tally" at every hundredth chunk of your simulated rolls, one way to solve the problem is to create a vector of "stops" that represents your cutoff points, then use sapply to perform the calculation (in this case, counting up the 6s) at each stop:
rolls.max <- 100000
a <- sample(1:6, size=rolls.max, replace=TRUE)
# a vector of "stops" at every hundredth entry of 'a'
stops <- seq(0, rolls.max, 100)[-1]
# counts of 6s from the first entry in 'a' to the values in 'stops'
count.6 <- sapply(stops, function(x) sum(a[1:x] == 6))
# ...or just as easily, the rolling proportion of 6s
prop.6 <- sapply(stops, function(x) mean(a[1:x] == 6))
Related
Trying to write a function that transforms a dataframe by high pass filtering each row entry by some percentile of the column values. The function is written for single cell RNA-sequencing data but in principle anything works. Transposes it at the end because it makes some downstream code cleaner.
topquantile.binarize <- function(scRNAseq_data, percentile){
# takes in data that is gene by cell
# returns dataframe of cell by gene
# calculates quantile for each gene
# if a gene in a cell is in the top 90th quantile
# that gene is accepted
for (i in c(1:dim(scRNAseq_data)[1])){
filter_value <- quantile(scRNAseq_data[i,], percentile)
filter_value <- as.numeric(filter_value)
high_pass <- function(x) {
if (x > filter_value) {
x <- 1
} else {
x <- 0
}
return(x)
}
scRNAseq_data[i, ] <- apply(scRNAseq_data[i, ], 2, high_pass)
}
return(t(scRNAseq_data))
}
EXAMPLE DATA
library(tictoc)
tic()
set.seed(42)
scRNAseq_data <- data.frame(matrix(rnorm(1000*100, mean=0, sd=1), 1000, 100))
res <- topquantile.binarize(scRNAseq_data, 0.9)
toc()
You will notice that even at 100 columns each with 1000 rows its running pretty slow, using tictoc you'll see it takes around 4 seconds (possibly a little more to do that.
I realize that technically the function does more than just look for values in the top quantile but whatever.
Use matrixStats::rowQuantiles and exploit the vectorization of the R language. Runs in the blink of an eye.
res1 <- t(+(scRNAseq_data > matrixStats::rowQuantiles(as.matrix(scRNAseq_data), probs=.9)))
stopifnot(all.equal(res, res1))
MatrixGenerics::rowQuantiles from bioconductor might also work.
dataset <- matrix(rnorm(100), 20, 5)
My dataset is a matrix of 100 returns, of 5 assets over 20 days.
I want to caluclate the average return for each asset, between 1:10 rows and 11:20 rows.
Then, I want to include the returns so computed in two vectors, in turn, included in a list.
The following list should include the two vectors of returns computed between rows 1:10 and 11:20.
returns <- vector(mode="list", 2)
I have implemented a for-loop, as reported below, to calculate the mean of returns only between 1:10.
assets <- 5
r <- rep(0, assets) # this vector should include the returns over 1:10
for(i in 1:assets){
r[i] <- mean(data[1:10,i])
}
returns[[1]] <- r
How could I manage this for-loop in order to calculate also the mean of returns between 11:20 rows?
I have tried to "index" the rows of the dataset, in the following way.
time <- c(1, 10, 11, 20)
and then implement a double for-loop, but the length are different. Moreover, in this case, I meet difficulties in managing the vector "r". Because, in this case, I should have two vectors and no longer only one as before.
for(j 1:length(time){
for(i in 1:assets){
r[i] <- mean(data[1:10,i])
}}
returns[[1]] <- r
You don't even need a for loop. You can use colMeans
returns <- vector(mode="list", 2)
returns[[1]] <- colMeans(dataset[1:10,])
returns[[2]] <- colMeans(dataset[11:20,])
Using a for loop, your solution could be something like the following
for(i in 1:assets){
returns[[1]] <- c(returns[[1]], mean(dataset[1:10,i]))
returns[[2]] <- c(returns[[2]], mean(dataset[11:20,i]))
}
I have a vector on which I want to do block resampling to get, say, 1000 samples of the same size of the vector, and then save all this samples in a list.
This is the code that performs normal resampling, i.e. randomly draws one observation per time, and saves the result in a list:
myvector <- c(1:200)
mylist <- list()
for(i in 1:1000){
mylist[[i]] <- sample(myvector, length(myvector), replace=TRUE)
}
I need a code that does exactly the same thing, except that instead of drawing single observations it draws blocks of observations (let's use blocks of dimension equal to 5).
I know there are packages that perform bootstrap operations, but I don't need statistics or confidence intervals or anything, just all the samples in a list. Both overlapping and non-overlapping blocks are ok, so the code for just one of the two procedures is enough. Of course, if you are so kind to give me the code for both it's appreciated. Thanks to anybody who can help me with this.
Not sure how you're wanting to store the final structure.
The following takes a block dimension, samples your vector by that block size (e.g. 200 element vector with block size 5 gives 40 observations of randomly sampled elements) and adds those blocks to an index of the final list. Using your example, the final result is a list with 1000 entries; each entry containing 40 randomly sampled observations.
myvector <- c(1:200)
rm(.Random.seed, envir=globalenv())
block_dimension <- 5
res = list()
for(i in 1:1000) {
name <- paste('sample_', i, sep='')
rep_num <- length(myvector) / block_dimension
all_blocks <- replicate(rep_num, sample(myvector, block_dimension))
tmp <- split(all_blocks, ceiling(seq_along(all_blocks)/block_dimension))
res[[name]] <- tmp
}
Here are the first 6 sampled observations for the first entry:
How about the following? Note that you can use lapply, which should be slightly faster than filling the list in a for loop in this case.
As reference, here is the case where you sample individual observations.
# Sample individual observations
set.seed(2017);
mylist <- lapply(1:1000, function(x) sample(myvector, length(myvector), replace = TRUE));
Next we sample blocks of 5 observations.
# Sample blocks of n observations
n <- 5;
set.seed(2017);
mylist <- lapply(1:1000, function(x) {
idx <- sample(1:(length(myvector) - n), length(myvector) / n, replace = TRUE);
idx <- c(t(sapply(0:(n - 1), function(i) idx + i)));
myvector[idx];
})
One solution, assuming blocks consist of contiguous elements of myvector, is to pre-define the blocks in rows of a data frame with start/end columns (e.g. blocks <- data.frame(start=seq(1,96,5),end=seq(5,100,5))). Create a set of sample indexes (with replacement) from [1:number of blocks] and concatenate values indexing from myvector using the start/end values from the defined blocks. You can add randomization within blocks as well, if you need to. This gives you control over the block contents, overlap, size, etc.
I found a way to perform the task with non-overlapping blocks:
myvector <- c(1:200)
n <- 5
mymatrix <- matrix(myvector, nrow = length(myvector)/n, byrow = TRUE)
mylist <- list()
for(i in 1:1000){
mylist[[i]] <- as.vector(t(mymatrix[sample(nrow(mymatrix), size = length(myvector)/n, replace = TRUE),]))
}
I've looked through previous help threads and haven't found something that has helped me with this specific problem. I know that a for loop would be a better way to generate the same data, but I'm interested in making this work with a repeat loop (mostly just as an exercise) and am struggling with the solution.
So I'm looping to create 3 iterations of 100 rnorm observations, changing the means each time from 5, to 25, to 45.
i <- 1
repeat{
x <- rnorm(100, mean = j, sd = 3)
j <- 5*i
i <- i + 4
if (j > 45) break
cat(x, "\n",j, "\n")
}
All of my tinkering to get a combined saved output for each iteration (for a total of 300 values) has failed. Help!
You can use lapply to get this:
lapply(c(5,25,45), function(x){
rnorm(100, mean = x, sd = 3)
})
This will give you a list with 3 elements:
Each containing 100 observations drawn from the respective normal-distribution.
Depends on what structure of data do you want.
For lists it would be:
r = list()
repeat{
r[[length(r)+1]] = list(x,j)
}
Then: r[[1]][[1]] will be x for 1 loop and r[[1]][[2]] would be j.
Since you know how many observations you want to store, you can pre-allocate a matrix of that size, and store the data in it as it's generated.
# preallocate the space for the values you want to store
x <- matrix(nrow=100, ncol=3)
# save the three means in a vector
j_vals <- c(5,25,45)
# if you really need a repeat loop you can do it like so:
i <- 1
repeat {
# save the random sample in a column of the matrix x
x[,i] <- rnorm(100, mean = j_vals[i], sd = 3)
# print the random sample to the console (you can omit this)
cat(x[,i], "\n",j_vals[i], "\n")
i <- i+1
if (i > 3) break
}
You should get out a matrix x with the random samples stored in the columns. You can access each column like x[,1], x[,2] etc.
I am generating a data vector to sample from with sample without replacement.
If the dataset I am generating from is large enough, the vector exceeds the limits of R.
How can I represent these data in such a way that I can sample without replacement but can still handle huge datasets?
Generating the vector of counts:
counts <- vector()
for (i in 1:1024) {
counts <- c(counts, rep(i, times=data[i,]$readCount))
}
Sampling:
trial_fn <- function(counts) {
replicate(num_trials, sample(counts, size=trial_size, replace=F), simplify=F)
}
trials <- trial_fn(counts)
Error: cannot allocate vector of size 32.0 Mb
Is there a more sparse or compressed way I can represent this and still be able to sample without replacement?
If I understand correctly, your data has 1024 rows with different readCount.
The vector you build has the first readCount value repeated once, the second readCount repeated twice and so on.
Then you want to sample from this vector without replacement. So basically, you're sampling the first readCount with a probability of 1 / sum(1:1024), the second readCount with a probability of 2 / sum(1:1024) and so on, and each time you extract one value, it is removed from the set.
Of course the fastest and easier approach is yours, but you can also do it with much less memory but losing speed (significantly). This can be done by giving probabilities of extraction to sample function, extracting one value at a time and manually "removing" the extracted value.
Here's an example :
# an example of your data
data <- data.frame(readCount=1:1024)
# custom function to sample
mySample <- function(values, size, nElementsPerValue){
nElementsPerValue <- as.integer(nElementsPerValue)
if(sum(nElementsPerValue) < size)
stop("Total number of elements per value is lower than the sample size")
if(length(values) != length(nElementsPerValue))
stop("nElementsPerValue must have the same length of values")
if(any(nElementsPerValue < 0))
stop("nElementsPerValue cannot contain a negative numbers")
# remove values having zero elements inside
nElementsPerValue <- nElementsPerValue[which(nElementsPerValue > 0)]
values <- values[which(nElementsPerValue > 0)]
# pre-allocate the result vector
res <- rep.int(0.0,size)
for(i in 1:size){
idx <- sample(1:length(values),size=1,replace=F,prob=nElementsPerValue)
res[i] <- values[idx]
# remove sampled value from nElementsPerValue
nElementsPerValue[idx] <- nElementsPerValue[idx] - 1
# if zero elements remove also from values
if(nElementsPerValue[idx] == 0){
values <- values[-idx]
nElementsPerValue <- nElementsPerValue[-idx]
}
}
return(res)
}
# just for reproducibility
set.seed(123)
# sample 100k values from readCount
system.time(
a <- mySample(data$readCount, 100000, 1:1024),
gcFirst=T)
# on my machine it gives :
# user system elapsed
# 10.63 0.00 10.67