Suppose I have the following data frame
set.seed(36)
n <- 300
dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500)))
d <- dat[order(dat$y),]
For each value of d$y<=300, I have to create a variable res in which the numerator is the sum of the indicator (d$x <= d$y[i]) and the denominator is the sum of the indicator (d$y >= d$y[i]). I have written the codes in for loop:
res <- NULL
for( i in seq_len(sum(d$y<=300)) ){
numerator <- sum(d$x <= d$y[i])
denominator <- sum(d$y >= d$y[i])
res[i] <- numerator / denominator
}
But my concern is when the number of observations of x and y is large, that is, the number of rows of the data frame increases, the for loop will work slowly. Additionally, if I simulate data 1000 times and each time run the for loop, the program will be inefficient.
What can be the more efficient solution of the code?
This depends on d already being sorted as it is:
# example data
set.seed(36)
n <- 1e5
dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500)))
d <- dat[order(dat$y),]
My suggestion (thanks to #alexis_laz for the denominator):
system.time(res3 <- {
xs <- sort(d$x) # sorted x
yt <- d$y[d$y <= 300] # truncated y
num = findInterval(yt, xs)
den = length(d$y) - match(yt, d$y) + 1L
num/den
})
# user system elapsed
# 0 0 0
OP's approach:
system.time(res <- {
res <- NULL
for( i in seq_len(sum(d$y<=300)) ){
numerator <- sum(d$x <= d$y[i])
denominator <- sum(d$y >= d$y[i])
res[i] <- numerator / denominator
}
res
})
# user system elapsed
# 50.77 1.13 52.10
# verify it matched
all.equal(res,res3) # TRUE
#d.b's approach:
system.time(res2 <- {
numerator = rowSums(outer(d$y, d$x, ">="))
denominator = rowSums(outer(d$y, d$y, "<="))
res2 = numerator/denominator
res2 = res2[d$y <= 300]
res2
})
# Error: cannot allocate vector of size 74.5 Gb
# ^ This error is common when using outer() on large-ish problems
Vectorization. Generally, tasks are faster in R if they can be vectorized. The key functions related to ordered vectors have confusing names (findInterval, sort, order and cut), but fortunately they all work on vectors.
Continuous vs discrete. The match above should be a fast way to compute the denominator whether the data is continuous or has mass points / repeating values. If the data is continuous (and so has no repeats), the denominator can just be seq(length(xs), length = length(yt), by=-1). If it is fully discrete and has a lot of repetition (like the example here), there might be some way to make that faster as well, maybe like one of these:
den2 <- inverse.rle(with(rle(yt), list(
values = length(xs) - length(yt) + rev(cumsum(rev(lengths))),
lengths = lengths)))
tab <- unname(table(yt))
den3 <- rep(rev(cumsum(rev(tab))) + length(xs) - length(yt), tab)
# verify
all.equal(den,den2) # TRUE
all.equal(den,den3) # TRUE
findInterval will still work for the numerator for continuous data. It's not ideal for the repeated-values case considered here I guess (since we're redundantly finding the interval for many repeated yt values). Similar ideas for speeding that up likely apply.
Other options. As #chinsoon suggested, the data.table package might be a good fit if findInterval is too slow, since it has a lot of features focused on sorted data, but it's not obvious to me how to apply it here.
Instead of running loop, generate all the numerator and denominator at once. This also allows you to keep track of which res is associated with which x and y. Later, you can keep only the ones you want.
You can use outer for element wise comparison between vectors.
numerator = rowSums(outer(d$y, d$x, ">=")) #Compare all y against all x
denominator = rowSums(outer(d$y, d$y, "<=")) #Compare all y against itself
res2 = numerator/denominator #Obtain 'res' for all rows
#I would first 'cbind' res2 to d and only then remove the ones for 'y <=300'
res2 = res2[d$y <= 300] #Keep only those 'res' that you want
Since this is using rowSums, this should be faster.
Related
Hellou
I've had problems with the following while loop in R. I try to know with what number of samples (n), I can achieve a variance less than 0.01 (dtest) and that let me to know the values of n, m, s and d:
n <- 100
x <- rnorm(n,0,1)
sd(x)
d <- sd(x)/sqrt(n)
dtest <- 0.01
while(dtest <=0.01) {
x <- rnorm(n,0,1)
n <- n+1
m <- mean(x)
s <- sd(x)
d <- s/sqrt(n)
return(output <- data.frame(n,m,s,d))
}
The first time I did the cycle without problems and it marked a n of approx 27K. Now only every time I execute the loop it accumulates
There are a number of issues:
Your condition should compare d to dtest. Currently, it’s comparing two values that aren’t changed within the loop, so will run forever.
Increment n at the start of the loop. Otherwise you’re using a different n to compute x and d.
Just create your results dataframe once, after the loop, rather than creating and discarding with each loop. And don’t use return(), which is meant for use inside functions.
Note that sd(x)/sqrt(n) is standard error, not variance. Variance would be sd(x)^2.
set.seed(13)
n <- 99
x <- rnorm(n,0,1)
d <- sd(x)/sqrt(n)
dtest <- 0.01
while(dtest <= d) {
n <- n+1
x <- rnorm(n,0,1)
s <- sd(x)
d <- s/sqrt(n)
}
output <- data.frame(n,m = mean(x),s,d)
output
n m s d
1 9700 0.01906923 0.9848469 0.009999605
I would like to apply the Rejection sampling method to simulate a random vector Y=(Y_1, Y_2) of a uniform distribution from a unit disc D = { (X_1 , X_2) \in R^2: \sqrt{x^2_1 + x^2_2} ≤ 1} such that X = (X_1 , X_ 2) is random vector of a uniform distribution in the square S = [−1, 1]^2 and the joint density f(y_1,y_2) = \frac{1}{\pi} 1_{D(y_1,y_2)}.
In the rejection method, we accept a sample generally if f(x) \leq C * g(x). I am using the following code to :
x=runif(100,-1,1)
y=runif(100,-1,1)
d=data.frame(x=x,y=y)
disc_sample=d[(d$x^2+d$y^2)<1,]
plot(disc_sample)
I have two questions:
{Using the above code, logically, the size of d should be greater than the size of disc_sample but when I call both of them I see there are 100 elements in each one of them. How could this be possible. Why the sizes are the same.} THIS PART IS SOLVED, thanks to the comment below.
The question now
Also, how could I reformulate my code to give me the total number of samples needed to get 100 samples follow the condition. i.e to give me the number of samples rejected until I got the 100 needed sample?
Thanks to the answer of r2evans but I am looking to write something simpler, a while loop to store all possible samples inside a matrix or a data frame instead of a list then to call from that data frame just the samples follow the condition. I modified the code from the answer without the use of the lists and without sapply function but it is not giving the needed result, it yields only one row.
i=0
samps <- data.frame()
goods <- data.frame()
nr <- 0L
sampsize <- 100L
needs <- 100L
while (i < needs) {
samps <- data.frame(x = runif(1, -1, 1), y = runif(1, -1, 1))
goods <- samps[(samps$x^2+samps$y^2)<1, ]
i = i+1
}
and I also thought about this:
i=0
j=0
samps <- matrix()
goods <- matrix()
needs <- 100
while (j < needs) {
samps[i,1] <- runif(1, -1, 1)
samps[i,2] <- runif(1, -1, 1)
if (( (samps[i,1])**2+(samps[i,2])**2)<1){
goods[j,1] <- samps[i,1]
goods[j,2] <- samps[i,2]
}
else{
i = i+1
}
}
but it is not working.
I would be very grateful for any help to modify the code.
As to your second question ... you cannot reformulate your code to know precisely how many it will take to get (at least) 100 resulting combinations. You can use a while loop and concatenate results until you have at least 100 such rows, and then truncate those over 100. Because using entropy piecewise (at scale) is "expensive", you might prefer to always over-estimate the rows you need and grab all at once.
(Edited to reduce "complexity" based on homework constraints.)
set.seed(42)
samps <- vector(mode = "list")
goods <- vector(mode = "list")
nr <- 0L
iter <- 0L
sampsize <- 100L
needs <- 100L
while (nr < needs && iter < 50) {
iter <- iter + 1L
samps[[iter]] <- data.frame(x = runif(sampsize, -1, 1), y = runif(sampsize, -1, 1))
rows <- (samps[[iter]]$x^2 + samps[[iter]]$y^2) < 1
goods[[iter]] <- samps[[iter]][rows, ]
nr <- nr + sum(rows)
}
iter # number of times we looped
# [1] 2
out <- head(do.call(rbind, goods), n = 100)
NROW(out)
# [1] 100
head(out) ; tail(out)
# x y
# 1 0.8296121 0.2524907
# 3 -0.4277209 -0.5668654
# 4 0.6608953 -0.2221099
# 5 0.2834910 0.8849114
# 6 0.0381919 0.9252160
# 7 0.4731766 0.4797106
# x y
# 221 -0.65673577 -0.2124462
# 231 0.08606199 -0.7161822
# 251 -0.37263236 0.1296444
# 271 -0.38589120 -0.2831997
# 28 -0.62909284 0.6840144
# 301 -0.50865171 0.5014720
Title's a little rough, open to suggestions to improve.
I'm trying to calculate time-average covariances for a 500 length vector.
This is the equation we're using
The result I'm hoping for is a vector with an entry for k from 0 to 500 (0 would just be the variance of the whole set).
I've started with something like this, but I know I'll need to reference the gap (i) in the first mean comparison as well:
x <- rnorm(500)
xMean <-mean(x)
i <- seq(1, 500)
dfGam <- data.frame(i)
dfGam$gamma <- (1/(500-dfGam$i))*(sum((x-xMean)*(x[-dfGam$i]-xMean)))
Is it possible to do this using vector math or will I need to use some sort of for loop?
Here's the for loop that I've come up with for the solution:
gamma_func <- function(input_vec) {
output_vec <- c()
input_mean <- mean(input_vec)
iter <- seq(1, length(input_vec)-1)
for(val in iter){
iter2 <- seq((val+1), length(input_vec))
gamma_sum <- 0
for(val2 in iter2){
gamma_sum <- gamma_sum + (input_vec[val2]-input_mean)*(input_vec[val2-val]-input_mean)
}
output_vec[val] <- (1/length(iter2))*gamma_sum
}
return(output_vec)
}
Thanks
Using data.table, mostly for the shift function to make x_{t - k}, you can do this:
library(data.table)
gammabar <- function(k, x){
xbar <- mean(x)
n <- length(x)
df <- data.table(xt = x, xtk = shift(x, k))[!is.na(xtk)]
df[, sum((xt - xbar)*(xtk - xbar))/n]
}
gammabar(k = 10, x)
# [1] -0.1553118
The filter [!is.na(xtk)] starts the sum at t = k + 1, because xtk will be NA for the first k indices due to being shifted by k.
Reproducible x
x <- c(0.376972124936433, 0.301548373935665, -1.0980231706536, -1.13040590360378,
-2.79653431987176, 0.720573498411587, 0.93912102300901, -0.229377746707471,
1.75913134696347, 0.117366786802848, -0.853122822287008, 0.909259181618213,
1.19637295955276, -0.371583903741348, -0.123260233287436, 1.80004311672545,
1.70399587729432, -3.03876460529759, -2.28897494991878, 0.0583034949929225,
2.17436525195634, 1.09818265352131, 0.318220322390854, -0.0731475581637693,
0.834268741278827, 0.198750636733429, 1.29784138432631, 0.936718306241348,
-0.147433193833294, 0.110431994640128, -0.812504663900505, -0.743702167768748,
1.09534507180741, 2.43537370755095, 0.38811846676708, 0.290627670295127,
-0.285598287083935, 0.0760147178373681, -0.560298603759627, 0.447188372143361,
0.908501134499943, -0.505059597708343, -0.301004012157305, -0.726035976548133,
-1.18007702699501, 0.253074712637114, -0.370711296884049, 0.0221795637601637,
0.660044122429767, 0.48879363533552)
I am trying to create a function that computes the sum of digits of large numbers, of the order of 100^100. The approach described in this question does not work, as shown below. I tried to come up with a function that does the job, but have not been able to get very far.
The inputs would be of the form a^b, where 1 < a, b < 100 and a and b are integers. So, in that sense, I am open to making digitSumLarge a function that accepts two arguments.
digitSumLarge <- function(x) {
pow <- floor(log10(x)) + 1L
rem <- x
i <- 1L
num <- integer(length = pow)
# Individually isolate each digit starting from the largest and store it in num
while(rem > 0) {
num[i] <- rem%/%(10^(pow - i))
rem <- rem%%(10^(pow - i))
i <- i + 1L
}
return(num)
}
# Function in the highest voted answer of the linked question.
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(x) - 1))) %% 10)
Consider the following tests:
x <- c(1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9)
as.numeric(paste(x, collapse = ''))
# [1] 1.234568e+17
sum(x)
# 90
digitSumLarge(as.numeric(paste(x, collapse = '')))
# 85
digitsum(as.numeric(paste(x, collapse = '')))
# 81, with warning message about loss of accuracy
Is there any way I can write such a function in R?
You need arbitrary precision numbers. a^b with R's numerics (double precision floats) can be only represented with limited precision and not exactly for sufficiently large input.
library(gmp)
a <- as.bigz(13)
b <- as.bigz(67)
sum(as.numeric(strsplit(as.character(a^b), split = "")[[1]]))
#[1] 328
Does anyone have an idea or suggestion on how to increase the efficiency of the following example of code eating up all my ram using a "kind-of" double rolling window?
First, I go through a simple example defining the problem, with a full MWE (implementation) at the bottom of this post.
First, consider the following "random" test vector (usually of length >25000):
A <- c(1.23,5.44,6.3,8.45,NaN,3.663,2.63,1.32,6.623,234.6,252.36)
A is sectioned into a "kind-of" train and test set, both with rolling windows. In this MWE, a train-set start of length 4 and a test set length of 2 are considered (usually of length >200). So initially, the following values are part of the train and test set:
train_1 <- A[1:4]
test_1 <- A[5:6]
Next, I want to subtract test_1 from train_1 at each possible consecutive location of train_1 (hence the first rolling window), generating the run_1_sub matrix.
run_1_sub <- matrix(NaN,3,2)
run_1_sub[1,] <- train_1[1:2] - test_1
run_1_sub[2,] <- train_1[2:3] - test_1
run_1_sub[3,] <- train_1[3:4] - test_1
Afterwards, I want to find on each row in run_1_sub the sum of each row divided by the number of entries in each row not being NaN.
run_1_sum <-
sapply(1:3, function(x) {
sum(run_1_sub[x,], na.rm = T) / sum(!is.na(run_1_sub[x,]))
})
In the next step, the "kind-of" train and test sets are updated by increasing their order from A by one (hence the second rolling window):
train_2 <- A[2:5]
test_2 <- A[6:7]
As previously, test_2 is subtracted at each possible location in train_2 and run_2_sub and run_2_sum are computed. This procedure is continued until the test set represents the last two values of A and finally I end (in this MWE) up with 6 run_sum matrices. My implementation, however, is very slow, and I was wondering whether anyone could help me to increase it's efficiency?
Here's my implementation:
# Initialization
library(zoo)
#rm(list = ls())
A <- c(1.23, 5.44, 6.3, 8.45, NaN, 3.663, 2.63, 1.32, 6.623, 234.6, 252.36) # test vector
train.length <- 4
test.length <- 2
run.length <- length(A) - train.length - test.length + 1
# Form test sets
test.sets <- sapply(1:run.length, function(x) {
A[(train.length + x):(train.length + test.length + x - 1)]
})
# Generate run_sub_matrices
run_matrix <- lapply(1:run.length, function(x) {
rollapply(A[x:(train.length + x - 1)], width = test.length, by = 1,
function(y) {
y - test.sets[, x]
})
})
# Genereate run_sum_matrices
run_sum <- sapply(1:length(run_matrix), function(x) {
rowSums(run_matrix[[x]], na.rm = T) / apply(run_matrix[[x]], 1, function(y) {
sum(!is.na(y))})
})
Naturally, the following initialization set-up slows the generation of run_sum and run_sub significantly down:
A <- runif(25000)*400
train.length <- 400
test.length <- 200
Here, the elapsed time for generating run_sub is 120.04s and for run_sum 28.69s respectively.
Any suggestions on how to increase and improved the speed and code?
Usually the first two steps of code optimization in R are:
Do less;
Use vectorization.
We will come through both of these steps. Let's agree to note x as input vector (A in your example).
The key functional unit in your problem can be formulated as follows: given train_start (start index of subset of train. We will use word 'train' for this subset), test_start (start index of test) and test_length (length of test) compute:
train_inds <- train_start + 0:(test_length-1)
test_inds <- test_start + 0:(test_length-1)
run_diff <- x[train_inds] - x[test_inds]
sum(run_diff, na.rm = TRUE) / sum(!is.na(run_diff))
This unit is invoked many times and so is computation of sums and !is.na. We will do less: instead of computing many times differences with their sums we precompute cumulative sums ones and use this data. See 'Preparatory computations' in run_mean_diff.
res now contains needed sum of differences of x_mod (which is a copy of x but with 0 instead of NAs and NaNs). We should now subtract all overused elements, i.e. those which we shouldn't use in sums because the respective element in other set is NA or NaN. While computing this information we will also compute the denominator. See 'Info about extra elements' in run_mean_diff.
The beauty of this code is that train_start, test_start and test_length can now be vectors: ith element of each vector is treated as single element for our task. This is vectorization. Our job is now to construct these vectors suited for our task. See function generate_run_data.
Presented code is using much less RAM, doesn't need extra zoo dependency and is considerably faster original on small train_length and test_length. On big *_lengths also faster but not very much.
One of the next steps might be writing this code using Rcpp.
The code:
run_mean_diff <- function(x, train_start, test_start, test_length) {
# Preparatory computations
x_isna <- is.na(x)
x_mod <- ifelse(x_isna, 0, x)
x_cumsum <- c(0, cumsum(x_mod))
res <- x_cumsum[train_start + test_length] - x_cumsum[train_start] -
(x_cumsum[test_start + test_length] - x_cumsum[test_start])
# Info about extra elements
extra <- mapply(
function(cur_train_start, cur_test_start, cur_test_length) {
train_inds <- cur_train_start + 0:(cur_test_length-1)
test_inds <- cur_test_start + 0:(cur_test_length-1)
train_isna <- x_isna[train_inds]
test_isna <- x_isna[test_inds]
c(
# Correction for extra elements
sum(x_mod[train_inds][test_isna]) -
sum(x_mod[test_inds][train_isna]),
# Number of extra elements
sum(train_isna | test_isna)
)
},
train_start, test_start, test_length, SIMPLIFY = TRUE
)
(res - extra[1, ]) / (test_length - extra[2, ])
}
generate_run_data <- function(n, train_length, test_length) {
run_length <- n - train_length - test_length + 1
num_per_run <- train_length - test_length + 1
train_start <- rep(1:num_per_run, run_length) +
rep(0:(run_length - 1), each = num_per_run)
test_start <- rep((train_length + 1):(n - test_length + 1),
each = num_per_run)
data.frame(train_start = train_start,
test_start = test_start,
test_length = rep(test_length, length(train_start)))
}
A <- c(1.23, 5.44, 6.3, 8.45, NaN, 3.663,
2.63, 1.32, 6.623, 234.6, 252.36)
train_length <- 4
test_length <- 2
run_data <- generate_run_data(length(A), train_length, test_length)
run_sum_new <- matrix(
run_mean_diff(A, run_data$train_start, run_data$test_start,
run_data$test_length),
nrow = train_length - test_length + 1
)
The reason your code uses so much RAM is because you keep a lot of intermediate objects, mainly all the elements in run_matrix. And profiling via Rprof shows that most of the time is spent in rollapply.
The easiest and simplest way to avoid all the intermediate objects is to use a for loop. It also makes the code clear. Then you just need to replace the call to rollapply with something faster.
The function you want to apply to each rolling subset is simple: subtract the test set. You can use the stats::embed function to create the matrix of lags, and then take advantage of R's recycling rules to subtract the test vector from each column. The function I created is:
calc_run_sum <- function(A, train_length, test_length) {
run_length <- length(A) - train_length - test_length + 1L
window_size <- train_length - test_length + 1L
# Essentially what embed() does, but with column order reversed
# (part of my adaptation of echasnovski's correction)
train_lags <- 1L:test_length +
rep.int(1L:window_size, rep.int(test_length, window_size)) - 1L
dims <- c(test_length, window_size) # lag matrix dims are always the same
# pre-allocate result matrix
run_sum <- matrix(NA, window_size, run_length)
# loop over each run length
for (i in seq_len(run_length)) {
# test set indices and vector
test_beg <- (train_length + i)
test_end <- (train_length + test_length + i - 1)
# echasnovski's correction
#test_set <- rep(test_set, each = train_length - test_length + 1)
#lag_matrix <- embed(A[i:(test_beg - 1)], test_length)
#run_sum[,i] <- rowMeans(lag_matrix - test_set, na.rm = TRUE)
# My adaptation of echasnovski's correction
# (requires train_lags object created outside the loop)
test_set <- A[test_beg:test_end]
train_set <- A[i:(test_beg - 1L)]
lag_matrix <- train_set[train_lags]
dim(lag_matrix) <- dims
run_sum[,i] <- colMeans(lag_matrix - test_set, na.rm = TRUE)
}
run_sum
}
Now, for some benchmarks. I used the following input data:
library(zoo)
set.seed(21)
A <- runif(10000)*200
train.length <- 200
test.length <- 100
Here are the timings for your original approach:
system.time({
run.length <- length(A) - train.length - test.length + 1
# Form test sets
test.sets <- sapply(1:run.length, function(x) {
A[(train.length + x):(train.length + test.length + x - 1)]
})
# Generate run_sub_matrices
run_matrix <- lapply(1:run.length, function(x) {
rm <- rollapply(A[x:(train.length + x - 1)], width = test.length, by = 1,
FUN = function(y) { y - test.sets[, x] })
})
# Genereate run_sum_matrices
run_sum <- sapply(run_matrix, function(x) {
rowSums(x, na.rm = T) / apply(x, 1, function(y) {
sum(!is.na(y))})
})
})
# user system elapsed
# 19.868 0.104 19.974
And here are the timings for echasnovski's approach:
system.time({
run_data <- generate_run_data(length(A), train.length, test.length)
run_sum_new <- matrix(
run_mean_diff(A, run_data$train_start, run_data$test_start,
run_data$test_length),
nrow = train.length - test.length + 1
)
})
# user system elapsed
# 10.552 0.048 10.602
And the timings from my approach:
system.time(run_sum_jmu <- calc_run_sum(A, train.length, test.length))
# user system elapsed
# 1.544 0.000 1.548
The output from all 3 approaches are identical.
identical(run_sum, run_sum_new)
# [1] TRUE
identical(run_sum, run_sum_jmu)
# [1] TRUE