I just saw a YouTube video from Numberphile on the Yellowstone sequence (A098550). It's base on a sequence starting with 1 and 2, with subsequent terms generated by the rules:
no repeated terms
always pick the lowest integer
gcd(a_n, a_(n-1)) = 1
gcd(a_n, a_(n-2)) > 1
The first 15 terms would be: 1 2 3 4 9 8 15 14 5 6 25 12 35 16 7
A Q&D approach in R could be something like this, but understandably, this becomes very slow at attempts to make longer sequences. It also make some assumptions about the highest number that is possible within the sequence (as info: the sequence of 10,000 items never goes higher than 5000).
What can we do to make this faster?
library(DescTools)
a <- c(1, 2, 3)
p <- length(a)
# all natural numbers
all_ints <- 1:5000
for (n in p:1000) {
# rule 1 - remove all number that are in sequence already
next_a_set <- all_ints[which(!all_ints %in% a)]
# rule 3 - search the remaining set for numbers that have gcd == 1
next_a_option <- next_a_set[which(
sapply(
next_a_set,
function(x) GCD(a[n], x)
) == 1
)]
# rule 4 - search the remaining number for gcd > 1
next_a <- next_a_option[which(
sapply(
next_a_option,
function(x) GCD(a[n - 1], x)
) > 1
)]
# select the lowest
a <- c(a, min(next_a))
n <- n + 1
}
Here's a version that's about 20 times faster than yours, with comments about the changes:
# Set a to the final length from the start.
a <- c(1, 2, 3, rep(NA, 997))
p <- 3
# Define a vectorized gcd() function. We'll be testing
# lots of gcds at once. This uses the Euclidean algorithm.
gcd <- function(x, y) { # vectorized gcd
while (any(y != 0)) {
x1 <- ifelse(y == 0, x, y)
y <- ifelse(y == 0, 0, x %% y)
x <- x1
}
x
}
# Guess at a reasonably large vector to work from,
# but we'll grow it later if not big enough.
allnum <- 1:1000
# Keep a logical record of what has been used
used <- c(rep(TRUE, 3), rep(FALSE, length(allnum) - 3))
for (n in p:1000) {
# rule 1 - remove all number that are in sequence already
# nothing to do -- used already records that.
repeat {
# rule 3 - search the remaining set for numbers that have gcd == 1
keep <- !used & gcd(a[n], allnum) == 1
# rule 4 - search the remaining number for gcd > 1
keep <- keep & gcd(a[n-1], allnum) > 1
# If we found anything, break out of this loop
if (any(keep))
break
# Otherwise, make the set of possible values twice as big,
# and try again
allnum <- seq_len(2*length(allnum))
used <- c(used, rep(FALSE, length(used)))
}
# select the lowest
newval <- which.max(keep)
# Assign into the appropriate place
a[n+1] <- newval
# Record that it has been used
used[newval] <- TRUE
}
If you profile it, you'll see it spends most of its time in the gcd() function. You could probably make that a lot faster by redoing it in C or C++.
The biggest change here is pre-allocation and restricting the search to numbers that have not yet been used.
library(numbers)
N <- 5e3
a <- integer(N)
a[1:3] <- 1:3
b <- logical(N) # which numbers have been used already?
b[1:3] <- TRUE
NN <- 1:N
system.time({
for (n in 4:N) {
a1 <- a[n - 1L]
a2 <- a[n - 2L]
for (k in NN[!b]) {
if (GCD(k, a1) == 1L & GCD(k, a2) > 1L) {
a[n] <- k
b[k] <- TRUE
break
}
}
if (!a[n]) {
a <- a[1:(n - 1L)]
break
}
}
})
#> user system elapsed
#> 1.28 0.00 1.28
length(a)
#> [1] 1137
For a fast C++ algorithm, see here.
I am working with the R programming language. I am trying to build a loop that performs the following :
Step 1: Keep generating two random numbers "a" and "b" until both "a" and "b" are greater than 12
Step 2: Track how many random numbers had to be generated until it took for Step 1 to be completed
Step 3: Repeat Step 1 and Step 2 100 times
Since I do not know how to keep generating random numbers until a condition is met, I tried to generate a large amount of random numbers hoping that the condition is met (there is probably a better way to write this):
results <- list()
for (i in 1:100){
# do until break
repeat {
# repeat many random numbers
a = rnorm(10000,10,1)
b = rnorm(10000,10,1)
# does any pair meet the requirement
if (any(a > 12 & b > 12)) {
# put it in a data.frame
d_i = data.frame(a,b)
# end repeat
break
}
}
# select all rows until the first time the requirement is met
# it must be met, otherwise the loop would not have ended
d_i <- d_i[1:which(d_i$a > 10 & d_i$b > 10)[1], ]
# prep other variables and only keep last row (i.e. the row where the condition was met)
d_i$index = seq_len(nrow(d_i))
d_i$iteration = as.factor(i)
e_i = d_i[nrow(d_i),]
results[[i]] <- e_i
}
results_df <- do.call(rbind.data.frame, results)
Problem: When I look at the results, I noticed that the loop is incorrectly considering the condition to be met, for example:
head(results_df)
a b index iteration
4 10.29053 10.56263 4 1
5 10.95308 10.32236 5 2
3 10.74808 10.50135 3 3
13 11.87705 10.75067 13 4
1 10.17850 10.58678 1 5
14 10.14741 11.07238 1 6
For instance, in each one of these rows - both "a" and "b" are smaller than 12.
Does anyone know why this is happening and can someone please show me how to fix this problem?
Thanks!
How about this way? As you tag while-loop, I tried using it.
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:100){
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- 1
while(a < 12 | b < 12) {
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- i + 1
}
x <- c(a,b,i)
res <- rbind(res, x)
}
head(res)
[,1] [,2] [,3]
x 12.14232 12.08977 399
x 12.27158 12.01319 1695
x 12.57345 12.42135 302
x 12.07494 12.64841 600
x 12.03210 12.07949 82
x 12.34006 12.00365 782
dim(res)
[1] 100 3
I'm trying to calculate the rolling mean of the previous k non-NA values within the dplyr/tidyverse framework. I've written a function that seems to work but was wondering if there's already a function from some package (which will probably be much more efficient than my attempt) doing exactly this. An example dataset:
tmp.df <- data.frame(
x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)
Let's say I want the rolling mean of the previous 3 non-NA values. Then the output y should be:
x y
1 NA NA
2 1 NA
3 2 NA
4 NA NA
5 3 NA
6 4 2
7 5 3
8 NA 4
9 NA 4
10 NA 4
11 6 4
12 7 5
13 NA 6
The first 5 elements of y are NAs because the first time x has 3 previous non-NA values is on row 6 and the average of those 3 elements is 2. The next y elements are self-explanatory. Row 9 gets a 4 because the 3 previous non-NA values of x are in rows 5, 6, and 7 and so on.
My attempt is this:
roll_mean_previous_k <- function(x, k){
require(dplyr)
res <- NA
lagged_vector <- dplyr::lag(x)
lagged_vector_without_na <- lagged_vector[!is.na(lagged_vector)]
previous_k_values <- tail(lagged_vector_without_na, k)
if (length(previous_k_values) >= k) res <- mean(previous_k_values)
res
}
to be used as follows (using the slide_dbl function from the slider package):
library(dplyr)
tmp.df %>%
mutate(
y = slider::slide_dbl(x, roll_mean_previous_k, k = 3, .before = Inf)
)
which gives the desired output. However, I'm wondering if there's a ready-made, and (as mentioned before) more efficient way of doing this. I should mention that I know of rollmean and roll_mean from the zoo and RcppRoll packages respectively, but unless I'm mistaken, they seem to work on a fixed rolling window with the option of dealing with NA values (e.g ignoring them). In my case, I want to "extend" my window to include k non-NA values.
Any thoughts/suggestions are welcome.
EDIT - SIMULATION RESULTS
Thank you to all contributors. First of all, I had not mentioned that my datasets are indeed much larger and run often so any performance improvements are most welcome. I therefore ran the following simulation to check execution times, before deciding which answer to accept. Note, that some of the answers needed small tweaks to return the desired output, but if you feel that your solution is misrepresented (and therefore is less efficient than intended) feel free to let me know and I'll edit accordingly. I've used G. Grothendieck's trick from his answer below, to remove the need for if-else checks regarding the length of the lagged , non-NA vector.
So here's the simulation code:
library(tidyverse)
library(runner)
library(zoo)
library(slider)
library(purrr)
library(microbenchmark)
set.seed(20211004)
test_vector <- sample(x = 100, size = 1000, replace = TRUE)
test_vector[sample(1000, size = 250)] <- NA
# Based on GoGonzo's answer and the runner package
f_runner <- function(z, k){
runner(
x = z,
f = function(x) {
mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
}
)
}
# Based on my inital answer (but simplified), also mentioned by GoGonzo
f_slider <- function(z, k){
slide_dbl(
z,
function(x) {
mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
},
.before = Inf
)
}
# Based on helios' answer. Return the correct results but with a warning.
f_helios <- function(z, k){
reduced_vec <- na.omit(z)
unique_means <- rollapply(reduced_vec, width = k, mean)
start <- which(!is.na(z))[k] + 1
repeater <- which(is.na(z)) + 1
repeater_cut <- repeater[(repeater > start-1) & (repeater <= length(z))]
final <- as.numeric(rep(NA, length(z)))
index <- start:length(z)
final[setdiff(index, repeater_cut)] <- unique_means
final[(start):length(final)] <- na.locf(final)
final
}
# Based on G. Grothendieck's answer (but I couldn't get it to run with the performance improvements)
f_zoo <- function(z, k){
rollapplyr(
z,
seq_along(z),
function(x, k){
mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
},
k)
}
# Based on AnilGoyal's answer
f_purrr <- function(z, k){
map_dbl(
seq_along(z),
~ ifelse(
length(tail(na.omit(z[1:(.x -1)]), k)) == k,
mean(tail(na.omit(z[1:(.x -1)]), k)),
NA
)
)
}
# Check if all are identical #
all(
sapply(
list(
# f_helios(test_vector, 10),
f_purrr(test_vector, 10),
f_runner(test_vector, 10),
f_zoo(test_vector, 10)
),
FUN = identical,
f_slider(test_vector, 10),
)
)
# Run benchmarking #
microbenchmark(
# f_helios(test_vector, 10),
f_purrr(test_vector, 10),
f_runner(test_vector, 10),
f_slider(test_vector, 10),
f_zoo(test_vector, 10)
)
And the results:
Unit: milliseconds
expr min lq mean median uq max neval cld
f_purrr(test_vector, 10) 31.9377 37.79045 39.64343 38.53030 39.65085 104.9613 100 c
f_runner(test_vector, 10) 23.7419 24.25170 29.12785 29.23515 30.32485 98.7239 100 b
f_slider(test_vector, 10) 20.6797 21.71945 24.93189 26.52460 27.67250 32.1847 100 a
f_zoo(test_vector, 10) 43.4041 48.95725 52.64707 49.59475 50.75450 122.0793 100 d
Based on the above, and unless the code can be further improved, it seems as the slider and runner solutions are faster. Any final suggestions are more than welcome.
Many thanks for your time!!
With runner it will be something like mean of 3-elements tail window of non-na values. You can achive the same result with slider
library(runner)
tmp.df <- data.frame(
x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)
# using runner
tmp.df$y_runner <- runner(
x = tmp.df$x,
f = function(x) {
mean(
tail(
x[!is.na(x)],
3
)
)
}
)
# using slider
tmp.df$y_slider <- slider::slide_dbl(
tmp.df$x,
function(x) {
mean(
tail(
x[!is.na(x)],
3
)
)
},
.before = Inf
)
tmp.df
# x y_runner y_slider
# 1 NA NaN NaN
# 2 1 1.0 1.0
# 3 2 1.5 1.5
# 4 NA 1.5 1.5
# 5 3 2.0 2.0
# 6 4 3.0 3.0
# 7 5 4.0 4.0
# 8 NA 4.0 4.0
# 9 NA 4.0 4.0
# 10 NA 4.0 4.0
# 11 6 5.0 5.0
# 12 7 6.0 6.0
# 13 NA 6.0 6.0
rollapplyr. Regarding the comment about rollmean in the question, zoo also has rollappy and rollapplyr (right aligned) and those allow different widths (and offsets) for each component of the input by specifying a vector (as we do here) or list for width -- see ?rollapply for more info. We use a relatively naive vector of widths below and also show some improved width vectors which run faster.
Operation Create a Mean function which takes a vector, removes the last element and all NA's and takes the last k elements of what is left extending it to k elements with NA's as needed. Finally take the mean of that. We use rollapplyr to apply that to x with a width of seq_along(x).
Performance improvements. With this small data the following likely don't make much difference but if you have larger data you could try these which might improve the speed:
replace na.omit with na_rm from the collapse package
replace the second argument of rollapplyr with the code shown here.
The idea here is that the sum of the lengths of the k+1 longest runs of NA plus k+1 forms a bound on the number of elements that we need to consider. This (plus using na_rm) ran about 25% faster as the code in the question on a problem when I tried it with 1300 rows (formed from 100 copies of the data in the question) and does not add much extra code.
pmin(with(rle(is.na(x)), sum(tail(sort(lengths[values]), k+1)))+k+1, seq_along(x))
replace the second argument of rollapplyr with w where w is shown here. The idea here is to use findInterval to find the element k non-NA's back which provides an even tigher bound. This one (plus using na_rm) ran nearly twice as fast as the code in the question when tried with the same 1300 rows at the expense of adding 2 more lines of code.
tt <- length(x) - rev(cumsum(rev(!is.na(x))))
w <- seq_along(tt) - findInterval(tt - k - 1, tt)
Code. With the data in the question the code below (not using the above improvements) ran slightly faster (not a lot) than the code in the question based on my benchmarking and it is only two lines of code.
library(dplyr)
library(zoo)
Mean <- function(x, k) mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
tmp.df %>% mutate(y = rollapplyr(x, seq_along(x), Mean, k = 3))
giving:
x y
1 NA NA
2 1 NA
3 2 NA
4 NA NA
5 3 NA
6 4 2
7 5 3
8 NA 4
9 NA 4
10 NA 4
11 6 4
12 7 5
13 NA 6
Since I am not aware of a ready-made way of computing your output in any standard library, I came up with the implementation roll_mean_k_efficient below, which seems to speed up your computations considerably. Note that this implementation makes use of the rollapply and the na.locf methods from the zoo package.
rm(list = ls())
library("zoo")
library("rbenchmark")
library("dplyr")
x = rep(c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA), 100)
# your sample (extended)
tmp.df <- data.frame(
x = rep(c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA), 100)
)
# enhanced implementation
roll_mean_k_efficient <- function(x, k){
reduced_vec = na.omit(x)
unique_means = rollapply(reduced_vec, width=k, mean)
start = which(!is.na(x))[k] + 1
repeater = which(is.na(x)) + 1
repeater_cut = repeater[(repeater > start-1) & (repeater <= length(x))]
final <- as.numeric(rep(NA, length(x)))
index = start:length(x)
final[setdiff(index, repeater_cut)] <- unique_means
final[(start):length(final)] <- na.locf(final)
final
}
# old implementation
roll_mean_previous_k <- function(x, k){
res <- NA
lagged_vector <- dplyr::lag(x)
lagged_vector_without_na <- lagged_vector[!is.na(lagged_vector)]
previous_k_values <- tail(lagged_vector_without_na, k)
if (length(previous_k_values) >= k) res <- mean(previous_k_values)
res
}
# wrapper function for the benchmarking below
roll_mean_benchmark = function(){
res = tmp.df %>%
mutate(
y = slider::slide_dbl(x, roll_mean_previous_k, k = 3, .before = Inf)
)
return(res)
}
# some benchmarking
benchmark(roll_mean_k_efficient(x = x, k=3),
roll_mean_benchmark(),
columns=c('test','elapsed','replications'),
replications = 100)
Furthermore, I extended your example vector x to get some more reliable benchmark results via the benchmark function from the rbenchmark package.
In my case the benchmark runtimes that are printed after running the code are:
test elapsed replications
2 roll_mean_benchmark() 4.463 100
1 roll_mean_k_efficient(x = x, k = 3) 0.039 100
Without using zoo. In tidyverse fashion, you can also do it using purrr::map
tmp.df %>% mutate(y = map(seq_along(x), ~ ifelse(length(tail(na.omit(tmp.df$x[1:(.x -1)]), 3)) ==3,
mean(tail(na.omit(tmp.df$x[1:(.x -1)]), 3)),
NA)))
x y
1 NA NA
2 1 NA
3 2 NA
4 NA NA
5 3 NA
6 4 2
7 5 3
8 NA 4
9 NA 4
10 NA 4
11 6 4
12 7 5
13 NA 6
I am a newbie in R.
I am trying to calculate the deviation between columns, and expect several rules applied:
deviation is calculated by current value minus previous value
if current value is NA, then return NA without calculation
if previous value is NA, then current value minus the value before previous value, until minus a valid value
the value in the first column is always valid
For example:
start = c(1, 2, 3, 4)
a = c(2, NA, 5, 6)
b = c(4, 5, NA, 8)
test <- data.frame(start, a, b)
test
start a b
1 1 2 4
2 2 NA 5
3 3 5 NA
4 4 6 8
Expected:
result
a_delta b_delta
1 1 2
2 NA 3
3 2 NA
4 2 2
Note:
cell (2, 1) in result is NA because cell (2, 2) in test is NA
cell (2, 2) in result is 3 because cell (2, 3) minus cell (2, 1) in result gets 3
Here is my broken code. Any suggestions are welcomed:
f <- function(data){
cn <- colnames(data)
cl <- ncol(data)
for (i in 2:cl)){
if (is.na(data$i)) {a <- NA}
else if (!is.na(data$(i-1))) {paste(cn[i], "_delta") <- data$cn[i] - data$cn[i-1]}
else { # check if previous value is NA repeatively
t < i - 1
while (is.na(data$cn[t])) {
t <- t - 1
}
paste(cn[i], "_delta") <- data$cn[i] - data$cn[t]
}
}
}
f(test)
Main function:
CalcDev <- function(x){
x <- unlist(x)
if (!any(is.na(x))){
return(diff(x, 1))
} else {
tmp <- x[-1]
res <- diff(na.omit(x), 1)
tmp[!is.na(tmp)] <- res
return(tmp)
}
}
And how to use:
start = c(1, 2, 3, 4)
a = c(2, NA, 5, 6)
b = c(4, 5, NA, 8)
test <- data.frame(start, a, b)
plyr::adply(test, 1, CalcDev)[, -1]
Result:
a b
1 1 2
2 NA 3
3 2 NA
4 2 2
You just need to rename columns.
I was unable to run your code, so no benchmark.
EDIT:
Answering your comment, you can use CalcDev function in dplyr chain if you vecorize it:
CalcDev.Vect <- Vectorize(CalcDev)
test %>%
CalcDev.Vect %>%
.[, -1] %>%
as.data.frame
You will get similar results, and it will be much faster, especially for bigger data sets.
There is two alternatives: using CalcDev inside do({}) or adply directly in chain, but both will be slower solutions.
Benchmarks for small data set :
expr min lq mean median uq max neval cld
foo.plyr(test) 2240.34 2392.08 2511.3 2490.13 2577.32 3199.16 100 b
foo.do_dplyr(test) 2680.34 2933.70 3104.4 3015.15 3109.48 5771.83 100 d
foo.plyr_in_dplyr(test) 2471.51 2635.04 2805.7 2702.99 2802.29 9422.46 100 c
foo.Vect(test) 441.55 490.58 539.7 539.92 564.74 928.41 100 a
And for bigger data sets difference in evaluation time will be more drastic.
Good morning,
I have the following problem.
My Data.frame "data" has the format:
Type amount
1 2
2 0
3 3
I would like to create a vector with the format:
1
1
3
3
3
This means I would like to transform my data.
I created a vector and wrote the following code for my transformation in R:
vector <- numeric(5)
for (i in 1:3){
k <- 1
while (k <= data[i,2]){
vector[k] <- data[i,1]
k <- k+1
}
}
The problem is, I get the following results and I have no Idea at which part I go wrong…
3
3
3
0
0
There might be many different ways in solving this particular problem in R but I am curious why my solution doesn't work. I am thankful for alternatives, but really would like to know what my mistake is.
Thank's for your help!
Try this solution:
df <- data.frame(type = c(1, 2, 3), amount = c(2, 0, 3))
result <- unlist(mapply(function(x, y) rep.int(x, y), df[, "type"], df[, "amount"]))
result
Output is following:
# [1] 1 1 3 3 3
Exaclty your code is buggy. Correct code should looks following:
df <- data.frame(type = c(1, 2, 3), amount = c(2, 0, 3))
vector <- numeric(5)
k <- 1
for (i in 1:3) {
j <- 1
while (j <= df[i, 2]) {
vector[k] <- df[i, 1]
k <- k + 1
j <- j + 1
}
}
vector
# [1] 1 1 3 3 3
Probably the fastest and most elegant way to obtain this result has been posted before in a comment by #akrun:
with(data, rep(Type, amount))
[1] 1 1 3 3 3
However, if you want to do this with for/while loops, it could be helpful to use a list for such cases, where the number of entries is not known at the beginning.
Here is an example with minimal modifications of your code:
my_list <- vector("list", 3)
for (i in 1:3) {
k <- 1
while (k <= data[i,2]){
my_list[[i]][k] <- data[i,1]
k <- k + 1
}
}
vector <- unlist(my_list)
#> vector
#[1] 1 1 3 3 3
The reason why your code didn't work was essentially that you were trying to put too much information into a single variable, k. It cannot serve as both, an index of your output vector, and as a counter for the individual entries in the first column of data; a counter which is reset to 1 each time the while loop has finished.