I wrote this function to find a factorial of number
fact <- function(n) {
if (n < 0){
cat ("Sorry, factorial does not exist for negative numbers", "\n")
} else if (n == 0){
cat ("The factorial of 0 is 1", "\n")
} else {
results = 1
for (i in 1:n){
results = results * i
}
cat(paste("The factorial of", n ,"is", results, "\n"))
}
}
Now I want to implement Memoization in R. I have Basic idea on R and trying to implement using them. But I am not sure is this way forward. Could you please also elaborate this topic as well. Thank in advance.
Memoized Factorial
fact_tbl <- c(0, 1, rep(NA, 100))
fact_mem <- function(n){
stopifnot(n > 0)
if(!is.na(fib_tbl[n])){
fib_tbl[n]
} else {
fact_tbl[n-1] <<- fac_mem(n-1) * n
}
}
print (fact_mem(4))
First of all, if you need an efficient implementation, use R's factorial function. Don't write it yourself. Then, the factorial is a good exercise for understanding recursion:
myfactorial <- function(n) {
if (n == 1) return(1)
n * myfactorial(n-1)
}
myfactorial(10)
#[1] 3628800
With this function memoization is only useful, if you intend to use the function repeatedly. You can implement memoization in R using closures. Hadley explains these in his book.
createMemFactorial <- function() {
res <- 1
memFactorial <- function(n) {
if (n == 1) return(1)
#grow res if necessary
if (length(res) < n) res <<- `length<-`(res, n)
#return pre-calculated value
if (!is.na(res[n])) return(res[n])
#calculate new values
res[n] <<- n * factorial(n-1)
res[n]
}
memFactorial
}
memFactorial <- createMemFactorial()
memFactorial(10)
#[1] 3628800
Is it actually faster?
library(microbenchmark)
microbenchmark(factorial(10),
myfactorial(10),
memFactorial(10))
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# factorial(10) 235 264.0 348.02 304.5 378.5 2463 100 a
# myfactorial(10) 4799 5279.5 6491.94 5629.0 6044.5 15955 100 c
# memFactorial(10) 950 1025.0 1344.51 1134.5 1292.0 7942 100 b
Note that microbenchmark evaluates the functions (by default) 100 times. Since we have stored the value for n = 10 when testing the memFactorial, we time only the if conditions and the lookup here. As you can also see, R's implementation, which is mostly written in C, is faster.
A better (and easier) example implements Fibonacci numbers. Here the algorithm itself benefits from memoization.
#naive recursive implementation
fib <- function(n) {
if(n == 1 || n == 2) return(1)
fib(n-1) + fib(n-2)
}
#with memoization
fibm <- function(n) {
if(n == 1 || n == 2) return(1)
seq <- integer(n)
seq[1:2] <- 1
calc <- function(n) {
if (seq[n] != 0) return(seq[n])
seq[n] <<- calc(n-1) + calc(n-2)
seq[n]
}
calc(n)
}
#try it:
fib(20)
#[1] 6765
fibm(20)
#[1] 6765
#Is memoization faster?
microbenchmark(fib(20),
fibm(20))
#Unit: microseconds
# expr min lq mean median uq max neval cld
# fib(20) 8005.314 8804.130 9758.75325 9301.6210 9798.8500 46867.182 100 b
#fibm(20) 38.991 44.798 54.12626 53.6725 60.4035 97.089 100 a
Related
I am trying to sum the odds numbers of a specific number (but excluding itself), for example: N = 5 then 1+3 = 4
a<-5
sum<-function(x){
k<-0
for (n in x) {
if(n %% 2 == 1)
k<-k+1
}
return(k)
}
sum(a)
# [1] 1
But the function is not working, because it counts the odds numbers instead of summing them.
We may use vectorized approach
a1 <- head(seq_len(a), -1)
sum(a1[a1%%2 == 1])
[1] 4
If we want a loop, perhaps
f1 <- function(x) {
s <- 0
k <- 1
while(k < x) {
if(k %% 2 == 1) {
s <- s + k
}
k <- k + 1
}
s
}
f1(5)
The issue in OP's code is
for(n in x)
where x is just a single value and thus n will be looped once - i.e. if our input is 5, then it will be only looped once and 'n' will be 5. Instead, it would be seq_len(x -1). The correct loop would be something like
f2<-function(x){
k<- 0
for (n in seq_len(x-1)) {
if(n %% 2 == 1) {
k <- k + n
}
}
k
}
f2(5)
NOTE: sum is a base R function. So, it is better to name the custom function with a different name
Mathematically, we can try the following code to calculate the sum (N could be odd or even)
(ceiling((N - 1) / 2))^2
It's simple and it does what it says:
sum(seq(1, length.out = floor(N/2), by = 2))
The multiplication solution is probably gonna be quicker, though.
NB - an earlier version of this answer was
sum(seq(1, N - 1, 2))
which as #tjebo points out, silently gives the wrong answer for N = 1.
We could use logical statement to access the values:
a <- 5
a1 <- head(seq_len(a), -1)
sum(a1[c(TRUE, FALSE)])
output:
[1] 4
Fun benchmarking. Does it surprise that Thomas' simple formula is by far the fastest solution...?
count_odds_thomas <- function(x){
(ceiling((x - 1) / 2))^2
}
count_odds_akrun <- function(x){
a1 <- head(seq_len(x), -1)
sum(a1[a1%%2 == 1])
}
count_odds_dash2 <- function(x){
sum(seq(1, x - 1, 2))
}
m <- microbenchmark::microbenchmark(
akrun = count_odds_akrun(10^6),
dash2 = count_odds_dash2(10^6),
thomas = count_odds_thomas(10^6)
)
m
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> akrun 22117564 26299922.0 30052362.16 28653712 31891621 70721894 100
#> dash2 4016254 4384944.0 7159095.88 4767401 8202516 52423322 100
#> thomas 439 935.5 27599.34 6223 8482 2205286 100
ggplot2::autoplot(m)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Moreover, Thomas solution works on really big numbers (also no surprise)... on my machine, count_odds_akrun stuffs the memory at a “mere” 10^10, but Thomas works fine till Infinity…
count_odds_thomas(10^10)
#> [1] 2.5e+19
count_odds_akrun(10^10)
#> Error: vector memory exhausted (limit reached?)
Let's say you have a function that takes a number as an input and outputs a vector. However, the output vector size depends on the input and you can't calculate it before the function.
For example, take the 3N+1 famous algorithm. A simple implementation of that algorithm, returning the whole path until 1 could look like this:
compute <- function(x) {
if (x %% 2 == 0)
return(x / 2)
return(3*x + 1)
}
algo <- function(x) {
if (x == 1)
return(1)
output <- x
while(x != 1) {
x <- compute(x)
output <- c(output, x)
}
return(output)
}
The algo function returns the whole path of an input X to 1, according to the function. As you can tell, the output variable grows dynamically, using the c() (combine) function.
Are there any alternatives to this? Is growing a list faster? Should I adopt some classic dynamic vector logic, such as initializing an empty N-sized vector and double it everytime it goes full?
EDIT: Please don't mind trying to optimize the way my helper functions are structured. I get it, but that's not the point here! I am only concerned about the c() function and an alternative to it.
Update
As per your edit, maybe you can check the following solution
algo_TIC2 <- function(x) {
res <- x
repeat {
u <- tail(res, 1)
if (u != 1) {
res[length(res) + 1] <- if (u %% 2) 3 * u + 1 else u / 2
} else {
return(res)
}
}
}
You can use recursions like below
compute <- function(x) if (x %% 2) 3*x + 1 else x / 2
algo_TIC1 <- function(x) {
if (x == 1) {
return(1)
}
c(x, algo_TIC1(compute(x)))
}
and you will see
> algo_TIC1(3000)
[1] 3000 1500 750 375 1126 563 1690 845 2536 1268 634 317 952 476 238
[16] 119 358 179 538 269 808 404 202 101 304 152 76 38 19 58
[31] 29 88 44 22 11 34 17 52 26 13 40 20 10 5 16
[46] 8 4 2 1
If you don't want any helper function, i.e., compute, you can try
algo_TIC1 <- function(x) {
if (x == 1) {
return(1)
}
c(x, algo_TIC1(if (x %% 2) 3*x + 1 else x / 2))
}
So, what bothers you is reallocation, and you are right. Let's see.
library(microbenchmark)
microbenchmark({
a <- c()
for (i in seq(1e4)) {
a <- c(a, i)
}
})
microbenchmark({
a <- numeric(1e4)
for (i in seq(1e4)) {
a[[i]] <- i
}
})
microbenchmark({
a <- numeric(1)
k <- 1
for (i in seq(1e4)) {
if (i > k) {
a <- c(a, numeric(k))
k <- k + k
}
a[[i]] <- i
}
a <- head(a, 1e4)
})
And the timings:
Append
min lq mean median uq max neval
78.0162 78.67925 83.36224 79.54515 81.79535 166.6988 100
Preallocate
min lq mean median uq max neval
1.484901 1.516051 1.567897 1.5552 1.569451 1.895601 100
Amortize
min lq mean median uq max neval
3.316501 3.377201 3.62415 3.484351 3.585701 11.7596 100
Never append many elements to a vector. If possible, preallocate, otherwise amortized allocation will do.
Even if you don't know the actual size beforehand, you may have an upper bound. Then you can still preallocate and truncate in the end. Even a reasonable estimate is useful: preallocate that size, and then resort to amortization if needed.
A remark: R is not good at loops. For small loops, for instance over variables of a dataframe or files in a directory, there is usually no problem. But if you have a long computation that really needs to be achieved with many loops and you can't vectorize, R might not be the right tool. On occasions, writing a function in C, C++, Fortran or Java could help: it's fairly easy to build plugins or to use Rcpp, and the performance gain is considerable.
You can set the length of a vector and then make assignments to specific elements. Your code would look like this:
algo2 <- function(x) {
if (x == 1)
return(1)
output <- x
index <- 1
while(x != 1) {
x <- compute(x)
index <- index + 1
if (index > length(output))
length(output) <- 2*length(output)
output[index] <- x
}
return(output[seq_len(index)])
}
This makes a difference, though not a big one in your example, because all those calls to compute() (and return()!) are quite costly. If you folded that calculation into algo you'd see more improvement. You could also initialize output to a length that is likely to be good enough for most cases, and rarely need doubling.
I want to multiply all the elements of a matrix together. I can do it with two for loops or with apply. My intuition was that the for loops would be faster. Apply has to create a temporary vector to store the results of the product of rows, then apply product to that.
It still has to execute for loops to multiply all the elements so it's just an extra operation of storing the intermediate results that the for loops approach doesn't have to do. Yet it's still about 4 times faster. Why is that?
cols <- 1000
rows <- 1000
a <- matrix(runif(cols * rows, 1, 2), nrow = rows)
system.time({
result <- 1
for(i in 1:nrow(a)) {
for(j in 1:ncol(a)) {
result <- result * a[i, j]
}
}
})
# 0.09s
system.time(result <- prod(apply(a, 1, prod)))
# 0.01s
Here's what I got with an effort to benchmark various methods. I have some concerns about the fact that Inf was the result of many of the calculations, and I wonder if a restriction to a range of 0-1 might be different. Like #badmax I was surprised that prod(a) was relatively slow. Seemed to me that it should have been coded in C and be more efficient. I also reasoned that a column oriented approach might be faster than a row oriented approach since that is how matrices in R are stored and was correct:
library(microbenchmark)
cols <- 1000
rows <- 1000
a <- matrix(runif(cols * rows, 1, 2), nrow = rows)
microbenchmark(loop1 ={
result <- 1
for(i in 1:nrow(a)) {
for(j in 1:ncol(a)) {
result <- result * a[i, j]
} } },
loop2 ={result <- 1
for(j in 1:ncol(a)) {
result <- result * prod(a[ , j])
} },
loop3 = {
result <- 1
for(i in 1:nrow(a)) {
result <- result * prod( a[i, ])
} },
apply_test = {result <- prod(apply(a, 1, prod))},
prod_test = {result <- prod(a) },
Reduce_test = {result <- Reduce("*", a)},
log_sum = { result<- exp( sum(log(a)))}) #since sum of logs == log of prod
#====================
Unit: milliseconds
expr min lq mean median uq max neval cld
loop1 58.872740 59.782277 60.665321 60.246169 61.156176 67.33558 100 c
loop2 5.314437 5.843748 7.316167 6.024948 6.626402 57.36532 100 a
loop3 9.614727 10.248335 11.521343 10.541872 10.947829 45.08280 100 ab
apply_test 8.336721 8.924148 9.960122 9.166424 9.429118 17.35621 100 ab
prod_test 94.314333 95.438939 95.956394 95.911858 96.286444 98.54229 100 d
Reduce_test 292.907175 312.754959 389.959756 354.369616 511.151578 545.80829 100 e
log_sum 19.258281 19.916965 22.333617 20.134510 20.551704 180.18492 100 b
I think the apply_test outcome is essentially doing the same as loop2, perhaps with a bit of overhead penalty for the apply_test. Here are the results for a test case where the range of random values were restricted to [0-1] (instead of [1-2]) and they do confirm my suspicion that some of the difference lies in the handling of Inf values:
Unit: milliseconds
expr min lq mean median uq max neval cld
loop1 56.693831 58.846847 59.688896 59.448108 60.208619 63.005431 100 c
loop2 5.667955 5.907125 10.090634 6.109151 6.532552 183.985617 100 ab
loop3 9.533779 10.080330 12.760057 10.431867 10.734991 183.262217 100 ab
apply_test 8.144015 8.622861 9.940263 8.904425 9.962390 17.470028 100 ab
prod_test 1.327710 1.371449 1.411990 1.394160 1.432646 1.677596 100 a
Reduce_test 293.697339 312.384739 385.437743 356.997439 500.446356 557.253762 100 d
log_sum 22.444015 23.224879 24.064932 23.539085 24.210656 29.774315 100 b
The prod function is now rescued from its lowly position relative to the loops and apply methods.
To vectorize instead of apply you can use rowProds from matrixStats:
library(matrixStats)
microbenchmark::microbenchmark({AA = prod(rowProds(a))}, times = 10)
takes around 18 milliseconds
It looks like apply is faster because there is still some vectorization going on. Consider this for loop:
system.time({
result <- 1
for(i in 1:nrow(a)) {
result <- result * prod(a[i,])
}
})
which, for me, is going about as fast the apply.
I have a random vector vec and want make a new vector L without using a loop. New element of L depends on old elements of L and vec.
set.seed(0)
vec <- rnorm(20,0)
i = 2;
N <- length(vec) -1
L <- numeric(N-1)
constant <- 0.6
while (i < N){
L[i] = vec[i + 1] - vec[i] - constant * L[i - 1]
i <- i + 1
}
L
# [1] 0.0000000 1.6560326 -1.0509895 -0.2271942 -1.8182750 1.7023480 -0.3875622 0.5214906 2.0975262 -2.8995756 0.1771427
# [12] -0.4549334 1.1311555 -0.6884468 0.3007724 0.4832709 -1.4341071 2.1880687
You want
L[1] = 0
L[i] = -constant * L[i - 1] + (vec[i + 1] - vec[i]), i = 2, 3, ...,
Let dv <- diff(vec), the 2nd line becomes
L[i] = -constant * L[i - 1] + dv[i], i = 2, 3, ...
an AR1 process with lag-1 auto-correlation -constant and innovation dv[-1]. AR1 process can be efficiently generated by filter with "recursive" method.
dv <- diff(vec)
L <- c(0, filter(dv[-1], -constant, "recursive"))
# [1] 0.0000000 1.6560326 -1.0509895 -0.2271942 -1.8182750 1.7023480
# [7] -0.3875622 0.5214906 2.0975262 -2.8995756 0.1771427 -0.4549334
#[13] 1.1311555 -0.6884468 0.3007724 0.4832709 -1.4341071 2.1880687
#[19] -2.9860629
I guess you mean while (i <= N) in your question. If you do want i < N, then you have to get rid of the last element above. Which can be done by
dv <- diff(vec)
L <- c(0, filter(dv[2:(length(dv) - 1)], -constant, "recursive"))
hours later...
I was brought to attention by Rui Barradas's benchmark. For short vec, any method is fast enough. For long vec, filter is definitely faster, but practically suffers from coercion as filter expects and returns a "ts" (time series) object. It is better to call its workhorse C routine straightaway:
AR1_FILTER <- function (x, filter, full = TRUE) {
n <- length(x)
AR1 <- .Call(stats:::C_rfilter, as.double(x), as.double(filter), double(n + 1L))
if (!full) AR1 <- AR1[-1L]
AR1
}
dv <- diff(vec)
L <- AR1_FILTER(dv[-1], -constant)
#L <- AR1_FILTER(dv[2:(length(dv) - 1)], -constant)
I am not interested in comparing AR1_FILTER with R-level loop. I will just compare it with filter.
library(microbenchmark)
v <- runif(100000)
microbenchmark("R" = c(0, filter(v, -0.6, "recursive")),
"C" = AR1_FILTER(v, -0.6))
Unit: milliseconds
expr min lq mean median uq max neval
R 6.803945 7.987209 11.08361 8.074241 9.131967 54.672610 100
C 2.586143 2.606998 2.76218 2.644068 2.660831 3.845041 100
When you have to compute values based on previous values the general purpose answer is no, there is no way around a loop.
In your case I would use a for loop. It's simpler.
M <- numeric(N - 1)
for(i in seq_len(N)[-N])
M[i] = vec[i + 1] - vec[i] - constant*M[i - 1]
identical(L, M)
#[1] TRUE
Note the use of seq_len, not 2:(N - 1).
Edit.
I have timed the solutions by myself and by user 李哲源. The results are clearly favorable to my solution.
f1 <- function(vec, constant = 0.6){
N <- length(vec) - 1
M <- numeric(N - 1)
for(i in seq_len(N)[-c(1, N)]){
M[i] = vec[i + 1] - vec[i] - constant*M[i - 1]
}
M
}
f2 <- function(vec, constant = 0.6){
dv <- diff(vec)
c(0, c(stats::filter(dv[2:(length(dv) - 1)], -constant, "recursive")) )
}
L1 <- f1(vec)
L2 <- f2(vec)
identical(L, L1)
identical(L, L2)
microbenchmark::microbenchmark(
loop = f1(vec),
filter = f2(vec)
)
On my PC the ratio of the medians gives my code 11 times faster.
I was thinking about using Rcpp for this, but one of the answer mentioned rfilter built internally in R, so I had a check:
/* recursive filtering */
SEXP rfilter(SEXP x, SEXP filter, SEXP out)
{
if (TYPEOF(x) != REALSXP || TYPEOF(filter) != REALSXP
|| TYPEOF(out) != REALSXP) error("invalid input");
R_xlen_t nx = XLENGTH(x), nf = XLENGTH(filter);
double sum, tmp, *r = REAL(out), *rx = REAL(x), *rf = REAL(filter);
for(R_xlen_t i = 0; i < nx; i++) {
sum = rx[i];
for (R_xlen_t j = 0; j < nf; j++) {
tmp = r[nf + i - j - 1];
if(my_isok(tmp)) sum += tmp * rf[j];
else { r[nf + i] = NA_REAL; goto bad3; }
}
r[nf + i] = sum;
bad3:
continue;
}
return out;
}
This function is already pretty look and I don't think I could write an Rcpp one to beat it with great improvement. I did a benchmark with this rfilter and the f1 function in the accepted answer:
f1 <- function(vec, constant = 0.6){
N <- length(vec) - 1
M <- numeric(N - 1)
for(i in seq_len(N)[-c(1, N)]){
M[i] = vec[i + 1] - vec[i] - constant*M[i - 1]
}
M
}
AR1_FILTER <- function (x, filter, full = TRUE) {
n <- length(x)
AR1 <- .Call(stats:::C_rfilter, as.double(x), as.double(filter), double(n + 1L))
if (!full) AR1 <- AR1[-1L]
AR1
}
f2 <- function (vec, constant) {
dv <- diff(vec)
AR1_FILTER(dv[2:(length(dv) - 1)], -constant)
}
library(microbenchmark)
Bench <- function (n) {
vec <- runif(n)
microbenchmark("R" = f1(vec, 0.6), "C" = f2(vec, 0.6))
}
For short vectors with length 100, I got
Bench(100)
Unit: microseconds
expr min lq mean median uq max neval
R 68.098 69.8585 79.05593 72.456 74.6210 244.148 100
C 66.423 68.5925 73.18702 69.793 71.1745 150.029 100
For large vectors with length 10000, I got
Bench(10000)
Unit: microseconds
expr min lq mean median uq max neval
R 6168.742 6699.9170 6870.277 6786.0415 6997.992 8921.279 100
C 876.934 904.6175 1192.000 931.9345 1034.273 2962.006 100
Yeah, there is no way that R is going to beat a compiled language.
library(dplyr)
L2 <- c(0,lead(vec) - vec - constant * lag(L))
L2 <- L2[!is.na(L2)]
L2
[1] 0.00000000 1.09605531 -0.62765133 1.81529867 -2.10535596 3.10864280 -4.36975556 1.41375965
[9] -1.08809820 2.16767510 -1.82140234 1.14748512 -0.89245650 0.03962074 -0.10930073 1.48162072
[17] -1.63074832 2.21593009
all.equal(L,L2)
[1] TRUE
With the data set here:
https://www.dropbox.com/s/gyimxbz5f3v0uq3/kfg.RData?dl=0
And executing the below code:
matrix(nrow=1600,ncol=8) -> ctw
for(k in 1:8){
for(i in 1:1600){
which(kfg[,9]==i) -> aj
if(length(aj)!=0){
sample(kfg[aj,11],prob=kfg[aj,k],size=1) -> ctw[i,k]
}
ctw[i,k]
}
}
Is doable, but the real set is over 800k rows and it takes very long. Is there a way in data.table or other package to do this faster? It is very slow to do the which() step.
I had to revise your original code to check for non-zero probabilities. I also removed the statement ctw[i,k] from the last line of the inner loop, because it has no effect. Your code is
matrix(nrow=1600,ncol=8) -> ctw
for(k in 1:8){
for(i in 1:1600){
which(kfg[,9]==i) -> aj
if ((length(aj)!=0) && any(kfg[aj, k] > 0)) {
sample(kfg[aj,11],prob=kfg[aj,k],size=1) -> ctw[i,k]
}
}
}
ctw
I reversed the order of the loops, so that kfg[,9] == i is only evaluated once instead of 8 times. I also took the test for length(aj) != 0 outside the loops using tabulate(). My revised code is
matrix(nrow=1600,ncol=8) -> ctw
which(tabulate(kfg[, 9], 1600) != 0) -> ii
for(i in ii) {
kfg[,9] == i -> aj
for(k in 1:8)
if (any(kfg[aj, k] > 0))
sample(kfg[aj,11], 1, prob=kfg[aj,k]) -> ctw[i,k]
}
ctw
This is approximately 5x faster for your sample data.
It is much faster to extract the vector of sample values kfg[,11] == kfg[[11]] once, and to work with a matrix as.matrix(kfg[, 1:8]) of probabilities, rather than a data.frame. For the sample data it is marginally faster to hoist the split on column 9 out of the loop, and to avoid the conditional inside the k loop by doing a vectorized calculation outside the loop to identify relevant indices
nrow <- 1600
matrix(nrow=nrow,ncol=8) -> ctw
x <- kfg[[11]]
pr <- as.matrix(kfg[,1:8])
ajs <- split(seq_len(nrow(kfg)), factor(kfg[[9]], levels=seq_len(nrow)))
ii <- seq_along(ajs)[lengths(ajs) > 0]
for(i in ii) {
aj <- ajs[[i]]
kk <- which(colSums(pr[aj,, drop=FALSE]) > 0)
for(k in kk)
sample(x[aj], 1, prob=pr[aj,k]) -> ctw[i,k]
}
ctw
These lead to a further 5x speed-up, so 25 times faster than the original.
To measure the speed, I enclosed each of the above in a function, e.g.,
f0 <- function() {
matrix(nrow=1600,ncol=8) -> ctw
for(k in 1:8){
for(i in 1:1600){
which(kfg[,9]==i) -> aj
if ((length(aj)!=0) && any(kfg[aj, k] > 0)) {
sample(kfg[aj,11],prob=kfg[aj,k],size=1) -> ctw[i,k]
}
}
}
ctw
}
and used the microbenchmark package
> library(microbenchmark)
> microbenchmark(f0(), f1(), f2(), times=10)
Unit: milliseconds
expr min lq mean median uq max neval cld
f0() 466.12527 483.43954 484.34258 483.74805 484.21627 521.19957 10 c
f1() 92.77415 94.79052 94.99273 95.10352 95.45368 96.10641 10 b
f2() 17.33708 17.83257 17.87095 17.87205 18.01723 18.16400 10 a
f1() and f2() should be identical, but they are not
> set.seed(123); res1 <- f1(); set.seed(123); res2 <- f2()
> all.equal(res1, res2)
[1] "'is.NA' value mismatch: 12096 in current 12133 in target"
Investigating, this is because the values in column 9 are numeric, but are treated, e.g., kfg[, 9] == i as though they are integer. For instance,
> kfg[[9]][(kfg[[9]] > 28 & kfg[[9]] <= 29)]
[1] 29 29 29
> kfg[[9]][(kfg[[9]] > 28 & kfg[[9]] <= 29)] == 29
[1] FALSE FALSE FALSE
Perhaps the intention is
kfg[[9]] = round(kfg[[9]])
With this change, we have
> all.equal(res1, res2)
[1] TRUE
> identical(res1, res2)
[1] TRUE