Sum the odds numbers of a "number" - r

I am trying to sum the odds numbers of a specific number (but excluding itself), for example: N = 5 then 1+3 = 4
a<-5
sum<-function(x){
k<-0
for (n in x) {
if(n %% 2 == 1)
k<-k+1
}
return(k)
}
sum(a)
# [1] 1
But the function is not working, because it counts the odds numbers instead of summing them.

We may use vectorized approach
a1 <- head(seq_len(a), -1)
sum(a1[a1%%2 == 1])
[1] 4
If we want a loop, perhaps
f1 <- function(x) {
s <- 0
k <- 1
while(k < x) {
if(k %% 2 == 1) {
s <- s + k
}
k <- k + 1
}
s
}
f1(5)
The issue in OP's code is
for(n in x)
where x is just a single value and thus n will be looped once - i.e. if our input is 5, then it will be only looped once and 'n' will be 5. Instead, it would be seq_len(x -1). The correct loop would be something like
f2<-function(x){
k<- 0
for (n in seq_len(x-1)) {
if(n %% 2 == 1) {
k <- k + n
}
}
k
}
f2(5)
NOTE: sum is a base R function. So, it is better to name the custom function with a different name

Mathematically, we can try the following code to calculate the sum (N could be odd or even)
(ceiling((N - 1) / 2))^2

It's simple and it does what it says:
sum(seq(1, length.out = floor(N/2), by = 2))
The multiplication solution is probably gonna be quicker, though.
NB - an earlier version of this answer was
sum(seq(1, N - 1, 2))
which as #tjebo points out, silently gives the wrong answer for N = 1.

We could use logical statement to access the values:
a <- 5
a1 <- head(seq_len(a), -1)
sum(a1[c(TRUE, FALSE)])
output:
[1] 4

Fun benchmarking. Does it surprise that Thomas' simple formula is by far the fastest solution...?
count_odds_thomas <- function(x){
(ceiling((x - 1) / 2))^2
}
count_odds_akrun <- function(x){
a1 <- head(seq_len(x), -1)
sum(a1[a1%%2 == 1])
}
count_odds_dash2 <- function(x){
sum(seq(1, x - 1, 2))
}
m <- microbenchmark::microbenchmark(
akrun = count_odds_akrun(10^6),
dash2 = count_odds_dash2(10^6),
thomas = count_odds_thomas(10^6)
)
m
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> akrun 22117564 26299922.0 30052362.16 28653712 31891621 70721894 100
#> dash2 4016254 4384944.0 7159095.88 4767401 8202516 52423322 100
#> thomas 439 935.5 27599.34 6223 8482 2205286 100
ggplot2::autoplot(m)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Moreover, Thomas solution works on really big numbers (also no surprise)... on my machine, count_odds_akrun stuffs the memory at a “mere” 10^10, but Thomas works fine till Infinity…
count_odds_thomas(10^10)
#> [1] 2.5e+19
count_odds_akrun(10^10)
#> Error: vector memory exhausted (limit reached?)

Related

How to efficiently grow a vector in R?

Let's say you have a function that takes a number as an input and outputs a vector. However, the output vector size depends on the input and you can't calculate it before the function.
For example, take the 3N+1 famous algorithm. A simple implementation of that algorithm, returning the whole path until 1 could look like this:
compute <- function(x) {
if (x %% 2 == 0)
return(x / 2)
return(3*x + 1)
}
algo <- function(x) {
if (x == 1)
return(1)
output <- x
while(x != 1) {
x <- compute(x)
output <- c(output, x)
}
return(output)
}
The algo function returns the whole path of an input X to 1, according to the function. As you can tell, the output variable grows dynamically, using the c() (combine) function.
Are there any alternatives to this? Is growing a list faster? Should I adopt some classic dynamic vector logic, such as initializing an empty N-sized vector and double it everytime it goes full?
EDIT: Please don't mind trying to optimize the way my helper functions are structured. I get it, but that's not the point here! I am only concerned about the c() function and an alternative to it.
Update
As per your edit, maybe you can check the following solution
algo_TIC2 <- function(x) {
res <- x
repeat {
u <- tail(res, 1)
if (u != 1) {
res[length(res) + 1] <- if (u %% 2) 3 * u + 1 else u / 2
} else {
return(res)
}
}
}
You can use recursions like below
compute <- function(x) if (x %% 2) 3*x + 1 else x / 2
algo_TIC1 <- function(x) {
if (x == 1) {
return(1)
}
c(x, algo_TIC1(compute(x)))
}
and you will see
> algo_TIC1(3000)
[1] 3000 1500 750 375 1126 563 1690 845 2536 1268 634 317 952 476 238
[16] 119 358 179 538 269 808 404 202 101 304 152 76 38 19 58
[31] 29 88 44 22 11 34 17 52 26 13 40 20 10 5 16
[46] 8 4 2 1
If you don't want any helper function, i.e., compute, you can try
algo_TIC1 <- function(x) {
if (x == 1) {
return(1)
}
c(x, algo_TIC1(if (x %% 2) 3*x + 1 else x / 2))
}
So, what bothers you is reallocation, and you are right. Let's see.
library(microbenchmark)
microbenchmark({
a <- c()
for (i in seq(1e4)) {
a <- c(a, i)
}
})
microbenchmark({
a <- numeric(1e4)
for (i in seq(1e4)) {
a[[i]] <- i
}
})
microbenchmark({
a <- numeric(1)
k <- 1
for (i in seq(1e4)) {
if (i > k) {
a <- c(a, numeric(k))
k <- k + k
}
a[[i]] <- i
}
a <- head(a, 1e4)
})
And the timings:
Append
min lq mean median uq max neval
78.0162 78.67925 83.36224 79.54515 81.79535 166.6988 100
Preallocate
min lq mean median uq max neval
1.484901 1.516051 1.567897 1.5552 1.569451 1.895601 100
Amortize
min lq mean median uq max neval
3.316501 3.377201 3.62415 3.484351 3.585701 11.7596 100
Never append many elements to a vector. If possible, preallocate, otherwise amortized allocation will do.
Even if you don't know the actual size beforehand, you may have an upper bound. Then you can still preallocate and truncate in the end. Even a reasonable estimate is useful: preallocate that size, and then resort to amortization if needed.
A remark: R is not good at loops. For small loops, for instance over variables of a dataframe or files in a directory, there is usually no problem. But if you have a long computation that really needs to be achieved with many loops and you can't vectorize, R might not be the right tool. On occasions, writing a function in C, C++, Fortran or Java could help: it's fairly easy to build plugins or to use Rcpp, and the performance gain is considerable.
You can set the length of a vector and then make assignments to specific elements. Your code would look like this:
algo2 <- function(x) {
if (x == 1)
return(1)
output <- x
index <- 1
while(x != 1) {
x <- compute(x)
index <- index + 1
if (index > length(output))
length(output) <- 2*length(output)
output[index] <- x
}
return(output[seq_len(index)])
}
This makes a difference, though not a big one in your example, because all those calls to compute() (and return()!) are quite costly. If you folded that calculation into algo you'd see more improvement. You could also initialize output to a length that is likely to be good enough for most cases, and rarely need doubling.

Finding an inversion score using R

Sorry in advance if "inversion score" isn't the proper terminology. Here's a wiki entry.
Consider a list of values, for instance
1 2 3 4 7 6 9 10 8
would have three penalties (a score of 3)
The 6 comes after 7
The 8 comes after 9
The 8 comes after 10
How can I calculate this inversion for a given vector of numbers in R? Note that some values will be NA, and I just want to skip these.
Your "inversion score" is a central component of Kendall's tau statistic. According to Wikipedia (see link), the tau statistic is (# concordant pairs-#discordant pairs)/(n*(n-1)/2). I believe that what R reports as T is the number of concordant pairs. Therefore, we should be able to reconstruct the number of discordant pairs (which I think is what you want) via n*(n-1)/2-T, as follows
x <- c(1,2,3,4,7,6,9,10,8)
(cc <- cor.test(sort(x),x,method="kendall"))
## Kendall's rank correlation tau
## data: sort(x) and x
## T = 33, p-value = 0.0008543
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.8333333
So this function should work:
ff <- function(x) {
cc <- cor.test(sort(x),x,method="kendall")
n <- length(x)
n*(n-1)/2-unname(cc$statistic["T"])
}
ff(x) is 3 as requested (it would be good if you gave more examples of desired output ...) Haven't checked speed, but this has the advantage of being implemented in underlying C code.
I quickly came up with two strategies. A naive and a more clever using the outer function.
We look at two vectors of numbers A and B, where A is your example.
A <- scan(text = "1 2 3 4 7 6 9 10 8")
B <- sample(1:2321)
Define and try the naive inversion counting:
simpleInversion <- function(A) {
sum <- 0
n <- length(A)
for (i in 1:(n-1)) {
for (j in (i+1):n) {
sum <- sum + (A[i] > A[j])
}
}
return(sum)
}
simpleInversion(A)
simpleInversion(B)
Define and try the slightly more clever inversion counting:
cleverInversion <- function(A) {
tab <- outer(A, A, FUN = ">")
return(sum(tab[upper.tri(tab)]))
}
cleverInversion(A)
cleverInversion(B)
For the version which ignores NAs we can simply add an na.omit:
cleverInversion2 <- function(A) {
AA <- na.omit(A)
Tab <- outer(AA, AA, FUN = ">")
return(sum(Tab[upper.tri(Tab)]))
}
A[2] <- NA
cleverInversion2(A)
Hope this helps.
Edit: A faster version
Both functions become quite slow quickly when the size of the vector grows. So I came up with at faster version:
fastInversion <- function(A) {
return(sum(cbind(1, -1) %*% combn(na.omit(AA), 2) > 0))
}
C <- sample(c(1:500, NA))
library("microbenchmark")
microbenchmark(
simpleInversion(C),
cleverInversion(C),
fastInversion(C))
#Unit: microseconds
# expr min lq median uq max neval
# simpleInversion(C) 128538.770 130483.626 133999.272 144660.116 185767.208 100
# cleverInversion(C) 9546.897 9893.358 10513.799 12564.298 17041.789 100
# fastInversion(C) 104.632 114.229 193.144 198.209 324.614 100
So we gain quite a speed-up of nearly two orders of magnitude. The speed-up is even greater for larger vectors.
You could test each pair of values in your vector, counting the number that are inverted:
inversion.score <- function(vec) {
sum(apply(combn(length(vec), 2), 2, function(x) vec[x[2]] < vec[x[1]]), na.rm=T)
}
inversion.score(c(1, 2, 3, 7, 6, 9, 10, 8, NA))
# [1] 3

More Efficient Way To Do A Conditional Running Total In R

As this is my first time asking a question on SO, I apologize in advance for any improper formatting.
I am very new to R and am trying to create a function that will return the row value of a data frame column once a running total in another column has met or exceeded a given value (the row that the running sum begins in is also an argument).
For example, given the following data frame, if given a starting parameter of x=3 and stop parameter of y=17, the function should return 5 (the x value of the row where the sum of y >= 17).
X Y
1 5
2 10
3 5
4 10
5 5
6 10
7 5
8 10
The function as I've currently written it returns the correct answer, but I have to believe there is a much more 'R-ish' way to accomplish this, instead of using loops and incrementing temporary variables, and would like to learn the right way, rather than form bad habits that I will have to correct later.
A very simplified version of the function:
myFunction<-function(DataFrame,StartRow,Total){
df<-DataFrame[DataFrame[[1]] >= StartRow,]
i<-0
j<-0
while (j < Total) {
i<-i+1
j<-sum(df[[2]][1:i])
}
x<-df[[1]][i]
return(x)
}
All the solutions posted so far compute the cumulative sum of the entire Y variable, which can be inefficient in cases where the data frame is really large but the index is near the beginning. In this case, a solution with Rcpp could be more efficient:
library(Rcpp)
get_min_cum2 = cppFunction("
int gmc2(NumericVector X, NumericVector Y, int start, int total) {
double running = 0.0;
for (int idx=0; idx < Y.size(); ++idx) {
if (X[idx] >= start) {
running += Y[idx];
if (running >= total) {
return X[idx];
}
}
}
return -1; // Running total never exceeds limit
}")
Comparison with microbenchmark:
get_min_cum <-
function(start,total)
with(dat[dat$X>=start,],X[min(which(cumsum(Y)>total))])
get_min_dt <- function(start, total)
dt[X >= start, X[cumsum(Y) >= total][1]]
set.seed(144)
dat = data.frame(X=1:1000000, Y=abs(rnorm(1000000)))
dt = data.table(dat)
get_min_cum(3, 17)
# [1] 29
get_min_dt(3, 17)
# [1] 29
get_min_cum2(dat$X, dat$Y, 3, 17)
# [1] 29
library(microbenchmark)
microbenchmark(get_min_cum(3, 17), get_min_dt(3, 17),
get_min_cum2(dat$X, dat$Y, 3, 17))
# Unit: milliseconds
# expr min lq median uq max neval
# get_min_cum(3, 17) 125.324976 170.052885 180.72279 193.986953 418.9554 100
# get_min_dt(3, 17) 100.990098 149.593250 162.24523 176.661079 399.7531 100
# get_min_cum2(dat$X, dat$Y, 3, 17) 1.157059 1.646184 2.30323 4.628371 256.2487 100
In this case, it's about 100x faster to use the Rcpp solution than other approaches.
Try this for example, I am using cumsum and vectorized logical subsetting:
get_min_cum <-
function(start,total)
with(dat[dat$X>=start,],X[min(which(cumsum(Y)>total))])
get_min_cum(3,17)
5
Here you go (using data.table because of ease of syntax):
library(data.table)
dt = data.table(df)
dt[X >= 3, X[cumsum(Y) >= 17][1]]
#[1] 5
Well, here's one way:
i <- 3
j <- 17
min(df[i:nrow(df),]$X[cumsum(df$Y[i:nrow(df)])>j])
# [1] 5
This takes df$X for rows i:nrow(df) and indexes that based on cumsum(df$Y) > j, starting also at row i. This returns all df$X for which the cumsum > j. min(...) then returns the smallest value.
with(df, which( cumsum( (x>=3)*y) >= 17)[1] )

Euler Project #1 in R

Problem
Find the sum of all numbers below 1000 that can be divisible by 3 or 5
One solution I created:
x <- c(1:999)
values <- x[x %% 3 == 0 | x %% 5 == 0]
sum(values
Second solution I can't get to work and need help with. I've pasted it below.
I'm trying to use a loop (here, I use while() and after this I'll try for()). I am still struggling with keeping references to indexes (locations in a vector) separate from values/observations within vectors. Loops seem to make it more challenging for me to distinguish the two.
Why does this not produce the answer to Euler #1?
x <- 0
i <- 1
while (i < 100) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- c(x, i)
}
i <- i + 1
}
sum(x)
And in words, line by line this is what I understand is happening:
x gets value 0
i gets value 1
while object i's value (not the index #) is < 1000
if is divisible by 3 or 5
add that number i to the vector x
add 1 to i in order (in order to keep the loop going to defined limit of 1e3
sum all items in vector x
I am guessing x[i] <- c(x, i) is not the right way to add an element to vector x. How do I fix this and what else is not accurate?
First, your loop runs until i < 100, not i < 1000.
Second, replace x[i] <- c(x, i) with x <- c(x, i) to add an element to the vector.
Here is a shortcut that performs this sum, which is probably more in the spirit of the problem:
3*(333*334/2) + 5*(199*200/2) - 15*(66*67/2)
## [1] 233168
Here's why this works:
In the set of integers [1,999] there are:
333 values that are divisible by 3. Their sum is 3*sum(1:333) or 3*(333*334/2).
199 values that are divisible by 5. Their sum is 5*sum(1:199) or 5*(199*200/2).
Adding these up gives a number that is too high by their intersection, which are the values that are divisible by 15. There are 66 such values, and their sum is 15*(1:66) or 15*(66*67/2)
As a function of N, this can be written:
f <- function(N) {
threes <- floor(N/3)
fives <- floor(N/5)
fifteens <- floor(N/15)
3*(threes*(threes+1)/2) + 5*(fives*(fives+1)/2) - 15*(fifteens*(fifteens+1)/2)
}
Giving:
f(999)
## [1] 233168
f(99)
## [1] 2318
And another way:
x <- 1:999
sum(which(x%%5==0 | x%%3==0))
# [1] 233168
A very efficient approach is the following:
div_sum <- function(x, n) {
# calculates the double of the sum of all integers from 1 to n
# that are divisible by x
max_num <- n %/% x
(x * (max_num + 1) * max_num)
}
n <- 999
a <- 3
b <- 5
(div_sum(a, n) + div_sum(b, n) - div_sum(a * b, n)) / 2
In contrast, a very short code is the following:
x=1:999
sum(x[!x%%3|!x%%5])
Here is an alternative that I think gives the same answer (using 99 instead of 999 as the upper bound):
iters <- 100
x <- rep(0, iters-1)
i <- 1
while (i < iters) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- i
}
i <- i + 1
}
sum(x)
# [1] 2318
Here is the for-loop mentioned in the original post:
iters <- 99
x <- rep(0, iters)
i <- 1
for (i in 1:iters) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- i
}
i <- i + 1
}
sum(x)
# [1] 2318

Fastest way to find *the index* of the second (third...) highest/lowest value in vector or column

Fastest way to find the index of the second (third...) highest/lowest value in vector or column ?
i.e. what
sort(x,partial=n-1)[n-1]
is to
max()
but for
which.max()
Best,
Fastest way to find second (third...) highest/lowest value in vector or column
One possible route is to use the index.return argument to sort. I'm not sure if this is fastest though.
set.seed(21)
x <- rnorm(10)
ind <- 2
sapply(sort(x, index.return=TRUE), `[`, length(x)-ind+1)
# x ix
# 1.746222 3.000000
EDIT 2 :
As Joshua pointed out, none of the given solutions actually performs correct when you have a tie on the maxima, so :
X <- c(11:19,19)
n <- length(unique(X))
which(X == sort(unique(X),partial=n-1)[n-1])
fastest way of doing it correctly then. I deleted the order way, as that one doesn't work and is a lot slower, so not a good answer according to OP.
To point to the issue we ran into :
> X <- c(11:19,19)
> n <- length(X)
> which(X == sort(X,partial=n-1)[n-1])
[1] 9 10 #which is the indices of the double maximum 19
> n <- length(unique(X))
> which(X == sort(unique(X),partial=n-1)[n-1])
[1] 8 # which is the correct index of 18
The timings of the valid solutions :
> x <- runif(1000000)
> ind <- 2
> n <- length(unique(x))
> system.time(which(x == sort(unique(x),partial=n-ind+1)[n-ind+1]))
user system elapsed
0.11 0.00 0.11
> system.time(sapply(sort(unique(x), index.return=TRUE), `[`, n-ind+1))
user system elapsed
0.69 0.00 0.69
library Rfast has implemented the nth element function with return index option.
UPDATE (28/FEB/21) package kit offers a faster implementation (topn) as shown in the simulations below.
x <- runif(1e+6)
n <- 2
which_nth_highest_richie <- function(x, n)
{
for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
which(x == max(x))
}
which_nth_highest_joris <- function(x, n)
{
ux <- unique(x)
nux <- length(ux)
which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
}
microbenchmark::microbenchmark(
topn = kit::topn(x, n,decreasing = T)[n],
Rfast = Rfast::nth(x,n,descending = T,index.return = T),
order = order(x, decreasing = TRUE)[n],
richie = which_nth_highest_richie(x,n),
joris = which_nth_highest_joris(x,n))
Unit: milliseconds
expr min lq mean median uq max neval
topn 3.741101 3.7917 4.517201 4.060752 5.108901 7.403901 100
Rfast 15.8121 16.7586 20.64204 17.73010 20.7083 47.6832 100
order 110.5416 113.4774 120.45807 116.84005 121.2291 164.5618 100
richie 22.7846 24.1552 39.35303 27.10075 42.0132 179.289 100
joris 131.7838 140.4611 158.20704 156.61610 165.1735 243.9258 100
Topn is the clear winner in finding the index of the 2nd biggest value in 1 million numbers.
Futher, simulations where run to estimate running times of finding the nth biggest number for varying n.
Variable x was repopulated for each n but it's size was always 1 million numbers.
As shown topn is the best option for finding the nth biggest element and it's index, given that n is not too big. In the plot we can observe that topn becomes slower than Rfast's nth for bigger n.
It is worthy to note that topn has not been implemented for n > 1000 and will throw an error in such cases.
Method: Set all max values to -Inf, then find the indices of the max. No sorting required.
X <- runif(1e7)
system.time(
{
X[X == max(X)] <- -Inf
which(X == max(X))
})
Works with ties and is very fast.
If you can guarantee no ties, then an even faster version is
system.time(
{
X[which.max(X)] <- -Inf
which.max(X)
})
EDIT: As Joris mentioned, this method doesn't scale that well for finding third, fourth, etc., highest values.
which_nth_highest_richie <- function(x, n)
{
for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
which(x == max(x))
}
which_nth_highest_joris <- function(x, n)
{
ux <- unique(x)
nux <- length(ux)
which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
}
Using x <- runif(1e7) and n = 2, Richie wins
system.time(which_nth_highest_richie(x, 2)) #about half a second
system.time(which_nth_highest_joris(x, 2)) #about 2 seconds
For n = 100, Joris wins
system.time(which_nth_highest_richie(x, 100)) #about 20 seconds, ouch!
system.time(which_nth_highest_joris(x, 100)) #still about 2 seconds
The balance point, where they take the same length of time, is about n = 10.
No ties which() is probably your friend here. Combine the output from the sort() solution with which() to find the index that matches the output from the sort() step.
> set.seed(1)
> x <- sample(1000, 250)
> sort(x,partial=n-1)[n-1]
[1] 992
> which(x == sort(x,partial=n-1)[n-1])
[1] 145
Ties handling The solution above doesn't work properly (and wasn't intended to) if there are ties and the ties are the values that are the ith largest or larger values. We need to take the unique values of the vector before sorting those values and then the above solution works:
> set.seed(1)
> x <- sample(1000, 1000, replace = TRUE)
> length(unique(x))
[1] 639
> n <- length(x)
> i <- which(x == sort(x,partial=n-1)[n-1])
> sum(x > x[i])
[1] 0
> x.uni <- unique(x)
> n.uni <- length(x.uni)
> i <- which(x == sort(x.uni, partial = n.uni-1)[n.uni-1])
> sum(x > x[i])
[1] 2
> tail(sort(x))
[1] 994 996 997 997 1000 1000
order() is also very useful here:
> head(ord <- order(x, decreasing = TRUE))
[1] 220 145 209 202 211 163
So the solution here is ord[2] for the index of the 2nd highest/largest element of x.
Some timings:
> set.seed(1)
> X <- sample(1e7, 1e7)
> system.time({n <- length(X); which(X == sort(X, partial = n-1)[n-1])})
user system elapsed
0.319 0.058 0.378
> system.time({ord <- order(X, decreasing = TRUE); ord[2]})
user system elapsed
14.578 0.084 14.708
> system.time({order(X, decreasing = TRUE)[2]})
user system elapsed
14.647 0.084 14.779
But as the linked post was getting at and the timings above show, order() is much slower, but both provide the same results:
> all.equal(which(X == sort(X, partial = n-1)[n-1]),
+ order(X, decreasing = TRUE)[2])
[1] TRUE
And for the ties-handling version:
foo <- function(x, i) {
X <- unique(x)
N <- length(X)
i <- i-1
which(x == sort(X, partial = N-i)[N-i])
}
> system.time(foo(X, 2))
user system elapsed
1.249 0.176 1.454
So the extra steps slow this solution down a bit, but it is still very competitive with order().
Use maxN function given by Zach to find the next max value and use which() with arr.ind = TRUE.
which(x == maxN(x, 4), arr.ind = TRUE)
Using arr.ind will return index position in any of the above solutions as well and simplify the code.
This is my solution for finding the index of the top N highest values in a vector (not exactly what the OP wanted, but this might help other people)
index.top.N = function(xs, N=10){
if(length(xs) > 0) {
o = order(xs, na.last=FALSE)
o.length = length(o)
if (N > o.length) N = o.length
o[((o.length-N+1):o.length)]
}
else {
0
}
}

Resources