Function: sapply in apply, removing outliers - r

I'm working on a function which will get rid of outliers in a given data set based on 3 sigma rule. My code is presented below. "data" is a data set to be processed.
rm.outlier <- function(data){
apply(data, 2, function(var) {
sigma3.plus <- mean(var) + 3 * sd(var)
sigma3.min <- mean(var) - 3 * sd(var)
sapply(var, function(y) {
if (y > sigma3.plus){
y <- sigma3.plus
} else if (y < sigma3.min){
y <- sigma3.min
} else {y <- y}
})
})
as.data.frame(data)
}
In order to check if the function works I wrote a short test:
set.seed(123)
a <- data.frame("var1" = rnorm(10000, 0, 1))
b <- a
sum(a$var1 > mean(a$var1) + 3 * sd(a$var1)) # number of outliers in a
As a result, I get:
[1] 12
So the variable var1 in the data frame a has 12 outliers. Next, I try to apply my function on this object:
a2 <- rm.outlier(a)
sum(b$var1 - a2$var1)
Unfortunately, it gives 0 which clearly indicates that something does not work. I have already worked out that the implementation of sapply is correct so there must be a mistake in my apply. Any help would be appreciated.

If runtime is important for you, you might consider another approach. You could vectorize this filtering, e.g. by using pmin and pmax which is equally readable and > 15x times faster. If you like it a little bit more complex you could use findInterval and get even more speed:
rm.outlier2 <- function(x) {
## calculate -3/3 * sigma borders
s <- mean(x) + c(-3, 3) * sd(x)
pmin(pmax(x, s[1]), s[2])
}
rm.outlier3 <- function(x) {
## calculate -3/3 * sigma borders
s <- mean(x) + c(-3, 3) * sd(x)
## sorts x into intervals 0 == left of s[1], 2 == right of s[2], 1
## between both s
i <- findInterval(x, s)
## which values are left/right of the interval
j <- which(i != 1L)
## add a value between s to directly use output of findInterval for subsetting
s2 <- c(s[1], 0, s[2])
## replace all values that are left/right of the interval
x[j] <- s2[i[j] + 1L]
x
}
Benchmarking the stuff:
## slightly modified OP version
rm.outlier <- function(x) {
sigma3 <- mean(x) + c(-3,3) * sd(x)
sapply(x, function(y) {
if (y > sigma3[2]){
y <- sigma3[2]
} else if (y < sigma3[1]){
y <- sigma3[1]
} else {y <- y}
})
}
set.seed(123)
a <- rnorm(10000, 0, 1)
# check output
all.equal(rm.outlier(a), rm.outlier2(a))
all.equal(rm.outlier2(a), rm.outlier3(a))
library("rbenchmark")
benchmark(rm.outlier(a), rm.outlier2(a), rm.outlier3(a),
order = "relative",
columns = c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
#3 rm.outlier3(a) 100 0.028 1.000
#2 rm.outlier2(a) 100 0.102 3.643
#1 rm.outlier(a) 100 1.825 65.179

It seems like you just forgot to assign your results of the apply function to a new dataframe. (Compare the 3rd line with your code)
rm.outlier <- function(data){
# Assign the result to a new dataframe
data_new <- apply(data, 2, function(var) {
sigma3.plus <- mean(var) + 3 * sd(var)
sigma3.min <- mean(var) - 3 * sd(var)
sapply(var, function(y) {
if (y > sigma3.plus){
y <- sigma3.plus
} else if (y < sigma3.min){
y <- sigma3.min
} else {y <- y}
})
})
# Print the new dataframe
as.data.frame(data_new)
}
set.seed(123)
a <- data.frame("var1" = rnorm(10000, 0, 1))
sum(a$var1 > mean(a$var1) + 3 * sd(a$var1)) # number of too big outliers
# 15
sum(a$var1 < mean(a$var1) - 3 * sd(a$var1)) # number of too small outliers
# 13
# Overall 28 outliers
# Check the function for the number of outliers
a2 <- rm.outlier(a)
sum(a2$var1 == a$var1) - length(a$var1)

Related

Scale raster cell in stack from -1 to 1 R

I have a raster stack with 364 layers with a daily rate of change in NDVI values.
I want to scale these values in every cell if positive from 0 to 1 and if negative from -1 to 0. So far I have only found a solution that scale values in single layers (see here: Replace specific value in each band of raster brick in R) and not along cells of multilayer objects. Additionally I have a decent amount of cells with NA for the entire time series and I'm not quite sure how to deal with this fact either.
I took the code from the previously mentioned post and tried to get it working for my problem:
norm <- function(x){-1+(x-min)*((1-(-1))/(max-min))}
for(j in 1:ncell(tif)){
if(is.na(sum(tif[j]))){
NULL
} else {
cat(paste("Currently processing layer:", j,"/",ncell(tif), "\n"))
min <- cellStats(tif[j],'min')
max <- cellStats(tif[j],'max')
#initialize cluster
#number of cores to use for clusterR function (max recommended: ncores - 1)
beginCluster(31)
#normalize
tif[j] <- clusterR(tif[j], calc, args=list(fun=norm), export=c('min',"max"))
#end cluster
endCluster()
}
}
I'm not quite certain if this produces the desired output. Any help is very much appreciated!
Some example data
library(raster)
r <- raster(ncol=10, nrow=10)
s <- stack(lapply(1:5, function(i) setValues(r, runif(100, -1, 1))))
# adding NAs
s[[2]][sample(100, 25, TRUE)] <- NA
For scaling (or any other operation) by cell (as requested) you can use calc together with a function that works on a vector. For example:
ff <- function(i) {
p <- which(i >= 0)
n <- which(i <= 0)
# positive values
if (length(p) > 0) {
i[p] <- i[p] - min(i[p], na.rm=TRUE)
i[p] <- i[p] / max(i[p])
}
# negative values
if (length(n) > 0) {
i[n] <- i[n] - max(i[n], na.rm=TRUE)
i[n] <- i[n] / abs(min(i[n]))
}
i
}
Test it
ff(c(-.3, -.1, .1, .4, .8))
#[1] -1.0000000 0.0000000 0.0000000 0.4285714 1.0000000
ff(c(-.3, -.1, .1, .4, .8, NA))
#[1] -1.0000000 0.0000000 0.0000000 0.4285714 1.0000000 NA
ff(c(-2,-1))
#[1] -1 0
ff(c(NA, NA))
#[1] NA NA
And use it
z <- calc(s, ff)
See the below to scale by layer, based on the min and max of all cell values (I first thought that this is what was asked for). Note that the functions I used below scale values from -1 to 1, but not the lowest positive value and highest negative value to zero.
minv <- abs(cellStats(s,'min'))
maxv <- cellStats(s,'max')
f1 <- function(i, mn, mx) {
j <- i < 0
j[is.na(j)] <- TRUE
i[j] <- i[j] / abs(mn)
i[!j] <- i[!j] / mx
i
}
ss <- list()
for (i in 1:nlayers(s)) {
ss[[i]] <- calc(s[[i]], fun=function(x) f1(x, minv[i], maxv[i]))
}
ss1 <- stack(ss)
Or without a loop
f2 <- function(x, mn, mx) {
x <- t(x)
i <- which(x > 0)
i[is.na(i)] <- FALSE
mxx <- x / mx
x <- x / mn
x[i] <- mxx[i]
t(x)
}
ss2 <- calc(s, fun=function(x) f2(x, minv, maxv))
For reference, to simply scale between 0 and 1
mnv <- cellStats(s,'min')
mxv <- cellStats(s,'max')
x <- (s - mnv) / (mxv - mnv)
To get values between -1 and 1 you can then do
y <- 2 * (x - 1)
But that way previously negative values can become positive and vice versa.
See ?raster::scale for other types of scaling.

R, for loop, scalable solutions

I have data that looks like this :
char_column date_column1 date_column2 integer_column
415 18JT9R6EKV 2014-08-28 2014-09-06 1
26 18JT9R6EKV 2014-12-08 2014-12-11 2
374 18JT9R6EKV 2015-03-03 2015-03-09 1
139 1PEGXAVCN5 2014-05-06 2014-05-10 3
969 1PEGXAVCN5 2014-06-11 2014-06-15 2
649 1PEGXAVCN5 2014-08-12 2014-08-16 3
I want to perform a loop that would check every row against the preceding row, and given certain conditions assign them the same number (so I can group them later) , the point is that if the date segments are close enough I would collapse them into one segment.
my attempt is the following :
i <- 1
z <- 1
v <- 1
for (i in 2:nrow(df)){
z[i] <- ifelse(df[i,'char_column'] == df[i-1,'char_column'],
ifelse((df[i,'date_column1'] - df[i-1,'date_column2']) <= 5,
ifelse(df[i,'integer_column'] == df[i-1,'integer_column'],
v, v<- v+1),
v <- v+1),
v <- v+1)}
df$grouping <- z
then I would just group using min(date_column1) and max(date_column2).
this method works perfectly for say 100,000 rows (22.86 seconds)
but for a million rows : 33.18 minutes!! I have over 60m rows to process,
is there a way I can make the process more efficient ?
PS: to generate a similar table you can use the following code :
x <- NULL
for (i in 1:200) { x[i] <- paste(sample(c(LETTERS, 1:9), 10), collapse = '')}
y <- sample((as.Date('2014-01-01')):as.Date('2015-05-01'), 1000, replace = T)
y2 <- y + sample(1:10)
df <- data.frame(char_column = sample(x, 1000, rep = T),
date_column1 = as.Date(y, origin = '1970-01-01'),
date_column2 = as.Date(y2,origin = '1970-01-01'),
integer_column = sample(1:3,1000, replace = T),
row.names = NULL)
df <- df[order(df$char_column, df$date_column1),]
Since data.table::rleid does not work, I post another (hopefully) fast solution
1. Get rid of nested ifelse
ifelse is often slow, especially for scalar evaluation, use if.
Nested ifelse should be avoided whenever possible: observe that ifelse(A, ifelse(B, x, y), y) can be suitably replaced by if (A&B) x else y
f1 <- function(df){
z <- rep(NA, nrow(df))
z[1] <- 1
char_col <- df[, 'char_column']
date_col1 <- df[, 'date_column1']
date_col2 <- df[, 'date_column2']
int_col <- df[, 'integer_column']
for (i in 2:nrow(df)){
if((char_col[i] == char_col[i-1])&((date_col1[i] - date_col2[i-1]) <= 5)&(int_col[i] == int_col[i-1]))
{
z[i] <- z[i-1]
}
else
{
z[i] <- z[i-1]+1
}
}
z
}
f1 is about 40% faster than the original solution for 10.000 rows.
system.time(f1(df))
user system elapsed
2.72 0.00 2.79
2. Vectorize
Upon closer inspection the conditions inside if can be vectorized
library(data.table)
f2 <- function(df){
z <- rep(NA, nrow(df))
z[1] <- 1
char_col <- df[, 'char_column']
date_col1 <- df[, 'date_column1']
date_col2 <- df[, 'date_column2']
int_col <- df[, 'integer_column']
cond <- (char_col==shift(char_col))&(date_col1 - shift(date_col2) <= 5)&(int_col==shift(int_col))
for (i in 2:nrow(df)){
if(cond[i])
{
z[i] <- z[i-1]
}
else
{
z[i] <- z[i-1]+1
}
}
z
}
# for 10000 rows
system.time(f2(df))
# user system elapsed
# 0.01 0.00 0.02
3. Vectorize, Vectorize
While f2 is already quite fast, a further vectorization is possible. Observe how z is calculated: cond is a logical vector, and z[i] = z[i-1] + 1 when cond is FALSE. This is none other than cumsum(!cond).
f3 <- function(df){
setDT(df)
df[, cond := (char_column==shift(char_column))&(date_column1 - shift(date_column2) <= 5)&(integer_column==shift(integer_column)),]
df[, group := cumsum(!c(FALSE, cond[-1L])),]
}
For 1M rows
system.time(f3(df))
# user system elapsed
# 0.05 0.05 0.09
system.time(f2(df))
# user system elapsed
# 1.83 0.05 1.87

Nested for loop in R

I wrote the following code, and I need to repeat this for 100 times, and I know I need to user another for loop, but I don't know how to do it. Here is the code:
mean <- c(5,5,10,10,5,5,5)
x <- NULL
u <- NULL
delta1 <- NULL
w1 <- NULL
for (i in 1:7 ) {
x[i] <- rexp(1, rate = mean[i])
u[i] <- (1/1.2)*runif(1, min=0, max=1)
y1 <- min(x,u)
if (y1 == min(x)) {
delta1 <- 1
}
else {
delta1 <- 0
}
if (delta1 == 0)
{
w1 <- NULL
}
else {
if(y1== x[[1]])
{
w1 <- "x1"
}
}
}
output <- cbind(delta1,w1)
output
I want the final output to be 100 rows* 3 columns matrix representing run number, delta1, and w1.
Any thought will be truly appreciated.
Here's what I gather you're trying to achieve from your code:
Given two vectors drawn from different distributions (Exponential and Uniform)
Find out which distribution the smallest number comes from
Repeat this 100 times.
Theres a couple of problems with your code if you want to achieve this, so here's a cleaned up example:
rates <- c(5, 5, 10, 10, 5, 5, 5) # 'mean' is an inbuilt function
# Initialise the output data frame:
output <- data.frame(number=rep(0, 100), delta1=rep(1, 100), w1=rep("x1", 100))
for (i in 1:100) {
# Generating u doesn't require a for loop. Additionally, can bring in
# the (1/1.2) out the front.
u <- runif(7, min=0, max=5/6)
# Generating x doesn't need a loop either. It's better to use apply functions
# when you can!
x <- sapply(rates, function(x) { rexp(1, rate=x) })
y1 <- min(x, u)
# Now we can store the output
output[i, "number"] <- y1
# Two things here:
# 1) use all.equal instead of == to compare floating point numbers
# 2) We initialised the data frame to assume they always came from x.
# So we only need to overwrite it where it comes from u.
if (isTRUE(all.equal(y1, min(u)))) {
output[i, "delta1"] <- 0
output[i, "w1"] <- NA # Can't use NULL in a character vector.
}
}
output
Here's an alternative, more efficient approach with replicate:
Mean <- c(5, 5, 10, 10, 5, 5, 5)
n <- 100 # number of runs
res <- t(replicate(n, {
x <- rexp(n = length(Mean), rate = Mean)
u <- runif(n = length(Mean), min = 0, max = 1/1.2)
mx <- min(x)
delta1 <- mx <= min(u)
w1 <- delta1 & mx == x[1]
c(delta1, w1)
}))
output <- data.frame(run = seq.int(n), delta1 = as.integer(res[ , 1]),
w1 = c(NA, "x1")[res[ , 2] + 1])
The result:
head(output)
# run delta1 w1
# 1 1 1 <NA>
# 2 2 1 <NA>
# 3 3 1 <NA>
# 4 4 1 x1
# 5 5 1 <NA>
# 6 6 0 <NA>

How to implement a recursive process in R?

Say I have a vector v = c(250,1200,700), a starting value n and a function e.g.
f = function(v){
g = function(v){
cases(
v <= 20 -> 0.1,
v > 20 & v <= 100 -> 0.075,
v > 100 -> .05
)
}
suppressWarnings(g(v))
}
f is written using cases from the memisc package - I'm still new to R and would be keen to hear if f can be coded in a 'better' way. Anyway, I am looking for code that will perform the following recursive process (including for vectors of a 'large' length):
f(n),
f(n)*v[1]+n,
f(f(n)*v[1]+n)*v[2] + f(n)*v[1]+n,
f(f(f(n)*v[1]+n)*v[2] + f(n)*v[1]+n)*v[3] + f(f(n)*v[1]+n)*v[2] + f(n)*v[1]+n
Ultimately I am interested in the value of the last line.
Cheers for any help
If I understood you right, this is the process you're talking about:
X1 = f(n)
X2 = X1*v[1] + n
X3 = F(X2)*v[2] + X2
X4 = F(X3)*v[3] + X3
...
If you need all in-between steps, a recursive function is rather useless as you need the in-between steps stored in the result as well. So you can easily code that using basic R :
Thefun <- function(v,n){
l <- length(v)
res <- numeric(l+1)
res[1] <- g(n)
res[2] <- res[1]*v[1] + n
for(i in seq(2,l)){
res[i+1] <- res[i] + g(res[i])*v[i]
}
return(res)
}
The last value of the result is the result you need. As you only needed the result of the last step, you can do it recursively using Recall:
Recfunc <- function(v,n){
l <- length(v)
if(l > 0){
res <- Recall(v[-l],n)
return(g(res)*v[l] + res)
} else {
return(n)
}
}
On a sidenote
You can define your function g different, like this (I call it fv) :
fv <- function(v){
0.1*(v <= 20) + 0.075*(v > 20 & v <=100) + 0.05*(v>100)
}
If compared to your function, you gain a 6 fold increase in speed.
vec <- sample(1:150,1e5,TRUE)
benchmark(
fv(vec),
g(vec),
columns=c("test","replications","elapsed","relative"),
replications = 1000
)
test replications elapsed relative
1 fv(vec) 1000 9.39 1.000
2 g(vec) 1000 56.30 5.996
I assume here that n is length of v.
I rewrite the recusrion like this :
y1 <- n ## slight change here
y2 <- f(y1)*v[1] +y1,
y3 <- f(y2)*v[2] +y2
y4 <- f(y3)*v[3] +y3
.... I can''t see the terms > length(v) so my first assumption
So for example you can implement this like :
filter.f <- function(func=f,coef=v){
n <- length(coef)
y <- numeric(n)
y[1] <- n
for(i in 2:n)
y[i] <- func(y[i-1])*coef[i-1]+y[i-1] ## here the recursion
y[1] <- f(n)
y
}
filter.f()
[1] 0.1 124.0 159.0 191.5
v=c(250, 1200, 700)
filter.f()
[1] 0.1 28.0 118.0

R Function for returning ALL factors

My normal search foo is failing me. I'm trying to find an R function that returns ALL of the factors of an integer. There are at least 2 packages with factorize() functions: gmp and conf.design, however these functions return only prime factors. I'd like a function that returns all factors.
Obviously searching for this is made difficult since R has a construct called factors which puts a lot of noise in the search.
To follow up on my comment (thanks to #Ramnath for my typo), the brute force method seems to work reasonably well here on my 64 bit 8 gig machine:
FUN <- function(x) {
x <- as.integer(x)
div <- seq_len(abs(x))
factors <- div[x %% div == 0L]
factors <- list(neg = -factors, pos = factors)
return(factors)
}
A few examples:
> FUN(100)
$neg
[1] -1 -2 -4 -5 -10 -20 -25 -50 -100
$pos
[1] 1 2 4 5 10 20 25 50 100
> FUN(-42)
$neg
[1] -1 -2 -3 -6 -7 -14 -21 -42
$pos
[1] 1 2 3 6 7 14 21 42
#and big number
> system.time(FUN(1e8))
user system elapsed
1.95 0.18 2.14
You can get all factors from the prime factors. gmp calculates these very quickly.
library(gmp)
library(plyr)
get_all_factors <- function(n)
{
prime_factor_tables <- lapply(
setNames(n, n),
function(i)
{
if(i == 1) return(data.frame(x = 1L, freq = 1L))
plyr::count(as.integer(gmp::factorize(i)))
}
)
lapply(
prime_factor_tables,
function(pft)
{
powers <- plyr::alply(pft, 1, function(row) row$x ^ seq.int(0L, row$freq))
power_grid <- do.call(expand.grid, powers)
sort(unique(apply(power_grid, 1, prod)))
}
)
}
get_all_factors(c(1, 7, 60, 663, 2520, 75600, 15876000, 174636000, 403409160000))
Update
This is now implemented in the package RcppBigIntAlgos. See this answer for more details.
Original Post
The algorithm has been fully updated and now implements multiple polynomials as well as some clever sieving techniques that eliminates millions of checks. In addition to the original links, this paper along with this post from primo were very helpful for this last stage (many kudos to primo). Primo does a great job of explaining the guts of the QS in a relatively short space and also wrote a pretty amazing algorithm (it will factor the number at the bottom, 38! + 1, in under 2 secs!! Insane!!).
As promised, below is my humble R implementation of the Quadratic Sieve. I have been working on this algorithm sporadically since I promised it in late January. I will not try to explain it fully (unless requested... also, the links below do a very good job) as it is very complicated and hopefully, my function names speak for themselves. This has proved to be one of the most challenging algorithms I have ever attempted to execute as it is demanding both from a programmer's point of view as well as mathematically. I have read countless papers and ultimately, I found these five to be the most helpful (QSieve1, QSieve2, QSieve3, QSieve4, QSieve5).
N.B. This algorithm, as it stands, does not serve very well as a general prime factorization algorithm. If it was optimized further, it would need to be accompanied by a section of code that factors out smaller primes (i.e. less than 10^5 as suggested by this post), then call QuadSieveAll, check to see if these are primes, and if not, call QuadSieveAll on both of these factors, etc. until you are left with all primes (all of these steps are not that difficult). However, the main point of this post is to highlight the heart of the Quadratic Sieve, so the examples below are all semiprimes (even though it will factor most odd numbers not containing a square… Also, I haven’t seen an example of the QS that didn’t demonstrate a non-semiprime). I know the OP was looking for a method to return all factors and not the prime factorization, but this algorithm (if optimized further) coupled with one of the algorithms above would be a force to reckon with as a general factoring algorithm (especially given that the OP was needing something for Project Euler, which usually requires much more than brute force methods). By the way, the MyIntToBit function is a variation of this answer and the PrimeSieve is from a post that #Dontas appeared on a while back (Kudos on that as well).
QuadSieveMultiPolysAll <- function(MyN, fudge1=0L, fudge2=0L, LenB=0L) {
### 'MyN' is the number to be factored; 'fudge1' is an arbitrary number
### that is used to determine the size of your prime base for sieving;
### 'fudge2' is used to set a threshold for sieving;
### 'LenB' is a the size of the sieving interval. The last three
### arguments are optional (they are determined based off of the
### size of MyN if left blank)
### The first 8 functions are helper functions
PrimeSieve <- function(n) {
n <- as.integer(n)
if (n > 1e9) stop("n too large")
primes <- rep(TRUE, n)
primes[1] <- FALSE
last.prime <- 2L
fsqr <- floor(sqrt(n))
while (last.prime <= fsqr) {
primes[seq.int(last.prime^2, n, last.prime)] <- FALSE
sel <- which(primes[(last.prime + 1):(fsqr + 1)])
if (any(sel)) {
last.prime <- last.prime + min(sel)
} else {
last.prime <- fsqr + 1
}
}
MyPs <- which(primes)
rm(primes)
gc()
MyPs
}
MyIntToBit <- function(x, dig) {
i <- 0L
string <- numeric(dig)
while (x > 0) {
string[dig - i] <- x %% 2L
x <- x %/% 2L
i <- i + 1L
}
string
}
ExpBySquaringBig <- function(x, n, p) {
if (n == 1) {
MyAns <- mod.bigz(x,p)
} else if (mod.bigz(n,2)==0) {
MyAns <- ExpBySquaringBig(mod.bigz(pow.bigz(x,2),p),div.bigz(n,2),p)
} else {
MyAns <- mod.bigz(mul.bigz(x,ExpBySquaringBig(mod.bigz(
pow.bigz(x,2),p), div.bigz(sub.bigz(n,1),2),p)),p)
}
MyAns
}
TonelliShanks <- function(a,p) {
P1 <- sub.bigz(p,1); j <- 0L; s <- P1
while (mod.bigz(s,2)==0L) {s <- s/2; j <- j+1L}
if (j==1L) {
MyAns1 <- ExpBySquaringBig(a,(p+1L)/4,p)
MyAns2 <- mod.bigz(-1 * ExpBySquaringBig(a,(p+1L)/4,p),p)
} else {
n <- 2L
Legendre2 <- ExpBySquaringBig(n,P1/2,p)
while (Legendre2==1L) {n <- n+1L; Legendre2 <- ExpBySquaringBig(n,P1/2,p)}
x <- ExpBySquaringBig(a,(s+1L)/2,p)
b <- ExpBySquaringBig(a,s,p)
g <- ExpBySquaringBig(n,s,p)
r <- j; m <- 1L
Test <- mod.bigz(b,p)
while (!(Test==1L) && !(m==0L)) {
m <- 0L
Test <- mod.bigz(b,p)
while (!(Test==1L)) {m <- m+1L; Test <- ExpBySquaringBig(b,pow.bigz(2,m),p)}
if (!m==0) {
x <- mod.bigz(x * ExpBySquaringBig(g,pow.bigz(2,r-m-1L),p),p)
g <- ExpBySquaringBig(g,pow.bigz(2,r-m),p)
b <- mod.bigz(b*g,p); r <- m
}; Test <- 0L
}; MyAns1 <- x; MyAns2 <- mod.bigz(p-x,p)
}
c(MyAns1, MyAns2)
}
SieveLists <- function(facLim, FBase, vecLen, sieveD, MInt) {
vLen <- ceiling(vecLen/2); SecondHalf <- (vLen+1L):vecLen
MInt1 <- MInt[1:vLen]; MInt2 <- MInt[SecondHalf]
tl <- vector("list",length=facLim)
for (m in 3:facLim) {
st1 <- mod.bigz(MInt1[1],FBase[m])
m1 <- 1L+as.integer(mod.bigz(sieveD[[m]][1] - st1,FBase[m]))
m2 <- 1L+as.integer(mod.bigz(sieveD[[m]][2] - st1,FBase[m]))
sl1 <- seq.int(m1,vLen,FBase[m])
sl2 <- seq.int(m2,vLen,FBase[m])
tl1 <- list(sl1,sl2)
st2 <- mod.bigz(MInt2[1],FBase[m])
m3 <- vLen+1L+as.integer(mod.bigz(sieveD[[m]][1] - st2,FBase[m]))
m4 <- vLen+1L+as.integer(mod.bigz(sieveD[[m]][2] - st2,FBase[m]))
sl3 <- seq.int(m3,vecLen,FBase[m])
sl4 <- seq.int(m4,vecLen,FBase[m])
tl2 <- list(sl3,sl4)
tl[[m]] <- list(tl1,tl2)
}
tl
}
SieverMod <- function(facLim, FBase, vecLen, SD, MInt, FList, LogFB, Lim, myCol) {
MyLogs <- rep(0,nrow(SD))
for (m in 3:facLim) {
MyBool <- rep(FALSE,vecLen)
MyBool[c(FList[[m]][[1]][[1]],FList[[m]][[2]][[1]])] <- TRUE
MyBool[c(FList[[m]][[1]][[2]],FList[[m]][[2]][[2]])] <- TRUE
temp <- which(MyBool)
MyLogs[temp] <- MyLogs[temp] + LogFB[m]
}
MySieve <- which(MyLogs > Lim)
MInt <- MInt[MySieve]; NewSD <- SD[MySieve,]
newLen <- length(MySieve); GoForIT <- FALSE
MyMat <- matrix(integer(0),nrow=newLen,ncol=myCol)
MyMat[which(NewSD[,1L] < 0),1L] <- 1L; MyMat[which(NewSD[,1L] > 0),1L] <- 0L
if ((myCol-1L) - (facLim+1L) > 0L) {MyMat[,((facLim+2L):(myCol-1L))] <- 0L}
if (newLen==1L) {MyMat <- matrix(MyMat,nrow=1,byrow=TRUE)}
if (newLen > 0L) {
GoForIT <- TRUE
for (m in 1:facLim) {
vec <- rep(0L,newLen)
temp <- which((NewSD[,1L]%%FBase[m])==0L)
NewSD[temp,] <- NewSD[temp,]/FBase[m]; vec[temp] <- 1L
test <- temp[which((NewSD[temp,]%%FBase[m])==0L)]
while (length(test)>0L) {
NewSD[test,] <- NewSD[test,]/FBase[m]
vec[test] <- (vec[test]+1L)
test <- test[which((NewSD[test,]%%FBase[m])==0L)]
}
MyMat[,m+1L] <- vec
}
}
list(MyMat,NewSD,MInt,GoForIT)
}
reduceMatrix <- function(mat) {
tempMin <- 0L; n1 <- ncol(mat); n2 <- nrow(mat)
mymax <- 1L
for (i in 1:n1) {
temp <- which(mat[,i]==1L)
t <- which(temp >= mymax)
if (length(temp)>0L && length(t)>0L) {
MyMin <- min(temp[t])
if (!(MyMin==mymax)) {
vec <- mat[MyMin,]
mat[MyMin,] <- mat[mymax,]
mat[mymax,] <- vec
}
t <- t[-1]; temp <- temp[t]
for (j in temp) {mat[j,] <- (mat[j,]+mat[mymax,])%%2L}
mymax <- mymax+1L
}
}
if (mymax<n2) {simpMat <- mat[-(mymax:n2),]} else {simpMat <- mat}
lenSimp <- nrow(simpMat)
if (is.null(lenSimp)) {lenSimp <- 0L}
mycols <- 1:n1
if (lenSimp>1L) {
## "Diagonalizing" Matrix
for (i in 1:lenSimp) {
if (all(simpMat[i,]==0L)) {simpMat <- simpMat[-i,]; next}
if (!simpMat[i,i]==1L) {
t <- min(which(simpMat[i,]==1L))
vec <- simpMat[,i]; tempCol <- mycols[i]
simpMat[,i] <- simpMat[,t]; mycols[i] <- mycols[t]
simpMat[,t] <- vec; mycols[t] <- tempCol
}
}
lenSimp <- nrow(simpMat); MyList <- vector("list",length=n1)
MyFree <- mycols[which((1:n1)>lenSimp)]; for (i in MyFree) {MyList[[i]] <- i}
if (is.null(lenSimp)) {lenSimp <- 0L}
if (lenSimp>1L) {
for (i in lenSimp:1L) {
t <- which(simpMat[i,]==1L)
if (length(t)==1L) {
simpMat[ ,t] <- 0L
MyList[[mycols[i]]] <- 0L
} else {
t1 <- t[t>i]
if (all(t1 > lenSimp)) {
MyList[[mycols[i]]] <- MyList[[mycols[t1[1]]]]
if (length(t1)>1) {
for (j in 2:length(t1)) {MyList[[mycols[i]]] <- c(MyList[[mycols[i]]], MyList[[mycols[t1[j]]]])}
}
}
else {
for (j in t1) {
if (length(MyList[[mycols[i]]])==0L) {MyList[[mycols[i]]] <- MyList[[mycols[j]]]}
else {
e1 <- which(MyList[[mycols[i]]]%in%MyList[[mycols[j]]])
if (length(e1)==0) {
MyList[[mycols[i]]] <- c(MyList[[mycols[i]]],MyList[[mycols[j]]])
} else {
e2 <- which(!MyList[[mycols[j]]]%in%MyList[[mycols[i]]])
MyList[[mycols[i]]] <- MyList[[mycols[i]]][-e1]
if (length(e2)>0L) {MyList[[mycols[i]]] <- c(MyList[[mycols[i]]], MyList[[mycols[j]]][e2])}
}
}
}
}
}
}
TheList <- lapply(MyList, function(x) {if (length(x)==0L) {0} else {x}})
list(TheList,MyFree)
} else {
list(NULL,NULL)
}
} else {
list(NULL,NULL)
}
}
GetFacs <- function(vec1, vec2, n) {
x <- mod.bigz(prod.bigz(vec1),n)
y <- mod.bigz(prod.bigz(vec2),n)
MyAns <- c(gcd.bigz(x-y,n),gcd.bigz(x+y,n))
MyAns[sort.list(asNumeric(MyAns))]
}
SolutionSearch <- function(mymat, M2, n, FB) {
colTest <- which(apply(mymat, 2, sum) == 0)
if (length(colTest) > 0) {solmat <- mymat[ ,-colTest]} else {solmat <- mymat}
if (length(nrow(solmat)) > 0) {
nullMat <- reduceMatrix(t(solmat %% 2L))
listSol <- nullMat[[1]]; freeVar <- nullMat[[2]]; LF <- length(freeVar)
} else {LF <- 0L}
if (LF > 0L) {
for (i in 2:min(10^8,(2^LF + 1L))) {
PosAns <- MyIntToBit(i, LF)
posVec <- sapply(listSol, function(x) {
t <- which(freeVar %in% x)
if (length(t)==0L) {
0
} else {
sum(PosAns[t])%%2L
}
})
ansVec <- which(posVec==1L)
if (length(ansVec)>0) {
if (length(ansVec) > 1L) {
myY <- apply(mymat[ansVec,],2,sum)
} else {
myY <- mymat[ansVec,]
}
if (sum(myY %% 2) < 1) {
myY <- as.integer(myY/2)
myY <- pow.bigz(FB,myY[-1])
temp <- GetFacs(M2[ansVec], myY, n)
if (!(1==temp[1]) && !(1==temp[2])) {
return(temp)
}
}
}
}
}
}
### Below is the main portion of the Quadratic Sieve
BegTime <- Sys.time(); MyNum <- as.bigz(MyN); DigCount <- nchar(as.character(MyN))
P <- PrimeSieve(10^5)
SqrtInt <- .mpfr2bigz(trunc(sqrt(mpfr(MyNum,sizeinbase(MyNum,b=2)+5L))))
if (DigCount < 24) {
DigSize <- c(4,10,15,20,23)
f_Pos <- c(0.5,0.25,0.15,0.1,0.05)
MSize <- c(5000,7000,10000,12500,15000)
if (fudge1==0L) {
LM1 <- lm(f_Pos ~ DigSize)
m1 <- summary(LM1)$coefficients[2,1]
b1 <- summary(LM1)$coefficients[1,1]
fudge1 <- DigCount*m1 + b1
}
if (LenB==0L) {
LM2 <- lm(MSize ~ DigSize)
m2 <- summary(LM2)$coefficients[2,1]
b2 <- summary(LM2)$coefficients[1,1]
LenB <- ceiling(DigCount*m2 + b2)
}
LimB <- trunc(exp((.5+fudge1)*sqrt(log(MyNum)*log(log(MyNum)))))
B <- P[P<=LimB]; B <- B[-1]
facBase <- P[which(sapply(B, function(x) ExpBySquaringBig(MyNum,(x-1)/2,x)==1L))+1L]
LenFBase <- length(facBase)+1L
} else if (DigCount < 67) {
## These values were obtained from "The Multiple Polynomial
## Quadratic Sieve" by Robert D. Silverman
DigSize <- c(24,30,36,42,48,54,60,66)
FBSize <- c(100,200,400,900,1200,2000,3000,4500)
MSize <- c(5,25,25,50,100,250,350,500)
LM1 <- loess(FBSize ~ DigSize)
LM2 <- loess(MSize ~ DigSize)
if (fudge1==0L) {
fudge1 <- -0.4
LimB <- trunc(exp((.5+fudge1)*sqrt(log(MyNum)*log(log(MyNum)))))
myTarget <- ceiling(predict(LM1, DigCount))
while (LimB < myTarget) {
LimB <- trunc(exp((.5+fudge1)*sqrt(log(MyNum)*log(log(MyNum)))))
fudge1 <- fudge1+0.001
}
B <- P[P<=LimB]; B <- B[-1]
facBase <- P[which(sapply(B, function(x) ExpBySquaringBig(MyNum,(x-1)/2,x)==1L))+1L]
LenFBase <- length(facBase)+1L
while (LenFBase < myTarget) {
fudge1 <- fudge1+0.005
LimB <- trunc(exp((.5+fudge1)*sqrt(log(MyNum)*log(log(MyNum)))))
myind <- which(P==max(B))+1L
myset <- tempP <- P[myind]
while (tempP < LimB) {
myind <- myind + 1L
tempP <- P[myind]
myset <- c(myset, tempP)
}
for (p in myset) {
t <- ExpBySquaringBig(MyNum,(p-1)/2,p)==1L
if (t) {facBase <- c(facBase,p)}
}
B <- c(B, myset)
LenFBase <- length(facBase)+1L
}
} else {
LimB <- trunc(exp((.5+fudge1)*sqrt(log(MyNum)*log(log(MyNum)))))
B <- P[P<=LimB]; B <- B[-1]
facBase <- P[which(sapply(B, function(x) ExpBySquaringBig(MyNum,(x-1)/2,x)==1L))+1L]
LenFBase <- length(facBase)+1L
}
if (LenB==0L) {LenB <- 1000*ceiling(predict(LM2, DigCount))}
} else {
return("The number you've entered is currently too big for this algorithm!!")
}
SieveDist <- lapply(facBase, function(x) TonelliShanks(MyNum,x))
SieveDist <- c(1L,SieveDist); SieveDist[[1]] <- c(SieveDist[[1]],1L); facBase <- c(2L,facBase)
Lower <- -LenB; Upper <- LenB; LenB2 <- 2*LenB+1L; MyInterval <- Lower:Upper
M <- MyInterval + SqrtInt ## Set that will be tested
SqrDiff <- matrix(sub.bigz(pow.bigz(M,2),MyNum),nrow=length(M),ncol=1L)
maxM <- max(MyInterval)
LnFB <- log(facBase)
## N.B. primo uses 0.735, as his siever
## is more efficient than the one employed here
if (fudge2==0L) {
if (DigCount < 8) {
fudge2 <- 0
} else if (DigCount < 12) {
fudge2 <- .7
} else if (DigCount < 20) {
fudge2 <- 1.3
} else {
fudge2 <- 1.6
}
}
TheCut <- log10(maxM*sqrt(2*asNumeric(MyNum)))*fudge2
myPrimes <- as.bigz(facBase)
CoolList <- SieveLists(LenFBase, facBase, LenB2, SieveDist, MyInterval)
GetMatrix <- SieverMod(LenFBase, facBase, LenB2, SqrDiff, M, CoolList, LnFB, TheCut, LenFBase+1L)
if (GetMatrix[[4]]) {
newmat <- GetMatrix[[1]]; NewSD <- GetMatrix[[2]]; M <- GetMatrix[[3]]
NonSplitFacs <- which(abs(NewSD[,1L])>1L)
newmat <- newmat[-NonSplitFacs, ]
M <- M[-NonSplitFacs]
lenM <- length(M)
if (class(newmat) == "matrix") {
if (nrow(newmat) > 0) {
PosAns <- SolutionSearch(newmat,M,MyNum,myPrimes)
} else {
PosAns <- vector()
}
} else {
newmat <- matrix(newmat, nrow = 1)
PosAns <- vector()
}
} else {
newmat <- matrix(integer(0),ncol=(LenFBase+1L))
PosAns <- vector()
}
Atemp <- .mpfr2bigz(trunc(sqrt(sqrt(mpfr(2*MyNum))/maxM)))
if (Atemp < max(facBase)) {Atemp <- max(facBase)}; myPoly <- 0L
while (length(PosAns)==0L) {LegTest <- TRUE
while (LegTest) {
Atemp <- nextprime(Atemp)
Legendre <- asNumeric(ExpBySquaringBig(MyNum,(Atemp-1L)/2,Atemp))
if (Legendre == 1) {LegTest <- FALSE}
}
A <- Atemp^2
Btemp <- max(TonelliShanks(MyNum, Atemp))
B2 <- (Btemp + (MyNum - Btemp^2) * inv.bigz(2*Btemp,Atemp))%%A
C <- as.bigz((B2^2 - MyNum)/A)
myPoly <- myPoly + 1L
polySieveD <- lapply(1:LenFBase, function(x) {
AInv <- inv.bigz(A,facBase[x])
asNumeric(c(((SieveDist[[x]][1]-B2)*AInv)%%facBase[x],
((SieveDist[[x]][2]-B2)*AInv)%%facBase[x]))
})
M1 <- A*MyInterval + B2
SqrDiff <- matrix(A*pow.bigz(MyInterval,2) + 2*B2*MyInterval + C,nrow=length(M1),ncol=1L)
CoolList <- SieveLists(LenFBase, facBase, LenB2, polySieveD, MyInterval)
myPrimes <- c(myPrimes,Atemp)
LenP <- length(myPrimes)
GetMatrix <- SieverMod(LenFBase, facBase, LenB2, SqrDiff, M1, CoolList, LnFB, TheCut, LenP+1L)
if (GetMatrix[[4]]) {
n2mat <- GetMatrix[[1]]; N2SD <- GetMatrix[[2]]; M1 <- GetMatrix[[3]]
n2mat[,LenP+1L] <- rep(2L,nrow(N2SD))
if (length(N2SD) > 0) {NonSplitFacs <- which(abs(N2SD[,1L])>1L)} else {NonSplitFacs <- LenB2}
if (length(NonSplitFacs)<2*LenB) {
M1 <- M1[-NonSplitFacs]; lenM1 <- length(M1)
n2mat <- n2mat[-NonSplitFacs,]
if (lenM1==1L) {n2mat <- matrix(n2mat,nrow=1)}
if (ncol(newmat) < (LenP+1L)) {
numCol <- (LenP + 1L) - ncol(newmat)
newmat <- cbind(newmat,matrix(rep(0L,numCol*nrow(newmat)),ncol=numCol))
}
newmat <- rbind(newmat,n2mat); lenM <- lenM+lenM1; M <- c(M,M1)
if (class(newmat) == "matrix") {
if (nrow(newmat) > 0) {
PosAns <- SolutionSearch(newmat,M,MyNum,myPrimes)
}
}
}
}
}
EndTime <- Sys.time()
TotTime <- EndTime - BegTime
print(format(TotTime))
return(PosAns)
}
With Old QS algorithm
> library(gmp)
> library(Rmpfr)
> n3 <- prod(nextprime(urand.bigz(2, 40, 17)))
> system.time(t5 <- QuadSieveAll(n3,0.1,myps))
user system elapsed
164.72 0.77 165.63
> system.time(t6 <- factorize(n3))
user system elapsed
0.1 0.0 0.1
> all(t5[sort.list(asNumeric(t5))]==t6[sort.list(asNumeric(t6))])
[1] TRUE
With New Muli-Polynomial QS algorithm
> QuadSieveMultiPolysAll(n3)
[1] "4.952 secs"
Big Integer ('bigz') object of length 2:
[1] 342086446909 483830424611
> n4 <- prod(nextprime(urand.bigz(2,50,5)))
> QuadSieveMultiPolysAll(n4) ## With old algo, it took over 4 hours
[1] "1.131717 mins"
Big Integer ('bigz') object of length 2:
[1] 166543958545561 880194119571287
> n5 <- as.bigz("94968915845307373740134800567566911") ## 35 digits
> QuadSieveMultiPolysAll(n5)
[1] "3.813167 mins"
Big Integer ('bigz') object of length 2:
[1] 216366620575959221 438925910071081891
> system.time(factorize(n5)) ## It appears we are reaching the limits of factorize
user system elapsed
131.97 0.00 131.98
Side note: The number n5 above is a very interesting number. Check it out here
The Breaking Point!!!!
> n6 <- factorialZ(38) + 1L ## 45 digits
> QuadSieveMultiPolysAll(n6)
[1] "22.79092 mins"
Big Integer ('bigz') object of length 2:
[1] 14029308060317546154181 37280713718589679646221
> system.time(factorize(n6)) ## Shut it down after 2 days of running
Latest Triumph (50 digits)
> n9 <- prod(nextprime(urand.bigz(2,82,42)))
> QuadSieveMultiPolysAll(n9)
[1] "12.9297 hours"
Big Integer ('bigz') object of length 2:
[1] 2128750292720207278230259 4721136619794898059404993
## Based off of some crude test, factorize(n9) would take more than a year.
It should be noted that the QS generally doesn't perform as well as the Pollard's rho algorithm on smaller numbers and the power of the QS starts to become apparent as the numbers get larger.
The following approach deliver correct results, even in cases of really big numbers (which should be passed as strings). And it's really fast.
# TEST
# x <- as.bigz("12345678987654321")
# all_divisors(x)
# all_divisors(x*x)
# x <- pow.bigz(2,89)-1
# all_divisors(x)
library(gmp)
options(scipen =30)
sort_listz <- function(z) {
#==========================
z <- z[order(as.numeric(z))] # sort(z)
} # function sort_listz
mult_listz <- function(x,y) {
do.call('c', lapply(y, function(i) i*x))
}
all_divisors <- function(x) {
#==========================
if (abs(x)<=1) return(x)
else {
factorsz <- as.bigz(factorize(as.bigz(x))) # factorize returns up to
# e.g. x= 12345678987654321 factors: 3 3 3 3 37 37 333667 333667
factorsz <- sort_listz(factorsz) # vector of primes, sorted
prime_factorsz <- unique(factorsz)
#prime_ekt <- sapply(prime_factorsz, function(i) length( factorsz [factorsz==i]))
prime_ekt <- vapply(prime_factorsz, function(i) sum(factorsz==i), integer(1), USE.NAMES=FALSE)
spz <- vector() # keep all divisors
all <-1
n <- length(prime_factorsz)
for (i in 1:n) {
pr <- prime_factorsz[i]
pe <- prime_ekt[i]
all <- all*(pe+1) #counts all divisors
prz <- as.bigz(pr)
pse <- vector(mode="raw",length=pe+1)
pse <- c( as.bigz(1), prz)
if (pe>1) {
for (k in 2:pe) {
prz <- prz*pr
pse[k+1] <- prz
} # for k
} # if pe>1
if (i>1) {
spz <- mult_listz (spz, pse)
} else {
spz <- pse;
} # if i>1
} #for n
spz <- sort_listz (spz)
return (spz)
}
} # function factors_all_divisors
#====================================
Refined version, very fast. Code remains simple, readable & clean.
TEST
#Test 4 (big prime factor)
x <- pow.bigz(2,256)+1 # = 1238926361552897 * 93461639715357977769163558199606896584051237541638188580280321
system.time(z2 <- all_divisors(x))
# user system elapsed
# 19.27 1.27 20.56
#Test 5 (big prime factor)
x <- as.bigz("12345678987654321321") # = 3 * 19 * 216590859432531953
system.time(x2 <- all_divisors(x^2))
#user system elapsed
#25.65 0.00 25.67
Major Update
Below is my latest R factorization algorithm. It is way faster and pays homage to the rle function.
Algorithm 3 (Updated)
library(gmp)
MyFactors <- function(MyN) {
myRle <- function (x1) {
n1 <- length(x1)
y1 <- x1[-1L] != x1[-n1]
i <- c(which(y1), n1)
list(lengths = diff(c(0L, i)), values = x1[i], uni = sum(y1)+1L)
}
if (MyN==1L) return(MyN)
else {
pfacs <- myRle(factorize(MyN))
unip <- pfacs$values
pv <- pfacs$lengths
n <- pfacs$uni
myf <- unip[1L]^(0L:pv[1L])
if (n > 1L) {
for (j in 2L:n) {
myf <- c(myf, do.call(c,lapply(unip[j]^(1L:pv[j]), function(x) x*myf)))
}
}
}
myf[order(asNumeric(myf))] ## 'order' is faster than 'sort.list'
}
Below are the new benchmarks (As Dirk Eddelbuettel says here, "Can't argue with empirics."):
Case 1 (large prime factors)
set.seed(100)
myList <- lapply(1:10^3, function(x) sample(10^6, 10^5))
benchmark(SortList=lapply(myList, function(x) sort.list(x)),
OrderFun=lapply(myList, function(x) order(x)),
replications=3,
columns = c("test", "replications", "elapsed", "relative"))
test replications elapsed relative
2 OrderFun 3 59.41 1.000
1 SortList 3 61.52 1.036
## The times are limited by "gmp::factorize" and since it relies on
## pseudo-random numbers, the times can vary (i.e. one pseudo random
## number may lead to a factorization faster than others). With this
## in mind, any differences less than a half of second
## (or so) should be viewed as the same.
x <- pow.bigz(2,256)+1
system.time(z1 <- MyFactors(x))
user system elapsed
14.94 0.00 14.94
system.time(z2 <- all_divisors(x)) ## system.time(factorize(x))
user system elapsed ## user system elapsed
14.94 0.00 14.96 ## 14.94 0.00 14.94
all(z1==z2)
[1] TRUE
x <- as.bigz("12345678987654321321")
system.time(x1 <- MyFactors(x^2))
user system elapsed
20.66 0.02 20.71
system.time(x2 <- all_divisors(x^2)) ## system.time(factorize(x^2))
user system elapsed ## user system elapsed
20.69 0.00 20.69 ## 20.67 0.00 20.67
all(x1==x2)
[1] TRUE
Case 2 (smaller numbers)
set.seed(199)
samp <- sample(10^9, 10^5)
benchmark(JosephDivs=sapply(samp, MyFactors),
DontasDivs=sapply(samp, all_divisors),
OldDontas=sapply(samp, Oldall_divisors),
replications=10,
columns = c("test", "replications", "elapsed", "relative"),
order = "relative")
test replications elapsed relative
1 JosephDivs 10 470.31 1.000
2 DontasDivs 10 567.10 1.206 ## with vapply(..., USE.NAMES = FALSE)
3 OldDontas 10 626.19 1.331 ## with sapply
Case 3 (for complete thoroughness)
set.seed(97)
samp <- sample(10^6, 10^4)
benchmark(JosephDivs=sapply(samp, MyFactors),
DontasDivs=sapply(samp, all_divisors),
CottonDivs=sapply(samp, get_all_factors),
ChaseDivs=sapply(samp, FUN),
replications=5,
columns = c("test", "replications", "elapsed", "relative"),
order = "relative")
test replications elapsed relative
1 JosephDivs 5 22.68 1.000
2 DontasDivs 5 27.66 1.220
3 CottonDivs 5 126.66 5.585
4 ChaseDivs 5 554.25 24.438
Original Post
The algorithm by #RichieCotton is a very nice R implementation. The brute force method will only get you so far and fails with large numbers. I have provided three algorithms that will meet different needs. The first one (is the original algorithm I posted in Jan 15 and has been updated slightly), is a stand-alone factorization algorithm which offers a combinatorial approach that is efficient, accurate, and can be easily translated into other languages. The second algorithm is more of a sieve that is very fast and extremely useful when you need the factorization of thousands of numbers quickly. The third is a short (posted above), yet powerful stand-alone algorithm that is superior for any number less than 2^70 (I scrapped almost everything from my original code). I drew inspiration from Richie Cotton's use of the plyr::count function (it inspired me to write my own rle function that has a very similar return as plyr::count), George Dontas's clean way of handling the trivial case (i.e. if (n==1) return(1)), and the solution provided by #Zelazny7 to a question I had regarding bigz vectors.
Algorithm 1 (original)
library(gmp)
factor2 <- function(MyN) {
if (MyN == 1) return(1L)
else {
max_p_div <- factorize(MyN)
prime_vec <- max_p_div <- max_p_div[sort.list(asNumeric(max_p_div))]
my_factors <- powers <- as.bigz(vector())
uni_p <- unique(prime_vec); maxp <- max(prime_vec)
for (i in 1:length(uni_p)) {
temp_size <- length(which(prime_vec == uni_p[i]))
powers <- c(powers, pow.bigz(uni_p[i], 1:temp_size))
}
my_factors <- c(as.bigz(1L), my_factors, powers)
temp_facs <- powers; r <- 2L
temp_facs2 <- max_p_div2 <- as.bigz(vector())
while (r <= length(uni_p)) {
for (i in 1:length(temp_facs)) {
a <- which(prime_vec > max_p_div[i])
temp <- mul.bigz(temp_facs[i], powers[a])
temp_facs2 <- c(temp_facs2, temp)
max_p_div2 <- c(max_p_div2, prime_vec[a])
}
my_sort <- sort.list(asNumeric(max_p_div2))
temp_facs <- temp_facs2[my_sort]
max_p_div <- max_p_div2[my_sort]
my_factors <- c(my_factors, temp_facs)
temp_facs2 <- max_p_div2 <- as.bigz(vector()); r <- r+1L
}
}
my_factors[sort.list(asNumeric(my_factors))]
}
Algorithm 2 (sieve)
EfficientFactorList <- function(n) {
MyFactsList <- lapply(1:n, function(x) 1)
for (j in 2:n) {
for (r in seq.int(j, n, j)) {MyFactsList[[r]] <- c(MyFactsList[[r]], j)}
}; MyFactsList}
It gives the factorization of every number between 1 and 100,000 in less than 2 seconds. To give you an idea of the efficiency of this algorithm, the time to factor 1 - 100,000 using the brute force method takes close to 3 minutes.
system.time(t1 <- EfficientFactorList(10^5))
user system elapsed
1.04 0.00 1.05
system.time(t2 <- sapply(1:10^5, MyFactors))
user system elapsed
39.21 0.00 39.23
system.time(t3 <- sapply(1:10^5, all_divisors))
user system elapsed
49.03 0.02 49.05
TheTest <- sapply(1:10^5, function(x) all(t2[[x]]==t3[[x]]) && all(asNumeric(t2[[x]])==t1[[x]]) && all(asNumeric(t3[[x]])==t1[[x]]))
all(TheTest)
[1] TRUE
Final Thoughts
#Dontas’s original comment about factoring large numbers got me thinking, what about really really large numbers… like numbers greater than 2^200. You will see that whichever algorithm you choose on this page, they will all take a very long time because most of them rely on gmp::factorize which uses the Pollard-Rho algorithm. From this question, this algorithm is only reasonable for numbers less than 2^70. I am currently working on my own factorize algorithm which will implement the Quadratic Sieve, which should take all of these algorithms to the next level.
A lot has changed in the R language since this question was originally asked. In version 0.6-3 of the numbers package, the function divisors was included that is very useful for getting all of the factors of a number. It will meet the needs of most users, however if you are looking for raw speed or you are working with larger numbers, you will need an alternative method. I have authored two new packages (partially inspired by this question, I might add) that contain highly optimized functions aimed at problems just like this. The first one is RcppAlgos and the other is RcppBigIntAlgos (formerly called bigIntegerAlgos).
RcppAlgos
RcppAlgos contains two functions for obtaining divisors of numbers less than 2^53 - 1 : divisorsRcpp (a vectorized function for quickly obtaining the complete factorization of many numbers) & divisorsSieve (quickly generates the complete factorization over a range). First up, we factor many random numbers using divisorsRcpp:
library(gmp) ## for all_divisors by #GeorgeDontas
library(RcppAlgos)
library(numbers)
options(scipen = 999)
set.seed(42)
testSamp <- sample(10^10, 10)
## vectorized so you can pass the entire vector as an argument
testRcpp <- divisorsRcpp(testSamp)
testDontas <- lapply(testSamp, all_divisors)
identical(lapply(testDontas, as.numeric), testRcpp)
[1] TRUE
And now, factor many numbers over a range using divisorsSieve:
system.time(testSieve <- divisorsSieve(10^13, 10^13 + 10^5))
user system elapsed
0.242 0.006 0.247
system.time(testDontasSieve <- lapply((10^13):(10^13 + 10^5), all_divisors))
user system elapsed
47.880 0.132 47.922
identical(lapply(testDontasSieve, asNumeric), testSieve)
[1] TRUE
Both divisorsRcpp and divisorsSieve are nice functions that are flexible and efficient, however they are limited to 2^53 - 1.
RcppBigIntAlgos
The RcppBigIntAlgos package (formerly called bigIntegerAlgos prior to version 0.2.0) links directly to the C library gmp and features divisorsBig which is designed for very large numbers.
library(RcppBigIntAlgos)
## testSamp is defined above... N.B. divisorsBig is not quite as
## efficient as divisorsRcpp. This is so because divisorsRcpp
## can take advantage of more efficient data types.
testBig <- divisorsBig(testSamp)
identical(testDontas, testBig)
[1] TRUE
And here are the benchmark as defined in my original post (N.B. MyFactors is replaced by divisorsRcpp and divisorsBig).
## Case 2
library(rbenchmark)
set.seed(199)
samp <- sample(10^9, 10^5)
benchmark(RcppAlgos=divisorsRcpp(samp),
RcppBigIntAlgos=divisorsBig(samp),
DontasDivs=lapply(samp, all_divisors),
replications=10,
columns = c("test", "replications", "elapsed", "relative"),
order = "relative")
test replications elapsed relative
1 RcppAlgos 10 5.236 1.000
2 RcppBigIntAlgos 10 12.846 2.453
3 DontasDivs 10 383.742 73.289
## Case 3
set.seed(97)
samp <- sample(10^6, 10^4)
benchmark(RcppAlgos=divisorsRcpp(samp),
RcppBigIntAlgos=divisorsBig(samp),
numbers=lapply(samp, divisors), ## From the numbers package
DontasDivs=lapply(samp, all_divisors),
CottonDivs=lapply(samp, get_all_factors),
ChaseDivs=lapply(samp, FUN),
replications=5,
columns = c("test", "replications", "elapsed", "relative"),
order = "relative")
test replications elapsed relative
1 RcppAlgos 5 0.083 1.000
2 RcppBigIntAlgos 5 0.265 3.193
3 numbers 5 12.913 155.578
4 DontasDivs 5 15.813 190.518
5 CottonDivs 5 60.745 731.867
6 ChaseDivs 5 299.520 3608.675
The next benchmarks demonstrate the true power of the underlying algorithm in the divisorsBig function. The number being factored is a power of 10, so the prime factoring step can almost be completely ignored (e.g. system.time(factorize(pow.bigz(10,30))) registers 0 on my machine). Thus, the difference in timing is due solely to how quickly the prime factors can be combined to produce all factors.
library(microbenchmark)
powTen <- pow.bigz(10, 30)
microbenchmark(divisorsBig(powTen), all_divisors(powTen), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
divisorsBig(powTen) 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 100
all_divisors(powTen) 21.49849 21.27973 21.13085 20.63345 21.18834 20.38772 100
## Negative numbers show an even greater increase in efficiency
negPowTen <- powTen * -1
microbenchmark(divisorsBig(negPowTen), all_divisors(negPowTen), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
divisorsBig(negPowTen) 1.00000 1.0000 1.0000 1.00000 1.00000 1.00000 100
all_divisors(negPowTen) 28.75275 28.1864 27.9335 27.57434 27.91376 30.16962 100
Very Large Numbers
With divisorsBig, obtaining the complete factorization with very large inputs is no problem. The algorithm dynamically adjusts based off of the input and applies different algorithms in different situations. We can also take advantage of multithreading if Lenstra's Elliptic Curve method or the Quadratic Sieve is utilized.
Here are some examples using n5 and n9 defined in this answer.
n5 <- as.bigz("94968915845307373740134800567566911")
system.time(print(divisorsBig(n5)))
Big Integer ('bigz') object of length 4:
[1] 1 216366620575959221 438925910071081891
[4] 94968915845307373740134800567566911
user system elapsed
0.162 0.003 0.164
n9 <- prod(nextprime(urand.bigz(2, 82, 42)))
system.time(print(divisorsBig(n9, nThreads=4)))
Big Integer ('bigz') object of length 4:
[1] 1
[2] 2128750292720207278230259
[3] 4721136619794898059404993
[4] 10050120961360479179164300841596861740399588283187
user system elapsed
1.776 0.011 0.757
Here is an example provided by #Dontas with one large prime and one smaller prime:
x <- pow.bigz(2, 256) + 1
divisorsBig(x, showStats=TRUE, nThreads=8)
Summary Statistics for Factoring:
115792089237316195423570985008687907853269984665640564039457584007913129639937
| Pollard Rho Time |
|--------------------|
| 479ms |
| Lenstra ECM Time | Number of Curves |
|--------------------|--------------------|
| 1s 870ms | 2584 |
| Total Time |
|--------------------|
| 2s 402ms |
Big Integer ('bigz') object of length 4:
[1] 1
[2] 1238926361552897
[3] 93461639715357977769163558199606896584051237541638188580280321
[4] 115792089237316195423570985008687907853269984665640564039457584007913129639937
Compare this to finding the prime factorization using gmp::factorize:
system.time(factorize(x))
user system elapsed
9.199 0.036 9.248
Lastly, here is an example with a large semiprime (N.B. since we know it's a semiprime, we skip the extended Pollard's rho algorithm as well as Lentra's elliptic curve method).
## https://members.loria.fr/PZimmermann/records/rsa.html
rsa79 <- as.bigz("7293469445285646172092483905177589838606665884410340391954917800303813280275279")
divisorsBig(rsa79, nThreads=8, showStats=TRUE, skipPolRho=T, skipECM=T)
Summary Statistics for Factoring:
7293469445285646172092483905177589838606665884410340391954917800303813280275279
| MPQS Time | Complete | Polynomials | Smooths | Partials |
|--------------------|----------|-------------|------------|------------|
| 2m 49s 174ms | 100 | 91221 | 5651 | 7096 |
| Mat Algebra Time | Mat Dimension |
|--------------------|--------------------|
| 14s 863ms | 12625 x 12747 |
| Total Time |
|--------------------|
| 3m 4s 754ms |
Big Integer ('bigz') object of length 4:
[1] 1
[2] 848184382919488993608481009313734808977
[3] 8598919753958678882400042972133646037727
[4] 7293469445285646172092483905177589838606665884410340391954917800303813280275279

Resources