Limit rpois distribution? - r

I'm looking to use rpois but I'm having trouble with the lower limits erroring my model.
A toy example:
a <- 1
repeat{
if (a >= 1) {
x <- rpois(a, 1 * .15)
} else {
x <- 0
}
print(x)
if (a - x < 0){
break
}
}
Basically, I can't have a - x be negative. Is there a way to set a min/max limit on rpois?

You have found a solution in extraDistr::rtpois
This would have worked as a base R alternative:
rlimpois <- function(n, lambda, lowlimit, toplimit){
sample(x=lowlimit:toplimit, size=n,
prob=dpois(lowlimit:toplimit, lambda), replace=TRUE)
}
so for example for a sample sized 20 from a Poisson distribution with parameter 2 where none of the values exceed 3 might be
set.seed(1)
rlimpois(20, 2, 0, 3)
# 1 2 2 0 1 0 0 3 2 1 1 1 3 2 3 2 3 0 2 3

Related

How to create binary constraints for optimization in R?

I have a function f(x) which I intend to minimize. "x" is a vector containing 50 parameters. This function has several constraints: first is that all parameters in x should be binary, so that x = (1,1,0,1,...); second is that the sum of "x" should be exactly 25, so that sum(x) = 25. The question can be illustrated as:
min f(x)
s.t. sum(x) = 25,
x = 0 or 1
However when I try to solve this problem in R, I met some problems. Prevalent packages such as "optim","constrOptim" from "stats" can only input coefficients of the target function (in my case, the function is bit complex and cannot be simply illustrated using coefficient matrix), "donlp2" from "Rdonlp" does not support setting parameters to be binary. I'm wondering whether anyone has any idea of how to set binary constraints for this case?
Expanding my comment, here is an example of a Local Search, as implemented in package NMOF. (I borrow Stéphane's objective function).
library("NMOF")
library("neighbours")
## Stéphane's objective function
f <- function(x)
sum(1:20 * x)
nb <- neighbourfun(type = "logical", kmin = 10, kmax = 10)
x0 <- c(rep(FALSE, 10), rep(TRUE, 10))
sol <- LSopt(f, list(x0 = x0, neighbour = nb, nI = 1000))
## initial solution
as.numeric(x0)
## [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
## final solution
as.numeric(sol$xbest)
## [1] 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
(Disclosure: I am the maintainer of packages NMOF and neighbours.)
You can try the amazing package rgenoud. Below is an example.
I take 20 binary variables instead of your 50 for easier reading. I take f(x) = sum(1:20 * x), this is a weighted sum with increasing weights so clearly the best solution (restricted to sum(x)=10) is 1, 1, ..., 1, 0, 0, ..., 0. And rgenoud brilliantly finds it.
library(rgenoud)
f <- function(x) { # the function to be minimized
sum(1:20 * x)
}
g <- function(x){
c(
ifelse(sum(x) == 10, 0, 1), # set the constraint (here sum(x)=10) in this way
f(x) # the objective function (to minimize/maximize)
)
}
solution <- genoud(
g,
pop.size = 3000,
lexical = 2, # see ?genoud for explanations
nvars = 20, # number of x_i's
starting.values = c(rep(0, 10), rep(1, 10)),
Domains = cbind(rep(0, 20), rep(1, 20)), # lower and upper bounds
data.type.int = TRUE # x_i's are integer
)
solution$par # the values of x
## [1] 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
solution$value
## [1] 0 55 ; 0 is the value of ifelse(sum(x)=10,0,1) and 55 is the value of f(x)

how to make double for loops faster in R

I am trying to do the below calculation using R. my function is recursive and it uses a double for loop to calculate values of "result" matrix. Is there a method to replace the for loops or achieve the if condition faster?
x<-rnorm(2400,0, 3)
y<-rnorm(400,0,3)
no_row<-length(x)
no_col<-length(y)
input<-matrix(data=1,nrow = no_row, ncol = no_col)
result<-matrix(nrow = no_row, ncol = no_col)
calculation<-function(x,y)
{
for(i in 1:no_row)
{
for(j in 1:no_col)
{
z<-exp(x[i]-y[j])
result[i,j]<-(z/1+z)
}
}
new_x<-x-1
new_y<-y-1
residual<-input-result
sq_sum_residulas<-sum((rowSums(residual, na.rm = T))^2)
if(sq_sum_residulas>=1){calculation(new_x,new_y)}
else(return(residual))
}
output<-calculation(x,y)
To complete Benjamin answer, you shouldn't use a recursion function. You should instead use a while loop with a max_iter parameter.
Reusing Benjamin function:
calculation2 <- function(x, y){
result <- outer(x, y, function(x, y) { z <- exp(x - y); z / 1 + z})
result
}
calculation <- function(x, y, max_iter = 10){
input <- matrix(data=1,nrow = length(x), ncol = length(y))
sq_sum_residulas <- 1 # Initialize it to enter while loop
new_x <- x # Computation x: it will be updated at each loop
new_y <- y # Computation y
n_iter <- 1 # Counter of iteration
while(sq_sum_residulas >= 1 & n_iter < max_iter){
result <- calculation2(new_x, new_y)
new_x <- x - 1
new_y <- y - 1
residual <- input - result
sq_sum_residulas <- sum((rowSums(residual, na.rm = T))^2)
n_iter <- n_iter + 1
}
if (n_iter == max_iter){
stop("Didn't converge")
}
return(residual)
}
If you try to run this code, you will see that it doesn't converge. I geuess there is a mistake in your computation. Especially in z/1 + z ?
The outer function is the tool you are looking for.
Compare these two functions that only generate the result matrix
x<-rnorm(100,0, 3)
y<-rnorm(100,0,3)
calculation<-function(x,y)
{
result <- matrix(nrow = length(x), ncol = length(y))
for(i in seq_along(x))
{
for(j in seq_along(y))
{
z<-exp(x[i]-y[j])
result[i,j]<-(z/1+z)
}
}
result
}
calculation2 <- function(x, y){
result <- outer(x, y, function(x, y) { z <- exp(x - y); z / 1 + z})
result
}
library(microbenchmark)
microbenchmark(
calculation(x, y),
calculation2(x, y)
)
Unit: microseconds
expr min lq mean median uq max neval
calculation(x, y) 1862.40 1868.119 1941.5523 1871.490 1876.1825 8375.666 100
calculation2(x, y) 466.26 469.192 515.3696 471.392 480.9225 4481.371 100
That discrepancy in time seems to grow as the length of the vectors increases.
Note, this will solve the speed for your double for loop, but there seem to be other issues in your function. It isn't clear to me what you are trying to do, or why you are calling calculation from within itself. As you have it written, there are no changes to x and y before it gets to calling itself again, so it would be stuck in a loop forever, if it worked at all (it doesn't on my machine)
#Benjamin #Emmanuel-Lin Thanks for the solutions :) I was able to solve the issue with your inputs. Please find below the sample data set and code. The solution converges when sq_sum_residual becomes less than 0.01. This is more than 12x faster than my code with double for loops.Sorry for the confusion created by the sample data & new_x, new_y calculation provided in the question.
Input is a dichotomous 9x10 matrix
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA 1 1 1 1 1 1 1 0 1
2 1 1 1 1 1 1 1 0 1 0
3 1 1 1 1 1 1 0 1 0 0
4 1 1 1 1 1 1 0 1 0 0
5 1 1 1 1 1 1 0 1 0 0
6 1 1 1 1 1 0 1 0 0 0
7 1 1 1 1 0 1 0 0 0 0
8 1 0 1 0 1 0 0 0 0 0
9 0 1 0 1 0 0 0 0 0 0
x<-c( 2.0794415,1.3862944,0.8472979, 0.8472979, 0.8472979,0.4054651,0.0000000, -0.8472979, -1.3862944)
y<-c(-1.4404130, -1.5739444, -1.5739444, -1.5739444, -0.7472659, -0.1876501, 1.1986443 , 0.7286407,2.5849387,2.5849387 )
result<-matrix(nrow = length(x), ncol = length(y))
calculation<-function(x,y)
{
result<-outer(x,y,function(x,y){ z<-exp(x-y);z/(1+z)})
result[!is.finite(result)]<-NA
variance_result<-result*(1-result)
row_var<- (-1)*rowSums(variance_result,na.rm=T)
col_var<- (-1)*colSums(variance_result,na.rm=T)
residual<-input-result
row_residual<-rowSums(residual,na.rm=T)#(not to be multiplied by -1)
col_residual<-(-1)*colSums(residual,na.rm=T)
new_x<-x-(row_residual/row_var)
new_x[!is.finite(new_x)]<-NA
new_x<as.array(new_x)
new_y<-y-(col_residual/col_var)
new_y[!is.finite(new_y)]<-NA
avg_new_y<-mean(new_y, na.rm = T)
new_y<-new_y-avg_new_y
new_y<-as.array(new_y)
sq_sum_residual<-round(sum(row_residual^2),5)
if(sq_sum_residual>=.01)
{calculation(new_x,new_y)}
else(return(residual))
}
calculation(x,y)

round but .5 should be floored

From R help function: Note that for rounding off a 5, the IEC 60559 standard is expected to be used, ‘go to the even digit’. Therefore round(0.5) is 0 and round(-1.5) is -2.
> round(0.5)
[1] 0
> round(1.5)
[1] 2
> round(2.5)
[1] 2
> round(3.5)
[1] 4
> round(4.5)
[1] 4
But I need all values ending with .5 to be rounded down. All other values should be rounded as it they are done by round() function.
Example:
round(3.5) = 3
round(8.6) = 9
round(8.1) = 8
round(4.5) = 4
Is there a fast way to do it?
Per Dietrich Epp's comment, you can use the ceiling() function with an offset to get a fast, vectorized, correct solution:
round_down <- function(x) ceiling(x - 0.5)
round_down(seq(-2, 3, by = 0.5))
## [1] -2 -2 -1 -1 0 0 1 1 2 2 3
I think this is faster and much simpler than many of the other solutions shown here.
As noted by Carl Witthoft, this adds much more bias to your data than simple rounding. Compare:
mean(round_down(seq(-2, 3, by = 0.5)))
## [1] 0.2727273
mean(round(seq(-2, 3, by = 0.5)))
## [1] 0.4545455
mean(seq(-2, 3, by = 0.5))
## [1] 0.5
What is the application for such a rounding procedure?
Check if the remainder of x %% 1 is equal to .5 and then floor or round the numbers:
x <- seq(1, 3, 0.1)
ifelse(x %% 1 == 0.5, floor(x), round(x))
> 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
I'll join the circus too:
rndflr <- function(x) {
sel <- vapply(x - floor(x), function(y) isTRUE(all.equal(y, 0.5)), FUN.VALUE=logical(1))
x[sel] <- floor(x[sel])
x[!sel] <- round(x[!sel])
x
}
rndflr(c(3.5,8.6,8.1,4.5))
#[1] 3 9 8 4
This function works by finding elements that have decimal part equal to 0.5, and adding a small negative number to to them before rounding, ensuring that they'll be rounded downwards. (It relies -- harmlessly but in slightly obfuscated manner --- on the fact that a Boolean vector in R will be converted to a vector of 0's and 1's when multiplied by a numeric vector.)
f <- function(x) {
round(x - .1*(x%%1 == .5))
}
x <- c(0.5,1,1.5,2,2.5,2.01,2.99)
f(x)
[1] 0 1 1 2 2 2 3
The function (not golfed) is very simple and checks whether the decimals that are left are .5 or less. In effect you could easily make it more useful and take 0.5 as an argument:
nice.round <- function(x, myLimit = 0.5) {
bX <- x
intX <- as.integer(x)
decimals <- x%%intX
if(is.na(decimals)) {
decimals <- 0
}
if(decimals <= myLimit) {
x <- floor(x)
} else {
x <- round(x)
}
if (bX > 0.5 & bX < 1) {
x <- 1
}
return(x)
}
Tests
Currently, this function does not work properly with values between 0.5 and 1.0.
> nice.round(1.5)
[1] 1
> nice.round(1.6)
[1] 2
> nice.round(10000.624541)
[1] 10001
> nice.round(0.4)
[1] 0
> nice.round(0.6)
[1] 1

R convert fractions to integer percentages adding up to 100

I have computed a vector of the frequency of different events, represented as fractions and sorted in descending order. I need to interface to a tool that requires positive integer percentages that must sum up to exactly 100. I would like to generate the percentages in a fashion that best represents the input distribution. That is, I would like relationship (ratios) among the percentages to best match the one in the input fractions, despite any non-linearities resulting in cutting a long tail.
I have a function that generates these percentages, but I don't think it is optimal or elegant. In particular, I would like to do more of the work in numeric space before resorting to "stupid integer tricks".
Here is an example frequency vector:
fractionals <- 1 / (2 ^ c(2, 5:6, 8, rep(9,358)))
And here is my function:
# Convert vector of fractions to integer percents summing to 100
percentize <- function(fractionals) {
# fractionals is sorted descending and adds up to 1
# drop elements that wouldn't round up to 1% vs. running total
pctOfCum <- fractionals / cumsum(fractionals)
fractionals <- fractionals[pctOfCum > 0.005]
# calculate initial percentages
percentages <- round((fractionals / sum(fractionals)) * 100)
# if sum of percentages exceeds 100, remove proportionally
i <- 1
while (sum(percentages) > 100) {
excess <- sum(percentages) - 100
if (i > length(percentages)) {
i <- 1
}
partialExcess <- max(1, round((excess * percentages[i]) / 100))
percentages[i] <- percentages[i] - min(partialExcess,
percentages[i] - 1)
i <- i + 1
}
# if sum of percentages shorts 100, add proportionally
i <- 1
while (sum(percentages) < 100) {
shortage <- 100 - sum(percentages)
if (i > length(percentages)) {
i <- 1
}
partialShortage <- max(1, round((shortage * percentages[i]) / 100))
percentages[i] <- percentages[i] + partialShortage
i <- i + 1
}
return(percentages)
}
Any ideas?
How about this? It rescales the variables so that it should add to 100, but if due to rounding it comes to 99 it adds 1 to the largest frequency.
fractionals <- 1 / (2 ^ c(2, 5:6, 8, rep(9,358)))
pctOfCum <- fractionals / cumsum(fractionals)
fractionals <- fractionals[pctOfCum > 0.005]
bunnies <- as.integer(fractionals / sum(fractionals) * 100) + 1
bunnies[bunnies > 1] <- round(bunnies[bunnies > 1] * (100 -
sum(bunnies[bunnies == 1])) / sum(bunnies[bunnies > 1]))
if((sum(bunnies) < 100) == TRUE) bunnies[1] <- bunnies[1] + 1
> bunnies
[1] 45 6 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Frequency table comparison using R

I have two frequency tables created using R's table() function:
freq1 <- table(unlist(strsplit(topic_list1, split=";")))
freq2 <- table(unlist(strsplit(topic_list2, split=";")))
topic_list1 and topic_list2 are strings that contains textual representations of topics separated by ;.
I want a way to compare the two frequencies, graphically if possible.
So if the two lists contain the same topic with different frequencies, I would like to be able to see it. The same goes for topics present in one frequency table, but not in the other.
There's probably a more elegant way to do this, but this ought to work:
# here I'm generating some example data
set.seed(5)
topic_list1 <- paste(sample(letters, 20, replace=T), sep=";")
topic_list2 <- paste(sample(letters, 15, replace=T), sep=";")
# I don't make the tables right away
tl1 <- unlist(strsplit(topic_list1, split=";"))
tl2 <- unlist(strsplit(topic_list2, split=";"))
big_list <- unique(c(tl1, tl2))
# this computes your frequencies
lbl <- length(big_list)
tMat1 <- matrix(rep(tl1, lbl), byrow=T, nrow=lbl)
tMat2 <- matrix(rep(tl2, lbl), byrow=T, nrow=lbl)
tMat1 <- cbind(big_list, tMat1)
tMat2 <- cbind(big_list, tMat2)
counts1 <- apply(tMat1, 1, function(x){sum(x[1]==x[2:length(x)])})
counts2 <- apply(tMat2, 1, function(x){sum(x[1]==x[2:length(x)])})
total_freqs <- rbind(counts1, counts2, counts1-counts2)
# this makes it nice looking & user friendly
colnames(total_freqs) <- big_list
rownames(total_freqs) <- c("topics1", "topics2", "difference")
total_freqs <- total_freqs[ ,order(total_freqs[3,])]
total_freqs
d l a z b f s y m r x h n i g k c v o
topics1 0 0 0 0 0 2 1 1 1 1 2 2 1 1 1 1 2 2 2
topics2 2 2 2 1 1 2 1 1 1 0 1 1 0 0 0 0 0 0 0
difference -2 -2 -2 -1 -1 0 0 0 0 1 1 1 1 1 1 1 2 2 2
From there you could just use the straight numbers or visualize them however you want (e.g, dotplots, etc.). Here's a simple dotplot:
windows()
dotchart(t(total_freqs)[,3], main="Frequencies of topics1 - topics2")
abline(v=0)
You can simply barplot them (with beside=T argument), which will give you a way to visually compare the counts per level ...
below is an example:
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, col=c("darkblue","red"), legend=rownames(counts), beside=T)

Resources