Optimization in R using constraints - r

I have some problems setting up an optimalization process using R. My data set is as follows:
set.seed(123)
library(lpSolve)
num_data <- 1000
bal_max <- .2/100
ind_max <- 10.5/100
data <- data.frame(id = 1:num_data,
balance = pmax(0,runif(num_data, 0, 1000)),
industry = rep(seq(1:10),num_data/10))
data$risk <- pmax(0, data$balance + rnorm(num_data,100,10))
As you can see, there are 1000 ids, 10 different industries. The objective is to maximize the sum of column "risk" and at the same time make sure that the fraction of each individual loan and industry cannot be above respectively 2% and 10.5%.
In the current data set these conditions are not met:
max(data$balance) / sum(data$balance)
#[1] 0.002009751
industry <- aggregate(balance ~ industry, FUN=sum,data=data)
max(industry$balance) / sum(industry$balance)
#[1] 0.1093997
Thus before we can maximize the column risk, these two conditions needs to be met. The remainder of my code is as follows
# set up linear prog problem
num_x <- nrow(data)
num_ind <- length(unique(data$industry))
objective.in <- data$risk
# define quantity to be maximized
# construct right-hand-side of constraint vector
# - sum of balances = 1
# - each balance <= bal_max
# - sum of balances for each industry <= ind_max
# - lp solver function imposes constraint that each balance >= 0
const.rhs <- c( 1, rep(bal_max, num_x), rep(ind_max, num_ind))
# construct constraint matrix for same constraints
mat_ind <- matrix(0,nrow=num_ind, ncol=num_x)
for( i in 1:num_ind) mat_ind[i,which(data$industry == i)] <- 1
const.mat <- rbind( matrix(1, nrow=1,ncol=num_x), diag(num_x), mat_ind )
# define directions for each constraint equation
const.dir <- c("=", rep("<=",num_x), rep("<=", num_ind))
# find balances for max risk
#
max_risk <- lp(direction="max", objective.in=objective.in, const.mat=const.mat,
const.dir=const.dir, const.rhs=const.rhs)
max_risk
# add data balances with optimum solution
data$balance <- max_risk$solution
# each balance should be smaller than bal_max
max(data$balance)
# industry should be smaller than 10.5% each
industry <- aggregate(balance ~ industry, FUN=sum,data=data)
industry
As you can see each industry and individual id does not exceed the bounderies (10.5% and 2%) anymore. The problem is that this code either fills in 2% or 0% for each loanid (such that the sum is 1). However, the absolute value of the initial balance should not rise as well. In this example the original balances often rises (fills in 2%).
In short, I want to optimize the column "risk" where each individual ID is capped at 2% of the total balance (after the optimalization process) and each industry is capped at 10.5% (after the optimalization process) of the total balance. The sum of all fractions should be one and the absolute value of the balance cannot increase.
The idea is to decrease the balances such that all conditions are met and "risk" is optimized.

Related

Is there a way in R to find a maximum value during a three point estimate

I am using the R programming language. Suppose I have the following 3 point estimate data : Data
Here, Task & Task 2 are being done parallelly, whereas Task 3 and Task 4 are done in series, where task 4 is dependent on the completion of task 3. So now, minimum time from Task 1 & Task 2 is '10', most likely is '20' and maximum is '40'. Which will be added to Task 3 & 4 giving us the total time.
When the three point cost estimation is given, the min, most likely and max cost is added together and a simulation(1000, 10000...whatever) is run. But in case of time The general rule is: time for tasks in series should be added; time for tasks in parallel equal the time it takes for the longest task.
How is the time estimation executed in R as we are adding up rows for multiple simulations in one go.
code:
inv_triangle_cdf <- function(P, vmin, vml, vmax){
Pvml <- (vml-vmin)/(vmax-vmin)
return(ifelse(P < Pvml,
vmin + sqrt(P*(vml-vmin)*(vmax-vmin)),
vmax - sqrt((1-P)*(vmax-vml)*(vmax-vmin))))
}
#no of simulation trials
n=1000
#read in cost data
task_costs <- read.csv(file="task_costs.csv", stringsAsFactors = F)
str(task_costs)
#set seed for reproducibility
set.seed(42)
#create data frame with rows = number of trials and cols = number of tasks
csim <- as.data.frame(matrix(nrow=n,ncol=nrow(task_costs)))
# for each task
for (i in 1:nrow(task_costs)){
#set task costs
vmin <- task_costs$cmin[i]
vml <- task_costs$cml[i]
vmax <- task_costs$cmax[i]
#generate n random numbers (one per trial)
psim <- runif(n)
#simulate n instances of task
csim[,i] <- inv_triangle_cdf(psim,vmin,vml,vmax)
}
#sum costs for each trial
ctot <- csim[,1] + csim[,2] + csim[,3] + csim[,4] #costs add
ctot
How can I update this in order to accommodate time duration from the data given above?

Optimising assigments of individuals to groups (with defined maximum capacity) according to individuals' preferences

I am attempting to populate a set of forests with individuals from various species. Each forest has a defined capacity which it cannot exceed (given by Forest Area * Organism Density). Each species has a set population size, which is a fraction of the total population, itself determined by the sum of all forest areas * organism density.
The species also have defined preferences in regards to which forests they are assigned to, according to a relationship with a characteristic that varies between the forests, say rainfall. This results in a matrix of probabilities of a given species individual being assigned to a given forest. In order to give an equal chance for each species to be allocated to its preferred forest, I am assigning one individual at a time, repeatedly iterating through all the species in order, until both the non-assigned species populations have been exhausted and all the forests are at their maximum capacity.
Ideally the proportion of a species' population in a forest will be as close as possible to the probability of that species being assigned to that forest. This will involve a compromise between all species to minimise the total error between the population proportions and the species specific probabilities of assignment. (Thanks to commenters for making this a clearer problem)
At the moment I am doing this with a for, if/else, while loop. Here, when a patch reaches its capacity it is removed from the selection process, and when a species' population has all been assigned, it is iterated over. The species population sizes are stored in one data.frame and the forest capacities are stored in another, and are adjusted accordingly when individuals are assigned.
These criteria have led me to struggle to see an alternative method than the loops (reproducible example below). However, it is very slow as I often have total population sizes in the 100s of millions. I feel there must be a much neater and faster alternative to using loops, perhaps residing in the way I have structured the input data or in the way I provide equal oppurtunity to each species (i.e. a way this doesn't have to be sequential), but I cannot figure one out. All help is greatly appreciated.
set.seed(999)
#Generate data
nSpecies <- 50 #Number of species
max_area <- 10000 #Maximum Area of a forest
nForests <- 20 #Number of different forests
areas <- round(rbeta(nForests, 1, 2) * max_area) #Generate random forest areas
total_area <- sum(areas) #Find total area of all forests
density <- 10 #Set organism density
total_population <- total_area * density #Find total population size across all forests
pop_structure <- table(sample(1:nSpecies, total_population, replace = T)) #Generate species populations
forests <- data.frame(Name = 1:nForests,
Capacity = (areas * density), #Find max population size of each forest
Rainfall = sample(0:10000, nForests, replace = T)) #Generate forest characteristic variable (e.g. rainfall)
species <- data.frame(Species = 1:nSpecies,
Individuals = as.numeric(pop_structure),
Rain_Response = rnorm(nSpecies, 0, 2)) #Generate species rainfall response
#Generate probabilities of assignment to each forest for each species
assignment_probs <- matrix(NA, nrow = nSpecies, ncol = nForests)
for(i in 1:nSpecies){
for(x in 1:nForests){
#Probability of assignment to forest = Exponent of species rain response * log(Rainfall in Forest)
assignment_probs[i,x] <- exp(species$Rain_Response[i] * log(forests$Rainfall[x]))
}
#Scale to sum to 1
assignment_probs[i,] <- (assignment_probs[i,] / sum(assignment_probs[i,]))
}
#Allocate species individuals to a forest
forest_comms <- matrix(0, nrow = nForests, ncol = nSpecies) #Empty community matrix
possible_forests <- 1:nForests #Vector to remove forests from selection without effecting other data
done <- FALSE #Used to exit loop when finished
while(sum(species$Individuals) > 0){ #While individuals in the species pool remain to be assigned...
for(sp in 1:nSpecies){ #Repeatedly assign one individual from each species until all done
if(species$Individuals[sp] > 0){ #If species individuals remain to be assigned, proceed. Else, skip
vacancies <- 0 #Set vacancies to 0 to enter next loop
while(vacancies == 0){ #If there are 0 vacancies in forest selected in next section, retry assignment
forest <- sample(possible_forests, 1, prob = assignment_probs[sp, possible_forests]) #Randomly select forest according to generated assignment probabilities
vacancies <- species$Individuals[forest] #Find no. of individual vacancies yet to be filled in the forest
if(vacancies > 0){ #If vacancies available in forest...
forest_comms[forest, sp] <- (forest_comms[forest, sp] + 1) #Assign an individual to the forest
species$Individuals[sp] <- (species$Individuals[sp] - 1) #Decrease species count by 1
forests$Individuals[forest] <- (forests$Individuals[forest] - 1) #Decrease remaining vacancies in forest by 1
} else { #If forest is already full...
possible_forests <- possible_forests[!possible_forests %in% forest] #Remove forest from selection process
}
if(length(possible_forests) == 1){ #If only one forest has vacancies...
for(i in 1:nrow(species)){ #Assign all remaining individuals to that forest
forest_comms[possible_forests, i] <- (forest_comms[possible_forests, i] + species$Individuals[i])
}
species$Individuals <- 0 #Set population to 0 (all individuals have been assigned)
done <- TRUE #Convert 'done' to true to end loop
break
}
}
}
}
cat('\n', sum(species$Individuals))
if(done){break}
}
sum(forest_comms) == total_population

lpsolve with constraints in r

I would like to use R to solve an optimization problem using the lpSolve which can perform processes similar to the solver add-in in excel. Below is a simple case where I would like to maximize npv value specifically using lpSolve.
df<-structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8), Revenue = c(109,
111, 122, 139, 156, 140, 137, 167)), row.names = c(NA, 8L), class = "data.frame")
dcf <- function(x, r, t0=FALSE){
# calculates discounted cash flows (DCF) given cash flow and discount rate
#
# x - cash flows vector
# r - vector or discount rates, in decimals. Single values will be recycled
# t0 - cash flow starts in year 0, default is FALSE, i.e. discount rate in first period is zero.
if(length(r)==1){
r <- rep(r, length(x))
if(t0==TRUE){r[1]<-0}
}
x/cumprod(1+r)
}
npv <- function(x, r, t0=FALSE){
# calculates net present value (NPV) given cash flow and discount rate
#
# x - cash flows vector
# r - discount rate, in decimals
# t0 - cash flow starts in year 0, default is FALSE
sum(dcf(x, r, t0))
}
npv(df$Revenue,.2)
#Non optimized npv yields a value of 492.
#How can i use lpSolve to optimize my table? Said another way how can I rearrange the table to maximize npv using lpSolve?
More complicated problem involves a penalizing column with the following rule:
Id's represent projects.
if Id project is not the starting period (row 1). Check to see if previous Id is within a delta of 2 (absolute value of subtracting row Id from other previous rows. If true, penalize Revenue by 20%. I think this problem still involved solving for the correct order. How can I optimize this function?
#Randomize order to give base npv. Now i need to optimize the order to find max value
df<- df%>%mutate(random_sort= sample(nrow(df)))
x=function(i){
df_fcn<- i
df_fcn<- df_fcn%>%mutate(Penalty= if_else(abs(random_sort-lag(random_sort))>2,1,.8))%>%mutate(Penalty=ifelse(is.na(Penalty),1,Penalty))
df_fcn<- df_fcn%>%mutate(Revenue_Penalized= Revenue*Penalty)
npv(df_fcn$Revenue_Penalized,.2)
}
Best I've come up with is to randomly rearrange the data and find the maximum value.
schedule_function=function(i){
i<- i%>%mutate(random_sort=sample(random_sort))
df_fcn<- i%>%mutate(Penalty= if_else(abs(random_sort-lag(random_sort))>2,1,.8))%>%mutate(Penalty=ifelse(is.na(Penalty),1,Penalty))
df_fcn<- df_fcn%>%mutate(Revenue_Penalized= Revenue*Penalty)
final_df<-print(df_fcn)
npv(df_fcn$Revenue_Penalized,.2)
}
n <- 1:10000
MAX = -Inf ## initialize maximum
for (i in 1:length(n)) {
x <- schedule_function(df)
if (x > MAX) MAX <- x
}

Optimize algorithm that calculates expected allele richness at different sub-sampling levels

I have implemented an algorithm for calculating allele richness based on formulas 1 - 3 presented in:
Counting alleles with rarefaction: Private alleles and hierarchical
sampling designs - Steven T. Kalinowski, Link to PDF
and the same formula from:
Diverging Trends Between Heterozygosity and Allelic Richness During Postglacial Colonization in the European Beech - B. Comps, Link to paper
and need help with optimizing it fully.
Allele richness is a measure of genetic diversity but is impacted by sample size. This formula allows us to estimate the expected allele richness at smaller sample sizes without resampling. I use it to estimate allele richness at all possible sub-sampling sizes allowing me to draw rarefaction curves.
As a first step in the algorithm, I calculate the probability of not observing an allele at each count level at each subsampling level to make a look-up table for calculating the actual probabilities. I calculate as few values as possible (I think), taking advantage of the fact that many values are just 1 - (previous calculated values). This is still the slowest part but I think I got it to scaling at n*log(n). I mostly want to know if there is a more efficient way to create the vectors and join them into a table (data frame).
The second step is to use the lookup table to calculate the expected allele richness at each sub-sampling level. I have changed this part to a much faster implementation and updated the code below.
Here is the code as it currently is. You can find it on my GitHub, DiDeoxy/PGDA: calc_allele_richness.R.
To run the code you can install the package with: devtools::install_github("https://github.com/DiDeoxy/PGDA") and use it with library(PGDA)
#' Calculate mean allele richness at all sampling levels
#'
#' Calculates the mean allele richness across all markers for a sample at all
#' sampling levels. Missing data is not allowed. Based on the formula presented
#' in https://www.genetics.org/content/157/1/389
#'
#' #param pop a data frame with individuals in columns and markers in rows,
#' there must be atleast two individuals
#' #param allele_coding the coding used for indicating the different alleles
#' #param num_cores the number of cores to use, must be 1 on windows, can use
#' detectCores() of the parallel package on linux
#'
#' #importFrom magrittr %>%
#' #importFrom parallel mclapply
#' #importFrom scrime rowTables
#'
#' #return a table of expected allele richness for each marker at each
#' subsampling level with markers in rows and sampling levels in columns
#'
#' #export
allele_richness <- function (pop, allele_coding = 1:2, num_cores = 1) {
# the total number of alleles observed at each marker
n <- ncol(pop)
# probs contains the probability of not observing allele i at each
# sub-sampling level (n - k) for each possible count of allele i with
# allele count in rows and k in columns
#
# for each subsampling level
probs <- mclapply(0:(n - 1), function (k) {
# a vector for containng the probs of not observing allele i at each count
# level at each subsampling (n - k) level
inter <- rep(0, n)
# if n - k <= 1 then the prob of not observing allele i is 0 at all count
# levels, the smaller k is compared to n the more levels will have probs of
# not observing allel i greater than 0
if (n - k > 1) {
# probs of not observing allele i are linear decreasing, therefore the top
# half and bottom half are 1 - mirrors, we can use this fact to skip a lot
# of computation
temp <- lapply(1:floor((n - k) / 2), function (n_i) {
(n - n_i - k) / (n - k)
}) %>% do.call(c, .)
# concatenating the calced probs with their 1 - mirror, if n - k is odd
# the middle value will equal 0.5 which we do not need to mirror
temp <- c(temp, rev(1 - temp[which(temp != 0.5)]))
inter[1:length(temp)] <- temp
inter
} else {
inter
}
}, mc.cores = num_cores) %>% do.call(cbind, .)
# creates a data frame containg the counts of each allele for each marker
marker_allele_counts <- rowTables(pop, allele_coding)
# we calcuate the mean allele richness across all markers at each subsampling
# level (n - k) by calculating the product of not observing each allele at
# each sub-sampling level then taking the sum of these for each marker and
# then taking the mean across all markers
#
# for each marker
mclapply(1:nrow(marker_allele_counts), function (marker) {
(1 - lapply(1:length(marker_allele_counts[marker, ]), function (allele) {
# for each allele, calc the probability of not observing the allele at
# each sub-sampling level
cumprod(probs[marker_allele_counts[[marker, allele]], ])
# rbind the probabilities for each allele at each sub-smapling level,
# subtract from one to turn them into probabilities of observing the allele,
# and sum the alleles together
}) %>% do.call(rbind, .)) %>% colSums()
# return a table with markers in rows and sub-sampling levels in columns
}, mc.cores = num_cores) %>% do.call(rbind, .)
}
Thanks for any help you can give, it's the first time I've programmed something like this.
Cheers,
DiDeoxy.

R: quickly simulate unbalanced panel with variable that depends on lagged values of itself

I am trying to simulate monthly panels of data where one variable depends on lagged values of that variable in R. My solution is extremely slow. I need around 1000 samples of 2545 individuals, each of whom is observed monthly over many years, but the first sample took my computer 8.5 hours to construct. How can I make this faster?
I start by creating an unbalanced panel of people with different birth dates, monthly ages, and variables xbsmall and error that will be compared to determine the Outcome. All of the code in the first block is just data setup.
# Setup:
library(plyr)
# Would like to have 2545 people (nPerson).
#Instead use 4 for testing.
nPerson = 4
# Minimum and maximum possible ages and birth dates
AgeMin = 10
AgeMax = 50
BornMin = 1950
BornMax = 1963
# Person-specific characteristics
ind =
data.frame(
id = 1:nPerson,
BornYear = floor(runif(length(1:nPerson), min=BornMin, max=BornMax+1)),
BornMonth = ceiling(runif(length(1:nPerson), min=0, max=12))
)
# Make an unbalanced panel of people over age 10 up to year 1986
# panel = ddply(ind, ~id, transform, AgeMonths = BornMonth)
panel = ddply(ind, ~id, transform, AgeMonths = (AgeMin*12):((1986-BornYear)*12 + 12-BornMonth))
# Set up some random variables to approximate the data generating process
panel$xbsmall = rnorm(dim(panel)[1], mean=-.3, sd=.45)
# Standard normal error for probit
panel$error = rnorm(dim(panel)[1])
# Placeholders
panel$xb = rep(0, dim(panel)[1])
panel$Outcome = rep(0, dim(panel)[1])
Now that we have data, here is the part that is slow (around a second on my computer for only 4 observations but hours for thousands of observations). Each month, a person gets two draws (xbsmall and error) from two different normal distributions (these were done above), and Outcome == 1 if xbsmall > error. However, if Outcome equals 1 in the previous month, then Outcome in the current month equals 1 if xbsmall + 4.47 > error. I use xb = xbsmall+4.47 in the code below (xb is the "linear predictor" in a probit model). I ignore the first month for each person for simplicity. For your information, this is simulating a probit DGP (but that is not necessary to know to solve the problem of computation speed).
# Outcome == 1 if and only if xb > -error
# The hard part: xb includes information about the previous month's outcome
start_time = Sys.time()
for(i in 1:nPerson){
# Determine the range of monthly ages to loop over for this person
AgeMonthMin = min(panel$AgeMonths[panel$id==i], na.rm=T)
AgeMonthMax = max(panel$AgeMonths[panel$id==i], na.rm=T)
# Loop over the monthly ages for this person and determine the outcome
for(t in (AgeMonthMin+1):AgeMonthMax){
# Indicator for whether Outcome was 1 last month
panel$Outcome1LastMonth[panel$id==i & panel$AgeMonths==t] = panel$Outcome[panel$id==i & panel$AgeMonths==t-1]
# xb = xbsmall + 4.47 if Outcome was 1 last month
# Otherwise, xb = xbsmall
panel$xb[panel$id==i & panel$AgeMonths==t] = with(panel[panel$id==i & panel$AgeMonths==t,], xbsmall + 4.47*Outcome1LastMonth)
# Outcome == 1 if xb > 0
panel$Outcome[panel$id==i & panel$AgeMonths==t] =
ifelse(panel$xb[panel$id==i & panel$AgeMonths==t] > - panel$error[panel$id==i & panel$AgeMonths==t], 1, 0)
}
}
end_time = Sys.time()
end_time - start_time
My thoughts for reducing computer time:
Something with cumsum()
Some wonderful panel data function that I do not know about
Find a way to make the t loop go through the same starting and ending points for each individual and then somehow use plyr::ddpl() or dplyr::gather_by()
Iterative solution: make an educated guess about the value of Outcome at each monthly age (say, the mode) and somehow adjust values that do not match the previous month. This would work better in my real application because xbsmall has a very clear trend in age.
Do the simulation only for smaller samples and then estimate the effect of sample size on the values I need (the distributions of regression coefficient estimates not calculated here)
One approach is to use a split-apply-combine method. I take out the for(t in (AgeMonthMin+1):AgeMonthMax) loop and put the contents in a function:
generate_outcome <- function(x) {
AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE)
AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE)
for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){
x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1]
x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0)
}
x
}
where x is a dataframe for one person. This allows us to simplify the panel$id==i & panel$AgeMonths==t construct. Now we can just do
out <- lapply(split(panel, panel$id), generate_outcome)
out <- do.call(rbind, out)
and all.equal(panel$Outcome, out$Outcome) returns TRUE. Computing 100 persons took 1.8 seconds using this method, compared to 1.5 minutes in the original code.

Resources