Optimising assigments of individuals to groups (with defined maximum capacity) according to individuals' preferences - r

I am attempting to populate a set of forests with individuals from various species. Each forest has a defined capacity which it cannot exceed (given by Forest Area * Organism Density). Each species has a set population size, which is a fraction of the total population, itself determined by the sum of all forest areas * organism density.
The species also have defined preferences in regards to which forests they are assigned to, according to a relationship with a characteristic that varies between the forests, say rainfall. This results in a matrix of probabilities of a given species individual being assigned to a given forest. In order to give an equal chance for each species to be allocated to its preferred forest, I am assigning one individual at a time, repeatedly iterating through all the species in order, until both the non-assigned species populations have been exhausted and all the forests are at their maximum capacity.
Ideally the proportion of a species' population in a forest will be as close as possible to the probability of that species being assigned to that forest. This will involve a compromise between all species to minimise the total error between the population proportions and the species specific probabilities of assignment. (Thanks to commenters for making this a clearer problem)
At the moment I am doing this with a for, if/else, while loop. Here, when a patch reaches its capacity it is removed from the selection process, and when a species' population has all been assigned, it is iterated over. The species population sizes are stored in one data.frame and the forest capacities are stored in another, and are adjusted accordingly when individuals are assigned.
These criteria have led me to struggle to see an alternative method than the loops (reproducible example below). However, it is very slow as I often have total population sizes in the 100s of millions. I feel there must be a much neater and faster alternative to using loops, perhaps residing in the way I have structured the input data or in the way I provide equal oppurtunity to each species (i.e. a way this doesn't have to be sequential), but I cannot figure one out. All help is greatly appreciated.
set.seed(999)
#Generate data
nSpecies <- 50 #Number of species
max_area <- 10000 #Maximum Area of a forest
nForests <- 20 #Number of different forests
areas <- round(rbeta(nForests, 1, 2) * max_area) #Generate random forest areas
total_area <- sum(areas) #Find total area of all forests
density <- 10 #Set organism density
total_population <- total_area * density #Find total population size across all forests
pop_structure <- table(sample(1:nSpecies, total_population, replace = T)) #Generate species populations
forests <- data.frame(Name = 1:nForests,
Capacity = (areas * density), #Find max population size of each forest
Rainfall = sample(0:10000, nForests, replace = T)) #Generate forest characteristic variable (e.g. rainfall)
species <- data.frame(Species = 1:nSpecies,
Individuals = as.numeric(pop_structure),
Rain_Response = rnorm(nSpecies, 0, 2)) #Generate species rainfall response
#Generate probabilities of assignment to each forest for each species
assignment_probs <- matrix(NA, nrow = nSpecies, ncol = nForests)
for(i in 1:nSpecies){
for(x in 1:nForests){
#Probability of assignment to forest = Exponent of species rain response * log(Rainfall in Forest)
assignment_probs[i,x] <- exp(species$Rain_Response[i] * log(forests$Rainfall[x]))
}
#Scale to sum to 1
assignment_probs[i,] <- (assignment_probs[i,] / sum(assignment_probs[i,]))
}
#Allocate species individuals to a forest
forest_comms <- matrix(0, nrow = nForests, ncol = nSpecies) #Empty community matrix
possible_forests <- 1:nForests #Vector to remove forests from selection without effecting other data
done <- FALSE #Used to exit loop when finished
while(sum(species$Individuals) > 0){ #While individuals in the species pool remain to be assigned...
for(sp in 1:nSpecies){ #Repeatedly assign one individual from each species until all done
if(species$Individuals[sp] > 0){ #If species individuals remain to be assigned, proceed. Else, skip
vacancies <- 0 #Set vacancies to 0 to enter next loop
while(vacancies == 0){ #If there are 0 vacancies in forest selected in next section, retry assignment
forest <- sample(possible_forests, 1, prob = assignment_probs[sp, possible_forests]) #Randomly select forest according to generated assignment probabilities
vacancies <- species$Individuals[forest] #Find no. of individual vacancies yet to be filled in the forest
if(vacancies > 0){ #If vacancies available in forest...
forest_comms[forest, sp] <- (forest_comms[forest, sp] + 1) #Assign an individual to the forest
species$Individuals[sp] <- (species$Individuals[sp] - 1) #Decrease species count by 1
forests$Individuals[forest] <- (forests$Individuals[forest] - 1) #Decrease remaining vacancies in forest by 1
} else { #If forest is already full...
possible_forests <- possible_forests[!possible_forests %in% forest] #Remove forest from selection process
}
if(length(possible_forests) == 1){ #If only one forest has vacancies...
for(i in 1:nrow(species)){ #Assign all remaining individuals to that forest
forest_comms[possible_forests, i] <- (forest_comms[possible_forests, i] + species$Individuals[i])
}
species$Individuals <- 0 #Set population to 0 (all individuals have been assigned)
done <- TRUE #Convert 'done' to true to end loop
break
}
}
}
}
cat('\n', sum(species$Individuals))
if(done){break}
}
sum(forest_comms) == total_population

Related

Alternative to brute force estimation of parameter in an ecological time series model

I am modeling a hydrologic process (water levels [stage] in lakes measured in mm) that can be described as:
where is estimated from a different model and used as a constant in this model. is the unknown and the value is expected to be between (-0.001,0.001). The starting value of S doesn't matter so long as it is greater than 10m (10000mm). The model runs on a daily time step. I have observed stage from multiple different lakes and fit each lake independently.
Currently, I am brute-force identifying the parameter value by:
Creating a 100 value sequence of parameter values spanning (-0.001,0.001)
Predicting stage using the above equation and estimate RMSE between modeled and observed data (significantly fewer observations than modeled data points)
Identifying the B with the lowest RMSE and selecting B values on either side to create a new sequence of parameter values to search over
Step 2 and 3 are repeated until RMSE decreases by less than 0.01 or increases.
The code for the brute force approach I've been using is below along with the data associated with a single lake.
Is there an alternative approach to estimating the unknown parameter Beta2 given the model above and the fact that I only have observed data for a limited number of days?
Thanks!
library(tidyverse)
library(lubridate)
library(Metrics)
#The Data
dat <- read_csv("https://www.dropbox.com/s/skg8wfpu9274npb/driver_data.csv?dl=1")
observeddata <- read_csv("https://www.dropbox.com/s/bhh27g5rupoqps3/observeddata.csv?dl=1") %>% select(Date,Value)
#Setup initial values and vectors
S = matrix(nrow = nrow(dat),ncol = 1) #create an empty matrix for predicted values
S[1,1] = 10*1000 #set initial value (mm)
rmse.diff <- 10^100 #random high value for difference between min RMSE between successive
#parameter searches
b.levels <- seq(from = -0.001,to = 0.001,length.out=100) #random starting parameter that should contain
#the final value being estimated
n = 0 # counter
#Loop to bruteforce search for best parameter estimate
while(rmse.diff > 0.01 ) {
rmse.vec = rep(NA,length(b.levels))
for(t in 1:length(b.levels)){
for(z in 2:nrow(S)){
S[z,1] <- S[(z-1),1] + (1.071663*(dat$X[z])) + (b.levels[t]*(S[(z-1),1])) #-1.532236
} #end of time series loop
extrap_level <- data.frame(Date= dat$Date, level = S) # predicted lake levels
#calculate an offset to center observed data on extrapolated data
dat.offset = observeddata %>% left_join(extrap_level) %>%
mutate(offset = level-Value) %>% drop_na()
offset <- mean(dat.offset$offset)
dat.compare <- observeddata %>% left_join(extrap_level) %>%
mutate(Value = Value + offset) %>% drop_na()
#calculate RMSE between observed and extrapolated values
rmse.vec[t] <- rmse(actual = dat.compare$Value,predicted = dat.compare$level)
#plot the data to watch how parameter choice influences fit while looping
#plots have a hard time keeping up
if(t ==1 | t==50 | t==100) {
plot(extrap_level$Date,extrap_level$level,type="l")
lines(dat.compare$Date,dat.compare$Value,col="red")
}
}
#find minimum RMSE value
min.rmse <- which(rmse.vec==min(rmse.vec))
if(n == 0) rmse.best <- rmse.vec[min.rmse] else rmse.best = c(rmse.best,rmse.vec[min.rmse])
if (n >= 1) rmse.diff <- (rmse.best[n]-rmse.best[n+1])
if(rmse.diff < 0) break()
best.b <- b.levels[min.rmse]
#take the parameter values on either side of the best prior RMSE and use those as search area
b.levels <- seq(from = (b.levels[min.rmse-1]),to = (b.levels[min.rmse+1]),length.out=100)
n = n + 1
}
rmse.best #vector of RMSE for each parameter search
best.b #Last identified best parameter value

Sampling from a multivariate distribution including gender in R

I'm trying to simulate a wider population from a small one in R as follows:
idata <- subset(data, select=c(WT, AGE, HT, BFP, SEX) )
M= cor(idata)
mu <- sapply(idata, mean)
sd <- sapply(idata, stdev)
sigma=cor2cov(M, sd)
simulation <- as.data.frame(mvrnorm(1000, mu, sigma))
But the problems is, for SEX, the code will consider a continuous distribution, while it has to be binary, and effects of sex has to be either fully considered (SEX==1), or not at all (SEX==0). I'd appreciate any help with this regard.
Thanks
What you should do is consider that your data consists of two sub-populations, and then draw data from them, based on their proportions.
So, first estimate the proportions, pi_m and pi_f (= 1 - pi_m), which are the proportion of SEX == 0 and SEX == 1. This should be something like
pi_m = sum(idata$SEX == 1)/ nrow(idata)
Then estimate parameters for the two populations, mu_f, mu_m, sigma_f and sigma_m, which are mean and covariance parameters for the two SEX populations (now without the SEX variable).
The first draw a random number r <- runif(1), if this is less than equal to pi_m then generate a sample from N(mu_m, sigma_s) else from N(mu_f, sigma_f).
You can do this step 1000 times to get 1000 samples from your distribution.
Of course, you can vector this, by first generating 1000 samples from runif. For example
n_m <- sum(runif(1000) <= pi_m)
n_f <- 1000 - n_m
X_m <- rmvnorm(n_m, mu_m, sigma_m)
X_f <- rmvnorm(n_f, mu_f, sigma_f)
X <- rbind(X_m, X_f)

MCMC in R Modify Proposal

I've been working with MCMC for population genetics and I have some doubts.
I'm not experienced in statistics and because of that I have difficulty.
I have code to run MCMC, 1000 iterations. I start by creating a matrix with 0's (50 columns = 50 individuals and 1000 lines for 1000 iterations).
Then I create a random vector to substitute the first line of the matrix. This vector has 1's and 2's, representing population 1 or population 2.
I also have genotype frequencies and the genotypes of the 50 individuals.
What I want is to, according to the genotype frequencies and genotypes, determine to what population an individual belongs.
Then, I'll keep changing the population assigned to a random individual and checking if the new value should be accepted.
niter <- 1000
z <- matrix(0,nrow=niter,ncol=ncol(targetinds))
z[1,] <- sample(1:2, size=ncol(z), replace=T)
lhood <- numeric(niter)
lhood[1] <- compute_lhood_K2(targetinds, z[1,], freqPops)
accepted <- 0
priorz <- c(1e-6, 0.999999)
for(i in 2:niter) {
z[i,] <- z[i-1,]
# propose new vector z, by selecting a random individual, proposing a new zi value
selind <- sample(1:nind, size=1)
# proposal probability of selecting individual at random
proposal_ratio_ind <- log(1/nind)-log(1/nind)
# propose a new index for the selected individual
if(z[i,selind]==1) {
z[i,selind] <- 2
} else {
z[i,selind] <- 1
}
# proposal probability of changing the index of individual is 1/2
proposal_ratio_cluster <- log(1/2)-log(1/2)
propratio <- proposal_ratio_ind+proposal_ratio_cluster
# compute f(x_i|z_i*, p)
# the probability of the selected individual given the two clusters
probindcluster <- compute_lhood_ind_K2(targetinds[,selind],freqPops)
# likelihood ratio f(x_i|z_i*,p)/f(x_i|z_i, p)
lhoodratio <- probindcluster[z[i,selind]]-probindcluster[z[i-1,selind]]
# prior ratio pi(z_i*)/pi(z_i)
priorratio <- log(priorz[z[i,selind]])-log(priorz[z[i-1,selind]])
# accept new value according to the MH ratio
mh <- lhoodratio+propratio+priorratio
# reject if the random value is larger than the MH ratio
if(runif(1)>exp(mh)) {
z[i,] <- z[i-1,] # keep the same z
lhood[i] <- lhood[i-1] # keep the same likelihood
} else { # if accepted
lhood[i] <- lhood[i-1]+lhoodratio # update the likelihood
accepted <- accepted+1 # increase the number of accepted
}
}
It is asked that I have to change the proposal probability so that the new proposed values are proportional to the likelihood. This leads to a Gibbs sampling MCMC algorithm, supposedly.
I don't know what to change in the code to do this. I also don't understand very well the concept of proposal probability and how to chose the prior.
Grateful if someone knows how to clarify my doubts.
Your current proposal is done here:
# propose a new index for the selected individual
if(z[i,selind]==1) {
z[i,selind] <- 2
} else {
z[i,selind] <- 1
}
if the individual is assigned to cluster 1, then you propose to switch assignment deterministically by assigning them to cluster 2 (and vice versa).
You didn't show us what freqPops is, but if you want to propose according to freqPops then I believe the above code has to be replaced by
z[i,selind] <- sample(c(1,2),size=1,prob=freqPops)
(at least that is what I understand when you say you want to propose based on the likelihood - however, that statement of yours is unclear).
For this now to be a valid mcmc gibbs sampling algorithm you also need to change the next line of code:
proposal_ratio_cluster <- log(freqPops[z[i-1,selind]])-log(fregPops[z[i,selind]])

How to predict cluster labeling using DBSCAN object and Gower distance matrix for new data in R

I'm having issue with predicting cluster labeling for a test data, based on a dbscan clustering model on the training data.
I used gower distance matrix when creating the model:
> gowerdist_train <- daisy(analdata_train,
metric = "gower",
stand = FALSE,
type = list(asymm = c(5,6)))
Using this gowerdist matrix, the dbscan clustering model created was:
> sb <- dbscan(gowerdist_train, eps = .23, minPts = 50)
Then I try to use predict to label a test dataset using the above dbscan object:
> predict(sb, newdata = analdata_test, data = analdata_train)
But I receive the following error:
Error in frNN(rbind(data, newdata), eps = object$eps, sort = TRUE,
...) : x has to be a numeric matrix
I can take a guess on where this error might be coming from, which is probably due to the absence of the gower distance matrix that hasn't been created for the test data.
My question is, should I create a gower distance matrix for all data (datanal_train + datanal_test) separately and feed it into predict? how else would the algorithm know what the distance of test data from the train data is, in order to label?
In that case, would the newdata parameter be the new gower distance matrix that contains ALL (train + test) data? and the data parameter in predict would be the training distance matrix, gowerdist_train?
What I am not quite sure about is how would the predict algorithm distinguish between the test and train data set in the newly created gowerdist_all matrix?
The two matrices (new gowerdist for all data and the gowerdist_train) would obviously not have the same dimensions. Also, it doesn't make sense to me to create a gower distance matrix only for test data because distances must be relative to the test data, not the test data itself.
Edit:
I tried using gower distance matrix for all data (train + test) as my new data and received an error when fed to predict:
> gowerdist_all <- daisy(rbind(analdata_train, analdata_test),
metric = "gower",
stand = FALSE,
type = list(asymm = c(5,6)))
> test_sb_label <- predict(sb, newdata = gowerdist_all, data = gowerdist_train)
ERROR: Error in 1:nrow(data) : argument of length 0 In addition:
Warning message: In rbind(data, newdata) : number of columns of
result is not a multiple of vector length (arg 1)
So, my suggested solution doesn't work.
I decided to create a code that would use KNN algorithm in dbscan to predict cluster labeling using gower distance matrix. The code is not very pretty and definitely not programmaticaly efficient but it works. Happy for any suggestions that would improve it.
The pseydocode is:
1) calculate new gower distance matrix for all data, including test and train
2) use the above distance matrix in kNN function (dbscan package) to determine the k nearest neighbours to each test data point.
3) determine the cluster labels for all those nearest points for each test point. Some of them will have no cluster labeling because they are test points themselves
4) create a count matrix to count the frequency of clusters for the k nearest points for each test point
5) use very simple likelihood calculation to choose the cluster for the test point based on its neighbours clusters (the maximum frequency). this part also considers the neighbouring test points. That is, the cluster for the test point is chosen only when the maximum frequency is largest when you add the number of neighbouring test points to the other clusters. Otherwise, it doesn't decide the cluster for that test point and waits for the next iteration when hopefully more of its neighboring test points have had their cluster label decided based on their neighbours.
6) repeat above (steps 2-5) until you've decided all clusters
** Note: this algorithm doesn't converge all the time. (once you do the math, it's obvious why that is) so, in the code i break out of the algorithm when the number of unclustered test points doesn't change after a while. then i repeat 2-6 again with new knn (change the number of nearest neighbours and then run the code again). This will ensure more points are involved in deciding in th enext round. I've tried both larger and smaller knn's and both work. Would be good to know which one is better. I haven't had to run the code more than twice so far to decide the clusters for the test data point.
Here is the code:
#calculate gower distance for all data (test + train)
gowerdist_test <- daisy(all_data[rangeofdataforgowerdist],
metric = "gower",
stand = FALSE,
type = list(asymm = listofasymmvars),
weights = Weights)
summary(gowerdist_test)
Then use the code below to label clusters for test data.
#library(dbscan)
# find the k nearest neibours for each point and order them with distance
iteration_MAX <- 50
iteration_current <- 0
maxUnclusterRepeatNum <- 10
repeatedUnclustNum <- 0
unclusteredNum <- sum(is.na(all_data$Cluster))
previousUnclustereNum <- sum(is.na(all_data$Cluster))
nn_k = 30 #number of neighbourhoods
while (anyNA(all_data$Cluster) & iteration_current < iteration_MAX)
{
if (repeatedUnclustNum >= maxUnclusterRepeatNum) {
print(paste("Max number of repetition (", maxUnclusterRepeatNum ,") for same unclustered data has reached. Clustering terminated unsuccessfully."))
invisible(gc())
break;
}
nn_test <- kNN(gowerdist_test, k = nn_k, sort = TRUE)
# for the TEST points in all data, find the closets TRAIN points and decide statistically which cluster they could belong to, based on the clusters of the nearest TRAIN points
test_matrix <- nn_test$id[1: nrow(analdata_test),] #create matrix of test data knn id's
numClusts <- nlevels(as.factor(sb_train$cluster))
NameClusts <- as.character(levels(as.factor(sb_train$cluster)))
count_clusters <- matrix(0, nrow = nrow(analdata_test), ncol = numClusts + 1) #create a count matrix that would count number of clusters + NA
colnames(count_clusters) <- c("NA", NameClusts) #name each column of the count matrix to cluster numbers
# get the cluster number of each k nearest neibhour of each test point
for (i in 1:nrow(analdata_test))
for (j in 1:nn_k)
{
test_matrix[i,j] <- all_data[nn_test$id[i,j], "Cluster"]
}
# populate the count matrix for the total clusters of the neighbours for each test point
for (i in 1:nrow(analdata_test))
for (j in 1:nn_k)
{
if (!is.na(test_matrix[i,j]))
count_clusters[i, c(as.character(test_matrix[i,j]))] <- count_clusters[i, c(as.character(test_matrix[i,j]))] + 1
else
count_clusters[i, c("NA")] <- count_clusters[i, c("NA")] + 1
}
# add NA's (TEST points) to the other clusters for comparison
count_clusters_withNA <- count_clusters
for (i in 2:ncol(count_clusters))
{
count_clusters_withNA[,i] <- t(rowSums(count_clusters[,c(1,i)]))
}
# This block of code decides the maximum count of cluster for each row considering the number other test points (NA clusters) in the neighbourhood
max_col_countclusters <- apply(count_clusters,1,which.max) #get the column that corresponds to the maximum value of each row
for (i in 1:length(max_col_countclusters)) #insert the maximum value of each row in its associated column in count_clusters_withNA
count_clusters_withNA[i, max_col_countclusters[i]] <- count_clusters[i, max_col_countclusters[i]]
max_col_countclusters_withNA <- apply(count_clusters_withNA,1,which.max) #get the column that corresponds to the maximum value of each row with NA added
compareCountClust <- max_col_countclusters_withNA == max_col_countclusters #compare the two count matrices
all_data$Cluster[1:nrow(analdata_test)] <- ifelse(compareCountClust, NameClusts[max_col_countclusters - 1], all_data$Cluster) #you subtract one because of additional NA column
iteration_current <- iteration_current + 1
unclusteredNum <- sum(is.na(all_data$Cluster))
if (previousUnclustereNum == unclusteredNum)
repeatedUnclustNum <- repeatedUnclustNum + 1
else {
repeatedUnclustNum <- 0
previousUnclustereNum <- unclusteredNum
}
print(paste("Iteration: ", iteration_current, " - Number of remaining unclustered:", sum(is.na(all_data$Cluster))))
if (unclusteredNum == 0)
print("Cluster labeling successfully Completed.")
invisible(gc())
}
I guess you can use this for any other type of clustering algorithm, it doesn't matter how you decided the cluster labels for the train data, as long as they are in your all_data before running the code.
Hope this help.
Not the most efficient or rigorous code. So, happy to see suggestions how to improve it.
*Note: I used t-SNE to compare the clustering of train with the test data and looks impressively clean. so, it seems it is working.

Loop structure for basic simulation model in R

I'm trying to write a basic model that simulates the growth of a population (whose initial size is drawn randomly from a normal distribution) and then grows by a user defined amount each 'year' (currently 2 individuals in the code below for arguments sake). The output that is produced only shows the results of one simulation and, within this simulation, the population hasn't grown at all i.e. for each 'year' the population hasn't grown/doesn't add to the previous 'years' population. I'm assuming that I've stuffed something up in the loop structure and keen for any advice!
n.years <- 3
n.sim <- 5
store.growth <- matrix(ncol=3,nrow= (n.years * n.sim))
for (i in 1:n.sim) {
init.pop.size <- rnorm(1,100,10)
for (j in 1:n.years){
#grow population
grow.pop <- init.pop.size + 5
store.growth[j,] <- cbind(grow.pop, n.years, n.sim)
}
}
store.growth

Resources