Keep rank over time in R - r

Is there a way in R to check if ranks are kept over time for individual observations?
I have measured a number of plants over the years and want to check if large plants stay large and small plants stay small (i.e. if the large plants prevent other plants from growing). The plants are ranked 1-5 in size (from small to large).
I have measured about 1000 plants.
Very grateful for any answers or comments.
/Stina

Maybe you do something like this?
# create random data
plantId <- sample(1:50,1000,replace=TRUE)
rank <- sample(1:5,1000,replace=TRUE)
time <- as.POSIXct(sample(1000000:10000000,1000,replace=FALSE)+10000000*rank,origin="1970-01-01")
myData <- data.frame(plantId , rank, time )
# function to calculate the time a plant has a given rank
getRankTime <- function(id,testRank,data=myData){
plantData <- myData[myData$plantId==id,];
if(nrow(plantData) < 2){ # only one observed value of this plant
return(NA)
}else if(all(plantData$rank != testRank)){ # plant was never of the rank under consideration
return(NA)
}else{ # calculate the (censered) time the plant stay(ed) in rank 'testRank'
startObsTimeInRank <- min(plantData$time[plantData$rank == testRank])
if(any(plantData$rank > testRank)){
endObsTimeInRank <- min(plantData$time[plantData$rank > testRank])
}else{
#eighter take the last time
endObsTimeInRank <- max(plantData$time[plantData$rank == testRank])
# alternatively use the current time
# endObsTimeInRank <- Sys.time()
}
return(as.numeric(endObsTimeInRank - startObsTimeInRank))
}
}
# calculate the average time plants stay in a rank
allPlantIds <- unique(myData$plantId)
stayInRankTime <- list()
for(runRank in 1:5){
stayInRankTime[[runRank]] <- sapply(allPlantIds, function(runPlatId) getRankTime(runPlatId,runRank) )
}
# average time plants stay in acertain rank'
avgRankTime <- lapply(stayInRankTime,function(x)mean(x, na.rm =TRUE))
avgRankTime

Related

Optimising assigments of individuals to groups (with defined maximum capacity) according to individuals' preferences

I am attempting to populate a set of forests with individuals from various species. Each forest has a defined capacity which it cannot exceed (given by Forest Area * Organism Density). Each species has a set population size, which is a fraction of the total population, itself determined by the sum of all forest areas * organism density.
The species also have defined preferences in regards to which forests they are assigned to, according to a relationship with a characteristic that varies between the forests, say rainfall. This results in a matrix of probabilities of a given species individual being assigned to a given forest. In order to give an equal chance for each species to be allocated to its preferred forest, I am assigning one individual at a time, repeatedly iterating through all the species in order, until both the non-assigned species populations have been exhausted and all the forests are at their maximum capacity.
Ideally the proportion of a species' population in a forest will be as close as possible to the probability of that species being assigned to that forest. This will involve a compromise between all species to minimise the total error between the population proportions and the species specific probabilities of assignment. (Thanks to commenters for making this a clearer problem)
At the moment I am doing this with a for, if/else, while loop. Here, when a patch reaches its capacity it is removed from the selection process, and when a species' population has all been assigned, it is iterated over. The species population sizes are stored in one data.frame and the forest capacities are stored in another, and are adjusted accordingly when individuals are assigned.
These criteria have led me to struggle to see an alternative method than the loops (reproducible example below). However, it is very slow as I often have total population sizes in the 100s of millions. I feel there must be a much neater and faster alternative to using loops, perhaps residing in the way I have structured the input data or in the way I provide equal oppurtunity to each species (i.e. a way this doesn't have to be sequential), but I cannot figure one out. All help is greatly appreciated.
set.seed(999)
#Generate data
nSpecies <- 50 #Number of species
max_area <- 10000 #Maximum Area of a forest
nForests <- 20 #Number of different forests
areas <- round(rbeta(nForests, 1, 2) * max_area) #Generate random forest areas
total_area <- sum(areas) #Find total area of all forests
density <- 10 #Set organism density
total_population <- total_area * density #Find total population size across all forests
pop_structure <- table(sample(1:nSpecies, total_population, replace = T)) #Generate species populations
forests <- data.frame(Name = 1:nForests,
Capacity = (areas * density), #Find max population size of each forest
Rainfall = sample(0:10000, nForests, replace = T)) #Generate forest characteristic variable (e.g. rainfall)
species <- data.frame(Species = 1:nSpecies,
Individuals = as.numeric(pop_structure),
Rain_Response = rnorm(nSpecies, 0, 2)) #Generate species rainfall response
#Generate probabilities of assignment to each forest for each species
assignment_probs <- matrix(NA, nrow = nSpecies, ncol = nForests)
for(i in 1:nSpecies){
for(x in 1:nForests){
#Probability of assignment to forest = Exponent of species rain response * log(Rainfall in Forest)
assignment_probs[i,x] <- exp(species$Rain_Response[i] * log(forests$Rainfall[x]))
}
#Scale to sum to 1
assignment_probs[i,] <- (assignment_probs[i,] / sum(assignment_probs[i,]))
}
#Allocate species individuals to a forest
forest_comms <- matrix(0, nrow = nForests, ncol = nSpecies) #Empty community matrix
possible_forests <- 1:nForests #Vector to remove forests from selection without effecting other data
done <- FALSE #Used to exit loop when finished
while(sum(species$Individuals) > 0){ #While individuals in the species pool remain to be assigned...
for(sp in 1:nSpecies){ #Repeatedly assign one individual from each species until all done
if(species$Individuals[sp] > 0){ #If species individuals remain to be assigned, proceed. Else, skip
vacancies <- 0 #Set vacancies to 0 to enter next loop
while(vacancies == 0){ #If there are 0 vacancies in forest selected in next section, retry assignment
forest <- sample(possible_forests, 1, prob = assignment_probs[sp, possible_forests]) #Randomly select forest according to generated assignment probabilities
vacancies <- species$Individuals[forest] #Find no. of individual vacancies yet to be filled in the forest
if(vacancies > 0){ #If vacancies available in forest...
forest_comms[forest, sp] <- (forest_comms[forest, sp] + 1) #Assign an individual to the forest
species$Individuals[sp] <- (species$Individuals[sp] - 1) #Decrease species count by 1
forests$Individuals[forest] <- (forests$Individuals[forest] - 1) #Decrease remaining vacancies in forest by 1
} else { #If forest is already full...
possible_forests <- possible_forests[!possible_forests %in% forest] #Remove forest from selection process
}
if(length(possible_forests) == 1){ #If only one forest has vacancies...
for(i in 1:nrow(species)){ #Assign all remaining individuals to that forest
forest_comms[possible_forests, i] <- (forest_comms[possible_forests, i] + species$Individuals[i])
}
species$Individuals <- 0 #Set population to 0 (all individuals have been assigned)
done <- TRUE #Convert 'done' to true to end loop
break
}
}
}
}
cat('\n', sum(species$Individuals))
if(done){break}
}
sum(forest_comms) == total_population

Creating transition matrix per indexed Time interval

I have a dataset on the transition of users between states (9 in total) in a specific time interval. This dataset will be used for a markov chain model. In total there are 96 time intervals, thus: for every user there are 96 observations each of which provides the specified time interval, a start location and an end location. A state that combines two locations simply means that the user is still in transition between the two states.
Below is a fictional dataset. In this example, unlike the actual dataset, start and end location are not necessarily linked, but I believe this will serve just as well as an illustration of the problem.
ID <- rep(1:10, each = 96)
TimeInterval <- rep(1:96, 10)
Locations <- c("Home", "Bakery", "Grocery", "Home-Bakery", "Home-Grocery", "Bakery-Home", "Bakery-Grocery", "Grocery-Home", "Grocery-Bakery")
startLocation <- sample(Locations, 960, replace = TRUE)
endLocation <- sample(Locations, 960, replace = TRUE)
df <- data.frame(ID, TimeInterval, startLocation, endLocation)
I want to calculate a transition matrix for every time interval, where the transition probability is calculated by the probability of transitioning into a state/location given the state/location at the previous time interval. For instance, to calculate the transition probability matrix for TimeInterval 37 the probability of being in a certain state in TimeInterval 37 given the state in TimeInterval 36 is taken.
This will result in a total of 96 transition matrices. The probability of transitioning from one state (Location) to another given a specific timeframe then depends on probability of all users combined.
However, I do not know how to aggregate the results of the individual transitions. What would be an efficient way to calculate these matrices?
The transition matrix per time interval should be a 9x9 matrix that includes all the states.
Edit:
A (very ugly) dplyr solution that worked for a single transition matrix:
Interval36 <- df %>% filter(TimeInterval== 36)
Interval37 <- df %> filter(TimeInterval == 37)
timeBlock37 <- as.data.frame(cbind(Interval37$journey, Interval36$journey))
mTimeBlock37 <- as.data.frame.matrix(table(timeBlock37))
timeBlock <- prop.table(mTimeBlock37 )
timeBlock
I solved it myself, although not in the most elegant way with quite a unstructured for-loop.
matrixList <- list()
states <- Locations
for(i in 1:96){
i <- ifelse(i < 96,i + 1, 96)
t1 <- df %>% filter(timeBlock == i)
j <- ifelse(i < 96,i + 1, 96)
t2 <- df%>% filter(timeBlock == j)
timeBlock <- as.data.frame(cbind(t1[,8], t2[,8]))
mTime <- as.data.frame.matrix(table(timeBlock))
timeBlock <- prop.table(mTime)
timeBlock <- as.matrix(timeBlock)
mat1 <- matrix(0, nrow = 9, ncol = 9)
colnames(mat1) <- states
rownames(mat1) <- states
colsNeeded <- colnames(mat1)[colnames(mat1) %in% colnames(timeBlock)]
rowsNeeded <- rownames(mat1)[rownames(mat1) %in% rownames(timeBlock)]
mat1[rowsNeeded, colsNeeded] <- timeBlock[rowsNeeded, colsNeeded]
matrixList[[i]] <- mat1
}
I created an empty matrix in the for-loop that could be populated with the transition matrix to take into account for the situations where not all states would appear within a certain timeframe.
If someone still has a more elegant/cleaner solution, feel free to contribute for future readers.

R: quickly simulate unbalanced panel with variable that depends on lagged values of itself

I am trying to simulate monthly panels of data where one variable depends on lagged values of that variable in R. My solution is extremely slow. I need around 1000 samples of 2545 individuals, each of whom is observed monthly over many years, but the first sample took my computer 8.5 hours to construct. How can I make this faster?
I start by creating an unbalanced panel of people with different birth dates, monthly ages, and variables xbsmall and error that will be compared to determine the Outcome. All of the code in the first block is just data setup.
# Setup:
library(plyr)
# Would like to have 2545 people (nPerson).
#Instead use 4 for testing.
nPerson = 4
# Minimum and maximum possible ages and birth dates
AgeMin = 10
AgeMax = 50
BornMin = 1950
BornMax = 1963
# Person-specific characteristics
ind =
data.frame(
id = 1:nPerson,
BornYear = floor(runif(length(1:nPerson), min=BornMin, max=BornMax+1)),
BornMonth = ceiling(runif(length(1:nPerson), min=0, max=12))
)
# Make an unbalanced panel of people over age 10 up to year 1986
# panel = ddply(ind, ~id, transform, AgeMonths = BornMonth)
panel = ddply(ind, ~id, transform, AgeMonths = (AgeMin*12):((1986-BornYear)*12 + 12-BornMonth))
# Set up some random variables to approximate the data generating process
panel$xbsmall = rnorm(dim(panel)[1], mean=-.3, sd=.45)
# Standard normal error for probit
panel$error = rnorm(dim(panel)[1])
# Placeholders
panel$xb = rep(0, dim(panel)[1])
panel$Outcome = rep(0, dim(panel)[1])
Now that we have data, here is the part that is slow (around a second on my computer for only 4 observations but hours for thousands of observations). Each month, a person gets two draws (xbsmall and error) from two different normal distributions (these were done above), and Outcome == 1 if xbsmall > error. However, if Outcome equals 1 in the previous month, then Outcome in the current month equals 1 if xbsmall + 4.47 > error. I use xb = xbsmall+4.47 in the code below (xb is the "linear predictor" in a probit model). I ignore the first month for each person for simplicity. For your information, this is simulating a probit DGP (but that is not necessary to know to solve the problem of computation speed).
# Outcome == 1 if and only if xb > -error
# The hard part: xb includes information about the previous month's outcome
start_time = Sys.time()
for(i in 1:nPerson){
# Determine the range of monthly ages to loop over for this person
AgeMonthMin = min(panel$AgeMonths[panel$id==i], na.rm=T)
AgeMonthMax = max(panel$AgeMonths[panel$id==i], na.rm=T)
# Loop over the monthly ages for this person and determine the outcome
for(t in (AgeMonthMin+1):AgeMonthMax){
# Indicator for whether Outcome was 1 last month
panel$Outcome1LastMonth[panel$id==i & panel$AgeMonths==t] = panel$Outcome[panel$id==i & panel$AgeMonths==t-1]
# xb = xbsmall + 4.47 if Outcome was 1 last month
# Otherwise, xb = xbsmall
panel$xb[panel$id==i & panel$AgeMonths==t] = with(panel[panel$id==i & panel$AgeMonths==t,], xbsmall + 4.47*Outcome1LastMonth)
# Outcome == 1 if xb > 0
panel$Outcome[panel$id==i & panel$AgeMonths==t] =
ifelse(panel$xb[panel$id==i & panel$AgeMonths==t] > - panel$error[panel$id==i & panel$AgeMonths==t], 1, 0)
}
}
end_time = Sys.time()
end_time - start_time
My thoughts for reducing computer time:
Something with cumsum()
Some wonderful panel data function that I do not know about
Find a way to make the t loop go through the same starting and ending points for each individual and then somehow use plyr::ddpl() or dplyr::gather_by()
Iterative solution: make an educated guess about the value of Outcome at each monthly age (say, the mode) and somehow adjust values that do not match the previous month. This would work better in my real application because xbsmall has a very clear trend in age.
Do the simulation only for smaller samples and then estimate the effect of sample size on the values I need (the distributions of regression coefficient estimates not calculated here)
One approach is to use a split-apply-combine method. I take out the for(t in (AgeMonthMin+1):AgeMonthMax) loop and put the contents in a function:
generate_outcome <- function(x) {
AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE)
AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE)
for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){
x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1]
x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0)
}
x
}
where x is a dataframe for one person. This allows us to simplify the panel$id==i & panel$AgeMonths==t construct. Now we can just do
out <- lapply(split(panel, panel$id), generate_outcome)
out <- do.call(rbind, out)
and all.equal(panel$Outcome, out$Outcome) returns TRUE. Computing 100 persons took 1.8 seconds using this method, compared to 1.5 minutes in the original code.

Optimization in R using constraints

I have some problems setting up an optimalization process using R. My data set is as follows:
set.seed(123)
library(lpSolve)
num_data <- 1000
bal_max <- .2/100
ind_max <- 10.5/100
data <- data.frame(id = 1:num_data,
balance = pmax(0,runif(num_data, 0, 1000)),
industry = rep(seq(1:10),num_data/10))
data$risk <- pmax(0, data$balance + rnorm(num_data,100,10))
As you can see, there are 1000 ids, 10 different industries. The objective is to maximize the sum of column "risk" and at the same time make sure that the fraction of each individual loan and industry cannot be above respectively 2% and 10.5%.
In the current data set these conditions are not met:
max(data$balance) / sum(data$balance)
#[1] 0.002009751
industry <- aggregate(balance ~ industry, FUN=sum,data=data)
max(industry$balance) / sum(industry$balance)
#[1] 0.1093997
Thus before we can maximize the column risk, these two conditions needs to be met. The remainder of my code is as follows
# set up linear prog problem
num_x <- nrow(data)
num_ind <- length(unique(data$industry))
objective.in <- data$risk
# define quantity to be maximized
# construct right-hand-side of constraint vector
# - sum of balances = 1
# - each balance <= bal_max
# - sum of balances for each industry <= ind_max
# - lp solver function imposes constraint that each balance >= 0
const.rhs <- c( 1, rep(bal_max, num_x), rep(ind_max, num_ind))
# construct constraint matrix for same constraints
mat_ind <- matrix(0,nrow=num_ind, ncol=num_x)
for( i in 1:num_ind) mat_ind[i,which(data$industry == i)] <- 1
const.mat <- rbind( matrix(1, nrow=1,ncol=num_x), diag(num_x), mat_ind )
# define directions for each constraint equation
const.dir <- c("=", rep("<=",num_x), rep("<=", num_ind))
# find balances for max risk
#
max_risk <- lp(direction="max", objective.in=objective.in, const.mat=const.mat,
const.dir=const.dir, const.rhs=const.rhs)
max_risk
# add data balances with optimum solution
data$balance <- max_risk$solution
# each balance should be smaller than bal_max
max(data$balance)
# industry should be smaller than 10.5% each
industry <- aggregate(balance ~ industry, FUN=sum,data=data)
industry
As you can see each industry and individual id does not exceed the bounderies (10.5% and 2%) anymore. The problem is that this code either fills in 2% or 0% for each loanid (such that the sum is 1). However, the absolute value of the initial balance should not rise as well. In this example the original balances often rises (fills in 2%).
In short, I want to optimize the column "risk" where each individual ID is capped at 2% of the total balance (after the optimalization process) and each industry is capped at 10.5% (after the optimalization process) of the total balance. The sum of all fractions should be one and the absolute value of the balance cannot increase.
The idea is to decrease the balances such that all conditions are met and "risk" is optimized.

Running into problems with running null model of genetic drift in R

I've been trying to wrangle a basic model of genetic drift in R.
However, every time I try to run the program it won't stop, and I have to manually stop it.
My complete code:
trials <- 100 #initialize the number of times you'll generate the time to fixation
fixation <- trials #Create a vector that records the number of generations until fixation of the alleles.
genVector <- numeric(trials)
for(i in 1:trials){
pop <- c(rep('a',20), rep('b',20)) #Initialize the population with equal numbers of both a and b alleles, for twenty individuals, or 40 alleles.
genTime <- 1 #Number of generations
freq <- length(pop[grep('a', pop)])/length(pop)
while(freq > 0 | freq < 1){ #While the frequency of a in the population is greater than 0 or less than 1, perform the following calculations
pop <- sample(pop, length(pop), replace = TRUE) #Randomly select 40 alleles with constant replacement
freq <- length(pop[grep('a', pop)])/length(pop)
genTime <- genTime + 1 #Add one to the generation time
}
genVector[i] <- genTime
}
I believe I have isolated the problem to the while loop I am using, within the for loop. I have no idea why it won't stop running though. Any comments or suggestions would be greatly appreciated!

Resources