R n most similar time series - dwt clustering / nearest neighbour - r

The data attached is a simplified example, as in reality I have hundreds of people and hundreds of points in time.
I am looking for a way to determine similar time series.
I have some code here to determine clusters, but this isn't exactly what I want.
What I would like is if I selected one person it would return the names of the n most similar time series.
I.e if n = 1, and I enter Bob it would return Dave, however if I entered Sam it would return Bob (with these names going into a new column with df). If n = 2 the first column would contain the most similar time series, and the second would contain the next most similar. This is similar to K nearest neighbours but across time series, so that each individual person has a different set of "neighbours".
If this is unfeasible, or too difficult I would alternatively like would to specify the number of people in each group, rather than the number of groups.
In this example I specified 4 groups, this does not make 4 groups of 2.
Group B contains 4 people, whilst C and D have only 1 person.
hc#cluster
James A
Dave B
Bob B
Joe C
Robert A
Michael B
Sam B
Steve D
library(dtwclust)
df <- data.frame(
row.names = c("James", "Dave", "Bob", "Joe", "Robert", "Michael", "Sam", "Steve"),
Monday = c(82, 46, 96, 57, 69, 28, 100, 10),
Tuesday = c(77, 62, 112, 66, 54, 34, 107, 20),
Wednesday = c(77, 59, 109, 65, 50, 37, 114, 30),
Thursday = c(73, 92, 142, 77, 54, 30, 128, 40),
Friday = c(74, 49, 99, 90, 50, 25, 111, 50),
Saturday = c(68, 26, 76, 81, 42, 28, 63, 60),
Sunday = c(79, 37, 87, 73, 53, 33, 79, 70)
)
hc<- tsclust(df, type = "h", k = 4,
preproc = zscore, seed = 899,
distance = "sbd", centroid = shape_extraction,
control = hierarchical_control(method = "average"))
plot(hc)
yo <- as.data.frame(hc#cluster)
yo$`hc#cluster` <- LETTERS[yo$`hc#cluster`]
print(yo)

What you want to do is not to cluster the data, you want to order it according to one specific time-series, there lies the problem. To do what you want, first, you have to select a measure of "distance", that could be euclidean or correlation for example. In the next example, I provide a code with both measurements of distances (correlation and euclidean). It simple calculate the distance between the time-series, then sort it, and lastly pick up the N lower. Note that the selection of the measurement of distance will alter your results.
df <- data.frame(
Monday = c(82, 46, 96, 57, 69, 28, 100, 10),
Tuesday = c(77, 62, 112, 66, 54, 34, 107, 20),
Wednesday = c(77, 59, 109, 65, 50, 37, 114, 30),
Thursday = c(73, 92, 142, 77, 54, 30, 128, 40),
Friday = c(74, 49, 99, 90, 50, 25, 111, 50),
Saturday = c(68, 26, 76, 81, 42, 28, 63, 60),
Sunday = c(79, 37, 87, 73, 53, 33, 79, 70)
)
df <- as.data.frame(t(df))
colnames(df) <- c("James", "Dave", "Bob", "Joe", "Robert", "Michael", "Sam", "Steve")
get_nearest_n <- function(data, name, n = 1){
#' n must be positive and integer
#' name must be a column name of data
#' data must be a dataframe
serie <- data[,name]
data <- data[,-which(colnames(data) == name)]
dist <- sqrt(colSums((data-serie)**2))
sorted_names <- names(sort(dist)[1:n])
return(data[,sorted_names])
}
get_nearest_n2 <- function(data, name, n = 1){
#' n must be positive and integer
#' name must be a column name of data
#' data must be a dataframe
serie <- data[,name]
data <- data[,-which(colnames(data) == name)]
dist <- as.data.frame(cor(serie,data))
sorted_names <- names(sort(dist,decreasing = T)[1:n])
return(data[,sorted_names])
}
get_nearest_n(data = df, name = 'Bob', n = 3)
get_nearest_n2(data = df, name = 'Bob', n = 3)

Related

selecting multiple max values from multiple ranges

I'm looking to find multiple max values using multiple ranges from a single table without using a loop.
It's difficult to explain, but here's an example:
list of value <- c(100, 110, 54, 64, 73, 23, 102)
beginning_of_max_range <- c(1, 2, 4)
end_of_max_range <- c(3, 5, 6)
output
110, 110, 73
max(100, 110, 54)
max(110, 54, 64)
max(64, 73, 23)
You may do this with mapply -
list_of_value <- c(100, 110, 54, 64, 73, 23, 102)
beginning_of_max_range <- c(1, 2, 4)
end_of_max_range <- c(3, 5, 6)
mapply(function(x, y) max(list_of_value[x:y]), beginning_of_max_range, end_of_max_range)
#[1] 110 110 73
We create a sequence from beginning_of_max_range to end_of_max_range, subset it from list_of_value and get the max from each pair.

Fit custom function to data

I have a data such that produced from special function:
where t0=1, alpha, q, gamma, C and beta are unknown parameters.
The question is how to fit the above function to following data, in R?
mydata<-structure(list(x = 1:100, y = c(0, 0, 2, 1, 3, 4, 4, 3, 7, 8,
9, 11, 12, 11, 15, 15, 17, 21, 49, 43, 117, 75, 85, 97, 113,
129, 135, 147, 149, 149, 123, 129, 127, 122, 143, 157, 144, 139,
123, 117, 141, 138, 124, 134, 158, 151, 136, 133, 121, 117, 122,
125, 117, 111, 98, 94, 92, 89, 73, 87, 91, 88, 94, 90, 93, 76,
60, 96, 71, 80, 71, 63, 65, 47, 74, 63, 78, 68, 55, 48, 51, 45,
48, 50, 71, 48, 35, 51, 69, 62, 64, 66, 51, 59, 58, 34, 57, 56,
63, 50)), class = "data.frame", row.names = c(NA, -100L))
I defined the function as follows:
t0<<-1
fyy<-function(t,cc0,alpha0,qq0,beta0,gamma0){
ret<-cc0*((t-t0)^alpha0)/(((1+(qq0-1)*beta0*(t-t0)^gamma0))^(1/(qq0-1)))
return(ret)
}
but I don't know how to continue?
as #mhovd mentioned I used "nls" function but I got an error as follows:
> fit <- nls(y~fyy(x,cc0 ,alpha0 ,beta0 ,gamma0 ,qq0 ),
data=data.frame(mydata), start=list(cc0 = .01,alpha0 =1,beta0 =.3,gamma0
= 2,qq0 = 1))
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
In the comments #masoud references a paper about the specific function in the question. It suggests fixing gamma0 and qq0 and if we do that we do get a solution -- fm shown in red in the plot. We have also shown an alternate parametric curve as fm2 in blue. It also has 3 optimized parameters but has lower residual sum of squares (lower is better).
fyy <- function(t,cc0,alpha0,qq0,beta0,gamma0){
cc0 * ((t-t0)^alpha0) / (((1+(qq0-1)*beta0*(t-t0)^gamma0))^(1/(qq0-1)))
}
mydata0 <- subset(mydata, y > 0)
# fixed values
t0 <- 1
gamma0 <- 3
qq0 <- 1.2
st <- list(cc0 = 1, alpha0 = 1, beta0 = 1) # starting values
fm <- nls(y ~ fyy(x, cc0, alpha0, qq0, beta0, gamma0), mydata0,
lower = list(cc0 = 0.1, alpha0 = 0.1, beta0 = 0.00001),
start = st, algorithm = "port")
deviance(fm) # residual sum of squares
## [1] 61458.5
st2 <- list(a = 1, b = 1, c = 1)
fm2 <- nls(y ~ exp(a + b/x + c*log(x)), mydata0, start = st2)
deviance(fm2) # residual sum of squares
## [1] 16669.24
plot(mydata0, ylab = "y", xlab = "t")
lines(fitted(fm) ~ x, mydata0, col = "red")
lines(fitted(fm2) ~ x, mydata0, col = "blue")
legend("topright", legend = c("fm", "fm2"), lty = 1, col = c("red", "blue"))

Optimize my regression using vectorization instead of a for loop

How would I vectorize this loop? When I have the loop with the backward stepwise regression, it takes over 15 minutes to run through the regression. (My full dataset has over 4000 observations and 20+ independent variables.) Any idea how I would vectorize this? I'm new to the whole concept.
I've looked into making this a function, and then using an ifelse statement for the training and validation. But, I haven't been able to get this to work in the code. Any ideas?
Here is a small dataset:
name <- c("Joe I.", "Joe I.", "Joe I.", "Joe I.", "Jane P.", "Jane P.", "Jane P.", "Jane P.",
"John K.", "John K.", "John K.", "John K.")
name_id <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
grade <- c(80, 99, 70, 65, 88, 90, 76, 65, 67, 68, 89, 67)
score <- c(82, 93, 72, 61, 89, 93, 71, 63, 64, 65, 82, 62)
attendance <- c(80, 99, 82, 62, 70, 65, 88, 90, 76, 93, 71, 99)
participation <- c(71, 63, 64, 71, 99, 76, 65, 67, 93, 72, 68, 89)
df <- cbind(name, name_id, class, grade, score, attendance, participation)
df <- as.data.frame(df)
df$name_id <- as.numeric(df$name_id)
df$grade <- as.numeric(df$grade)
df$score <- as.numeric(df$score)
df$attendance <- as.numeric(df$attendance)
df$participation <- as.numeric(df$participation)
Here is the loop:
magic_for(print, silent = TRUE)
for(i in 1:3){
validation = df[df$name_id == (i),]
training = df[df$name_id != (i),]
m = lm(score ~ grade + attendance, participation, data = training)
stepm <- stepAIC(m, direction = "backward", trace = FALSE)
pred1 <- predict(stepm, validation)
print(pred1)
}
options(max.print=999999)
pred1 <- magic_result_as_dataframe()
I am not sure if the following code can speed up your program, please have a try. Here df is pre-processed to be splitted by df$name_id, such that you have different chunks in terms of name_id
dfs <- split(df,df$name_id)
lapply(seq_along(dfs), function(k) {
validation <- dfs[[k]]
m <- lm(score ~ grade + attendance, participation, data = Reduce(rbind,dfs[-k]))
stepm <- stepAIC(m, direction = "backward", trace = FALSE)
pred1 <- predict(stepm, validation)
})

Simulate data from a Gompertz curve in R

I have a set of data that I have collected which consists of a time series, where each y-value is found by taking the mean of 30 samples of grape cluster weight.
I want to simulate more data from this, with the same number of x and y values, so that I can carry out some Bayesian analysis to find the posterior distribution of the data.
I have the data, and I know that the growth follows a Gompertz curve with formula:
[y = a*exp(-exp(-(x-x0)/b))], with a = 88.8, b = 11.7, and x0 = 15.1.
The data I have is
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112)
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165).
Any help would be appreciated thank you
*Will edit when more information is given**
I am a little confused by your question. I have compiled what you have written into R. Please elaborate for me so that I can help you:
gompertz <- function(x, x0, a, b){
a*exp(-exp(-(x-x0)/b))
}
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165) # means of 30 samples of grape cluster weights?
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112) # ?
#??
gompertz(x, x0 = 15.1, a = 88.8, b = 11.7)
gompertz(y, x0 = 15.1, a = 88.8, b = 11.7)

Simulate vectors conditional on custom distribution

I am measuring the duration of episodes (vector ep.dur in minutes) per day, for an observation period for T=364 days. The vector ep.dur has a length(ep.dur) of T=364, with zeros in days when no episode occurred, and range(ep.dur) is between 0 and 1440
The sum of the episode duration over the T period is a<-sum(ep.duration)
Now I have a vector den, with length(den)=99. The vector den shows how many days are required for the development of each 1% (1%, 2%, 3%, ...) of a
Now given den and a, I would like to simulate multiple ep.dur
Is this possible?
Clarification 1:: (first comment of danas.zuokas) The elements of den represent duration NOT exact days. That means, for example 1, that 1%(=1195.8) of a is developed in 1 day, 2% in 2 days, 3% in 3 days, 4% in 4 days, 5% in 5 days, 6% in 5 days .....). The episodes can take place anywhare in T
Clarification 2: (second comment of danas.zuokas) Unfortunately there can be no assumptions on how duration develops. That is why I have to simulate numerous ep.dur vectors. HOWEVER, i can expand the den vector into more finite resolution (that is: instead of 1% jumps, 0.1% jumps) if this is of any help.
Description of the algorithm
The algorithm should satisfy all information the den vector provides. I have imagined the algorithm going as following (Example 3):
Each 1% jump of a is 335,46 min. den[1] tells us that 1% of a is developed in 1 day. so lets say we generate ep.dur[1]=335,46. OK. We go to den[2]: 2% of the a is developed in d[2]=1 days. So, ep.dur[1] cannot be 335,46 and is rejected (2% of a should still occur in one day). Lets say that had generated ep.dur[1]=1440. d[1] is satisfied, d[2] is satisifed (at least 2% of the total duration is developed in dur[2]=1 days), dur[3]=1 is also satisfied. Keeper? However, dur[4]=2 is not satified if ep.dur[1]=1440 because it states that 4% of a (=1341) should occur in 2 days. So ep.dur[1] is rejected. Now lets say that ep.dur[1]=1200. dur[1:3] are accepted. Then we generate ep.dur[2] and so on making sure that the generated ep.dur all satisfy the information provided by den.
Is this programmatically feasible? I really do not know where to start with this problem. I will provide a generous bounty once bounty start period is over
Example 1:
a<-119508
den<-c(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 10, 11, 12, 13, 14, 15, 15,
16, 17, 18, 19, 20, 20, 21, 22, 23, 24, 25, 25, 26, 27, 28, 29,
30, 30, 31, 32, 33, 34, 35, 35, 36, 37, 38, 39, 40, 40, 41, 42,
43, 44, 45, 45, 46, 47, 48, 49, 50, 50, 51, 52, 53, 54, 55, 55,
56, 57, 58, 59, 60, 60, 61, 62, 63, 64, 65, 65, 66, 67, 68, 69,
70, 70, 71, 72, 73, 74, 75, 75, 76, 77, 78, 79, 80, 80, 81, 82,
83)
Example 2:
a<-78624
den<-c(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11,
11, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 18, 19, 21, 22, 23,
28, 32, 35, 36, 37, 38, 43, 52, 55, 59, 62, 67, 76, 82, 89, 96,
101, 104, 115, 120, 126, 131, 134, 139, 143, 146, 153, 160, 165,
180, 193, 205, 212, 214, 221, 223, 227, 230, 233, 234, 235, 237,
239, 250, 253, 263, 269, 274, 279, 286, 288, 296, 298, 302, 307,
309, 315, 320, 324, 333, 337, 342, 347, 352)
Example 3
a<-33546
den<-c(1, 1, 1, 2, 4, 6, 8, 9, 12, 15, 17, 21, 25, 29, 31, 34, 37,
42, 45, 46, 51, 52, 56, 57, 58, 59, 63, 69, 69, 71, 76, 80, 81,
87, 93, 95, 102, 107, 108, 108, 112, 112, 118, 123, 124, 127,
132, 132, 132, 135, 136, 137, 150, 152, 162, 166, 169, 171, 174,
176, 178, 184, 189, 190, 193, 197, 198, 198, 201, 202, 203, 214,
218, 219, 223, 225, 227, 238, 240, 246, 248, 251, 254, 255, 257,
259, 260, 277, 282, 284, 285, 287, 288, 290, 294, 297, 321, 322,
342)
Example 4
a<-198132
den<-c(2, 3, 5, 6, 7, 9, 10, 12, 13, 14, 16, 17, 18, 20, 21, 23, 24,
25, 27, 28, 29, 31, 32, 34, 35, 36, 38, 39, 40, 42, 43, 45, 46,
47, 49, 50, 51, 53, 54, 56, 57, 58, 60, 61, 62, 64, 65, 67, 68,
69, 71, 72, 74, 75, 76, 78, 79, 80, 82, 83, 85, 86, 87, 89, 90,
91, 93, 94, 96, 97, 98, 100, 101, 102, 104, 105, 107, 108, 109,
111, 112, 113, 115, 116, 120, 123, 130, 139, 155, 165, 172, 176,
178, 181, 185, 190, 192, 198, 218)
As far as I understand what you're after, I would start by converting den to an rle object. (Here using data from your Example 3)
EDIT: Add 100% at day 364 to den
if(max(den)!=364) den <- c(den, 364)
(rleDen <- rle(den))
# Run Length Encoding
# lengths: int [1:92] 3 1 1 1 1 1 1 1 1 1 ... # 92 intervals
# values : num [1:92] 1 2 4 6 8 9 12 15 17 21 ...
percDur <- rleDen$lengths # Percentage of total duration in each interval
atDay <- rleDen$values # What day that percentage was reached
intWidth <- diff(c(0, atDay), k = 1) # Interval width
durPerDay <- 1440 # Max observation time per day
percPerDay <- durPerDay/a*100 # Max percentage per day
cumPercDur <- cumsum(percDur) # Cumulative percentage in each interval
maxPerInt <- pmin(percPerDay * diff(c(0, atDay), 1),
percDur + 1) # Max percent observation per interval
set.seed(1)
nsims <- 10 # Desired number of simulations
sampMat <- matrix(0, ncol = length(percDur), nrow = nsims) # Matrix to hold sim results
To allow for randomness while considering the limitation of a maximum 1440 minutes of observation per day, check to see if there are any long intervals (i.e., any intervals in which the jump in percentage cannot be completely achieved in that interval)
if(any(percDur > maxPerInt)){
longDays <- percDur > maxPerInt
morePerInt <- maxPerInt - percDur
perEnd <- c(which(diff(longDays,1) < 0), length(longDays))
# Group intervals into periods bounded by "long" days
# and determine if there are any long periods (i.e., where
# the jump in percentage can't be achieved in that period)
perInd <- rep(seq_along(perEnd), diff(c(0, perEnd)))
perSums <- tapply(percDur, perInd, sum)
maxPerPer <- tapply(maxPerInt, perInd, sum)
longPers <- perSums > maxPerPer
# If there are long periods, determine, starting with the last period, when the
# excess can be covered. Each group of periods is recorded in the persToWatch
# object
if(any(longPers)) {
maxLongPer <- perEnd[max(which(longPers))]
persToWatch <- rep(NA, length(maxLongPer))
for(kk in rev(seq_len(maxLongPer))) {
if(kk < maxLongPer && min(persToWatch, na.rm = TRUE) <= kk) next
theSums <- cumsum(morePerInt[order(seq_len(kk),
decreasing = TRUE)])
above0 <- which(rev(theSums) > 0)
persToWatch[kk] <- max(above0[which(!perInd[above0] %in% c(perInd[kk],
which(longPers)) & !above0 %in% which(longDays))])
}
}
}
Now we can start the randomness. The first component of the sampling determines the overall proportion of a that occurs in each of the intervals. How much? Let runif decide. The upper and lower limits must reflect the maximum observation time per day and the excess amount of any long days and periods
for(jj in seq_along(percDur[-1])) {
upperBound <- pmin(sampMat[, jj] + maxPerInt[jj],
cumPercDur[jj] + 1)
lowerBound <- cumPercDur[jj]
# If there are long days, determine the interval over which the
# excess observation time may be spread
if(any(percDur > maxPerInt) && any(which(longDays) >= jj)) {
curLongDay <- max(which(perInd %in% perInd[jj]))
prevLongDay <- max(0, min(which(!longDays)[which(!longDays) <= jj]))
curInt <- prevLongDay : curLongDay
# If there are also long periods, determine how much excess observation time there is
if(any(longPers) && maxLongPer >= jj) {
curLongPerHigh <- min(which(!is.na(persToWatch))[
which(!is.na(persToWatch)) >= jj])
curLongPerLow <- persToWatch[curLongPerHigh]
longInt <- curLongPerLow : curLongPerHigh
curExtra <- max(0,
cumPercDur[curLongPerHigh] -
sum(maxPerInt[longInt[longInt > jj]]) -
sampMat[, jj, drop = FALSE])
} else {
curExtra <- cumPercDur[curLongDay] -
(sum(maxPerInt[curInt[curInt > jj]]) +
sampMat[, jj, drop = FALSE])
}
# Set the lower limit for runif appropriately
lowerBound <- sampMat[, jj, drop = FALSE] + curExtra
}
# There may be tolerance errors when the observations are tightly
# packed
if(any(lowerBound - upperBound > 0)) {
if(all((lowerBound - upperBound) <= .Machine$double.eps*2*32)) {
upperBound <- pmax(lowerBound, upperBound)
} else {
stop("\nUpper and lower bounds are on the wrong side of each other\n",
jj,max(lowerBound - upperBound))
}
}
sampMat[, jj + 1] <- runif(nsims, lowerBound, upperBound)
}
Then add 100 percent to the end of the results and calculate the interval-specific percentage
sampMat2 <- cbind(sampMat[, seq_along(percDur)], 100)
sampPercDiff <- t(apply(sampMat2, 1, diff, k = 1))
The second component of the randomness determines the distribution of sampPercDiff over the interval widths intWidth. This still requires more thought in my opinion. For instance, how long does a typical episode last compared to the unit of time under consideration?
For each interval, determine if the random percentage needs to be allocated over multiple time units (in this case days). EDIT: Changed the following code to limit percentage increase when intWidth > 1.
library(foreach)
ep.dur<-foreach(ii = seq_along(intWidth),.combine=cbind)%do%{
if(intWidth[ii]==1){
ret<-sampPercDiff[, ii, drop = FALSE] * a / 100
dimnames(ret)<-list(NULL,atDay[ii])
ret
} else {
theDist<-matrix(numeric(0), ncol = intWidth[ii], nrow = nsims)
for(jj in seq_len(intWidth[ii]-1)){
theDist[, jj] <- floor(runif(nsims, 0, pmax(0,
min(sampPercDiff[, ii], floor(sampMat2[,ii + 1])-.Machine$double.eps -
sampMat2[,ii]) * a / 100 - rowSums(theDist, na.rm = TRUE))))
}
theDist[, intWidth[ii]] <- sampPercDiff[, ii] * a / 100 - rowSums(theDist,
na.rm = TRUE)
distOrder <- replicate(nsims, c(sample.int(intWidth[ii] - 1),
intWidth[ii]), simplify = FALSE)
ret <- lapply(seq_len(nrow(theDist)), function(x) {
theDist[x, order(distOrder[[x]])]
})
ans <- do.call(rbind, ret)
dimnames(ans) <- list(NULL, atDay[ii]-((intWidth[ii]:1)-1))
ans
}
}
The duration time is sampled randomly for each time unit (day) in the interval to which it is to be distributed. After breaking up the total duration into daily observed times, these are then assigned randomly to the days in the interval.
Then, multiply the sampled and distributed percentages by a and divide by 100
ep.dur[1, 1 : 6]
# 1 2 3 4 5 6
# 1095.4475 315.4887 1.0000 578.9200 13.0000 170.6224
ncol(ep.dur)
# [1] 364
apply(ep.dur, 1, function(x) length(which(x == 0)))
# [1] 131 133 132 117 127 116 139 124 124 129
rowSums(ep.dur)/a
# [1] 1 1 1 1 1 1 1 1 1 1
plot(ep.dur[1, ], type = "h", ylab = "obs time")
I would most probably do this with a ruby script but it could be done in R too. I am not sure whether it is your homework problem or not. As to answer your question: Can this be done problematically? Yes, Ofcourse!
According to your problem, my solution is to define the minimum and maximum limits with in which I could like to randomly pick a percentage that satisfies the conditions given by den vector and a value.
Since the den vector only contains 99% values, we cannot be sure when the 100% is going to happen. This condition yields my solution to be split into 3 parts - 1) For the given den vector upto 98% 2) For the 99% 3) Beyond 99%. I could define another function and put the common code in all these 3 parts in it but I haven't done so.
Since, I use runif command to generate random numbers, given the low-limit, it is unlikely that it will generate the exact low-limit value. Hence, I have defined a threshold value which I can check and if it falls below it, I would make it 0. You can have this or remove it. Also when you consider example 4, the first 1% is going to happen at 2nd day. So it means the 1st day could contain upto a maximum=0.999999% of the episode and then the 1% occurs on 2nd day. This is why the maximum limit is defined by subtracting a smallestdiff value, which can be changed.
FindMinutes=function(a,den){
if (a>1440*364){
Print("Invalid value for aa")
return("Invalid value for aa")
}
threshold=1E-7
smallestdiff=1E-6
sum_perc=0.0
start=1 #day 1
min=0 #minimum percentage value for a day
max=0 #maximum percentage value for a day
days=rep(c(0),364) #day vector with percentage of minutes - initialized to 0
maxperc=1440*100/a #maximum percentage wrto 1440 minutes/day
#############################################################
#############################################################
############ For the length of den vector ###################
for (i in 1:length(den)){
if (den[i]>start){
min=(i-1)-sum_perc
for(j in start:(den[i]-1)){#number of days in-between
if (j>start){ min=0 }
if (i-smallestdiff-sum_perc>=maxperc){
max=maxperc
if ((i-smallestdiff-sum_perc)/(den[i]-j)>=maxperc){
min=maxperc
}else{
if ((i-smallestdiff-sum_perc)/(den[i]-j-1)<maxperc){
min=maxperc-(i-smallestdiff-sum_perc)/(den[i]-j-1)
}else{
min=maxperc
}
}
}else{
max=i-smallestdiff-sum_perc
}
if ((r=runif(1,min,max))>=threshold){
days[j]=r
sum_perc=sum_perc+days[j]
}else{
days[j]=0.0
}
}
start=den[i]
}
}
#############################################################
#############################################################
#####################For the 99% ############################
min=99-sum_perc
for(j in start:den[length(den)]){
if (j>start){
min=0
}
max=100-sum_perc
if (100-sum_perc>=maxperc){
max=maxperc
if ((100-sum_perc)/(364+1-j)>=maxperc){
min=maxperc
}else{
if ((100-sum_perc)/(364-j)<maxperc){
min=maxperc-(100-sum_perc)/(364-j)
}else{
min=maxperc
}
}
}else{
max=100-sum_perc
}
if ((r=runif(1,min,max))>=threshold){
days[j]=r
sum_perc=sum_perc+days[j]
}else{
days[j]=0.0
}
}
#############################################################
#############################################################
##################### For the remaining 1%###################
min=0
for(j in den[length(den)]+1:364){
max=100-sum_perc
if (j==364){
min=max
days[j]=min
}else{
if (100-sum_perc>maxperc){
max=maxperc
if ((100-sum_perc)/(364+1-j)>=maxperc){
min=maxperc
}else{
if ((100-sum_perc)/(364-j)<maxperc){
min=maxperc-(100-sum_perc)/(364-j)
}else{
min=maxperc
}
}
}else{
max=100-sum_perc
}
if ((r=runif(1,min,max))>=threshold){
days[j]=r
}else{
days[j]=0.0
}
}
sum_perc=sum_perc+days[j]
if (sum_perc>=100.00){
break
}
}
return(days*a/100) #return as minutes vector corresponding to each 364 days
}#function
In my code, I randomly generate percentage values of episodes for each day according to the minimum and maximum value. Also, the condition (den vector) holds good when you round the percentage values to integers (days vector) but you might need extra tuning (which depends on checking the den vector further ahead and then re-tuning the minimum value of percentages) if you want it accurate upto few decimal places. You can also check to make sure that sum(FindMinutes(a,den)) is equal to a. If you want to define den in terms of 0.1%, you can do so but you need to change the corresponding equations (in min and max)
As the worst case scenario example, if you make a as the maximum value it can take and a corresponding den vector:
a=1440*364
den<-c(0)
cc=1
for(i in 1:363){
if (trunc(i*1440*100/(1440*364))==cc){
den[cc]=i
cc=cc+1
}
}
You can run the above example by calling the function: maxexamplemin=FindMinutes(a,den)
and you can check to see that all the days have the maximum minutes of 1440 which is the only possible scenario here.
As an illustration, let me run your example 3:
a<-33546
den<-c(1, 1, 1, 2, 4, 6, 8, 9, 12, 15, 17, 21, 25, 29, 31, 34, 37, 42, 45, 46, 51, 52, 56, 57, 58, 59, 63, 69, 69, 71, 76, 80, 81, 87, 93, 95, 102, 107, 108, 108, 112, 112, 118, 123, 124, 127, 132, 132, 132, 135, 136, 137, 150, 152, 162, 166, 169, 171, 174, 176, 178, 184, 189, 190, 193, 197, 198, 198, 201, 202, 203, 214, 218, 219, 223, 225, 227, 238, 240, 246, 248, 251, 254, 255, 257, 259, 260, 277, 282, 284, 285, 287, 288, 290, 294, 297, 321, 322, 342)
rmin=FindMinutes(a,den)
sum(rmin)
[1] 33546
rmin2=FindMinutes(a,den)
rmin3=FindMinutes(a,den)
plot(rmin,tpe="h")
par(new=TRUE)
plot(rmin2,col="red",type="h")
par(new=TRUE)
plot(rmin3,col="red",type="h")
and the 3 super-imposed plots is shown below :

Resources