How to sum values of two unequal vectors in R? - r

I have two unequal vectors in length.
For example,
I want to add all values from TT to all values from FF.
TT <- c(1:10)
FF <- c(0, 60, 120, 180)
I would expect to have the below result
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190
I would appreciate if you could give me some advice.
Thanks in advance

We can use outer
c(outer(TT, FF, FUN = `+`))
or with sapply
c(sapply(TT, `+`, FF))

Related

FCT_Collapse using a range

Im trying to use a range (160:280) instead of '160', '161' and so on. How would i do that?
group_by(disp = fct_collapse(as.character(disp), Group1 = c(160:280), Group2 = c(281:400)) %>%
summarise(meanHP = mean(hp)))
Error: Problem adding computed columns in `group_by()`.
x Problem with `mutate()` column `disp`.
i `disp = `%>%`(...)`.
x Each input to fct_recode must be a single named string. Problems at positions: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 17```
For range of values it is better to use cut where you can define breaks and labels.
library(dplyr)
library(forcats)
mtcars %>%
group_by(disp = cut(disp, c(0, 160, 280, 400, Inf), paste0('Group', 1:4))) %>%
summarise(meanHP = mean(hp))
# disp meanHP
# <fct> <dbl>
#1 Group1 93.1
#2 Group2 143
#3 Group3 217.
#4 Group4 217.
So here 0-160 becomes 'Group1', 160-280 'Group2' and so on.
With fct_collapse you can do -
mtcars %>%
group_by(disp = fct_collapse(as.character(disp), Group1 = as.character(160:280), Group2 = as.character(281:400))) %>%
summarise(meanHP = mean(hp)) %>%
suppressWarnings()
However, this works only for exact values which are present so 160 would be in group1 but not 160.1.
We could also do
library(dplyr)
library(stringr)
mtcars %>%
group_by(disp = cut(disp, c(0, 160, 280, 400, Inf), strc('Group', 1:4))) %>%
summarise(meanHP = mean(hp))

update valence shifter in sentimentr package in r

I am trying to remove certain rows from the lexicon::hash_valence_shifters in the sentimentr package. Specifically, i want to keep only rows:
c( 1 , 2 , 3 , 6 , 7 , 13, 14 , 16 , 19, 24, 25 , 26, 27 , 28, 36, 38, 39, 41, 42, 43, 45, 46, 53, 54, 55, 56, 57, 59, 60, 65, 70, 71, 73, 74, 79, 84, 85, 87, 88, 89, 94, 95, 96, 97, 98, 99, 100, 102, 103, 104, 105, 106, 107, 114, 115, 119, 120, 123, 124, 125, 126, 127, 128, 129, 135, 136, 138)
I have tried the below approach:
vsm = lexicon::hash_valence_shifters[c, ]
vsm[ , y := as.numeric(y)]
vsm = sentimentr::as_key(vsm, comparison = NULL, sentiment = FALSE)
sentimentr::is_key(vsm)
vsn = sentimentr::update_valence_shifter_table(vsm, drop = c(dropvalue$x), x= lexicon::hash_valence_shifters, sentiment = FALSE, comparison = TRUE )
However, when I am calculating the sentiment using the updated valence shifter table "vsn", it is giving the sentiment as 0.
Can someone please let me know how to just keep specific rows of the valence shifter table ?
Thanks!

In R, how can I properly subset a data frame based on a list of values inside of a function?

I have a function that is attempting to select rows from a dataframe based on a list of values.
For instance, some values might be:
> subset_ids
[1] "JUL_0003_rep1" "JUL_0003_rep2"
[3] "JUL_0003_rep3" "JUL_0007_rep1"
[5] "JUL_0007_rep2" "JUL_0007_rep3"
I have a data frame called "targets" with a column called "LongName". It has many other columns but no big deal. I want to select the rows from targets when LongName is in subset ids.
I can do this fine with either:
targets[is.element(targets$LongName, subset_ids),]
or
targets[targets$LongName %in% subset_ids,]
The problem is that I want to do this in a function, and I don't know what the column will be called in advance.
So I tried using the eval/parse method, which upon recent reading may not be the best way to do it. When I do the following:
sub1 <- paste("targets[is.element(targets$", column_name, ", subset_ids),]", sep="")
targets_subset <- as.character(eval(parse(text = sub1)))
It returns some strange concatenation of row numbers. It looks like this:
[1] "c(5, 6, 7, 17, 18, 19, 26, 27, 28, 35, 36, 46, 47, 48, 54, 55, 61, 62, 63, 64, 73, 74, 75, 76, 77, 78, 91, 92, 93, 102, 103, 104, 114, 117, 118, 129, 136, 137, 140, 141, 151, 152, 153, 157, 158, 159, 169, 172, 173, 183, 187, 188, 199, 200, 201, 208, 209, 210, 232, 233, 241, 242, 243, 252, 253, 254, 264, 265, 270, 271, 285, 286, 296, 297, 298)"
[2] "c(5, 6, 7, 17, 18, 19, 26, 27, 28, 35, 36, 46, 47, 48, 54, 55, 61, 62, 63, 64, 73, 74, 75, 76, 77, 78, 91, 92, 93, 102, 103, 104, 114, 117, 118, 129, 136, 137, 140, 141, 151, 152, 153, 157, 158, 159, 169, 172, 173, 183, 187, 188, 199, 200, 201, 208, 209, 210, 232, 233, 241, 242, 243, 252, 253, 254, 264, 265, 270, 271, 285, 286, 296, 297, 298)"
[3] "c(3, 3, 3, 7, 7, 7, 11, 11, 11, 15, 15, 19, 19, 19, 22, 22, 26, 26, 27, 27, 31, 31, 31, 32, 32, 32, 39, 39, 39, 43, 43, 43, 47, 49, 49, 53, 57, 57, 59, 59, 63, 63, 63, 65, 65, 65, 70, 72, 72, 76, 78, 78, 83, 83, 83, 86, 86, 86, 97, 97, 100, 100, 100, 104, 104, 104, 108, 108, 111, 111, 117, 117, 121, 121, 121)"
So 5, 6, 7, 17 ... appear to be the right rows for the target i'm trying to pick, but I don't understand why it sent this back in the first place, or what item [3] is at all.
If I manually execute the line generated by the above "sub1 <- ...", then it returns the proper data. If I ask the function to do it, it returns this garbage.
My question is two-fold. 1: Why is the data being returned this way? 2: Is there a better way than eval/parse to do what I'm trying to do?
I suspect some strange scope or environment level issue, but it is unclear to me at this point. I appreciate any advice anyone has.
The data are returned that way because you are coercing the dataframe to a character object. Try
as.character(head(targets))
to see a short example.
So, your method works if you eliminate the as.character(). Here it is as a MWE:
targets <- data.frame(LongName = sample(letters, 1000, replace = TRUE),
SeqNum= 1:1000,
X = rnorm(1000))
subset_ids <- c("a","f")
targets[is.element(targets$LongName, subset_ids),]
targets[targets$LongName %in% subset_ids,]
testfun <- function(targets, column_name, subset_ids){
sub1 <- paste("targets[is.element(targets$", column_name, ", subset_ids),]", sep="")
targets_subset <- eval(parse(text = sub1))
return(targets_subset)
}
testfun(targets, column_name = "LongName", subset_ids)

Reverse leaf order in dendrogram using R

I have tried for several days to just flip a dendrogram so that the last gene is the first in the figure and the first the last. But even when I have managed to move leaves around the internal ordering is not the same. Here is my script:
cluster.hosts <- read.table("Norm_0_to1_heatmap.txt", header = TRUE, sep="", quote="/", row.names = 1)
# A table with 8 columnns and 229 rows cirresponding to gene expression
hosts.dist <- dist(cluster.hosts, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
hc <- hclust(hosts.dist, method = "average")
dd <- as.dendrogram(hc)
order.dendrogram(dd)
X11()
par(cex=0.5,font=3)
plot(dd, main="Dendrogram of Syn9 genes")
order.dd <- order.dendrogram(dd) #the numbers in the order indicate the position of the gene in the original table
#Then I generate a vector with the opposed order to the one obtained
y <- c(206, 204, 210, 209, 213, 212, 211, 207, 208, 94, 199, 192, 195, 198, 193, 201, 203, 200, 185, 61, 191, 190, 197, 189, 188, 196, 187, 215, 214, 202, 217, 220, 219, 218, 95, 180, 179, 181, 182, 186, 178, 132, 133, 122, 66, 65, 64, 58, 91, 88, 92, 89, 62, 184, 103, 128, 127, 229, 231, 230, 148, 63, 228, 116, 134, 104, 221, 78, 20, 232, 160, 159, 225, 112, 167, 164, 166, 140, 222, 51, 149, 227, 79, 68, 90, 131, 130, 136, 135, 105, 147, 172, 150, 176, 175, 174, 177, 152, 151, 165, 137, 168, 163, 52, 146, 141, 145, 82, 81, 56, 161, 120, 144, 129, 84, 1, 173, 143, 142, 86, 85, 83, 194, 183, 111, 55, 53, 54, 224, 171, 170, 223, 169, 93, 59, 60, 123, 121, 124, 87, 125, 226, 3, 158, 47, 10, 162, 138, 139, 154, 153, 119, 118, 117, 106, 80, 45, 70, 69, 126, 205, 77, 67, 19, 102, 46, 13, 108, 107, 109, 72, 71, 73, 23, 22, 25, 57, 48, 216, 155, 29, 24, 101, 35, 113, 115, 36, 37, 114, 110, 2, 14, 6, 16, 15, 17, 18, 74, 31, 30, 76, 12, 75, 8, 11, 5, 7, 99, 98, 100, 39, 38, 33, 32, 97, 96, 49, 44, 34, 50, 156, 26, 157, 42, 41, 43, 4, 28, 27, 9, 40, 21)
rx <- reorder(dd, y, agglo.FUN=mean)
order.rx <- order.dendrogram(rx)
write(order.rx, file="order_hosts_rx.txt", sep="\t")
write(labels(rx), file="labels_order_hosts_rx.txt", sep="\t")
X11()
par(cex=0.5)
plot(rx, main="Dendrogram of Syn9 genes")
I guess it has something to do with the heights of the leaves but I just want to flip the dendrogram...
Thanks in advance!
Miguel
You can use rev(dd); rev.dendrogram simply returns the dendrogram with reversed nodes:
hc <- hclust(dist(USArrests), "ave")
dd <- as.dendrogram(hc)
plot(dd)
plot(rev(dd))

Simulate vectors conditional on custom distribution

I am measuring the duration of episodes (vector ep.dur in minutes) per day, for an observation period for T=364 days. The vector ep.dur has a length(ep.dur) of T=364, with zeros in days when no episode occurred, and range(ep.dur) is between 0 and 1440
The sum of the episode duration over the T period is a<-sum(ep.duration)
Now I have a vector den, with length(den)=99. The vector den shows how many days are required for the development of each 1% (1%, 2%, 3%, ...) of a
Now given den and a, I would like to simulate multiple ep.dur
Is this possible?
Clarification 1:: (first comment of danas.zuokas) The elements of den represent duration NOT exact days. That means, for example 1, that 1%(=1195.8) of a is developed in 1 day, 2% in 2 days, 3% in 3 days, 4% in 4 days, 5% in 5 days, 6% in 5 days .....). The episodes can take place anywhare in T
Clarification 2: (second comment of danas.zuokas) Unfortunately there can be no assumptions on how duration develops. That is why I have to simulate numerous ep.dur vectors. HOWEVER, i can expand the den vector into more finite resolution (that is: instead of 1% jumps, 0.1% jumps) if this is of any help.
Description of the algorithm
The algorithm should satisfy all information the den vector provides. I have imagined the algorithm going as following (Example 3):
Each 1% jump of a is 335,46 min. den[1] tells us that 1% of a is developed in 1 day. so lets say we generate ep.dur[1]=335,46. OK. We go to den[2]: 2% of the a is developed in d[2]=1 days. So, ep.dur[1] cannot be 335,46 and is rejected (2% of a should still occur in one day). Lets say that had generated ep.dur[1]=1440. d[1] is satisfied, d[2] is satisifed (at least 2% of the total duration is developed in dur[2]=1 days), dur[3]=1 is also satisfied. Keeper? However, dur[4]=2 is not satified if ep.dur[1]=1440 because it states that 4% of a (=1341) should occur in 2 days. So ep.dur[1] is rejected. Now lets say that ep.dur[1]=1200. dur[1:3] are accepted. Then we generate ep.dur[2] and so on making sure that the generated ep.dur all satisfy the information provided by den.
Is this programmatically feasible? I really do not know where to start with this problem. I will provide a generous bounty once bounty start period is over
Example 1:
a<-119508
den<-c(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 10, 11, 12, 13, 14, 15, 15,
16, 17, 18, 19, 20, 20, 21, 22, 23, 24, 25, 25, 26, 27, 28, 29,
30, 30, 31, 32, 33, 34, 35, 35, 36, 37, 38, 39, 40, 40, 41, 42,
43, 44, 45, 45, 46, 47, 48, 49, 50, 50, 51, 52, 53, 54, 55, 55,
56, 57, 58, 59, 60, 60, 61, 62, 63, 64, 65, 65, 66, 67, 68, 69,
70, 70, 71, 72, 73, 74, 75, 75, 76, 77, 78, 79, 80, 80, 81, 82,
83)
Example 2:
a<-78624
den<-c(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11,
11, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 18, 19, 21, 22, 23,
28, 32, 35, 36, 37, 38, 43, 52, 55, 59, 62, 67, 76, 82, 89, 96,
101, 104, 115, 120, 126, 131, 134, 139, 143, 146, 153, 160, 165,
180, 193, 205, 212, 214, 221, 223, 227, 230, 233, 234, 235, 237,
239, 250, 253, 263, 269, 274, 279, 286, 288, 296, 298, 302, 307,
309, 315, 320, 324, 333, 337, 342, 347, 352)
Example 3
a<-33546
den<-c(1, 1, 1, 2, 4, 6, 8, 9, 12, 15, 17, 21, 25, 29, 31, 34, 37,
42, 45, 46, 51, 52, 56, 57, 58, 59, 63, 69, 69, 71, 76, 80, 81,
87, 93, 95, 102, 107, 108, 108, 112, 112, 118, 123, 124, 127,
132, 132, 132, 135, 136, 137, 150, 152, 162, 166, 169, 171, 174,
176, 178, 184, 189, 190, 193, 197, 198, 198, 201, 202, 203, 214,
218, 219, 223, 225, 227, 238, 240, 246, 248, 251, 254, 255, 257,
259, 260, 277, 282, 284, 285, 287, 288, 290, 294, 297, 321, 322,
342)
Example 4
a<-198132
den<-c(2, 3, 5, 6, 7, 9, 10, 12, 13, 14, 16, 17, 18, 20, 21, 23, 24,
25, 27, 28, 29, 31, 32, 34, 35, 36, 38, 39, 40, 42, 43, 45, 46,
47, 49, 50, 51, 53, 54, 56, 57, 58, 60, 61, 62, 64, 65, 67, 68,
69, 71, 72, 74, 75, 76, 78, 79, 80, 82, 83, 85, 86, 87, 89, 90,
91, 93, 94, 96, 97, 98, 100, 101, 102, 104, 105, 107, 108, 109,
111, 112, 113, 115, 116, 120, 123, 130, 139, 155, 165, 172, 176,
178, 181, 185, 190, 192, 198, 218)
As far as I understand what you're after, I would start by converting den to an rle object. (Here using data from your Example 3)
EDIT: Add 100% at day 364 to den
if(max(den)!=364) den <- c(den, 364)
(rleDen <- rle(den))
# Run Length Encoding
# lengths: int [1:92] 3 1 1 1 1 1 1 1 1 1 ... # 92 intervals
# values : num [1:92] 1 2 4 6 8 9 12 15 17 21 ...
percDur <- rleDen$lengths # Percentage of total duration in each interval
atDay <- rleDen$values # What day that percentage was reached
intWidth <- diff(c(0, atDay), k = 1) # Interval width
durPerDay <- 1440 # Max observation time per day
percPerDay <- durPerDay/a*100 # Max percentage per day
cumPercDur <- cumsum(percDur) # Cumulative percentage in each interval
maxPerInt <- pmin(percPerDay * diff(c(0, atDay), 1),
percDur + 1) # Max percent observation per interval
set.seed(1)
nsims <- 10 # Desired number of simulations
sampMat <- matrix(0, ncol = length(percDur), nrow = nsims) # Matrix to hold sim results
To allow for randomness while considering the limitation of a maximum 1440 minutes of observation per day, check to see if there are any long intervals (i.e., any intervals in which the jump in percentage cannot be completely achieved in that interval)
if(any(percDur > maxPerInt)){
longDays <- percDur > maxPerInt
morePerInt <- maxPerInt - percDur
perEnd <- c(which(diff(longDays,1) < 0), length(longDays))
# Group intervals into periods bounded by "long" days
# and determine if there are any long periods (i.e., where
# the jump in percentage can't be achieved in that period)
perInd <- rep(seq_along(perEnd), diff(c(0, perEnd)))
perSums <- tapply(percDur, perInd, sum)
maxPerPer <- tapply(maxPerInt, perInd, sum)
longPers <- perSums > maxPerPer
# If there are long periods, determine, starting with the last period, when the
# excess can be covered. Each group of periods is recorded in the persToWatch
# object
if(any(longPers)) {
maxLongPer <- perEnd[max(which(longPers))]
persToWatch <- rep(NA, length(maxLongPer))
for(kk in rev(seq_len(maxLongPer))) {
if(kk < maxLongPer && min(persToWatch, na.rm = TRUE) <= kk) next
theSums <- cumsum(morePerInt[order(seq_len(kk),
decreasing = TRUE)])
above0 <- which(rev(theSums) > 0)
persToWatch[kk] <- max(above0[which(!perInd[above0] %in% c(perInd[kk],
which(longPers)) & !above0 %in% which(longDays))])
}
}
}
Now we can start the randomness. The first component of the sampling determines the overall proportion of a that occurs in each of the intervals. How much? Let runif decide. The upper and lower limits must reflect the maximum observation time per day and the excess amount of any long days and periods
for(jj in seq_along(percDur[-1])) {
upperBound <- pmin(sampMat[, jj] + maxPerInt[jj],
cumPercDur[jj] + 1)
lowerBound <- cumPercDur[jj]
# If there are long days, determine the interval over which the
# excess observation time may be spread
if(any(percDur > maxPerInt) && any(which(longDays) >= jj)) {
curLongDay <- max(which(perInd %in% perInd[jj]))
prevLongDay <- max(0, min(which(!longDays)[which(!longDays) <= jj]))
curInt <- prevLongDay : curLongDay
# If there are also long periods, determine how much excess observation time there is
if(any(longPers) && maxLongPer >= jj) {
curLongPerHigh <- min(which(!is.na(persToWatch))[
which(!is.na(persToWatch)) >= jj])
curLongPerLow <- persToWatch[curLongPerHigh]
longInt <- curLongPerLow : curLongPerHigh
curExtra <- max(0,
cumPercDur[curLongPerHigh] -
sum(maxPerInt[longInt[longInt > jj]]) -
sampMat[, jj, drop = FALSE])
} else {
curExtra <- cumPercDur[curLongDay] -
(sum(maxPerInt[curInt[curInt > jj]]) +
sampMat[, jj, drop = FALSE])
}
# Set the lower limit for runif appropriately
lowerBound <- sampMat[, jj, drop = FALSE] + curExtra
}
# There may be tolerance errors when the observations are tightly
# packed
if(any(lowerBound - upperBound > 0)) {
if(all((lowerBound - upperBound) <= .Machine$double.eps*2*32)) {
upperBound <- pmax(lowerBound, upperBound)
} else {
stop("\nUpper and lower bounds are on the wrong side of each other\n",
jj,max(lowerBound - upperBound))
}
}
sampMat[, jj + 1] <- runif(nsims, lowerBound, upperBound)
}
Then add 100 percent to the end of the results and calculate the interval-specific percentage
sampMat2 <- cbind(sampMat[, seq_along(percDur)], 100)
sampPercDiff <- t(apply(sampMat2, 1, diff, k = 1))
The second component of the randomness determines the distribution of sampPercDiff over the interval widths intWidth. This still requires more thought in my opinion. For instance, how long does a typical episode last compared to the unit of time under consideration?
For each interval, determine if the random percentage needs to be allocated over multiple time units (in this case days). EDIT: Changed the following code to limit percentage increase when intWidth > 1.
library(foreach)
ep.dur<-foreach(ii = seq_along(intWidth),.combine=cbind)%do%{
if(intWidth[ii]==1){
ret<-sampPercDiff[, ii, drop = FALSE] * a / 100
dimnames(ret)<-list(NULL,atDay[ii])
ret
} else {
theDist<-matrix(numeric(0), ncol = intWidth[ii], nrow = nsims)
for(jj in seq_len(intWidth[ii]-1)){
theDist[, jj] <- floor(runif(nsims, 0, pmax(0,
min(sampPercDiff[, ii], floor(sampMat2[,ii + 1])-.Machine$double.eps -
sampMat2[,ii]) * a / 100 - rowSums(theDist, na.rm = TRUE))))
}
theDist[, intWidth[ii]] <- sampPercDiff[, ii] * a / 100 - rowSums(theDist,
na.rm = TRUE)
distOrder <- replicate(nsims, c(sample.int(intWidth[ii] - 1),
intWidth[ii]), simplify = FALSE)
ret <- lapply(seq_len(nrow(theDist)), function(x) {
theDist[x, order(distOrder[[x]])]
})
ans <- do.call(rbind, ret)
dimnames(ans) <- list(NULL, atDay[ii]-((intWidth[ii]:1)-1))
ans
}
}
The duration time is sampled randomly for each time unit (day) in the interval to which it is to be distributed. After breaking up the total duration into daily observed times, these are then assigned randomly to the days in the interval.
Then, multiply the sampled and distributed percentages by a and divide by 100
ep.dur[1, 1 : 6]
# 1 2 3 4 5 6
# 1095.4475 315.4887 1.0000 578.9200 13.0000 170.6224
ncol(ep.dur)
# [1] 364
apply(ep.dur, 1, function(x) length(which(x == 0)))
# [1] 131 133 132 117 127 116 139 124 124 129
rowSums(ep.dur)/a
# [1] 1 1 1 1 1 1 1 1 1 1
plot(ep.dur[1, ], type = "h", ylab = "obs time")
I would most probably do this with a ruby script but it could be done in R too. I am not sure whether it is your homework problem or not. As to answer your question: Can this be done problematically? Yes, Ofcourse!
According to your problem, my solution is to define the minimum and maximum limits with in which I could like to randomly pick a percentage that satisfies the conditions given by den vector and a value.
Since the den vector only contains 99% values, we cannot be sure when the 100% is going to happen. This condition yields my solution to be split into 3 parts - 1) For the given den vector upto 98% 2) For the 99% 3) Beyond 99%. I could define another function and put the common code in all these 3 parts in it but I haven't done so.
Since, I use runif command to generate random numbers, given the low-limit, it is unlikely that it will generate the exact low-limit value. Hence, I have defined a threshold value which I can check and if it falls below it, I would make it 0. You can have this or remove it. Also when you consider example 4, the first 1% is going to happen at 2nd day. So it means the 1st day could contain upto a maximum=0.999999% of the episode and then the 1% occurs on 2nd day. This is why the maximum limit is defined by subtracting a smallestdiff value, which can be changed.
FindMinutes=function(a,den){
if (a>1440*364){
Print("Invalid value for aa")
return("Invalid value for aa")
}
threshold=1E-7
smallestdiff=1E-6
sum_perc=0.0
start=1 #day 1
min=0 #minimum percentage value for a day
max=0 #maximum percentage value for a day
days=rep(c(0),364) #day vector with percentage of minutes - initialized to 0
maxperc=1440*100/a #maximum percentage wrto 1440 minutes/day
#############################################################
#############################################################
############ For the length of den vector ###################
for (i in 1:length(den)){
if (den[i]>start){
min=(i-1)-sum_perc
for(j in start:(den[i]-1)){#number of days in-between
if (j>start){ min=0 }
if (i-smallestdiff-sum_perc>=maxperc){
max=maxperc
if ((i-smallestdiff-sum_perc)/(den[i]-j)>=maxperc){
min=maxperc
}else{
if ((i-smallestdiff-sum_perc)/(den[i]-j-1)<maxperc){
min=maxperc-(i-smallestdiff-sum_perc)/(den[i]-j-1)
}else{
min=maxperc
}
}
}else{
max=i-smallestdiff-sum_perc
}
if ((r=runif(1,min,max))>=threshold){
days[j]=r
sum_perc=sum_perc+days[j]
}else{
days[j]=0.0
}
}
start=den[i]
}
}
#############################################################
#############################################################
#####################For the 99% ############################
min=99-sum_perc
for(j in start:den[length(den)]){
if (j>start){
min=0
}
max=100-sum_perc
if (100-sum_perc>=maxperc){
max=maxperc
if ((100-sum_perc)/(364+1-j)>=maxperc){
min=maxperc
}else{
if ((100-sum_perc)/(364-j)<maxperc){
min=maxperc-(100-sum_perc)/(364-j)
}else{
min=maxperc
}
}
}else{
max=100-sum_perc
}
if ((r=runif(1,min,max))>=threshold){
days[j]=r
sum_perc=sum_perc+days[j]
}else{
days[j]=0.0
}
}
#############################################################
#############################################################
##################### For the remaining 1%###################
min=0
for(j in den[length(den)]+1:364){
max=100-sum_perc
if (j==364){
min=max
days[j]=min
}else{
if (100-sum_perc>maxperc){
max=maxperc
if ((100-sum_perc)/(364+1-j)>=maxperc){
min=maxperc
}else{
if ((100-sum_perc)/(364-j)<maxperc){
min=maxperc-(100-sum_perc)/(364-j)
}else{
min=maxperc
}
}
}else{
max=100-sum_perc
}
if ((r=runif(1,min,max))>=threshold){
days[j]=r
}else{
days[j]=0.0
}
}
sum_perc=sum_perc+days[j]
if (sum_perc>=100.00){
break
}
}
return(days*a/100) #return as minutes vector corresponding to each 364 days
}#function
In my code, I randomly generate percentage values of episodes for each day according to the minimum and maximum value. Also, the condition (den vector) holds good when you round the percentage values to integers (days vector) but you might need extra tuning (which depends on checking the den vector further ahead and then re-tuning the minimum value of percentages) if you want it accurate upto few decimal places. You can also check to make sure that sum(FindMinutes(a,den)) is equal to a. If you want to define den in terms of 0.1%, you can do so but you need to change the corresponding equations (in min and max)
As the worst case scenario example, if you make a as the maximum value it can take and a corresponding den vector:
a=1440*364
den<-c(0)
cc=1
for(i in 1:363){
if (trunc(i*1440*100/(1440*364))==cc){
den[cc]=i
cc=cc+1
}
}
You can run the above example by calling the function: maxexamplemin=FindMinutes(a,den)
and you can check to see that all the days have the maximum minutes of 1440 which is the only possible scenario here.
As an illustration, let me run your example 3:
a<-33546
den<-c(1, 1, 1, 2, 4, 6, 8, 9, 12, 15, 17, 21, 25, 29, 31, 34, 37, 42, 45, 46, 51, 52, 56, 57, 58, 59, 63, 69, 69, 71, 76, 80, 81, 87, 93, 95, 102, 107, 108, 108, 112, 112, 118, 123, 124, 127, 132, 132, 132, 135, 136, 137, 150, 152, 162, 166, 169, 171, 174, 176, 178, 184, 189, 190, 193, 197, 198, 198, 201, 202, 203, 214, 218, 219, 223, 225, 227, 238, 240, 246, 248, 251, 254, 255, 257, 259, 260, 277, 282, 284, 285, 287, 288, 290, 294, 297, 321, 322, 342)
rmin=FindMinutes(a,den)
sum(rmin)
[1] 33546
rmin2=FindMinutes(a,den)
rmin3=FindMinutes(a,den)
plot(rmin,tpe="h")
par(new=TRUE)
plot(rmin2,col="red",type="h")
par(new=TRUE)
plot(rmin3,col="red",type="h")
and the 3 super-imposed plots is shown below :

Resources