Related
I have a data such that produced from special function:
where t0=1, alpha, q, gamma, C and beta are unknown parameters.
The question is how to fit the above function to following data, in R?
mydata<-structure(list(x = 1:100, y = c(0, 0, 2, 1, 3, 4, 4, 3, 7, 8,
9, 11, 12, 11, 15, 15, 17, 21, 49, 43, 117, 75, 85, 97, 113,
129, 135, 147, 149, 149, 123, 129, 127, 122, 143, 157, 144, 139,
123, 117, 141, 138, 124, 134, 158, 151, 136, 133, 121, 117, 122,
125, 117, 111, 98, 94, 92, 89, 73, 87, 91, 88, 94, 90, 93, 76,
60, 96, 71, 80, 71, 63, 65, 47, 74, 63, 78, 68, 55, 48, 51, 45,
48, 50, 71, 48, 35, 51, 69, 62, 64, 66, 51, 59, 58, 34, 57, 56,
63, 50)), class = "data.frame", row.names = c(NA, -100L))
I defined the function as follows:
t0<<-1
fyy<-function(t,cc0,alpha0,qq0,beta0,gamma0){
ret<-cc0*((t-t0)^alpha0)/(((1+(qq0-1)*beta0*(t-t0)^gamma0))^(1/(qq0-1)))
return(ret)
}
but I don't know how to continue?
as #mhovd mentioned I used "nls" function but I got an error as follows:
> fit <- nls(y~fyy(x,cc0 ,alpha0 ,beta0 ,gamma0 ,qq0 ),
data=data.frame(mydata), start=list(cc0 = .01,alpha0 =1,beta0 =.3,gamma0
= 2,qq0 = 1))
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
In the comments #masoud references a paper about the specific function in the question. It suggests fixing gamma0 and qq0 and if we do that we do get a solution -- fm shown in red in the plot. We have also shown an alternate parametric curve as fm2 in blue. It also has 3 optimized parameters but has lower residual sum of squares (lower is better).
fyy <- function(t,cc0,alpha0,qq0,beta0,gamma0){
cc0 * ((t-t0)^alpha0) / (((1+(qq0-1)*beta0*(t-t0)^gamma0))^(1/(qq0-1)))
}
mydata0 <- subset(mydata, y > 0)
# fixed values
t0 <- 1
gamma0 <- 3
qq0 <- 1.2
st <- list(cc0 = 1, alpha0 = 1, beta0 = 1) # starting values
fm <- nls(y ~ fyy(x, cc0, alpha0, qq0, beta0, gamma0), mydata0,
lower = list(cc0 = 0.1, alpha0 = 0.1, beta0 = 0.00001),
start = st, algorithm = "port")
deviance(fm) # residual sum of squares
## [1] 61458.5
st2 <- list(a = 1, b = 1, c = 1)
fm2 <- nls(y ~ exp(a + b/x + c*log(x)), mydata0, start = st2)
deviance(fm2) # residual sum of squares
## [1] 16669.24
plot(mydata0, ylab = "y", xlab = "t")
lines(fitted(fm) ~ x, mydata0, col = "red")
lines(fitted(fm2) ~ x, mydata0, col = "blue")
legend("topright", legend = c("fm", "fm2"), lty = 1, col = c("red", "blue"))
I have a list I have generated in R. I would like to save it as an external file , and then, I want to read this file in Python.
I used :
write.table(list, file="~/test.txt")
But I get this error message :
Error in write.table(list, file = "~/test.txt") :
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 9312, 628, 317, 27, 244, 894, 540, 280, 739, 866, 239, 2432, 969, 2033, 75, 509, 539, 321, 6637, 24, 116, 2006, 18, 2695, 16, 47, 32, 34, 28, 102, 44, 462, 84, 30, 73, 77, 29, 400, 60, 80, 31, 101, 680, 100, 58, 126, 112, 122, 155, 123, 167, 138, 149, 202, 246, 296, 240, 68, 350, 583, 46, 701, 467, 636, 654, 56, 418, 230, 64, 90, 74, 72, 67, 61, 55, 41, 40, 35, 25, 23, 22, 20, 19, 17
Edit :
THis is the structure of my list (I only show a part) :
$`5`
[1] "OTU1159" "OTU1158" "UniRef90_A0A1B2YRW1"
[4] "UniRef90_A0A1B2Z315" "UniRef90_A0A1B2Z316" "UniRef90_A0A1Z9FR83"
[7] "UniRef90_A0A1Z9FRN0" "UniRef90_A0A1Z9FRT6" "UniRef90_A0A1Z9FSZ6"
[10] "UniRef90_A0A1Z9FTY0" "UniRef90_A0A1Z9FU92" "UniRef90_A0A1Z9FVK9"
[13] "UniRef90_A0A1Z9FVU0" "UniRef90_A0A1Z9FWD5" "UniRef90_A0A1Z9FYC5"
$`6`
[1] "OTU4451" "OTU4536" "OTU4458"
[4] "OTU4430" "OTU4435" "OTU2156"
[7] "UniRef90_A0A081FUN4" "UniRef90_A0A081FUN8" "UniRef90_A0A0F5AQ41"
[10] "UniRef90_A0A0R2PEV0" "UniRef90_A0A0R2U5F4" "UniRef90_A0A0R2UTD5"
[13] "UniRef90_A0A0R2UUB4" "UniRef90_A0A0R2UW29" "UniRef90_A0A0R2UWJ2"
[16] "UniRef90_A0A0R2UXE1" "UniRef90_A0A0R2UXE3" "UniRef90_A0A0S8BGF5"
Any help?
Try opening a connection in append text mode, open = "at" and then lapply function writeLines to write one vector at a time.
fl <- file("~/test.txt", open = "at")
lapply(thelist, writeLines, con = fl)
close(fl)
The code above will write the vectors vertically, one after the other. To write them text line by text line, use something along the lines of this:
fl <- file("~/test.txt", open = "at")
lapply(thelist, function(x){
x <- paste(x, collapse = " ")
writeLines(x, con = fl, sep = "\n")
})
close(fl)
The field separator collapse character can be any other character, such as a tab "\t" or a comma.
This is pretty much what JSON is for. Do the following in R, where l is your list:
library(jsonlite)
write_json(l, "test.json")
And then do this in Python:
import json
with open("test.json") as f:
l = json.load(f)
I have to make a 4*4 Kohonen map for a project.
However, I get the error
Error in win_index[1, ] : subscript out of bounds
In addition: There were 16 warnings (use warnings() to see them)
After testing my code, I guess it's the lapply function (line 77) that doesn't
is not executed correctly because the matrix thus created contains only NA.
And so since this matrix is used later on, the result is not correct because
NA is present throughout the program.
#############################################
##Function for distance calculation (RMSDA)##
#############################################
RMSDA<-function(data_phipsi,Kohonen_matrix)
{
difference<-data_phipsi-Kohonen_matrix
for(j in 1:length(difference)){
if (difference[j]< -180) {difference[j]=difference[j]+360}
if (difference[j]> +180) {difference[j]=difference[j]-360}
}
distance=mean(sqrt(difference^2))
return(distance)
}
##############################
## Program ###
##############################
for(step in 1:iteration)
{
data_phipsi<-data_phipsi[sample(nrow(data_phipsi)),] # Sample vectors of training (samples of lines of the dataframe)
print(step) #Visualize where we are in loops
for(k_row in 1:nrow(data_phipsi))
{
#Update learn_rate and radius at each row of each iteration
learn_rate<-learning(initial_rate,((step-1)*nrow(data_phipsi))+k_row,data_phipsi)
learn_radius<-learning(initial_rate,((step-1)*nrow(data_phipsi))+k_row,data_phipsi)
#Find distance between each vectors of angles of Kohonen Map and the training vector
phipsi_RMSDA<-lapply(random_list, RMSDA, data_phipsi=data_phipsi[k_row,])
}
How is it possible to fix this lapply error?
Thank you. Thank you.
edit : These are the only useful elements for the lapply
For the random list we can use:
random_list <- list(
c(88, 148, 60, 83, -119, -59, -96, 169),
c(104, 101, 174, -48, 18, 10, -159, 158),
c(164, -80, 137, -170, -172, 52, -149, 96),
c(88, 18, -115, 48, -3, -158, -92, -154),
c(170, -107, -109, -14, -142, -77, -120, 76),
c(-121, 15, -46, -145, -128, 74, -166, 44),
c(46, -178, 67, -88, -125, -130, 88, -11),
c(131, 147, -32, 103, -16, 116, 78, -125),
c(75, -95, -137, 133, -97, -134, 126, -105),
c(115, 173, -82, -135, 134, 82, -143, -43),
c(111, 13, -54, -53, 103, 132, -13, -43),
c(-143, 89, -91, -137, -63, 14, -166, 83),
c(-98, 178, 14, -80, -122, -25, 19, 117),
c(-113, -97, 34, -178, -56, 18, -167, 84),
c(49, 82, 50, 168, -157, -154, 51, 78),
c(173, -4, 164, 125, 31, 115, -74, -92)
)
and a exemple for
data_phipsi[1,]
data_phipsi <- read.table(header = TRUE, text = "
phi1 psi2 phy2 psi3 phy3 psi4 phy4 psi5
-24.5 81.9 -155.2 -81.4 127.7 -118 166 -82.1")
data_phipsi
# phi1 psi2 phy2 psi3 phy3 psi4 phy4 psi5
# 1 -24.5 81.9 -155.2 -81.4 127.7 -118 166 -82.1
The main issue was that a follow-on command (check back in the question edits for the win_index line of code, if interested) was failing because the results from lapply were suspect.
Troubleshooting that found the problem:
when the error hits, the value of both random_list and data_phipsi[k_row,] were reasonable and not suspect;
however, the output from lapply was all NA
So diving into the function RMSDA, it was noticed that difference looked fine up until
distance=mean(sqrt(difference^2))
at which point it turned all-NA. The problem is that the data was still in a form of a frame, so mean was failing (just try mean(mtcars) to see). (It should also be noted that I believe sqrt(difference^2) is equivalent to abs(difference).)
Replacing that line of RMSDA with
distance=mean(abs(unlist(difference)))
seems to have fixed the problem (the critical part being unlist).
Having trouble fitting an appropriate curve to this data.
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 21, 31, 41, 51, 61, 71,
81, 91, 110, 210, 310, 410, 510, 610, 710, 810, 910, 1100, 2100,
3100, 4100, 5100, 6100, 7100, 8100, 9100)
y <- c(75, 84, 85, 89, 88, 91, 92, 92, 93, 92, 94, 95, 95, 96, 95,
95, 94, 97, 97, 97, 98, 98, 98, 99, 99, 99, 99, 99, 99, 99, 99,
99, 99, 99, 99, 99, 99)
Tried so far:
fit1 <- lm(y~log(x)+I(1/x))
fit2 <- lm(y~log(x)+I(1/x)+x)
plot(x,y, log="x")
lines(0.01:10000, predict(fit1, newdata = data.frame(x=0.01:10000)))
lines(0.01:10000, predict(fit2, newdata = data.frame(x=0.01:10000)), col='red')
The fits are ok, but arrived at entirely empirically and there is room for improvement. I did not fit loess or splines to be any better.
The concrete goal is to increase the R^2 of the fit and improve regression diagnostics (e.g. Q-Q plots of residuals).
Edit: Expected Model: this is sampling data, where more samples (x) improve the accuracy of the estimate (y); it would saturate at 100%.
This would be my function guess and according fit in python
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as so
def f( x, a, b , s, p ):
return a + b * s * ( x - 1 ) / ( 1 + ( s * ( x - 1 ) )**( abs( 1 / p ) ) )**abs( p )
def g( x, a , s, p ):
return a * s * x / ( 1 + ( s * x )**( abs( 1 / p ) ) )**abs( p )
def h( x, s, p ):
return 100 * s * x / ( 1 + ( s * x )**( abs( 1 / p ) ) )**abs( p )
xData = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 21, 31, 41, 51, 61, 71,
81, 91, 110, 210, 310, 410, 510, 610, 710, 810, 910, 1100, 2100,
3100, 4100, 5100, 6100, 7100, 8100, 9100 ]
yData = [ 75, 84, 85, 89, 88, 91, 92, 92, 93, 92, 94, 95, 95, 96, 95,
95, 94, 97, 97, 97, 98, 98, 98, 99, 99, 99, 99, 99, 99, 99, 99,
99, 99, 99, 99, 99, 99 ]
xList = np.logspace( 0, 5, 100 )
bestFitF, err = so.curve_fit( f , xData, yData, p0=[ 75, 25, 1, 1])
bestFitG, err = so.curve_fit( g , xData, yData)
bestFitH, err = so.curve_fit( h , xData, yData)
fList = np.fromiter( ( f(x, *bestFitF ) for x in xList ), np.float)
gList = np.fromiter( ( g(x, *bestFitG ) for x in xList ), np.float)
hList = np.fromiter( ( h(x, *bestFitH ) for x in xList ), np.float)
fig = plt.figure()
ax = fig.add_subplot( 1, 1, 1 )
ax.plot( xData, yData, marker='o', linestyle='')
ax.plot( xList, fList, linestyle='-.', label='f')
ax.plot( xList, gList, linestyle='-.', label='g')
ax.plot( xList, hList, linestyle='-.', label='h')
ax.set_xscale( 'log' )
ax.legend( loc=0 )
plt.show()
Function f requires start values, g and h don't. It should be possible to write some code to guess the parameters, basically the first one is yData[0], the second is yData[-1] - yData[0] and the others don't matter and are just set to 1, but I did it manually here.
Both, g and h have the property that they pass ( 0, 0 ).
Additionally, h will saturate at 100.
Note: Sure the more parameters the better the fit, but if it is, e.g., a CDF you probably want a fixed saturation value and maybe the pass through ( 0, 0 ) as well.
This might be an acceptable fit to the Gunary equation, with an R-squared value of 0.976:
y = x / (a + bx + cx^0.5)
Fitting target of lowest sum of squared absolute error = 2.4509677507601545E+01
a = 1.2327255760994933E-03
b = 1.0083740273268828E-02
c = 1.9179200839782879E-03
R package drc has many options.
Here is a 5-parameter log-logistic model, which yields residuals lower than the fits in the question.
BONUS: It has a self-starter function, so you avoid the challenge of finding initial values for non-linear regression.
library(drc)
dosefit <- drm(y ~ x, fct = LL2.5())
I am measuring the duration of episodes (vector ep.dur in minutes) per day, for an observation period for T=364 days. The vector ep.dur has a length(ep.dur) of T=364, with zeros in days when no episode occurred, and range(ep.dur) is between 0 and 1440
The sum of the episode duration over the T period is a<-sum(ep.duration)
Now I have a vector den, with length(den)=99. The vector den shows how many days are required for the development of each 1% (1%, 2%, 3%, ...) of a
Now given den and a, I would like to simulate multiple ep.dur
Is this possible?
Clarification 1:: (first comment of danas.zuokas) The elements of den represent duration NOT exact days. That means, for example 1, that 1%(=1195.8) of a is developed in 1 day, 2% in 2 days, 3% in 3 days, 4% in 4 days, 5% in 5 days, 6% in 5 days .....). The episodes can take place anywhare in T
Clarification 2: (second comment of danas.zuokas) Unfortunately there can be no assumptions on how duration develops. That is why I have to simulate numerous ep.dur vectors. HOWEVER, i can expand the den vector into more finite resolution (that is: instead of 1% jumps, 0.1% jumps) if this is of any help.
Description of the algorithm
The algorithm should satisfy all information the den vector provides. I have imagined the algorithm going as following (Example 3):
Each 1% jump of a is 335,46 min. den[1] tells us that 1% of a is developed in 1 day. so lets say we generate ep.dur[1]=335,46. OK. We go to den[2]: 2% of the a is developed in d[2]=1 days. So, ep.dur[1] cannot be 335,46 and is rejected (2% of a should still occur in one day). Lets say that had generated ep.dur[1]=1440. d[1] is satisfied, d[2] is satisifed (at least 2% of the total duration is developed in dur[2]=1 days), dur[3]=1 is also satisfied. Keeper? However, dur[4]=2 is not satified if ep.dur[1]=1440 because it states that 4% of a (=1341) should occur in 2 days. So ep.dur[1] is rejected. Now lets say that ep.dur[1]=1200. dur[1:3] are accepted. Then we generate ep.dur[2] and so on making sure that the generated ep.dur all satisfy the information provided by den.
Is this programmatically feasible? I really do not know where to start with this problem. I will provide a generous bounty once bounty start period is over
Example 1:
a<-119508
den<-c(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 10, 11, 12, 13, 14, 15, 15,
16, 17, 18, 19, 20, 20, 21, 22, 23, 24, 25, 25, 26, 27, 28, 29,
30, 30, 31, 32, 33, 34, 35, 35, 36, 37, 38, 39, 40, 40, 41, 42,
43, 44, 45, 45, 46, 47, 48, 49, 50, 50, 51, 52, 53, 54, 55, 55,
56, 57, 58, 59, 60, 60, 61, 62, 63, 64, 65, 65, 66, 67, 68, 69,
70, 70, 71, 72, 73, 74, 75, 75, 76, 77, 78, 79, 80, 80, 81, 82,
83)
Example 2:
a<-78624
den<-c(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11,
11, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 18, 19, 21, 22, 23,
28, 32, 35, 36, 37, 38, 43, 52, 55, 59, 62, 67, 76, 82, 89, 96,
101, 104, 115, 120, 126, 131, 134, 139, 143, 146, 153, 160, 165,
180, 193, 205, 212, 214, 221, 223, 227, 230, 233, 234, 235, 237,
239, 250, 253, 263, 269, 274, 279, 286, 288, 296, 298, 302, 307,
309, 315, 320, 324, 333, 337, 342, 347, 352)
Example 3
a<-33546
den<-c(1, 1, 1, 2, 4, 6, 8, 9, 12, 15, 17, 21, 25, 29, 31, 34, 37,
42, 45, 46, 51, 52, 56, 57, 58, 59, 63, 69, 69, 71, 76, 80, 81,
87, 93, 95, 102, 107, 108, 108, 112, 112, 118, 123, 124, 127,
132, 132, 132, 135, 136, 137, 150, 152, 162, 166, 169, 171, 174,
176, 178, 184, 189, 190, 193, 197, 198, 198, 201, 202, 203, 214,
218, 219, 223, 225, 227, 238, 240, 246, 248, 251, 254, 255, 257,
259, 260, 277, 282, 284, 285, 287, 288, 290, 294, 297, 321, 322,
342)
Example 4
a<-198132
den<-c(2, 3, 5, 6, 7, 9, 10, 12, 13, 14, 16, 17, 18, 20, 21, 23, 24,
25, 27, 28, 29, 31, 32, 34, 35, 36, 38, 39, 40, 42, 43, 45, 46,
47, 49, 50, 51, 53, 54, 56, 57, 58, 60, 61, 62, 64, 65, 67, 68,
69, 71, 72, 74, 75, 76, 78, 79, 80, 82, 83, 85, 86, 87, 89, 90,
91, 93, 94, 96, 97, 98, 100, 101, 102, 104, 105, 107, 108, 109,
111, 112, 113, 115, 116, 120, 123, 130, 139, 155, 165, 172, 176,
178, 181, 185, 190, 192, 198, 218)
As far as I understand what you're after, I would start by converting den to an rle object. (Here using data from your Example 3)
EDIT: Add 100% at day 364 to den
if(max(den)!=364) den <- c(den, 364)
(rleDen <- rle(den))
# Run Length Encoding
# lengths: int [1:92] 3 1 1 1 1 1 1 1 1 1 ... # 92 intervals
# values : num [1:92] 1 2 4 6 8 9 12 15 17 21 ...
percDur <- rleDen$lengths # Percentage of total duration in each interval
atDay <- rleDen$values # What day that percentage was reached
intWidth <- diff(c(0, atDay), k = 1) # Interval width
durPerDay <- 1440 # Max observation time per day
percPerDay <- durPerDay/a*100 # Max percentage per day
cumPercDur <- cumsum(percDur) # Cumulative percentage in each interval
maxPerInt <- pmin(percPerDay * diff(c(0, atDay), 1),
percDur + 1) # Max percent observation per interval
set.seed(1)
nsims <- 10 # Desired number of simulations
sampMat <- matrix(0, ncol = length(percDur), nrow = nsims) # Matrix to hold sim results
To allow for randomness while considering the limitation of a maximum 1440 minutes of observation per day, check to see if there are any long intervals (i.e., any intervals in which the jump in percentage cannot be completely achieved in that interval)
if(any(percDur > maxPerInt)){
longDays <- percDur > maxPerInt
morePerInt <- maxPerInt - percDur
perEnd <- c(which(diff(longDays,1) < 0), length(longDays))
# Group intervals into periods bounded by "long" days
# and determine if there are any long periods (i.e., where
# the jump in percentage can't be achieved in that period)
perInd <- rep(seq_along(perEnd), diff(c(0, perEnd)))
perSums <- tapply(percDur, perInd, sum)
maxPerPer <- tapply(maxPerInt, perInd, sum)
longPers <- perSums > maxPerPer
# If there are long periods, determine, starting with the last period, when the
# excess can be covered. Each group of periods is recorded in the persToWatch
# object
if(any(longPers)) {
maxLongPer <- perEnd[max(which(longPers))]
persToWatch <- rep(NA, length(maxLongPer))
for(kk in rev(seq_len(maxLongPer))) {
if(kk < maxLongPer && min(persToWatch, na.rm = TRUE) <= kk) next
theSums <- cumsum(morePerInt[order(seq_len(kk),
decreasing = TRUE)])
above0 <- which(rev(theSums) > 0)
persToWatch[kk] <- max(above0[which(!perInd[above0] %in% c(perInd[kk],
which(longPers)) & !above0 %in% which(longDays))])
}
}
}
Now we can start the randomness. The first component of the sampling determines the overall proportion of a that occurs in each of the intervals. How much? Let runif decide. The upper and lower limits must reflect the maximum observation time per day and the excess amount of any long days and periods
for(jj in seq_along(percDur[-1])) {
upperBound <- pmin(sampMat[, jj] + maxPerInt[jj],
cumPercDur[jj] + 1)
lowerBound <- cumPercDur[jj]
# If there are long days, determine the interval over which the
# excess observation time may be spread
if(any(percDur > maxPerInt) && any(which(longDays) >= jj)) {
curLongDay <- max(which(perInd %in% perInd[jj]))
prevLongDay <- max(0, min(which(!longDays)[which(!longDays) <= jj]))
curInt <- prevLongDay : curLongDay
# If there are also long periods, determine how much excess observation time there is
if(any(longPers) && maxLongPer >= jj) {
curLongPerHigh <- min(which(!is.na(persToWatch))[
which(!is.na(persToWatch)) >= jj])
curLongPerLow <- persToWatch[curLongPerHigh]
longInt <- curLongPerLow : curLongPerHigh
curExtra <- max(0,
cumPercDur[curLongPerHigh] -
sum(maxPerInt[longInt[longInt > jj]]) -
sampMat[, jj, drop = FALSE])
} else {
curExtra <- cumPercDur[curLongDay] -
(sum(maxPerInt[curInt[curInt > jj]]) +
sampMat[, jj, drop = FALSE])
}
# Set the lower limit for runif appropriately
lowerBound <- sampMat[, jj, drop = FALSE] + curExtra
}
# There may be tolerance errors when the observations are tightly
# packed
if(any(lowerBound - upperBound > 0)) {
if(all((lowerBound - upperBound) <= .Machine$double.eps*2*32)) {
upperBound <- pmax(lowerBound, upperBound)
} else {
stop("\nUpper and lower bounds are on the wrong side of each other\n",
jj,max(lowerBound - upperBound))
}
}
sampMat[, jj + 1] <- runif(nsims, lowerBound, upperBound)
}
Then add 100 percent to the end of the results and calculate the interval-specific percentage
sampMat2 <- cbind(sampMat[, seq_along(percDur)], 100)
sampPercDiff <- t(apply(sampMat2, 1, diff, k = 1))
The second component of the randomness determines the distribution of sampPercDiff over the interval widths intWidth. This still requires more thought in my opinion. For instance, how long does a typical episode last compared to the unit of time under consideration?
For each interval, determine if the random percentage needs to be allocated over multiple time units (in this case days). EDIT: Changed the following code to limit percentage increase when intWidth > 1.
library(foreach)
ep.dur<-foreach(ii = seq_along(intWidth),.combine=cbind)%do%{
if(intWidth[ii]==1){
ret<-sampPercDiff[, ii, drop = FALSE] * a / 100
dimnames(ret)<-list(NULL,atDay[ii])
ret
} else {
theDist<-matrix(numeric(0), ncol = intWidth[ii], nrow = nsims)
for(jj in seq_len(intWidth[ii]-1)){
theDist[, jj] <- floor(runif(nsims, 0, pmax(0,
min(sampPercDiff[, ii], floor(sampMat2[,ii + 1])-.Machine$double.eps -
sampMat2[,ii]) * a / 100 - rowSums(theDist, na.rm = TRUE))))
}
theDist[, intWidth[ii]] <- sampPercDiff[, ii] * a / 100 - rowSums(theDist,
na.rm = TRUE)
distOrder <- replicate(nsims, c(sample.int(intWidth[ii] - 1),
intWidth[ii]), simplify = FALSE)
ret <- lapply(seq_len(nrow(theDist)), function(x) {
theDist[x, order(distOrder[[x]])]
})
ans <- do.call(rbind, ret)
dimnames(ans) <- list(NULL, atDay[ii]-((intWidth[ii]:1)-1))
ans
}
}
The duration time is sampled randomly for each time unit (day) in the interval to which it is to be distributed. After breaking up the total duration into daily observed times, these are then assigned randomly to the days in the interval.
Then, multiply the sampled and distributed percentages by a and divide by 100
ep.dur[1, 1 : 6]
# 1 2 3 4 5 6
# 1095.4475 315.4887 1.0000 578.9200 13.0000 170.6224
ncol(ep.dur)
# [1] 364
apply(ep.dur, 1, function(x) length(which(x == 0)))
# [1] 131 133 132 117 127 116 139 124 124 129
rowSums(ep.dur)/a
# [1] 1 1 1 1 1 1 1 1 1 1
plot(ep.dur[1, ], type = "h", ylab = "obs time")
I would most probably do this with a ruby script but it could be done in R too. I am not sure whether it is your homework problem or not. As to answer your question: Can this be done problematically? Yes, Ofcourse!
According to your problem, my solution is to define the minimum and maximum limits with in which I could like to randomly pick a percentage that satisfies the conditions given by den vector and a value.
Since the den vector only contains 99% values, we cannot be sure when the 100% is going to happen. This condition yields my solution to be split into 3 parts - 1) For the given den vector upto 98% 2) For the 99% 3) Beyond 99%. I could define another function and put the common code in all these 3 parts in it but I haven't done so.
Since, I use runif command to generate random numbers, given the low-limit, it is unlikely that it will generate the exact low-limit value. Hence, I have defined a threshold value which I can check and if it falls below it, I would make it 0. You can have this or remove it. Also when you consider example 4, the first 1% is going to happen at 2nd day. So it means the 1st day could contain upto a maximum=0.999999% of the episode and then the 1% occurs on 2nd day. This is why the maximum limit is defined by subtracting a smallestdiff value, which can be changed.
FindMinutes=function(a,den){
if (a>1440*364){
Print("Invalid value for aa")
return("Invalid value for aa")
}
threshold=1E-7
smallestdiff=1E-6
sum_perc=0.0
start=1 #day 1
min=0 #minimum percentage value for a day
max=0 #maximum percentage value for a day
days=rep(c(0),364) #day vector with percentage of minutes - initialized to 0
maxperc=1440*100/a #maximum percentage wrto 1440 minutes/day
#############################################################
#############################################################
############ For the length of den vector ###################
for (i in 1:length(den)){
if (den[i]>start){
min=(i-1)-sum_perc
for(j in start:(den[i]-1)){#number of days in-between
if (j>start){ min=0 }
if (i-smallestdiff-sum_perc>=maxperc){
max=maxperc
if ((i-smallestdiff-sum_perc)/(den[i]-j)>=maxperc){
min=maxperc
}else{
if ((i-smallestdiff-sum_perc)/(den[i]-j-1)<maxperc){
min=maxperc-(i-smallestdiff-sum_perc)/(den[i]-j-1)
}else{
min=maxperc
}
}
}else{
max=i-smallestdiff-sum_perc
}
if ((r=runif(1,min,max))>=threshold){
days[j]=r
sum_perc=sum_perc+days[j]
}else{
days[j]=0.0
}
}
start=den[i]
}
}
#############################################################
#############################################################
#####################For the 99% ############################
min=99-sum_perc
for(j in start:den[length(den)]){
if (j>start){
min=0
}
max=100-sum_perc
if (100-sum_perc>=maxperc){
max=maxperc
if ((100-sum_perc)/(364+1-j)>=maxperc){
min=maxperc
}else{
if ((100-sum_perc)/(364-j)<maxperc){
min=maxperc-(100-sum_perc)/(364-j)
}else{
min=maxperc
}
}
}else{
max=100-sum_perc
}
if ((r=runif(1,min,max))>=threshold){
days[j]=r
sum_perc=sum_perc+days[j]
}else{
days[j]=0.0
}
}
#############################################################
#############################################################
##################### For the remaining 1%###################
min=0
for(j in den[length(den)]+1:364){
max=100-sum_perc
if (j==364){
min=max
days[j]=min
}else{
if (100-sum_perc>maxperc){
max=maxperc
if ((100-sum_perc)/(364+1-j)>=maxperc){
min=maxperc
}else{
if ((100-sum_perc)/(364-j)<maxperc){
min=maxperc-(100-sum_perc)/(364-j)
}else{
min=maxperc
}
}
}else{
max=100-sum_perc
}
if ((r=runif(1,min,max))>=threshold){
days[j]=r
}else{
days[j]=0.0
}
}
sum_perc=sum_perc+days[j]
if (sum_perc>=100.00){
break
}
}
return(days*a/100) #return as minutes vector corresponding to each 364 days
}#function
In my code, I randomly generate percentage values of episodes for each day according to the minimum and maximum value. Also, the condition (den vector) holds good when you round the percentage values to integers (days vector) but you might need extra tuning (which depends on checking the den vector further ahead and then re-tuning the minimum value of percentages) if you want it accurate upto few decimal places. You can also check to make sure that sum(FindMinutes(a,den)) is equal to a. If you want to define den in terms of 0.1%, you can do so but you need to change the corresponding equations (in min and max)
As the worst case scenario example, if you make a as the maximum value it can take and a corresponding den vector:
a=1440*364
den<-c(0)
cc=1
for(i in 1:363){
if (trunc(i*1440*100/(1440*364))==cc){
den[cc]=i
cc=cc+1
}
}
You can run the above example by calling the function: maxexamplemin=FindMinutes(a,den)
and you can check to see that all the days have the maximum minutes of 1440 which is the only possible scenario here.
As an illustration, let me run your example 3:
a<-33546
den<-c(1, 1, 1, 2, 4, 6, 8, 9, 12, 15, 17, 21, 25, 29, 31, 34, 37, 42, 45, 46, 51, 52, 56, 57, 58, 59, 63, 69, 69, 71, 76, 80, 81, 87, 93, 95, 102, 107, 108, 108, 112, 112, 118, 123, 124, 127, 132, 132, 132, 135, 136, 137, 150, 152, 162, 166, 169, 171, 174, 176, 178, 184, 189, 190, 193, 197, 198, 198, 201, 202, 203, 214, 218, 219, 223, 225, 227, 238, 240, 246, 248, 251, 254, 255, 257, 259, 260, 277, 282, 284, 285, 287, 288, 290, 294, 297, 321, 322, 342)
rmin=FindMinutes(a,den)
sum(rmin)
[1] 33546
rmin2=FindMinutes(a,den)
rmin3=FindMinutes(a,den)
plot(rmin,tpe="h")
par(new=TRUE)
plot(rmin2,col="red",type="h")
par(new=TRUE)
plot(rmin3,col="red",type="h")
and the 3 super-imposed plots is shown below :