Optimize a function using gradient descent - r

Growing degree days is a concept in plant phenology where a given crop needs to accumulate certain amount of thermal units every day in order to move from one stage to the other.
I have thermal units data available at daily resolution for a given site for 10 years as follows:
set.seed(1)
avg_temp <- data.frame(year_ref = rep(2001:2010, each = 365),
doy = rep(1:365, times = 10),
thermal.units = sample(0:40, 3650, replace=TRUE))
I also have a crop grown in this site that should take 110 days to mature if planted on day 152
planting_date <- 152
observed_days_to_mature <- 110
I also have some initial random guess on how many thermal units this crop in general might accumulate in each stage starting from planting to reach full maturity. For e.g. in the below example, stage 1 needs to accumulate 50 thermal units since planting, stage2 needs 120 thermal units since
planting, stage 3 needs 190 thermal units since planting and so on.
gdd_data <- data.frame(stage_id = 1:4,
gdd_required = c(50, 120, 190, 250))
So given the gdd requirement, I can calculate for each year, how many days does this crop take to mature.
library(dplyr)
library(data.table)
days_to_mature_func <- function(gdd_data_df, avg_temp_df, planting_date_d){
gdd.vec <- gdd_data_df$gdd_required
year_vec <- sort(unique(avg_temp_df$year_ref))
temp_ls <- list()
for(y in seq_along(year_vec)){
year_id <- year_vec[y]
weather_sub <- avg_temp_df %>%
dplyr::filter(year_ref == year_id &
doy >= planting_date_d)
stage_vec <- unlist(lapply(1:length(gdd.vec), function(x) planting_date_d - 1 + which.max(cumsum(weather_sub$thermal.units) >= gdd.vec[x])))
stage_vec[length(stage_vec)] <- ifelse(stage_vec[length(stage_vec)] <= stage_vec[length(stage_vec) - 1], NA, stage_vec[length(stage_vec)])
gdd_doy <- as.data.frame(t(as.data.frame(stage_vec)))
names(gdd_doy) <- paste0('stage_doy', 1:length(stage_vec))
gdd_doy$year_ref <- year_id
temp_ls[[y]] <- gdd_doy
}
days_to_mature_mod <- rbindlist(temp_ls)
return(days_to_mature_mod)
}
days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date)
days_to_mature_mod
stage_doy1 stage_doy2 stage_doy3 stage_doy4 year_ref
1: 154 160 164 167 2001
2: 154 157 159 163 2002
3: 154 157 160 162 2003
4: 155 157 163 165 2004
5: 154 156 160 164 2005
6: 154 161 164 168 2006
7: 154 156 159 161 2007
8: 155 158 161 164 2008
9: 154 156 160 163 2009
10: 154 158 160 163 2010
Since the crop should be taking 110 days to mature, I define the error as:
error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2
My question is how do I optimise the gdd_required in the gdd_data to produce the minimal error.
One method I have implemented is to loop over a sequence of factors that reduces the gdd_required in
each step and calculates the error. the factor with the lowest error is the final factor that I apply
to the gdd_required data. I am reading about the gradient descent algorithm that might make this processquicker but unfortunately I don't have enough techincal expertise yet to achieve this.
From comment: I do have a condition that wasn't explicit - the x in the function that I am optimising are ordered i.e. x[1] < x[2] < x[3] < x[4] since they are cumulative.

Building on your example, you can define a function that takes arbitrary gdd_required and returns the fit:
optim_function <- function(x){
gdd_data <- data.frame(stage_id = 1:4, gdd_required = x)
days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date)
error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2
}
The function optim allows you to find the parameters that reach a minimum, starting from the initial set you used e.g.
optim(c(50, 120, 190, 250), optim_function)
#$par
#[1] 266.35738 199.59795 -28.35870 30.21135
#
#$value
#[1] 1866.24
#
#$counts
#function gradient
# 91 NA
#
#$convergence
#[1] 0
#
#$message
#NULL
So a best fit of around 1866 is found with parameters 266.35738, 199.59795, -28.35870, 30.21135.
The help page gives some pointers on doing constrained optimisation if it is important that they are in a specific range.
Given your comment that the parameters should be strictly increasing, you can transform arbitrary values into increasing ones with cumsum(exp()) so your code would become
optim_function_plus <- function(x){
gdd_data <- data.frame(stage_id = 1:4, gdd_required = cumsum(exp(x)))
days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date)
error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2
}
opt <- optim(log(c(50, 70, 70, 60)), optim_function_plus)
opt
# $par
# [1] 1.578174 2.057647 2.392850 3.241456
#
# $value
# [1] 1953.64
#
# $counts
# function gradient
# 57 NA
#
# $convergence
# [1] 0
#
# $message
# NULL
To get the parameters back on the scale you're interested, you'd need to do:
cumsum(exp(opt$par))
# [1] 4.846097 12.673626 23.618263 49.189184

Related

Store coefficients from several regressions in R then call coefficients into second loop

I am trying to output coefficients from multiple multi-linear regressions, store each of them and then multiply the coefficients by a future data set to predict future revenue.
There are 91 regressions total. One for each 'DBA' numbered 0 to 90. These are ran against 680 dates.
I have the loop that runs all of the regressions and outputs the coefficients. I need help storing each of the unique 91 coefficient vectors.
x = 0
while(x<91) {
pa.coef <- lm(formula = Final_Rev ~ OTB_Revenue + ADR + Sessions,data=subset(data, DBA == x))
y <- coef(pa.coef)
print(cbind(x,y))
x = x + 1
}
After storing each of the unique vectors I need to multiply the vectors by future 'dates' to output 'predicted revenue.'
Any help would be greatly appreciated!
Thanks!
Since you need to store data from an iteration, consider an apply function over standard loops such as for or while. And because you need to subset by a group, consider using by (the object-oriented wrapper to tapply) which slices dataframe by factor(s) and passes subsets into a function. Such a needed function would call lm and predict.lm.
Below demonstrates with random data and otherdata dataframes (10 rows per DBA group) to return a named list of predicted Final_Rev vectors (each length 10 as per their DBA group).
Data
set.seed(51718)
data <- data.frame(DBA = rep(seq(0,90), 10),
Sessions = sample(100:200, 910, replace=TRUE),
ADR = abs(rnorm(910)) * 100,
OTB_Revenue = abs(rnorm(910)) * 1000,
Final_Rev = abs(rnorm(910)) * 1000)
set.seed(8888)
other_data <- data.frame(DBA = rep(seq(0,90), 10),
Sessions = sample(100:200, 910, replace=TRUE),
ADR = abs(rnorm(910)) * 100,
OTB_Revenue = abs(rnorm(910)) * 1000,
Final_Rev = abs(rnorm(910)) * 1000)
Prediction
final_rev_predict_list <- by(data, data$DBA, function(sub){
pa.model <- lm(formula = Final_Rev ~ OTB_Revenue + ADR + Sessions, data=sub)
predict.lm(pa.model, new_data=other_data)
})
final_rev_predict_list[['0']]
# 1 92 183 274 365 456 547 638 729 820
# 831.3382 1108.0749 1404.8833 1024.4387 784.5980 455.0259 536.9992 100.5486 575.0234 492.1356
final_rev_predict_list[['45']]
# 46 137 228 319 410 501 592 683 774 865
# 1168.1625 961.9151 536.2392 1125.5452 1440.8600 1008.1956 609.7389 728.3272 1474.5348 700.1708
final_rev_predict_list[['90']]
# 91 182 273 364 455 546 637 728 819 910
# 749.9693 726.6120 488.7858 830.1254 659.7508 618.7387 929.6969 584.3375 628.9795 929.3194

tseries - block bootstrap two series same order of resampling

for example
require(tseries)
series1 <- c(100,140,150,200,150,260,267,280,300,350)
series2 <- c(500,600,250,300,350,500,100,130,50,60)
data <- data.frame("series1" = series1, "series2" = series2)
ts = tsbootstrap(data$series1, m=1, b=2, type="block", nb=10)
ts <- as.data.frame(ts)
head(ts)
> head(ts)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 280 280 150 200 100 300 150 140 100 260
2 300 300 200 150 140 350 260 150 140 267
3 140 260 140 260 267 200 150 150 260 300
4 150 267 150 267 280 150 200 200 267 350
5 260 100 260 150 300 100 150 267 100 200
6 267 140 267 200 350 140 260 280 140 150
we now have blocks of two and stictched together in a different order. My question is, how can I 'reshuffle' series1 and series2 by block boostrap whilst keeping the blocks of both series in the same order?
For example.. if we set block by 2, it grabs 2 blocks lets say its position 5,6 out of 10. It grabs element 5,6 and moves it to position 1,2... this is for series1, for series 2, it grabs element 5,6 and moves to position 1,2. That way I keep the same order of the two series, is this possible?
So far I have tried to merge series1 and series2 to make a new column. That way when use bootstrap it moves the two series to the same position:
data <- transform(data, ts.merge=paste(series1, series2, sep=","))
head(data)
series1 series2 ts.merge
1 100 500 100,500
2 140 600 140,600
3 150 250 150,250
4 200 300 200,300
5 150 350 150,350
6 260 500 260,500
However, the , separator is not compatible with tseries...
Error in FUN(newX[, i], ...) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In as.vector(x, mode = "double") : NAs introduced by coercion
2: In as.vector(x, mode = "double") : NAs introduced by coercion
I also try separator "" however, not sure how I distinguish between two numerical values afterwards in order to separate them after (note my real life example is not simply triple digit values like shown above, otherwise I could split them in half afterwards)
Try :
ts.index = tsbootstrap(index(series1), m=1, b=2, type="block", nb=10)
series1[ts.index[,1]]
series2[ts.index[,1]]
Next, you can manage your final dataframe as you want.
Took me all day, but this is a manual solution which will resample per row:
# Random Data
data=matrix(rnorm(20*100), ncol = 2)
data=as.data.frame(data)
# Set block size
reps <- NROW(data)/5 # Set group number
data$id <- rep(1:reps,each=5) # each = 5 corresponds to number of blocks to bootstrap by (5 in this case)
# Id data
IDs<-unique(data$id)
runs <- 1:1000
temp <- list()
# Function for bootstrap 1x data frame
# subsets data by id number
# Resamples the subsets
bootSTRAP = function(x){
for (i in 1:length(IDs)){
temp[i] <- list(data[data$id==IDs[i],])
}
out <- sample(temp,replace=TRUE)
df <- do.call(rbind, out)
}
# Loop for running it a 1000 times
runs <- 1:1000
run.output <- list()
i=1
for (i in 1:length(runs)){ # Length of optimization
tryCatch({
temp.1 <- bootSTRAP(runs[i])
#cum_ret <- rbind.data.frame(cum_ret, temp)
run.output[[i]] <- cbind.data.frame(temp.1)
ptm0 <- proc.time()
Sys.sleep(0.1)
ptm1=proc.time() - ptm0
time=as.numeric(ptm1[3])
cat('\n','Iteration',i,'took', time, "seconds to complete")
}, error = function(e) { print(paste("i =", i, "failed:")) })
}
# cbind outputs
master <- do.call(cbind, run.output)
# Rename columns
col.ids <- rep(1:1000,each=3)
cnames <- paste(col.ids)
colnames(master) <- cnames
If the goal is to keep series1 and series2 rows in sync, you can add an index when creating the 'data' as follows:
data <- data.frame("series1" = series1, "series2" = series2, index =
seq(1:length(series1)))
Then change the data field to bootstrap to 'index' as follows:
ts = tsbootstrap(data$index, m=1, b=2, type="block", nb=10)

Code for interest calculation in R for systematic investments [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
Suppose I deposit 1000 (P) dollars in bank every month for 60 (n) months and bank pays me 1 (rate) percent per month as interest. Hence I can calculate the interest on each deposit as:
P=1000
n=60
rate=1
P*(rate/100)*(n:1)
[1] 600 590 580 570 560 550 540 530 520 510 500 490 480 470 460 450 440 430 420 410 400 390 380 370 360 350 340 330 320 310 300
[32] 290 280 270 260 250 240 230 220 210 200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10
and total interest (totalInt) as
totalInt = sum(P*(rate/100)*(n:1))
totalInt
[1] 18300
How can I calculate 'rate' if I know P, n and totalInt? I tried following formula but it produces a series of values which do not make sense:
rate = totalInt * 100 / (P*(n:1))
> rate
[1] 30.50000 31.01695 31.55172 32.10526 32.67857 33.27273 33.88889 34.52830 35.19231 35.88235 36.60000
[12] 37.34694 38.12500 38.93617 39.78261 40.66667 41.59091 42.55814 43.57143 44.63415 45.75000 46.92308
[23] 48.15789 49.45946 50.83333 52.28571 53.82353 55.45455 57.18750 59.03226 61.00000 63.10345 65.35714
[34] 67.77778 70.38462 73.20000 76.25000 79.56522 83.18182 87.14286 91.50000 96.31579 101.66667 107.64706
[45] 114.37500 122.00000 130.71429 140.76923 152.50000 166.36364 183.00000 203.33333 228.75000 261.42857 305.00000
[56] 366.00000 457.50000 610.00000 915.00000 1830.00000
>
mean(rate)
[1] 142.736
The mean value is also too high and does not make sense.
You seem to be new to R, so here are a couple of ways to do this.
If this was another programming language, you'd calculate compound interest this way:
# approach using loops - very inefficient in R
totalInt <- 0
prin <- P
for (i in 1:n) {
totalInt <- totalInt + prin*rate/100
prin <- prin * (1+rate/100)
}
totalInt
# [1] 816.6967
Since R is a vectorized language, this is the preferred way in R.
# vectorized approach - very efficient in R
prin <- P * (1+rate/100)^(0:(n-1))
int <- prin * rate/100
totalInt <- sum(int)
totalInt
# [1] 816.6967
This code creates a vector, prin with the principle at the beginning of each period, and then a vector int containing the interest earned in that period. The approach below is a more compact version of the same thing.
# vectorized approach simplified
P * sum((1+rate/100)^(0:(n-1))*rate/100)
# [1] 816.6967
So to do the reverse, e.g. calculate the rate given P, n, and totalInt, we use the uniroot(...) function applied to the function f(...), below. Read the documentation on uniroot(...).
f <- function(rate,P,n,totalInt) {
totalInt - P * sum((1+rate/100)^(0:(n-1))*rate/100)
}
result <- uniroot(f,P=1000,n=60,totalInt=816.7, lower=0, upper=100)$root
result
# [1] 1.000003
For clarity, let's add a variable: fracRate = rate/100
Since totalInt = sum(P*fracRate*(n:1)) then totalInt = P*fracRate*(n+1)*n/2.
Now we can rearrange to get fracRate = 2*totalInt /(P*(n+1)*n). You can calculate percent rate with rate = 100*fracRate.
The calculation isn't really R specific, so perhaps you should try asking on Math#SE, where answers can also include the nicer LaTeX equation formatting.
Your method for calculating assumes that the bank calculates interest from principal only. In reality, interest is usually compounded.

how to calculate the median on grouped dataset?

My dataset is as following:
salary number
1500-1600 110
1600-1700 180
1700-1800 320
1800-1900 460
1900-2000 850
2000-2100 250
2100-2200 130
2200-2300 70
2300-2400 20
2400-2500 10
How can I calculate the median of this dataset? Here's what I have tried:
x <- c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)
colnames <- "numbers"
rownames <- c("[1500-1600]", "(1600-1700]", "(1700-1800]", "(1800-1900]",
"(1900-2000]", "(2000,2100]", "(2100-2200]", "(2200-2300]",
"(2300-2400]", "(2400-2500]")
y <- matrix(x, nrow=length(x), dimnames=list(rownames, colnames))
data.frame(y, "cumsum"=cumsum(y))
numbers cumsum
[1500-1600] 110 110
(1600-1700] 180 290
(1700-1800] 320 610
(1800-1900] 460 1070
(1900-2000] 850 1920
(2000,2100] 250 2170
(2100-2200] 130 2300
(2200-2300] 70 2370
(2300-2400] 20 2390
(2400-2500] 10 2400
Here, you can see the half-way frequency is 2400/2=1200. It is between 1070 and 1920. Thus the median class is the (1900-2000] group. You can use the formula below to get this result:
Median = L + h/f (n/2 - c)
where:
L is the lower class boundary of median class
h is the size of the median class i.e. difference between upper and lower class boundaries of median class
f is the frequency of median class
c is previous cumulative frequency of the median class
n/2 is total no. of observations divided by 2 (i.e. sum f / 2)
Alternatively, median class is defined by the following method:
Locate n/2 in the column of cumulative frequency.
Get the class in which this lies.
And in code:
> 1900 + (1200 - 1070) / (1920 - 1070) * (2000 - 1900)
[1] 1915.294
Now what I want to do is to make the above expression more elegant - i.e. 1900+(1200-1070)/(1920-1070)*(2000-1900). How can I achieve this?
Since you already know the formula, it should be easy enough to create a function to do the calculation for you.
Here, I've created a basic function to get you started. The function takes four arguments:
frequencies: A vector of frequencies ("number" in your first example)
intervals: A 2-row matrix with the same number of columns as the length of frequencies, with the first row being the lower class boundary, and the second row being the upper class boundary. Alternatively, "intervals" may be a column in your data.frame, and you may specify sep (and possibly, trim) to have the function automatically create the required matrix for you.
sep: The separator character in your "intervals" column in your data.frame.
trim: A regular expression of characters that need to be removed before trying to coerce to a numeric matrix. One pattern is built into the function: trim = "cut". This sets the regular expression pattern to remove (, ), [, and ] from the input.
Here's the function (with comments showing how I used your instructions to put it together):
GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
# If "sep" is specified, the function will try to create the
# required "intervals" matrix. "trim" removes any unwanted
# characters before attempting to convert the ranges to numeric.
if (!is.null(sep)) {
if (is.null(trim)) pattern <- ""
else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
else pattern <- trim
intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
}
Midpoints <- rowMeans(intervals)
cf <- cumsum(frequencies)
Midrow <- findInterval(max(cf)/2, cf) + 1
L <- intervals[1, Midrow] # lower class boundary of median class
h <- diff(intervals[, Midrow]) # size of median class
f <- frequencies[Midrow] # frequency of median class
cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
n_2 <- max(cf)/2 # total observations divided by 2
unname(L + (n_2 - cf2)/f * h)
}
Here's a sample data.frame to work with:
mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800",
"1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300",
"2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L,
850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"),
class = "data.frame", row.names = c(NA, -10L))
mydf
# salary number
# 1 1500-1600 110
# 2 1600-1700 180
# 3 1700-1800 320
# 4 1800-1900 460
# 5 1900-2000 850
# 6 2000-2100 250
# 7 2100-2200 130
# 8 2200-2300 70
# 9 2300-2400 20
# 10 2400-2500 10
Now, we can simply do:
GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294
Here's an example of the function in action on some made up data:
set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
# Var1 Freq
# 1 (1.9,11.7] 8
# 2 (11.7,21.5] 8
# 3 (21.5,31.4] 8
# 4 (31.4,41.2] 15
# 5 (41.2,51] 13
# 6 (51,60.8] 5
# 7 (60.8,70.6] 11
# 8 (70.6,80.5] 15
# 9 (80.5,90.3] 11
# 10 (90.3,100] 6
### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231
### ... and the output of median on the original vector
median(x)
# [1] 49.5
By the way, with the sample data that you provided, where I think there was a mistake in one of your ranges (all were separated by dashes except one, which was separated by a comma), since strsplit uses a regular expression by default to split on, you can use the function like this:
x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
"(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
"(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294
I've written it like this to clearly explain how it's being worked out. A more compact version is appended.
library(data.table)
#constructing the dataset with the salary range split into low and high
salarydata <- data.table(
salaries_low = 100*c(15:24),
salaries_high = 100*c(16:25),
numbers = c(110,180,320,460,850,250,130,70,20,10)
)
#calculating cumulative number of observations
salarydata <- salarydata[,cumnumbers := cumsum(numbers)]
salarydata
# salaries_low salaries_high numbers cumnumbers
# 1: 1500 1600 110 110
# 2: 1600 1700 180 290
# 3: 1700 1800 320 610
# 4: 1800 1900 460 1070
# 5: 1900 2000 850 1920
# 6: 2000 2100 250 2170
# 7: 2100 2200 130 2300
# 8: 2200 2300 70 2370
# 9: 2300 2400 20 2390
# 10: 2400 2500 10 2400
#identifying median group
mediangroup <- salarydata[
(cumnumbers - numbers) <= (max(cumnumbers)/2) &
cumnumbers >= (max(cumnumbers)/2)]
mediangroup
# salaries_low salaries_high numbers cumnumbers
# 1: 1900 2000 850 1920
#creating the variables needed to calculate median
mediangroup[,l := salaries_low]
mediangroup[,h := salaries_high - salaries_low]
mediangroup[,f := numbers]
mediangroup[,c := cumnumbers- numbers]
n = salarydata[,sum(numbers)]
#calculating median
median <- mediangroup[,l + ((h/f)*((n/2)-c))]
median
# [1] 1915.294
The compact version -
EDIT: Changed to a function at #AnandaMahto's suggestion. Also, using more general variable names.
library(data.table)
#Creating function
CalculateMedian <- function(
LowerBound,
UpperBound,
Obs
)
{
#calculating cumulative number of observations and n
dataset <- data.table(UpperBound, LowerBound, Obs)
dataset <- dataset[,cumObs := cumsum(Obs)]
n = dataset[,max(cumObs)]
#identifying mediangroup and dynamically calculating l,h,f,c. We already have n.
median <- dataset[
(cumObs - Obs) <= (max(cumObs)/2) &
cumObs >= (max(cumObs)/2),
LowerBound + ((UpperBound - LowerBound)/Obs) * ((n/2) - (cumObs- Obs))
]
return(median)
}
# Using function
CalculateMedian(
LowerBound = 100*c(15:24),
UpperBound = 100*c(16:25),
Obs = c(110,180,320,460,850,250,130,70,20,10)
)
# [1] 1915.294
(Sal <- sapply( strsplit(as.character(dat[[1]]), "-"),
function(x) mean( as.numeric(x) ) ) )
[1] 1550 1650 1750 1850 1950 2050 2150 2250 2350 2450
require(Hmisc)
wtd.mean(Sal, weights = dat[[2]])
[1] 1898.75
wtd.quantile(Sal, weights=dat[[2]], probs=0.5)
Generalization to a weighed median might require looking for a package that has such.
Have you tried median or apply(yourobject,2,median) if it is a matrix or data.frame ?
What about this way? Create vectors for each salary bracket, assuming an even spread over each band. Then make one big vector from those vectors, and take the median. Similar to you, but a slightly different result. I'm not a mathematician, so the method could be incorrect.
dat <- matrix(c(seq(1500, 2400, 100), seq(1600, 2500, 100), c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)), ncol=3)
median(unlist(apply(dat, 1, function(x) { ((1:x[3])/x[3])*(x[2]-x[1])+x[1] })))
Returns 1915.353
I think this concept should work you.
$salaries = array(
array("1500","1600"),
array("1600","1700"),
array("1700","1800"),
array("1800","1900"),
array("1900","2000"),
array("2000","2100"),
array("2100","2200"),
array("2200","2300"),
array("2300","2400"),
array("2400","2500"),
);
$numbers = array("110","180","320","460","850","250","130","70","20","10");
$cumsum = array();
$n = 0;
$count = 0;
foreach($numbers as $key=>$number){
$cumsum[$key] = $number;
$n += $number;
if($count > 0){
$cumsum[$key] += $cumsum[$key-1];
}
++$count;
}
$classIndex = 0;
foreach($cumsum as $key=>$cum){
if($cum < ($n/2)){
$classIndex = $key+1;
}
}
$classRange = $salaries[$classIndex];
$L = $classRange[0];
$h = (float) $classRange[1] - $classRange[0];
$f = $numbers[$classIndex];
$c = $numbers[$classIndex-1];
$Median = $L + ($h/$f)*(($n/2)-$c);
echo $Median;

Probabilty heatmap in ggplot

I asked this question a year ago and got code for this "probability heatmap":
numbet <- 32
numtri <- 1e5
prob=5/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
library(plyr)
mxcum2 <- ddply(mxcum, .(bet, outcome), nrow)
mxcum3 <- ddply(mxcum2, .(bet), summarize,
ymin=c(0, head(seq_along(V1)/length(V1), -1)),
ymax=seq_along(V1)/length(V1),
fill=(V1/sum(V1)))
head(mxcum3)
library(ggplot2)
p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") +
scale_y_continuous(formatter="percent") +
xlab("Bet")
print(p)
(May need to change this code slightly because of this)
This is almost exactly what I want. Except each vertical shaft should have different numbers of bins, ie the first should have 2, second 3, third 4 (N+1). In the graph shaft 6 +7 have the same number of bins (7), where 7 should have 8 (N+1).
If I'm right, the reason the code does this is because it is the observed data and if I ran more trials we would get more bins. I don't want to rely on the number of trials to get the correct number of bins.
How can I adapt this code to give the correct number of bins?
I have used R's dbinom to generate the frequency of heads for n=1:32 trials and plotted the graph now. It will be what you expect. I have read some of your earlier posts here on SO and on math.stackexchange. Still I don't understand why you'd want to simulate the experiment rather than generating from a binomial R.V. If you could explain it, it would be great! I'll try to work on the simulated solution from #Andrie to check out if I can match the output shown below. For now, here's something you might be interested in.
set.seed(42)
numbet <- 32
numtri <- 1e5
prob=5/6
require(plyr)
out <- ldply(1:numbet, function(idx) {
outcome <- dbinom(idx:0, size=idx, prob=prob)
bet <- rep(idx, length(outcome))
N <- round(outcome * numtri)
ymin <- c(0, head(seq_along(N)/length(N), -1))
ymax <- seq_along(N)/length(N)
data.frame(bet, fill=outcome, ymin, ymax)
})
require(ggplot2)
p <- ggplot(out, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", low="red", high="blue") +
xlab("Bet")
The plot:
Edit: Explanation of how your old code from Andrie works and why it doesn't give what you intend.
Basically, what Andrie did (or rather one way to look at it) is to use the idea that if you have two binomial distributions, X ~ B(n, p) and Y ~ B(m, p), where n, m = size and p = probability of success, then, their sum, X + Y = B(n + m, p) (1). So, the purpose of xcum is to obtain the outcome for all n = 1:32 tosses, but to explain it better, let me construct the code step by step. Along with the explanation, the code for xcum will also be very obvious and it can be constructed in no time (without any necessity for for-loop and constructing a cumsum everytime.
If you have followed me so far, then, our idea is first to create a numtri * numbet matrix, with each column (length = numtri) having 0's and 1's with probability = 5/6 and 1/6 respectively. That is, if you have numtri = 1000, then, you'll have ~ 834 0's and 166 1's *for each of the numbet columns (=32 here). Let's construct this and test this first.
numtri <- 1e3
numbet <- 32
set.seed(45)
xcum <- t(replicate(numtri, sample(0:1, numbet, prob=c(5/6,1/6), replace = TRUE)))
# check for count of 1's
> apply(xcum, 2, sum)
[1] 169 158 166 166 160 182 164 181 168 140 154 142 169 168 159 187 176 155 151 151 166
163 164 176 162 160 177 157 163 166 146 170
# So, the count of 1's are "approximately" what we expect (around 166).
Now, each of these columns are samples of binomial distribution with n = 1 and size = numtri. If we were to add the first two columns and replace the second column with this sum, then, from (1), since the probabilities are equal, we'll end up with a binomial distribution with n = 2. Similarly, instead, if you had added the first three columns and replaced th 3rd column by this sum, you would have obtained a binomial distribution with n = 3 and so on...
The concept is that if you cumulatively add each column, then you end up with numbet number of binomial distributions (1 to 32 here). So, let's do that.
xcum <- t(apply(xcum, 1, cumsum))
# you can verify that the second column has similar probabilities by this:
# calculate the frequency of all values in 2nd column.
> table(xcum[,2])
0 1 2
694 285 21
> round(numtri * dbinom(2:0, 2, prob=5/6))
[1] 694 278 28
# more or less identical, good!
If you divide the xcum, we have generated thus far by cumsum(1:numbet) over each row in this manner:
xcum <- xcum/matrix(rep(cumsum(1:numbet), each=numtri), ncol = numbet)
this will be identical to the xcum matrix that comes out of the for-loop (if you generate it with the same seed). However I don't quite understand the reason for this division by Andrie as this is not necessary to generate the graph you require. However, I suppose it has something to do with the frequency values you talked about in an earlier post on math.stackexchange
Now on to why you have difficulties obtaining the graph I had attached (with n+1 bins):
For a binomial distribution with n=1:32 trials, 5/6 as probability of tails (failures) and 1/6 as the probability of heads (successes), the probability of k heads is given by:
nCk * (5/6)^(k-1) * (1/6)^k # where nCk is n choose k
For the test data we've generated, for n=7 and n=8 (trials), the probability of k=0:7 and k=0:8 heads are given by:
# n=7
0 1 2 3 4 5
.278 .394 .233 .077 .016 .002
# n=8
0 1 2 3 4 5
.229 .375 .254 .111 .025 .006
Why are they both having 6 bins and not 8 and 9 bins? Of course this has to do with the value of numtri=1000. Let's see what's the probabilities of each of these 8 and 9 bins by generating probabilities directly from the binomial distribution using dbinom to understand why this happens.
# n = 7
dbinom(7:0, 7, prob=5/6)
# output rounded to 3 decimal places
[1] 0.279 0.391 0.234 0.078 0.016 0.002 0.000 0.000
# n = 8
dbinom(8:0, 8, prob=5/6)
# output rounded to 3 decimal places
[1] 0.233 0.372 0.260 0.104 0.026 0.004 0.000 0.000 0.000
You see that the probabilities corresponding to k=6,7 and k=6,7,8 corresponding to n=7 and n=8 are ~ 0. They are very low in values. The minimum value here is 5.8 * 1e-7 actually (n=8, k=8). This means that you have a chance of getting 1 value if you simulated for 1/5.8 * 1e7 times. If you check the same for n=32 and k=32, the value is 1.256493 * 1e-25. So, you'll have to simulate that many values to get at least 1 result where all 32 outcomes are head for n=32.
This is why your results were not having values for certain bins because the probability of having it is very low for the given numtri. And for the same reason, generating the probabilities directly from the binomial distribution overcomes this problem/limitation.
I hope I've managed to write with enough clarity for you to follow. Let me know if you've trouble going through.
Edit 2:
When I simulated the code I've just edited above with numtri=1e6, I get this for n=7 and n=8 and count the number of heads for k=0:7 and k=0:8:
# n = 7
0 1 2 3 4 5 6 7
279347 391386 233771 77698 15763 1915 117 3
# n = 8
0 1 2 3 4 5 6 7 8
232835 372466 259856 104116 26041 4271 392 22 1
Note that, there are k=6 and k=7 now for n=7 and n=8. Also, for n=8, you have a value of 1 for k=8. With increasing numtri you'll obtain more of the other missing bins. But it'll require a huge amount of time/memory (if at all).

Resources