Related
Suppose I have every month an income of 1200
The interest rate is 1% - so after 1 year the price will increase 1%
I would like to find out how many years it will take until each investment will break even
Suppose an investment costs 200,000 with a momthly income of 1200
My first year and subsequent years annual income will be:
firstYear = 1200 * 12
additionalYears = (1200*12)*(1+0.01)^c(1:5)
c(firstYear, additionalYears)
14400.00 14544.00 14689.44 14836.33 14984.70 15134.54
I would like to make the "5" in the above example dynamic until it find the breakeven point.
In this example I have:
sum(c(firstYear, additionalYears))
198854.3
So the investment did not breakeven yet. Adjusting it to "12" gives me the breakeven point:
firstYear = 1200 * 12
additionalYears = (1200*12)*(1+0.01)^c(1:12)
sum(c(firstYear, additionalYears))
If possible I would like to determine the month of that year it will break even (so given this example it breakseven in month 12 of year 12, but others might break even in month 8 of year 6 etc.
Instead of using a loop, you can use vectors. Set the max years to 100, create a vector of 100 incomes to grow and a vector of 100 growth factors. Multiply the two and get a cumulative sum of the cost (negative) and the incomes. Count the number of times the sum is negative, that is your break even.
cost = -200000 # negative cost
income = 1200*12 # annual income
i = 0.01 # interest rate to grow income after year 0.
# repeat 14400 101 times, multiply it by (1+r)^n - R is vectorised
income100 = rep(income, 101) * ((1+i) ^ seq(0,100))
# subtract the cost from the cumulative sum of income
cumincome = cost + cumsum(income100)
# how many are negative?
yrs = sum(cumincome < 0)
# how much left to recover in next year
yrs + (-cumincome[yrs] / income100[yrs+1])
[1] 13.06991
Putting this into a function
break_even_years <- function(cost, income, interest=0, period = "monthly"){
if(cost >= 0) cost = -cost
if(period == "monthly") income = income * 12
income100 = rep(income, 101) * ((1+interest) ^ seq(0,100))
cumincome = cost + cumsum(income100)
# how many are negative?
yrs = sum(cumincome < 0)
# how much left to recover in next year
yrs + (-cumincome[yrs] / income100[yrs+1])
}
Using the function
purrr::map2(
.x = cost,
.y = investment,
~ break_even_years(.x, .y, interest = 0.03, "annual"))
[[1]]
[1] 28.90435
[[2]]
[1] 23.75858
[[3]]
[1] 6.391264
[[4]]
[1] 3.505453
Benchmarking
library(microbenchmark)
microbenchmark(break_even_years(200000,1000,0.01),
find_break_even_year(1000, 200000, 0.01, 100), times = 1000)
Unit: microseconds
expr min lq mean median uq
break_even_years(200000, 1000, 0.01) 50.9 87.10 257.4185 119.0 159.05
find_break_even_year(1000, 200000, 0.01, 100) 853.5 1247.05 3432.5157 1556.2 2391.35
max neval
36938.0 1000
145980.6 1000
I think this answers my question. If anybody can help with not using the forloop function that would be very helpful.
library(tidyverse)
investment = c(1000, 2000, 5000, 27000)
interest_rate = 0.03
cost = c(45000, 67900, 34678, 98367)
max_years = 100
future_value = list()
find_break_even_year <- function(CF, investment, interest_rate, max_years){
for (year in 1:max_years){
#print(year)
future_value[[year]] <- CF * (1 + interest_rate)^year
future_value_sums = sum(unlist(future_value))
if(future_value_sums >= investment)
return(year)
}
}
purrr::map2(
.x = investment,
.y = cost,
~ find_break_even_year(.x, .y, interest_rate = 0.03, max_years = 100)
)
I am using the boxplot function in R 3.1.1, and I am trying to understand what is happening behind the scenes rather than fix my code.
png(file = "plot1.png")
par(mfrow= c(1,2))
par(mar = c(3,4,4,1))
boxplot(emissions ~ year, col = "blue", xlab="Year", ylab ="Emissions", title = "Pm25 Emissions 1999 and 2008", bg ="white",ylim=c(0,6000))
boxplot(emissions2 ~ year2, col = "blue", xlab="Year", ylab ="Emissions", title = "Pm25 Emissions per Year", bg ="white",ylim=c(0,6000))
dev.off()
The resulting output is:
Under most situations from what I have read, the code should return a box and whiskers, but it is returning this linear mess of aligned dots that are no better than a bar chart. Any clues on what I have done wrong?
Thanks. The image is not posted as that I don't have 10 reputation points.
Full code to upload data set for automated and temporary processing.
url = "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip"
#######Erased to encourage the learning process...
NEI <- readRDS(mydata[2])
SCC <- readRDS(mydata[1])
year <- (NEI[,6])
emissions <-( NEI[,4])
mat <- cbind(year,emissions)
png(file = "plot1.png")
....
Summary(NEI) results:
Emissions
Min : 0.0
1st Qu.: 0.0
Median : 0.0
Mean : 3.4
3rd Qu.: 0.1
Max. :646952.0
year
Min. :1999
1st Qu.:2002
Median :2005
Mean :2004
3rd Qu.:2008
Max. :2008
As you may have noticed, your NEI variable is strongly skewed.
library(dplyr)
nei <- as.tbl(NEI)
nei%>%
group_by(year) %>%
summarise(
min = min(Emissions),
max = max(Emissions),
mean = mean(Emissions),
median = median(Emissions),
Q25 = quantile (Emissions, probs=0.25),
Q75 = quantile (Emissions, probs=0.75)
)
the summary
Source: local data frame [4 x 7]
year min max mean median Q25 Q75
1 1999 0 66696.32 6.615401 0.040000000 0.0100000000 0.25600000
2 2002 0 646951.97 3.317747 0.007164684 0.0005436423 0.08000000
3 2005 0 58896.10 3.182719 0.006741885 0.0005283287 0.07000000
4 2008 0 20799.70 1.752560 0.005273130 0.0003983980 0.06162755
boxplot is a representation of your data distribution. More preiscely it depends in your data quantiles values.
For example, if yours quantiles overlaps , you will have only one horizontal line( the box and whisker is flat) and your outliers as a vertical line of points.
You can easily imagine your data distibuted like this example:
set.seed(1)
boxplot(count ~ spray,
data = data.frame(count=c(rep(0,800),runif(200)),
spray=sample(1:2,1000,rep=TRUE)), col = "lightgray")
I have a very large dataframe (N = 107,251), that I wish to split into relatively equal halves (~53,625). However, I would like the split to be done such that three variables are kept in equal proportion in the two sets (pertaining to Gender, Age Category with 6 levels, and Region with 5 levels).
I can generate the proportions for the variables independently (e.g., via prop.table(xtabs(~dat$Gender))) or in combination (e.g., via prop.table(xtabs(~dat$Gender + dat$Region + dat$Age)), but I'm not sure how to utilise this information to actually do the sampling.
Sample dataset:
set.seed(42)
Gender <- sample(c("M", "F"), 1000, replace = TRUE)
Region <- sample(c("1","2","3","4","5"), 1000, replace = TRUE)
Age <- sample(c("1","2","3","4","5","6"), 1000, replace = TRUE)
X1 <- rnorm(1000)
dat <- data.frame(Gender, Region, Age, X1)
Probabilities:
round(prop.table(xtabs(~dat$Gender)), 3) # 48.5% Female; 51.5% Male
round(prop.table(xtabs(~dat$Age)), 3) # 16.8, 18.2, ..., 16.0%
round(prop.table(xtabs(~dat$Region)), 3) # 21.5%, 17.7, ..., 21.9%
# Multidimensional probabilities:
round(prop.table(xtabs(~dat$Gender + dat$Age + dat$Region)), 3)
The end goal for this dummy example would be two data frames with ~500 observations in each (completely independent, no participant appearing in both), and approximately equivalent in terms of gender/region/age splits. In the real analysis, there is more disparity between the age and region weights, so doing a single random split-half isn't appropriate. In real world applications, I'm not sure if every observation needs to be used or if it is better to get the splits more even.
I have been reading over the documentation from package:sampling but I'm not sure it is designed to do exactly what I require.
You can check out my stratified function, which you should be able to use like this:
set.seed(1) ## just so you can reproduce this
## Take your first group
sample1 <- stratified(dat, c("Gender", "Region", "Age"), .5)
## Then select the remainder
sample2 <- dat[!rownames(dat) %in% rownames(sample1), ]
summary(sample1)
# Gender Region Age X1
# F:235 1:112 1:84 Min. :-2.82847
# M:259 2: 90 2:78 1st Qu.:-0.69711
# 3: 94 3:82 Median :-0.03200
# 4: 97 4:80 Mean :-0.01401
# 5:101 5:90 3rd Qu.: 0.63844
# 6:80 Max. : 2.90422
summary(sample2)
# Gender Region Age X1
# F:238 1:114 1:85 Min. :-2.76808
# M:268 2: 92 2:81 1st Qu.:-0.55173
# 3: 97 3:83 Median : 0.02559
# 4: 99 4:83 Mean : 0.05789
# 5:104 5:91 3rd Qu.: 0.74102
# 6:83 Max. : 3.58466
Compare the following and see if they are within your expectations.
x1 <- round(prop.table(
xtabs(~dat$Gender + dat$Age + dat$Region)), 3)
x2 <- round(prop.table(
xtabs(~sample1$Gender + sample1$Age + sample1$Region)), 3)
x3 <- round(prop.table(
xtabs(~sample2$Gender + sample2$Age + sample2$Region)), 3)
It should be able to work fine with data of the size you describe, but a "data.table" version is in the works that promises to be much more efficient.
Update:
stratified now has a new logical argument "bothSets" which lets you keep both sets of samples as a list.
set.seed(1)
Samples <- stratified(dat, c("Gender", "Region", "Age"), .5, bothSets = TRUE)
lapply(Samples, summary)
# $SET1
# Gender Region Age X1
# F:235 1:112 1:84 Min. :-2.82847
# M:259 2: 90 2:78 1st Qu.:-0.69711
# 3: 94 3:82 Median :-0.03200
# 4: 97 4:80 Mean :-0.01401
# 5:101 5:90 3rd Qu.: 0.63844
# 6:80 Max. : 2.90422
#
# $SET2
# Gender Region Age X1
# F:238 1:114 1:85 Min. :-2.76808
# M:268 2: 92 2:81 1st Qu.:-0.55173
# 3: 97 3:83 Median : 0.02559
# 4: 99 4:83 Mean : 0.05789
# 5:104 5:91 3rd Qu.: 0.74102
# 6:83 Max. : 3.58466
The following code basically creates a key based on the group membership then loops through each group, sampling half to one set and half (roughly) to the other. If you compare the resulting probabilities they are within 0.001 of each other. The downside to this is that its biased to make a larger sample size for the second group due to how rounding of odd-numbered group member number is handled. In this case the first sample is 488 observations and the second is 512. You can probably throw in some logic to account for that and even it out better.
EDIT: Added that logic and it split it up evenly.
set.seed(42)
Gender <- sample(c("M", "F"), 1000, replace = TRUE)
Region <- sample(c("1","2","3","4","5"), 1000, replace = TRUE)
Age <- sample(c("1","2","3","4","5","6"), 1000, replace = TRUE)
X1 <- rnorm(1000)
dat <- data.frame(Gender, Region, Age, X1)
dat$group <- with(dat, paste(Gender, Region, Age))
groups <- unique(dat$group)
setA <- dat[NULL,]
setB <- dat[NULL,]
for (i in 1:length(groups)){
temp <- dat[dat$group==groups[i],]
if (nrow(setA) > nrow(setB)){
tempA <- temp[1:floor(nrow(temp)/2),]
tempB <- temp[(1+floor(nrow(temp)/2)):nrow(temp),]
} else {
tempA <- temp[1:ceiling(nrow(temp)/2),]
tempB <- temp[(1+ceiling(nrow(temp)/2)):nrow(temp),]
}
setA <- rbind(setA, tempA)
setB <- rbind(setB, tempB)
}
I have a very large data frame (150.000.000 rows) with a format like this:
df = data.frame(pnr = rep(500+2*(1:15),each=3), x = runif(3*15))
pnr is person id and x is some data. I would like to sample 10% of the persons. Is there a fast way to do this in dplyr?
The following is a solution, but it is slow because of the merge-statement
prns = as.data.frame(unique(df$prn))
names(prns)[1] = "prn"
prns$s = rbinom(nrow(prns),1,0.1)
df = merge(df,prns)
df2 = df[df$s==1,]
I would actually suggest the "data.table" package over "dplyr" for this. Here's an example with some big-ish sample data (not much smaller than your own 15 million rows).
I'll also show some right and wrong ways to do things :-)
Here's the sample data.
library(data.table)
library(dplyr)
library(microbenchmark)
set.seed(1)
mydf <- DT <- data.frame(person = sample(10000, 1e7, TRUE),
value = runif(1e7))
We'll also create a "data.table" and set the key to "person". Creating the "data.table" takes no significant time, but setting the key can.
system.time(setDT(DT))
# user system elapsed
# 0.001 0.000 0.001
## Setting the key takes some time, but is worth it
system.time(setkey(DT, person))
# user system elapsed
# 0.620 0.025 0.646
I can't think of a more efficient way to select your "person" values than the following, so I've removed these from the benchmarks--they are common to all approaches.
## Common to all tests...
A <- unique(mydf$person)
B <- sample(A, ceiling(.1 * length(A)), FALSE)
For convenience, the different tests are presented as functions...
## Base R #1
fun1a <- function() {
mydf[mydf$person %in% B, ]
}
## Base R #2--sometimes using `which` makes things quicker
fun1b <- function() {
mydf[which(mydf$person %in% B), ]
}
## `filter` from "dplyr"
fun2 <- function() {
filter(mydf, person %in% B)
}
## The "wrong" way to do this with "data.table"
fun3a <- function() {
DT[which(person %in% B)]
}
## The "right" (I think) way to do this with "data.table"
fun3b <- function() {
DT[J(B)]
}
Now, we can benchmark:
## The benchmarking
microbenchmark(fun1a(), fun1b(), fun2(), fun3a(), fun3b(), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1a() 382.37534 394.27968 396.76076 406.92431 494.32220 20
# fun1b() 401.91530 413.04710 416.38470 425.90150 503.83169 20
# fun2() 381.78909 394.16716 395.49341 399.01202 417.79044 20
# fun3a() 387.35363 397.02220 399.18113 406.23515 413.56128 20
# fun3b() 28.77801 28.91648 29.01535 29.37596 42.34043 20
Look at the performance we get from using "data.table" the right way! All the other approaches are impressively fast though.
summary shows the results to be the same. (The row order for the "data.table" solution would be different since it has been sorted.)
summary(fun1a())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
summary(fun2())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
summary(fun3b())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
In base R, to sample 10% of the rows, rounding up to the next row
> df[sample(nrow(df), ceiling(0.1*nrow(df)), FALSE), ]
## pnr x
## 16 512 0.9781232
## 21 514 0.5279925
## 33 522 0.8332834
## 14 510 0.7989481
## 4 504 0.7825318
or rounding down to the next row
> df[sample(nrow(df), floor(0.1*nrow(df)), FALSE), ]
## pnr x
## 43 530 0.449985180
## 35 524 0.996350657
## 2 502 0.499871966
## 25 518 0.005199058
or sample 10% of the pnr column, rounding up
> sample(df$pnr, ceiling(0.1*length(df$pnr)), FALSE)
## [1] 530 516 526 518 514
ADD:
If you're looking to sample 10% of the people (unique pnr ID), and return those people and their respective data, I think you want
> S <- sample(unique(df$pnr), ceiling(0.1*length(unique(df$pnr))), FALSE)
> df[df$pnr %in% S, ]
## pnr x
## 1 502 0.7630667
## 2 502 0.4998720
## 3 502 0.4839460
## 22 516 0.8248153
## 23 516 0.5795991
## 24 516 0.1572472
PS: I would wait for a dplyr answer. It will likely be quicker on 15mil rows.
If you don't necessarily want a thoroughly random sample, then you could do
filter(df, pnr %% 10 ==0).
Which would take every 10th person (you could get 10 different samples by changing to ==1,...). You could make this random by re-allocating IDs randomly - fairly trivial to do this using sample(15)[(df$pnr-500)/2] for your toy example - reversing the mapping of pnr onto a set that's suitable for sample might be less easy for the real-world case.
My dataset is as following:
salary number
1500-1600 110
1600-1700 180
1700-1800 320
1800-1900 460
1900-2000 850
2000-2100 250
2100-2200 130
2200-2300 70
2300-2400 20
2400-2500 10
How can I calculate the median of this dataset? Here's what I have tried:
x <- c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)
colnames <- "numbers"
rownames <- c("[1500-1600]", "(1600-1700]", "(1700-1800]", "(1800-1900]",
"(1900-2000]", "(2000,2100]", "(2100-2200]", "(2200-2300]",
"(2300-2400]", "(2400-2500]")
y <- matrix(x, nrow=length(x), dimnames=list(rownames, colnames))
data.frame(y, "cumsum"=cumsum(y))
numbers cumsum
[1500-1600] 110 110
(1600-1700] 180 290
(1700-1800] 320 610
(1800-1900] 460 1070
(1900-2000] 850 1920
(2000,2100] 250 2170
(2100-2200] 130 2300
(2200-2300] 70 2370
(2300-2400] 20 2390
(2400-2500] 10 2400
Here, you can see the half-way frequency is 2400/2=1200. It is between 1070 and 1920. Thus the median class is the (1900-2000] group. You can use the formula below to get this result:
Median = L + h/f (n/2 - c)
where:
L is the lower class boundary of median class
h is the size of the median class i.e. difference between upper and lower class boundaries of median class
f is the frequency of median class
c is previous cumulative frequency of the median class
n/2 is total no. of observations divided by 2 (i.e. sum f / 2)
Alternatively, median class is defined by the following method:
Locate n/2 in the column of cumulative frequency.
Get the class in which this lies.
And in code:
> 1900 + (1200 - 1070) / (1920 - 1070) * (2000 - 1900)
[1] 1915.294
Now what I want to do is to make the above expression more elegant - i.e. 1900+(1200-1070)/(1920-1070)*(2000-1900). How can I achieve this?
Since you already know the formula, it should be easy enough to create a function to do the calculation for you.
Here, I've created a basic function to get you started. The function takes four arguments:
frequencies: A vector of frequencies ("number" in your first example)
intervals: A 2-row matrix with the same number of columns as the length of frequencies, with the first row being the lower class boundary, and the second row being the upper class boundary. Alternatively, "intervals" may be a column in your data.frame, and you may specify sep (and possibly, trim) to have the function automatically create the required matrix for you.
sep: The separator character in your "intervals" column in your data.frame.
trim: A regular expression of characters that need to be removed before trying to coerce to a numeric matrix. One pattern is built into the function: trim = "cut". This sets the regular expression pattern to remove (, ), [, and ] from the input.
Here's the function (with comments showing how I used your instructions to put it together):
GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
# If "sep" is specified, the function will try to create the
# required "intervals" matrix. "trim" removes any unwanted
# characters before attempting to convert the ranges to numeric.
if (!is.null(sep)) {
if (is.null(trim)) pattern <- ""
else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
else pattern <- trim
intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
}
Midpoints <- rowMeans(intervals)
cf <- cumsum(frequencies)
Midrow <- findInterval(max(cf)/2, cf) + 1
L <- intervals[1, Midrow] # lower class boundary of median class
h <- diff(intervals[, Midrow]) # size of median class
f <- frequencies[Midrow] # frequency of median class
cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
n_2 <- max(cf)/2 # total observations divided by 2
unname(L + (n_2 - cf2)/f * h)
}
Here's a sample data.frame to work with:
mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800",
"1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300",
"2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L,
850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"),
class = "data.frame", row.names = c(NA, -10L))
mydf
# salary number
# 1 1500-1600 110
# 2 1600-1700 180
# 3 1700-1800 320
# 4 1800-1900 460
# 5 1900-2000 850
# 6 2000-2100 250
# 7 2100-2200 130
# 8 2200-2300 70
# 9 2300-2400 20
# 10 2400-2500 10
Now, we can simply do:
GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294
Here's an example of the function in action on some made up data:
set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
# Var1 Freq
# 1 (1.9,11.7] 8
# 2 (11.7,21.5] 8
# 3 (21.5,31.4] 8
# 4 (31.4,41.2] 15
# 5 (41.2,51] 13
# 6 (51,60.8] 5
# 7 (60.8,70.6] 11
# 8 (70.6,80.5] 15
# 9 (80.5,90.3] 11
# 10 (90.3,100] 6
### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231
### ... and the output of median on the original vector
median(x)
# [1] 49.5
By the way, with the sample data that you provided, where I think there was a mistake in one of your ranges (all were separated by dashes except one, which was separated by a comma), since strsplit uses a regular expression by default to split on, you can use the function like this:
x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
"(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
"(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294
I've written it like this to clearly explain how it's being worked out. A more compact version is appended.
library(data.table)
#constructing the dataset with the salary range split into low and high
salarydata <- data.table(
salaries_low = 100*c(15:24),
salaries_high = 100*c(16:25),
numbers = c(110,180,320,460,850,250,130,70,20,10)
)
#calculating cumulative number of observations
salarydata <- salarydata[,cumnumbers := cumsum(numbers)]
salarydata
# salaries_low salaries_high numbers cumnumbers
# 1: 1500 1600 110 110
# 2: 1600 1700 180 290
# 3: 1700 1800 320 610
# 4: 1800 1900 460 1070
# 5: 1900 2000 850 1920
# 6: 2000 2100 250 2170
# 7: 2100 2200 130 2300
# 8: 2200 2300 70 2370
# 9: 2300 2400 20 2390
# 10: 2400 2500 10 2400
#identifying median group
mediangroup <- salarydata[
(cumnumbers - numbers) <= (max(cumnumbers)/2) &
cumnumbers >= (max(cumnumbers)/2)]
mediangroup
# salaries_low salaries_high numbers cumnumbers
# 1: 1900 2000 850 1920
#creating the variables needed to calculate median
mediangroup[,l := salaries_low]
mediangroup[,h := salaries_high - salaries_low]
mediangroup[,f := numbers]
mediangroup[,c := cumnumbers- numbers]
n = salarydata[,sum(numbers)]
#calculating median
median <- mediangroup[,l + ((h/f)*((n/2)-c))]
median
# [1] 1915.294
The compact version -
EDIT: Changed to a function at #AnandaMahto's suggestion. Also, using more general variable names.
library(data.table)
#Creating function
CalculateMedian <- function(
LowerBound,
UpperBound,
Obs
)
{
#calculating cumulative number of observations and n
dataset <- data.table(UpperBound, LowerBound, Obs)
dataset <- dataset[,cumObs := cumsum(Obs)]
n = dataset[,max(cumObs)]
#identifying mediangroup and dynamically calculating l,h,f,c. We already have n.
median <- dataset[
(cumObs - Obs) <= (max(cumObs)/2) &
cumObs >= (max(cumObs)/2),
LowerBound + ((UpperBound - LowerBound)/Obs) * ((n/2) - (cumObs- Obs))
]
return(median)
}
# Using function
CalculateMedian(
LowerBound = 100*c(15:24),
UpperBound = 100*c(16:25),
Obs = c(110,180,320,460,850,250,130,70,20,10)
)
# [1] 1915.294
(Sal <- sapply( strsplit(as.character(dat[[1]]), "-"),
function(x) mean( as.numeric(x) ) ) )
[1] 1550 1650 1750 1850 1950 2050 2150 2250 2350 2450
require(Hmisc)
wtd.mean(Sal, weights = dat[[2]])
[1] 1898.75
wtd.quantile(Sal, weights=dat[[2]], probs=0.5)
Generalization to a weighed median might require looking for a package that has such.
Have you tried median or apply(yourobject,2,median) if it is a matrix or data.frame ?
What about this way? Create vectors for each salary bracket, assuming an even spread over each band. Then make one big vector from those vectors, and take the median. Similar to you, but a slightly different result. I'm not a mathematician, so the method could be incorrect.
dat <- matrix(c(seq(1500, 2400, 100), seq(1600, 2500, 100), c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)), ncol=3)
median(unlist(apply(dat, 1, function(x) { ((1:x[3])/x[3])*(x[2]-x[1])+x[1] })))
Returns 1915.353
I think this concept should work you.
$salaries = array(
array("1500","1600"),
array("1600","1700"),
array("1700","1800"),
array("1800","1900"),
array("1900","2000"),
array("2000","2100"),
array("2100","2200"),
array("2200","2300"),
array("2300","2400"),
array("2400","2500"),
);
$numbers = array("110","180","320","460","850","250","130","70","20","10");
$cumsum = array();
$n = 0;
$count = 0;
foreach($numbers as $key=>$number){
$cumsum[$key] = $number;
$n += $number;
if($count > 0){
$cumsum[$key] += $cumsum[$key-1];
}
++$count;
}
$classIndex = 0;
foreach($cumsum as $key=>$cum){
if($cum < ($n/2)){
$classIndex = $key+1;
}
}
$classRange = $salaries[$classIndex];
$L = $classRange[0];
$h = (float) $classRange[1] - $classRange[0];
$f = $numbers[$classIndex];
$c = $numbers[$classIndex-1];
$Median = $L + ($h/$f)*(($n/2)-$c);
echo $Median;