I am trying to map with a one-to-one relation of values that returns a dataframe when the values are between two values. For example on this dataset:
Coastal_Cities Summer_2009 Summer_2010 Summer_2011 Summer_2012 Summer_2013 Summer_2014 Summer_2015 Summer_2016 Summer_2017 Summer_2018 Summer_2019
1 Aberdeen City 497 434 437 310 541 556 556 492 474 616 526
2 Barrow-in-Furness 552 555 637 445 671 726 616 514 547 773 627
3 Blackpool 551 550 623 433 664 700 585 493 535 738 611
Try to map with a one-to-one relation each value in the row, relative to the values between the upper and lower confidence, if not then set to 0.
lower_confidence upper_confidence
1 479.8784 509.0307
2 588.6927 622.7619
3 573.3041 605.4232
So row 1 should map to row 1, row 2 should map to row 2 etc ...
For example, given a set of values:
structure(list(Coastal_Cities = c("Aberdeen City", "Barrow-in-Furness",
"Blackpool", "Bournemouth, Christchurch and Poole", "Caerdydd - Cardiff"
), Summer_2009 = c(497, 552, 551, 654, 529), Summer_2010 = c(434,
555, 550, 642, 598), Summer_2011 = c(437, 637, 623, 567, 549),
Summer_2012 = c(310, 445, 433, 481, 433), Summer_2013 = c(541,
671, 664, 776, 733), Summer_2014 = c(556, 726, 700, 799,
741), Summer_2015 = c(556, 616, 585, 619, 621), Summer_2016 = c(492,
514, 493, 598, 524), Summer_2017 = c(474, 547, 535, 659,
569), Summer_2018 = c(616, 773, 738, 806, 730), Summer_2019 = c(526,
627, 611, 688, 561)), row.names = c(NA, 5L), class = "data.frame")
#Get its 95% confidence interval by rows
ci <- function(x){
z= rowMeans(x[,-1])-1.96*(apply(x[, -1], 1, sd)/length(x[,-1]));
v =rowMeans(x[,-1])+1.96*(apply(x[, -1], 1, sd)/length(x[,-1])) ;
y=data.frame(lower_confidence = z, upper_confidence = v);
return(y)}
Now I am trying to produce a function that stores those values in each row that are between the upper and lower interval:
diff_ci <- function(x, y) {
for(i in nrow(x)) {
for(j in length(x[, -1])){
t = x[j] > ci(y)[1][[1]][i]
p = x[j] < ci(y)[2][[1]][i]
}
}
e = data.frame(t, p)
return(e)
}
However, when I call the function I get an output that I cannot seem to interpret what exactly is it representing by the :
diff_ci(weather[, -1], weather)
Summer_2018 Summer_2018.1
1 FALSE TRUE
2 TRUE FALSE
3 TRUE FALSE
4 TRUE FALSE
5 TRUE FALSE
It is not clear to me that it has done anything that I had in mind for.
As when I run this into a lapply function:
> lapply(weather[, -1], diff_ci, y=weather)
I get the error:
Error in data.frame(t, p) : object 'p' not found
I am guessing I messed up with the allocation of arguments within the for-loop?
My expected output (for first 2 rows):
Coastal_Cities Summer_2009 Summer_2010 Summer_2011 Summer_2012 Summer_2013 Summer_2014 Summer_2015 Summer_2016 Summer_2017 Summer_2018 Summer_2019
1 Barrow-in-Furness 497 0 0 0 0 0 0 492 0 0 0
2 Blackpool 0 0 0 0 0 0 0 0 0 0 0
I am also expecting a cleaner version from the interpretation of my code with an explanation as to how I went wrong?
EDIT:
I have also tried this which fixes some of the i /j uses, however it just prints the entire output ...:
diff_ci <- function(x, y) {
for(i in nrow(x)) {
for(j in length(x[, -1])){
if(x[[j]][i] > ci(y)[1][[1]][i] | x[[j]][i] < ci(y)[2][[1]][i]){
print(x)
}
}
}
}
diff_ci(ten_year.average[, -1], ten_year.average)
Reading your question it seems to me that what you need is fairly straight forward. If a value falls within the rowwise confidence interval provide that value. If a value is outside the confidence interval return 0. I named the data you supplied reprex.
I'm not sure what your use case is, so I didn't really fiddle with your ci calculation except to pull it out of a function so you can step through building the dataframe.
min <-
sapply(seq_along(reprex$Coastal_Cities),
function(x) {
rowMeans(reprex[x,-1])-1.96*(apply(reprex[x, -1], 1, sd)/length(reprex[x,-1]))
}
)
max <-
sapply(seq_along(reprex$Coastal_Cities),
function(x) {
rowMeans(reprex[x,-1])+1.96*(apply(reprex[x, -1], 1, sd)/length(reprex[x,-1]))
}
)
confint <- data.frame(min = min, max = max)
Instead of stepping into indexing hell via for loops, you can take advantage of the dataframe structure. A dataframe is a list of vectors and so it is easy to work with lapply and sapply to iterate through.
If I use lapply on a dataframe object itself, it will iterate through each column. You can see this in action by running the simple example below:
lapply(data.frame(a = 1:3, b = 4:6, c = 7:9), print)
So for your use case you want to iterate through each element of the vector we call by an inital lapply. We can use a nested sapply to keep the vector structure.
using our really simple example above, let's say we want to paste "b" into each observation of the dataframe:
lapply(data.frame(a = 1:3, b = 4:6, c = 7:9),
function(x) {
sapply(x,
function(y) {
paste(y, "b") }
)
})
# if we print the above output
$a
[1] "1 b" "2 b" "3 b"
$b
[1] "4 b" "5 b" "6 b"
$c
[1] "7 b" "8 b" "9 b"
So we apply the same logic, except instead of pasting "b" we use an if else statement to either return the original value, or return 0 if the corresponding row (y) in the confint dataframe.
Finally, we want to get a dataframe from our output. So we use do.call to supply the arguments (a list of vectors) to the data.frame() function call.
conditional <-
lapply(reprex[-1], function(x) {
sapply(seq_along(x), function(y) {
if(x[y] > confint$min[y] & x[y] < confint$max[y]) {x[y]} else {0}
})
})
do.call(data.frame, conditional)
But for real, if you do this kind of thing often, I recommend taking some time to learn the tidyverse. Tons of time saving tools for tasks like this. Using the tidyverse you could tackle this problem like so:
library(tidyverse)
reprex %>%
pivot_longer(starts_with("Summer")) %>%
group_by(Coastal_Cities) %>%
mutate(sd = sd(value),
mean = mean(value),
ci_min = mean - 1.96 * sd/n(),
ci_max = mean + 1.96 * sd/n()) %>%
ungroup() %>%
mutate(value = case_when(
ci_min <= value & value <= ci_max ~ value,
TRUE ~ 0
))
Related
This function returns people's BMIs. The function the returns the BMIs does what I want but I cannot figure out how to print the WEIGHTS of those people that have a BMI over 25. I feel like I should be subscripting but how do I get that association? Thanks in advance.
heights_in_cms <- c(180, 165, 160, 193)
weights_in_kg <- c(87, 58, 65, 100)
bmi_calc <- function(h, w) {
bmi = w / (h / 100)^2
cat("BMIs are:", bmi, "\n")
for (b in bmi) {
if (b > 25) {
print(b)
}
}
}
bmi_seq <- bmi_calc(heights_in_cms, weights_in_kg)
using the same method as tmfmnk's answer you can get make the output of the function a dataframe if you prefer.
bmi_calc <- function(h, w) {
bmi <- w/(h/100)^2
return(data.frame(BMI=bmi[bmi > 25], Weight = w[bmi > 25]))
}
> bmi_calc(heights_in_cms, weights_in_kg)
BMI Weight
1 26.85185 87
2 25.39062 65
3 26.84636 100
Edit: from the discussion in the comments
data.frame(Height.cm=heights_in_cms, Weight.kg=weights_in_kg, BMI=weights_in_kg*(heights_in_cms/100)^2)
Height.cm Weight.kg BMI
1 180 87 281.880
2 165 58 157.905
3 160 65 166.400
4 193 100 372.490
You can do:
bmi_calc <- function(h, w) {
bmi = w/(h/100)^2
print(list(`BMIs are:` = bmi[bmi > 25],
`Weights are:` = w[bmi > 25]))
}
bmi_calc(heights_in_cms, weights_in_kg)
$`BMIs are:`
[1] 26.85185 25.39062 26.84636
$`Weights are:`
[1] 87 65 100
I am trying to calculate the value of the function EX_A0 for each row of df and add it as a new column but i get "longer object length is not a multiple of shorter object length" error. When i filter out a row and do it for just one row, there is no error. Result of EX_A0 for both rows are numeric and has just one dimension. I don't understand why i get this error. I would appreciate your help. Here is my code:
EQ_A0 <- function(S_a, lambda_a, c){
integrate(integrand2, 0, 30, S_a, lambda_a, c, subdivisions=2000, rel.tol=.Machine$double.eps^.05)$value
}
integrand2 <- function(tau, S_a, lambda_a, c){
exp(log(tau)+h_A0(tau, S_a, lambda_a, c))
}
h_A0 <- function(tau, S_a, lambda_a, c){
dgamma(tau, shape=S_a, scale = lambda_a*c, log = TRUE) - pgamma(30, shape=S_a, scale = lambda_a*c, lower.tail = TRUE, log.p=TRUE)
}
df <- data.frame(cc=c(0.06329820, 0.05141647), ya=c(31, 256), Sa=c(31,256), yb=c(2865, 742), Sb=c(2993, 1348))
df %>%
mutate(asd=EQ_A0(Sa, 350, cc))
The following worked but i still don't understand why mutate does not work.
mapply(EQ_A0, df$Sa, lambda_a, df$cc)
cbind(df,f = mapply(EQ_A0, df$Sa, 350, df$cc) )
The problem is that EQ_A0 is not a vectorized function for the parameters S_a and cc. The warning (not an error)
Warning message:
In dgamma(tau, shape = S_a, scale = lambda_a * c, log = TRUE) - :
longer object length is not a multiple of shorter object length
is raised inside h_A0 and related to the fact that S_a and cc are length 2-vectors instead of scalars (You can check this by adding a browser() statement).
As I already mentioned in my comment the result of dgamma(tau, shape=S_a, scale = lambda_a*c, log = TRUE) is a vector of length 21, while the result of pgamma(30, shape=S_a, scale = lambda_a*c, lower.tail = TRUE, log.p=TRUE) is a vector of length 2. Hence, when substracting the last from the first R raises the warning as 21 (the length of the longer object) is not a multiple of 2 (the length of the shorter object). R still performs the computation but gives you a false result.
Also, this is not related to mutate, which you can check using EQ_A0(df$Sa, 350, df$cc) (This is what you are trying to do with mutate).
To solve this issue you have loop over the rows of parameters in your dataframe with map2_dbl (the equivalent to mapply(EQ_A0, df$Sa, lambda_a, df$cc)) or using rowwise:
library(dplyr)
library(purrr)
EQ_A0 <- function(S_a, lambda_a, c){
integrate(integrand2, 0, 30, S_a, lambda_a, c, subdivisions=2000, rel.tol=.Machine$double.eps^.05)$value
}
integrand2 <- function(tau, S_a, lambda_a, c){
exp(log(tau)+h_A0(tau, S_a, lambda_a, c))
}
h_A0 <- function(tau, S_a, lambda_a, c){
#browser()
dgamma(tau, shape=S_a, scale = lambda_a*c, log = TRUE) - pgamma(30, shape=S_a, scale = lambda_a*c, lower.tail = TRUE, log.p=TRUE)
}
df <- data.frame(cc=c(0.06329820, 0.05141647), ya=c(31, 256), Sa=c(31,256), yb=c(2865, 742), Sb=c(2993, 1348))
# Solution 1: use map2_dbl to loop over parameters
df %>%
mutate(asd = map2_dbl(Sa, cc, ~ EQ_A0(.x, 350, .y)))
#> cc ya Sa yb Sb asd
#> 1 0.06329820 31 31 2865 2993 29.02379
#> 2 0.05141647 256 256 742 1348 29.88251
# Solution 1: use rowwise to loop over parameters
df %>%
rowwise() %>%
mutate(asd=EQ_A0(Sa, 350, cc)) %>%
ungroup()
#> # A tibble: 2 x 6
#> cc ya Sa yb Sb asd
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0633 31 31 2865 2993 29.0
#> 2 0.0514 256 256 742 1348 29.9
Created on 2020-04-04 by the reprex package (v0.3.0)
I'm more experienced with R than many of my peers, yet it sometimes takes hours to move a novel-to-me concept into the code line, and usually a few more to get a successful output. I don't know how to describe this in R language, so I hope you can help me- either with sample code, or pointing me in the right direction.
I have c(X1,X2,X3,...Xn) for starting variable, a non-random numeric value.
I have c(Y1,Y2,Y3,...Yn) for change variable, a non-random numeric value denoting by how much to change X, give or take, and a value between 0-10.
I have c(Z1,Z2,Z3,...Zn) which is the min and max range of X.
What I want to observe is the random sampling of all numbers X, which have all randomly had corresponding Y variable subtracted or added to them. What I'm trying to ask in this problem, is how many times will I draw X values which are exactly the X values which I initially input as well as give or take only a low Y value.
For instance,
Exes<-c(135,462,579,222)
Whys<-c(1,3,3,2)
Zees<-c(c(115,155),c(450,474),c(510,648),c(200,244))
First iteration: X=c(135,562,579,222), second iteration: X=c(130,471,585,230)<- as you can see, X of second iteration has changed by (-5*Y1), (+3*Y2), (+2*Y3), and (+11*Y4)
What I want to output is a list of randomized X values which have changed by only a factor of their corresponding Y value, and always fall within the range of given Z values. Further, I want to examine how many times at least one- and only one- X value will be be significantly different from the corresponding,starting input X.
I feel like I'm not wording the question succinctly, but I also feel that this is why I've posted. I'm not trying to ask for hand-holding, but rather seeking advice.
I am not sure that I understood the question, do you want to reiterate the process numerous times? is it for the purpose of simulation?. Here is a start of a solution.
library(dplyr)
x <- c(135,462,579,222)
y <- c(1,3,3,2)
z.lower <- c(115, 450, 510, 200)
z.upper <- c(155, 474, 648, 244)
temp.df <- data.frame(x, y, z.lower, z.upper)
df %>%
mutate(samp = sample(seq(-10, 10, 1), nrow(temp.df))) %>% ### Sample numbers between 0 and 10
mutate(new.val = x + samp * y) %>% ### Create new X
mutate(is.bound = new.val < z.upper & new.val > z.lower) ### Check that falls in bounds
x y z.lower z.upper samp new.val is.bound
1 135 1 115 155 -10 125 TRUE
2 462 3 450 474 10 492 FALSE
3 579 3 510 648 8 603 TRUE
4 222 2 200 244 6 234 TRUE
For this dataset, this is a possibility:
Exes<-c(135,462,579,222)
Whys<-c(1,3,3,2)
Zees<-c(c(115,155),c(450,474),c(510,648),c(200,244))
n = 10000
x_range_l <- split(Zees, rep(seq_len(length(Zees) / 2), each = 2))
mapply(function(y, x_range) sample(seq(from = x_range[1], to = x_range[2], by = y), size = n, replace = T),
Whys, x_range_l)
Note that this option depends more on the Zees than the Exes. A more complete way to do it would be:
Exes<-c(135,462,579,222)
Whys<-c(1,3,3,2)
Why_Range <- c(20, 4, 13, 11)
x_range_l <- Map(function(x, y, rng) c(x - y * rng, x + y * rng), Exes, Whys, Why_Range)
n = 10000
mapply(function(y, x_range) sample(seq(from = x_range[1], to = x_range[2], by = y), size = n, replace = T),
Whys, x_range_l)
I am trying to filter a vector of integers.
My condition is that the distance between 2 consecutive elements should be at least 100 ; if not, remove the element and look at the next candidate.
Here is an example :
set.seed(42)
input <- sort(sample(1:1000, 20))
head(input, 20)
[1] 24 49 74 128 146 153 165 228 303 321 356 410 532 561 601 622 634 839 882 997
If I start from the first element 24, I would like to keep the first element that has a distance at least 100 from it.
In this case, it would be 128.
Then, from 128, repeat the same process.
The result should be :
24 128 228 356 532 634 839 997
I managed to create a quick and dirty loop that gives the correct result, but I can guess that it would not be very efficient for very large vectors...
result <- integer(length(input))
result[1] <- input[1]
for(i in seq_along(input)[-1]) {
if(is.na(input[2])) break
if(input[2] - input[1] < 100) {
input <- input[-2]
} else {
result[i] <- input[2]
input <- input[-1]
}
}
result <- result[result != 0]
What would be an efficient way to get the expected result ? Can it be done using vectorization ?
unique(Reduce(function(x,y)ifelse(y-x>=100,y,x),input,accumulate = T))
[1] 24 128 228 356 532 634 839 997
Not thoroughly tested, but I believe this gets you there. I am using purrr::accumulate. This is a pretty neat problem :-) hoping to see some other solutions/approaches, so maybe leave this open (unanswered) for a bit...
library(purrr)
input <- c(24, 49, 74, 128, 146, 153, 165, 228, 303, 321, 356, 410, 532, 561, 601, 622, 634, 839, 882, 997)
idx <- which(accumulate(diff(input), ~ ifelse(.x >= 100, .y, .x + .y)) >= 100)
input[c(1, idx + 1)]
#> [1] 24 128 228 356 532 634 839 997
And to make this read a little more purrr, I suppose we could do:
accumulate(diff(input), ~ if_else(.x >= 100, .y, .x + .y)) %>%
map_lgl(~ . >= 100) %>%
which %>%
{ input[c(1, . + 1)] }
My dataset is as following:
salary number
1500-1600 110
1600-1700 180
1700-1800 320
1800-1900 460
1900-2000 850
2000-2100 250
2100-2200 130
2200-2300 70
2300-2400 20
2400-2500 10
How can I calculate the median of this dataset? Here's what I have tried:
x <- c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)
colnames <- "numbers"
rownames <- c("[1500-1600]", "(1600-1700]", "(1700-1800]", "(1800-1900]",
"(1900-2000]", "(2000,2100]", "(2100-2200]", "(2200-2300]",
"(2300-2400]", "(2400-2500]")
y <- matrix(x, nrow=length(x), dimnames=list(rownames, colnames))
data.frame(y, "cumsum"=cumsum(y))
numbers cumsum
[1500-1600] 110 110
(1600-1700] 180 290
(1700-1800] 320 610
(1800-1900] 460 1070
(1900-2000] 850 1920
(2000,2100] 250 2170
(2100-2200] 130 2300
(2200-2300] 70 2370
(2300-2400] 20 2390
(2400-2500] 10 2400
Here, you can see the half-way frequency is 2400/2=1200. It is between 1070 and 1920. Thus the median class is the (1900-2000] group. You can use the formula below to get this result:
Median = L + h/f (n/2 - c)
where:
L is the lower class boundary of median class
h is the size of the median class i.e. difference between upper and lower class boundaries of median class
f is the frequency of median class
c is previous cumulative frequency of the median class
n/2 is total no. of observations divided by 2 (i.e. sum f / 2)
Alternatively, median class is defined by the following method:
Locate n/2 in the column of cumulative frequency.
Get the class in which this lies.
And in code:
> 1900 + (1200 - 1070) / (1920 - 1070) * (2000 - 1900)
[1] 1915.294
Now what I want to do is to make the above expression more elegant - i.e. 1900+(1200-1070)/(1920-1070)*(2000-1900). How can I achieve this?
Since you already know the formula, it should be easy enough to create a function to do the calculation for you.
Here, I've created a basic function to get you started. The function takes four arguments:
frequencies: A vector of frequencies ("number" in your first example)
intervals: A 2-row matrix with the same number of columns as the length of frequencies, with the first row being the lower class boundary, and the second row being the upper class boundary. Alternatively, "intervals" may be a column in your data.frame, and you may specify sep (and possibly, trim) to have the function automatically create the required matrix for you.
sep: The separator character in your "intervals" column in your data.frame.
trim: A regular expression of characters that need to be removed before trying to coerce to a numeric matrix. One pattern is built into the function: trim = "cut". This sets the regular expression pattern to remove (, ), [, and ] from the input.
Here's the function (with comments showing how I used your instructions to put it together):
GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
# If "sep" is specified, the function will try to create the
# required "intervals" matrix. "trim" removes any unwanted
# characters before attempting to convert the ranges to numeric.
if (!is.null(sep)) {
if (is.null(trim)) pattern <- ""
else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
else pattern <- trim
intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
}
Midpoints <- rowMeans(intervals)
cf <- cumsum(frequencies)
Midrow <- findInterval(max(cf)/2, cf) + 1
L <- intervals[1, Midrow] # lower class boundary of median class
h <- diff(intervals[, Midrow]) # size of median class
f <- frequencies[Midrow] # frequency of median class
cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
n_2 <- max(cf)/2 # total observations divided by 2
unname(L + (n_2 - cf2)/f * h)
}
Here's a sample data.frame to work with:
mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800",
"1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300",
"2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L,
850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"),
class = "data.frame", row.names = c(NA, -10L))
mydf
# salary number
# 1 1500-1600 110
# 2 1600-1700 180
# 3 1700-1800 320
# 4 1800-1900 460
# 5 1900-2000 850
# 6 2000-2100 250
# 7 2100-2200 130
# 8 2200-2300 70
# 9 2300-2400 20
# 10 2400-2500 10
Now, we can simply do:
GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294
Here's an example of the function in action on some made up data:
set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
# Var1 Freq
# 1 (1.9,11.7] 8
# 2 (11.7,21.5] 8
# 3 (21.5,31.4] 8
# 4 (31.4,41.2] 15
# 5 (41.2,51] 13
# 6 (51,60.8] 5
# 7 (60.8,70.6] 11
# 8 (70.6,80.5] 15
# 9 (80.5,90.3] 11
# 10 (90.3,100] 6
### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231
### ... and the output of median on the original vector
median(x)
# [1] 49.5
By the way, with the sample data that you provided, where I think there was a mistake in one of your ranges (all were separated by dashes except one, which was separated by a comma), since strsplit uses a regular expression by default to split on, you can use the function like this:
x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
"(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
"(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294
I've written it like this to clearly explain how it's being worked out. A more compact version is appended.
library(data.table)
#constructing the dataset with the salary range split into low and high
salarydata <- data.table(
salaries_low = 100*c(15:24),
salaries_high = 100*c(16:25),
numbers = c(110,180,320,460,850,250,130,70,20,10)
)
#calculating cumulative number of observations
salarydata <- salarydata[,cumnumbers := cumsum(numbers)]
salarydata
# salaries_low salaries_high numbers cumnumbers
# 1: 1500 1600 110 110
# 2: 1600 1700 180 290
# 3: 1700 1800 320 610
# 4: 1800 1900 460 1070
# 5: 1900 2000 850 1920
# 6: 2000 2100 250 2170
# 7: 2100 2200 130 2300
# 8: 2200 2300 70 2370
# 9: 2300 2400 20 2390
# 10: 2400 2500 10 2400
#identifying median group
mediangroup <- salarydata[
(cumnumbers - numbers) <= (max(cumnumbers)/2) &
cumnumbers >= (max(cumnumbers)/2)]
mediangroup
# salaries_low salaries_high numbers cumnumbers
# 1: 1900 2000 850 1920
#creating the variables needed to calculate median
mediangroup[,l := salaries_low]
mediangroup[,h := salaries_high - salaries_low]
mediangroup[,f := numbers]
mediangroup[,c := cumnumbers- numbers]
n = salarydata[,sum(numbers)]
#calculating median
median <- mediangroup[,l + ((h/f)*((n/2)-c))]
median
# [1] 1915.294
The compact version -
EDIT: Changed to a function at #AnandaMahto's suggestion. Also, using more general variable names.
library(data.table)
#Creating function
CalculateMedian <- function(
LowerBound,
UpperBound,
Obs
)
{
#calculating cumulative number of observations and n
dataset <- data.table(UpperBound, LowerBound, Obs)
dataset <- dataset[,cumObs := cumsum(Obs)]
n = dataset[,max(cumObs)]
#identifying mediangroup and dynamically calculating l,h,f,c. We already have n.
median <- dataset[
(cumObs - Obs) <= (max(cumObs)/2) &
cumObs >= (max(cumObs)/2),
LowerBound + ((UpperBound - LowerBound)/Obs) * ((n/2) - (cumObs- Obs))
]
return(median)
}
# Using function
CalculateMedian(
LowerBound = 100*c(15:24),
UpperBound = 100*c(16:25),
Obs = c(110,180,320,460,850,250,130,70,20,10)
)
# [1] 1915.294
(Sal <- sapply( strsplit(as.character(dat[[1]]), "-"),
function(x) mean( as.numeric(x) ) ) )
[1] 1550 1650 1750 1850 1950 2050 2150 2250 2350 2450
require(Hmisc)
wtd.mean(Sal, weights = dat[[2]])
[1] 1898.75
wtd.quantile(Sal, weights=dat[[2]], probs=0.5)
Generalization to a weighed median might require looking for a package that has such.
Have you tried median or apply(yourobject,2,median) if it is a matrix or data.frame ?
What about this way? Create vectors for each salary bracket, assuming an even spread over each band. Then make one big vector from those vectors, and take the median. Similar to you, but a slightly different result. I'm not a mathematician, so the method could be incorrect.
dat <- matrix(c(seq(1500, 2400, 100), seq(1600, 2500, 100), c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)), ncol=3)
median(unlist(apply(dat, 1, function(x) { ((1:x[3])/x[3])*(x[2]-x[1])+x[1] })))
Returns 1915.353
I think this concept should work you.
$salaries = array(
array("1500","1600"),
array("1600","1700"),
array("1700","1800"),
array("1800","1900"),
array("1900","2000"),
array("2000","2100"),
array("2100","2200"),
array("2200","2300"),
array("2300","2400"),
array("2400","2500"),
);
$numbers = array("110","180","320","460","850","250","130","70","20","10");
$cumsum = array();
$n = 0;
$count = 0;
foreach($numbers as $key=>$number){
$cumsum[$key] = $number;
$n += $number;
if($count > 0){
$cumsum[$key] += $cumsum[$key-1];
}
++$count;
}
$classIndex = 0;
foreach($cumsum as $key=>$cum){
if($cum < ($n/2)){
$classIndex = $key+1;
}
}
$classRange = $salaries[$classIndex];
$L = $classRange[0];
$h = (float) $classRange[1] - $classRange[0];
$f = $numbers[$classIndex];
$c = $numbers[$classIndex-1];
$Median = $L + ($h/$f)*(($n/2)-$c);
echo $Median;