Growing degree days is a concept in plant phenology where a given crop needs to accumulate certain amount of thermal units every day in order to move from one stage to the other.
I have thermal units data available at daily resolution for a given site for 10 years as follows:
set.seed(1)
avg_temp <- data.frame(year_ref = rep(2001:2010, each = 365),
doy = rep(1:365, times = 10),
thermal.units = sample(0:40, 3650, replace=TRUE))
I also have a crop grown in this site that should take 110 days to mature if planted on day 152
planting_date <- 152
observed_days_to_mature <- 110
I also have some initial random guess on how many thermal units this crop in general might accumulate in each stage starting from planting to reach full maturity. For e.g. in the below example, stage 1 needs to accumulate 50 thermal units since planting, stage2 needs 120 thermal units since
planting, stage 3 needs 190 thermal units since planting and so on.
gdd_data <- data.frame(stage_id = 1:4,
gdd_required = c(50, 120, 190, 250))
So given the gdd requirement, I can calculate for each year, how many days does this crop take to mature.
library(dplyr)
library(data.table)
days_to_mature_func <- function(gdd_data_df, avg_temp_df, planting_date_d){
gdd.vec <- gdd_data_df$gdd_required
year_vec <- sort(unique(avg_temp_df$year_ref))
temp_ls <- list()
for(y in seq_along(year_vec)){
year_id <- year_vec[y]
weather_sub <- avg_temp_df %>%
dplyr::filter(year_ref == year_id &
doy >= planting_date_d)
stage_vec <- unlist(lapply(1:length(gdd.vec), function(x) planting_date_d - 1 + which.max(cumsum(weather_sub$thermal.units) >= gdd.vec[x])))
stage_vec[length(stage_vec)] <- ifelse(stage_vec[length(stage_vec)] <= stage_vec[length(stage_vec) - 1], NA, stage_vec[length(stage_vec)])
gdd_doy <- as.data.frame(t(as.data.frame(stage_vec)))
names(gdd_doy) <- paste0('stage_doy', 1:length(stage_vec))
gdd_doy$year_ref <- year_id
temp_ls[[y]] <- gdd_doy
}
days_to_mature_mod <- rbindlist(temp_ls)
return(days_to_mature_mod)
}
days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date)
days_to_mature_mod
stage_doy1 stage_doy2 stage_doy3 stage_doy4 year_ref
1: 154 160 164 167 2001
2: 154 157 159 163 2002
3: 154 157 160 162 2003
4: 155 157 163 165 2004
5: 154 156 160 164 2005
6: 154 161 164 168 2006
7: 154 156 159 161 2007
8: 155 158 161 164 2008
9: 154 156 160 163 2009
10: 154 158 160 163 2010
Since the crop should be taking 110 days to mature, I define the error as:
error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2
My question is how do I optimise the gdd_required in the gdd_data to produce the minimal error.
One method I have implemented is to loop over a sequence of factors that reduces the gdd_required in
each step and calculates the error. the factor with the lowest error is the final factor that I apply
to the gdd_required data. I am reading about the gradient descent algorithm that might make this processquicker but unfortunately I don't have enough techincal expertise yet to achieve this.
From comment: I do have a condition that wasn't explicit - the x in the function that I am optimising are ordered i.e. x[1] < x[2] < x[3] < x[4] since they are cumulative.
Building on your example, you can define a function that takes arbitrary gdd_required and returns the fit:
optim_function <- function(x){
gdd_data <- data.frame(stage_id = 1:4, gdd_required = x)
days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date)
error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2
}
The function optim allows you to find the parameters that reach a minimum, starting from the initial set you used e.g.
optim(c(50, 120, 190, 250), optim_function)
#$par
#[1] 266.35738 199.59795 -28.35870 30.21135
#
#$value
#[1] 1866.24
#
#$counts
#function gradient
# 91 NA
#
#$convergence
#[1] 0
#
#$message
#NULL
So a best fit of around 1866 is found with parameters 266.35738, 199.59795, -28.35870, 30.21135.
The help page gives some pointers on doing constrained optimisation if it is important that they are in a specific range.
Given your comment that the parameters should be strictly increasing, you can transform arbitrary values into increasing ones with cumsum(exp()) so your code would become
optim_function_plus <- function(x){
gdd_data <- data.frame(stage_id = 1:4, gdd_required = cumsum(exp(x)))
days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date)
error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2
}
opt <- optim(log(c(50, 70, 70, 60)), optim_function_plus)
opt
# $par
# [1] 1.578174 2.057647 2.392850 3.241456
#
# $value
# [1] 1953.64
#
# $counts
# function gradient
# 57 NA
#
# $convergence
# [1] 0
#
# $message
# NULL
To get the parameters back on the scale you're interested, you'd need to do:
cumsum(exp(opt$par))
# [1] 4.846097 12.673626 23.618263 49.189184
I have Landsat data for 31 years in 6 NetCDF files.
Each files has about 4 million rows of data.
Each file has an as yet unknown number of timesteps in each
Except for the first file that I am using to write my script, which has 60 timesteps, and will be used to process all the files.
I have never used R before this wee exercise.
My task is to create one set of land only data for a statistician to work on.
How can I loop through the timesteps calculating ndwi?
I have this code working:
But it requires a lot of find/replace, copy/paste to work through the 60 timesteps and then there are 5 more files to process.
###############################################################
# Calculate ndwi for for the each timestep
###############################################################
#convert the timestep_1 for green to numeric value
green_nir_df02$t_1 <- as.numeric(as.character(green_nir_df02$t_1))
head (na.omit(green_nir_df02$t_1, 20))
#convert the timestep_1 for nir to numeric value
green_nir_df02$t_1.1 <- as.numeric(as.character(green_nir_df02$t_1.1))
head (na.omit(green_nir_df02$t_1.1, 20))
# calculate green minus nir for timestep_1
grnMinusNir_t_1 <- green_nir_df02$t_1 - green_nir_df02$t_1.1
head(grnMinusNir_t_1, 20)
# calculate green plus nir for timestep_1
grnPlusNir_t_1 <- green_nir_df02$t_1 + green_nir_df02$t_1.1
head(grnPlusNir_t_1, 20)
# calculate ndwi from greenMinusNir divided by greenPlusNir for timestep_1
ndwi_t_1 <- grnMinusNir_t_1 / grnPlusNir_t_1
head(ndwi_t_1, 20)
# write ndwi to the green_nir_df02 dataframe for timestep_1
green_nir_df02$ndwi_t_1 <- ndwi_t_1
####################################################
#convert the timestep_2 for green to numeric value
green_nir_df02$t_2 <- as.numeric(as.character(green_nir_df02$t_2))
head (na.omit(green_nir_df02$t_2, 20))
#convert the timestep_2 for nir to numeric value
green_nir_df02$t_2.1 <- as.numeric(as.character(green_nir_df02$t_2.1))
head (na.omit(green_nir_df02$t_2.1, 20))
# calculate green minus nir for timestep_2
grnMinusNir_t_2 <- green_nir_df02$t_2 - green_nir_df02$t_2.1
head(grnMinusNir_t_2, 20)
# calculate green plus nir for timestep_2
grnPlusNir_t_2 <- green_nir_df02$t_2 + green_nir_df02$t_2.1
head(grnPlusNir_t_2, 20)
# calculate ndwi from greenMinusNir divided by greenPlusNir for timestep_2
ndwi_t_2 <- grnMinusNir_t_2 / grnPlusNir_t_2
head(ndwi_t_2, 20)
# write ndwi to the green_nir_df02 dataframe for timestep_2
green_nir_df02$ndwi_t_2 <- ndwi_t_2
... etc to t_60
###############################################################
# END: Calculate ndwi for for the each timestep
###############################################################
and the next file starts at t_61 and so on
I have tried using this loop, which is not working because I can't point to the value in the column of t_1 in the dataframe rather than the value in the array of column names.
# initialize the "for" loop to calculate ndwi columns
seq <- 1:nt
i <- 0
green_t <- array(1:nt)
nir_t <- array(1:nt)
ndwi_t <- array(1:nt)
greenMinusNir <- array(1:nt)
greenPlusNir <- array(1:nt)
ndwi <- array(1:nt)
# set the value of the count
# count = nt from previous file #nt = num timesteps#THIS IS THE VALUE TO USE
count <- 0 ################################## EDIT THIS VALUE BEFORE RUNNING
# START: Loop for each timestep
# Step 1: create variable names for green, nir, ndwi
# Step 2: convert t_[i] and t_[i].1 to numeric
# Step 3: calculate green minus nir
# Step 4: calculate green plus nir
# Step 5: calculate ndwi = g-nir/g+nir
# Step 6: write ndwi to the green_nir_df02 dataframe
for(i in seq){
# Step 1: create variable names for green, nir, ndwi
green_t[i] <- paste("green_nir_df02$t_",count+i,sep="")
nir_t[i] <- paste("green_nir_df02$t_",count+i,".1",sep="")
ndwi_t[i] <- paste("green_nir_df02$ndwi_t_",count+i,sep="")
}
# initialize the "for" loop to calculate ndwi columns
seq <- 1:nt
i <- 0
for(i in seq){
# Step 2: convert green (i.e., t_[i]) and nir (i.e., t_[i].1) to numeric
green <- as.numeric(as.character(green_t[i]))
##
#ERROR when i=1 > green <-as.numeric(as.character("green_nir_df02$t_1"))
##
nir_t <- as.numeric(as.character(nir_t))
# Step 3: calculate green minus nir
greenMinusNir[i] <- green_t - nir_t
# Step 4: calculate green plus nir
greenPlusNir[i] <- green_t[i] + nir_t[i]
# Step 5: calculate ndwi = g-nir/g+nir
ndwi[i] <- greenMinusNir[i] / greenPlusNir[i]
# Step 6: write ndwi to the green_nir_df02 dataframe ndwi timestep
ndwi_t[i] <- ndwi[i]
# i <- i+1
}
Sample data
lon_lat t_1 t_2 t_60 t_1.1 t_2.1 t_60.1 ndwi_t_1 ndwi_t_2
1 -1609787.5_-2275087.5 180 216 247 80 197 192 0.3846154 0.04600484
2 -1609762.5_-2275087.5 102 252 281 80 197 227 0.1208791 0.12249443
3 -1609737.5_-2275087.5 102 216 281 80 156 227 0.1208791 0.16129032
4 -1609712.5_-2275087.5 141 216 281 80 156 227 0.2760181 0.16129032
5 -1609687.5_-2275087.5 180 181 281 80 156 227 0.3846154 0.07418398
6 -1609662.5_-2275087.5 180 216 281 80 197 227 0.3846154 0.04600484
I have a dataframe of many rows with only one column, the column having strings of variable lengths, ranging from 30000 to 200000 characters(DNA sequence). [Below is a sample of 150 characters]
TTCCCCAAACAGCAACTTTAAGGAGCAGCTTCCTTTATGATCCCTGATTGCCTCCCCTTTGTTCCCATAACAAGTAGTTTAAATTTTCTGTTAAAGTCCAAACCACATATTTACAATACCTCGCACC
Here is the full dataset: https://drive.google.com/open?id=1f9prtKW5NnS-BLI5lqsl4FEi4PvRfxGR
I have a code in R, which divides each row into 20 bins depending on its length, and counts the occurrence of G's and C's for each bin, and gives me back a matrix of 20 columns. Here is the code:
library(data.table)
data <- fread("string.fa", header = F)
loopchar <- function(data){ bins <- sapply(seq(1, nchar(data), nchar(data)/20), function(x) substr(data, x, x + nchar(data)/20 - 1))output <- (str_count(bins, c("G"))/nchar(bins) + str_count(bins, c("C"))/nchar(bins))*100}
result <- data.frame(t(apply(data,1,loopchar)))
However, now I want to do something different. Instead of nchar(data)/20, I want the substring segments (20) to vary from a list I have. So now for my data frame, the first row should be divided into 22 bins/segments, and the code would be nchar(data)/22.
The second row should be divided into 21 bins, and the code would be nchar(data)/21, and so on. I want the function to keep changing the number of bins for the data. Both my data dataframe with strings and vector list of numbers with bins are of the same length.
What is the best way to do this?
It's more natural to use some of the Bioconductor's libraries for such tasks. In my case I use Biostrings, but maybe you could find another way.
Data
Your file is too big, so I have created a text file (in memory), which contains random DNA for each line:
# set seed to create reproducible example
set.seed(53101614)
# create an example text file in memory
temp <- tempfile()
writeLines(
sapply(1:100, function(i){
paste(sample(c("A", "T", "C", "G"), sample(100:6000),
replace = T), collapse = "")
}),
con = temp
)
# read lines from tmp file
dna <- readLines(temp)
# unlink file
unlink(temp)
Data preprocessing
Creating Biostrings::DNAStringSet object
Using Biostrings::DNAStringSet() function we can read character vector to create DNAStringSet object. Note that I assume that all the records are in standard DNA alphabet i.e. each string contains only A, T, C, G symbols. If it does not hold in your case, refer to Biostrings documentation.
dna <- DNAStringSet(dna, use.names = F)
# inspect the output
dna
A DNAStringSet instance of length 100
width seq
[1] 2235 GGGCTTCCGGTGGTTGTAGGCCCATAAGGTGGGAAATATACA...GAAACGTCGACAAGATACAAACGAGTGGTCAACAGGCCAGCC
[2] 1507 ATGCGGTCTATCTACTTGTTCGGCCGAACCTTGAGGGCAGCC...AACGCTTTGTACCTGTCCCAGAGTCAGAAGTAACAGTTTAGC
[3] 1462 CATTGGAGTACATAGGGTATTCCCTCTCGTTGTATAACTCCA...TCCTACTTGCGAAGGCAGTCGCACACAAGGGTCTATTTCGTC
[4] 1440 ATGCTACGTTGGTAGGGTAACGCAGACTAGAACCACACGGGA...ATAAAGCCGTCACAAGGAATGTTAGCACTCAATGGCTCGCTA
[5] 3976 AAGCGGAAGTACACGTACCCGCGTAGATTACGTATAGTCGCC...TTACGCGTTGCTCAAATCGTTCGGTGCAGTTTTATAGTGATG
... ... ...
[96] 4924 AGTAAGCAGATCCAGAGTACTGTGAAAGACGTCAGATCCCGA...TATAATGGGTTGCGTGTTTGATTCTGCCATGAATCCTATGTT
[97] 5702 CCTGAAGAGGACGTTTCCCCCTACATCCAGTAGTATTGGTGT...TCTGCTTTGCGCGGCGGGGCCGGACTGTCCATGGCTCACTTG
[98] 5603 GCGGCTGATTATTGCCCGTCTGCCTGCATGATCGAGCAGAAC...CTCTTTACATGCTCATAGGAATCGGCAACGAAGGAGAGAGTC
[99] 3775 GGCAAGACGGTCAGATGTTTTGATGTCCGGGCGGATATCCTT...CGCTGCCCGTGACAATAGTTATCATAAGGAGACCTGGATGGT
[100] 407 TGTCGCAACCTCTCTTGCACGTCCAATTCCCCGACGGTTCTA...GCGACATTCCGGAGTCTGCGCAGCCTATGTATACCCTACAGA
Create the vector of random N numbers of bins
set.seed(53101614)
k <- sample(100, 100, replace = T)
# inspect the output
head(k)
[1] 37 32 63 76 19 41
Create Views object were each DNA sequence represented by N = k[i] chunks
It is much easier to solve your problem using IRanges::Views container. This thing is furiously fast and beautiful.
First of all we divide each DNA sequenced into k[i] ranges:
seqviews <- lapply(seq_along(dna), function(i){
seq = dna[[i]]
seq_length = length(seq)
starts = seq(1, seq_length - seq_length %% k[i], seq_length %/% k[i])
Views(seq, start = starts, end = c(starts[-1] - 1, seq_length))
}
)
# inspect the output for k[2] and seqviews[2]
k[2]
seqviews[2]
32
Views on a 1507-letter DNAString subject subject: ATGCGGTCTATCTACTTG...GTCAGAAGTAACAGTTTAG
views:
start end width
[1] 1 47 47 [ATGCGGTCTATCTACTTGTTCGGCCGAACCTTGAGGGCAGCCAGCTA]
[2] 48 94 47 [ACCGCCGGAGACCTGAGTCCACCACACCCATTCGATCTCCATGGTTG]
[3] 95 141 47 [GCGCTCTCCGAGGTGCCACGTCAAGTTGTACTACTCTCTCAGACCTC]
[4] 142 188 47 [TTGTTAGAAGTCCCGAGGTATATGCGCAATACCTCAACCGAAGCGCC]
[5] 189 235 47 [TGATGAGCAAACGTTTCTTATAGTCGCGACCTTGTCCCGAGGACTTG]
... ... ... ... ...
[28] 1270 1316 47 [AGGCGAGGGCAGGGCACATGTTTCTACAGTGAGGCGTGATCCGCTCC]
[29] 1317 1363 47 [GAGGCAAGCTCGTGAACTGTCGTGGCAAGTTACTTATGAGGATGTCA]
[30] 1364 1410 47 [TGGGCAGATGCAACAGACTGCTATTGGCGGGAGAGAGGCATCGACAT]
[31] 1411 1457 47 [ACCGTCTCAAGTACCACAGCTGAGAGGCTCTCGTGGAGATGCGCACA]
[32] 1458 1507 50 [TGAGTCGTAACGCTTTGTACCTGTCCCAGAGTCAGAAGTAACAGTTTAGC]
After that, we check if all sequences have been divided to desired number of chunks:
all(sapply(seq_along(k), function(i) k[i] == length(seqviews[[i]])))
[1] TRUE
Important observation
Before we proceed, there is one important observation about your function.
Your function produces N chunks with variable length (because the indices it produces are floats but not integers, so substr() when you call it, rounds provided indices to the nearest integer.
As an example, extracting 1st record from the dna set, and splitting this sequence into 37 bins using your code will produce following results:
dna_1 <- as.character(dna[[1]])
sprintf("DNA#1: %d bp long, 37 chunks", nchar(dna_1))
[1] "DNA#1: 2235 bp long, 37 chunks"
bins <- sapply(seq(1, nchar(dna_1), nchar(dna_1)/37),
function(x){
substr(dna_1, x, x + nchar(dna_1)/37 - 1)
}
)
bins_length <- sapply(bins, nchar)
barplot(table(bins_length),
xlab = "Bin's length",
ylab = "Count",
main = "Bin's length variability"
)
The approach I use in my code, while length(dna[[i]]) %% k[i] != 0 (reminder), produces k[i] - 1 bins of equal lengths, and only the last bin has its length equal to length(dna[i]) %/% k[i] + length(dna[[i]] %% k[i]:
bins_length <- sapply(seqviews, length)
barplot(table(bins_length),
xlab = "Bin's length",
ylab = "Count",
main = "Bin's length variability"
)
GC content calculation
As it is mentioned above, Biostrings::letterFrequency() applied to IRanges::Views allows you to calculate GC content easily:
Find the GC frequency for each bin in every DNA sequence
GC <- lapply(seqviews, letterFrequency, letters = "GC", as.prob = TRUE)
Convert to percents
GC <- lapply(GC, "*", 100)
Inspect the output
head(GC[[1]])
G|C
[1,] 53.33333
[2,] 46.66667
[3,] 50.00000
[4,] 55.00000
[5,] 60.00000
[6,] 45.00000
Plot GC content for DNAs 1:9
par(mfrow = c(3, 3))
invisible(
lapply(1:9, function(i){
plot(GC[[i]],
type = "l",
main = sprintf("DNA #%d, %d bp, %d bins", i, length(dna[[i]]), k[i]),
xlab = "N bins",
ylab = "GC content, %",
ylim = c(0, 100)
)
abline(h = 50, lty = 2, col = "red")
}
)
)
I am trying to output coefficients from multiple multi-linear regressions, store each of them and then multiply the coefficients by a future data set to predict future revenue.
There are 91 regressions total. One for each 'DBA' numbered 0 to 90. These are ran against 680 dates.
I have the loop that runs all of the regressions and outputs the coefficients. I need help storing each of the unique 91 coefficient vectors.
x = 0
while(x<91) {
pa.coef <- lm(formula = Final_Rev ~ OTB_Revenue + ADR + Sessions,data=subset(data, DBA == x))
y <- coef(pa.coef)
print(cbind(x,y))
x = x + 1
}
After storing each of the unique vectors I need to multiply the vectors by future 'dates' to output 'predicted revenue.'
Any help would be greatly appreciated!
Thanks!
Since you need to store data from an iteration, consider an apply function over standard loops such as for or while. And because you need to subset by a group, consider using by (the object-oriented wrapper to tapply) which slices dataframe by factor(s) and passes subsets into a function. Such a needed function would call lm and predict.lm.
Below demonstrates with random data and otherdata dataframes (10 rows per DBA group) to return a named list of predicted Final_Rev vectors (each length 10 as per their DBA group).
Data
set.seed(51718)
data <- data.frame(DBA = rep(seq(0,90), 10),
Sessions = sample(100:200, 910, replace=TRUE),
ADR = abs(rnorm(910)) * 100,
OTB_Revenue = abs(rnorm(910)) * 1000,
Final_Rev = abs(rnorm(910)) * 1000)
set.seed(8888)
other_data <- data.frame(DBA = rep(seq(0,90), 10),
Sessions = sample(100:200, 910, replace=TRUE),
ADR = abs(rnorm(910)) * 100,
OTB_Revenue = abs(rnorm(910)) * 1000,
Final_Rev = abs(rnorm(910)) * 1000)
Prediction
final_rev_predict_list <- by(data, data$DBA, function(sub){
pa.model <- lm(formula = Final_Rev ~ OTB_Revenue + ADR + Sessions, data=sub)
predict.lm(pa.model, new_data=other_data)
})
final_rev_predict_list[['0']]
# 1 92 183 274 365 456 547 638 729 820
# 831.3382 1108.0749 1404.8833 1024.4387 784.5980 455.0259 536.9992 100.5486 575.0234 492.1356
final_rev_predict_list[['45']]
# 46 137 228 319 410 501 592 683 774 865
# 1168.1625 961.9151 536.2392 1125.5452 1440.8600 1008.1956 609.7389 728.3272 1474.5348 700.1708
final_rev_predict_list[['90']]
# 91 182 273 364 455 546 637 728 819 910
# 749.9693 726.6120 488.7858 830.1254 659.7508 618.7387 929.6969 584.3375 628.9795 929.3194
My dataset is as following:
salary number
1500-1600 110
1600-1700 180
1700-1800 320
1800-1900 460
1900-2000 850
2000-2100 250
2100-2200 130
2200-2300 70
2300-2400 20
2400-2500 10
How can I calculate the median of this dataset? Here's what I have tried:
x <- c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)
colnames <- "numbers"
rownames <- c("[1500-1600]", "(1600-1700]", "(1700-1800]", "(1800-1900]",
"(1900-2000]", "(2000,2100]", "(2100-2200]", "(2200-2300]",
"(2300-2400]", "(2400-2500]")
y <- matrix(x, nrow=length(x), dimnames=list(rownames, colnames))
data.frame(y, "cumsum"=cumsum(y))
numbers cumsum
[1500-1600] 110 110
(1600-1700] 180 290
(1700-1800] 320 610
(1800-1900] 460 1070
(1900-2000] 850 1920
(2000,2100] 250 2170
(2100-2200] 130 2300
(2200-2300] 70 2370
(2300-2400] 20 2390
(2400-2500] 10 2400
Here, you can see the half-way frequency is 2400/2=1200. It is between 1070 and 1920. Thus the median class is the (1900-2000] group. You can use the formula below to get this result:
Median = L + h/f (n/2 - c)
where:
L is the lower class boundary of median class
h is the size of the median class i.e. difference between upper and lower class boundaries of median class
f is the frequency of median class
c is previous cumulative frequency of the median class
n/2 is total no. of observations divided by 2 (i.e. sum f / 2)
Alternatively, median class is defined by the following method:
Locate n/2 in the column of cumulative frequency.
Get the class in which this lies.
And in code:
> 1900 + (1200 - 1070) / (1920 - 1070) * (2000 - 1900)
[1] 1915.294
Now what I want to do is to make the above expression more elegant - i.e. 1900+(1200-1070)/(1920-1070)*(2000-1900). How can I achieve this?
Since you already know the formula, it should be easy enough to create a function to do the calculation for you.
Here, I've created a basic function to get you started. The function takes four arguments:
frequencies: A vector of frequencies ("number" in your first example)
intervals: A 2-row matrix with the same number of columns as the length of frequencies, with the first row being the lower class boundary, and the second row being the upper class boundary. Alternatively, "intervals" may be a column in your data.frame, and you may specify sep (and possibly, trim) to have the function automatically create the required matrix for you.
sep: The separator character in your "intervals" column in your data.frame.
trim: A regular expression of characters that need to be removed before trying to coerce to a numeric matrix. One pattern is built into the function: trim = "cut". This sets the regular expression pattern to remove (, ), [, and ] from the input.
Here's the function (with comments showing how I used your instructions to put it together):
GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
# If "sep" is specified, the function will try to create the
# required "intervals" matrix. "trim" removes any unwanted
# characters before attempting to convert the ranges to numeric.
if (!is.null(sep)) {
if (is.null(trim)) pattern <- ""
else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
else pattern <- trim
intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
}
Midpoints <- rowMeans(intervals)
cf <- cumsum(frequencies)
Midrow <- findInterval(max(cf)/2, cf) + 1
L <- intervals[1, Midrow] # lower class boundary of median class
h <- diff(intervals[, Midrow]) # size of median class
f <- frequencies[Midrow] # frequency of median class
cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
n_2 <- max(cf)/2 # total observations divided by 2
unname(L + (n_2 - cf2)/f * h)
}
Here's a sample data.frame to work with:
mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800",
"1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300",
"2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L,
850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"),
class = "data.frame", row.names = c(NA, -10L))
mydf
# salary number
# 1 1500-1600 110
# 2 1600-1700 180
# 3 1700-1800 320
# 4 1800-1900 460
# 5 1900-2000 850
# 6 2000-2100 250
# 7 2100-2200 130
# 8 2200-2300 70
# 9 2300-2400 20
# 10 2400-2500 10
Now, we can simply do:
GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294
Here's an example of the function in action on some made up data:
set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
# Var1 Freq
# 1 (1.9,11.7] 8
# 2 (11.7,21.5] 8
# 3 (21.5,31.4] 8
# 4 (31.4,41.2] 15
# 5 (41.2,51] 13
# 6 (51,60.8] 5
# 7 (60.8,70.6] 11
# 8 (70.6,80.5] 15
# 9 (80.5,90.3] 11
# 10 (90.3,100] 6
### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231
### ... and the output of median on the original vector
median(x)
# [1] 49.5
By the way, with the sample data that you provided, where I think there was a mistake in one of your ranges (all were separated by dashes except one, which was separated by a comma), since strsplit uses a regular expression by default to split on, you can use the function like this:
x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
"(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
"(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294
I've written it like this to clearly explain how it's being worked out. A more compact version is appended.
library(data.table)
#constructing the dataset with the salary range split into low and high
salarydata <- data.table(
salaries_low = 100*c(15:24),
salaries_high = 100*c(16:25),
numbers = c(110,180,320,460,850,250,130,70,20,10)
)
#calculating cumulative number of observations
salarydata <- salarydata[,cumnumbers := cumsum(numbers)]
salarydata
# salaries_low salaries_high numbers cumnumbers
# 1: 1500 1600 110 110
# 2: 1600 1700 180 290
# 3: 1700 1800 320 610
# 4: 1800 1900 460 1070
# 5: 1900 2000 850 1920
# 6: 2000 2100 250 2170
# 7: 2100 2200 130 2300
# 8: 2200 2300 70 2370
# 9: 2300 2400 20 2390
# 10: 2400 2500 10 2400
#identifying median group
mediangroup <- salarydata[
(cumnumbers - numbers) <= (max(cumnumbers)/2) &
cumnumbers >= (max(cumnumbers)/2)]
mediangroup
# salaries_low salaries_high numbers cumnumbers
# 1: 1900 2000 850 1920
#creating the variables needed to calculate median
mediangroup[,l := salaries_low]
mediangroup[,h := salaries_high - salaries_low]
mediangroup[,f := numbers]
mediangroup[,c := cumnumbers- numbers]
n = salarydata[,sum(numbers)]
#calculating median
median <- mediangroup[,l + ((h/f)*((n/2)-c))]
median
# [1] 1915.294
The compact version -
EDIT: Changed to a function at #AnandaMahto's suggestion. Also, using more general variable names.
library(data.table)
#Creating function
CalculateMedian <- function(
LowerBound,
UpperBound,
Obs
)
{
#calculating cumulative number of observations and n
dataset <- data.table(UpperBound, LowerBound, Obs)
dataset <- dataset[,cumObs := cumsum(Obs)]
n = dataset[,max(cumObs)]
#identifying mediangroup and dynamically calculating l,h,f,c. We already have n.
median <- dataset[
(cumObs - Obs) <= (max(cumObs)/2) &
cumObs >= (max(cumObs)/2),
LowerBound + ((UpperBound - LowerBound)/Obs) * ((n/2) - (cumObs- Obs))
]
return(median)
}
# Using function
CalculateMedian(
LowerBound = 100*c(15:24),
UpperBound = 100*c(16:25),
Obs = c(110,180,320,460,850,250,130,70,20,10)
)
# [1] 1915.294
(Sal <- sapply( strsplit(as.character(dat[[1]]), "-"),
function(x) mean( as.numeric(x) ) ) )
[1] 1550 1650 1750 1850 1950 2050 2150 2250 2350 2450
require(Hmisc)
wtd.mean(Sal, weights = dat[[2]])
[1] 1898.75
wtd.quantile(Sal, weights=dat[[2]], probs=0.5)
Generalization to a weighed median might require looking for a package that has such.
Have you tried median or apply(yourobject,2,median) if it is a matrix or data.frame ?
What about this way? Create vectors for each salary bracket, assuming an even spread over each band. Then make one big vector from those vectors, and take the median. Similar to you, but a slightly different result. I'm not a mathematician, so the method could be incorrect.
dat <- matrix(c(seq(1500, 2400, 100), seq(1600, 2500, 100), c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)), ncol=3)
median(unlist(apply(dat, 1, function(x) { ((1:x[3])/x[3])*(x[2]-x[1])+x[1] })))
Returns 1915.353
I think this concept should work you.
$salaries = array(
array("1500","1600"),
array("1600","1700"),
array("1700","1800"),
array("1800","1900"),
array("1900","2000"),
array("2000","2100"),
array("2100","2200"),
array("2200","2300"),
array("2300","2400"),
array("2400","2500"),
);
$numbers = array("110","180","320","460","850","250","130","70","20","10");
$cumsum = array();
$n = 0;
$count = 0;
foreach($numbers as $key=>$number){
$cumsum[$key] = $number;
$n += $number;
if($count > 0){
$cumsum[$key] += $cumsum[$key-1];
}
++$count;
}
$classIndex = 0;
foreach($cumsum as $key=>$cum){
if($cum < ($n/2)){
$classIndex = $key+1;
}
}
$classRange = $salaries[$classIndex];
$L = $classRange[0];
$h = (float) $classRange[1] - $classRange[0];
$f = $numbers[$classIndex];
$c = $numbers[$classIndex-1];
$Median = $L + ($h/$f)*(($n/2)-$c);
echo $Median;