I have a very large dataframe (N = 107,251), that I wish to split into relatively equal halves (~53,625). However, I would like the split to be done such that three variables are kept in equal proportion in the two sets (pertaining to Gender, Age Category with 6 levels, and Region with 5 levels).
I can generate the proportions for the variables independently (e.g., via prop.table(xtabs(~dat$Gender))) or in combination (e.g., via prop.table(xtabs(~dat$Gender + dat$Region + dat$Age)), but I'm not sure how to utilise this information to actually do the sampling.
Sample dataset:
set.seed(42)
Gender <- sample(c("M", "F"), 1000, replace = TRUE)
Region <- sample(c("1","2","3","4","5"), 1000, replace = TRUE)
Age <- sample(c("1","2","3","4","5","6"), 1000, replace = TRUE)
X1 <- rnorm(1000)
dat <- data.frame(Gender, Region, Age, X1)
Probabilities:
round(prop.table(xtabs(~dat$Gender)), 3) # 48.5% Female; 51.5% Male
round(prop.table(xtabs(~dat$Age)), 3) # 16.8, 18.2, ..., 16.0%
round(prop.table(xtabs(~dat$Region)), 3) # 21.5%, 17.7, ..., 21.9%
# Multidimensional probabilities:
round(prop.table(xtabs(~dat$Gender + dat$Age + dat$Region)), 3)
The end goal for this dummy example would be two data frames with ~500 observations in each (completely independent, no participant appearing in both), and approximately equivalent in terms of gender/region/age splits. In the real analysis, there is more disparity between the age and region weights, so doing a single random split-half isn't appropriate. In real world applications, I'm not sure if every observation needs to be used or if it is better to get the splits more even.
I have been reading over the documentation from package:sampling but I'm not sure it is designed to do exactly what I require.
You can check out my stratified function, which you should be able to use like this:
set.seed(1) ## just so you can reproduce this
## Take your first group
sample1 <- stratified(dat, c("Gender", "Region", "Age"), .5)
## Then select the remainder
sample2 <- dat[!rownames(dat) %in% rownames(sample1), ]
summary(sample1)
# Gender Region Age X1
# F:235 1:112 1:84 Min. :-2.82847
# M:259 2: 90 2:78 1st Qu.:-0.69711
# 3: 94 3:82 Median :-0.03200
# 4: 97 4:80 Mean :-0.01401
# 5:101 5:90 3rd Qu.: 0.63844
# 6:80 Max. : 2.90422
summary(sample2)
# Gender Region Age X1
# F:238 1:114 1:85 Min. :-2.76808
# M:268 2: 92 2:81 1st Qu.:-0.55173
# 3: 97 3:83 Median : 0.02559
# 4: 99 4:83 Mean : 0.05789
# 5:104 5:91 3rd Qu.: 0.74102
# 6:83 Max. : 3.58466
Compare the following and see if they are within your expectations.
x1 <- round(prop.table(
xtabs(~dat$Gender + dat$Age + dat$Region)), 3)
x2 <- round(prop.table(
xtabs(~sample1$Gender + sample1$Age + sample1$Region)), 3)
x3 <- round(prop.table(
xtabs(~sample2$Gender + sample2$Age + sample2$Region)), 3)
It should be able to work fine with data of the size you describe, but a "data.table" version is in the works that promises to be much more efficient.
Update:
stratified now has a new logical argument "bothSets" which lets you keep both sets of samples as a list.
set.seed(1)
Samples <- stratified(dat, c("Gender", "Region", "Age"), .5, bothSets = TRUE)
lapply(Samples, summary)
# $SET1
# Gender Region Age X1
# F:235 1:112 1:84 Min. :-2.82847
# M:259 2: 90 2:78 1st Qu.:-0.69711
# 3: 94 3:82 Median :-0.03200
# 4: 97 4:80 Mean :-0.01401
# 5:101 5:90 3rd Qu.: 0.63844
# 6:80 Max. : 2.90422
#
# $SET2
# Gender Region Age X1
# F:238 1:114 1:85 Min. :-2.76808
# M:268 2: 92 2:81 1st Qu.:-0.55173
# 3: 97 3:83 Median : 0.02559
# 4: 99 4:83 Mean : 0.05789
# 5:104 5:91 3rd Qu.: 0.74102
# 6:83 Max. : 3.58466
The following code basically creates a key based on the group membership then loops through each group, sampling half to one set and half (roughly) to the other. If you compare the resulting probabilities they are within 0.001 of each other. The downside to this is that its biased to make a larger sample size for the second group due to how rounding of odd-numbered group member number is handled. In this case the first sample is 488 observations and the second is 512. You can probably throw in some logic to account for that and even it out better.
EDIT: Added that logic and it split it up evenly.
set.seed(42)
Gender <- sample(c("M", "F"), 1000, replace = TRUE)
Region <- sample(c("1","2","3","4","5"), 1000, replace = TRUE)
Age <- sample(c("1","2","3","4","5","6"), 1000, replace = TRUE)
X1 <- rnorm(1000)
dat <- data.frame(Gender, Region, Age, X1)
dat$group <- with(dat, paste(Gender, Region, Age))
groups <- unique(dat$group)
setA <- dat[NULL,]
setB <- dat[NULL,]
for (i in 1:length(groups)){
temp <- dat[dat$group==groups[i],]
if (nrow(setA) > nrow(setB)){
tempA <- temp[1:floor(nrow(temp)/2),]
tempB <- temp[(1+floor(nrow(temp)/2)):nrow(temp),]
} else {
tempA <- temp[1:ceiling(nrow(temp)/2),]
tempB <- temp[(1+ceiling(nrow(temp)/2)):nrow(temp),]
}
setA <- rbind(setA, tempA)
setB <- rbind(setB, tempB)
}
Related
I am looking for a way to visualize inventory throughout a day. The dataset looks as follows, with the summaries of the last two columns below:
Time Price Inventory Duration
1 9/1/2016 9:25:06 AM 13.960 318 0
2 9/1/2016 9:36:42 AM 13.980 106 696
3 9/1/2016 9:40:52 AM 13.990 -599 250
4 9/1/2016 9:52:54 AM 14.015 68 722
5 9/1/2016 9:52:54 AM 14.015 321 0
6 9/1/2016 9:54:17 AM 14.010 74 83
Inventory
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1120.00 -98.75 9.00 0.00 100.00 1988.00
Duration
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 40.25 205.50 2100.00 529.00 272700.00
I want to visualize the data by showing how much time was spent on various inventory levels. What would you recommend as a function for this? So far I found only histograms based on frequency, not time. My intended result would look similar to this:
https://postimg.org/image/z074waij1/
Thanks in advance
I wrote the following function for my needs. Hope it helps
inv.barplot.IDs <- function(inv.list, IDs = 1:1620)
{
# Subset according to the IDs
myinvs <- as.data.frame(matrix(nrow = 0, ncol = 14))
names(myinvs) <- inv.names
Volume <- Duration <- vector("numeric")
for (i in IDs)
{
#myinvs <- rbind(myinvs, inv.list[[i]])
Volume <- c(Volume, as.numeric(inv.list[[i]]$Volume))
Duration <- c(Duration, as.numeric(inv.list[[i]]$Duration))
}
# Design a sequence of skatules
minimum <- min(Volume)
maximum <- max(Volume)
width <- (maximum + abs(minimum)) / 18
width <- round(width, -1)
seq.pos <- seq(width, maximum + width, by = width)
seq.neg <- - seq(0, abs(minimum) + width, by = width)
seq <- c(rev(seq.neg), seq.pos)
# Categorize the dataframe (new column)
Skatule <- numeric(length = length(Volume))
for (i in 1:length(Volume))
{
Skatule[i] <- seq[head(which(seq > Volume[i]), 1) - 1]
}
barplot.data <- tapply(Duration, Skatule, sum)
# Save the barplot
#jpeg(filename = file.barplot, width = 480 * (16/9))
inv.barplot <- barplot(barplot.data, border = NA, ylim = c(0, max(barplot.data)), main = "Total time spent on various inventory levels", xlab = "Inventory", ylab = "Log of Hours")
#print(inv.barplot)
#dev.off()
}
Let's consider the following vectors in the dataframe:
ctrl <- rnorm(50)
x1 <- rnorm(30, mean=0.2)
x2 <- rnorm(100,mean=0.1)
x3 <- rnorm(100,mean=0.4)
x <- data.frame(data=c(ctrl,x1,x2,x3),
Group=c(
rep("ctrl", length(ctrl)),
rep("x1", length(x1)),
rep("x2", length(x2)),
rep("x3", length(x3))) )
I know I could use
pairwise.t.test(x$data,
x$Group,
pool.sd=FALSE)
to get pairwise comparison like
Pairwise comparisons using t tests with non-pooled SD
data: x$data and x$Group
ctrl x1 x2
x1 0.08522 - -
x2 0.99678 0.10469 -
x3 0.00065 0.99678 2.8e-05
P value adjustment method: holm
However I am not interested in every possible combination of vectors. I am seeking a way to compare ctrl vector with every other vectors, and to take into account alpha inflation. I'd like to avoid
t.test((x$data[x$Group=='ctrl']), (x$data[x$Group=='x1']), var.equal=T)
t.test((x$data[x$Group=='ctrl']), (x$data[x$Group=='x2']), var.equal=T)
t.test((x$data[x$Group=='ctrl']), (x$data[x$Group=='x3']), var.equal=T)
And then perform manual correction for multiple comparisons. What would be the best way to do so ?
You can use p.adjust to get a Bonferroni adjustment to multiple p-values. You should not bundle thos unequal length vectors inot t adataframe but rather use a list.
ctrl <- rnorm(50)
x1 <- rnorm(30, mean=0.2)
x2 <- rnorm(100,mean=0.1)
x3 <- rnorm(100,mean=0.4)
> lapply( list(x1,x2,x3), function(x) t.test(x,ctrl)$p.value)
[[1]]
[1] 0.2464039
[[2]]
[1] 0.8576423
[[3]]
[1] 0.0144275
> p.adjust( .Last.value)
[1] 0.4928077 0.8576423 0.0432825
#BondedDust 's answer looks great. I provide a bit more complicated solution if you really need to work with dataframes.
library(dplyr)
ctrl <- rnorm(50)
x1 <- rnorm(30, mean=0.2)
x2 <- rnorm(100,mean=0.1)
x3 <- rnorm(100,mean=0.4)
x <- data.frame(data=c(ctrl,x1,x2,x3),
Group=c(
rep("ctrl", length(ctrl)),
rep("x1", length(x1)),
rep("x2", length(x2)),
rep("x3", length(x3))), stringsAsFactors = F )
# provide the combinations you want
# set1 with all from set2
set1 = c("ctrl")
set2 = c("x1","x2","x3")
dt_res =
data.frame(expand.grid(set1,set2)) %>% # create combinations
mutate(test_id = row_number()) %>% # create a test id
group_by(test_id) %>% # group by test id, so everything from now on is performed for each test separately
do({x_temp = x[(x$Group==.$Var1 | x$Group==.$Var2),] # for each test id keep groups of interest
x_temp = data.frame(x_temp)}) %>%
do(test = t.test(data~Group, data=.)) # perform the test and save it
# you create a dataset that has the test id and a column with t.tests results as elements
dt_res
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# test_id test
# 1 1 <S3:htest>
# 2 2 <S3:htest>
# 3 3 <S3:htest>
# get all tests as a list
dt_res$test
# [[1]]
#
# Welch Two Sample t-test
#
# data: data by Group
# t = -1.9776, df = 58.36, p-value = 0.05271
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.894829477 0.005371207
# sample estimates:
# mean in group ctrl mean in group x1
# -0.447213560 -0.002484425
#
#
# [[2]]
#
# Welch Two Sample t-test
#
# data: data by Group
# t = -2.3549, df = 100.68, p-value = 0.02047
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.71174095 -0.06087081
# sample estimates:
# mean in group ctrl mean in group x2
# -0.44721356 -0.06090768
#
#
# [[3]]
#
# Welch Two Sample t-test
#
# data: data by Group
# t = -5.4235, df = 101.12, p-value = 4.001e-07
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -1.2171386 -0.5652189
# sample estimates:
# mean in group ctrl mean in group x3
# -0.4472136 0.4439652
PS : It's always interesting to work with p-values and alpha corrections. It's a bit of a philosophical issue how to approach that and some people agree and other disagree. Personally, I tend to correct alpha based on all possible comparison I can do after an experiment, because you never know when you'll come back to investigate other pairs. Imagine what happens if in the future people decide that you have to go back and compare the winning group (let's say x1) with x2 and x3. You'll focus on those pairs and you'll again correct alpha based on those compariosns. But on the whole you performed all possible comparisons, apart from x2 vs x3! You may write your reports or publish findings that should have been a bit more strict on the alpha correction.
I have a very large data frame (150.000.000 rows) with a format like this:
df = data.frame(pnr = rep(500+2*(1:15),each=3), x = runif(3*15))
pnr is person id and x is some data. I would like to sample 10% of the persons. Is there a fast way to do this in dplyr?
The following is a solution, but it is slow because of the merge-statement
prns = as.data.frame(unique(df$prn))
names(prns)[1] = "prn"
prns$s = rbinom(nrow(prns),1,0.1)
df = merge(df,prns)
df2 = df[df$s==1,]
I would actually suggest the "data.table" package over "dplyr" for this. Here's an example with some big-ish sample data (not much smaller than your own 15 million rows).
I'll also show some right and wrong ways to do things :-)
Here's the sample data.
library(data.table)
library(dplyr)
library(microbenchmark)
set.seed(1)
mydf <- DT <- data.frame(person = sample(10000, 1e7, TRUE),
value = runif(1e7))
We'll also create a "data.table" and set the key to "person". Creating the "data.table" takes no significant time, but setting the key can.
system.time(setDT(DT))
# user system elapsed
# 0.001 0.000 0.001
## Setting the key takes some time, but is worth it
system.time(setkey(DT, person))
# user system elapsed
# 0.620 0.025 0.646
I can't think of a more efficient way to select your "person" values than the following, so I've removed these from the benchmarks--they are common to all approaches.
## Common to all tests...
A <- unique(mydf$person)
B <- sample(A, ceiling(.1 * length(A)), FALSE)
For convenience, the different tests are presented as functions...
## Base R #1
fun1a <- function() {
mydf[mydf$person %in% B, ]
}
## Base R #2--sometimes using `which` makes things quicker
fun1b <- function() {
mydf[which(mydf$person %in% B), ]
}
## `filter` from "dplyr"
fun2 <- function() {
filter(mydf, person %in% B)
}
## The "wrong" way to do this with "data.table"
fun3a <- function() {
DT[which(person %in% B)]
}
## The "right" (I think) way to do this with "data.table"
fun3b <- function() {
DT[J(B)]
}
Now, we can benchmark:
## The benchmarking
microbenchmark(fun1a(), fun1b(), fun2(), fun3a(), fun3b(), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1a() 382.37534 394.27968 396.76076 406.92431 494.32220 20
# fun1b() 401.91530 413.04710 416.38470 425.90150 503.83169 20
# fun2() 381.78909 394.16716 395.49341 399.01202 417.79044 20
# fun3a() 387.35363 397.02220 399.18113 406.23515 413.56128 20
# fun3b() 28.77801 28.91648 29.01535 29.37596 42.34043 20
Look at the performance we get from using "data.table" the right way! All the other approaches are impressively fast though.
summary shows the results to be the same. (The row order for the "data.table" solution would be different since it has been sorted.)
summary(fun1a())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
summary(fun2())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
summary(fun3b())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
In base R, to sample 10% of the rows, rounding up to the next row
> df[sample(nrow(df), ceiling(0.1*nrow(df)), FALSE), ]
## pnr x
## 16 512 0.9781232
## 21 514 0.5279925
## 33 522 0.8332834
## 14 510 0.7989481
## 4 504 0.7825318
or rounding down to the next row
> df[sample(nrow(df), floor(0.1*nrow(df)), FALSE), ]
## pnr x
## 43 530 0.449985180
## 35 524 0.996350657
## 2 502 0.499871966
## 25 518 0.005199058
or sample 10% of the pnr column, rounding up
> sample(df$pnr, ceiling(0.1*length(df$pnr)), FALSE)
## [1] 530 516 526 518 514
ADD:
If you're looking to sample 10% of the people (unique pnr ID), and return those people and their respective data, I think you want
> S <- sample(unique(df$pnr), ceiling(0.1*length(unique(df$pnr))), FALSE)
> df[df$pnr %in% S, ]
## pnr x
## 1 502 0.7630667
## 2 502 0.4998720
## 3 502 0.4839460
## 22 516 0.8248153
## 23 516 0.5795991
## 24 516 0.1572472
PS: I would wait for a dplyr answer. It will likely be quicker on 15mil rows.
If you don't necessarily want a thoroughly random sample, then you could do
filter(df, pnr %% 10 ==0).
Which would take every 10th person (you could get 10 different samples by changing to ==1,...). You could make this random by re-allocating IDs randomly - fairly trivial to do this using sample(15)[(df$pnr-500)/2] for your toy example - reversing the mapping of pnr onto a set that's suitable for sample might be less easy for the real-world case.
My dataset is as following:
salary number
1500-1600 110
1600-1700 180
1700-1800 320
1800-1900 460
1900-2000 850
2000-2100 250
2100-2200 130
2200-2300 70
2300-2400 20
2400-2500 10
How can I calculate the median of this dataset? Here's what I have tried:
x <- c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)
colnames <- "numbers"
rownames <- c("[1500-1600]", "(1600-1700]", "(1700-1800]", "(1800-1900]",
"(1900-2000]", "(2000,2100]", "(2100-2200]", "(2200-2300]",
"(2300-2400]", "(2400-2500]")
y <- matrix(x, nrow=length(x), dimnames=list(rownames, colnames))
data.frame(y, "cumsum"=cumsum(y))
numbers cumsum
[1500-1600] 110 110
(1600-1700] 180 290
(1700-1800] 320 610
(1800-1900] 460 1070
(1900-2000] 850 1920
(2000,2100] 250 2170
(2100-2200] 130 2300
(2200-2300] 70 2370
(2300-2400] 20 2390
(2400-2500] 10 2400
Here, you can see the half-way frequency is 2400/2=1200. It is between 1070 and 1920. Thus the median class is the (1900-2000] group. You can use the formula below to get this result:
Median = L + h/f (n/2 - c)
where:
L is the lower class boundary of median class
h is the size of the median class i.e. difference between upper and lower class boundaries of median class
f is the frequency of median class
c is previous cumulative frequency of the median class
n/2 is total no. of observations divided by 2 (i.e. sum f / 2)
Alternatively, median class is defined by the following method:
Locate n/2 in the column of cumulative frequency.
Get the class in which this lies.
And in code:
> 1900 + (1200 - 1070) / (1920 - 1070) * (2000 - 1900)
[1] 1915.294
Now what I want to do is to make the above expression more elegant - i.e. 1900+(1200-1070)/(1920-1070)*(2000-1900). How can I achieve this?
Since you already know the formula, it should be easy enough to create a function to do the calculation for you.
Here, I've created a basic function to get you started. The function takes four arguments:
frequencies: A vector of frequencies ("number" in your first example)
intervals: A 2-row matrix with the same number of columns as the length of frequencies, with the first row being the lower class boundary, and the second row being the upper class boundary. Alternatively, "intervals" may be a column in your data.frame, and you may specify sep (and possibly, trim) to have the function automatically create the required matrix for you.
sep: The separator character in your "intervals" column in your data.frame.
trim: A regular expression of characters that need to be removed before trying to coerce to a numeric matrix. One pattern is built into the function: trim = "cut". This sets the regular expression pattern to remove (, ), [, and ] from the input.
Here's the function (with comments showing how I used your instructions to put it together):
GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
# If "sep" is specified, the function will try to create the
# required "intervals" matrix. "trim" removes any unwanted
# characters before attempting to convert the ranges to numeric.
if (!is.null(sep)) {
if (is.null(trim)) pattern <- ""
else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
else pattern <- trim
intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
}
Midpoints <- rowMeans(intervals)
cf <- cumsum(frequencies)
Midrow <- findInterval(max(cf)/2, cf) + 1
L <- intervals[1, Midrow] # lower class boundary of median class
h <- diff(intervals[, Midrow]) # size of median class
f <- frequencies[Midrow] # frequency of median class
cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
n_2 <- max(cf)/2 # total observations divided by 2
unname(L + (n_2 - cf2)/f * h)
}
Here's a sample data.frame to work with:
mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800",
"1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300",
"2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L,
850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"),
class = "data.frame", row.names = c(NA, -10L))
mydf
# salary number
# 1 1500-1600 110
# 2 1600-1700 180
# 3 1700-1800 320
# 4 1800-1900 460
# 5 1900-2000 850
# 6 2000-2100 250
# 7 2100-2200 130
# 8 2200-2300 70
# 9 2300-2400 20
# 10 2400-2500 10
Now, we can simply do:
GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294
Here's an example of the function in action on some made up data:
set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
# Var1 Freq
# 1 (1.9,11.7] 8
# 2 (11.7,21.5] 8
# 3 (21.5,31.4] 8
# 4 (31.4,41.2] 15
# 5 (41.2,51] 13
# 6 (51,60.8] 5
# 7 (60.8,70.6] 11
# 8 (70.6,80.5] 15
# 9 (80.5,90.3] 11
# 10 (90.3,100] 6
### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231
### ... and the output of median on the original vector
median(x)
# [1] 49.5
By the way, with the sample data that you provided, where I think there was a mistake in one of your ranges (all were separated by dashes except one, which was separated by a comma), since strsplit uses a regular expression by default to split on, you can use the function like this:
x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
"(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
"(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294
I've written it like this to clearly explain how it's being worked out. A more compact version is appended.
library(data.table)
#constructing the dataset with the salary range split into low and high
salarydata <- data.table(
salaries_low = 100*c(15:24),
salaries_high = 100*c(16:25),
numbers = c(110,180,320,460,850,250,130,70,20,10)
)
#calculating cumulative number of observations
salarydata <- salarydata[,cumnumbers := cumsum(numbers)]
salarydata
# salaries_low salaries_high numbers cumnumbers
# 1: 1500 1600 110 110
# 2: 1600 1700 180 290
# 3: 1700 1800 320 610
# 4: 1800 1900 460 1070
# 5: 1900 2000 850 1920
# 6: 2000 2100 250 2170
# 7: 2100 2200 130 2300
# 8: 2200 2300 70 2370
# 9: 2300 2400 20 2390
# 10: 2400 2500 10 2400
#identifying median group
mediangroup <- salarydata[
(cumnumbers - numbers) <= (max(cumnumbers)/2) &
cumnumbers >= (max(cumnumbers)/2)]
mediangroup
# salaries_low salaries_high numbers cumnumbers
# 1: 1900 2000 850 1920
#creating the variables needed to calculate median
mediangroup[,l := salaries_low]
mediangroup[,h := salaries_high - salaries_low]
mediangroup[,f := numbers]
mediangroup[,c := cumnumbers- numbers]
n = salarydata[,sum(numbers)]
#calculating median
median <- mediangroup[,l + ((h/f)*((n/2)-c))]
median
# [1] 1915.294
The compact version -
EDIT: Changed to a function at #AnandaMahto's suggestion. Also, using more general variable names.
library(data.table)
#Creating function
CalculateMedian <- function(
LowerBound,
UpperBound,
Obs
)
{
#calculating cumulative number of observations and n
dataset <- data.table(UpperBound, LowerBound, Obs)
dataset <- dataset[,cumObs := cumsum(Obs)]
n = dataset[,max(cumObs)]
#identifying mediangroup and dynamically calculating l,h,f,c. We already have n.
median <- dataset[
(cumObs - Obs) <= (max(cumObs)/2) &
cumObs >= (max(cumObs)/2),
LowerBound + ((UpperBound - LowerBound)/Obs) * ((n/2) - (cumObs- Obs))
]
return(median)
}
# Using function
CalculateMedian(
LowerBound = 100*c(15:24),
UpperBound = 100*c(16:25),
Obs = c(110,180,320,460,850,250,130,70,20,10)
)
# [1] 1915.294
(Sal <- sapply( strsplit(as.character(dat[[1]]), "-"),
function(x) mean( as.numeric(x) ) ) )
[1] 1550 1650 1750 1850 1950 2050 2150 2250 2350 2450
require(Hmisc)
wtd.mean(Sal, weights = dat[[2]])
[1] 1898.75
wtd.quantile(Sal, weights=dat[[2]], probs=0.5)
Generalization to a weighed median might require looking for a package that has such.
Have you tried median or apply(yourobject,2,median) if it is a matrix or data.frame ?
What about this way? Create vectors for each salary bracket, assuming an even spread over each band. Then make one big vector from those vectors, and take the median. Similar to you, but a slightly different result. I'm not a mathematician, so the method could be incorrect.
dat <- matrix(c(seq(1500, 2400, 100), seq(1600, 2500, 100), c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)), ncol=3)
median(unlist(apply(dat, 1, function(x) { ((1:x[3])/x[3])*(x[2]-x[1])+x[1] })))
Returns 1915.353
I think this concept should work you.
$salaries = array(
array("1500","1600"),
array("1600","1700"),
array("1700","1800"),
array("1800","1900"),
array("1900","2000"),
array("2000","2100"),
array("2100","2200"),
array("2200","2300"),
array("2300","2400"),
array("2400","2500"),
);
$numbers = array("110","180","320","460","850","250","130","70","20","10");
$cumsum = array();
$n = 0;
$count = 0;
foreach($numbers as $key=>$number){
$cumsum[$key] = $number;
$n += $number;
if($count > 0){
$cumsum[$key] += $cumsum[$key-1];
}
++$count;
}
$classIndex = 0;
foreach($cumsum as $key=>$cum){
if($cum < ($n/2)){
$classIndex = $key+1;
}
}
$classRange = $salaries[$classIndex];
$L = $classRange[0];
$h = (float) $classRange[1] - $classRange[0];
$f = $numbers[$classIndex];
$c = $numbers[$classIndex-1];
$Median = $L + ($h/$f)*(($n/2)-$c);
echo $Median;
Let's consider this data set:
df <- data.frame(age= sample(c(20:90), 20, rep=T),
sex = sample(c('m', 'f'), 20, rep=T),
smoker=sample(c("never", "former", "active"), 20, rep=T),
size= sample (c(8:40), 20, rep=T),
fac = as.factor(sample(c("neg","lo","med","hi"), 20, rep=T)),
outcome = sample(c(0,1), 20, rep=T)
)
# let's introduce some missing data
for (i in (1:3)) {df[sample(c(1:20),1), sample(c(1:6),1)] <- NA}
In a medical manuscript the first table summarizes the population (or its subgroups as appropriate); here the rows would be age, sex, smoking status, etc and the two outcomes would be listed in separate columns. The continuous variables are reported as means; the categorical variables as counts.
I was wondering if there is a function that I am missing that
creates such contingency tables. I can do that manually but would love to be able to automatically update if the data set changes. Ultimately I need to output in latex.
the function would need to ignore missing data, but not delete those rows.
Asking too much?!
In medical articles, 'Table 1' summarizes the demographics of the study population, usually broken down between subgroups
Generate data set
n <- 100
df <- data.frame(
age = sample(c(20:90), n, rep = T),
sex = sample(c("m", "f"), 20, rep = T, prob = c(0.55, 0.45)),
smoker = sample(c("never", "former", "active"), n, rep = T, prob = c(0.4, 0.45, 0.15)),
size = abs(rnorm(n, 20, 8)),
logitest = sample(c(TRUE, FALSE), n, rep = T, prob = c(0.1, 0.9)),
labtest = as.factor(sample(c("neg", "lo", quot;med",quot;hi"), n, rep = T, prob = c(0.4, 0.3, 0.2, 0.1))),
outcome = sample(c(0, 1), n, rep = T, prob = c(0.8, 0.2))
)
# let's introduce some missing data
for (i in (1:floor(n/6))) {
df[sample(c(1:n), 1), sample(c(1:ncol(df)), 1)] <- NA
}
head(df)
## age sex smoker size logitest labtest outcome
## 1 70 m former 39.17 NA med NA
## 2 51 f former 33.64 FALSE hi 1
## 3 58 f former 10.10 FALSE neg 1
## 4 30 m former 43.24 FALSE med 0
## 5 54 m former 22.78 FALSE lo 0
## 6 86 f former 8.20 FALSE neg 0
if working a real data set, use it instead
# df <- read.csv()
#you may need to eliminate some columns
#colnames(df)
#df0<-df #backup
#df <- df[,-c(1,...,27:38)]
Change this as needed: the column with the diagnosis has to be removed from the variables list!
dx <- 7 #index of outcome/diagnosis
####################################
summary(df[, -dx])
## age sex smoker size logitest
## Min. :20.0 f :44 active:19 Min. : 0.91 Mode :logical
## 1st Qu.:42.5 m :54 former:49 1st Qu.:15.00 FALSE:85
## Median :58.0 NA's: 2 never :30 Median :20.12 TRUE :12
## Mean :57.3 NA's : 2 Mean :20.44 NA's :3
## 3rd Qu.:74.0 3rd Qu.:27.10
## Max. :88.0 Max. :43.24
## NA's :1 NA's :2
## labtest
## hi : 4
## lo :29
## med :20
## neg :45
## NA's: 2
##
##
attach(df)
Build list of vars
vars <- colnames(df)
vars
## [1] "age" "sex" "smoker" "size" "logitest" "labtest"
## [7] "outcome"
catvars <- NULL #categorical variables
contvars <- NULL #continuous variables
logivars <- NULL #logic variables
vars <- vars[-dx]
vars
## [1] "age" "sex" "smoker" "size" "logitest" "labtest"
for (i in 1:length(vars)) {
ifelse(is.factor(df[, i]), catvars <- c(catvars, vars[i]), ifelse(is.logical(df[,
i]), logivars <- c(logivars, vars[i]), contvars <- c(contvars, vars[i])))
}
contvars
## [1] "age" "size"
catvars
## [1] "sex" "smoker" "labtest"
logivars
## [1] "logitest"
Create the subgroups
bg <- df[df[, dx] == 0 & !is.na(df[, dx]), ]
nrow(bg) #; bg
## [1] 73
mg <- df[df[, dx] == 1 & !is.na(df[, dx]), ]
nrow(mg) #; mg
## [1] 23
indet <- df[is.na(df[, dx]), ]
nrow(indet)
## [1] 4
indet
## age sex smoker size logitest labtest outcome
## 1 70 m former 39.173 NA med NA
## 9 87 m former 23.621 FALSE lo NA
## 18 65 m former 2.466 FALSE <NA> NA
## 67 88 f former 17.575 FALSE med NA
For continuous variables
normality testing
normality <- NULL
for (i in 1:length(contvars)) {
j <- which(vars == contvars[i]) #find position of variable in the original data frame and its subsets
st <- shapiro.test(df[, j]) #normality testing on all patients, bg and mg alike
normality <- c(normality, st$p.value) #normality testing on all patients, bg and mg alike
}
normality
## [1] 0.00125 0.73602
comparing the means of two samples; if normal, use t-test, otherwise wilcoxon
ttpvalue <- NULL
for (i in 1:length(contvars)) {
j <- which(vars == contvars[i]) #find position of variable in the original data frame and its subsets
## if normal, use t-test, otherwise wilcoxon if shapiro p<.05 then pop
## likely NOT normally dist
ifelse(normality[i] < 0.05, tt <- wilcox.test(bg[, j], mg[, j]), tt <- t.test(bg[,
j], mg[, j]))
ttpvalue <- c(ttpvalue, tt$p.value) ##if t-test p<.05 then pop likely have different means
}
ttpvalue
## [1] 0.6358 0.3673
contvarlist <- list(variables = contvars, normality = normality, ttest.by.subgroup = ttpvalue)
For categorical variables
chisqpvalue <- NULL
for (i in 1:length(catvars)) {
j <- which(vars == catvars[i]) #find position of variable in the original data frame and its subsets
tbl <- table(df[, j], df[, dx])
chisqtest <- summary(tbl)
chisqpvalue <- c(chisqpvalue, chisqtest$p.value)
}
chisqpvalue
## [1] 0.01579 0.77116 0.39484
catvarlist <- list(variables = catvars, chisq.by.subgroup = chisqpvalue)
For logic variables
proppvalue <- NULL
for (i in 1:length(logivars)) {
j <- which(vars == logivars[i]) #find position of variable in the original data frame and its subsets
tbl <- table(df[, j], df[, dx])
chisqtest <- summary(tbl)
proppvalue <- c(proppvalue, chisqtest$p.value)
}
proppvalue
## [1] 0.5551
logivarlist = list(variables = logivars, chisq.by.subgroup = proppvalue)
And now, the results!
str(contvarlist) #if shapiro p<.05 then pop likely NOT normally dist; if t-test p<.05 then pop likely have different means
## List of 3
## $ variables : chr [1:2] "age" "size"
## $ normality : num [1:2] 0.00125 0.73602
## $ ttest.by.subgroup: num [1:2] 0.636 0.367
str(catvarlist) #if chisq p<.05 then variables are likely NOT independent
## List of 2
## $ variables : chr [1:3] "sex" "smoker" "labtest"
## $ chisq.by.subgroup: num [1:3] 0.0158 0.7712 0.3948
str(logivarlist) #if chisq p<.05 then variables are likely NOT independent
## List of 2
## $ variables : chr "logitest"
## $ chisq.by.subgroup: num 0.555