Finding differences between populations - r

I have data equivalent data from 2019 and 2020. The proportion of diagnoses in 2020 look like they differ from 2019, but I'd like to ...
a) statistically test the populations are different.
b) determine which categories are the most different.
I've worked out I can do 'a' using:
chisq.test(test$count.2020, test$count.2019)
I don't know how to find out which categories are the ones that are the most different between 2020 and 2019. Any help would be amazing, thanks!
diagnosis <- data.frame(mf_label = c("Audiovestibular", "Autonomic", "Cardiovascular",
"Cerebral palsy", "Cerebrovascular", "COVID", "Cranial nerves",
"CSF disorders", "Developmental", "Epilepsy and consciousness",
"Functional", "Head injury", "Headache", "Hearing loss", "Infection",
"Maxillofacial", "Movement disorders", "Muscle and NMJ", "Musculoskeletal",
"Myelopathy", "Neurodegenerative", "Neuroinflammatory", "Peripheral nerve",
"Plexopathy", "Psychiatric", "Radiculopathy", "Spinal", "Syncope",
"Toxic and nutritional", "Tumour", "Visual system"),
count.2019 = c(5, 0, 1, 1, 2, 0, 4, 3, 0, 7, 4, 0, 24, 0, 0, 2, 22, 3, 3, 0, 3, 18, 12, 0, 0, 2, 2, 0, 1, 4, 0),
count.2020 = c(5, 1, 1, 3, 28, 9, 11, 13, 1, 13, 30, 5, 68, 1, 1, 2, 57, 14, 5, 8, 16, 37, 27, 3, 13, 17, 3, 1, 8, 13, 11))

Your Chi square test is not correct. You need to provide the counts as a table or matrix, not as two separate vectors. Because you have very small expected values for half of the cells, you need to use simulation to estimate the p-value:
results <- chisq.test(diagnosis[, 2:3], simulate.p.value=TRUE)
The overall table is barely significant at .05. The chisq.test function returns a list including the original data, the expected values, residuals, and standardized residuals. The manual page describes these (?chisq.test) and provides some citations for more details.

Related

How do i Interpret the coefficients of glm with binomial error distribution?

I would be happy if someone could help me understand glm with binominal error distribution.
Lets assume the following df:
year<-c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3,3, 3, 3, 3, 3, 3, 3, 3)
success<-c(1, 0, 3, 1, 1, 2, 6, 0, 1, 1, 12, 2, NA, 6, 12, 0, 10,
7, 4, 10, 13, 1, 2, 1, 18, 6, 3, 8, 3, 1, 9, 15, 6, 12,
6, 15, 13, 6, 8, 6, 2, 11, 6, 1, 12, 0, 4, 15, 0, 3, 18,
5, 6, 17, 5, 3, 17, 8, 0, 7, 12, 10, 26, 12, 4, 17, 1, 8,
2, 7, 14, 8)
no_success<-c(1, 9, 5, 4, 6, 1, 4, 4, 6, 10, 16, 4, NA, 3, NA, 3,
5, 5, 6, 10, 0, 5, 3, 10, 1, 7, 11, 8, 20, 4, 3, 3,
19, 1, 11, 4, 6, 4, 9, 4, 10, 4, 2, 8, 3, 1, 13, 3,
5, 7, 5, 9, 3, 6, 3, 4, 3, 13, 6, 5, 10, 3, 1, 0,
18, 6, 13, 0, 3, 2, 2, 2)
df<-data.frame(year,success,no_success)
df$success<-as.integer(df$success)
df$no_success<-as.integer(df$no_success)
If I want to know if there is a linear increase or decrease between year in regards to the success or no_success of a thought up treatment I apply a binominal glm:
m<- glm(cbind(success, no_success)~year,
data=df, family = "quasibinomial",
na.action=na.exclude)
summary(m)
I changed to "quasibinomial" here because of overdispersion.
From the summary I see that there is a significant effect: P: 0.0219 *
As the coefficients in a binomial glm represent log odds,
I get exp(estimate) = exp(0.3099) = 1.363
So, there is an increase in Odds of succes of 1.363 per year
My Questions are:
1.) When I exp(negative estimate) it gets always positive - this can not be correct. There must be a way to express negative relationships.
2.) When I want to visualize multiple linear models, I like to display the estimates.
In a "normal" lm I would display the estimate and confidence interval like this: divide the estimate by the mean of the observation and than substract and add the mean of observation/Std. Error times 1.96.
Estimate.mean<-exp(0.3099)/mean(df$or,na.rm=TRUE)
Std.Error.mean<-exp(0.1321)/mean(df$or,na.rm=TRUE)
low<-Estimate.mean-Std.Error.mean*1.96
high<-Estimate.mean+Std.Error.mean*1.96
If this confidence level is not touching the zero line it should be significant. The effect is significantly not greater than zero.
But here the low bound is -0.3901804 and the high bound is 1.608095. This does not appear to be a significant linear relationship despite the low p-value from the glm (0.0219).
What have I mixed up here?
I am happy for any suggestions
The "zero line" in this case is x=1 and not x=0.
Question 2:
the question is. Is there a effect that is different from zero?
But odds of 1 basicaly means zero.
Question 1:
When the estimate is exp the result can not be negative.But odds below 1 express a negative effect.
Here are some sources to calculate the confidence intervall for anyone stumbling over this post.
https://fromthebottomoftheheap.net/2018/12/10/confidence-intervals-for-glms/
https://stats.stackexchange.com/questions/304833/how-to-calculate-odds-ratio-and-95-confidence-interval-for-logistic-regression

How to make the expected value of the difference in the values in paired data using ggplot2

I have a pair data as below and I want to make the expected value of the difference in the value (column called value) of pairs. In all the pairs, one has disease and the other one does not have disease as you can see from the data. In other words, the expected value of the difference of the value in one sibling compare to his/her sibling.
The description of the variable in the data are:
id = individual ID
family ID = family ID showing their dependency
status = 1 means disease and status = 0 means no-disease
Any guidance is appreciated.
d <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
familyID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
status = c(0,1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1),
value = c(29,26, 39, 22.3, 24, 41, 29.7, 24, 25.9, 21, 29,24,26,29, 15.2, 11, 35, 15.4,16, 13.4)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -20L))
I'm not certain if this is what you are looking for, but I used pivot_wider from tidyr to spread the values into two columns, though with status 0 and those with status 1. Then I used mutate to take a difference between the two columns, then plotted the familyID by the newly created difference with ggplot. Note that I removed the id column for the pivot_wider to work.
d <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
familyID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
status = c(0,1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1),
value = c(29,26, 39, 22.3, 24, 41, 29.7, 24, 25.9, 21, 29,24,26,29, 15.2, 11, 35, 15.4,16, 13.4)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -20L))
library(dplyr)
library(tidyr)
library(ggplot2)
d%>%
select(-id)%>%
pivot_wider(values_from = value, names_from = status)%>%
mutate("Diff" = (`0`-`1`))%>%
ggplot()+
aes(as.character(familyID), Diff)+
geom_point()
You can group by familyID, then use summarize() from the dplyr package to find the differences.
Also note the conversion of id, familyID, and status to factors, which may make life easier so they aren't confused with being integers.
library(dplyr)
library(forcats)
library(ggplot2)
d <- structure(list(id = as.factor(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)),
familyID = as.factor(c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10)),
status = as.factor(c(0,1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1)),
value = c(29,26, 39, 22.3, 24, 41, 29.7, 24, 25.9, 21, 29,24,26,29, 15.2, 11, 35, 15.4,16, 13.4)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -20L))
diffs <- group_by(d, familyID) %>%
summarize(., diff = (value[status == 0] - value[status == 1]))
Reordering the families by difference can help get a sense of the distribution of differences
diffs$familyID <- fct_reorder(diffs$familyID, diffs$diff, .desc = TRUE)
ggplot(diffs, aes(x = familyID, y = diff)) +
geom_bar(stat="identity")
If you really have a lot of families you may want to display a summary of the differences.
One option is with a histogram (modifying binwidth can control how fine the bins are):
ggplot(diffs, aes(x = diff)) +
geom_histogram(binwidth = 3)
Similar to a histogram is a density plot:
ggplot(diffs, aes(x = diff)) +
geom_density()
Finally, a boxplot is also a familiar summary. They're mostly meant for comparing multiple groups, but it works okay with just one. I've added the individual points using the geom_jitter() function.
ggplot(diffs, aes(y = diff)) + #If using multiple groups add x=group inside the aes() function.
geom_boxplot() +
geom_jitter(aes(x = 0))

survival::tmerge: how to manage events that are actually not not be intended as multiple?

I'm struggling to set a dataframe for multistate survival analysis. Here is the reproducible example with only 3 individuals (ID). This is only a part of the multistate.
f <- structure(list(ID = c(3, 4, 5), time_to_end = c(30, 36, 36)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
f.long<–structure(list(ID = c(3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5), resp_pois = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), time = c(6, 12, 18, 24, 30, 36, 42, 48, 6, 12, 18, 24, 30, 36, 42, 48, 6,
12, 18, 24, 30, 36, 42, 48)), row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"))
f includes 3 individual taking a new drug and observed for time_to_end.
f.long includes the biochemical response at different time points.
E.g., ID 3 immediately respond to the drug up to 24 months, but at 30 months there is evidence of lack of biochemical response, while ID 4 shows only an isolated response at 24 months but never again or before.
I'm trying to manage this reversible condition between response and no-response with tmerge, as follows:
f.merge <- tmerge(f %>% select(ID), f, id=ID, tstart = 0, tstop = time_to_end)
f.merge <- tmerge(f.merge, f.long, id=ID, response=event(time, resp_pois))
survfit(Surv(tstart, tstop, response)~1, data=f.merge)
The problem is that tmerge interpretes every resp==1 as a new event, so at the end the survfit function give 5, instead of 2.
Can someone suggest any solution? Am i probably misusing tmerge?
I got this answer from Prof Terry Therneau by e-mail
Okay,
0. When debugging a purported issue with the survival package, my first step is run in an R session with the survival package loaded, and NOTHING else.
I can't read your data set: I'm getting a syntax error somewhere. But by cutting and pasting, I was able to create a "de tibbled" version of the data.
But what exactly is the data? The variable name "resp_pois" means nothing to me. You gave some information on where you want to go, but none about where you are starting. I will hazard a guess that you are dealing with panel data, i.e., subjects come in at regular intervals and you mesure their state at each visit?
Now, tmerge has no way to distinguish a data set with multiple heart attacks, and one row per attack; from a panel study data set.

Generating predictions from an aggregated binomial regression

Assessing model accuracy is reasonably easy with Bernoulli outcomes, but I am unsure how to generate meaningful predictions from an aggregated binomial regression.
Take this example. We want to model the number of drug counselling sessions (variable numCouns) a client attends over a twelve-week period based on: (1) how many years they had been using cannabis regularly prior to starting treatment (variable durationRegUse) and (2) the number of grams of cannabis they used on an average day (variable gms). The maximum number of counselling sessions each client can attend is six.
Here is the data
df <- data.frame(durationRegUse = c(19, 9, 13, 19, 10, 13, 2, 14, 11, 12, 7, 6, 3, 18, 17, 9, 9, 10, 0, 20, 4, 4, 8, 5, 4, 19, 25, 10, 27, 1, 10, 25, 8, 24, 8, 18, 15, 10, 6, 14, 16, 13, 4, 4, 5, 17, 13, 21, 8, 7, 10, 17, 13, 12, 28, 38, 23, 19, 36, 3, 14, 14, 22, 11, 26, 17, 4, 8, 25, 35, 14, 28, 32, 29, 22, 21, 2, 23, 35, 34, 31, 34, 15, 14, 26, 6, 3, 25, 24, 31, 31, 27, 30, 14.5, 12, 9, 3, 13, 5, 6, 23, 21, 27, 7, 36, 19, 22, 15, 11, 17, 11, 26, 21, 15),
gms = c(3.5, 2, 0.5, 10, 3, 3, 4, 4, 2, 2, 2, 2, 2, 2, 1, 1.75, 4, 1.75, 0.33, 5, 2.5, 1.25, 1, 0.5, 3, 2, 5, 3, 3, 0.571, 1, 0.5, 2, 4, 2.5, 1.25, 1.5, 1, 2.5, 2, 1, 2, 1.5, 2, 0.2, 1, 1, 2, 14, 2, 3.5, 3, 2, 1.75, 2, 0.55, 1, 2, 6, 0.5, 0.5, 0.5, 3, 1, 2.75, 4.5, 3, 3, 3, 2, 2, 1, 2.5, 1.75, 1, 1.5, 2, 0.7, 7, 0.5, 2, 1.2, 0.4, 3, 0.8, 1.3, 1.2, 2, 1.5, 3, 2, 2, 4, 3, 1, 6, 1, 0.5, 1.5, 2.5, 1, 2.5, 1.5, 1, 1.5, 2.5, 1.5, 2.5, 10, 1.5, 1.5, 0.5, 5, 1.5),
numCouns = c(6, 1, 2, 6, 0, 6, 0, 0, 2, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 2, 5, 6, 0, 0, 6, 0, 6, 3, 6, 0, 0, 0, 4, 5, 0, 0, 4, 0, 4, 3, 0, 1, 2, 6, 4, 2, 4, 3, 1, 0, 2, 2, 5, 2, 0, 1, 3, 0, 3, 2, 1, 6, 0, 0, 1, 0, 1, 2, 0, 0, 5, 1, 1, 1, 5, 3, 5, 6, 6, 5, 3, 6, 2, 4, 3, 4, 6, 1, 0, 6, 4, 3, 3, 1, 5, 0, 1, 1, 6, 6, 6, 3, 3, 2, 0, 0, 5, 1, 6, 3, 0, 0))
To model it as an aggregated binomial regression we need to create a coverage variable (the max number of sessions.)
df$coverage <- 6
Now we can create the aggregated binomial regression model
aggBinMod <- glm(
formula = cbind(numCouns, coverage - numCouns) ~ durationRegUse + gms,
data = df,
family = binomial(link = "logit"))
And here is the output
summary(aggBinMod)
#output
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.157570 0.183116 -6.322 2.59e-10 ***
# durationRegUse 0.035975 0.008455 4.255 2.09e-05 ***
# gms 0.075838 0.039273 1.931 0.0535 .
Now is the part I am unsure of: How to generate predictions with which to assess model accuracy. Now, as I understand it if we use the predict() function, selecting "response" as the type we get a predicted per-trial probability of drawing a 1 from a Bernoulli response scale (i.e. [0,1]).
predBin <- predict(aggBinMod, type = "response")
predBin
# (predicted bernoulli probability for first 16 participants)
# 1 2 3 4 5 6 7 8
# 0.4480346 0.3357882 0.3425441 0.5706073 0.3611657 0.3864206 0.3138308 0.4132440
# 9 10 11 12 13 14 15 16
# 0.3520203 0.3602692 0.3199350 0.3121589 0.2894678 0.4113600 0.3845787 0.3315728
So, following that logic, in order to generate predictions for the number of sessions for each client from our aggregated binomial regression model, we should be able to simply multiply this value by the number of trials we wish to predict, in our case 6. So to generate the predictions we would run
predBin6 <- predict(aggBinMod, type = "response")*6
predBin6
# predicted number of sessions, out of a possible 6), for first 18 clients
# 1 2 3 4 5 6 7 8 9
# 2.688208 2.014729 2.055265 3.423644 2.166994 2.318524 1.882985 2.479464 2.112122
# 10 11 12 13 14 15 16 17 18
# 2.161615 1.919610 1.872954 1.736807 2.468160 2.307472 1.989437 2.222478 2.037563
And from there it is straightforward to assess model accuracy via the mean squared error
error <- predBin6 - df$numCouns
mse <- mean(error^2)
mse
# output
# [1] 4.871892
So my question is is this the correct way to generate predictions from an aggregated binomial regression?
More or less, yes.
Instead of hard-coding the fact that there are 6 trials per observation (in some applications the number of trials differs from observation to observation), I would recommend
predBin6 <- predict(aggBinMod, type = "response")*weights(aggBinMod)
(which should give the same answer in your case).
I would also say that MSE is reasonable, but not necessarily the best measure of predictive accuracy for a binomial model (it doesn't take the dependence of the variance on the mean into account). (I don't have a particular alternative recommendation, but the deviance (deviance(aggBinMod)) or something similar might be appropriate.)

How can I calculate weighted standard errors and plot them in a bar plot?

I have a data frame of counts. I would like to calculate weighted proportions, plot the proportions, and also plot standard error bars for these weighted proportions.
Sample of my data frame:
head(df[1:4,])
badge year total b_1 b_2 b_3 b_4 b_5 b_6 b_7 b_8 b_9 b_10
1 15 2014 14 3 2 1 1 1 1 1 1 1 1
2 15 2015 157 13 12 11 8 6 6 6 5 5 5
3 15 2016 15 5 3 1 1 1 1 1 1 1 0
4 2581 2014 13 1 1 1 1 1 1 1 1 1 1
The data contain counts of 911 calls officers respond to in ten different police beats (b_1, b_2,...) in a given year. So officer 15 responds to 14 calls total in 2014, 3 of which were in beat 1, 2 in beat 2, and so on.
Essentially, what I want is to get the overall proportion of calls that occur within each beat. But I want these proportions to be weighted by the total number of calls.
So far, I've been able to calculate this by just adding the values within each b_ column and the total column, and calculating proportions. I have plotted these in a simple bar plot. I am haven't been able to figure out how to calculate standard errors that are weighted by total.
I have no preference for how the data are plotted. I'm mainly interested in getting the right standard errors.
Here is the code I have so far:
sums_by_beat <- apply(df[, grep('b_', colnames(df2))], 2, sum)
props_by_beat <- sums_by_beat / sum(df$total)
# Bar plot of proportions by beat
barplot(props_by_beat, main='Distribution of Calls by Beat',
xlab="Nth Most Common Division", ylim=c(0,1),
names.arg=1:length(props_by_beat), ylab="Percent of Total Calls")
And a 30-row sample of my data:
df <- structure(list(badge = c(15, 15, 15, 2581, 2581, 2745, 2745,
3162, 3162, 3162, 3396, 3650, 3650, 3688, 3688, 3688, 3698, 3698,
3698, 3717, 3717, 3717, 3740, 3740, 3740, 3813, 3873, 3907, 3930,
4007), year = c(2014, 2015, 2016, 2014, 2015, 2015, 2016, 2014,
2015, 2016, 2016, 2014, 2015, 2014, 2015, 2016, 2014, 2015, 2016,
2014, 2015, 2016, 2014, 2015, 2016, 2016, 2015, 2014, 2014, 2014
), total = c(14, 157, 15, 13, 29, 1, 1, 754, 1172, 1039, 14,
1, 2, 34, 57, 146, 3, 7, 28, 593, 1036, 1303, 461, 952, 1370,
1, 4, 41, 5, 451), b_1 = c(3, 13, 5, 1, 3, 1, 1, 33, 84, 83,
2, 1, 2, 5, 10, 14, 2, 7, 7, 39, 72, 75, 42, 69, 81, 1, 1, 7,
1, 36), b_2 = c(2, 12, 3, 1, 2, 0, 0, 33, 61, 52, 2, 0, 0, 3,
6, 8, 1, 0, 2, 37, 65, 70, 29, 65, 75, 0, 1, 5, 1, 23), b_3 = c(1,
11, 1, 1, 2, 0, 0, 32, 57, 45, 2, 0, 0, 3, 5, 8, 0, 0, 2, 34,
62, 67, 28, 50, 73, 0, 1, 3, 1, 22), b_4 = c(1, 8, 1, 1, 2, 0,
0, 31, 44, 39, 2, 0, 0, 3, 3, 7, 0, 0, 2, 34, 61, 67, 26, 42,
72, 0, 1, 3, 1, 21), b_5 = c(1, 6, 1, 1, 1, 0, 0, 30, 42, 37,
1, 0, 0, 3, 3, 7, 0, 0, 1, 33, 53, 61, 23, 42, 67, 0, 0, 2, 1,
21), b_6 = c(1, 6, 1, 1, 1, 0, 0, 30, 40, 36, 1, 0, 0, 2, 2,
6, 0, 0, 1, 32, 53, 61, 22, 41, 63, 0, 0, 2, 0, 21), b_7 = c(1,
6, 1, 1, 1, 0, 0, 26, 39, 35, 1, 0, 0, 2, 2, 6, 0, 0, 1, 30,
47, 58, 22, 39, 62, 0, 0, 2, 0, 21), b_8 = c(1, 5, 1, 1, 1, 0,
0, 26, 39, 33, 1, 0, 0, 2, 2, 6, 0, 0, 1, 30, 47, 58, 21, 38,
59, 0, 0, 2, 0, 19), b_9 = c(1, 5, 1, 1, 1, 0, 0, 24, 34, 33,
1, 0, 0, 2, 2, 5, 0, 0, 1, 30, 43, 57, 20, 37, 57, 0, 0, 2, 0,
15), b_10 = c(1, 5, 0, 1, 1, 0, 0, 23, 34, 32, 1, 0, 0, 1, 2,
5, 0, 0, 1, 27, 40, 56, 18, 36, 55, 0, 0, 2, 0, 14)), row.names = c(NA,
30L), class = "data.frame")
There isn't (as far as I know) a built-in R function to calculate the standard error of a weighted mean, but it is fairly straightforward to calculate - with some assumptions that are probably valid in the case you describe.
See, for instance:
https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Standard_error
Standard error of the weighted mean
If the elements used to calculate the weighted mean are samples from populations that all have the same variance v, then the variance of the weighted sample mean is estimated as:
var_m = v^2 * sum( wnorm^2 ) # wnorm = weights normalized to sum to 1
And the standard error of the weighted mean is equal to the square root of the variance.
sem = sqrt( var_m )
So, we need to calculate the sample variance from the weighted data.
Weighted variance
The weighted population variance (or biased sample variance) is calculated as:
pop_v = sum( w * (x-mean)^2 ) / sum( w )
However, if (as in the case you describe), we are working with samples taken from the population, rather then with the population itself, we need to make an adjustment to obtain an unbiased sample variance.
If the weights represent the frequencies of observations underlying each of the elements used to calculate the weighted mean & variance, then the adjustment is:
v = pop_v * sum( w ) / ( sum( w ) -1 )
However, this is not the case here, as the weights are the total frequenceis of 911 calls for each policeman, not the calls for each beat. So in this case the weights correspond to the reliabilities of each element, and the adjustment is:
v = pop_v * sum( w )^2 / ( sum( w )^2 - sum( w^2) )
weighted.var and weighted.sem functions
Putting all this together, we can define weighted.var and weighted.sem functions, similar to the base R weighted.mean function (note that several R packages, for instance "Hmisc", already include more-versatile functions to calculate the weighted variance):
weighted.var = function(x,w,type="reliability") {
m=weighted.mean(x,w)
if(type=="frequency"){ return( sum(w*(x-m)^2)/(sum(w)-1) ) }
else { return( sum(w*(x-m)^2)*sum(w)/(sum(w)^2-sum(w^2)) ) }
}
weighted.sem = function(x,w,...) { return( sqrt(weighted.var(x,w,...)*sum(w^2)/sum(w)^2) ) }
applied to 911 call data in the question
In the case of the question, the elements from which we want to calculate the weighted mean and weighted sem correspond to the proportions of calls in each beat, for each policeman.
So (finally...):
props = t(apply(df,1,function(row) row[-(1:3)]/row[3]))
wmean_props = apply(props,2,function(col) weighted.mean(col,w=df[,3]))
wsem_props = apply(props,2,function(col) weighted.sem(col,w=df[,3]))
Aren't your "proportions" actually the mean of the weighted (by total) observations? Then we could simply calculate the weighted colMeans accordingly.
df2 <- df[, grep('b_', colnames(df))]
means.w <- colMeans(df2 / df$total)
For the error bars we could use the quantiles of 1 - alpha/2, i.e. for alpha==.05 we use c(.025, .975). The analytical sds would yield negative values.
q.w <- t(apply(df2 / df$total, 2, quantile, c(.025, .975)))
Now, we store the x-positions that barplot returns invisible,
# Bar plot of proportions by beat
b <- barplot(means.w, main='Distribution of Calls by Beat',
xlab="Nth Most Common Division", ylim=c(0,1),
names.arg=1:length(means.w), ylab="Percent of Total Calls")
and construct the error bars with arrows.
arrows(b, q.w[,1], b, q.w[,2], length=.02, angle=90, code=3)

Resources