fit a normal distribution to grouped data, giving expected frequencies - r

I have a frequency distribution of observations, grouped into counts within class intervals.
I want to fit a normal (or other continuous) distribution, and find the expected frequencies in each interval according to that distribution.
For example, suppose the following, where I want to calculate another column, expected giving the
expected number of soldiers with chest circumferences in the interval given by chest, where these
are assumed to be centered on the nominal value. E.g., 35 = 34.5 <= y < 35.5. One analysis I've seen gives the expected frequency in this cell as 72.5 vs. the observed 81.
> data(ChestSizes, package="HistData")
>
> ChestSizes
chest count
1 33 3
2 34 18
3 35 81
4 36 185
5 37 420
6 38 749
7 39 1073
8 40 1079
9 41 934
10 42 658
11 43 370
12 44 92
13 45 50
14 46 21
15 47 4
16 48 1
>
> # ungroup to a vector of values
> chests <- vcdExtra::expand.dft(ChestSizes, freq="count")
There are quite a number of variations of this question, most of which relate to plotting the normal density on top of a histogram, scaled to represent counts not density. But none explicitly show the calculation of the expected frequencies. One close question is R: add normal fits to grouped histograms in ggplot2
I can perfectly well do the standard plot (below), but for other things, like a Chi-square test or a vcd::rootogram plot, I need the expected frequencies in the same class intervals.
> bw <- 1
n_obs <- nrow(chests)
xbar <- mean(chests$chest)
std <- sd(chests$chest)
plt <-
ggplot(chests, aes(chest)) +
geom_histogram(color="black", fill="lightblue", binwidth = bw) +
stat_function(fun = function(x)
dnorm(x, mean = xbar, sd = std) * bw * n_obs,
color = "darkred", size = 1)
plt

here is how you could calculate the expected frequencies for each group assuming Normality.
xbar <- with(ChestSizes, weighted.mean(chest, count))
sdx <- with(ChestSizes, sd(rep(chest, count)))
transform(ChestSizes, Expected = diff(pnorm(c(32, chest) + .5, xbar, sdx)) * sum(count))
chest count Expected
1 33 3 4.7600583
2 34 18 20.8822328
3 35 81 72.5129162
4 36 185 199.3338028
5 37 420 433.8292832
6 38 749 747.5926687
7 39 1073 1020.1058521
8 40 1079 1102.2356155
9 41 934 943.0970605
10 42 658 638.9745241
11 43 370 342.7971793
12 44 92 145.6089948
13 45 50 48.9662992
14 46 21 13.0351612
15 47 4 2.7465640
16 48 1 0.4579888

Related

Add category mean value to faceted scatter plots in ggplot

I am using facet wrap to plot Weight Gain versus Caloric Intake for four different diets. Diet is a four-level factor, Weight Gain and Caloric Intake are numeric. I am adding a regression line to each plot facet. What I want to do is add a horizontal line for the group mean weight gain for each diet in the plot (4 different mean values). The problem is when I use the geom_hline function it puts the global mean on all of the plots, which is not what I want.
I tried using stat_summary(fun.y=mean,geom="line"), but it gives me line segments joining each of the points in every plot.
Below is the code I am using that is giving me the single global mean on all plots. Also the data set I am using. I've included the labeller code for completeness but I really just need help with drawing the group mean lines.
Thanks in advance for any help.
# Calculate slopes and means to use for facet labels
#
wgSlope<-rep(NA,nlevels(vitaminData$Diet))
dietMeans<-rep(NA,nlevels(vitaminData$Diet))
for (i in 1:nlevels(vitaminData$Diet)){
dietMeans[i]<-mean(filter(vitaminData,Diet==i)$WeightGain)
#
# Get regression lines and coefficients for each facet
#
lm<-lm(WeightGain~CaloricIntake,data=filter(vitaminData,Diet==i))
wgSlope[i]<-lm$coefficients[2]
}
#
# Build facet labels
#
dietLabel<-c(`1`=
paste("Diet 1, Slope=",round(wgSlope[1],2),", Mean=",round(dietMeans[1],1)),
`2`=paste("Diet 2, Slope=",round(wgSlope[2],2),", Mean=",round(dietMeans[2],1)),
`3`=paste("Diet 3, Slope =",round(wgSlope[3],2),", Mean=",round(dietMeans[3],1)),
`4`=paste("Diet 4, Slope =",round(wgSlope[4],2),", Mean=",round(dietMeans[4],1)))
#
# Draw the plots
#
ggplot(data=vitaminData,
aes(y=WeightGain,x=CaloricIntake,color=Diet))+
theme_bw()+
geom_point(aes(color=Diet,fill=Diet,shape=Diet))+
geom_smooth(method="lm",se=FALSE,linetype=2,alpha=0.5)+
labs(x="Caloric Intake",y="Weight Gain")+
scale_color_manual(values=c("red","blue","orange","darkgreen"))+
geom_hline(yintercept=mean(vitaminData$WeightGain))+
facet_wrap(~Diet,labeller=labeller(Diet=dietLabel))+
theme(legend.position="none")
Diet WeightGain CaloricIntake
<fct> <dbl> <dbl>
1 1 48 35
2 1 67 44
3 1 78 44
4 1 69 51
5 1 53 47
6 2 65 40
7 2 49 45
8 2 37 37
9 2 73 53
10 2 63 42
11 3 79 51
12 3 52 41
13 3 63 47
14 3 65 47
15 3 67 48
16 4 59 53
17 4 50 52
18 4 59 52
19 4 42 45
20 4 34 38
Here's an approach using dplyr. (Add library(dplyr) or library(tidyverse) if not already loaded.)
geom_hline(data = vitaminData %>%
group_by(Diet) %>%
summarize(mean = mean(WeightGain)),
aes(yintercept = mean)) +

how to add regression lines for each factor on a plot

I've created a model and I'm trying to add curves that fit the two parts of the data, insulation and no insulation. I was thinking about using the insulation coefficient as a true/false term, but I'm not sure how to translate that into code. Entries 1:56 are "w/o" and 57:101 are "w/". I'm not sure how to include the data I'm using but here's the head and tail:
month year kwh days est cost avgT dT.yr kWhd.1 id insulation
1 8 2003 476 21 a 33.32 69 -8 22.66667 1 w/o
2 9 2003 1052 30 e 112.33 73 -1 35.05172 2 w/o
3 10 2003 981 28 a 24.98 60 -6 35.05172 3 w/o
4 11 2003 1094 32 a 73.51 53 2 34.18750 4 w/o
5 12 2003 1409 32 a 93.23 44 6 44.03125 5 w/o
6 1 2004 1083 32 a 72.84 34 3 33.84375 6 w/o
month year kwh days est cost avgT dT.yr kWhd.1 id insulation
96 7 2011 551 29 e 55.56 72 0 19.00000 96 w/
97 8 2011 552 27 a 61.17 78 1 20.44444 97 w/
98 9 2011 666 34 e 73.87 71 -2 19.58824 98 w/
99 10 2011 416 27 a 48.03 64 0 15.40741 99 w/
100 11 2011 653 31 e 72.80 53 1 21.06452 100 w/
101 12 2011 751 33 a 83.94 45 2 22.75758 101 w/
bill$id <- seq(1:101)
bill$insulation <- as.factor(ifelse(bill$id > 56, c("w/"), c("w/o")))
m1 <- lm(kWhd.1 ~ avgT + insulation + I(avgT^2), data=bill)
with(bill, plot(kWhd.1 ~ avgT, xlab="Average Temperature (F)",
ylab="Daily Energy Use (kWh/d)", col=insulation))
no_ins <- data.frame(bill$avgT[1:56], bill$insulation[1:56])
curve(predict(m1, no_ins=x), add=TRUE, col="red")
ins <- data.frame(bill$avgT[57:101], bill$insulation[57:101])
curve(predict(m1, ins=x), add=TRUE, lty=2)
legend("topright", inset=0.01, pch=21, col=c("red", "black"),
legend=c("No Insulation", "Insulation"))
ggplot2 makes this a lot easier than base plotting. Something like this should work:
ggplot(bill, aes(x = avgT, y = kWhd.1, color = insulation)) +
geom_smooth(method = "lm", formula = y ~ x + I(x^2), se = FALSE) +
geom_point()
In base, I'd create a data frame with point you want to predict on, something like
pred_data = expand.grid(
kWhd.1 = seq(min(bill$kWhd.1), max(bill$kWhd.1), length.out = 100),
insulation = c("w/", "w/o")
)
pred_data$prediction = predict(m1, newdata = pred_data)
And then use lines to add the predictions to your plot. My base graphics is pretty rusty, so I'll leave that to you (or another answerer) if you want it.
In base R it's important to order the x-values. Since this is to be done on multiple factors, we can do this with by, resulting in a list L.
Since your example data is not complete, here's an example with iris where we consider Species as the "factor".
L <- by(iris, iris$Species, function(x) x[order(x$Petal.Length), ])
Now we can do the plot and add loess predictions as lines with a sapply.
with(iris, plot(Sepal.Width ~ Petal.Length, col=Species))
sapply(seq(L), function(x)
lines(L[[x]]$Petal.Length,
predict(loess(Sepal.Width ~ Petal.Length, L[[x]], span=1.1)), # span=1.1 for smoothing
col=x))
Yields

cluster analysis with weight

I have a data frame 'heat' demonstrating people's performance across time.
'Var1' represents the code of persons.
'Var2' represents a time line (measured by number of days from the starting point).
'Variable' is the score they get at a given time point.
Var1 Var2 value
1 1 36 -0.6941826
2 2 36 -0.5585414
3 3 36 0.8032384
4 4 36 0.7973031
5 5 36 0.7536959
6 6 36 -0.5942059
....
54 10 73 0.7063218
55 11 73 -0.6949616
56 12 73 -0.6641516
57 13 73 0.6890433
58 14 73 0.6310124
59 15 73 -0.6305091
60 16 73 0.6809655
61 17 73 0.8957870
....
101 13 110 0.6495796
102 14 110 0.5990869
103 15 110 -0.6210600
104 16 110 0.6441960
105 17 110 0.7838654
....
Now I want to cluster their performance and reflect it on a heatmap. So I used the function dist() and hclust() to clustered the data frame and plotted it with ggplot2:
ggplot(data = heat) + geom_tile(aes(x = Var2, y = Var1 %>% as.character(),
fill = value)) +
scale_fill_gradient(low = "yellow",high = "red") +
geom_vline(xintercept = c(746, 2142, 2917))
It looks like this:
However, I am more interested in what happened around day 746, day 2142 and day 2917 (the black lines). I would like the scores around these days bearing more weight in the clustering. I want people demonstrating similar performance around these days to have more priority to be clustered together. Is there a way of doing this?
As long as your weights are integer, you supposedly can just replicate those days artificially.
If you want more control, just compute the distance matrix yourself, with whatever weighted distance you want to use.

Use vector to make probability table

In the form of a probability table, I'd like to illustrate a vector of quantiles divisible by 7 and 5, for marginal probability distributions, and 5 given 7, for conditional probability.
Let's assume this is my data:
>prob.table(table(x)) # discrete number and its probability
20 22 23 24 25 26 27 28 29 30 31
0.000152 0.000625 0.000796 0.001224 0.003138 0.003043 0.004549 0.006444 0.005938 0.009301 0.009456
32 33 34 35 36 37 38 39 40 41 42
0.013448 0.019839 0.018596 0.026613 0.028902 0.027377 0.035156 0.041379 0.041092 0.047733 0.055827
43 44 45 46 47 48 49 50 51 52 53
0.046099 0.051624 0.055131 0.049779 0.056992 0.049801 0.052912 0.031924 0.049114 0.022880 0.042279
54 55 56 57 58 59 61 63 65
0.013946 0.032340 0.003466 0.021240 0.001227 0.011734 0.005115 0.001491 0.000278
How can I turn this into a two-way probability table that shows which numbers are divisible by 7 and/or 5 for marginal and conditional probability?
This is what I'd hope the table to look like
Yes NO # Probability of numbers divisible by 7
Yes 0.02754 0.02886
No 0.02656 0.02831
# Probability of numbers divisible by 5
x <- sample(1:100, 100, replace = TRUE)
# %% is the mod operator, which gives the remainder after the division of the left-hand side by the right-hand side. x %% y == 0 therefore returns TRUE if x is divisible by y
db5 <- x %% 5 == 0
db7 <- x %% 7 == 0
table(db5, db7) / length(x)
# db7
# db5 FALSE TRUE
# FALSE 0.62 0.13
# TRUE 0.24 0.01

R error type "Subscript out of bounds"

I am simulating a correlation matrix, where the 60 variables correlate in the following way:
more highly (0.6) for every two variables (1-2, 3-4... 59-60)
moderate (0.3) for every group of 12 variables (1-12,13-24...)
mc <- matrix(0,60,60)
diag(mc) <- 1
for (c in seq(1,59,2)){ # every pair of variables in order are given 0.6 correlation
mc[c,c+1] <- 0.6
mc[c+1,c] <- 0.6
}
for (n in seq(1,51,10)){ # every group of 12 are given correlation of 0.3
for (w in seq(12,60,12)){ # these are variables 11-12, 21-22 and such.
mc[n:n+1,c(n+2,w)] <- 0.2
mc[c(n+2,w),n:n+1] <- 0.2
}
}
for (m in seq(3,9,2)){ # every group of 12 are given correlation of 0.3
for (w in seq(12,60,12)){ # these variables are the rest.
mc[m:m+1,c(1:m-1,m+2:w)] <- 0.2
mc[c(1:m-1,m+2:w),m:m+1] <- 0.2
}
}
The first loop works well, but not the second and third ones. I get this error message:
Error in `[<-`(`*tmp*`, m:m + 1, c(1:m - 1, m + 2:w), value = 0.2) :
subscript out of bounds
Error in `[<-`(`*tmp*`, m:m + 1, c(1:m - 1, m + 2:w), value = 0.2) :
subscript out of bounds
I would really appreciate any hints, since I don't see the loop commands get to exceed the matrix dimensions. Thanks a lot in advance!
Note that : takes precedence over +. E.g., n:n+1 is the same as n+1. I guess you want n:(n+1).
The maximal value of w is 60:
w <- 60
m <- 1
m+2:w
#[1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#[49] 51 52 53 54 55 56 57 58 59 60 61
And 61 is out of bounds. You need to add a lot of parentheses.

Resources