Calculate confidence intervals (binomial) within data frame - r

I want to get the confidence intervals for proportions within my tibble. Is there a way of doing this?
library(tidyverse)
library(Hmisc)
library(broom)
df <- tibble(id = c(1, 2, 3, 4, 5, 6),
count = c(4, 1, 22, 4545, 33, 23),
n = c(22, 65, 34, 6323, 35, 45))
Which looks like this:
# A tibble: 6 x 3
id count n
<dbl> <dbl> <dbl>
1 1 4 22
2 2 1 65
3 3 22 34
4 4 4545 6323
5 5 33 35
6 6 23 45
Using binconf from Hmisc and tidy from broom the solution could be from any package:
The intervals for the first row:
tidy(binconf(4, 22))
# A tibble: 1 x 4
.rownames PointEst Lower Upper
<chr> <dbl> <dbl> <dbl>
1 "" 0.182 0.0731 0.385
I have tried using map in purrr but get errors:
map(df, tidy(binconf(count, n)))
Error in x[i] : object of type 'closure' is not subsettable
I could just calculate them using dplyr but I get values below zero (e.g. row 2) or above one (e.g row 5), which I don't want. e.g.
df %>%
mutate(prop = count / n) %>%
mutate(se = (sqrt(prop * (1-prop)/n))) %>%
mutate(lower = prop - (se*1.96)) %>%
mutate(upper = prop + (se*1.96))
# A tibble: 6 x 7
id count n prop se lower upper
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 22 0.182 0.0822 0.0206 0.343
2 2 1 65 0.0154 0.0153 -0.0145 0.0453
3 3 22 34 0.647 0.0820 0.486 0.808
4 4 4545 6323 0.719 0.00565 0.708 0.730
5 5 33 35 0.943 0.0392 0.866 1.02
6 6 23 45 0.511 0.0745 0.365 0.657
Is there a good way of doing this? I did have a look at the confint_tidy() function, but could not get that to work. Any ideas?

It may not be tidy but
> as.tibble(cbind(df, binconf(df$count, df$n)))
# A tibble: 6 x 6
id count n PointEst Lower Upper
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 22 0.182 0.0731 0.385
2 2 1 65 0.0154 0.000789 0.0821
3 3 22 34 0.647 0.479 0.785
4 4 4545 6323 0.719 0.708 0.730
5 5 33 35 0.943 0.814 0.984
6 6 23 45 0.511 0.370 0.650
seems to work

Related

R dplyr::ntile vs ggplot2::cut_number

I need to divide a vector in quantiles, ie. bins with the same number of observations. I am currently looking at this two functions dplyr::ntile and ggplot2::cut_number. It looks like they do roughly the same thing. The only difference is that ntile gives back the quantile to which the observation belongs, ie. 1, 2, 3, ..., whereas cut_number returns the limits of the interval, ie. (0, 0.5], (0.5, 1], ... .
I did some experiments and it looks like I roughly get the same answers:
library(tidyverse)
df <- tibble(vec = runif(1000))
df %>% mutate(vec_cut = cut_number(vec, 10)) %>% count(vec_cut)
#> # A tibble: 10 x 2
#> vec_cut n
#> <fct> <int>
#> 1 [7.29e-05,0.0905] 100
#> 2 (0.0905,0.211] 100
#> 3 (0.211,0.325] 100
#> 4 (0.325,0.423] 100
#> 5 (0.423,0.5] 100
#> 6 (0.5,0.602] 100
#> 7 (0.602,0.699] 100
#> 8 (0.699,0.806] 100
#> 9 (0.806,0.91] 100
#> 10 (0.91,0.997] 100
df %>% mutate(vec_tile = ntile(vec, 10)) %>%
group_by(vec_tile) %>%
summarise(count = n(),
min = min(vec),
max = max(vec))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 10 x 4
#> vec_tile count min max
#> <int> <int> <dbl> <dbl>
#> 1 1 100 0.0000729 0.0905
#> 2 2 100 0.0905 0.210
#> 3 3 100 0.211 0.324
#> 4 4 100 0.325 0.422
#> 5 5 100 0.423 0.499
#> 6 6 100 0.501 0.602
#> 7 7 100 0.602 0.699
#> 8 8 100 0.702 0.806
#> 9 9 100 0.806 0.910
#> 10 10 100 0.911 0.997
The problem is that sometimes cut_numbers fails where ntile does not.
vec <- c(rep(0,100), seq(1:100))
table(cut_number(vec, 10))
#> Error: Insufficient data values to produce 10 bins.
#> Run `rlang::last_error()` to see where the error occurred.
table(ntile(vec,10))
#> 1 2 3 4 5 6 7 8 9 10
#> 20 20 20 20 20 20 20 20 20 20
I could use ntile, however it is nice to have the interval limits and not just the index of the quantiles. Am I doing something wrong?

Group data by intervals and condition

I have the following problem: I want to be able to add the income(random 1-100) into intervals, and group them by sex( showing how many cases in each interval for each sex ), plus I wanna know the proporcion and percentage:
ingresos <- sample (0:100, 30, replace = T)
sexo <- sample (1:2, 30, replace = T)
base<-tibble(Ingresos=ingresos<-case_when(
ingresos>=0 & ingresos<20 ~ "(0, 19]",
ingresos>=20 & ingresos<50 ~ "(20, 49]",
ingresos>=50 & ingresos<70 ~ "(50, 69]",
ingresos>=70 ~ "(70 ó +)"
) , Sexo=sexo, Proporción=ingresos/sum(ingresos), Porcentaje=Proporción*100)
I ended up with:
> show(base)
# A tibble: 30 x 4
Ingresos Sexo Proporción Porcentaje
<chr> <int> <dbl> <dbl>
1 (0, 19] 2 0.00583 0.583
2 (50, 69] 1 0.0343 3.43
3 (20, 49] 2 0.0233 2.33
4 (20, 49] 1 0.0188 1.88
5 (20, 49] 2 0.0311 3.11
6 (50, 69] 2 0.0369 3.69
7 (20, 49] 1 0.0278 2.78
8 (20, 49] 1 0.0142 1.42
9 (70 ó +) 1 0.0628 6.28
10 (20, 49] 1 0.0130 1.30
# … with 20 more rows
And I'm Looking for somwthing like:
Ingresos Sexo Cases Proporción Porcentaje
(0,19] 1 12 .xxx x.xxx
(0,19] 2 20 .xxx x.xxx
(20,49] 1 17 .xxx x.xxx
(20,49] 2 30 .xxx x.xxx
Cutting up the ingresos vector into ranges can be done with cut(). And the frequencies can be derived with dplyr::count(). Proportions and percentage can be added with dplyr::mutate(). Like this:
ingresos <- sample(0:100, 30, replace = T)
sexo <- sample(1:2, 30, replace = T)
library(dplyr)
tibble(ingresos, sexo) %>%
mutate(ingresos = cut(ingresos, c(0, 20, 50, 70, 100))) %>%
count(ingresos, sexo) %>%
mutate(Proporción=n/sum(n), Porcentaje=Proporción*100)
#> # A tibble: 8 x 5
#> ingresos sexo n Proporción Porcentaje
#> <fct> <int> <int> <dbl> <dbl>
#> 1 (0,20] 1 3 0.1 10
#> 2 (0,20] 2 4 0.133 13.3
#> 3 (20,50] 1 2 0.0667 6.67
#> 4 (20,50] 2 5 0.167 16.7
#> 5 (50,70] 1 3 0.1 10
#> 6 (50,70] 2 1 0.0333 3.33
#> 7 (70,100] 1 4 0.133 13.3
#> 8 (70,100] 2 8 0.267 26.7

How to make grouped summary statistics based off of densities in R

Goal: I would like to generate grouped percentiles for each group (hrzn)
I have the following data
# A tibble: 3,500 x 3
hrzn parameter density
<dbl> <dbl> <dbl>
1 1 0.0183 0.00914
2 1 0.0185 0.00905
3 1 0.0187 0.00897
4 1 0.0189 0.00888
5 1 0.0191 0.00880
6 1 0.0193 0.00872
7 1 0.0194 0.00864
8 1 0.0196 0.00855
9 1 0.0198 0.00847
10 1 0.0200 0.00839
The hrzn is the group, the parameter is a grid of parameter space, and the density is the density for the value in the parameter column.
I would like to generate summary the statistics percentiles 10 to 90 by 10 by hrzn. I am trying to keep this computationally efficient. I know I could sample the parameter with the density as weights, but I am curious is there is a faster way to generate the percentiles from the density without doing a sample.
The data may be obtained with the following
df <- readr::read_csv("https://raw.githubusercontent.com/alexhallam/density_data/master/data.csv")
When I load the data from your csv, each of the 5 groups have identical values for parameter and density:
df
#># A tibble: 3,500 x 3
#> hrzn parameter density
#> <int> <dbl> <dbl>
#> 1 1 0.0183 0.00914
#> 2 1 0.0185 0.00905
#> 3 1 0.0187 0.00897
#> 4 1 0.0189 0.00888
#> 5 1 0.0191 0.00880
#> 6 1 0.0193 0.00872
#> 7 1 0.0194 0.00864
#> 8 1 0.0196 0.00855
#> 9 1 0.0198 0.00847
#>10 1 0.0200 0.00839
#># ... with 3,490 more rows
sapply(1:5, function(x) all(df$parameter[df$hrzn == x] == df$parameter[df$hrzn == 1]))
# [1] TRUE TRUE TRUE TRUE TRUE
sapply(1:5, function(x) all(df$density[df$hrzn == x] == df$density[df$hrzn == 1]))
# [1] TRUE TRUE TRUE TRUE TRUE
I'm not sure if this is a mistake or not, but clearly if you're worried about computation, anything you want to do on all the groups can be done 5 times faster by only doing it on a single group.
Anyway, to get the 10th and 90th centiles for each hrzn, you just need to see which parameter is adjacent to 0.1 and 0.9 on the cumulative distribution function. Let's generalize to working it out for all the groups in case there's an issue with the data or you want to repeat it with different data:
library(dplyr)
df %>%
mutate(hrzn = factor(hrzn)) %>%
group_by(hrzn) %>%
summarise(centile_10 = parameter[which(cumsum(density) > .1)[1]],
centile_90 = parameter[which(cumsum(density) > .9)[1]] )
#># A tibble: 5 x 3
#> hrzn centile_10 centile_90
#> <fct> <dbl> <dbl>
#>1 1 0.0204 0.200
#>2 2 0.0204 0.200
#>3 3 0.0204 0.200
#>4 4 0.0204 0.200
#>5 5 0.0204 0.200
Of course, they're all the same for the reasons mentioned above.
If you're worried about computation time (even though the above only takes a few milliseconds), and you don't mind opaque code, you could take advantage of the ordering to cut the cumsum of your entire density column between 0 and 5 in steps of 0.1, to get all the 10th centiles, like this:
summary <- df[which((diff(as.numeric(cut(cumsum(df$density), seq(0,5,.1))) - 1) != 0)) + 1,]
summary <- summary[-(1:5)*10,]
summary$centile <- rep(1:9*10, 5)
summary
#> # A tibble: 45 x 4
#> hrzn parameter density centile
#> <int> <dbl> <dbl> <dbl>
#> 1 1 0.0204 0.00824 10
#> 2 1 0.0233 0.00729 20
#> 3 1 0.0271 0.00634 30
#> 4 1 0.0321 0.00542 40
#> 5 1 0.0392 0.00453 50
#> 6 1 0.0498 0.00366 60
#> 7 1 0.0679 0.00281 70
#> 8 1 0.103 0.00199 80
#> 9 1 0.200 0.00114 90
#> 10 2 0.0204 0.00824 10
#> # ... with 35 more rows
Perhaps I have misunderstood you and you are actually working in a 5-dimensional parameter space and want to know the parameter values at the 10th and 90th centiles of 5d density. In that case, you can take advantage of the fact that all groups are the same to calculate the 10th and 90th centiles for the 5-d density by simply taking the 5th root of these two centiles:
df %>%
mutate(hrzn = factor(hrzn)) %>%
group_by(hrzn) %>%
summarise(centile_10 = parameter[which(cumsum(density) > .1^.2)[1]],
centile_90 = parameter[which(cumsum(density) > .9^.2)[1]] )
#> # A tibble: 5 x 3
#> hrzn centile_10 centile_90
#> <fct> <dbl> <dbl>
#> 1 1 0.0545 0.664
#> 2 2 0.0545 0.664
#> 3 3 0.0545 0.664
#> 4 4 0.0545 0.664
#> 5 5 0.0545 0.664

Create Multiple 2-dimensional Tables from Multiple Columns in R Using dplyr

I'm looking for an efficient way to create multiple 2-dimension tables from an R dataframe of chi-square statistics. The code below builds on this answer to a previous question of mine about getting chi-square stats by groups. Now I want to create tables from the output by group. Here's what I have so far using the hsbdemo data frame from the UCLA R site:
ml <- foreign::read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
str(ml)
'data.frame': 200 obs. of 13 variables:
$ id : num 45 108 15 67 153 51 164 133 2 53 ...
$ female : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 2 1 ...
$ ses : Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...
$ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
$ prog : Factor w/ 3 levels "general","academic",..: 3 1 3 3 3 1 3 3 3 3 ...
ml %>%
dplyr::select(prog, ses, schtyp) %>%
table() %>%
apply(3, chisq.test, simulate.p.value = TRUE) %>%
lapply(`[`, c(6,7,9)) %>%
reshape2::melt() %>%
tidyr::spread(key = L2, value = value) %>%
dplyr::rename(SchoolType = L1) %>%
dplyr::arrange(SchoolType, prog) %>%
dplyr::select(-observed, -expected) %>%
reshape2::acast(., prog ~ ses ~ SchoolType ) %>%
tbl_df()
The output after the last arrange statement produces this tibble (showing only the first five rows):
prog ses SchoolType expected observed stdres
1 general low private 0.37500 2 3.0404678
2 general middle private 3.56250 3 -0.5187244
3 general high private 2.06250 1 -1.0131777
4 academic low private 1.50000 0 -2.5298221
5 academic middle private 14.25000 14 -0.2078097
It's easy to select one column, for example, stdres, and pass it to acast and tbl_df, which gets pretty much what I'm after:
# A tibble: 3 x 6
low.private middle.private high.private low.public middle.public high.public
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3.04 -0.519 -1.01 1.47 -0.236 -1.18
2 -2.53 -0.208 1.50 -0.940 -2.06 3.21
3 -0.377 1.21 -1.06 -0.331 2.50 -2.45
Now I can repeat these steps for observed and expected frequencies and bind them by rows, but that seems inefficient. The output would observed frequencies stacked on expected, stacked on the standardized residuals. Something like this:
low.private middle.private high.private low.public middle.public high.public
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 3 1 14 17 8
2 0 14 10 19 30 32
3 0 2 0 12 29 7
4 0.375 3.56 2.06 10.4 17.6 10.9
5 1.5 14.2 8.25 21.7 36.6 22.7
6 0.125 1.19 0.688 12.9 21.7 13.4
7 3.04 -0.519 -1.01 1.47 -0.236 -1.18
8 -2.53 -0.208 1.50 -0.940 -2.06 3.21
9 -0.377 1.21 -1.06 -0.331 2.50 -2.45
Seems there ought to be a way to do this without repeating the code for each column, probably by creating and processing a list. Thanks in advance.
Might this be the answer?
ml1 <- ml %>%
dplyr::select(prog, ses, schtyp) %>%
table() %>%
apply(3, chisq.test, simulate.p.value = TRUE) %>%
lapply(`[`, c(6,7,9)) %>%
reshape2::melt()
ml2 <- ml1 %>%
dplyr::mutate(type=paste(ses, L1, sep=".")) %>%
dplyr::select(-ses, -L1) %>%
tidyr::spread(type, value)
This gives you
prog L2 high.private high.public low.private low.public middle.private middle.public
1 general expected 2.062500 10.910714 0.3750000 10.4464286 3.5625000 17.6428571
2 general observed 1.000000 8.000000 2.0000000 14.0000000 3.0000000 17.0000000
3 general stdres -1.013178 -1.184936 3.0404678 1.4663681 -0.5187244 -0.2360209
4 academic expected 8.250000 22.660714 1.5000000 21.6964286 14.2500000 36.6428571
5 academic observed 10.000000 32.000000 0.0000000 19.0000000 14.0000000 30.0000000
6 academic stdres 1.504203 3.212431 -2.5298221 -0.9401386 -0.2078097 -2.0607058
7 vocation expected 0.687500 13.428571 0.1250000 12.8571429 1.1875000 21.7142857
8 vocation observed 0.000000 7.000000 0.0000000 12.0000000 2.0000000 29.0000000
9 vocation stdres -1.057100 -2.445826 -0.3771236 -0.3305575 1.2081594 2.4999085
I am not sure I understand completely what you are out after... But basically, create a new variable of SES and school type, and gather based on that. And obviously, reorder it as you wish :-)

Calculate mean euclidean distances between items within groups

I would like to calculate the mean euclidean distances between each item and all other items in a group within a data frame. I'd like to do this within the tidyverse, but can't seem to get it to work how I want.
Example data:
library(tidyverse)
DF <- data.frame(Item = letters[1:20], Grp = rep(1:4, each = 5),
x = runif(20, -0.5, 0.5),
y = runif(20, -0.5, 0.5))
I think euclidean distances are calculated using:
sqrt(((x[i] - x[i + 1]) ^ 2) + ((y[i] - y[i + 1]) ^ 2))
I've tried, without success, to do something with mutate.
DF %>%
group_by(Grp, Item) %>%
mutate(Dist = mean(sqrt(((x - lag(x, default = x[1])) ^ 2) +
(y - lag(y, default = y[1])) ^ 2)))
But, it doesn't work and only returns NA's.
# A tibble: 20 x 5
# Groups: Grp, Item [20]
Item Grp x y Dist
<fct> <int> <dbl> <dbl> <dbl>
1 a 1 -0.212 0.390 NA
2 b 1 0.288 0.193 NA
3 c 1 -0.0910 0.141 NA
4 d 1 0.383 0.494 NA
5 e 1 0.440 0.156 NA
6 f 2 -0.454 0.209 NA
7 g 2 0.0281 0.0441 NA
8 h 2 0.392 0.0941 NA
9 i 2 0.0514 -0.211 NA
10 j 2 -0.0434 -0.353 NA
11 k 3 0.457 0.463 NA
12 l 3 -0.0467 0.402 NA
13 m 3 0.178 0.191 NA
14 n 3 0.0726 0.295 NA
15 o 3 -0.397 -0.475 NA
16 p 4 0.400 -0.0222 NA
17 q 4 -0.254 0.258 NA
18 r 4 -0.458 -0.284 NA
19 s 4 -0.172 -0.182 NA
20 t 4 0.455 -0.268 NA
If I understand lag correctly it would still be sequential (if it worked), rather than computing distances between all pairs within a group.
How can I get the mean of all 4 distances for each item in a group?
Does anyone have any suggestions?
DF %>% group_by(Grp) %>%
mutate(Dist = colMeans(as.matrix(dist(cbind(x, y)))))
# # A tibble: 20 x 5
# # Groups: Grp [4]
# Item Grp x y Dist
# <fctr> <int> <dbl> <dbl> <dbl>
# 1 a 1 -0.197904299 0.363086055 0.4659160
# 2 b 1 0.090540444 -0.006314185 0.2031230
# 3 c 1 0.101018893 -0.025062949 0.2011672
# 4 d 1 0.006358616 -0.149784267 0.2323359
# 5 e 1 0.219596250 -0.341440596 0.3605274
# 6 f 2 -0.493124602 -0.002935820 0.5155365
# ...
To see how it works, start with one data subset and go piece by piece:
# run these one line at a time and have a look at ?dist
dd = DF[DF$Grp == "1", c("x", "y")]
dist(dd)
as.matrix(dist(dd))
colMeans(as.matrix(dist(dd)))

Resources