Calculate mean euclidean distances between items within groups - r

I would like to calculate the mean euclidean distances between each item and all other items in a group within a data frame. I'd like to do this within the tidyverse, but can't seem to get it to work how I want.
Example data:
library(tidyverse)
DF <- data.frame(Item = letters[1:20], Grp = rep(1:4, each = 5),
x = runif(20, -0.5, 0.5),
y = runif(20, -0.5, 0.5))
I think euclidean distances are calculated using:
sqrt(((x[i] - x[i + 1]) ^ 2) + ((y[i] - y[i + 1]) ^ 2))
I've tried, without success, to do something with mutate.
DF %>%
group_by(Grp, Item) %>%
mutate(Dist = mean(sqrt(((x - lag(x, default = x[1])) ^ 2) +
(y - lag(y, default = y[1])) ^ 2)))
But, it doesn't work and only returns NA's.
# A tibble: 20 x 5
# Groups: Grp, Item [20]
Item Grp x y Dist
<fct> <int> <dbl> <dbl> <dbl>
1 a 1 -0.212 0.390 NA
2 b 1 0.288 0.193 NA
3 c 1 -0.0910 0.141 NA
4 d 1 0.383 0.494 NA
5 e 1 0.440 0.156 NA
6 f 2 -0.454 0.209 NA
7 g 2 0.0281 0.0441 NA
8 h 2 0.392 0.0941 NA
9 i 2 0.0514 -0.211 NA
10 j 2 -0.0434 -0.353 NA
11 k 3 0.457 0.463 NA
12 l 3 -0.0467 0.402 NA
13 m 3 0.178 0.191 NA
14 n 3 0.0726 0.295 NA
15 o 3 -0.397 -0.475 NA
16 p 4 0.400 -0.0222 NA
17 q 4 -0.254 0.258 NA
18 r 4 -0.458 -0.284 NA
19 s 4 -0.172 -0.182 NA
20 t 4 0.455 -0.268 NA
If I understand lag correctly it would still be sequential (if it worked), rather than computing distances between all pairs within a group.
How can I get the mean of all 4 distances for each item in a group?
Does anyone have any suggestions?

DF %>% group_by(Grp) %>%
mutate(Dist = colMeans(as.matrix(dist(cbind(x, y)))))
# # A tibble: 20 x 5
# # Groups: Grp [4]
# Item Grp x y Dist
# <fctr> <int> <dbl> <dbl> <dbl>
# 1 a 1 -0.197904299 0.363086055 0.4659160
# 2 b 1 0.090540444 -0.006314185 0.2031230
# 3 c 1 0.101018893 -0.025062949 0.2011672
# 4 d 1 0.006358616 -0.149784267 0.2323359
# 5 e 1 0.219596250 -0.341440596 0.3605274
# 6 f 2 -0.493124602 -0.002935820 0.5155365
# ...
To see how it works, start with one data subset and go piece by piece:
# run these one line at a time and have a look at ?dist
dd = DF[DF$Grp == "1", c("x", "y")]
dist(dd)
as.matrix(dist(dd))
colMeans(as.matrix(dist(dd)))

Related

The recast fun to spread the rows is omitting key columns in R?

I am trying to spread my data such that months are the columns associated with both site and spx. I tried to use recast but I lose the informaton about species. What do I do to get the expected output (attached)?
set.seed(111)
month <- rep(c("J","F","M"), each = 6)
site <- rep(c(1,2,3,4,5,6), times = 3)
spA <- rnorm(18,0,2)
spB <- rnorm(18,0,2)
spC <- rnorm(18,0,2)
spD <- rnorm(18,0,2)
df <- data.frame(month, site, spA, spB, spC, spD)
df.test <- reshape2::recast(df, site ~ month)
Here is what I am getting.
site F J M
1 1 5 5 5
2 2 5 5 5
3 3 5 5 5
4 4 5 5 5
5 5 5 5 5
6 6 5 5 5
#Expected output (It's dummy data)
site sp J F M
1 A 5 6 7
1 B 2 3 4
..
6 D 1 2 3
If the intention is not to aggregate, but just transpose, then we can use pivot_longer to reshape to long and then reshape back to wide with pivot_wider
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('sp'), names_prefix = 'sp',
names_to = 'sp') %>%
pivot_wider(names_from = month, values_from = value)
-output
# A tibble: 24 × 5
site sp J F M
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 A 0.470 -2.99 3.69
2 1 B -2.39 0.653 -6.23
3 1 C -0.232 -2.72 4.97
4 1 D 0.350 -0.433 0.405
5 2 A -0.661 -2.02 0.788
6 2 B 0.728 1.20 -1.88
7 2 C 0.669 0.962 3.92
8 2 D -1.69 2.89 -1.61
9 3 A -0.623 -1.90 1.60
10 3 B 0.723 -3.68 2.80
# … with 14 more rows
Or using recast - specify the id.var and then include the variable also in the formula
library(reshape2)
reshape2::recast(df, site + variable ~ month, id.var = c("month", "site"))
site variable F J M
1 1 spA -2.99485331 0.4704414 3.6912725
2 1 spB 0.65309848 -2.3872179 -6.2264346
3 1 spC -2.72380897 -0.2323101 4.9713231
4 1 spD -0.43285732 0.3501913 0.4046144
5 2 spA -2.02037684 -0.6614717 0.7881082
6 2 spB 1.19650840 0.7283735 -1.8827148
7 2 spC 0.96224916 0.6685120 3.9199634
8 2 spD 2.89295633 -1.6945355 -1.6123984
9 3 spA -1.89695121 -0.6232476 1.5950570
10 3 spB -3.68306860 0.7233249 2.8005176
11 3 spC 1.48394325 -1.2417162 0.3833268
12 3 spD 0.81941960 1.9564633 0.5892684
13 4 spA -0.98792443 -4.6046913 -3.1333307
14 4 spB 5.43611120 0.6939287 -3.2409401
15 4 spC 0.05564925 -2.6196898 3.1050885
16 4 spD 1.82183314 3.6117365 2.8097662
...

Using accumulate function with second to last value as .init argument

I have recently come across an interesting question of calculating a vector values using its penultimate value as .init argument plus an additional vector's current value. Here is the sample data set:
set.seed(13)
dt <- data.frame(id = rep(letters[1:2], each = 5), time = rep(1:5, 2), ret = rnorm(10)/100)
dt$ind <- if_else(dt$time == 1, 120, if_else(dt$time == 2, 125, as.numeric(NA)))
id time ret ind
1 a 1 0.005543269 120
2 a 2 -0.002802719 125
3 a 3 0.017751634 NA
4 a 4 0.001873201 NA
5 a 5 0.011425261 NA
6 b 1 0.004155261 120
7 b 2 0.012295066 125
8 b 3 0.002366797 NA
9 b 4 -0.003653828 NA
10 b 5 0.011051443 NA
What I would like to calculate is:
ind_{t} = ind_{t-2}*(1+ret_{t})
I tried the following code. Since .init is of no use here I tried the nullify the original .init and created a virtual .init but unfortunately it won't drag the newly created values (from third row downward) into calculation:
dt %>%
group_by(id) %>%
mutate(ind = c(120, accumulate(3:n(), .init = 125,
~ .x * 1/.x * ind[.y - 2] * (1 + ret[.y]))))
# A tibble: 10 x 4
# Groups: id [2]
id time ret ind
<chr> <int> <dbl> <dbl>
1 a 1 0.00554 120
2 a 2 -0.00280 125
3 a 3 0.0178 122.
4 a 4 0.00187 125.
5 a 5 0.0114 NA
6 b 1 0.00416 120
7 b 2 0.0123 125
8 b 3 0.00237 120.
9 b 4 -0.00365 125.
10 b 5 0.0111 NA
I was wondering if there was a tweak I could make to this code and make it work completely.
I would appreciate your help greatly in advance
Use a state vector consisting of the current value of ind and the prior value of ind. That way the prior state contains the second prior value of ind. We encode that into complex values with the real part equal to ind and the imaginary part equal to the prior value of ind. At the end we take the real part.
library(dplyr)
library(purrr)
dt %>%
group_by(id) %>%
mutate(result = c(ind[1],
Re(accumulate(.x = tail(ret, -2),
.f = ~ Im(.x) * (1 + .y) + Re(.x) * 1i,
.init = ind[2] + ind[1] * 1i)))) %>%
ungroup
giving:
# A tibble: 10 x 5
id time ret ind result
<chr> <int> <dbl> <dbl> <dbl>
1 a 1 0.00554 120 120
2 a 2 -0.00280 125 125
3 a 3 0.0178 NA 122.
4 a 4 0.00187 NA 125.
5 a 5 0.0114 NA 124.
6 b 1 0.00416 120 120
7 b 2 0.0123 125 125
8 b 3 0.00237 NA 120.
9 b 4 -0.00365 NA 125.
10 b 5 0.0111 NA 122.
Variation
This variation eliminates the complex numbers and uses a vector of 2 elements in place of each complex number with the first number corresponding to the real part in the prior solution and the second number of each pair corresponding to the imaginary part. This could be extended to cases where we need more than 2 numbers per state and where the dependence involves all of the last N values but for the question here there is the downside of the extra line of code to extract the result from the list of pairs of numbers which is more involved than using Re in the prior solution.
dt %>%
group_by(id) %>%
mutate(result = c(ind[1],
accumulate(.x = tail(ret, -2),
.f = ~ c(.x[2] * (1 + .y), .x[1]),
.init = ind[2:1])),
result = map_dbl(result, first)) %>%
ungroup
Check
We check that the results above are correct. Alternately this could be used as a straight forward solution.
calc <- function(ind, ret) {
for(i in seq(3, length(ret))) ind[i] <- ind[i-2] * (1 + ret[i])
ind
}
dt %>%
group_by(id) %>%
mutate(result = calc(ind, ret)) %>%
ungroup
giving:
# A tibble: 10 x 5
id time ret ind result
<chr> <int> <dbl> <dbl> <dbl>
1 a 1 0.00554 120 120
2 a 2 -0.00280 125 125
3 a 3 0.0178 NA 122.
4 a 4 0.00187 NA 125.
5 a 5 0.0114 NA 124.
6 b 1 0.00416 120 120
7 b 2 0.0123 125 125
8 b 3 0.00237 NA 120.
9 b 4 -0.00365 NA 125.
10 b 5 0.0111 NA 122.
I would have done it by creating dummy groups for each sequence, so that it can be done for any number of 'N'. Demonstrating it on a new elaborated data
df <- data.frame(
stringsAsFactors = FALSE,
grp = c("a","a","a","a",
"a","a","a","a","a","b","b","b","b","b",
"b","b","b","b"),
rate = c(0.082322056,
0.098491104,0.07294593,0.08741672,0.030179747,
0.061389031,0.011232314,0.08553277,0.091272669,
0.031577847,0.024039791,0.091719552,0.032540636,
0.020411727,0.094521716,0.081729178,0.066429708,
0.04985793),
ind = c(11000L,12000L,
13000L,NA,NA,NA,NA,NA,NA,10000L,13000L,12000L,
NA,NA,NA,NA,NA,NA)
)
df
#> grp rate ind
#> 1 a 0.08232206 11000
#> 2 a 0.09849110 12000
#> 3 a 0.07294593 13000
#> 4 a 0.08741672 NA
#> 5 a 0.03017975 NA
#> 6 a 0.06138903 NA
#> 7 a 0.01123231 NA
#> 8 a 0.08553277 NA
#> 9 a 0.09127267 NA
#> 10 b 0.03157785 10000
#> 11 b 0.02403979 13000
#> 12 b 0.09171955 12000
#> 13 b 0.03254064 NA
#> 14 b 0.02041173 NA
#> 15 b 0.09452172 NA
#> 16 b 0.08172918 NA
#> 17 b 0.06642971 NA
#> 18 b 0.04985793 NA
library(tidyverse)
N = 3
df %>% group_by(grp) %>%
group_by(d = row_number() %% N, .add = TRUE) %>%
mutate(ind = accumulate(rate[-1] + 1, .init = ind[1], ~ .x * .y))
#> # A tibble: 18 x 4
#> # Groups: grp, d [6]
#> grp rate ind d
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 0.0823 11000 1
#> 2 a 0.0985 12000 2
#> 3 a 0.0729 13000 0
#> 4 a 0.0874 11962. 1
#> 5 a 0.0302 12362. 2
#> 6 a 0.0614 13798. 0
#> 7 a 0.0112 12096. 1
#> 8 a 0.0855 13420. 2
#> 9 a 0.0913 15057. 0
#> 10 b 0.0316 10000 1
#> 11 b 0.0240 13000 2
#> 12 b 0.0917 12000 0
#> 13 b 0.0325 10325. 1
#> 14 b 0.0204 13265. 2
#> 15 b 0.0945 13134. 0
#> 16 b 0.0817 11169. 1
#> 17 b 0.0664 14147. 2
#> 18 b 0.0499 13789. 0
Alternate answer in dplyr (using your own data modified a bit only)
set.seed(13)
dt <- data.frame(id = rep(letters[1:2], each = 5), time = rep(1:5, 2), ret = rnorm(10)/100)
dt$ind <- ifelse(dt$time == 1, 12000, ifelse(dt$time == 2, 12500, as.numeric(NA)))
library(dplyr, warn.conflicts = F)
dt %>% group_by(id) %>%
group_by(d= row_number() %% 2, .add = TRUE) %>%
mutate(ind = cumprod(1 + duplicated(id) * ret)* ind[1])
#> # A tibble: 10 x 5
#> # Groups: id, d [4]
#> id time ret ind d
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 a 1 0.00554 12000 1
#> 2 a 2 -0.00280 12500 0
#> 3 a 3 0.0178 12213. 1
#> 4 a 4 0.00187 12523. 0
#> 5 a 5 0.0114 12353. 1
#> 6 b 1 0.00416 12000 0
#> 7 b 2 0.0123 12500 1
#> 8 b 3 0.00237 12028. 0
#> 9 b 4 -0.00365 12454. 1
#> 10 b 5 0.0111 12161. 0

How to create a new variable based on one condition (is.na) at the minimum of another variable in dplyr?

I need to create a new variable (obs.new) that conserves the original value from obs except when the minimum of date is missing. In those cases, the obs.new value should be the mean.obs value. The other instances where obs is na, should remain na.
This is a reproducible example of what I did:
library(dplyr)
data.1 <-read.csv(text = "
site ,treat,date,obs,mean.obs,
1,a,33,0.585581765,0.4,
1,a,34,0.871886986,0.4,
1,a,35,,0.4,
1,a,36,,0.4,
1,a,37,,0.4,
1,a,38,,0.4,
1,a,39,0.628236902,0.4,
1,a,40,0.041956742,0.4,
1,b,36,,0.52,
1,b,37,0.327067686,0.52,
1,b,38,,0.52,
1,b,39,,0.52,
1,b,40,,0.52,
1,b,41,0.982637394,0.52,
1,b,42,0.80141212,0.52,
1,b,43,0.739522519,0.52,
2,a,56,,0.48,
2,a,57,0.724849037,0.48,
2,a,58,0.050617254,0.48,
2,a,59,,0.48,
2,a,60,,0.48,
2,a,61,,0.48,
2,a,62,,0.48,
2,a,63,0.269993451,0.48,
2,b,23,0.216291392,0.49,
2,b,24,,0.49,
2,b,25,,0.49,
2,b,26,,0.49,
2,b,27,,0.49,
2,b,28,,0.49,
2,b,29,0.951644067,0.49,
2,b,30,0.745131113,0.49")
data.1.1 <- data.1 %>%
group(site, treat) %>%
mutate(obs.new = if_else(is.na(slice(which.min(date))),
mean.obs, obs))
This is the error I got:
Error: Problem with `mutate()` input `obs.new`.
x no applicable method for 'slice' applied to an object of class "c('integer', 'numeric')"
i Input `obs.new` is `if_else(is.na(slice(which.min(date))), mean.obs, obs)`.
i The error occurred in group 1: site = 1, treat = "a".
Run `rlang::last_error()` to see where the error occurred.
The expected result is this:
Thanks for any hint.
You could replace obs value if obs is NA and date is minimum date in the group.
library(dplyr)
data.1 %>%
group_by(site, treat) %>%
mutate(mean.obs.new = ifelse(is.na(obs) & date == min(date), mean.obs, obs))
# site treat date obs mean.obs mean.obs.new
# <int> <chr> <int> <dbl> <dbl> <dbl>
# 1 1 a 33 0.586 0.4 0.586
# 2 1 a 34 0.872 0.4 0.872
# 3 1 a 35 NA 0.4 NA
# 4 1 a 36 NA 0.4 NA
# 5 1 a 37 NA 0.4 NA
# 6 1 a 38 NA 0.4 NA
# 7 1 a 39 0.628 0.4 0.628
# 8 1 a 40 0.0420 0.4 0.0420
# 9 1 b 36 NA 0.52 0.52
#10 1 b 37 0.327 0.52 0.327
# … with 22 more rows
data.1 %>%
group_by(site, treat) %>%
mutate(obs.new = if_else(!is.na(obs),
obs,
if_else(date == min(date),
mean.obs,
0)
)
)
data.1 %>%
group_by(site, treat) %>%
mutate(obs.new = coalesce(obs,
if_else(row_number() == 1, mean.obs, NA_real_))) %>%
ungroup()
# A tibble: 32 x 7
site treat date obs mean.obs X obs.new
<int> <chr> <int> <dbl> <dbl> <lgl> <dbl>
1 1 a 33 0.586 0.4 NA 0.586
2 1 a 34 0.872 0.4 NA 0.872
3 1 a 35 NA 0.4 NA NA
4 1 a 36 NA 0.4 NA NA
5 1 a 37 NA 0.4 NA NA
6 1 a 38 NA 0.4 NA NA
7 1 a 39 0.628 0.4 NA 0.628
8 1 a 40 0.0420 0.4 NA 0.0420
9 1 b 36 NA 0.52 NA 0.52
10 1 b 37 0.327 0.52 NA 0.327
# … with 22 more rows
change group to group_by
then you can use case_when()
library(dplyr)
data.1 %>%
group_by(site, treat) %>%
mutate(obs.new = case_when(!is.na(obs) ~ obs,
date==min(date) ~ mean.obs,
TRUE ~ 0))
Using data.table
library(data.table)
setDT(data.1)[, mean.obs.new := fifelse(is.na(obs) & date == min(date), mean.obs, obs), .(site, treat)]

How to make grouped summary statistics based off of densities in R

Goal: I would like to generate grouped percentiles for each group (hrzn)
I have the following data
# A tibble: 3,500 x 3
hrzn parameter density
<dbl> <dbl> <dbl>
1 1 0.0183 0.00914
2 1 0.0185 0.00905
3 1 0.0187 0.00897
4 1 0.0189 0.00888
5 1 0.0191 0.00880
6 1 0.0193 0.00872
7 1 0.0194 0.00864
8 1 0.0196 0.00855
9 1 0.0198 0.00847
10 1 0.0200 0.00839
The hrzn is the group, the parameter is a grid of parameter space, and the density is the density for the value in the parameter column.
I would like to generate summary the statistics percentiles 10 to 90 by 10 by hrzn. I am trying to keep this computationally efficient. I know I could sample the parameter with the density as weights, but I am curious is there is a faster way to generate the percentiles from the density without doing a sample.
The data may be obtained with the following
df <- readr::read_csv("https://raw.githubusercontent.com/alexhallam/density_data/master/data.csv")
When I load the data from your csv, each of the 5 groups have identical values for parameter and density:
df
#># A tibble: 3,500 x 3
#> hrzn parameter density
#> <int> <dbl> <dbl>
#> 1 1 0.0183 0.00914
#> 2 1 0.0185 0.00905
#> 3 1 0.0187 0.00897
#> 4 1 0.0189 0.00888
#> 5 1 0.0191 0.00880
#> 6 1 0.0193 0.00872
#> 7 1 0.0194 0.00864
#> 8 1 0.0196 0.00855
#> 9 1 0.0198 0.00847
#>10 1 0.0200 0.00839
#># ... with 3,490 more rows
sapply(1:5, function(x) all(df$parameter[df$hrzn == x] == df$parameter[df$hrzn == 1]))
# [1] TRUE TRUE TRUE TRUE TRUE
sapply(1:5, function(x) all(df$density[df$hrzn == x] == df$density[df$hrzn == 1]))
# [1] TRUE TRUE TRUE TRUE TRUE
I'm not sure if this is a mistake or not, but clearly if you're worried about computation, anything you want to do on all the groups can be done 5 times faster by only doing it on a single group.
Anyway, to get the 10th and 90th centiles for each hrzn, you just need to see which parameter is adjacent to 0.1 and 0.9 on the cumulative distribution function. Let's generalize to working it out for all the groups in case there's an issue with the data or you want to repeat it with different data:
library(dplyr)
df %>%
mutate(hrzn = factor(hrzn)) %>%
group_by(hrzn) %>%
summarise(centile_10 = parameter[which(cumsum(density) > .1)[1]],
centile_90 = parameter[which(cumsum(density) > .9)[1]] )
#># A tibble: 5 x 3
#> hrzn centile_10 centile_90
#> <fct> <dbl> <dbl>
#>1 1 0.0204 0.200
#>2 2 0.0204 0.200
#>3 3 0.0204 0.200
#>4 4 0.0204 0.200
#>5 5 0.0204 0.200
Of course, they're all the same for the reasons mentioned above.
If you're worried about computation time (even though the above only takes a few milliseconds), and you don't mind opaque code, you could take advantage of the ordering to cut the cumsum of your entire density column between 0 and 5 in steps of 0.1, to get all the 10th centiles, like this:
summary <- df[which((diff(as.numeric(cut(cumsum(df$density), seq(0,5,.1))) - 1) != 0)) + 1,]
summary <- summary[-(1:5)*10,]
summary$centile <- rep(1:9*10, 5)
summary
#> # A tibble: 45 x 4
#> hrzn parameter density centile
#> <int> <dbl> <dbl> <dbl>
#> 1 1 0.0204 0.00824 10
#> 2 1 0.0233 0.00729 20
#> 3 1 0.0271 0.00634 30
#> 4 1 0.0321 0.00542 40
#> 5 1 0.0392 0.00453 50
#> 6 1 0.0498 0.00366 60
#> 7 1 0.0679 0.00281 70
#> 8 1 0.103 0.00199 80
#> 9 1 0.200 0.00114 90
#> 10 2 0.0204 0.00824 10
#> # ... with 35 more rows
Perhaps I have misunderstood you and you are actually working in a 5-dimensional parameter space and want to know the parameter values at the 10th and 90th centiles of 5d density. In that case, you can take advantage of the fact that all groups are the same to calculate the 10th and 90th centiles for the 5-d density by simply taking the 5th root of these two centiles:
df %>%
mutate(hrzn = factor(hrzn)) %>%
group_by(hrzn) %>%
summarise(centile_10 = parameter[which(cumsum(density) > .1^.2)[1]],
centile_90 = parameter[which(cumsum(density) > .9^.2)[1]] )
#> # A tibble: 5 x 3
#> hrzn centile_10 centile_90
#> <fct> <dbl> <dbl>
#> 1 1 0.0545 0.664
#> 2 2 0.0545 0.664
#> 3 3 0.0545 0.664
#> 4 4 0.0545 0.664
#> 5 5 0.0545 0.664

Calculate confidence intervals (binomial) within data frame

I want to get the confidence intervals for proportions within my tibble. Is there a way of doing this?
library(tidyverse)
library(Hmisc)
library(broom)
df <- tibble(id = c(1, 2, 3, 4, 5, 6),
count = c(4, 1, 22, 4545, 33, 23),
n = c(22, 65, 34, 6323, 35, 45))
Which looks like this:
# A tibble: 6 x 3
id count n
<dbl> <dbl> <dbl>
1 1 4 22
2 2 1 65
3 3 22 34
4 4 4545 6323
5 5 33 35
6 6 23 45
Using binconf from Hmisc and tidy from broom the solution could be from any package:
The intervals for the first row:
tidy(binconf(4, 22))
# A tibble: 1 x 4
.rownames PointEst Lower Upper
<chr> <dbl> <dbl> <dbl>
1 "" 0.182 0.0731 0.385
I have tried using map in purrr but get errors:
map(df, tidy(binconf(count, n)))
Error in x[i] : object of type 'closure' is not subsettable
I could just calculate them using dplyr but I get values below zero (e.g. row 2) or above one (e.g row 5), which I don't want. e.g.
df %>%
mutate(prop = count / n) %>%
mutate(se = (sqrt(prop * (1-prop)/n))) %>%
mutate(lower = prop - (se*1.96)) %>%
mutate(upper = prop + (se*1.96))
# A tibble: 6 x 7
id count n prop se lower upper
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 22 0.182 0.0822 0.0206 0.343
2 2 1 65 0.0154 0.0153 -0.0145 0.0453
3 3 22 34 0.647 0.0820 0.486 0.808
4 4 4545 6323 0.719 0.00565 0.708 0.730
5 5 33 35 0.943 0.0392 0.866 1.02
6 6 23 45 0.511 0.0745 0.365 0.657
Is there a good way of doing this? I did have a look at the confint_tidy() function, but could not get that to work. Any ideas?
It may not be tidy but
> as.tibble(cbind(df, binconf(df$count, df$n)))
# A tibble: 6 x 6
id count n PointEst Lower Upper
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 22 0.182 0.0731 0.385
2 2 1 65 0.0154 0.000789 0.0821
3 3 22 34 0.647 0.479 0.785
4 4 4545 6323 0.719 0.708 0.730
5 5 33 35 0.943 0.814 0.984
6 6 23 45 0.511 0.370 0.650
seems to work

Resources