Weighting of an already calculated average - r

I have this kind of data:
category <- c("A", "B", "C", "B", "B", "B", "C")
mean <- c(5,4,5,4,3,1,5)
counts <- c(5, 200, 300, 150, 400, 200,250)
df <- data.frame(category, mean, counts)
category is some kind of factor, mean is an already calculated average. mean was calculated through different shapes of a scale (1-5) and the number of counts. I just have the calculated means and not the single values it was calculated from.
The goal is to aggregate the different means over the different categorys. Like this:
library(dplyr)
df %>% group_by(category) %>%
summarise(weighted.mean(mean, counts))
A 5.000000
B 2.947368
C 5.000000
The problem is that it is more valuable (in my case) to get an average from higher counts like C (550) then from lower counts like A (5). Any Idea how to take this into account?
My solution would be this one. But I don´t know if it´s valid:
df %>% mutate(y = mean * counts) %>%
mutate(category = as.factor(category)) %>%
group_by(category) %>%
summarise(X = sum(y)) %>%
arrange(desc(X))
B 2800
C 2750
A 25

Related

R: Average rows together within a group to match a second dataframe

This question is an extension of one of my previous questions. I have two long-form dataframes, small and big, that have the same groups (id) and a combination of numeric and character variables. In big, the number of rows per group is greater compared to the number of rows per group in small. My goal is to average rows together in big so that the number of rows per group matches the number of rows per group in small as closely as possible.
I have created a reprex below, which gets close but not as close as I think is possible. I believe the issue is that in big, each group may need its own sum_ref value (which refers to how many n rows should be averaged together), but I am unsure of how to implement that. Any advice is appreciated.
set.seed(123)
library(tidyverse)
id <- c(rep("101", 10), rep("102", 21), rep("103", 15))
color <- c(rep ("red", 10), rep("blue", 21), rep("green", 15))
time <- c(1:10, 1:21, 1:15)
V1 <- sample(1:3, 10+21+15, replace=TRUE)
V2 <- sample(1:3, 10+21+15, replace=TRUE)
V3 <- sample(1:3, 10+21+15, replace=TRUE)
small <- data.frame(id,color,time,V1,V2,V3) %>%
mutate(time = 1:length(V1)) %>%
select(id, time, everything())
id <- c(rep("101", 32), rep("102", 45), rep("103", 27))
color <- c(rep ("red", 32), rep("blue", 45), rep("green", 27))
time <- c(1:32, 1:45, 1:27)
V1 <- sample(1:3, 32+45+27, replace=TRUE)
V2 <- sample(1:3, 32+45+27, replace=TRUE)
V3 <- sample(1:3, 32+45+27, replace=TRUE)
big <- data.frame(id,color,time,V1,V2,V3) %>%
mutate(time = 1:length(V1)) %>%
select(id, time, everything())
rm(V1,V2,V3,color,id,time)
small_size <- nrow(small)
big_size <- nrow(big)
sum_ref <- big_size/small_size
# `new` should have the same number of rows as `small`
# also for each ID, the number of rows in `small` should equal the number of rows in `new`
new <- big %>%
group_by(id, color, new_time = as.integer(gl(n(), sum_ref, n()))) %>%
summarise(across(starts_with('V'), mean), .groups = 'drop')
print(nrow(small))
#> [1] 46
print(nrow(new))
#> [1] 53
# for id 101
small %>% filter(id == "101") %>% nrow()
#> [1] 10
new %>% filter(id == "101") %>% nrow()
#> [1] 16
You are correct: "each group may need its own sum_ref value". My solution to that is creating a variable to storage the sizes of each group in small:
small_size <- small %>% group_by(id, color) %>% summarise(size = n())
Then, for each group in big, we create the column of which values should be averaged together. In your code, you did that using as.integer(gl(n(), sum_ref, n())), but as sum_ref is a decimal number, that doesn't assures that this column will go from 1 to the size of the corresponding small group size, so I made a new version:
seq(1, small_size$size[cur_group_id()], length = n()) %>% trunc()
This makes a sequence that goes from 1, to the small group size stored in small_size, using the cur_group_id() to acess the correct entry of the array. This sequence will have size n() (the big group size) and will only have intergers because of %>% trunc() (which does the same as your as.interger()). There might be a better way to do this, as with my method the last value only apears once. But regardles of the way you choose to make this vector transformation, the essense of the answer is how to make a diferent transformation for each group with small_size$size[cur_group_id()].
new <- big %>%
group_by(id, color) %>%
mutate(new_time = seq(1, small_size$size[cur_group_id()], length = n()) %>% trunc()) %>%
group_by(new_time, .add = TRUE) %>%
summarise(across(starts_with('V'), mean), .groups = 'drop')
print(nrow(small))
#> [1] 46
print(nrow(new))
#> [1] 46
# for id 101
small %>% filter(id == "101") %>% nrow()
#> [1] 16
new %>% filter(id == "101") %>% nrow()
#> [1] 16

How can I calculate the sum for specific cells?

I want to sum up the Population and householders for NH_AmIn, NH_PI, NH_Other, NH_More as a new row for each county. How can I do that?
A dplyr approach using dummy data, you would have to expand on this. Its filtering to the focal races, grouping by county, getting the sum of population for the filtered rows and by groups, and appending it to the initial data.
library(dplyr)
set.seed(1)
# demo data
df <- data.frame(county=rep(c("A","B"), each=4), race=c("a", "b", "c", "d"), population=sample(2000:15000, size=8))
# sum by state for subset
df %>%
filter(race %in% c("c", "d")) %>%
group_by(cou ty) %>%
summarise("race"="total", "population"=sum(population)) %>%
rbind(df)
The solution for yours, if df is the name of your data.frame, is
df %>%
filter(Race %in% c("NH_AmIn", "NH_PI", "NH_Other", "NH_More")) %>%
group_by(County) %>%
summarise("Race"="total", "Population"=sum(Population), "Householder"=sum(Householder)) %>%
rbind(df)

dplyr: how to ignore NA in grouping variable

Using dplyr, I'm trying to group by two variables. Now, if there is a NA in one variable but the other variable match, I'd still like to see those rows grouped, with the NA taking on the value of the non-NA value. So if I have a data frame like this:
variable_A <- c("a", "a", "b", NA, "f")
variable_B <- c("c", "d", "e", "c", "c")
variable_C <- c(10, 20, 30, 40, 50)
df <- data.frame(variable_A, variable_B, variable_C)
And if I wanted to group by variable_A and variable_B, row 1 and 4 normally wouldn't group but I'd like them to , while the NA gets overridden to "a." How can I achieve this? The below doesn't do the job.
df2 <- df %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))
You can group by B first, and then fill in the missing A values. Then proceed with what you wanted to do:
df_filled = df %>%
group_by(variable_B) %>%
mutate(variable_A = first(na.omit(variable_A)))
df_filled %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))
You could do the missing value imputation using base R as follows:
ii <- which(is.na(df$variable_A))
jj <- which(df$variable_B == df$variable_B[ii])
df_filled <- df
df_filled$variable_A[jj] = df$variable_A[jj][!is.na(df$variable_A[jj])]
Then group and summarize as planned with dplyr
df_filled %>%
group_by(variable_A, variable_B) %>%
dplyr::summarise(total=sum(variable_C))

Using dplyr collapse rows taking condition from another numeric column

An example df:
experiment = c("A", "A", "A", "A", "A", "B", "B", "B")
count = c(1,2,3,4,5,1,2,1)
df = cbind.data.frame(experiment, count)
Desired output:
experiment_1 = c("A", "A", "A", "B", "B")
freq = c(1,1,3,2,1) # frequency
freq_per = c(20,20,60,66.6,33.3) # frequency percent
df_1 = cbind.data.frame(experiment_1, freq, freq_per)
I want to do the following:
Group df using experiment
Calculate freq using the count column
Calculate freq_per
Calculate sum of freq_per for all observations with count >= 3
I have the following code. How do I do the step 4?
freq_count = df %>% dplyr::group_by(experiment, count) %>% summarize(freq=n()) %>% na.omit() %>% mutate(freq_per=freq/sum(freq)*100)
Thank you very much.
There may be a more concise approach but I would suggest collapsing your count in a new column using mutate() and ifelse() and then summarising:
freq_count %>%
mutate(collapsed_count = ifelse(count >= 3, 3, count)) %>%
group_by(collapsed_count, add = TRUE) %>% # adds a 2nd grouping var
summarise(freq = sum(freq), freq_per = (sum(freq_per))) %>%
select(-collapsed_count) # dropped to match your df_1.
Also, just fyi, for step 2 you might consider the count() function if you're keen to save some keystrokes. Also tibble() or data.frame() are likely better options than calling the dataframe method of cbind explicitly to create a data frame.

overlapping groups in dplyr

I'm trying to calculate "rolling" summary statistics based on a grouping factor. Is there a nice way to process by (overlapping) groups, based on (say) an ordered factor?
As an example, say I want to calculate the sum of val by groups
df <- data.frame(grp = c("a", "a", "b", "b", "c", "c", "c"),
val = rnorm(7))
For groups based on grp, it's easy:
df %>% group_by(grp) %>% summarise(total = sum(val))
# result:
grp total
1 a 1.6388
2 b 0.7421
3 c 1.1707
However, what I want to do is calculate "rolling" sums for successive groups ("a" & "b", then "b" & "c", etc.). The desired output would be something like this:
grp1 grp2 total
1 a b 1.6388
2 b c 0.7421
I'm having trouble doing this in dplyr. In particular, I can't seem to figure out how to get "overlapping" groups - the "b" rows in the above example should end up in two output groups.
Try lag:
df %>%
group_by(grp) %>%
arrange(grp) %>%
summarise(total = sum(val)) %>%
mutate(grp1 = lag(grp), grp2 = grp, total = total + lag(total)) %>%
select(grp1, grp2, total) %>%
na.omit

Resources