dplyr: User defined Function in summarise() involving two input vectors - r

I have a data frame of say 20 columns. Column 1 is group, column 2 is weights (not normalized to 1 or 100) and columns 3 to 20 contain data to be aggregated. There are some 250 rows but just 15 groups. So on an average for each group there are around 16-17 rows for each group.
For each of columns 3 to 20, I need to get group-wise weighted mean, weights being column 2.
As such this is easy by multiplying all columns by column 2 and then running
group_by(df, column1)%>%
summarise_all(sum_na)
Here sum_na is the usual function sum with na.rm=T
And then dividing column 3 to 20 by column 2.
The problem is that there are NAs scattered in the data frame. Say for example, 150th row (belonging to group 5, say) in column 12 has NA. While calculating weighted mean for Group 5 and column 12, the denominator should exclude the weight in row 150 of column 2.
How to do this? Sorry for the long post. Unable to provide sample data as unfortunately stack overflow is inaccessible in office (posting from mobile).

Would something like this work ?
library(dplyr)
df %>%
group_by(group) %>%
summarise_at(vars(col1:col18), ~weighted.mean(., wt, na.rm = TRUE))
You can select range of columns in vars. This removes NA values from the columns col1 to col18 with weight column as wt.
Tried this on this example :
df <- data.frame(group = rep(1:3, each = 3), wt = 1:9,
col1 = c(2:5, NA, 6:9), col2 = c(NA, 3:6, NA, 2:4))
df %>%
group_by(group) %>%
summarise_at(vars(col1:col2), ~weighted.mean(., wt, na.rm = TRUE))
# group col1 col2
# <int> <dbl> <dbl>
#1 1 3.33 3.6
#2 2 5.6 5.56
#3 3 8.08 3.08

We can use data.table methods
library(data.table)
setDT(df)[, lapply(.SD, function(x) weighted.mean(x, wt, na.rm = TRUE)),
by = group, .SDcols = col1:col18]

Related

How to partition into equal sum subsets in R?

I have a dataset with a column, X1, of various values. I would like to order this dataset by the value of X1, and then partition into K number of equal sum subsets. How can this be accomplished in R? I am able to find quartiles for X1 and append the quartile groupings as a new column to the dataset, however, quartile is not quite what I'm looking for. Thank you in advance!
df <- data.frame(replicate(10,sample(0:1000,1000,rep=TRUE)))
df <- within(df, quartile <- as.integer(cut(X1, quantile(X1, probs=0:4/4), include.lowest=TRUE)))
Here's a rough solution (using set.seed(47) if you want to reproduce exactly). I calculate the proportion of the sum for each row, and do the cumsum of that proportion, and then cut that into the desired number of buckets.
library(dplyr)
n_groups = 10
df %>% arrange(X1) %>%
mutate(
prop = X1 / sum(X1),
cprop = cumsum(prop),
bins = cut(cprop, breaks = n_groups - 1)
) %>%
group_by(bins) %>%
summarize(
group_n = n(),
group_sum = sum(X1)
)
# # A tibble: 9 × 3
# bins group_n group_sum
# <fct> <int> <int>
# 1 (-0.001,0.111] 322 54959
# 2 (0.111,0.222] 141 54867
# 3 (0.222,0.333] 111 55186
# 4 (0.333,0.444] 92 55074
# 5 (0.444,0.556] 80 54976
# 6 (0.556,0.667] 71 54574
# 7 (0.667,0.778] 66 55531
# 8 (0.778,0.889] 60 54731
# 9 (0.889,1] 57 55397
This could of course be simplified--you don't need to keep around the extra columns, just mutate(bins = cut(cumsum(X1 / sum(X1)), breaks = n_groups - 1)) will add the bins column to the original data (and no other columns), and the group_by() %>% summarize() is just to diagnose the result.

Compute accordance of column values grouped by another column [duplicate]

This question already has an answer here:
Find out what values occur the most in my collection and its proportion
(1 answer)
Closed 1 year ago.
I have a data frame with a column of IDs spanning multiple rows (col_id) and another column of assessments for this row (col_assessment), like so:
df <- data.frame(col_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
col_assessment = c("Pos", "Pos", "Neu", "Neu", "Neg", "Neu", "Pos", "Neu", "Neg"))
I now want to calculate how much the assessment is in accordance for each row. (I.e. how many of the assessments are the same per ID. For this, I have the following function. (I do not have to use this function and am also open to other solutions.)
compute_ICR <- function(coding_values){
### takes in list of coding values and returns number of the share of agreement (up to 1 if all are in agreement)
most_common_value <- coding_values %>% table() %>% sort(decreasing = TRUE) %>% magrittr::extract(1) %>% names()
share_accordance <- length(which(coding_values == most_common_value)) / coding_values %>% nrow()
# number of matching, most common values divided by number of total values
return(share_accordance)
}
I would now like to apply this to df by group of col_id, like so (not working pseudo-code!)
df %>% group_by(col_id) %>% summarize(share_accordance = compute_ICR(df$col_assessment))
This should give me the following data frame for the above example:
data.frame(col_id = c(1,2,3), share_accordance = c(.6667, 1, .333))
Can someone point out how to achieve this result? Thanks in advance.
I would change the function to -
compute_ICR <- function(x){
sort(table(x), decreasing = TRUE)[1]/length(x)
}
and apply it for each ID .
library(dplyr)
df %>%
group_by(col_id) %>%
summarize(share_accordance = compute_ICR(col_assessment))
# col_id share_accordance
# <dbl> <dbl>
#1 1 0.667
#2 2 0.667
#3 3 0.333
Or in base R -
aggregate(col_assessment~col_id, df, compute_ICR)
As I understand your question you want the largest proportion of answers per ID? The code below will give this answer independent of the number of possible values for col_assessment
library(dplyr)
df1 %>%
group_by(col_id) %>%
summarise(prop = max(prop.table(table(col_assessment))))
Returns:
col_id prop
<dbl> <dbl>
1 1 0.667
2 2 0.667
3 3 0.333

Create a loop for calculating values from a dataframe in R?

Let's say I make a dummy dataframe with 6 columns with 10 observations:
X <- data.frame(a=1:10, b=11:20, c=21:30, d=31:40, e=41:50, f=51:60)
I need to create a loop that evaluates 3 columns at a time, adding the summed second and third columns and dividing this by the sum of the first column:
(sum(b)+sum(c))/sum(a) ... (sum(e)+sum(f))/sum(d) ...
I then need to construct a final dataframe from these values. For example using the dummy dataframe above, it would look like:
value
1. 7.454545
2. 2.84507
I imagine I need to use the next function to iterate within the loop, but I'm fairly lost! Thank you for any help.
You can split your data frame into groups of 3 by creating a vector with rep where each element repeats 3 times. Then with this list of sub data frames, (s)apply the function of summing the second and third columns, adding them, and dividing by the sum of the first column.
out_vec <-
sapply(
split.default(X, rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (sum(x[2]) + sum(x[3]))/sum(x[1]))
data.frame(value = out_vec)
# value
# 1 7.454545
# 2 2.845070
You could also sum all the columns up front before the sapply with colSums, which will be more efficient.
out_vec <-
sapply(
split(colSums(X), rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (x[2] + x[3])/x[1])
data.frame(value = out_vec, row.names = NULL)
# value
# 1 7.454545
# 2 2.845070
You could use tapply:
tapply(colSums(X), gl(ncol(X)/3, 3), function(x)sum(x[-1])/x[1])
1 2
7.454545 2.845070
Here is an option with tidyverse
library(dplyr) # 1.0.0
library(tidyr)
X %>%
summarise(across(.fn = sum)) %>%
pivot_longer(everything()) %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
summarise(value = sum(lead(value)/first(value), na.rm = TRUE)) %>%
select(value)
# A tibble: 2 x 1
# value
# <dbl>
#1 7.45
#2 2.85

adding values using rowSums and tidyverse

I am having some issues trying to sum a bunch of columns in R. I am analyzing a huge dataset so I am reproducing a sample. of fake data.
Here's how the data looks like (I have 800 columns).
library(data.table)
dataset <- data.table(name = c("A", "B", "C", "D"), a1 = 1:4, a2 = c(1,2,NaN,5), a3 = 1:4, a4 = 1:4, a5 = c(1,2,NA,5), a6 = 1:4, a8 = 1:4)
dataset
What I want to do is sum the columns in buckets of 100 columns so, for example, all the values in the first row between the first column and the column 100, all the values in the first row between the column 1 and the column 200, all the values in the second row between the first column and the column 100, etc.
Using the sample data I've come with this solution using rowSums.
dataset %>%
mutate_if(~!is.numeric(.x), as.numeric) %>%
mutate_all(funs(replace_na(., 0))) %>%
mutate(sum = rowSums(.[,paste("a", 1:3, sep="")])) %>%
mutate(sum1 = rowSums(.[,paste("a", 4:5, sep="")])) %>%
mutate(sum2 = rowSums(.[,paste("a", 6:8, sep="")]))
but I am getting the following error:
Error in `[.data.frame`(., , paste("a", 6:8, sep = "")) : undefined columns selected
as the data does not include column a7.
The original data is missing a bunch of columns between a1 and a800 so solving this would be key to make it work.
What would it be the best way to approach and solve this error?
Also, I have a few more questions regarding the code I've written:
Is there a smarter way to select the column a1 and a100 instead of using this approach .[,paste("a", 1:3, sep="")]? I am interested in selected the column by name. I do not want to select it by the position of the column because sometimes a100 does not mean that is the column 100.
Also, I am converting the NAs and the NaNs to 0 in order to be able to sum the rows. I am doing it this way mutate_all(funs(replace_na(., 0))), losing my first row than contains the names of the values. What would it be the best way to replace NA and NaN without mutating the string values of the first row to 0?
The type of the columns I am adding is integer as I converted them beforehand mutate_if(~!is.numeric(.x), as.numeric) . Should I follow the same approach in case I have dbl?
Thank you!
Here is one way to do this after transforming data to longer format, for each name, we create a group of n rows and take the sum.
library(dplyr)
library(tidyr)
n <- 2 #No of columns to bucket. Change this to 100 for your case.
dataset %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name) %>%
group_by(grp = rep(seq_len(n()), each = n, length.out = n()), add = TRUE) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
#If needed in wider format again
pivot_wider(names_from = grp, values_from = value, names_prefix = 'col')
# name col1 col2 col3 col4
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 2 2 2 1
#2 B 4 4 4 2
#3 C 3 6 3 3
#4 D 9 8 9 4

How to group multiple rows based on some criteria and sum values in R?

Hi All,
Example :- The above is the data I have. I want to group age 1-2 and count the values. In this data value is 4 for age group 1-2. Similarly I want to group age 3-4 and count the values. Here the value for age group 3-4 is 6.
How can I group age and aggregate the values correspond to it?
I know this way: code-
data.frame(df %>% group_by(df$Age) %>% tally())
But the values are aggregating on individual Age.
I want the values aggregating on multiple age to be a group as mentioned above example.
Any help on this will be greatly helpful.
Thanks a lot to All.
Here are two solutions, with base R and with package dplyr.
I will use the data posted by Shree.
First, base R.
I create a grouping variable grp and then aggregate on it.
grp <- with(df, c((age %in% 1:2) + 2*(age %in% 3:4)))
aggregate(age ~ grp, df, length)
# grp age
#1 1 4
#2 2 6
Second a dplyr way.
Function case_when is used to create a grouping variable. This allows for meaningful names to be given to the groups in an easy way.
library(dplyr)
df %>%
mutate(grp = case_when(
age %in% 1:2 ~ "2:3",
age %in% 3:4 ~ "3:4",
TRUE ~ NA_character_
)) %>%
group_by(grp) %>%
tally()
## A tibble: 2 x 2
# grp n
# <chr> <int>
#1 1:2 4
#2 3:4 6
Here's one way using dplyr and ?cut from base R -
df <- data.frame(age = c(1,1,2,2,3,3,3,4,4,4),
Name = letters[1:10],
stringsAsFactors = F)
df %>%
count(grp = cut(age, breaks = c(0,2,4)))
# A tibble: 2 x 2
grp n
<fct> <int>
1 (0,2] 4
2 (2,4] 6

Resources