Calculate ratio for subsets within subsets using dplyr - r

I have a set of data for many authors (AU), spanning multiple years (Year) and multiple topics (Topic). For each AU, Year, and Topic combination I want to calculate a ratio of the total FL by Topic / total FL for the year.
The data will look like this:
Data <- data.frame("AU" = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
"Year" = c(2010,2010,2010,2010,2010,2010,2011,2011,2011,2011,2010,2010,2010,2011,2011,2011,2011,2010,2011,2011),
"Topic" = c(1,1,1,2,2,2,1,1,2,2,2,2,2,1,1,1,1,1,1,1),
"FL" = c(1,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0,1,1,1))
I've been playing around with dplyr trying to figure out how to do this. I can group_by easy enough but I'm not sure how to go about calculating the ratio using a "group" for numerator and a total across all groups for the denominator
Results <- Data %>%
group_by(Year, AU) %>%
summarise(ratio = ???) # Should be (Sum(FL) by Topic) / (Sum(FL) across all Topics)

If I understand correctly your desired output, you can calculate the total by Topic, Year, AU and total by Year, AU separately and join them together using left_join.
left_join(
Data %>%
group_by(AU, Year, Topic) %>%
summarise(FL_topic = sum(FL)) %>%
ungroup(),
Data %>%
group_by(AU, Year) %>%
summarise(FL_total = sum(FL)) %>%
ungroup(),
by = c("AU", "Year")
) %>%
mutate(ratio = FL_topic/FL_total)
# A tibble: 7 x 6
# AU Year Topic FL_topic FL_total ratio
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2010 1 2 4 0.5
# 2 1 2010 2 2 4 0.5
# 3 1 2011 1 0 2 0
# 4 1 2011 2 2 2 1
# 5 2 2010 1 1 4 0.25
# 6 2 2010 2 3 4 0.75
# 7 2 2011 1 4 4 1

Related

How to drop NA's out of the summarise(count = n()) function in R?

I have a dataset containing 4 organisation units (org_unit) with different number of participants and 2 Questions (Q1,Q2) on a 2-degree scale (1:2). I want to know how many people per unit answered the respective question with [1] and divide them by the total number of participants / unit.
Org_unit <- c(1,1,1,1,2,2,2,3,3,4)
Q1 <- c(1,2,1,2,1,2,1,2,1,2)
Q2 <- c(-9,-9,-9,-9,-9,-9,-9,-9,-9,-9)
The problem is, my Q2 only consists of [-9] which stands for non-response. I therefore assigned NA to [-9].
DF <- data.frame(Org_unit, Q1, Q2)
DF[DF == -9] <- NA
DF
Org_unit Q1 Q2
1 1 1 NA
2 1 2 NA
3 1 1 NA
4 1 2 NA
5 2 1 NA
6 2 2 NA
7 2 1 NA
8 3 2 NA
9 3 1 NA
10 4 2 NA
Next I calculated the proportion of people who answered Q1 with [1], which works fine.
prop_q1 <- DF %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q1 == 1))
prop_q1
# A tibble: 4 x 3
Org_unit count prop
<dbl> <int> <dbl>
1 1 4 0.5
2 2 3 0.667
3 3 2 0.5
4 4 1 0
when i run the same code for Q2 however, I get the same amount of members per unit (count = c(1,2,3,4), although nobody answered the question and I don't want them to be registered as participants, since they technically didn't participate in the study.
prop_q2 <- DF %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q2 == 1))
prop_q2
# A tibble: 4 x 3
Org_unit count prop
<dbl> <int> <dbl>
1 1 4 NA
2 2 3 NA
3 3 2 NA
4 4 1 NA
Is there a way to calculate the right amount of members per unit when facing NA's? [-9]
Thanks!
Would
prop_q2 <- DF %>%
filter(!is.na(Q2)) %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q2 == 1))
do the job?
Given that you want to do this across multiple columns, I think that using across() within the dplyr verbs will be better for you. I explain the solution below.
Org_unit <- c(1,1,1,1,2,2,2,3,3,4)
Q1 <- c(1,2,1,2,1,2,1,2,1,2)
Q2 <- c(1,-9,-9,-9,-9,-9,-9,-9,-9,-9) #Note one response
df <- tibble(Org_unit, Q1, Q2)
df %>%
mutate(across(starts_with("Q"), ~na_if(., -9))) %>%
group_by(Org_unit) %>%
summarize(across(starts_with("Q"),
list(
N = ~sum(!is.na(.)),
prop = ~sum(. == 1, na.rm = TRUE)/sum(!is.na(.)))
))
# A tibble: 4 x 5
Org_unit Q1_N Q1_prop Q2_N Q2_prop
* <dbl> <int> <dbl> <int> <dbl>
1 1 4 0.5 1 1
2 2 3 0.667 0 NaN
3 3 2 0.5 0 NaN
4 4 1 0 0 NaN
First, we take the data frame (which I created as a tibble) and substitute NA for all values that equal -9 for all columns that start with a capital "Q". This converts all question columns to have NAs in place of -9s.
Second, we group by the organizational unit and then summarize using two functions. The first sums all values where the response to the question is not NA. The string _N will be appended to columns with these values. The second calculates the proportion and will have _prop appended to the values.

Using dplyr and group_by to calculate number of repetition for a value

I have a dataset which includes seller_ID, product_ID and year the product was sold, and I am trying to find the year that one seller had maximum sold product and the specific number of sold in that year for each individual seller. Here is an example of data
seller_ID <- c(1,1,1,2,2,3,4,4,4,4,4)
Product_ID <- c(1000,1000,1005,1004,1005,1003,1010,
1000,1001,1019,1017)
year <- c(2015,2016,2015,2020,2020,2000,2000,2001,2001,2001,2005)
data<- data.frame(seller_ID,Product_ID,year)
seller_ID Product_ID year
1 1 1000 2015
2 1 1000 2016
3 1 1005 2015
4 2 1004 2020
5 2 1005 2020
6 3 1003 2000
7 4 1010 2000
8 4 1000 2001
9 4 1001 2001
10 4 1019 2001
11 4 1017 2005
so the ideal result would be:
seller_ID Max_sold_num_year Max_year
1 1 2 2015
2 2 2 2020
3 3 1 2000
4 4 3 2001
I tried the approach I explained below and it worked ...
df_temp<- data %>%
group_by(seller_ID, year) %>%
summarize(Sold_in_Year=length(Product_ID))
unique_seller=unique(data$seller_ID)
ID_list=c()
Max_list=c()
Max_Sold_Year=c()
j=1
for (ID in unique_seller) {
df_temp_2<- subset(df_temp, df_temp$seller_ID==ID)
Max_year<- subset(df_temp_2,df_temp_2$Sold_in_Year==max(df_temp_2$Sold_in_Year))
if (nrow(Max_year)>1){
ID_list[j]<-Max_year[1,1]
Max_Sold_Year[j]<-Max_year[1,2]
Max_list[j]<-Max_year[1,3]
j<-j+1
}
else {
ID_list[j]<-Max_year[1,1]
Max_Sold_Year[j]<-Max_year[1,2]
Max_list[j]<-Max_year[1,3]
j<-j+1
}
}
#changing above list to dataframe
mm=length(ID_list)
df_test_list<- data.frame(seller_ID=numeric(mm), Max_sold_num_year=numeric(mm),Max_year=numeric(mm))
for (i in 1:mm){
df_test_list$seller_ID[i] <- ID_list[[i]]
df_test_list$Max_sold_num_year[i] <- Max_list[[i]]
df_test_list$Max_year[i] <- Max_Sold_Year[[i]]
}
however, due to subsetting each time and using for loop this approach is kind of slow for a large dataset. Do you have any suggestions on how I can improve my code? is there any other way that I can calculate the desired result without using for loop?
Thanks
Try this
library(dplyr)
seller_ID <- c(1,1,1,2,2,3,4,4,4,4,4)
Product_ID <- c(1000,1000,1005,1004,1005,1003,1010,
1000,1001,1019,1017)
year <- c(2015,2016,2015,2020,2020,2000,2000,2001,2001,2001,2005)
data<- data.frame(seller_ID,Product_ID,year)
data %>%
dplyr::count(seller_ID, year) %>%
dplyr::group_by(seller_ID) %>%
dplyr::filter(n == max(n)) %>%
dplyr::rename(Max_sold_num_year = n, Max_year = year)
#> # A tibble: 4 x 3
#> # Groups: seller_ID [4]
#> seller_ID Max_year Max_sold_num_year
#> <dbl> <dbl> <int>
#> 1 1 2015 2
#> 2 2 2020 2
#> 3 3 2000 1
#> 4 4 2001 3
And thanks to the comment by #yung_febreze this could be achieved even shorter with
data %>%
dplyr::count(seller_ID, year) %>%
dplyr::group_by(seller_ID) %>%
dplyr::top_n(1)
EDIT In case of duplicated maximum values one can add dplyr::top_n(1, wt = year) which filters for the latest (or maximum) year:
data %>%
dplyr::count(seller_ID, year) %>%
dplyr::group_by(seller_ID) %>%
dplyr::top_n(1, wt = n) %>%
dplyr::top_n(1, wt = year) %>%
dplyr::rename(Max_sold_num_year = n, Max_year = year)

Counting how often x occures per y and Visualize in R

I would like to count certain things in my dataset. I have panel data and ideally would like to count the number of activities per person.
people <- c(1,1,1,2,2,3,3,4,4,5,5)
activity <- c(1,1,1,2,2,3,4,5,5,6,6)
completion <- c(0,0,1,0,1,1,1,0,0,0,1)
So my output would tell me that person 4 has 2 tasks.
people 1
frequency activity 2
Would i need to group something? Ideally i would like to also visualize this as a histogram.
I have tried this:
> ##activity per person cllw %>%
> ## Group observations by people group_by(id_user) %>%
> ## count activities per person and i am not sure how to create frequencies at all
Like this?
library(dplyr)
df %>%
group_by(people) %>%
summarise("frequency activity" = n())
# A tibble: 5 x 2
people `frequency activity`
<dbl> <int>
1 1 3
2 2 2
3 3 2
4 4 2
5 5 2
Or like this if you only want "active" tasks:
df %>%
filter(completion != 1) %>%
group_by(people) %>%
summarise("frequency activity" = n())
# A tibble: 4 x 2
people `frequency activity`
<dbl> <int>
1 1 2
2 2 1
3 4 2
4 5 1
Edit for unique tasks per person:
df %>%
filter(completion != 1) %>%
distinct(people, activity) %>%
group_by(people) %>%
summarise("frequency activity" = n())
# A tibble: 4 x 2
people `frequency activity`
<dbl> <int>
1 1 1
2 2 1
3 4 1
4 5 1

R - how to sum each columns from df

I have this df
df <- read.table(text="
id month gas tickets
1 1 13 14
2 1 12 1
1 2 4 5
3 1 5 7
1 3 0 9
", header=TRUE)
What I like to do is calculate sum of gas, tickets (and another 50+ rows in my real df) for each month. Usually I would do something like
result <-
df %>%
group_by(month) %>%
summarise(
gas = sum(gas),
tickets = sum(tickets)
) %>%
ungroup()
But since I have really lot of columns in my dataframe, I don´t want to repeat myself with creating sum function for each column. I´m wondering if is possible to create some more elegant - function or something that will create sum of each column except id and month with grouped month column.
You can use summarise_at() to ignore id and sum the rest:
df %>%
group_by(month) %>%
summarise_at(vars(-id), list(sum = ~sum))
# A tibble: 3 x 3
month gas_sum tickets_sum
<int> <int> <int>
1 1 30 22
2 2 4 5
3 3 0 9
You can use aggregate as markus recommends in the comments. If you want to stick to the tidyverse you could try something like this:
df %>%
select(-id) %>%
group_by(month) %>%
summarise_if(is.numeric, sum)
#### OUTPUT ####
# A tibble: 3 x 3
month gas tickets
<fct> <int> <int>
1 1 30 22
2 2 4 5
3 3 0 9

Count number of observations without N/A per year in R

I have a dataset and I want to summarize the number of observations without the missing values (denoted by NA).
My data is similar as the following:
data <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="CompanyNumber ResponseVariable Year ExplanatoryVariable1 ExplanatoryVariable2
1 2.5 2000 1 2
1 4 2001 3 1
1 3 2002 NA 7
2 1 2000 3 NA
2 2.4 2001 0 4
2 6 2002 2 9
3 10 2000 NA 3")
I was planning to use the package dplyr, but that does only take the years into account and not the different variables:
library(dplyr)
data %>%
group_by(Year) %>%
summarise(number = n())
How can I obtain the following outcome?
2000 2001 2002
ExplanatoryVariable1 2 2 1
ExplanatoryVariable2 2 2 2
To get the counts, you can start by using:
library(dplyr)
data %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.)))
## A tibble: 3 x 3
# Year ExplanatoryVariable1 ExplanatoryVariable2
# <int> <int> <int>
#1 2000 2 2
#2 2001 2 2
#3 2002 1 2
If you want to reshape it as shown in your question, you can extend the pipe using tidyr functions:
library(tidyr)
data %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.))) %>%
gather(var, count, -Year) %>%
spread(Year, count)
## A tibble: 2 x 4
# var `2000` `2001` `2002`
#* <chr> <int> <int> <int>
#1 ExplanatoryVariable1 2 2 1
#2 ExplanatoryVariable2 2 2 2
Just to let OP know, since they have ~200 explanatory variables to select. You can use another option of summarise_at to select the variables. You can simply name the first:last variable, if they are ordered correctly in the data, for example:
data %>%
group_by(Year) %>%
summarise_at(vars(ExplanatoryVariable1:ExplanatoryVariable2), ~sum(!is.na(.)))
Or:
data %>%
group_by(Year) %>%
summarise_at(3:4, ~sum(!is.na(.)))
Or store the variable names in a vector and use that:
vars <- names(data)[4:5]
data %>%
group_by(Year) %>%
summarise_at(vars, ~sum(!is.na(.)))
data %>%
gather(cat, val, -(1:3)) %>%
filter(complete.cases(.)) %>%
group_by(Year, cat) %>%
summarize(n = n()) %>%
spread(Year, n)
# # A tibble: 2 x 4
# cat `2000` `2001` `2002`
# * <chr> <int> <int> <int>
# 1 ExplanatoryVariable1 2 2 1
# 2 ExplanatoryVariable2 2 2 2
Should do it. You start by making the data stacked, and the simply calculating the n for both year and each explanatory variable. If you want the data back to wide format, then use spread, but either way without spread, you get the counts for both variables.
Using base R:
do.call(cbind,by(data[3:5], data$Year,function(x) colSums(!is.na(x[-1]))))
2000 2001 2002
ExplanatoryVariable1 2 2 1
ExplanatoryVariable2 2 2 2
For aggregate:
aggregate(.~Year,data[3:5],function(x) sum(!is.na(x)),na.action = function(x)x)
You could do it with aggregate in base R.
aggregate(list(ExplanatoryVariable1 = data$ExplanatoryVariable1,
ExplanatoryVariable2 = data$ExplanatoryVariable2),
list(Year = data$Year),
function(x) length(x[!is.na(x)]))
# Year ExplanatoryVariable1 ExplanatoryVariable2
#1 2000 2 2
#2 2001 2 2
#3 2002 1 2

Resources