group_by and summarize() multiple things in R using dplyr/tidyverse - r

I am trying to find the country with the highest average age but I also need to filter out countries with less than 5 entries in the data frame. I tried the following but it does not work:
bil %>%
group_by(citizenship,age) %>%
mutate(n=count(citizenship), theMean=mean(age,na.rm=T)) %>%
filter(n>=5) %>%
arrange(desc(theMean))
bil is the dataset and I am trying to count how many entries I have for each country, filter out countries with less than 5 entries, find the average age for each country and then find the country with the highest average. I am confused on how to do both things at the same time. If I do one summarize at a time I lose the rest of my data.

Perhaps, this could help. Note that the parameter 'x' in count is a tbl/data.frame. So, instead of count, we group by 'citizenship' and get the frequency of values with n(), get the mean of 'age' (not sure about the 'age' as grouping variable) and do the filter
bil %>%
group_by(citizenship) %>%
mutate(n = n()) %>%
mutate(theMean = mean(age, na.rm=TRUE)) %>%
filter(n>=5) %>%
arrange(desc(theMean))

Related

Count occurences of elements and sum up its value in R

I have a data frame with 2 columns (Company and Amount). There are multiple companies (string) and a value in another column as numeric.
I want to count all the companies sum the amount for each company.
So in the end I want to have a new data frame with 3 columns (Company, Occurrences, Total Amount)
I could count all the occurrences and saved it in a new data frame but have no idea how to get the amount.
library(dplyr)
df_companies %>% count(df_companies$Geldgeber) %>% top_n(3, n) %>% arrange(desc(n))
You can use n() to get the count
library(dplyr)
df_companies %>%
group_by(Geldgeber) %>%
summarize(ct = n(), total = sum(Betrag))
You will want to consider how to account for missing values in Betrag, if any. For example, you could initially filter(!is.na(Betrag)), or you can use na.rm=T in the call to sum()

grouping_by_condition_and_mutate in R

I need your help:
The general idea is to create 1 new column,
Group the data by a column (DEP) then count the number of Total lines per group using the column (id).
Then filter the data with another column (dely): (only dely>=60) and count the id
Then calculate the share using the number of rows of
(the filtered columns)/ (total number calculated at the beginning).
total= count(id by group)
share = (dely>=60)/total
I was able to do it in 3 steps but I wanted to know if possible to do it in a faster way?
#group the data by ( DEP)
Total_group<-df %>%
group_by(DEP) %>%
summarise(n = n())
filter the data T_depart>60
Filter_60<- df %>% filter(df$T_depart>=60)
#then gorup the filtred data by DEP as I did for the Total
Filter_60_group<-Filter_60 %>%
group_by(DEP) %>%
summarise(n = n())
then calculte the share( share_dep)
share_data<-left_join (Total_group, Filter_60_group, by="DEP") %>% mutate(share_dep=n.x/n.y)
Any idea how to put all this steps in one or 2 step?

How do you count the number of observations in multiple columns and use mutate to make the counts as new columns in R?

I have a dataset that has multiple lines of survey responses from different years and from different organizations. There are 100 questions in the survey and people can skip them. I am trying to get the average for each question by year by organization (so grouped by organization and year). I also want to get the count of the number of people in those averages since people can skip them. I want these two data points as new columns as well, so it will add 200 columns total. I figured out how to the average. See code below. I can't seem to use the same function to get the count of observation.
This is how I successfully got the average.
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Question'), mean, na.rm = TRUE, .names = "{.col}_average")) %>%
ungroup()
I am now trying to use a similar set up to get the count of observations. I duplicated the columns with the raw data and added Count in the title so that the new average columns are not counted as columns that R needs to find the ncount for
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'), function(x){sum(!is.na(.))}, .names = "{.col}_ncount")) %>%
ungroup()
The code above does get me the new columns but the n count is the same of all columns and all rows? Any thoughts?
The issue is in the lambda function i.e. function(x) and then the sum is on the . instead of x. . by itself can be evaluated as the whole data
library(dplyr)
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
function(x){sum(!is.na(x))},
.names = "{.col}_ncount")) %>%
ungroup()
If we want to use the . or .x, specify the lambda function as ~
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
~ sum(!is.na(.)),
.names = "{.col}_ncount")) %>%
ungroup()

How to count grouped negative values by year

I have a data set of prices in which the first column is year and the rest of the columns are regions. I'm trying to count the number of negative values by year by region.
I've tried to use dplyr to group_by(year) then summarise_at() but I can't figure out the exact code to use.
Neg_Count <- select (BaseCase, Year, 'Hub1': 'Hub15')
Neg_Count <- Neg_Count %>%
group_by(Year) %>%
What is the best way to do this?
We can use summarise_at. Create a logical vector and get the sum
library(dplyr)
Neg_Count %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Hub")), ~ sum(. < 0))

What does n=n( ) mean in R?

The other day I was reading the following lines in R and I don't understand what the %>% and summarise(n=n()) and summarise(total=n()) meant. I understand the group_by and ungroup methods though.
Can someone help out? There isn't any documentation for this either.
library(dplyr)
net.multiplicity <- group_by(net, nodeid, epoch) %>% summarise(n=n()) %>%
ungroup() %>% group_by(n) %>% summarise(total=n())
This is from the dplyr package. n=n() means that a variable named n will be assigned the number of rows (think number of observations) in the summarized data.
the %>% is read as "and then" and is way of listing your functions sequentially rather then nesting them. So that command is saying you should do the grouping and then summarize the result of the grouping by the number of rows in each group and then ungroup that result, and then group the un-grouped data based on n and then summarize that by the total number of rows in each of the new groups.

Resources