Count occurences of elements and sum up its value in R - r

I have a data frame with 2 columns (Company and Amount). There are multiple companies (string) and a value in another column as numeric.
I want to count all the companies sum the amount for each company.
So in the end I want to have a new data frame with 3 columns (Company, Occurrences, Total Amount)
I could count all the occurrences and saved it in a new data frame but have no idea how to get the amount.
library(dplyr)
df_companies %>% count(df_companies$Geldgeber) %>% top_n(3, n) %>% arrange(desc(n))

You can use n() to get the count
library(dplyr)
df_companies %>%
group_by(Geldgeber) %>%
summarize(ct = n(), total = sum(Betrag))
You will want to consider how to account for missing values in Betrag, if any. For example, you could initially filter(!is.na(Betrag)), or you can use na.rm=T in the call to sum()

Related

grouping_by_condition_and_mutate in R

I need your help:
The general idea is to create 1 new column,
Group the data by a column (DEP) then count the number of Total lines per group using the column (id).
Then filter the data with another column (dely): (only dely>=60) and count the id
Then calculate the share using the number of rows of
(the filtered columns)/ (total number calculated at the beginning).
total= count(id by group)
share = (dely>=60)/total
I was able to do it in 3 steps but I wanted to know if possible to do it in a faster way?
#group the data by ( DEP)
Total_group<-df %>%
group_by(DEP) %>%
summarise(n = n())
filter the data T_depart>60
Filter_60<- df %>% filter(df$T_depart>=60)
#then gorup the filtred data by DEP as I did for the Total
Filter_60_group<-Filter_60 %>%
group_by(DEP) %>%
summarise(n = n())
then calculte the share( share_dep)
share_data<-left_join (Total_group, Filter_60_group, by="DEP") %>% mutate(share_dep=n.x/n.y)
Any idea how to put all this steps in one or 2 step?

How do you count the number of observations in multiple columns and use mutate to make the counts as new columns in R?

I have a dataset that has multiple lines of survey responses from different years and from different organizations. There are 100 questions in the survey and people can skip them. I am trying to get the average for each question by year by organization (so grouped by organization and year). I also want to get the count of the number of people in those averages since people can skip them. I want these two data points as new columns as well, so it will add 200 columns total. I figured out how to the average. See code below. I can't seem to use the same function to get the count of observation.
This is how I successfully got the average.
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Question'), mean, na.rm = TRUE, .names = "{.col}_average")) %>%
ungroup()
I am now trying to use a similar set up to get the count of observations. I duplicated the columns with the raw data and added Count in the title so that the new average columns are not counted as columns that R needs to find the ncount for
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'), function(x){sum(!is.na(.))}, .names = "{.col}_ncount")) %>%
ungroup()
The code above does get me the new columns but the n count is the same of all columns and all rows? Any thoughts?
The issue is in the lambda function i.e. function(x) and then the sum is on the . instead of x. . by itself can be evaluated as the whole data
library(dplyr)
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
function(x){sum(!is.na(x))},
.names = "{.col}_ncount")) %>%
ungroup()
If we want to use the . or .x, specify the lambda function as ~
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
~ sum(!is.na(.)),
.names = "{.col}_ncount")) %>%
ungroup()

Counting rows that match result of calculation in R

I have a line of code that calculates the maximum value for a number of products
data2019 %>%
group_by(PRODUCT) %>%
summarise(max_amt = max(AMOUNT))
I want to then count the number of rows where AMOUNT == max_amt for that particular product, but if I try to wrap it in a count or sum function it gives me the max value for the whole set, and the total number of rows for each product, which isn't very helpful, especially as the values vary considerably. How can I get it to produce the answer for each specific product?
You can do a count on condition by writing your summarize like sum(CONDITION). Like so:
data2019 %>%
group_by(PRODUCT) %>%
summarize(max_count = sum(AMOUNT == max(AMOUNT)))

Counting the rows based on two other column values, and manipulate the value in a loop through one of these column values in R

There are three columns: website, Date ("%Y %m"), click_tracking (T/F). I would like to add a variable describing the number of websites whose click tracking = T in each month / the number of all website in that month.
I thought the steps would be something like:
aggregate(sum(df$click_tracking = TRUE), by=list(Category=df$Date), FUN = sum)
as.data.frame(table(Date))
Then somehow loop through Date and divide the two variables above which would have been already grouped by Date. How can I achieve this? Many thanks!
If we are creating a column, then do a group by 'Date' and get the sum of 'click_tracking' (assuming it is a logical column - TRUE/FALSE) iin mutate
library(dplyr)
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(click_tracking))
If the column is factor, convert to logical with as.logical
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(as.logical(click_tracking)))
If it is to create a summarised output
df %>%
group_by(Date) %>%
summarise(countTRUE = sum(click_tracking))
In the OP's code, = (assignment) is used instead of == in sum(df$click_tracking = TRUE) and there is no need to do a comparison on a logical column
aggregate(cbind(click_tracking = as.logical(click_tracking)) ~ Date, FUN = sum)
This will create the proportion of websites with click tracking (out of all websites) per month.
aggregate(data=df, click_tracking ~ Date, mean)

group_by and summarize() multiple things in R using dplyr/tidyverse

I am trying to find the country with the highest average age but I also need to filter out countries with less than 5 entries in the data frame. I tried the following but it does not work:
bil %>%
group_by(citizenship,age) %>%
mutate(n=count(citizenship), theMean=mean(age,na.rm=T)) %>%
filter(n>=5) %>%
arrange(desc(theMean))
bil is the dataset and I am trying to count how many entries I have for each country, filter out countries with less than 5 entries, find the average age for each country and then find the country with the highest average. I am confused on how to do both things at the same time. If I do one summarize at a time I lose the rest of my data.
Perhaps, this could help. Note that the parameter 'x' in count is a tbl/data.frame. So, instead of count, we group by 'citizenship' and get the frequency of values with n(), get the mean of 'age' (not sure about the 'age' as grouping variable) and do the filter
bil %>%
group_by(citizenship) %>%
mutate(n = n()) %>%
mutate(theMean = mean(age, na.rm=TRUE)) %>%
filter(n>=5) %>%
arrange(desc(theMean))

Resources