grouping_by_condition_and_mutate in R - r

I need your help:
The general idea is to create 1 new column,
Group the data by a column (DEP) then count the number of Total lines per group using the column (id).
Then filter the data with another column (dely): (only dely>=60) and count the id
Then calculate the share using the number of rows of
(the filtered columns)/ (total number calculated at the beginning).
total= count(id by group)
share = (dely>=60)/total
I was able to do it in 3 steps but I wanted to know if possible to do it in a faster way?
#group the data by ( DEP)
Total_group<-df %>%
group_by(DEP) %>%
summarise(n = n())
filter the data T_depart>60
Filter_60<- df %>% filter(df$T_depart>=60)
#then gorup the filtred data by DEP as I did for the Total
Filter_60_group<-Filter_60 %>%
group_by(DEP) %>%
summarise(n = n())
then calculte the share( share_dep)
share_data<-left_join (Total_group, Filter_60_group, by="DEP") %>% mutate(share_dep=n.x/n.y)
Any idea how to put all this steps in one or 2 step?

Related

Count occurences of elements and sum up its value in R

I have a data frame with 2 columns (Company and Amount). There are multiple companies (string) and a value in another column as numeric.
I want to count all the companies sum the amount for each company.
So in the end I want to have a new data frame with 3 columns (Company, Occurrences, Total Amount)
I could count all the occurrences and saved it in a new data frame but have no idea how to get the amount.
library(dplyr)
df_companies %>% count(df_companies$Geldgeber) %>% top_n(3, n) %>% arrange(desc(n))
You can use n() to get the count
library(dplyr)
df_companies %>%
group_by(Geldgeber) %>%
summarize(ct = n(), total = sum(Betrag))
You will want to consider how to account for missing values in Betrag, if any. For example, you could initially filter(!is.na(Betrag)), or you can use na.rm=T in the call to sum()

How do you count the number of observations in multiple columns and use mutate to make the counts as new columns in R?

I have a dataset that has multiple lines of survey responses from different years and from different organizations. There are 100 questions in the survey and people can skip them. I am trying to get the average for each question by year by organization (so grouped by organization and year). I also want to get the count of the number of people in those averages since people can skip them. I want these two data points as new columns as well, so it will add 200 columns total. I figured out how to the average. See code below. I can't seem to use the same function to get the count of observation.
This is how I successfully got the average.
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Question'), mean, na.rm = TRUE, .names = "{.col}_average")) %>%
ungroup()
I am now trying to use a similar set up to get the count of observations. I duplicated the columns with the raw data and added Count in the title so that the new average columns are not counted as columns that R needs to find the ncount for
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'), function(x){sum(!is.na(.))}, .names = "{.col}_ncount")) %>%
ungroup()
The code above does get me the new columns but the n count is the same of all columns and all rows? Any thoughts?
The issue is in the lambda function i.e. function(x) and then the sum is on the . instead of x. . by itself can be evaluated as the whole data
library(dplyr)
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
function(x){sum(!is.na(x))},
.names = "{.col}_ncount")) %>%
ungroup()
If we want to use the . or .x, specify the lambda function as ~
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
~ sum(!is.na(.)),
.names = "{.col}_ncount")) %>%
ungroup()

Counting occurrence of diagnosis code across multiple columns in large R dataset

I'm using two years of NIS data (already combined) to search for a diagnosis code across all of the DX columns. The columns start at I10_DX1 to I10_DX40 (which are column #18-57). I want to create a new dataset that has the observations that has this diagnosis code in any of these columns.
I 've tried loops and the ICD packages but haven't been able to get it right. Most recently tried code as follows:
get_icd_labels(icd3 = c("J80"), year = 2018:2019) %>%
arrange(year, icd_sub) %>%
filter(icd_sub %in% c("J80") %>%
select(year, icd_normcode, label) %>%
knitr::kable(row.names = FALSE)
This is a tidyverse (dplyr) solution. If you don't already have a unique id for each record, I'd start out by adding one.
df <-
df %>%
mutate(my_id = row_number())
Next, I'd gather the diagnosis codes into a table where each record is a single diagnosis.
diagnoses <-
df %>%
select(my_id, 18:57) %>%
gather("diag_num","diag_code",2:ncol(.)) %>%
filter(!is.na(diag_code)) #No need to keep a bunch of empty rows
Finally, I would join my original df to the diagnoses data frame and filter for the code I want.
df %>%
inner_join(diagnoses, by = "my_id") %>%
filter(diag_code == "J80")

How do I compare group means to individual observations and make a new TRUE/FALSE column?

I am new to R and this is my first post on SO - so please bear with me.
I am trying to identify outliers in my dataset. I have two data.frames:
(1 - original data set, 192 rows): observations and their value (AvgConc)
(2 - created with dplyr, 24 rows): Group averages from the original data set, along with quantiles, minimum, and maximum values
I want to create a new column within the original data set that gives TRUE/FALSE based on whether (AvgConc) is greater than the maximum or less than the minimum I have calculated in the second data.frame. How do I go about doing this?
Failed attempt:
Outliers <- Original.Data %>%
group_by(Status, Stim, Treatment) %>%
mutate(Outlier = Original.Data$AvgConc > Quantiles.Data$Maximum | Original.Data$AvgConc < Quantiles.Data$Minimum) %>%
as.data.frame()
Error: Column Outlier must be length 8 (the group size) or one, not 192
Here, we need to remove the Quantiles.Data$ by doing a join with 'Original.Data' by the 'Status', 'Stim', 'Treatment'
library(dplyr)
Original.Data %>%
inner_join(Quantiles.Data %>%
select(Status, Stim, Treatment, Maximum, Minimum)) %>%
group_by(Status, Stim, Treatment) %>%
mutate(Outlier = (AvgConc > Maximum) |(AvgConc < Minimum)) %>%
as.data.frame()

What does n=n( ) mean in R?

The other day I was reading the following lines in R and I don't understand what the %>% and summarise(n=n()) and summarise(total=n()) meant. I understand the group_by and ungroup methods though.
Can someone help out? There isn't any documentation for this either.
library(dplyr)
net.multiplicity <- group_by(net, nodeid, epoch) %>% summarise(n=n()) %>%
ungroup() %>% group_by(n) %>% summarise(total=n())
This is from the dplyr package. n=n() means that a variable named n will be assigned the number of rows (think number of observations) in the summarized data.
the %>% is read as "and then" and is way of listing your functions sequentially rather then nesting them. So that command is saying you should do the grouping and then summarize the result of the grouping by the number of rows in each group and then ungroup that result, and then group the un-grouped data based on n and then summarize that by the total number of rows in each of the new groups.

Resources