How to summarize data by several groups (R) - r

I have a dataframe that looks like this:
And I want to get this, that is, a single row per Group, with a column for the % of As in all the ID_1_Subgroup for each Group, together with the sum of ValueSubgroup, for each group too):
Can someone help? I have seen other issues (like this: Summarizing by group and subgroup) which are similar but not for R.

We can do
library(dplyr)
df1 %>%
group_by(Group) %>%
summarise(PercA = mean(id_1_Subgroup == "A"),
SumOfValueSubgroup = sum(ValueSubgroup))

Related

How do you count the number of observations in multiple columns and use mutate to make the counts as new columns in R?

I have a dataset that has multiple lines of survey responses from different years and from different organizations. There are 100 questions in the survey and people can skip them. I am trying to get the average for each question by year by organization (so grouped by organization and year). I also want to get the count of the number of people in those averages since people can skip them. I want these two data points as new columns as well, so it will add 200 columns total. I figured out how to the average. See code below. I can't seem to use the same function to get the count of observation.
This is how I successfully got the average.
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Question'), mean, na.rm = TRUE, .names = "{.col}_average")) %>%
ungroup()
I am now trying to use a similar set up to get the count of observations. I duplicated the columns with the raw data and added Count in the title so that the new average columns are not counted as columns that R needs to find the ncount for
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'), function(x){sum(!is.na(.))}, .names = "{.col}_ncount")) %>%
ungroup()
The code above does get me the new columns but the n count is the same of all columns and all rows? Any thoughts?
The issue is in the lambda function i.e. function(x) and then the sum is on the . instead of x. . by itself can be evaluated as the whole data
library(dplyr)
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
function(x){sum(!is.na(x))},
.names = "{.col}_ncount")) %>%
ungroup()
If we want to use the . or .x, specify the lambda function as ~
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
~ sum(!is.na(.)),
.names = "{.col}_ncount")) %>%
ungroup()

Counting occurrence of diagnosis code across multiple columns in large R dataset

I'm using two years of NIS data (already combined) to search for a diagnosis code across all of the DX columns. The columns start at I10_DX1 to I10_DX40 (which are column #18-57). I want to create a new dataset that has the observations that has this diagnosis code in any of these columns.
I 've tried loops and the ICD packages but haven't been able to get it right. Most recently tried code as follows:
get_icd_labels(icd3 = c("J80"), year = 2018:2019) %>%
arrange(year, icd_sub) %>%
filter(icd_sub %in% c("J80") %>%
select(year, icd_normcode, label) %>%
knitr::kable(row.names = FALSE)
This is a tidyverse (dplyr) solution. If you don't already have a unique id for each record, I'd start out by adding one.
df <-
df %>%
mutate(my_id = row_number())
Next, I'd gather the diagnosis codes into a table where each record is a single diagnosis.
diagnoses <-
df %>%
select(my_id, 18:57) %>%
gather("diag_num","diag_code",2:ncol(.)) %>%
filter(!is.na(diag_code)) #No need to keep a bunch of empty rows
Finally, I would join my original df to the diagnoses data frame and filter for the code I want.
df %>%
inner_join(diagnoses, by = "my_id") %>%
filter(diag_code == "J80")

Lead and lag issue using dplyr

I have a data frame with data that looks like this that has 365 rows reflecting the calendar year. I am trying to shift the county name columns up by one row. The data frame doesn't contain any missing values.
I tried using the following code to shift it, but the resulting table has values that are all NA.
covid_shift <- covid_pivot %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
Does anyone know what might be the issue?
Since covid_pivot is grouped by date, and each of these groups has one row, the lead and lag functions return NA.
Try:
covid_shift <- covid_pivot %>%
ungroup() %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
You might also consider using across()
covid_pivot %>%
ungroup() %>%
mutate(across(-date, ~lag(.x)))

How might I summarize the sum of all columns in a filtered dataset using dplyr?

I'm having trouble getting the sum of a column from a filtered dataset. Would someone be able to show me where I am going wrong? This summarize method worked before, but now I get an error. Thank you,
select("STNAME", "CTYNAME", "YEAR", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
save(popSample, file="./datafiles/popSample.rdata" )
load("./datafiles/popSample.rdata")
# We want to see Total Population for all years and all age groups
set1filter <- popSample %>%
filter(AGEGRP == 0) %>%
summarize(set1filter, set1 = sum(TOT_POP))
set1```
There is an extra %>% at the end of filter while creating the set1filter or remove the set1filter from the summarize if we are using the same chain
library(dplyr)
popSample %>%
filter(AGEGRP == 0) %>%
summarise(set1 = sum(TOT_POP))
We can't have an object that is not yet created in the summarize

R spread across multiple value columns

My dataset looks like this -
dataset = data.frame(Site=c(rep('A',6),rep('B',6)),Date=c(rep(c('2019-05-31','2019-04-30','2019-03-31'),4)),Question=c(rep('Q1',3),rep('Q2',3)),Score=runif(12,0.5,1),Average=runif(12,0.5,1))
I'd like to spread columns in such a way that the the first two columns contain the Site and Question and the remaining columns are have the Score_Date and Average_Date
Here's an example of what the first line of the resulting table would look like
Site Question Score_2019.03.31 Score_2019.04.30 Score_2019.05.31 Average_2019.03.31 Average_2019.04.30 Average_2019.05.31
A Q1 0.9117566 0.8661078 0.5624139 0.7246694 0.8870703 0.6401099
I tried using unite & spread from tidyr but nowhere close to the result
Any inputs would be highly appreciated
Using tidyr and dplyr from the tidyverse, you could do the following:
library(tidyverse)
dataset %>%
nest(Score, Average, .key = 'value_col') %>%
spread(key = Date, value = value_col) %>%
unnest(`2019-03-31`, `2019-04-30`, `2019-05-31`, .sep = "_")

Resources