How to summarize data by several groups (R)

How to summarize data by several groups (R) - r

I have a dataframe that looks like this:
And I want to get this, that is, a single row per Group, with a column for the % of As in all the ID_1_Subgroup for each Group, together with the sum of ValueSubgroup, for each group too):
Can someone help? I have seen other issues (like this: Summarizing by group and subgroup) which are similar but not for R.

We can do
library(dplyr)
df1 %>%
group_by(Group) %>%
summarise(PercA = mean(id_1_Subgroup == "A"),
SumOfValueSubgroup = sum(ValueSubgroup))

Related

How do you count the number of observations in multiple columns and use mutate to make the counts as new columns in R?

I have a dataset that has multiple lines of survey responses from different years and from different organizations. There are 100 questions in the survey and people can skip them. I am trying to get the average for each question by year by organization (so grouped by organization and year). I also want to get the count of the number of people in those averages since people can skip them. I want these two data points as new columns as well, so it will add 200 columns total. I figured out how to the average. See code below. I can't seem to use the same function to get the count of observation.
This is how I successfully got the average.
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Question'), mean, na.rm = TRUE, .names = "{.col}_average")) %>%
ungroup()
I am now trying to use a similar set up to get the count of observations. I duplicated the columns with the raw data and added Count in the title so that the new average columns are not counted as columns that R needs to find the ncount for
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'), function(x){sum(!is.na(.))}, .names = "{.col}_ncount")) %>%
ungroup()
The code above does get me the new columns but the n count is the same of all columns and all rows? Any thoughts?

The issue is in the lambda function i.e. function(x) and then the sum is on the . instead of x. . by itself can be evaluated as the whole data
library(dplyr)
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
function(x){sum(!is.na(x))},
.names = "{.col}_ncount")) %>%
ungroup()
If we want to use the . or .x, specify the lambda function as ~
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
~ sum(!is.na(.)),
.names = "{.col}_ncount")) %>%
ungroup()

Counting occurrence of diagnosis code across multiple columns in large R dataset

I'm using two years of NIS data (already combined) to search for a diagnosis code across all of the DX columns. The columns start at I10_DX1 to I10_DX40 (which are column #18-57). I want to create a new dataset that has the observations that has this diagnosis code in any of these columns.
I 've tried loops and the ICD packages but haven't been able to get it right. Most recently tried code as follows:
get_icd_labels(icd3 = c("J80"), year = 2018:2019) %>%
arrange(year, icd_sub) %>%
filter(icd_sub %in% c("J80") %>%
select(year, icd_normcode, label) %>%
knitr::kable(row.names = FALSE)

This is a tidyverse (dplyr) solution. If you don't already have a unique id for each record, I'd start out by adding one.
df <-
df %>%
mutate(my_id = row_number())
Next, I'd gather the diagnosis codes into a table where each record is a single diagnosis.
diagnoses <-
df %>%
select(my_id, 18:57) %>%
gather("diag_num","diag_code",2:ncol(.)) %>%
filter(!is.na(diag_code)) #No need to keep a bunch of empty rows
Finally, I would join my original df to the diagnoses data frame and filter for the code I want.
df %>%
inner_join(diagnoses, by = "my_id") %>%
filter(diag_code == "J80")

Lead and lag issue using dplyr

I have a data frame with data that looks like this that has 365 rows reflecting the calendar year. I am trying to shift the county name columns up by one row. The data frame doesn't contain any missing values.
I tried using the following code to shift it, but the resulting table has values that are all NA.
covid_shift <- covid_pivot %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
Does anyone know what might be the issue?

Since covid_pivot is grouped by date, and each of these groups has one row, the lead and lag functions return NA.
Try:
covid_shift <- covid_pivot %>%
ungroup() %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
You might also consider using across()
covid_pivot %>%
ungroup() %>%
mutate(across(-date, ~lag(.x)))

How might I summarize the sum of all columns in a filtered dataset using dplyr?

I'm having trouble getting the sum of a column from a filtered dataset. Would someone be able to show me where I am going wrong? This summarize method worked before, but now I get an error. Thank you,
select("STNAME", "CTYNAME", "YEAR", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
save(popSample, file="./datafiles/popSample.rdata" )
load("./datafiles/popSample.rdata")
# We want to see Total Population for all years and all age groups
set1filter <- popSample %>%
filter(AGEGRP == 0) %>%
summarize(set1filter, set1 = sum(TOT_POP))
set1```

There is an extra %>% at the end of filter while creating the set1filter or remove the set1filter from the summarize if we are using the same chain
library(dplyr)
popSample %>%
filter(AGEGRP == 0) %>%
summarise(set1 = sum(TOT_POP))
We can't have an object that is not yet created in the summarize

R spread across multiple value columns

My dataset looks like this -
dataset = data.frame(Site=c(rep('A',6),rep('B',6)),Date=c(rep(c('2019-05-31','2019-04-30','2019-03-31'),4)),Question=c(rep('Q1',3),rep('Q2',3)),Score=runif(12,0.5,1),Average=runif(12,0.5,1))
I'd like to spread columns in such a way that the the first two columns contain the Site and Question and the remaining columns are have the Score_Date and Average_Date
Here's an example of what the first line of the resulting table would look like
Site Question Score_2019.03.31 Score_2019.04.30 Score_2019.05.31 Average_2019.03.31 Average_2019.04.30 Average_2019.05.31
A Q1 0.9117566 0.8661078 0.5624139 0.7246694 0.8870703 0.6401099
I tried using unite & spread from tidyr but nowhere close to the result
Any inputs would be highly appreciated

Using tidyr and dplyr from the tidyverse, you could do the following:
library(tidyverse)
dataset %>%
nest(Score, Average, .key = 'value_col') %>%
spread(key = Date, value = value_col) %>%
unnest(`2019-03-31`, `2019-04-30`, `2019-05-31`, .sep = "_")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to summarize data by several groups (R) - r

We can do library(dplyr) df1 %>% group_by(Group) %>% summarise(PercA = mean(id_1_Subgroup == "A"), SumOfValueSubgroup = sum(ValueSubgroup))

Related

How do you count the number of observations in multiple columns and use mutate to make the counts as new columns in R?

Counting occurrence of diagnosis code across multiple columns in large R dataset

Lead and lag issue using dplyr

How might I summarize the sum of all columns in a filtered dataset using dplyr?

R spread across multiple value columns

Categories

Resources