R: dplyr summarize, sum only values of uniques - r

I am having trouble with a pesky command I would like to have for an analysis of a summary, for which I'm using the dplyr package. It's easiest to explain with some example data:
structure(list(Date = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L),
Name = structure(c(3L, 3L, 4L, 3L, 2L, 3L, 2L, 4L, 1L), .Label = c("George",
"Jack", "John", "Mary"), class = "factor"), Birth.Year = c(1995L,
1995L, 1997L, 1995L, 1999L, 1995L, 1999L, 1997L, 1997L),
Special_Balance = c(10L, 40L, 30L, 5L, 10L, 15L, 2L, 1L,
100L), Total_Balance = c(100L, 100L, 50L, 200L, 20L, 200L,
20L, 100L, 1600L)), .Names = c("Date", "Name", "Birth.Year",
"Special_Balance", "Total_Balance"), class = "data.frame", row.names = c(NA,
-9L))
Two simple summaries are my goal: first, I'd like to summarize just by Date, with the code seen below. The part that is wrong is the total_balance_sum calculation, in which I want to sum the balance of each person but only one time for each person. So for instance, the result of my command for Date=1 is total_balance_sum=100, but what it should be is 150 (add total_balance of 100 for Jack once to total_balance of Mary of 50 once). This wrong calculation obviously messes up the final pct calc.
example_data %>%
group_by(Date) %>%
summarise(
total_people=n_distinct(Name),
total_loan_exposures=n(),
special_sum=sum(Special_Balance,na.rm=TRUE),
total_balance_sum=sum(Total_Balance[n_distinct(Name)]),
total_pct=special_sum/total_balance_sum
) -> example_summary
In the second summary (below), I group by both date and birth year, and again am calculating total_balance_sum incorrectly.
example_data %>%
group_by(Date,Birth.Year) %>%
summarise(
total_people=n_distinct(Name),
total_loan_exposures=n(),
special_sum=sum(Special_Balance,na.rm=TRUE),
total_balance_sum=sum(Total_Balance[n_distinct(Name)]),
total_pct=special_sum/total_balance_sum
) -> example_summary_birthyear
What is the correct way to achieve my goal? Clearly the n_distinct I'm using is only taking one of the values and not summing it properly across names.
Thanks for your help.

I'm a little unclear on what you may be asking for, but does this do what you'd like?: (just for the first example)
example_data %>%
group_by(Date, Name) %>%
summarise(
total_loan_exposures=n(),
total_SpecialPerson=sum(Special_Balance,na.rm=TRUE),
total_balance_sumPerson=Total_Balance[1])%>%
ungroup() %>%
group_by(Date) %>%
summarise(
total_people=n(),
total_loan_exposures=sum(total_loan_exposures),
special_sum=sum(total_SpecialPerson,na.rm=TRUE),
total_balance_sum=sum(total_balance_sumPerson)) %>%
mutate(total_pct=(special_sum/total_balance_sum))-> example_summary
> example_summary
Source: local data frame [3 x 6]
Date total_people total_loan_exposures special_sum total_balance_sum total_pct
1 1 2 3 80 150 0.53333333
2 2 2 4 32 220 0.14545455
3 3 2 2 101 1700 0.05941176

For the second example (for the first, just remove the Birth.Year):
library(dplyr)
example_data %>% group_by(Date, Birth.Year) %>%
mutate(special_sum = sum(Special_Balance),
total_loan_exposure = n( )) %>%
distinct(Name, Total_Balance) %>%
summarise(Total_balance_sum = sum(Total_Balance),
special_sum = special_sum[1],
total_people = n(),
total_loan_exposure = total_loan_exposure[1],
special_sum/Total_balance_sum)

Related

When to use which vs subset function when calculating mean

I am trying to make a find the mean of a variable (age) by gender in R. Age and gender are each columns in the a dataset and I want to find the mean of age for women and for men. I believe I need to use the which function, but am unsure how to actually do that and was wondering if using the subset function would make more sense.
You're probably looking for aggregate().
aggregate(age ~ sex, dat, mean)
# sex age
# 1 1 24.5
# 2 2 24.0
Data
dat <- structure(list(sex = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), age = c(28L,
19L, 26L, 25L, 22L, 27L, 23L), x = c(0.978226428385824, 0.117487361654639,
0.474997081561014, 0.560332746244967, 0.904031387297437, 0.138710167724639,
0.988891728920862)), class = "data.frame", row.names = c(NA,
-7L))
We can also use dplyr
library(dplyr)
df1 %>%
group_by(sex) %>%
summarise(age = mean(age))

How to aggregate a data frame by columns and rows?

I have the following data set:
Class Total AC Final_Coverage
A 1000 1 55
A 1000 2 66
B 1000 1 77
A 1000 3 88
B 1000 2 99
C 1000 1 11
B 1000 3 12
B 1000 4 13
B 1000 5 22
C 1000 2 33
C 1000 3 44
C 1000 4 55
C 1000 5 102
A 1000 4 105
A 1000 5 109
I would like to get the average of the AC and the Final_Coverage for the first three rows of each class. Then, I want to store the average values along with the class name in a new dataframe. To do that, I did the following:
dataset <- read_csv("/home/ad/Desktop/testt.csv")
classes <- unique(dataset$Class)
new_data <- data.frame(Class = character(0), AC = numeric(0), Coverage = numeric(0))
for(class in classes){
new_data$Class <- class
dataClass <- subset(dataset, Class == class)
tenRows <- dataClass[1:3,]
coverageMean <- mean(tenRows$Final_Coverage)
acMean <- mean(tenRows$AC)
new_data$Coverage <- coverageMean
new_data$AC <- acMean
}
Everything works fine except entering the average value into the new_data frame. I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "Class", value = "A") :
replacement has 1 row, data has 0
Do you know how to solve this?
This should get you the new dataframe by using dplyr.
dataset %>% group_by(Class) %>% slice(1:3) %>% summarise(AC= mean(AC),
Coverage= mean(Final_Coverage))
In your method the error is that you initiated your new dataframe with 0 rows and try to assign a single value to it. This is reflected by the error. You want to replace one row to a dataframe with 0 rows. This would work, though:
new_data <- data.frame(Class = classes, AC = NA, Coverage = NA)
for(class in classes){
new_data$Class <- class
dataClass <- subset(dataset, Class == class)
tenRows <- dataClass[1:3,]
coverageMean <- mean(tenRows$Final_Coverage)
acMean <- mean(tenRows$AC)
new_data$Coverage[classes == class] <- coverageMean
new_data$AC[classes == class] <- acMean
}
You could look into aggregate().
> aggregate(df1[df1$AC <= 3, 3:4], by=list(Class=df1[df1$AC <= 3, 1]), FUN=mean)
Class AC Final_Coverage
1 A 2 69.66667
2 B 2 62.66667
3 C 2 29.33333
DATA
df1 <- structure(list(Class = structure(c(1L, 1L, 2L, 1L, 2L, 3L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"),
Total = c(1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L,
1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L),
AC = c(1L, 2L, 1L, 3L, 2L, 1L, 3L, 4L, 5L, 2L, 3L, 4L, 5L,
4L, 5L), Final_Coverage = c(55L, 66L, 77L, 88L, 99L, 11L,
12L, 13L, 22L, 33L, 44L, 55L, 102L, 105L, 109L)), class = "data.frame", row.names = c(NA,
-15L))

How to calculate percentage of mising data in a time series in R dplyr

In the following sample data and script,
How can I calculate the % of missing data between start date strtdt and end date enddt for each ID. What I want to get is: add the missing days with NA between strtdt and enddt separately for each IDs than calculated the % of NA.
I tried following using dplyr but for no luck. Any suggestion will be highly appreciated.
Note: I can achieve same by calculating individually for each ID however that is not possible because I have more than 10000 IDs.
Ultimate goal is to get % of NA between start date and end date for each ID; If the dates are missing completely than i have to add missing date with NA values.
library(dplyr
df<-structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L
), .Label = c("xx", "xyz", "yy", "zz"), class = "factor"), Date = structure(c(8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 19L, 20L, 21L, 22L, 23L), .Label = c("1989-09-12",
"1989-09-13", "1989-09-14", "1989-09-19", "1989-09-23", "1990-01-12",
"1990-01-13", "1996-09-12", "1996-09-13", "1996-09-16", "1996-09-17",
"1996-09-18", "1996-09-19", "2000-09-12", "2000-09-13", "2000-11-10",
"2000-11-11", "2000-11-12", "2001-09-07", "2001-09-08", "2001-09-09",
"2001-09-10", "2001-09-11"), class = "factor"), val = c(3, 5,
9, 3, 5, 6, 8, 7, 9, 5, 3, 2, 8, 8, 5, 3, 2, 1, 5, 7, NA, NA,
NA, NA)), .Names = c("ID", "Date", "val"), row.names = c(NA,
-24L), class = "data.frame")
df$Date<-as.Date(df$Date,format="%Y-%m-%d")
df
df_mis<-df %>%
group_by(ID)%>%
dplyr::mutate(strtdt=min(Date),
enddt=max(Date))
df_mis
df_mis2<-df_mis %>%
group_by(ID) %>%
dplyr::do( data.frame(., Date1= seq(.$strtdt,.$enddt, by = '1 day')))
df_mis2
I assume from the sequence generation in the question's code, that the expected observations are one per day between the first observed date and last observed date per ID. Here's a clunky piece by piece calculation to count the % missing data.
1. Make a data frame of all expected dates for each ID
library(dplyr)
# df as in the question, but coerce Date column
df$Date <- as.Date(df$Date)
# Data frame with date ranges per id
ranges_df <- df %>%
group_by(ID) %>%
summarize(min=min(Date), max=max(Date))
# Data frame with IDs and date for every day expected.
alldays <- ranges_df %>%
group_by(ID) %>%
do(., data.frame(
Date = seq(.$dmin,.$dmax, by = '1 day')
)
)
2. JOIN the expected dates table with the observed dates table.
imputed_df <- left_join(alldays, df)
3. Count NAs
imputed_df %>%
group_by(ID) %>%
summarize(total=n(),
missing=sum(is.na(val)),
percent_missing=missing/total*100
)
result:
# A tibble: 4 x 4
ID total missing percent_missing
<fctr> <int> <int> <dbl>
1 xx 8 2 25.00000
2 xyz 4 4 100.00000
3 yy 62 57 91.93548
4 zz 4380 4371 99.794
Assuming that NAs in the original data should be counted as missing data, this will do so.
Calculate the number of days between the min and max of dates as an intermediate variable.
Then, calculate the number of missing days as number of days - number of observations. Then, calculate percentages.
df %>%
group_by(ID) %>%
mutate(numdays = as.numeric(max(Date) - min(Date)) + 1,
pctmissing = (numdays - n()) / numdays)

Mutate repeats first row value

I have a dataset with taxonomy assignment and I want to extract the genus in a new column.
library(tidyverse)
library(magrittr)
library(stringr)
df <- structure(list(C043 = c(18361L, 59646L, 27575L, 163L, 863L, 3319L,
0L, 6L), C057 = c(20020L, 97610L, 13427L, 1L, 161L, 237L, 2L,
105L), taxonomy = structure(c(3L, 2L, 1L, 6L, 4L, 4L, 5L, 2L), .Label = c("k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;NA",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;s__cloacae",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__coli",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Klebsiella;s__",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__stutzeri"
), class = "factor")), .Names = c("C043", "C057", "taxonomy"), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 8L, 10L), class = "data.frame")
So this is my function (it works)
extract_genus <- function(str){
genus <- str_split(str, pattern = ";")[[1]][6]
genus %<>% str_sub(start = 4) #%>% as.character
return(genus)
}
But when I applied it in mutate (with or without as.character), it repeats first row value in the new column.
df %>% mutate(genus = extract_genus(taxonomy))
C043 C057 taxonomy genus
1 18361 20020 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__coli Escherichia
2 59646 97610 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;s__cloacae Escherichia
3 27575 13427 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;NA Escherichia
4 163 1 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__stutzeri Escherichia
5 863 161 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Klebsiella;s__ Escherichia
When I use sapply (but I don't want to, I want a solution with dplyr pipeline), it works.
df_group_gen$genus <- sapply(df_group_gen$taxonomy, extract_genus)
C043 C057 taxonomy genus
1 18361 20020 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__coli Escherichia
2 59646 97610 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;s__cloacae Enterobacter
3 27575 13427 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;NA Enterobacter
4 163 1 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__stutzeri Pseudomonas
5 863 161 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Klebsiella;s__ Klebsiella
Why mutate doesn't compute as we can expect? I find this question but no answer is provided, only a had hoc code.
Thank you :)
You can Vectorize your function to allow mutate to occur on every row:
ex_gen <- Vectorize(extract_genus, vectorize.args='str')
df %>% mutate(genus=ex_gen(taxonomy))
Alternatively, you can use rowwise to mutate each row:
df %>%
rowwise() %>%
mutate(genus = extract_genus(taxonomy))

Spread with duplicate identifiers (using tidyverse and %>%) [duplicate]

This question already has answers here:
Reshaping data in R with "login" "logout" times
(6 answers)
Closed 5 years ago.
My data looks like this:
I am trying to make it look like this:
I would like to do this in tidyverse using %>%-chaining.
df <-
structure(list(id = c(2L, 2L, 4L, 5L, 5L, 5L, 5L), start_end = structure(c(2L,
1L, 2L, 2L, 1L, 2L, 1L), .Label = c("end", "start"), class = "factor"),
date = structure(c(6L, 7L, 3L, 8L, 9L, 10L, 11L), .Label = c("1979-01-03",
"1979-06-21", "1979-07-18", "1989-09-12", "1991-01-04", "1994-05-01",
"1996-11-04", "2005-02-01", "2009-09-17", "2010-10-01", "2012-10-06"
), class = "factor")), .Names = c("id", "start_end", "date"
), row.names = c(3L, 4L, 7L, 8L, 9L, 10L, 11L), class = "data.frame")
What I have tried:
data.table::dcast( df, formula = id ~ start_end, value.var = "date", drop = FALSE ) # does not work because it summarises the data
tidyr::spread( df, start_end, date ) # does not work because of duplicate values
df$id2 <- 1:nrow(df)
tidyr::spread( df, start_end, date ) # does not work because the dataset now has too many rows.
These questions do not answer my question:
Using spread with duplicate identifiers for rows (because they summarise)
R: spread function on data frame with duplicates (because they paste the values together)
Reshaping data in R with "login" "logout" times (because not specifically asking for/answered using tidyverse and chaining)
We can use tidyverse. After grouping by 'start_end', 'id', create a sequence column 'ind' , then spread from 'long' to 'wide' format
library(dplyr)
library(tidyr)
df %>%
group_by(start_end, id) %>%
mutate(ind = row_number()) %>%
spread(start_end, date) %>%
select(start, end)
# id start end
#* <int> <fctr> <fctr>
#1 2 1994-05-01 1996-11-04
#2 4 1979-07-18 NA
#3 5 2005-02-01 2009-09-17
#4 5 2010-10-01 2012-10-06
Or using tidyr_1.0.0
chop(df, date) %>%
spread(start_end, date) %>%
unnest(c(start, end))

Resources