Group by function query - r

Hi guys i am new to R,
While i have attached screenshot of the df i am working with (https://i.stack.imgur.com/CUz4l.png), here is a short description
I have a data frame with a total of 7 columns, one of which is a month column, rest of the 6 columns are (integer) values and these also have empty rows
Need to summarise by count of all the 6 columns and group them by month
tried the following code: group_by(Month) %>% summarise(count=n(),na.omit())
get the following error:
Error: Problem with summarise() input ..2.
x argument "object" is missing, with no default
i Input ..2 is na.omit().
i The error occurred in group 1: Month = "1".
Run rlang::last_error() to see where the error occurred.
Can someone please assist?
[head of data][1] (https://i.stack.imgur.com/stfoG.png)
> dput(head(Dropoff))
structure(list(Start.Date = c("01-11-2019 06:07", "01-11-2019 06:07",
"01-11-2019 06:08", "01-11-2019 06:08", "02-11-2019 06:08", "02-11-2019 06:07"
), End.Date = c("01-11-2019 06:12", "01-11-2019 09:28", "01-11-2019 10:02",
"01-11-2019 13:05", "02-11-2019 06:13", "02-11-2019 06:16"),
Month = structure(c(3L, 3L, 3L, 3L, 3L, 3L), .Label = c("1",
"2", "11"), class = "factor"), nps = c(9L, 10L, 9L, 8L, 9L,
9L), effort = c(9L, 10L, 9L, 9L, 9L, 8L), knowledge = c(NA,
NA, 5L, NA, NA, 5L), confidence = c(5L, 5L, NA, NA, 5L, NA
), listening = c(NA, NA, NA, 5L, NA, NA), fcr = c(1L, 1L,
1L, 1L, 1L, 1L), fixing.issues = c(NA, NA, NA, NA, NA, NA
)), row.names = c(NA, 6L), class = "data.frame")
id like the output to look something like this
Month
count of nps
count of effort
1
xxx
xxx
2
xxx
xxx
11
6
6
....so on (count)for all the variables
the following
df%>% group_by(Month) %>% summarise(count=n())
provides this output
[1]: https://i.stack.imgur.com/u3nxv.png
this is not what i am hoping for

looks like the na.omit() causes problems in this case. Given that you want to count NA but not have them in any following sum, you might use
df[is.na(df)] = 0
and then
df %>% group_by(Month) %>% summarise(count=n())

thanks for the clarifications. The semi-manual solution
df %>% group_by(Month) %>% summarize(
c_nps= sum(!is.na(nps)),
c_effort= sum(!is.na(effort)),
c_knowledge= sum(!is.na(knowledge)),
c_confidence= sum(!is.na(confidence)),
c_listening= sum(!is.na(listening)),
c_fcr= sum(!is.na(fcr))
)
should do the trick. Since it's only 6 columns to be summarized, I would use the manual specification over an automated implementation (i.e. count non-NA in all other columns).
It results in
# A tibble: 1 x 7
Month c_nps c_effort c_knowledge c_confidence c_listening c_fcr
<fct> <int> <int> <int> <int> <int> <int>
1 11 6 6 2 3 1 6
Cheers and good luck!

From you example I understand, that you want to count the non-NA values in every column.
Dropoff %>% group_by(Month) %>%
summarise_at(vars(nps:fixing.issues), list(count=~sum(!is.na(.x))))
summarize_at: The term performs a summarize at every column given in the vars() expression. Here I chose all columns from nps to fixing.issues.
As summarizing function (which describes how the data is summarized), I defined to count all non-NA values. The syntax is to give all functions as named list. Here the ~ does the same as function(x). A more lengthy way to write it would be: function(x) sum(!is.na(x))
The "count" expression works as follows: check the vector of the column (x) if those are NA values is.na. The ! negates this expression. As this is a vector with only true/false values, you can just count the true values with sum.
The expression works for all kind of column types (text, numbers, ...)
Giving the result:
# A tibble: 1 x 8
Month nps_count effort_count knowledge_count confidence_count listening_count fcr_count fixing.issues_count
<fct> <int> <int> <int> <int> <int> <int> <int>
1 11 6 6 2 3 1 6 0
If that is not what you are aiming at, please precise your question.

Related

R aggregate column until one condition is met

so I´m having a dataframe of this form:
ID Var1 Var2
1 1 1
1 2 2
1 3 3
1 4 2
1 5 2
2 1 4
2 2 8
2 3 10
2 4 10
2 5 7
and I would like to filter the Var1 values by group for their maximum, on the condition, that the maximum value of Var2 is not met. This will be part of a new dataframe only containing one row per ID, so the outcome should be something like this:
ID Var1
1 2
2 2
so the function should filter the dataframe for the maximum, but only consider the values in the rows before Var2 reaches it´s maximum. The rows containing the maximum itself should not be included and so shouldn´t the rows after the maximum.
I tried building something with the while loop, but it didn´t work out. Also I´d be thankful if the solution doesn´t employ data.table
Thanks in advance
Maybe you could do something like this:
DF <- structure(list(
ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
Var1 = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L),
Var2 = c(1L, 2L, 3L, 2L, 2L, 4L, 8L, 10L, 10L, 7L)),
class = "data.frame", row.names = c(NA, -10L))
library(dplyr)
DF %>% group_by(ID) %>%
slice(1:(which.max(Var2)-1)) %>%
slice_max(Var1) %>%
select(ID, Var1)
#> # A tibble: 2 x 2
#> # Groups: ID [2]
#> ID Var1
#> <int> <int>
#> 1 1 2
#> 2 2 2
Created on 2020-08-04 by the reprex package (v0.3.0)

How to combine transmute with grep function?

I'm trying to find a way to create a new table with variables using the rowSums() function from an existing dataframe. For example, my existing dataframe is called 'asn' and I want to sum up the values for each row of all variables which contain "2011" in the variable title. I want a new table consisting of just one column called asn_y2011 which contains the sum of each row using the variables containing "2011"
Data
structure(list(row = 1:3, south_2010 = c(1L, 5L, 7L), south_2011 = c(4L,
0L, 4L), south_2012 = c(5L, 8L, 6L), north_2010 = c(3L, 4L, 1L
), north_2011 = c(2L, 6L, 0L), north_2012 = c(1L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-3L))
The existing 'asn' dataframe looks like this
row south_2010 south_2011 south_2012 north_2010 north_2011 north_2012
1 1 4 5 3 2 1
2 5 0 8 4 6 1
3 7 4 6 1 0 2
I'm trying to use the following function:
asn %>%
transmute(asn_y2011 = rowSums(, grep("2011")))
to get something like this
row asn_y2011
1 6
2 6
3 4
Continuing with your code, grep() should work like this:
library(dplyr)
asn %>%
transmute(row, asn_y2011 = rowSums(.[grep("2011", names(.))]))
# row asn_y2011
# 1 1 6
# 2 2 6
# 3 3 4
Or you can use tidy selection in c_across():
asn %>%
rowwise() %>%
transmute(row, asn_y2011 = sum(c_across(contains("2011")))) %>%
ungroup()
Another base R option using rowSums
cbind(asn[1],asn_y2011 = rowSums(asn[grep("2011",names(asn))]))
which gives
row asn_y2011
1 1 6
2 2 6
3 3 4
An option in base R with Reduce
cbind(df['row'], asn_y2011 = Reduce(`+`, df[endsWith(names(df), '2011')]))
# row asn_y2011
#1 1 6
#2 2 6
#3 3 4
data
df <- structure(list(row = 1:3, south_2010 = c(1L, 5L, 7L), south_2011 = c(4L,
0L, 4L), south_2012 = c(5L, 8L, 6L), north_2010 = c(3L, 4L, 1L
), north_2011 = c(2L, 6L, 0L), north_2012 = c(1L, 1L, 2L)),
class = "data.frame", row.names = c(NA,
-3L))
I think that this code will do what you want:
library(magrittr)
tibble::tibble(row = 1:3, south_2011 = c(4, 0, 4), north_2011 = c(2, 6, 0)) %>%
tidyr::gather(- row, key = "key", value = "value") %>%
dplyr::mutate(year = purrr::map_chr(.x = key, .f = function(x)stringr::str_split(x, pattern = "_")[[1]][2])) %>%
dplyr::group_by(row, year) %>%
dplyr::summarise(sum(value))
I first load the package magrittr so that I can use the pipe, %>%. I've explicitly listed the packages from which the functions are exported, but you are welcome to load the packages with library if you like.
I then create a tibble, or data frame, like what you specify.
I use gather to reorganize the data frame before creating a new variable, year. I then summarise the counts by value of row and year.
You can try this approach
library(tidyverse)
df2 <- df %>%
select(grep("_2011|row", names(df), value = TRUE)) %>%
rowwise() %>%
mutate(asn_y2011 = sum(c_across(south_2011:north_2011))) %>%
select(row, asn_y2011)
# row asn_y2011
# <int> <int>
# 1 1 6
# 2 2 6
# 3 3 4
Data
df <- structure(list(row = 1:3, south_2010 = c(1L, 5L, 7L), south_2011 = c(4L, 0L, 4L), south_2012 = c(5L, 8L, 6L), north_2010 = c(3L, 4L, 1L), north_2011 = c(2L, 6L, 0L), north_2012 = c(1L, 1L, 2L)), class = "data.frame", row.names = c(NA,-3L))

Create a list that contains a numeric value for each response to a categorical variable

I have a df with a categorical value which has 2 levels: Feed, Food
structure(c(2L, 2L, 1L, 2L, 1L, 2L), .Label = c("Feed", "Food"), class = "factor")
I want to create a list with a numeric value to match each categorical variable (ie. Feed = 0, Food = 1)
The list matches with the categorical variable to form 2 columns
Probably very simple...every time I've attempted to put the two together, both columns have ended up numeric
Like this?
library(dplyr)
df <- structure(c(2L, 2L, 1L, 2L, 1L, 2L), .Label = c("Feed", "Food"), class = "factor")
df <- tibble("foods" = df)
df %>%
mutate(numeric = as.numeric(foods))
# A tibble: 6 x 2
foods numeric
<fct> <dbl>
1 Food 2
2 Food 2
3 Feed 1
4 Food 2
5 Feed 1
6 Food 2
Or like this if you want 0/1 as numbers.
df %>%
mutate(numeric = as.numeric(foods) - 1)

How can I keep a tally of the count associated with each type of record? [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have some data:
structure(list(date = structure(c(17888, 17888, 17888, 17888,
17889, 17889, 17891, 17891, 17891, 17891, 17891, 17892, 17894
), class = "Date"), type = structure(c(4L, 6L, 15L, 16L, 2L,
5L, 2L, 3L, 5L, 6L, 8L, 2L, 2L), .Label = c("aborted-live-lead",
"conversation-archived", "conversation-auto-archived", "conversation-auto-archived-store-offline-or-busy",
"conversation-claimed", "conversation-created", "conversation-dropped",
"conversation-restarted", "conversation-transfered", "cs-transfer-connected",
"cs-transfer-ended", "cs-transfer-failed", "cs-transfer-initiate",
"cs-transfer-request", "getnotified-requested", "lead-created",
"lead-expired"), class = "factor"), count = c(1L, 1L, 1L, 1L,
3L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L)), row.names = c(NA, -13L), class = c("tbl_df",
"tbl", "data.frame"))
It looks like this:
> head(dat)
# A tibble: 6 x 3
date type count
<date> <fct> <int>
1 2018-12-23 conversation-auto-archived-store-offline-or-busy 1
2 2018-12-23 conversation-created 1
3 2018-12-23 getnotified-requested 1
4 2018-12-23 lead-created 1
5 2018-12-24 conversation-archived 3
6 2018-12-24 conversation-claimed 1
For each unique type value, there is an associated count per day.
How can I count all of the values of each type (regardless of the date) and list them in a two-column data frame (in a format like this):
type count
------ ------
conversation-created 10
conversation-archived 4
lead-created 2
...
The reason for this is to show an overall count of each event type over the entire date range.
I presume that I have to use the select() function from dplyr but I am sure I am missing something.
This is what I have so far - it sums every value in the count column which isn't what I want as I want it broken down by day:
dat %>%
select(type, count) %>%
summarise(count = sum(count)) %>%
ungroup()
Seems like a combination of group_by and summarize with sum does the job:
dat %>% group_by(type) %>% summarise(count = sum(count))
# A tibble: 8 x 2
# type count
# <fct> <int>
# 1 conversation-archived 7
# 2 conversation-auto-archived 1
# 3 conversation-auto-archived-store-offline-or-busy 1
# 4 conversation-claimed 3
# 5 conversation-created 3
# 6 conversation-restarted 1
# 7 getnotified-requested 1
# 8 lead-created 1
There is no need for select as summarize will drop all the other variables anyway. Or perhaps you are confusing select with group_by, which is what we want in this case - to summarize those values of count where type takes the same value.

How to calculate percentage of mising data in a time series in R dplyr

In the following sample data and script,
How can I calculate the % of missing data between start date strtdt and end date enddt for each ID. What I want to get is: add the missing days with NA between strtdt and enddt separately for each IDs than calculated the % of NA.
I tried following using dplyr but for no luck. Any suggestion will be highly appreciated.
Note: I can achieve same by calculating individually for each ID however that is not possible because I have more than 10000 IDs.
Ultimate goal is to get % of NA between start date and end date for each ID; If the dates are missing completely than i have to add missing date with NA values.
library(dplyr
df<-structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L
), .Label = c("xx", "xyz", "yy", "zz"), class = "factor"), Date = structure(c(8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 19L, 20L, 21L, 22L, 23L), .Label = c("1989-09-12",
"1989-09-13", "1989-09-14", "1989-09-19", "1989-09-23", "1990-01-12",
"1990-01-13", "1996-09-12", "1996-09-13", "1996-09-16", "1996-09-17",
"1996-09-18", "1996-09-19", "2000-09-12", "2000-09-13", "2000-11-10",
"2000-11-11", "2000-11-12", "2001-09-07", "2001-09-08", "2001-09-09",
"2001-09-10", "2001-09-11"), class = "factor"), val = c(3, 5,
9, 3, 5, 6, 8, 7, 9, 5, 3, 2, 8, 8, 5, 3, 2, 1, 5, 7, NA, NA,
NA, NA)), .Names = c("ID", "Date", "val"), row.names = c(NA,
-24L), class = "data.frame")
df$Date<-as.Date(df$Date,format="%Y-%m-%d")
df
df_mis<-df %>%
group_by(ID)%>%
dplyr::mutate(strtdt=min(Date),
enddt=max(Date))
df_mis
df_mis2<-df_mis %>%
group_by(ID) %>%
dplyr::do( data.frame(., Date1= seq(.$strtdt,.$enddt, by = '1 day')))
df_mis2
I assume from the sequence generation in the question's code, that the expected observations are one per day between the first observed date and last observed date per ID. Here's a clunky piece by piece calculation to count the % missing data.
1. Make a data frame of all expected dates for each ID
library(dplyr)
# df as in the question, but coerce Date column
df$Date <- as.Date(df$Date)
# Data frame with date ranges per id
ranges_df <- df %>%
group_by(ID) %>%
summarize(min=min(Date), max=max(Date))
# Data frame with IDs and date for every day expected.
alldays <- ranges_df %>%
group_by(ID) %>%
do(., data.frame(
Date = seq(.$dmin,.$dmax, by = '1 day')
)
)
2. JOIN the expected dates table with the observed dates table.
imputed_df <- left_join(alldays, df)
3. Count NAs
imputed_df %>%
group_by(ID) %>%
summarize(total=n(),
missing=sum(is.na(val)),
percent_missing=missing/total*100
)
result:
# A tibble: 4 x 4
ID total missing percent_missing
<fctr> <int> <int> <dbl>
1 xx 8 2 25.00000
2 xyz 4 4 100.00000
3 yy 62 57 91.93548
4 zz 4380 4371 99.794
Assuming that NAs in the original data should be counted as missing data, this will do so.
Calculate the number of days between the min and max of dates as an intermediate variable.
Then, calculate the number of missing days as number of days - number of observations. Then, calculate percentages.
df %>%
group_by(ID) %>%
mutate(numdays = as.numeric(max(Date) - min(Date)) + 1,
pctmissing = (numdays - n()) / numdays)

Resources