Flip group_by variable to columns, and flip columns to rows dplyr - r

thank you in advance for your response! I am working in Rstudio, trying to create a specific table format that my customer is looking for. Specifically, I would like to show each metric as a row and the group_by variable, in this case application type, as a column. I'm using group_by to consolidate all my data by application type, and I'm using the summarise function to create the new variables.
subs <- data.frame(
App_type = c('A','A','A','B','B','B','C','C','C','C'),
Has_error = c(1,1,1,0,0,1,1,0,1,1),
Has_critical_error = c(1,0,1,0,0,1,0,0,1,1)
)
I'm able to group the submissions together by application type to see total submissions with errors and total with critical errors. Here's what I've done -
subs %>%
group_by(App_type) %>%
summarise(
total_sub = n(),
total_error = sum(Has_error),
total_critical_error = sum(Has_critical_error)
)
# A tibble: 3 x 4
App_type total_sub total_error total_critical_error
<fct> <int> <dbl> <dbl>
1 A 3 3 2
2 B 3 1 1
3 C 4 3 2
However, my customer would like to see it this way with application totals.
A B C TOTAL
1 total_sub 3 3 4 10
2 total_error 3 1 3 7
3 total_critical_error 2 1 2 5

We can pivot to 'wide' format after reshaping to 'long' and then change the column name 'name' to rowname
library(dplyr)
library(tidyr)
library(tibble)
subs %>%
group_by(App_type) %>%
summarise(
total_sub = n(),
total_error = sum(Has_error),
total_critical_error = sum(Has_critical_error)) %>%
pivot_longer(cols = -App_type) %>%
pivot_wider(names_from = App_type, values_from = value) %>%
mutate(TOTAL = A + B + C) %>%
column_to_rownames("name")
# A B C TOTAL
#total_sub 3 3 4 10
#total_error 3 1 3 7
#total_critical_error 2 1 2 5
Or another option is transpose from data.table
library(data.table)
data.table::transpose(setDT(out), make.names = 'App_type',
keep.names = 'name')[, TOTAL := A + B + C][]
where out is the OP's summarised output
out <- subs %>%
group_by(App_type) %>%
summarise(
total_sub = n(),
total_error = sum(Has_error),
total_critical_error = sum(Has_critical_error)
)
Or with base R
addmargins(t(cbind(total_sub = as.integer(table(subs$App_type)),
rowsum(subs[-1], subs$App_type))), 2)
# A B C Sum
#total_sub 3 3 4 10
#Has_error 3 1 3 7
#Has_critical_error 2 1 2 5

Related

How to summarize in R the number of first occurrences of a character string in a dataframe column?

I am trying to figure out a fast way to calculate the number of "first times" a specified character appears in a dataframe column, by groups. In this example, I am trying to summarize (sum) the number of first times, for each Period, the State of "X" appears, grouped by ID. I am looking for a fast way to process this because it is to be run against a database of several million rows. Maybe there is a good solution using the data.table package?
Immediately below I illustrate what I am trying to achieve, and at the bottom I post the code for the dataframe called testDF.
Code:
testDF <-
data.frame(
ID = c(rep(10,5),rep(50,5),rep(60,5)),
Period = c(1:5,1:5,1:5),
State = c("A","B","X","X","X",
"A","A","A","A","A",
"A","X","A","X","B")
)
Maybe we can group by 'ID' first and then create the column and then do a group by 'period' and summarise
library(dplyr)
testDF %>%
group_by(ID) %>%
mutate(`1stStateX` = row_number() == which(State == "X")[1]) %>%
group_by(Period) %>%
summarise(`1stStateX` = sum(`1stStateX`, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 5 × 2
Period `1stStateX`
<int> <int>
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
Another option will be to slice after grouping by 'ID', get the count and use complete to fill the 'Period' not available
library(tidyr)
testDF %>%
group_by(ID) %>%
slice(match('X', State)) %>%
ungroup %>%
count(Period, sort = TRUE ,name = "1stStateX") %>%
complete(Period = unique(testDF$Period),
fill = list(`1stStateX` = 0))
-output
# A tibble: 5 × 2
Period `1stStateX`
<int> <int>
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
Or similar option in data.table
library(data.table)
setDT(testDF)[, `1stStateX` := .I == .I[State == 'X'][1],
ID][, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period]
-output
Period 1stStateX
<int> <int>
1: 1 0
2: 2 1
3: 3 1
4: 4 0
5: 5 0

Separate rows with conditions

I have this dataframe separate_on_condition with two columns:
separate_on_condition <- data.frame(first = 'a3,b1,c2', second = '1,2,3,4,5,6')`
# first second
# 1 a3,b1,c2 1,2,3,4,5,6
How can I turn it to:
# A tibble: 6 x 2
first second
<chr> <chr>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
where:
a3 will be separated into 3 rows
b1 into 1 row
c2 into 2 rows
Is there a better way on achieving this instead of using rep() on first column and separate_rows() on the second column?
Any help would be much appreciated!
Create a row number column to account for multiple rows.
Split second column on , in separate rows.
For each row extract the data to be repeated along with number of times it needs to be repeated.
library(dplyr)
library(tidyr)
library(stringr)
separate_on_condition %>%
mutate(row = row_number()) %>%
separate_rows(second, sep = ',') %>%
group_by(row) %>%
mutate(first = rep(str_extract_all(first(first), '[a-zA-Z]+')[[1]],
str_extract_all(first(first), '\\d+')[[1]])) %>%
ungroup %>%
select(-row)
# first second
# <chr> <chr>
#1 a 1
#2 a 2
#3 a 3
#4 b 4
#5 c 5
#6 c 6
You can the following base R option
with(
separate_on_condition,
data.frame(
first = unlist(sapply(
unlist(strsplit(first, ",")),
function(x) rep(gsub("\\d", "", x), as.numeric(gsub("\\D", "", x)))
), use.names = FALSE),
second = eval(str2lang(sprintf("c(%s)", second)))
)
)
which gives
first second
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
Here is an alternative approach:
add NA to first to get same length
use separate_rows to bring each element to a row
use extract by regex digit to split first into first and helper
group and slice by values in helper
do some tweaking
library(tidyr)
library(dplyr)
separate_on_condition %>%
mutate(first = str_c(first, ",NA,NA,NA")) %>%
separate_rows(first, second, sep = "[^[:alnum:].]+", convert = TRUE) %>%
extract(first, into = c("first", "helper"), "(.{1})(.{1})", remove=FALSE) %>%
group_by(second) %>%
slice(rep(1:n(), each = helper)) %>%
ungroup() %>%
drop_na() %>%
mutate(second = row_number()) %>%
select(first, second)
first second
<chr> <int>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6

rollsumr with window-length>1: filling missing values

My data frame looks something like the first two columns of the following
I want to add a third column, equal to the sum of the ID-group's last three observations for VAL.
Using the following command, I managed to get the output below:
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3)) %>%
ungroup()
ID VAL SUM
1 2 NA
1 1 NA
1 3 6
1 4 8
...
I am now hoping to be able to fill the NAs that result for the group's cells in the first two rows.
ID VAL SUM
1 2 2
1 1 3
1 3 6
1 4 8
...
How do I do that?
I have tried doing the following
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=min(3, row_number())) %>%
ungroup()
and
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3), fill = "extend") %>%
ungroup()
But both give me the same error, because I have groups of sizes <= 2.
Evaluation error: need at least two non-NA values to interpolate.
What do I do?
Alternatively, you can use rollapply() from the same package:
df %>%
group_by(ID) %>%
mutate(SUM = rollapply(VAL, width = 3, FUN = sum, partial = TRUE, align = "right"))
ID VAL SUM
<int> <int> <int>
1 1 2 2
2 1 1 3
3 1 3 6
4 1 4 8
Due to argument partial = TRUE, also the rows that are below the desired window of length three are summed.
Not a direct answer but one way would be to replace the values which are NAs with cumsum of VAL
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(is.na(SUM), cumsum(VAL), SUM))
# ID VAL SUM
# <int> <int> <int>
#1 1 2 2
#2 1 1 3
#3 1 3 6
#4 1 4 8
Or since you know the window size before hand, you could check with row_number() as well
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(row_number() < 3, cumsum(VAL), SUM))

Create a list of all values of a variable grouped by another variable in R

I have a data frame that contains two variables, like this:
df <- data.frame(group=c(1,1,1,2,2,3,3,4),
type=c("a","b","a", "b", "c", "c","b","a"))
> df
group type
1 1 a
2 1 b
3 1 a
4 2 b
5 2 c
6 3 c
7 3 b
8 4 a
I want to produce a table showing for each group the combination of types it has in the data frame as one variable e.g.
group alltypes
1 1 a, b
2 2 b, c
3 3 b, c
4 4 a
The output would always list the types in the same order (e.g. groups 2 and 3 get the same result) and there would be no repetition (e.g. group 1 is not "a, b, a").
I tried doing this using dplyr and summarize, but I can't work out how to get it to meet these two conditions - the code I tried was:
> df %>%
+ group_by(group) %>%
+ summarise(
+ alltypes = paste(type, collapse=", ")
+ )
# A tibble: 4 × 2
group alltypes
<dbl> <chr>
1 1 a, b, a
2 2 b, c
3 3 c, b
4 4 a
I also tried turning type into a set of individual counts, but not sure if that's actually useful:
> df %>%
+ group_by(group, type) %>%
+ tally %>%
+ spread(type, n, fill=0)
Source: local data frame [4 x 4]
Groups: group [4]
group a b c
* <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0
2 2 0 1 1
3 3 0 1 1
4 4 1 0 0
Any suggestions would be greatly appreciated.
I think you were very close. You could call the sort and unique functions to make sure your result adheres to your conditions as follows:
df %>% group_by(group) %>%
summarize(type = paste(sort(unique(type)),collapse=", "))
returns:
# A tibble: 4 x 2
group type
<int> <chr>
1 1 a, b
2 2 b, c
3 3 b, c
4 4 a
To expand on Florian's answer this could be extended to generating an ordered list based on values in your data set. An example could be determining the order of dates:
library(lubridate)
library(tidyverse)
# Generate random dates
set.seed(123)
Date = ymd("2018-01-01") + sort(sample(1:200, 10))
A = ymd("2018-01-01") + sort(sample(1:200, 10))
B = ymd("2018-01-01") + sort(sample(1:200, 10))
C = ymd("2018-01-01") + sort(sample(1:200, 10))
# Combine to data set
data = bind_cols(as.data.frame(Date), as.data.frame(A), as.data.frame(B), as.data.frame(C))
# Get order of dates for each row
data %>%
mutate(D = Date) %>%
gather(key = Var, value = D, -Date) %>%
arrange(Date, D) %>%
group_by(Date) %>%
summarize(Ord = paste(Var, collapse=">"))
Somewhat tangential to the original question but hopefully helpful to someone.

sum by group including intermediate groups

I have:
df <- data.frame(group=c(1,1,2,4,4,5), value=c(3,1,5,2,3,6))
aggregate(value ~ group, data = df, FUN = 'sum')
group value
1 1 4
2 2 5
3 4 5
4 5 6
is there a way to include intermediate groups to return the below? I realise this could be done by creating a dataframe with all the desired groups and matching in the results from aggregate() but I am hoping there is a cleaner way to do this. it would need to be as fast as using aggregate and only use base r packages - this is due to restrictions in my workplace.
group value
1 1 4
2 2 5
3 3 0
4 4 5
5 5 6
You can try this .
library(tidyr)
library(dplyr)
df %>%
mutate(group=factor(group, 1:5)) %>%
complete(group) %>%group_by(group)%>%
dplyr::summarise(value=sum(value,na.rm = T))
group value
<fctr> <dbl>
1 1 4
2 2 5
3 3 0
4 4 5
5 5 6
You can do this easily with the tidyverse:
library(dplyr)
library(tidyr)
df %>%
group_by(group) %>%
summarise(valuesum = sum(value)) %>%
full_join(., expand(df, group = 1:5)) %>%
complete(group, fill = list(valuesum = 0))
The result:
# A tibble: 5 x 2
group valuesum
<dbl> <dbl>
1 1 4
2 2 5
3 3 0
4 4 5
5 5 6
Or a bit more difficult to understand with data.table:
library(data.table)
setDT(df)[.(group = 1:5), on = 'group', sum(value, na.rm = TRUE), by = .EACHI]
You can use mergefrom base R. I've changed the name of your data.frame to dat, since df is the name of an R function.
dat <- read.table(text = "
group value
1 4
2 5
4 5
5 6
", header = TRUE)
str(dat)
res <- aggregate(value ~ group, data = dat, FUN = 'sum')
merge(res, data.frame(group = seq(from = min(res$group), to = max(res$group))), all = TRUE)
Note that there will be a NA, not a zero. I believe that you should solve that by leaving it as a missing value.

Resources