rollsumr with window-length>1: filling missing values - r

My data frame looks something like the first two columns of the following
I want to add a third column, equal to the sum of the ID-group's last three observations for VAL.
Using the following command, I managed to get the output below:
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3)) %>%
ungroup()
ID VAL SUM
1 2 NA
1 1 NA
1 3 6
1 4 8
...
I am now hoping to be able to fill the NAs that result for the group's cells in the first two rows.
ID VAL SUM
1 2 2
1 1 3
1 3 6
1 4 8
...
How do I do that?
I have tried doing the following
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=min(3, row_number())) %>%
ungroup()
and
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3), fill = "extend") %>%
ungroup()
But both give me the same error, because I have groups of sizes <= 2.
Evaluation error: need at least two non-NA values to interpolate.
What do I do?

Alternatively, you can use rollapply() from the same package:
df %>%
group_by(ID) %>%
mutate(SUM = rollapply(VAL, width = 3, FUN = sum, partial = TRUE, align = "right"))
ID VAL SUM
<int> <int> <int>
1 1 2 2
2 1 1 3
3 1 3 6
4 1 4 8
Due to argument partial = TRUE, also the rows that are below the desired window of length three are summed.

Not a direct answer but one way would be to replace the values which are NAs with cumsum of VAL
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(is.na(SUM), cumsum(VAL), SUM))
# ID VAL SUM
# <int> <int> <int>
#1 1 2 2
#2 1 1 3
#3 1 3 6
#4 1 4 8
Or since you know the window size before hand, you could check with row_number() as well
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(row_number() < 3, cumsum(VAL), SUM))

Related

How to summarize in R the number of first occurrences of a character string in a dataframe column?

I am trying to figure out a fast way to calculate the number of "first times" a specified character appears in a dataframe column, by groups. In this example, I am trying to summarize (sum) the number of first times, for each Period, the State of "X" appears, grouped by ID. I am looking for a fast way to process this because it is to be run against a database of several million rows. Maybe there is a good solution using the data.table package?
Immediately below I illustrate what I am trying to achieve, and at the bottom I post the code for the dataframe called testDF.
Code:
testDF <-
data.frame(
ID = c(rep(10,5),rep(50,5),rep(60,5)),
Period = c(1:5,1:5,1:5),
State = c("A","B","X","X","X",
"A","A","A","A","A",
"A","X","A","X","B")
)
Maybe we can group by 'ID' first and then create the column and then do a group by 'period' and summarise
library(dplyr)
testDF %>%
group_by(ID) %>%
mutate(`1stStateX` = row_number() == which(State == "X")[1]) %>%
group_by(Period) %>%
summarise(`1stStateX` = sum(`1stStateX`, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 5 × 2
Period `1stStateX`
<int> <int>
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
Another option will be to slice after grouping by 'ID', get the count and use complete to fill the 'Period' not available
library(tidyr)
testDF %>%
group_by(ID) %>%
slice(match('X', State)) %>%
ungroup %>%
count(Period, sort = TRUE ,name = "1stStateX") %>%
complete(Period = unique(testDF$Period),
fill = list(`1stStateX` = 0))
-output
# A tibble: 5 × 2
Period `1stStateX`
<int> <int>
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
Or similar option in data.table
library(data.table)
setDT(testDF)[, `1stStateX` := .I == .I[State == 'X'][1],
ID][, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period]
-output
Period 1stStateX
<int> <int>
1: 1 0
2: 2 1
3: 3 1
4: 4 0
5: 5 0

Creating counts of subset with dplyr

I'm trying to summarize a data set with not only total counts per group, but also counts of subsets. So starting with something like this:
df <- data.frame(
Group=c('A','A','B','B','B'),
Size=c('Large','Large','Large','Small','Small')
)
df_summary <- df %>%
group_by(Group) %>%
summarize(group_n=n())
I can get a summary of the number of observations for each group:
> df_summary
# A tibble: 2 x 2
Size size_n
<chr> <int>
1 Large 3
2 Small 2
Is there anyway I can add some sort of subsetting information to n() to get, say, a count of how many observations per group were Large in this example? In other words, ending up with something like:
Group group_n Large_n
1 A 2 2
2 B 3 1
Thank you!
We could use count:
count(xyz) is the same as group_by(xyz) %>% summarise(xyz = n())
library(dplyr)
df %>%
count(Group, Size)
Group Size n
1 A Large 2
2 B Large 1
3 B Small 2
OR
library(dplyr)
library(tidyr)
df %>%
count(Group, Size) %>%
pivot_wider(names_from = Size, values_from = n)
Group Large Small
<chr> <int> <int>
1 A 2 NA
2 B 1 2
I approach this problem using an ifelse and a sum:
df_summary <- df %>%
group_by(Group) %>%
summarize(group_n=n(),
Large_n = sum(ifelse(Size == "Large", 1, 0)))
The last line turns Size into a binary indicator taking the value 1 if Size == "Large" and 0 otherwise. Summing this indicator is equivalent to counting the number of rows with "Large".
df_summary <- df %>%
group_by(Group) %>%
mutate(group_n=n())%>%
ungroup() %>%
group_by(Group,Size) %>%
mutate(Large_n=n()) %>%
ungroup() %>%
distinct(Group, .keep_all = T)
# A tibble: 2 x 4
Group Size group_n Large_n
<chr> <chr> <int> <int>
1 A Large 2 2
2 B Large 3 1

Organizing a data frame with multiple entries per sample

I have the following database with several entries per individual:
record_id<-c(21,21,21,15,15,15,2,2,2,2,3,3,3)
var<-c(0,0,0,1,0,0,1,1,0,0,1,1,0)
data<-data.frame(cbind(record_id,var))
I want to create a new data frame with just 1 row per record_id. But it has to fulfill that if the individual (record_id) has a data$var == 1. The outcome data frame must indicate 1.
So, the outcome would be like this:
record_id<-c(21,15,2,3)
var<-c(0,1,1,1)
data_sol<-data.frame(cbind(record_id,var))
I have tried this:
DF1 <- data %>%
group_by(record_id) %>%
mutate(class = ifelse(var==1,1,0)) %>%
ungroup
I know it's not the best way, I was planning to obtain afterwards the unique values... But it did not make the trick.
If your 'var' is all zeroes or ones, you can also use max():
data%>%group_by(record_id)%>%
summarise(new_var=max(var))
# A tibble: 4 x 2
record_id new_var
<dbl> <dbl>
1 2 1
2 3 1
3 15 1
4 21 0
You can use mean() with the mutate to detect if there exsist any non zero value inside a group like,
data %>%
group_by(record_id) %>%
mutate(var = ifelse(mean(var)!=0,1,0)) %>%
distinct(record_id,var)
gives,
# A tibble: 4 x 2
# Groups: record_id [4]
# record_id var
# <dbl> <dbl>
# 1 21 0
# 2 15 1
# 3 2 1
# 4 3 1
We can do
library(dplyr)
data %>%
group_by(record_id) %>%
summarise(var = +(mean(var) != 0))
Or using slice
data %>%
group_by(record_id) %>%
slice_max(n = 1, order_by = var)

Mark row before count starts again

shift = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3)
count =c(1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7)
test <- cbind(shift,count)
So I am trying to mark every last row for every shift (so rows with count = c(8,10,7)with a binary 1 and every other row with 0. Right now I am thinking maybe that is possible with a left join but I am not quite sure. I would prefer not working with loops but rather use some techniques from dplyr. Thanks guys!
Assuming that you want to add a new 0/1 column last that contains a 1 in the last row of each shift and that the shifts are contiguous, here are two base R approaches:
transform(test, last = ave(count, shift, FUN = function(x) x == max(x)))
transform(test, last = +!duplicated(shift, fromLast = TRUE))
or with dplyr use mutate:
test %>%
as.data.frame %>%
group_by(shift) %>%
mutate(last = +(1:n() == n())) %>%
ungroup
test %>%
as.data.frame %>%
mutate(last = +!duplicated(shift, fromLast = TRUE))
Try this one
library(dplyr)
test %>%
as_tibble() %>%
group_by(shift) %>%
mutate(is_last = ifelse( row_number() == max(row_number()), 1, 0)) %>%
ungroup()
# A tibble: 25 x 3
shift count is_last
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 1
9 2 1 0
10 2 2 0
# … with 15 more rows

Flip group_by variable to columns, and flip columns to rows dplyr

thank you in advance for your response! I am working in Rstudio, trying to create a specific table format that my customer is looking for. Specifically, I would like to show each metric as a row and the group_by variable, in this case application type, as a column. I'm using group_by to consolidate all my data by application type, and I'm using the summarise function to create the new variables.
subs <- data.frame(
App_type = c('A','A','A','B','B','B','C','C','C','C'),
Has_error = c(1,1,1,0,0,1,1,0,1,1),
Has_critical_error = c(1,0,1,0,0,1,0,0,1,1)
)
I'm able to group the submissions together by application type to see total submissions with errors and total with critical errors. Here's what I've done -
subs %>%
group_by(App_type) %>%
summarise(
total_sub = n(),
total_error = sum(Has_error),
total_critical_error = sum(Has_critical_error)
)
# A tibble: 3 x 4
App_type total_sub total_error total_critical_error
<fct> <int> <dbl> <dbl>
1 A 3 3 2
2 B 3 1 1
3 C 4 3 2
However, my customer would like to see it this way with application totals.
A B C TOTAL
1 total_sub 3 3 4 10
2 total_error 3 1 3 7
3 total_critical_error 2 1 2 5
We can pivot to 'wide' format after reshaping to 'long' and then change the column name 'name' to rowname
library(dplyr)
library(tidyr)
library(tibble)
subs %>%
group_by(App_type) %>%
summarise(
total_sub = n(),
total_error = sum(Has_error),
total_critical_error = sum(Has_critical_error)) %>%
pivot_longer(cols = -App_type) %>%
pivot_wider(names_from = App_type, values_from = value) %>%
mutate(TOTAL = A + B + C) %>%
column_to_rownames("name")
# A B C TOTAL
#total_sub 3 3 4 10
#total_error 3 1 3 7
#total_critical_error 2 1 2 5
Or another option is transpose from data.table
library(data.table)
data.table::transpose(setDT(out), make.names = 'App_type',
keep.names = 'name')[, TOTAL := A + B + C][]
where out is the OP's summarised output
out <- subs %>%
group_by(App_type) %>%
summarise(
total_sub = n(),
total_error = sum(Has_error),
total_critical_error = sum(Has_critical_error)
)
Or with base R
addmargins(t(cbind(total_sub = as.integer(table(subs$App_type)),
rowsum(subs[-1], subs$App_type))), 2)
# A B C Sum
#total_sub 3 3 4 10
#Has_error 3 1 3 7
#Has_critical_error 2 1 2 5

Resources