R Extract nested cumulatives from dataframe - r

given a dataframe
period<-c(1,1,1,3,3,3,3)
item<-c("a","b","b","a","b","c","c")
quantity<-c(1,3,2,4,5,3,7)
df<-data.frame(period,item,quantity)
df
period item quantity
1 1 a 1
2 1 b 3
3 1 b 2
4 3 a 4
5 3 b 5
6 3 c 3
7 3 c 7
I want to obtain
period item cumulative
1 a 1
1 b 5
1 c 0
2 a 0
2 b 0
2 c 0
3 a 4
3 b 5
3 c 10
Not sure what is a kind of efficient way to do this in R. The file has approx 500k records and 10,000 different items
Thanks!!

You can use complete to create the missing sequence of period and item and for each combination sum the quantity value.
library(dplyr)
library(tidyr)
df %>%
complete(period = min(period):max(period), item) %>%
group_by(period, item) %>%
summarise(quantity = sum(quantity, na.rm = TRUE)) %>%
ungroup
# period item quantity
# <dbl> <chr> <dbl>
#1 1 a 1
#2 1 b 5
#3 1 c 0
#4 2 a 0
#5 2 b 0
#6 2 c 0
#7 3 a 4
#8 3 b 5
#9 3 c 10

Related

R cummax function with NA

data
data=data.frame("person"=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2),
"score"=c(1,2,1,2,3,1,3,NA,4,2,1,NA,2,NA,3,1,2,4),
"want"=c(1,2,1,2,3,3,3,3,4,2,1,1,2,2,3,3,3,4))
attempt
library(dplyr)
data = data %>%
group_by(person) %>%
mutate(wantTEST = ifelse(score >= 3 | (row_number() >= which.max(score == 3)),
cummax(score), score),
wantTEST = replace(wantTEST, duplicated(wantTEST == 4) & wantTEST == 4, NA))
i am basically working to use the cummax function but only under specific circumstances. i want to keep any values (1-2-1-1) except if there is a 3 or 4 (1-2-1-3-2-1-4) should be (1-2-1-3-3-4). if there is NA value i want to carry forward previous value. thank you.
Here's one way with tidyverse. You may want to use fill() after group_by() but that's somewhat unclear.
data %>%
fill(score) %>%
group_by(person) %>%
mutate(
w = ifelse(cummax(score) > 2, cummax(score), score)
) %>%
ungroup()
# A tibble: 18 x 4
person score want w
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 1 2 2 2
3 1 1 1 1
4 1 2 2 2
5 1 3 3 3
6 1 1 3 3
7 1 3 3 3
8 1 3 3 3
9 1 4 4 4
10 2 2 2 2
11 2 1 1 1
12 2 1 1 1
13 2 2 2 2
14 2 2 2 2
15 2 3 3 3
16 2 1 3 3
17 2 2 3 3
18 2 4 4 4
One way to do this is to first fill NA values and then for each row check if anytime the score of 3 or more is passed in the group. If the score of 3 is reached till that point we take the max score until that point or else return the same score.
library(tidyverse)
data %>%
fill(score) %>%
group_by(person) %>%
mutate(want1 = map_dbl(seq_len(n()), ~if(. >= which.max(score == 3))
max(score[seq_len(.)]) else score[.]))
# person score want want1
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1
# 2 1 2 2 2
# 3 1 1 1 1
# 4 1 2 2 2
# 5 1 3 3 3
# 6 1 1 3 3
# 7 1 3 3 3
# 8 1 3 3 3
# 9 1 4 4 4
#10 2 2 2 2
#11 2 1 1 1
#12 2 1 1 1
#13 2 2 2 2
#14 2 2 2 2
#15 2 3 3 3
#16 2 1 3 3
#17 2 2 3 3
#18 2 4 4 4
Another way is to use accumulate from purrr. I use if_else_ from hablar for type stability:
library(tidyverse)
library(hablar)
data %>%
fill(score) %>%
group_by(person) %>%
mutate(wt = accumulate(score, ~if_else_(.x > 2, max(.x, .y), .y)))

Identifying duplicate within groups by latest date

I currently have a data frame that looks like this:
ID Value Date
1 1 A 1/1/2018
2 1 B 2/3/1988
3 1 B 6/3/1994
4 2 A 12/6/1999
5 2 B 24/12/1957
6 3 A 9/8/1968
7 3 B 20/9/2016
8 3 C 15/4/1993
9 3 C 9/8/1994
10 4 A 8/8/1988
11 4 C 6/4/2001
Within each ID I would like to identify a row where there is a duplicate Value. The Value that I would like to identify is the duplicate with the most recent Date.
The resulting data frame should look like this:
ID Value Date mostRecentDuplicate
1 1 A 1/1/2018 0
2 1 B 2/3/1988 0
3 1 B 6/3/1994 1
4 2 A 12/6/1999 0
5 2 B 24/12/1957 0
6 3 A 9/8/1968 0
7 3 B 20/9/2016 0
8 3 C 15/4/1993 0
9 3 C 9/8/1994 1
10 4 A 8/8/1988 0
11 4 C 6/4/2001 0`
How do I go about doing this?
Using dplyr we can first convert Date to actual date value, then group_by ID and Value and assign value 1 in the group where there is more than 1 row and the row_number is same as row number of maximum Date.
library(dplyr)
df %>%
mutate(Date = as.Date(Date, "%d/%m/%Y")) %>%
group_by(ID, Value) %>%
mutate(mostRecentDuplicate = +(n() > 1 & row_number() == which.max(Date))) %>%
ungroup()
# A tibble: 11 x 4
# ID Value Date mostRecentDuplicate
# <int> <fct> <date> <int>
# 1 1 A 2018-01-01 0
# 2 1 B 1988-03-02 0
# 3 1 B 1994-03-06 1
# 4 2 A 1999-06-12 0
# 5 2 B 1957-12-24 0
# 6 3 A 1968-08-09 0
# 7 3 B 2016-09-20 0
# 8 3 C 1993-04-15 0
# 9 3 C 1994-08-09 1
#10 4 A 1988-08-08 0
#11 4 C 2001-04-06 0

Count number of values which are less than current value

I'd like to count the rows in the column input if the values are smaller than the current row (Please see the results wanted below). The issue to me is that the condition is based on current row value, so it is very different from general case where the condition is a fixed number.
data <- data.frame(input = c(1,1,1,1,2,2,3,5,5,5,5,6))
input
1 1
2 1
3 1
4 1
5 2
6 2
7 3
8 5
9 5
10 5
11 5
12 6
The results I expect to get are like this. For example, for observations 5 and 6 (with value 2), there are 4 observations with value 1 less than their value 2. Hence count is given value 4.
input count
1 1 0
2 1 0
3 1 0
4 1 0
5 2 4
6 2 4
7 3 6
8 5 7
9 5 7
10 5 7
11 5 7
12 6 11
Edit: as I am dealing with grouped data with dplyr, the ultimate results I wish to get is like below, that is, I am wishing the conditions could be dynamic within each group.
data <- data.frame(id = c(1,1,2,2,2,3,3,4,4,4,4,4),
input = c(1,1,1,1,2,2,3,5,5,5,5,6),
count=c(0,0,0,0,2,0,1,0,0,0,0,4))
id input count
1 1 1 0
2 1 1 0
3 2 1 0
4 2 1 0
5 2 2 2
6 3 2 0
7 3 3 1
8 4 5 0
9 4 5 0
10 4 5 0
11 4 5 0
12 4 6 4
Here is an option with tidyverse
library(tidyverse)
data %>%
mutate(count = map_int(input, ~ sum(.x > input)))
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
Update
With the updated data, add the group by 'id' in the above code
data %>%
group_by(id) %>%
mutate(count1 = map_int(input, ~ sum(.x > input)))
# A tibble: 12 x 4
# Groups: id [4]
# id input count count1
# <dbl> <dbl> <dbl> <int>
# 1 1 1 0 0
# 2 1 1 0 0
# 3 2 1 0 0
# 4 2 1 0 0
# 5 2 2 2 2
# 6 3 2 0 0
# 7 3 3 1 1
# 8 4 5 0 0
# 9 4 5 0 0
#10 4 5 0 0
#11 4 5 0 0
#12 4 6 4 4
In base R, we can use sapply and for each input count how many values are greater than itself.
data$count <- sapply(data$input, function(x) sum(x > data$input))
data
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
With dplyr one way would be using rowwise function and following the same logic.
library(dplyr)
data %>%
rowwise() %>%
mutate(count = sum(input > data$input))
1. outer and rowSums
data$count <- with(data, rowSums(outer(input, input, `>`)))
2. table and cumsum
tt <- cumsum(table(data$input))
v <- setNames(c(0, head(tt, -1)), c(head(names(tt), -1), tail(names(tt), 1)))
data$count <- v[match(data$input, names(v))]
3. data.table non-equi join
Perhaps more efficient with a non-equi join in data.table. Count number of rows (.N) for each match (by = .EACHI).
library(data.table)
setDT(data)
data[data, on = .(input < input), .N, by = .EACHI]
If your data is grouped by 'id', as in your update, join on that variable as well:
data[data, on = .(id, input < input), .N, by = .EACHI]
# id input N
# 1: 1 1 0
# 2: 1 1 0
# 3: 2 1 0
# 4: 2 1 0
# 5: 2 2 2
# 6: 3 2 0
# 7: 3 3 1
# 8: 4 5 0
# 9: 4 5 0
# 10: 4 5 0
# 11: 4 5 0
# 12: 4 6 4

Fill in rows based on condition for grouped data using tidyr

I have the following dataframe of which I am trying to create the 'index2' field conditional on the 'index1' field:
Basically this data represents a succession of behaviours for different individual (ID) penguins and I am trying to index groups of behaviour (index 2) that incorporates all other behaviours in between (and including) dives (which have been indexed into dive bouts = index 1). I would appreciate a tidyverse solution grouping by ID.
Reproducible:
df<-data.frame(ID=c(rep('A',9),rep('B',14)),behaviour=c('surface','dive','dive','dive','surface','commute','surface','dive', 'dive','dive','dive','surface','dive','dive','commute','commute','surface','dive','dive','surface','dive','dive','surface'),index1=c(0,1,1,1,0,0,0,1,1,2,2,0,3,3,0,0,0,3,3,0,3,3,0),index2=c(0,1,1,1,1,1,1,1,1,2,2,0,3,3,3,3,3,3,3,3,3,3,0))
We could create a function with rle
frle <- function(x) {
rl <- rle(x)
i1 <- cummax(rl$values)
i2 <- c(i1[-1] != i1[-length(i1)], FALSE)
i1[i2] <- 0
as.integer(inverse.rle(within.list(rl, values <- i1)))
}
After grouping by 'ID', mutate the 'Index1' to get the expected column
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Index2New = frle(Index1))
# A tibble: 19 x 5
# Groups: ID [2]
# ID behaviour Index1 Index2 Index2New
# <chr> <chr> <int> <int> <int>
# 1 A surface 0 0 0
# 2 A dive 1 1 1
# 3 A dive 1 1 1
# 4 A dive 1 1 1
# 5 A surface 0 1 1
# 6 A commute 0 1 1
# 7 A surface 0 1 1
# 8 A dive 1 1 1
# 9 A dive 1 1 1
#10 B dive 2 2 2
#11 B dive 2 2 2
#12 B surface 0 0 0
#13 B dive 3 3 3
#14 B dive 3 3 3
#15 B commute 0 3 3
#16 B commute 0 3 3
#17 B surface 0 3 3
#18 B dive 3 3 3
#19 B dive 3 3 3

How to copy value of a cell to other rows based on the value of other two columns?

I have a data frame that looks like this:
zz = "Sub Item Answer
1 A 1 NA
2 A 1 0
3 A 2 NA
4 A 2 1
5 B 1 NA
6 B 1 1
7 B 2 NA
8 B 2 0"
Data = read.table(text=zz, header = TRUE)
The desirable result is to have the value under "Answer" (0 or 1) copied to the NA cells of the same Subject and the same Item. For instance, the answer = 0 in row 2 should be copied to the answer cell in row 1, but not other rows. The output should be like this:
zz2 = "Sub Item Answer
1 A 1 0
2 A 1 0
3 A 2 1
4 A 2 1
5 B 1 1
6 B 1 1
7 B 2 0
8 B 2 0"
Data2 = read.table(text=zz2, header = TRUE)
How should I do this? I noticed that there are some previous questions that asked how to copy a cell to other cells such as replace NA value with the group value, but it was done based on the value of one column only. Also, this question is slightly different from Replace missing values (NA) with most recent non-NA by group, which aims to copy the most-recent numeric value to NAs.
Thanks for all your answers!
You can use zoo::na.locf.
library(tidyverse);
library(zoo);
Data %>% group_by(Sub, Item) %>% mutate(Answer = na.locf(Answer));
# A tibble: 8 x 3
## Groups: Sub, Item [4]
# Sub Item Answer
# <fct> <int> <int>
#1 A 1 0
#2 A 1 0
#3 A 2 1
#4 A 2 1
#5 B 1 1
#6 B 1 1
#7 B 2 0
#8 B 2 0
Thanks to #steveb, here is an alternative without having to rely on zoo::na.locf:
Data %>% group_by(Sub, Item) %>% mutate(Answer = Answer[!is.na(Answer)]);
library(tidyverse)
Data%>%group_by(Sub,Item)%>%fill(Answer,.direction = "up")
# A tibble: 8 x 3
# Groups: Sub, Item [4]
Sub Item Answer
<fctr> <int> <int>
1 A 1 0
2 A 1 0
3 A 2 1
4 A 2 1
5 B 1 1
6 B 1 1
7 B 2 0
8 B 2 0
Though it was not intention of OP but I thought of situations where there are only NA values for set of Sub, Item group OR there multiple non-NA values for a group.
The one way to handle such situations could be by taking max/min of that group and ignoring max/min values if those are Inf
A solution could be:
library(dplyr)
Data %>% group_by(Sub, Item) %>%
mutate(Answer = ifelse(max(Answer, na.rm=TRUE)== -Inf, NA,
as.integer(max(Answer, na.rm=TRUE))))
#Result
# Sub Item Answer
# <fctr> <int> <int>
#1 A 1 0
#2 A 1 0
#3 A 2 1
#4 A 2 1
#5 B 1 1
#6 B 1 1
#7 B 2 0
#8 B 2 0

Resources