R sum consecutive duplicate odd rows and remove all but first - r

I am stuck with question - how to sum consecutive duplicate odd rows and remove all but first row. I have got how to sum consecutive duplicate rows and remove all but first row (link: https://stackoverflow.com/a/32588960/11323232). But this project, i would like to sum the consecutive duplicate odd rows but not all of the consecutive duplicate rows.
ia<-c(1,1,2,NA,2,1,1,1,1,2,1,2)
time<-c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a<-as.data.frame(cbind(ia, time))
a
ia time
1 1 4.5
2 1 2.4
3 2 3.6
5 2 1.2
6 1 4.9
7 1 6.4
8 1 4.4
9 1 4.7
10 2 7.3
11 1 2.3
12 2 4.3
to
a
ia time
1 1 6.9
3 2 3.6
5 2 1.2
6 1 20.4
10 2 7.3
11 1 2.3
12 2 4.3
how to edit the following code for my goal to sum consecutive duplicate odd rows and remove all but first row ?
result <- a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia)) %>%
mutate(change = ia != lag(ia, default = FALSE)) %>%
group_by(group = cumsum(change), ia) %>%
# this part
summarise(time = sum(time))

One dplyr possibility could be:
a %>%
group_by(grp = with(rle(ia), rep(seq_along(lengths), lengths))) %>%
mutate(grp2 = ia %/% 2 == 0,
time = sum(time)) %>%
filter(!grp2 | (grp2 & row_number() == 1)) %>%
ungroup() %>%
select(-grp, -grp2)
ia time
<dbl> <dbl>
1 1 6.9
2 2 3.6
3 2 1.2
4 1 20.4
5 2 7.3
6 1 2.3
7 2 4.3

You could try with use of data.table the following:
library(data.table)
ia <- c(1,1,2,NA,2,1,1,1,1,2,1,2)
time <- c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a <- data.table(ia, time)
a[, sum(time), by=.(ia, rleid(!ia %% 2 == 0))]
Gives
## ia rleid V1
##1: 1 1 6.9
##2: 2 2 3.6
##3: NA 3 1.5
##4: 2 4 1.2
##5: 1 5 20.4
##6: 2 6 7.3
##7: 1 7 2.3
##8: 2 8 4.3

Related

Randomly sampling 1 ID from each pair in column

Say I have something like the following..
df <- data.frame (ID = c("2330", "2331", "2333", "2334", "2336", "2337", "4430", "4431", "4510", "4511"), length = c(8.4,6,3,9,3,4,1,7,4,2))
> df
ID length
1 2330 8.4
2 2331 6.0
3 2333 3.0
4 2334 9.0
5 2336 3.0
6 2337 4.0
7 4430 1.0
8 4431 7.0
9 4510 4.0
10 4511 2.0
IDs that are in a pair are +/- 1 of each other. (2330, 2331), (2333, 2334), (2336, 2337), (4430, 4431), & (4510, 4511) are the pairs in my example. I would like to randomly sample 1 ID from each pair to get a dataframe that looks like the following...
> df
ID length
1 2330 8.4
2 2334 9.0
3 2336 3.0
4 4430 1.0
5 4510 4.0
How would I accomplish this with base R? Thank you.
We may create a grouping column with gl for every 2 adjacent elements and then use slice_sample with n = 1
library(dplyr)
df %>%
group_by(grp = as.integer(gl(n(), 2, n()))) %>%
slice_sample(n = 1) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 × 2
ID length
<chr> <dbl>
1 2330 8.4
2 2333 3
3 2337 4
4 4430 1
5 4510 4
Or using base R
do.call(rbind, lapply(split(df, gl(nrow(df), 2, nrow(df)),
drop = TRUE), function(x) x[sample(nrow(x), 1),]))
-output
ID length
1 2330 8.4
2 2333 3.0
3 2337 4.0
4 4430 1.0
5 4510 4.0
Or with aggregate in base R
aggregate(.~ grp, transform(df, grp = cumsum(c(TRUE,
diff(as.numeric(ID)) !=1))), FUN = sample, 1)[-1]
ID length
1 2331 8.4
2 2334 3
3 2337 3
4 4431 7
5 4510 2
Or with tapply
df[with(df, tapply(seq_along(ID), rep(seq_along(ID), each = 2,
length.out = nrow(df)), FUN = sample, 1)),]
ID length
1 2330 8.4
4 2334 9.0
5 2336 3.0
7 4430 1.0
10 4511 2.0

R: Split all groups in half (dplyr)

My data is grouped, but I would like to split each group in two, as illustrated in the example below. It doesn't really matter what the content of group_half will be, it can be anything like 1a/1b or 1.1/1.2. Any recommendations on how to do this using dplyr? Thanks!
col_1 <- c(23,31,98,76,47,65,23,76,3,47,54,56)
group <- c(1,1,1,1,2,2,2,2,3,3,3,3)
group_half <- c(1.1, 1.1, 1.2, 1.2, 2.1, 2.1, 2.2, 2.2, 3.1, 3.1, 3.2, 3.2)
df1 <- data.frame(col_1, group, group_half)
# col_1 group group_half
# 23 1 1.1
# 31 1 1.1
# 98 1 1.2
# 76 1 1.2
# 47 2 2.1
# 65 2 2.1
# 23 2 2.2
# 76 2 2.2
# 3 3 3.1
# 47 3 3.1
# 54 3 3.2
# 56 3 3.2
Here are two options :
If you always have even number of rows in each group.
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(group_half = paste(group, rep(1:2, each = n()/2), sep = '.')) %>%
ungroup
# col_1 group group_half
# <dbl> <dbl> <chr>
# 1 23 1 1.1
# 2 31 1 1.1
# 3 98 1 1.2
# 4 76 1 1.2
# 5 47 2 2.1
# 6 65 2 2.1
# 7 23 2 2.2
# 8 76 2 2.2
# 9 3 3 3.1
#10 47 3 3.1
#11 54 3 3.2
#12 56 3 3.2
This will work irrespective of number of rows in the group.
df1 %>%
group_by(group) %>%
mutate(group_half = paste(group,as.integer(row_number() > n()/2) + 1, sep = '.')) %>%
ungroup

row-wise iteration in a dataframe where each row depends on previous row calculation in R

In a dataframe I'd like to use previous row calculated result in order to get current row calculated result involving other current row values. Also I need to apply some conditions and it has to be done by a dimension product_id. The key point is that the target column is at the same time used to calculate itself. I reproduced a sample in Excel and it looks like this:
product_id <- c(rep(1,each=9), rep(2,each=8))
dates <- c("24/09/20","25/09/20","26/09/20","27/09/20","28/09/20","29/09/20", "30/09/20","01/10/20","02/10/20","08/10/20","09/10/20","10/10/20","11/10/20","12/10/20","13/10/20","14/10/20","15/10/20")
date <- as.Date(dates, "%d/%m/%y")
num_day <- c(1:9, 1:8)
production <- c(rep(4,each=9), rep(3.5,each=8))
demand <- c(0,0,3,1,3,20,0,1,3,0,1,2,5,0,15,1,3)
df <- data.frame (product_id,date,num_day,production,demand)
Target column to be created is stock. The df is sorted by product_id and then by date, so, the order of the rows is meaningful.
Conditions: both can be applied with one statement but I split them to make it clear
Condition 1: if (stock previous day + production current day - demand
current day <= 0, 0, stock previous day + production current day -
demand current day)
Condition 2: if num_day = 1, stock = production current day - demand
current day and it cannot be negative neither if production current
day - demand current day < 0
In Excel it's a pretty straight forward formula but when dealing with large amount of data, more than 1 million rows, it's not possible to do it there. I'm trying to built a function in R but maybe it's not the best approach. Is any way to do it in R?
I tried to use an auxiliary column with accumulative sum, shift columns but it does not work. I think that it's more complex than that
This can easily be done using purrr's accumulate:
df %>%
group_by(product_id) %>%
mutate(stock = accumulate(production - demand, ~max(.x + .y, 0))) %>%
ungroup()
Result:
# A tibble: 17 x 6
# Groups: product_id [2]
product_id date num_day production demand stock
<dbl> <date> <int> <dbl> <dbl> <dbl>
1 1 2020-09-24 1 4 0 4
2 1 2020-09-25 2 4 0 8
3 1 2020-09-26 3 4 3 9
4 1 2020-09-27 4 4 1 12
5 1 2020-09-28 5 4 3 13
6 1 2020-09-29 6 4 20 0
7 1 2020-09-30 7 4 0 4
8 1 2020-10-01 8 4 1 7
9 1 2020-10-02 9 4 3 8
10 2 2020-10-08 1 3.5 0 3.5
11 2 2020-10-09 2 3.5 1 6
12 2 2020-10-10 3 3.5 2 7.5
13 2 2020-10-11 4 3.5 5 6
14 2 2020-10-12 5 3.5 0 9.5
15 2 2020-10-13 6 3.5 15 0
16 2 2020-10-14 7 3.5 1 2.5
17 2 2020-10-15 8 3.5 3 3
The result matches yours and #rjen's, so I am relatively sure this is correct.
Explanation: with accumulate, a simple cumulative sum could be implemented as accumulate(production - demand, ~.x + .y) (or even shorter as accumulate(production - demand, `+`)). Using the max function here ensures the result never gets lower than 0, which is what you intended.
Until you find a more elegant solution, you can do the following.
library(dplyr)
df %>%
group_by(product_id) %>%
mutate(group = if_else(cumsum(production-demand) < 0 |
num_day == min(num_day), 1, 0)) %>%
ungroup() %>%
mutate(group = if_else(group != 0, row_number(), as.integer(0)),
group = cumsum(group)) %>%
group_by(group) %>%
mutate(modDiff = if_else(num_day == min(num_day), 0, production-demand)) %>%
ungroup() %>%
group_by(product_id) %>%
mutate(modDiff = if_else(num_day == min(num_day), production-demand, modDiff),
modDiff = if_else(num_day == min(num_day) & modDiff < 0, 0, modDiff)) %>%
group_by(group) %>%
mutate(stock = cumsum(modDiff)) %>%
ungroup() %>%
select(-modDiff, -group)
# # A tibble: 17 x 6
# product_id date num_day production demand stock
# <dbl> <date> <int> <dbl> <dbl> <dbl>
# 1 1 2020-09-24 1 4 0 4
# 2 1 2020-09-25 2 4 0 8
# 3 1 2020-09-26 3 4 3 9
# 4 1 2020-09-27 4 4 1 12
# 5 1 2020-09-28 5 4 3 13
# 6 1 2020-09-29 6 4 20 0
# 7 1 2020-09-30 7 4 0 4
# 8 1 2020-10-01 8 4 1 7
# 9 1 2020-10-02 9 4 3 8
# 10 2 2020-10-08 1 3.5 0 3.5
# 11 2 2020-10-09 2 3.5 1 6
# 12 2 2020-10-10 3 3.5 2 7.5
# 13 2 2020-10-11 4 3.5 5 6
# 14 2 2020-10-12 5 3.5 0 9.5
# 15 2 2020-10-13 6 3.5 15 0
# 16 2 2020-10-14 7 3.5 1 2.5
# 17 2 2020-10-15 8 3.5 3 3

Select varying number of top_n for different groups using dplyr

I have the following dataframe. I want to prefer dplyr to solve this problem.
For each zone I want at minimum two values. Value > 4.0 is preferred.
Therefore, for zone 10 all values (being > 4.0) are kept. For zone 20, top two values are picked. Similarly for zone 30.
zone <- c(rep(10,4), rep(20, 4), rep(30, 4))
set.seed(1)
value <- c(4.5,4.3,4.6, 5,5, rep(3,7)) + round(rnorm(12, sd = 0.1),1)
df <- data.frame(zone, value)
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 2.9
7 20 3.0
8 20 3.1
9 30 3.1
10 30 3.0
11 30 3.2
12 30 3.0
The desired output is as follows
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 3.1
7 30 3.1
8 30 3.2
I thought of using top_n but it picks the same number for each zone.
You could dynamically calculate n in top_n
library(dplyr)
df %>% group_by(zone) %>% top_n(max(sum(value > 4), 2), value)
# zone value
# <dbl> <dbl>
#1 10 4.4
#2 10 4.3
#3 10 4.5
#4 10 5.2
#5 20 5
#6 20 3.1
#7 30 3.1
#8 30 3.2
can do so
library(tidyverse)
df %>%
group_by(zone) %>%
filter(row_number(-value) <=2 | head(value > 4))

R: How to create a new column for 90th quantile based off previous rows in a data frame

data.frame(c = c(1,7,11,4,5,5))
c
1 1
2 7
3 11
4 4
5 5
6 5
desired dataframe
c c.90th
1 1 NA
2 7 1
3 11 6.4
4 4 10.2
5 5 9.8
6 5 9.4
For the first row, I want it to look at the previous rows, none and get the 90th quantile, NA.
For the second row, I want it to look at the previous rows, 1 and get the 90th quantile, 1.
For the third row, I want it to look at the previous rows, 1, 7 and get the 90th quantile, 6.4.
etc.
A solution using data.table that also works by groups:
library(data.table)
dt <- data.table(c = c(1,7,11,4,5,5),
group = c(1, 1, 1, 2, 2, 2))
cumquantile <- function(y, prob) {
sapply(seq_along(y), function(x) quantile(y[0:(x - 1)], prob))
}
dt[, c90 := cumquantile(c, 0.9)]
dt[, c90_by_group := cumquantile(c, 0.9), by = group]
> dt
c group c90 c90_by_group
1: 1 1 NA NA
2: 7 1 1.0 1.0
3: 11 1 6.4 6.4
4: 4 2 10.2 NA
5: 5 2 9.8 4.0
6: 5 2 9.4 4.9
Try:
dff <- data.frame(c = c(1,7,11,4,5,5))
dff$c.90th <- sapply(1:nrow(dff),function(x) quantile(dff$c[0:(x-1)],0.9,names=F))
Output:
c c.90th
1 NA
7 1.0
11 6.4
4 10.2
5 9.8
5 9.4

Resources