R dataframe with special cumsum - r

I have a dateframe like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
# Limits for desired cumulative sum (CumSum)
maxCumSum <- 8
minCumSum <- 0
What I would like to calculate is a cumulative sum of value by group (grp) within the values of maxCumSum and minCumSum. The respective table dt2 should look something like this:
grp t value CumSum
a 1 -1 0
a 2 5 5
a 3 9 8
a 4 -15 0
a 5 6 6
b 1 5 5
b 2 1 6
b 3 7 8
b 4 -11 0
b 5 9 8
Think of CumSum as a water storage with has a certain maximum capacity and the level of which cannot sink below zero.
The normal cumsum does obviously not do the trick since there are no limitations to maximum or minimum. Has anyone a suggestion how to achieve this? In the real dataframe there are of course more than 2 groups and far more than 5 times.
Many thanks!

What you can do is create a function which calculate the cumsum until it reach the max value and start again at the min value like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
library(dplyr)
maxCumSum <- 8
minCumSum <- 0
f <- function(x, y) max(min(x + y, maxCumSum), minCumSum)
df %>%
group_by(grp) %>%
mutate(CumSum = Reduce(f, value, 0, accumulate = TRUE)[-1])
#> # A tibble: 10 × 4
#> # Groups: grp [2]
#> grp t value CumSum
#> <chr> <int> <dbl> <dbl>
#> 1 a 1 -1 0
#> 2 a 2 5 5
#> 3 a 3 9 8
#> 4 a 4 -15 0
#> 5 a 5 6 6
#> 6 b 1 5 5
#> 7 b 2 1 6
#> 8 b 3 7 8
#> 9 b 4 -11 0
#> 10 b 5 9 8
Created on 2022-07-04 by the reprex package (v2.0.1)

Related

Convert list of named vectors of different length to tibble

I have the following list
example <- list(a = c(1, 2, 3),
b = c(2, 3),
c = c(3, 4, 5, 6))
that I'd like to transform into the following tibble
# A tibble: 9 × 2
name value
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 2
5 b 3
6 c 3
7 c 4
8 c 5
9 c 6
I've found multiple StackOverflow questions on this subject like here, here or here, but none is adressing this particular case where the name of the vector is not expected to become a column name.
I managed to achieve the desired result with a good old loop like below, but I'm looking for a faster and more elegant way.
library(dplyr)
example_list <- list(a = c(1, 2, 3),
b = c(2, 3),
c = c(3, 4, 5, 6))
example_tibble <- tibble()
for (i in 1:length(example_list)) {
example_tibble <- example_tibble %>%
bind_rows(as_tibble(example_list[[i]]) %>%
mutate(name = names(example_list)[[i]]))
}
example_tibble <- example_tibble %>%
relocate(name)
Try stack
> stack(example)
values ind
1 1 a
2 2 a
3 3 a
4 2 b
5 3 b
6 3 c
7 4 c
8 5 c
9 6 c
example <- list(a = c(1, 2, 3),
b = c(2, 3),
c = c(3, 4, 5, 6))
library(tidyverse)
enframe(example) %>%
unnest(value)
#> # A tibble: 9 x 2
#> name value
#> <chr> <dbl>
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 2
#> 5 b 3
#> 6 c 3
#> 7 c 4
#> 8 c 5
#> 9 c 6
Created on 2021-11-04 by the reprex package (v2.0.1)

How to find the next occurrence in a data.frame in R?

Assume we have an email dataset with a sender and a recipient in every row. We want to find the next occurrence in the dataset for which the sender and the recipient are interchanged. So if sender==x & recipient==y, we are looking for the next row that has sender==y & recipient==x. Subsequently, we want to calculate the difference between counts for those observations. See the column diff_count for the desired output.
# creating the data.frame
id = 1:10
sender = c(1, 2, 3, 2, 3, 1, 2, 1, 2, 3)
recipient = c(2, 1, 2, 3, 1, 2, 3, 3, 1, 1)
count = c(1, 4, 5, 7, 12, 17, 24, 31, 34, 41)
df <- data.frame(id, sender, recipient, count)
# output should look like this
df$diff_count <- c(3, 13, 2, NA, 19, 17, NA, 10, NA, NA)
If there are no more observations that satisfy the requirement, then we simply fill in NA. Solution should be relatively easy with tidyverse, but I seem not to be able to do it.
Another dplyr-way without a custom function but several self joins:
library(dplyr)
data %>%
left_join(data,
by = c("sender" = "recipient", "recipient" = "sender"),
suffix = c("", ".y")) %>%
filter(id < id.y) %>%
group_by(id) %>%
slice_min(id.y) %>%
ungroup() %>%
mutate(diff_count = count.y - count) %>%
right_join(data) %>%
select(-matches("\\.(y|x)")) %>%
arrange(id)
returns
Joining, by = c("id", "sender", "recipient", "count")
# A tibble: 10 x 5
id sender recipient count diff_count
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 1 3
2 2 2 1 4 13
3 3 3 2 5 2
4 4 2 3 7 NA
5 5 3 1 12 19
6 6 1 2 17 17
7 7 2 3 24 NA
8 8 1 3 31 10
9 9 2 1 34 NA
10 10 3 1 41 NA
There should be easier ways, but below is one way using a custom function in tidyverse style:
library(dplyr)
calc_diff <- function(df, send, recp, cnt) {
df %>%
slice_tail(n = nrow(df) - cur_group_rows()) %>%
filter(sender == send, recipient == recp) %>%
slice_head(n = 1) %>%
pull(count) %>%
{ifelse(length(.) == 0, NA, .)} %>%
`-`(., cnt)
}
df %>%
rowwise(id) %>%
mutate(diff_count = calc_diff(df,
send = recipient,
recp = sender,
cnt = count))
#> # A tibble: 10 x 5
#> # Rowwise: id
#> id sender recipient count diff_count
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 1 3
#> 2 2 2 1 4 13
#> 3 3 3 2 5 2
#> 4 4 2 3 7 NA
#> 5 5 3 1 12 19
#> 6 6 1 2 17 17
#> 7 7 2 3 24 NA
#> 8 8 1 3 31 10
#> 9 9 2 1 34 NA
#> 10 10 3 1 41 NA
Created on 2021-08-20 by the reprex package (v2.0.1)

Creating a column based on existing column where new column has values plus or minus certain value of old one

I am trying to create a column, where the new column has values plus or minus some fixed number or existing number of old column. For example, my old column is a and new column is b.
data = data.frame(a = 2:11)
new_data = data.frame(a = 2:11, b = c(1, 4, 5, 5, 6, 8, 9, 8, 11, 12))
new_data
#> a b
#> 1 2 1
#> 2 3 4
#> 3 4 5
#> 4 5 5
#> 5 6 6
#> 6 7 8
#> 7 8 9
#> 8 9 8
#> 9 10 11
#> 10 11 12
data$b <- data$a + sample(c(0, -1, +1), nrow(data), replace = T)
so If fixed number is say x do this
x <- 1
data$b <- data$a + sample(c(0, -1*x, x), nrow(data), replace = T)
Edit based on requirements stated in comments below. Use pmin and pmax. seed fixed in order to demonstrate
set.seed(19)
data %>% mutate(b = pmin(11, pmax(2, a + sample(-1:1, nrow(.), T)))) %>% pull(b) %>% cat
2 3 4 6 5 7 7 10 9 11
#otherwise
set.seed(19)
data %>% mutate(b = a + sample(-1:1, nrow(.), T))
a b
1 2 1
2 3 3
3 4 4
4 5 6
5 6 5
6 7 7
7 8 7
8 9 10
9 10 9
10 11 12

Mutate within a for loop

I have a dataframe like this
structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("Love_ABC", "Love_CNN", "Hate_ABC", "Hate_CNN", "Love_CNBC", "Hate_CNBC"), row.names = c(NA,
8L), class = "data.frame")
I have made the following for loop
channels = c("ABC", "CNN", "CNBC")
for (channel in channels) {
dataframe <- dataframe %>%
mutate(ALL_channel = Love_channel + Hate_channel)
}
But when i run the for loop R tells me " object Love_channel" not found. Have i done something wrong in the for loop?
Here's a way with rlang. Note, reshaping the data is likely more straightforward. Non-standard evaluation (NSE) is a complicated topic.
for (channel in channels) {
DF <- DF %>%
mutate(!!sym(paste0("ALL_", channel)) := !!sym(paste0("Love_", channel)) + !!sym(paste0("Hate_", channel)))
}
DF
## Love_ABC Love_CNN Hate_ABC Hate_CNN Love_CNBC Hate_CNBC ALL_ABC ALL_CNN ALL_CNBC
## 1 1 1 6 6 1 2 7 7 3
## 2 3 3 3 2 2 3 6 5 5
## 3 4 4 6 4 4 4 10 8 8
## 4 6 2 5 5 5 2 11 7 7
## 5 3 6 3 3 6 2 6 9 8
## 6 2 7 6 7 7 7 8 14 14
## 7 5 2 5 2 6 5 10 4 11
## 8 1 6 3 6 3 2 4 12 5
This is a solution with dplyr and tidyr:
library(tidyr)
library(dplyr)
dataframe <- dataframe %>%
tibble::rowid_to_column()
dataframe %>%
pivot_longer(-rowid, names_to = c(NA, "channel"), names_sep = "_") %>%
pivot_wider(names_from = channel, names_prefix = "ALL_", values_from = value, values_fn = sum) %>%
right_join(dataframe, by = "rowid") %>%
select(-rowid)
#> # A tibble: 8 x 9
#> ALL_ABC ALL_CNN ALL_CNBC Love_ABC Love_CNN Hate_ABC Hate_CNN Love_CNBC Hate_CNBC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 7 7 3 1 1 6 6 1 2
#> 2 6 5 5 3 3 3 2 2 3
#> 3 10 8 8 4 4 6 4 4 4
#> 4 11 7 7 6 2 5 5 5 2
#> 5 6 9 8 3 6 3 3 6 2
#> 6 8 14 14 2 7 6 7 7 7
#> 7 10 4 11 5 2 5 2 6 5
#> 8 4 12 5 1 6 3 6 3 2
The idea is to reshape it to make the sums easier. Then you can join the final result back to the initial dataframe.
start by uniquely identifying each row with a rowid.
reshape with pivot_longer so to have all values neatly in one column. In this step you also separate the names Love/Hate_channel in two and you remove the Love/Hate part (you are interested only on the channel) [that is what the NA does!].
reshape again: this time you want to get one column for each channel. In this step you also sum up what previously was Love and Hate together for each rowid and channel (that's what values_fn=sum does!). Also you add a prefix (names_prefix = "ALL_") to each new column name to have names that respect your expected final result.
with right_join you add the values back to the original dataframe. You have no need for rowid now, so you can remove it.

Ratio of largest value per row in dataframe in R

I have a dataframe somewhat similar to the one below (df). I need to add a new column indicating the ratio of the largest value for each row (= largest value in row divided by sum of all values in the row). The output should look similar to df1.
df <- data.frame('x' = c(1, 4, 1, 4, 1), 'y' = c(4, 6, 5, 2, 3), 'z' = c(5, 3, 2, 3, 2))
df1 <- data.frame('x' = c(1, 4, 1, 4, 1), 'y' = c(4, 6, 5, 2, 3), 'z' = c(5, 3, 2, 3, 2), 'ratio' = c(0.5, 0.462, 0.625, 0.444, 0.5)
Thank you!
Here is a solution using dplyr:
df %>%
rowwise() %>%
mutate(max_value = max(x,y,z),
sum_values = sum(x,y,z),
ratio = max_value / sum_values) #%>%
#select(-max_value, -sum_values) #uncomment this line if you want to df1 as in your question
# A tibble: 5 x 6
x y z max_value sum_values ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 5 5 10 0.5
2 4 6 3 6 13 0.462
3 1 5 2 5 8 0.625
4 4 2 3 4 9 0.444
5 1 3 2 3 6 0.5
library(tidyverse)
df %>%
rowwise() %>%
mutate(MAX = max(x,y,z, na.rm = TRUE ),
SUM = sum(x,y,z, na.rm = TRUE),
ratio = MAX / SUM)
# A tibble: 5 x 6
x y z MAX SUM ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 5 5 10 0.5
2 4 6 3 6 13 0.462
3 1 5 2 5 8 0.625
4 4 2 3 4 9 0.444
5 1 3 2 3 6 0.5
Another option with rowSums and pmax
library(dplyr)
library(purrr)
df %>%
mutate(ratio = reduce(., pmax)/rowSums(.))
# x y z ratio
#1 1 4 5 0.5000000
#2 4 6 3 0.4615385
#3 1 5 2 0.6250000
#4 4 2 3 0.4444444
#5 1 3 2 0.5000000
Or in base R
df$ratio <- do.call(pmax, df)/rowSums(df)
Additional solution
df$ratio <- apply(df, 1, function(x) max(x, na.rm = T) / sum(x, na.rm = T))

Resources