cumsum with a condition to restart in R - r

I have this dataset containing multiple columns. I want to use cumsum() on a column conditioning the sum on another column. That is when X happens I want the sum to restart from zero but, I want to sum also the number of the "x" event row. I'll be more precise here in an example.
inv ass port G cumsum(G)
i x 2 1 1
i x 2 0 1
i x 0 1 2
i x 3 0 0
i x 3 1 1
So in the 3rd row the condition port == 0 happens. I want to cumsum(G), but on the 3rd row i want to still sum the value of G and to restart the count from the following row.
I'm using dplyr to group_by(investor, asset) but I'm stuck here since I'm doing:
res1 <- res %>%
group_by(investor, asset) %>%
mutate(posdays = ifelse(operation < 0 & portfolio == 0, 0, cumsum(G))) %>%
ungroup()
Since this restart the cumsum() but excludes the sum of the 3rd row.
I think something saying "cumsum(G) but when condition "x" in the previous row, restart the sum in the following row".
Can you help me?

You may use cumsum to create groups as well.
library(dplyr)
df <- df %>%
group_by(group = cumsum(dplyr::lag(port == 0, default = 0))) %>%
mutate(cumsum_G = cumsum(G)) %>%
ungroup
df
# inv ass port G group cumsum_G
# <chr> <chr> <int> <int> <dbl> <int>
#1 i x 2 1 0 1
#2 i x 2 0 0 1
#3 i x 0 1 0 2
#4 i x 3 0 1 0
#5 i x 3 1 1 1
You may remove the group column from output using %>% select(-group).
data
df <- structure(list(inv = c("i", "i", "i", "i", "i"), ass = c("x",
"x", "x", "x", "x"), port = c(2L, 2L, 0L, 3L, 3L), G = c(1L,
0L, 1L, 0L, 1L)), class = "data.frame", row.names = c(NA, -5L))

Related

R subsetting by unique observation and prioritizing a value [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 21 days ago.
I have a coding problem regarding subsetting my dataset. I would like to subset my data with the following conditions (1) one observation per ID and (2) retaining a row for "event" = 1 occurring at any time, while still not losing any observations.
An example dataset looks like this:
ID event
A 1
A 1
A 0
A 1
B 0
B 0
B 0
C 0
C 1
Desired output
A 1
B 0
C 1
I imagine this would be done using dplyr df >%> group_by(ID), but I'm unsure how to prioritize selecting for any row that contains event = 1 without losing when event = 0. I do not want to lose any of the IDs.
Any help would be appreciated - thank you very much.
We may use
aggregate(event ~ ID, df1, max)
ID event
1 A 1
2 B 0
3 C 1
Or with dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
slice_max(n = 1, event, with_ties = FALSE) %>%
ungroup
# A tibble: 3 × 2
ID event
<chr> <int>
1 A 1
2 B 0
3 C 1
data
df1 <- structure(list(ID = c("A", "A", "A", "A", "B", "B", "B", "C",
"C"), event = c(1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L)),
class = "data.frame", row.names = c(NA,
-9L))

Replacing positive table values with positive values from a list of uneven vectors

Background
I recently asked this question. I however made the example slightly too simple, so I am adding some complexity here, where the vector length is no longer equal to the table length.
Problem
I have a table as follows:
tableA <- structure(c(1L, 0L, 0L, 0L, 4L, 6L, 0L, 6L, 1L, 3L, 0L, 0L, 0L, 0L, 1L), dim = c(3L,
5L), dimnames = structure(list(c("X", "Y",
"Z"), c("A", "B", "C","D", "E")), names = c("", "")), class = "table")
A B C D E
X 1 0 0 3 0 (two positive numbers)
Y 0 4 6 0 0 (two positive numbers)
Z 0 6 1 0 1 (three positive numbers)
And a list of vectors as follows:
listB <- list(
"X" = c(0, 4),
"Y" = c(4, 5),
"Z" = c(7, 1, 0))
The vectors are not equal, but the amount of numbers in the vectors equal the amount positive numbers in the table.
I would like to replace all values in tableA, of columns B,C, and D, that are bigger than zero, with the corresponding values of listB.
Desired output:
A B C D E
X 1 0 0 4 0
Y 0 4 5 0 0
Z 0 7 1 0 1
Previous answer
The original answer by sindri_baldur suggested the following:
cols2replace = match(c('B', 'C', 'D'), colnames(tableA))
cells2replace = tableA[, cols2replace] > 0
tableB = matrix(unlist(listB), nrow = 3, byrow = TRUE)
tableA[, cols2replace][cells2replace] = tableB[, cols2replace][cells2replace]
In my actual data however the vectors in the list of vectors are unequal. Therefore:
tableB = matrix(unlist(listB), nrow = 3, byrow = TRUE)
Does not work.
Suggestion
I am wondering if I cannot simply get all positive values from tableA, replace them with all positive values of of listB (so that values which are 0 in listB are not replaced) and then put them back in the table. I started as follows:
# Get all positive values from the table
library(dplyr)
library(tidyr)
library(stringr)
out <- tableA %>%
pivot_longer(cols = -rn) %>%
filter(str_detect(value, '\\b0\\b', negate = TRUE)) %>%
group_by(rn) %>%
summarise(freq = list(value), .groups = 'drop')
But the code does not work on a table.
Assume the amount of numbers in each vector in listB equals the amount of positive numbers in each row of tableA, you could solve it by
tab <- t(tableA)
tab[tab > 0] <- unlist(listB)
tableA[, c('B', 'C', 'D')] <- t(tab)[, c('B', 'C', 'D')]
tableA
# A B C D E
# X 1 0 0 4 0
# Y 0 4 5 0 0
# Z 0 7 1 0 1
You could also solve your problem as follow:
x = unlist(listB)
tableA = t(tableA)
tableA[c("B", "C", "D"), ][which(tableA[c("B", "C", "D"), ]>0)] = x[x>0]
tableA = t(tableA)
A B C D E
X 1 0 0 4 0
Y 0 4 5 0 0
Z 0 7 1 0 1

R: count times per column a condition is met and row names appear in a list

I have a dataframe with count information (df1)
rownames
sample1
sample2
sample3
m1
0
5
1
m2
1
7
5
m3
6
2
0
m4
3
1
0
and a second with sample information (df2)
rownames
batch
total count
sample1
a
10
sample2
b
15
sample3
a
6
I also have two lists with information about the m values (could easily be turned into another data frame if necessary but I would rather not add to the count information as it is quite large). No patterns (such as even and odd) exist, I am just using a very simplistic example
x <- c("m1", "m3") and y <- c("m2", "m4")
What I would like to do is add another two columns to the sample information. This is a count of each m per sample that has a value of above 5 and appears in list x or y
rownames
batch
total count
x
y
sample1
a
10
1
0
sample2
b
15
1
1
sample3
a
6
0
1
My current strategy is to make a list of values for both x and y and then append them to df2. Here are my attempts so far:
numX <- colSums(df1[sum(rownames(df1)>10 %in% x),]) and numX <- colSums(df1[sum(rownames(df1)>10 %in% x),]) both return a list of 0s
numX <- colSums(df1[rownames(df1)>10 %in% x,]) returns a list of the sum of count values meeting the conditions for each column
numX <- length(df1[rownames(df1)>10 %in% novel,]) returns the number of times the condition is met (in this example 2L)
I am not really sure how to approach this so I have just been throwing around attempts. I've tried looking for answers but maybe I am just struggling to find the proper wording.
We may do this with rowwise
library(dplyr)
df2 %>%
rowwise %>%
mutate(x = +(sum(df1[[rownames]][df1$rownames %in% x]) >= 5),
y = +(sum(df1[[rownames]][df1$rownames %in% y]) >= 5)) %>%
ungroup
-output
# A tibble: 3 × 5
rownames batch totalcount x y
<chr> <chr> <int> <int> <int>
1 sample1 a 10 1 0
2 sample2 b 15 1 1
3 sample3 a 6 0 1
Or based on the data, a base R option would be
out <- aggregate(. ~ grp, FUN = sum,
transform(df1, grp = c('x', 'y')[1 + (rownames %in% y)] )[-1])
df2[out$grp] <- +(t(out[-1]) >= 5)
-output
> df2
rownames batch totalcount x y
1 sample1 a 10 1 0
2 sample2 b 15 1 1
3 sample3 a 6 0 1
data
df1 <- structure(list(rownames = c("m1", "m2", "m3", "m4"), sample1 = c(0L,
1L, 6L, 3L), sample2 = c(5L, 7L, 2L, 1L), sample3 = c(1L, 5L,
0L, 0L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(rownames = c("sample1", "sample2", "sample3"),
batch = c("a", "b", "a"), totalcount = c(10L, 15L, 6L)),
class = "data.frame", row.names = c(NA,
-3L))
How about using using dplyr and reshape2::melt
df3 <- df1 %>%
melt %>%
filter(value >= 5) %>%
mutate(x = as.numeric(rownames %in% c("m1", "m3")),
y = as.numeric(rownames %in% c("m2", "m4"))) %>%
select(-rownames, - value) %>%
group_by(variable) %>%
summarise(x = sum(x), y = sum(y))
df2 %>% left_join(df3, by = c("rownames" = "variable"))
rownames batch total_count x y
1 sample1 a 10 1 0
2 sample2 b 15 1 1
3 sample3 a 6 0 1
You can create a named list of vectors and for each rownames count how many values of x and y in the respective sample is >= 5.
Base R option -
list_vec <- list(x = x, y = y)
cbind(df2, do.call(rbind, lapply(df2$rownames, function(x)
sapply(list_vec, function(y) {
sum(df1[[x]][df1$rownames %in% y] >= 5)
}))))
# rownames batch total.count x y
#1 sample1 a 10 1 0
#2 sample2 b 15 1 1
#3 sample3 a 6 0 1
Using tidyverse -
library(dplyr)
library(purrr)
list_vec <- lst(x, y)
df2 %>%
bind_cols(map_df(df2$rownames, function(x)
map(list_vec, ~sum(df1[[x]][df1$rownames %in% .x] >= 5))))

Getting column-wise means and standard deviations for positive and negative values separately in R

I can get column-wise means and standard deviations (sample) of a dataframe as follows:
means <- apply(df , 2, mean)
sdevs <- apply(df , 2, sd)
However, my dataframe contains positive and negative values and I need to get means and standard deviation for positive and negative values separately
Example Input:
COL1 COL2
1 1
2 1
3 1
-1 -1
-5 -1
-9 -1
Example Output:
positive_means = [2,1]
positive_sdevs = [1,0]
negative_means = [-5,-1]
negative_sdevs = [4,0]
I do not want to build a for loop because my data frame contain too much values and columns.
Thanks.
You can try this creating a group for positive and negative values and then summarise with dplyr functions:
library(dplyr)
#Code
new <- df %>% mutate(group=ifelse(COL1>0&COL2>0,'Pos','Neg')) %>%
group_by(group) %>% summarise_all(list('mean'=mean,'sd'=sd))
Output:
# A tibble: 2 x 5
group COL1_mean COL2_mean COL1_sd COL2_sd
<chr> <dbl> <dbl> <dbl> <dbl>
1 Neg -5 -1 4 0
2 Pos 2 1 1 0
Some data used:
#Data
df <- structure(list(COL1 = c(1L, 2L, 3L, -1L, -5L, -9L), COL2 = c(1L,
1L, 1L, -1L, -1L, -1L)), class = "data.frame", row.names = c(NA,
-6L))
Another option can be using apply() and rowSums():
#Code1
as.data.frame(apply(df[rowSums(df)>0,],2,function(x) {data.frame(Mean=mean(x),SD=sd(x))}))
Output:
COL1.Mean COL1.SD COL2.Mean COL2.SD
1 2 1 1 0
#Code2
as.data.frame(apply(df[!rowSums(df)>0,],2,function(x) {data.frame(Mean=mean(x),SD=sd(x))}))
Output:
COL1.Mean COL1.SD COL2.Mean COL2.SD
1 -5 4 -1 0
Here's another base R option to add to Duck's excellent answer:
as.data.frame(lapply(df, function(x) c(mean_pos = mean(x[x > 0]),
mean_neg = mean(x[x <= 0]),
sd_pos = sd(x[x > 0 ]),
sd_neg = sd(x[x <= 0]))))
#> COL1 COL2
#> mean_pos 2 1
#> mean_neg -5 -1
#> sd_pos 1 0
#> sd_neg 4 0

R find consecutive months

I'd like to find consecutive month by client. I thought this is easy but
still can't find solutions..
My goal is to find months' consecutive purchases for each client. Any
My data
Client Month consecutive
A 1 1
A 1 2
A 2 3
A 5 1
A 6 2
A 8 1
B 8 1
In base R, we can use ave
df$consecutive <- with(df, ave(Month, Client, cumsum(c(TRUE, diff(Month) > 1)),
FUN = seq_along))
df
# Client Month consecutive
#1 A 1 1
#2 A 1 2
#3 A 2 3
#4 A 5 1
#5 A 6 2
#6 A 8 1
#7 B 8 1
In dplyr, we can create a new group with lag to compare the current month with the previous month and assign row_number() in each group.
library(dplyr)
df %>%
group_by(Client,group=cumsum(Month-lag(Month, default = first(Month)) > 1)) %>%
mutate(consecutive = row_number()) %>%
ungroup %>%
select(-group)
We can create a grouping variable based on the difference in adjacent 'Month' for each 'Client' and use that to create the sequence
library(dplyr)
df1 %>%
group_by(Client) %>%
group_by(grp =cumsum(c(TRUE, diff(Month) > 1)), add = TRUE) %>%
mutate(consec = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 7 x 4
# Client Month consecutive consec
# <chr> <int> <int> <int>
#1 A 1 1 1
#2 A 1 2 2
#3 A 2 3 3
#4 A 5 1 1
#5 A 6 2 2
#6 A 8 1 1
#7 B 8 1 1
Or using data.table
library(data.table)
setDT(df1)[, grp := cumsum(c(TRUE, diff(Month) > 1)), Client
][, consec := seq_len(.N), .(Client, grp)
][, grp := NULL][]
data
df1 <- structure(list(Client = c("A", "A", "A", "A", "A", "A", "B"),
Month = c(1L, 1L, 2L, 5L, 6L, 8L, 8L), consecutive = c(1L,
2L, 3L, 1L, 2L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-7L))

Resources