Conditional rolling sum based on another column - r

I would like to compute the conditional rolling sum of a column, but based on the values of another column.
I have a table like this:
data_frame <- data.frame( category1 = c("A", "A", "A", "B", "B", "B", "A", "A", "B"),
category2 = c("B", "B", "B", "A", "A", "A", "B", "B", "A"),
value = c(1, 2, 1, 2, 1, 5, 3, 4, 2),
desired_output = c(0, 0, 0, 4, 4, 4, 8, 8, 11))
data_frame2 <- data_frame %>%
group_by(category1) %>%
mutate(cumsum = cumsum(value))
category1 category2 value cumsum desired_output
A B 1 1 0
A B 2 3 0
A B 1 4 0
B A 2 2 4
B A 1 3 4
B A 5 8 4
A B 3 7 8
A B 4 11 8
B A 2 10 11
I am able to compute the rolling sum of the value based on category1 or category2 using cumsum, but I would like a column which calculates a rolling sum of the value column when category1 equals the current value of category2. For example, in the last row of the above example it sums the value of all the above rows when category1 == A, as the current value of category2 is A.
I have tried various hacky ifelse/lag/fill solutions but nothing gets close to what I need. I have also tried adding a conditional into the ave function, as below, but not sure what the syntax should be...
data_frame2$desired_output <- ave(data_frame2$value, data_frame2$category1 = data_frame2$category2, FUN=cumsum)
Thanks in advance - first question so apologies about anything I missed/got wrong!

Related

2-group heterogeneity index

I have a dataset with two distinct groups (A and B) belonging to 3 different categories (1, 2, 3):
library(tidyverse)
set.seed(100)
df <- tibble(Group = sample(c(1, 2, 3), 20, replace = T),Company = sample(c('A', 'B'), 20, replace = T))
I want to come come up with a metric that characterizes group composition across the timespan.
Thus far, I have used an index based on Shannon's Index which gives a measure of heterogeneity varying between 0 and 1. With 1 being a perfectly heterogeneous (equal representation of each group) and 0 being completely homogeneous (only 1 group is represented):
df %>%
group_by(Group, Company) %>%
summarise(n=n()) %>%
mutate(p = n / sum(n)) %>%
mutate(Shannon = -(p*log2(p) + (1-p)))
Yielding:
Group Company n p Shannon
<dbl> <chr> <int> <dbl> <dbl>
1 A 2 0.6666667 0.05664167
1 B 1 0.3333333 -0.13834583
2 A 4 0.5000000 0.00000000
2 B 4 0.5000000 0.00000000
3 A 1 0.1111111 -0.53667500
3 B 8 0.8888889 0.03993333
However, I am looking for an index between [-1, +1]. Where the index yields -1 when only group A is present at a time point, +1 when only group B is present at a time point, 0 being an equal representation.
How can I create such an index? I have looked at measures such as Moran's I as inspiration, but they do not seem to suit the need.
A simple solution might be to calculate the mean.
I transformed Company into value with A = -1 and B = 1 and calculated the mean by Group.
The result will be an index for each Group, with -1 when Company has just "A"s or 1 when there are just "B"s.
Data
df <- structure(list(Group = c(2, 2, 3, 3, 1, 2, 3, 1, 1, 3, 3, 1,
2, 2, 3, 2, 2, 1, 1, 3), Company = c("A", "A", "A", "A", "B",
"B", "B", "B", "A", "B", "B", "B", "A", "A", "B", "A", "B", "B",
"A", "B")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
Code
df %>%
mutate(value = ifelse(Company == "A", -1, 1)) %>%
group_by(Group) %>%
summarise(index = mean(value))
Output
# A tibble: 3 x 2
Group index
<dbl> <dbl>
1 1 0.333
2 2 -0.429
3 3 0.429

Logic for filtering dependent on two columns [duplicate]

This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 2 years ago.
I am struggling to write the right logic to filter two columns based only on the condition in one column. I have multiple ids and if an id appears in 2020, I want all the data for the other years that id was measured to come along.
As an example, if a group contains the number 3, I want all the values in that group. We should end up with a dataframe with all the b and d rows.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
threes <- df4 %>%
filter(pop == 3 |&ifelse????
A bit slower than the other answers here (more steps involved), but for me a bit clearer:
df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group) -> groups
df4 %>%
filter(group %in% groups)
or if you want to combine the two steps:
df4 %>%
filter(group %in% df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group))
You can do:
df4[df4$group %in% df4$group[df4$pop == 3],]
#> group pop value
#> 6 b 1 2.0
#> 7 b 2 3.0
#> 8 b 3 4.0
#> 9 b 4 3.5
#> 10 b 5 3.0
#> 16 d 1 0.5
#> 17 d 2 1.5
#> 18 d 3 6.0
#> 19 d 4 2.0
#> 20 d 5 1.5
You can do this way using dplyr group_by(), filter() and any() function combined. any() will return TRUE for the matching condition. Group by will do the operation for each subgroup of the variable you mention as a grouping.
Follow these steps:
First pipe the data to group_by() to group by your group variable.
Then pipe to filter() to filter by if any group pop is equal to 3 using any() function.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
# load the library
library(dplyr)
threes <- df4 %>%
group_by(group) %>%
filter(any(pop == 3))
# print the result
threes
Output:
threes
# A tibble: 10 x 3
# Groups: group [2]
group pop value
<chr> <dbl> <dbl>
1 b 1 2
2 b 2 3
3 b 3 4
4 b 4 3.5
5 b 5 3
6 d 1 0.5
7 d 2 1.5
8 d 3 6
9 d 4 2
10 d 5 1.5
An easy base R option is using subset + ave
subset(
df4,
ave(pop == 3, group, FUN = any)
)
which gives
group pop value
6 b 1 2.0
7 b 2 3.0
8 b 3 4.0
9 b 4 3.5
10 b 5 3.0
16 d 1 0.5
17 d 2 1.5
18 d 3 6.0
19 d 4 2.0
Use dplyr:
df4%>%group_by(group)%>%filter(any(pop==3))

Rearrange observations to add up to desired value - data.table?

I have a data table with a number of observations - each has a non unique ID and integer X value.
grouped by ID. I want the sum of X for each group to be larger than 10. To achieve this, each observation is allowed to change its ID as long as the sum of X for the original group does not go below 10.
Below is an example of what I mean and a very manual solution:
# sample data
input <-data.table(ID = c("A", "A", "A", "B", "B", "B" ,
"C", "C", "C", "D", "D", "D"),
X = c(1, 3, 1, 5, 1, 5,
6, 10, 2, 3, 3, 4))
# summarise X by ID
input[, .(X = sum(X)), by = ID]
ID X
A 5
B 11
C 18
D 10
# what the output should look like
output <- data.table(ID = c("A", "A", "A", "B", "B", "B" ,
"A", "C", "C", "D", "D", "D"),
X = c(1, 3, 1, 5, 1, 5,
6, 10, 2, 3, 3, 4))
output[, .(X = sum(X)), by = ID]
ID X
A 11
B 11
C 12
D 10
output
ID X
A 1
A 3
A 1
B 5
B 1
B 5
A 6 - this observation changed ID from C to A to get group A to 11
C 10
C 2
This is obviously a very simple example as only group C has enough spare capacity to donate observations. In reality there might be cases when multiple groups are able to give away observations and when more than one observation has to be modified, however there will only ever be up to 4 groups between which movement can happen.
Is it possible to automate this process in R?
Here is a possible heuristic approach. First, subset to those IDs with less than or greater than 10 (say, there are N of these IDs). Then, distribute the values from these IDs in descending order into smallest of these N buckets.
b <- 10
input <- data.table(ID = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"),
X = c(1, 3, 1, 5, 1, 5, 6, 10, 2, 3, 3, 4))
#find the IDs that need redistribution
scramID <- input[, sum(X), by = ID][V1!=b, ID]
#extract the values for those IDs
x <- input[ID %in% scramID, sort(X, decreasing=TRUE)]
#create list of empty vectors as buckets
l <- replicate(length(scramID), vector('integer'))
#assign the values starting with largest to the bucket with the smallest sum
for (k in x) {
i <- which.min(lapply(l, sum))
l[[i]] <- c(l[[i]], k)
}
#rbind to original dataset to get result
names(l) <- scramID
ans <- rbindlist(list(
setnames(setDT(stack(l)), c("X","ID")),
input[!ID %in% scramID]
), use.names=TRUE)
output:
X ID
1: 10 A
2: 1 A
3: 1 A
4: 6 B
5: 3 B
6: 2 B
7: 5 C
8: 5 C
9: 1 C
10: 3 D
11: 3 D
12: 4 D
and the distribution of values:
> ans[, sum(X), by = ID]
ID V1
1: A 12
2: B 11
3: C 11
4: D 10
For this example, the IDs chosen for redistribution are A, B and C. There are three buckets. The values for these IDs in descending order is 10, 6, 5, 5, 3, 2, 1, 1 and 1.
In the first 3 iterations, 10, 6 and 5 are assigned to first, second and third buckets respectively.
The next number is 5 and this is assigned to the third bucket with the number 5 since it has the smallest sum (5) of the three buckets.
The next number, 3, is assigned to the second bucket since it has the smallest sum (6) of the three buckets.
And so on for the rest of the numbers.

Subset of dataframe for which 2 variables match another dataframe in R

I'm looking to obtain a subset of my first, larger, dataframe 'df1' by selecting rows which contain particular combinations in the first two variables, as specified in a smaller 'df2'. For example:
df1 <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
day = c(1, 2, 2, 1, 2, 3), value = seq(4,9))
df1 # my actual df has 20 varables
ID day value
A 1 4
A 2 5
A 2 6
B 1 7
B 2 8
B 3 9
df2 <- data.frame(ID = c("A", "B"), day = c(2, 1))
df2 # this df remains at 2 variables
ID day
A 2
B 1
Where the output would be:
ID day value
A 2 5
A 2 6
B 1 7
Any help wouldbe much appreciated, thanks!
This is a good use of the merge function.
df1 <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
day = c(1, 2, 2, 1, 2, 3), value = seq(4,9))
df2 <- data.frame(ID = c("A", "B"), day = c(2, 1))
merge(df1,
df2,
by = c("ID", "day"))
Which gives output:
ID day value
1 A 2 5
2 A 2 6
3 B 1 7
Here is a dplyr solution:
library("dplyr")
semi_join(df1, df2, by = c("ID", "day"))
# ID day value
# 1 A 2 5
# 2 A 2 6
# 3 B 1 7

Unique body count column

I'm trying to add a body count for each unique person. Each person has multiple data points.
df <- data.frame(PERSON = c("A", "A", "A", "B", "B", "C", "C", "C", "C"),
Y = c(2, 5, 4, 1, 2, 5, 3, 7, 1))
This is what I'd like it to look like:
PERSON Y UNIQ_CT
1 A 2 1
2 A 5 0
3 A 4 0
4 B 1 1
5 B 2 0
6 C 5 1
7 C 3 0
8 C 7 0
9 C 1 0
You can use duplicated and negate it:
transform(df, uniqct = as.integer(!duplicated(Person)))
Since there is dplyr tag to the question here is an option
library(dplyr)
df %>%
group_by(PERSON) %>%
mutate(UNIQ_CT = ifelse(row_number( ) == 1, 1, 0))

Resources