Cumulative sums by month in R - r

I want to transform my data from this
Month Expenditures
1 1
1 2
2 3
2 6
3 2
3 5
to this:
Month Cumulative_expenditures
1 3
2 12
3 19
, but can't seem to figure out how to do it.
I tried using the cumsum() function, but it counts each observation - it doesn't distinguish between groups.
Any help would be much appreciated!

A two steps base R solution would be:
#Code
df1 <- aggregate(Expenditures~Month,data=mydf,sum)
#Create cum sum
df1$Expenditures <- cumsum(df1$Expenditures)
Output:
Month Expenditures
1 1 3
2 2 12
3 3 19
Some data used:
#Data
mydf <- structure(list(Month = c(1L, 1L, 2L, 2L, 3L, 3L), Expenditures = c(1L,
2L, 3L, 6L, 2L, 5L)), class = "data.frame", row.names = c(NA,
-6L))

Using dplyr:
library(dplyr)
df %>%
group_by(Month) %>%
summarise(Expenditures = sum(Expenditures), .groups = "drop") %>%
mutate(Expenditures = cumsum(Expenditures))
#> # A tibble: 3 x 2
#> Month Expenditures
#> <int> <int>
#> 1 1 3
#> 2 2 12
#> 3 3 19
Or in base R:
data.frame(Month = unique(df$Month),
Expenditure = cumsum(tapply(df$Expenditure, df$Month, sum)))
#> Month Expenditure
#> 1 1 3
#> 2 2 12
#> 3 3 19

Here is another base R option using subset + ave
subset(
transform(df, Expenditures = cumsum(Expenditures)),
ave(rep(FALSE, nrow(df)), Month, FUN = function(x) seq_along(x) == length(x))
)
which gives
Month Expenditures
2 1 3
4 2 12
6 3 19

We can use base R
out <- with(df1, rowsum(Expenditures, Month))
data.frame(Month = row.names(out), Expenditure = cumsum(out))
# Month Expenditure
#1 1 3
#2 2 12
#3 3 19
Or more compactly
with(df1, stack(cumsum(rowsum(Expenditures, Month)[,1])))[2:1]
data
df1 <- structure(list(Month = c(1L, 1L, 2L, 2L, 3L, 3L), Expenditures = c(1L,
2L, 3L, 6L, 2L, 5L)), class = "data.frame", row.names = c(NA,
-6L))

Related

How can I calculate the sum of the column wise differences using dplyr

Despite using R and dplyr on a regular basis, I encountered the issue of not being able to calculate the sum of the absolute differences between all columns:
sum_diff=ABS(A-B)+ABS(B-C)+ABS(C-D)...
A
B
C
D
sum_diff
1
2
3
4
3
2
1
3
4
4
1
2
1
1
2
4
1
2
1
5
I know I could iterate using a for loop over all columns, but given the size of my data frame, I prefer a more elegant and fast solution.
Any help?
Thank you
We may remove the first and last columns, get the difference, and use rowSums on the absolute values in base R. This could be very efficient compared to a package solution
df1$sum_diff <- rowSums(abs(df1[-ncol(df1)] - df1[-1]))
-output
> df1
A B C D sum_diff
1 1 2 3 4 3
2 2 1 3 4 4
3 1 2 1 1 2
4 4 1 2 1 5
Or another option is rowDiffs from matrixStats
library(matrixStats)
rowSums(abs(rowDiffs(as.matrix(df1))))
[1] 3 4 2 5
data
df1 <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Daata from akrun (many thanks)!
This is complicated the idea is to generate a list of the combinations, I tried it with combn but then I get all possible combinations. So I created by hand.
With this combinations we then could use purrrs map_dfc and do some data wrangling after that:
library(tidyverse)
combinations <-list(c("A", "B"), c("B", "C"), c("C","D"))
purrr::map_dfc(combinations, ~{df <- tibble(a=data[[.[[1]]]]-data[[.[[2]]]])
names(df) <- paste0(.[[1]],"_v_",.[[2]])
df}) %>%
transmute(sum_diff = rowSums(abs(.))) %>%
bind_cols(data)
sum_diff A B C D
<dbl> <int> <int> <int> <int>
1 3 1 2 3 4
2 4 2 1 3 4
3 2 1 2 1 1
4 5 4 1 2 1
data:
data <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Here is a dplyrs version of #akrun's elegant aproach that calculates the diff of the dataframe with it's shifted variant:
df %>%
mutate(sum_diff = rowSums(abs(identity(.) %>% select(1:last_col(1))
- identity(.) %>% select(2:last_col()))))
And here we have the rowwise variant, which basicly follows the same idea but this time every row is used as a vector that get's substracted by it's shifted self.
df %>%
rowwise() %>%
mutate(sum_diff = map2_int(c_across(1:last_col(1)),
c_across(2:last_col()),
~ abs(.x - .y)) %>% sum())

How to get the difference between groups with a dataframe in long format in R?

Have a simple dataframe with 2 ID's (N = 2) and 2 periods (T = 2), for example:
year id points
1 1 10
1 2 12
2 1 20
2 2 18
How does one achieves the following dataframe (preferably using dplyr or any tidyverse solution)?
id points_difference
1 10
2 6
Notice that the points_difference column is the difference between each ID in across time (namely T2 - T1).
Additionally, how to generalize for multiple columns and multiple ID (with only 2 periods)?
year id points scores
1 1 10 7
1 ... ... ...
1 N 12 8
2 1 20 9
2 ... ... ...
2 N 12 9
id points_difference scores_difference
1 10 2
... ... ...
N 0 1
If you are on dplyr 1.0.0(or higher), summarise can return multiple rows in output so this will also work if you have more than 2 periods. You can do :
library(dplyr)
df %>%
arrange(id, year) %>%
group_by(id) %>%
summarise(across(c(points, scores), diff, .names = '{col}_difference'))
# id points_difference scores_difference
# <int> <int> <int>
#1 1 10 2
#2 1 -7 1
#3 2 6 2
#4 2 -3 3
data
df <- structure(list(year = c(1L, 1L, 2L, 2L, 3L, 3L), id = c(1L, 2L,
1L, 2L, 1L, 2L), points = c(10L, 12L, 20L, 18L, 13L, 15L), scores = c(2L,
3L, 4L, 5L, 5L, 8L)), class = "data.frame", row.names = c(NA, -6L))

counting number of times an id has duplicated years

I have the following data frame:
df =
id Year Value
1 1 3
1 2 4
2 1 6
2 2 2
2 2 3
3 1 7
3 2 3
I want to count the number of times an individual id has a duplicating year.
Desired Outcome:
1
Id 2 has year 2 twice, that's why 1 is the outcome
So far I have tried:
library("dplyr")
df %>% group_by(id, Year) %>% summarize(count=n())
but I cannot get a single number with the count
Cheers
We can use table and create counts of observation for each id and year and then calculate the ones which occur more than 1 time.
sum(table(df$id, df$Year) > 1)
#[1] 1
Just for completion, if we want to do this in dplyr
library(dplyr)
df %>%
group_by(id, Year) %>%
summarise(count= n()) %>%
ungroup() %>%
summarise(new_count = sum(count > 1))
# new_count
# <int>
#1 1
Just for fun:
data.table solution:
data:
dt<-
fread("id Year Value
1 1 3
1 2 4
2 1 6
2 2 2
2 2 3
3 1 7
3 2 3")
code:
dt[,.N>1,by=c("id","Year")]$V1 %>% sum
A (fast) alternative:
sum(sapply(split(df$Year, df$id), function(x) any(duplicated(x))))
Where:
df <- data.frame(
id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L),
Year = c(1L, 2L, 1L, 2L, 2L, 1L, 2L),
Value = c(3L, 4L, 6L, 2L, 3L, 7L, 3L)
)

Cumulative Count Paste

I have this dataset:
ID Set Type Count
1 1 1 A NA
2 2 1 R NA
3 3 1 R NA
4 4 1 U NA
5 5 1 U NA
6 6 1 U NA
7 7 2 A NA
8 8 3 R NA
9 9 3 R NA
As dputs:
mystart <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("ID", "Set", "Type",
"Count"), class = "data.frame", row.names = c(NA, -9L))
By using dplyr package how can I obtain this:
ID Set Type Count
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2
Again dputs:
myend <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 1L, 7L, 8L), .Label = c("A1", "A1R1", "A1R2",
"A1R2U1", "A1R2U2", "A1R2U3", "R1", "R2"), class = "factor")), .Names = c("ID",
"Set", "Type", "Count"), class = "data.frame", row.names = c(NA,
-9L))
In short, I want to count the observations of the column "type" within column "set" and print this count(text) cumulatively.
Examining similar posts, I got closely to this:
myend <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = c(1L,
1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L)), .Names = c("ID", "Set", "Type",
"Count"), class = "data.frame", row.names = c(NA, -9L))
With the code:
library(dplyr)
myend <- read.table("mydata.txt", header=TRUE, fill=TRUE)
myend %>%
group_by(Set, Type) %>%
mutate(Count = seq(n())) %>%
ungroup(myend)
Thank you very much for your help,
Base R version :
aggregateGroup <- function(x){
vecs <- Reduce(f=function(a,b){ a[b] <- sum(a[b],1L,na.rm=TRUE); a },
init=integer(0),
as.character(x),
accumulate = TRUE)
# vecs is a list with something like this :
# [[1]]
# integer(0)
# [[2]]
# A
# 1
# [[3]]
# A R
# 1 1
# ...
# so we simply turn those vectors into characters using vapply and paste
# (excluding the first)
vapply(vecs,function(y) paste0(names(y),y,collapse=''),FUN.VALUE='')[-1]
}
split(mystart$Count,mystart$Set) <- lapply(split(mystart$Type,mystart$Set), aggregateGroup)
> mystart
ID Set Type Count
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2
A dplyr version:
mystart %>%
group_by(Set) %>%
mutate(Count = paste0('A', cumsum(Type == 'A'),
'R', cumsum(Type == 'R'),
'U', cumsum(Type == 'U'))) %>%
ungroup()
Which yields
# A tibble: 9 x 4
ID Set Type Count
<int> <int> <chr> <chr>
1 1 1 A A1R0U0
2 2 1 R A1R1U0
3 3 1 R A1R2U0
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1R0U0
8 8 3 R A0R1U0
9 9 3 R A0R2U0
If you want to omit the variables with count zero, you'd need to wrap a function around it like so
mygroup <- function(lst) {
name <- names(lst)
vectors <- lapply(seq_along(lst), function(i) {
x <- lst[[i]]
char <- name[i]
x <- ifelse(x == 0, "", paste0(char, x))
return(x)
})
return(do.call("paste0", vectors))
}
mystart %>%
group_by(Set) %>%
mutate(Count = mygroup(list(A = cumsum(Type == 'A'),
R = cumsum(Type == 'R'),
U = cumsum(Type == 'U')))) %>%
ungroup()
This yields
# A tibble: 9 x 4
ID Set Type Count
<int> <int> <chr> <chr>
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2
One line solve with data.table
you gotta first do
require(data.table)
mystart <- as.data.table(mystart)
then just use one line
mystart[, .(Type,
count = paste0(
'A',
cumsum(Type == 'A'),
'R',
countR = cumsum(Type == 'R'),
'U',
countU = cumsum(Type == 'U')
)),
by = c('Set')]
first you want cumsum each type and paste them together by 'set'
cumsum(Type=='A') equals the count, since when Type==A, it's 1, otherwise it's 0.
you wanted to paste them into one column also. So, paste0() is good to use.
you still wanted the Type column, so I included Type in the line.
The output:
Set Type count
1: 1 A A1R0U0
2: 1 R A1R1U0
3: 1 R A1R2U0
4: 1 U A1R2U1
5: 1 U A1R2U2
6: 1 U A1R2U3
7: 2 A A1R0U0
8: 3 R A0R1U0
9: 3 R A0R2U0
Hope this helps.
btw, if you want count 0 ignored, you gotta design some if-esle clause yourself.
basically you want this: if cumsum(something) ==0, NULL, esle paste0('something', cumsum(something)), then you paste0() them together.
It's gonna get nasty, I'm not writing it. you get the idea
Here's a base solution.
We can paste raw letters toseq_along of letter groups to get the last 2 characters, then paste the result to the last element of the previous result, using Reduce.
On top of this we use ave to compute by group.
fun <- function(x,y) paste0(x[length(x)],y,seq_along(y))
mystart$Count <- ave(as.character(mystart$Type),mystart$Set,
FUN = function(x) unlist(Reduce(fun,split(x,x),init=NULL,accumulate = TRUE)))
# ID Set Type Count
# 1 1 1 A A1
# 2 2 1 R A1R1
# 3 3 1 R A1R2
# 4 4 1 U A1R2U1
# 5 5 1 U A1R2U2
# 6 6 1 U A1R2U3
# 7 7 2 A A1
# 8 8 3 R R1
# 9 9 3 R R2
Details
split(x,x) splits letters as shown here for first Set:
with(subset(mystart,Set==1),split(Type,Type))
# $A
# [1] "A"
#
# $R
# [1] "R" "R"
#
# $U
# [1] "U" "U" "U"
Then fun does this type of operations, helped by Reduce :
fun(NULL,"A") # [1] "A1"
fun("A1",c("R","R")) # [1] "A1R1" "A1R2"
fun(c("A1R1","A1R2"),c("U","U","U")) # [1] "A1R2U1" "A1R2U2" "A1R2U3"
Bonus solution
This other base solution, using rle and avoiding split gives the same output for given example (and whenever Type values are grouped in Sets), but not with mystart2 <- rbind(mystart,mystart) for instance.
fun2 <- function(x){
rle_ <- rle(x)
suffix <- paste0(x,sequence(rle_$length))
prefix <- unlist(mapply(rep,
lag(unlist(
Reduce(paste0,paste0(rle_$values,rle_$lengths),accumulate=TRUE)
),rle_$lengths[1]),
each=rle_$lengths))
prefix[is.na(prefix)] <- ""
paste0(prefix,suffix)
}
mystart$Count2 <-ave(as.character(mystart$Type), mystart$Set,FUN=fun2)
Many elegant solutions have been provided for the problem. Still I was looking for something dplyr way (without-cumsum on fixed types). The function is generic enough to handle additional values of Type.
A solution with help of a custom function as:
library(dplyr)
mystart %>% group_by(Set, Type) %>%
mutate(type_count = row_number()) %>%
mutate(TypeMod = paste0(Type,type_count)) %>%
group_by(Set) %>%
mutate(Count = cumCat(TypeMod, type_count)) %>%
select(-type_count, -TypeMod)
cumCat <- function(x, y){
retVal <- character(length(x))
prevVal = ""
lastGrpVal = ""
for ( i in seq_along(x)){
if(y[i]==1){
lastGrpVal = prevVal
}
retVal[i] = paste0(lastGrpVal,x[i])
prevVal = retVal[i]
}
retVal
}
# # Groups: Set [3]
# ID Set Type Count
# <int> <int> <fctr> <chr>
# 1 1 1 A A1
# 2 2 1 R A1R1
# 3 3 1 R A1R2
# 4 4 1 U A1R2U1
# 5 5 1 U A1R2U2
# 6 6 1 U A1R2U3
# 7 7 2 A A1
# 8 8 3 R R1
# 9 9 3 R R2

Finding running maximum by group

I need to find a running maximum of a variable by group using R. The variable is sorted by time within group using df[order(df$group, df$time),].
My variable has some NA's but I can deal with it by replacing them with zeros for this computation.
this is how the data frame df looks like:
(df <- structure(list(var = c(5L, 2L, 3L, 4L, 0L, 3L, 6L, 4L, 8L, 4L),
group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
.Label = c("a", "b"), class = "factor"),
time = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L)),
.Names = c("var", "group","time"),
class = "data.frame", row.names = c(NA, -10L)))
# var group time
# 1 5 a 1
# 2 2 a 2
# 3 3 a 3
# 4 4 a 4
# 5 0 a 5
# 6 3 b 1
# 7 6 b 2
# 8 4 b 3
# 9 8 b 4
# 10 4 b 5
And I want a variable curMax as:
var | group | time | curMax
5 a 1 5
2 a 2 5
3 a 3 5
4 a 4 5
0 a 5 5
3 b 1 3
6 b 2 6
4 b 3 6
8 b 4 8
4 b 5 8
Please let me know if you have any idea how to implement it in R.
We can try data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'group' , we get the cummax of 'var' and assign (:=) it to a new variable ('curMax')
library(data.table)
setDT(df1)[, curMax := cummax(var), by = group]
As commented by #Michael Chirico, if the data is not ordered by 'time', we can do that in the 'i'
setDT(df1)[order(time), curMax:=cummax(var), by = group]
Or with dplyr
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(curMax = cummax(var))
If df1 is tbl_sql explicit ordering might be required, using arrange
df1 %>%
group_by(group) %>%
arrange(time, .by_group=TRUE) %>%
mutate(curMax = cummax(var))
or dbplyr::window_order
library(dbplyr)
df1 %>%
group_by(group) %>%
window_order(time) %>%
mutate(curMax = cummax(var))
you can do it so:
df$curMax <- ave(df$var, df$group, FUN=cummax)

Resources