Merging dataframe every x row - r

I am trying to merge values in a dataframe by every nth row.
The data structure looks as follows:
id value
1 1
2 2
3 1
4 2
5 3
6 4
7 1
8 2
9 4
10 4
11 2
12 1
I like to aggregate the values for every 4 rows each. Actually, the dataset describes a measurement for each a 4-day period.
id"1" = day1,
id"2" = day2,
id"3" = day3,
id"4" = day4,
id"5" = day1,
...
As such, a column counting in a loop from 1 to 4 might be used?
The result should look like (sums):
day sum
1 8
2 10
3 4
4 5

This can be achieved with %% for creating a grouping variable and then do the sum with aggregate
n <- 4
aggregate(value ~cbind(day = (seq_along(df1$id)-1) %% n + 1), df1, FUN = sum)
# day value
#1 1 8
#2 2 10
#3 3 4
#4 4 5
This approach can also be used with dplyr/data.table
library(dplyr)
df1 %>%
group_by(day = (seq_along(id)-1) %% 4 +1) %>%
summarise(value = sum(value))
# day value
# <dbl> <int>
#1 1 8
#2 2 10
#3 3 4
#4 4 5
or
setDT(df1)[, .(value = sum(value)), .(day = (seq_along(id) - 1) %% 4 + 1)]
# day value
#1: 1 8
#2: 2 10
#3: 3 4
#4: 4 5

You need to make a sequence to group by, e.g.
rep(1:4, length = nrow(df))
## [1] 1 2 3 4 1 2 3 4 1 2 3 4
In aggregate:
aggregate(value ~ cbind(day = rep(1:4, length = nrow(df))), df, FUN = sum)
## day value
## 1 1 8
## 2 2 10
## 3 3 4
## 4 4 5
or dplyr:
library(dplyr)
df %>% group_by(day = rep(1:4, length = n())) %>% summarise(sum = sum(value))
## # A tibble: 4 x 2
## day sum
## <int> <int>
## 1 1 8
## 2 2 10
## 3 3 4
## 4 4 5
or data.table:
library(data.table)
setDT(df)[, .(sum = sum(value)), by = .(day = rep(1:4, length = nrow(df)))]
## day sum
## 1: 1 8
## 2: 2 10
## 3: 3 4
## 4: 4 5

Related

Roll max in R. From first row to current row

I would like to calculate max value from first row to current row
df <- data.frame(id = c(1,1,1,1,2,2,2), value = c(2,5,3,2,4,5,4), result = c(NA,2,5,5,NA,4,5))
I have tried grouping by id with dplyr and using rollmax function from zoo but did not success
1) rollmax is used with a fixed width but here we have a variable width so using rollapplyr, which seems close to the approach of the question, we have:
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(out = lag(rollapplyr(value, 1:n(), max))) %>%
ungroup
giving:
# A tibble: 7 x 4
# Groups: id [2]
id value result out
<dbl> <dbl> <dbl> <dbl>
1 1 2 NA NA
2 1 5 2 2
3 1 3 5 5
4 1 2 5 5
5 2 4 NA NA
6 2 5 4 4
7 2 4 5 5
2) It is also possible to perform the grouping via the width (second) argument of rollapplyr like this eliminating dplyr. In this case the widths are 1, 2, 3, 4, 1, 2, 3 and Max is like max except it does not use the last element of its argument x. (An alternate expression for the width would be seq_along(id) - match(id, id) + 1).
library(zoo)
Max <- function(x) if (length(x) == 1) NA else max(head(x, -1))
transform(df, out = rollapplyr(value, sequence(rle(id)$lengths), Max))
giving:
id value result out
1 1 2 NA NA
2 1 5 2 2
3 1 3 5 5
4 1 2 5 5
5 2 4 NA NA
6 2 5 4 4
7 2 4 5 5
A data.table option using shift + cummax
> setDT(df)[, result2 := shift(cummax(value)), id][]
id value result result2
1: 1 2 NA NA
2: 1 5 2 2
3: 1 3 5 5
4: 1 2 5 5
5: 2 4 NA NA
6: 2 5 4 4
7: 2 4 5 5
library(dplyr)
df |>
group_by(id) |>
mutate(result = lag(cummax(value)))
# # A tibble: 7 x 3
# # Groups: id [2]
# id value result
# <dbl> <dbl> <dbl>
# 1 1 2 NA
# 2 1 5 2
# 3 1 3 5
# 4 1 2 5
# 5 2 4 NA
# 6 2 5 4
# 7 2 4 5
Here is a base R solution. This would just get you the cumulative maximum:
df$result = ave(df$value, df$i, FUN=cummax)
To get the cumulative maximum with the lag you wanted:
df$result = ave(df$value, df$i, FUN=function(x) c(NA,cummax(x[-(length(x))])))

How to create a new variable that is the sum of a column, by group, in R? [duplicate]

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Using R statistics add a group sum to each row [duplicate]
(3 answers)
Closed 2 years ago.
I am trying to create a new variable in my dataframe that is the group-specific sum of a variable. For example:
df <- data.frame (group = c(1, 1, 1, 2, 2, 2),
variable = c(1, 2, 1, 3, 4, 5)
)
df
group variable
1 1 1
2 1 2
3 1 1
4 2 3
5 2 4
6 2 5
I would like a new variable that sums variable by group to get something that looks like this:
group variable sum
1 1 1 4
2 1 2 4
3 1 1 4
4 2 3 12
5 2 4 12
6 2 5 12
Thank you!
Base R
with(df, ave(variable, group, FUN = sum))
# [1] 4 4 4 12 12 12
(Reassign into the frame with df$sum <- with(df, ...).)
dplyr
library(dplyr)
df %>%
group_by(group) %>%
mutate(sum = sum(variable)) %>%
ungroup()
# # A tibble: 6 x 3
# group variable sum
# <dbl> <dbl> <dbl>
# 1 1 1 4
# 2 1 2 4
# 3 1 1 4
# 4 2 3 12
# 5 2 4 12
# 6 2 5 12
data.table
library(data.table)
DF <- as.data.table(df)
DF[, sum := sum(variable), by = .(group) ]
DF
# group variable sum
# 1: 1 1 4
# 2: 1 2 4
# 3: 1 1 4
# 4: 2 3 12
# 5: 2 4 12
# 6: 2 5 12

Splitting and creating 2 rows out of one in R data table

I have a dataset (dt) like this in R:
n id val
1 1&&2 10
2 3 20
3 4&&5 30
And what I want to get is
n id val
1 1 10
2 2 10
3 3 20
4 4 30
5 5 30
I know that to split ids I need to do something like this:
id_split <- strsplit(dt$id,"&&")
But how do I create new rows with the same val for ids which were initially together in a row?
You may cbind the splits to get a column which you cbind again to the val (recycling).
res <- do.call(rbind, Map(data.frame, id=lapply(strsplit(dat$id, "&&"), cbind),
val=dat$val))
res <- cbind(n=1:nrow(res), res)
res
# n id val
# 1 1 1 10
# 2 2 2 10
# 3 3 3 20
# 4 4 4 30
# 5 5 5 30
You can use the lengths from the split of id and expand your rows. Then set n to be the sequece of the length of your data frame, i.e.
l1 <- strsplit(as.character(df$id), '&&')
res_df <- transform(df[rep(seq_len(nrow(df)), lengths(l1)),],
id = unlist(l1),
n = seq_along(unlist(l1)))
which gives,
n id val
1 1 1 10
1.1 2 2 10
2 3 3 20
3 4 4 30
3.1 5 5 30
You can remove the rownames with rownames(res_df) <- NULL
A data.table solution.
library(data.table)
DT <- fread('n id val
1 1&&2 10
2 3 20
3 4&&5 30')
DT[,.(id=unlist(strsplit(id,split ="&&"))),by=.(n,val)][,n:=.I][]
#> n val id
#> 1: 1 10 1
#> 2: 2 10 2
#> 3: 3 20 3
#> 4: 4 30 4
#> 5: 5 30 5
Created on 2020-05-08 by the reprex package (v0.3.0)
Note:
A more rebosut solution is by = 1:nrow(DT). But you need to play around your other columns though.
If anyone looking for tidy solution,
dt %>%
separate(id, into = paste0("id", 1:2),sep = "&&") %>%
pivot_longer(cols = c(id1,id2), names_to = "id_name", values_to = "id") %>%
drop_na(id) %>%
select(n, id, val)
output as
# A tibble: 5 x 3
n id val
<dbl> <chr> <dbl>
1 1 1 10
2 1 2 10
3 2 3 20
4 3 4 30
5 3 5 30
Edit:
As suggested by #sotos, and completely missed by me. one liner solution
d %>% separate_rows(id, ,sep = "&&")
gives same output as
# A tibble: 5 x 3
n id val
<dbl> <chr> <dbl>
1 1 1 10
2 1 2 10
3 2 3 20
4 3 4 30
5 3 5 30
tstrplit by id from data.table can do the job
library(data.table)
df <- setDT(df)[,.('id' = tstrsplit(id, "&&")), by = c('n','val')]
df[,'n' := seq(.N)]
df
n val id
1: 1 10 1
2: 2 10 2
3: 3 20 3
4: 4 30 4
5: 5 30 5

is there a way in R to subtract two rows within a group by specifying another grouping var?

Say I have something like this:
ID = c("a","a","a","a","a", "b","b","b","b","b")
Group = c("1","2","3","4","5", "1","2","3","4","5")
Value = c(3, 4,2,4,3, 6, 1, 8, 9, 10)
df<-data.frame(ID,Group,Value)
I want to subtract group=5 from group=3 within the ID, with an output column which has this difference for each ID like so:
ID Group Value Want
1 a 1 3 1
2 a 2 4 1
3 a 3 2 1
4 a 4 4 1
5 a 5 3 1
6 b 1 6 2
7 b 2 1 2
8 b 3 8 2
9 b 4 9 2
10 b 5 10 2
Also, if that calculation cannot be done (i.e. group 5 is missing), NA values for the 'want' column would be ideal.
As there is only one unique 'Group' per 'ID', we can do subsetting
library(dplyr)
df %>%
group_by(ID) %>%
mutate(want = Value[Group == 5] - Value[Group == 3])
# A tibble: 10 x 4
# Groups: ID [2]
# ID Group Value want
# <fct> <fct> <dbl> <dbl>
# 1 a 1 3 1
# 2 a 2 4 1
# 3 a 3 2 1
# 4 a 4 4 1
# 5 a 5 3 1
# 6 b 1 6 2
# 7 b 2 1 2
# 8 b 3 8 2
# 9 b 4 9 2
#10 b 5 10 2
The above can be made more error-proof if we convert to numeric index and get the first element. When there are no TRUE, by using [1], it returns NA
df %>%
slice(-10) %>%
group_by(ID) %>%
mutate(want = Value[which(Group == 5)[1]] - Value[which(Group == 3)[1]])
Or use match which returns an index of NA if there are no matches, and anything with NA index returns NA which will subsequently return NA in subtraction (NA -3)
df %>%
slice(-10) %>% # removing the last row where Group is 10
group_by(ID) %>%
mutate(want = Value[match(5, Group)] - Value[match(3, Group)])
Here is a base R solution
dfout <- Reduce(rbind,
lapply(split(df,df$ID),
function(x) within(x, Want <-diff(subset(Value, Group %in% c("3","5"))))))
such that
> dfout
ID Group Value Want
1 a 1 3 1
2 a 2 4 1
3 a 3 2 1
4 a 4 4 1
5 a 5 3 1
6 b 1 6 2
7 b 2 1 2
8 b 3 8 2
9 b 4 9 2
10 b 5 10 2
A data.table method:
library(data.table)
setDT(df)[, want := (Value[Group == 5] - Value[Group == 3]), by = .(ID)]
df
# ID Group Value want
# 1: a 1 3 1
# 2: a 2 4 1
# 3: a 3 2 1
# 4: a 4 4 1
# 5: a 5 3 1
# 6: b 1 6 2
# 7: b 2 1 2
# 8: b 3 8 2
# 9: b 4 9 2
# 10: b 5 10 2
Here is a solution using base R.
unsplit(
lapply(
split(df, df$ID),
function(d) {
x5 = d$Value[d$Group == "5"]
x5 = ifelse(length(x5) == 1, x5, NA)
x3 = d$Value[d$Group == "3"]
x3 = ifelse(length(x3) == 1, x3, NA)
d$Want = x5 - x3
d
}),
df$ID)

adding grouping indicator for repeating sequences

I thought this is simple thing but failed and can't find answer from anywhere.
Example data looks like this. I have nro running from 1:x and restarts at random points. I would like to create ind variable which would be 1 for first run and 2 for second...
tbl <- tibble(nro = c(rep(1:3, 1), rep(1:5, 1), rep(1:4, 1)))
End result should look like this:
tibble(nro = c(rep(1:3, 1), rep(1:5, 1), rep(1:4, 1)),
ind = c(rep(1, 3), rep(2, 5), rep(3, 4)))
# A tibble: 12 x 2
nro ind
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 4 2
8 5 2
9 1 3
10 2 3
11 3 3
12 4 3
I thought I could do something with ifelse but failed miserably.
tbl %>%
mutate(ind = ifelse(nro < lag(nro), 1 + lag(ind), 1))
I assume this needs some kind of loop.
for sequences of the same length
You could use group_by on your nro variable and then just take the row_number():
tbl %>%
group_by(nro) %>%
mutate(ind = row_number())
# A tibble: 12 x 2
# Groups: nro [4]
# nro ind
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 1 2
# 6 2 2
# 7 3 2
# 8 4 2
# 9 1 3
# 10 2 3
# 11 3 3
# 12 4 3
for varying length of the sequences
inspired by docendo discimus's comment
tbl <- tibble(nro = c(rep(1:3, 1), rep(1:5, 1), rep(1:4, 1)))
tbl %>%
mutate(ind = cumsum(nro == 1))
However, this is limited to sequences which begin with 1, since only the TRUE values of nro == 1 are cumulated.
thus, you should consider to use this:
tbl %>% mutate(dif = nro - lag(nro)) %>%
mutate(dif = ifelse(is.na(dif), nro, dif)) %>%
mutate(ind = cumsum(dif < 0) + 1) %>%
select(-dif)
# A tibble: 12 x 2
# nro ind
# <int> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 1 2
# 5 2 2
# 6 3 2
# 7 4 2
# 8 5 2
# 9 1 3
# 10 2 3
# 11 3 3
# 12 4 3

Resources