Sum column over specific rownumbers in grouped dataframe in R - r

I have a dataframe like this:
df = data.frame(
x = 1:100,
y = rep(1:10, times = 10, each = 10)
) %>%
group_by(y)
And I would like to compute the sum of x from the 3rd to the 6th row of each group of y.
I think this should be easy, but I just can not figure it out at the moment.
In pseudocode I imagine something like this:
df %>%
mutate(
sum(x, ifelse(between(row_number(), 3,6)))
)
But this of course does not work. I would like to solve it with some dplyr-function, but also in base R I cannot think of a fast solution.
For the first group the sum would be 3+4+5+6....

One option could be:
df %>%
group_by(y) %>%
mutate(z = sum(x[row_number() %in% 3:6]))
x y z
<int> <int> <int>
1 1 1 18
2 2 1 18
3 3 1 18
4 4 1 18
5 5 1 18
6 6 1 18
7 7 1 18
8 8 1 18
9 9 1 18
10 10 1 18

You could also do this with filter() and summarise() and obtain a group-wise summary:
df %>%
group_by(y) %>%
mutate(rn = 1:n()) %>%
filter(rn %in% 3:6) %>%
summarise(x_sum = sum(x))
# A tibble: 10 x 2
y x_sum
<int> <int>
1 1 18
2 2 58
3 3 98
4 4 138
5 5 178
6 6 218
7 7 258
8 8 298
9 9 338
10 10 378

Update: If you want to sum multiple sequences from x then you can sum by index:
df %>%
group_by(y) %>%
mutate(sum_row3to6 = sum(x[3:6]),
sum_row1to4 = sum(x[1:4])
)
Output:
x y sum_row3to6 sum_row1to4
<int> <int> <int> <int>
1 1 1 18 10
2 2 1 18 10
3 3 1 18 10
4 4 1 18 10
5 5 1 18 10
6 6 1 18 10
7 7 1 18 10
8 8 1 18 10
9 9 1 18 10
10 10 1 18 10
First answer:
We could use slice summarise
library(dplyr)
df %>%
group_by(y) %>%
slice(3:6) %>%
summarise(sum = sum(x))
Output:
y sum
<int> <int>
1 1 18
2 2 58
3 3 98
4 4 138
5 5 178
6 6 218
7 7 258
8 8 298
9 9 338
10 10 378

data.table
library(data.table)
df = data.frame(
x = 1:100,
y = rep(1:10, times = 10, each = 10)
)
setDT(df)[rowid(y) %in% 3:6, list(sum_x = sum(x)), by = y][]
#> y sum_x
#> 1: 1 18
#> 2: 2 58
#> 3: 3 98
#> 4: 4 138
#> 5: 5 178
#> 6: 6 218
#> 7: 7 258
#> 8: 8 298
#> 9: 9 338
#> 10: 10 378
Created on 2021-05-21 by the reprex package (v2.0.0)

Related

R reshape dataset

I am trying to reshape to wide a dataset using R, this is the code, I would like to have df2 but I am struggling a bit.
value <- seq(1,20,1)
country <- c("AT","AT","AT","AT",
"BE","BE","BE","BE",
"CY","CY","CY", "CY",
"DE","DE","DE","DE",
"EE", "EE","EE","EE")
df <- data.frame(country, value)
df
# country value
# 1 AT 1
# 2 AT 2
# 3 AT 3
# 4 AT 4
# 5 BE 5
# 6 BE 6
# 7 BE 7
# 8 BE 8
# 9 CY 9
# 10 CY 10
# 11 CY 11
# 12 CY 12
# 13 DE 13
# 14 DE 14
# 15 DE 15
# 16 DE 16
# 17 EE 17
# 18 EE 18
# 19 EE 19
# 20 EE 20
#new dataset
AT <- seq(1,4,1)
BE <- seq(5,8,1)
# etc
df2 <- data.frame(AT, BE)
df2
# AT BE
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
Any help?
Using the tidyverse (dplyr and tidyr)
df %>% group_by(country) %>%
mutate(row=row_number()) %>%
pivot_wider(names_from = country,values_from=value)
# A tibble: 4 x 6
row AT BE CY DE EE
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 9 13 17
2 2 2 6 10 14 18
3 3 3 7 11 15 19
4 4 4 8 12 16 20
We can reshape to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
group_by(country) %>%
mutate(rn = row_number()) %>%
pivot_wider(names_from = country, values_from = value)
# A tibble: 4 x 6
# rn AT BE CY DE EE
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 5 9 13 17
#2 2 2 6 10 14 18
#3 3 3 7 11 15 19
#4 4 4 8 12 16 20
Or using base R
out <- unstack(df, value ~ country)
str(out)
#'data.frame': 4 obs. of 5 variables:
# $ AT: num 1 2 3 4
# $ BE: num 5 6 7 8
# $ CY: num 9 10 11 12
# $ DE: num 13 14 15 16
# $ EE: num 17 18 19 20

Hhow do I combine dataframes of unequal length based on a condition

I would like to know how to best combine the two following dataframes:
df1 <- data.frame(Date = c(1,2,3,4,5,6,7,8,9,10),
Altitude=c(100,101,101,102,103,99,98,99,89,70))
> df1
Date Altitude
1 1 100
2 2 101
3 3 101
4 4 102
5 5 103
6 6 99
7 7 98
8 8 99
9 9 89
10 10 70
df2 <- data.frame(Start = c(1,4,8),Stop = c(3,7,10),Longitude=c(10,12,13))
> df2
Start Stop Longitude
1 1 3 10
2 4 7 12
3 8 10 13
I would basically need a third column in df2, with the Longitude based on whether the Date is between Start and Stop, resulting in something like this:
Date Altitude Longitude
1 1 100 10
2 2 101 10
3 3 101 10
4 4 102 12
5 5 103 12
6 6 99 12
7 7 98 12
8 8 99 13
9 9 89 13
10 10 70 13
I've been trying all kinds of subsetting, filtering, ... but I just can't figure it out. Any help would be appreciated!
Kind regards
An idea via dplyr is to complete the start:stop sequence, unnest and merge, i.e.
library(dplyr)
df2 %>%
mutate(Date = mapply(seq, Start, Stop)) %>%
tidyr::unnest() %>%
select(-c(1, 2)) %>%
right_join(df1, by = 'Date')
which gives,
Longitude Date Altitude
1 10 1 100
2 10 2 101
3 10 3 101
4 12 4 102
5 12 5 103
6 12 6 99
7 12 7 98
8 13 8 99
9 13 9 89
10 13 10 70
Here is a tidyverse answer using the group_by and group_modify functions in the dplyr package (introduced in version 0.8.1 in May 2019).
library(dplyr)
df1 %>%
group_by(Date, Altitude) %>%
group_modify(~ data.frame(df2 %>%
filter(.x$Date >= Start, .x$Date <= Stop)) %>%
select(Longitude),
keep = TRUE)
For each unique combination in df1 of date and altitude (i.e. for each row), this finds the longitude corresponding to the date range in df2.
The output is a tibble:
# A tibble: 10 x 3
# Groups: Date, Altitude [10]
Date Altitude Longitude
<dbl> <dbl> <dbl>
1 1 100 10
2 2 101 10
3 3 101 10
4 4 102 12
5 5 103 12
6 6 99 12
7 7 98 12
8 8 99 13
9 9 89 13
10 10 70 13
Base R solution:
ind <- apply(df2, 1, function(x) which(df1$Date >= x[1] & df1$Date <= x[2]))
df1$Longitude <- unlist(Map(function(x,y) rep(y, length(x)), ind, df2$Longitude))
Output
Date Altitude Longitude
1 1 100 10
2 2 101 10
3 3 101 10
4 4 102 12
5 5 103 12
6 6 99 12
7 7 98 12
8 8 99 13
9 9 89 13
10 10 70 13

Collapsing a data.frame by group and interval coordinates

I have a data.frame which specifies linear intervals (along chromosomes), where each interval is assigned to a group:
df <- data.frame(chr = c(rep("1",5),rep("2",4),rep("3",5)),
start = c(seq(1,50,10),seq(1,40,10),seq(1,50,10)),
end = c(seq(10,50,10),seq(10,40,10),seq(10,50,10)),
group = c(c("g1.1","g1.1","g1.2","g1.3","g1.1"),c("g2.1","g2.2","g2.3","g2.2"),c("g3.1","g3.2","g3.2","g3.2","g3.3")),
stringsAsFactors = F)
I'm looking for a fast way to collapse df by chr and by group such that consecutive intervals along a chr that are assigned to the same group are combined and their start and end coordinates are modified accordingly.
Here's the desired outcome for this example:
res.df <- data.frame(chr = c(rep("1",4),rep("2",4),rep("3",3)),
start = c(c(1,21,31,41),c(1,11,21,31),c(1,11,41)),
end = c(c(20,30,40,50),c(10,20,30,40),c(10,40,50)),
group = c("g1.1","g1.2","g1.3","g1.1","g2.1","g2.2","g2.3","g2.2","g3.1","g3.2","g3.3"),
stringsAsFactors = F)
Edit: To account for the consecutive requirement you can use the same approach as earlier but add an extra grouping variable based on consecutive values.
library(dplyr)
df %>%
group_by(chr, group, temp.grp = with(rle(group), rep(seq_along(lengths), lengths))) %>%
summarise(start = min(start),
end = max(end)) %>%
arrange(chr, start) %>%
select(chr, start, end, group)
# A tibble: 11 x 4
# Groups: chr, group [9]
chr start end group
<chr> <dbl> <dbl> <chr>
1 1 1 20 g1.1
2 1 21 30 g1.2
3 1 31 40 g1.3
4 1 41 50 g1.1
5 2 1 10 g2.1
6 2 11 20 g2.2
7 2 21 30 g2.3
8 2 31 40 g2.2
9 3 1 10 g3.1
10 3 11 40 g3.2
11 3 41 50 g3.3
A different tidyverse approach could be:
df %>%
gather(var, val, -c(chr, group)) %>%
group_by(chr, group) %>%
filter(val == min(val) | val == max(val)) %>%
spread(var, val)
chr group end start
<chr> <chr> <dbl> <dbl>
1 1 g1.1 20 1
2 1 g1.2 30 21
3 1 g1.3 50 31
4 2 g2.1 10 1
5 2 g2.2 20 11
6 2 g2.3 40 21
7 3 g3.1 10 1
8 3 g3.2 40 11
9 3 g3.3 50 41
Or:
df %>%
group_by(chr, group) %>%
summarise_all(funs(min, max)) %>%
select(-end_min, -start_max)
chr group start_min end_max
<chr> <chr> <dbl> <dbl>
1 1 g1.1 1 20
2 1 g1.2 21 30
3 1 g1.3 31 50
4 2 g2.1 1 10
5 2 g2.2 11 20
6 2 g2.3 21 40
7 3 g3.1 1 10
8 3 g3.2 11 40
9 3 g3.3 41 50
A solution, using also rleid() from data.table, to the updated post could be:
df %>%
group_by(chr, group, group2 = rleid(group)) %>%
summarise_all(funs(min, max)) %>%
select(-end_min, -start_max)
chr group group2 start_min end_max
<chr> <chr> <int> <dbl> <dbl>
1 1 g1.1 1 1 20
2 1 g1.1 4 41 50
3 1 g1.2 2 21 30
4 1 g1.3 3 31 40
5 2 g2.1 5 1 10
6 2 g2.2 6 11 20
7 2 g2.2 8 31 40
8 2 g2.3 7 21 30
9 3 g3.1 9 1 10
10 3 g3.2 10 11 40
11 3 g3.3 11 41 50

Different filter rules for groups using dplyr

Sample data:
df <- data.frame(loc.id = rep(1:2, each = 11),
x = c(35,51,68,79,86,90,92,93,95,98,100,35,51,68,79,86,90,92,92,93,94,94))
For each loc.id, I want to filter filter out x <= 95.
df %>% group_by(loc.id) %>% filter(row_number() <= which.max(x >= 95))
loc.id x
<int> <dbl>
1 1 35
2 1 51
3 1 68
4 1 79
5 1 86
6 1 90
7 1 92
8 1 93
9 1 95
10 2 35
However, the issue for group 2 all the values are less than 95. Therefore I want to keep all values of
x for group 2. However, the above line does not do it.
Perhaps something like this?
df %>%
group_by(loc.id) %>%
mutate(n = sum(x > 95)) %>%
filter(n == 0 | (x > 0 & x > 95)) %>%
ungroup() %>%
select(-n)
## A tibble: 13 x 2
# loc.id x
# <int> <dbl>
# 1 1 98.
# 2 1 100.
# 3 2 35.
# 4 2 51.
# 5 2 68.
# 6 2 79.
# 7 2 86.
# 8 2 90.
# 9 2 92.
#10 2 92.
#11 2 93.
#12 2 94.
#13 2 94.
Note that removing entries where x <= 95 corresponds to retaining entries where x > 95 (not x >= 95).
You can use match to get the first TRUE index and return the length of group if no match is found via the nomatch parameter:
df %>%
group_by(loc.id) %>%
filter(row_number() <= match(TRUE, x >= 95, nomatch=n()))
# A tibble: 20 x 2
# Groups: loc.id [2]
# loc.id x
# <int> <dbl>
# 1 1 35
# 2 1 51
# 3 1 68
# 4 1 79
# 5 1 86
# 6 1 90
# 7 1 92
# 8 1 93
# 9 1 95
#10 2 35
#11 2 51
#12 2 68
#13 2 79
#14 2 86
#15 2 90
#16 2 92
#17 2 92
#18 2 93
#19 2 94
#20 2 94
Or reverse cumsum as filter condition:
df %>% group_by(loc.id) %>% filter(!lag(cumsum(x >= 95), default=FALSE))
A solution using all along with dplyr package can be achieved as:
library(dplyr)
df %>% group_by(loc.id) %>%
filter((x > 95) | all(x<=95)) # All x in group are <= 95 OR x > 95
# # Groups: loc.id [2]
# loc.id x
# <int> <dbl>
# 1 1 98.0
# 2 1 100
# 3 2 35.0
# 4 2 51.0
# 5 2 68.0
# 6 2 79.0
# 7 2 86.0
# 8 2 90.0
# 9 2 92.0
# 10 2 92.0
# 11 2 93.0
# 12 2 94.0
# 13 2 94.0

Sum of group but keep the same value for each row in r

I have data frame, I want to create a new variable by sum of each ID and group, if I sum normal,dimension of data reduce, my case I need to keep and repeat each row.
ID <- c(rep(1,3), rep(3, 5), rep(4,4))
Group <-c(1,1,2,1,1,1,2,2,1,1,1,2)
x <- c(1:12)
y<- c(12:23)
df <- data.frame(ID,Group,x,y)
ID Group x y
1 1 1 1 12
2 1 1 2 13
3 1 2 3 14
4 3 1 4 15
5 3 1 5 16
6 3 1 6 17
7 3 2 7 18
8 3 2 8 19
9 4 1 9 20
10 4 1 10 21
11 4 1 11 22
12 4 2 12 23
The output with 2 more variables "sumx" and "sumy". Group by (ID, Group)
ID Group x y sumx sumy
1 1 1 1 12 3 25
2 1 1 2 13 3 25
3 1 2 3 14 3 14
4 3 1 4 15 15 48
5 3 1 5 16 15 48
6 3 1 6 17 15 48
7 3 2 7 18 15 37
8 3 2 8 19 15 37
9 4 1 9 20 30 63
10 4 1 10 21 30 63
11 4 1 11 22 30 63
12 4 2 12 23 12 23
Any Idea?
As short as:
df$sumx <- with(df,ave(x,ID,Group,FUN = sum))
df$sumy <- with(df,ave(y,ID,Group,FUN = sum))
We can use dplyr
library(dplyr)
df %>%
group_by(ID, Group) %>%
mutate_each(funs(sum)) %>%
rename(sumx=x, sumy=y) %>%
bind_cols(., df[c("x", "y")])
If there are only two columns to sum, then
df %>%
group_by(ID, Group) %>%
mutate(sumx = sum(x), sumy = sum(y))
You can use below code to get what you want if it is a single column and in case you have more than 1 column then add accordingly:
library(dplyr)
data13 <- data12 %>%
group_by(Category) %>%
mutate(cum_Cat_GMR = cumsum(GrossMarginRs))

Resources