Aggregate sum of column within groups [duplicate] - r

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
Hello everyone I'm looking for help in order to aggregate sum of columns within df$Names
Here is the df
Names COL1 COL2 COL3 COL4
A 2 2 0 1
A 3 1 1 1
A 3 2 0 1
A 4 0 4 0
B 1 1 0 0
B 3 1 1 1
The expected output is :
Names COL1 COL2 COL3 COL4
A 12 5 5 3
B 4 2 1 1
Here are the data :
structure(list(Names = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), COL1 = c(2L, 3L, 3L, 4L, 1L, 3L), COL2 = c(2L,
1L, 2L, 0L, 1L, 1L), COL3 = c(0L, 1L, 0L, 4L, 0L, 1L), COL4 = c(1L,
1L, 1L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA,
-6L))
I tried:
aggregate(cbind(COL1,COL2,COL3,COL4) ~ Names, data = df, sum, na.rm = TRUE)

Does this work:
library(dplyr)
df %>% group_by(Names) %>% summarise(across(starts_with('COL'), ~ sum(.)))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 5
Names COL1 COL2 COL3 COL4
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 12 5 5 3
2 B 4 2 1 1

Related

transform dataset to tidy format combining column and row in R

I have a dataset that is in a somewhat unfortunate structure:
Species site 2001 2002 2003
a 1 0 1 4
a 2 1 1 0
a 3 5 5 5
b 1 3 0 4
b 2 1 1 1
b 3 4 5 5
After trying for hours to get it in the correct format using R, I did it in Excel and transformed it to the format below.
ID a b
1_2001 0 3
1_2002 1 0
1_2003 4 4
2_2001 1 1
2_2002 1 1
2_2003 0 1
3_2001 5 4
3_2002 5 5
3_2004 5 5
The original dataset is rather large, and I can't let it rest that i don't know how to do this fast in R.
Can someone explain to me how this transformation can be done in R?
Using tidyr and dplyr, you can first reshape our year columns into a longer format, then use pivot_wider to create "a" and "b" column, assemble "site" and "ID" and finally keep only desired columns:
library(tidyr)
library(dplyr)
df %>% pivot_longer(.,-c(Species, site), names_to = "ID", values_to = "val") %>%
pivot_wider(.,names_from = Species, values_from = val) %>%
rowwise() %>%
mutate(ID = paste(site,ID, sep = "_")) %>%
select(ID, a, b)
Source: local data frame [9 x 3]
Groups: <by row>
# A tibble: 9 x 3
ID a b
<chr> <int> <int>
1 1_2001 0 3
2 1_2002 1 0
3 1_2003 4 4
4 2_2001 1 1
5 2_2002 1 1
6 2_2003 0 1
7 3_2001 5 4
8 3_2002 5 5
9 3_2003 5 5
Data
structure(list(Species = c("a", "a", "a", "b", "b", "b"), site = c(1L,
2L, 3L, 1L, 2L, 3L), `2001` = c(0L, 1L, 5L, 3L, 1L, 4L), `2002` = c(1L,
1L, 5L, 0L, 1L, 5L), `2003` = c(4L, 0L, 5L, 4L, 1L, 5L)), row.names = c(NA,
-6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x56276b4f1350>)
Here another solution with gather and spread from tidyr-package:
tibble::tibble(Species = c("a", "a", "a", "b", "b", "b"),
site = c(1L, 2L, 3L, 1L, 2L, 3L),
`2001` = c(0L, 1L, 5L, 3L, 1L, 4L),
`2002` = c(1L, 1L, 5L, 0L, 1L, 5L),
`2003` = c(4L, 0L, 5L, 4L, 1L, 5L)) %>%
tidyr::gather(-Species, -site, key = "key", value = "value") %>%
tidyr::spread(key = "Species", value = "value")
Output:
# A tibble: 9 x 4
site key a b
<int> <chr> <int> <int>
1 1 2001 0 3
2 1 2002 1 0
3 1 2003 4 4
4 2 2001 1 1
5 2 2002 1 1
6 2 2003 0 1
7 3 2001 5 4
8 3 2002 5 5
9 3 2003 5 5

subsetting data based with the condition of the current and previous entity in r

I have data with the status column. I want to subset my data to the condition of 'f' status, and previous condition of 'f' status.
to simplify:
df
id status time
1 n 1
1 n 2
1 f 3
1 n 4
2 f 1
2 n 2
3 n 1
3 n 2
3 f 3
3 f 4
my result should be:
id status time
1 n 2
1 f 3
2 f 1
3 n 2
3 f 3
3 f 4
How can I do this in R?
Here's a solution using dplyr -
df %>%
group_by(id) %>%
filter(status == "f" | lead(status) == "f") %>%
ungroup()
# A tibble: 6 x 3
id status time
<int> <fct> <int>
1 1 n 2
2 1 f 3
3 2 f 1
4 3 n 2
5 3 f 3
6 3 f 4
Data -
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
status = structure(c(2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
1L), .Label = c("f", "n"), class = "factor"), time = c(1L,
2L, 3L, 4L, 1L, 2L, 1L, 2L, 3L, 4L)), .Names = c("id", "status",
"time"), class = "data.frame", row.names = c(NA, -10L))

get max value of x in relation of two variables in R [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 4 years ago.
in my data
data=structure(list(v1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
v2 = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), x = c(10L,
1L, 2L, 3L, 4L, 3L, 2L, 30L, 3L, 5L)), .Names = c("v1", "v2",
"x"), class = "data.frame", row.names = c(NA, -10L))
There are 3 variables.
I need to get only those lines in relation to which X, has the max value.
For example. Take First category of v1 and look in relation to which category v2 x has max value
It is
v1=1 and v2=1 x=10
Take second category of v1 and look in relation to which category v2 x has max value
It is v1=2 ,v2=3 x=30
so desired output
v1 v2 x
1 1 10
2 3 30
How to do it?
Here is a solution using data.table:
library(data.table)
setDT(data)
data[, .SD[which.max(x)], keyby = v1]
v1 v2 x
1: 1 1 10
2: 2 3 30
And for completeness an ugly base-R solution:
t(sapply(split(data, data[["v1"]]), function(s) s[which.max(s[["x"]]),]))
v1 v2 x
1 1 1 10
2 2 3 30
Using dplyr:
data %>%
group_by(v1) %>%
filter(x == max(x))
# A tibble: 2 x 3
# Groups: v1 [2]
v1 v2 x
<int> <int> <int>
1 1 1 10
2 2 3 30

Subsetting a data frame according to recursive rows and creating a column for ordering

Consider the sample data
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 8L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 0L, 1L, 0L, 0L)
),
.Names = c("id", "A", "B"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id (stored in column 1) has varying number of entries for column A and B. In the example data, there are four observations with id = 1. I am looking for a way to subset this data in R so that there will be at most 3 entries for for each id and finally create another column (labelled as C) which consists of the order of each id. The expected output would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 1L, 0L, 0L),
C = c(1L, 2L, 3L, 1L, 2L, 1L)
),
.Names = c("id", "A", "B","C"),
class = "data.frame",
row.names = c(NA,-6L)
)
Your help is much appreciated.
Like this?
library(data.table)
dt <- as.data.table(df)
dt[, C := seq(.N), by = id]
dt <- dt[C <= 3,]
dt
# id A B C
# 1: 1 20 1 1
# 2: 1 12 1 2
# 3: 1 13 0 3
# 4: 2 11 1 1
# 5: 2 21 0 2
# 6: 3 17 0 1
Here is one option with dplyr and considering the top 3 values based on A (based of the comments of #Ronak Shah).
library(dplyr)
df %>%
group_by(id) %>%
top_n(n = 3, wt = A) %>% # top 3 values based on A
mutate(C = rank(id, ties.method = "first")) # C consists of the order of each id
# A tibble: 6 x 4
# Groups: id [3]
id A B C
<int> <int> <int> <int>
1 1 20 1 1
2 1 12 1 2
3 1 13 0 3
4 2 11 1 1
5 2 21 0 2
6 3 17 0 1

Subsetting a dataframe based on summation of rows of a given column

I am dealing with data with three variables (i.e. id, time, gender). It looks like
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
time = c(21L, 3L, 4L, 9L, 5L, 9L, 10L, 6L, 27L, 3L, 4L, 10L),
gender = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-12L)
)
That is, each id has four observations for time and gender. I want to subset this data in R based on the sums of the rows of variable time which first gives a value which is greater than or equal to 25 for each id. Notice that for id 2 all observations will be included and for id 3 only the first observation is involved. The expected results would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L ),
time = c(21L, 3L, 4L, 5L, 9L, 10L, 6L, 27L ),
gender = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-8L)
)
Any help on this is highly appreciated.
One option is using lag of cumsum as:
library(dplyr)
df %>% group_by(id,gender) %>%
filter(lag(cumsum(time), default = 0) < 25 )
# # A tibble: 8 x 3
# # Groups: id, gender [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Using data.table: (Updated based on feedback from #Renu)
library(data.table)
setDT(df)
df[,.SD[shift(cumsum(time), fill = 0) < 25], by=.(id,gender)]
Another option would be to create a logical vector for each 'id', cumsum(time) >= 25, that is TRUE when the cumsum of 'time' is equal to or greater than 25.
Then you can filter for rows where the cumsum of this vector is less or equal then 1, i.e. filter for entries until the first TRUE for each 'id'.
df %>%
group_by(id) %>%
filter(cumsum( cumsum(time) >= 25 ) <= 1)
# A tibble: 8 x 3
# Groups: id [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Can try dplyr construction:
dt <- groupby(df, id) %>%
#sum time within groups
mutate(sum_time = cumsum(time))%>%
#'select' rows, which fulfill the condition
filter(sum_time < 25) %>%
#exclude sum_time column from the result
select (-sum_time)

Resources