I am trying to summarize a list of variables by group. Some varibles need to be summed and others need to be averaged.
I have this:
Group Variable1 Variable2
1 10 2
1 12 6
2 6 7
2 4 9
I'd like the sum of variable 1 and mean of variable 2:
Group Variable1 Variable2
1 22 4
2 10 8
I've been using dplyr to get the group sum:
sum <- (df %>%
group_by(Group) %>%
summarise_all(funs(sum)))
I'm trying to find a way to choose which columns are summed and which are averaged for the summarize function.
Thank you!
It is possible with the devel version of dplyr to selectively apply different functions on different set of variables with across
library(dplyr)
df %>%
group_by(Group) %>%
summarise(across(Variable1:Variable2, sum), across(Variable3:Variable5, mean))
# A tibble: 2 x 6
# Group Variable1 Variable2 Variable3 Variable4 Variable5
# <int> <int> <int> <dbl> <dbl> <dbl>
#1 1 22 8 18.5 5 24
#2 2 10 16 11 7 20.5
data
df <- structure(list(Group = c(1L, 1L, 2L, 2L), Variable1 = c(10L,
12L, 6L, 4L), Variable2 = c(2L, 6L, 7L, 9L), Variable3 = c(24L,
13L, 10L, 12L), Variable4 = c(3L, 7L, 9L, 5L), Variable5 = c(26L,
22L, 23L, 18L)), class = "data.frame", row.names = c(NA, -4L))
Example data with more columns:
df <- structure(list(Group = c(1L, 1L, 2L, 2L), Variable1 = c(10L,
12L, 6L, 4L), Variable2 = c(2L, 6L, 7L, 9L), Variable3 = c(9L,
8L, 10L, 2L), Variable4 = c(8L, 7L, 9L, 5L)), row.names = c(NA,
-4L), class = "data.frame")
# Group Variable1 Variable2 Variable3 Variable4
# 1: 1 10 2 9 8
# 2: 1 12 6 8 7
# 3: 2 6 7 10 9
# 4: 2 4 9 2 5
Create vectors of variable names and use mget + lapply in data.table
library(data.table)
setDT(df)
df[, c(lapply(mget(paste0('Variable', 1:2)), sum),
lapply(mget(paste0('Variable', 3:4)), mean)),
by = Group]
# Group Variable1 Variable2 Variable3 Variable4
# 1: 1 22 8 8.5 7.5
# 2: 2 10 16 6.0 7.0
Here is a base R solution using merge + aggregate, i.e
dfout <- merge(aggregate(Variable1~Group,df,sum),
aggregate(Variable2~Group,df,mean))
such that
> dfout
Group Variable1 Variable2
1 1 22 4
2 2 10 8
DATA
df <- structure(list(Group = c(1L, 1L, 2L, 2L), Variable1 = c(10L,
12L, 6L, 4L), Variable2 = c(2L, 6L, 7L, 9L)), class = "data.frame", row.names = c(NA,
-4L))
We can use mutate_at to apply functions to multiple columns and then select 1st row in each group to get summarised values.
library(dplyr)
df %>%
group_by(Group) %>%
mutate_at(vars(Variable1:Variable2), sum) %>%
mutate_at(vars(Variable3:Variable4), mean) %>%
slice(1L)
# Group Variable1 Variable2 Variable3 Variable4
# <int> <int> <int> <dbl> <dbl>
#1 1 22 8 8.5 7.5
#2 2 10 16 6 7
Related
I have this file:
ID P
1 10
1 12
1 11
2 9
2 8
2 10
3 11
3 12
3 14
4 15
4 16
4 8
5 11
5 13
5 10
6 14
6 16
6 11
And I would like to assign these values (a,b,c) randomly to the file:
like this:
ID P Group
1 10 a
1 12 b
1 11 c
2 9 c
2 8 a
2 10 b
3 11 a
3 12 c
3 14 b
4 15 c
4 16 a
4 8 b
5 11 b
5 13 c
5 10 a
6 14 b
6 16 c
6 11 a
I need to do several times, every time randomly. I tried this:
df %>% group_by(ID) %>% replicate(1,sample(df$group))
but, for sure, didnĀ“t work. Some suggestion?
Here is an option with sample
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Group = sample(c('a', 'b', 'c'), n(), replace = TRUE))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), P = c(10L, 12L, 11L, 9L, 8L,
10L, 11L, 12L, 14L, 15L, 16L, 8L, 11L, 13L, 10L, 14L, 16L, 11L
)), class = "data.frame", row.names = c(NA, -18L))
Two solutions, one with grouping, the other without
library(tidyverse)
df <- dplyr::tribble(
~ID, ~P,
1,10,
1,12,
1,11,
2,9,
2,8,
2,10,
3,11,
3,12,
3,14,
4,15,
4,16,
4,8,
5,11,
5,13,
5,10,
6,14,
6,16,
6,11
)
sample_vector <- c("a","b","c")
##Without grouping id
df_2 <- df %>%
mutate(Group = sample(sample_vector, nrow(df), replace = TRUE))
##With grouping by ID
df_2 <- df %>% group_by(ID) %>%
mutate(Group = sample(sample_vector, n(), replace = TRUE))
This question already has answers here:
How to merge multiple rows by a given condition and sum?
(2 answers)
Closed 2 years ago.
I have a data frame where
Disease Genemutation Mean. Total No of pateints No.of pateints.
cancertype1 BRCA1 1 10 2
cancertype2 BRCA2 5 10 3
cancertype3 BRCA2 7 10 4
cancertype1 BRCA1 8 10 1
cancertype3 BRCA2 4 10 4
cancertype2 BRCA1 6 10 1
how do I create an new variable called cancertype 4 (from cancer type 3 and cancer type 2) that includes the number of patients that have it as a result of merging the two variable?
We can use replace with %in% to replace those values (assuming 'Disease' is character class)
df1 %>%
group_by(Disease = replace(Disease,
Disease %in% c("cancertype2", "cancertype3"), "cancertype4")) %>%
summarise(TotalNoofpateints = sum(TotalNoofpateints))
-output
# A tibble: 2 x 2
# Disease TotalNoofpateints
# <chr> <int>
#1 cancertype1 20
#2 cancertype4 40
Here is a base R option using aggregate
aggregate(
Total.No.of.pateints ~ Disease,
transform(
df,
Disease = replace(Disease, Disease %in% c("cancertype2", "cancertype3"), "cancertype4")
),
sum
)
giving
Disease Total.No.of.pateints
1 cancertype1 20
2 cancertype4 40
Data
> dput(df)
structure(list(Disease = c("cancertype1", "cancertype2", "cancertype3",
"cancertype1", "cancertype3", "cancertype2"), Genemutation = c("BRCA1",
"BRCA2", "BRCA2", "BRCA1", "BRCA2", "BRCA1"), Mean. = c(1L, 5L,
7L, 8L, 4L, 6L), Total.No.of.pateints = c(10L, 10L, 10L, 10L,
10L, 10L), No.of.pateints. = c(2L, 3L, 4L, 1L, 4L, 1L)), class = "data.frame", row.names = c(NA,
-6L))
Have a simple dataframe with 2 ID's (N = 2) and 2 periods (T = 2), for example:
year id points
1 1 10
1 2 12
2 1 20
2 2 18
How does one achieves the following dataframe (preferably using dplyr or any tidyverse solution)?
id points_difference
1 10
2 6
Notice that the points_difference column is the difference between each ID in across time (namely T2 - T1).
Additionally, how to generalize for multiple columns and multiple ID (with only 2 periods)?
year id points scores
1 1 10 7
1 ... ... ...
1 N 12 8
2 1 20 9
2 ... ... ...
2 N 12 9
id points_difference scores_difference
1 10 2
... ... ...
N 0 1
If you are on dplyr 1.0.0(or higher), summarise can return multiple rows in output so this will also work if you have more than 2 periods. You can do :
library(dplyr)
df %>%
arrange(id, year) %>%
group_by(id) %>%
summarise(across(c(points, scores), diff, .names = '{col}_difference'))
# id points_difference scores_difference
# <int> <int> <int>
#1 1 10 2
#2 1 -7 1
#3 2 6 2
#4 2 -3 3
data
df <- structure(list(year = c(1L, 1L, 2L, 2L, 3L, 3L), id = c(1L, 2L,
1L, 2L, 1L, 2L), points = c(10L, 12L, 20L, 18L, 13L, 15L), scores = c(2L,
3L, 4L, 5L, 5L, 8L)), class = "data.frame", row.names = c(NA, -6L))
I want to transform my data from this
Month Expenditures
1 1
1 2
2 3
2 6
3 2
3 5
to this:
Month Cumulative_expenditures
1 3
2 12
3 19
, but can't seem to figure out how to do it.
I tried using the cumsum() function, but it counts each observation - it doesn't distinguish between groups.
Any help would be much appreciated!
A two steps base R solution would be:
#Code
df1 <- aggregate(Expenditures~Month,data=mydf,sum)
#Create cum sum
df1$Expenditures <- cumsum(df1$Expenditures)
Output:
Month Expenditures
1 1 3
2 2 12
3 3 19
Some data used:
#Data
mydf <- structure(list(Month = c(1L, 1L, 2L, 2L, 3L, 3L), Expenditures = c(1L,
2L, 3L, 6L, 2L, 5L)), class = "data.frame", row.names = c(NA,
-6L))
Using dplyr:
library(dplyr)
df %>%
group_by(Month) %>%
summarise(Expenditures = sum(Expenditures), .groups = "drop") %>%
mutate(Expenditures = cumsum(Expenditures))
#> # A tibble: 3 x 2
#> Month Expenditures
#> <int> <int>
#> 1 1 3
#> 2 2 12
#> 3 3 19
Or in base R:
data.frame(Month = unique(df$Month),
Expenditure = cumsum(tapply(df$Expenditure, df$Month, sum)))
#> Month Expenditure
#> 1 1 3
#> 2 2 12
#> 3 3 19
Here is another base R option using subset + ave
subset(
transform(df, Expenditures = cumsum(Expenditures)),
ave(rep(FALSE, nrow(df)), Month, FUN = function(x) seq_along(x) == length(x))
)
which gives
Month Expenditures
2 1 3
4 2 12
6 3 19
We can use base R
out <- with(df1, rowsum(Expenditures, Month))
data.frame(Month = row.names(out), Expenditure = cumsum(out))
# Month Expenditure
#1 1 3
#2 2 12
#3 3 19
Or more compactly
with(df1, stack(cumsum(rowsum(Expenditures, Month)[,1])))[2:1]
data
df1 <- structure(list(Month = c(1L, 1L, 2L, 2L, 3L, 3L), Expenditures = c(1L,
2L, 3L, 6L, 2L, 5L)), class = "data.frame", row.names = c(NA,
-6L))
Consider the sample data
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 8L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 0L, 1L, 0L, 0L)
),
.Names = c("id", "A", "B"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id (stored in column 1) has varying number of entries for column A and B. In the example data, there are four observations with id = 1. I am looking for a way to subset this data in R so that there will be at most 3 entries for for each id and finally create another column (labelled as C) which consists of the order of each id. The expected output would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 1L, 0L, 0L),
C = c(1L, 2L, 3L, 1L, 2L, 1L)
),
.Names = c("id", "A", "B","C"),
class = "data.frame",
row.names = c(NA,-6L)
)
Your help is much appreciated.
Like this?
library(data.table)
dt <- as.data.table(df)
dt[, C := seq(.N), by = id]
dt <- dt[C <= 3,]
dt
# id A B C
# 1: 1 20 1 1
# 2: 1 12 1 2
# 3: 1 13 0 3
# 4: 2 11 1 1
# 5: 2 21 0 2
# 6: 3 17 0 1
Here is one option with dplyr and considering the top 3 values based on A (based of the comments of #Ronak Shah).
library(dplyr)
df %>%
group_by(id) %>%
top_n(n = 3, wt = A) %>% # top 3 values based on A
mutate(C = rank(id, ties.method = "first")) # C consists of the order of each id
# A tibble: 6 x 4
# Groups: id [3]
id A B C
<int> <int> <int> <int>
1 1 20 1 1
2 1 12 1 2
3 1 13 0 3
4 2 11 1 1
5 2 21 0 2
6 3 17 0 1