I want to create a new column containg means of specific columns. The selected columns should depend on the group.
group
first
second
third
0
3
2
4
0
0
NA
5
0
2
7
1
1
3
1
6
1
4
0
NA
1
2
3
3
0
5
5
0
0
6
2
2
1
NA
1
3
As an example: I want a mean column with the following conditions:
if a row contains a "0" in group, the mean should be calculated from "first" and "second"
if a row contains a "1" in group, the mean should be calculated from "first" and "third"
if a cell contains NA it should be ignored
So the final dataframe should look something like this:
group
first
second
third
mean
0
3
2
4
2.5
0
0
NA
5
0
0
2
7
1
4.5
1
3
1
6
4.5
1
4
0
NA
4
1
2
3
3
2.5
0
5
5
0
5
0
6
2
2
4
1
NA
1
3
3
Since my dataframe contains over 50 variables (and a few thousand rows) and not just those I want the mean from I can't select specific columns by their column or row number (like c(2,5),). I was thinking about adding a condition that explains to R that it should calculate the mean from "first" and "second" only for those rows that have a "0" in group and then the same principle for group = 1. I have no idea how to combinde these conditions or how I can do this in several steps.
library(tidyverse)
tribble(
~group, ~first, ~second, ~third,
0, 3, 2, 4,
0, 0, NA, 5,
0, 2, 7, 1,
1, 3, 1, 6,
1, 4, 0, NA,
1, 2, 3, 3,
0, 5, 5, 0,
0, 6, 2, 2,
1, NA, 1, 3
) |>
rowwise() |>
mutate(mean = if_else(group == 0, mean(c_across(c(first, second)), na.rm = TRUE),
mean(c_across(c(first, third)), na.rm = TRUE)))
#> # A tibble: 9 × 5
#> # Rowwise:
#> group first second third mean
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 3 2 4 2.5
#> 2 0 0 NA 5 0
#> 3 0 2 7 1 4.5
#> 4 1 3 1 6 4.5
#> 5 1 4 0 NA 4
#> 6 1 2 3 3 2.5
#> 7 0 5 5 0 5
#> 8 0 6 2 2 4
#> 9 1 NA 1 3 3
Created on 2022-06-08 by the reprex package (v2.0.1)
One way to do this would be to pivot the data to long format and use case_when() to add a weight variable of 0 (for values you want ignored) and 1 (for values you want included) according to your conditions. Use weighted.mean() to calculate your mean and pivot back to wide.
library(tidyr)
library(dplyr)
df %>%
rowid_to_column() %>%
pivot_longer(-c(rowid, group)) %>%
mutate(weight = case_when(group == 0 & name == "third" ~ 0,
group == 1 & name == "second" ~ 0,
TRUE ~ 1)) %>%
group_by(rowid) %>%
mutate(mean = weighted.mean(value, weight, na.rm = TRUE)) %>%
pivot_wider(-weight) %>%
ungroup() %>%
relocate(mean, .after = last_col())
# A tibble: 9 × 6
rowid group first second third mean
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 3 2 4 2.5
2 2 0 0 NA 5 0
3 3 0 2 7 1 4.5
4 4 1 3 1 6 4.5
5 5 1 4 0 NA 4
6 6 1 2 3 3 2.5
7 7 0 5 5 0 5
8 8 0 6 2 2 4
9 9 1 NA 1 3 3
If you have many groups and many columns, then I would recommend a more programmatic approach. You can define a code list code_ls where you define which columns should be used for which group numbers. Then we can subset this with dplyr::cur_group()$group and use it in an across statement to select those columns and wrap that into rowMeans(). Note that we use all_of() inside across() to select columns based on a character vector. Since your groups are numeric and we want to subset code_ls by name we wrap cur_group()$group into as.character.
library(dplyr)
code_ls <- list(`0` = c("first", "second"),
`1` = c("first", "third"))
dat %>%
group_by(group) %>%
mutate(mean = rowMeans(across(
all_of(code_ls[[as.character(cur_group()$group)]])
), na.rm = TRUE))
#> # A tibble: 9 x 5
#> # Groups: group [2]
#> group first second third mean
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 3 2 4 2.5
#> 2 0 0 NA 5 0
#> 3 0 2 7 1 4.5
#> 4 1 3 1 6 4.5
#> 5 1 4 0 NA 4
#> 6 1 2 3 3 2.5
#> 7 0 5 5 0 5
#> 8 0 6 2 2 4
#> 9 1 NA 1 3 3
# the data
dat <- tribble(
~group, ~first, ~second, ~third,
0, 3, 2, 4,
0, 0, NA, 5,
0, 2, 7, 1,
1, 3, 1, 6,
1, 4, 0, NA,
1, 2, 3, 3,
0, 5, 5, 0,
0, 6, 2, 2,
1, NA, 1, 3
)
Created on 2022-06-08 by the reprex package (v2.0.1)
Related
I have a large matrix, in which every row corresponds to a sample, and samples belong to a population. For example, the row name s1-2 means population 1 - sample 2. I would like to calculate the mean for every population, such as in the illustration (unfortunately, I cannot create a sample):
Is this possible in R? May I kindly ask for guidance?
It's not clear why you can't create a sample. Here's one for the purposes of exposition:
set.seed(1)
dimnames <- paste(rep(paste0('s', 1:3), each = 3), rep(1:3, 3), sep = '-')
m <- matrix(sample(0:5, 81, TRUE), 9, dimnames = list(dimnames, dimnames))
m
#> s1-1 s1-2 s1-3 s2-1 s2-2 s2-3 s3-1 s3-2 s3-3
#> s1-1 0 2 4 5 5 3 4 3 3
#> s1-2 3 0 4 0 3 0 1 5 4
#> s1-3 0 4 0 3 3 5 5 2 4
#> s2-1 1 4 0 0 3 1 5 0 3
#> s2-2 4 1 5 3 1 2 5 3 5
#> s2-3 2 5 4 2 3 1 0 4 4
#> s3-1 5 5 4 5 0 5 2 0 3
#> s3-2 1 1 1 1 5 5 2 0 3
#> s3-3 2 0 1 1 0 1 5 5 0
To get the mean of each row / column group, then assuming we can identify the group by the first two characters of the row or column name (as in your example), we could do:
groups <- expand.grid(row = unique(substr(rownames(m), 1, 2)),
col = unique(substr(colnames(m), 1, 2)))
m2 <- matrix(unlist(Map(function(r, c) {
mean(m[grep(r, rownames(m)), grep(c, rownames(m))])
}, r = groups$row, c = groups$col)), 3,
dimnames = list(unique(substr(rownames(m), 1, 2)),
unique(substr(colnames(m), 1, 2))))
Resuting in
m2
#> s1 s2 s3
#> s1 1.888889 3.000000 3.444444
#> s2 2.888889 1.777778 3.222222
#> s3 2.222222 2.555556 2.222222
An example dataframe with 2 columns:
groupID <- c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3)
index_ad <- c( 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0)
df <- data.frame(groupID, index_ad)
I want to add another column with a sequence for each group starting at the row where index_ad = 1 and then adding sequential positive/negative numbers depending on whether the row comes before or after the row where index_ad = 1.
ep_id <- c(0, 1, 2, 3, -2, -1, 0, 1, 2, -1, 0, 1, 2)
df1 <- data.frame(groupID, index_ad, ep_id)
I've tried using row_number, but that always starts from the first row in each group.
df <- df %>% group_by(groupID) %>% mutate(ep_num = row_number()) %>% ungroup()
The real dataset has >10,000 rows and multiple other variables including date/times. The groups are arranged/sorted by date/time and the 'index_ad' variable refers to whether the case/row should be considered the index case for that group. All cases/rows before the index case have date/times that occurred before it and all cases/rows after it have date/times that occurred after it.
Please help me figure out how to add the 'ep_id' numeric sequence using R! Thankyou!
You can try
library(dplyr)
df |> group_by(groupID) |> mutate(ep_id = 1:n() - which(index_ad == 1))
output
# A tibble: 13 × 3
# Groups: groupID [3]
groupID index_ad ep_id
<dbl> <dbl> <int>
1 1 1 0
2 1 0 1
3 1 0 2
4 1 0 3
5 2 0 -2
6 2 0 -1
7 2 1 0
8 2 0 1
9 2 0 2
10 3 0 -1
11 3 1 0
12 3 0 1
13 3 0 2
df %>%
group_by(groupID) %>%
mutate(row = row_number(),
ep_num = row - row[index_ad == 1]) %>%
ungroup()
# A tibble: 13 × 4
groupID index_ad row ep_num
<dbl> <dbl> <int> <int>
1 1 1 1 0
2 1 0 2 1
3 1 0 3 2
4 1 0 4 3
5 2 0 1 -2
6 2 0 2 -1
7 2 1 3 0
8 2 0 4 1
9 2 0 5 2
10 3 0 1 -1
11 3 1 2 0
12 3 0 3 1
13 3 0 4 2
Here is a way. Subtract which index row is equal to 1 from the row number to get the result.
groupID <- c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3)
index_ad <- c( 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0)
df <- data.frame(groupID, index_ad)
suppressPackageStartupMessages(library(dplyr))
df %>%
group_by(groupID) %>%
mutate(ep_num = row_number(),
ep_num = ep_num - which(index_ad == 1)) %>%
ungroup()
#> # A tibble: 13 × 3
#> groupID index_ad ep_num
#> <dbl> <dbl> <int>
#> 1 1 1 0
#> 2 1 0 1
#> 3 1 0 2
#> 4 1 0 3
#> 5 2 0 -2
#> 6 2 0 -1
#> 7 2 1 0
#> 8 2 0 1
#> 9 2 0 2
#> 10 3 0 -1
#> 11 3 1 0
#> 12 3 0 1
#> 13 3 0 2
Created on 2022-08-12 by the reprex package (v2.0.1)
I have coded the mutate above in two lines to make it clearer but it can be simplified to
df %>%
group_by(groupID) %>%
mutate(ep_num = row_number() - which(index_ad == 1)) %>%
ungroup()
I have a dataset in R like this one:
and I want to keep the same dataset with adding a column that gives the sum rows by ID when A=B=1.
This is the required dataset:
I tried the following R code but it doesn't give the result I want:
library(dplyr)
data1<-data%>% group_by(ID) %>%
mutate(result=case_when(A==1 & B==1 ~ sum(A),TRUE ~ 0)) %>% ungroup()
Not as neat and clean , but still:
data %>%
mutate(row_sum = apply(across(A:B), 1, sum)) %>%
group_by(ID) %>%
mutate(result = sum(row_sum == 2)) %>%
ungroup() %>%
select(-row_sum)
which gives:
# A tibble: 10 x 4
ID A B result
<dbl> <dbl> <dbl> <int>
1 1 1 0 3
2 1 1 1 3
3 1 0 1 3
4 1 0 0 3
5 1 1 1 3
6 1 1 1 3
7 2 1 0 2
8 2 1 1 2
9 2 1 1 2
10 2 0 0 2
After grouping by 'ID', multiply the 'A' with 'B' (0 values in B returns 0 in A) and then get the sum
library(dplyr)
data %>%
group_by(ID) %>%
mutate(result = sum(A*B)) %>%
ungroup
-output
# A tibble: 10 × 4
ID A B result
<dbl> <dbl> <dbl> <dbl>
1 1 1 0 3
2 1 1 1 3
3 1 0 1 3
4 1 0 0 3
5 1 1 1 3
6 1 1 1 3
7 2 1 0 2
8 2 1 1 2
9 2 1 1 2
10 2 0 0 2
data
data <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2), A = c(1,
1, 0, 0, 1, 1, 1, 1, 1, 0), B = c(0, 1, 1, 0, 1, 1, 0, 1, 1,
0)), class = "data.frame", row.names = c(NA, -10L))
Dear all I have a data frame that looks like this
df <- data.frame(time=c(1,2,3,4,1,2,3,4,5), type=c("A","A","A","A","B","B","B","B","B"), count=c(10,0,0,1,8,0,1,0,1))
df
time type count
1 1 A 10
2 2 A 0
3 3 A 0
4 4 A 1
5 1 B 8
6 2 B 0
7 3 B 1
8 4 B 0
9 5 B 1
I want to examine each group of types and if I see that one count is 0 then to replace the next count forward in time with 0. I do not count to be resurrected from the zero.
I want my data to looks like this
time type count
1 1 A 10
2 2 A 0
3 3 A 0
4 4 A 0
5 1 B 8
6 2 B 0
7 3 B 0
8 4 B 0
9 5 B 0
If I understood correctly
library(tidyverse)
df <-
data.frame(
time = c(1, 2, 3, 4, 1, 2, 3, 4, 5),
type = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
count = c(10, 0, 0, 1, 8, 0, 1, 0, 1)
)
df %>%
group_by(type) %>%
mutate(count = if_else(lag(count, default = first(count)) == 0, 0, count))
#> # A tibble: 9 x 3
#> # Groups: type [2]
#> time type count
#> <dbl> <chr> <dbl>
#> 1 1 A 10
#> 2 2 A 0
#> 3 3 A 0
#> 4 4 A 0
#> 5 1 B 8
#> 6 2 B 0
#> 7 3 B 0
#> 8 4 B 0
#> 9 5 B 0
Created on 2021-09-10 by the reprex package (v2.0.1)
You may use cummin function.
library(dplyr)
df %>% group_by(type) %>% mutate(count = cummin(count))
# time type count
# <dbl> <chr> <dbl>
#1 1 A 10
#2 2 A 0
#3 3 A 0
#4 4 A 0
#5 1 B 8
#6 2 B 0
#7 3 B 0
#8 4 B 0
#9 5 B 0
Since cummin is a base R function you may also implement it in base R -
transform(df, count = ave(count, type, FUN = cummin))
Here is a reproducible test dataset
mydata <- structure(list(subject = c(1, 1, 1, 2, 2, 2, 3, 3, 3), time = c(0, 1, 2, 0, 1, 2, 0, 1, 2), measure = c(10, 12, 8, 7, 0, 0, 5, 3, NA)), .Names = c("subject", "time", "measure"), row.names = 1:9, class = "data.frame")
mydata
subject time measure
1 0 10
1 1 12
1 2 8
2 0 7
2 1 0
2 2 0
3 0 5
3 1 3
3 2 NA
I would like to remove all the rows where measure is NA and all the corresponding rows for the same subject. So in the example above that would yield:
subject time measure
1 0 10
1 1 12
1 2 8
2 0 7
2 1 0
2 2 0
Is there an easy way to do this without reshaping to wide format first ?
I don't think this needs reshaping or even ave. It is just a subsetting issue, if I understand your question right.
mydata[!with(mydata, subject %in% subject[is.na(measure)]), ]
# subject time measure
# 1 1 0 10
# 2 1 1 12
# 3 1 2 8
# 4 2 0 7
# 5 2 1 0
# 6 2 2 0
You could use:
mydata[with(mydata, as.logical(ave(measure, subject, FUN=function(x) ifelse(any(is.na(x)), 0, 1)))),]
# subject time measure
# 1 1 0 10
# 2 1 1 12
# 3 1 2 8
# 4 2 0 7
# 5 2 1 0
# 6 2 2 0