An example dataframe with 2 columns:
groupID <- c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3)
index_ad <- c( 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0)
df <- data.frame(groupID, index_ad)
I want to add another column with a sequence for each group starting at the row where index_ad = 1 and then adding sequential positive/negative numbers depending on whether the row comes before or after the row where index_ad = 1.
ep_id <- c(0, 1, 2, 3, -2, -1, 0, 1, 2, -1, 0, 1, 2)
df1 <- data.frame(groupID, index_ad, ep_id)
I've tried using row_number, but that always starts from the first row in each group.
df <- df %>% group_by(groupID) %>% mutate(ep_num = row_number()) %>% ungroup()
The real dataset has >10,000 rows and multiple other variables including date/times. The groups are arranged/sorted by date/time and the 'index_ad' variable refers to whether the case/row should be considered the index case for that group. All cases/rows before the index case have date/times that occurred before it and all cases/rows after it have date/times that occurred after it.
Please help me figure out how to add the 'ep_id' numeric sequence using R! Thankyou!
You can try
library(dplyr)
df |> group_by(groupID) |> mutate(ep_id = 1:n() - which(index_ad == 1))
output
# A tibble: 13 × 3
# Groups: groupID [3]
groupID index_ad ep_id
<dbl> <dbl> <int>
1 1 1 0
2 1 0 1
3 1 0 2
4 1 0 3
5 2 0 -2
6 2 0 -1
7 2 1 0
8 2 0 1
9 2 0 2
10 3 0 -1
11 3 1 0
12 3 0 1
13 3 0 2
df %>%
group_by(groupID) %>%
mutate(row = row_number(),
ep_num = row - row[index_ad == 1]) %>%
ungroup()
# A tibble: 13 × 4
groupID index_ad row ep_num
<dbl> <dbl> <int> <int>
1 1 1 1 0
2 1 0 2 1
3 1 0 3 2
4 1 0 4 3
5 2 0 1 -2
6 2 0 2 -1
7 2 1 3 0
8 2 0 4 1
9 2 0 5 2
10 3 0 1 -1
11 3 1 2 0
12 3 0 3 1
13 3 0 4 2
Here is a way. Subtract which index row is equal to 1 from the row number to get the result.
groupID <- c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3)
index_ad <- c( 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0)
df <- data.frame(groupID, index_ad)
suppressPackageStartupMessages(library(dplyr))
df %>%
group_by(groupID) %>%
mutate(ep_num = row_number(),
ep_num = ep_num - which(index_ad == 1)) %>%
ungroup()
#> # A tibble: 13 × 3
#> groupID index_ad ep_num
#> <dbl> <dbl> <int>
#> 1 1 1 0
#> 2 1 0 1
#> 3 1 0 2
#> 4 1 0 3
#> 5 2 0 -2
#> 6 2 0 -1
#> 7 2 1 0
#> 8 2 0 1
#> 9 2 0 2
#> 10 3 0 -1
#> 11 3 1 0
#> 12 3 0 1
#> 13 3 0 2
Created on 2022-08-12 by the reprex package (v2.0.1)
I have coded the mutate above in two lines to make it clearer but it can be simplified to
df %>%
group_by(groupID) %>%
mutate(ep_num = row_number() - which(index_ad == 1)) %>%
ungroup()
Related
I want to create a new column containg means of specific columns. The selected columns should depend on the group.
group
first
second
third
0
3
2
4
0
0
NA
5
0
2
7
1
1
3
1
6
1
4
0
NA
1
2
3
3
0
5
5
0
0
6
2
2
1
NA
1
3
As an example: I want a mean column with the following conditions:
if a row contains a "0" in group, the mean should be calculated from "first" and "second"
if a row contains a "1" in group, the mean should be calculated from "first" and "third"
if a cell contains NA it should be ignored
So the final dataframe should look something like this:
group
first
second
third
mean
0
3
2
4
2.5
0
0
NA
5
0
0
2
7
1
4.5
1
3
1
6
4.5
1
4
0
NA
4
1
2
3
3
2.5
0
5
5
0
5
0
6
2
2
4
1
NA
1
3
3
Since my dataframe contains over 50 variables (and a few thousand rows) and not just those I want the mean from I can't select specific columns by their column or row number (like c(2,5),). I was thinking about adding a condition that explains to R that it should calculate the mean from "first" and "second" only for those rows that have a "0" in group and then the same principle for group = 1. I have no idea how to combinde these conditions or how I can do this in several steps.
library(tidyverse)
tribble(
~group, ~first, ~second, ~third,
0, 3, 2, 4,
0, 0, NA, 5,
0, 2, 7, 1,
1, 3, 1, 6,
1, 4, 0, NA,
1, 2, 3, 3,
0, 5, 5, 0,
0, 6, 2, 2,
1, NA, 1, 3
) |>
rowwise() |>
mutate(mean = if_else(group == 0, mean(c_across(c(first, second)), na.rm = TRUE),
mean(c_across(c(first, third)), na.rm = TRUE)))
#> # A tibble: 9 × 5
#> # Rowwise:
#> group first second third mean
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 3 2 4 2.5
#> 2 0 0 NA 5 0
#> 3 0 2 7 1 4.5
#> 4 1 3 1 6 4.5
#> 5 1 4 0 NA 4
#> 6 1 2 3 3 2.5
#> 7 0 5 5 0 5
#> 8 0 6 2 2 4
#> 9 1 NA 1 3 3
Created on 2022-06-08 by the reprex package (v2.0.1)
One way to do this would be to pivot the data to long format and use case_when() to add a weight variable of 0 (for values you want ignored) and 1 (for values you want included) according to your conditions. Use weighted.mean() to calculate your mean and pivot back to wide.
library(tidyr)
library(dplyr)
df %>%
rowid_to_column() %>%
pivot_longer(-c(rowid, group)) %>%
mutate(weight = case_when(group == 0 & name == "third" ~ 0,
group == 1 & name == "second" ~ 0,
TRUE ~ 1)) %>%
group_by(rowid) %>%
mutate(mean = weighted.mean(value, weight, na.rm = TRUE)) %>%
pivot_wider(-weight) %>%
ungroup() %>%
relocate(mean, .after = last_col())
# A tibble: 9 × 6
rowid group first second third mean
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 3 2 4 2.5
2 2 0 0 NA 5 0
3 3 0 2 7 1 4.5
4 4 1 3 1 6 4.5
5 5 1 4 0 NA 4
6 6 1 2 3 3 2.5
7 7 0 5 5 0 5
8 8 0 6 2 2 4
9 9 1 NA 1 3 3
If you have many groups and many columns, then I would recommend a more programmatic approach. You can define a code list code_ls where you define which columns should be used for which group numbers. Then we can subset this with dplyr::cur_group()$group and use it in an across statement to select those columns and wrap that into rowMeans(). Note that we use all_of() inside across() to select columns based on a character vector. Since your groups are numeric and we want to subset code_ls by name we wrap cur_group()$group into as.character.
library(dplyr)
code_ls <- list(`0` = c("first", "second"),
`1` = c("first", "third"))
dat %>%
group_by(group) %>%
mutate(mean = rowMeans(across(
all_of(code_ls[[as.character(cur_group()$group)]])
), na.rm = TRUE))
#> # A tibble: 9 x 5
#> # Groups: group [2]
#> group first second third mean
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 3 2 4 2.5
#> 2 0 0 NA 5 0
#> 3 0 2 7 1 4.5
#> 4 1 3 1 6 4.5
#> 5 1 4 0 NA 4
#> 6 1 2 3 3 2.5
#> 7 0 5 5 0 5
#> 8 0 6 2 2 4
#> 9 1 NA 1 3 3
# the data
dat <- tribble(
~group, ~first, ~second, ~third,
0, 3, 2, 4,
0, 0, NA, 5,
0, 2, 7, 1,
1, 3, 1, 6,
1, 4, 0, NA,
1, 2, 3, 3,
0, 5, 5, 0,
0, 6, 2, 2,
1, NA, 1, 3
)
Created on 2022-06-08 by the reprex package (v2.0.1)
Say I have a df.
df = data.frame(status = c(1, 0, 0, 0, 1, 0, 0, 0),
stratum = c(1,1,1,1, 2,2,2,2),
death = 1:8)
> df
status stratum death
1 1 1 1
2 0 1 2
3 0 1 3
4 0 1 4
5 1 2 5
6 0 2 6
7 0 2 7
8 0 2 8
I want to mutate a new variable named weights. And it should meet the following conditions:
weights should be mutated in stratum group.
the weights value should return death value when the status is 1.
What I expected should like this:
df_wanted = data.frame(status = c(1, 0, 0, 0, 1, 0, 0, 0),
stratum = c(1,1,1,1, 2,2,2,2),
death = 1:8,
weights = c(1,1,1,1, 5,5,5,5))
> df_wanted
status stratum death weights
1 1 1 1 1
2 0 1 2 1
3 0 1 3 1
4 0 1 4 1
5 1 2 5 5
6 0 2 6 5
7 0 2 7 5
8 0 2 8 5
I do not know how to write the code.
Any help will be highly appreciated!
You may get the death value where status = 1.
library(dplyr)
df %>%
group_by(stratum) %>%
mutate(weights = death[status == 1]) %>%
ungroup
The above works because there is exactly 1 value in each group where status = 1. If there are 0 or more than 1 value in a group where status = 1 thann a better option is to use match which will return NA for 0 value and return the 1st death value for more than 1 value.
df %>%
group_by(stratum) %>%
mutate(weights = death[match(1, status)]) %>%
ungroup
# status stratum death weights
# <dbl> <dbl> <int> <int>
#1 1 1 1 1
#2 0 1 2 1
#3 0 1 3 1
#4 0 1 4 1
#5 1 2 5 5
#6 0 2 6 5
#7 0 2 7 5
#8 0 2 8 5
I have a dataset in R like this one:
and I want to keep the same dataset with adding a column that gives the sum rows by ID when A=B=1.
This is the required dataset:
I tried the following R code but it doesn't give the result I want:
library(dplyr)
data1<-data%>% group_by(ID) %>%
mutate(result=case_when(A==1 & B==1 ~ sum(A),TRUE ~ 0)) %>% ungroup()
Not as neat and clean , but still:
data %>%
mutate(row_sum = apply(across(A:B), 1, sum)) %>%
group_by(ID) %>%
mutate(result = sum(row_sum == 2)) %>%
ungroup() %>%
select(-row_sum)
which gives:
# A tibble: 10 x 4
ID A B result
<dbl> <dbl> <dbl> <int>
1 1 1 0 3
2 1 1 1 3
3 1 0 1 3
4 1 0 0 3
5 1 1 1 3
6 1 1 1 3
7 2 1 0 2
8 2 1 1 2
9 2 1 1 2
10 2 0 0 2
After grouping by 'ID', multiply the 'A' with 'B' (0 values in B returns 0 in A) and then get the sum
library(dplyr)
data %>%
group_by(ID) %>%
mutate(result = sum(A*B)) %>%
ungroup
-output
# A tibble: 10 × 4
ID A B result
<dbl> <dbl> <dbl> <dbl>
1 1 1 0 3
2 1 1 1 3
3 1 0 1 3
4 1 0 0 3
5 1 1 1 3
6 1 1 1 3
7 2 1 0 2
8 2 1 1 2
9 2 1 1 2
10 2 0 0 2
data
data <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2), A = c(1,
1, 0, 0, 1, 1, 1, 1, 1, 0), B = c(0, 1, 1, 0, 1, 1, 0, 1, 1,
0)), class = "data.frame", row.names = c(NA, -10L))
I have a dataframe that looks like this:
a b c d e
1 0 0 1 1
.5 1 1 0 1
1 1. .5 .5. 0
0 0 1 NA 1
0 1 0 1 .5
I am looking for an output like:
col val count
a 1 2
.5 1
0 2
b 1 3
0 2
c 1 2
.5 1
0 2
d 1 2
.5 1
0 1
NA 1
e 1 3
.5 1
0 1
I have tried using
data %>%
summarize_at(colnames(data)), n(), na.rm = TRUE)
but this doesn't give me what I want. Any suggestions greatly appreciated, thank you!
I've assumed column d row 3 is a typo and .5. really is 0.5, in which case you could do the following:
library(tidyr)
library(dplyr)
df %>%
pivot_longer(everything()) %>%
group_by(name, value) %>%
summarise(count = n()) %>%
arrange(name, desc(value))
# or more succinctly as pointed out by #LMc
df %>%
pivot_longer(everything()) %>%
count(name, value) %>%
arrange(name, desc(value))
#> # A tibble: 15 x 3
#> name value count
#> <chr> <dbl> <int>
#> 1 a 1 2
#> 2 a 0.5 1
#> 3 a 0 2
#> 4 b 1 3
#> 5 b 0 2
#> 6 c 1 2
#> 7 c 0.5 1
#> 8 c 0 2
#> 9 d 1 2
#> 10 d 0.5 1
#> 11 d 0 1
#> 12 d NA 1
#> 13 e 1 3
#> 14 e 0.5 1
#> 15 e 0 1
data
df <- structure(list(a = c(1, 0.5, 1, 0, 0), b = c(0, 1, 1, 0, 1),
c = c(0, 1, 0.5, 1, 0), d = c(1, 0, 0.5, NA, 1),
e = c(1, 1, 0, 1, 0.5)), class = "data.frame", row.names = c(NA,
-5L))
Created on 2021-04-13 by the reprex package (v2.0.0)
given a table with defined groups where within each group
I have just 1 reference (query) I'd like to change all values of a column
based in value of the reference.
This values are just 1 or -1.
The idea is:
- if reference is equal to 1 so keep all values as it are
- but if reference is -1, so all values should be multiplied by -1, so that way reference became to be 1 and the items with value 1 became to be -1
- Also, modified groups should have opposite order
I'm trying to do this way:
library(tidyverse)
item <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l")
grou <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4)
quer <- c(0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0)
dir <- c(1, 1, 1, -1, 1, 1, 1, -1, 1, -1, -1, -1)
ds <- tibble(item = item,
group = grou,
query = quer,
direction = dir)
ds %>%
group_by(group) %>%
mutate(
direction = ifelse(
direction[query == 1] == 1, direction, (-1 * direction)
)
)
So this
# A tibble: 12 x 5
# Groups: group [4]
item group query direction
<chr> <dbl> <dbl> <dbl>
1 a 1 0 1
2 b 1 1 1
3 c 1 0 1
4 d 2 0 -1
5 e 2 1 1
6 f 2 0 1
7 g 3 0 1
8 h 3 1 -1
9 i 3 0 1
10 j 4 0 -1
11 k 4 1 -1
12 l 4 0 -1
Should became this
# A tibble: 12 x 5
# Groups: group [4]
item group query direction
<chr> <dbl> <dbl> <dbl>
1 a 1 0 1
2 b 1 1 1
3 c 1 0 1
4 d 2 0 -1
5 e 2 1 1
6 f 2 0 1
7 i 3 0 -1
8 h 3 1 1
9 g 3 0 -1
10 l 4 0 1
11 k 4 1 1
12 j 4 0 1
But it is not working.
Thanks in advance
Here is a way to do it:
ds %>%
rowid_to_column("id") %>%
group_by(group) %>%
mutate(tmp = max(query * direction) - 0.5,
direction = tmp * 2 * direction) %>%
arrange(id * tmp, .by_group = TRUE) %>%
select(-c(id, tmp))
The result:
# A tibble: 12 x 4
# Groups: group [4]
item group query direction
<chr> <dbl> <dbl> <dbl>
1 a 1 0 1
2 b 1 1 1
3 c 1 0 1
4 d 2 0 -1
5 e 2 1 1
6 f 2 0 1
7 i 3 0 -1
8 h 3 1 1
9 g 3 0 -1
10 l 4 0 1
11 k 4 1 1
12 j 4 0 1