Replicate a row (specific columns) based on a vector - r

Is there a straightforward way in dplyr to expand a dataframe by replicating a row based on a specific vector?
For example, I have following dataframe:
df <- tibble(Year=c(2019),
cat1=c("A","B"),
cat2=c("X","Y"),
value1=c(1,2),
value2=c(10,20))
selected_years <- c(2019:2021)
where I would like to replicate the row where cat1=="A" for the years 2019-2021. The values of some columns (value1, cat) should be taken from the original year 2019, some other columns (value2) filled with NAs.
The final output should look like:
Year cat value1 value2
2019 A 1 10
2020 A 1 NA
2021 A 1 NA
2019 B 2 20
I tried with bind_rows()...however, the result is not fully what I wanted (I only get the "A"-Part not the "B"-Part), and I am not sure if this is really the most intuitive/dplyr way to go, or if another approach (or even specific function) would be more reasonable:
df%>%
filter(cat1=="A",Year==2019)%>%
bind_rows(
data.frame(
Year=setdiff(selected_years,.$Year),
cat1=.$cat1,
value1=.$value1
)
)
)
Edit:
I also tried using expand and right_join, but I then my desired column values are not repeated:
df %>%
dplyr::right_join(df %>%
filter(cat1=="A",Year==2019)%>%
expand(Year=c(2019:2021)))
Maybe an approach involving case_when?

The part where you only want to keep specfic values and others not, makes this tricky. It is easy to expand the vector on all values using rowwise and unnest together with the condition in if_else. In the last step we just reset the values to NA which we don't want to replicate. If you have more than one value that you want to set NA, we can use across.
library(tidyverse)
df <- tibble(Year=c(2019),
cat1=c("A","B"),
cat2=c("X","Y"),
value1=c(1,2),
value2=c(10,20))
selected_years <- c(2019:2021)
df %>%
rowwise %>%
mutate(Year = if_else(cat1 == "A", list(selected_years), list(Year))) %>%
unnest(Year) %>%
mutate(value2 = if_else(Year != 2019, NA_real_, value2))
#> # A tibble: 4 x 5
#> Year cat1 cat2 value1 value2
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 2019 A X 1 10
#> 2 2020 A X 1 NA
#> 3 2021 A X 1 NA
#> 4 2019 B Y 2 20
Created on 2021-12-08 by the reprex package (v2.0.1)
Or we could create a df2 and full_join it with df:
library(dplyr)
df2 <- tibble(Year = selected_years,
cat1 = "A",
cat2 = "X",
value1 = 1)
df %>%
full_join(df2, by = c("Year", "cat1", "cat2", "value1"))
#> # A tibble: 4 x 5
#> Year cat1 cat2 value1 value2
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 2019 A X 1 10
#> 2 2019 B Y 2 20
#> 3 2020 A X 1 NA
#> 4 2021 A X 1 NA
Created on 2021-12-08 by the reprex package (v2.0.1)

library(tidyverse)
tibble(selected_years) %>%
mutate(cat1 = "A") %>%
full_join(df, by = "cat1") %>%
mutate(selected_years = ifelse(is.na(selected_years), Year, selected_years)) %>%
group_by(cat1) %>%
mutate(value2 = ifelse(row_number() != 1, NA, value2)) %>%
ungroup() %>%
select(Year = selected_years, cat = cat1, value1, value2)
Year cat value1 value2
<dbl> <chr> <dbl> <dbl>
1 2019 A 1 10
2 2020 A 1 NA
3 2021 A 1 NA
4 2019 B 2 20

A solution based on dplyr::bind_rows:
library(tidyverse)
df <- tibble(Year=c(2019),
cat1=c("A","B"),
cat2=c("X","Y"),
value1=c(1,2),
value2=c(10,20))
selected_years <- c(2020:2021)
df %>%
bind_rows(data.frame(
Year=selected_years, filter(., cat1 == "A") %>% select(-Year, -value2))) %>%
arrange(cat1)
#> # A tibble: 4 × 5
#> Year cat1 cat2 value1 value2
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 2019 A X 1 10
#> 2 2020 A X 1 NA
#> 3 2021 A X 1 NA
#> 4 2019 B Y 2 20

Related

R dplyr::c_across() strange behaviour in rowSums

I'm trying to see how to apply rowSums() to specific columns only.
here is a reprex:
df <- tibble(
"ride" = c("bicycle", "motorcycle", "car", "other"),
"A" = c(1, NA, 1, NA),
"B" = c(NA, 2, NA, 2)
)
I can get the desired result, by index[2:3]
df %>%
mutate(total = rowSums(.[2:3], na.rm = TRUE))
# A tibble: 4 × 4
ride A B total
<chr> <dbl> <dbl> <dbl>
1 bicycle 1 NA 1
2 motorcycle NA 2 2
3 car 1 NA 1
4 other NA 2 2
however, if I try specifying columns by name, strange results occur
df %>%
mutate(total = sum(c_across(c("A":"B")), na.rm = TRUE))
# A tibble: 4 × 4
ride A B total
<chr> <dbl> <dbl> <dbl>
1 bicycle 1 NA 6
2 motorcycle NA 2 6
3 car 1 NA 6
4 other NA 2 6
What am I doing wrong?
I can achieve what I want, by something like this:
df %>%
mutate_all(~replace(., is.na(.), 0)) %>%
mutate(total = A + B)
but I'd like to specify column names by passing a vector, so I can change to different combination of column names in future.
Something like this is what I'd like to achieve:
cols_to_sum <- c("A","B")
df %>%
mutate(total = sum(across(cols_to_sum), na.rm = TRUE))
You may use select to specify the columns you want to sum.
library(dplyr)
cols_to_sum <- c("A","B")
df %>%
mutate(total = rowSums(select(., all_of(cols_to_sum)), na.rm = TRUE))
# ride A B total
# <chr> <dbl> <dbl> <dbl>
#1 bicycle 1 NA 1
#2 motorcycle NA 2 2
#3 car 1 NA 1
#4 other NA 2 2
c_across works with rowwise -
df %>%
rowwise() %>%
mutate(total = sum(c_across(all_of(cols_to_sum)), na.rm = TRUE)) %>%
ungroup

Use a single, common group-specific baseline for calculations (cumsum) within sub-groups

I'm looking for a tidy solution preferably using tidyverse
This question is in line with this answer, it does however have an added twist. My data has an overall grouping variable 'grp'. Within each such group, I want to perform calculations based on cumulative sum (cumsum) within sub-groups defined by 'trial', here X and Y.
However, for the calculations within both sub-groups, trial "X" and trial "Y", I need to use a single, common group-specific baseline, i.e. where trial is B.
My desired outcome is Value3 in the data set desired_outcome below:
# library(tidyverse)
# library(dplyr)
desired_outcome # see below I got this `desired_outcome`
# A tibble: 10 x 6
# Groups: grp [2]
grp trial yr value1 value2 Value3
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 A B 2021 2 0 2
2 A X 2022 3 1 5
3 A X 2023 4 2 10
4 A Y 2022 5 3 7
5 A Y 2023 6 4 16
6 B B 2021 0 2 0
7 B X 2022 1 3 3
8 B X 2023 2 4 8
9 B Y 2022 3 5 5
10 B Y 2023 4 6 14
My minimal working example. Data first,
tabl <- tribble(~grp, ~trial, ~yr, ~value1, ~value2,
'A', "B", 2021, 2, 0,
'A', "X", 2022, 3, 1,
'A', "X", 2023, 4, 2,
'A', "Y", 2022, 5, 3,
'A', "Y", 2023, 6, 4,
'B', "B", 2021, 0, 2,
'B', "X", 2022, 1, 3,
'B', "X", 2023, 2, 4,
'B', "Y", 2022, 3, 5,
'B', "Y", 2023, 4, 6) %>%
mutate(trial = factor(trial, levels = c("B", "X", "Y"))) %>%
arrange(grp, trial, yr)
Now, I need to use group_by(), but I can't group on trial as I need to use the baseline, B in calculations for both "X" and "Y".
undesired_outcome_tidier_code <- tabl %>%
group_by(grp) %>% # this do not work!
mutate(Value1.1 = cumsum(value1),
Value2.1 = lag(cumsum(value2), default = 0),
Value3 = Value1.1 + Value2.1) %>%
select(-Value1.1, -Value2.1)
In undesired_outcome_tidier_code row 4-5 and 9-10 is, for obvious reasons, not using line 1 and 6, respectively, as base line. As shown here,
undesired_outcome_tidier_code
# A tibble: 10 x 6
# Groups: grp [2]
grp trial yr value1 value2 Value3
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 A B 2021 2 0 2
2 A X 2022 3 1 5
3 A X 2023 4 2 10
4 A Y 2022 5 3 17
5 A Y 2023 6 4 26
6 B B 2021 0 2 0
7 B X 2022 1 3 3
8 B X 2023 2 4 8
9 B Y 2022 3 5 15
10 B Y 2023 4 6 24
I am looking for a solution that gets me desired_outcome (see below) in a tidy way.
I can, in this smaller example, work my way around it, to get to my desired_outcome, but it's a cumbersome two step solution. There must be a better/tidier way.
step1 <- tabl %>% arrange(grp, trial, yr) %>% filter(trial != 'Y') %>%
group_by(grp) %>%
mutate(Value1.1 = cumsum(value1),
Value2.1 = lag(cumsum(value2), default = 0),
Value3 = Value1.1 + Value2.1)
step2 <- tabl %>% arrange(grp, trial, yr) %>% filter(trial != 'X') %>%
group_by(grp) %>%
mutate(Value1.1 = cumsum(value1),
Value2.1 = lag(cumsum(value2), default = 0),
Value3 = Value1.1 + Value2.1)
desired_outcome <- rbind(step1,
step2 %>% filter(trial != 'B')
) %>% select(-Value1.1, -Value2.1) %>% arrange(grp, trial, yr)
With the addition of purrr, you could do:
map(.x = c("X", "Y"),
~ tabl %>%
arrange(grp, trial, yr) %>%
filter(trial != .x) %>%
group_by(grp) %>%
mutate(value3 = cumsum(value1) + lag(cumsum(value2), default = 0))) %>%
reduce(full_join) %>%
arrange(grp, trial, yr)
grp trial yr value1 value2 value3
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 A B 2021 2 0 2
2 A X 2022 3 1 5
3 A X 2023 4 2 10
4 A Y 2022 5 3 7
5 A Y 2023 6 4 16
6 B B 2021 0 2 0
7 B X 2022 1 3 3
8 B X 2023 2 4 8
9 B Y 2022 3 5 5
10 B Y 2023 4 6 14
You can try with this.
calculate_value3 is a function that calculates value3 as you described. It does it for every letter of trial. It always includes the observation of the baseline. It doesn't matter if the letters will be different than X and Y. Note that baseline can be any letter you want, I set it up as "B" for now.
Inside the pipes, you go for a map-reduce solution. map will run the function calculate_value3 for each unique trial and reduce will set them all together with coalesce (which will replace all NAs --> this is why I initialize v3 as a vector of all NAs in calculate_value3)
calculate_value3 <- function(ut, # trial under examination
tr, # trial vector
v1, # value1 vector
v2, # value2 vector
baseline = "B"){ # baseline id
v3 <- rep_len(NA, length(tr))
ind <- ut == tr | baseline == tr
cumv1 <- cumsum(v1[ind])
cumlv2 <- cumsum(lag(v2[ind], default = 0))
v3[ind] <- cumv1 + cumlv2
v3
}
library(purrr)
tabl %>%
group_by(grp) %>%
mutate(value3 = reduce(
map(unique(trial), calculate_value3,
tr = trial, v1 = value1, v2 = value2),
coalesce)) %>%
ungroup()
#> # A tibble: 10 x 6
#> grp trial yr value1 value2 value3
#> <chr> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 A B 2021 2 0 2
#> 2 A X 2022 3 1 5
#> 3 A X 2023 4 2 10
#> 4 A Y 2022 5 3 7
#> 5 A Y 2023 6 4 16
#> 6 B B 2021 0 2 0
#> 7 B X 2022 1 3 3
#> 8 B X 2023 2 4 8
#> 9 B Y 2022 3 5 5
#> 10 B Y 2023 4 6 14
The solution is flexible to the identifiers of the trials and seems reasonably easy to debug and to edit if need be [at least to me].
Because tidyverse didn't seem like a strict requirement, I take the opportunity to suggest a data.table alternative:
Starting with the 'desired_outcome' data, just to make it easier to compare results:
library(data.table)
setDT(desired_outcome)
desired_outcome[ , v3 := {
c(value1[1], sapply(c("X", "Y"), function(g){
.SD[trial %in% c("B", g), (cumsum(value1) + cumsum(shift(value2, fill = 0)))[-1]]
}))}, by = grp]
# grp trial yr value1 value2 Value3 v3
# 1: A B 2021 2 0 2 2
# 2: A X 2022 3 1 5 5
# 3: A X 2023 4 2 10 10
# 4: A Y 2022 5 3 7 7
# 5: A Y 2023 6 4 16 16
# 6: B B 2021 0 2 0 0
# 7: B X 2022 1 3 3 3
# 8: B X 2023 2 4 8 8
# 9: B Y 2022 3 5 5 5
# 10: B Y 2023 4 6 14 14
For each 'grp' (by = grp), loop over 'trial' "X" and "Y" (sapply(c("X", "Y")). Within each sub-dataset defined by by (.SD), select rows where 'trial' is equal to "B" or the current value of the loop (trial %in% c("B", g)).
Do the calculation (cumsum(value1) + cumsum(shift(value2, fill = 0)) and remove the first value ([-1]). Append the first row within each 'grp', i.e. the row that corresponds to trial "B" (c(value1[1], ...). Assign the result to a new variable by reference (v3 := )

How to find the first observation of a column that matches a condition

I have a data frame:
df = tibble(a=c(7,6,10,12,12), b=c(3,5,8,8,7), c=c(4,4,12,15,20), week=c(1,2,3,4,5))
# A tibble: 5 x 4
a b c week
<dbl> <dbl> <dbl> <dbl>
1 7 3 4 1
2 6 5 4 2
3 10 8 12 3
4 12 8 15 4
5 12 7 20 5
and i want for every column a, b and c the week in which the observation is equal to or exceeds 10.
I.e. for column a it would be week 3, for column b it would be week NA, for column c it would be week 3 as well.
A desired ouotcome could look like this:
tibble(abc=c("a", NA, "b"), value=c(10, NA, 12), week=c(3, NA, 3))
# A tibble: 3 x 3
abc value week
<chr> <dbl> <dbl>
1 a 10 3
2 b NA NA
3 c 12 3
One way would be to get the data in long format and for each column name select the first value that is greater than 10. We fill the missing combinations with complete.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -week, names_to = 'abc') %>%
group_by(abc) %>%
slice(which(value >= 10)[1]) %>%
ungroup %>%
complete(abc = names(df)[-4])
# A tibble: 3 x 3
# abc week value
# <chr> <dbl> <dbl>
#1 a 3 10
#2 b NA NA
#3 c 3 12
Another way is to first calculate what we want and then transform the dataset into long format.
df %>%
summarise(across(a:c, list(week = ~week[which(. >= 10)[1]],
value = ~.[. >= 10][1]))) %>%
pivot_longer(cols = everything(),
names_to = c('abc', '.value'),
names_sep = "_")

Rank function in R after group by

How to use R to create a rank column? Below is an example
This is what I have:
Date group
12/5/2020 A
12/5/2020 A
11/7/2020 A
11/7/2020 A
11/9/2020 B
11/9/2020 B
10/8/2020 B
This is what I want:
Date group rank
12/5/2020 A 2
12/5/2020 A 2
11/7/2020 A 1
11/7/2020 A 1
11/9/2020 B 2
11/9/2020 B 2
10/8/2020 B 1
tidyverse
(I'm using dplyr here since I think it is easy to see the steps being done.)
A first approach might be to capitalize on R's factor function, which assigns an integer to each distinct value, so that operations on this factor is faster (when compared with strings). That is, it takes a (possibly looooong) vector of strings and converts it into a just-as-long vector of integers (much smaller and faster) and a very short vector of strings, where the integers are indices into the small vector of strings. This small vector is called the factor's "levels".
library(dplyr)
group_by(dat, group) %>%
mutate(rank = as.integer(factor(Date))) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <chr> <chr> <int>
# 1 12/5/2020 A 2
# 2 12/5/2020 A 2
# 3 11/7/2020 A 1
# 4 11/7/2020 A 1
# 5 11/9/2020 B 2
# 6 11/9/2020 B 2
# 7 10/8/2020 B 1
This "sorta" works, but there are two problems:
This is reliant on the lexicographic sorting of the Date column, for which this data sample is acceptable, but this will fail. A better way is to convert to something more appropriately sortable, such as a Date object.
Failing sorts:
sort(c("12/9/2020", "11/9/2020", "2/9/2020"))
# [1] "11/9/2020" "12/9/2020" "2/9/2020"
dat %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
group_by(group) %>%
mutate(rank = as.integer(factor(Date))) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <date> <chr> <int>
# 1 2020-12-05 A 2
# 2 2020-12-05 A 2
# 3 2020-11-07 A 1
# 4 2020-11-07 A 1
# 5 2020-11-09 B 2
# 6 2020-11-09 B 2
# 7 2020-10-08 B 1
and
There really are better functions for ranking, such as dplyr::dense_rank (which #akrun put in an answer first ... I was building to it, honestly):
dat %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
group_by(group) %>%
mutate(rank = dense_rank(Date)) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <date> <chr> <int>
# 1 2020-12-05 A 2
# 2 2020-12-05 A 2
# 3 2020-11-07 A 1
# 4 2020-11-07 A 1
# 5 2020-11-09 B 2
# 6 2020-11-09 B 2
# 7 2020-10-08 B 1
We can use dense_rank after converting the 'Date' to Date class
library(dplyr)
library(lubridate)
df1 %>%
group_by(group) %>%
mutate(rank = dense_rank(mdy(Date)))
# A tibble: 7 x 3
# Groups: group [2]
# Date group rank
# <chr> <chr> <int>
#1 12/5/2020 A 2
#2 12/5/2020 A 2
#3 11/7/2020 A 1
#4 11/7/2020 A 1
#5 11/9/2020 B 2
#6 11/9/2020 B 2
#7 10/8/2020 B 1
data
df1 <- structure(list(Date = c("12/5/2020", "12/5/2020", "11/7/2020",
"11/7/2020", "11/9/2020", "11/9/2020", "10/8/2020"), group = c("A",
"A", "A", "A", "B", "B", "B")), class = "data.frame", row.names = c(NA,
-7L))
Convert the Date column to the actual date object, arrange the data by Date and use match with unique to get rank column.
library(dplyr)
df %>%
mutate(Date = lubridate::mdy(Date)) %>%
arrange(group, Date) %>%
group_by(group) %>%
mutate(rank = match(Date, unique(Date)))
# Date group rank
# <date> <chr> <int>
#1 2020-11-07 A 1
#2 2020-11-07 A 1
#3 2020-12-05 A 2
#4 2020-12-05 A 2
#5 2020-10-08 B 1
#6 2020-11-09 B 2
#7 2020-11-09 B 2
data
df <- structure(list(Date = c("12/5/2020", "12/5/2020", "11/7/2020",
"11/7/2020", "11/9/2020", "11/9/2020", "10/8/2020"), group = c("A",
"A", "A", "A", "B", "B", "B")), class = "data.frame", row.names = c(NA, -7L))

Moving rows to columns in R using identifier

I have a dataset in r with two columns of numerical data and one with an identifier. Some of the rows share the same identifier (i.e. they are the same individual), but contain different data. I want to use the identifier to move those that share an identifier from a row into a columns. There are currently 600 rows, but there should be 400.
Can anyone share r code that might do this? I am new to R, and have tried the reshape (cast) programme, but I can't really follow it, and am not sure it's exactly what i'm trying to do.
Any help gratefully appreciated.
UPDATE:
Current
ID Age Sex
1 3 1
1 5 1
1 6 1
1 7 1
2 1 2
2 12 2
2 5 2
3 3 1
Expected output
ID Age Sex Age2 Sex2 Age3 Sex3 Age4 Sex4
1 3 1 5 1 6 1 7 1
2 1 2 12 2 5 2
3 3 1
UPDATE 2:
So far I have tried using the melt and dcast commands from reshape2. I am getting there, but it still doesn't look quite right. Here is my code:
x <- melt(example, id.vars = "ID")
x$time <- ave(x$ID, x$ID, FUN = seq_along)
example2 <- dcast (x, ID ~ time, value.var = "value")
and here is the output using that code:
ID A B C D E F G H (for clarity i have labelled these)
1 3 5 6 7 1 1 1 1
2 1 12 5 2 2 2
3 3 1
So, as you can probably see, it is mixing up the 'sex' and 'age' variables and combining them in the same column. For example column D has the value '7' for person 1 (age4), but '2' for person 2 (Sex). I can see that my code is not instructing where the numerical values should be cast to, but I do not know how to code that part. Any ideas?
Here's an approach using gather, spread and unite from the tidyr package:
suppressPackageStartupMessages(library(tidyverse))
x <- tribble(
~ID, ~Age, ~Sex,
1, 3, 1,
1, 5, 1,
1, 6, 1,
1, 7, 1,
2, 1, 2,
2, 12, 2,
2, 5, 2,
3, 3, 1
)
x %>% group_by(ID) %>%
mutate(grp = 1:n()) %>%
gather(var, val, -ID, -grp) %>%
unite("var_grp", var, grp, sep ='') %>%
spread(var_grp, val, fill = '')
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Age2 Age3 Age4 Sex1 Sex2 Sex3 Sex4
#> * <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 5 6 7 1 1 1 1
#> 2 2 1 12 5 2 2 2
#> 3 3 3 1
If you prefer to keep the columns numeric then just remove the fill='' argument from spread(var_grp, val, fill = '').
Other questions which might help with this include:
R spreading multiple columns with tidyr
How can I spread repeated measures of multiple variables into wide format?
I have recently come across a similar issue in my data, and wanted to provide an update using the tidyr 1.0 functions as gather and spread have been retired. The new pivot_longer and pivot_wider are currently much slower than gather and spread, especially on very large datasets, but this is supposedly fixed in the next update of tidyr, so hope this updated solution is useful to people.
library(tidyr)
library(dplyr)
x %>%
group_by(ID) %>%
mutate(grp = 1:n()) %>%
pivot_longer(-c(ID, grp), names_to = "var", values_to = "val") %>%
unite("var_grp", var, grp, sep = "") %>%
pivot_wider(names_from = var_grp, values_from = val)
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Sex1 Age2 Sex2 Age3 Sex3 Age4 Sex4
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3 1 5 1 6 1 7 1
#> 2 2 1 2 12 2 5 2 NA NA
#> 3 3 3 1 NA NA NA NA NA NA

Resources