I am working with the R programming language.
I have the following data on medical patients:
my_data = data.frame(id = c(1,2,3), status_2017 = c("alive", "alive", "alive"), status_2018 = c("alive", "dead", "alive"), status_2019 = c("alive", "dead", "dead"), height_2017 = rnorm(3,3,3), height_2018 = rnorm(3,3,3),
height_2019 = rnorm(3,3,3) , weight_2017 = rnorm(3,3,3), weight_2018 = rnorm(3,3,3), weight_2019 = rnorm(3,3,3))
cols <- colnames(my_data)
ix <- my_data[, startsWith(cols, "status")] == "dead"
my_data[, startsWith(cols, "height")][ ix ] <- NA
my_data[, startsWith(cols, "weight")][ ix ] <- NA
This looks something like this:
id status_2017 status_2018 status_2019 height_2017 height_2018 height_2019 weight_2017 weight_2018 weight_2019
1 1 alive alive alive 3.7276706 4.524869 -1.648458 -1.702781 7.755581 3.369895
2 2 alive dead dead 0.7539518 NA NA 1.060408 NA NA
3 3 alive alive dead 6.6213771 2.122374 NA 5.114120 1.851467 NA
My Question: I want to restructure this data such that:
Each patient has its own row for each year
There is a "year" column
Status_2017, Status_2018, Status_2019 are all combined into a single column (i.e. "status")
Height_2017, Height_2018, Height_2019 are all combined into a single column (i.e. "height")
Weight_2017, Weight_2018, Weight_2019 are all combined into a single column (i.e. "weight")
A new variable ("new_var") is created such that if a patient id has a row with 2019 then new_var is always 0 - for all other patient id's, new_var is 0 until the max year (and new_var then is 1)
I tried to do this with the following code:
library(dplyr)
library(tidyr)
my_data_long <- na.omit(my_data %>%
pivot_longer(cols = -c(id, status_2017),
names_to = c(".value", "year"),
names_pattern = "(height|weight)_(\\d{4})") %>%
arrange(id, year))
final = my_data_long %>%
group_by(id) %>%
mutate(
new_var = ifelse(any(year == "2019"), 0, 1),
max_year = max(year)
) %>%
ungroup() %>%
mutate(
new_var = ifelse(year == max_year & new_var == 1, 1, 0),
max_year = NULL
)
The final result looks something like this:
> final
# A tibble: 6 x 6
id status_2017 year height weight new_var
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1 alive 2017 2.39 2.27 0
2 1 alive 2018 -0.541 1.63 0
3 1 alive 2019 -1.93 10.1 0
4 2 alive 2017 4.18 -3.35 1
5 3 alive 2017 -1.35 7.12 0
6 3 alive 2018 1.42 1.70 1
My end goal is to restructure this dataset such that I can fit a "time varying survival analysis model" (e.g. cox-ph) to this data (e.g. https://atm.amegroups.com/article/view/18820/html, https://cran.r-project.org/web/pacacages/survival/vignettes/timedep.pdf)
Can someone please tell me if I have done this correctly?
Thanks!
Note: I tried to add time differences for each ID
This looks something like this:
library(stringr)
final %>%
group_by(id) %>%
mutate(start = 0:(n() - 1),
end = 1:n()) %>%
ungroup()
# A tibble: 6 x 8
id status_2017 year height weight new_var start end
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <int> <int>
1 1 alive 2017 2.39 2.27 0 0 1
2 1 alive 2018 -0.541 1.63 0 1 2
3 1 alive 2019 -1.93 10.1 0 2 3
4 2 alive 2017 4.18 -3.35 1 0 1
5 3 alive 2017 -1.35 7.12 0 0 1
6 3 alive 2018 1.42 1.70 1 1 2
if we need the status column, then we have to include those also in the pivoting to long i.e. cols = -c(id, status_2017) removes the 'status_2017' from reshaping. In addition, the names_pattern needs to include the status as well in addition to height and weight
library(dplyr) # version >= 1.1.0
library(tidyr)
my_data %>%
pivot_longer(cols = -id, names_to = c(".value", "year"),
names_pattern = "(height|weight|status)_(\\d{4})") %>%
drop_na() %>%
mutate(new_var = +(2019 %in% year), max_year = max(year), .by = "id") %>%
mutate(new_var = +(year == max_year & new_var), max_year = NULL)
-output
# A tibble: 6 × 6
id year status height weight new_var
<dbl> <chr> <chr> <dbl> <dbl> <int>
1 1 2017 alive 9.54 7.47 0
2 1 2018 alive 6.49 5.23 0
3 1 2019 alive 3.75 1.93 1
4 2 2017 alive 4.21 0.619 0
5 3 2017 alive 1.97 5.32 0
6 3 2018 alive -0.406 8.00 0
Related
df_input is the data frame which needs to be transformed into df_output.
For instance, 2001-2003 is assembly=1, and we had a winner in 2001. It means we have a winner if the assembly doesn't change. Similarly, we have a string variable called "party", which doesn't change as long as the assembly is the same.
df_input <- data.frame(winner = c(1,0,0,0,2,0,0,0,1,0,0,0,0),
party = c("A",0,0,0,"B",0,0,0,"C",0,0,0,0),
assembly= c(1,1,1,2,2,2,3,3,3,3,4,4,4),
year = c(2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013))
df_output <- data.frame(winner = c(1,1,1,0,2,2,0,0,1,1,0,0,0),
party = c("A","A","A",0,"B","B",0,0,"C","C",0,0,0),
assembly= c(1,1,1,2,2,2,3,3,3,3,4,4,4),
year = c(2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013))
The code is working fine with the numeric variable (winner). How to do it if there is an additional string variable, "party"?
I get the following error after implementing this code:
df_output <- df_input %>%
mutate(winner = if_else(winner > 0, winner, NA_real_)) %>%
group_by(assembly) %>%
fill(winner) %>%
ungroup() %>%
replace_na(list(winner = 0)) #working fine
df_output <- df_input %>%
mutate(party = ifelse(party>0, party, NA)) %>%
group_by(assembly) %>%
fill(party) %>%
ungroup() %>%
replace_na(list(party = 0))
Error:
Error in `vec_assign()`:
! Can't convert `replace$party` <double> to match type of `data$party` <character>.
You have to pay attention to the datatypes. As party is a character use "0" in replace_na. Also, there is a NA_character_:
library(dplyr)
library(tidyr)
df_input %>%
mutate(winner = if_else(winner > 0, winner, NA_real_),
party = if_else(party != "0", party, NA_character_)) %>%
group_by(assembly) %>%
fill(winner, party) %>%
ungroup() %>%
replace_na(list(winner = 0, party = "0"))
#> # A tibble: 13 × 4
#> winner party assembly year
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 A 1 2001
#> 2 1 A 1 2002
#> 3 1 A 1 2003
#> 4 0 0 2 2004
#> 5 2 B 2 2005
#> 6 2 B 2 2006
#> 7 0 0 3 2007
#> 8 0 0 3 2008
#> 9 1 C 3 2009
#> 10 1 C 3 2010
#> 11 0 0 4 2011
#> 12 0 0 4 2012
#> 13 0 0 4 2013
I'm trying to calculate the cumulative time among several grades.
Here's how my original df looks like:
df = data.frame(id = c(1,1,1,1,2,2,2,2),
group = c(0,0,0,0,1,1,1,1),
grade = c(0,1,2,3,0,1,3,4),
time = c(10,7,4,1,20,17,14,11))
Here's what I'm expecting as the result df1:
df1 = df %>%
pivot_wider(
names_from = "grade",
names_prefix = "grade_",
values_from = "time") %>%
replace(is.na(.), 0) %>%
mutate(grade_1 = grade_1 + grade_2 + grade_3 + grade_4,
grade_2 = grade_2 + grade_3 + grade_4,
grade_3 = grade_3 + grade_4) %>%
pivot_longer(
cols = 3:7,
names_to = "grade",
names_prefix = "grade_",
values_to = "time")
My method works, but I want it to be more flexible. When I have more grades in the df, I don't need to manually add grade_x = grade_1 + grade_2 + grade_3 ...
Thank you!
One option would be to rearrange the grade column, then do cumsum so that it is in reverse. However, we exclude the last row, where grade == 0. Then, we can re-arrange back in the desired order and ungroup.
library(tidyverse)
results <- df %>%
group_by(id) %>%
arrange(id, desc(grade)) %>%
mutate(time = ifelse(row_number()!=n(), cumsum(time), time)) %>%
arrange(id, grade) %>%
ungroup
Output
id group grade time
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 10
2 1 0 1 12
3 1 0 2 5
4 1 0 3 1
5 2 1 0 20
6 2 1 1 42
7 2 1 3 25
8 2 1 4 11
If you need each group to have the same number of rows as in your desired output, then you can use complete:
df %>%
tidyr::complete(id, grade) %>%
group_by(id) %>%
fill(group, .direction ="downup") %>%
replace(is.na(.), 0) %>%
arrange(id, desc(grade)) %>%
mutate(time = ifelse(row_number()!=n(), cumsum(time), time)) %>%
arrange(id, grade) %>%
ungroup
Output
id grade group time
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 10
2 1 1 0 12
3 1 2 0 5
4 1 3 0 1
5 1 4 0 0
6 2 0 1 20
7 2 1 1 42
8 2 2 1 25
9 2 3 1 25
10 2 4 1 11
Or if you want to pivot back and forth then you could do something like this:
output <- df %>%
pivot_wider(
names_from = "grade",
names_prefix = "grade_",
values_from = "time") %>%
replace(is.na(.), 0) %>%
select(id, group, grade_0, last_col():grade_1)
results2 <- output %>%
select(-c(id, group, grade_0)) %>%
rowwise()%>%
do(data.frame(t(cumsum(unlist(.))))) %>%
bind_cols(select(output, id, group, grade_0), .) %>%
pivot_longer(
cols = 3:7,
names_to = "grade",
names_prefix = "grade_",
values_to = "time")
1st Try:
for cumulative sums across a variable, we can group_by and use cumsum() :
No need to specify grades, etc. You can do more aggregations if needed.
df%>%
group_by(grade)%>%
mutate(Cum_Time = cumsum(time))%>%arrange(grade)
id group grade time Cum_Time
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 10 10
2 2 1 0 20 30
3 1 0 1 7 7
4 2 1 1 17 24
5 1 0 2 4 4
6 1 0 3 1 1
7 2 1 3 14 15
8 2 1 4 11 11
Using the following data:
df <- data.frame(id = c("A", "B", "C", "A", "B", "A"),
value = c(1, 2, 3, 4, 5, 6))
I want to pivot_wider this data so that the reshaping creates two different sets of columns:
One set where I create a bunch of binary columns that take the column names from the value columns (e.g. bin_1, bin_2 and so on) and that are coded as 0/1.
An additional set where I create as many necessary columns to store the values in a "categorical" way. Here, id "A" has three values, so I want to create three columns cat_1, cat_2, cat_3 and for IDs B and C I want to fill them up with NAs if there's no value.
Now, I know how to create these two things separately from each other and merge them afterwards via a left_join.
However, my question is: can it be done in one pipeline, where I do two subsequent pivot_widers? I tried, but it doesn't work (obviously because my way of copying the value column and then try to use one for the binary reshape and one for the categorial reshape is wrong).
Any ideas?
Code so far that works:
df1 <- df %>%
group_by(id) %>%
mutate(group_id = 1:n()) %>%
ungroup() %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value)
df2 <- df %>%
mutate(dummy = 1) %>%
arrange(value) %>%
pivot_wider(names_from = value,
names_prefix = "bin_",
values_from = dummy,
values_fill = list(dummy = 0),
values_fn = list(dummy = length))
df <- df1 %>%
left_join(., df2, by = "id)
Expected output:
# A tibble: 3 x 10
id cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
<chr> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 A 1 4 6 1 0 0 1 0 1
2 B 2 5 NA 0 1 0 0 1 0
3 C 3 NA NA 0 0 1 0 0 0
With the addition of purrr, you could do:
map(.x = reduce(range(df$value), `:`),
~ df %>%
group_by(id) %>%
mutate(!!paste0("bin_", .x) := as.numeric(.x %in% value))) %>%
reduce(full_join) %>%
mutate(cats = paste0("cat_", row_number())) %>%
pivot_wider(names_from = "cats",
values_from = "value")
id bin_1 bin_2 bin_3 bin_4 bin_5 bin_6 cat_1 cat_2 cat_3
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 0 0 1 0 1 1 4 6
2 B 0 1 0 0 1 0 2 5 NA
3 C 0 0 1 0 0 0 3 NA NA
In base you can try:
tt <- unstack(df[2:1])
x <- cbind(t(sapply(tt, "[", seq_len(max(lengths(tt))))),
t(+sapply(names(tt), "%in%", x=df$id)))
colnames(x) <- c(paste0("cat_", seq_len(max(lengths(tt)))),
paste0("bin_", seq_len(nrow(df))))
x
# cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
#A 1 4 6 1 0 0 1 0 1
#B 2 5 NA 0 1 0 0 1 0
#C 3 NA NA 0 0 1 0 0 0
Slightly modifying your approach by reducing df2 code and including it all in one pipe by taking advantage of the list and . trick which allows you to work on two versions of df in the same call.
Its not much of an improvement on what you have done but it is now all in one call. I can't think of way you can do it without a merge/join.
library(tidyverse)
df %>%
list(
pivot_wider(., id_cols = id,
names_from = value,
names_prefix = "bin_") %>%
mutate_if(is.numeric, ~ +(!is.na(.))), #convert to binary
group_by(., id) %>%
mutate(group_id = 1:n()) %>%
ungroup() %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value)
) %>%
.[c(2:3)] %>%
reduce(left_join)
# id bin_1 bin_2 bin_3 bin_4 bin_5 bin_6 cat_1 cat_2 cat_3
# <chr> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
# 1 A 1 0 0 1 0 1 1 4 6
# 2 B 0 1 0 0 1 0 2 5 NA
# 3 C 0 0 1 0 0 0 3 NA NA
Even you can join both your syntax into one without creating any intermediate object
df %>%
group_by(id) %>%
mutate(group_id = row_number()) %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value) %>% left_join(df %>% mutate(dummy = 1) %>% arrange(value) %>% pivot_wider(names_from = value,
names_prefix = "bin_",
values_from = dummy,
values_fill = list(dummy = 0),
values_fn = list(dummy = length)), by = "id")
# A tibble: 3 x 10
# Groups: id [3]
id cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
<chr> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 A 1 4 6 1 0 0 1 0 1
2 B 2 5 NA 0 1 0 0 1 0
3 C 3 NA NA 0 0 1 0 0 0
I have a database that goes like that:
d <- c(01, 02, 03, 04)
h <- c("19:00", "19:00", "07:00", "07:00")
p1 <- c(123, 321, 123, 123)
p2 <- c(321, 345, 567, 567)
df <- data.frame(date = d, hours = h, person1 = p1, person2 = p2)
I used this code to associate all the characteristics of each person1 in different columns:
EDITED: rn = rowid(person1, date, hours) is the actual code. Not rn = rowid(person1)
library(dplyr)
library(data.table)
library(tidyr)
df1 <- df %>%
mutate(rn = rowid(person1, date, hours)) %>%
pivot_wider(names_from = rn, values_from = c(date, hours, person2),
names_sep="")
But this code gives me this output:
# person1 date1 hours1 person21
# 123 c(1,3,4) c("19:00", "07:00", "07:00") c(321,567,567)
# 321 2 19:00 345
I Dont want it to repeat values like 07:00 or 567. I want it to give me each different value in different columns, ignoring repeated values. And if possible, organized like that:
# person1 date1 date2 date3 date4... hours1 hours2 ... person21 person22 person23 person24
# 123 01 NA 03 04 07:00 19:00 NA 321 NA 567
# 321 NA 02 NA NA NA 19:00 NA NA 345 NA
person21, 22, 23 and 24 being the first, second, third, fourth, and so on person of my df1$person1.
But the ideal output for me would be something like this:
# person1 d01 d02 d03 d04 ... h07:00 h19:00 ... p123 p321 p345 p567
# 123 1 0 1 1 ... 1 0 ... 1 0 0 1
# 321 0 1 0 0 ... 0 0 ... 1 0 0 1
How can I do this?
If we want to return a binary output, specify the values_fn and values_fill in pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(rn = rowid(person1)) %>%
pivot_wider(names_from = rn, values_from = c(date, hours, person2),
names_sep="", values_fn = length, values_fill = list(date = 0, hours = 0, person2 = 0))
# A tibble: 2 x 10
# person1 date1 date2 date3 hours1 hours2 hours3 person21 person22 person23
# <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 123 1 1 1 1 1 1 1 1 1
#2 321 1 0 0 1 0 0 1 0 0
If we want the values to be also column names, an option is to reshape into 'long' format first and then do the pivot_wider after transformation
df %>%
mutate(date = sprintf("%02d", date)) %>%
mutate(across(where(is.numeric), as.character)) %>%
pivot_longer(cols = -person1) %>%
mutate(name = substr(name, 1, 1)) %>%
unite(name, name, value, sep="") %>%
distinct(person1, name) %>%
mutate(n = 1) %>%
pivot_wider(names_from = name, values_from =n, values_fill = list(n = 0))
# A tibble: 2 x 10
# person1 d01 `h19:00` p321 d02 p345 d03 `h07:00` p567 d04
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 123 1 1 1 0 0 1 1 1 1
#2 321 0 1 0 1 1 0 0 0 0
Basically, I have a data frame that contains IDs, Dates, VolumeX, and VolumeY.
I want to split the VolumeX data frame into before and after the max date of VolumeY specific to an ID.
Ex.
df looks like (with many different IDs) :
ID Date VolX VolY
1 2018 - 02- 01 5 -
1 2018 - 03- 01 6 -
1 2018 - 08- 01 3 -
1 2018 - 10- 01 1 -
1 2017 - 02- 01 - 1
1 2014 - 10- 01 - 0
1 2014 - 11- 01 - 5
1 2018 - 02- 01 - 0
So for the max date of VolY for every ID, I'd like to split the data frame into two: before and after that date for each ID soas to sum VolX before and after VolY max date.
Seems like this needs to be some kind of nested for loop. I am able to extract max dates and total volume... just having a hard time selecting out ID-specific
Is this what you're after?
library(dplyr)
df %>%
replace(., . == "-", NA) %>%
mutate(Date = as.Date(gsub("\\s", "", Date))) %>%
mutate_at(vars(VolX, VolY), as.numeric) %>%
group_by(ID, Before_After = cumsum(c(0, lag(+(Date == max(Date)))[-1]))) %>%
mutate(
sum_Volx = sum(VolX[Date != max(Date)], na.rm = T),
sum_VolY = sum(VolY[Date != max(Date)], na.rm = T)
) %>% ungroup() %>% select(-Before_After)
Output:
# A tibble: 8 x 6
ID Date VolX VolY sum_Volx sum_VolY
<int> <date> <dbl> <dbl> <dbl> <dbl>
1 1 2018-02-01 5 NA 14 0
2 1 2018-03-01 6 NA 14 0
3 1 2018-08-01 3 NA 14 0
4 1 2018-10-01 1 NA 14 0
5 1 2017-02-01 NA 1 0 6
6 1 2014-10-01 NA 0 0 6
7 1 2014-11-01 NA 5 0 6
8 1 2018-02-01 NA 0 0 6
You could also make separate columns for before/after, like this:
df %>%
replace(., . == "-", NA) %>%
mutate_at(vars(VolX, VolY), as.numeric) %>%
group_by(ID) %>%
mutate(
Date = as.Date(gsub("\\s", "", Date)),
Before_After = cumsum(c(0, lag(+(Date == max(Date)))[-1])),
sum_Volx_Before = sum(VolX[Date != max(Date) & Before_After == 0], na.rm = T),
sum_VolY_Before = sum(VolY[Date != max(Date) & Before_After == 0], na.rm = T),
sum_Volx_After = sum(VolX[Date != max(Date) & Before_After == 1], na.rm = T),
sum_VolY_After = sum(VolY[Date != max(Date) & Before_After == 1], na.rm = T)
) %>% ungroup() %>% select(-Before_After)
Output:
# A tibble: 8 x 8
ID Date VolX VolY sum_Volx_Before sum_VolY_Before sum_Volx_After sum_VolY_After
<int> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2018-02-01 5 NA 14 0 0 6
2 1 2018-03-01 6 NA 14 0 0 6
3 1 2018-08-01 3 NA 14 0 0 6
4 1 2018-10-01 1 NA 14 0 0 6
5 1 2017-02-01 NA 1 14 0 0 6
6 1 2014-10-01 NA 0 14 0 0 6
7 1 2014-11-01 NA 5 14 0 0 6
8 1 2018-02-01 NA 0 14 0 0 6
On the other hand, you could just create 2 separate new data frames in your environment, named Before and After, that literally exclude the maximum date and summarise the information, like below:
df_list <- df %>%
replace(., . == "-", NA) %>%
mutate_at(vars(VolX, VolY), as.numeric) %>%
group_by(ID) %>%
mutate(
Date = as.Date(gsub("\\s", "", Date)),
Before_After = cumsum(c(0, lag(+(Date == max(Date)))[-1]))
) %>%
filter(!Date == max(Date)) %>%
group_by(ID, Before_After) %>%
summarise(
sum_VolX = sum(VolX, na.rm = T),
sum_VolY = sum(VolY, na.rm = T)
) %>%
split(., .$Before_After)
names(df_list) <- c("Before", "After")
list2env(df_list, envir = .GlobalEnv)
Let's go through one-by-one:
first we replace the - signs by NA (not strictly needed, just to avoid errors later on);
afterwards we transform VolX and VolY into numeric;
then we group by ID so that everything is applied to each group separately;
afterwards we transform the Date into a proper Date format;
then it is the crucial part: we calculate the flag Before_After column where first we flag with 1 if in the previous row the maximum date was observed; afterwards we calculate a cumulative sum of such column, so that everything before this event is 0 and everything after 1;
then we filter out the maximum Date;
we group again by ID and Before_After indicator;
we shrink the data frame with summarise so that it only contains the sum of the respective columns;
we turn the data frame into 2 different ones by splitting on Before_After column;
as the obtained result is a list of 2 data frames, we need to get them into global environment, so first we assign the names to each one and then we turn them into 'proper' data frames.
Output:
Before
# A tibble: 1 x 4
# Groups: ID [1]
ID Before_After sum_VolX sum_VolY
<int> <dbl> <dbl> <dbl>
1 1 0 14 0
After
# A tibble: 1 x 4
# Groups: ID [1]
ID Before_After sum_VolX sum_VolY
<int> <dbl> <dbl> <dbl>
1 1 1 0 6
Note that 0 corresponds to Before and 1 to After.