I have a dataset where there is an ID columns Patient_ID, and multiple columns relating to each baby of a birth event. There are more than one set of each column, as there have been multiple births (twins, triplets etc), and the database decided to work in a wide format.
So, I have the columns:
Patient_ID *for the mother;
pofid_1
pof1completeddate
pof1pregendweeks
pofid_2
pof2completeddate
pof2pregendweeks
etc, etc.
pofid_1 refers to a unique identifier for each baby, and is the only variable that doesnt follow the format of pofnvarname (pof - pregnancy outcome form). There are ~50 columns for each baby, I have only listed three here for demonstration. Is there a way I can pivot the whole dataset based on the number after pof so I have the following column names, and one row for each baby born:
Patient_ID
babynumber
pofid *baby ID;
pofcompleteddate
pofpregendweeks
So, I am starting off with:
data.frame(
Patient_ID = c(1, 2, 3, 4),
pofid_1 = c(1, 2, 3, 4),
pof1completeddate = as.Date(c("2022-11-12", "2022-12-11", "2022-10-10", "2022-01-01")),
pof1pregendweeks = c(40, 39, 41, 40),
pofid_2 = c(NA, NA, 5, 6),
pof2completeddate = as.Date(c(NA, NA, "2022-10-10", "2022-01-01")),
pof2pregendweeks = c(NA, NA, 41, 40)
)
Patient_ID pofid_1 pof1completeddate pof1pregendweeks pofid_2 pof2completeddate pof2pregendweeks
1 1 1 2022-11-12 40 NA <NA> NA
2 2 2 2022-12-11 39 NA <NA> NA
3 3 3 2022-10-10 41 5 2022-10-10 41
4 4 4 2022-01-01 40 6 2022-01-01 40
And want
Patient_ID pofid babynumber pofcompleteddate pofpregendweeks
1 1 1 1 2022-11-12 40
2 2 2 1 2022-12-11 39
3 3 3 1 2022-10-10 41
4 3 5 2 2022-10-10 41
5 4 4 1 2022-01-01 40
6 4 6 2 2022-01-01 40
It's best to ensure you have consistent naming across your columns by changing pofid_1 and pof_id2 to pof1id and pof2id. You can do this in one gulp using rename_with. Then, it's just a case of pivoting to long format and filtering to retain complete cases:
library(tidyverse)
df %>%
rename_with(~gsub('pofid_(\\d+)', 'pof\\1id', .x)) %>%
pivot_longer(-Patient_ID, names_sep = '(?<=pof\\d)',
names_to = c('babynumber', '.value')) %>%
filter(complete.cases(.)) %>%
mutate(babynumber = as.numeric(gsub('\\D', '', babynumber))) %>%
rename(pofid = id)
#> # A tibble: 6 x 5
#> Patient_ID babynumber pofid completeddate pregendweeks
#> <int> <dbl> <int> <chr> <int>
#> 1 1 1 1 2022-11-12 40
#> 2 2 1 2 2022-12-11 39
#> 3 3 1 3 2022-10-10 41
#> 4 3 2 5 2022-10-10 41
#> 5 4 1 4 2022-01-01 40
#> 6 4 2 6 2022-01-01 4
Created on 2023-02-13 with reprex v2.0.2
Data in reproducible format
df <- structure(list(Patient_ID = 1:4, pofid_1 = 1:4,
pof1completeddate = c("2022-11-12",
"2022-12-11", "2022-10-10", "2022-01-01"), pof1pregendweeks = c(40L,
39L, 41L, 40L), pofid_2 = c(NA, NA, 5L, 6L), pof2completeddate = c(NA,
NA, "2022-10-10", "2022-01-01"), pof2pregendweeks = c(NA, NA,
41L, 4L)), class = "data.frame", row.names = c("1", "2", "3",
"4"))
Related
Hi everyone,
I have a dataframe with where each ID has multiple visits from 1-5. I am trying to calculate the difference of a score between each visit to visit 1. eg. (Score(Visit 5-score(Visit1) and so on). How do I achieve that in R ? Below is a sample dataset and result dataset
structure(list(ID = c("A", "A", "A", "A", "A", "B", "B", "B"),
Visit = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L), Score = c(16,
15, 13, 12, 12, 20, 19, 18)), class = "data.frame", row.names = c(NA,
-8L))
#> ID Visit Score
#> 1 A 1 16
#> 2 A 2 15
#> 3 A 3 13
#> 4 A 4 12
#> 5 A 5 12
#> 6 B 1 20
#> 7 B 2 19
#> 8 B 3 18
Created on 2021-05-20 by the reprex package (v2.0.0)
Here is the expected output
Here's a solution using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Difference = ifelse(Visit == 1, NA, Score[Visit == 1] - Score))
# A tibble: 8 x 4
# Groups: ID [2]
ID Visit Score Difference
<chr> <int> <dbl> <dbl>
1 A 1 16 NA
2 A 2 15 1
3 A 3 13 3
4 A 4 12 4
5 A 5 12 4
6 B 1 20 NA
7 B 2 19 1
8 B 3 18 2
Sample data
df <- data.frame(
ID = c('A', 'A', 'A', 'A', 'A', 'B', 'B', 'B'),
Visit = c(1:5, 1:3),
Score = c(16,15,13,12,12,20,19,18)
)
Sidenote: next time I suggest you to post not images but a sample data using the dput() function on your dataframe
Solution with dplyr using first
data <- data.frame(
ID = c(rep("A", 5), rep("B", 3)),
Visit = c(1:5, 1:3),
Score = c(16, 15, 13, 12, 12, 20, 19, 18))
library(dplyr)
data %>%
group_by(ID) %>%
arrange(Visit) %>%
mutate(Difference = first(Score) - Score)
#> # A tibble: 8 x 4
#> # Groups: ID [2]
#> ID Visit Score Difference
#> <chr> <int> <dbl> <dbl>
#> 1 A 1 16 0
#> 2 A 2 15 1
#> 3 A 3 13 3
#> 4 A 4 12 4
#> 5 A 5 12 4
#> 6 B 1 20 0
#> 7 B 2 19 1
#> 8 B 3 18 2
Created on 2021-05-20 by the reprex package (v2.0.0)
I got column like this with some duplicated values
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1), date = c(NA, NA,
NA, "2011/01/01", "2011/02/01", "2012/01/01", "2012/01/01", "2012/05/01"
)), class = "data.frame", row.names = c(NA, -8L))
I want to keep only one of the duplicated values, like this
structure(list(id2 = c(1, 1, 1, 1, 1, 1, 1),
date2 = c(NA, NA, NA, "2011/01/01", "2011/02/01", "2012/01/01", "2012/05/01")),
class = "data.frame", row.names = c(NA, -7L))
Depending on what you want exactly there are multiple alternatives:
dat %>%
filter(!duplicated(date))
gives
id date
1 1 <NA>
2 1 2011/01/01
3 1 2011/02/01
4 1 2012/01/01
As someone else also suggested, it gives the same result as
dat %>% distinct(date, .keep_all = T)
In contrast to that person I added a column to the distinct function, as I assumed you only want to remove the duplicated dates, not necessary duplicates in other columns (and the .keep_all is than necessary to keep those other columns).
However it is unclear for me if you want to keep all NAs or not. Becuase than you need to add some rows with just the NAs.
if you want all NAs you could for example do:
dat %>%
filter(!is.na(date) & !duplicated(date)) %>%
bind_rows(dat %>% filter(is.na(date)))
which gives
id date
1 1 2011/01/01
2 1 2011/02/01
3 1 2012/01/01
4 1 <NA>
5 1 <NA>
6 1 <NA>
Although there probably is a nicer way to do this.
Edit:
If you want to keep the entries but only want to make the duplicated values NA you can use the duplicated function this way:
dat %>%
mutate(
date1 = case_when(
duplicated(date) ~ NA_character_,
TRUE ~ date
)
)
I generally prefer case_when over if_else due to its readability. But in this case it would be the same.
It results in
id date date1
1 1 <NA> <NA>
2 1 <NA> <NA>
3 1 <NA> <NA>
4 1 2011/01/01 2011/01/01
5 1 2011/02/01 2011/02/01
6 1 2012/01/01 2012/01/01
7 1 2012/01/01 <NA>
8 1 2012/05/01 2012/05/01
I created an extra column for this example. But you could simply overwrite the date column in your actual analysis.
You can use dplyr::distinct:
library(tidyverse)
df <- structure(list(id = c(1, 1, 1, 1, 1, 1), date = c(NA, NA, NA,
"2011/01/01", "2011/02/01", "2012/01/01")), row.names = c(NA, 6L), class = "data.frame")
df
#> id date
#> 1 1 <NA>
#> 2 1 <NA>
#> 3 1 <NA>
#> 4 1 2011/01/01
#> 5 1 2011/02/01
#> 6 1 2012/01/01
df %>%
distinct()
#> id date
#> 1 1 <NA>
#> 2 1 2011/01/01
#> 3 1 2011/02/01
#> 4 1 2012/01/01
How to replace if the NA values in any column that should replace values by the next column's values in R programming, This has to be done without particularly mentioned the name of the columns (without hardcode)
Also the entire column that had NA values should be removed in R programming
library(tidyverse)
df1 <- structure(list(GID = c("1", "2", "3", "4", "5", "NG1", "MG2", "MG3", "NG4"),
ColA = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
ColB = c("2", "4", "4", "5", "5", "", "1", "1", "")),
row.names = c(NA, -9L),
class = "data.frame")
df1 %>%
mutate(across(everything(), ~str_replace(., "^$", "N")),
GID = GID %>% str_remove("N"))
#> GID ColA ColB
#> 1 1 NA 2
#> 2 2 NA 4
#> 3 3 NA 4
#> 4 4 NA 5
#> 5 5 NA 5
#> 6 G1 NA N
#> 7 MG2 NA 1
#> 8 MG3 NA 1
#> 9 G4 NA N
Expected output:
#> GID ColA
#> 1 1 2
#> 2 2 4
#> 3 3 4
#> 4 4 5
#> 5 5 5
#> 6 G1 N
#> 7 MG2 1
#> 8 MG3 1
#> 9 G4 N
I guess you already have answer to the first part of your question, here is an alternative way using replace. To drop columns that have all NA in them you can use select with where.
library(dplyr)
df1 %>%
mutate(across(.fns = ~replace(., . == '', 'N')),
GID = sub('N', '', GID)) %>%
select(-where(~all(is.na(.)))) %>%
rename_with(~names(df1)[seq_along(.)])
# GID ColA
#1 1 2
#2 2 4
#3 3 4
#4 4 5
#5 5 5
#6 G1 N
#7 MG2 1
#8 MG3 1
#9 G4 N
I have 158 columns in a dataset. I want to create 3 new columns(1_day,3_day and 7_day lag) for each column.
data <- data.frame(DATE = c("1/1/2016","1/2/2016","1/3/2016","1/4/2016","1/5/2016","1/6/2016","1/7/2016","1/8/2016","1/9/2016","1/10/2016",
Attr1 = c(5,8,7,6,2,1,4,1,2),
Attr2 = c(10,23,32,12,3,2,5,3,21),
Attr3 = c(12,23,43,3,2,4,1,23,33))
The result wanted is as follows :
Attr1_3D = Average of last 3 days of ATTR1
Attr1_7D = Aveage of last 7 days of ATTR1
Attr2_3D = Average of last 3 days of ATTR2
Attr2_7D = Aveage of last 7 days of ATTR2
Attr3_3D = Average of last 3 days of ATTR3
Attr3_7D = Aveage of last 7 days of ATTR3
One approach using tidyverse and zoo is below. You can use rollapply from zoo package to get rolling means (by 1, 3, or 7 days).
Edit: Also added offset by 1 day (as rolling mean values are included on the day after the X-day window). Also joining back to original data frame to include original Attr columns.
library(tidyverse)
library(zoo)
data %>%
pivot_longer(starts_with("Attr"), names_to = "Attr", values_to = "Value") %>%
group_by(Attr) %>%
mutate(Attr_1D = rollapply(Value, 1, mean, align = 'right', fill = NA),
Attr_3D = rollapply(Value, 3, mean, align = 'right', fill = NA),
Attr_7D = rollapply(Value, 7, mean, align = 'right', fill = NA),
DATE = lead(DATE)) %>%
pivot_wider(id_cols = DATE, names_from = "Attr", values_from = c("Attr_1D", "Attr_3D", "Attr_7D")) %>%
right_join(data)
Output
# A tibble: 9 x 13
DATE Attr_1D_Attr1 Attr_1D_Attr2 Attr_1D_Attr3 Attr_3D_Attr1 Attr_3D_Attr2 Attr_3D_Attr3 Attr_7D_Attr1 Attr_7D_Attr2 Attr_7D_Attr3 Attr1 Attr2 Attr3
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1/1/2016 NA NA NA NA NA NA NA NA NA 5 10 12
2 1/2/2016 5 10 12 NA NA NA NA NA NA 8 23 23
3 1/3/2016 8 23 23 NA NA NA NA NA NA 7 32 43
4 1/4/2016 7 32 43 6.67 21.7 26 NA NA NA 6 12 3
5 1/5/2016 6 12 3 7 22.3 23 NA NA NA 2 3 2
6 1/6/2016 2 3 2 5 15.7 16 NA NA NA 1 2 4
7 1/7/2016 1 2 4 3 5.67 3 NA NA NA 4 5 1
8 1/8/2016 4 5 1 2.33 3.33 2.33 4.71 12.4 12.6 1 3 23
9 1/9/2016 1 3 23 2 3.33 9.33 4.14 11.4 14.1 2 21 33
Data
data <- structure(list(DATE = structure(1:9, .Label = c("1/1/2016", "1/2/2016",
"1/3/2016", "1/4/2016", "1/5/2016", "1/6/2016", "1/7/2016", "1/8/2016",
"1/9/2016"), class = "factor"), Attr1 = c(5, 8, 7, 6, 2, 1, 4,
1, 2), Attr2 = c(10, 23, 32, 12, 3, 2, 5, 3, 21), Attr3 = c(12,
23, 43, 3, 2, 4, 1, 23, 33)), class = "data.frame", row.names = c(NA,
-9L))
How to extract only three observations that are top observations with respect to some variable, ex. count (n var in example data below)? I would like to avoid arranging rows so I thought I could use dplyr::min_rank.
ex <- structure(list(code = c("17.1", "6.2", "151.5", "78.1", "88.1",
"95.1", "45.2", "252.2"), id = c(1, 2, 3, 4, 5, 6, 7, 8), n = c(6L,
5L, 8L, 10L, 6L, 3L, 4L, 6L)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L))
ex %>%
filter(min_rank(desc(n)) <= 3)
But if there are ties, it can give more than 3 observations. For example, the command above returns five rows:
# A tibble: 5 x 3
code id n
<chr> <dbl> <int>
1 17.1 1 6
2 151.5 3 8
3 78.1 4 10
4 88.1 5 6
5 252.2 8 6
How can I then extract exactly 3 observations? (no matter which observation is returned in case of ties)
We can use row_number that can take a column as argument
ex %>%
filter(row_number(desc(n)) <= 3)
# A tibble: 3 x 3
# code id n
# <chr> <dbl> <int>
#1 17.1 1 6
#2 151.5 3 8
#3 78.1 4 10
In base R, we can use
ex[tail(order(ex$n),3), ]