Dividing data into quintiles using group_by - r

I am looking for a way to change my way in such a way that it sorts the data into quintiles instead of the top 5 and bottom 5. My current code looks like this:
CombData <- CombData %>%
group_by(Date) %>%
mutate(
R=min_rank(Value),
E_P = case_when(
R < 6 ~ "5w",
R > max(R, na.rm =TRUE) - 5 ~ "5b",
TRUE ~ NA_character_)
) %>%
ungroup() %>%
arrange(Date, E_P)
My dataset is quite large therefore I will just provide sample data. The data I use is more complex and the code should, therefore, allow for varying lengths of the column Date and also for multiple values that are missing (NAs):
df <- data.frame( Date = c(rep("2010-01-31",16), rep("2010-02-28", 14)), Value=c(rep(c(1,2,3,4,5,6,7,8,9,NA,NA,NA,NA,NA,15),2))
Afterward, I would also like to test the minimum size of quintiles i.e. how many data points are minimum in each quintile in the entire dataset.
The expected output would look like this:
structure(list(Date = structure(c(14640, 14640, 14640, 14640,
14640, 14640, 14640, 14640, 14640, 14640, 14640, 14640, 14640,
14640, 14640, 14640, 14668, 14668, 14668, 14668, 14668, 14668,
14668, 14668, 14668, 14668, 14668, 14668, 14668, 14668), class = "Date"),
Value = c(1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 15, NA, NA, NA, NA,
NA, 2, 3, 4, 5, 6, 7, 8, 9, 15, NA, NA, NA, NA, NA), R = c(1L,
1L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, NA, NA, NA, NA,
NA, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, NA, NA, NA, NA, NA
), S_P = c("Worst", "Worst", "Worst", NA, NA, NA, NA, "Best",
"Best", "Best", NA, NA, NA, NA, NA, NA, "Worst", "Worst", NA, NA,
NA, NA, NA, "Best", "Best", NA, NA, NA, NA, NA)), row.names = c(NA,
-30L), class = c("tbl_df", "tbl", "data.frame"))

Probably, you could use something like this with quantile :
library(dplyr)
out <- CombData %>%
group_by(Date) %>%
mutate(S_P = case_when(Value <= quantile(Value, 0.2, na.rm = TRUE) ~ 'Worst',
Value >= quantile(Value, 0.8, na.rm = TRUE) ~ 'Best'))
You could change the value of quantile according to your preference.
To get minimum number of "Best" and "Worst" we can do :
out %>%
count(Date, S_P) %>%
na.omit() %>%
ungroup() %>%
select(-Date) %>%
group_by(S_P) %>%
top_n(-1, n)
# S_P n
# <chr> <int>
#1 Best 2
#2 Worst 2

When I understand you correctly, you want to rank your column 'Value' and mark those with rank below the quantile 20% with "worst" and those above 80% with "best". After that you want a table.
You could use use ave for both, the ranking and the quantile identification. The quantile function yields three groups, that you can identify with findInterval, code as a factor variable and label them at will. I'm not sure, though, which ranks should be included in the quantiles, I therefore make the E_P coding in two separate columns for comparison purposes.
dat2 <- within(dat, {
R <- ave(Value, Date, FUN=function(x) rank(x, na.last="keep"))
E_P <- ave(R, Date, FUN=function(x) {
findInterval(x, quantile(R, c(.2, .8), na.rm=TRUE))
})
E_P.fac <- factor(E_P, labels=c("worst", NA, "best"))
})
dat2 <- dat2[order(dat2$Date, dat2$E_P), ] ## order by date and E_P
Yields:
dat2
# Date Value E_P.fac E_P R
# 1 2010-01-31 1 worst 0 1.5
# 16 2010-01-31 1 worst 0 1.5
# 2 2010-01-31 2 <NA> 1 3.0
# 3 2010-01-31 3 <NA> 1 4.0
# 4 2010-01-31 4 <NA> 1 5.0
# 5 2010-01-31 5 <NA> 1 6.0
# 6 2010-01-31 6 <NA> 1 7.0
# 7 2010-01-31 7 <NA> 1 8.0
# 8 2010-01-31 8 best 2 9.0
# 9 2010-01-31 9 best 2 10.0
# 15 2010-01-31 15 best 2 11.0
# 10 2010-01-31 NA <NA> NA NA
# 11 2010-01-31 NA <NA> NA NA
# 12 2010-01-31 NA <NA> NA NA
# 13 2010-01-31 NA <NA> NA NA
# 14 2010-01-31 NA <NA> NA NA
# 17 2010-02-28 2 worst 0 1.0
# 18 2010-02-28 3 worst 0 2.0
# 19 2010-02-28 4 <NA> 1 3.0
# 20 2010-02-28 5 <NA> 1 4.0
# 21 2010-02-28 6 <NA> 1 5.0
# 22 2010-02-28 7 <NA> 1 6.0
# 23 2010-02-28 8 <NA> 1 7.0
# 24 2010-02-28 9 <NA> 1 8.0
# 30 2010-02-28 15 best 2 9.0
# 25 2010-02-28 NA <NA> NA NA
# 26 2010-02-28 NA <NA> NA NA
# 27 2010-02-28 NA <NA> NA NA
# 28 2010-02-28 NA <NA> NA NA
# 29 2010-02-28 NA <NA> NA NA
When I check the quintiles of the Rank column, it appears to be right.
quantile(dat2$R, c(.2, .8), na.rm=TRUE)
# 20% 80%
# 2.8 8.2
After that you could just make a table to get the numbers of each category.
with(dat2, table(Date, E_P.fac))
# E_P.fac
# Date worst <NA> best
# 2010-01-31 2 6 3
# 2010-02-28 2 6 1
Data
dat <- structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("2010-01-31", "2010-02-28"
), class = "factor"), Value = c(1, 2, 3, 4, 5, 6, 7, 8, 9, NA,
NA, NA, NA, NA, 15, 1, 2, 3, 4, 5, 6, 7, 8, 9, NA, NA, NA, NA,
NA, 15)), row.names = c(NA, -30L), class = "data.frame")

Related

Sort rows grouped by grep alphabetically

I have a dataframe with a row full of adverse events but also relationships of these adverse events to the procedure, like this:
df <- data.frame(
adverse_event = c(
"Haemorrhage", "related", "likely related",
"Other", "related", "likely related", "Pain", "related", "likely related",
"Subcapsular hematoma", "related", "likely related", "Ascites",
"related", "likely related", "Hyperbilirubinemia", "related",
"likely related", "Liver abscess", "related", "likely related",
"Pleural effusion with drainage", "related", "likely related",
"Pneumothorax", "related", "likely related", "Biliary leakage / occlusion / fistula",
"related", "likely related", "Portal vein thrombosis", "related",
"likely related", "Sepsis", "related", "likely related"
),
grade_1 = c(
4L, 4L, 0L, 3L, 6L, 1L, 8L, 4L, 5L, 3L, 1L, 3L, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA
),
grade_2 = c(
2L, 3L, 0L, 11L, 3L, 7L, 2L, 4L, 2L, 1L, 2L, 0L, 1L, 1L, 0L,
1L, 0L, 2L, 1L, 1L, 0L, 1L, 2L, 1L, 1L, 1L, 0L, NA, NA, NA, NA,
NA, NA, NA, NA, NA
),
grade_3 = c(
1L, 4L, 1L, 5L, 3L, 2L, 2L, 5L, 1L, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 4L, 5L, 1L, NA, NA, NA, 1L, 1L, 0L, 1L, 2L, 0L, 1L,
1L, 0L, 1L, 1L, 0L
),
grade_4 = c(
2L, 4L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA
)
)
Now I'd like to sort the adverse events alphabetically but of course take the "related", "likely related" rows with the individual adverse event rows, so I'd like to somehow group them first.
In this example it's always 3 rows, but let's assume it could be sometimes 2, 4 or 5 rows too (all except the adverse event rows containing "related" in the string/name though e.g. 'unlikely related').
I know, I can get the indices of the adverse event rows by
grep('related', df$adverse_event, invert = T) but I'm unsure how to use this to group the rows together before sorting them.
Edit: Beginning of the left column of the desired output:
expected_output_left_column <- data.frame(adverse_event = c(
"Ascites", "related", "likely related",
"Biliary leakage / occlusion / fistula", "related", "likely related" ) )
Thank you!
Another solution using base r and lead function from dplyr
# where start each group
id <- grep('related', df$adverse_event, invert = T)
# size of each group
size <- lead(id) - id
size_of_last_group <- nrow(df) - id[length(id)] + 1
size[length(size)] <- size_of_last_group
# add col with id
df$id <- paste0(rep(df$adverse_event[id], times = size),
df$adverse_event)
# order
df <- df[order(df$id), ]
# remove id
df$id <- NULL
You can do the following:
library(dplyr)
left_join(
df,
df %>%
filter(!grepl('related',adverse_event)) %>%
select(adverse_event) %>%
arrange(adverse_event) %>%
mutate(o = row_number())
) %>%
mutate(o = data.table::nafill(o, "locf")) %>%
arrange(o) %>%
select(-o)
Output:
adverse_event grade_1 grade_2 grade_3 grade_4
1 Ascites NA 1 NA NA
2 related NA 1 NA NA
3 likely related NA 0 NA NA
4 Biliary leakage / occlusion / fistula NA NA 1 NA
5 related NA NA 2 NA
6 likely related NA NA 0 NA
7 Haemorrhage 4 2 1 2
8 related 4 3 4 4
9 likely related 0 0 1 1
10 Hyperbilirubinemia NA 1 NA NA
11 related NA 0 NA NA
12 likely related NA 2 NA NA
13 Liver abscess NA 1 4 NA
14 related NA 1 5 NA
15 likely related NA 0 1 NA
16 Other 3 11 5 NA
17 related 6 3 3 NA
18 likely related 1 7 2 NA
19 Pain 8 2 2 NA
20 related 4 4 5 NA
21 likely related 5 2 1 NA
22 Pleural effusion with drainage NA 1 NA NA
23 related NA 2 NA NA
24 likely related NA 1 NA NA
25 Pneumothorax NA 1 1 NA
26 related NA 1 1 NA
27 likely related NA 0 0 NA
28 Portal vein thrombosis NA NA 1 NA
29 related NA NA 1 NA
30 likely related NA NA 0 NA
31 Sepsis NA NA 1 NA
32 related NA NA 1 NA
33 likely related NA NA 0 NA
34 Subcapsular hematoma 3 1 NA NA
35 related 1 2 NA NA
36 likely related 3 0 NA NA
Note that this uses data.table::nafill().. A full data.table solution is as below:
library(data.table)
setDT(df)
data.table(adverse_event = sort(df[!grepl('related',adverse_event), adverse_event]))[, o:=.I][
df, on="adverse_event"][, o:=nafill(o, "locf")][order(o), !c("o")]
Add a "group" variable and sort
tmp=!grepl("related",df$adverse_event)
df$grp=cumsum(tmp)
df[order(match(df$grp,order(df$adverse_event[tmp]))),]
adverse_event grade_1 grade_2 grade_3 grade_4 grp
13 Ascites NA 1 NA NA 5
14 related NA 1 NA NA 5
15 likely related NA 0 NA NA 5
28 Biliary leakage / occlusion / fistula NA NA 1 NA 10
29 related NA NA 2 NA 10
30 likely related NA NA 0 NA 10
1 Haemorrhage 4 2 1 2 1
2 related 4 3 4 4 1
3 likely related 0 0 1 1 1
16 Hyperbilirubinemia NA 1 NA NA 6
17 related NA 0 NA NA 6
18 likely related NA 2 NA NA 6
19 Liver abscess NA 1 4 NA 7
20 related NA 1 5 NA 7
21 likely related NA 0 1 NA 7
4 Other 3 11 5 NA 2
5 related 6 3 3 NA 2
6 likely related 1 7 2 NA 2
7 Pain 8 2 2 NA 3
8 related 4 4 5 NA 3
9 likely related 5 2 1 NA 3
22 Pleural effusion with drainage NA 1 NA NA 8
23 related NA 2 NA NA 8
24 likely related NA 1 NA NA 8
25 Pneumothorax NA 1 1 NA 9
26 related NA 1 1 NA 9
27 likely related NA 0 0 NA 9
31 Portal vein thrombosis NA NA 1 NA 11
32 related NA NA 1 NA 11
33 likely related NA NA 0 NA 11
34 Sepsis NA NA 1 NA 12
35 related NA NA 1 NA 12
36 likely related NA NA 0 NA 12
10 Subcapsular hematoma 3 1 NA NA 4
11 related 1 2 NA NA 4
12 likely related 3 0 NA NA 4
Just to throw in another tidyverse solution:
library(tidyr)
library(dplyr)
df %>%
mutate(grp = if_else(grepl("related", adverse_event),
NA_character_,
adverse_event)) %>%
fill(grp) %>%
nest(data = -grp) %>%
arrange(grp) %>%
unnest(cols = data) %>%
select(-grp)
# # A tibble: 36 × 5
# adverse_event grade_1 grade_2 grade_3 grade_4
# <chr> <int> <int> <int> <int>
# 1 Ascites NA 1 NA NA
# 2 related NA 1 NA NA
# 3 likely related NA 0 NA NA
# 4 Biliary leakage / occlusion / fistula NA NA 1 NA
# 5 related NA NA 2 NA
# 6 likely related NA NA 0 NA
# 7 Haemorrhage 4 2 1 2
# 8 related 4 3 4 4
# 9 likely related 0 0 1 1
# 10 Hyperbilirubinemia NA 1 NA NA
# 11 related NA 0 NA NA
# 12 likely related NA 2 NA NA
# 13 Liver abscess NA 1 4 NA
# 14 related NA 1 5 NA
# 15 likely related NA 0 1 NA
# 16 Other 3 11 5 NA
# 17 related 6 3 3 NA
# 18 likely related 1 7 2 NA
# 19 Pain 8 2 2 NA
# 20 related 4 4 5 NA
# 21 likely related 5 2 1 NA
# 22 Pleural effusion with drainage NA 1 NA NA
# 23 related NA 2 NA NA
# 24 likely related NA 1 NA NA
# 25 Pneumothorax NA 1 1 NA
# 26 related NA 1 1 NA
# 27 likely related NA 0 0 NA
# 28 Portal vein thrombosis NA NA 1 NA
# 29 related NA NA 1 NA
# 30 likely related NA NA 0 NA
# 31 Sepsis NA NA 1 NA
# 32 related NA NA 1 NA
# 33 likely related NA NA 0 NA
# 34 Subcapsular hematoma 3 1 NA NA
# 35 related 1 2 NA NA
# 36 likely related 3 0 NA NA
Explanation
mutate + fill: Label each adverse_event with the stem, i.e. re-label all related records with the corresponding event above.
Nest all columns, but keep the newly created grp column, which bears the name of the stem adverse event.
Sort the adverse event stems.
Unnest the rows again.
Remove the grp column.
An approach using rank. Using an extended data set with 4 entries for "Ascites".
library(dplyr)
df %>%
mutate(ord = !grepl("related", adverse_event),
grp = cumsum(ord),
Rank = rank(adverse_event[ord])[grp]) %>%
arrange(Rank) %>%
select(-c(ord, grp, Rank))
adverse_event grade_1 grade_2 grade_3 grade_4
1 Ascites NA 1 NA NA
2 related NA 1 NA NA
3 related NA 1 NA NA
4 likely related NA 0 NA NA
5 Biliary leakage / occlusion / fistula NA NA 1 NA
6 related NA NA 2 NA
7 likely related NA NA 0 NA
8 Haemorrhage 4 2 1 2
9 related 4 3 4 4
10 likely related 0 0 1 1
11 Hyperbilirubinemia NA 1 NA NA
12 related NA 0 NA NA
13 likely related NA 2 NA NA
14 Liver abscess NA 1 4 NA
15 related NA 1 5 NA
16 likely related NA 0 1 NA
17 Other 3 11 5 NA
18 related 6 3 3 NA
19 likely related 1 7 2 NA
20 Pain 8 2 2 NA
21 related 4 4 5 NA
22 likely related 5 2 1 NA
23 Pleural effusion with drainage NA 1 NA NA
24 related NA 2 NA NA
25 likely related NA 1 NA NA
26 Pneumothorax NA 1 1 NA
27 related NA 1 1 NA
28 likely related NA 0 0 NA
29 Portal vein thrombosis NA NA 1 NA
30 related NA NA 1 NA
31 likely related NA NA 0 NA
32 Sepsis NA NA 1 NA
33 related NA NA 1 NA
34 likely related NA NA 0 NA
35 Subcapsular hematoma 3 1 NA NA
36 related 1 2 NA NA
37 likely related 3 0 NA NA
extended data
df <- structure(list(adverse_event = c("Haemorrhage", "related", "likely related",
"Other", "related", "likely related", "Pain", "related", "likely related",
"Subcapsular hematoma", "related", "likely related", "Ascites",
"related", "related", "likely related", "Hyperbilirubinemia",
"related", "likely related", "Liver abscess", "related", "likely related",
"Pleural effusion with drainage", "related", "likely related",
"Pneumothorax", "related", "likely related", "Biliary leakage / occlusion / fistula",
"related", "likely related", "Portal vein thrombosis", "related",
"likely related", "Sepsis", "related", "likely related"), grade_1 = c(4L,
4L, 0L, 3L, 6L, 1L, 8L, 4L, 5L, 3L, 1L, 3L, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), grade_2 = c(2L, 3L, 0L, 11L, 3L, 7L, 2L, 4L,
2L, 1L, 2L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 2L, 1L, 1L, 0L, 1L, 2L,
1L, 1L, 1L, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA), grade_3 = c(1L,
4L, 1L, 5L, 3L, 2L, 2L, 5L, 1L, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 4L, 5L, 1L, NA, NA, NA, 1L, 1L, 0L, 1L, 2L, 0L, 1L, 1L,
0L, 1L, 1L, 0L), grade_4 = c(2L, 4L, 1L, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA,
37L), class = "data.frame")
Here is a benchmark of the different suggestions if needed :
library(bench)
library(dplyr)
library(data.table)
library(tidyr)
df <- data.frame(
adverse_event = c(
"Haemorrhage", "related", "likely related",
"Other", "related", "likely related", "Pain", "related", "likely related",
"Subcapsular hematoma", "related", "likely related", "Ascites",
"related", "likely related", "Hyperbilirubinemia", "related",
"likely related", "Liver abscess", "related", "likely related",
"Pleural effusion with drainage", "related", "likely related",
"Pneumothorax", "related", "likely related", "Biliary leakage / occlusion / fistula",
"related", "likely related", "Portal vein thrombosis", "related",
"likely related", "Sepsis", "related", "likely related"
),
grade_1 = c(
4L, 4L, 0L, 3L, 6L, 1L, 8L, 4L, 5L, 3L, 1L, 3L, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA
),
grade_2 = c(
2L, 3L, 0L, 11L, 3L, 7L, 2L, 4L, 2L, 1L, 2L, 0L, 1L, 1L, 0L,
1L, 0L, 2L, 1L, 1L, 0L, 1L, 2L, 1L, 1L, 1L, 0L, NA, NA, NA, NA,
NA, NA, NA, NA, NA
),
grade_3 = c(
1L, 4L, 1L, 5L, 3L, 2L, 2L, 5L, 1L, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 4L, 5L, 1L, NA, NA, NA, 1L, 1L, 0L, 1L, 2L, 0L, 1L,
1L, 0L, 1L, 1L, 0L
),
grade_4 = c(
2L, 4L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA
)
)
paul_carteron <- function(df){
# where start each group
id <- grep('related', df$adverse_event, invert = T)
# size of each group
size <- lead(id) - id
size_of_last_group <- nrow(df) - id[length(id)] + 1
size[length(size)] <- size_of_last_group
# add col with id
df$id <- paste0(rep(df$adverse_event[id], times = size),
df$adverse_event)
# order
df <- df[order(df$id), ]
# remove id
df$id <- NULL
}
lang_tang_dplyr <- function(df){
left_join(
df,
df %>%
filter(!grepl('related', adverse_event)) %>%
select(adverse_event) %>%
arrange(adverse_event) %>%
mutate(o = row_number())
) %>%
mutate(o = data.table::nafill(o, "locf")) %>%
arrange(o) %>%
select(-o)
}
lang_tang_databable <- function(df) {
setDT(df)
data.table(adverse_event = sort(df[!grepl('related',adverse_event), adverse_event]))[, o:=.I][
df, on="adverse_event"][, o:=nafill(o, "locf")][order(o), !c("o")]
}
andre_wilberg <- function(df){
df %>%
mutate(ord = !grepl("related", adverse_event),
grp = cumsum(ord),
Rank = rank(adverse_event[ord])[grp]) %>%
arrange(Rank) %>%
select(-c(ord, grp, Rank))
}
thotal <- function(df){
df %>%
mutate(grp = if_else(grepl("related", adverse_event),
NA_character_,
adverse_event)) %>%
fill(grp) %>%
nest(data = -grp) %>%
arrange(grp) %>%
unnest(cols = data) %>%
select(-grp)
}
results = bench::mark(
iterations = 1000, check = FALSE, time_unit = "s", filter_gc = FALSE,
paul_carteron = paul_carteron(df),
lang_tang_dplyr = lang_tang_dplyr(df),
lang_tang_databable = lang_tang_databable(df),
andre_wilberg = andre_wilberg(df),
thotal = thotal(df)
)
plot(results)

Impute missing values with a value from previous month (if exists)

I have a dataframe with more than 100 000 rows and 30 000 unique ids.
My aim is to fill all the NAs among the different columns if there is a value from the previous month and the same id. However, most of the times the previous recorded value is from more than a month ago. Those NAs I would like to leave untouched.
The id column and the date column do not have NAs.
Here is an example of the data I have:
df3
id oxygen gluco dias bp date
1 0,25897842 0,20201604 0,17955655 0,14100962 31.7.2019
2 NA NA 0,38582622 0,12918231 31.12.2014
2 0,35817147 0,32943499 NA 0,43667462 30.11.2018
2 0,68557053 0,42898807 0,93897514 NA 31.10.2018
2 NA NA 0,99899076 0,44168223 31.7.2018
2 0,43848054 0,38604586 NA NA 30.4.2013
2 0,15823254 0,06216771 0,07829624 0,69755251 31.1.2016
2 NA NA 0,61645303 NA 29.2.2016
2 0,94671363 0,50682091 0,96770222 0,97403356 31.5.2018
3 NA 0,77352235 0,660479 0,11554399 30.4.2019
3 0,15567703 NA 0,4553325 NA 31.3.2017
3 NA NA 0,22181609 0,08527658 30.9.2017
3 0,93660763 NA NA NA 31.3.2018
3 0,73416759 NA NA 0,78501791 30.11.2018
3 NA NA NA NA 28.2.2019
3 0,84525106 0,54360374 NA 0,40595426 31.8.2014
3 0,76221263 0,62983336 0,84592719 0,10640734 31.8.2013
4 NA 0,29108942 0,3863479 NA 31.1.2018
4 0,74075742 NA 0,38117415 0,58849266 30.11.2018
4 0,09400641 0,68860814 NA 0,88895224 31.8.2014
4 0,72202944 0,49901387 0,19967415 NA 31.8.2018
4 0,98205262 0,85213969 0,34450998 0,98962306 30.11.2013
This is the last code implementation that I have tried:
´´´
df3 %>%
group_by(id) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, maxgap = 30)))
´´´
But apparently "mutate_all() ignored the following grouping variables:
Column id"
You can use the tidyverse for that. Here's an approach:
Change the date column to class Date, then order by date
Prepare the dates and remove the days in Ym
get the time difference in mo
flag the rows which have max one month difference
get groups by cumsum the inverse logic in flag
fill the rows from the same groups
library(dplyr)
library(tidyr)
library(lubridate)
df$date <- as.Date(df$date, format="%d.%m.%Y")
df %>%
arrange(date) %>%
mutate(
Ym = ym(strftime(date, "%Y-%m")),
mo = interval(Ym, lag(Ym, default=as.Date("1970-01-01"))) / months(1),
flag = cumsum(!(mo > -2 & mo < 1))) %>%
group_by(id, flag) %>%
fill(names(.), .direction="down") %>%
ungroup() %>%
select(-c("Ym","mo","flag")) %>%
print(n=nrow(.))
Output
# A tibble: 22 × 6
id oxygen gluco dias bp date
<int> <chr> <chr> <chr> <chr> <date>
1 2 0,43848054 0,38604586 NA NA 2013-04-30
2 3 0,76221263 0,62983336 0,84592719 0,10640734 2013-08-31
3 4 0,98205262 0,85213969 0,34450998 0,98962306 2013-11-30
4 3 0,84525106 0,54360374 NA 0,40595426 2014-08-31
5 4 0,09400641 0,68860814 NA 0,88895224 2014-08-31
6 2 NA NA 0,38582622 0,12918231 2014-12-31
7 2 0,15823254 0,06216771 0,07829624 0,69755251 2016-01-31
8 2 0,15823254 0,06216771 0,61645303 0,69755251 2016-02-29
9 3 0,15567703 NA 0,4553325 NA 2017-03-31
10 3 NA NA 0,22181609 0,08527658 2017-09-30
11 4 NA 0,29108942 0,3863479 NA 2018-01-31
12 3 0,93660763 NA NA NA 2018-03-31
13 2 0,94671363 0,50682091 0,96770222 0,97403356 2018-05-31
14 2 NA NA 0,99899076 0,44168223 2018-07-31
15 4 0,72202944 0,49901387 0,19967415 NA 2018-08-31
16 2 0,68557053 0,42898807 0,93897514 NA 2018-10-31
17 2 0,35817147 0,32943499 0,93897514 0,43667462 2018-11-30
18 3 0,73416759 NA NA 0,78501791 2018-11-30
19 4 0,74075742 NA 0,38117415 0,58849266 2018-11-30
20 3 NA NA NA NA 2019-02-28
21 3 NA 0,77352235 0,660479 0,11554399 2019-04-30
22 1 0,25897842 0,20201604 0,17955655 0,14100962 2019-07-31
Data
df <- structure(list(id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), oxygen = c("0,25897842",
NA, "0,35817147", "0,68557053", NA, "0,43848054", "0,15823254",
NA, "0,94671363", NA, "0,15567703", NA, "0,93660763", "0,73416759",
NA, "0,84525106", "0,76221263", NA, "0,74075742", "0,09400641",
"0,72202944", "0,98205262"), gluco = c("0,20201604", NA, "0,32943499",
"0,42898807", NA, "0,38604586", "0,06216771", NA, "0,50682091",
"0,77352235", NA, NA, NA, NA, NA, "0,54360374", "0,62983336",
"0,29108942", NA, "0,68860814", "0,49901387", "0,85213969"),
dias = c("0,17955655", "0,38582622", NA, "0,93897514", "0,99899076",
NA, "0,07829624", "0,61645303", "0,96770222", "0,660479",
"0,4553325", "0,22181609", NA, NA, NA, NA, "0,84592719",
"0,3863479", "0,38117415", NA, "0,19967415", "0,34450998"
), bp = c("0,14100962", "0,12918231", "0,43667462", NA, "0,44168223",
NA, "0,69755251", NA, "0,97403356", "0,11554399", NA, "0,08527658",
NA, "0,78501791", NA, "0,40595426", "0,10640734", NA, "0,58849266",
"0,88895224", NA, "0,98962306"), date = structure(c(18108,
16435, 17865, 17835, 17743, 15825, 16831, 16860, 17682, 18016,
17256, 17439, 17621, 17865, 17955, 16313, 15948, 17562, 17865,
16313, 17774, 16039), class = "Date")), row.names = c(NA,
-22L), class = "data.frame")

analyse multiple row dependig of the value of the first?

I have this table :
record_id result date_start date_end
1 1 pos
2 1 26/06/2019 28/06/2019
3 1 27/06/2019 29/06/2019
4 1 28/06/2019 30/06/2019
5 1 29/06/2019 01/07/2019
6 2 neg
7 2 01/07/2019 03/07/2019
8 2 02/07/2019 04/07/2019
9 2 03/07/2019 05/07/2019
10 2 04/07/2019 06/07/2019
11 2 05/07/2019 07/07/2019
12 3 pos
13 3 07/07/2019 09/07/2019
14 3 08/07/2019 10/07/2019
I want to calculate the difference of date for each row, no problem with that. What i want after that is to analyse the group of "pos" and "neg" separately. But i have not the value of result in my data when i have the date. This is data imported from REDCap, with repeat instruments.
I use tidyverse, and i think dplyr could help, isn't it a pivot_wider i must do ? I've try, but no way ...
Thanks if anyone could help ...
Like this, to e.g., calculate the mean date difference per group?
library(tidyverse)
library(lubridate)
df %>%
fill(result, .direction = "down") %>%
filter(!is.na(date_start)) %>%
mutate(date_start = dmy(date_start),
date_end = dmy(date_end)) %>%
group_by(result) %>%
summarise(mean_date_dif = mean(date_end - date_start))
#`summarise()` ungrouping output (override with `.groups` argument)
## A tibble: 2 x 2
# result mean_date_dif
# <chr> <drtn>
#1 neg 2 days
#2 pos 2 days
Data
df <- tibble::tribble(
~record_id, ~result, ~date_start, ~date_end,
1L, "pos", NA, NA,
1L, NA, "26/06/2019", "28/06/2019",
1L, NA, "27/06/2019", "29/06/2019",
1L, NA, "28/06/2019", "30/06/2019",
1L, NA, "29/06/2019", "01/07/2019",
2L, "neg", NA, NA,
2L, NA, "01/07/2019", "03/07/2019",
2L, NA, "02/07/2019", "04/07/2019",
2L, NA, "03/07/2019", "05/07/2019",
2L, NA, "04/07/2019", "06/07/2019",
2L, NA, "05/07/2019", "07/07/2019",
3L, "pos", NA, NA,
3L, NA, "07/07/2019", "09/07/2019",
3L, NA, "08/07/2019", "10/07/2019"
)

Vectorizing loop operation in R

I have a long-format balanced data frame (df1) that has 7 columns:
df1 <- structure(list(Product_ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3), Product_Category = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("A", "B"), class = "factor"),
Manufacture_Date = c(1950, 1950, 1950, 1950, 1950, 1960,
1960, 1960, 1960, 1960, 1940, 1940, 1940, 1940, 1940), Control_Date = c(1961L,
1962L, 1963L, 1964L, 1965L, 1961L, 1962L, 1963L, 1964L, 1965L,
1961L, 1962L, 1963L, 1964L, 1965L), Country_Code = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ABC",
"DEF", "GHI"), class = "factor"), Var1 = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Var2 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA,
15L), class = "data.frame")
Each Product_ID in this data set is linked with a unique Product_Category and Country_Code and Manufacture_Date, and is followed over time (Control_Date). Product_Category has two possible values (A or B); Country_Code and Manufacture_Date have 190 and 90 unique values, respectively. There are 400,000 unique Product_ID's, that are followed over a period of 50 years (Control_Date from 1961 to 2010). This means that df1 has 20,000,000 rows. The last two columns of this data frame are NA at the beginning and have to be filled using the data available in another data frame (df2):
df2 <- structure(list(Product_ID = 1:6, Product_Category = structure(c(1L,
2L, 1L, 1L, 1L, 2L), .Label = c("A", "B"), class = "factor"),
Manufacture_Date = c(1950, 1960, 1940, 1950, 1940, 2000),
Country_Code = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("ABC",
"DEF", "GHI"), class = "factor"), Year_1961 = c(5, NA, 10,
NA, 6, NA), Year_1962 = c(NA, NA, 4, 5, 3, NA), Year_1963 = c(8,
6, NA, 5, 6, NA), Year_1964 = c(NA, NA, 9, NA, 10, NA), Year_1965 = c(6,
NA, 7, 4, NA, NA)), row.names = c(NA, 6L), class = "data.frame")
This second data frame contains another type of information on the exact same 400,000 products, in wide-format. Each row represents a unique product (Product_ID) accompanied by its Product_Category, Manufacture_Date and Country_Code. There are 50 other columns (for each year from 1961 to 2010) that contain a measured value (or NA) for each product in each of those years.
Now what I would like to do is to fill in the Var1 & Var2 columns in the first data frame, by doing some calculation on the data available in the second data frame. More precisely, for each row in the first data frame (i.e. a product at Control_Date "t"), the last two columns are defined as follows:
Var1: total number of products in df2 with the same Product_Category, Manufacture_Date and Country_Code that have non-NA value in Year_t;
Var2: total number of products in df2 with different Product_Category but the same Manufacture_Date and Country_Code that have non-NA value in Year_t.
My initial solution with nested for-loops is as follows:
for (i in unique(df1$Product_ID)){
Category <- unique(df1[which(df1$Product_ID==i),"Product_Category"])
Opposite_Category <- ifelse(Category=="A","B","A")
Manufacture <- unique(df1[which(df1$Product_ID==i),"Manufacture_Date"])
Country <- unique(df1[which(df1$Product_ID==i),"Country_Code"])
ID_Similar_Product <- df2[which(df2$Product_Category==Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]
ID_Quasi_Similar_Product <- df2[which(df2$Product_Category==Opposite_Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]
for (j in unique(df1$Control_Date)){
df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var1"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Similar_Product),paste0("Year_",j)])))
df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var2"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Quasi_Similar_Product),paste0("Year_",j)])))
}
}
The problem with this approach is that it takes a lot of time to be run. So I would like to know if anybody could suggest a vectorized version that would do the job in less time.
See if this does what you want. I'm using the data.table package since you have a rather large (20M) dataset.
library(data.table)
setDT(df1)
setDT(df2)
# Set keys on the "triplet" to speed up everything
setkey(df1, Product_Category, Manufacture_Date, Country_Code)
setkey(df2, Product_Category, Manufacture_Date, Country_Code)
# Omit the Var1 and Var2 from df1
df1[, c("Var1", "Var2") := NULL]
# Reshape df2 to long form
df2.long <- melt(df2, measure=patterns("^Year_"))
# Split "variable" at the "_" to extract 4-digit year into "Control_Date" and delete leftovers.
df2.long[, c("variable","Control_Date") := tstrsplit(variable, "_", fixed=TRUE)][
, variable := NULL]
# Group by triplet, Var1=count non-NA in value, join with...
# (Group by doublet, N=count non-NA), update Var2=N-Var1.
df2_N <- df2.long[, .(Var1 = sum(!is.na(value))),
by=.(Product_Category, Manufacture_Date, Country_Code)][
df2.long[, .(N = sum(!is.na(value))),
by=.(Manufacture_Date, Country_Code)],
Var2 := N - Var1, on=c("Manufacture_Date", "Country_Code")]
# Update join: df1 with df2_N
df1[df2_N, c("Var1","Var2") := .(i.Var1, i.Var2),
on = .(Product_Category, Manufacture_Date, Country_Code)]
df1
Product_ID Product_Category Manufacture_Date Control_Date Country_Code Var1 Var2
1: 3 A 1940 1961 GHI 4 0
2: 3 A 1940 1962 GHI 4 0
3: 3 A 1940 1963 GHI 4 0
4: 3 A 1940 1964 GHI 4 0
5: 3 A 1940 1965 GHI 4 0
6: 1 A 1950 1961 ABC 6 0
7: 1 A 1950 1962 ABC 6 0
8: 1 A 1950 1963 ABC 6 0
9: 1 A 1950 1964 ABC 6 0
10: 1 A 1950 1965 ABC 6 0
11: 2 B 1960 1961 DEF NA NA
12: 2 B 1960 1962 DEF NA NA
13: 2 B 1960 1963 DEF NA NA
14: 2 B 1960 1964 DEF NA NA
15: 2 B 1960 1965 DEF NA NA
df2
Product_ID Product_Category Manufacture_Date Country_Code Year_1961 Year_1962 Year_1963 Year_1964 Year_1965
1: 5 A 1940 DEF 6 3 6 10 NA
2: 3 A 1940 GHI 10 4 NA 9 7
3: 1 A 1950 ABC 5 NA 8 NA 6
4: 4 A 1950 ABC NA 5 5 NA 4
5: 2 B 1940 DEF NA NA 6 NA NA
6: 6 B 2000 GHI NA NA NA NA NA

Dplyr - Mean for multiple columns

I want to calculate the mean for several columns and thus create a new column for the mean using dplyr and without melting + merging.
> head(growth2)
CODE_COUNTRY CODE_PLOT IV12_ha_yr IV23_ha_yr IV34_ha_yr IV14_ha_yr IV24_ha_yr IV13_ha_yr
1 1 6 4.10 6.97 NA NA NA 4.58
2 1 17 9.88 8.75 NA NA NA 8.25
3 1 30 NA NA NA NA NA NA
4 1 37 15.43 15.07 11.89 10.00 12.09 14.33
5 1 41 20.21 15.01 14.72 11.31 13.27 17.09
6 1 46 12.64 14.36 13.65 9.07 12.47 12.36
>
I need a new column within the dataset with the mean of all the IV columns.
I tried this:
growth2 %>%
group_by(CODE_COUNTRY, CODE_PLOT) %>%
summarise(IVmean=mean(IV12_ha_yr:IV13_ha_yr, na.rm=TRUE))
And returned several errors depending on the example used, such as:
Error in NA_real_:NA_real_ : NA/NaN argument
or
Error in if (trim > 0 && n) { : missing value where TRUE/FALSE needed
You don't need to group, just select() and then mutate()
library(dplyr)
mutate(df, IVMean = rowMeans(select(df, starts_with("IV")), na.rm = TRUE))
Use . in dplyr.
library(dplyr)
mutate(df, IVMean = rowMeans(select(., starts_with("IV")), na.rm = TRUE))
Here is a dplyr solution using c_across which is designed for row-wise aggregations. This makes it easy to refer to columns by name, type or position and to apply any function to the selected columns.
library("tidyverse")
df <-
tibble::tribble(
~CODE_COUNTRY, ~CODE_PLOT, ~IV12_ha_yr, ~IV23_ha_yr, ~IV34_ha_yr, ~IV14_ha_yr, ~IV24_ha_yr, ~IV13_ha_yr,
1L, 6L, 4.1, 6.97, NA, NA, NA, 4.58,
1L, 17L, 9.88, 8.75, NA, NA, NA, 8.25,
1L, 30L, NA, NA, NA, NA, NA, NA,
1L, 37L, 15.43, 15.07, 11.89, 10, 12.09, 14.33,
1L, 41L, 20.21, 15.01, 14.72, 11.31, 13.27, 17.09,
1L, 46L, 12.64, 14.36, 13.65, 9.07, 12.47, 12.36
)
df %>%
rowwise() %>%
mutate(
IV_mean = mean(c_across(starts_with("IV")), na.rm = TRUE),
IV_sd = sd(c_across(starts_with("IV")), na.rm = TRUE)
)
#> # A tibble: 6 × 10
#> # Rowwise:
#> CODE_COUNTRY CODE_PLOT IV12_ha_yr IV23_ha_yr IV34_ha_yr IV14_ha_yr IV24_ha_yr
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 4.1 6.97 NA NA NA
#> 2 1 17 9.88 8.75 NA NA NA
#> 3 1 30 NA NA NA NA NA
#> 4 1 37 15.4 15.1 11.9 10 12.1
#> 5 1 41 20.2 15.0 14.7 11.3 13.3
#> 6 1 46 12.6 14.4 13.6 9.07 12.5
#> # … with 3 more variables: IV13_ha_yr <dbl>, IV_mean <dbl>, IV_sd <dbl>
Created on 2022-06-25 by the reprex package (v2.0.1)
I tried to comment on Rick Scriven's answer but don't have the experience points for it. Anyway, wanted to contribute. His answer said to do this:
library(dplyr)
mutate(df, IVMean = rowMeans(select(df, starts_with("IV")), na.rm = TRUE))
That works, but if all columns don't start with "IV", which was my case, how do you do it? Turns out, that select does not want a logical vector, so you can't use AND or OR. For example, you cannot say "starts_with('X') | starts_with('Y')". You have to build a numeric vector. Here is how it is done.
mutate(df, IVMean = rowMeans(select(df, c(starts_with("IV"), starts_with("IX"))), na.rm = TRUE))
you can use as follows:
your data
data<- structure(list(CODE_COUNTRY = c(1L, 1L, 1L, 1L, 1L, 1L), CODE_PLOT = c(6L,
17L, 30L, 37L, 41L, 46L), IV12_ha_yr = c(4.1, 9.88, NA, 15.43,
20.21, 12.64), IV23_ha_yr = c(6.97, 8.75, NA, 15.07, 15.01, 14.36
), IV34_ha_yr = c(NA, NA, NA, 11.89, 14.72, 13.65), IV14_ha_yr = c(NA,
NA, NA, 10, 11.31, 9.07), IV24_ha_yr = c(NA, NA, NA, 12.09, 13.27,
12.47), IV13_ha_yr = c(4.58, 8.25, NA, 14.33, 17.09, 12.36)), .Names = c("CODE_COUNTRY",
"CODE_PLOT", "IV12_ha_yr", "IV23_ha_yr", "IV34_ha_yr", "IV14_ha_yr",
"IV24_ha_yr", "IV13_ha_yr"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
mydata <- cbind(data,IVMean=apply(data[,3:8],1,mean, na.rm=TRUE))
you can also do this
mydata <- cbind(data,IVMean=rowMeans(data[3:8], na.rm=TRUE))

Resources