Loop through datatable & alter values meeting a specific condition - r

I'm attempting to create a function that takes a datatable & a variable as arguments. The data table will always have 4 columns (1st is a date column, the other 3 are numeric) but the number of rows will differ. The variable is an integer that is set to act as cutoff. The goal of the function is to output the datatable with all values in the numeric columns greater than the first number larger than the variable. Here is a snippet of the datatable being tested
#datatable dt
> dput(dt[30:40])
structure(list(a = structure(c(18517, 18524, 18531, 18538, 18545,
18552, 18559, 18566, 18573, 18580, 18587), class = "Date"), b = c(14L,
16L, 18L, 21L, 23L, 26L, 29L, 32L, 35L, 39L, 42L), c = c(9L,
10L, 12L, 14L, 16L, 18L, 21L, 23L, 26L, 29L, 32L), d = c(4L,
5L, 6L, 8L, 9L, 11L, 13L, 16L, 18L, 20L, 23L)), row.names = c(NA,
-11L), class = c("data.table", "data.frame"))
> dt[30:40]
a b c d
1: 2020-09-12 14 9 4
2: 2020-09-19 16 10 5
3: 2020-09-26 18 12 6
4: 2020-10-03 21 14 8
5: 2020-10-10 23 16 9
6: 2020-10-17 26 18 11
7: 2020-10-24 29 21 13
8: 2020-10-31 32 23 16
9: 2020-11-07 35 26 18
10: 2020-11-14 39 29 20
11: 2020-11-21 42 32 23
Here is the function I've come up with:
cutoff <- 21 #some integer
checkDT <- function(dt, cutoff){
columns <- c('b','c','d')
for (i in columns){
for (j in dt[,..columns]){
if(is.infinite(min(j[which(j > cutoff)]))){
dt <- dt
}else{
dt[i > min(j[which(j > cutoff)]), `:=` (i = NA)]
}
}
return(dt)
}
}
This outputs a datatable with a fifth column i that is all NA. If I use this statement for a specific column than the output is as expected but I'm trying to have the function perform this to get rid of some lines of code.
if(is.infinite(min(dt$b[which(dt$b > cutoff)]))){
dt <- dt
} else{
dt[b > min(dt$b[which(dt$b > cutoff)]), `:=`(b = NA)]
}
> dt[30:40]
a b c d
1: 2020-09-12 14 9 4
2: 2020-09-19 16 10 5
3: 2020-09-26 18 12 6
4: 2020-10-03 21 14 8
5: 2020-10-10 23 16 9
6: 2020-10-17 NA 18 11
7: 2020-10-24 NA 21 13
8: 2020-10-31 NA 23 16
9: 2020-11-07 NA 26 18
10: 2020-11-14 NA 29 20
11: 2020-11-21 NA 32 23
This is the expected output with a cutoff value of 21:
a b c d
1: 2020-09-12 14 9 4
2: 2020-09-19 16 10 5
3: 2020-09-26 18 12 6
4: 2020-10-03 21 14 8
5: 2020-10-10 23 16 9
6: 2020-10-17 NA 18 11
7: 2020-10-24 NA 21 13
8: 2020-10-31 NA 23 16
9: 2020-11-07 NA NA 18
10: 2020-11-14 NA NA 20
11: 2020-11-21 NA NA 23

Here's another way using lapply and .SDcols.
checkDT <- function(dt1, cutoff) {
columns <- c('b','c','d')
dt1[, (columns) := lapply(.SD, function(x)
replace(x, x > x[x > cutoff][1], NA)), .SDcols = columns][]
}
checkDT(dt, 21)
# a b c d
# 1: 2020-09-12 14 9 4
# 2: 2020-09-19 16 10 5
# 3: 2020-09-26 18 12 6
# 4: 2020-10-03 21 14 8
# 5: 2020-10-10 23 16 9
# 6: 2020-10-17 NA 18 11
# 7: 2020-10-24 NA 21 13
# 8: 2020-10-31 NA 23 16
# 9: 2020-11-07 NA NA 18
#10: 2020-11-14 NA NA 20
#11: 2020-11-21 NA NA 23

I simplified a lot of your notation here
In data.table you don't have to use dt$ again inside the brackets
The which() isn't necessary because the logical vector can be used directly to indicate which rows to keep.
The key is using the get function to translate the text to a column name
I just used suppressWarnings to get rid of the infinity warnings,
The code doesn't replace anything in that case and that's what you want.
checkDT <- function(dt, cutoff) {
columns <- c('b', 'c', 'd')
for (i in columns) {
suppressWarnings(dt[get(i) > min(dt[get(i) > cutoff, get(i)]), (i) := NA])
}
dt[]
}
checkDT(dt, cutoff) gives the desired result

Related

Change the values for multiple columns of a data frame to NA that are located within the range of two vectors

Using R: I want to change the values for each column of my df which are located within the range of the corresponding columns from my other dataframes to NA.
It works when I have only one column each:
df<
days X1
1 20
2 30
3 50
4 10
5 10
6 20
7 10
8 70
9 90
10 20
start_vec<-c(4)
end_vec<-c(8)
by using:
df[df$days %in% start_vec:end_vec, ]<-NA
where I get:
desired_df<
days X1
1 20
2 30
3 50
NA NA
NA NA
NA NA
NA NA
NA NA
9 90
10 20
But actually I have a dataframe with more than three columns and more than three different starting and ending vectors, as you can see here:
df
days X1 X2 X3 Xn
1 20 10 20 ...
2 30 50 40 ...
3 50 40 40 ...
4 10 70 20 ...
5 10 10 30 ...
6 20 80 50 ...
7 10 30 70 ...
8 70 10 10 ...
9 90 10 70 ...
10 20 50 10 ...
start_vec<-c(4,5,2,n)
end_vec<-c(8,9,7,n)
my desired dataframe is this:
desired_df<
days X1 X2 X3 Xn
1 20 10 20 ...
2 30 50 NA ...
3 50 40 NA ...
4 NA 70 NA ...
5 NA NA NA ...
6 NA NA NA ...
7 NA NA NA ...
8 NA NA 10 ...
9 90 NA 70 ...
10 20 50 10 ...
Where NAs for X1 of df are defined by the range between 1st start_vec and 1st end_vec.
X2 of df is defined by the range between 2nd start_vec and 2nd end_vec and so on...
I would need a function that works for n columns, since my original data frame has more than 100 columns. For each column I have the corresponding vectors.
I tried with apply in order to apply condition to every column:
desired_df<-apply(df, 2, function(x) x[df$days %in% start_vec:end_vec]<-NA)
What I get is a df with NAs only.
Do you have any idea how I can change the values for each column of my df to NA where days or the index is within the range of the corresponding vectors?
Thanks for any help!
df <- data.frame(
days = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L),
X1 = c(20L, 30L, 50L, 10L, 10L, 20L, 10L, 70L, 90L, 20L),
X2 = c(10L, 50L, 40L, 70L, 10L, 80L, 30L, 10L, 10L, 50L),
X3 = c(20L, 40L, 40L, 20L, 30L, 50L, 70L, 10L, 70L, 10L)
)
start_df <- c(4, 5, 2)
end_df <- c(8, 9, 7)
mat <-
data.frame(row = c(4:8, 5:9, 2:7), col = c(rep(1, 5), rep(2, 5), rep(3, 6)))
df[-1][as.matrix(mat[, c("row", "col")])] <- NA
df
#> days X1 X2 X3
#> 1 1 20 10 20
#> 2 2 30 50 NA
#> 3 3 50 40 NA
#> 4 4 NA 70 NA
#> 5 5 NA NA NA
#> 6 6 NA NA NA
#> 7 7 NA NA NA
#> 8 8 NA NA 10
#> 9 9 90 NA 70
#> 10 10 20 50 10
Created on 2022-08-19 with reprex v2.0.2

How to fill missing values grouped on id and based on time period from index date

I want to fill in missing values for a data.frame based on a period of time within groups of ID.
For the latest registration_dat within the same ID group, I want to fill in with previous values in the ID group but only if the registration_dat is within 1 year of the latest registration_dat in the ID group.
Sample version of my data:
ID registration_dat value1 value2
1 2020-03-04 NA NA
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 NA NA
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 NA NA
8 2019-10-31 NA 43
8 2019-08-12 32 46
Desired output:
ID registration_dat value1 value2
1 2020-03-04 33 25
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 32 83
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 32 43
8 2019-10-31 NA 43
8 2019-08-12 32 46
I am later filtering the data so that i get one unique ID based on the latest registration date and I want this row to have as little missing data as possible hence I want to do this for all columns in the dataframe. However I do not want NA values being filled in by values in previous dates if its more than 1 year apart from the latest registration date. My dataframe has 14 columns and 3 million+ rows so I would need it to work on a much bigger data.frame than the one shown as an example.
I'd appreciate any ideas!
You can use across() to manipulate multiple columns at the same time. Note that I use date1 - years(1) <= date2 rather than date1 - 365 <= date2 to identify if a date is within 1 year of the latest one, which can take a leap year (366 days) into account.
library(dplyr)
library(lubridate)
df %>%
group_by(ID) %>%
arrange(desc(registration_dat), .by_group = TRUE) %>%
mutate(across(starts_with("value"),
~ if_else(row_number() == 1 & is.na(.x) & registration_dat - years(1) <= registration_dat[which.max(!is.na(.x))],
.x[which.max(!is.na(.x))], .x))) %>%
ungroup()
# # A tibble: 12 x 4
# ID registration_dat value1 value2
# <int> <date> <int> <int>
# 1 1 2020-03-04 33 25
# 2 1 2019-05-06 33 25
# 3 1 2019-01-02 32 21
# 4 3 2021-10-31 NA NA
# 5 3 2018-10-12 33 NA
# 6 3 2018-10-10 25 35
# 7 4 2020-01-02 32 83
# 8 4 2019-10-31 32 83
# 9 4 2019-09-20 33 56
# 10 8 2019-12-12 32 43
# 11 8 2019-10-31 NA 43
# 12 8 2019-08-12 32 46
Data
df <- structure(list(ID = c(1L, 1L, 1L, 3L, 3L, 3L, 4L, 4L, 4L, 8L,
8L, 8L), registration_dat = structure(c(18325, 18022, 17898,
18931, 17816, 17814, 18263, 18200, 18159, 18242, 18200, 18120
), class = "Date"), value1 = c(NA, 33L, 32L, NA, 33L, 25L, NA,
32L, 33L, NA, NA, 32L), value2 = c(NA, 25L, 21L, NA, NA, 35L,
NA, 83L, 56L, NA, 43L, 46L)), class = "data.frame", row.names = c(NA,-12L))
You could make a small function (f, below) to handle each value column.
Make a grouped ID, and generate a rowid (this is only to retain your original order)
dat <- dat %>%
mutate(rowid = row_number()) %>%
arrange(registration_dat) %>%
group_by(ID)
Make a function that takes a df and val column, and returns and updated df with val fixed
f <- function(df, val) {
bind_rows(
df %>% filter(is.na({{val}}) & row_number()!=n()),
df %>% filter(!is.na({{val}}) | row_number()==n()) %>%
mutate({{val}} := if_else(is.na({{val}}) & registration_dat-lag(registration_dat)<365, lag({{val}}),{{val}}))
)
}
Apply the function to the columns of interest
dat = f(dat,value1)
dat = f(dat,value2)
If you want, recover the original order
dat %>% arrange(rowid) %>% select(-rowid)
Output:
ID registration_dat value1 value2
<int> <date> <int> <int>
1 1 2020-03-04 33 25
2 1 2019-05-06 33 25
3 1 2019-01-02 32 21
4 3 2021-10-31 NA NA
5 3 2018-10-12 33 NA
6 3 2018-10-10 25 35
7 4 2020-01-02 32 83
8 4 2019-10-31 32 83
9 4 2019-09-20 33 56
10 8 2019-12-12 32 46
11 8 2019-10-31 NA 43
12 8 2019-08-12 32 46
Update:
The OP wants the final row (i.e the last registration_dat) per ID. With 3 million rows and 14 value columns, I would use data.table and do something like this:
library(data.table)
f <- function(df) {
df = df[df[1,registration_dat]-registration_dat<=365]
df[1,value:=df[2:.N][!is.na(value)][1,value]][1]
}
dcast(
melt(setDT(dat), id=c("ID", "registration_dat"))[order(-registration_dat),f(.SD), by=.(ID,variable)],
ID+registration_dat~variable, value.var="value"
)
Output:
ID registration_dat value1 value2
<int> <Date> <int> <int>
1: 1 2020-03-04 33 25
2: 3 2021-10-31 NA NA
3: 4 2020-01-02 32 83
4: 8 2019-12-12 32 43

Getting rows in data frame based on mutiple ranges in R

Let's say I have this table
a b
1 5 12
2 6 17
3 7 28
4 8 12
5 9 17
6 10 28
7 15 12
8 25 14
9 13 29
Also another table with index ranges:
start end
1 2 3
2 5 7
I want to get the rows in the first table based on the index ranges in the second table with a group name to differentiate, something like this:
a b group
2 6 17 1
3 7 28 1
5 9 17 2
6 10 28 2
7 15 12 2
how do I achieve this in R?
We can subset df1 taking the row index value from df2 using Map.
do.call(rbind, Map(function(x, y, z) transform(df1[x:y, ], group = z),
df2$start, df2$end, seq_len(nrow(df2))))
Or in purrr :
purrr::map2_dfr(df2$start, df2$end, ~df1[.x:.y, ], .id = "group")
# group a b
#1 1 6 17
#2 1 7 28
#3 2 9 17
#4 2 10 28
#5 2 15 12
data
df1 <- structure(list(a = c(5L, 6L, 7L, 8L, 9L, 10L, 15L, 25L, 13L),
b = c(12L, 17L, 28L, 12L, 17L, 28L, 12L, 14L, 29L)),
class = "data.frame", row.names = c(NA, -9L))
df2 <- structure(list(start = c(2L, 5L), end = c(3L, 7L)),
class = "data.frame", row.names = c(NA, -2L))
An option using data.table:
DT1[, rn := .I]
DT2[, g := .I]
DT1[, g := DT2[DT1, on=.(start<=rn, end>=rn), x.g]][
!is.na(g)]
output:
a b rn g
1: 6 17 2 1
2: 7 28 3 1
3: 9 17 5 2
4: 10 28 6 2
5: 15 12 7 2
data:
library(data.table)
DT1 <- fread("a b
5 12
6 17
7 28
8 12
9 17
10 28
15 12
25 14
13 29")
DT2 <- fread("start end
2 3
5 7")

Coding the number of visits based on dates and assigning value in new column R

I am relatively new to R and am trying to create a new column for number of visits (ie num_visits) based on the admission dates (ie admit_date)
The sample dataframe is below and the number of visits has to be created based on the admit_date column. The admit_dates do not necessarily run in sequence.
subject_id admit_date num_visits
22 2010-10-20 1
23 2010-10-20 1
24 2010-10-21 1
25 2010-10-21 1
22 2010-12-30 3
22 2010-12-22 2
23 2010-12-25 2
30 2011-01-14 1
31 2011-01-14 1
33 2011-02-05 2
33 2011-01-26 1
I know i need to groupby subject_id and perhaps get the counts based on the sequence of the dates.
Am stuck after the following codes, appreciate any form of help, thank you!
df %>%
group_by(subject_id) %>%
We can use mutate after grouping by 'subject_id'
library(dplyr)
df %>%
arrange(subject_id, as.Date(admit_date)) %>%
group_by(subject_id) %>%
mutate(num_visits = row_number())
or with data.table
library(data.table)
setDT(df)[order(as.IDate(admit_date)), num_visits := rowid(subject_id)][]
# subject_id admit_date num_visits
# 1: 22 2010-10-20 1
# 2: 23 2010-10-20 1
# 3: 24 2010-10-21 1
# 4: 25 2010-10-21 1
# 5: 22 2010-12-30 3
# 6: 22 2010-12-22 2
# 7: 23 2010-12-25 2
# 8: 30 2011-01-14 1
# 9: 31 2011-01-14 1
#10: 33 2011-02-05 2
#11: 33 2011-01-26 1
data
df <- structure(list(subject_id = c(22L, 23L, 24L, 25L, 22L, 22L, 23L,
30L, 31L, 33L, 33L), admit_date = c("2010-10-20", "2010-10-20",
"2010-10-21", "2010-10-21", "2010-12-30", "2010-12-22", "2010-12-25",
"2011-01-14", "2011-01-14", "2011-02-05", "2011-01-26")), row.names = c(NA,
-11L), class = "data.frame")

Moving average and moving slope in R

I am looking to separately calculate a 7-day moving average and 7-day moving slope of 'oldvar'.
My sincere apologies that I didn't add the details below in my original post. These are repeated observations for each id which can go from a minimum of 3 observations per id to 100 observations per id. The start day can be different for different IDs, and to make things complicated, the days are not equally spaced, so some IDs have missing days.
Here is the data structure. Please note that 'average' is the variable that I am trying to create as moving 7-day average for each ID:
id day outcome average
1 1 15 100 NA
2 1 16 110 NA
3 1 17 190 NA
4 1 18 130 NA
5 1 19 140 NA
6 1 20 150 NA
7 1 21 160 140
8 1 22 100 140
9 1 23 180 150
10 1 24 120 140
12 2 16 90 NA
13 2 17 110 NA
14 2 18 120 NA
12 2 20 130 NA
15 3 16 110 NA
16 3 18 200 NA
17 3 19 180 NA
18 3 21 170 NA
19 3 22 180 168
20 3 24 210 188
21 3 25 160 180
22 3 27 200 184
Also, would appreciate advice on how to calculate a moving 7-day slope using the same.
Thank you and again many apologies for being unclear the first time around.
The real challenge is to create a data.frame after completing the missing rows. One solution could be using zoo library. The rollapply function will provide a way to assign NA value for the initial rows.
Using data from OP as is, the solution could be:
library(zoo)
library(dplyr)
# Data from OP
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
day = c(15L,16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 16L, 17L, 18L, 20L,
16L, 18L, 19L, 21L, 22L, 24L, 25L, 27L),
outcome = c(100L, 110L,190L, 130L, 140L, 150L, 160L, 100L, 180L, 120L, 90L, 110L, 120L,
130L, 110L, 200L, 180L, 170L, 180L, 210L, 160L, 200L)),
.Names = c("id", "day", "outcome"), row.names = c(NA, -22L), class = "data.frame")
# Make a list without missing day for each id
df_complete <- merge(
expand.grid(id=unique(df$id), day=min(df$day):max(df$day)),
df, all=TRUE)
# Valid range of day for each ID group
df_id_wise_range <- df %>% group_by(id) %>%
summarise(min_day = min(day), max_day = max(day)) %>% as.data.frame()
# id min_day max_day
# 1 1 15 24
# 2 2 16 20
# 3 3 16 27
# Join original df and df_complete and then use df_id_wise_range to
# filter it for valid range of day for each group
df_final <- df_complete %>%
left_join(df, by=c("id","day")) %>%
select(-outcome.y) %>%
inner_join(df_id_wise_range, by="id") %>%
filter(day >= min_day & day <= max_day) %>%
mutate(outcome = outcome.x) %>%
select( id, day, outcome) %>%
as.data.frame()
# Now apply mean to get average
df_average <- df_final %>% group_by(id) %>%
mutate(average= rollapply(outcome, 7, mean, na.rm = TRUE, by = 1,
fill = NA, align = "right", partial = 7)) %>% as.data.frame()
df_average
# The result
# id day outcome average
#1 1 15 100 NA
#2 1 16 110 NA
#3 1 17 190 NA
#4 1 18 130 NA
#5 1 19 140 NA
#6 1 20 150 NA
#7 1 21 160 140.0
#8 1 22 100 140.0
#9 1 23 180 150.0
#10 1 24 120 140.0
#11 2 16 90 NA
#12 2 17 110 NA
#13 2 18 120 NA
#....
#....
#19 3 19 180 NA
#20 3 20 NA NA
#21 3 21 170 NA
#22 3 22 180 168.0
#23 3 23 NA 182.5
#24 3 24 210 188.0
#25 3 25 160 180.0
#26 3 26 NA 180.0
#27 3 27 200 184.0
The steps to calculate moving slope are:
First create a function to return slope
Use function as as part of rollapplyr
#Function to calculate slope
slop_e <- function(z) coef(lm(b ~ a, as.data.frame(z)))[[2]]
#Apply function
z2$slope <- rollapplyr(zoo(z2), 7, slop_e , by.column = FALSE, fill = NA, align = "right")
z2
a b mean_a slope
1 1 21 NA NA
2 2 22 NA NA
3 3 23 NA NA
4 4 24 NA NA
5 5 25 NA NA
6 6 26 NA NA
7 7 27 4 1
8 8 28 5 1
9 9 29 6 1
10 10 30 7 1
11 11 31 8 1
12 12 32 9 1
13 13 33 10 1
14 14 34 11 1
15 15 35 12 1
16 16 36 13 1
17 17 37 14 1
18 18 38 15 1
19 19 39 16 1
20 20 40 17 1

Resources