How to merge two columns in R? - r

newdf=data.frame(id=c(1,3,2),admission=c("2020-05-18","2020-04-30","2020-05-08"),
vent=c("mechanical_vent","self_vent","mechanical_vent"))
newdf$admission=as.Date(newdf$admission)
newdf1=data.frame(id=c(1,3,1,2,1,3,2,2),
date=c("2020-05-19","2020-05-02","2020-05-20","2020-05-09","2020-05-21","2020-05-04","2020-05-10","2020-05-11"),
vent=c("self_vent","mechanical_vent","mechanical_vent","mechanical_vent","self_vent","mechanical_vent","mechanical_vent","self_vent"))
newdf1$date=as.Date(newdf1$date)
newdf=newdf %>% group_by(id) %>% bind_rows(newdf,newdf1)
newdf$dates=paste(newdf$admission,newdf$date)
I want to merge admission and date columns as dates. I used paste function but it gives output with NA values. I have attached image of data set herewith. Could you please suggest a method to solve this?

If you want to transfer the dates from admissionto date, where dateis NA, this will work:
newdf %>%
mutate(across(c(admission, date), ~ as.character(.))) %>%
mutate(date = ifelse(is.na(date), admission, date))

We could use pmax:
newdf$dates <- pmax(newdf$admission, newdf$date, na.rm = TRUE)
Output:
id admission vent date dates
<dbl> <date> <chr> <date> <date>
1 1 2020-05-18 mechanical_vent NA 2020-05-18
2 3 2020-04-30 self_vent NA 2020-04-30
3 2 2020-05-08 mechanical_vent NA 2020-05-08
4 1 2020-05-18 mechanical_vent NA 2020-05-18
5 3 2020-04-30 self_vent NA 2020-04-30
6 2 2020-05-08 mechanical_vent NA 2020-05-08
7 1 NA self_vent 2020-05-19 2020-05-19
8 3 NA mechanical_vent 2020-05-02 2020-05-02
9 1 NA mechanical_vent 2020-05-20 2020-05-20
10 2 NA mechanical_vent 2020-05-09 2020-05-09
11 1 NA self_vent 2020-05-21 2020-05-21
12 3 NA mechanical_vent 2020-05-04 2020-05-04
13 2 NA mechanical_vent 2020-05-10 2020-05-10
14 2 NA self_vent 2020-05-11 2020-05-11

You can use coalesce -
library(dplyr)
newdf %>% ungroup %>% mutate(dates = coalesce(admission, date))
# id admission vent date dates
# <dbl> <date> <chr> <date> <date>
# 1 1 2020-05-18 mechanical_vent NA 2020-05-18
# 2 3 2020-04-30 self_vent NA 2020-04-30
# 3 2 2020-05-08 mechanical_vent NA 2020-05-08
# 4 1 2020-05-18 mechanical_vent NA 2020-05-18
# 5 3 2020-04-30 self_vent NA 2020-04-30
# 6 2 2020-05-08 mechanical_vent NA 2020-05-08
# 7 1 NA self_vent 2020-05-19 2020-05-19
# 8 3 NA mechanical_vent 2020-05-02 2020-05-02
# 9 1 NA mechanical_vent 2020-05-20 2020-05-20
#10 2 NA mechanical_vent 2020-05-09 2020-05-09
#11 1 NA self_vent 2020-05-21 2020-05-21
#12 3 NA mechanical_vent 2020-05-04 2020-05-04
#13 2 NA mechanical_vent 2020-05-10 2020-05-10
#14 2 NA self_vent 2020-05-11 2020-05-11

Related

Extract values from netCDF to location by date

I am new to R and am now looking for a solution to my problem.
I have a directory with files in .nc (netCDF) format that contain daily data on sea surface temperature. Each day during the period from December 1, 2019 to August 1, 2021 corresponds to a file containing data on the water temperature for the day. The date to which the file belongs is contained in the middle of the file name in the yyyymmdd format (for example – 'TERRA_MODIS.20191201.L3m.DAY.SST.x_sst.nc ').
Link on example of MODIS files. In this work only files with name endings "L3m.DAY.SST.x_sst.nc" should be used.
Example of file structure
A simple visualization of sea surface temperature data looks like this:
library('ncdf4')
library('lattice')
library('RColorBrewer')
nc <- nc_open('TERRA_MODIS.20191201.L3m.DAY.SST.x_sst.nc')
lon <- ncvar_get(nc, 'lon')
lat <- ncvar_get(nc, 'lat', verbose = F)
SST <- ncvar_get(nc, 'sst')
grid <- expand.grid(lon=lon, lat=lat)
cutpts <- c(0,5,10,15,20,25,30,35,40,45)
levelplot(SST ~ lon * lat, data=grid, at=cutpts, cuts=11, pretty=T, col.regions=(rev(brewer.pal(10,'RdBu'))))
I also have a dataframe containing the No. of the buoy, the date of the location and the geographic coordinates of the location of the buoy on that day
Link on example.csv
example <- read.csv('example.csv', sep=';',dec='.')
example
Buoy Date Longitude Latitude
1 1 2019-12-01 50.29614 43.92681
2 1 2019-12-02 50.23525 43.89244
3 1 2019-12-03 50.19717 43.88295
4 1 2019-12-04 50.20559 43.88417
5 1 2019-12-05 50.26016 43.86125
6 2 2019-12-01 51.73309 46.53087
7 2 2019-12-02 51.79530 46.56380
8 2 2019-12-03 51.79190 46.53550
9 2 2019-12-04 51.79958 46.56178
10 2 2019-12-05 51.85411 46.33031
11 3 2019-12-01 51.54246 41.28999
12 3 2019-12-02 50.76324 41.60532
13 3 2019-12-03 51.39782 41.79459
14 3 2019-12-04 49.52380 42.33821
15 3 2019-12-05 49.48472 42.62323
I need to extract the sea surface temperature value to column 'SST' for each location depending on the date for which the location was obtained. The result should look something like this:
> example
Buoy Date Longitude Latitude SST
1 1 2019-12-01 50.29614 43.92681 13
2 1 2019-12-02 50.23525 43.89244 16
3 1 2019-12-03 50.19717 43.88295 2
4 1 2019-12-04 50.20559 43.88417 10
5 1 2019-12-05 50.26016 43.86125 8
6 2 2019-12-01 51.73309 46.53087 18
7 2 2019-12-02 51.79530 46.56380 4
8 2 2019-12-03 51.79190 46.53550 17
9 2 2019-12-04 51.79958 46.56178 20
10 2 2019-12-05 51.85411 46.33031 13
11 3 2019-12-01 51.54246 41.28999 14
12 3 2019-12-02 50.76324 41.60532 18
13 3 2019-12-03 51.39782 41.79459 8
14 3 2019-12-04 49.52380 42.33821 7
15 3 2019-12-05 49.48472 42.62323 2
Could you tell me how this can be implemented in R?
With sst.zip unzipped into working directory and referring to extract:
files <- unzip('~/Downloads/sst.zip')
> files
[1] "./TERRA_MODIS.20191205.L3b.DAY.SST.NRT.x.nc"
[2] "./TERRA_MODIS.20191205.L3b.DAY.SST.x.nc"
[3] "./TERRA_MODIS.20191205.L3m.DAY.SST.NRT.x_sst.nc"
[4] "./TERRA_MODIS.20191205.L3m.DAY.SST.x_sst.nc"
[5] "./TERRA_MODIS.20191201.L3b.DAY.SST.NRT.x.nc"
[6] "./TERRA_MODIS.20191201.L3b.DAY.SST.x.nc"
[7] "./TERRA_MODIS.20191201.L3m.DAY.SST.NRT.x_sst.nc"
[8] "./TERRA_MODIS.20191201.L3m.DAY.SST.x_sst.nc"
[9] "./TERRA_MODIS.20191202.L3b.DAY.SST.NRT.x.nc"
[10] "./TERRA_MODIS.20191202.L3b.DAY.SST.x.nc"
[11] "./TERRA_MODIS.20191202.L3m.DAY.SST.NRT.x_sst.nc"
[12] "./TERRA_MODIS.20191202.L3m.DAY.SST.x_sst.nc"
[13] "./TERRA_MODIS.20191203.L3b.DAY.SST.NRT.x.nc"
[14] "./TERRA_MODIS.20191203.L3b.DAY.SST.x.nc"
[15] "./TERRA_MODIS.20191203.L3m.DAY.SST.NRT.x_sst.nc"
[16] "./TERRA_MODIS.20191203.L3m.DAY.SST.x_sst.nc"
[17] "./TERRA_MODIS.20191204.L3b.DAY.SST.NRT.x.nc"
[18] "./TERRA_MODIS.20191204.L3b.DAY.SST.x.nc"
[19] "./TERRA_MODIS.20191204.L3m.DAY.SST.NRT.x_sst.nc"
[20] "./TERRA_MODIS.20191204.L3m.DAY.SST.x_sst.nc"
# just showing order of x_sst.nc
files[endsWith(files, suffix = 'L3m.DAY.SST.x_sst.nc')]
[1] "./TERRA_MODIS.20191205.L3m.DAY.SST.x_sst.nc"
[2] "./TERRA_MODIS.20191201.L3m.DAY.SST.x_sst.nc"
[3] "./TERRA_MODIS.20191202.L3m.DAY.SST.x_sst.nc"
[4] "./TERRA_MODIS.20191203.L3m.DAY.SST.x_sst.nc"
[5] "./TERRA_MODIS.20191204.L3m.DAY.SST.x_sst.nc"
bouys <- read.csv('~/Downloads/example.csv', header = TRUE, colClasses=c('integer', 'Date', 'numeric', 'numeric'))
bouys
Buoy Date Longitude Latitude
1 1 2019-12-01 50.29614 43.92681
2 1 2019-12-02 50.23525 43.89244
3 1 2019-12-03 50.19717 43.88295
4 1 2019-12-04 50.20559 43.88417
5 1 2019-12-05 50.26016 43.86125
6 2 2019-12-01 51.73309 46.53087
7 2 2019-12-02 51.79530 46.56380
8 2 2019-12-03 51.79190 46.53550
9 2 2019-12-04 51.79958 46.56178
10 2 2019-12-05 51.85411 46.33031
11 3 2019-12-01 51.54246 41.28999
12 3 2019-12-02 50.76324 41.60532
13 3 2019-12-03 51.39782 41.79459
14 3 2019-12-04 49.52380 42.33821
15 3 2019-12-05 49.48472 42.62323
bouys_v <- vect(bouys, geom = c('Longitude', 'Latitude'))
SSTs <- rast(files[endsWith(files, suffix = 'L3m.DAY.SST.x_sst.nc')], 'sst')
sst_extr <- extract(SSTs, bouys_v)
sst_extr
ID sst sst sst sst sst
1 1 NA NA NA NA NA
2 2 NA NA NA NA NA
3 3 NA NA NA NA NA
4 4 NA NA NA NA NA
5 5 NA NA NA NA NA
6 6 NA NA NA -0.320 NA
7 7 NA NA NA -0.380 NA
8 8 NA NA NA -0.455 NA
9 9 NA NA NA -0.380 NA
10 10 NA NA NA -0.130 NA
11 11 NA NA NA NA NA
12 12 NA NA NA NA NA
13 13 NA NA NA NA NA
14 14 NA NA NA 11.790 NA
15 15 NA 11.835 NA NA NA
sst_idx <- which(is.na(sst_extr) == FALSE, arr.ind = TRUE)
sst_extr[sst_idx]
[1] 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000
[11] 11.000 12.000 13.000 14.000 15.000 11.835 -0.320 -0.380 -0.455 -0.380
[21] -0.130 11.790
bouys$sst <- NA
bouys$sst[sst_idx[16:22,][,1]] <- sst_extr[sst_idx][16:22]
bouys
Buoy Date Longitude Latitude sst
1 1 2019-12-01 50.29614 43.92681 NA
2 1 2019-12-02 50.23525 43.89244 NA
3 1 2019-12-03 50.19717 43.88295 NA
4 1 2019-12-04 50.20559 43.88417 NA
5 1 2019-12-05 50.26016 43.86125 NA
6 2 2019-12-01 51.73309 46.53087 -0.320
7 2 2019-12-02 51.79530 46.56380 -0.380
8 2 2019-12-03 51.79190 46.53550 -0.455
9 2 2019-12-04 51.79958 46.56178 -0.380
10 2 2019-12-05 51.85411 46.33031 -0.130
11 3 2019-12-01 51.54246 41.28999 NA
12 3 2019-12-02 50.76324 41.60532 NA
13 3 2019-12-03 51.39782 41.79459 NA
14 3 2019-12-04 49.52380 42.33821 11.790
15 3 2019-12-05 49.48472 42.62323 11.835
Though there must be a better way to get to the final result. which with arr.ind gives (in this case) row/col, [sst_idx[16:22,][,1]] says what rows we want to update, sst_extr[sst_idx][16:22] with what values.

Custom function: update old variables and create new variables/ non standard evaluation

I would use some help with a tidyverse solution for a function that I've custom written. I have a dataset with a binary phenotype and an associated diagnosis date, as well as 2 other dates I hope to use to update and create new variables.
I want to:
update the value of the supplied variables to NA if vardt < other_dt
generate new variables, {var}_incid & {var}_incid_dt, if the vardt variable is before baseline_dt
Here's my go at a function; I know that it likely requires some non-standard evaluation techniques, so I've tried to use assign() and eval(substitute()) around the names to no avail. Any tips? Thanks in advance for the help.
# load lib
library(tidyverse)
library(lubridate)
rdate <- function(x,
min = paste0(format(Sys.Date(), '%Y'), '-01-01'),
max = paste0(format(Sys.Date(), '%Y'), '-12-31'),
sort = TRUE) {
dates <- sample(seq(as.Date(min), as.Date(max), by = "day"), x, replace = TRUE)
if (sort == TRUE) {
sort(dates)
} else {
dates
}
}
# set seed for reproducibility
set.seed(42)
# Beginning dataset
das <- data.frame(id = rep(letters[1:3], each = 5),
pheno = rbinom(n=15, size = 1, prob = 0.30),
pheno_dt = rdate(15),
baseline_dt = rdate(15),
other_dt = rdate(15))
update_pheno <- function(var, vardt){
outds <- das %>%
mutate(eval(substitute(var)) = ifelse(var == 1 & pheno_dt < other_dt, NA, var),
# update vardt to NA if var value is NA
vardt = ifelse(is.na(var), NA, vardt))
# create incidence variable based on nomenclature of variable
paste0(var, "_incid") = ifelse(var == 1 & vardt < baseline_dt, NA, var),
# create associated dt variable
paste0(var, "_incid_dt" = ifelse(is.na(paste0(var, "_incid")), NA, vardt)))
return(outds)
}
test <- update_pheno(var = pheno, vardt = phenodt)
Limitations, Assumptions, and Simplifications
# Since we're talking *tidyverse*, let's make this a tibble:
das <- as_tibble( das )
das
# A tibble: 15 × 5
id pheno pheno_dt baseline_dt other_dt
<chr> <int> <date> <date> <date>
1 a 1 2022-01-05 2022-01-04 2022-01-03
2 a 1 2022-01-20 2022-04-19 2022-01-05
3 a 0 2022-01-24 2022-05-16 2022-02-02
4 a 1 2022-03-30 2022-05-26 2022-02-09
5 a 0 2022-04-19 2022-06-07 2022-04-13
6 b 0 2022-04-20 2022-07-16 2022-04-19
7 b 1 2022-06-14 2022-08-03 2022-04-24
8 b 0 2022-07-31 2022-08-14 2022-05-10
9 b 0 2022-09-16 2022-09-02 2022-05-18
10 b 1 2022-10-10 2022-10-19 2022-07-05
11 c 0 2022-10-24 2022-10-26 2022-08-16
12 c 1 2022-10-25 2022-11-10 2022-09-15
13 c 1 2022-11-10 2022-11-20 2022-09-19
14 c 0 2022-12-14 2022-12-14 2022-11-25
15 c 0 2022-12-26 2022-12-21 2022-12-24
Update the value of the supplied variables to NA if vardt < other_dt
# Do this directly:
das[ das$pheno_dt < das$other_dt , "pheno" ] <- NA
das
# A tibble: 15 × 5
id pheno pheno_dt baseline_dt other_dt
<chr> <int> <date> <date> <date>
1 a 1 2022-01-05 2022-01-04 2022-01-03
2 a 1 2022-01-20 2022-04-19 2022-01-05
3 a NA 2022-01-24 2022-05-16 2022-02-02
4 a 1 2022-03-30 2022-05-26 2022-02-09
5 a 0 2022-04-19 2022-06-07 2022-04-13
6 b 0 2022-04-20 2022-07-16 2022-04-19
7 b 1 2022-06-14 2022-08-03 2022-04-24
8 b 0 2022-07-31 2022-08-14 2022-05-10
9 b 0 2022-09-16 2022-09-02 2022-05-18
10 b 1 2022-10-10 2022-10-19 2022-07-05
11 c 0 2022-10-24 2022-10-26 2022-08-16
12 c 1 2022-10-25 2022-11-10 2022-09-15
13 c 1 2022-11-10 2022-11-20 2022-09-19
14 c 0 2022-12-14 2022-12-14 2022-11-25
15 c 0 2022-12-26 2022-12-21 2022-12-24
Generate new variables, {var}_incid & {var}_incid_dt, if the vardt variable is before baseline_dt
# What are the names of these new variables?
potential_new_names <- paste0( das$id, '_incid' )
potential_new_names
[1] "a_incid" "a_incid" "a_incid" "a_incid" "a_incid" "b_incid" "b_incid" "b_incid" "b_incid" "b_incid" "c_incid"
[12] "c_incid" "c_incid" "c_incid" "c_incid"
# To which rows does this apply?
these_rows <- which( das$pheno_dt < das$baseline_dt )
these_rows
[1] 2 3 4 5 6 7 8 10 11 12 13
# Remove duplicates
new_value_variables <- unique( potential_new_names[ these_rows ] )
# Create corresponding date variables
new_date_variables <- paste0( new_value_variables, "_dt" )
# Combine value variables and date variables
new_column_names <- c( new_value_variables, new_date_variables )
new_column_names
[1] "a_incid" "b_incid" "c_incid" "a_incid_dt" "b_incid_dt" "c_incid_dt"
code_to_make_new_columns <- sprintf(
'das %%>%% mutate( %s )'
, paste0( new_column_names, "=NA", collapse="," )
)
code_to_make_new_columns
[1] "das %>% mutate( a_incid=NA,b_incid=NA,c_incid=NA,a_incid_dt=NA,b_incid_dt=NA,c_incid_dt=NA )"
new_das <- eval( parse( text = code_to_make_new_columns ))
new_das
# A tibble: 15 × 11
id pheno pheno_dt baseline_dt other_dt a_incid b_incid c_incid a_incid_dt b_incid_dt c_incid_dt
<chr> <int> <date> <date> <date> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 a 1 2022-01-05 2022-01-04 2022-01-03 NA NA NA NA NA NA
2 a 1 2022-01-20 2022-04-19 2022-01-05 NA NA NA NA NA NA
3 a NA 2022-01-24 2022-05-16 2022-02-02 NA NA NA NA NA NA
4 a 1 2022-03-30 2022-05-26 2022-02-09 NA NA NA NA NA NA
5 a 0 2022-04-19 2022-06-07 2022-04-13 NA NA NA NA NA NA
6 b 0 2022-04-20 2022-07-16 2022-04-19 NA NA NA NA NA NA
7 b 1 2022-06-14 2022-08-03 2022-04-24 NA NA NA NA NA NA
8 b 0 2022-07-31 2022-08-14 2022-05-10 NA NA NA NA NA NA
9 b 0 2022-09-16 2022-09-02 2022-05-18 NA NA NA NA NA NA
10 b 1 2022-10-10 2022-10-19 2022-07-05 NA NA NA NA NA NA
11 c 0 2022-10-24 2022-10-26 2022-08-16 NA NA NA NA NA NA
12 c 1 2022-10-25 2022-11-10 2022-09-15 NA NA NA NA NA NA
13 c 1 2022-11-10 2022-11-20 2022-09-19 NA NA NA NA NA NA
14 c 0 2022-12-14 2022-12-14 2022-11-25 NA NA NA NA NA NA
15 c 0 2022-12-26 2022-12-21 2022-12-24 NA NA NA NA NA NA
Now update the values for the new variables
incident_value_columns <- grep( pattern = "incid$" , names( new_das ))
incident_date_columns <- grep( pattern = "incid_dt$", names( new_das ))
rows_to_update <- das$pheno_dt >= das$baseline_dt
new_das[ rows_to_update, incident_value_columns ] <- new_das[ rows_to_update, 'pheno' ]
new_das[ rows_to_update, incident_date_columns ] <- new_das[ rows_to_update, 'pheno_dt' ]
new_das
# A tibble: 15 × 11
id pheno pheno_dt baseline_dt other_dt a_incid b_incid c_incid a_incid_dt b_incid_dt c_incid_dt
<chr> <int> <date> <date> <date> <int> <int> <int> <date> <date> <date>
1 a 1 2022-01-05 2022-01-04 2022-01-03 1 1 1 2022-01-05 2022-01-05 2022-01-05
2 a 1 2022-01-20 2022-04-19 2022-01-05 NA NA NA NA NA NA
3 a NA 2022-01-24 2022-05-16 2022-02-02 NA NA NA NA NA NA
4 a 1 2022-03-30 2022-05-26 2022-02-09 NA NA NA NA NA NA
5 a 0 2022-04-19 2022-06-07 2022-04-13 NA NA NA NA NA NA
6 b 0 2022-04-20 2022-07-16 2022-04-19 NA NA NA NA NA NA
7 b 1 2022-06-14 2022-08-03 2022-04-24 NA NA NA NA NA NA
8 b 0 2022-07-31 2022-08-14 2022-05-10 NA NA NA NA NA NA
9 b 0 2022-09-16 2022-09-02 2022-05-18 0 0 0 2022-09-16 2022-09-16 2022-09-16
10 b 1 2022-10-10 2022-10-19 2022-07-05 NA NA NA NA NA NA
11 c 0 2022-10-24 2022-10-26 2022-08-16 NA NA NA NA NA NA
12 c 1 2022-10-25 2022-11-10 2022-09-15 NA NA NA NA NA NA
13 c 1 2022-11-10 2022-11-20 2022-09-19 NA NA NA NA NA NA
14 c 0 2022-12-14 2022-12-14 2022-11-25 0 0 0 2022-12-14 2022-12-14 2022-12-14
15 c 0 2022-12-26 2022-12-21 2022-12-24 0 0 0 2022-12-26 2022-12-26 2022-12-26
The non-standard-evaluation part
When you need to access something referenced by a combination of names and values of the parameters passed to a function, eval and sym can be used as follows:
example_within_a_function <- function(
the_data
, var_column_name
, var_datestamp_column_name
, baseline_column_name
, other_column_name
){
# Skip the first argument, which is the function, itself,
# and get all the rest of the arguments,
# which are passed parameters
arguments <- match.call()[ -1 ] %>% as.list
# Extract the value passed to each argument
values <- seq( arguments ) %>% map_chr( ~rlang::as_string( arguments[[.]] ))
# Return the names of the arguments, their values,
# the data table (using non-standard evaluation), and
# the data table (using a straight-forward reference).
list(
labels_within_function = names( arguments )
, labels_in_parent_env = values
, data = eval( sym( values[[ 1 ]] ))
, also_data = the_data
)
}
example_within_a_function(
the_data = das
, var_column_name = pheno
, var_datestamp_column_name = pheno_dt
, baseline_column_name = baseline_dt
, other_column_name = other_dt
)
$labels_within_function
[1] "the_data" "var_column_name" "var_datestamp_column_name" "baseline_column_name"
[5] "other_column_name"
$labels_in_parent_env
[1] "das" "pheno" "pheno_dt" "baseline_dt" "other_dt"
$data
# A tibble: 15 × 5
id pheno pheno_dt baseline_dt other_dt
<chr> <int> <date> <date> <date>
1 a 1 2022-01-05 2022-01-04 2022-01-03
2 a 1 2022-01-20 2022-04-19 2022-01-05
3 a NA 2022-01-24 2022-05-16 2022-02-02
4 a 1 2022-03-30 2022-05-26 2022-02-09
5 a 0 2022-04-19 2022-06-07 2022-04-13
6 b 0 2022-04-20 2022-07-16 2022-04-19
7 b 1 2022-06-14 2022-08-03 2022-04-24
8 b 0 2022-07-31 2022-08-14 2022-05-10
9 b 0 2022-09-16 2022-09-02 2022-05-18
10 b 1 2022-10-10 2022-10-19 2022-07-05
11 c 0 2022-10-24 2022-10-26 2022-08-16
12 c 1 2022-10-25 2022-11-10 2022-09-15
13 c 1 2022-11-10 2022-11-20 2022-09-19
14 c 0 2022-12-14 2022-12-14 2022-11-25
15 c 0 2022-12-26 2022-12-21 2022-12-24
$also_data
# A tibble: 15 × 5
id pheno pheno_dt baseline_dt other_dt
<chr> <int> <date> <date> <date>
1 a 1 2022-01-05 2022-01-04 2022-01-03
2 a 1 2022-01-20 2022-04-19 2022-01-05
3 a NA 2022-01-24 2022-05-16 2022-02-02
4 a 1 2022-03-30 2022-05-26 2022-02-09
5 a 0 2022-04-19 2022-06-07 2022-04-13
6 b 0 2022-04-20 2022-07-16 2022-04-19
7 b 1 2022-06-14 2022-08-03 2022-04-24
8 b 0 2022-07-31 2022-08-14 2022-05-10
9 b 0 2022-09-16 2022-09-02 2022-05-18
10 b 1 2022-10-10 2022-10-19 2022-07-05
11 c 0 2022-10-24 2022-10-26 2022-08-16
12 c 1 2022-10-25 2022-11-10 2022-09-15
13 c 1 2022-11-10 2022-11-20 2022-09-19
14 c 0 2022-12-14 2022-12-14 2022-11-25
15 c 0 2022-12-26 2022-12-21 2022-12-24

How to search upwards a column for a value based on whether another column is NA or not?

I need to find the previous date for which value is not NA and then also use the value on that row. I have tried to use shift, but I have met a problem because shift works well for row 9 but not for when there are consecutive non-NAs on type, such as on rows 5,6.
dtihave = data.table(date = as.Date(c("2020-01-01", "2020-02-01", "2020-03-03", "2020-04-02", "2020-05-09", "2020-06-10", "2020-07-18", "2020-08-23", "2020-09-09")),
type = c(1,1,NA,NA,1,1,NA,NA,1),
value = c(7,NA,6,8,NA,NA,5,9,NA))
> dtihave
date type value
1: 2020-01-01 1 7
2: 2020-02-01 1 NA
3: 2020-03-03 NA 6
4: 2020-04-02 NA 8
5: 2020-05-09 1 NA
6: 2020-06-10 1 NA
7: 2020-07-18 NA 5
8: 2020-08-23 NA 9
9: 2020-09-09 1 NA
dtiwant = data.table(date = as.Date(c("2020-01-01", "2020-02-01", "2020-03-03", "2020-04-02", "2020-05-09", "2020-06-10", "2020-07-18", "2020-08-23", "2020-09-09")),
type = c(1,1,NA,NA,1,1,NA,NA,1),
value = c(7,NA,6,8,NA,NA,5,9,NA),
iwantdate = c(NA, as.Date("2020-01-01"), NA, NA, as.Date("2020-04-02"), as.Date("2020-04-02"), NA, NA, as.Date("2020-08-23")),
iwantvalue = c(NA,7,NA,NA,8,8,NA,NA,9))
dtiwant[, iwantdate := as.Date(iwantdate, origin = "1970-01-01")]
> dtiwant
date type value iwantdate iwantvalue
1: 2020-01-01 1 7 <NA> NA
2: 2020-02-01 1 NA 2020-01-01 7
3: 2020-03-03 NA 6 <NA> NA
4: 2020-04-02 NA 8 <NA> NA
5: 2020-05-09 1 NA 2020-04-02 8
6: 2020-06-10 1 NA 2020-04-02 8
7: 2020-07-18 NA 5 <NA> NA
8: 2020-08-23 NA 9 <NA> NA
9: 2020-09-09 1 NA 2020-08-23 9
My current progress using shift, but I need row 6's iwantdate = "2020-04-02". The number of shifts I need to make is unknown, so I can not just use n=2 in shift.
dtprogress = copy(dtihave)
dtprogress[, iwantdate := ifelse(!is.na(type) & is.na(value), shift(date), NA)]
dtprogress[, iwantdate := ifelse(!is.na(type) & !is.na(value), date, iwantdate)]
dtprogress[, iwantdate := as.Date(iwantdate, origin = "1970-01-01")]
> dtprogress
date type value iwantdate
1: 2020-01-01 1 7 2020-01-01
2: 2020-02-01 1 NA 2020-01-01
3: 2020-03-03 NA 6 <NA>
4: 2020-04-02 NA 8 <NA>
5: 2020-05-09 1 NA 2020-04-02
6: 2020-06-10 1 NA 2020-05-09
7: 2020-07-18 NA 5 <NA>
8: 2020-08-23 NA 9 <NA>
9: 2020-09-09 1 NA 2020-08-23
You could do:
dtihave[, idx := cummax((!is.na(value)) * .I) * NA^!is.na(value)][,
c('want_date', 'want_value') := lapply(.SD, '[', idx),
.SDcols = c('date', 'value')][, idx:=NULL]
dtihave
date type value want_date want_value
1: 2020-01-01 1 7 <NA> NA
2: 2020-02-01 1 NA 2020-01-01 7
3: 2020-03-03 NA 6 <NA> NA
4: 2020-04-02 NA 8 <NA> NA
5: 2020-05-09 1 NA 2020-04-02 8
6: 2020-06-10 1 NA 2020-04-02 8
7: 2020-07-18 NA 5 <NA> NA
8: 2020-08-23 NA 9 <NA> NA
9: 2020-09-09 1 NA 2020-08-23 9
with tidyverse. Hopefully this solves the grouping. ie just add %>%group_by(...) before mutate and you are good to go
dtihave %>%
mutate(val_na = !is.na(value),
idx = nafill(na_if(row_number() * val_na, 0), "locf"),
idx = idx * NA ^ val_na,
date1 = date[idx], value1 = value[idx],
val_na = NULL, idx = NULL)
You can use lag to get previous values, e.g.
library(dplyr)
dtihave %>%
mutate(iwantdate = ifelse(is.na(value), lag(date), NA) %>% as.Date(., origin = "1970-01-01"),
iwantvalue = ifelse(is.na(value), lag(value), NA))
date type value iwantdate iwantvalue
1: 2020-01-01 1 7 <NA> NA
2: 2020-02-01 1 NA 2020-01-01 7
3: 2020-03-03 NA 6 <NA> NA
4: 2020-04-02 NA 8 <NA> NA
5: 2020-05-09 1 NA 2020-04-02 8
6: 2020-06-10 1 NA 2020-05-09 NA
7: 2020-07-18 NA 5 <NA> NA
8: 2020-08-23 NA 9 <NA> NA
9: 2020-09-09 1 NA 2020-08-23 9

Turn a loop based code into a vectorised one in R?

I´ve got this dataset and want to perform some calculations based on certain conditions:
library(tidyverse)
library(lubridate)
filas <- structure(list(Año = c(rep(2020,4),rep(2021,4),2022),
Mes = c(2:5,3:4,9,11,1),
Id = c(rep(1,7),2,2)),
row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame")) %>%
mutate(fecha = make_date(Año,Mes,1),
meses_imp = make_date(2999,1,1))
Año
Mes
Id
fecha
meses_imp
2020
2
1
2020-02-01
2999-01-01
2020
3
1
2020-03-01
2999-01-01
2020
4
1
2020-04-01
2999-01-01
2020
5
1
2020-05-01
2999-01-01
2021
3
1
2021-03-01
2999-01-01
2021
4
1
2021-04-01
2999-01-01
2021
9
1
2021-09-01
2999-01-01
2021
11
2
2021-11-01
2999-01-01
2022
1
2
2022-01-01
2999-01-01
I need to add rows for each "Id" when there are "holes" between two consecutive ones, and count those added rows later. I´ve achieved this using a "while" loop:
i <- 2
while(!is.na(filas[i,]$Id)) {
if (as.double(difftime(filas[i,]$fecha,filas[i-1,]$fecha)) > 31 &
filas[i,]$Id == filas[i-1,]$Id) {
filas <- add_row(filas,
Id = filas[i,]$Id,
fecha = filas[i-1,]$fecha + months(1),
meses_imp = pmin(filas[i-1,]$fecha,
filas[i-1,]$meses_imp),
.after = i-1)}
i=i+1}
filas2 <- filas %>%
group_by(Id,meses_imp) %>%
summarise(cant_meses_imp = n()) %>%
ungroup() %>%
filter(meses_imp != "2999-01-01")
filas <- left_join(filas,
filas2,
by=c("Id","meses_imp"))
Año
Mes
Id
fecha
meses_imp
cant_meses_imp
2020
2
1
2020-02-01
2999-01-01
NA
2020
3
1
2020-03-01
2999-01-01
NA
2020
4
1
2020-04-01
2999-01-01
NA
2020
5
1
2020-05-01
2999-01-01
NA
NA
NA
1
2020-06-01
2020-05-01
9
NA
NA
1
2020-07-01
2020-05-01
9
NA
NA
1
2020-08-01
2020-05-01
9
NA
NA
1
2020-09-01
2020-05-01
9
NA
NA
1
2020-10-01
2020-05-01
9
NA
NA
1
2020-11-01
2020-05-01
9
NA
NA
1
2020-12-01
2020-05-01
9
NA
NA
1
2021-01-01
2020-05-01
9
NA
NA
1
2021-02-01
2020-05-01
9
2021
3
1
2021-03-01
2999-01-01
NA
2021
4
1
2021-04-01
2999-01-01
NA
NA
NA
1
2021-05-01
2021-04-01
4
NA
NA
1
2021-06-01
2021-04-01
4
NA
NA
1
2021-07-01
2021-04-01
4
NA
NA
1
2021-08-01
2021-04-01
4
2021
9
1
2021-09-01
2999-01-01
NA
2021
11
2
2021-11-01
2999-01-01
NA
NA
NA
2
2021-12-01
2021-11-01
1
2022
1
2
2022-01-01
2999-01-01
NA
Since I`d like to apply this to a much larger dataset (~ 300k rows), how could I rewrite it in a vectorised way so it´s more efficient (and elegant maybe)?
Thanks!
You can apply the following code using padr and zoo packages.
This idea is to:
Add missing dates with the padr::pad() function.
Remove unwanted lines (non-integer Id values)
Create na and grp columns to identify rows added in 1.
Group by grp and create a column cant_meses_imp to count the number of consecutive na in each group
Select only desired columns
library(dplyr)
library(padr)
library(zoo)
filas %>%
pad(by = "fecha") %>% # add missing dates
mutate(Id = na.approx(Id)) %>% # interpolate NA values in Id column
subset(Id%%1 == 0) %>% # Keep only Id interger
# This part is for generating the cant_meses_imp column
mutate(na = ifelse(is.na(Mes), 1, 0),
grp = rle(na)$lengths %>% {rep(seq(length(.)), .)}) %>%
group_by(grp) %>%
mutate(cant_meses_imp = ifelse(na == 0, NA, n())) %>%
ungroup() %>%
select(-c(na, grp))
The code does not reproduce exactly the fecha column as there is no guidelines for its values.

R's padr package claiming the "datetime variable does not vary" when it does vary

library(tidyverse)
library(lubridate)
library(padr)
df
#> # A tibble: 828 x 5
#> Scar_Id Code Type Value YrMo
#> <chr> <chr> <chr> <date> <date>
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
#> # ... with 818 more rows
I have an R data frame named df shown above. I want to concentrate on row numbers 5 and 6. I can usually use the package padr to pad the months in between rows 5 and 6. The pad() function of the padr will basically add rows at intervals the user specifies, best shown as the added rows "X" below.
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> X 262-179 NA NA NA 2019-09-01
#> X 262-179 NA NA NA 2019-10-01
#> X 262-179 NA NA NA 2019-11-01
#> X 262-179 NA NA NA 2019-12-01
#> X 262-179 NA NA NA 2020-01-01
#> X 262-179 NA NA NA 2020-02-01
#> X 262-179 NA NA NA 2020-03-01
#> X 262-179 NA NA NA 2020-04-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
To get there I usually issue a command, such as is shown below, and it works fine in padr. But it doesn't work in my specific example, and instead yields the warning shown below.
df %>% pad(group = "Scar_Id", by = "YrMo", interval = "month")
#> # A tibble: 828 x 5
#> Scar_Id Code Type Value YrMo
#> <chr> <chr> <chr> <date> <date>
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
#> # ... with 818 more rows
#> Warning message:
#> datetime variable does not vary for 537 of the groups, no padding applied on this / these group(s)
Why does it claim that "the datetime variable does not vary" for rows 5 and 6, when the datetime does indeed vary. The datetime for row 5 variable YrMo is "2019-08-01" and the datetime for row 6 variable YrMo is "2020-05-01". Let me state the obvious that "2019-08-01" varies from "2020-05-01".
Any ideas what went wrong? I tried to create a reproducible example and could not. The basic examples I created all work as expected (as I describe). Hopefully these clues can help somebody determine what is going on.

Resources