Unusual datetime behavior in dplyr - r

I'm trying to generate the day of time stamps recorded in the UTC time zone, using as.Date(). This sometimes produces inexplicable NA's in a grouped tbl_df object, though not if I enclose that same object in data.frame(), ungroup(), or filter it. My example is below. The grouped tbl_df object is checkit and the errant observation is #3, for wcid = 148. There is nothing unusual about its timestamp, yet as.Date() will return a NA for it, unless I transform checkit as described above:
> checkit
Source: local data frame [6 x 3]
Groups: wcid, ab_split_test [6]
wcid ab_split_test mailing_timestamp
(dbl) (chr) (time)
1 1 N <NA>
2 78 Y 2016-04-04 12:28:58
3 148 Y 2016-03-17 09:11:31
4 204 Y 2016-03-04 09:01:15
5 255 Y 2016-03-03 09:18:43
6 267 Y 2016-03-23 09:16:50
> class(checkit)
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
> checkit %>% mutate(treatment_day_actual = as.Date(mailing_timestamp))
Source: local data frame [6 x 4]
Groups: wcid, ab_split_test [6]
wcid ab_split_test mailing_timestamp treatment_day_actual
(dbl) (chr) (time) (date)
1 1 N <NA> <NA>
2 78 Y 2016-04-04 12:28:58 2016-04-04
3 148 Y 2016-03-17 09:11:31 <NA>
4 204 Y 2016-03-04 09:01:15 2016-03-04
5 255 Y 2016-03-03 09:18:43 2016-03-03
6 267 Y 2016-03-23 09:16:50 2016-03-23
> ungroup(checkit) %>% mutate(treatment_day_actual = as.Date(mailing_timestamp))
Source: local data frame [6 x 4]
wcid ab_split_test mailing_timestamp treatment_day_actual
(dbl) (chr) (time) (date)
1 1 N <NA> <NA>
2 78 Y 2016-04-04 12:28:58 2016-04-04
3 148 Y 2016-03-17 09:11:31 2016-03-17
4 204 Y 2016-03-04 09:01:15 2016-03-04
5 255 Y 2016-03-03 09:18:43 2016-03-03
6 267 Y 2016-03-23 09:16:50 2016-03-23
> data.frame(checkit) %>% mutate(treatment_day_actual = as.Date(mailing_timestamp))
wcid ab_split_test mailing_timestamp treatment_day_actual
1 1 N <NA> <NA>
2 78 Y 2016-04-04 12:28:58 2016-04-04
3 148 Y 2016-03-17 09:11:31 2016-03-17
4 204 Y 2016-03-04 09:01:15 2016-03-04
5 255 Y 2016-03-03 09:18:43 2016-03-03
6 267 Y 2016-03-23 09:16:50 2016-03-23
> filter(checkit, wcid == 148) %>% mutate(treatment_day_actual = as.Date(mailing_timestamp))
Source: local data frame [1 x 4]
Groups: wcid, ab_split_test [1]
wcid ab_split_test mailing_timestamp treatment_day_actual
(dbl) (chr) (time) (date)
1 148 Y 2016-03-17 09:11:31 2016-03-17
And here's dput:
> dput(checkit)
structure(list(wcid = c(1, 78, 148, 204, 255, 267), ab_split_test = c("N",
"Y", "Y", "Y", "Y", "Y"), mailing_timestamp = structure(c(NA,
1459787338.92449, 1458220291.82732, 1457100075.70328, 1457014723.60799,
1458739010.74587), class = c("POSIXct", "POSIXt"), tzone = "")), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), vars = list(
wcid, ab_split_test), drop = TRUE, indices = list(0L, 1L,
2L, 3L, 4L, 5L), group_sizes = c(1L, 1L, 1L, 1L, 1L, 1L), biggest_group_size = 1L, labels = structure(list(
wcid = c(1, 78, 148, 204, 255, 267), ab_split_test = c("N",
"Y", "Y", "Y", "Y", "Y")), class = "data.frame", row.names = c(NA,
-6L), vars = list(wcid, ab_split_test), drop = TRUE, .Names = c("wcid",
"ab_split_test")), .Names = c("wcid", "ab_split_test", "mailing_timestamp"
))
I just noticed from dput() that the time zone is missing. When I query it it shows up as my locale:
> attr(as.POSIXlt(checkit$mailing_timestamp),'tzone')
[1] "" "EST" "EDT"
This is not as it should be either, because the sql argument in my dplyr::tbl() call specifically requested UTC, as in select mailing_timestamp at time zone 'UTC' as mailing_timestamp. I am connecting to a PostgreSQL database.

Related

How to summarize factor data in R?

I have a data set train consisting of a column of dates obs_m, a column of factors default_flag (levels 0 and 1), and a column of numeric values pred_default. When I try to summarize by date using the following code:
library(dplyr)
train %>% group_by(obs_m) %>% summarise_at(vars(default_flag, pred_default), mean)
It gives me the this:
## # A tibble: 60 × 3
## obs_m default_flag pred_default
## <date> <dbl> <dbl>
## 1 2014-04-01 NA 0.0169
## 2 2014-05-01 NA 0.0205
## 3 2014-06-01 NA 0.0239
## 4 2014-07-01 NA 0.0246
## 5 2014-08-01 NA 0.0275
## 6 2014-09-01 NA 0.0301
## 7 2014-10-01 NA 0.0291
## 8 2014-11-01 NA 0.0254
## 9 2014-12-01 NA 0.0233
## 10 2015-01-01 NA 0.0199
## # … with 50 more rows
What can I do to prevent default_flag from returning NAs?
Edit
train = structure(list(obs_m = structure(c(16161, 16161, 16161, 16161,
16161, 16161), class = "Date"), default_flag = structure(c(1L,
1L, 1L, 1L, 1L, 1L), levels = c("0", "1"), class = "factor"),
pred_default = c(0.0181536124206322, 0.0138495688337231,
0.00682555751527574, 0.0107712925780696, 0.0159171589457986,
0.0168013691030077)), row.names = c(NA, 6L), class = "data.frame")

Merge two dataframes: specifically merge a selection of columns based on two conditions?

I have two datasets on the same 2 patients. With the second dataset I want to add new information to the first, but I can't seem to get the code right.
My first (incomplete) dataset has a patient ID, measurement time (either T0 or FU1), year of birth, date of the CT scan, and two outcomes (legs_mass and total_mass):
library(tidyverse)
library(dplyr)
library(magrittr)
library(lubridate)
df1 <- structure(list(ID = c(115, 115, 370, 370), time = structure(c(1L,
6L, 1L, 6L), .Label = c("T0", "T1M0", "T1M6", "T1M12", "T2M0",
"FU1"), class = "factor"), year_of_birth = c(1970, 1970, 1961,
1961), date_ct = structure(c(16651, 17842, 16651, 18535), class = "Date"),
legs_mass = c(9.1, NA, NA, NA), total_mass = c(14.5, NA,
NA, NA)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
# Which gives the following dataframe
df1
# A tibble: 4 x 6
ID time year_of_birth date_ct legs_mass total_mass
<dbl> <fct> <dbl> <date> <dbl> <dbl>
1 115 T0 1970 2015-08-04 9.1 14.5
2 115 FU1 1970 2018-11-07 NA NA
3 370 T0 1961 2015-08-04 NA NA
4 370 FU1 1961 2020-09-30 NA NA
The second dataset adds to the legs_mass and total_mass columns:
df2 <- structure(list(ID = c(115, 370), date_ct = structure(c(17842,
18535), class = "Date"), ctscan_label = c("PXE115_CT_20181107_xxxxx-3.tif",
"PXE370_CT_20200930_xxxxx-403.tif"), legs_mass = c(956.1, 21.3
), total_mass = c(1015.9, 21.3)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
# Which gives the following dataframe:
df2
# A tibble: 2 x 5
ID date_ct ctscan_label legs_mass total_mass
<dbl> <date> <chr> <dbl> <dbl>
1 115 2018-11-07 PXE115_CT_20181107_xxxxx-3.tif 956. 1016.
2 370 2020-09-30 PXE370_CT_20200930_xxxxx-403.tif 21.3 21.3
What I am trying to do, is...
Add the legs_mass and total_mass column values from df2 to df1, based on ID number and date_ct.
Add the new columns of df2 (the one that is not in df1; ctscan_label) to df1, also based on the date of the ct and patient ID.
So that the final dataset df3 looks as follows:
df3 <- structure(list(ID = c(115, 115, 370, 370), time = structure(c(1L,
6L, 1L, 6L), .Label = c("T0", "T1M0", "T1M6", "T1M12", "T2M0",
"FU1"), class = "factor"), year_of_birth = c(1970, 1970, 1961,
1961), date_ct = structure(c(16651, 17842, 16651, 18535), class = "Date"),
legs_mass = c(9.1, 956.1, NA, 21.3), total_mass = c(14.5,
1015.9, NA, 21.3)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
# Corresponding to the following tibble:
# A tibble: 4 x 6
ID time year_of_birth date_ct legs_mass total_mass
<dbl> <fct> <dbl> <date> <dbl> <dbl>
1 115 T0 1970 2015-08-04 9.1 14.5
2 115 FU1 1970 2018-11-07 956. 1016.
3 370 T0 1961 2015-08-04 NA NA
4 370 FU1 1961 2020-09-30 21.3 21.3
I have tried the merge function and rbind from baseR, and bind_rows from dplyr but can't seem to get it right.
Any help?
You can join the two datasets and use coalesce to keep one non-NA value from the two datasets.
library(dplyr)
left_join(df1, df2, by = c("ID", "date_ct")) %>%
mutate(leg_mass = coalesce(legs_mass.x , legs_mass.y),
total_mass = coalesce(total_mass.x, total_mass.y)) %>%
select(-matches('\\.x|\\.y'), -ctscan_label)
# ID time year_of_birth date_ct leg_mass total_mass
# <dbl> <fct> <dbl> <date> <dbl> <dbl>
#1 115 T0 1970 2015-08-04 9.1 14.5
#2 115 FU1 1970 2018-11-07 956. 1016.
#3 370 T0 1961 2015-08-04 NA NA
#4 370 FU1 1961 2020-09-30 21.3 21.3
We can use data.table methods
library(data.table)
setDT(df1)[setDT(df2), c("legs_mass", "total_mass") :=
.(fcoalesce(legs_mass, i.legs_mass),
fcoalesce(total_mass, i.total_mass)), on = .(ID, date_ct)]
-output
df1
ID time year_of_birth date_ct legs_mass total_mass
1: 115 T0 1970 2015-08-04 9.1 14.5
2: 115 FU1 1970 2018-11-07 956.1 1015.9
3: 370 T0 1961 2015-08-04 NA NA
4: 370 FU1 1961 2020-09-30 21.3 21.3

code is running fine line by line but fails when ran as a whole chunk in rmarkdown

When I run just this line of the code, the results are as expected. When I run the chunk, the mutations stop on the third line. How can I fix this, I feel like this is something new that I did not face before with the same code.
Sample data:
> dput(head(out))
structure(list(SectionCut = c("S-1", "S-1", "S-1", "S-1", "S-2",
"S-2"), OutputCase = c("LL-1", "LL-2", "LL-3", "LL-4", "LL-1",
"LL-2"), V2 = c(81.782, 119.251, 119.924, 96.282, 72.503, 109.595
), M3 = c("-29.292000000000002", "-32.661999999999999", "-30.904",
"-23.632999999999999", "29.619", "32.994"), id = c("./100-12-S01.xlsx",
"./100-12-S01.xlsx", "./100-12-S01.xlsx", "./100-12-S01.xlsx",
"./100-12-S01.xlsx", "./100-12-S01.xlsx")), row.names = c(NA,
-6L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), groups = structure(list(
SectionCut = c("S-1", "S-1", "S-1", "S-1", "S-2", "S-2"),
OutputCase = c("LL-1", "LL-2", "LL-3", "LL-4", "LL-1", "LL-2"
), id = c("./100-12-S01.xlsx", "./100-12-S01.xlsx", "./100-12-S01.xlsx",
"./100-12-S01.xlsx", "./100-12-S01.xlsx", "./100-12-S01.xlsx"
), .rows = list(1L, 2L, 3L, 4L, 5L, 6L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
> dput(head(Beamline_Shear))
structure(list(VLL = c(159.512186, 154.3336, 149.4451613, 167.0207595,
161.2269091, 156.4116505)), row.names = c("84-9", "84-12", "84-15",
"92-9", "92-12", "92-15"), class = "data.frame")
Code that I am trying to run:
Shear <- out[,-4] %>% mutate(N_l = str_extract(OutputCase,"\\d+"),
UG = str_extract(id,"\\d+"), a = str_extract(id,"-\\d+"),
S = str_extract(a,"\\d+"), Sections = paste0(UG,"-",S),
Sample = str_remove_all(id, "./\\d+-\\d+-|.xlsx")) %>%
left_join(Beamline_Shear %>% rownames_to_column("Sections"), by = "Sections") %>%
select(-OutputCase,-id,-Sections,-a)
There are some group attributes in the data, which should work normally, but can be an issue if we are running in a different env. Also, the mutate step and the join step doesn't really need any grouping attributes as they are fairly very straightforward rowwise operations that are vectorized.
library(dplyr)
out %>%
select(-4) %>%
ungroup %>% # // removes group attributes
mutate(N_l = str_extract(OutputCase,"\\d+"),
UG = str_extract(id,"\\d+"), a = str_extract(id,"-\\d+"),
S = str_extract(a,"\\d+"), Sections = paste0(UG,"-",S),
Sample = str_remove_all(id, "./\\d+-\\d+-|.xlsx")) %>% left_join(Beamline_Shear %>% rownames_to_column("Sections"), by = "Sections")
# A tibble: 6 x 11
# SectionCut OutputCase V2 id N_l UG a S Sections Sample VLL
# <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#1 S-1 LL-1 81.8 ./100-12-S01.xlsx 1 100 -12 12 100-12 S01 NA
#2 S-1 LL-2 119. ./100-12-S01.xlsx 2 100 -12 12 100-12 S01 NA
#3 S-1 LL-3 120. ./100-12-S01.xlsx 3 100 -12 12 100-12 S01 NA
#4 S-1 LL-4 96.3 ./100-12-S01.xlsx 4 100 -12 12 100-12 S01 NA
#5 S-2 LL-1 72.5 ./100-12-S01.xlsx 1 100 -12 12 100-12 S01 NA
#6 S-2 LL-2 110. ./100-12-S01.xlsx 2 100 -12 12 100-12 S01 NA

Divide column from dataframe into another

I've got 2 data frames that I'm trying to divide by each other but it's not working for me. Both dataframes are 8 x 3 with column one the same for both, column names are also the same for both data frames
bal_tier[,c(1, 3:4)]
# A tibble: 8 x 3
# Groups: hierachy_level2 [8]
hierachy_level2 `201804` `201904`
<chr> <dbl> <dbl>
1 CS 239 250
2 FNZ 87 97
3 OPS 1057 1136.
4 P&T 256 279
5 R&A 520 546
6 SPE 130 136.
7 SPP 67 66
8 TUR 46 69
dput(bal_tier[,c(1, 3:4)])
structure(list(hierachy_level2 = c("CS", "FNZ", "OPS", "P&T",
"R&A", "SPE", "SPP", "TUR"), `201804` = c(239, 87, 1057, 256,
520, 130, 67, 46), `201904` = c(250, 97, 1136.5, 279, 546, 136.5,
66, 69)), row.names = c(NA, -8L), groups = structure(list(hierachy_level2 = c("CS",
"FNZ", "OPS", "P&T", "R&A", "SPE", "SPP", "TUR"), .rows = list(
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L)), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"), .drop = FALSE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
tier_leavers[,c(1, 3:4)]
# A tibble: 8 x 3
# Groups: hierachy_level2 [8]
hierachy_level2 `201804` `201904`
<chr> <dbl> <dbl>
1 CS 32 47
2 FNZ 1 11
3 OPS 73 76
4 P&T 48 33
5 R&A 41 33
6 SPE 28 30
7 SPP 10 12
8 TUR 2 3
dput(tier_leavers[,c(1, 3:4)])
structure(list(hierachy_level2 = c("CS", "FNZ", "OPS", "P&T",
"R&A", "SPE", "SPP", "TUR"), `201804` = c(32, 1, 73, 48, 41,
28, 10, 2), `201904` = c(47, 11, 76, 33, 33, 30, 12, 3)), row.names = c(NA,
-8L), groups = structure(list(hierachy_level2 = c("CS", "FNZ",
"OPS", "P&T", "R&A", "SPE", "SPP", "TUR"), .rows = list(1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L)), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"), .drop = FALSE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Doing this gives me what I want:
bal_tier[,1]
# A tibble: 8 x 1
# Groups: hierachy_level2 [8]
hierachy_level2
<chr>
1 CS
2 FNZ
3 OPS
4 P&T
5 R&A
6 SPE
7 SPP
8 TUR
(tier_leavers[,c(3:4)] / bal_tier[,c(3:4)])
201804 201904
1 0.13389121 0.18800000
2 0.01149425 0.11340206
3 0.06906339 0.06687198
4 0.18750000 0.11827957
5 0.07884615 0.06043956
6 0.21538462 0.21978022
7 0.14925373 0.18181818
8 0.04347826 0.04347826
but when I combine it in a cbind I end up with this:
cbind(bal_tier[,1], tier_leavers[,c(3:4)] / bal_tier[,c(3:4)])
[,1] [,2]
201804 Character,8 Numeric,8
201904 Character,8 Numeric,8
What am I understanding wrong here?
Here's a solution using tidyverse
nme <- c("A","B","C","D","E")
yr_1 <- round(10*runif(n=5,min=0,max=10),0)
yr_2 <- round(10*runif(n=5,min=0,max=10),0)
data_1 <- data.frame(nme,yr_1,yr_2)
yr_1 <- round(10*runif(n=5,min=0,max=10),0)
yr_2 <- round(10*runif(n=5,min=0,max=10),0)
data_2 <- data.frame(nme,yr_1,yr_2)
data_divide <- data_1 %>%
left_join(data_2,by="nme") %>%
mutate(
result_1=yr_1.x/yr_1.y,
result_2=yr_2.x/yr_2.y
)
What I ended up doing feels like cheating but I got a clue from Zeus's answer:
a <- bal_tier[, 1]
b <- tier_leavers[,c(3:4)] / bal_tier[,c(3:4)]
tier_to <- data.frame(a, b)
tier_to
> tier_to
hierachy_level2 X201804 X201904
1 CS 0.13389121 0.18800000
2 FNZ 0.01149425 0.11340206
3 OPS 0.06906339 0.06687198
4 P&T 0.18750000 0.11827957
5 R&A 0.07884615 0.06043956
6 SPE 0.21538462 0.21978022
7 SPP 0.14925373 0.18181818
8 TUR 0.04347826 0.04347826

Aggregating time-based data of multiple patients to daily averages per patient in R

I have a dataframe that looks like this:
id time value
01 2014-02-26 13:00:00 6
02 2014-02-26 15:00:00 6
01 2014-02-26 18:00:00 6
04 2014-02-26 21:00:00 7
02 2014-02-27 09:00:00 6
03 2014-02-27 12:00:00 6
The dataframe consists of a mood score at different time stamps throughout the day of multiple patients.
I want the dataframe to become like this:
id 2014-02-26 2014-02-27
01 6.25 4.32
02 5.39 8.12
03 9.23 3.18
04 5.76 3.95
With on each row a patient and in each the column the daily mean of all the days in the dataframe. If there is no mood score on a specific date from a patient, I want the value to be NA.
What is the easiest way to do so using functions like ddply, or from other packages?
df <- structure(list(id = c(1L, 2L, 1L, 4L, 2L, 3L), time = structure(c(1393437600,
1393444800, 1393455600, 1393466400, 1393509600, 1393520400), class = c("POSIXct",
"POSIXt"), tzone = ""), value = c(6L, 6L, 6L, 7L, 6L, 6L)), .Names = c("id",
"time", "value"), row.names = c(NA, -6L), class = "data.frame")
Based on your description, this seems to be what you need,
library(tidyverse)
df1 %>%
group_by(id, time1 = format(time, '%Y-%m-%d')) %>%
summarise(new = mean(value)) %>%
spread(time1, new)
#Source: local data frame [4 x 3]
#Groups: id [4]
# id `2014-02-26` `2014-02-27`
#* <int> <dbl> <dbl>
#1 1 6 NA
#2 2 6 6
#3 3 NA 6
#4 4 7 NA
In base R, you could combine aggregate with reshape like this:
# get means by id-date
temp <- setNames(aggregate(value ~ id + format(time, "%y-%m-%d"), data=df, FUN=mean),
c("id", "time", "value"))
# reshape to get dates as columns
reshape(temp, direction="wide", idvar="id", timevar="time")
id value.14-02-26 value.14-02-27
1 1 6 NA
2 2 6 6
3 4 7 NA
5 3 NA 6
I'd reccomend using the data.table package, the approach then is very similar to Sotos' tidiverse solution.
library(data.table)
df <- data.table(df)
df[, time1 := format(time, '%Y-%m-%d')]
aggregated <- df[, list(meanvalue = mean(value)), by=c("id", "time1")]
aggregated <- dcast.data.table(aggregated, id~time1, value.var="meanvalue")
aggregated
# id 2014-02-26 2014-02-27
# 1: 1 6 NA
# 2: 2 6 6
# 3: 3 NA 6
# 4: 4 NA 7
(I think my result differs, because my System runs on another timezone, I imported the datetime objects as UTC.)

Resources