Separating characters into new column [duplicate] - r

This question already has answers here:
Split date-time column into Date and time variables
(7 answers)
Closed 1 year ago.
I have the following dataset, and require the times and not dates placed into a separate column relative to the date. Which can be indicated by id, to join the time with date.
dataset:
# A tibble: 10 x 2
origintime id
<dttm> <int>
1 2021-03-04 18:44:25 1
2 2021-03-04 18:28:32 2
3 2021-03-04 18:25:55 3
4 2021-03-04 18:23:00 4
5 2021-03-04 18:20:00 5
6 2021-03-04 18:15:58 6
7 2021-03-04 18:11:41 7
8 2021-03-04 18:10:57 8
9 2021-03-04 18:10:33 9
10 2021-03-04 18:07:01 10
outcome:
# A tibble: 10 x 3
origintime time id
<dttm> <int>
1 2021-03-04 18:44:25 1
2 2021-03-04 18:28:32 2
3 2021-03-04 18:25:55 3
4 2021-03-04 18:23:00 4
5 2021-03-04 18:20:00 5
6 2021-03-04 18:15:58 6
7 2021-03-04 18:11:41 7
8 2021-03-04 18:10:57 8
9 2021-03-04 18:10:33 9
10 2021-03-04 18:07:01 10
reproducible code:
structure(list(origintime = structure(c(1614883465.299, 1614882512.721,
1614882355.215, 1614882180.074, 1614882000.671, 1614881758.214,
1614881501.122, 1614881457.527, 1614881433.217, 1614881221.603
), tzone = "UTC", class = c("POSIXct", "POSIXt")), id = 1:10), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))

Just use format with %T to extract the time component from the 'origintime' column while converting the 'origintime' to Date class
library(dplyr)
df1 <- df1 %>%
mutate(time = format(origintime, '%T'), origintime = as.Date(origintime))
Or use separate and return as character columns
library(tidyr)
df1 %>%
separate(origintime, into = c('origintime', 'time'), sep=" ")

library(data.table)
setDT(df)
df[ , c('origintime', 'time') := tstrsplit(origintime, " ")]
df
# origintime id time
# 1: 2021-03-04 1 18:44:25
# 2: 2021-03-04 2 18:28:32
# 3: 2021-03-04 3 18:25:55
# 4: 2021-03-04 4 18:23:00
# 5: 2021-03-04 5 18:20:00
# 6: 2021-03-04 6 18:15:58
# 7: 2021-03-04 7 18:11:41
# 8: 2021-03-04 8 18:10:57
# 9: 2021-03-04 9 18:10:33
#10: 2021-03-04 10 18:07:01

Related

How to create a new column that counts the number of occurrences of a value in another column and orders them by date

I have a 2 column data frame with "date" and "ID" headings. Some IDs are listed more than once. I want to create a new column "Attempt" that denotes the number of attempts that each ID has taken, ordered by the date of occurrence.
Here is my sample data:
ID <- c(1,2,5,8,4,9,1,11,15,32,54,1,4,2,14)
Date <- c("2021-04-12", "2021-04-12", "2021-04-13", "2021-04-14", "2021-04-19",
"2021-04-19", "2021-04-20", "2021-04-21", "2021-04-22", "2021-04-28",
"2021-04-28", "2021-04-29", "2021-04-29", "2021-05-06", "2021-05-07")
Data <- data.frame(ID, Date)
Data$Date <- as.Date(Data$Date, format="%Y-%m-%d")
I tried various iterations of duplicated(). I can remove all duplicates or make every instance of a duplicated value "2" or "3" for example, but I want each occurrence to be ordered based on the date of the attempt taken.
Here is my expected result column to be added onto the original data frame:
Attempt <- c(1,1,1,1,1,1,2,1,1,1,1,3,2,2,1)
Data %>%
group_by(ID)
mutate(Attempt1 = row_number())
ID Date Attempt
1 1 2021-04-12 1
2 2 2021-04-12 1
3 5 2021-04-13 1
4 8 2021-04-14 1
5 4 2021-04-19 1
6 9 2021-04-19 1
7 1 2021-04-20 2
8 11 2021-04-21 1
9 15 2021-04-22 1
10 32 2021-04-28 1
11 54 2021-04-28 1
12 1 2021-04-29 3
13 4 2021-04-29 2
14 2 2021-05-06 2
15 14 2021-05-07 1
If you have the latest version of dplyr use
Data %>%
mutate(Attempt = row_number(), .by = ID)
Using data.table
library(data.table)
setDT(Data)[, Attempt := rowid(ID)]
-output
> Data
ID Date Attempt
1: 1 2021-04-12 1
2: 2 2021-04-12 1
3: 5 2021-04-13 1
4: 8 2021-04-14 1
5: 4 2021-04-19 1
6: 9 2021-04-19 1
7: 1 2021-04-20 2
8: 11 2021-04-21 1
9: 15 2021-04-22 1
10: 32 2021-04-28 1
11: 54 2021-04-28 1
12: 1 2021-04-29 3
13: 4 2021-04-29 2
14: 2 2021-05-06 2
15: 14 2021-05-07 1

Rolling Window based on a fixed time interval

I'm trying to calculate a rolling window in a fixed time interval. Suppose that the interval is 48 hours. I would like to get every data point that is contained between the date of the current observation and 48 hours before that observation. For example, if the datetime of the current observation is 05-07-2022 14:15:28, for that position, I would like a count value for every occurence between that date and 03-07-2022 14:15:28. Seconds are not fundamental to the analysis.
library(tidyverse)
library(lubridate)
df = tibble(id = 1:7,
date_time = ymd_hm('2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'))
# A tibble: 7 × 2
id date_time
<int> <dttm>
1 1 2022-05-07 15:00:00
2 2 2022-05-09 13:45:00
3 3 2022-05-09 13:51:00
4 4 2022-05-09 17:00:00
5 5 2022-05-10 15:25:00
6 6 2022-05-10 17:18:00
7 7 2022-05-11 14:00:00
With the example window of 48 hours, that would yield:
# A tibble: 7 × 4
id date_time lag_48hours count
<int> <dttm> <dttm> <dbl>
1 1 2022-05-07 15:00:00 2022-05-05 15:00:00 1
2 2 2022-05-09 13:45:00 2022-05-07 13:45:00 2
3 3 2022-05-09 13:51:00 2022-05-07 13:51:00 3
4 4 2022-05-09 17:00:00 2022-05-07 17:00:00 3
5 5 2022-05-10 15:25:00 2022-05-08 15:25:00 4
6 6 2022-05-10 17:18:00 2022-05-08 17:18:00 5
7 7 2022-05-11 14:00:00 2022-05-09 14:00:00 4
I added the lag column for illustration purposes. Any idea how to obtain the count column? I need to be able to adjust the window (48 hours in this example).
I'd encourage you to use slider, which allows you to do rolling window analysis using an irregular index.
library(tidyverse)
library(lubridate)
library(slider)
df = tibble(
id = 1:7,
date_time = ymd_hm(
'2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'
)
)
df %>%
mutate(
count = slide_index_int(
.x = id,
.i = date_time,
.f = length,
.before = dhours(48)
)
)
#> # A tibble: 7 × 3
#> id date_time count
#> <int> <dttm> <int>
#> 1 1 2022-05-07 15:00:00 1
#> 2 2 2022-05-09 13:45:00 2
#> 3 3 2022-05-09 13:51:00 3
#> 4 4 2022-05-09 17:00:00 3
#> 5 5 2022-05-10 15:25:00 4
#> 6 6 2022-05-10 17:18:00 5
#> 7 7 2022-05-11 14:00:00 4
How about this...
df %>%
mutate(count48 = map_int(date_time,
~sum(date_time <= . & date_time > . - 48 * 60 * 60)))
# A tibble: 7 × 3
id date_time count48
<int> <dttm> <int>
1 1 2022-05-07 15:00:00 1
2 2 2022-05-09 13:45:00 2
3 3 2022-05-09 13:51:00 3
4 4 2022-05-09 17:00:00 3
5 5 2022-05-10 15:25:00 4
6 6 2022-05-10 17:18:00 5
7 7 2022-05-11 14:00:00 4

Converting variable with 5 digit numbers and dates into date values

I have the following data, which contains some date values as 5 digit character values. When I try to convert to date, the correct date changes to NA value.
dt <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2),
Registrationdate=c('2019-01-09','2019-01-09','2019-01-09','2019-01-09','2019-01-09',
'2019-01-09',"44105","44105","44105","44105","44105"))
Expected value
id Registrationdate
1 1 2019-01-09
2 1 2019-01-09
3 1 2019-01-09
4 1 2019-01-09
5 1 2019-01-09
6 1 2019-01-09
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01
I tried using
library(openxlsx)
dt$Registrationdate <- convertToDate(dt$Registrationdate, origin = "1900-01-01")
But I got
1 1 <NA>
2 1 <NA>
3 1 <NA>
4 1 <NA>
5 1 <NA>
6 1 <NA>
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01
Here's one approach using a mix of dplyr and base R:
library(dplyr, warn = FALSE)
dt |>
mutate(Registrationdate = if_else(grepl("-", Registrationdate),
as.Date(Registrationdate),
openxlsx::convertToDate(Registrationdate, origin = "1900-01-01")))
#> Warning in openxlsx::convertToDate(Registrationdate, origin = "1900-01-01"): NAs
#> introduced by coercion
#> id Registrationdate
#> 1 1 2019-01-09
#> 2 1 2019-01-09
#> 3 1 2019-01-09
#> 4 1 2019-01-09
#> 5 1 2019-01-09
#> 6 1 2019-01-09
#> 7 2 2020-10-01
#> 8 2 2020-10-01
#> 9 2 2020-10-01
#> 10 2 2020-10-01
#> 11 2 2020-10-01
Created on 2022-10-15 with reprex v2.0.2
library(janitor)
dt$Registrationdate <- convert_to_date(dt$Registrationdate)
id Registrationdate
1 1 2019-01-09
2 1 2019-01-09
3 1 2019-01-09
4 1 2019-01-09
5 1 2019-01-09
6 1 2019-01-09
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01
Another option is to import columns in the expected format. An example with openxlsx2 is shown below. The top half creates a file that causes the behavior you see with openxlsx. This is because some of the rows in the Registrationdate column are formatted as dates and some as strings, a fairly common error caused by the person who generated the xlsx input.
With openxlsx2 you can define the type of column you want to import. The option was inspired by readxl (iirc).
library(openxlsx2)
## prepare data
date_as_string <- data.frame(
id = rep(1, 6),
Registrationdate = rep('2019-01-09', 6)
)
date_as_date <- data.frame(
id = rep(2, 5),
Registrationdate = rep(as.Date('2019-01-10'), 5)
)
options(openxlsx2.dateFormat = "yyyy-mm-dd")
wb <- wb_workbook()$
add_worksheet()$
add_data(x = date_as_string)$
add_data(x = date_as_date, colNames = FALSE, startRow = 7)
#wb$open()
## read data as date
dt <- wb_to_df(wb, types = c(id = 1, Registrationdate = 2))
## check that Registrationdate is actually a Date column
str(dt$Registrationdate)
#> Date[1:10], format: "2019-01-09" "2019-01-09" "2019-01-09" "2019-01-09" "2019-01-09" ...

Dataframe with start & end date to daily data

I am trying to convert below data on daily basis based on range available in start_date & end_date_ column.
to this output (sum):
Please use dput() when posting data frames next time!
Example data
# A tibble: 4 × 4
id start end inventory
<int> <chr> <chr> <dbl>
1 1 01/05/2022 02/05/2022 100
2 2 10/05/2022 15/05/2022 50
3 3 11/05/2022 21/05/2022 80
4 4 14/05/2022 17/05/2022 10
Transform the data
df %>%
mutate(across(2:3, ~ as.Date(.x,
format = "%d/%m/%Y"))) %>%
pivot_longer(cols = c(start, end), values_to = "date") %>%
arrange(date) %>%
select(date, inventory)
# A tibble: 8 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-10 50
4 2022-05-11 80
5 2022-05-14 10
6 2022-05-15 50
7 2022-05-17 10
8 2022-05-21 80
Expand the dates and left_join
left_join(tibble(date = seq(first(df$date),
last(df$date),
by = "day")), df)
# A tibble: 21 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-03 NA
4 2022-05-04 NA
5 2022-05-05 NA
6 2022-05-06 NA
7 2022-05-07 NA
8 2022-05-08 NA
9 2022-05-09 NA
10 2022-05-10 50
# … with 11 more rows

R function for creating uneven groups based on uneven dates

I am trying to find an R function that can index groups iteratively, given a set of unevenly spaced dates, uneven group sizes, and by grouped cases. Here are example data:
> h
# A tibble: 20 x 2
ID date
<int> <date>
1 1 2021-01-07
2 1 2021-01-11
3 1 2021-01-15
4 1 2021-01-16
5 1 2021-01-21
6 1 2021-01-26
7 1 2021-02-04
8 1 2021-02-08
9 1 2021-02-13
10 1 2021-02-20
11 1 2021-02-23
12 1 2021-02-27
13 2 2021-01-05
14 2 2021-01-11
15 2 2021-02-02
16 2 2021-02-08
17 2 2021-02-08
18 2 2021-02-14
19 2 2021-02-17
20 2 2021-02-21
For each unique ID, I want to find the first date (chronologically) and create a group (i.e., group==1) for that case and any other rows within 7 days. For the next date after 7 days, create a second group (i.e., group==2) for that case and any others within the next 7 days. Note: the next date is not necessarily exactly 7 days after the initial date. Repeat this process for the remaining remaining cases to get the desired output:
# A tibble: 20 x 3
ID date group
<int> <date> <dbl>
1 1 2021-01-07 1
2 1 2021-01-11 1
3 1 2021-01-15 2
4 1 2021-01-16 2
5 1 2021-01-21 2
6 1 2021-01-26 3
7 1 2021-02-04 4
8 1 2021-02-08 4
9 1 2021-02-13 5
10 1 2021-02-20 5
11 1 2021-02-23 6
12 1 2021-02-27 6
13 2 2021-01-05 1
14 2 2021-01-11 1
15 2 2021-02-02 2
16 2 2021-02-08 2
17 2 2021-02-08 2
18 2 2021-02-14 3
19 2 2021-02-17 3
20 2 2021-02-21 3
Using a rolling window function of 7 days will not work, as far as I can tell, as it will group the cases incorrectly. But I am wondering if a sort of custom rolling window function could be used? I would prefer a solution using dplyr, but other options would also work. Any help here is appreciated.
> dput(h)
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), date = structure(c(18634,
18638, 18642, 18643, 18648, 18653, 18662, 18666, 18671, 18678,
18681, 18685, 18632, 18638, 18660, 18666, 18666, 18672, 18675,
18679), class = "Date")), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Define a function date1 which given the first date of the group of the prior row's point and the current row's date returns the date of the start of the current group -- that must be one of the two arguments. Then grouping by ID use Reduce to apply that to the dates in each ID and convert the result to factor and then to integer.
library(dplyr)
date1 <- function(prev, x) if (x > prev + 7) x else prev
h %>%
group_by(ID) %>%
mutate(group = as.integer(factor(Reduce(date1, date, acc = TRUE)))) %>%
ungroup
giving:
# A tibble: 20 x 3
ID date group
<int> <date> <dbl>
1 1 2021-01-07 1
2 1 2021-01-11 1
3 1 2021-01-15 2
4 1 2021-01-16 2
5 1 2021-01-21 2
6 1 2021-01-26 3
7 1 2021-02-04 4
8 1 2021-02-08 4
9 1 2021-02-13 5
10 1 2021-02-20 5
11 1 2021-02-23 6
12 1 2021-02-27 6
13 2 2021-01-05 1
14 2 2021-01-11 1
15 2 2021-02-02 2
16 2 2021-02-08 2
17 2 2021-02-08 2
18 2 2021-02-14 3
19 2 2021-02-17 3
20 2 2021-02-21 3
For each ID group, create group as a vector of NAs. While some group elements are still NA, take the first date value where group is NA and add 0 and 7 days to it to make a range of dates. For any rows where date is in the calculated date range, set elements of group to 1 more than the current max value of group (or 0 if group is still all NA).
library(data.table)
setDT(df)
df[order(ID, date), {
group <- rep(NA_real_, .N)
while(any(is.na(group))){
group_range <- first(date[is.na(group)]) + c(0, 7)
group[date %between% group_range] <- 1 + max(fcoalesce(group, 0))
}
list(date, group)
}, by = ID]
# ID date group
# 1: 1 2021-01-07 1
# 2: 1 2021-01-11 1
# 3: 1 2021-01-15 2
# 4: 1 2021-01-16 2
# 5: 1 2021-01-21 2
# 6: 1 2021-01-26 3
# 7: 1 2021-02-04 4
# 8: 1 2021-02-08 4
# 9: 1 2021-02-13 5
# 10: 1 2021-02-20 5
# 11: 1 2021-02-23 6
# 12: 1 2021-02-27 6
# 13: 2 2021-01-05 1
# 14: 2 2021-01-11 1
# 15: 2 2021-02-02 2
# 16: 2 2021-02-08 2
# 17: 2 2021-02-08 2
# 18: 2 2021-02-14 3
# 19: 2 2021-02-17 3
# 20: 2 2021-02-21 3
Here's another version where I try to limit the computations. No idea if it's actually faster
df[order(ID, date), {
group <- rep(NA_integer_, .N)
i <- 1L
g <- 1L
while(i <= .N){
group_range <- date[i] + c(0, 7)
chg <- date %between% group_range
group[chg] <- g
g <- g + 1L
i <- i + sum(chg)
}
list(date, group)
}, by = ID]

Resources