How to merge characters of two columns? - r

I have a df1:
ID Date
23 4/5/2011
12 4/7/2012
14 6/17/2020
90 12/20/1994
Currently these columns are both character classes. I would like to create a third column that merges these two columns with an underscore. The output would look like:
ID Date Identifier
23 4/5/2011 23_4/5/2011
12 4/7/2012 12_4/7/2012
14 6/17/2020 14_6/17/2020
90 12/20/1994 90_12/20/1994

Use paste():
df1 <- data.frame(ID = c("23","13","14","90"), Date = c("4/5/2011", "4/7/2012", "6/17/2020", "12/20/1994"))
df1 |>
dplyr::mutate(Identifier = paste(ID, Date, sep = "_"))
#> ID Date Identifier
#> 1 23 4/5/2011 23_4/5/2011
#> 2 13 4/7/2012 13_4/7/2012
#> 3 14 6/17/2020 14_6/17/2020
#> 4 90 12/20/1994 90_12/20/1994
Created on 2022-03-06 by the reprex package (v2.0.1)

Using base R, you could do:
Reprex
your data
df <- read.table(text = "ID Date
23 4/5/2011
12 4/7/2012
14 6/17/2020
90 12/20/1994", header = TRUE)
Code
df$Identifier <- paste0(df$ID, "_", df$Date)
Output
df
#> ID Date Identifier
#> 1 23 4/5/2011 23_4/5/2011
#> 2 12 4/7/2012 12_4/7/2012
#> 3 14 6/17/2020 14_6/17/2020
#> 4 90 12/20/1994 90_12/20/1994
Created on 2022-03-06 by the reprex package (v2.0.1)

Related

Mean function R with missing values, loop on 5 rows

I would like to calculate mean every 5 rows in my df. Here is my df :
Time
value
03/06/2021 06:15:00
NA
03/06/2021 06:16:00
NA
03/06/2021 06:17:00
20
03/06/2021 06:18:00
22
03/06/2021 06:19:00
25
03/06/2021 06:20:00
NA
03/06/2021 06:21:00
31
03/06/2021 06:22:00
23
03/06/2021 06:23:00
19
03/06/2021 06:24:00
25
03/06/2021 06:25:00
34
03/06/2021 06:26:00
42
03/06/2021 06:27:00
NA
03/06/2021 06:28:00
19
03/06/2021 06:29:00
17
03/06/2021 06:30:00
25
I already have a loop which goes well to calculate means for each 5 rows package. My problem is in my "mean function".
The problem is :
-if I put na.rm = FALSE, mean = NA as soon as there is a NA in a package of 5 values.
- if I put na.rm = TRUE in mean function, the result gives me averages that are shifted to take 5 values. I would like the NA not to interfere with the average and that when there is a NA in a package of 5 values, the average is only done on 4 values.
How can I do this? Thanks for your help !
You can solve your problem by introducing a dummy variable that groups your observarions in sets of five and then calculating the mean within group. Here's MWE, based in the tidyverse, that assumes your data is in a data.frame named df.
library(tidyverse)
df %>%
mutate(Group= 1 + floor((row_number()-1) / 5)) %>%
group_by(Group) %>%
summarise(Mean=mean(value, na.rm=TRUE), .groups="drop")
# A tibble: 4 × 2
Group Mean
<dbl> <dbl>
1 1 22.3
2 2 24.5
3 3 28
4 4 25
A solution based on purrr::map_dfr:
library(purrr)
df <- data.frame(
stringsAsFactors = FALSE,
time = c("03/06/2021 06:15:00","03/06/2021 06:16:00",
"03/06/2021 06:17:00",
"03/06/2021 06:18:00","03/06/2021 06:19:00",
"03/06/2021 06:20:00","03/06/2021 06:21:00",
"03/06/2021 06:22:00","03/06/2021 06:23:00",
"03/06/2021 06:24:00","03/06/2021 06:25:00",
"03/06/2021 06:26:00",
"03/06/2021 06:27:00","03/06/2021 06:28:00",
"03/06/2021 06:29:00","03/06/2021 06:30:00"),
value = c(NA,NA,20L,22L,
25L,NA,31L,23L,19L,25L,34L,42L,NA,19L,17L,
25L)
)
map_dfr(1:(nrow(df)-5),
~ data.frame(Group =.x, Mean = mean(df$value[.x:(.x+5)],na.rm=T)))
#> Group Mean
#> 1 1 22.33333
#> 2 2 24.50000
#> 3 3 24.20000
#> 4 4 24.00000
#> 5 5 24.60000
#> 6 6 26.40000
#> 7 7 29.00000
#> 8 8 28.60000
#> 9 9 27.80000
#> 10 10 27.40000
#> 11 11 27.40000
If you want to take average of every 5 minutes you may use lubridate's function floor_date/ceiling_date to round the time.
library(dplyr)
library(lubridate)
df %>%
mutate(time = mdy_hms(time),
time = floor_date(time, '5 mins')) %>%
group_by(time) %>%
summarise(value = mean(value, na.rm = TRUE))
# time value
# <dttm> <dbl>
#1 2021-03-06 06:15:00 22.3
#2 2021-03-06 06:20:00 24.5
#3 2021-03-06 06:25:00 28
#4 2021-03-06 06:30:00 25

Is there a quick way of extracting data over several years corresponding to today's date

I have a dataframe that has a date column that spans from 2014-01-01 to today 2021-04-29, and other columns of associated data.
What I would like to do is filter for data for the current day and month across all years. So data for 04-29 is brought back for all years 2014 to 2021 on that particular day.
What would be the most efficient / tidiest way of doing this?
Do something like this
set.seed(2)
df <- data.frame(d = as.Date('2015-01-01') + sample(1:3000, 1000),
o = runif(1000))
head(df)
#> d o
#> 1 2017-09-02 0.1481156
#> 2 2016-12-11 0.3957168
#> 3 2022-09-23 0.2654405
#> 4 2016-02-21 0.3482240
#> 5 2016-01-28 0.6943241
#> 6 2015-10-01 0.5069469
library(lubridate)
#>
#> date, intersect, setdiff, union
df[month(df$d) == month(Sys.Date()) & day(df$d) == day(Sys.Date()),]
#> d o
#> 24 2016-04-29 0.2431883
#> 131 2017-04-29 0.9359659
#> 383 2022-04-29 0.2703415
Created on 2021-04-29 by the reprex package (v2.0.0)
dplyr method is similar
df %>% filter(month(d) == month(Sys.Date()) & day(d) == day(Sys.Date()))

How to calculate a time period until a condition is matched

I need to calculate a time of consecutive dates, until the difference of time between two consecutive dates is greater than 13 seconds.
For example, in the data frame create with the code shown below, the column test has the time difference between the dates. What I need is events of time between lines with test > 13 seconds.
# Create a vector of dates with a random time difference in seconds between records
dates <- seq(as.POSIXct("2020-01-01 00:00:02"), as.POSIXct("2020-01-02 00:00:02"), by = "2 sec")
dates <- dates + sample(15, length(dates), replace = T)
# Create a data.frame
data <- data.frame(id = 1:length(dates), dates = dates)
# Create a test field with the time difference between each date and the next
data$test <- c(diff(data$dates, lag = 1), 0)
# Delete the zero and negative time
data <- data[data$test > 0, ]
head(data)
What I want is something like this:
To get to your desired result we need to define 'blocks' of observation. Each block is splitted where test is greater than 13.
We start identifying the split_point, and then using the rle function we can assign an ID to each block.
Then we can filter out the split_point, and summarize the remaining blocks. Once with the sum of seconds, then with the min of the event dates.
split_point <- data$test <=13
# Find continuous blocks
block_str <- rle(split_point)
# Create block IDs
data$block <- rep(seq_along(block_str$lengths), block_str$lengths)
data <- data[split_point, ] # Remove split points
# Summarize
final_df <- aggregate(test ~ block, data = data, FUN = sum)
dtevent <- aggregate(dates ~ block, data= data, FUN=min)
# Join the two summaries
final_df$DatetimeEvent <- dtevent$dates
head(final_df)
#> block test DatetimeEvent
#> 1 1 101 2020-01-01 00:00:09
#> 2 3 105 2020-01-01 00:01:11
#> 3 5 277 2020-01-01 00:02:26
#> 4 7 46 2020-01-01 00:04:58
#> 5 9 27 2020-01-01 00:05:30
#> 6 11 194 2020-01-01 00:05:44
Created on 2020-04-02 by the reprex package (v0.3.0)
Using dplyrfor convenience sake:
library(dplyr)
final_df <- data %>%
mutate(split_point = test <= 13,
block = with(rle(split_point), rep(seq_along(lengths), lengths))) %>%
group_by(block) %>%
filter(split_point) %>%
summarise(DateTimeEvent = min(dates), TotalTime = sum(test))
final_df
#> # A tibble: 1,110 x 3
#> block DateTimeEvent TotalTime
#> <int> <dttm> <drtn>
#> 1 1 2020-01-01 00:00:06 260 secs
#> 2 3 2020-01-01 00:02:28 170 secs
#> 3 5 2020-01-01 00:04:11 528 secs
#> 4 7 2020-01-01 00:09:07 89 secs
#> 5 9 2020-01-01 00:10:07 37 secs
#> 6 11 2020-01-01 00:10:39 135 secs
#> 7 13 2020-01-01 00:11:56 50 secs
#> 8 15 2020-01-01 00:12:32 124 secs
#> 9 17 2020-01-01 00:13:52 98 secs
#> 10 19 2020-01-01 00:14:47 83 secs
#> # … with 1,100 more rows
Created on 2020-04-02 by the reprex package (v0.3.0)
(results are different because reprex recreates the data each time)

Set up data in order to use Prophet() in R

I want to use the Prophet() function in R, but I cannot transform my column "YearWeek" to a as.Date() column.
I have a column "YearWeek" that stores values from 201401 up to 201937 i.e. starting in 2014 week 1 up to 2019 week 37.
I don't know how to declare this column as a date in the form yyyy-ww needed to use the Prophet() function.
Does anyone know how to do this?
Thank you in advance.
One solution could be to append a 01 to the end of your yyyy-ww formatted dates.
Data:
library(tidyverse)
df <- cross2(2014:2019, str_pad(1:52, width = 2, pad = 0)) %>%
map_df(set_names, c("year", "week")) %>%
transmute(date = paste(year, week, sep = "")) %>%
arrange(date)
head(df)
#> # A tibble: 6 x 1
#> date
#> <chr>
#> 1 201401
#> 2 201402
#> 3 201403
#> 4 201404
#> 5 201405
#> 6 201406
Now let's append the 01 and convert to date:
df %>%
mutate(date = paste(date, "01", sep = ""),
new_date = as.Date(date, "%Y%U%w"))
#> # A tibble: 312 x 2
#> date new_date
#> <chr> <date>
#> 1 20140101 2014-01-05
#> 2 20140201 2014-01-12
#> 3 20140301 2014-01-19
#> 4 20140401 2014-01-26
#> 5 20140501 2014-02-02
#> 6 20140601 2014-02-09
#> 7 20140701 2014-02-16
#> 8 20140801 2014-02-23
#> 9 20140901 2014-03-02
#> 10 20141001 2014-03-09
#> # ... with 302 more rows
Created on 2019-10-10 by the reprex package (v0.3.0)
More info about a numeric week of the year can be found here.

How to assign day of year values starting from an arbitary date and take care of missing values?

I have an R dataframe df_demand with a date column (depdate) and a dependent variable column bookings. The duration is 365 days starting from 2017-11-02 and ending at 2018-11-01, sorted in ascending order.
We have booking data for only 279 days in the year.
dplyr::arrange(df_demand, depdate)
depdate bookings
1 2017-11-02 43
2 2017-11-03 27
3 2017-11-05 27
4 2017-11-06 22
5 2017-11-07 39
6 2017-11-08 48
.
.
279 2018-11-01 60
I want to introduce another column day_of_year in the following way:
depdate day_of_year bookings
1 2017-11-02 1 43
2 2017-11-03 2 27
3 2017-11-04 3 NA
4 2017-11-05 4 27
.
.
.
365 2018-11-01 365 60
I am trying to find the best possible way to do this.
In Python, I could use something like :
df_demand['day_of_year'] = df_demand['depdate'].sub(df_demand['depdate'].iat[0]).dt.days + 1
I wanted to know about an R equivalent of the same.
When I run
typeof(df_demand_2$depdate)
the output is
"double"
Am I missing something?
You can create a row for every date using the complete function from the tidyr package.
First, I'm creating a data frame with some sample data:
df <- data.frame(
depdate = as.Date(c('2017-11-02', '2017-11-03', '2017-11-05')),
bookings = c(43, 27, 27)
)
Next, I'm performing two operations. First, using tidyr::complete, I'm specifying all the dates I want in my analysis. I can do that using seq.Date, creating a sequence from the first to the last day.
Once that is done, the day_of_year column is simply equal to the row number.
df_complete <- tidyr::complete(df,
depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)
)
df_complete$day_of_year <- 1:nrow(df_complete)
> df_complete
#> # A tibble: 4 x 3
#> depdate bookings day_of_year
#> <date> <dbl> <int>
#> 1 2017-11-02 43 1
#> 2 2017-11-03 27 2
#> 3 2017-11-04 NA 3
#> 4 2017-11-05 27 4
An equivalent solution with the pipe operator from dplyr:
df %>%
complete(depdate = seq.Date(from = min(df$depdate), to = max(df$depdate), by = 1)) %>%
mutate(days_of_year = row_number())

Resources