I have a following example dataset:
df <- data.frame("id" = c(1,2,3,3,4),
"start" = c(01-01-2018,01-06-2018,01-05-2018,01-05-2018,01-05-2018, 01-10-2018),
"end" = c(01-03-2018,01-07-2018,01-09-2018,01-06-2018,01-06-2018,01-11-2018))
df$start <- as.Date(df$start, "%d-%m-%Y")
df$end <- as.Date(df$end, "%d-%m-%Y")
What I want to do with it is for each group to get a union of all date intervals), i.e.
01-01-2018 - 01-03-2018 for group 1
01-06-2018 - 01-06-2018 for group 2
01-05-2018 - 01-09-2018 for group 3
01-05-2018 - 01-06-2018 and 01-10-2018 - 01-11-2018 for group 4
The purpose of this is to have an interval as an output, because I need it to determine whether certain observation dates for the group fall in the intervals or not.
We convert the 'start', 'end' to Date class, then grouped by'id', created an interval column in summarise based on the min and max of the 'start', and 'end' columns respectively
library(dplyr)
library(stringr)
library(lubridate)
df %>%
mutate(across(c(start, end), mdy)) %>%
group_by(id) %>%
summarise(interval = interval(min(start), max(end)), .groups = 'drop')
data
df <- structure(list(id = c(1, 2, 3, 3, 4), start = c("01-01-2018",
"01-06-2018", "01-05-2018", "01-05-2018", "01-10-2018"), end = c("01-03-2018",
"01-07-2018", "01-09-2018", "01-06-2018", "01-11-2018")),
class = "data.frame", row.names = c(NA,
-5L))
Related
I have data in 6-month intervals (ID, 6-month-start-date, outcome value), but for some IDs, there are half years where the outcome is missing. Simplified example:
id = c("aa", "aa", "ab", "ab", "ab")
date = as.Date(c("2021-07-01", "2022-07-01", "2021-07-01", "2022-01-01", "2022-07-01"))
col3 = c(1,2,1,2,1)
df <- data.frame(id, date, col3)
For similar datasets where the date is monthly, I used complete(date = seq.Date(start date, end date, by = "month") to fill the missing months and add 0 to the outcome field in the 3rd column.
I could do the following and expand the data to monthly, then create a new 6-month-start-date column, group by it and ID, and sum col3.
df_complete <- df %>% group_by(id) %>%
complete(date = seq.Date(as.Date(min(date)), as.Date(max(date) %m+% months(5)), by="month")) %>%
mutate (col3 = replace_na(col3, 0))
df_complete_6mth <- df_complete %>% mutate(
halfyear = ifelse(as.integer(format(date, '%m')) <= 6,
paste0(format(date, '%Y'), '-01-01'),
paste0(format(date, '%Y'), '-07-01'))) %>%
group_by(id, halfyear) %>%
summarise(col3_halfyear = sum(col3))
However, is there a solution where the "by =" argument specifies 6 months? I tried
df_complete <- df %>% group_by(id) %>%
complete(date = seq.Date(as.Date(min(date)), as.Date(max(date) %m+% months(5)), by="months(6)")) %>%
mutate (col3 = replace_na(col3, 0))
but it didn't work.
From the help for seq.Date:
by can be specified in several ways.
A number, taken to be in days.
A object of class difftime
A character string, containing one of "day", "week", "month",
"quarter" or "year". This can optionally be preceded by a (positive or
negative) integer and a space, or followed by "s".
So I expect you want:
library(dplyr); library(tidyr)
df %>%
group_by(id) %>%
complete(date = seq.Date(min(date), max(date), by="6 month"),
fill = list(col3 = 0))
Could you do something like this. You make a sequence of dates by month and then take every sixth one after the first one?
library(lubridate)
dates <- seq(mdy("01-01-2020"), mdy("01-01-2023"), by="month")
dates[seq(1, length(dates), by=6)]
#> [1] "2020-01-01" "2020-07-01" "2021-01-01" "2021-07-01" "2022-01-01"
#> [6] "2022-07-01" "2023-01-01"
Created on 2023-02-08 by the reprex package (v2.0.1)
I have the following data frame:
library(janitor)
library(lubridate)
library(tidyverse)
data <- data.frame(date = c("1/28/2022", "1/25/2022", "1/27/2022", "1/23/2022"),
y = c(100, 25, 35, 45))
I need to write a function that adds a new column that sorts the date column and assigns sequential day stage (i.e., Day 1, Day 2, etc.). So far I have tried the following with no luck.
day.assign <- function(df){
df2 <- clean_names(df)
len <- length(unique(df2$date))
levels.start <- as.character(sort(mdy(unique(df2$date))))
day.label <- paste("Day", seq(1, len, by = 1))
df <-
df %>%
mutate(Date = as.character(mdy(Date)),
Day = as.factor(Date,
levels = levels.start,
labels = day.label))
}
Future files will have a various amount of dates that must be accounted for when assigning the day column (i.e., one file may have 4 dates while the next may have 6).
You could do:
library(lubridate)
library(dplyr)
data <- data.frame(date = c("1/28/2022", "1/25/2022", "1/27/2022", "1/23/2022"),
y = c(100, 25, 35, 45))
day.assign <- function(df) {
df %>%
mutate(Date = mdy(date)) %>%
arrange(mdy(date)) %>%
mutate(Day = paste0("Day ", row_number()))
}
day.assign(data)
#> date y Date Day
#> 1 1/23/2022 45 2022-01-23 Day 1
#> 2 1/25/2022 25 2022-01-25 Day 2
#> 3 1/27/2022 35 2022-01-27 Day 3
#> 4 1/28/2022 100 2022-01-28 Day 4
Hi I have time series data that has daily dates (variable 1) and then for each date I have a time variable that is assigned from (1-60). On each day there is a number X events. Is there a way to create a new dataset where 2 day aggregates for my value are summed across and I have 30 rows (time variables) instead of 60?
Update: Here is a reproducible example of what I want
set.seed(101)
df <- data.frame(
dte = c(as.Date("2021-01-01"),
as.Date("2021-01-02") ,
as.Date("2021-01-03"),
as.Date("2021-01-04") ,
as.Date("2021-02-01") ,
as.Date("2021-02-02") ,
as.Date("2021-02-03") ,
as.Date("2021-02-04")
),
tme = rep(c(1, 2, 3, 4)),
val1 = sample(1:8),
work_type = c("Construction Worker", "Construction Worker","Construction
Worker", "Construction Worker", "Sales", "Sales", "Sales", "Sales"),
Work_Site = "A"
)
print(df)
df_2day <- data.frame(
tme = rep(c(1, 2)),
val1 = c(9,13,5,9),
work_type = c("Construction Worker", "Construction Worker",
"Sales", "Sales"),
Work_Site = "A"
)
print(df_2day)
I also have facility B, C, D
You can create group of 2 days and sum the val1 values.
library(lubridate)
library(dplyr)
df %>%
group_by(Work_Site, work_type, grp = ceiling_date(dte, '2 days')) %>%
summarise(val1 = sum(val1))
# Work_Site work_type grp val1
# <chr> <chr> <date> <int>
#1 A Construction Worker 2021-01-03 9
#2 A Construction Worker 2021-01-05 15
#3 A Sales 2021-02-03 5
#4 A Sales 2021-02-05 7
You can identify the groupings by dividing the row number for each day by two and rounding up the the nearest whole number. So the 3rd reading would be 3/2 = 1.5, rounded up to be group 2. The 10th would be 10/2 = group 5.
Below is an implementation using dplyr, but you could use something else...
library(dplyr)
df <- data.frame(
dte = c(as.Date("2021-01-01"),
as.Date("2021-01-01") ,
as.Date("2021-01-01"),
as.Date("2021-01-01") ,
as.Date("2021-02-01") ,
as.Date("2021-02-01") ,
as.Date("2021-02-01") ,
as.Date("2021-02-01")
),
tme = rep(c(1, 2, 3, 4)),
val1 = sample(1:8),
val2 = sample(1:8)
)
print(df)
result <- df %>%
group_by(dte) %>%
mutate(dategroup=ceiling(rank(tme) / 2)) %>%
group_by(dte, dategroup) %>%
summarise_all(sum)
print(result)
I want to replace Jan 01 to Jun 25 of all the years in FakeData with data from Ob2020 for the two variables (Level & Flow) of my data.frame. Here is what i have started and am looking for suggestions to achieving my goal.
library(tidyverse)
library(lubridate)
set.seed(1500)
FakeData <- data.frame(Date = seq(as.Date("2010-01-01"), to = as.Date("2018-12-31"), by = "days"),
Level = runif(3287, 0, 30), Flow = runif(3287, 1,10))
Ob2020 <- data.frame(Date = seq(as.Date("2020-01-01"), to = as.Date("2020-06-25"), by = "days"),
Level = runif(177, 0, 30), Flow = runif(177, 1,10))
Here's a way using dplyr and lubridate :
library(dplyr)
library(lubridate)
FakeData %>%
mutate(day = day(Date), month = month(Date)) %>%
left_join(Ob2020 %>%
mutate(day = day(Date), month = month(Date)),
by = c('day', 'month')) %>%
mutate(Level = coalesce(Level.y, Level.x),
Flow = coalesce(Flow.y, Flow.x)) %>%
select(Date = Date.x, Level, Flow)
If you dont mind a data.table solution, here is an update join:
library(data.table)
#extract year and month of the date
setDT(FakeData)[, c("day", "mth") := .(mday(Date), month(Date))]
setDT(Ob2020)[, c("day", "mth") := .(mday(Date), month(Date))]
#print to console to show old values
head(FakeData)
head(Ob2020)
cols <- c("Level", "Flow")
FakeData[Ob2020[mth<=6L & day<=25], on=.(day, mth),
(cols) := mget(paste0("i.", cols))]
#print to console to show new values
head(FakeData)
I have a dataframe of 96074 obs. of 31 variables.
the first two variables are id and the date, then I have 9 columns with measurement (three different KPIs with three different time properties), then various technical and geographical variables.
df <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d_1day_old = rnorm(9, 2, 1),
sum_i_1day_old = rnorm(9, 2, 1),
per_i_d_1day_old = rnorm(9, 0, 1),
sum_d_5days_old = rnorm(9, 0, 1),
sum_i_5days_old = rnorm(9, 0, 1),
per_i_d_5days_old = rnorm(9, 0, 1),
sum_d_15days_old = rnorm(9, 0, 1),
sum_i_15days_old = rnorm(9, 0, 1),
per_i_d_15days_old = rnorm(9, 0, 1)
)
I want to transform from wide to long, in order to do graphs with ggplot using facets for example.
If I had a df with just one variable with its three-time scans I would have no problem in using gather:
plotdf <- df %>%
gather(sum_d, value,
c(sum_d_1day_old, sum_d_5days_old, sum_d_15days_old),
factor_key = TRUE)
But having three different variables trips me up.
I would like to have this output:
plotdf <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d = rep(c("sum_d_1day_old", "sum_d_5days_old", "sum_d_15days_old"), 3),
values_sum_d = rnorm(9, 2, 1),
sum_i = rep(c("sum_i_1day_old", "sum_i_5days_old", "sum_i_15days_old"), 3),
values_sum_i = rnorm(9, 2, 1),
per_i_d = rep(c("per_i_d_1day_old", "per_i_d_5days_old", "per_i_d_15days_old"), 3),
values_per_i_d = rnorm(9, 2, 1)
)
with id, sum_d, sum_i and per_i_d of class factor time of class Date and the values of class numeric (I have to add that I don't have negative measures in these variables).
what I've tried to do:
plotdf <- gather(df, key, value, sum_d_1day_old:per_i_d_15days_old, factor_key = TRUE)
gathering all of the variables in a single column
plotdf$KPI <- paste(sapply(strsplit(as.character(plotdf$key), "_"), "[[", 1),
sapply(strsplit(as.character(plotdf$key), "_"), "[[", 2), sep = "_")
creating a new column with the name of the KPI, without the time specification
plotdf %>% unite(value2, key, value) %>%
#creating a new variable with the full name of the KPI attaching the value at the end
mutate(i = row_number()) %>% spread(KPI, value2) %>% select(-i)
#spreading
But spread creates rows with NAs.
To replace then at first I used
group_by(id, date) %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "down") %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "up") %>%
But the problem is that there are already some measurements with NAs in the original df in the variable per_i_d (44 in total), so I lose that information.
I thought that I could replace the NAs in the original df with a dummy value and then replace the NAs back, but then I thought that there could be a more efficient solution for all of my problem.
After I replaced the NAs, my idea was to use slice(1) to select only the first row of each couple id/date, then do some manipulation with separate/unite to have the output I desired.
I actually did that, but then I remembered I had those aforementioned NAs in the original df.
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+')) %>%
select(-key) %>%
spread(type,value)
gives
id time age per_i sum_d sum_i
1 1 2009-01-01 15days_old 0.8132301 0.8888928 0.077532040
2 1 2009-01-01 1day_old -2.0993199 2.8817133 3.047894196
3 1 2009-01-01 5days_old -0.4626151 -1.0002926 0.327102000
4 1 2009-01-02 15days_old 0.4089618 -1.6868523 0.866412133
5 1 2009-01-02 1day_old 0.8181313 3.7118065 3.701018419
...
EDIT:
adding non-value columns to the dataframe:
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+'),
info = paste(age,type,sep = "_")) %>%
select(-key) %>%
gather(key,value,-id,-time,-age,-type) %>%
unite(dummy,type,key) %>%
spread(dummy,value)