Subtracting values from a value equal to specific date - r

I have a dataset that looks similar to:
Date
Total_Nonfarm
Leisure_and_Hospitality
2020-01-01
300
50
2020-02-01
200
40
2020-03-01
100
15
2020-04-01
75
15
2020-05-01
150
10
I need to calculate the monthly losses of both Nonfarm and Leisure and Hospitality jobs (two new columns) based on the total amount lost since 2020-02-01. So this would not affect rows before 2020-02-01, only after.
I've tried finding a way to print the value for the jobs dependent on date being equal to 2020-02-01, but could not successfully figure it out. I thought perhaps if I could print this value (say Feb2020 <- *the correct function to find that value*, then I could use dplyr to mutate a new column and execute mutate(Total_jobs_lost = Feb2020 - Total_Nonfarm. But I think this would then affect rows prior to 2020-02-01.
I would like the output to look as such:
Date
Total_Nonfarm
Leisure_and_Hospitality
Total_Nonfarm_Losses
LH_Losses
2020-01-01
300
50
NA
NA
2020-02-01
200
40
0
0
2020-03-01
100
15
-100
-25
2020-04-01
75
15
-125
-25
2020-05-01
150
10
-50
-30
Any help would be appreciated. Thanks.

require(tidyverse)
require(lubridate)
(df <- read_csv2(file = "data.csv"))
df$Date <- dmy(df$Date)
df$Total_Nonfarm_Losses <- NA
df$Leisure_and_Hospitality_Losses <- NA
timepoint <- ymd("2020/02/01")
(before <- df %>% filter(Date < timepoint))
(after <- df %>% filter(Date >= timepoint))
(
after
%>% mutate(Feb_Nonfarm = df[df$Date == timepoint, "Total_Nonfarm"] ,
Feb_Leisure = df[df$Date == timepoint, "Leisure_and_Hospitality"],
Total_Nonfarm_Losses = Feb_Nonfarm - Total_Nonfarm,
Leisure_and_Hospitality_Losses = Feb_Leisure - Leisure_and_Hospitality
)
%>% select(1:5)
%>% bind_rows(before, .)
)
And the output:
# A tibble: 5 x 5
Date Total_Nonfarm Leisure_and_Hospi~ Total_Nonfarm_Losses~ Leisure_and_Hospitality_L~
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-01-01 300 50 NA NA
2 2020-02-01 200 40 0 0
3 2020-03-01 100 15 100 25
4 2020-04-01 75 15 125 25
5 2020-05-01 150 10 50 30

library(tidyverse)
data <- read.delim("clipboard")
data$Total_Nonfarm_Losses <- rep("NA", nrow(data))
data$LH_Losses <- rep("NA", nrow(data))
Feb2020 <- filter(data,Date == "2020-02-01")$Total_Nonfarm
Feb2020b <- filter(data,Date == "2020-02-01")$Leisure_and_Hospitality
data[2:5,] <- data[2:5,] %>%
mutate(Total_Nonfarm_Losses = Total_Nonfarm- Feb2020,
LH_Losses = Leisure_and_Hospitality - Feb2020b)
`
Date Total_Nonfarm Leisure_and_Hospitality Total_Nonfarm_Losses LH_Losses
1 2020-01-01 300 50 NA NA
2 2020-02-01 200 40 0 0
3 2020-03-01 100 15 -100 -25
4 2020-04-01 75 15 -125 -25
5 2020-05-01 150 10 -50 -30

Related

Is there a way to have a rolling average calculation within a for loop in R?

I have a question regarding for loops and rolling average where I would want the current day plus the 6 previous days estimates. Currently, I have a for loop that calculates the daily number of new people. But what I want is to have a rolling average as I discussed previously. Any help would be appreciated. thanks!
The data looks like this:
dataframe = d
date place total
2020-01-10 A 10
2020-01-11 A 6
2020-01-12 A 8
2020-01-13 A 5
2020-01-14 A 7
2020-01-15 A 6
2020-01-16 A 9
2020-01-17 A 10
2020-01-10 B 11
2020-01-20 B 61
2020-01-21 B 82
2020-01-22 B 53
2020-01-23 B 74
2020-01-24 B 65
2020-01-25 B 96
2020-01-27 B 100
The for loop I wrote to calculate the number of new people per day is:
for(x in unique(d$place)) {
region <- d[d$place == x,]
n <- nrow(region)
for(i in 1:n-1) {
region$newpeople[i]<-region$total[i]-region$total[i+1]
}
region$newpeople[n]<-region$total[n]
}
I then append the estimates to the associated daily date. I would want something similar to the rolling average from the past 7 days.
date_range <- seq(region$date[1], region$date[n], by = "days")
y <- paste(region$date, collapse = "|")
missing_dates <- date_range[!grepl(y, date_range)]
if (length(missing_dates) != 0) {
date <- missing_dates
place <- paste0(region$place[1])
total<- NA
newpeople <- NA
df <- data.frame(date, place, total, newpeople)
region <- rbind(region, df) %>%
arrange(date)
}
Any help would be appreciated!
I'm not sure if you're totally set on using for loops.
Data
d <- read.table(text = "
date place total
2020-01-10 A 10
2020-01-11 A 6
2020-01-12 A 8
2020-01-13 A 5
2020-01-14 A 7
2020-01-15 A 6
2020-01-16 A 9
2020-01-17 A 10
2020-01-10 B 11
2020-01-20 B 61
2020-01-21 B 82
2020-01-22 B 53
2020-01-23 B 74
2020-01-24 B 65
2020-01-25 B 96
2020-01-27 B 100
",
header = TRUE)
Attempts
This post and website are pretty helpful. So using the mean_run() function from the runner package, we get
# install.packages("runner")
d %>%
group_by(place) %>%
arrange(date, .by_group = TRUE) %>%
mutate(
# Difference between days
diff = total - lag(total),
# Rolling average of past seven days
rolling_7 = runner::mean_run(
x = total,
k = 7,
idx = as.Date(date)
)
)
I'm not sure if this is what you're looking for though. For example, when looking at the rolling average for 2020-01-27, the code recognizes that you didn't have data for 2020-01-26, so it skips it. Therefore, the rolling average for 2020-01-27 is 78.3 = (82 + 53 + 74 + 65 + 96 + 100) / 6.

How to calculate a time period until a condition is matched

I need to calculate a time of consecutive dates, until the difference of time between two consecutive dates is greater than 13 seconds.
For example, in the data frame create with the code shown below, the column test has the time difference between the dates. What I need is events of time between lines with test > 13 seconds.
# Create a vector of dates with a random time difference in seconds between records
dates <- seq(as.POSIXct("2020-01-01 00:00:02"), as.POSIXct("2020-01-02 00:00:02"), by = "2 sec")
dates <- dates + sample(15, length(dates), replace = T)
# Create a data.frame
data <- data.frame(id = 1:length(dates), dates = dates)
# Create a test field with the time difference between each date and the next
data$test <- c(diff(data$dates, lag = 1), 0)
# Delete the zero and negative time
data <- data[data$test > 0, ]
head(data)
What I want is something like this:
To get to your desired result we need to define 'blocks' of observation. Each block is splitted where test is greater than 13.
We start identifying the split_point, and then using the rle function we can assign an ID to each block.
Then we can filter out the split_point, and summarize the remaining blocks. Once with the sum of seconds, then with the min of the event dates.
split_point <- data$test <=13
# Find continuous blocks
block_str <- rle(split_point)
# Create block IDs
data$block <- rep(seq_along(block_str$lengths), block_str$lengths)
data <- data[split_point, ] # Remove split points
# Summarize
final_df <- aggregate(test ~ block, data = data, FUN = sum)
dtevent <- aggregate(dates ~ block, data= data, FUN=min)
# Join the two summaries
final_df$DatetimeEvent <- dtevent$dates
head(final_df)
#> block test DatetimeEvent
#> 1 1 101 2020-01-01 00:00:09
#> 2 3 105 2020-01-01 00:01:11
#> 3 5 277 2020-01-01 00:02:26
#> 4 7 46 2020-01-01 00:04:58
#> 5 9 27 2020-01-01 00:05:30
#> 6 11 194 2020-01-01 00:05:44
Created on 2020-04-02 by the reprex package (v0.3.0)
Using dplyrfor convenience sake:
library(dplyr)
final_df <- data %>%
mutate(split_point = test <= 13,
block = with(rle(split_point), rep(seq_along(lengths), lengths))) %>%
group_by(block) %>%
filter(split_point) %>%
summarise(DateTimeEvent = min(dates), TotalTime = sum(test))
final_df
#> # A tibble: 1,110 x 3
#> block DateTimeEvent TotalTime
#> <int> <dttm> <drtn>
#> 1 1 2020-01-01 00:00:06 260 secs
#> 2 3 2020-01-01 00:02:28 170 secs
#> 3 5 2020-01-01 00:04:11 528 secs
#> 4 7 2020-01-01 00:09:07 89 secs
#> 5 9 2020-01-01 00:10:07 37 secs
#> 6 11 2020-01-01 00:10:39 135 secs
#> 7 13 2020-01-01 00:11:56 50 secs
#> 8 15 2020-01-01 00:12:32 124 secs
#> 9 17 2020-01-01 00:13:52 98 secs
#> 10 19 2020-01-01 00:14:47 83 secs
#> # … with 1,100 more rows
Created on 2020-04-02 by the reprex package (v0.3.0)
(results are different because reprex recreates the data each time)

sliding window with tidyr nest

I am using tidyr::nest to deliver a grouped_by table to function boot and boot.ci from boot package in order to calculate mean and confidence interval for a non-parametric statistic. This works fine for non-overlapping groups like below:
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
library(broom)
library(boot)
#toy example
set.seed(1)
Sys.setenv(TZ="America/Chicago")
df <- data.frame(date = mdy("01-01-2018")+ddays(sample(0:364,100,replace = T)),
score = sample(0:10,100,replace = T,prob=c(0.15,0.15,rep(0.15/7,7),0.25,0.3)))
# the statistic of interest
net_promoter_score <- function(data,col_name='score') {
return(
(sum(data[[col_name]]>=9,na.rm=TRUE)-
sum(data[[col_name]]<=6,na.rm=TRUE))/sum(!is.na(data[[col_name]]))*100
)
}
# boot needs to resample the staistic by index
nps_boot <- function(d,i) net_promoter_score(d[i,])
#do NPS confidence intervals by month - this works fine!
by_month = df %>%
mutate(month = lubridate::month(date,label=T,abbr=T)) %>%
nest(-month) %>%
mutate(boots = map(data, ~boot::boot(.x,nps_boot,R=4999)),
CI = map(boots, ~boot::boot.ci(.x,conf=0.9)$bca),
tidied_NPS = map(boots,broom::tidy),
tidied_CI = map(CI,broom::tidy)
) %>%
unnest(tidied_NPS,tidied_CI,.drop=T) %>%
select(month,mean=statistic,CI10=V4,CI90=V5)
by_month %>% head
A tibble: 6 x 4
month mean CI10 CI90
<ord> <dbl> <dbl> <dbl>
1 Apr 0 -100 33.3
2 May 6.67 -46.7 33.3
3 Jul 60 -100 60
4 Nov -20 -80 20
5 Mar -11.1 -66.7 33.3
6 Dec 0 -100 50
But I would like to do this for a sliding window - kind of like a moving average except I would like to use a different statistic to slide over. I can do this with lapply but I would like to use tidyverse.
#do 50-sample sliding window. I would like to solve this with tidyverse
window_size = 50
results = lapply(1:(nrow(df)-window_size), function(x) {
boot_df = df %>% arrange(date) %>% slice(x:(x+window_size-1))
boot = boot::boot(boot_df,nps_boot,R=999)
CI = boot.ci(boot,conf=0.9)$bca[4:5]
return(c(x,mean(boot$t),CI))
})
by_slide = as.data.frame(do.call(rbind, results)) %>%
select(date=V1,mean=V2,CI10=V3,CI90=V4) %>%
mutate(date = mdy("01-01-2018")+ddays((window_size %/% 2)+date))
by_slide %>% head
date mean CI10 CI90
1 2018-01-27 15.40541 -8.00000 38
2 2018-01-28 15.94194 -8.00000 36
3 2018-01-29 15.83383 -8.00000 36
4 2018-01-30 15.24525 -8.00000 38
5 2018-01-31 15.79780 -10.00000 36
6 2018-02-01 15.82583 -10.92218 36
You can use purrr::map_dfr():
results <- purrr::map_dfr(1:(nrow(df)-window_size), function(x) {
boot_df = df %>% arrange(date) %>% slice(x:(x+window_size-1))
boot = boot::boot(boot_df,nps_boot,R=999)
CI = boot.ci(boot,conf=0.9)$bca[4:5]
list(date = boot_df$date[1],
mean = mean(boot$t),
ci_lo = CI[1],
ci_hi = CI[2])
})
results
# A tibble: 50 x 4
date mean ci_lo ci_hi
<date> <dbl> <dbl> <dbl>
1 2018-01-05 15.6 -8 38
2 2018-01-09 16.3 -8 36
3 2018-01-22 16.2 -10 36
4 2018-01-23 15.6 -10 36
5 2018-01-26 15.2 -10 36
6 2018-01-31 16.5 -10 36
7 2018-02-06 19.7 -4.75 40
8 2018-02-09 19.5 -8 40
9 2018-02-14 16.3 -10 36
10 2018-02-15 16.1 -10 36
# … with 40 more rows
Then you can use results directly in computing by_slide:
by_slide = results %>%
mutate(date = mdy("01-01-2018") + ddays(window_size %/% 2))
Although I admit I don't understand how adding date in the ddays duration object works, that doesn't seem to come out with your provided output. But I'm assuming that's a syntax issue - separate from your question about how to replace lapply.

Aggregating time on hourly basis and counting it

I have following dataframe in R.
Date Car_NO
2016-12-24 19:35:00 ABC
2016-12-24 19:55:00 DEF
2016-12-24 20:15:00 RTY
2016-12-24 20:35:00 WER
2016-12-24 21:34:00 DER
2016-12-24 00:23:00 ABC
2016-12-24 00:22:00 ERT
2016-12-24 11:45:00 RTY
2016-12-24 13:09:00 RTY
Date format is "POSIXct" "POSIXt"
I want to count hourly movement of car traffic. like 12-1,1-2,2-3,3-4 and so on
Currently my approach is following
df$time <- ymd_hms(df$Date)
df$hours <- hour(df$time)
df$minutes <- minute(df$time)
df$time <- as.numeric(paste(df$hours,df$minutes,sep="."))
And after this I will apply ifelse loop to divide it in hourly time slots,but I think it will be long and tedious way to do it. Is there any easy approach in R.
My desired dataframe would be
Time_Slots Car_Traffic_count
00-01 2
01-02 0
02-03 0
.
.
.
19-20 2
20-21 2
21-22 1
.
.
.
Simplest would be to just use the starting hour to indicate a time interval:
# sample data
df = data.frame(time = Sys.time()+seq(1,10)*10000, runif(10) )
# summarize
library(dplyr)
df$hour = factor(as.numeric(format(df$time,"%H")), levels = seq(0,24))
df = df %>%
group_by(hour) %>%
summarize(count=n()) %>%
complete(hour, fill = list(count = 0))
Output:
# A tibble: 24 x 2
hour count
<fctr> <dbl>
1 0 0
2 1 1
3 2 0
4 3 0
5 4 1
6 5 0
7 6 1
8 7 0
9 8 0
10 9 1
# ... with 14 more rows
You can optionally add:
df$formatted = paste0(as.character(df$hour),"-",as.numeric(as.character(df$hour))+1)
at then end to get your desired format. Hope this helps!

Difftime for workdays according to holidayNYSE in R

I'm trying to find difftime for working days only. I want to calculate difftime according to holidayNYSE calendar. When I use the difftime function weekends and holidays are included in the answers, my dataset contaies only data from working days, but when using difftime I have to subtract the non-working days somehow.
A is a vector of 0 and 1, and I want to find the duration of how many days with 0 or 1. Duration for run one are suppose to be 35 and I get 49 (working days from January 1990).
df <- data.frame(Date=(dates), A)
setDT(df)
df <- data.frame(Date=(dates), A)
DF1 <- df[, list(A = unique(A), duration = difftime(max(Date),min(Date), holidayNYSE
(year=setRmetricsOptions(start="1990-01-01", end="2015-31-12")))), by = run]
DF1
run A duration
1: 1 1 49 days
2: 2 0 22 days
3: 3 1 35 days
4: 4 0 27 days
5: 5 1 14 days
---
291: 291 1 6 days
292: 292 0 34 days
293: 293 1 10 days
294: 294 0 15 days
295: 295 1 29 days
An answer to my question without use of difftime:
df <- data.frame(Date=(dates), Value1=bull01)
setDT(df)
df[, run := cumsum(c(1, diff(Value1) !=0))]
duration <- rep(0)
for (i in 1:295){
ind <- which(df$run==i)
a <- df$Date[ind]
duration[i] <- length(a)
}
c <- rep(c(1,0),295)
c <- c[1:295]
df2 <- data.frame(duration, type=c)
> df2
run duration type
1 35 1
2 17 0
3 25 1
4 20 0
5 10 1
---
291 5 1
292 25 0
293 9 1
294 11 0
295 21 1

Resources