I am looking for an if statement in R that will help me determine if Monday close is higher the Friday close, the point is to add it as an extra column.
I have tried the following which gives a plausible result:
GSPC$GSPC.DIFF <- for(i in 1:length(GSPC$GSPC.Weekdaynr)) {if(GSPC$GSPC.Weekdaynr[i] > 4){GSPC.DIFF <- append(GSPC.DIFF,5)}else{GSPC.DIFF <- append(GSPC.DIFF, "")}}
enter image description here
But changing statment 1 to append(GSPC.DIFF,GSPC$GSPC.Close)
GSPC$GSPC.DIFF <- for(i in 1:length(GSPC$GSPC.Weekdaynr)) {if(GSPC$GSPC.Weekdaynr[i] > 4){GSPC.DIFF <- append(GSPC.DIFF,GSPC$GSPC.Close)}else{GSPC.DIFF <- append(GSPC.DIFF, "")}}
Gives the following:
[enter image description here][2]
do anyone know why it is not giving value to every five lines like in the first case?
here is where the value is collected from:
enter image description here
Here's a potentially useful approach (not the most elegant way I guess):
library(dplyr)
library(tibble)
library(lubridate)
set.seed(123)
data <- tibble(
Date = seq.Date(from = as.Date("2020/04/01"), to = as.Date("2020/05/12"), by = "day"),
Open = runif(42, min = 11500, max = 12600),
Close = runif(42, min = 11500, max = 12600)
)
data <- data %>%
mutate(day = weekdays(Date),
week = week(Date)) %>%
filter(day == "Monday" | day == "Friday") %>%
group_by(week) %>%
mutate(Delta = ifelse(Close - lead(Open, 1) > 0, "Higher", "Lower")) %>%
select(Date, week, Delta) %>%
filter(!is.na(Delta))
With this output:
> data
# A tibble: 6 x 3
# Groups: week [6]
Date week Delta
<date> <dbl> <chr>
1 2020-04-03 14 Higher
2 2020-04-10 15 Lower
3 2020-04-17 16 Lower
4 2020-04-24 17 Lower
5 2020-05-01 18 Lower
6 2020-05-08 19 Lower
From this input:
> data
# A tibble: 42 x 3
Date Open Close
<date> <dbl> <dbl>
1 2020-04-01 11816. 11955.
2 2020-04-02 12367. 11906.
3 2020-04-03 11950. 11668.
4 2020-04-04 12471. 11653.
5 2020-04-05 12535. 11756.
6 2020-04-06 11550. 12013.
7 2020-04-07 12081. 11793.
8 2020-04-08 12482. 12444.
9 2020-04-09 12107. 11550.
10 2020-04-10 12002. 11986.
We can convert the xts series to dataframe using fortify.zoo(), create a weekday column and calculate the percent difference between each day with the previous day.
DJI %>%
fortify.zoo() %>%
mutate(day = weekdays(Index),
diff_per = (DJI.Close - lag(DJI.Close))/lag(DJI.Close) * 100)
# Index DJI.Open DJI.High DJI.Low DJI.Close DJI.Volume DJI.Adjusted day diff_per
#1 2007-01-03 12474.5 12474.5 12474.5 12474.5 0 12474.5 Wednesday NA
#2 2007-01-04 12480.7 12480.7 12480.7 12480.7 0 12480.7 Thursday 0.049702954
#3 2007-01-05 12398.0 12398.0 12398.0 12398.0 0 12398.0 Friday -0.662624642
#4 2007-01-08 12423.5 12423.5 12423.5 12423.5 0 12423.5 Monday 0.205678335
#5 2007-01-09 12416.6 12416.6 12416.6 12416.6 0 12416.6 Tuesday -0.055543051
#6 2007-01-10 12442.2 12442.2 12442.2 12442.2 0 12442.2 Wednesday 0.206180330
#....
If you are interested only in Monday and Friday values, you can add filter in the above command.
%>% filter(day %in% c('Monday', 'Friday'))
data
library(quantmod)
library(zoo)
library(dplyr)
getSymbols('DJI')
Related
This seems like it should be straightforward but I cannot find a way to do this.
I have a sales cycle that begins ~ August 1 of each year and need to sum sales by week number. I need to create a "week number" field where week #1 begins on a date that I specify. Thus far I have looked at lubridate, baseR, and strftime, and I cannot find a way to change the "start" date from 01/01/YYYY to something else.
Solution needs to let me specify the start date and iterate week numbers as 7 days from the start date. The actual start date doesn't always occur on a Sunday or Monday.
EG Data Frame
eg_data <- data.frame(
cycle = c("cycle2019", "cycle2019", "cycle2018", "cycle2018", "cycle2017", "cycle2017", "cycle2016", "cycle2016"),
dates = as.POSIXct(c("2019-08-01" , "2019-08-10" ,"2018-07-31" , "2018-08-16", "2017-08-03" , "2017-08-14" , "2016-08-05", "2016-08-29")),
week_n = c("1", "2","1","3","1","2","1","4"))
I'd like the result to look like what is above - it would take the min date for each cycle and use that as a starting point, then iterate up week numbers based on a given date's distance from the cycle starting date.
This almost works. (Doing date arithmetic gives us durations in seconds: there may be a smoother way to convert with lubridate tools?)
secs_per_week <- 60*60*24*7
(eg_data
%>% group_by(cycle)
%>% mutate(nw=1+as.numeric(round((dates-min(dates))/secs_per_week)))
)
The results don't match for 2017, because there is an 11-day gap between the first and second observation ...
cycle dates week_n nw
<chr> <dttm> <chr> <dbl>
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 3
If someone has a better answer plz post, but this works -
Take the dataframe in the example, eg_data -
eg_data %>%
group_by(cycle) %>%
mutate(
cycle_start = as.Date(min(dates)),
days_diff = as.Date(dates) - cycle_start,
week_n = days_diff / 7,
week_n_whole = ceiling(days_diff / 7) ) -> eg_data_check
(First time I've answered my own question)
library("lubridate")
eg_data %>%
as_tibble() %>%
group_by(cycle) %>%
mutate(new_week = week(dates)-31)
This doesn't quite work the same as your example, but perhaps with some fiddling based on your domain experience you could adapt it:
library(lubridate)
eg_data %>%
mutate(aug1 = ymd_h(paste(str_sub(cycle, start = -4), "080100")),
week_n2 = ceiling((dates - aug1)/ddays(7)))
EDIT: If you have specific known dates for the start of each cycle, it might be helpful to join those dates to your data for the calc:
library(lubridate)
cycle_starts <- data.frame(
cycle = c("cycle2019", "cycle2018", "cycle2017", "cycle2016"),
start_date = ymd_h(c(2019080100, 2018072500, 2017080500, 2016071300))
)
eg_data %>%
left_join(cycle_starts) %>%
mutate(week_n2 = ceiling((dates - start_date)/ddays(7)))
#Joining, by = "cycle"
# cycle dates week_n start_date week_n2
#1 cycle2019 2019-08-01 1 2019-08-01 1
#2 cycle2019 2019-08-10 2 2019-08-01 2
#3 cycle2018 2018-07-31 1 2018-07-25 1
#4 cycle2018 2018-08-16 3 2018-07-25 4
#5 cycle2017 2017-08-03 1 2017-08-05 0
#6 cycle2017 2017-08-14 2 2017-08-05 2
#7 cycle2016 2016-08-05 1 2016-07-13 4
#8 cycle2016 2016-08-29 4 2016-07-13 7
This is a concise solution using lubridate
library(lubridate)
eg_data %>%
group_by(cycle) %>%
mutate(new_week = floor(as.period(ymd(dates) - ymd(min(dates))) / weeks()) + 1)
# A tibble: 8 x 4
# Groups: cycle [4]
cycle dates week_n new_week
<chr> <dttm> <chr> <dbl>
1 cycle2019 2019-08-01 00:00:00 1 1
2 cycle2019 2019-08-10 00:00:00 2 2
3 cycle2018 2018-07-31 00:00:00 1 1
4 cycle2018 2018-08-16 00:00:00 3 3
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 2
7 cycle2016 2016-08-05 00:00:00 1 1
8 cycle2016 2016-08-29 00:00:00 4 4
I have data in the zoo format in the following structure
date val
2020-11-01 3244
2020-11-02 3273
2020-11-03 2974
2020-11-04 3283
2020-11-05 3922
2020-11-06 3669
2020-11-07 4246
2020-11-08 4594
2020-11-09 4086
2020-11-10 4302
2020-11-11 4559
2020-11-12 4981
2020-11-13 4741
2020-11-14 5267
that I am trying to get into this form
date val
Mon 2020-11-01 3244
Tue 2020-11-02 3273
Wed 2020-11-03 2974
Thu 2020-11-04 3283
Fri 2020-11-05 3922
Sat 2020-11-06 3669
Sun 2020-11-07 4246
Mon 2020-11-08 4594
Tue 2020-11-09 4086
Wed 2020-11-10 4302
Thu 2020-11-11 4559
Fri 2020-11-12 4981
Sat 2020-11-13 4741
Sun 2020-11-14 5267
In order to count the number of time I observe the smallest of the values per week.
Mon = 1
Tue = 1
Wed = 0
Thu = 0
Fri = 0
Sat = 0
Sun = 0
I tried to let the data in the flat format before adding the date with zoo and added the weekdays but failed to count with it. Does anyone know an easier way to do it? I am open to visual solutions
If you store the data in a dataframe you can create a new column with weekdays and week number, for each week keep the row with minimum value and count number of weekdays that have the minimum value.
library(dplyr)
df %>%
mutate(date = as.Date(date),
weekday = factor(weekdays(date)),
week_year = format(date, '%Y-%W')) %>%
group_by(week_year) %>%
slice(which.min(val)) %>%
ungroup %>%
count(weekday, .drop = FALSE)
The following should do the trick:
library(lubridate)
df$day <- weekdays(as.Date(df$date))
# Note:
# There is one way to define a week
df$week <- week(df$date)
# And there is also another. Make sure to pick.
df$isoweek <- isoweek(df$date)
df <- df %>% group_by(isoweek) %>% mutate(min_here = val == min(val))
df %>% group_by(day) %>% summarise(sum(min_here))
# A tibble: 7 x 2
day `sum(min_here)`
<chr> <int>
1 Friday 0
2 Monday 1
3 Saturday 0
4 Sunday 1
5 Thursday 0
6 Tuesday 1
7 Wednesday 0
Base R
... though it seems a little clumsier in comparison to dplyr's mechanics in RonakShah's answer, or data.table below:
ismin <- ave(dat$val, list(format(dat$date, format = "%U")),
FUN = function(z) seq_along(z) == which.min(z))
aggregate(ismin, list(weekday = weekdays(dat$date)), FUN = sum)
# weekday x
# 1 Friday 0
# 2 Monday 1
# 3 Saturday 0
# 4 Sunday 0
# 5 Thursday 0
# 6 Tuesday 1
# 7 Wednesday 0
(The order is not emphasized here.)
data.table
library(data.table)
DT <- as.data.table(dat)
DT[, ismin := seq_len(.N) == which.min(val), by = format(date, format = "%U")
][, weekday := weekdays(date)][, .(n = sum(ismin)), by = .(weekday) ]
# weekday n
# <char> <int>
# 1: Sunday 0
# 2: Monday 1
# 3: Tuesday 1
# 4: Wednesday 0
# 5: Thursday 0
# 6: Friday 0
# 7: Saturday 0
Data
dat <- structure(list(date = structure(c(18567, 18568, 18569, 18570, 18571, 18572, 18573, 18574, 18575, 18576, 18577, 18578, 18579, 18580), class = "Date"), val = c(3244L, 3273L, 2974L, 3283L, 3922L, 3669L, 4246L, 4594L, 4086L, 4302L, 4559L, 4981L, 4741L, 5267L)), class = "data.frame", row.names = c(NA, -14L))
I have a cohort of data with multiple person visits and want to group visits with a common ID based on person # and the time of the visit. The condition is if an start is within 24 hours of a the previous exit, then I want those to have the same ID.
Sample of what data looks like:
dat <- data.frame(
Person_ID = c(1,1,1,2,3,3,3,4,4),
Admit_Date_Time = as.POSIXct(c("2017-02-07 15:26:00","2017-04-21 10:20:00",
"2017-04-22 12:12:00", "2017-10-16 01:31:00","2017-01-24 02:41:00","2017- 01-24 05:31:00", "2017-01-28 04:26:00", "2017-12-01 01:31:00","2017-12-01
01:31:00"), format = "%Y-%m-%d %H:%M"),
Discharge_Date_Time = as.POSIXct(c("2017-03-01 11:42:00","2017-04-22
05:56:00",
"2017-04-26 21:01:00",
"2017-10-18 20:11:00",
"2017-01-27 22:15:00",
"2017-01-26 15:35:00",
"2017-01-28 09:25:00",
"2017-12-05 18:33:00",
"2017-12-04 16:41:00"),format = "%Y-%m-%d %H:%M" ),
Visit_ID = c(1:9))
this is what I tried to start:
dat1 <-
dat %>%
arrange(Person_ID, Admit_Date_Time) %>%
group_by(Person_ID) %>%
mutate(Previous_Visit_Interval = difftime(lag(Discharge_Date_Time,
1),Admit_Date_Time, units = "hours")) %>%
mutate(start = c(1,Previous_Visit_Interval[-1] < hours(-24)), run =
cumsum(start))
dat1$ID = as.numeric(as.factor(paste0(dat1$Person_ID,dat1$run)))
Which is almost right, except it does not give the correct ID for visit 7 (person #3). Since there are three visits and the second visit is entirely within the first, and the third starts within 24 hours of the first but not the second.
There's probably a way to shorten this, but here's an approach using tidyr::gather and spread. By gathering into long format, we can track the cumulative admissions inside each visit. A new visit is recorded whenever there's a new Person_ID or that Person_ID completed a visit (cumulative admissions went to zero) at least 24 hours prior.
library(tidyr)
dat1 <- dat %>%
# Gather into long format with event type in one column, timestamp in another
gather(event, time, Admit_Date_Time:Discharge_Date_Time) %>%
# I want discharges to have an effect up to 24 hours later. Sort using that.
mutate(time_adj = if_else(event == "Discharge_Date_Time",
time + ddays(1),
time)) %>%
arrange(Person_ID, time_adj) %>%
# For each Person_ID, track cumulative admissions. 0 means a visit has completed.
# (b/c we sorted by time_adj, these reflect the 24hr period after discharges.)
group_by(Person_ID) %>%
mutate(admissions = if_else(event == "Admit_Date_Time", 1, -1)) %>%
mutate(admissions_count = cumsum(admissions)) %>%
ungroup() %>%
# Record a new Hosp_ID when either (a) a new Person, or (b) preceded by a
# completed visit (ie admissions_count was zero).
mutate(Hosp_ID_chg = 1 *
(Person_ID != lag(Person_ID, default = 1) | # (a)
lag(admissions_count, default = 1) == 0), # (b)
Hosp_ID = cumsum(Hosp_ID_chg)) %>%
# Spread back into original format
select(-time_adj, -admissions, -admissions_count, -Hosp_ID_chg) %>%
spread(event, time)
Results
> dat1
# A tibble: 9 x 5
Person_ID Visit_ID Hosp_ID Admit_Date_Time Discharge_Date_Time
<dbl> <int> <dbl> <dttm> <dttm>
1 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
2 1 2 2 2017-04-21 10:20:00 2017-04-22 05:56:00
3 1 3 2 2017-04-22 12:12:00 2017-04-26 21:01:00
4 2 4 3 2017-10-16 01:31:00 2017-10-18 20:11:00
5 3 5 4 2017-01-24 02:41:00 2017-01-27 22:15:00
6 3 6 4 2017-01-24 05:31:00 2017-01-26 15:35:00
7 3 7 4 2017-01-28 04:26:00 2017-01-28 09:25:00
8 4 8 5 2017-12-01 01:31:00 2017-12-05 18:33:00
9 4 9 5 2017-12-01 01:31:00 2017-12-04 16:41:00
Here's a data.table approach using an overlap-join
library( data.table )
library( lubridate )
setDT( dat )
setorder( dat, Person_ID, Admit_Date_Time )
#create a 1-day extension after each discharge
dt2 <- dat[, discharge_24h := Discharge_Date_Time %m+% days(1)][]
#now create id
setkey( dat, Admit_Date_Time, discharge_24h )
#create data-table with overlap-join, create groups based on overlapping ranges
dt2 <- setorder(
foverlaps( dat,
dat,
mult = "first",
type = "any",
nomatch = 0L
),
Visit_ID )[, list( Visit_ID = i.Visit_ID,
Hosp_ID = .GRP ),
by = .( Visit_ID )][, Visit_ID := NULL]
#reorder the result
setorder( dt2[ dat, on = "Visit_ID" ][, discharge_24h := NULL], Visit_ID )[]
# Visit_ID Hosp_ID Person_ID Admit_Date_Time Discharge_Date_Time
# 1: 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
# 2: 2 2 1 2017-04-21 10:20:00 2017-04-22 05:56:00
# 3: 3 2 1 2017-04-22 12:12:00 2017-04-26 21:01:00
# 4: 4 3 2 2017-10-16 01:31:00 2017-10-18 20:11:00
# 5: 5 4 3 2017-01-24 02:41:00 2017-01-27 22:15:00
# 6: 6 4 3 2017-01-24 05:31:00 2017-01-26 15:35:00
# 7: 7 4 3 2017-01-28 04:26:00 2017-01-28 09:25:00
# 8: 8 5 4 2017-12-01 01:31:00 2017-12-05 18:33:00
# 9: 9 5 4 2017-12-01 01:31:00 2017-12-04 16:41:00
Suppose I have a daily rain data.frame like this:
df.meteoro = data.frame(Dates = seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"),
rain = rnorm(length(seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"))))
I'm trying to sum the accumulated rain between a 14 days interval with this code:
library(tidyverse)
library(lubridate)
df.rain <- df.meteoro %>%
mutate(TwoWeeks = round_date(df.meteoro$data, "14 days")) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
The problem is that it isn't starting on 2017-01-19 but on 2017-01-15 and I was expecting my output dates to be:
"2017-02-02" "2017-02-16" "2017-03-02" "2017-03-16" "2017-03-30" "2017-04-13"
"2017-04-27" "2017-05-11" "2017-05-25" "2017-06-08" "2017-06-22" "2017-07-06" "2017-07-20"
"2017-08-03" "2017-08-17" "2017-08-31" "2017-09-14" "2017-09-28" "2017-10-12" "2017-10-26"
"2017-11-09" "2017-11-23" "2017-12-07" "2017-12-21" "2018-01-04" "2018-01-18"
TL;DR I have a year long daily rain data.frame and want to sum the accumulate rain for the dates above.
Please help.
Use of round_date in the way you have shown it will not give you 14-day periods as you might expect. I have taken a different approach in this solution and generated a sequence of dates between your first and last dates and grouped these into 14-day periods then joined the dates to your observations.
startdate = min(df.meteoro$Dates)
enddate = max(df.meteoro$Dates)
dateseq =
data.frame(Dates = seq.Date(startdate, enddate, by = 1)) %>%
mutate(group = as.numeric(Dates - startdate) %/% 14) %>%
group_by(group) %>%
mutate(starts = min(Dates))
df.rain <- df.meteoro %>%
right_join(dateseq) %>%
group_by(starts) %>%
summarise(sum_rain = sum(rain))
head(df.rain)
> head(df.rain)
# A tibble: 6 x 2
starts sum_rain
<date> <dbl>
1 2017-01-19 6.09
2 2017-02-02 5.55
3 2017-02-16 -3.40
4 2017-03-02 2.55
5 2017-03-16 -0.12
6 2017-03-30 8.95
Using a right-join to the date sequence is to ensure that if there are missing observation days that spanned a complete time period you'd still get that period listed in the result (though in your case you have a complete year of dates anyway).
round_date rounds to the nearest multiple of unit (here, 14 days) since some epoch (probably the Unix epoch of 1970-01-01 00:00:00), which doesn't line up with your purpose.
To get what you want, you can do the following:
df.rain = df.meteoro %>%
mutate(days_since_start = as.numeric(Dates - as.Date("2017/1/18")),
TwoWeeks = as.Date("2017/1/18") + 14*ceiling(days_since_start/14)) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
This computes days_since_start as the days since 2017/1/18 and then manually rounds to the next multiple of two weeks.
Assuming you want to round to the closest date from the ones you have specified I guess the following will work
targetDates<-seq(ymd("2017-02-02"),ymd("2018-01-18"),by='14 days')
df.meteoro$Dates=targetDates[sapply(df.meteoro$Dates,function(x) which.min(abs(interval(targetDates,x))))]
sum_rain=ddply(df.meteoro,.(Dates),summarize,sum_rain=sum(rain,na.rm=T))
as you can see not all dates have the same number of observations. Date "2017-02-02" for instance has all the records between "2017-01-19" until "2017-02-09", which are 22 records. From "2017-02-10" on dates are rounded to "2017-02-16" etc.
This may be a cheat, but assuming each row/observation is a separate day, then why not just group by every 14 rows and sum.
# Assign interval groups, each 14 rows
df.meteoro$my_group <-rep(1:100, each=14, length.out=nrow(df.meteoro))
# Grab Interval Names
my_interval_names <- df.meteoro %>%
select(-rain) %>%
group_by(my_group) %>%
slice(1)
# Summarise
df.meteoro %>%
group_by(my_group) %>%
summarise(rain = sum(rain)) %>%
left_join(., my_interval_names)
#> Joining, by = "my_group"
#> # A tibble: 27 x 3
#> my_group rain Dates
#> <int> <dbl> <date>
#> 1 1 3.86 2017-01-19
#> 2 2 -0.581 2017-02-02
#> 3 3 -0.876 2017-02-16
#> 4 4 1.80 2017-03-02
#> 5 5 3.79 2017-03-16
#> 6 6 -3.50 2017-03-30
#> 7 7 5.31 2017-04-13
#> 8 8 2.57 2017-04-27
#> 9 9 -1.33 2017-05-11
#> 10 10 5.41 2017-05-25
#> # ... with 17 more rows
Created on 2018-03-01 by the reprex package (v0.2.0).
I have following dataframe in R.
Date Car_NO
2016-12-24 19:35:00 ABC
2016-12-24 19:55:00 DEF
2016-12-24 20:15:00 RTY
2016-12-24 20:35:00 WER
2016-12-24 21:34:00 DER
2016-12-24 00:23:00 ABC
2016-12-24 00:22:00 ERT
2016-12-24 11:45:00 RTY
2016-12-24 13:09:00 RTY
Date format is "POSIXct" "POSIXt"
I want to count hourly movement of car traffic. like 12-1,1-2,2-3,3-4 and so on
Currently my approach is following
df$time <- ymd_hms(df$Date)
df$hours <- hour(df$time)
df$minutes <- minute(df$time)
df$time <- as.numeric(paste(df$hours,df$minutes,sep="."))
And after this I will apply ifelse loop to divide it in hourly time slots,but I think it will be long and tedious way to do it. Is there any easy approach in R.
My desired dataframe would be
Time_Slots Car_Traffic_count
00-01 2
01-02 0
02-03 0
.
.
.
19-20 2
20-21 2
21-22 1
.
.
.
Simplest would be to just use the starting hour to indicate a time interval:
# sample data
df = data.frame(time = Sys.time()+seq(1,10)*10000, runif(10) )
# summarize
library(dplyr)
df$hour = factor(as.numeric(format(df$time,"%H")), levels = seq(0,24))
df = df %>%
group_by(hour) %>%
summarize(count=n()) %>%
complete(hour, fill = list(count = 0))
Output:
# A tibble: 24 x 2
hour count
<fctr> <dbl>
1 0 0
2 1 1
3 2 0
4 3 0
5 4 1
6 5 0
7 6 1
8 7 0
9 8 0
10 9 1
# ... with 14 more rows
You can optionally add:
df$formatted = paste0(as.character(df$hour),"-",as.numeric(as.character(df$hour))+1)
at then end to get your desired format. Hope this helps!