Sub string from character string using regex in R - r

I'm scraping PDF reports for their data.
I'm trying to extract the location the report is based off. I've got a character string with the location, and then a rolling 13 months header seen here:
header_line <- "Corp Dec '20 Jan '21 Feb '21 Mar '21 Apr '21 May '21 Jun '21 Jul '21 Aug '21 Sep '21 Oct '21 Nov '21 Dec '21"
I'd like to extract all characters from the beginning of the string to the start of WHATEVER month could be appearing after Because it's a rolling 13-month report, it could be any of those months abbreviations next to the location.
I have this working for the above example, but I'm not sure how to create an "Or pattern" with regex. I know I could brute force it with a loop or apply function, but I was hoping there was a less dirty way.
stringr::str_extract(header_line, "[^Dec]+")
[1] "Corp "

It is difficult to anticipate the possible cases that the location could be, but the below solution may cover most of it. It will match everything prior to 3 alphabetical characters, followed by a space, and apostrophe, and 2 digits.
str_extract(header_line, '^(.*?)(?=[a-zA-Z]{3}\\s\'\\d{2})')
Test cases:
header_line <- "Corp Dec '20 Jan '21 Feb '21 Mar '21 Apr '21 May '21 Jun '21 Jul '21 Aug '21 Sep '21 Oct '21 Nov '21 Dec '21"
header_line2 <- "Corp multiple words Dec '20 Jan '21 Feb '21 Mar '21 Apr '21 May '21 Jun '21 Jul '21 Aug '21 Sep '21 Oct '21 Nov '21 Dec '21"
header_line3 <- "Corp multiple words 1 Dec '20 Jan '21 Feb '21 Mar '21 Apr '21 May '21 Jun '21 Jul '21 Aug '21 Sep '21 Oct '21 Nov '21 Dec '21"
header_line4 <- "Corp multiple 444 Dec '20 Jan '21 Feb '21 Mar '21 Apr '21 May '21 Jun '21 Jul '21 Aug '21 Sep '21 Oct '21 Nov '21 Dec '21"
str_extract(header_line, '^(.*?)(?=[a-zA-Z]{3}\\s\'\\d{2})')
[1] "Corp"
str_extract(header_line2, '^(.*?)(?=[a-zA-Z]{3}\\s\'\\d{2})')
[1] "Corp multiple words"
str_extract(header_line3, '^(.*?)(?=[a-zA-Z]{3}\\s\'\\d{2})')
[1] "Corp multiple words 1"
str_extract(header_line4, '^(.*?)(?=[a-zA-Z]{3}\\s\'\\d{2})')
[1] "Corp multiple 444"

Related

How can I change date string to POSIXct format?

I am trying to convert these dates in created_at column to the number of seconds column created_at_dt using POSIXct.
created_at
<chr>
Fri May 26 17:30:01 +0000 2017
Fri May 26 17:30:05 +0000 2017
Fri May 26 17:30:05 +0000 2017
Fri May 26 17:30:04 +0000 2017
Fri May 26 17:30:12 +0000 2017
Example of what i want to achieve:
created_at_dt
<dbl>
1495819801
1495819805
1495819805
1495819804
1495819812
I tried the following line but got only NA values introduced.
tweets <- tweets %>%
mutate(created_at_dt = asPOSIXct(as.numeric('created_at')))
Any help would be much appreciated. Thank you!
You just need to specify the correct format string for as.POSIXct.
Also, created_at should not be in quotes for mutate().
library(dplyr)
tweets <- tweets %>%
mutate(created_at_dt = as.POSIXct(created_at,
format = "%a %B %d %H:%M:%S %z %Y") %>%
as.numeric())
Result:
created_at created_at_dt
1 Fri May 26 17:30:01 +0000 2017 1495819801
2 Fri May 26 17:30:05 +0000 2017 1495819805
3 Fri May 26 17:30:05 +0000 2017 1495819805
4 Fri May 26 17:30:04 +0000 2017 1495819804
5 Fri May 26 17:30:12 +0000 2017 1495819812
The data:
tweets <- structure(list(created_at = c("Fri May 26 17:30:01 +0000 2017",
"Fri May 26 17:30:05 +0000 2017", "Fri May 26 17:30:05 +0000 2017",
"Fri May 26 17:30:04 +0000 2017", "Fri May 26 17:30:12 +0000 2017"
)), class = "data.frame", row.names = c(NA, -5L))

Issues with ARIMAX forecasting (auto.arima)

I'm trying to forecast an accumulated monthly time series (see data below) with the auto.arima function with exogenous regressors. I have two issues.
1) My first issue is that when I fit the model and use the forecast function to predict the second half of 2019 the forecast starts from zero as can be seen in this forecast plot.
This only happens when I include a matrix of exogenous regressors and not when I use a single time series as regressor as can be seen in this plot.
Why is that? My code is:
regnskab <- ts(data$Regnskab, frequency = 12, start = c(2014,1), end = c(2019,6))
budget <- ts(data$Budget, frequency = 12, start = c(2014,1), end = c(2019,6))
dagtilbud <- ts(data$Dagtilbud, frequency = 12, start = c(2014,1), end = c(2019,6))
skole <- ts(data$Skole, frequency = 12, start = c(2014,1), end = c(2019,6))
sundhed <- ts(data$Sundhed, frequency = 12, start = c(2014,1), end = c(2019,6))
miljø <- ts(data$Miljø, frequency = 12, start = c(2014,1), end = c(2019,6))
tsmatrix <- cbind(budget, dagtilbud, miljø, skole, sundhed)
fit <- auto.arima(regnskab, xreg = tsmatrix)
fcast <- forecast(fit, h = 6, xreg = tsmatrix)
autoplot(fcast)
summary(fcast)
2) My second issue is that I want a forecast for 6 months forward, but the h=6 option does not apply when including exogenous regressors. Can this be solved in any way? Again, it is not a problem without exogenous regressors.
I hope you can help and sorry for the data spamming!
A summary of my model:
> summary(fcast)
Forecast method: Regression with ARIMA(1,0,0)(1,0,0)[12] errors
Model Information:
Series: regnskab
Regression with ARIMA(1,0,0)(1,0,0)[12] errors
Coefficients:
ar1 sar1 budget dagtilbud miljø skole sundhed
0.7466 0.6693 0.0101 2.0861 0.1037 2.5240 7.7623
s.e. 0.0935 0.1042 0.0077 0.6967 1.7672 0.7535 2.6611
sigma^2 estimated as 1.884: log likelihood=-114.84
AIC=245.68 AICc=248.21 BIC=263.2
Error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.01739231 1.297694 0.9002519 -0.1065542 0.9060671 0.3687968 -0.03222251
> regnskab
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014 19.11281 36.68003 54.66383 74.93864 94.10328 113.36373 134.96638 152.75095 170.79800 189.55430 207.00803 227.82096
2015 18.90205 37.20079 55.73305 75.44689 94.74538 115.03997 136.79829 155.41164 173.69889 191.96484 210.42391 231.52982
2016 20.12939 38.51516 56.32522 78.04822 97.46681 116.58424 139.43255 157.83048 175.26727 195.06259 213.73833 234.45281
2017 20.43082 38.55219 57.50119 78.07558 97.50132 119.13735 141.71973 161.49281 180.32002 199.27769 216.92571 239.40683
2018 19.35194 37.40571 55.36897 76.33412 95.90922 117.41442 140.03545 159.10527 177.88068 194.43207 215.28905 245.85670
2019 20.85722 40.01691 59.97383 81.92719 103.15225 123.81454
> tsmatrix
budget dagtilbud miljø skole sundhed
Jan 2014 230.0605 2.616639 0.597125 3.193017 0.456470
Feb 2014 230.0605 5.025708 1.047983 6.402845 1.012468
Mar 2014 230.0605 7.548424 1.458105 9.816814 1.602384
Apr 2014 230.0605 10.350321 1.957022 13.446215 2.263646
May 2014 230.0605 12.913356 2.439587 17.100957 2.873934
Jun 2014 230.0605 15.380146 2.915020 20.791343 3.498350
Jul 2014 230.0605 17.931069 3.434464 23.701276 3.987042
Aug 2014 230.0605 20.441732 3.837721 27.319389 4.597127
Sep 2014 230.0605 22.839922 4.295486 30.859254 5.185271
Oct 2014 230.0605 25.234620 4.761740 34.350629 5.819948
Nov 2014 230.0605 27.554525 5.163576 37.688182 6.416112
Dec 2014 230.0605 30.109529 5.742699 42.095747 7.313195
Jan 2015 234.5089 2.404843 0.643976 3.185265 0.477921
Feb 2015 234.5089 5.090533 1.094641 6.654691 1.040235
Mar 2015 234.5089 7.319261 1.462134 10.168618 1.659232
Apr 2015 234.5089 10.040823 1.943120 14.082780 2.356247
May 2015 234.5089 12.470742 2.431818 17.827494 2.963360
Jun 2015 234.5089 14.846720 3.019969 21.612527 3.615607
Jul 2015 234.5089 17.543682 3.540084 24.702634 4.126374
Aug 2015 234.5089 19.786612 3.984587 28.330977 4.741392
Sep 2015 234.5089 22.037785 4.362497 31.942762 5.367815
Oct 2015 234.5089 24.347196 4.805391 35.423452 6.019133
Nov 2015 234.5089 26.751255 5.250481 38.964450 6.642436
Dec 2015 234.5089 29.276667 5.789919 43.428855 7.555361
Jan 2016 237.2361 2.538133 0.721184 3.352676 0.508847
Feb 2016 237.2361 4.906975 1.377086 6.804320 1.100914
Mar 2016 237.2361 7.184724 1.719629 10.290800 1.744743
Apr 2016 237.2361 9.895237 2.333842 14.223635 2.480869
May 2016 237.2361 12.316509 2.850905 17.957433 3.115473
Jun 2016 237.2361 14.578536 3.404785 21.759111 3.858713
Jul 2016 237.2361 17.215216 3.867858 24.949928 4.359129
Aug 2016 237.2361 19.399769 4.406750 28.503968 5.030926
Sep 2016 237.2361 21.702215 4.792190 32.112449 5.674259
Oct 2016 237.2361 24.112579 5.238401 35.625806 6.328084
Nov 2016 237.2361 26.453919 5.677270 39.158270 6.977991
Dec 2016 237.2361 28.969565 6.098136 43.558768 7.974787
Jan 2017 241.9089 2.538901 0.917354 3.488151 0.535639
Feb 2017 241.9089 4.847981 1.450172 6.857674 1.138782
Mar 2017 241.9089 7.281994 1.899543 10.394615 1.808938
Apr 2017 241.9089 10.031959 2.388542 14.335895 2.554613
May 2017 241.9089 12.411935 2.893036 18.042788 3.206503
Jun 2017 241.9089 14.982942 3.282057 22.137085 3.959622
Jul 2017 241.9089 17.567382 3.770244 25.392706 4.540047
Aug 2017 241.9089 19.738993 4.484434 29.108498 5.196528
Sep 2017 241.9089 22.273634 5.051894 32.693173 5.870257
Oct 2017 241.9089 24.636583 5.456458 36.203329 6.544383
Nov 2017 241.9089 27.259158 5.793056 39.867875 7.249982
Dec 2017 241.9089 29.831986 6.079033 44.273697 8.269454
Jan 2018 246.0944 2.467981 0.985846 3.377469 0.544258
Feb 2018 246.0944 4.877189 1.383190 6.815726 1.167431
Mar 2018 246.0944 7.367918 1.738033 10.486250 1.848972
Apr 2018 246.0944 10.148353 2.249466 14.439246 2.614913
May 2018 246.0944 12.687311 2.844656 18.194669 3.328234
Jun 2018 246.0944 15.482606 3.616200 22.433048 4.108966
Jul 2018 246.0944 17.715938 3.982451 25.305411 4.689087
Aug 2018 246.0944 20.077201 4.696088 29.018017 5.396796
Sep 2018 246.0944 22.659831 5.158706 32.860215 6.087975
Oct 2018 246.0944 24.719623 5.586616 36.143198 6.713136
Nov 2018 246.0944 27.750904 6.069519 40.237747 7.501346
Dec 2018 246.0944 30.326036 6.308786 44.733470 8.564162
Jan 2019 251.9230 2.653607 0.932776 3.501389 0.595458
Feb 2019 251.9230 5.070721 1.445741 6.991538 1.243721
Mar 2019 251.9230 7.542256 1.825956 10.737607 1.941444
Apr 2019 251.9230 10.301781 2.330015 14.647082 2.733956
May 2019 251.9230 13.193286 2.999816 18.671285 3.455616
Jun 2019 251.9230 15.423716 3.516735 22.612031 4.145206
The xreg matrix in the forecast function should be for the future time periods. If you want h=6, then give a matrix of 6 rows corresponding to those 6 periods.

rrule to get the 2nd Monday, Wednesday and Friday of the month, for every month

I am trying to create a rrule for my fullcalendar event, that occur on the 2nd Monday, Wednesday and Friday of the month for every month.
Here is the rrule I have tried
RRULE:FREQ=MONTHLY;COUNT=10;INTERVAL=1;WKST=SU;BYDAY=MO,WE,FR;BYSETPOS=2
events: [{
title: 'rrule event',
rrule: {
freq: RRule.MONTHLY,
count: 10,
interval: 1,
wkst: RRule.SU,
byweekday: [RRule.MO, RRule.WE, RRule.FR],
bysetpos: [2]
},
duration: '02:00',
rendering: 'inverse-background'
}
],
This is what I get
1 Fri, 03 May 2019 12:33:53 GMT
2 Wed, 05 Jun 2019 12:33:53 GMT
3 Wed, 03 Jul 2019 12:33:53 GMT
4 Mon, 05 Aug 2019 12:33:53 GMT
5 Wed, 04 Sep 2019 12:33:53 GMT
6 Fri, 04 Oct 2019 12:33:53 GMT
7 Mon, 04 Nov 2019 12:33:53 GMT
8 Wed, 04 Dec 2019 12:33:53 GMT
9 Fri, 03 Jan 2020 12:33:53 GMT
10 Wed, 05 Feb 2020 12:33:53 GMT
What is expected is
1 Mon, 08 Apr 2019
2 Wed, 10 Apr 2019
3 Fri, 12 Apr 2019
4 Mon, 13 May 2019
5 Wed, 08 May 2019
6 Fri, 10 May 2019.........
RFC 5545, section 3.3.10. states:
Each BYDAY value can also be preceded by a positive (+n) or
negative (-n) integer. If present, this indicates the nth
occurrence of a specific day within the MONTHLY or YEARLY "RRULE".
So the rule you're looking for literally specifies the 2nd Monday (2MO), Wednesday (2WE) and Friday (2FR) of each month.
FREQ=MONTHLY;COUNT=10;BYDAY=2MO,2WE,2FR
(click to see the results)
Note that INTERVAL=1 is the default and WKST=SU is meaningless in this case, so you can just as well omit them.
Btw, your rule basically says, of all Mondays, Wednesdays and Fridays of a month, take the second instance in that month.

Is there a better way to order levels and factors together? I have successfully ordered them but there must be a more elegant way.

Create a month vector.
> mths<-month.abb
> mths
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
but this is character vector, hence I convert it to factors as follows:
> mths1<-factor(mths)
> mths1
[1] Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Levels: Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
But now the order of levels is shown Alphabetic, Which is not what we want. Trying the following we get the levels in the correct order but data gets jumbled.
> levels(mths1)<-mths
> mths1
[1] May Apr Aug Jan Sep Jul Jun Feb Dec Nov Oct Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
And then I tried this:
> mths1[]<-mths
> mths1
[1] Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
And now the factors as well as their levels are all sorted correctly.
I want to know what is the dynamics happening behind the scenes in each of the above cases as the assignments to levels and data got me a bit confused.
Finally, what is a more elegant way to achieve the same thing?
Combining comments above into an answer
mths<-month.abb
mths
# [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
mths1 <-factor(mths, levels=mths, ordered=TRUE)
mths1
# [1] Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec

How to transform the date using R

Currently, I have a lot of data. Associated with the data, I also have dates. Unfortunately, the dates are in the following format (day (Monday-Sunday), month (January-December) date (1-31) Hour:Minute:Second timezone Year). I would like to convert this into just Month/Day(1-31)/Year. Following is the sample data.
created_data
Sat Jun 20 23:45:03 +0000 2015
Sat Jun 20 23:45:06 +0000 2015
Sat Jun 20 23:45:06 +0000 2015
Sat Jun 20 23:45:08 +0000 2015
Sat Jun 20 23:45:11 +0000 2015
Sat Jun 20 23:45:13 +0000 2015
Sat Jun 20 23:45:14 +0000 2015
Sat Jun 20 23:45:15 +0000 2015
This is currently in the form of a dataframe. The format in which I am trying to see the dataframe is the following:
Results
Jun 20 2015
Jun 20 2015
Jun 20 2015
Jun 20 2015
Jun 20 2015
Jun 20 2015
Jun 20 2015
Jun 20 2015
Following is the code that I have tried but the result was just NA
strptime(x = created_data, format = "%m/%d/%Y")
Result = NA
First you have to convert your character string to something that R knows how to deal with such as a POSIXct object.
Given your format you can do as.POSIXct(created_data), format="%a %b %d %X %z %Y")
Once it is in that format you can convert it back to a character string of the format you want using format such as...
format(as.POSIXct(created_data, format="%a %b %d %X %z %Y"), format = "%Y/%m/%d")
The following should work, assuming the datetimes are stored in a character vector.
library("stringr")
library("dplyr")
dates <- c("Sat Jun 20 23:45:03 +0000 2015",
"Sat Jun 20 23:45:06 +0000 2015",
"Sat Jun 20 23:45:06 +0000 2015",
"Sat Jun 20 23:45:08 +0000 2015",
"Sat Jun 20 23:45:11 +0000 2015",
"Sat Jun 20 23:45:13 +0000 2015",
"Sat Jun 20 23:45:14 +0000 2015",
"Sat Jun 20 23:45:15 +0000 2015")
str_split_fixed(dates, pattern = " ", n=6) %>%
as.data.frame() %>%
mutate(new.date = as.Date(paste(V2, V3, V6), format = "%b %d %Y"))
The basic idea being to split the string into its individual pieces using str_split_fixed(), then recombine the pieces in as.Date()
Just a base R solution without other packages.
x <- "Sat Jun 20 23:45:03 +0000 2015"
x1 <- format(strptime(x, "%a %b %d %H:%M:%S %z %Y", tz = "GMT"), "%b %d %Y")
x1
[1] "Jun 20 2015"

Resources