Covert string into date in Pyspark dataframe - datetime

I was trying to convert a string column in my dataframe into date type. The string looks like this :
Fri Oct 12 18:14:29 +0000 2018
And I have tried this code
df_en.withColumn('date_timestamp',unix_timestamp('created_at','ddd MMM dd HH:mm:ss K yyyy')).show()
But I got the result of :
+--------------------+--------------------+--------------------+--------------+
| created_at| text| sentiment|date_timestamp|
+--------------------+--------------------+--------------------+--------------+
|Mon Oct 15 20:53:...|What a shock hey,...|-0.07755102040816327| null|
|Fri Oct 12 18:14:...|No Bucky, people ...| 0.0| null|
|Wed Oct 10 07:51:...|If Sarah Hanson Y...| 0.05| null|
|Mon Oct 15 02:30:...| 365 days| 0.0| null|
|Sun Oct 14 06:17:...|#HimToo: how an a...| -0.5| null|
|Tue Oct 09 07:30:...|hopefully the #Hi...| 0.0| null|
|Tue Oct 09 23:30:...|If Labor win Gove...| 0.8| null|
|Thu Oct 11 01:09:...|Hello #Perth - th...| 0.75| null|
|Sat Oct 13 21:47:...|#MeToo changed th...| 0.0| null|
|Tue Oct 09 00:41:...|Rich for Queensla...| 0.375| null|
|Mon Oct 15 12:59:...|Wonder what else ...| 0.0| null|
|Mon Oct 15 05:12:...|#dani_ries #metoo...| 0.0| null|
|Wed Oct 10 00:30:...|Hey #JackieTrad a...| 0.25| null|
|Tue Oct 16 04:00:...|“There's this ide...| 0.03611111111111113| null|
|Sun Oct 14 08:14:...|Is this the attit...|-0.01499999999999999| null|
|Sat Oct 13 11:26:...|#metoo official s...| 0.1| null|
|Tue Oct 09 00:23:...|On the limited an...|-0.01904761904761...| null|
|Tue Oct 16 14:41:...|Domestic Violence...| 0.0| null|
|Wed Oct 10 23:34:...|#australian Note ...| 0.0| null|
|Sat Oct 06 20:07:...|Wtaf, America. I ...| 0.0| null|
+--------------------+--------------------+--------------------+--------------+
Also, I have tried
df_en.select(col("created_at"),to_date(col("created_at")).alias("to_date") ).show()
The result is exactly the same. I don't know why, could anybody help me ?

Try this pattern EEE MMM dd HH:mm:ss Z yyyy with Spark config .config('spark.sql.legacy.timeParserPolicy', 'LEGACY'). Check this as well.

Related

I want To find the number of weeks with start date and end date of each for the month using moment.js

let currentDate = moment();
let weekStart = currentDate.clone().startOf('week');
let weekEnd = currentDate.clone().endOf('week');
I want to know the start date and end date of every week for a given month.
expected output
In August month
In Array of object
1. 1 Aug 2021 - 7 Aug 2021
2. 8 Aug 2021 - 14 Aug 2021
3. 15 Aug 2021 - 21 Aug 2021
4. 22 Aug 2021 - 28 Aug 2021
5. 29 Aug 2021 - 31 Aug 2021
moment().startOf('week');
moment().endOf('week');
Refrence:
https://www.itsolutionstuff.com/post/moment-js-get-current-week-start-and-end-date-exampleexample.html

file manipulation unix

cat sample_file.txt(Extracted job info from Control-M)
upctm,pmdw_bip,pmdw_bip_mnt_35-FOLDistAutoRpt,Oct 7 2019 4:45 AM,Oct 7 2019 4:45 AM,1,1,Oct 6 2019 12:00 AM,Ended OK,3ppnc
upctm,pmdw_ddm,pmdw_ddm_dum_01-StartProjDCSDemand,Oct 17 2019 4:02 AM,Oct 17 2019 4:02 AM,3,1,Oct 16 2019 12:00 AM,Ended OK,3pqgq
I need to process this file into DB table(Oracle)
Bu I need to make sure that day is 2 number (example 7 to 07).
(example: Oct 07 2019 6:32 AM)
I used this command to get all the date in every line:
cat sample_file.txt | grep "," | while read line
do
l_start_date=`echo $line|cut -d ',' -f4`
l_end_date=`echo $line|cut -d ',' -f5`
l_order_date=`echo $line|cut -d ',' -f8`
echo $l_start_date
echo $l_end_date
echo $l_order_date
done
Output:
Oct 7 2019 4:45 AM
Oct 7 2019 4:45 AM
Oct 6 2019 12:00 AM
Oct 17 2019 4:02 AM
Oct 17 2019 4:02 AM
Oct 16 2019 12:00 AM
expected output:
FROM: Oct 7 2019 6:32 AM
To: Oct 07 2019 6:32 AM
I used this sed command but it add also to 2 number day (17)
sed command sed 's|,Oct |,Oct 0|g' sample_file.txt
Oct 17 was change to Oct 017
upctm,pmdw_bip,pmdw_bip_mnt_35-FOLDistAutoRpt,Oct 07 2019 4:45 AM,Oct 07 2019 4:45 AM,1,1,Oct 06 2019 12:00 AM,Ended OK,3ppnc
upctm,pmdw_ddm,pmdw_ddm_dum_01-StartProjDCSDemand,Oct 017 2019 4:02 AM,Oct 017 2019 4:02 AM,3,1,Oct 016 2019 12:00 AM,Ended OK,3pqgq
I wish it was easier, but I only managed the following:
awk.f:
function fmt(s) {
split(s,a," "); a[2]=substr(a[2]+100,2)
return a[1] " " a[2] " "a[3] " " a[4] " " a[5]
}
BEGIN {FS=",";OFS=","}
{gsub(/ +/," ");
$4=fmt($4); $5=fmt($5); $8=fmt($8);
print}
This is a little awk script that first removes superfluous blanks and then picks out particular columns (4,5 and 8) and reformats the second part of each date string into a two-digit number.
You run the script like this:
awk -f f.awk sample_file.txt
output:
upctm,pmdw_aud,pmdw_aud_ext_06-GAPAnalysYTD,Oct 07 2019 6:32 AM,Oct 07 2019 6:32 AM,17,17,Oct 06 2019 12:00 AM,Ended OK,3pu9v
upctm,pmdw_ddm,pmdw_ddm_dum_01-StartProjDCSDemand,Oct 07 2019 4:02 AM,Oct 07 2019 4:02 AM,3,1,Oct 06 2019 12:00 AM,Ended OK,3pqgq
upctm,pmdw_bip,pmdw_bip_mnt_35-FOLDistAutoRpt,Oct 07 2019 4:45 AM,Oct 07 2019 4:45 AM,1,1,Oct 06 2019 12:00 AM,Ended OK,3ppnc
With a fixed locale, you can make a fixed replacement like
sed -r 's/(Jan|Feb|Oct|Whatever) ([1-9]) /\1 0\2 /g' sample_file.txt

Issues with ARIMAX forecasting (auto.arima)

I'm trying to forecast an accumulated monthly time series (see data below) with the auto.arima function with exogenous regressors. I have two issues.
1) My first issue is that when I fit the model and use the forecast function to predict the second half of 2019 the forecast starts from zero as can be seen in this forecast plot.
This only happens when I include a matrix of exogenous regressors and not when I use a single time series as regressor as can be seen in this plot.
Why is that? My code is:
regnskab <- ts(data$Regnskab, frequency = 12, start = c(2014,1), end = c(2019,6))
budget <- ts(data$Budget, frequency = 12, start = c(2014,1), end = c(2019,6))
dagtilbud <- ts(data$Dagtilbud, frequency = 12, start = c(2014,1), end = c(2019,6))
skole <- ts(data$Skole, frequency = 12, start = c(2014,1), end = c(2019,6))
sundhed <- ts(data$Sundhed, frequency = 12, start = c(2014,1), end = c(2019,6))
miljø <- ts(data$Miljø, frequency = 12, start = c(2014,1), end = c(2019,6))
tsmatrix <- cbind(budget, dagtilbud, miljø, skole, sundhed)
fit <- auto.arima(regnskab, xreg = tsmatrix)
fcast <- forecast(fit, h = 6, xreg = tsmatrix)
autoplot(fcast)
summary(fcast)
2) My second issue is that I want a forecast for 6 months forward, but the h=6 option does not apply when including exogenous regressors. Can this be solved in any way? Again, it is not a problem without exogenous regressors.
I hope you can help and sorry for the data spamming!
A summary of my model:
> summary(fcast)
Forecast method: Regression with ARIMA(1,0,0)(1,0,0)[12] errors
Model Information:
Series: regnskab
Regression with ARIMA(1,0,0)(1,0,0)[12] errors
Coefficients:
ar1 sar1 budget dagtilbud miljø skole sundhed
0.7466 0.6693 0.0101 2.0861 0.1037 2.5240 7.7623
s.e. 0.0935 0.1042 0.0077 0.6967 1.7672 0.7535 2.6611
sigma^2 estimated as 1.884: log likelihood=-114.84
AIC=245.68 AICc=248.21 BIC=263.2
Error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.01739231 1.297694 0.9002519 -0.1065542 0.9060671 0.3687968 -0.03222251
> regnskab
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014 19.11281 36.68003 54.66383 74.93864 94.10328 113.36373 134.96638 152.75095 170.79800 189.55430 207.00803 227.82096
2015 18.90205 37.20079 55.73305 75.44689 94.74538 115.03997 136.79829 155.41164 173.69889 191.96484 210.42391 231.52982
2016 20.12939 38.51516 56.32522 78.04822 97.46681 116.58424 139.43255 157.83048 175.26727 195.06259 213.73833 234.45281
2017 20.43082 38.55219 57.50119 78.07558 97.50132 119.13735 141.71973 161.49281 180.32002 199.27769 216.92571 239.40683
2018 19.35194 37.40571 55.36897 76.33412 95.90922 117.41442 140.03545 159.10527 177.88068 194.43207 215.28905 245.85670
2019 20.85722 40.01691 59.97383 81.92719 103.15225 123.81454
> tsmatrix
budget dagtilbud miljø skole sundhed
Jan 2014 230.0605 2.616639 0.597125 3.193017 0.456470
Feb 2014 230.0605 5.025708 1.047983 6.402845 1.012468
Mar 2014 230.0605 7.548424 1.458105 9.816814 1.602384
Apr 2014 230.0605 10.350321 1.957022 13.446215 2.263646
May 2014 230.0605 12.913356 2.439587 17.100957 2.873934
Jun 2014 230.0605 15.380146 2.915020 20.791343 3.498350
Jul 2014 230.0605 17.931069 3.434464 23.701276 3.987042
Aug 2014 230.0605 20.441732 3.837721 27.319389 4.597127
Sep 2014 230.0605 22.839922 4.295486 30.859254 5.185271
Oct 2014 230.0605 25.234620 4.761740 34.350629 5.819948
Nov 2014 230.0605 27.554525 5.163576 37.688182 6.416112
Dec 2014 230.0605 30.109529 5.742699 42.095747 7.313195
Jan 2015 234.5089 2.404843 0.643976 3.185265 0.477921
Feb 2015 234.5089 5.090533 1.094641 6.654691 1.040235
Mar 2015 234.5089 7.319261 1.462134 10.168618 1.659232
Apr 2015 234.5089 10.040823 1.943120 14.082780 2.356247
May 2015 234.5089 12.470742 2.431818 17.827494 2.963360
Jun 2015 234.5089 14.846720 3.019969 21.612527 3.615607
Jul 2015 234.5089 17.543682 3.540084 24.702634 4.126374
Aug 2015 234.5089 19.786612 3.984587 28.330977 4.741392
Sep 2015 234.5089 22.037785 4.362497 31.942762 5.367815
Oct 2015 234.5089 24.347196 4.805391 35.423452 6.019133
Nov 2015 234.5089 26.751255 5.250481 38.964450 6.642436
Dec 2015 234.5089 29.276667 5.789919 43.428855 7.555361
Jan 2016 237.2361 2.538133 0.721184 3.352676 0.508847
Feb 2016 237.2361 4.906975 1.377086 6.804320 1.100914
Mar 2016 237.2361 7.184724 1.719629 10.290800 1.744743
Apr 2016 237.2361 9.895237 2.333842 14.223635 2.480869
May 2016 237.2361 12.316509 2.850905 17.957433 3.115473
Jun 2016 237.2361 14.578536 3.404785 21.759111 3.858713
Jul 2016 237.2361 17.215216 3.867858 24.949928 4.359129
Aug 2016 237.2361 19.399769 4.406750 28.503968 5.030926
Sep 2016 237.2361 21.702215 4.792190 32.112449 5.674259
Oct 2016 237.2361 24.112579 5.238401 35.625806 6.328084
Nov 2016 237.2361 26.453919 5.677270 39.158270 6.977991
Dec 2016 237.2361 28.969565 6.098136 43.558768 7.974787
Jan 2017 241.9089 2.538901 0.917354 3.488151 0.535639
Feb 2017 241.9089 4.847981 1.450172 6.857674 1.138782
Mar 2017 241.9089 7.281994 1.899543 10.394615 1.808938
Apr 2017 241.9089 10.031959 2.388542 14.335895 2.554613
May 2017 241.9089 12.411935 2.893036 18.042788 3.206503
Jun 2017 241.9089 14.982942 3.282057 22.137085 3.959622
Jul 2017 241.9089 17.567382 3.770244 25.392706 4.540047
Aug 2017 241.9089 19.738993 4.484434 29.108498 5.196528
Sep 2017 241.9089 22.273634 5.051894 32.693173 5.870257
Oct 2017 241.9089 24.636583 5.456458 36.203329 6.544383
Nov 2017 241.9089 27.259158 5.793056 39.867875 7.249982
Dec 2017 241.9089 29.831986 6.079033 44.273697 8.269454
Jan 2018 246.0944 2.467981 0.985846 3.377469 0.544258
Feb 2018 246.0944 4.877189 1.383190 6.815726 1.167431
Mar 2018 246.0944 7.367918 1.738033 10.486250 1.848972
Apr 2018 246.0944 10.148353 2.249466 14.439246 2.614913
May 2018 246.0944 12.687311 2.844656 18.194669 3.328234
Jun 2018 246.0944 15.482606 3.616200 22.433048 4.108966
Jul 2018 246.0944 17.715938 3.982451 25.305411 4.689087
Aug 2018 246.0944 20.077201 4.696088 29.018017 5.396796
Sep 2018 246.0944 22.659831 5.158706 32.860215 6.087975
Oct 2018 246.0944 24.719623 5.586616 36.143198 6.713136
Nov 2018 246.0944 27.750904 6.069519 40.237747 7.501346
Dec 2018 246.0944 30.326036 6.308786 44.733470 8.564162
Jan 2019 251.9230 2.653607 0.932776 3.501389 0.595458
Feb 2019 251.9230 5.070721 1.445741 6.991538 1.243721
Mar 2019 251.9230 7.542256 1.825956 10.737607 1.941444
Apr 2019 251.9230 10.301781 2.330015 14.647082 2.733956
May 2019 251.9230 13.193286 2.999816 18.671285 3.455616
Jun 2019 251.9230 15.423716 3.516735 22.612031 4.145206
The xreg matrix in the forecast function should be for the future time periods. If you want h=6, then give a matrix of 6 rows corresponding to those 6 periods.

rrule to get the 2nd Monday, Wednesday and Friday of the month, for every month

I am trying to create a rrule for my fullcalendar event, that occur on the 2nd Monday, Wednesday and Friday of the month for every month.
Here is the rrule I have tried
RRULE:FREQ=MONTHLY;COUNT=10;INTERVAL=1;WKST=SU;BYDAY=MO,WE,FR;BYSETPOS=2
events: [{
title: 'rrule event',
rrule: {
freq: RRule.MONTHLY,
count: 10,
interval: 1,
wkst: RRule.SU,
byweekday: [RRule.MO, RRule.WE, RRule.FR],
bysetpos: [2]
},
duration: '02:00',
rendering: 'inverse-background'
}
],
This is what I get
1 Fri, 03 May 2019 12:33:53 GMT
2 Wed, 05 Jun 2019 12:33:53 GMT
3 Wed, 03 Jul 2019 12:33:53 GMT
4 Mon, 05 Aug 2019 12:33:53 GMT
5 Wed, 04 Sep 2019 12:33:53 GMT
6 Fri, 04 Oct 2019 12:33:53 GMT
7 Mon, 04 Nov 2019 12:33:53 GMT
8 Wed, 04 Dec 2019 12:33:53 GMT
9 Fri, 03 Jan 2020 12:33:53 GMT
10 Wed, 05 Feb 2020 12:33:53 GMT
What is expected is
1 Mon, 08 Apr 2019
2 Wed, 10 Apr 2019
3 Fri, 12 Apr 2019
4 Mon, 13 May 2019
5 Wed, 08 May 2019
6 Fri, 10 May 2019.........
RFC 5545, section 3.3.10. states:
Each BYDAY value can also be preceded by a positive (+n) or
negative (-n) integer. If present, this indicates the nth
occurrence of a specific day within the MONTHLY or YEARLY "RRULE".
So the rule you're looking for literally specifies the 2nd Monday (2MO), Wednesday (2WE) and Friday (2FR) of each month.
FREQ=MONTHLY;COUNT=10;BYDAY=2MO,2WE,2FR
(click to see the results)
Note that INTERVAL=1 is the default and WKST=SU is meaningless in this case, so you can just as well omit them.
Btw, your rule basically says, of all Mondays, Wednesdays and Fridays of a month, take the second instance in that month.

Converting UTC Time to Local Time with Days of Week and Date Included

I have the following 2 columns as part of a larger data frame. The Timezone_Offset is the difference in hours for the local time (US West Coast in the data I'm looking at). In other words, UTC + Offset = Local Time.
I'm looking to convert the UTC time to the local time, while also correctly changing the day of the week and date, if necessary. For instance, here are the first 5 rows of the two columns.
UTC Timezone_Offset
Sun Apr 08 02:42:03 +0000 2012 -7
Sun Jul 01 03:27:20 +0000 2012 -7
Wed Jul 11 04:40:18 +0000 2012 -7
Sat Nov 17 01:31:36 +0000 2012 -8
Sun Apr 08 20:50:30 +0000 2012 -7
Things get tricky when the day of the week and date also have to be changed. For instance, looking at the first row, the local time should be Sat Apr 07 19:42:03 +0000 2012. In the second row, the month also has to be changed.
Sorry, I'm fairly new to R. Could someone possibly explain how to do this? Thank you so much in advance.
Parse as UTC, then apply the offset in seconds, ie times 60*60 :
data <- read.csv(text="UTC, Timezone_Offset
Sun Apr 08 02:42:03 +0000 2012, -7
Sun Jul 01 03:27:20 +0000 2012, -7
Wed Jul 11 04:40:18 +0000 2012, -7
Sat Nov 17 01:31:36 +0000 2012, -8
Sun Apr 08 20:50:30 +0000 2012, -7", stringsAsFactors=FALSE)
data$pt <- as.POSIXct(strptime(data$UTC, "%a %b %d %H:%M:%S %z %Y", tz="UTC"))
data$local <- data$pt + data$Timezone_Offset*60*60
Result:
> data[,3:4]
pt local
1 2012-04-08 02:42:03 2012-04-07 19:42:03
2 2012-07-01 03:27:20 2012-06-30 20:27:20
3 2012-07-11 04:40:18 2012-07-10 21:40:18
4 2012-11-17 01:31:36 2012-11-16 17:31:36
5 2012-04-08 20:50:30 2012-04-08 13:50:30
>

Resources