Convert string to Date SparkR - datetime

I'm reading a large dataframe using SparkR. I want to summarize using the month of a column which contains the date in character format:
head(select(df, df$booking_date))
booking_date
1 29-JUL-16
2 29-JUL-16
3 06-JUL-16
4 21-JUL-16
5 28-JUL-16
6 28-JUL-16
However, if I try to print the month:
head(select(df, month(df$booking_date)))
month(booking_date)
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
It does not return the correct value. It seems it does not understand the format but executing the following command makes the month() method very unhappy:
head(select(df, month(as.Date(df$booking_date, format = "%d/%b/%y")))
Thus, how could I get the month from the booking_date column in order to group_by() the data?
Thanks!
Carlos

Try following code to get month name
a <- as.list(collect(select(df, df$booking_date)))
b <- SparkR:::lapply(a$booking_date, function(x){months(as.Date(x, format = "%d-%b-%y"))})
print(b)

Related

How to use Sys.Date() To Extract Current Year? [duplicate]

This question already has answers here:
How can I get the extract the previous year (2020) using Sys.Date()?
(2 answers)
Closed 1 year ago.
I have manually separated my dataset (discrete_8) into 2 separate datasets (data & data2). 'Data' contains the data from this current year (2021), whereas 'Data2' contains data from previous years. Of course, this is based on the current year (2021), but I want to automate the line of code so that when the year 2022 comes, I will not have to edit the script to change 2021 to 2022. Should I use Sys.Date() for calling the most recent year? How would I go about incorporating sys.date() to partition the dataset?
Here is my code so far, where I partition the dataset:
data <- discrete_8 %>% filter(PS_DATE >= as.POSIXct("2021-01-01"))#current year
data2 <- discrete_8 %>% filter(PS_DATE < as.POSIXct("2021-01-01"))#past years
Here is what discrete_8 looks like:
X PS_DATE PS_NAME Control.Parameters.Cell.Return.Flow.Rate Control.Parameters.Harvest.Flow.Rate Control.Parameters.Microsparger.Total.Gas.Flow.Rate
1 0 2014-02-06 123 NA NA 1
2 1 2014-02-07 124 NA NA 1
3 2 2014-02-08 125 NA NA 1
4 3 2014-02-09 126 1.5 NA 1
5 4 2014-02-10 127 1.5 NA 1
6 5 2014-02-11 128 1.5 NA 1
There is somewhat tedious bug still present in that trunc(Sys.Date(), "year") does not give you Jan 01 of the current year -- it does in R-devel.
But you can build yourself a helper such as this:
> firstDay <- function() { d <- Sys.Date(); d - as.POSIXlt(d)$yday }
> firstDay()
[1] "2021-01-01"
and you can use that to compare. (Also, in the code you posted, as.Date() is simpler as you ignore hours/minutes/seconds here.)
one option can be the lubridate::floor_date() function:
lubridate::floor_date(Sys.Date(), unit = "years")
[1] "2021-01-01"
I use substr(Sys.Date(),1,4) to get the current year. In your code you can replace as.POSIXct("2021-01-01") with
as.POSIXct(paste0(substr(Sys.Date(),1,4),"-01-01"))
This will give the 1st of the current year in your datetime format.

Format function output as data frame

I use the following to sum several measures of TABR per Julian date in 5 separate years:
TABR_YearDay<-with(wsmr, tapply(TABR, list(Julian, Year),sum))
Which produces output like:
2015 2016 2017 2018 2019
33 NA NA NA 2 NA
....
80 NA 1 NA 21 NA
81 NA 47 NA 25 NA
82 NA 12 1 9 NA
But I want to convert these results into a dataframe with 6 columns Julian + 2015-2019.
I tried:
TABR_Day<-as.data.frame(TABR_YearDay)
But that seems to not produce a fully realized df: there is no column for Julian and if I want to call an individual variable, I have to use quotes around it like:
hist(TABR_Day$"2017")
Can you help me transition the function output to a dataframe with 6 viable columns?
The column is present in the rownames of the result. Column names usually don't start with a number, we can prepend the column name with the word 'Year' to make it 'Year_2015' etc and construct the final dataframe.
TABR_YearDay<-with(wsmr, tapply(TABR, list(Julian, Year),sum))
colnames(TABR_YearDay) <- paste0('Year_', colnames(TABR_YearDay))
TABR_Day <- data.frame(Julian = rownames(TABR_YearDay), TABR_YearDay)

Paste date in new column if condition is true in another R [duplicate]

This question already has an answer here:
Replace value using index [R]
(1 answer)
Closed 2 years ago.
I want to extract the date from a variable if the condition in another variable is true.
Example: if comorbidity1==10, extract the date from smr_01, otherwise NA
I also need to do this for if if comorbidity1==11 OR comorbidity1==12, extract the date from smr_01, otherwise NA
This is what I want my data to look like
comorbidity1 smr_01 NewDate
1 20120607 NA
10 20120607 20120607
10 20120613 20120613
3 20121103 NA
6 20150607 NA
12 20140509 NA
11 20120405 NA
I have tried this
fulldata$NewDate<-ifelse(fulldata$comorbidity1==10, fulldata$smr_01, NA)
but it is not pasting the date in the correct format.
what I am getting looks like this
comorbidity1 smr_01 NewDate
1 20120607 NA
10 20120607 4675
10 20120613 17856
3 20121103 NA
6 20150607 NA
12 20140509 NA
11 20120405 NA
smr_01 is classed as a date
Thank you
Try :
df$NewDate <- as.Date(NA)
inds <- df$comorbidity1 == 10
#For more than 1 value use %in%
#inds <- df$comorbidity1 %in% 10:12
df$NewDate[inds] <- df$smr_01[inds]
df

Type 'double' with column and row information in console does not appear correctly when using View()

Something strange (to me) is going on.
I have time series data collected by running the commands
data.ts = ts(1:10, frequency = 4, start = c(1959, 2))
D = decompose(data.ts)
df = D$trend
I have what I thought was a data frame (but is actually of type double), df, that when executed in the console, looks like
>df
Qtr1 Qtr2 Qtr3 Qtr4
1959 NA NA 3
1960 4 5 6 7
1961 8 NA NA
However, when using View(df), the data looks like the following below (and does not have the years or quarter information with it):
>View(df)
z
1 NA
2 NA
3 3
4 4
5 5
6 6
7 7
8 8
9 NA
10 NA
I have been trying to converting this type double (it is not a ts object to a data frame that looks like the result I'm getting currently in the console, but using as.data.frame(df) converts df to a data frame that looks like the 2 column example from earlier.
What is going on exactly?
Bonus: How do I create a data frame out of df while keeping the months and years intact?

Use dplyr to compute lagging difference

My data frame consists of three columns: state name, year, and the tax receipt for each year and each state. Below is an example for just one state.
year RealTaxRevs
1 1971 8335046
2 1972 9624026
3 1973 10498935
4 1974 10052305
5 1975 8708381
6 1976 8911262
7 1977 10759032
I'd like to compute the change in tax receipt from one year to the next, for each state. I used the following code:
data %>% group_by(state) %>% summarise(diff(RealTaxRevs, lag = 1, differences = 1))
but it gives me "Error: expecting a single value".
Could anyone explain this error message, and help me do this correctly using dplyr? Thank you.
If you want to use diff like function, then consider using the zoo library as well. Then you can have code which looks like the following:
library(zoo)
diff(as.zoo(1:4), na.pad=T)
In a data frame setting it would be like:
dat <- data.frame(a=c(8335046, 9624026, 10498935, 10052305, 8708381, 8911262, 10759032))
dat %>% mutate(b=diff(as.zoo(a), na.pad=T))
# a b
# 1 8335046 NA
# 2 9624026 1288980
# 3 10498935 874909
# 4 10052305 -446630
# 5 8708381 -1343924
# 6 8911262 202881
# 7 10759032 1847770
This way you can easily increase the number of lags, without continually adding NA
dat %>% mutate(b2=diff(as.zoo(a), lag=2, na.pad=T))
# a b2
# 1 8335046 NA
# 2 9624026 NA
# 3 10498935 2163889
# 4 NA NA
# 5 8708381 -1790554
# 6 8911262 NA
# 7 10759032 2050651
We can use data.table
library(data.table)
setDT(data)[, Diffs := RealTaxRevs - shift(RealTaxRevs)[[1]], state]

Resources