pyspark create column of date from tweets timestamp - datetime

Im working on tweet dataframe and i want to use the timestamp column differentiate the tweets by date, however datetime conversion from timestamp does not work on a column, is there any way to do that conversion?
thanks in advance.

datediff(Column end, Column start)
Returns the number of days from start to end.
from pyspark.sql import functions as F
df = df.withColumn(F.datediff(F.(end_col), F.(start_col)))
In case you are trying to get date use below
using date_format
>>> df.select(date_format(col('ts'),"yyyy-MM-dd").alias('ts').cast("date")).show(10,False)
or using to_date
>>> df.select(to_date(col('ts')).alias('ts').cast("date")).show(10,False)
or Using from_unixtime and unix_timestamp functions:
>>> df.select(from_unixtime(unix_timestamp(col('ts'),"yyyy-MM-dd'T'HH:mm:ss.SSS"),"yyyy-MM-dd").alias("ts").cast("date")).show(10,False)

Related

Create a Datetime Object in R from a date column (format date) and an hour column (format integer)

I need a simple code to combine a date object with a numeric hour object to a datetime. Preferably using lubridate.
Excerpt from Dataset
wstemp2001_2002$Datetime <- as.POSIXct(paste(wstemp2001_2002$Date,
as.character(wstemp2001_2002$Hour2)), format= "%Y-%m-%d %H", tz="UTC")
Found the answer myself. This is a base R version. I convert the hour object into a character and use the "paste" function to combine the two.
Also please when working with datetime, always use UTC!

When writing a DateTime value using openxlsx, a "x" is written followed by DateTime value in the next row instead of just the DateTime value

I would like to write DateTime values to an excel sheet using openxlsx. When I try to do this, instead of just the DateTime value, I get a lowercase "x" on one row followed by the DateTime in the subsequent row. This occurs whether I use write.xlsx or writeData. I also tried converting the DateTime using as.POSIXlt or as.POSIXct, converting the date with timezone specified or not, and get the same result.
The UTC DateTime values are coming from a PerkinElmer microplate reader file.
Below is a code snippet that gives me this result. Any advice or help is appreciated, Thanks!
library(openxlsx)
library(lubridate)
date <- as_datetime("2022-04-07T22:15:08+0000", tz = "America/Los_Angeles")
options(openxlsx.datetimeFormat = "yyyy-mm-dd hh:mm:ss")
write.xlsx(date,"test.xlsx",overwrite = TRUE)
The documentation of write.xlsx says in section Arguments that x is (my emphasis)
A data.frame or a (named) list of objects that can be handled by writeData() or writeDataTable() to write to file.
So apparently an atomic vector is first coerced to data.frame and since the data argument name is x, so is its column header.
This also happens when writing a named list date_list <- list(date = date). A workbook with a sheet named date is created and the data in it has a column header x.

converting character from mongolite to timestamp in R

I have a question. I am downloading some data from mongodb and then I want to do sam calculations of this data. Unfortunately I get timestamp as a string I and don't know how to convert it back to timestamp.
MaxDate <- con_string$find(query = '{}', sort = '{"timestamp":-1}', limit = 1)$timestamp
Above code returns to me maximum date from column timestamp. But format of that is for me totally useful.
"Aug 14 2019 8:57AM"
Any ideas how to convert it to interpretable by R version of timestamp?
Update:
I
Here is a good link on how to modify strings to dates:
https://stats.idre.ucla.edu/r/faq/how-can-i-format-a-string-containing-a-date-into-r-date-object/
It has multiple formats you might want to compare with. For your specific example, I think this should work:
MaxDate <- as.Date(MaxDate, "%b %d %Y")
if you want to save the Date part only. If you also want to use the time, there is another method you could use for that:
strptime(temp, format="%b %d %Y %H:%M%p")
More information about as.Date() and formats you can find here: as.Date() helper
More information about strptime (date + time) you can find here: striptime helper
UPDATE: I found that package in R that might be helpful for you to avoid multiple conversions: timestamp conversions
You can cast the timestamp data to the measurable time stamp.

psycopg2 - acceptable date/datetime values

I'm using psycopg2 and sqlalchemy to insert data in a postgres db from xls files. I've previously been experiencing issues in inserting the 'date' columns which have been formatted as a number in excel. We have defined these columns as date type in postgres.
I have two issues here:
1. Some of the values in the date columns are empty. Pandas is converting those values to NaT or NaN but sqlalchemy and psycopg2 is not able to parse.
df = pd.read_excel(full_path, encoding='utf-8')
dict_items = df.to_dict(orient='records')
table = sql.Table(table_name, engine, schema='users')
connection.execute(table.insert().values(dict_items))
<class 'sqlalchemy.exc.DataError'>, DataError('(psycopg2.DataError) invalid input syntax for type timestamp: "NaT"
I have converted the numbers into python dates via the code below but also had to make sure the dates are not greated then Pandas timestamp max because I previously got an 'Range Out of Bounds' for timestamp:
max_date = pd.Timestamp.max
for index, row in df.iterrows():
for col in date_cols:
date_value = row[col]
if not np.isnan(date_value):
year, month, day, hour, minute, sec = xlrd.xldate_as_tuple(date_value, 0)
py_date = "%02d.%02d.%04d" % (month, day, year)
if py_date > str(max_date):
df.loc[index, col] = pd.to_datetime(max_date)
else:
df.loc[index, col] = py_date
if np.isnan(date_value):
df.loc[index, col] = pd.to_datetime('01.12.2016')
Now I get the following error:
<class 'sqlalchemy.exc.DataError'>, DataError('(psycopg2.DataError) integer out of range\n',)<traceback object at>
Could this be related to the last line of code, where I push in the 01.12.2016? Is there some way of tracing where the problem lies?
Thanks in advance.
To fix the issues with the nan's and nat's just change them to None in the dataframe and then they should get inserted without complaint. It solved my issue with this.
df = df.where(pd.notnull(df), None)
I got this solution from the postgres message board where they show a small example of the nans getting changed to None
Another alternate approach that worked for me
import numpy as np
df = df.replace({np.nan: None})

Converting date and time from Excel to R with input '2016-09-25 17:13:46.030'

I have been trying to use the as.PosIXct() to import a combined date & time variable from Excel to R. The format that I want to import looks like this: '2016-09-25 17:13:46.030'. I want it to look like this in R: '2016-09-25 17:13:46'. When I use the code below, I get back only NA values.
fd$AnswerValue <- as.POSIXct(as.character(fd$AnswerValue),
format = '%y%m%d%H%M', origin = '2011-07-15 13:00:00')
I expect this has something to do with the three additional decimals of the second counts in the original file. Anyone with advice?
A lubridate solution would be:
test <- "2016-09-25 17:13:46.030"
library(lubridate)
ymd_hms(test)
Or the base function, but longer:
as.POSIXct(as.character(test),
format = '%Y-%m-%d %H:%M:%S', origin = '2011-07-15 13:00:00')

Resources