psycopg2 - acceptable date/datetime values - datetime

I'm using psycopg2 and sqlalchemy to insert data in a postgres db from xls files. I've previously been experiencing issues in inserting the 'date' columns which have been formatted as a number in excel. We have defined these columns as date type in postgres.
I have two issues here:
1. Some of the values in the date columns are empty. Pandas is converting those values to NaT or NaN but sqlalchemy and psycopg2 is not able to parse.
df = pd.read_excel(full_path, encoding='utf-8')
dict_items = df.to_dict(orient='records')
table = sql.Table(table_name, engine, schema='users')
connection.execute(table.insert().values(dict_items))
<class 'sqlalchemy.exc.DataError'>, DataError('(psycopg2.DataError) invalid input syntax for type timestamp: "NaT"
I have converted the numbers into python dates via the code below but also had to make sure the dates are not greated then Pandas timestamp max because I previously got an 'Range Out of Bounds' for timestamp:
max_date = pd.Timestamp.max
for index, row in df.iterrows():
for col in date_cols:
date_value = row[col]
if not np.isnan(date_value):
year, month, day, hour, minute, sec = xlrd.xldate_as_tuple(date_value, 0)
py_date = "%02d.%02d.%04d" % (month, day, year)
if py_date > str(max_date):
df.loc[index, col] = pd.to_datetime(max_date)
else:
df.loc[index, col] = py_date
if np.isnan(date_value):
df.loc[index, col] = pd.to_datetime('01.12.2016')
Now I get the following error:
<class 'sqlalchemy.exc.DataError'>, DataError('(psycopg2.DataError) integer out of range\n',)<traceback object at>
Could this be related to the last line of code, where I push in the 01.12.2016? Is there some way of tracing where the problem lies?
Thanks in advance.

To fix the issues with the nan's and nat's just change them to None in the dataframe and then they should get inserted without complaint. It solved my issue with this.
df = df.where(pd.notnull(df), None)
I got this solution from the postgres message board where they show a small example of the nans getting changed to None

Another alternate approach that worked for me
import numpy as np
df = df.replace({np.nan: None})

Related

pyspark create column of date from tweets timestamp

Im working on tweet dataframe and i want to use the timestamp column differentiate the tweets by date, however datetime conversion from timestamp does not work on a column, is there any way to do that conversion?
thanks in advance.
datediff(Column end, Column start)
Returns the number of days from start to end.
from pyspark.sql import functions as F
df = df.withColumn(F.datediff(F.(end_col), F.(start_col)))
In case you are trying to get date use below
using date_format
>>> df.select(date_format(col('ts'),"yyyy-MM-dd").alias('ts').cast("date")).show(10,False)
or using to_date
>>> df.select(to_date(col('ts')).alias('ts').cast("date")).show(10,False)
or Using from_unixtime and unix_timestamp functions:
>>> df.select(from_unixtime(unix_timestamp(col('ts'),"yyyy-MM-dd'T'HH:mm:ss.SSS"),"yyyy-MM-dd").alias("ts").cast("date")).show(10,False)

Finding maximum or minimum date value for each individual

I have a dataframe in a wide format in R, denoting different visit dates for each individual (visitdate1, visitdate2, visitdate3, etc.). I'm trying to find the latest date for each individual and save it as a new column, but this doesn't seem to be working.
I checked the class of the dataframe and each visitdate is already recognized as a Date, so I don't know why the code is not working.
This is the code I tried:
df1$latestdate <- pmax(as_date(df1$visitdate1), as_date(df1$visitdate2),
as_date(df1$visitdate3))
The error I'm getting is the following:
Error in as.Date.default(x, ...) :
do not know how to convert 'x' to class “Date”
The problem is that I'm asking R to find the maximum date value per row, not to convert any date (as it's already a date).
However, even when I leave as_date out of the code, I get the error that :
replacement has 0 rows, data has 120.
Any insight that might help? Thanks in advance! Btw, I'm new to R. :)
Below I provide an example, kind of guessing what your data looks like. pmax may not be the best thing for this.
DATES = seq(as.Date('2011-01-01'),as.Date('2017-01-01'),"months")
df = data.frame(id=1:10,
visitdate1 = sample(DATES,10),
visitdate2 = sample(DATES,10),
visitdate3 = sample(DATES,10)
)
#set columns to find row Max
COLUMNS = c("visitdate1","visitdate2","visitdate3")
df$latestdate = apply(df[,COLUMNS],1,max)

How to convert a character date time to be useable using dplyr and RPostgreSQL?

I have time stamp, column Timelocal in my data that's formatted as follows:
2015-08-24T00:02:03.000Z
Normally, I use the following line to convert this format to convert it to a date format I can use.
timestamp2 = "2015-08-24T00:02:03.000Z"
timestamp2_formatted = strptime(timestamp2,"%Y-%m-%dT%H:%M:%S",tz="UTC")
# also works for dataframes (my main use of it)
df$TimeNew = strptime(df$TimeLocal,"%Y-%m-%dT%H:%M:%S",tz="UTC")
This works fine on my machine. The problem is, I'm now working with a much bigger dataframe. It's on a Redshift cluster and I am accessing it using the RPostgreSQL package. I'm using dplyr to manipulate data as the documentation online indicates that it plays nicely with RPostgreSQL.
It does seem to, except for converting the date format. I'd like to convert the character format to a time format. Timelocal it was read into Redshift as "varchar". Thus, R is interpreting it as a character field.
I've tried the following:
library(dplyr)
library(RPostgreSQL)
library(lubridate)
try 1 - using easy dplyr syntax
mutate(elevate, timelocalnew = fast_strptime(timelocal, "%Y-%m-%dT%H:%M:%S",tz="UTC"))
try 2 - using dplyr syntax from another online reference code
elevate %>%
mutate(timelocalnew = timelocal %>% fast_strptime("%Y-%m-%dT%H:%M:%S",tz="UTC") %>% as.character()) %>%
filter(!is.na(timelocalnew))
try 3 - using strptime instead of fast_strptime
elevate %>%
mutate(timelocalnew = timelocal %>% strptime("%Y-%m-%dT%H:%M:%S",tz="UTC") %>% as.character()) %>%
filter(!is.na(timelocalnew))
I am trying to adapt code from here: http://www.markhneedham.com/blog/2014/12/08/r-dplyr-mutate-with-strptime-incompatible-sizewrong-result-size/
My tries are erroring because:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: syntax error at or near "AS"
LINE 1: ...CAST(STRPTIME("timelocal", '%YSuccess2048568264T%H%M�����', 'UTC' AS "tz") A...
^
)
In addition: Warning messages:
1: In postgresqlQuickSQL(conn, statement, ...) :
Could not create executeSELECT count(*) FROM (SELECT "timelocal", "timeutc", "zipcode", "otherdata", "country", CAST(STRPTIME("timelocal", '%Y%m%dT%H%M%S', 'UTC' AS "tz") AS TEXT) AS "timelocalnew"
FROM "data") AS "master"
2: Named arguments ignored for SQL STRPTIME
It would seem that strptime is incompatible with RPostgreSQL. Is this the right interpretation? If so, does this mean there is no means of handling date formats within R if the data is on Redshift? I checked the RPostgreSQL package documentation and did not see anything related to specifying time formats.
Would appreciate any advice on getting date time columns formatted correctly with dplyr and RpostgreSQL.
Traditional R functions will not work here.
Your should go with SQL translation which has been evolving in the latest versions of dplyr and dbplyr.
The following worked for me:
library(dbplyr)
mutate(date = to_date(timestamp2, 'YYYY-MM-DD'))
Note, I am using AWS Redshift.
Does the following work?
as.Date(strptime(timelocal,format = "%YYYY/%MM/%DD %H:%M:%OS"),tz="UTC")

Error in panda from_csv datetime parsing

I have a csv file, where the first 3 columns of each row represent a date, like:
2013,1,1,... (first row)
I want to automatically convert the first three columns of each row in the csv into a python datetime object using the following code:
parseDate = lambda y,m,d: datetime.datetime(y,m,d)
df = pandas.DataFrame.from_csv(csvPath, index_col=False,header=None, parse_dates...
=[0,1,2],date_parser= parseDate)
But get an error in the date_parser part.
However, just doing
dtime = parseDate(2003,1,1)
works as expected, so my lambda expression actually seems to be correct.
Can anyone help?

RODBC sqlQuery as.is returning bad results

I'm trying to import an excel worksheet into R. I want to retrieve a (character) ID column and a couple of date columns from the worksheet. The following code works fine but brings one column in as a date and not another. I think it has something to do with more leading columns being empty in the second date field.
dateFile <- odbcConnectExcel2007(xcelFile)
query <- "SELECT ANIMALID, ST_DATE_TIME, END_DATE_TIME FROM [KNWR_CL$]"
idsAndDates <- sqlQuery(dateFile,query)
So my plan now is to bring in the date columns as character fields and convert them myself using as.POSIXct. However, the following code produces only a single row in idsAndDates.
dateFile <- odbcConnectExcel2007(xcelFile)
query <- "SELECT ANIMALID, ST_DATE_TIME, END_DATE_TIME FROM [KNWR_CL$]"
idsAndDates <- sqlQuery(dateFile,query,as.is=TRUE,TRUE,TRUE)
What am I doing wrong?
I had to move on and ended up using the gdata library (which worked). I'd still be interested in an answer for this though.

Resources