I am trying to load my tables on PostgreSQL to Spark.
I have successfully read the table from PostgreSQL to Spark using jdbc.
I have a code written in R, which I want to use on the table, but I cannot access the data in R.
using the following code to connect
val pgDF_table = spark.read
.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://10.128.0.4:5432/sparkDB")
.option("dbtable", "survey_results")
.option("user", "prashant")
.option("password","pandey")
.load()
pgDF_table.show
is there any option as spark.write?
In SparkR,
You can read data from JDBC using the following code:
read.jdbc(url, tableName, partitionColumn = NULL, lowerBound = NULL,
upperBound = NULL, numPartitions = 0L, predicates = list(), ...)
Arguments
`url': JDBC database url of the form 'jdbc:subprotocol:subname'
`tableName': the name of the table in the external database
`partitionColumn': the name of a column of integral type that will be used for partitioning
`lowerBound': the minimum value of 'partitionColumn' used to decide partition stride
`upperBound': the maximum value of 'partitionColumn' used to decide partition stride
`numPartitions': the number of partitions, This, along with 'lowerBound' (inclusive), 'upperBound' (exclusive), form partition strides for generated WHERE clause expressions used to split the column 'partitionColumn' evenly. This defaults to SparkContext.defaultParallelism when unset.
`predicates': a list of conditions in the where clause; each one defines one partition
Data can be written to JDBC using the following code:
write.jdbc(x, url, tableName, mode = "error", ...)
Arguments
`x`: a SparkDataFrame.
`url`: JDBC database url of the form jdbc:subprotocol:subname.
`tableName`: yhe name of the table in the external database.
`mode`: one of 'append', 'overwrite', 'error', 'ignore' save mode (it is 'error' by default).
`...`: additional JDBC database connection properties.
JDBC Driver must be in spark classpath
Related
I have a simple folder tree in Azure Data Lake Gen 2 that is partitioned by date with the following standard folder structure: {yyyy}/{MM}/{dd}. e.g. /Container/folder1/sub_folder/2020/11/01
In each leaf folder, I have some CSV files with few columns but without a timestamp (as the date is already embedded in the folder name).
I am trying to create an ADX external table that will include a virtual column of the date, and then query the data in ADX by date (this is a well-known pattern in Hive and Big data in general).
.create-or-alter external table TableName (col1:double, col2:double, col3:double, col4:double)
kind=adl
partition by (Date:datetime)
pathformat = ("/date=" datetime_pattern("year={yyyy}/month={MM}/day={dd}", Date))
dataformat=csv
(
h#'abfss://container#datalake_name.dfs.core.windows.net/folder1/subfolder/;{key}'
)
with (includeHeaders = 'All')
Unfortunately querying the table fails, and show artifacts return an empty list.
external_table("Table Name")
| take 10
.show external table Walmart_2141_OEE artifacts
with the following exception:
Query execution has resulted in error (0x80070057): Partial query failure: The parameter is incorrect. (message: 'path2
Parameter name: Argument 'path2' failed to satisfy condition 'Can't append a full path': at Concat in C:\source\Src\Common\Kusto.Cloud.Platform\Utils\UriPath.cs: line 25:
I tried to follow many types of pathformats and datetime_pattern as described in the documentation but nothing worked.
Any ideas?
According to your description the following definition should work:
.create-or-alter external table TableName (col1:double, col2:double, col3:double, col4:double)
kind=adl
partition by (Date:datetime)
pathformat = (datetime_pattern("yyyy/MM/dd", Date))
dataformat=csv
(
h#'abfss://container#datalake_name.dfs.core.windows.net/folder1/subfolder;{key}'
)
with (includeHeaders = 'All')
I have applied Encryption function using aes_decrypt function in Redshift table.Now try to revert the data using aes_decrypt function.Could not update the data or insert to another table.
It Return errors like "error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 4: ordinal not in range(128). Please look at svl_udf_log for more information redshift"?
I am currently investigating the use of the Always Encrypted feature for Microsoft SQL Server. I'm trying to simply store a blob object in a column encrypted table ('randomised') using pyodbc. Where the code works perfectly fine on non-encrypted columns for inserting arbitrary binary objects, it fails when running the same code on a column that is encrypted. Even more strange is the fact that it works for non-image files, but whenever I'm trying to upload a PDF, JPEG, PNG or similar, it fails.
The code looks like this.
import pyodbc
server = 'tcp:XXXXX-XXXXXX-XXXXX-XXXXX-XXXXX.windows.net,1433'
database = 'db-encryption'
username = 'XXXXXX#dbs-always-encrypted'
password = 'XXXXXXXXX'
connection_string = [
'DRIVER={ODBC Driver 17 for SQL Server}',
'Server={}'.format(server),
'Database={}'.format(database),
'UID={}'.format(username),
'PWD={}'.format(password),
'Encrypt=yes',
'TrustedConnection=yes',
'ColumnEncryption=Enabled',
'KeyStoreAuthentication=KeyVaultClientSecret',
'KeyStorePrincipalId=XXXXX-XXXXXX-XXXXX-XXXXX-XXXXX',
'KeyStoreSecret=XXXXX-XXXXXX-XXXXX-XXXXX-XXXXX'
]
cnxn = pyodbc.connect( ';'.join(connection_string) )
cursor = cnxn.cursor()
insert = 'insert into Blob (Data) values (?)'
files = ['Text.txt', 'SimplePDF.pdf']
for file in files:
# without hex encode
bindata = None
with open(file, 'rb') as f:
bindata = pyodbc.Binary(f.read())
# insert binary
cursor.execute(insert, bindata)
cnxn.commit()
The error message I receive when trying to run the code on the encrypted 'Data' column (VARBINARY(MAX)) is the following
pyodbc.DataError: ('22018', "[22018] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Operand type clash: image is incompatible with varbinary(max) encrypted with (encryption_type = 'RANDOMIZED', encryption_algorithm_name = 'AEAD_AES_256_CBC_HMAC_SHA_256', column_encryption_key_name = 'CEK_Auto1', column_encryption_key_database_name = 'db-encryption') (206) (SQLExecDirectW)")
It seems like the driver reads the bytes and sees that it is a 'known type' and treats the data as 'image'
Is there any way I can prevent this from happening? I simply wanna store any arbitrary byte object in said column.
It might be late but the issue is with your driver. You must install the ODBC 17 driver or use {ODBC Driver 13 for SQL Server} or you can also try {SQL Server}.
Download the driver from here
I am using RODBC with R to connect to Teradata.
I am trying to copy a large table EXAMPLE (25GB) from the READ_ONLY database to the WORKdatabase. The two databases are under the same DB system so I only need one connection.
I have tried sqlQuery, sqlCopy and sqlCopyTablefunctions but do not succeed.
sqlQuery
EDIT: syntax error corrected as suggested by #dnoeth.
CREATE TABLE WORK.EXAMPLE AS (SELECT * FROM READ_ONLY.EXAMPLE) WITH DATA;
OR
CREATE TABLE WORK.EXAMPLE AS (SELECT * FROM READ_ONLY.EXAMPLE) WITH NO DATA;
INSERT INTO WORK.EXAMPLE SELECT * FROM READ_ONLY.EXAMPLE;
I let the latter method run for 15h but it did not complete the copy.
sqlCopy
sqlCopy(ch,
query='SELECT * FROM READ_ONLY.EXAMPLE',
destination = 'WORK.EXAMPLE')
Error: cannot allocate vector of size 155.0 Mb
Does sqlCopy try to first copy the data to R's memory before creating the new table? If so, how can I bypass this step and work exclusively on the Teradata server? Also, the error persists even if use the option fast=F.
In case R's memory was the issue, I tried creating a smaller table of 1000 rows:
sqlCopy(ch,
query='SELECT * FROM READ_ONLY.EXAMPLE SAMPLE 1000',
destination = 'WORK.EXAMPLE')
Error in sqlSave(destchannel, dataset, destination, verbose = verbose, :
[RODBC] Failed exec in Update
22018 0 [Teradata][ODBC Teradata Driver] Data is not a numeric-literal.
In addition: Warning message:
In odbcUpdate(channel, query, mydata, coldata[m, ], test = test, :
character data '2017-03-20 12:08:25' truncated to 15 bytes in column 'ExtractionTS'
With this command a table was actually created but it only includes the column names without any rows.
sqlCopyTable
sqlCopyTable(ch,
srctable = 'READ_ONLY.EXAMPLE',
desttable = 'WORK.EXAMPLE')
Error in if (as.character(keys[[4L]]) == colnames[i]) create <- paste(create, :
argument is of length zero
The syntax in your sqlQuery is not correc, the WITH DATAoption is missing:
CREATE TABLE WORK.EXAMPLE AS (SELECT * FROM READ_ONLY.EXAMPLE) WITH DATA;
Caution, this will loose all NOT NULL & CHECK constraints and all indexes, resulting in the 1st column as Non-Unique Primary Index.
Either add a PI manually or switch to
CREATE TABLE WORK.EXAMPLE AS READ_ONLY.EXAMPLE WITH DATA;
if READ_ONLY.EXAMPLE is a table and you actually want an exact copy.
Background: I am developing a rscript that pulls data from a mysql database, performs a logistic regression and then inserts the predictions back into the database. I want the entire system to be self contained in the script in case of database failure. This includes all mysql stored procedures that the script depends on to aggregate the data on the backend since these would be deleted in such a database failure.
Question: I'm having trouble creating a stored procedure from an R script. I am running the following:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
DROP PROCEDURE IF EXISTS Test.Tester;
DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END //
DELIMITER ;
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)
I however get the following error that seems to involve the DELIMITER change.
Error in .local(conn, statement, ...) :
could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
EN' at line 2
What I've Done: I have spent quite a bit of time searching for the answer, but have come up with nothing. What am I missing?
Just wanted to follow up on this string of comments. Thank you for your thoughts on this issue. I have a couple Python scripts that need to have this functionality and I began researching the same topic for Python. I found this question that indicates the answer. The question states:
"The DELIMITER command is a MySQL shell client builtin, and it's recognized only by that program (and MySQL Query Browser). It's not necessary to use DELIMITER if you execute SQL statements directly through an API.
The purpose of DELIMITER is to help you avoid ambiguity about the termination of the CREATE FUNCTION statement, when the statement itself can contain semicolon characters. This is important in the shell client, where by default a semicolon terminates an SQL statement. You need to set the statement terminator to some other character in order to submit the body of a function (or trigger or procedure)."
Hence the following code will run in R:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)