RODBC Teradata Copy Table - r

I am using RODBC with R to connect to Teradata.
I am trying to copy a large table EXAMPLE (25GB) from the READ_ONLY database to the WORKdatabase. The two databases are under the same DB system so I only need one connection.
I have tried sqlQuery, sqlCopy and sqlCopyTablefunctions but do not succeed.
sqlQuery
EDIT: syntax error corrected as suggested by #dnoeth.
CREATE TABLE WORK.EXAMPLE AS (SELECT * FROM READ_ONLY.EXAMPLE) WITH DATA;
OR
CREATE TABLE WORK.EXAMPLE AS (SELECT * FROM READ_ONLY.EXAMPLE) WITH NO DATA;
INSERT INTO WORK.EXAMPLE SELECT * FROM READ_ONLY.EXAMPLE;
I let the latter method run for 15h but it did not complete the copy.
sqlCopy
sqlCopy(ch,
query='SELECT * FROM READ_ONLY.EXAMPLE',
destination = 'WORK.EXAMPLE')
Error: cannot allocate vector of size 155.0 Mb
Does sqlCopy try to first copy the data to R's memory before creating the new table? If so, how can I bypass this step and work exclusively on the Teradata server? Also, the error persists even if use the option fast=F.
In case R's memory was the issue, I tried creating a smaller table of 1000 rows:
sqlCopy(ch,
query='SELECT * FROM READ_ONLY.EXAMPLE SAMPLE 1000',
destination = 'WORK.EXAMPLE')
Error in sqlSave(destchannel, dataset, destination, verbose = verbose, :
[RODBC] Failed exec in Update
22018 0 [Teradata][ODBC Teradata Driver] Data is not a numeric-literal.
In addition: Warning message:
In odbcUpdate(channel, query, mydata, coldata[m, ], test = test, :
character data '2017-03-20 12:08:25' truncated to 15 bytes in column 'ExtractionTS'
With this command a table was actually created but it only includes the column names without any rows.
sqlCopyTable
sqlCopyTable(ch,
srctable = 'READ_ONLY.EXAMPLE',
desttable = 'WORK.EXAMPLE')
Error in if (as.character(keys[[4L]]) == colnames[i]) create <- paste(create, :
argument is of length zero

The syntax in your sqlQuery is not correc, the WITH DATAoption is missing:
CREATE TABLE WORK.EXAMPLE AS (SELECT * FROM READ_ONLY.EXAMPLE) WITH DATA;
Caution, this will loose all NOT NULL & CHECK constraints and all indexes, resulting in the 1st column as Non-Unique Primary Index.
Either add a PI manually or switch to
CREATE TABLE WORK.EXAMPLE AS READ_ONLY.EXAMPLE WITH DATA;
if READ_ONLY.EXAMPLE is a table and you actually want an exact copy.

Related

Fail to access files in ADLS Gen 2 with ADX External table "Virtual columns"

I have a simple folder tree in Azure Data Lake Gen 2 that is partitioned by date with the following standard folder structure: {yyyy}/{MM}/{dd}. e.g. /Container/folder1/sub_folder/2020/11/01
In each leaf folder, I have some CSV files with few columns but without a timestamp (as the date is already embedded in the folder name).
I am trying to create an ADX external table that will include a virtual column of the date, and then query the data in ADX by date (this is a well-known pattern in Hive and Big data in general).
.create-or-alter external table TableName (col1:double, col2:double, col3:double, col4:double)
kind=adl
partition by (Date:datetime)
pathformat = ("/date=" datetime_pattern("year={yyyy}/month={MM}/day={dd}", Date))
dataformat=csv
(
h#'abfss://container#datalake_name.dfs.core.windows.net/folder1/subfolder/;{key}'
)
with (includeHeaders = 'All')
Unfortunately querying the table fails, and show artifacts return an empty list.
external_table("Table Name")
| take 10
.show external table Walmart_2141_OEE artifacts
with the following exception:
Query execution has resulted in error (0x80070057): Partial query failure: The parameter is incorrect. (message: 'path2
Parameter name: Argument 'path2' failed to satisfy condition 'Can't append a full path': at Concat in C:\source\Src\Common\Kusto.Cloud.Platform\Utils\UriPath.cs: line 25:
I tried to follow many types of pathformats and datetime_pattern as described in the documentation but nothing worked.
Any ideas?
According to your description the following definition should work:
.create-or-alter external table TableName (col1:double, col2:double, col3:double, col4:double)
kind=adl
partition by (Date:datetime)
pathformat = (datetime_pattern("yyyy/MM/dd", Date))
dataformat=csv
(
h#'abfss://container#datalake_name.dfs.core.windows.net/folder1/subfolder;{key}'
)
with (includeHeaders = 'All')

R delete rows from data table using sqldf

I am wondering if R does not support using sqldf to delete rows from a data table. My data looks like this
and I am trying to delete from a data table using a delete statement. There is no underlying database just a data.table. But hwen I enter the following sql statement:
loans_good <- sqldf("Delete from LoansDT1 where status not in ('Current','Default')")
I get the following error message:
'SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().'
Since I get the same message for update I am wondering if it is a limitation.
This question is a FAQ. See FAQ 8 on the sqldf github home page.
The operation did work. The message is a warning message, not an error message. The message is misleading and you can ignore it. Note that question did not show the complete message -- the complete message does state that it is a warning message.
The warning message comes from RSQLite, not from sqldf itself. It is caused by non-backwardly compatible change that was introduced into RSQLite at some point; however, as stated the actual operation works anyways.
Also delete and update act on tables in the database. They do not return values so even if they work you won't see any result. If you want a result you have to use a select statement after the delete or update to retrieve the modified table.
Here is an example using the built-in 6 row BOD data.frame. It deletes the last row as that row has a Time greater than 5.
library(sqldf)
sqldf(c("delete from BOD where Time > 5", "select * from BOD"))
## Time demand
## 1 1 8.3
## 2 2 10.3
## 3 3 19.0
## 4 4 16.0
## 5 5 15.6
## Warning message:
## In result_fetch(res#ptr, n = n) :
## SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
Note that this is listed in the sqldf issues where a workaround for the message is provided: https://github.com/ggrothendieck/sqldf/issues/40
You need to use dbExecute() to perform delete, update or insert queries.
conn <- dbConnect("Put your connection to your database here")
dbExecute(
conn,
"Delete from LoansDT1 where status not in ('Current','Default')"
)
dbReadTable(conn, LoansDT1) # Check

Can we use JDBC to write data from postgresql to Spark?

I am trying to load my tables on PostgreSQL to Spark.
I have successfully read the table from PostgreSQL to Spark using jdbc.
I have a code written in R, which I want to use on the table, but I cannot access the data in R.
using the following code to connect
val pgDF_table = spark.read
.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://10.128.0.4:5432/sparkDB")
.option("dbtable", "survey_results")
.option("user", "prashant")
.option("password","pandey")
.load()
pgDF_table.show
is there any option as spark.write?
In SparkR,
You can read data from JDBC using the following code:
read.jdbc(url, tableName, partitionColumn = NULL, lowerBound = NULL,
upperBound = NULL, numPartitions = 0L, predicates = list(), ...)
Arguments
`url': JDBC database url of the form 'jdbc:subprotocol:subname'
`tableName': the name of the table in the external database
`partitionColumn': the name of a column of integral type that will be used for partitioning
`lowerBound': the minimum value of 'partitionColumn' used to decide partition stride
`upperBound': the maximum value of 'partitionColumn' used to decide partition stride
`numPartitions': the number of partitions, This, along with 'lowerBound' (inclusive), 'upperBound' (exclusive), form partition strides for generated WHERE clause expressions used to split the column 'partitionColumn' evenly. This defaults to SparkContext.defaultParallelism when unset.
`predicates': a list of conditions in the where clause; each one defines one partition
Data can be written to JDBC using the following code:
write.jdbc(x, url, tableName, mode = "error", ...)
Arguments
`x`: a SparkDataFrame.
`url`: JDBC database url of the form jdbc:subprotocol:subname.
`tableName`: yhe name of the table in the external database.
`mode`: one of 'append', 'overwrite', 'error', 'ignore' save mode (it is 'error' by default).
`...`: additional JDBC database connection properties.
JDBC Driver must be in spark classpath

Sqoop trying to --split-by ROWID (Oracle) fails

(be Kind, this is my first question and I did extensive Research here and on the net beforehand. Question Oracle ROWID for Sqoop Split-By Column did not really solve this issue, as the original Person asking resorted to using another column)
I am using sqoop to copy data from an Oracle 11 DB.
Unfortunately, some tables have no index, no Primary key, only partitions (date). These tables are very large, hundreds of millions if not billions of rows.
so far, I have decided to Access data in the source by explicitly adressing the partitions. That works well and Speeds up the process nicely.
I need to do the splits by data that resides in each and every table in order to avoid too many if- branches in my bash script. (we're talking some 200+ tables here)
I notice that a split by 8 Tasks results in very uneven spread of workload among the Tasks. I considered using Oracle ROWID to define the split.
To do this, I must define a boundary-query. In a Standard query 'select * from xyz' the rowid is not part of the result set. therefore, it is not an option to let Sqoop define the boundary-query from --query.
Now, when I run this, I am getting the error
ERROR tool.ImportTool: Encountered IOException running import job:
java.io.IOException: Sqoop does not have the splitter for the given SQL
data type. Please use either different split column (argument --split-by)
or lower the number of mappers to 1. Unknown SQL data type: -8
samples of ROWID :
AAJXFWAKPAAOqqKAAA
AAJXFWAKPAAOqqKAA+
AAJXFWAKPAAOqqKAA/
it is static and unique once it is created for any row.
I cast this funny datatype into something else in my boundary-query
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect
jdbc:oracle:thin:#127.0.0.1:port:mydb --username $USER --P --m 8
--split-by ROWID --boundary-query "select cast(min(ROWID) as varchar(18)), cast
( max(ROWID)as varchar(18)) from table where laufbzdt >
TO_DATE('2019-02-27', 'YYYY-MM-DD')" --query "select * from table
where laufbzdt > TO_DATE('2019-02-27', 'YYYY-MM-DD') and \$CONDITIONS "
--null-string '\\N'
--null-non-string '\\N'
But then I get ugly ROWIDs that are rejected by Oracle:
select * from table where laufbzdt > TO_DATE('2019-02-27', 'YYYY-MM-DD')
and ( ROWID >= 'AAJX6oAG聕聁AE聉N:' ) AND ( ROWID < 'AAJX6oAH⁖⁁AD䁔䀷' ) ,
Error Msg = ORA-01410: invalid ROWID
how can I resolve this properly?
I am a LINUX-Embryo and have painfully chewed myself through the Topics of bash-shell-scripting and Sqooping so far, but I would like to make better use of evenly spread mapper-task workload - it would cut sqoop-time in half, I guess, saving some 5 to 8 hours.
TIA!
wahlium
You can try ROWNUM, but I think sqoop import does not work with pseudocolumn.

Create a stored procedure using RMySQL

Background: I am developing a rscript that pulls data from a mysql database, performs a logistic regression and then inserts the predictions back into the database. I want the entire system to be self contained in the script in case of database failure. This includes all mysql stored procedures that the script depends on to aggregate the data on the backend since these would be deleted in such a database failure.
Question: I'm having trouble creating a stored procedure from an R script. I am running the following:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
DROP PROCEDURE IF EXISTS Test.Tester;
DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END //
DELIMITER ;
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)
I however get the following error that seems to involve the DELIMITER change.
Error in .local(conn, statement, ...) :
could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
EN' at line 2
What I've Done: I have spent quite a bit of time searching for the answer, but have come up with nothing. What am I missing?
Just wanted to follow up on this string of comments. Thank you for your thoughts on this issue. I have a couple Python scripts that need to have this functionality and I began researching the same topic for Python. I found this question that indicates the answer. The question states:
"The DELIMITER command is a MySQL shell client builtin, and it's recognized only by that program (and MySQL Query Browser). It's not necessary to use DELIMITER if you execute SQL statements directly through an API.
The purpose of DELIMITER is to help you avoid ambiguity about the termination of the CREATE FUNCTION statement, when the statement itself can contain semicolon characters. This is important in the shell client, where by default a semicolon terminates an SQL statement. You need to set the statement terminator to some other character in order to submit the body of a function (or trigger or procedure)."
Hence the following code will run in R:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)

Resources