(be Kind, this is my first question and I did extensive Research here and on the net beforehand. Question Oracle ROWID for Sqoop Split-By Column did not really solve this issue, as the original Person asking resorted to using another column)
I am using sqoop to copy data from an Oracle 11 DB.
Unfortunately, some tables have no index, no Primary key, only partitions (date). These tables are very large, hundreds of millions if not billions of rows.
so far, I have decided to Access data in the source by explicitly adressing the partitions. That works well and Speeds up the process nicely.
I need to do the splits by data that resides in each and every table in order to avoid too many if- branches in my bash script. (we're talking some 200+ tables here)
I notice that a split by 8 Tasks results in very uneven spread of workload among the Tasks. I considered using Oracle ROWID to define the split.
To do this, I must define a boundary-query. In a Standard query 'select * from xyz' the rowid is not part of the result set. therefore, it is not an option to let Sqoop define the boundary-query from --query.
Now, when I run this, I am getting the error
ERROR tool.ImportTool: Encountered IOException running import job:
java.io.IOException: Sqoop does not have the splitter for the given SQL
data type. Please use either different split column (argument --split-by)
or lower the number of mappers to 1. Unknown SQL data type: -8
samples of ROWID :
AAJXFWAKPAAOqqKAAA
AAJXFWAKPAAOqqKAA+
AAJXFWAKPAAOqqKAA/
it is static and unique once it is created for any row.
I cast this funny datatype into something else in my boundary-query
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect
jdbc:oracle:thin:#127.0.0.1:port:mydb --username $USER --P --m 8
--split-by ROWID --boundary-query "select cast(min(ROWID) as varchar(18)), cast
( max(ROWID)as varchar(18)) from table where laufbzdt >
TO_DATE('2019-02-27', 'YYYY-MM-DD')" --query "select * from table
where laufbzdt > TO_DATE('2019-02-27', 'YYYY-MM-DD') and \$CONDITIONS "
--null-string '\\N'
--null-non-string '\\N'
But then I get ugly ROWIDs that are rejected by Oracle:
select * from table where laufbzdt > TO_DATE('2019-02-27', 'YYYY-MM-DD')
and ( ROWID >= 'AAJX6oAG聕聁AE聉N:' ) AND ( ROWID < 'AAJX6oAH⁖⁁AD䁔䀷' ) ,
Error Msg = ORA-01410: invalid ROWID
how can I resolve this properly?
I am a LINUX-Embryo and have painfully chewed myself through the Topics of bash-shell-scripting and Sqooping so far, but I would like to make better use of evenly spread mapper-task workload - it would cut sqoop-time in half, I guess, saving some 5 to 8 hours.
TIA!
wahlium
You can try ROWNUM, but I think sqoop import does not work with pseudocolumn.
Related
I am using the following query to obtain the current component serial number (tr_sim_sn) installed on the host device (tr_host_sn) from the most recent record in a transaction history table (PUB.tr_hist)
SELECT tr_sim_sn FROM PUB.tr_hist
WHERE tr_trnsactn_nbr = (SELECT max(tr_trnsactn_nbr)
FROM PUB.tr_hist
WHERE tr_domain = 'vattal_us'
AND tr_lot = '99524136'
AND tr_part = '6684112-001')
The actual table has ~190 million records. The excerpt below contains only a few sample records, and only fields relevant to the search to illustrate the query above:
tr_sim_sn |tr_host_sn* |tr_host_pn |tr_domain |tr_trnsactn_nbr |tr_qty_loc
_______________|____________|_______________|___________|________________|___________
... |
356136072015140|99524135 |6684112-000 |vattal_us |178415271 |-1.0000000000
356136072015458|99524136 |6684112-001 |vattal_us |178424418 |-1.0000000000
356136072015458|99524136 |6684112-001 |vattal_us |178628048 |1.0000000000
356136072015050|99524136 |6684112-001 |vattal_us |178628051 |-1.0000000000
356136072015836|99524137 |6684112-005 |vattal_us |178645337 |-1.0000000000
...
* = key field
The excerpt illustrates multiple occurrences of tr_trnsactn_nbr for a single value of tr_host_sn. The largest value for tr_trnsactn_nbr corresponds to the current tr_sim_sn installed within tr_host_sn.
This query works, but it is very slow, ~8minutes.
I would appreciate suggestions to improve or refactor this query to improve its speed.
Check with your admins to determine when they last updated the SQL statistics. If the answer is "we don't know" or "never" then you might want to ask them to run the following 4gl program which will create a SQL script to accomplish that:
/* genUpdateSQL.p
*
* mpro dbName -p util/genUpdateSQL.p -param "tmp/updSQLstats.sql"
*
* sqlexp -user userName -password passWord -db dnName -S servicePort -infile tmp/updSQLstats.sql -outfile tmp/updSQLtats.log
*
*/
output to value( ( if session:parameter <> "" then session:parameter else "updSQLstats.sql" )).
for each _file no-lock where _hidden = no:
put unformatted
"UPDATE TABLE STATISTICS AND INDEX STATISTICS AND ALL COLUMN STATISTICS FOR PUB."
'"' _file._file-name '"' ";"
skip
.
put unformatted "commit work;" skip.
end.
output close.
return.
This will generate a script that updates statistics for all table and all indexes. You could edit the output to only update the tables and indexes that are part of this query if you want.
Also, if the admins are nervous they could, of course, try this on a test db or a restored backup before implementing in a production environment.
I am posting this as a response to my request for an improved query.
As it turns out, the following syntax features two distinct features that greatly improved the speed of the query. One is to include tr_domain search criteria in both main and nested portions of the query. Second is to narrow the search by increasing the number of search criteria, which in the following are all included in the nested section of the syntax:
SELECT tr_sim_sn,
FROM PUB.tr_hist
WHERE tr_domain = 'vattal_us'
AND tr_trnsactn_nbr IN (
SELECT MAX(tr_trnsactn_nbr)
FROM PUB.tr_hist
WHERE tr_domain = 'vattal_us'
AND tr_part = '6684112-001'
AND tr_lot = '99524136'
AND tr_type = 'ISS-WO'
AND tr_qty_loc < 0)
This syntax results in ~0.5s response time. (credit to my colleague, Daniel V.)
To be fair, this query uses criteria outside the originally stated parameters that were included in the original post, making it difficult to impossible for others to attempt a reasonable answer. This omission was not on purpose of course, rather due to being fairly new to fundamentals of good query design. This query in part is a result of learning that when too-few or non-indexed fields are used as search criteria in a large table, it is sometimes helpful to narrow the search by increasing the number of search criteria items. The original had 3, this one has 5.
I have a query that works perfectly in SSMS. But when running the query in R using the DBI package, I receive several multipart identifier errors: The multi-part identifier: "rt.secondary_id" could not be bound, "rt.third_id" could not be bound, and "t2.important" could not be bound.
select t1.[main_id]
,rt.secondary_id
,rt.third_id
,t1.[date_col]
,t2.important
from t1
inner join rt on t1.main_id = rt.main_id
inner join t2 on rt.main_id = t2.main_id
inner join (select t1.main_id, max(t1.date_col) as upload_time from t1 group by t1.main_id) AS ag ON t1.main_id = ag.main_id AND t1.date_col = ag.upload_time
The unique identifier in t1 is the combination of main_id and date_col, and this query finds the most recent entry in t1 for a given main_id.
Not exactly sure if my query is structured in a poor way or this is an R issue. I've tried adding SET NOCOUNT ON to the query based on what I thought might be related issues elsewhere on stackoverflow, but no dice.
I found out what my issue was- silly (but time consuming) mistake on my part... but essentially, I was bringing my SQL query into R via paste(scan(...), collapse = " "). I had a comment in my SQL query, --, which could not be read correctly by R. Deleting the comment OR switching the comment to /* ... */ syntax fixes the problem.
I am using RODBC with R to connect to Teradata.
I am trying to copy a large table EXAMPLE (25GB) from the READ_ONLY database to the WORKdatabase. The two databases are under the same DB system so I only need one connection.
I have tried sqlQuery, sqlCopy and sqlCopyTablefunctions but do not succeed.
sqlQuery
EDIT: syntax error corrected as suggested by #dnoeth.
CREATE TABLE WORK.EXAMPLE AS (SELECT * FROM READ_ONLY.EXAMPLE) WITH DATA;
OR
CREATE TABLE WORK.EXAMPLE AS (SELECT * FROM READ_ONLY.EXAMPLE) WITH NO DATA;
INSERT INTO WORK.EXAMPLE SELECT * FROM READ_ONLY.EXAMPLE;
I let the latter method run for 15h but it did not complete the copy.
sqlCopy
sqlCopy(ch,
query='SELECT * FROM READ_ONLY.EXAMPLE',
destination = 'WORK.EXAMPLE')
Error: cannot allocate vector of size 155.0 Mb
Does sqlCopy try to first copy the data to R's memory before creating the new table? If so, how can I bypass this step and work exclusively on the Teradata server? Also, the error persists even if use the option fast=F.
In case R's memory was the issue, I tried creating a smaller table of 1000 rows:
sqlCopy(ch,
query='SELECT * FROM READ_ONLY.EXAMPLE SAMPLE 1000',
destination = 'WORK.EXAMPLE')
Error in sqlSave(destchannel, dataset, destination, verbose = verbose, :
[RODBC] Failed exec in Update
22018 0 [Teradata][ODBC Teradata Driver] Data is not a numeric-literal.
In addition: Warning message:
In odbcUpdate(channel, query, mydata, coldata[m, ], test = test, :
character data '2017-03-20 12:08:25' truncated to 15 bytes in column 'ExtractionTS'
With this command a table was actually created but it only includes the column names without any rows.
sqlCopyTable
sqlCopyTable(ch,
srctable = 'READ_ONLY.EXAMPLE',
desttable = 'WORK.EXAMPLE')
Error in if (as.character(keys[[4L]]) == colnames[i]) create <- paste(create, :
argument is of length zero
The syntax in your sqlQuery is not correc, the WITH DATAoption is missing:
CREATE TABLE WORK.EXAMPLE AS (SELECT * FROM READ_ONLY.EXAMPLE) WITH DATA;
Caution, this will loose all NOT NULL & CHECK constraints and all indexes, resulting in the 1st column as Non-Unique Primary Index.
Either add a PI manually or switch to
CREATE TABLE WORK.EXAMPLE AS READ_ONLY.EXAMPLE WITH DATA;
if READ_ONLY.EXAMPLE is a table and you actually want an exact copy.
Background: I am developing a rscript that pulls data from a mysql database, performs a logistic regression and then inserts the predictions back into the database. I want the entire system to be self contained in the script in case of database failure. This includes all mysql stored procedures that the script depends on to aggregate the data on the backend since these would be deleted in such a database failure.
Question: I'm having trouble creating a stored procedure from an R script. I am running the following:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
DROP PROCEDURE IF EXISTS Test.Tester;
DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END //
DELIMITER ;
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)
I however get the following error that seems to involve the DELIMITER change.
Error in .local(conn, statement, ...) :
could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
EN' at line 2
What I've Done: I have spent quite a bit of time searching for the answer, but have come up with nothing. What am I missing?
Just wanted to follow up on this string of comments. Thank you for your thoughts on this issue. I have a couple Python scripts that need to have this functionality and I began researching the same topic for Python. I found this question that indicates the answer. The question states:
"The DELIMITER command is a MySQL shell client builtin, and it's recognized only by that program (and MySQL Query Browser). It's not necessary to use DELIMITER if you execute SQL statements directly through an API.
The purpose of DELIMITER is to help you avoid ambiguity about the termination of the CREATE FUNCTION statement, when the statement itself can contain semicolon characters. This is important in the shell client, where by default a semicolon terminates an SQL statement. You need to set the statement terminator to some other character in order to submit the body of a function (or trigger or procedure)."
Hence the following code will run in R:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)
I have a dev environment running sqlite 3.7.16.2 and a production environment running sqlite 3.7.9 and I am running into some unexpected backwards incompatability.
I have a table that looks like this:
sqlite> select * from calls;
ID|calldate|calltype
1|2013-10-01|monthly
1|2013-11-01|3 month
1|2013-12-01|monthly
2|2013-07-11|monthly
2|2013-08-11|monthly
2|2013-09-11|3 month
2|2013-10-11|monthly
2|2013-11-11|monthly
3|2013-04-22|monthly
3|2013-05-22|monthly
3|2013-06-22|3 month
3|2013-07-22|monthly
4|2013-10-04|monthly
4|2013-11-04|3 month
4|2013-12-04|monthly
5|2013-10-28|monthly
5|2013-11-28|monthly
With the newer version of sqlite (3.7.16.2) I can use this:
SELECT ID, MIN(calldate), calltype FROM calls WHERE calldate > date('NOW') GROUP BY ID;
which gives me:
ID|MIN(calldate)|calltype
1|2013-11-01|3 month
2|2013-11-11|monthly
4|2013-11-04|3 month
5|2013-10-28|monthly
However when I run this same code on the older version of sqlite (3.7.9) I get this:
ID|MIN(calldate)|calltype
1|2013-11-01|monthly
2|2013-11-11|monthly
4|2013-11-04|monthly
5|2013-10-28|monthly
I looked through the changes here, but could not figure out why this is still happening. Any suggestions on how to work around this or how to rewrite my query?
You are using an extension that was added in SQLite 3.7.11.
In standard SQL, it is not allowed to use columns that appear neither in the GROUP BY clause nor are wrapped in an aggregate function.
(SQLite accepts this silently for compatibility with MySQL, but returns the data from some random record in the group.)
To get other columns from a record with the minimum value, you have to search the minimum values for each group first, and then to join these with the original table:
SELECT calls.ID,
calls.calldate,
calls.calltype
FROM calls
JOIN (SELECT ID,
MIN(calldate) AS calldate
FROM calls
WHERE calldate > date('now')
GROUP BY ID
) AS earliest
ON calls.ID = earliest.ID AND
calls.calldate = earliest.calldate