Fetch 4-5 million records from Postgres and process them JAVA

Fetch 4-5 million records from Postgres and process them JAVA - spring-jdbc

I am currently using JdbcTemplate to fetch the data and then use ResultSet to store the data in an ArrayList. However, as a result of this, I get the following error - java.lang.OutOfMemoryError: GC overhead limit exceeded. Is there any way I could fetch and process a few records and then again fetch the other batch of data?
P.S - I am currently using the jdbcTemplate.setFetchSize(fetchSize);, but it doesn't work. Also, I cannot use pagination. Lastly, have set the -Xmx size to 2gb but still doesn't work.

Related

Cosmos DB Emulator hangs when pumping continuation token, segmented query

I have just added a new feature to an app I'm building. It uses the same working Cosmos/Table storage code that other features use to query and pump results segments from the Cosmos DB Emulator via the Tables API.
The emulator is running with:
/EnableTableEndpoint /PartitionCount=50
This is because I read that the emulator defaults to 5 unlimited containers and/or 25 limited and since this is a Tables API app, the table containers are created as unlimited.
The table being queried is the 6th to be created and contains just 1 document.
It either takes around 30 seconds to run a simple query and "trips" my Too Many Requests error handling/retry in the process, or hangs seemingly forever and no results are returned, the emulator has to be shut down.
My understanding is that with 50 partitions I can make 10 unlimited tables, collections since each is "worth" 5. See documentation.
I have tried with rate limiting on and off, and jacked the RU/s to 10,000 on the table. It always fails to query this one table. The data, including the files on disk, has been cleared many times.
It seems like a bug in the emulator. Note that the "Sorry..." error that I would expect to see upon creation of the 6th unlimited table, as per the docs, is never encountered.

After switching to a real Cosmos DB instance on Azure, this is looking like a problem with my dodgy code.
Confirmed: my dodgy code.
Stand down everyone. As you were.

WSO2 ESB 6.1.0 Batch Processing

I have a requirement to process 10 million records in MS SQL database using WSO2 ESB.
Input file can be XML or Flat file.
I have created a dataservice provided in WSO2 ESB.
Now, I start process to read from XML and insert into MS SQL database, I want to commit every 5000 records during processing via ESB so that if 5001 record fails, I can restart the processing from 5001 record instead of 0.
First problem, commit is happening for all records at once. I want to configure it in such a way that it should process 5000 records, commits in DB and then proceed with next set of records. Additionally, if the batch job fails after processing 10000 records, I want the batch job to start processing from 100001 record and not from 0
Please suggest ideas.
Thanks,
Abhishek

This is a more or less common pattern. Create an agent/process continously reading from an ipc buffer (memory or file).
The ESB endpoint simply writes into the buffer.
The agent is reponsible of retrying and/or notify asynchonously if finally cannot commit.

What you can do is you can write start and end records in a file and place it in ESB, When the schedule starts it will pick the record from file, in your case 5000 and then process it in DSS, now if DSS response is successful then you increment the record and update in the file in this case 10000, now if DSS response is not success then 10000 will be mentioned in the file, after you find the root cause as why it failed and fix it and then run the schedule then it will take record from 10000 and if it is success then write 15000 in file, so this will continue till it doesn't meet the end condition

how find out the process description from process Id on redshift?

I'm trying to debug a deadlock on Redshift:
SQL Execution failed ... deadlock detected
DETAIL: Process 7679 waits for AccessExclusiveLock on relation 307602 of database 108260; blocked by process 7706.
Process 7706 waits for AccessShareLock on relation 307569 of database 108260; blocked by process 7679.
Is there a sql query to get a description for process ids 7679 and 7706?

select * from stl_query where pid=XXX
This will give you the query txt which will help you identify your query.
You can also query stv_locks to check is there are any current updates in the database, and str_tr_conflict will display all the lock conflict on the table.

Spring JDBC: Oracle transaction errors out after 120 seconds

For a particular requirement, I will have to iterate through a list of 50000 records and insert them into database. The requirement is that if any one of the 50000 records fail, all the other records should be rollback. And hence we did not involve any commit in the processing. But this resulted in the following error:
[2/1/16 14:01:47:939 CST] 000000be SystemOut O ERROR
org.springframework.jdbc.UncategorizedSQLException:
PreparedStatementCallback; uncategorized SQLException for SQL [INSERT
INTO ...) VALUES (...)]; SQL state [null]; error code [0]; Current
thread has not commited in more than [120] seconds and may incur
unwanted blocking locks. Please refactor code to commit more
frequently.
Now, when we implemented batching - We are using PreparedStatement.executeBatch() method to insert data in batches of 100, the above error doesn't arise. The autoCommit is set to true by default for the batching, so a commit happens after every batch execution.
Could anyone suggest how we can handle the rollback mechanism in the above case. If 50th batch execution fails, then we want all the previous 49 batch executions to be reverted. We are using Spring Data/JDBC, Oracle 11g database, WebSphere application server. I have read somewhere that the above 120 seconds timeout for commit can also be configured in the JTA settings of WebSphere. Is it so? Please suggest any alternatives or other possible solutions.
Thank you in advance.

You must set autocommit to false and only commit at the end if all your batches executed successfully.

Sqoop from Oracle: "Snapshot too Old"

I am setting up an automated process to sqoop from an oracle table to an hdfs directory with this command:
sqoop-import --connect jdbc:oracle:thin:#redacted.company.com:1234/db --username redacted --password secret123 --num-mappers 1 --table table --target-dir /data/destination/directory/ --as-avrodatafile --compress --compression-codec org.apache.hadoop.io.compress.BZip2Codec
Unfortunately, I'm getting the following error message:
Error:java.io.IOException: SQLException in nextKeyValue
...
Caused by: java.sql.SQLException: ORA-01555: snapshot too old: rollback segment number 336 with name "_SYSSMU336_879580159$" too small
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396)
at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:951)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:208)
The business requirement I am attempting to fulfill is that the entire table is imported into our hdfs. Since we do not own or administrate this database, I lack control over the UNDO tablespace and related parameters. The job is scheduled to run at 1am which is not a peak time, but since automated processes touch it, I cannot coax people to stop using it during the job.
How should I modify my sqoop-import statement to avoid this error?

It is not a Sqoop issue. You would get the same error executing the same statment directly on Oracle. It is an undo tablespace issue. You have to get your query faster or you have to increase the Oracle undo tablespace size.
List of possible fixes:
Schedule your task when there are less database activity (maybe even
ask people to stop working for a while).
Optimize the query that is failing with this error to read less data
and take less time Increase the size of the UNDO tablespace.
Increase the size of the UNDO_RETENTION parameter.
Set the UNDO tablespace in GUARANTEE mode.
If you are exporting a table, consider exporting with the
CONSISTENT=no parameter.
Do not commit inside a cursor loop
Regards
Giova

Usage of --num-mappers=10 (i.e. increased parallelism) was sufficient enough to overcome the problem in this instance without impacting the source too much.
Additionally, adding the --direct parameter will cause Sqoop to use an Oracle specific connector which will speed things up further, and will be added to my solution as soon as I convince the DBA on that database to open up the necessary privileges. Direct also supports the option -Doraoop.import.consistent.read={true|false} which seems to mirror the Oracle export utility's CONSISTENT parameter in function (note, defaults to false), in the sense that the undo tablespace would not be used to try to preserve consistency, eliminating the need to race to import before the Undo tablespace fills up altogether.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Fetch 4-5 million records from Postgres and process them JAVA - spring-jdbc

Related

Cosmos DB Emulator hangs when pumping continuation token, segmented query

WSO2 ESB 6.1.0 Batch Processing

how find out the process description from process Id on redshift?

Spring JDBC: Oracle transaction errors out after 120 seconds

Sqoop from Oracle: "Snapshot too Old"

Categories

Resources