SOLR Delta import takes longer than next scheduled delta import cron job - solr5

We are using Solr 5.0.0. Delta import configuration is very simple, just like the apache-wiki
We have setup cron job to do delta-imports every 30 mins, simple setup as well:
0,30 * * * * /usr/bin/wget http://<solr_host>:8983/solr/<core_name>/dataimport?command=delta-import
Now, what happens if sometimes currently running delta-import takes longer than the next scheduled chron job?
Does SOLR Launches next delta-import in a parallel thread? Or ignores job until previous one is done?
Extending time in cron scheduler isn't an option as similar problem could happen as user and document number increases over the time...

I had the similar problem at my end.
Here is how I had a work around for it.
Note : I have implemented solr with core.
I have one table where in I have kept the info about solr like core name, last re-index date and re-indexing-required, current_status.
I have written a scheduler where it check which all cores needs re-indexing(delta-import) from the above table and starts the re-index.
Re-indexing request are sent/invoked after every 20 minutes(In your its 30 min).
When I start the re-indexing also update table and mark the status for the specific core as "inprogress".
After ten minutes I fire a request checking if the re-indexing is completed.
For checking the re-indexing I have used the request as :
final URL url = new URL(SOLR_INDEX_SERVER_PROTOCOL, SOLR_INDEX_SERVER_IP, Integer.valueOf(SOLR_INDEX_SERVER_PORT),
"/solr/"+ core_name +"/select?qt=/dataimport&command=status");
check the status for Committed or idle and the consider it as re-indexing is completed and mark the status of it as Idle in the table.
So re-indexing scheduler wont pick core which are in inprogress status.
Also it considers only those cores for re-indexing where in there some updates (which can be identified by flag "re-indexing-required").
Re-indexing is invoked only if re-indexing-required is true and current status is idle.
If there are some updates(identified by "re-indexing-required") but the current_status is inprogress the scheduler wont pick it for re-indexing.
I hope this may help you.
Note : I have used DIH for indexing and re-indexing.

Solr will simple ignore next import request until the end of the first one and it will not cache the second request. I can observe the behaviour and I've been read it somewhere but couldn't find it now.
Infact I'm dealing with same problem. I try to optimize the queries:
deltaImportQuery="select * from Assests where ID='${dih.delta.ID}'"
deltaQuery="select [ID] from Assests where date_created > '${dih.last_index_time}' "
I only retrieved ID field in first hand and than try to retrive the intended doc.
You may also specify your fields instead of '*' sign. since I use view it doesn't apply in my case
I will update if I had another solution.
Edit After Solution
Beyond the suggested request above I change one more think that speed up my indexing process 10 times. I had two big Entities nested. I used Entity inside another one like
<entity name="TableA" query="select * from TableA">
<entity name="TableB" query="select * from TableB where TableB.TableA_ID='${TableA.ID}'" >
Which yields to multi valued tableB fields. But For every row one request maded to db for TableB.
I changed my view using a with clause combined with a comma separeted field value. And parse the value from solr field mapping. and indexed it in to multivalued field.
My whole indexing process speed up from hours to minutes. Below is my view and solr mappping config.
WITH tableb_with as (SELECT * from TableB)
SELECT *,STUFF( (SELECT ',' + REPLACE( fieldb1, ',', ';') from tableb_with where tableb_with.tableA.ID = tableA.ID
for xml path(''), type).value('.', 'varchar(max)') , 1, 1, '') AS field2WithComma,
STUFF( (SELECT ',' + REPLACE( fieldb1, ',', ';') from tableb_with where tableb_with.tableA.ID = tableA.ID
for xml path(''), type).value('.', 'varchar(max)') , 1, 1, '') AS field2WithComma,
Al fancy Joins and unions goes into with clouse in tableB and also alot of joins in tableA. Actually this view held 200 hundred field in total.
solr mappping is goes like this :
<field column="field1WithComma" name="field1" splitBy=","/>
Hope It may help someone.

Related

Mariadb SELECT not failing on lock

I’m trying to cause a ‘SELECT’ query to fail if the record it is trying to read is locked.
To simulate it I have added a trigger on UPDATE that sleeps for 20 seconds and then in one thread (Java application) I’m updating a record (oid=53) and in another thread I’m performing the following query:
“SET STATEMENT max_statement_time=1 FOR SELECT * FROM Jobs j WHERE j.oid =53”.
(Note: Since my mariadb server version is 10.2 I cannot use the “SELECT … NOWAIT” option and must use “SET STATEMENT max_statement_time=1 FOR ….” instead).
I would expect that the SELECT will fail since the record is in a middle of UPDATE and should be read/write locked, but the SELECT succeeds.
Only if I add ‘for update’ to the SELECT query the query fails. (But this is not a good option for me).
I checked the INNODB_LOCKS table during the this time and it was empty.
In the INNODB_TRX table I saw the transaction with isolation level – REPEATABLE READ, but I don’t know if it is relevant here.
Any thoughts, how can I make the SELECT fail without making it 'for update'?
Normally consistent (and dirty) reads are non-locking, they just read some sort of snapshot, depending on what your transaction isolation level is. If you want to make the read wait for concurrent transaction to finish, you need to set isolation level to SERIALIZABLE and turn off autocommit in the connection that performs the read. Something like
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SET autocommit = 0;
SET STATEMENT max_statement_time=1 FOR ...
should do it.
Relevant page in MariaDB KB
Side note: my personal preference would be to use innodb_lock_wait_timeout=1 instead of max_statement_time=1. Both will make the statement fail, but innodb_lock_wait_timeout will cause an error code more suitable for the situation.

cache intersystems command to get the last updated timestamp of a table

I want to know the last update time of a Cache Intersystems DB table. Please let me know the relevant command. I ran through their command documentation:
http://docs.intersystems.com/latest/csp/docboo/DocBook.UI.Page.cls?KEY=GTSQ_commands
But I don't see any such command there. I also tried searching through this :
http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=RSQL_currenttimestamp
Is this not the complete documentation of commands ?
Cache' does not maintain "last updated" information by default as it might introduce unnecessary performance penalty on DML operations.
You can add this field manually to every table of interest:
Property LastUpdated As %TimeStamp [ SqlComputeCode = { Set {LastUpdated}= $ZDT($H, 3) }, SqlComputed, SqlComputeOnChange = (%%INSERT, %%UPDATE) ];
This way it would keep the time of last Update/Insert for every row, but still it would not help you with Delete.
Alternatively - you can setup triggers for every DML operation that would maintain timestamp in a separate table.
Without additional coding the only way to gather this information is to scan Journal files, which is not really intended use for these and would be slow at best.

BizTalk 2013 file receive location trigger on non-event

I have a file receive location which is schedule to run at specific time of day. I need to trigger a alert or mail if receive location is unable to find any file at that location.
I know I can create custom components or I can use BizTalk 360 to do so. But I am looking for some out of box BizTalk feature.
BizTalk is not very good at triggering on non-events. Non-events are things that did not happen, but still represent a certain scenario.
What you could do is:
Insert the filename of any file triggering the receive location in a custom SQL table.
Once per day (scheduled task adapter or polling via stored procedure) you would trigger a query on the SQL table, which would only create a message in case no records were made that day.
Also think about cleanup: that approach will require you to delete any existing records.
Another options could be a scheduled task with a custom c# program which would create a file only if there were no input files, etc...
The sequential convoy solution should work, but I'd be concerned about a few things:
It might consume the good message when the other subscriber is down, which might cause you to miss what you'd normally consider a subscription failure
Long running orchestrations can be difficult to manage and maintain. It sounds like this one would be running all day/night.
I like Pieter's suggestion, but I'd expand on it a bit:
Create a table, something like this:
CREATE TABLE tFileEventNotify
(
ReceiveLocationName VARCHAR(255) NOT NULL primary key,
LastPickupDate DATETIME NOT NULL,
NextExpectedDate DATETIME NOT NULL,
NotificationSent bit null,
CONSTRAINT CK_FileEventNotify_Dates CHECK(NextExpectedDate > LastPickupDate)
);
You could also create a procedure for this, which should be called every time you receive a file on that location (from a custom pipeline or an orchestration), something like
CREATE PROCEDURE usp_Mrg_FileEventNotify
(
#rlocName varchar(255),
#LastPickupDate DATETIME,
#NextPickupDate DATETIME
)
AS
BEGIN
IF EXISTS(SELECT 1 FROM tFileEventNotify WHERE ReceiveLocationName = #rlocName)
BEGIN
UPDATE tFileEventNotify SET LastPickupDate = #LastPickupDate, NextPickupDate = #NextPickupDate WHERE ReceiveLocationName = #rlocName;
END
ELSE
BEGIN
INSERT tFileEventNotify (ReceiveLocationName, LastPickupDate, NextPickupDate) VALUES (#rlocName, #LastPickupDate, #NextPickupDate);
END
END
And then you could create a polling port that had the following Polling Data Available statement:
SELECT 1 FROM tFileEventNotify WHERE NextPickupDate < GETDATE() AND NotificationSent <> 1
And write up a procedure to produce a message from that table that you could then map to an email sent via SMTP port (or whatever other notification mechanism you want to use). You could even add columns to tFileEventNotify like EmailAddress or SubjectLine etc. You may want to add a field to the table to indicate whether a notification has already been sent or not, depending on how large you make the polling interval. If you want it sent every time you can ignore that part.
One option is to set up a BAM Alert to trigger if no file is received during the day.
Here's one mostly out of the box solution:
BizTalk Server: Detecting a Missing Message
Basically, it's an Orchestration that listens for any message from that Receive Port and resets a timer. If the timer expires, it can do something.

Column 'AuctionStatus' cannot be used in an IF UPDATE clause because it is a computed column

I am developing an Auction site in asp.net3.5 and sql server 2008R2, My Database has an Auction Table that has a calculated column "AuctionStatus" -
(case when [EndDateTime] < getdate() then '0' else '1' end)
that gives auction status Active or inactive based on End Date.
Now I want to call a stored procedure that sends email notifications to buyers and sellers as soon as AuctionStatus becomes '0'. For that i tried to create a after update trigger that could call the email notification sp, but i am not able to do so.
I am getting the following error message :-
Msg 2114, Level 16, State 1, Procedure trgAuctionEmailNotification,
Line 6 Column 'AuctionStatus' cannot be used in an IF UPDATE clause
because it is a computed column.
The trigger is:
CREATE TRIGGER trgAuctionEmailNotification ON SE_Auctions
AFTER UPDATE
AS
BEGIN
IF (UPDATE (AuctionStatus))
BEGIN
IF EXISTS (SELECT * FROM inserted WHERE currentbidderid > 0
AND AuctionStatus='0' )
BEGIN
DECLARE #ID int
SELECT #ID = AuctionID from inserted
EXEC spSelectSE_AuctionsByAuctionID #ID
END
END
END
You could just replace AuctionStatus with the corresponding expression :
IF EXISTS (SELECT * FROM inserted WHERE currentbidderid > 0 AND [EndDateTime] < getdate() )
But, the point is I don't see how your trigger will be "triggered" as [AuctionStatus] is never "updated". Its Value is just calculated whenever you need it.
You could go for a sql job that runs every x minutes and send a notification for each auction which ended during the last x minutes.
You need to add a real column containing a flag to indicate whether the notifications have been sent, and then implement a polling technique to scan the table for rows where the status is inactive and notifications haven't been sent.
The computed column doesn't really transition from one state to another, so it's not like an UPDATE has occurred. Even if SQL Server did implement this, it would be hideously expensive, since it would have to query the entire table for transitioning rows every 3ms. (or even more frequently if you're using datetime2 with a higher precision)
Whereas you can pick a suitable polling interval yourself. This could be an SQL agent job, or in some service code somewhere, whatever best fits the rest of your architecture.

How to find out which package/procedure is updating a table?

I would like to find out if it is possible to find out which package or procedure in a package is updating a table?
Due to a certain project being handed over (the person who handed over the project has since left) without proper documentation, data that we know we have updated always go back to some strange source point.
We are guessing that this could be a database job or scheduler that is running the update command without our knowledge. I am hoping that there is a way to find out where the source code is calling from that is updating the table and inserting the source as a trigger on that table that we are monitoring.
Any ideas?
Thanks.
UPDATE: I poked around and found out
how to trace a statement back to its
owning PL/SQL object.
In combination with what Tony mentioned, you can create a logging table and a trigger that looks like this:
CREATE TABLE statement_tracker
( SID NUMBER
, serial# NUMBER
, date_run DATE
, program VARCHAR2(48) null
, module VARCHAR2(48) null
, machine VARCHAR2(64) null
, osuser VARCHAR2(30) null
, sql_text CLOB null
, program_id number
);
CREATE OR REPLACE TRIGGER smb_t_t
AFTER UPDATE
ON smb_test
BEGIN
INSERT
INTO statement_tracker
SELECT ss.SID
, ss.serial#
, sysdate
, ss.program
, ss.module
, ss.machine
, ss.osuser
, sq.sql_fulltext
, sq.program_id
FROM v$session ss
, v$sql sq
WHERE ss.sql_address = sq.address
AND ss.SID = USERENV('sid');
END;
/
In order for the trigger above to compile, you'll need to grant the owner of the trigger these permissions, when logged in as the SYS user:
grant select on V_$SESSION to <user>;
grant select on V_$SQL to <user>;
You will likely want to protect the insert statement in the trigger with some condition that only makes it log when the the change you're interested in is occurring - on my test server this statement runs rather slowly (1 second), so I wouldn't want to be logging all these updates. Of course, in that case, you'd need to change the trigger to be a row-level one so that you could inspect the :new or :old values. If you are really concerned about the overhead of the select, you can change it to not join against v$sql, and instead just save the SQL_ADDRESS column, then schedule a job with DBMS_JOB to go off and update the sql_text column with a second update statement, thereby offloading the update into another session and not blocking your original update.
Unfortunately, this will only tell you half the story. The statement you're going to see logged is going to be the most proximal statement - in this case, an update - even if the original statement executed by the process that initiated it is a stored procedure. This is where the program_id column comes in. If the update statement is part of a procedure or trigger, program_id will point to the object_id of the code in question - you can resolve it thusly:
SELECT * FROM all_objects where object_id = <program_id>;
In the case when the update statement was executed directly from the client, I don't know what program_id represents, but you wouldn't need it - you'd have the name of the executable in the "program" column of statement_tracker. If the update was executed from an anonymous PL/SQL block, I'm not how to track it back - you'll need to experiment further.
It may be, though, that the osuser/machine/program/module information may be enough to get you pointed in the right direction.
If it is a scheduled database job then you can find out what scheduled database jobs exist and look into what they do. Other things you can do are:
look at the dependencies views e.g. ALL_DEPENDENCIES to see what packages/triggers etc. use that table. Depending on the size of your system that may return a lot of objects to trawl through.
Search all the database source code for references to the table like this:
select distinct type, name
from all_source
where lower(text) like lower('%mytable%');
Again that may return a lot of objects, and of course there will be some "false positives" where the search string appears but isn't actually a reference to that table. You could even try something more specific like:
select distinct type, name
from all_source
where lower(text) like lower('%insert into mytable%');
but of course that would miss cases where the command was formatted differently.
Additionally, could there be SQL scripts being run through "cron" jobs on the server?
Just write an "after update" trigger and, in this trigger, log the results of "DBMS_UTILITY.FORMAT_CALL_STACK" in a dedicated table.
The purpose of this function is exactly to give you the complete call stack of al the stored procedures and triggers that have been fired to reach your code.
I am writing from the mobile app, so i can't give you more detailed examples, but if you google for it you'll find many of them.
A quick and dirty option if you're working locally, and are only interested in the first thing that's altering the data, is to throw an error in the trigger instead of logging. That way, you get the usual stack trace and it's a lot less typing and you don't need to create a new table:
AFTER UPDATE ON table_of_interest
BEGIN
RAISE_APPLICATION_ERROR(-20001, 'something changed it');
END;
/

Resources