Updating Coldfusion solr collections with a scheduled task

Updating Coldfusion solr collections with a scheduled task - collections

So I'm pretty new to using the Coldfusion solr search (just moved from a CF8 Mac OS X server to a Linux CF9 server), and I'm wondering what the best way to handle automatically updating the collections is. I know scheduled tasks are meant for this but I haven't been able to find any examples online.
I currently have a scheduled task set up to update all of the collections weekly by getting the list of collections and using the cfindex tag in a loop to run the refresh command. This is pretty processing intensive though and takes about ten minutes to update the four collections I have set up so far. This works when I run it in the browser, but I get this error "The request has exceeded the allowable time limit Tag: CFLOOP" when I run the task from scheduled task administration page.
Is there a better way to handle updating the collections? Would it be better if I made a task to update each collection individually?
Here's my update code.
<cfsetting requesttimeout="1800">
<cfcollection action="list" name="collections" engine="solr">
<cfloop query="collections">
<cfindex collection="#name#" action="refresh" extensions=".pdf, .html, .htm, .cfml, .cfm" type="path" key="/home/#name#/public_html/" recurse="yes">
</cfloop>

In earlier versions of ColdFusion there was a URL parameter that could be passed on any HTTP request to change the server's timeout for the requested page. You might have guessed from the scheduled task configuration that there's an HTTP request running your task, so it functions just like any other page. In those earlier versions you would have just added &requesttimeout=900 to the URL and that gave the server 15 minutes to process that task.
In later versions they realized that this URL parameter was a security risk but they needed a way to allow developers to declare that an individual HTTP request should still be allowed to take longer than the default page timeout set in the ColdFusion Administrator. So they moved it from the URL parameter to the <cfsetting> tag.
<cfsetting requesttimeout="900" />
You need to put the cfsetting tag at the top of the page, rather than putting it inside your loop, because it's resetting the total allowable time from the beginning of the request, not just since the last cfsetting tag. Ben Nadel wrote a blog article about that here: http://www.bennadel.com/blog/626-CFSetting-RequestTimeout-Updates-Timeouts-It-Does-Not-Set-Them.htm
I'm not sure if there's an upper limit to the request timeout. I do know that in the past when I've had a really long-running task like that, the server has gradually slowed down, in some cases until it crashed. I'm not sure if I would expect reindexing Solr collections to degrade performance so badly, I think my tasks were doing some other things that were probably hogging memory at the time. Anyway if you run into that issue, you may need to divide it up into separate tasks for each collection and just make sure there's enough time between the tasks to allow each one to complete before the next one starts.
EDIT: Oops! I don't know how I missed the cfsetting tag in the original question. D'oh! In any event, when you execute a scheduled task via the CF Administrator, it performs a cfhttp request to execute the task. This is the way scheduled tasks are normally executed, and I suspect it's so the task can execute inside your own application scope, but the effect is that there are two separate requests executing. I don't think there's a cfsetting tag in the CFIDE page, but I suspect a person could add one if they wanted to allow that page longer to wait for the task to complete.
EDIT: Okay, if you wanted to add the cfsetting in the CFIDE, you would first have to decrypt the template and then add your one line of code... which might void your warranty on the server, but is probably not dangerous. ;) For decrypting the template see: Can I get the source of a hacked Coldfusion template? - and the template to edit is /CFIDE/administrator/scheduler/scheduletasks.cfm.

Related

Designing an asynchronous task library for ASP.NET

The ASP.NET runtime is meant for short work loads that can be run in parallel. I need to be able to schedule periodic events and background tasks that may or may not run for much longer periods.
Given the above I have the following problems to deal with:
The AppDomain can shutdown due to changes (Web.config, bin, App_Code, etc.)
IIS recycles the AppPool on a regular basis (daily)
IIS itself might restart, or for that matter the server might crash
I'm not convinced that running this code inside ASP.NET is not the right thing to do, becuase it would allow for a simpler programming model. But doing so would require that an external service periodically makes requests to the app so that the application is keept running and that all background tasks are programmed with utter most care. They will have to be able to pause and resume thier work, in the event of an unexpected error.
My current line of thinking goes something like this:
If all jobs are registered in the database, it should be possible to use the database as a bookkeeping mechanism. In the case of an error, the database would contain all state necessary to resume the operation at the next opportunity given.
I'd really appriecate some feedback/advice, on this matter. I've been considering running a windows service and using some RPC solution as well, but it doesn't have the same appeal to me. And I'd instead have a lot of deployment issues and sycnhronizing tasks and code cross several applications. Due to my business needs this is less than optimial.

This is a shot in the dark since I don't know what database you use, but I'd recommend you to consider dialog timers and activation. Assuming that most of the jobs have to do some data manipulation, and is likely that all have to do only data manipulation, leveraging activation and timers give an extremely reliable job scheduling solution, entirely embedded in the database (no need for an external process/service, not dependencies outside the database bounds like msdb), and is a solution that ensures scheduled jobs can survive restarts, failover events and even disaster recovery restores. Simply put, once a job is scheduled it will run even if the database is restored one week later on a different machine.
Have a look at Asynchronous procedure execution for a related example.
And if this is too radical, at least have a look at Using Tables as Queues since storing the scheduled items in the database often falls under the 'pending queue' case.

I recommend that you have a look at Quartz.Net. It is open source and it will give you some ideas.

Using the database as a state-keeping mechanism is a completely valid idea. How complex it will be depends on how far you want to take it. In many cases you will ended up pairing your database logic with a Windows service to achieve the desired result.
FWIW, it is typically not a good practice to manually use the thread pool inside an ASP.Net application, though (contrary to what you may read) it actually works quite nicely other than the huge caveat that you can't guarantee it will work.
So if you needed a background thread that examined the state of some object every 30 seconds and you didn't care if it fired every 30 seconds or 29 seconds or 2 minutes (such as in a long app pool recycle), an ASP.Net-spawned thread is a quick and very dirty solution.
Asynchronously fired callbacks (such as on the ASP.Net Cache object) can also perform a sort of "behind the scenes" role.
I have faced similar challenges and ultimately opted for a Windows service that uses a combination of building blocks for maximum flexibility. Namely, I use:
1) WCF with implementation-specific types OR
2) Types that are meant to transport and manage objects that wrap a job OR
3) Completely generic, serializable objects contained in a custom wrapper. Since they are just a binary payload, this allows any object to be passed to the service. Once in the service, the wrapper defines what should happen to the object (e.g. invoke a method, gather a result, and optionally make that result available for return).
Ultimately, the web site is responsible for querying the service about its state. This querying can be as simple as polling or can use asynchronous callbacks with WCF (though I believe this also uses some sort of polling behind the scenes).

I tell you what I have do.
I have create a class called Atzenta that have a timer (1-2 second trigger).
I have also create a table on my temporary database that keep the jobs. The table knows the jobID, other parameters, priority, job status, messages.
I can add, or delete a job on this class. When there is no action to be done the timer is stop. When I add a job, then the timer starts again. (the timer is a thread by him self that can do parallel work). I use the System.Timers and not other timers for this.
The jobs can have different priority.
Now let say that I place a job on this table using the Atzenta class. The next time that the timer is trigger is check the query on this table and find the first available job and just run it. No other jobs run until this one is end.
Every synchronize and flags are done from the table. In the table I have flags for every job that show if its |wait to run|request to run|run|pause|finish|killed|
All jobs are all ready known functions or class (eg the creation of statistics).
For stop and start, I use the global.asax and the Application_Start, Application_End to start and pause the object that keep the tasks. For example when I do a job, and I get the Application_End ether I wait to finish and then stop the app, ether I stop the action, notify the table, and start again on application_start.
So I say, Atzenta.RunTheJob(Jobs.StatisticUpdate, ProductID); and then I add this job on table, open the timer, and then on trigger this job is run and I update the statistics for the given product id.
I use a table on a database to synchronize many pools that run the same web app and in fact its work that way. With a common table the synchronize of the jobs is easy and you avoid 2 pools to run the same job at the same time.
On my back office I have a simple table view to see the status of all jobs.

Dealing with Long running processes in ASP.NET

I would like to know the best way to deal with long running processes started on demand from an ASP.NET webpage.
The process may consist of various steps (like upload files to the server, run SSIS packages on them, execute some stored procedures etc.) and sometimes the process could take up to couple of hours to finish.
If I go for asynchronous execution using a WCF service, then what happens if the user closes the browser while the process is running, how the process success or failure result should be displayed to the user? To solve this, I choose one-way WCF service calls, but the problem with this is I need to create a process table and store the result (and error messages if it fails in any of the steps and which steps have completed successfully) in that table which is an additional overhead because there are many such processes with various steps that the user can invoke from the web page and user needs to be made aware of the progress (in simplest case, the status can be "process xyz running") and once it is done, the output needs to be displayed to the user (for example by running a stored procedure).
What is the best way to design the solution for this?

As I see it, you have three options
Have a long running page where the user waits for the response. If this is several hours, you're going to have many usability problems, so I wouldn't even consider it.
Create a process table to store the results of operations. Run service functions asynchronously and delegate logging the results to the service. There can be a page that the user refreshes which gets the latest results of this table.
If you really don't want to create a table, then store all the current process details in the users' session state, and have a current processes page as above. You have the possible issue that the session might timeout, or the web app might restart and you'll lose all this.
I can't see that number 2 is such a great hardship. You could make the table fairly generic to encompass all types of processes: process details could just be encoded as binary or xml and interpreted by the web application. You then have the most robust solution.

I cant say what the best way would be but using Windows Workflow Foundation for such long running processes is definitely one way to go about it.
You can do tracking of the process to see what stage it is at, even persist it if you have steps where it is awaiting user input etc.
WF provides a lot of features out of the box (especially if your storage medium is SQL Server) and may be a good option to consider.
http://www.codeproject.com/KB/WF/WF4Extensions.aspx might help give you some more insight into the same.

I think you are in the right track. You should run the process asynchronously, store the execution somewhere (a table), and keep status of the running process in there.
Your user should see a pending display label while the process is executing, and a finished label with the result when the process finished. If the user closed the browser, she will see the result of her running process next time she logs in.

long running http process - how to put in separate process?

I know that similar questions have been asked all over the place, but I'm having trouble finding one that relates directly to what I'm after.
I have a website where a user uploads a data file, then that file is transformed and imported into SQL. The file could be up to 50mb in size, and some times this process can take 30 minutes or sometimes even longer.
I realise I need to palm off the actual work to another process, and poll that process on the web page. I'm wondering what the best approach would be though? Being a web developer by trade, I'm finding all this new Windows Service stuff a bit confusing, and I just wanted somewhere to start.
So:
Can I do / should I being doing this with a windows service? if so, how?
Should I use WCF? If this runs under IIS, will I have problems with aspnet_wp.exe recycling and timing out my process?
clarifications
The data is imported into sql, there's no file distribution taking place.
If there is a failure, it absolutely MUST be reported to the user. The web page will poll every, lets say, 5 seconds, from the time the async task begins, to get the 'status' of the import. Once it's finished another response will tell the page to stop polling for status updates.
queries on final decision
ok, so as I thought, it seems that a windows service is the best idea. So as to HOW to get it to work, it seems the 'put the file there and wait for the service to pick it up' idea is the generally accepted way, is there a way I can start a process run by the service, without it having to constantly be checking a database table / folder? As I said earlier, I don't have any experience with Windows Services - I wondered if I put a public method in the service, can I call it somehow?

well ...
var thread = new Thread(() => {
// your action
});
thread.Start();
but you will have problems with that:
what if the import to sql fails? should there be any response to the client
if it fails, how do you ensure the file on a later request
what if the applications shuts down ... this newly created and started thread will be killed either
...
it's not always a good idea to store everything in sql (especially files...). if you want to make the file available to several servers why not distribute them via ftp ...?
i believe that your whole concept is a bit messed up (sry assuming this), and it might be helpful if you elaborate and give us more information about your intentions!
edit:
Can I do / should I being doing this
with a windows service? if so, how?
you can :) i advise you to create a simple console-program and convert this with srvany and sc. you can get a rough overview howto here (note: insert blanks after =... that's a silly pitfall)
the term should is relative, because you did not answer the most important question
what if a record is persisted to the database, telling a consumer that file test.img should be persisted, but your service hasn't captured it or did not transform it yet?
so ... next on
Should I use WCF? If this runs under IIS, will I have problems with aspnet_wp.exe recycling and timing out my process?
you probably could create a WCF-service which recieves some binary-data and then stores this to a database. this request could be async. yes. but what for?
once again:
please give us more insight to your workflow: what are you exactly trying to achieve? which "environmental-conditions" to you have (eg. app A polls db and expects file-records which are referenced in table x to be persisted) ...
edit:
so you want to import a .csv-file. well that changes everything :)
but i won't advise you to use a wcf-service (there could be a usage: eg. a wcf-service which has a method to insert a single row, then your iteration through the file would be implemented in another app... not that good, though).
i would suggest following:
at first do everything in your webapp (as you've already done), but rather use some sort of bulk-insert and do your transformation/logic on the database.
if you have some sort of bottle-neck then, i would suggest you something like a minor job-service, eg:
webapp will upload the file and insert a row to a job-table. the job-service is continiously polling the table/or gets informed via wcf by the webapp (hey, hey, finally some sort of usage for WCF in your scenario... :) ) and then does the import-job, writing a finish-note to a table/or set the state of the job to finished ...
but this is a bit overkill :)

Please see if my below comments helps you to resolve your issue:
•Can I do / should I being doing this with a windows service? if so, how?
Yes you can do this with a windows service. And I think that is the way you should be doing it. You can implement your own service to process your request or you can use the open source code Job Proccessor
Basically the idea is..
You submit a request for processing
the csv file in database table with
some status as not started.
Then your windows service picks up
the request from database table which
are not started and update them as in
progress status.
Once the processing is complete
succesfully /unsuccesfuly your
service updated the database table
with status as Completed / Failed.
And your asp.net page can poll to
database table for the current status
every 5 sec or so.
•Should I use WCF? If this runs under IIS, will I have problems with aspnet_wp.exe recycling and timing out my process?
you should not be using WCF for this purpose.

ASP.NET Lifecycle and long process

I know we need a better solution but we need to get this done this way for right now. We have a long import process that's fired when you click start import button on a aspx web page. It takes a long time..sometimes several hours. I changed the timeout and that's fine but I keep getting a connection server reset error after about an hour. I'm thinking it's the asp.net lifecycle and I'd like to know if there are settings in IIS I can change to make this lifecycle last longer.

You should almost certainly do the long-running work in a separate process (not just a separate thread).
Write a standalone program to do the import. Have it set a flag somewhere (a column in a database, for example) when it's done, and put lines into a logfile or database table to show progress.
That way your page just gets the job started. Afterwards, it can self-refresh once every few minutes, until the 'completed' flag is set. You can display the log table if you want to be sure it's still running and hasn't died.
This is pretty straightforward stuff, but if you need code examples they can be supplied.

One other point to consider which might explain the behaviour is that the aspnet_wp.exe recycles if too much memory is being consumed (do not confuse this with the page life cycle)
If your long process is taking up too much memory ASP.NET will launch a new process and reassign all existing request. I would suggest checking for this. You can do this by looking in task manager at the aspnet_wp and checking the memory size being used - if the size suddnely goes back down it has recycled.
You can change the memory limit in machine.config:
<system.web>
<processModel autoConfig="true"/>
Use memoryLimit to specify the maximum allowed memory size, as a percentage of total system memory that the worker process can consume before ASP.NET launches a new process and reassigns existing requests. (The default is 60)
<system.web>
<processModel autoConfig="true" memoryLimit="10"/>
If this is what is causing a problem for you, the only solution might be to have a separate process for your long operation. You will need to setup IIS accordingly to allow your other EXE the relevant permissions.

You can try running the process in a new thread. This means that the page will start the task and then finish the page's processing but the separate thread will still run in the background. You wont be able to have any visual feedback though so you may want to log progress to a database and display that in a separate page instead.
You can also try running this as an ajax call instead of a postback which has different limitations...
Since you recognize this is not the way to do this I wont list alternatives. Im sure you know what they are :)

Extending the timeout is definitely not the way to do it. Response times should be kept to an absolute minimum. If at all possible, I would try to shift this long-running task out of the ASP.NET application entirely and have it run as a separate process.
After that it's up to you how you want to proceed. You might want the process to dump its results into a file that the ASP application can poll for, either via AJAX or having the user hit F5.

If it's taking hours I would suggest a separate thread for this and perhaps email a notification when it is ready to download the result from the server (i.e. send a link to the finished result)
Or if it is important to have a UI in the client's browser (if they are going to be hanging around for n hours) then you could have a WebMethod which is called from the client (JavaScript) using SetInterval to periodically check if its done.

Best way to run a background task in ASP.Net web app and also get feedback?

I am thinking on the following approach but not sure if its the best way out:
step1 (server side): A TaskMangaer class creates a new thread and start a task.
step2 (server side): Store taskManager object reference into the cache for future reference.
step3 (client side): Use periodic Ajax call to check the status of the task.
Basically the intention is to have a framework to run a background task (5mins approx) and provide regular feedback on the web UI for the percentage of task completed.
Is there a neat way around this or any existing asp.net API that will be helpful ?
Edit 1#: I want to run the task in-proc with the app.
Edit 2#: Looks like badge implementation on stack overflow is also using the cache to track background task. https://blog.stackoverflow.com/2008/07/easy-background-tasks-in-aspnet/

I think the problem with storing the result in the cache is that ASP.NET might scavenge that cache entry for other purposes (ie if its short on memory, if its grumpy, etc). Something that is served from the cache should be something you can recreate on demand if its not found in the cache, the ASP.NET runtime is free to dump cache entries whenever it feels like it.
The usage of the cache in the badge discussion seems fundamentally different, in that case the task was shortlived. The cache was just being used as a hacky timer to fire off the task periodically.
Can you confirm this is a task that is going to take 5 minutes, and require its own thread that whole time? This is a performance concern in itself, you will only be able to support a limited number of such requests if each requires its own thread for so long. Only if thats acceptable would I let the task camp a thread for so long.
If its ok for these tasks to camp a thread, then I'd just go ahead and store the result in a dictionary global to the process. The key of the dictionary would correlate to the client request / AJAX callback series. The key should incorporate the user ID as well if security is at all important.
If you need to scale up to many users, then I think you need to break the task down into asynchronous steps, and in that case I'd probably use a DB table to store the results (again keyed per request / user).

Microsoft Message Queuing was built for scenarios like the one you try to solve:
http://www.microsoft.com/windowsserver2003/technologies/msmq/default.mspx
Windows Communicatio Foundation also has message queuing support.
Hope this helps.
Thomas

One approach for doing this is to use application state. When you spawn a worker thread, pass it a request ID that you generate, and return this to the client. The client will then pass that request ID back to the server in its AJAX calls. The server will then fetch the status using the request ID from application state. (The worker thread would be updating the application state based on its status).

I saw an approach to a similar problem somewhere. The solution was something like:
Start the background task on server.Return immediately with a url to the result.
Until the result is posted, this url will return 404.
The client checks periodically for this url.
The client reads the results when
they are finally posted.
The url will be something like http://mysite/myresults/cffc6c30-d1c2-11dd-ad8b-0800200c9a66.
The best document format is probably JSON.
If feedback on progress is important, modify the document to also contain status (inprogress/finish) and progress (42 %).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex