Programmatic Remote Access to the Datastore - google-cloud-datastore

I have a requirement to implement a batch processing system that will run outside of Google App Engine (GAE) to batch process data from an RDBMS and insert it into GAE.
The appcfg.py does this from various input files but I would like to do it "by hand" using some API so I can fully control the lifecycle of the process. Is there a public API that is used internally by appcfg.py?
I would write a daemon in Python that runs on my internal server and monitors certain MySQL tables. Under the correct conditions, it would grab data from MySQL, process it, and post it using the GAE RemoteAPI to the GAE application.

sounds like you already know what to do. in your own words: "grab data from MySQL, process it, and post it using the GAE RemoteAPI." the remote api docs even have examples that write to the datastore.

What you could probably do (If I understand right what your problem is) is using the Task Queue. With that you could define a Task that does what you expect it to do;
Lets say you want to insert something into GAE-datastore. prepare the insert file on some server. Than go to your application and prepare an "Start Insert Task". By clicking on that a background task will start, read that file and insert it into the datastore.
Furthermore, if that task is daily performed you could invoke the task creation with a cron job.
However, if you could say more about the work you have to perform it would be easier :-P

Related

Hosting shinyApp on EC2 with background running capability

I want to host a shiny app on amazon EC2 which takes a excelsheet using fileinput(). Then I need to make some API calls for each row in the excelsheet which is expected to take 1-2 hours on average for my purposes. So I figured out that this is what I should do:
Host a shiny app where one can upload an excelsheet.
On receiving an excelsheet from a user, store it on the amazon servers, notify the user that an email will be sent once the processing is complete, and trigger run another R script (I'm not sure how to do that) which will keep running in background even if the user closes the browser window and collect all the information by making the slow API calls.
Once I have all the data, store it in another excelsheet and email back to the user.
If it is possible and reasonable to do it this way or you have some other ideas to do my task, please help me with how to do it.
Edit: I've found this is what I can do otherwise:
Get the excelsheet data and store it in a file.
Call a bash script from the R shiny like this: ./<my-script> &; disown
The bash script will call a python file which makes all API calls, decodes the relevant data from JSON output and stores it in another file on the server.
It finally sends an email to the user with he processed data attached.
I wanted to know if this is an appropriate way to do the job. Thanks a lot.
Try implementing simple web framework like Django since you are using python. Flask may come in handy for creating simple routes. Please comment if you find any issues.

Asynchronous logging of server-side activities

I have a web server that on every request produces a log that should be persistently saved in a DB.
Since the request rate is too high, I can't make DB operations on every request.
I thought to do the following thing:
Any request to a web server produces a log.
This log is placed somewhere in a place where it can be stored in a fast way (redis?)
Another service (a cron job?) periodically flushes the data from that place, removes duplicates (yes, there can be duplicates that don't need to be stored in the DB) and makes a single MySQL query to save the data permanently.
What would be the most efficient way to achieve this thing?
Normally you would use a common logging library (e.g. log4j) which will manage safely writing your log statements to the file. However you should note that log verbosity can still impact application performance.
After the file is on disk, you can do whatever you like with it - it would be completely normal to ingest that file into Splunk for further processing and ad-hoc searches/alerting.
If you want to have better performance on this operation, you should send you'r logs to some kind of queue and with a service reading that queue, send all the logs to the database.
I found some information about queue's in mysql.
MySQL Insert Statement Queue
regards.

How to implement synchronized Memcached with database

AFAIK, Memcached does not support synchronization with database (at least SQL Server and Oracle). We are planning to use Memcached (it is free) with our OLTP database.
In some business processes we do some heavy validations which requires lot of data from database, we can not keep static copy of these data as we don't know whether the data has been modified so we fetch the data every time which slows the process down.
One possible solution could be
Write triggers on database to create/update prefixed-postfixed (table-PK1-PK2-PK3-column) files on change of records
Monitor this change of file using FileSystemWatcher and expire the key (table-PK1-PK2-PK3-column) to get updated data
Problem: There would be around 100,000 users using any combination of data for 10 hours. So we will end up having a lot of files e.g. categ1-subcateg5-subcateg-78-data100, categ1-subcateg5-subcateg-78-data250, categ2-subcateg5-subcateg-78-data100, categ1-subcateg5-subcateg-33-data100, etc.
I am expecting 5 million files at least. Now it looks a pathetic solution :(
Other possibilities are
call a web service asynchronously from the trigger passing the key
to be expired
call an exe from trigger without waiting it to finish and then this
exe would expire the key. (I have got some success with this approach on SQL Server using xp_cmdsell to call an exe, calling an exe from oracle's trigger looks a bit difficult)
Still sounds pathetic, isn't it?
Any intelligent suggestions please
It's not clear (to me) if the use of Memcached is mandatory or not. I would personally avoid it and use instead SqlDependency and OracleDependency. The two both allow to pass a db command and get notified when the data that the command would return changes.
If Memcached is mandatory you can still use this two classes to trigger the invalidation.
MS SQL Server has "Change Tracking" features that maybe be of use to you. You enable the database for change tracking and configure which tables you wish to track. SQL Server then creates change records on every update, insert, delete on a table and then lets you query for changes to records that have been made since the last time you checked. This is very useful for syncing changes and is more efficient than using triggers. It's also easier to manage than making your own tracking tables. This has been a feature since SQL Server 2005.
How to: Use SQL Server Change Tracking
Change tracking only captures the primary keys of the tables and let's you query which fields might have been modified. Then you can query the tables join on those keys to get the current data. If you want it to capture the data also you can use Change Capture, but it requires more overhead and at least SQL Server 2008 enterprise edition.
Change Data Capture
I have no experience with Oracle, but i believe it may also have some tracking functionality as well. This article might get you started:
20 Using Oracle Streams to Record Table Changes

How can i execute a function at an exact hour every day, in ASP.net?

I have a function that needs to be executed every day at an exact hour. How can i do that in asp.net? Do i need to use a webService or do i need to install something on the server or something else? How can this be made?
You can use a Scheduled Task to execute a C# console app.
http://support.microsoft.com/kb/308569
There are a variety of options, you could use:
Windows Workflow Foundation that executes at a certain time (probably overkill for you)
SQL Agent (probably best suited for SQL esc jobs)
Combine Web and Windows Services to Run Your ASP.NET Code at Scheduled Intervals
Scheduled Tasks in ASP.NET Web Applications using Timers
ASPNET is a Request-driven model. It gets a request, for a document, a resource, a page... and then it runs some logic to generate that document, logic or page and transmit it to the requester. It's not set up to "run" by itself.
So you have a couple options:
The way to get something to run at a particular time on Windows is via Task Scheduler. It's available on any recent Windows. You supply the EXE. In your case it might make sense to write a console EXE.
Use schtasks.exe (command line tool) or the control-panel applet to to set up the task, the login, the time and repeats.
The EXE could be an ASPNET client that makes a request to an ASPNET-managed resource. Or it could just directly do the thing you want - maybe it's reading a database and creating a report.

Listing Currently Running Workflows in .Net 4.0

I've got a .Net 4.0 Workflow application hosted in WCF that takes a request to process some information. This information is passed to a secondary system via a web service and returns a bool indicating that its going to process that information.
My workflow then loops, sleeping for 5 minutes and then querying the secondary system to see if processing of the information is complete.
When its complete the workflow finishes.
I have this persisting in SQL, and it works perfectly.
My question is how do I retrieve a list of the persisted workflows in such a way that I can tie them back to the original request? I'd like my UI to be able to list the running workflows in a grid with the elapsed time that they've been run.
I've thought about storing the workflow GUID in my primary DB and generating the list that way, but what I'd really like is to be able to reconcile what I think is running, and what the persistant store thinks is running.
I'd also like to be able to select a running workflow and kill it off or completely restart it if the user determines that its gone screwy.
You can promote data from the workflow using the SqlWorkflowInstanceStore. The result is they are stored alongside the workflow data in the InstancesTable using the InstancePromotedPropertiesTable. Using the InstancePromotedProperties view is the easiest way of querying you data.
This blog post will show you the code you need.
Another option, use the WorkflowRuntime GetAllServices().
Then you can loop through each one to pull out the data you need. I would cache the results, given this may be an expensive operation. If you have only 100 or less running workflows, and only a few users on your page, don't bother caching.
This way you don't have to create a DAL or Repo layer. Especially if you are using sql for persistence.
http://msdn.microsoft.com/en-us/library/ms594874(v=vs.100).aspx

Resources