Running a R script to Update a Database - r

I have written a script in R that calculates a specific value for each of the S&P500 stocks. I would like to run this script every five minutes during trading hours and have the script upload the values to an online database.
I don't know anything about IT. I was thinking to run the script on AWS and have the script upload a SQL database or an AWS version of a SQL server every five minutes.
Do you guys have any ideas about how I should approach this problem? or any other methodologies I could use.
Thank you.

If you want to go the AWS route with a database then there are any number of ways to achieve this, but here is an outline of a fairly straightforward approach.
Launch a database. See e.g. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Welcome.html.
Launch an EC2 instance. See e.g. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html.
Set up a cron job to launch your R script every 5 minutes during business hours from the EC2 instance. See e.g. https://www.cyberciti.biz/faq/how-do-i-add-jobs-to-cron-under-linux-or-unix-oses/.
Use an R package such as dplyr/dbplyr/DBI/odbc as suggested by #rpolicastro to connect to and write data to the database.
I'm glossing over a lot of complication in setting up the system in AWS, but hopefully this can get you started. Also, if you really care about ensuring that you never miss any data timepoints then you may either need to set up some kind of redundancy systems, or code in the ability to look backwards in time and fill in the missing timepoints.

Related

How can i run an R markdown script automatically at set intervals on the cloud?

I have an R script, that calls and runs several R markdown scripts which need to be run every day, preferably every hour. The belonging data files are continuously retrieved by MS Power Automate. Everything left for it to be fully automated is if the script could run by itself every day, or better, every hour. Preferably this would not require a physical pc to be on either. The RMD- and belonging data files are currently in a Sharepoint folder. Preferably the solution is cheap or free as well. Any suggestions?
Thanks.

Amazon Web Services - how to run a script daily

I have an R script that I run every day that scrapes data from a couple of different websites, and then writes the data scraped to a couple of different CSV files. Each day, at a specific time (that changes daily) I open RStudio, open the file, and run the script. I check that it runs correctly each time, and then I save the output to a CSV file. It is often a pain to have to do this everyday (takes ~10-15 minutes a day). I would love it if someway I could have this script run automatically at a pre-defined specific time, and a buddy of mine said AWS is capable of doing this?
Is this true? If so, what is the specific feature / aspect of AWS that is able to do this, this way I can look more into it?
Thanks!
Two options come to mind thinking about this:
Host a EC2 Instance with R on it and configure a CRON-Job to execute your R-Script regularly.
One easy way to get started: Use this AMI.
To execute the script R offers a CLI rscript. See e.g. here on how to set this up
Go Serverless: AWS Lambda is a hosted microservice. Currently R is not natively supported but on the official AWS Blog here they offer a step by step guid on how to run R. Basically you execute R from Python using the rpy2-Package.
Once you have this setup schedule the function via CloudWatch Events (~hosted cron-job). Here you can find a step by step guide on how to do that.
One more thing: You say that your function outputs CSV files: To save them properly you will need to put them to a file-storage like AWS-S3. You can do this i R via the aws.s3-package. Another option would be to use the AWS SDK for python which is preinstalled in the lambda-function. You could e.g. write a csv file to the /tmp/-dir and after the R script is done move the file to S3 via boto3's S3 upload_file function.
IMHO the first option is easier to setup but the second-one is more robust.
It's a bit counterintuitive but you'd use Cloudwatch with an event rule to run periodically. It can run a Lambda or send a message to an SNS topic or SQS queue. The challenge you'll have is that a Lambda doesn't support R so you'd either have to have a Lambda kick off something else or have something waiting on the SNS topic or SQS queue to run the script for you. It isn't a perfect solution as there are, potentially, quite a few moving parts.
#stdunbar is right about using CloudWatch Events to trigger a lambda function. You can set a frequency of the trigger or use a Cron. But as he mentioned, Lambda does not natively support R.
This may help you to use R with Lambda: R Statistics ready to run in AWS Lambda and x86_64 Linux VMs
If you are running windows, one of the easier solution is to write a .BAT script to run your R-script and then use Window's task scheduler to run as desired.
To call your R-script from your batch file use the following syntax:
C:\Program Files\R\R-3.2.4\bin\Rscript.exe" C:\rscripts\hello.R
Just verify the path to the "RScript" application and your R code is correct.
Dockerize your script (write a Dockerfile, build an image)
Push the image to AWS ECR
Create an AWS ECS cluster and AWS ECS task definition within the cluster that will run the image from AWS ECR every time it's spun-up
Use EventBridge to create a time-based trigger that will run the AWS ECS task definition
I recently gave a seminar walking through this at the Why R? 2022 conference.
You can check out the video here: https://www.youtube.com/watch?v=dgkm0QkWXag
And the GitHub repo here: https://github.com/mrismailt/why-r-2022-serverless-r-in-the-cloud

Using MySQL as cache for quantmod getSymbols

I've been using quantmods getSymbols function a lot and would like to reduce the load on the external data providers and reduce the time it takes to execute longer code loops because of network latency.
Ideal would be a function that takes a list of symbols (like getSymbols), downloads them from the provider configured in 'setSymbolLookup' and saves them in a MySQL database for easy retrieval later using getSymbols.MySQL.
A major bonus would be if another function (or the same one) allowed only downloading the difference since the last update.
Alternatively a type of proxy where the symbol is downloaded if it doesn't already exist in a local MySQL database/cache would also be good.
Has anyone developed something like this, or come across any documentation on how to do it? I've searched around but the closest I can get are some questions about how to use MySQL as an input source.
Thanks in advance!

Integration Testing best practices

Our team has hundreds of integration tests that hit a database and verify results. I've got two base classes for all the integration tests, one for retrieve-only tests and one for create/update/delete tests. The retrieve-only base class regenerates the database during the TestFixtureSetup so it only executes once per test class. The CUD base class regenerates the database before each test. Each repository class has its own corresponding test class.
As you can imagine, this whole thing takes quite some time (approaching 7-8 minutes to run and growing quickly). Having this run as part of our CI (CruiseControl.Net) is not a problem, but running locally takes a long time and really prohibits running them before committing code.
My question is are there any best practices to help speed up the execution of these types of integration tests?
I'm unable to execute them in-memory (a la sqlite) because we use some database specific functionality (computed columns, etc.) that aren't supported in sqlite.
Also, the whole team has to be able to execute them, so running them on a local instance of SQL Server Express or something could be error prone unless the connection strings are all the same for those instances.
How are you accomplishing this in your shop and what works well?
Thanks!
Keep your fast (unit) and slow (integration) tests separate, so that you can run them separately. Use whatever method for grouping/categorizing the tests is provided by your testing framework. If the testing framework does not support grouping the tests, move the integration tests into a separate module that has only integration tests.
The fast tests should take only some seconds to run all of them and should have high code coverage. These kind of tests allow the developers to refactor ruthlessly, because they can do a small change and run all the tests and be very confident that the change did not break anything.
The slow tests can take many minutes to run and they will make sure that the individual components work together right. When the developers do changes that might possibly break something which is tested by the integration tests but not the unit tests, they should run those integration tests before committing. Otherwise, the slow tests are run by the CI server.
in NUnit you can decorate your test classes (or methods) with an attribute eg:
[Category("Integration")]
public class SomeTestFixture{
...
}
[Category("Unit")]
public class SomeOtherTestFixture{
...
}
You can then stipulate in the build process on the server that all categories get run and just require that your developers run a subset of the available test categories. What categories they are required to run would depend on things you will understand better than I will. But the gist is that they are able to test at the unit level and the server handles the integration tests.
I'm a java developer but have dealt with a similar problem. I found that running a local database instance works well because of the speed (no data to send over the network) and because this way you don't have contention on your integration test database.
The general approach we use to solving this problem is to set up the build scripts to read the database connection strings from a configuration file, and then set up one file per environment. For example, one file for WORKSTATION, another for CI. Then you set up the build scripts to read the config file based on the specified environment. So builds running on a developer workstation run using the WORKSTATION configuration, and builds running in the CI environment use the CI settings.
It also helps tremendously if the entire database schema can be created from a single script, so each developer can quickly set up a local database for testing. You can even extend this concept to the next level and add the database setup script to the build process, so the entire database setup can be scripted to keep up with changes in the database schema.
We have an SQL Server Express instance with the same DB definition running for every dev machine as part of the dev environment. With Windows authentication the connection strings are stable - no username/password in the string.
What we would really like to do, but haven't yet, is see if we can get our system to run on SQL Server Compact Edition, which is like SQLite with SQL Server's engine. Then we could run them in-memory, and possibly in parallel as well (with multiple processes).
Have you done any measurements (using timers or similar) to determine where the tests spend most of their time?
If you already know that the database recreation is why they're time consuming a different approach would be to regenerate the database once and use transactions to preserve the state between tests. Each CUD-type test starts a transaction in setup and performs a rollback in teardown. This can significantly reduce the time spent on database setup for each test since a transaction rollback is cheaper than a full database recreation.

Munin: Is it possible to recover lost graphs?

I'm not extremely familiar with how munin works, so I apologize if this is obvious.
I've been using munin for a couple of my projects now and I've run into this twice where I will lose all my munin graphs that were generated from past events. This just happened to me this afternoon. Now, I only have a track record of all system events since this afternoon.
Are these graphs recoverable? Is there data stored somewhere thats used to generate these graphs?
If its unrecoverable, I would like to know what could possibly have caused this to occur. Granted, each time this has happened, I was messing w/ my munin config settings. For this case, I was adding new servers to be logged by munin...I dont see how doing that would cause munin to lose all data on my other servers.
Thanks.
the data is generated from .rrd databases which are updated during 'munin-update' (usually called from 'munin-cron'). You can find those files in the directory specified by 'dbdir' in munin.conf (/var/db/munin on my site). If your graphs only show events since this afternoon, my guess is that the .rrd files got deleted/recreated/corrupted. You should be able to restore them from backup and then call 'munin-graph' and 'munin-html'. See the respective man-pages for those commands.

Resources