I have a jupyter notebook containing sensitive data that I would like to not be cached inside the notebook. This would avoid jupyter's tendency to mix data and code.
In a notebook I can reset all variables using
%reset
Is there any way to run this automatically on exit, or on shutdown of the notebook or server?
Or is there a command-line script that could be run over a .ipynb, e.g. in a nightly cron job, to purge the file of stored variables (or - even better - only certain variables)?
Thanks!
nbclean allows some fairly complex customization on what gets cleaned and altered in the resulting notebook. You could do cron job with a script running that on your schedule. Or use Github actions to trigger upon actions such as push.
Related
Is it possible to modify the contents of a notebook in the notebook startup code? I want to run some init code and add "header" cells to every notebook on a machine based on the code, for instance grab the hash of the current head from a local git repo, or pull a file from S3 to the local file system.
I can put a bunch of scripts, either .py or .ipy in the ~/.ipython/profile_default/startup/ directory and I'd like to modify the notebook that is currently being opened using those scripts (or some other scripts if that's possible).
According to the docs the shell has already been setup when those scripts run, so I'm thinking there should be some way of accessing, at a minimum, the local path of the notebook that was opened. I could then use nbformat (github) to modify the contents.
Alternatively I could use NotebookApp or ContentsManager to possibly modify the running notebook, but I'm not exactly sure how to do that and the notebook docs are pretty light on the actual API for those classes. This might not be possible as the init code is executed in the kernel, which does not know what the front end is, it could be the case that the kernel is connected to a console not to a notebook or to both a notebook and a console.
So
can I access the filename of the current notebook in a startup script?
should I rather be looking to modify the notebook cells through NotebookApp, FileContentsManager or some other internal class?
related
There is an open issue for template files https://github.com/jupyter/notebook/issues/332 -- this is not what I'm looking for, the template files are static, I need to modify the notebook based on the result of a computation
I have an R script that I run every day that scrapes data from a couple of different websites, and then writes the data scraped to a couple of different CSV files. Each day, at a specific time (that changes daily) I open RStudio, open the file, and run the script. I check that it runs correctly each time, and then I save the output to a CSV file. It is often a pain to have to do this everyday (takes ~10-15 minutes a day). I would love it if someway I could have this script run automatically at a pre-defined specific time, and a buddy of mine said AWS is capable of doing this?
Is this true? If so, what is the specific feature / aspect of AWS that is able to do this, this way I can look more into it?
Thanks!
Two options come to mind thinking about this:
Host a EC2 Instance with R on it and configure a CRON-Job to execute your R-Script regularly.
One easy way to get started: Use this AMI.
To execute the script R offers a CLI rscript. See e.g. here on how to set this up
Go Serverless: AWS Lambda is a hosted microservice. Currently R is not natively supported but on the official AWS Blog here they offer a step by step guid on how to run R. Basically you execute R from Python using the rpy2-Package.
Once you have this setup schedule the function via CloudWatch Events (~hosted cron-job). Here you can find a step by step guide on how to do that.
One more thing: You say that your function outputs CSV files: To save them properly you will need to put them to a file-storage like AWS-S3. You can do this i R via the aws.s3-package. Another option would be to use the AWS SDK for python which is preinstalled in the lambda-function. You could e.g. write a csv file to the /tmp/-dir and after the R script is done move the file to S3 via boto3's S3 upload_file function.
IMHO the first option is easier to setup but the second-one is more robust.
It's a bit counterintuitive but you'd use Cloudwatch with an event rule to run periodically. It can run a Lambda or send a message to an SNS topic or SQS queue. The challenge you'll have is that a Lambda doesn't support R so you'd either have to have a Lambda kick off something else or have something waiting on the SNS topic or SQS queue to run the script for you. It isn't a perfect solution as there are, potentially, quite a few moving parts.
#stdunbar is right about using CloudWatch Events to trigger a lambda function. You can set a frequency of the trigger or use a Cron. But as he mentioned, Lambda does not natively support R.
This may help you to use R with Lambda: R Statistics ready to run in AWS Lambda and x86_64 Linux VMs
If you are running windows, one of the easier solution is to write a .BAT script to run your R-script and then use Window's task scheduler to run as desired.
To call your R-script from your batch file use the following syntax:
C:\Program Files\R\R-3.2.4\bin\Rscript.exe" C:\rscripts\hello.R
Just verify the path to the "RScript" application and your R code is correct.
Dockerize your script (write a Dockerfile, build an image)
Push the image to AWS ECR
Create an AWS ECS cluster and AWS ECS task definition within the cluster that will run the image from AWS ECR every time it's spun-up
Use EventBridge to create a time-based trigger that will run the AWS ECS task definition
I recently gave a seminar walking through this at the Why R? 2022 conference.
You can check out the video here: https://www.youtube.com/watch?v=dgkm0QkWXag
And the GitHub repo here: https://github.com/mrismailt/why-r-2022-serverless-r-in-the-cloud
I have a shiny app that continuously runs on a server, but this app uses SQL data tables and needs to check for updates once a day. Right now, with no batch file in place, I have to manually stop the app, run an R script that checks for these updates, then re-run the app. The objects that I want to update are currently stored in RStudio's global environment. I've been looking around at modifying .RData files because I'm running out of options. Any ideas?
EDIT: I'm aware that I probably have to shut down the app for a few minutes to refresh the tables, but is there a way I can do something like this using a batch file?
I have been trying to run a script automatically using the steps that I found online.
I am trying to run the following R script called AUTO.R
Here is what the script contains:
library(quantmod)
obs <- last(Ad(getSymbols("SPY", auto.assign=FALSE)))
saveRDS(obs, "SAMPLE.rds")
When I build the application it prints Workflow completed
I believe all is well until the time comes to run the script. The alarm pop-up in my desktop is displayed from Calendar but nothing runs. After a few minutes the folder where the .rds file should be saved does not contain anything.
Two suggested changes:
Your Automator task should be more like just /usr/local/bin/Rscript --vanilla /Users/rimeallthetime/Desktop/AUTO.R
You should explicitly set the path in saveRDS; i.e. saveRDS(obs, "/Users/rimeallthetime/Desktop/SAMPLE.rds")
Honestly, though, you should at least make a ~/bin dir (i.e. a directory called bin under your home directory, so in your case /Users/rimeallthetime/bin and put both the workflow and R script in there, and I'd also suggest creating another directory for output files vs the desktop.
UPDATE
I just let the calendar event run and this is really a crude way to automate what you want to do. You'd be better off in the long run using launchd, that way it's fully automated and requires no human intervnention at all (but you may need to adjust your script to send you a notification or "append" to the rds file).
Is there a way to pass commands (from a shell) to an already running R-runtime/R-GUI, without copy and past.
So far I only know how to call R via shell with the -f or -e options, but in both cases a new R-Runtime will process the R-Script or R-Command I passed to it.
I rather would like to have an open R-Runtime waiting for commands passed to it via whatever connection is possible.
What you ask for cannot be done. R is single threaded and has a single REPL aka Read-eval-print loop which is, say, attached to a single input as e.g. the console in the GUI, or stdin if you pipe into R. But never two.
Unless you use something else as e.g. the most excellent Rserve which (when hosted on an OS other than Windoze) can handle multiple concurrent requests over tcp/ip. You may however have to write your custom connection. Examples for Java, C++ and R exist in the Rserve documentation.
You can use Rterm (under C:\Program Files\R\R-2.10.1\bin in Windows and R version 2.10.1). Or you can start R from the shell typing "R" (if the shell does not recognize the command you need to modify your path).
You could try simply saving the workspace from one session and manually loading it into the other one (or any kind of variation on this theme, like saving only the objects you share between the 2 sessions with saveRDS or similar). That would require some extra load and save commands but you could automatise this further by adding some lines in your .RProfile file that is executed at the beginning of every R session. Here is some more detailed information about R on startup. But I guess it all highly depends on what are you doing inside the R sessions. hth