Security considerations for read.csv() from web url in R? - r

A lot of R's read* functions permit urls to external data sources.
Are there any built in protections in R itself or in R's read* functions to protect against malicious files being read in? Or is it a case of 'user-beware'?
What I know so far
We could run some tests - e.g. a file that is not a csv e.g. read.csv(url("http://www.hello.com")) returns the raw HTML for that web page. This seems to indicate that read.csv doesn't validate that the file is a csv, but tries to parse it regardless.
I am not sure what would happen in the hypothetical read.csv(url("www.fake-microsoft.com/nasty-viris.exe")) (where virus.exe is a real virus)? Would read.csv() try to read the asset (in this case an exe file) as a csv, or would it download it successfully first?
Short of setting up a virtual machine and actually testing, I am not sure how to research how much vetting (if any) is done by read* functions on external data sources.

Related

Deploy R function with directory as argument as executable or web application

I've written an R function (code available on demand) that improves some analysis workflows in my research group (~10 people), and now I would like to deploy it so it's easily accessible to the rest of the group. I am the only person in the group with any knowledge of R.
In a nutshell, the function does the following:
Takes one argument, the directory in which to search for microscopy images in proprietary formats
Asks user (with readline()) which channels should be analysed and what they are named
Generate several histograms and scatter plots of intensity levels per image channel, after various normalisation steps, these are deposited in a .pdf file for each image stack
Perform linear regression, generate a .txt file per image stack
The .pdf and .txt files get output to the directory the user specifies as the argument when running the function. I want to turn it into something somewhat more user-friendly, essentially removing the need to install R + function dependencies. For the sake of universality I would like to deploy it as a web application that takes a .zip file of the images as input, extracts them and then runs the function with that newly created directory as the argument. When it's done, it should output a .zip file of the created .pdfs and .txts. I'm not sure how feasible this is. I have looked into using Shiny for this but I'm having a hard time figuring out how to apply it as I do not have experience with it. I have experience in unix server administration and have a remote server that I can play around with.
The second option would be somewhat less universal, but it would be to deploy it as a Windows executable (I am the only person in my group not to use Windows as a daily OS, but I do have access to a Windows environment). Here, ideally the executable should ask the user to navigate to a directory, then use that directory as the argument to the function and output the generated files in said directory.
As my experience with R is limited, I cannot tell which option would be more feasible and worth it in the long run. My gut feeling says the web application would be the most flexible and user friendly, but more challenging to deploy. Are either of my ideas implementable, and if so, what would be a good way to do so?
Thanks in advance!

How can I share (GitHub) my code (R) with sensitive information (passwords)?

Imagine you are using a package that uses an access token. Maybe one from rOpenSci.
My current approach is to source a file at the beginning that is in the .gitignore. It, hence, gets ignored and I can share without worries.
source("never-commit-password.R")
However, there is still a danger that it might be uploaded via .RData because I left it in the workspace.
What is leading practice that trades off convenience with safety?

How i Connect the data that's in my folder without writing the specific path like:"C:\\Users\\Dima\\Desktop\\NewData\\..."

I am writing a script that's Requires Data Which is in my computer folder.
But eventually this script will be used in another computer, by another person.
I can't tell him to change all the paths to the data in the script.
How i Connect the data that's in my folder without writing the specific path
Like:"C:\Users\Dima\Desktop\NewData\..."
The best way of making your code shareable depends upon your use case.
As Carl Witthoft pointed out, most code should be encapsulated in functions. These functions can then be packaged into packages and easily redistributed on other peoples's machines. Writing packages is easier than you think.
For one off analyses, scripts are appropriate. How you make them user-independent depends on who your users are. If your are sharing the script with colleagues, try to keep your data on a network drive, then the link to the data will be the same for everyone. If you are sharing your script with the world, then keep your data on the internet, and the link to the data will be a hyperlink, again, the same for everyone.
If you are sharing your script with a few people who don't have access to a common drive, and you can't put your data on the internet, then some directory manipulation is acceptable.
Change your working directory to the root of where your project files are.
setwd("c:/Users/Dima/My Project")
Then you can reference the location of the data using relative paths.
data_file <- "Data/My data file.csv"
my_data <- read.csv(data_file)
Assuming that you keep the directory structure within your project the same, then you only need to change the call to setwd on each machine.
Also note that the special location "~" refers to your user home directory. Try
normalizePath("~")
That way, if you keep your project in that location, you can avoid reference to "Dima" entirely.

Using R to save images & .csv's, can I use R to upload them to website (use filezilla to do it manually)?

First I should say that a lot of this is over my head, so I apologize in advance for using incorrect terminology and potentially asking an unclear question. I'm doing my best.
Also, I saw ThisPost; is RCurl the tool I want to use for this task?
Every day for 4 months I'll be analyzing new data, and generating .csv files and .png's that need to be uploaded to a web site so that other team members will be checking. I've (nearly) automated all of the data collecting, data downloading, analysis, and file saving. The analysis is carried out in R, and R saves the files. Currently I use Filezilla to manually upload the new files to the website. Is there a way to use R to upload the files to the web site, so that I don't have to open Filezilla and drag+drop files?
It'd be nice to run my R-code and walk away, knowing that once it finishes running, the newly saved files will be automatically be put on the website.
Thanks for any help!
You didn't specify which protocol you use to upload your files using FileZilla. I assume it is ftp. If so, you can use the ftpUpload function of RCurl:
library(RCurl)
ftpUpload("yourfile", "ftp://ftp.yourserver.foo/yourfile",
userpwd="username:passwd")
RCurl also had methods for scp and should also support sftp using ftpUpload.

Apache log file format analysis by R

I was trying to do the analysis of weblog files by R. I am comfortable to deal with the date and bytes, wherever numeric data is present but fail to deal with the strings.
From the log file (log file in CSV format), I want to find out the particular user (with help of IP and Agents) and its total spending on the web page.
There are numurous libraries to do this kind of analysis, although I could find none in R. A google for parse apache logfile yielded a library in Perl, and python parse apache logfile yields the Scratchy library. Both rely on regular expressions to parse the contents of the file.
From here there are two ways to deal with the apache logfile:
Call perl or python from R, either using a direct link, or using a system call (this is simpler).
Take the idea from the perl or python lib and use it to implement R versions of the functions. This will take a lot of time.
You refer to a csv file, but I think the libraries above work with the original text file with the Apache log, so I'd use those, and not your csv file.
In addition, this SO post mentions an answer by #doug (profile) where he states that he has created some functions to create visualizations of apache logfile data, parsed by Python. Maybe you could send him a message or mail and see if he is willing to share the code.
Logfile analysis in R is an interesting topic we had before, you can find our discussion right here. Maybe this discussion might also help you to adjust to the SO etiquette in order to get better feedback (not to take anything away from yours, Paul).

Resources