Running R notebook from command line - r

Ok, I'm trying to run this script through a batch file in Windows (server 2016), but it just starts to push out lineshifts and dots to the output screen:
"c:\Program Files\R\R-3.5.1\bin\rscript.exe" C:\projects\r\HentTsmReport.R
The script works like a charm in RStudio, it reads a html file (a TSM backup report) and transforms the content into a data frame, then it saves one of the html tables as a csv-file.
Why do I just get a gunk of nothing on the screen instead of an output to the csv when running through rscript.exe?
My goal is to run this script through a scheduled task each day to keep a history of the backup status in a table to keep track of failed backups through tivoli.
This is the script in the R-file:
library(XML)
library(RCurl)
#library(tidyverse)
library(rlist)
theurl <- getURL("file://\\\\chill\\um\\backupreport20181029.htm",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
head(tables)
test <- tables[5] # select table number 5
write.csv(test, file = "c:\\temp\\backupreport.csv")

Related

is there any method to defer the execution of code in r?

I have created the following function to read a csv file from a given URL:
function(){
s <- 1;
#first get the bhav copy
today <- c();ty <- c();tm <- c();tmu <- c();td <- c();
# get the URL first
today <- Sys.Date()
ty <- format(today, format = "%Y")
tm <- format(today, format = "%b")
tmu <- toupper(tm)
td <- format(today, format = "%d")
dynamic.URL <- paste("https://www.nseindia.com/content/historical/EQUITIES/",ty,"/",tmu,"/cm",td,tmu,ty,"bhav.csv.zip", sep = "")
file.string <- paste("C:/Users/user/AppData/Local/Temp/cm",td,tmu,ty,"bhav.csv")
download.file(dynamic.URL, "C:/Users/user/Desktop/bhav.csv.zip")
bhav.copy <- read.csv(file.string)
return(bhav.copy)
}
If I run the function, immediately it says that "file.string not found". But when I run it after some time(a few seconds), it executes normally. I think when download.file ecexecutes, it transfers control to read.csv,and it tries to load the file which is not yet properly saved. when i run it after some time, it tries to overwrite the existing file, which it cannot, and the read.csvproperly loads the saved file.`
I want the function to execute the first time I run it. Is there any way or a function to defer the action of read.csvuntil the file is properly saved? Something like this:
download.file(dynamic.URL, "C:/Users/user/Desktop/bhav.csv.zip")
wait......
bhav.copy <- read.csv(file.string)
Ignore the fact that the destfile in download.file is different from file.string; it is due to function of my system (windows 7).
Very many thanks for your time and effort...

Create an automated R script using taskscheduleR

I trying to create some automated R scripts using the taskscheduleR library. I have created the following script:
library(lubridate)
setwd("C:/Users/Marc/Desktop/")
create_df <- function(){
list <- c(1,2,3)
df <- data.frame(list)
x <- format(Sys.time(), "%S")
name <- paste0("name_", x, ".csv")
write.csv(df, name)
}
create_df()
That can be fired up with the following:
myscript <- "C:/Users/Marc/Dropbox/PROJECTEN/Lopend/taskschedulR_test/test.R"
taskscheduler_create(taskname = "myfancyscript", rscript = myscript,
schedule = "ONCE", starttime = format(Sys.time() + 62, "%H:%M"))
However when I execute it nothing happens. Any thoughts on how I can get this running?
It worked for me, I've now got a .csv called "name_03". I have the script within the folder that the output goes into, unlike yours which is in your dropbox. You could check the event log by looking at the history tab on the Task Scheduler, type this into R:
system("control schedtasks")

Reading MS Access (.mdb, .accdb) into R; Mac to PC conversion

I am working on a program that pulls data out of .mdb and .accdb files and creates the appropriate tables in R.
My working program on my Mac looks like this:
library(Hmisc)
p <- '/Users/Josh/Desktop/Directory/'
mdbfilename <- 'x.mdb'
mdbconcat <- paste(p, mdbfilename, sep = "")
mdb <- mdb.get(mdbconcat)
mdbnames <- data.frame(mdb.get(mdbconcat, tables = TRUE))
list2env(mdb, .GlobalEnv)
accdbfilename <- 'y.accdb'
accdbconcat <- paste(p, accdbfilename, sep = '')
accdb <- mdb.get(accdbconcat)
accdbnames <- data.frame(mdb.get(accdbconcat, tables = TRUE))
list2env(accdb, .GlobalEnv)
This works fine on my Mac, but on the PC I'm developing this for, I get this error message:
Error in system(paste("mdb-tables -1", file), intern = TRUE) :
'mdb-tables' not found
I've thought a lot about using RODBC, but this program allows me to have the tables arranged in a way where subsequent querying and dplyr functions work. Is there any way to get these function to work on a Windows machine?

Is it possible to install pandoc on windows using an R command?

I would like to download and install pandoc on a windows 7 machine, by running a command in R. Is that possible?
(I know I can do this manually, but when I'd show this to students - the more steps I can organize within an R code chunk - the better)
What about simply downloading the most recent version of the installer and starting that from R:
a) Identify the most recent version of Pandoc and grab the URL with the help of the XML package:
library(XML)
page <- readLines('http://code.google.com/p/pandoc/downloads/list', warn = FALSE)
pagetree <- htmlTreeParse(page, error=function(...){}, useInternalNodes = TRUE, encoding='UTF-8')
url <- xpathSApply(pagetree, '//tr[2]//td[1]//a ', xmlAttrs)[1]
url <- paste('http', url, sep = ':')
b) Or apply some regexp magic thanks to #G.Grothendieck instead (no need for the XML package this way):
page <- readLines('http://code.google.com/p/pandoc/downloads/list', warn = FALSE)
pat <- "//pandoc.googlecode.com/files/pandoc-[0-9.]+-setup.exe"
line <- grep(pat, page, value = TRUE); m <- regexpr(pat, line)
url <- paste('http', regmatches(line, m), sep = ':')
c) Or simply check the most recent version manually if you'd feel like that:
url <- 'http://pandoc.googlecode.com/files/pandoc-1.10.1-setup.exe'
Download the file as binary:
t <- tempfile(fileext = '.exe')
download.file(url, t, mode = 'wb')
And simply run it from R:
system(t)
Remove the needless file after installation:
unlink(t)
PS: sorry, only tested on Windows XP

Troubleshooting R mapper script on Amazon Elastic MapReduce - Results not as expected

I am trying to use Amazon Elastic Map Reduce to run a series of simulations of several million cases. This is an Rscript streaming job with no reducer. I am using the Identity Reducer in my EMR call --reducer org.apache.hadoop.mapred.lib.IdentityReducer.
The script file works fine when tested and run locally from the command line on a Linux box when passing one line of string manually echo "1,2443,2442,1,5" | ./mapper.R and I get the one line of results that I am expecting. However, when I tested my simulation using about 10,000 cases (lines) from the input file on EMR, I only got output for a dozen lines or so out of 10k input lines. I've tried several times and I cannot figure out why. The Hadoop job runs fine without any errors. It seems like input lines are being skipped, or perhaps something is happening with the Identity reducer. The results are correct for the cases where there is output.
My input file is a csv with the following data format, a series of five integers separated by commas:
1,2443,2442,1,5
2,2743,4712,99,8
3,2443,861,282,3177
etc...
Here is my R script for mapper.R
#! /usr/bin/env Rscript
# Define Functions
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))
# function to read in the relevant data from needed data files
get.data <- function(casename) {
list <- lapply(casename, function(x) {
read.csv(file = paste("./inputdata/",x, ".csv", sep = ""),
header = TRUE,
stringsAsFactors = FALSE)})
return(data.frame(list))
}
con <- file("stdin")
line <- readLines(con, n = 1, warn = FALSE)
line <- trimWhiteSpace(line)
values <- unlist(strsplit(line, ","))
lv <- length(values)
cases <- as.numeric(values[2:lv])
simid <- paste("sim", values[1], ":", sep = "")
l <- length(cases) # for indexing
## create a vector for the case names
names.vector <- paste("case", cases, sep = ".")
## read in metadata and necessary data columns using get.data function
metadata <- read.csv(file = "./inputdata/metadata.csv", header = TRUE,
stringsAsFactors = FALSE)
d <- cbind(metadata[,1:3], get.data(names.vector))
## Calculations that use df d and produce a string called 'output'
## in the form of "id: value1 value2 value3 ..." to be used at a
## later time for agregation.
cat(output, "\n")
close(con)
The (generalized) EMR call for this simulation is:
ruby elastic-mapreduce --create --stream --input s3n://bucket/project/input.txt --output s3n://bucket/project/output --mapper s3n://bucket/project/mapper.R --reducer org.apache.hadoop.mapred.lib.IdentityReducer --cache-archive s3n://bucket/project/inputdata.tar.gz#inputdata --name Simulation --num-instances 2
If anyone has any insights as to why I might be experiencing these issues, I am open to suggestions, as well as any changes/optimization to the R script.
My other option is to turn the script into a function and run a parallelized apply using R multicore packages, but I haven't tried it yet. I'd like to get this working on EMR. I used JD Long's and Pete Skomoroch's R/EMR examples as a basis for creating the script.
Nothing obvious jumps out. However, can you run the job using a simple input file of only 10 lines? Make sure these 10 lines are scenarios which did not run in your big test case. Try this to eliminate the possibility that your inputs are causing the R script to not produce an answer.
Debugging EMR jobs is a skill of its own.
EDIT:
This is a total fishing expedition, but fire up a EMR interactive pig session using the AWS GUI. "Interactive pig" sessions stay up and running so you can ssh into them. You could also do this from the command line tools, but it's a little easier from the GUI since, hopefully, you only need to do this once. Then ssh into the cluster, transfer over your test case infile your cachefiles and your mapper and then run this:
cat infile.txt | yourMapper.R > outfile.txt
This is just to test if your mapper can parse the infile in the EMR environment with no Hadoop bits in the way.
EDIT 2:
I'm leaving the above text there for posterity but the real issue is your script never goes back to stdin to pick up more data. Thus you get one run for each mapper then it ends. If you run the above one liner you will only get out one result, not a result for each line in infile.txt. If you had run the cat test even on your local machine the error should pop out!
So let's look at Pete's word count in R example:
#! /usr/bin/env Rscript
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))
## **** could wo with a single readLines or in blocks
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
words <- splitIntoWords(line)
## **** can be done as cat(paste(words, "\t1\n", sep=""), sep="")
for (w in words)
cat(w, "\t1\n", sep="")
}
close(con)
The piece your script is missing is this bit:
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
#do your dance
#do your dance quick
#come on everybody tell me what's the word
#word up
}
you should, naturally, replace the lyrics of Cameo's Word Up! with your actual logic.
Keep in mind that proper debugging music makes the process less painful:
http://www.youtube.com/watch?v=MZjAantupsA

Resources