get executing program directory - r

to get executing program directory
> pt=funr::get_script_path()
> pt
NULL
> source("yourfile.R", chdir = T)
this code is not giving executing directory. it gives only
> source("yourfile.R", chdir = T)
How to get directory name?

Related

openxlsx::saveWorkbook -- write error during file append

I have a script that has been running smoothly for months. The last line of code basically goes as follows:
saveWorkbook(Wb, 'address/filename.xlsx'), overwrite = TRUE)
I run this script weekly (Mondays, unimportant), so I go to run it this week and I'm now getting this error when I go to save this created workbook:
Warning message:
In file.append(to[okay], from[okay]) : write error during file append
The address for this file is on a shared drive for work, so one of my first thoughts was maybe there were some new permissions for the shared drive, since saving this on local drives seems okay. But, I can save csv files on the shared drive still (using data.table::fwrite).
I'm a bit at a loss here. I've updated R, RTools, and RStudio and all my packages.
Has anyone come across this, or a similar, issue before? I could possibly be looking for some more information concerning the "write error during file append". I'm actually creating a whole new file when I run this and not appending anything to an existing file. But, I haven't been able to find anything explaining situations that could cause this error.
I have the same output but only inside a blob container on azure
overwrite instruction fails
packageVersion("openxlsx")
[1] ‘4.2.5’
> openxlsx::write.xlsx(x = mtcars,file ="mtcars.xlsx")
> openxlsx::write.xlsx(x = mtcars,file ="mtcars.xlsx")
Warning message:
In file.append(to[okay], from[okay]) : write error during file append
> openxlsx::write.xlsx(x = mtcars,file ="mtcars.xlsx",overwrite = T)
Warning message:
In file.append(to[okay], from[okay]) : write error during file append
> openxlsx::write.xlsx(x = mtcars,file ="mtcars.xlsx",overwrite = F)
Error in saveWorkbook(wb, file = file, overwrite = overwrite) :
File already exists!
Same code on my desktop produces diferent output
> packageVersion("openxlsx")
[1] ‘4.2.5’
> openxlsx::write.xlsx(x = mtcars,file ="mtcars.xlsx")
> openxlsx::write.xlsx(x = mtcars,file ="mtcars.xlsx")
> openxlsx::write.xlsx(x = mtcars,file ="mtcars.xlsx",overwrite = T)
> openxlsx::write.xlsx(x = mtcars,file ="mtcars.xlsx",overwrite = F)
Error in saveWorkbook(wb, file = file, overwrite = overwrite) :
File already exists!
As we can't fix Azure I recomend delete first and the write
> filexls= "mtcars.xlsx"
> if (file.exists(filexls)) {
+ file.remove(filexls)
+ }
[1] TRUE
> openxlsx::write.xlsx(x = mtcars,file ="mtcars.xlsx",overwrite = T)

Download files from FTP folder using Loop

I am trying to download all the files inside FTP folder
temp <- tempfile()
destination <- "D:/test"
url <- "ftp://XX.XX.net/"
userpwd <- "USER:Password"
filenames <- getURL(url, userpwd = userpwd,ftp.use.epsv = FALSE,dirlistonly = TRUE)
filenames <- strsplit(filenames, "\r*\n")[[1]]
When I am printing "filenames" I am getting all the file names which are inside the FTP folder - correct output till here
[1] "2018-08-28-00.gz" "2018-08-28-01.gz"
[3] "2018-08-28-02.gz" "2018-08-28-03.gz"
[5] "2018-08-28-04.gz" "2018-08-28-05.gz"
[7] "2018-08-28-08.gz" "2018-08-28-09.gz"
[9] "2018-08-28-10.gz" "2018-08-28-11.gz"
[11] "2018-08-28-12.gz" "2018-08-28-13.gz"
[13] "2018-08-28-14.gz" "2018-08-28-15.gz"
[15] "2018-08-28-16.gz" "2018-08-28-17.gz"
[17] "2018-08-28-18.gz" "2018-08-28-23.gz"
for ( i in filenames ) {
download.file(paste0(url,i), paste0(destination,i), mode="w")
}
I got this error
trying URL 'ftp://XXX.net/2018-08-28-00.gz'
Error in download.file(paste0(url, i), paste0(destination, i), mode = "w") :
cannot open URL 'ftp://XXX.net/2018-08-28-00.gz'
In addition: Warning message:
In download.file(paste0(url, i), paste0(destination, i), mode = "w") :
InternetOpenUrl failed: 'The login request was denied'
I modified the code to
for ( i in filenames )
{
#download.file(paste0(url,i), paste0(destination,i), mode="w")
download.file(getURL(paste(url,filenames[i],sep=""), userpwd =
"USER:PASSWORD"), paste0(destination,i), mode="w")
}
After that, I got this error
Error in function (type, msg, asError = TRUE) : RETR response: 550
Without a minimal, complete, and verifiable example it is a challenge to directly replicate your problem. Assuming the file names don't include the URL, you'll need to combine them to access the files.
download.file() requires a file to be read, an output file, as well as additional flags regarding whether you want a binary download or not.
For example, I have data from Alberto Barradas' Pokémon Stats kaggle.com data set stored on my Github site. To download some of the files to the test subdirectory of my R Working Directory, I can use the following code:
filenames <- c("gen01.csv","gen02.csv","gen03.csv")
fileLocation <- "https://raw.githubusercontent.com/lgreski/pokemonData/master/"
# use ./ for subdirectory of current directory, end with / to work with paste0()
destination <- "./test/"
# note that these are character files, so use mode="w"
for (i in filenames){
download.file(paste0(fileLocation,i),
paste0(destination,i),
mode="w")
}
...and the output:
The paste0() function concatenates text without spaces, which allows the code to generate a fully qualified path name for the url of each source file, as well as the subdirectory where the destination file will be stored.
To illustrate what's happening with paste0() in the for() loop, we can use message() to print to the R console.
> # illustrate what paste0() does
> for (i in filenames){
+ message(paste("Source is: ",paste0(fileLocation,i)))
+ message(paste("Destination is:",paste0(destination,i)))
+ }
Source is: https://raw.githubusercontent.com/lgreski/pokemonData/master/gen01.csv
Destination is: ./test/gen01.csv
Source is: https://raw.githubusercontent.com/lgreski/pokemonData/master/gen02.csv
Destination is: ./test/gen02.csv
Source is: https://raw.githubusercontent.com/lgreski/pokemonData/master/gen03.csv
Destination is: ./test/gen03.csv
>

Error in download file in R

I have an excel file containing a the names of company and the downloadable links for their .pdf files . My aim is to create directories as per tyhe company name in the excel column and have the pdf file downloaded to the newly created directory.
Here is my code
##Set the working directory
txtsrc<-"C:\\FirstAid"
setwd(txtsrc)
##make a vector file names and links
pdflist <- read.xlsx("Final results_6thjuly.xlsx",1)
colnames(pdflist)
##Check if docs folder exists
if (dir.exists("FirstAid_docs")=="FALSE"){
dir.create("FirstAid_docs")
}
##Change the working directory
newfolder<-c("FirstAid_docs")
newpath<-file.path(txtsrc,newfolder)
setwd(newpath)
##Check the present working directory
getwd()
## Create directories and download files
for( i in 1:length(pdflist[,c("ci_CompanyName")])){
##First delete the existing directories
if(dir.exists(pdflist[,c("ci_CompanyName")][i])=="TRUE"){
unlink(pdflist[,c("ci_CompanyName")][i], recursive = TRUE)
}
##Create a new directory
directoryname<-pdflist[,c("ci_CompanyName")][i]
dir.create(directoryname,recursive = FALSE, mode = "0777")
##Get the downloadable links
##Link like :www.xyz.com thus need to add https to it
link<-pdflist[,c("DocLink")][i]
vallink<-c("https://")
##Need to remove quotes from link
newlink<-paste0(vallink,link)
newlink<-noquote(newlink)
##Set paths for the downloadble file
destfile<-file.path(txtsrc,newfolder,directoryname)
##Download the file
download.file(newlink,destfile,method="auto")
##Next record
i<-i+1
}
This is the error/results i get
> colnames(pdflist)
[1] "ci_CompanyID" "ci_CompanyName" "ProgramScore" "ID_DI" "DocLink"
> download.file(newlink,destfile,method="auto")
Error in download.file(newlink, destfile, method = "auto") :
cannot open destfile 'C:\Users\skrishnan\Desktop\HR needed\text analysis proj\pdf\FirstAid/FirstAid_docs/Buckeye Partners, LP', reason 'Permission denied'
Despite setting the chmod why do i get the error .
I am using CRAN RGui(64-bit) and the R version 3.5.0 on Windows 64-bit machine.
Any help is greatly appreciated.
destfile in download.file needs to be a specific file, not just a directory. For example,
'C:\Users\skrishnan\Desktop\HR needed\text analysis proj\pdf\FirstAid\FirstAid_docs\Buckeye Partners, LP\myFile.pdf'
The final working code:
> ##Set the working directory txtsrc<-"C:\\FirstAid"
> setwd(txtsrc)
>
> ##make a vector file names and links pdflist <- read.xlsx("Final results_6thjuly.xlsx",1) colnames(pdflist)
>
> ##Check if docs folder exists if (dir.exists("FirstAid_docs")=="FALSE"){ dir.create("FirstAid_docs") }
>
> ##Change the working directory newfolder<-c("FirstAid_docs") newpath<-file.path(txtsrc,newfolder) setwd(newpath)
>
> ##Check the present working directory getwd()
>
> ## Create directories and download files
> for( i in 1:length(pdflist[,c("ci_CompanyName")])){
>
> ##First delete the existing directories
> if(dir.exists(pdflist[,c("ci_CompanyName")][i])=="TRUE"){
> unlink(pdflist[,c("ci_CompanyName")][i], recursive = TRUE)
> }
>
> ##Create a new directory
> directoryname<-pdflist[,c("ci_CompanyName")][i]
> dir.create(directoryname,recursive = FALSE, mode = "0777")
>
>
> ##Get the downloadable links
> ##Link like :www.xyz.com thus need to add https to it
> link<-pdflist[,c("DocLink")][i]
> vallink<-c("https://")
>
> ##Need to remove quotes from link
> newlink<-paste0(vallink,link)
> newlink<-noquote(newlink)
>
> ##Set paths for the downloadble file
> neway<-file.path(newpath,directoryname)
> destfile<-paste(neway,"my.pdf",sep="/")
>
>
>
> ##Download the file
> download.file(newlink,destfile,method="auto")
>
> ##Next record
> i<-i+1
> }

Running a .Rexec on a .Rmd from command prompt

So I've followed this blog to create an executable file called r2jekyll
Unfortunately, I'm on Windows so I've had to create the .Rexec called r2jekyll differently
the code for r2jekyll is here:
#!/usr/bin/env Rscript
library(knitr)
# Get the filename given as an argument in the shell.
args = commandArgs(TRUE)
filename = args[1]
# Check that it's a .Rmd file.
if(!grepl(".Rmd", filename)) {
stop("You must specify a .Rmd file.")
}
# Knit and place in _posts.
dir = paste0("../_posts/", Sys.Date(), "-")
output = paste0(dir, sub('.Rmd', '.md', filename))
knit(filename, output)
# Copy .png files to the images directory.
fromdir = "{{ site.url }}/images"
todir = "../images"
pics = list.files(fromdir, ".png")
pics = sapply(pics, function(x) paste(fromdir, x, sep="/"))
file.copy(pics, todir)
unlink("{{ site.url }}", recursive = TRUE)
That all works fine, I can run my r2jekyll rexec (thanks to this blog) it runs but nothing happens
I'm up to the last step to run the r2jekyll on a file i've called first_test.Rmd
I run the following code in command prompt
cd (go to directory where my r2jekyll and my first_test.Rmd are sitting)
then
r2jekyll.rexec first_test.Rmd (the blog author used this code, he is on a mac)
and I get the following error
Error: You must specify a .Rmd file.
Execution halted
So my question is: how do I get my r2jekyll.Rexec to do its thing to the first_test.Rmd

Can't create dplyr src backed by SparkSQL in dplyr.spark.hive package

Recently I found out about great dplyr.spark.hive package that enables dplyr frontend operations with spark or hive backend .
There is an information on how to install this package in package's README :
options(repos = c("http://r.piccolboni.info", unlist(options("repos"))))
install.packages("dplyr.spark.hive")
and there are also many examples on how to work with dplyr.spark.hive when one is already connected to hiveServer - check this.
But I am not able to connect to hiveServer, so I can not benefit from the great power of this package...
I've tried such commands, but they did not work out. Does anyone have any solution or comment on what am I doing wrong?
> library(dplyr.spark.hive,
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
>
> my_db = src_SparkSQL()
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl',
+ port = 10000)
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(start.server = TRUE)
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
>
> my_db = src_SparkSQL(start.server = TRUE,
+ list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client"))
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
EDIT 2
I have set more paths to system variables like this but now I receive a warning telling me that some kind of Java logging-configuration is not specified bu I think it is
> library(dplyr.spark.hive,
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
3: package ‘SparkR’ was built under R version 3.2.1
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HADOOP_HOME="/usr/share/hadoop")
> Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop")
> Sys.setenv(PATH='/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/share/hadoop/bin:/opt/hive/bin')
>
>
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
My log properties are not empty.
-bash-4.2$ wc /etc/hadoop/log4j.properties
179 432 6581 /etc/hadoop/log4j.properties
EDIT 3
My exact call to the scr_SparkSQL() is
> detach("package:SparkR", unload=TRUE)
Warning message:
package ‘SparkR’ was built under R version 3.2.1
> detach("package:dplyr", unload=TRUE)
> library(dplyr.spark.hive, lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
And then the proces does not stop (never).
Where those settings work for beeline with such params:
beeline -u "jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl" -n mkosinski --outputformat=tsv --incremental=true -f sql_statement.sql > sql_output
but I am not able to pass user name and dbname to src_SparkSQL()
so I have tried to manual use the code from inside that function but I receive the sam problem that the below code also does not finish
host = 'tools-1.hadoop.srv'
port = 10000
driverclass = "org.apache.hive.jdbc.HiveDriver"
Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
library(RJDBC)
dr = JDBC(driverclass, Sys.getenv("HADOOP_JAR"))
url = paste0("jdbc:hive2://", host, ":", port)
class = "Hive"
con.class = paste0(class, "Connection") # class = "Hive"
# dbConnect_retry =
# function(dr, url, retry){
# if(retry > 0)
# tryCatch(
# dbConnect(drv = dr, url = url),
# error =
# function(e) {
# Sys.sleep(0.1)
# dbConnect_retry(dr = dr, url = url, retry - 1)})
# else dbConnect(drv = dr, url = url)}
#################
##con = new(con.class, dbConnect_retry(dr, url, retry = 100))
#################
con = new(con.class, dbConnect(dr, url, user = "mkosinski", dbname = "loghost"))
Maybe the url should containg also /loghost - the dbname?
I now see that you tried multiple things with multiple errors. Let me comment error by error.
my_db = src_SparkSQL()
Error in .jfindClass(as.character(driverClass)[1]) : class not found
The RJDBC object could not be created. Unless we solve this, nothing else will work, workarounds or not. Have you set HADOOP_JAR with, for instance,
Sys.setenv(HADOOP_JAR = "../spark/assembly/target/scala-2.10/spark-assembly-1.5.0-hadoop2.6.0.jar"). Sorry I seem to have skipped this in the instructions. Will fix.
my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl',
+ port = 10000)
Error in .jfindClass(as.character(driverClass)[1]) : class not found
Same problem. Please note host port argument do not accept URL syntax, just host and port. URL is formed internally.
my_db = src_SparkSQL(start.server = TRUE)
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
Stop thriftserver first or connect to existing one, but you still have to fix the class not found problem.
my_db = src_SparkSQL(start.server = TRUE,
+ list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client"))
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
Same as above.
Plan:
Set HADOOP_JAR. Find host and port of running thriftserver, if not default. Try src_SparkSQL with start.server = FALSE. If happy quit, else goto step 2
Stop existing thriftserver. Try again src_SparkSQL with start.server = TRUE
Let me know how things go.
There was a problem that I did't specify the proper classPath that was needed inside JDBC function that created a driver. Parameters to classPath in dplyr.spark.hive package are passed via HADOOP_JAR global variable.
To use JDBC as a driver to hiveServer2 (through the Thrift protocol) one need to add at least those 3 .jars with Java classes to create a proper driver
hive-jdbc-1.0.0-standalone.jar
hadoop/common/lib/commons-configuration-1.6.jar
hadoop/common/hadoop-common-2.4.1.jar
versions are arbitrary and should be compatible with the installed version of local hive, hadoop and hiveServer2.
They need to be set with the .Platform$path.sep (as described here)
classPath = c("system_path1_to_hive/hive/lib/hive-jdbc-1.0.0-standalone.jar",
"system_path1_to_hadoop/hadoop/common/lib/commons-configuration-1.6.jar",
"system_path1_to_hadoop/hadoop/common/hadoop-common-2.4.1.jar")
Sys.setenv(HADOOP_JAR= paste0(classPath, collapse=.Platform$path.sep)
Then when HADOOP_JAR is set one have to be carefull with hiveServer2 url. In my case it had to be
host = 'tools-1.hadoop.srv'
port = 10000
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
and finally the proper connection with hiveServer2 using RJDBC package is
Sys.setenv(HADOOP_HOME="/usr/share/hadoop/share/hadoop/common/")
Sys.setenv(HIVE_HOME = '/opt/hive/lib/')
host = 'tools-1.hadoop.srv'
port = 10000
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
driverclass = "org.apache.hive.jdbc.HiveDriver"
library(RJDBC)
.jinit()
dr2 = JDBC(driverclass,
classPath = c("/opt/hive/lib/hive-jdbc-1.0.0-standalone.jar",
#"/opt/hive/lib/commons-configuration-1.6.jar",
"/usr/share/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar",
"/usr/share/hadoop/share/hadoop/common/hadoop-common-2.4.1.jar"),
identifier.quote = "`")
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
dbConnect(dr2, url, username = "mkosinski") -> cont

Resources