Does anyone know how to read a text file in SparkR version 1.4.0?
Are there any Spark packages available for that?

Spark 1.6+
You can use text input format to read text file as a DataFrame:
read.df(sqlContext=sqlContext, source="text", path="")
Spark <= 1.5
Short answer is you don't. SparkR 1.4 has been almost completely stripped from low level API, leaving only a limited subset of Data Frame operations.
As you can read on an old SparkR webpage:
As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4). (...) Initial support for Spark in R be focussed on high level operations instead of low level ETL.
Probably the closest thing is to load text files using spark-csv:
> df <- read.df(sqlContext, "", source = "com.databricks.spark.csv")
> showDF(limit(df, 5))
| C0|
| # Apache Spark|
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
Since typical RDD operations like map, flatMap, reduce or filter are gone as well it is probably what you want anyway.
Now, low level API is still underneath so you can always do something like below but I doubt it is a good idea. SparkR developers most likely had a good reason to make it private. To quote ::: man page:
It is typically a design mistake to use ‘:::’ in your code since
the corresponding object has probably been kept internal for a
good reason. Consider contacting the package maintainer if you
feel the need to access the object for anything but mere
Even if you're willing to ignore good coding practices I it is most likely not worth the time. Pre 1.4 low level API is embarrassingly slow and clumsy and without all the goodness of the Catalyst optimizer it is most likely the same when it comes to internal 1.4 API.
> rdd <- SparkR:::textFile(sc, '')
> counts <- SparkR:::map(rdd, nchar)
> SparkR:::take(counts, 3)
[1] 14
[1] 0
[1] 78
Not that spark-csv, unlike textFile, ignores empty lines.

Please follow the links
we can simply use -
textFile <- textFile(sc, "/home/cloudera/SparkR-pkg/")
While checking the SparkR code, Context.R has textFile method , so ideally a SparkContext must have textFile API to create the RDD , but thats missing in doc.
# Create an RDD from a text file.
# This function reads a text file from HDFS, a local file system (available on all
# nodes), or any Hadoop-supported file system URI, and creates an
# RDD of strings from it.
# #param sc SparkContext to use
# #param path Path of file to read. A vector of multiple paths is allowed.
# #param minPartitions Minimum number of partitions to be created. If NULL, the default
# value is chosen based on available parallelism.
# #return RDD where each item is of type \code{character}
# #export
# #examples
# sc <- sparkR.init()
# lines <- textFile(sc, "myfile.txt")
textFile <- function(sc, path, minPartitions = NULL) {
# Allow the user to have a more flexible definiton of the text file path
path <- suppressWarnings(normalizePath(path))
# Convert a string vector of paths to a string containing comma separated paths
path <- paste(path, collapse = ",")
jrdd <- callJMethod(sc, "textFile", path, getMinPartitions(sc, minPartitions))
# jrdd is of type JavaRDD[String]
RDD(jrdd, "string")
Follow the link
Infact, you could use the databricks/spark-csv package to handle tsv files too.
For example,
data <- read.df(sqlContext, "<path_to_tsv_file>", source = "com.databricks.spark.csv", delimiter = "\t")
Extracting Body of Text from Research Articles; Several Attempted Methods

I need to extract the body of texts from my corpus for text mining as my code now includes references, which bias my results. All coding is performed in R using RStudio. I have tried many techniques.
I have text mining code (of which only the first bit is included below), but recently found out that simply text mining a corpus of research articles is insufficient as the reference section will bias results; reference sections alone may provide another analysis, which would be a bonus.
EDIT: perhaps there is an R package that I am not aware of
My initial response was to clean the text formats after converting from pdf to text using Regex commands within quanteda. As a reference I was intending to follow: . Their method confuses me not just in coding a parallel regex code, but in how to implement recognizing the last reference section to avoid cutting off portions of the text when "reference" appears prior to that section; I have been in contact with their team, but am waiting to learn more about their code since it appears they use a streamlined program now.
PubChunks and LAPDF-text were my next two options the latter of which is referenced in the paper above. In order to utilize the PubChunks package I need to convert all of my pdf (now converted to text) files into XML. This should be straightforward only the packages I found (fileToPDF, pdf2xml, trickypdf) did not appear to work; this seems to be a within-R concern. (Coding relating to trickypdf is included below).
For LAPDF-text, ...[see edit]... the code did not seem to run properly. There are also very limited resources out there for this package in terms of guides etc and they have shifted their focus to a larger package using different language that does happen to include LAPDF-text.
EDIT: I installed java 1.6 (SE 6) and Maven 2.0 then ran the LAPDF-text installer, which seemed to work. That being said, I am still having issues with this process and mvn commands recognizing folders though am continuing to work through it.
I am guessing there is someone else out there, as there are related research papers with similarly vague processes, who has done this before and has also got their hands dirty. Any recommendations is greatly appreciated.
library(stringi) # regex pattern
library(stringr) # simpler than stringi ; uses stringi on backend
setwd('C:\\Users\\Hunter S. Baggen\\Desktop\\ZS_TestSet_04_05')
files <- list.files(pattern = 'pdf$')
# Length 63
corpus_tm <- Corpus(URISource(files),
readerControl = list(reader = readPDF()))
# documents 63
# convert tm::Corpus to quanteda::corpus
corpus_q <- corpus(corpus_tm)
summary(corpus_q, n = 2)
# Add Doc-level Variables here *by folder and meta-variable year
# extract segments ~ later to remove segments
# corpus_segment(x, pattern, valuetype, extract_pattern = TRUE)
corpus_q_refA <- corpus_reshape(corpus_q, to = "paragraphs", showmeta = TRUE)
# Based upon Westergaard et al (15 Million texts; removing references)
corpus_q_refB <- corpus_trim(corpus_q, what = c('sentences'), exclude_pattern = '^\[\d+\]\s[A-Za-z]')
corpus_q_refB # ERROR with regex above
sum(str_detect(corpus_q, '^Referen'))
corpus_qB <- corpus_q
RemoveRef_B <- corpus_segment(corpus_q, pattern = 'Reference', valuetype = 'regex')
cbind(texts(RemoveRef_B), docvars(corpus_qB))
# -------------------------
# Idea taken from guide (must reference guide)
setGeneric('removeCitations', function(object, ...) standardGeneric('removeCitations'))
setMethod('removeCitations', signature(object = 'PlainTextDocument'),
function(object, ...) {
c <- Content(object)
# remove citations tarting with '>'
# EG for > : citations <- grep('^[[:blank:]]*>.*', c) if (length(citations) > 0) c <- c[-citations]
# EG for -- : signatureStart <- grep('^-- $', c) if (length(signatureStart) > 0) c <- c[-(signatureStart:length(c))]
# using 15 mil removal guideline
citations <- grep('^\[\d+\]\s[A-Za-z]')
# TRICKY PDF download from github
githubinstall('trickypdf') # input Y then 1 if want all related packages
# library(trickypdf)
# This time suggested I install via 'PolMine/trickypdf'
# Second attempt issue with RPoppler
library(trickypdf) # Not working
# Failed to install package 'Rpoppler' is not available for R 3.6.0
Short of the RPoppler issue above the initial description should be sufficient.
UPDATE: Having reached out to several research groups the TALN-UPF researchers got back to me and provided me with a pdfx java program that has allowed me to convert my pdfs easily into xml. Of course, now I learn that PubChunks is created with its sister program that extracts xmls from search engines and therefore is of little use to me. That being said, the TALN-UPF group will hopefully advise whether I can extract the body of each text via their other programs (Dr Inventor and Grobid). If this is possible then everything will be accomplished. Of course if not I will be back at RegEx.

R for-loop for varibles in ftp address

I want to get multiple zip files from a ftp server.
I can get the zip files individually with the help of previous posts.
But that would be a lot of work for all needed data. So I wantet to find an automated way.
The ftp looks like this:
I want to change the "01048" to the ID of the nearest wheater station, whitch I have in a data frame (data) already.
I thought I can just for loop all the needed stations
for(y in Data$StationsID)) {
urls <- "{y}"))
but I only get "{y}"))
the zip data holds an.txt ideal for csv analysis.
Later I want to use the files to get solar data from diffrent points in Germany.
But firts I need a list like this but I don't know how to get that:
[1] url_1
[2] url_2
You do not even need a loop. Try
urls <- paste0("", Data$StationsID, "")
This will give you a vector of all URLs.
Afterwards you can fetch all files using e.g. lapply.
results <- lapply(urls, FUN = function(u) {
# ...
An alternate approach is below.
get all the possible data files
filter it to just the SOLAR zip data files (you could filter more if needed — i.e. restrict it to what you have in your existing data frame)
make a save location
download the files
Note that it is really bad form to hammer a server with consecutive requests without a pause so this introduces that but that's an oft overlooked courtesy these days.
base_dir <- ""
# Get all available files in that directory
res <- curl_fetch_memory(base_dir, handle = new_handle(dirlistonly = TRUE)
strsplit(rawToChar(res$content), "\n")[[1]], # curl_fetch returns a raw vector since it has no idea what type of content might be there so we have to convert it and it's a text listing so we have to do some more wrangling
value = TRUE
) -> all_zips
## [1] ""
## [2] ""
## [3] ""
## [4] ""
## [5] ""
## [6] ""
save_dir <- "~/Data/solar-output"
for (zip in all_zips) {
url = sprintf("%s%s", base_dir, zip),
httr::write_disk(file.path(save_dir, zip)), # enables caching (it won't overwrite by default and avoid issues with download.file() on windows)
httr::progress() # progress bars for free!
Sys.sleep(5) # be kind to their server CPU and network bandwidth
We wrap the GET() in try() since we've asked write_disk() to not overwrite existing files. It tosses an exception when this happens so try() catches it and lets the loop keep going (but still displays the helpful message about the file already existing).

Loading data with RSQLite which has quoted values

I am trying to load a large-ish csv file into a SQL lite database using the RSQLite package (I have also tried the sqldf package). The file contains all UK postcodes and a variety of lookup values for them.
I wanted to avoid loading it into R and just directly load it into the database. Whilst this is not strictly necessary for this task, I want to do so in order to have the technique ready for larger files which won't fit in memory should I have to handle them in the future.
Unfortunately the csv is provided with the values in double quotes and the dbWriteTable function doesn't seem able to strip them or ignore them in any form. Here is the download location of the file:
Here is my code:
# Load library
# Create a temporary directory
tmpdir <- tempdir()
# Set the file name
file <- "data\\"
# Unzip the ONS Postcode Data file
unzip(file, exdir = tmpdir )
# Create a path pointing at the unzipped csv file
ONSPD_path <- paste0(tmpdir,"\\ONSPD_MAY_2017_UK.csv")
# Create a SQL Lite database connection
db_connection <- dbConnect(SQLite(), dbname="ons_lkp_db")
# Now load the data into our SQL lite database
dbWriteTable(conn = db_connection,
name = "ONS_PD",
value = ONSPD_path,
row.names = FALSE,
header = TRUE,
overwrite = TRUE
# Check the data upload
dbGetQuery(db_connection,"SELECT pcd, pcd2, pcds from ONS_PD LIMIT 20")
Having hit this issue, I found a reference tutorial ( which recommended using the sqldf package but unfortunately when I try to use the relevant function in sqldf (read.csv.sql) then I get the same issue with double quotes.
This feels like a fairly common issue when importing csv files into a sql system, most import tools are able to handle double quotes so I'm surprised to be hitting an issue with this (unless I've missed an obvious help file on the issue somewhere along the way).
Here is some example data from my csv file in the form of a dput output of the SQL table:
structure(list(pcd = c("\"AB1 0AA\"", "\"AB1 0AB\"", "\"AB1 0AD\"",
"\"AB1 0AE\"", "\"AB1 0AF\""), pcd2 = c("\"AB1 0AA\"", "\"AB1 0AB\"",
"\"AB1 0AD\"", "\"AB1 0AE\"", "\"AB1 0AF\""), pcds = c("\"AB1 0AA\"",
"\"AB1 0AB\"", "\"AB1 0AD\"", "\"AB1 0AE\"", "\"AB1 0AF\"")), .Names = c("pcd",
"pcd2", "pcds"), class = "data.frame", row.names = c(NA, -5L))
Here is my attempt using the filter argument in sqldf's read.csv.sql function (note that Windows users will need rtools installed for this). Unfortunately this still doesn't seem to remove the quotes from my data, although it does mysteriously remove all the spaces.
sqldf("attach 'ons_lkp_db' as new")
db_connection <- dbConnect(SQLite(), dbname="ons_lkp_db")
dbname = "ons_lkp_db",
filter = 'tr.exe -d ^"'
dbGetQuery(db_connection,"SELECT pcd, pcd2, pcds from ONS_PD LIMIT 5")
The CSV importer in the RSQLite package is derived from the sqlite3 shell, which itself doesn't seem to offer support for quoted values when importing CSV files (How to import load a .sql or .csv file into SQLite?, doc). You could use readr::read_delim_chunked():
callback <- function(data) {
name <- "ONS_PD"
exists <- dbExistsTable(con, name)
dbWriteTable(con, name, data, append = exists)
readr::read_delim_chunked(ONSPD_path, callback, ...)
Substitute ... with any extra arguments you need for your CSV file.
Use read.csv.sql from the sqldf package with the filter argument and provide any utility which strips out double quotes or which translates them to spaces.
The question does not provide a fully reproducible minimal example but I have provided one below. If you are using read.csv.sql in order to pick out a subset of rows or columns then just add the appropriate sql argument to do so.
First set up the test input data and then try any of the one-line solutions shown below. Assuming Windows, ensure that the tr utility (found in R's Rtools distribution) or the third party csvfix utility (found here and for Linux also see this) or the trquote2space.vbs vbscript utility (see Note at end) is on your path:
cat('a,b\n"1","2"\n', file = "tmp.csv")
# 1 - corrected from FAQ
read.csv.sql("tmp.csv", filter = "tr.exe -d '^\"'")
# 2 - similar but does not require Windows cmd quoting
read.csv.sql("tmp.csv", filter = "tr -d \\42")
# 3 - using csvfix utility (which must be installed first)
read.csv.sql("tmp.csv", filter = "csvfix echo -smq")
# 4 - using trquote2space.vbs utility as per Note at end
read.csv.sql("tmp.csv", filter = "cscript /nologo trquote2space.vbs")
any of which give:
a b
1 1 2
You could also use any other language or utility that is appropriate. For example, your Powershell suggestion could be used although I suspect that dedicated utilities such as tr and csvfix would run faster.
The first solution above is corrected from the FAQ. (It did work at the time the FAQ was written many years back but testing it now in Windows 10 it seems to require the indicated change or possibly the markdown did not survive intact from the move from Google Code, where it was originally located, to github which uses a slightly different markdown flavor.)
For Linux, tr is available natively although quoting differs from Windows and can even depend on the shell. csvfix is available on Linux too but would have to be installed. The csvfix example shown above would work identically on Windows and Linux. vbscript is obviously specific to Windows.
Note: sqldf comes with a mini-tr utility written in vbscript. If you change the relevant lines to:
Dim sSearch : sSearch = chr(34)
Dim sReplace : sReplace = " "
and change the name to trquote2space.vbs then you will have a Windows specific utility to change double quotes to spaces.
Honestly I could not find anything to solve this problem.
sqldf documentation tells
"so, one limitation with .csv files is that quotes
are not regarded as special within files so a comma within a data field such as
"Smith, James"
would be regarded as a field delimiter and the quotes would be entered as part of the data which
probably is not what is intended"
So, It looks like there is no solution as far as I know.
One possible suboptimal approach (other then obvious find and replace in text editor)
is to use SQL commands like this
dbSendQuery(db_connection,"UPDATE ONS_PD SET pcd = REPLACE(pcd, '\"', '')")

How to use R to read XML data from S3 more quickly?

this is my first time working with XML data, and I'd appreciate any help/advice that you can offer!
I'm working on pulling some data that is stored on AWS in a collection of XML files. I have an index files that contains a list of the ~200,000 URLs where the XML files are hosted. I'm currently using the XML package in R to loop through each URL and pull the data from the node that I'm interested in. This is working fine, but with so many URLs, this loop takes around 12 hours to finish.
Here's a simplified version of my code. The index file contains the list of URLs. The parsed XML files aren't very large (stored as dat in this example...R tells me they're 432 bytes). I've put NodeOfInterest in as a placeholder for the spot where I'd normally list the XML tag that I'd like to pull data from.
for (i in 1:200000) {
url <- paste('',index[i,9],'_public.xml', sep="") ## create URL based off of index file
dat <- (xmlTreeParse(url, useInternal = TRUE)) ## load entire XML file
nodes <- (getNodeSet(dat, "//x:NodeOfInterest", "x")) ##find node for the tag I'm interested in
if (length(nodes) > 0 & exists("dat")) {
dat2 <- xmlToDataFrame(nodes) ##create data table from node
compiled_data <- rbind(compiled_data, dat2) ##create master file
It seems like there must be a more efficient way to pull this data. I think the longest step (by far) is loading the XML into memory, but I haven't found anything out there that suggests another option. Any advice???
Thanks in advance!
If parsing the XML into a tree is your pinchpoint (in xmlTreeParse) maybe use a streaming interface like SAX which will allow you to only process those elements that are useful for your application. I haven't used it, but the package xml2 is built on top of libxml2 which provides a SAX ability.

R: Improving workflow and keeping track of output

I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
if (is.null(filename))
stop('Provide an output filename - parameter filename.')
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
if (class(t.sf) == 'try-error')
source.file <- NULL
func <- deparse(
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func,
func <- NULL
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
which can then be called in another function, stored in another file, e.g.:
RandomPlot <- function(x, y)
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
But you can still work with files as single-quoted targets. (See 'report.Rmd' and '' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.
