Reading pdf with TM package - r

I am trying to read pdf files with the TM package. I have gone through succesfully in most of the attempts, but one. I have several folders with hundreds of documents each. I have read all of them but one. The problem is that the pdfs in that specific folder have a sequence of images on the bottom of the first page that prevents me from reading them. I get the following error:
Error in strptime(d, fmt) : input string is too long
If I remove the first page, I manage to read them. I could do it without much loss of relevant information, but it is too much work.
I try with xpdf and ghoststring, but both give me the same error.
My code is as following:
library(rvest)
library(tm)
url<-paste0("http://www.tjrj.jus.br/search?q=acidente+de+transito+crianca+atropelamento&btnG=Pesquisar&processType=cnj&site=juris&client=juris&output=xml_no_dtd&proxystylesheet=juris&entqrm=0&oe=UTF-8&ie=UTF-8&ud=1&filter=0&getfields=*&partialfields=(ctd:1)&exclude_apps=1&ulang=en&lr=lang_pt&sort=date:D:S:d1&as_q=+&access=p&entqr=3&start=",seq(0,462,10))
css<-sprintf(".margin-top-10:nth-child(%.d) .outros .featured",1:10)
for (j in 1:1){ # There 47 pages, but I only put one here
for (i in 1:10){ # there are 10 files per page.
a<-html_node(css=css[i]) %>%
html_attr("href")
download.file(a,paste0("doc",j,i,".pdf"))
}
}
files <- list.files(pattern = "pdf$")
Rpdf <- readPDF(control = list(text = "-layout"))
docs <- Corpus(URISource(files,encoding="UTF-8"),readerControl = list(reader = Rpdf,language="portuguese"))
Does someone have a suggestion? I use a Mac.

Late answer:
But I recently discovered that with the current verions of tm (0.7-4) readPDF uses pdftools as default to read pdfs.
library(tm)
directory <- getwd() # change this to directory where pdf-files are located
# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF))

Related

Downloading multiple files using purrr

I am trying to download all of the Excel files at: https://www.grants.gov.au/reports/gaweeklyexport
Using the code below I get errors similar to the following for each link (77 in total):
[[1]]$error
<simpleError in download.file(.x, .y, mode = "wb"): scheme not supported in URL '/Reports/GaWeeklyExportDownload?GaWeeklyExportUuid=0db183a2-11c6-42f8-bf52-379aafe0d21b'>
I get this error when trying to iterate over the full list, but when I call download.file on an individual list item it works fine.
I would be grateful if someone could tell me what I have done wrong or suggest a better way of doing it.
The code that produces the error:
library(tidyverse)
library(rvest)
# Reading links to the Excel files to be downloaded
url <- "https://www.grants.gov.au/reports/gaweeklyexport"
webpage <- read_html(url)
# The list of links to the Excel files
links <- html_attr(html_nodes(webpage, '.u'), "href")
# Creating names for the files to supply to the download.file function
wb_names = str_c(1:77, ".xlsx")
# Defining a function that using purrr's safely to ensure it doesn't fail if there is a dead link
safe_download <- safely(~ download.file(.x , .y, mode = "wb"))
# Combining links, file names, and the function returns an error
map2(links, wb_names, safe_download)
You need to prepend 'https://www.grants.gov.au/' to the URL to get the absolute path of the file which can be used to download the file.
library(rvest)
library(purrr)
url <- "https://www.grants.gov.au/reports/gaweeklyexport"
webpage <- read_html(url)
# The list of links to the Excel files
links <- paste0('https://www.grants.gov.au/', html_attr(html_nodes(webpage, '.u'), "href"))
safe_download <- safely(~ download.file(.x , .y, mode = "wb"))
# Creating names for the files to supply to the download.file function
wb_names = paste0(1:77, ".xlsx")
map2(links, wb_names, safe_download)

R, edit movie files or avi files?

I have a bunch of .avi files I would like to load into R, breakdown each frame as an individual image, arrange the images, and save as a separate image. In spite of a sincere effort to try to find a package to load .avi files, I can't find anything.
1) is it possible to load and work with avi files in r?
2) how is this done?
3) is there a specific library for this?
Ive seen several examples using linux, such as this post, but I'm hoping for an R solution.
Converting AVI Frames to JPGs on Linux
I figured this out. As indicated in the comments section, ffmpeg is the package called by various packages in R to load videos into the R environment. The Imager package has a function called "load.video.internal" that works well and uses ffmpeg. I had to down load the package from github because this function was not available in the version I installed using "install.packages". I ended up copy/pasting the function from the source package and commenting out the reference to the "has.ffmpeg" function because it kept hanging at this step. Using paste, I was able to indicate the file path and load an avi file successfully.
Modified load.video.internal function:
load.video.internal <- function(fname,maxSize=1,skip.to=0,frames=NULL,fps=NULL,extra.args="",verbose=FALSE)
{
# if (!has.ffmpeg()) stop("Can't find ffmpeg. Please install.")
dd <- paste0(tempdir(),"/vid")
if (!is.null(frames)) extra.args <- sprintf("%s -vframes %i ",extra.args,frames)
if (!is.null(fps)) extra.args <- sprintf("%s -vf fps=%.4d ",extra.args,fps)
arg <- sprintf("-i %s %s -ss %s ",fname,extra.args,as.character(skip.to)) %>% paste0(dd,"/image-%d.bmp")
tryCatch({
dir.create(dd)
system2("ffmpeg",arg,stdout=verbose,stderr=verbose)
fls <- dir(dd,full.names=TRUE)
if (length(fls)==0) stop("No output was generated")
ordr <- stringr::str_extract(fls,"(\\d)+\\.bmp") %>% stringr::str_split(stringr::fixed(".")) %>% purrr::map_int(~ as.integer(.[[1]])) %>% order
fls <- fls[ordr]
#Check total size
imsz <- load.image(fls[[1]]) %>% dim %>% prod
tsz <- ((imsz*8)*length(fls))/(1024^3)
if (tsz > maxSize)
{
msg <- sprintf("Video exceeds maximum allowed size in memory (%.2d Gb out of %.2d Gb)",tsz,maxSize)
unlink(dd,recursive=TRUE)
stop(msg)
}
else
{
out <- map_il(fls,load.image) %>% imappend("z")
out
} },
finally=unlink(dd,recursive=TRUE))
}
Example of its use:
vid_in <- load.video.internal(paste("/home/phil/Documents/avi_files/Untitled.avi"))
UPDATED SIMPLER VERSION:
the magick package will read avi files using the image_read function.

How to change row names of a DTM when writing to .csv in R

I have a large set of documents stored in a folder. I used these documents for text mining, using the tm package. I got all the way to topic modeling and want to write some results to a csv file. However, when doing this, the names of the documents are represented like this: character(0).
I want to have the names of the documents as they are stored in my folder. This is the code I use (relevant steps shown only):
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF, language = "dutch"))
dtm <- DocumentTermMatrix(my_corpus)
library(topicmodels)
ldaOut <- LDA(dtm, k, method = "Gibbs")
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics, file = paste("LDAGibbs", k, "CorpusToTopics.csv"))
I can't seem to find the answer anywhere. I assume it's a basic code in R that I don't know.
Weird how you loose the document names. I don't seem to be able to reproduce this error and I have a lot of different folders with pdf's and loads of different naming conventions.
Check the result of dtm$dimnames$Docs just when you created the dtm. If this results in charactor(0), you can do the following to get the document names into the document term matrix.
pdf_names <- list.files(directory, pattern = ".pdf")
dtm$dimnames$Docs <- pdf_names

In R , Reading a PDF document [duplicate]

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.
install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}

Using R to download zipped data file, extract, and import data

#EZGraphs on Twitter writes:
"Lots of online csvs are zipped. Is there a way to download, unzip the archive, and load the data to a data.frame using R? #Rstats"
I was also trying to do this today, but ended up just downloading the zip file manually.
I tried something like:
fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")
but I feel as if I'm a long way off.
Any thoughts?
Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to
Create a temp. file name (eg tempfile())
Use download.file() to fetch the file into the temp. file
Use unz() to extract the target file from temp. file
Remove the temp file via unlink()
which in code (thanks for basic example, but this is simpler) looks like
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)
Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)
Just for the record, I tried translating Dirk's answer into code :-P
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
con <- unz(temp, "a1.dat")
data <- matrix(scan(con),ncol=4,byrow=TRUE)
unlink(temp)
I used CRAN package "downloader" found at http://cran.r-project.org/web/packages/downloader/index.html . Much easier.
download(url, dest="dataset.zip", mode="wb")
unzip ("dataset.zip", exdir = "./")
For Mac (and I assume Linux)...
If the zip archive contains a single file, you can use the bash command funzip, in conjuction with fread from the data.table package:
library(data.table)
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | funzip")
In cases where the archive contains multiple files, you can use tar instead to extract a specific file to stdout:
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | tar -xf- --to-stdout *a1.dat")
Here is an example that works for files which cannot be read in with the read.table function. This example reads a .xls file.
url <-"https://www1.toronto.ca/City_Of_Toronto/Information_Technology/Open_Data/Data_Sets/Assets/Files/fire_stns.zip"
temp <- tempfile()
temp2 <- tempfile()
download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
data <- read_xls(file.path(temp2, "fire station x_y.xls"))
unlink(c(temp, temp2))
To do this using data.table, I found that the following works. Unfortunately, the link does not work anymore, so I used a link for another data set.
library(data.table)
temp <- tempfile()
download.file("https://www.bls.gov/tus/special.requests/atusact_0315.zip", temp)
timeUse <- fread(unzip(temp, files = "atusact_0315.dat"))
rm(temp)
I know this is possible in a single line since you can pass bash scripts to fread, but I am not sure how to download a .zip file, extract, and pass a single file from that to fread.
Using library(archive) one can also read in a particular csv file within the archive, without having to UNZIP it first; read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())
which I find more convenient & is faster.
It also supports all major archive formats & is quite a bit faster than the base R untar or unz - it supports tar, ZIP, 7-zip, RAR, CAB, gzip, bzip2, compress, lzma, xz & uuencoded files.
To unzip everything one can use archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)
This works on all platforms & given the superior performance for me would be the preferred option.
Try this code. It works for me:
unzip(zipfile="<directory and filename>",
exdir="<directory where the content will be extracted>")
Example:
unzip(zipfile="./data/Data.zip",exdir="./data")
rio() would be very suitable for this - it uses the file extension of a file name to determine what kind of file it is, so it will work with a large variety of file types. I've also used unzip() to list the file names within the zip file, so its not necessary to specify the file name(s) manually.
library(rio)
# create a temporary directory
td <- tempdir()
# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")
# download file from internet into temporary location
download.file("http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip", tf)
# list zip archive
file_names <- unzip(tf, list=TRUE)
# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)
# use when zip file has only one file
data <- import(file.path(td, file_names$Name[1]))
# use when zip file has multiple files
data_multiple <- lapply(file_names$Name, function(x) import(file.path(td, x)))
# delete the files and directories
unlink(td)
I found that the following worked for me. These steps come from BTD's YouTube video, Managing Zipfile's in R:
zip.url <- "url_address.zip"
dir <- getwd()
zip.file <- "file_name.zip"
zip.combine <- as.character(paste(dir, zip.file, sep = "/"))
download.file(zip.url, destfile = zip.combine)
unzip(zip.file)

Resources