In R , Reading a PDF document [duplicate]

In R , Reading a PDF document [duplicate] - r

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?

Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.

This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.

A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.

A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.

install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])

The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.

I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}

Related

Saving pptx as pdf in R

I have created powerpoint files using officer package and I would also like to save them as pdf from R (dont want to manualy open and save as pdf each file). Is this possible?

you can save the powerpoint object edited using the code which is posted here: create pdf in addition to word docx using officer.
You will need to first install pdftools and libreoffice
library(pdftools)
office_shot <- function( file, wd = getwd() ){
cmd_ <- sprintf(
"/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to pdf --outdir %s %s",
wd, file )
system(cmd_)
pdf_file <- gsub("\\.(docx|pptx)$", ".pdf", basename(file))
pdf_file
}
office_shot(file = "your_presentation.pptx")
Note that the author of the officer package is the one who referred someone to this response.

Note that the answer from Corey Pembleton has the LibreOffice iOS path. (Which I personally didn't initially notice). The Windows path would be something like "C:/Program Files/LibreOffice/program/soffice.exe".
Since the initial answer provided by Corey, an example using docxtractr::convert_to_pdf can now be found here.
The package and function are the ones John M commented in Corey initial answer.

An easy solution to this question is to use convert_to_pdf function from docxtractr package. Note: this solution requires to download LibreOffice from here. I used the following order.
First, I need to set the path to LibreOffice and soffice.exe
library(docxtractr)
set_libreoffice_path("C:/Program Files/LibreOffice/program/soffice.exe")
Second, I set the path of the PowerPoint document I want to convert to pdf.
pptx_path <- "G:/My Drive/Courses/Aysem/Certifications/September17_Part2.pptx"
Third, convert it using convert_to_pdf function.
pdf <- convert_to_pdf(pptx_path, pdf_file = tempfile(fileext = ".pdf"))
Be careful here. The converted pdf file is saved in a local temporary folder. Here is mine to give you an idea. Just go and copy it from the temporary folder.
"C:\\Users\\MEHMET~1\\AppData\\Local\\Temp\\RtmpqAaudc\\file3eec51d77d18.pdf"
EDIT: A quick solution to find where the converted pdf is saved. Just replace the third step with the following line of code. You can set the path where you want to save. You don't need to look for the weird local temp folder.
pdf <- convert_to_pdf(pptx_path, pdf_file = sub("[.]pptx", ".pdf", pptx_path))

How to make R read thousands .txt files (Financial Reports) at once?

I am new to R and would like to have it read all the 10-K financial reports which are in .txt format and I want it to read all these files at once?
I have tried
dir<- "/Users/Documents/Edgar filings"
text = readtext(paste0(dir, "/ALL_2016/*.txt")
However, nothing happens, I mean no output is produced at the console section, and also there is no errors.
I tried to have it read only one 10-K report through
dir <- "/Users/Documents/Edgar filings"
text = readtext(paste0(dir, "/ALL_2016/10254_10-K_2016-03-11*.txt")
but again neither any output nor any error messages.
I tried it through quanteda
mycorpus <- corpus(textfile("~/ALL_10-K_2016/*.txt"))
but this time it says "textfile function is not available for 3.4.3" even though my R version is 3.4.4 which I updated a couple of weeks ago.
I also tried using list.files() but I do not know how to create a loop coding, so I just got stuck!
I highly appreciate if someone please help.

Here is a solution using readChar with vapply:
store the folder path (note the \ before the space)
path <- "/Users/Documents/Edgar\ filings"
store the fully specified file paths
files <- file.path(path, list.files(path))
read in all of the files
documents <- vapply(X=1:length(files), FUN=function(i){
readChar(files[i], file.info(files[i])$size)
}, FUN.VALUE = character(1))

read an Excel file embedded in a website

I would like to read automatically in R the file which is located at
https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017
This link generates the automatic download of a zipfile. This zipfile contains the Excel file I want to read in R.
Does any of you have any suggestions on this? Thanks.

Panagiotis' comment to use download.file() is generally good advice, but I couldn't make it work here (and would be curious to know why). Instead I used httr.
(Edit: got it, I reversed args of download.file()... Repeat after me: always use named args...)
Another problem with this data: it appears not to be a regular xls file, I couldn't open it with the yet excellent readxl package.
Looks like a tab separated flat file, but no success with read.table() either. readr::read_delim() made it.
library(httr)
library(readr)
r <- GET("https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017")
# Write the archive on disk
writeBin(r$content, "./data/rte_data")
rte_data <-
read_delim(
unzip("./data/rte_data", exdir = "./data/"),
delim = "\t",
locale = locale(encoding = "ISO-8859-1"),
col_names = TRUE
)
There still are parsing problems, but not sure they should be dealt with in this SO question.

Saving text from webpage for word cloud in R

I'm trying to practice making word clouds in R and I've seen the process nicely explained in sites like this (http://www.r-bloggers.com/building-wordclouds-in-r/) and in some videos on YouTube. So I thought I'd pick some random long document to practice myself.
I chose the script for Good Will Hunting. It is available here (https://finearts.uvic.ca/writing/websites/writ218/screenplays/award_winning/good_will_hunting.html). What I did is copy that into Notepad++ and start removing blank lines, names, etc. to try to clean up the data before saving. Saving as a .csv file doesn't seem to be an option so I saved it as a .txt file and R doesn't seem to want to read it in.
Both of the following lines return errors in R.
goodwillhunting <- read.csv("C:/Users/MyName/Desktop/goodwillhunting.txt", sep="", stringsAsFactors=FALSE)
goodwillhunting <- read.table("C:/Users/MyName/Desktop/goodwillhunting.txt", sep="", stringsAsFactors=FALSE)
My question is based on an html document what is the best way to save it to be read in to be used for something like this? I know with the rvest package you can read in webpages. The tutorials for word clouds have used .csv files so I'm not sure if that's what my end goal needs to be.
This might be a way to read in the data going that route?
test = read_html("https://finearts.uvic.ca/writing/websites/writ218/screenplays/award_winning/good_will_hunting.html")
text = html_text(test)
Any help is appreciated!

Here's one way:
library(rvest)
library(wordcloud)
test <- read_html("https://finearts.uvic.ca/writing/websites/writ218/screenplays/
award_winning/good_will_hunting.html")
text <- html_text(test)
content <- stringi::stri_extract_all_words(text, simplify = TRUE)
wordcloud(content, min.freq = 10, colors = RColorBrewer::brewer.pal(5,"Spectral"))
Which gives:

Here is a simple example:
library(wordcloud)
text = scan("fulltext.txt", character(0), strip.white = TRUE)
frequency_table = as.data.frame(table(text))
wordcloud(frequency_table$text, frequency_table$Freq)

Read SPSS file into R

I am trying to learn R and want to bring in an SPSS file, which I can open in SPSS.
I have tried using read.spss from foreign and spss.get from Hmisc. Both error messages are the same.
Here is my code:
## install.packages("Hmisc")
library(foreign)
## change the working directory
getwd()
setwd('C:/Documents and Settings/BTIBERT/Desktop/')
## load in the file
## ?read.spss
asq <- read.spss('ASQ2010.sav', to.data.frame=T)
And the resulting error:
Error in read.spss("ASQ2010.sav", to.data.frame = T) : error
reading system-file header In addition: Warning message: In
read.spss("ASQ2010.sav", to.data.frame = T) : ASQ2010.sav: position
0: character `\000' (
Also, I tried saving out the SPSS file as a SPSS 7 .sav file (was previously using SPSS 18).
Warning messages: 1: In read.spss("ASQ2010_test.sav", to.data.frame =
T) : ASQ2010_test.sav: Unrecognized record type 7, subtype 14
encountered in system file 2: In read.spss("ASQ2010_test.sav",
to.data.frame = T) : ASQ2010_test.sav: Unrecognized record type 7,
subtype 18 encountered in system file

I had a similar issue and solved it following a hint in read.spss help.
Using package memisc instead, you can import a portable SPSS file like this:
data <- as.data.set(spss.portable.file("filename.por"))
Similarly, for .sav files:
data <- as.data.set(spss.system.file('filename.sav'))
although in this case I seem to miss some string values, while the portable import works seamlessly. The help page for spss.portable.file claims:
The importer mechanism is more flexible and extensible than read.spss and read.dta of package "foreign", as most of the parsing of the file headers is done in R. They are also adapted to load efficiently large data sets. Most importantly, importer objects support the labels, missing.values, and descriptions, provided by this package.

The read.spss seems to be outdated a little bit, so I used package called memisc.
To get this to work do this:
install.packages("memisc")
data <- as.data.set(spss.system.file('yourfile.sav'))

You may also try this:
setwd("C:/Users/rest of your path")
library(haven)
data <- read_sav("data.sav")
and if you want to read all files from one folder:
temp <- list.files(pattern = "*.sav")
read.all <- sapply(temp, read_sav)

I know this post is old, but I also had problems loading a Qualtrics SPSS file into R. R's read.spss code came from PSPP a long time ago, and hasn't been updated in a while. (And Hmisc's code uses read.spss(), too, so no luck there.)
The good news is that PSPP 0.6.1 should read the files fine, as long as you specify a "String Width" of "Short - 255 (SPSS 12.0 and earlier)" on the "Download Data" page in Qualtrics. Read it into PSPP, save a new copy, and you should be in business. Awkward, but free.
,

You can read SPSS file from R using above solutions or the one you are currently using. Just make sure that the command is fed with the file, that it can read properly. I had same error and the problem was, SPSS could not access that file. You should make sure the file path is correct, file is accessible and it is in correct format.
library(foreign)
asq <- read.spss('ASQ2010.sav', to.data.frame=TRUE)
As far as warning message is concerned, It does not affect the data. The record type 7 is used to store features in newer SPSS software to make older SPSS software able to read new data. But does not affect data. I have used this numerous times and data is not lost.
You can also read about this at http://r.789695.n4.nabble.com/read-spss-warning-message-Unrecognized-record-type-7-subtype-18-encountered-in-system-file-td3000775.html#a3007945

It looks like the R read.spss implementation is incomplete or broken. R2.10.1 does better than R2.8.1, however. It appears that R gets upset about custom attributes in a sav file even with 2.10.1 (The latest I have). R also may not understand the character encoding field in the file, and in particular it probably does not work with SPSS Unicode files.
You might try opening the file in SPSS, deleting any custom attributes, and resaving the file.
You can see whether there are custom attributes with the SPSS command
display attributes.
If so, delete them (see VARIABLE ATTRIBUTE and DATAFILE ATTRIBUTE commands), and try again.
HTH,
Jon Peck

If you have access to SPSS, save file as .csv, hence import it with read.csv or read.table. I can't recall any problem with .sav file importing. So far it was working like a charm both with read.spss and spss.get. I reckon that spss.get will not give different results, since it depends on foreign::read.spss
Can you provide some info on SPSS/R/Hmisc/foreign version?

Another solution not mentioned here is to read SPSS data in R via ODBC. You need:
IBM SPSS Statistics Data File Driver. Standalone driver is enough.
Import SPSS data using RODBC package in R.
See the example here. However I have to admit that, there could be problems with very big data files.

For me it works well using memisc!
install.packages("memisc")
load('memisc')
Daten.Februar <-as.data.set(spss.system.file("NPS_Februar_15_Daten.sav"))
names(Daten.Februar)

I agree with #SDahm that the haven package would be the way to go. I myself have struggled a bit with string values when starting to use it, so I thought I'd share my approach on that here, too.
The "semantics" vignette has some useful information on this topic.
library(tidyverse)
library(haven)
# Some interesting information in here
vignette('semantics')
# Get data from spss file
df <- read_sav(path_to_file)
# get value labels
df <- map_df(.x = df, .f = function(x) {
if (class(x) == 'labelled') as_factor(x)
else x})
# get column names
colnames(df) <- map(.x = spss_file, .f = function(x) {attr(x, 'label')})

There is no such problem with packages you are using. The only requirement for read a spss file is to put the file into a PORTABLE format file. I mean, spss file have *.sav extension. You need to transform your spss file in a portable document that uses *.por extension.
There is more info in http://www.statmethods.net/input/importingdata.html

In my case this warning was combined with a appearance of a new variable before first column of my data with values -100, 2, 2, 2, ..., a shift in the correspondence between labels and values and the deletion of the last variable. A solution that worked was (using SPSS) to create a new dump variable in the last column of the file, fill it with random values and execute the following code:
(filename is the path to the sav file and in my case the original SPSS file had 62 columns, thus 63 with the additional dumb variable)
library(memisc)
data <- as.data.set(spss.system.file(filename))
copyofdata = data
for(i in 2:63){
names(data)[i] <- names(copyofdata)[i-1]
}
data[[1]] <- NULL
newcopyofdata = data
for(i in 2:62){
labels(data[[i]]) <- labels(newcopyofdata[[i-1]])
}
labels(data[[1]]) <- NULL
Hope the above code will help someone else.

Turn your UNICODE in SPSS off
Open SPSS without any data open and run the code below in your syntax editor
SET UNICODE OFF.
Open the data set and resave it to remove the Unicode
read.spss('yourdata.sav', to.data.frame=T) works correctly then

I just came came across an SPSS file that I couldn't get open using haven, foreign, or memisc, but readspss::read.por did the trick for me:
download.file("http://www.tcd.ie/Political_Science/elections/IMSgeneral92.zip",
"IMSgeneral92.zip")
unzip("IMSgeneral92.zip", exdir = "IMSgeneral92")
# rio, haven, foreign, memisc pkgs don't work on this file! But readspss does:
if(!require(readspss)) remotes::install_git("https://github.com/JanMarvin/readspss.git")
ims92 <- readspss::read.por("IMSgeneral92/IMS_Nov7 92.por", convert.factors = FALSE)
Nice! Thanks, #JanMarvin!

1)
I've found the program, stat-transfer, useful for importing spss and stata files into R.
It resolves the issue you mention by converting spss to R dataset. Also very useful for subsetting super large datasets into smaller portions consumable by R. Not free, but a very useful tool for working with datasets from different programs -- especially if you don't have access to them.
2)
Memisc package also has an spss function worth trying.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

In R , Reading a PDF document [duplicate] - r

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R? In Python there is PDFMiner, but I would like to keep this analysis all in R if possible. Any suggestions?

Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf. That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.

This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.

A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.

A purely R solution could be: library('tm') file <- 'namefile.pdf' Rpdf <- readPDF(control = list(text = "-layout")) corpus <- VCorpus(URISource(file), readerControl = list(reader = Rpdf)) corpus.array <- content(content(corpus)[[1]]) then you'll have pdf lines in an array.

install.packages("pdftools") library(pdftools) download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf", "56901.DEN.Gamebook", mode = "wb") txt <- pdf_text("56901.DEN.Gamebook") cat(txt[1])

Related

Saving pptx as pdf in R

How to make R read thousands .txt files (Financial Reports) at once?

read an Excel file embedded in a website

Saving text from webpage for word cloud in R

Read SPSS file into R

Categories

Resources