readPDF library tm format issue - r

So hello, I am trying to read a PDF but have issues with the format.
The doc and code is below.
The issue is the output doc doesn't respect the original PDF lines. The last item from line 4 appears in line 5. Is that something I can correct?
The reason I am asking is I need to read 1000's of files like this and have this issue in most files.
When using a pdf to excel on the web I dont have this same issue.
thanks
URL="http://www.arb.ca.gov/cc/capandtrade/offsets/issuance/cals5047-a-b.pdf"
destfile="filetoconvert.pdf"
download.file(URL,destfile)
doc=readPDF(control = list(text = "-layout"))(elem = list(uri = destfile),
language = "en",
id = "id1")
issuance2=NULL
issuance2delim=NULL
doc = c(as.character(doc))

Related

How write a CSV in R that is correctly read by Google Spreadsheets

Problem
I am trying to write a dataframe to a CSV-file that will be read-in correctly by Google Spreadsheets however I am running into an error with a particular column format.
There is one column called 'details' that has values that look like this {\"campaign_id\":1,\"line_item_id\":1234}. This column format is correctly read by R from an original dataframe supplied from Google Spreadsheets but when written into a CSV, the column is separated along the , into two columns pushing overwriting the values of the following column (which is empty by default).
Data
The dataframe in R looks like this:
data <- structure(
list(
`Line Item Id` = c(1234, 4567),
Details = c(
"{\"campaign_id\":1,\"line_item_id\":1234}",
"{\"campaign_id\":1,\"line_item_id\":4567}"
),
`Bid Strategy Type` = c("",""),
`TrueView Video Ad Formats` = c("In-stream / Video Discovery",
"In-stream / Video Discovery"),
`TrueView Bid Strategy Type` = c("Manual CPV",
"Manual CPV")
),
row.names = 1:2,
class = "data.frame"
)
Current approach
I have tried writing the relevant column in a quote:
library(tidyverse)
data %>%
mutate(Details = dQuote(Details,q = )) %>%
write.csv("test.csv", fileEncoding = "UTF-8",na = "",row.names = FALSE,quote = FALSE)
But this does not seem to work and neither does omitting the dQuote.
My output csv is this:
test.csv generated by above code
More Details
The data being wrangled here is an SDF generated by DV360 a Google platform to manage YouTube ad campaigns. In my process I download an SDF from DV360 change some values in R and upload it back. However re-uploading does not work at the moment due to the described problem. I have tested it to confirm that the column problem described above is causing the issue and if manually corrected uploading works.
Expected output
I have added the expected output and the output I am getting.
What I have at the moment:
Line Item Id,Details,TrueView Video Ad Formats,TrueView Bid Strategy Type
14596716402,“{"campaign_id":283,"line_item_id":99588}”,In-stream / Video Discovery,
14596725552,“{"campaign_id":283,"line_item_id":99585}”,In-stream / Video Discovery,
What I need:
Line Item Id,Details,TrueView Video Ad Formats,TrueView Bid Strategy Type
1234,"{""campaign_id"":1,""line_item_id"":1234}",,In-stream / Video Discovery
4567,"{""campaign_id"":1,""line_item_id"":4567}",,In-stream / Video Discovery
And quite interestingly, what I get when I fiex the problem by hand in googlesheets and then download the file:
Line Item Id,Details,TrueView Video Ad Formats,TrueView Bid Strategy Type
1234,"""{""""campaign_id"""":1,""""line_item_id"""":1234}""",,In-stream / Video Discovery
4567,"""{""""campaign_id"""":1,""""line_item_id"""":4567}""",,In-stream / Video Discovery
After getting valuable input from #Greg and #MrFlick I was finally able to solve it.
For Google ecosystem (Spreadsheets and Dv360) to correctly read the column it needs to have this format:
"{""campaign_id"":1,""line_item_id"":1234}"
Using dQuote() will put the necessary quotes around the column but due to my system settings, the wrong quote type was supplied. So we need to put off useFancyQuotes.
Additionally the already occuring quotes around campaign_id and line_item_id need to be double-quoted.
Maybe there is a faster way but the following code will work:
library(dplyr) # only needed for pipe, not part of solution
options(useFancyQuotes = FALSE)
data %>%
mutate(Details = dQuote(gsub('"','""',Details))) %>%
write.csv("test3.csv", fileEncoding = "UTF-8",na = "",row.names = FALSE,quote = FALSE)
So we need to first convert all quotes to double Quotes, which I did with gsub() and then use dQuote() to put final quotes around the column making sure not to use fancy, directional quotes.

magick image library writing incorrect header to PDF files?

I am using the very useful magick library to read and annotate PDF files, and overlay an image on the result. I can generate a PDF file that looks as I would expect it to look. However, when I open the file, the header, which I would expect to read something like %PDF-1.7, reads ‰PNG like this.
It looks to me as if magick is looking at the most recent operation, which is image_composite for a PNG file, and using this for the header. If so, is this a bug? The PDF file that is output appears otherwise well-formed, so it doesn't seem to be causing problems, but I am curious. The following code should enable the issue to be reproduced.
require(magick)
require(pdftools)
pdf_file <- "https://web.archive.org/web/20140624182842/http://www.gnupdf.org/images/d/db/Hello.pdf"
image_file <- "https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/PDF_file_icon.svg/200px-PDF_file_icon.svg.png"
my_image <- image_read(image_file,density = 300)
pdfimage <- image_read_pdf(pdf_file,density = 300)
pdfimage2 <- image_annotate(pdfimage, "test",
location = "+400+700", style = "normal", weight = 400,
size=42)
pdfimage3 <- image_composite(pdfimage2,my_image,operator="atop",
offset = "+100+100")
image_write(pdfimage3, path = "C:/temp/test.pdf", density = 300, flatten = TRUE)
I have held off from answering this because the solution is embarrassingly obvious. In retrospect, I just assumed that, because I used image_read_pdf it should and would save in PDF format. What I needed to do was specify it explicitly. Adding a format = "pdf" argument to the image_write call achieved that.
image_write(pdfimage3, path = "C:/temp/test.pdf", density = 300, format = "pdf", flatten = TRUE)
This results in a well-formed PDF. Problem solved. Lesson learned.

Why figures are not being pulled into pdf from R markdown

I'm using RStudio and knitr to create reproducable PDF reports on work. However, figures are not pulled into the document - instead there is "figure/unnamed-chunk-" where the image should be.
Images are produced and saved to 'home/figure/'.
The code I use to create the PDF is:
Rfile = "/Users/user/Documents/folder/file.R"
setwd(dirname(Rfile))
spin(Rfile, format = 'Rmd', report=F)
render(paste(substring(Rfile,0,(nchar(file)-1)),"md",sep=""), pdf_document(toc = TRUE, toc_depth=6, number_sections= TRUE),
output_file = paste(substring(file,0,(nchar(file)-2)),".pdf",sep=""))
In the md file, there is a line for each figure that is
figure/unnamed-chunk-X-X.pdf
I've tried adding the lines below after reading the answers at https://groups.google.com/forum/#!topic/knitr/_sw4sAtLkoQ - but they don't make a difference.
opts_knit$set(base.dir = dirname(file))
opts_knit$set(fig.path = '/figure/')
I'm sure there is a simple fix to this but I can't see what it might be.

Saving text from webpage for word cloud in R

I'm trying to practice making word clouds in R and I've seen the process nicely explained in sites like this (http://www.r-bloggers.com/building-wordclouds-in-r/) and in some videos on YouTube. So I thought I'd pick some random long document to practice myself.
I chose the script for Good Will Hunting. It is available here (https://finearts.uvic.ca/writing/websites/writ218/screenplays/award_winning/good_will_hunting.html). What I did is copy that into Notepad++ and start removing blank lines, names, etc. to try to clean up the data before saving. Saving as a .csv file doesn't seem to be an option so I saved it as a .txt file and R doesn't seem to want to read it in.
Both of the following lines return errors in R.
goodwillhunting <- read.csv("C:/Users/MyName/Desktop/goodwillhunting.txt", sep="", stringsAsFactors=FALSE)
goodwillhunting <- read.table("C:/Users/MyName/Desktop/goodwillhunting.txt", sep="", stringsAsFactors=FALSE)
My question is based on an html document what is the best way to save it to be read in to be used for something like this? I know with the rvest package you can read in webpages. The tutorials for word clouds have used .csv files so I'm not sure if that's what my end goal needs to be.
This might be a way to read in the data going that route?
test = read_html("https://finearts.uvic.ca/writing/websites/writ218/screenplays/award_winning/good_will_hunting.html")
text = html_text(test)
Any help is appreciated!
Here's one way:
library(rvest)
library(wordcloud)
test <- read_html("https://finearts.uvic.ca/writing/websites/writ218/screenplays/
award_winning/good_will_hunting.html")
text <- html_text(test)
content <- stringi::stri_extract_all_words(text, simplify = TRUE)
wordcloud(content, min.freq = 10, colors = RColorBrewer::brewer.pal(5,"Spectral"))
Which gives:
Here is a simple example:
library(wordcloud)
text = scan("fulltext.txt", character(0), strip.white = TRUE)
frequency_table = as.data.frame(table(text))
wordcloud(frequency_table$text, frequency_table$Freq)

Set encoding for output in R globally

I'd like R to render its whole output (both to the console and to files) in UTF-8. Is there a way to define the encoding for R output for a whole document?
I think what you're getting at comes out of options . For example, here's part of the help page for file :
file(description = "", open = "", blocking = TRUE,
encoding = getOption("encoding"), raw = FALSE)
So if you investigate setting options(encoding=your_choicehere) you may be all set.
Edit: If you haven't already, be sure to set your locale to the language desired.

Resources