Extracting/Parsing from a PDF to CSV using R?

Extracting/Parsing from a PDF to CSV using R? - r

I am trying to extract data from a poorly formatted PDF into a .csv file for geocoding. The data I am concerned with are the locations of Farmers' Markets in Colorado for 2018 (https://www.colorado.gov/pacific/sites/default/files/Colorado%20Farmers%27%20Markets.pdf). The necessary fields I am looking to have are Business_Name, Address, City, State, Zip, Hours, Season, Email, and Website. The trouble is that the data are all in one column, and not all of the entries have 100% complete data. That is to say that one entry may have five attributes under it (name, address, hours, zip, website) and another may only have 2 lines of the attributes (name, address).
I found an embedded map of locations here (http://www.coloradofarmers.org/find-markets/) that references the PDF file above. I was able to save this map to MyMaps and copy/paste the table to a CSV, but there are missing entries.
Is there a way to cleanly parse this data from PDF to CSV? I imagine what I need to do is create a dictionary of Colorado towns with markets (e.g. 'Denver', 'Canon City', 'Telluride') and then basically have R look through the column, put every new line that exists between look-up cities on the previous city's line all in one row in separate field columns. Or as one comma-delimited field to then parse out based on what the fields looks like.
Here's what I have so far:
#Set the working directory
setwd("C:/Users/bwhite/Desktop")
#download the PDF of data
?download.file
download.file("https://www.colorado.gov/pacific/sites/default/files/Colorado%20Farmers%27%20Markets.pdf", destfile = "./ColoradoMarkets2018.pdf", method = "auto", quiet = FALSE, mode = "w", cacheOK=TRUE)
#import the pdf table library from CRAN
install.packages("pdftables")
library(pdftables)
#convert pdf to CSV
?convert_pdf
convert_pdf("Colorado Farmers' Markets.pdf",output_file = "FarmersMarkets.csv",
format = "csv", message = TRUE, api_key = "n7qgsnz2nkun")
# read in CSV
Markets18 <-read.csv("./FarmersMarkets.csv")
#create a look-up table list of Colorado cities
install.packages("htmltab")
library(htmltab)
CityList <-htmltab("https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Colorado",1)
names(CityList)
Any help is appreciated.

You can only attempt to extract information that is consistent. I'm not an expert but I tried to build a logic for some part. Pages 2-20 are somewhat free of dirty data. Also if you notice, each group can be split at p.m. (for most part). Since number of columns is different for some of them, it was difficult to build one logic. Even the extracted dataframe would require some transformation.
library(pdftools)
text<-pdf_text("Colorado Farmers' Markets.pdf")
library(plyr)
new<-data.frame()
text4<-data.frame(Reduce(rbind, text),row.names =c() ,stringsAsFactors = FALSE)
for (i in 2:20){
list1<-text4[i,1]
list1<-strsplit(list1,'p.m.')
final<-data.frame(Reduce(rbind, list1),row.names =c() ,stringsAsFactors = FALSE)
for (i in 1:dim(final)[1]){
c<-final[i,]
c<-strsplit(c,'\n')
new<-rbind.fill(new,data.frame(t(data.frame(c,row.names =c()))))
}
}

Related

Is there a way to automate the extraction of the Microsoft Office 365 Quarantine Data CSV file into Multiple CSV Files?

When I ran a security report through the Office 365 Admin Email Explorer to obtain detailed information about emails and their respective types of attacks, I downloaded the .csv file and manually use Microsoft Excel to filter out exact email subject rows and save to their own .csv file. This took a long time to create individual CSV files since there were quite a lot of various emails with same or differing subject titles as values.
Downloaded the .csv fild from the Office 365 Admin portal with a date range of 7 days into the past (date-range).
Imported into R using the R command below:
Office_365_Report_CSV = "C:/Users/absnd/Documents/2022-11-18office365latestquarantine.csv"
Imported the table from the library.
require(data.table)
Created a new variable to convert the data into a data-frame.
quarantine_data = fread(paste0(Office_365_Report_CSV),sep = ",", header = TRUE, check. Names = FALSE)
Pull columns needed to filter through in the data-frame.
Quarantine_Columns = quarantine_data[,c("Email date (UTC)","Recipients","Subject","Sender","Sender IP","Sender domain","Delivery action","Latest delivery location","Original delivery location","Internet message ID","Network message ID","Mail language","Original recipients","Additional actions","Threats","File threats","File hash","Detection technologies","Alert ID","Final system override","Tenant system override(s)","User system override(s)","Directionality","URLs","Sender tags","Recipient tags","Exchange transport rule","Connector","Context" )]
Steps Needed to be done (I am not sure where to go from here):
-I would like to have R write to individual .csv file with the same "Subject" value rows that must contain all the above columns data in step 5.
Sub-step - ex. If the row data contains the value inside the column (named, "Threats") = "Phish" generate a file named, "YYYY-MM-DD Phishing <number increment +1>.csv."
Sub-step - ex. 2 If the row data contains the value inside the column (named, "Threats") = Phish, Spam" generate a CSV file named, "YYYY-MM-DD Phishing and Spam <number increment +1>.csv."
Step 6 and so on would filter out like same "Subject" column values for rows and save the rows with same Subject email values into a single file that would be named based on the if-condition in the substeps above in step 6.

First of all, you are looking to do this in R - RStudio is an IDE to make usage of R easier.
If you save your data frames in a list, and then set a vector of the names of the files that you want to give each of those files, you can then use purrr::walk2() to iterate through the saving. Some reproducible code as an example:
library(purrr)
library(glue)
library(readr)
mydfs_l <- list(mtcars, iris)
file_names <- c("mtcars_file", "iris_file")
walk2(mydfs_l, file_names, function(x, y) { write_excel_csv(x, glue("mypath/to/files/{y}.csv")) })

Converting missing values to NA within sapply function

I want to import json data from some kind of fitness tracker in order to run some analysis on them. The single json files are quite large, while I am only interested in specific numbers per training session (each json file is a training session).
I managed to read in the names of the files & to grab the interesting content out of the files. Unfortunately my code does obviously not work correctly if one or more information are missing in some of the json files (e.g. distance is not availaible as it was an indoor training).
I stored all json files with training sessions in a folder (=path in the code) and asked R to get a list of the files in that folder:
json_files<- list.files(path,pattern = ".json",full.names = TRUE) #this is the list of files
jlist<-as.list(json_files)
Then I wrote this function to get the data im interested in from each single file (as reading in all the content for each file exceeded my available RAM capacity...)
importPFData <- function(x)
{
testimport<-fromJSON(x)
sport<-testimport$exercises$sport
starttimesession<-testimport$exercises$startTime
endtimesession<-testimport$exercises$stopTime
distance<-testimport$exercises$distance
durationsport<-testimport$exercises$duration
maxHRsession <- testimport$exercises$heartRate$max
minHRsession <- testimport$exercises$heartRate$min
avgHRsession <- testimport$exercises$heartRate$avg
calories <- testimport$exercises$kiloCalories
VO2max_overall <- testimport$physicalInformationSnapshot$vo2Max
return(c(starttimesession,endtimesession,sport,distance,durationsport,
maxHRsession,minHRsession,avgHRsession,calories,VO2max_overall))
}
Next I applied this function to all elements of my list of files:
dataTest<-sapply(jlist, importPFData)
I receive a list with one entry per file, as expected. Unfortunately not all of the data was available per file, which results in some entries having 7, other having 8,9 or 10 entries.
I struggle with getting this into a proper dataframe as the infomation is not shown as NA or 0, its just left out.
Is there an easy way to include NA in the function above if no information is found in the individual json file for that specific detail (e.g. distance not available --> "NA" for distance for this single entry)?
Example (csv) of the content of a file with 10 entries:
"","c..2013.01.06T08.52.38.000....2013.01.06T09.52.46.600....RUNNING..."
"1","2013-01-06T08:52:38.000"
"2","2013-01-06T09:52:46.600"
"3","RUNNING"
"4","6890"
"5","PT3608.600S"
"6","234"
"7","94"
"8","139"
"9","700"
"10","48"
Example (csv) for a file with only 7 entries (columns won´t macht to Example 1):
"","c..2014.01.22T18.38.30.000....2014.01.22T18.38.32.000....RUNNING..."
"1","2014-01-22T18:38:30.000"
"2","2014-01-22T18:38:32.000"
"3","RUNNING"
"4","0"
"5","PT2S"
"6","0"
"7","46"

Mass create documents in R markdown

I'm wondering if anyone can help me with a dilemma I'm having.
I have a dataset of around 300 individuals that contains some basic information about them (e.g. age, gender). I want to knit a separate R markdown report for each individual that details this basic information, in Word format. And then I want to save each report with their unique name.
The actual code which sits behind the report doesn't change, only the details of each individual. For example:
Report 1: "Sally is a female who is 34 years old". (Which I would want to save as Sally.doc)
Report 2: "Mike is a male who is 21 years old." (Saved as Mike.doc)
Etc, etc.
Is there a way I can do this without manually filtering the data, re-kniting the document, and then saving manually with unique names?
Thanks a lot!

Use the render function and pass a list of names to the function:
renderMyDocument <- function(name) {
rmarkdown::render("./printing_procedures/dagr_parent.Rmd",
params = list(names = name),
output_file = paste("~/document_", name, '.doc', sep = '')
)
}
lapply(names, renderMyDocument)
Then just make sure your RMD file can take the params argument via the YAML:
---
params:
names:
---
RStudio documentation on parameterized reports here: https://rmarkdown.rstudio.com/developer_parameterized_reports.html

Loop for extracting variable from each document and placing in appropriate column

My company documents summaries of policies/services for each client in a pdf formatted file. These files are combined into a large dataset each year. One row per client and columns are variables in the client's document. There are a couple thousand of these files and each one has approximately 20-30 variables each. I' want to automate this process by creating a data.frame with each row representing a client, and then pull the variables for each client from their pdf document. I'm able to create a list or data.frame of all the clients by the pdf filename in a directory but don't know how to create a loop that pulls each variable I need for each document. I currently have two different methods which I can't decide between, and also need help with a loop that grabs the variables I need for each client document. My code and links to two mock files are provided below. Any help would be appreciated!
Files: Client 1 and Client 2
Method 1: pdftools
The benefit of the first method is it extracts the entire pdf into a vector, and each page into a separate element. This makes it easier for me to pull strings/variables. However, don't know how to loop it to pull the information from each client and appropriately place it in a column for each client.
library(pdftools)
library(stringr)
Files <- list.files(path="...", pattern=".pdf")
Files <- Files %% mutate(FR =
str_match(text, "\\$\\d+\\s\\Financial Reporting")) #Extract the first variable
Method 2:
The benefit of this approach is it automatically creates a database for each of the client documents with file name as a row, and the each pdf in a variable. The downside is an entire pdf in a variable makes it more difficult to match and extract strings compared to having each page in its own element. I don't know how to write a loop that will extract variables for each client and place them in their respective column.
DF <- readtext("directory pathway/*.pdf")
DF <- DF %>% mutate(FR =
str_match(text, "\\$\\d+\\s\\Financial Reporting"))

Here's a basic framework that I think solves your problem using your proposed Method 1.
library(pdftools)
library(stringr)
Files <- list.files(path="pdfs/", pattern=".pdf")
lf <- length(Files)
client_df <- data.frame(client = rep(NA, lf), fr = rep(NA, lf))
for(i in 1:lf){
# extract the text from the pdf
f <- pdf_text(paste0("pdfs/", Files[i]))
# remove commas from numbers
f <- gsub(',', '', f)
# extract variables
client_name <- str_match(f[1], "Client\\s+\\d+")[[1]]
fr <- as.numeric(str_match(f[1], "\\$(\\d+)\\s+Financial Reporting")[[2]])
# add variables to your dataframe
client_df$client[i] <- client_name
client_df$fr[i] <- fr
}
I removed commas from the text under the assumption that any numeric variables you extract you'll want to use as numbers in some analysis. This removes all commas though, so if those are important in other areas you'll have to rethink that.
Also note that I put the sample PDFs into a directory called 'pdfs'.
I would imagine that with a little creative regex you can extract anything else that would be useful. Using this method makes it easy to scrape the data if the elements of interest will always be on the same pages across all documents. (Note the index on f in the str_match lines.) Hope this helps!

Quotation issues reading data into R

I have some data from and I am trying to load it into R. It is in .csv files and I can view the data in both Excel and OpenOffice. (If you are curious, it is the 2011 poll results data from Elections Canada data available here).
The data is coded in an unusual manner. A typical line is:
12002,Central Nova","Nova-Centre"," 1","River John",N,N,"",1,299,"Chisholm","","Matthew","Green Party","Parti Vert",N,N,11
There is a " on the end of the Central-Nova but not at the beginning. So in order to read in the data, I suppressed the quotes, which worked fine for the first few files. ie.
test<-read.csv("pollresults_resultatsbureau11001.csv",header = TRUE,sep=",",fileEncoding="latin1",as.is=TRUE,quote="")
Now here is the problem: in another file (eg. pollresults_resultatsbureau12002.csv), there is a line of data like this:
12002,Central Nova","Nova-Centre"," 6-1","Pictou, Subd. A",N,N,"",0,168,"Parker","","David K.","NDP-New Democratic Party","NPD-Nouveau Parti democratique",N,N,28
Because I need to suppress the quotes, the entry "Pictou, Subd. A" makes R wants to split this into 2 variables. The data can't be read in since it wants to add a column half way through constructing the dataframe.
Excel and OpenOffice both can open these files no problem. Somehow, Excel and OpenOffice know that quotation marks only matter if they are at the beginning of a variable entry.
Do you know what option I need to enable on R to get this data in? I have >300 files that I need to load (each with ~1000 rows each) so a manual fix is not an option...
I have looked all over the place for a solution but can't find one.

Building on my comments, here is a solution that would read all the CSV files into a single list.
# Deal with French properly
options(encoding="latin1")
# Set your working directory to where you have
# unzipped all of your 308 CSV files
setwd("path/to/unzipped/files")
# Get the file names
temp <- list.files()
# Extract the 5-digit code which we can use as names
Codes <- gsub("pollresults_resultatsbureau|.csv", "", temp)
# Read all the files into a single list named "pollResults"
pollResults <- lapply(seq_along(temp), function(x) {
T0 <- readLines(temp[x])
T0[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', T0[-1])
final <- read.csv(text = T0, header = TRUE)
final
})
names(pollResults) <- Codes
You can easily work with this list in different ways. If you wanted to just see the 90th data.frame you can access it by using pollResults[[90]] or by using pollResults[["24058"]] (in other words, either by index number or by district number).
Having the data in this format means you can also do a lot of other convenient things. For instance, if you wanted to fix all 308 of the CSVs in one go, you can use the following code, which will create new CSVs with the file name prefixed with "Corrected_".
invisible(lapply(seq_along(pollResults), function(x) {
NewFilename <- paste("Corrected", temp[x], sep = "_")
write.csv(pollResults[[x]], file = NewFilename,
quote = TRUE, row.names = FALSE)
}))
Hope this helps!

This answer is mainly to #AnandaMahto (see comments to the original question).
First, it helps to set some options globally because of the french accents in the data:
options(encoding="latin1")
Next, read in the data verbatim using readLines():
temp <- readLines("pollresults_resultatsbureau13001.csv")
Following this, simply replace the first comma in each line of data with a comma+quotation. This works because the first field is always 5 characters long. Note that it leaves the header untouched.
temp[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', temp[-1])
Penultimately, write over the original file.
fileConn<-file("pollresults_resultatsbureau13001.csv")
writeLines(temp,fileConn)
close(fileConn)
Finally, simply read the data back into R:
data<-read.csv(file="pollresults_resultatsbureau13001.csv",header = TRUE,sep=",")
There is probably a more parsimonious way to do this (and one that can be iterated more easily) but this process made sense to me.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex