R: extract dates and numbers from PDF - r

I'm really struggling to extract the proper information from several thousands PDF files from NTSB (some Dates and numbers to be specific); these PDFs don't require to be OCRed and each report is almost identical in length and layout information.
I need to extract the date and the time of the accident (first page) and some other information, like Pilot's age or its Flight experience. What I tried does the job for several files but is not working for each file the since code I am using is poorly written.
# an example with a single file
# Download the file and read it row by row
file <- 'http://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/89789/pdf' # less than 100 kb
destfile <- paste0(getwd(),"/example.pdf")
download.file(file, destfile)
pdf <- pdf_text(destfile)
rows <-scan(textConnection(pdf),
what="character", sep = "\n")
# Extract the date of the accident based on the 'Date & Time' occurrence.
date <-rows[grep(pattern = 'Date & Time', x = rows, ignore.case = T, value = F)]
date <- strsplit(date, " ")
date[[1]][9] #this method is not desirable since the date will not be always in that position
# Pilot age
age <- rows[grep(pattern = 'Age', x = rows, ignore.case = F, value = F)]
age <- strsplit(age, split = ' ')
age <- age[[1]][length(age[[1]])] # again, I'm using the exact position in that list
age <- readr::parse_number(age) #
The main issue I got is when I am trying to extract the date and time of the accident. Is it possible to extract that exact piece of information by avoiding using a list as I did here?

I think the best approach to achieve what you want is to use regex.
In this case I use stringr library. The main idea with regex is to find
the desire string pattern, in this case is the date 'July 29, 2014, 11:15'
Take on count that you'll have to check the date format for each pdf file
# Download the file and read it row by row
file <- 'http://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/89789/pdf' # less than 100 kb
destfile <- paste0(getwd(), "/example.pdf")
download.file(file, destfile)
pdf <- pdf_text(destfile)
## New code
# Regex pattern for date 'July 29, 2014, 11:15'
regex_pattern <- "[T|t]ime\\:(.*\\d{2}\\:\\d{2})"
# Getting date from page 1
grouped_matched <- str_match_all(pdf[1], regex_pattern)
# This returns a list with groups. You're interested in group 2
raw_date <- grouped_matched[[1]][2] # First element, second group
# Clean date
date <- trimws(raw_date)
# Using dplyr
date <- pdf[1] %>%
str_match_all(regex_pattern) %>%
.[[1]] %>% # First list element
.[2] %>% # Second group
trimws() # Remove extra white spaces
You can make a function to extract the date changing the regex pattern for different files


Gather multiple data sets from an URL/FTP site and merge them into a single dataframe for wrangling

Okay R community. I have a myrid of code pieces going on here from data.table, dyplr, base, etc.. My goal is to download a group of files from NOAA into a single data frame for wrangling. Currently, my code is ugly, to say the least and of course not working. I should have all of data set 1950, then right below it i have 1951 data, etc.
#hard code website addressess
noaa.url <- "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
noaa.ftp <- "ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
#set fixed name of files for download
details.str <- "StormEvents_details-ftp_*"
fatalities.str <- "StormEvents_fatalities-ftp_"
locations.str <- "StormEvents_locations-ftp_"
#test function to download file using manual operation
index.storm <- "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1950_c20210803.csv.gz"
storm.1950 <- fread(index.storm )
storm.1951 <- fread("https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1951_c20210803.csv.gz")
#test append
storm.append <- rbind(storm.1950, storm.1951)
#create a list of colnames
detail.colnames <- colnames(storm.1950)
#-------------------------------Begin Website Scrap-------------------------------------
#create a directory from the NOAA website. Must use the FTP directory. Will get 404 error if using the http site
dir_list <-
sep = "",
strip.white = TRUE)
#subset the data we want
dir_list <- dir_list %>%
#fix column names
colnames(dir_list) <- c("FileName", "FileSize")
#create new table for loop through list with complete website directory. This will get just the storm details we want
details.dir <- dir_list %>%
select(1) %>%
filter(str_detect(FileName,"details")) %>%
mutate(FileName = paste0(noaa.url,FileName))
#how many rows to get. could use this in counter for loop if needed
total.count <- count(details.dir)
#subset just first 5 rows
details.dirsub <- head(details.dir,5)
#very basic loop and apply a list. Note: files get larger as years go on.
for (x in details.dirsub) {
something = details.dirsub$FileName
storm.append = lapply(something, fread) #lapply is creating a join not an append
#storm.append = rbindlist(fread(something)) #does not work
#expand the list into a dataframe for wrangling
storm.full <- as.data.frame(do.call(cbind, storm.append))
# try to set colnames if use sapply instead of lapply
#setnames(storm.full, detail.colnames)
#filter by GEORGIA -- can not filter because lapply is creating joins instead of append. tried rbindlist() but errors.
storm.georgia <- storm.full %>%
filter(STATE == "GEORGIA")
If I understand correctly, the OP wants
to read all data files
whose file names include the string "details"
from a certain FTP directory,
combine them into one large data.frame
but keep only rows whih are related to GEORGIA.
This is what I would do using my favourite tools:
noaa_ftp <- "ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
state_to_select <- "GEORGIA"
storm.georgia <-
getURL(noaa_ftp, dirlistonly = TRUE) %>%
fread(header = FALSE) %>%
.[V1 %like% "details.*csv.gz", V1] %>%
lapply(function(x) {
url <- file.path(noaa.ftp, x)
cat(url, "\n")
fread(url)[STATE == state_to_select]
}) %>%
This filters each file directly after reading in order to reduce memory allocation. The result consists of one data.table with nearly 50k rows and 51 columns:
Total: 28MB
Please note that there are inconsistencies in the data files as can be seen from the last lines of output
trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1990_c20220425.csv.gz'
Content type 'unknown' length 385707 bytes (376 KB)
trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1988_c20220425.csv.gz'
Content type 'unknown' length 255646 bytes (249 KB)
trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1986_c20220425.csv.gz'
Content type 'unknown' length 298130 bytes (291 KB)
Warning messages:
1: In fread(url) :
Stopped early on line 33957. Expected 51 fields but found 65. Consider fill=TRUE and comment.char=. First discarded non-empty line:
Storm","Z",1,"NW CACHE","SLC","07-DEC-97 04:00:00","MST","08-DEC-97
18:00:00","20","0","0","0","200K",,,,,,,,,,,,,,,,,,,,,,,,"A strong
trough moved through northern Utah on the 7th. The cold moist airmass
remained unstable into the 8th in a strong northwest flow.
Lake-enhanced snowbands developed along the Wasatch Front on the 8th
as well. Criteria snow fell across much of the state and locally
strong winds occurred >>
2: In fread(url) :
Found and resolved improper quoting out-of-sample. First healed line 23934:
15:43:00","0","0","0","0",,,,,,,,,,,,,,,,,"S OF "7 MILE" BRIDGE",,,"S
SPOTTER...LOST IN RAIN SHAFT.",,"PDC">>. If the fields are not quoted
(e.g. field separator does not appear within any field), try quote=""
to avoid this warning. ```

Read list of files with inconsistent delimiter/fixed width

I am trying to find a more efficient way to import a list of data files with a kind of awkward structure. The files are generated by a software program that looks like it was intended to be printed and viewed rather than exported and used. The file contains a list of "Compounds" and then some associated data. Following a line reading "Compound X: XXXX", there are a lines of tab delimited data. Within each file the number of rows for each compound remains constant, but the number of rows may change with different files.
Here is some example data:
#Generate two data files to be imported
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\nCompound 2: Two\n",
file = "test1.txt")
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\nCompound 2: Two\n",
file = "test2.txt")
What I want in the end is a list of data frames, one for each "Compound", containing all rows of data associated with each compound. To get there, I have a fairly convoluted approach of smashed together functions which give me what I want but in a very unruly fashion.
## Step 1: ID list of data files
data.files <- list.files(path = ".",
pattern = ".txt",
full.names = TRUE)
## Step 2: Read in the data files
data.list.raw <- lapply(data.files, read_lines, skip = 4)
## Step 3: Identify the "compounds" in the data file output
Hdr.dat <- lapply(data.list.raw, function(x) grepl("Compound", x)) # Scan the file and find the different compounds within it (this can be applied to any Waters output)
grp.dat <- Map(function(x, y) {x[y][cumsum(y)]}, data.list.raw, Hdr.dat)
## Step 4: Unpack the tab delimited parts of the export file, then generate a list of dataframes within a list of imported files
Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, stringsAsFactors = FALSE)
raw.dat <- Map(function(x,y) {Map(Read, split(x, y))}, data.list.raw, grp.dat)
## Step 5: Curate the list of compounds - remove "Compound X: "
cmpd.list <- lapply(raw.dat, function(x) trimws(substring(names(x), 13)))
## Step 6: Rename the headers for the dataframes, remove the blank rows and recentre
NameCols <- function(z) lapply(names(z), function(i){
x <- z[[ i ]]
colnames(x) <- x[2,]
data.list <- Map(function(x,y){setNames(NameCols(x), y)}, raw.dat, cmpd.list)
## Step 7: rbind the data based on the compound
cmpd_names <- unique(unlist(sapply(data.list, names)))
result <- list()
j <- for (n in cmpd_names) {
result[[n]] <- map(data.list, n)
list.merged <- map(result, dplyr::bind_rows)
list.merged <- lapply(list.merged, function(x) x %>% filter(Name != ""))
The challenge here is script efficiency as far as time (I can import hundreds or thousands of data files with hundreds of lines of data, which can take quite a while) as well as general "cleanliness", which is why I included tidyverse as a tag here. I also want this to be highly generalizable, as the "Compounds" may change over time. If someone can come up with a clean and efficient way to do all of this I would be forever in your debt.
See one approach below. The whole pipeline might be intimidating at first glance. You can insert a head (or tail) call after each step (%>%) to display the current stage of data transformation. There's a bit of cleanup with regular expressions going on in the gsubs: modify as desired.
intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
rowwise %>%
## read file content into a raw string:
mutate(raw = read_file(file_name)) %>%
## separate raw file contents into rows
## using newline and carriage return as row delimiters:
separate_rows(raw, sep = '[\\n\\r]') %>%
## provide a compound column for later grouping
## by extracting the 'Compound' string from column raw
## or setting the compound column to NA otherwise:
mutate(compound = ifelse(grepl('^Compound',raw),
gsub('.*(Compound .*):.*','\\1', raw),
) %>%
## remove rows with empty raw text:
filter(raw != '') %>%
## filling missing compound values (NAs) with last non-NA compound string:
fill(compound, .direction = 'down') %>%
## keep only rows with tab-separated raw string
## indicating tabular data
filter(grepl('\\t',raw)) %>%
## insert a column header 'Index' because
## original format has four data columns but only three header cols:
mutate(raw = gsub(' *\\tName','Index\tName',raw))
Above steps result in a dataframe with a column 'raw' containing the cleaned-up data as string suited for conversion into tabular data (tab-delimited, linefeeds).
From there on, we can either proceed by keeping and householding the future single tables inside the parent table as a so-called list column (Variant A) or proceed with splitting column 'raw' and mapping it (Variant B, credits to #Dorton).
Variant A produces a column of dataframes inside the dataframe:
intermediate_result %>%
group_by(compound) %>%
## the nifty piece: you can store dataframes inside a dataframe:
tables = list(read.table(text = raw, header = TRUE, sep = '\t' ))
Variant B produces a list of dataframes named with the corresponding compound:
intermediate_result %>%
split(f = as.factor(.$compound)) %>%
lapply(function(x) x %>%
into = unlist(
str_split(x$raw[1], pattern = "\t"))

Changing Dates in R from webscraper but not able to convert

I am trying to complete a problem that pulls from two data sets that need to be combined into one data set. To get to this point, I need to rbind both data sets by the year-month information. Unfortunately, the first data set needs to be tallied by year-month info, and I can't seem to figure out how to change the date so I can have month-year info rather than month-day-year info.
This is data on avalanches and I need to write code totally the number of avalanches each moth for the Snow Season, defined as Dec-Mar. How do I do that?
I keep trying to convert the format of the date to month-year but after I change it with
as.Date(avalancheslc$Date, format="%y-%m")
all the values for Date turn to NA's....help!
# write the webscraper
for(page in all.pages){
this.url<-paste(avalanche.url, page, sep=" ")
thispage.avalanche<-readHTMLTable(this.webpage, which=1, header=T)
# subset the data to the Salt Lake Region
avalancheslc<-subset(avalanche, Region=="Salt Lake")
# How can I tally the number of avalanches?
The final output of my dataset should be something like:
date avalanches
2000-1 18
2000-2 4
2000-3 10
2000-12 12
2001-1 52
This should work (I tried it on only 1 page, not all 203). Note the use of the option stringsAsFactors = F in the readHTMLTable function, and the need to add names because 1 column does not automatically get one.
avalanche <- data.frame()
avalanche.url <- "https://utahavalanchecenter.org/observations?page="
all.pages <- 0:202
for(page in all.pages){
this.url <- paste(avalanche.url, page, sep=" ")
this.webpage <- htmlParse(getURL(this.url))
thispage.avalanche <- readHTMLTable(this.webpage, which = 1, header = T,
stringsAsFactors = F)
names(thispage.avalanche) <- c('Date','Region','Location','Observer')
avalanche <- rbind(avalanche,thispage.avalanche)
avalancheslc <- subset(avalanche, Region == "Salt Lake")
avalancheslc <- mutate(avalancheslc, Date = as.Date(Date, format = "%m/%d/%Y"),
monthyear = paste(year(Date), month(Date), sep = "-"))

Convert code into a function in R

I have a series of steps that I will like to convert into a function, so I can apply it to data frames simply by calling them. Below is the code with some comments:
# Create Data frame
Off_let_data <- data.frame(page_id = c(3,3,3,3,3), element_id = c(19, 22, 26, 31, 31),
text = c("The Protected Percentage of your property value thats has been chosen is 0%",
"The Arrangement Fee payable at complettion: £50.00",
"The Fixed Interest Rate that is applied for the life of the period is: 5.40%",
"The Benchmark rate that will be used to calculate any early repayment 2.08%",
"The property value used in this scenario is 275,000.00"))
# read in the first element of a list of pdf file from a folder
files <- list.files(pattern = "pdf$")[1]
# extract the account number from the first pdf file
acc_num <- str_extract(files, "^\\d+")
# The RegEx's used to extract the relevant information
protec_per_reg <- "Protected\\sP\\w+\\sof"
Arr_Fee_reg <- "^The\\sArrangement\\sF\\w+"
Fix_inter_reg <- "Fixed\\sI\\w+\\sR\\w+"
Bench_rate_reg <- "Benchmark\\sR\\w+\\sthat"
# create a df that only includes the rows which match the above RegEx
Off_let <- Off_let_data %>% filter(page_id == 3, str_detect(Off_let_data$text, protec_per_reg)|
str_detect(Off_let_data$text, Arr_Fee_reg) | str_detect(Off_let_data$text, Fix_inter_reg) |
str_detect(Off_let_data$text, Bench_rate_reg))
# Now only extract the numbers from the above DF
off_let_num <- str_extract(Off_let$text, "\\d+\\.?\\d+")
# The first element is always a NA value - based on the structure of these PDF files
# replace the first element of this character vector with the below
off_let_num[is.na(off_let_num)] <- str_extract(Off_let$text, "\\d+%")[[1]]
Can someone please help me turning this into a function. Thanks
Something like this?
What should be the inputs/outputs of the function? For now, the function only takes a data.frame as only argument, but you could extend it, so you can pass different regular expressions, or define the page_id for example.
# Create Data frame
Off_let_data <- data.frame(page_id = c(3,3,3,3,3), element_id = c(19, 22, 26, 31, 31),
text = c("The Protected Percentage of your property value thats has been chosen is 0%",
"The Arrangement Fee payable at complettion: £50.00",
"The Fixed Interest Rate that is applied for the life of the period is: 5.40%",
"The Benchmark rate that will be used to calculate any early repayment 2.08%",
"The property value used in this scenario is 275,000.00"))
dummyFunc <- function(df) {
# read in the first element of a list of pdf file from a folder
files <- list.files(pattern = "pdf$")[1]
# extract the account number from the first pdf file
acc_num <- str_extract(files, "^\\d+")
# The RegEx's used to extract the relevant information
protec_per_reg <- "Protected\\sP\\w+\\sof"
Arr_Fee_reg <- "^The\\sArrangement\\sF\\w+"
Fix_inter_reg <- "Fixed\\sI\\w+\\sR\\w+"
Bench_rate_reg <- "Benchmark\\sR\\w+\\sthat"
# create a df that only includes the rows which match the above RegEx
Off_let <- df %>% filter(page_id == 3, str_detect(df$text, protec_per_reg)|
str_detect(df$text, Arr_Fee_reg) | str_detect(df$text, Fix_inter_reg) |
str_detect(df$text, Bench_rate_reg))
# Now only extract the numbers from the above DF
off_let_num <- str_extract(Off_let$text, "\\d+\\.?\\d+")
# The first element is always a NA value - based on the structure of these PDF files
# replace the first element of this character vector with the below
off_let_num[is.na(off_let_num)] <- str_extract(Off_let$text, "\\d+%")[[1]]
And for a more extended version of the function:
# The RegEx's used to extract the relevant information
protec_per_reg <- "Protected\\sP\\w+\\sof"
Arr_Fee_reg <- "^The\\sArrangement\\sF\\w+"
Fix_inter_reg <- "Fixed\\sI\\w+\\sR\\w+"
Bench_rate_reg <- "Benchmark\\sR\\w+\\sthat"
regexprlist <- list(protec_per_reg, Arr_Fee_reg,
Fix_inter_reg, Bench_rate_reg)
dummyFuncExt <- function(df, regexp, page_id) {
# read in the first element of a list of pdf file from a folder
files <- list.files(pattern = "pdf$")[1]
# extract the account number from the first pdf file
acc_num <- str_extract(files, "^\\d+")
# create a df that only includes the rows which match the above RegEx
Off_let <- df %>% filter(page_id == page_id, str_detect(df$text, regexprlist[[1]])|
str_detect(df$text, regexprlist[[2]]) | str_detect(df$text, regexprlist[[3]]) |
str_detect(df$text, regexprlist[[4]]))
# Now only extract the numbers from the above DF
off_let_num <- str_extract(Off_let$text, "\\d+\\.?\\d+")
# The first element is always a NA value - based on the structure of these PDF files
# replace the first element of this character vector with the below
off_let_num[is.na(off_let_num)] <- str_extract(Off_let$text, "\\d+%")[[1]]
dummyFuncExt(df = Off_let_data, regexp = regexprlist, page_id = 3)

Regex to filter, then determine latest date

Say I have a directory with four files:
This is how the names are formatted. I cannot change them, and the format will remain the same except the dates will change. All of the names begin with someText. Next, there is either a four-letter code (abcd) or a three latter code (xyz). If the file name has a four letter code, it will always have a three-letter code after it. Finally there is a date value.
I have two tasks. First, I need to filter out the files that have the "abcd" component. This will always be a four-character code that appears after the someText. in the name. Is there a way to right a regex expression to remove these values?
That leaves two files:
I need only the file with the later date. Is there a second regex I could do to extract the dates, find the latest, and then keep only that date? I'm doing this to get the file set down to four:
myDir <- "\\\\myDir\\folder\\"
files <- list.files(path = myDir, pattern = "\\.csv$")
Here's a vector with the file names if someone wants to try it out:
files <- c("someText.abcd.xyz.10Sep16.csv", "someText.xyz.10Sep16.csv", "someText.abcd.xyz.23Oct16.csv", "someText.xyz.23Oct16.csv")
Here's my attempt at a simple base R answer
# regex subset
files <- files[!grepl("^.*?\\.[[:alpha:]]{4}\\.", files)]
# get date
dates <- unlist(lapply(strsplit(files, "\\."), "[[", 3))
files[which.max(as.Date(dates, format = "%d%b%y"))]
# [1] "someText.xyz.23Oct16.csv"
I think this should be robust enough to work reliably. I used dplyr to pass the results through and manipulate them, and lubridate for a convenient date extraction (dmy). Almost forgot: you need to load magrittr to get the %$% pipe.
I split the file names by the "."s, then slide over the results if they are missing the four-letter code section. Bind them into a data.frame for easy filtering etc. Here, filter for those missing the four-letter section, then select the one with the latest date.
strsplit(files, "\\.") %>%
setNames(files) %>%
if(length(x) == 4){
x[3:5] <- x[2:4]
x[2] <- "noCode"
rbind(x) %>%
}) %>%
bind_rows(.id = "fileName") %>%
mutate(date = dmy(V4)) %>%
filter(V2 == "noCode") %$%
returns: "someText.xyz.23Oct16.csv"
I am sure that this can be made more compact, but here is a base R answer:
# file names
file_names =c(
# the pattern to be tested
reg_file_names = regexec(
pattern = "^someText\\.[a-z]{4}\\.[a-z]{3}\\.(.*).csv$",
# parse out the matched dates, and look for the maximum
x = file_names, m = reg_file_names
function(match) {
length(match) == 0,
format = "%d%b%y"
The regular expression that you need is fairly straightforward, and the rest of the code is just to handle the cases where there is no match, and to format the dates so that they can be compared.
