R: Read text files with blanks and unequal number of columns - r

I am trying to read many text files into R using read.table. Most of the time we have clean text files which have defined columns.
The data that I am trying to read comes from ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt
You can see that the blanks and length of text files varies by report.
ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt
ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/100917_livecattle.txt
My objective is to read many of these text files and combine them into a dataset.
If I can read one of the them then compiling should not be an issue. However, I am running into several issues because of the format of the text file:
1) the number of FIRMS vary from report to report. For example, sometimes there will be 3 rows (i.e. 3 firms that did business on that data) of data to import and sometimes there may be 10.
2) Blanks are being recognized. For example, under the FIRM section there should be a column for Deliveries (DEL) and Receipts (REC). The data when it is read in THIS section should look like:
df <- data.frame("FIRM_#" = c(407, 685, 800, 905),
"FIRM_NAME" = c("STRAITS FIN LLC", "R.J.O'BRIEN ASSOC", "ROSENTHAL COLLINS LL", "ADM INVESTOR SERVICE"),
"DEL" = c(1,1,15,1), "REC"= c(NA,18,NA,NA))
however when I read this in the fomatting is all messed up and does not put NA for the blank values
3) The above issues apply for "YARDS" and "FUTURE DELIVERIES SCHEDULED" section of the text file.
I have tried to read in sections of the text file and then format it accordingly but since the the number of firms change day to day the code does not generalize.
Any help would greatly be appreciated.

Here an answer which starts from the scratch via rvest for downloading data and includes lots of formatting. The general idea is to identify fixed widths that may be used to separate columns - I used a little help from SO for this purpose link.
You could then use read.fwf() in combination with cat()and tempfile(). In my first attempt this did not work, due to some formatting issues, so I added some additional lines to get the final table format.
Maybe there are some more elegant options and shortcuts I have overseen, but at least, my answer should get you started. Of course, you will have to adapt the selection of lines, identification of widths for spliting tables depending on what parts of the data you need. Once this is settled, you may loop through all the websites to gather data. I hope this helps...
library(rvest)
library(dplyr)
page <- read_html("ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt")
table <- page %>%
html_text("pre") %>%
#reformat by splitting on line breakes
{ unlist(strsplit(., "\n")) } %>%
#select range based on strings in specific lines
"["(.,(grep("FIRM #", .):(grep(" DELIVERIES SCHEDULED", .)-1))) %>%
#exclude empty rows
"["(., !grepl("^\\s+$", .)) %>%
#fix width of table to the right
{ substring(., 1, nchar(gsub("\\s+$", "" , .[1]))) } %>%
#strip white space on the left
{ gsub("^\\s+", "", .) }
headline <- unlist(strsplit(table[1], "\\s{2,}"))
get_split_position <- function(substring, string) {
nchar(string)-nchar(gsub(paste0("(^.*)(?=", substring, ")"), "", string , perl=T))
}
#exclude first element, no split before this element
split_positions <- sapply(headline[-1], function(x) {
get_split_position(x, table[1])
})
#exclude headline from split
table <- lapply(table[-1], function(x) {
substring(x, c(1, split_positions + 1), c(split_positions, nchar(x)))
})
table <- do.call(rbind, table)
colnames(table) <- headline
#strip whitespace
table <- gsub("\\s+", "", table)
table <- as.data.frame(table, stringsAsFactors = FALSE)
#assign NA values
table[ table == "" ] <- NA
#change column type
table[ , c("FIRM #", "DEL", "REC")] <- apply(table[ , c("FIRM #", "DEL", "REC")], 2, as.numeric)
table
# FIRM # FIRM NAME DEL REC
# 1 407 STRAITSFINLLC 1 NA
# 2 685 R.J.O'BRIENASSOC 1 18
# 3 800 ROSENTHALCOLLINSLL 15 NA
# 4 905 ADMINVESTORSERVICE 1 NA

Related

String match error "invalid regular expression, reason 'Out of memory'"

I have a table that is shaped like this called df (the actual table is 16,263 rows):
title date brand
big farm house 2022-01-01 A
ranch modern 2022-01-01 A
town house 2022-01-01 C
Then I have a table like this called match_list (the actual list is 94,000 rows):
words_for_match
farm
town
clown
beach
city
pink
And I'm trying to filter the first table to just be rows where the title contains a word in the words_for_match list. So I do this:
match_list <- match_list$words_for_match
match_list <- paste(match_list, collapse = "|")
match_list <- sprintf("\\b(%s)\\b", match_list)
df %>%
filter(grepl(match_list, title))
But then I get the following error:
Problem while computing `..1 = grepl(match_list, subject)`.
Caused by error in `grepl()`:
! invalid regular expression, reason 'Out of memory'
If I filter the table with 94,000 rows to just 1,000 then it runs, so it appears to just be a memory issue. So I'm wondering if there's a less memory-intensive way to do this or if this is an example of needing to look beyond my computer for computation. Advice on either pathway (or other options) is welcome. Thanks!
You could keep titles sequentially, let's say you have 10 titles that match 'farm' you do not need to evaluate those titles with other words.
Here a simple implementation :
titles <- c("big farm house", "ranch modern", "town house")
words_for_match <- c("farm", "town", "clown", "beach", "city", "pink")
titles.to.keep <- c()
for(w in words_for_match)
{
w <- sprintf("\\b(%s)\\b", w)
is.match <- grepl(w, titles)
titles.to.keep <- c(titles.to.keep, titles[is.match])
titles <- titles[!is.match]
print(paste(length(titles), "remaining titles"))
}
titles.to.keep
If you have a prior on the frequency of words on match_list, it's better to start with the most frequent ones.
UPDATE
You can also make a mix with your previous strategy to make it faster :
gr.size <- 20
gr.words <- split(words_for_match, ceiling(seq_along(words_for_match) / gr.size))
gr.words <- sapply(gr.words, function(words)
{
words <- paste(words, collapse = "|")
sprintf("\\b(%s)\\b", words)
})
and then iterate on gr.words and not on words_for_match in the first code chunk.

Read list of files with inconsistent delimiter/fixed width

I am trying to find a more efficient way to import a list of data files with a kind of awkward structure. The files are generated by a software program that looks like it was intended to be printed and viewed rather than exported and used. The file contains a list of "Compounds" and then some associated data. Following a line reading "Compound X: XXXX", there are a lines of tab delimited data. Within each file the number of rows for each compound remains constant, but the number of rows may change with different files.
Here is some example data:
#Generate two data files to be imported
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t25.2",
"\n2\tA4567\tQC\t26.8\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t51.1",
"\n2\tA4567\tQC\t48.6\n",
file = "test1.txt")
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t25.2",
"\n2\tC4567\tQC\t26.8",
"\n3\tC8910\tQC\t25.4\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t51.1",
"\n2\tC4567\tQC\t48.6",
"\n3\tC8910\tQC\t45.6\n",
file = "test2.txt")
What I want in the end is a list of data frames, one for each "Compound", containing all rows of data associated with each compound. To get there, I have a fairly convoluted approach of smashed together functions which give me what I want but in a very unruly fashion.
library(tidyverse)
## Step 1: ID list of data files
data.files <- list.files(path = ".",
pattern = ".txt",
full.names = TRUE)
## Step 2: Read in the data files
data.list.raw <- lapply(data.files, read_lines, skip = 4)
## Step 3: Identify the "compounds" in the data file output
Hdr.dat <- lapply(data.list.raw, function(x) grepl("Compound", x)) # Scan the file and find the different compounds within it (this can be applied to any Waters output)
grp.dat <- Map(function(x, y) {x[y][cumsum(y)]}, data.list.raw, Hdr.dat)
## Step 4: Unpack the tab delimited parts of the export file, then generate a list of dataframes within a list of imported files
Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, stringsAsFactors = FALSE)
raw.dat <- Map(function(x,y) {Map(Read, split(x, y))}, data.list.raw, grp.dat)
## Step 5: Curate the list of compounds - remove "Compound X: "
cmpd.list <- lapply(raw.dat, function(x) trimws(substring(names(x), 13)))
## Step 6: Rename the headers for the dataframes, remove the blank rows and recentre
NameCols <- function(z) lapply(names(z), function(i){
x <- z[[ i ]]
colnames(x) <- x[2,]
x[c(-1,-2),]
})
data.list <- Map(function(x,y){setNames(NameCols(x), y)}, raw.dat, cmpd.list)
## Step 7: rbind the data based on the compound
cmpd_names <- unique(unlist(sapply(data.list, names)))
result <- list()
j <- for (n in cmpd_names) {
result[[n]] <- map(data.list, n)
}
list.merged <- map(result, dplyr::bind_rows)
list.merged <- lapply(list.merged, function(x) x %>% filter(Name != ""))
The challenge here is script efficiency as far as time (I can import hundreds or thousands of data files with hundreds of lines of data, which can take quite a while) as well as general "cleanliness", which is why I included tidyverse as a tag here. I also want this to be highly generalizable, as the "Compounds" may change over time. If someone can come up with a clean and efficient way to do all of this I would be forever in your debt.
See one approach below. The whole pipeline might be intimidating at first glance. You can insert a head (or tail) call after each step (%>%) to display the current stage of data transformation. There's a bit of cleanup with regular expressions going on in the gsubs: modify as desired.
intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
rowwise %>%
## read file content into a raw string:
mutate(raw = read_file(file_name)) %>%
## separate raw file contents into rows
## using newline and carriage return as row delimiters:
separate_rows(raw, sep = '[\\n\\r]') %>%
## provide a compound column for later grouping
## by extracting the 'Compound' string from column raw
## or setting the compound column to NA otherwise:
mutate(compound = ifelse(grepl('^Compound',raw),
gsub('.*(Compound .*):.*','\\1', raw),
NA)
) %>%
## remove rows with empty raw text:
filter(raw != '') %>%
## filling missing compound values (NAs) with last non-NA compound string:
fill(compound, .direction = 'down') %>%
## keep only rows with tab-separated raw string
## indicating tabular data
filter(grepl('\\t',raw)) %>%
## insert a column header 'Index' because
## original format has four data columns but only three header cols:
mutate(raw = gsub(' *\\tName','Index\tName',raw))
Above steps result in a dataframe with a column 'raw' containing the cleaned-up data as string suited for conversion into tabular data (tab-delimited, linefeeds).
From there on, we can either proceed by keeping and householding the future single tables inside the parent table as a so-called list column (Variant A) or proceed with splitting column 'raw' and mapping it (Variant B, credits to #Dorton).
Variant A produces a column of dataframes inside the dataframe:
intermediate_result %>%
group_by(compound) %>%
## the nifty piece: you can store dataframes inside a dataframe:
mutate(
tables = list(read.table(text = raw, header = TRUE, sep = '\t' ))
)
Variant B produces a list of dataframes named with the corresponding compound:
intermediate_result %>%
split(f = as.factor(.$compound)) %>%
lapply(function(x) x %>%
separate(raw,
into = unlist(
str_split(x$raw[1], pattern = "\t"))
)
)

Extracting strings from a PDF with R

I have this PDF file from European parliament, that you can download here.
I have downloaded it and put it in R.
It contains lists of names of Members of European Parliament (MEP) after a session of vote.
I want to extract just bits of these lists. Specifically, I want to extract and put in a table the names situated between "AVGIVNA RÖSTER" and 0, see the text highlighted in this screenshot.
Similar series of names repeat in the PDF. It refers to specific votes. I want them all in a table. MEP's names change but the structure remains, they are always situated between the bits "AVGIVNA RÖSTER" and "0".
I thought of using a startswith function and and a for loop"but I struggle with the writing.
Here is what I did so far:
library(pdftools)
library(tidyverse)
votetext <- pdftools::pdf_text("MEP.pdf") %>%
readr::read_lines()
You could try something like this
votetext <- pdftools::pdf_text("MEP.pdf") %>%
readr::read_lines()
a <- which(grepl("AVGIVNA RÖSTER", votetext)) #beginning of string
b <- which(grepl("^\\s*0\\s*$", votetext)) #end of string
sapply(a, function(x){paste(votetext[x:(min(b[b > x]))], collapse = ". ")})
Note that in the definition of b I use \\s* to find white space in a string.
In general you could first remove trailing and leading white space, see this question.
In your case you could do:
votetext2 <- pdftools::pdf_text("data.pdf") %>%
readr::read_lines() %>%
str_remove("^\\s*") %>% #remove white space in the begining
str_remove("\\s*$") %>% #remove white space in the end
str_replace_all("\\s+", " ") #replace multiple white-spaces with a singe white-space
a2 <- which(votetext2 == "AVGIVNA RÖSTER")
b2 <- which(votetext2 == "0")
result <- sapply(a2, function(x){paste(votetext2[x:(min(b2[b2 > x]))], collapse = ". ")})
result then looks like this:
`"AVGIVNA RÖSTER. Martin Hojsík, Naomi Long, Margarida Marques, Pedro Marques, Manu Pineda, Ramona Strugariu, Marie Toussaint,. + Dragoş Tudorache, Marie-Pierre Vedrenne. -. Agnès Evren. 0"

1 column contains headers and data, how to make it into multiple

I'm trying to take a Word doc that has data not in a table, and make it into a table. There are hundreds of identical word docs and I would like to write a script that could take the data and make it into a table.
My first idea is to convert it all into one column, and then I can somehow pull the column headers out and organize the data underneath it.
Word file: https://github.com/cstaulbee/Operation-WordDoc/blob/master/Sanitized_sampe.docx
library(docxtractr)
filenames <- list.files(".", pattern="*.docx", full.names=TRUE)
docx.files <- lapply(filenames, function(file) read_docx(file))
idx <- 1
docx.tables <- lapply(docx.files, function(file) {
ifelse(dir.exists("Contents"), {
unlink("Contents", recursive=T, force=T)
dir.create("Contents")
}, {
dir.create("Contents")
})
filename <- filenames[idx]
idx <- idx + 1
tbl <- docx_extract_tbl(file, 1)
file.copy(filename, "Contents\\word.zip", overwrite=T)
unzip("Contents\\word.zip", exdir='Contents')
x <- xml2::read_xml("Contents\\word\\document.xml")
nodes <- xml2::xml_find_all(x, "w:body/w:p/w:r/w:t")
data.date <- paste(xml2::xml_text(nodes, trim=T), collapse="::")
word_df <- strsplit(gsub("[:]{1,}", ":", txt), ":")
return(
list(
date=data.date
)
)
})
word_df <- strsplit(gsub("[:]{1,}", ":", docx.tables), ":")
This converts the word doc to a zip file, then reads it as an XML. It pulls out the info that isn't in tables, and then puts it all into a list that can then be manipulated.
I wanted to know if anyone knows of a way to take this column and make it into a few columns based on the data. For example, Date, Time in, Pilot, and Assistants will appear 3 or so times in the column, but I want each of those to be their own column, with the data between them and the next column header to be the data that makes up the rows.
So basically it looks like this:
df_col
Date
2/
2/16
Pilot
John, Mark
Assistants
Alfred, James
But I want it to look like this
Date_col Pilot_col Assistants_col
2/22/16 John, Mark Alfred, James
Unless someone has an idea of a better way of doing this.
You can use officer to scrap your docx document:
library(officer)
doc <- read_docx(path = "Sanitized_sampe.docx")
docx_summary(doc)
The last step would be to regexp column text when content_type=="paragraph".

Transfer text file to table in R with some conditions on it

I have one text file like this
DOB
Name
Address
13-03-2003
ABC
xyz.
12-08-2004
dfs
1 infinite loop.
text goes till approx 300 lines. And sometimes Address data exceeds to two second line also i want to convert this text data to either cvs format which will have data like this
DOB, Name, Address
13-03-2003,ABC,xyz.
or at least in one data frame. I tried so many things, when i am giving read.table("file.txt",sep="\n") it makes everything in one column and i also tried first making headers by using
header <- read.table("file.txt",sep= "\n")
and then another data <- read.table("file.txt",skip = 3, sep ="\n") and then combining both but its not working out as my header vector has 3 and data vector has like 300 approx columns, its not working as required. Any help will be really helpful :)
You could try
entries <- unlist(strsplit(text, "\\n")) #separate entries by line breaks
entries <- entries[nchar(entries) > 0] #remove empty lines
as.data.frame(matrix(entries, ncol=3, byrow=TRUE)) #assemble dataframe
# V1 V2 V3
#1 DOB Name Address
#2 13-03-2003 ABC xyz.
#3 12-08-2004 dfs 1 infinite loop.
data
text <-'DOB
Name
Address
13-03-2003
ABC
xyz.
12-08-2004
dfs
1 infinite loop.'
df <- read.table(text = text)
Two assumptions were made, 1 there will not be any blank names or date of births. By "blank" I do not mean "NA", "", or any other marker that the value was missing. Second assumption was that names and DOBs will only occupy one line each.
s1 <- gsub("^\n|\n$", "", strsplit(x, "\n\n+")[[1]])
stars <- gsub("\n", ", ", sub("\n", "*", sub("\n", "*", s1)))
mat <- t(as.data.frame(strsplit(stars, "\\*")))
dimnames(mat) <- c(NULL, NULL)
write.csv(mat,"filename.csv")
We start by splitting the text by the blank lines and eliminating any leading or trailing newline tokens. Then we replace the first and second "\n" symbols with stars. Next we split on those new star markers that we created to always have 3 elements for each row. We create a matrix with the values and transpose it for display. Then write the data to csv.
When opened with Notepad on a test file, I get:
"","V1","V2","V3"
"1","DOB","Name","Address"
"2","13-03-2003","ABC","xyz."
"3","12-08-2004","dfs","1 infinite loop"
"4","01-01-2000","Bob Smith","1234 Main St, Suite 400"
row and column names can be set to FALSE with ?write.csv if desired.
Data
x <- "DOB
Name
Address
13-03-2003
ABC
xyz.
12-08-2004
dfs
1 infinite loop
01-01-2000
Bob Smith
1234 Main St
Suite 400
"

Resources