I'm trying to take a Word doc that has data not in a table, and make it into a table. There are hundreds of identical word docs and I would like to write a script that could take the data and make it into a table.
My first idea is to convert it all into one column, and then I can somehow pull the column headers out and organize the data underneath it.
Word file: https://github.com/cstaulbee/Operation-WordDoc/blob/master/Sanitized_sampe.docx
library(docxtractr)
filenames <- list.files(".", pattern="*.docx", full.names=TRUE)
docx.files <- lapply(filenames, function(file) read_docx(file))
idx <- 1
docx.tables <- lapply(docx.files, function(file) {
ifelse(dir.exists("Contents"), {
unlink("Contents", recursive=T, force=T)
dir.create("Contents")
}, {
dir.create("Contents")
})
filename <- filenames[idx]
idx <- idx + 1
tbl <- docx_extract_tbl(file, 1)
file.copy(filename, "Contents\\word.zip", overwrite=T)
unzip("Contents\\word.zip", exdir='Contents')
x <- xml2::read_xml("Contents\\word\\document.xml")
nodes <- xml2::xml_find_all(x, "w:body/w:p/w:r/w:t")
data.date <- paste(xml2::xml_text(nodes, trim=T), collapse="::")
word_df <- strsplit(gsub("[:]{1,}", ":", txt), ":")
return(
list(
date=data.date
)
)
})
word_df <- strsplit(gsub("[:]{1,}", ":", docx.tables), ":")
This converts the word doc to a zip file, then reads it as an XML. It pulls out the info that isn't in tables, and then puts it all into a list that can then be manipulated.
I wanted to know if anyone knows of a way to take this column and make it into a few columns based on the data. For example, Date, Time in, Pilot, and Assistants will appear 3 or so times in the column, but I want each of those to be their own column, with the data between them and the next column header to be the data that makes up the rows.
So basically it looks like this:
df_col
Date
2/
2/16
Pilot
John, Mark
Assistants
Alfred, James
But I want it to look like this
Date_col Pilot_col Assistants_col
2/22/16 John, Mark Alfred, James
Unless someone has an idea of a better way of doing this.
You can use officer to scrap your docx document:
library(officer)
doc <- read_docx(path = "Sanitized_sampe.docx")
docx_summary(doc)
The last step would be to regexp column text when content_type=="paragraph".
Related
I am trying to read many text files into R using read.table. Most of the time we have clean text files which have defined columns.
The data that I am trying to read comes from ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt
You can see that the blanks and length of text files varies by report.
ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt
ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/100917_livecattle.txt
My objective is to read many of these text files and combine them into a dataset.
If I can read one of the them then compiling should not be an issue. However, I am running into several issues because of the format of the text file:
1) the number of FIRMS vary from report to report. For example, sometimes there will be 3 rows (i.e. 3 firms that did business on that data) of data to import and sometimes there may be 10.
2) Blanks are being recognized. For example, under the FIRM section there should be a column for Deliveries (DEL) and Receipts (REC). The data when it is read in THIS section should look like:
df <- data.frame("FIRM_#" = c(407, 685, 800, 905),
"FIRM_NAME" = c("STRAITS FIN LLC", "R.J.O'BRIEN ASSOC", "ROSENTHAL COLLINS LL", "ADM INVESTOR SERVICE"),
"DEL" = c(1,1,15,1), "REC"= c(NA,18,NA,NA))
however when I read this in the fomatting is all messed up and does not put NA for the blank values
3) The above issues apply for "YARDS" and "FUTURE DELIVERIES SCHEDULED" section of the text file.
I have tried to read in sections of the text file and then format it accordingly but since the the number of firms change day to day the code does not generalize.
Any help would greatly be appreciated.
Here an answer which starts from the scratch via rvest for downloading data and includes lots of formatting. The general idea is to identify fixed widths that may be used to separate columns - I used a little help from SO for this purpose link.
You could then use read.fwf() in combination with cat()and tempfile(). In my first attempt this did not work, due to some formatting issues, so I added some additional lines to get the final table format.
Maybe there are some more elegant options and shortcuts I have overseen, but at least, my answer should get you started. Of course, you will have to adapt the selection of lines, identification of widths for spliting tables depending on what parts of the data you need. Once this is settled, you may loop through all the websites to gather data. I hope this helps...
library(rvest)
library(dplyr)
page <- read_html("ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt")
table <- page %>%
html_text("pre") %>%
#reformat by splitting on line breakes
{ unlist(strsplit(., "\n")) } %>%
#select range based on strings in specific lines
"["(.,(grep("FIRM #", .):(grep(" DELIVERIES SCHEDULED", .)-1))) %>%
#exclude empty rows
"["(., !grepl("^\\s+$", .)) %>%
#fix width of table to the right
{ substring(., 1, nchar(gsub("\\s+$", "" , .[1]))) } %>%
#strip white space on the left
{ gsub("^\\s+", "", .) }
headline <- unlist(strsplit(table[1], "\\s{2,}"))
get_split_position <- function(substring, string) {
nchar(string)-nchar(gsub(paste0("(^.*)(?=", substring, ")"), "", string , perl=T))
}
#exclude first element, no split before this element
split_positions <- sapply(headline[-1], function(x) {
get_split_position(x, table[1])
})
#exclude headline from split
table <- lapply(table[-1], function(x) {
substring(x, c(1, split_positions + 1), c(split_positions, nchar(x)))
})
table <- do.call(rbind, table)
colnames(table) <- headline
#strip whitespace
table <- gsub("\\s+", "", table)
table <- as.data.frame(table, stringsAsFactors = FALSE)
#assign NA values
table[ table == "" ] <- NA
#change column type
table[ , c("FIRM #", "DEL", "REC")] <- apply(table[ , c("FIRM #", "DEL", "REC")], 2, as.numeric)
table
# FIRM # FIRM NAME DEL REC
# 1 407 STRAITSFINLLC 1 NA
# 2 685 R.J.O'BRIENASSOC 1 18
# 3 800 ROSENTHALCOLLINSLL 15 NA
# 4 905 ADMINVESTORSERVICE 1 NA
Trying to figure why when I run this code all the information from the columns is being written to the first file only. What I want is only the data from the columns unique to a MO number to be written out. I believe the problem is in the third line, but am not sure how to divide the data by each unique number.
Thanks for the help,
for (i in 1:nrow(MOs_InterestDF1)) {
MO = MOs_InterestDF1[i,1]
df = MOs_Interest[MOs_Interest$MO_NUMBER == MO, c("ITEM_NUMBER", "OPER_NO", "OPER_DESC", "STDRUNHRS", "ACTRUNHRS","Difference", "Sum")]
submit.df <- data.frame(df)
filename = paste("Variance", "Report",MO, ".csv", sep="")
write.csv(submit.df, file = filename, row.names = FALSE)}
If you are trying to write out a separate csv for each unique MO number, then something like this may work to accomplish that.
unique.mos <- unique(MOs_Interest$MO_NUMBER)
for (mo in unique.mos){
submit.df <- MOs_Interest[MOs_Interest$MO_NUMBER == mo, c("ITEM_NUMBER", "OPER_NO", "OPER_DESC", "STDRUNHRS", "ACTRUNHRS","Difference", "Sum")]
filename <- paste("Variance", "Report", mo, ".csv", sep="")
write.csv(submit.df, file = filename, row.names = FALSE)
}
It's hard to answer fully without example data (what are the columns of MOs_InterestDF1?) but I think your issue is in the df line. Are you trying to subset the dataframe to only the data matching the MO? If so, try which as in df = MOs_Interest[which(MOs_Interest$MO_NUMBER == MO),].
I wasn't sure if you actually had two separate dfs (MOs_Interest and MOs_InterestDF1); if not, make sure the df line points to the correct data frame.
I tried to create some simplified sample data:
MOs_InterestDF1 <- data.frame("MO_NUMBER" = c(1,2,3), "Item_No" = c(142,423,214), "Desc" = c("Plate","Book","Table"))
for (i in 1:nrow(MOs_InterestDF1)) {
MO = MOs_InterestDF1[i,1]
mydf = data.frame(MOs_InterestDF1[which(MOs_InterestDF1$MO_NUMBER == MO),])
filename = paste("This is number ",MO,".csv", sep="")
write.csv(mydf, file = filename, row.names=FALSE)
}
This output three different csv files, each with exactly one row of data. For example, "This is number 1.csv" had the following data:
MOs Item_No Desc
1 142 Plate
I am pulling 10-Ks off the SEC website using the EDGAR package in R. Fortunately, the text files come with a consistent file naming convention: CIK number (this is a unique filing ID)_File type_Date.
Ultimately I want to analyze these by SIC/industry group, so I think the best way to do this would be to add the SIC industry code to this filename rule.
I am including an image of what I would like to do below. It is kind of like a database join except my file names would be taking the new field. Not sure how to do that, I am pretty new to R and file scripting.
I am assuming that you have a data.frame with a column filenames. (Or a vector containing all the filenames) See the code below:
# A data.frame with a character column 'filenames'
df$CIK <- sapply(df$filenames, FUN = function(x) {unlist(strsplit(x, split = "_"))[1]})
df$CIK <- as.character(df$CIK)
Now, let us assume that you have another data.frame with two columns: CIK and SIC.
# A data.frame with two character columns: 'CIK' and 'SIC'
# df2.
#
# We add another column to the first data.frame: 'new_filenames'
df$new_filename <- sapply(1:nrow(df), FUN = function(idx, CIK, filenames, df2) {
SIC <- df2$SIC[which(df2$CIK == CIK[idx])]
new_filename <- as.character(paste(SIC, "_", filenames[idx], sep = ""))
new_filenames
}, CIK = df$CIK, filenames = df$filenames, df2 = df2)
# Now the new filenames are available in df$new_filenames
View(df)
I have the following .csv file:
https://drive.google.com/open?id=0Bydt25g6hdY-RDJ4WG41VFpyX1k
And I would like to be able to take the date and agent name(pasting its constituent parts) and append them as columns to the right of the table, up until it finds a different name and date, doing the same for the remaining name and date items, to get the following result:
The only thing I have been able to do with the dplyr package is the following:
library(dplyr)
library(stringr)
report <- read.csv(file ="test15.csv", head=TRUE, sep=",")
date_pattern <- "(\\d+/\\d+/\\d+)"
date <- str_extract(report[,2], date_pattern)
report <- mutate(report, date = date)
Which gives me the following result:
The difficulty I am finding is probably using conditionals in order make the script get the appropriate string and append it as a column at the end of the table.
This might be crude, but I think it illustrates several things: a) setting stringsAsFactors=F; b) "pre-allocating" the columns in the data frame; and c) using the column name instead of column number to set the value.
report<-read.csv('test15.csv', header=T, stringsAsFactors=F)
# first, allocate the two additional columns (with NAs)
report$date <- rep(NA, nrow(report))
report$agent <- rep(NA, nrow(report))
# step through the rows
for (i in 1:nrow(report)) {
# grab current name and date if "Agent:"
if (report[i,1] == 'Agent:') {
currDate <- report[i+1,2]
currName=paste(report[i,2:5], collapse=' ')
# otherwise append the name/date
} else {
report[i,'date'] <- currDate
report[i,'agent'] <- currName
}
}
write.csv(report, 'test15a.csv')
In the webpage, there is a kind of table in the webpage which have more than one element in one cell. I can crawl the content in the table by following code, but I could not bind these elements as their webpage architecture. Do we have some methods to combine these element perfectly, or we should use other idea to get each element?
library(XML)
dataissued <- "http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued"
ec_parsed <- htmlTreeParse(dataissued, encoding = "UTF-8", useInternalNodes = TRUE)
# gether content in table and build the dataframe
# title and introduction link of IR resource
item_title <- xpathSApply(ec_parsed, '//td[#headers="t1"]//a', xmlValue)
item_hrefs <- xpathSApply(ec_parsed, '//td[#headers="t1"]//a/#href')
# author and introduction link of IR resource
auth_name <- xpathSApply(ec_parsed, '//td[#headers="t2"]//a', xmlValue)
auth_hrefs <- xpathSApply(ec_parsed, '//td[#headers="t2"]//#href')
# publish date of IR resource
pub_date <- xpathSApply(ec_parsed, '//td[#headers="t3"]', xmlValue)
# whole content link of IR resource
con_link <- xpathSApply(ec_parsed, '//td[#headers="t3"]//a[#href]', xmlValue)
item_table <- cbind(item_title, item_hrefs, auth_name, auth_hrefs, pub_date, con_link)
colnames(item_table) <- c("t1", "href1", "t2", "href2", "t3", "t4", "href4")
I have tried many times but still cannot organise them as it should be, just like one paper may have several authors, and all the authors and their links should save in one "row", but now one author is in one row, and the title of paper totally reused. That makes the result messed up.
This is one way to make a long data frame from that table:
library(rvest)
library(purrr)
library(tibble)
pg <- read_html("http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued")
# extract the columns
col1 <- html_nodes(pg, "td[headers='t1']")
col2 <- html_nodes(pg, "td[headers='t2']")
col3 <- html_nodes(pg, "td[headers='t3']")
# this is the way to get the full text column
col4 <- html_nodes(pg, "td[headers='t3'] + td")
# now, iterate over the rows; map_df() will bind all our data.frame's together
map_df(1:legnth(col1), function(i) {
# extract the links
a1 <- xml_nodes(col1[i], "a")
a2 <- xml_nodes(col2[i], "a")
a4 <- xml_nodes(col4[i], "a")
# put the row into a long data.frame for the row
data_frame( title = html_text(a1, trim=TRUE),
title_link = html_attr(a1, "href"),
author = html_text(a2, trim=TRUE),
author_link = html_attr(a2, "href"),
issue_date = html_text(col3[i], trim=TRUE),
full_text = html_attr(a4, "href"))
})
The biggest problem during using "rvest" package is mess code. Even the parameter "encoding" has been used in the program, the result still have mess code. But the web page encoding is UTF-8. Such as:
library(rvest)
pg <- read_html("http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued", encoding = "UTF-8")
For my test, the best performance should be "XML", when I use getNodeset function, the result is right, no mess code at all. However, I only get the whole node, and could not gether each row of table with their structure.
library(XML)
pg <- "http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued"
pg_tables <- getNodeSet(htmlParse(pg), "//table[#summary='This table browse all dspace content']")
# gether the node of whole table
papernode <- getNodeSet(pg_tables[[1]], "//td[#headers='t1']")
paper_hrefs <- xpathSApply(papernode[[1]], '//a/#href')
paper_name <- xpathSApply(papernode[[1]], '//a', xmlValue)
# gether authors in table
authnode <- getNodeSet(pg_tables[[1]], "//td[#headers='t2']")
# gether date in table
datenode <- getNodeSet(pg_tables[[1]], "//td[#headers='t3']")
With this program, I could get these "nodes" separatly. However, crawling the headers and their links seems getting harder. Because the result class of "getNodeSet" is not same as "html_nodes". How can we read the dataframe generated by "getNodeSet" automatically and extract the header and their links from these nodes in an exact way?