I am new to this platform and I hope someone can help me.
I have imported some pdf files into Rstudio using the pdftools library. Now I want to make structured columns of this text. I just can't seem to get the structure right.
This is an example of one file added that I imported. I want to make the yellow shaded lines in a data table.
This is the outcome I would ultimately like to have.
Now I have entered the code below, but I can't get it into a data table.
library(pdftools)
library(stringr)
library(dplyr)
# load the PDF-files into Rstudio
files <- list.files(pattern = "pdf$", full.names = TRUE)
# make a list of the PDF-files
filestext <- lapply(files, pdf_text)
# remove "\n"
filestext <- str_split(filestext, pattern = "\n")
This is the result I get:
Does anyone know the easiest way to solve this?
I would also give https://sensible.so a shot. We have some great documentation and a free plan just for projects like this. Plus, when you sign up there are some tutorials to help you understand how to extract different types of data. I bet you can have this extracted into a clean JSON object in no time.
I am trying to scrape from a 276-page PDF available here: https://www.acf.hhs.gov/sites/default/files/documents/ocse/fy_2018_annual_report.pdf
Not only is the document very long but it also has tables in different formats. I tried using the extract_tables() function in the tabulizer library. This successfully scrapes the data tables beginning on page 143 of the document but does not work for the tables on pages 18-75. Are these pages unscrapable? If so why?
I get error messages that say "more columns than column names" and "duplicate 'row.names' are not allowed"
child_support_scrape <- extract_tables(
file = "C:/Users/Jenny/Downloads/OCSE_2018_annual_report.pdf",
method = "decide",
output = "data.frame")
As texts in pdf files are not stored in plain text format. It is generally hard to extract text from a pdf file. The following method provide an alternative method to extract the table from the pdf. It requires the pdftools and plyr package.
# Download the pdf file as a variable in R
pdf_text <- pdftools::pdf_text("https://www.acf.hhs.gov/sites/default/files/documents/ocse/fy_2018_annual_report.pdf")
# Focus on the table in page 22
pdf_text22 <- strsplit(pdf_text[[22]], "\n")[[1]]
# Reformat the table using "regular expression"
pdf_text22 <- strsplit(pdf_text22, " {2,100}")
# Convert the table in a data frame
pdf_text22 <- plyr::rbind.fill(lapply(pdf_text22, function(x) as.data.frame(t(matrix(x)))))
Additional formatting may be required to beautify the data frame.
Trying to do daily reports with Rmarkdown on covid-19 data. Want to tweet top 10 values from tables, but the options tried so far leave no spaces - tabs are erased when the tweet button is pushed. Have tried {kableExtra} with html output and {flextable} with word output, but when copied and pasted, the column separations are 'disappearing' tabs.
Does anyone have any recommendations on how to obtain a table with spaces or commas between columns?
Example Rmarkdown script is here, if interested, but the question is meant to be general and not require looking at the script.
How about creating a picture of the table (which looks quite good then).
You could do this like this:
library("knitr")
library(kableExtra")
knitr::kable(mtcars, "latex") %>%
kableExtra::kable_styling(latex_options = "striped") %>%
kableExtra::save_kable("test.png")
Or does this have any downsides you don't want?
Addition:
Alright, I didn't look at your file - seems you want to add 4 tables but not copy 4 images.
Short question here - isn't this then quite hard with the 280 char limit of twitter...?
But what you could do is the following:
```{r, echo = F}
aa <- knitr::kable(head(mtcars[, 1:4]), "pipe")
for (i in 1:length(aa)) {
aa[i] <- gsub(" ", ",", aa[i])
aa[i] <- paste(aa[i], "\n")
}
aa
```
In your code chunk save the table to a variable. This will then just be a table in markdown format. Now you can parse through and replace and alter chars how you need it.
Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.
install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}
I am writing code to export database from R into Excel, I have been trying others code including:
write.table(ALBERTA1, "D:/ALBERTA1.txt", sep="\t")
write.csv(ALBERTA1,":\ALBERTA1.csv")
your_filename_in_R = read.csv("ALBERTA1.csv")
your_filename_in_R = read.csv("ALBERTA1.csv")
write.csv(df, file = "ALBERTA1.csv")
your_filename_in_R = read.csv("ALBERTA1.csv")
write.csv(ALBERTA1, "ALBERTA1.csv")
write.table(ALBERTA1, 'clipboard', sep='\t')
write.table(ALBERTA1,"ALBERTA1.txt")
write.table(as.matrix(ALBERTA2),"ALBERTA2.txt")
write.table(as.matrix(vecm.pred$fcst$Alberta_Females[,1]), "vecm.pred$fcst$Alberta_Females[,1].txt")
write.table(as.matrix(foo),"foo.txt")
write.xlsx(ALBERTA2, "/ALBERTA2.xlsx")
write.table(ALBERTA1, "D:/ALBERTA1.txt", sep="\t").
Other users of this forum advised me this:
write.csv2(ALBERTA1, "ALBERTA1.csv")
write.table(kt, "D:/kt.txt", sep="\t", row.names=FALSE)
You can see on the pictures the outcome I have got from the code above. But this numbers can't be used to make any further operations such as addition with other matrices.
Has someone experienced this kind of problems?
Another option is the openxlsx-package. It doesn't depend on java and can read, edit and write Excel-files. From the description from the package:
openxlsx simplifies the the process of writing and styling Excel xlsx files from R and removes the dependency on Java
Example usage:
library(openxlsx)
# read data from an Excel file or Workbook object into a data.frame
df <- read.xlsx('name-of-your-excel-file.xlsx')
# for writing a data.frame or list of data.frames to an xlsx file
write.xlsx(df, 'name-of-your-excel-file.xlsx')
Besides these two basic functions, the openxlsx-package has a host of other functions for manipulating Excel-files.
For example, with the writeDataTable-function you can create formatted tables in an Excel-file.
Recently used xlsx package, works well.
library(xlsx)
write.xlsx(x, file, sheetName="Sheet1")
where x is a data.frame
writexl, without Java requirement:
# install.packages("writexl")
library(writexl)
tempfile <- write_xlsx(iris)
The WriteXLS function from the WriteXLS package can write data to Excel.
Alternatively, write.xlsx from the xlsx package will also work.
One could also use the readODS package. Granted it doesn't produce an .xlsx, but Excel can read Open Document Spreadsheet (ODS) / LibreOffice files too.
require(readODS)
tmp = file.path(tempdir(), 'iris.ods')
write_ods(iris, tmp)
If I might offer an alternative, you could also save your dataframe in a regular csv file, and then use the "get data" function within Excel to import the dataframe. This worked like a charm for me, and you need not bother with any excel packages in R.
Here is a way to write data from a dataframe into an excel file by different IDs and into different tabs (sheets) by another ID associated to the first level id. Imagine you have a dataframe that has email_address as one column for a number of different users, but each email has a number of 'sub-ids' that have all the data.
data <- tibble(id = c(1,2,3,4,5,6,7,8,9), email_address = c(rep('aaa#aaa.com',3), rep('bbb#bbb.com', 3), rep('ccc#ccc.com', 3)))
So ids 1,2,3 would be associated with aaa#aaa.com. The following code splits the data by email and then puts 1,2,3 into different tabs. The important thing is to set append = True when writing the .xlsx file.
temp_dir <- tempdir()
for(i in unique(data$email_address)){
data %>%
filter(email_address == i) %>%
arrange(id) -> subset_data
for(j in unique(subset_data$id)){
write.xlsx(subset_data %>% filter(id == j),
file = str_c(temp_dir,"/your_filename_", str_extract(i, pattern = "\\b[A-Za-z0-
9._%+-]+"),'_', Sys.Date(), '.xlsx'),
sheetName = as.character(j),
append = TRUE)}
}
The regex gets the name from the email address and puts it into the file-name.
Hope somebody finds this useful. I'm sure there's more elegant ways of doing this but it works.
Btw, here is a way to then send these individual files to the various email addresses in the data.frame. Code goes into second loop [j]
send.mail(from = "sender#sender.com",
to = i,
subject = paste("Your report for", str_extract(i, pattern = "\\b[A-Za-z0-9._%+-]+"), 'on', Sys.Date()),
body = "Your email body",
authenticate = TRUE,
smtp = list(host.name = "XXX", port = XXX,
user.name = Sys.getenv("XXX"), passwd = Sys.getenv("XXX")),
attach.files = str_c(temp_dir, "/your_filename_", str_extract(i, pattern = "\\b[A-Za-z0-9._%+-]+"),'_', Sys.Date(), '.xlsx'))
I have been trying out the different packages including the function:
install.packages ("prettyR")
library (prettyR)
delimit.table (Corrvar,"Name the csv.csv") ## Corrvar is a name of an object from an output I had on scaled variables to run a regression.
However I tried this same code for an output from another analysis (occupancy models model selection output) and it did not work. And after many attempts and exploration I:
copied the output from R (Ctrl+c)
in Excel sheet I pasted it (Ctrl+V)
Select the first column where the data is
In the "Data" vignette, click on "Text to column"
Select Delimited option, click next
Tick space box in "Separator", click next
Click Finalize (End)
Your output now should be in a form you can manipulate easy in excel. So perhaps not the fanciest option but it does the trick if you just want to explore your data in another way.
PS. If the labels in excel are not the exact one it is because Im translating the lables from my spanish excel.