trying to scrape from long PDF with different table formats

trying to scrape from long PDF with different table formats - r

I am trying to scrape from a 276-page PDF available here: https://www.acf.hhs.gov/sites/default/files/documents/ocse/fy_2018_annual_report.pdf
Not only is the document very long but it also has tables in different formats. I tried using the extract_tables() function in the tabulizer library. This successfully scrapes the data tables beginning on page 143 of the document but does not work for the tables on pages 18-75. Are these pages unscrapable? If so why?
I get error messages that say "more columns than column names" and "duplicate 'row.names' are not allowed"
child_support_scrape <- extract_tables(
file = "C:/Users/Jenny/Downloads/OCSE_2018_annual_report.pdf",
method = "decide",
output = "data.frame")

As texts in pdf files are not stored in plain text format. It is generally hard to extract text from a pdf file. The following method provide an alternative method to extract the table from the pdf. It requires the pdftools and plyr package.
# Download the pdf file as a variable in R
pdf_text <- pdftools::pdf_text("https://www.acf.hhs.gov/sites/default/files/documents/ocse/fy_2018_annual_report.pdf")
# Focus on the table in page 22
pdf_text22 <- strsplit(pdf_text[[22]], "\n")[[1]]
# Reformat the table using "regular expression"
pdf_text22 <- strsplit(pdf_text22, " {2,100}")
# Convert the table in a data frame
pdf_text22 <- plyr::rbind.fill(lapply(pdf_text22, function(x) as.data.frame(t(matrix(x)))))
Additional formatting may be required to beautify the data frame.

Related

How to save table1 package output table to .doc format? R

Package "table1" allows for some amazing tables to be made. However, i can't figure how to export one to .doc format.
Here is an example code:
table1( ~ x | Y, data = df)
I wonder if it is possible to somehow save resulting table into .doc format so it still looks like this:

Turns out it's very simple:
Just cmd+a the table in Rstudio viewer -> cmd+c -> paste Special in MS Word -> chose .html format!

I'm sorry to dig this up. However I learnt to love the two packages table1 and aswell flextable and one other route would be to fastly convert the table1 object to a flextable
library(table1)
library(flextable)
library(magrittr)
# Create Table1 Object
tbl1<- table1(~Sepal.Length+Sepal.Width|Species,data=iris)
# Convert to flextable
t1flex(tbl1) %>%
save_as_docx(path="Iris_table1.docx")
Note that this give you the full flexibility over working with a flextable and an easy and save way to export to word. A nice guide on additional formatting with flextable can be found here

Another possible solution:
The above strategy did not work for me when I had a similar issue, but it was resolved once I knitted the table1 object and opened the html in browser to copy the html table and successfully paste into word. Doing it within RStudio viewer would not work for me for some reason.

You can also save it as a .csv file for further manipulation (and from there to word)
mytable <- table1( ~ x | Y, data = df)
write.table (mytable , "my_table_1_file.csv", col.names = T, row.names=F, append= T, sep=',')

Is there a way to convert a .csv format embedded in a website to an actual csv to use read.csv() on?

Basically, on baseball-reference.com there is a way to switch the tables to csv format, but not actually a .csv link. I am trying to see if the csv formatted text on the webpage can be converted to a .csv file in order to make it a usable table.
I tried to use the normal 'rvest' package with the following code
#Los Angeles Dodgers
dodgerBatting <- read_html('https://www.baseball-reference.com/teams/LAD/2019.shtml')
dodgerCSV <- dodgerBatting%>%
html_nodes('#csv_team_batting')%>%
html_text()
print(head(dodgerCSV))
The results are basically an empty character
character(0)

You can get the tables present on the webpage using html_table command in rvest.
library(rvest)
url <- "https://www.baseball-reference.com/teams/LAD/2019.shtml"
out_table <- url %>% read_html %>% html_table()
This returns a list of dataframes, we can access individual dataframes using out_table[[1]], out_table[[2]]. You might need to do some cleaning before using them.
If needed in csv format, we can use write.csv command to write them
write.csv(out_table[[1]], "/path/of/the/file.csv")

R: how to convert table of various types to text columns for export

I have 2 columns of data, in table loaded from Excel: ID, BeforeConv
containing number and text, respectively.
After loading this data from Excel to R studio i made some tm_map based text cleaning on column BeforeConve copy applied to the new column in RStudio: AfterConv
This is my code:
x <- read_excel(file.choose())
corpus<-Corpus(VectorSource(x$BeforeConv))
corpus<-tm_map(corpus,tolower)
x$AfterConv<-data.frame(text = sapply(corpus, as.character)) //<-here in R studio data displayed correctly yet
write.csv(x=x, file="R2test.csv") //<-now the data displaced
my issue is that while in R studio i see the conversion correct(see image below)
...when exporting to either CSV or XLSX that last column (AfterConv) displays totally messy data (see below). What i did wrong and how to fix it?
Best if i could see the proper code to convert last column correctly to export it correctly along with two exiting columns to Excel or CSV.

R: Writing data frame into excel with large number of rows

I have a data frame (panel form) in R with 194498 rows and 7 columns. I want to write it to an Excel file (.xlsx) using function res <- write.xlsx(df, output) but R goes in the coma (keeps showing stop sign on the top left of console) without making any change in the targeted file(output). Finally shows following:
Error in .jcheck(silent = FALSE) :
Java Exception <no description because toString() failed>.jcall(row[[ir]], "Lorg/apache/poi/ss/usermodel/Cell;", "createCell", as.integer(colIndex[ic] - 1))<S4 object of class "jobjRef">
I have loaded readxl and xlsx packages. Please suggest to fix it. Thanks.

Install and load package named 'WriteXLS' and try writing out your R object using function WriteXLS(). Make sure your R object is written in quotes like the one below "data".
# Store your data with 194498 rows and 7 columns in a data frame named 'data'
# Install package named WriteXLS
install.packages("WriteXLS")
# Loading package
library(WriteXLS)
# Writing out R object 'data' in an Excel file created namely data.xlsx
WriteXLS("data",ExcelFileName="data.xlsx",row.names=F,col.names=T)
Hope this helped.

This does not answer your question, but might be a solution to your problem.
Could save the file as a CSV instead like so:
write.csv(df , "df.csv")
open the CSV and then save as an Excel file.
I gave up on trying to import/export Excel files with R because of hassles like this.

In addition to Pete's answer I wouldn't recommend write.csv because it takes or can take minutes to load. I used fwrite() (from data.table library) and it did the same thing in about 1-2 secs.
The post author asked about large files. I dealt with a table about 2,3 million rows long and write.data (and frwrite) aren't able to write more than about 1 million rows. It just cuts the data away. So instead use write.table(Data, file="Data.txt"). You can open it in Excel and split the one column by your delimiter (use argument sep) and voila!

R shiny special characters

I'm developing an application in R shiny. I'm reading SAS XPT files using read.xport function and creating data tables. I use Labels of the data table columns to display choices in a checkboxGroupInput. It works well for all SAS datasets but few(I get the error : string 1 is invalid in this locale) and when I researched I found out that some of the column labels have special characters, for example here is a column(label) CNTRYLNM(Country, Long Name�)
When I display the label with unname(label(datatable)), it comes as "Country, Long Name\xa0" other column labels have \xa1 at the end.
How do I get rid of these while reading from xpt itself or afterwards?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

trying to scrape from long PDF with different table formats - r

Related

How to save table1 package output table to .doc format? R

Is there a way to convert a .csv format embedded in a website to an actual csv to use read.csv() on?

R: how to convert table of various types to text columns for export

R: Writing data frame into excel with large number of rows

R shiny special characters

Categories

Resources