Save website content into txt files - r

I am trying to write R code where I input an URL and output (save on hard drive) a .txt file. I created a large list of url using the "edgarWebR" package. An example would be "https://www.sec.gov/Archives/edgar/data/1131013/000119312518074650/d442610dncsr.htm". Basically
open the link
Copy everything (CTRL+A, CTRL+C)
open empy text file and paste content (CTRL+V)
save .txt file under specified name
(in a looped fashion of course). I am inclined to "hard code it" (as in open website in browner using browseURL(...) and "send keys" commands). But I am afraid that it will not run very smoothly. However other commands (such as readLines()) seem to copy the HTML structure (therefore returning not only the text).
In the end I am interested in a short paragraph of each of those shareholder letters (containing only text; Therefore Tables/graphs are no concern in my particular setup.)
Anyone aware of an R function that would help`?
thanks in advance!

Let me know incase below code works for you. xpathSApply can be applied for different html components as well. Since in your case only paragraphs are required.
library(RCurl)
library(XML)
# Create character vector of urls
urls <- c("url1", "url2", "url3")
for ( url in urls) {
# download html
html <- getURL(url, followlocation = TRUE)
# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
# writing lines to html
# depends whether you need separate files for each url or same
fileConn<-file(paste(url, "txt", sep="."))
writeLines(paste(plain.text, collapse = "\n"), fileConn)
close(fileConn)
}

Thanks everyone for your input. Turns out that any html conversion took too much time given the ammount of websites I need to parse. The (working) solution probably violates some best-practice guidelines, but it does do the job.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path=path + '/codes_ml/geckodriver/geckodriver.exe') # initialize driver
# it is fine to open the driver just once
# loop over urls will the text
driver.get(report_url)
element = driver.find_element_by_css_selector("body")
element.send_keys(Keys.CONTROL+'a')
element.send_keys(Keys.CONTROL+'c')
text = clipboard.paste()

Related

Change Value in "broken" .xml File in R

I got a "broken" .xml file with missing header and root element
myBrokenXML.xml
<attribute1>false</attribute1>
<subjects>
<subject>
<population>Adult</population>
<name>adult1</name>
</subject>
</subjects>
This .xml file is the input for a program that i have to use and the structure cannot be changed.
I would like to change the attribute "name" to adult5.
I tried using the xml2 package but it requires a proper xml file for read_xml() which returns this error message "Extra content at the end of the document"
I tried reading the file line by line using readLines and then writing a new line with writeLines() but this again resulted in an error message "cannot write to this connection"
Anny suggestions are greatly appreciated. I am new to R and XML and been at this for hours (and cursed the developers a few times in the process)
Thanks in advance!
Code using xml2:
XMLFile <- read_xml("myBrokenXML.xml")
Code using readLines/writeLines; would still require to delete the original line
conn <- file("myBrokenXML.xml", open = "r")
lines <- readLines(conn)
for (i in 1:length(lines)){
print(lines[i])
if (lines[i] == "\t\t<name>adult1</name>"){
writeLines("\t\t<name>adult5</name>", conn)
}
}
GOAL
i need to change the value of "name" from adult1 to adult5 and the file must be in the same structure (no header, no root element) at the end.
The easiest way to do this is to use read_html instead of read_xml, since read_html will attempt to parse even broken documents, whereas read_xml requires strict formatting. It is possible to use this fact to create a repaired xml document by creating a new xml_document and writing the nodes obtained from read_html into it. This function will allow fragments of xml to be repaired into a proper xml document:
fix_xml <- function(xml_path, root_name = "root")
{
my_xml <- xml2::xml_new_root("root")
root <- xml2::xml_find_all(my_xml, "//root")
my_html <- xml2::read_html(xml_path)
fragment <- xml2::xml_find_first(my_html, xpath = "//body")
new_root <- xml2::xml_set_name(fragment, root_name)
new_root <- xml2::xml_replace(root, fragment)
return(my_xml)
}
So we can do:
fix_xml("myBrokenXML.xml")
#> {xml_document}
#> <root>
#> [1] <attribute1>false</attribute1>
#> [2] <subjects>\n <subject>\n <population>Adult</population>\n <name>adult1...
The Answer from Allan Cameron (#1) works fine, as long as your file does not include case sensitive elements.
If someone ever runs into the same problem, here is what worked for me.
fix_xml <- function(xmlPath){
con <- file(xmlPath)
lines <- readLines(con)
firstLine <- c("<root>")
lastLine <- c("</root>")
lines <- append(lines, firstLine, after = 0)
lines <- append(lines, lastLine, after = length(lines))
write(lines, xmlPath)
close(con)
}
This function insert a root element into a "broken" xml file.
The fixed xml file can then be read using read_xml() and edited as desired.
The only difference to Allan's answer is, that using read_html does not care about upper case letters and reads the whole file as all-lower-case.
My solution is not as versatile, but it keeps upper case letters.

R - using Magick to mass read pdfs. issue with looping

I'm trying to write a script to read a series of pdfs, OCR them using the tesseract package, and then do things with the text I can extract.
So far, I'm at the following:
ReportDensity <- list()
AllReports <- list.files(path = "path",pattern = "*.PDF",full.names=TRUE)
and then I needed to call the page number for each pdf so that I can read the image data
for (i in seq(AllReports))
ReportDensity[[i]] <- pdf_info(AllReports[[i]])
ReportDensity <- lapply(ReportDensity, `[[`, 2)
Now, what I want to do is to list each page of a pdf of a separate image file so that I can OCR it.
for (i in seq(AllReports))
for (j in 1:ReportDensity[[i]])
(assign(paste0("Report_",i,"_Page_",j),image_read_pdf(AllReports[[i]],pages = ReportDensity[j])))
The error message I receive is:
"Error in poppler_render_page(loadfile(pdf), page, dpi, opw, upw, antialiasing, :
Invalid page."
which I believe to be because I wrote the loop incorrectly. I have tested the code by manually putting in image/page numbers, and it loads correctly.
I'm hoping that the end result would be a series of image files of the form "Report_ReportNumber_PageNumber" that I could then process.
pdfs are text mainly (most often);
i usually extract text from pdfs using python's pdf2txt, page by page run on the shell through a call to
i=pagenumber
system(paste("pdf2txt -p", i, "-o text.txt pdffile.pdf"))
then you can grep text from each page; flag -o can output an html or xml which you can scrap with library(rvest)
[pdfimages][2] extracts the images contained in pdfs, you can OCR those:
system(paste("pdfimages -f", i, "-l", i, "-p -png pdffile.pdf imagefile"))
that may output a lot of pngs from a single page, they come out numbered:
system(paste0("tesseract imagefile-",i,"-006.png out6"))
tesseract has several parameters you must tune before getting a decent result

Scraping multiple urls from a list of urls. Obtaining the data (text) behind each url, and writing out a text file

I am learning python (using 3.5). I realize I will probably take a bit of heat for posting my question. Here goes: I have literally reviewed several hundred posts, help docs, etc. all in an attempt to construct the code I need. No luck thus far. I hope someone can help me. I have a set of URLs say, 18 or more. Only 2 illustrated here:
[1] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html"
[2] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"
I need to scrape all the data (text) behind each url and write out to individual text files (one for each URL) for future topic model analysis. Right now, I pull in the urls through R using rvest. I then take each url (one at a time, by code) into python and do the following:
soup = BeautifulSoup(urlopen('http://www.senate.mo.gov/media/14info/chappelle-nadal/Columns/012314-Condensed.html').read())
txt = soup.find('div', {'class' : 'body'})
print(soup.get_text())
#print(soup.prettify()) not much help
#store the info in an object, then write out the object
test=print(soup.get_text())
test=soup.get_text()
#below does write a file
#how to take my BS object and get it in
open_file = open('23Jan2014cplNadal1.txt', 'w')
open_file.write(test)
open_file.close()
The above gets me partially to my target. It leaves me just a little clean up regarding the text, but that's okay. The problem is that it is labor intensive.
Is there a way to
Write a clean text file (without invisibles, etc.) out from R with all listed urls?
For python 3.5: Is there a way to take all the urls, once they are in a clean single file (the clean text file, one url per line), and have some iterative process retrieve the text behind each url and write out a text file for each URL's data(text) to a location on my hard drive?
I have to do this process for approximately 1000 state-level senators. Any help or direction is greatly appreciated.
Edit to original: Thank you so much all. To N. Velasquez: I tried the following:
urls<-c("http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/120114.html",
"http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/110614.htm"
)
for (url in urls) {
download.file(url, destfile = basename(url), method="curl", mode ="w", extra="-k")
}
html files are then written out to my working directory. However, is there a way to write out text files instead of html files? I've read download.file info and can't seem to figure out a way to push out individual text files. Regarding the suggestion for a for loop: Is what I illustrate what you mean for me to attempt? Thank you!
The answer for 1 is: Sure!
The following code will loop you through the html list and export atomic TXTs, as per your request.
Note that through rvest and html_node() you could get a much more structure datset, with recurring parts of the html stored separately. (header, office info, main body, URL, etc...)
library(rvest)
urls <- (c("http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html", "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"))
for (i in 1:length(urls))
{
ht <- list()
ht[i] <- html_text(html_node(read_html(urls[i]), xpath = '//*[#id="mainContent"]'), trim = TRUE)
ht <- gsub("[\r\n]","",ht)
writeLines(ht[i], paste("DOC_", i, ".txt", sep =""))
}
Look for the DOC_1.txt and DOC_2.txt in your working directory.

RMarkdown Inline Code Format

I am reading ISL at the moment which is related to machine learning in R
I really like how the book is laid out specifically where the authors reference code inline or libraries for example library(MASS).
Does anyone know if the same effect can be achieved using R Markdown i.e. making the MASS keyword above brown when i reference it in a paper? I want to color code columns in data frames when i talk about them in the R Markdown document. When you knit it as a HTML document it provides pretty good formatting but when i Knit it to MS Word it seems to just change the font type
Thanks
I've come up with a solution that I think might address your issue. Essentially, because inline source code gets the same style label as code chunks, any change you make to SourceCode will be applied to both chunks, which I don't think is what you want. Instead, there needs to be a way to target just the inline code, which doesn't seem to be possible from within rmarkdown. Instead, what I've opted to do is take the .docx file that is produced, convert it to a .zip file, and then modify the .xml file inside that has all the data. It applies a new style to the inline source code text, which can then be modified in your MS Word template. Here is the code:
format_inline_code = function(fpath) {
if (!tools::file_ext(fpath) == "docx") stop("File must be a .docx file...")
cur_dir = getwd()
.dir = dirname(fpath)
setwd(.dir)
out = gsub("docx$", "zip", fpath)
# Convert to zip file
file.rename(fpath, out)
# Extract files
unzip(out, exdir=".")
# Read in document.xml
xml = readr::read_lines("word/document.xml")
# Replace styling
# VerbatimChar didn't appear to the style that was applied in Word, nor was
# it present to be styled. VerbatimStringTok was though.
xml = sapply(xml, function(line) gsub("VerbatimChar", "VerbatimStringTok", line))
# Save document.xml
readr::write_lines(xml, "word/document.xml")
# Zip files
.files = c("_rels", "docProps", "word", "[Content_Types].xml")
zip(zipfile=out, files=.files)
# Convert to docx
file.rename(out, fpath)
# Remove the folders extracted from zip
sapply(.files, unlink, recursive=TRUE)
setwd(cur_dir)
}
The style that you'll want to modify in you MS Word template is VerbatimStringTok. Hope that helps!

Download URL links using R

I am new to R and would like to seek some advice.
I am trying to download multiple url links (pdf format, not html) and save it into pdf file format using R.
The links I have are in character (took from the html code of the website).
I tried using download.file() function, but this requires specific url link (Written in R script) and therefore can only download 1 link for 1 file. However I have many url links, and would like to get help in doing this.
Thank you.
I believe what you are trying to do is download a list of URLs, you could try something like this approach:
Store all the links in a vector using c(), ej:
urls <- c("http://link1", "http://link2", "http://link3")
Iterate through the file and download each file:
for (url in urls) {
download.file(url, destfile = basename(url))
}
If you're using Linux/Mac and https you may need to specify method and extra attributes for download.file:
download.file(url, destfile = basename(url), method="curl", extra="-k")
If you want, you can test my proof of concept here: https://gist.github.com/erickthered/7664ec514b0e820a64c8
Hope it helps!
URL
url = c('https://cran.r-project.org/doc/manuals/r-release/R-data.pdf',
'https://cran.r-project.org/doc/manuals/r-release/R-exts.pdf',
'http://kenbenoit.net/pdfs/text_analysis_in_R.pdf')
Designated names
names = c('manual1',
'manual2',
'manual3')
Iterate through the file and download each file with corresponding name:
for (i in 1:length(url)){
download.file(url[i], destfile = names[i], mode = 'wb')
}

Resources