So I am attempting to webscrape a webpage that has irregular blocks of data that is organized in a manner easy to spot with the eye. Let's imagine we are looking at wikipedia. If I am scraping the text from articles of the following link I end up with 33 entries. If I instead grab just the headers, I end up with only 7 (see code below). This result does not surprise us as we know that some sections of articles have multiple paragraphs while others have only one or no paragraph text.
My question though is, how do I associate my headers with my texts. If there were the same number of paragraphs per header or some multiple, this would be trivial.
library(rvest)
wiki <- html("https://en.wikipedia.org/wiki/Web_scraping")
wikitext <- wiki %>%
html_nodes('p+ ul li , p') %>%
html_text(trim=TRUE)
wikiheading <- wiki %>%
html_nodes('.mw-headline') %>%
html_text(trim=TRUE)
This will give you a list called content whose elements are named according to the headings and contain the corresponding text.
library(rvest) # Assumes version 0.2.0.9 is installed not currently on CRAN
wiki <- html("https://en.wikipedia.org/wiki/Web_scraping")
# This node set contains the headings and text
wikicontent <- wiki %>%
html_nodes("div[id='mw-content-text']") %>%
xml_children()
# Locates the positions of the headings
headings <- sapply(wikicontent,xml_name)
headings <- c(grep("h2",headings),length(headings)-1)
# Loop through the headings keeping the stuff in-between them as content
content <- list()
for (i in 1:(length(headings)-1)) {
foo <- wikicontent[headings[i]:(headings[i+1]-1)]
foo.title <- xml_text(foo[[1]])
foo.content <- xml_text(foo[-c(1)])
content[[i]] <- foo.content
names(content)[i] <- foo.title
}
The key was spotting the mw-content-text node which has all the things you want as children.
Related
I want to import multiple pdf-files into R but per page there are 4 columns, a header/footer line and a table of contents.
For purpose of text mining I want to remove them from my file or character vector.
Right now I am using two functions to read in the files. The first one is pdf_text because it keeps the pages but can't deal with the 4 columns. The second one is extract_text, this one on its own doesn't keep the pages but can deal with the column structure (and is decently with occuring tables) .
But neither one of them is able to remove the table of contents (as far as I have tried).
My data set is not exactly minimal but otherwise I had some problems with the data structures. Here a working code:
################ relevant code ##############
library(pdftools)
library(tidyverse)
library(tabulizer)
files_name <- "Nachhaltigkeit 2021.pdf"
file_url <- c("https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/sustainability/documents/Allianz_Group_Sustainability_Report_2021-web.pdf", "https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/investor-relations/en/results-reports/annual-report/ar-2021/en-Allianz-Group-Annual-Report-2021.pdf")
reports_list <- lapply(file_url, pdf_text)
createTibble <- function(){
tibble_together <- NULL
#for all files
for(i in 1:length(files_name)){
page_nr <- length(reports_list[[i]])
tib <- tibble(report = rep(files_name[i], page_nr), page = 1:page_nr, text = gsub("\r\n", " ",
extract_text(files_name[[i]], pages = 1:page_nr)))
tibble_together <- rbind(tibble_together, tib)
}
return(tibble_together)
}
reports_df <- createTibble()
############ code for problem visualization ###############
reports_df <- reports_df %>% unnest_tokens(output = word, input = text, token = "words")
#e.g this part contains the table of contents which is not intended
(reports_df %>% filter(page == 34, report == "Nachhaltigkeit 2021.pdf"))$word[832:885]
Thanks for your help in advance
PS: it's my first question so if you need sth. let me know.
And I know that the function createTibble probably isn't optimal. But that's not my primary concern.
I am trying to scrape some data from yahoo finance. Usually I have no problem doing this. Today however, I have run into a problem trying to pull a certain container. What might be the reason this is giving me such a difficult time?
I have tried many combos of xpaths. Selector gadget for some reason can not pick up the xpath. I have posted some attempts and the url below.
The green aea is what I am trying to bring into my console.
library(tidyverse)
library(rvest)
library(httr)
read_html("https://ca.finance.yahoo.com/quote/SPY/holdings?p=SPY") %>% html_nodes(xpath = '//*[#id="Col1-0-Holdings-Proxy"]/section/div[1]/div[1]')
{xml_nodeset (0)}
#When I search for all tables using the following function.
read_html("https://finance.yahoo.com/quote/xlk/holdings?p=xlk") %>% html_nodes("table") %>% .[1] %>% html_table(fill = T)
I get the table at the bottom of the page. Trying different numbers in the [] leads to errors.
What am I doing wrong? This seems like such an easy scrape. Thanks a bunch for your help.
Your data doesn't reside within an actual html table.
You could use the following css selectors currently - though a lot of the page looks dynamic and I suspect attributes and classes will change in future. I tried to keep a little more generic to compensate but you should definitely seek to make this even more generic if possible.
I use css selectors throughout for the flexibility and specificity gained. The [] denote attribute selectors, the . denotes class selector, * is the contains operator specifiying that the left hand side attribute's value contains the right hand side string e.g. with [class*=screenerBorderGray] this means the class attribute contains the stringscreenerBorderGray.
The " " ,">" , "+" between selectors are called combinators and are used to specify relationships between nodes matched by consecutive parts of the selector sequence.
I generate a left column list of nodes and a right column list of nodes (ignoring the chart col in between). I then join these into a final dataframe.
R
library(rvest)
library(magrittr)
pg <- read_html('https://finance.yahoo.com/quote/xlk/holdings?p=xlk&guccounter=1')
lhs <- pg %>%
html_nodes('[id*=Holdings] section > .Fl\\(start\\) [class*=screenerBorderGray] > span:nth-child(1)') %>%
html_text()
rhs <- pg %>%
html_nodes('[id*=Holdings] section > .Fl\\(start\\) [class*=screenerBorderGray] span + span:last-child') %>%
html_text()
df <- data.frame(lhs,rhs) %>% set_names(., c('Title','value'))
df <- df[-c(3),]
rownames(df) <- NULL
print(df)
Py
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
r = requests.get('https://finance.yahoo.com/quote/xlk/holdings?p=xlk&guccounter=1')
soup = bs(r.content, 'lxml')
lhs = [i.text.strip() for i in soup.select('[id*=Holdings] section > .Fl\(start\) .Bdbc\(\$screenerBorderGray\) > span:nth-child(1)')]
rhs = [i.text.strip() for i in soup.select('[id*=Holdings] section > .Fl\(start\) .Bdbc\(\$screenerBorderGray\) span + span:last-child')]
df = pd.DataFrame(zip(lhs, rhs), columns = ['Title','Value'])
df = df.drop([2]).reset_index(drop = True)
print(df)
References:
Row re-numbering #thelatemail
I'm trying to scrape a website that has numerous different information I want in paragraphs. I got this to work perfect... However, I don't understand how to break the text up and create a dataframe.
Website :Website I want Scraped
Code:
library(rvest)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"
#Reading the HTML code from the website
webpage <- read_html(url)
p_nodes<-webpage%>%
html_nodes(xpath = '//p')%>%
html_text()
#replace multiple whitespaces with single space
p_nodes<- gsub('\\s+',' ',p_nodes)
#trim spaces from ends of elements
p_nodes <- trimws(p_nodes)
#drop blank elements
p_nodes <- p_nodes[p_nodes != '']
How I want the dataframe to look:
I'm not sure if this is even possible. I tried to extract each piece of information separately and then make the dataframe like that but it doesn't work since most of the info is stored in the p tag. I would appreciate any guidance. Thanks!
Proof-of-concept (based on what I wrote in the comment):
Code
lapply(c('data.table', 'httr', 'rvest'), library, character.only = T)
tags <- 'tr:nth-child(6) td , tr~ tr+ tr p , td+ p'
burl <- 'https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml'
url_text <- read_html(burl)
chunks <- url_text %>% html_nodes(tags) %>% html_text()
coordFunc <- function(chunk){
patter_lat <- 'Longitude:.*(-[[:digit:]]{1,2}.[[:digit:]]{0,15})'
ret <- regmatches(x = chunk, m = regexec(pattern = patter_lat, text = chunk))
return(ret[[1]][2])
}
longitudes <- as.numeric(unlist(lapply(chunks, coordFunc)))
Output
# using 'cat' to make the output easier to read
> cat(chunks[14])
Mt. Laurel DOT
Rt. 38, East
1/4 mile East of Rt. 295
Mt. Laurel Open 24 Hrs
Unleaded / Diesel
856-235-3096Latitude: 39.96744662Longitude: -74.88930386
> longitudes[14]
[1] -74.8893
If you do not coerce longitudes to be numeric, you get:
longitudes <- (unlist(lapply(chunks, coordFunc)))
> longitudes[14]
[1] "-74.88930386"
I chose the longitude as a proof-of-concept but you can modify your function to extract all relevant bits in a single call. For getting the right tag you can use SelectorGadget extension (works well in Chrome for me). Alliteratively most browsers let you 'inspect element' to get the html tag. The function could return the extracted values in a data.table which can then be combined into one using rbindlist.
You could even advance pages programatically to scrape the entire website - be sure to check with the usage policy (it's generally frowned upon or restricted to scrape websites).
Edit
the text is not structured the same way throughout the webpage so you'll need to spend more time examining what exceptions can take place.
Here's a new function to resolve each chunk into separate lines and then you can try to use additional regular expressions to get what you want.
newfunc <- function(chunk){
# Each chunk is a couple of lines. First, we split at '\r\n' using strsplit
# the output is a list so we use 'unlist' to get a vector
# then use 'trimws' to remove whitespace around it - try out each of these functions
# separately to understand what is going on. The final output here is a vector.
txt <- trimws(unlist(strsplit(chunk, '\r\n')))
return(txt)
}
This returns the 'text' contained in each chunk as a vector of separate lines. Taking a look at the number of lines in the first 20 chunks, you can see it is not the same:
> unlist(lapply(chunks[1:20], function(z) length(newfunc(z))))
[1] 5 6 5 7 5 5 5 5 5 4 1 6 6 6 5 1 1 1 5 6
A good way to resolve this would be to put in a conditional statement based on the number of lines of text in each chunk, e.g. in newfunc you could add:
if(length(txt) == 1){
return(NULL)
}
This is because that is for the entries that don't have any text in them. since this a proof of concept I haven't checked all entries but there's some simple logic:
The first line is typically the name
the coordinates are in the last line
The fuel can be either unleaded or diesel. You can grep on these two strings to see what each depot offers. e.g. grepl('diesel', newfunc(chunks[12]))
Another approach would be to use a different set of html tags e.g. all coorindates and opening hours are in boldface and have the tag strong. You can extract those separately and then use regular expressions to get what you want.
You could search for 24(Hrs|Hours) to first extract all sites that are open 24 hours and then use selective regex on the remainder to get their operating times.
There is no simple easy answer with most web-scraping, you have to find patterns and then apply some logic based on that. Only on the most structured websites will you find something that works for the entire page/range.
You can use tidyverse package (stringr, tibble, purrr)
library(rvest)
library(tidyverse)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"
#Reading the HTML code from the website
webpage <- read_html(url)
p_nodes<-webpage%>%
html_nodes(xpath = '//p')%>%
html_text()
# Split on new line
l = p_nodes %>% stringr::str_split(pattern = "\r\n")
var1 = sapply(l, `[`, 1) # replace var by the name you want
var2 = sapply(l, `[`, 2)
var3 = sapply(l, `[`, 3)
var4 = sapply(l, `[`, 4)
var5 = sapply(l, `[`, 5)
t = tibble(var1,var2,var3,var4,var5) # make tibble
t = t %>% filter(!is.na(var2)) # delete useless lines
purrr::map_dfr(t,trimws) # delete blanks
I am trying to parse a number of documents using the excellent xml2 R library. As an example, consider the following XML file:
pg <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/westminhall/westminster2001-01-24a.xml")
Which contains a number of <speech> tags which are separated, though not nested within, a number of <minor-heading> and <major-heading> tags. I would like to be process this document to a resulting data.frame with the following structure:
major_heading_id speech_text
heading_id_1 text1
heading_id_1 text2
heading_id_2 text3
heading_id_2 text4
Unfortunately, because the tags are not nested, I cannot figure out how to do this! I have code that successfully recovers the relevant information (see below), but matching the speech tags to their respective major-headings is beyond me.
My intuition is that it would probably be best to split the XML document at the heading tags, and then process each as an individual document, but I couldn't find a function in the xml2 package that would let me do this!
Any help would be great.
Where I have got to so far:
speech_recs <- xml_find_all(pg, "//speech")
speech_text <- trimws(xml_text(speech_recs))
heading_recs <- xml_find_all(pg, "//major-heading")
major_heading_id <- xml_attr(heading_recs, "id")
You can do this as follows:
require(xml2)
require(tidyverse)
doc <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/westminhall/westminster2001-01-24a.xml")
# Get the headings
heading_recs <- xml_find_all(doc, "//major-heading")
# path creates the structure you want
# so the speech nodes that have exactly n headings above them.
path <- sprintf("//speech[count(preceding-sibling::major-heading)=%d]",
seq_along(heading_recs))
# Get the text of the speech nodes
map(path, ~xml_text(xml_find_all(doc, .x))) %>%
# Combine it with the id of the headings
map2_df(xml_attr(heading_recs, "id"),
~tibble(major_heading_id = .y, speech_text = .x))
This results in:
I have extracted the reviews of a movie on IMDB but the separate reviews have a lot of blank lines between them. It is unstructured and very difficult to view.
I have to apply certain functions on each of them separately and then store them together as 1 for some text mining for some other functions.
How can I structure (clean) them and access them one at a time and also how to combine them and store it together?
Here is my code for scraping the reviews
ID <- 1490017
URL <- paste0("http://www.imdb.com/title/", ID, "/reviews?filter=prolific")
MOVIE_URL <- read_html(URL)
ex_review <- MOVIE_URL %>%
html_nodes("p") %>%
html_text()
I would suggest that you are more specific when you navigate the DOM. For instance, this code will only deliver reviews and none of the other information that you are presumably not looking to scrape:
ID <- 1490017
URL <- paste0("http://www.imdb.com/title/tt", ID, "/reviews?filter=prolific")
MOVIE_URL <- read_html(URL)
ex_review <- MOVIE_URL %>% html_nodes("#pagecontent") %>%
html_nodes("div+ p") %>%
html_text()
And here is a way to remove line breaks, applying a function to each review, and merging all reviews into one paragraph (also see this post on concatenating vector elements and this post on replacing line breaks):
ex_review <- gsub("[\r\n]", " ", ex_review) # replace line breaks
sapply(ex_review, function(x){}) # apply function to each review
ex_review <- paste(ex_review, collapse = "") # concatenate reviews into one paragraph
write(ex_review, "test.txt")
I think you were also missing a "tt" in the URL.