Scraping a Sloppy Node - web-scraping

Scraping a Sloppy Node - web-scraping

I am scraping a "sloppy" node that includes multiple elements of the same data. The code below scrapes city-dates from a page that lists record albums. I only want the first city-date instance for each album, but I'm not sure how to write the code so that only the first city-date instance is returned.
library(rvest);library(stringi);library(stringr)
citydate <- read_html("https://www.jazzdisco.org/atlantic-records/catalog-1200-series/") %>%
html_nodes(".date") %>% html_text()

There is not too much hierarchy in the page you are scraping. This means you will need to look at other structure. It seems that each date is preceeded by a h3 header. Knowing this we can build an index to grab the values you are after.
First grab all of the h3 and .date nodes:
nodes <- read_html("https://www.jazzdisco.org/atlantic-records/catalog-1200-series/") %>%
html_nodes("h3,.date")
Now for the index. We want to find where a h3 is followed by a .date. I have used the html_name and a paste function to test for the structure but you can build this however you wish.
index <- c()
for (n in 1:(length(nodes) - 1)) {
if (paste(html_name(nodes[n]), html_name(nodes[n+1]), sep = "_") == 'h3_div') {
index <- c(index, n)
}
}
Now using the index we can get the .date nodes. My test has matched the h3 node so we have to add 1 to get the .date
citydate <- nodes[index + 1] %>% html_text()

Related

loop over a vector as to apply it to a function to perform webscraping

I have a vector with two character list elements that I would like to apply through a loop to complete the necessary information on a URL from which I would like to web-scrape a specific node and obtain two lists each with three sub-lists.
For a single URL I will do the following obtaining the desired output structure
library(rvest)
H1<- "https://www.tripadvisor.com/Hotel_Review-g1063979-d1756170-Reviews-La_Trobada_Hotel_Boutique-Ripoll_Province_of_Girona_Catalonia.html"
page0_url<-read_html (H1)
groupNodes <- html_nodes(page0_url,"._1nAmDotd") # get all nodes under the same heading
outputlist <-lapply ( groupNodes , function(node) {
results <- node %>% html_nodes("div") %>% html_text()
Hotel_Amenities <- outputlist[[1]][-9];
Room_Features <- outputlist [[2]] [-9];
Room_Types <- outputlist[[3]]
Hotel_Amenities <- as.matrix(Hotel_Amenities)
Room_Features <- as.matrix(Room_Features)
})
print (Hotel_Amenities)
print (Room_Features)
print (Room_Types)
Attempting to run the same function for two URLs sequentially
library(rvest)
country <- data.frame ("d1447619-Reviews-Solana_del_Ter-
Ripoll","d2219428-Reviews-La_Sequia_Molinar-Campdevanol")
for (i in country)
fun <- function (node) {
html_nodes ( read_html ( paste
("https://www.tripadvisor.co.uk/Hotel_Review-g1063979-
",i,"_Province_of_Girona_Catalonia.html", sep="" )) ,"._1nAmDotd")%>%
html_nodes("div") %>% html_text()
}
lapply(country, fun)
# Produces two lists however the lists are the same list twice
I have tried more combinations. All either yield and error or at most obtain the result of only one of the two URLs and in a text string format. Any help will be welcomed.

Extracting a specific link form an element based on a previous text-element

I want to extract all available links and dates of the available documents ("Referentenentwurf", "Kabinett", "Bundesrat" and "Inkrafttreten") for each legislative process (each of the gray boxes) from the page. My data set should have the following structure:
Each legislative process is represented by one row and the information about the related documents are in the rows
Here is the HTML structure of the seventh legislative process:
This is one example of the HTML-structure of the elements including the legislative processes.
Extracting the dates of each document per legislative process is not a problem (simply done by the investigation whether the "text()"-element includes e.g. "Kabinett").
But extracting the right URL is much more difficult because the "text()"-elements (indicating the document type) are not directly linked with the ""-elements (including the URL).
I'm trying to find a solution for the seventh legislative process ("Zwanzigste Verordnung zur Änderung von Anlagen des Betäubungsmittelgesetzes") in order to apply this solution to every legislative process.
This is my current work status:
if(!require("rvest")) install.packages("rvest")
library(rvest) #for html_attr & read_html
if(!require("dplyr")) install.packages("dplyr")
library(dplyr) # for %>%
if(!require("stringr")) install.packages("stringr")
library(stringr) # for str_detect()
if(!require("magrittr")) install.packages("magrittr")
library(magrittr) # for extract() [within pipes]
page <- read_html("https://www.bundesgesundheitsministerium.de/service/gesetze-und-verordnungen.html")
#Gesetz.Link -> here "Inkrafttreten"
#Gesetz.Link <- lapply(1:72, function(x){
x <- 7 # for demonstration reasons
node.with.data <- html_nodes(page, css = paste0("#skiplink2maincontent > div.col-xs-12.col-sm-10.col-sm-offset-1.col-md-8.col-md-offset-2 > div:nth-child(",x*2,") > div > div > div.panel-body > p")) %>%
extract(
str_detect(html_text(html_nodes(page, css = paste0("#skiplink2maincontent > div.col-xs-12.col-sm-10.col-sm-offset-1.col-md-8.col-md-offset-2 > div:nth-child(",x*2,") > div > div > div.panel-body > p"))),
"Inkrafttreten")
)
link <- node.with.data %>%
html_children() %>%
extract(
str_detect(html_text(html_nodes(node.with.data, xpath = paste0("text()"))),
"Inkrafttreten")
) %>%
html_attr("href")
ifelse(length(node.with.data)==0, NA, link) # set link to "NA" if there is no Link to "Referentenentwurf"
#}) %>%
# unlist()
(I have commented out the application for the entire website so that the solution can be related to the seventh element.)
The problem is, that there can be several URLs linked to each document (here "Download" & "Stellungnahmen" are linked to "Referentenentwurf"). This lead to an error of my syntax.
Is there any way to extract the nth-element within after another element? So there could be a check if the "text()"-element is "Referentenentwurf" and then extract the first element behind it
-> "<a href="/fileadmin/Dateien/3_Downloads/Gesetze_und_Verordnungen/GuV/B/2020-03-04_RefE_20-BtMAEndV.pdf" ...>".
I would be very grateful for tips on how to solve this problem!

Beyond that, I took the freedom to change a few things in your code and try to get you where you want:
My stab at this is to go into the list of Verordnungen/Gesetze/etc., find the div.panel-body > p as you do and within that the first link that refers to a downloadable document, by searching for a href containing "/fileadmin/Dateien" using xpath.
Looks like this:
library(purrr)
library(xml2)
html_nodes(page, css = '#skiplink2maincontent > div.col-xs-12.col-sm-10.col-sm-offset-1.col-md-8.col-md-offset-2 > div') %>%
map(~{
.x %>%
xml_find_first('./div/div/div[contains(#class,"panel-body")]/p//a[contains(#href,"/fileadmin/Dateien")]') %>%
xml_attr('href')
})
//update:
If the above assumption doesn't work for you and you really just want to check for "first a-tag after 'Referentenentwurf' in the p-element", the following does get you that. However, I couldn't make it as "elegant" and just used a regex :)
html_nodes(page, css = '#skiplink2maincontent > div.col-xs-12.col-sm-10.col-sm-offset-1.col-md-8.col-md-offset-2 > div') %>%
map(~{
.x %>%
xml_find_first('./div/div/div[contains(#class,"panel-body")]/p') %>%
as.character() %>%
str_extract_all('(?<=Referentenentwurf.{0,10000})(?<=<a href=")[^"]*(?=")') %>%
unlist() %>%
first()
})

Extract data from multiple webpages from a website which reloads automatically in r

I have seen other posts which show to extract data from multiple webpages
But the problem is that for my website when I scroll the website to see the number of webpages to check in how many pages the data is divided into, the page automatically refresh next data, making unable to identify the number of webpages.I don't have that good knowledge of html and javascript so that I can easily identify the attribute on which the method is been getting called. so I have identified a way by which we can get the number of pages.
The website when loaded in browser gives number of records present, accessing that number and divide it by 30(number of data present per page) for e.g if number of records present is 90, then do 90/30 = 3 number of pages
here is the code to get the number of records found on that page
active_name_data1 <- html_nodes(webpage,'.active')
active1 <- html_text(active_name_data1)
as.numeric(gsub("[^\\d]+", "", word(active1[1],start = 1,end =1), perl=TRUE))
AND another approach is that get the attribute for number of pages i.e
url='http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'
webpage <- read_html(url)
active_data_html <- html_nodes(webpage,'a.act')
active <- html_text(active_data_html)
here active gives me number of pages i.e "1" " 2" " 3" " 4"
SO here I'm unable to identify how do I get the active page data and iterate the other number of webpage so as to get the entire data.
here is what I have tried (uuu_df2 is the dataframe with multiple link for which I want to crawl data)
library(rvest)
uuu_df2 <- data.frame(x = c('http://www.magicbricks.com/property-for-
sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-
Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-
Lacs&BudgetMax=5-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'))
urlList <- llply(uuu_df2[,1], function(url){
this_pg <- read_html(url)
results_count <- this_pg %>%
xml_find_first(".//span[#id='resultCount']") %>%
xml_text() %>%
as.integer()
if(!is.na(results_count) & (results_count > 0)){
cards <- this_pg %>%
xml_find_all('//div[#class="SRCard"]')
df <- ldply(cards, .fun=function(x){
y <- data.frame(wine = x %>% xml_find_first('.//span[#class="agentNameh"]') %>% xml_text(),
excerpt = x %>% xml_find_first('.//div[#class="postedOn"]') %>% xml_text(),
locality = x %>% xml_find_first('.//span[#class="localityFirst"]') %>% xml_text(),
society = x %>% xml_find_first('.//div[#class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
return(y)
})
} else {
df <- NULL
}
return(df)
}, .progress = 'text')
names(urlList) <- uuu_df2[,1]
a=bind_rows(urlList)
But this code just gives me the data from active page and does not iterate through other pages of the given link.
P.S : If the link doesn't has any record the code skips that link and
moves to other link from the list.
Any suggestion on what changes should be made to the code will be helpful. Thanks in advance.

Scraping data from a site with multiple urls

I've been trying to scrape a list of companies off of the site -Company list401.html. I can scrape the single table off of this page with this code:
>fileurl = read_html("http://archive.fortune.com
/magazines/fortune/fortune500_archive/full/2005/1")
> content = fileurl %>%
+ html_nodes(xpath = '//*[#id="MagListDataTable"]/table[2]') %>%
+ html_table()
>contentframe = data.frame(content)
> view(contentframe)
However, I need all of the data that goes back to 1955 from 2005 as well as a list of the companies 1 through 500, whereas this list only shows 100 companies and a single year at a time. I've recognized that the only changes to the url are "...fortune500_archive/full/" YEAR "/" 1, 201,301, or 401 (per range of companies showing).
I also understand that I have to create a loop that will automatically collect this data for me as opposed to me manually replacing the url after saving each table. I've tried a few variations of sapply functions from reading other posts and watching videos, but none will work for me and I'm lost.

A few suggestions to get you started. First, it may be useful to write a function to download and parse each page, e.g.
getData <- function(year, start) {
url <- sprintf("http://archive.fortune.com/magazines/fortune/fortune500_archive/full/%d/%d.html",
year, start)
fileurl <- read_html(url)
content <- fileurl %>%
html_nodes(xpath = '//*[#id="MagListDataTable"]/table[2]') %>%
html_table()
contentframe <- data.frame(content)
}
We can then loop through the years and pages using lapply (as well as do.call(rbind, ...) to rbind all 5 dataframes from each year together). E.g.:
D <- lapply(2000:2005, function(year) {
do.call(rbind, lapply(seq(1, 500, 100), function(start) {
cat(paste("Retrieving", year, ":", start, "\n"))
getData(year, start)
}))
})

Rvest scraping errors

Here's the code I'm running
library(rvest)
rootUri <- "https://github.com/rails/rails/pull/"
PR <- as.list(c(100, 200, 300))
list <- paste0(rootUri, PR)
messages <- lapply(list, function(l) {
html(l)
})
Up until this point it seems to work fine, but when I try to extract the text:
html_text(messages)
I get:
Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) :
Unknown input of class: list
Trying to extract a specific element:
html_text(messages[1])
Can't do that either...
Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) :
Unknown input of class: list
So I try a different way:
html_text(messages[[1]])
This seems to at least get at the data, but is still not succesful:
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument')"
How can I extract the text material from each of the elements of my list?

There are two problems with your code. Look here for examples on how to use the package.
1. You cannot just use every function with everything.
html() is for download of content
html_node() is for selecting node(s) from the downloaded content of a page
html_text() is for extracting text from a previously selected node
Therefore, to download one of your pages and extract the text of the html-node, use this:
library(rvest)
old-school style:
url <- "https://github.com/rails/rails/pull/100"
url_content <- html(url)
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text
... or this ...
hard to read old-school style:
url_mainnode_text <- html_text(html_node(html("https://github.com/rails/rails/pull/100"), "*"))
url_mainnode_text
... or this ...
magritr-piping style
url_mainnode_text <-
html("https://github.com/rails/rails/pull/100") %>%
html_node("*") %>%
html_text()
url_mainnode_text
2. When using lists you have to apply functions to the list with e.g. lapply()
If you want to kind of batch-process several URLs you can try something like this:
url_list <- c("https://github.com/rails/rails/pull/100",
"https://github.com/rails/rails/pull/200",
"https://github.com/rails/rails/pull/300")
get_html_text <- function(url, css_or_xpath="*"){
html_text(
html_node(
html("https://github.com/rails/rails/pull/100"), css_or_xpath
)
)
}
lapply(url_list, get_html_text, css_or_xpath="a[class=message]")

You need to use html_nodes() and identify which CSS selectors relate to the data you're interested in. For example, if we want to extract the usernames of the people discussing pull 200
rootUri <- "https://github.com/rails/rails/pull/200"
page<-html(rootUri)
page %>% html_nodes('#discussion_bucket strong a') %>% html_text()
[1] "jaw6" "jaw6" "josevalim"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping a Sloppy Node - web-scraping

Related

loop over a vector as to apply it to a function to perform webscraping

Extracting a specific link form an element based on a previous text-element

Extract data from multiple webpages from a website which reloads automatically in r

Scraping data from a site with multiple urls

Rvest scraping errors

Categories

Resources