Web scraping Wunderground - r

in the last couple of month I used a code to get historical data from Wunderground. The code worked. But since today they somehow changed the website. Unfortunately I am not very familiar with html stuff. This was the code which worked:
webpage <- read_html("https://www.wunderground.com/history/monthly/LOWW/date/2018-06?dayend=18&monthend=6&yearend=2018&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=")
tbls <- html_nodes(webpage, "table")
weather.data <- html_table(tbls)[[2]]
This code selected the second table on the website. Does anyone has an idea why it does not work anymore? They somehow changed the description for the nodes?
Cheers

Related

R scraping from multiple websites links

I am pretty new to web scraping and i need to scrape newspapers articles content from a list of urls related to articles from different websites. I would like to obtain the actual textual content from each of the documents, however, I cannot find a way to automate the scraping procedure through links relating to different websites.
In my case, data are stored in "dublin", a dataframe looking like this.
enter image description here
So far, I managed to scrape together articles from equal websites in order to rely to the same .css paths I find with selector gadget for retrieving the texts. Here is the code I'm using to scrape content selecting documents from the same webpage, in this case those posted by The Irish Times:
library(xml2)
library(rvest)
library(dplyr)
dublin <- dublin%>%
filter(dublin$page == "The Irish Times")
link <- c(pull(dublin, 2))
articles <- list()
for(i in link){
page <- read_html(i)
text = page %>%
html_elements(".body-paragraph")%>%
html_text()
articles[[i]] <- c(text)
}
articles
It actually works. However, since webpages vary case by case, I was wondering whether there is any way to automate this procedure through all the elements of the "url" variable.
Here is an example of the links I scraped:
https://www.thesun.ie/news/10035498/dublin-docklands-history-augmented-reality-app/
https://lovindublin.com/lifestyle/dublins-history-comes-to-life-with-new-ar-app-that-lets-you-experience-it-first-hand
https://www.irishtimes.com/ireland/dublin/2023/01/11/phone-app-offering-augmented-reality-walking-tour-of-dublins-docklands-launched/
https://www.dublinlive.ie/whats-on/family-kids-news/new-augmented-reality-app-bring-25949045
https://lovindublin.com/news/campaigners-say-we-need-to-be-ambitious-about-potential-lido-for-georges-dock
Thank you in advance! Hope the material I provided is enough.

Web scraping price with the use of xml

I am trying to scrape the following: 13.486 Kč from: https://www.aofis.cz/informace-pro-klienty/elba-opf/
For some reason, the following code does not seem to find the number. I am rather a newbie to this so perhaps it is because the string in xml_find_all is wrong. Can anyone please have a look why?
library(xml)
library(xml2)
page <- "https://www.aofis.cz/informace-pro-klienty/elba-opf/"
read_page <- read_html(page)
Price <- read_page %>%
rvest::html_nodes('page-content') %>%
xml2::xml_find_all("//strong[contains(#class 'sg_selected')]") %>%
rvest::html_text()
Price
Thank you!!
Michael
The html code you see in your browser developer panel (or selector gadget) is not the same as the content that is being delivered to your R session. It is actually a javascript file which then builds the web page. This is why your rvest call isn't finding the correct html node: there are no html nodes in the string you are processing!
There are a few different ways to get the information you want, but perhaps the best is to just get the monetary values from the javascript code using regex:
page <- "https://www.aofis.cz/informace-pro-klienty/elba-opf/"
read_page <- httr::content(httr::GET(page), "text")
stringr::str_extract_all(read_page, "\\d+\\.\\d+ K")[[1]][1]
#> [1] "13.486 K"

Using rvest or RSelenium to create automated webscrape of table inside frame

I know there's a great deal of resources/questions that deal with this subject but I have been trying for days and can't seem to figure it out. I have webscraped websites before but this one is causing me problems.
The website: njaqinow.net
What I want scraped: I would like to scrape the table under the "Current Status"->"Pollutants" tab. I would like to have this scraped every time the table is updated so I can use this information inside a shiny app I am creating.
What I have tried: I have tried numerous different approaches but for simplicity I will show my most recent approach:
library("rvest")
url<-"http://www.njaqinow.net"
webpage <- read_html(url)
test<-webpage%>%
html_node("table")%>%
html_table()
My guess is that this is way more complicated then I originally thought because it seems to me that the table is inside a frame. I am not a javascript/HTML pro so I am not entirely sure. Any help/guidance would be greatly appreciated!
I can contribute a solution with RSelenium. I would show you how to navigate to that table and
get its content. For formatting the table content i provide a link to another question, but wont be
in the scope of this answer.
I think you have two challenges. Switch into a frame and switching between frames.
Switch into a frame is done by remDr$switchToFrame().
Switching between frames is discussed here: https://github.com/ropensci/RSelenium/issues/155.
In your case:
remDr$switchToFrame("contents")
...
remDr$switchToFrame(NA)
remDr$switchToFrame("contentsi")
Full code would read:
remDr$navigate("http://www.njaqinow.net")
frame1 <- remDr$findElement("xpath", "//frame[#id = 'contents']")
remDr$switchToFrame(frame1)
remDr$findElement("xpath", "//*[text() = 'Current Status']")$clickElement()
remDr$findElement("xpath", "//*[text() = 'POLLUTANTS']")$clickElement()
remDr$switchToFrame(NA)
remDr$switchToFrame("contentsi")
table <- remDr$findElement("xpath", "//table[#id = 'C1WebGrid1']")
table$getElementText()
For formatting a table you could look here:
scraping table with R using RSelenium

Scraping a HTML table in R, after changing a Javascript dropdown option

I am looking to scrape the main table from this website. I have managed to get the table into R and working, but the one problem is the website defaults to PS4, while I wanted the data for Xbox (this is changed in the top-right dropdown menu).
Ideally there would be a way to pass a parameter in the URL that will define the platform in this way, but I haven't been able to find anything about that.
Looking around it seems that PhantomJS would be the best way to go but I have no experience using Javascript and I'm not sure how you would implement performing an action on the page, then scraping the resulting table.
This is currently all I have in terms of my main code scraping the data:
library(rvest)
url1 <- "https://www.futbin.com/19/players?page="
pge <- 1
tbl <- paste0(url1, pge) %>%
read_html() %>%
html_nodes(xpath='//*[#id="repTb"]') %>%
html_table()
Thanks in advance for any help.

readHTMLTable function not able to extract the html table

I would like to extract the table (table 4) from the URL "http://www.moneycontrol.com/financials/oilnaturalgascorporation/profit-loss/IP02". The catch is that I will have to use RSelenium
Now here is the code I am using:
remDr$navigate(URL)
doc<-htmlParse(remDr$getPageSource()[[1]])
x<-readHTMLTable(doc)
The above code is not able to extract the table 4. However when I do not use Rselenium like below, I am able to extract the table easily
download.file(URL,'quote.html')
doc<-htmlParse('quote.html')
x<-readHTMLTable(doc,which=5)
Please let me the solution as I have been stuck on this part for a month now. Appreciate your suggestions
I think it works fine. The table you were able to get using download.file can also be gotten by using the following code for RSelenium
readHTMLTable(htmlParse(remDr$getPageSource(),asText=TRUE),header=TRUE,which=6)
Hope that helps!
I found the solution. In my case, I had to first navigate to the inner frame (boxBg1) before I could extract the outer html and then use readHtmlTable function. It works fine now. Will post in case I run into a similar issue in the future
I'm struggling with more or less the same issue: I'm trying to come up with a solution that doesn't use htmlParse: for example (after navigating to the page):
table <- remDr$findElements(using = "tag name", value = "table"))
You might have to use css or xpath on yours, next step I'm still working on.
I finally got a table downloaded into a nice little data frame, It seems easy when you get it figured out. Using the help page from the XML package:
library(RSelenium)
library(XML)
u <- 'http://www.w3schools.com/html/html_tables.asp'
doc <- htmlParse(u)
tableNodes <- getNodeSet(do9c, "//table")
tb <- readHTMLTable(tableNodes[[1]])

Resources