I am attempting to scrape (dynamic?) content from a webpage using the rvest package. I understand that dynamic content should require the use of tools such as Selenium or PhantomJS.
However my experimentation leads me to believe I should still be able to find the content I want using only standard webscraping r packages (rvest,httr,xml2).
For this example I will be using a google maps webpage.
Here is the example url...
https://www.google.com/maps/dir/920+nc-16-br,+denver,+nc,+28037/2114+hwy+16,+denver,+nc,+28037/
If you follow the hyperlink above it will take you to an example webpage. The content I would want in this example are the addresses "920 NC-16, Crumpler, NC 28617" and "2114 NC-16, Newton, NC 28658" in the top left corner of the webpage.
Standard techniques using the css selector or xpath did not work, which initially made sense, as I thought this content was dynamic.
url<-"https://www.google.com/maps/dir/920+nc-16-br,+denver,+nc,+28037/2114+hwy+16,+denver,+nc,+28037/"
page<-read_html(url)
# The commands below all return {xml nodeset 0}
html_nodes(page,css=".tactile-searchbox-input")
html_nodes(page,css="#sb_ifc50 > input")
html_nodes(page,xpath='//*[contains(concat( " ", #class, " " ), concat( " ", "tactile-searchbox-input", " " ))]')
The commands above all return "{xml nodeset 0}" which I thought was a result of this content being generated dynamically, but here's were my confusion lies, if I convert the whole page to text using html_text() I can find the addresses in the value returned.
html_text(read_html(url))
substring<-substr(x,33561-100,33561+300)
Executing the commands above results in a substring with the following value,
"null,null,null,null,[null,null,null,null,null,null,null,[[[\"920 NC-16, Crumpler, NC 28617\",null,null,null,null,null,null,null,null,null,null,\"Nzm5FTtId895YoaYC4wZqUnMsBJ2rlGI\"]\n,[\"2114 NC-16, Newton, NC 28658\",null,null,null,null,null,null,null,null,null,null,\"RIU-FSdWnM8f-IiOQhDwLoMoaMWYNVGI\"]\n]\n,null,null,0,null,[[null,null,null,null,null,null,null,3]\n,[null,null,null,null,[null,null,null,null,nu"
The substring is very messy but contains the content I need. I've heard parsing webpages using regex is frowned upon but I cannot think of any other way of obtaining this content which would also avoid the use of dynamic scraping tools.
If anyone has any suggestions for parsing the html returned or can explain why I am unable to find the content using xpath or css selectors, but can find it by simply parsing the raw html text, it would be greatly appreciated.
Thanks for your time.
The reason why you can't find the text with Xpath or css selectors is that the string you have found is within the contents of a javascript array object. You were right to assume that the text elements you can see on the screen are loaded dynamically; these are not where you are reading the strings from.
I don't think there's anything wrong with parsing specific html with regex. I would ensure that I get the full html rather than just the html_text() output, in this case by using the httr package. You can grab the address from the page like this:
library(httr)
GetAddressFromGoogleMaps <- function(url)
{
GET(url) %>%
content("text") %>%
strsplit("spotlight") %>%
extract2(1) %>%
extract(-1) %>%
strsplit("[[]{3}(\")*") %>%
extract2(1) %>%
extract(2) %>%
strsplit("\"") %>%
extract2(1) %>%
extract(1)
}
Now:
GetAddressFromGoogleMaps(url)
#[1] "920 NC-16, Crumpler, NC 28617, USA"
This question is not so much about how to google search in R (discussed many times before) as much as why it does not always work.
I found this code from another posted question here
That I recall working perfectly. It would produce all the links in the search.
But now it does not work. For some reason the node is not there anymore when I pull the data into R. But when I actually inspect the html code on Chrome it's there when I am browsing the code. It show's the h3 node in the display inspector but not when it's being downloded.
library(rvest)
ht <- read_html('https://www.google.co.in/search?q=guitar+repair+workshop')
links <- ht %>% html_nodes(xpath='//h3/a') %>% html_attr('href')
gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
I get the following return:
character(0)
The google page display of links depends on your location/preferences. So maybe this is what is causing the issue?
It appears that the format switched very recently, maybe today, and that the //h3 is no longer used. This produces what is intended with one final extraneous result
library(rvest)
ht <- read_html('https://www.google.co.in/search?q=guitar+repair+workshop')
links <- ht %>% html_nodes(xpath='//a') %>% html_attr('href')
gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
I am trying to webscrape some recipes for my own personal collection. It works great on some sites because the website structure sometimes easily allows for scraping, but some are harder. This one I have no idea how to deal with:
https://www.koket.se/halloumigryta-med-tomat-linser-och-chili
For the moment, let's just assume I want the ingredients on the left. If I inspect the website it looks like what I want are the two article class="ingredients" chunks. But I can't seem to get there.
I start with the following:
library(rvest)
library(tidyverse)
read_html("https://www.koket.se/halloumigryta-med-tomat-linser-och-chili") %>%
html_nodes(".recipe-column-wrapper") %>%
html_nodes(xpath = '//*[#id="react-recipe-page"]')
However, running the above code shows that all of the ingredients are stored in data-item like so:
<div id="react-recipe-page" data-item="{
"chefNames":"<a href='/kockar/siri-barje'>Siri Barje</a>",
"groupedIngredients":[{
"header":"Kokosris",
"ingredients":[{
"name":"basmatiris","unit":"dl","amount":"3","amount_info":{"from":3},"main":false,"ingredient":true
}
<<<and so on>>>
So I am a little bit puzzled, because from inspecting the website everything seems to be neatly placed in things I can extract, but now it's not. Instead, I'd need some serious regular expressions in order to get everything like I want it.
So my question is: am I missing something? Is there some way I can get the contents of the ingredients articles?
(I tried SelectorGadget, but it just gave me No valid path found).
You can extract attributes by using html_attr("data-item") from the rvest package.
Furthermore, the data-item attribute looks like it's in JSON, which you can convert to a list using the fromJSON from the jsonlite package:
html <- read_html("https://www.koket.se/halloumigryta-med-tomat-linser-och-chili") %>%
html_nodes(".recipe-column-wrapper") %>%
html_nodes(xpath = '//*[#id="react-recipe-page"]')
recipe <- html %>% html_attr("data-item") %>%
fromJSON
Lastly, the recipe list contains lots of different values, which are not relevant, but the ingredients and measurements are there as well in the element recipe$ingredients.
I am trying to extract articles for a period of 200 days from Time dot mk archive, e.g. http://www.time.mk/week/2016/22. Each day has top 10 headings, each of which link to all articles related to it (at bottom of each heading "e.g. 320 поврзани вести". Following this link leads to a list of all related articles.
This is what I've managed so far:
`library(rvest)
url = ("http://www.time.mk/week/2016/22")
frontpage = read_html(url) %>%
html_nodes(".other_articles") %>%
html_attr("href") %>%
paste0()
mark = "http://www.time.mk/"
frontpagelinks = paste0(mark, frontpage)`
by now I access primary links going to related news
The following extracts all of the links to related news for the first heading, from where I clear my data for only those links that I need.
final = list()
final = read_html(frontpagelinks[1]) %>%
html_nodes("h1 a") %>%
html_attr("href")%>%
paste0()`
My question is how I could instruct R, whether via loop or some other option so as to extract links from all 10 headings from "frontpagelinks" at once - I tried a variety of options but nothing really worked.
Thanks!
EDIT
Parfait's response worked like a charm! Thank you so much.
I've run into an inexplicable issue however after using that solution.
Whereas before, when I was going link by link, I could easily sort out the data for only those portals that I need via:
a1onJune = str_extract_all(dataframe, ".a1on.")
Which provided me with a clean output: [130] "a1on dot mk/wordpress/archives/618719"
with only the links I needed, now if I try to run the same code with the larger df of all links I inexplicably get many variants of this this:
"\"alsat dot mk/News/255645\", \"a1on dot mk/wordpress/archives/620944\", , \"http://www dot libertas dot mk/sdsm-poradi-kriminalot-na-vmro-dpmne-makedonija-stana-slepo-tsrevo-na-balkanot/\",
As you can see in bold it returns my desired link, but also many others (I've edited out most for clarity sake) that occur both before and after it.
Given that I'm using the same expression, I don't see why this would be happening.
Any thoughts?
Simply run lapply to return a list of links from each element of frontpagelinks
linksList <- lapply(frontpagelinks, function(i) {
read_html(i) %>%
html_nodes("h1 a") %>%
html_attr("href")
})
I am trying to scrape a table from baseball-reference.com using rvest. my code is:
url="http://www.baseball-reference.com/leagues/NL/2016-standard-batting.shtml"
css=""#players_standard_batting.sortable.stats_table"
read_html(url) %>% html_node(css) %>% html_table()->nlbatting.raw
At this point the table is a bit garbled, there is an 'Â' wherever there should be a space. I have tried
nlbatting.raw %>% mutate(Name=repair_encoding(Name))->nlbatting.raw
which makes everything look ok, but then I get really odd behavior. For instance:
nlbatting.raw$Name[86]=="Yoenis Cespedes"
FALSE
and:
gsub(" ","_",nlbatting.raw$Name[86])
"Yoenis Cespedes"
I have tried different encoding parameters in read_html() but nothing changes. I tried leaving the encoding alone and just gsubbing out the 'Â' but have the same problem. Any help would be great, thanks in advance!
ps. Long time lurker first time poster, sorry if I've missed something obvious
Edited to fix html nodes (from ".class" to ".stats_table"). It worked fine for me. Try this again:
library(rvest)
url <- "http://www.baseball-reference.com/leagues/NL/2016-standard-batting.shtml"
data <- read_html(url) %>% html_nodes(".stats_table") %>% html_table()
head(data[[1]])
head(data[[2]])