This question is not so much about how to google search in R (discussed many times before) as much as why it does not always work.
I found this code from another posted question here
That I recall working perfectly. It would produce all the links in the search.
But now it does not work. For some reason the node is not there anymore when I pull the data into R. But when I actually inspect the html code on Chrome it's there when I am browsing the code. It show's the h3 node in the display inspector but not when it's being downloded.
library(rvest)
ht <- read_html('https://www.google.co.in/search?q=guitar+repair+workshop')
links <- ht %>% html_nodes(xpath='//h3/a') %>% html_attr('href')
gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
I get the following return:
character(0)
The google page display of links depends on your location/preferences. So maybe this is what is causing the issue?
It appears that the format switched very recently, maybe today, and that the //h3 is no longer used. This produces what is intended with one final extraneous result
library(rvest)
ht <- read_html('https://www.google.co.in/search?q=guitar+repair+workshop')
links <- ht %>% html_nodes(xpath='//a') %>% html_attr('href')
gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
Related
I am looking to extract all the links for each episode on this webpage, however I appear to be having difficulty using html_nodes() where I haven't experienced such difficulty before. I am trying to iterate the code using "." such that all the attributes for the page are obtained with that CSS. This code is meant to give an output of all the attributes, but instead I get {xml_nodeset (0)}. I know what to do once I have all the attributes in order to obtain the links specifically out of them, but this step is proving a stumbling block for this website.
Here is the code I have begun in R:
episode_list_page_1 <- "https://jrelibrary.com/episode-list/"
episode_list_page_1 %>%
read_html() %>%
html_node("body") %>%
html_nodes(".type-text svelte-fugjkr first-mobile first-desktop") %>%
html_attrs()
This rvest down does not work here because this page uses javascript to insert another webpage into an iframe on this page, to display the information.
If you search the imebedded script you will find a reference to this page: "https://datawrapper.dwcdn.net/eoqPA/66/" which will redirect you to "https://datawrapper.dwcdn.net/eoqPA/67/". This second page contains the data you are looking for in as embedded JSON and generated via javascript.
The links to the shows are extractable, and there is a link to a Google doc that is the full index.
Searching this page turns up a link to a Google doc:
library(rvest)
library(dplyr)
library(stringr)
page2 <-read_html("https://datawrapper.dwcdn.net/eoqPA/67/")
#find all of the links on the page:
str_extract_all(html_text(page2), 'https:.*?\\"')
#isolate the Google docs
print(str_extract_all(html_text(page2), 'https://docs.*?\\"') )
#[[1]]
#[1] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/edit?usp=sharing"
#[2] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/export?format=csv&id=12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8"
My first post and a beginner with R so patience requested if I should have found an answer to my question elsewhere.
I'm trying to cobble together a table with data pulled from multiple sites from CME (https://www.cmegroup.com/trading/energy/crude-oil/western-canadian-select-wcs-crude-oil-futures.html is one).
I've tried using rvest but get a blank table.
I think this is because of the Javascript that is being used to populate the table in real time? I've fumbled my way around this site to look for similar problems and haven't quite figured out how best to pull this data. Any help is much appreciated.
library(rvest)
library(dplyr)
WCS_page <- "https://www.cmegroup.com/trading/energy/crude-oil/canadian-heavy-crude-oil-net-energy-index-futures_quotes_globex.html"
WCS_diff <- read_html(WCS_page)
month <- WCS_diff %>%
rvest::html_nodes('th') %>%
xml2::xml_find_all("//scope[contains(#col, 'Month')]") %>%
rvest::html_text()
price <- WCS_diff %>%
rvest::html_nodes('tr') %>%
xml2::xml_find_all("//td[contains(#class, 'quotesFuturesProductTable1_CLK0_last')]") %>%
rvest::html_text()
WTI_df <- data.frame(month, price)
knitr::kable(
WTI_df %>% head (10))
Yes, the page is using JS to load the data.
The easy way to check is to view source and then search for some of the text you saw in the table. For example the word "May" never shows up in the raw HTML, so it must have been loaded later.
The next step is to use something like the Chrome DevTools to inspect the network requests that were made. In this case there is a clear winner, and your structured data is coming down from here:
https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/6038/G
I am trying to scrape off the number amounts listed in a set of donation websites. So in this example, I would like to get
$3, $10, $25, $100, $250, $1500, $2800
The xpath indicates that one of them should be
/html/body/div[1]/div[3]/div[2]/div/div[1]/div/div/
form/div/div[1]/div/div/ul/li[2]/label
and the css selector
li.btn--wrapper:nth-child(2) > label:nth-child(1)
Up to the following, I see something in the xml_nodeset:
library(rvest)
url <- "https://secure.actblue.com/donate/pete-buttigieg-announcement-day"
read_html(url) %>% html_nodes(
xpath = '//*[#id="cf-app-target"]/div[3]/div[2]/div/div[1]/div/div'
)
Then I see add the second part of the xpath and it shows up blank. Same with
X %>% html_nodes("li")
which gives a bunch of things, but all the StyledButton__StyledAnchorButton-a7s38j-0 kEcVlT turn blank.
I have worked with rvest for a fair bit now, but this one's baffling. And I am not quite sure how RSelenium will help here, although I have knowledge on how to use it for screenshots and clicks. If it helps, the website also refuses to be captured in the wayback machine---there's only the background and nothing else.
I have even tried just taking a screenshot with RSelenium and attempting ocr with tessaract and magick, but while other pages worked this particular example spectacularly fails, because the text is in white and in a rather nonstandard font. Yes, I've tried image_negate and image_resize to see if it helped, but it only showed that relying on OCR is rather a bad idea, as it depends on screenshot size.
Any advice on how to best extract what I want in this situation? Thanks.
You can use regex to extract numbers from script tag. You get a comma separated character vector
library(rvest)
library(stringr)
con <- url('https://secure.actblue.com/donate/pete-buttigieg-announcement-day?refcode=website', "rb")
page = read_html(con)
res <- page %>%
html_nodes(xpath=".//script[contains(., 'preloadedState')]")%>%
html_text() %>% as.character %>%
str_match_all(.,'(?<="amounts":\\[)(\\d+,?)+')
print(res[[1]][,1])
Try it here
I will change the website, to make this question better. Still facing similar issues, that can't use only rvest package and maybe answer will be easier to obtain with RSelenium. Website: http://ravimaailma.fi/cg/tulokset/20/ and I want to obtain links from the main article which would direct me to individual race results. Links look something like this: http://ravimaailma.fi/article/tulokset/pori-18-11-2017-tulokset/8718/
I'm trying to use simple Rvest as thought that would be all needed here. SelectorGadget is giving links CSS as .article-title a, so my code is simply
url %>%
read_html() %>%
html_nodes(".article-title a") %>%
html_text()
This will return nothing. Website loads more results when you scroll down, but I thought I would atleast get first results out. Below gives out some links and links 28:32 looks promising, but I think they are links from the sidebar, not from article.
url %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href")
What I'm I doing wrong here and can RSelenium help me?
Here is my partial answer, still not getting all, but maybe helps some one. Code will return 1 link for first result. Not sure why it isn't giving them all. I'm using
library(RSelenium)
rD <- rsDriver(port = 4444L, browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate("http://ravimaailma.fi/cg/tulokset/20/")
elem <- remDr$findElement(using="css selector", value=".article-title a")
elemtxt <- elem$getElementAttribute("href")
#Click button to load more results
#button <- remDr$findElement(using="id", value="loadmore")
#button$clickElement()
remDr$close()
I haven't used button click yet, but seemed that it was working as well. Only problem is that I can't get all results from the site.
[I'm not (yet) allowed to write comments, so I chose to make this post an answer]
RSelenium is not always necessary, you can also interact with a website using directly PhantomJS (see e.g. this example).
If you provided an example from the website instead of a local link to a .pdf, I can try to find out how to retrieve the data.
I am trying to extract articles for a period of 200 days from Time dot mk archive, e.g. http://www.time.mk/week/2016/22. Each day has top 10 headings, each of which link to all articles related to it (at bottom of each heading "e.g. 320 поврзани вести". Following this link leads to a list of all related articles.
This is what I've managed so far:
`library(rvest)
url = ("http://www.time.mk/week/2016/22")
frontpage = read_html(url) %>%
html_nodes(".other_articles") %>%
html_attr("href") %>%
paste0()
mark = "http://www.time.mk/"
frontpagelinks = paste0(mark, frontpage)`
by now I access primary links going to related news
The following extracts all of the links to related news for the first heading, from where I clear my data for only those links that I need.
final = list()
final = read_html(frontpagelinks[1]) %>%
html_nodes("h1 a") %>%
html_attr("href")%>%
paste0()`
My question is how I could instruct R, whether via loop or some other option so as to extract links from all 10 headings from "frontpagelinks" at once - I tried a variety of options but nothing really worked.
Thanks!
EDIT
Parfait's response worked like a charm! Thank you so much.
I've run into an inexplicable issue however after using that solution.
Whereas before, when I was going link by link, I could easily sort out the data for only those portals that I need via:
a1onJune = str_extract_all(dataframe, ".a1on.")
Which provided me with a clean output: [130] "a1on dot mk/wordpress/archives/618719"
with only the links I needed, now if I try to run the same code with the larger df of all links I inexplicably get many variants of this this:
"\"alsat dot mk/News/255645\", \"a1on dot mk/wordpress/archives/620944\", , \"http://www dot libertas dot mk/sdsm-poradi-kriminalot-na-vmro-dpmne-makedonija-stana-slepo-tsrevo-na-balkanot/\",
As you can see in bold it returns my desired link, but also many others (I've edited out most for clarity sake) that occur both before and after it.
Given that I'm using the same expression, I don't see why this would be happening.
Any thoughts?
Simply run lapply to return a list of links from each element of frontpagelinks
linksList <- lapply(frontpagelinks, function(i) {
read_html(i) %>%
html_nodes("h1 a") %>%
html_attr("href")
})