How to retrieve hyperlinks in google search using rvest

How to retrieve hyperlinks in google search using rvest - r

I am using rvest to get the hyperlinks in a Google search. User #AllanCameron helped me in the past to sketch this code but now I do not know how to change the xpath or what I need to do in order to get the links. Here my code:
library(rvest)
library(tidyverse)
#Code
#url
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
#Get data
first_page <- read_html(url)
links <- html_nodes(first_page, xpath = "//div/div/a/h3") %>%
html_attr('href')
Which entirely returns NA.
I would like to get the links for each item that appears like next (sorry for the quality of images):
Is possible to get that stored in a dataframe? Many thanks!

Look at the parents a of the h3 nodes and find their href attribute. This ensures you have the same number of links as the main titles, to allow for easy arrangement in a dataframe.
titles <- html_nodes(first_page, xpath = "//div/div/a/h3")
titles %>%
html_elements(xpath = "./parent::a") %>%
html_attr("href") %>%
str_extract("https.*?(?=&)")
[1] "https://www.linkedin.com/in/mario-torres-b5796315b"
[2] "https://mariolopeztorres.com/"
[3] "https://www.instagram.com/mario_torres25/%3Fhl%3Den"
[4] "https://www.1stdibs.com/buy/mario-torres-lopez/"
[5] "https://m.facebook.com/2064681987175832"
[6] "https://www.facebook.com/mariotorresmx"
[7] "https://www.transfermarkt.us/mario-torres/profil/spieler/28167"
[8] "https://en.wikipedia.org/wiki/Mario_Garc%25C3%25ADa_Torres"
[9] "https://circawho.com/press-and-magazines/mario-lopez-torress-legacy-is-still-being-woven-in-michoacan-mexico/"

Related

Scraping html nodes from multiple pages

I'm trying to scrape some data for a potential stats project, but I can't seem to get all of the nodes per page. Instead, it only grabs the first one before moving the next page.
library(rvest)
pages <- pages <- c("https://merrimackathletics.com/sports/" %>%
paste0(c("baseball", "mens-basketball", "mens-cross-country") %>%
paste0("/roster")))
Major <- lapply(pages,
function(url){
url %>% read_html(url) %>%
html_node(".sidearm-roster-player-major") %>%
html_text()
})
Subsequently, the above only returns:
> Major
[[1]]
[1] "Business Adminstration"
[[2]]
[1] "Communications"
[[3]]
[1] "Global Management"
How should I go about indexing the node such that I get more than just the first "major" per page? Thanks!

The function html_node only extracts the first element. html_nodes will do what you want.
From the documentation:
html_node is like [[ it always extracts exactly one element. When given a list of nodes, html_node will always return a list of the same length, the length of html_nodes might be longer or shorter.

Excluding notes using html_nodes() in r

I am scraping stock market prices using the rvest package in R. I would like to exclude nodes when using html_nodes().
The following classes appear on the website with stock prices:
[4] <span id="ctl00_ctl00_Content_LeftContent_IssueList_StockList_repIssues_ctl02_DifferenceBlock_lblRelativeDifferenceDown" class="ValueDown">-0,51%</span>
[5] <span id="ctl00_ctl00_Content_LeftContent_IssueList_StockList_repIssues_ctl02_ctl02_lblDifference" class="ValueDown Difference">-51%</span>
Now I would like to include only the text after class="ValueDown", and I would like to exclude the text after class="ValueDown Difference".
For this I use the following code:
urlIEX <- "https://www.iex.nl/Koersen/Europa_Lokale_Beurzen/Amsterdam/AMX.aspx"
webpageIEX <- read_html(urlIEX)
percentage_change <- webpageIEX %>%
html_nodes(".ValueDown") %>%
html_text()
However, this gives me both the values -0,51% and -51%. Is there a way to include everything with class="ValueDown" and exclude everything with class="ValueDown Difference"?

I'am not expert, but I think you should use the attribute selector:
percentage_change <- webpageIEX %>%
html_nodes("[class='ValueDown']") %>%
html_text()

rvest read_html for a specific table

I am trying to scrape a web page in R. In the table of contents here:
https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm#du42901a_main_toc
I am interested in the
Consolidated Statement of Earnings - Page 50
Consolidated Statement of Cash Flows - Page 51
Consolidated Balance Sheet - Page 52
Depending on the document the page number can vary where these statements are.
I am trying to locate these documents using html_nodes() but I cannot seem to get it working. When I inspect the url I find the table at <div align="CENTER"> == $0 but I cannot find a table ID key.
url <- "https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm"
dat <- url %>%
read_html() %>%
html_table(fill = TRUE)
Any push in the right direction would be great!
EDIT: I know of the finreportr and finstr packages but they are taking the XML documents and not all .HTML pages have XML documents - I also want to do this using the rvest package.
EDIT:
Something like the following Works:
url <- "https://www.sec.gov/Archives/edgar/data/936340/000093634015000014/dteenergy2014123110k.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[623]/div/table') %>%
html_table()
x <- population[[1]]
Its very messy but it does get the cash flows table. The Xpath changes depending on the webpage.
For example this one is different:
url <- "https://www.sec.gov/Archives/edgar/data/80661/000095015205001650/l12357ae10vk.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[30]/div/table') %>%
html_table()
x <- population[[1]]
Is there a way to "search" for the "cash Flow" table and somehow extract the xpath?
Some more links to try.
[1] "https://www.sec.gov/Archives/edgar/data/1281761/000095014405002476/g93593e10vk.htm"
[2] "https://www.sec.gov/Archives/edgar/data/721683/000095014407001713/g05204e10vk.htm"
[3] "https://www.sec.gov/Archives/edgar/data/72333/000007233318000049/jwn-232018x10k.htm"
[4] "https://www.sec.gov/Archives/edgar/data/1001082/000095013406005091/d33908e10vk.htm"
[5] "https://www.sec.gov/Archives/edgar/data/7084/000000708403000065/adm10ka2003.htm"
[6] "https://www.sec.gov/Archives/edgar/data/78239/000007823910000015/tenkjan312010.htm"
[7] "https://www.sec.gov/Archives/edgar/data/1156039/000119312508035367/d10k.htm"
[8] "https://www.sec.gov/Archives/edgar/data/909832/000090983214000021/cost10k2014.htm"
[9] "https://www.sec.gov/Archives/edgar/data/91419/000095015205005873/l13520ae10vk.htm"
[10] "https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm"

Using str_extract_all to extract publications from specific portal out of hundreds of portals in a string.

Using a simple code to extract the links to my articles (one by one)
library(rvest)
url = ("http://www.time.mk/week/2016/22")
frontpage = read_html(url) %>%
html_nodes(".other_articles") %>%
html_attr("href") %>%
paste0()
print(frontpage)
mark = "http://www dot time dot mk/"
frontpagelinks = paste0(mark, frontpage)
final = list()
final = read_html(frontpagelinks[1]) %>%
html_nodes("h1 a") %>%
html_attr("href")%>%
paste0()
I used
a1onJune = str_extract_all(frontpage, ".*a1on.*") to extract articles from the website a1on dot mk, which worked like a charm finding only the articles I needed.
After getting some help here as to how to make my code more efficient, i.e. extract numerous links at once, via:
linksList <- lapply(frontpagelinks, function(i) {
read_html(frontapagelinks[i]) %>%
html_nodes("h1 a") %>%
html_attr("href")%>%
paste0()
which extracts all of the links I need, the same stringr code returns oddly enough something like this
"\"standard dot mk/germancite-ermenskiot-genocid/\", \"//plusinfo dot mk/vest/72702/turcija-ne-go-prifakja-zborot-genocid\", \"/a1on dot mk/wordpress/archives/618719\", \"sitel dot mk/na-povidok-nov-sudir-megju-turcija-i-germanija\",
Where as shown in bold I also extract the links to the website I need, but also a bunch of other noise that I definitely don't want there. I tried a variety of regex expressions, however I've not managed to define only those lines of code that contain a1on posts.
Given that the list which I am attempting to clear out outputs separated links I am a bit baffled by the fact that when I use stringr it (as far as im concerned) randomly divides them into strings of multiple links:
[93] "http://telegraf dot mk /aktuelno/svet/ns-newsarticle-vo-znak-na-protest-turcija-go-povlece-svojot-ambasador-od-germanija.nspx"
[94] "http://tocka dot mk /1/197933/odnosite-pomegju-berlin-i-ankara-pred-totalen-kolaps-germanija-go-prizna-turskiot-genocid-nad-ermencite"
[95] "lokalno dot mk /merkel-vladata-na-germanija-e-podgotvena-da-pomogne-vo-dijalogot-megju-turcija-i-ermenija/"
Any thoughts as to how I can go about this? Perhaps something that is more general, given that I need to do the same type of cleaning for five different portals.
Thank you.

Using a simple code to extract the links to my articles (one by one)
library(rvest)
url = ("http://www.time.mk/week/2016/22")
frontpage = read_html(url) %>%
html_nodes(".other_articles") %>%
html_attr("href") %>%
paste0()
print(frontpage)
mark = "http://www.time.mk/"
frontpagelinks = paste0(mark, frontpage)
# lappy returns a list of lists, so use unlist to flatten
linksList <- unlist( lapply(frontpagelinks, function(i) {
read_html(i) %>%
html_nodes("h1 a") %>%
html_attr("href") %>%
paste0()}))
# grab the lists of interest
a1onLinks <- linksList[grepl(".*a1on.*", linksList)]
# [1] "http://a1on.mk/wordpress/archives/621196" "http://a1on.mk/wordpress/archives/621038"
# [3] "http://a1on.mk/wordpress/archives/620576" "http://a1on.mk/wordpress/archives/620686"
# [5] "http://a1on.mk/wordpress/archives/620364" "http://a1on.mk/wordpress/archives/620399"

Web scraping - selection of a table

I'm trying to extract the table of historical data from Yahoo Finance website.
First, by inspecting the source code I've found that it's actually a table, so I suspect that html_table() from rvest should be able to work with it, however, I can't find a way to reach it from R. I've tried providing the function with just the full page, however, it did not fetch the right table:
url <- https://finance.yahoo.com/quote/^FTSE/history?period1=946684800&period2=1470441600&interval=1mo&filter=history&frequency=1mo
read_html(url) %>% html_table(fill = TRUE)
# Returns only:
# [[1]]
# X1 X2
# 1 Show all results for Tip: Use comma to separate multiple quotes Search
Second, I've found an xpath selector for the particular table, but I am still unsuccessful in fetching the data:
xpath1 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]/table'
read_html(url) %>% html_node(xpath = xpath1)
# Returns an empty nodeset:
# {xml_nodeset (0)}
By removing the last term from the selector I get a non-empty nodeset, however, still no table:
xpath2 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]'
read_html(url) %>% html_node(xpath = xpath2) %>% html_table(fill = TRUE)
# Error: html_name(x) == "table" is not TRUE
What am I doing wrong? Any help would be appreciated!
EDIT: I've found that html_text() with the last xpath returns
read_html(url) %>% html_node(xpath = xpath2) %>% html_text()
[1] "Loading..."
which suggests that the table is not yet loaded when R did the read. This would explain why it failed to see the table. Question: any ways of bypassing that loading text?

Categories

HOME

gatsby-plugin-image

fastapi

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to retrieve hyperlinks in google search using rvest - r

Related

Scraping html nodes from multiple pages

Excluding notes using html_nodes() in r

rvest read_html for a specific table

Using str_extract_all to extract publications from specific portal out of hundreds of portals in a string.

Web scraping - selection of a table

Categories

Resources