R: rvest library to extract nested node content - r

This is a link to a journal page:
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1535-9
I'm trying to get the following: Author Affiliations (all authors), Corresponding Author, and Corresponding Author's Email. Note: it is assumed the corresponding author is the last author listed in the authors sections at the top of the article. I've used SelectorGadget to identify some tags for other elements like Abstract and Publication Date, but I just can't seem to figure out how to get these three. The following is my code to get the authors as a character vector:
#url is the url for the list of articles on a particular page
s <- html_session(url)<br >
page <- s %>% follow_link(art) %>% read_html() <br >
str_replace_all(str_squish(page %>% html_nodes(".AuthorName") %>% html_text()), "[0-9]|Email author", "")<br >
And this returns a vector of all authors involved, in this case of length 8 for each of the authors. But now I need to follow the links on their names to get the affiliations, and their emails. I'm sure all the code I need is in front of me but I'm a little lost as I'm new to R and web scraping (had to learn this quickly for my current project).
Update
The answer below is perfect.

I am not sure the email address always matches the author at the last position.
Because when I open the Chrome view-source, I find the email address somehow is below an independent list.
library(rvest)
#> 载入需要的程辑包:xml2
library(data.table)
library(tidyverse)
xml <- read_html('https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1535-9')
xml %>%
html_nodes('.EmailAuthor') %>%
html_attr('href')
#> [1] "mailto:liuj#cs.uky.edu"
# get email address
xml %>%
html_nodes('.AuthorName') %>%
html_text
#> [1] "Ye<U+00A0>Yu" "Jinpeng<U+00A0>Liu" "Xinan<U+00A0>Liu" "Yi<U+00A0>Zhang"
#> [5] "Eamonn<U+00A0>Magner" "Erik<U+00A0>Lehnert" "Chen<U+00A0>Qian" "Jinze<U+00A0>Liu"
# get name
data.table(
name = xml %>%
html_nodes('meta') %>%
html_attr('name')
,content = xml %>%
html_nodes('meta') %>%
html_attr('content')
) %>%
# extract both name and affiliatation, because make show they are matched.
filter(name %in% c('citation_author_institution')) %>%
select(content)
#> content
#> 1 Department of Computer Science, University of Kentucky, Lexington, USA
#> 2 Department of Computer Science, University of Kentucky, Lexington, USA
#> 3 Department of Computer Science, University of Kentucky, Lexington, USA
#> 4 Department of Computer Science, University of Kentucky, Lexington, USA
#> 5 Department of Computer Science, University of Kentucky, Lexington, USA
#> 6 Seven Bridges Genomics Inc, Cambridge, USA
#> 7 Department of Computer Engineering, University of California Santa Cruz, Santa Cruz, USA
#> 8 Department of Computer Science, University of Kentucky, Lexington, USA
Created on 2018-11-02 by the reprex package (v0.2.1)

Related

scraping with select/ option dropdown

List item
I am new to web scrapping and after a couple of Wikipedia pages I found this page where I wanted to extract the tables for all the portfolio managers. I am not able to use the things I found on the internet. I thought it would be easy since it's just a table but I am not able to extract even a single table after filling out the form. Can someone please tell me how I could get this done in R? I have added an image in this post but it seems to look like a link that says to enter image description here.
https://www.sebi.gov.in/sebiweb/other/OtherAction.do?doPmr=yes
library(tidyverse)
library(rvest)
library(httr)
library(RCurl)
url <- "https://www.sebi.gov.in/sebiweb/other/OtherAction.do?doPmr=yes"
result <- postForm(url,
pmrId="RIGHT HORIZONS PORTFOLIO MANAGEMENT PRIVATE LIMITED",
year="2022",
month="August")
attr(result,"Content-Type")
result
enter image description here
Sebi Website
If you change those passed values to corresponding value attribute values of options (i.e. "8" instead of "August" in case of <option value="8">August</option>), you should be all set.
And you can also check the actual payload of POST requests:
Lazy approach would be just using Copy as cURL in DevTools and heading to https://curlconverter.com/r/ to convert it to httr request.
library(rvest)
resp <- httr::POST("https://www.sebi.gov.in/sebiweb/other/OtherAction.do?doPmr=yes",
body = list(
pmrId="INP000004417##INP000004417##AEQUITAS INVESTMENT CONSULTANCY PRIVATE LIMITED",
year="2022",
month="8"))
tables <- resp %>%
read_html() %>%
html_elements("table") %>%
html_table()
# first table:
tables[[1]]
#> # A tibble: 11 × 2
#> X1 X2
#> <chr> <chr>
#> 1 Name of the Portfolio Manager "Aeq…
#> 2 Registration Number "INP…
#> 3 Date of Registration "201…
#> 4 Registered Address of the Portfolio Manager ",,,…
#> 5 Name of Principal Officer ""
#> 6 Email ID of the Principal Officer ""
#> 7 Contact Number (Direct) of the Principal Officer ""
#> 8 Name of Compliance Officer ""
#> 9 Email ID of the Compliance Officer ""
#> 10 No. of clients as on last day of the month "124…
#> 11 Total Assets under Management (AUM) as on last day of the month (Amoun… "143…
Created on 2022-10-11 with reprex v2.0.2

How to retrieve text below titles from google search using rvest

This is a follow up question for this one:
How to retrieve titles from google search using rvest
In this time I am trying to get the text behind titles in google search (circled in red):
Due to my lack of knowledge in web design I do not know how to formulate the xpath to extract the text below titles.
The answer by #AllanCameron is very useful but I do not know how to modify it:
library(rvest)
library(tidyverse)
#Code
#url
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
#Get data
first_page <- read_html(url)
titles <- html_nodes(first_page, xpath = "//div/div/div/a/h3") %>%
html_text()
Many thanks for your help!
This can all be done without Selenium, using rvest. Unfortunately, Google works differently in different locales, so for example in my locale there is a consent page that has to be navigated before I can even send a request to Google.
It seems this is not required in the OPs locale, but for those if you in the UK, you might need to run the following code first for the rest to work:
library(rvest)
library(tidyverse)
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
google_handle <- httr::handle('https://www.google.com')
httr::GET('https://www.google.com', handle = google_handle)
httr::POST(paste0('https://consent.google.com/save?continue=',
'https://www.google.com/',
'&gl=GB&m=0&pc=shp&x=5&src=2',
'&hl=en&bl=gws_20220801-0_RC1&uxe=eomtse&',
'set_eom=false&set_aps=true&set_sc=true'),
handle = google_handle)
url <- httr::GET(url, handle = google_handle)
For the OP and those without a Google consent page, the set up is simply:
library(rvest)
library(tidyverse)
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
Next we define the xpaths we are going to use to extract the title (as in the previous Q&A), and the text below the title (pertinent to this question)
title <- "//div/div/div/a/h3"
text <- paste0(title, "/parent::a/parent::div/following-sibling::div")
Now we can just apply these xpaths to get the correct nodes and extract the text from them:
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
#> # A tibble: 9 x 2
#> title text
#> <chr> <chr>
#> 1 "Mario García Torres - Wikipedia" "Mario García Torres (born 1975 in Monc~
#> 2 "Mario Torres (#mario_torres25) • I~ "Mario Torres. Oaxaca, México. Luz y co~
#> 3 "Mario Lopez Torres - A Furniture A~ "The Mario Lopez Torres boutiques are a~
#> 4 "Mario Torres - Player profile | Tr~ "Mario Torres. Unknown since: -. Mario ~
#> 5 "Mario García Torres | The Guggenhe~ "Mario García Torres was born in 1975 i~
#> 6 "Mario Torres - Founder - InfOhana ~ "Ve el perfil de Mario Torres en Linked~
#> 7 "3500+ \"Mario Torres\" profiles - ~ "View the profiles of professionals nam~
#> 8 "Mario Torres Lopez - 33 For Sale o~ "H 69 in. Dm 20.5 in. 1970s Tropical Vi~
#> 9 "Mario Lopez Torres's Woven River G~ "28 Jun 2022 · From grass harvesting to~
The subtext you refer to appears to be rendered in JavaScript, which makes it difficult to access with conventional read_html() methods.
Here is a script using RSelenium which gets the results necessary. You can also click the next page element if you want to get more results etc.
library(wdman)
library(RSelenium)
library(tidyverse)
selServ <- selenium(
port = 4446L,
version = 'latest',
chromever = '103.0.5060.134', # set to available
)
remDr <- remoteDriver(
remoteServerAddr = 'localhost',
port = 4446L,
browserName = 'chrome'
)
remDr$open()
remDr$navigate("insert URL here")
text_elements <- remDr$findElements("xpath", "//div/div/div/div[2]/div/span")
sapply(text_elements, function(x) x$getElementText()) %>%
unlist() %>%
as_tibble_col("results") %>%
filter(str_length(results) > 15)
# # A tibble: 6 × 1
# results
# <chr>
# 1 "The meaning of HI is —used especially as a greeting. How to use hi in a sentence."
# 2 "Hi definition, (used as an exclamation of greeting); hello! See more."
# 3 "A friendly, informal, casual greeting said upon someone's arrival. quotations ▽synonyms △. Synonyms: hello, greetings, howdy.…
# 4 "Hi is defined as a standard greeting and is short for \"hello.\" An example of \"hi\" is what you say when you see someone. i…
# 5 "hi synonyms include: hola, hello, howdy, greetings, cheerio, whats crack-a-lackin, yo, how do you do, good morrow, guten tag,…
# 6 "Meaning of hi in English ... used as an informal greeting, usually to people who you know: Hi, there! Hi, how are you doing? …

Extract all text & tags between two heading tags (<h3>) with rvest

This page shows six sections listing people between <h3> tags.
How can I use XPath to select these six sections separately (using rvest), perhaps into a nested list? My goal is to later lapply through these six sections to fetch the people's names and affiliations (separated by section).
The HTML isn't so well-structured, i.e. not every text is located within specific tags. An example:
<h3>Editor-in-Chief</h3>
Claudio Ronco – <i>St. Bartolo Hospital</i>, Vicenza, Italy<br />
<br />
<h3>Clinical Engineering</h3>
William R. Clark – <i>Purdue University</i>, West Lafayette, IN, USA<br />
Hideyuki Kawanashi – <i>Tsuchiya General Hospital</i>, Hiroshima, Japan<br />
I access the site with the following code:
journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"
webpage <- rvest::html_session(journal_url,
httr::user_agent("Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"))
webpage <- rvest::html_nodes(webpage, css = '#editorialboard')
I tried various XPaths to extract the six sections with html_nodes into a nested list of six lists, but none of them work properly:
# this gives me a list of 190 (instead of 6) elements, leaving out the text between <i> and </i>
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3 and following-sibling::h3]')
# this gives me a list of 190 (instead of 6) elements, leaving out text that are not between tags
webpage <- rvest::html_nodes(webpage, xpath = '//*[preceding-sibling::h3 and following-sibling::h3]')
# error "VECTOR_ELT() can only be applied to a 'list', not a 'logical'"
webpage <- rvest::html_nodes(webpage, xpath = '//* and text()[preceding-sibling::h3 and following-sibling::h3]')
# this gives me a list of 274 (instead of 6) elements
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3]')
Are you ok with an ugly solution that does not use XPath? I don't think you can get a nested list from the structure of this website... But I am not very experienced in xpath.
I first got the headings, divided the raw text using the heading names and then, within each group, divided the members using '\n' as a separator.
journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"
webpage <- read_html(journal_url) %>% html_node(css = '#editorialboard')
# get h3 headings
headings <- webpage %>% html_nodes('h3') %>% html_text()
# get raw text
raw.text <- webpage %>% html_text()
# split raw text on h3 headings and put in a list
list.members <- list()
raw.text.2 <- raw.text
for (h in headings) {
# split on headings
b <- strsplit(raw.text.2, h, fixed=TRUE)
# split members using \n as separator
c <- strsplit(b[[1]][1], '\n', fixed=TRUE)
# clean empty elements from vector
c <- list(c[[1]][c[[1]] != ""])
# add vector of member to list
list.members <- c(list.members, c)
# update text
raw.text.2 <- b[[1]][2]
}
# remove first element of main list
list.members <- list.members[2:length(list.members)]
# add final segment of raw.text to list
c <- strsplit(raw.text.2, '\n', fixed=TRUE)
c <- list(c[[1]][c[[1]] != ""])
list.members <- c(list.members, c)
# add names to list
names(list.members) <- headings
Then you get a list of the groups and each element of the list is a vector with strings for each member (using all info)
> list.members$`Editor-in-Chief`
[1] "Claudio Ronco – St. Bartolo Hospital, Vicenza, Italy"
> list.members$`Clinical Engineering`
[1] "William R. Clark – Purdue University, West Lafayette, IN, USA"
[2] "Hideyuki Kawanashi – Tsuchiya General Hospital, Hiroshima, Japan"
[3] "Tadayuki Kawasaki – Mobara Clinic, Mobara City, Japan"
[4] "Jeongchul Kim – Wake Forest School of Medicine, Winston-Salem, NC, USA"
[5] "Anna Lorenzin – International Renal Research Institute of Vicenza, Vicenza, Italy"
[6] "Ikuto Masakane – Honcho Yabuki Clinic, Yamagata City, Japan"
[7] "Michio Mineshima – Tokyo Women's Medical University, Tokyo, Japan"
[8] "Tomotaka Naramura – Kurashiki University of Science and the Arts, Kurashiki, Japan"
[9] "Mauro Neri – International Renal Research Institute of Vicenza, Vicenza, Italy"
[10] "Masanori Shibata – Koujukai Rehabilitation Hospital, Nagoya City, Japan"
[11] "Yoshihiro Tange – Kyushu University of Health and Welfare, Nobeoka-City, Japan"
[12] "Yoshiaki Takemoto – Osaka City University, Osaka City, Japan"

Web scraping with R and selector gadget

I am trying to scrape data from a website using R. I am using rvest in an attempt to mimic an example scraping the IMDB page for the Lego Movie. The example advocates use of a tool called Selector Gadget to help easily identify the html_node associated with the data you are seeking to pull.
I am ultimately interested in building a data frame that has the following schema/columns:
rank, blog_name, facebook_fans, twitter_followers, alexa_rank.
My code below. I was able to use Selector Gadget to correctly identity the html tag used in the Lego example. However, following the same process and same code structure as the Lego example, I get NAs (...using firstNAs introduced by coercion[1] NA
). My code is below:
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_node(".stats") %>%
html_text() %>%
as.numeric()
I have also experimented with: html_node("html_node(".stats , .stats span")), which seems to work for the "Facebook fans" column since it reports 714 matches, however only returns 1 number is returned.
714 matches for .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')] | .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')]/descendant-or-self::*/span: using first{xml_node}
<td>
[1] <span>997,669</span>
This may help you:
library(rvest)
d1 <- read_html("http://blog.feedspot.com/video_game_news/")
stats <- d1 %>%
html_nodes(".stats") %>%
html_text()
blogname <- d1%>%
html_nodes(".tlink") %>%
html_text()
Note that it is html_nodes (plural)
Result:
> head(blogname)
[1] "Kotaku - The Gamer's Guide" "IGN | Video Games" "Xbox Wire" "Official PlayStation Blog"
[5] "Nintendo Life " "Game Informer"
> head(stats,12)
[1] "997,669" "1,209,029" "873" "4,070,476" "4,493,805" "399" "23,141,452" "10,210,993" "879"
[10] "38,019,811" "12,059,607" "500"
blogname returns the list of blog names that is easy to manage. On the other hand the stats info comes out mixed. This is due to the way the stats class for Facebook and Twitter fans are indistinguishable from one another. In this case the output array has the information every three numbers, that is stats = c(fb, tw, alx, fb, tw, alx...). You should separate each vector from this one.
FBstats = stats[seq(1,length(stats),3)]
> head(stats[seq(1,length(stats),3)])
[1] "997,669" "4,070,476" "23,141,452" "38,019,811" "35,977" "603,681"
You can use html_table to extract the whole table with minimal work:
library(rvest)
library(tidyverse)
# scrape html
h <- 'http://blog.feedspot.com/video_game_news/' %>% read_html()
game_blogs <- h %>%
html_node('table') %>% # select enclosing table node
html_table() %>% # turn table into data.frame
set_names(make.names) %>% # make names syntactic
mutate(Blog.Name = sub('\\s?\\+.*', '', Blog.Name)) %>% # extract title from name info
mutate_at(3:5, parse_number) %>% # make numbers actually numbers
tbl_df() # for printing
game_blogs
#> # A tibble: 119 x 5
#> Rank Blog.Name Facebook.Fans Twitter.Followers Alexa.Rank
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 Kotaku - The Gamer's Guide 997669 1209029 873
#> 2 2 IGN | Video Games 4070476 4493805 399
#> 3 3 Xbox Wire 23141452 10210993 879
#> 4 4 Official PlayStation Blog 38019811 12059607 500
#> 5 5 Nintendo Life 35977 95044 17727
#> 6 6 Game Informer 603681 1770812 10057
#> 7 7 Reddit | Gamers 1003705 430017 25
#> 8 8 Polygon 623808 485827 1594
#> 9 9 Xbox Live's Major Nelson 65905 993481 23114
#> 10 10 VG247 397798 202084 3960
#> # ... with 109 more rows
It's worth checking that everything is parsed like you want, but it should be usable at this point.
This uses html_nodes (plural) and str_replace to remove commas in numbers. Not sure if these are all the stats you need.
library(rvest)
library(stringr)
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_nodes(".stats") %>%
html_text() %>%
str_replace_all(',', '') %>%
as.numeric()

How to fetch headlines from google news using rvest R?

I want to fetch headlines from google news using rvest in R. I have done this so far
library(rvest)
url=read_html("https://www.google.com/search?hl=en&tbm=nws&authuser=0&q=american+president")
selector_name<-"r"
fnames<-html_nodes(x = url, css = selector_name) %>%
html_text()
but the result is
> fnames
character(0)
This is the inspect element of a headline?
<h3 class="r">Obama Addresses Racial Tensions at Celebration of African ...</h3>
How can I fetch the headlines from google news?
I think you are just missing a dot for the class name:
> headlines = read_html("https://www.google.com/search?hl=en&tbm=nws&authuser=0&q=american+president") %>%
html_nodes(".r") %>%
html_text()
> headlines
[1] "Iranian President: No American President Can Renegotiate the Now ..."
[2] "US: President Barack Obama vetoes 9/11 bill"
[3] "President Obama Wants Donald Trump to Visit New African ..."
[4] "President Obama: Discrimination Should Concern 'All Americans ..."
[5] "Conrad Black: The Middle East watches, and waits, for the next ..."
[6] "Putin's close friend: Donald Trump will be next US president"
[7] "US election 2016 polls and odds: Latest Donald Trump and Hillary ..."
[8] "US election: Ted Cruz endorses Donald Trump for president"
[9] "Obama – I'm proud of my 'African record' as US president"
[10] "Almost 6000 Americans Have Already Voted for President"
Well you could do by:
library(rvest)
reviews <- link %>%
read_html() %>%
html_nodes(".g") %>%
html_text()
you check via inspect element where the text(headline is present), in this case it would class g. Then read the text within each node.

Resources