I'm trying to find a way to copy-paste the title and the abstract from a PubMed page.
I started using
browseURL("https://pubmed.ncbi.nlm.nih.gov/19592249") ## final numbers are the PMID
now I can't find a way to obtain the title and the abstract in a txt way. I have to do it for multiple PMID so I need to automatize it. It can be useful also just copying everything is on that page and after I can take only what I need.
Is it possible to do that? thanks!
I suppose what you're trying to do is scrape PubMed for articles of interest?
Here's one way to do this using the rvest package:
#Required libraries.
library(magrittr)
library(rvest)
#Function.
getpubmed <- function(url){
dat <- rvest::read_html(url)
pid <- dat %>% html_elements(xpath = '//*[#title="PubMed ID"]') %>% html_text2() %>% unique()
ptitle <- dat %>% html_elements(xpath = '//*[#class="heading-title"]') %>% html_text2() %>% unique()
pabs <- dat %>% html_elements(xpath = '//*[#id="enc-abstract"]') %>% html_text2()
return(data.frame(pubmed_id = pid, title = ptitle, abs = pabs, stringsAsFactors = FALSE))
}
#Test run.
urls <- c("https://pubmed.ncbi.nlm.nih.gov/19592249", "https://pubmed.ncbi.nlm.nih.gov/22281223/")
df <- do.call("rbind", lapply(urls, getpubmed))
The code should be fairly self-explanatory. (I've not added the contents of df here for brevity.) The function getpubmed does no error-handling or anything of that sort, but it is a start. By supplying a vector of URLs to the do.call("rbind", lapply(urls, getpubmed)) construct, you can get back a data.frame consisting of the PubMed ID, title, and abstract as columns.
Another option would be to explore the easyPubMed package.
I would also use a function and rvest. However, I would go with a passing the pid in as the argument function, using html_node as only a single node is needed to be matched, and use faster css selectors. String cleaning is done via stringr package:
library(rvest)
library(stringr)
library(dplyr)
get_abstract <- function(pid){
page <- read_html(paste0('https://pubmed.ncbi.nlm.nih.gov/', pid))
df <-tibble(
title = page %>% html_node('.heading-title') %>% html_text() %>% str_squish(),
abstract = page %>% html_node('#enc-abstract') %>% html_text() %>% str_squish()
)
return(df)
}
get_abstract('19592249')
Related
The site I use to scrape data has changed and I'm having issues pulling the data into table format. I used two different types of codes below trying to get the tables, but it is returning blanks instead of tables.
I'm a novice in regards to scraping and would appreciate the expertise of the group. Should I look for other solutions in rvest, or try to learn a program like rSelenium?
https://www.pgatour.com/stats/detail/02675
Scrape for Multiple Links
library("dplyr")
library("purr")
library("rvest")
df23 <- expand.grid(
stat_id = c("02568","02674", "02567", "02564", "101")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/detail/',
stat_id
)
) %>%
as_tibble()
#replaced tournament_id with stat_id
get_info <- function(link, stat_id){
data <- link %>%
read_html() %>%
html_table() %>%
.[[2]]
}
test_main_stats <- df23 %>%
mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))
test_main_stats <- test_main_stats %>%
unnest(everything())
Alternative Code
url <- read_html("https://www.pgatour.com/stats/detail/02568")
test1 <- url %>%
html_nodes(".css-8atqhb") %>%
html_table
This page uses javascript to create the table, so rvest will not directly work. But if one examines the page's source code, all of the data is stored in JSON format in a "<script>" node.
This code finds that node and converts from JSON to a list. The variable is the main table but there is a wealth of other information contained in the JSON data struture.
#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")
#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[#id='__NEXT_DATA__']") %>% html_text()
#convert from JSON
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)
#get the main table
answer <-output$props$pageProps$statDetails$rows
I'm currently web scraping the IMDB website to extract movie data.
I would like to know how you would solve this problem.
library(tidyverse)
library(data.table)
library(rvest)
library(janitor)
#top rated movies website
url <- 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
# extract the title of the movies using rvest
titles <- url %>%
read_html() %>%
html_nodes(' .titleColumn a') %>%
html_text() %>%
as.data.table() %>%
setnames(. ,old = colnames(.), new='title')
# extract links to each of the titles, this will be the reference
links <- url %>%
read_html() %>%
html_nodes('.titleColumn a') %>%
html_attr('href') %>%
as.data.table() %>%
setnames(. ,old = colnames(.), new='links')
# creating a DT with the data
movies <- cbind(titles,links)
I will have movies DT with title and links as columns.
Now, I will like to extract additional data of each movie using the links
I will continue using the first row as an example.
#the first link in movies
link <- 'https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=NJ52X0MM1V9FKSPBT46G&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1'
# selector for budget data (this will not change)
select <- '.txt-block:nth-child(15) , .txt-block:nth-child(14) , #titleDetails .txt-block:nth-child(13) , #titleDetails .txt-block:nth-child(12)'
# get budget data
budget <- link %>%
read_html() %>%
html_nodes(select) %>%
html_text() %>%
gsub('\\n','',.) %>%
str_split(.,'\\:')%>%
as.data.table() %>%
janitor::row_to_names(row_number = 1) %>%
setnames(.,old=colnames(.),new= tolower(gsub(' ','_' , str_trim(colnames(.)))))
budget[,(colnames(budget))] <- lapply(budget,function(x) str_extract_all(x, "(\\$) *([0-9,]+)"))
Now I have a 1x4 table with budget information
I would like to pull data for each link of movies and merge it into DT to have a final DT with 6 columns; 'title', 'link' + four budget variables. I was trying to create a function that includes the code to get the budget data using each row's link as a parameter and the using 'lapply', I don't think this is the correct approach.
I would like to see if you have a solution to this in an efficient way.
Thanks so much for your help.
I think this would solve your problem:
# selector for budget data (this will not change)
select <- '.txt-block:nth-child(15) , .txt-block:nth-child(14) , #titleDetails .txt-block:nth-child(13) , #titleDetails .txt-block:nth-child(12)'
# get budget data
## As function
get_budget = function(link,select){
budget <- link %>%
read_html() %>%
html_nodes(select) %>%
html_text() %>%
gsub('\\n','',.) %>%
str_split(.,'\\:')%>%
as.data.table() %>%
janitor::row_to_names(row_number = 1) %>%
setnames(.,old=colnames(.),new= tolower(gsub(' ','_' , str_trim(colnames(.)))))
budget[,(colnames(budget))] <- lapply(budget,function(x) str_extract_all(x, "(\\$) *([0-9,]+)"))
return(budget)
}
#As your code is slow I'll subset movies to have 10 rows:
movies = movies[1:10,]
tmp =
lapply(movies[, links], function(x)
get_budget(link = paste0("https://www.imdb.com/",x),select=select )) %>%
rbindlist(., fill = T)
movies = cbind(movies, tmp)
And your result would seem like this: movies_result
Finally, I think this little advice would make your code loke cooler:
setnames doesn't need . from magrittr; it automatically understands your kind of code.
When possible avoid using setnames(. ,old = colnames(.), new='links'). In your case is just necessary setnames('links') since you are renaming all your variables.
setnames(dt,old = oldnames, new=newnames) is only necessary when oldnames is not equal to names(dt).
Since DT is another R popular library, completely unrelated with data.table I think is better to refer to a data.table as what is a data.table.
I have the following question.
I am trying to harvest data from the Booking website (for me only, in order to learn the functionality of the rvest package). Everything's good and fine, the package seems to collect what I want and to put everything in the table (dataframe).
Here's my code:
library(rvest)
library(lubridate)
library(tidyverse)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
paste0(1:60) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(1:60) %>%
paste0(c("&pageSize=10&sortBy=recent"))
so in this chunk I collect the data from the first 60 pages after first manually feeding the Booking search engine with the country of my choise (Spain), the dates I am interested in (just some arbitrary interval) and the number of people (I used defaults here).
Then, I add this code to select the properties I want:
read_hotel <- function(url){ # collecting hotel names
ho <- read_html(url)
headline <- ho %>%
html_nodes("span.sr-hotel__name") %>% # the node I want to read
html_text() %>%
as_tibble()
}
hotels <- map_dfr(page_booking, read_hotel)
read_pr <- function(url){ # collecting price tags
pr <- read_html(url)
full_pr <- pr %>%
html_nodes("div.bui-price-display__value") %>% #the node I want to read
html_text() %>%
as_tibble()
}
fullprice <- map_dfr(page_booking, read_pr)
... and eventually save the whole data in the dataframe:
dfr <- tibble(hotels = hotels,
price_fact = fullprice)
I collect more parameters but this doesn't matter. The final dataframe of 1500 rows and two columns is then created. But the problem is the data within the second column does not correspond to the data in the first one. Which is really strange and renders my dataframe to be useless.
I don't really understand how the package works in the background and why does it behaves that way. I also paid attention the first rows in the first column of the dataframe (hotel name) do not correspond to the first hotels I see on the website. So it seems to be a different search/sort/filter criteria the rvest package uses.
Could you please explain me the processes take place during the rvest node hoping?
I would really appreciate at least some explanation, just to better understand the tool we work with.
You shouldn't scrape hotels' name and price separately like that. What you should do is get all nodes of items (hotels), then scrape the name and price relatively of each hotel. With this method, you can't mess up the order.
library(rvest)
library(purrr)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
paste0(1:60) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(1:60) %>%
paste0(c("&pageSize=10&sortBy=recent"))
hotels <-
map_dfr(
page_booking,
function(url) {
pg <- read_html(url)
items <- pg %>%
html_nodes(".sr_item")
map_dfr(
items,
function(item) {
data.frame(
hotel = item %>% html_node(xpath = "./descendant::*[contains(#class,'sr-hotel__name')]") %>% html_text(trim = T),
price = item %>% html_node(xpath = "./descendant::*[contains(#class,'bui-price-display__value')]") %>% html_text(trim = T)
)
}
)
}
)
(The dots start the XPath syntaxes present the current node which is the hotel item.)
Update:
Update the code that I think faster but still does the job:
hotels <-
map_dfr(
page_booking,
function(url) {
pg <- read_html(url)
items <- pg %>%
html_nodes(".sr_item")
data.frame(
hotel = items %>% html_node(xpath = "./descendant::*[contains(#class,'sr-hotel__name')]") %>% html_text(trim = T),
price = items %>% html_node(xpath = "./descendant::*[contains(#class,'bui-price-display__value')]") %>% html_text(trim = T)
)
}
)
I was wondering how to store and retrieve the data from a for loop when aiming to scrape multiple websites in R.
library(rvest)
library(dplyr)
library(tidyverse)
library(glue)
cont<-rep(NA,101)
countries <- c("au","at","de","se","gb","us")
for (i in countries) {
sides<-glue("https://www.beeradvocate.com/beer/top-rated/",i,.sep = "")
html <- read_html(sides)
cont[i] <- html %>%
html_nodes("table") %>% html_table()
}
table_au <- cont[2] [[1]]
The idea is to get a list for each website respectively. If I ran my code, table_au will just show me NA, presumably because the loop results are not stored.
It would be awesome, if someone could help me.
BR,
Marco
We can extract all the tables in a list.
library(rvest)
url <- "https://www.beeradvocate.com/beer/top-rated/"
temp <- purrr::map(paste0(url, countries), ~{
.x %>%
read_html() %>%
html_nodes("table") %>%
html_table(header = TRUE) %>% .[[1]]
})
If you want data as different dataframes like tab_au, tab_at, we can name the list and use list2env to get data separately.
names(temp) <- paste0('tab_', countries)
list2env(temp, .GlobalEnv)
I am trying to get a list of Companies and jobs in a table from indeed.com's job board.
I am using the rvest package using a URL Base of http://www.indeed.com/jobs?q=proprietary+trader&
install.packages("gtools")
install.packages('rvest")
library(rvest)
library(gtools)
mydata = read.csv("setup.csv", header=TRUE)
url_base <- "http://www.indeed.com/jobs?q=proprietary+trader&"
names <- mydata$Page
results<-data.frame()
for (name in names){
url <-paste0(url_base,name)
title.results <- url %>%
html() %>%
html_nodes(".jobtitle") %>%
html_text()
company.results <- url %>%
html() %>%
html_nodes(".company") %>%
html_text()
results <- smartbind(company.results, title.results)
results3<-data.frame(company=company.results, title=title.results)
}
new <- results(Company=company, Title=title)
and then looping a contatenation. For some reason it is not grabbing all of the jobs and mixing the companies and jobs.
It might be because you make two separate requests to the page. You should change the middle part of your code to:
page <- url %>%
html()
title.results <- page %>%
html_nodes(".jobtitle") %>%
html_text()
company.results <- page %>%
html_nodes(".company") %>%
html_text()
When I do that, it seems to give me 10 jobs and companies which match. Can you give an example otherwise of a query URL that doesn't work?