I am trying to scrap titles and contents from a list of urls using r. I am able to extract title and content for each article individually. However, I need to loop through these list of urls to get the title from each page and its content.
These are the urls, and they are stored in a csv file:
http://well.blogs.nytimes.com/2016/08/29/edible-sunscreens-all-the-rage-but-no-proof-they-work/?smid=fb-nytwell&smtyp=cur
http://www.nytimes.com/2016/08/30/well/live/how-12-epipens-saved-my-life.html?smid=fb-nytwell&smtyp=cur
http://www.nytimes.com/2016/08/29/opinion/why-we-never-die.html?smid=fb-nytwell&smtyp=cur
http://www.nytimes.com/2016/08/31/health/how-to-ride-downhill-on-a-bicycle.html?smid=fb-nytwell&smtyp=cur
http://www.cbssports.com/college-football/news/one-sweet-gesture-by-fsus-travis-rudolph-makes-mom-of-an-autistic-boy-cry/
http://www.nytimes.com/2016/08/31/well/family/what-kids-wish-their-teachers-knew.html?smid=fb-nytwell&smtyp=cur
This is the code I used to extract each article individually (note that each paragraph of the content is considered to be a node and when I extract these nodes each one appears in a new raw while I need them to be only in the first raw).
install.packages('xml2')
library(xml2)
library(rvest)
url <- "http://well.blogs.nytimes.com/2016/08/29/edible-sunscreens-all-the-rage-but-no-proof-they-work/?smid=fb-nytwell&smtyp=cur"
article <- read_html(url)
title <- article %>% html_node(".entry-title") %>% html_text()
content <- article %>% html_nodes(".story-body-text") %>% html_text()
article_table <- data.frame(title, content)
article_table
You need to collapse the output to get into single line for the article
content <-
article %>% html_nodes(".story-body-text") %>% html_text() %>% paste(., collapse = "")
For multiple urls, its been already answered here
Adapted it to your case. please note that, .entry-title tag does not work for all urls. you need to use title
library(rvest)
library(purrr)
article <- listofurls %>% map(read_html)
title <-
article %>% map_chr(. %>% html_node("title") %>% html_text())
content <-
article %>% map_chr(. %>% html_nodes(".story-body-text") %>% html_text() %>% paste(., collapse = ""))
article_table <- data.frame("Title" = title, "Content" = content)
dim(article_table)
Related
Testing out different keywords on Google News to web-scrape headlines and urls, but somehow some keywords do not have matching number of headlines and urls.
library(rvest)
library(stringr)
library(magrittr)
link = "https://news.google.com/search?q=onn%20hafiz&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline = read_html(link) %>% html_nodes('.DY5T1d') %>% html_text()
url = read_html(link) %>% html_nodes(".VDXfz") %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
data.frame(headline, url)
Results:
Error in data.frame(headline, url) :
arguments imply differing number of rows: 82, 85
But with other keywords, this seems to work fine.
link = "https://news.google.com/search?q=international%20petroleum&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline = read_html(link) %>% html_nodes('.DY5T1d') %>% html_text()
url = read_html(link) %>% html_nodes(".VDXfz") %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
data.frame(headline, url)
Anyone knows the issue for this, and how to fix it? Thanks
With those selectors you are extracting headlines from different nodes than hrefs and there doesn't seem to be fixed 1:1 relation between those two. At the time of writing your first search results with some nested headlines and that's probably the reason why your headline and url count does not match.
Get the url and text from the same node and you should be covered:
url <- "https://news.google.com/search?q=onn%20hafiz&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline_links <- read_html(url) %>% html_nodes('a.DY5T1d')
data.frame(
headline = headline_links %>% html_text(),
url = headline_links %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
)
I am trying to collect a number of links from a website.
For example I have the following and my idea was to collect the link where it says leer más which is where I get the xpath from.
url = "https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l/181"
x <- GET(url, add_headers('user-agent' = desktop_agents[sample(1:10, 1)]))
x %>%
read_html() %>%
html_nodes(xpath = '//*[#id="App"]/div[2]/div[1]/main/div/div[3]/section/article[1]/div/a/p/span[2]')
This gives me the following but not the link:
{xml_nodeset (1)}
[1] <span class="re-CardDescription-link">Leer más</span>
Additionally, I thought about collecting all links:
x %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href")
This gives me a lot of links but not the links to the individual webpages I want.
I would like to have a list of links such as:
https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-terraza-trastero-ascensor-amueblado-internet/162262978/d
https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-trastero-ascensor-amueblado/159750574/d
https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-jardin-zona-comunitaria-ascensor-patio-amueblado-parking-television-internet-piscina/162259162/d
Those links are stored inside a JavaScript object within a script tag. You can regex out the string defining that object, do some unescapes to enable jsonlite to parse, then apply a custom function to extract just the urls of interest to the json object
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
link <- 'https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l/181'
p <- read_html(url) %>% html_text()
s <- str_match(p, 'window\\.__INITIAL_PROPS__ = JSON\\.parse\\("(.*)".*?;')[,2]
data <- jsonlite::parse_json(gsub('\\\\\\"', '\\\"', gsub('\\\\"', '"', s)))
links <- purrr::map(data$initialSearch$result$realEstates, ~ .x$detail$`es-ES` %>% url_absolute(link))
I am trying to scrape Table 1 from the following website using rvest:
https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/
Following is the code i have written:
link <- "https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/"
page <- read_html(link)
page %>% html_nodes("iframe") %>% html_attr("src") %>% .[11] %>% read_html() %>%
html_nodes("table.medium datawrapper-g2oKP-6idse1 svelte-1vspmnh resortable")
But, i get {xml_nodeset (0)} as the result. I am struggling to figure out the correct tag to select in html_nodes() from the datawrapper page to extract Table 1.
I will be really grateful if someone can point out the mistake i am making, or suggest a solution to scrape this table.
Many thanks.
The data is present in the iframe but needs a little manipulation. It is easier, for me at least, to construct the csv download url from the iframe page then request that csv
library(rvest)
library(magrittr)
library(vroom)
library(stringr)
page <- read_html('https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/')
iframe <- page %>% html_element('iframe[title^="Table 1"]') %>% html_attr('src')
id <- read_html(iframe) %>% html_element('meta') %>% html_attr('content') %>% str_match('/(\\d+)/') %>% .[, 2]
csv_url <- paste(iframe,id, 'dataset.csv', sep = '/' )
data <- vroom(csv_url, show_col_types = FALSE)
I try to scape the prices, area and addresses from all flats of this homepage (https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden)
Getting the data for one list element with Rvest and xpath works fine (see code), but I don´t know how to get the ID of each list element to loop through all elements.
Here is a part of the html-code with the data-go-to-expose-id I need for the loop. How can I get all IDs?
<span class="slick-bg-layer"></span><img alt="Immobilienbild" class="gallery__image block height-full" src="https://pictures.immobilienscout24.de/listings/541dfd45-c75a-4da7-a831-3339264d578b-1193970198.jpg/ORIG/legacy_thumbnail/532x399/format/jpg/quality/80">a831-3339264d578b-1193970198.jpg/ORIG/legacy_thumbnail/532x399/format/jpg/quality/80"></a>
And here is my current R-code to fetch the data from one list element:
library(rvest)
url <- "https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden"
address <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[2]/div[2]/a') %>% html_text()
price <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[3]/div/div[1]/dl[1]/dd') %>% html_text()
area <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[3]/div/div[1]/dl[2]/dd') %>% html_text()
Does this get what you are after
library("tidyverse")
library("httr")
library("rvest")
url <- "https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden"
x <- read_html(url)
x %>%
html_nodes("#listings") %>%
html_nodes(".result-list__listing") %>%
html_attr("data-id")
I used this script to extract the text from a webpage
url <- "http://www.dlink.com/it/it"
doc <- getURL(url)
#get the text from the body
html <- htmlTreeParse(doc, useInternal = TRUE)
txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)
txt<-toString(txt)
but the problem is that it takes just the words in the first page, how can I extend it to the whole website?
I'd go with rvest to scrape the links and purrr to iterate:
library(rvest)
library(purrr)
url <- "http://www.dlink.com/it/it"
r <- read_html(url) %>%
html_nodes('a') %>%
html_attr('href') %>%
Filter(function(f) !is.na(f) & !grepl(x = f, pattern = '#|facebook|linkedin|twitter|youtube'), .) %>%
map(~{
print(.x)
html_session(url) %>%
jump_to(.x) %>%
read_html() %>%
html_nodes('body') %>%
html_text() %>%
toString()
})
I filtered out social nets and dead links from the list of links, some more tuning might be in order.
Be advised that you will be scraping a lot of garbage. Some more targeting on what to scrape inside each page might be needed (ie: something more specfic than the whole body tag)