The webpage https://polititweet.org/ stores the complete tweet history of certain politicans, CEOs and so on. Importantly, they also provide deleted tweets I am interested in. Now, I would like to write a webscraper in R to retrieve the texts of the deleted tweets from Elon Musk, but I fail as the html includes some href.
That's my try (after edit due to #Bensstats):
library(rvest)
url_page1<- read_html("https://polititweet.org/tweets?page=1&deleted=True&account=44196397&search=")
tweets_deleted <- html_nodes(url_page1, ".tweet-card") |> html_attr("href")
tweets_deleted
With this, I yield the ID of the deleted tweets on page 1. However, what I want, is the deleted text itself.
Moreover, there are 9 pages of deleted tweets for Musk. As in future this amount of pages is likely to increase, I would like to extract the number of pages automatically and then automatise the process for each page (via a loop or something like that).
I would really appreciate it if somebody of you has an idea how to fix these problems!
Thanks a lot!
Get all of Elons deleted tweets, page 1:9.
library(tidyverse)
library(rvest)
get_tweets <- function(page) {
tweets <-
str_c(
"https://polititweet.org/tweets?page=",
page,
"&deleted=True&account=44196397&search="
) %>%
read_html() %>%
html_elements(".tweet-card")
tibble(
tweeter = tweets %>%
html_element("strong+ .has-text-grey") %>%
html_text2(),
tweet = tweets %>%
html_element(".small-top-margin") %>%
html_text2(),
time = tweets %>%
html_element(".is-paddingless") %>%
html_text2() %>%
str_remove_all("Posted ")
)
}
map_dfr(1:9, get_tweets)
# A tibble: 244 × 3
tweeter tweet time
<chr> <chr> <chr>
1 #elonmusk "#BBCScienceNews Tesla Megapacks are extremely e… Nov.…
2 #elonmusk "\u2014 PolitiTweet.org" Nov.…
3 #elonmusk "#BuzzPatterson They could help it \U0001f923 \u… Nov.…
4 #elonmusk "\U0001f37f \u2014 PolitiTweet.org" Nov.…
5 #elonmusk "Let\u2019s call the fact-checkers \u2026 \u2014… Nov.…
6 #elonmusk "#SharkJumping \u2014 Po… Nov.…
7 #elonmusk "Can you believe this app only costs $8!? https:… Nov.…
8 #elonmusk "#langdon #EricFrohnhoefer #pokemoniku He\u2019s… Nov.…
9 #elonmusk "#EricFrohnhoefer #MEAInd I\u2019ve been at Twit… Nov.…
10 #elonmusk "#ashleevance #mtaibbi #joerogan Twitter drives … Nov.…
# … with 234 more rows
# ℹ Use `print(n = ...)` to see more rows
Since you wanted it to automatically detect pages and scrape, here's a possible solution where you just supply a link into the function:
get_tweets <- function(link) {
page <- link %>%
read_html()
pages <- page %>%
html_elements(".pagination-link") %>%
last() %>%
html_text2() %>%
as.numeric()
twitter <- function(link, page) {
tweets <-
link %>%
str_replace(pattern = "page=1", str_c("page=", page)) %>%
read_html() %>%
html_elements(".tweet-card")
tibble(
tweeter = tweets %>%
html_element("strong+ .has-text-grey") %>%
html_text2(),
tweet = tweets %>%
html_element(".small-top-margin") %>%
html_text2(),
time = tweets %>%
html_element(".is-paddingless") %>%
html_text2() %>%
str_remove_all("Posted ")
)
}
map2_dfr(link, 1:pages, twitter)
}
get_tweets("https://polititweet.org/tweets?page=1&deleted=True&account=44196397&search=")
I highly recommend this tool to help you select CSS elements.
Edit
You are going to want to change the CSS selector
library(rvest)
url <- read_html("https://polititweet.org/tweets?account=44196397&deleted=True")
tweets_deleted <- html_nodes(url, "p.small-top-margin") |> html_text()
tweets_deleted %>%
gsub('\\n','',.)
Related
I'm attempting to scrape a table of read books from the Goodreads website using rvest. The data is formatted as a table, however I am getting errors when attempting to extract this info.
First we load some packages and set the url to scrape
library(dplyr)
library(rvest)
url <- "https://www.goodreads.com/review/list/4622890?shelf=read"
Running this code:
dat <- read_html(url) %>%
html_nodes('//*[#id="booksBody"]') %>%
html_table()
Produces: Error in tokenize(css) : Unexpected character '/' found at position 1
Trying it again, but without the first /:
dat <- read_html(url) %>%
html_nodes('/*[#id="booksBody"]') %>%
html_table()
Produces: Error in parse_simple_selector(stream) : Expected selector, got <EOF at 20>
And finally, just trying to get the table directly, without the intermediate call to html_nodes:
dat <- read_html(url) %>%
html_table('/*[#id="booksBody"]')
Produces: Error in if (header) { : argument is not interpretable as logical
Would appreciate any guidance on how to scrape this table
Scraping the first 5 pages
library(tidyverse)
library(rvest)
library(httr2)
get_books <- function(page) {
cat("Scraping page:", page, "\n")
books <-
str_c("https://www.goodreads.com/review/list/4622890-emily-may?page=", page,
"&shelf=%23ALL%23") %>%
read_html() %>%
html_elements(".bookalike.review")
tibble(
title = books %>%
html_elements(".title a") %>%
html_text2(),
author = books %>%
html_elements(".author a") %>%
html_text2(),
rating = books %>%
html_elements(".avg_rating .value") %>%
html_text2() %>%
as.numeric(),
date = books %>%
html_elements(".date_added .value") %>%
html_text2() %>%
lubridate::mdy()
)
}
df <- map_dfr(0:5, get_books)
# A tibble: 180 x 4
title author rating date
<chr> <chr> <dbl> <date>
1 Sunset "Cave~ 4.19 2023-01-14
2 Green for Danger (Inspector Cockrill~ "Bran~ 3.84 2023-01-12
3 Stone Cold Fox "Crof~ 4.22 2023-01-12
4 What If I'm Not a Cat? "Wint~ 4.52 2023-01-10
5 The Prisoner's Throne (The Stolen He~ "Blac~ 4.85 2023-01-07
6 The Kind Worth Saving (Henry Kimball~ "Swan~ 4.13 2023-01-06
7 Girl at War "Novi~ 4 2022-12-29
8 If We Were Villains "Rio,~ 4.23 2022-12-29
9 The Gone World "Swet~ 3.94 2022-12-28
10 Batman: The Dark Knight Returns "Mill~ 4.26 2022-12-28
# ... with 170 more rows
# i Use `print(n = ...)` to see more rows
I can get the first 30 books using this -
library(dplyr)
library(rvest)
url <- "https://www.goodreads.com/review/list/4622890?shelf=read"
book_table <- read_html(url) %>%
html_elements('table#books') %>%
html_table() %>%
.[[1]]
book_table
There is some cleaning that you might need to do in the data captured. Moreover, to get the complete list I am afraid rvest would not be enough. You might need to use something like RSelenium to scroll through the list.
I am trying to scrape details from a website in order to gather details for pictures with a script in R.
What I need is:
Image name (1.jpg)
Image caption ("A recruit demonstrates the proper use of a CO2 portable extinguisher to put out a small outside fire.")
Photo credit ("Photo courtesy of: James Fortner")
There are over 16,000 files, and thankfully the web url goes "...asp?photo=1, 2, 3, 4" so there is base url which doesn't change, just the last section with the image number. I would like the script to loop for either a set number (I tell it where to start) or it just breaks when it gets to a page which doesn't exisit.
Using the code below, I can get the caption of the photo, but only one line. I would like to get the photo credit, which is on a separate line; there are three between the main caption and photo credit. I'd be fine if the table which is generated had two or three blank columns to account for the lines, as I can delete them later.
library(rvest)
library(dplyr)
link = "http://fallschurchvfd.org/photovideo.asp?photo=1"
page = read_html(link)
caption = page %>% html_nodes(".text7 i") %>% html_text()
info = data.frame(caption, stringsAsFactors = FALSE)
write.csv(info, "photos.csv")
Scraping with rvest and tidyverse
library(tidyverse)
library(rvest)
get_picture <- function(page) {
cat("Scraping page", page, "\n")
page <- str_c("http://fallschurchvfd.org/photovideo.asp?photo=", page) %>%
read_html()
tibble(
image_name = page %>%
html_element(".text7 img") %>%
html_attr("src"),
caption = page %>%
html_element(".text7") %>%
html_text() %>%
str_split(pattern = "\r\n\t\t\t\t") %>%
unlist %>%
nth(1),
credit = page %>%
html_element(".text7") %>%
html_text() %>%
str_split(pattern = "\r\n\t\t\t\t") %>%
unlist %>%
nth(3)
)
}
# Get the first 1:50
df <- map_dfr(1:50, possibly(get_picture, otherwise = tibble()))
# A tibble: 42 × 3
image_name caption credit
<chr> <chr> <chr>
1 /photos/1.jpg Recruit Clay Hamric demonstrates the use… James…
2 /photos/2.jpg A recruit demonstrates the proper use of… James…
3 /photos/3.jpg Recruit Paul Melnick demonstrates the pr… James…
4 /photos/4.jpg Rescue 104 James…
5 /photos/5.jpg Rescue 104 James…
6 /photos/6.jpg Rescue 104 James…
7 /photos/15.jpg Truck 106 operates a ladder pipe from Wi… Jim O…
8 /photos/16.jpg Truck 106 operates a ladder pipe as heav… Jim O…
9 /photos/17.jpg Heavy fire vents from the roof area of t… Jim O…
10 /photos/18.jpg Arlington County Fire and Rescue Associa… James…
# … with 32 more rows
# ℹ Use `print(n = ...)` to see more rows
For the images, you can use the command line tool curl. For example, to download images 1.jpg through 100.jpg
curl -O "http://fallschurchvfd.org/photos/[0-100].jpg"
For the R code, if you grab the whole .text7 section, then you can split into caption and photo credit subsequently:
extractedtext <- page %>% html_nodes(".text7") %>% html_text()
caption <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1]
credit <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
As a loop
library(rvest)
library(tidyverse)
df<-data.frame(id=1:20,
image=NA,
caption=NA,
credit=NA)
for (i in 1:20){
cat(i, " ") # to monitor progress and debug
link <- paste0("http://fallschurchvfd.org/photovideo.asp?photo=", i)
tryCatch({ # This is to avoid stopping on an error message for missing pages
page <- read_html(link)
close(link)
df$image[i] <- page %>% html_nodes(".text7 img") %>% html_attr("src")
extractedtext <- page %>% html_nodes(".text7") %>% html_text()
df$caption[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1] # This is an awkward way of saying "list 1, element 1"
df$credit[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
},
error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
I get inconsistent results with this current code, for example, page 15 has more line breaks than page 1.
TODO: enhance string extraction; switch to an 'append' method of adding data to a data.frame (vs pre-allocate and insert).
I'm trying to scrape the information about the nurse jobs on that link: https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse%20Sister%20Matron&logic=OR
I managed to do it on the first page of results. But when I try to do it on the other few hundreds pages, read_html() doesn't work anymore.
The first page works perfectly fine:
install.packages("rvest")
install.packages("dplyr")
library(rvest)
library(dplyr)
link = "https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse%20Sister%20Matron&logic=OR"
page = read_html(link)
But then for the following code I get the error message: Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html, : Failed to parse text
link = "https://www.jobs.nhs.uk/xi/search_vacancy?action=page&page=2"
page = read_html(link)
Could you please tell me where I'm wrong when I scrape the second page of results? Thanks
[EDIT] Thanks for the answers. For anybody interested, this is what I ended up doing using #Dave2e answer (I am too much of a beginner to use RSelenium), and it works fine (with scraping_onepage the function I created to scrape one page):
#extract the number of pages of results
link = "https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse%20Sister%20Matron&logic=OR"
page = read_html(link)
extract = page %>% html_nodes(".total") %>% html_text()
number_pages = substring(extract, 24, 26)
#initialization of nurse_jobs for the loop
nurse_jobs <- scraping_onepage(page)
#loop
s<- session("https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse%20Sister%20Matron&logic=OR")
for (page_result in seq(from = 2, to = number_pages, by = 1)) {
link = paste0("https://www.jobs.nhs.uk/xi/search_vacancy?action=page&page=", page_result)
s1 <- session_jump_to(s, link) #method: https://stackoverflow.com/questions/73044507/read-html-returning-error-in-read-xml-raw-failed-to-parse-text-while
page = read_html(s1)
nurse_jobs1 <- scraping_onepage(page)
nurse_jobs = rbind(nurse_jobs, nurse_jobs1)
}
Here I scraped from page 2 to 100 without any error. It should work for the 362 pages available. The code is inspired from the answer of #Dave2e.
library(tidyverse)
library(rvest)
library(httr2)
ses <-
"https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse%20Sister%20Matron&logic=OR" %>%
session()
n_pages <- page %>%
html_element("li:nth-child(10) a") %>%
html_text2() %>%
as.numeric()
get_info <- function(index_page) {
cat("Scraping page", index_page, "...", "\n")
page <- session_jump_to(ses,
paste0("https://www.jobs.nhs.uk/xi/search_vacancy?action=page&page=",
index_page)) %>%
read_html()
tibble(
from_page = index_page,
position = page %>%
html_elements("h2 a") %>%
html_text2(),
practice = page %>%
html_elements(".vacancy h3") %>%
html_text2(),
salary = page %>%
html_elements(".salary") %>%
html_text2(),
type = page %>%
html_elements(".left dl~ dl+ dl dd") %>%
html_text2()
)
}
df <- map_dfr(2:100, get_info)
# A tibble: 1,980 × 5
from_page position practice salary type
<int> <chr> <chr> <chr> <chr>
1 2 Practice Nurse or Nurse Practitioner General … Depen… Perm…
2 2 Practice Nurse General … Depen… Perm…
3 2 Practice Nurse General … Depen… Perm…
4 2 Practice Nurse General … Depen… Perm…
5 2 Practice Nurse General … Depen… Perm…
6 2 Practice Nurse General … Depen… Perm…
7 2 Practice Nurse General … Depen… Perm…
8 2 Practice Nurse General … Depen… Perm…
9 2 Practice Nurse General … Depen… Perm…
10 2 Staff Nurse Neurology £2565… Perm…
# … with 1,970 more rows
You maybe able to create a session and then jump from page to page:
library(rvest)
s<- session("https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse%20Sister%20Matron&logic=OR")
link = "https://www.jobs.nhs.uk/xi/search_vacancy?action=page&page=2"
#jump to next page
s <- session_jump_to(s, link)
page = read_html(s2)
page %>% html_elements("div.vacancy")
session_history(s1). #display history
This should work, but I have not fully tested it to verify.
If you want to scrape a few hundred pages with an easy pagination structure (next page button), you might be better off using something like RSelenium to automate the clicking and scraping process. A clever trick for XPaths is Google Chrome -> Inspect -> Right Click on Code -> Copy XPath, you can do that for the next page button. Previous iterations of this issue have encoding errors, but the encoding for this site is UTF-8, and it doesn't work even if that is specified. This means that the site is probably in JavaScript, which further signifies that the best approach is Selenium. Alternatively, if the coding is too difficult you can use Octoparse, a free tool for Webscraping that makes pagination loops easy.
I am using Rvest to scrape google news.
However, I encounter missing values in element "Time" from time to time on different keywords. Since the values are missing, it will end up having "different number of rows error" for the data frame of scraping result.
Is there anyway to fill-in NA for these missing values?
Below is the example of the code I am using.
html_dat <- read_html(paste0("https://news.google.com/search?q=",Search,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
Description = html_dat %>%
html_nodes('.Rai5ob') %>%
html_text(),
Time = html_dat %>%
html_nodes('.WW6dff') %>%
html_text()
)
Without knowing the exact page you were looking at I tried the first Google news page.
In the Rvest page, html_node (without the s) will always return a value even it is NA. Therefore in order to keep the vectors the same length, one needed to find the common parent node for all of the desired data nodes. Then parse the desired information from each one of those nodes.
Assuming the Title node is most complete, go up 1 level with xml_parent() and attempt to retrieving the same number of description nodes, this didn't work. Then tried 2 levels up using xml_parent() %>% xml_parent(), this seems to work.
library(rvest)
url <-"https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en"
html_dat <- read_html(url)
Title = html_dat %>% html_nodes('.DY5T1d') %>% html_text()
# Link = dat$Link
Link = html_dat %>% html_nodes('.VDXfz') %>% html_attr('href')
Link <- gsub("./articles/", "https://news.google.com/articles/",Link)
#Find the common parent node
#(this was trial and error) Tried the parent then the grandparent
Titlenodes <- html_dat %>% html_nodes('.DY5T1d') %>% xml_parent() %>% xml_parent()
Description = Titlenodes %>% html_node('.Rai5ob') %>% html_text()
Time = Titlenodes %>% html_node('.WW6dff') %>% html_text()
answer <- data.frame(Title, Time, Description, Link)
I seem to always have a problem scraping reference sites using either Python or R. Whenever I use my normal xpath approach (Python) or Rvest approach in R, the table I want never seems to be picked up by the scraper.
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[#id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[#id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
I'm trying to scrape the starting lineup tables. The first bit of code pulls the urls for all boxscores in 2016, and the for loop goes to each boxscore page with the hopes of extracting the tables led by "Insert Team Here" Starters.
Here's one link for example: 'https://www.pro-football-reference.com/boxscores/201609110rav.htm'
When I run the code above, the home_starters and home_starters2 objects contain zero elements (when ideally it should contain the table or elements of the table I'm trying to bring in).
I appreciate the help!
I've spent the last three hours trying to figure this out. This is how it shoudl be done. This is given my example but I'm sure you could apply it to yours.
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()