I am trying to can scrape an embedded tweet on a website. I believe the tweet is loaded via JSON. Ideally, I would be able to simply scrape the embedded tweet's ID. As far as I can tell, this data should be available with the css selector '#twitter-widget-0,' but nothing is returned when I scrape using rvest.
My code is below:
page <- "https://deutsch.rt.com/amerika/86714-rund-woche-nach-russland-auch-china-schickt-militaer-nach-venezuela/"
read_html(page) %>%
html_nodes('#twitter-widget-0') %>%
html_text()
Something like this might help
library(dplyr)
library(rvest)
page %>%
read_html() %>%
html_nodes("div.rtcode") %>%
html_text()
#[1] "#Venezuela#China#Russia#Caracas#Chinese army soldiers arrived in
#Venezuela #Chinese People’s Liberation Army soldiers, as part of a
#cooperation program, #arrived, after delivering humanitarian supplies, to one
#of Venezuelan military #facilities. pic.twitter.com/HwZ9Ee67d0— Sukhoi Su-57
#frazor\U0001f1f7\U0001f1fa\U0001f1ee\U0001f1f3 (#I30mki) 1. April 2019"
Or if you want the unique twitter URL
page %>%
read_html() %>%
html_nodes("div.rtcode a") %>%
html_attr("href") %>%
grep("status", ., value = TRUE)
#[1] "https://twitter.com/I30mki/status/1112578904835981312?ref_src=twsrc%5Etfw"
Related
I am trying to scrape data from this webpage: https://www.premierleague.com/stats/top/players/saves however there are two pages of data i want to scrape. I have been able to scrape the first page of data with the code below:
remDr$navigate("https://www.premierleague.com/stats/top/players/saves")
epl <- read_html(remDr$getPageSource()[[1]])
rank <- epl %>% html_nodes(".statsTableContainer .rank") %>% html_text()
player <- epl %>% html_nodes(".playerName ") %>% html_text()
club <- epl %>% html_nodes(".statNameSecondary") %>% html_text()
stat <- epl %>% html_nodes(".statsTableContainer .text-centre") %>% html_text()
str(rank)
str(player)
str(club)
str(stat)
Saves <- data.frame(rank, player, club, stat)
I have been using the RSelenium pkg for the scraping. For the second page there isn't a different URL you have to click the arrow on the side. How do i scrape the second page when there's only an arrow to select?
I haven't been able to try anything as i'm not sure where to even start as i've not come accross this problem before.
i'm trying to use Rvest to extract the actual and projected medals from this page
https://projects.fivethirtyeight.com/olympics-medal-count/
but for some reason can't get any text from it.
I've been trying multiple variations of
page<-read_html("https://projects.fivethirtyeight.com/olympics-medal-count/")
page %>%
html_nodes('.countries')%>%
html_nodes('.actual-rank') %>%
html_text()
page %>%
html_nodes('.actual-rank') %>%
html_text()
page %>%
html_nodes('div.rank-value') %>%
html_text()
But never am getting anything other than some variation of blank/missing data.
Any help or direction would be really appreciated.
Thanks.
I have a list of URL's (mesa$fullerurl) for documents and I am trying to scrape a specific section of text on each website (the paragraphs on Risk Factors). The problem is that there is no unique HTML tag that I can see for this section. The best way I can think of is to tell R to grab the text from the Risk Factors heading up to the next heading and then put that in a new data frame, k10, but I am not sure how to specify this in R. Thanks!
Here is an example of the document that I am trying to scrape from:
https://www.sec.gov/Archives/edgar/data/72903/000007290319000010/xcel1231201810-k.htm
sec<-read_html("https://www.sec.gov/cgi-bin/browse-edgar?
action=getcompany&CIK=0000072903&type=10-
k&dateb=&owner=exclude&count=40")
xcel<- sec %>%
html_nodes("#documentsbutton") %>%
html_attr("href")
xcel<-data.frame(xcel)
xcel$xcell<-paste0("https://www.sec.gov",xcel$xcell)
xcel$fullurl<-paste0(xcel$xcell,xcel$xcel)
as.character(xcel$fullurl)
mesa<-map_dfr(xcel$fullurl, ~ .x %>% read_html() %>% html_table() %>% .
[[1]])
mesa<-subset(mesa,mesa$Type=="10-K"|mesa$Type=="10-K/A"|mesa$Type=="10-
K405")
mesa
s<-gsub("(.*)/.*","\\1",xcel$fullurl)
table(xcel$fullurl)
xcel$fullurl<-s
xcel$fullurl<-paste0(xcel$fullurl,"/")
mesa$fullerurl<-paste0(xcel$fullurl,mesa$Document)
as.character(mesa$fullerurl)
mesa$Document[mesa$Document == ""] <- NA
mesa$fullerurl
#Below is the problematic part
k10<-map_dfr(mesa$fullerurl, ~ .x %>% read_html("") %>% html_nodes("")
%>% html_text(""))
I am web scraping a page at
http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
From this url, I have built up a dataframe through the following code:
dflist <- map(.x = 1:417, .f = function(x) {
Sys.sleep(5)
url <- ("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=")
read_html(url) %>%
html_nodes(".title a") %>%
html_text() %>%
as.data.frame()
}) %>% do.call(rbind, .)
I have repeated the same code in order to get all the data I was interested in and it seems to work perfectly, although is of course a little slow due to the Sys.sleep() thing.
My issue has raised once I have tried to scrape the single projects descriptions that should be included in the dataframe.
For instance, the first project description is at
http://catalog.ihsn.org/index.php/catalog/7118/study-description
the second project description is at
http://catalog.ihsn.org/index.php/catalog/6606/study-description
and so forth.
My problem is that I can't find a dynamic way to scrape all the projects' pages and insert them in the data frame, being the number in the URLs not progressive nor at the end of the link.
To make things clearer, this is the structure of the website I am scraping:
1.http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
1.1. http://catalog.ihsn.org/index.php/catalog/7118
1.1.a http://catalog.ihsn.org/index.php/catalog/7118/related_materials
1.1.b http://catalog.ihsn.org/index.php/catalog/7118/study-description
1.1.c. http://catalog.ihsn.org/index.php/catalog/7118/data_dictionary
I have scraped successfully level 1. but cannot level 1.1.b. (study-description) , the one I am interested in, since the dynamic element of the URL (in this case: 7118) is not consistent in the website's above 6000 pages of that level.
You have to extract the deeper urls from the .title a and then scrape those as well. Here's a small example on how to do that using rvest and the tidyverse
library(tidyverse)
library(rvest)
scraper <- function(x) {
Sys.sleep(5)
url <- sprintf("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=%s&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=", x)
html <- read_html(url)
tibble(title = html_nodes(html, ".title a") %>% html_text(trim = TRUE),
project_url = html_nodes(html, ".title a") %>% html_attr("href"))
}
result <- map_df(1:2, scraper) %>%
mutate(study_description = map(project_url, ~read_html(sprintf("%s/study-description", .x)) %>% html_node(".xsl-block") %>% html_text()))
This isn't complete as to all the things you want to do, but should show you an approach.
I am practicing my web scraping coding in R and I cannot pass one phase no matter what website I try.
For example,
https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music
My goal is to extract all 77 schools' name (Oxford to London Metropolitan)
So I tried...
library(rvest)
url_college <- "https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music"
college <- read_html(url_college)
info <- html_nodes(college, css = '.league-table-institution-name')
info %>% html_nodes('.league-table-institution-name') %>% html_text()
From F12, I could find out that all schools' name is under class '.league-table-institution-name'... and that's why I wrote that in html_nodes...
What have I done wrong?
You appear to be running html_nodes() twice: first on college, an xml_document (which is correct) and then on info, a character vector, which is not correct.
Try this instead:
url_college %>%
read_html() %>%
html_nodes('.league-table-institution-name') %>%
html_text()
and then you'll need an additional step to clean up the school names; this one was suggested:
%>%
str_replace_all("(^[^a-zA-Z]+)|([^a-zA-Z]+$)", "")