Unable to extract image links using Rvest - r

I am unable to extract links of images from a website.
I am new to data scraping. I have used Selectorgadget as well as inspect element method to get the class of the image, but to no avail.
main.page <- read_html(x= "https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974")
urls <- main.page %>%
html_nodes(".match-detail--item:nth-child(9) .lazyloaded") %>%
html_attr("src")
sotu <- data.frame(urls = urls)
I am getting the following output:
<0 rows> (or 0-length row.names)

Certain classes and parameters don't show up in the scraped data for some reason. Just target img instead of .lazyloaded and data-src instead of src:
library(rvest)
main.page <- read_html("https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974")
main.page %>%
html_nodes(".match-detail--item:nth-child(9) img") %>%
html_attr("data-src")
#### OUTPUT ####
[1] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/1.png&h=25&w=25"
[2] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[3] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[4] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[5] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[6] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[7] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[8] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[9] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[10] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[11] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[12] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"

As the DOM is modified via javascript (using React) when using browser you don't get the same layout for rvest. You could, less optimal, regex out the info from the javascript object the links are housed in. Then use a json parser to extract the links
library(rvest)
library(jsonlite)
library(stringr)
library(magrittr)
url <- "https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974"
r <- read_html(url) %>%
html_nodes('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'debuts":(.*?\\])')
json <- jsonlite::fromJSON(x[[1]][,2])
print(json$imgicon)

Related

How to download data from the Reptile database using r

I am using R to try and download images from the Reptile-database by filling their form to seek for specific images. For that, I am following previous suggestions to fill a form online from R, such as:
library(httr)
library(tidyverse)
POST(
url = "http://reptile-database.reptarium.cz/advanced_search",
encode = "json",
body = list(
genus = "Chamaeleo",
species = "dilepis"
)) -> res
out <- content(res)[1]
This seems to work smoothly, but my problem now is to identify the link with the correct species name in the resulting out object.
This object should contain the following page:
https://reptile-database.reptarium.cz/species?genus=Chamaeleo&species=dilepis&search_param=%28%28genus%3D%27Chamaeleo%27%29%28species%3D%27dilepis%27%29%29
This contains names with links. Thus, i would like to identify the link that takes me to the page with the correct species's table. however I am unable to find the link nor even the name of the species within the generated out object.
Here I only extract the links to the pictures. Simply map or apply a function to download them with download.file()
library(tidyverse)
library(rvest)
genus <- "Chamaeleo"
species <- "dilepis"
pics <- paste0(
"http://reptile-database.reptarium.cz/species?genus=", genus,
"&species=", species) %>%
read_html() %>%
html_elements("#gallery img") %>%
html_attr("src")
[1] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000034021_01_t.jpg"
[2] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000033342_01_t.jpg"
[3] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029987_01_t.jpg"
[4] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029988_01_t.jpg"
[5] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035130_01_t.jpg"
[6] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035131_01_t.jpg"
[7] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035132_01_t.jpg"
[8] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035133_01_t.jpg"
[9] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036237_01_t.jpg"
[10] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036238_01_t.jpg"
[11] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036239_01_t.jpg"
[12] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041048_01_t.jpg"
[13] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041049_01_t.jpg"
[14] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041050_01_t.jpg"
[15] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041051_01_t.jpg"
[16] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042287_01_t.jpg"
[17] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042288_01_t.jpg"
[18] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0070.jpeg"
[19] "https://calphotos.berkeley.edu/imgs/128x192/1338_3161/0662/0074.jpeg"
[20] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0082.jpeg"
[21] "https://calphotos.berkeley.edu/imgs/128x192/1338_3152/3386/0125.jpeg"
[22] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/1009/0136.jpeg"
[23] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/0210/0057.jpeg"

Inserting string into the middle of a URL in R

I am using rvest to scrape an IMDB list and want to access the list of full cast and crew. Unfortunately, IMDB has created a summary page when you click on the title and it takes me to the wrong page.
This is the webpage I get: https://www.imdb.com/title/tt1375666/?ref_=ttls_li_tt
This is the webpage I need: https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl
Notice the addition of the /fullcredits in the URL.
How can I insert /fullcredits into the middle of a URL I have built?
#install.packages("rvest")
#install.packages("dplyr")
library(rvest) #webscraping package
library(dplyr) #piping
link = "https://www.imdb.com/list/ls006266261/?st_dt=&mode=detail&page=1&sort=list_order,asc"
credits = "fullcredits/"
page = read_html(link)
name <- page %>% rvest::html_nodes(".lister-item-header a") %>% rvest::html_text()
movie_link = page %>% rvest::html_nodes(".lister-item-header a") %>% html_attr("href") %>% paste("https://www.imdb.com", ., sep="")
Here is an option - get the dirname and basename from the link, replace the substring of the basename with new substring ("tt_ql_cl") and join them again with file.path after inserting the "fullcredits" in between
library(stringr)
movie_link2 <- file.path(dirname(movie_link), "fullcredits",
str_replace(basename(movie_link), "ttls_li_tt", "tt_ql_cl"))
-output
> head(movie_link2)
[1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl"
> tail(movie_link2)
[1] "https://www.imdb.com/title/tt0144084/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0119654/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0477348/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0080339/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0469494/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl"
Another way,
df1 = gsub("\\?.*", "", movie_link)
df = paste0(df1, 'fullcredits/?ref_=tt_ql_cl')
df
[1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl"
[7] "https://www.imdb.com/title/tt0137523/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0108052/fullcredits/?ref_=tt_ql_cl"
[9] "https://www.imdb.com/title/tt0118749/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0105236/fullcredits/?ref_=tt_ql_cl"
[11] "https://www.imdb.com/title/tt0111161/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0073195/fullcredits/?ref_=tt_ql_cl"
[13] "https://www.imdb.com/title/tt0075314/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0119488/fullcredits/?ref_=tt_ql_cl"

Information lost by html_table

I'm looking to scrape the third table off of this website and store it as a data frame. Below is a reproducible example
The third table is the one with "Isiah YOUNG" in the first row, third column.
library(rvest)
library(dplyr)
target_url <-
"https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm"
table <- target_url %>%
read_html(options = c("DTDLOAD")) %>%
html_nodes("[id^=splitevents]") # this is the correct node
So far so good. Printing table[[1]] shows the contents I want.
table[[1]]
{html_node}
<table id="splitevents" class="sortable" align="center">
[1] <tr>\n<th class="sorttable_nosort" width="20">Pl</th>\n<th class="sorttable_nosort" width="20">Ln</th>\n<th ...
[2] <td>1</td>\n
[3] <td>6</td>\n
[4] <td></td>\n
[5] <td>Isiah YOUNG</td>\n
[6] <td></td>\n
[7] <td>NIKE</td>\n
[8] <td>20.28 Q</td>\n
[9] <td><b><font color="grey">0.184</font></b></td>
[10] <td>2</td>\n
[11] <td>7</td>\n
[12] <td></td>\n
[13] <td>Elijah HALL-THOMPSON</td>\n
[14] <td></td>\n
[15] <td>Houston</td>\n
[16] <td>20.50 Q</td>\n
[17] <td><b><font color="grey">0.200</font></b></td>
[18] <td>3</td>\n
[19] <td>9</td>\n
[20] <td></td>\n
...
However, passing this to html_table results in an empty data frame.
table[[1]] %>%
html_table(fill = TRUE)
[1] Pl Ln Athlete Affiliation Time
<0 rows> (or 0-length row.names)
How can I get the contents of table[[1]] (which clearly do exist) as a data frame?
The html is full of errors and tripping up the parser and I haven't seen any easy way to fix these.
An alternative way, in this particular scenario, is to use the header count to determine the appropriate column count, then derive the row count by dividing the total td count by the number of columns; use these to convert into a matrix then dataframe.
library(rvest)
library(dplyr)
target_url <- "https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm"
table <- read_html(target_url) %>%
html_node("#splitevents")
tds <- table %>% html_nodes('td') %>% html_text()
ths <- table %>% html_nodes("th") %>% html_text()
num_col <- length(ths)
num_row <- length(tds) / num_col
df <- tds %>%
matrix(nrow = num_row, ncol = num_col, byrow = TRUE) %>%
data.frame() %>%
setNames(ths)

Download multiple files from a url, using R

I have this url: https://www.cnpm.embrapa.br/projetos/relevobr/download/index.htm with geographic information about Brazilian states. If you click in any state, you will find these grids:
Now, if you click in any grid, you will be able to download the geographic information of this specific grid:
What I need: download all the grids at once. Is it possible?
You can scrape the page to get the URLs for the zip files, then iterate across the URLs to download everything:
library(rvest)
# get page source
h <- read_html('https://www.cnpm.embrapa.br/projetos/relevobr/download/mg/mg.htm')
urls <- h %>%
html_nodes('area') %>% # get all `area` nodes
html_attr('href') %>% # get the link attribute of each node
sub('.htm$', '.zip', .) %>% # change file suffix
paste0('https://www.cnpm.embrapa.br/projetos/relevobr/download/mg/', .) # append to base URL
# create a directory for it all
dir <- file.path(tempdir(), 'mg')
dir.create(dir)
# iterate and download
lapply(urls, function(url) download.file(url, file.path(dir, basename(url))))
# check it's there
list.files(dir)
#> [1] "sd-23-y-a.zip" "sd-23-y-b.zip" "sd-23-y-c.zip" "sd-23-y-d.zip" "sd-23-z-a.zip" "sd-23-z-b.zip"
#> [7] "sd-23-z-c.zip" "sd-23-z-d.zip" "sd-24-y-c.zip" "sd-24-y-d.zip" "se-22-y-d.zip" "se-22-z-a.zip"
#> [13] "se-22-z-b.zip" "se-22-z-c.zip" "se-22-z-d.zip" "se-23-v-a.zip" "se-23-v-b.zip" "se-23-v-c.zip"
#> [19] "se-23-v-d.zip" "se-23-x-a.zip" "se-23-x-b.zip" "se-23-x-c.zip" "se-23-x-d.zip" "se-23-y-a.zip"
#> [25] "se-23-y-b.zip" "se-23-y-c.zip" "se-23-y-d.zip" "se-23-z-a.zip" "se-23-z-b.zip" "se-23-z-c.zip"
#> [31] "se-23-z-d.zip" "se-24-v-a.zip" "se-24-v-b.zip" "se-24-v-c.zip" "se-24-v-d.zip" "se-24-y-a.zip"
#> [37] "se-24-y-c.zip" "sf-22-v-b.zip" "sf-22-x-a.zip" "sf-22-x-b.zip" "sf-23-v-a.zip" "sf-23-v-b.zip"
#> [43] "sf-23-v-c.zip" "sf-23-v-d.zip" "sf-23-x-a.zip" "sf-23-x-b.zip" "sf-23-x-c.zip" "sf-23-x-d.zip"
#> [49] "sf-23-y-a.zip" "sf-23-y-b.zip" "sf-23-z-a.zip" "sf-23-z-b.zip" "sf-24-v-a.zip"

html_nodes returning two results for a link

I'm trying to use R to fetch all the links to data files on the Eurostat website. While my code currently "works", I seem to get a duplicate result for every link.
Note, the use of download.file is to get around my company's firewall, per this answer
library(dplyr)
library(rvest)
myurl <- "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?dir=data&sort=1&sort=2&start=all"
download.file(myurl, destfile = "eurofull.html")
content <- read_html("eurofull.html")
links <- content %>%
html_nodes("a") %>% #Note that I dont know the significance of "a", this was trial and error
html_attr("href") %>%
data.frame()
# filter to only get the ".tsv.gz" links
files <- filter(links, grepl("tsv.gz", .))
Looking at the top of the dataframe
files$.[1:6]
[1] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali01.tsv.gz
[2] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali01.tsv.gz
[3] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali02.tsv.gz
[4] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali02.tsv.gz
[5] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_eaa01.tsv.gz
[6] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_eaa01.tsv.gz
The only difference between 1 and 2 is that 1 says "...file=data..." while 2 says "...downfile=data...". This pattern continues for all pairs down the dataframe.
If I download 1 and 2 and read the files into R, an identical check confirms they are the same.
Why are two links to the same data being returned? Is there a way (other than filtering for "downfile") to only return one of the links?
As noted, you can just do some better node targeting. This uses XPath vs CSS selectors and picks the links with downfile in the href:
html_nodes(content, xpath = ".//a[contains(#href, 'downfile')]") %>%
html_attr("href") %>%
sprintf("http://ec.europa.eu/%s", .) %>%
head()
## [1] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali01.tsv.gz"
## [2] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali02.tsv.gz"
## [3] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa01.tsv.gz"
## [4] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa02.tsv.gz"
## [5] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa03.tsv.gz"
## [6] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa04.tsv.gz"

Resources