How to download data from the Reptile database using r - r

I am using R to try and download images from the Reptile-database by filling their form to seek for specific images. For that, I am following previous suggestions to fill a form online from R, such as:
library(httr)
library(tidyverse)
POST(
url = "http://reptile-database.reptarium.cz/advanced_search",
encode = "json",
body = list(
genus = "Chamaeleo",
species = "dilepis"
)) -> res
out <- content(res)[1]
This seems to work smoothly, but my problem now is to identify the link with the correct species name in the resulting out object.
This object should contain the following page:
https://reptile-database.reptarium.cz/species?genus=Chamaeleo&species=dilepis&search_param=%28%28genus%3D%27Chamaeleo%27%29%28species%3D%27dilepis%27%29%29
This contains names with links. Thus, i would like to identify the link that takes me to the page with the correct species's table. however I am unable to find the link nor even the name of the species within the generated out object.

Here I only extract the links to the pictures. Simply map or apply a function to download them with download.file()
library(tidyverse)
library(rvest)
genus <- "Chamaeleo"
species <- "dilepis"
pics <- paste0(
"http://reptile-database.reptarium.cz/species?genus=", genus,
"&species=", species) %>%
read_html() %>%
html_elements("#gallery img") %>%
html_attr("src")
[1] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000034021_01_t.jpg"
[2] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000033342_01_t.jpg"
[3] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029987_01_t.jpg"
[4] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029988_01_t.jpg"
[5] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035130_01_t.jpg"
[6] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035131_01_t.jpg"
[7] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035132_01_t.jpg"
[8] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035133_01_t.jpg"
[9] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036237_01_t.jpg"
[10] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036238_01_t.jpg"
[11] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036239_01_t.jpg"
[12] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041048_01_t.jpg"
[13] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041049_01_t.jpg"
[14] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041050_01_t.jpg"
[15] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041051_01_t.jpg"
[16] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042287_01_t.jpg"
[17] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042288_01_t.jpg"
[18] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0070.jpeg"
[19] "https://calphotos.berkeley.edu/imgs/128x192/1338_3161/0662/0074.jpeg"
[20] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0082.jpeg"
[21] "https://calphotos.berkeley.edu/imgs/128x192/1338_3152/3386/0125.jpeg"
[22] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/1009/0136.jpeg"
[23] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/0210/0057.jpeg"

Related

Can't Scrape a table from naturereport.miljoeportal.dk using rvest

I am trying to scrape a table from the following site (https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1)
I am using rvest and the Selector Gadget to try to make it work, but so far I have only been able to get it in text form.
What I need to extract:
I am mostly interested in extracting the number of species of two areas the Stjernearter, and the 2-stjernearter, as seen in the image bellow:
As seen in the developer tools of firefox that corresponds to a table:
But as I have tried to get the table with Gadget selector, I have not had any success.
What I have tried:
This are some ideas I have tried with limited success:
I have been able to get the text, but not the table with this 2 codes
library(rvest)
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(":nth-child(9) .table-col") %>%
html_text()
this gets me the following:
[1] "\r\n\t\t\t\t\t\t\tStjernearter (arter med artsscorer = 4 eller 5):\r\n\t\t\t\t\t\t"
[2] "Strandarve | Honckenya peploides"
[3] "Bidende stenurt | Sedum acre"
[4] "\r\n\t\t\t\t\t\t\t2-stjernearter (artsscore = 6 eller 7):\r\n\t\t\t\t\t\t"
[5] "Ingen arter registreret"
[6] "\r\n\t\t\t\t\t\t\t N-følsomme arter:\r\n\t\t\t\t\t\t "
[7] "Bidende stenurt | Sedum acre"
[8] "\r\n\t\t\t\t\t\t\tProblemarter:\r\n\t\t\t\t\t\t"
[9] "Ingen arter registreret"
[10] "\r\n\t\t\t\t\t\t\tInvasive arter:\r\n\t\t\t\t\t\t"
[11] "Ingen arter registreret"
[12] "\r\n\t\t\t\t\t\t\tHabitatdirektivets bilagsarter:\r\n\t\t\t\t\t\t"
[13] "Ingen arter registreret"
[14] "\r\n\t\t\t\t\t\t\tRødlistede arter:\r\n\t\t\t\t\t\t"
[15] "Ingen arter registreret"
[16] "\r\n\t\t\t\t\t\t\tFredede arter:\r\n\t\t\t\t\t\t"
[17] "Ingen arter registreret"
[18] "\r\n\t\t\t\t\t\t\tAntal arter:\r\n\t\t\t\t\t\t"
[19] "Mosser: 1 fund"
[20] "Planter: 7 fund"
And I get a similar result with
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(":nth-child(9) .table-col") %>%
html_text2()
I have also tried the following codes:
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(":nth-child(9) .table-col") %>%
html_table()
and
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(".report-body") %>%
html_table()
This will be done for several sites that I will loop from, so I will need it in a table format.
Edit
It seems that this code is bringing me closer to the answer:
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(".report-section-body")
The eighth element has the table, but I have not been able to extract it:
Test <- rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(".report-section-body")
Test[8]
{xml_nodeset (1)}
[1] <div class="report-section-body"><div class="table">\n<div class="

Looking for recommendations on scraping specific data from this unstructured, paginated HTML website

As the title describes, I am trying to extract data from a website. Specifically, I'm trying to extract host susceptibility and host insusceptibility data from each of the species pages found here.
These data can be found on individual species specific pages, for example for Abelia latent tymovirus at its respective URL.
I am struggling to extract these data as the HTML seems to be very unstructured. For example, host susceptibility/insusceptibility always exists in node h4, but along with other varying headers and listitems.
This is my first go at web-scraping and I have been trying with the chrome plugin Web Scraper, which seems very intuitive and flexible. I have been able to get the scraper to visit the multiple pages, but I can't seem to direct it to specifically collect the susceptibility/insusceptibility data. I attempted using SelectorGadget to identify exactly what my selector should be, but the lack of structure in the HTML made this ineffective.
Any advice on how I can change my plan of attack for this?
I am also open to trying to extract the data using R's rvest package. I have so far been able to read the html from a specific page, extract the h4 and li elements, and clean up the line breaks. Reproducible code:
library(rvest)
library(stringr)
pvo <- read_html("http://bio-mirror.im.ac.cn/mirrors/pvo/vide/descr042.htm")
pvo %>%
html_elements("h4, li") %>%
html_text() %>%
str_replace_all("[\n]" , "")
Which seems to provide me with what I want plus extraneous data:
...
[20] "Susceptible host species "
[21] "Chenopodium amaranticolor"
[22] "Chenopodium quinoa"
[23] "Cucumis sativus"
[24] "Cucurbita pepo"
[25] "Cynara scolymus"
[26] "Gomphrena globosa"
[27] "Nicotiana benthamiana"
[28] "Nicotiana clevelandii"
[29] "Nicotiana glutinosa"
[30] "Ocimum basilicum"
[31] "Vigna unguiculata"
[32] "Insusceptible host species"
[33] "Nicotiana rustica"
[34] "Nicotiana tabacum"
[35] "Phaseolus vulgaris "
...
From here, I am unfamiliar with how to specifically select/filter the desired information from the string. I have tried some stringr, gsub, and rm_between filter functions, but all attempts have been unsuccessful. I wouldn't know where to start to make this code visit the many species pages on the online database, or how to instruct it to save the aggregate data. What a road I have ahead of me!
Here is one trick.
You can get the index of
'Susceptible host species'
'Insusceptible host species'
'Families containing susceptible hosts'
Everything between 1 and 2 are susceptible_species and between 2 and 3 are insusceptible_species.
library(rvest)
library(stringr)
pvo <- read_html("http://bio-mirror.im.ac.cn/mirrors/pvo/vide/descr042.htm")
all_values <- pvo %>% html_elements("h4 li") %>% html_text()
which(all_values == 'Susceptible host species')
sus_index <- grep('Susceptible host species', all_values, fixed = TRUE)
insus_index <- grep('Insusceptible host species', all_values, fixed = TRUE)
family_sus_index <- grep('Families containing susceptible hosts', all_values, fixed = TRUE)
susceptible_species <- all_values[(sus_index+1):(insus_index-1)]
susceptible_species
# [1] "Chenopodium amaranticolor" "Chenopodium quinoa" "Cucumis sativus"
# [4] "Cucurbita pepo" "Cynara scolymus" "Gomphrena globosa"
# [7] "Nicotiana benthamiana" "Nicotiana clevelandii" "Nicotiana glutinosa"
#[10] "Ocimum basilicum" "Vigna unguiculata"
insusceptible_species <- all_values[(insus_index+1):(family_sus_index-1)]
insusceptible_species
#[1] "Nicotiana rustica" "Nicotiana tabacum" "Phaseolus vulgaris "

How to extract data from different hyperlinks of a web page

I want to extract data from different hyperlinks of this web page
I was using the following code to extract table of the hyperlink.
url <- "https://www.maritime-database.com/company.php?cid=66304"
webpage<-read_html(URL)
df <-
webpage %>%
html_node("table") %>%
html_table(fill=TRUE)
From this code, I was able to extract all hyperlinks in a table but I don't have any idea how to extract data from this hyperlink.
EX- for this link I want to extract data as given in figure[![Data from the link provided in example][1]][1]
=
Let's start by loading some libraries we will need..
library(rvest)
library(tidyverse)
library(stringr)
Then, we can open the desired page and extract all links:
url <- "https://www.maritime-database.com/company.php?cid=66304"
webpage<-read_html(url)
urls <- webpage %>% html_nodes("a") %>% html_attr("href")
Let's take a look at what we uncovered...
> head(urls,100)
[1] "/" "/areas/"
[3] "/countries/" "/ports/"
[5] "/ports/topports.php" "/addcompany.php"
[7] "/aboutus.php" "/activity.php?aid=28"
[9] "/activity.php?aid=9" "/activity.php?aid=16"
[11] "/activity.php?aid=24" "/activity.php?aid=27"
[13] "/activity.php?aid=29" "/activity.php?aid=25"
[15] "/activity.php?aid=5" "/activity.php?aid=11"
[17] "/activity.php?aid=19" "/activity.php?aid=17"
[19] "/activity.php?aid=2" "/activity.php?aid=31"
[21] "/activity.php?aid=1" "/activity.php?aid=13"
[23] "/activity.php?aid=23" "/activity.php?aid=18"
[25] "/activity.php?aid=22" "/activity.php?aid=12"
[27] "/activity.php?aid=4" "/activity.php?aid=26"
[29] "/activity.php?aid=10" "/activity.php?aid=14"
[31] "/activity.php?aid=7" "/activity.php?aid=30"
[33] "/activity.php?aid=21" "/activity.php?aid=20"
[35] "/activity.php?aid=8" "/activity.php?aid=6"
[37] "/activity.php?aid=15" "/activity.php?aid=3"
[39] "/africa/" "/centralamerica/"
[41] "/northamerica/" "/southamerica/"
[43] "/asia/" "/caribbean/"
[45] "/europe/" "/middleeast/"
[47] "/oceania/" "company-contact.php?cid=66304"
[49] "http://www.quadrantplastics.com" "/company.php?cid=313402"
[51] "/company.php?cid=262400" "/company.php?cid=262912"
[53] "/company.php?cid=263168" "/company.php?cid=263424"
[55] "/company.php?cid=67072" "/company.php?cid=263680"
[57] "/company.php?cid=67328" "/company.php?cid=264192"
[59] "/company.php?cid=67840" "/company.php?cid=264448"
[61] "/company.php?cid=264704" "/company.php?cid=68352"
[63] "/company.php?cid=264960" "/company.php?cid=68608"
[65] "/company.php?cid=265216" "/company.php?cid=68864"
[67] "/company.php?cid=265472" "/company.php?cid=200192"
[69] "/company.php?cid=265728" "/company.php?cid=69376"
[71] "/company.php?cid=200448" "/company.php?cid=265984"
[73] "/company.php?cid=200704" "/company.php?cid=266240"
After some inspection, we find that we are only interested in urls that start with /company.php
Let's then figure out how many of them are there, and create a placeholder list for our results:
numcompanies <- length(which(!is.na(str_extract(urls, '/company.php'))))
mylist = vector("list", numcompanies )
We find that there are 40034 company urls we need to scrape. This will take a while...
> numcompanies
40034
Now, it's just a matter of looping through each matching url one by one, and saving the text.
i = 0
for(u in urls){
if(!is.na(str_match(u, '/company.php'))){
Sys.sleep(1)
i = i + 1
companypage <-read_html(paste0('https://www.maritime-database.com', u))
cat(paste('page nr', i, '; saved text from: ', u, '\n'))
text <- companypage %>%
html_nodes('.txt') %>% html_text()
names(mylist)[i] <- u
mylist[[i]] <- text
}
}
In the loop above, we have taken advantage of the observation that the info we want always has class="txt" (see screenshot below).
Assuming that opening a page takes around 1 second, scraping all pages will take approximately 11 hours.
Also, keep in mind the ethics of web scraping.

Unable to extract image links using Rvest

I am unable to extract links of images from a website.
I am new to data scraping. I have used Selectorgadget as well as inspect element method to get the class of the image, but to no avail.
main.page <- read_html(x= "https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974")
urls <- main.page %>%
html_nodes(".match-detail--item:nth-child(9) .lazyloaded") %>%
html_attr("src")
sotu <- data.frame(urls = urls)
I am getting the following output:
<0 rows> (or 0-length row.names)
Certain classes and parameters don't show up in the scraped data for some reason. Just target img instead of .lazyloaded and data-src instead of src:
library(rvest)
main.page <- read_html("https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974")
main.page %>%
html_nodes(".match-detail--item:nth-child(9) img") %>%
html_attr("data-src")
#### OUTPUT ####
[1] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/1.png&h=25&w=25"
[2] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[3] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[4] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[5] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[6] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[7] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[8] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[9] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[10] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[11] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[12] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
As the DOM is modified via javascript (using React) when using browser you don't get the same layout for rvest. You could, less optimal, regex out the info from the javascript object the links are housed in. Then use a json parser to extract the links
library(rvest)
library(jsonlite)
library(stringr)
library(magrittr)
url <- "https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974"
r <- read_html(url) %>%
html_nodes('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'debuts":(.*?\\])')
json <- jsonlite::fromJSON(x[[1]][,2])
print(json$imgicon)

How to turn rvest output into table

Brand new to R, so I'll try my best to explain this.
I've been playing with data scraping using the "rvest" package. In this example, I'm scraping US state populations from a table on Wikipedia. The code I used is:
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
forecasthtml = html_nodes(statepop, "td")
forecasttext = html_text(forecasthtml)
forecasttext
The resulting output was as follows:
[2] "7000100000000000000♠1"
[3] " California"
[4] "39,250,017"
[5] "37,254,503"
[6] "7001530000000000000♠53"
[7] "738,581"
[8] "702,905"
[9] "12.15%"
[10] "7000200000000000000♠2"
[11] "7000200000000000000♠2"
[12] " Texas"
[13] "27,862,596"
[14] "25,146,105"
[15] "7001360000000000000♠36"
[16] "763,031"
[17] "698,487"
[18] "8.62%"
How can I turn these strings of text into a table that is set up similar to the way it is presented on the original Wikipedia page (with columns, rows, etc)?
Try using rvest's html_table function.
Note there are five tables on the page thus you will need to specify which table you would like to parse.
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
#find all of the tables on the page
tables<-html_nodes(statepop, "table")
#convert the first table into a dataframe
table1<-html_table(tables[1])

Resources