I want to extract data from different hyperlinks of this web page
I was using the following code to extract table of the hyperlink.
url <- "https://www.maritime-database.com/company.php?cid=66304"
webpage<-read_html(URL)
df <-
webpage %>%
html_node("table") %>%
html_table(fill=TRUE)
From this code, I was able to extract all hyperlinks in a table but I don't have any idea how to extract data from this hyperlink.
EX- for this link I want to extract data as given in figure[![Data from the link provided in example][1]][1]
=
Let's start by loading some libraries we will need..
library(rvest)
library(tidyverse)
library(stringr)
Then, we can open the desired page and extract all links:
url <- "https://www.maritime-database.com/company.php?cid=66304"
webpage<-read_html(url)
urls <- webpage %>% html_nodes("a") %>% html_attr("href")
Let's take a look at what we uncovered...
> head(urls,100)
[1] "/" "/areas/"
[3] "/countries/" "/ports/"
[5] "/ports/topports.php" "/addcompany.php"
[7] "/aboutus.php" "/activity.php?aid=28"
[9] "/activity.php?aid=9" "/activity.php?aid=16"
[11] "/activity.php?aid=24" "/activity.php?aid=27"
[13] "/activity.php?aid=29" "/activity.php?aid=25"
[15] "/activity.php?aid=5" "/activity.php?aid=11"
[17] "/activity.php?aid=19" "/activity.php?aid=17"
[19] "/activity.php?aid=2" "/activity.php?aid=31"
[21] "/activity.php?aid=1" "/activity.php?aid=13"
[23] "/activity.php?aid=23" "/activity.php?aid=18"
[25] "/activity.php?aid=22" "/activity.php?aid=12"
[27] "/activity.php?aid=4" "/activity.php?aid=26"
[29] "/activity.php?aid=10" "/activity.php?aid=14"
[31] "/activity.php?aid=7" "/activity.php?aid=30"
[33] "/activity.php?aid=21" "/activity.php?aid=20"
[35] "/activity.php?aid=8" "/activity.php?aid=6"
[37] "/activity.php?aid=15" "/activity.php?aid=3"
[39] "/africa/" "/centralamerica/"
[41] "/northamerica/" "/southamerica/"
[43] "/asia/" "/caribbean/"
[45] "/europe/" "/middleeast/"
[47] "/oceania/" "company-contact.php?cid=66304"
[49] "http://www.quadrantplastics.com" "/company.php?cid=313402"
[51] "/company.php?cid=262400" "/company.php?cid=262912"
[53] "/company.php?cid=263168" "/company.php?cid=263424"
[55] "/company.php?cid=67072" "/company.php?cid=263680"
[57] "/company.php?cid=67328" "/company.php?cid=264192"
[59] "/company.php?cid=67840" "/company.php?cid=264448"
[61] "/company.php?cid=264704" "/company.php?cid=68352"
[63] "/company.php?cid=264960" "/company.php?cid=68608"
[65] "/company.php?cid=265216" "/company.php?cid=68864"
[67] "/company.php?cid=265472" "/company.php?cid=200192"
[69] "/company.php?cid=265728" "/company.php?cid=69376"
[71] "/company.php?cid=200448" "/company.php?cid=265984"
[73] "/company.php?cid=200704" "/company.php?cid=266240"
After some inspection, we find that we are only interested in urls that start with /company.php
Let's then figure out how many of them are there, and create a placeholder list for our results:
numcompanies <- length(which(!is.na(str_extract(urls, '/company.php'))))
mylist = vector("list", numcompanies )
We find that there are 40034 company urls we need to scrape. This will take a while...
> numcompanies
40034
Now, it's just a matter of looping through each matching url one by one, and saving the text.
i = 0
for(u in urls){
if(!is.na(str_match(u, '/company.php'))){
Sys.sleep(1)
i = i + 1
companypage <-read_html(paste0('https://www.maritime-database.com', u))
cat(paste('page nr', i, '; saved text from: ', u, '\n'))
text <- companypage %>%
html_nodes('.txt') %>% html_text()
names(mylist)[i] <- u
mylist[[i]] <- text
}
}
In the loop above, we have taken advantage of the observation that the info we want always has class="txt" (see screenshot below).
Assuming that opening a page takes around 1 second, scraping all pages will take approximately 11 hours.
Also, keep in mind the ethics of web scraping.
Related
I am using R to try and download images from the Reptile-database by filling their form to seek for specific images. For that, I am following previous suggestions to fill a form online from R, such as:
library(httr)
library(tidyverse)
POST(
url = "http://reptile-database.reptarium.cz/advanced_search",
encode = "json",
body = list(
genus = "Chamaeleo",
species = "dilepis"
)) -> res
out <- content(res)[1]
This seems to work smoothly, but my problem now is to identify the link with the correct species name in the resulting out object.
This object should contain the following page:
https://reptile-database.reptarium.cz/species?genus=Chamaeleo&species=dilepis&search_param=%28%28genus%3D%27Chamaeleo%27%29%28species%3D%27dilepis%27%29%29
This contains names with links. Thus, i would like to identify the link that takes me to the page with the correct species's table. however I am unable to find the link nor even the name of the species within the generated out object.
Here I only extract the links to the pictures. Simply map or apply a function to download them with download.file()
library(tidyverse)
library(rvest)
genus <- "Chamaeleo"
species <- "dilepis"
pics <- paste0(
"http://reptile-database.reptarium.cz/species?genus=", genus,
"&species=", species) %>%
read_html() %>%
html_elements("#gallery img") %>%
html_attr("src")
[1] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000034021_01_t.jpg"
[2] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000033342_01_t.jpg"
[3] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029987_01_t.jpg"
[4] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029988_01_t.jpg"
[5] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035130_01_t.jpg"
[6] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035131_01_t.jpg"
[7] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035132_01_t.jpg"
[8] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035133_01_t.jpg"
[9] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036237_01_t.jpg"
[10] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036238_01_t.jpg"
[11] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036239_01_t.jpg"
[12] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041048_01_t.jpg"
[13] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041049_01_t.jpg"
[14] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041050_01_t.jpg"
[15] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041051_01_t.jpg"
[16] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042287_01_t.jpg"
[17] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042288_01_t.jpg"
[18] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0070.jpeg"
[19] "https://calphotos.berkeley.edu/imgs/128x192/1338_3161/0662/0074.jpeg"
[20] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0082.jpeg"
[21] "https://calphotos.berkeley.edu/imgs/128x192/1338_3152/3386/0125.jpeg"
[22] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/1009/0136.jpeg"
[23] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/0210/0057.jpeg"
I am trying to scrape a table from the following site (https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1)
I am using rvest and the Selector Gadget to try to make it work, but so far I have only been able to get it in text form.
What I need to extract:
I am mostly interested in extracting the number of species of two areas the Stjernearter, and the 2-stjernearter, as seen in the image bellow:
As seen in the developer tools of firefox that corresponds to a table:
But as I have tried to get the table with Gadget selector, I have not had any success.
What I have tried:
This are some ideas I have tried with limited success:
I have been able to get the text, but not the table with this 2 codes
library(rvest)
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(":nth-child(9) .table-col") %>%
html_text()
this gets me the following:
[1] "\r\n\t\t\t\t\t\t\tStjernearter (arter med artsscorer = 4 eller 5):\r\n\t\t\t\t\t\t"
[2] "Strandarve | Honckenya peploides"
[3] "Bidende stenurt | Sedum acre"
[4] "\r\n\t\t\t\t\t\t\t2-stjernearter (artsscore = 6 eller 7):\r\n\t\t\t\t\t\t"
[5] "Ingen arter registreret"
[6] "\r\n\t\t\t\t\t\t\t N-følsomme arter:\r\n\t\t\t\t\t\t "
[7] "Bidende stenurt | Sedum acre"
[8] "\r\n\t\t\t\t\t\t\tProblemarter:\r\n\t\t\t\t\t\t"
[9] "Ingen arter registreret"
[10] "\r\n\t\t\t\t\t\t\tInvasive arter:\r\n\t\t\t\t\t\t"
[11] "Ingen arter registreret"
[12] "\r\n\t\t\t\t\t\t\tHabitatdirektivets bilagsarter:\r\n\t\t\t\t\t\t"
[13] "Ingen arter registreret"
[14] "\r\n\t\t\t\t\t\t\tRødlistede arter:\r\n\t\t\t\t\t\t"
[15] "Ingen arter registreret"
[16] "\r\n\t\t\t\t\t\t\tFredede arter:\r\n\t\t\t\t\t\t"
[17] "Ingen arter registreret"
[18] "\r\n\t\t\t\t\t\t\tAntal arter:\r\n\t\t\t\t\t\t"
[19] "Mosser: 1 fund"
[20] "Planter: 7 fund"
And I get a similar result with
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(":nth-child(9) .table-col") %>%
html_text2()
I have also tried the following codes:
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(":nth-child(9) .table-col") %>%
html_table()
and
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(".report-body") %>%
html_table()
This will be done for several sites that I will loop from, so I will need it in a table format.
Edit
It seems that this code is bringing me closer to the answer:
rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(".report-section-body")
The eighth element has the table, but I have not been able to extract it:
Test <- rvest::read_html("https://naturereport.miljoeportal.dk/HtmlViewer?id=827472&bA=1&bI=1&bN=1%22") %>%
html_elements(".report-section-body")
Test[8]
{xml_nodeset (1)}
[1] <div class="report-section-body"><div class="table">\n<div class="
How can I get a list of the exact names of the objects in the datasets package?
I found many of them here:
data_package = data(package="datasets")
datasets <- as.data.frame(data_package[[3]])$Item
datasets
# [1] "AirPassengers" "BJsales" "BJsales.lead (BJsales)" "BOD" "CO2" "ChickWeight"
# [7] "DNase" "EuStockMarkets" "Formaldehyde" "HairEyeColor" "Harman23.cor" "Harman74.cor"
# [13] "Indometh" "InsectSprays" "JohnsonJohnson" "LakeHuron" "LifeCycleSavings" "Loblolly"
# [19] "Nile" "Orange" "OrchardSprays" "PlantGrowth" "Puromycin" "Seatbelts"
# [25] "Theoph" "Titanic" "ToothGrowth" "UCBAdmissions" "UKDriverDeaths" "UKgas"
# [31] "USAccDeaths" "USArrests" "USJudgeRatings" "USPersonalExpenditure" "UScitiesD" "VADeaths"
# [37] "WWWusage" "WorldPhones" "ability.cov" "airmiles" "airquality" "anscombe"
# [43] "attenu" "attitude" "austres" "beaver1 (beavers)" "beaver2 (beavers)" "cars"
# [49] "chickwts" "co2" "crimtab" "discoveries" "esoph" "euro"
# [55] "euro.cross (euro)" "eurodist" "faithful" "fdeaths (UKLungDeaths)" "freeny" "freeny.x (freeny)"
# [61] "freeny.y (freeny)" "infert" "iris" "iris3" "islands" "ldeaths (UKLungDeaths)"
# [67] "lh" "longley" "lynx" "mdeaths (UKLungDeaths)" "morley" "mtcars"
# [73] "nhtemp" "nottem" "npk" "occupationalStatus" "precip" "presidents"
# [79] "pressure" "quakes" "randu" "rivers" "rock" "sleep"
# [85] "stack.loss (stackloss)" "stack.x (stackloss)" "stackloss" "state.abb (state)" "state.area (state)" "state.center (state)"
# [91] "state.division (state)" "state.name (state)" "state.region (state)" "state.x77 (state)" "sunspot.month" "sunspot.year"
# [97] "sunspots" "swiss" "treering" "trees" "uspop" "volcano"
# [103] "warpbreaks" "women"
So something like this would iterate through each one
for(i in 1:length(datasets)) {
print(get(datasets[i]))
cat("\n\n")
}
It works for the first two datasets (AirPassengers and BJsales), but it fails on BJsales.lead (BJsales) since it should be referred to as datasets::BJsales.lead.
I guess I could use string split or similar to discard anything from a space onwards, but I wonder is there any neater way of obtaining a list of all the objects in the dataset package?
Notes
In addition to the above, I also tried listing everything in the datasets namespace but it gave a weird result:
ls(getNamespace("datasets"), all.names=TRUE)
# [1] ".__NAMESPACE__." ".__S3MethodsTable__." ".packageName"
There is a note on the ?data help page that states
Where the datasets have a different name from the argument that should be used to retrieve them the index will have an entry like beaver1 (beavers) which tells us that dataset beaver1 can be retrieved by the call data(beavers).
So the actual object name is the thing before the parentheses at the end. Since that value is returned as just a string, that's something you'll need to remove yourself unfortunately. But you can do that with a gsub
datanames <- data(package="datasets")$results[,"Item"]
objnames <- gsub("\\s+\\(.*\\)","", datanames)
for(ds in objnames) {
print(get(ds))
cat("\n\n")
}
Brand new to R, so I'll try my best to explain this.
I've been playing with data scraping using the "rvest" package. In this example, I'm scraping US state populations from a table on Wikipedia. The code I used is:
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
forecasthtml = html_nodes(statepop, "td")
forecasttext = html_text(forecasthtml)
forecasttext
The resulting output was as follows:
[2] "7000100000000000000♠1"
[3] " California"
[4] "39,250,017"
[5] "37,254,503"
[6] "7001530000000000000♠53"
[7] "738,581"
[8] "702,905"
[9] "12.15%"
[10] "7000200000000000000♠2"
[11] "7000200000000000000♠2"
[12] " Texas"
[13] "27,862,596"
[14] "25,146,105"
[15] "7001360000000000000♠36"
[16] "763,031"
[17] "698,487"
[18] "8.62%"
How can I turn these strings of text into a table that is set up similar to the way it is presented on the original Wikipedia page (with columns, rows, etc)?
Try using rvest's html_table function.
Note there are five tables on the page thus you will need to specify which table you would like to parse.
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
#find all of the tables on the page
tables<-html_nodes(statepop, "table")
#convert the first table into a dataframe
table1<-html_table(tables[1])
I am trying to write a list to file as one row and without quotes in R.
Content of the list is:
[1] "X4775495036_J" "X4775495036_F" "X5147722015_F" "X5067554009_F"
[5] "X5067554063_B" "X4954590047_A" "X5067554063_G" "X5067554009_L"
[9] "X5147722015_D" "X5511045011_D" "X5067554063_A" "X4805447025_F"
[13] "X5455362015_K" "X4805447025_L" "X5147722015_B" "X5067554009_G"
[17] "X5147722014_K" "X5067554063_H" "X5147722009_G" "X5067554008_H"
[21] "X5067554054_H" "X4805447016_K" "X5147722014_E" "X4954590051_K"
[25] "X5067554008_E" "X5147722015_H" "X5147722009_H" "X5067554063_D"
[29] "X5147722015_A" "X5511045022_E" "X5067554054_I" "X5067554063_J"
[33] "X5067554007_F" "X4775495036_E" "X4775495036_H" "X4805447025_H"
[37] "X5067554009_I" "X4805447025_K" "X4954590051_C" "X4805447025_E"
[41] "X5067554063_E" "X5147722009_J" "X5067554054_C" "X5067554054_G"
[45] "X4805447016_I" "X5455362015_B" "X5067554009_H" "X5147722014_A"
[49] "X4775495036_I" "X5067554063_L" "X5455362015_J" "X4954590047_J"
[53] "X5067554009_A" "X4954590051_D" "X5455362015_I" "X5511045011_E"
[57] "X5147722014_F"
I want something like this (all elements in one row):
X4775495036_J X4775495036_F X5147722015_F X5067554009_F ...
I have tried with write.table, write but with no result.
Note that you don't have a list, you have a character vector.
cat(your_vector, "\n", file="your_file.txt")
The "\n" is an optional newline at the end.
You could use the ncolumns argument of write:
n <- LETTERS[1:10] # create example values
write(n, "letters.txt", ncolumns=length(n))
Or you could concatenate your names before:
nc <- paste0(n, collapse=" ")
write(nc, "letters.txt")