Brand new to R, so I'll try my best to explain this.
I've been playing with data scraping using the "rvest" package. In this example, I'm scraping US state populations from a table on Wikipedia. The code I used is:
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
forecasthtml = html_nodes(statepop, "td")
forecasttext = html_text(forecasthtml)
forecasttext
The resulting output was as follows:
[2] "7000100000000000000♠1"
[3] " California"
[4] "39,250,017"
[5] "37,254,503"
[6] "7001530000000000000♠53"
[7] "738,581"
[8] "702,905"
[9] "12.15%"
[10] "7000200000000000000♠2"
[11] "7000200000000000000♠2"
[12] " Texas"
[13] "27,862,596"
[14] "25,146,105"
[15] "7001360000000000000♠36"
[16] "763,031"
[17] "698,487"
[18] "8.62%"
How can I turn these strings of text into a table that is set up similar to the way it is presented on the original Wikipedia page (with columns, rows, etc)?
Try using rvest's html_table function.
Note there are five tables on the page thus you will need to specify which table you would like to parse.
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
#find all of the tables on the page
tables<-html_nodes(statepop, "table")
#convert the first table into a dataframe
table1<-html_table(tables[1])
Related
I am using R to try and download images from the Reptile-database by filling their form to seek for specific images. For that, I am following previous suggestions to fill a form online from R, such as:
library(httr)
library(tidyverse)
POST(
url = "http://reptile-database.reptarium.cz/advanced_search",
encode = "json",
body = list(
genus = "Chamaeleo",
species = "dilepis"
)) -> res
out <- content(res)[1]
This seems to work smoothly, but my problem now is to identify the link with the correct species name in the resulting out object.
This object should contain the following page:
https://reptile-database.reptarium.cz/species?genus=Chamaeleo&species=dilepis&search_param=%28%28genus%3D%27Chamaeleo%27%29%28species%3D%27dilepis%27%29%29
This contains names with links. Thus, i would like to identify the link that takes me to the page with the correct species's table. however I am unable to find the link nor even the name of the species within the generated out object.
Here I only extract the links to the pictures. Simply map or apply a function to download them with download.file()
library(tidyverse)
library(rvest)
genus <- "Chamaeleo"
species <- "dilepis"
pics <- paste0(
"http://reptile-database.reptarium.cz/species?genus=", genus,
"&species=", species) %>%
read_html() %>%
html_elements("#gallery img") %>%
html_attr("src")
[1] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000034021_01_t.jpg"
[2] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000033342_01_t.jpg"
[3] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029987_01_t.jpg"
[4] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029988_01_t.jpg"
[5] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035130_01_t.jpg"
[6] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035131_01_t.jpg"
[7] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035132_01_t.jpg"
[8] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035133_01_t.jpg"
[9] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036237_01_t.jpg"
[10] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036238_01_t.jpg"
[11] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036239_01_t.jpg"
[12] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041048_01_t.jpg"
[13] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041049_01_t.jpg"
[14] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041050_01_t.jpg"
[15] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041051_01_t.jpg"
[16] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042287_01_t.jpg"
[17] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042288_01_t.jpg"
[18] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0070.jpeg"
[19] "https://calphotos.berkeley.edu/imgs/128x192/1338_3161/0662/0074.jpeg"
[20] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0082.jpeg"
[21] "https://calphotos.berkeley.edu/imgs/128x192/1338_3152/3386/0125.jpeg"
[22] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/1009/0136.jpeg"
[23] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/0210/0057.jpeg"
I'm using the rtweet package and it's not returning the database with all columns with the search_tweets() function. The database has only 35 columns and no columns "screen_name" and "mentions_screen_name". Please how to get the rest of the columns? Below an example the columns returned.
tweets.df <- search_tweets("science")
names(tweets.df)
[1] "created_at" "id"
[3] "id_str" "full_text"
[5] "truncated" "display_text_range"
[7] "entities" "metadata"
[9] "source" "in_reply_to_status_id"
[11] "in_reply_to_status_id_str" "in_reply_to_user_id"
[13] "in_reply_to_user_id_str" "in_reply_to_screen_name"
[15] "geo" "coordinates"
[17] "place" "contributors"
[19] "is_quote_status" "retweet_count"
[21] "favorite_count" "favorited"
[23] "retweeted" "possibly_sensitive"
[25] "lang" "retweeted_status"
[27] "quoted_status_id" "quoted_status_id_str"
[29] "quoted_status" "text"
[31] "favorited_by" "display_text_width"
[33] "quoted_status_permalink" "query"
[35] "possibly_sensitive_appealable"
You seem to have installed the development version of rtweet 0.7.0 < rtweet > 1.0.0 which is not released yet on CRAN. Could you post the packageVersion("rtweet") output?
The devel version of rtweet returns only the columns returned by the API but the user information is retrieval via users_data(tweets.df). There you will find the id and screen name of the user who posted the tweets.
The previous mentions_screen_name is the in_reply_to_screen_name column.
Please make sure that you read the documentation of the version you are using
Get users data of the tweets using users_data method
tweets <- search_tweets("science", n = 100)
users <- users_data(tweets)
# get screen names of users
users["screen_name"]
As the title describes, I am trying to extract data from a website. Specifically, I'm trying to extract host susceptibility and host insusceptibility data from each of the species pages found here.
These data can be found on individual species specific pages, for example for Abelia latent tymovirus at its respective URL.
I am struggling to extract these data as the HTML seems to be very unstructured. For example, host susceptibility/insusceptibility always exists in node h4, but along with other varying headers and listitems.
This is my first go at web-scraping and I have been trying with the chrome plugin Web Scraper, which seems very intuitive and flexible. I have been able to get the scraper to visit the multiple pages, but I can't seem to direct it to specifically collect the susceptibility/insusceptibility data. I attempted using SelectorGadget to identify exactly what my selector should be, but the lack of structure in the HTML made this ineffective.
Any advice on how I can change my plan of attack for this?
I am also open to trying to extract the data using R's rvest package. I have so far been able to read the html from a specific page, extract the h4 and li elements, and clean up the line breaks. Reproducible code:
library(rvest)
library(stringr)
pvo <- read_html("http://bio-mirror.im.ac.cn/mirrors/pvo/vide/descr042.htm")
pvo %>%
html_elements("h4, li") %>%
html_text() %>%
str_replace_all("[\n]" , "")
Which seems to provide me with what I want plus extraneous data:
...
[20] "Susceptible host species "
[21] "Chenopodium amaranticolor"
[22] "Chenopodium quinoa"
[23] "Cucumis sativus"
[24] "Cucurbita pepo"
[25] "Cynara scolymus"
[26] "Gomphrena globosa"
[27] "Nicotiana benthamiana"
[28] "Nicotiana clevelandii"
[29] "Nicotiana glutinosa"
[30] "Ocimum basilicum"
[31] "Vigna unguiculata"
[32] "Insusceptible host species"
[33] "Nicotiana rustica"
[34] "Nicotiana tabacum"
[35] "Phaseolus vulgaris "
...
From here, I am unfamiliar with how to specifically select/filter the desired information from the string. I have tried some stringr, gsub, and rm_between filter functions, but all attempts have been unsuccessful. I wouldn't know where to start to make this code visit the many species pages on the online database, or how to instruct it to save the aggregate data. What a road I have ahead of me!
Here is one trick.
You can get the index of
'Susceptible host species'
'Insusceptible host species'
'Families containing susceptible hosts'
Everything between 1 and 2 are susceptible_species and between 2 and 3 are insusceptible_species.
library(rvest)
library(stringr)
pvo <- read_html("http://bio-mirror.im.ac.cn/mirrors/pvo/vide/descr042.htm")
all_values <- pvo %>% html_elements("h4 li") %>% html_text()
which(all_values == 'Susceptible host species')
sus_index <- grep('Susceptible host species', all_values, fixed = TRUE)
insus_index <- grep('Insusceptible host species', all_values, fixed = TRUE)
family_sus_index <- grep('Families containing susceptible hosts', all_values, fixed = TRUE)
susceptible_species <- all_values[(sus_index+1):(insus_index-1)]
susceptible_species
# [1] "Chenopodium amaranticolor" "Chenopodium quinoa" "Cucumis sativus"
# [4] "Cucurbita pepo" "Cynara scolymus" "Gomphrena globosa"
# [7] "Nicotiana benthamiana" "Nicotiana clevelandii" "Nicotiana glutinosa"
#[10] "Ocimum basilicum" "Vigna unguiculata"
insusceptible_species <- all_values[(insus_index+1):(family_sus_index-1)]
insusceptible_species
#[1] "Nicotiana rustica" "Nicotiana tabacum" "Phaseolus vulgaris "
I want to extract data from different hyperlinks of this web page
I was using the following code to extract table of the hyperlink.
url <- "https://www.maritime-database.com/company.php?cid=66304"
webpage<-read_html(URL)
df <-
webpage %>%
html_node("table") %>%
html_table(fill=TRUE)
From this code, I was able to extract all hyperlinks in a table but I don't have any idea how to extract data from this hyperlink.
EX- for this link I want to extract data as given in figure[![Data from the link provided in example][1]][1]
=
Let's start by loading some libraries we will need..
library(rvest)
library(tidyverse)
library(stringr)
Then, we can open the desired page and extract all links:
url <- "https://www.maritime-database.com/company.php?cid=66304"
webpage<-read_html(url)
urls <- webpage %>% html_nodes("a") %>% html_attr("href")
Let's take a look at what we uncovered...
> head(urls,100)
[1] "/" "/areas/"
[3] "/countries/" "/ports/"
[5] "/ports/topports.php" "/addcompany.php"
[7] "/aboutus.php" "/activity.php?aid=28"
[9] "/activity.php?aid=9" "/activity.php?aid=16"
[11] "/activity.php?aid=24" "/activity.php?aid=27"
[13] "/activity.php?aid=29" "/activity.php?aid=25"
[15] "/activity.php?aid=5" "/activity.php?aid=11"
[17] "/activity.php?aid=19" "/activity.php?aid=17"
[19] "/activity.php?aid=2" "/activity.php?aid=31"
[21] "/activity.php?aid=1" "/activity.php?aid=13"
[23] "/activity.php?aid=23" "/activity.php?aid=18"
[25] "/activity.php?aid=22" "/activity.php?aid=12"
[27] "/activity.php?aid=4" "/activity.php?aid=26"
[29] "/activity.php?aid=10" "/activity.php?aid=14"
[31] "/activity.php?aid=7" "/activity.php?aid=30"
[33] "/activity.php?aid=21" "/activity.php?aid=20"
[35] "/activity.php?aid=8" "/activity.php?aid=6"
[37] "/activity.php?aid=15" "/activity.php?aid=3"
[39] "/africa/" "/centralamerica/"
[41] "/northamerica/" "/southamerica/"
[43] "/asia/" "/caribbean/"
[45] "/europe/" "/middleeast/"
[47] "/oceania/" "company-contact.php?cid=66304"
[49] "http://www.quadrantplastics.com" "/company.php?cid=313402"
[51] "/company.php?cid=262400" "/company.php?cid=262912"
[53] "/company.php?cid=263168" "/company.php?cid=263424"
[55] "/company.php?cid=67072" "/company.php?cid=263680"
[57] "/company.php?cid=67328" "/company.php?cid=264192"
[59] "/company.php?cid=67840" "/company.php?cid=264448"
[61] "/company.php?cid=264704" "/company.php?cid=68352"
[63] "/company.php?cid=264960" "/company.php?cid=68608"
[65] "/company.php?cid=265216" "/company.php?cid=68864"
[67] "/company.php?cid=265472" "/company.php?cid=200192"
[69] "/company.php?cid=265728" "/company.php?cid=69376"
[71] "/company.php?cid=200448" "/company.php?cid=265984"
[73] "/company.php?cid=200704" "/company.php?cid=266240"
After some inspection, we find that we are only interested in urls that start with /company.php
Let's then figure out how many of them are there, and create a placeholder list for our results:
numcompanies <- length(which(!is.na(str_extract(urls, '/company.php'))))
mylist = vector("list", numcompanies )
We find that there are 40034 company urls we need to scrape. This will take a while...
> numcompanies
40034
Now, it's just a matter of looping through each matching url one by one, and saving the text.
i = 0
for(u in urls){
if(!is.na(str_match(u, '/company.php'))){
Sys.sleep(1)
i = i + 1
companypage <-read_html(paste0('https://www.maritime-database.com', u))
cat(paste('page nr', i, '; saved text from: ', u, '\n'))
text <- companypage %>%
html_nodes('.txt') %>% html_text()
names(mylist)[i] <- u
mylist[[i]] <- text
}
}
In the loop above, we have taken advantage of the observation that the info we want always has class="txt" (see screenshot below).
Assuming that opening a page takes around 1 second, scraping all pages will take approximately 11 hours.
Also, keep in mind the ethics of web scraping.
I am attempting to extract all words that start with a particular phrase from a website. The website I am using is:
http://docs.ggplot2.org/current/
I want to extract all the words that start with "stat_". I should get 21 names like "stat_identity" in return. I have the following code:
stats <- readLines("http://docs.ggplot2.org/current/")
head(stats)
grep("stat_{1[a-z]", stats, value=TRUE)
I am returned every line containing the phrase "stat_". I just want to extract the "stat_" words. So I tried something else:
gsub("\b^stat_[a-z]+ ", "", stats)
I think the output I got was an empty string, " ", where a "stat_" phrase would be? So now I'm trying to think of ways to extract all the text and set everything that is not a "stat_" phrase to empty strings. Does anyone have any ideas on how to get my desired output?
rvest & stringr to the rescue:
library(xml2)
library(rvest)
library(stringr)
pg <- read_html("http://docs.ggplot2.org/current/")
unique(str_match_all(html_text(html_nodes(pg, "body")),
"(stat_[[:alnum:]_]+)")[[1]][,2])
## [1] "stat_bin" "stat_bin2dCount"
## [3] "stat_bindot" "stat_binhexBin"
## [5] "stat_boxplot" "stat_contour"
## [7] "stat_density" "stat_density2d"
## [9] "stat_ecdf" "stat_functionSuperimpose"
## [11] "stat_identity" "stat_qqCalculation"
## [13] "stat_quantile" "stat_smooth"
## [15] "stat_spokeConvert" "stat_sum"
## [17] "stat_summarySummarise" "stat_summary_hexApply"
## [19] "stat_summary2dApply" "stat_uniqueRemove"
## [21] "stat_ydensity" "stat_defaults"
Unless you need the links (then you can use other rvest functions), this removes all the markup for you and just gives you the text of the website.