R: Extract words from a website

R: Extract words from a website - r

I am attempting to extract all words that start with a particular phrase from a website. The website I am using is:
http://docs.ggplot2.org/current/
I want to extract all the words that start with "stat_". I should get 21 names like "stat_identity" in return. I have the following code:
stats <- readLines("http://docs.ggplot2.org/current/")
head(stats)
grep("stat_{1[a-z]", stats, value=TRUE)
I am returned every line containing the phrase "stat_". I just want to extract the "stat_" words. So I tried something else:
gsub("\b^stat_[a-z]+ ", "", stats)
I think the output I got was an empty string, " ", where a "stat_" phrase would be? So now I'm trying to think of ways to extract all the text and set everything that is not a "stat_" phrase to empty strings. Does anyone have any ideas on how to get my desired output?

rvest & stringr to the rescue:
library(xml2)
library(rvest)
library(stringr)
pg <- read_html("http://docs.ggplot2.org/current/")
unique(str_match_all(html_text(html_nodes(pg, "body")),
"(stat_[[:alnum:]_]+)")[[1]][,2])
## [1] "stat_bin" "stat_bin2dCount"
## [3] "stat_bindot" "stat_binhexBin"
## [5] "stat_boxplot" "stat_contour"
## [7] "stat_density" "stat_density2d"
## [9] "stat_ecdf" "stat_functionSuperimpose"
## [11] "stat_identity" "stat_qqCalculation"
## [13] "stat_quantile" "stat_smooth"
## [15] "stat_spokeConvert" "stat_sum"
## [17] "stat_summarySummarise" "stat_summary_hexApply"
## [19] "stat_summary2dApply" "stat_uniqueRemove"
## [21] "stat_ydensity" "stat_defaults"
Unless you need the links (then you can use other rvest functions), this removes all the markup for you and just gives you the text of the website.

Related

How to download data from the Reptile database using r

I am using R to try and download images from the Reptile-database by filling their form to seek for specific images. For that, I am following previous suggestions to fill a form online from R, such as:
library(httr)
library(tidyverse)
POST(
url = "http://reptile-database.reptarium.cz/advanced_search",
encode = "json",
body = list(
genus = "Chamaeleo",
species = "dilepis"
)) -> res
out <- content(res)[1]
This seems to work smoothly, but my problem now is to identify the link with the correct species name in the resulting out object.
This object should contain the following page:
https://reptile-database.reptarium.cz/species?genus=Chamaeleo&species=dilepis&search_param=%28%28genus%3D%27Chamaeleo%27%29%28species%3D%27dilepis%27%29%29
This contains names with links. Thus, i would like to identify the link that takes me to the page with the correct species's table. however I am unable to find the link nor even the name of the species within the generated out object.

Here I only extract the links to the pictures. Simply map or apply a function to download them with download.file()
library(tidyverse)
library(rvest)
genus <- "Chamaeleo"
species <- "dilepis"
pics <- paste0(
"http://reptile-database.reptarium.cz/species?genus=", genus,
"&species=", species) %>%
read_html() %>%
html_elements("#gallery img") %>%
html_attr("src")
[1] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000034021_01_t.jpg"
[2] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000033342_01_t.jpg"
[3] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029987_01_t.jpg"
[4] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029988_01_t.jpg"
[5] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035130_01_t.jpg"
[6] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035131_01_t.jpg"
[7] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035132_01_t.jpg"
[8] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035133_01_t.jpg"
[9] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036237_01_t.jpg"
[10] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036238_01_t.jpg"
[11] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036239_01_t.jpg"
[12] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041048_01_t.jpg"
[13] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041049_01_t.jpg"
[14] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041050_01_t.jpg"
[15] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041051_01_t.jpg"
[16] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042287_01_t.jpg"
[17] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042288_01_t.jpg"
[18] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0070.jpeg"
[19] "https://calphotos.berkeley.edu/imgs/128x192/1338_3161/0662/0074.jpeg"
[20] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0082.jpeg"
[21] "https://calphotos.berkeley.edu/imgs/128x192/1338_3152/3386/0125.jpeg"
[22] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/1009/0136.jpeg"
[23] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/0210/0057.jpeg"

Convert Dat Data To Data Frame

I'm reading data data and trying to convert it to data frame to save it into readable format. However no clue about converting the dat data. A bit beginner to R. Any help will be highly appreciated.
Code so Far:
data <- readLines("Day8.dat")
print(data)
Output So Far:
[1] "<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\"
modelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\"
xmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\">
<d2lm:exchange><d2lm:supplierIdentification><d2lm:country>gb</d2lm:country>
<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier></d2lm:supplierIdentification></d2lm:exchange>
<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\"
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><d2lm:feedType>Event Data</d2lm:feedType>
<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime><d2lm:publicationCreator>
<d2lm:country>gb</d2lm:country><d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>
</d2lm:publicationCreator><d2lm:situation version=\"\" id=\"2922904\"><d2lm:headerInformation>
<d2lm:areaOfInterest>national</d2lm:areaOfInterest>
....
Thanks

It all depends on what you want to do with the data, i.e., how you want to process it.
For example, let's assume your interest is in parsing all XML tags as separate strings, then you can extract the tags using regular expression and the function str_extract:
library(stringr)
str_extract_all(dat, "<(d2lm:[^>]*)>.*</\\1>|<d2lm:[^>]*>")
This regex works even if the XML element names are variable:
str_extract_all(dat, "<([^>]*)>.*</\\1>|<[^>]*>")
The result is a list:
[[1]]
[1] "<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\" \nmodelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\" \nxmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\">"
[2] "<d2lm:exchange>"
[3] "<d2lm:supplierIdentification>"
[4] "<d2lm:country>gb</d2lm:country>"
[5] "<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>"
[6] "<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\" \nxmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">"
[7] "<d2lm:feedType>Event Data</d2lm:feedType>"
[8] "<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime>"
[9] "<d2lm:publicationCreator>"
[10] "<d2lm:country>gb</d2lm:country>"
[11] "<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>"
[12] "<d2lm:situation version=\"\" id=\"2922904\">"
[13] "<d2lm:headerInformation>"
[14] "<d2lm:areaOfInterest>national</d2lm:areaOfInterest>"
To turn the list into a dataframe:
datDF <- data.frame(tags = unlist(str_extract_all(dat, "<(d2lm:[^>]*)>.*</\\1>|<d2lm:[^>]*>")))
EDIT:
If you want to have a dataframe with the text values between XML start tag and XML end tag, you can extract these tags and values along these lines:
datDF <- data.frame(
tags = unlist(str_extract_all(dat, "<([^>]*)>(?=[^>]*</\\1>)")),
values = unlist(str_extract_all(dat, "(?<=<([^>]{1,100})>).*(?=</\\1>)"))
)
datDF
tags values
1 <d2lm:country> gb
2 <d2lm:nationalIdentifier> NTIS
3 <d2lm:feedType> Event Data
4 <d2lm:publicationTime> 2020-05-10T00:00:44.778+01:00
5 <d2lm:country> gb
6 <d2lm:nationalIdentifier> NTIS
7 <d2lm:areaOfInterest> national
Is this--roughly--what you had in mind?
DATA:
dat <- '<d2lm:d2LogicalModel extensionVersion=\"2.0\" extensionName=\"NTIS Published Services\"
modelBaseVersion=\"2\" xmlns:ns4=\"http://www.thalesgroup.com/NTIS/Datex2Extensions/1.0Beta1\"
xmlns:ns3=\"http://datex2.eu/schema/2/2_0/inrix\" xmlns:d2lm=\"http://datex2.eu/schema/2/2_0\">
<d2lm:exchange><d2lm:supplierIdentification><d2lm:country>gb</d2lm:country>
<d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier></d2lm:supplierIdentification></d2lm:exchange>
<d2lm:payloadPublication xsi:type=\"d2lm:SituationPublication\" lang=\"en\"
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><d2lm:feedType>Event Data</d2lm:feedType>
<d2lm:publicationTime>2020-05-10T00:00:44.778+01:00</d2lm:publicationTime><d2lm:publicationCreator>
<d2lm:country>gb</d2lm:country><d2lm:nationalIdentifier>NTIS</d2lm:nationalIdentifier>
</d2lm:publicationCreator><d2lm:situation version=\"\" id=\"2922904\"><d2lm:headerInformation>
<d2lm:areaOfInterest>national</d2lm:areaOfInterest>'

Trouble With Outputting all Elements to Console

I have an issue where I'm trying to use sink() to capture my console output to a text file. However, my console keeps on restricting my print statements, despite having set max.print to the maximum integer in R.
I have consulted various other stackoverflow links but to no avail. Has anyone solved this issue?
This is a sample output, despite having changed max.print.
options(max.print = .Machine$integer.max)
> print(outputFile[1])
[[1]]
+ 1681/519133 vertices, named, from 71aeda5:
[1] p_8945206-t_25 p_24353782-t_0 p_5096967-t_0
[4] p_12728438-t_2 p_1914103-t_8 p_7949965-t_59
[7] p_5171435-t_4 p_6628106-t_7 p_2535537-t_0
[10] p_45026190-t_2 p_25504870-t_8 p_796238-t_1
[13] p_135998-t_13 p_20853906-t_1 p_17154085-t_0
[16] p_29505258-t_4 p_27269129-t_13 p_6793896-t_92
[19] p_5331193-t_1 p_11521441-t_2 p_34271996-t_2
[22] p_95594-t_0 p_16395989-t_0 p_582576-t_3
[25] p_9368888-t_1 p_697462-t_28 p_80124-t_72
[28] p_7595644-t_0 p_14372110-t_4 p_2083314-t_2
+ ... omitted several vertices
Additionally, I have tried indexing but it hasn't worked.

igraph specific options like auto.print.lines should still affect the printing of your graph objects, even if they're contained in a list. Using a combination of auto.print.lines and max.print, I'm able to get graphs to print out in full:
library(purrr)
library(igraph)
# Using purrr to create a list of multiple large graphs
gs = map(1:5, ~ random.graph.game(200, 0.1))
options(max.print = .Machine$integer.max)
igraph_options(auto.print.lines = Inf)
print(gs)

Extract data from HTML tags in R

I have an html table I'm trying to extract data from. I have here line 21 that I need to achieve an 11 character vector from (and then do the same for all lines of data. I'm trying to write a function to do this, where:
dt is my data table and this is what line 21 looks like:
[1] "<tr><td>1</td><td>11 Com</td><td>b</td><td>Radial Velocity</td>
<td>1</td><td>326.03</td><td>1.29</td><td></td><td>19.4</td><td></td>
<td>2.7</td></tr>"
I need to get rid of all of the "<tr><td>" etc., as well as insert either a 0 or an NA where they exist back to back ("</td><td></td><td>").
Here is what I have so far. First, I keep getting the error:
Error in strsplit(a, "</td><td>") : non-character argument
f<-function(row.data){
a<-strsplit(row.data,"<tr><td>")
b<-unlist(strsplit(a,"</td><td>")))
}
f(dt[21])
And this has yet to address inserting 0s or NAs. I'm quite new to R, so I am super appreciative of any help.

This can be done with gsub. As commented you should indeed escape the / with \\
dat <- "<tr><td>1</td><td>11 Com</td><td>b</td><td>Radial Velocity</td><td>1</td><td>326.03</td><td>1.29</td><td></td><td>19.4</td><td></td><td>2.7</td></tr>"
a<-gsub("<tr>",0,dat)
a<-gsub("<td>",0,a)
a<-gsub("<\\/td>",0,a)
a<-gsub("<\\/tr>",0,a)
a
[1] "0010011 Com00b00Radial Velocity0 \n0100326.03001.29000019.4000 \n02.700"

Like I mentioned above, your task is really parsing HTML, so a more appropriate method would be to use a package like rvest that's made for parsing HTML. I'm guessing this is part of a larger table, in which case you could probably use rvest::html_table to scrape data from an entire table at once.
If, instead, what you have is really just strings of the HTML tags for each row, you can convert that text to its XML representation (the backbone of HTML) with read_html. Then from that XML, you can pull out the <tr> tags, then from those, pull out the <td> tags. I did table rows before table cells in case there's more logic that you need for keeping rows together.
library(dplyr)
library(rvest)
tags <- "<tr><td>1</td><td>11 Com</td><td>b</td><td>Radial Velocity</td>
<td>1</td><td>326.03</td><td>1.29</td><td></td><td>19.4</td><td></td>
<td>2.7</td></tr>"
read_html(tags) %>%
html_nodes("tr") %>%
html_nodes("td")
#> {xml_nodeset (11)}
#> [1] <td>1</td>\n
#> [2] <td>11 Com</td>\n
#> [3] <td>b</td>\n
#> [4] <td>Radial Velocity</td>
#> [5] <td>1</td>\n
#> [6] <td>326.03</td>\n
#> [7] <td>1.29</td>\n
#> [8] <td></td>\n
#> [9] <td>19.4</td>\n
#> [10] <td></td>
#> [11] <td>2.7</td>
Then html_text pulls the inner text out from each tag.
read_html(tags) %>%
html_nodes("tr") %>%
html_nodes("td") %>%
html_text()
#> [1] "1" "11 Com" "b"
#> [4] "Radial Velocity" "1" "326.03"
#> [7] "1.29" "" "19.4"
#> [10] "" "2.7"
Created on 2018-10-31 by the reprex package (v0.2.1)

How to turn rvest output into table

Brand new to R, so I'll try my best to explain this.
I've been playing with data scraping using the "rvest" package. In this example, I'm scraping US state populations from a table on Wikipedia. The code I used is:
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
forecasthtml = html_nodes(statepop, "td")
forecasttext = html_text(forecasthtml)
forecasttext
The resulting output was as follows:
[2] "7000100000000000000♠1"
[3] " California"
[4] "39,250,017"
[5] "37,254,503"
[6] "7001530000000000000♠53"
[7] "738,581"
[8] "702,905"
[9] "12.15%"
[10] "7000200000000000000♠2"
[11] "7000200000000000000♠2"
[12] " Texas"
[13] "27,862,596"
[14] "25,146,105"
[15] "7001360000000000000♠36"
[16] "763,031"
[17] "698,487"
[18] "8.62%"
How can I turn these strings of text into a table that is set up similar to the way it is presented on the original Wikipedia page (with columns, rows, etc)?

Try using rvest's html_table function.
Note there are five tables on the page thus you will need to specify which table you would like to parse.
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
#find all of the tables on the page
tables<-html_nodes(statepop, "table")
#convert the first table into a dataframe
table1<-html_table(tables[1])

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Extract words from a website - r

Related

How to download data from the Reptile database using r

Convert Dat Data To Data Frame

Trouble With Outputting all Elements to Console

Extract data from HTML tags in R

How to turn rvest output into table

Categories

Resources