Extracting image source from string in R - r

I am trying to scrape image sources from different website. I used rvest to do that. The problem I encounter is that I have a vector string containing the source but I need to extract the source from it.
Here are the first few entries:
> string
{xml_nodeset (100)}
[1] <td class="no-wrap currency-name" data-sort="Bitcoin">\n<img src="https://s2.coinmarketc ...
[2] <td class="no-wrap currency-name" data-sort="Ethereum">\n<img src="https://s2.coinmarket ...
[3] <td class="no-wrap currency-name" data-sort="Ripple">\n <img src="https://s2.coinmarketc ...
What I need is basically the part coming after src=", so for the first one "https://s2.coinmarketcap.com/static/img/coins/16x16/1.png" (the console doesn't show the full strings but this what appears after the dots ... and there comes more stuff after it as well).
Any help is appreciated as I am a bit stuck here.

You can do:
library(rvest)
read_html("https://coinmarketcap.com/coins/")%>%
html_nodes("td img")%>%html_attr("src")
[1] "https://s2.coinmarketcap.com/static/img/coins/16x16/1.png"
[2] "https://s2.coinmarketcap.com/generated/sparklines/web/7d/usd/1.png"
[3] "https://s2.coinmarketcap.com/static/img/coins/16x16/1027.png"
[4] "https://s2.coinmarketcap.com/generated/sparklines/web/7d/usd/1027.png"
[5] "https://s2.coinmarketcap.com/static/img/coins/16x16/52.png"
[6] "https://s2.coinmarketcap.com/generated/sparklines/web/7d/usd/52.png"
[7] "https://s2.coinmarketcap.com/static/img/coins/16x16/1831.png"
[8] "https://s2.coinmarketcap.com/generated/sparklines/web/7d/usd/1831.png"
:
:
:
:

As pointed out in the comments, a regular expression should do it:
myhtml <- gsub('^.*https://\\s*|\\s*.png.*$', "", string)
myhtml <- paste0("https://", myhtml, ".png")
The first line will extract the part of the string contained between https:// and .png, and the second one will paste them back into your string in order to have a valid source, i.e. with https:// and .png at the end.

Related

rvest not able to grab html table using html_nodes("table"), despite table being on page

We are struggling to grab the main table at this fangraphs link. Using rvest:
url = 'https://www.fangraphs.com/leaders/splits-leaderboards?splitArr=1&splitArrPitch=&position=B&autoPt=false&splitTeams=false&statType=team&statgroup=2&startDate=2021-07-07&endDate=2021-07-21&players=&filter=&groupBy=season&sort=9,1'
table_nodes = url %>% read_html() %>% html_nodes('table')
table_nodes
table_nodes
{xml_nodeset (7)}
[1] <table class="menu-standings-table"><tbody><tr>\n<td>\r\n <div class="menu-sub-header">AL East</div>\r\n ...
[2] <table class="menu-team-table">\n<tr>\n<td>\r\n <div class="menu-sub-header">AL East</div>\r\n ...
[3] <table class="menu-team-table">\n<tr>\n<td>\r\n <div class="menu-sub-header">AL East</div>\r\n ...
[4] <table>\n<tr>\n<td>BAL</td>\n<td><a href="http://www.fangraphs.com/blogs/top-34-prospects ...
[5] <table>\n<tr>\n<td>ATL</td>\n<td><a href="http://www.fangraphs.com/blogs/top-49-prospects-ch ...
[6] <table>\n<tr>\n<td>BAL</td>\n<td><a href="http://www.fangraphs.com/blogs/top-38-prospects ...
[7] <table>\n<tr>\n<td>ATL</td>\n<td><a href="http://www.fangraphs.com/blogs/top-41-prospects-ch ...
None of these 7 tables are the main table at the URL with all of the different team stats. url %>% read_html() %>% html_nodes('div.table-scroll') returns an empty nodeset, and div.table-scroll is the wrapper div that the main table is located in.
Edit: I guess here is the network request, but still not sure how to get API call from this. How to see the full API call for this?
Data is dynamically retrieved from an API call. Switch to httr as you need to make a POST request which includes the start/end date. Also, switch to infinite in terms of returning as much data as possible, with as few calls as possible.
You want to convert the below into some form of custom function which accepts date args.
library(httr)
library(purrr)
headers = c(
'user-agent' = 'Mozilla/5.0',
'content-type' = 'application/json;charset=UTF-8'
)
data = '{"strPlayerId":"all","strSplitArr":[1],"strGroup":"season","strPosition":"B","strType":"2","strStartDate":"2021-07-07","strEndDate":"2021-07-21","strSplitTeams":false,"dctFilters":[],"strStatType":"team","strAutoPt":"false","arrPlayerId":[],"strSplitArrPitch":[]}'
r <- httr::POST(url = 'https://www.fangraphs.com/api/leaders/splits/splits-leaders', httr::add_headers(.headers=headers), body = data) %>% content()
df <- map_df(r$data, data.frame)

Extract youtube video ID from url with R stringr regex

I'm looking to extract only the video id string from a column of youtube links.
The stringr function I'm currently using is this:
str_extract(data$link, "\\b[^=]+$")
This works for most standard youtube links with the id at the end of the url appearing after an = sign i.e.
youtube.com/watch?v=kFF0v0FQzEI
However not all links follow this pattern, examples:
youtube.com/v/kFF0v0FQzEI
youtube.com/vi/kFF0v0FQzEI
youtu.be/kFF0v0FQzEI
www.youtube.com/v/kFF0v0FQzEI?feature=autoshare&version=3&autohide=1&autoplay=1
www.youtube.com/watch?v=kFF0v0FQzEI&list=PLuV2ACKGzAMsG-pem75yNYhBvXZcl-mj_&index=1
So could anyone help me out with an R regex pattern to extract the id (kFF0v0FQzEI in this case) in all the examples above?
I've seen examples of regex patterns used in other languages to do this but I'm unsure how to convert to R compliance.
Thanks!
You could use something like the following, but note that it's pretty heavily hard-coded to the examples you provided.
links = c("youtube.com/v/kFF0v0FQzEI",
"youtube.com/vi/kFF0v0FQzEI",
"youtu.be/kFF0v0FQzEI",
"www.youtube.com/v/kFF0v0FQzEI?feature=autoshare&version=3&autohide=1&autoplay=1",
"www.youtube.com/watch?v=kFF0v0FQzEI&list=PLuV2ACKGzAMsG-pem75yNYhBvXZcl-mj_&index=1",
"youtube.com/watch?v=kFF0v0FQzEI",
"http://www.youtube.com/watch?argv=xyz&v=kFF0v0FQzEI")
get_id = function(link) {
if (stringr::str_detect(link, '/watch\\?')) {
rgx = '(?<=\\?v=|&v=)[\\w]+'
} else {
rgx = '(?<=/)[\\w]+/?(?:$|\\?)'
}
stringr::str_extract(link, rgx)
}
ids = unname(sapply(links, get_id))
# [1] "kFF0v0FQzEI" "kFF0v0FQzEI" "kFF0v0FQzEI" "kFF0v0FQzEI?"
# "kFF0v0FQzEI" "kFF0v0FQzEI" "kFF0v0FQzEI"

Removing punctuations from text using R

I need to remove punctuation from the text. I am using tm package but the catch is :
eg: the text is something like this:
data <- "I am a, new comer","to r,"please help","me:out","here"
now when I run
library(tm)
data<-removePunctuation(data)
in my code, the result is :
I am a new comerto rplease helpmeouthere
but what I expect is:
I am a new comer to r please help me out here
Here's how I take your question, and an answer that is very close to #David Arenburg's in the comment above.
data <- '"I am a, new comer","to r,"please help","me:out","here"'
gsub('[[:punct:] ]+',' ',data)
[1] " I am a new comer to r please help me out here "
The extra space after [:punct:] is to add spaces to the string and the + matches one or more sequential items in the regular expression. This has the side effect, desirable in some cases, of shortening any sequence of spaces to a single space.
If you had something like
string <- "hello,you"
> string
[1] "hello,you"
You could do this:
> gsub(",", "", string)
[1] "helloyou"
It replaces the "," with "" in the variable called string

removing data with tags from a vector

I have a string vector which contains html tags e.g
abc<-""welcome <span class=\"r\">abc</span> Have fun!""
I want to remove these tags and get follwing vector
e.g
abc<-"welcome Have fun"
Try
> gsub("(<[^>]*>)","",abc)
what this says is 'substitute every instance of < followed by anything that isnt a > up to a > with nothing"
You cant just do gsub("<.*>","",abc) because regexps are greedy, and the .* would match up to the last > in your text (and you'd lose the 'abc' in your example).
This solution might fail if you've got > in your tags - but is <foo class=">" > legal? Doubtless someone will come up with another answer that involves parsing the HTML with a heavyweight XML package.
You can convert your piece of HTML to an XML document with
htmlParse or htmlTreeParse.
You can then convert it to text,
i.e., strip all the tags, with xmlValue.
abc <- "welcome <span class=\"r\">abc</span> Have fun!"
library(XML)
#doc <- htmlParse(abc, asText=TRUE)
doc <- htmlTreeParse(abc, asText=TRUE)
xmlValue( xmlRoot(doc) )
If you also want to remove the contents of the links,
you can use xmlDOMApply to transform the XML tree.
f <- function(x) if(xmlName(x) == "span") xmlTextNode(" ") else x
d <- xmlDOMApply( xmlRoot(doc), f )
xmlValue(d)

R help page as object

Is there a nice way to extract get the R-help page from an installed package in the form of an R object (e.g a list). I would like to expose help pages in the form of standardized JSON or XML schemas. However getting the R-help info from the DB is harder than I thought.
I hacked together a while ago to get the HTML of an R help manual page. However I would rather have a general R object that contains this information, that I can render to JSON/XML/HTML, etc. I looked into the helpr package from Hadley, but this seems to be a bit of overkill for my purpose.
Edited with suggestion of Hadley
You can do this a bit easier by:
getHTMLhelp <- function(...){
thefile <- help(...)
capture.output(
tools:::Rd2HTML(utils:::.getHelpFile(thefile))
)
}
Using tools:::Rd2txt instead of tools:::Rd2HTML will give you plain text. Just getting the file (without any parsing) gives you the original Rd format, so you can write your custom parsing function to parse it into an object (see the solution of #Jeroen, which does a good job in extracting all info into a list).
This function takes exactly the same arguments as help() and returns a vector with every element being a line in the file, eg:
> head(HelpAnova)
[1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">"
[2] "<html><head><title>R: Anova Tables</title>"
[3] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">"
[4] "<link rel=\"stylesheet\" type=\"text/css\" href=\"R.css\">"
[5] "</head><body>"
[6] ""
Or :
> HelpGam <- getHTMLhelp(gamm,package=mgcv)
> head(HelpGam)
[1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">"
[2] "<html><head><title>R: Generalized Additive Mixed Models</title>"
[3] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">"
[4] "<link rel=\"stylesheet\" type=\"text/css\" href=\"R.css\">"
[5] "</head><body>"
[6] ""
So below what I hacked together. However I yet have to test it on many help files to see if it generally works.
Rd2list <- function(Rd){
names(Rd) <- substring(sapply(Rd, attr, "Rd_tag"),2);
temp_args <- Rd$arguments;
Rd$arguments <- NULL;
myrd <- lapply(Rd, unlist);
myrd <- lapply(myrd, paste, collapse="");
temp_args <- temp_args[sapply(temp_args , attr, "Rd_tag") == "\\item"];
temp_args <- lapply(temp_args, lapply, paste, collapse="");
temp_args <- lapply(temp_args, "names<-", c("arg", "description"));
myrd$arguments <- temp_args;
return(myrd);
}
getHelpList <- function(...){
thefile <- help(...)
myrd <- utils:::.getHelpFile(thefile);
Rd2list(myrd);
}
And then you would do something like:
myhelp <- getHelpList("qplot", package="ggplot2");
cat(jsonlite::toJSON(myhelp));

Resources