How can I GET youtube video title with httr request? - r

I'm not a programer so I might be tripping onto something really silly to solve.
I'm trying to get the title of multiple youtube videos for my research. I recently found the httr package, and I think the GET function reaches this info really well, the problem is that I don't know how to access the response.
I tried
x <- GET("https://www.youtube.com/watch?v=2lAe1cqCOXo")
content(x)
and it gave me this response
{html_document}
<html style="font-size: 10px;font-family: Roboto, Arial, sans-serif;" lang="pt-BR" system-icons="" typography="" typography-spacing="">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta http ...
[2] <body>\n<div id="watch7-content" class="watch-main-col" itemscope itemid="" itemtype="h ...
I know that every video title is in [1]<head'> part as:
<title= TITLE OF THE VIDEO - YOUTUBE </'title>
or as
<'meta name="title" content=" TITLE OF THE VIDEO ">
Is that a way to browse the content response to extract this information?

Here is another approach that can be considered :
library(rvest)
library(stringr)
vector_URL_Title_To_Extract <- c("https://www.youtube.com/watch?v=DyX5RFSxWOY",
"https://www.youtube.com/watch?v=2lAe1cqCOXo",
"https://www.youtube.com/watch?v=ndTktsXlN7w")
nb_URL <- length(vector_URL_Title_To_Extract)
list_URL_Names <- list()
for(i in 1 : nb_URL)
{
print(i)
webpage <- read_html(vector_URL_Title_To_Extract[i])
webpage_Text <- html_text(webpage)
title <- stringr::str_extract_all(webpage_Text, "(;[^;]*#YouTubeRewind - YouTube)|(;[^;]*- YouTube\\{)")[[1]]
list_URL_Names[[i]] <- title
}
list_URL_Names
[[1]]
[1] ";CAQ celebrates victory in Quebec City, as opposition absorbs loss in Montreal - YouTube{"
[[2]]
[1] ";YouTube Rewind 2019: For the Record | #YouTubeRewind - YouTube"
[[3]]
[1] ";Crazy Capoeira Master Setting the UFC on Fire - Michel Pereira - YouTube{"

The best way to do this would probably be using YouTube's API designed for this purpose
But if you know the format of the youtube page's HTML you could probably just represent the whole HTML document as a string and use a string function to find the index of where the title tag is located, and isolate it that way. Or you could use a HTML parser to parse the HTML and find the data.

Here is an approach that can be considered :
library(RSelenium)
port <- as.integer(4444L + rpois(lambda = 1000, 1))
rd <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
remDr <- rd$client
vector_URL_Title_To_Extract <- c("https://www.youtube.com/watch?v=DyX5RFSxWOY",
"https://www.youtube.com/watch?v=2lAe1cqCOXo",
"https://www.youtube.com/watch?v=ndTktsXlN7w")
nb_URL <- length(vector_URL_Title_To_Extract)
list_URL_Names <- list()
for(i in 1 : nb_URL)
{
print(i)
remDr$navigate(vector_URL_Title_To_Extract[i])
Sys.sleep(5)
web_Obj <- remDr$findElement("css selector", '#title > h1 > yt-formatted-string')
list_URL_Names[[i]] <- web_Obj$getElementText()
}
list_URL_Names
[[1]]
[[1]][[1]]
[1] "CAQ celebrates victory in Quebec City, as opposition absorbs loss in Montreal"
[[2]]
[[2]][[1]]
[1] "YouTube Rewind 2019: For the Record | #YouTubeRewind"
[[3]]
[[3]][[1]]
[1] "Crazy Capoeira Master Setting the UFC on Fire - Michel Pereira"

Related

Click button using R + httr

I'm trying to scrape randomly generated names from a website.
library(httr)
library(rvest)
url <- "https://letsmakeagame.net//tools/PlanetNameGenerator/"
mywebsite <- read_html(url) %>%
html_nodes(xpath="//div[contains(#id,'title')]")
However, that does not work. I'm assuming I have to «click» the «generate» button before extracting the content. Is there a simple way (without RSelenium) to achieve that?
Something similar to:
POST(url,
body = list("EntryPoint.generate()" = T),
encode = "form") -> res
res_t <- content(res, as="text")
Thanks!
rvest isn't much of a help here as planet names are not requested from a remote service, names are generated locally with javascript, that's what the EntryPoint.generate() call does. A relatively simple way is to use chromote, though its session/process closing seems kind of messy at the moment:
library(chromote)
b <- ChromoteSession$new()
{
b$Page$navigate("https://letsmakeagame.net/tools/PlanetNameGenerator")
b$Page$loadEventFired()
}
# call EntryPoint.generate(), read result from <p id="title></p> element,
# replicate 10x
replicate(10, b$Runtime$evaluate('EntryPoint.generate();document.getElementById("title").innerText')$result$value)
#> [1] "Torade" "Ukiri" "Giconerth" "Dunia" "Brihoria"
#> [6] "Tiulaliv" "Giahiri" "Zuthewei 4A" "Elov" "Brachomia"
b$close()
#> [1] TRUE
b$parent$close()
#> Error in self$send_command(msg, callback = callback_, error = error_, : Chromote object is closed.
b$parent$get_browser()$close()
#> [1] TRUE
Created on 2023-01-25 with reprex v2.0.2

RSelenium - how to go to the next page by cliking on the button next?

my question is about scraping with RSelenium.
I am trying to scrape data from the following website:
"https://www.nhtsa.gov/ratings" using RSelenium.
My present difficulty lies in managing to skip between pages for a given carmaker.
This is my code so far:
library(RSelenium)
#opens a connection
rD <- rsDriver()
remDr <- rD$client
#goes to the page we want
url <- "https://www.nhtsa.gov/ratings"
remDr$navigate(url)
#clicking to open the manufacturer selection "page"
webElem <- remDr$findElement(using = 'css selector', "#vehicle a")
webElem$clickElement()
#opening the options menu
option.menu <- remDr$findElement(using='css selector', 'select')
option.menu$clickElement()
#selecting one maker, loop over this later
maker.select <- remDr$findElement(using = 'xpath', "//*/option[#value = 'AUDI']")
maker.select$clickElement()
#search our selection
maker.click<-remDr$findElement(using='css selector', '.manufacturer-search-submit')
maker.click$clickElement()
#now we have to go through each car (10 per page), loop later
cars<-remDr$findElement(using='css selector', 'tbody:nth-child(6) a')
individual.link<-cars$getElementAttribute("href")
#going to the next page
next_page<-remDr$findElement(using='css selector', 'button.btn.link-arrow::after')
next_page$clickElement()
But I get the error:
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
Further Details: run errorDetails method
As you can probabily see I am new to RSelenium. Any help that you can give me would be appreciated. Thanks in advance.
Here is another approach that might be of help.
You can access the data simply by sending a GET request to the website. From the website (on the first page), we can see
'https://api.nhtsa.gov/vehicles/byManufacturer?offset=0&max=10&sort=overallRating&order=desc&data=crashtestratings,recommendedfeatures&productDetail=all&dateStart=2011-01-01&manufacturerName=AUDI&dateEnd=3000-01-01&name='
This is where we can get the data. The second page will have offset=10 then 20,30,etc.
If api_url is defined to be the above url, then we can get the data using httr
# request the data
request <- httr::GET(api_url)
# retrieve the content
request_content <- httr::content(request)
request_result <- request_content$results
# request results contains the data of interest
# A few glimpses into the data
# The first model
request_result[[1]]$vehicleModel
# [1] "A3"
request_result[[1]]$modelYear
# [1] 2018
request_result[[1]]$manufacturer
# [1] "AUDI OF AMERICA, INC"
Now by playing around with offset it is straight forward to build a loop and gather all pages
out <- list()
k <- 0L
i <- 1L
while (k < 1e+3) {
req_url <- paste0('https://api.nhtsa.gov/vehicles/byManufacturer?offset=',
k,
'&max=10&sort=overallRating&order=desc&data=crashtestratings,recommendedfeatures&productDetail=all&dateStart=2011-01-01&manufacturerName=AUDI&dateEnd=3000-01-01&name=')
req <- httr::content(httr::GET(req_url))$result
if (length(req) == 0) break
out[[i]] <- req
cat(paste0('\nAdded content for offset \t', k))
i <- i + 1L
k <- k + 10L
}
lengths(out)
# [1] 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
Note that you can also play around with manufacturerName in the url and with many more arguments to have clean and tailored data.

Webscraping with R a continuous page with "view more"

I'm new to R and need to scrape the titles and the dates on the posts on this website https://www.healthnewsreview.org/news-release-reviews/
Using rvest I was able to write the basic code to get the info:
url <- 'https://www.healthnewsreview.org/?post_type=news-release-review&s='
webpage <- read_html(url)
date_data_html <- html_nodes(webpage,'span.date')
date_data <- html_text(date_data_html)
head(date_data)
webpage <- read_html(url)
title_data_html <- html_nodes(webpage,'h2')
title_data <- html_text(title_data_html)
head(title_data)
But since the website only displays 10 items at first, and then you have to click "view more" I don't know how to scrape the whole site. Thank you!!
Introducing third-party dependencies should be done as a last resort. RSelenium (as r2evans posited as the only solution, originally) is not necessary the vast majority of the time, including now. (It is necessary for gosh-awful sites that use horrible tech like SharePoint since maintaining state without a browser context for that is more pain than it's worth).)
If we start with the main page:
library(rvest)
pg <- read_html("https://www.healthnewsreview.org/news-release-reviews/")
We can get the first set of links (10 of them):
pg %>%
html_nodes("div.item-content") %>%
html_attr("onclick") %>%
gsub("^window.location.href='|'$", "", .)
## [1] "https://www.healthnewsreview.org/news-release-review/more-unwarranted-hype-over-the-unique-benefits-of-proton-therapy-this-time-in-combo-with-thermal-therapy/"
## [2] "https://www.healthnewsreview.org/news-release-review/caveats-and-outside-expert-balance-speculative-claim-that-anti-inflammatory-diet-might-benefit-bipolar-disorder-patients/"
## [3] "https://www.healthnewsreview.org/news-release-review/plug-for-study-of-midwifery-for-low-income-women-is-fuzzy-on-benefits-costs/"
## [4] "https://www.healthnewsreview.org/news-release-review/tiny-safety-trial-prematurely-touts-clinical-benefit-of-cancer-vaccine-for-her2-positive-cancers/"
## [5] "https://www.healthnewsreview.org/news-release-review/claim-that-milk-protein-alleviates-chemotherapy-side-effects-based-on-study-of-just-12-people/"
## [6] "https://www.healthnewsreview.org/news-release-review/observational-study-cant-prove-surgery-better-than-more-conservative-prostate-cancer-treatment/"
## [7] "https://www.healthnewsreview.org/news-release-review/recap-of-mental-imagery-for-weight-loss-study-requires-that-readers-fill-in-the-blanks/"
## [8] "https://www.healthnewsreview.org/news-release-review/bmjs-attempt-to-hook-readers-on-benefits-of-golf-slices-way-out-of-bounds/"
## [9] "https://www.healthnewsreview.org/news-release-review/time-to-test-all-infants-gut-microbiomes-or-is-this-a-product-in-search-of-a-condition/"
## [10] "https://www.healthnewsreview.org/news-release-review/zika-vaccine-for-brain-cancer-pr-release-headline-omits-crucial-words-in-mice/"
I guess you want to scrape the content of those ^^ so have at it.
But, there's that pesky "View more" button.
When you click on it, it issues this POST request:
With curlconverter we can convert it into a callable httr function (which may not exist given the impossibility of this task). We can wrap that function call in in another function with a pagination parameter:
view_more <- function(current_offset=10) {
httr::POST(
url = "https://www.healthnewsreview.org/wp-admin/admin-ajax.php",
httr::add_headers(
`X-Requested-With` = "XMLHttpRequest"
),
body = list(
action = "viewMore",
current_offset = as.character(as.integer(current_offset)),
page_id = "22332",
btn = "btn btn-gray",
active_filter = "latest"
),
encode = "form"
) -> res
list(
links = httr::content(res) %>%
html_nodes("div.item-content") %>%
html_attr("onclick") %>%
gsub("^window.location.href='|'$", "", .),
next_offset = current_offset + 4
)
}
Now, we can run it (since it defaults to the 10 issued in the first View More click):
x <- view_more()
str(x)
## List of 2
## $ links : chr [1:4] "https://www.healthnewsreview.org/news-release-review/university-pr-misleads-with-claim-that-preliminary-blood-t"| __truncated__ "https://www.healthnewsreview.org/news-release-review/observational-study-on-testosterone-replacement-therapy-fo"| __truncated__ "https://www.healthnewsreview.org/news-release-review/recap-of-lung-cancer-screening-test-relies-on-hyperbole-co"| __truncated__ "https://www.healthnewsreview.org/news-release-review/ties-to-drugmaker-left-out-of-postpartum-depression-drug-study-recap/"
## $ next_offset: num 14
We can pass that new offset to another call:
y <- view_more(x$next_offset)
str(y)
## List of 2
## $ links : chr [1:4] "https://www.healthnewsreview.org/news-release-review/sweeping-claims-based-on-a-single-case-study-of-advanced-c"| __truncated__ "https://www.healthnewsreview.org/news-release-review/false-claims-of-benefit-weaken-news-release-on-experimenta"| __truncated__ "https://www.healthnewsreview.org/news-release-review/contrary-to-claims-heart-scans-dont-save-lives-but-subsequ"| __truncated__ "https://www.healthnewsreview.org/news-release-review/breastfeeding-for-stroke-prevention-kudos-to-heart-associa"| __truncated__
## $ next_offset: num 18
You can do the hard part of scraping the initial article count (it's on the main page) and doing the math to put that in a loop and stop efficiently.
NOTE: If you are doing this scraping to archive the complete site (whether for them or independently) since it's dying at the end of the year, you should comment to that effect and I have better suggestions for that use-case than manual coding in any programming language. There are free, industrial "site preservation" frameworks designed to preserve these types of dying resources. If you just need the article content, then an iterator and custom scraper is likely a 👍🏼 (but, apparently impossible) choice.
NOTE also that the pagination increment of 4 is what the site does when you literally press the button, so this just mimics that functionality.

Is there any method of reading non-standard table use organizing labels without loop?

As far as we know, the parsing library like XML and xml2 can read standard table on web page perfectly. But there are some sorts of table which has no grid of table but organizing labels, such as “<span>” and “<div>”.
Now I am coping with a table like this,
The structure of table marks with “<span>”, and every 4 “<span>” Labels organize one record. I’ve used a loop to solve this problem and succeed. But I want to process it without loop. I heard that library purrr may help on this problem, but I don’t know how to use it in this situation.
I do my analysis by both “XML” and “xml2”:
Analysis with “XML” package
pg<-"http://www.irgrid.ac.cn/simple-search?fq=eperson.unique.id%3A311007%5C-000920"
library(XML)
tableNodes <- getNodeSet(htmlParse(pg), "//table[#class='miscTable2']")
itemlines <- xpathApply(tableNodes[[1]], "//tr[#class='itemLine']/td[#width='750']")
ispan <- xmlElementsByTagName(itemlines[[2]], "span")
title <- xmlValue(ispan$span)
isuedate <- xmlValue(ispan$span[1,2])
author <- xmlValue(ispan$span[3])
In this case, “XML” got a list of one span, but this list is very strange but met my expectations:
> attributes(ispan)
$names
[1] "span" "span" "span" "span"
It seems have one row only, but four columns. However, it doesn’t. The 2-4 “span” couldn’t be select by column. The first “span” occupied 2 columns, and other “span” could not get.
> val <- xmlValue(ispan$span[[1]])
> val
[1] "超高周疲劳裂纹萌生与初始扩展的特征尺度"
> isuedate <- xmlValue(ispan$span[[2]])
> isuedate
[1] " \r\n [科普文章]"
> isuedate <- xmlValue(ispan$span[[3]])
> isuedate
[1] NA
> author <- xmlValue(ispan$span[[4]])
> author
[1] NA
None of the selection method used in list works:
> title <- xmlValue(ispan$span[1,1])
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('XMLInternalNodeList', 'XMLNodeList')"
title <- xmlValue(ispan$span[1,])
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('XMLInternalNodeList', 'XMLNodeList')"
author <- xmlValue(ispan[1,3])
Error in ispan[1, 3] : incorrect number of dimensions
Analysis with “xml2”
Use “xml2” the obstacle of “span” makes same problem
pg<-"http://www.irgrid.ac.cn/simple-search?fq=eperson.unique.id%3A311007%5C-000920"
library(xml2)
tableSource <- xml_find_all(read_html(pg, encoding = "UTF-8"), "//table[#class='miscTable2']")
itemspan <- xml_child(itemspantab, "span")
It could not gether any of these “span” labels:
> itemspan
{xml_nodeset (1)}
[1] <NA>
If we make a step further to locate the “span” labels, it only get nothing:
> itemspanl <- xml_find_all(itemspantab, '//tr[#class="itemLine"]/td/span')
> itemspan <- xml_child(itemspanl, "span")
> itemspan
{xml_nodeset (40)}
[1] <NA>
[2] <NA>
[3] <NA>
...
An suggest told me use library(purrr) to do this, but the “purrr” process dataframe only, the “list” prepared by “xml2” could not be analyzed.
I want not to use loop and get the result like below, can we do it? I hope the scholars who have experience on “XML” and “xml2” could give me some advise on how to cope with this non-standard table. Thanks a lot.

List and description of all packages in CRAN from within R

I can get a list of all the available packages with the function:
ap <- available.packages()
But how can I also get a description of these packages from within R, so I can have a data.frame with two columns: package and description?
Edit of an almost ten-year old accepted answer. What you likely want is not to scrape (unless you want to practice scraping) but use an existing interface: tools::CRAN_package_db(). Example:
> db <- tools::CRAN_package_db()[, c("Package", "Description")]
> dim(db)
[1] 18978 2
>
The function brings (currently) 66 columns back of which the of interest here are a part.
I actually think you want "Package" and "Title" as the "Description" can run to several lines. So here is the former, just put "Description" in the final subset if you really want "Description":
R> ## from http://developer.r-project.org/CRAN/Scripts/depends.R and adapted
R>
R> require("tools")
R>
R> getPackagesWithTitle <- function() {
+ contrib.url(getOption("repos")["CRAN"], "source")
+ description <- sprintf("%s/web/packages/packages.rds",
+ getOption("repos")["CRAN"])
+ con <- if(substring(description, 1L, 7L) == "file://") {
+ file(description, "rb")
+ } else {
+ url(description, "rb")
+ }
+ on.exit(close(con))
+ db <- readRDS(gzcon(con))
+ rownames(db) <- NULL
+
+ db[, c("Package", "Title")]
+ }
R>
R>
R> head(getPackagesWithTitle()) # I shortened one Title here...
Package Title
[1,] "abc" "Tools for Approximate Bayesian Computation (ABC)"
[2,] "abcdeFBA" "ABCDE_FBA: A-Biologist-Can-Do-Everything of Flux ..."
[3,] "abd" "The Analysis of Biological Data"
[4,] "abind" "Combine multi-dimensional arrays"
[5,] "abn" "Data Modelling with Additive Bayesian Networks"
[6,] "AcceptanceSampling" "Creation and evaluation of Acceptance Sampling Plans"
R>
Dirk has provided an answer that is terrific and after finishing my solution and then seeing his I debated for some time posting my solution for fear of looking silly. But I decided to post it anyway for two reasons:
it is informative to beginning scrapers like myself
it took me a while to do and so why not :)
I approached this thinking I'd need to do some web scraping and choose crantastic as the site to scrape from. First I'll provide the code and then two scraping resources that have been very helpful to me as I learn:
library(RCurl)
library(XML)
URL <- "http://cran.r-project.org/web/checks/check_summary.html#summary_by_package"
packs <- na.omit(XML::readHTMLTable(doc = URL, which = 2, header = T,
strip.white = T, as.is = FALSE, sep = ",", na.strings = c("999",
"NA", " "))[, 1])
Trim <- function(x) {
gsub("^\\s+|\\s+$", "", x)
}
packs <- unique(Trim(packs))
u1 <- "http://crantastic.org/packages/"
len.samps <- 10 #for demo purpose; use:
#len.samps <- length(packs) # for all of them
URL2 <- paste0(u1, packs[seq_len(len.samps)])
scraper <- function(urls){ #function to grab description
doc <- htmlTreeParse(urls, useInternalNodes=TRUE)
nodes <- getNodeSet(doc, "//p")[[3]]
return(nodes)
}
info <- sapply(seq_along(URL2), function(i) try(scraper(URL2[i]), TRUE))
info2 <- sapply(info, function(x) { #replace errors with NA
if(class(x)[1] != "XMLInternalElementNode"){
NA
} else {
Trim(gsub("\\s+", " ", xmlValue(x)))
}
}
)
pack_n_desc <- data.frame(package=packs[seq_len(len.samps)],
description=info2) #make a dataframe of it all
Resources:
talkstats.com thread on web scraping (great beginner
examples)
w3schools.com site on html stuff (very
helpful)
I wanted to try to do this using a HTML scraper (rvest) as an exercise, since the available.packages() in OP doesn't contain the package Descriptions.
library('rvest')
url <- 'https://cloud.r-project.org/web/packages/available_packages_by_name.html'
webpage <- read_html(url)
data_html <- html_nodes(webpage,'tr td')
length(data_html)
P1 <- html_nodes(webpage,'td:nth-child(1)') %>% html_text(trim=TRUE) # XML: The Package Name
P2 <- html_nodes(webpage,'td:nth-child(2)') %>% html_text(trim=TRUE) # XML: The Description
P1 <- P1[lengths(P1) > 0 & P1 != ""] # Remove NULL and empty ("") items
length(P1); length(P2);
mdf <- data.frame(P1, P2, row.names=NULL)
colnames(mdf) <- c("PackageName", "Description")
# This is the problem! It lists large sets column-by-column,
# instead of row-by-row. Try with the full list to see what happens.
print(mdf, right=FALSE, row.names=FALSE)
# PackageName Description
# A3 Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModels
# abbyyR Access to Abbyy Optical Character Recognition (OCR) API
# abc Tools for Approximate Bayesian Computation (ABC)
# abc.data Data Only: Tools for Approximate Bayesian Computation (ABC)
# ABC.RAP Array Based CpG Region Analysis Pipeline
# ABCanalysis Computed ABC Analysis
# For small sets we can use either:
# mdf[1:6,] #or# head(mdf, 6)
However, although working quite well for small array/dataframe list (subset), I ran into a display problem with the full list, where the data would be shown either column-by-column or unaligned. I would have been great to have this paged and properly formatted in a new window somehow. I tried using page, but I couldn't get it to work very well.
EDIT:
The recommended method is not the above, but rather using Dirk's suggestion (from the comments below):
db <- tools::CRAN_package_db()
colnames(db)
mdf <- data.frame(db[,1], db[,52])
colnames(mdf) <- c("Package", "Description")
print(mdf, right=FALSE, row.names=FALSE)
However, this still suffers from the display problem mentioned...

Resources