I'm writing a wrapper for the YouTube Analytics API, and have created a function as follows:
yt_request <- function(dimensions = NULL, metrics = NULL, sort = NULL,
maxResults = NULL, filtr = NULL, startDate = Sys.Date() - 30,
endDate = Sys.Date(), token) {
url <- paste0("https://youtubeanalytics.googleapis.com/v2/reports?",
"&ids=channel%3D%3DMINE",
"&startDate=", startDate,
"&endDate=", endDate)
if(!is.null(dimensions)) url <- paste0(url, "&dimensions=", dimensions)
if(!is.null(metrics)) url <- paste0(url, "&metrics=", metrics)
if(!is.null(sort)) url <- paste0(url, "&sort=", sort)
if(!is.null(maxResults)) url <- paste0(url, "&maxResults=", maxResults)
if(!is.null(filtr)) url <- paste0(url, "&filters=", filtr)
r <- GET(url, token)
return(r)
}
This is meant to just be a flexible but not the most friendly of functions because I want to have wrapper functions that will contain yt_request() that will be much more user friendly. For example:
top_videos <- function(...) {
dim <- "video"
met <- "views,averageViewDuration"
maxRes <- 10
temp <- yt_request(dimensions = dim, metrics = met, maxResults = maxRes, token = myToken)
return(temp)
}
Which so far works fine and dandy, but I also want potential users to have a little flexibility with the results. For example, if they want to have maxResults <- 20 instead of 10 or they want different metrics than the ones I specify, I want them to be able to pass their own arguments in the ... of top_videos(...).
How can I do a check if someone passes an argument in the ellipsis? If they pass a metric, I want it to override the default I specify, otherwise, go with the default.
EDIT
To help clarify, I'm hoping that when the user decides to use the function, they could just write something like top_videos(maxResults = 20) and the function would ignore the line maxRes <- 10 and in the yt_request() function would assign maxResults = 20 instead of 10
We can capture the ... in a list and convert the whole elements to a key/value pair. Then, extract the elements based on the name. If we are not passing that particular named element, it will return NULL. We make use of this behavior of NULL to concatenate with the default value of 10 in maxRes and select the first element ([1]) so that if it is NULL, the default 10 is selected, or else the value passed will be selected. Likewise, do this on all those objects that the OP wanted to override
top_videos <- function(...) {
nm1 <- list(...)
lst1 <- as.list(nm1)
dim <- c(lst1[["dimensions"]], "video")[1]
met <- c(lst1[["metrics"]], "views,averageViewDuration")[1]
maxRes <- c(lst1[['maxResults']], 10)[1]
#temp <- yt_request(dimensions = dim,
metrics = met, maxResults = maxRes, token = myToken)
#temp
maxRes
}
-testing
top_videos(maxResults = 20)
#[1] 20
top_videos(hello = 5)
#[1] 10
Related
I have a data set where I want to calculate the 6 month return of stocks with tq_get (see example below)
Dataset called top
ticker 6month
AKO.A
BIG
BGFV
Function
library(tidyverse)
library(dplyr)
library(tidyquant)
library(riingo)
calculate <- function (x) {
(tq_get(x, get = "tiingo", from = yesterday, to = yesterday)$adjusted/tq_get(x, get = "tiingo", from = before, to = before)$adjusted)-1
}
top[2] <- lapply(top[1], function(x) calculate(x))
Unfortunately for some of the tickers there is no value existing which results in error message when simply using lapply or mutate as the resulting vector is smaller (less rows) then the existing dataset. Resolving with try_catch did not worked.
I now wanted to apply a work around by checking with is_supported_ticker() provided by the package riingo if the ticker is available
calculate <- function (x) {
if (is_supported_ticker(x, type = "tiingo") == TRUE) {
(tq_get(x, get = "tiingo", from = yesterday, to = yesterday)$adjusted/tq_get(x, get = "tiingo", from = before, to = before)$adjusted)-1
}
else {
NA
}
}
top[2] <- lapply(top[1], function(x) calculate(x))
But now I receive the error message x ticker must be length 1, but is actually length 3.
I assume this is based on the fact that the whole first column of my dataset is used as input for is_supported_ticker() instead of row by row. How can I resolve this issue?
Glancing at the documentation, it looks like tq_get supports multiple symbols, only if_supported_ticker goes one at a time. So probably you should check all the tickers to see if they are supported, and then use tq_get once on all the supported ones. Something like this (untested, as I don't have any of these packages):
calculate <- function (x) {
supported = sapply(x, is_supported_ticker, type = "tiingo")
result = rep(NA, length(x))
result[supported] =
(
tq_get(x[supported], get = "tiingo", from = yesterday, to = yesterday)$adjusted /
tq_get(x[supported], get = "tiingo", from = before, to = before)$adjusted
) - 1
return(result)
}
It worries me that before and yesterday aren't function arguments - they're just assumed to be there in the global environment. I'd suggest passing them in as arguments to calculate(), like this:
calculate <- function (x, before, yesterday) {
supported = sapply(x, is_supported_ticker, type = "tiingo")
result = rep(NA, length(x))
result[supported] =
(
tq_get(x[supported], get = "tiingo", from = yesterday, to = yesterday)$adjusted /
tq_get(x[supported], get = "tiingo", from = before, to = before)$adjusted
) - 1
return(result)
}
# then calling it
calculate(top$ticker, before = <...>, yesterday = <...>)
This way you can pass values in for before and yesterday on the fly. If they are objects in your global environment, you can simply use calculate(top$ticker, before, yesterday), but it gives you freedom to vary those arguments without redefining those names in your global environment.
I'm sorry this example won't be reproducible by those who aren't Bloomberg users.
For the others, I'm using Rblpapi and its subscribe function. I would like to create something like a data frame, a matrix or an array and fill it with values that are streamed by the subscription.
Assuming your BBComm component is up and running, my example says:
require(Rblpapi)
con <- blpConnect()
securities <- c('SX5E 07/20/18 C3400 Index',
'SX5E 07/20/18 C3450 Index',
'SX5E 07/20/18 C3500 Index')
I would like to fill a 3 x 2 matrix with these fields:
fields <- c('BID', 'ASK')
I guess I can create a matrix like this with almost no performance overhead:
mat <- matrix(data = NA,
nrow = 3,
ncol = 2)
Now I use subscribe and its argument fun for filling purposes, so something like this (albeit ugly to see and likely inefficient):
i <- 1
subscribe(securities = securities,
fields = fields,
fun = function(x){
if (i > length(securities))
i <<- 1
tryCatch(
expr = {
mat[i, 1] <<- x$data$BID
mat[i, 2] <<- x$data$ASK
i <<- i + 1
},
error = function(e){
message(e)
},
finally = {}
)
})
Result:
Error in subscribe_Impl(con, securities, fields, fun, options, identity) :
Evaluation error: number of items to replace is not a multiple of replacement length.
Of course, this doesn't work because I don't really know how to use indexing on streamed data. $ operator seems fine to retrieve data points by name - like I did with BID and ASK - but I cannot find a way to figure out which values are referring to, say, securities[1] or to securities[2]. It seems that I get a stream of numeric values that are indistinguishable one from each other because I cannot retrieve the ownership of the value among the securities.
Using an index on x$data$BID[1] throws the same error.
Ok your code looks fine, the only thing that does not work is x$data$BID, change to x$data["BID"] and then you can store it, Im working with your code and this is my result.
fields=c("TIME","LAST_PRICE", "BID", "ASK")
blpConnect()
blpConnect()
i <- 1
subscribe(securities = securities,
fields = fields,"interval=60",
fun = function(x){
if (i > length(securities))
i <<- 1
tryCatch(
expr = {
tim <- x$data["TIME"]
last <<- x$data["LAST_PRICE"]
ask <<- x$data["ASK"]
bid <<- x$data["BID"]
i <<- i + 1
},
error = function(e){
message(e)
},
finally = {}
)
print(cbind(tim$TIME,last$LAST_PRICE,ask$ASK, bid$BID))
})
result
A good way to take a look at the result object from the subscribe function is:
subscribe(securities=c("AAPL US Equity"),
fields=c("LAST_PRICE"),
fun=function(x) print(str(x)))
From there you can work your way into the data:
subscribe(securities=c("AAPL US Equity", "INTC US Equity"),
fields=c("LAST_PRICE","BID","ASK"),
fun=function(x) {
if (!is.null(x$data$MKTDATA_EVENT_TYPE) && x$data$MKTDATA_EVENT_TYPE == "TRADE" && exists("LAST_PRICE", where = x$data)) {
print(data.frame(Ticker = x$topic, DateTime = x$data$TRADE_UPDATE_STAMP_RT, Trade = x$data$LAST_PRICE))
}
})
I only printed the data.frame here. The data can be processed or stored directly using the FUN argument of subscribe.
I am running an R script with the highlighted following filter:
query.list <- Init(start.date = as.character(startdate),
end.date = as.character(enddate),
dimensions =
"ga:date,ga:campaign,ga:adwordsCampaignID,ga:adGroup,ga:adDestinationUrl",
metrics = "ga:sessions,ga:bounces",
max.results = 10000,
**filters =
c("ga:adwordsCampaignID!=%28not%20set%29;ga:sessions>0"),**
table.id = example)
ga.query <- QueryBuilder(query.list)
x <- as.integer(difftime(max(as.Date(query.list$end.date, '%Y-%m-%d')) ,
min(as.Date(query.list$start.date, '%Y-%m-%d')) , units = "days"))
# Daywise split paramter defaults to False when date does not equal Monday
daywisesplit <- if(x == 0) {
F
} else {
T
}
# Extract the data and store it in a data-frame
example_camp <- GetReportData(ga.query, token, split_daywise = daywisesplit)
example_camp$date <- as.Date(example_camp$date, '%Y%m%d')
example_camp$brand <- 'Example'
example_camp <- subset(example_camp, sessions > 0)
example_camp <- subset(example_camp, adwordsCampaign > 0)
The Script runs with no errors, but when looking into the write.csv file I continue to see the (not set) within the AdwordsCampaignID column. Moreover, when I look at the sessions column all the columns with zero sessions are not included, so this filter is working properly.
How can I make the exclude (not set) filter work properly when pulling data into the the write.CSV?Perhaps I need to update the data-frame?
If I define a list as follows: list(param = "value1"), then the result will be:
> list(param = "value1")
#$param
#[1] "value1"
I'm trying to define a new operator %!=% such that I can define a list as follows: list(param %!=% "value1") and I want the result to be:
> list(param %!=% "value1")
#$param.ne
#[1] "value1"
As background, I'm writing an R wrapper for a RESTful API that is pulling data from a database. When you make a request, you can tell the API to return results that match a parameter value param1 = value1 -- this will be included directly in the query string of the GET request.
However, the API also allows you to pull results that do NOT match the parameter value: param != value1. Putting != into the query string of a request presents a problem. The API is designed to use param.ne = value1 in place of param != value1.
Currently I'm trying to do this using some elements from non-standard evaluation:
`%!=%` = function(query, value) {
query = deparse(substitute(query))
query = paste0("`", query, ".ne", "`", "=", "\"", value, "\"")
parse(text = query)
}
This function will figure out the type of query (param, param2, param3, etc) using non-standard evaluation, convert the query to a string, add a .ne to the end and send it back out as an expression.
However, if I try to do list(eval(param1 %!=% "value1")) then the result will be:
> list(eval(address %!=% "value1"))
#[[1]]
#[1] "value1"
Is there a simple way to achieve the above?
EDIT:
In light of posts below, an additional question:
Is there anyway to adapt the above to work with a function that has a default parameter? For example:
newlist <- function(param1 = "default1", param2 = "default2", ...)
and I wanted to do something like
newlist(param2 = "non-default", param3 %!=% "stuff")
with the result being a list with param1 = "default1", param2 = "non-default" and param3.ne = "stuff"?
You're better off tacking this at the list() level call rather than trying to interfere with parameter name replacements. For example, consider this function
newlist <- function(...) {
dots <- substitute(...())
op <- sapply(sapply(dots, '[[', 1), deparse)
car <- sapply(sapply(dots, '[[', 2), deparse)
cdr <- sapply(sapply(dots, "[[", 3), eval.parent)
stopifnot(all(op %in% c("==","!=")))
setNames(as.list(cdr), ifelse(op=="!=", paste0(car, ".ne"), car))
}
newlist(a == "b", c != "d")
# $a
# [1] "b"
# $c.ne
# [1] "d"
Here we look at the expressions you are passing to the function and make sure they are == or != and can build the appropriate list.
But if you really wanted to use your function, you could do
`%!=%` = function(query, value) {
field = paste0(deparse(substitute(query)), ".ne")
setNames(list(value), field)
}
a %!=% 5
# $a.ne
# [1] 5
Note that this will return a list directly. You can combine them with c
c(a %!=% 5, b %!=% "fred")
The task:
I wanted to scrape all the YouTube comments from a given video.
I successfully adapted the R code from a previous question (Scraping Youtube comments in R).
Here is the code:
library(RCurl)
library(XML)
x <- "https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo/comments?orderby=published"
html = getURL(x)
doc = htmlParse(html, asText=TRUE)
txt = xpathSApply(doc,
"//body//text()[not(ancestor::script)][not(ancestor::style)[not(ancestor::noscript)]",xmlValue)
To use it, simply replace the video ID (i.e. "4H9pTgQY_mo") with the ID you require.
The problem:
The problem is that it doesn't return all the comments. In fact, it always returns a vector with 283 elements, regardless of how many comments are in the video.
Can anyone please shed light on what is going wrong here? It is incredibly frustrating. Thank you.
I was (for the most part) able to accomplish this by using the latest version of the Youtube Data API and the R package httr. The basic approach I took was to send multiple GET requests to the appropriate URL and grab the data in batches of 100 (the maximum the API allows) - i.e.
base_url <- "https://www.googleapis.com/youtube/v3/commentThreads/"
api_opts <- list(
part = "snippet",
maxResults = 100,
textFormat = "plainText",
videoId = "4H9pTgQY_mo",
key = "my_google_developer_api_key",
fields = "items,nextPageToken",
orderBy = "published")
where key is your actual Google Developer key, of course.
The initial batch is retrieved like this:
init_results <- httr::content(httr::GET(base_url, query = api_opts))
##
R> names(init_results)
#[1] "nextPageToken" "items"
R> init_results$nextPageToken
#[1] "Cg0Q-YjT3bmSxQIgACgBEhQIABDI3ZWQkbzEAhjVneqH75u4AhgCIGQ="
R> class(init_results)
#[1] "list"
The second element - items - is the actual result set from the first batch: it's a list of length 100, since we specified maxResults = 100 in the GET request. The first element - nextPageToken - is what we use to make sure each request returns the appropriate sequence of results. For example, we can get the next 100 results like this:
api_opts$pageToken <- gsub("\\=","",init_results$nextPageToken)
next_results <- httr::content(
httr::GET(base_url, query = api_opts))
##
R> next_results$nextPageToken
#[1] "ChYQ-YjT3bmSxQIYyN2VkJG8xAIgACgCEhQIABDI3ZWQkbzEAhiSsMv-ivu0AhgCIMgB"
where the current request's pageToken is returned as the previous requests nextPageToken, and we are given a new nextPageToken for obtaining out next batch of results.
This is pretty straightforward, but it would obviously be very tedious to have to keep changing the value of nextPageToken by hand after each request we send. Instead I thought this would be a good use case for a simple R6 class:
yt_scraper <- setRefClass(
"yt_scraper",
fields = list(
base_url = "character",
api_opts = "list",
nextPageToken = "character",
data = "list",
unique_count = "numeric",
done = "logical",
core_df = "data.frame"),
methods = list(
scrape = function() {
opts <- api_opts
if (nextPageToken != "") {
opts$pageToken <- nextPageToken
}
res <- httr::content(
httr::GET(base_url, query = opts))
nextPageToken <<- gsub("\\=","",res$nextPageToken)
data <<- c(data, res$items)
unique_count <<- length(unique(data))
},
scrape_all = function() {
while (TRUE) {
old_count <- unique_count
scrape()
if (unique_count == old_count) {
done <<- TRUE
nextPageToken <<- ""
data <<- unique(data)
break
}
}
},
initialize = function() {
base_url <<- "https://www.googleapis.com/youtube/v3/commentThreads/"
api_opts <<- list(
part = "snippet",
maxResults = 100,
textFormat = "plainText",
videoId = "4H9pTgQY_mo",
key = "my_google_developer_api_key",
fields = "items,nextPageToken",
orderBy = "published")
nextPageToken <<- ""
data <<- list()
unique_count <<- 0
done <<- FALSE
core_df <<- data.frame()
},
reset = function() {
data <<- list()
nextPageToken <<- ""
unique_count <<- 0
done <<- FALSE
core_df <<- data.frame()
},
cache_core_data = function() {
if (nrow(core_df) < unique_count) {
sub_data <- lapply(data, function(x) {
data.frame(
Comment = x$snippet$topLevelComment$snippet$textDisplay,
User = x$snippet$topLevelComment$snippet$authorDisplayName,
ReplyCount = x$snippet$totalReplyCount,
LikeCount = x$snippet$topLevelComment$snippet$likeCount,
PublishTime = x$snippet$topLevelComment$snippet$publishedAt,
CommentId = x$snippet$topLevelComment$id,
stringsAsFactors=FALSE)
})
core_df <<- do.call("rbind", sub_data)
} else {
message("\n`core_df` is already up to date.\n")
}
}
)
)
which can be used like this:
rObj <- yt_scraper()
##
R> rObj$data
#list()
R> rObj$unique_count
#[1] 0
##
rObj$scrape_all()
##
R> rObj$unique_count
#[1] 1673
R> length(rObj$data)
#[1] 1673
R> ##
R> head(rObj$core_df)
Comment User ReplyCount LikeCount PublishTime
1 That Andorra player was really Ruud..<U+feff> Cistrolat 0 6 2015-03-22T14:07:31.213Z
2 This just in; Karma is a bitch.<U+feff> Swagdalf The Obey 0 1 2015-03-21T20:00:26.044Z
3 Legend! Haha B)<U+feff> martyn baltussen 0 1 2015-01-26T15:33:00.311Z
4 When did Van der sar ran up? He must have run real fast!<U+feff> Witsakorn Poomjan 0 0 2015-01-04T03:33:36.157Z
5 <U+003c>b<U+003e>LOL<U+003c>/b<U+003e> F Hanif 5 19 2014-12-30T13:46:44.028Z
6 Fucking Legend.<U+feff> Heisenberg 0 12 2014-12-27T11:59:39.845Z
CommentId
1 z123ybioxyqojdgka231tn5zbl20tdcvn
2 z13hilaiftvus1cc1233trvrwzfjg1enm
3 z13fidjhbsvih5hok04cfrkrnla2htjpxfk
4 z12js3zpvm2hipgtf23oytbxqkyhcro12
5 z12egtfq5ojifdapz04ceffqfrregdnrrbk
6 z12fth0gemnwdtlnj22zg3vymlrogthwd04
As I alluded to earlier, this gets you almost everything - 1673 out of about 1790 total comments. For some reason, it does not seem to catch users' nested replies, and I'm not quite sure how to specify this within the API framework.
I had previously set up a Google Developer account a while back for using the Google Analytics API, but if you haven't done that yet, it should be pretty straightforward. Here's an overview - you shouldn't need to set up OAuth or anything like that, just make a project and create a new Public API access key.
An alternative to the XML package is the rvest package. Using the URL that you've provided, scraping comments would look like this:
library(rvest)
x <- "https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo/comments?orderby=published"
x %>%
html %>%
html_nodes("content") %>%
html_text
Which returns a character vector of the comments:
[1] "That Andorra player was really Ruud.."
[2] "This just in; Karma is a bitch."
[3] "Legend! Haha B)"
[4] "When did Van der sar ran up? He must have run real fast!"
[5] "What a beast Ruud was!"
...
More information on rvest can be found here.
Your issue lies with getting max results.
Solution Algorithm
First you need to call url https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo?v=2 This url contains the information for the video comments count, from there extract that number and us it to iterate over.
<gd:comments><gd:feedLink ..... countHint='1797'/></gd:comments>
After that use it to iterate thought url with these 2 parameters https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo/comments?max-results=50&start-index=1
When you are iterating you need to change start-index from 1,51,101,151... Did test the max-result it has limit to 50.
I tried for different videos with "tuber" package in R and my results here.
If one author has only replies (doesnt have comment about video) ,then according to number of replies behave.If the author has not more than 5 replies then dont scrape anyone.But if has more than 5 replies then some comments are scraping.
And if one author has both himself comments and replies then more than second man (up I told) comments are scraping.