YouTube comment scraper returns limited results - r

The task:
I wanted to scrape all the YouTube comments from a given video.
I successfully adapted the R code from a previous question (Scraping Youtube comments in R).
Here is the code:
library(RCurl)
library(XML)
x <- "https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo/comments?orderby=published"
html = getURL(x)
doc = htmlParse(html, asText=TRUE)
txt = xpathSApply(doc,
"//body//text()[not(ancestor::script)][not(ancestor::style)[not(ancestor::noscript)]",xmlValue)
To use it, simply replace the video ID (i.e. "4H9pTgQY_mo") with the ID you require.
The problem:
The problem is that it doesn't return all the comments. In fact, it always returns a vector with 283 elements, regardless of how many comments are in the video.
Can anyone please shed light on what is going wrong here? It is incredibly frustrating. Thank you.

I was (for the most part) able to accomplish this by using the latest version of the Youtube Data API and the R package httr. The basic approach I took was to send multiple GET requests to the appropriate URL and grab the data in batches of 100 (the maximum the API allows) - i.e.
base_url <- "https://www.googleapis.com/youtube/v3/commentThreads/"
api_opts <- list(
part = "snippet",
maxResults = 100,
textFormat = "plainText",
videoId = "4H9pTgQY_mo",
key = "my_google_developer_api_key",
fields = "items,nextPageToken",
orderBy = "published")
where key is your actual Google Developer key, of course.
The initial batch is retrieved like this:
init_results <- httr::content(httr::GET(base_url, query = api_opts))
##
R> names(init_results)
#[1] "nextPageToken" "items"
R> init_results$nextPageToken
#[1] "Cg0Q-YjT3bmSxQIgACgBEhQIABDI3ZWQkbzEAhjVneqH75u4AhgCIGQ="
R> class(init_results)
#[1] "list"
The second element - items - is the actual result set from the first batch: it's a list of length 100, since we specified maxResults = 100 in the GET request. The first element - nextPageToken - is what we use to make sure each request returns the appropriate sequence of results. For example, we can get the next 100 results like this:
api_opts$pageToken <- gsub("\\=","",init_results$nextPageToken)
next_results <- httr::content(
httr::GET(base_url, query = api_opts))
##
R> next_results$nextPageToken
#[1] "ChYQ-YjT3bmSxQIYyN2VkJG8xAIgACgCEhQIABDI3ZWQkbzEAhiSsMv-ivu0AhgCIMgB"
where the current request's pageToken is returned as the previous requests nextPageToken, and we are given a new nextPageToken for obtaining out next batch of results.
This is pretty straightforward, but it would obviously be very tedious to have to keep changing the value of nextPageToken by hand after each request we send. Instead I thought this would be a good use case for a simple R6 class:
yt_scraper <- setRefClass(
"yt_scraper",
fields = list(
base_url = "character",
api_opts = "list",
nextPageToken = "character",
data = "list",
unique_count = "numeric",
done = "logical",
core_df = "data.frame"),
methods = list(
scrape = function() {
opts <- api_opts
if (nextPageToken != "") {
opts$pageToken <- nextPageToken
}
res <- httr::content(
httr::GET(base_url, query = opts))
nextPageToken <<- gsub("\\=","",res$nextPageToken)
data <<- c(data, res$items)
unique_count <<- length(unique(data))
},
scrape_all = function() {
while (TRUE) {
old_count <- unique_count
scrape()
if (unique_count == old_count) {
done <<- TRUE
nextPageToken <<- ""
data <<- unique(data)
break
}
}
},
initialize = function() {
base_url <<- "https://www.googleapis.com/youtube/v3/commentThreads/"
api_opts <<- list(
part = "snippet",
maxResults = 100,
textFormat = "plainText",
videoId = "4H9pTgQY_mo",
key = "my_google_developer_api_key",
fields = "items,nextPageToken",
orderBy = "published")
nextPageToken <<- ""
data <<- list()
unique_count <<- 0
done <<- FALSE
core_df <<- data.frame()
},
reset = function() {
data <<- list()
nextPageToken <<- ""
unique_count <<- 0
done <<- FALSE
core_df <<- data.frame()
},
cache_core_data = function() {
if (nrow(core_df) < unique_count) {
sub_data <- lapply(data, function(x) {
data.frame(
Comment = x$snippet$topLevelComment$snippet$textDisplay,
User = x$snippet$topLevelComment$snippet$authorDisplayName,
ReplyCount = x$snippet$totalReplyCount,
LikeCount = x$snippet$topLevelComment$snippet$likeCount,
PublishTime = x$snippet$topLevelComment$snippet$publishedAt,
CommentId = x$snippet$topLevelComment$id,
stringsAsFactors=FALSE)
})
core_df <<- do.call("rbind", sub_data)
} else {
message("\n`core_df` is already up to date.\n")
}
}
)
)
which can be used like this:
rObj <- yt_scraper()
##
R> rObj$data
#list()
R> rObj$unique_count
#[1] 0
##
rObj$scrape_all()
##
R> rObj$unique_count
#[1] 1673
R> length(rObj$data)
#[1] 1673
R> ##
R> head(rObj$core_df)
Comment User ReplyCount LikeCount PublishTime
1 That Andorra player was really Ruud..<U+feff> Cistrolat 0 6 2015-03-22T14:07:31.213Z
2 This just in; Karma is a bitch.<U+feff> Swagdalf The Obey 0 1 2015-03-21T20:00:26.044Z
3 Legend! Haha B)<U+feff> martyn baltussen 0 1 2015-01-26T15:33:00.311Z
4 When did Van der sar ran up? He must have run real fast!<U+feff> Witsakorn Poomjan 0 0 2015-01-04T03:33:36.157Z
5 <U+003c>b<U+003e>LOL<U+003c>/b<U+003e> F Hanif 5 19 2014-12-30T13:46:44.028Z
6 Fucking Legend.<U+feff> Heisenberg 0 12 2014-12-27T11:59:39.845Z
CommentId
1 z123ybioxyqojdgka231tn5zbl20tdcvn
2 z13hilaiftvus1cc1233trvrwzfjg1enm
3 z13fidjhbsvih5hok04cfrkrnla2htjpxfk
4 z12js3zpvm2hipgtf23oytbxqkyhcro12
5 z12egtfq5ojifdapz04ceffqfrregdnrrbk
6 z12fth0gemnwdtlnj22zg3vymlrogthwd04
As I alluded to earlier, this gets you almost everything - 1673 out of about 1790 total comments. For some reason, it does not seem to catch users' nested replies, and I'm not quite sure how to specify this within the API framework.
I had previously set up a Google Developer account a while back for using the Google Analytics API, but if you haven't done that yet, it should be pretty straightforward. Here's an overview - you shouldn't need to set up OAuth or anything like that, just make a project and create a new Public API access key.

An alternative to the XML package is the rvest package. Using the URL that you've provided, scraping comments would look like this:
library(rvest)
x <- "https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo/comments?orderby=published"
x %>%
html %>%
html_nodes("content") %>%
html_text
Which returns a character vector of the comments:
[1] "That Andorra player was really Ruud.."
[2] "This just in; Karma is a bitch."
[3] "Legend! Haha B)"
[4] "When did Van der sar ran up? He must have run real fast!"
[5] "What a beast Ruud was!"
...
More information on rvest can be found here.

Your issue lies with getting max results.
Solution Algorithm
First you need to call url https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo?v=2 This url contains the information for the video comments count, from there extract that number and us it to iterate over.
<gd:comments>&ltgd:feedLink ..... countHint='1797'/></gd:comments>
After that use it to iterate thought url with these 2 parameters https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo/comments?max-results=50&start-index=1
When you are iterating you need to change start-index from 1,51,101,151... Did test the max-result it has limit to 50.

I tried for different videos with "tuber" package in R and my results here.
If one author has only replies (doesnt have comment about video) ,then according to number of replies behave.If the author has not more than 5 replies then dont scrape anyone.But if has more than 5 replies then some comments are scraping.
And if one author has both himself comments and replies then more than second man (up I told) comments are scraping.

Related

How to see if ... ellipses in R contains a certain argument?

I'm writing a wrapper for the YouTube Analytics API, and have created a function as follows:
yt_request <- function(dimensions = NULL, metrics = NULL, sort = NULL,
maxResults = NULL, filtr = NULL, startDate = Sys.Date() - 30,
endDate = Sys.Date(), token) {
url <- paste0("https://youtubeanalytics.googleapis.com/v2/reports?",
"&ids=channel%3D%3DMINE",
"&startDate=", startDate,
"&endDate=", endDate)
if(!is.null(dimensions)) url <- paste0(url, "&dimensions=", dimensions)
if(!is.null(metrics)) url <- paste0(url, "&metrics=", metrics)
if(!is.null(sort)) url <- paste0(url, "&sort=", sort)
if(!is.null(maxResults)) url <- paste0(url, "&maxResults=", maxResults)
if(!is.null(filtr)) url <- paste0(url, "&filters=", filtr)
r <- GET(url, token)
return(r)
}
This is meant to just be a flexible but not the most friendly of functions because I want to have wrapper functions that will contain yt_request() that will be much more user friendly. For example:
top_videos <- function(...) {
dim <- "video"
met <- "views,averageViewDuration"
maxRes <- 10
temp <- yt_request(dimensions = dim, metrics = met, maxResults = maxRes, token = myToken)
return(temp)
}
Which so far works fine and dandy, but I also want potential users to have a little flexibility with the results. For example, if they want to have maxResults <- 20 instead of 10 or they want different metrics than the ones I specify, I want them to be able to pass their own arguments in the ... of top_videos(...).
How can I do a check if someone passes an argument in the ellipsis? If they pass a metric, I want it to override the default I specify, otherwise, go with the default.
EDIT
To help clarify, I'm hoping that when the user decides to use the function, they could just write something like top_videos(maxResults = 20) and the function would ignore the line maxRes <- 10 and in the yt_request() function would assign maxResults = 20 instead of 10
We can capture the ... in a list and convert the whole elements to a key/value pair. Then, extract the elements based on the name. If we are not passing that particular named element, it will return NULL. We make use of this behavior of NULL to concatenate with the default value of 10 in maxRes and select the first element ([1]) so that if it is NULL, the default 10 is selected, or else the value passed will be selected. Likewise, do this on all those objects that the OP wanted to override
top_videos <- function(...) {
nm1 <- list(...)
lst1 <- as.list(nm1)
dim <- c(lst1[["dimensions"]], "video")[1]
met <- c(lst1[["metrics"]], "views,averageViewDuration")[1]
maxRes <- c(lst1[['maxResults']], 10)[1]
#temp <- yt_request(dimensions = dim,
metrics = met, maxResults = maxRes, token = myToken)
#temp
maxRes
}
-testing
top_videos(maxResults = 20)
#[1] 20
top_videos(hello = 5)
#[1] 10

Is it possible to add external arguments to form partial field names?

I have two fields:
FirstVisit
SecondVisit
I am building a function to pull data from either field depending on user input (heavily reduced yet relevant version of function):
pullData(visit){
# Do something
}
What I am looking to do is for the function to take the user's input and use it to form part of the call to the data frame field.
For example, when the user runs:
pullData(First)
The function will run like this:
print(df$FirstVisit)
Conversely, when the user runs:
pullData(Second)
The function will run:
print(df$SecondVisit)
My function is considerably more complex than this, but this basic example relates to just the specific aspect of it that I am trying to work out.
So far I have tried something like:
print(paste0(df["df$", visit, "Visit", ])
# The intention is to result in df$FirstVisit or df$SecondVisit depending on the input
And this:
print(paste0(df[df$", visit, "Visit, ])
# Again, intended result should be df$FirstVisit or df$SecondVisit, depending on the input
among other alternatives (some with paste()), yet nothing has worked so far.
I suspect that it is possible and feel that I am close.
How can I achieve this?
If you really want to run the function like pullData(First), you need to use metaprogramming (to get the name of the argument instead of the arguements value) like
pullData <- function(...) {
arg <- rlang::ensyms(...)
if(length(arg)!=1) stop("invalid argument in pullData")
dataName <- paste0(as.character(arg[[1]]),"Visit")
print(df[[dataName]])
}
If you can manage to call the function with a character-argument like pullData("First"), you can simply do:
pullData <- function(choice = "First") {
dataName <- paste0(choice,"Visit")
print(df[[dataName]])
}
I am not quite sure if this is what you're going for, but here's a possible solution:
pullData <- function(visit){
visit <- rlang::quo_text(enquo(visit))
visit <- tolower(visit)
if (visit %in% c("first", "firstvisit")){
data <- df$FirstVisit
}
if (visit %in% c("second", "secondvisit")){
data <- df$SecondVisit
}
data
}
Using this sample data:
df <- data.frame(FirstVisit = c("first value"),
SecondVisit = c("second value"))
Gets us:
> pullData(first)
[1] "first value"
> pullData(second)
[1] "second value"
For the sake of completeness, R allows for partial matching when subsetting with character indices; see help("$").
df <- data.frame(FirstVisit = 11:12, SecondVisit = 21:22)
For interactive use:
df$F
[1] 11 12
df$S
[1] 21 22
For programming on computed indices, the [[ operator has to be used, e.g.,
df[["F", exact = FALSE]]
[1] 11 12
This can be wrapped in a function call:
pullData <- function(x) df[[x, exact = FALSE]]
Thus,
pullData("F")
pullData("Fi")
pullData("First")
pullData("FirstVisit")
return all
[1] 11 12
while
pullData("S")
pullData("Second")
return
[1] 21 22
But watchout when dealing with user supplied input as typos might lead to unexpected results:
pullData("f")
pullData("first")
pullData("Frist")
NULL

how to properly close connection so I won't get "Error in file(con, "r") : all connections are in use" when using "readlines" and "tryCatch"

I have a list of URLs (more than 4000) from a specific domain (pixilink.com) and what I want to do is to figure out if the provided domain is a picture or a video. To do this, I used the solutions provided here: How to write trycatch in R and Check whether a website provides photo or video based on a pattern in its URL and wrote the code shown below:
#Function to get the value of initial_mode from the URL
urlmode <- function(x){
mycontent <- readLines(x)
mypos <- grep("initial_mode = ", mycontent)
if(grepl("0", mycontent[mypos])){
return("picture")
} else if(grepl("tour", mycontent[mypos])){
return("video")
} else{
return(NA)
}
}
Also, in order to prevent having error for URLs that don't exist, I used the code below:
readUrl <- function(url) {
out <- tryCatch(
{
readLines(con=url, warn=FALSE)
return(1)
},
error=function(cond) {
return(NA)
},
warning=function(cond) {
return(NA)
},
finally={
message( url)
}
)
return(out)
}
Finally, I separated the list of URLs and pass it into the functions (here for instance, I used 1000 values from URL list) described above:
a <- subset(new_df, new_df$host=="www.pixilink.com")
vec <- a[['V']]
vec <- vec[1:1000] # only chose first 1000 rows
tt <- numeric(length(vec)) # checking validity of url
for (i in 1:length(vec)){
tt[i] <- readUrl(vec[i])
print(i)
}
g <- data.frame(vec,tt)
g2 <- g[which(!is.na(g$tt)),] #only valid url
dd <- numeric(nrow(g2))
for (j in 1:nrow(g2)){
dd[j] <- urlmode(g2[j,1])
}
Final <- cbind(g2,dd)
Final <- left_join(g, Final, by = c("vec" = "vec"))
I ran this code on a sample list of URLs with 100, URLs and it worked; however, after I ran it on whole list of URLs, it returned an error. Here is the error : Error in textConnection("rval", "w", local = TRUE) : all connections are in use Error in textConnection("rval", "w", local = TRUE) : all connections are in use
And after this even for sample URLs (100 samples that I tested before) I ran the code and got this error message : Error in file(con, "r") : all connections are in use
I also tried closeAllConnection after each recalling each function in the loop, but it didn't work.
Can anyone explain what this error is about? is it related to the number of requests we can have from the website? what's the solution?
So, my guess as to why this is happening is because you're not closing the connections that you're opening via tryCatch() and via urlmode() through the use of readLines(). I was unsure of how urlmode() was going to be used in your previous post so it had made it as simple as I could (and in hindsight, that was badly done, my apologies). So I took the liberty of rewriting urlmode() to try and make it a little bit more robust for what appears to be a more expansive task at hand.
I think the comments in the code should help, so take a look below:
#Updated URL mode function with better
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
#Check if URL is good to go
if(!httr::http_error(x)){
#Test cases
#x <- "www.pixilink.com/3"
#x <- "https://www.pixilink.com/93320"
#x <- "https://www.pixilink.com/93313"
#Then since there are redirect shenanigans
#Get the actual URL the input points to
#It should just be the input URL if there is
#no redirection
#This is important as this also takes care of
#checking whether http or https need to be prefixed
#in case the input URL is supplied without those
#(this can cause problems for url() below)
myx <- httr::HEAD(x)$url
#Then check for what the default mode is
mycon <- url(myx)
open(mycon, "r")
mycontent <- readLines(mycon)
mypos <- grep("initial_mode = ", mycontent)
#Close the connection since it's no longer
#necessary
close(mycon)
#Some URLs with weird formats can return
#empty on this one since they don't
#follow the expected format.
#See for example: "https://www.pixilink.com/clients/899/#3"
#which is actually
#redirected from "https://www.pixilink.com/3"
#After that, evaluate what's at mypos, and always
#return the actual URL
#along with the result
if(!purrr::is_empty(mypos)){
#mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\\s\\=).*")
mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
return(c(myx, mystr))
#return(mystr)
#So once all that is done, check if the line at mypos
#contains a 0 (picture), tour (video)
#if(grepl("0", mycontent[mypos])){
# return(c(myx, "picture"))
#return("picture")
#} else if(grepl("tour", mycontent[mypos])){
# return(c(myx, "video"))
#return("video")
#}
} else{
#Valid URL but not interpretable
return(c(myx, "uninterpretable"))
#return("uninterpretable")
}
} else{
#Straight up invalid URL
#No myx variable to return here
#Just x
return(c(x, "invalid"))
#return("invalid")
}
}
#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)
#All future + progressr related stuff
#learned courtesy
#https://stackoverflow.com/a/62946400/9494044
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)
#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))
#Website's base URL
baseurl <- "https://www.pixilink.com"
#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar
#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
myprog <- progressr::progressor(along = range)
sitetype <- do.call(rbind, future_lapply(range, function(b, x){
myprog() ##Progress bar signaller
myurl <- paste0(b, "/", x)
cat("\n", myurl, " ")
myret <- urlmode(myurl)
cat(myret, "\n")
return(c(myurl, myret))
}, b = baseurl, future.chunk.size = 10))
})
#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")
#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")
head(sitetype)
# given_url actual_url mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310 invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311 invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313 picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315 tour
unique(sitetype$mode)
# [1] "invalid" "floorplan2d" "picture" "tour"
#--------
Basically, urlmode() now opens and closes connections only when necessary, checks for URL validity, URL redirection, and also "intelligently" extracts the value assigned to initial_mode. With the help of future.lapply(), and the progress bar from the progressr package, this can now be applied quite conveniently in parallel to as many pixilink.com/<integer> URLs as desired. With a bit of wrangling thereafter, the results can be presented very tidily as a data.frame as shown.
As an example, I've demonstrated this for a small range in the code above. Note the commented out 1:10000 range in the code in this context: I let this code run the last couple of hours over this (hopefully sufficiently) large range of URLs to check for errors and problems. I can attest that I encountered no errors (only the regular warnings In readLines(mycon) : incomplete final line found on 'https://www.pixilink.com/93334'). For proof, I have the data from all 10000 URLs written to a CSV file that I can provide upon request (I don't fancy uploading that to pastebin or elsewhere unnecessarily). Due to oversight on my part, I forgot to benchmark that run, but I suppose I could do that later if performance metrics are desired/would be considered interesting.
For your purposes, I believe you can simply take the entire code snippet below and run it verbatim (or with modifications) by just changing the range assignment right before the with_progress(do.call(...)) step to a range of your liking. I believe this approach is simpler and does away with having to deal with multiple functions and such (and no tryCatch() messes to deal with).

How to create a data frame with Rblpapi subscribe function

I'm sorry this example won't be reproducible by those who aren't Bloomberg users.
For the others, I'm using Rblpapi and its subscribe function. I would like to create something like a data frame, a matrix or an array and fill it with values that are streamed by the subscription.
Assuming your BBComm component is up and running, my example says:
require(Rblpapi)
con <- blpConnect()
securities <- c('SX5E 07/20/18 C3400 Index',
'SX5E 07/20/18 C3450 Index',
'SX5E 07/20/18 C3500 Index')
I would like to fill a 3 x 2 matrix with these fields:
fields <- c('BID', 'ASK')
I guess I can create a matrix like this with almost no performance overhead:
mat <- matrix(data = NA,
nrow = 3,
ncol = 2)
Now I use subscribe and its argument fun for filling purposes, so something like this (albeit ugly to see and likely inefficient):
i <- 1
subscribe(securities = securities,
fields = fields,
fun = function(x){
if (i > length(securities))
i <<- 1
tryCatch(
expr = {
mat[i, 1] <<- x$data$BID
mat[i, 2] <<- x$data$ASK
i <<- i + 1
},
error = function(e){
message(e)
},
finally = {}
)
})
Result:
Error in subscribe_Impl(con, securities, fields, fun, options, identity) :
Evaluation error: number of items to replace is not a multiple of replacement length.
Of course, this doesn't work because I don't really know how to use indexing on streamed data. $ operator seems fine to retrieve data points by name - like I did with BID and ASK - but I cannot find a way to figure out which values are referring to, say, securities[1] or to securities[2]. It seems that I get a stream of numeric values that are indistinguishable one from each other because I cannot retrieve the ownership of the value among the securities.
Using an index on x$data$BID[1] throws the same error.
Ok your code looks fine, the only thing that does not work is x$data$BID, change to x$data["BID"] and then you can store it, Im working with your code and this is my result.
fields=c("TIME","LAST_PRICE", "BID", "ASK")
blpConnect()
blpConnect()
i <- 1
subscribe(securities = securities,
fields = fields,"interval=60",
fun = function(x){
if (i > length(securities))
i <<- 1
tryCatch(
expr = {
tim <- x$data["TIME"]
last <<- x$data["LAST_PRICE"]
ask <<- x$data["ASK"]
bid <<- x$data["BID"]
i <<- i + 1
},
error = function(e){
message(e)
},
finally = {}
)
print(cbind(tim$TIME,last$LAST_PRICE,ask$ASK, bid$BID))
})
result
A good way to take a look at the result object from the subscribe function is:
subscribe(securities=c("AAPL US Equity"),
fields=c("LAST_PRICE"),
fun=function(x) print(str(x)))
From there you can work your way into the data:
subscribe(securities=c("AAPL US Equity", "INTC US Equity"),
fields=c("LAST_PRICE","BID","ASK"),
fun=function(x) {
if (!is.null(x$data$MKTDATA_EVENT_TYPE) && x$data$MKTDATA_EVENT_TYPE == "TRADE" && exists("LAST_PRICE", where = x$data)) {
print(data.frame(Ticker = x$topic, DateTime = x$data$TRADE_UPDATE_STAMP_RT, Trade = x$data$LAST_PRICE))
}
})
I only printed the data.frame here. The data can be processed or stored directly using the FUN argument of subscribe.

R - How to count the occurence of a specific string for large textfiles

I am trying to find the occurence of ~10.000 different locations in a list of emails.
What I need is one vector with the most frequently mentioned location per eMail, one with the second most frequent and one with the third !
Since my dataset is huge, I have problems with the perfomrance. I tried it with stringi and the parallel package but it still runs very slowlx (about 15 min for 20.000 eMails and 10.000 locations).
The input data (eMails and Cities) looks like this:
SearchVector = c('Berlin, 'Amsterdam', San Francisco', 'Los Angeles') ...
g$Message = c('This is the first mail from paris. Berlin is a nice place', 'This is the 2nd mail from San francisco. Beirut is a nice place to stay', 'This is the 3rd mail. Los Angeles is a great place') ...
Here is my code using stringi:
# libraries
library(doParallel)
library(stringi)
detectCores()
registerDoParallel(cores=7)
getDoParWorkers()
# function
getCount <- function(data, keyword)
{
keyword2 = paste0( "^(", keyword, ")|(", keyword, ")$|[ ](", keyword, ")[ ]" )
wcount <- stri_count(data, regex=keyword2)
return(data.frame(wcount))
}
SearchVector = as.vector(countryList2)
Text = g$Message
cityName1 = character()
cityName2 = character()
result = foreach(i=Text, .combine=rbind, .inorder=FALSE, .packages=c('stringi'), .errorhandling=c('remove')) %dopar%
{
cities = as.data.frame(t(getCount(i, SearchVector)))
colnames(cities) = SearchVector
if ( length(cities[which(cities > 0)]) == 1 ) {
cityName1 = names(sort(cities, decreasing = TRUE))[1]
cityName2 = NA
}
else if ( length(cities[which(cities > 0)]) > 1 ) {
cityName1 = names(sort(cities, decreasing = TRUE))[1]
cityName2 = names(sort(cities, decreasing = TRUE))[2]
}
else {
cityName1 = NA
cityName2 = NA
}
return(data.frame(cityName1, cityName2))
}
g$cityName1 = result[, 1]
g$cityName2 = result[, 2]
Any ideas how I can speed up this by, for instance, using an index or equal ?
I really look forward to getting help on this issue.
Many thanks
Clemens
It's a bit too messy to comment this, but give this a shot:
library(data.table)
library(stringr)
dt = data.table(Text = g$Message, cleantext = tolower(g$Message))
dt[, place := str_extract_all(cleantext, paste0("(", paste(tolower(SearchVector), collapse = ")|("), ")"))]
Also your SearchVector in the question has some missing quotes.
data.table is usually lightning quick for things like this, but try it on a subset and see if it's acceptably fast.
The place column will look like a bunch of place names separated by commas, but internally it's a list so it's easy to do all sorts of aggregation with that like count places in each text, count how many time each place is mentioned etc.
dt[, n := lapply(place, length)]; dt
nplace = data.table(place = dt[, unlist(place)])[, .N, place]
I also changed all the text to lower case when doing the searching for good luck (this probably isn't the fastest way to be case insensitive but it just looks the most explicit to me).

Resources