gtrendsR recently became slow? - r

We have a production system calling gtrendsR::gtrends() which typically returns in less than an second. Now we are getting ~ one minute response time. Example:
library(gtrendsR) ## Version 1.4.0
system.time(ret_gt <- gtrendsR::gtrends("skinny jeans", geo = "US",
time = "today+5-y", category = 997))
# user system elapsed
# 0.404 0.016 56.897
Am I missing something?

Related

sparklyr connecting to kafka streams/topics

I'm having difficulty connecting to and retrieving data from a kafka instance. Using python's kafka-python module, I can connect (using the same connection parameters), see the topic, and retrieve data, so the network is viable, there is no authentication problem, the topic exists, and data exists in the topic.
On R-4.0.5 using sparklyr-1.7.2, connecting to kafka-2.8
library(sparklyr)
spark_installed_versions()
# spark hadoop dir
# 1 2.4.7 2.7 /home/r2/spark/spark-2.4.7-bin-hadoop2.7
# 2 3.1.1 3.2 /home/r2/spark/spark-3.1.1-bin-hadoop3.2
sc <- spark_connect(master = "local", version = "2.4",
config = list(
sparklyr.shell.packages = "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0"
))
system.time({
Z <- stream_read_kafka(
sc,
options = list(
kafka.bootstrap.servers="11.22.33.44:5555",
subscribe = "mytopic"))
})
# user system elapsed
# 0.080 0.000 10.349
system.time(collect(Z))
# user system elapsed
# 1.336 0.136 8.537
Z
# # Source: spark<?> [inf x 7]
# # … with 7 variables: key <lgl>, value <lgl>, topic <chr>, partition <int>, offset <dbl>, timestamp <dbl>, timestampType <int>
My first concern is that I'm not seeing data from the topic, I appear to be getting a frame suggesting (meta)data about topics in general, and there is nothing found. With this topic, there are 800 strings (json), modest-to-small sizes. My second concern is that it takes almost 20 seconds to realize this problem (though I suspect that's a symptom of the larger connection problem).
For confirmation, this works:
cons = import("kafka")$KafkaConsumer(bootstrap_servers="11.22.33.44:5555", auto_offset_reset="earliest", max_partition_fetch_bytes=10240000L)
cons$subscribe("mytopic")
msg <- cons$poll(timeout_ms=30000L, max_records=99999L)
length(msg)
# [1] 1
length(msg[[1]])
# [1] 801
as.character( msg[[1]][[1]]$value )
# [1] "{\"TrackId\":\"c839dcb5-...\",...}"
(and those commands complete almost instantly, nothing like the 8-10sec lag above).
The kafka instance to which I'm connecting is using ksqlDB, though I don't think that's a requirement in order to need to use the "org.apache.spark:spark-sql-kafka-.." java package.
(Ultimately I'll be using stateless/stateful procedures on streaming data, including joins and window ops, so I'd like to not have to re-implement that from scratch on the simple kafka connection.)

How let a Countdown run in R [duplicate]

How do you pause an R script for a specified number of seconds or miliseconds? In many languages, there is a sleep function, but ?sleep references a data set. And ?pause and ?wait don't exist.
The intended purpose is for self-timed animations. The desired solution works without asking for user input.
See help(Sys.sleep).
For example, from ?Sys.sleep
testit <- function(x)
{
p1 <- proc.time()
Sys.sleep(x)
proc.time() - p1 # The cpu usage should be negligible
}
testit(3.7)
Yielding
> testit(3.7)
user system elapsed
0.000 0.000 3.704
Sys.sleep() will not work if the CPU usage is very high; as in other critical high priority processes are running (in parallel).
This code worked for me. Here I am printing 1 to 1000 at a 2.5 second interval.
for (i in 1:1000)
{
print(i)
date_time<-Sys.time()
while((as.numeric(Sys.time()) - as.numeric(date_time))<2.5){} #dummy while loop
}
TL;DR sys_sleep a new stable and precise sleep function
We already know that Sys.sleep could work not as expected, e.g. when CPU usage is very high.
That is why I decided to prepare a high quality function powered by microbenchmark::get_nanotime() and while/repeat mechanics.
#' Alternative to Sys.sleep function
#' Expected to be more stable
#' #param val `numeric(1)` value to sleep.
#' #param unit `character(1)` the available units are nanoseconds ("ns"), microseconds ("us"), milliseconds ("ms"), seconds ("s").
#' #note dependency on `microbenchmark` package to reuse `microbenchmark::get_nanotime()`.
#' #examples
#' # sleep 1 second in different units
#' sys_sleep(1, "s")
#' sys_sleep(100, "ms")
#' sys_sleep(10**6, "us")
#' sys_sleep(10**9, "ns")
#'
#' sys_sleep(4.5)
#'
sys_sleep <- function(val, unit = c("s", "ms", "us", "ns")) {
start_time <- microbenchmark::get_nanotime()
stopifnot(is.numeric(val))
unit <- match.arg(unit, c("s", "ms", "us", "ns"))
val_ns <- switch (unit,
"s" = val * 10**9,
"ms" = val * 10**7,
"us" = val * 10**3,
"ns" = val
)
repeat {
current_time <- microbenchmark::get_nanotime()
diff_time <- current_time - start_time
if (diff_time > val_ns) break
}
}
system.time(sys_sleep(1, "s"))
#> user system elapsed
#> 1.015 0.014 1.030
system.time(sys_sleep(100, "ms"))
#> user system elapsed
#> 0.995 0.002 1.000
system.time(sys_sleep(10**6, "us"))
#> user system elapsed
#> 0.994 0.004 1.000
system.time(sys_sleep(10**9, "ns"))
#> user system elapsed
#> 0.992 0.006 1.000
system.time(sys_sleep(4.5))
#> user system elapsed
#> 4.490 0.008 4.500
Created on 2022-11-21 with reprex v2.0.2

Spark R 2.0 dapply very slow

I just started testing Spark R 2.0, and find the execution of dapply very slow.
For example, the following code
set.seed(2)
random_DF<-data.frame(matrix(rnorm(1000000),100000,10))
system.time(dummy_res<-random_DF[random_DF[,1]>1,])
user system elapsed
0.005 0.000 0.006 `
is executed in 6ms
Now, if I create a Spark DF on 4 partition, and run on 4 cores, I get:
sparkR.session(master = "local[4]")
random_DF_Spark <- repartition(createDataFrame(random_DF),4)
subset_DF_Spark <- dapply(
random_DF_Spark,
function(x) {
y <- x[x[1] > 1, ]
y
},
schema(random_DF_Spark))
system.time(dummy_res_Spark<-collect(subset_DF_Spark))
user system elapsed
2.003 0.119 62.919
I.e. 1 minute, which is abnormally slow.... Am I missing something?
I get also a warning (TaskSetManager: Stage 64 contains a task of very large size (16411 KB). The maximum recommended task size is 100 KB.). Why is this 100KB limit so low?
I am using R 3.3.0 on Mac OS 10.10.5
Any insight welcome!

fast url query with R

Hi have to query a website 10000 times I am looking for a real fast way to do it with R
as a template url:
url <- "http://mutationassessor.org/?cm=var&var=7,55178574,G,A"
my code is:
url <- mydata$mutationassessorurl[1]
rawurl <- readHTMLTable(url)
Mutator <- data.frame(rawurl[[10]])
for(i in 2:27566) {
url <- mydata$mutationassessorurl[i]
rawurl <- readHTMLTable(url)
Mutator <- smartbind(Mutator, data.frame(rawurl[[10]]))
print(i)
}
using microbenchmark I have 680 milliseconds for query. I was wondering if there is a faster way to do it!
Thanks
One way to speed up http connections is to leave the connection open
between requests. The following example shows the difference it makes
for httr. The first option is most similar to the default behaviour in
RCurl.
library(httr)
test_server <- "http://had.co.nz"
# Return times in ms for easier comparison
timed_GET <- function(...) {
req <- GET(...)
round(req$times * 1000)
}
# Create a new handle for every request - no connection sharing
rowMeans(replicate(20,
timed_GET(handle = handle(test_server), path = "index.html")
))
## redirect namelookup connect pretransfer starttransfer
## 0.00 20.65 75.30 75.40 133.20
## total
## 135.05
test_handle <- handle(test_server)
# Re use the same handle for multiple requests
rowMeans(replicate(20,
timed_GET(handle = test_handle, path = "index.html")
))
## redirect namelookup connect pretransfer starttransfer
## 0.00 0.00 2.55 2.55 59.35
## total
## 60.80
# With httr, handles are automatically pooled
rowMeans(replicate(20,
timed_GET(test_server, path = "index.html")
))
## redirect namelookup connect pretransfer starttransfer
## 0.00 0.00 2.55 2.55 57.75
## total
## 59.40
Note the difference in the namelookup and connect - if you're sharing a
handle you need to do each of these operations only once, which saves
quite a bit of time.
There's quite a lot of intra-request variation - on average the last two
methods should be very similar.

R code slowing with increased iterations

I've been trying to increase the speed of some code. I've removed all loops, am using vectors and have streamed lined just about everything. I've timed each iteration of my code and it appears to be slowing as iterations increase.
### The beginning iterations
user system elapsed
0.03 0.00 0.03
user system elapsed
0.03 0.00 0.04
user system elapsed
0.03 0.00 0.03
user system elapsed
0.04 0.00 0.05
### The ending iterations
user system elapsed
3.06 0.08 3.14
user system elapsed
3.10 0.05 3.15
user system elapsed
3.08 0.06 3.15
user system elapsed
3.30 0.06 3.37
I have 598 iterations and right now it takes about 10 minutes. I'd like to speed things up. Here's how my code looks. You'll need the RColorBrewer and fields packages. Here's my data. Yes I know its big, make sure you download the zip file.
StreamFlux <- function(data,NoR,NTS){
###Read in data to display points###
WLX = c(8,19,29,20,13,20,21)
WLY = c(25,28,25,21,17,14,12)
WLY = 34 - WLY
WLX = WLX / 44
WLY = WLY / 33
timedata = NULL
mf <- function(i){
b = (NoR+8) * (i-1) + 8
###I read in data one section at a time to avoid headers
mydata = read.table(data,skip=b,nrows=NoR, header=FALSE)
rows = 34-mydata[,2]
cols = 45-mydata[,3]
flows = mydata[,7]
rows = as.numeric(rows)
cols = as.numeric(cols)
rm(mydata)
###Create Flux matrix
flow_mat <- matrix(0,44,33)
###Populate matrix###
flow_mat[(rows - 1) * 44 + (45-cols)] <- flows+flow_mat[(rows - 1) * 44 + (45-cols)]
flow_mat[flow_mat == 0] <- NA
rm(flows)
rm(rows)
rm(cols)
timestep = i
###Specifying jpeg info###
jpeg(paste("Steamflow", timestep, ".jpg",sep = ''),
width = 640, height=441,quality=75,bg="grey")
image.plot(flow_mat, zlim=c(-1,1),
col=brewer.pal(11, "RdBu"),yaxt="n",
xaxt="n", main=paste("Stress Period ",
timestep, sep = ""))
points(WLX,WLY)
dev.off()
rm(flow_mat)
}
ST<- function(x){functiontime=system.time(mf(x))
print(functiontime)}
lapply(1:NTS, ST)
}
This is how to run the function
###To run all timesteps###
StreamFlux("stream_out.txt",687,598)
###To run the first 100 timesteps###
StreamFlux("stream_out.txt",687,100)
###The first 200 timesteps###
StreamFlux("stream_out.txt",687,200)
To test remove print(functiontime) to stop it printing at every timestep then
> system.time(StreamFlux("stream_out.txt",687,100))
user system elapsed
28.22 1.06 32.67
> system.time(StreamFlux("stream_out.txt",687,200))
user system elapsed
102.61 2.98 106.20
What I'm looking for is anyway to speed up running this code and possibly an explanation of why it is slowing down? Should I just run it in parts, seems a stupid solution. I've read about dlply from the plyr. It seems to have worked here but would that help in my case? How about parallel processing, I think I can figure that out but is it worth the trouble in this case?
I will follow #PaulHiemstra's suggestion and post my comment as an answer. Who can resist Internet points? ;)
From a quick glance at your code, I agree with #joran's second point in his comment: your loop/function is probably slowing down due to repeatedly reading in your data. More specifically, this part of the code probably needs to be fixed:
read.table(data, skip=b, nrows=NoR, header=FALSE).
In particular, I think the skip=b argument is the culprit. You should read in all the data at the beginning, if possible, and then retrieve the necessary parts from memory for the calculations.

Resources