fast url query with R - r

Hi have to query a website 10000 times I am looking for a real fast way to do it with R
as a template url:
url <- "http://mutationassessor.org/?cm=var&var=7,55178574,G,A"
my code is:
url <- mydata$mutationassessorurl[1]
rawurl <- readHTMLTable(url)
Mutator <- data.frame(rawurl[[10]])
for(i in 2:27566) {
url <- mydata$mutationassessorurl[i]
rawurl <- readHTMLTable(url)
Mutator <- smartbind(Mutator, data.frame(rawurl[[10]]))
print(i)
}
using microbenchmark I have 680 milliseconds for query. I was wondering if there is a faster way to do it!
Thanks

One way to speed up http connections is to leave the connection open
between requests. The following example shows the difference it makes
for httr. The first option is most similar to the default behaviour in
RCurl.
library(httr)
test_server <- "http://had.co.nz"
# Return times in ms for easier comparison
timed_GET <- function(...) {
req <- GET(...)
round(req$times * 1000)
}
# Create a new handle for every request - no connection sharing
rowMeans(replicate(20,
timed_GET(handle = handle(test_server), path = "index.html")
))
## redirect namelookup connect pretransfer starttransfer
## 0.00 20.65 75.30 75.40 133.20
## total
## 135.05
test_handle <- handle(test_server)
# Re use the same handle for multiple requests
rowMeans(replicate(20,
timed_GET(handle = test_handle, path = "index.html")
))
## redirect namelookup connect pretransfer starttransfer
## 0.00 0.00 2.55 2.55 59.35
## total
## 60.80
# With httr, handles are automatically pooled
rowMeans(replicate(20,
timed_GET(test_server, path = "index.html")
))
## redirect namelookup connect pretransfer starttransfer
## 0.00 0.00 2.55 2.55 57.75
## total
## 59.40
Note the difference in the namelookup and connect - if you're sharing a
handle you need to do each of these operations only once, which saves
quite a bit of time.
There's quite a lot of intra-request variation - on average the last two
methods should be very similar.

Related

sparklyr connecting to kafka streams/topics

I'm having difficulty connecting to and retrieving data from a kafka instance. Using python's kafka-python module, I can connect (using the same connection parameters), see the topic, and retrieve data, so the network is viable, there is no authentication problem, the topic exists, and data exists in the topic.
On R-4.0.5 using sparklyr-1.7.2, connecting to kafka-2.8
library(sparklyr)
spark_installed_versions()
# spark hadoop dir
# 1 2.4.7 2.7 /home/r2/spark/spark-2.4.7-bin-hadoop2.7
# 2 3.1.1 3.2 /home/r2/spark/spark-3.1.1-bin-hadoop3.2
sc <- spark_connect(master = "local", version = "2.4",
config = list(
sparklyr.shell.packages = "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0"
))
system.time({
Z <- stream_read_kafka(
sc,
options = list(
kafka.bootstrap.servers="11.22.33.44:5555",
subscribe = "mytopic"))
})
# user system elapsed
# 0.080 0.000 10.349
system.time(collect(Z))
# user system elapsed
# 1.336 0.136 8.537
Z
# # Source: spark<?> [inf x 7]
# # … with 7 variables: key <lgl>, value <lgl>, topic <chr>, partition <int>, offset <dbl>, timestamp <dbl>, timestampType <int>
My first concern is that I'm not seeing data from the topic, I appear to be getting a frame suggesting (meta)data about topics in general, and there is nothing found. With this topic, there are 800 strings (json), modest-to-small sizes. My second concern is that it takes almost 20 seconds to realize this problem (though I suspect that's a symptom of the larger connection problem).
For confirmation, this works:
cons = import("kafka")$KafkaConsumer(bootstrap_servers="11.22.33.44:5555", auto_offset_reset="earliest", max_partition_fetch_bytes=10240000L)
cons$subscribe("mytopic")
msg <- cons$poll(timeout_ms=30000L, max_records=99999L)
length(msg)
# [1] 1
length(msg[[1]])
# [1] 801
as.character( msg[[1]][[1]]$value )
# [1] "{\"TrackId\":\"c839dcb5-...\",...}"
(and those commands complete almost instantly, nothing like the 8-10sec lag above).
The kafka instance to which I'm connecting is using ksqlDB, though I don't think that's a requirement in order to need to use the "org.apache.spark:spark-sql-kafka-.." java package.
(Ultimately I'll be using stateless/stateful procedures on streaming data, including joins and window ops, so I'd like to not have to re-implement that from scratch on the simple kafka connection.)

Fastest way to upload data via R to PostgresSQL 12

I am using the following code to connect to a PostgreSQL 12 database:
con <- DBI::dbConnect(odbc::odbc(), driver, server, database, uid, pwd, port)
This connects me to a PostgreSQL 12 database on Google Cloud SQL. The following code is then used to upload data:
DBI::dbCreateTable(con, tablename, df)
DBI::dbAppendTable(con, tablename, df)
where df is a data frame I have created in R. The data frame consists of ~ 550,000 records totaling 713 MB of data.
When uploaded by the above method, it took approximately 9 hours at a rate of 40 write operations/second. Is there a faster way to upload this data into my PostgreSQL database, preferably through R?
I've always found bulk-copy to be the best, external to R. The insert can be significantly faster, and your overhead is (1) writing to file, and (2) the shorter run-time.
Setup for this test:
win10 (2004)
docker
postgres:11 container, using port 35432 on localhost, simple authentication
a psql binary in the host OS (where R is running); should be easy with linux, with windows I grabbed the "zip" (not installer) file from https://www.postgresql.org/download/windows/ and extracted what I needed
I'm using data.table::fwrite to save the file because it's fast; in this case write.table and write.csv are still much faster than using DBI::dbWriteTable, but with your size of data you might prefer something quick
DBI::dbCreateTable(con2, "mt", mtcars)
DBI::dbGetQuery(con2, "select count(*) as n from mt")
# n
# 1 0
z1000 <- data.table::rbindlist(replicate(1000, mtcars, simplify=F))
nrow(z1000)
# [1] 32000
system.time({
DBI::dbWriteTable(con2, "mt", z1000, create = FALSE, append = TRUE)
})
# user system elapsed
# 1.56 1.09 30.90
system.time({
data.table::fwrite(z1000, "mt.csv")
URI <- sprintf("postgresql://%s:%s#%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
system(
sprintf("psql.exe -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
"mt", paste(colnames(z1000), collapse = ","),
sQuote("mt.csv"), URI)
)
})
# COPY 32000
# user system elapsed
# 0.05 0.00 0.19
DBI::dbGetQuery(con2, "select count(*) as n from mt")
# n
# 1 64000
While this is a lot smaller than your data (32K rows, 11 columns, 1.3MB of data), a speedup from 30 seconds to less than 1 second cannot be ignored.
Side note: there is also a sizable difference between dbAppendTable (slow) and dbWriteTable. Comparing psql and those two functions:
z100 <- rbindlist(replicate(100, mtcars, simplify=F))
system.time({
data.table::fwrite(z100, "mt.csv")
URI <- sprintf("postgresql://%s:%s#%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
system(
sprintf("/Users/r2/bin/psql -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
"mt", paste(colnames(z100), collapse = ","),
sQuote("mt.csv"), URI)
)
})
# COPY 3200
# user system elapsed
# 0.0 0.0 0.1
system.time({
DBI::dbWriteTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
# user system elapsed
# 0.17 0.04 2.95
system.time({
DBI::dbAppendTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
# user system elapsed
# 0.74 0.33 23.59
(I don't want to time dbAppendTable with z1000 above ...)
(For kicks, I ran it with replicate(10000, ...) and ran the psql and dbWriteTable tests again, and they took 2 seconds and 372 seconds, respectively. Your choice :-) ... now I have over 650,000 rows of mtcars ... hrmph ... drop table mt ...
I suspect that dbAppendTable results in an INSERT statement per row, which can take a long time for high numbers of rows.
However, you can generate a single INSERT statement for the entire data frame using the sqlAppendTable function and run it by using dbSendQuery explicitly:
res <- DBI::dbSendQuery(con, DBI::sqlAppendTable(con, tablename, df, row.names=FALSE))
DBI::dbClearResult(res)
For me, this was much faster: a 30 second ingest reduced to a 0.5 second ingest.

Base R (or CRAN) function to run internet speed test from R?

Is there a function in base R (or in a package on CRAN) that runs a speed test (i.e. measures a user's download speed)?
Note: I do not want something that relies on libraries beyond CRAN, external scripts or any software that is outside of base R/CRAN and not already on standard operating systems (i.e. linux, mac and windows).
Methods that come close
Non-CRAN package
There is a package on github (not on CRAN) that returns the user's download speed
install_github("https://github.com/hrbrmstr/speedtest")
library(speedtest)
speedtest::spd_download_test(speedtest::spd_best_servers())$mean
# [1] 12.9
Python script
It is possible to get the download speed via a system call to curl, retrieval of a python script from github, and then executing that script. E.g. system("curl -s https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py | python -").
This is nice because it can be done in R with one line of code. However, it's problematic because a) it relies on having python installed, and b) it retrieves code from github (dangerous)
As we discussed in the comments, you can download the first n bytes of the file, timing it, and do the math from there.
Though not base-R, it uses only the httr package, which is common-enough I think. You might be able to adapt this to download.file, though I had difficult getting headers= to do what I needed here.
This is an over-engineered helper-script:
dl_size <- function(url) {
tryCatch(
as.integer(httr::HEAD(url)$headers$`content-length`),
error = function(e) NA_integer_)
}
dl_speedtest <- function(url, size = 10000, tries = 1) {
urlsize <- dl_size(url)
stopifnot(isTRUE(!is.na(urlsize)))
starts <- size * seq_len(tries)
tries <- min(tries, floor(urlsize / size))
counts <- sapply(
paste(c(0, starts[-tries]), starts-1, sep = "-"),
function(byt) {
system.time(ign <- httr::GET(url, httr::add_headers(Range = paste0("bytes=", byt))))
})
browser()
if (tries < 3) {
elapsed <- counts["elapsed",]
speeds <- sort(size / counts["elapsed",])
} else {
elapsed <- summary(counts["elapsed",])
speeds <- summary(size / counts["elapsed",])
expected <- summary(urlsize / (size / counts["elapsed",]))
}
list(elapsed = elapsed, speeds = speeds, expected = expected)
}
For testing, I set up a 50MiB "random" file on a personal website. Since I'd rather non inundate that site with random traffic trying to prove this, I'll just use URL here.
In action:
dl_speedtest(URL, size=100000, tries=3)
# $elapsed
# [1] 0.20 0.11 0.09
# $speeds
# [1] 500000.0 909090.9 1111111.1
# $expected
# [1] 102.40 56.32 46.08
dl_speedtest(URL, size=100000, tries=5)
# $elapsed
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.080 0.090 0.090 0.094 0.100 0.110
# $speeds
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 909091 1000000 1111111 1076263 1111111 1250000
# $expected
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 40.96 46.08 46.08 48.13 51.20 56.32
Like I said, over-engineered, but I was playing with it. You can/should reduce the code quite a bit.

A faster way to generate a vector of UUIDs in R

The code below takes about 15 seconds to generate a vector of 10k UUIDs. I will need to generate 1M or more and I calculate that this will take 15 * 10 * 10 / 60 minutes, or about 25 minutes. Is there a faster way to achieve this?
library(uuid)
library(dplyr)
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), UUIDgenerate )
end_time <- Sys.time()
end_time - start_time
# Time difference of 15.072 secs
Essentially, I'm searching for a method for R that manages to achieve the performance boost described here for Java: Performance of Random UUID generation with Java 7 or Java 6
They should be RFC 4122 compliant but the other requirements are flexible.
Bottom line up front: no, there is currently no way to speed up generation of a lot of UUIDs with uuid without compromising the core premise of uniqueness. (Using uuid, that is.)
In fact, your suggestion to use use.time=FALSE has significantly bad ramifications (on windows). See below.
It is possible to get faster performance at scale, just not with uuid. See below.
uuid on Windows
Performance of uuid::UUIDgenerate should take into account the OS. More specifically, the source of randomness. It's important to look at performance, yes, where:
library(microbenchmark)
microbenchmark(
rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rf 8.675561 9.330877 11.73299 10.14592 11.75467 66.2435 100
# rt 89.446158 90.003196 91.53226 90.94095 91.13806 136.9411 100
# sf 8.570900 9.270524 11.28199 10.22779 12.06993 24.3583 100
# st 89.359366 90.189178 91.73793 90.95426 91.89822 137.4713 100
... so using use.time=FALSE is always faster. (I included the sapply examples for comparison with your answer's code, to show that replicate is never slower. Use replicate here unless you feel you need the numeric argument for some reason.)
However, there is a problem:
R.version[1:3]
# _
# platform x86_64-w64-mingw32
# arch x86_64
# os mingw32
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 20
Given that a UUID is intended to be unique each time called, this is disturbing, and is a symptom of insufficient randomness on windows. (Does WSL provide a way out for this? Another research opportunity ...)
uuid on Linux
For comparison, the same results on a non-windows platform:
microbenchmark(
rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rf 20.852227 21.48981 24.90932 22.30334 25.11449 74.20972 100
# rt 9.782106 11.03714 14.15256 12.04848 15.41695 100.83724 100
# sf 20.250873 21.39140 24.67585 22.44717 27.51227 44.43504 100
# st 9.852275 11.15936 13.34731 12.11374 15.03694 27.79595 100
R.version[1:3]
# _
# platform x86_64-pc-linux-gnu
# arch x86_64
# os linux-gnu
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 1000
(I'm slightly intrigued by the fact that use.time=FALSE on linux takes twice as long as on windows ...)
UUID generation with a SQL server
If you have access to a SQL server (you almost certainly do ... see SQLite ...), then you can deal with this scale problem by employing the server's implementation of UUID generation, recognizing that there are some slight differences.
(Side note: there are "V4" (completely random), "V1" (time-based), and "V1mc" (time-based and includes the system's mac address) UUIDs. uuid gives V4 if use.time=FALSE and V1 otherwise, encoding the system's mac address.)
Some performance comparisons on windows (all times in seconds):
# n uuid postgres sqlite sqlserver
# 1 100 0 1.23 1.13 0.84
# 2 1000 0.05 1.13 1.21 1.08
# 3 10000 0.47 1.35 1.45 1.17
# 4 100000 5.39 3.10 3.50 2.68
# 5 1000000 63.48 16.61 17.47 16.31
The use of SQL has some overhead that does not take long to overcome when done at scale.
PostgreSQL needs the uuid-ossp extension, installable with
CREATE EXTENSION "uuid-ossp"
Once installed/available, you can generate n UUIDs with:
n <- 3
pgcon <- DBI::dbConnect(...)
DBI::dbGetQuery(pgcon, sprintf("select uuid_generate_v1mc() as uuid from generate_series(1,%d)", n))
# uuid
# 1 53cd17c6-3c21-11e8-b2bf-7bab2a3c8486
# 2 53cd187a-3c21-11e8-b2bf-dfe12d92673e
# 3 53cd18f2-3c21-11e8-b2bf-d3c64c6ad73f
Other UUID functions exists. https://www.postgresql.org/docs/9.6/static/uuid-ossp.html
SQLite includes limited ability to do it, but this hack works well enough for a V4-style UUID (length n):
sqlitecon <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") # or your own
DBI::dbGetQuery(sqlitecon, sprintf("
WITH RECURSIVE cnt(x) as (
select 1 union all select x+1 from cnt limit %d
)
select (hex(randomblob(4))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(6))) as uuid
from cnt", n))
# uuid
# 1 EE6B08DA-2991-BF82-55DD-78FEA48ABF43
# 2 C195AAA4-67FC-A1C0-6675-E4C5C74E99E2
# 3 EAC159D6-7986-F42C-C5F5-35764544C105
This takes a little pain to format it the same, a nicety at best. You might find small performance improvements by not clinging to this format.)
SQL Server requires temporarily creating a table (with newsequentialid()), generating a sequence into it, pulling the automatically-generated IDs, and discarding the table. A bit over-the-top, especially considering the ease of using SQLite for it, but YMMV. (No code offered, it doesn't add much.)
Other considerations
In addition to execution time and sufficient-randomness, there are various discussions around (uncited for now) with regards to database tables that indicate performance impacts by using non-consecutive UUIDs. This has to do with index pages and such, outside the scope of this answer.
However, assuming this is true ... with the assumption that rows inserted at around the same time (temporally correlated) are often grouped together (directly or sub-grouped), then it is a good thing to keep same-day data with UUID keys in the same db index-page, so V4 (completely random) UUIDs may decrease DB performance with large groups (and large tables). For this reason, I personally prefer V1 over V4.
Other (still uncited) discussions consider including a directly-traceable MAC address in the UUID to be a slight breach of internal information. For this reason, I personally lean towards V1mc over V1.
(But I don't yet have a way to do this well with RSQLite, so I'm reliant on having postgresql nearby. Fortunately, I use postgresql enough for other things that I keep an instance around with docker on windows.)
Providing the option use.time will significantly speed up the process. It can be set to either TRUE or FALSE, to determine if the UUIDs are time-based or not. In both cases, it will be significantly faster than not specifying this option.
For 10k UUIDs,
library(uuid)
library(dplyr)
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), function(ign) UUIDgenerate(FALSE) )
end_time <- Sys.time()
end_time - start_time
# 10k: 0.01399994 secs
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), function(ign) UUIDgenerate(TRUE) )
end_time <- Sys.time()
end_time - start_time
# 10k: 0.01100016 secs
Even scaling up to 100M, still gives a faster run-time than the original 15 seconds.
start_time <- Sys.time()
temp <- sapply( seq_along(1:100000000), function(ign) UUIDgenerate(FALSE) )
end_time <- Sys.time()
end_time - start_time
# 100M: 1.154 secs
start_time <- Sys.time()
temp <- sapply( seq_along(1:100000000), function(ign) UUIDgenerate(TRUE) )
end_time <- Sys.time()
end_time - start_time
# 100M: 3.7586 secs

R code slowing with increased iterations

I've been trying to increase the speed of some code. I've removed all loops, am using vectors and have streamed lined just about everything. I've timed each iteration of my code and it appears to be slowing as iterations increase.
### The beginning iterations
user system elapsed
0.03 0.00 0.03
user system elapsed
0.03 0.00 0.04
user system elapsed
0.03 0.00 0.03
user system elapsed
0.04 0.00 0.05
### The ending iterations
user system elapsed
3.06 0.08 3.14
user system elapsed
3.10 0.05 3.15
user system elapsed
3.08 0.06 3.15
user system elapsed
3.30 0.06 3.37
I have 598 iterations and right now it takes about 10 minutes. I'd like to speed things up. Here's how my code looks. You'll need the RColorBrewer and fields packages. Here's my data. Yes I know its big, make sure you download the zip file.
StreamFlux <- function(data,NoR,NTS){
###Read in data to display points###
WLX = c(8,19,29,20,13,20,21)
WLY = c(25,28,25,21,17,14,12)
WLY = 34 - WLY
WLX = WLX / 44
WLY = WLY / 33
timedata = NULL
mf <- function(i){
b = (NoR+8) * (i-1) + 8
###I read in data one section at a time to avoid headers
mydata = read.table(data,skip=b,nrows=NoR, header=FALSE)
rows = 34-mydata[,2]
cols = 45-mydata[,3]
flows = mydata[,7]
rows = as.numeric(rows)
cols = as.numeric(cols)
rm(mydata)
###Create Flux matrix
flow_mat <- matrix(0,44,33)
###Populate matrix###
flow_mat[(rows - 1) * 44 + (45-cols)] <- flows+flow_mat[(rows - 1) * 44 + (45-cols)]
flow_mat[flow_mat == 0] <- NA
rm(flows)
rm(rows)
rm(cols)
timestep = i
###Specifying jpeg info###
jpeg(paste("Steamflow", timestep, ".jpg",sep = ''),
width = 640, height=441,quality=75,bg="grey")
image.plot(flow_mat, zlim=c(-1,1),
col=brewer.pal(11, "RdBu"),yaxt="n",
xaxt="n", main=paste("Stress Period ",
timestep, sep = ""))
points(WLX,WLY)
dev.off()
rm(flow_mat)
}
ST<- function(x){functiontime=system.time(mf(x))
print(functiontime)}
lapply(1:NTS, ST)
}
This is how to run the function
###To run all timesteps###
StreamFlux("stream_out.txt",687,598)
###To run the first 100 timesteps###
StreamFlux("stream_out.txt",687,100)
###The first 200 timesteps###
StreamFlux("stream_out.txt",687,200)
To test remove print(functiontime) to stop it printing at every timestep then
> system.time(StreamFlux("stream_out.txt",687,100))
user system elapsed
28.22 1.06 32.67
> system.time(StreamFlux("stream_out.txt",687,200))
user system elapsed
102.61 2.98 106.20
What I'm looking for is anyway to speed up running this code and possibly an explanation of why it is slowing down? Should I just run it in parts, seems a stupid solution. I've read about dlply from the plyr. It seems to have worked here but would that help in my case? How about parallel processing, I think I can figure that out but is it worth the trouble in this case?
I will follow #PaulHiemstra's suggestion and post my comment as an answer. Who can resist Internet points? ;)
From a quick glance at your code, I agree with #joran's second point in his comment: your loop/function is probably slowing down due to repeatedly reading in your data. More specifically, this part of the code probably needs to be fixed:
read.table(data, skip=b, nrows=NoR, header=FALSE).
In particular, I think the skip=b argument is the culprit. You should read in all the data at the beginning, if possible, and then retrieve the necessary parts from memory for the calculations.

Resources