I am extracting information from various databases, and to accomplish that I am keeping track of how to convert between the different IDs for each database.
library("RCurl")
library("XML")
transformDrugId<-function(x){
URLtoan<-getURL(x)
PARSED<-htmlParse(URLtoan)
dsource<-xpathSApply( PARSED,"//*[#id='advancedform']/div[7]/fieldset/p/b[1]/text()",xmlValue)
id<-xpathSApply( PARSED,"//*[#id='advancedform']/div[7]/fieldset/p/a[1]/span/text()",xmlValue)
return(c(dsource,id))}
And just as an example the time that it takes on my PC using linux and RSTUDIO is
system.time(DBidstest<-sapply(urls[c(10001:10003)],transformDrugId))
user system elapsed
0.132 0.000 3.675
system.time(DBids7<-sapply(urls[c(601:700)],transformDrugId))
user system elapsed
3.980 0.124 549.233
Where urls contain the list of url adresses of the TDR database where I check for IDs
The computation time becomes prohibitively long when I have to do this for the 300000 drug IDs.
As an example I provide the first five urls
head(urls)
[1] "http://tdrtargets.org/drugs/view?mol_id=608858"
[2] "http://tdrtargets.org/drugs/view?mol_id=608730"
[3] "http://tdrtargets.org/drugs/view?mol_id=549548"
[4] "http://tdrtargets.org/drugs/view?mol_id=581648"
[5] "http://tdrtargets.org/drugs/view?mol_id=5857"
[6] "http://tdrtargets.org/drugs/view?mol_id=550626"
Any help that might help in reducing the time to get and analyse the htmls will be apreciated. I am open to any suggestion that might involve not using R.
I have later realized that using getURLAsynchronous for 10 or less URL is sometimes faster, but using it twice becomes slower
system.time(test<-getURLAsynchronous(urls[c(1:10)]))
user system elapsed
0.128 0.016 1.414
system.time(test<-getURLAsynchronous(urls[c(1:10)]))
user system elapsed
0.152 0.088 300.103
Downloading directly using the shell resulted ten times faster
echo $URLTEST| xargs -n 1 -P 7 wget -q
where URLTEST is a list of htmls to download.-n sets the waiting time between queries and -P the number of parallel queries, both where fine tuned so that for 100 htmls I got
real 0m13.498s
user 0m0.196s
sys 0m0.652s
There must be some problem in how R's interface t libcurl, that makes it really slow in comparison both for getURL() and downloadFile()
Related
Okay, so I have approximately 2 GB worth of files (images and what not) stored on a server (I'm using Cygwin right now since I'm on Windows) and I was wondering if I was able to get all of this data into R and then eventually translate it onto a website where people can view/download those images?
I currently have installed the ssh package and have logged into my server using:
ssh::ssh_connect("name_and_server_ip_here")
I've been able to successfully connect, however, I am not particular sure how to locate the files on the server through R. I assume I would use something like scp_download to download the files from the server, but as mentioned before, I am not particularly sure how to locate the files from the server, so I wouldn't be able to download them anyways (yet)!
Any sort of feedback and help would be appreciated! Thanks :)
You can use ssh::ssh_exec_internal and some shell commands to "find" commands.
sess <- ssh::ssh_connect("r2#myth", passwd="...")
out <- ssh::ssh_exec_internal(sess, command = "find /home/r2/* -maxdepth 3 -type f -iname '*.log'")
str(out)
# List of 3
# $ status: int 0
# $ stdout: raw [1:70] 2f 68 6f 6d ...
# $ stderr: raw(0)
The stdout/stderr are raw (it's feasible that the remote command did not produce ascii data), so we can use rawToChar to convert. (This may not be console-safe if you have non-ascii data, but it is here, so I'll go with it.)
rawToChar(out$stdout)
# [1] "/home/r2/logs/dns.log\n/home/r2/logs/ping.log\n/home/r2/logs/status.log\n"
remote_files <- strsplit(rawToChar(out$stdout), "\n")[[1]]
remote_files
# [1] "/home/r2/logs/dns.log" "/home/r2/logs/ping.log" "/home/r2/logs/status.log"
For downloading, scp_download is not vectorized, so we can only upload one file at a time.
for (rf in remote_files) ssh::scp_download(sess, files = rf, to = ".")
# 4339331 C:\Users\r2\.../dns.log
# 36741490 C:\Users\r2\.../ping.log
# 17619010 C:\Users\r2\.../status.log
For uploading, scp_upload is vectorized, so we can send all in one shot. I'll create a new directory (just for this example, and to not completely clutter my remote server :-), and then upload them.
ssh::ssh_exec_wait(sess, "mkdir '/home/r2/newlogs'")
# [1] 0
ssh::scp_upload(sess, files = basename(remote_files), to = "/home/r2/newlogs/")
# [100%] C:\Users\r2\...\dns.log
# [100%] C:\Users\r2\...\ping.log
# [100%] C:\Users\r2\...\status.log
# [1] "/home/r2/newlogs/"
(I find it odd that scp_upload is vectorized while scp_download is not. If this were on a shell/terminal, then each call to scp would need to connect, authenticate, copy, then disconnect, a bit inefficient; since we're using a saved session, I believe (unverified) that there is little efficiency lost due to not vectorizing the R function ... though it is still really easy to vectorize it.)
I'm trying to troubleshoot why a process that runs fine serially, and which runs fine in parallel on a local machine (with 3 cores), might slow way down when run on many cores on a cluster. I can't really make a minimal non-working example, because the code works just fine locally.
I've modified the script to save each worker's output to a file, and I'll aggregate the gobs of files later -- so I don't think that it is communication and aggregation, unless I am missing something.
Here is my output. I'm showing a wrapper function for a rather complicated process, which I then run once on its own before running it through plyr:
> analyze_ = function(r){#wrapper function with error handling
+ sti = proc.time()
+ out<-tryCatch(analyze(r)
+ , error=function(e) e, finally=1+1)
+ if(inherits(out, "error")) {
+ out='error!'
+ print(paste("an error happened at",r[1],r[2]))
+ }
+ print(proc.time()-sti
+ }
>
> st = proc.time()
> x = analyze_(xygrid[1,])
[1] "Successfully did -0.25 100.25"
user system elapsed
8.282 0.008 8.286
Then I run the code in parallel and everything slows way down:
> nodes <- detectCores()
> nodes
[1] 24
> registerDoMC(nodes)
> output<-a_ply(xygrid,1,analyze_, .parallel=T)
[1] "Successfully did 0.25 102.25"
user system elapsed
9.292 0.042 221.954
[1] "Successfully did 0.25 10.25"
user system elapsed
9.298 0.039 221.994
[1] "Successfully did -0.25 102.75"
user system elapsed
9.313 0.054 222.808
[1] "Successfully did -0.25 102.25"
user system elapsed
9.328 0.043 222.832
[1] "Successfully did -0.25 104.25"
user system elapsed
9.250 0.032 223.761
[1] "Successfully did -0.25 103.75"
user system elapsed
9.258 0.038 223.786
What are some of the things that might cause this sort of behavior? As I've said, the function discards the output, maintains benchmark speed when run serially, and works fine on three cores in parallel on my local machine.
Here's a thing: why is the "elapsed" time almost an exact 24-fold multiple of the "user" time? (and what is the "user" time anyway?)
Edit In response to a question in the comments, here is a head-to-head comparison of performace in parallel vs in serial. Parallel is in fact slower.
> system.time(analyze_(xygrid[1,]))
user system elapsed
8.223 0.005 8.250
>
> nodes <- detectCores()
> nodes
[1] 24
> registerDoMC(nodes)
>
> system.time(a_ply(xygrid[1:24,],1,analyze_, .parallel=F))
user system elapsed
197.666 0.072 197.762
>
> system.time(a_ply(xygrid[1:24,],1,analyze_, .parallel=T))
system elapsed
119.871 0.401 206.257
Edit2
The code works fine on the login node -- each job takes 6 seconds serially, and 24 jobs in parallel takes 12 seconds. So how do compute nodes differ in general from login nodes?
Edit3
Solved
It turns out that I have been using ibrun in my shell script, which sends the job to multiple threads. However, R uses something called "thread parallelism", in which the parallel backend is called from within R, rather than from the shell script that calls the R job. If you call multiple threads from the batching command, the cluster will make the parallel job run in each core. Hence the speed that was a rough multiple of the number of cores. So, moral of the story: only ask the batching system for one thread on a shared resource system, and let R handle the number of actual cores that get used. And this issue will probably only be relevant on shared supercomputers with batching systems.
I am writing a simple command-line Rscript that reads some binary data and outputs it to as a stream of numeric characters. The data is of specific format and R has a very quick library to deal with the binary files in question. The file (of 7 million characters) is read quickly - in less than a second:
library(affyio)
system.time(CEL <- read.celfile("testCEL.CEL"))
user system elapsed
0.462 0.035 0.498
I want to write a part of read data to stdout:
str(CEL$INTENSITY$MEAN)
num [1:6553600] 6955 225 7173 182 148 ...
As you can see it's numeric data with ~6.5 million integers.
And the writing is terribly slow:
system.time(write(CEL$INTENSITY$MEAN, file="TEST.out"))
user system elapsed
8.953 10.739 19.694
(Here the writing is done to a file, but doing it to standard output from Rscript takes the same amount of time"
cat(vector) does not improve the speed at all. One improvement I found is this:
system.time(writeLines(as.character(CEL$INTENSITY$MEAN), "TEST.out"))
user system elapsed
6.282 0.016 6.298
It is still a far cry from the speed it got when reading the data in (and it read 5 times more data than this particular vector). Moreover I have an overhead of transforming the entire vector to character before I can proceed. Plus when sinking to stdout I cannot terminate the stream with CTRL+C if by accident I fail to redirect it to file.
So my question is - is there a faster way to simply output numeric vector from R to stdout?
Also why is reading the data in so much faster than writing? And this is not only for binary files, but in general:
system.time(tmp <- scan("TEST.out"))
Read 6553600 items
user system elapsed
1.216 0.028 1.245
Binary reads are fast. Printing to stdout is slow for two reasons:
formatting
actual printing
You can benchmark / profile either. But if you really want to be "fast", stay away from formatting for printing lots of data.
Compiled code can help make the conversion faster. But again, the fastest solution will to
remain with binary
not write to stdout out or file (but use eg something like Redis).
I'm performing several tests using different approaches for cleaning a big csv file and then importing it into R.
This time I'm playing with Powershell in Windows.
While things work well and most accurate than when using cut() with pipe(), the process is horribly slow.
This is my command:
shell(shell = "powershell",
"Import-Csv In.csv |
select-object col1, col2, etc |
Export-csv new.csv")
And these are the system.time() results:
user system elapsed
0.61 0.42 1568.51
I've seen some other posts that use C# via streaming taking couple of dozens of seconds, but I don't know C#.
My question is, how can improve the PowerShell command in order to make it faster?
Thanks,
Diego
There's a fair amout of overhead in reading in the csv, converting the rows to powershell objects, and the converting back to csv. Doing it through the pipeline that way also causes it to do this one record at a time. You should be able to speed that up considerably if you switch to using Get-Content with a -ReadCount parameter, and extracting your data using a regular expression in a -replace operator, e.g.:
shell(shell = "powershell",
"Get-Content In.csv -ReadCount 1000 |
foreach { $_ -replace '^(.+?,.+?),','$1' |
Add-Content new.csv")
This will reduce the number if disk reads, and the -replace will be functioning as an array operator, doing 1000 records at a time.
First and foremost, my first test was wrong in the sense that due some errors I had before, several other sessions of powershell remained open and delayed the whole process.
These are the real numbers:
> system.time(shell(shell = "powershell", psh.comm))
user system elapsed
0.09 0.05 824.53
Now, as I said I couldn't find a good pattern for splitting the columns of my csv file.
I maybe need to add that it is a messy file, with fields containing commas, multiline fields, summary lines, and so on.
I tried other approaches, like one very famous in stack overflow that uses embedded C# code in PowerShell for splitting csv files.
While it works faster than the more common approach I showed previously, results are not accurate for these types of messy files.
> system.time(shell(shell = "powershell", psh.comm))
user system elapsed
0.01 0.00 212.96
Both approaches showed similar RAM consumption (~40Mb), and CPU usage (~50%) most of the time.
So while the former approach took 4 times the amount of the later, the accuracy of the results, the low cost in terms of resources, and the lesser developing time make me consider it the most efficient for big and messy csv files.
I am using a 32 bit perl in my openvms system.(So perl can access up till 2gb of virtual address space ).
I am hitting "out of memory!" in a large perl script. I zeroed in on the location of variable causing this . However after my tests with devel:size it turns out the array is using only 13 Mb memory and the hash is using much less than that.
My question is about memory profiling this perl script in VMS.
is there a good way of doing memory profile on VMS?
I used size to get size of array and hash.(Array is local scope and hash is global scope)
DV Z01 A4:[INTRO_DIR]$ perl scanner_SCANDIR.PL
Directory is Z3:[new_dir]
13399796 is total on array
3475702 is total on hash
Directory is Z3:[new_dir.subdir]
2506647 is total on array
4055817 is total on hash
Directory is Z3:[new_dir.subdir.OBJECT]
5704387 is total on array
6040449 is total on hash
Directory is Z3:[new_dir.subdir.XFET]
1585226 is total on array
6390119 is total on hash
Directory is Z3:[new_dir.subdir.1]
3527966 is total on array
7426150 is total on hash
Directory is Z3:[new_dir.subdir.2]
1698678 is total on array
7777489 is total on hash
(edited: Pmis-spelled GFLQUOTA )
Where is that output coming from? To OpenVMS folks it suggests files in directories, which the code might suck in? There would typically be considerable malloc/align overhead per element saved.
Anyway the available ADDRESSABLE memory when strictly using 32 pointers on OpenVMS is 1GB: 0x0 .. 0x3fffffff, not 2GB, for programs and (malloc) data for 'P0' space. There is also room in P1 (0x7fffffff .. 0x4000000) for thread-local stack storages, but perl does not use (much) of that.
From a second session you can look at that with DCL:
$ pid = "xxxxxxxx"
$ write sys$output f$getjpi(pid,"FREP0VA"), " ", f$getjpi(pid,"FREP1VA")
$ write sys$output f$getjpi(pid,"PPGCNT"), " ", f$getjpi(pid,"GPGCNT")
$ write sys$output f$getjpi(pid,"PGFLQUOTA")
However... those are just addresses ranges, NOT how much memory the process is allowed to used. That's governed by the process page-file-quota. Check with $ SHOW PROC/QUOTA before running perl. And its usage can be reported as per above from the outside adding Private pages and Groups-shared pages as per above.
An other nice way to look at memory (and other quota) is SHOW PROC/CONT ... and then hit "q"
So how many elements are stored in each large active array? How large is an average element, rounded up to 16 bytes? How many hash elements? How large are the key + value on average (round up generously)
What is the exact message?
Does the program 'blow' up right away, or after a while (so you can use SHOW PROC/CONT)
Is there a source file data set (size) that does work?
Cheers,
Hein.