I have a large data.frame (15 columns and 100,000 rows) in an existing R session that I want to send to a Q/KDB instance. From KDB's cookbook, the possible solutions are:
RServer for Q: use KDB to create new R instance which shares memory space. This doesn't work because my data is in an existing instance of R.
RServe: run an R server and use TCP/IP to communicate with Q/KDB client. This does not work, because as per RServe's documentation, "every connection has a separate workspace and working directory" and so i presume does not see my existing data.
R Math Library: access R's functionality via a math library without needing an instance of R. This does not work because my data is already in an instance of R.
So any other ideas on how to send data from R to Q/KDB?
open a port in Q. I start Q with a batch file:
#echo off
c:\q\w32\q -p 5001
load qserver.dll
tryCatch({
dyn.load("c:/q/qserver.dll")}
,error = function(f){
print("can't load qserver.dll")
})
Then use these
open_connection <- function(host="localhost", port=5001, user=NULL) {
parameters <- list(host, as.integer(port), user)
h <- .Call("kx_r_open_connection", parameters)
assign(".k.h", h, envir = .GlobalEnv)
return(h)
}
close_connection <- function(connection) {
.Call("kx_r_close_connection", as.integer(connection))
}
execute <- function(connection, query) {
.Call("kx_r_execute", as.integer(connection), query)
}
d<<-open_connection(host="localhost",port=thePort)
ex2 <- function(...)
{
query <- list(...)
theResult <- NULL
for(i in query) theResult <- paste0(theResult,i)
return(execute(d,paste0(theResult)))
}
then ex2 can take multiple arguments so you can build queries with R variables and strings
Edit: thats for R from Q, heres R to Q
2nd Edit: improved algo:
library(stringr)
RToQTable <- function(Rtable,Qname,withColNames=TRUE,withRowNames=TRUE,colSuffix = NULL)
{
theColnames <- if(!withColNames || length(colnames(Rtable))==0) paste0("col",as.character(1:length(Rtable[1,])),colSuffix) else colnames(Rtable)
if(!withRowNames || length(rownames(Rtable))==0) withRowNames <- FALSE
Rtable <- rbind(Rtable,"linesep")
charnum <- as.integer(nchar(thestr <- paste(paste0(theColnames,':("',str_split(paste(Rtable,collapse='";"'),';\"linesep\";\"')[[1]],');'),collapse="")) - 11)
if(withRowNames)
ex2(Qname,":([]",Qname,str_replace_all(paste0("`",paste(rownames(Rtable),collapse="`"))," ","_"),";",.Internal(substr(thestr,1L,charnum)),"))") else
ex2(Qname,":([]",.Internal(substr(thestr,1L,charnum)),"))")
}
> bigMat <- matrix(runif(1500000),nrow=100000,ncol=15)
> microbenchmark(RToQTable(bigMat,"Qmat"),times=3)
Unit: seconds
expr min lq mean median uq max neval
RToQTable(bigMat, "Qmat") 10.29171 10.315 10.32766 10.33829 10.34563 10.35298 3
This will work for a matrix, so for a data frame just save a vector containing the types of each column, then convert the dataframe to a matrix, import the matrix to Q, and cast the types
Note that this algo is approx O(rows * cols^1.1) so you'll need to chop the columns up into multiple matricies if you have any more than 20 to get O(rows * cols)
but for your example 150,000 rows and 15 columns takes 10 seconds so further optimization may not be necessary.
Related
few days ago I ask this topic about calling a custom made function within a loop that was well resolved by a combination of
eval(parse(text = Function text))
here is the link: Automatic creation and use of custom made function in R.
This allowed me to work with for loop and call automatically the function I need from a Data frame storing the body of the function to create.
Now I would like to bring the question to a next level. My problem is computation time. I need to evaluate something like 52 indices from a hyperspectrial image. this means that in R my hyperspectral image is loaded as a 3d array of 512x512x204 bands.
what I would like to do is run the evaluation of the indices in parallel to reduce the computation time.
here a dummy example to what I would like to emulate, but not in parallel computing.
# create a fake matrix rappresenting my Hyperpectral image
HYPR_IMG=array(NA,dim=c(5,3,4))
HYPR_IMG[,,1]=1
HYPR_IMG[,,2]=2
HYPR_IMG[,,3]=3
HYPR_IMG[,,4]=4
image.plot(HYPR_IMG[,,1], zlim=c(0,20))
image.plot(HYPR_IMG[,,2], zlim=c(0,20))
image.plot(HYPR_IMG[,,3], zlim=c(0,20))
image.plot(HYPR_IMG[,,4], zlim=c(0,20))
#create a fake DF for simulating my indices stored in the dataframe
IDXname=c("IDX1","IDX2","IDX3","IDX4")
IDXFunc=c("HYPR_IMG[,,1] + 3*HYPR_IMG[,,2]",
"HYPR_IMG[,,3] + HYPR_IMG[,,2]",
"HYPR_IMG[,,4] + HYPR_IMG[,,2] - HYPR_IMG[,,3]",
"HYPR_IMG[,,1] + HYPR_IMG[,,4] + 4*HYPR_IMG[,,2] + HYPR_IMG[,,3]")
IDX_DF=as.data.frame(cbind(IDXname,IDXFunc))
# that was what I did before
Store_DF=data.frame(NA)
for (i in 1: length(IDX_DF$IDXname)) {
IDX_ID=IDX_DF$IDXname[i]
IDX_Fun_tmp=IDX_DF$IDXFunc[which(IDX_DF$IDXname==IDX_ID)] #use for extra care to select the right fuction
IDXFunc_call=paste("IDXfun_tmp=function(HYPR_IMG){",IDX_Fun_tmp,"}",sep="")
eval(parse(text = IDXFunc_call))
IDX_VAL=IDXfun_tmp (HYPR_IMG)
image.plot(IDX_VAL,zlim=c(0,20)); title(main=IDX_ID)
temp_DF=as.vector(IDX_VAL)
Store_DF=cbind(Store_DF,temp_DF)
names(Store_DF)[i+1] <- as.vector(IDX_ID)
}
my final goal is to have the very same Store_DF ,storing all the Indices value. Here I have a for loop but using a foreach loop things should speed up. if needed I am working with windows 8 or more as OS.
Is it really possible ?
Will I be able at the end, to reduce the overall computational time having the same Store_DF dataframe or somthing simlar like a matrix with the columns names?
Thanks a lot!!!
For the specific example using either the build in parallelization of a package like data.table or a parallel apply might be more beneficial.
Below is a minimal example of how to achieve the results using a parApply from the parallel package. Note the output is a matrix, which actually yields slightly better performance in base R (not the case necessarily in tidyverse or data.table). In case the data.frame structure is vital you will have to convert it with data.frame.
cl <- parallel::makeCluster( parallel::detectCores() )
result <- parallel::parApply(cl = cl, X = IDX_DF, MARGIN = 1, FUN = function(x, IMAGES){
IDX_ID <- x[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", x[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(IMAGES))
names(IDX_VAL) <- IDX_ID
IDX_VAL
}, IMAGES = HYPR_IMG)
colnames(result) = IDXname
IDXname
parallel::stopCluster(cl)
Please note the stopCluster(cl) which is important to shut down any loose R sessions.
Benchmark results (4 tiny cores):
Unit: milliseconds
expr min lq mean median uq max neval
Loop 8.420432 9.027583 10.426565 9.272444 9.943783 26.58623 100
Parallel 1.382324 1.491634 2.038024 1.554690 1.907728 18.23942 100
For replications of benchmarks the code has been provided below:
cl <- parallel::makeCluster( parallel::detectCores() )
microbenchmark::microbenchmark(
Loop = {
Store_DF=data.frame(NA)
for (i in 1: length(IDX_DF$IDXname)) {
IDX_ID = IDX_DF$IDXname[i]
IDX_Fun_tmp = IDX_DF$IDXFunc[which(IDX_DF$IDXname == IDX_ID)] #use for extra care to select the right function
eval(parse(text = paste0("IDXfun_tmp = function(HYPR_IMG){", IDX_Fun_tmp, "}")))
IDX_VAL = IDXfun_tmp(HYPR_IMG)
#Plotting in parallel is not a good idea. It will most often not work but might make the R session crash or slow down significantly (at best the latter at worst the prior)
#image.plot(IDX_VAL, zlim = c(0,20)); title(main = IDX_ID)
temp_DF = as.vector(IDX_VAL)
Store_DF = cbind(Store_DF,temp_DF)
names(Store_DF)[i+1] <- as.vector(IDX_ID)
}
rm(Store_DF)
},
Parallel = {
result <- parallel::parApply(cl = cl, X = IDX_DF, MARGIN = 1, FUN = function(x, IMAGES){
IDX_ID <- x[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", x[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(IMAGES))
names(IDX_VAL) <- IDX_ID
IDX_VAL
}, IMAGES = HYPR_IMG)
colnames(result) = IDXname
rm(result)
}
)
parallel::stopCluster(cl)
Edit: Using the foreach package
After a few comments about performance issues (maybe due to memory), i decided to make an illustration of how one could obtain the same result using the foreach package. A few notes:
The foreach package uses iterators. As standard it can be used like a for loop, where it will iterate over each column in a data.frame
Like other parallel implementations in R, if you are on Windows, often you will have to export the data used for calculations. It can sometimes be avoided with some fiddling and foreach sometimes will let you not export data. When this is, is unclear from the documentation.
The output of the foreach will be combined either as a list or as defined by the .combine argument, which can be rbind, cbind or any other function.
There is a lot of comments, making the code seem alot longer than it actually it. Removing comments and blank lines, it is 9 lines longer.
Below is the code which will yield the same output as above. Note i have used the data.table package. For more information about this package i suggest their wikipedia on github.
cl <- parallel::makeCluster( parallel::detectCores() )
#Foeach uses doParallel for the parallization
doParallel::registerDoParallel(cl)
#To iterate over the rows, we need to use iterators
# if foreach is given a matrix it will be converted to a column iterators
rowIterator <- iterators::iter(IDX_DF, by = "row")
library(foreach)
result <-
foreach(
#Supply the iterator
row = rowIterator,
#Specify if the calculations needs to be in order. If not then we can get better performance not doing so
.inorder = FALSE,
#In most foreach loops you will have to export the data you need for the calculations
# it worked without doing so for me, in which case it is faster if the exported stuff is large
#.export = c("HYPR_IMG"),
#We need to say how the output should be merged. If nothing is given it will be output as a list
#data.table rbindlist is faster than rbind (returns a data.table)
.combine = function(...)data.table::rbindlist(list(...)) ,
#otherwise we could've used:
#.combine = rbind
#if we dont use rbind or cbind (i used data.table::rbindlist for speed)
# we will have to tell if it can take more than 1 argument
.multicombine = TRUE
) %dopar% #Specify how to do the calculations. %do% loop. %dopar% parallel loop. %:% nested loops (next foreach tells how we do the loop)
{ #to do stuff in parallel we use the %dopar%. Alternative %do%. For multiple foreach we split each of them by %:%
IDX_ID <- row[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", row[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(HYPR_IMG))
data.frame(ID = IDX_ID, IDX_VAL)
}
#output is saved in result
result
result_reformatted <- dcast(result[,indx := 1:.N, by = ID],
indx~ID,
value.var = "IDX_VAL")
#if we dont want to use data.table we could use unstack instead
unstack(test, IDX_VAL ~ ID)
I try to optimize my R code with parSapply.
I have xmlfile and X as global variables.
When I didn't use the clusterExport(cl,"X") and clusterExport(cl,"xmlfile") I got "xmlfile object was not found".
When I used these two clusterExport I got an error "object of type 'externalptr' is not subsettable".
With regular sapply it works ok.
can someone see the problem?
I have this R code:
require("XML")
library(parallel)
setwd("C:/PcapParser")
# A helper function that enables the dynamic additon of new rows and unseen variables to a data.frame
# field is an R XML leaf-node (capturing a field of a protocol)
# X is the current data.frame to which the feature in field should be added
# rowNum is the row (packet) to which the feature should be added. [must be that rowNum <= dim(X)[1]+1]
addFeature <- function(field, X, rowNum)
{
# extract xml name and value
featureName = xmlAttrs(field)['name']
if (featureName == "")
featureName = xmlAttrs(field)['show']
value = xmlAttrs(field)['value']
if (is.na(value) | value=="")
value = xmlAttrs(field)['show']
# attempt to add feature (add rows/cols if neccessary)
if (!(featureName %in% colnames(X))) #we are adding a new feature
{
#Special cases
#Bad column names: anything that has the prefix...
badCols = list("<","Content-encoded entity body"," ","\\?")
for(prefix in badCols)
if(grepl(paste("^",prefix,sep=""),featureName))
return(X) #don't include this new feature
X[[featureName]]=array(dim=dim(X)[1]) #add this new feature column with NAs
}
if (rowNum > dim(X)[1]) #we are trying to add a new row
{X = rbind(X,array(dim=dim(X)[2]))} #add row of NA
X[[featureName]][rowNum] = value
return(X)
}
firstLoop<-function(x)
{
packet = xmlfile[[x]]
# Iterate over all protocols in this packet
for (prot in 1:xmlSize(packet))
{
protocol = packet[[prot]]
numFields = xmlSize(protocol)
# Iterate over all fields in this protocol (recursion is not used since the passed dataset is large)
if(numFields>0)
for (f in 1:numFields)
{
field = protocol[[f]]
if (xmlSize(field) == 0) # leaf
X<<-addFeature(field,X,x)
else #not leaf xml element (assumption: there are at most three more steps down)
{
# Iterate over all sub-fields in this field
for (ff in 1:xmlSize(field))
{ #extract sub-field data for this packet
subField = field[[ff]]
if (xmlSize(subField) == 0) # leaf
X<<-addFeature(subField,X,x)
else #not leaf xml element (assumption: there are at most two more steps down)
{
# Iterate over all subsub-fields in this field
for (fff in 1:xmlSize(subField))
{ #extract sub-field data for this packet
subsubField = subField[[fff]]
if (xmlSize(subsubField) == 0) # leaf
X<<-addFeature(subsubField,X,x)
else #not leaf xml element (assumption: there is at most one more step down)
{
# Iterate over all subsubsub-fields in this field
for (ffff in 1:xmlSize(subsubField))
{ #extract sub-field data for this packet
subsubsubField = subsubField[[ffff]]
X<<-addFeature(subsubsubField,X,x) #must be leaf
}
}
}
}
}
}
}
}
}
# Given the path to a pcap file, this function returns a dataframe 'X'
# with m rows that contain data fields extractable from each of the m packets in XMLcap.
# Wireshark must be intalled to work
raw_feature_extractor <- function(pcapPath){
## Step 1: convert pcap into PDML XML file with wireshark
#to run this line, wireshark must be installed in the location referenced in the pdmlconv.bat file
print("Converting pcap file with Wireshark.")
system(paste("pdmlconv",pcapPath,"tmp.xml"))
## Step 2: load XML file into R
print("Parsing XML.")
xmlfile<<-xmlRoot(xmlParse("tmp.xml"))
## Step 3: Extract all feature into data.frame
print("Extracting raw features.")
X <<- data.frame(num=NA) #first feature is packet number
# Iterate over all packets
# Calculate the number of cores
no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(3)
parSapply (cl,seq(from=1,to=xmlSize(xmlfile),by=1),firstLoop)
print("Done.")
return(X)
}
What do I do wrong with parSapply? (maybe considering the global variables)
Thank you
So I see a couple of obvious problems with this code. Global variables and functions are not accessible in a paralleled environment, unless you explicitly force them in or call them. You need to define your addFunction and raw_feature_extractor functions inside firstLoop. When calling functions from a pre-existing package, you should either load the package as part of firstLoop (bad coding!) or call them explicitly using the package::function notation (good coding!). I suggesting looking at the R documentation here on StackOverflow to help you work through creating an appropriately parallelized function.
Here is my R Script that works just fine:
perc.rank <- function(x) trunc(rank(x)) / length(x) * 100.0
library(dplyr)
setwd("~/R/xyz")
datFm <- read.csv("yellow_point_02.csv")
datFm <- filter(datFm, HRA_ClassHRA_Final != -9999)
quant_cols <- c("CL_GammaRay_Despiked_Spline_MLR", "CT_Density_Despiked_Spline_FinalMerged",
"HRA_PC_1HRA_Final", "HRA_PC_2HRA_Final","HRA_PC_3HRA_Final",
"SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT", "Ultrasonic_DT_Despiked_Spline_MLR")
# add an extra column to datFm to store the quantile value
for (column_name in quant_cols) {
datFm[paste(column_name, "quantile", sep = "_")] <- NA
}
# initialize an empty dataframe with the new column names appended
newDatFm <- datFm[0,]
# get the unique values for the hra classes
hraClassNumV <- sort(unique(datFm$HRA_ClassHRA_Final))
# loop through the vector and create currDatFm and append it to newDatFm
for (i in hraClassNumV) {
currDatFm <- filter(datFm, HRA_ClassHRA_Final == i)
for (column_name in quant_cols) {
currDatFm <- within(currDatFm,
{
CL_GammaRay_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$CL_GammaRay_Despiked_Spline_MLR)
CT_Density_Despiked_Spline_FinalMerged_quantile <- perc.rank(currDatFm$CT_Density_Despiked_Spline_FinalMerged)
HRA_PC_1HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_1HRA_Final)
HRA_PC_2HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_2HRA_Final)
HRA_PC_3HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_3HRA_Final)
SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT_quantile <- perc.rank(currDatFm$SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT)
Ultrasonic_DT_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$Ultrasonic_DT_Despiked_Spline_MLR)
}
)
}
newDatFm <- rbind(newDatFm, currDatFm)
}
newDatFm <- newDatFm[order(newDatFm$Core_Depth),]
# head(newDatFm, 10)
write.csv(newDatFm, file = "Ricardo_quantiles.csv")
I have a few questions though. Every R book or video that I have read or watched, recommends using the 'apply' family of language constructs over the classic 'for' loop stating that apply is much faster.
So the first question is: how would you write it using apply (or tapply or some other apply)?
Second, is this really true though that apply is much faster than for? The csv file 'yellow_point_02.csv' has approx. 2500 rows. This script runs almost instantly on my Macbook Pro which has 16 Gig of memory.
Third, See the 'quant_cols' vector? I created it so that I could write a generic loop (for columm_name in quant_cols) ....But I could not make it to work. So I hard-coded the column names post-fixed with '_quantile' and called the 'perc.rank' many times. Is there a way this could be made dynamic? I tried the 'paste' stuff that I have in my script, but that did not work.
On the positive side though, R seems awesome in its ability to cut through the 'Data Wrangling' tasks with very few statements.
Thanks for your time.
Problem Description:
I have a big matrix c, loaded in RAM memory. My goal is through parallel processing to have read only access to it. However when I create the connections either I use doSNOW, doMPI, big.matrix, etc the amount to ram used increases dramatically.
Is there a way to properly create a shared memory, where all the processes may read from, without creating a local copy of all the data?
Example:
libs<-function(libraries){# Installs missing libraries and then load them
for (lib in libraries){
if( !is.element(lib, .packages(all.available = TRUE)) ) {
install.packages(lib)
}
library(lib,character.only = TRUE)
}
}
libra<-list("foreach","parallel","doSNOW","bigmemory")
libs(libra)
#create a matrix of size 1GB aproximatelly
c<-matrix(runif(10000^2),10000,10000)
#convert it to bigmatrix
x<-as.big.matrix(c)
# get a description of the matrix
mdesc <- describe(x)
# Create the required connections
cl <- makeCluster(detectCores ())
registerDoSNOW(cl)
out<-foreach(linID = 1:10, .combine=c) %dopar% {
#load bigmemory
require(bigmemory)
# attach the matrix via shared memory??
m <- attach.big.matrix(mdesc)
#dummy expression to test data aquisition
c<-m[1,1]
}
closeAllConnections()
RAM:
in the image above, you may find that the memory increases a lot until foreach ends and it is freed.
I think the solution to the problem can be seen from the post of Steve Weston, the author of the foreach package, here. There he states:
The doParallel package will auto-export variables to the workers that are referenced in the foreach loop.
So I think the problem is that in your code your big matrix c is referenced in the assignment c<-m[1,1]. Just try xyz <- m[1,1] instead and see what happens.
Here is an example with a file-backed big.matrix:
#create a matrix of size 1GB aproximatelly
n <- 10000
m <- 10000
c <- matrix(runif(n*m),n,m)
#convert it to bigmatrix
x <- as.big.matrix(x = c, type = "double",
separated = FALSE,
backingfile = "example.bin",
descriptorfile = "example.desc")
# get a description of the matrix
mdesc <- describe(x)
# Create the required connections
cl <- makeCluster(detectCores ())
registerDoSNOW(cl)
## 1) No referencing
out <- foreach(linID = 1:4, .combine=c) %dopar% {
t <- attach.big.matrix("example.desc")
for (i in seq_len(30L)) {
for (j in seq_len(m)) {
y <- t[i,j]
}
}
return(0L)
}
## 2) Referencing
out <- foreach(linID = 1:4, .combine=c) %dopar% {
invisible(c) ## c is referenced and thus exported to workers
t <- attach.big.matrix("example.desc")
for (i in seq_len(30L)) {
for (j in seq_len(m)) {
y <- t[i,j]
}
}
return(0L)
}
closeAllConnections()
Alternatively, if you are on Linux/Mac and you want a CoW shared memory, use forks. First load all your data into the main thread, and then launch working threads (forks) with general function mcparallel from the parallel package.
You can collect their results with mccollect or with the use of truly shared memory using the Rdsm library, like this:
library(parallel)
library(bigmemory) #for shared variables
shared<-bigmemory::big.matrix(nrow = size, ncol = 1, type = 'double')
shared[1]<-1 #Init shared memory with some number
job<-mcparallel({shared[1]<-23}) #...change it in another forked thread
shared[1,1] #...and confirm that it gets changed
# [1] 23
You can confirm, that the value really gets updated in backgruound, if you delay the write:
fn<-function()
{
Sys.sleep(1) #One second delay
shared[1]<-11
}
job<-mcparallel(fn())
shared[1] #Execute immediately after last command
# [1] 23
aaa[1,1] #Execute after one second
# [1] 11
mccollect() #To destroy all forked processes (and possibly collect their output)
To control for concurency and avoid race conditions use locks:
library(synchronicity) #for locks
m<-boost.mutex() #Lets create a mutex "m"
bad.incr<-function() #This function doesn't protect the shared resource with locks:
{
a<-shared[1]
Sys.sleep(1)
shared[1]<-a+1
}
good.incr<-function()
{
lock(m)
a<-shared[1]
Sys.sleep(1)
shared[1]<-a+1
unlock(m)
}
shared[1]<-1
for (i in 1:5) job<-mcparallel(bad.incr())
shared[1] #You can verify, that the value didn't get increased 5 times due to race conditions
mccollect() #To clear all threads, not to get the values
shared[1]<-1
for (i in 1:5) job<-mcparallel(good.incr())
shared[1] #As expected, eventualy after 5 seconds of waiting you get the 6
#[1] 6
mccollect()
Edit:
I simplified dependencies a bit by exchanging Rdsm::mgrmakevar into bigmemory::big.matrix. mgrmakevar internally calls big.matrix anyway, and we don't need anything more.
Seemingly simple question, but I don't know how the loop syntax and variable assignments work in R very well. I have a 6900 line table that I want parsed into 10 equal sized text files. My code is below, but how would I design a loop around it and iterate through the filenames?
write.table(clipboard[1:619,1],
"mydata1.txt",sep="\t")
write.table(clipboard[619:1238,1],
"mydata2.txt",sep="\t")
write.table(clipboard[1238:1857,1],
"mydata3.txt",sep="\t")
write.table(clipboard[1857:2476,1],
"mydata4.txt",sep="\t")
write.table(clipboard[2476:3095,1],
"mydata5.txt",sep="\t")
write.table(clipboard[3095:3714,1],
"mydata6.txt",sep="\t")
write.table(clipboard[3714:4333,1],
"mydata7.txt",sep="\t")
write.table(clipboard[4333:4952,1],
"mydata8.txt",sep="\t")
write.table(clipboard[4952:5571,1],
"mydata9.txt",sep="\t")
write.table(clipboard[5571:6190,1],
"mydata10.txt",sep="\t")
The manual way
I guess not such an issue to use a loop for IO:
for (i in 1:10) {
start <- 1 + (i-1) * nrow(clipboard) / 10
end <- i * nrow(clipboard) / 10
fname <- paste("mydata", i ,".txt", sep="")
write.table(x=clipboard[start:end, 1], file=fname, sep="\t")
}
Note that this assumes that it can actually be separated into 10 equally sized files!
Done properly, write.split:
This method will actually (when not perfectly divisable) create an extra file for the remainder.
I used this splitter to create a list of data that will then be used in parallel for some statistical computations in my package correlate. Here, it actually means we would be able to write the files in parallel. Note that this is pointless for small files; maybe even slower.
# Helper to split the data in chunks
splitter <- function(x, splitsize) {
nr <- nrow(x)
if (splitsize > nr) {
splitsize <- nr
}
splits <- floor(nr / splitsize)
splitted.list <- lapply(split(x[seq_len(splits*splitsize), ],
seq_len(splits)), function(x) matrix(x, splitsize))
if (nr %% splitsize != 0) {
splitted.list$last <- x[(splits * splitsize + 1):nr, ]
}
return(splitted.list)
}
write.split <- function(x, chunks, file.prefix, file.extension, cores = 1, ...) {
splitsize <- nrow(x) / chunks
splitted.list <- splitter(x, splitsize)
if (cores == 1) {
sapply(names(splitted.list), function(z)
write.table(splitted.list[z],
file = paste(file.prefix, z, file.extension, sep=""),
...))
} else {
# currently just the simple linux version; this won't work on Windows.
# Upon request I'll add it
stopifnot(require(parallel))
mclapply(names(splitted.list), function(z)
write.table(splitted.list[z],
file = paste(file.prefix, z, file.extension, sep=""),
...))
}
}
Usage:
write.split(z, chunks = 10,
file.prefix = "mydata", file.extension = ".txt", sep="\t")
You can also give it the row.names and col.names arguments, basically anything that can be passed to write.table.
Benchmark:
Using `matrix(1:1000000, 1000)` as data.
Unit: seconds
expr min lq median uq max neval
1-core 1.780022 1.990751 2.079907 2.166891 2.744904 100
4-cores 1.305048 1.438777 1.492114 1.559110 2.070911 100
Extensibility:
It could also be easily extended by allowing to give a number of lines to write rather than the amount of chunks.