Find unused character(s) in string - r

For a library call I have to provide a separator, which must not occur in the in the text, because otherwise the library call gets confused.
Now I was wondering how I can adapt my code to assure that the separator I use is guaranteed not to occur in the input text.
I am solving this issue with a while loop: I make a (hardcoded) assumption about the most unlikely string in the input, check if it is present and if so, just enlarges the string. This works but feels very hackish, so I was wondering whether there is a more elegant version (e.g. an existing base R function, or a loop free solution), which does the same for me? Ideally the found separator is also minimal in length.
I could simply hardcode a large enough set of potential separators and look for the first one not occuring in the text, but this may also break at some point if all of these sepeatirs happen to occur in my input.
Reasoning for that is that even if it will never happen (well never say never), I am afraid that in some distant future there will be this one input string which requires thousands of while loops before finding an unused string.
input_string <- c("a/b", "a#b", "a//b", "a-b", "a,b", "a.b")
orig_sep <- sep <- "/" ## first guess as a separator
while(any(grepl(sep, input_string, fixed = TRUE))) {
sep <- paste0(sep, orig_sep)
}
print(sep)
# "///"

In case 1 ASCII can be found you can use table.
tt <- table(factor(strsplit(paste(input_string, collapse = ""), "")[[1]]
, rawToChar(as.raw(32:126), TRUE)))
names(tt)[tt==0]
rawToChar(as.raw(32:126), TRUE) gives you all ASCII's, which are used as factor levels. And table counts all cases. If there is at least one 0 you can use it.
In case you need 2 ASCII you can try the following returning all possible delimiters:
x <- rawToChar(as.raw(32:126), TRUE)
x <- c(outer(x, x, paste0))
x[!sapply(x, function(y) {any(grepl(y, input_string, fixed=TRUE))})]
Or for n-ASCII:
orig_sep <- x <- rawToChar(as.raw(32:126), TRUE)
sep <- x[0]
repeat {
sep <- x[!sapply(x, function(y) {any(grepl(y, input_string, fixed=TRUE))})]
if(length(sep) > 0) break;
x <- c(outer(x, orig_sep, paste0))
}
sep
Search for 1-2 ASCII with only a sapply-loop and taking separator with minimal length.
x <- rawToChar(as.raw(32:126), TRUE)
x <- c(x, outer(x, x, paste0))
x[!sapply(x, function(y) {any(grepl(y, input_string, fixed=TRUE))})][1]
#[1] " "
In case you want to know how many times a character needs to be repeated to work as a separator, as you do it in the question, you can use gregexpr.
strrep("/", max(sapply(gregexpr("/*", input_string)
, function(x) max(attributes(x)$match.length)))+1)
#[1] "///"
strrep("/", max(c(0, sapply(gregexpr("/+", input_string)
, function(x) max(attributes(x)$match.length))))+1)
#[1] "///"

I made some benchmarks, and the sad news is that only if we have a lot of occurrences of the separator in the input string the regex solution will pay off. I won't expect long repetitions of the separator, so from that perspective the while solution should be preferable, but it would be the first time in my R life that I actually had to rely on a while construct.
Code
library(microbenchmark)
sep <- "/"
make_input <- function(max_occ, vec_len = 1000) {
paste0("A", strrep(sep, sample(0:max_occ, vec_len, TRUE)))
}
set.seed(1)
no_occ <- make_input(0)
typ_occ <- make_input(1)
mid_occ <- make_input(10)
high_occ <- make_input(100)
while_fun <- function(in_str) {
my_sep <- sep
while(any(grepl(my_sep, in_str, fixed = TRUE))) {
my_sep <- paste0(my_sep, sep)
}
my_sep
}
greg_fun <- function(in_str) {
strrep(sep,
max(sapply(gregexpr(paste0(sep, "+"), in_str),
purrr::attr_getter("match.length")), 0) + 1)
}
microbenchmark(no_occ_w = while_fun(no_occ),
no_occ_r = greg_fun(no_occ),
typ_occ_w = while_fun(typ_occ),
typ_occ_r = greg_fun(typ_occ),
mid_occ_w = while_fun(mid_occ),
mid_occ_r = greg_fun(mid_occ),
high_occ_w = while_fun(high_occ),
high_occ_r = greg_fun(high_occ))
Results
Unit: microseconds
expr min lq mean median uq max neval cld
no_occ_w 12.3 13.30 15.947 14.60 16.55 51.1 100 a
no_occ_r 1074.8 1184.90 1981.637 1253.45 1546.20 7037.9 100 b
typ_occ_w 33.8 36.00 42.842 38.55 41.45 229.2 100 a
typ_occ_r 1090.4 1192.15 2090.526 1283.80 1547.10 8490.7 100 b
mid_occ_w 277.9 283.35 336.466 288.30 309.45 3452.2 100 a
mid_occ_r 1161.6 1269.50 2204.213 1368.45 1789.20 7664.7 100 b
high_occ_w 3736.4 3852.95 4082.844 3962.30 4097.60 6658.3 100 d
high_occ_r 1685.5 1776.15 2819.703 1868.10 4065.00 7960.9 100 c

Related

Converting unit abbreviations to numbers

I have a dataset that abbreviates numerical values in a column. For example, 12M mean 12 million, 1.2k means 1,200. M and k are the only abbreviations. How can I write code that allows R to sort these values from lowest to highest?
I've though about using gsub to convert M to 000,000 etc but that does not take into account the decimals (1.5M would then be 1.5000000).
So you want to translate SI unit abbreviations ('K','M',...) into exponents, and thus numerical powers-of-ten.
Given that all units are single-letter, and the exponents are uniformly-spaced powers of 10**3, here's working code that handles 'Kilo'...'Yotta', and any future exponents:
> 10 ** (3*as.integer(regexpr('T', 'KMGTPEY')))
[1] 1e+12
Then just multiply that power-of-ten by the decimal value you have.
Also, you probably want to detect and handle the 'no-match' case for unknown letter prefixes, otherwise you'd get a nonsensical -1*3
> unit_to_power <- function(u) {
exp_ <- 10**(as.integer(regexpr(u, 'KMGTPEY')) *3)
return (if(exp_>=0) exp_ else 1)
}
Now if you want to case-insensitive-match both 'k' and 'K' to Kilo (as computer people often write, even though it's technically an abuse of SI), then you'll need to special-case e.g with if-else ladder/expression (SI units are case-sensitive in general, 'M' means 'Mega' but 'm' strictly means 'milli' even if disk-drive users say otherwise; upper-case is conventionally for positive exponents). So for a few prefixes, #DanielV's case-specific code is better.
If you want negative SI prefixes too, use as.integer(regexpr(u, 'zafpnum#KMGTPEY')-8) where # is just some throwaway character to keep uniform spacing, it shouldn't actually get matched. Again if you need to handle non-power-of-10**3 units like 'deci', 'centi', will require special-casing, or the general dict-based approach WeNYoBen uses.
base::regexpr is not vectorized also its performance is bad on big inputs, so if you want to vectorize and get higher-performance use stringr::str_locate.
Give this a shot:
Text_Num <- function(x){
if (grepl("M", x, ignore.case = TRUE)) {
as.numeric(gsub("M", "", x, ignore.case = TRUE)) * 1e6
} else if (grepl("k", x, ignore.case = TRUE)) {
as.numeric(gsub("k", "", x, ignore.case = TRUE)) * 1e3
} else {
as.numeric(x)
}
}
In your case you can using gsubfn
a=c('12M','1.2k')
dict<-list("k" = "e3", "M" = "e6")
as.numeric(gsubfn::gsubfn(paste(names(dict),collapse="|"),dict,a))
[1] 1.2e+07 1.2e+03
I am glad to meet you.
I wrote another answer
Define function
res = function (x) {
result = as.numeric(x)
if(is.na(result)){
text = gsub("k", "*1e3", x, ignore.case = T)
text = gsub("m", "*1e6", text, ignore.case = T)
result = eval(parse(text = text))
}
return(result)
}
Result
> res("5M")
[1] 5e+06
> res("4K")
[1] 4000
> res("100")
[1] 100
> res("4k")
[1] 4000
> res("1e3")
[1] 1000

Find sequences of elements in vectors

I need some pointers on this. Actually, I don't necessarily need a fully-fledged solution here - some pointers to functions and/or packages would be great.
The problem: I want to find specific sequences in a character vector. The sequences can be somewhat "underspecified". That means that some of the elements should be fixed, but for some elements it does not matter how long they are or what they are exactly.
An example: Suppose I want to find the following pattern in a character vector:
The sequence should begin with "Out of" or "out of"
The sequence should end with "reasons"
In between, there should be other elements. But it does not matter how much elements (also zero would be OK) and what the elements exactly are.
In between 1. and 2., there shouldn't be a ".", "!" or "?".
There should be a parameter that controls how long the sequence in 3. can maximally be to still produce a result.
Return value of the function should be the intervening elements and/or their indices in the vector.
So, the function should "behave" like this:
c("Out", "of", "specific", "reasons", ".") Return "specific"
c("Out", "of", "very", "specific", "reasons", ".") Return c("very", "specific")
c("out", "of", "curiosity", ".", "He", "had", "his", "reasons") Return "" or NA or NULL, which one doesn't matter - just a signal that there is no result.
As I said: I don't need a full solution. Any pointers to packages that already implement such functionality are appreciated!
Optimally, I don't want to rely on a solution that first pastes the text and then uses regex for matching.
Thanks a lot!
I would be really curious to learn of a package that serves your needs. My inclination would be to collapse the strings and use regular expressions or find a programmer or use perl. But here's one extensible solution in R with a few more cases to experiment on. Not very elegant, but see if this has some utility.
# Recreate data as a list with a few more edge cases
txt1 <- c(
"Out of specific reasons.",
"Out of very specific reasons.",
"Out of curiosity. He had his reasons.",
"Out of reasons.",
"Out of one's mind.",
"For no particular reason.",
"Reasons are out of the ordinary.",
"Out of time and money and for many good reasons, it seems.",
"Out of a box, a car, and for random reasons.",
"Floop foo bar.")
txt2 <- strsplit(txt1, "[[:space:]]+") # remove space
txt3 <- lapply(txt2, strsplit, "(?=[[:punct:]])", perl = TRUE) #
txt <- lapply(txt3, unlist) # create list of tokens from each line
# Define characters to exclude: [. ! and ?] but not [,]
exclude <- "[.!?]"
# Assign acceptable limit to separation
lim <- 5 # try 7 and 12 to experiment
# Create indices identifying each of the enumerated conditions
fun1 <- function(x, pat) grep(pat, x, ignore.case = TRUE)
index1 <- lapply(txt, fun1, "out")
index2 <- lapply(txt, fun1, "of")
index3 <- lapply(txt, fun1, "reasons")
index4 <- lapply(txt, fun1, exclude)
# Create logical vectors from indices satisfying the conditions
fun2 <- function(set, val) val[1] %in% set
cond1 <- sapply(index1, fun2, val = 1) & sapply(index2, fun2, val = 2)
cond2 <- sapply(index3, "[", 1) < lim + 2 + 2 # position of 'of' + 2
cond3 <- sapply(index3, max, -Inf) < sapply(index4, min, Inf)
# Combine logical vectors to a single logical vector
valid <- cond1 & cond2 & cond3
valid <- ifelse(is.na(valid), FALSE, valid)
# Examine selected original lines
print(txt1[valid])
# Helper function to extract the starting and the ending element
fun3 <- function(index2, index3, valid) {
found <- rep(list(NULL), length(index2))
found[valid] <- Map(seq, index2[valid], index3[valid])
found <- lapply(found, tail, -1)
found <- lapply(found, head, -1)
}
# Extract starting and ending element from valid list members
idx <- fun3(index2, index3, valid)
# Return the results or "" for no intervening text or NULL for no match
ans <- Map(function(x, i) {
if (is.null(i)) NULL # no match found
else if (length(i) == 0) "" # no intervening elements
else x[i]}, # all intervening elements <= lim
txt, idx)
# Show found (non-NULL) values
ans[!sapply(ans, is.null)]
So let's assume your example
x <- c("Out", "of", "very", "specific", "reasons", ".")
We first need to get the beginning of the indicator
i_Beginning <- as.numeric(grep("Out|out", x))
and the ending
i_end <- as.numeric(grep("reasons", x))
Need to also check that Out is followed by of
Is_Of <- grepl("Of|of", x[i_Beginning +1])
And if this is true we extract the other elements
if(Is_Of){
extraction <- x[c(i_Beginning +2, i_end -1)]
}
print(extraction)

the difference between for loops in R and python

My function has to turn all uppercases in a given string to lowercases and vice versa. I used to solve such problems with loops. So, my code is:
mirror_case <- function(x){
for(i in x){
ifelse(i==toupper(i),x <-
str_replace_all(x,i,tolower(i)),
ifelse(i==tolower(i),x <-
str_replace_all(x,i,toupper(i)),
x <- gsub(i,i,x)))}
return(x)}
I checked this on several strings. Sometimes it works and sometimes doesn't.
> d
[1] "LKJLjlei 33"
> mirror_case(d)
[1] "LKJLjlei 33"
> e
[1] "asddf"
> mirror_case(e)
[1] "ASDDF"
> f
[1] "ASDDF"
> mirror_case(f)
[1] "asddf"
So, what's wrong with this function?
I'd like not only to get the answer, but also some explanations to understand the problem and not come back here with the similar question.
A string in R is not a sequence like it is in python, and can not be traversed in a for loop like this. You should break the string to individual characters first. Try this:
mirror_case <- function(s) {
# break to characters
chars <- strsplit(s, '')
# apply your ifelse statement to all characters
mirror_chars <- sapply(chars, function(i)
ifelse(toupper(i) == i, tolower(i), toupper(i)))
# join back to a string
mirror_s <- paste(mirror_chars, collapse = "")
return(mirror_s)
}
mirror_case("LKJLjlei 33")
# [1] "lkjlJLEI 33"
#YosiHammer's solution does not need an sapply call (which is a loop) to run on list of one item from split. As #李哲源 shows in comments, like gsub, paste, even ifelse, toupper() and tolower() are vectorized functions and can receive multiple items in one call.
mirror_case <- function(s) {
chars <- strsplit(s, '')[[1]] # RETRIEVE THE CHARACTER VECTOR
mirror_chars <- ifelse(toupper(chars) == chars, tolower(chars), toupper(chars))
mirror_s = paste(mirror_chars, collapse = "")
return(mirror_s)
}
mirror_case("LKJLjlei 33")
# [1] "lkjlJLEI 33"
mirror_case("AbCdEfGhIj")
# [1] "aBcDeFgHiJ"
A simple solution to this problem is to use chartr function:
chartr("[A-Za-z]", "[a-zA-Z]", "bbBB 122")
Check it online
The function is vectorized:
chartr("[A-Za-z]", "[a-zA-Z]", c("bbBB 122", "QwER 12 bB"))
another option is to pass a function to str_replace_all but this is sub-optimal as can be seen from the benchmarks.
library(stringr)
str_replace_all(c("bbBB 122", "QwER 12 bB"),
"[A-Za-z]",
function(x)
ifelse(toupper(x) == x, tolower(x), toupper(x)))
benchmark:
data will be 100000 10 character strings:
dat <- as.vector(
replicate(1e5,
paste0(sample(c(LETTERS,
letters,
" ",
as.character(1:9)),
10,
replace = TRUE),
collapse = "")
))
head(dat)
#output
"aPJAGOiirN" "FSYN DLYQS" "K7Vzh8qALH" "vQzU96JOVF" "WMmqO1D3Q8" "XdBiTG72zV"
functions proposed in other posts (not vectorized):
mirror_case <- function(s) {
chars <- strsplit(s, '')[[1]] # RETRIEVE THE CHARACTER VECTOR
mirror_chars <- ifelse(toupper(chars) == chars, tolower(chars), toupper(chars))
mirror_s = paste(mirror_chars, collapse = "")
return(mirror_s)
}
mirror.case <- function(s) {
# break to characters
chars <- strsplit(s, '')
# apply your ifelse statement to all characters
mirror_chars <- sapply(chars, function(i)
ifelse(toupper(i) == i, tolower(i), toupper(i)))
# join back to a string
mirror_s <- paste(mirror_chars, collapse = "")
return(mirror_s)
}
library(microbenchmark)
microbenchmark(missuse = chartr("[A-Za-z]", "[a-zA-Z]", dat),
missuse2 = str_replace_all(dat,
"[A-Za-z]",
function(x)
ifelse(toupper(x) == x, tolower(x), toupper(x))),
Parfait = lapply(dat, mirror_case),
YosiHammer = lapply(dat, mirror_case),
times = 10)
results
Unit: milliseconds
expr min lq mean median uq max neval
missuse 9.607483 11.05621 18.48764 16.50272 19.06369 39.65646 10
missuse2 11226.900565 11473.40730 11612.95776 11582.65838 11636.32779 12218.78642 10
Parfait 1461.056405 1572.58683 1700.75182 1594.43438 1746.08949 2149.49213 10
YosiHammer 1526.730674 1576.35174 1649.55893 1607.62199 1670.76008 1843.11601 10
as you can see the chartr method is around 100x faster than the other solutions.
Check equality of results:
all.equal(chartr("[A-Za-z]", "[a-zA-Z]", dat),
unlist(lapply(dat, mirror_case)))
all.equal(chartr("[A-Za-z]", "[a-zA-Z]", dat),
unlist(lapply(dat, mirror.case)))
all.equal(chartr("[A-Za-z]", "[a-zA-Z]", dat),
str_replace_all(dat,
"[A-Za-z]",
function(x)
ifelse(toupper(x) == x, tolower(x), toupper(x))))

Regex to extract string from a file path R?

I have the following file path with uuid incapsulated within it's path:
"~/My_Files/F0/F1/F2/0b27ea5fad61c99d/0b27ea5fad61c99d/2015-04-1-04-25-12-925"
I want to extract it using regular expression.
I know that I can just unlist(strsplit(string, "/")) and take the 7th element but it seems to me too slow and not efficient problem solving.
Here is what I have tried so far:
\w{16}
I am keep trying to play with this, please advise.
I want to extract the uuid: 0b27ea5fad61c99d
Here's a slightly hacky but compact and regex-free solution:
basename(dirname(x))
#[1] "0b27ea5fad61c99d"
Where
x <- "~/My_Files/F0/F1/F2/0b27ea5fad61c99d/0b27ea5fad61c99d/2015-04-1-04-25-12-925"
EDIT: As per Onyambu's comment adding following solution too now.
sub(".*/(.*)/[^/]+$","\\1",val)
Could you please try following gsub function to of base R and let me know if this helps you.
gsub("([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/][0-9a-z]+)/(.*)","\\6",val)
Explanation: Here is a brief explanation of above snippet.
([^/]*): Selecting all from starting to till a / and keeping it in first place holder of memory.
/: Mentioning / then.
Again repeating these above 2 steps till 5 times to select 6th field which is mentioned by ([^/][0-9a-z]+) then /(.*) means taking all rest of the matches in 7th memory place.
"\\6": Now substituting whole value of variable val with only 6th memory place which is actually required by OP to get the desired results.
Have you conducted some benchmarks regarding timing? I think your own solution performs already quite well, especially with the minor improvement of introducing fixed = T. See below timings. Why start a complicated regex search when you already know the exact symbols where to split your string...
Update with respect to comments: The vectorized versions shows, that f2 shows not the best but still acceptable performance. But as indicated in the comments, vectorized regex approaches will usually perform better with increasing length of the vector - and of course, they are more flexible if you have less knowledge on the structure of your directory names.
Update 2: If anyone is still interested, I have updated the function f2 by using a better way for accessing sub-elements of the lists. This now makes it the fastest approach for the specific example, at least, for a benchmark of 500 items.
library(microbenchmark)
library(stringi)
string = "~/My_Files/F0/F1/F2/0b27ea5fad61c99d/0b27ea5fad61c99d/2015-04-1-04-25-12-925"
string = rep(string, 500)
f1 = function(x) sapply(strsplit(x, "/"), `[[`, 7)
f2 = function(x) sapply(strsplit(x, "/", fixed = T), `[[`, 7)
f2b = function(x) sapply(stri_split_fixed(string, "/"), `[[`, 7)
f3 = function(x) stri_extract_first_regex(x, "(?=[a-f0-9]+[a-f])(?=[a-f0-9]+[0-9])([a-f0-9]{16})")
f4 = function(x) sapply(x, function(y) tail(unlist(strsplit(dirname(x), "/")),1), USE.NAMES = FALSE)
f5 = function(x) basename(dirname(x))
f6 = function(x) gsub("([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/][0-9a-z]+)/(.*)","\\6",x)
f7 = function(x) sub("^.*/(.*)$", "\\1", dirname(x))
f8 = function(x) sub(".*/(.*)/[^/]+$","\\1",x)
bm = microbenchmark(
(a = f1(string))
,(b = f2(string))
, (b2 = f2b(string))
,(c = f3(string))
, (d = f4(string))
, (e = f5(string))
, (f = f6(string))
, (g = f7(string))
, (h = f8(string))
, times = 25)
bm
# Unit: microseconds
# expr min lq mean median uq max neval
# (a = f1(string)) 1894.017 1947.307 2083.6390 2072.444 2142.709 2896.684 25
# (b = f2(string)) 532.520 575.153 605.7698 592.917 630.813 823.451 25
# (b2 = f2b(string)) 545.152 569.232 617.1387 606.733 637.129 778.450 25
# (c = f3(string)) 855.426 894.112 953.5931 946.614 999.511 1286.890 25
# (d = f4(string)) 2497889.661 2538700.607 2604673.5850 2602081.839 2654385.172 2820226.019 25
# (e = f5(string)) 4686.881 4935.573 5087.7735 5155.450 5201.240 5544.674 25
# (f = f6(string)) 5991.532 6357.861 6750.8284 6584.054 6886.039 9232.438 25
# (g = f7(string)) 4313.840 4462.661 4770.6780 4696.749 4900.046 6442.733 25
# (h = f8(string)) 2328.637 2422.193 2620.5163 2606.542 2660.229 3697.239 25
all(all.equal(a, b)
,all.equal(a, c)
,all.equal(a, d)
,all.equal(a, e)
,all.equal(a, f)
,all.equal(a, g)
,all.equal(a, b2)
,all.equal(a, h)
)
# TRUE
You might match a forward slash, then use a positive use 2 positive lookaheads (?= to assert that what follows is at least [a-f] and at least [0-9]. Then capture in a group ([a-f0-9]{16})
/(?=[a-f0-9]+[a-f])(?=[a-f0-9]+[0-9])([a-f0-9]{16})
You may use regular expression:
(?:[^\/]+\/){5}(\w+)
(?: Start of non capturing group.
[^\/]+\/ Anything except a forward slash /, followed by forward slash /.
) Close non capturing group.
{5} Match exactly five occurrences of the preceding pattern.
(\w+) Capturing group. Capture alphanumeric characters greedily.
The substring of interest is contained in the capture group.
You can test the regex live here.

Writing a table to multiple files in R

Seemingly simple question, but I don't know how the loop syntax and variable assignments work in R very well. I have a 6900 line table that I want parsed into 10 equal sized text files. My code is below, but how would I design a loop around it and iterate through the filenames?
write.table(clipboard[1:619,1],
"mydata1.txt",sep="\t")
write.table(clipboard[619:1238,1],
"mydata2.txt",sep="\t")
write.table(clipboard[1238:1857,1],
"mydata3.txt",sep="\t")
write.table(clipboard[1857:2476,1],
"mydata4.txt",sep="\t")
write.table(clipboard[2476:3095,1],
"mydata5.txt",sep="\t")
write.table(clipboard[3095:3714,1],
"mydata6.txt",sep="\t")
write.table(clipboard[3714:4333,1],
"mydata7.txt",sep="\t")
write.table(clipboard[4333:4952,1],
"mydata8.txt",sep="\t")
write.table(clipboard[4952:5571,1],
"mydata9.txt",sep="\t")
write.table(clipboard[5571:6190,1],
"mydata10.txt",sep="\t")
The manual way
I guess not such an issue to use a loop for IO:
for (i in 1:10) {
start <- 1 + (i-1) * nrow(clipboard) / 10
end <- i * nrow(clipboard) / 10
fname <- paste("mydata", i ,".txt", sep="")
write.table(x=clipboard[start:end, 1], file=fname, sep="\t")
}
Note that this assumes that it can actually be separated into 10 equally sized files!
Done properly, write.split:
This method will actually (when not perfectly divisable) create an extra file for the remainder.
I used this splitter to create a list of data that will then be used in parallel for some statistical computations in my package correlate. Here, it actually means we would be able to write the files in parallel. Note that this is pointless for small files; maybe even slower.
# Helper to split the data in chunks
splitter <- function(x, splitsize) {
nr <- nrow(x)
if (splitsize > nr) {
splitsize <- nr
}
splits <- floor(nr / splitsize)
splitted.list <- lapply(split(x[seq_len(splits*splitsize), ],
seq_len(splits)), function(x) matrix(x, splitsize))
if (nr %% splitsize != 0) {
splitted.list$last <- x[(splits * splitsize + 1):nr, ]
}
return(splitted.list)
}
write.split <- function(x, chunks, file.prefix, file.extension, cores = 1, ...) {
splitsize <- nrow(x) / chunks
splitted.list <- splitter(x, splitsize)
if (cores == 1) {
sapply(names(splitted.list), function(z)
write.table(splitted.list[z],
file = paste(file.prefix, z, file.extension, sep=""),
...))
} else {
# currently just the simple linux version; this won't work on Windows.
# Upon request I'll add it
stopifnot(require(parallel))
mclapply(names(splitted.list), function(z)
write.table(splitted.list[z],
file = paste(file.prefix, z, file.extension, sep=""),
...))
}
}
Usage:
write.split(z, chunks = 10,
file.prefix = "mydata", file.extension = ".txt", sep="\t")
You can also give it the row.names and col.names arguments, basically anything that can be passed to write.table.
Benchmark:
Using `matrix(1:1000000, 1000)` as data.
Unit: seconds
expr min lq median uq max neval
1-core 1.780022 1.990751 2.079907 2.166891 2.744904 100
4-cores 1.305048 1.438777 1.492114 1.559110 2.070911 100
Extensibility:
It could also be easily extended by allowing to give a number of lines to write rather than the amount of chunks.

Resources