Best way to count unique element in a string in r - r

I'm a still a beginner in R and I have a question!
I have data frame of 222.000 observations and I'm interesting by a specific column which name is id. The problem is it can be further ids separate by a ',' in the same string and I want to count unique element in a each string (I mean in each string of the first data frame).
For example:
id results
0000001,0000003 2
0000002,0000002 1
0010001,0001006,0010001 2
I have used the function 'str_split_fixed' to separate all id in the same string and I put the result in a new data frame(so know I have only 1 id by string or nothing in a string). The problem is that can be as many as 68 ',' so the new data frame is huge with 68columns and 220.000 observations and it take much time(15 secondes maybe). After a used a apply function to know all unique.
Does someone know a more efficient way or have an idea?
Finally, I used the following code:
sapply(id, function(x)
length( # count items
unique( # that are unique
scan( # when arguments are presented to scan as text
text=x, what="", sep =",", # when separated by ","
quiet=TRUE))) )
But there is a message error:
Error in textConnection(text, encoding = "UTF-8") :
argument 'text' incorrect
6 textConnection(text, encoding = "UTF-8")
5 scan(text = x, what = "", sep = ",", quiet = TRUE)
4 unique(scan(text = x, what = "", sep = ",", quiet = TRUE))
3 FUN(X[[i]], ...)
2 lapply(X = X, FUN = FUN, ...)
1 sapply(id, function(x) length(unique(scan(text = x,
what = "", sep = ",", quiet = TRUE))))
My R version is:
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.0.0 plyr_1.8.3
loaded via a namespace (and not attached):
[1] magrittr_1.5 tools_3.2.2 Rcpp_0.12.2 stringi_1.0-1
>
I've tried this: Encoding(id) <- "UTF-8"
But the result is:
Error in `Encoding<-`(`*tmp*`, value = "UTF-8")
and the output of dput(id) is from this:
[9987,] "2320212,2320230"
[9988,] "4530090,4530917"
[9989,] "8532412"
[9990,] "4560292"
[9991,] "4540375"
[9992,] "3311324"
[9993,] "4540030"
[9994,] "9010000"
[9995,] "2811810"
[9996,] "3311000"
[9997,] "4540030"
[9998,] "4540215"
[9999,] "1541201"
[10000,] "2423810"
[ getOption("max.print") est atteint -- 90000 lignes omises ]
the ouput is huge so I post just the end and the first line:
[9002,] "9460000"
and for dput( head(data$id) ):
"9460000,9433000", "9460000,9436000", "9460000,9437000",
"9510000", "9510010", "9510030", "9510090", "9910000", "9910020",
"9910040", "9910090", "D", "FIELD_NOT_FOUND", "I"), class = "factor")
Thanks in advance, Jef

sapply(id, function(x)
length( # count items
unique( # that are unique
scan( # when arguments are presented to scan as text
text=x, what="", sep =",", # when separated by ","
quiet=TRUE))) )
# --- result: first typed line is 'names' of the items, not the results.
1 2,3,4 1,1
1 3 1
The argument text=x should allow scan to accept a character element of length-1 and break it into components at divisions of the separator argument value. These will get passed element-by-element to the anonymous function from the id vector(or row by row if it were coming from a dataframe).

Related

strsplit doesn't always split on '?' [duplicate]

This question already has answers here:
R strsplit with multiple unordered split arguments?
(4 answers)
Closed 4 years ago.
I want (for LSAfun::genericSummary) to split some strings by c(".", "!", "?"). I use the option fixed = TRUE but it still return the worng result.
I want to understand why it doesn't work because I can't modify the call.
Actually, it's not called directly but via LSAfun::genericSummary. And the result is not the expected one because of the strsplit unexpected result.
strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?",
split = c(".", "!", "?"), fixed = TRUE)[[1]]
returns :
[1] "Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?"
expected :
[1] "Faut-il reconnaitre le vote blanc " " Faut-il rendre le vote obligatoire " ""
I'm lost... anyone for an explanation ?
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 LC_NUMERIC=C LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.3.0 yaml_2.1.18
the function :
function (text, k, split = c(".", "!", "?"), min = 5, breakdown = FALSE,
...)
{
sentences <- unlist(strsplit(text, split = split, fixed = T))
if (breakdown == TRUE) {
sentences <- breakdown(sentences)
}
sentences <- sentences[nchar(sentences) > min]
td = tempfile()
dir.create(td)
for (i in 1:length(sentences)) {
docname <- paste("sentence", i, ".txt", sep = "")
write(sentences[i], file = paste(td, docname, sep = "/"))
}
A <- textmatrix(td, ...)
rownames <- rownames(A)
colnames <- colnames(A)
A <- matrix(A, nrow = nrow(A), ncol = ncol(A))
rownames(A) <- rownames
colnames(A) <- colnames
unlink(td, T, T)
Vt <- lsa(A, dims = length(sentences))$dk
snum <- vector(length = k)
for (i in 1:k) {
snum[i] <- names(Vt[, i][abs(Vt[, i]) == max(abs(Vt[,
i]))])
}
snum <- gsub(snum, pattern = "[[:alpha:]]", replacement = "")
snum <- gsub(snum, pattern = "[[:punct:]]", replacement = "")
snum <- as.integer(snum)
summary.sentences <- sentences[snum]
return(summary.sentences)
}
<environment: namespace:LSAfun>
For multiple split elements, place it inside a [] and remove the fixed = TRUE or paste the patterns with a | to split either by one of them
strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?",
split = "[.!?]")[[1]]
According to ?strsplit
split - If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x.
You can also omit the fixed = TRUE part and escape the characters, i.e.
strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?", c("\\.|!|\\?"))
Of course it will not be as efficient since we are going through the regex engine.

Error in rep(" ", len) : invalid 'times' argument

library(OneR)
library(RWeka)
loan_train <- read.csv("loan_train.csv")
loan_test <- read.csv("loan_test.csv")
loan_train <- optbin(loan_train, method = "logreg", na.omit = TRUE)
loan_test <- optbin(loan_test, method = "logreg", na.omit = TRUE)
#Task 1
loan_1R <- OneR(bad_loans ~ ., data = loan_train)
loan_1R
loan_JRip <- JRip(bad_loans ~ ., data = loan_train)
loan_JRip
Need some help with my code. I am able to run everything but for some reason, every time I print loan_1R, it gives me an error. Tried using traceback() but have no idea what it means. My csv file can be in the link below.
https://drive.google.com/file/d/1139FUSXUc_fdzgtKAleo5bGAtjcVGoRC/view?usp=sharing
Error in rep(" ", len) : invalid 'times' argument
In addition: Warning message:
In max(nchar(names(model$rules))) :
no non-missing arguments to max; returning -Inf
> traceback()
3: cat("If ", model$feature, " = ", names(model$rules[iter]), rep(" ",
len), " then ", model$target, " = ", model$rules[[iter]],
"\n", sep = "")
2: print.OneR(x)
1: function (x, ...)
UseMethod("print")(x)
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_Singapore.1252 LC_CTYPE=English_Singapore.1252
[3] LC_MONETARY=English_Singapore.1252 LC_NUMERIC=C
[5] LC_TIME=English_Singapore.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-37 OneR_2.2
loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1 grid_3.4.1 rJava_0.9-9 RWekajars_3.9.2-1
After hours of testing i found out the problem but I have no idea why it is so. Think that it has something to do with the library(RWeka) package.... Placing library(RWeka) after the OneR code seemed to make it run. But this means i encounter the error only once i run the library(RWeka). Any workaround this?
library(OneR)
loan_train <- read.csv("loan_train.csv")
loan_test <- read.csv("loan_test.csv")
loan_train <- optbin(loan_train, method = "logreg", na.omit = TRUE)
loan_test <- optbin(loan_test, method = "logreg", na.omit = TRUE)
#Task 1
loan_1R <- OneR(bad_loans ~ ., data = loan_train)
loan_1R
library(RWeka)
loan_JRip <- JRip(bad_loans ~ ., data = loan_train)
loan_JRip

Getting example codes of R functions into knitr using helpExtract function

I want to get the example codes of R functions to use in knitr. There might be an easy way but tried the following code using helpExtract function which can be obtained from here (written by #AnandaMahto). With my approach I have to look whether a function has Examples or not and have to include only those functions which have Examples.
This is very inefficient and naive approach. Now I'm trying to include only those functions which have Examples. I tried the following code but it is not working as desired. How can I to extract Examples codes from an R package?
\documentclass{book}
\usepackage[T1]{fontenc}
\begin{document}
<< label=packages, echo=FALSE>>=
library(ggplot2)
library(devtools)
source_gist("https://gist.github.com/mrdwab/7586769")
library(noamtools) # install_github("noamtools", "noamross")
#
\chapter{Linear Model}
<< label = NewTest1, results="asis">>=
tryCatch(
{helpExtract(lm, section="Examples", type = "s_text");
cat(
"\\Sexpr{
knit_child(
textConnection(helpExtract(lm, section=\"Examples\", type = \"s_text\"))
, options = list(tidy = FALSE, eval = TRUE)
)
}", "\n"
)
}
, error=function(e) FALSE
)
#
\chapter{Modify properties of an element in a theme object}
<< label = NewTest2, results="asis">>=
tryCatch(
{helpExtract(add_theme , section="Examples", type = "s_text");
cat(
"\\Sexpr{
knit_child(
textConnection(helpExtract(add_theme , section=\"Examples\", type = \"s_text\"))
, options = list(tidy = FALSE, eval = TRUE)
)
}", "\n"
)
}
, error=function(e) FALSE
)
#
\end{document}
I've done some quick work modifying the function (which I've included at this Gist). The Gist also includes a sample Rnw file (I haven't had a chance to check an Rmd file yet).
The function now looks like this:
helpExtract <- function(Function, section = "Usage", type = "m_code", sectionHead = NULL) {
A <- deparse(substitute(Function))
x <- capture.output(tools:::Rd2txt(utils:::.getHelpFile(utils::help(A)),
options = list(sectionIndent = 0)))
B <- grep("^_", x) ## section start lines
x <- gsub("_\b", "", x, fixed = TRUE) ## remove "_\b"
X <- rep(FALSE, length(x)) ## Create a FALSE vector
X[B] <- 1 ## Initialize
out <- split(x, cumsum(X)) ## Create a list of sections
sectionID <- vapply(out, function(x) ## Identify where the section starts
grepl(section, x[1], fixed = TRUE), logical(1L))
if (!any(sectionID)) { ## If the section is missing...
"" ## ... just return an empty character
} else { ## Else, get that list item
out <- out[[which(sectionID)]][-c(1, 2)]
while(TRUE) { ## Remove the extra empty lines
out <- out[-length(out)] ## from the end of the file
if (out[length(out)] != "") { break }
}
switch( ## Determine the output type
type,
m_code = {
before <- "```r"
after <- "```"
c(sectionHead, before, out, after)
},
s_code = {
before <- "<<eval = FALSE>>="
after <- "#"
c(sectionHead, before, out, after)
},
m_text = {
c(sectionHead, paste(" ", out, collapse = "\n"))
},
s_text = {
before <- "\\begin{verbatim}"
after <- "\\end{verbatim}"
c(sectionHead, before, out, after)
},
stop("`type` must be either `m_code`, `s_code`, `m_text`, or `s_text`")
)
}
}
What has changed?
A new argument sectionHead has been added. This is used to be able to specify the section title in the call to the helpExtract function.
The function checks to see whether the relevant section is available in the parsed document. If it is not, it simply returns a "" (which doesn't get printed).
Example use would be:
<<echo = FALSE>>=
mySectionHeading <- "\\section{Some cool section title}"
#
\Sexpr{knit_child(textConnection(
helpExtract(cor, section = "Examples", type = "s_code",
sectionHead = mySectionHeading)),
options = list(tidy = FALSE, eval = FALSE))}
Note: Since Sexpr doesn't allow curly brackets to be used ({), we need to specify the title outside of the Sexpr step, which I have done in a hidden code chunk.
This is not a complete answer so I'm marking it as community wiki. Here are two simple lines to get the examples out of the Rd file for a named function (in this case lm). The code is much simpler than Ananda's gist in my opinion:
x <- utils:::.getHelpFile(utils::help(lm))
sapply(x[sapply(x, function(z) attr(z, "Rd_tag") == "\\examples")][[1]], `[[`, 1)
The result is a simple vector of all of the text in the Rd "examples" section, which should be easy to parse, evaluate, or include in a knitr doc.
[1] "\n"
[2] "require(graphics)\n"
[3] "\n"
[4] "## Annette Dobson (1990) \"An Introduction to Generalized Linear Models\".\n"
[5] "## Page 9: Plant Weight Data.\n"
[6] "ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)\n"
[7] "trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)\n"
[8] "group <- gl(2, 10, 20, labels = c(\"Ctl\",\"Trt\"))\n"
[9] "weight <- c(ctl, trt)\n"
[10] "lm.D9 <- lm(weight ~ group)\n"
[11] "lm.D90 <- lm(weight ~ group - 1) # omitting intercept\n"
[12] "\n"
[13] "\n"
[14] "opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))\n"
[15] "plot(lm.D9, las = 1) # Residuals, Fitted, ...\n"
[16] "par(opar)\n"
[17] "\n"
[18] "\n"
[19] "### less simple examples in \"See Also\" above\n"
Perhaps the following might be useful.
get.examples <- function(pkg=NULL) {
suppressWarnings(f <- unique(utils:::index.search(TRUE, find.package(pkg))))
out <- setNames(sapply(f, function(x) {
tf <- tempfile("Rex")
tools::Rd2ex(utils:::.getHelpFile(x), tf)
if (!file.exists(tf)) return(invisible())
readLines(tf)
}), basename(f))
out[!sapply(out, is.null)]
}
ex.base <- get.examples('base')
This returns the examples for all functions (that have documentation containing examples) within the specified vector of packages. If pkg=NULL, it returns the examples for all functions within loaded packages.
For example:
ex.base['scan']
# $scan
# [1] "### Name: scan"
# [2] "### Title: Read Data Values"
# [3] "### Aliases: scan"
# [4] "### Keywords: file connection"
# [5] ""
# [6] "### ** Examples"
# [7] ""
# [8] "cat(\"TITLE extra line\", \"2 3 5 7\", \"11 13 17\", file = \"ex.data\", sep = \"\\n\")"
# [9] "pp <- scan(\"ex.data\", skip = 1, quiet = TRUE)"
# [10] "scan(\"ex.data\", skip = 1)"
# [11] "scan(\"ex.data\", skip = 1, nlines = 1) # only 1 line after the skipped one"
# [12] "scan(\"ex.data\", what = list(\"\",\"\",\"\")) # flush is F -> read \"7\""
# [13] "scan(\"ex.data\", what = list(\"\",\"\",\"\"), flush = TRUE)"
# [14] "unlink(\"ex.data\") # tidy up"
# [15] ""
# [16] "## \"inline\" usage"
# [17] "scan(text = \"1 2 3\")"
# [18] ""
# [19] ""
# [20] ""
# [21] ""

Loading ffdf data take a lot of memory

I am facing a strange problem:
I save ffdf data using
save.ffdf()
from ffbase package and when i load them in a new R session, doing
load.ffdf("data.f")
it gets loaded into RAM aprox 90% of the memory than the same data as a data.frame object in R.
Having this issue, it does not make a lot of sense to use ffdf, isnĀ“t it?
I can't use ffsave because i am working in a server and do not have the zip app on it.
packageVersion(ff) # 2.2.10
packageVersion(ffbase) # 0.6.3
Any ideas about ?
[edit] some code example to help to clarify:
data <- read.csv.ffdf(file = fn, header = T, colClasses = classes)
# file fn is a csv database with 5 columns and 2.6 million rows,
# with some factor cols and some integer cols.
data.1 <- data
save.ffdf(data.1 , dir = my.dir) # my.dir is a string pointing to the file. "C:/data/R/test.f" for example.
closing the R session... opening again:
load.ffdf(file.name) # file.name is a string pointing to the file.
#that gives me object data, with class(data) = ffdf.
then i have a data object ffdf[5] , and its memory size is almost as big as:
data.R <- data[,] # which is a data.frame.
[end of edit]
*[ SECOND EDIT :: FULL REPRODUCIBLE CODE ::: ]
As my question is not answered yet, and i still find the problem, i give a reproducible example ::
dir1 <- 'P:/Projects/RLargeData';
setwd(dir1);
library(ff)
library(ffbase)
memory.limit(size=4000)
N = 1e7;
df <- data.frame(
x = c(1:N),
y = sample(letters, N, replace =T),
z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")),
w = factor( sample(c(1:N/10) , N, replace=T)) )
df[1:10,]
dff <- as.ffdf(df)
head(dff)
#str(dff)
save.ffdf(dff, dir = "dframeffdf")
dim(dff)
# on disk, the directory "dframeffdf" is : 205 MB (215.706.264 bytes)
### resetting R :: fresh RStudio Session
dir1 <- 'P:/Projects/RLargeData';
setwd(dir1);
library(ff)
library(ffbase)
memory.size() # 15.63
load.ffdf(dir = "dframeffdf")
memory.size() # 384.42
gc()
memory.size() # 287
So we have into memory 384 Mb, and after gc() there are 287, which is around the size of the data in the disk. (checked also in "Process explorer" application for windows)
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C LC_TIME=Danish_Denmark.1252
attached base packages:
[1] tools stats graphics grDevices utils datasets methods base
other attached packages:
[1] ffbase_0.7-1 ff_2.2-10 bit_1.1-9
[END SECOND EDIT ]
In ff, when you have factor columns, the factor levels are always in RAM. ff character columns currently don't exist and character columns are converted to factors in an ffdf.
Regarding your example: your 'w' column in 'dff' contains more than 6 Mio levels. These levels are all in RAM. If you wouldn't have columns with a lot of levels, you wouldn' see the RAM increase as shown below using your example.
N = 1e7;
df <- data.frame(
x = c(1:N),
y = sample(letters, N, replace =T),
z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")),
w = sample(c(1:N/10) , N, replace=T))
dff <- as.ffdf(df)
save.ffdf(dff, dir = "dframeffdf")
### resetting R :: fresh RStudio Session
library(ff)
library(ffbase)
memory.size() # 14.67
load.ffdf(dir = "dframeffdf")
memory.size() # 14.78
The ffdf package(s) have mechanisms for segregating object in 'physical' and 'virtual' storage. I suspect you are implicitly constructing items in physical memory, but since you offer not coding for how this workspace was created, there's only so much guessing that is possible.

Read a text file in R line by line

I would like to read a text file in R, line by line, using a for loop and with the length of the file. The problem is that it only prints character(0). This is the code:
fileName="up_down.txt"
con=file(fileName,open="r")
line=readLines(con)
long=length(line)
for (i in 1:long){
linn=readLines(con,1)
print(linn)
}
close(con)
You should take care with readLines(...) and big files. Reading all lines at memory can be risky. Below is a example of how to read file and process just one line at time:
processFile = function(filepath) {
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
print(line)
}
close(con)
}
Understand the risk of reading a line at memory too. Big files without line breaks can fill your memory too.
Just use readLines on your file:
R> res <- readLines(system.file("DESCRIPTION", package="MASS"))
R> length(res)
[1] 27
R> res
[1] "Package: MASS"
[2] "Priority: recommended"
[3] "Version: 7.3-18"
[4] "Date: 2012-05-28"
[5] "Revision: $Rev: 3167 $"
[6] "Depends: R (>= 2.14.0), grDevices, graphics, stats, utils"
[7] "Suggests: lattice, nlme, nnet, survival"
[8] "Authors#R: c(person(\"Brian\", \"Ripley\", role = c(\"aut\", \"cre\", \"cph\"),"
[9] " email = \"ripley#stats.ox.ac.uk\"), person(\"Kurt\", \"Hornik\", role"
[10] " = \"trl\", comment = \"partial port ca 1998\"), person(\"Albrecht\","
[11] " \"Gebhardt\", role = \"trl\", comment = \"partial port ca 1998\"),"
[12] " person(\"David\", \"Firth\", role = \"ctb\"))"
[13] "Description: Functions and datasets to support Venables and Ripley,"
[14] " 'Modern Applied Statistics with S' (4th edition, 2002)."
[15] "Title: Support Functions and Datasets for Venables and Ripley's MASS"
[16] "License: GPL-2 | GPL-3"
[17] "URL: http://www.stats.ox.ac.uk/pub/MASS4/"
[18] "LazyData: yes"
[19] "Packaged: 2012-05-28 08:47:38 UTC; ripley"
[20] "Author: Brian Ripley [aut, cre, cph], Kurt Hornik [trl] (partial port"
[21] " ca 1998), Albrecht Gebhardt [trl] (partial port ca 1998), David"
[22] " Firth [ctb]"
[23] "Maintainer: Brian Ripley <ripley#stats.ox.ac.uk>"
[24] "Repository: CRAN"
[25] "Date/Publication: 2012-05-28 08:53:03"
[26] "Built: R 2.15.1; x86_64-pc-mingw32; 2012-06-22 14:16:09 UTC; windows"
[27] "Archs: i386, x64"
R>
There is an entire manual devoted to this.
Here is the solution with a for loop. Importantly, it takes the one call to readLines out of the for loop so that it is not improperly called again and again. Here it is:
fileName <- "up_down.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
for (i in 1:length(linn)){
print(linn[i])
}
close(conn)
I write a code to read file line by line to meet my demand which different line have different data type follow articles: read-line-by-line-of-a-file-in-r and determining-number-of-linesrecords. And it should be a better solution for big file, I think. My R version (3.3.2).
con = file("pathtotargetfile", "r")
readsizeof<-2 # read size for one step to caculate number of lines in file
nooflines<-0 # number of lines
while((linesread<-length(readLines(con,readsizeof)))>0) # calculate number of lines. Also a better solution for big file
nooflines<-nooflines+linesread
con = file("pathtotargetfile", "r") # open file again to variable con, since the cursor have went to the end of the file after caculating number of lines
typelist = list(0,'c',0,'c',0,0,'c',0) # a list to specific the lines data type, which means the first line has same type with 0 (e.g. numeric)and second line has same type with 'c' (e.g. character). This meet my demand.
for(i in 1:nooflines) {
tmp <- scan(file=con, nlines=1, what=typelist[[i]], quiet=TRUE)
print(is.vector(tmp))
print(tmp)
}
close(con)
I suggest you check out chunked and disk.frame. They both have functions for reading in CSVs chunk-by-chunk.
In particular, disk.frame::csv_to_disk.frame may be the function you are after?
fileName = "up_down.txt"
### code to get the line count of the file
length_connection = pipe(paste("cat ", fileName, " | wc -l", sep = "")) # "cat fileName | wc -l" because that returns just the line count, and NOT the name of the file with it
long = as.numeric(trimws(readLines(con = length_connection, n = 1)))
close(length_connection) # make sure to close the connection
###
for (i in 1:long){
### code to extract a single line at row i from the file
linn_connection_cmd = paste("head -n", format(x = i, scientific = FALSE, big.mark = ""), fileName, "| tail -n 1", sep = " ") # extracts one line from fileName at the desired line number (i)
linn_connection = pipe(linn_connection_cmd)
linn = readLines(con = linn_connection, n = 1)
close(linn_connection) # make sure to close the conection
###
# the line is now loaded into R and anything can be done with it
print(linn)
}
close(con)
By using R's pipe() command, and using shell commands to extract what we want, the full file is never loaded into R, and is read in line by line.
paste("head -n", format(x = i, scientific = FALSE, big.mark = ""), fileName, "| tail -n 1", sep = " ")
It is this command that does all the work; it extracts one line from the desired file.
Edit: R's default behavior is for i to return as normal number when less than 100,000, but begins returning i in scientific notation when it is greater than or equal to 100,000 (1e+05). Thus, format(x = i, scientific = FALSE, big.mark = "") is used in our pipe command to make sure that the pipe() command always receives a number in normal form, which is all that the command can understand. If the pipe() command is given any number like 1e+05, it will not be able to comprehend it and will result in the following error:
head: 1e+05: invalid number of lines

Resources