strsplit doesn't always split on '?' [duplicate] - r

This question already has answers here:
R strsplit with multiple unordered split arguments?
(4 answers)
Closed 4 years ago.
I want (for LSAfun::genericSummary) to split some strings by c(".", "!", "?"). I use the option fixed = TRUE but it still return the worng result.
I want to understand why it doesn't work because I can't modify the call.
Actually, it's not called directly but via LSAfun::genericSummary. And the result is not the expected one because of the strsplit unexpected result.
strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?",
split = c(".", "!", "?"), fixed = TRUE)[[1]]
returns :
[1] "Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?"
expected :
[1] "Faut-il reconnaitre le vote blanc " " Faut-il rendre le vote obligatoire " ""
I'm lost... anyone for an explanation ?
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 LC_NUMERIC=C LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.3.0 yaml_2.1.18
the function :
function (text, k, split = c(".", "!", "?"), min = 5, breakdown = FALSE,
...)
{
sentences <- unlist(strsplit(text, split = split, fixed = T))
if (breakdown == TRUE) {
sentences <- breakdown(sentences)
}
sentences <- sentences[nchar(sentences) > min]
td = tempfile()
dir.create(td)
for (i in 1:length(sentences)) {
docname <- paste("sentence", i, ".txt", sep = "")
write(sentences[i], file = paste(td, docname, sep = "/"))
}
A <- textmatrix(td, ...)
rownames <- rownames(A)
colnames <- colnames(A)
A <- matrix(A, nrow = nrow(A), ncol = ncol(A))
rownames(A) <- rownames
colnames(A) <- colnames
unlink(td, T, T)
Vt <- lsa(A, dims = length(sentences))$dk
snum <- vector(length = k)
for (i in 1:k) {
snum[i] <- names(Vt[, i][abs(Vt[, i]) == max(abs(Vt[,
i]))])
}
snum <- gsub(snum, pattern = "[[:alpha:]]", replacement = "")
snum <- gsub(snum, pattern = "[[:punct:]]", replacement = "")
snum <- as.integer(snum)
summary.sentences <- sentences[snum]
return(summary.sentences)
}
<environment: namespace:LSAfun>

For multiple split elements, place it inside a [] and remove the fixed = TRUE or paste the patterns with a | to split either by one of them
strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?",
split = "[.!?]")[[1]]
According to ?strsplit
split - If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x.

You can also omit the fixed = TRUE part and escape the characters, i.e.
strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?", c("\\.|!|\\?"))
Of course it will not be as efficient since we are going through the regex engine.

Related

Error in rep(" ", len) : invalid 'times' argument

library(OneR)
library(RWeka)
loan_train <- read.csv("loan_train.csv")
loan_test <- read.csv("loan_test.csv")
loan_train <- optbin(loan_train, method = "logreg", na.omit = TRUE)
loan_test <- optbin(loan_test, method = "logreg", na.omit = TRUE)
#Task 1
loan_1R <- OneR(bad_loans ~ ., data = loan_train)
loan_1R
loan_JRip <- JRip(bad_loans ~ ., data = loan_train)
loan_JRip
Need some help with my code. I am able to run everything but for some reason, every time I print loan_1R, it gives me an error. Tried using traceback() but have no idea what it means. My csv file can be in the link below.
https://drive.google.com/file/d/1139FUSXUc_fdzgtKAleo5bGAtjcVGoRC/view?usp=sharing
Error in rep(" ", len) : invalid 'times' argument
In addition: Warning message:
In max(nchar(names(model$rules))) :
no non-missing arguments to max; returning -Inf
> traceback()
3: cat("If ", model$feature, " = ", names(model$rules[iter]), rep(" ",
len), " then ", model$target, " = ", model$rules[[iter]],
"\n", sep = "")
2: print.OneR(x)
1: function (x, ...)
UseMethod("print")(x)
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_Singapore.1252 LC_CTYPE=English_Singapore.1252
[3] LC_MONETARY=English_Singapore.1252 LC_NUMERIC=C
[5] LC_TIME=English_Singapore.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-37 OneR_2.2
loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1 grid_3.4.1 rJava_0.9-9 RWekajars_3.9.2-1
After hours of testing i found out the problem but I have no idea why it is so. Think that it has something to do with the library(RWeka) package.... Placing library(RWeka) after the OneR code seemed to make it run. But this means i encounter the error only once i run the library(RWeka). Any workaround this?
library(OneR)
loan_train <- read.csv("loan_train.csv")
loan_test <- read.csv("loan_test.csv")
loan_train <- optbin(loan_train, method = "logreg", na.omit = TRUE)
loan_test <- optbin(loan_test, method = "logreg", na.omit = TRUE)
#Task 1
loan_1R <- OneR(bad_loans ~ ., data = loan_train)
loan_1R
library(RWeka)
loan_JRip <- JRip(bad_loans ~ ., data = loan_train)
loan_JRip

Best way to count unique element in a string in r

I'm a still a beginner in R and I have a question!
I have data frame of 222.000 observations and I'm interesting by a specific column which name is id. The problem is it can be further ids separate by a ',' in the same string and I want to count unique element in a each string (I mean in each string of the first data frame).
For example:
id results
0000001,0000003 2
0000002,0000002 1
0010001,0001006,0010001 2
I have used the function 'str_split_fixed' to separate all id in the same string and I put the result in a new data frame(so know I have only 1 id by string or nothing in a string). The problem is that can be as many as 68 ',' so the new data frame is huge with 68columns and 220.000 observations and it take much time(15 secondes maybe). After a used a apply function to know all unique.
Does someone know a more efficient way or have an idea?
Finally, I used the following code:
sapply(id, function(x)
length( # count items
unique( # that are unique
scan( # when arguments are presented to scan as text
text=x, what="", sep =",", # when separated by ","
quiet=TRUE))) )
But there is a message error:
Error in textConnection(text, encoding = "UTF-8") :
argument 'text' incorrect
6 textConnection(text, encoding = "UTF-8")
5 scan(text = x, what = "", sep = ",", quiet = TRUE)
4 unique(scan(text = x, what = "", sep = ",", quiet = TRUE))
3 FUN(X[[i]], ...)
2 lapply(X = X, FUN = FUN, ...)
1 sapply(id, function(x) length(unique(scan(text = x,
what = "", sep = ",", quiet = TRUE))))
My R version is:
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.0.0 plyr_1.8.3
loaded via a namespace (and not attached):
[1] magrittr_1.5 tools_3.2.2 Rcpp_0.12.2 stringi_1.0-1
>
I've tried this: Encoding(id) <- "UTF-8"
But the result is:
Error in `Encoding<-`(`*tmp*`, value = "UTF-8")
and the output of dput(id) is from this:
[9987,] "2320212,2320230"
[9988,] "4530090,4530917"
[9989,] "8532412"
[9990,] "4560292"
[9991,] "4540375"
[9992,] "3311324"
[9993,] "4540030"
[9994,] "9010000"
[9995,] "2811810"
[9996,] "3311000"
[9997,] "4540030"
[9998,] "4540215"
[9999,] "1541201"
[10000,] "2423810"
[ getOption("max.print") est atteint -- 90000 lignes omises ]
the ouput is huge so I post just the end and the first line:
[9002,] "9460000"
and for dput( head(data$id) ):
"9460000,9433000", "9460000,9436000", "9460000,9437000",
"9510000", "9510010", "9510030", "9510090", "9910000", "9910020",
"9910040", "9910090", "D", "FIELD_NOT_FOUND", "I"), class = "factor")
Thanks in advance, Jef
sapply(id, function(x)
length( # count items
unique( # that are unique
scan( # when arguments are presented to scan as text
text=x, what="", sep =",", # when separated by ","
quiet=TRUE))) )
# --- result: first typed line is 'names' of the items, not the results.
1 2,3,4 1,1
1 3 1
The argument text=x should allow scan to accept a character element of length-1 and break it into components at divisions of the separator argument value. These will get passed element-by-element to the anonymous function from the id vector(or row by row if it were coming from a dataframe).

Missing object error when using step() within a user-defined function

5 days and still no answer
As can be seen by Simon's comment, this is a reproducible and very strange issue. It seems that the issue only arises when a stepwise regression with very high predictive power is wrapped in a function.
I have been struggling with this for a while and any help would be much appreciated. I am trying to write a function that runs several stepwise regressions and outputs all of them to a list. However, R is having trouble reading the dataset that I specify in my function arguments. I found several similar errors on various boards (here, here, and here), however none of them seemed to ever get resolved. It all comes down to some weird issues with calling step() in a user-defined function. I am using the following script to test my code. Run the whole thing several times until an error arises (trust me, it will):
test.df <- data.frame(a = sample(0:1, 100, rep = T),
b = as.factor(sample(0:5, 100, rep = T)),
c = runif(100, 0, 100),
d = rnorm(100, 50, 50))
test.df$b[10:100] <- test.df$a[10:100] #making sure that at least one of the variables has some predictive power
stepModel <- function(modeling.formula, dataset, outfile = NULL) {
if (is.null(outfile) == FALSE){
sink(file = outfile,
append = TRUE, type = "output")
print("")
print("Models run at:")
print(Sys.time())
}
model.initial <- glm(modeling.formula,
family = binomial,
data = dataset)
model.stepwise1 <- step(model.initial, direction = "backward")
model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
sink()
return(output)
}
blah <- stepModel(a~., dataset = test.df)
This returns the following error message (if the error does not show up right away, keep re-running the test.df script as well as the call for stepModel(), it will show up eventually):
Error in is.data.frame(data) : object 'dataset' not found
I have determined that everything runs fine up until model.stepwise2 starts to get built. Somehow, the temporary object 'dataset' works just fine for the first stepwise regression, but fails to be recognized by the second. I found this by commenting out part of the function as can be seen below. This code will run fine, proving that the object 'dataset' was originally being recognized:
stepModel1 <- function(modeling.formula, dataset, outfile = NULL) {
if (is.null(outfile) == FALSE){
sink(file = outfile,
append = TRUE, type = "output")
print("")
print("Models run at:")
print(Sys.time())
}
model.initial <- glm(modeling.formula,
family = binomial,
data = dataset)
model.stepwise1 <- step(model.initial, direction = "backward")
# model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
# sink()
# output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
return(model.stepwise1)
}
blah1 <- stepModel1(a~., dataset = test.df)
EDIT - before anyone asks, all the summary() functions were there because the full function (i edited it so that you could focus in on the error) has another piece that defines a file to which you can output stepwise trace. I just got rid of them
EDIT 2 - session info
sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] tcltk stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-6.4 RSQLite.extfuns_0.0.1 RSQLite_0.11.3 chron_2.3-43
[5] gsubfn_0.6-5 proto_0.3-10 DBI_0.2-6 ggplot2_0.9.3.1
[9] caret_5.15-61 reshape2_1.2.2 lattice_0.20-6 foreach_1.4.0
[13] cluster_1.14.2 plyr_1.8
loaded via a namespace (and not attached):
[1] codetools_0.2-8 colorspace_1.2-1 dichromat_2.0-0 digest_0.6.2 grid_2.15.1
[6] gtable_0.1.2 iterators_1.0.6 labeling_0.1 MASS_7.3-18 munsell_0.4
[11] RColorBrewer_1.0-5 scales_0.2.3 stringr_0.6.2 tools_2.15
EDIT 3 - this performs all the same operations as the function, just without using a function. This will run fine every time, even when the algorithm doesn't converge:
modeling.formula <- a~.
dataset <- test.df
outfile <- NULL
if (is.null(outfile) == FALSE){
sink(file = outfile,
append = TRUE, type = "output")
print("")
print("Models run at:")
print(Sys.time())
}
model.initial <- glm(modeling.formula,
family = binomial,
data = dataset)
model.stepwise1 <- step(model.initial, direction = "backward")
model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
output <- list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
Using do.call to refer to the data set in the calling environment works for me. See https://stackoverflow.com/a/7668846/210673 for the original suggestion. Here's a version that works (with sink code removed).
stepModel2 <- function(modeling.formula, dataset) {
model.initial <- do.call("glm", list(modeling.formula,
family = "binomial",
data = as.name(dataset)))
model.stepwise1 <- step(model.initial, direction = "backward")
model.stepwise2 <- step(model.stepwise1, scope = ~.^2)
list(modInitial = model.initial, modStep1 = model.stepwise1, modStep2 = model.stepwise2)
}
blah <- stepModel2(a~., dataset = "test.df")
It fails for me consistently with set.seed(6) with the original code. The reason it fails is that the dataset variable is not present within the step function, and although it's not needed in making model.stepwise1, it is needed for model.stepwise2 when model.stepwise1 keeps a linear term. So that's the case when your version fails. Calling the dataset from the global environment as I do here fixes this issue.

addOBV throwing error

I am trying to plot a graph with price and a few technical indicators such as ADX, RSI, and OBV. I cannot figure out why addOBV is giving an error and why addADX not showing at all in the graph lines in the chart?
Here my code:
tmp <- read.csv(paste("ProcessedQuotes/",Nifty[x,],".csv", sep=""),
as.is=TRUE, header=TRUE, row.names=NULL)
tmp$Date<-as.Date(tmp$Date)
ydat = xts(tmp[,-1],tmp$Date)
lineChart(ydat, TA=NULL, name=paste(Nifty[x,]," Technical Graph"))
plot(addSMA(10))
plot(addEMA(10))
plot(addRSI())
plot(addADX())
plot(addOBV())
Error for addOBV is:
Error in try.xts(c(2038282, 1181844, -1114409, 1387404, 3522045, 4951254, :
Error in as.xts.double(x, ..., .RECLASS = TRUE) :
order.by must be either 'names()' or otherwise specified
Below you can see DIn is not shown fully in the graphs.
> class(ydat)
[1] "xts" "zoo"
> head(ydat)
Open High Low Close Volume Trades Sma20 Sma50 DIp DIn DX ADX aroonUp aroonDn oscillator macd signal RSI14
I don't know why that patch doesn't work for you, but you can just create a new function (or you could mask the one from quantmod). Let's just make a new, patched version called addOBV2 which is the code for addOBV except for the one patched line. (x <- as.matrix(lchob#xdata) is replaced with x <- try.xts(lchob#xdata, error=FALSE)).
addOBV2 <- function (..., on = NA, legend = "auto")
{
stopifnot("package:TTR" %in% search() || require("TTR", quietly = TRUE))
lchob <- quantmod:::get.current.chob()
x <- try.xts(lchob#xdata, error=FALSE)
#x <- as.matrix(lchob#xdata)
x <- OBV(price = Cl(x), volume = Vo(x))
yrange <- NULL
chobTA <- new("chobTA")
if (NCOL(x) == 1) {
chobTA#TA.values <- x[lchob#xsubset]
}
else chobTA#TA.values <- x[lchob#xsubset, ]
chobTA#name <- "chartTA"
if (any(is.na(on))) {
chobTA#new <- TRUE
}
else {
chobTA#new <- FALSE
chobTA#on <- on
}
chobTA#call <- match.call()
legend.name <- gsub("^.*[(]", " On Balance Volume (", deparse(match.call()))#,
#extended = TRUE)
gpars <- c(list(...), list(col=4))[unique(names(c(list(col=4), list(...))))]
chobTA#params <- list(xrange = lchob#xrange, yrange = yrange,
colors = lchob#colors, color.vol = lchob#color.vol, multi.col = lchob#multi.col,
spacing = lchob#spacing, width = lchob#width, bp = lchob#bp,
x.labels = lchob#x.labels, time.scale = lchob#time.scale,
isLogical = is.logical(x), legend = legend, legend.name = legend.name,
pars = list(gpars))
if (is.null(sys.call(-1))) {
TA <- lchob#passed.args$TA
lchob#passed.args$TA <- c(TA, chobTA)
lchob#windows <- lchob#windows + ifelse(chobTA#new, 1,
0)
chartSeries.chob <- quantmod:::chartSeries.chob
do.call("chartSeries.chob", list(lchob))
invisible(chobTA)
}
else {
return(chobTA)
}
}
Now it works.
# reproduce your data
ydat <- getSymbols("ZEEL.NS", src="yahoo", from="2012-09-11",
to="2013-01-18", auto.assign=FALSE)
lineChart(ydat, TA=NULL, name=paste("ZEEL Technical Graph"))
plot(addSMA(10))
plot(addEMA(10))
plot(addRSI())
plot(addADX())
plot(addOBV2())
This code reproduces the error:
library(quantmod)
getSymbols("AAPL")
lineChart(AAPL, 'last 6 months')
addOBV()
Session Info:
sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] quantmod_0.3-17 TTR_0.21-1 xts_0.9-1 zoo_1.7-9 Defaults_1.1-1 rgeos_0.2-11
[7] sp_1.0-5 sos_1.3-5 brew_1.0-6
loaded via a namespace (and not attached):
[1] grid_2.15.0 lattice_0.20-6 tools_2.15.0
Googling around, the error seems to be related to the fact that addOBV converts the data into a matrix, which causes problems with TTR::OBV. A patch has been posted on RForge.

Read a text file in R line by line

I would like to read a text file in R, line by line, using a for loop and with the length of the file. The problem is that it only prints character(0). This is the code:
fileName="up_down.txt"
con=file(fileName,open="r")
line=readLines(con)
long=length(line)
for (i in 1:long){
linn=readLines(con,1)
print(linn)
}
close(con)
You should take care with readLines(...) and big files. Reading all lines at memory can be risky. Below is a example of how to read file and process just one line at time:
processFile = function(filepath) {
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
print(line)
}
close(con)
}
Understand the risk of reading a line at memory too. Big files without line breaks can fill your memory too.
Just use readLines on your file:
R> res <- readLines(system.file("DESCRIPTION", package="MASS"))
R> length(res)
[1] 27
R> res
[1] "Package: MASS"
[2] "Priority: recommended"
[3] "Version: 7.3-18"
[4] "Date: 2012-05-28"
[5] "Revision: $Rev: 3167 $"
[6] "Depends: R (>= 2.14.0), grDevices, graphics, stats, utils"
[7] "Suggests: lattice, nlme, nnet, survival"
[8] "Authors#R: c(person(\"Brian\", \"Ripley\", role = c(\"aut\", \"cre\", \"cph\"),"
[9] " email = \"ripley#stats.ox.ac.uk\"), person(\"Kurt\", \"Hornik\", role"
[10] " = \"trl\", comment = \"partial port ca 1998\"), person(\"Albrecht\","
[11] " \"Gebhardt\", role = \"trl\", comment = \"partial port ca 1998\"),"
[12] " person(\"David\", \"Firth\", role = \"ctb\"))"
[13] "Description: Functions and datasets to support Venables and Ripley,"
[14] " 'Modern Applied Statistics with S' (4th edition, 2002)."
[15] "Title: Support Functions and Datasets for Venables and Ripley's MASS"
[16] "License: GPL-2 | GPL-3"
[17] "URL: http://www.stats.ox.ac.uk/pub/MASS4/"
[18] "LazyData: yes"
[19] "Packaged: 2012-05-28 08:47:38 UTC; ripley"
[20] "Author: Brian Ripley [aut, cre, cph], Kurt Hornik [trl] (partial port"
[21] " ca 1998), Albrecht Gebhardt [trl] (partial port ca 1998), David"
[22] " Firth [ctb]"
[23] "Maintainer: Brian Ripley <ripley#stats.ox.ac.uk>"
[24] "Repository: CRAN"
[25] "Date/Publication: 2012-05-28 08:53:03"
[26] "Built: R 2.15.1; x86_64-pc-mingw32; 2012-06-22 14:16:09 UTC; windows"
[27] "Archs: i386, x64"
R>
There is an entire manual devoted to this.
Here is the solution with a for loop. Importantly, it takes the one call to readLines out of the for loop so that it is not improperly called again and again. Here it is:
fileName <- "up_down.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
for (i in 1:length(linn)){
print(linn[i])
}
close(conn)
I write a code to read file line by line to meet my demand which different line have different data type follow articles: read-line-by-line-of-a-file-in-r and determining-number-of-linesrecords. And it should be a better solution for big file, I think. My R version (3.3.2).
con = file("pathtotargetfile", "r")
readsizeof<-2 # read size for one step to caculate number of lines in file
nooflines<-0 # number of lines
while((linesread<-length(readLines(con,readsizeof)))>0) # calculate number of lines. Also a better solution for big file
nooflines<-nooflines+linesread
con = file("pathtotargetfile", "r") # open file again to variable con, since the cursor have went to the end of the file after caculating number of lines
typelist = list(0,'c',0,'c',0,0,'c',0) # a list to specific the lines data type, which means the first line has same type with 0 (e.g. numeric)and second line has same type with 'c' (e.g. character). This meet my demand.
for(i in 1:nooflines) {
tmp <- scan(file=con, nlines=1, what=typelist[[i]], quiet=TRUE)
print(is.vector(tmp))
print(tmp)
}
close(con)
I suggest you check out chunked and disk.frame. They both have functions for reading in CSVs chunk-by-chunk.
In particular, disk.frame::csv_to_disk.frame may be the function you are after?
fileName = "up_down.txt"
### code to get the line count of the file
length_connection = pipe(paste("cat ", fileName, " | wc -l", sep = "")) # "cat fileName | wc -l" because that returns just the line count, and NOT the name of the file with it
long = as.numeric(trimws(readLines(con = length_connection, n = 1)))
close(length_connection) # make sure to close the connection
###
for (i in 1:long){
### code to extract a single line at row i from the file
linn_connection_cmd = paste("head -n", format(x = i, scientific = FALSE, big.mark = ""), fileName, "| tail -n 1", sep = " ") # extracts one line from fileName at the desired line number (i)
linn_connection = pipe(linn_connection_cmd)
linn = readLines(con = linn_connection, n = 1)
close(linn_connection) # make sure to close the conection
###
# the line is now loaded into R and anything can be done with it
print(linn)
}
close(con)
By using R's pipe() command, and using shell commands to extract what we want, the full file is never loaded into R, and is read in line by line.
paste("head -n", format(x = i, scientific = FALSE, big.mark = ""), fileName, "| tail -n 1", sep = " ")
It is this command that does all the work; it extracts one line from the desired file.
Edit: R's default behavior is for i to return as normal number when less than 100,000, but begins returning i in scientific notation when it is greater than or equal to 100,000 (1e+05). Thus, format(x = i, scientific = FALSE, big.mark = "") is used in our pipe command to make sure that the pipe() command always receives a number in normal form, which is all that the command can understand. If the pipe() command is given any number like 1e+05, it will not be able to comprehend it and will result in the following error:
head: 1e+05: invalid number of lines

Resources