How to efficiently clean linebreak in entire dataset R data.table - r

Line break is always the first target to prevent new line splitted in new row when perform "import text file" in Excel.Or export to other application with csv file importing.
(The solution might be able to apply in clean another special mark in dataset.
)
Goal clean all line break into space of entire dataset
dt[,lapply(.SD,gsub("\\n","",.SD))]
Problems
R freezed after applying the script with +50 cols & +3 million rows
What's wrong with the lapply approach above?And what is the preferred approach to clean certain things on entire table ?

chinsoon12 is basically it -- use set for low-overhead by-reference column overwrite; just add fixed=TRUE to make the regex faster too:
for (jj in seq_len(ncol(dt))) set(dt, , jj, gsub('\n', '', dt[[jj]], fixed = TRUE))
BTW, \\n is different from \n. \n is the literal newline character, \\n is the string "\n", i.e., a backslash followed by n. You can see the difference thus:
cat('hey\nyou')
# hey
# you
cat('hey\\nyou')
# hey\nyou

Related

In R, Is there a way to break long string literals? [duplicate]

I want to split a line in an R script over multiple lines (because it is too long). How do I do that?
Specifically, I have a line such as
setwd('~/a/very/long/path/here/that/goes/beyond/80/characters/and/then/some/more')
Is it possible to split the long path over multiple lines? I tried
setwd('~/a/very/long/path/here/that/goes/beyond/80/characters/and/
then/some/more')
with return key at the end of the first line; but that does not work.
Thanks.
Bah, comments are too small. Anyway, #Dirk is very right.
R doesn't need to be told the code starts at the next line. It is smarter than Python ;-) and will just continue to read the next line whenever it considers the statement as "not finished". Actually, in your case it also went to the next line, but R takes the return as a character when it is placed between "".
Mind you, you'll have to make sure your code isn't finished. Compare
a <- 1 + 2
+ 3
with
a <- 1 + 2 +
3
So, when spreading code over multiple lines, you have to make sure that R knows something is coming, either by :
leaving a bracket open, or
ending the line with an operator
When we're talking strings, this still works but you need to be a bit careful. You can open the quotation marks and R will read on until you close it. But every character, including the newline, will be seen as part of the string :
x <- "This is a very
long string over two lines."
x
## [1] "This is a very\nlong string over two lines."
cat(x)
## This is a very
## long string over two lines.
That's the reason why in this case, your code didn't work: a path can't contain a newline character (\n). So that's also why you better use the solution with paste() or paste0() Dirk proposed.
You are not breaking code over multiple lines, but rather a single identifier. There is a difference.
For your issue, try
R> setwd(paste("~/a/very/long/path/here",
"/and/then/some/more",
"/and/then/some/more",
"/and/then/some/more", sep=""))
which also illustrates that it is perfectly fine to break code across multiple lines.
Dirk's method above will absolutely work, but if you're looking for a way to bring in a long string where whitespace/structure is important to preserve (example: a SQL query using RODBC) there is a two step solution.
1) Bring the text string in across multiple lines
long_string <- "this
is
a
long
string
with
whitespace"
2) R will introduce a bunch of \n characters. Strip those out with strwrap(), which destroys whitespace, per the documentation:
strwrap(long_string, width=10000, simplify=TRUE)
By telling strwrap to wrap your text to a very, very long row, you get a single character vector with no whitespace/newline characters.
For that particular case there is file.path :
File <- file.path("~",
"a",
"very",
"long",
"path",
"here",
"that",
"goes",
"beyond",
"80",
"characters",
"and",
"then",
"some",
"more")
setwd(File)
The glue::glue function can help. You can write a string on multiple lines in a script but remove the line breaks from the string object by ending each line with \\:
glue("some\\
thing")
something
I know this post is old, but I had a Situation like this and just want to share my solution. All the answers above work fine. But if you have a Code such as those in data.table chaining Syntax it becomes abit challenging. e.g. I had a Problem like this.
mass <- files[, Veg:=tstrsplit(files$file, "/")[1:4][[1]]][, Rain:=tstrsplit(files$file, "/")[1:4][[2]]][, Roughness:=tstrsplit(files$file, "/")[1:4][[3]]][, Geom:=tstrsplit(files$file, "/")[1:4][[4]]][time_[s]<=12000]
I tried most of the suggestions above and they didn´t work. but I figured out that they can be split after the comma within []. Splitting at ][ doesn´t work.
mass <- files[, Veg:=tstrsplit(files$file, "/")[1:4][[1]]][,
Rain:=tstrsplit(files$file, "/")[1:4][[2]]][,
Roughness:=tstrsplit(files$file, "/")[1:4][[3]]][,
Geom:=tstrsplit(files$file, "/")[1:4][[4]]][`time_[s]`<=12000]
There is no coinvent way to do this because there is no operator in R to do string concatenation.
However, you can define a R Infix operator to do string concatenation:
`%+%` = function(x,y) return(paste0(x,y))
Then you can use it to concat strings, even break the code to multiple lines:
s = "a" %+%
"b" %+%
"c"
This will give you "abc".
This will keep the \n character, but you can also just wrap the quote in parentheses. Especially useful in RMarkdown.
t <- ("
this is a long
string
")
I do this all of the time. Use the paste0() function.
Rootdir = "/myhome/thisproject/part1/"
Subdir = "subdirectory1/subsubdir2/"
fullpath = paste0( Rootdir, Subdir )
fullpath
> fullpath
[1] "/myhome/thisproject/part1/subdirectory1/subsubdir2/"

skip lines in reading files using reg ex

i have files with similar contents
!software version: $Revision$
!date: 07/06/2016 $
!
! from Mouse Genome Database (MGD) & Gene Expression Database (GXD)
!
MGI
I am using read.csv to read the files. But I need to skip the lines with "!" in the beginning. How can I do that?
The read.csv function and read.table that it is based on have an argument called comment.char which can be used to specify a character that if seen will ignore the rest of that line. Setting that to "!" may be enough to do what you want.
If you really need a regular expression, then the best approach is to read the file using readLines (or similar function), then apply the regular expression to the resulting vector of character strings to drop to unwanted elements (rows), then pass the result to the text argument to read.table (or use a text connection).
To calculate the first line that doesn't start with a !,
to_skip <- min(grep('^[^!]', trimws(readLines('file.csv'))))
df <- read.csv('file.csv', skip = to_skip)

Reading a csv file with embedded quotes into R

I have to work with a .csv file that comes like this:
"IDEA ID,""IDEA TITLE"",""VOTE VALUE"""
"56144,""Net Present Value PLUS (NPV+)"",1"
"56144,""Net Present Value PLUS (NPV+)"",1"
If I use read.csv, I obtain a data frame with one variable. What I need is a data frame with three columns, where columns are separated by commas. How can I handle the quotes at the beginning of the line and the end of the line?
I don't think there's going to be an easy way to do this without stripping the initial and terminal quotation marks first. If you have sed on your system (Unix [Linux/MacOS] or Windows+Cygwin?) then
read.csv(pipe("sed -e 's/^\"//' -e 's/\"$//' qtest.csv"))
should work. Otherwise
read.csv(text=gsub("(^\"|\"$)","",readLines("qtest.csv")))
is a little less efficient for big files (you have to read in the whole thing before processing it), but should work anywhere.
(There may be a way to do the regular expression for sed in the same, more-compact form using parentheses that the second example uses, but I got tired of trying to sort out where all the backslashes belonged.)
I suggest both removing the initial/terminal quotes and turning the back-to-back double quotes into single double quotes. The latter is crucial in case some of the strings contain commas themselves, as in
"1,""A mostly harmless string"",11"
"2,""Another mostly harmless string"",12"
"3,""These, commas, cause, trouble"",13"
Removing only the initial/terminal quotes while keeping the back-to-back quote leads the read.csv() function to produce 6 variables, as it interprets all commas in the last row as value separators. So the complete code might look like this:
data.text <- readLines("fullofquotes.csv") # Reads data from file into a character vector.
data.text <- gsub("^\"|\"$", "", data.text) # Removes initial/terminal quotes.
data.text <- gsub("\"\"", "\"", data.text) # Replaces "" by ".
data <- read.csv(text=data.text, header=FALSE)
Or, of course, all in a single line
data <- read.csv(text=gsub("\"\"", "\"", gsub("^\"|\"$", "", readLines("fullofquotes.csv", header=FALSE))))

readline is considering every record in the spreadsheet as a new line [R]

I am trying to create a function that will calculate the frequency count of keywords using TM package. The function works fine if the text pasted from readline is on free form text without a new line. The problem is, when I paste a bunch of text copied from a spreadsheet, readline considers it as a new line.
keyword <- function() {
x <- readline(as.character('Input text here: '))
x <- Corpus(VectorSource(x))
...
tdm <- TermDocumentMatrix(x)
...
tdm
}
Here's the full code: https://github.com/CSCDataAnalytics/PM-Analysis/blob/master/Keyword.R
How can I prevent this from happening or at least consider a bunch of text of every row from the spreadsheet as one vector only?
If I'm understanding you correctly, the problem is when the user pastes the text from another application: the newline is causing R to stop accepting the subsequent lines.
One technique (fragile as it may be) is to look for a specific line, such as an empty line "" or a period ".". It's a little fragile because now you need (1) assurance that the data will "never" include that as a whole line, and (2) it is easily appended by the user.
Try:
endofinput <- ""
totalstr <- ""
while(! endofinput == (x <- readline('prompt (empty string when done): ')))
totalstr <- paste(totalstr, x)
In this case, the empty string is the catch, and when the while loop is done, totalstr contains all input separated by a space (this can be changed in the paste function).
NB: one problem with this technique is that it is "growing" the vector totalstr, which will eventually cause performance penalties (depending on the size of the input data): every loop iteration, more memory is allocated and the entire string is copied plus the new line of text. There are more verbose ways to side-step this problem (e.g., pre-allocate a vector larger than your anticipated input data), but if you aren't anticipated 1000s of lines then you may be able to accept this naive programming for simplicity.
Another option would be to have the user save the data to a text file and use file.choose() and readLines() to get your data.
Try collapsing the data into a single string after using readline
x <- paste(readline(as.character('Input text here: ')), collapse=' ')

Split code over multiple lines in an R script

I want to split a line in an R script over multiple lines (because it is too long). How do I do that?
Specifically, I have a line such as
setwd('~/a/very/long/path/here/that/goes/beyond/80/characters/and/then/some/more')
Is it possible to split the long path over multiple lines? I tried
setwd('~/a/very/long/path/here/that/goes/beyond/80/characters/and/
then/some/more')
with return key at the end of the first line; but that does not work.
Thanks.
Bah, comments are too small. Anyway, #Dirk is very right.
R doesn't need to be told the code starts at the next line. It is smarter than Python ;-) and will just continue to read the next line whenever it considers the statement as "not finished". Actually, in your case it also went to the next line, but R takes the return as a character when it is placed between "".
Mind you, you'll have to make sure your code isn't finished. Compare
a <- 1 + 2
+ 3
with
a <- 1 + 2 +
3
So, when spreading code over multiple lines, you have to make sure that R knows something is coming, either by :
leaving a bracket open, or
ending the line with an operator
When we're talking strings, this still works but you need to be a bit careful. You can open the quotation marks and R will read on until you close it. But every character, including the newline, will be seen as part of the string :
x <- "This is a very
long string over two lines."
x
## [1] "This is a very\nlong string over two lines."
cat(x)
## This is a very
## long string over two lines.
That's the reason why in this case, your code didn't work: a path can't contain a newline character (\n). So that's also why you better use the solution with paste() or paste0() Dirk proposed.
You are not breaking code over multiple lines, but rather a single identifier. There is a difference.
For your issue, try
R> setwd(paste("~/a/very/long/path/here",
"/and/then/some/more",
"/and/then/some/more",
"/and/then/some/more", sep=""))
which also illustrates that it is perfectly fine to break code across multiple lines.
Dirk's method above will absolutely work, but if you're looking for a way to bring in a long string where whitespace/structure is important to preserve (example: a SQL query using RODBC) there is a two step solution.
1) Bring the text string in across multiple lines
long_string <- "this
is
a
long
string
with
whitespace"
2) R will introduce a bunch of \n characters. Strip those out with strwrap(), which destroys whitespace, per the documentation:
strwrap(long_string, width=10000, simplify=TRUE)
By telling strwrap to wrap your text to a very, very long row, you get a single character vector with no whitespace/newline characters.
For that particular case there is file.path :
File <- file.path("~",
"a",
"very",
"long",
"path",
"here",
"that",
"goes",
"beyond",
"80",
"characters",
"and",
"then",
"some",
"more")
setwd(File)
The glue::glue function can help. You can write a string on multiple lines in a script but remove the line breaks from the string object by ending each line with \\:
glue("some\\
thing")
something
I know this post is old, but I had a Situation like this and just want to share my solution. All the answers above work fine. But if you have a Code such as those in data.table chaining Syntax it becomes abit challenging. e.g. I had a Problem like this.
mass <- files[, Veg:=tstrsplit(files$file, "/")[1:4][[1]]][, Rain:=tstrsplit(files$file, "/")[1:4][[2]]][, Roughness:=tstrsplit(files$file, "/")[1:4][[3]]][, Geom:=tstrsplit(files$file, "/")[1:4][[4]]][time_[s]<=12000]
I tried most of the suggestions above and they didn´t work. but I figured out that they can be split after the comma within []. Splitting at ][ doesn´t work.
mass <- files[, Veg:=tstrsplit(files$file, "/")[1:4][[1]]][,
Rain:=tstrsplit(files$file, "/")[1:4][[2]]][,
Roughness:=tstrsplit(files$file, "/")[1:4][[3]]][,
Geom:=tstrsplit(files$file, "/")[1:4][[4]]][`time_[s]`<=12000]
There is no coinvent way to do this because there is no operator in R to do string concatenation.
However, you can define a R Infix operator to do string concatenation:
`%+%` = function(x,y) return(paste0(x,y))
Then you can use it to concat strings, even break the code to multiple lines:
s = "a" %+%
"b" %+%
"c"
This will give you "abc".
This will keep the \n character, but you can also just wrap the quote in parentheses. Especially useful in RMarkdown.
t <- ("
this is a long
string
")
I do this all of the time. Use the paste0() function.
Rootdir = "/myhome/thisproject/part1/"
Subdir = "subdirectory1/subsubdir2/"
fullpath = paste0( Rootdir, Subdir )
fullpath
> fullpath
[1] "/myhome/thisproject/part1/subdirectory1/subsubdir2/"

Resources