I have a CSV fwith several columns: Tweet, date, etc. The spaces in some Tweets is causing blank lines and undesired truncated lines.
What works:
1. Using Notepad++'s function "Line Operations>Remove Empty Lines (Containing Blank Characters)"
2. Search and replace: \r with nothing.
However, I need to do this for a large number of files, and I can't manage to find a Regular Expression with gsub() in R that will do what the Notepadd++ function does.
Note that replacing ^[ \t]*$\r?\n with nothing and then \r with nothing does work in Notepad++, but not in R, as suggested here, but it does not work with g(sub) in R.
I have tried the following code:
tx <- readLines("tweets.csv")
subbed <-gsub(pattern = "^[ \\t]*$\\r?\\n", replace = "", x = tx)
subbed <-gsub(pattern = "\r", replace = "", x = subbed)
writeLines(subbed, "output.csv")
This is the input:
This is the desired output:
You may use
library(readtext)
tx <- readtext("weets.csv")
subbed <- gsub("(?m)^\\h*\\R?", "", tx$text, perl=TRUE)
subbed <- gsub("\r", "", subbed, fixed=TRUE)
writeLines(trimws(subbed), "output.csv")
The readtext llibrary reads the file into a single variable and thus all line break chars are kept.
I'd like to remove all elements of a string after a certain keyword.
Example :
this.is.an.example.string.that.I.have
Desired Output :
This.is.an.example
I've tried using gsub('string', '', list) but that only removes the word string. I've also tried using the gsub('^string', '', list) but that also doesn't seem to work.
Thank you.
Following simple sub may help you here.
sub("\\.string.*","",variable)
Explanation: Method of using sub
sub(regex_to_replace_text_in_variable,new_value,variable)
Difference between sub and gsub:
sub: is being used for performing substitution on variables.
gsub: gsub is being used for same substitution tasks only but only thing it will be perform substitution on ALL matches found though sub performs it only for first match found one.
From help page of R:
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
You can try this positive lookbehind regex
S <- 'this.is.an.example.string.that.I.have'
gsub('(?<=example).*', '', S, perl=TRUE)
# 'this.is.an.example'
You can use strsplit. Here you split your string after a key word, and retain the first part of the string.
x <- "this.is.an.example.string.that.I.have"
strsplit(x, '(?<=example)', perl=T)[[1]][1]
[1] "this.is.an.example"
I have the following string:
test <- "C:\\Users\\stefanj\\Documents\\Automation_Desk\\script.R"
I am separating the string on the backslash characters with the following code:
pdf_path_long <- unlist(strsplit(test, "\\\\",
fixed = FALSE, perl = FALSE, useBytes = FALSE))
What I want to do is:
pdf_path_short <- file.path(pdf_path_long[1], pdf_path_long[2], ...)
Problem is:
I know how to count the elements in the pdf_path_short - length(pdf_path_long), but I don't know how to set them in the file.path as the number of elements will very based on the length of the path.
You can directly (no need for a strsplit call) use gsub on test to change the separators (with fixed=TRUE so you don't need to escape the double backslash), you will get same output as with file.path:
pdf_path_short <- gsub("\\", "/", test, fixed=TRUE)
pdf_path_short
# "C:/Users/stefanj/Documents/Automation_Desk/script.R"
Of course, you can change the replacement part with whatever separator you need.
Note: you can also check normalizePath function:
normalizePath(test, "/", mustWork=FALSE)
#[1] "C:/Users/stefanj/Documents/Automation_Desk/script.R"
I'm trying to use the following code to replace two dots for only one:
test<-"test..1"
gsub("\\..", ".", test, fixed=TRUE)
and getting:
[1] "test..1"
I tried several combinations of escape strings, including brackets [] with no success.
What am I doing wrong?
If you are going to use fixed = TRUE, use the (non-interpreted) character .:
> gsub("..", ".", test, fixed = TRUE)
Otherwise, within regular expressions (fixed = FALSE), . has a special meaning (any character) so you'll want to prefix it with a backslash to mean "the dot character":
> gsub("\\.\\.", ".", test)
> gsub("\\.{2}", ".", test)
I am having trouble to read a file containing lines like the one below in R.
"_:b5507F4C7x59005","Fabiana D\"atri"
Any idea? How can I make read.table understand that \" is the escape of quote?
Cheers,
Alexandre
It seems to me that read.table/read.csv cannot handle escaped quotes.
...But I think I have an (ugly) work-around inspired by #nullglob;
First read the file WITHOUT a quote character.
(This won't handle embedded , as #Ben Bolker noted)
Then go though the string columns and remove the quotes:
The test file looks like this (I added a non-string column for good measure):
13,"foo","Fab D\"atri","bar"
21,"foo2","Fab D\"atri2","bar2"
And here is the code:
# Generate test file
writeLines(c("13,\"foo\",\"Fab D\\\"atri\",\"bar\"",
"21,\"foo2\",\"Fab D\\\"atri2\",\"bar2\"" ), "foo.txt")
# Read ignoring quotes
tbl <- read.table("foo.txt", as.is=TRUE, quote='', sep=',', header=FALSE, row.names=NULL)
# Go through and cleanup
for (i in seq_len(NCOL(tbl))) {
if (is.character(tbl[[i]])) {
x <- tbl[[i]]
x <- substr(x, 2, nchar(x)-1) # Remove surrounding quotes
tbl[[i]] <- gsub('\\\\"', '"', x) # Unescape quotes
}
}
The output is then correct:
> tbl
V1 V2 V3 V4
1 13 foo Fab D"atri bar
2 21 foo2 Fab D"atri2 bar2
On Linux/Unix (or on Windows with cygwin or GnuWin32), you can use sed to convert the escaped double quotes \" to doubled double quotes "" which can be handled well by read.csv:
p <- pipe(paste0('sed \'s/\\\\"/""/g\' "', FILENAME, '"'))
d <- read.csv(p, ...)
rm(p)
Effectively, the following sed command is used to preprocess the CSV input:
sed 's/\\"/""/g' file.csv
I don't call this beautiful, but at least you don't have to leave the R environment...
My apologies ahead of time that this isn't more detailed -- I'm right in the middle of a code crunch.
You might consider using the scan() function. I created a simple sample file "sample.csv," which consists of:
V1,V2
"_:b5507F4C7x59005","Fabiana D\"atri"
Two quick possibilities are (with output commented so you can copy-paste to the command line):
test <- scan("sample.csv", sep=",", what='character',allowEscapes=TRUE)
## Read 4 items
test
##[1] "V1" "V2" "_:b5507F4C7x59005"
##[4] "Fabiana D\\atri\n"
or
test <- scan("sample.csv", sep=",", what='character',comment.char="\\")
## Read 4 items
test
## [1] "V1" "V2" "_:b5507F4C7x59005"
## [4] "Fabiana D\\atri\n"
You'll probably need to play around with it a little more to get what you want. And I see that you've already mentioned writeLines, so you may have already tried this. Either way, good luck!
I was able to get your eample to work by setting the quote argument:
> read.csv('test.csv',quote="'",head=FALSE)
V1 V2
1 "_:b5507F4C7x59005" "Fabiana D\\"atri"
2 "_:b5507F4C7x59005" "Fabiana D\\"atri"
read_delim from package readr can handle escaped and doubled double quotes, using the arguments escape_double and escape_backslash.
For example, if our file escapes quotes by doubling them:
"quote""","hello"
1,2
then we use
read_delim(file, delim=',') # default escape_backslash=FALSE, escape_double=TRUE
If our file escapes quotes with a backslash:
"quote\"","hello"
1,2
we use
read_delim(file, delim=',', escape_double=FALSE, escape_backslash=TRUE)
As of newer R versions, readr::read_delim() is the correct answer.
data = read_delim(filename, delim = "\t", quote = "\"",
escape_backslash=T, escape_double=F,
# The columns depend on your data
col_names = c("timeStart", "posEnd", "added", "removed"),
col_types = "nncc"
)
This should be fine with read.csv(). Take a look at the help for ?read.csv - the option for specifying the quote is quote = "....". In this case, though, there may be a problem: it seems that read.csv() prefers to see matching quotes.
I tried the same with read.table("sample.txt", header = FALSE, as.is = TRUE), with your text in sample.txt, and it seems to work. When all else fails with read.csv(), I tend to back up to read.table() and specify the parameters carefully.