strsplit not consistently working, character between letters isn't a space? - r

The problem is very simple, but I'm having no luck fixing it. strsplit() is a fairly simple function, and I am surprised I am struggling as much as I am:
# temp is the problem string. temp is copy / pasted from my R code.
# i am hoping the third character, the space, which i think is the error, remains the error
temp = "GS PG"
# temp2 is created in stackoverflow, using an actual space
temp2 = "GS PG"
unlist(strsplit(temp, split = " "))
[1] "GS PG"
unlist(strsplit(temp2, split = " "))
[1] "GS" "PG"
.
even if it doesn't work here with me trying to reproduce the example, this is the issue i am running into. with temp, the code isn't splitting the variable on the space for some odd reason. any thoughts would be appreciated!
Best,
EDIT - my example failed to recreate the issue. For reference, temp is being created in my code by scraping code from online with rvest, and for some reason it must be scraping a different character other than a normal space, i think? I need to split these strings by space though.

Try the following:
unlist(strsplit(temp, "\\s+"))
The "\\s+" is a regex search for any type of whitespace instead of just a standard space.

As in the comment,
It is likely that the "space" is not actually a space but some other whitespace character.
Try any of the following to narrow it down:
whitespace <- c(" ", "\t" , "\n", "\r", "\v", "\f")
grep(paste(whitespace,collapse="|"), temp)
Related question here:
How to remove all whitespace from a string?

Related

In R, Is there a way to break long string literals? [duplicate]

I want to split a line in an R script over multiple lines (because it is too long). How do I do that?
Specifically, I have a line such as
setwd('~/a/very/long/path/here/that/goes/beyond/80/characters/and/then/some/more')
Is it possible to split the long path over multiple lines? I tried
setwd('~/a/very/long/path/here/that/goes/beyond/80/characters/and/
then/some/more')
with return key at the end of the first line; but that does not work.
Thanks.
Bah, comments are too small. Anyway, #Dirk is very right.
R doesn't need to be told the code starts at the next line. It is smarter than Python ;-) and will just continue to read the next line whenever it considers the statement as "not finished". Actually, in your case it also went to the next line, but R takes the return as a character when it is placed between "".
Mind you, you'll have to make sure your code isn't finished. Compare
a <- 1 + 2
+ 3
with
a <- 1 + 2 +
3
So, when spreading code over multiple lines, you have to make sure that R knows something is coming, either by :
leaving a bracket open, or
ending the line with an operator
When we're talking strings, this still works but you need to be a bit careful. You can open the quotation marks and R will read on until you close it. But every character, including the newline, will be seen as part of the string :
x <- "This is a very
long string over two lines."
x
## [1] "This is a very\nlong string over two lines."
cat(x)
## This is a very
## long string over two lines.
That's the reason why in this case, your code didn't work: a path can't contain a newline character (\n). So that's also why you better use the solution with paste() or paste0() Dirk proposed.
You are not breaking code over multiple lines, but rather a single identifier. There is a difference.
For your issue, try
R> setwd(paste("~/a/very/long/path/here",
"/and/then/some/more",
"/and/then/some/more",
"/and/then/some/more", sep=""))
which also illustrates that it is perfectly fine to break code across multiple lines.
Dirk's method above will absolutely work, but if you're looking for a way to bring in a long string where whitespace/structure is important to preserve (example: a SQL query using RODBC) there is a two step solution.
1) Bring the text string in across multiple lines
long_string <- "this
is
a
long
string
with
whitespace"
2) R will introduce a bunch of \n characters. Strip those out with strwrap(), which destroys whitespace, per the documentation:
strwrap(long_string, width=10000, simplify=TRUE)
By telling strwrap to wrap your text to a very, very long row, you get a single character vector with no whitespace/newline characters.
For that particular case there is file.path :
File <- file.path("~",
"a",
"very",
"long",
"path",
"here",
"that",
"goes",
"beyond",
"80",
"characters",
"and",
"then",
"some",
"more")
setwd(File)
The glue::glue function can help. You can write a string on multiple lines in a script but remove the line breaks from the string object by ending each line with \\:
glue("some\\
thing")
something
I know this post is old, but I had a Situation like this and just want to share my solution. All the answers above work fine. But if you have a Code such as those in data.table chaining Syntax it becomes abit challenging. e.g. I had a Problem like this.
mass <- files[, Veg:=tstrsplit(files$file, "/")[1:4][[1]]][, Rain:=tstrsplit(files$file, "/")[1:4][[2]]][, Roughness:=tstrsplit(files$file, "/")[1:4][[3]]][, Geom:=tstrsplit(files$file, "/")[1:4][[4]]][time_[s]<=12000]
I tried most of the suggestions above and they didn´t work. but I figured out that they can be split after the comma within []. Splitting at ][ doesn´t work.
mass <- files[, Veg:=tstrsplit(files$file, "/")[1:4][[1]]][,
Rain:=tstrsplit(files$file, "/")[1:4][[2]]][,
Roughness:=tstrsplit(files$file, "/")[1:4][[3]]][,
Geom:=tstrsplit(files$file, "/")[1:4][[4]]][`time_[s]`<=12000]
There is no coinvent way to do this because there is no operator in R to do string concatenation.
However, you can define a R Infix operator to do string concatenation:
`%+%` = function(x,y) return(paste0(x,y))
Then you can use it to concat strings, even break the code to multiple lines:
s = "a" %+%
"b" %+%
"c"
This will give you "abc".
This will keep the \n character, but you can also just wrap the quote in parentheses. Especially useful in RMarkdown.
t <- ("
this is a long
string
")
I do this all of the time. Use the paste0() function.
Rootdir = "/myhome/thisproject/part1/"
Subdir = "subdirectory1/subsubdir2/"
fullpath = paste0( Rootdir, Subdir )
fullpath
> fullpath
[1] "/myhome/thisproject/part1/subdirectory1/subsubdir2/"

gsub replace text after set of characters

I have a lot of error messages that I am trying to clean up.
some of the errors end with the text "(sec): 0.xxx"
i'm trying to use gsub to remove everything after (sec)
data$Message <- gsub("(sec).*", "", data$Message, perl = TRUE)
this returns everything after (
I know it would be easy to just use ":" or ")" but then it effects other errors that I do not want to change.
Is there a way to use gsub to look at several characters -like "(sec)"- instead of just one?
on a related note is their a symbol that represents any number (excludes text) similiar to "."?
You can use regex look behind ?<= to avoid sec being removed and at the same time assert the removed pattern follows sec, so (?<=sec\\)).* will remove everything after sec) but not sec) itself:
gsub("(?<=sec\\)).*", "", "(sec): 0.xxx", perl = TRUE)
# [1] "(sec)"
You can select the first part of the expression (between brackets) and omit the rest:
gsub('(^.*\\(sec\\)).*', '\\1', '(sec): 0.xxx')
## [1] "(sec)"

Removing white space from data frame in R

I have scraped some data and stored it in a data frame. Some rows contain unwanted information within square brackets. Example "[N] Team Name".
I want to keep just the part containing the team name, so first I use the code below to remove the brackets and any text contained within them
gsub( " *\\(.*?\\) *", "", x)
This leaves me with " Team Name" (notice the space before the T).
Now I am trying to remove the white space before the T using trimws or the method shown here, but it is not working
could someone please help me with removing the extra white space.
Note: if I write the string containing the space manually and apply trimws on it, it works. However when obtaining the string directly from the data frame it doesnt. Also when running the code snippet below (where df[1,1] is the same string retreived from the data frame), I get FALSE. This gives me reason to believe that the string in the data frame is not the same as the manually typed string.
" team name" == df[1,1]
You could try
gsub( "\\[[^]]*\\]\\W*", "", "[N] Team Name")
We can use
sub(".*\\]\\s+", "", x)
#[1] "Team Name"
Or just
sub("\\S+\\s+", "", x)
#[1] "Team Name"
data
x <- '[N] Team Name';
You should be able to remove the bracketed piece as well as any following whitespace with a single regex substitution. Your regex is correct as-is, and should successfully accomplish this. (Note: I've ignored the unexplained discrepancy between your use of parentheses vs. square brackets in your question. I've assumed square brackets for my answer.)
Strangely, this seems to be a case where the default regex engine is failing, but adding perl=T gets it working:
x <- '[N] Team Name';
gsub(' *\\[.*?\\] *','',x);
## [1] " Team Name"
gsub(perl=T,' *\\[.*?\\] *','',x);
## [1] "Team Name"
In the past I have run across cases where the default regex engine flakes out, but I have never encountered this with perl=T, so I suggest you use that. I really think there is something broken in the default regex implementation.

How to remove website urls from dataset corpus in R? [duplicate]

I am new to R and text mining. I had made a word cloud out of twitter feed related to some term. The problem that I'm facing is that in the wordcloud it shows http:... or htt...
How do I deal about this issue
I tried using metacharacter * but I still doubt if I'm applying it right
tw.text = removeWords(tw.text,c(stopwords("en"),"rt","http\\*"))
somebody into text-minning please help me with this.
If you are looking to remove URLs from your string, you may use:
gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
Where x would be:
x <- c("some text http://idontwantthis.com",
"same problem again http://pleaseremoveme.com")
It would be easier to provide you with a specific answer if you could post sample of your data but the following example would give you a clean text with no URLs:
> clean_x <- gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
> clean_x
[1] "some text " "same problem again "
As a side point, I would suggest that it may be worth searching for the existing methods to clean text before mining. For example the clean function discussed here would enable you to do this automatically. On similar lines, there are function to clean your text from tweets (#,#), punctuation and other undesirable entries.
Apply the below code to corpus to replace a string pattern with space.
String pattern can be urls or terms you want to remove from the wordcloud.
For example to remove terms starting with https:
replace with space
toSpace = content_transformer( function(x, pattern) gsub(pattern," ",x) )
tweet_corpus_clean = tm_map( tweet_corpus, toSpace, "https*")
Or pass a pattern as below to remove urls
tweet_corpus_clean = tm_map( tweet_corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")

Split code over multiple lines in an R script

I want to split a line in an R script over multiple lines (because it is too long). How do I do that?
Specifically, I have a line such as
setwd('~/a/very/long/path/here/that/goes/beyond/80/characters/and/then/some/more')
Is it possible to split the long path over multiple lines? I tried
setwd('~/a/very/long/path/here/that/goes/beyond/80/characters/and/
then/some/more')
with return key at the end of the first line; but that does not work.
Thanks.
Bah, comments are too small. Anyway, #Dirk is very right.
R doesn't need to be told the code starts at the next line. It is smarter than Python ;-) and will just continue to read the next line whenever it considers the statement as "not finished". Actually, in your case it also went to the next line, but R takes the return as a character when it is placed between "".
Mind you, you'll have to make sure your code isn't finished. Compare
a <- 1 + 2
+ 3
with
a <- 1 + 2 +
3
So, when spreading code over multiple lines, you have to make sure that R knows something is coming, either by :
leaving a bracket open, or
ending the line with an operator
When we're talking strings, this still works but you need to be a bit careful. You can open the quotation marks and R will read on until you close it. But every character, including the newline, will be seen as part of the string :
x <- "This is a very
long string over two lines."
x
## [1] "This is a very\nlong string over two lines."
cat(x)
## This is a very
## long string over two lines.
That's the reason why in this case, your code didn't work: a path can't contain a newline character (\n). So that's also why you better use the solution with paste() or paste0() Dirk proposed.
You are not breaking code over multiple lines, but rather a single identifier. There is a difference.
For your issue, try
R> setwd(paste("~/a/very/long/path/here",
"/and/then/some/more",
"/and/then/some/more",
"/and/then/some/more", sep=""))
which also illustrates that it is perfectly fine to break code across multiple lines.
Dirk's method above will absolutely work, but if you're looking for a way to bring in a long string where whitespace/structure is important to preserve (example: a SQL query using RODBC) there is a two step solution.
1) Bring the text string in across multiple lines
long_string <- "this
is
a
long
string
with
whitespace"
2) R will introduce a bunch of \n characters. Strip those out with strwrap(), which destroys whitespace, per the documentation:
strwrap(long_string, width=10000, simplify=TRUE)
By telling strwrap to wrap your text to a very, very long row, you get a single character vector with no whitespace/newline characters.
For that particular case there is file.path :
File <- file.path("~",
"a",
"very",
"long",
"path",
"here",
"that",
"goes",
"beyond",
"80",
"characters",
"and",
"then",
"some",
"more")
setwd(File)
The glue::glue function can help. You can write a string on multiple lines in a script but remove the line breaks from the string object by ending each line with \\:
glue("some\\
thing")
something
I know this post is old, but I had a Situation like this and just want to share my solution. All the answers above work fine. But if you have a Code such as those in data.table chaining Syntax it becomes abit challenging. e.g. I had a Problem like this.
mass <- files[, Veg:=tstrsplit(files$file, "/")[1:4][[1]]][, Rain:=tstrsplit(files$file, "/")[1:4][[2]]][, Roughness:=tstrsplit(files$file, "/")[1:4][[3]]][, Geom:=tstrsplit(files$file, "/")[1:4][[4]]][time_[s]<=12000]
I tried most of the suggestions above and they didn´t work. but I figured out that they can be split after the comma within []. Splitting at ][ doesn´t work.
mass <- files[, Veg:=tstrsplit(files$file, "/")[1:4][[1]]][,
Rain:=tstrsplit(files$file, "/")[1:4][[2]]][,
Roughness:=tstrsplit(files$file, "/")[1:4][[3]]][,
Geom:=tstrsplit(files$file, "/")[1:4][[4]]][`time_[s]`<=12000]
There is no coinvent way to do this because there is no operator in R to do string concatenation.
However, you can define a R Infix operator to do string concatenation:
`%+%` = function(x,y) return(paste0(x,y))
Then you can use it to concat strings, even break the code to multiple lines:
s = "a" %+%
"b" %+%
"c"
This will give you "abc".
This will keep the \n character, but you can also just wrap the quote in parentheses. Especially useful in RMarkdown.
t <- ("
this is a long
string
")
I do this all of the time. Use the paste0() function.
Rootdir = "/myhome/thisproject/part1/"
Subdir = "subdirectory1/subsubdir2/"
fullpath = paste0( Rootdir, Subdir )
fullpath
> fullpath
[1] "/myhome/thisproject/part1/subdirectory1/subsubdir2/"

Resources