Removing white space from data frame in R - r

I have scraped some data and stored it in a data frame. Some rows contain unwanted information within square brackets. Example "[N] Team Name".
I want to keep just the part containing the team name, so first I use the code below to remove the brackets and any text contained within them
gsub( " *\\(.*?\\) *", "", x)
This leaves me with " Team Name" (notice the space before the T).
Now I am trying to remove the white space before the T using trimws or the method shown here, but it is not working
could someone please help me with removing the extra white space.
Note: if I write the string containing the space manually and apply trimws on it, it works. However when obtaining the string directly from the data frame it doesnt. Also when running the code snippet below (where df[1,1] is the same string retreived from the data frame), I get FALSE. This gives me reason to believe that the string in the data frame is not the same as the manually typed string.
" team name" == df[1,1]

You could try
gsub( "\\[[^]]*\\]\\W*", "", "[N] Team Name")

We can use
sub(".*\\]\\s+", "", x)
#[1] "Team Name"
Or just
sub("\\S+\\s+", "", x)
#[1] "Team Name"
data
x <- '[N] Team Name';

You should be able to remove the bracketed piece as well as any following whitespace with a single regex substitution. Your regex is correct as-is, and should successfully accomplish this. (Note: I've ignored the unexplained discrepancy between your use of parentheses vs. square brackets in your question. I've assumed square brackets for my answer.)
Strangely, this seems to be a case where the default regex engine is failing, but adding perl=T gets it working:
x <- '[N] Team Name';
gsub(' *\\[.*?\\] *','',x);
## [1] " Team Name"
gsub(perl=T,' *\\[.*?\\] *','',x);
## [1] "Team Name"
In the past I have run across cases where the default regex engine flakes out, but I have never encountered this with perl=T, so I suggest you use that. I really think there is something broken in the default regex implementation.

Related

R: remove every word that ends with ".htm"

I have a df = desc with a variable "value" that holds long text and would like to remove every word in that variable that ends with ".htm" . I looked for a long time around here and regex expressions and cannot find a solution.
Can anyone help? Thank you so much!
I tried things like:
library(stringr)
desc <- str_replace_all(desc$value, "\*.htm*$", "")
But I get:
Error: '\*' is an unrecognized escape in character string starting ""\*"
This regex:
Will Catch all that ends with .htm
Will not catch instances with .html
Is not dependent on being in the beginning / end of a string.
strings <- c("random text shouldbematched.htm notremoved.html matched.htm random stuff")
gsub("\\w+\\.htm\\b", "", strings)
Output:
[1] "random text notremoved.html random stuff"
I am not sure what exactly you would like to accomplish, but I guess one of those is what you are looking for:
words <- c("apple", "test.htm", "friend.html", "remove.htm")
# just replace the ".htm" from every string
str_replace_all(words, ".htm", "")
# exclude all words that contains .htm anywhere
words[!grepl(pattern = ".htm", words)]
# exlude all words that END with .htm
words[substr(words, nchar(words)-3, nchar(words)) != ".htm"]
I am not sure if you can use * to tell R to consider any value inside a string, so I would first remove it. Also, in your code you are setting a change in your variable "value" to replace the entire df.
So I would suggest the following:
desc$value <- str_replace(desc$value, ".htm", "")
By doing so, you are telling R to remove all .htm that you have in the desc$value variable alone. I hope it works!
Let's assume you have, as you say, a variable "value" that holds long text and you want to remove every word that ends in .html. Based on these assumptions you can use str_remove all:
The main point here is to wrap the pattern into word boundary markers \\b:
library(stringr)
str_remove_all(value, "\\b\\w+\\.html\\b")
[1] "apple and test2.html01" "the word must etc. and as well" "we want to remove .htm"
Data:
value <- c("apple test.html and test2.html01",
"the word friend.html must etc. and x.html as well",
"we want to remove .htm")
To achieve what you want just do:
desc$value <- str_replace(desc$value, ".*\\.htm$", "")
You are trying to escape the star and it is useless. You get an error because \* does not exist in R strings. You just have \n, \t etc...
\. does not exist either in R strings. But \\ exists and it produces a single \ in the resulting string used for the regular expression. Therefore, when you escape something in a R regexp you have to escape it twice:
In my regexp: .* means any chars and \\. means a real dot. I have to escape it twice because \ needs to be escape first from the R string.

Replace multiple spaces in string, but leave singles spaces be

I am reading a PDF file using R. I would like to transform the given text in such a way, that whenever multiple spaces are detected, I want to replace them by some value (for example "_"). I've come across questions where all spaces of 1 or more can be replaced using "\\s+" (Merge Multiple spaces to single space; remove trailing/leading spaces) but this will not work for me. I have a string that looks something like this;
"[1]This is the first address This is the second one
[2]This is the third one
[3]This is the fourth one This is the fifth"
When I apply the answers I found; replacing all spaces of 1 or more with a single space, I will not be able to recognise separate addresses anymore, because it would look like this;
gsub("\\s+", " ", str_trim(PDF))
"[1]This is the first address This is the second one
[2]This is the third one
[3]This is the fourth one This is the fifth"
So what I am looking for is something like this
"[1]This is the first address_This is the second one
[2]This is the third one_
[3]This is the fourth one_This is the fifth"
However if I rewrite the code used in the example, I get the following
gsub("\\s+", "_", str_trim(PDF))
"[1]This_is_the_first_address_This_is_the_second_one
[2]This_is_the_third_one_
[3]This_is_the_fourth_one_This_is_the_fifth"
Would anyone know a workaround for this? Any help will be greatly appreciated.
Whenever I come across string and reggex problems I like to refer to the stringr cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf
On the second page you can see a section titled "Quantifiers", which tells us how to solve this:
library(tidyverse)
s <- "This is the first address This is the second one"
str_replace(s, "\\s{2,}", "_")
(I am loading the complete tidyverse instead of just stringr here due to force of habit).
Any 2 or more whitespace characters will no be replaced with _.

Remove whitespace before bracket " (" in R

I have nearly 100,000 rows of scraped data that I have converted to data frames. One column is a string of text characters but is operating strangely. In the example below, there is text, that has bracketed information that I want to remove, and I also want to remove " (c)". However the space in front is not technically a space (is it considered whitespace?).
I am not sure how to reproduce the example here because when I copy/paste a record, it is treated like normal and works, but in the scraped data, it does not. Gut check was to count spaces and it gave me 4, which means the space in front of ( is not a true space. I do not know how to remove this!
My code that I usually would run is as follows. Again, works this way, but does not work in my scraped data.
test<-c("Barry Windham (c) & Mike Rotundo (c)")
test<-gsub("[ ][(]c[)]","",test)
You can consider using:
test<-c("Barry Windham (c) & Mike Rotundo (c)")
gsub("(*UCP)\\s+\\(c\\)", "", test, perl=TRUE)
# => [1] "Barry Windham & Mike Rotundo"
See an online R demo
Details
(*UCP) - makes all shorthand character classes in the PCRE regex (it is PCRE due to perl=TRUE) Unicode aware
\\s+ - any one or more Unicode whitespaces
\\(c\\) - (c) substring.
If you need to keep (c), capture it and use a backreference in the replacement:
gsub("(*UCP)\\s+(\\(c\\))", "\\1", test, perl=TRUE)

How to remove urls without http in a text document using r

I am trying to remove urls that may or may not start with http/https from a large text file, which I saved in urldoc in R. The url may start like tinyurl.com/ydyzzlkk or aclj.us/2y6dQKw or pic.twitter.com/ZH08wej40K. Basically I want to remove data before a '/' after finding the space and after a "/" until I find a space. I tried with many patterns and searched many places. Couldn't complete the task. I would help me a lot if you could give some input.
This is the last statement I tried and got stuck for the above problem.
urldoc = gsub("?[a-z]+\..\/.[\s]$","", urldoc)
Input would be: A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk
Output I am expecting is: A disgrace to his profession. In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. nothing like the admin. proposal:
Thanks.
According to your specs, you may use the following regex:
\s*[^ /]+/[^ /]+
See the regex demo.
Details
\s* - 0 or more whitespace chars
[^ /]+ (or [^[:space:]/]) - any 1 or more chars other than space (or whitespace) and /
/ - a slash
[^ /]+ (or [^[:space:]/]) - any 1 or more chars other than space (or whitespace) and /.
R demo:
urldoc = gsub("\\s*[^ /]+/[^ /]+","", urldoc)
If you want to account for any whitespace, replace the literal space with [:space:],
urldoc = gsub("\\s*[^[:space:]/]+/[^[:space:]/]+","", urldoc)
See already answered, but here is an alternative if you've not come across stringi before
# most complete package for string manipulation
library(stringi)
# text and regex
text <- "A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk"
pattern <- "(?:\\s)[^\\s\\.]*\\.[^\\s]+"
# see what is captured
stringi::stri_extract_all_regex(text, pattern)
# remove (replace with "")
stringi::stri_replace_all_regex(text, pattern, "")
This might work:
text <- " http:/thisisanurl.wde , thisaint , nope , uihfs/yay"
words <- strsplit(text, " ")[[1]]
isurl <- sapply(words, function(x) grepl("/",x))
result <- paste0(words[!isurl], collapse = " ")
result
[1] " , thisaint , nope ,"

strsplit not consistently working, character between letters isn't a space?

The problem is very simple, but I'm having no luck fixing it. strsplit() is a fairly simple function, and I am surprised I am struggling as much as I am:
# temp is the problem string. temp is copy / pasted from my R code.
# i am hoping the third character, the space, which i think is the error, remains the error
temp = "GS PG"
# temp2 is created in stackoverflow, using an actual space
temp2 = "GS PG"
unlist(strsplit(temp, split = " "))
[1] "GS PG"
unlist(strsplit(temp2, split = " "))
[1] "GS" "PG"
.
even if it doesn't work here with me trying to reproduce the example, this is the issue i am running into. with temp, the code isn't splitting the variable on the space for some odd reason. any thoughts would be appreciated!
Best,
EDIT - my example failed to recreate the issue. For reference, temp is being created in my code by scraping code from online with rvest, and for some reason it must be scraping a different character other than a normal space, i think? I need to split these strings by space though.
Try the following:
unlist(strsplit(temp, "\\s+"))
The "\\s+" is a regex search for any type of whitespace instead of just a standard space.
As in the comment,
It is likely that the "space" is not actually a space but some other whitespace character.
Try any of the following to narrow it down:
whitespace <- c(" ", "\t" , "\n", "\r", "\v", "\f")
grep(paste(whitespace,collapse="|"), temp)
Related question here:
How to remove all whitespace from a string?

Resources