I have some cells with long strings.
I want to truncate the cells within a column, so that only word(s) before a semicolon are maintained. For example, if I have a cell with the string blue house; with green garden I want to only maintain the words before the semicolon, so it would become blue house
Thank you!
If you just want to delete the semi-colon and everything after it then this is a simple pattern that replaces such instances with the empty string: "".
blue <- c('blue house; with green garden')
gsub(";.*$", "", blue)
[1] "blue house"
The previously accepted answer uses a conditional/look-behind and a capture class, both useful concepts, but a bit complex for this simple task.
blue <- c('blue house; with green garden')
gsub('(;*?);.*', '\\1', blue)
[1] "blue house"
appears to work. When using the very nice regex101.com, and you get something to work, remember to add another \ to working examples when you use them in your terminal or RStudio.
Related
I have a question similar to this one but instead of having two specific characters to look between, I want to get the text between a space and a specific character. In my example, I have this string:
myString <- "This is my string I scraped from the web. I want to remove all instances of a picture. picture-file.jpg. The text continues here. picture-file2.jpg"
but if I were to do something like this: str_remove_all(myString, " .*jpg) I end up with
[1] "This"
I know that what's happening is R is finding the first instance of a space and removing everything between that space and ".jpg" but I want it to be the first space immediately before ".jpg". My final result I hope for looks like this:
[1] "This is my string I scraped from the web. I want to remove all instances of a picture. the text continues here.
NOTE: I know that a solution may arise which does what I want, but ends up putting two periods next to each other. I do not mind a solution like that because later in my analysis I am removing punctuation.
You can use
str_remove_all(myString, "\\S*\\.jpg")
Or, if you also want to remove optional whitespace before the "word":
str_remove_all(myString, "\\s*\\S*\\.jpg")
Details:
\s* - zero or more whitespaces
\S* - zero or more non-whitespaces
\.jpg - .jpg substring.
To make it case insensitive, add (?i) at the pattern part: "(?i)\\s*\\S*\\.jpg".
If you need to make sure there is no word char after jpg, add a word boundary: "(?i)\\s*\\S*\\.jpg\\b"
I am reading a PDF file using R. I would like to transform the given text in such a way, that whenever multiple spaces are detected, I want to replace them by some value (for example "_"). I've come across questions where all spaces of 1 or more can be replaced using "\\s+" (Merge Multiple spaces to single space; remove trailing/leading spaces) but this will not work for me. I have a string that looks something like this;
"[1]This is the first address This is the second one
[2]This is the third one
[3]This is the fourth one This is the fifth"
When I apply the answers I found; replacing all spaces of 1 or more with a single space, I will not be able to recognise separate addresses anymore, because it would look like this;
gsub("\\s+", " ", str_trim(PDF))
"[1]This is the first address This is the second one
[2]This is the third one
[3]This is the fourth one This is the fifth"
So what I am looking for is something like this
"[1]This is the first address_This is the second one
[2]This is the third one_
[3]This is the fourth one_This is the fifth"
However if I rewrite the code used in the example, I get the following
gsub("\\s+", "_", str_trim(PDF))
"[1]This_is_the_first_address_This_is_the_second_one
[2]This_is_the_third_one_
[3]This_is_the_fourth_one_This_is_the_fifth"
Would anyone know a workaround for this? Any help will be greatly appreciated.
Whenever I come across string and reggex problems I like to refer to the stringr cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf
On the second page you can see a section titled "Quantifiers", which tells us how to solve this:
library(tidyverse)
s <- "This is the first address This is the second one"
str_replace(s, "\\s{2,}", "_")
(I am loading the complete tidyverse instead of just stringr here due to force of habit).
Any 2 or more whitespace characters will no be replaced with _.
Alright so I have minimal experience with RStudio, I've been googling this for hours now and I'm fed up-- I don't care about the pride of figuring it out on my own anymore, I just want it done. I want to do some stuff with Canterbury Tales-- the Middle English version on Gutenberg.
Downloaded the plaintext, trimmed out the meta data, etc but it's chock-full of "helpful" footnotes and I can't figure out how to cut them out. EX:
"And shortly, whan the sonne was to reste,
So hadde I spoken with hem everichon,
That I was of hir felawshipe anon,
And made forward erly for to ryse,
To take our wey, ther as I yow devyse.
19. Hn. Bifel; E. Bifil. 23. E. were; _rest_ was. 24. E. Hn.
compaignye. 26, 32. E. felaweshipe. Hl. pilgryms; E. pilgrimes.
34. E. oure
But natheles, whyl I have tyme and space,..."
I at least have the vague notion that this is a grep/regex puzzle. Looking at the text in TextEdit, each bundle of footnotes is indented by 4 spaces, and the next verse starts with a capitalized word indented by (edit: 4 spaces as well).
So I tried downloading the package qdap and using the rm_between function to specify removal of text between four spaces and a number; and two spaces and a capital letter (" [0-9]"," "[A-Z]") to no avail.
I mean, this isn't nearly as simple as "make the text lowercase and remove all the numbers dur-hur" which all the tutorials are so helpfully offering. But I'm assuming this is a rather common thing that people have to do when dealing with big texts. Can anyone help me? Or do I have to go into textedit and just manually delete all the footnotes?
EDIT: I restarted the workspace today and all I have is a scan of the file, each line stored in a character vector, with the Gutenburg metadata trimmed out:
text<- scan("thefilepath.txt, what = "character", sep = "\n")
start <-which(text=="GROUP A. THE PROLOGUE.")
end <-which(text==""God bringe us to the Ioye . that ever schal be!")
cant.lines.v <- text[start:end]
And that's it so far. Eventually I will
cant.v<- paste(cant.lines.v, collapse=" ")
And then strsplit and unlist into a vector of individual words-- but I'm assuming, to get rid of the footnotes, I need to gsub and replace with blank space, and that will be easier with each separate line? I just don't know how to encode the pattern I need to cut. I believe it is 4 spaces followed by a number, then continuing on until you get to 4 spaces followed by a capitalized word and a second word w/o numbers and special characters and punctuation.
I hope that I'm providing enough information, I'm not well-versed in this but I am looking to become so...thanks in advance.
I'm using a text file in R and using the readLine function and regexs to extract words from it. The file uses special characters around words (such as # sings before and after a word to show it is bolded or # before and after a word to show it should be italicized) to indicate special meanings, which are messing up my regexs.
So far this is my r code which removed all empty lines and then combined my text file into a single vector :
book<-readLines("/Users/Desktop/SAMPLE .txt",encoding="UTF-8")
#remove all empty lines
empty_lines = grepl('^\\s*$', book)
book = book[! empty_lines]
#combine book into one variable
xBook = paste(book, collapse = '')
#remove extra white spaces for a single text of the entire book
updated<-trimws(gsub("\\s+"," ",xBook))
when i run updated, i see the entire file stored in the variable updated but with the special characters:
updated
[1] "It is a truth universally acknowledged, that a #single# man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a #man# may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, #that# he is considered the rightful property of some one or other of #their# daughters.
How can I remove all all the leading or trailing # or # from the words in my updated variable?
my desired output is just the plain text, with no indication of words that should be bolded or italicized:
updated
[1] "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.
gsub("[##]([a-zA-Z]+)[##]", "\\1", x)
I am detecting substrings within reports and then adding suffix words to the end of the reports depending if the substring is present or absent. Shorter words are dangerous as they are usually parts of longer words. Example: ear and overbearing. The spacebar tends to be a reasonable solution. Therefore, instead of search for the substring 'ear' I will use ' ear'. Note the white space in front of the substring. And no white space at the end of the substring, as I don't want to miss the plural ears.
The problem is when the 1st word in the entire report is Ear. There is no leading white space.
I tried to solve the problem with library stringr but adding a space to the beginning of each report, but the text is returned unchanged.
(stringr)
Data$Fail <- str_pad(Data$text, width = 1, side = "left")
Data$Fail <- str_pad(Data$text, width = 1, side = "left") didn't work because str_pad() pads a string to a fixed length, which you specified as width = 1, so it would only have inserted a space if the text were initially empty.
But if you just want to insert a space at the start of a string, you don't need a special library - text = paste("", text) would do.
Armali already answered your question (use paste('',text)) to add a space in front of ear. Since you also want to match the Ear at the start of a sentence you can better use a regex as pointed out by HO LI Pin.
pattern <- '(?<![A-z])[Ee]ar'
This will only match E/ear if not preceded by any other letter (it can thus still be preceded by things like _ ,(, etc. but it is not clear from your question whether this is allowed or not. Then you can either use the base R or simpler the stringr library to search all matches using this regex pattern:
library(stringr)
pattern <- '(?<![A-z])[Ee]ar'
text = 'Ear this is some nice text as you can hear with your ear about overbearing'
unlist(str_extract_all(text, pattern, simplify = FALSE))
Which will give you:
[1] "Ear" "ear"