How to remove these footnotes from text - r

Alright so I have minimal experience with RStudio, I've been googling this for hours now and I'm fed up-- I don't care about the pride of figuring it out on my own anymore, I just want it done. I want to do some stuff with Canterbury Tales-- the Middle English version on Gutenberg.
Downloaded the plaintext, trimmed out the meta data, etc but it's chock-full of "helpful" footnotes and I can't figure out how to cut them out. EX:
"And shortly, whan the sonne was to reste,
So hadde I spoken with hem everichon,
That I was of hir felawshipe anon,
And made forward erly for to ryse,
To take our wey, ther as I yow devyse.
19. Hn. Bifel; E. Bifil. 23. E. were; _rest_ was. 24. E. Hn.
compaignye. 26, 32. E. felaweshipe. Hl. pilgryms; E. pilgrimes.
34. E. oure
But natheles, whyl I have tyme and space,..."
I at least have the vague notion that this is a grep/regex puzzle. Looking at the text in TextEdit, each bundle of footnotes is indented by 4 spaces, and the next verse starts with a capitalized word indented by (edit: 4 spaces as well).
So I tried downloading the package qdap and using the rm_between function to specify removal of text between four spaces and a number; and two spaces and a capital letter (" [0-9]"," "[A-Z]") to no avail.
I mean, this isn't nearly as simple as "make the text lowercase and remove all the numbers dur-hur" which all the tutorials are so helpfully offering. But I'm assuming this is a rather common thing that people have to do when dealing with big texts. Can anyone help me? Or do I have to go into textedit and just manually delete all the footnotes?
EDIT: I restarted the workspace today and all I have is a scan of the file, each line stored in a character vector, with the Gutenburg metadata trimmed out:
text<- scan("thefilepath.txt, what = "character", sep = "\n")
start <-which(text=="GROUP A. THE PROLOGUE.")
end <-which(text==""God bringe us to the Ioye . that ever schal be!")
cant.lines.v <- text[start:end]
And that's it so far. Eventually I will
cant.v<- paste(cant.lines.v, collapse=" ")
And then strsplit and unlist into a vector of individual words-- but I'm assuming, to get rid of the footnotes, I need to gsub and replace with blank space, and that will be easier with each separate line? I just don't know how to encode the pattern I need to cut. I believe it is 4 spaces followed by a number, then continuing on until you get to 4 spaces followed by a capitalized word and a second word w/o numbers and special characters and punctuation.
I hope that I'm providing enough information, I'm not well-versed in this but I am looking to become so...thanks in advance.

Related

How to grep a string ending in a specific punctuation mark

I'm trying to grep strings that end in a dash in R, but having trouble. I've worked out how to grep strings ending in any punctuation mark, maybe not the best way but this worked:
grep("\\#[[:print:]]+[[:punct:]]$",c)
Can't for the life of me work out how to grep strings that end specifically in a dash
for example these strings:
- # (piano) - not this.
- # hello hello - not this either.
I'd like to sub all the stuff between the dashes (and including the dashes) with nothing "" and leave the text to the right of the second dash, which end in full stops. So, I would like the output to be (for example, based on the example above):
not this.
and
not this either.
Any help would be appreciated.
Thank you!
Maro
UPDATE:
Hi again everyone,
I'm just updating my original question again:
So what I had in my original data was these three examples (I tried to simplify in my original post above, but I think it might be helpful for you all to see what I was actually dealing with):
- # (Piano) - no, and neither can you.
- # (Piano) - uh-huh.
- # Many dreams ago - Try it again.
(numbers 1-3 are for the purposes of making things clearer, they are not part of the strings)
I was trying to find a way to delete all the stuff between and including the two dashes, and leave all the stuff after the second dash, so I wanted my output to be:
no, and neither can you.
uh-huh.
Try it again.
I ended up using this:
gsub(("-[[:blank:]]#[[:blank:]]\\(?[A-Z][a-z]*\\)?[[:blank:]]-", "", c)
which helped me get 1. and 2. in one go. But this didn't help with 3 - I thought by including the question mark after the open and close bracket (which I thought meant 'optional') this would help me get all three targets, but for some reason it didn't. To then get 3, I just ended up targeting that specific string i.e. - # Many dreams ago -, by using:
gsub(("- # Many dreams ago -"), "", c)
I'm new to this, so not the best solution I'm sure.
In my original post (this has been edited a couple of times) I included square brackets around the three strings, which explains some of the answers I originally received from members of the community. Apologies for the confusion!
Thanks everyone - if there's anything that doesn't make sense, please let me know, and I'll try to clarify.
Maro
If you want to stay in between the square brackets you can start the match at #, then use a negated character class [^][]* matching optional chars other than an opening or closing square bracket, and match the last -
Replace the match with an empty string.
c <- "[- # (piano) - not this.]"
sub("#[^][]*-", "", c)
Output
[1] "[- not this.]"
For a more specific match of that string format, you can match the whole line including the square brackets, the # and the string ending on a full stop, and capture what you want to keep.
In the replacement use the capture group value.
c <- c("[- # (piano) - not this.]", "[- # hello hello - not this either.]")
sub("\\[[^][#]*#[^][]*-\\s*([^][]*\\.)]", "\\1", c)
Output
[1] "not this." "not this either."

Remove whitespace before bracket " (" in R

I have nearly 100,000 rows of scraped data that I have converted to data frames. One column is a string of text characters but is operating strangely. In the example below, there is text, that has bracketed information that I want to remove, and I also want to remove " (c)". However the space in front is not technically a space (is it considered whitespace?).
I am not sure how to reproduce the example here because when I copy/paste a record, it is treated like normal and works, but in the scraped data, it does not. Gut check was to count spaces and it gave me 4, which means the space in front of ( is not a true space. I do not know how to remove this!
My code that I usually would run is as follows. Again, works this way, but does not work in my scraped data.
test<-c("Barry Windham (c) & Mike Rotundo (c)")
test<-gsub("[ ][(]c[)]","",test)
You can consider using:
test<-c("Barry Windham (c) & Mike Rotundo (c)")
gsub("(*UCP)\\s+\\(c\\)", "", test, perl=TRUE)
# => [1] "Barry Windham & Mike Rotundo"
See an online R demo
Details
(*UCP) - makes all shorthand character classes in the PCRE regex (it is PCRE due to perl=TRUE) Unicode aware
\\s+ - any one or more Unicode whitespaces
\\(c\\) - (c) substring.
If you need to keep (c), capture it and use a backreference in the replacement:
gsub("(*UCP)\\s+(\\(c\\))", "\\1", test, perl=TRUE)

Removing Special Characters in a Text File in R

I'm using a text file in R and using the readLine function and regexs to extract words from it. The file uses special characters around words (such as # sings before and after a word to show it is bolded or # before and after a word to show it should be italicized) to indicate special meanings, which are messing up my regexs.
So far this is my r code which removed all empty lines and then combined my text file into a single vector :
book<-readLines("/Users/Desktop/SAMPLE .txt",encoding="UTF-8")
#remove all empty lines
empty_lines = grepl('^\\s*$', book)
book = book[! empty_lines]
#combine book into one variable
xBook = paste(book, collapse = '')
#remove extra white spaces for a single text of the entire book
updated<-trimws(gsub("\\s+"," ",xBook))
when i run updated, i see the entire file stored in the variable updated but with the special characters:
updated
[1] "It is a truth universally acknowledged, that a #single# man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a #man# may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, #that# he is considered the rightful property of some one or other of #their# daughters.
How can I remove all all the leading or trailing # or # from the words in my updated variable?
my desired output is just the plain text, with no indication of words that should be bolded or italicized:
updated
[1] "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.
gsub("[##]([a-zA-Z]+)[##]", "\\1", x)

Getting an error in removing word just before the colon in R

I have the below dataframe
head(df)
index song year artist genre lyrics
2 Till i am gone 2010 Eminem Rap Chorus:It's too much, it's too tough
i have done other data cleanups such as converting everything into lower case using gsub and removing words between brackets, however, not finding the syntax to just remove the word and the colon that is after it, for example in my row, i want to remove "chorus:"
After the syntax it should be
lyrics
It's too much, it's too tough
The following code will delete everything before the colon which i don't want as this colon can be anywhere in the cell
gsub(".*:","",foo)
You can specify to only remove the word immediately before the colon.
I expanded your test set to show that it works.
foo = c("Chorus:It's too much, it's too tough ",
"ABC Chorus:It's too much, it's too tough ")
gsub("\\w+:", "", foo)
[1] "It's too much, it's too tough " "ABC It's too much, it's too tough "

Which function should I use to read unstructured text file into R? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
This is my first ever question here and I'm new to R, trying to figure out my first step in how to do data processing, please keep it easy : )
I'm wondering what would be the best function and a useful data structure in R to load unstructured text data for further processing. For example, let's say I have a book stored as a text file, with no new line characters in it.
Is it a good idea to use read.delim() and store the data in a list? Or is a character vector better, and how would I define it?
Thank you in advance.
PN
P.S. If I use "." as my delimeter, it would treat things like "Mr." as a separate sentence. While this is just an example and I'm not concerned about this flaw, just for educational purposes, I'd still be curious how you'd go around this problem.
read.delim reads in data in table format (with rows and columns, as in Excel). It is not very useful for reading a string of text.
To read text from a text file into R you can use readLines(). readLines() creates a character vector with as many elements as lines of text. A line, for this kind of software, is any string of text that ends with a newline. (Read about newline on Wikipedia.) When you write text, you enter your system specific newline character(s) by pressing Return. In effect, a line of text is not defined by the width of your software window, but can run over many visual rows. In effect, a line of text is what in a book would be a a paragraph. So readLines() splits your text at the paragraphs:
> readLines("/path/to/tom_sawyer.txt")
[1] "\"TOM!\""
[2] "No answer."
[3] "\"TOM!\""
[4] "No answer."
[5] "\"What's gone with that boy, I wonder? You TOM!\""
[6] "No answer."
[7] "The old lady pulled her spectacles down and looked over them about the room; then she put them up and looked out under them. She seldom or never looked through them for so small a thing as a boy; they were her state pair, the pride of her heart, and were built for \"style,\" not service—she could have seen through a pair of stove-lids just as well. She looked perplexed for a moment, and then said, not fiercely, but still loud enough for the furniture to hear:"
[8] "\"Well, I lay if I get hold of you I'll—\"
Note that you can scroll long text to the left here in Stackoverflow. That seventh line is longer than this column is wide.
As you can see, readLines() read that long seventh paragraph as one line. And, as you can also see, readLines() added a backslash in front of each quotation mark. Since R holds the individual lines in quotation marks, it needs to distinguish these from those that are part of the original text. Therefore, it "escapes" the original quotation marks. Read about escaping on Wikipedia.
readLines() may output a warning that an "incomplete final line" was found in your file. This only means that there was no newline after the last line. You can suppress this warning with readLines(..., warn = FALSE), but you don't have to, it is not an error, and supressing the warning will do nothing but supress the warning message.
If you don't want to just output your text to the R console but process it further, create an object that holds the output of readLines():
mytext <- readLines("textfile.txt")
Besides readLines(), you can also use scan(), readBin() and other functions to read text from files. Look at the manual by entering ?scan etc. Look at ?connections to learn about many different methods to read files into R.
I would strongly advise you to write your text in a .txt-file in a text editor like Vim, Notepad, TextWrangler etc., and not compose it in a word processor like MS Word. Word files contain more than the text you see on screen or printed, and those will be read by R. You can try and see what you get, but for good results you should either save your file as a .txt-file from Word or compose it in a text editor.
You can also copy-paste your text from a text file open in any other software to R or compose your text in the R console:
myothertext <- c("What did you do?
+ I wrote some text.
+ Ah, interesting.")
> myothertext
[1] "What did you do?\nI wrote some text.\nAh, interesting."
Note how entering Return does not cause R to execute the command before I closed the string with "). R just replies with +, telling me that I can continue to edit. I did not type in those plusses. Try it. Note also that now the newlines are part of your string of text. (I'm on a Mac, so my newline is \n.)
If you input your text manually, I would load the whole text as one string into a vector:
x <- c("The text of your book.")
You could load different chapters into different elements of this vector:
y <- c("Chapter 1", "Chapter 2")
For better reference, you can name the elements:
z <- c(ch1 = "This is the text of the first chapter. It is not long! Why was the author so lazy?", ch2 = "This is the text of the second chapter. It is even shorter.")
Now you can split the elements of any of these vectors:
sentences <- strsplit(z, "[.!?] *")
Enter ?strsplit to read the manual for this function and learn about the attributes it takes. The second attribute takes a regular expression. In this case I told strsplit to split the elements of the vector at any of the three punctuation marks followed by an optional space (if you don't define a space here, the resulting "sentences" will be preceded by a space).
sentences now contains:
> sentences
$ch1
[1] "This is the text of the first chapter" "It is not long"
[3] "Why was the author so lazy"
$ch2
[1] "This is the text of the second chapter" "It is even shorter"
You can access the individual sentences by indexing:
> sentences$ch1[2]
[3] "It is not long"
R will be unable to know that it should not split after "Mr.". You must define exceptions in your regular expression. Explaining this is beyond the scope of this question.
How you would tell R how to recognize subjects or objects, I have no idea.

Resources