R - Suppressing separator at a single text passage in paste() - r

When having a paste() with a given delimiter like in the below example, is it possible to suppress the delimiter at single positions?
The example is part of a ggplot()+labs(subtitle=paste(...)) with whitespace separator:
paste("Minimum/maximum number of observations per point:", min(currdataset_S03$nobs), "* /", max(currdataset_S03$nobs), sep = " ")
And results in:
Now, I'd like to suppress/skip only the whitespace between 7 (=currdataset_S03$nobs) and the asterisk (see red arrow in the image). But don't know how.
I'm quite sure that I have seen a quite simple solution some time ago. But I can't remember it, - or my memory may not serves me right.
However, I could not find any helpful post so far. So, has anybody an idea for me please?

You could solve this using sprintf.
sprintf("Minimum/maximum number of observations per point: %s * / %s",
min(currdataset_S03$nobs), max(currdataset_S03$nobs))
# [1] "Minimum/maximum number of observations per point: 7 * / 14"
Data:
currdataset_S03 <- data.frame(nobs=7:14)

Related

How to grep a string ending in a specific punctuation mark

I'm trying to grep strings that end in a dash in R, but having trouble. I've worked out how to grep strings ending in any punctuation mark, maybe not the best way but this worked:
grep("\\#[[:print:]]+[[:punct:]]$",c)
Can't for the life of me work out how to grep strings that end specifically in a dash
for example these strings:
- # (piano) - not this.
- # hello hello - not this either.
I'd like to sub all the stuff between the dashes (and including the dashes) with nothing "" and leave the text to the right of the second dash, which end in full stops. So, I would like the output to be (for example, based on the example above):
not this.
and
not this either.
Any help would be appreciated.
Thank you!
Maro
UPDATE:
Hi again everyone,
I'm just updating my original question again:
So what I had in my original data was these three examples (I tried to simplify in my original post above, but I think it might be helpful for you all to see what I was actually dealing with):
- # (Piano) - no, and neither can you.
- # (Piano) - uh-huh.
- # Many dreams ago - Try it again.
(numbers 1-3 are for the purposes of making things clearer, they are not part of the strings)
I was trying to find a way to delete all the stuff between and including the two dashes, and leave all the stuff after the second dash, so I wanted my output to be:
no, and neither can you.
uh-huh.
Try it again.
I ended up using this:
gsub(("-[[:blank:]]#[[:blank:]]\\(?[A-Z][a-z]*\\)?[[:blank:]]-", "", c)
which helped me get 1. and 2. in one go. But this didn't help with 3 - I thought by including the question mark after the open and close bracket (which I thought meant 'optional') this would help me get all three targets, but for some reason it didn't. To then get 3, I just ended up targeting that specific string i.e. - # Many dreams ago -, by using:
gsub(("- # Many dreams ago -"), "", c)
I'm new to this, so not the best solution I'm sure.
In my original post (this has been edited a couple of times) I included square brackets around the three strings, which explains some of the answers I originally received from members of the community. Apologies for the confusion!
Thanks everyone - if there's anything that doesn't make sense, please let me know, and I'll try to clarify.
Maro
If you want to stay in between the square brackets you can start the match at #, then use a negated character class [^][]* matching optional chars other than an opening or closing square bracket, and match the last -
Replace the match with an empty string.
c <- "[- # (piano) - not this.]"
sub("#[^][]*-", "", c)
Output
[1] "[- not this.]"
For a more specific match of that string format, you can match the whole line including the square brackets, the # and the string ending on a full stop, and capture what you want to keep.
In the replacement use the capture group value.
c <- c("[- # (piano) - not this.]", "[- # hello hello - not this either.]")
sub("\\[[^][#]*#[^][]*-\\s*([^][]*\\.)]", "\\1", c)
Output
[1] "not this." "not this either."

How to remove these footnotes from text

Alright so I have minimal experience with RStudio, I've been googling this for hours now and I'm fed up-- I don't care about the pride of figuring it out on my own anymore, I just want it done. I want to do some stuff with Canterbury Tales-- the Middle English version on Gutenberg.
Downloaded the plaintext, trimmed out the meta data, etc but it's chock-full of "helpful" footnotes and I can't figure out how to cut them out. EX:
"And shortly, whan the sonne was to reste,
So hadde I spoken with hem everichon,
That I was of hir felawshipe anon,
And made forward erly for to ryse,
To take our wey, ther as I yow devyse.
19. Hn. Bifel; E. Bifil. 23. E. were; _rest_ was. 24. E. Hn.
compaignye. 26, 32. E. felaweshipe. Hl. pilgryms; E. pilgrimes.
34. E. oure
But natheles, whyl I have tyme and space,..."
I at least have the vague notion that this is a grep/regex puzzle. Looking at the text in TextEdit, each bundle of footnotes is indented by 4 spaces, and the next verse starts with a capitalized word indented by (edit: 4 spaces as well).
So I tried downloading the package qdap and using the rm_between function to specify removal of text between four spaces and a number; and two spaces and a capital letter (" [0-9]"," "[A-Z]") to no avail.
I mean, this isn't nearly as simple as "make the text lowercase and remove all the numbers dur-hur" which all the tutorials are so helpfully offering. But I'm assuming this is a rather common thing that people have to do when dealing with big texts. Can anyone help me? Or do I have to go into textedit and just manually delete all the footnotes?
EDIT: I restarted the workspace today and all I have is a scan of the file, each line stored in a character vector, with the Gutenburg metadata trimmed out:
text<- scan("thefilepath.txt, what = "character", sep = "\n")
start <-which(text=="GROUP A. THE PROLOGUE.")
end <-which(text==""God bringe us to the Ioye . that ever schal be!")
cant.lines.v <- text[start:end]
And that's it so far. Eventually I will
cant.v<- paste(cant.lines.v, collapse=" ")
And then strsplit and unlist into a vector of individual words-- but I'm assuming, to get rid of the footnotes, I need to gsub and replace with blank space, and that will be easier with each separate line? I just don't know how to encode the pattern I need to cut. I believe it is 4 spaces followed by a number, then continuing on until you get to 4 spaces followed by a capitalized word and a second word w/o numbers and special characters and punctuation.
I hope that I'm providing enough information, I'm not well-versed in this but I am looking to become so...thanks in advance.

strsplit not consistently working, character between letters isn't a space?

The problem is very simple, but I'm having no luck fixing it. strsplit() is a fairly simple function, and I am surprised I am struggling as much as I am:
# temp is the problem string. temp is copy / pasted from my R code.
# i am hoping the third character, the space, which i think is the error, remains the error
temp = "GS PG"
# temp2 is created in stackoverflow, using an actual space
temp2 = "GS PG"
unlist(strsplit(temp, split = " "))
[1] "GS PG"
unlist(strsplit(temp2, split = " "))
[1] "GS" "PG"
.
even if it doesn't work here with me trying to reproduce the example, this is the issue i am running into. with temp, the code isn't splitting the variable on the space for some odd reason. any thoughts would be appreciated!
Best,
EDIT - my example failed to recreate the issue. For reference, temp is being created in my code by scraping code from online with rvest, and for some reason it must be scraping a different character other than a normal space, i think? I need to split these strings by space though.
Try the following:
unlist(strsplit(temp, "\\s+"))
The "\\s+" is a regex search for any type of whitespace instead of just a standard space.
As in the comment,
It is likely that the "space" is not actually a space but some other whitespace character.
Try any of the following to narrow it down:
whitespace <- c(" ", "\t" , "\n", "\r", "\v", "\f")
grep(paste(whitespace,collapse="|"), temp)
Related question here:
How to remove all whitespace from a string?

un-quote an R string?

TL;DR
I have a snippet of text
str <- '"foo\\dar embedded \\\"quote\\\""'
# cat(str, '\n') # gives
# "foo\dar embedded \"quote\""
# i.e. as if the above had been written to a CSV with quoting turned on.
I want to end up with the string:
str <- 'foo\\dar embedded "quote"'
# cat(str, '\n') # gives
# foo\dar embedded "quote"
essentially removing one "layer" of quoting. How may I do this?
(Initial attempt -- eval(parse(text=str)), which works unless you have something like \\dar, where you get the error "\d is an unrecognized escape in character string ...").
Gory details (optional)
The reason my strings are quoted once-too-many times is I kludged some data processing -- I wrote str (well, a dataframe in my case) to a table with quoting enabled, but forgot that many of the columns in my dataframe had embedded newlines with embedded quotes (i.e. forgot to escape/remove them).
It turns out that when I read.table a file with multiple columns in the same row that have embedded newlines and embedded quotes (or something like that), the function fails (fair enough).
I had since closed my R session so my only access to my data was through my munged CSV. So I wrote some spaghetti code to simply readLines my CSV and split everything up to reconstruct my dataframe again. However, since all my character columns were quoted in the CSV, I have a few columns in my restored dataframe that are still quoted that I want to unquote.
Messy, I know. I'll remember to save an original version of the data next time (save, saveRDS).
For those interested, the header row and three rows of my CSV are shown below (all the characters are ASCII)
"quote";"id";"date";"author";"context"
"< mwk> I tried to fix the bug I mentioned, but I accidentally ascended the character I started for testing... hoped she'd die soon and I could get to coding, but alas I was wrong";"< mwk> I tried to fix the bug I mentioned, but I accidentally ascended the character I started for testing... hoped she'd die soon and I could get to coding, but alas I was wrong";"February 28, 2013";"nhqdb";"nhqdb"
"< intx14> \"A gush of water hits the air elemental on the central core!\"
< intx14> What is this, a weather forecast?";"< intx14> \"A gush of water hits the air elemental on the central core!\"
< intx14> What is this, a weather forecast?";"February 28, 2013";"nhqdb";"nhqdb"
"< bcode> n - a spherical amulet. You are lucky! Full moon tonight.
< bcode> That must be a sign - I'll put it on! What could possibly go wrong...
< oracle\devnull> DIED : bcode2 (Wiz-Elf-Mal-Cha) 0 points, killed by strangulation on pcs1.nethack.devnull.net";"< bcode> n - a spherical amulet. You are lucky! Full moon tonight.
< bcode> That must be a sign - I'll put it on! What could possibly go wrong...
< oracle\devnull> DIED : bcode2 (Wiz-Elf-Mal-Cha) 0 points, killed by strangulation on pcs1.nethack.devnull.net";"February 28, 2013";"nhqdb";"nhqdb"
The first two columns of each row are the same, being the quote (the first row has no embedded newlines in the quote; the second and third do). Separator is ';'.
> read.table('test.csv', sep=';', header=T)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 5 elements
# same for with ,allowEscape=T
Use regular expressions:
str <- gsub('^"|"$', '', gsub('\\\"', '"', str, fixed = TRUE))
[EDIT 3: the OP has posted three separate versions of this - two of them irreproducible, interspersed with complaining. Due to this timewasting behavior and several people downvoting, I'm leaving the original answer to version 2 of the question.]
EDIT 1: My solution to the second version of the OP's question was this:
txt <- read.csv('escaped.csv', header=T, allowEscapes=T, sep=';')
EDIT 2: We now get a third version. Finally some reproducible code after 36 minutes asking and waiting. Due to the behavior of the OP and other posters I'm not inclined to waste more time on this. I'm going to complain about both of your behavior on MSO. Downvote yourselves silly.
ORIGINAL:
gsub is the ugly way.
Use read.csv(..., allowEscapes=TRUE, quote=..., encoding=...) arguments. See the manpage, section on Encoding
If you want actual code, you need to give us a full line or two of your CSV file.
See also SO: "How to detect the right encoding for read.csv?"
Quoting the relevant part of your question:
The reason my strings are quoted once-too-many times is I kludged some
data processing -- I wrote str (well, a dataframe in my case) to a
table with quoting enabled, but forgot that many of the columns in my
dataframe had embedded newlines within quotes (i.e. forgot to
escape/remove them).
It turns out that when I read.table a file with multiple columns in
the same row that have embedded newlines within quotes, the function
fails (fair enough).

Finding number of occurrences of a word in a file using R functions

I am using the following code for finding number of occurrences of a word memory in a file and I am getting the wrong result. Can you please help me to know what I am missing?
NOTE1: The question is looking for exact occurrence of word "memory"!
NOTE2: What I have realized they are exactly looking for "memory" and even something like "memory," is not accepted! That was the part which has brought up the confusion I guess. I tried it for word "action" and the correct answer is 7! You can try as well.
#names=scan("hamlet.txt", what=character())
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())
Read 28230 items
> length(grep("memory",names))
[1] 9
Here's the file
The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.
You can fix the problem by telling scan to regard all quotation marks as normal characters and not treat them specially:
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
Be careful when using the R implementation of grep. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.
As pointed by #andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:
names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)
length(idxs)
# [1] 10

Resources