Sometimes we copy/paste a string into RStudio, in which case we need to manually surround the text with quotes.
Is there a native way to paste with automatic quoting?
Example
If the clipboard contained here is my text, such a shortcut would result in "here is my text" being pasted in the R console/script pane.
This can be done in R:
x <- readClipboard()
x
## [1] "Here is my text"
This also works:
x <- readLines(stdin())
...paste clipboard into R & press ctrl-z (windows) or ctrl-d (unix)...
x
## [1] "Here is my text"
If you want a way to make the text contents of your Clipboard reusable in a script you can do
dput(readClipboard())
This has the benefit of automatically making multi-line text into a concatenated character vector. For example if I copy:
Alas, poor Yorick!
I knew him, Horatio;
a fellow of infinite jest,
of most excellent fancy;
Then I can do
dput(readClipboard())
# c("Alas, poor Yorick! ", "I knew him, Horatio; ", "a fellow of infinite jest, ",
# "of most excellent fancy; ")
Related
I am looking for a way to read text into a vector such that each line would be a different element, all happening within an R script.
One way that I found was something like:
bla <- scan(text = "line1
line2
line3",
what = character())
Which correctly gives me:
> bla
[1] "line1" "line2" "line3"
However, there are several problems. First, it is indented. I don't have to, but any auto indentation features will just pop it back to be aligned (which I commonly use). Second, this requires escape codes if I would like to use the double quote symbol for example.
Is there a way to do something similar to the Here-Document method (<< EOF), in R scripts?
I am using RStudio as my IDE, running on Windows. Preferably there would be a platform independent way of doing this.
EDIT
Do you need to have the text inside the R script?
Yes.
An example of what I want to do:
R script here
⋮
bla <- <SOMETHING - BEGIN>
line1
line2
line3
<SOMETHING - END>
⋮
more R script here
Where the requirement, again, is that I can type freely without worrying about auto indentation moving the lines forward, and no need to worry about escape codes when typing things like ".
Both problems can be solved with the scan function and two little tricks, I think:
scan(text = '
line1
"line2" uses quotation mark
line3
', what = character(), sep = "\n")
Read 3 items
[1] "line1" "\"line2\" uses quotation mark"
[3] "line3"
When you put the quotation marks in a line of their own, you don't have a problem with auto indentation (tested using RStudio). If you only have double quotation marks in the text, you can use single quotation marks to start and end your character object. If you have single quotation marks in the text, use double quotation marks for character. If you have both, you should probably use search and replace to make them uniform.
I also added sep = "\n", so every line is one element of the resulting character vector.
Since R version 4.0, we have raw strings (See ?Quotes)
bla <- r"(line1
line2
"line3"
'line4'
Here is indentation
Here is a backslash \
)"
#> [1] "line1\nline2\n\"line3\"\n'line4'\n Here is indentation\nHere is a backslash \\\n"
Note though it gives one single string, not separate elements. We can split it back with strsplit:
bla <- strsplit(bla, "\n")[[1]]
#> [1] "line1"
#> [2] "line2"
#> [3] "\"line3\""
#> [4] "'line4'"
#> [5] " Here is indentation"
#> [6] "Here is a backslash \\"
If authoring an Rmarkdown document instead of an R script is an option, we could use the knitr cat engine
---
title: "Untitled"
output: html_document
---
```{cat engine.opts=list(file='foo')}
line1
line2
"line3"
'line4'
```
```{r}
bla <- readLines("foo")
bla
```
I'm working with some SurveyMonkey response data, which I'm importing from a .xlsx file.
Something along these lines was happening:
> unique(responseColumn)
[1] "This string"
[2] "Something else"
>(responseColumn == unique(responseColumn)[1])
[1] 25
>sum(responseColumn == "This string")
[1] 0
>unique(responseColumn)[1]
[1] "This string"
>unique(responseColumn)[1] == "This string"
[1] FALSE
Obviously that was confusing. I played around for a while and found that I could use
writeClipboard(unique(responseColumn)[1])
to catch the offending string and paste it into my code.
In the console, it looked exactly the same: "This string".
In my script editing window, however, it appeared as:
I copied the red dot to the clipboard and did some testing:
>readClipboard()
[1] " "
>readClipboard() == " "
[1] FALSE
>utf8toInt(" ")
[1] 32
>utf8toInt(readClipboard())
[1] NA
What is this mysterious character? I wrote the Survey Monkey questions and distinctly remember hitting 'space' on my keyboard when specifying this option. Other spaces in the response have remained as they were (In fact the response in question actually has multiple spaces IRL and only one of them has been converted into this mysteryChar). What's going on?
My guess here is that the "red" dot is just some non ASCII, possibly UTF-8, character. That you can't see it in R console does not mean that it still is not logically there in the Windows clipboard. It could just mean that R console is not displaying UTF-8 characters correctly.
If your R tool is not displaying the character correctly, then consider configuring it to support UTF-8.
I scrapped tweets from the twitter API and the package rtweet but I don't know how to work with text with emojis because they are in the form '\U0001f600' and all the regex code that I tried failed until now. I can't get anything of it.
For example
text = 'text text. \U0001f600'
grepl('U',text)
Give me FALSE
grepl('000',text)
Also give me FALSE.
Another problem is that they are often sticked to the word before (for example i am here\U0001f600 )
So how can I make R recognize emojis of that format? What can I put in the grepl that will return me TRUE for any emojis of that format?
In R there tends to be a package for most things. And in this case textclean and with it comes the lexicon package which has a lot of dictionaries. Using textclean you have 2 functions you can use, replace_emoji and replace_emoji_identifier
text = c("text text. \U0001f600", "i am here\U0001f600")
# replace emoji with identifier:
textclean::replace_emoji_identifier(text)
[1] "text text. lexiconvygwtlyrpywfarytvfis " "i am here lexiconvygwtlyrpywfarytvfis "
# replace emoji with text representation
textclean::replace_emoji(text)
[1] "text text. grinning face " "i am here grinning face "
Next you could use sentimentr to use sentiment scoring on the emoji's or for text analysis quanteda. If you just want to check the presence as in your expected output:
grepl("lexicon[[:alpha:]]{20}", textclean::replace_emoji_identifier(text))
[1] TRUE TRUE
Your problem is that you use a single character \ in your code:
text = 'text text. \U0001f600'
It really should be \\:
text = 'text text. \\U0001f600'
I had a similar experience using the rtweet library.
In my case the tweets bring some Unicode code points, not just emoji, and with the following format: "some text<U+code-point>". What I did in this case was "convert" that code point to its graphic representation:
library(stringi)
#I use gsub() to replace "<U+code-point>" with "\\ucode-point", the appropriate format
# And stri_unescape_unicode() to un-escape all Unicode sequences
stri_unescape_unicode(gsub("<U\\+(\\S+)>",
"\\\\u\\1", #replace by \\ucode-point
"some text with #COVID<U+30FC>19"))
#[1] "some text with #COVIDー19"
If the Unicode code point is not delimited as in my case (<>), you should change the regular expression from "<U\\+(\\S+)>" to "U(\\S+)" . You should be careful here, because this will work correctly if a space character appears after the code point. In case you have words attached to the code point both before and after, it must be more specific and indicate the number of characters that compose it, example "U(....)".
You can try refining this regular expression using Character Classes, or specifying only hexadecimal digits "U([A-Fa-f0-9]+)".
Note that in the RStudio console, the emoji are not going to be seen, you can apply this function but to see the emoji you must use an R library for this purpose. However other characters can be seen: "#COVID<U+30FC>19" appears in the RStudio console as "#COVIDー19".
Edit: Actually "\\S+" didn't work for me when there were consecutive Unicode code points like "<U+0001F926><U+200D><U+2642>". In this case it only replaced the first occurrence, I didn't delve into that, I just changed my regular expression to "<U\\+([A-Fa-f0-9]+)>".
"[A-Fa-f0-9]" represents hexadecimal digits.
Is there an R equivalent for Environment.NewLine in .NET?
I'm looking for a character object that would represent a new line based on the environment, e.g. CR LF ("\r\n") on Windows and LF ("\n") on Unix. I couldn't find any such thing in the R documentation, or the default R options.
There’s no equivalent, but most of the time you won’t need it: as long as you’re writing to a text connection, the operating system will do the correct thing and treat '\n' according to the platform’s specification; for example, the documentation of writeLines says:
Normally writeLines is used with a text-mode connection, and the default separator is converted to the normal separator for that platform (LF on Unix/Linux, CRLF on Windows).
The \n should still work:
> s = "line 1\nline 2"
> cat(s)
line 1
line 2
Here's a separate question which explains that print(s) doesn't quite work when trying to output strings with escape characters, and we should use cat or writeLine instead: Printing newlines with print() in R
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
This is my first ever question here and I'm new to R, trying to figure out my first step in how to do data processing, please keep it easy : )
I'm wondering what would be the best function and a useful data structure in R to load unstructured text data for further processing. For example, let's say I have a book stored as a text file, with no new line characters in it.
Is it a good idea to use read.delim() and store the data in a list? Or is a character vector better, and how would I define it?
Thank you in advance.
PN
P.S. If I use "." as my delimeter, it would treat things like "Mr." as a separate sentence. While this is just an example and I'm not concerned about this flaw, just for educational purposes, I'd still be curious how you'd go around this problem.
read.delim reads in data in table format (with rows and columns, as in Excel). It is not very useful for reading a string of text.
To read text from a text file into R you can use readLines(). readLines() creates a character vector with as many elements as lines of text. A line, for this kind of software, is any string of text that ends with a newline. (Read about newline on Wikipedia.) When you write text, you enter your system specific newline character(s) by pressing Return. In effect, a line of text is not defined by the width of your software window, but can run over many visual rows. In effect, a line of text is what in a book would be a a paragraph. So readLines() splits your text at the paragraphs:
> readLines("/path/to/tom_sawyer.txt")
[1] "\"TOM!\""
[2] "No answer."
[3] "\"TOM!\""
[4] "No answer."
[5] "\"What's gone with that boy, I wonder? You TOM!\""
[6] "No answer."
[7] "The old lady pulled her spectacles down and looked over them about the room; then she put them up and looked out under them. She seldom or never looked through them for so small a thing as a boy; they were her state pair, the pride of her heart, and were built for \"style,\" not service—she could have seen through a pair of stove-lids just as well. She looked perplexed for a moment, and then said, not fiercely, but still loud enough for the furniture to hear:"
[8] "\"Well, I lay if I get hold of you I'll—\"
Note that you can scroll long text to the left here in Stackoverflow. That seventh line is longer than this column is wide.
As you can see, readLines() read that long seventh paragraph as one line. And, as you can also see, readLines() added a backslash in front of each quotation mark. Since R holds the individual lines in quotation marks, it needs to distinguish these from those that are part of the original text. Therefore, it "escapes" the original quotation marks. Read about escaping on Wikipedia.
readLines() may output a warning that an "incomplete final line" was found in your file. This only means that there was no newline after the last line. You can suppress this warning with readLines(..., warn = FALSE), but you don't have to, it is not an error, and supressing the warning will do nothing but supress the warning message.
If you don't want to just output your text to the R console but process it further, create an object that holds the output of readLines():
mytext <- readLines("textfile.txt")
Besides readLines(), you can also use scan(), readBin() and other functions to read text from files. Look at the manual by entering ?scan etc. Look at ?connections to learn about many different methods to read files into R.
I would strongly advise you to write your text in a .txt-file in a text editor like Vim, Notepad, TextWrangler etc., and not compose it in a word processor like MS Word. Word files contain more than the text you see on screen or printed, and those will be read by R. You can try and see what you get, but for good results you should either save your file as a .txt-file from Word or compose it in a text editor.
You can also copy-paste your text from a text file open in any other software to R or compose your text in the R console:
myothertext <- c("What did you do?
+ I wrote some text.
+ Ah, interesting.")
> myothertext
[1] "What did you do?\nI wrote some text.\nAh, interesting."
Note how entering Return does not cause R to execute the command before I closed the string with "). R just replies with +, telling me that I can continue to edit. I did not type in those plusses. Try it. Note also that now the newlines are part of your string of text. (I'm on a Mac, so my newline is \n.)
If you input your text manually, I would load the whole text as one string into a vector:
x <- c("The text of your book.")
You could load different chapters into different elements of this vector:
y <- c("Chapter 1", "Chapter 2")
For better reference, you can name the elements:
z <- c(ch1 = "This is the text of the first chapter. It is not long! Why was the author so lazy?", ch2 = "This is the text of the second chapter. It is even shorter.")
Now you can split the elements of any of these vectors:
sentences <- strsplit(z, "[.!?] *")
Enter ?strsplit to read the manual for this function and learn about the attributes it takes. The second attribute takes a regular expression. In this case I told strsplit to split the elements of the vector at any of the three punctuation marks followed by an optional space (if you don't define a space here, the resulting "sentences" will be preceded by a space).
sentences now contains:
> sentences
$ch1
[1] "This is the text of the first chapter" "It is not long"
[3] "Why was the author so lazy"
$ch2
[1] "This is the text of the second chapter" "It is even shorter"
You can access the individual sentences by indexing:
> sentences$ch1[2]
[3] "It is not long"
R will be unable to know that it should not split after "Mr.". You must define exceptions in your regular expression. Explaining this is beyond the scope of this question.
How you would tell R how to recognize subjects or objects, I have no idea.