I am looking for a way to read text into a vector such that each line would be a different element, all happening within an R script.
One way that I found was something like:
bla <- scan(text = "line1
line2
line3",
what = character())
Which correctly gives me:
> bla
[1] "line1" "line2" "line3"
However, there are several problems. First, it is indented. I don't have to, but any auto indentation features will just pop it back to be aligned (which I commonly use). Second, this requires escape codes if I would like to use the double quote symbol for example.
Is there a way to do something similar to the Here-Document method (<< EOF), in R scripts?
I am using RStudio as my IDE, running on Windows. Preferably there would be a platform independent way of doing this.
EDIT
Do you need to have the text inside the R script?
Yes.
An example of what I want to do:
R script here
⋮
bla <- <SOMETHING - BEGIN>
line1
line2
line3
<SOMETHING - END>
⋮
more R script here
Where the requirement, again, is that I can type freely without worrying about auto indentation moving the lines forward, and no need to worry about escape codes when typing things like ".
Both problems can be solved with the scan function and two little tricks, I think:
scan(text = '
line1
"line2" uses quotation mark
line3
', what = character(), sep = "\n")
Read 3 items
[1] "line1" "\"line2\" uses quotation mark"
[3] "line3"
When you put the quotation marks in a line of their own, you don't have a problem with auto indentation (tested using RStudio). If you only have double quotation marks in the text, you can use single quotation marks to start and end your character object. If you have single quotation marks in the text, use double quotation marks for character. If you have both, you should probably use search and replace to make them uniform.
I also added sep = "\n", so every line is one element of the resulting character vector.
Since R version 4.0, we have raw strings (See ?Quotes)
bla <- r"(line1
line2
"line3"
'line4'
Here is indentation
Here is a backslash \
)"
#> [1] "line1\nline2\n\"line3\"\n'line4'\n Here is indentation\nHere is a backslash \\\n"
Note though it gives one single string, not separate elements. We can split it back with strsplit:
bla <- strsplit(bla, "\n")[[1]]
#> [1] "line1"
#> [2] "line2"
#> [3] "\"line3\""
#> [4] "'line4'"
#> [5] " Here is indentation"
#> [6] "Here is a backslash \\"
If authoring an Rmarkdown document instead of an R script is an option, we could use the knitr cat engine
---
title: "Untitled"
output: html_document
---
```{cat engine.opts=list(file='foo')}
line1
line2
"line3"
'line4'
```
```{r}
bla <- readLines("foo")
bla
```
Related
I tried to write a code in regex101.com to identify any kind of email address.
The general email address formats are like this:
rohan.singh#example.com
rakesh#example.com
hamed.jelveh#example.dd.rr
This command works in www.regex101.com if i want to select just emails among the text. The regex101.com link is below:
https://regex101.com/r/UA6CTA/1
(\w){1,25}(.|\w){1,25}#(\w){1,25}.(\w){1,25}(.|\w|$)((\w){1,25}|$)
but when i write this in R even when i use \ insitead of \ with grep command, it gives me "character(0)".
the script is below:
emails <- c("javad.rasooli#bpmn.edu",
"education#world.gov",
"babak.pirooz#peace.org",
"invalid.edu",
"sadeghi#apbarez.edu",
"hassaneskandari#codeman.ir")
emails[grep(pattern = r"(\w){1,25}(.|\w){1,25}#(\w){1,25}.(\w){1,25}(.|\w|$)((\w){1,25}|$)",
x=emails)]
The output in terminal is below:
emails[grep(pattern = r"((\w){1,25}(.|\w){1,25}#(\w){1,25}.
+ (\w){1,25}(.|\w|$)((\w){1,25}|$))",
+ x=emails)]
character(0)
Can anyone help me what to do ?
I assume the regex used in regex101 was without double backslashes, like this:
(\w){1,25}(.|\w){1,25}#(\w){1,25}.(\w){1,25}(.|\w|$)((\w){1,25}|$)
Though this does not match the one in R example, with nor without extra escaping. In addition, regex in R example is marked as a raw string (r"...") but in R one should also use starting & closing sequence (i.e. r"(...)", more details in R help, ?Quotes).
emails <- c("javad.rasooli#bpmn.edu",
"education#world.gov",
"babak.pirooz#peace.org",
"invalid.edu",
"sadeghi#apbarez.edu",
"hassaneskandari#codeman.ir")
emails[grep(pattern=r"((\w){1,25}(.|\w){1,25}#(\w){1,25}.(\w){1,25}(.|\w|$)((\w){1,25}|$))", ,x=emails)]
#> [1] "javad.rasooli#bpmn.edu" "education#world.gov"
#> [3] "babak.pirooz#peace.org" "sadeghi#apbarez.edu"
#> [5] "hassaneskandari#codeman.ir"
Or without raw string:
emails[grep(pattern="(\\w){1,25}(.|\\w){1,25}#(\\w){1,25}.(\\w){1,25}(.|\\w|$)((\\w){1,25}|$)", x=emails)]
#> [1] "javad.rasooli#bpmn.edu" "education#world.gov"
#> [3] "babak.pirooz#peace.org" "sadeghi#apbarez.edu"
#> [5] "hassaneskandari#codeman.ir"
Created on 2023-01-28 with reprex v2.0.2
That is incredible . But the key point is when you are using regex by grep as a sting, if after pattern="bla bla bla..." you go to the next line because of the R margin, it changes the string form. In below i describe the solution.
For instance i want to save the string "Hello to programming lovers" into a string variable.
st<- "Hello to programming lovers"
st
the output:
[1] "Hello to programming lovers"
Now for any reason i repeat the above code in 2 lines instead of one line.
st<- "Hello to
programming lovers"
st
the output:
[1] "Hello to \n programming lovers"
This is natural when i write this code in two lines it gives me "character(0)".
`emails[grep(pattern =r"((\w){1,25}(\.|\w){0,25}
(\w){1,25}#(\w){1,25}\.(\w){1,25}(\.|\w|$)((\w){1,25}|$))",x=emails)]
The output:
character(0)
Meanwhile when you use it in just one line or use with "paste" command with sep=""
it gives you desired result.
This is simple but ticky!.
I have some data in an object called all_lines that is a character class in R (as a result of reading into R a PDF file). My objective: to delete everything before a certain string and delete everything after another string.
The data looks like this and it is stored in the object all_lines
class(all_lines)
"character"
[1] LABORATORY Research Cover Sheet
[2] Number 201111EZ Title Maximizing throughput"
[3] " in Computers
[4] Start Date 01/15/2000
....
[49] Introduction
[50] Some text here and there
[51] Look more text
....
[912] Citations
[913] Author_1 Paper or journal
[914] Author_2 Book chapter
I want to delete everything before the string 'Introduction' and everything after 'Citations'. However, nothing I find seems to do the trick. I have tried the following commands from these posts: How to delete everything after a matching string in R and multiple on-line R tutorials on how to do just this. Here are some commands that I have tried and all I get is the string 'Introduction' deleted in the all_lines with everything else returned.
str_remove(all_lines, "^.*(?=(Introduction))")
sub(".*Introduction", "", all_lines)
gsub(".*Introduction", "", all_lines)
I have also tried to delete everything after the string 'Citations' using the same commands, such as:
sub("Citations.*", "", all_lines)
Am I missing anything? Any help would really be appreciated!
It looks like your variable is vector of character strings. One element per line in the document.
We can use the grep() function here to locate the lines containing the desired text. I am assuming only 1 line contains "Introduction" and only 1 line contains "Citations"
#line numbers containing the start and end
Intro <- grep("Introduction", all_lines)
Citation <- grep("Citations", all_lines)
#extract out the desired portion.
abridged <- all_lines[Intro:Citation]
You may need to add 1 or substract 1 if you would like to actually remove the "Introduction" or "Citations" line.
Assuming you can accept a single string as output, you could collapse the input into a single string and then use gsub():
all_lines <- paste(all_lines, collapse = " ")
output <- gsub("^.*?(?=\\bIntroduction\\b)|(?<=\\bCitations\\b).*$", "", all_lines)
Sometimes we copy/paste a string into RStudio, in which case we need to manually surround the text with quotes.
Is there a native way to paste with automatic quoting?
Example
If the clipboard contained here is my text, such a shortcut would result in "here is my text" being pasted in the R console/script pane.
This can be done in R:
x <- readClipboard()
x
## [1] "Here is my text"
This also works:
x <- readLines(stdin())
...paste clipboard into R & press ctrl-z (windows) or ctrl-d (unix)...
x
## [1] "Here is my text"
If you want a way to make the text contents of your Clipboard reusable in a script you can do
dput(readClipboard())
This has the benefit of automatically making multi-line text into a concatenated character vector. For example if I copy:
Alas, poor Yorick!
I knew him, Horatio;
a fellow of infinite jest,
of most excellent fancy;
Then I can do
dput(readClipboard())
# c("Alas, poor Yorick! ", "I knew him, Horatio; ", "a fellow of infinite jest, ",
# "of most excellent fancy; ")
I am trying to create a multi-line string in R out of several other strings and adding line breaks with /n. When I use paste I do not get a new line i simply get "\n" concatenated in with my strings.
x <- "hello"
y <- "world
paste(x,"\n",y)
# [1] "hello \n world"
I would line to get something along the lines of:
hello
world
I know that cat would output the correct answer to the console but I would like to store this multi-line string in a variable.
Any help would be greatly appreciated. Thank you.
I'm working on being able to read transcripts of dialogue into R. However I run into a bump with special characters like curly quotes en and em dashes etc. Typically I replace these special characters in a microsoft product first with replace. Typically I replace special characters with plain text but on some occasions desire to replace them with other characters (ie I replace “ ” with { }). This is tedious and not always thorough. If I could read the transcripts into R as is and then use Encoding to switch their encoding to a recognizable unicode format, I could gsub them out and replace them with plain text versions. However the file is read in in some way I don't understand.
Here's an xlsx of what my data may look like:
http://dl.dropbox.com/u/61803503/test.xlsx
This is what is in the .xlsx file
text num
“ ” curly quotes 1
en dash (–) and the em dash (—) 2
‘ ’ curly apostrophe-ugg 3
… ellipsis are uck in R 4
This can be read into R with:
URL <- "http://dl.dropbox.com/u/61803503/test.xlsx"
library(gdata)
z <- read.xls(URL, stringsAsFactors = FALSE)
The result is:
text num
1 “ †curly quotes 1
2 en dash (–) and the em dash (—) 2
3 ‘ ’ curly apostrophe-ugg 3
4 … ellipsis are uck in R 4
So I tried to use Encoding to convert to Unicode:
iconv(z[, 1], "latin1", "UTF-8")
This gives:
[1] "â\u0080\u009c â\u0080\u009d curly quotes" "en dash (â\u0080\u0093) and the em dash (â\u0080\u0094)"
[3] "â\u0080\u0098 â\u0080\u0099 curly apostrophe-ugg" "â\u0080¦ ellipsis are uck in R"
Which makes gsubing less useful.
What can I do to convert these special characters to distinguishable unicode so I can gsub them out appropriately? To be more explicit I was hoping to have z[1, 1] read:
\u201C 2\u01D curly quotes
To make it even more clear my desired outcome I will webscrape the tables from a page like wikipedia's: http://en.wikipedia.org/wiki/Quotation_mark_glyphs and use the unicode reference chart to replace characters appropriately. So I need the characters to be in unicode or some standard format that I can systematically go through and replace the characters. Maybe it already is and I'm missing it.
PS I don't save the files as .csv or plain text because the special characters are replaced with ? hence the use of read.xls I'm not attached to any particular method of reading in the file (ie read.xls) if you've got a better alternative.
Maybe this will help (I'll have access to a Windows machine tomorrow and can probably play with it more at that point if SO doesn't get you the answer first).
On my Linux system, when I do the following:
iconv(z$text, "", "cp1252")
I get:
[1] "\x93 \x94 curly quotes" "en dash (\x96) and the em dash (\x97)"
[3] "\x91 \x92 curly apostrophe-ugg" "\x85 ellipsis are uck in R"
This is not UTF, but (I believe) ISO hex entities. Still, if you are able to get to this point also, then you should be able to use gsub the way you intend to.
See this page (reserved section in particular) for conversions.
Update
You can also try converting to an encoding that doesn't have those characters, like ASCII and set sub to "byte". On my machine, that gives me:
iconv(z$text, "", "ASCII", "byte")
# [1] "<e2><80><9c> <e2><80><9d> curly quotes"
# [2] "en dash (<e2><80><93>) and the em dash (<e2><80><94>)"
# [3] "<e2><80><98> <e2><80><99> curly apostrophe-ugg"
# [4] "<e2><80><a6> ellipsis are uck in R"
It's ugly, but UTF-8(e2, 80, 9c) is a right curly quote (each character, I believe, is a set of three values in angled brackets). You can find conversions at this site where you can search by punctuation mark name.
Try
> iconv(z, "UTF-8", "UTF-8")
[1] "c(\"“—” curly quotes\", \"en dash (–) and the em dash (—)\", \"‘—’ curly apostrophe-ugg\", \"… ellipsis are uck in R\")"
[2] "c(1, 2, 3, 4)"
windows is very problematic with encodings. Maybe you can look at http://www.vmware.com/products/player/ and run linux.
This works on my windows box. Initial input was as you had. You may have a different experience.