Longest line in text dataset - r

I am looking for a way to find the length of the longest line in a text file.
E.g. consider a simple dataset from the tm package.
install.packages("tm")
library(tm)
txt <- system.file("texts", "txt", package = "tm")
ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"), readerControl =
list(language = "lat"))
length(ovid)
[1] 5
ovid is composed by five documents each one composed by a character vector of n elements (from 16 to 18), between which I would like to identify the longest.
I found documentation for python, C# and for bash shell but, surprisingly, I did not find anything with R. Because of that, my attempts were quite naive, with:
max(nchar(ovid))
[1] 5410
max(length(ovid))
[1] 5

Actually it's the fourth text which is the longest, once we remove the padding from whitespace. Here's how. Note that a lot of this comes from the difficulty of getting texts out of a tm (V)Corpus object, which has been asked (several times) before, for instance here.
Note that I am interpreting your question about "lines" as referring to the five documents, which are more than five lines each, but consist of multiple lines (between 16 and 18 length character vectors each). I hope I have interpreted this correctly.
texts <- sapply(ovid$content, "[[", "content")
str(texts)
## List of 5
## $ : chr [1:16] " Si quis in hoc artem populo non novit amandi," " hoc legat et lecto carmine doctus amet." " arte citae veloque rates remoque moventur," " arte leves currus: arte regendus amor." ...
## $ : chr [1:17] " quas Hector sensurus erat, poscente magistro" " verberibus iussas praebuit ille manus." " Aeacidae Chiron, ego sum praeceptor Amoris:" " saevus uterque puer, natus uterque dea." ...
## $ : chr [1:17] " vera canam: coeptis, mater Amoris, ades!" " este procul, vittae tenues, insigne pudoris," " quaeque tegis medios, instita longa, pedes." " nos venerem tutam concessaque furta canemus," ...
## $ : chr [1:17] " scit bene venator, cervis ubi retia tendat," " scit bene, qua frendens valle moretur aper;" " aucupibus noti frutices; qui sustinet hamos," " novit quae multo pisce natentur aquae:" ...
## $ : chr [1:18] " mater in Aeneae constitit urbe sui." " seu caperis primis et adhuc crescentibus annis," " ante oculos veniet vera puella tuos:" " sive cupis iuvenem, iuvenes tibi mille placebunt." ...
So here we have extracted the texts, but they are on multiple lines represented by one element each of the character vectors that each "document" comprises, and because they are verses, there is variable white space padding at the beginning and end of some of these elements. Let's trim these and just leave the text, using stringi's stri_trim_both function.
# need to trim leading and trailing whitespace
texts <- lapply(texts, stringi::stri_trim_both)
## texts[1]
## [[1]]
## [1] "Si quis in hoc artem populo non novit amandi," "hoc legat et lecto carmine doctus amet."
## [3] "arte citae veloque rates remoque moventur," "arte leves currus: arte regendus amor."
## [5] "" "curribus Automedon lentisque erat aptus habenis,"
## [7] "Tiphys in Haemonia puppe magister erat:" "me Venus artificem tenero praefecit Amori;"
## [9] "Tiphys et Automedon dicar Amoris ego." "ille quidem ferus est et qui mihi saepe repugnet:"
## [11] "" "sed puer est, aetas mollis et apta regi."
## [13] "Phillyrides puerum cithara perfecit Achillem," "atque animos placida contudit arte feros."
## [15] "qui totiens socios, totiens exterruit hostes," "creditur annosum pertimuisse senem."
# now paste them together to make a single character vector of the five documents
texts <- sapply(texts, paste, collapse = "\n")
str(texts)
## chr [1:5] "Si quis in hoc artem populo non novit amandi,\nhoc legat et lecto carmine doctus amet.\narte citae veloque rates remoque movent"| __truncated__ ...
cat(texts[1])
## Si quis in hoc artem populo non novit amandi,
## hoc legat et lecto carmine doctus amet.
## arte citae veloque rates remoque moventur,
## arte leves currus: arte regendus amor.
##
## curribus Automedon lentisque erat aptus habenis,
## Tiphys in Haemonia puppe magister erat:
## me Venus artificem tenero praefecit Amori;
## Tiphys et Automedon dicar Amoris ego.
## ille quidem ferus est et qui mihi saepe repugnet:
##
## sed puer est, aetas mollis et apta regi.
## Phillyrides puerum cithara perfecit Achillem,
## atque animos placida contudit arte feros.
## qui totiens socios, totiens exterruit hostes,
## creditur annosum pertimuisse senem.
That's looking more like it. Now we can figure out which was longest.
nchar(texts)
## [1] 600 621 644 668 622
which.max(nchar(texts))
## [1] 4

Related

Search for matches to argument pattern within every item of a character vector and a window function

I have the following string
library(stringi)
s=stri_rand_lipsum(10)
Function grepl searches for matches to argument pattern within every item of a character vector. As far as I know, it performs the search of just one word at once. For example if I would like to search "conubia" and "viverra" I have to perform two searches:
x=s[grepl("conubia",s)]
x=x[grepl("viverra",x)]
Anyway, I would like to search two or more terms which appear in the same entry of s within a window of length equal to, e.g. 140 characters.
You can use *apply family. If your source text is a character vector, I recommend using vapply, but you have to specify the type and the length of the returned values. Because you use grepl, the returned values are logical vectors.
txt = "My name is Abdur Rohman"
patt = c("na", "Ab","man", "om")
vapply(patt, function(x) grepl(x,txt),
FUN.VALUE = logical(length(txt)))
# na Ab man om
# TRUE TRUE TRUE FALSE
So, in your example you can use:
s = stri_rand_lipsum(10)
vapply(c("conubia","viverra"), function(x) grepl(x,s),
FUN.VALUE = logical(length(s))
# conubia viverra
# [1,] TRUE TRUE
# [2,] FALSE FALSE
# [3,] TRUE FALSE
# [4,] FALSE FALSE
# [5,] FALSE FALSE
# [6,] FALSE TRUE
# [7,] FALSE FALSE
# [8,] FALSE FALSE
# [9,] FALSE FALSE
#[10,] FALSE FALSE
Edit to include a 140-character window
As for the requirement to create a limiting window with 140-character length, as explained in your comment, one way of meeting the requirement is by extracting all characters between the two targeted strings, and then calculate the number of the extracted characters. The requirement is met only if the number is less than or equal to 140.
Extracting all characters between two strings can be done by regular expressions in gsub. However,in case the strings are repeated, you need to specify the window. Let me give examples:
txt <- "Lorem conubia amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor viverra"
This text contains two conubias and two viverras. You have four options to choose the window to specify all characters between conubia and viverra.
Option 1: between the last conubia and the first viverra
gsub(".*conubia(.*?)viverra.*", "\\1", txt, perl = TRUE)
#[1] " ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget "
Option 2: between the first conubia and the last viverra
gsub(".*?conubia(.*)viverra.*", "\\1", txt, perl = TRUE)
# [1] " amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor "
Option 3: between the first conubia and the first viverra
gsub(".*?conubia(.*?)viverra.*", "\\1", txt, perl = TRUE)
#[1] " amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget "
Option 4: between the last conubia and the last viverra
gsub(".*conubia(.*)viverra.*", "\\1", txt, perl = TRUE)
#[1] " ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor "
To calculate the number of the extracted characters, nchar can be used.
# Option 1
nchar(gsub(".*conubia(.*?)viverra.*", "\\1", txt, perl = TRUE))
#[1] 68
Applying this approach:
set.seed(8)
s1 <- stri_rand_lipsum(10)
Nch <- nchar(gsub(".*conubia(.*?)viverra.*", "\\1", s1, perl = TRUE))
Nch
# [1] 637 42 512 528 595 640 522 407 388 512
we found that the second element of s1 meets the requirement.
To print the element we can use: s1[which(Nch <= 140)].
Some great references I've been learning from:
https://www.buymeacoffee.com/wstribizew/extracting-text-two-strings-regular-expressions
https://regex101.com/
Extracting a string between other two strings in R

Subsetting a vector using a list of sequences in R [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
I have a character vector that contains textual data which I can subset by selecting individual lines. The eventual goal is to store different sequences of the vector as independent variables or element of a list. I am able to do this using a simple loop, but I don't succeed in subsetting a character vector by a list of sequences.
See the following example:
Text<-scan("~/Desktop/Lorem Ipsum.txt", what="character", sep="\n")
[1] "Lorem ipsum dolor sit amet, "
[2] "consectetur adipiscing elit,"
[3] "sed do eiusmod tempor incididunt "
[4] "ut labore et dolore magna aliqua."
[5] "Ut enim ad minim veniam, "
[6] "quis nostrud exercitation "
[7] "ullamco laboris nisi ut aliquip ex ea commodo consequat."
[8] "Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur."
[9] "Excepteur sint occaecat cupidatat non proident,"
[10] "sunt in culpa qui officia deserunt mollit anim id est laborum."
The normal way of subsetting the vector would be text[1:4], returning
[1] "Lorem ipsum dolor sit amet, "
[2] "consectetur adipiscing elit,"
[3] "sed do eiusmod tempor incididunt "
[4] "ut labore et dolore magna aliqua."
In a list I have stored sequences of numbers that represent different sets of lines in the vector.
Sentence.numbers<-c(1:4, 5:7, 8, 9:10).
Now I would like to subset all the numbers that make up the different sentences at once and store them in a list for further analysis.
I used Text[Sentence.numbers], but the error message is "invalid index type 'list'".
Is there a way to use a list of values to subset?
You need to set up Sentence.numbers as a list and then use lapply -
Sentence.numbers <- list(1:4, 5:7, 8, 9:10)
lapply(Sentence.numbers, function(x) Text[x])
Here's an example -
lapply(Sentence.numbers, function(x) letters[x])
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] "e" "f" "g"
[[3]]
[1] "h"
[[4]]
[1] "i" "j"

how to extract plain text from .docx file using R

Anyone know of anything they can recommend in order to extract just the plain text from an article with in .docx format (preferable with R) ?
Speed isn't crucial, and we could even use a website that has some API to upload and extract the files but i've been unable to find one. I need to extract the introduction, the method, the result and the conclusion I want to delete the abstract, the references, and specially the graphics and the table
thanks
You can try to use readtext library:
library(readtext)
x <- readtext("/path/to/file/myfile.docx")
# x$text will contain the plain text in the file
Variable x contains just the text without any formatting, so if you need to extract some information you need to perform string search. For example for the document you mentioned in your comment, one approach could be as follows:
library(readtext)
doc.text <- readtext("test.docx")$text
# Split text into parts using new line character:
doc.parts <- strsplit(doc.text, "\n")[[1]]
# First line in the document- the name of the Journal
journal.name <- doc.parts[1]
journal.name
# [1] "International Journal of Science and Research (IJSR)"
# Similarly we can extract some other parts from a header
issn <- doc.parts[2]
issue <- doc.parts[3]
# Search for the Abstract:
abstract.loc <- grep("Abstract:", doc.parts)[1]
# Search for the Keyword
Keywords.loc <- grep("Keywords:", doc.parts)[1]
# The text in between these 2 keywords will be abstract text:
abstract.text <- paste(doc.parts[abstract.loc:(Keywords.loc-1)], collapse=" ")
# Same way we can get Keywords text:
Background.loc <- Keywords.loc + grep("1\\.", doc.parts[-(1:Keywords.loc)])[1]
Keywords.text <- paste(doc.parts[Keywords.loc:(Background.loc-1)], collapse=" ")
Keywords.text
# [1] "Keywords: Nephronophtisis, NPHP1 deletion, NPHP4 mutations, Tunisian patients"
# Assuming that Methods is part 2
Methods.loc <- Background.loc + grep("2\\.", doc.parts[-(1:Background.loc)])[1]
Background.text <- paste(doc.parts[Background.loc:(Methods.loc-1)], collapse=" ")
# Assuming that Results is Part 3
Results.loc <- Methods.loc- + grep("3\\.", doc.parts[-(1:Methods.loc)])[1]
Methods.text <- paste(doc.parts[Methods.loc:(Results.loc-1)], collapse=" ")
# Similarly with other parts. For example for Acknowledgements section:
Ack.loc <- grep("Acknowledgements", doc.parts)[1]
Ref.loc <- grep("References", doc.parts)[1]
Ack.text <- paste(doc.parts[Ack.loc:(Ref.loc-1)], collapse=" ")
Ack.text
# [1] "6. Acknowledgements We are especially grateful to the study participants.
# This study was supported by a grant from the Tunisian Ministry of Health and
# Ministry of Higher Education ...
The exact approach depends on the common structure of all the documents you need to search through. For example if the first section is always named "Background" you can use this word for your search. However if this could sometimes be "Background" and sometimes "Introduction" then you might want to search for "1." pattern.
You should find that one of these packages will do the trick for you.
https://davidgohel.github.io/officer/
https://cran.r-project.org/web/packages/docxtractr/index.html
At the end of the day the modern Office file formats (OpenXML) are simply *.zip files containing structured XML content and so if you have well structured content then you may just want to open it that way. I would start here (http://officeopenxml.com/anatomyofOOXML.php) and you should be able to unpick the OpenXML SDK for guidance as well (https://msdn.microsoft.com/en-us/library/office/bb448854.aspx)
Pandoc is a fantastic solution for tasks like this. With a document named a.docx you would run at the command line
pandoc -f docx -t markdown -o a.md a.docx
You could then use regex tools in R to extract what you needed from the newly-created a.md, which is text. By default, images are not converted.
Pandoc is part of RStudio, by the way, so you may already have it.
You can do it with package officer:
library(officer)
example_pptx <- system.file(package = "officer", "doc_examples/example.docx")
doc <- read_docx(example_pptx)
summary_paragraphs <- docx_summary(doc)
summary_paragraphs[summary_paragraphs$content_type %in% "paragraph", "text"]
#> [1] "Title 1"
#> [2] "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
#> [3] "Title 2"
#> [4] "Quisque tristique "
#> [5] "Augue nisi, et convallis "
#> [6] "Sapien mollis nec. "
#> [7] "Sub title 1"
#> [8] "Quisque tristique "
#> [9] "Augue nisi, et convallis "
#> [10] "Sapien mollis nec. "
#> [11] ""
#> [12] "Phasellus nec nunc vitae nulla interdum volutpat eu ac massa. "
#> [13] "Sub title 2"
#> [14] "Morbi rhoncus sapien sit amet leo eleifend, vel fermentum nisi mattis. "
#> [15] ""
#> [16] ""
#> [17] ""

TermDocumentMatrix in R - only 1-grams created

I just started with tm package in R and cannot seem to overcome an issue.
Even though my tokenizer functions seem to work right:
uniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
uniTDM <- TermDocumentMatrix(corpus, control=list(tokenize = uniTokenizer))
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer))
triTDM <- TermDocumentMatrix(corpus, control=list(tokenize = triTokenizer))
when I try to pull 2-grams from biTDM, only 1-grams come up...
findFreqTerms(biTDM, 50)
[1] "after" "and" "most" "the" "were" "years" "love"
[8] "you" "all" "also" "been" "did" "from" "get"
at the same, the 2-gram function appears to be in tact:
x <- biTokenizer(corpus)
head(x)
[1] "c in" "in the" "the years"
[4] "years thereafter" "thereafter most" "most of"
I can only assume what the problem is here: NGramTokenizer needs a VCorpus object rather than a Corpus object.
library(tm)
library(RWeka)
# some dummy text
text <- c("Lorem ipsum dolor sit amet, consetetur sadipscing elitr",
"sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat",
"sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum",
"Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet")
# create a VCorpus
corpus <- VCorpus(VectorSource(text))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer))
print(biTDM$dimnames$Terms)
[1] "accusam et" "aliquyam erat" "amet consetetur" "at vero" "clita kasd" "consetetur sadipscing" "diam nonumy" "diam voluptua" "dolor sit" "dolore magna"
[11] "dolores et" "duo dolores" "ea rebum" "eirmod tempor" "eos et" "est lorem" "et accusam" "et dolore" "et ea" "et justo"
[21] "gubergren no" "invidunt ut" "ipsum dolor" "justo duo" "kasd gubergren" "labore et" "lorem ipsum" "magna aliquyam" "no sea" "nonumy eirmod"
[31] "sadipscing elitr" "sanctus est" "sea takimata" "sed diam" "sit amet" "stet clita" "takimata sanctus" "tempor invidunt" "ut labore" "vero eos"
[41] "voluptua at"

Is there in R something like the "here document" in bash?

My script contains the line
lines <- readLines("~/data")
I would like to keep the content of the file data (verbatim) in the script itself. Is there in R a "read_the_following_lines" function? Something like to the "here document" in the bash shell?
Multi-line strings are going to be as close as you get. It's definitely not the same (since you have to care about the quotes) but it does work pretty well for what you're trying to achieve (and you can do it with more than read.table):
here_lines <- 'line 1
line 2
line 3
'
readLines(textConnection(here_lines))
## [1] "line 1" "line 2" "line 3" ""
here_csv <- 'thing,val
one,1
two,2
'
read.table(text=here_csv, sep=",", header=TRUE, stringsAsFactors=FALSE)
## thing val
## 1 one 1
## 2 two 2
here_json <- '{
"a" : [ 1, 2, 3 ],
"b" : [ 4, 5, 6 ],
"c" : { "d" : { "e" : [7, 8, 9]}}
}
'
jsonlite::fromJSON(here_json)
## $a
## [1] 1 2 3
##
## $b
## [1] 4 5 6
##
## $c
## $c$d
## $c$d$e
## [1] 7 8 9
here_xml <- '<CATALOG>
<PLANT>
<COMMON>Bloodroot</COMMON>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>a
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$2.44</PRICE>
<AVAILABILITY>031599</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Columbine</COMMON>
<BOTANICAL>Aquilegia canadensis</BOTANICAL>
<ZONE>3</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$9.37</PRICE>
<AVAILABILITY>030699</AVAILABILITY>
</PLANT>
</CATALOG>
'
str(xml <- XML::xmlParse(here_xml))
## Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
print(xml)
## <?xml version="1.0"?>
## <CATALOG>
## <PLANT><COMMON>Bloodroot</COMMON><BOTANICAL>Sanguinaria canadensis</BOTANICAL><ZONE>4</ZONE>a
## <LIGHT>Mostly Shady</LIGHT><PRICE>$2.44</PRICE><AVAILABILITY>031599</AVAILABILITY></PLANT>
## <PLANT>
## <COMMON>Columbine</COMMON>
## <BOTANICAL>Aquilegia canadensis</BOTANICAL>
## <ZONE>3</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$9.37</PRICE>
## <AVAILABILITY>030699</AVAILABILITY>
## </PLANT>
## </CATALOG>
Pages 90f. of An introduction to R state that it is possible to write R scripts like this (I quote the example modified from there):
chem <- scan()
2.90 3.10 3.40 3.40 3.70 3.70 2.80 2.50 2.40 2.40 2.70 2.20
5.28 3.37 3.03 3.03 28.95 3.77 3.40 2.20 3.50 3.60 3.70 3.70
print(chem)
Write these lines into a file, and give it the name, say, heredoc.R. If you then execute that script non-interactively by typing in your terminal
Rscript heredoc.R
you will get the following output
Read 24 items
[1] 2.90 3.10 3.40 3.40 3.70 3.70 2.80 2.50 2.40 2.40 2.70 2.20
[13] 5.28 3.37 3.03 3.03 28.95 3.77 3.40 2.20 3.50 3.60 3.70 3.70
So you see that the data provided in the file are saved in the variable chem. The function scan(.) reads from the connection stdin() per default. stdin() refers to user input from the console in interactive mode (a call to R without specified script), but when an input script is read in, the following lines of that script are read *). The empty line after the data is important because it marks the end of the data.
This also works with tabular data:
tab <- read.table(file=stdin(), header=T)
A B C
1 1 0
2 1 0
3 2 9
summary(tab)
When using readLines(.), you must specify the number of lines read; the approach with the empty line does not work here:
txt <- readLines(con=stdin(), n=5)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ultricies diam
sed felis mattis, id commodo enim hendrerit. Suspendisse iaculis bibendum eros,
ut mattis eros interdum sit amet. Pellentesque condimentum eleifend blandit. Ut
commodo ligula quis varius faucibus. Aliquam accumsan tortor velit, et varius
sapien tristique ut. Sed accumsan, tellus non iaculis luctus, neque nunc
print(txt)
You can overcome this limitation by reading one line at a time until one line is empty or some other predefined string. Note however, that you may run out of memory if you read a large (>100MB) file this way, because each time you append a string to your read-in data, all the data is copied to another place in memory. See the chapter "Growing objects" in The R inferno:
txt <- c()
repeat{
x <- readLines(con=stdin(), n=1)
if(x == "") break # you can use any EOF string you want here
txt = c(txt, x)
}
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ultricies diam
sed felis mattis, id commodo enim hendrerit. Suspendisse iaculis bibendum eros,
ut mattis eros interdum sit amet. Pellentesque condimentum eleifend blandit. Ut
commodo ligula quis varius faucibus. Aliquam accumsan tortor velit, et varius
sapien tristique ut. Sed accumsan, tellus non iaculis luctus, neque nunc
print(txt)
*) If you want to read from standard input in an R script, for example because you want to create a reusable script which can be called with any input data (Rscript reusablescript.R < input.txt or
some-data-generating-command | Rscript reusablescript.R), use not stdin() but file("stdin").
Since R v4.0.0, there is a new syntax for raw strings, as stated in changelogs, that largely allows heredocs style documents to be created.
Additionally, from help(Quotes):
The delimiter pairs [] and {} can also be used, and R can be used in place of r. For additional flexibility, a number of dashes can be placed between the opening quote and the opening delimiter, as long as the same number of dashes appear between the closing delimiter and the closing quote.
As an example, one can use (on a system with BASH shell):
file_raw_string <-
r"(#!/bin/bash
echo $#
for word in $#;
do
echo "This is the word: '${word}'."
done
exit 0
)"
writeLines(file_raw_string, "print_words.sh")
system("bash print_words.sh Word/1 w#rd2 LongWord composite-word")
or even another R script:
file_raw_string <- r"(
x <- lapply(mtcars[,1:4], mean)
cat(
paste(
"Mean for column", names(x), "is", format(x,digits = 2),
collapse = "\n"
)
)
cat("\n")
cat(r"{ - This is a raw string where \n, "", '', /, \ are allowed.}")
)"
writeLines(file_raw_string, "print_means.R")
source("print_means.R")
#> Mean for column mpg is 20
#> Mean for column cyl is 6.2
#> Mean for column disp is 231
#> Mean for column hp is 147
#> - This is a raw string where \n, "", '', /, \ are allowed.
Created on 2021-08-01 by the reprex package (v2.0.0)
A way to do multi-line strings but not worry about quotes (only backticks) you can use:
as.character(quote(`
all of the crazy " ' ) characters, except
backtick and bare backslashes that aren't
printable, e.g. \n works but a \ and c with no space between them would fail`))
What about some more recent tidyverse syntax?
SQL <- c("
SELECT * FROM patient
LEFT OUTER JOIN projectpatient ON patient.patient_id = projectpatient.patient_id
WHERE projectpatient.project_id = 16;
") %>% stringr::str_replace_all("[\r\n]"," ")

Resources