What is the best way to do the same action across multiple lines of code in the RStudio source editor?
Example 1
Let's say that I copy a list from a text file and paste it into R (like the list below). Then, I want to add quotation marks around each word and add a comma to each line, so that I can make a vector.
Krista Hicks
Miriam Cummings
Ralph Lamb
Jaylene Gilbert
Jordon Sparks
Kenna Melton
Expected Output
"Krista Hicks",
"Miriam Cummings",
"Ralph Lamb",
"Jaylene Gilbert",
"Jordon Sparks",
"Kenna Melton"
Example 2
How can I add missing parentheses on multiple lines. For example, if I have an if statement, then how can I add the missing opening parentheses for names on line 1 and line 4.
if (!is.null(names pattern))) {
vec <- FALSE
replacement <- unname(pattern)
pattern[] <- names pattern)
}
Expected Output
if (!is.null(names(pattern))) {
vec <- FALSE
replacement <- unname(pattern)
pattern[] <- names(pattern)
}
*Note: These names are just from a random name generator.
RStudio has support for multiple cursors, which allows you to write and edit multiple lines at the same time.
Example 1
You can simply click Alt on Windows/Linux (or option on Mac) and drag your mouse to make your selection, or you can use Alt+Shift to create a rectangular selection from the current location of the cursor to a clicked position.
Example 2
Another multiple cursor option is for selecting all matching instances of a term. So, you can select names and press Ctrl+Alt+Shift+M. Then, you can use the arrow keys to move the cursors to delete the space and add in the parentheses.
Related
I'm stumped. My issue is that I want to grab specific names from a given column. However, when I try and filter them I get most of the names except for a few, even though I can clearly see their names in the original excel file. I think it has to do what some sort of special characters or spacing in the name column. I am confused on how I can fix this.
I have tried using excels clean() function to apply that to the given column. I have tried working an Alteryx flow to clean the data. All of these steps haven't helped any. I am starting to wonder if this is an r issue.
surveyData %>% filter(`Completed By` == "Spencer,(redbox with whitedot in middle)Amy")
surveyData %>% filter(`Completed By` == "Spencer, Amy")
in r the first line had this redbox with white dot in between the comma and the first name. I got this red box with white dot by copy the name from the data frame and copying it into notepad and then pasting it in r. This actually works and returns what I want. Now the second case is a standard space which doesn't return what I want. So how can I fix this issue by not having to copy a name from the data frame and copy to notepad then copying the results from notepad to r, which has the redbox with a white dot in between the comma(,) and first name.
Expected results is that I get the rows that are attached to what ever name I filter by.
I was able to find the answer, it turns out the space is actually a break space with unicode of (U+00A0) compared to the normal space unicode (U+0020). The break space is not apart of the American Standard Code for Information Interchange(ACSII). Thus r filter() couldn't grab some names because they had break spaces. I fixed this by subbing the Unicode of the break space with the Unicode for a normal space and applying that to my given column. Example below:
space_fix = gsub("\u00A0", " ", surveyData$`Completed By`, fixed = TRUE) #subbing break space unicode with space unicode for the given column I am interested in
surveyData$`Completed By Clean` = space_fix
Once, I applied this I could easily filter any name!
Thanks everyone!
I have a question.
My text file contains lines such as:
1.1 Description.
This is the description.
1.1.1 Quality Assurance
Random sentence.
1.6.1 Quality Control. Quality Control is the responsibility of the contractor.
I'm trying to find out how to get:
1.1 Description
1.1.1 Quality Assurance
1.6.1 Quality Control
Right now, I have:
txt1 <- readLines("text1.txt")
txt2<-grep("^[0-9.]+", txt1, value = TRUE)
file<-write(txt2, "text3.txt")
which results in:
1.1 Description.
1.1.1 Quality Assurance
1.6.1 Quality Control. Quality Control is the responsibility of the contractor.
You are using grep with value=TRUE, which
returns a character vector containing the selected elements of x
(after coercion, preserving names but no other attributes).
This means, that if your regular expression matches anything in the line, the all line will be returned. You managed to build your regular expression to match numbers in the begining of the line. So all the lines which begin with numbers get selected.
It seems that your goal is not to select the all line, but to select only until there is a line break or a period.
So, you need to adjust the regular expression to be more specific, and you need to extract only the matching portion of the line.
A regular expression that matches what you want can be:
"^([0-9]\\.?)+ .+?(\\.|$)"
It selects numbers with dots, followed by a space, followed by anything, and stops matching things when a . comes or the line ends. I recommend the following website to better understand what the regex does: https://regexr.com/
The next step is extracting from the given lines only the matching portion, and not the all line where the regex has a match. For this we'll use the function regexpr, which tells us where the matches are, and the function regmatches, which helps us extract those matches:
txt1 <- readLines("text.txt")
regmatches(txt1, regexpr("^([0-9]\\.?)+ .+?(\\.|$)", txt1))
attach.files = c(paste("/users/joesmith/nosection_", currentDate,".csv",sep=""),
paste("/users/joesmith/withsection_", currentDate,".csv",sep=""))
Basically, if I did it like
c("nosection_051418.csv", "withsection_051418.csv")
And I did that manually it would work fine but since I'm automating this to run every day I can't do that.
I'm trying to attach files in an automated email but when I structure it like this, it doesn't work. How can I recreate this so that the character vector accepts it?
I thought your example implied the need for "parallel" inputs to the path stem, the first portion of the file name, and the date portions of those full paths. Consider this illustration of using a 2 item vector and a one item vector (produced by Sys.Date, replacing your "currentdate") to populate the %s positions in that sprintf string (suggested by #Gregor):
sprintf("/users/joesmith/%s_%s.csv", c("nosection", "withsection"), Sys.Date() )
[1] "/users/joesmith/nosection_2018-05-14.csv" "/users/joesmith/withsection_2018-05-14.csv"
Please any advice will be appreciated.. This is time sensitive. I have PDF reports that are mostly blocks of text. They are long reports (~50-100 pages). I'm trying to write an R script that is capable of extracting specific sections of these PDF reports using start/stop positional strings. NOTE: Reports vary in length. Short example:
DOCUMENT TITLE
01. SECTION 1
This is a test section that I DONT want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
02. SECTION 2
This is a test section that I do want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
11. SECTION 11
This is a test section that I do want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
12. SECTION 12
This is a test section that I DONT want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
So the goal in this example, is to extract the paragraph below Section 2 and store it as a field/data point. I also want to store Section 11 as a field/data point. Note the document is in PDF format
I have tried used pdftools, tm, stringr, I've literally spent 20+ hours searching for solutions and tutorials on how to do this. I know it is possible as I have done it using SAS before...
Please see code below, I added comments with questions. I believe RegEx will be part of the solution but i'm so lost.
# Init Step
libs <- c("tm","class","stringr","testthat",
"pdftools")
lapply(libs, require, character.only= TRUE)
# File name & location
filename = "~/pdf_test/test.pdf"
# converting PDF to text
textFile <- pdf_text(filename)
cat(textFile[1]) # Text of pg. 1 of PDF
cat(textFile[2]) # Text of pg. 2 of PDF
# I'm at a loss of how to parse the values I want. I have seen things
like:
sectionxyz <- str_extract_all(textFile, #??? )
rm_between()
# 1) How do I loop through each page of PDF file?
# 2) How do I identify start/stop positions for section to be extracted?
# 3) How do I add logic to extract text between start/stop positions
# and then add the result to a data field?
# 4) Sections in PDF will be long sections of text (i.e. 100+ words into a field)
NEW------
So I have been able to:
-Prep doc correctly
-Identify the correct start/stop patterns:
length(grep("^11\\. LIMITS OF LIABILITY( +){1}$",source_main2))
length(grep("Applicable\\s+[Ll]imits\\s+[Oo]f",source_main2))
pat_st_lol <- "^11\\. LIMITS OF LIABILITY( +){1}$"
pat_ed_lol <- "Applicable\\s+[Ll]imits\\s+[Oo]f"
The length(grep()) statements verify only 1 instance is being found. From here I am kind of lost based on how to use gsub or similar to extract the portion of data I want. I tried:
pat <- paste0(".*",pat_st_lol,"(.*)",pat_ed_lol,".*")
test <- gsub(".*^11\\. LIMITS OF LIABILITY( +){1}$(.*)\n",
"Applicable\\s+[Ll]imits\\s+[Oo]f", source_main2)
test2 <-gsub(".*pat_st_lol(.*)\npat_ed_lol.*")
So far, little progress, but progress anyways.
Provided you can come with a systematic to identify the sections you need, you could, as you indicated, use Regex to extract the text you want.
In your above example, something like gsub(".*SECTION 11(.*)\n12\\..*","\\1",string) ought to work.
Now you could define patterns dynamically using paste and iterate through all files. Each result can then be saved in your data.frame, list,....
Here is a brief more detailed explanation of the pattern:
Firstly, .* is way of matching "anything". If you want to match digits you can use \\d or equivalently [0-9]. Here is a short intro to Regex in R (which I found to be quite useful) where you can find several character classes.
.* at the edges of the pattern means that there can be text before/after
(.*) denotes the content we want (so here matching any content as .* is used). Basically it means extract "anything" between SECTION 11 and 12.
\\. means the dot and \n is the "newline" metacharacter (as before "12.", a new line is started)
In Regex you can create groupings within your pattern using the brackets, i.e. gsub(".*(\\d{2}\\:\\d{2})", "\\1","18.05.2018, 21:37") will return 21:37, or gsub("([A-z]) \\d+","\\1","hello 123") will give hello.
Now the second argument in gsub can and is often used to provide a substitute, i.e. something to replace to matched pattern with. Here however, we do not want any substitue, we want to extract something. \\1 means extract the first grouping, i.e. what it inside the first brackets (you could have multiple groupings).
Finally, string is the string from which we want to extract, i.e. the PDF file
Now if you want to perform something similar in a loop you could do the following:
# we are in the loop
# first is your starting point in the extraction, i.e. "SECTION 11"
# last is your end point, i.e. "12."
first <- "SECTION 11" # first and last can be dynamically assigned
last <- "12\\." # "\\" is added before the dot as "." is a Regex metachar
# If last doesn't systematically contain a dot
# you could use gsub to add "\\" before the dot when needed:
# gsub("\\.","\\\\.",".") returns "\\."
# so gsub("\\.","\\\\.","12.") returns "12\\."
pat <- paste0(".*",first,"(.*)","\n",last,".*") #"\n" is added to stop before the newline, but it could be omitted (then "\n" might appear in the extraction)
gsub(pat,"\\1",string) # returns the same as above
I am trying to create a function that will calculate the frequency count of keywords using TM package. The function works fine if the text pasted from readline is on free form text without a new line. The problem is, when I paste a bunch of text copied from a spreadsheet, readline considers it as a new line.
keyword <- function() {
x <- readline(as.character('Input text here: '))
x <- Corpus(VectorSource(x))
...
tdm <- TermDocumentMatrix(x)
...
tdm
}
Here's the full code: https://github.com/CSCDataAnalytics/PM-Analysis/blob/master/Keyword.R
How can I prevent this from happening or at least consider a bunch of text of every row from the spreadsheet as one vector only?
If I'm understanding you correctly, the problem is when the user pastes the text from another application: the newline is causing R to stop accepting the subsequent lines.
One technique (fragile as it may be) is to look for a specific line, such as an empty line "" or a period ".". It's a little fragile because now you need (1) assurance that the data will "never" include that as a whole line, and (2) it is easily appended by the user.
Try:
endofinput <- ""
totalstr <- ""
while(! endofinput == (x <- readline('prompt (empty string when done): ')))
totalstr <- paste(totalstr, x)
In this case, the empty string is the catch, and when the while loop is done, totalstr contains all input separated by a space (this can be changed in the paste function).
NB: one problem with this technique is that it is "growing" the vector totalstr, which will eventually cause performance penalties (depending on the size of the input data): every loop iteration, more memory is allocated and the entire string is copied plus the new line of text. There are more verbose ways to side-step this problem (e.g., pre-allocate a vector larger than your anticipated input data), but if you aren't anticipated 1000s of lines then you may be able to accept this naive programming for simplicity.
Another option would be to have the user save the data to a text file and use file.choose() and readLines() to get your data.
Try collapsing the data into a single string after using readline
x <- paste(readline(as.character('Input text here: ')), collapse=' ')