I am reading a couple excel files in and merging them into one dataframe. Some of the address fields have returns in them. I came up with this to remove them but it does not work and RStudio says that there are invalid tokens in the line.
df$Primary.Street <- gsub("\r\n", " ", df$Primary.Street)
Any help would be much appreacited.
Sample of input row of how it looks in Excel:
"123 Main St
"Sam Jones" Apt A" "New York" "NY" "12345"
Desired output to csv:
"Sam Jones","123 Main St Apt A","New York","NY","12345"
Put your carriage return characters in square brackets to create a character class, which will match any character in the class:
> samp <- "120 Main st\nApt A"
> gsub("[\r\n]+", " ", samp)
[1] "120 Main st Apt A"
Your example without the brackets would only match a \r and \n in sequence. My example here will match any sequence of one or more of either (via the + quantifier).
Related
This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed 3 years ago.
I want to extract a group of strings between two punctuations using RStudio.
I tried to use str_extract command, but whenever I tried to use anchors (^ for starting char, and $ for ending char), it failed.
Here is the sample problem:
> text <- "Name : Dr. CHARLES DOWNING MAP ; POB : London; Age/DOB : 53 years / August 05, 1958;"
Here is the sample code I used:
> str_extract(text,"(Name : )(.+)?( ;)")
> str_match(str_extract(text,"(Name : )(.+)?( ;)"),"(Name : )(.+)?( ;)")[3]
But it seemed too verbose, and not flexible.
I only want to extract "Dr. CHARLES DOWNING MAP".
Anyone can help with my problem?
Can I tell the regex to start with any non-white-space character after "Name : " and ends before " ; POB"?
This seems to work.
> gsub(".*Name :(.*) ;.*", "\\1", text)
[1] " Dr. CHARLES DOWNING MAP"
With str_match
stringr::str_match(text, "^Name : (.*) ;")[, 2]
#[1] "Dr. CHARLES DOWNING MAP"
[, 2] is to get the contents from the capture group.
There is also qdapRegex::ex_between to extract string between left and right markers
qdapRegex::ex_between(text, "Name : ", ";")[[1]]
#[1] "Dr. CHARLES DOWNING MAP"
This is my current dataset:
c("Jetstar","Qantas", "QantasLink","RegionalExpress","TigerairAustralia",
"VirginAustralia","VirginAustraliaRegionalAirlines","AllAirlines",
"Qantas-allQFdesignatedservices","VirginAustralia-allVAdesignatedservices")
I want to add a space in between airlines name and separate it with space.
For this i tried this code:
airlines$airline <- gsub("([[:lower:]]) ([[:upper:]])", "\\1 \\2", airlines$airline)
But I got the text in the same format as before.
My desired output is as below:
txt <- c("Jetstar","Qantas", "QantasLink","RegionalExpress","TigerairAustralia",
"VirginAustralia","VirginAustraliaRegionalAirlines","AllAirlines",
"Qantas-allQFdesignatedservices","VirginAustralia-allVAdesignatedservices")
You need two different sorts of rules: one for the spaces before the case changes and the other for recurring words ("designated", "services") or symbols ("-"). You could start with a pattern that identified a lowercase character followed by an uppercase character (identified with a character class like "[A-Z]") and then insert a space between those two characters in two capture classes (created with flanking parentheses around a section of a pattern). See the ?regex Details section for a quick description of character classes and capture classes:
gsub("([a-z])([A-Z])", "\\1 \\2", txt)
You then use that result as an argument that adds a space before any of the recurring words in your text that you want also separated:
gsub("(-|all|designated|services)", " \\1", # second pattern and sub for "specials"
gsub("([a-z])([A-Z])", "\\1 \\2", txt)) #first pattern and sub for case changes
[1] "Jetstar"
[2] "Qantas"
[3] "Qantas Link"
[4] "Regional Express"
[5] "Tigerair Australia"
[6] "Virgin Australia"
[7] "Virgin Australia Regional Airlines"
[8] "All Airlines"
[9] "Qantas - all QF designated services"
[10] "Virgin Australia - all VA designated services"
I see that someone upvoted my earlier answer to Splitting CamelCase in R which was similar, but this one had a few more wrinkles to iron out.
This could (almost) do the trick
gsub("([A-Z])", " \\1", airlines)
Borrowed from: splitting-camelcase-in-r
Of course names like Qantas-allQFd… will stil pose a problem because of the two consecutive UpperCase letters ("QF") in the second part of the string.
I have tried to figure it out and I have come up with something:
library(stringr)
data_vec<- c("Jetstar","Qantas", "QantasLink","RegionalExpress","TigerairAustralia",
"VirginAustralia","VirginAustraliaRegionalAirlines","AllAirlines",
"Qantas-allQFdesignatedservices","VirginAustralia-allVAdesignatedservices")
str_trim(gsub("(?<=[A-Z]{2})([a-z]{1})", " \\1",gsub("([A-Z]{1,2})", " \\1", data_vec)))
I Hope this helps.
I'm doing a text mining task in R.
Tasks:
1) count sentences
2) identify and save quotes in a vector
Problems :
False full stops like "..." and periods in titles like "Mr." have to be dealt with.
There's definitely quotes in the text body data, and there'll be "..." in them. I was thinking to extract those quotes from the main body and save them in a vector. (there's some manipulation to be done with them too.)
IMPORTANT TO NOTE : My text data is in a Word document. I use readtext("path to .docx file") to load in R. When I view the text, quotes are just " but not \" contrarily to the reproducible text.
path <- "C:/Users/.../"
a <- readtext(paste(path, "Text.docx", sep = ""))
title <- a$doc_id
text <- a$text
reproducible text
text <- "Mr. and Mrs. Keyboard have two children. Keyboard Jr. and Miss. Keyboard. ...
However, Miss. Keyboard likes being called Miss. K [Miss. Keyboard is a bit of a princess ...]
\"Mom how are you o.k. with being called Mrs. Keyboard? I'll never get it...\". "
# splitting by "."
unlist(strsplit(text, "\\."))
The problem is it's splitting by false full-stops
Solution I tried:
# getting rid of . in titles
vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
library(gsubfn)
# replacing . in titles
gsubfn("\\S+", setNames(as.list(vec.rep), vec), text)
The problem with this is that it's not replacing [Miss. by [Miss
To identify quotes :
stri_extract_all_regex(text, '"\\S+"')
but that's not working too. (It's working with \" with the code below)
stri_extract_all_regex("some text \"quote\" some other text", '"\\S+"')
The exact expected vector is :
sentences <- c("Mr and Mrs Keyboard have two children. ", "Keyboard Jr and Miss Keyboard.", "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]", ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
I wanted the sentences separated (so I can count how many sentences in each paragraph).
And quotes also separated.
quotes <- ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
You may match all your current vec values using
gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
That is, \w+ matches 1 or more word chars and \. matches a dot.
Next, if you just want to extract quotes, use
regmatches(text, gregexpr('"[^"]*"', text))
The " matches a " and [^"]* matches 0 or more chars other than ".
If you plan to match your sentences together with quotes, you might consider
regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
Details
\\s* - 0+ whitespaces
"[^"]*" - a ", 0+ chars other than " and a "
| - or
[^"?!.]+ - 0+ chars other than ?, ", ! and .
[[:space:]?!.]+ - 1 or more whitespace, ?, ! or . chars
[^"[:alnum:]]* - 0+ non-alphanumeric and " chars
R sample code:
> vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
> vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
> library(gsubfn)
> text <- gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
> regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
[[1]]
[1] "Mr and Mrs Keyboard have two children. "
[2] "Keyboard Jr and Miss Keyboard. ... \n"
[3] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n "
[4] "\"Mom how are you o.k. with being called Mrs Keyboard? I'll never get it...\""
I am trying to substitute multiple patterns within a character vector with their corresponding replacement strings. After doing some research I found the package gsubfn which I think is able to do what I want it to, however when I run the code below I don't get my expected output (see end of question for results versus what I expected to see).
library(gsubfn)
# Our test data that we want to search through (while ignoring case)
test.data<- c("1700 Happy Pl","155 Sad BLVD","82 Lolly ln", "4132 Avent aVe")
# A list data frame which contains the patterns we want to search for
# (again ignoring case) and the associated replacement strings we want to
# exchange any matches we come across with.
frame<- data.frame(pattern= c(" Pl"," blvd"," LN"," ave"), replace= c(" Place", " Boulevard", " Lane", " Avenue"),stringsAsFactors = F)
# NOTE: I added spaces in front of each of our replacement terms to make
# sure we only grab matches that are their own word (for instance if an
# address was 45 Splash Way we would not want to replace "pl" inside of
# "Splash" with "Place
# The following set of paste lines are supposed to eliminate the substitute function from
# grabbing instances like first instance of " Ave" found directly after "4132"
# inside "4132 Avent Ave" which we don't want converted to " Avenue".
pat <- paste(paste(frame$pattern,collapse = "($|[^a-zA-Z])|"),"($|[^a-zA-Z])", sep = "")
# Here is the gsubfn function I am calling
gsubfn(x = test.data, pattern = pat, replacement = setNames(as.list(frame$replace),frame$pattern), ignore.case = T)
Output being received:
[1] "1700 Happy" "155 Sad" "82 Lolly" "4132 Avent"
Output expected:
[1] "1700 Happy Place" "155 Sad Boulevard" "82 Lolly Lane" "4132 Avent Avenue"
My working theory on why this isn't working is that the matches don't match the names associated with the list I am passing into the gsubfn's replacement argument because of some case discrepancies (eg: the match being found on "155 Sad BLVD" doesn't == " blvd" even though it was able to be seen as a match due to the ignore.case argument). Can someone confirm that this is the issue/point me to what else might be going wrong, and perhaps a way of fixing this that doesn't require me expanding my pattern vector to include all case permutations if possible?
Seems like stringr has a simple solution for you:
library(stringr)
str_replace_all(test.data,
regex(paste0('\\b',frame$pattern,'$'),ignore_case = T),
frame$replace)
#[1] "1700 Happy Place" "155 Sad Boulevard" "82 Lolly Lane" "4132 Avent Avenue"
Note that I had to alter the regex to look for only words at the end of the string because of the tricky 'Avent aVe'. But of course there's other ways to handle that too.
I cannot fully understand why my regular expression does not work to extract the info I want. I have an unlisted vector that looks like this:
text <- c("Senator, 1.4balbal", "rule 46.1, declares",
"Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23)
I would like to create a regular expression to extract only the name of the "Town", even if the town has a long name as the one written in the vector ("A Town with a Long Name"). I have tried this to extract the name of the town:
reg.town <- "[[:alpha:]](.+?)+,(.+?)\\d{2}"
towns<- unlist(str_extract_all(example, reg.prov))
but I extract everything around the ",".
Thanks in advance,
It looks like a town name starts with a capital letter ([[:upper:]]), ends with a comma (or continues to the end of text if there is no comma) ([^,]+) and should be at the start of the input text (^). The corresponding regex in this case would be:
^[[:upper:]][^,]+
Demo: https://regex101.com/r/QXYtyv/1
I have solve the problem thanks to #Dmitry Egorov 's demo post in the comment. the regular expression is this one ([[:upper:]].+?, [[:digit:]])
Thanks for your quick replies!!
You may use the following regex:
> library(stringr)
> text <- c("Senator, 1.4balbal", "rule 46.1, declares", "Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23")
> towns <- unlist(str_extract_all(text, "\\b\\p{Lu}[^,]++(?=, \\d)"))
> towns
[1] "Senator" "Town"
[3] "A Town with a Long Name"
The regex matches:
\\b - a leading word boundary
\\p{Lu} - an uppercase letter
[^,]++ - 1+ chars other than a , (possessively, due to ++ quantifier, with no backtracking into this pattern for a more efficient matching)
(?=, \\d) - a positive lookahead that requires a ,, then a space and then any digit to appear immediately after the last non-, symbol matched with [^,]++.
Note you may get the same results with base R using the same regex with a PCRE option enabled:
> towns_baseR <- unlist(regmatches(text, gregexpr("\\b\\p{Lu}[^,]++(?=, \\d)", text, perl=TRUE)))
> towns_baseR
[1] "Senator" "Town"
[3] "A Town with a Long Name"
>