Cut string at complete word closest to certain character number - r

I'm trying to split a vector of strings into two pieces (I only want to keep the first bit) based on the following criteria:
it should split after a full word (i.e. where a space occurs)
it should cut at the space closest to the 12th character
Example:
textvec <- c("this is an example", "I hope someone can help me", "Thank you in advance")
Expected result is a vector like this:
"this is an" , "I hope someone", "Thank you in"
What I tried so far:
I'm able to get the full words that occur before or at the 12th character like this:
t13 <- substr(textvec , 1, 13) #gives me first 13 characters of each string
lastspace <- lapply(gregexpr(" ", t13), FUN=function(x) x[length(x)]) #gives me last space before/at 13th character
result <- substr(t13, start=1, stop=lastspace)
But what I want is to get include the word closest to the 12th character (e.g. "someone" in the example above), not necessarily before or at the 12th character. In case there's a tie, I would like to include the word after the 12th character. I hope I'm explaining myself clearly :)

Using cumsum,
sapply(strsplit(textvec, ' '), function(i) paste(i[cumsum(nchar(i)) <= 12], collapse = ' '))
#[1] "this is an" "I hope someone" "Thank you in"

We can use gregexpr to find the closest space at 12 and then with substr cut the string
substr(textvec, 1, sapply(gregexpr("\\s+", textvec),
function(x) x[which.min(abs(12 - x))])-1)
#[1] "this is an" "I hope someone" "Thank you in"

Related

How do I collapse on a specific pattern in a text?

I have some strings of text (example below). As you can see each string was split at a period or question mark.
[1]"I am a Mr."
[2]"asking for help."
[3]"Can you help?"
[4]"Thank you ms."
[5]"or mr."
I want to collapse where the string ends with an abbreviation like mr., mrs. so the end result would be the desired output below.
[1]"I am a Mr. asking for help."
[2]"Can you help?"
[3]"Thank you ms. or mr."
I already created a vector (called abbr) containing all my abbreviations in the following format:
> abbr
[1] "Mr|Mrs|Ms|Dr|Ave|Blvd|Rd|Mt|Capt|Maj"
but I can't figure out how to use it in paste function to collapse. I have also tried using gsub (didn't work) to replace \n following abbreviation with a period with a space like this:
lines<-gsub('(?<=abbr\\.\\n)(?=[A-Z])', ' ', lines, perl=FALSE)
We can use tapply to collapse string and grepl to create groups to collapse.
x <- c("I am a Mr.", "asking for help.","Can you help?","Thank you ms.", "or Mr.")
#Include all the abbreviations with proper cases
#Note that "." has a special meaning in regex so you need to escape it.
abbr <- 'Mr\\.|Mrs\\.|Ms\\.|Dr\\.|mr\\.|ms\\.'
unname(tapply(x, c(0, head(cumsum(!grepl(abbr, x)), -1)), paste, collapse = " "))
#[1] "I am a Mr. asking for help." "Can you help?" "Thank you ms. or mr."

How to extract first 2 words from a string in R?

I need to extract first 2 words from a string. If the string contains more than 2 words, it should return the first 2 words else if the string contains less than 2 words it should return the string as it is.
I've tried using 'word' function from stringr package but it's not giving the desired output for cases where len(string) < 2.
word(dt$var_containing_strings, 1,2, sep=" ")
Example:
Input String: Auto Loan (Personal)
Output: Auto Loan
Input String: Others
Output: Others
If you want to use stringr::word(), you can do:
ifelse(is.na(word(x, 1, 2)), x, word(x, 1, 2))
[1] "Auto Loan" "Others"
Sample data:
x <- c("Auto Loan (Personal)", "Others")
Something like this?
a <- "this is a character string"
unlist(strsplit(a, " "))[1:2]
[1] "this" "is"
EDIT:
To add the part where original string is returned if number of worlds is less than 2, a simple if-else function can be used:
a <- "this is a character string"
words <- unlist(strsplit(a, " "))
if (length(words) > 2) {
words[1:2]
} else {
a
}
You could use regex in base R using sub
sub("(\\w+\\s+\\w+).*", "\\1", "Auto Loan (Personal)")
#[1] "Auto Loan"
which will also work if you have only one word in the text
sub("(\\w+\\s+\\w+).*", "\\1", "Auto")
#[1] "Auto"
Explanation :
Here we extract the pattern shown inside round brackets which is (\\w+\\s+\\w+) which means :
\\w+ One word followed by \\s+ whitespace followed by \\w+ another word, so in total we extract two words. Extraction is done using backreference \\1 in sub.

How to develop a function that accepts a vectors of character which corresponds to the column component of a dataframe?

This is my current dataset called details.
> details$names<- c("James Johnson","Michael Jones","Robert Miller","Christopher Smith","Richard Nolan","Constantine Wilson","Mountabatteen Keizman")
I want to extract the part of names considering these 2 aspects:
1) Starting from the left, extract all characters until a space or a hypen (or minus sign) is reached.
2) Extract no more than ten characters.
I tried to do this by using this code:
> abrevStrings<- function(details$names)
{
gsub("([a-z])([A-Z])","([a-z])([A-Z])<= 10",details$names)
}
But I didn't get the output I wanted.
My desired output can be seen below:
James
Michael
Robert
Christophe
Richard
Constantin
Mountabatt
One way would using sub and substr by removing everything after whitespace or hyphen and then select only first 10 characters.
abrevStrings <- function(x) {
substr(sub("\\s+.*|-.*", "", x), 1, 10)
}
abrevStrings(details$names)
#[1] "James" "Michael" "Robert" "Christophe" "Richard"
# "Constantin" "Mountabatt"
Or another option is to split the strings on whitespace or hyphen and take the substring of the first part of the string.
sapply(strsplit(details$names, "\\s+|-"), function(x) substr(x[1], 1, 10))
data
details <- data.frame(names = c("James Johnson","Michael Jones","Robert Miller",
"Christopher Smith","Richard Nolan","Constantine Wilson",
"Mountabatteen Keizman"), stringsAsFactors = FALSE)

center a string by padding spaces up to a specified length

I have a vector of names, like this:
x <- c("Marco", "John", "Jonathan")
I need to format it so that the names get centered in 10-character strings, by adding leading and trailing spaces:
> output
# [1] " Marco " " John " " Jonathan "
I was hoping a solution less complicated than to go with paste, rep, and counting nchar? (maybe with sprintf but I don't know how).
Here's a sprintf() solution that uses a simple helper vector f to determine the low side widths. We can then insert the widths into our format using the * character, taking the ceiling() on the right side to account for an odd number of characters in a name. Since our max character width is at 10, each name that exceeds 10 characters will remain unchanged because we adjust those widths with pmax().
f <- pmax((10 - nchar(x)) / 2, 0)
sprintf("%-*s%s%*s", f, "", x, ceiling(f), "")
# [1] " Marco " " John " " Jonathan " "Christopher"
Data:
x <- c("Marco", "John", "Jonathan", "Christopher")
Eventually, I know it's not the same language, but it is Worth noting that Python (and not R) has a built-in method for doing just that, it's called centering a string:
example = "John"
example.center(10)
#### ' john '
It adds to the right for odd Numbers, and allows you to input the filling character of your choice. ALthough it's not vectorized.

gsub and returning the correct number in a string

I have a text string in a data frame like the following
2 Sector. District 1, Area 1
My goal is to extract the number before Sector or else return blank.
I thought the following regex would work:
gsub("^(?:([0-9]+).*Sector.*|.*)$","\\1",TEXTSTRINGCOLUMN)
This correctly returns nothing when the word Sector is not present, but returns 1 rather than 2. Greatly appreciate help on where I am going wrong. Thanks!
We can use a regex lookahead for "Sector", capture the numbers as a group and in the replacement specify the capture group (\\1).
sub('.*?(\\d+)\\s*(?=Sector).*', '\\1', v1, perl=TRUE)
#[1] "2"
EDIT: Modified based on #Avinash Raj's comment.
Without using the lookarounds, (credit to #Avinash Raj)
sub('.*?(\\d+)\\s*Sector.*', '\\1', v1)
data
v1 <- "2 Sector. District 1, Area 1"
Try,
x <- "2 Sector. District 1, Area 1"
substring(x, 0, as.integer(grepl("Sector", x)))
#[1] "2"

Resources