conditional concatenation in R - r

I have a vector like this:
> myarray
[1] "AA\tThis is ",
[2] "\tthe ",
[3] "\tbegining."
[4] "BB\tA string of "
[5] "\tcharacters."
[6] "CC\tA short line."
[7] "DD\tThe "
[8] "\tend."`
I am trying to write a function that processes the above to generate this:
> myoutput
[1] "AA\tThis is the begining."
[2] "BB\tA string of characters."
[3] "CC\tA short line"
[4] "DD\tThe end."`
This is doable by looping through the rows and using an if statement to concatenate the current row with the last one if it starts with a \t. I was wondering if there is a more efficient way of achieving the same result.

# Create your example data
myarray <- c("AA\this is ", "\tthe ", "\tbeginning", "BB\tA string of ", "\tcharacters.", "CC\tA short line.", "DD\tThe", "\tend")
# Find where each "sentence" starts based on detecting
# that the first character isn't \t
starts <- grepl("^[^\t]", myarray)
# Create a grouping variable
id <- cumsum(starts)
# Remove the leading \t as that seems like what your example output wants
tmp <- sub("^\t", "", myarray)
# split into groups and paste the groups together
sapply(split(tmp, id), paste, collapse = "")
And running it we get
> sapply(split(tmp, id), paste, collapse = "")
1 2
"AA\this is the beginning" "BB\tA string of characters."
3 4
"CC\tA short line." "DD\tThe end"

An option is to use paste than replace AA,BB etc. with additional character say ## and and strsplit as:
#Data
myarray <- c("AA\this is ", "\tthe ", "\tbeginning", "BB\tA string of ",
"\tcharacters.", "CC\tA short line.", "DD\tThe", "\tend")
strsplit(gsub("([A-Z]{2})","##\\1",
paste(sub("^\t","", myarray), collapse = "")),"##")[[1]][-1]
# [1] "AA\this is the beginning"
# [2] "BB\tA string of characters."
# [3] "CC\tA short line."
# [4] "DD\tTheend"

Related

How to extract several substrings with a foor loop in R

I have the following 100 strings:
[3] "Department_Complementary_Demand_Converted_Sum"
[4] "Department_Home_Demand_Converted_Sum"
[5] "Department_Store A_Demand_Converted_Sum"
[6] "Department_Store B_Demand_Converted_Sum"
...
[100] "Department_Unisex_Demand_Converted_Sum"
Obviously I can for every string use substr() with different start and end values for the string indices. But as one can see, all the strings start with Department_ and end with _Demand_Converted_Sum. I want to only extract what's inbetween. If there was a way to always start at index 11 from the left and end on index 21 from the left then I can just run a for loop over all the 100 strings above.
Example
Given input: Department_Unisex_Demand_Converted_Sum
Expected output: Unisex
Looks a like a classic case for lookarounds:
library(stringr)
str_extract(str, "(?<=Department_)[^_]+(?=_)")
[1] "Complementary" "Home" "Store A"
Data:
str <- c("Department_Complementary_Demand_Converted_Sum",
"Department_Home_Demand_Converted_Sum",
"Department_Store A_Demand_Converted_Sum")
Using strsplit(),
sapply(strsplit(string, '_'), '[', 2)
# [1] "Complementary" "Home" "Store A"
or stringi::stri_sub_all().
unlist(stringi::stri_sub_all(str, 12, -22))
# [1] "Complementary" "Home" "Store A"

Locate position of first number in string [R]

How can I create a function in R that locates the word position of the first number in a string?
For example:
string1 <- "Hello I'd like to extract where the first 1010 is in this string"
#desired_output for string1
9
string2 <- "80111 is in this string"
#desired_output for string2
1
string3 <- "extract where the first 97865 is in this string"
#desired_output for string3
5
I would just use grep and strsplit here for a base R option:
sapply(input, function(x) grep("\\d+", strsplit(x, " ")[[1]]))
Hello I'd like to extract where the first 1010 is in this string
9
80111 is in this string
1
extract where the first 97865 is in this string
5
Data:
input <- c("Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string",
"extract where the first 97865 is in this string")
Here is a way to return your desired output:
library(stringr)
min(which(!is.na(suppressWarnings(as.numeric(str_split(string, " ", simplify = TRUE))))))
This is how it works:
str_split(string, " ", simplify = TRUE) # converts your string to a vector/matrix, splitting at space
as.numeric(...) # tries to convert each element to a number, returning NA when it fails
suppressWarnings(...) # suppresses the warnings generated by as.numeric
!is.na(...) # returns true for the values that are not NA (i.e. the numbers)
which(...) # returns the position for each TRUE values
min(...) # returns the first position
The output:
min(which(!is.na(suppressWarnings(as.numeric(str_split(string1, " ", simplify = TRUE))))))
[1] 9
min(which(!is.na(suppressWarnings(as.numeric(str_split(string2, " ", simplify = TRUE))))))
[1] 1
min(which(!is.na(suppressWarnings(as.numeric(str_split(string3, " ", simplify = TRUE))))))
[1] 5
Here I'll leave a fully tidyverse approach:
library(purrr)
library(stringr)
map_dbl(str_split(strings, " "), str_which, "\\d+")
#> [1] 9 1 5
map_dbl(str_split(strings[1], " "), str_which, "\\d+")
#> [1] 9
Note that it works both with one and multiple strings.
Where strings is:
strings <- c("Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string",
"extract where the first 97865 is in this string")
Here is another approach. We can trim off the remaining characters after the first digit of the first number. Then, just find the position of the last word. \\b matches word boundaries while \\S+ matches one or more non-whitespace characters.
first_numeric_word <- function(x) {
x <- substr(x, 1L, regexpr("\\b\\d+\\b", x))
lengths(gregexpr("\\b\\S+\\b", x))
}
Output
> first_numeric_word(x)
[1] 9 1 5
Data
x <- c(
"Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string",
"extract where the first 97865 is in this string"
)
Here is a base solution using rapply() w/ grep() to recurse through the results of strsplit() and works with a vector of strings.
Note: swap " " and fixed = TRUE with "\\s+" and fixed = FALSE (the default) if you want to split the strings on any whitespace instead of a literal space.
rapply(strsplit(strings, " ", fixed = TRUE), function(x) grep("[0-9]+", x))
[1] 9 1 5
Data:
strings = c("Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string", "extract where the first 97865 is in this string")
Try the following:
library(stringr)
position_first_number <- function(string) {
min(which(str_detect(str_split(string, "\\s+", simplify = TRUE), "[0-9]+")))
}
With your example strings:
> string1 <- "Hello I'd like to extract where the first 1010 is in this string"
> position_first_number(string1)
[1] 9
> string2 <- "80111 is in this string"
> position_first_number(string2)
[1] 1
> string3 <- "extract where the first 97865 is in this string"
> position_first_number(string3)
[1] 5

To capture message from a character after specific texts

I have a following character in R. Is there way to populate only text coming after [SQ].
Input
df # df is a character
[1] "[Mi][OD][SQ]Nice message1."
[2] "[Mi][OD][SQ]Nice message2."
[3] "[RO] ERROR: Could not SQLExecDirect 'SELECT * FROM "
Expected output
df
[1] Nice message1. Nice message2
In case there are more [SQ] like below
df # df is a character
[1] "[Mi][OD][SQ]Nice message1."
[2] "[Mi][OD][SQ]Nice message2."
[3] "[RO] ERROR: Could not SQLExecDirect 'SELECT * FROM "
[4] "[Mi][OD][SQ]Nice message3."
Expected output
df
[1] Nice message1. Nice message2. Nice message3
An option is to use str_extract to extract the substring and then wrap with na.omit to remove the NA elements which occur when there is no match for a string. Here, we use a regex lookaround to check the pattern [SQ] that precedes other characters to extract those characters that are succeeding it
library(stringr)
as.vector(na.omit( str_extract(df, "(?<=\\[SQ\\]).*")))
#[1] "Nice message1" "Nice message2" "Nice message3"
If it needs to be a single string, then str_c to collapse the strings
str_c(na.omit( str_extract(df, "(?<=\\[SQ\\]).*")), collapse = '. ')
#[1] "Nice message1. Nice message2. Nice message3"
data
df <- c("[Mi][OD][SQ]Nice message1.", "[Mi][OD][SQ]Nice message2.",
"[RO] ERROR: Could not SQLExecDirect 'SELECT * FROM ", "[Mi][OD][SQ]Nice message3."
)

strsplit does not split for all elements of character vector provided to parameter "split"

The R documentation for the strsplit function states for parameter split that "If split has length greater than 1, it is re-cycled along x."
I take it to mean that if I use the following code
strsplit(x = "Whatever will be will be", split = c("ever", "be"))
..., I will get x split into "What" and "will" and "will be". This does not happen. The output is "What" and "will be will be".
Am I misinterpreting the documentation? Also, how can I get the result I desire?
The arguments in split will be recycled if also x has multiple arguments:
strsplit(x = c("Whatever will be will be","Whatever will be will be"),
split = c("ever", "be"))
[[1]]
[1] "What" " will be will be"
[[2]]
[1] "Whatever will " " will "
The behaviour I suspect you expect is achieved with a |:
strsplit(x = "Whatever will be will be", split = c("ever|be"))
[[1]]
[1] "What" " will " " will "
The split is recycled across elements of x, so that the first element of split is applied to the first element of x, the second to the second, etc. So, for example:
strsplit(x = c("Whatever will be will be", "Whatever will be will be"), split = c("ever", "be"))
[[1]]
[1] "What" " will be will be"
[[2]]
[1] "Whatever will " " will "

Using R how to separate a string based on characters

I have a set of strings and I need to search by words that have a period in the middle. Some of the strings are concatenated so I need to break them apart in to words so that I can then filter for words with dots.
Below is a sample of what I have and what I get so far
punctToRemove <- c("[^[:alnum:][:space:]._]")
s <- c("get_degree('TITLE',PERS.ID)",
"CLIENT_NEED.TYPE_CODe=21",
"2.1.1Report Field Level Definition",
"The user defined field. The user will validate")
This is what I currently get
gsub(punctToRemove, " ", s)
[1] "get_degree TITLE PERS.ID "
[2] "CLIENT_NEED.TYPE_CODe 21"
[3] "2.1.1Report Field Level Definition"
[4] "The user defined field. The user will validate"
Sample of what I want is below
[1] "get_degree ( ' TITLE ' , PERS.ID ) " # spaces before and after the "(", "'", ",",and ")"
[2] "CLIENT_NEED.TYPE_CODe = 21" # spaces before and after the "=" sign. Dot and underscore remain untouched.
[3] "2.1.1Report Field Level Definition" # no changes
[4] "The user defined field. The user will validate" # no changes
We can use regex lookarounds
s1 <- gsub("(?<=['=(),])|(?=['(),=])", " ", s, perl = TRUE)
s1
#[1] "get_degree ( ' TITLE ' , PERS.ID ) "
#[2] "CLIENT_NEED.TYPE_CODe = 21"
#[3] "2.1.1Report Field Level Definition"
#[4] "The user defined field. The user will validate"
nchar(s1)
#[1] 35 26 34 46
which is equal to the number of characters showed in the OP's expected output.
For this example:
library(stringr)
s <- str_replace_all(s, "\\)", " \\) ")
s <- str_replace_all(s, "\\(", " \\( ")
s <- str_replace_all(s, "=", " = ")
s <- str_replace_all(s, "'", " ' ")
s <- str_replace_all(s, ",", " , ")

Resources