How to take out specific letters from a character variable - r

xcv(123)
wert(232)
t(145)
tyui ier(133)
ytie(435)
...
The length of the string is dynamic meaning it is random. The number between the brackets are the target letters that are required to be taken out & stored in a new column in the same data set.
The following key words might help:
substr() strsplit()
I'm actively looking for an answer. Your help would be deeply appreciated.

Do you mean you want to extract the, for example, 123rd letter from the string called xcv?
set.seed(123)
xcv <- paste( sample( letters, 200, replace = TRUE ), collapse = "" )
n <- 123
You can extract the nth letter like so:
substr( xcv, n, n )
# [1] "i"

dat = c('xcv(123)' ,'wert(232)', 't(145)', 'tyui ier(133)', 'ytie(435)')
target = gsub(".*\\(|\\).*", "", dat) #captures anything in between '(' and ')'. We use \\( and \\) to denote the brackets since they are special characters.
cbind(dat, target)
dat target
[1,] "xcv(123)" "123"
[2,] "wert(232)" "232"
[3,] "t(145)" "145"
[4,] "tyui ier(133)" "133"
[5,] "ytie(435)" "435"

Related

Regex with 2 capture groups, "key=value" or "value_only"

I am trying to build a regex that matches either key=value or value_only, where in the key=value case the value may contain = signs. The key should go into capture group 1 and the value should go into capture group 2. Examples in R/stringr, this is the ICU engine. I have not found any combination of greedy, possessive and lazy quantifiers to get this to work. Am I missing something?
library(stringr)
data <- c(
"key1=value1",
"value_only_no_key",
"key2=value2=containing=equal=signs"
)
# Desired outcome:
result <- matrix(c(
"key1", "value1",
"", "value_only_no_key",
"key2", "value2=containing=equal=signs"
), ncol=2, byrow= TRUE)
# The non-optionality of = results in no match for #2
str_match(
data,
"(.*?)=(.*)"
)[,-1]
# Same here
str_match(
data,
"([^=]*?)=(.*)"
)[,-1]
# The optionality of =? lets the greedy capture 2 eat everything
str_match(
data,
"(.*?)=?(.*)"
)[,-1]
# This is better than nothing, but the value_no_key ends up in the first match
str_match(
data,
"([^=]*+)=?+(.*)"
)[,-1]
If you know that the key is before the first occurrence of the equals sign, you can use a negated character class to match all characters excluding =
If you don't want to match empty strings and there should be at least a single character for the value:
^(?:([^\s=]+)=)?(.+)
Regex demo
If the key can also contain spaces, you can exclude matching a newline instead of whitespace chars.
^(?:([^\r\n=]+)=)?(.+)
Example
library(stringr)
data <- c(
"key1=value1",
"value_only_no_key",
"key2=value2=containing=equal=signs"
)
str_match(data,
"^(?:([^\\s=]+)=)?(.+)"
)[,-1]
Output
[,1] [,2]
[1,] "key1" "value1"
[2,] NA "value_only_no_key"
[3,] "key2" "value2=containing=equal=signs"
How about using a non-matching (?:) optional ? group anchored to the start of the string ^?
str_match(data,
"^(?:(.*?)=)?(.*)"
)[,-1]
[,1] [,2]
[1,] "key1" "value1"
[2,] NA "value_only_no_key"
[3,] "key2" "value2=containing=equal=signs"

pattern matching a formula in R

I want to do a pattern matching of variables in a formula. The ideal solution should be able to perform as below:
formula <- 'variable_1+variable_2*variable_3-variable_4/variable_5 + 456' and output should be variable_1, variable_2,variable_3, variable_4, variable_5.
Note: variable name can contain character, underscore (_), numbers only and operations are limited to +,-,*,/. formula may contain constants as well (like here it is 456). The output should contain only variables names and should ignore any numeric constants.
I have tried the below codes. I was only able to check for the variable name containing only character and minus operation (-) does not work as well.
formula <- "variableX +variableY*VariableZ"
strapplyc(gsub(" ", "", format(formula), fixed = T), "-?|[a-zA-Z_]+", simplify = T, ignore.case = T) gives below output
[,1]
[1,] "variableX"
[2,] ""
[3,] "variableY"
[4,] ""
[5,] "VariableZ"
which is correct BUT when i include minus operation (-), the strapplyc gives wrong results
formula <- "variableX -variableY"
strapplyc(gsub(" ", "", format(formula), fixed = T), "-?|[a-zA-Z_]+", simplify = T, ignore.case = T) gives below output
[,1]
[1,] "variableX"
[2,] "-"
[3,] "variableY"
I would appreciate if anyone could help me on ideal solution.
You can use regular expressions for this:
formula <- "variable_1+variable_2*variable_3-variable_4/variable_5"
gsub("[\\+\\*\\-\\/]", ", ", formula)
Explanation of the regex:
[ and ] start and end a group of characters that you want to select
\\+ escapes the + sign, with you want to replace with ", "
\\* escapes the * sign, with you want to replace with ", "
\\- escapes the - sign, with you want to replace with ", "
\\/ escapes the / sign, with you want to replace with ", "
Edit to reflect OP's updated request
Another way would be just to extract your variables. The below works if you hold the format lowercaseletters_numberfor your variable name:
formula <- "variable_1+variable_2*variable_3-variable_4/variable_5+34+brigadeiro_5"
paste(regmatches(formula, gregexpr("variable_[0-9]", formula))[[1]],
collapse = ", ")
You can also use the stringr package if you want the code to look a little cleaner:
library(stringr)
str_extract_all(formula, "[a-z]*_[0-9]*")
You could use strsplit() with some extras.
res <- trimws(el(strsplit(formula, "\\+|\\-|\\*|\\/")))
Thereafter we want those elements yielding NA when we try to coerce them as.numeric().
res[is.na(suppressWarnings(as.numeric(res)))]
# [1] "variable_1" "variable_2" "variable_3" "variable_4" "variable_5"
Data
formula <- 'variable_1+variable_2*variable_3-variable_4/variable_5 + 456'

Extract numbers after a pattern in vector of characters

I'm trying to extract values from a vector of strings. Each string in the vector, (there are about 2300 in the vector), follows the pattern of the example below:
"733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"
What I'd like is to extract the numbers following the pattern "Sent. " and place them into a separate vector. For the example, I'd like to extract "1311531".
I'm having trouble using gsub to accomplish this.
library(tidyverse)
Data <- c("PASTE YOUR WHOLE STRING")
str_locate(Data, "Sent. ")
Reference <- str_locate_all(Data, "Sent. ") %>% as.data.frame()
Reference %>% names() #Returns [1] "start" "end"
Reference <- Reference %>% mutate(end = end +1)
YourNumbers <- substr(Data,start = Reference$end[1], stop = Reference$end[1])
for (i in 2:dim(Reference)[1]){
Temp <- substr(Data,start = Reference$end[i], stop = Reference$end[i])
YourNumbers <- paste(YourNumbers, Temp, sep = "")
}
YourNumbers #Returns "1234567"
We can use str_match_all from stringr to get all the numbers followed by "Sent".
str_match_all(text, "Sent.*?_+(\\d+)")[[1]][, 2]
#[1] "1" "3" "1" "1" "5" "3" "1"
A base R option using strsplit and sub
lapply(strsplit(ss, "\\|"), function(x)
sub("Sent.+: _+(\\d+)_+", "\\1", x[grepl("^Sent", x)]))
#[[1]]
#[1] "1" "3" "1" "1" "5" "3" "1"
Sample data
ss <- "733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"

Using R to compare two words and find letters unique to second word (across c. 6000 cases)

I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.
The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))
Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"
x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.

Fix the order of strings that have both letter and number components

I have string data like below.
a <- c("53H", "H26","14M","M47")
##"53H" "H26" "14M" "M47"
I want to fix the numbers and letters in a certain order such that
the numbers goes first, the letters goes second, or the other way around.
How can I do it?
##"53H" "26H" "14M" "47M"
or
##"H53" "H26" "M14" "M47"
You can extract the numbers and letters separately with gsub, then use paste0
to put them in any order you like.
a <- c("53H", "H26","14M","M47")
( nums <- gsub("[^0-9]", "", a) ) ## extract numbers
# [1] "53" "26" "14" "47"
( lets <- gsub("[^A-Z]", "", a) ) ## extract letters
# [1] "H" "H" "M" "M"
Numbers first answer:
paste0(nums, lets)
# [1] "53H" "26H" "14M" "47M"
Letters first answer:
paste0(lets, nums)
# [1] "H53" "H26" "M14" "M47"
You can capture the relevant parts in groups using () and then backreference them using gsub:
a <- c("53H", "H26","14M","M47")
gsub("^([0-9]+)([A-Z]+)$", "\\2\\1", a)
# [1] "H53" "H26" "M14" "M47"
This is like saying "Find a group of numbers at the start of my string and capture them in a group (^([0-9]+)). Then find the group of letters that go on to the end of my string and capture them in a second group (([A-Z]+)). That's my search pattern. Next, replace it such that the second group (referred to by \\2) is returned first and the first group (referred to by \\1) is returned second).
From Ananda Mahto's answer, you can order the number first and letter second using the following code:
gsub("^([A-Z]+)([0-9]+)$", "\\2\\1", a)
because you want to capture the strings which start with a letter (^([A-Z]+)), then capture the group of numbers ( ([0-9]+)$ )/

Resources