Flag when character appears more than once in a string - r

I have seen something similar answered for Python but not for R. Say I have the sample data below, and I want to create the "want" column, which flags when the character "|" appears more than once in the string in the "var1" column. How would I do this in R? I know I can use grepl to flag whenever "|" appears, but this would also capture when it only appears once.
Sample data:
var1<-c("BLUE|RED","RED|BLUE","WHITE|BLACK|ORANGE","BLACK|WHITE|ORANGE")
want<-c(0,0,1,1)
have<-as.data.frame(cbind(var1,want))
var1 want
BLUE|RED 0
RED|BLUE 0
WHITE|BLACK|ORANGE 1
BLACK|WHITE|ORANGE 1

str_count can be used - count the number of | (metacharacter - so escape (\\) or specify as fixed, and then create a logical vector (> 1), convert the logical to binary (as.integer or +)
library(stringr)
have$want <- +(str_count(have$var1, fixed("|") ) > 1)

Related

Regular expression weird result [duplicate]

This question already has answers here:
Multiple overlapping regex matches instead of one
(2 answers)
Biostrings gregexpr2 gives errors while gregexpr works fine
(1 answer)
Closed 3 years ago.
Code
gsub('101', '111', '110101101')
#[1] "111101111"
Would anyone know why the second 0 in the input isn't being substituted into a 1 in the output?
I'm looking for the pattern 101 in string and replace it with string 111. Later on I wish to turn longer sub-sequences into sequences of 1's, such as 10001 to 11111.
You could use a lookahead ?=
The way this works is q(?=u) matches a q that is followed by a u, without making the u part of the match.
Example:
gsub('10(?=1)', '11', '110101101', perl=TRUE);
// Output: 111111111
Edit: you need to use gsub in perl mode to use lookaheads
Its because it doesnt work in a recursive way
gsub('101', '111', '110101101') divides the third string as it finds the matches. So it finds the first 101 and its left with 01101. Think about it. If it would replace "recursively", something like gsub('11', '111', '11'), would return an infinite string of '1' and break. It doesn't check in the already "replaced" text.
It is because when R first detected 110101101, it treat the next 0 as in 011 in 110101101.
It seems that you only want to replace '0' by '1'. Then you can just use gsub('0', '1', '110101101')
Later on I wish to turn longer sub-sequences into sequences of 1's, such as 10001 to 11111.
Hopefully, R provides a means to generate the replacement string based on the matched substring. (This is a common feature.)
If so, search for 10+, and have the replacement string generator create a string consisting of a number of 1 characters equal to the length of the match. (e.g. If 100 is matched, replace with 111. If 1000 is matched, replace with 1111. etc.)
I don't know R in the least. Here's how it's done in some other languages in case that helps:
Perl:
$s =~ s{10+}{ "1" x length($&) }ger
Python:
re.sub(r'10+', lambda match: '1' * len(match.group()), s)
JavaScript:
s.replace(/10+/g, function(match) { return '1'.repeat(match.length) })
JavaScript (ES6):
s.replace(/10+/g, match => '1'.repeat(match.length))
According to the OP
Later on I wish to turn longer sub-sequences into sequences of 1's,
such as 10001 to 11111.
If I understand correctly, the final goal is to replace any sub-sequence of consecutive 0 into the same number of 1 if they are surrounded by a 1 on both sides.
In R, this can be achieved by the str_replace_all() function from the stringr package. For demonstration and testing, the input vector contains some edge cases where substrings of 0 are not surrounded by 1.
input <- c("110101101",
"11010110001",
"110-01101",
"11010110000",
"00010110001")
library(stringr)
str_replace_all(input, "(?<=1)0+(?=1)", function(x) str_dup("1", str_length(x)))
[1] "111111111" "11111111111" "110-01111" "11111110000" "00011111111"
The regex "(?<=1)0+(?=1)" uses look behind (?<=1) as well as look ahead (?=1) to ensure that the subsequence 0+ to replace is surrounded by 1. Thus, leading and trailing subsequences of 0 are not replaced.
The replacement is computed by a functions which returns a subsequence of 1 of the same length as the subsequence of 0 to replace.

gsub to add leading zero to selected numbers in column names text to aid in sorting [duplicate]

This question already has answers here:
How to sort a character vector where elements contain letters and numbers?
(6 answers)
Closed 2 years ago.
I've got a dataframe pulled from an external source with column labels based on year and week counts. Unfortunately, the columns pull in a weird non-sequential order (feature of the external dataset), and so when I report on them I'm going to want to pull the columns using "select" to get them in date-sequential order.
I want to insert a zero before the single-digit column labels below -- that is, "W1_2019" becomes "W01_2019" (and so forth for 2, 3, and up to 9), but not before the double-digit ones -- that is ""W10_2019" will remain as-is. The resulting column should allow me to order names(df) in ascending order, with W01 followed by W02 and W03. Without the zeros, of course, the order is W1 followed by W10 and then W2 which is exactly what I don't want.
See code below.
df<-setNames(
data.frame(
t(data.frame(c("1","2","1","3","2","3", "1")))
,row.names = NULL,stringsAsFactors = FALSE
),
c("W10_2018", "W50_2018", "W51_2018", "W52_2018", "W1_2019", "W2_2019", "W3_2019")
)
names(df) = gsub(pattern="W#_.*", replacement = "W0#_", x=names(df))
sort(names(df))
The gsub line doesn't return an error, but it also doesn't change the names. The result is that the output of the "sort" line is:
[1] "W1_2019" "W10_2018" "W2_2019" "W3_2019" "W50_2018" "W51_2018" "W52_2018"
What it should look like if successful is:
[1] "W01_2019" "W02_2019" "W03_2019" "W10_2018" "W50_2018" "W51_2018" "W52_2018"
We can use mixedsort from gtools
library(gtools)
mixedsort(names(df))
#[1] "W1_2019" "W2_2019" "W3_2019" "W10_2018" "W50_2018" "W51_2018" "W52_2018"
If we need to have consistency i.e. 2 digits after 'W', make some changes with sub. Grab the one digit that follows the 'W' and before the '_' as a capture group (2 digits will not be matched), then in the replacement, "W" followed by a digit 0, then backreference of captured group (\\1) and the _ and it should work
mixedsort(sub("W(\\d{1})_", "W0\\1_", names(df)))
#[1] "W01_2019" "W02_2019" "W03_2019" "W10_2018" "W50_2018" "W51_2018" "W52_2018"

Insert delimiter between characters

I have a data frame of character column in which I want to insert a delimiter after every 2 characters. The character column is variable in the length. This is how it looks like
id character
1 aaabdg
2 bjdbjhdj
3 bjbkjekkechj
4 jkfb
the output data frame I want is as below
id character
1 aa_ab_dg
2 bj_db_jh_dj
3 bj_bk_je_kk_ec_hj
4 jk_fb
I have been trying to create regex to use in the below code but have not found any luck yet.
cat(paste0('[a-z]{2}', paste(str1, collapse="", ""), '[a-z]{2}'))
and
gsub("([a-z])", "\\,", str1)
Any help/suggestions would be much appreciated
Here is one option using gsub:
gsub("(..)(?!$)", "\\1_", "bjbkjekkechj", perl=TRUE)
[1] "bj_bk_je_kk_ec_hj"
This approach is to match and capture every pair of characters in succession, provided that there be at least one character following the pair. Then, we replace with those two captured characters, followed by an underscore. The negative lookahead (?!$) ensures that we do not add an underscore after the very last single or pair of characters.

R: Count number of rows in data frame, with matching character in specified position of string

I have a data frame with a column with characters:
strings
1 a;b;c;d
2 g;h;i;j
3 k;m
4 o
I would like to get a count of the number of strings(rows) with a certain specified characters at a certain position within the string.
Eg.
Get count of number of strings with 3rd character as one of the
characters in this set: {a,b,m}.
The output should be 2 in this case, since only the 1st and 3rd row
have any characters in {a,b,m} as their 3rd character within the
string.
I could only use this code to find any strings that contains 'b':
sum(grepl("b",df))
However, this is not good enough for the above task.
Please advice.
You can try grepl:
x = c('a;b;c;d','g;h;i;j','k;m','o')
sum(grepl('^.{2}[abm]', x))
#[1] 2
Try this:
sum(substr(df$strings,3,3) %in% c("a","b","m"))
Alternatively, if you want to use a ; as the delimeter you can do:
sum(sapply(strsplit(df$strings,";"),function(x) x[2] %in% c("a","b","m")))

Replace specific value in column in R

I have a variable in a data frame which consists of either 5 or 6 digits/characters. Those values in the variable with 5 digits are all numbers e.g. 27701 those with 6 digits however all have a character 'C' preceding the numbers e.g. C22701.
How can I replace the 'C' characters with 999 for example?
I have tried:
replace(data$varname,'C',999)
Any ideas folks?
data$varname <- as.numeric(gsub('C', '999', data$varname)) should do the trick, I think. Assuming you want a numeric vector in the end. If you want a character vector, then you can leave as.numeric off.
You can use a substring to remove the first letter, and paste0 to add 999 to it.
> x <- c("C000", "P1745")
> paste0("999", substring(x,2))
# [1] "999000" "9991745"

Resources