Gsub in R for hyphens and digits [duplicate] - r

This question already has answers here:
Trim a string to a specific number of characters in R
(3 answers)
Using gsub in R to remove values in Zip Code field
(1 answer)
Closed 2 years ago.
I'm trying to use gsub on the df$Zipcode in the following data frame:
#Sample
df <-data.frame(ID = c(1,2,3,4,5,6,7),
Zipcode =c("10001-2838", "95011", "95011", "100028018", "84321", "84321", "94011"))
df
I want to take everything after the "-" (hyphen) out and replace it with nothing. Something like:
df$Zipcode <- gsub("\-", "", df$Zipcode)
But I don't think that is quite right. I also want to take the first 5 digits of all Zipcodes that are longer than 5 digits, like observation 4. Which should just be 10002. Maybe this is correct:
df$Zipcode <- gsub("[:6:]", "", df$Zipcode)

We can capture the first 5 characters that are not a - as a group and replace with the backreference (\\1) of the captured group
df$Zipcode <- sub("^([^-]{5}).*", "\\1", df$Zipcode)
df$Zipcode
#[1] "10001" "95011" "95011" "10002" "84321" "84321" "94011"

I think what you're looking for is this:
sub("(\\d{5}).*", "\\1", df$Zipcode)
[1] "10001" "95011" "95011" "10002" "84321" "84321" "94011"
This matches the first 5 digits, puts them into a capturing group, and 'remembers' them (but not the rest) via backreference \\1 in the replacement argument to sub.

Related

Remove first character of string with condition in R [duplicate]

This question already has answers here:
remove leading 0s with stringr in R
(3 answers)
Closed 2 years ago.
I'm trying to remove the 0 that appears at the beginning of some observations for Zipcode in the following table:
I think the sub function is probably my best choice but I only want to do the replacement for observations that begin with 0, not all observations like the following does:
data_individual$Zipcode <-sub(".", "", data_individual$Zipcode)
Is there a way to condition this so it only removes the first character if the Zipcode starts with 0? Maybe grepl for those that begin with 0 and generate a dummy variable to use?
We can specify the ^0+ as pattern i.e. one or more 0s at the start (^) of the string instead of . (. in regex matches any character)
data_individual$Zipcode <- sub("^0+", "", data_individual$Zipcode)
Or with tidyverse
library(stringr)
data_individual$Zipcode <- str_remove(data_individual$Zipcode, "^0+")
Another option without regex would be to convert to numeric as numeric values doesn't support prefix 0 (assuming all zipcodes include only digits)
data_individual$Zipcode <- as.numeric(data_individual$Zipcode)

Removing end of colnames with variable lengths and patterns [duplicate]

This question already has answers here:
How to delete everything after nth delimiter in R?
(2 answers)
Closed 2 years ago.
I currently have a dataframe with colnames that I'm trying to truncate the end after the 2nd period.
Example below:
GTEX.W5WGY.1726.SM.4LMI5 GTEX.WEY5.1226.SM.4LMIQ
23 20
0 32
Ideal output:
GTEX.W5WGY GTEX.WEY5
23 20
0 32
I'm trying to get it to this output instead and have tried sub but it isn't working.
colnames(x) <- sub("..*.SM..*", "", colnames(x))
Any help would be appreciated!
We can change the pattern to capture the characters that are not a . ([^.]+) from the start (^) of the string followed by a . and the second of no dots, replace with the backreference of the captured group
colnames(x) <- sub("^([^.]+\\.[^.]+)\\..*", "\\1", colnames(x))

Remove characters which repeat more than twice in a string [duplicate]

This question already has answers here:
remove repeated character between words
(4 answers)
Closed 3 years ago.
I have this text:
F <- "hhhappy birthhhhhhdayyy"
and I want to remove the repeat characters, I tried this code
https://stackoverflow.com/a/11165145/10718214
and it works, but I need to remove repeat characters if it repeats more than 2, and if it repeated 2 times keep it.
so the output that I expect is
"happy birthday"
any help?
Try using sub, with the pattern (.)\\1{2,}:
F <- ("hhhappy birthhhhhhdayyy")
gsub("(.)\\1{2,}", "\\1", F)
[1] "happy birthday"
Explanation of regex:
(.) match and capture any single character
\\1{2,} then match the same character two or more times
We replace with just the single matching character. The quantity \\1 represents the first capture group in sub.

Keep part of string after last sign. [duplicate]

This question already has answers here:
Extract last word in string in R
(5 answers)
Closed 4 years ago.
I would like to keep only the string after the last | sign in my rownames which looks like this:
in:
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Chromatiales|f__Woeseiaceae|g__Woeseia"
out:
g__Woeseia
I have this code which keeps everything from the start until a given sign:
gsub("^.*\\.",".",x)
We could do this by capturing as a group. Using sub, match characters (.*) until the | and capture zero or more characters that are not a | (([^|]*)) until the end ($) of the string and replace by the backreference (\\1) of the captured group
sub(".*\\|([^|]*)$", "\\1", str1)
#[1] "g__Woeseia"
Or match characters until the | and replace it with blank ("")
sub(".*\\|", "", str1)
#[1] "g__Woeseia"
data
str1 <- "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Chromatiales|f__Woeseiaceae|g__Woeseia"

Remove characters in string before specific symbol(including it) [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Use gsub remove all string before first white space in R
(4 answers)
Closed 5 years ago.
at the beginning, yes - simillar questions are present here, however the solution doesn't work as it should - at least for me.
I'd like to remove all characters, letters and numbers with any combination before first semicolon, and also remove it too.
So we have some strings:
x <- "1;ABC;GEF2"
y <- "X;EER;3DR"
Let's do so gsub() with . and * which means any symbol with occurance 0 or more:
gsub(".*;", "", x)
gsub(".*;", "", y)
And as a result i get:
[1] "GEF2"
[1] "3DR"
But I'd like to have:
[1] "ABC;GEF2"
[1] "EER;3DR"
Why did it 'catch' second occurence of semicolon instead of first?
You could use
gsub("[^;]*;(.*)", "\\1", x)
# [1] "ABC;GEF2"

Resources