Remove first character of string with condition in R [duplicate] - r

This question already has answers here:
remove leading 0s with stringr in R
(3 answers)
Closed 2 years ago.
I'm trying to remove the 0 that appears at the beginning of some observations for Zipcode in the following table:
I think the sub function is probably my best choice but I only want to do the replacement for observations that begin with 0, not all observations like the following does:
data_individual$Zipcode <-sub(".", "", data_individual$Zipcode)
Is there a way to condition this so it only removes the first character if the Zipcode starts with 0? Maybe grepl for those that begin with 0 and generate a dummy variable to use?

We can specify the ^0+ as pattern i.e. one or more 0s at the start (^) of the string instead of . (. in regex matches any character)
data_individual$Zipcode <- sub("^0+", "", data_individual$Zipcode)
Or with tidyverse
library(stringr)
data_individual$Zipcode <- str_remove(data_individual$Zipcode, "^0+")
Another option without regex would be to convert to numeric as numeric values doesn't support prefix 0 (assuming all zipcodes include only digits)
data_individual$Zipcode <- as.numeric(data_individual$Zipcode)

Related

Gsub in R for hyphens and digits [duplicate]

This question already has answers here:
Trim a string to a specific number of characters in R
(3 answers)
Using gsub in R to remove values in Zip Code field
(1 answer)
Closed 2 years ago.
I'm trying to use gsub on the df$Zipcode in the following data frame:
#Sample
df <-data.frame(ID = c(1,2,3,4,5,6,7),
Zipcode =c("10001-2838", "95011", "95011", "100028018", "84321", "84321", "94011"))
df
I want to take everything after the "-" (hyphen) out and replace it with nothing. Something like:
df$Zipcode <- gsub("\-", "", df$Zipcode)
But I don't think that is quite right. I also want to take the first 5 digits of all Zipcodes that are longer than 5 digits, like observation 4. Which should just be 10002. Maybe this is correct:
df$Zipcode <- gsub("[:6:]", "", df$Zipcode)
We can capture the first 5 characters that are not a - as a group and replace with the backreference (\\1) of the captured group
df$Zipcode <- sub("^([^-]{5}).*", "\\1", df$Zipcode)
df$Zipcode
#[1] "10001" "95011" "95011" "10002" "84321" "84321" "94011"
I think what you're looking for is this:
sub("(\\d{5}).*", "\\1", df$Zipcode)
[1] "10001" "95011" "95011" "10002" "84321" "84321" "94011"
This matches the first 5 digits, puts them into a capturing group, and 'remembers' them (but not the rest) via backreference \\1 in the replacement argument to sub.

Restructuring column names in a df [duplicate]

This question already has answers here:
Getting and removing the first character of a string
(7 answers)
Remove prefix letter from column variables
(3 answers)
Closed 2 years ago.
I've got a data with column names that look like this:
X121.10.21 X131.90.23
I want to remove the X at the beginning of each string, remove the third number after the . and then reorder the first and second number. Like this:
10.121 90.131
How can I do this? I would especially appreciate a way to do this with dplyr, if possible.
We can use sub, capture as a group and replace with the backreference of the captured group
names(df1) <- sub("X(\\d+)\\.(\\d+)\\..*", "\\2.\\1", names(df1))

Remove characters which repeat more than twice in a string [duplicate]

This question already has answers here:
remove repeated character between words
(4 answers)
Closed 3 years ago.
I have this text:
F <- "hhhappy birthhhhhhdayyy"
and I want to remove the repeat characters, I tried this code
https://stackoverflow.com/a/11165145/10718214
and it works, but I need to remove repeat characters if it repeats more than 2, and if it repeated 2 times keep it.
so the output that I expect is
"happy birthday"
any help?
Try using sub, with the pattern (.)\\1{2,}:
F <- ("hhhappy birthhhhhhdayyy")
gsub("(.)\\1{2,}", "\\1", F)
[1] "happy birthday"
Explanation of regex:
(.) match and capture any single character
\\1{2,} then match the same character two or more times
We replace with just the single matching character. The quantity \\1 represents the first capture group in sub.

Edit character length of row names in R [duplicate]

This question already has answers here:
Getting and removing the first character of a string
(7 answers)
Closed 5 years ago.
I am working on Bioinformatics recently. I have to edit row.names for my variable. Here is the situation for me:
I have clinical data and gene expression values downloaded from Cancer Genome Atlas. I have to match row names but in clinical data I have row names like this "TCGA-6D-AA2E". But in gene expressions row names like "TCGA-6D-AA2E-01A-11R-A38B-07".
Normally I used "match" command to match row names but the character lengths are not same. So my question is "Is there easy way to edit character length for row names?"
You could use grep function instead:
gene.names <- c("TCGA-6D-AA2E-01A-11R-A38B-07", "TCGC-6D-AA2E-01A-11R-A38B-07", "TAGA-6D-AA2E-01A-11R-07", "TCGA-6D-AA2E-A38B-07")
pick <- "TCGA-6D-AA2E"
grep(pick, gene.names)
# [1] 1 4
Edit based on the comment: Use substr to pick 12 first characters:
substr(gene.names, 0,12)
#[1] "TCGA-6D-AA2E" "TCGC-6D-AA2E" "TAGA-6D-AA2E" "TCGA-6D-AA2E"

Remove underscore from a string in R [duplicate]

This question already has answers here:
Replace specific characters within strings
(7 answers)
Closed 7 years ago.
In my data.frame, I have a column of type character, where all the values look like this : 123_456 (three digits, an underscore, three digits).
I need to transform these values to a numeric, and as.numeric(my_dataframe$my_column) gives me a NA. Therefore I need to remove the underscore first, in order to do as.numeric.
How would I do that please ?
Thanks
We can use sub
as.numeric(sub("_", "", my_dataframe$my_column))

Resources