Padding lost zeros not universally in a column [duplicate] - r

This question already has answers here:
How to add leading zeros?
(8 answers)
Closed 1 year ago.
I have a list of US postal zip codes of 5 digits, but some lost their leading zeros. How do I add those zeros back in, while keeping others without the leading 0s intact? I tried formatC, springf, str_pad, and none of them worked, because I am not adding 0s to all values.

We can use sprintf
sprintf('%05d', as.integer(zipcodes))

In which way did str_pad not work?
https://www.rdocumentation.org/packages/stringr/versions/1.4.0/topics/str_pad
df<-data.frame(zip=c(1,22,333,4444,55555))
df$zip <- stringr::str_pad(df$zip, width=5, pad = "0")
[1] "00001" "00022" "00333" "04444" "55555"

Update:
As of the valuable comment of r2evans:
My solution is not very efficient and to get leading 0 we have to modify the paste0 part slightly see here with a dataframe example:
sapply(df$zip, function(x){if(nchar(x)<5){paste0(0,x)}else{x}})
data:
df <- tribble(
~zip,
7889,
2345,
45567,
4394,
34566,
4392,
4599)
df
Output:
[1] "07889" "02345" "45567" "04394" "34566" "04392" "04599"
Fist answer:
This will add a trailing zero to each integer < 5 digits
Where zip is a vector:
sapply(zip, function(x){if(nchar(x)<5){paste0(x,0)}else{x}})

If they start as strings and you don't want to (or cannot) convert to integers first, then an alternative to sprintf is
vec <- c('1','11','11111')
paste0(strrep('0', pmax(0, 5 - nchar(vec))), vec)
# [1] "00001" "00011" "11111"
This will handle strings of any length, and is a no-op for strings of 5 or greater characters.
In a frame, that would be
dat$colname <- paste0(strrep('0', pmax(0, 5 - nchar(dat$colname))), dat$colname)

Related

change numbers in string vector [duplicate]

This question already has answers here:
R: gsub, pattern = vector and replacement = vector
(6 answers)
Closed 3 years ago.
I have a string Vector including numbers like this:
x <- c("abc122", "73dj", "lo7833ll")
x
[1] "abc122" "73dj" "lo7833ll"
I want to Change the numbers of the x Vector and replace them with numbers I have stored in another Vector:
right_numbers <- c(500, 700, 23)
> right_numbers
[1] 500 700 23
How can I do this even if the numbers are in different positions in the string(some are at the beginning, some at the end..)?
This is how the x Vector should look like after the changes:
> x
[1] "abc500" "700dj" "lo23ll"
A vectorized solution with stringr -
str_replace(x, "[0-9]+", as.character(right_numbers))
[1] "abc500" "700dj" "lo23ll"
Possibly a more efficient version with stringi package, thanks to #sindri_baldur -
stri_replace_first_regex(x, '[0-9]+', right_numbers)
[1] "abc500" "700dj" "lo23ll"
Here is an idea,
mapply(function(i, y)gsub('[0-9]+', y, i), x, right_numbers)
# abc122 73dj lo7833ll
#"abc500" "700dj" "lo23ll"

How to delete penultimate 0 in numeric column?

In numeric column old_code of dataframe df1 my integers have this structure:
head(df1$old_code)
[1] 12101 18201 13202 11301 13302 10401
In the column new_code I would like the same exact integers, minus the second to last 0 (i.e. 12101 I would like to see as 1011, 18201 as 1021). I am positive this is solved using regex but I can't crack the code.
Thank you for your help!
You can try to use gsub:
gsub("(.*)0(.)$", "\\1\\2", df$old_code)
# [1] "1211" "1821" "1322" "1131" "1332" "1041"
You could try converting it into a string:
df$old_code=as.character(df$old_code)
a= substr(df$old_code, 1, 1)
b=substr(df$old_code, 3,nchar(df$old_code))
df$old_code=paste0(a,b)
df$old_code=as.integer(df$old_code)

How to extract the trailing digits from a string in R? [duplicate]

This question already has answers here:
Extract a substring according to a pattern
(9 answers)
Closed 4 years ago.
I have a column of data that looks like this:
**varX**
Q1#_1
Q1#_5
Q1#_10
I would like to edit the data to look like this:
**varX**
1
5
10
Is there a command I could use to simply keep all information after the underscore?
If you want a tidyverse solution, you can use str_extract from the stringr package:
data %>%
mutate(varx = str_extract(varx, "[0-9]+$")) %>%
mutate(varx = as.numeric(varx)) # include this last line if you want a number and not character
In case you always have the Q1#_ string, you can do:
gsub("Q1#_", "", df$varX)
I think you're looking for sub, substitute a certain part of a string with something else. You can give it a regular expression if you want to go fancy, or just give it a literal:
VarX <- sub('Q1#_', '', VarX, fixed=T)
The fancy way ("remove everything before and including the underscore") would be
VarX <- sub('^.*_', '', VarX)
And you may want to convert it to a numeric or an integer:
VarX <- as.integer(sub('Q1#_', '', VarX, fixed=T)) # or as.numeric
You could you use regular expressions:
df[["varX"]] <- sub(".+_", "", df[["varX"]])
df
varX
1 1
2 5
3 10
Or regular expressions-free: with strsplit():
df[["varX"]] <- sapply(df[["varX"]], function(x) strsplit(x, "_")[[c(1,2)]])

R-- Add leading zero to string, with no fixed string format

I have a column as below.
9453, 55489, 4588, 18893, 4457, 2339, 45489HQ, 7833HQ
I would like to add leading zero if the number is less than 5 digits. However, some numbers have "HQ" in the end, some don't.(I did check other posts, they dont have similar problem in the "HQ" part)
so the finally desired output should be:
09453, 55489, 04588, 18893, 04457, 02339, 45489HQ, 07833HQ
any idea how to do this? Thank you so much for reading my post!
A one-liner using regular expressions:
my_strings <- c("9453", "55489", "4588",
"18893", "4457", "2339", "45489HQ", "7833HQ")
gsub("^([0-9]{1,4})(HQ|$)", "0\\1\\2",my_strings)
[1] "09453" "55489" "04588" "18893"
"04457" "02339" "45489HQ" "07833HQ"
Explanation:
^ start of string
[0-9]{1,4} one to four numbers in a row
(HQ|$) the string "HQ" or the end of the string
Parentheses represent capture groups in order. So 0\\1\\2 means 0 followed by the first capture group [0-9]{1,4} and the second capture group HQ|$.
Of course if there is 5 numbers, then the regex isn't matched, so it doesn't change.
I was going to use the sprintf approach, but found the the stringr package provides a very easy solution.
library(stringr)
x <- c("9453", "55489", "4588", "18893", "4457", "2339", "45489HQ", "7833HQ")
[1] "9453" "55489" "4588" "18893" "4457" "2339" "45489HQ" "7833HQ"
This can be converted with one simple stringr::str_pad() function:
stringr::str_pad(x, 5, side="left", pad="0")
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "7833HQ"
If the number needs to be padded even if the total string width is >5, then the number and text need to be separated with regex.
The following will work. It combines regex matching with the very helpful sprintf() function:
sprintf("%05.0f%s", # this encodes the format and recombines the number with padding (%05.0f) with text(%s)
as.numeric(gsub("^(\\d+).*", "\\1", x)), #get the number
gsub("[[:digit:]]+([a-zA-Z]*)$", "\\1", x)) #get just the text at the end
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "07833HQ"
Another attempt, which will also work in cases like "123" or "1HQR":
x <- c("18893","4457","45489HQ","7833HQ","123", "1HQR")
regmatches(x, regexpr("^\\d+", x)) <- sprintf("%05d", as.numeric(sub("\\D+$","",x)))
x
#[1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
This basically finds any numbers at the start of the string (^\\d+) and replaces them with a zero-padded (via sprintf) string that was subset out by removing any non-numeric characters (\\D+$) from the end of the string.
We can use only sprintf() and gsub() by splitting up the parts then putting them back together.
sprintf("%05d%s", as.numeric(gsub("[^0-9]+", "", x)), gsub("[0-9]+", "", x))
# [1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
Using #thelatemail's data:
x <- c("18893", "4457", "45489HQ", "7833HQ", "123", "1HQR")

Adding leading 0s in r

I have a large data frame that is filled with characters such as:
x <- c("Y188","Y204" ,"Y221","EP121_1" ,"Y233" , "Y248" ,"Y268", "BB2","BB20",
"BB32" ,"BB044" ,"BB056" , "Y234" , "Y249" ,"Y271" ,"BB3", "BB21", "BB33",
"BB045","BB057" ,"Y236", "Y250", "Y272" , "BB4", "BB22" )
As you can see, certain tags such as BB20 only have two integers. I would like the entire list of characters to have at least 3 integers like this(the issue is only in the BB tags if that helps):
Y188, Y204, Y221, EP121_1, Y233, Y248, Y268, BB002, BB020, BB032, BB044,
BB056, Y234, Y249, Y271, BB003, BB021, BB033, BB045, BB057, Y236, Y250,
Y272, BB004, BB022
Ive looked into the sprintf and FormatC functions but still am having no luck.
A forceful approach with a nested gsub call:
gsub("(.*[A-Z])(\\d{1}$)", "\\100\\2",
gsub("(.*[A-Z])(\\d{2}$)", "\\10\\2", x))
# [1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020"
# [10] "BB032" "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033"
# [19] "BB045" "BB057" "Y236" "Y250" "Y272" "BB004" "BB022"
There is surely a more general way to do this, but for such a localized task, two simple sub can be enough: add one trailing zero for two-digit numbers, two trailing zeros for one-digit numbers.
x <- sub("^BB(\\d{1})$","BB00\\1",x)
x <- sub("^BB(\\d{2})$","BB0\\1",x)
This works, but will have edge case
# indicator for numeric of length less than three
num <- gsub("[^0-9]", "", x)
id <- nchar(num) < 3
# overwrite relevant values with the reformatted ones
x[id] <- paste0(gsub("[0-9]", "", x)[id],
formatC(as.numeric(num[id]), width = 3, flag = "0"))
[1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020" "BB032"
[11] "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033" "BB045" "BB057"
[21] "Y236" "Y250" "Y272" "BB004" "BB022"
It can be done using sprintf and gsub function.This step would extract numeric values and change its format.
num=sprintf("%03d",as.numeric(gsub("[^[:digit:]]", "", x)))
Next step would be to paste back numbers with changed format
x=paste(gsub("[^[:alpha:]]", "", x),num,sep="")

Resources