formatting the date in R - r

I have a date value as follows
"'2015-10-24'"
class Character
I am trying to format this value such that it looks like this '10/24/2015'
I know how to use noquote function and strip the quotes and gsub function to replace the - with / but I am not sure how to switch the year, date and month such that it looks like this '10/24/2015'
Any help is much appreciated.

We can convert to Date class after removing the ' with gsub, and then use format to get the expected output
format(as.Date(gsub("'", '', v1)), "'%m/%d/%Y'")
#[1] "'10/24/2015'" "'10/25/2015'"
Or without using the gsub to remove ', we can specify the ' also in the format within as.Date
format(as.Date(v1, "'%Y-%m-%d'"), "'%m/%d/%Y'")
#[1] "'10/24/2015'" "'10/25/2015'"
This can be made more compact if we are using library(lubridate)
library(lubridate)
format(ymd(v1), "'%m/%d/%Y'")
#[1] "'10/24/2015'" "'10/25/2015'"
If we don't need the ' in the output, we don't have to specify that in the format,
format(ymd(v1), "%m/%d/%Y")
#[1] "10/24/2015" "10/25/2015"
Or we can do this using only gsub by capturing the characters as a group. In the below code, we capture the first 4 characters (.{4}) as a group by wrapping with parentheses followed by matching the -, then capturing the next two characters, followed by -, and capturing the last two characters. In the replacement, we can shuffle the capture groups as per the requirement. In this case, the second capture group should come first (\\2) followed by /, then the third (\\3) and so on...
gsub('(.{4})-(.{2})-(.{2})', '\\2/\\3/\\1', v1)
#[1] "'10/24/2015'" "'10/25/2015'"
To avoid the quotes,
gsub('.(.{4})-(.{2})-(.{2}).', '\\2/\\3/\\1', v1)
#[1] "10/24/2015" "10/25/2015"
In addition, there are other ways such as splitting the string
vapply(strsplit(v1, "['-]"), function(x) paste(x[c(3,4,2)], collapse='/'), character(1))
#[1] "10/24/2015" "10/25/2015"
or extracting the numeric part with str_extract_all and pasteing as before.
library(stringr)
vapply(str_extract_all(v1, '\\d+'), function(x)
paste(x[c(2,3,1)], collapse='/'), character(1))
#[1] "10/24/2015" "10/25/2015"
data
v1 <- c("'2015-10-24'", "'2015-10-25'")

You can also use the function strftime to get the result
d <- "'2015-10-24'"
strftime(as.Date(gsub("'", "", d)), "%m/%d/%Y")
# [1] "10/24/2015"

Related

Regex: Match first two digits of a four digit number

I have:
'30Jun2021'
I want to skip/remove the first two digits of the four digit number (or any other way of doing this):
'30Jun21'
I have tried:
^.{0,5}
https://regex101.com/r/hAJcdE/1
I have the first 5 characters but I have not figured out how to skip/remove the '20'
Manipulating datetimes is better using the dedicated date/time functions.
You can convert the variable to date and use format to get the output in any format.
x <- '30Jun2021'
format(as.Date(x, '%d%b%Y'), '%d%b%y')
#[1] "30Jun21"
You can also use lubridate::dmy(x) to convert x to date.
You don't even need regex for this. Just use substring operations:
x <- '30Jun2021'
paste0(substr(x, 1, 5), substr(x, 8, 9))
[1] "30Jun21"
Use sub
sub('\\d{2}(\\d{2})$', "\\1", x)
[1] "30Jun21"
or with str_remove
library(stringr)
str_remove(x, "\\d{2}(?=\\d{2}$)")
[1] "30Jun21"
data
x <- '30Jun2021'
You could also match the format of the string with 2 capture groups, where you would match the part that you want to omit and capture what you want to keep.
\b(\d+[A-Z][a-z]+)\d\d(\d\d)\b
Regex demo
sub("\\b(\\d+[A-Z][a-z]+)\\d\\d(\\d\\d)\\b", "\\1\\2", "30Jun2021")
Output
[1] "30Jun21"

Remove some special characters from a string and convert to decimal a column of a data frame [duplicate]

I have a column in a dataframe as follows:
COL1
$54,345
$65,231
$76,234
How do I convert it into this:
COL1
54345
65231
76234
The way I tried it at first was:
df$COL1<-as.numeric(as.character(df$COL1))
That didn't work because it said NA's were introduced.
Then I tried it like this:
df$COL1<-as.numeric(gsub("\\$","",as.character(df$COL1)))
And the same this happened.
Any ideas?
We could use parse_number from readr package which removes any non-numeric characters.
library(readr)
parse_number(df$COL1)
#[1] 54345 65231 76234
The reason why the gsub didn't work was there was , in the column, which is still non-numeric. So when convert to 'numeric' with as.numeric, all the non-numeric elements are converted to NA. So, we need to remove both , and $ to make it work.
df1$COL1 <- as.numeric(gsub('[$,]', '', df1$COL1))
We match the $ and , inside the square brackets ([$,]) so that it will be considered as that character ($ left alone has special meaning i.e. it signifies the end of the string.) and replace it with ''.
Or we can escape (\\) the character ($) to match it and replace by ''.
df1$COL1 <- as.numeric(gsub('\\$|,', '', df1$COL1))
Another option using stringr library to remove '$' and ',' then convert as follows:
df %>% mutate(COL1 = COL1 %>% str_remove_all("\\$,") %>% as.numeric())
Nested gsub to handle negatives and transform to make it functional and to take advantage of NSE
transform(df, COL1 = as.numeric(gsub("[$),]", "", gsub("^\\(", "-", COL1))))

R-- Add leading zero to string, with no fixed string format

I have a column as below.
9453, 55489, 4588, 18893, 4457, 2339, 45489HQ, 7833HQ
I would like to add leading zero if the number is less than 5 digits. However, some numbers have "HQ" in the end, some don't.(I did check other posts, they dont have similar problem in the "HQ" part)
so the finally desired output should be:
09453, 55489, 04588, 18893, 04457, 02339, 45489HQ, 07833HQ
any idea how to do this? Thank you so much for reading my post!
A one-liner using regular expressions:
my_strings <- c("9453", "55489", "4588",
"18893", "4457", "2339", "45489HQ", "7833HQ")
gsub("^([0-9]{1,4})(HQ|$)", "0\\1\\2",my_strings)
[1] "09453" "55489" "04588" "18893"
"04457" "02339" "45489HQ" "07833HQ"
Explanation:
^ start of string
[0-9]{1,4} one to four numbers in a row
(HQ|$) the string "HQ" or the end of the string
Parentheses represent capture groups in order. So 0\\1\\2 means 0 followed by the first capture group [0-9]{1,4} and the second capture group HQ|$.
Of course if there is 5 numbers, then the regex isn't matched, so it doesn't change.
I was going to use the sprintf approach, but found the the stringr package provides a very easy solution.
library(stringr)
x <- c("9453", "55489", "4588", "18893", "4457", "2339", "45489HQ", "7833HQ")
[1] "9453" "55489" "4588" "18893" "4457" "2339" "45489HQ" "7833HQ"
This can be converted with one simple stringr::str_pad() function:
stringr::str_pad(x, 5, side="left", pad="0")
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "7833HQ"
If the number needs to be padded even if the total string width is >5, then the number and text need to be separated with regex.
The following will work. It combines regex matching with the very helpful sprintf() function:
sprintf("%05.0f%s", # this encodes the format and recombines the number with padding (%05.0f) with text(%s)
as.numeric(gsub("^(\\d+).*", "\\1", x)), #get the number
gsub("[[:digit:]]+([a-zA-Z]*)$", "\\1", x)) #get just the text at the end
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "07833HQ"
Another attempt, which will also work in cases like "123" or "1HQR":
x <- c("18893","4457","45489HQ","7833HQ","123", "1HQR")
regmatches(x, regexpr("^\\d+", x)) <- sprintf("%05d", as.numeric(sub("\\D+$","",x)))
x
#[1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
This basically finds any numbers at the start of the string (^\\d+) and replaces them with a zero-padded (via sprintf) string that was subset out by removing any non-numeric characters (\\D+$) from the end of the string.
We can use only sprintf() and gsub() by splitting up the parts then putting them back together.
sprintf("%05d%s", as.numeric(gsub("[^0-9]+", "", x)), gsub("[0-9]+", "", x))
# [1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
Using #thelatemail's data:
x <- c("18893", "4457", "45489HQ", "7833HQ", "123", "1HQR")

Replace element in vector based on first letter of character string

Consider the vectors below:
ID <- c("A1","B1","C1","A12","B2","C2","Av1")
names <- c("ALPHA","BRAVO","CHARLIE","AVOCADO")
I want to replace the first character of each element in vector ID with vector names based on the first letter of vector names. I also want to add a _0 before each number between 0:9.
Note that the elements Av1 and AVOCADO throw things off a bit, especially with the lowercase v in Av1.
The result should look like this:
res <- c("ALPHA_01","BRAVO_01","CHARLIE_01","ALPHA_12","BRAVO_02","CHARLIE_02", "AVOCADO_01")
I know it should be done with regex but I've been trying for 2 days now and haven't got anywhere.
We can use gsubfn.
library(gsubfn)
#remove the number part from 'ID' (using `sub`) and get the unique elements
nm1 <- unique(sub("\\d+", "", ID))
#using gsubfn, replace the non-numeric elements with the matching
#key/value pair in the replacement
#finally format to add the "_" with sub
sub("(\\d+)$", "_0\\1", gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
The (\\d+) indicates one or more numeric elements, and (\\D+) is one or more non-numeric elements. We are wrapping it within the brackets to capture as a group and replace it with the backreference (\\1 - as it is the first backreference for the captured group).
Update
If the condition would be to append 0 only to those 'ID's that have numbers less than 10, then we can do this with a second gsubfn and sprintf
gsubfn("(\\d+)", ~sprintf("_%02d", as.numeric(x)),
gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_12"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
Doing this via base R, we can search for second character being V (as in AVOCADO) and substring 2 characters if that's true or 1 character if not. This will capture both AVOCADO and ALPHA. We then match those substrings with the letters extracted from ID (also convert toupper to capture Av with AV). Finally paste _0 along with the number found in each ID
paste0(names[match(toupper(sub('\\d+', '', ID)),
ifelse(substr(names, 2, 2) == 'V', substr(names, 1, 2),
substr(names, 1, 1)))],'_0', sub('\\D+', '', ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02" "BRAVO_02" "CHARLIE_02" "AVOCADO_01"

Convert currency with commas into numeric

I have a column in a dataframe as follows:
COL1
$54,345
$65,231
$76,234
How do I convert it into this:
COL1
54345
65231
76234
The way I tried it at first was:
df$COL1<-as.numeric(as.character(df$COL1))
That didn't work because it said NA's were introduced.
Then I tried it like this:
df$COL1<-as.numeric(gsub("\\$","",as.character(df$COL1)))
And the same this happened.
Any ideas?
We could use parse_number from readr package which removes any non-numeric characters.
library(readr)
parse_number(df$COL1)
#[1] 54345 65231 76234
The reason why the gsub didn't work was there was , in the column, which is still non-numeric. So when convert to 'numeric' with as.numeric, all the non-numeric elements are converted to NA. So, we need to remove both , and $ to make it work.
df1$COL1 <- as.numeric(gsub('[$,]', '', df1$COL1))
We match the $ and , inside the square brackets ([$,]) so that it will be considered as that character ($ left alone has special meaning i.e. it signifies the end of the string.) and replace it with ''.
Or we can escape (\\) the character ($) to match it and replace by ''.
df1$COL1 <- as.numeric(gsub('\\$|,', '', df1$COL1))
Another option using stringr library to remove '$' and ',' then convert as follows:
df %>% mutate(COL1 = COL1 %>% str_remove_all("\\$,") %>% as.numeric())
Nested gsub to handle negatives and transform to make it functional and to take advantage of NSE
transform(df, COL1 = as.numeric(gsub("[$),]", "", gsub("^\\(", "-", COL1))))

Resources