How to change year format - r

I have a year column in my dataframe, which is formatted as financial year (e.g. 2015-16, 2016-17, etc). I want to change them to just 4-digit year in such a way that 2015-16 becomes 2016; 2016-17 becomes 2017, etc. How can I do it?

You can use parse_number from readr :
x <- c('2015-16', '2016-17')
readr::parse_number(x) + 1
#[1] 2016 2017
parse_number drops any non-numeric characters before or after the first number. So in this example, everything after the first number is dropped and turned to numeric. We then add 1 to to it to get next year.

A possible solution can be,
as.numeric(sub('-.*', '', '2015-16')) + 1
#[1] 2016

We can use sub to capture the first two digits while leaving the next two digits and the -, and in the replacement, specify the backreference (\\1) of the captured group
as.numeric(sub("^(\\d{2})\\d{2}-", "\\1", v1))
#[1] 2016 2017
Or more compactly match the two digits followed by the -, and replace with blank ('')
sub("\\d{2}-", "", v1)
[1] "2016" "2017"
Or using substr
paste0(substr(v1,1, 2), substr(v1, 6, 7))
#[1] "2016" "2017"
NOTE: None of the solutions require any external packages. Also, it doesn't implicitly assume there is always an increment of 1 year. It can be any year range as below and it works
v2 <- c("2015-18", "2014-15", "2012-19")
sub("\\d{2}-", "", v2)
#[1] "2018" "2015" "2019"
data
v1 <- c("2015-16", "2016-17")

Related

Extract first X digits of N digit numbers

How to select first 2 digits of a number? I just need the name of the function
Example: 12455 turns into 12, 13655 into 13
Basically it's the equivalent of substring for integers.
If at the end you need again a numeric vector/element, you can use
as.numeric(substr(x, 1, 2))
This solution uses gsub, the anchor ^ signifiying the start position of a string, \\d{2} for any two digits appearing at this position, wrapped into (...) to mark it as a capturing group, and backreference \\1 in the replacement argument, which 'recalls' the capturing group:
x <- c(12455,13655)
gsub("(^\\d{2}).*", "\\1", x)
[1] "12" "13"
Alternatively, use str_extract:
library(stringr)
str_extract(x, "^\\d{2}")

Extracting Date from text using R

My dataframe looks like
df <- setNames(data.frame(c("2 June 2004, 5 words, ()(","profit, Insight, 2 May 2004, 188 words, reports, by ()("), stringsAsFactors = F), "split")
What I want is to split column for date and words So far I found
"Extract date text from string"
lapply(df2, function(x) gsub(".*(\\d{2} \\w{3} \\d{4}).*", "\\1", x))
But its not working with my example, thanks for the help as always
As there is only a single column, we can directly use gsub/sub after extracting the column. In the pattern, the days can be 1 or more, similarly the words have 3 ('May') or 4 characters ('June'), so we need to make those changes
sub(".*\\b(\\d{1,} \\w{3,4} \\d{4}).*", "\\1", df$split)
#[1] "2 June 2004" "2 May 2004"

Regex pattern questions in r

I need to match author and time from string in R.
test = "Postedby BeauHDon Friday November 24, 2017 #10:30PM from the cost-effective dept."
I am currently using gsub() to find the desired output.
Expected output would be:
#author
"BeauHDon"
#Month
"November"
#Date
24
#Time
22:30
I got to gsub("Postedby (.*).*", "\\1", test) but the output is
"BeauHDon Friday November 24, 2017 #10:30PM from the cost-effective dept."
Also I understand time requires more more coding after extracting 10:30.
Is it possible to add 12 if next two string is PM?
Thank you.
We can extract using capturing as a group (assuming that the patterns are as shown in the example). Here the pattern is to match one or more non-white spaces (\\S+) followed by spaces (\\s+) from the start (^) of the string, followed by word which we capture in a group (\\w+), followed by capturing word after we skip the next word and space, then get the numbers ((\\d+)) and the time that follows the #
v1 <- scan(text=sub("^\\S+\\s+(\\w+)\\s+\\w+\\s+(\\w+)\\s+(\\d+)[^#]+#(\\S+).*",
"\\1,\\2,\\3,\\4", test), what = "", sep=",", quiet = TRUE)
As the last entry is time, we can convert it to datetime with strptime and change the format, assign it to the last element
v1[4] <- format(strptime(v1[4], "%I:%M %p"), "%H:%M")
If needed, set the names of the element with author, Month etc.
names(v1) <- c("#author", "#Month", "#Date", "#Time")
v1
# #author #Month #Date #Time
#"BeauHDon" "November" "24" "22:30"

R - gsub() for to remove dates from data set

I am using the gsub() function to remove the unwanted text from the data. I just want to have the age in the brackets, not the dates of birth. However, this is in a large data set with differing birth days.
Example of the data:
Test1$Age
Sep 10, 1990(27)
Mar 26, 1987(30
Feb 24, 1997(20)
You can do this using str_extract() from the stringr package:
s <- "Sep 10, 1990(27)"
# get the age in parentheses
stringr::str_extract(s, "\\([0-9]+\\)")
# just the age, with parentheses removed
stringr::str_extract(s, "(?<=\\()[0-9]+")
And the output is:
> s <- "Sep 10, 1990(27)"
>
> # get the age in parentheses
> stringr::str_extract(s, "\\([0-9]+\\)")
[1] "(27)"
>
> # just the age, with parentheses removed
> stringr::str_extract(s, "(?<=\\()[0-9]+")
[1] "27"
The first regular expression matches paired parentheses containing one or more digits. The second regular expression uses positive lookbehind to match one or more digits following an opening parenthesis.
If your data is in a data.frame df with the column named age, then you could do the following:
df$age <- stringr::str_extract(df$age, "\\([0-9]+\\)")
Or, in tidyverse notation:
df <- df %>% mutate(age = stringr::str_extract(age, "\\([0-9]+\\)"))
There seems to be two problems:
the date prior to the left parenthesis is not wanted
the right parenthesis is sometimes missing and it needs to be inserted
1) sub These can be addressed with sub. Match
any number of characters .* followed by
a literal left parenthesis [(] followed by
digits in a capture group (\\d+) followed by
an optional right parenthesis [)]?
and then replace that with a left parenthesis, the match to the capture group \\1 and a right parenthesis.
No packages are used.
pat <- ".*[(](\\d+)[)]?"
transform(test, Age = sub(pat, "(\\1)", Age))
If, instead, you wanted the age as a numeric field then:
transform(test, Age = as.numeric(sub(pat, "\\1", Age)))
2) substring/sub Another possibility is to take the 13th character onwards which gives everything from the left parenthesis to the end of the string and insert a ) if missing. )?$ matches a right parenthesis at the end of the string or just the end of the string if none. That is replaced with a right parenthesis. Again, no packages are used.
transform(test, Age = sub(")?$", ")", substring(Age, 13))
A variation of this if we wanted a numeric Age instead would be to take everything from the 14th character and remove the final ) if present.
transform(test, Age = as.numeric(sub(")", "", substring(Age, 14))))
3) read.table Use read.table to read the Age field with sep = "(" and comment.char = ")" and pick off the second column read. This will give the numeric age and we can use sprintf to surround that with parentheses. If Age were character (as opposed to factor) then as.character(Age) could optionally be written as just Age.
Again, no packages are used. This one does not use regular expressions.
transform(test, Age =
sprintf("(%s)", read.table(text = as.character(Age), sep = "(", comment.char = ")")$V2)
Note: The input in reproducible form is:
test <- data.frame(Age = c("Sep 10, 1990(27)", "Mar 26, 1987(30", "Feb 24, 1997(20)"))

gsub and returning the correct number in a string

I have a text string in a data frame like the following
2 Sector. District 1, Area 1
My goal is to extract the number before Sector or else return blank.
I thought the following regex would work:
gsub("^(?:([0-9]+).*Sector.*|.*)$","\\1",TEXTSTRINGCOLUMN)
This correctly returns nothing when the word Sector is not present, but returns 1 rather than 2. Greatly appreciate help on where I am going wrong. Thanks!
We can use a regex lookahead for "Sector", capture the numbers as a group and in the replacement specify the capture group (\\1).
sub('.*?(\\d+)\\s*(?=Sector).*', '\\1', v1, perl=TRUE)
#[1] "2"
EDIT: Modified based on #Avinash Raj's comment.
Without using the lookarounds, (credit to #Avinash Raj)
sub('.*?(\\d+)\\s*Sector.*', '\\1', v1)
data
v1 <- "2 Sector. District 1, Area 1"
Try,
x <- "2 Sector. District 1, Area 1"
substring(x, 0, as.integer(grepl("Sector", x)))
#[1] "2"

Resources