A column in my dataset includes dates such as Month Name and Year. I want to change the month's name to number.
My dataset looks like this (but is not limited to only 3 rows):
I want to change the ldr_start column to this:
ldr_start
3/92
7/93
8/93
Thank you.
It's not really a "date" in either case. The zoo package does define a 'yearmon' class whose use in this case has been illustrated by one of that package's authors, #G.Grothendieck.
Here we can just use strsplit and process the month-character, match against the R ?Constants, month.abb, and then rejoin:
dat <- scan(text="Mar-92,Feb-93,Jul-94,Sep-95", what = "", sep=",")
#Read 4 items
datspl <- strsplit(dat, split="-")
sapply( datspl, function(mnyr){ paste( match(mnyr[1], month.abb), mnyr[2], sep="/")})
#[1] "3/92" "2/93" "7/94" "9/95"
Using the input defined reproducibly in the Note at the end we have some alternatives.
1) yearmmon Using the test data in the Note at the end convert the input column to yearmon class and then format it in the desired fashion. See ?strptime for information on the percent codes.
library(zoo)
transform(DF, start2 = sub("^0", "", format(as.yearmon(start, "%b-%y"), "%m/%y")))
## start start2
## 1 Mar-92 3/92
1a) If it is ok to have a leading zero on one digit months then we can omit the sub and just write:
transform(DF, start2 = format(as.yearmon(start, "%b-%y"), "%m/%y"))
## start start2
## 1 Mar-92 03/92
2) Base R Using only base R we can append 1 to start to make it a valid date (valid dates require year, month and day and not just month and year) and then proceed in a similar way as in (1) or (1a).
transform(DF,
start2 = sub("^0", "", format(as.Date(paste(start, 1), "%b-%y %d"), "%m/%y")))
## start start2
## 1 Mar-92 3/92
Note
DF <- data.frame(start = "Mar-92") # test data frame
We could also use stringr's str_replace_all:
data <- c("Mar-92", "Jul-93", "Aug-93")
str_replace_all(data, setNames(as.character(1:12), month.abb))
Output:
[1] "3-92" "7-93" "8-93"
Update 14/aug: Why this works
As pointed out by #IRTFM this functionality might be unexpected at first glance. However, it can be found in the documentation for the replacement-argument:
To perform multiple replacements in each element of string, pass a named vector (c(pattern1 = replacement1)) to str_replace_all. Alternatively, pass a function to replacement: it will be called once for each match and its return value will be used to replace the match.
The functionality is also evident in the code. If we pass a named vector, the names get assigned to the 'pattern' argument and the values get assigned to the 'replacement' argument, exactly as expected:
if (!is.null(names(pattern))) {
vec <- FALSE
replacement <- unname(pattern)
pattern[] <- names(pattern)
}
Related
I have dates listed as "X5.13.1996", representing May 13th, 1996. The class for the date column is currently a character.
When using mdy from lubridate, it keeps populating NA. Is there a code I can use to get rid of the "X" to successfully use the code? Is there anything else I can do?
You can use substring(date_variable, 2) to drop the first character from the string.
substring("X5.13.1996", 2)
[1] "5.13.1996"
To convert a variable (i.e., column) in your data frame:
library(dplyr)
library(lubridate)
dates <- data.frame(
dt = c("X5.13.1996", "X11.15.2021")
)
dates %>%
mutate(converted = mdy(substring(dt, 2)))
or, without dplyr:
dates$converted <- mdy(substring(dates$dt, 2))
Output:
dt converted
1 X5.13.1996 1996-05-13
2 X11.15.2021 2021-11-15
This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 1 year ago.
relatively new to R, need help with applying a regex-based substitution.
I have a data frame in one column of which I have a sequence of digits (my values of interest) followed by a string of all sorts of characters.
Example:
4623(randomcharacters)
I need to remove everything after the initial digits to continue working with the values. My idea was to use gsub to remove the non-digit characters by positive lookbehind.
The code I have is:
sub_function <- function() {
gsub("?<=[[:digit:]].", " ", fixed = T)
}
data_frame$`x` <- data_known$`x` %>%
sapply(sub_function)
But I then get the error:
Error in FUN(X[[i]], ...) : unused argument (X[[i]])
Any help would be greatly appreciated!
Here is a base R function.
It uses sub, not gsub, since there will be only one substitution. And there's no need for look behind, the meta-character ^ marks the beginning of the string, followed by an optional minus sign, followed by at least one digit. Everything else is discarded.
sub_function <- function(x){
sub("(^-*[[:digit:]]+).*", "\\1", x)
}
data <- data.frame(x = c("4623(randomcharacters)", "-4623(randomcharacters)"))
sub_function(data$x)
#[1] "4623" "-4623"
Edit
With this simple modification the function returns a numeric vector.
sub_function <- function(x){
y <- sub("(^-*[[:digit:]]+).*", "\\1", x)
as.numeric(y)
}
There are a few ways to accomplish this, but I like using functions from {tidyverse}:
library(tidyverse)
# Create some dummy data
df <- tibble(targetcol = c("4658(randomcharacters)", "5847(randomcharacters)", "4958(randomcharacters)"))
df <- mutate(df, just_digits = str_extract(targetcol, pattern = "^[[:digit:]]+"))
Output (contents of df):
targetcol just_digits
<chr> <chr>
1 4658(randomcharacters) 4658
2 5847(randomcharacters) 5847
3 4958(randomcharacters) 4958
If you always want to extract numbers from the data, you can use parse_number from readr. It will also return data in numeric form by default.
Using #Rory S' data.
sub_function <- function(x) {
readr::parse_number(x)
}
sub_function(df$targetcol)
#[1] 4658 5847 4958
I would like to add a number indicating the x^th occurrence of a word in a vector. (So this question is different from Make a column with duplicated values unique in a dataframe , because I have a simple vector and try to avoid the overhead of casting it to a data.frame).
E.g. for the vector:
book, ship, umbrella, book, ship, ship
the output would be:
book, ship, umbrella, book2, ship2, ship3
I have solved this myself by transposing the vector to a dataframe and next using the grouping function. That feels like using a sledgehammer to crack nuts:
# add consecutive number for equal string
words <- c("book", "ship", "umbrella", "book", "ship", "ship")
# transpose word vector to data.frame for grouping
df <- data.frame(words = words)
df <- df %>% group_by(words) %>% mutate(seqN = row_number())
# combine columns and remove '1' for first occurrence
wordsVec <- paste0(df$words, df$seqN)
gsub("1", "", wordsVec)
# [1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Is there a more clean solution, e.g. using the stringr package?
You can still utilize row_number() from dplyr but you don't need to convert to data frame, i.e.
sub('1$', '', ave(words, words, FUN = function(i) paste0(i, row_number(i))))
#[1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Another option is to use make.unique along with gsubfn to increment your values by 1, i.e.
library(gsubfn)
gsubfn("\\d+", function(x) as.numeric(x) + 1, make.unique(words))
#[1] "book" "ship" "umbrella" "book.2" "ship.2" "ship.3"
I am Rstudio for my R sessions and I have the following R codes:
d1 <- read.csv("mydata.csv", stringsAsFactors = FALSE, header = TRUE)
d2 <- d1 %>%
mutate(PickUpDate = ymd(PickUpDate))
str(d2$PickUpDate)
output of last line of code above is as follows:
Date[1:14258], format: "2016-10-21" "2016-07-15" "2016-07-01" "2016-07-01" "2016-07-01" "2016-07-01" ...
I need an additional column (let's call it MthDD) to the dataframe d2, which will be the Month and Day of the "PickUpDate" column. So, column MthDD need to be in the format mm-dd but most importantly, it should still be of the date type.
How can I achieve this?
UPDATE:
I have tried the following but it outputs the new column as a character type. I need the column to be of the date type so that I can use it as the x-axis component of a plot.
d2$MthDD <- format(as.Date(d2$PickUpDate), "%m-%d")
Date objects do not display as mm-dd. You can create a character string with that representation but it will no longer be of Date class -- it will be of character class.
If you want an object that displays as mm-dd and still acts like a Date object what you can do is create a new S3 subclass of Date that displays in the way you want and use that. Here we create a subclass of Date called mmdd with an as.mmdd generic, an as.mmdd.Date method, an as.Date.mmdd method and a format.mmdd method. The last one will be used when displaying it. mmdd will inherit methods from Date class but you may still need to define additional methods depending on what else you want to do -- you may need to experiment a bit.
as.mmdd <- function(x, ...) UseMethod("as.mmdd")
as.mmdd.Date <- function(x, ...) structure(x, class = c("mmdd", "Date"))
as.Date.mmdd <- function(x, ...) structure(x, class = "Date")
format.mmdd <- function(x, format = "%m-%d", ...) format(as.Date(x), format = format, ...)
DF <- data.frame(x = as.Date("2018-03-26") + 0:2) # test data
DF2 <- transform(DF, y = as.mmdd(x))
giving:
> DF2
x y
1 2018-03-26 03-26
2 2018-03-27 03-27
3 2018-03-28 03-28
> class(DF2$y)
[1] "mmdd" "Date"
> as.Date(DF2$y)
[1] "2018-03-26" "2018-03-27" "2018-03-28"
Try using this:
PickUpDate2 <- format(PickUpDate,"%m-%d")
PickUpDate2 <- as.Date(PickUpDate2, "%m-%d")
This should work, and you should be able to bind_cols afterwards, or just add it to the data frame right away, as you proposed in the code you provided. So the code should be substituted to be:
d2$PickUpDate2 <- format(d2$PickUpDate,"%m-%d")
d2$PickUpDate2 <- as.Date(d2$PickUpDate2, "%m-%d")
I have a data frame of a few columns, the last one is called a Filename. This is how it looks like.
Product Company Filename
… … mg-tvd_bmmh_20170930.csv
… … mg-tvd_bmmh_2016_06_13.csv
… … …
I am trying to write a short script in R which takes dates from a filename and transforms it into a new column which I call a Date. So a new data frame would look like this:
Product Company Date Filename
… … 09/30/2017 mg-tvd_bmmh_20170930.csv
… … 16/13/2017 mg-tvd_bmmh_2016_06_13.csv
… … … …
This is a relevant piece of my script.
df <- mutate(df, Date <- grep(pattern = "(\d{4})_?(\d{2})_?
(\d{1,2})", df$Filename, value = TRUE))
ddf$Date <- as.Date(Date,format = "%m/%d/%y")
Any advice why I can't get it working?
I am getting these errors:
Error: '\d' is an unrecognized escape in character string starting ""(\d"
Error in as.Date(Date, format = "%m/%d/%y") :
object 'Date' not found
You can use this command:
transform(df, Date = as.Date(sub(".*\\D(\\d{4})_?(\\d{2})_?(\\d{1,2}).*",
"\\1\\2\\3", Filename), "%Y%m%d"))
You are getting the error because instead of:
ddf$Date <- as.Date(Date,format = "%m/%d/%y")
you should have:
df$Date <- as.Date(df$Date,format = "%Y/%m/%d")
or:
df %>%
mutate(Date = as.Date(df$Date,format = "%Y/%m/%d"))
The incorrect specification of format = "%m/%d/%y" would give you NA values in Date while the incorrect reference of as.Date(Date, ... would throw you the error.
You can also use str_extract from stringr to extract the dates and ymd from lubridate to parse it to Date object:
library(dplyr)
library(stringr)
library(lubridate)
df %>%
mutate(Date = ymd(str_extract(Filename, "\\d{4}_?\\d{2}_?\\d{2}(?=\\.csv)")))
Data:
Product Company Filename Date
1 1 3 mg-tvd_bmmh_20170930.csv 2017-09-30
2 2 4 mg-tvd_bmmh_2016_06_13.csv 2016-06-13
The advantage with ymd is that it "...recognize arbitrary non-digit separators as well as no separator..." So there is no need to standardize the Date character vector before parsing. For instance,
> df$Filename %>% str_extract("\\d{4}_?\\d{2}_?\\d{2}(?=\\.csv)")
[1] "20170930" "2016_06_13"
The error you show is originating because special characters in regex need to be double escaped in R (e.g. \d should be \\d). I would suggest using sub for the regex portion so you can control the output, and adding wildcards (*) after the underscores to get matches if there is or is not an underscore (like your example shows).
Formatting in as.Date wants a capital Y (%Y) for year.
The updated code would be:
df <- mutate(df, Date = sub(pattern = ".*_(\\d{4})_*(\\d{2})_*(\\d{1,2}).*", "\\2/\\3/\\1", df$Filename))
df$Date <- as.Date(df$Date,format = "%m/%d/%Y")