Changing character (time) to numeric in r - r

My data set contains a character column showing time as "12:00:00 PM". Could you please let me know if there is any way to show this as 1200 (numeric) value?
I have tried the following code but it give me a large int number:
time = as.numeric(as.POSIXct(time, format = "%H:%M:%S %p"))

You can also just use string functions to get the first four digits:
library(stringr)
library(magrittr)
"12:00:00 PM" %>%
str_split_fixed(":", 3) %>%
extract(1:2) %>%
str_c(collapse = "") %>%
as.numeric()
#> [1] 1200

Related

How to fix date typos in R?

I have a data set with a good hundred thousand lines in it.
somehow.. the data provider sent it to me with all the dates formatted like 1/1/20202021 08:07:43 AM (mdy_hms). The correct year should be the last four in year for every row.
lubridate::mdy_hms() obviously cant recognize this. So I am trying to figure out how I could use grep or similar to pull out the correct date time. Any ideas?
Thanks everyone (:
You can handle this with functions in the stringr package. First, get the correct year by extracting it from the date variable. For example,
library(stringr)
date_value <- "1/1/20202021 08:07:43 AM"
correct_year <- str_sub(
str_extract(date_value, pattern = "\\d{8}\\s"), 5, 10
)
This returns "2021 ". You can now use str_replace() to replace the 8-digit bad year with correct_year:
str_replace(date_value, pattern = "\\d{8}\\s", replacement = correct_year)
[1] "1/1/2021 08:07:43 AM"
To perform this operation across the whole data frame you can do something like this:
library(tidyverse)
df %>%
mutate(
date_value = str_replace(
date_value,
pattern = "\\d{8}\\s",
replacement = str_sub(
str_extract(date_value, pattern = "\\d{8}\\s"), 5, 10
)
)
)
You can extract only the 2nd 4-digit year with sub.
x <- "1/1/20202021 08:07:43 AM"
lubridate::mdy_hms(sub('(\\d{4})(\\d{4})', '\\2', x))
#[1] "2021-01-01 08:07:43 UTC"
To apply this to entire column you replace x with df$column_name.

How to explain the different behavior of `as.character()` on a `POSIXct` field: `select(foodate)` versus `$foodate`?

What is the explanation for how these two different indexing methods result in different output from as.character() ?
> df <- data.frame(date=c( as.POSIXct("2021-01-15"), as.POSIXct("2021-01-16")))
> df
date
1 2021-01-15
2 2021-01-16
> df$date %>% as.character()
[1] "2021-01-15" "2021-01-16"
> df %>% select(date) %>% as.character()
[1] "c(1610697600, 1610784000)"
I'd like to be able to use the dplyr syntax to convert a heterogeneous collection of fields to strings, so using format() to convert the dates would require some conditional logic. Is there a way to get the field with select() and still have as.character() return the formatted date strings rather than the seconds-since-epoch?
We need pull instead of select as select returns a data.frame/tibble with one column while pull returns a vector and as.character expects a vector as input
library(dplyr)
df %>%
pull(date) %>%
as.character()
#[1] "2021-01-15" "2021-01-16"
With tidyverse, it is done within mutate to transform or modify or create a new column
df %>%
select(date) %>%
mutate(date = as.character(date))
The observed behavior is not related to select but the expectation of as.character input. E.g.
as.character(df)
#[1] "c(1610686800, 1610773200)"
whereas, extracting a vector
as.character(df[,1])
#[1] "2021-01-15" "2021-01-16"
It is better to check the str before attempting
df %>%
dplyr::select(date) %>%
str
#'data.frame': 2 obs. of 1 variable:
# $ date: POSIXct, format: "2021-01-15" "2021-01-16"

Extracting String from Column

I am working with the following dataset called results and am trying to add in a column that only contains the date (ideally just the year) of the row.
I am trying to extract just the date (for example: 2012-02-10) from the column_label column.
This is the code that I use:
pattern <- "- (.*?) .RData"
subsetpk <- results %>%
filter(team=="Pakistan") %>%
mutate(year = str_extract(column_label, pattern))
This, however, only gives me NA values.
You can use a regular expression. Here '\\d{4}' just matches the first 4 consecutive digits that are found in the string. This works if your data always looks the same as your example. If not, you may need something more sophisticated. If this doesn't work, post some more example data.
library(tidyverse)
library(stringr)
df <- data.frame(column_label = c("Afghanistan-Pakistan-2012-02-10.RDATA.overs",
"Afghanistan-Pakistan-2019-02-10.RDATA.overs"))
df %>%
mutate(my_year = str_extract(column_label, '\\d{4}'))
column_label my_year
#1 Afghanistan-Pakistan-2012-02-10.RDATA.overs 2012
#2 Afghanistan-Pakistan-2012-02-10.RDATA.overs 2019
The ymd() function from the lubridate package
Transforms dates stored in character and numeric vectors to Date or POSIXct objects
So, we can pass the complete string conveniently without having to deal with regular expressions:
x <- c("Afghanistan-Pakistan-2012-02-10.RDATA.overs",
"Afghanistan-Pakistan-2019-02-10.RDATA.overs")
lubridate::ymd(x)
[1] "2012-02-10" "2019-02-10"
The year can be derived from the extracted dates by
library(lubridate)
year(ymd(x))
[1] 2012 2019
Use str_extract from the package stringr:
DATA:
results <- data.frame(
column_label = "Afghanistan-Pakistan-2012-02-10.RData.overs")
SOLUTION:
results$date <- str_extract(results$column_label, "\\d{4}-\\d{2}-\\d{2}")
RESULT:
results
column_label date
1 Afghanistan-Pakistan-2012-02-10.RData.overs 2012-02-10

Adding a new column with month extracted from a separate already existing "date" (mdy) column

Trying to add a new column in my data table denoting the month (either as a numeric value or character) using an already available column of "SetDate", which is in the format mdy.
I'm new to R and having trouble. Thank you
base solution:
f = "%m/%d/%y" # note the lowercase y; it's because the year is 92, not 1992
dataset$SetDateMonth <- format(as.POSIXct(dataset$SetDate, format = f), "%m")
Basically, what it does is it converts the column from character (presumed class) to POSIXct, which allows for an easy extraction of month information.
Quick test:
format(as.POSIXct('1/1/92', format = "%m/%d/%y"), "%m")
[1] "01"
Try this (created a small example):
library(lubridate)
date_example <- "1/1/92"
lubridate::mdy(date_example)
[1] "1992-01-01"
lubridate::mdy(date_example) %>% lubridate::month()
[1] 1
If you want full month as character string, use:
lubridate::mdy(date_example) %>% lubridate::month(label = TRUE, abbr = FALSE)

Find and extract year within sentence for each cell in R

I have a large dataframe of 22641 obs. and 12 variables.
The first column "year" includes extracted values from satellite images in the format below.
1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc
From this format, I only want to keep the date which in this case is 19870517 and format it as date (so two different things). Usually, I use the regex to extract the words that I want, but here the date is different for each cell and I have no idea how to replace the above text with only the date. Maybe the way to do this is to search by position within the sentence but I do not know how.
Any ideas?
Thanks.
It's not clear what the "date is different in each cell" means but if it means that the value of the date is different and it is always the 7th field then either of (1) or (2) will work. If it either means that it consists of 8 consecutive digits anywhere in the text or 8 consecutive digits surrounded by _ anywhere in the text then see (3).
1) Assuming the input DF shown in reproducible form in the Note at the end use read.table to read year, pick out the 7th field and then convert it to Date class. No packages are used.
transform(read.table(text = DF$year, sep = "_")[7],
year = as.Date(as.character(V7), "%Y%m%d"), V7 = NULL)
## year
## 1 1987-05-17
2) Another alternative is separate in tidyr. 0.8.2 or later is needed.
library(dplyr)
library(tidyr)
DF %>%
separate(year, c(rep(NA, 6), "year"), extra = "drop") %>%
mutate(year = as.Date(as.character(year), "%Y%m%d"))
## year
## 1 1987-05-17
3) This assumes that the date is the only sequence of 8 digits in the year field use this or if we know it is surrounded by _ delimiters then the regular expression "_(\\d{8})_" can be used instead.
library(gsubfn)
transform(DF,
year = do.call("c", strapply(DF$year, "\\d{8}", ~ as.Date(x, "%Y%m%d"))))
## year
## 1 1987-05-17
Note
DF <- data.frame(year = "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc",
stringsAsFactors = FALSE)
Not sure if this will generalize to your whole data but maybe:
gsub(
'(^(?:.*?[^0-9])?)(\\d{8})((?:[^0-9].*)?$)',
'\\2',
'1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc',
perl = TRUE
)
## [1] "19870517"
This uses group capturing and throws away anything but bounded 8 digit strings.
You can use sub to extract the data string and as.Date to convert it into R's date format:
as.Date(sub(".+?([0-9]+)_[^_]+$", "\\1", txt), "%Y%m%d")
# [1] "1987-05-17"
where txt <- "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc"

Resources