data frame with mixed date format

data frame with mixed date format - r

I would like to change all the mixed date format into one format for example d-m-y
here is the data frame
x <- data.frame("Name" = c("A","B","C","D","E"), "Birthdate" = c("36085.0","2001-sep-12","Feb-18-2005","05/27/84", "2020-6-25"))
I hv tried using this code down here, but it gives NAs
newdateformat <- as.Date(x$Birthdate,
format = "%m%d%y", origin = "2020-6-25")
newdateformat
Then I tried using parse, but it also gives NAs which means it failed to parse
require(lubridate)
parse_date_time(my_data$Birthdate, orders = c("ymd", "mdy"))
[1] NA NA "2001-09-12 UTC" NA
[5] "2005-02-18 UTC"
and I also could find what is the format for the first date in the data frame which is "36085.0"
i did found this code but still couldn't understand what the number means and what is the "origin" means
dates <- c(30829, 38540)
betterDates <- as.Date(dates,
origin = "1899-12-30")
p/s : I'm quite new to R, so i appreciate if you can use an easier explanation thank youuuuu

You should parse each format separately. For each format, select the relevant rows with a regular expression and transform only those rows, then move on the the next format. I'll give the answer with data.table instead of data.frame because I've forgotten how to use data.frame.
library(lubridate)
library(data.table)
x = data.table("Name" = c("A","B","C","D","E"),
"Birthdate" = c("36085.0","2001-sep-12","Feb-18-2005","05/27/84", "2020-6-25"))
# or use setDT(x) to convert an existing data.frame to a data.table
# handle dates like "2001-sep-12" and "2020-6-25"
# this regex matches strings beginning with four numbers and then a dash
x[grepl('^[0-9]{4}-',Birthdate),Birthdate1:=ymd(Birthdate)]
# handle dates like "36085.0": days since 1904 (or 1900)
# see https://learn.microsoft.com/en-us/office/troubleshoot/excel/1900-and-1904-date-system
# this regex matches strings that only have numeric characters and .
x[grepl('^[0-9\\.]+$',Birthdate),Birthdate1:=as.Date(as.numeric(Birthdate),origin='1904-01-01')]
# assume the rest are like "Feb-18-2005" and "05/27/84" and handle those
x[is.na(Birthdate1),Birthdate1:=mdy(Birthdate)]
# result
> x
Name Birthdate Birthdate1
1: A 36085.0 2002-10-18
2: B 2001-sep-12 2001-09-12
3: C Feb-18-2005 2005-02-18
4: D 05/27/84 1984-05-27
5: E 2020-6-25 2020-06-25

Related

Extracting String from Column

I am working with the following dataset called results and am trying to add in a column that only contains the date (ideally just the year) of the row.
I am trying to extract just the date (for example: 2012-02-10) from the column_label column.
This is the code that I use:
pattern <- "- (.*?) .RData"
subsetpk <- results %>%
filter(team=="Pakistan") %>%
mutate(year = str_extract(column_label, pattern))
This, however, only gives me NA values.

You can use a regular expression. Here '\\d{4}' just matches the first 4 consecutive digits that are found in the string. This works if your data always looks the same as your example. If not, you may need something more sophisticated. If this doesn't work, post some more example data.
library(tidyverse)
library(stringr)
df <- data.frame(column_label = c("Afghanistan-Pakistan-2012-02-10.RDATA.overs",
"Afghanistan-Pakistan-2019-02-10.RDATA.overs"))
df %>%
mutate(my_year = str_extract(column_label, '\\d{4}'))
column_label my_year
#1 Afghanistan-Pakistan-2012-02-10.RDATA.overs 2012
#2 Afghanistan-Pakistan-2012-02-10.RDATA.overs 2019

The ymd() function from the lubridate package
Transforms dates stored in character and numeric vectors to Date or POSIXct objects
So, we can pass the complete string conveniently without having to deal with regular expressions:
x <- c("Afghanistan-Pakistan-2012-02-10.RDATA.overs",
"Afghanistan-Pakistan-2019-02-10.RDATA.overs")
lubridate::ymd(x)
[1] "2012-02-10" "2019-02-10"
The year can be derived from the extracted dates by
library(lubridate)
year(ymd(x))
[1] 2012 2019

Use str_extract from the package stringr:
DATA:
results <- data.frame(
column_label = "Afghanistan-Pakistan-2012-02-10.RData.overs")
SOLUTION:
results$date <- str_extract(results$column_label, "\\d{4}-\\d{2}-\\d{2}")
RESULT:
results
column_label date
1 Afghanistan-Pakistan-2012-02-10.RData.overs 2012-02-10

Adding a new column with month extracted from a separate already existing "date" (mdy) column

Trying to add a new column in my data table denoting the month (either as a numeric value or character) using an already available column of "SetDate", which is in the format mdy.
I'm new to R and having trouble. Thank you

base solution:
f = "%m/%d/%y" # note the lowercase y; it's because the year is 92, not 1992
dataset$SetDateMonth <- format(as.POSIXct(dataset$SetDate, format = f), "%m")
Basically, what it does is it converts the column from character (presumed class) to POSIXct, which allows for an easy extraction of month information.
Quick test:
format(as.POSIXct('1/1/92', format = "%m/%d/%y"), "%m")
[1] "01"

Try this (created a small example):
library(lubridate)
date_example <- "1/1/92"
lubridate::mdy(date_example)
[1] "1992-01-01"
lubridate::mdy(date_example) %>% lubridate::month()
[1] 1
If you want full month as character string, use:
lubridate::mdy(date_example) %>% lubridate::month(label = TRUE, abbr = FALSE)

Find and extract year within sentence for each cell in R

I have a large dataframe of 22641 obs. and 12 variables.
The first column "year" includes extracted values from satellite images in the format below.
1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc
From this format, I only want to keep the date which in this case is 19870517 and format it as date (so two different things). Usually, I use the regex to extract the words that I want, but here the date is different for each cell and I have no idea how to replace the above text with only the date. Maybe the way to do this is to search by position within the sentence but I do not know how.
Any ideas?
Thanks.

It's not clear what the "date is different in each cell" means but if it means that the value of the date is different and it is always the 7th field then either of (1) or (2) will work. If it either means that it consists of 8 consecutive digits anywhere in the text or 8 consecutive digits surrounded by _ anywhere in the text then see (3).
1) Assuming the input DF shown in reproducible form in the Note at the end use read.table to read year, pick out the 7th field and then convert it to Date class. No packages are used.
transform(read.table(text = DF$year, sep = "_")[7],
year = as.Date(as.character(V7), "%Y%m%d"), V7 = NULL)
## year
## 1 1987-05-17
2) Another alternative is separate in tidyr. 0.8.2 or later is needed.
library(dplyr)
library(tidyr)
DF %>%
separate(year, c(rep(NA, 6), "year"), extra = "drop") %>%
mutate(year = as.Date(as.character(year), "%Y%m%d"))
## year
## 1 1987-05-17
3) This assumes that the date is the only sequence of 8 digits in the year field use this or if we know it is surrounded by _ delimiters then the regular expression "_(\\d{8})_" can be used instead.
library(gsubfn)
transform(DF,
year = do.call("c", strapply(DF$year, "\\d{8}", ~ as.Date(x, "%Y%m%d"))))
## year
## 1 1987-05-17
Note
DF <- data.frame(year = "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc",
stringsAsFactors = FALSE)

Not sure if this will generalize to your whole data but maybe:
gsub(
'(^(?:.*?[^0-9])?)(\\d{8})((?:[^0-9].*)?$)',
'\\2',
'1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc',
perl = TRUE
)
## [1] "19870517"
This uses group capturing and throws away anything but bounded 8 digit strings.

You can use sub to extract the data string and as.Date to convert it into R's date format:
as.Date(sub(".+?([0-9]+)_[^_]+$", "\\1", txt), "%Y%m%d")
# [1] "1987-05-17"
where txt <- "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc"

How can you insert a colon every two characters?

I have a column of time values, except that they are in character format and do not have the colons to separate H, M, S. The column looks similar to the following:
Time
024201
054722
213024
205022
205024
125440
I want to convert all the values in the column to look like actual time values in the format H:M:S. The values are already in HMS format, so it is simply a matter of inserting colons, but that is proving more difficult than I thought. I found a package that adds commas every three digits from the right to make Strings look like currency values, but nothing for time (without also adding a date value, which I do not want to do). Any help would be appreciated.

Since the data is time related, you should consider storing it in a POSIX format:
> df <- data.frame(Time=c("024201", "054722", "213024", "205022", "205024", "125440")
> df$Time <- as.POSIXct(df$Time, format="%H%M%S")
> df
Time
1 2014-01-05 02:42:01
2 2014-01-05 05:47:22
3 2014-01-05 21:30:24
4 2014-01-05 20:50:22
5 2014-01-05 20:50:24
6 2014-01-05 12:54:40
To output just the times:
> format(df, "%H:%M:%S")
Time
1 02:42:01
2 05:47:22
3 21:30:24
4 20:50:22
5 20:50:24
6 12:54:40

A regular expression with lookaround works for this:
gsub('(..)(?=.)', '\\1:', x$Time, perl=TRUE)
The (?=.) means a character (matched by .) must follow, but is not considered part of the match (and is not captured).

Here is a regex solution:
x <- readLines(n=6)
024201
054722
213024
205022
205024
125440
gsub("(\\d\\d)(\\d\\d)(\\d\\d)", "\\1:\\2:\\3", x)
## [1] "02:42:01" "05:47:22" "21:30:24"
## [4] "20:50:22" "20:50:24" "12:54:40 "
Here the (\\d\\d) says we're looking for 2 digits. The parenthesis breaks the string into 3 parts. Then the \\1: says take chunk 1 and place a colon after it.

Or via date/times classes:
time <- c("024201", "054722", "213024", "205022", "205024", "125440")
time <- as.POSIXct(paste0("1970-01-01", time), format="%Y-%d-%m %H%M%S")
(time <- format(time, "%H:%M:%S"))
# [1] "02:42:01" "05:47:22" "21:30:24" "20:50:22" "20:50:24" "12:54:40"

This gives a chron "times" class vector:
> library(chron)
> times(gsub("(..)(..)(..)", "\\1:\\2:\\3", DF$Time))
[1] 02:42:01 05:47:22 21:30:24 20:50:22 20:50:24 12:54:40
The "times" class can display times without having to display the date and supports various methods on the times.
On the other hand, if only a character string is wanted then only the gsub part is needed.

Convert data from csv file into "xts" object

I have got CSV files which has the Date in the following format:
25-Aug-2004
I want to read it as an "xts" object so as to use the function "periodReturn" in quantmod package.
Can I use the following file for the function?
Symbol Series Date Prev.Close Open.Price High.Price Low.Price
1 XXX EQ 25-Aug-2004 850.00 1198.70 1198.70 979.00
2 XXX EQ 26-Aug-2004 987.95 992.00 997.00 975.30
Guide me with the same.

Unfortunately I can't speak for the ts part, but this is how you can convert your dates to a proper format that can be read by other functions as dates (or time).
You can import your data into a data.frame as usual (see here if you've missed it). Then, you can convert your Date column into a POSIXlt (POSIXt) class using strptime function.
nibha <- "25-Aug-2004" # this should be your imported column
lct <- Sys.getlocale("LC_TIME"); Sys.setlocale("LC_TIME", "C") #temporarily change locale to C if you happen go get NAs
strptime(nibha, format = "%d-%b-%Y")
Sys.setlocale("LC_TIME", lct) #revert back to your locale

Try this. We get rid of the nuisance columns and specify the format of the time index, then convert to xts and apply the dailyReturn function:
Lines <- "Symbol Series Date Prev.Close Open.Price High.Price Low.Price
1 XXX EQ 25-Aug-2004 850.00 1198.70 1198.70 979.00
2 XXX EQ 26-Aug-2004 987.95 992.00 997.00 975.30"
library(quantmod) # this also pulls in xts & zoo
z <- read.zoo(textConnection(Lines), format = "%d-%b-%Y",
colClasses = rep(c(NA, "NULL", NA), c(1, 2, 5)))
x <- as.xts(z)
dailyReturn(x)
Of course, textConnection(Lines) is just to keep the example self contained and in reality would be replaced with something like "myfile.dat".

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

data frame with mixed date format - r

Related

Extracting String from Column

Adding a new column with month extracted from a separate already existing "date" (mdy) column

Find and extract year within sentence for each cell in R

How can you insert a colon every two characters?

Convert data from csv file into "xts" object

Categories

Resources