as I was trying to analyze a dataset from kaggle, I run in some conversion issues. I want to retrieve an ISO date à la "2022-04-31" from "4/31/2022 8:26".
My first idea was a classical programming approach via loop and if-logic - way too much afford. The problem here are the missing leading zeroes.
The second approach was to separate the column string values via str_split and then convert it together again:
################################################################################
# START OF SCRIPT
################################################################################
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))
################################################################################
# ETL
################################################################################
#---->> https://www.kaggle.com/carrie1/ecommerce-data
raw_data <- read.csv("data 2.csv", sep = ",")
clean_data <- raw_data %>% drop_na()
clean_data <- clean_data[!duplicated(clean_data[,1:8]),]
#
## date conversion
#
split <- str_split(clean_data$InvoiceDate, "/") %>% plyr::ldply(,data.frame)
colnames(split) <- c("month", "day", "year")
split$year <- substr(split$year, 1,4)
######
filled_day = as.Date(split$day, format = "%d")
str_day <- substr(filled_day, 9,10)
For the day column it seems to work like that, but I am failing to reconvert the month with base and lubridate. Maybe my approach is either too complex or too simple. Please share your ideas with me
You can use as.Date with the format %m/%d/%Y.
as.Date("4/30/2022 8:26", "%m/%d/%Y")
#[1] "2022-04-30"
But this will work only for valid dates.
as.Date("4/31/2022 8:26", "%m/%d/%Y")
#[1] NA
as there is no 31 April.
Another way is using sub and gsub not testing if the date is valid:
gsub("\\b(\\d)\\b", "0\\1"
, sub("(\\d+)/(\\d+)/(\\d+).*", "\\3-\\1-\\2", "4/31/2022 8:26"))
#[1] "2022-04-31"
Related
I was working on an assignment,
library(tidyverse)
library(quantmod)
library(lubridate)
macro <- c("GDPC1", "CPIAUCSL","DTB3", "DGS10", "DAAA", "DBAA", "UNRATE", "INDPRO", "DCOILWTICO")
rm(macro_factors)
for (i in 1:length(macro)){
getSymbols(macro[i], src = "FRED")
data <- as.data.frame(get(macro[i]))
data$date <- as.POSIXlt.character(rownames(data))
rownames(data) <- NULL
colnames(data)[1] <- "macro_value"
data$quarter <- as.yearqtr(data$date)
data$macro_ticker <- rep(macro[i], dim(data)[1])
data <- data%>%
mutate(date = ymd(date))%>%
group_by(quarter)%>%
top_n(1,date) %>%
filter(date >= "1980-01-01", date <= "2019-12-31") %>%
if(i == 1){macro_factors <- data} else {macro_factors <- rbind(macro_factors, data)}
}
but this came out
Error in as.POSIXlt.character(rownames(data)) :
character string is not in a standard unambiguous format
I try follow the online tutorial of using as.POSIXct() by convert the data from charater to numeric first, but it did not work for my case, and I check the class of the data and the data shown like "year-month-day", and is in the class of character, supposedly the function as.POSIXlt() will work right?
There are several problems:
POSIXlt class should not be used in data frames. Also do not use POSIXct for dates since you can get into needless time zone problems.
to convert an xts object, such as the object produced by getSymbols , to a data frame use fortify.zoo
depending on what you want to do you might not need to convert from xts to a data frame in the first place. Suggest reading about xts and zoo in the documentation of those packages.
This gives a list of data frames L and then a long data frame DF containing them all.
library(dplyr, exclude = c("filter", "lag"))
library(quantmod) # also brings in xts and zoo
macro <- c("GDPC1", "CPIAUCSL")
getData <- function(symb) symb %>%
getSymbols(src = "FRED", auto.assign = FALSE) %>%
aggregate(as.yearqtr, tail, 1) %>%
window(start = "1980q1", end = "2019q4") %>%
fortify.zoo
L <- Map(getData, macro)
DF <- bind_rows(L, .id = "id")
I am trying to write a function which downloads multiple .csv files from GitHUB repository and at first stores them in one (long format) tibble like so:
# write different endings of urls "by hand" with 'ctrl-c' & 'ctrl-v' to get a list.
hobo_id <- c("10088310_Th.csv", "10234637_Th.csv", "10347313_Th.csv", "10347320_Th.csv", "10347321_th.csv", "10347327_Th.csv", "10347328_Th.csv", "10347356_Th.csv", "10347362_Th.csv", "10347366_Th.csv", "10347384_Th.csv", "10347394_Th.csv", "10350002_Th.csv ", "10350005_Th.csv", "10350049_Th.csv", "10610854_Th.csv", "10760709_Th.csv", "10760710_Th.csv", "10760811_Th.csv", "10760820_Th.csv", "10760822_Th.csv", "10801139_th.csv", "10801141_Th.csv")
# import function:
import_csv <- function(hobo_id){
#create urls
HOBO_urls <- paste0('https://raw.githubusercontent.com/data-hydenv/data/master/hobo/2022/hourly/',hobo_id)
# HOBO_urls represents a list of each link, that read_csv will download in the next step
# read in file
hobo_coll <- read_csv(as.character(HOBO_urls))
return(hobo_coll)
}
hobo_coll <- import_csv(hobo_id)
This works so far. But I want to add a column called 'ID'.
One of my approaches looks like this:
import_csv <- function(hobo_id){
#create urls
HOBO_urls <- paste0('https://raw.githubusercontent.com/data-hydenv/data/master/hobo/2022/hourly/',hobo_id)
# read in file
hobo_coll <- read_csv(as.character(HOBO_urls))
# Add column ID
hobo_coll1 <- hobo_coll %>%
mutate(dttm = parse_date_time(dttm, "%Y-%m-%d %H:%M:%S")) %>%
mutate(ID = ifelse(dttm >= "2021-12-13 00:00:00" & dttm <= "2022-01-09 23:00:00", hobo_id, NA))
return(hobo_coll1)
}
This works so far, but the ID from 'hobo_id' should stay the same for 4032 rows (from each "2021-12-13 00:00:00" to "2022-01-09 23:00:00") and then change to the next ID (hobo_id[,2]) and after the next time period of 4032 rows to the next (hobo_id[,3]) and so on.
I thought there must maybe be a way to to it with the tidyr::extract() function, but can't seem to figure out how.
I also considered a for loop, but kind of want to stick to the import_csv() function solution.
Thank you for your help in advance, gladly appreciate it!
Use the function argument directly, without any indexing and change the line
mutate(ID = ifelse(dttm >= "2021-12-13 00:00:00" & dttm <= "2022-01-09 23:00:00", .[[hobo_id]], NA))
to
mutate(ID = ifelse(dttm >= "2021-12-13 00:00:00" & dttm <= "2022-01-09 23:00:00", hobo_id, NA))
I am trying to convert date objects into a date class using lubridate. These are the following dates in the "wrong" format"
wrong_format_date1 <- "01-25-1999"
wrong_format_date2 <- 25012005
wrong_format_date3 <- "2005-05-31"
But I would like them in this format:
"1999-01-25"
"2005-01-25"
"2005-05-31"
Can someone please assist me with this?
Try this. Use parse_date_time() from lubridate where you can define a vector with possible formats so that you get your strings parsed as dates. Here the code:
library(lubridate)
#Data
wrong_format_date1 <- "01-25-1999"
wrong_format_date2 <- 25012005
wrong_format_date3 <- "2005-05-31"
#Dataframe
df <- data.frame(v1=c(wrong_format_date1,wrong_format_date2,wrong_format_date3),stringsAsFactors = F)
#Code
df$Date <- as.Date(parse_date_time(df$v1, c("mdY", "dmY","Ymd")))
Output:
v1 Date
1 01-25-1999 1999-01-25
2 25012005 2005-01-25
3 2005-05-31 2005-05-31
My problem is that I am importing a CSV file, and trying to get R to recognize the date column as dates and format them as such.
So far I have achieved to replace the format seen below "#yyyy-mm-dd#" with the integer date value in R.
But when I check the class before and after the transformation it still says "character".
I need the column to be recognized as a date class so that I can use it for forecasting. But
DemandCSV <- read_csv("C:/Users/pth/Desktop/Care/Demand.csv")
nrow <- nrow(DemandCSV)
for(i in 1:nrow){
DemandCSV[i,1] <-as.Date(ymd(substr(DemandCSV[i,1], 2, 11)))
}
DemandCSV[,1] <- format(DemandCSV[,1], "%Y-%m-%d")
Figured out an inelegant solution (turns out it was not a solution)
DemandCSV <- read_csv("C:/Users/pth/Desktop/Care/Demand.csv")
nrow <- nrow(DemandCSV)
for(i in 1:nrow){
DemandCSV[i,1] <-as.Date(ymd(substr(DemandCSV[i,1], 2, 11)))
DemandCSV[i,1] <- format(as.Date(as.numeric(DemandCSV[i,1],origin = "01-01-1970")), "%Y-%m-%d")}
DemandCSV %>% pad %>% fill_by_value(0)
Does including the "#" in the format string solve your problem?
data <- c("#2019-09-23#", "#2019-09-24#", "#2019-09-25#")
a <- as.Date(data,format="#%Y-%m-%d#")
or
DemandCSV <- data.frame(date=
c("#2019-09-23#", "#2019-09-24#", "#2019-09-25#"))
mutate_at(DemandCSV,"date",as.Date,format="#%Y-%m-%d#")
Maybe simpler to
Substitute out the #
Rely on anydate from the anytime package
Demo:
R> data <- c("#2019-09-23#", "#2019-09-24#", "#2019-09-25#")
R> anytime::anydate(gsub("#", "", data))
[1] "2019-09-23" "2019-09-24" "2019-09-25"
R>
Please help as I have a csv file of large database with date column having various format of dates like 20080408 or 2008/04/08 or 08/04/2008.How do i change these format to one format of dd/mm/yyyy.In R Programing
You can do it with failure tests via lubridate dmy and mdy conversions as well (hence the suppressWarnings() calls. I don't think you're going to be able to ensure proper handling of things like "08/04/2008" if 08 is supposed to be the "day" component, tho, given that the functions can't read minds.
library(lubridate)
dat <- c("20080408", "2008/04/08", "08/04/2008")
dat.1 <- unlist(lapply(dat, function(x) {
suppressWarnings(res <- mdy(x))
if (is.na(res)) { suppressWarnings(res <- ymd(x)) }
return(as.character(res))
}))
dat.1
## [1] "2008-04-08" "2008-04-08" "2008-08-04"
The following should work for your data.frame. You may need to convert your date column to the class as.character in order that the string split function strsplit works correctly. After tha, the loop simply evaluates how many characters are in the string before the first "/" character, and adjusts the formatting accordingly.
Example:
df <- data.frame(DATE=as.character(c("20080408", "2008/04/08", "08/04/2008")), DATE2=as.Date(NA))
df$DATE=as.character(df$DATE)
for(i in seq(df$DATE)){
sp <- unlist(strsplit(df$DATE[i], "/"))
if(nchar(sp[1]) == 8){
df$DATE2[i] <- as.Date(df$DATE[i], format="%Y%m%d")
}
if(nchar(sp[1]) == 4){
df$DATE2[i] <- as.Date(df$DATE[i], format="%Y/%m/%d")
}
if(nchar(sp[1]) == 2){
df$DATE2[i] <- as.Date(df$DATE[i], format="%d/%m/%Y")
}
}
Result:
df
# DATE DATE2
#1 20080408 2008-04-08
#2 2008/04/08 2008-04-08
#3 08/04/2008 2008-04-08
You can read them as character values and convert them using as.Date.
x1 <- '20080408' ## class character (string)
x2 <- '2008/04/08'
x1.dt <- as.Date(x1, format='%Y%m%d')
x2.dt <- as.Date(x2, format='%Y/%m/%d') ## different format
print(c(x1, x2), format='%d/%m/%Y') ## you can return Date objects in any format you want
Check out ?strftime for all the formatting options.