How to change syntax of column in R? - r

I have df1:
ID Time
1 16:00:00
2 14:30:00
3 9:23:00
4 10:00:00
5 23:59:00
and would like to change the current 'character' column 'Time' into a an 'integer' as below:
ID Time
1 1600
2 1430
3 923
4 1000
5 2359

We could replace the :'s, make numeric, divide by 100, and convert to integer like this:
df1$Time = as.integer(as.numeric(gsub(':', '', df1$Time))/100)

You want to use as.POSIXct().
Functions to manipulate objects of classes "POSIXlt" and "POSIXct" representing calendar dates and times.
R Documents as.POSIXct()
So in the case of row 1: as.POSIXct("16:00:00", format = "%H%M")
Then use as.numeric if you need it to truly be an int.
Converts a character matrix to a numeric matrix.
R Docs as.Numeric()

df1 <- data.frame(Time = "16:00:00")
df1[, "Time"] <- as.numeric(paste0(substr(df1[, "Time"], 1, 2), substr(df1[, "Time"], 4, 5)))
print(df1)
# Time
# 1 1600

There are many ways to process this, but here's one example:
library(dplyr)
df1 <- mutate(df1, Time = gsub(":", "", Time) # replace colons with blanks
df1 <- mutate(df1, Time = as.numeric(Time)/100) # coerce to numeric type, divide by 100

Related

Extract Month from a YYYYMM column in R

I have tried to extract it but the methods seem to only work for YYYY-MM. I have data in terms of a date (YYYYMM) and am trying to get in terms of just the month, such as: Month
Ultimately, I would like it to look like this:
ID Date Month
1 200402 2
2 200603 3
3 200707 7
I am doing this in hopes of plotting monthly mean values.
You can simply do it using:
library(stringr)
str_sub(df$Date,-2,-1)
Or using;
df['Date'].str[-2:]
Hope this helps!
Assuing your Date column be numeric, you could just use the modulus:
df$Month <- df$Date %% 100
df
ID Date Month
1 1 200402 2
2 2 200603 3
3 3 200707 7
Data:
df <- data.frame(ID=c(1,2,3), Date=c(200402, 200603, 200707))
To make the above work when Date be character, just cast it to numeric first.
You can extract last two characters of Date Column.
sub('.*(..)$', '\\1', df$Date)
#Or without capture groups suggested by #Tim Biegeleisen
#sub("^.*(?=..$)", "", df$Date, perl = TRUE)
#[1] "02" "03" "07"
However, ideally you should avoid parsing information from date-time using regex. Convert it to date and then extract the month.
format(as.Date(paste(df$Date, '01'), "%Y%m%d"), '%m')
#Or with zoo::yearmon
#format(zoo::as.yearmon(as.character(df$Date), "%Y%m"), '%m')

How to convert list to dataframe without type conversion on date

I'm trying to write a script which can take a file, look up some metadata of relating to the file, and convert certain columns based on that metadata. For example, suppose my data looks like the output of the following:
test_data <- data.frame(date1 = c("03/02/2018","04/25/2018"),date2 = c("9/14/17","9/27/17"))
and suppose that, based on a metadata lookup I found that the columns date1 and date2 of the input file have, respectively, the formats
date_formats <- c("%m/%d/%Y","%m/%d/%y")
So my script would then proceed to define index as a boolean vector which contains the value TRUE where I have a date column and FALSE otherwise, and then attempt to convert all such columns to a standardized R date format:
test_data[,index] <- as.data.frame(
lapply(test_data[,index],as.Date,
format = date_formats[index],
origin ="1970-01-01")))
But this produces some bizarre output:
date1 date2
1 2018-03-02 0017-09-14
2 2020-04-25 2017-09-27
Notice that the years for the (1,2) and (2,1) entries are off. I don't understand why the other values were properly converted. That is mystery #1.
The other mystery is that, if I try to convert only one column, say
as.data.frame(lapply(test_data[,1],as.Date,format = c("%m/%d/%Y")))
then I get undesirable output:
structure.17592..class....Date.. structure.17646..class....Date..
1 2018-03-02 2018-04-25
and if I first wrap this with cbind a la
as.data.frame( cbind(lapply(test_data[,1],as.Date,format = c("%m/%d/%Y"))))
then what I get are the raw, unformatted date values because of the behaviour of cbind:
V1
1 17592
2 17646
So how can I write this generic method which can handle an arbitrary number of columns, with different formats, and convert them all to the same formatted date type in a dataframe?
Try this:
test_data <- data.frame(date1 = c("03/02/2018","04/25/2018"),date2 = c("9/14/17","9/27/17"))
date_formats <- c("%m/%d/%Y","%m/%d/%y")
index <- c(TRUE,TRUE)
test_data[,index] <-
as.data.frame(
lapply(which(index),function(i)
as.Date(test_data[[i]],
format = date_formats[i],
origin ="1970-01-01")))
# date1 date2
# 1 2018-03-02 2017-09-14
# 2 2018-04-25 2017-09-27
The index you were feeding to date_formats in your lapply loop was always of length 2, lapply didn't loop on it. We need to convert your boolean index to numeric, and then loop on it.
Here is cleaner code to achieve what you want:
test_data[,index] <-
Map(as.Date,test_data[index],date_formats[index],origin ="1970-01-01")
# date1 date2
# 1 2018-03-02 2017-09-14
# 2 2018-04-25 2017-09-27
It seems OP's intention is to read the data from a file, certain columns of which contain date in different formats. #Moody_Mudskipper has already provided a nice solution to convert data once it has been read from file.
Another option is to use colClasses argument of the read functions(i.e. read.table, read.csv etc.) itself and get the date columns converted.
# Test data to be read from file. I have added one more column ID in data from OP
textData <- "
ID date1 date2
1 03/02/2018 9/14/17
2 04/25/2018 9/27/17"
setClass("dateformat1")
setClass("dateformat2")
setAs("character", "dateformat1", function(from)as.Date(from, format = "%m/%d/%Y"))
setAs("character", "dateformat2", function(from)as.Date(from, format = "%m/%d/%y"))
read.table(text = textData, header = TRUE, stringsAsFactors = FALSE,
colClasses = c("numeric", "dateformat1","dateformat2"))
# ID date1 date2
# 1 1 2018-03-02 2017-09-14
# 2 2 2018-04-25 2017-09-27

R - How to format the date of several columns in a datatable/dataframe

I want to format several columns in datatable/dataframe using lubridate and column indexing.
Suppose that there is a very large data set which has several unformatted date columns. The question is how can I identify those columns (most likely through indexing) and then format them at the same time in one script using lubridate.
library(data.table)
library (lubridate)
> dt <- data.frame(date1 = c("14.01.2009", "9/2/2005", "24/1/2010", "28.01.2014"),var1 = rnorm(4,2,1), date2 = c("09.01.2009", "23/8/2005","17.01.2000", "04.01.2005"))
> dt
date1 var1 date2
1 14.01.2009 2.919293 09.01.2009
2 9/2/2005 2.390123 23/8/2005
3 24/1/2010 0.878209 17.01.2000
4 28.01.2014 2.224461 04.01.2005
dt <- setDT(dt)
I tried these :
> dmy(dt$date1,dt$date2)# his dose not generate two columns
[1] "2009-01-14" "2005-02-09" "2010-01-24" "2014-01-28" "2009-01-09" "2005-08-23"
[7] "2000-01-17" "2005-01-04"
> as.data.frame(dmy(dt$date1,dt$date2))
dmy(dt$date1, dt$date2) # this dose not generate two columns either
1 2009-01-14
2 2005-02-09
3 2010-01-24
4 2014-01-28
5 2009-01-09
6 2005-08-23
7 2000-01-17
8 2005-01-04
dmy(dt[,.SD, .SD =c(1,3)])
[1] NA NA
> sapply(dmy(dt$date1,dt$date2),dmy)
[1] NA NA NA NA NA NA NA NA
Warning messages:
1: All formats failed to parse. No formats found.
Any help is highly appreciated.
How about:
dt <- data.frame(date1 = c("14.01.2009", "9/2/2005", "24/1/2010", "28.01.2014"),var1 = rnorm(4,2,1), date2 = c("09.01.2009", "23/8/2005","17.01.2000", "04.01.2005"))
for(i in c(1,3)){
dt[,i] <- dmy(dt[,i])
}
Here's a data.table way. Suppose you have k columns named dateX:
k = 2
date_cols = paste0('date', 1:k)
for (col in date_cols) {
set(dt, j=col, value=dmy(dt[[col]])
}
You can avoid the loop, but apparently the loop may be faster; see this answer
dt[,(date_cols) := lapply(.SD, dmy), .SDcols=date_cols]
EDIT
If you have aribitray column names, assuming data looks as in OP
date_cols = names(dt)[grep("^\\d{4}(\\.|/)", names(dt))]
date_cols = c(date_cols, names(dt)[grep("(\\.|/)\\d{4}", names(dt))])
You can add regular expressions if there are more delimiters than . or /, and you can combine this into a single grep but this is clearer to me.
Far from perfect, this is a solution which should be more general:
The only assumption here is, that the date columns contain digits separated by either . , / or -. If there's other separators, they may be added. But if you have another variable which is similar, but not a date, this won't work well.
for (j in seq_along(dt)) if (all(grepl('\\d+(\\.|/|-)\\d+(\\.|/|-)\\d+',dt[,j]))) dt[,j] <- dmy(dt[,j])
This loops through the columns and checks if a date could be present using regular expressions. If so, it will convert it to a date and overwrite the column.
Using data.table:
for (j in seg_along(dt)) if (all(grepl('\\d+(\\.|/|-)\\d+(\\.|/|-)\\d+',dt[,j]))) set(dt,j = j, value = dmy(dt[[j]]))
You could also replace all with any with the idea that if you have any match in the column, you could assume all of the values in that column are dates which can be read by dmy.

rowr::cbind.fill() changes characters value to numeric

I have two variables date and referencenumber. Both are extracted from a text string, with the use of a regular expression. They both have the class character.
When I use the cbind.fill function to combine these variables in an already excising dataframe the values are transformed to numeric values, 1 and 1. Instead of "06-07-2016" and "123ABC". I use the cbind.fill function because something only 1 variables is found, and then this variable still must be placed in the dataframe.
When I run the same code on a computer at school, it doesn't transform the values to numeric. So maybe it has something to do with my settings?
Why is this happening?
library(rowr)
dataframevariablen <- as.data.frame(matrix(nrow = 0, ncol = 2))
colnames(dataframevariablen) <- c("date", "refnr")
rulebased(dfgg$Text[i]) #returns the date and refnr as global variable
dataframevariablen[i,] <- cbind.fill(date,refnr, fill = NULL)
This works for you?
x <- c("6jul2016", "2jan1960", "31mar1960", "30jul1960")
date <- as.Date(x, "%d%b%Y")
refnr="123ABC" #returns the date and refnr as global variable
for (i in 1:length(date))
dataframevariablen[i,] <- data.frame(date[i],refnr,stringsAsFactors = F)
dataframevariablen$date=as.Date(dataframevariablen$date,origin="1970-01-01")
dataframevariablen
date refnr
1 2016-07-06 123ABC
2 1960-01-02 123ABC
3 1960-03-31 123ABC
4 1960-07-30 123ABC

Conditional subsetting of data frame based on HH:MM:SS formatted column

So I have a large df with a column called "session" that is in the format
HH:MM:SS (e.g. 0:35:24 for 35 mins and 24 secs).
I want to create a subset of the df based on a condition like > 2 mins or < 90 mins from the "sessions" column
I tried to first convert the column format into Date:
df$session <- as.Date(df$session, "%h/%m/%s")
I was going to then use the subset() to create my conditional subset but the above code generates a column of NAs.
subset.morethan2min <-subset(df, CONDITION)
where CONDITION is df$session >2 mins?
How should I manipulate the "session" column in order to be able to subset on a condition as described?
Sorry very new to R so welcome any suggestions.
Thanks!
UPDATE:
I converted the session column to POSIXct then used function minute() from lubridate package to get numerical values for hour and minute components. Not a near solution but seems to work for my needs right now. Still would welcome a neater solution though.
df$sessionPOSIX <- as.POSIXct(strptime(df$session, "%H:%M:%S"))
df$minute <- minute(df$sessionPOSIX)
subset.morethan2min <- subset(df, minute > 2)
A date is not the same as a period. The easiest way to handle periods is to use the lubridate package:
library(lubridate)
df$session <- hms(df$session)
df.morethan2min <- subset(df, df$session > period(2, 'minute'))
hms() converts your duration stamps into period objects, and period() creates a period object of the specified length for comparison.
As an aside, there are numerous other ways to subset data frames, including the [ operator and functions like filter() in the dplyr package, but that's beyond what you need for your current purposes.
Probably simpler ways to do this, but here's one solution:
set.seed(1234)
tDF <- data.frame(
Val = rnorm(100),
Session = paste0(
sample(0:23,100,replace=TRUE),
":",
sample(0:59,100,replace=TRUE),
":",
sample(0:59,100,replace=TRUE),
sep="",collapse=NULL),
stringsAsFactors=FALSE
)
##
toSec <- function(hms){
Long <- as.POSIXct(
paste0(
"2013-01-01 ",
hms),
format="%Y-%m-%d %H:%M:%S",
tz="America/New_York")
3600*as.numeric(substr(Long,12,13))+
60*as.numeric(substr(Long,15,16))+
as.numeric(substr(Long,18,19))
}
##
tDF <- cbind(
tDF,
Seconds = toSec(tDF$Session),
Minutes = toSec(tDF$Session)/60
)
##
> head(tDF)
Val Session Seconds Minutes
1 -1.2070657 15:21:41 55301 921.6833
2 0.2774292 12:58:24 46704 778.4000
3 1.0844412 7:32:45 27165 452.7500
4 -2.3456977 18:26:46 66406 1106.7667
5 0.4291247 12:56:34 46594 776.5667
6 0.5060559 17:27:11 62831 1047.1833
Then you can just subset your data easily by doing subset(Data, Minutes > some_number).

Resources