Formatting a date in R without zeros in the year - r

I'm having trouble using a function POSIXct.
When I apply the function in my dataset, the year appears with two zeros ahead.
like this:
datu1$timestamp <- as.POSIXct(datu1$date.sec, origin = "1970-01-01", tz="GMT")
datu1$timestamp <- as.POSIXct(datu1$timestamp,
format = "%Y-%m-%d %H:%M:%S", tz = 'GMT')
head(datu1)
ID date.sec lon lat lon.025 lat.025 lon.5 lat.5 lon.975 lat.975
1 102211.10 -61827840000 -38.6616 -13.59272 -40.5025 -15.25025 -38.7 -13.76 -36.9000 -10.88950
2 102211.10 -61827818400 -38.6647 -13.60312 -40.4000 -15.17025 -38.7 -13.77 -37.0975 -11.03975
3 102211.10 -61827796800 -38.6723 -13.64505 -40.3000 -15.10000 -38.7 -13.79 -37.0000 -11.29950
4 102211.10 -61827775200 -38.6837 -13.68972 -40.2000 -14.98025 -38.7 -13.83 -37.2000 -11.45975
5 102211.10 -61827753600 -38.7030 -13.73054 -40.2000 -14.98100 -38.7 -13.84 -37.3000 -11.62925
6 102211.10 -61827732000 -38.7221 -13.77846 -40.0000 -15.04050 -38.7 -13.88 -37.5000 -11.69950
bmode bmode.5 timestamp
1 1.556 2 0010-10-03 00:00:00
2 1.565 2 0010-10-03 06:00:00
3 1.571 2 0010-10-03 12:00:00
4 1.571 2 0010-10-03 18:00:00
5 1.589 2 0010-10-04 00:00:00
6 1.599 2 0010-10-04 06:00:00
How can I fix this to get the full year (like: 2010) instead of two zeros?

Perhaps your data was encoded with a weird origin (e.g. excel uses "1899-12-30"). Just adapt the origin= 'till the date matches what you require.
as.POSIXct(-61827840000, origin="1970-01-01", tz="GMT")
# [1] "0010-10-03 GMT"
as.POSIXct(-61827840000, origin="3970-01-01", tz="GMT")
# [1] "2010-10-03 GMT"

Related

Why does dplyr convert POSIXct objects

I have a date-time object of class POSIXct. I need to adjust the values by adding several hours. I understand that I can do this using basic addition. For example, I can add 5 hours to a POSIXct object like so:
x <- as.POSIXct("2009-08-02 18:00:00", format="%Y-%m-%d %H:%M:%S")
x
[1] "2009-08-02 18:00:00 PDT"
x + (5*60*60)
[1] "2009-08-02 23:00:00 PDT"
Now I have a data frame in which some times are ok and some are bad.
> df
set_time duration up_time
1 2009-05-31 14:10:00 3 2009-05-31 11:10:00
2 2009-08-02 18:00:00 4 2009-08-02 23:00:00
3 2009-08-03 01:20:00 5 2009-08-03 06:20:00
4 2009-08-03 06:30:00 2 2009-08-03 11:30:00
Note that the first data frame entry has an 'up_time' less than the 'set_time'. So in this context a 'good' time is one where the set_time < up_time. And a 'bad' time is one in which set_time > up_time. I want to leave the good entries alone and fix the bad entries. The bad entries should be fixed by creating an 'up_time' that is equal to the 'set_time' + duration. I do this with the following dplyr pipe:
df1 <- tbl_df(df) %>% mutate(up_time = ifelse(set_time > up_time, set_time +
(duration*60*60), up_time))
df1
# A tibble: 4 x 3
set_time duration up_time
<dttm> <dbl> <dbl>
1 2009-05-31 14:10:00 3. 1243815000.
2 2009-08-02 18:00:00 4. 1249279200.
3 2009-08-03 01:20:00 5. 1249305600.
4 2009-08-03 06:30:00 2. 1249324200.
Up time has been coerced to numeric:
> str(df1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 3 variables:
$ set_time: POSIXct, format: "2009-05-31 14:10:00" "2009-08-02 18:00:00"
"2009-08-03 01:20:00" "2009-08-03 06:30:00"
$ duration: num 3 4 5 2
$ up_time : num 1.24e+09 1.25e+09 1.25e+09 1.25e+09
I can convert it back to the desired POSIXct format using:
> as.POSIXct(df1$up_time,origin="1970-01-01")
[1] "2009-05-31 17:10:00 PDT" "2009-08-02 23:00:00 PDT" "2009-08-03 06:20:00
PDT" "2009-08-03 11:30:00 PDT"
But I feel like this last step shouldn't be necessary. Can I/How can I avoid having dplyr change my variable formatting?

R: Find missing timestamps in csv

as I failed to solve my problem with PHP/MySQL or Excel due to the data size, I'm trying to do my very first steps with R now and struggle a bit. The problem is this: I have a second-by-second CSV-file with half a year of data, that looks like this:
metering,timestamp
123,2016-01-01 00:00:00
345,2016-01-01 00:00:01
243,2016-01-01 00:00:02
101,2016-01-01 00:00:04
134,2016-01-01 00:00:06
As you see, there are some seconds missing every once in a while (don't ask me, why the values are written before the timestamp, but that's how I received the data…). Now I try to calculate the amount of values (= seconds) that are missing.
So my idea was
to create a vector that is correct (includes all sec-by-sec timestamps),
match the given CSV file with that new vector, and
sum up all the timestamps with no value.
I managed to make step 1 happen with the following code:
RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")
write.csv(RegularTimeSeries, file = "RegularTimeSeries.csv")
To have an idea what I did I also exported the vector to a CSV that looks like this:
"1",2016-01-01 00:00:00
"2",2016-01-01 00:00:01
"3",2016-01-01 00:00:02
"4",2016-01-01 00:00:03
"5",2016-01-01 00:00:04
"6",2016-01-01 00:00:05
"7",2016-01-01 00:00:06
Unfortunately I have no idea how to go on with step 2 and 3. I found some very similar examples (http://www.r-bloggers.com/fix-missing-dates-with-r/, R: Insert rows for missing dates/times), but as a total R noob I struggled to translate these examples to my given sec-by-sec data.
Some hints for the greenhorn would be very very helpful – thank you very much in advance :)
In the tidyverse,
library(dplyr)
library(tidyr)
# parse datetimes
df %>% mutate(timestamp = as.POSIXct(timestamp)) %>%
# complete sequence to full sequence from min to max by second
complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = 'sec'))
## # A tibble: 7 x 2
## timestamp metering
## <time> <int>
## 1 2016-01-01 00:00:00 123
## 2 2016-01-01 00:00:01 345
## 3 2016-01-01 00:00:02 243
## 4 2016-01-01 00:00:03 NA
## 5 2016-01-01 00:00:04 101
## 6 2016-01-01 00:00:05 NA
## 7 2016-01-01 00:00:06 134
If you want the number of NAs (i.e. the number of seconds with no data), add on
%>% tally(is.na(metering))
## # A tibble: 1 x 1
## n
## <int>
## 1 2
You can check which values of your RegularTimeSeries are in your broken time series using which and %in%. First create BrokenTimeSeries from your example:
RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")
BrokenTimeSeries <- RegularTimeSeries[-c(3,6,9)] # remove some seconds
This will give you the indeces of values within RegularTimeSeries that are not in BrokenTimeSeries:
> which(!(RegularTimeSeries %in% BrokenTimeSeries))
[1] 3 6 9
This will return the actual values:
> RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))]
[1] "2016-01-01 00:00:02 UTC" "2016-01-01 00:00:05 UTC" "2016-01-01 00:00:08 UTC"
Maybe I'm misunderstanding your problem but you can count the number of missing seconds simply subtracting the length of your broken time series from RegularTimeSeries or getting the length of any of the two resulting vectors above.
> length(RegularTimeSeries) - length(BrokenTimeSeries)
[1] 3
> length(which(!(RegularTimeSeries %in% BrokenTimeSeries)))
[1] 3
> length(RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))])
[1] 3
If you want to merge the files together to see the missing values you can do something like this:
#data with regular time series and a "step"
df <- data.frame(
RegularTimeSeries
)
df$BrokenTimeSeries[RegularTimeSeries %in% BrokenTimeSeries] <- df$RegularTimeSeries
df$BrokenTimeSeries <- as.POSIXct(df$BrokenTimeSeries, origin="2015-01-01", tz="UTC")
resulting in:
> df[1:12,]
RegularTimeSeries BrokenTimeSeries
1 2016-01-01 00:00:00 2016-01-01 00:00:00
2 2016-01-01 00:00:01 2016-01-01 00:00:01
3 2016-01-01 00:00:02 <NA>
4 2016-01-01 00:00:03 2016-01-01 00:00:02
5 2016-01-01 00:00:04 2016-01-01 00:00:03
6 2016-01-01 00:00:05 <NA>
7 2016-01-01 00:00:06 2016-01-01 00:00:04
8 2016-01-01 00:00:07 2016-01-01 00:00:05
9 2016-01-01 00:00:08 <NA>
10 2016-01-01 00:00:09 2016-01-01 00:00:06
11 2016-01-01 00:00:10 2016-01-01 00:00:07
12 2016-01-01 00:00:11 2016-01-01 00:00:08
If all you want is the number of missing seconds, it can be done much more simply. First find the number of seconds in your timerange, and then subtract the number of rows in your dataset. This could be done in R along these lines:
n.seconds <- difftime("2016-06-01 00:00:00", "2016-01-01 00:00:00", units="secs")
n.rows <- nrow(my.data.frame)
n.missing.values <- n.seconds - n.rows
You might change the time range and the variable of your data frame.
Hope it helps
d <- (c("2016-01-01 00:00:01",
"2016-01-01 00:00:02",
"2016-01-01 00:00:03",
"2016-01-01 00:00:04",
"2016-01-01 00:00:05",
"2016-01-01 00:00:06",
"2016-01-01 00:00:10",
"2016-01-01 00:00:12",
"2016-01-01 00:00:14",
"2016-01-01 00:00:16",
"2016-01-01 00:00:18",
"2016-01-01 00:00:20",
"2016-01-01 00:00:22"))
d <- as.POSIXct(d)
for (i in 2:length(d)){
if(difftime(d[i-1],d[i], units = "secs") < -1 ){
c[i] <- d[i]
}
}
class(c) <- c('POSIXt','POSIXct')
c
[1] NA NA NA
NA NA
[6] NA "2016-01-01 00:00:10 EST" "2016-01-01 00:00:12
EST" "2016-01-01 00:00:14 EST" "2016-01-01 00:00:16 EST"
[11] "2016-01-01 00:00:18 EST" "2016-01-01 00:00:20 EST" "2016-01-01
00:00:22 EST"

splitting date and time in data frame [duplicate]

This question already has answers here:
Split date-time column into Date and time variables
(7 answers)
Closed 3 years ago.
I have a column list of dates in data frame with date format 201001011200 as %Y%m%d%H%M. I wanted to split them as %Y%m%d and %H%M as Date and Time.
I tried to as.Date(data$Date,origin = "1970-01-01") but I got an error message
Error in charToDate(x) : character string is not in a standard
unambiguous format
The class of the date is numeric. So I tried to convert it to characterand applied the above as.Date function but was not helpful.
Any idea? Thank you in advance.
EDIT
Here is a sample of my data:
Index Date rank amount
81211 201004090000 11 4.9
81212 201004090100 11 4.6
81213 201004090200 11 3.3
81214 201004090300 11 2.7
81215 201004090400 11 3.1
81216 201004090500 11 3.7
81217 201004090600 11 4.0
81218 201004090700 11 4.2
81219 201004090800 11 4.2
81220 201004090900 11 4.0
Updated Answer: Beginning with your example data, you can do
data$Date <- as.POSIXct(as.character(data$Date), format = "%Y%m%d%H%M")
to change the column to a POSIX datetime value. Then, to extract the date and time into two separate columns, you can do
data$date <- as.character(as.Date(data$Date))
data$time <- format(data$Date, "%T")
This gives the following updated data frame data
Index Date rank amount date time
1 81211 2010-04-09 00:00:00 11 4.9 2010-04-09 00:00:00
2 81212 2010-04-09 01:00:00 11 4.6 2010-04-09 01:00:00
3 81213 2010-04-09 02:00:00 11 3.3 2010-04-09 02:00:00
4 81214 2010-04-09 03:00:00 11 2.7 2010-04-09 03:00:00
5 81215 2010-04-09 04:00:00 11 3.1 2010-04-09 04:00:00
6 81216 2010-04-09 05:00:00 11 3.7 2010-04-09 05:00:00
7 81217 2010-04-09 06:00:00 11 4.0 2010-04-09 06:00:00
8 81218 2010-04-09 07:00:00 11 4.2 2010-04-09 07:00:00
9 81219 2010-04-09 08:00:00 11 4.2 2010-04-09 08:00:00
10 81220 2010-04-09 09:00:00 11 4.0 2010-04-09 09:00:00
Original Answer: If you are starting with a numeric value, wrap it in as.character() then run it through as.POSIXct() to get a POSIX date-time value.
data$Date <- as.POSIXct(as.character(data$Date), format = "%Y%m%d%H%M")
As an example I will use 201001011200 as you gave.
(x <- as.POSIXct(as.character(201001011200), format = "%Y%m%d%H%M"))
# [1] "2010-01-01 12:00:00 PST"
Then to separate out the date and time you can do the following.
list(as.Date(x), format(x, "%T"))
# [[1]]
# [1] "2010-01-01"
#
# [[2]]
# [1] "12:00:00"
That gives Date and character classed items, respectively. For a plain old character vector, just use format() twice.
c(format(x, "%m-%d-%Y"), format(x, "%T"))
# [1] "01-01-2010" "12:00:00"
or
c(as.character(as.Date(x)), format(x, "%T"))
# [1] "2010-01-01" "12:00:00"

Associate numbers to datetime/timestamp

I have a dataframe df with a certain number of columns. One of them, ts, is timestamps:
1462147403122 1462147412990 1462147388224 1462147415651 1462147397069 1462147392497
...
1463529545634 1463529558639 1463529556798 1463529558788 1463529564627 1463529557370.
I have also at my disposal the corresponding datetime in the datetime column:
"2016-05-02 02:03:23 CEST" "2016-05-02 02:03:32 CEST" "2016-05-02 02:03:08 CEST" "2016-05-02 02:03:35 CEST" "2016-05-02 02:03:17 CEST" "2016-05-02 02:03:12 CEST"
...
"2016-05-18 01:59:05 CEST" "2016-05-18 01:59:18 CEST" "2016-05-18 01:59:16 CEST" "2016-05-18 01:59:18 CEST" "2016-05-18 01:59:24 CEST" "2016-05-18 01:59:17 CEST"
As you can see my dataframe contains data accross several day. Let's say there are 3. I would like to add a column containing number 1, 2 or 3. 1 if the line belongs to the first day, 2 for the second day, etc...
Thank you very much in advance,
Clement
One way to do this is to keep track of total days elapsed each time the date changes, as demonstrated below.
# Fake data
dat = data.frame(datetime = c(seq(as.POSIXct("2016-05-02 01:03:11"),
as.POSIXct("2016-05-05 01:03:11"), length.out=6),
seq(as.POSIXct("2016-05-09 01:09:11"),
as.POSIXct("2016-05-16 02:03:11"), length.out=4)))
tz(dat$datetime) = "UTC"
Note, if your datetime column is not already in a datetime format, convert it to one using as.POSIXct.
Now, create a new column with the day number, counting the first day in the sequence as day 1.
dat$day = c(1, cumsum(as.numeric(diff(as.Date(dat$datetime, tz="UTC")))) + 1)
dat
datetime day
1 2016-05-02 01:03:11 1
2 2016-05-02 15:27:11 1
3 2016-05-03 05:51:11 2
4 2016-05-03 20:15:11 2
5 2016-05-04 10:39:11 3
6 2016-05-05 01:03:11 4
7 2016-05-09 01:09:11 8
8 2016-05-11 09:27:11 10
9 2016-05-13 17:45:11 12
10 2016-05-16 02:03:11 15
I specified the timezone in the code above to avoid getting tripped up by potential silent shifts between my local timezone and UTC. For example, note the silent shift from my default local time zone ("America/Los_Angeles") to UTC when converting a POSIXct datetime to a date:
# Fake data
datetime = seq(as.POSIXct("2016-05-02 01:03:11"), as.POSIXct("2016-05-05 01:03:11"), length.out=6)
tz(datetime)
[1] ""
date = as.Date(datetime)
tz(date)
[1] "UTC"
data.frame(datetime, date)
datetime date
1 2016-05-02 01:03:11 2016-05-02
2 2016-05-02 15:27:11 2016-05-02
3 2016-05-03 05:51:11 2016-05-03
4 2016-05-03 20:15:11 2016-05-04 # Note day is different due to timezone shift
5 2016-05-04 10:39:11 2016-05-04
6 2016-05-05 01:03:11 2016-05-05

In R how can I split a dataframe by date

I have a dataframe where one column is a date time (chron). I would like to split this dataframe into a list of dataframes split by the date part only. So each dataframe will have all the data for that day. I looked at split function but not sure how to use part of a column value?
say you have this data.frame :
df <- data.frame(date=rep(seq.POSIXt(as.POSIXct("2010-01-01 15:26"), by="day", length.out=3), each=3), var=rnorm(9))
> df
date var
1 2010-01-01 15:26:00 -0.02814237
2 2010-01-01 15:26:00 -0.26924825
3 2010-01-01 15:26:00 -0.57968310
4 2010-01-02 15:26:00 0.88089757
5 2010-01-02 15:26:00 -0.79954092
6 2010-01-02 15:26:00 1.87145778
7 2010-01-03 15:26:00 0.93234835
8 2010-01-03 15:26:00 1.29130038
9 2010-01-03 15:26:00 -1.09841234
to split by day you just need:
> split(df, as.Date(df$date))
$`2010-01-01`
date var
1 2010-01-01 15:26:00 -0.02814237
2 2010-01-01 15:26:00 -0.26924825
3 2010-01-01 15:26:00 -0.57968310
$`2010-01-02`
date var
4 2010-01-02 15:26:00 0.8808976
5 2010-01-02 15:26:00 -0.7995409
6 2010-01-02 15:26:00 1.8714578
$`2010-01-03`
date var
7 2010-01-03 15:26:00 0.9323484
8 2010-01-03 15:26:00 1.2913004
9 2010-01-03 15:26:00 -1.0984123
EDIT:
the above method is consistent with chron datetime object too:
x <- chron(dates = "02/27/92", times = "22:29:56")
> x
[1] (02/27/92 22:29:56)
> as.Date(x)
[1] "1992-02-27"
EDIT 2
making sure that as.Date doesn't change your data is crucial, see here:
# I'm using "DSTday" to make a sequece of one entire _apparent_ day
x <- rep(seq.POSIXt(as.POSIXct("2010-03-27 00:31"), by="DSTday", length.out=3))
> x
[1] "2010-03-27 00:31:00 GMT" "2010-03-28 00:31:00 GMT" "2010-03-29 00:31:00 BST"
> as.Date(x)
[1] "2010-03-27" "2010-03-28" "2010-03-28"
the third item is in the summer time and as.Date retrieve the actual day, i.e. minus one hour. To avoid this:
> as.Date(cut(x, "DSTday"))
[1] "2010-03-27" "2010-03-28" "2010-03-29"
The trick is to create a vector that tells R how to split the data. So in your example we have a data frame:
dd = data.frame(x = runif(100),data= paste0(1:4, "/05/13"))
##This step will depend on your data structure
dd$date = strptime(dd$data, "%d/%m/%y")
Note that I've made the date column have class POSIXlt`POSIXt`. This allows easy manipulation of dates.
Next I'll create the variable I'm going to split on - split_date. Basically, I subtract the minimum date from all other dates and divide by the number of seconds in a day:
split_date = (dd$date -min(dd$date))/86400
Since this will result in fractions, I'll round down to the nearest day:
split_date = floor(split_date)
Now I use the split function in the standard way:
split_by_day = split(dd, split_date)

Resources