Background info (see question at bottom):
I received a dataset of hourly averaged observations collected by instruments at hundreds of sites in different time zones every hour for the past 10 years. The instruments are never adjusted for daylight savings time, so all times in the dataset are in local standard time. The hourly reported values are averages of measurements made every minute for the previous hour. Year, month, day, and hour are reported in separate columns. The hours go from 1:24 instead of 0:23. I want to create a new column containing the UTC datetime.
The data table below is a sample dataset with my most recent solution, as far as it goes. For many frustrating hours for two weeks, I have experimented with strptime, chron, POXITcl, and POXITlt, and scoured stackoverflow and other sources to try and understand what the solution would be. I'm never sure of what's going on in my attempts at conversion (except when I'm sure it's wrong, which is most of the time!).
I'm not sure that the datetime column I've created is the correct intermediate step I should be using, either, or how to get from that to UTC time that R will handle correctly. I inserted the character "T" between data and time in my datetime column to force the column to remain as character, otherwise unexpected things happen. For example, my computer operating system timezone is America/Toronto, and
as.POSIXct(mydata$datetime, format="%Y-%m-%dT%H:%M %z")
converts 2013-01-01T01:00-0800 to 2013-01-01 04:00:00 . The above command is seems to be converting to my machine's timezone, not UTC. So, if I change the R environment time zone, without changing the computer operating system time zone, before running the command
Sys.setenv(TZ = "GMT")
mydata$dateUTC <- as.POSIXct(mydata $datetime, format="%Y-%m-%dT%H:%M %z")
Sys.unsetenv("TZ")
then the above command converts 2013-01-01T01:00-0800 to 2013-01-01 09:00:00 which appears to be the UTC time I'm looking for.
I'm not too worried about hour24, because it seems that whatever method is used, the date is automatically increased to the next day and the hour changed to 00:00 (e.g., 2013-01-01 24:00 becomes 2013-01-02 00:00).
When converting from UTC to local time, I'm not too worried about the fact that the date on which times change from Standard time to Daylight Savings time can, and has changed over the years. Given the correct UTC time and Olson timezone, if I use the IANA timezone database this should be automatically taken care of (I think).
Question 1:
Using R, how should I convert year-month-day-hour reported in local standard time all year to UTC time?
Question 2:
Using R, How should I convert from UTC time to local standard time (without converting to DST in localities that use DST for civil time)?
Question 3:
Using R, how should I convert from UTC time to local time, taking into account daylight saving time?
Question 4:
In order to convert from UTC to local time, I will need the timezone names from the IANA database. Is there some way I can pull this in from somewhere on the web, given the latitude and longitude for each site?
filename = mydata
site year month day hourend UTCoffset datetime obs
2001 2015 1 1 22:00 -0200 2013-01-01T22:00-0200 1356
2001 2015 1 1 23:00 -0200 2013-01-01T23:00-0300 1593
2001 2015 1 1 24:00 -0200 2013-01-01T24:00-0200 946
2001 2015 1 2 01:00 -0200 2013-01-02T01:00-0200 271
2001 2015 1 2 02:00 -0200 2013-01-02T02:00-0200 665
3001 2015 1 1 22:00 -0350 2013-01-01T22:00-0350 548
3001 2015 1 1 23:00 -0350 2013-01-01T23:00-0350 936
3001 2015 1 1 24:00 -0350 2013-01-01T24:00-0350 1938
3001 2015 1 2 01:00 -0350 2013-01-02T01:00-0350 952
3001 2015 1 2 02:00 -0350 2013-01-02T02:00-0350 1584
4001 2015 1 1 22:00 -0400 2013-01-01T22:00-0400 1837
4001 2015 1 1 23:00 -0400 2013-01-01T23:00-0400 1275
4001 2015 1 1 24:00 -0400 2013-01-01T24:00-0400 382
4001 2015 1 2 01:00 -0400 2013-01-02T01:00-0400 837
4001 2015 1 2 02:00 -0400 2013-01-02T02:00-0400 592
5001 2015 1 1 22:00 -0500 2013-01-01T22:00-0500 392
5001 2015 1 1 23:00 -0500 2013-01-01T23:00-0500 15
5001 2015 1 1 24:00 -0500 2013-01-01T24:00-0500 403
5001 2015 1 2 01:00 -0500 2013-01-02T01:00-0500 993
5001 2015 1 2 02:00 -0500 2013-01-02T02:00-0500 1287
6001 2015 1 1 22:00 -0600 2013-01-01T22:00-0600 738
6001 2015 1 1 23:00 -0600 2013-01-01T23:00-0600 992
6001 2015 1 1 24:00 -0600 2013-01-01T24:00-0600 1392
6001 2015 1 2 01:00 -0600 2013-01-02T01:00-0600 189
6001 2015 1 2 02:00 -0600 2013-01-02T02:00-0600 1282
7001 2015 1 1 22:00 -0700 2013-01-01T22:00-0700 839
7001 2015 1 1 23:00 -0700 2013-01-01T23:00-0700 742
7001 2015 1 1 24:00 -0700 2013-01-01T24:00-0700 942
7001 2015 1 2 01:00 -0700 2013-01-02T01:00-0700 882
7001 2015 1 2 02:00 -0700 2013-01-02T02:00-0700 993
8001 2015 1 1 22:00 -0800 2013-01-01T22:00-0800 1140
8001 2015 1 1 23:00 -0800 2013-01-01T23:00-0800 1532
8001 2015 1 1 24:00 -0800 2013-01-01T24:00-0800 1834
8001 2015 1 2 01:00 -0800 2013-01-02T01:00-0800 1732
8001 2015 1 2 02:00 -0800 2013-01-02T02:00-0800 954
You can check out the "Lubridate" package in R. The strptime function there would be useful for your case.
Related
When I try using as.POSIXlt or strptime I keep getting a single value of 'NA' as a result.
What I need to do is transform 3 and 4 digit numbers e.g. 2300 or 115 to 23:00 or 01:15 respectively, but I simply cannot get any code to work.
Basically, this data fame of outputs:
Time
1 2345
2 2300
3 2130
4 2400
5 115
6 2330
7 100
8 2300
9 1530
10 130
11 100
12 215
13 2245
14 145
15 2330
16 2400
17 2300
18 2230
19 2130
20 30
should look like this:
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30
7 01:00
8 23:00
9 15:30
10 01:30
11 01:00
12 02:15
13 22:45
14 01:45
15 23:30
16 24:00
17 23:00
18 22:30
19 21:30
20 00:30
I think you can use the following solution. However this is actually producing a character vector:
gsub("(\\d{2})(\\d{2})", "\\1:\\2", sprintf("%04d", df$Time)) |>
as.data.frame() |>
setNames("Time") |>
head()
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30
head(a)
ID yr mth
1 1 2021 M04
2 2 2021 M04
3 3 2021 M04
4 4 2021 M04
5 5 2021 M04
6 6 2021 M04
head(b)
yr mth period_start period_end
1 2015 M03 2015-03-01 2015-03-31
2 2015 M03 2015-03-01 2015-03-31
3 2015 M04 2015-04-01 2015-04-30
4 2015 M04 2015-04-01 2015-04-30
5 2015 M05 2015-05-01 2015-05-30
6 2015 M05 2015-05-01 2015-05-30
I want the result is like below, a full join b, and some record in that 2 date fields are missing / in "1900-01-01"
ID yr mth period_start period_end
1 1 2019 M04 2015-03-01 2015-03-31
2 2 2018 M01 2015-03-01 2015-03-31
3 3 2021 M01 2015-04-01 2015-04-30
4 4 2015 M04 ???? ????
5 5 2021 M03 2015-04-01 2015-04-30
6 6 2021 M04 2015-04-01 2015-04-30
Please advise how to make it. Thanks
Because of reputation I cannot comment with question for clarifications.
However, I would use dplyr in r, where we have a "traditional" left join. This will preserve all your rows in the first dataframe (a) and generate missings for the two new columns in (b) where there is no match.
Only the classes of the columns you are merging on need to be the same, the rest can be transformed later one. If the below generates an error, you can share it and we can go from there.
library(dplyr)
df <- left_join(a, b, by = c("yr", "mth"))
If you want to work on the dates, I recommend the lubridate package in r, which makes transforming from and to dates very easy.
I'm starting to study time series analysis. I have some df which are composed as follows: each line consists of customer service, having the start date (mse_in), the end date (mse_fim) of the service, and the neighborhood (bairro) where the activity occurred.
> mse_df
# A tibble: 484 × 3
mse_in mse_fim Bairro
<date> <date> <fctr>
1 2015-11-03 2016-08-11 Pachecos
2 2013-03-18 2014-10-02 Bela Vista
3 2012-08-08 2015-09-24 Brejaru
4 2014-02-24 2014-12-17 Madri
5 2015-03-30 2015-04-29 Jardim Eldorado
6 2012-07-30 2013-09-19 Brejaru
7 2016-05-24 2017-05-19 Frei Damiao
8 2012-08-13 2015-02-09 Ponte do Imaruim
9 2012-08-08 2014-07-23 Ponte do Imaruim
10 2012-07-30 2012-10-10 Caminho Novo
# ... with 474 more rows
I would like to do the time series analysis counting all the clients and another for each neighborhood between January 2012 and April 2017, but I do not know how to produce the necessary time series to start the analyzes. Thanks.
I have got following data set and I am trying to convert the consumption to time series. Some of the data are nonexistent (e.g. there is no data for 10/2014).
year month consumption
2014 7 10617
2014 8 8318
2014 9 3199
2014 12 2066
2015 1 10825
2015 2 3096
2015 3 1665
2015 4 3651
2015 5 5807
2015 7 2951
2015 8 5885
2015 9 3653
2015 10 4266
2015 11 9706
when I use ts() in R, the wrong values are replaced for nonexistent months.
ts(mkt$consumptions, start = c(2014,7),end=c(2015,11), frequency=12)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014 10617 8318 3199 2066 10825 3096
2015 1665 3651 5807 2951 5885 3653 4266 9706 10617 8318 3199
,y question is how to simply replace the nonexistent values with zero or blank?
"ts" class requires that the data be regularly spaced, i.e. every month should be present or NA but that is not the case here. The zoo package can handle irregularly spaced series. Read the input into zoo using the "yearmon" class for the year/month and then simply use it as a "zoo" series or else convert it to "ts". If the input is in a file but otherwise is exactly the same as in Lines then replace text = Lines with something like "myfile.dat" .
Lines <- "year month consumption
2014 7 10617
2014 8 8318
2014 9 3199
2014 12 2066
2015 1 10825
2015 2 3096
2015 3 1665
2015 4 3651
2015 5 5807
2015 7 2951
2015 8 5885
2015 9 3653
2015 10 4266
2015 11 9706"
library(zoo)
toYearmon <- function(y, m) as.yearmon(paste(y, m), "%Y %m")
z <- read.zoo(text = Lines, header = TRUE, index = 1:2, FUN = toYearmon)
as.ts(z)
I need to calculate the max value contained between the beginning of the day and the moment when the min value happened. This is a toy example of my dataset for one day and one dendro:
TIMESTAMP year DOY ring dendro diameter
1 2013-05-02 00:00:00 2013 122 1 1 3405
2 2013-05-02 00:15:00 2013 122 1 1 3317
3 2013-05-02 00:30:00 2013 122 1 1 3217
4 2013-05-02 00:45:00 2013 122 1 1 3026
5 2013-05-02 01:00:00 2013 122 1 1 4438
6 2013-05-03 00:00:00 2013 123 1 1 3444
7 2013-05-03 00:15:00 2013 123 1 1 3410
8 2013-05-03 00:30:30 2013 123 1 1 3168
9 2013-05-03 00:45:00 2013 123 1 1 3373
10 2013-05-02 00:00:00 2013 122 2 4 5590
11 2013-05-02 00:15:00 2013 122 2 4 5602
12 2013-05-02 00:30:00 2013 122 2 4 5515
13 2013-05-02 00:45:00 2013 122 2 4 4509
14 2013-05-02 01:00:00 2013 122 2 4 5566
15 2013-05-02 01:15:00 2013 122 2 4 6529
First, I calculated the MIN diameter for each day (DOY= day of the year) in each dendro (contained in one ring), also getting the time at what that min value happened:
library(plyr)
dailymin <- ddply(datamelt, .(year, DOY, ring, dendro),function(x)x[which.min(x$diameter), ])
Now, my problem is that I want to calculate the MAX diameter for each day. However, sometimes de max value occurs after the min value. I am only interested in the max value contained BEFORE the min value. I am not interested in the total max value if it happened after the min. Therefore, I need the max value contained (for each DAY) WITHIN THE TIME INTERVAL FROM THE BEGINNING OF THE DAY (00:00:00) TO THE THE MIN DIAMETER. Like I did with the min, I also need to know at what time that max value happened. This is what I want from the previous df:
year DOY ring dendro timeMin min timeMax max
1 2013 122 1 1 2013-05-02 00:45:00 3026 2013-05-02 00:00:00 3405
2 2013 123 1 1 2013-05-03 00:30:00 3168 2013-05-03 00:00:00 3444
3 2013 122 2 4 2013-05-02 00:45:00 4509 2013-05-02 00:00:15 5602
As you can see, the min value is the actual min value. However, the max value I want is not the max value of the day, it is the max value that happened between the beginning of the day and the min value.
My first attempt, unsuccessful, returns the max value of the day, even in it is out of the desired time interval:
dailymax <- ddply(datamelt, .(year, DOY, ring, dendro),
function(x)x[which.max(x$diameter[1:which.min(datamelt$diameter)]), ])
Any ideas?
In a data.table, you could write:
DT[,{
istar <- which.min(diameter)
list(
dmin=diameter[istar],
prevmax=max(diameter[1:istar])
)},by='year,DOY,ring,dendro']
# year DOY ring dendro dmin prevmax
# 1: 2013 242 6 8 470 477.2
I assume that a similar function can be written with your **ply
EDIT1: where DT comes from...
require(data.table)
DT <- data.table(header=TRUE, text='
date TIMESTAMP year DOY ring dendro diameter
1928419 2013-08-30 00:00:00 2013 242 6 8 471.5
1928420 2013-08-30 01:30:00 2013 242 6 8 477.2
1928421 2013-08-30 03:00:00 2013 242 6 8 474.7
1928422 2013-08-30 04:30:00 2013 242 6 8 470.0
1928423 2013-08-30 06:00:00 2013 242 6 8 475.6
1928424 2013-08-30 08:30:00 2013 242 6 8 478.7')
Your "TIMESTAMP" has a space in it, so I'm reading it as two columns, with the first called "date". Paste them together if you like. Next time, you can look into making a "reproducible example", as described here: How to make a great R reproducible example?
EDIT2: For the time of the max and min:
DT[,{
istar <- which.min(diameter)
istar2 <- which.max(diameter[1:istar])
list(
dmin = diameter[istar],
tmin = TIMESTAMP[istar],
dmax = diameter[istar2],
tmax = TIMESTAMP[istar2]
)},by='year,DOY,ring,dendro']
# year DOY ring dendro dmin tmin dmax tmax
# 1: 2013 242 6 8 470 04:30:00 477.2 01:30:00
As mentioned in EDIT1, I don't have both pieces of your TIMESTAMP variable in a single column because you did not provide them that way. To add more columns, just add new expressions in the list() above. The idea behind the code is that the {} expression is a code block where you can work with the variables in the chunk of data associated with each year,DOY,ring,dendro combination and return a list of new columns.