I am having problems merging / joining data for the coming daylight savings shift. My time-vector d is supposed to be the controlling time-vector, so when I join with data with missing holes I just get NA values. This normally works brillantly. However, during the coming '2015-10-25 02:00:00' it goes horribly wrong.
Data example:
d <- seq.POSIXt(from = as.POSIXct("2015-10-25 00:00:00", "%Y-%m-%d %H:%M:%S", tz = ""),
to = as.POSIXct("2015-10-25 23:00:00", "%Y-%m-%d %H:%M:%S", tz = ""), by = "hour")
df1 <- data.frame(Date = d, value1 = 1:25)
df2 <- data.frame(Date = as.POSIXct(format(d, "%Y-%m-%d %H:%M:%S"), tz = ""), value2 = 26:50)
require(dplyr)
df <- left_join(df1, df2, by = "Date")
df <- merge(df1, df2, by = "Date", all.x = TRUE)
Both left_join and merge gives wrong results, and I am not sure what goes wrong. Well, I can see R has no idea how to handle the two repeated hours - and that is completely understandable. Both time series are POSIXct, but there is clearly some information I am missing? How can you handle this? I would prefer a base R-solution.
It gets exponentially worse, if you need to do even more joins from different data-sets. I need to join 7 and it just gets worse and worse.
The correct result is:
result <- data.frame(Date = d, var1 = df1[, 2], var2 = df2[, 2])
Related
I am trying to convert this vector(v) into Unix time stamps. Year, month, day, hour, minute, seconds are unimportant, but need to be introduced.
v = c("10ms", "20ms", "30ms", ..., "800ms")
A) I thought of starting with:
x <- c("2000-01-01 0:00:00.000", "2000-01-01 0:00:00.010", "2000-01-01 0:00:00.020")
y <- as.POSIXct(x, format = "%Y-%m-%d %H:%M:%OS", origin = "1970-01-01", tz = "America/Chicago")
y = format(y, "%Y-%m-%d %H:%M:%OS4")
B) and convert that into a Unix time stamp:
z = as.numeric(as.POSIXct(y,format = "%Y-%m-%d %H:%M:%OS4", origin = "1970-01-01", tz = "America/Chicago"))
This yields: NA NA NA
z = as.numeric(as.POSIXct(y,format = "%Y-%m-%d %H:%M:%OS", origin = "1970-01-01", tz = "America/Chicago"))
This yields: 946706400 946706400 946706400
Any help would be appreciated.
This is surprisingly simple to fix. In your example, z actually does contain the decimal part, but it isn't printed to the console because the number is so large. If you convert it to a character, you should get the result you want:
as.character(round(z, 3))
#> [1] "946706400" "946706400.01" "946706400.02"
Or depending on your chosen format:
y <- as.POSIXct("2000-01-01") + lubridate::milliseconds(1:10)
format(round(as.numeric(y), 3), digits = 13)
#> [1] "946684800.001" "946684800.002" "946684800.003" "946684800.004" "946684800.005"
#> [6] "946684800.006" "946684800.007" "946684800.008" "946684800.009" "946684800.010"
I have the above table. I would like to fill in the missing values under Transaction ID. The algorithm for filling this would be as follows:
User ID "kenn1" has two missing Transaction IDs, and this can be filled using the other two Transaction IDs t1 and t4.
To choose which one to use between t1 and t4, I look at the Event Time. The first missing value happens at 9:30, and it is 30 minutes away from t1 and 20 minutes away from t4. Since t4 is closer to this missing value it would be filled as t4. Similarly for the missing value in row 4, it is 45 minutes away from t1 and 5 minutes away from t4. It would therefore be replaced with t4.
Similar approach for missing values for User ID "kenn2"
How do I do this in R?
Probably there is a better solution, but I wrote this solution with data.table:
library(data.table)
#Create Data Table, You can read.csv or read.xlsx etc
raw <- data.table(Event = paste0("e", 1:10),
TransactionID = c("t1",NA,NA,"t4",NA,"t5","t6",NA,NA,"t8"),
UserId = c(rep("kenn1",4), rep("kenn2",6)),
EventTime = as.POSIXct(
c("2017-05-20 9:00", "2017-05-20 9:30", "2017-05-20 9:45", "2017-05-20 9:50", "2017-05-20 10:01",
"2017-05-20 10:02", "2017-05-20 10:03","2017-05-20 10:04","2017-05-20 10:05","2017-05-20 10:06")
, format="%Y-%m-%d %H:%M")
)
transactionTimes <- raw[!is.na(TransactionID), .(TransactionID, EventTime)]
raw[, Above := na.locf(TransactionID, na.rm = F), UserId]
raw[, Below := na.locf(TransactionID, na.rm = F, fromLast = T), UserId]
raw <- merge(raw, transactionTimes[, .(Above = TransactionID, AboveTime = EventTime)], by="Above", all.x = T)
raw <- merge(raw, transactionTimes[, .(Below = TransactionID, BelowTime = EventTime)], by="Below", all.x = T)
raw[, AboveDiff := EventTime - AboveTime]
raw[, BelowDiff := BelowTime - EventTime]
raw[is.na(TransactionID) & is.na(AboveDiff), TransactionID := Below]
raw[is.na(TransactionID) & is.na(BelowDiff), TransactionID := Above]
raw[is.na(TransactionID), TransactionID := ifelse(AboveDiff <= BelowDiff, Above, Below)]
raw <- raw[, .(Event, TransactionID, UserId, EventTime)]
rm(transactionTimes)
Another solution with data.table.
library(data.table)
#Create Data Table, You can read.csv or read.xlsx etc
raw <- data.table(Event = paste0("e", 1:10),
TransactionID = c("t1",NA,NA,"t4",NA,"t5","t6",NA,NA,"t8"),
UserId = c(rep("kenn1",4), rep("kenn2",6)),
EventTime = as.POSIXct(
c("2017-05-20 9:00", "2017-05-20 9:30", "2017-05-20 9:45", "2017-05-20 9:50", "2017-05-20 10:01",
"2017-05-20 10:02", "2017-05-20 10:03","2017-05-20 10:04","2017-05-20 10:05","2017-05-20 10:06")
, format="%Y-%m-%d %H:%M")
)
#subset a rows without duplicates
raw_notNA <- raw[!is.na(TransactionID)]
# merge the subset data with original (this will duplicate rows of originals with candiate rows)
merged <- merge(raw, raw_notNA, all.x = T, by = "UserId", allow.cartesian=TRUE)
# calcuate time difference between original and candiate rows
merged[, DiffTime := abs(EventTime.x - EventTime.y)]
# create new Transaction IDs from the closest event
merged[, NewTransactionID := TransactionID.y[DiffTime == min(DiffTime)], by = Event.x]
# remove the duplicaetd rows, and delete unnecesary columns
output <- merged[, .SD[1], by = Event.x][, list(Event.x, NewTransactionID, UserId, EventTime.x)]
names(output) <- names(raw)
print(output)
Inspired by answers to this question (your question is not a duplicate, just similar)
R - merge dataframes on matching A, B and *closest* C?
I have a data frame with daily data. I need to bind it to hourly data, but first I need to convert it to a suitable posixct format. This looks like this:
set.seed(42)
df <- data.frame(
Date = seq.Date(from = as.Date("2015-01-01", "%Y-%m-%d"), to = as.Date("2015-01-29", "%Y-%m-%d"), by = "day"),
var1 = runif(29, min = 5, max = 10)
)
result <- data.frame(
Date = d <- seq.POSIXt(from = as.POSIXct("2015-01-01 00:00:00", "%Y-%m-%d %H:%M:%S", tz = ""),
to = as.POSIXct("2015-01-29 23:00:00", "%Y-%m-%d %H:%M:%S", tz = ""), by = "hour"),
var1 = rep(df$var1, each = 24) )
However, my data is not as easy to work with as the above. I have lots of missing dates, so I need to be able to take the specific df$Date-vector and convert it to a posixct frame, with the matching daily values.
I've looked high and low but been unable to find anything on this.
The way I went about this was to find the min and max of the dataset and deem them hour 0 and hour 23.
hourly <- data.frame(Hourly=seq(min(as.POSIXct(paste0(df$Date, "00:00:00"),tz="")),max(as.POSIXct(paste0(df$Date, "23:00:00"),tz="")),by="hour"))
hourly[,"Var1"] <- df[match(x = as.Date(hourly$Hourly),df$Date),"var1"]
This achieves a result of the daily values becoming hourly with the daily var1 assigned to each hour that contains the day. In this respect missing daily values should not be an issue and if there is no match, it will add in NA's.
Lets say I have dataframe consisting of 3 columns with dates:
index <- c("31.10.2012", "16.06.2012")
begin <- c("22.10.2012", "29.05.2012")
end <- c("24.10.2012", "17.06.2012")
index.new <- as.Date(index, format = "%d.%m.%Y")
begin.new <- as.Date(begin, format = "%d.%m.%Y")
end.new <- as.Date(end, format = "%d.%m.%Y")
data.frame(index.new, begin.new, end.new)
My problem: I want to select (subset) the rows, where the interval of begin and end-date is within 4 days before the index-day. This is obviously only in row no 2.
Can you help me out here?
Your way to express the problem is messy, in the first case dates.new[1]>dates.new[2] and in the second case dates.new[3]<dates.new[4]. Making things proper:
interval1 = c(dates.new[2], dates.new[1])
interval2 = c(dates.new[3],dates.new[4])
If you wanna check interval2 CONTAINS interval1:
all.equal(findInterval(interval1, interval2),c(1,1))
Pleas let me know if this works and if is what you want
library("timeDate")
index <- c("31.10.2012", "16.06.2012")
begin <- c("22.10.2012", "29.05.2012")
end <- c("24.10.2012", "17.06.2012")
index.new <- as.Date(index, format = "%d.%m.%Y")
begin.new <- as.Date(begin, format = "%d.%m.%Y")
end.new <- as.Date(end, format = "%d.%m.%Y")
data <- data.frame(index.new, begin.new, end.new)
apply(data, 1, function(x){paste(x[1]) %in% paste(timeSequence(x[2], x[3], by = "day"))})
I am subtracting dates in xts i.e.
library(xts)
# make data
x <- data.frame(x = 1:4,
BDate = c("1/1/2000 12:00","2/1/2000 12:00","3/1/2000 12:00","4/1/2000 12:00"),
CDate = c("2/1/2000 12:00","3/1/2000 12:00","4/1/2000 12:00","9/1/2000 12:00"),
ADate = c("3/1/2000","4/1/2000","5/1/2000","10/1/2000"),
stringsAsFactors = FALSE)
x$ADate <- as.POSIXct(x$ADate, format = "%d/%m/%Y")
# object we will use
xxts <- xts(x[, 1:3], order.by= x[, 4] )
#### The subtractions
# anwser in days
transform(xxts, lag = as.POSIXct(BDate, format = "%d/%m/%Y %H:%M") - index(xxts))
# asnwer in hours
transform(xxts, lag = as.POSIXct(CDate, format = "%d/%m/%Y %H:%M") - index(xxts))
Question: How can I standardise the result so that I always get the answer in hours. Not by multiplying the days by 24 as I will not know before han whther the subtratcion will round to days or hours....
Unless I can somehow check if the format is in days perhaps using grep and regexand then multiply within an if clause.
I have tried to work through this and went for the grep regex apprach but this doesnt even keep the negative sign..
p <- transform(xxts, lag = as.POSIXct(BDate, format = "%d/%m/%Y %H:%M") - index(xxts))
library(stringr)
ind <- grep("days", p$lag)
p$lag[ind] <- as.numeric( str_extract_all(p$lag[ind], "\\(?[0-9,.]+\\)?")) * 24
p$lag
#2000-01-03 2000-01-04 2000-01-05 2000-01-10
# 36 36 36 132
I am convinced there is a more elegant solution...
ok difftime works...
transform(xxts, lag = difftime(as.POSIXct(BDate, format = "%d/%m/%Y %H:%M"), index(xxts), unit = "hours"))