I have a data frame with missing values for "SNAP_ID". I'd like to fill in the missing values with floating point values based on a sequence from the previous non-missing value (lag()?). I would really like to achieve this using just dplyr if possible.
Assumptions:
There will never be missing data as the first or last row I'm generating the missing dates based on missing days between a min and max of a data set
There can be multiple gaps in the data set
Current data:
end SNAP_ID
1 2015-06-26 12:59:00 365
2 2015-06-26 13:59:00 366
3 2015-06-27 00:01:00 NA
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367
8 2015-06-29 09:59:00 368
What I want to achieve:
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 366.2
5 2015-06-28 00:01:00 366.3
6 2015-06-28 23:00:00 366.4
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
As a data frame:
df <- structure(list(end = structure(c(1435323540, 1435327140, 1435363260,
1435446000, 1435449660, 1435532400, 1435568400, 1435571940), tzone = "UTC", class = c("POSIXct",
"POSIXt")), SNAP_ID = c(365, 366, NA, NA, NA, NA, 367, 368)), .Names = c("end",
"SNAP_ID"), row.names = c(NA, -8L), class = "data.frame")
This was my attempt at achieving this goal, but it only works for the first missing value:
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),lag(SNAP_ID)+0.1,SNAP_ID))
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
The outstanding answer from #mathematical.coffee below:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1))) %>%
ungroup() %>%
select(-tmp)
EDIT: new version works for any number of NA runs.
This one doesn't need zoo, either.
First, notice that tmp=cumsum(!is.na(SNAP_ID)) groups the SNAP_IDs such groups of the same tmp consist of one non-NA value followed by a run of NA values.
Then group by this variable and just add .1 to the first SNAP_ID to fill out the NAs:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1)))
end SNAP_ID tmp
1 2015-06-26 12:59:00 365.0 1
2 2015-06-26 13:59:00 366.0 2
3 2015-06-27 00:01:00 366.1 2
4 2015-06-27 23:00:00 366.2 2
5 2015-06-28 00:01:00 366.3 2
6 2015-06-28 23:00:00 366.4 2
7 2015-06-29 09:00:00 367.0 3
8 2015-06-29 09:59:00 368.0 4
Then you can drop the tmp column afterwards (add %>% select(-tmp) to the end).
EDIT: this is the old version which doesn't work for subsequent runs of NAs.
If your aim is to fill each NA with the previous value + 0.1, you can use zoo's na.locf (which fills each NA with the previous value), along with cumsum(is.na(SNAP_ID))*0.1 to add the extra 0.1.
library(zoo)
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),
na.locf(SNAP_ID) + cumsum(is.na(SNAP_ID))*0.1,
SNAP_ID))
Related
I have a data.frame looking like this:
date1 date2
2015-09-17 03:07:00 2015-09-17 11:53:00
2015-09-17 08:00:00 2015-09-18 11:48:59
2015-09-18 15:58:00 2015-09-22 12:14:00
2015-09-22 12:14:00 2015-09-24 13:58:21
I'd like to combine these two into one column, something like:
dates
2015-09-17 03:07:00
2015-09-17 11:53:00
2015-09-17 08:00:00
2015-09-18 11:48:59
2015-09-18 15:58:00
2015-09-22 12:14:00
2015-09-22 12:14:00
2015-09-24 13:58:21
Please note that dates (like the last but one and the last but two) can be equal. Now I'd like to add a column 'value'. For every date that has it's origin in date1, the value should be 1, if it's origin is in date2, it should be 2.
Adding a new column is obvious. Merging works fine. I've used:
df <- as.data.frame(df$date1)
df <- data.frame(date1 = c(df$date1, test$date2 ))
That works perfectly fine for the merging of the columns, but how to get the correct value for df$value?
The result should be:
dates value
2015-09-17 03:07:00 1
2015-09-17 11:53:00 2
2015-09-17 08:00:00 1
2015-09-18 11:48:59 2
2015-09-18 15:58:00 1
2015-09-22 12:14:00 1
2015-09-22 12:14:00 2
2015-09-24 13:58:21 1
I tried to mock your problem.
If you are not concerned about time complexity, this is the simplest solution that I can suggest.
a = c(1,3,5)
b = c(2,4,6)
df = data.frame(a, b)
d1 = c()
d2 = c()
for(counter in 1:length(df$a))
{
d1 = c(d1,df$a[counter],df$b[counter])
d2 = c(d2,1,2)
}
df = data.frame(d1, d2)
print(df)
Input:
a b
1 2
3 4
5 6
Output:
d1 d2
1 1
2 2
3 1
4 2
5 1
6 2
Can't you just do something like this?
dates1 <- data.frame(dates = c("2015-09-17 03:07:00",
"2015-09-17 08:00:00",
"2015-09-18 15:58:00",
"2015-09-22 12:14:00"), value = 1)
dates2 <- data.frame(dates = c("2015-09-17 11:53:00",
"2015-09-18 11:48:59",
"2015-09-22 12:14:00",
"2015-09-24 13:58:21"), value = 2)
# row-bind the two data.frames
df <- rbind(dates1, dates2)
# if "dates" is in a string format, convert to timestamp
df$dates <- strptime(df$dates, format = "%Y-%m-%d %H:%M:%S")
# order by "dates"
df[order(df$dates),]
# result:
dates value
1 2015-09-17 03:07:00 1
2 2015-09-17 08:00:00 1
5 2015-09-17 11:53:00 2
6 2015-09-18 11:48:59 2
3 2015-09-18 15:58:00 1
4 2015-09-22 12:14:00 1
7 2015-09-22 12:14:00 2
8 2015-09-24 13:58:21 2
There might be a more clever solution, but I'd just separate each column into its own data frame, add a value column, and then rbind() into a single dates data frame.
df1 <- df$date1
df1$value <- 1
df2 <- df$date2
df2$value <- 2
dates <- rbind(df1,df2)
I have some data that looks like
CustomerID InvoiceDate
<fctr> <dttm>
1 13313 2011-01-04 10:00:00
2 18097 2011-01-04 10:22:00
3 16656 2011-01-04 10:23:00
4 16875 2011-01-04 10:37:00
5 13094 2011-01-04 10:37:00
6 17315 2011-01-04 10:38:00
7 16255 2011-01-04 11:30:00
8 14606 2011-01-04 11:34:00
9 13319 2011-01-04 11:40:00
10 16282 2011-01-04 11:42:00
It tells me when a person make a transaction. I would like to know the time between transactions for each customer, preferably in days. I do this in the following way
d <- data %>%
arrange(CustomerID,InvoiceDate) %>%
group_by(CustomerID) %>%
mutate(delta.t = InvoiceDate - lag(InvoiceDate), #calculating the difference
delta.day = as.numeric(delta.t, unit = 'days')) %>%
na.omit() %>%
arrange(CustomerID) %>%
inner_join(Ntrans) %>% #Existing data.frame telling me the number of transactions per customer
filter(N>=10) %>% #only want people with more than 10 transactions
select(-N)
However, the result doesn't make sense (seen below)
CustomerID InvoiceDate delta.t delta.day
<fctr> <dttm> <time> <dbl>
1 12415 2011-01-10 09:58:00 5686 days 5686
2 12415 2011-02-15 09:52:00 51834 days 51834
3 12415 2011-03-03 10:59:00 23107 days 23107
4 12415 2011-04-01 14:28:00 41969 days 41969
5 12415 2011-05-17 15:42:00 66314 days 66314
6 12415 2011-05-20 14:13:00 4231 days 4231
7 12415 2011-06-15 13:37:00 37404 days 37404
8 12415 2011-07-13 15:30:00 40433 days 40433
9 12415 2011-07-13 15:31:00 1 days 1
10 12415 2011-07-19 10:51:00 8360 days 8360
The differences measured in days are way off. What I want is something close to SQL's rolling window function partitioned over customerID. How can I implement this?
If you just want to change the difference to days you can use the package lubridate.
> library('lubridate')
> library('dplyr')
>
> InvoiceDate <- c('2011-01-10 09:58:00', '2011-02-15 09:52:00', '2011-03-03 10:59:00')
> CustomerID <- c(111, 111, 111)
>
> dat <- data.frame('Invo' = InvoiceDate, 'ID' = CustomerID)
>
> dat %>% mutate('Delta' = as_date(Invo) - as_date(lag(Invo)))
Invo ID Delta
1 2011-01-10 09:58:00 111 NA days
2 2011-02-15 09:52:00 111 36 days
3 2011-03-03 10:59:00 111 16 days
I couldn't find an answer to this potential issue as I don't really know how to describe what I would like to achieve in a few words. Basically I have 2 columns (sunrise and sunset) with a certain number of rows. I wish to combine them into one column so that the first value of the combined column takes the value in the first row and first column, the second value in the combined column the value in first row and second column, the value in third row of combined column the value in second row and first column etc. WIth data we start with this:
df <- structure(list(sunrise = structure(c(1439635810.57809, 1439722237.7463,
1439808664.71935, 1439895091.49609, 1439981518.07612, 1440067944.45978
), class = c("POSIXct", "POSIXt")), sunset = structure(c(1439682771.28069,
1439769119.75559, 1439855467.39929, 1439941814.23447, 1440028160.28404,
1440114505.57116), class = c("POSIXct", "POSIXt"))), .Names = c("sunrise",
"sunset"), row.names = c(NA, 6L), class = "data.frame")
sunrise sunset
1 2015-08-15 06:50:10 2015-08-15 19:52:51
2 2015-08-16 06:50:37 2015-08-16 19:51:59
3 2015-08-17 06:51:04 2015-08-17 19:51:07
4 2015-08-18 06:51:31 2015-08-18 19:50:14
5 2015-08-19 06:51:58 2015-08-19 19:49:20
6 2015-08-20 06:52:24 2015-08-20 19:48:25
The desired outcome should look like:
data.frame(c("2015-08-15 06:50:10", "2015-08-15 19:52:51", "2015-08-16 06:50:37",
"2015-08-16 19:51:59", "2015-08-17 06:51:04", "2015-08-17 19:51:07",
"2015-08-18 06:51:31", "2015-08-18 19:50:14", "2015-08-19 06:51:58",
"2015-08-19 19:49:20", "2015-08-20 06:52:24", "2015-08-20 19:48:25"
))
output
1 2015-08-15 06:50:10
2 2015-08-15 19:52:51
3 2015-08-16 06:50:37
4 2015-08-16 19:51:59
5 2015-08-17 06:51:04
6 2015-08-17 19:51:07
7 2015-08-18 06:51:31
8 2015-08-18 19:50:14
9 2015-08-19 06:51:58
10 2015-08-19 19:49:20
11 2015-08-20 06:52:24
12 2015-08-20 19:48:25
I can then assign day/night to each row, and use these bins to categorize my data in day and night using the findInterval function.
Any help is greatly appreciated.
EDIT: Thanks for the answers, they work like a charm
Extract the rows iteratively and then convert into a vector
data.frame(output = as.POSIXct(Reduce(c, (apply(df, 1, c)))))
# output
#1 2015-08-15 05:50:10
#2 2015-08-15 18:52:51
#3 2015-08-16 05:50:37
#4 2015-08-16 18:51:59
#5 2015-08-17 05:51:04
#6 2015-08-17 18:51:07
#7 2015-08-18 05:51:31
#8 2015-08-18 18:50:14
#9 2015-08-19 05:51:58
#10 2015-08-19 18:49:20
#11 2015-08-20 05:52:24
#12 2015-08-20 18:48:25
#NOTE: the values are different because of timezone
OR index the values from the data.frame directly
as.POSIXct(df[cbind(sort(rep(1:NROW(df), NCOL(df))), rep(1:NCOL(df), NROW(df)))])
## create a matrix of indices then order it
o <- order(matrix(1:prod(dim(df)), nrow(df), byrow = TRUE))
## create the new data frame from the concatenated dates and the order vector
data.frame(output = do.call("c", c(df, use.names = FALSE))[o])
# output
# 1 2015-08-15 03:50:10
# 2 2015-08-15 16:52:51
# 3 2015-08-16 03:50:37
# 4 2015-08-16 16:51:59
# 5 2015-08-17 03:51:04
# 6 2015-08-17 16:51:07
# 7 2015-08-18 03:51:31
# 8 2015-08-18 16:50:14
# 9 2015-08-19 03:51:58
# 10 2015-08-19 16:49:20
# 11 2015-08-20 03:52:24
# 12 2015-08-20 16:48:25
Consider this
time <- seq(ymd_hms("2014-02-24 23:00:00"), ymd_hms("2014-06-25 08:32:00"), by="hour")
group <- rep(LETTERS[1:20], each = length(time))
value <- sample(-10^3:10^3,length(time), replace=TRUE)
df2 <- data.frame(time,group,value)
str(df2)
> head(df2)
time group value
1 2014-02-24 23:00:00 A 246
2 2014-02-25 00:00:00 A -261
3 2014-02-25 01:00:00 A 628
4 2014-02-25 02:00:00 A 429
5 2014-02-25 03:00:00 A -49
6 2014-02-25 04:00:00 A -749
I would like to create a variable that contains, for each group, the rolling mean of value
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At time 2014-02-24 23:00:00, df2['rolling_mean_same_hour'] contains the mean of the values of value observed at 23:00:00 during the last 5 days in the data (not including 2014-02-24 of course).
I would like to do that in either dplyr or data.table. I confess having no ideas how to do that.
Any ideas?
Many thanks!
You can calculate the rollmean() with your data grouped by the group variable and hour of the time variable, normally the rollmean() will include the current observation, but you can use shift() function to exclude the current observation from the rollmean:
library(data.table); library(zoo)
setDT(df2)
df2[, .(rolling_mean_same_hour = shift(
rollmean(value, 5, na.pad = TRUE, align = 'right'),
n = 1,
type = 'lag'),
time), .(hour(time), group)]
# hour group rolling_mean_same_hour time
# 1: 23 A NA 2014-02-24 23:00:00
# 2: 23 A NA 2014-02-25 23:00:00
# 3: 23 A NA 2014-02-26 23:00:00
# 4: 23 A NA 2014-02-27 23:00:00
# 5: 23 A NA 2014-02-28 23:00:00
# ---
#57796: 22 T -267.0 2014-06-20 22:00:00
#57797: 22 T -389.6 2014-06-21 22:00:00
#57798: 22 T -311.6 2014-06-22 22:00:00
#57799: 22 T -260.0 2014-06-23 22:00:00
#57800: 22 T -26.8 2014-06-24 22:00:00
I have the following data as a list of POSIXct times that span one month. Each of them represent a bike delivery. My aim is to find the average amount of bike deliveries per ten-minute interval over a 24-hour period (producing a total of 144 rows). First all of the trips need to be summed and binned into an interval, then divided by the number of days. So far, I've managed to write a code that sums trips per 10-minute interval, but it produces incorrect values. I am not sure where it went wrong.
The data looks like this:
head(start_times)
[1] "2014-10-21 16:58:13 EST" "2014-10-07 10:14:22 EST" "2014-10-20 01:45:11 EST"
[4] "2014-10-17 08:16:17 EST" "2014-10-07 17:46:36 EST" "2014-10-28 17:32:34 EST"
length(start_times)
[1] 1747
The code looks like this:
library(lubridate)
library(dplyr)
tripduration <- floor(runif(1747) * 1000)
time_bucket <- start_times - minutes(minute(start_times) %% 10) - seconds(second(start_times))
df <- data.frame(tripduration, start_times, time_bucket)
summarized <- df %>%
group_by(time_bucket) %>%
summarize(trip_count = n())
summarized <- as.data.frame(summarized)
out_buckets <- data.frame(out_buckets = seq(as.POSIXlt("2014-10-01 00:00:00"), as.POSIXct("2014-10-31 23:0:00"), by = 600))
out <- left_join(out_buckets, summarized, by = c("out_buckets" = "time_bucket"))
out$trip_count[is.na(out$trip_count)] <- 0
head(out)
out_buckets trip_count
1 2014-10-01 00:00:00 0
2 2014-10-01 00:10:00 0
3 2014-10-01 00:20:00 0
4 2014-10-01 00:30:00 0
5 2014-10-01 00:40:00 0
6 2014-10-01 00:50:00 0
dim(out)
[1] 4459 2
test <- format(out$out_buckets,"%H:%M:%S")
test2 <- out$trip_count
test <- cbind(test, test2)
colnames(test)[1] <- "interval"
colnames(test)[2] <- "count"
test <- as.data.frame(test)
test$count <- as.numeric(test$count)
test <- aggregate(count~interval, test, sum)
head(test, n = 20)
interval count
1 00:00:00 32
2 00:10:00 33
3 00:20:00 32
4 00:30:00 31
5 00:40:00 34
6 00:50:00 34
7 01:00:00 31
8 01:10:00 33
9 01:20:00 39
10 01:30:00 41
11 01:40:00 36
12 01:50:00 31
13 02:00:00 33
14 02:10:00 34
15 02:20:00 32
16 02:30:00 32
17 02:40:00 36
18 02:50:00 32
19 03:00:00 34
20 03:10:00 39
but this is impossible because when I sum the counts
sum(test$count)
[1] 7494
I get 7494 whereas the number should be 1747
I'm not sure where I went wrong and how to simplify this code to get the same result.
I've done what I can, but I can't reproduce your issue without your data.
library(dplyr)
I created the full sequence of 10 minute blocks:
blocks.of.10mins <- data.frame(out_buckets=seq(as.POSIXct("2014/10/01 00:00"), by="10 mins", length.out=30*24*6))
Then split the start_times into the same bins. Note: I created a baseline time of midnight to force the blocks to align to 10 minute intervals. Removing this later is an exercise for the reader. I also changed one of your data points so that there was at least one example of multiple records in the same bin.
start_times <- as.POSIXct(c("2014-10-01 00:00:00", ## added
"2014-10-21 16:58:13",
"2014-10-07 10:14:22",
"2014-10-20 01:45:11",
"2014-10-17 08:16:17",
"2014-10-07 10:16:36", ## modified
"2014-10-28 17:32:34"))
trip_times <- data.frame(start_times) %>%
mutate(out_buckets = as.POSIXct(cut(start_times, breaks="10 mins")))
The start_times and all the 10 minute intervals can then be merged
trips_merged <- merge(trip_times, blocks.of.10mins, by="out_buckets", all=TRUE)
These can then be grouped by 10 minute block and counted
trips_merged %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(time) (int)
1 2014-10-01 00:00:00 1
2 2014-10-07 10:10:00 2
3 2014-10-17 08:10:00 1
4 2014-10-20 01:40:00 1
5 2014-10-21 16:50:00 1
6 2014-10-28 17:30:00 1
Instead, if we only consider time, not date
trips_merged2 <- trips_merged
trips_merged2$out_buckets <- format(trips_merged2$out_buckets, "%H:%M:%S")
trips_merged2 %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(chr) (int)
1 00:00:00 1
2 01:40:00 1
3 08:10:00 1
4 10:10:00 2
5 16:50:00 1
6 17:30:00 1