Problem:
I have a list of 4 rows (for each hour) with values which are datetime indexed. Now I want to have 16 (4*4) rows with each value copied 3 times and filled in Forward.
My Question: How can I tell Pandas/Python to write the last three lines?
Thats what i want
My try:
create dataframe
df = pd.DataFrame(
{'A' : [4,5,6,7], 'B' : [10,20,30,40],'C' : [100,50,-30,-50]})
create date
date_60min = pd.date_range(
'1/1/2013', periods=4, freq='60min', tz='Europe/Berlin')
add date
df['Date'] = date_60min
set date to index
df_date = df.set_index('Date')
show df_date
df_date
Variation 1 with resmaple
df_resample15min = df_date.resample(
'15Min',fill_method='ffill', label='left', closed='right')
df_resample15min
Variation 2 with asfreq
df_asfreq15min = df_date.asfreq('15Min', method='pad')
df_asfreq15min
Related
I have a dataset of daily remotely sensed data. In short, it's reflectance (values between 0 and 1) for the last 20 years. Because it's remotely sensed data, some dates do not have a value because of clouds or some other obstruction.
I want to use rollapply() in R's zoo package to detect in the time series when the values remain at 1.0 for a certain amount of time (let's say 2 weeks) or at 0 for that same amount of time.
I have code to do this, but the width argument in the rollapply() function (the 2-week threshold mentioned in the previous paragraph) looks at data points rather than time. So it looks at 14 data values rather than 14 days, which may span over a month due to the missing data values from cloud cover etc.
Here's an example:
test_data <- data.frame(date = c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-17", "2000-01-18"),
value = c(0, 1, 1, 1, 0))
test_data$date <- ymd(test_data$date)
select_first_1_value <- test_data %>%
mutate(value = rollapply(value, width = 3, min, align = "left", fill = NA, na.rm = TRUE)) %>%
filter(value == 1) %>%
filter(row_number() == 1) %>%
ungroup
With the argument as width = 3, it works. It finds that 2000-01-02 is the first date where a value = 1 occurs for at least 3 values. However, if I change this to 14, it no longer works, because it only sees 5 values in this instance. Even if I wrote out an additional 10 values that equal 1 (for a total of 15), it would be incorrect because the value = 0 at 2000-01-18 and it is only counting data points and not dates.
But when we look at the dates, there are missing dates between 2000-01-03 and 2000-01-17. If both are a value = 1, then I want to extract 2000-01-02 as the first instance where the time series remains at 1 for at least 14 consecutive days. Here, I'm assuming that the values are 1 for the missing days.
Any help is greatly appreciated. Thank you.
There really are two problems here:
How to roll by date rather than number of points.
how to find the first stretch of 14 days of 1's assuming that missing dates are 1.
Note that (2) is not readily solved by (1) because the start of the first series of ones may not be any of the listed dates! For example, suppose we change the first date to Dec 1, 1999 giving test_data2 below. Then the start of the first period of 14 ones is Dec 2, 1999. It is not any of the dates in the test_data2 series.
test_data2 <- data.frame(
date = c("1999-12-01", "2000-01-02", "2000-01-03", "2000-01-17", "2000-01-18"),
value = c(0, 1, 1, 1, 0))
1) What we need to do is not roll by date but rather expand the series to fill in the missing dates giving zz and then use rollapply. Below do that by creating a zoo series (which also converts the dates to Date class) and then convert that to ts class. Because ts class can only represent regularly spaced series that conversion will fill in the missing dates and provide a value of NA for them. We can fill those in with 1 and then convert back to zoo with Date class index.
library(zoo)
z <- read.zoo(test_data2)
zz <- z |> as.ts() |> na.fill(1) |> as.zoo() |> aggregate(as.Date)
r <- rollapply(zz, 14, min, na.rm = TRUE, partial = TRUE, align = "left")
time(r)[which(r == 1)[1]]
## [1] "1999-12-02"
2) Another way to solve this not involving rollapply at all would be to use rle. Using zz from above
ok <- with(rle(coredata(zz)), rep(lengths >= 14 & values == 1, lengths))
tt[which(ok)[1]]
## [1] "1999-12-02"
3) Another way without using rollapply is to extract the 0 value rows and then keep only those whose difference exceeds 14 days from the next 0 value row. Finally take the first such row and use the date one day after it. This assumes that there is at least one 0 row before the first run of 14+ ones. Below we have returned back to using test_data from the question although this would have also worked with test_data2.
library(dplyr)
test_data %>%
mutate(date = as.Date(date)) %>%
filter(value == 0) %>%
mutate(diff = as.numeric(lead(date) - date)) %>%
filter(diff > 14) %>%
head(1) %>%
mutate(date = date + 1)
## date value diff
## 1 2000-01-02 0 17
rollapply over dates rather than points
4) The question also discussed using rollapply over dates rather than points which we address here. As noted above this does not actually solve the question of finding the first stretch of 14+ ones so instead we show how to find the first date in the series which starts a stretch of at least 14 ones. In general, we do this by first calculating a width vector using findInterval and then use rollapply in the usual way but with those widths rather than using a scalar width. This only involves one extra line of code to calculate the widths, w.
# using test_data from question
tt <- as.Date(test_data$date)
w <- findInterval(tt + 13, tt, rightmost.closed = TRUE) - seq_along(tt) + 1
r <- rollapply(test_data$value, w, min, fill = NA, na.rm = TRUE, align = "left")
tt[which(r == 1)[1]]
## [1] "2000-01-02"
There are further examples in ?rollapply showing how to roll by time rather than number of points.
sqldf
5) A completely different way of approaching the problem of finding the first 14+ ones with a date in the series is to use an SQL self join. It joins the first instance of test aliased to a to a second instance b associating all rows of b within the indicated date range and of a taking the minimum value of those creating a new column min14 with those minimums. The having clause then keeps only those rows for which min14 is 1 and of those the limit clause keeps the first. We then extract the date at the end.
library(sqldf)
test <- transform(test_data, date = as.Date(date))
sqldf("select a.*, min(b.value) min14
from test a
left join test b on b.date between a.date and a.date + 13
group by a.rowid
having min14 = 1
limit 1")$date
## [1] "2000-01-02"
You may look into runner package where you can pass k as days/weeks etc. See this example, to sum the last 3 days of value.
library(dplyr)
library(runner)
test_data %>%
mutate(date = as.Date(date),
sum_val = runner(value, k = "3 days", idx = date, f = sum))
# date value sum_val
#1 2000-01-01 0 0
#2 2000-01-02 1 1
#3 2000-01-03 1 2
#4 2000-01-17 1 1
#5 2000-01-18 0 1
Notice row 4 has value 1 (and not 3) because there is only 1 value that occurred in last 3 days.
I am trying to convert a column in dataset that has time format in HMS into seconds.
Below is how my dataset looks like:
Participant Event ID Event_start Event_time
Joe 1 3 1:49:52
Arya 1 2 1:37:39
Cynthia 1 1 1:40:17
I used this
dataset %>%
mutate(Timeinsec = period_to_seconds(hms("Event_time")))
it gives me warning.
The warning is because Event_time is quoted. Try it without quotes:
dataset %>%
mutate(Timeinsec = hms(Event_time))
If you want seconds as an integer, use period_to_seconds:
dataset %>%
mutate(Sec = period_to_seconds(hms(Event_time)))
I'm using R and RStudio to analyse GTFS public transport feeds and to create timetable range plots using ggplot2. The code currently works fine but is quite slow, which is problematic when working with very big CSVs as is often the case here.
The slowest part of the code is as follows (with some context): a for loop that iterates through the data frame and subsets each unique trip into a temporary data frame from which the extreme arrival and departure values (first & last rows) are extracted:
# Creates an empty df to contain trip_id, trip start and trip end times
Trip_Times <- data.frame(Trip_ID = character(), Departure = character(), Arrival = character(), stringsAsFactors = FALSE)
# Creates a vector containing all trips of the analysed day
unique_trips = unique(stop_times$trip_id)
# Iterates through stop_times for each unique trip_id and populates previously created data frame
for (i in seq(from = 1, to = length(unique_trips), by = 1)) {
temp_df <- subset(stop_times, trip_id == unique_trips[i])
Trip_Times[nrow(Trip_Times) + 1, ] <- c(temp_df$trip_id[[1]], temp_df$departure_time[[1]], temp_df$arrival_time[[nrow(temp_df)]])
}
The stop_times df looks as follows with some feeds containing over 2.5 million lines giving around 200k unique trips, hence 200k loop iterations...
head(stop_times)
trip_id arrival_time departure_time stop_sequence
1 011_0840101_A14 7:15:00 7:15:00 1
2 011_0840101_A14 7:16:00 7:16:00 2
3 011_0840101_A14 7:17:00 7:17:00 3
4 011_0840101_A14 7:18:00 7:18:00 4
5 011_0840101_A14 7:19:00 7:19:00 5
6 011_0840101_A14 7:20:00 7:20:00 6
Would anyone be able to advise me how to optimise this code in order to obtain faster results. I don't believe apply can be used here but I may well be wrong.
This should be straightforward with dplyr...
library(dplyr)
Trip_Times <- stop_times %>%
group_by(trip_id) %>%
summarise(departure_time=first(departure_time),
arrival_time=last(arrival_time))
We can use data.table
library(data.table)
setDT(stop_times)[, .(departure_time = departure_time[1L],
arrival_time = arrival_time[.N]) , by = trip_id]
I have the following data:
Data <- data.frame(Project=c(123,123,123,123,123,123,124,124,124,124,124,124),
Date=c("12/27/2016 15:16","12/27/2016 15:20","12/27/2016 15:24","12/27/2016 15:28","12/27/2016 15:28","12/27/2016 15:42","12/28/2016 7:22","12/28/2016 7:26","12/28/2016 7:35","12/28/2016 11:02","12/28/2016 11:02","12/28/2016 11:28"),
OldValue=c("","Open","In Progress","Open","System Declined","In Progress","System Declined","Open","In Progress","Open","Complete","In Progress"),
NewValue=c("Open","In Progress","System Declined","In Progress","Open","System Declined","Open","In Progress","Complete","In Progress","Open","Complete"))
The data is already ordered by Project, then Date.
However, if there are two rows with the same Date (such as rows 4,5 and 10,11) I want to designate the order based on OldValue. So I'd like row 5 ahead of row 4, and row 11 ahead of row 10.
How can I go about doing this?
#Assign Desired order to the OldValue, CHANGE "y" IF NECESSARY
OldValue_order = data.frame(OldValue = c("","Open","In Progress","System Declined","Complete"), y = c(0,4,2,1,3))
# We'll need lookup command to copy desired order to the "Data"
library(qdapTools)
Data$OV_order = lookup(Data$OldValue, OldValue_order) # Adds new column to "Data"
# Arrange the data.frame in desired order
Data = Data[with(Data, order(Project, as.POSIXct(Date, format = "%m/%d/%Y %H:%M"), OV_order)),]
#Remove the added column
Data = Data[1:4]
Sample Data:
product_id <- c("1000","1000","1000","1000","1000","1000", "1002","1002","1002","1002","1002","1002")
qty_ordered <- c(1,2,1,1,1,1,1,2,1,2,1,1)
price <- c(2.49,2.49,2.49,1.743,2.49,2.49, 2.093,2.093,2.11,2.11,2.11, 2.97)
date <- c("2/23/15","2/23/15", '3/16/15','3/16/15','5/16/15', "6/18/15", "2/19/15","3/19/15","3/19/15","3/19/15","3/19/15","4/19/15")
sampleData <- data.frame(product_id, qty_ordered, price, date)
I would like to identify every time when a change in a price occurred. Also, I would like to sum() the total qty_ordered between those two price change dates. For example,
For product_id == "1000", price changed occurred on 3/16/15 from $2.49 to $1.743. The total qty_ordered is 1+2+1=4;
the difference between those two earliest date of price change is from 2/23/15 to 3/16/15 which is 21 days.
So the New Data Frame should be:
product_id sum_qty_ordered price date_diff
1000 4 2.490 21
1000 1 1.743 61
1000 2 2.490 33
Here are what I have tried:
**NOTE: for this case, a simple "dplyr::group_by" will not work since it will ignore the date effect.
1) I found this code from Determine when columns of a data.frame change value and return indices of the change:
This is to identify every time when the price changed, which identify the first date when the price changed for each product.
IndexedChanged <- c(1,which(rowSums(sapply(sampleData[,3],diff))!=0)+1)
sampleData[IndexedChanged,]
However, I am not sure how to calculate the sum(qty_ordered) and the date difference for each of those entries if I use that code.
2) I tried to write a WHILE loop to temporarily store each batch of product_id, price, range of dates (e.g. a subset of data frame with one product_id, one price, and all entries ranged from the earliest date of price change till the last date of price before it changed),
and then, summarise that subset to get sum(sum_qty_ordered) and the date diff.
However, I think I always am confused by WHILE and FOR, so my code has some problems in it. Here is my code:
create an empty data frame for later data storage
NewData_Ready <- data.frame(
product_id = character(),
price = double(),
early_date = as.Date(character()),
last_date=as.Date(character()),
total_qty_demanded = double(),
stringsAsFactors=FALSE)
create a temp table to store the batch price order entries
temp_dataset <- data.frame(
product_id = character(),
qty_ordered = double(),
price = double(),
date=as.Date(character()),
stringsAsFactors=FALSE)
loop:
This is messy...and probably not make sense, so I do really help on this.
for ( i in unique(sampleData$product_id)){
#for each unique product_id in the dataset, we are gonna loop through it based on product_id
#for first product_id which is "1000"
temp_table <- sampleData[sampleData$product_id == "i", ] #subset dataset by ONE single product_id
#this dataset only has product of "1000" entries
#starting a new for loop to loop through the entire entries for this product
for ( p in 1:length(temp_table$product_id)){
current_price <- temp_table$price[p] #assign current_price to the first price value
#assign $2.49 to current price.
min_date <- temp_table$date[p] #assign the first date when the first price change
#assign 2015-2-23 to min_date which is the earliest date when price is $2.49
while (current_price == temp_table$price[p+1]){
#while the next price is the same as the first price
#that is, if the second price is $2.49 is the same as the first price of $2.49, which is TRUE
#then execute the following statement
temp_dataset <- rbind(temp_dataset, temp_table[p,])
#if the WHILE loop is TRUE, means every 2 entries have the same price
#then combine each entry when price is the same in temp_table with the temp_dataset
#if the WHILE loop is FALSE, means one entry's price is different from the next one
#then stop the statement at the above, but do the following
current_price <- temp_table$price[p+1]
#this will reassign the current_price to the next price, and restart the WHILE loop
by_idPrice <- dplyr::group_by(temp_dataset, product_id, price)
NewRow <- dplyr::summarise(
early_date = min(date),
last_date = max(date),
total_qty_demanded = sum(qty_ordered))
NewData_Ready <- rbind(NewData_Ready, NewRow)
}
}
}
I have searched a lot on related questions but I have not found anything that are related to this problem yet. If you have some suggestions, please let me know.
Also, please provide some suggestions on the solution to my questions. I would greatly appreciate your time and help!
Here is my R version:
platform x86_64-apple-darwin13.4.0
arch x86_64
os darwin13.4.0
system x86_64, darwin13.4.0
status
major 3
minor 3.1
year 2016
month 06
day 21
svn rev 70800
language R
version.string R version 3.3.1 (2016-06-21)
nickname Bug in Your Hair
Using data.table:
library(data.table)
setDT(sampleData)
Some Preprocessing:
sampleData[, firstdate := as.Date(date, "%m/%d/%y")]
Based on how you calculate date diff, we are better off creating a range of dates for each row:
sampleData[, lastdate := shift(firstdate,type = "lead"), by = product_id]
sampleData[is.na(lastdate), lastdate := firstdate]
# Arun's one step: sampleData[, lastdate := shift(firstdate, type="lead", fill=firstdate[.N]), by = product_id]
Then create a new ID for every change in price:
sampleData[, price_id := cumsum(c(0,diff(price) != 0)), by = product_id]
Then calculate your groupwise functions, by product and price run:
sampleData[,
.(
price = unique(price),
sum_qty = sum(qty_ordered),
date_diff = max(lastdate) − min(firstdate)
),
by = .(
product_id,
price_id
)
]
product_id price_id price sum_qty date_diff
1: 1000 0 2.490 4 21 days
2: 1000 1 1.743 1 61 days
3: 1000 2 2.490 2 33 days
4: 1002 0 2.093 3 28 days
5: 1002 1 2.110 4 31 days
6: 1002 2 2.970 1 0 days
I think the last price change for 1000 is only 33 days, and the preceding one is 61 (not 60). If you include the first day it is 22, 62 and 34, and the line should read date_diff = max(lastdate) − min(firstdate) + 1