I have a dataset which has a timestamp. Now I cannot take timestamp data into regression model as it would not allow that. Hence I wanted to concatenate the time stamp data, into particular dates and group the rows which fall on the same date. How do I go about doing that?
Example data set
print(processed_df.head())
date day isWeekend distance time
15 2016-07-06 14:43:53.923 Tuesday False 0.000 239.254
17 2016-07-07 09:24:53.928 Wednesday False 0.000 219.191
18 2016-07-07 09:33:02.291 Wednesday False 0.000 218.987
37 2016-07-14 22:03:23.355 Wednesday False 0.636 205.000
46 2016-07-14 23:51:49.696 Wednesday False 0.103 843.000
Now I would like the date to be index and all Wednesday rows can be combined to form a single row adding the distance and time.
My attempt on same.
print(new_df.groupby('date').mean().head())
distance time
date
2016-07-06 14:43:53.923 0.0 239.254
2016-07-07 09:24:53.928 0.0 219.191
2016-07-07 09:33:02.291 0.0 218.987
2016-07-07 11:28:26.920 0.0 519.016
2016-07-08 11:59:02.044 0.0 398.971
Which has failed.
Desired output
distance time
date
2016-07-06 0.0 239.254
2016-07-07 0.0 957.194
2016-07-08 0.0 398.971
I think you need groupby by dt.date:
#cast if dtype is not datetime
df.date = pd.to_datetime(df.date)
print (df.groupby([df.date.dt.date])['distance', 'time'].mean())
distance time
date
2016-07-06 0.0000 239.254
2016-07-07 0.0000 219.089
2016-07-14 0.3695 524.000
Another solution with resample, but then need remove NaN rows by dropna:
print (df.set_index('date').resample('D')['distance', 'time'].mean())
distance time
date
2016-07-06 0.0000 239.254
2016-07-07 0.0000 219.089
2016-07-08 NaN NaN
2016-07-09 NaN NaN
2016-07-10 NaN NaN
2016-07-11 NaN NaN
2016-07-12 NaN NaN
2016-07-13 NaN NaN
2016-07-14 0.3695 524.000
print (df.set_index('date').resample('D')['distance', 'time'].mean().dropna())
distance time
date
2016-07-06 0.0000 239.254
2016-07-07 0.0000 219.089
2016-07-14 0.3695 524.000
Related
How can I make a temporal variogram of a set of 20 datetime values of rainfall with a frequency of 15 minutes? How should I transfer the datetime values to temporal distances?
datetime R
0 2011-08-05 14:45:00 0.000
1 2011-08-05 15:00:00 0.000
2 2011-08-05 15:15:00 0.000
3 2011-08-05 15:30:00 0.000
4 2011-08-05 15:45:00 4.318
5 2011-08-05 16:00:00 3.302
6 2011-08-05 16:15:00 6.604
7 2011-08-05 16:30:00 0.000
...
19 2011-08-05 19:30:00 0.000
I solved already some spatial and directional variogram in Rstudio, but it is hard to solve temporal variogram. Have anyone some suggestions?
My data frame looks like this:
Date Time Consumption kVARh kW weekday
2 2016-12-13 0:15:00 90.144 0.000 360.576 Tue
3 2016-12-13 0:30:00 90.144 0.000 360.576 Tue
4 2016-12-13 0:45:00 91.584 0.000 366.336 Tue
5 2016-12-13 1:00:00 93.888 0.000 375.552 Tue
6 2016-12-13 1:15:00 88.416 0.000 353.664 Tue
7 2016-12-13 1:30:00 88.704 0.000 354.816 Tue
8 2016-12-13 1:45:00 91.296 0.000 365.184 Tue
I got data from a csv with date as factor, which I changed to as.character, and then as.date. Then I added a column giving me the day of week using
sigEx1DF$weekday <- format(as.Date(sigEx1DF$Date), "%a")
which I then converted to an ordered factor from Sunday through Saturday.
This is granular data from a smart meter which measures usage (consumption) at 15 minute intervals. kW is Consumption*4. I need to average each weekday and then get the max of the averages, but when I subset the data frame looks like this:
Date Time Consumption kVARh kW weekday
3 2016-12-13 0:30:00 90.144 0.000 360.576 Tue
8 2016-12-13 1:45:00 91.296 0.000 365.184 Tue
13 2016-12-13 3:00:00 93.600 0.000 374.400 Tue
18 2016-12-13 4:15:00 93.312 0.000 373.248 Tue
23 2016-12-13 5:30:00 107.424 0.000 429.696 Tue
28 2016-12-13 6:45:00 103.968 0.000 415.872 Tue
33 2016-12-13 8:00:00 108.576 0.000 434.304 Tue
Several of the 15 minute intervals are missing now (lines 4-7, for instance). I don't see a difference in rows 4-7, yet they are missing after the subset.
This is the code I used to subset:
bldg1_Wkdy <- subset(sort.df, weekday == c("Mon","Tue","Wed","Thu","Fri"),
select = c("Date","Time","Consumption","kVARh","kW","weekday"))
Here's the data frame structure before the subset:
'data.frame': 72888 obs. of 6 variables:
$ Date : Date, format: "2016-12-13" "2016-12-13" "2016-12-13" ...
$ Time : Factor w/ 108 levels "0:00:00","0:15:00",..: 2 3 4 5 6 7 8 49 50 51 ...
$ Consumption: num 90.1 90.1 91.6 93.9 88.4 ...
$ kVARh : num 0 0 0 0 0 0 0 0 0 0 ...
$ kW : num 361 361 366 376 354 ...
$ weekday : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 3 3 3 3 3 3 3 3 3 3 ...
I go from 72888 observations to only 10,427 for the weekdays, and 10,368 for the weekends, with many rows that seem to be randomly missing as noted above. Some of the intervals have zero consumption (electricity may have been out due to storm or other reasons), but those are actually showing up in the subset data. So it doesn't seem like zeroes are causing the problem. Thanks for your help!
Instead of weekday == c("Mon","Tue","Wed","Thu","Fri") you should use weekday %in% c("Mon","Tue","Wed","Thu","Fri"), see below a minimal test which shows how %in% works as expected:
> subset(x, weekday == c("Mon","Tue","Wed","Thu","Fri"))
weekday
NA <NA>
> subset(x, weekday %in% c("Mon","Tue","Wed","Thu","Fri"))
weekday
1 Tue
I have two text files:
1-
> head(val)
V1 V2 V3
1 2015/03/31 00:00 0.134
2 2015/03/31 01:00 0.130
3 2015/03/31 02:00 0.133
4 2015/03/31 03:00 0.132
2-
> head(tes)
A B date
1 0.04 0.02 2015-03-31 02:18:56
What I need is to combine V1 (date) and V2 (hour) in val. search in val the date and time that correspond (the closest) to date in tes and then extract the corresponding V3 and put it in tes.
the desired out put would be:
tes
A B date V3
1 0.04 0.02 2015-04-01 02:18:56 0.133
Updated answer based on OP's comments.
val$date <- with(val,as.POSIXct(paste(V1,V2), format="%Y/%m/%d %H:%M"))
val
# V1 V2 V3 date
# 1 2015/03/31 00:00 0.134 2015-03-31 00:00:00
# 2 2015/03/31 01:00 0.130 2015-03-31 01:00:00
# 3 2015/03/31 02:00 0.133 2015-03-31 02:00:00
# 4 2015/03/31 03:00 0.132 2015-03-31 03:00:00
# 5 2015/04/07 13:00 0.080 2015-04-07 13:00:00
# 6 2015/04/07 14:00 0.082 2015-04-07 14:00:00
tes$date <- as.POSIXct(tes$date)
tes
# A B date
# 1 0.04 0.02 2015-03-31 02:18:56
# 2 0.05 0.03 2015-03-31 03:30:56
# 3 0.06 0.04 2015-03-31 05:30:56
# 4 0.07 0.05 2015-04-07 13:42:56
f <- function(d) { # for given tes$date, find val$V3
diff <- abs(difftime(val$date,d,units="min"))
if (min(diff > 45)) Inf else which.min(diff)
}
tes <- cbind(tes,val[sapply(tes$date,f),c("date","V3")])
tes
# A B date date V3
# 1 0.04 0.02 2015-03-31 02:18:56 2015-03-31 02:00:00 0.133
# 2 0.05 0.03 2015-03-31 03:30:56 2015-03-31 03:00:00 0.132
# 3 0.06 0.04 2015-03-31 05:30:56 <NA> NA
# 4 0.07 0.05 2015-04-07 13:42:56 2015-04-07 14:00:00 0.082
The function f(...) calculates the index into val (the row number) for which val$date is closest in time to the given tes$date, unless that time is > 45 min, in which case Inf is returned. Using this function with sapply(...) as in:
sapply(tes$date, f)
returns a vector of row numbers in val matching your condition for each test$date.
The reason we use Inf instead of NA for missing values is that indexing a data.frame using Inf always returns a single "row" containing NA, whereas indexing using NA returns nrow(...) rows all containing NA.
I added the extra rows into val and tes per your comment.
I have a very large time series data set in the following format.
"Tag.1","1/22/2015 11:59:54 PM","570.29895",
"Tag.1","1/22/2015 11:59:56 PM","570.29895",
"Tag.1","1/22/2015 11:59:58 PM","570.29895",
"Tag.1","1/23/2015 12:00:00 AM","649.67133",
"Tag.2","1/22/2015 12:00:02 AM","1.21",
"Tag.2","1/22/2015 12:00:04 AM","1.21",
"Tag.2","1/22/2015 12:00:06 AM","1.21",
"Tag.2","1/22/2015 12:00:08 AM","1.21",
"Tag.2","1/22/2015 12:00:10 AM","1.21",
"Tag.2","1/22/2015 12:00:12 AM","1.21",
I would like to separate this out into a data frame with a common column for the time stamp and one column each for the tags.
Date.Time, Tag.1, Tag.2, Tag.3...
1/22/2015 11:59:54 PM,570.29895,
Any suggestions would be appreciated!
Maybe something like this:
cast(df,V2~V1,mean,value='V3')
V2 Tag.1 Tag.2
1 1/22/2015 11:59:54 PM 570.2989 NaN
2 1/22/2015 11:59:56 PM 570.2989 NaN
3 1/22/2015 11:59:58 PM 570.2989 NaN
4 1/22/2015 12:00:02 AM NaN 1.21
5 1/22/2015 12:00:04 AM NaN 1.21
6 1/22/2015 12:00:06 AM NaN 1.21
7 1/22/2015 12:00:08 AM NaN 1.21
8 1/22/2015 12:00:10 AM NaN 1.21
9 1/22/2015 12:00:12 AM NaN 1.21
10 1/23/2015 12:00:00 AM 649.6713 NaN
cast is a part of reshape package
Bests,
ZP
EDIT: Now made reproducible by using a publicly available dataset.
I would like to take a transaction log e_log, with the following schema:
require(BTYD)
e_log <- dc.ReadLines(system.file("data/cdnowElog.csv", package="BTYD"),2,3,5)
e_log[,"date"] <- as.Date(e_log[,"date"], "%Y%m%d")
e_log$cust<-as.integer(e_log$cust)
head(e_log)
cust date sales
1 1 1997-01-01 29.33
2 1 1997-01-18 29.73
3 1 1997-08-02 14.96
4 1 1997-12-12 26.48
5 2 1997-01-01 63.34
6 2 1997-01-13 11.77
Where each instance is a transaction and cust is the customer id, date is the transaction date and sales is the sales amount, and transform it into the following schema (note that the columns are not ordered):
cust trans_date sales birth date_diff
1 1 1997-01-01 29.33 1997-01-01 0 days
2 1 1997-01-16 0.00 1997-01-01 15 days
3 1 1997-08-27 0.00 1997-01-01 238 days
4 1 1997-09-11 0.00 1997-01-01 253 days
5 1 1998-04-23 0.00 1997-01-01 477 days
6 1 1997-08-17 0.00 1997-01-01 228 days
Where cust is the customer id, trans_date is the transaction date in year/month/day, sales is the sum of sales for a given trans_date and cust, birth is the date of acquisition, and date_diff is the number of days elapsed from when the customer was acquired to the trans_date. In this schema, the primary key is cust and date_diff. There should be a row for every customer, for every day elapsed since the customer was acquired up until the maximum date in the dataset (i.e. the final observation time), regardless of whether there was a sale on a given day. The goal is to see sales as a function of days elapsed from acquisition.
I have created a function that converts a transaction log to the above schema, but its slow, crude, and inefficient (not to be too hard on my self):
require(BTYD)
cohort_spend.df<-function(trans_log){
###create a customer by time spend matrix
spend<-dc.CreateSpendCBT(trans_log)
###coerce to data.frame
sdf<-data.frame(spend)
###order elog by date, create birth index
e_ord<-trans_log[,1:2][with(trans_log[,1:2],order(date)),]
birth<-by(e_ord,e_ord$cust,head,n=1)
birthd<-do.call("rbind",as.list(birth))
colnames(birthd)<-c("cust","birth")
###merge birth dates to customer spend data frame
sdfm<-merge(sdf,birthd,by="cust")
###difference transaction date and birth date to get days elapsed
###from acquisition
sdfm$date<-as.Date(sdfm$date)
sdfm$diff<-sdfm$date-sdfm$birth
sdfm2<-sdfm[sdfm$diff>=0,]
colnames(sdfm2)<-c("cust","trans_date","sales","birth","date_diff")
return(sdfm2)}
desired_schema<-cohort_spend.df(trans_log=e_log)
head(desired_schema)
cust trans_date sales birth date_diff
1 1 1997-01-01 29.33 1997-01-01 0 days
2 1 1997-01-16 0.00 1997-01-01 15 days
3 1 1997-08-27 0.00 1997-01-01 238 days
4 1 1997-09-11 0.00 1997-01-01 253 days
5 1 1998-04-23 0.00 1997-01-01 477 days
6 1 1997-08-17 0.00 1997-01-01 228 days
system.time(cohort_spend.df(trans_log=e_log))
user system elapsed
46.777 0.967 47.768
I've included the function so that you can reproduce my results. Again, the output is correct, I'm simply looking to refactor; If you can think of a cleaner way to get the desired result, please share.
NOTE: the desired schema should be derived entirely from the transaction log, with no need for external data.
EDITED TO INCLUDE ZERO VALUES
require(data.table)
DT<-data.table(e_log,key=c("date","cust")) # turn the frame into a table
births<-DT[,list(birth=min(date)),by="cust"] # list births
grid<-CJ(date=as.Date(min(DT[,date]):max(DT[,date]),origin="1970-01-01"),cust=unique(DT[,cust])) # make the grid of all combinations
grid<-merge(DT,grid,all.y=T) # merge in the data from the log
grid<-merge(grid,births,all.x=T,by="cust") # merge in the birth
grid[is.na(sales),sales:=0] # set NA sales value to 0
grid[,date_diff:=paste(date_diff=date-birth," Days")] # add the date_diff field
setkey(grid,cust,date) # set the key
grid[,list(sales=sum(sales),birth,date_diff),by=c("cust","date")] # output
cust date sales birth date_diff
1: 1 1997-01-01 29.33 1997-01-01 0 Days
2: 1 1997-01-02 0.00 1997-01-01 1 Days
3: 1 1997-01-03 0.00 1997-01-01 2 Days
4: 1 1997-01-04 0.00 1997-01-01 3 Days
5: 1 1997-01-05 0.00 1997-01-01 4 Days
---
1287141: 2357 1998-06-26 0.00 1997-03-25 458 Days
1287142: 2357 1998-06-27 0.00 1997-03-25 459 Days
1287143: 2357 1998-06-28 0.00 1997-03-25 460 Days
1287144: 2357 1998-06-29 0.00 1997-03-25 461 Days
1287145: 2357 1998-06-30 0.00 1997-03-25 462 Days
Actually, to filter out the date prior to each entry's birth:
grid[,list(sales=sum(sales),birth,date_diff),by=c("cust","date")][date>=birth]
cust date sales birth date_diff
1: 1 1997-01-01 29.33 1997-01-01 0 Days
2: 1 1997-01-02 0.00 1997-01-01 1 Days
3: 1 1997-01-03 0.00 1997-01-01 2 Days
4: 1 1997-01-04 0.00 1997-01-01 3 Days
5: 1 1997-01-05 0.00 1997-01-01 4 Days
---
1185816: 2357 1998-06-26 0.00 1997-03-25 458 Days
1185817: 2357 1998-06-27 0.00 1997-03-25 459 Days
1185818: 2357 1998-06-28 0.00 1997-03-25 460 Days
1185819: 2357 1998-06-29 0.00 1997-03-25 461 Days
1185820: 2357 1998-06-30 0.00 1997-03-25 462 Days
Here's my solution using data.table, building on Troy's solution:
require(BTYD)
require(data.table)
cohort_spend.dft<-function(e_log){
e_log<-dc.MergeTransactionsOnSameDate(e_log) ##merge same day transactions
DT<-data.table(e_log,key=c("date","cust")) # turn the frame into a table
births<-DT[,list(birth=min(date)),by="cust"] # list births
grid<-CJ(date=as.Date(min(DT[,date]):max(DT[,date]),origin="1970-01-01"),cust=unique(DT[,cust])) # make the grid of all combinations
grid<-merge(DT,grid,all.y=T) # merge in the data from the log
grid[is.na(sales),sales:=0] # set NA sales value to 0
grid<-merge(grid,births,by="cust") # merge in the birth
grid[,date_diff:=date-birth][date_diff>=0]
}
system.time(cohort_spend.dft(e_log=e_log))
user system elapsed
1.783 0.532 2.413