I have a very large time series data set in the following format.
"Tag.1","1/22/2015 11:59:54 PM","570.29895",
"Tag.1","1/22/2015 11:59:56 PM","570.29895",
"Tag.1","1/22/2015 11:59:58 PM","570.29895",
"Tag.1","1/23/2015 12:00:00 AM","649.67133",
"Tag.2","1/22/2015 12:00:02 AM","1.21",
"Tag.2","1/22/2015 12:00:04 AM","1.21",
"Tag.2","1/22/2015 12:00:06 AM","1.21",
"Tag.2","1/22/2015 12:00:08 AM","1.21",
"Tag.2","1/22/2015 12:00:10 AM","1.21",
"Tag.2","1/22/2015 12:00:12 AM","1.21",
I would like to separate this out into a data frame with a common column for the time stamp and one column each for the tags.
Date.Time, Tag.1, Tag.2, Tag.3...
1/22/2015 11:59:54 PM,570.29895,
Any suggestions would be appreciated!
Maybe something like this:
cast(df,V2~V1,mean,value='V3')
V2 Tag.1 Tag.2
1 1/22/2015 11:59:54 PM 570.2989 NaN
2 1/22/2015 11:59:56 PM 570.2989 NaN
3 1/22/2015 11:59:58 PM 570.2989 NaN
4 1/22/2015 12:00:02 AM NaN 1.21
5 1/22/2015 12:00:04 AM NaN 1.21
6 1/22/2015 12:00:06 AM NaN 1.21
7 1/22/2015 12:00:08 AM NaN 1.21
8 1/22/2015 12:00:10 AM NaN 1.21
9 1/22/2015 12:00:12 AM NaN 1.21
10 1/23/2015 12:00:00 AM 649.6713 NaN
cast is a part of reshape package
Bests,
ZP
Related
My data frame looks like this:
Date Time Consumption kVARh kW weekday
2 2016-12-13 0:15:00 90.144 0.000 360.576 Tue
3 2016-12-13 0:30:00 90.144 0.000 360.576 Tue
4 2016-12-13 0:45:00 91.584 0.000 366.336 Tue
5 2016-12-13 1:00:00 93.888 0.000 375.552 Tue
6 2016-12-13 1:15:00 88.416 0.000 353.664 Tue
7 2016-12-13 1:30:00 88.704 0.000 354.816 Tue
8 2016-12-13 1:45:00 91.296 0.000 365.184 Tue
I got data from a csv with date as factor, which I changed to as.character, and then as.date. Then I added a column giving me the day of week using
sigEx1DF$weekday <- format(as.Date(sigEx1DF$Date), "%a")
which I then converted to an ordered factor from Sunday through Saturday.
This is granular data from a smart meter which measures usage (consumption) at 15 minute intervals. kW is Consumption*4. I need to average each weekday and then get the max of the averages, but when I subset the data frame looks like this:
Date Time Consumption kVARh kW weekday
3 2016-12-13 0:30:00 90.144 0.000 360.576 Tue
8 2016-12-13 1:45:00 91.296 0.000 365.184 Tue
13 2016-12-13 3:00:00 93.600 0.000 374.400 Tue
18 2016-12-13 4:15:00 93.312 0.000 373.248 Tue
23 2016-12-13 5:30:00 107.424 0.000 429.696 Tue
28 2016-12-13 6:45:00 103.968 0.000 415.872 Tue
33 2016-12-13 8:00:00 108.576 0.000 434.304 Tue
Several of the 15 minute intervals are missing now (lines 4-7, for instance). I don't see a difference in rows 4-7, yet they are missing after the subset.
This is the code I used to subset:
bldg1_Wkdy <- subset(sort.df, weekday == c("Mon","Tue","Wed","Thu","Fri"),
select = c("Date","Time","Consumption","kVARh","kW","weekday"))
Here's the data frame structure before the subset:
'data.frame': 72888 obs. of 6 variables:
$ Date : Date, format: "2016-12-13" "2016-12-13" "2016-12-13" ...
$ Time : Factor w/ 108 levels "0:00:00","0:15:00",..: 2 3 4 5 6 7 8 49 50 51 ...
$ Consumption: num 90.1 90.1 91.6 93.9 88.4 ...
$ kVARh : num 0 0 0 0 0 0 0 0 0 0 ...
$ kW : num 361 361 366 376 354 ...
$ weekday : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 3 3 3 3 3 3 3 3 3 3 ...
I go from 72888 observations to only 10,427 for the weekdays, and 10,368 for the weekends, with many rows that seem to be randomly missing as noted above. Some of the intervals have zero consumption (electricity may have been out due to storm or other reasons), but those are actually showing up in the subset data. So it doesn't seem like zeroes are causing the problem. Thanks for your help!
Instead of weekday == c("Mon","Tue","Wed","Thu","Fri") you should use weekday %in% c("Mon","Tue","Wed","Thu","Fri"), see below a minimal test which shows how %in% works as expected:
> subset(x, weekday == c("Mon","Tue","Wed","Thu","Fri"))
weekday
NA <NA>
> subset(x, weekday %in% c("Mon","Tue","Wed","Thu","Fri"))
weekday
1 Tue
I have a dataset which has a timestamp. Now I cannot take timestamp data into regression model as it would not allow that. Hence I wanted to concatenate the time stamp data, into particular dates and group the rows which fall on the same date. How do I go about doing that?
Example data set
print(processed_df.head())
date day isWeekend distance time
15 2016-07-06 14:43:53.923 Tuesday False 0.000 239.254
17 2016-07-07 09:24:53.928 Wednesday False 0.000 219.191
18 2016-07-07 09:33:02.291 Wednesday False 0.000 218.987
37 2016-07-14 22:03:23.355 Wednesday False 0.636 205.000
46 2016-07-14 23:51:49.696 Wednesday False 0.103 843.000
Now I would like the date to be index and all Wednesday rows can be combined to form a single row adding the distance and time.
My attempt on same.
print(new_df.groupby('date').mean().head())
distance time
date
2016-07-06 14:43:53.923 0.0 239.254
2016-07-07 09:24:53.928 0.0 219.191
2016-07-07 09:33:02.291 0.0 218.987
2016-07-07 11:28:26.920 0.0 519.016
2016-07-08 11:59:02.044 0.0 398.971
Which has failed.
Desired output
distance time
date
2016-07-06 0.0 239.254
2016-07-07 0.0 957.194
2016-07-08 0.0 398.971
I think you need groupby by dt.date:
#cast if dtype is not datetime
df.date = pd.to_datetime(df.date)
print (df.groupby([df.date.dt.date])['distance', 'time'].mean())
distance time
date
2016-07-06 0.0000 239.254
2016-07-07 0.0000 219.089
2016-07-14 0.3695 524.000
Another solution with resample, but then need remove NaN rows by dropna:
print (df.set_index('date').resample('D')['distance', 'time'].mean())
distance time
date
2016-07-06 0.0000 239.254
2016-07-07 0.0000 219.089
2016-07-08 NaN NaN
2016-07-09 NaN NaN
2016-07-10 NaN NaN
2016-07-11 NaN NaN
2016-07-12 NaN NaN
2016-07-13 NaN NaN
2016-07-14 0.3695 524.000
print (df.set_index('date').resample('D')['distance', 'time'].mean().dropna())
distance time
date
2016-07-06 0.0000 239.254
2016-07-07 0.0000 219.089
2016-07-14 0.3695 524.000
I have the following dataset:
head(filter_selection)
MATCHID COMPETITION TEAM1 TEAM2 GOALS1 GOALS2 RESULT EXPG1 EXPG2 DATUM TIJD VERSCHIL
1 1696873 Pro League Standard Liège Sporting Charleroi 3 0 TEAM1 1.57 0.61 25-7-2014 18:30:00 0.96
2 1696883 Pro League Waasland-Beveren Club Brugge 0 2 TEAM2 1.29 1.18 26-7-2014 16:00:00 0.11
3 1696879 Pro League Lierse KV Oostende 2 0 TEAM1 1.03 1.04 26-7-2014 18:00:00 -0.01
4 1696881 Pro League Westerlo Lokeren 1 0 TEAM1 1.76 1.24 26-7-2014 18:00:00 0.52
5 1696877 Pro League Mechelen Genk 3 1 TEAM1 1.60 1.23 27-7-2014 12:30:00 0.37
6 1696871 Pro League Anderlecht Mouscron-Péruwelz 3 1 TEAM1 1.27 0.62 27-7-2014 16:00:00 0.65
I want to use the VERSCHIL value to predict the RESULT. Therefore I do the following to create a test/training set:
library(rcaret)
inTrain <- createDataPartition(y=filter_selection$RESULT, p=0.75, list=FALSE)
Thing is however that when I do this my RESULT column changes:
training <- df_final_test[inTrain, ]
testing <- df_final_test[-inTrain, ]
head(training, 20)
MATCHID COMPETITION TEAM1 TEAM2 GOALS1 GOALS2 RESULT EXPG1 EXPG2 DATUM TIJD VERSCHIL CLAS type TYPE TYPE2
1 1696873 Pro League Standard Liège Sporting Charleroi 3 0 3 1.57 0.61 25-7-2014 18:30:00 0.96 0.96 TBD (-0.0767,1.54] HIGH
2 1696883 Pro League Waasland-Beveren Club Brugge 0 2 4 1.29 1.18 26-7-2014 16:00:00 0.11 0.11 TBD (-0.0767,1.54] MEDIUM
It's now 3 and 4 in stead of TEAM1 and TEAM2. Could anybody tell me why the TEAM1 value changed into 3?
Its strange cause when I do the same with the spam dataset it works fine
data(spam)
inTrain <- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training <- spam[inTrain, ]
head(training)
And that taking into consideration that the classes are the same
class(spam$type)
[1] "factor"
class(filter_selection$RESULT)
[1] "factor"
First of all, there is no package rcaret.
Secondly, you create a datapartition on "filter_selection", but then you create the training and test sets based on a different dataframe "df_final_test".
But do check the structure of df_final_test$RESULT and see how many levels the factor has. Maybe something went wrong there. If there are any levels in there you do not want use droplevels(df_final_test$RESULT)
If I try the code on the filter_selection and create a training set out of this one, I get a correct training and test set.
library(caret)
inTrain <- createDataPartition(y=filter_selection$RESULT, p=0.75, list=FALSE)
training <- filter_selection[inTrain, ]
testing <- filter_selection[-inTrain, ]
head(training)
MATCHID COMPETITION TEAM1 TEAM2 GOALS1 GOALS2 RESULT EXPG1 EXPG2 DATUM TIJD VERSCHIL
1 1696873 Pro League Standard Liège Sporting Charleroi 3 0 TEAM1 1.57 0.61 25-7-2014 18:30:00 0.96
2 1696883 Pro League Waasland-Beveren Club Brugge 0 2 TEAM2 1.29 1.18 26-7-2014 16:00:00 0.11
4 1696881 Pro League Westerlo Lokeren 1 0 TEAM1 1.76 1.24 26-7-2014 18:00:00 0.52
5 1696877 Pro League Mechelen Genk 3 1 TEAM1 1.60 1.23 27-7-2014 12:30:00 0.37
6 1696871 Pro League Anderlecht Mouscron-Péruwelz 3 1 TEAM1 1.27 0.62 27-7-2014 16:00:00 0.65
I have two text files:
1-
> head(val)
V1 V2 V3
1 2015/03/31 00:00 0.134
2 2015/03/31 01:00 0.130
3 2015/03/31 02:00 0.133
4 2015/03/31 03:00 0.132
2-
> head(tes)
A B date
1 0.04 0.02 2015-03-31 02:18:56
What I need is to combine V1 (date) and V2 (hour) in val. search in val the date and time that correspond (the closest) to date in tes and then extract the corresponding V3 and put it in tes.
the desired out put would be:
tes
A B date V3
1 0.04 0.02 2015-04-01 02:18:56 0.133
Updated answer based on OP's comments.
val$date <- with(val,as.POSIXct(paste(V1,V2), format="%Y/%m/%d %H:%M"))
val
# V1 V2 V3 date
# 1 2015/03/31 00:00 0.134 2015-03-31 00:00:00
# 2 2015/03/31 01:00 0.130 2015-03-31 01:00:00
# 3 2015/03/31 02:00 0.133 2015-03-31 02:00:00
# 4 2015/03/31 03:00 0.132 2015-03-31 03:00:00
# 5 2015/04/07 13:00 0.080 2015-04-07 13:00:00
# 6 2015/04/07 14:00 0.082 2015-04-07 14:00:00
tes$date <- as.POSIXct(tes$date)
tes
# A B date
# 1 0.04 0.02 2015-03-31 02:18:56
# 2 0.05 0.03 2015-03-31 03:30:56
# 3 0.06 0.04 2015-03-31 05:30:56
# 4 0.07 0.05 2015-04-07 13:42:56
f <- function(d) { # for given tes$date, find val$V3
diff <- abs(difftime(val$date,d,units="min"))
if (min(diff > 45)) Inf else which.min(diff)
}
tes <- cbind(tes,val[sapply(tes$date,f),c("date","V3")])
tes
# A B date date V3
# 1 0.04 0.02 2015-03-31 02:18:56 2015-03-31 02:00:00 0.133
# 2 0.05 0.03 2015-03-31 03:30:56 2015-03-31 03:00:00 0.132
# 3 0.06 0.04 2015-03-31 05:30:56 <NA> NA
# 4 0.07 0.05 2015-04-07 13:42:56 2015-04-07 14:00:00 0.082
The function f(...) calculates the index into val (the row number) for which val$date is closest in time to the given tes$date, unless that time is > 45 min, in which case Inf is returned. Using this function with sapply(...) as in:
sapply(tes$date, f)
returns a vector of row numbers in val matching your condition for each test$date.
The reason we use Inf instead of NA for missing values is that indexing a data.frame using Inf always returns a single "row" containing NA, whereas indexing using NA returns nrow(...) rows all containing NA.
I added the extra rows into val and tes per your comment.
I have a time series data, and I wanted to use a function to return suitably lagged and iterated divided value.
Data:
ID Temperature value
1 -1.1923333
2 -0.2123333
3 -0.593
4 -0.7393333
5 -0.731
6 -0.4976667
7 -0.773
8 -0.6843333
9 -0.371
10 0.754
11 1.798
12 3.023
13 3.8233333
14 4.2456667
15 4.599
16 5.078
17 4.9133333
18 3.5393333
19 2.0886667
20 1.8236667
21 1.2633333
22 0.6843333
23 0.7953333
24 0.6883333
The function should work like this:
new values : 23ID=value(24)/value(23), 22ID=value(23)/value(22), 21ID=value(22)/value(21), and so forth.
Expected Results:
ID New Temperature value
1 0.17
2 2.79
3 1.24
4 0.98
5 0.68
6 1.55
7 0.885
8 0.54
9 -2.03
10 2.38
11 1.68
12 1.264
13 1.11
14 1.083
15 1.104
16 0.967
17 0.72
18 0.59
19 0.873
20 0.69
21 0.541
22 1.16
23 0.86
24 NAN
To divide each element of a vector x by its successor, use:
x[-1] / x[-length(x)]
This will return a vector with a length of length(x) - 1. If you really need the NaN value at the end, add it by hand via c(x[-1] / x[-length(x)], NaN).