data handling outlier with conditional in R

data handling outlier with conditional in R - r

I have 2 data frame (data by hour and data by Day).
I want the point outlier by hour will be mark with conditional (PH by Hour in day belong to (standard1 - standard2) is OK and else is Outlier)
Example
PH in 11-09-13 10:00 (Hour) = 49.14068
compare with 11-09-13 20-40
and 49.14068>40 => Outlier
I want run, compare it automatic in R
I was search for this question but no result for this.
So, help me please !
Data by Hour
DateTime PH
11-09-13 10:00 49.14068
11-09-13 11:00 52.53494167
11-09-13 12:00 24.8525
11-09-13 13:00 8.56055
11-09-13 14:00 23.77944167
11-09-13 15:00 25.13243333
11-09-13 16:00 35.2913
11-09-13 17:00 20.58211667
11-09-13 18:00 18.605975
11-09-13 19:00 59.16179167
11-09-13 20:00 72.06908333
11-09-13 21:00 43.47536667
11-09-13 22:00 44.73696667
11-09-13 23:00 38.7266
12-09-13 0:00 41.12040833
12-09-13 1:00 33.67845833
12-09-13 2:00 38.49083333
12-09-13 3:00 46.20168333
12-09-13 4:00 40.03630833
12-09-13 5:00 41.10841667
12-09-13 6:00 43.753475
12-09-13 7:00 45.077675
12-09-13 8:00 57.53141667
12-09-13 9:00 45.17694167
12-09-13 10:00 41.106525
12-09-13 11:00 30.08048333
12-09-13 12:00 24.70255833
12-09-13 13:00 15.60813333
12-09-13 14:00 14.09729167
........ n day(24h/day)
Data by Day aggregate from Data by Hour
DateTime standard1 standard2
11-09-13 20 40
12-09-13 12 50
13-09-13 16 30
....... n day

Related

Merge two datasets based on time interval in R

I have two datasets.
Dataset X looks as follows. It contains 30-min intervals of the trading day of some stock index, which opens 9:30AM and closes at 15:00PM for DJ, but 16:00PM for DX. So the closing time may vary by Ticker.
Date Ticker end_time start_time
1997-10-06 DJ 10:00 09:30
1997-10-06 DJ 10:30 10:00
1997-10-06 DJ 11:00 10:30
1997-10-06 DJ 11:30 11:00
1997-10-08 DJ 09:30 15:00
1997-10-08 DJ 10:00 09:30
1997-10-06 DX 10:00 09:30
1997-10-06 DX 10:30 10:00
1997-10-06 DX 11:00 10:30
1997-10-06 DX 11:30 11:00
1997-10-07 DX 14:30 14:00
1997-10-07 DX 15:00 14:30
1997-10-07 DX 15:30 15:00
1997-10-07 DX 16:00 15:30
1997-10-08 DX 09:30 16:00
1997-10-08 DX 10:00 09:30
Dataset Y looks as follows:
Date Time Event
1997-10-06 09:30 Event1
1997-10-06 10:30 Event2
1997-10-07 22:00 Event3
1997-10-08 09:00 Event4
1997-10-08 09:30 Event5
1997-10-08 09:30 Event6
My aim is to link events in Y to X based on whether the event date-time occurs within the start/end time interval. My expected output is something (data-set Z):
Date Ticker end_time start_time Event
1997-10-06 DJ 10:00 09:30 Event1
1997-10-06 DJ 10:30 10:00 NA
1997-10-06 DJ 11:00 10:30 Event2
1997-10-06 DJ 11:30 11:00 NA
1997-10-08 DJ 09:30 15:00 Event3,Event4
1997-10-08 DJ 10:00 09:30 Event5,Event6
1997-10-06 DX 10:00 09:30 Event1
1997-10-06 DX 10:30 10:00 NA
1997-10-06 DX 11:00 10:30 Event2
1997-10-06 DX 11:30 11:00 NA
1997-10-07 DX 14:30 14:00 NA
1997-10-07 DX 15:00 14:30 NA
1997-10-07 DX 15:30 15:00 NA
1997-10-07 DX 16:00 15:30 NA
1997-10-08 DX 09:30 16:00 Event3, Event4
1997-10-08 DX 10:00 09:30 Event5,Event6
It is thus possible to multiple events happen between an interval. Is it possible to store those in column "Event". It is also possible that Event occurs after a market closes, which should be stored in the first interval that occurs after the event. How can I obtain this expected output? I have been thinking for a while now, but I have no clue where to start.
Edit: X contains 400k 30-min intervals. Y contains 40k events.

There are lots of ways of approaching the problem, here's just one suggestion.
I'm using these datasets:
x <- read.table(text = "Date,Ticker,end_time,start_time
06/10/1997,DJ,10:00,09:30
06/10/1997,DJ,10:30,10:00
06/10/1997,DJ,11:00,10:30
06/10/1997,DJ,11:30,11:00
08/10/1997,DJ,09:30,15:00
08/10/1997,DJ,10:00,09:30
06/10/1997,DX,10:00,09:30
06/10/1997,DX,10:30,10:00
06/10/1997,DX,11:00,10:30
06/10/1997,DX,11:30,11:00
07/10/1997,DX,14:30,14:00
07/10/1997,DX,15:00,14:30
07/10/1997,DX,15:30,15:00
07/10/1997,DX,16:00,15:30
08/10/1997,DX,09:30,16:00
08/10/1997,DX,10:00,09:30
08/10/1997,DX,10:00,09:30", sep =",", header = TRUE, stringsAsFactors =
FALSE)
y <- read.table(text = "Date,Time,Event
06/10/1997,09:30,Event1
06/10/1997,10:30,Event2
07/10/1997,22:00,Event3
08/10/1997,09:00,Event4
08/10/1997,09:30,Event5
08/10/1997,09:30,Event6
", sep =",", header = TRUE, stringsAsFactors = FALSE)
I would start by concatenating and formatting the dates and times so they can be used in functions to check whether an event occurred in that window. Assuming you have two data frames called x and y in the structure described above:
y$date_time <- strptime(paste(y$Time,y$Date),format="%H:%M %d/%m/%Y")
x$start_time_date <- strptime(paste(x$start_time,x$Date),format="%H:%M %d/%m/%Y")
x$end_time_date <- strptime(paste(x$end_time,x$Date),format="%H:%M %d/%m/%Y")
If you have control over the dataset as it is compiled then it might be easier for the start and end dates to be recorded in this way, as for the periods that cross over a date doing it this way will produce start date-times that are after the end date-times. We can edit those by just instead using the date from the previous entry in the data frame, assuming it will always be in chronological order and there won't be missing periods. This is a bit of a hack!:
#check which entries cross over a date
overnight_idx <- which(x$end_time_date < x$start_time_date)
#replace start date with that of preceding entry in the data frame
x[overnight_idx, 'start_time_date'] <-
as.POSIXct(paste(x[overnight_idx, 'start_time'],
x[overnight_idx - 1, 'Date']),format="%H:%M %d/%m/%Y",
origin = "1970-01-01")
Now we can write a function that for a given row in the data frame x will extract any events that have occurred listed in y, and then do a little bit of formatting to get it in the format you described.
checkEvent <- function(x_row){
y2 <- y[y$date_time>=x_row['start_time_date'] &
y$date_time<x_row['end_time_date'], 'Event']
if(length(y2)==0){
y2 <- NA
} else if(length(y2)>1){
y2 <- paste(y2,collapse = ' ')
}
return(y2)
}
Then we can just apply that to x
x$Event <- apply(x,1,checkEvent)
which will produce the following (ignoring the columns we created above so it fits on the screen):
> x[,c('Date','Ticker','end_time','start_time','Event')]
Date Ticker end_time start_time Event
1 06/10/1997 DJ 10:00 09:30 Event1
2 06/10/1997 DJ 10:30 10:00 <NA>
3 06/10/1997 DJ 11:00 10:30 Event2
4 06/10/1997 DJ 11:30 11:00 <NA>
5 08/10/1997 DJ 09:30 15:00 Event3 Event4
6 08/10/1997 DJ 10:00 09:30 Event5 Event6
7 06/10/1997 DX 10:00 09:30 Event1
8 06/10/1997 DX 10:30 10:00 <NA>
9 06/10/1997 DX 11:00 10:30 Event2
10 06/10/1997 DX 11:30 11:00 <NA>
11 07/10/1997 DX 14:30 14:00 <NA>
12 07/10/1997 DX 15:00 14:30 <NA>
13 07/10/1997 DX 15:30 15:00 <NA>
14 07/10/1997 DX 16:00 15:30 <NA>
15 08/10/1997 DX 09:30 16:00 Event3 Event4
16 08/10/1997 DX 10:00 09:30 Event5 Event6
17 08/10/1997 DX 10:00 09:30 Event5 Event6

How to grepl search for the max and min timings in a string?

I have a dataset with a column containing the opening and closing times of various stores.
The timings are in string format Opening time - Closing time,
eg: 17:00 - 21:00 | 11:30 - 14:30 | 11:30 - 14:30
I want to extract the minimum opening time within the above string, i.e. 11:30 and the max closing time i.e. 21:00.How do I do that using R?
DPUT:
structure(list(head.timings_remapping.Opening.And.Closing.Time..40. = c("15:30 - 21:30",
"12:00 - 00:00", "11:00 - 15:00 | 16:30 - 20:45", "12:00 - 22:30",
"17:00 - 21:30", "17:00 - 21:30", "16:30 - 00:00", "16:00 - 21:15",
"16:30 - 20:30", "17:00 - 20:00", "16:00 - 23:30", "16:30 - 21:30",
"17:00 - 22:00", "17:00 - 22:00", "17:00 - 21:30", "17:00 - 21:30",
"16:00 - 00:00", "16:30 - 23:59", "11:30 - 22:30", "11:30 - 23:59",
"17:00 - 20:30", "07:30 - 12:50", "16:15 - 23:00", "09:00 - 21:00",
"10:00 - 21:00", "11:00 - 22:00", "07:00 - 12:00 | 07:00 - 13:30 | 12:00 - 13:30",
"07:00 - 13:00 | 10:00 - 15:00", "10:00 - 02:00", "00:00 - 23:59",
"00:00 - 23:59", "11:00 - 20:00", "11:00 - 20:00", NA, "12:00 - 03:30 | 11:00 - 00:00",
"05:30 - 15:00", "07:00 - 16:00", "08:30 - 13:30", "17:00 - 21:00 | 11:30 - 14:30 | 11:30 - 14:30",
"12:00 - 01:00")), class = "data.frame", row.names = c(NA, -40L
))
The final output will have two columns "Opening time" and "Closing time"

Does this work:
library(dplyr)
library(tidyr)
df %>%
separate(col = head.timings_remapping.Opening.And.Closing.Time..40., into = c('Open_Close','A'), sep = '\\|') %>%
separate(col = Open_Close, into = c('Opening Time','Closing Time'), sep = ' - ') %>%
mutate(`Opening Time` = trimws(`Opening Time`), `Closing Time` = trimws(`Closing Time`)) %>% select(-A)
Opening Time Closing Time
1 15:30 21:30
2 12:00 00:00
3 11:00 15:00
4 12:00 22:30
5 17:00 21:30
6 17:00 21:30
7 16:30 00:00
8 16:00 21:15
9 16:30 20:30
10 17:00 20:00
11 16:00 23:30
12 16:30 21:30
13 17:00 22:00
14 17:00 22:00
15 17:00 21:30
16 17:00 21:30
17 16:00 00:00
18 16:30 23:59
19 11:30 22:30
20 11:30 23:59
21 17:00 20:30
22 07:30 12:50
23 16:15 23:00
24 09:00 21:00
25 10:00 21:00
26 11:00 22:00
27 07:00 12:00
28 07:00 13:00
29 10:00 02:00
30 00:00 23:59
31 00:00 23:59
32 11:00 20:00
33 11:00 20:00
34 <NA> <NA>
35 12:00 03:30
36 05:30 15:00
37 07:00 16:00
38 08:30 13:30
39 17:00 21:00
40 12:00 01:00

Using dplyr and tidyr library you can do :
library(dplyr)
library(tidyr)
#Rename the long column name to something smaller
names(df)[1] <- 'Time'
df %>%
#Create a row index
mutate(row = row_number()) %>%
#Split the data in different rows on '|'
separate_rows(Time, sep = '\\s*\\|\\s*') %>%
#split the data on '-'
separate(Time, c("Opening_Time", "Closing_time"), sep = '\\s*-\\s*') %>%
#Change the time to POSIXct format
mutate(across(c(Opening_Time, Closing_time), as.POSIXct, format = '%H:%M')) %>%
#For each row
group_by(row) %>%
#Get minimum opening time and maximum closing time
#and change into required format
summarise(Opening_Time = format(min(Opening_Time), "%H:%M"),
Closing_time = format(max(Closing_time), "%H:%M")) %>%
#Drop row column
select(-row)
This returns
# Opening_Time Closing_time
# <chr> <chr>
# 1 15:30 21:30
# 2 12:00 00:00
# 3 11:00 20:45
# 4 12:00 22:30
# 5 17:00 21:30
# 6 17:00 21:30
# 7 16:30 00:00
# 8 16:00 21:15
# 9 16:30 20:30
#10 17:00 20:00
# … with 30 more rows

How to parse year from a date in r [duplicate]

This question already has answers here:
Extract year from date
(7 answers)
Closed 5 years ago.
I have 53000 Date data-set and I want to extract only "year" from the date variable.
Do you guys know how can I do this?
My data are as follows:
OPN_DT_TM
18/07/2003 10:55
12/06/2004 6:00
9/06/2007 12:20
29/06/2001 16:00
6/06/2000 7:55
27/11/2006 10:15
17/11/2001 17:00
12/05/2004 22:00
16/04/2005 22:00
18/03/2005 8:40
13/06/2006 11:10
30/07/2006 12:00
16/07/2002 6:10
16/07/2002 7:15
3/09/2004 6:00
9/11/2004 15:20
25/08/2005 14:15
24/11/2001 19:10
15/04/2002 6:30
20/06/2002 6:30
17/03/2003 7:00
15/01/2005 13:00
23/03/2007 1:00
21/01/2001 10:30
,,,

This can be achieved by converting the entries into Date format and extracting the year, for instance like this:
> format(as.Date("15/01/2005 13:00", format="%d/%m/%Y %H:%M"),"%Y")
[1] "2005"
To get in-depth knowledge about dates and times in R, please see this.

standard deviation of specific row numbers and put the value in another row & column in R

I have following data:
Date Value Std.Dev
11/30/2015 10:00 0
11/30/2015 10:30 -0.002400962
11/30/2015 11:00 -0.004819286
11/30/2015 11:30 -0.000805477
11/30/2015 12:00 -0.001612904
11/30/2015 12:30 -0.003233633
11/30/2015 13:00 0.000809389
11/30/2015 13:30 0.005647453
11/30/2015 14:00 -0.002416433
11/30/2015 14:30 -0.006472515
11/30/2015 15:00 -0.002438035
11/30/2015 15:30 0
11/30/2015 16:30 -0.000814001
12/1/2015 9:00 0.006493529 0.002931114
12/1/2015 9:30 -0.001619434 0.003657839
12/1/2015 10:00 -0.003246756 0.00363798
12/1/2015 10:30 -0.002442004 0.003519869
12/1/2015 11:00 0.000814664 0.003551266
12/1/2015 11:30 -0.001629992 0.00357286
12/1/2015 12:00 0.000815328 0.003504601
12/1/2015 12:30 -1.11022E-16 0.003504796
12/1/2015 13:00 -0.000815328 0.002981979
Std.Dev should start calculation from row number 14. Because I am calculating first std.dev on previous days values. And standard deviation for row 14 will be calculated on row=1 of Value to row=13 of Value. And thus it should go on. So Std.Dev_at_row_number_15 = STDEV(Value2:Value14).
Std.Dev_at_row_number_16 = STDEV(Value3:Value15). And so on....
Can you please suggest any function for this kind of calculation in R. In excel it is very easy. But if you can suggest similar in R, it will be very helpful.
Thanks.
Pardon me for bad English if any. Please let me know in comments if you want more details or example.

Definitely not the most efficient way, but maybe sufficient for you (with x denoting your data frame):
for(counter in 14:nrow(x)){
x[counter,3] <- sd(x[(counter-13):(counter-1),2])
}
But again, that's definitely not the most efficient way.

For a data.frame, df, you can get this as follows with sapply:
df$st.dev <- c(rep(NA, 13), sapply(13:(nrow(df)-1), function(i) sd(df$Value[(i-12):i])))
sapply will run through the selected rows and the function that follows will repeatedly calculate the standard deviations for the selected rows. I prepend NAs to this output so that it can be added to the data.frame.
data
I cheated a little in reading in the data, but it doesn't affect the result.
df <- read.table(header=T, text="Date Time Value
11/30/2015 10:00 0
11/30/2015 10:30 -0.002400962
11/30/2015 11:00 -0.004819286
11/30/2015 11:30 -0.000805477
11/30/2015 12:00 -0.001612904
11/30/2015 12:30 -0.003233633
11/30/2015 13:00 0.000809389
11/30/2015 13:30 0.005647453
11/30/2015 14:00 -0.002416433
11/30/2015 14:30 -0.006472515
11/30/2015 15:00 -0.002438035
11/30/2015 15:30 0
11/30/2015 16:30 -0.000814001
12/1/2015 9:00 0.006493529
12/1/2015 9:30 -0.001619434
12/1/2015 10:00 -0.003246756
12/1/2015 10:30 -0.002442004
12/1/2015 11:00 0.000814664
12/1/2015 11:30 -0.001629992
12/1/2015 12:00 0.000815328
12/1/2015 12:30 -1.11022E-16
12/1/2015 13:00 -0.000815328", as.is=TRUE, row)

R Programming: How to arrange the ticks values in datetime plot in ggplot [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Plot dates on the x axis and time on the y axis with ggplot2
I have these data,
Arrival Date
7:50 Apr-19
7:45 Apr-20
7:30 Apr-23
7:30 Apr-24
7:55 Apr-25
7:20 Apr-26
7:30 Apr-27
7:50 Apr-28
8:00 Apr-30
7:45 May-2
8:30 May-3
8:06 May-4
8:25 May-7
7:35 May-8
7:45 May-9
8:02 May-10
7:53 May-11
8:39 May-14
8:14 May-15
8:08 May-16
8:27 May-17
8:20 May-18
12:00 Apr-19
12:00 Apr-20
12:00 Apr-23
12:00 Apr-24
12:00 Apr-25
12:00 Apr-26
12:00 Apr-27
12:00 Apr-28
11:50 Apr-30
12:00 May-2
11:45 May-3
11:50 May-4
12:00 May-7
11:50 May-8
11:55 May-9
12:10 May-10
11:53 May-11
11:54 May-14
11:40 May-15
11:54 May-16
11:45 May-17
12:00 May-18
And I want to plot it using ggplot,
This is what I did,
OJT <- read.csv(file = "Data.csv", header = TRUE)
qplot(Date,Arrival, data = OJT, xlab = expression(bold("Date")), ylab = expression(bold("Time"))) + theme_bw() + opts(axis.text.x=theme_text(angle=90)) +geom_point(size = 2, colour = "black", fill = "red", pch = 21)
And here is the output
As you can see, the time and date is not arrange. I want the time to start from 7:00 am to 12:20 pm, and the date from April 19 to May 18. I tried using
as.Date(strptime(OJT$Date,"%m-%dT"))
But still I don't get the right plot.
And I can't find similar problems through the internet.
Any idea to help me solve this.
Thanks

I will try a different approach with some wrangling in lubridate. Target plot:
The code, including your data:
library("ggplot2")
library("lubridate")
df <- read.table(text = "Arrival Date
7:50 Apr-19
7:45 Apr-20
7:30 Apr-23
7:30 Apr-24
7:55 Apr-25
7:20 Apr-26
7:30 Apr-27
7:50 Apr-28
8:00 Apr-30
7:45 May-2
8:30 May-3
8:06 May-4
8:25 May-7
7:35 May-8
7:45 May-9
8:02 May-10
7:53 May-11
8:39 May-14
8:14 May-15
8:08 May-16
8:27 May-17
8:20 May-18
12:00 Apr-19
12:00 Apr-20
12:00 Apr-23
12:00 Apr-24
12:00 Apr-25
12:00 Apr-26
12:00 Apr-27
12:00 Apr-28
11:50 Apr-30
12:00 May-2
11:45 May-3
11:50 May-4
12:00 May-7
11:50 May-8
11:55 May-9
12:10 May-10
11:53 May-11
11:54 May-14
11:40 May-15
11:54 May-16
11:45 May-17
12:00 May-18", header=TRUE)
df$Date <- paste('2012-',df$Date, sep='')
df$Full <- paste(df$Date, df$Arrival, sep=' ')
df$Full <- ymd_hm(df$Full)
df$decimal.hour <- hour(df$Full) + minute(df$Full)/60
p <- ggplot(df, aes(x=Full, y=decimal.hour)) +
geom_point()
p

#make some data in your kind of format:
tS <- dummySeries()
a<-rownames(tS)
x<-c(a,a)
y<-1:24
dat<-as.data.frame(cbind(x,y))
#get it in the format for the plot
v<-paste(dat$x,dat$y, sep=" ")
v2<-as.POSIXct(strptime(v, "%Y-%m-%d %H",tz="GMT"))
v3<-sort(v2)
hrs<-strftime(v2,"%H")
days<-strftime(v2,"%Y-%m-%d")
final<-data.frame(cbind(days,hrs))
qplot(days,hrs,data=final) + geom_point()
#ooooff... I bet this can be done much cleaner...i know little about
#time series data.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

data handling outlier with conditional in R - r

Related

Merge two datasets based on time interval in R

How to grepl search for the max and min timings in a string?

How to parse year from a date in r [duplicate]

standard deviation of specific row numbers and put the value in another row & column in R

R Programming: How to arrange the ticks values in datetime plot in ggplot [duplicate]

Categories

Resources