Creating Subsets of data with multiple where/between statements - r

I have a dataset which consists of 2 days in 2 different months and the same time periods. It shows how many occupants were in a house during the time. I want to separate the data by date, time period AND houseid.
So i want to get all the records where the date is 01-02-2010, between the time periods 14:00:00 - 19:00:00 where houseid is N60421A. At the moment data.type is stored as characters except for occupants which is numeric.
http://www.sharecsv.com/s/aa6d4dc34acfbaf73ada1d2c8764b888/modecsv.csv
Atm i have tried this but i seem to get no results
data2 = subset(data, dayMonthYear == "01/02/2010" && Houses == "N60421A")
In SQL i would do something like
SELECT *
From data
where dayMonthYear == "01/02/2010"
AND houses == "N60421A"
AND time > 14:00:00
AND time < 19:00:00

This should work for you...
#Combine date and time into a new POSIXct variable "Time1"
data$Time1 <- as.POSIXct(paste(data$dayMonthYear, data$Time), format="%d/%m/%Y %H:%M:%S")
#Subset
data2 <-subset(data, dayMonthYear == "01/02/2010" & Houses == "N60421A" & strftime(Time1, "%H") %in% c('14','15','16','17','18','19'))
You could also use the "chron" package and standard R subsetting...
#Approach 2
#Load Library
library(chron)
#Convert Time from factor while creating new variable "Time2"
data$Time2 <- chron(times = as.character(data$Time))
#Subset
data2 <- data[(data$dayMonthYear == "01/02/2010" & data$Houses == "N60421A" & data$Time2 >= "14:00:00" & data$Time2 <= "19:00:00" ),]

Related

How to split a data frame in R based on date when multiple rows have identical date stamp [duplicate]

I am working with daily returns from a Brazilian Index (IBOV) since 1993, I am trying to figure out the best way to subset for periods between 2 dates.
The data frame (IBOV_RET) is as follows :
head(IBOV_RET)
DATE 1D_RETURN
1 1993-04-28 -0.008163265
2 1993-04-29 -0.024691358
3 1993-04-30 0.016877637
4 1993-05-03 0.000000000
5 1993-05-04 0.033195021
6 1993-05-05 -0.012048193
...
I set 2 variables DATE1 and DATE2 as dates
DATE1 <- as.Date("2014-04-01")
DATE2 <- as.Date("2014-05-05")
I was able to create a new subset using this code:
TEST <- IBOV_RET[IBOV_RET$DATE >= DATE1 & IBOV_RET$DATE <= DATE2,]
It worked, but I was wondering if there is a better way to subset the data between 2 date, maybe using subset.
As already pointed out by #MrFlick, you dont get around the basic logic of subsetting. One way to make it easier for you to subset your specific data.frame would be to define a function that takes two inputs like DATE1 and DATE2 in your example and then returns the subset of IBOV_RET according to those subset parameters.
myfunc <- function(x,y){IBOV_RET[IBOV_RET$DATE >= x & IBOV_RET$DATE <= y,]}
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
Test <- myfunc(DATE1,DATE2)
#> Test
# DATE X1D_RETURN
#2 1993-04-29 -0.02469136
#3 1993-04-30 0.01687764
#4 1993-05-03 0.00000000
#5 1993-05-04 0.03319502
You can also enter the specific dates directly into myfunc:
myfunc(as.Date("1993-04-29"),as.Date("1993-05-04")) #will produce the same result
You can use the subset() function with the & operator:
subset(IBOV_RET, DATE1> XXXX-XX-XX & DATE2 < XXXX-XX-XX)
Updating for a more "tidyverse-oriented" approach:
IBOV_RET %>%
filter(DATE1 > XXXX-XX-XX, DATE2 < XXXX-XX-XX) #comma same as &
There is no real other way to extract date ranges. The logic is the same as extracting a range of numeric values as well, you just need to do the explicit Date conversion as you've done. You can make your subsetting shorter as you would with any other subsetting task with subset or with. You can break ranges into intervals with cut (there is a specific cut.Date overload). But base R does not have any way to specify Date literals so you cannot avoid the conversion. I can't imagine what other sort of syntax you may have had in mind.
What about:
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
# creating a data range with the start and end date:
dates <- seq(DATE1, DATE2, by="days")
IBOV_RET <- subset(IBOV_RET, DATE %in% dates)
I believe lubridate could help here;
daterange <- interval(DATE1, DATE2)
TEST <- IBOV_RET[which(Date %within% daterange),]
I sort of love dplyr package
So if you
>library("dplyr")
and then, as you did:
>Date1<-as.Date("2014-04-01")
>Date2<-as.Date("2014-05-05")
Finally
>test<-filter(IBOV_RET, filter(DATE>Date1 & DATE<Date2))
You can use R's between() function after simply converting the strings to dates:
df %>%
filter(between(date_column, as.Date("string-date-lower-bound"), as.Date("string-date-upper-bound")))
Test = IBOV_RET[IBOV_RET$Date => "2014-04-01" | IBOV_RET$Date <= "1993-05-04"]
Here I am using "or" function | where data should be greater than particular data or data should be less than or equal to this date.

Filtering out time data from R data frame

So i have a dataset in R:
IncidentID Time Vehicle
19002 4:48 Car
19003 12:30 Motorcycle
19004 14:00 Car
19005 9:30 Bicycle
And I'm trying to filter out some data, since its quite a large dataset. The above is just a few examples of data.
I want to filter out the data according to the time, where say i want to obtain the data where the Time is between 12pm to 6pm (18:00 in 24 hour format), hence i would have:
IncidentID Time Vehicle
19003 12:30 Motorcycle
19004 14:00 Car
I did:
incident <- read.csv("incident.csv")
afternoon_incident <- incident[which(incident$Time >= 12 && incident$Time <= 18),]
But I'm getting the error saying:
1: In Ops.factor(web$Time, 6:0) : ‘>=’ not meaningful for factors
2: In Ops.factor(web$Time, 12:0) : ‘<=’ not meaningful for factors
You can use lubridate to convert Time field into time object and then extract hour for filtering:
library(lubridate)
incident$Time <- hm(as.character(incident$Time))
incident[which(hour(incident$Time) >= 12 & hour(incident$Time) <= 18), ]
You need to first convert the Time into actual date-time object using as.POSIXct and then compare.
As you want to subset based on hour, we can extract only hour part of the data using format and keep rows which are in between 12 and 18 hour. Using base R, we can do
df$hour <- as.numeric(format(as.POSIXct(df$Time, format = "%H:%M"), "%H"))
subset(df, hour >= 12 & hour <= 18)
# IncidentID Time Vehicle hour
#2 19003 12:30 Motorcycle 12
#3 19004 14:00 Car 14
You can remove the hour column later if not needed.
For a general solution, we can create a date-time column and then compare
df$datetime <- as.POSIXct(df$Time, format = "%H:%M")
subset(df, datetime >= as.POSIXct("12:30:00", format = "%T") &
datetime <= as.POSIXct("18:30:00", format = "%T"))

Subset between two dates [duplicate]

I am working with daily returns from a Brazilian Index (IBOV) since 1993, I am trying to figure out the best way to subset for periods between 2 dates.
The data frame (IBOV_RET) is as follows :
head(IBOV_RET)
DATE 1D_RETURN
1 1993-04-28 -0.008163265
2 1993-04-29 -0.024691358
3 1993-04-30 0.016877637
4 1993-05-03 0.000000000
5 1993-05-04 0.033195021
6 1993-05-05 -0.012048193
...
I set 2 variables DATE1 and DATE2 as dates
DATE1 <- as.Date("2014-04-01")
DATE2 <- as.Date("2014-05-05")
I was able to create a new subset using this code:
TEST <- IBOV_RET[IBOV_RET$DATE >= DATE1 & IBOV_RET$DATE <= DATE2,]
It worked, but I was wondering if there is a better way to subset the data between 2 date, maybe using subset.
As already pointed out by #MrFlick, you dont get around the basic logic of subsetting. One way to make it easier for you to subset your specific data.frame would be to define a function that takes two inputs like DATE1 and DATE2 in your example and then returns the subset of IBOV_RET according to those subset parameters.
myfunc <- function(x,y){IBOV_RET[IBOV_RET$DATE >= x & IBOV_RET$DATE <= y,]}
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
Test <- myfunc(DATE1,DATE2)
#> Test
# DATE X1D_RETURN
#2 1993-04-29 -0.02469136
#3 1993-04-30 0.01687764
#4 1993-05-03 0.00000000
#5 1993-05-04 0.03319502
You can also enter the specific dates directly into myfunc:
myfunc(as.Date("1993-04-29"),as.Date("1993-05-04")) #will produce the same result
You can use the subset() function with the & operator:
subset(IBOV_RET, DATE1> XXXX-XX-XX & DATE2 < XXXX-XX-XX)
Updating for a more "tidyverse-oriented" approach:
IBOV_RET %>%
filter(DATE1 > XXXX-XX-XX, DATE2 < XXXX-XX-XX) #comma same as &
There is no real other way to extract date ranges. The logic is the same as extracting a range of numeric values as well, you just need to do the explicit Date conversion as you've done. You can make your subsetting shorter as you would with any other subsetting task with subset or with. You can break ranges into intervals with cut (there is a specific cut.Date overload). But base R does not have any way to specify Date literals so you cannot avoid the conversion. I can't imagine what other sort of syntax you may have had in mind.
What about:
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
# creating a data range with the start and end date:
dates <- seq(DATE1, DATE2, by="days")
IBOV_RET <- subset(IBOV_RET, DATE %in% dates)
I believe lubridate could help here;
daterange <- interval(DATE1, DATE2)
TEST <- IBOV_RET[which(Date %within% daterange),]
I sort of love dplyr package
So if you
>library("dplyr")
and then, as you did:
>Date1<-as.Date("2014-04-01")
>Date2<-as.Date("2014-05-05")
Finally
>test<-filter(IBOV_RET, filter(DATE>Date1 & DATE<Date2))
You can use R's between() function after simply converting the strings to dates:
df %>%
filter(between(date_column, as.Date("string-date-lower-bound"), as.Date("string-date-upper-bound")))
Test = IBOV_RET[IBOV_RET$Date => "2014-04-01" | IBOV_RET$Date <= "1993-05-04"]
Here I am using "or" function | where data should be greater than particular data or data should be less than or equal to this date.

Subset a dataframe between 2 dates

I am working with daily returns from a Brazilian Index (IBOV) since 1993, I am trying to figure out the best way to subset for periods between 2 dates.
The data frame (IBOV_RET) is as follows :
head(IBOV_RET)
DATE 1D_RETURN
1 1993-04-28 -0.008163265
2 1993-04-29 -0.024691358
3 1993-04-30 0.016877637
4 1993-05-03 0.000000000
5 1993-05-04 0.033195021
6 1993-05-05 -0.012048193
...
I set 2 variables DATE1 and DATE2 as dates
DATE1 <- as.Date("2014-04-01")
DATE2 <- as.Date("2014-05-05")
I was able to create a new subset using this code:
TEST <- IBOV_RET[IBOV_RET$DATE >= DATE1 & IBOV_RET$DATE <= DATE2,]
It worked, but I was wondering if there is a better way to subset the data between 2 date, maybe using subset.
As already pointed out by #MrFlick, you dont get around the basic logic of subsetting. One way to make it easier for you to subset your specific data.frame would be to define a function that takes two inputs like DATE1 and DATE2 in your example and then returns the subset of IBOV_RET according to those subset parameters.
myfunc <- function(x,y){IBOV_RET[IBOV_RET$DATE >= x & IBOV_RET$DATE <= y,]}
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
Test <- myfunc(DATE1,DATE2)
#> Test
# DATE X1D_RETURN
#2 1993-04-29 -0.02469136
#3 1993-04-30 0.01687764
#4 1993-05-03 0.00000000
#5 1993-05-04 0.03319502
You can also enter the specific dates directly into myfunc:
myfunc(as.Date("1993-04-29"),as.Date("1993-05-04")) #will produce the same result
You can use the subset() function with the & operator:
subset(IBOV_RET, DATE1> XXXX-XX-XX & DATE2 < XXXX-XX-XX)
Updating for a more "tidyverse-oriented" approach:
IBOV_RET %>%
filter(DATE1 > XXXX-XX-XX, DATE2 < XXXX-XX-XX) #comma same as &
There is no real other way to extract date ranges. The logic is the same as extracting a range of numeric values as well, you just need to do the explicit Date conversion as you've done. You can make your subsetting shorter as you would with any other subsetting task with subset or with. You can break ranges into intervals with cut (there is a specific cut.Date overload). But base R does not have any way to specify Date literals so you cannot avoid the conversion. I can't imagine what other sort of syntax you may have had in mind.
What about:
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
# creating a data range with the start and end date:
dates <- seq(DATE1, DATE2, by="days")
IBOV_RET <- subset(IBOV_RET, DATE %in% dates)
I believe lubridate could help here;
daterange <- interval(DATE1, DATE2)
TEST <- IBOV_RET[which(Date %within% daterange),]
I sort of love dplyr package
So if you
>library("dplyr")
and then, as you did:
>Date1<-as.Date("2014-04-01")
>Date2<-as.Date("2014-05-05")
Finally
>test<-filter(IBOV_RET, filter(DATE>Date1 & DATE<Date2))
You can use R's between() function after simply converting the strings to dates:
df %>%
filter(between(date_column, as.Date("string-date-lower-bound"), as.Date("string-date-upper-bound")))
Test = IBOV_RET[IBOV_RET$Date => "2014-04-01" | IBOV_RET$Date <= "1993-05-04"]
Here I am using "or" function | where data should be greater than particular data or data should be less than or equal to this date.

Identify only the first matching record

I have a large amount of time series data stored in a dataframe called "Tag.data" where one record is taken every 30 seconds over the course of several months. For example:
2013-09-30 23:59:00
2013-09-30 23:59:30
2013-10-01 00:00:00
2013-10-01 00:00:30
2013-10-01 00:01:00
2013-10-01 00:01:30
2013-10-01 00:02:00
...
2013-10-15 05:00:00
2013-10-15 05:00:30
2013-10-15 05:01:00
2013-10-15 05:01:30
2013-10-15 05:02:00
...
This data is stored in Tag.data$dt.
Within my data I would like to identify the 1st and 15th day of each month so that these can be used on a later plot.
I was successfully able to identify the first day of each month with this code:
locs <- tapply (X=Tag.data$dt, FUN=min, INDEX=format(Tag.data$dt, '%Y%m'))
at <- Tag.data$dt %in% locs
at <- at & format(Tag.data$dt, '%m') %in% c('01', '02', '03','04', '05', '06','07', '08', '09','10', '11', '12') & format(Tag.data$dt, '%d') == '01'
Unfortunately I was less successful when I attempted to also identify the 15th day of each month with this code:
locs <- tapply (X=Tag.data$dt, FUN=min, INDEX=format(Tag.data$dt, '%Y%m'))
at <- Tag.data$dt %in% locs
at <- at & format(Tag.data$dt, '%m') %in% c('01', '02', '03','04', '05', '06','07', '08', '09','10', '11', '12') & format(Tag.data$dt, '%d') == '01'|
format(Tag.data$dt, '%m') %in% c('01', '02', '03','04', '05', '06','07', '08', '09','10', '11', '12') & format(Tag.data$dt, '%d') == '15'
While this did identify both the 1st and the 15th days of each month, for some reason it identifies only one record for the 1st day of the month but every record for the 15th day of the month (of which there are a great many). I would like to identify only the first record for both the 1st and 15th days of each month. Any help would be much appreciated.
Judging from your code:
locs <- tapply (X=Tag.data$dt, FUN=min, INDEX=format(Tag.data$dt, '%Y%m'))
I assume Tag.data$dt is stored as one of POSIX classes.
I would like to identify only the first record for both the 1st and 15th days of each month.
Probably slow, but this does the work.
ymd <- format(Tag.data$dt,"%Y%m%d")
index.01.15 <- !duplicated(ymd) & grepl("01$|15$", ymd)
You can use the logical vector to select the rows Tag.data[index.01.15, ]
Try this. It makes use of lubridate. You can select all rows where the day is either 1 or 15.
library(lubridate)
options(stringsAsFactors=FALSE)
Tag.data = structure(list(dt = c("30/09/2013 23:59", "1/10/2013 0:00", "1/10/2013 0:00",
"1/10/2013 0:01", "1/10/2013 0:01", "1/10/2013 0:02", "2/10/2013 0:04",
"15/10/2013 5:00", "15/10/2013 5:00", "15/10/2013 5:01", "15/10/2013 5:01",
"15/10/2013 5:02")), .Names = "dt", class = "data.frame", row.names = c(NA,
-12L))
Tag.data$dt = parse_date_time(Tag.data$dt, '%d/%m/%Y %H%M')
at = Tag.data[day(Tag.data$dt) %in% c(1,15), ]
This is more flexible as you can specify any day you wish to subset on. E.g replace the values in c(1,15) for any day, or month(Tag.data$dt) %in% c(<INSERT MONTH NUMBER>) to subset on month.
It looks like your data are already stored as dates of some sort (e.g., POSIXct). Something like this, but with even more rows?
Tag.data <- data.frame(dt=seq(ISOdate(2013,10,1), by = "30 min", length.out = 10000))
Then if you want just the first record from each 1st or 15th day, this might work:
daychars <- format(Tag.data$dt, '%d')
day1or15 <- daychars %in% c("01","15")
newday <- c(TRUE, (daychars[1:(length(daychars)-1)] != daychars[2:length(daychars)]))
format(Tag.data[day1or15 & newday,"dt"],"%m/%d/%Y %H:%M:%S")
The newday line helpfully does not require that the day begins at any particular time, but it does assume that your time series is ordered.
I suggest you use the excellent xts package for time series data in R.
You didn't provide reproducible data, so i made some of my own.
require(xts)
Tag.data <- xts(rnorm(1e5), order.by = Sys.time() + seq(30, 3e6, 30))
Sub-setting by day of the month is a simple one-liner.
days_1n15 <- Tag.data[.indexmday(Tag.data) %in% c(1, 15)]
This returns all records on the 1st and 15th day of any month.
Now we just need to pull out the first observations on each matching day.
firstOf <- do.call(rbind, lapply(split(days_1n15, 'days'), first))
Which contains the data you want:
R> firstOf
[,1]
2014-02-01 21:29:01 1.284222
2014-02-15 00:00:01 -1.262235
2014-03-01 00:00:01 -0.465001

Resources