I am new to R but have turned to it to solve a problem with a large data set I am trying to process. Currently I have a 4 columns of data (Y values) set against minute-interval timestamps (month/day/year hour:min) (X values) as below:
timestamp tr tt sr st
1 9/1/01 0:00 1.018269e+02 -312.8622 -1959.393 4959.828
2 9/1/01 0:01 1.023567e+02 -313.0002 -1957.755 4958.935
3 9/1/01 0:02 1.018857e+02 -313.9406 -1956.799 4959.938
4 9/1/01 0:03 1.025463e+02 -310.9261 -1957.347 4961.095
5 9/1/01 0:04 1.010228e+02 -311.5469 -1957.786 4959.078
The problem I have is that some timestamp values are missing - e.g. there may be a gap between 9/1/01 0:13 and 9/1/01 0:27 and such gaps are irregular through the data set. I need to put several of these series into the same database and because the missing values are different for each series, the dates do not currently align on each row.
I would like to generate rows for these missing timestamps and fill the Y columns with blank values (no data, not zero), so that I have a continuous time series.
I'm honestly not quite sure where to start (not really used R before so learning as I go along!) but any help would be much appreciated. I have thus far installed chron and zoo, since it seems they might be useful.
Thanks!
This is an old question, but I just wanted to post a dplyr way of handling this, as I came across this post while searching for an answer to a similar problem. I find it more intuitive and easier on the eyes than the zoo approach.
library(dplyr)
ts <- seq.POSIXt(as.POSIXct("2001-09-01 0:00",'%m/%d/%y %H:%M'), as.POSIXct("2001-09-01 0:07",'%m/%d/%y %H:%M'), by="min")
ts <- seq.POSIXt(as.POSIXlt("2001-09-01 0:00"), as.POSIXlt("2001-09-01 0:07"), by="min")
ts <- format.POSIXct(ts,'%m/%d/%y %H:%M')
df <- data.frame(timestamp=ts)
data_with_missing_times <- full_join(df,original_data)
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 NA NA NA NA
7 09/01/01 00:06 NA NA NA NA
8 09/01/01 00:07 NA NA NA NA
Also using dplyr, this makes it easier to do something like change all those missing values to something else, which came in handy for me when plotting in ggplot.
data_with_missing_times %>% group_by(timestamp) %>% mutate_each(funs(ifelse(is.na(.),0,.)))
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 0 0 0 0
7 09/01/01 00:06 0 0 0 0
8 09/01/01 00:07 0 0 0 0
I think the easiest thing ist to set Date first as already described, convert to zoo, and then just set a merge:
df$timestamp<-as.POSIXct(df$timestamp,format="%m/%d/%y %H:%M")
df1.zoo<-zoo(df[,-1],df[,1]) #set date to Index
df2 <- merge(df1.zoo,zoo(,seq(start(df1.zoo),end(df1.zoo),by="min")), all=TRUE)
Start and end are given from your df1 (original data) and you are setting by - e.g min - as you need for your example. all=TRUE sets all missing values at the missing dates to NAs.
Date padding is implemented in the padr package in R. If you store your data frame, with your date-time variable stored as POSIXct or POSIXlt. All you need to do is:
library(padr)
pad(df_name)
See vignette("padr") or this blog post for its working.
I think this can accomplished by using complete in tidyr package.
library(tidyverse)
df <- df %>%
complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = "minute"),
tr, tt, sr,st)
you can also initialize your start date and end date instead of using min(timestamp) and max(timestamp).
# some made-up data
originaldf <- data.frame(timestamp=c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:04"),
tr = rnorm(4,0,1),
tt = rnorm(4,0,1))
originaldf$minAsPOSIX <- as.POSIXct(originaldf$timestamp, format="%m/%d/%y %H:%M", tz="GMT")
# Generate vector of all minutes
ndays <- 1 # number of days to generate
minAsNumeric <- 60*60*24*243 + seq(0,60*60*24*ndays,by=60)
# convert those minutes to POSIX
minAsPOSIX <- as.POSIXct(minAsNumeric, origin="2001-01-01", tz="GMT")
# new df
newdf <- merge(data.frame(minAsPOSIX),originaldf,all.x=TRUE, by="minAsPOSIX")
In case you want to substitute the NA values acquired by any method mentioned above with zeroes, you can do this:
df[is.na(df)] <- 0
(I orginally wanted to comment this on Ibollar's answer but I lack the necessary reputation, thus I posted as an answer)
df1.zoo <- zoo(df1[,-1], as.POSIXlt(df1[,1], format = "%Y-%m-%d %H:%M:%S")) #set date to Index: Notice that column 1 is Timestamp type and is named as "TS"
full.frame.zoo <- zoo(NA, seq(start(df1.zoo), end(df1.zoo), by="min")) # zoo object
full.frame.df <- data.frame(TS = as.POSIXlt(index(full.frame.zoo), format = "%Y-%m-%d %H:%M:%S")) # conver zoo object to data frame
full.vancouver <- merge(full.frame.df, df1, all = TRUE) # merge
I was looking for something similar where instead of filling out missing timestamps my data was in months and days. So I wanted to generate a sequence of months that would cater for leap years et cetera. I used lubridate:
date <- df$timestamp[1]
date_list <- c(date)
while (date < df$timestamp[nrow(df)]){
date <- date %m+% months(1)
date_list <- c(date_list,date)
}
date_list <- format(as.Date(date_list),"%Y-%m-%d")
df_1 <- data.frame(months=date_list, stringsAsFactors = F)
This will give me a list of dates in incremental months. Then I join
df_with_missing_months <- full_join(df_1,df)
There are some advances in handling time series data in R, e.g. the tsibble package added such time series manipulations in tidy way:
library(tsibble)
library(lubridate)
ts <- lubridate::dmy_hm(c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:27"))
originaldf <- tsibble(timestamp = ts,
tr = rnorm(4,0,1),
tt = rnorm(4,0,1),
index = timestamp)
originaldf %>%
fill_gaps()
Related
I have a csv file that is written like this
Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50
I'd like R to produce something like this
Date Date
1/1/1980
1/2/1980
1/3/1980
1/4/1980
1/5/1980 25
1/6/1980
1/7/1980 30
Then I would like R to bring the last observation forward like this
Date Date
1/1/1980
1/2/1980
1/3/1980
1/4/1980
1/5/1980 25
1/6/1980 25
1/7/1980 30
I'd like two separate data.tables created one with just the actual data, then another with the last observation brought forward.
Thanks for all the help!
Edit: I also will need any NA's that are populated to changed to 0
You could also use tidyverse:
library(tidyverse)
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
complete(Date = seq(as.Date(format(min(Date), "%Y-%m-01")), max(Date), by = "day")) %>%
fill(Data) %>%
replace(., is.na(.), 0)
First 10 rows:
# A tibble: 104 x 2
Date Data
<date> <dbl>
1 1980-01-01 0
2 1980-01-02 0
3 1980-01-03 0
4 1980-01-04 0
5 1980-01-05 25
6 1980-01-06 25
7 1980-01-07 30
8 1980-01-08 30
9 1980-01-09 30
10 1980-01-10 30
I've used as a starting point the 1st day of the month and year of minimum date, and maximum the maximum date; this can be of course adjusted as needed.
EDIT: #Sotos has an even better suggestion for a more concise approach (by better usage of format argument):
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
complete(Date = seq(as.Date(format(min(Date), "%Y-%m-01")), max(Date), by = "day")) %>%
fill(Data)
The solution is:
create a data.frame with successive date
merge it with your original data.frame
use na.locf function from zoo to carry forward your data
Here is the code. I use lubridate to work with date.
library(lubridate)
df$Date <- mdy(df$Date)
successive <-data.frame(Date = seq( as.Date(as.yearmon(df$Date[1])), df$Date[length(df$Date)], by="days"))
successive is the vector of successive dates. Now the merging:
result <- merge(df,successive,all.y = T,on = "Date")
And the forward propagation:
library(zoo)
result$Data <- na.locf(result$Data,na.rm = F)
Date Data
1 1980-01-05 25
2 1980-01-06 25
3 1980-01-07 30
4 1980-01-08 30
5 1980-01-09 30
6 1980-01-10 30
7 1980-01-11 30
8 1980-01-12 30
9 1980-01-13 30
10 1980-01-14 30
11 1980-01-15 30
12 1980-01-16 30
13 1980-01-17 30
14 1980-01-18 30
15 1980-01-19 30
16 1980-01-20 30
17 1980-01-21 30
18 1980-01-22 30
19 1980-01-23 30
20 1980-01-24 30
21 1980-01-25 30
The data:
df <- read.table(text = "Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50", header = T)
Assuming that the result should start at the first of the month of the first date and end at the last date and that the input data frame is DF shown reproducibly in the Note at the end, convert DF to a zoo object z, create a grid of dates g merge them to give zoo objects z0 (with zero filling) and zz (with na.locf filling) and optionally convert back to data frames or else just leave it as is so you can use zoo for further processing.
library(zoo)
z <- read.zoo(DF, header = TRUE, format = "%m/%d/%Y")
g <- seq(as.Date(as.yearmon(start(z))), end(z), "day")
z0 <- merge(z, zoo(, g), fill = 0) # zero filled
zz <- na.locf0(merge(z, zoo(, g))) # na.locf filled
# optional
DF0 <- fortify.zoo(z0) # zero filled
DF2 <- fortify.zoo(zz) # na.locf filled
data.table
The question mentions data tables and if that refers to the data.table package then add:
library(data.table)
DT0 <- data.table(DF0) # zero filled
DT2 <- data.table(DF2) # na.locf filled
Variations
I wasn't clear on whether the question was asking for a zero filled answer and an na.locf filled answer or just an na.locf filled answer whose remaining NA values are 0 filled but assumed the former case. If you want to fill the NAs that are left in the na.locf filled answer then add:
zz[is.na(zz)] <- 0
If you want to end at the end of the last month rather than at the last date replace end(z) with as.Date(as.yearmon(end(z)), frac = 1) .
If you want to start at the first date rather than the first of the month of the first date replace as.Date(as.yearmon(start(z))) with start(z)
.
As an alternative to (3), to start at the first date and end at the last date we could simply convert to ts and back. Note that we need to restore Date class on the second line below since ts class cannot handle Date class directly.
z2.na <- as.zoo(as.ts(z))
time(z2.na) <- as.Date(time(z2.na))
zz20 <- replace(z2.na, is.na(z2.na), 0) # zero filled
zz2 <- na.locf0(z2.na) # na.locf filled
Note
Lines <- "
Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50"
DF <- read.table(text = Lines, header = TRUE)
I have a cross section data as following:
transaction_code <- c('A_111','A_222','A_333')
loan_start_date <- c('2016-01-03','2011-01-08','2013-02-13')
loan_maturity_date <- c('2017-01-03','2013-01-08','2015-02-13')
loan_data <- data.frame(cbind(transaction_code,loan_start_date,loan_maturity_date))
Now the dataframe looks like this
>loan_data
transaction_code loan_start_date loan_maturity_date
1 A_111 2016-01-03 2017-01-03
2 A_222 2011-01-08 2013-01-08
3 A_333 2013-02-13 2015-02-13
Now I want to create a monthly time series observing the time to maturity(in months) for each of the three loans for a period of 48 months. How can I achieve that? The final output should look like following:
>loan data
transaction_code loan_start_date loan_maturity_date feb13 march13 april13........
1 A_111 2016-01-03 2017-01-03 46 45 44
2 A_222 2011-01-08 2013-01-08 NA NA NA
3 A_333 2013-02-13 2015-02-13 23 22 21
Here new columns (for 48 months) represents the time to maturity for each loan from that respective months.
Would really appreciate your help. Thanks
Here's an approach using tidyverse packages.
# Define the months to use in the right-hand columns.
months <- seq.Date(from = as.Date("2013-02-01"), by = "month", length.out = 48)
library(tidyverse); library(lubridate)
loan_data2 <- loan_data %>%
# Make a row for each combination of original data and the `months` list
crossing(months) %>%
# Format dates as MonYr and make into an ordered factor
mutate(month_name = format(months, "%b%y") %>% fct_reorder(months)) %>%
# Calculate months remaining -- this task is harder than it sounds! This
# approach isn't perfect, but it's hard to accomplish more simply, since
# months are different lengths.
mutate(months_remaining =
round(interval(months, loan_maturity_date) / ddays(1) / 30.5 - 1),
months_remaining = if_else(months_remaining < 0,
NA_real_, months_remaining)) %>%
# Drop the Date format of months now that calcs done
select(-months) %>%
# Spread into wide format
spread(month_name, months_remaining)
Output
loan_data2[,1:6]
# transaction_code loan_start_date loan_maturity_date Feb13 Mar13 Apr13
# 1 A_111 2016-01-03 2017-01-03 46 45 44
# 2 A_222 2011-01-08 2013-01-08 NA NA NA
# 3 A_333 2013-02-13 2015-02-13 23 22 21
Hi I am new to R and would like to know if there is a simple way to filter data over multiple dates.
I have a data which has dates from 07.03.2003 to 31.12.2016.
I need to split/ filter the data by multiple time series, as per below.
Dates require in new data frame:
07.03.2003 to 06/03/2005
and
01/01/2013 to 31/12/2016
i.e the new data frame should not include dates from 07/03/2005 to 31/12/2012
Let's take the following data.frame with dates:
df <- data.frame( date = c(ymd("2017-02-02"),ymd("2016-02-02"),ymd("2014-02-01"),ymd("2012-01-01")))
date
1 2017-02-02
2 2016-02-02
3 2014-02-01
4 2012-01-01
I can filter this for a range of dates using lubridate::ymd and dplyr::between and dplyr::between:
df1 <- filter(df, between(date, ymd("2017-01-01"), ymd("2017-03-01")))
date
1 2017-02-02
Or:
df2 <- filter(df, between(date, ymd("2013-01-01"), ymd("2014-04-01")))
date
1 2014-02-01
I would go with lubridate. In particular
library(data.table)
library(lubridate)
set.seed(555)#in order to be reproducible
N <- 1000#number of pseudonumbers to be generated
date1<-dmy("07-03-2003")
date2<-dmy("06-03-2005")
date3<-dmy("01-01-2013")
date4<-dmy("31-12-2016")
Creating data table with two columns (dates and numbers):
my_dt<-data.table(date_sample=c(sample(seq(date1, date4, by="day"), N),numeric_sample=sample(N,replace = F)))
> head(my_dt)
date_sample numeric_sample
1: 2007-04-11 2
2: 2006-04-20 71
3: 2007-12-20 46
4: 2016-05-23 78
5: 2011-10-07 5
6: 2003-09-10 47
Let's impose some cuts:
forbidden_dates<-interval(date2+1,date3-1)#create interval that dates should not fall in.
> forbidden_dates
[1] 2005-03-07 UTC--2012-12-31 UTC
test_date1<-dmy("08-03-2003")#should not fall in above range
test_date2<-dmy("08-03-2005")#should fall in above range
Therefore:
test_date1 %within% forbidden_dates
[1] FALSE
test_date2 %within% forbidden_dates
[1] TRUE
A good way of visualizing the cut:
before
>plot(my_dt)
my_dt<-my_dt[!(date_sample %within% forbidden_dates)]#applying the temporal cut
after
>plot(my_dt)
I am new to R but have turned to it to solve a problem with a large data set I am trying to process. Currently I have a 4 columns of data (Y values) set against minute-interval timestamps (month/day/year hour:min) (X values) as below:
timestamp tr tt sr st
1 9/1/01 0:00 1.018269e+02 -312.8622 -1959.393 4959.828
2 9/1/01 0:01 1.023567e+02 -313.0002 -1957.755 4958.935
3 9/1/01 0:02 1.018857e+02 -313.9406 -1956.799 4959.938
4 9/1/01 0:03 1.025463e+02 -310.9261 -1957.347 4961.095
5 9/1/01 0:04 1.010228e+02 -311.5469 -1957.786 4959.078
The problem I have is that some timestamp values are missing - e.g. there may be a gap between 9/1/01 0:13 and 9/1/01 0:27 and such gaps are irregular through the data set. I need to put several of these series into the same database and because the missing values are different for each series, the dates do not currently align on each row.
I would like to generate rows for these missing timestamps and fill the Y columns with blank values (no data, not zero), so that I have a continuous time series.
I'm honestly not quite sure where to start (not really used R before so learning as I go along!) but any help would be much appreciated. I have thus far installed chron and zoo, since it seems they might be useful.
Thanks!
This is an old question, but I just wanted to post a dplyr way of handling this, as I came across this post while searching for an answer to a similar problem. I find it more intuitive and easier on the eyes than the zoo approach.
library(dplyr)
ts <- seq.POSIXt(as.POSIXct("2001-09-01 0:00",'%m/%d/%y %H:%M'), as.POSIXct("2001-09-01 0:07",'%m/%d/%y %H:%M'), by="min")
ts <- seq.POSIXt(as.POSIXlt("2001-09-01 0:00"), as.POSIXlt("2001-09-01 0:07"), by="min")
ts <- format.POSIXct(ts,'%m/%d/%y %H:%M')
df <- data.frame(timestamp=ts)
data_with_missing_times <- full_join(df,original_data)
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 NA NA NA NA
7 09/01/01 00:06 NA NA NA NA
8 09/01/01 00:07 NA NA NA NA
Also using dplyr, this makes it easier to do something like change all those missing values to something else, which came in handy for me when plotting in ggplot.
data_with_missing_times %>% group_by(timestamp) %>% mutate_each(funs(ifelse(is.na(.),0,.)))
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 0 0 0 0
7 09/01/01 00:06 0 0 0 0
8 09/01/01 00:07 0 0 0 0
I think the easiest thing ist to set Date first as already described, convert to zoo, and then just set a merge:
df$timestamp<-as.POSIXct(df$timestamp,format="%m/%d/%y %H:%M")
df1.zoo<-zoo(df[,-1],df[,1]) #set date to Index
df2 <- merge(df1.zoo,zoo(,seq(start(df1.zoo),end(df1.zoo),by="min")), all=TRUE)
Start and end are given from your df1 (original data) and you are setting by - e.g min - as you need for your example. all=TRUE sets all missing values at the missing dates to NAs.
Date padding is implemented in the padr package in R. If you store your data frame, with your date-time variable stored as POSIXct or POSIXlt. All you need to do is:
library(padr)
pad(df_name)
See vignette("padr") or this blog post for its working.
I think this can accomplished by using complete in tidyr package.
library(tidyverse)
df <- df %>%
complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = "minute"),
tr, tt, sr,st)
you can also initialize your start date and end date instead of using min(timestamp) and max(timestamp).
# some made-up data
originaldf <- data.frame(timestamp=c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:04"),
tr = rnorm(4,0,1),
tt = rnorm(4,0,1))
originaldf$minAsPOSIX <- as.POSIXct(originaldf$timestamp, format="%m/%d/%y %H:%M", tz="GMT")
# Generate vector of all minutes
ndays <- 1 # number of days to generate
minAsNumeric <- 60*60*24*243 + seq(0,60*60*24*ndays,by=60)
# convert those minutes to POSIX
minAsPOSIX <- as.POSIXct(minAsNumeric, origin="2001-01-01", tz="GMT")
# new df
newdf <- merge(data.frame(minAsPOSIX),originaldf,all.x=TRUE, by="minAsPOSIX")
In case you want to substitute the NA values acquired by any method mentioned above with zeroes, you can do this:
df[is.na(df)] <- 0
(I orginally wanted to comment this on Ibollar's answer but I lack the necessary reputation, thus I posted as an answer)
df1.zoo <- zoo(df1[,-1], as.POSIXlt(df1[,1], format = "%Y-%m-%d %H:%M:%S")) #set date to Index: Notice that column 1 is Timestamp type and is named as "TS"
full.frame.zoo <- zoo(NA, seq(start(df1.zoo), end(df1.zoo), by="min")) # zoo object
full.frame.df <- data.frame(TS = as.POSIXlt(index(full.frame.zoo), format = "%Y-%m-%d %H:%M:%S")) # conver zoo object to data frame
full.vancouver <- merge(full.frame.df, df1, all = TRUE) # merge
I was looking for something similar where instead of filling out missing timestamps my data was in months and days. So I wanted to generate a sequence of months that would cater for leap years et cetera. I used lubridate:
date <- df$timestamp[1]
date_list <- c(date)
while (date < df$timestamp[nrow(df)]){
date <- date %m+% months(1)
date_list <- c(date_list,date)
}
date_list <- format(as.Date(date_list),"%Y-%m-%d")
df_1 <- data.frame(months=date_list, stringsAsFactors = F)
This will give me a list of dates in incremental months. Then I join
df_with_missing_months <- full_join(df_1,df)
There are some advances in handling time series data in R, e.g. the tsibble package added such time series manipulations in tidy way:
library(tsibble)
library(lubridate)
ts <- lubridate::dmy_hm(c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:27"))
originaldf <- tsibble(timestamp = ts,
tr = rnorm(4,0,1),
tt = rnorm(4,0,1),
index = timestamp)
originaldf %>%
fill_gaps()
I start with a data frame titled 'dat' in R that looks like the following:
datetime lat long id extra step
1 8/9/2014 13:00 31.34767 -81.39117 36 1 31.38946
2 8/9/2014 17:00 31.34767 -81.39150 36 1 11155.67502
3 8/9/2014 23:00 31.30683 -81.28433 36 1 206.33342
4 8/10/2014 5:00 31.30867 -81.28400 36 1 11152.88177
What I need to do is find out what days have less than 3 entries and remove all entries associated with those days from the original data.
I initially did this by the following:
library(plyr)
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
### count using just the date so you can ID which days have fewer than 3 points
datecount<- count(dat2, "date")
datecount<- subset(datecount, datecount$freq < 3)
This end up producing the following:
row.names date freq
1 49 2014-09-26 1
2 50 2014-09-27 2
3 135 2014-12-21 2
Which is great, but I cannot figure out how to remove the entries from these days with less than three entries from the original 'dat' because this is a compressed version of the original data frame.
So to try and deal with this I have come up with another way of looking at the problem. I will use the strptime and cbind from above:
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
And I will utilize the column titled "extra". I would like to create a new column which is the result of summing the values in this "extra" column by the simplified strptime dates. But find a way to apply this new value to all entries from that date, like the following:
date datetime lat long id extra extra_sum
1 2014-08-09 8/9/2014 13:00 31.34767 -81.39117 36 1 3
2 2014-08-09 8/9/2014 17:00 31.34767 -81.39150 36 1 3
3 2014-08-09 8/9/2014 23:00 31.30683 -81.28433 36 1 3
4 2014-08-10 8/10/2014 5:00 31.30867 -81.28400 36 1 4
5 2014-08-10 8/10/2014 13:00 31.34533 -81.39317 36 1 4
6 2014-08-10 8/10/2014 17:00 31.34517 -81.39317 36 1 4
7 2014-08-10 8/10/2014 23:00 31.34483 -81.39283 36 1 4
8 2014-08-11 8/11/2014 5:00 31.30600 -81.28317 36 1 2
9 2014-08-11 8/11/2014 13:00 31.34433 -81.39300 36 1 2
The code that creates the "extra_sum" column is what I am struggling with.
After creating this I can simply subset my data to all entries that have a value >2. Any help figuring out how to use my initial methodology or this new one to remove days with fewer than 3 entries from my initial data set would be much appreciated!
The plyr way.
library(plyr)
datetime <- dat$datetime
###strip the time down to only have the date no hh:mm:ss
date <- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2 <-cbind(date, dat)
dat3 <- ddply(dat2, .(date), function(df){
if (nrow(df)>=3) {
return(df)
} else {
return(NULL)
}
})
I recommend using the data.table package
library(data.table)
dat<-data.table(dat)
dat$Date<-as.Date(as.character(dat$datetime), format = "%m/%d/%Y")
dat_sum<-dat[, .N, by = Date ]
dat_3plus<-dat_sum[N>=3]
dat<-dat[Date%in%dat_3plus$Date]