subsetting a data frame according factor date - r

I have a data frame(df) where one of its column is a date column. However that column's type is factor:
> head(df$date)
[1] 2011-01-01 2011-01-01 2011-01-01 2011-01-01 2011-01-01 2011-01-01
1519 Levels: 2010-11-27 2010-11-28 2010-11-29 2010-11-30 2010-12-01 2010-12-02 2010-12-03 2010-12-04 ... 2015-02-07
I want to subset this data frame according to date. For example I want to create a second data frame(df2) where it is a subset of df where dates are smaller than 2014-03-30.
How can I do that using R? I will be very glad for any help. Thanks a lot.

You could begin exploring the lubridate library. It makes working with dates very simple.
df <- data.frame(date = c("2013-01-01", "2014-04-01", "2014-01-01",
"2011-06-01", "2012-03-01", "2014-08-01"))
df
date
1 2013-01-01
2 2014-04-01
3 2014-01-01
4 2011-06-01
5 2012-03-01
6 2014-08-01
library(lubridate)
# ymd - year-month-day
df$date <- ymd(df$date)
with(df, df[date < ymd("2014-03-30"),])
[1] "2013-01-01 UTC" "2014-01-01 UTC" "2011-06-01 UTC" "2012-03-01 UTC"

Related

R: Find missing timestamps in csv

as I failed to solve my problem with PHP/MySQL or Excel due to the data size, I'm trying to do my very first steps with R now and struggle a bit. The problem is this: I have a second-by-second CSV-file with half a year of data, that looks like this:
metering,timestamp
123,2016-01-01 00:00:00
345,2016-01-01 00:00:01
243,2016-01-01 00:00:02
101,2016-01-01 00:00:04
134,2016-01-01 00:00:06
As you see, there are some seconds missing every once in a while (don't ask me, why the values are written before the timestamp, but that's how I received the data…). Now I try to calculate the amount of values (= seconds) that are missing.
So my idea was
to create a vector that is correct (includes all sec-by-sec timestamps),
match the given CSV file with that new vector, and
sum up all the timestamps with no value.
I managed to make step 1 happen with the following code:
RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")
write.csv(RegularTimeSeries, file = "RegularTimeSeries.csv")
To have an idea what I did I also exported the vector to a CSV that looks like this:
"1",2016-01-01 00:00:00
"2",2016-01-01 00:00:01
"3",2016-01-01 00:00:02
"4",2016-01-01 00:00:03
"5",2016-01-01 00:00:04
"6",2016-01-01 00:00:05
"7",2016-01-01 00:00:06
Unfortunately I have no idea how to go on with step 2 and 3. I found some very similar examples (http://www.r-bloggers.com/fix-missing-dates-with-r/, R: Insert rows for missing dates/times), but as a total R noob I struggled to translate these examples to my given sec-by-sec data.
Some hints for the greenhorn would be very very helpful – thank you very much in advance :)
In the tidyverse,
library(dplyr)
library(tidyr)
# parse datetimes
df %>% mutate(timestamp = as.POSIXct(timestamp)) %>%
# complete sequence to full sequence from min to max by second
complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = 'sec'))
## # A tibble: 7 x 2
## timestamp metering
## <time> <int>
## 1 2016-01-01 00:00:00 123
## 2 2016-01-01 00:00:01 345
## 3 2016-01-01 00:00:02 243
## 4 2016-01-01 00:00:03 NA
## 5 2016-01-01 00:00:04 101
## 6 2016-01-01 00:00:05 NA
## 7 2016-01-01 00:00:06 134
If you want the number of NAs (i.e. the number of seconds with no data), add on
%>% tally(is.na(metering))
## # A tibble: 1 x 1
## n
## <int>
## 1 2
You can check which values of your RegularTimeSeries are in your broken time series using which and %in%. First create BrokenTimeSeries from your example:
RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")
BrokenTimeSeries <- RegularTimeSeries[-c(3,6,9)] # remove some seconds
This will give you the indeces of values within RegularTimeSeries that are not in BrokenTimeSeries:
> which(!(RegularTimeSeries %in% BrokenTimeSeries))
[1] 3 6 9
This will return the actual values:
> RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))]
[1] "2016-01-01 00:00:02 UTC" "2016-01-01 00:00:05 UTC" "2016-01-01 00:00:08 UTC"
Maybe I'm misunderstanding your problem but you can count the number of missing seconds simply subtracting the length of your broken time series from RegularTimeSeries or getting the length of any of the two resulting vectors above.
> length(RegularTimeSeries) - length(BrokenTimeSeries)
[1] 3
> length(which(!(RegularTimeSeries %in% BrokenTimeSeries)))
[1] 3
> length(RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))])
[1] 3
If you want to merge the files together to see the missing values you can do something like this:
#data with regular time series and a "step"
df <- data.frame(
RegularTimeSeries
)
df$BrokenTimeSeries[RegularTimeSeries %in% BrokenTimeSeries] <- df$RegularTimeSeries
df$BrokenTimeSeries <- as.POSIXct(df$BrokenTimeSeries, origin="2015-01-01", tz="UTC")
resulting in:
> df[1:12,]
RegularTimeSeries BrokenTimeSeries
1 2016-01-01 00:00:00 2016-01-01 00:00:00
2 2016-01-01 00:00:01 2016-01-01 00:00:01
3 2016-01-01 00:00:02 <NA>
4 2016-01-01 00:00:03 2016-01-01 00:00:02
5 2016-01-01 00:00:04 2016-01-01 00:00:03
6 2016-01-01 00:00:05 <NA>
7 2016-01-01 00:00:06 2016-01-01 00:00:04
8 2016-01-01 00:00:07 2016-01-01 00:00:05
9 2016-01-01 00:00:08 <NA>
10 2016-01-01 00:00:09 2016-01-01 00:00:06
11 2016-01-01 00:00:10 2016-01-01 00:00:07
12 2016-01-01 00:00:11 2016-01-01 00:00:08
If all you want is the number of missing seconds, it can be done much more simply. First find the number of seconds in your timerange, and then subtract the number of rows in your dataset. This could be done in R along these lines:
n.seconds <- difftime("2016-06-01 00:00:00", "2016-01-01 00:00:00", units="secs")
n.rows <- nrow(my.data.frame)
n.missing.values <- n.seconds - n.rows
You might change the time range and the variable of your data frame.
Hope it helps
d <- (c("2016-01-01 00:00:01",
"2016-01-01 00:00:02",
"2016-01-01 00:00:03",
"2016-01-01 00:00:04",
"2016-01-01 00:00:05",
"2016-01-01 00:00:06",
"2016-01-01 00:00:10",
"2016-01-01 00:00:12",
"2016-01-01 00:00:14",
"2016-01-01 00:00:16",
"2016-01-01 00:00:18",
"2016-01-01 00:00:20",
"2016-01-01 00:00:22"))
d <- as.POSIXct(d)
for (i in 2:length(d)){
if(difftime(d[i-1],d[i], units = "secs") < -1 ){
c[i] <- d[i]
}
}
class(c) <- c('POSIXt','POSIXct')
c
[1] NA NA NA
NA NA
[6] NA "2016-01-01 00:00:10 EST" "2016-01-01 00:00:12
EST" "2016-01-01 00:00:14 EST" "2016-01-01 00:00:16 EST"
[11] "2016-01-01 00:00:18 EST" "2016-01-01 00:00:20 EST" "2016-01-01
00:00:22 EST"

Associate numbers to datetime/timestamp

I have a dataframe df with a certain number of columns. One of them, ts, is timestamps:
1462147403122 1462147412990 1462147388224 1462147415651 1462147397069 1462147392497
...
1463529545634 1463529558639 1463529556798 1463529558788 1463529564627 1463529557370.
I have also at my disposal the corresponding datetime in the datetime column:
"2016-05-02 02:03:23 CEST" "2016-05-02 02:03:32 CEST" "2016-05-02 02:03:08 CEST" "2016-05-02 02:03:35 CEST" "2016-05-02 02:03:17 CEST" "2016-05-02 02:03:12 CEST"
...
"2016-05-18 01:59:05 CEST" "2016-05-18 01:59:18 CEST" "2016-05-18 01:59:16 CEST" "2016-05-18 01:59:18 CEST" "2016-05-18 01:59:24 CEST" "2016-05-18 01:59:17 CEST"
As you can see my dataframe contains data accross several day. Let's say there are 3. I would like to add a column containing number 1, 2 or 3. 1 if the line belongs to the first day, 2 for the second day, etc...
Thank you very much in advance,
Clement
One way to do this is to keep track of total days elapsed each time the date changes, as demonstrated below.
# Fake data
dat = data.frame(datetime = c(seq(as.POSIXct("2016-05-02 01:03:11"),
as.POSIXct("2016-05-05 01:03:11"), length.out=6),
seq(as.POSIXct("2016-05-09 01:09:11"),
as.POSIXct("2016-05-16 02:03:11"), length.out=4)))
tz(dat$datetime) = "UTC"
Note, if your datetime column is not already in a datetime format, convert it to one using as.POSIXct.
Now, create a new column with the day number, counting the first day in the sequence as day 1.
dat$day = c(1, cumsum(as.numeric(diff(as.Date(dat$datetime, tz="UTC")))) + 1)
dat
datetime day
1 2016-05-02 01:03:11 1
2 2016-05-02 15:27:11 1
3 2016-05-03 05:51:11 2
4 2016-05-03 20:15:11 2
5 2016-05-04 10:39:11 3
6 2016-05-05 01:03:11 4
7 2016-05-09 01:09:11 8
8 2016-05-11 09:27:11 10
9 2016-05-13 17:45:11 12
10 2016-05-16 02:03:11 15
I specified the timezone in the code above to avoid getting tripped up by potential silent shifts between my local timezone and UTC. For example, note the silent shift from my default local time zone ("America/Los_Angeles") to UTC when converting a POSIXct datetime to a date:
# Fake data
datetime = seq(as.POSIXct("2016-05-02 01:03:11"), as.POSIXct("2016-05-05 01:03:11"), length.out=6)
tz(datetime)
[1] ""
date = as.Date(datetime)
tz(date)
[1] "UTC"
data.frame(datetime, date)
datetime date
1 2016-05-02 01:03:11 2016-05-02
2 2016-05-02 15:27:11 2016-05-02
3 2016-05-03 05:51:11 2016-05-03
4 2016-05-03 20:15:11 2016-05-04 # Note day is different due to timezone shift
5 2016-05-04 10:39:11 2016-05-04
6 2016-05-05 01:03:11 2016-05-05

Create a dataframe with columns of given Date and Time

I would like to create a dataframe with its first column as Date and second column as Time. The condition is the time should increase in 30 minutes interval and the date accordingly. And later i will add other columns manually.
> df
Date Time
2012-01-01 00:00:00
2012-01-01 00:30:00
2012-01-01 01:00:00
2012-01-01 01:30:00
.......... ........
.......... ........
and so on...
EDIT
Can be done in another way as well.
A single column can be created with the given date and time and then separated later using tidyr or any other packages.
> df
DateTime
2012-01-01 00:00:00
2012-01-01 00:30:00
2012-01-01 01:00:00
2012-01-01 01:30:00
..........
..........
and so on...
Any help will be appreciated. Thank you in advance.
you can generate a sequence using seq, specifying the start and end dates, and the time interval
df <- data.frame(DateTime = seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-02-01"),
by=(30*60)))
head(df)
DateTime
1 2012-01-01 00:00:00
2 2012-01-01 00:30:00
3 2012-01-01 01:00:00
4 2012-01-01 01:30:00
5 2012-01-01 02:00:00
6 2012-01-01 02:30:00
And to get them in two separate columns we can use ?strftime
date_seq <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-02-01"),
by=(30*60))
df <- data.frame(Date = strftime(date_seq, format="%Y-%m-%d"),
Time = strftime(date_seq, format="%H:%M:%S"))
Date Time
1 2012-01-01 00:00:00
2 2012-01-01 00:00:30
3 2012-01-01 00:01:00
4 2012-01-01 00:01:30
5 2012-01-01 00:02:00
6 2012-01-01 00:02:30
Update
You can include the time part of the POSIXct datetime too. This will give you finer control over your upper & lower bounds:
date_seq <- seq(as.POSIXct("2012-01-01 00:00:00"),
as.POSIXct("2012-02-02 23:30:00"),
by=(30*60))
df <- data.frame(Date = strftime(date_seq, format="%Y-%m-%d"),
Time = strftime(date_seq, format="%H:%M:%S"))

R add specific (different) amounts of times to entire column

I have a table in R like:
start duration
02/01/2012 20:00:00 5
05/01/2012 07:00:00 6
etc... etc...
I got to this by importing a table from Microsoft Excel that looked like this:
date time duration
2012/02/01 20:00:00 5
etc...
I then merged the date and time columns by running the following code:
d.f <- within(d.f, { start=format(as.POSIXct(paste(date, time)), "%m/%d/%Y %H:%M:%S") })
I want to create a third column called 'end', which will be calculated as the number of hours after the start time. I am pretty sure that my time is a POSIXct vector. I have seen how to manipulate one datetime object, but how can I do that for the entire column?
The expected result should look like:
start duration end
02/01/2012 20:00:00 5 02/02/2012 01:00:00
05/01/2012 07:00:00 6 05/01/2012 13:00:00
etc... etc... etc...
Using lubridate
> library(lubridate)
> df$start <- mdy_hms(df$start)
> df$end <- df$start + hours(df$duration)
> df
# start duration end
#1 2012-02-01 20:00:00 5 2012-02-02 01:00:00
#2 2012-05-01 07:00:00 6 2012-05-01 13:00:00
data
df <- structure(list(start = c("02/01/2012 20:00:00", "05/01/2012 07:00:00"
), duration = 5:6), .Names = c("start", "duration"), class = "data.frame", row.names = c(NA,
-2L))
You can simply add dur*3600 to start column of the data frame. E.g. with one date:
start = as.POSIXct("02/01/2012 20:00:00",format="%m/%d/%Y %H:%M:%S")
start
[1] "2012-02-01 20:00:00 CST"
start + 5*3600
[1] "2012-02-02 01:00:00 CST"

Parsing dates in multiple formats in R using lubridate

I have data with dates in MM/DD/YY HH:MM format and others in plain old MM/DD/YY format. I want to parse all of them into the same format as "2010-12-01 12:12 EST." How should I go about doing that? I tried the following ifelse statement and it gave me a bunch of long integers and told me a large number of my data points failed to parse:
df_prime$date <- ifelse(!is.na(mdy_hm(df$date)), mdy_hm(df$date), mdy(df$date))
df_prime is a duplicate of the data frame df that I initially loaded in
IEN date admission_number KEY_PTF_45 admission_from discharge_to
1 12 3/3/07 18:05 1 252186 OTHER DIRECT
2 12 3/9/07 12:10 1 252186 RETURN TO COMMUNITY- INDEPENDENT
3 12 3/10/07 15:08 2 252382 OUTPATIENT TREATMENT
4 12 3/14/07 10:26 2 252382 RETURN TO COMMUNITY-INDEPENDENT
5 12 4/24/07 19:45 3 254343 OTHER DIRECT
6 12 4/28/07 11:45 3 254343 RETURN TO COMMUNITY-INDEPENDENT
...
1046334 23613488506 2/25/14 NA NA
1046335 23613488506 2/25/14 11:27 NA NA
1046336 23613488506 2/28/14 NA NA
1046337 23613488506 3/4/14 NA NA
1046338 23613488506 3/10/14 11:30 NA NA
1046339 23613488506 3/10/14 12:32 NA NA
Sorry if some of the formatting isn't right, but the date column is the most important one.
EDIT: Below is some code for a portion of my data frame via a dput command:
structure(list(IEN = c(23613488506, 23613488506, 23613488506, 23613488506, 23613488506, 23613488506), date = c("2/25/14", "2/25/14 11:27", "2/28/14", "3/4/14", "3/10/14 11:30", "3/10/14 12:32")), .Names = c("IEN", "date"), row.names = 1046334:1046339, class = "data.frame")
Have you tried the function guess_formats() in the lubridate package?
A reproducible example to build a dataframe like yours could be helpful!
The lubridate package's mdy_hm has a truncated parameter that lets you supply dates that might not have all the bits. For your example:
> mdy_hm(d$date,truncated=2)
[1] "2014-02-25 00:00:00 UTC" "2014-02-25 11:27:00 UTC"
[3] "2014-02-28 00:00:00 UTC" "2014-03-04 00:00:00 UTC"
[5] "2014-03-10 11:30:00 UTC" "2014-03-10 12:32:00 UTC"

Resources