Combining date and time into a Date column for plotting - r

I want to create a line plot. I have 3 columns in my data frame:
date time numbers
01-02-2010 14:57 5
01-02-2010 23:23 7
02-02-2010 05:05 3
02-02-2010 10:23 11
How can I combine the first two columns and make a plot based on date and time ?
Date is Date class, time is just a char variable.

The lubridate package is another option. It handles most of the fussy formatting details, so it can be easier to use than base R date functions. For example, in your case, mdy_hm (month-day-year_hour-minute) will convert your date and time variables into a single POSIXct date-time column. (If you meant it to be day-month-year, rather than month-day-year, then just use dmy_hm.) See code below.
library(lubridate)
dat$date_time = mdy_hm(paste(dat$date, dat$time))
dat
date time numbers date_time
1 01-02-2010 14:57 5 2010-01-02 14:57:00
2 01-02-2010 23:23 7 2010-01-02 23:23:00
3 02-02-2010 05:05 3 2010-02-02 05:05:00
4 02-02-2010 10:23 11 2010-02-02 10:23:00
library(ggplot2)
ggplot(dat, aes(date_time, numbers)) +
geom_point() + geom_line() +
scale_x_datetime(breaks=date_breaks("1 week"),
minor_breaks=date_breaks("1 day"))

Reconstruct your data:
dat <- read.table(text="
date time numbers
01-02-2010 14:57 5
01-02-2010 23:23 7
02-02-2010 05:05 3
02-02-2010 10:23 11", header=TRUE)
Now use as.POSIXct() and paste() to combine your date and time into a POSIX date. You need to specify the format, using the symbols defined in ?strptime. Also see ?DateTimeClasses for more information
dat$newdate <- with(dat, as.POSIXct(paste(date, time), format="%m-%d-%Y %H:%M"))
plot(numbers ~ newdate, data=dat, type="b", col="blue")

Related

adding two column of a data where col1 contains date and col2 contains days

I have a data frame in which i have two columns date and days and i want to add date column with days and show the result in other column
data frame-1
col date is in format of mm/dd/yyyy format
date days
3/2/2019 8
3/5/2019 4
3/6/2019 4
3/21/2019 3
3/25/2019 7
and i want my output like this
date days new-date
3/2/2019 8 3/10/2019
3/5/2019 4 3/9/2019
3/6/2019 4 3/10/2019
3/21/2019 3 3/24/2019
3/25/2019 7 4/1/2019
i was trying this
as.Date("3/10/2019") +8
but i think it will work for a single value
Convert to actual Date values and then add Days. You need to specify the actual format of date (read ?strptime) while converting it to Date.
as.Date(df$date, "%m/%d/%Y") + df$days
#[1] "2019-03-10" "2019-03-09" "2019-03-10" "2019-03-24" "2019-04-01"
If you want the output back in same format, we can use format
df$new_date <- format(as.Date(df$date, "%m/%d/%Y") + df$days, "%m/%d/%Y")
df
# date days new_date
#1 3/2/2019 8 03/10/2019
#2 3/5/2019 4 03/09/2019
#3 3/6/2019 4 03/10/2019
#4 3/21/2019 3 03/24/2019
#5 3/25/2019 7 04/01/2019
If you get confused with different date format we can use lubridate to do
library(lubridate)
with(df, mdy(date) + days)

Convert sub-hourly data to hourly and round up time in R

I have a very big dataframe in R, containing weather data with the following format.
valid temp
1 17/08/2014 00:20 14
2 17/08/2014 00:50 14
3 17/08/2014 01:20 13.5
4 17/08/2014 01:50 13
5 17/08/2014 02:20 12
6 17/08/2014 02:50 10
I would like to convert these sub-hourly data to hourly, like the following.
valid tmpc
1 2014-08-17 00:00:00 14
2 2014-08-17 01:00:00 13.75
3 2014-08-17 02:00:00 12.5
The class of df$valid is 'factor'. I have tried first converting them to Date through POSIXct, but it gives only NA values. I have also tried changing the system locale and still I get NAs.
We can do this in base R by converting to POSIXlt, set the minute to 0, convert it back to POSIXct and aggregate to get the mean of 'temp'
df1$valid <- strptime(df1$valid, "%d/%m/%Y %H:%M")
df1$valid$min <- 0
df1$valid <- as.POSIXct(df1$valid)
aggregate(temp~valid, df1, FUN = mean)
Option 1: The lubridate solution using ceiling_date or round_date. It's not clear according to your data frame and results if what you want is to round or ceiling. For instance, in the first row you are rounding and in the third using ceiling. Anyways here the example:
library(lubridate)
df <- data.frame(i = 1, valid= "17/08/2014 01:28", temp = 14)
df$valid <- dmy_hm(df$valid)
df$valid_round <- ceiling_date(df$valid , unit="hours")
Results:
i valid temp valid_round
1 1 2014-08-17 01:28:00 14 2014-08-17 02:00:00
Option 2: using the base functions. Use:
df$valid <- as.POSIXct(strptime(df$valid, "%d/%m/%Y %H:%M", tz ="UTC"))
and then round it.

Plotting the frequency of string matches over time in R

I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).

Removing multiple data entries based on a total number of entries per day

I start with a data frame titled 'dat' in R that looks like the following:
datetime lat long id extra step
1 8/9/2014 13:00 31.34767 -81.39117 36 1 31.38946
2 8/9/2014 17:00 31.34767 -81.39150 36 1 11155.67502
3 8/9/2014 23:00 31.30683 -81.28433 36 1 206.33342
4 8/10/2014 5:00 31.30867 -81.28400 36 1 11152.88177
What I need to do is find out what days have less than 3 entries and remove all entries associated with those days from the original data.
I initially did this by the following:
library(plyr)
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
### count using just the date so you can ID which days have fewer than 3 points
datecount<- count(dat2, "date")
datecount<- subset(datecount, datecount$freq < 3)
This end up producing the following:
row.names date freq
1 49 2014-09-26 1
2 50 2014-09-27 2
3 135 2014-12-21 2
Which is great, but I cannot figure out how to remove the entries from these days with less than three entries from the original 'dat' because this is a compressed version of the original data frame.
So to try and deal with this I have come up with another way of looking at the problem. I will use the strptime and cbind from above:
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
And I will utilize the column titled "extra". I would like to create a new column which is the result of summing the values in this "extra" column by the simplified strptime dates. But find a way to apply this new value to all entries from that date, like the following:
date datetime lat long id extra extra_sum
1 2014-08-09 8/9/2014 13:00 31.34767 -81.39117 36 1 3
2 2014-08-09 8/9/2014 17:00 31.34767 -81.39150 36 1 3
3 2014-08-09 8/9/2014 23:00 31.30683 -81.28433 36 1 3
4 2014-08-10 8/10/2014 5:00 31.30867 -81.28400 36 1 4
5 2014-08-10 8/10/2014 13:00 31.34533 -81.39317 36 1 4
6 2014-08-10 8/10/2014 17:00 31.34517 -81.39317 36 1 4
7 2014-08-10 8/10/2014 23:00 31.34483 -81.39283 36 1 4
8 2014-08-11 8/11/2014 5:00 31.30600 -81.28317 36 1 2
9 2014-08-11 8/11/2014 13:00 31.34433 -81.39300 36 1 2
The code that creates the "extra_sum" column is what I am struggling with.
After creating this I can simply subset my data to all entries that have a value >2. Any help figuring out how to use my initial methodology or this new one to remove days with fewer than 3 entries from my initial data set would be much appreciated!
The plyr way.
library(plyr)
datetime <- dat$datetime
###strip the time down to only have the date no hh:mm:ss
date <- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2 <-cbind(date, dat)
dat3 <- ddply(dat2, .(date), function(df){
if (nrow(df)>=3) {
return(df)
} else {
return(NULL)
}
})
I recommend using the data.table package
library(data.table)
dat<-data.table(dat)
dat$Date<-as.Date(as.character(dat$datetime), format = "%m/%d/%Y")
dat_sum<-dat[, .N, by = Date ]
dat_3plus<-dat_sum[N>=3]
dat<-dat[Date%in%dat_3plus$Date]

Choose specific date with strptime in r

I have a text file dataset with headers
YEAR MONTH DAY value
which runs hourly from 1/6/2010 to 14/7/2012. I open and plot the data with the following commands:
data=read.table('example.txt',header=T)
time = strptime(paste(data$DAY,data$MONTH,data$YEAR,sep="-"), format="%d-%m-%Y")
plot(time,data$value)
However, when the data are plotted, the x axis only shows 2011 and 2012. . How can I do to keep the 2011 and 2012 labels but also to add some specific month, e.g. if I want March, June & September?
I have made the data available on this link
https://dl.dropbox.com/u/107215263/example.txt
You need to use function axis.POSIXct to format and dispose of your date labels as you wish:
plot(time,data$value,xaxt="n") #Skip the x-axis here
axis.POSIXct(1, at=pretty(time), format="%B %Y")
To see all possible formats, see ?strptime.
You can of course play with parameter at to place your ticks wherever you want, for instance:
axis.POSIXct(1, at=seq(time[1],time[length(time)],"3 months"),
format="%B %Y")
While this doesn't answer question directly, I would like to suggest you to use xts package for any timeseries analysis. It makes timeseries analysis very convenient
require(xts)
DF <- read.table("https://dl.dropbox.com/u/107215263/example.txt", header = TRUE)
head(DF)
## YEAR MONTH DAY value
## 1 2010 6 1 95.3244
## 2 2010 6 2 95.3817
## 3 2010 6 3 100.1968
## 4 2010 6 4 103.8667
## 5 2010 6 5 104.5969
## 6 2010 6 6 107.2666
#Get Index for xts object which we will create in next step
DFINDEX <- ISOdate(DF$YEAR, DF$MONTH, DF$DAY)
#Create xts timeseries
DF.XTS <- .xts(x = DF$value, index = DFINDEX, tzone = "GMT")
head(DF.XTS)
## [,1]
## 2010-06-01 12:00:00 95.3244
## 2010-06-02 12:00:00 95.3817
## 2010-06-03 12:00:00 100.1968
## 2010-06-04 12:00:00 103.8667
## 2010-06-05 12:00:00 104.5969
## 2010-06-06 12:00:00 107.2666
#plot xts
plot(DF.XTS)

Resources