I would like to know if the order of dates matter when plotting a time series in R.
For example, the dataframe below has it's date starting from the year 2010 onwards increasing as it goes down, for example till 2011:
Date Number of visits
2010-05-17 13
2010-05-18 11
2010-05-19 4
2010-05-20 2
2010-05-21 23
2010-05-22 26
2011-05-13 14
and below where the year are jumbled up.
Date Number of visits
2011-06-19 10
2009-04-25 5
2012-03-09 20
2011-01-04 45
Would i be able to plot a time series in R for the second example above? Is it required that in order to plot a time series, the dates must be sorted?
Assuming the data shown reproducibly int he Note at the end create an ordering vector o and then plot the ordered data:
o <- order(dat$Date)
plot(dat[o, ], type = "o")
or convert the data to a zoo series, which will automatically order it, and then plot.
library(zoo)
z <- read.zoo(dat)
plot(z, type = "o")
Note
The data in reproducible form:
Lines <- "Date Number of visits
2010-05-17 13
2010-05-18 11
2010-05-19 4
2010-05-20 2
2010-05-21 23
2010-05-22 26
2011-05-13 14"
dat <- read.csv(text = gsub(" +", ",", readLines(textConnection(Lines))),
check.names = FALSE)
dat$Date <- as.Date(dat$Date)
as.Date slove your problem:
data$Date <- as.Date(x$Date)
ggplot(data, aes(Date, Number_of_visits)) + geom_line()
Related
I have the following data set called "my_data" (a data frame) - the dates are "factor types" and represent "year-month" (note: this data frame was created from an original data frame using the "dplyr group_by/summarise" commands - and then the "pivot_longer" command in "tidyverse" to make the data in "long format") :
head(my_data)
col_A.dates col_A.count col_B.count col_C.count col_D.count col_E.count
1 2010-01 189 130 57 58 53
2 2010-02 63 62 25 18 30
3 2010-03 46 24 12 12 11
4 2010-04 45 17 8 16 15
5 2010-05 42 26 13 12 16
I am trying to make a time series plot of this data using the "dygraph" library (https://rstudio.github.io/dygraphs/).
To do this, it seems like you have to first convert your data frame into an "xts" type:
library(xts)
xts_data <- xts(my_data[,-1], order.by=my_data[,1])
But this returns the following error:
Error in xts(my_data[, -1], order.by = my_data[, 1]) :
order.by requires an appropriate time-based object
This is preventing me from creating the final graph:
library(dygraphs)
dygraph(xts_data) %>% dyRangeSelector()
Can someone please show me how to fix this problem?
References:
Converting a data frame to xts
Maybe it is easy to use dummy date and convert date class.
Below is an example;
library(tidyverse); library(xts)
dummy_date <- "-15"
my_data2 <- my_data %>%
mutate(col_A.dates = as.Date(paste0(as.character(col_A.dates), dummy_date)))
xts_data <- xts(my_data2[,-1], order.by=my_data2[,1]) # if my_data2 is tibble, order_by = my_data2[[1]]
dygraph(xts_data) %>% dyRangeSelector()
comment response
yes, if I were you, I'll convert "05-OCT-21" to date class directly and use it.
## example
# depending on you locale, it is needed to change locale. (if do so, please delete #)
# lct <- Sys.getlocale("LC_TIME") # keep origin locale
# Sys.setlocale("LC_TIME", "C") # change locale
as.Date(as.character("05-OCT-21"), format = "%d-%b-%y")
# Sys.setlocale("LC_TIME", lct) # return origin locale
### expected_code
my_data2 <- my_data %>%
mutate(col_A.dates = as.Date(as.character(origiinal_date), format = "%d-%b-%y"))
?dygraph indicates that any class that is convertible to xts can be used so assuming we have the data frame shown reproducibly in the Note at the end use read.zoo to convert it to a zoo object with yearmon class index and then call dygraph.
library(zoo)
z <- read.zoo(my_data, FUN = as.yearmon)
library(dygraphs)
dygraph(z)
or to use ggplot2 (see ?autoplot.zoo for more info):
library(ggplot2)
autoplot(z, facets = NULL)
We don't really need anything more than the above but just in case if you want a Date class index, an xts object or a ts object then once we have z it is easy to convert it to many other forms.
zd <- aggregate(z, as.Date)
library(xts)
x <- as.xts(z)
as.ts(z)
Note
Lines <- " col_A.dates col_A.count col_B.count col_C.count col_D.count col_E.count
1 2010-01 189 130 57 58 53
2 2010-02 63 62 25 18 30
3 2010-03 46 24 12 12 11
4 2010-04 45 17 8 16 15
5 2010-05 42 26 13 12 16"
my_data <- read.table(text = Lines, check.names = FALSE)
my_data[[1]] <- factor(my_data[[1]])
I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
....
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
library(lubridate)
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]
I have a large number of files (~1200) which each contains a large timeserie with data about the height of the groundwater. The starting date and length of the serie is different for each file. There can be large data gaps between dates, for example (small part of such a file):
Date Height (cm)
14-1-1980 7659
28-1-1980 7632
14-2-1980 7661
14-3-1980 7638
28-3-1980 7642
14-4-1980 7652
25-4-1980 7646
14-5-1980 7635
29-5-1980 7622
13-6-1980 7606
27-6-1980 7598
14-7-1980 7654
28-7-1980 7654
14-8-1980 7627
28-8-1980 7600
12-9-1980 7617
14-10-1980 7596
28-10-1980 7601
14-11-1980 7592
28-11-1980 7614
11-12-1980 7650
29-12-1980 7670
14-1-1981 7698
28-1-1981 7700
13-2-1981 7694
17-3-1981 7740
30-3-1981 7683
14-4-1981 7692
14-5-1981 7682
15-6-1981 7696
17-7-1981 7706
28-7-1981 7699
28-8-1981 7686
30-9-1981 7678
17-11-1981 7723
11-12-1981 7803
18-2-1982 7757
16-3-1982 7773
13-5-1982 7753
11-6-1982 7740
14-7-1982 7731
15-8-1982 7739
14-9-1982 7722
14-10-1982 7794
15-11-1982 7764
14-12-1982 7790
14-1-1983 7810
28-3-1983 7836
28-4-1983 7815
31-5-1983 7857
29-6-1983 7801
28-7-1983 7774
24-8-1983 7758
28-9-1983 7748
26-10-1983 7727
29-11-1983 7782
27-1-1984 7801
28-3-1984 7764
27-4-1984 7752
28-5-1984 7795
27-7-1984 7748
27-8-1984 7729
28-9-1984 7752
26-10-1984 7789
28-11-1984 7797
18-12-1984 7781
28-1-1985 7833
21-2-1985 7778
22-4-1985 7794
28-5-1985 7768
28-6-1985 7836
26-8-1985 7765
19-9-1985 7760
31-10-1985 7756
26-11-1985 7760
20-12-1985 7781
17-1-1986 7813
28-1-1986 7852
26-2-1986 7797
25-3-1986 7838
22-4-1986 7807
27-5-1986 7785
24-6-1986 7787
26-8-1986 7744
23-9-1986 7742
22-10-1986 7752
1-12-1986 7749
17-12-1986 7758
I want to calculate the average height over 5 years. So, in case of the example 14-1-1980 + 5 years, 14-1-1985 + 5 years, .... The amount of datapoints is different for each calculation of the average. It is very likely that the date 5 years later will not be in the dataset as a datapoint. Hence, I think I need to tell R somehow to take an average in a certain timespan.
I searched on the internet but didn't find something that fitted my needs. A lot of useful packages like uts, zoo, lubridate and the function aggregate passed by. Instead of getting closer to the solution I get more and more confused about which approach is the best for my problem.
Thanks a lot in advance!
As #vagabond points out, you'll want to combine your 1200 files into a single data frame (the plyr package would allow you to do something simple like: data.all <- adply(dir([DATA FOLDER]), 1, read.csv).
Once you have the data, the first step would be to transform the Date column into proper POSIXct date data. Right now the data appear to be strings, and we want them to have an underlying numerical representation (which POSIXct does):
library(lubridate)
df$date.new <- as.Date(dmy(df$Date))
Date Height date.new
1 14-1-1980 7659 1980-01-14
2 28-1-1980 7632 1980-01-28
3 14-2-1980 7661 1980-02-14
4 14-3-1980 7638 1980-03-14
5 28-3-1980 7642 1980-03-28
6 14-4-1980 7652 1980-04-14
Note that the date.new column looks like a string, but is in fact Date data, and can be handled with numerical operations (addition, comparison, etc.).
Next, we might construct a set of date periods, over which we want to compute averages. Your example mentions 5 years, but with the data you provided, that's not a very illustrative example. So here I'm creating 1-year periods starting at every day between Jan 14 1980 and Jan 14 1985
date.start <- as.Date(as.Date('1980-01-14') : as.Date('1985-01-14'), origin = '1970-01-01')
date.end <- date.start + years(1)
dates <- data.frame(start = date.start, end = date.end)
start end
1 1980-01-14 1981-01-14
2 1980-01-15 1981-01-15
3 1980-01-16 1981-01-16
4 1980-01-17 1981-01-17
5 1980-01-18 1981-01-18
6 1980-01-19 1981-01-19
Then we can use the dplyr package to move through each row of this data frame and compute a summary average of Height:
library(dplyr)
df.mean <- dates %>%
group_by(start, end) %>%
summarize(height.mean = mean(df$Height[df$date.new >= start & df$date.new < end]))
start end height.mean
<date> <date> <dbl>
1 1980-01-14 1981-01-14 7630.273
2 1980-01-15 1981-01-15 7632.045
3 1980-01-16 1981-01-16 7632.045
4 1980-01-17 1981-01-17 7632.045
5 1980-01-18 1981-01-18 7632.045
6 1980-01-19 1981-01-19 7632.045
The foverlaps function is IMHO the perfect candidate for such a situation:
library(data.table)
library(lubridate)
# convert to a data.table with setDT()
# convert the 'Date'-column to date-format
# create a begin & end date for the required period
setDT(dat)[, Date := as.Date(Date, '%d-%m-%Y')
][, `:=` (begindate = Date, enddate = Date + years(1))]
# set the keys (necessary for the foverlaps function)
setkey(dat, begindate, enddate)
res <- foverlaps(dat, dat, by.x = c(1,3))[, .(moving.average = mean(i.Height)), Date]
the result:
> head(res,15)
Date moving.average
1: 1980-01-14 7633.217
2: 1980-01-28 7635.000
3: 1980-02-14 7637.696
4: 1980-03-14 7636.636
5: 1980-03-28 7641.273
6: 1980-04-14 7645.261
7: 1980-04-25 7644.955
8: 1980-05-14 7646.591
9: 1980-05-29 7647.143
10: 1980-06-13 7648.400
11: 1980-06-27 7652.900
12: 1980-07-14 7655.789
13: 1980-07-28 7660.550
14: 1980-08-14 7660.895
15: 1980-08-28 7664.000
Now you have for each date an average of all the values that lie the date and one year ahead of that date.
Hey I just tried after seeing your question!!! Ran on a sample data frame. Try it on yours after understanding the code and then let me know!
Bdw instead of having an interval of 5 years, I used just 2 months (2*30 = approx 2 months) as the interval!
df = data.frame(Date = c("14-1-1980", "28-1-1980", "14-2-1980", "14-3-1980", "28-3-1980",
"14-4-1980", "25-4-1980", "14-5-1980", "29-5-1980", "13-6-1980:",
"27-6-1980", "14-7-1980", "28-7-1980", "14-8-1980"), height = 1:14)
# as.Date(df$Date, "%d-%m-%Y")
df1 = data.frame(orig = NULL, dest = NULL, avg_ht = NULL)
orig = as.Date(df$Date, "%d-%m-%Y")[1]
dest = as.Date(df$Date, "%d-%m-%Y")[1] + 2*30 #approx 2 months
dest_final = as.Date(df$Date, "%d-%m-%Y")[14]
while (dest < dest_final){
m = mean(df$height[which(as.Date(df$Date, "%d-%m-%Y")>=orig &
as.Date(df$Date, "%d-%m-%Y")<dest )])
df1 = rbind(df1,data.frame(orig=orig,dest=dest,avg_ht=m))
orig = dest
dest = dest + 2*30
print(paste("orig:",orig, " + ","dest:",dest))
}
> df1
orig dest avg_ht
1 1980-01-14 1980-03-14 2.0
2 1980-03-14 1980-05-13 5.5
3 1980-05-13 1980-07-12 9.5
I hope this works for you as well
This is my best try, but please keep in mind that I am working with the years instead of the full date, i.e. based on the example you provided I am averaging over beginning of 1980- end of 1984.
dat<-read.csv("paixnidi.csv")
install.packages("stringr")
library(stringr)
dates<-dat[,1]
#extract the year of each measurement
years<-as.integer(str_sub(dat[,1], start= -4))
spread_y<-years[length(years)]-years[1]
ind<-list()
#find how many 5-year intervals there are
groups<-ceiling(spread_y/4)
meangroups<-matrix(0,ncol=2,nrow=groups)
k<-0
for (i in 1:groups){
#extract the indices of the dates vector whithin the 5-year period
ind[[i]]<-which(years>=(years[1]+k)&years<=(years[1]+k+4),arr.ind=TRUE)
meangroups[i,2]<-mean(dat[ind[[i]],2])
meangroups[i,1]<-(years[1]+k)
k<-k+5
}
colnames(meangroups)<-c("Year:Year+4","Mean Height (cm)")
I am trying to make a histogram (or other plot) of the number of occurrences of each event from a set of data from multiple years but grouped by month and day. Basically I want a year long x-axis starting from 1 March showing how many times each date occurs and shading those based on a categorical value. Below is the top 20 entries in the data set:
goose
Index DateLost DateLost1 Nested
1 2/5/1988 1988-02-05 N
2 5/20/1988 1988-05-20 N
3 1/31/1985 1985-01-31 N
4 9/6/1997 1997-09-06 Y
5 9/24/1996 1996-09-24 N
6 9/27/1996 1996-09-27 N
7 9/15/1997 1997-09-15 Y
8 1/18/1989 1989-01-18 Y
9 1/12/1985 1985-01-12 Y
10 2/12/1988 1988-02-12 N
11 1/12/1985 1985-01-12 Y
12 10/26/1986 1986-10-26 N
13 9/15/1988 1988-09-15 Y
14 12/30/1986 1986-12-30 N
15 1/19/1991 1991-01-19 N
16 1/7/1992 1992-01-07 N
17 10/9/1999 1999-10-09 N
18 10/20/1990 1990-10-20 N
19 10/25/2001 2001-10-25 N
20 9/23/1996 1996-09-23 Y
I have tried grouping using strftime, zoo, and lubridate but then the plots don't recognize the time sequence or allow me to adjust the starting value. I have tried numerous methods using plot() and ggplot2() but either can't get the grouped data to plot correctly or can't get data grouped. My best plot so far is from this code:
ggplot(goose, aes(x=DateLost1,fill=Nested))+
stat_bin(binwidth=100 ,position="identity") +
scale_x_date("Date")
This gets me a nice plot but over all years, rather than one year. I have also played with the code from a previous answer here:
Understanding dates and plotting a histogram with ggplot2 in R
But am having trouble choosing a start date. Any help would be greatly appreciated. Let me know if I can provide the example data in an easier to use format.
Let's read in your data:
goose <- read.table(header = TRUE, text = "Index DateLost DateLost1 Nested
1 2/5/1988 1988-02-05 N
2 5/20/1988 1988-05-20 N
3 1/31/1985 1985-01-31 N
4 9/6/1997 1997-09-06 Y
5 9/24/1996 1996-09-24 N
6 9/27/1996 1996-09-27 N
7 9/15/1997 1997-09-15 Y
8 1/18/1989 1989-01-18 Y
9 1/12/1985 1985-01-12 Y
10 2/12/1988 1988-02-12 N
11 1/12/1985 1985-01-12 Y
12 10/26/1986 1986-10-26 N
13 9/15/1988 1988-09-15 Y
14 12/30/1986 1986-12-30 N
15 1/19/1991 1991-01-19 N
16 1/7/1992 1992-01-07 N
17 10/9/1999 1999-10-09 N
18 10/20/1990 1990-10-20 N
19 10/25/2001 2001-10-25 N
20 9/23/1996 1996-09-23 Y")
now we can convert this to POSIXct format:
goose$DateLost1 <- as.POSIXct(goose$DateLost,
format = "%m/%d/%Y",
tz = "GMT")
then we need to figure out what year it was lost in, relative to March 31. Don't try to do this in ggplot(). This requires some mucking about to figure out which year we are in, and then calculate the number of days after March 31.
goose$DOTYMarch1 = as.numeric(format(as.POSIXct(paste0("3/1/",format(goose$DateLost1,"%Y")),
format = "%m/%d/%Y",
tz = "GMT"),
"%j"))
goose$DOTYLost = as.numeric(format(goose$DateLost1,
"%j"))
goose$YLost = as.numeric(format(goose$DateLost1,"%Y")) + (as.numeric(goose$DOTYLost>goose$DOTYMarch1) -1)
goose$DOTYAfterMarch31Lost = as.numeric(goose$DateLost1 - as.POSIXct(paste0("3/1/",goose$YLost),
format = "%m/%d/%Y",
tz = "GMT"))
Then we can plot it. Your code was pretty much perfect already.
require(ggplot2)
p <- ggplot(goose,
aes(x=DOTYAfterMarch31Lost,
fill=Nested))+
stat_bin(binwidth=1,
position="identity")
print(p)
And we get this:
I'm using R's ff package with ffdf objects named MyData, (dim=c(10819740,16)). I'm trying to split the variable Date into Day, Month and Year and add these 3 variables into ffdf existing data MyData.
For instance: My Date column named SalesReportDate with VirtualVmode and PhysicalVmode = double after I've changed SalesReportDate to as.date(,format="%m/%d/%Y").
Example of SalesReportDate are as follow:
> B
SalesReportDate
1 2013-02-01
2 2013-05-02
3 2013-05-04
4 2013-10-06
5 2013-15-10
6 2013-11-01
7 2013-11-03
8 2013-30-02
9 2013-12-12
10 2014-01-01
I've refer to Split date into different columns for year, month and day and try to apply it but keep getting error warning.
So, is there any way for me to do this? Thanks in advance.
Credit to #jwijffels for this great solution:
require(ffbase)
MyData$SalesReportDateYear <- with(MyData["SalesReportDate"], format(SalesReportDate, "%Y"), by = 250000)
MyData$SalesReportDateMonth <- with(MyData["SalesReportDate"], format(SalesReportDate, "%m"), by = 250000)
MyData$SalesReportDateDay <- with(MyData["SalesReportDate"], format(SalesReportDate, "%d"), by = 250000)