I have this data I wanted to convert to dates, but I doubt it is possible with year that is below 0, below is the snippets
library(datasets)
library(quantmod)
data(treering)
tree_df = data.frame(ds=index(treering), y=as.numeric(treering))
> head(tree_df)
ds y
1 -6000 1.345
2 -5999 1.077
3 -5998 1.545
4 -5997 1.319
5 -5996 1.413
6 -5995 1.069
> tail(tree_df)
ds y
7975 1974 1.031
7976 1975 1.027
7977 1976 1.173
7978 1977 1.471
7979 1978 1.444
7980 1979 1.160
?treering
Yearly Treering Data, -6000–1979
Description
Contains normalized tree-ring widths in dimensionless units.
Usage
treering
Format
A univariate time series with 7981 observations. The object is of class "ts".
Each tree ring corresponds to one year.
Is there a way to convert the data into dates with a negative year in its own way? like for example "-6000-01-01"?
Apparently by converting Minus Integer to Date help the trick, in this case (-2910983) from the year 1970 is -6000, therefore a sequence of 1 Year will help and then finally converted to Date
sequences = seq(as.Date(-2910983,origin="1970-01-01"),as.Date(paste0(max(index(treering)),"-01-01")),by="1 years")
tail(sequences)
[1] "1974-01-01" "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01" "1979-01-01"
head(sequences)
[1] "-6000-01-01" "-5999-01-01" "-5998-01-01" "-5997-01-01" "-5996-01-01" "-5995-01-01"
Related
I have a dataframe as follows:
Date Price1 Price2 Price3 Price4 .... Price 24
2017-10-15 60.43 49.40 48.72 48.32
2017-10-16 38.09 30.00 24.47 24.88
2017-10-17 48.80 46.76 46.73 45.82
The goal is to turn the dataframe object into a temporal series, predicting as well the date 2017-10-18, with all the corresponding 24 price/values.
Actually, I get the ts object, but it appears the following error at time to compute Error in ets(stock_prize) : y should be a univariate time series
Any advice?
I think your data structure is not correct. I suggest you should make those dates a factor and make only one column for the values. For example you have something like this:
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
mydates2 <-as.Date(c("2008-06-22", "2005-02-13"))
mydates3 <-as.Date(c("2009-06-22", "2006-02-13"))
hours <- c(8,9)
values <- c(1,2)
a=data.frame(mydates,mydates2,mydates3,hours,values)
a
This is how your data looks:
mydates mydates2 mydates3 hours values
1 2007-06-22 2008-06-22 2009-06-22 8 1
2 2004-02-13 2005-02-13 2006-02-13 9 2
But you should transform them to look something like this:
dates=c(mydates,mydates2,mydates3)
hours_factor=rep(hours,3)
ordered_values=rep(values,3)
b=data.frame(dates,hours_factor,ordered_values)
b
This is how your data shoud look like:
dates hours_factor ordered_values
1 2007-06-22 8 1
2 2004-02-13 9 2
3 2008-06-22 8 1
4 2005-02-13 9 2
5 2009-06-22 8 1
6 2006-02-13 9 2
After that you can make the variables a ts class. You can use ts function for that. If you want to predict next date value you can do an auto-regression. It is very well documented in the Internet, but please know your data have to match some requirements first.
I have a data frame in my R environment that I would like to subset based on a specific criteria -a sort of conditional filter. My data frame is a panel dataset of daily values for each day between 2004-2014. Each day in the data frame is a separate observation. Each year has 366 days. I would like to subset the data such that only the leap years retain the 366th day in the panel data. There are three leap years in that time range -2004, 2008, 2012. I have a separate column for the year and the day of the year. In other words, I need a script that will return a dataset without the 366th day but only for each year other than 2004, 2008, and 2012.
I've managed to accomplish this the following way: I pasted my day and year columns together (e.g. "2006-366") and simply used dplyr's filter command to subset each year (2005-366, 2006-366, 2007-366, 2009-366, 2010-366, 2011-366, 2013-366, 2014-366). This however is an awfully crude method. I was hoping someone could point me in the right direction here. Here's some reproducible data along with the workflow I used.
#Create DF
year<-rep(c(2004:2014), each=366)
day<-rep(c(1:366))
df<-data.frame(day, year)
#My crude method
df $reduc<-paste(df$year, df$day, sep="-")
df <-df %>%
filter(reduc!="2005-366") %>%
filter(reduc!="2006-366") %>%
filter(reduc!="2007-366") %>%
filter(reduc!="2009-366") %>%
filter(reduc!="2010-366") %>%
filter(reduc!="2011-366") %>%
filter(reduc!="2013-366") %>%
filter(reduc!="2014-366")
Set up data:
df <- expand.grid(year=2004:2014,day=1:366)
nrow(df) ## 4026
Now exclude cases where (year is not divisible by 4) AND (day equals 366) (identifying non-leap years would be trickier if you included 2000 and/or century-years in your data set ...)
library(dplyr)
df2 <- df %>% filter(!(year %% 4 > 0 & day==366))
You should derive the correct Date values for your dates. This can be done by building the January 1st string representation for each row's year, coercing to Date type, and then adding the day (minus 1) to the Date value.
df$date <- as.Date(paste0(df$year,'-01-01'))+(df$day-1L);
We will then be able to pull out the year from the Date value and check it against the input year. If they fail to match, then we know the year/day combination was invalid, and we can excise it from the data. This works because invalid leap days will translate into January 1st of the following year under the above derivation method.
df[df$year==as.integer(strftime(df$date,'%Y')),];
## day year date
## 1 1 2004 2004-01-01
## ...
## 366 366 2004 2004-12-31
## 367 1 2005 2005-01-01
## ...
## 731 365 2005 2005-12-31
## 733 1 2006 2006-01-01
## ...
## 1097 365 2006 2006-12-31
## 1099 1 2007 2007-01-01
## ...
## 1463 365 2007 2007-12-31
## 1465 1 2008 2008-01-01
## ...
## 1830 366 2008 2008-12-31
## 1831 1 2009 2009-01-01
## ...
## 2195 365 2009 2009-12-31
## 2197 1 2010 2010-01-01
## ...
## 2561 365 2010 2010-12-31
## 2563 1 2011 2011-01-01
## ...
## 2927 365 2011 2011-12-31
## 2929 1 2012 2012-01-01
## ...
## 3294 366 2012 2012-12-31
## 3295 1 2013 2013-01-01
## ...
## 3659 365 2013 2013-12-31
## 3661 1 2014 2014-01-01
## ...
## 4025 365 2014 2014-12-31
I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).
This is probably a very simple question that has been asked already but..
I have a data frame that I have constructed from a CSV file generated in excel. The observations are not homogeneously sampled, i.e they are for "On Peak" times of electricity usage. That means they exclude different days each year. I have 20 years of data (1993-2012) and am running both non Robust and Robust LOESS to extract seasonal and linear trends.
After the decomposition has been done, I want to focus only on the observations from June through September.
How can I create a new data frame of just those results?
Sorry about the formatting, too.
Date MaxLoad TMAX
1 1993-01-02 2321 118.6667
2 1993-01-04 2692 148.0000
3 1993-01-05 2539 176.0000
4 1993-01-06 2545 172.3333
5 1993-01-07 2517 177.6667
6 1993-01-08 2438 157.3333
7 1993-01-09 2302 152.0000
8 1993-01-11 2553 144.3333
9 1993-01-12 2666 146.3333
10 1993-01-13 2472 177.6667
As Joran notes, you don't need anything other than base R:
## Reproducible data
df <-
data.frame(Date = seq(as.Date("2009-03-15"), as.Date("2011-03-15"), by="month"),
MaxLoad = floor(runif(25,2000,3000)), TMAX=runif(25,100,200))
## One option
df[months(df$Date) %in% month.name[6:9],]
# Date MaxLoad TMAX
# 4 2009-06-15 2160 188.4607
# 5 2009-07-15 2151 164.3946
# 6 2009-08-15 2694 110.4399
# 7 2009-09-15 2460 150.4076
# 16 2010-06-15 2638 178.8341
# 17 2010-07-15 2246 131.3283
# 18 2010-08-15 2483 112.2635
# 19 2010-09-15 2174 160.9724
## Another option: strftime() will be more _generally_ useful than months()
df[as.numeric(strftime(df$Date, "%m")) %in% 6:9,]
I am finding this to be quite tricky. I have an R time series data frame, consisting of a value for each day for about 50 years of data. I would like to compute the mean of only the last 5 values for each month. This would be simple if each month ended in the same 31st day, in which case I could just subset. However, as we all know some months end in 31, some in 30, and then we have leap years. So, is there a simple way to do this in R without having to write a complex indexing function to take account of all the possibilities including leap years? Perhaps a function that works on zoo type objects? The data frame is as follows:
Date val
1 2014-01-06 1.49
2 2014-01-03 1.38
3 2014-01-02 1.34
4 2013-12-31 1.26
5 2013-12-30 2.11
6 2013-12-26 3.20
7 2013-12-25 3.00
8 2013-12-24 2.89
9 2013-12-23 2.90
10 2013-12-22 4.5
tapply Try this where dd is your data frame and we have assumed that the Date column is of class "Date". (If dd is already sorted in descending order of Date as it appears it might be in the question then we can shorten it a bit by replacing the anonymous function with function(x) mean(head(x, 5)) .)
> tapply(dd$val, format(dd$Date, "%Y-%m"), function(x) mean(tail(sort(x), 5)))
2013-12 2014-01
2.492000 1.403333
aggregate.zoo In terms of zoo we can do this which returns another zoo object and its index is of class "yearmon". (In the case of zoo it does not matter whether dd is sorted or not since zoo will sort it automatically.)
> library(zoo)
> z <- read.zoo(dd)
> aggregate(z, as.yearmon, function(x) mean(tail(x, 5)))
Dec 2013 Jan 2014
2.492000 1.403333
REVISIONS. Made some corrections.