I'm trying to differenciate a time serie, which looks like that : time serie to differenciate. But sadly, diff(spread) returns me this. I also tried diff(spread,1)). I nearly copypasted some code of a working example, and I don't find any obvious mistakes. I installed the modules two hours ago, so I've got the last version of all packages used.
# chemin espace de travail
setwd("C:/Users/Simon/Desktop/Projet serie temp")
#### Q1 ####
require(zoo)
require(tseries)
require(fUnitRoots)
data <- read.csv("base_form.csv",sep=",") #import .csv
View(data) #visualisation
indice = data$Index
dates = data$Dates
spread <- zoo(indice, order.by=dates)
View(spread)
plot.window(ylim = c(-20,20))
plot(spread) #représentation graphique
dspread <- diff(spread) #différence première
plot(cbind(spread,dspread))
Here is the error I get :
> plot(dspread)
Error in plot.window(xlim = xlim, ylim = ylim, log = log, yaxs = pars$yaxs) :
valeurs finies requises pour 'ylim'
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
Here is the output of dput(head(spread))
structure(c(83.87, 86.15, 94.07, 90.02, 92.22, 93.18), index = structure(1:6, .Label = c("1990-01",
"1990-02", "1990-03", "1990-04", "1990-05", "1990-06", "1990-07",
"1990-08", "1990-09", "1990-10", "1990-11", "1990-12", "1991-01",
"1991-02", "1991-03", "1991-04", "1991-05", "1991-06", "1991-07",
"1991-08", "1991-09", "1991-10", "1991-11", "1991-12", "1992-01",
"1992-02", "1992-03", "1992-04", "1992-05", "1992-06", "1992-07",
"1992-08", "1992-09", "1992-10", "1992-11", "1992-12", "1993-01",
"1993-02", "1993-03", "1993-04", "1993-05", "1993-06", "1993-07",
"1993-08", "1993-09", "1993-10", "1993-11", "1993-12", "1994-01",
"1994-02", "1994-03", "1994-04", "1994-05", "1994-06", "1994-07",
"1994-08", "1994-09", "1994-10", "1994-11", "1994-12", "1995-01",
"1995-02", "1995-03", "1995-04", "1995-05", "1995-06", "1995-07",
"1995-08", "1995-09", "1995-10", "1995-11", "1995-12", "1996-01",
"1996-02", "1996-03", "1996-04", "1996-05", "1996-06", "1996-07",
"1996-08", "1996-09", "1996-10", "1996-11", "1996-12", "1997-01",
"1997-02", "1997-03", "1997-04", "1997-05", "1997-06", "1997-07",
"1997-08", "1997-09", "1997-10", "1997-11", "1997-12", "1998-01",
"1998-02", "1998-03", "1998-04", "1998-05", "1998-06", "1998-07",
"1998-08", "1998-09", "1998-10", "1998-11", "1998-12", "1999-01",
"1999-02", "1999-03", "1999-04", "1999-05", "1999-06", "1999-07",
"1999-08", "1999-09", "1999-10", "1999-11", "1999-12", "2000-01",
"2000-02", "2000-03", "2000-04", "2000-05", "2000-06", "2000-07",
"2000-08", "2000-09", "2000-10", "2000-11", "2000-12", "2001-01",
"2001-02", "2001-03", "2001-04", "2001-05", "2001-06", "2001-07",
"2001-08", "2001-09", "2001-10", "2001-11", "2001-12", "2002-01",
"2002-02", "2002-03", "2002-04", "2002-05", "2002-06", "2002-07",
"2002-08", "2002-09", "2002-10", "2002-11", "2002-12", "2003-01",
"2003-02", "2003-03", "2003-04", "2003-05", "2003-06", "2003-07",
"2003-08", "2003-09", "2003-10", "2003-11", "2003-12", "2004-01",
"2004-02", "2004-03", "2004-04", "2004-05", "2004-06", "2004-07",
"2004-08", "2004-09", "2004-10", "2004-11", "2004-12", "2005-01",
"2005-02", "2005-03", "2005-04", "2005-05", "2005-06", "2005-07",
"2005-08", "2005-09", "2005-10", "2005-11", "2005-12", "2006-01",
"2006-02", "2006-03", "2006-04", "2006-05", "2006-06", "2006-07",
"2006-08", "2006-09", "2006-10", "2006-11", "2006-12", "2007-01",
"2007-02", "2007-03", "2007-04", "2007-05", "2007-06", "2007-07",
"2007-08", "2007-09", "2007-10", "2007-11", "2007-12", "2008-01",
"2008-02", "2008-03", "2008-04", "2008-05", "2008-06", "2008-07",
"2008-08", "2008-09", "2008-10", "2008-11", "2008-12", "2009-01",
"2009-02", "2009-03", "2009-04", "2009-05", "2009-06", "2009-07",
"2009-08", "2009-09", "2009-10", "2009-11", "2009-12", "2010-01",
"2010-02", "2010-03", "2010-04", "2010-05", "2010-06", "2010-07",
"2010-08", "2010-09", "2010-10", "2010-11", "2010-12", "2011-01",
"2011-02", "2011-03", "2011-04", "2011-05", "2011-06", "2011-07",
"2011-08", "2011-09", "2011-10", "2011-11", "2011-12", "2012-01",
"2012-02", "2012-03", "2012-04", "2012-05", "2012-06", "2012-07",
"2012-08", "2012-09", "2012-10", "2012-11", "2012-12", "2013-01",
"2013-02", "2013-03", "2013-04", "2013-05", "2013-06", "2013-07",
"2013-08", "2013-09", "2013-10", "2013-11", "2013-12", "2014-01",
"2014-02", "2014-03", "2014-04", "2014-05", "2014-06", "2014-07",
"2014-08", "2014-09", "2014-10", "2014-11", "2014-12", "2015-01",
"2015-02", "2015-03", "2015-04", "2015-05", "2015-06", "2015-07",
"2015-08", "2015-09", "2015-10", "2015-11", "2015-12", "2016-01",
"2016-02", "2016-03", "2016-04", "2016-05", "2016-06", "2016-07",
"2016-08", "2016-09", "2016-10", "2016-11", "2016-12", "2017-01",
"2017-02", "2017-03", "2017-04", "2017-05", "2017-06", "2017-07",
"2017-08", "2017-09", "2017-10", "2017-11", "2017-12", "2018-01",
"2018-02"), class = "factor"), class = "zoo")
I cannot reproduce the problem perfectly, but I have some thoughts.
TL;DR: Edit: don't use factors, use either character or Date objects before zoo-ifying things.
I hunted this down by looking at the source for zoo:::diff.zoo. Namely, it was failing at
x - lag(x, k=-1)
# Data:
# numeric(0)
# Index:
# factor(0)
# 338 Levels: 1990-01 1990-02 1990-03 1990-04 1990-05 1990-06 1990-07 1990-08 1990-09 1990-10 1990-11 1990-12 1991-01 1991-02 1991-03 1991-04 1991-05 1991-06 1991-07 1991-08 1991-09 1991-10 1991-11 1991-12 1992-01 1992-02 1992-03 1992-04 ... 2018-02
I believe that typically zoo objects are indexed based on some form of time-progression. This might be simple integers, as in
str(zoo(2:5))
# 'zoo' series from 1 to 4
# Data: int [1:4] 2 3 4 5
# Index: int [1:4] 1 2 3 4
or something more explicit/intentional, such as a Date or POSIXct timestamp. In your case, it's a factor. I don't know if zoo is trying to treat it like an integer (probably not, otherwise it should have come up with something), or like some categorical character, most likely not what you want in a time-series. (Correction: as 42- pointed out, this is actually quite fine.)
So even if zoo intelligently deals with factors, there is also the problem that the date you have listed is not perfectly unambiguous (is not a time-based object). For instance, by "1990-01" do you mean "1990-01-01"? Though it might seem intuitive and obvious to make that assumption, R typically does not follow you on that leap.
Try this:
(ind <- index(x))
# [1] 1990-01 1990-02 1990-03 1990-04 1990-05 1990-06
# 338 Levels: 1990-01 1990-02 1990-03 1990-04 1990-05 1990-06 1990-07 1990-08 1990-09 1990-10 1990-11 1990-12 ... 2018-02
(ind <- as.Date(paste0(ind, "-01"), format="%Y-%m-%d"))
# [1] "1990-01-01" "1990-02-01" "1990-03-01" "1990-04-01" "1990-05-01" "1990-06-01"
index(x) <- ind
(The surrounding parentheses are merely a shortcut to dump the output post-assignment. They can be safely removed for production.) That now allows
x - lag(x, k=-1)
# 1990-01-01 1990-02-01 1990-03-01 1990-04-01 1990-05-01 1990-06-01
# NA 2.28 7.92 -4.05 2.20 0.96
which means your spread is likely working now:
diff(x)
# 1990-02-01 1990-03-01 1990-04-01 1990-05-01 1990-06-01
# 2.28 7.92 -4.05 2.20 0.96
My guess means that your data import should instead look like:
data <- read.csv("base_form.csv",sep=",") #import .csv
indice = data$Index
dates = as.Date(paste0(data$Dates, "-01"), format="%Y-%m-%d")
spread <- zoo(indice, order.by=dates)
or more simply
data <- read.csv("base_form.csv",sep=",")
dates = as.character(data$Dates)
or even more simply
data <- read.csv("base_form.csv",sep=",", stringsAsFactors=FALSE)
The problem appears to be the dates are encoded as factors. Note the difference if we construct spread manually:
> indice <- c(83.87, 86.15, 94.07, 90.02, 92.22, 93.18)
> dates <- as.factor(c("1990-01", "1990-02", "1990-03", "1990-04", "1990-05", "1990-06"))
> spread <- zoo(indice, order.by = dates)
> diff(spread)
Data:
numeric(0)
Index:
factor(0)
Levels: 1990-01 1990-02 1990-03 1990-04 1990-05 1990-06
> dates <- c("1990-01", "1990-02", "1990-03", "1990-04", "1990-05", "1990-06")
> spread <- zoo(indice, order.by = dates)
> diff(spread)
1990-02 1990-03 1990-04 1990-05 1990-06
2.28 7.92 -4.05 2.20 0.96
To fix it, you can try adding stringsAsFactors = FALSE to your read.csv.
data <- read.csv("base_form.csv", stringsAsFactors = FALSE)
(Note that sep = "," is the default for read.csv, so you don't really need to specify it.)
EDIT: I should add there are many more zoo-like way of reading dates in correctly, see https://cran.r-project.org/web/packages/zoo/vignettes/zoo-read.pdf
I'm posting to correct what I think are some inaccuracies in r2evans analysis of the problem. It is true that the problem stems from using a factor as an index. The factor class in R does not support ordering operations and at least one of the "o"'s in the name "zoo" stands for "ordered". It could have been solved quickly by:
index(spread) <- as.character(index(spread))
Then the diff-operation would have succeeded, and the cbind operation would also have succeeded because there is a cbind.zoo function that recognizes differences in number of rows and automagically pads the shorter columns with NA's at the beginning.
> cbind( diff(spread), spread )
diff(spread) spread
1990-01 NA 83.87
1990-02 2.28 86.15
1990-03 7.92 94.07
1990-04 -4.05 90.02
1990-05 2.20 92.22
1990-06 0.96 93.18
> cbind( diff(diff(spread)), spread )
diff(diff(spread)) spread
1990-01 NA 83.87
1990-02 NA 86.15
1990-03 5.64 94.07
1990-04 -11.97 90.02
1990-05 6.25 92.22
1990-06 -1.24 93.18
Character vectors are perfectly acceptable index classes for zoo. They will be ordered as lexical values. It's perfectly acceptable to make a "<" or ">" operation on two character values, so there is no ambiguity in this case. The zoo-package also has a yearmon class that this index could become if desired.
This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I have a dataframe that looks roughly like below (meaning that it is an approximation made for illustration, and not an exact replica of the dataframe you can download through the link below, or get from the dput() I pasted below):
March_created_at March_email March_type April_created_at April_email April_type
3/11/12 7:28 jeremy#asynk.ch PushEvent 4/1/12 4:03 PushEvent
3/11/12 7:28 jeremy#asynk.ch PushEvent 4/1/12 4:03 PushEvent
3/11/12 7:28 jeremy#asynk.ch PushEvent 4/1/12 4:03 PushEvent
3/11/12 7:28 jeremy#asynk.ch PushEvent 4/1/12 7:03 high IssuesEvent
3/11/12 11:06 medium PushEvent 4/1/12 13:57 medium PushEvent
3/11/12 11:06 medium PushEvent 4/1/12 13:57 medium PushEvent
3/11/12 11:06 medium PushEvent 4/1/12 13:57 medium PushEvent
3/11/12 12:46 PushEvent
3/11/12 12:46 PushEvent
3/11/12 12:46 PushEvent
The full dataset can be found here as a CSV file
I'm looking to write a function that takes the following inputs:
A dataframe
Certain columns of that dataframe
A list of strings (e.g. a set of email addresses)
A replacement string (e.g. "low")
Now, I want the function to go through only the specified columns of that dataframe and replace all of the strings (as well as empty cells) that do not match the list of strings specified in point 3 above with the replacement string in point 4. However, this should only be done if the following condition holds:
The cell under consideration needs to have a timestamp for the same month.
For example, let's say we are about to replace the empty cell on row 8 in column "March_email". I can see that on row 8 in the column "March_created_at" there is a timestamp, so I can go ahead and replace this empty cell with the specified string (e.g. "low"). However, look at row 8 in the column "April_email". This cell is also empty, and so is the cell on row 8 in column "April_created_at". In this case, nothing should be done (i.e. no string inserted).
The reason I want to do this is that certain cells are just empty because there is no data, so nothing should be inserted. Other cells are empty because the data is missing, so I need to impute the data based on the function I specified above.
How can I accomplish this in R?
Appendix: Here is a dput() of the head of the dataset:
structure(list(March_created_at = c("2012-03-11 07:28:04", "2012-03-11 07:28:04",
"2012-03-11 07:28:04", "2012-03-11 07:28:19", "2012-03-11 07:28:19",
"2012-03-11 07:28:19"), March_actor_attributes_email = c("jeremy#asynk.ch",
"jeremy#asynk.ch", "jeremy#asynk.ch", "jeremy#asynk.ch", "jeremy#asynk.ch",
"jeremy#asynk.ch"), March_type = c("PushEvent", "PushEvent",
"PushEvent", "PushEvent", "PushEvent", "PushEvent"), April_created_at = c("2012-04-01 04:03:13",
"2012-04-01 04:03:13", "2012-04-01 04:03:13", "2012-04-01 07:03:11",
"2012-04-01 07:03:11", "2012-04-01 07:03:11"), April_actor_attributes_email = c("",
"", "", "high", "high", "high"), April_type = c("PushEvent",
"PushEvent", "PushEvent", "IssuesEvent", "IssuesEvent", "IssuesEvent"
), May_created_at = c("2012-05-01 00:16:05", "2012-05-01 00:16:05",
"2012-05-01 00:16:05", "2012-05-01 01:03:19", "2012-05-01 01:03:19",
"2012-05-01 01:03:19"), May_actor_attributes_email = c("john.firebaugh#gmail.com",
"john.firebaugh#gmail.com", "john.firebaugh#gmail.com", "mitch.tishmack#gmail.com",
"mitch.tishmack#gmail.com", "mitch.tishmack#gmail.com"), May_type = c("PushEvent",
"PushEvent", "PushEvent", "IssueCommentEvent", "IssueCommentEvent",
"IssueCommentEvent"), June_created_at = c("2012-06-01 00:25:05",
"2012-06-01 00:25:05", "2012-06-01 00:25:05", "2012-06-01 00:42:29",
"2012-06-01 00:42:29", "2012-06-01 00:42:29"), June_actor_attributes_email = c("michaelklishin#me.com",
"michaelklishin#me.com", "michaelklishin#me.com", "", "", ""),
June_type = c("IssueCommentEvent", "IssueCommentEvent", "IssueCommentEvent",
"PushEvent", "PushEvent", "PushEvent"), July_created_at = c("2012-07-01 13:46:20",
"2012-07-01 13:46:20", "2012-07-02 11:53:37", "2012-07-02 11:53:37",
"2012-07-02 12:27:30", "2012-07-02 12:27:30"), July_actor_attributes_email = c("medium",
"medium", "ryoqun#gmail.com", "ryoqun#gmail.com", "ryoqun#gmail.com",
"ryoqun#gmail.com"), July_type = c("PushEvent", "PushEvent",
"CreateEvent", "CreateEvent", "PushEvent", "PushEvent"),
August_created_at = c("2012-08-01 00:04:09", "2012-08-01 00:04:09",
"2012-08-01 00:04:42", "2012-08-01 00:04:42", "2012-08-01 00:05:04",
"2012-08-01 00:05:04"), August_actor_attributes_email = c("jeremy#asynk.ch",
"jeremy#asynk.ch", "jeremy#asynk.ch", "jeremy#asynk.ch",
"jeremy#asynk.ch", "jeremy#asynk.ch"), August_type = c("IssueCommentEvent",
"IssueCommentEvent", "IssuesEvent", "IssuesEvent", "IssueCommentEvent",
"IssueCommentEvent"), September_created_at = c("2012-09-01 18:12:24",
"2012-09-01 18:12:24", "2012-09-01 23:51:18", "2012-09-01 23:51:18",
"2012-09-02 00:34:54", "2012-09-02 00:34:54"), September_actor_attributes_email = c("ryoqun#gmail.com",
"ryoqun#gmail.com", "ryoqun#gmail.com", "ryoqun#gmail.com",
"ryoqun#gmail.com", "ryoqun#gmail.com"), September_type = c("CommitCommentEvent",
"CommitCommentEvent", "CreateEvent", "CreateEvent", "PushEvent",
"PushEvent"), October_created_at = c("2012-10-01 07:48:38",
"2012-10-01 10:01:40", "2012-10-01 10:01:43", "2012-10-01 10:17:00",
"2012-10-01 16:08:29", "2012-10-01 18:06:46"), October_actor_attributes_email = c("medium",
"medium", "medium", "medium", "", "core"), October_type = c("PushEvent",
"IssuesEvent", "PushEvent", "PushEvent", "ForkEvent", "PullRequestEvent"
)), .Names = c("March_created_at", "March_actor_attributes_email",
"March_type", "April_created_at", "April_actor_attributes_email",
"April_type", "May_created_at", "May_actor_attributes_email",
"May_type", "June_created_at", "June_actor_attributes_email",
"June_type", "July_created_at", "July_actor_attributes_email",
"July_type", "August_created_at", "August_actor_attributes_email",
"August_type", "September_created_at", "September_actor_attributes_email",
"September_type", "October_created_at", "October_actor_attributes_email",
"October_type"), row.names = c(NA, 6L), class = "data.frame")
How about something like this:
myfun <- function(month, DF, matches, replacement) {
email.col <- paste0(month, '_actor_attributes_email')
date.col <- paste0(month, '_created_at')
DF[[email.col]] <- ifelse(DF[[date.col]] != '' & !DF[[email.col]] %in% matches,
DF[[email.col]],
replacement)
return (DF[, c(date.col, email.col)])
}
myfun(dat, 'April', 'high', 'foo')
# April_created_at April_actor_attributes_email
# 1 2012-04-01 04:03:13 foo
# 2 2012-04-01 04:03:13 foo
# 3 2012-04-01 04:03:13 foo
# 4 2012-04-01 07:03:11 high
# 5 2012-04-01 07:03:11 high
# 6 2012-04-01 07:03:11 high
Then, you can just feed it a bunch of months...
out <- lapply(list('March', 'April', 'May'),
myfun, DF=dat, matches='', replacement='foo')
And you can get that back into a data.frame right quick. with plyr
as.data.frame(unlist(out, recursive=FALSE))
There are plenty of other ways and options but this should give you a big start.