Basically I want to know why as.Date(200322,format="%Y%W") gives me NA. While we are at it, I would appreciate any advice on a data structure for repeated cross-section (aka pseudo-panel) in R.
I did get aggregate() to (sort of) work, but it is not flexible enough - it misses data on columns when I omit the missed values, for example.
Specifically, I have a survey that is repeated weekly for a couple of years with a bunch of similar questions answers to which I would like to combine, average, condition and plot in both dimensions. Getting the date conversion right should presumably help me towards my goal with zoo package or something similar.
Any input is appreciated.
Update: thanks for string suggestion, but as you can see in your own example, %W part doesn't work - it only identifies the year while setting the current day while I need to set a specific week (and leave the day blank).
Use a string as first argument in as.Date() and select a specific weekday (format %w, value 0-6). There are seven possible dates in each week, therefore strptime needs more information to select a unique date. Otherwise the current day and month are returned.
> as.Date(paste("200947", "0", sep="-"), format="%Y%W-%w")
[1] "2009-11-22"
Related
This is my first time ever asking a question on Stack Overflow and I'm a programming novice so any advice as to how to improve my question asking abilities would be appreciated.
Onto my question: I have two csv files, one containing three columns (date time in dd/mm/yyyy hh:(00 or 30) format, production of a certain product, and demand for said product), and the other containing several columns (decomposition of the date time into year, month, day, hour, and whether it is :00 or :30 represented by 1 or 2 respectively, alongside several columns for independent variables which may affect production/demand of said product).
I've only played around with the first csv file, converting the string into a datetime object but the ts() function won't recognise the datetime objects as my times. I've tried adjusting the frequency parameter but ultimately failed and have no idea how to create a time series using half hourly data. Would appreciate any help.
Thanks in advance!
My suggestion is to apply the "difftime" over all your time data. For instance, like following code, you can use your initial time (the time of first record) for all comparisons as time_start and the others as time_finish. Then it return the time intervals as number of seconds and then you are ready to use other column values as the value of the time stamps.
interval=as.integer(difftime(strptime(time_finish,"%H:%M"),strptime(time_start,"%H:%M"),units = "sec"))
Second 0 10 15 ....
Just looking for help working with some dates in R. Code for a simple data frame is below, with one column of start dates and one column of end dates. I would like to create a new column with the difference in days between each set of dates - start date and end date. Also, the dates are in different formats, so is there an easy way to convert all dates to a similar format? I've been reading about the lubridate package but haven't found anything yet on this particular situation that is easy for me to quickly learn as an R newbie. It would be great to link the answer to the dplyr pipeline as well, if possible, to calculate average number of days, etc.
Start.date<-c("05-May-15", "10-June-15", "July-12-2015")
End.date<-c("12-July-15", "2015-Aug-15", "Sept-12-2015")
Dates.df<-data.frame(Start.date,End.date)
I have time series data that I'm trying to analyse in R. It was provided as a CSV from excel, which I subsequently read as a data.frame all. Let's say it has two columns: all$date and all$people, representing the count of people on a particular date. The frequency is hence daily.
Being from Excel, the dates are integers representing the number of days since 1900-01-01.
I could read the data as people = ts(all$people, start=c(all$date[1], 1), frequency=365); but that gives a silly start value of almost 40000 because the data starts in 2006. The start parameter doesn't take a date object, according to ?ts, so I can't just use as.Date():
ts - ...
start: the time of the first observation. Either a single number
or a vector of two integers, which specify a natural time unit and
a (1-based) number of samples into the time unit. See the examples
for the use of the second form.
I could of course set start=1, but it's a bit painful to figure out what season we're in when the plot tells me interesting things are happening around day 2100. (To be clear, setting frequency=365 does tell me what year we're in, but isn't useful more precise dates). Is there a useful way of expressing the date in ts in a human-readable form so that I don't have to keep calling as.Date() to understand when the interesting features are happening?
I am working on genealogical software that stores its data in SQLite3 format. Everything works fine, except for one minor detail. Not in all cases is the accuracy of the birth or death dates (etc) available to the exact day. So I have the following accuracies:
exact (YYYY-MM-DD)
month (YYYY-MM)
year (YYYY)
year (YYYY+/-5)
year (YYYY+/-10)
year (YYYY+/-50)
decade
century
Now, assuming I store everything in a single column, I end up with a problem. Since SQLite3 has the Julian Day function I was thinking to encode the accuracy in the fractional part of the REAL Julian Day (I don't need the hours anyway). That is fine, but it complicates the way SELECTs work, in fact it means that stuff I could otherwise offload to SQLite3 has to be implemented in application code.
What would be a reasonable method to store the inaccurate dates and be able to query them quickly?
Note: if it matters to anyone answering, the language used is Python, but I am asking in general.
When doing queries on those date values, the most common operation probably is to check whether a date might match another date.
For this, you always need the start and the end of the interval, so it would make sense to store these two values in the DB.
(Call them Start/End or Min/Max or Earliest/Latest or whatever makes sense.)
For example, to find people who might have been born one century ago:
... WHERE '1913-04-16' BETWEEN BirthDateMin AND BirthDateMax
Inequality comparisons can be done with one of the interval boundaries.
For example, to find people who might have been born more than one century ago:
... WHERE BirthDateMin < '1913-04-16'
Just because you're storing date information, doesn't mean that the built-in date type is the right one for you. Your data requirements (date inaccuracy) means that it's probably more accurate and better long-term to do some custom date-handling work, and avoid using the built-in date data types.
Use two columns. One column is the approximate date, as accurate as possible, in SQLite format. The second column is the accuracy of the date in days. If the date is absolutely accurate, the second column is zero. If only the month is known, the date would be mid month and the second column 15 days. Etc. Date comparisons can be done by comparing against the date +/- the accuracy column.
I have an xts time series in R and am using the very handy function to subset the time series based on a string, for example
time_series["17/06/2006 12:00:00"]
This will return the nearest observation to that date/time - which is very handy in many situations. However, in this particular situation I only want to return the elements of the time series which are at that exact time. Is there a way to do this in xts using a nice date/time string like this?
In a more general case (I don't have this problem immediately now, but suspect I may run into it soon) - is it possible to extract the closest observation within a certain period of time? For example, the closest observation to the given date/time, assuming it is within 10 minutes of the given date/time - otherwise just discard that observation.
I suspect this more general case may require me writing a function to do this - which I am happy to do - I just wanted to check whether the more specific case (or the general case) was already catered for in xts.
AFAIK, the only way to do this is to use a subset that begins at the time you're interested in, then get the first observation of that.
e.g.
first(time_series["2006-06-17 12:00:00/2006-06-17 12:01"])
or, more generally, to get the 12:00 price every day, you can subset down to 1 minute of each day, then split by days and extract the first observation of each.
do.call(rbind, lapply(split(time_series["T12:00:00/T12:01"],'days'), first))
Here's a thread where Jeff (the xts author) contemplates adding the functionality you want
http://r.789695.n4.nabble.com/Find-first-trade-of-day-in-xts-object-td3598441.html#a3599887