Descriptive statistics of time variables - r

I want to compute simple descriptive statistics (mean, etc) of times when people go to bed. I ran into two problems. The original data comes from an Excel file in which just the time that people went to bed, were typed in - in 24 hrs format. My problem is that r so far doesn't recognizes if people went to bed at 1.00 am the next day. Meaning that a person who went to bed at 10 pm is 3 hrs apart from the one at 1.00 am (and not 21 hrs).
In my dataframe the variable in_bed is a POSIXct format so I thought to apply an if-function telling that if the time is before 12:00 than I want to add 24 hrs.
My function is:
Patr$in_bed <- if(Patr$in_bed <= ) {
Patr$in_bed + 24*60*60
}
My data frame looks like this
in_bed
1 1899-12-30 22:13:00
2 1899-12-30 23:44:00
3 1899-12-30 00:08:00
If I run my function my variable gets deleted and the following error message gets printed:
Warning message:
In if (Patr$in_bed < "1899-12-30 12:00") { :
the condition has length > 1 and only the first element will be used
What do I do wrong or does anyone has a better idea? And can I run commands such as mean on variables in POSIXct format and if not how do I do it?

When you compare Patr$in_bed (vector) and "1899-12-30 12:00" (single value), you get a logical vector. But the IF statement requires a single logical, thus it generates a warning and consider only the first element of the vector.
You can try :
Patr$in_bed <- Patr$in_bed + 24*60*60 * (Patr$in_bed < as.POSIXct("1899-12-30 12:00"))
Explanations : the comparison in the parenthesis will return a logical vector, which will be converted to integer (0 for FALSE and 1 for TRUE). Then the dates for which the statement is true will have +24*60*60, and the others dates will have +0.
But since the POSIXct format includes the date, I don't see the purpose of adding 24 hrs. For instance,
as.POSIXct("1899-12-31 01:00:00") - as.POSIXct("1899-12-30 22:00:00")
returns a time difference of 3 hours, not 21.
To answer your last question, yes you can compute the mean of a POSIXct vector, simply with :
mean(Patr$in_bed)
Hope it helps,
Jérémy

Related

Trying to extract a date from a 5 or 6-digit number

I am trying to extract a date from a number. The date is stored as the first 6 digits of a 11-digit personal ID-number (date-month-year). Unfortunately the cloud-based database (REDCap) output of this gets formatted as a number, so that the leading zero in those born on the first nine days of the month end up with a 10 digit ID number instead of a 11 digit one. I managed to extract the 6 or 5 digit number corresponding to the date, i.e. 311230 for 31st December 1930, or 11230 for first December 1930. I end up with two problems that I have not been able to solve.
Let's say we use the following numbers:
dato <- c(311230, 311245, 311267, 311268, 310169, 201104, 51230, 51269, 51204)
I convert these into string, and then apply the as.Date() function:
datostr <- as.character(dato)
datofinal <- as.Date(datostr, "%d%m%y")
datofinal
The problems i have are:
Five-digit numbers (eg 11230) gets reported as NA.
Six-digit numbers are recognized, but those born before 1.1.1969 gets reported with 100 years added, i.e. 010160 gets converted to 2060.01.01
I am sure this must be easy for those who are more knowledgeable about R, but, I struggle a bit solving this. Any help is greatly appreciated.
Greetings
Bjorn
If your 5-digit numbers really just need to be zero-padded, then
dato_s <- sprintf("%06d", dato)
dato_s
# [1] "311230" "311245" "311267" "311268" "310169" "201104" "051230" "051269" "051204"
From there, your question about "dates before 1969", take a look at ?strptime for the '%y' pattern:
'%y' Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2018 POSIX standard, but it does also say
'it is expected that in a future version the default century
inferred from a 2-digit year will change'.
So if you have specific alternate years for those, you need to add the century before sending to as.Date (which uses strptime-patterns).
dato_d <- as.Date(gsub("([0-4][0-9])$", "20\\1",
gsub("([5-9][0-9])$", "19\\1", dato_s)),
format = "%d%m%Y")
dato_d
# [1] "2030-12-31" "2045-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
# [7] "2030-12-05" "1969-12-05" "2004-12-05"
In this case, I'm assuming 50-99 will be 1900, everything else 2000. If you need 40s or 30s, feel free to adjust the pattern: add digits to the second pattern (e.g., [3-9]) and remove from the first pattern (e.g., [0-2]), ensuring that all decades are included in exactly one pattern, not "neither" and not "both".
Borrowing from Allan's answer, I like that assumption of now() (since you did mention "born on"). Without lubridate, try this:
dato_s <- sprintf("%06d", dato)
dato_d <- as.Date(dato_s, format = "%d%m%y")
dato_d[ dato_d > Sys.Date() ] <-
as.Date(sub("([0-9]{2})$", "19\\1", dato_s[ dato_d > Sys.Date() ]), format = "%d%m%Y")
dato_d
# [1] "1930-12-31" "1945-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
# [7] "1930-12-05" "1969-12-05" "2004-12-05"
You can make this a bit easier using lubridate, and noting that no-one can have a date of birth that is in the future of the current time:
library(lubridate)
dato <- dmy(sprintf("%06d", dato))
dato[dato > now()] <- dato[dato > now()] - years(100)
dato
#> [1] "1930-12-31" "1945-12-31" "1967-12-31" "1968-12-31" "1969-01-31"
#> [6] "2004-11-20" "1930-12-05" "1969-12-05" "2004-12-05"
Of course, without further information, this method will not (nor will any other method) be able to pick out the edge cases of people who are aged over 100. This might be easy to determine from the context.
Created on 2020-06-29 by the reprex package (v0.3.0)
Converting five digit "numbers" to six digits is straightforward: x <- stringr::str_pad(x, 6, pad="0") or similar will do the trick.
Your problem with years is the Millennium bug revisited. You'll have to consult with whoever compiled your data to see what assumptions they used.
I suspect all dates on or before 31Dec1970 are affected, not just those before 01Jan1960. That's because as.Date uses a default origin of 01Jan1970 when deciding how to handle two digit years. So your solution is to pick an appropriate origin in your conversion to fix this dataset. Something like d <- as.Date(x, origin="1900-01-01"). And then start using four digit years in the fiture! ;)

R: data.table. How to save dates properly with fwrite?

I have a dataset. I can choose to load it on R from a Stata file or from a SPSS file.
In both cases it's loaded properly with the haven package.
The dates are recognized properly.
But when I save it to disk with data.table's fwrite function.
fwrite(ppp, "ppp.csv", sep=",", col.names = TRUE)
I have a problem, the dates dissapear and are converted to different numbers. For example, the date 1967-08-06 is saved in the csv file as -879
I've also tried playing with fwrite options, such as quote=FALSE, with no success.
I've uploaded a small sample of the files, the spss, the stata and the saved csv.
and this is the code, in order to do things easier for you.
library(haven)
library(data.table)
ppp <- read_sav("pspss.sav") # choose one of these two.
ppp <- read_dta("pstata.dta") # choose one of these two.
fwrite(ppp, "ppp.csv", sep=",", col.names = TRUE)
The real whole table has more than one thousand variables and one million individuals. That's why I would like to use a fast way to do things.
http://www73.zippyshare.com/v/OwzwbyQq/file.html
This is for #ArtificialBreeze:
> head(my)
# A tibble: 6 x 9
ID_2006_2011 TIS FECHA_NAC_2006 año2006 Edad_31_12_2006 SEXO_2006
<dbl> <chr> <date> <date> <dbl> <chr>
1 1.60701e+11 BBNR670806504015 1967-08-06 2006-12-31 39 M
2 1.60701e+11 BCBD580954916014 1958-09-14 2006-12-31 48 F
3 1.60701e+11 BCBL451245916015 1945-12-05 2006-12-31 61 F
4 1.60701e+11 BCGR610904916012 1961-09-04 2006-12-31 45 M
5 1.60701e+11 BCMR580148916015 1958-01-08 2006-12-31 48 F
6 1.60701e+11 BCMX530356917018 1953-03-16 2006-12-31 53 F
# ... with 3 more variables: PAIS_NAC_2006 <dbl>, FECHA_ALTA_TIS_2006 <date>,
# FECHA_ALTA_TIS_2006n <date>
Since this question was asked 6 months ago, fwrite has improved and been released to CRAN. I believe it should work as you wanted now; i.e. fast, direct and convenient date formatting. It now has the dateTimeAs argument as follows, copied from fwrite's manual page for v1.10.0 as on CRAN now. As time progresses, please check the latest version of the manual page.
====
dateTimeAs : How Date/IDate, ITime and POSIXct items are written.
"ISO" (default) - 2016-09-12, 18:12:16 and 2016-09-12T18:12:16.999999Z. 0, 3 or 6 digits of fractional seconds are printed if and when present for convenience, regardless of any R options such as digits.secs. The idea being that if milli and microseconds are present then you most likely want to retain them. R's internal UTC representation is written faithfully to encourage ISO standards, stymie timezone ambiguity and for speed. An option to consider is to start R in the UTC timezone simply with "$ TZ='UTC' R" at the shell (NB: it must be one or more spaces between TZ='UTC' and R, anything else will be silently ignored; this TZ setting applies just to that R process) or Sys.setenv(TZ='UTC') at the R prompt and then continue as if UTC were local time.
"squash" - 20160912, 181216 and 20160912181216999. This option allows fast and simple extraction of yyyy, mm, dd and (most commonly to group by) yyyymm parts using integer div and mod operations. In R for example, one line helper functions could use %/%10000, %/%100%%100, %%100 and %/%100 respectively. POSIXct UTC is squashed to 17 digits (including 3 digits of milliseconds always, even if 000) which may be read comfortably as integer64 (automatically by fread()).
"epoch" - 17056, 65536 and 1473703936.999999. The underlying number of days or seconds since the relevant epoch (1970-01-01, 00:00:00 and 1970-01-01T00:00:00Z respectively), negative before that (see ?Date). 0, 3 or 6 digits of fractional seconds are printed if and when present.
"write.csv" - this currently affects POSIXct only. It is written as write.csv does by using the as.character method which heeds digits.secs and converts from R's internal UTC representation back to local time (or the "tzone" attribute) as of that historical date. Accordingly this can be slow. All other column types (including Date, IDate and ITime which are independent of timezone) are written as the "ISO" option using fast C code which is already consistent with write.csv.
The first three options are fast due to new specialized C code. The epoch to date-part conversion uses a fast approach by Howard Hinnant (see references) using a day-of-year starting on 1 March. You should not be able to notice any difference in write speed between those three options. The date range supported for Date and IDate is [0000-03-01, 9999-12-31]. Every one of these 3,652,365 dates have been tested and compared to base R including all 2,790 leap days in this range. This option applies to vectors of date/time in list column cells, too. A fully flexible format string (such as "%m/%d/%Y") is not supported. This is to encourage use of ISO standards and because that flexibility is not known how to make fast at C level. We may be able to support one or two more specific options if required.
====
I had the same problem, and I just changed the date column to as.character before writing, and then changed it back to as.Date after reading. I don't know how it influences read and write times, but it was a good enough solution for me.
These numbers have sense :) It seems that fwrite change data format into "Matlab coding" where origin is "1970-01-01".
If you read your data, you can simply change number into date using these code:
my$FECHA_NAC_2006<-as.Date(as.numeric(my$FECHA_NAC_2006),origin="1970-01-01")
For example
as.Date(-879,origin="1970-01-01")
[1] "1967-08-06"
Since it seems there is no simple solution I'm trying to store column classes and change them back again.
I take the original dataset ppp,
areDates <- (sapply(ppp, class) == "Date")
I save it on an file and I can read it next time.
ppp <- fread("ppp.csv", encoding="UTF-8")
And now I change the classes of the newly read dataset back to the original one.
ppp[,names(ppp)[areDates] := lapply(.SD,as.Date),
.SDcols = areDates ]
Maybe someone can written it better with a for loop and the command set.
ppp[,lapply(.SD, setattr, "class", "Date") ,
.SDcols = areDates]
It can be also written with positions instead of a vector of TRUE and FALSE
You need to add the argument: dateTimeAs = "ISO".
By adding the argument dateTimeAs = and specifying the appropriate option, you will get dates writting in your csv file with the desired format AND with their respective time zone.
This is particularly important in the case of dealing with POSIXct variables, which are time zone dependant. The lack of this argument might affect the timestamps written in the csv file by shifting dates and times according to the difference of hours between time zones. Thus, the date/time variable POSIXct, you will need to add: dateTimeAs = "write.csv" ; unfortunately this option can be slow (https://www.rdocumentation.org/packages/data.table/versions/1.10.0/topics/fwrite?). Good luck!!!

Conditional Label in R without Loops

I'm trying to find out the best (best as in performance) to having a data frame of the form getting a new column called "Season" with each of the four seasons of the year:
MON DAY YEAR
1 1 1 2010
2 1 1 2010
3 1 1 2010
4 1 1 2010
5 1 1 2010
6 1 1 2010
One straightforward to do this is create a loop conditioned on the MON and DAY column and assign the value one by one but I think there is a better way to do this. I've seen on other posts suggestions for ifelse or := or apply but most of the problem stated is just binary or the value can be assigned based on a given single function f based on the parameters.
In my situation I believe a vector containing the four stations labels and somehow the conditions would suffice but I don't see how to put everything together. My situation resembles more of a switch case.
Using modulo arithmetic and the fact that arithmetic operators coerce logical-values to 0/1 will be far more efficient if the number of rows is large:
d$SEASON <- with(d, c( "Winter","Spring", "Summer", "Autumn")[
1+(( (DAY>=21) + MON-1) %/% 3)%%4 ] )
The first added "1" shifts the range of the %%4 operationon all the results inside the parentheses from 0:3 to 1:4. The second subtracted "1" shifts the (inner) 1:12 range back to 0:11 and the (DAY >= 21) advances the boundary months forward one.
I'll start by giving a simple answer then I'll delve into the details.
I quick way to do this would be to check the values of MON and DAY and output the correct season. This is trivial :
f=function(m,d){
if(m==12 && d>=21) i=3
else if(m>9 || (m==9 && d>=21)) i=2
else if(m>6 || (m==6 && d>=21)) i=1
else if(m>3 || (m==3 && d>=21)) i=0
else i=3
}
This f function, given a day and a month, will return an integer corresponding to the season (it doesn't matter much if it's an integer or a string ; integer only allows to save a bit of memory but it's a technicality).
Now you want to apply it to your data.frame. No need to use a loop for this ; we'll use mapply. d will be our simulated data.frame. We'll factor the output to have nice season names.
d=data.frame(MON=rep(1:12,each=30),DAY=rep(1:30,12),YEAR=2012))
d$SEA=factor(
mapply(f,d$MON,d$DAY),
levels=0:3,
labels=c("Spring","Summer","Autumn","Winter")
)
There you have it !
I realize seasons don't always change a 21st. If you need fine tuning, you should define a 3-dimension array as a global variable to store the accurate days. Given a season and a year, you could access the corresponding day and replace the "21"s in the f function with the right calls (you would obviously add a third argument for the year).
About the things you mentionned in your question :
ifelse is the "functionnal" way to make a conditionnal test. On atomic variables it's only slightly better than the conditionnal statements but it is vectorized, meaning that if the argument is a vector, it will loop itself on its elements. I'm not familiar with it but it's the way to got for an optimized solution
mapply is derived from sapply of the "apply family" and allows to call a function with several arguments on vector (see ?mapply)
I don't think := is a standard operator in R, which brings me to my next point :
data.table ! It's a package that provides a new structure that extends data.frame for fast computing and typing (among other things). := is an operator in that package and allows to define new columns. In our case you could write d[,SEA:=mapply(f,MON,DAY)] if d is a data.table.
If you really care about performance, I can't insist enough on using data.table as it is a major improvement if you have a lot of data. I don't know if it would really impact time computing with the solution I proposed though.

Create warnings in R

I wanna write a script whicht makes R usable for "everybody" at this special topic of analysis. Is there a possibility to create warnings?
time,value
2012-01-01,5
2012-01-02,0
2012-01-03,0
2012-01-04,0
2012-01-05,3
For example if the value is at least 3 times 0 (afterwards - better within a settet period of time - 3 days) give warnings - and naming the date. Maybe create something like a report, if I am combining conditions.
In general: Masurement data are read via read.csv and then set Date by as.POSIXct - xts/zoo. I want the "user" to get a clear message if the values are changing etc.; if they are 0 for a long time etc.
The second step would be sending emails - maybe running on a server later.
Additional Questions:
I do have a df in xts now - is it possible to check if the value is greater a threshold value? It's not working because it's not an atomic vector.
Thanks
Try this.
x <- read.table(text = "time,value
2012-01-01,5
2012-01-02,0
2012-01-03,0
2012-01-04,0
2012-01-05,3", header = TRUE, sep = ",")
if(any(rle(x$value)$lengths >= 3)) warning("I noticed some dates have value 0 at least three times.")
Warning message:
I noticed some dates have value 0 at least three times.
I'll leave it to you as a training exercise to paste a warning message that would also give you the date(s).

Forcing full weeks with apply.weekly()

I'm trying to figure out what xts (or zoo) uses as the time after doing an apply.period. Consider the following:
> myTs = xts(1:10, as.Date(1:10, origin = '2012-12-1'))
> apply.weekly(myTs, colSums)
[,1]
2012-12-02 1
2012-12-09 35
2012-12-11 19
I think the '2012-12-02' means "for the week ending 2012-12-02, the sum is 1". So basically the time is the end of the week.
But the problem is with that "2012-12-11" - I think what it's doing is saying that the 11th is the last day of the week that was given, so it's giving that as the time.
Is there any way to force it to give the sunday on which it ends, even if that day was not included in the data set?
Try this:
nextsun <- function(x) 7 * ceiling(as.numeric(x-0+4) / 7) + as.Date(0-4)
aggregate(myTs, nextsun, sum)
where nextsun was derived from nextfri code given in the zoo quick reference by replacing 5 (for Friday) with 0 (for Sunday).
Those are full weeks. It's only showing you the date of the very last observation. See ?endpoints (apply.weekly, is essentially a thin wrapper for endpoints).
apply.weekly
function (x, FUN, ...)
{
ep <- endpoints(x, "weeks")
period.apply(x, ep, FUN, ...)
}
<environment: namespace:xts>
From ?endpoints
endpoints returns a numeric vector corresponding to the last
observation in each period specified by on, with a zero added to the
beginning of the vector, and the index of the last observation in x at
the end.
Valid values for the argument on include: “us” (microseconds),
“microseconds”, “ms” (milliseconds), “milliseconds”, “secs” (seconds),
“seconds”, “mins” (minutes), “minutes”, “hours”, “days”, “weeks”,
“months”, “quarters”, and “years”.
The answer to your second question is no, there is no option to do so. But you could always edit the last date manually, if you're going to present all data wrapped up anyways, I don't see any harm in it.
No you can't force it give you the sunday.
Because the index of the result of period.apply is given by
ep <- endpoints(myTs,'weeks')
myTs[ep]
[,1]
2012-12-02 2
2012-12-09 9
2012-12-10 10
So you need to shift the last date. Unfortunately xts don't offer this option, you can't shift a single value of the index. I don't know why (maybe a design choice get unique index)
e.g You can do the flowing:
ts.weeks <- apply.weekly(myTs, colSums)
ts.weeks[length(ts.weeks)] <- last(index(myTs)) + 7-last(floor(diff(ep)))

Resources