Assistance finding the floor and ceiling of a vector - r

I have created vectors for months and month numbers, then for months 6-9, and from there, the mean...
mean(Summer)
Months[mean(Summmer)]
July
Now I have to use floor and ceiling functions to return upper and lower limits of months for average summer month.
I thought I was on the right track, but I'm getting errors! Thoughts?
months[x]
months[floor(x)] # July
months[ceiling(x)] # August

It's not exactly clear what you are doing, but here's my best effort to produce a working example in response:
# This is built into R and it gives month names
month.name
# Let's make seasons:
Spring = 3:5
Summer = 6:8
Autumn = 9:11
Winter = c(12,1,2)
# If each season has 3 months, then we can get out a particular monh like so:
month.name[Summer[1]]
month.name[Summer[2]]
month.name[Summer[3]]
# We can also use the min, mean, and max technique for summer:
month.name[min(Summer)]
month.name[mean(Summer)]
month.name[max(Summer)]
# But it doesn't really work for Winter:
month.name[min(Winter)]
month.name[mean(Winter)]
month.name[max(Winter)]
My guess is that the main confusion here is around the functionality of floor. Floor takes as input a vector and outputs a vector that applies the floor function (round down to nearest integer) to each element of the vector, so floor(c(1,2,3)) will return c(1,2,3) and floor(1.3) will return 1. Ceiling works similarly.
Hopefully this helps with whatever you are trying to do. In general, it's helpful to provide a reproducible example so that others can run the exact same code you are running and help debug your problem.

Related

Trying to extract a date from a 5 or 6-digit number

I am trying to extract a date from a number. The date is stored as the first 6 digits of a 11-digit personal ID-number (date-month-year). Unfortunately the cloud-based database (REDCap) output of this gets formatted as a number, so that the leading zero in those born on the first nine days of the month end up with a 10 digit ID number instead of a 11 digit one. I managed to extract the 6 or 5 digit number corresponding to the date, i.e. 311230 for 31st December 1930, or 11230 for first December 1930. I end up with two problems that I have not been able to solve.
Let's say we use the following numbers:
dato <- c(311230, 311245, 311267, 311268, 310169, 201104, 51230, 51269, 51204)
I convert these into string, and then apply the as.Date() function:
datostr <- as.character(dato)
datofinal <- as.Date(datostr, "%d%m%y")
datofinal
The problems i have are:
Five-digit numbers (eg 11230) gets reported as NA.
Six-digit numbers are recognized, but those born before 1.1.1969 gets reported with 100 years added, i.e. 010160 gets converted to 2060.01.01
I am sure this must be easy for those who are more knowledgeable about R, but, I struggle a bit solving this. Any help is greatly appreciated.
Greetings
Bjorn
If your 5-digit numbers really just need to be zero-padded, then
dato_s <- sprintf("%06d", dato)
dato_s
# [1] "311230" "311245" "311267" "311268" "310169" "201104" "051230" "051269" "051204"
From there, your question about "dates before 1969", take a look at ?strptime for the '%y' pattern:
'%y' Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2018 POSIX standard, but it does also say
'it is expected that in a future version the default century
inferred from a 2-digit year will change'.
So if you have specific alternate years for those, you need to add the century before sending to as.Date (which uses strptime-patterns).
dato_d <- as.Date(gsub("([0-4][0-9])$", "20\\1",
gsub("([5-9][0-9])$", "19\\1", dato_s)),
format = "%d%m%Y")
dato_d
# [1] "2030-12-31" "2045-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
# [7] "2030-12-05" "1969-12-05" "2004-12-05"
In this case, I'm assuming 50-99 will be 1900, everything else 2000. If you need 40s or 30s, feel free to adjust the pattern: add digits to the second pattern (e.g., [3-9]) and remove from the first pattern (e.g., [0-2]), ensuring that all decades are included in exactly one pattern, not "neither" and not "both".
Borrowing from Allan's answer, I like that assumption of now() (since you did mention "born on"). Without lubridate, try this:
dato_s <- sprintf("%06d", dato)
dato_d <- as.Date(dato_s, format = "%d%m%y")
dato_d[ dato_d > Sys.Date() ] <-
as.Date(sub("([0-9]{2})$", "19\\1", dato_s[ dato_d > Sys.Date() ]), format = "%d%m%Y")
dato_d
# [1] "1930-12-31" "1945-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
# [7] "1930-12-05" "1969-12-05" "2004-12-05"
You can make this a bit easier using lubridate, and noting that no-one can have a date of birth that is in the future of the current time:
library(lubridate)
dato <- dmy(sprintf("%06d", dato))
dato[dato > now()] <- dato[dato > now()] - years(100)
dato
#> [1] "1930-12-31" "1945-12-31" "1967-12-31" "1968-12-31" "1969-01-31"
#> [6] "2004-11-20" "1930-12-05" "1969-12-05" "2004-12-05"
Of course, without further information, this method will not (nor will any other method) be able to pick out the edge cases of people who are aged over 100. This might be easy to determine from the context.
Created on 2020-06-29 by the reprex package (v0.3.0)
Converting five digit "numbers" to six digits is straightforward: x <- stringr::str_pad(x, 6, pad="0") or similar will do the trick.
Your problem with years is the Millennium bug revisited. You'll have to consult with whoever compiled your data to see what assumptions they used.
I suspect all dates on or before 31Dec1970 are affected, not just those before 01Jan1960. That's because as.Date uses a default origin of 01Jan1970 when deciding how to handle two digit years. So your solution is to pick an appropriate origin in your conversion to fix this dataset. Something like d <- as.Date(x, origin="1900-01-01"). And then start using four digit years in the fiture! ;)

monthly means with apply for multidimensional arrays

I want to compute the mean over the 3-D of a multidimensional array. As this dimension is supposed to be the time, I wanted to computed monthly means. For that, I tried to use apply, but I am not sure where the problem is. Let's say my data is as the following:
#Creating a sample
m <-array(1:12, dim=c(20,4,36))
#number of months
months <- seq(1:12)
#Compute the mean over each month (dimension of the result should be [20,4,12]
monmean <- apply(m,1:2,function(x) for(i in 1:12) mean(x[,,months==i],na.rm=TRUE))
Any idea??
Thanks in advance
I think I understand what you're after. This is actually slightly more complex than it may seem, because months are not regular periods of time; they vary in number of days, and February varies between years due to leap years. Thus a simple regular logical or numeric index vector will not be sufficient to calculate this result precisely. You need to take into account the exact dates that are covered by the z-dimension of your array.
Solution 1
What you can do is separately compute a date vector that identifies the dates that correspond to each z-index of your array. Within the apply() call for each z-line, you can then call strftime() to extract the months for each such date, and group by that month value using tapply() to take monthly mean()s. Here's how it could be done:
set.seed(1);
R <- 48;
C <- 39;
Z <- 3653;
N <- R*C*Z;
a1 <- array(rnorm(N,10,2),c(R,C,Z));
dates <- seq(as.Date('2000-01-01'),as.Date('2009-12-31'),1);
a2 <- aperm(apply(a1,1:2,function(x) tapply(x,strftime(dates,'%m'),mean)),c(2,3,1));
Here's a demo showing a few specific proofs of correctness:
for (r in sample(1:nrow(a2),2)) for (c in sample(1:ncol(a2),2)) for (m in sample(1:dim(a2)[3],2)) cat(sprintf('[%02d,%02d,%3s] %f %f\n',r,c,month.abb[m],mean(a1[r,c,strftime(dates,'%m')==sprintf('%02d',m)]),a2[r,c,m]));
## [14,05,Aug] 10.030313 10.030313
## [14,05,Apr] 10.200982 10.200982
## [14,25,Jan] 9.957879 9.957879
## [14,25,Apr] 10.185447 10.185447
## [26,34,Oct] 10.056931 10.056931
## [26,34,Nov] 9.876327 9.876327
## [26,17,Apr] 10.005423 10.005423
## [26,17,Sep] 10.009785 10.009785
Notes
I randomly chose a date range of 2000-01-01 to 2009-12-31 because it covers a 10 year period during which (due to leap years) there were exactly 3653 days, but obviously you should be sure to use whatever dates are actually covered by your real data.
As you can see, you were on the right track by calling apply() with 1:2 as the margins, because that allows you to operate independently on each z-line, such that you can group that z-line by month and compute the mean for each month along that z-line.
Unfortunately, apply() has an annoying habit of returning the result in a different transposition than people generally expect. For two-dimensional usages, this is normally solved with a simple call to t(), but since we're working in three dimensions here, we need to call aperm() to fix the dimension order.
Since the dates I chose begin with January and advance through the months in calendar order, the means in the result will end up being ordered by calendar month. IOW, z-indexes 1:12 in a2 correspond to months Jan-Dec. If your dates do not begin with January, then this solution should still work, but you'll have to be careful about the correspondence between z-indexes and months in the result. For example, my "proof of correctness" code assumed that indexes 1:12 corresponded to months Jan-Dec, but that wouldn't be correct if the months occurred in a different order in the input array.
Solution 2
While writing this answer I actually thought of a slightly different, and one could argue slightly better, solution. You can call tapply() just once and group by rows, then columns, and finally months. Unfortunately, tapply() doesn't seem to be designed to naturally cycle its group vectors to cover the input vector, so we have to cycle them ourselves using carefully crafted calls to rep() (using the each and times arguments carefully--and I suppose tapply() actually wouldn't even know how to do this properly for our input data), but other than that, it's fairly straightforward:
a3 <- tapply(a1,list(rep(1:R,C*Z),rep(1:C,each=R,times=Z),rep(strftime(dates,'%m'),each=R*C)),mean);
Here's a proof that the result is identical to my first method (dimnames() have to be fixed first to get the identical() call to work, but that's trivial):
dimnames(a3) <- dimnames(a2);
identical(a3,a2);
## [1] TRUE
Performance
Here's some basic performance testing using system.time() to give an idea of the superiority of the second solution:
first <- function() a2 <- aperm(apply(a1,1:2,function(x) tapply(x,strftime(dates,'%m'),mean)),c(2,3,1));
second <- function() a3 <- tapply(a1,list(rep(1:R,C*Z),rep(1:C,each=R,times=Z),rep(strftime(dates,'%m'),each=R*C)),mean);
system.time({ first() });
## user system elapsed
## 3.672 0.015 3.719
system.time({ first() });
## user system elapsed
## 3.672 0.016 3.720
system.time({ second() });
## user system elapsed
## 1.797 0.344 2.135
system.time({ second() });
## user system elapsed
## 1.719 0.391 2.124

Conditional Label in R without Loops

I'm trying to find out the best (best as in performance) to having a data frame of the form getting a new column called "Season" with each of the four seasons of the year:
MON DAY YEAR
1 1 1 2010
2 1 1 2010
3 1 1 2010
4 1 1 2010
5 1 1 2010
6 1 1 2010
One straightforward to do this is create a loop conditioned on the MON and DAY column and assign the value one by one but I think there is a better way to do this. I've seen on other posts suggestions for ifelse or := or apply but most of the problem stated is just binary or the value can be assigned based on a given single function f based on the parameters.
In my situation I believe a vector containing the four stations labels and somehow the conditions would suffice but I don't see how to put everything together. My situation resembles more of a switch case.
Using modulo arithmetic and the fact that arithmetic operators coerce logical-values to 0/1 will be far more efficient if the number of rows is large:
d$SEASON <- with(d, c( "Winter","Spring", "Summer", "Autumn")[
1+(( (DAY>=21) + MON-1) %/% 3)%%4 ] )
The first added "1" shifts the range of the %%4 operationon all the results inside the parentheses from 0:3 to 1:4. The second subtracted "1" shifts the (inner) 1:12 range back to 0:11 and the (DAY >= 21) advances the boundary months forward one.
I'll start by giving a simple answer then I'll delve into the details.
I quick way to do this would be to check the values of MON and DAY and output the correct season. This is trivial :
f=function(m,d){
if(m==12 && d>=21) i=3
else if(m>9 || (m==9 && d>=21)) i=2
else if(m>6 || (m==6 && d>=21)) i=1
else if(m>3 || (m==3 && d>=21)) i=0
else i=3
}
This f function, given a day and a month, will return an integer corresponding to the season (it doesn't matter much if it's an integer or a string ; integer only allows to save a bit of memory but it's a technicality).
Now you want to apply it to your data.frame. No need to use a loop for this ; we'll use mapply. d will be our simulated data.frame. We'll factor the output to have nice season names.
d=data.frame(MON=rep(1:12,each=30),DAY=rep(1:30,12),YEAR=2012))
d$SEA=factor(
mapply(f,d$MON,d$DAY),
levels=0:3,
labels=c("Spring","Summer","Autumn","Winter")
)
There you have it !
I realize seasons don't always change a 21st. If you need fine tuning, you should define a 3-dimension array as a global variable to store the accurate days. Given a season and a year, you could access the corresponding day and replace the "21"s in the f function with the right calls (you would obviously add a third argument for the year).
About the things you mentionned in your question :
ifelse is the "functionnal" way to make a conditionnal test. On atomic variables it's only slightly better than the conditionnal statements but it is vectorized, meaning that if the argument is a vector, it will loop itself on its elements. I'm not familiar with it but it's the way to got for an optimized solution
mapply is derived from sapply of the "apply family" and allows to call a function with several arguments on vector (see ?mapply)
I don't think := is a standard operator in R, which brings me to my next point :
data.table ! It's a package that provides a new structure that extends data.frame for fast computing and typing (among other things). := is an operator in that package and allows to define new columns. In our case you could write d[,SEA:=mapply(f,MON,DAY)] if d is a data.table.
If you really care about performance, I can't insist enough on using data.table as it is a major improvement if you have a lot of data. I don't know if it would really impact time computing with the solution I proposed though.

Find the daily stock return percentage in R [duplicate]

This question already has answers here:
How to calculate returns from a vector of prices?
(7 answers)
Closed 8 years ago.
I am taking a column of historical stock prices and trying to find the percent return on stock. This would be accomplished through calculations such as todays stock price minus yesterdays stock price divided by yesterdays stock price. You could also divide the most current day and divide by the last and subtract by one.
I can find the difference between each day, but that is not my problem. I believe my teacher told me it is
x <- diff(log(theReturns))
Can you guys find the percent change in daily stock in R?
Let's say your vector is v <- c(10, 20, 23, 15, 22, 30) (this would be what you call theReturns, but I am using v for short here).
The difference between each day, which you already know how to get as you say, is
v[2:6] - v[1:5]
# 10 3 -8 7 8
In R there is another way to write this, using the function diff (see ?diff for more details):
diff(v) == v[2:6] - v[1:5]
# TRUE TRUE TRUE TRUE TRUE
Since you want to calculate the difference as a percentage of the previous day (i.e. the relative change), you simply need to divide this by v[1:5]
diff(v) / v[1:5]
# 1.0000000 0.1500000 -0.3478261 0.4666667 0.3636364
My guess is that you know how to do all that, but your confusion comes from your teacher introducing the log function in there. I don't think you necessarily have to use log, but it may simplify things because of one of its properties, which is that log(x/y) = log(x) - log(y), for positive x, y. Using this (after a little bit of algebra), you can see that another way to calculate the relative change is
exp(diff(log(v))) - 1
since that evaluates to exp(log(v[2:6]) - log(v[1:5])) - 1 which equals (v[2:6] / v[1:5]) - 1 which in turn equals (v[2:6] - v[1:5]) / v[1:5].

Forcing full weeks with apply.weekly()

I'm trying to figure out what xts (or zoo) uses as the time after doing an apply.period. Consider the following:
> myTs = xts(1:10, as.Date(1:10, origin = '2012-12-1'))
> apply.weekly(myTs, colSums)
[,1]
2012-12-02 1
2012-12-09 35
2012-12-11 19
I think the '2012-12-02' means "for the week ending 2012-12-02, the sum is 1". So basically the time is the end of the week.
But the problem is with that "2012-12-11" - I think what it's doing is saying that the 11th is the last day of the week that was given, so it's giving that as the time.
Is there any way to force it to give the sunday on which it ends, even if that day was not included in the data set?
Try this:
nextsun <- function(x) 7 * ceiling(as.numeric(x-0+4) / 7) + as.Date(0-4)
aggregate(myTs, nextsun, sum)
where nextsun was derived from nextfri code given in the zoo quick reference by replacing 5 (for Friday) with 0 (for Sunday).
Those are full weeks. It's only showing you the date of the very last observation. See ?endpoints (apply.weekly, is essentially a thin wrapper for endpoints).
apply.weekly
function (x, FUN, ...)
{
ep <- endpoints(x, "weeks")
period.apply(x, ep, FUN, ...)
}
<environment: namespace:xts>
From ?endpoints
endpoints returns a numeric vector corresponding to the last
observation in each period specified by on, with a zero added to the
beginning of the vector, and the index of the last observation in x at
the end.
Valid values for the argument on include: “us” (microseconds),
“microseconds”, “ms” (milliseconds), “milliseconds”, “secs” (seconds),
“seconds”, “mins” (minutes), “minutes”, “hours”, “days”, “weeks”,
“months”, “quarters”, and “years”.
The answer to your second question is no, there is no option to do so. But you could always edit the last date manually, if you're going to present all data wrapped up anyways, I don't see any harm in it.
No you can't force it give you the sunday.
Because the index of the result of period.apply is given by
ep <- endpoints(myTs,'weeks')
myTs[ep]
[,1]
2012-12-02 2
2012-12-09 9
2012-12-10 10
So you need to shift the last date. Unfortunately xts don't offer this option, you can't shift a single value of the index. I don't know why (maybe a design choice get unique index)
e.g You can do the flowing:
ts.weeks <- apply.weekly(myTs, colSums)
ts.weeks[length(ts.weeks)] <- last(index(myTs)) + 7-last(floor(diff(ep)))

Resources