Using Rollapply on two columns

Using Rollapply on two columns - r

I'm trying to do something similar I was asking for here and unfortunately I cannot work it out.
This is my data frame (data), a time series of prices:
Date Price Vol
1998-01-01 200 0.3
1998-01-02 400 0.4
1998-01-03 600 -0.2
1998-01-04 100 0.1
...
1998-01-20 100 0.1
1998-01-21 200 -0.4
1998-01-21 500 0.06
....
1998-02-01 100 0.2
1998-02-02 200 0.4
1998-02-03 500 0.3
1998-02-04 100 0.1
etc.
I would like to tell R, to
take the 1st value of "Vol" and divide it by the 20th value of "Price", then
take the 2st value of "Vol" and divide it by the 21th value of "Price", then.
take the 3st value of "Vol" and divide it by the 22th value of "Price", then
etc.
In my other post, I was able to use this function to calculate a return over a holding period of 20 days:
> data.xts <- xts(data[, -1], data[, 1])
> hold <- 20
> f <- function(x) log(tail(x, 1)) - log(head(x, 1))
> data.xts$returns.xts <- rollapply(data.xts$Price, FUN=f,
width=hold+1, align="left", na.pad=T)
Is there a way to do something very similar for the problem stated above? So something like
f1 <- function(x,y) head(x, 1) / tail(y,1)
where x is "Vol" and y is "Price" and then apply "rollapply"?
Thank you very much
UPDATE: # Dr G:
Thanks for your suggestions. With a slight change, it did what I wanted!
data.xts <- xts(data[, -1], data[, 1])
hold <- 20
data.xts$quo <- lag(data.xts[,2], hold) / data.xts[,1]
Now my problem is, that the resulting data frame looks like this:
Date Price Vol quo
1 1998-01-01 200 0.3 NA
2 1998-01-02 400 0.4 NA
3 1998-01-03 600 -0.2 NA
4 1998-01-04 100 0.1 NA
...
21 1998-01-20 180 0.2 0.003
I know that there must be NA's as an outcome, but only for the last 20 observations, not the first 20 ones. The formula stated above calculates the correct values, however puts them starting at the 21st row instead of the first row. Do you know how I could change that?

Use by.column = FALSE in rollapply. In order to use the posted data we will divide the volume in the first row by the price in the 3rd row and so on for purposes of reproducible illustration:
library(zoo)
Lines <- "Date Price Vol
1998-01-01 200 0.3
1998-01-02 400 0.4
1998-01-03 600 -0.2
1998-01-04 100 0.1
1998-01-20 100 0.1
1998-01-21 200 -0.4
1998-01-21 500 0.06
1998-02-01 100 0.2
1998-02-02 200 0.4
1998-02-03 500 0.3
1998-02-04 100 0.1"
# read in and use aggregate to remove all but last point in each day.
# In reality we would replace textConnection(Lines) with something
# like "myfile.dat"
z <- read.zoo(textConnection(Lines), header = TRUE,
aggregate = function(x) tail(x, 1))
# divide Volume by the Price of the point 2 rows ahead using by.column = FALSE
# Note use of align = "left" to align with the volume.
# If we used align = "right" it would align with the price.
rollapply(z, 3, function(x) x[1, "Vol"] / x[3, "Price"], by.column = FALSE,
align = "left")
# and this is the same as rollapply with align = "left" as above
z$Vol / lag(z$Price, 2)
# this is the same as using rollapply with align = "right"
lag(z$Vol, -2) / z$Price
By the way, note that zoo uses the same convention for the sign of lag as does R but xts uses the opposite convention so if you convert the above to xts you will have to negate the lags.

It's actually easier than that. Just do this:
data.xts <- xts(data[, -1], data[, 1])
hold <- 20
returns.xts = data.xts[,2] / lag(data.xts[,1], hold)
Actually for this using zoo instead of xts would work as well:
data.zoo<- zoo(data[, -1], data[, 1])
hold <- 20
returns.zoo = data.zoo[,2] / lag(data.zoo[,1], -hold)
Only thing that changes is the sign of the lags (zoo convention is different than xts)

You just need to use
data.xts$quo <- data.xts[,2] / lag( data.xts[,1], -hold)

Related

Identify columns applicable for percentage text to decimal conversion

My imported data comes with varying row/col size. I need to convert text % (32%) into decimal (0.32). Some columns have the text percentage, others are normal numeric and need to be unchanged.
I can convert the string to decimal across a column, and apply this across the data frame, however no elegant way of selectively only applying the conversion to relevant columns. I have solved my problem in a clunky manner by creating a vector to detect columns with % strings and then running a loop across the dataframe checking the vector for which columns to apply this rule. I'm looking for a cleaner solution
# Example structure of data on a small scale
df <- data.frame(desc = c('a','b','c'),val = c(10, 3, 100), perc = c('23.01%', '11.0%','2.33%'))
# desc val perc
# 1 a 10 23.01%
# 2 b 3 11.0%
# 3 c 100 2.33%
# the below converts everything which is not desired
sapply(df, function(x) as.numeric(sub("%","",x))/100)
# desc val perc
# [1,] NA 0.10 0.2301
# [2,] NA 0.03 0.1100
# [3,] NA 1.00 0.0233
# my (clunky) solution
aa <- rep(0,ncol(df))
for(i in 1:ncol(df)){aa[i] <- length(grep("%",df[,i]))}
# [1] 0 0 3
for(i in 1:ncol(df)){if (aa[i]>0) {df[,i] <- as.numeric(sub("%", "",df[,i],fixed=TRUE))/100 } }
# desc val perc
# 1 a 10 0.2301
# 2 b 3 0.1100
# 3 c 100 0.0233

A tidyverse solution would be the following:
df %>%
mutate_if(~sum(str_detect(., "%")) > 0,
~as.numeric(str_remove(., "%")) / 100)

What I would do is find columns that have a %, convert them to character (just so you don't have to work with factors which are PITA in this case) and remove % signs and divide numbers by 100.
xy <- data.frame(desc = c('a','b','c'),val = c(10, 3, 100), perc = c('23.01%', '11.0%','2.33%'))
# find which colums have a % - this assumes % is used only to denote percentages
perc.index <- sapply(xy, grepl, pattern = "%")
# convert columns that have at least one % to character
# this step can be also done manually or on import (stringsAsFactors = FALSE)
xy[, colSums(perc.index) > 0] <- sapply(xy[, colSums(perc.index) > 0, drop = FALSE], as.character)
xy[perc.index] <- as.numeric(gsub("%", "", xy[perc.index])) / 100
xy
desc val perc
1 a 10 0.2301
2 b 3 0.11
3 c 100 0.0233

tmp=nchar(as.character(df$perc))
tmp2=which(substr(df$perc,tmp,tmp)=="%")
tmp3=which(!substr(df$perc,tmp,tmp)=="%")
df$perc2=NA
df$perc2[tmp2]=as.numeric(gsub("%","",df$perc[tmp2]))/100
df$perc2[tmp3]=as.numeric(as.character(df$perc[tmp3]))

Finding out the percentage of times a sequence in one column is the same as in another column

I hope I articulate this properly. I have a data set with two columns I am trying to compare in a memory experiment. Recall.CRESP is a column specifying the correct answers on a memory test selected through grid coordinates. Recall.RESP shows participants response.
The columns look something like this:
|Recall.CRESP | Recall.RESP |
|---------------------------------|---------------------------------|
|grid35grid51grid12grid43grid54 | grid35grid51grid12grid43grid54 |
|grid22grid53grid35grid21grid44 | grid23grid53grid35grid21grid43 |
|grid12grid14grid15grid41grid23 | grid12grid24grid31grid41grid25 |
|grid15grid41grid33grid24grid55 | grid15grid41grid33grid14grid55 |
I have the following line of code to tell me the percentage of times per row that the columns are identical to each other:
paste0((100*with(Data, mean(Recall.CRESP==Recall.RESP, na.rm = "TRUE"))), "%")
So for example, in my dataset 20% of the time column Recall.CRESP matches Recall.RESP exactly, signifying that a subject scored 5 out of 5 in their memory test 20% of the time.
However I want to be able to expand on this in two ways. The first is rather than giving me a percentage of when the rows are identical, I would like a percentage for when there is a partial match in the sequence. For instance grid11gird42gird22grid51grid32 and grid11gird15gird55grid42grid32 share a match of 2/5, with both the first and the last grid coordinate being identical. I am not sure how to specify the request in R for a partial sequence match of 2/5 (or any other outcome out of 5). Also keep in mind that in this example grid42 shows up in both sequences, but is not correctly recalled considering it is remembered out of position in Recall.RESP. The order is important in these sequences.
The other point is that so far I have described the experiment in terms of checking accuracy for forwards recall of memory items. Yet I also have separate data where participants were recalling in backwards order. So for example, grid11gird22gird33grid44grid55 from Recall.CRESP and grid51grid44grid33grid22grid11 from Recall.RESP are correctly matching 4/5 times. How can I turn the code around to check for reverse sequences and calculate percentages out of 5?
Any thoughts would be greatly appreciated.

I would separate the strings into columns of matrices, which will make them easy to compare and manipulate:
# borrowing Oriol's nicely shared data
Recall.CRESP <- c('grid35grid51grid12grid43grid54',
'grid22grid53grid35grid21grid44',
'grid12grid14grid15grid41grid23',
'grid15grid41grid33grid24grid55')
Recall.RESP <- c('grid35grid51grid12grid43grid54',
'grid23grid53grid35grid21grid43',
'grid12grid24grid31grid41grid25',
'grid15grid41grid33grid14grid55')
# function to create matrices
matrixify = function(dat) {
dat = do.call(rbind, strsplit(dat, split = "grid"))
dat = dat[, -1]
mode(dat) = "numeric"
return(dat)
}
cresp_mat = matrixify(Recall.CRESP)
resp_mat = matrixify(Recall.RESP)
## an example of what we made: just the numbers in the right order
cresp_mat
# [,1] [,2] [,3] [,4] [,5]
# [1,] 35 51 12 43 54
# [2,] 22 53 35 21 44
# [3,] 12 14 15 41 23
# [4,] 15 41 33 24 55
## Calculating results is now easy:
(forwards = rowMeans(cresp_mat == resp_mat))
# [1] 1.0 0.6 0.4 0.8
(reverse = rowMeans(cresp_mat == resp_mat[, 5:1]))
# [1] 0.2 0.2 0.0 0.2
You could, of course, assign the results to be new columns of your original data.

Here is my solution:
Recall.CRESP <- c('grid35grid51grid12grid43grid54',
'grid22grid53grid35grid21grid44',
'grid12grid14grid15grid41grid23',
'grid15grid41grid33grid24grid55')
Recall.RESP <- c('grid35grid51grid12grid43grid54',
'grid23grid53grid35grid21grid43',
'grid12grid24grid31grid41grid25',
'grid15grid41grid33grid14grid55')
df <- data.frame(Recall.CRESP, Recall.RESP, stringsAsFactors = F)
df$correctNormal <- NA
df$correctReverse <- NA
for (row in 1:nrow(df)) {
crespVector <- unlist(strsplit(as.character(df[row, 1]), 'grid'))[-1]
respVector <- unlist(strsplit(as.character(df[row, 2]), 'grid'))[-1]
correctNormal <- 0
correctReverse <- 0
for (i in 1:length(crespVector)) {
if (crespVector[i] == respVector[i]) correctNormal <- correctNormal + 1
if (crespVector[i] == respVector[length(respVector) + 1 - i]) correctReverse <- correctReverse + 1
}
df$correctNormal[row] = correctNormal / 5
df$correctReverse[row] = correctReverse / 5
}
df
## Recall.CRESP Recall.RESP correctNormal correctReverse
## 1 grid35grid51grid12grid43grid54 grid35grid51grid12grid43grid54 1.0 0.2
## 2 grid22grid53grid35grid21grid44 grid23grid53grid35grid21grid43 0.6 0.2
## 3 grid12grid14grid15grid41grid23 grid12grid24grid31grid41grid25 0.4 0.0
## 4 grid15grid41grid33grid24grid55 grid15grid41grid33grid14grid55 0.8 0.2

R: Data driven insertion of additional rows and values into a variable

I am trying to correct a time course measurement where the triggering of a positive event has failed and thus the interval of the var2 is an approximate multiple of the the median value of the data set. Here is a subset of the data.
var1 freq
1 0.9 1
2 2.7 3
3 0.9 1
4 1.8 2
5 0.9 1
6 0.9 1
7 1.8 2
8 2.7 3
To correct the record I would need to insert "freq"-1 additional rows and fill them with the value calculated from "var1"/"freq".
I have found the following example on this site (Exploding date range as row is R)
In it a function is defined which loops through and examines the "freq" variable to explode the data frame. But it is examining a time variable whilst my variables are numbers.
extendDF <- function(x) {
foo <- function(i, z) {
times <- z[i, "freq"]
out <- data.frame(seq(z[i, 1], by = "days", length = times),
rep(z[i, 2], times),
rep(z[i, 3], times))
names(out) <- names(z)
out
}
out <- lapply(seq_len(nrow(x)), FUN = foo, z = x)
do.call("rbind", out)
}

How do I calculate whether a date is below a series of cutoffs quickly in R?

Say I have a series of dates, and I want to break them into groups (let's call the groups "epochs"). My first idea of how to do this would be to create a variable that indicates which epoch a date belongs in. The following code shows what I want.
library(dplyr)
library(mosaic)
library(magrittr)
# Generate 1,000,000 random dates
set.seed(919)
df <- data.frame(dates = runif(1e6, -100, 100) + as.Date("2015-12-18"))
# Set two arbitrary dates as cutoffs
e1 <- as.Date("2015-10-01")
e2 <- as.Date("2015-12-20")
# Add a variable that indicates what the lowest cutoff date was
system.time(df %<>% mutate(epoch = derivedFactor(epoch.1 = dates < e1,
epoch.2 = dates < e2,
.method = "first",
.default = "epoch.3")))
# user system elapsed
# 341.86 0.16 344.70
But this is very slow -- about 5 minutes on my laptop. I imagine there is a faster way to do this. For example, my naive guess would be that you could sort the data by date, find the last row where dates < e1, and then mark all the preceding rows as a 1, etc. But maybe someone on here knows a better or more elegant way to do this?

I think you're overthinking this. I did it in base R, but presumably you could do this in dplyr too. Just order the data, and then set the factor levels you want in decreasing order.
Conceptually, you just set everything to the most recent epoch, 3. Then, you go through and find all the rows that are less than the epoch 2 cutoff, and then change those to 2. Then, repeat the same process with 1.
# Generate 1,000,000 random dates
set.seed(919)
test.data <- data.frame(row_id = 1:1000000,dates = runif(1e6, -100, 100) + as.Date("2015-12-18"))
# Set two arbitrary dates as cutoffs
e1 <- as.Date("2015-10-01")
e2 <- as.Date("2015-12-20")
test.data <- test.data[order(test.data$dates),]
test.data$epoch <- 3
test.data[test.data$dates < e2,"epoch"] <- 2
test.data[test.data$dates < e1,"epoch"] <- 1
table(test.data$epoch)

As Ben Bolker pointed out, you can use findInterval to do this:
df %<>% mutate(epoch = findInterval(df$dates, c(e1, e2)))
head(df, 10)
## dates epoch
## 1 2016-03-15 2
## 2 2016-01-02 2
## 3 2016-01-30 2
## 4 2015-10-03 1
## 5 2015-09-17 0
## 6 2016-02-11 2
## 7 2015-12-05 1
## 8 2015-12-15 1
## 9 2016-03-11 2
## 10 2015-10-21 1
On my machine, this takes much less than 0.1 second.

R - Extracting summary statistics for all columns in an xts object either directly or first converting to data frame

I have an xts object that includes multiple parameters over a 24 hour period (measurements each minute). Based on the time, I have added a column grouping into 4 'time of day' (tod) options: 'morning', 'afternoon', 'evening' and 'night'.
I would like to extract the mean values and standard deviations of the columns (parameters) for the entire period and also by time of day ('tod').
I have tried to first convert the xts object to a data frame, but have problems with the columns being of class factor instead of numeric. I have also tried 'aggregate' but am getting really strange outputs (or errors) when I use aggregate. Here is an example:
eg code to create a much smaller version of my data:
# time vector:
Time <- ISOdatetime(2015,01,01,6,12,0) + seq(0:(0.5*60-1))*1*60
# sample parameter columns
a <- 1:30
b <- 31:60
c<-seq(1,90,3)
# a sample xts object 'tester'
tester <- xts(cbind(a,b,c),Time)
# assign 'time of day':
tester$tod <- NA
tester$tod["T06:00/T06:20"]<-"night"
tester$tod["T06:21/T11:30"]<-"morning"
tester$tod["T06:31/T06:50"]<-"afternoon"
eg of how I have tried to get the mean values for a, b, c both for all data and by 'tod' using 'aggregate' (note that there are NA's in my data but this is not the issue):
tester$group = 1 #create a group column just to get the means for all data
mean_all <- aggregate(.~group, data=tester, FUN=mean, na.rm = TRUE, na.action=NULL)
meann_tod <- aggregate(.~tod, data=tester, FUN=mean, na.rm = TRUE, na.action=NULL)
Unfortunately this does not work, although there are no errors, the values are completely wrong.
Any advice would be much appreciated, I imagined this would be a very simple task!

When you attempted creating a character vector, tod you would have needed to coerce the coredata matrix to be character rather than numeric. The authors did issue a warning when it basically refused to let you mess up your other data, but you ignored it (and I didn't understand it until I did some extra work.) You could construct a numeric vector to do the grouping:
> tester$tod <- NA
> tester$tod["T06:00/T06:20"]<-1
> tester$tod["T06:21/T11:30"]<-2
> tester$tod["T06:31/T06:50"]<-3
>
> tester$group = 1
> (mean_all <- aggregate(.~group, data=tester, FUN=mean, na.rm = TRUE, na.action=NULL))
group a b c tod
1 1 15.5 45.5 44.5 2.133333
> (meann_tod <- aggregate(.~tod, data=tester, FUN=mean, na.rm = TRUE, na.action=NULL))
tod a b c group
1 1 4.5 34.5 11.5 1
2 2 13.5 43.5 38.5 1
3 3 24.5 54.5 71.5 1
I probably would have omitted the "group" variable from the formula:
> (meann_tod <- aggregate(cbind(a,b,c)~tod, data=tester, FUN=mean, na.rm = TRUE, na.action=NULL))
tod a b c
1 1 4.5 34.5 11.5
2 2 13.5 43.5 38.5
3 3 24.5 54.5 71.5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex