use rollapply for certain rows - r

I wonder if it is possible to use rollapply() only for certain rows of a dataframe. I know the "by" argument can specify the every by-th time point at which I calculate FUN, but now I have a very specific vector of row indices to which I wish to apply the rollapply(). For example, I have the below dataframe:
df <- data.frame(x = (1:10), y = (11:20))
I know how to calculate the rolling mean for y column when the rolling width is 3.
library(zoo)
m <- rollapply(df$y, width = 3, FUN = mean, fill = NA, align = "right")
But what if I want the width-3-mean only for the 4th and 9th row? Is there something in "by" argument that I can manipulate? Or some other better methods (using apply to do rolling calculation maybe)?

Hopefully I am understanding your question correctly. I think you are asking how to perform a function on every 4th and 9th element in a sliding window? If yes, just restrict your function to the 4th and 9th element using x[4] and x[9]. Like this:
output <- rollapply(df, 9, function(x) (x[4] + x[9])/2), fill = NA, align = "right")
I also interpret your question as asking how to get the mean when the window contains the 4th or 9th row? This can be done by sub setting. The question you need to think about is where you want the 4th and 9th row to be located within your window. Do you want the 4th row to be at position x[1], x[2], or x[3] within your window? Depending on what is at the other positions will obviously effect your output. Say you dont know, and all three seem reasonable, you will need to write a function a that creates a list of dataframes containing the range of data you are interested in, and then use an apply function, or a for loop, to rollapply the mean function over each dataframe in the list. You can then all of these outputs into a dataframe to work with further. Like this:
# the rlist library has a function that allows us to add items to a list
# which will be handy later on
library(rlist)
library(zoo)
# your example data
df <- data.frame(x = (1:10), y = (11:20))
# a vector of your desired rows
desired_rows <- c(4,9)
# A for loop that generates a list of dataframes
# with your desired rows in the middle of each
for (i in desired_rows){
lower_bound <- i-2
upper_bound <- i+2
df_subset <- df[c(lower_bound:upper_bound), ]
if(exists("list_df_range")){
list_df_range <- list.append(list_df_range, df_subset)
}else{
list_df_range <- list(df_subset)
}
}
# a second for loop that applies your rollapply function to each
# data frame in the list and then
# returns a dataframe of the final results
# with each column named after the originating row
for (n in list_df_range){
m <- rollapply(n$y, width = 3, FUN = mean, fill = NA, align = "right")
if(exists("final_out")){
final_out <- cbind(final_out, m)
}else{
final_out <- data.frame(m)
}
}
names(final_out) <- desired_rows

Based on the comment below the question by the poster it seems that what is wanted is to take the mean of each rolling window of width 3 excluding the middle element in each window and only keeping the 4th and 9th elements so
cc <- c(4, 9)
rollapply(df$y, list(c(-2, 0)), mean, fill = NA)[cc]
## [1] 13 18
or
rollapplyr(df$y, 3, function(x) mean(x[-2]), fill = NA)[cc]
## [1] 13 18
or
sapply(cc, function(ix) mean(df$y[seq(to = ix, by = 2, length = 2)]))
## [1] 13 18
or
(df$y[cc - 2] + df$y[cc]) / 2
## [1] 13 18

Related

Submit every similarly named elements of a list of vectors to a function in R

Below, I'm wondering how to use BASE R function quantile() separately across elements in L that are named EFL and ESL?
Note: this is a toy example, L could contain any number of similarly named elements.
foo <- function(X) {
X <- as.matrix(X)
tab <- table(row(X), factor(X, levels = sort(unique(as.vector(X)))))
w <- diag(ncol(tab))
rosum <- rowSums(tab)
obs_oc <- tab * (t(w %*% t(tab)) - 1)
obs_c <- colSums(obs_oc)
max_oc <- tab * (rosum - 1)
max_c <- colSums(max_oc)
SA <- obs_c / max_c
h <- names(SA)
h[is.na(h)] <- "NA"
setNames(SA, h)
}
DAT <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/X.csv", row.names = 1)
L <- replicate(50, foo(DAT[sample(1:nrow(DAT), replace = TRUE),]), simplify = FALSE)
# How to use `quantile()` separately across all similarly named elements (e.g., EFL, ESL) in `L[[i]]` i = 1,... 5
# quantile(all EFL elements across `L`)
# quantile(all ESL elements across `L`)
The previous solution I used do.call to rbind each list into a matrix and array and then calculate the quantile over each data.frame row.
sapply(as.data.frame(do.call(rbind, L)), quantile)
However, when there is a missing row, it does not take that into account. To accurately get the rows you need to fill the missing rows. I used data.table's rbindlist (you could also use plyr::rbind.fill) with fill=TRUE to fill the missing values. It requires each to be a data.frame/table/list, so I converted each to a data.frame, but before doing so you need to transpose (t()) the data so that the rows line up to each element. It could be written in a single line, but it's easier read what is happening in multiple lines.
L2 = lapply(L, function(x){as.data.frame(t(x))})
df = data.table::rbindlist(L2, fill=TRUE) # or plyr::rbind.fill(L2)
sapply(df, quantile, na.rm = TRUE)
You can also use purrr::transpose:
Lt <- purrr::tranpose(L)
quantile(unlist(Lt$EFL),.8)
quantile(unlist(Lt$ESL),.8)

Subset in the data frame rows in R

I have a data frame with 30 rows and 4 columns (namely, x, y, z, u). It is given below.
mydata = data.frame(x = rnorm(30,4), y = rnorm(30,2,1), z = rnorm(30,3,1), u = rnorm(30,5))
Further, I have a sequence values, which represent row number in my data frame.
myseq = c(seq(1, 30, by = 5))
myseq
[1] 1 6 11 16 21 26
Now, I wanted to compute the prob values for each segment of 99 rows.
filt= subset(mydata[1:6,], mydata[1:6,]$x < mydata[1:6,]$y & mydata[1:6,]$z < mydata[1:6,]$u
filt
prob = length(filt$x)/30
prob
Then I need to compute the above prob for 1:6,.., 27:30 and so on . Here, I have only 6 prob values. So, I can do one by one. If I have 100 values it would be tedious. Are there any way to compute the prob values?.
Thank you in advance.
BTW: in subset(DF[1:99,], ...), use DF[1:99,] in the first argument, not again, ala
subset(DF[1:99,], cumsuml < inchivaluel & cumsumr < inchivaluer)
Think about how to do this in a list.
The first step is to break your data into the va starting points. I'll start with a list of the indices to break it into:
inds <- mapply(seq, va, c(va[-1], nrow(DF)), SIMPLIFY=FALSE)
this now is a list of sequences, starting with 1:99, then 100:198, etc. See str(inds) to verify.
Now we can subset a portion of the data based on each element's vector of indices:
filts <- lapply(inds, function(ind) subset(DF[ind,], cumsuml < inchivaluel & cumsumr < inchivaluer))
We now have a list of vectors, let's summarize it:
results <- sapply(filts, function(filt) length(filt$cumsuml)/length(alpha))
Bottom line, it helps to think about how to break this problem into lists, examples at http://stackoverflow.com/a/24376207/3358272.
BTW: instead of initially making a list of indices, we could just break up the data in that first step, ala
DF2 <- mapply(function(a,b) DF[a:b,], va, c(va[-1], nrow(DF)), SIMPLIFY=FALSE)
filts <- lapply(DF2, function(x) subset(x, cumsuml < inchivaluel & cumsumr < inchivaluer))
results <- sapply(filts, function(filt) length(filt$cumsuml)/length(alpha))

rollapply for moving average with non-business day

I'd like to get MovingAverage in data which have "NA" in the middle of data like below.
date <- seq.Date(as.Date("2018-07-02"),as.Date("2018-07-14"),by = "days")
A <- c(100,110,120,130,140,NA,NA,150,160,170,180,190,200)
B <- c(200,220,240,260,280,NA,NA,300,320,340,360,380,400)
C <- c(150,160,170,180,190,200,210,NA,NA,220,230,240,250)
dataset <- data.frame(A,B,C)
dataset <- as.xts(dataset, order.by = date)
If I use rollapply like below to get 3-day MovingAverage...
y <- rollapply(dataset, width = 3, function(x) mean(x, na.rm = TRUE ))
This is not what I want.
For example, In MovingAverage of A at "2018-07-09", the result is (NA+NA+150)/1 = 150. But I want to get (130+140+150)/3 = 140.
How can I do that?
I assume you want NAs to stay as NA and otherwise to take the mean of the last 3 non-NAs.
1) Take 5 elements at a time and if the last element is NA then return NA; otherwise, remove the NAs and take the mean of the last 3. Note that this does imply that the first 4 rows will be NA.
mean_bus <- function(x) if (is.na(tail(x, 1))) NA else mean(tail(na.omit(x), 3))
y1 <- rollapplyr(dataset, width = 5, mean_bus)
2) An alternate would be to take the last 3 non-NAs and then overwrite that with NAs in all positions where the input is NA.
mean_omit <- function(x) mean(tail(na.omit(x), 3))
y <- rollapplyr(dataset, 5, mean_omit)
y2 <- replace(y, is.na(dataset), NA)
all.equal(y1, y2)
## [1] TRUE
3) If you prefer to fill in the first 4 rows with partial values then convert to zoo and use the partial= argument of rollapplyr.zoo. mean_bus is from (1).
y3 <- as.xts(rollapplyr(as.zoo(dataset), 5, mean_bus, partial = TRUE))
You could either remove the NAs in each series before you compute the moving average (MA).
Or you use a larger window and keep only the last three values for the MA.
y <- rollapply(dataset, width = 5,
function(x) {mean(tail(x[ !is.na(x) ], 3))})

averaging every 16 columns in r [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
apply a function over groups of columns
I have a data.frame with 30 rows and many columns (1000+), but I need to average every 16 columns together. For example, the data frame will look like this (I truncate it to make it easier..):
Col1 Col2 Col3 Col4........
4.176 4.505 4.048 4.489
6.167 6.184 6.359 6.444
5.829 5.739 5.961 5.764
.
.
.
Therefore, I cannot aggregate (I do not have a list) and I tried:
a <- data.frame(rowMeans(my.df[,1:length(my.df)]) )
which gives me the average of the all 1000+ coumns, But is there any way to say I want to do that every 16 columns until the end? (they are multiple of 16 the total number of columns).
A secondary, less important point but would be useful to solve this as well.
The col names are in the following structure:
XXYY4ZZZ.txt
Once averaged the columns all I need is a new col name with only XXYY as the rest will be averaged out. I know I could use gsub but is there an optimal way to do the averaging and this operation in one go?
I am still relatively new to R and therefore I am not sure where and how to find the answer.
Here is an example adapted from #ben's question and #TylerRinker's answer from apply a function over groups of columns . It should be able to apply any function over a matrix or data frame by intervals of columns.
# Create sample data for reproducible example
n <- 1000
set.seed(1234)
x <- matrix(runif(30 * n), ncol = n)
# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
# Run function
y <- byapply(x, 16, rowMeans)
# Test to make sure it returns expected result
y.test <- rowMeans(x[, 17:32])
all.equal(y[, 2], y.test)
# TRUE
You can do other odd things with it. For example, if you needed to know the total sum of every 10 columns, being sure to remove NAs if present:
y.sums <- byapply(x, 10, sum, na.rm = T)
y.sums[1]
# 146.7756
sum(x[, 1:10], na.rm = T)
# 146.7756
Or find the standard deviations:
byapply(x, 10, apply, 1, sd)
Update
by can also be specified as a vector of groups:
byapply(x, rep(1:10, each = 10), rowMeans)
This works for me on a much smaller data frame:
rowMeans(my.df[,seq(1,length(my.df),by=16)])

Constrained randomization of column order in a data.frame

I am trying to duplicate each column from data frame and move it to a randomly located point within 1-3 columns and do it for each column in the data frame. I want columns to move AT LEAST one space to the left or right. Of course sample(data) reorders columns randomly, but my attempts to put it in a loop are embarrassingly bad (I admit I skipped majority of linear algebra classes, damn...). Below is an example data:
dat <- read.table(textConnection(
"-515.5718 94.33423 939.6324 -502.9918 -75.14629 946.6926
-515.2283 96.10239 939.5687 -503.1425 -73.39015 946.6360
-515.0044 97.68119 939.4177 -503.4021 -71.79252 946.6909
-514.7430 99.59141 939.3976 -503.6645 -70.08514 946.6887
-514.4449 101.08511 939.2342 -503.9207 -68.48133 946.7183
-514.2769 102.29453 939.0013 -504.2665 -67.04509 946.7809
-513.9294 104.02753 938.9436 -504.4703 -65.34361 946.7899
-513.5900 105.49624 938.7684 -504.7405 -63.75965 946.7991"
),header=F,as.is=T)
sample(dat)#random columns position
How about this brute-force but plenty-fast solution?
It tries out different permutations of the columns until it finds one in which each column is moved at least 1, and not more than 3 columns to left or right. When it finds such a permutation, the test in the final line of the while() call evaluates to FALSE, terminating the loop and leaving the variable x containing the acceptable permutation.
n <- ncol(dat)
while({x <- sample(n) # Proposed new column positions
y <- seq_len(n) # Original column positions
max(abs(x - y)) > 3 | min(abs(x - y)) == 0
}) NULL
dat[x]
I should probably wait to post this until I have time to comment it up, and discuss some of the ambiguities in the problem as currently specified in the comments above. But since I won't be able to do that, possibly for a while, I thought I'd give you code for a solution that you can examine yourself.
# Create a function that generates acceptable permutations of the data
getPermutation <- function(blockSize, # number of columns/block
nBlock, # number of blocks of data
fromBlocks) { # indices of blocks to be moved
X <- unique(as.vector(outer(fromBlocks, c(-2,-1,1,2), "+")))
# To remove nonsensical indices like 0 or -1
X <- X[X %in% seq.int(nBlock)]
while({toBlocks <- sample(X, size = length(fromBlocks))
max(abs(toBlocks - fromBlocks)) > 2 | min(abs(toBlocks - fromBlocks)) < 1
}) NULL
A <- seq.int(nBlock)
A[toBlocks] <- fromBlocks
A[fromBlocks] <- toBlocks
blockColIndices <-
lapply(seq.int(nBlock) - 1,
function(X) {
seq(from = X * blockSize + 1,
by = 1,
length.out = blockSize)
})
unlist(blockColIndices[A])
}
# Create an example dataset, a 90 column data.frame
dat <- as.data.frame(matrix(seq.int(90*4), ncol=90))
# Call the function for a data frame with 30 3-column blocks
# within which you want to move blocks 2, 14, and 14.
index <- getPermutation(3, 30, c(2, 14, 15))
newdat <- dat[index]

Resources