This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
apply a function over groups of columns
I have a data.frame with 30 rows and many columns (1000+), but I need to average every 16 columns together. For example, the data frame will look like this (I truncate it to make it easier..):
Col1 Col2 Col3 Col4........
4.176 4.505 4.048 4.489
6.167 6.184 6.359 6.444
5.829 5.739 5.961 5.764
.
.
.
Therefore, I cannot aggregate (I do not have a list) and I tried:
a <- data.frame(rowMeans(my.df[,1:length(my.df)]) )
which gives me the average of the all 1000+ coumns, But is there any way to say I want to do that every 16 columns until the end? (they are multiple of 16 the total number of columns).
A secondary, less important point but would be useful to solve this as well.
The col names are in the following structure:
XXYY4ZZZ.txt
Once averaged the columns all I need is a new col name with only XXYY as the rest will be averaged out. I know I could use gsub but is there an optimal way to do the averaging and this operation in one go?
I am still relatively new to R and therefore I am not sure where and how to find the answer.
Here is an example adapted from #ben's question and #TylerRinker's answer from apply a function over groups of columns . It should be able to apply any function over a matrix or data frame by intervals of columns.
# Create sample data for reproducible example
n <- 1000
set.seed(1234)
x <- matrix(runif(30 * n), ncol = n)
# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
# Run function
y <- byapply(x, 16, rowMeans)
# Test to make sure it returns expected result
y.test <- rowMeans(x[, 17:32])
all.equal(y[, 2], y.test)
# TRUE
You can do other odd things with it. For example, if you needed to know the total sum of every 10 columns, being sure to remove NAs if present:
y.sums <- byapply(x, 10, sum, na.rm = T)
y.sums[1]
# 146.7756
sum(x[, 1:10], na.rm = T)
# 146.7756
Or find the standard deviations:
byapply(x, 10, apply, 1, sd)
Update
by can also be specified as a vector of groups:
byapply(x, rep(1:10, each = 10), rowMeans)
This works for me on a much smaller data frame:
rowMeans(my.df[,seq(1,length(my.df),by=16)])
Related
Below, I'm wondering how to use BASE R function quantile() separately across elements in L that are named EFL and ESL?
Note: this is a toy example, L could contain any number of similarly named elements.
foo <- function(X) {
X <- as.matrix(X)
tab <- table(row(X), factor(X, levels = sort(unique(as.vector(X)))))
w <- diag(ncol(tab))
rosum <- rowSums(tab)
obs_oc <- tab * (t(w %*% t(tab)) - 1)
obs_c <- colSums(obs_oc)
max_oc <- tab * (rosum - 1)
max_c <- colSums(max_oc)
SA <- obs_c / max_c
h <- names(SA)
h[is.na(h)] <- "NA"
setNames(SA, h)
}
DAT <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/X.csv", row.names = 1)
L <- replicate(50, foo(DAT[sample(1:nrow(DAT), replace = TRUE),]), simplify = FALSE)
# How to use `quantile()` separately across all similarly named elements (e.g., EFL, ESL) in `L[[i]]` i = 1,... 5
# quantile(all EFL elements across `L`)
# quantile(all ESL elements across `L`)
The previous solution I used do.call to rbind each list into a matrix and array and then calculate the quantile over each data.frame row.
sapply(as.data.frame(do.call(rbind, L)), quantile)
However, when there is a missing row, it does not take that into account. To accurately get the rows you need to fill the missing rows. I used data.table's rbindlist (you could also use plyr::rbind.fill) with fill=TRUE to fill the missing values. It requires each to be a data.frame/table/list, so I converted each to a data.frame, but before doing so you need to transpose (t()) the data so that the rows line up to each element. It could be written in a single line, but it's easier read what is happening in multiple lines.
L2 = lapply(L, function(x){as.data.frame(t(x))})
df = data.table::rbindlist(L2, fill=TRUE) # or plyr::rbind.fill(L2)
sapply(df, quantile, na.rm = TRUE)
You can also use purrr::transpose:
Lt <- purrr::tranpose(L)
quantile(unlist(Lt$EFL),.8)
quantile(unlist(Lt$ESL),.8)
I have a function that ranks a variable based on # of occurrences.
rankTab <- function (x)
{
tab1 <- data.frame(table(x))
tab1 <- tab1[order(-tab1$Freq), ]
tab1
}
I'd like to run this across a data.frame with multiple columns and figure out a rough measure of cardinality by saying for each column, what % of values are covered by the 5 most frequently occurring values. Something like this:
df$top_5_val_pct <- round(sapply(x, function(x) sum(rankTab(x)[1:max(5,nrow(x)),'Freq']) / length(x)), 4)
My problem is when there are < 5 values, I'm getting an NA as there aren't 5 rows to sum. I've tried using min and max but can't figure out how to get 5 or fewer rows. Any suggestions?
I'm having a hard time parsing the code you're using to accomplish this, but going simply off of "what % of values are covered by the 5 most frequently occurring values" I'd do something like this:
sortTab <- function(x,n){
t <- sort(table(x))
sum(tail(t,n)) / sum(t)
}
sapply(mtcars,sortTab,n = 2)
where in this example, I'm finding the proportion covered by the two most common values.
How about changing the sum() to add in na.rm = TRUE
sum(rankTab(x)[1:5, "Freq"], na.rm = TRUE)
giving
df <- data.frame(A = sample(letters[1:4], 20, replace = TRUE),
B = sample(letters[1:4], 20, replace = TRUE))
round(sapply(df, function(x) sum(sum(rankTab(x)[1:5, "Freq"], na.rm = TRUE)) / length(x)), 4)
I wonder if it is possible to use rollapply() only for certain rows of a dataframe. I know the "by" argument can specify the every by-th time point at which I calculate FUN, but now I have a very specific vector of row indices to which I wish to apply the rollapply(). For example, I have the below dataframe:
df <- data.frame(x = (1:10), y = (11:20))
I know how to calculate the rolling mean for y column when the rolling width is 3.
library(zoo)
m <- rollapply(df$y, width = 3, FUN = mean, fill = NA, align = "right")
But what if I want the width-3-mean only for the 4th and 9th row? Is there something in "by" argument that I can manipulate? Or some other better methods (using apply to do rolling calculation maybe)?
Hopefully I am understanding your question correctly. I think you are asking how to perform a function on every 4th and 9th element in a sliding window? If yes, just restrict your function to the 4th and 9th element using x[4] and x[9]. Like this:
output <- rollapply(df, 9, function(x) (x[4] + x[9])/2), fill = NA, align = "right")
I also interpret your question as asking how to get the mean when the window contains the 4th or 9th row? This can be done by sub setting. The question you need to think about is where you want the 4th and 9th row to be located within your window. Do you want the 4th row to be at position x[1], x[2], or x[3] within your window? Depending on what is at the other positions will obviously effect your output. Say you dont know, and all three seem reasonable, you will need to write a function a that creates a list of dataframes containing the range of data you are interested in, and then use an apply function, or a for loop, to rollapply the mean function over each dataframe in the list. You can then all of these outputs into a dataframe to work with further. Like this:
# the rlist library has a function that allows us to add items to a list
# which will be handy later on
library(rlist)
library(zoo)
# your example data
df <- data.frame(x = (1:10), y = (11:20))
# a vector of your desired rows
desired_rows <- c(4,9)
# A for loop that generates a list of dataframes
# with your desired rows in the middle of each
for (i in desired_rows){
lower_bound <- i-2
upper_bound <- i+2
df_subset <- df[c(lower_bound:upper_bound), ]
if(exists("list_df_range")){
list_df_range <- list.append(list_df_range, df_subset)
}else{
list_df_range <- list(df_subset)
}
}
# a second for loop that applies your rollapply function to each
# data frame in the list and then
# returns a dataframe of the final results
# with each column named after the originating row
for (n in list_df_range){
m <- rollapply(n$y, width = 3, FUN = mean, fill = NA, align = "right")
if(exists("final_out")){
final_out <- cbind(final_out, m)
}else{
final_out <- data.frame(m)
}
}
names(final_out) <- desired_rows
Based on the comment below the question by the poster it seems that what is wanted is to take the mean of each rolling window of width 3 excluding the middle element in each window and only keeping the 4th and 9th elements so
cc <- c(4, 9)
rollapply(df$y, list(c(-2, 0)), mean, fill = NA)[cc]
## [1] 13 18
or
rollapplyr(df$y, 3, function(x) mean(x[-2]), fill = NA)[cc]
## [1] 13 18
or
sapply(cc, function(ix) mean(df$y[seq(to = ix, by = 2, length = 2)]))
## [1] 13 18
or
(df$y[cc - 2] + df$y[cc]) / 2
## [1] 13 18
I am trying to create a "for loop" setup that is going calculate different rolling means of a return series, where I use rolling means ranging from the last 2 observations to the last 16 observations. kϵ[2,16]. I've been trying to use a function like this, where the "rollmean" is a function from zoo. This produces the warning "Warning message:
In roll[i] <- rollmean(x, i) :
number of items to replace is not a multiple of replacement length"
Can someone please help me?
rollk <- function(x, kfrom= 2, kto=16){
roll <- as.list(kto-kfrom+1)
for (i in kfrom:kto){
roll[i]<- rollmean(x, i)
return(roll)
}}
I suppose you want
# library(zoo)
rollk <- function(x, kfrom = 2, kto = 16){
roll <- list()
ft <- kfrom:kto
for (i in seq_along(ft)){
roll[[i]]<- rollmean(x, ft[i])
}
return(roll)
}
There are several problems in your function:
You need [[ to access a single list element, not [.
You want a list of length length(krom:kto). Now, i starts at 1, not at kfrom.
Now, roll is returned after the for loop. Hence, the function returns a single list containing all values.
A shorter equivalent of the function above:
rollk2 <- function(x, kfrom = 2, kto = 16)
lapply(seq(kfrom, kto), function(i) na.omit(filter(x, 1 / rep(i, i))))
It does not require loading additional packages.
Try this:
library(zoo)
lapply(2:16, rollmean, x = x)
I am trying to duplicate each column from data frame and move it to a randomly located point within 1-3 columns and do it for each column in the data frame. I want columns to move AT LEAST one space to the left or right. Of course sample(data) reorders columns randomly, but my attempts to put it in a loop are embarrassingly bad (I admit I skipped majority of linear algebra classes, damn...). Below is an example data:
dat <- read.table(textConnection(
"-515.5718 94.33423 939.6324 -502.9918 -75.14629 946.6926
-515.2283 96.10239 939.5687 -503.1425 -73.39015 946.6360
-515.0044 97.68119 939.4177 -503.4021 -71.79252 946.6909
-514.7430 99.59141 939.3976 -503.6645 -70.08514 946.6887
-514.4449 101.08511 939.2342 -503.9207 -68.48133 946.7183
-514.2769 102.29453 939.0013 -504.2665 -67.04509 946.7809
-513.9294 104.02753 938.9436 -504.4703 -65.34361 946.7899
-513.5900 105.49624 938.7684 -504.7405 -63.75965 946.7991"
),header=F,as.is=T)
sample(dat)#random columns position
How about this brute-force but plenty-fast solution?
It tries out different permutations of the columns until it finds one in which each column is moved at least 1, and not more than 3 columns to left or right. When it finds such a permutation, the test in the final line of the while() call evaluates to FALSE, terminating the loop and leaving the variable x containing the acceptable permutation.
n <- ncol(dat)
while({x <- sample(n) # Proposed new column positions
y <- seq_len(n) # Original column positions
max(abs(x - y)) > 3 | min(abs(x - y)) == 0
}) NULL
dat[x]
I should probably wait to post this until I have time to comment it up, and discuss some of the ambiguities in the problem as currently specified in the comments above. But since I won't be able to do that, possibly for a while, I thought I'd give you code for a solution that you can examine yourself.
# Create a function that generates acceptable permutations of the data
getPermutation <- function(blockSize, # number of columns/block
nBlock, # number of blocks of data
fromBlocks) { # indices of blocks to be moved
X <- unique(as.vector(outer(fromBlocks, c(-2,-1,1,2), "+")))
# To remove nonsensical indices like 0 or -1
X <- X[X %in% seq.int(nBlock)]
while({toBlocks <- sample(X, size = length(fromBlocks))
max(abs(toBlocks - fromBlocks)) > 2 | min(abs(toBlocks - fromBlocks)) < 1
}) NULL
A <- seq.int(nBlock)
A[toBlocks] <- fromBlocks
A[fromBlocks] <- toBlocks
blockColIndices <-
lapply(seq.int(nBlock) - 1,
function(X) {
seq(from = X * blockSize + 1,
by = 1,
length.out = blockSize)
})
unlist(blockColIndices[A])
}
# Create an example dataset, a 90 column data.frame
dat <- as.data.frame(matrix(seq.int(90*4), ncol=90))
# Call the function for a data frame with 30 3-column blocks
# within which you want to move blocks 2, 14, and 14.
index <- getPermutation(3, 30, c(2, 14, 15))
newdat <- dat[index]