Compute mean across intervals of rows - r

I have a long matrix, of which I need to compute the mean for a certain interval of rows. At the moment I am doing this manually like this:
values <- cbind(1:50,1)
meanqual10 <- mean(values[1:10,1])
meanqual10
[1] 5.5
meanqual15 <- mean(values[11:20,1])
meanqual15
[1] 15.5
meanqual20 <- mean(values[21:30,1])
meanqual20
[1] 25.5
meanqual25 <- mean(values[31:40,1])
meanqual25
[1] 35.5
meanqual30 <- mean(values[41:50,1])
meanqual30
[1] 45.5
There must be a nicer way of doing this. Can anybody help please?

SeƱor O's answer is nice if you have regular intervals. Another approach, if you want to select arbitrary rows could be something like:
l <- list(1:10,11:20,21:30,31:40,41:50) # vectors of any length or ordering
sapply(l, function(x) mean(values[x,1]))
Which gives:
[1] 5.5 15.5 25.5 35.5 45.5
And of course by can do this for arbitrary rows, too. This is just a slightly different approach.

by(values[,1], ceiling(1:50 / 10), mean)
ceiling(1:50/10) creates a vector of length 50 with a new integer every 10 numbers.
by will then take the mean for each group of the same integer.
Result:
ceiling(1:50/10): 1
[1] 5.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 2
[1] 15.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 3
[1] 25.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 4
[1] 35.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 5
[1] 45.5

Related

Fill NAs by smoothing the peak values

df <- data.frame(date = seq(from=as.POSIXct(as.Date("2020-10-01")),
to= as.POSIXct(as.Date("2020-10-02")) , by = 'hour'),
val = c(15,20,18,22,17,NA,NA,NA,80,14,23,16,19,21,NA,NA,60,18,15,20,22,19,NA,35,18))
There are uneven sequences of 'NA's followed by peak values e.g.: val = 80, 60 and 35 .
I would like to fill the 'NA's by smoothing the peak values. For example: in the first NA sequence, three NAs are followed by 80, which equals four data points therefore, 80 is divided by 4 = 20 .
Note: the peak values are NOT outliers, so the total sum of the data points should not change.
If possible, I would like to fill the NAs with the above conditions while reserving the signal behavior (trend and seasonality).
Many thanks.
The following function fills sequences of NA values with the next non-NA value divided by the sequence length.
fill_na <- function(x){
na <- is.na(x)
r <- rle(na)
div <- r$lengths[r$values] + 1L
cs <- cumsum(r$lengths)[r$values]
for(i in seq_along(div)){
if(cs[i] < length(x)){
x[ (cs[i] - div[i] + 1L):(cs[i] + 1L) ] <- x[ cs[i] + 1L ]/div[i]
}
}
x
}
fill_na(df$val)
# [1] 15.0 20.0 18.0 22.0 20.0 20.0 20.0 20.0 20.0 14.0 23.0
#[12] 16.0 19.0 20.0 20.0 20.0 20.0 18.0 15.0 20.0 22.0 17.5
#[23] 17.5 17.5 18.0

how to pass variable name in for loop or lapply functions in R? [duplicate]

This question already has answers here:
Dynamically select data frame columns using $ and a character value
(10 answers)
Closed 2 years ago.
Not able to pass variable names correctly in for loop or use lapply functions.
When I try this command without loop/laaply it works and I get values:
> boxplot.stats(df$price)$out
[1] 38.7 43.8 41.3 50.0 50.0 50.0 50.0 37.2 39.8 37.9 50.0
[12] 50.0 42.3 48.5 50.0 44.8 50.0 37.6 46.7 41.7 48.3 42.8
[23] 44.0 50.0 43.1 48.8 50.0 43.5 45.4 46.0 50.0 37.3 50.0
[34] 50.0 50.0 50.0 50.0
But when I put this under a lapply or for-loop then I get Null, why ?
df_numeric_names <- names(select_if(df, is.numeric))
df_numeric_names
[1] "price" "resid_area" "air_qual" "room_num" "age" "dist1" "dist2" "dist3"
[9] "dist4" "teachers" "poor_prop" "n_hos_beds" "n_hot_rooms" "rainfall" "parks" "Sold"
loop
for (feature in df_numeric_names){
outlier_values <- boxplot.stats(df$feature)$out
print(outlier_values)
}
- Output:
NULL
NULL
NULL
lapply
lapply(df_numeric_names, function(x) {
boxplot.stats(df$x)$out
})
- output
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
This is a fairly simple thing but I am not sure what am I doing wrong and how do I fix.
This slight change in the loop could solve your issue:
for (feature in df_numeric_names){
outlier_values <- boxplot.stats(df[,feature])$out
print(outlier_values)
}
And a little example:
library(dplyr)
#Data
data("iris")
df <- iris
#Numeric names
df_numeric_names <- names(select_if(df, is.numeric))
#Loop
for (feature in df_numeric_names){
outlier_values <- boxplot.stats(df[,feature])$out
print(outlier_values)
}
The output:
numeric(0)
[1] 4.4 4.1 4.2 2.0
numeric(0)
numeric(0)
Also using lapply() you should use a code similar to this:
lapply(df_numeric_names, function(x) {
boxplot.stats(df[,x])$out
})
Output:
[[1]]
numeric(0)
[[2]]
[1] 4.4 4.1 4.2 2.0
[[3]]
numeric(0)
[[4]]
numeric(0)

R: What can I do about a slow double sapply?

I have a computation that does a version of this:
n <- 5
l <- 3
m <- seq(0,1,length.out = n)
r <- seq(3,4,length.out = n)
y <- 1:n
pp <- sapply(0:l, function(h) cumsum(y[(h+1):n]*y[1:(n-h)]))
rec.acf <- sapply(0:l, function(h) pp[[h+1]] + sapply((h+1):n, function(j) m[j] + r[j-h]) )
to obtain
> rec.acf
[[1]]
[1] 4.0 8.5 18.0 34.5 60.0
[[2]]
[1] 5.25 11.75 24.25 44.75
[[3]]
[1] 6.5 15.0 30.5
[[4]]
[1] 7.75 18.25
In practice, of course, n and l are much larger (and the actual functions, computing autocovariances over samples of increasing size, are more complicated).
When l is relatively small, as I had hoped for, the computations work much faster than other implementations I had worked out that do not take into account that I can recycle many identical computations through pp.
However, the picture reverses when l is large relative to n, likely because the outer sapply then sends off many inner loops. Is there anything obviously inefficient about my approach?
I tinkered with mapply, without much success.
The inner loop is unnecessary:
mm = lapply(0:l, function(h) tail(m, length(m) - h) + head(r, length(r) - h))
mapply("+", pp, mm)
#[[1]]
#[1] 4.0 8.5 18.0 34.5 60.0
#
#[[2]]
#[1] 5.25 11.75 24.25 44.75
#
#[[3]]
#[1] 6.5 15.0 30.5
#
#[[4]]
#[1] 7.75 18.25

Estimating missing values in time-series data frame based on a "rate of change"

I am trying to use a loop in R to estimate values that will replace the NAs in my data frame based on a rate of change ("rate") that multiplies my last value (ok, this is confusing, but please refer to the example below). This is something similar to my data:
l1 <- c(NA,NA,NA,27,31,0.5)
l2 <- c(NA,8,12,28,39,0.5)
l3 <- c(NA,NA,NA,NA,39,0.3)
l4 <- c(NA,NA,11,15,31,0.2)
l5 <- c(NA,NA,NA,NA,51,0.9)
data <- as.data.frame(rbind(l1,l2,l3,l4,l5))
colnames(data) <- c("dbh1","dbh2","dbh3","dbh4","dbh5","rate")
So I created a loop to identify my first no-NA value in each line, then use that value to estimate its previous values based on the "rate". So for instance, in row 1, the first NA value would be replace by "27-(0.5*3)", then the second one would be "27-(0.5*2)" and the third one by "27-(0.5*1)". This is the loop I came up with. I know the first part (the outside loop) works but the the inside one doesn't:
for (i in 1: nrow(data)) {
dbh.cols <- data3[i,c("dbh1","dbh2","dbh3","dbh4","dbh5")]
sample.year <- which(dbh.cols != "NA")
data$first.dbh[i] <- min(dbh.cols, na.rm = T)
data$first.index[i] <- min(sample.year)
for (j on 1: (min(sample.year)-1)) {
ifelse(is.na(data[i,j]), min(dbh.cols, na.rm = T) - (min(sample.year)-j)*rate[i,j], data[i,j])
}
}
I am not good at programming so probably my internal loop strategy with "ifelse" is too weird (and wrong) but I just couldn't think of anything else that would work here... Any suggestions?
1) This uses no explicit loops, just an apply. It assumes that the NAs are all leading as in the example given.
fillIn <- function(x) {
rate <- tail(x, 1)
n <- sum(is.na(x)) # no of NAs
c(x[n+1] - rate * seq(n, 1), na.omit(x))
}
replace(data, TRUE, t(apply(data, 1, fillIn)))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2) Here is a second approach that uses na.approx from the zoo package. It does not require apply. Here data1 has the same content as data except that the first column is filled in. The other NAs remain. The last line uses na.approx to fill in the remaining NAs linearly.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind( data[cbind(1:nrow(data), NAs + 1)] - data$rate * NAs, data[-1] )
replace(data, TRUE, t(na.approx(t(data1))))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2a) A variation on (2) uses na.locf in the middle line to bring forward the first non-NA in each row. The first and last lines are the same.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind(na.locf(t(data), fromLast = TRUE)[1, ] - data$rate * NAs, data[-1])
replace(data, TRUE, t(na.approx(t(data1))))
You do not need to use multiple for loops for this. Here is some simplified code to do what you want just for the for loop. Working explicitly with your data we need to get the first non-NA value from each row.
for_estimate <- apply(data, 1, function(x) x[min(which(is.na(x) == FALSE))])
Secondly, we need to determine what integer to multiply the rate by for each row depending on how many NA values there are.
# total number of NA values per row
n_na <- apply(data,1, function(x) sum(is.na(x)) )
# make it a matrix with a 0's appended on
n_na <- matrix(c(n_na, rep(0, nrow(data) * (ncol(data)-1))),
nrow = nrow(data), ncol = ncol(data)-1)
# fill in the rest of the matrix
for(i in 2:ncol(n_na)){
n_na[,i] <- n_na[,i-1] -1
}
Once we have that we can use this code to back fill the NA values in that way you are interested in.
for(i in (ncol(data)-1):1){
if(sum(is.na(data[,i]))>0){
to_fill <- which(is.na(data[,i])==TRUE)
data[to_fill,i] <- for_estimate[to_fill] - (data$rate[to_fill]*(n_na[to_fill,i])
}
}
output
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9

Apply() function to a data frame

I have a data frame object which has 24 columns and each one has a different length. I would like to multiply every column by a vector of 24 values. I am thinking about using the apply function since I do not have any matrix. my guess is like:
trans_temp:
Ta.f Ta.f Ta.f Ta.f
1995-10-13 04:00:00 13.6 13.6 13.6 13.6
1995-10-13 05:00:00 13.6 13.6 13.6 13.6
1995-10-13 06:00:00 13.6 13.6 13.6 13.6
1995-10-13 07:00:00 13.5 13.5 13.5 13.5
1995-10-13 08:00:00 13.5 13.5 13.5 13.5
and my vector is
x <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
So I want the first column multiplied by 1, the second by 2, the third by 3 and so on. I can not multiply directlly because it is a data.frame object.
apply(trans_temp,x,MARGIN=2,fun)
Any help?
You can create a matrix directly and just multiply the data with it:
as.matrix(trans_temp) * col(trans_temp)
Benchmarking with eddi's
m <- as.data.frame(matrix(runif(1e7), ncol=1000))
x <- seq_len(1000)
system.time(tt1 <- as.matrix(m) * col(m)) # 0.335 seconds
system.time(tt2 <- t(x*t(m))) # 0.505 seconds
identical(tt1, tt2) # TRUE
You are on the right track, but I don't understand how your columns have different lengths, unless you mean some contain, e.g. NA in them. Use MARGIN = 1 to apply across rows.
x <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
t( apply(trans_temp , MARGIN = 1 , function(y) x * y ) )
You could even shorten the call like so:
t( apply(trans_temp , 1 , `*` , x ) )
Here's another approach without using apply, that relies on R recycling behavior:
t(x*t(trans_temp))
This will probably be much faster than the other two approaches.
^^^ Not anymore after Arun's edits :) What this has going for it now is that you can have an arbitrary x (and if you want an arbitrary operation in addition to arbitrary x, then you'd go with Simon's answer).

Resources