For loop with a function for a moving/rolling average? - r

Essentially (in R), I want to apply a moving average function over a period of time (eg. date and time variables) to see how a particular metric changes over time. However, the metric in itself is a function. The scores can either be 1 (pro), 0 (neutral), or -1 (neg). The function for the metric is:
function(pro, neg, total) {
x <- (pro / total) * 100
y <- (neg / total) * 100
x - y
}
So the percentage of 1's minus the percentage of -1's is the metric value.
Given timestamps for each recorded score, I want to evaluate the metric as a moving average across all rows. I assumed that a for loop would be the best way to apply this but I am stuck in how to do this.
Does anyone have any thoughts / advice?

As mentioned in the comments, rollapply() from zoo is a good option. I took the liberty to generate some example data, apologies if it doesn't resemble yours.
library(zoo)
f <- function(x, l) {
p <- sum(x == 1) / l
n <- sum(x == -1) / l
(p - n)*100
}
# Or more efficiently
f <- function(x, l=length(x)) {
(sum(x)/l)*100
}
set.seed(1)
N <- 25
dtf <- data.frame(time=as.Date(15000+(1:N)), score=sample(-1:1, N, rep=TRUE))
score <- read.zoo(dtf)
l <- 8
zts <- cbind(score, rolling=rollapply(score, l, f, l, fill=NA))
zts
# score rolling
# 2011-01-27 -1 NA
# 2011-01-28 0 NA
# 2011-01-29 0 NA
# 2011-01-30 1 12.5
# 2011-01-31 -1 25.0
# 2011-02-01 1 12.5
# 2011-02-02 1 0.0
# 2011-02-03 0 -25.0
# 2011-02-04 0 0.0
# 2011-02-05 -1 -12.5
# 2011-02-06 -1 -12.5
# 2011-02-07 -1 -12.5
# 2011-02-08 1 0.0
# 2011-02-09 0 25.0
# 2011-02-10 1 37.5
# 2011-02-11 0 62.5
# 2011-02-12 1 62.5
# 2011-02-13 1 50.0
# 2011-02-14 0 37.5
# 2011-02-15 1 25.0
# 2011-02-16 1 0.0
# 2011-02-17 -1 NA
# 2011-02-18 0 NA
# 2011-02-19 -1 NA
# 2011-02-20 -1 NA

Related

Dataframe calculation, anchor cell value to formula

I would like to do some calculations with the following dataframe. There are some values in specific cells of a column, and I would like to have them replicated based on a second column value, and store these in a new, third column:
x <- c ("1", "2","3", "4")
z <- (rep(x,5))
batch <- sort(z)
NDF <- rnorm(20, 10, 1); NDF <- signif (NDF, digits =3)
Fibre_analysis <- data.frame(batch, NDF)
Fibre_analysis$NDF[[1]] <- 10
Fibre_analysis$NDF[[6]] <- 100
Fibre_analysis$NDF[[11]] <- 1000
Fibre_analysis$NDF[[16]] <- 10000
This is the table that I would like:
batch NDF NEW_column
1 1 10.00 10
2 1 10.80 10
3 1 9.44 10
4 1 10.30 10
5 1 11.60 10
6 2 100.00 100
7 2 8.26 100
8 2 9.15 100
9 2 9.40 100
10 2 8.53 100
11 3 1000.00 1000
12 3 9.41 1000
13 3 9.20 1000
14 3 10.30 1000
15 3 9.32 1000
16 4 10000.00 10000
17 4 11.20 10000
18 4 7.33 10000
19 4 9.34 10000
20 4 11.00 10000
I would like this to create a new column in the dataframe, with absolute cell values from $NDFthat have to change for each value of $batch.
Because I need to use this process more than once I created the following function:
batch_Function <- function (x,y){
ifelse (x =="1", y[[1]],
ifelse (x =="2", y[[6]],
ifelse (x =="3", y[[11]],
y[[16]] )))
print (y)
}
when I call the function:
Fibre_analysis$NEW_column <- batch_Function ( Fibre_analysis$batch , Fibre_analysis$NDF )
I expect $NEW_column to look like this:
x <- c(10,100,1000,10000)
NEW_column <- rep(x, each=5)
whereas instead it is the exact same copy of the $NDF.
The only necessary change is to drop print(y) as it is not allowing to return the actual result:
batch_Function <- function (x, y) {
ifelse (x =="1", y[[1]],
ifelse (x =="2", y[[6]],
ifelse (x =="3", y[[11]],
y[[16]] )))
}
batch_Function (Fibre_analysis$batch , Fibre_analysis$NDF )
# [1] 10 10 10 10 10 100 100 100 100 100 1000 1000 1000 1000
# [15] 1000 10000 10000 10000 10000 10000
In case you still want print(y), you may put it at the beginning of batch_Function.

R Compute Statistics on Lagged Partitions

I have a data.frame with one column containing categorical data, one column containing dates, and one column containing numeric values. For simplicity, see the sample below:
A B C
1 L 2015-12-01 5.7
2 M 2015-11-30 2.1
3 K 2015-11-01 3.2
4 L 2015-10-05 5.7
5 M 2015-12-05 1.2
6 L 2015-11-15 2.3
7 L 2015-12-03 4.4
I would like to, for each category in A, compute a lagging average (e.g. average of the previous 30 days' values in column C).
I cannot for the life of me figure this one out. I have tried using sapply and a custom function that subsets the data.frame on category and date (or a deep copy of it) and returns the statistic (think mean or sd) and that works fine for single values, but it returns all NA's from inside sapply.
Any help you can give is appreciated.
This could be done more compactly, but here I have drawn it out to make it easiest to understand. The core is the split, lapply/apply, and then putting it back together. It uses a date window rather than a solution based on sorting, so it is very general. I also put the object back to its original order to enable direct comparison.
# set up the data
set.seed(100)
# create a data.frame with about a two-month period for each category of A
df <- data.frame(A = rep(c("K", "L", "M"), each = 60),
B = rep(seq(as.Date("2015-01-01"), as.Date("2015-03-01"), by="days"), 3),
C = round(runif(180)*6, 1))
head(df)
## A B C
## 1 K 2015-01-01 1.8
## 2 K 2015-01-02 1.5
## 3 K 2015-01-03 3.3
## 4 K 2015-01-04 0.3
## 5 K 2015-01-05 2.8
## 6 K 2015-01-06 2.9
tail(df)
## A B C
## 175 M 2015-02-24 4.8
## 176 M 2015-02-25 2.0
## 177 M 2015-02-26 5.7
## 178 M 2015-02-27 3.9
## 179 M 2015-02-28 2.8
## 180 M 2015-03-01 3.6
# preserve original order
df$originalOrder <- 1:nrow(df)
# randomly shuffle the order
randomizedOrder <- order(runif(nrow(df)))
df <- df[order(runif(nrow(df))), ]
# split on A - your own data might need coercion of A to a factor
df.split <- split(df, df$A)
# set the window size
window <- 30
# compute the moving average
listD <- lapply(df.split, function(tmp) {
apply(tmp, 1, function(x) mean(tmp$C[tmp$B <= as.Date(x["B"]) & tmp$B (as.Date(x["B"]) - window)]))
})
# combine the result with the original data
result <- cbind(do.call(rbind, df.split), rollingMean = unlist(listD))
# and tidy up:
# return to original order
result <- result[order(result$originalOrder), ]
result$originalOrder <- NULL
# remove the row names
row.names(result) <- NULL
result[c(1:5, 59:65), ]
## A B C rollingMean
## 1 K 2015-01-01 1.8 1.800000
## 2 K 2015-01-02 1.5 1.650000
## 3 K 2015-01-03 3.3 2.200000
## 4 K 2015-01-04 0.3 1.725000
## 5 K 2015-01-05 2.8 1.940000
## 59 K 2015-02-28 3.6 3.080000
## 60 K 2015-03-01 1.3 3.066667
## 61 L 2015-01-01 2.8 2.800000
## 62 L 2015-01-02 3.9 3.350000
## 63 L 2015-01-03 5.8 4.166667
## 64 L 2015-01-04 4.1 4.150000
## 65 L 2015-01-05 2.7 3.860000

Calculate mean of a proportion of the data.frame

I'm working with data that looks similar to this:
cat value n
1 100 18
2 0 19
3 -100 15
4 100 13
5 0 17
6 -100 18
In the real data, there are many cats and value can be any number between -100 and 100 (no NA).
What I want to do is to calculate the mean of value based on terciles defined by n
So, for example, since sum(n)=100 what I want to do is to get n's as close as possible to 33 and calculate the mean of value. So for the first tercile, 18 isn't quite 33, so I need to take 15 values from cat=2. So the mean for the first tercile should be (100*18+0*15)/(18+15). The second tercile would be the remaining ns from cat=2, then as many as are needed to get to 33: (0*4+-100*15+100*13+0*1)/(4+15+13+1). Similar for the last tercile.
I got started writing this, but ended up with lots of nasty for loops and if statements. I'm hoping that you see an easier way to deal with this than I do. Thanks in advance!
A solution with data.table:
setDT(df)[rep(1:.N,n)
][,indx:=c(rep("a",33),rep("b",33),rep("c",34))
][,.(mean_val_indx=mean(value)),by=indx]
this gives:
indx mean_val_indx
1: a 54.545455
2: b -6.060606
3: c -52.941176
Which are the means of value for the three parts of the data.
Broken down in the intermediate steps:
1: replice the rows according n
setDT(df)[rep(1:.N,n)]
this gives (shortened):
cat value n
1: 1 100 18
2: 1 100 18
....
17: 1 100 18
18: 1 100 18
19: 2 0 19
20: 2 0 19
....
36: 2 0 19
37: 2 0 19
38: 3 -100 15
....
99: 6 -100 18
100: 6 -100 18
2: create an index with [,indx:=c(rep("a",33),rep("b",33),rep("c",34))]
setDT(df)[rep(1:.N,n)
][,indx:=c(rep("a",33),rep("b",33),rep("c",34))]
this gives:
> dt
cat value n indx
1: 1 100 18 a
2: 1 100 18 a
....
17: 1 100 18 a
18: 1 100 18 a
19: 2 0 19 a
20: 2 0 19 a
....
32: 2 0 19 a
33: 2 0 19 a
34: 2 0 19 b
35: 2 0 19 b
....
99: 6 -100 18 c
100: 6 -100 18 c
3: summarise value by indx with [,.(mean_val_indx=mean(value)),by=indx]
You could try something like this, data being your example dataframe:
longData<-unlist(apply(data[,c("value","n")],1,function(x){
rep(x["value"],x["n"])
}))
aggregate(longData,list(cut(seq_along(longData),breaks=3,right=FALSE)),mean)
longData will be a vector of length 100 with, using your example, 18 repetitions of -100, 19 repetitions of 0 etc.
The cut in the aggregate will divide longData into three groups, and the mean of each group will be calculated.
If already the data is very long repetition by "n" is perhaps unwanted.
The following solution doesn't do this. Moreover, 1/3 of the sum of the
"n"-values is not rounded to the nearest integer.
"i" is the vector of row numbers where terciles end. Since it is possible
that several terciles end at the same row, those row numbers are replicated.
The result is the vector "k".
For each index "j" the cumulative sum of "data$value"*"data$n" up to "k[j]"
covers "ms[k[j]]" terciles, so "ms[j]-j" terciles have to be subtracted
to get the cumulative sum up to the "j"th tercile.
m <- 3
sn <- sum(data$n)
ms <- m * cumsum(data$n) / sn
d <- diff(c(0,floor(ms)))
i <- which(d>0)
k <- rep(i,d[i])
vn <- data$value * data$n
sums <- cumsum(vn)[k] - (ms[k]-(1:m))*data$value[k]*sn/m
means <- m*diff(c(0,sums))/sn
The means of the terciles are:
> means
[1] 54 -6 -54
In this example "i" is equal to "k". But if terciles are replaced by deciles,
i.e. "m" is not 3 but 10, they are distinct:
> m
[1] 10
> i
[1] 1 2 3 4 5 6
> k
[1] 1 2 2 3 3 4 5 5 6 6
> means
[1] 100 80 0 -30 -100 60 50 0 -80 -100
I compared the speed of the 4 answers, using out small example with 8 rows:
> ##### "longData"-Answer #####
>
> system.time( for ( i in 1:1000 ) { A1 <- f1(data) } )
User System verstrichen
3.48 0.00 3.49
> ##### "sapply"-Answer #####
>
> system.time( for ( i in 1:1000 ) { A2 <- f2(data) } )
User System verstrichen
1.00 0.00 0.99
> ##### "data.table"Answer #####
>
> system.time( for ( i in 1:1000 ) { A3 <- f3(data) } )
User System verstrichen
4.73 0.00 4.79
> ##### this Answer #####
>
> system.time( for ( i in 1:1000 ) { A4 <- f4(data) } )
User System verstrichen
0.43 0.00 0.44
The "sapply"-Answer is even false:
> A1
Group.1 x
1 [0.901,34) 54.545455
2 [34,67) -6.060606
3 [67,100) -52.941176
> A2
(0,33] (33,67] (67,100]
-100.00000 0.00000 93.93939
> A3
indx mean_val_indx
1: a 54.545455
2: b -6.060606
3: c -52.941176
> A4
[1] 54 -6 -54
>
This is basically the same as NicE although perhaps useful as a different way fo assembling the rep and cutting operations:
sapply(split( sort(unlist( mapply(rep, res$value, res$n) )),
cut(seq(sum(res$n)), breaks=c(0,33,67,100) )),
mean)
(0,33] (33,67] (67,100]
-100.00000 0.00000 93.93939

Selecting datachunks depending on condition

I have some question of selecting data chunks depending on condition I provide.
Its a multi step process which I think should be done in function and can be applied to the other data sets by lapply.
I have have data.frame which has 19 column (but the example data here has only two) I want to first check the first column (time) rows they should be in range 90 and 54000 if some of them not in this range skip them. After count those chunks, count how many of mag columns show full positive and neg/pos values. If the chunk contains negative number count it as switched state. and give the switching rate something like (total numbers of chunks which shows switched state)/(total number of chunks which range in between 90:54000)
for the data chunks which satisfies the range 90:54000, check the mag
for the first observation of the number <0 together with corresponding time
numbers <- c(seq(1,-1,length.out = 601),seq(1,0.98,length.out = 601))
time <- c(seq(90,54144,length.out = 601),seq(90,49850,length.out = 601))
data = data.frame(rep(time,times=12), mag=rep(numbers, times=6))
n <- 90:54000
dfchunk<- split(data, factor(sort(rank(row.names(data))%%n)))
ext_fsw<-lapply(dfchunk,function(x)x[which(x$Mag<0)[1],])
x.n <- data.frame(matrix(unlist(ext_fsw),nrow=n, byrow=T)
Here is what the real dataset look like:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1 90 0 0 0 0.0023 -0.0064 0.9987 0.0810 0.0375 0.9814 0.0829 0.0379 0.9803 0.0715 0.0270 0.9823
2 180 0 0 0 0.0023 -0.0064 0.9987 0.0887 -0.0281 0.9818 0.0956 -0.0288 0.9778 0.0796 -0.0469 0.9772
3 270 0 0 0 0.0023 -0.0064 0.9987 -0.0132 -0.0265 0.9776 0.0087 -0.0369 0.9797 0.0311 -0.0004 0.9827
4 360 0 0 0 0.0023 -0.0064 0.9987 0.0843 0.0369 0.9752 0.0765 0.0362 0.9749 0.0632 0.0486 0.9735
5 450 0 0 0 0.0023 -0.0064 0.9987 0.1075 -0.0660 0.9737 0.0914 -0.0748 0.9698 0.0586 -0.0361 0.9794
6 540 0 0 0 0.0023 -0.0064 0.9987 0.0006 0.0072 0.9808 -0.0162 -0.0152 0.9797 0.0369 0.0118 0.9763
Here is the expected outputs (just and example)
For part 1:
ss (swiched state) total countable chunks switching probability
5 10 5/10
For part 2:
time mag
27207 -0.03
26520 -0.98
32034 -0.67
.
.
.
.
etc
Okay, I think have this figured out. I put them into two functions. For each function, you give a dataframe and a column name, and it'll return the requested data.
library(dplyr)
thabescity <- function(data, col){
filter_vec <- data[col] < 0
new_df <- data %>%
filter(filter_vec) %>%
filter(90 <= time & time <= 54000) %>%
group_by(time) %>%
summarise()
ss <- nrow(new_df)
total <- length(unique(data$time))
switching_probability <- ss/total
results <- c(ss, total, switching_probability)
output <- as.data.frame(cbind(ss, total, switching_probability))
return(output)
}
print(thabescity(data, "mag"))
ss total switching_probability
1 298 1201 0.2481266
You can make a list and run it in a loop to do all the columns and have it come out in a list:
data_names <- names(data)[2:length(names(data))]
first_problem <- list()
for(name in data_names){
first_problem[[name]] <- thabescity(data, name)
}
first_problem[["mag"]]
ss total switching_probability
1 298 1201 0.2481266
The second problem is a bit easier:
thabescity2 <- function(data, col){
data <- data[,c("time", col)]
filter_vec <- data[col] < 0
new_df <- data %>%
filter(filter_vec) %>%
filter(90 <= time & time <= 54000) %>%
group_by(time) %>%
filter(row_number() == 1)
return(new_df)
}
print(thabescity2(data, "mag"))
Source: local data frame [298 x 2]
Groups: time
time mag
1 27207.09 -0.003333333
2 27297.18 -0.006666667
3 27387.27 -0.010000000
4 27477.36 -0.013333333
5 27567.45 -0.016666667
6 27657.54 -0.020000000
7 27747.63 -0.023333333
8 27837.72 -0.026666667
9 27927.81 -0.030000000
10 28017.90 -0.033333333
.. ... ...
You can do the same thing as above to go through the whole dataframe:
data_names <- names(data)[2:length(names(data))]
second_problem <- list()
for(name in data_names){
second_problem[[name]] <- thabescity2(data, name)
}
second_problem[["mag"]]
Source: local data frame [298 x 2]
Groups: time
time mag
1 27207.09 -0.003333333
2 27297.18 -0.006666667
3 27387.27 -0.010000000
4 27477.36 -0.013333333
5 27567.45 -0.016666667
6 27657.54 -0.020000000
7 27747.63 -0.023333333
8 27837.72 -0.026666667
9 27927.81 -0.030000000
10 28017.90 -0.033333333
.. ... ...
Double check my results, but I think this does what you want.

Automatically creating and filling data frames in R

Here is the code that I am working with.
rnumbers <- data.frame(replicate(5,runif(20000, 0, 1)))
dt <- c(.001)
A <- dt*1
B <- dt*.5
## A = 0
## B = 1
rstate <- rnumbers # copy the structure
rstate[] <- NA # preserve structure with NA's
# Init:
rstate[1, ] <- rnumbers[1, ] < .02 & rnumbers[1, ] > 0.01
step_generator <- function(col, rnum){
for (i in 2:length(col) ){
if( rnum[i] < B) { col[i] <- 0 }
else { if (rnum[i] < A) {col[i] <- 1 }
else {col[i] <- col[i-1] } }
}
return(col)
}
# Run for each column index:
for(cl in 1:5){ rstate[ , cl] <-
step_generator(rstate[,cl], rnumbers[,cl]) }
rstate1 <- transform(rstate, time = rep(dt))
rstate2 <- transform(rstate1, cumtime = cumsum(time))
This gives me a data frame with 5 columns that contain state switches over time. Time interval is in the 6th column (seconds) and cumulative time is in the 7th column (seconds). Now I want to see how long each state lasts in seconds. This is what I am doing -
1) lengths <- rle(rstate2[,1])
>Run Length Encoding
lengths: int [1:15] 366 3278 1817 451 3033 1655 1901 748 742 1780 ...
values : num [1:15] 0 1 0 1 0 1 0 1 0 1 ...
2) lengths1 <- data.frame(state = lengths$values, duration = lengths$lengths)
> lengths1
state duration
1 0 366
2 1 3278
3 0 1817
4 1 451
5 0 3033
6 1 1655
7 0 1901
8 1 748
9 0 742
10 1 1780
11 0 26
12 1 458
13 0 305
14 1 1039
15 0 2401
3) library("plyr")
lengths2 <- transform(lengths1, time = duration*dt)
lengths3 <- arrange(lengths2, desc(state))
> lengths3
state duration time
1 1 3278 3.278
2 1 451 0.451
3 1 1655 1.655
4 1 748 0.748
5 1 1780 1.780
6 1 458 0.458
7 1 1039 1.039
8 0 366 0.366
9 0 1817 1.817
10 0 3033 3.033
11 0 1901 1.901
12 0 742 0.742
13 0 26 0.026
14 0 305 0.305
15 0 2401 2.401
4) col1 <- ddply(lengths3, .(state), function(df) 1/mean(df$time))
> col1
state V1
1 0 0.7553583
2 1 0.7439685
So, col1 is showing me "1/mean(time in each state)" for column1 of rstate2. What I would like to do is iterate steps 1-4 for every column in rstate2 and generate a data frame that looks like this :
> rates
state col1 col2 col3 col4 col5
1 0 0.1 0.2 0.3 0.4 0.5
2 1 0.3 0.4 0.5 0.6 0.7
Where the numbers for each column are equal to the 1/mean(df$time) for each of the column from rstate2.
Thank you for any and all help.
I'd do this using the development version of data.table (v 1.8.11) in this manner:
require(data.table) # 1.8.11
require(reshape2)
DT <- data.table(rstate2)
DT.m <- melt(DT, id=6, measure=1:5)
ans <- DT.m[, {dl=data.table:::duplist(list(value));
list(state=value[dl], time=c(diff(dl),
.N-dl[length(dl)]+1)*dt)
}, by=list(variable)]
ans <- ans[, 1/mean(time), by=list(variable, state)]
dcast.data.table(ans, state ~ variable)
state X1 X2 X3 X4 X5
1: 0 0.9875568 1.0777521 0.3227194 2.2371365 0.7237054
2: 1 1.0127608 0.4442799 0.2802691 0.2887169 1.0576415
Unfortunately, it's still building on R-Forge. So, probably you can install 1.8.10 from CRAN and use reshape2's melt and cast (which'll output a data.frame) and convert the result back to a data.table and do the grouping as follows:
require(data.table) # 1.8.10
require(reshape2)
DT.m <- data.table(melt(rstate2, id=6, measure=1:5))
ans <- DT.m[, {dl=data.table:::duplist(list(value));
list(state=value[dl], time=c(diff(dl),
.N-dl[length(dl)]+1)*dt)
}, by=list(variable)]
ans <- ans[, 1/mean(time), by=list(variable, state)]
dcast(ans, state ~ variable)

Resources