Zoo::Rollmax How to shorten width to prevent errors - r

I have 10 days of values, and for each day I want to know the max of the previous 4 days. If there aren't 4 days worth of values, then I want the max of the last 3 days, etc. Code example:
set.seed(131)
Index <- 1:10
Val <- c(sample(10, 10, replace = T))
df = data.frame(Index, Val)
dfoo = df %>%
mutate(Lag1 = lag(Val, 1, default = 0), #get last days value
Last4Max = rollmax(Lag1, 4, partial = T, fill = 0, align = "right")) #get max of last 4 days
This works for all but for day 2/3 since there aren't 4 values in Lag1 (day 1 should be 0/NA because there's no "previous" day).
Index Val Lag1 Last4Max
1 1 3 0 0
2 2 2 3 0
3 3 3 2 0
4 4 4 3 3
5 5 9 4 4
6 6 6 9 9
7 7 6 6 9
8 8 3 6 9
9 9 4 3 9
10 10 10 4 6
So Last4Max should be 3 for index 2/3, and 0/NA for 1. Is there a way to change the width size to account for having width>rownumbers? My alternative is to create 4 variables for each lag (with default = 0) and then take the max of all 4. I know this would work but it seems clunky, and it'd limit me if I wanted to quickly do max of last 10 days on a bigger dataset.
Thanks

1) Note that:
as per ?rollmax it does not have a partial argument; however, we can use rollapply or rollapplyr with a partial argument and specify FUN = max.
rollapplyr (and also rollmaxr) with an r on the end defaults to align = "right" allowing one to avoid writing that argument out
the width argument can specify a one-component list of offsets so to specify that the prior 4 elements are to be used we can specify width = list(-seq(4)) eliminating the need for a separate lag column.
Putting all these together we get:
rollapplyr(Val, list(-seq(4)), max, partial = TRUE, fill = 0)
## [1] 0 3 3 3 4 9 9 9 9 6
2) Another way to do this is to use a width of 5 but not use the last element when taking the maximum. In this case we don't need fill = 0 since it is able to process each component of Val leaving nothing to fill.
Max <- function(x) if (length(x) > 1) max(head(x, -1)) else 0
rollapplyr(Val, 5, Max, partial = TRUE)
2a) If we knew that all elements of Val were non-negative then we could alternately use this shorter definition for Max:
Max <- function(x) max(head(x, -1), 0)

Related

rollmean fill NAs with original value

I followed this example to do a rolling mean rollmin in R similar to zoo package rollmax
But the first few are filled with NA's. How can I fill the NA's with the original value so that I don't lose datapoints?
We may use coalesce with the original vector to replace the NA with that corresponding non-NA element from original vector
library(dplyr)
library(zoo)
coalesce(rollmeanr(x, 3, fill = NA), x)
If it is a data.frame
ctd %>%
group_by(station) %>%
mutate(roll_mean_beam = coalesce(rollmeanr(beam_coef,
k = 5, fill = NA), beam_coef))
data
x <- 1:10
1) Using the original values seems a bit bizarre. Taking the rolling minimum of 1:10 using a width of 3 would give
1 2 1 2 3 4 5 6 7 8
I think what you really want is to apply min to however many points are available so that in this example we get
1 1 1 2 3 4 5 6 7 8
Now rollapplyr with partial=TRUE will use whatever number of points are available if fewer than width=3 exist at that point. At the first point only one point is available so it returns min(x[1]). At the second only two points are available so it returns min(x[1:2]). For all the rest it can use three points. Only zoo is used.
library(zoo)
x <- 1:10
rollapplyr(x, 3, min, partial = TRUE)
## [1] 1 1 1 2 3 4 5 6 7 8
2) The above seems more logical than filling the first two points with the first two input values but if you really wanted to do that anyways then simply prefix the series with the original values using c or use one of the other alternatives shown below. Only zoo is used.
c(x[1:2], rollapplyr(x, 3, min))
## [1] 1 2 1 2 3 4 5 6 7 8
pmin(rollapplyr(x, 3, min, fill = max(x)), x)
## [1] 1 2 1 2 3 4 5 6 7 8
replace(rollapplyr(x, 3, min, fill = NA), 1:2, x[1:2])
## [1] 1 2 1 2 3 4 5 6 7 8
Min <- function(x) if (length(x) < 3) tail(x, 1) else min(x)
rollapplyr(x, 3, Min, partial = TRUE)
## [1] 1 2 1 2 3 4 5 6 7 8

Is there a way to manipulate R dataframe rows based on values in the rows adjacent to them?

I'm looking at a metric for modelled populations which goes up and down somewhat randomly, if it exceeds a set limit for a year or two it's not a problem, but if it exceeds the limit for several years it's a sign that something is going wrong with the population.
I have dataframes where values can be inside (not_out) or outside (out) a set limit, along a time series (t). I want to extract the first instance of t where the value is 'out' for more than a set number of rows, for example 3. E.g.
set.seed(36)
example.data <- data.frame("t" = c(1:15) ,
"value" = sample(-3:3, 15, replace = TRUE),
"limit" = 2)
example.data <- mutate(example.data,
"out_" = ifelse(value >= limit | value <= 0 - limit, "out", "not_out"))
example.data
t value limit out_
1 1 1 2 not_out
2 2 2 2 out
3 3 -1 2 not_out
4 4 0 2 not_out
5 5 -3 2 out
6 6 3 2 out
7 7 -2 2 out
8 8 0 2 not_out
9 9 3 2 out
10 10 1 2 not_out
11 11 1 2 not_out
12 12 3 2 out
13 13 3 2 out
14 14 0 2 not_out
15 15 0 2 not_out
So t == 5 would be the first instance where value goes 'out' and stays 'out' for more than 3 rows.
I tried to solve this with a for loop and an if statement along the lines of...
for(t in min(example.data$t) : max(example.data$t)) {
if(example.data$out_ == "out"){
a <- t
return(a)
}
}
But I'm struggling to get it to work for a single instance of out_ == "out", and I don't know how to tell R I want it to look at t & t+1...t+n when making the evaluation. Any help would be greatly appreciated.
You can also use data.table as follow :
library(data.table)
d <- as.data.table(example.data)
d[, n.prev.out := (shift(out_, type = 'lag', n = 1) == "out" &
shift(out_, type = 'lag', n = 2) == "out" &
out_ == "out")]
example.data %>%
group_by(grp = data.table::rleid(out_)) %>%
filter(row_number() == 1 & out_ == "out" & n() > 3) %>%
ungroup() %>%
select(-grp)

R - Shift specified columns using minimum value into positive values

I'm looking for an easy way to add the minimum value for each column inside my dataframe.
This feels like a common thing, but I haven't been able to find any good answers yet...maybe I'm missing something obvious.
Let's say I've got two columns (in reality I have close to 100) with positive and negative numbers.
w <- c(9, 9, 9, 9)
x <- c(-2, 0, 1, 3)
y <- c(-1, 1, 3, 4)
z <- as.data.frame(cbind(w, x, y))
w x y
1 9 -2 -1
2 9 0 1
3 9 1 3
4 9 3 4
I want z to look like this after a transformation for only x and y columns [,2:3]
w x y
1 9 0 0
2 9 2 2
3 9 3 4
4 9 5 5
Does that make sense?
library(dplyr)
dplyr::mutate(z, across(c(x, y), ~ . + abs(min(.))))
w x y
1 9 0 0
2 9 2 2
3 9 3 4
4 9 5 5
You can also do by column position rather than column name by changing c(x,y) to 2:3 or c(2:3, 5) for non-sequential column positions.
Depends exactly what you mean and what you want to happen if there aren't negative values. No matter the values, this will anchor the minimum at 0, but you should be able to adapt it if you want something slightly different.
z[] = lapply(z, function(col) col - min(col))
z
# x y
# 1 0 0
# 2 2 2
# 3 3 4
# 4 5 5
As a side note, as.data.frame(cbind(x, y)) is bad - if you have a mix of numeric and character values, cbind() will convert everything to character. It's shorter and better to simplify to data.frame(x, y).
Do you want
z[] <- lapply(z, function(columnValues) columnValues + abs(min(columnValues)))

R - Cut non-zero values

I have a time series data in a data table format (let's say it has columns "date" and "y"), and I would like to cut the non-zero values of y into quartiles by date, so that each quartile gets the label 1-4, and the zero values to have a label of 0. So I know that if I just wanted to do this for all values of y, I would just run:
dt <- dt %>%
group_by(date) %>%
mutate(quartile = cut(y, breaks = 4, labels = (1:4)))
But I can't figure out how to do it to get labels 0-4, with 0 allocated to 0-values of y, and 1-4 being the quartiles in the non-zero values.
Edit: To clarify, what I want to do is the following: for each date, I would like to divide the values of y in that date into 5 groups: 1) y=0, 2) bottom 25% of y (in that date), 3) 2nd 25% of y, 3) 3rd 25% of y, 4) the top 25% of y.
Edit 2:
So I have found 2 more solutions for this:
dt[,quartile := cut(y, quantile(dt[y>0]$y, probs = 0:4/4),
labels = (1:4)), by = date]
and
dt %>%
group_by(date) %>%
mutate(quartile = findInterval(y, quantile(dta[y>0]$y,
probs= 0:4/4)))
But what both of these seem to do is to first calculate the break points for the entire data and then cut the data by date. But I want the break points to be calculated by date, since obs distribution can be different in different dates.
You can pass the output of quantile to the breaks argument of cut. By default, quantile will produce quartile breaks.
x <- rpois(100,4)
table(x)
x
0 1 2 3 4 5 6 7 8 9 10 12
1 7 17 19 17 18 12 5 1 1 1 1
cut(x,breaks=quantile(x),labels=1:4)
[1] 2 2 2 1 2 1 1 2 3 3 1 4 1 4 1
[16] 2 4 2 4 2 3 1 4 1 2 2 1 1 2 2
[31] 1 2 2 3 4 1 4 2 2 1 2 4 4 3 1
[46] 3 1 1 3 3 2 4 2 2 1 2 2 4 1 1
[61] 1 2 2 4 4 3 3 2 1 1 3 2 3 2 3
[76] 2 4 2 <NA> 2 3 2 4 2 1 4 4 3 4 1
[91] 2 4 3 2 2 3 4 4 3 2
Levels: 1 2 3 4
Note that the minimum value is excluded by default. If you want your ranges to be computed including zero, the zeros will be NA's and you can use this to your advantage and use is.na to treat this differently afterwards.
However, if you want to exclude the zero's before computing the breaks, you will need to reduce the minimum break value slightly to ensure all values are given a label. You can do this by using quantile(x[x>0])-c(1e-10,rep(0,4)) for example. The zeros will again appear as NA's in this case.
I'm admittedly not sure what you mean by "cutting the non-zero values of y into quartiles by date", and I'm afraid I don't have enough reputation to ask.
If 'date' is an actual date column, and you mean, "the new variable 'quartile' should indicate what part of the year y occurred in, assuming y isn't 0, in which case it should be 0", I'd do it like this:
library(dplyr)
library(lubridate)
# create example
dt <- data.frame(y = c(0, 1, 3, 4), date = c("01-02-18", "01-06-18",
"01-12-16", "01-04-17"))
dt <- dt %>%
## change 'date' to an actual date
mutate(date = as_date(date)) %>%
## extract the quarter
mutate(quartile = quarter(date)) %>%
## replace all quarters with 0 where y was 0
mutate(quartile = if_else(y == 0, 0, as.double(quartile)))`
EDIT: I think I understand the problem now. This is probably a little verbose, but I think it does what you want:
library(dplyr)
dt <- tibble(y = c(20, 30, 40, 20, 30, 40, 0), date = c("01-02-16",
"01-02-16", "01-02-16", "01-08-18", "01-08-18", "01-08-18",
"01-08-18"))
new_dt <- dt %>%
# filter out all cases where y is greater than 0
filter(y > 0) %>%
# group by date
group_by(date) %>%
# cut the y values per date
mutate(quartile = cut(y, breaks = 4, labels = c(1:4)))
dt <- dt %>%
# take the original dt, add in the newly calculated quartiles
full_join(new_dt, by = c("y", "date")) %>%
# replace the NAs by 0
mutate(quartile = ifelse (is.na(quartile), 0, quartile))

dplyr::mutate comparing each value to vector, collapsing with any/all

I have a dataset of true values (location) that I'm attempting to compare to a vector of estimated values using dplyr. My code below results in an error message. How do I compare each value of data$location to every value of est.locations and collapse the resulting vector to true if all comparisons are greater than 20?
library(dplyr)
data <- data.frame("num" = 1:10, "location" = runif(10, 0, 1500) %>% sort)
est.locations <- runif(12, 0, 1500) %>% sort
data %>%
mutate(false.neg = (all(abs(location - est.locations) > 20)))
num location false.neg
1 1 453.4281 FALSE
2 2 454.4260 FALSE
3 3 718.0420 FALSE
4 4 801.2217 FALSE
5 5 802.7981 FALSE
6 6 854.2148 FALSE
7 7 873.6085 FALSE
8 8 901.0217 FALSE
9 9 1032.8321 FALSE
10 10 1240.3547 FALSE
Warning message:
In c(...) :
longer object length is not a multiple of shorter object length
The context of the question is dplyr, but I'm open to other suggestions that may be faster. This is a piece of a larger calculation I'm doing on birth-death mcmc chains for 3000 iterations * 200 datasets. (i.e. repeated many times and the number of locations will be different among datasets and for each iteration.)
UPDATE (10/13/15):
I'm going to mark akrun's solution as the answer. A linear algebra approach is a natural fit for this problem and with a little tweaking this will work for calculating both FNR and FPR (FNR should need an (l)apply by iteration, FPR should be one large vector/matrix operation).
JohannesNE's solution points out the issue with my initial approach -- the use of any() reduces the number of rows to a single value, when instead I intended to do this operation row-wise. Which also leads me to think there is likely a dplyr solution using rowwise() and do().
I attempted to limit the scope of the question in my initial post. But for added context, the full problem is on a Bayesian mixture model with an unknown number of components, where the components are defined by a 1D point process. Estimation results in a 'random effects' chain similar in structure to the version of est.locations below. The length mismatch is a result of having to estimate the number of components.
## Clarification of problem
options("max.print" = 100)
set.seed(1)
# True values (number of items and their location)
true.locations <-
data.frame("num" = 1:10,
"location" = runif(10, 0, 1500) %>% sort)
# Mcmc chain of item-specific values ('random effects')
iteration <<- 0
est.locations <-
lapply(sample(10:14, 3000, replace=T), function(x) {
iteration <<- iteration + 1
total.items <- rep(x, x)
num <- 1:x
location <- runif(x, 0, 1500) %>% sort
data.frame(iteration, total.items, num, location)
}) %>% do.call(rbind, .)
print(est.locations)
iteration total.items num location
1 1 11 1 53.92243818
2 1 11 2 122.43662006
3 1 11 3 203.87297671
4 1 11 4 641.70211495
5 1 11 5 688.19477968
6 1 11 6 1055.40283048
7 1 11 7 1096.11595818
8 1 11 8 1210.26744065
9 1 11 9 1220.61185888
10 1 11 10 1362.16553219
11 1 11 11 1399.02227302
12 2 10 1 160.55916378
13 2 10 2 169.66834129
14 2 10 3 212.44257723
15 2 10 4 228.42561489
16 2 10 5 429.22830291
17 2 10 6 540.42659572
18 2 10 7 594.58339156
19 2 10 8 610.53964624
20 2 10 9 741.62600969
21 2 10 10 871.51458277
22 3 13 1 10.88957267
23 3 13 2 42.66629869
24 3 13 3 421.77297967
25 3 13 4 429.95036650
[ reached getOption("max.print") -- omitted 35847 rows ]
You can use sapply (here inside mutate, but not really taking advantage of its functions).
library(dplyr)
data <- data.frame("num" = 1:10, "location" = runif(10, 0, 1500) %>% sort)
est.locations <- runif(12, 0, 1500) %>% sort
data %>%
mutate(false.neg = sapply(location, function(x) {
all(abs(x - est.locations) > 20)
}))
num location false.neg
1 1 92.67941 TRUE
2 2 302.52290 FALSE
3 3 398.26299 TRUE
4 4 558.18585 FALSE
5 5 859.28005 TRUE
6 6 943.67107 TRUE
7 7 991.19669 TRUE
8 8 1347.58453 TRUE
9 9 1362.31168 TRUE
10 10 1417.01290 FALSE
We can use outer for this kind of comparison. We get all the combination of difference between 'location' and 'est.locations', take the abs, compare with 20, negate (!), do the rowSums and negate again so that if all the elements in the rows are greater than 20, it will be TRUE.
data$false.neg <- !rowSums(!abs(outer(data$location, est.locations, FUN='-'))>20)

Resources