I have a vector that I want to modify so that it contains only elements that are equal too or larger than the previous element. The vector represents a phenomena that should only increase or stay the same (i.e. cumulative deaths by day), but reporting errors result in elements that are less than the previous element. I want to correct this by replacing elements with previous ones until the vector meets the aforementioned criteria.
raw data : 1 3 3 6 8 10 7 9 15 12
desired modified data: 1 3 3 6 6 6 7 9 9 12
library(zoo)
raw <- c(1, 3, 3, 6, 8, 10, 7, 9, 15, 12)
replace.errors <- function(x){
x %>%
replace(diff(x) < 0, NA) %>%
na.locf(na.rm=FALSE)
}
replace.errors(raw)
# [1] 1 3 3 6 8 8 7 9 9 12
My function does not work if multiple sequential elements in a row need to be replaced (8 and 10), as it just pulls forward an element that is still greater than the next one.
A data.table option using nafill along with cummin
nafill(replace(raw, rev(cummin(rev(raw))) != raw, NA), type = "locf")
gives
> nafill(replace(raw, rev(cummin(rev(raw))) != raw, NA), type = "locf")
[1] 1 3 3 6 6 6 7 9 9 12
Following the similar idea from above approach, your function replace.errors can be defined as
replace.errors <- function(x){
x %>%
replace(rev(cummin(rev(.))) != (.), NA) %>%
na.locf()
}
such that
> replace.errors(raw)
[1] 1 3 3 6 6 6 7 9 9 12
Another option is to define a user function like below
f <- function(v) {
for (k in which(c(FALSE, diff(v) < 0))) {
p <- max(v[v < v[k]])
v <- replace(v, tail(which(v == p), 1):(k - 1), p)
}
v
}
which gives
> f(raw)
[1] 1 3 3 6 6 6 7 9 9 12
Base R using #ThomasIsCoding brilliant replace logic:
# Replace values breaching condition with NA: scrubbed => integer vector
scrubbed <- replace(raw, rev(cummin(rev(raw))) != raw, NA_integer_)
# i) Interpolate constants:
res <- na.omit(scrubbed)[cumsum(!is.na(scrubbed))]
# OR
# ii) Interpolate constants using approx()
res <- approx(scrubbed, method = "constant", n = length(scrubbed))$y
Or in one expression:
approx(
replace(raw, rev(cummin(rev(raw))) != raw, NA_integer_),
method = "constant",
n = length(raw)
)$y
This smells a bit inefficient, but it may still be the best option:
replace_errors <- function(raw) {
while (is.unsorted(raw)) {
raw <- raw[c(TRUE, diff(raw) >= 0)]
}
raw
}
Related
I need to take an existing vector and create a new vector that contains the values;
(x1+2x2−x3, x2+2x3−x4, . . . , xn−2+2xn−1 − xn)
I've tried using xVec[n-2] + 2* xVec[n-1] - xVec[n] but this doesn't work!
Without zoo:
n <- 10
xVec <- seq(n)
idx <- seq(1, n-2)
xVec[idx] + 2* xVec[idx+1] - xVec[idx+2]
[1] 2 4 6 8 10 12 14 16
You need a rolling calculation, something that the zoo package provides:
vec <- 1:10
zoo::rollapply(vec, width = 3, FUN = function(z) z[1]+2*z[2]-z[3])
# [1] 2 4 6 8 10 12 14 16
Validation, using first three and last three:
1 + 2*2 - 3
# [1] 2
8 + 2*9 - 10
# [1] 16
Explanation: each time the function (passed to FUN=) is called, it is given a vector with width= elements in it. The first call is effectively z=1:3, the second call z=2:4, third z=3:5, etc.
You should know that by default it will return length(vec) - width + 1 elements in its return value. You can control this with fill= and align= arguments:
zoo::rollapply(1:10, width = 3, FUN = function(z) z[1]+2*z[2]-z[3], fill = NA)
# [1] NA 2 4 6 8 10 12 14 16 NA
zoo::rollapply(1:10, width = 3, FUN = function(z) z[1]+2*z[2]-z[3], fill = NA, align = "right")
# [1] NA NA 2 4 6 8 10 12 14 16
In a comment, B. Go has suggested to "reshape" the vector and wonders if this can be done in R as well.
In R, two packages provide functions to shift the elements of a vector: data.table and dplyr. (The lag() function from base R deals with times series objects.)
data.table
x <- 1:10
library(data.table)
shift(x, 2L) + 2 * shift(x) - x
[1] NA NA 2 4 6 8 10 12 14 16
dplyr
x <- 1:10
library(dplyr)
lag(x, 2L) + 2 * lag(x) - x
[1] NA NA 2 4 6 8 10 12 14 16
By default, both functions do fill up missing values after shifting with NA. This explains why the first two elements of the result vector are NA.
To get rid of the leading NAs, the tail() function can be used, e.g.,
tail(shift(x, 2L) + 2 * shift(x) - x, -2L)
[1] 2 4 6 8 10 12 14 16
If you are up for a bit of matrix math:
xVec <- 1:10
linear_combo <- c(1, 2, -1)
m <- matrix(0, length(xVec), length(xVec))
for (index in seq_along(linear_combo)) {
m[row(m) == col(m) - index + 1] <- linear_combo[index]
}
m %*% xVec
Note in this case the last two elements are incomplete and should probably be dropped or replaced by NA.
head(m %*% xVec, -(length(linear_combo) - 1))
I have a data frame in R that I want to aggregate. The summary function that I want to apply to each subset is a custom function that takes several variables (columns) as input, and returns a vector or list of variable length. As an output, I would like to have a data frame with a column of the grouping variable, and a single other column containing the output vector (of varying length).
To give a mock example, suppose I have the following dataframe:
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
> df
particle time state energy
1 X 1 A 9
2 X 2 A 8
3 X 3 B 7
4 X 4 C 5
5 X 5 A 0
6 Y 1 A 1
7 Y 2 B 7
8 Y 3 B 7
9 Z 1 B 3
10 Z 2 C 9
11 Z 3 A 5
12 Z 4 A 6
I would like to obtain for each particle a list of the energy they had every time they changed state. The output I'm looking for is something like this:
>
particle energy
1 X c(9,7,5,0)
2 Y c(1,7)
3 Z c(3,9,5)
To do so, I would define a function like the following:
myfun <- function(state, energy){
tempstate <- state[1]
energyvec <- energy[1]
for(i in 2:length(state)){
if(state[i] != tempstate){
energyvec <- c(energyvec, energy[i])
tempstate <- state[i]
}
}
return(energyvec)
}
And try to pass it to aggregate somehow
The two data structures I tried for this are data.frame and data.table.
In data.frame, using a custom function that returns a vector seems to give the correct output format I am looking for, that is where the output column is really a list, and each row contains a list with the output of the function. However, I can't seem to pass several columns to the function when aggregating this way.
With a data.table, the aggregation is easier to do when considering a function of several variables. However, I can't seem to obtain the output I'm looking for. Indeed,
dt <- data.table(df)
dt[,myfun(state, energy), by= Particle]
only returns the first element of energyvec (instead of a vector), and
dt <- data.table(df)
dt[,as.list(myfun(state, energy)), by= Particle]
doesn't work as the outputs don't all have the same length.
Is there an alternative way to go to accomplish this?
Thank you very much in advance for all your help!
Here's a tidyverse approach:
library(tidyverse)
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
# Hard-code energy to make this reproducible
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
df %>%
group_by(particle) %>%
mutate(
changed_state = coalesce(state != lag(state, 1), TRUE)
) %>%
filter(changed_state) %>%
summarise(
string = toString(energy)
)
#> # A tibble: 3 x 2
#> particle string
#> <fct> <chr>
#> 1 X 9, 7, 5, 0
#> 2 Y 1, 7
#> 3 Z 3, 9, 5
I'd run each line of the pipe individually. Basically, create a changed_state variable by checking if the "this" state matches the last state lag(state, 1). Since we only care when this happens, we filter where this is TRUE (a more verbose line would be filter(changed_state == TRUE). The toString function collapses the rows of energy as desired and we are already "grouped" by particle.
data.table approach
sample data
#stolen from JasonAizkalns's answer
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
code
library( data.table )
#create data.table
dt <- as.data.table(df)
#use `uniqlist` to get rownumbers where the value of `state` changes,
# then get these rows into a subset
result <- dt[ data.table:::uniqlist(dt[, c("particle", "state")]), ]
#split the resulting `energy`-column by the contents of the `particle`-column
l <- split( result$energy, result$particle)
# $X
# [1] 9 7 5 0
#
# $Y
# [1] 1 7
#
# $Z
# [1] 3 9 5
#craete final output
data.table( particle = names(l), energy = l )
# particle energy
# 1: X 9,7,5,0
# 2: Y 1,7
# 3: Z 3,9,5
Another possible data.table approach
library(data.table)
setDT(DF)[, .(energy=.(.SD[, first(energy), by=.(rleid(state))]$V1)), by=.(particle)]
output:
particle energy
1: X 9,4,6,9
2: Y 2,9
3: Z 7,6,1
data:
set.seed(0L)
DF <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
DF
# particle time state energy
# 1 X 1 A 9
# 2 X 2 A 3
# 3 X 3 B 4
# 4 X 4 C 6
# 5 X 5 A 9
# 6 Y 1 A 2
# 7 Y 2 B 9
# 8 Y 3 B 9
# 9 Z 1 B 7
# 10 Z 2 C 6
# 11 Z 3 A 1
# 12 Z 4 A 2
I have a problem to find a vectorization representation for a specific loop in R. My objective is to enhance the performance of the loop, because it has to be run thousands of times in the algorithm.
I want to find the position of the lowest value in a particular array section defined by a vector 'Level' for each row.
Example:
Level = c(2,3)
Let first row of array X be: c(2, -1, 3, 0.5, 4).
Searching for the position of the lowest value in the range 1:Level[1] of the row (that is (2, -1)), I get a 2, because -1 < 2 and -1 stands on second position of the row. Then, searching the position of the lowest value in the second range (Level[1]+1):(Level[1]+Level[2]) (that is (3, 0.5, 4)), I get a 4, because 0.5 < 3 < 4 and 0.5 stands on fourth position of the row.
I have to perform this over each row in the array.
My solution to the problem works as follows:
Level = c(2,3,3) #elements per section, here: 3 sections with 2,3 and 3 levels
rows = 10 #number of rows in array X
X = matrix(runif(rows*sum(Level),-5,5),rows,sum(Level)) #array with 10 rows and sum(Level) columns, here: 8
Position_min = matrix(0,rows,length(Level)) #array in which the position of minimum values for each section and row are stored
for(i in 1:rows){
for(j in 1:length(Level)){ #length(Level) is number of intervals, here: 3
if(j == 1){coeff=0}else{coeff=1}
Position_min[i,j] = coeff*sum(Level[1:(j-1)]) + which(X[i,(coeff*sum(Level[1:(j-1)])+1):sum(Level[1:j])] == min(X[i,(coeff*sum(Level[1:(j-1)])+1):sum(Level[1:j])]))
}
}
It works fine but I would prefer a solution with better performance. Any ideas?
This will remove the outer level of the loop:
Level1=c(0,cumsum(Level))
for(j in 1:(length(Level1)-1)){
Position_min[,j]=max.col(-X[,(Level1[j]+1):Level1[j+1]])+(Level1[j])
}
Here is a "fully vectorized" solution with no explicit loops:
findmins <- function(x, level) {
series <- rep(1:length(Level), Level)
x <- split(x, factor(series))
minsSplit <- as.numeric(sapply(x, which.min))
minsSplit + c(0, cumsum(level[-length(level)]))
}
Position_min_vectorized <- t(apply(X, 1, findmins, Level))
identical(Position_min, Position_min_vectorized)
## [1] TRUE
You can get better performance by making your matrix into a list, and then using parallel's mclapply():
X_list <- split(X, factor(1:nrow(X)))
do.call(rbind, parallel::mclapply(X_list, findmins, Level))
## [,1] [,2] [,3]
## 1 1 5 6
## 2 2 3 6
## 3 1 4 7
## 4 1 5 6
## 5 2 5 7
## 6 2 4 6
## 7 1 5 8
## 8 1 5 8
## 9 1 3 8
## 10 1 3 8
I am trying to use the lag function from the dplyr package. However when I give a lag > 0 I want the missing values to be replaced by the first value in x. How can we achieve this
library(dplyr)
x<-c(1,2,3,4)
z<-lag(x,2)
z
## [1] NA NA 1 2
Since you are using the lag function dplyr, there is an argument default. So you can specify that you want x[1] to be the default.
lag(x, 2, default=x[1])
Here's a modified function mylag:
mylag <- function(x, k = 1, ...)
replace(lag(x, k, ...), seq(k), x[1])
x <- 1:4
mylag(x, k = 2)
# [1] 1 1 1 2
May I suggest adapting the function so that it works both ways: for lag and lead (positive AND negative lags).
shift = function(x, lag, fill=FALSE) {
require(dplyr)
switch(sign(lag)/2+1.5,
lead( x, n=abs(lag), default=switch(fill+1, NA, tail(x, 1)) ),
lag( x, n=abs(lag), default=switch(fill+1, NA, head(x, 1)) )
)
}
It has a "fill" argument that automatically fills with first of last value depending on the sign of the lag.
> shift(1:10, -1)
#### [1] 2 3 4 5 6 7 8 9 10 NA
> shift(1:10, +1, fill=TRUE)
#### [1] 1 1 2 3 4 5 6 7 8 9
I'm trying to apply a function to a dataframe using ddply from the plyr package, but I'm getting some results that I don't understand. I have 3 questions about the
results
Given:
mydf<- data.frame(c(12,34,9,3,22,55),c(1,2,1,1,2,2)
, c(0,1,2,1,1,2))
colnames(mydf)[1] <- 'n'
colnames(mydf)[2] <- 'x'
colnames(mydf)[3] <- 'x1'
mydf looks like this:
n x x1
1 12 1 0
2 34 2 1
3 9 1 2
4 3 1 1
5 22 2 1
6 55 2 2
Question #1
If I do:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
mydf <- ddply(mydf, c("x") , .fun = k, .inform = TRUE)
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "z", value = structure(c(12, 34, 9, :
replacement has 3 rows, data has 6
Error: with piece 1:
n x x1
1 12 1 0
2 9 1 2
3 3 1 1
I get this error regardless of whether I specify the variable to split by as c("x"), "x", or .(x). I don't understand why I'm getting this error message.
Question #2
But, what I really want to do is set up an if/else function because my dataset has variables x1, x2, x3, and x4 and I want to take those variables into account as well. But when I try something simple such as:
j <- function(x) {
if(x == 1){
mydf$z <- 0
} else {
mydf$z <- mydf$n
}
return(mydf)
}
mydf <- ddply(mydf, x, .fun = j, .inform = TRUE)
I get:
Warning messages:
1: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
2: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
Question #3
I'm confused about to use function() and when to use function(x). Using function() for either j() or k() gives me a different error:
Error in .fun(piece, ...) : unused argument (piece)
Error: with piece 1:
n x x1 z
1 12 1 0 12
2 9 1 2 9
3 3 1 1 3
4 12 1 0 12
5 9 1 2 9
6 3 1 1 3
7 12 1 0 12
8 9 1 2 9
9 3 1 1 3
10 12 1 0 12
11 9 1 2 9
12 3 1 1 3
where column z is not correct. Yet I see a lot of functions written as function().
I sincerely appreciate any comments that can help me out with this
There's a lot that needs explaining here. Let's start with the simplest case. In your first example, all you need is:
mydf$z <- with(mydf,ifelse(x == 1,0,n))
An equivalent ddply solution might look like this:
ddply(mydf,.(x),transform,z = ifelse(x == 1,0,n))
Probably your biggest source of confusion is that you seem to not understand what is being passed as arguments to functions within ddply.
Consider your first attempt:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
The way ddply works is that it splits mydf up into several, smaller data frame, based on the values in the column x. That means that each time ddply calls k, the argument passed to k is a data frame. Specifically, a subset of you primary data frame.
So within k, x is a subset of mydf, with all the columns. You should not be trying to modify mydf from within k. Modify x, and then return the modified version. (If you must, but the options I displayed above are better.) So we might re-write your k like this:
k <- function(x) {
x$z <- ifelse(x$x == 1, 0, x$n)
return (x)
}
Note that you've created some confusing stuff by using x as both an argument to k and as the name of one of our columns.