I have been constructing a function for centered moving average in R (without using any packages), and have encountered a challenge as below:
As you know, the centered moving average includes the concept of incorporating the 'incomplete portions' (i.e. at the beginning and the end of the datapoint). For example, consider below vector p:
p <- c(10,20,30,40,50,60,70,80,90)
In this case, centered moving average that I am interested in looks like this:
x <- ((10+20)/2, (10+20+30)/3, (20+30+40)/3 ..... (70+80+90)/3, (80+90)/2)
To achieve above, I tried function with if function as below:
wd means window size
mov_avg <- function(p, wd) {
x <- c(0, cumsum(p))
if ((p > p[1])&(p < p[length(p)])) {
neut <- 1:(length(p)-(wd-1))
upper <- neut+(wd-1)
x <- (x[upper]-x[neut])/(upper-neut)
} else if (p==p[1]) {
neut <- 0
upper <- neut+3
x <- (x[upper]-x[neut])/(upper-1-neut)
} else if (p==p[length(p)]) {
upper <-(length(p)+1)
neut <- (length(p)-(wd-2))
x <- (x[upper]-x[neut])/(upper-neut)
}
return(x)
}
Then I entered below line to execute:
mov_avg(p, 3)
I encountered errors as below:
numeric(0)
Warning messages:
1: In if ((p > p[1]) & (p < p[length(p)])) { :
the condition has length > 1 and only the first element will be used
2: In if (p == p[1]) { :
the condition has length > 1 and only the first element will be used
Could someone help me out in making this a working function?
Thank you!
How about something like this in base R:
window <- 3
p <- c(10,20,30,40,50,60,70,80,90)
x <- c(NA, p, NA)
sapply(seq_along(x[-(1:(window - 1))]), function(i)
mean(x[seq(i, i + window - 1)], na.rm = T))
#[1] 15 20 30 40 50 60 70 80 85
The trick is to add flanking NAs and then use mean with na.rm = T.
I know you said "without using packages", but the same is even shorter using zoo::rollapply
library(zoo)
rollapply(c(NA, p, NA), 3, mean, na.rm = T)
#[1] 15 20 30 40 50 60 70 80 85
We could also use rowMeans
rowMeans(embed(c(NA, p, NA), 3)[, 3:1], na.rm = TRUE)
#[1] 15 20 30 40 50 60 70 80 85
Another method is to create a function where we can adjust with variable windows
mov_avg <- function(p, window) {
mean_number = numeric()
index = 1
while(index < length(p)) {
if (index == 1 | index == length(p) - 1)
mean_number = c(mean_number, mean(p[index:(index + window - 2)]))
else
mean_number = c(mean_number, mean(p[index:(index + window - 1)]))
index = index + 1
}
mean_number
}
mov_avg(p, 3)
#[1] 15 30 40 50 60 70 80 85
mov_avg(p, 2)
#[1] 10 25 35 45 55 65 75 80
Take the mean by rows in a matrix with columns that are x, and the head and tail appended with the means respectively of the first two and last two elements.
apply( matrix( c(x,
c( x[1]+x[2])/2, head(x,-1) ),
c( tail(x,-1), sum( tail(x,2))/2) ),
ncol = 3),
1, mean)
Related
I have a large data matrix with many numeric values (counts) in it. I would like to remove 10% of all counts. So, for example, a matrix which looks like this:
30 10
0 20
The sum of all counts here is 60. 10% of 60 is 6. So I want to randomly remove 6. A correct output could be:
29 6
0 19
(As you can see it removed 1 from 30, 4 from 10 and 1 from 20). There cannot be negative values.
How could I program this in R?
Here is a way. It subtracts 1 to positive matrix elements until a certain total to remove is reached.
subtract_int <- function(X, n){
inx <- which(X != 0, arr.ind = TRUE)
N <- nrow(inx)
while(n > 0){
i <- sample(N, 1)
if(X[ inx[i, , drop = FALSE] ] > 0){
X[ inx[i, , drop = FALSE] ] <- X[ inx[i, , drop = FALSE] ] - 1
n <- n - 1
}
if(any(X[inx] == 0)){
inx <- which(X != 0, arr.ind = TRUE)
N <- nrow(inx)
}
}
X
}
set.seed(2021)
to_remove <- round(sum(A)*0.10)
subtract_int(A, to_remove)
# [,1] [,2]
#[1,] 30 6
#[2,] 0 18
Data
A <- structure(c(30, 0, 10, 20), .Dim = c(2L, 2L))
Maybe this helps you at least to get on the right track. It's nothing more than a draft though:
randomlyRemove <- function(matrix) {
sum_mat <- sum(matrix)
while (sum_mat > 0) {
sum_mat <- sum_mat - runif(1, min = 0, max = sum_mat)
x <- round(runif(1, 1, dim(matrix)[1]), digits = 0)
y <- round(runif(1, 1, dim(matrix)[2]), digits = 0)
matrix[x,y] <- matrix[x,y] - sum_mat
}
return(matrix)
}
You might want to play with the random number generator process to get more evenly distributed substractions.
edit: added round(digits = 0) to get only integer (dimension) values and modified the random (dimension) value generation to start from 1 (not zero).
I think we can make it work with using sample. This solution is a lot more compact.
The data
A <- structure(c(30, 0, 11, 20), .Dim = c(2L, 2L))
sum(A)
#> [1] 61
The logic
UseThese <- (1:length(A))[A > 0] # Choose indices to be modified because > 0
Sample <- sample(UseThese, sum(A)*0.1, replace = TRUE) # Draw a sample of indices
A[UseThese] <- A[UseThese] - as.vector(table(Sample)) # Subtract handling repeated duplicate indices in the sample
Check the result
A
#> [,1] [,2]
#> [1,] 28 8
#> [2,] 0 19
sum(A) # should be the value above minus 6
#> [1] 55
One disadvantage of this solution is that it could lead to negative
values. So check with:
any(A < 0)
#> [1] FALSE
I have a df with a label "S" for anywhere my numeric column is <35.
I'd like to use each S position and label "S-1", "S-2", "S-3" for the 3 previous rows to S, then "S+1", "S+2" for the next 2 rows of S.
like this..
N S
45
56
67 S-3
47 S-2
52 S-1
28 S
89 S+1
66 S+2
55
76
I was using this to start me off, just as an example.
n <- sample(50:100, 10, replace=T)
data <- data.frame(N=n)
data <- rbind(data, 30)
data <- rbind(data,data,data,data,data,data)
data$S <- ifelse(data$N<35, "S", "")
Any ideas..?
here is an option using base R, where we get the index of rows where 'N' is less than 35 ('i1'), create the 'S' column with blank ("") elements, loop through 'i1', get the sequence of 3 elements before, 2 elements after, paste with 'S', get the intersect of sequence with the index ('ind') and assign the strings ('val') to the 'S' column
i1 <- which(data$N < 35)
data$S <- ""
out <- do.call(rbind, lapply(i1, function(i) data.frame(ind =(i-3): (i+2),
val = c(paste0("S-", 3:1), "S", paste0("S+", 1:2)), stringsAsFactors = FALSE)))
i2 <- out$ind %in% seq_len(nrow(data))
data$S[out$ind[i2]] <- out$val[i2]
data
set.seed(24)
n <- sample(50:100, 10, replace=T)
data <- data.frame(N=n)
data <- rbind(data, 30)
data <- rbind(data,data,data,data,data,data)
Without dealing with possible overlap, here is a bunch of ifelse() statements to get the job done. Maybe you can think of a more appropriate way to generalize it.
You can use lag() and lead() with the dplyr package.
data %>% mutate(S = ifelse(S == "S", S,
ifelse(lag(S == "S"), "S+1",
ifelse(lag(S == "S", 2), "S+2",
ifelse(lead(S == "S"), "S-1",
ifelse(lead(S == "S", 2), "S-2", ""))))),
S = ifelse(is.na(S), "", S))
(You would get NA values in the first two rows if the first value is not <35, so replace these with "".)
N S
1 52
2 86
3 86
4 57
5 54
6 57
7 51
8 98
9 100 S-2
10 73 S-1
11 30 S
12 52 S+1
13 86 S+2
14 86
This is a long-ish answer since I break it down into pieces I would normally implement using a pipeline and lambda expressions, but it should be easy enough to follow.
I will work on row indices and compute two vectors, one containing the index closest to i on the left that has label "S", and one containing the index closest to i on the right.
indices <- 1:length(data$S)
closest_left <- rep(NA, length = length(indices))
closest_right <- rep(NA, length = length(indices))
I compute these using purrr's reduce functions but you could easily do it in a loop as well.
this_or_left <- function(left_val, i) {
res <- if (data$S[[i]] == "S") i else left_val
closest_left[[i]] <<- if (data$S[[i]] == "S") i else left_val
}
this_or_right <- function(right_val, i) {
res <- if (data$S[[i]] == "S") i else right_val
closest_right[[i]] <<- if (data$S[[i]] == "S") i else right_val
}
purrr::reduce(indices, this_or_left, .init = this_or_left(NA, 1))
purrr::reduce_right(indices, this_or_right, .init = this_or_right(NA, length(indices)))
Whether you could do it with vectorised expressions I don't know. Possibly. I didn't try.
Now, I simply have to compute the distance to the closest S and make labels from that, using empty labels if the distance is greater than 3 and label "S" if the distance is zero.
get_dist <- Vectorize(function(i) {
down <- i - closest_left[i]
up <- closest_right[i] - i
if (is.na(down) || down > up) up
else if (is.na(up) || down <= up) -down
else NA
})
make_label <- Vectorize(function(dist) {
if (abs(dist) > 3) ""
else if (dist == 0) "S"
else if (dist < 0) paste0("S", dist)
else if (dist > 0) paste0("S+", dist)
})
make_label(get_dist(indices))
Here, I used Vectorized expressions to change it up a little.
I have question on replacing the value in between the vectors.
The algorithm should find that replacement number when the certain condition is met. In this case finding the number which makes the difference -20 with the previous number. So I prefer to use diff function.
Here is what I mean
x <- c(20,20,0,20,0,5)
> diff(x)
[1] 0 -20 20 -20 5
So in this case 0 makes the difference -20 and I want to change those 0s to 20.
. I know the easiest solution is the directly assigning x[3] <- 20 or x[5] <- 20
However, the 0 location is always different so I need an automated process that can do that. Thanks!
**EDIT
if we need to do this in a grouped data.frame
> df
x gr
1 20 1
2 20 1
3 0 1
4 20 1
5 0 1
6 5 1
7 33 2
8 0 2
9 20 2
10 0 2
11 20 2
12 0 2
How can we implement this ?
modify <- function(x){
value_search = c(0, 33)
value_replacement = c(20, 44)
for (k in 1:length(value_search)) {
index_position = which(x %in% value_search[k])
replacement = value_replacement[k]
for (i in index_position) {
x[i] = replacement
}
}
}
df%>%
group_by(gr)%>%
mutate(modif_x=modify(x))
Error in mutate_impl(.data, dots) :
Evaluation error: 'match' requires vector arguments.
You can do it using which to get the position, i.e.
x[which(diff(x) == -20)+1] <- 20
x
#[1] 20 20 20 20 20 5
if you want a generic way to replace values of a vector based on particular values, i would approach it this way.
x = c(20,20,0,20,0,5)
value_search = 0
value_replacement = 20
index_position = which(x %in% value_search)
for (i in index_position) {
x[i] = value_replacement
}
but this works for single values. if you want to look for multiple values, you can use a nested loop as below:
x = c(20,20,0,20,0,5,33)
value_search = c(0, 33)
value_replacement = c(20, 44)
for (k in 1:length(value_search)) {
index_position = which(x %in% value_search[k])
replacement = value_replacement[k]
for (i in index_position) {
x[i] = replacement
}
}
in response to OP's edits:
any number of ways to do this:
x = c(20,20,0,20,0,5,33)
gr = c(1,1,1,1,2,2,2)
df = data.frame(x, gr)
func_replace <- function(source, value_search, value_replacement) {
for (k in 1:length(source)) {
index_position = which(x %in% value_search[k])
replacement = value_replacement[k]
for (i in index_position) {
source[i] = replacement
} # for i loop
} # for k loop
return(source)
} # func_replace
value_search = c(0, 33)
value_replacement = c(20, 44)
gr_value = 1
df$replacement = with(df, ifelse(gr == gr_value, sapply(df, FUN = function(x) func_replace(x, value_search, value_replacement)), NA))
This is my function:
g <- function(x,y){
x <- (x-y):x
y <- 1:30 # ------> (y is always fixed 1:30)
z<- outer(x,y,fv) # ---->(fv is a previous function)
s <- colSums(z)
which(s==max(s),arr.ind=T)
}
It tells me the position of the max value in s. I basically have a problem in choosing y because given a small y, the max(s) appears more than once in s. For example:
#given x=53
> g(53,1)
[1] 13 16 20 22 25 26 27
> g(53,2)
[1] 20 25 26
> g(53,3)
[1] 20 25 26
> g(53,4)
[1] 20 25 26
> g(53,5)
[1] 20 25
> g(53,6)
[1] 25 -----> This is the only result i would like from my function (right y=6)
Another example:
# given x=71
> g(71,1)
[1] 7 9 14
> g(71,2)
[1] 7 14
> g(71,3)
[1] 14 -----> my desired result (right y=3)
Therefore, i would like a function resulting in the first unique solution given y as small as possible ( ex: g(53)=25 , g(71)=14, ...). Any help? Thanks
This is a simplify example. I hope to be more clear in questioning:
#The idea is the same:
n <- 1:9
e <- rep(nn,500)
p<- sample(e) # --->(Need to sample in order to have more max later (mixed matrix)
mat <- matrix(p,90)
g <- function(x,y){
x <- (x-y):x
k <- rowSums(mat[,x])
which(k==max(k), arr.ind=T)
}
#In my sample matrix :
k <- rowSums(mat[,44:45])
which(k==max(k), arr.ind=T)
[1] 44 71 90
#In fact
g(45,1)
[1] 44 71 90 # ---> more than one solution
g(45,2)
[1] 90 # ----> I would like to pick up this value wich is the first unique solution given x=45
Therefore, i would like a function resulting in the first unique solution for y as small as possible given x ( in this new ex: g(45)=90... ).
I got it. It is a bit long but i think right.
Taking into consideration the second simplify example:
g <- function(x,y){
x <- (x-y):x
k <- rowSums(mat[,x])
q <- which(k==max(k), arr.ind=T)
length(q)
}
gv <- Vectorize(g)
l <- function(x){
y<- 1:30 # <- (until 30 to be sure)
z<- outer(x,y,gv)
y <- which.min(z) # <- (min is surely length=1 and which.min takes the first)
x <- (x-y):x
k <- rowSums(mat[,x])
q <- which(k==max(k), arr.ind=T)
q
}
l(45)
[1] 90
It seems like you could just do this with a recursive function. Consider the following:
set.seed(42)
n = 1:9
e = rep(n, 500)
p = sample(e)
mat = matrix(p, 90)
g <- function(x, y=1) {
xv <- (x-y):x
k <- rowSums(mat[, xv])
i <- which(k == max(k), arr.ind=T)
n <- length(i)
if (n == 1) {
return(y) # want to know the min y that solves the problem, right?
} else {
y <- y + 1 # increase y by 1
g(x,y) # run our function again with a new value of y
}
}
You should now be able to run g(45) and get 1 as the result, since that is the value of y that solves the problem, and g(33) to get 2.
How would I efficiently go about taking a 1-by-1 ascending random sample of the values 1:n, making sure that each of the randomly sampled values is always higher than
the previous value?
e.g.:
For the values 1:100, get a random number, say which is 61. (current list=61)
Then pick another number between 62 and 100, say which is 90 (current list=61,90)
Then pick another number between 91 and 100, say which is 100.
Stop the process as the max value has been hit (final list=61,90,100)
I have been stuck in loop land, thinking in this clunky manner:
a1 <- sample(1:100,1)
if(a1 < 100) {
a2 <- sample((a+1):100,1)
}
etc etc...
I want to report a final vector being the concatenation of a1,a2,a(n):
result <- c(a1,a2)
Even though this sounds like a homework question, it is not. I thankfully left the days of homework many years ago.
Coming late to the party, but I think this is gonna rock your world:
unique(cummax(sample.int(100)))
This uses a while loop and is wrapped in a function
# from ?sample
resample <- function(x, ...) x[sample.int(length(x), ...)]
sample_z <- function(n){
z <- numeric(n)
new <- 0
count <- 1
while(new < n){
from <- seq(new+1,n,by=1)
new <- resample(from, size= 1)
z[count] <- new
if(new < n) count <- count+1
}
z[1:count]
}
set.seed(1234)
sample_z(100)
## [1] 12 67 88 96 100
Edit
note the change to deal with when the new sample is 100 and the way sample deals with an integer as opposed to a vector for x
Edit 2
Actually reading the help for sample gave the useful resample function. Which avoids the pitfalls when length(x) == 1
Not particularly efficient but:
X <- 0
samps <- c()
while (X < 100) {
if(is.null(samps)) {z <- 1 } else {z <- 1 + samps[length(samps)]}
if (z == 100) {
samps <- c(samps, z)
} else {
samps <- c(samps, sample(z:100, 1))
}
X <- samps[length(samps)]
}
samps
EDIT: Trimming a little fat from it:
samps <- c()
while (is.null(samps[length(samps)]) || samps[length(samps)] < 100 ) {
if(is.null(samps)) {z <- 1 } else {z <- 1 + samps[length(samps)]}
if (z == 100) {
samps <- c(samps, z)
} else {
samps <- c(samps, sample(z:100, 1))
}
}
samps
even later to the party, but just for kicks:
X <- Y <- sample(100L)
while(length(X <- Y) != length(Y <- X[c(TRUE, diff(X)>0)])) {}
> print(X)
[1] 28 44 60 98 100
Sorting Random Vectors
Create a vector of random integers and sort it afterwards.
sort(sample(1:1000, size = 10, replace = FALSE),decreasing = FALSE)
Gives 10 random Integers between 1 and 1000.
> sort(sample(1:1000, size = 10, replace = FALSE),decreasing = FALSE)
[1] 44 88 164 314 617 814 845 917 944 995
This of course also works with random decimals and floats.