I'm trying to come up with a generic R function to produce figure (b) in the image, where x is the x-axis and g is the y-axis. I'm trying to come up with a function f with the prototype f(x, start_x, dx, init_g, end_g) where x is a vector of ints representing timesteps (e.g. 1:100), start_x represents the step to start the ramp, dx represents the distance between start_x and end of ramp. init_g is the starting value of the function on the vertical axis, and end_g is the value at the end of the ramp. The function will return the ramp which I can plot to get that figure.
If dx=0, we get a step function like in figure (a). Also, the ramp can slope down depending on whether end_g is more or less than init_g.
I don't care about figures (c) or (d) in the image. I just can't figure out what to do to get the part between start_x and start_x + dx. Thanks for the help.
It only takes four x and four g values to define a plot like this.
plot_fun <- function(x, start_x, dx, init_g, end_g) {
x <- c(x[1], start_x, start_x + dx, tail(x, 1))
g <- c(init_g, init_g, end_g, end_g)
plot(x, g, type = "l")
}
plot_fun(x = 1:100, start_x = 20, dx = 30, init_g = 2, end_g = 5)
plot_fun(x = 1:100, start_x = 20, dx = 0, init_g = 2, end_g = 5)
If you just want to generate a vector g of the same length as x that is needed to produce the plot, here is how that can work:
make_g <- function(x, start_x, dx, init_g, end_g) {
require(dplyr)
require(tidyr)
x_g <- data.frame(x = c(x[1], start_x + dx),
g = c(init_g, end_g))
x_g <- data.frame(x) %>%
left_join(x_g) %>%
fill(g, .direction = "down")
return(x_g$g)
}
make_g(x = 1:100, start_x = 20, dx = 30, init_g = 2, end_g = 5)
Joining, by = "x"
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 5 5 5 5 5
[55] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
I figured out something that seems to work, not sure exactly why. There's probably a more elegant way to do it.
f <- function(x, x0, dx, g0, g1) {
s <- (x0 - dx) * (g1-g0)/dx
ifelse(x < x0, g0,
ifelse(x < (x0+dx), (x-dx) * (g1 - g0)/dx + (g0 - s),
g1))
}
Here's a function that returns data that you can use to plot.
f <- function(x, start_x, dx, init_g, end_g, type = c("ramp", "step")) {
type <- match.arg(type)
y <- numeric(length(x))
if (!length(x)) return(y)
end_x <- start_x + dx
y[x < start_x] <- init_g
y[x > end_x] <- end_g
mid <- (x >= start_x & x <= end_x)
y[mid] <-
if (type == "ramp") {
approx(c(start_x, end_x), c(init_g, end_g), xout = x[mid])$y
} else c(rep(init_g, length(mid) - 1), end_g)
return(data.frame(x = x, y = y))
}
Demonstration:
plot(y ~ x, data = f(1:100, 50, 10, 10, 20, "ramp"), type = "b")
plot(y ~ x, data = f(1:100, 50, 10, 10, 20, "step"), type = "b")
Related
I have a numeric vector like this x <- c(1, 23, 7, 10, 9, 2, 4) and I want to group the elements from left to right with the constrain that each group sum must not exceed 25. Thus, here the first group is c(1, 23), the second is c(7, 10) and the last c(9, 2, 4). the expected output is a dataframe with a second column containing the groups:
data.frame(x= c(1, 23, 7, 10, 9, 2, 4), group= c(1, 1, 2, 2, 3, 3, 3))
I have tried different things with cumsum but am not able to kind of dynamically restart cumsum for the new group once the limit sum of 25 for the last group is reached.
I think cpp function is the fastest way:
library(Rcpp)
cppFunction(
"IntegerVector GroupBySum(const NumericVector& x, const double& max_sum = 25)
{
double sum = 0;
int cnt = 0;
int period = 1;
IntegerVector res(x.size());
for (int i = 0; i < x.size(); ++i)
{
++cnt;
sum += x[i];
if (sum > max_sum)
{
sum = x[i];
if (cnt > 1)
++period;
cnt = 1;
}
res[i] = period;
}
return res;
}"
)
GroupBySum(c(1, 23, 7, 10, 9, 2, 4), 25)
We can try this as a programming practice if you like :)
f1 <- function(x) {
group <- c()
while (length(x)) {
idx <- cumsum(x) <= 25
x <- x[!idx]
group <- c(group, rep(max(group, 0) + 1, sum(idx)))
}
group
}
or
f2 <- function(x) {
group <- c()
g <- 0
while (length(x)) {
cnt <- s <- 0
for (i in seq_along(x)) {
s <- s + x[i]
if (s <= 25) {
cnt <- cnt + 1
} else {
break
}
}
g <- g + 1
group <- c(group, rep(g, cnt))
x <- x[-(1:cnt)]
}
group
}
or
f3 <- function(x) {
s <- cumsum(x)
r <- c()
grp <- 1
while (length(s)) {
idx <- (s <= 25)
r <- c(r, rep(grp, sum(idx)))
grp <- grp + 1
s <- s[!idx] - tail(s[idx], 1)
}
r
}
which gives
[1] 1 1 2 2 3 3 3
and benchmarking among them looks like
set.seed(1)
set.seed(1)
x <- runif(1e3, 0, 25)
bm <- microbenchmark(
f1(x),
f2(x),
f3(x),
check = "equivalent"
)
autoplot(bm)
Recursion version
Another option is using recursion (based on f1())
f <- function(x, res = c()) {
if (!length(x)) {
return(res)
}
idx <- cumsum(x) <= 25
Recall(x[!idx], res = c(res, list(x[idx])))
}
and you will see
> f(x)
[[1]]
[1] 1 23
[[2]]
[1] 7 10
[[3]]
[1] 9 2 4
You can use the cumsumbinning built-in function from the MESS package:
# install.packages("MESS")
MESS::cumsumbinning(x, 25, cutwhenpassed = F)
# [1] 1 1 2 2 3 3 3
Or it can be done with purrr::accumulate:
cumsum(x == accumulate(x, ~ifelse(.x + .y <= 25, .x + .y, .y)))
# [1] 1 1 2 2 3 3 3
output
group <- MESS::cumsumbinning(x, 25, cutwhenpassed = F)
data.frame(x= c(1, 23, 7, 10, 9, 2, 4),
group = group)
x group
1 1 1
2 23 1
3 7 2
4 10 2
5 9 3
6 2 3
7 4 3
Quick benchmark:
x<- c(1, 23, 7, 10, 9, 2, 4)
bm <- microbenchmark(
fThomas(x),
fThomasRec(x),
fJKupzig(x),
fCumsumbinning(x),
fAccumulate(x),
fReduce(x),
fRcpp(x),
times = 100L,
setup = gc(FALSE)
)
autoplot(bm)
Егор Шишунов's Rcpp is the fastest, closely followed by MESS::cumsumbinning and ThomasIsCoding's both functions.
With n = 100, the gap gets bigger but Rcpp and cumsumbinning are still the top choices and the while loop option is no longer efficient (I had to remove ThomasIsCoding's functions because the execution time was too long):
x = runif(100, 1, 50)
In base R you could also use Reduce:
do.call(rbind, Reduce(\(x,y) if((z<-x[1] + y) > 25) c(y, x[2]+1)
else c(z, x[2]), x[-1], init = c(x[1], 1), accumulate = TRUE))
[,1] [,2]
[1,] 1 1
[2,] 24 1
[3,] 7 2
[4,] 17 2
[5,] 9 3
[6,] 11 3
[7,] 15 3
Breaking it down:
f <- function(x, y){
z <- x[1] + y
if(z > 25) c(y, x[2] + 1)
else c(z, x[2])
}
do.call(rbind, Reduce(f, x[-1], init = c(x[1], 1), accumulate = TRUE))
if using accumulate
library(tidyverse)
accumulate(x[-1], f, .init = c(x[1], 1)) %>%
invoke(rbind, .)
[,1] [,2]
[1,] 1 1
[2,] 24 1
[3,] 7 2
[4,] 17 2
[5,] 9 3
[6,] 11 3
[7,] 15 3
Here is a solution using base R and cumsum (and lapply for iteration):
id <- c(seq(1, length(x),1)[!duplicated(cumsum(x) %/% 25)], length(x)+1)
id2 <- 1:length(id)
group <- unlist(lapply(1:(length(id)-1), function(x) rep(id2[x], diff(id)[x])))
data.frame(x=x, group=group)
x group
1 1 1
2 23 1
3 7 2
4 10 2
5 9 3
6 2 3
7 4 3
Edit: New Approach using recursive function
Here is a new more efficient approach that should also cover the special case which #ЕгорШишунов considered and should work efficiently because it's written as a recursive function.
recursiveFunction<- function(x, maxN=25, sumX=0, period=1, period2return=c()){
sumX <- sumX + x[1]
if (sumX >= maxN) { sumX=x[1]; period = period + 1}
period2return <- c(period2return, period)
if (length(x) == 1) { return(period2return)}
return(recursiveFunction(x[-1], 25, sumX, period, period2return))
}
recursiveFunction(x, maxN=25)
Note that you should not change the entries for the last three function parameters (sumX=0, period=1, period2return=c()) because they are only important during the recursive call of the function.
I have the piece of code below. What i want is alter the code
such that when the column entries for mat are the same, i get the same result in their respective positions in summation without it performing the fit operation?
So instead of getting
1 3 5 1 7 1
2 4 6 2 8 2
3 9 12 5 16 4
I want
1 3 5 1 7 1
2 4 6 2 8 2
3 9 12 3 16 3
set.seed(123)
fit = function(A){
x = A[1]
y = A[2]
z = sum(sample((x+y),2))
return(z)
}
mat= matrix(c(1,2,3,4,5,6,1,2,7,8,1,2),nrow=2,ncol=6)
summation=apply(mat, 2, FUN = 'fit')
newmat=rbind(mat,summation)
newmat
You can find out columns that are duplicates and replace the corresponding summation value with the first value of summation so that you get the same value.
fit = function(A){
x = A[1]
y = A[2]
z = sum(sample((x+y),2))
return(z)
}
mat= matrix(c(1,2,3,4,5,6,1,2,7,8,1,2),nrow=2,ncol=6)
summation=apply(mat, 2, FUN = 'fit')
vals <- apply(mat, 2, paste0, collapse = '-')
summation <- ave(summation, match(vals, unique(vals)), FUN = function(x) x[1])
newmat=rbind(mat,summation)
newmat
To pass only unique columns to fit function we can do :
fit = function(A){
x = A[1]
y = A[2]
z = sum(sample((x+y),2))
return(z)
}
mat= matrix(c(1,2,3,4,5,6,1,2,7,8,1,2),nrow=2,ncol=6)
vals <- apply(mat, 2, paste0, collapse = '-')
summation <- apply(mat[, !duplicated(vals)], 2, fit)
summation <- summation[match(vals, unique(vals))]
newmat=rbind(mat,summation)
newmat
How can I write a function f(v, n) (or use a base R function) that turns a numeric vector v into another based on n, a recycling factor (in lack of a better word).
For instance:
f(v = c(1,2,3,4,5), n = 1) #would yield c(1,2,3,4,5)
f(v = c(1,2,3,4,5), n = 2) #would yield c(1,2,1,2,1)
f(v = c(1,2,3,4,5), n = 3) #would yield c(1,2,3,1,2)
f(v = c(5,4,3,2,1), n = 3) #would yield c(2,1,3,2,1)
f(v = c(3,6), n = 3) #would yield c(3,3)
The closest I got was to use %%
1:5%%3 #giving me: [1] 1 2 0 1 2 - not quite what I want, at least some recycling.
We can create the function as
f <- function(x, n) if(n ==1) x else (x - 1) %% n + 1
f(1:5, 1)
#[1] 1 2 3 4 5
f(1:5, 2)
#[1] 1 2 1 2 1
f(1:5, 3)
#[1] 1 2 3 1 2
f(5:1, 3)
#[1] 2 1 3 2 1
f(c(3, 6), 3)
#[1] 3 3
I am trying interpolate splines for the following example data:
trt depth root carbon
A 2 1 14
A 4 2 18
A 6 3 18
A 8 3 17
A 10 1 12
B 2 3 16
B 4 4 18
B 6 4 17
B 8 2 15
B 10 1 12
in the following way:
new_df<-df%>%
group_by(trt)%>%
summarise_each(funs(splinefun(., x=depth, method="natural")))
I get an Error: not a vector, but I don't see why not. Am I not expressing the function in the right way?
Do you want a dataset that contains the values interpolated? If so, I've expanded the dataset to contain the desired x locations before the splines are calculated.
The resolution of those points are determined in the second line of the expand.grid() function. Just make sure the original depth points are a subset of the expanded depth points (eg, don't use something uneven like by=.732).
library(magrittr)
ds <- readr::read_csv("trt,depth,root,carbon\nA,2,1,14\nA,4,2,18\nA,6,3,18\nA,8,3,17\nA,10,1,12\nB,2,3,16\nB,4,4,18\nB,6,4,17\nB,8,2,15\nB,10,1,12")
ds_depths_possible <- expand.grid(
depth = seq(from=min(ds$depth), max(ds$depth), by=.5), #Decide resolution here.
trt = c("A", "B"),
stringsAsFactors = FALSE
)
ds_intpolated <- ds %>%
dplyr::right_join(ds_depths_possible, by=c("trt", "depth")) %>% #Incorporate locations to interpolate
dplyr::group_by(trt) %>%
dplyr::mutate(
root_interpolated = spline(x=depth, y=root , xout=depth)$y,
carbon_interpolated = spline(x=depth, y=carbon, xout=depth)$y
) %>%
dplyr::ungroup()
ds_intpolated
Output:
Source: local data frame [34 x 6]
trt depth root carbon root_interpolated carbon_interpolated
(chr) (dbl) (int) (int) (dbl) (dbl)
1 A 2.0 1 14 1.000000 14.00000
2 A 2.5 NA NA 1.195312 15.57031
3 A 3.0 NA NA 1.437500 16.72917
4 A 3.5 NA NA 1.710938 17.52344
5 A 4.0 2 18 2.000000 18.00000
6 A 4.5 NA NA 2.289062 18.21094
7 A 5.0 NA NA 2.562500 18.22917
8 A 5.5 NA NA 2.804688 18.13281
9 A 6.0 3 18 3.000000 18.00000
10 A 6.5 NA NA 3.132812 17.88281
.. ... ... ... ... ... ...
In the graphs above, the little points & lines are interpolated. The big fat points are observed.
library(ggplot2)
ggplot(ds_intpolated, aes(x=depth, y=root_interpolated, color=trt)) +
geom_line() +
geom_point(shape=1) +
geom_point(aes(y=root), size=5, alpha=.3, na.rm=T) +
theme_bw()
ggplot(ds_intpolated, aes(x=depth, y=carbon_interpolated, color=trt)) +
geom_line() +
geom_point(shape=1) +
geom_point(aes(y=carbon), size=5, alpha=.3, na.rm=T) +
theme_bw()
If you want an additional example, here's some recent code and slides. We needed a rolling median for some missing points, and linear stats::approx() for some others. Another option is also stats::loess(), but it's arguments aren't as similar as approx() and spline().
I gave up trying to get dplyr::summarise_each (and also tried dplyr::summarise, since your choice of functions didn't seem to match you desire for multiple column input to return only two functions.) I'm not sure it's possible in dply. Here's what might be called the canonical method of approaching this:
lapply( split(df, df$trt), function(d) splinefun(x=d$depth, y=d$carbon) )
#-------------
$A
function (x, deriv = 0L)
{
deriv <- as.integer(deriv)
if (deriv < 0L || deriv > 3L)
stop("'deriv' must be between 0 and 3")
if (deriv > 0L) {
z0 <- double(z$n)
z[c("y", "b", "c")] <- switch(deriv, list(y = z$b, b = 2 *
z$c, c = 3 * z$d), list(y = 2 * z$c, b = 6 * z$d,
c = z0), list(y = 6 * z$d, b = z0, c = z0))
z[["d"]] <- z0
}
res <- .splinefun(x, z)
if (deriv > 0 && z$method == 2 && any(ind <- x <= z$x[1L]))
res[ind] <- ifelse(deriv == 1, z$y[1L], 0)
res
}
<bytecode: 0x7fe56e4853f8>
<environment: 0x7fe56efd3d80>
$B
function (x, deriv = 0L)
{
deriv <- as.integer(deriv)
if (deriv < 0L || deriv > 3L)
stop("'deriv' must be between 0 and 3")
if (deriv > 0L) {
z0 <- double(z$n)
z[c("y", "b", "c")] <- switch(deriv, list(y = z$b, b = 2 *
z$c, c = 3 * z$d), list(y = 2 * z$c, b = 6 * z$d,
c = z0), list(y = 6 * z$d, b = z0, c = z0))
z[["d"]] <- z0
}
res <- .splinefun(x, z)
if (deriv > 0 && z$method == 2 && any(ind <- x <= z$x[1L]))
res[ind] <- ifelse(deriv == 1, z$y[1L], 0)
res
}
<bytecode: 0x7fe56e4853f8>
<environment: 0x7fe56efc4db8>
I would like to do the following:
combine into a data frame, two vectors that
have different length
contain sequences found also in the other vector
contain sequences not found in the other vector
sequences that are not found in other vector are never longer than 3 elements
always have same first element
The data frame should show the equal sequences in the two vectors aligned, with NA in the column if a vector lacks a sequence present in the other vector.
For example:
vector 1 vector 2 vector 1 vector 2
1 1 a a
2 2 g g
3 3 b b
4 1 or h a
1 2 a g
2 3 g b
5 4 c h
5 c
should be combined into data frame
1 1 a a
2 2 g g
3 3 b b
4 NA h NA
1 1 or a a
2 2 g g
NA 3 NA b
NA 4 NA h
5 5 c c
What I did, is to search for merge, combine, cbind, plyr examples but was not able to find solutions. I am afraid I will need to start write a function with nested for loops to solve this problem.
Note - this was proposed as an answer to the first version of the OP. The question has been modified since then but the problem is still not well-defined in my opinion.
Here is a solution that works with your integer example and would also work with numeric vectors. I am also assuming that:
both vectors contain the same number of sequences
a new sequence starts where value[i+1] <= value[i]
If your vectors are non-numeric or if one of my assumptions does not fit your problem, you'll have to clarify.
v1 <- c(1,2,3,4,1,2,5)
v2 <- c(1,2,3,1,2,3,4,5)
v1.sequences <- split(v1, cumsum(c(TRUE, diff(v1) <= 0)))
v2.sequences <- split(v2, cumsum(c(TRUE, diff(v2) <= 0)))
align.fun <- function(s1, s2) { #aligns two sequences
s12 <- sort(unique(c(s1, s2)))
cbind(ifelse(s12 %in% s1, s12, NA),
ifelse(s12 %in% s2, s12, NA))
}
do.call(rbind, mapply(align.fun, v1.sequences, v2.sequences))
# [,1] [,2]
# [1,] 1 1
# [2,] 2 2
# [3,] 3 3
# [4,] 4 NA
# [5,] 1 1
# [6,] 2 2
# [7,] NA 3
# [8,] NA 4
# [9,] 5 5
I maintain that your problem might be solved in terms of the shortest common supersequence. It assumes that your two vectors each represent one sequence. Please give the code below a try.
If it still does not solve your problem, you'll have to explain exactly what you mean by "my vector contains not one but many sequences": define what you mean by a sequence and tell us how sequences can be identified by scanning through your two vectors.
Part I: given two sequences, find the longest common subsequence
LongestCommonSubsequence <- function(X, Y) {
m <- length(X)
n <- length(Y)
C <- matrix(0, 1 + m, 1 + n)
for (i in seq_len(m)) {
for (j in seq_len(n)) {
if (X[i] == Y[j]) {
C[i + 1, j + 1] = C[i, j] + 1
} else {
C[i + 1, j + 1] = max(C[i + 1, j], C[i, j + 1])
}
}
}
backtrack <- function(C, X, Y, i, j) {
if (i == 1 | j == 1) {
return(data.frame(I = c(), J = c(), LCS = c()))
} else if (X[i - 1] == Y[j - 1]) {
return(rbind(backtrack(C, X, Y, i - 1, j - 1),
data.frame(LCS = X[i - 1], I = i - 1, J = j - 1)))
} else if (C[i, j - 1] > C[i - 1, j]) {
return(backtrack(C, X, Y, i, j - 1))
} else {
return(backtrack(C, X, Y, i - 1, j))
}
}
return(backtrack(C, X, Y, m + 1, n + 1))
}
Part II: given two sequences, find the shortest common supersequence
ShortestCommonSupersequence <- function(X, Y) {
LCS <- LongestCommonSubsequence(X, Y)[c("I", "J")]
X.df <- data.frame(X = X, I = seq_along(X), stringsAsFactors = FALSE)
Y.df <- data.frame(Y = Y, J = seq_along(Y), stringsAsFactors = FALSE)
ALL <- merge(LCS, X.df, by = "I", all = TRUE)
ALL <- merge(ALL, Y.df, by = "J", all = TRUE)
ALL <- ALL[order(pmax(ifelse(is.na(ALL$I), 0, ALL$I),
ifelse(is.na(ALL$J), 0, ALL$J))), ]
ALL$SCS <- ifelse(is.na(ALL$X), ALL$Y, ALL$X)
ALL
}
Your Example:
ShortestCommonSupersequence(X = c("a","g","b","h","a","g","c"),
Y = c("a","g","b","a","g","b","h","c"))
# J I X Y SCS
# 1 1 1 a a a
# 2 2 2 g g g
# 3 3 3 b b b
# 9 NA 4 h <NA> h
# 4 4 5 a a a
# 5 5 6 g g g
# 6 6 NA <NA> b b
# 7 7 NA <NA> h h
# 8 8 7 c c c
(where the two updated vectors are in columns X and Y.)