R – Populate a vector via rep with an increasing exponential value - r

I have one vector like this:
years <- c(2021:2091)
And I want to create another vector to bind to it based off an initial value and inrease compound-like for every row based on an arbitrary decimal(such as 10%, 15%, 20%):
number = x
rep(x*(1 + .10)^n, length(years))
How do I replicate the length of years for the second vector while increasing the exponent every time. Say there is 71 rows in years, I need n to start at 1 and run through 71.
I have tried:
rep(x*(1 + .10)^(1:71), length(years))
But this does it 71*71 times. I just need one value for each exponent!
Hopefully this makes sense, thanks in advance!

Here is how you could do it with a function:
future_value = function(years, x = 1, interest = 0.1) {
x * (1 + interest) ^ (1:length(years))
}
Example outputs:
> future_value(2021:2025)
[1] 1.10000 1.21000 1.33100 1.46410 1.61051
> future_value(2021:2025, x = 2, interest = 0.15)
[1] 2.300000 2.645000 3.041750 3.498012 4.022714

Related

Create Random Tree in R

Suppose that I want to create in R a binary tree on the interval (0,1) with maximum depth 3 in the following way:
First we have a pool of potential cut-offs for the tree t=c(0.1,0.2,0.3,0.4,0.5,0.6,0.7), a cut-off means that if we randomly choose the value 0.4 then we split the interval (0,1) to (0,0.4) and (0.4,1).
The steps that I want to do are:
1)Start with the whole interval (0,1)
2)Randomly choose a cut-off from t, denoted as t_1
3)Split the interval (0,1) based on the chosen cut-off i.e. to subintervals (0,t_1) and (t_1,1)
4)Then randomly choose one of the intervals (0,t_1) and (t_1,1)
5)For the chosen interval randomly sample from the cut-offs a point t_2 that makes sense, i.e. a point that it is not outside of the interval
6)Continue the procedure up to the point where we reach the maximum depth.
I'm totally clueless on where how to start. Is this the right forum to post such a question?
Creating a tree structure like this requires a recursive function (i.e. a function that calls itself). The following function creates a list of nodes, where each branch node contains a split value, and two daughter nodes called left and right. The leaf nodes contains the final range encompassed within the leaf.
make_node <- function(min = 0, max = 1, desired_depth = 3, depth = 0) {
if (depth < desired_depth) {
split <- runif(1, min, max)
list(split = split,
left = make_node(min, split, desired_depth, depth + 1),
right = make_node(split, max, desired_depth, depth + 1))
} else {
list(range = c(min, max))
}
}
It works like this. Let's create a reproducible tree:
set.seed(1)
tree <- make_node()
To get the initial splitting value, we do:
tree$split
#> [1] 0.2655087
So the right branch deals with all values between 0.2655087 and 1. To see where it splits this range, we do
tree$right$split
#> [1] 0.4136423
So this branch splits into values between [0.2655087, 0.4136423] on the left and [0.4136423, 1] on the right. Let's examine the left node:
tree$right$left$split
#> [1] 0.3985904
This has now split the [0.2655087, 0.4136423] branch into a left [0.2655087, 0.3985904] branch and a right [0.3985904, 0.4136423] branch.
If we take this right branch, have now reached depth 3, so we get the final range of this leaf and confirm its range:
tree$right$left$right
#> $range
#> [1] 0.3985904 0.4136423
Of course, to make all this easier you probably want some kind of function to walk the tree to classify a particular number.
walk_tree <- function(value, tree) {
result <- paste("Value:", value, "\n")
while(is.null(tree$range)) {
if(value >= tree$split) {
result <- paste(result, "\nGreater than split of", tree$split)
tree <- tree$right
} else {
result <- paste(result, "\nLess than split of", tree$split)
tree <- tree$left
}
}
result <- paste0(result, "\nValue falls into leaf node with range [",
tree$range[1], ",", tree$range[2], "]\n")
cat(result)
}
So, for example, we get
walk_tree(value = 0.4, tree)
#> Value: 0.4
#>
#> Greater than split of 0.2655086631421
#> Less than split of 0.413642294289884
#> Greater than split of 0.398590389362078
#> Value falls into leaf node with range [0.398590389362078,0.413642294289884]
You may prefer that this function returns a vector of 0s and 1s, or you may be looking for it to draw the tree, which is trickier to do, but possible.
Created on 2022-03-09 by the reprex package (v2.0.1)
Perhaps we can use Reduce to generate intervals in the binary-tree manner
Reduce(
function(interval, k) {
lb <- min(interval)
ub <- max(interval)
x <- v[v > lb & v < ub]
if (!length(x)) {
return(c(NA, NA))
}
p <- sample(x, 1)
list(c(lb, p), c(p, ub))[[sample(1:2, 1)]]
},
1:3,
init = c(0, 1),
accumulate = TRUE
)
and you will see the result like
[[1]]
[1] 0 1
[[2]]
[1] 0.0 0.6
[[3]]
[1] 0.0 0.2
[[4]]
[1] 0.0 0.1
which indicates the selected interval in each iteration from top to bottom.

Select values from a data frame indexed to another data frame in a many-to-one relationship in R

I am building a program for simulating sequences of wind vectors (in base R).
I have a data set of parameters for six wind-generation mechanisms ('pars'),(I'll call them ellipses) and there are 5 parameters for each ellipse, thus 30 columns of parameters, plus other parameters that indicate the proportion of time (frequency, indicated by f.0, f.1...) each ellipse is in operation. There are 24 rows in 'pars', each identified by an 'hour' variable. The following codes generates a simulated 'pars' data frame
pars <- as.data.frame(matrix(rnorm(24*42), 24, 42, dimnames=list(NULL, c(
'f.0', 'f.1', 'f.2', 'f.3', 'f.4', 'f.5', 'f.6',
'W.0', 'W.1', 'W.2', 'W.3', 'W.4', 'W.5', 'W.6',
'S.0', 'S.1', 'S.2', 'S.3', 'S.4', 'S.5', 'S.6',
'w.0', 'w.1', 'w.2', 'w.3', 'w.4', 'w.5', 'w.6',
's.0', 's.1', 's.2', 's.3', 's.4', 's.5', 's.6',
'r.0', 'r.1', 'r.2', 'r.3', 'r.4', 'r.5', 'r.6')
)))
jobFun <- function(n) {
m <- matrix(runif(7*n), ncol=7)
m <- sweep(m, 1, rowSums(m), FUN="/")
m
}
pars[1:24,c('f.0', 'f.1', 'f.2', 'f.3', 'f.4', 'f.5', 'f.6')] <- jobFun(24) # generate ellipse frequencies, summing to 1
pars$hour <- 0:23 # Add an 'hour' variable
pars$p0 <- with(pars, f.0) # change to make it zero if < zero!
pars$p1 <- with(pars, f.1 + p0)
pars$p2 <- with(pars, f.2 + p1)
pars$p3 <- with(pars, f.3 + p2)
pars$p4 <- with(pars, f.4 + p3)
pars$p5 <- with(pars, f.5 + p4)
pars$p6 <- with(pars, f.6 + p5)
I start by generating a sequence of POSIXct date-times for a single day, e.g, at 5 minute intervals ('sim'). For each date-time in 'sim', I need to select an ellipse and assign the parameters to the 'sim' data set. I have made additional columns in 'pars' with the cumulative probability of each ellipse, e.g., p0 = f.0, p1 = p0 + f.1, p2 = p1 + f.2, etc. I am going to select a different ellipse for each 5 minute time increment (then select the parameters corresponding to that ellipse). My difficulty lies in being unable to specify the appropriate value for p.
START <- ISOdate(2022, MONTH, 1, hour=0, min=0)
END <- START + (24*3600) - 1
tseq <- seq(from=START,to=END,by=300)
sim = data.frame(tseq)
sim$Ep <- runif(nrow(sim)) # Generate random vector Ep for ellipse picking
sim$Enum <- with(sim, ifelse( # number identifying ellipse to be used
Ep < pars$p0[which(pars$hour == hour(tseq))], 0, ifelse(
Ep < pars$p1[which(pars$hour == hour(tseq))], 1, ifelse(
Ep < pars$p2[which(pars$hour == hour(tseq))], 2, ifelse(
Ep < pars$p3[which(pars$hour == hour(tseq))], 3, ifelse(
Ep < pars$p4[which(pars$hour == hour(tseq))], 4, ifelse(
Ep < pars$p5[which(pars$hour == hour(tseq))], 5, 6)))))))
...
The result should be a vector (Enum) of integers between 0 and 6 identifying the ellipse to be used at each 5 minute time increment. My program only provides a correct answer at the 0th minute of each hour; there is something wrong with the statement
pars$p[which(pars$hour == hour(tseq))]
which ends up generating NA's for all the other 5 minute time increments in the hour. (i.e., there are 12 increments of 5 minutes in an hour, and the statement
which(pars$hour == hour(tseq))
brings up all 12 at once, instead of one at a time which is what I need here. Maybe I need a 'for' loop? Any suggestions for fixing, and for making the above code more compact, will be appreciated.
The problem is that the logical subscripting too complicated. All that is necessary is to change, e.g.,
pars$p0[which(pars$hour == hour(tseq))]
to
pars$p0[hour(tseq)+1]
and the value for p0 that is specific to the hour being simulated will be selected.
Spector (2008) "Data Manipulation with R" is helpful as usual.
Note that for question above, the 'lubridate' package is necessary for the hour() function, and MONTH must be specified (e.g., MONTH=4) to run the code

R programming: How to set while loop condition based if all required values in vector have been copied from sample?

I new to R and I'm trying to see how many iterations are needed to fill a vector with numbers 1 to 55 (no duplicates) from a random sample using runif.
At the moment, the vector has a lots of duplicates in it and my number of iterations being returned is the size of the vector. So, i'm not sure if my logic is correct.
The aim of the if statement is to check if the value from the sample exists in the vector, and if it does, choose the next one. But i'm not sure if it's correct, since the next number could already exist in the vector. Any help would be much appreciated
numbers=as.integer(runif(800, min=1, max=55)) ## my sample from runif
i=sample(numbers, 1)
## setting up my vector to store 55 unique values (1 to 55)
p=rep(0,55)
## my counters
j=0
n=1
## my while loop
while (p[n] %in% 0){
## if the sample value already exists in the vector, choose the next value from the sample
if (numbers[n] %in% p) {
p[n]=numbers[n+1]
}
else {
p[n] = numbers[n]
}
n = n + 1
j = j + 1
}
I believe that the following is what you want. Instead pf a while loop on p, the while loop should search for a new value in numbers.
set.seed(2021) # make the results reproducible
numbers <- sample(55, 800, TRUE)
## setting up my vector to store 55 unique values (1 to 55)
p <- integer(55)
# assign the elemnts of p one by one
for(j in seq_along(p)){
## if the sample value already exists in the vector,
## choose the next value from the sample
n <- 1
while (numbers[n] %in% p) {
n <- n + 1
}
if(n <= length(numbers)){
p[j] <- numbers[n]
}
}
j
#[1] 55
length(unique(p)) == length(p)
#[1] TRUE

Reducing row reference by 1 for each for loop iteration in R

I'm working on a formula in R, that iterates over a data frame in reverse. Right now, the formula will take a set number of columns, and find the mean for each column, up to a set row number. What I'd like to do is have the row number decrease by 1 for each iteration of the for loop. The goal here is to create a "triangular" reference that uses one less value for the column means, per iteration.
Here's some code you can use to create sample data that works in the formula.
test = data.frame(p1 = c(1,2,0,1,0,2,0,1,0,0), p2 = c(0,0,1,2,0,1,2,1,0,1))
Here's the function I'm working with. My best guess is that I'll need to add some sort of reference to i in the mean(data[1:row, i]) section, but I can't seem to work the logic/math out on my own.
averagePickup = function(data, day, periods) {
# data will be your Pickup Data
# day is the day you're forecasting for (think row number)
# periods is the period or range of periods that you need to average (a column or range of columns).
pStart = ncol(data)
pEnd = ncol(data) - (periods-1)
row = (day-1)
new_frame <- as.data.frame(matrix(nrow = 1, ncol = periods))
for(i in pStart:pEnd) {
new_frame[1,1+abs(ncol(data)-i)] <- mean(data[1:row , i])
}
return(sum(new_frame[1,1:ncol(new_frame)]))
}
Right now, inputing averagePickup(test,5,2) will yield a result of 1.75. This is the sum of the means for the first 4 values of the two columns. What I'd like the result to be is 1.33333. This would be the sum of the mean of the first 4 values in column p1, and the mean of the first 3 values in column p2.
Please let me know if you need any further clarification, I'm still a total scrub at R!!!
Like this?
test = data.frame(p1 = c(1,2,0,1,0,2,0,1,0,0), p2 = c(0,0,1,2,0,1,2,1,0,1))
averagePickup = function(data, first, second) {
return(mean(test[1:first,1]) + mean(test[1:second,2]))
}
averagePickup(test,4,3)
This gives you 1.333333
Welp, I ended up figuring it out with a few more head bashes against the wall. Here's what worked for me:
averagePickup = function(data, day, periods) {
# data will be your Pickup Data
# day is the day you're forecasting for (think row number)
# periods is the period or range of periods that you need to average (a column or range of columns).
pStart = ncol(data)
pEnd = ncol(data) - (periods-1)
row = (day-1)
new_frame <- as.data.frame(matrix(nrow = 1, ncol = periods))
q <- 0 # Instantiated a q value. Run 0 will be the first one.
for(i in pStart:pEnd) {
new_frame[1,1+abs(ncol(data)-i)] <- mean(data[1:(day - periods + q) , i]) # Added a subtraction of q from the row number to use.
q <- q + 1 # Incrementing q, so the next time will use one less row.
}
return(sum(new_frame[1,1:ncol(new_frame)]))
}

split data matrix

I have a data matrix with 100,000 rows of values corresponding to methylation values across several cell types. I would like to visually display the changes in methylation in a clustered heatmap. To get the data into a more manageable size I was thinking of creating a new data matrix every 10th or so row. Is there any simple way to do this?
Use seq and combinations of arguments. E.g.:
m1 <- matrix(runif(100000*10), ncol = 10)
m2 <- m1[seq(from = 1, to = nrow(m1), by = 10), ]
> dim(m2)
[1] 10000 10
How does this work? Look at what this does:
> sq <- seq(from = 1, to = nrow(m1), by = 10)
> head(sq)
[1] 1 11 21 31 41 51
> tail(sq)
[1] 99941 99951 99961 99971 99981 99991
> nrow(m1)
[1] 100000
We specify to go from the first row to the last incrementing 10 each step. This gives us rows 1, 11, 21, etc. When we get to the end of the sequence, even though we specified nrow(m1) (which is 100000) the last element in our sequence in 99991. This is because 99991 + 10 would take us beyond the from argument limit (beyond 100000) and hence that is not included in the sequence.
Try the following which takes your large matrix m and generates a list of smaller matrices. It generates a sequence of indices that breaks at every chunk.length values and then collects the chunks.
list.of.matrices <- lapply(X=seq.int(1, nrow(m), by=chunk.length)),
FUN=function (k) {
m[k + seq_len(chunk.length) - 1, ])
})
However, if you have 100,000 rows, it will be wasteful for your RAM to save all these chunks separately. Perhaps, you can just do the required computation on the subsets and save only the results. Just a suggestion.

Resources