Somatory of "n" math operation results without VBA - math

I need an algorithm only "function-based" (no VBA) to find the value of "S" as explained below.
Value "A"
Value "B"
Value "C"
Value "D"
j = increment - increases by 1 for each time the equation is executed;
n = Qtt of iterations = Number of times the main equation must be executed
S = Sum = Sum of all the "i" results of equations for each variation of "j"
The equation for each line is: = [A*(B+8)]* [C+[D*(j-0.5)]]
The operation is based on accumulating the results of the equation for each value of "j" that increases at a rate of 1. See the following example:
For: A = 2, B = 3, C = 4, D = 5 and n = 7, we have:
for i=1: =[2*(3+8)]* [4+[5*(1-0.5)]] = 143
for i=2: =[2*(3+8)]* [4+[5*(2-0.5)]] = 253
for i=3: =[2*(3+8)]* [4+[5*(3-0.5)]] = 363
for i=4: =[2*(3+8)]* [4+[5*(4-0.5)]] = 473
for i=5: =[2*(3+8)]* [4+[5*(5-0.5)]] = 583
for i=6: =[2*(3+8)]* [4+[5*(6-0.5)]] = 693
for i=7: =[2*(3+8)]* [4+[5*(7-0.5)]] = 803
So "S" will be: S = 143 + 253 + 363 + 473 + 583 + 693 + 803 = 3311
I know using VBA would be relatively easy to find the "S" value, but the thing is that using VBA is unwanted in this case. Maybe the math reasoning to reach my goal is the calculation of integrals but I also couldn't find a native Excel function to go thru this path
I tried to associate functions like OFFSET, SERIESSUM, FACT and even integrals and I couldn't get there.

As what I understand and a straightforward set up as image
copy paste to column I to many rows you want
copy paste to cell H2
you can change the nth value you want in cell G1
and straightforward sum for S value


Hoping for help to translate a thought experiment into R code, using randomization

I'm more experienced with R than many of my peers, yet it sometimes takes hours to move a novel-to-me concept into the code line, and usually a few more to get a successful output. I don't know how to describe this in R language, so I hope you can help me- either with sample code, or pointing me in the right direction.
I have c(X1,X2,X3,...Xn) for starting variable, a non-random numeric value.
I have c(Y1,Y2,Y3,...Yn) for change variable, a non-random numeric value denoting by how much to change X, give or take, and a value between 0-10.
I have c(Z1,Z2,Z3,...Zn) which is the min and max range of X.
What I want to observe is the random sampling of all numbers X, which have all randomly had corresponding Y variable subtracted or added to them. What I'm trying to ask in this problem, is how many times will I draw X values which are exactly the X values which I initially input as well as give or take only a low Y value.
For instance,
First iteration: X=c(135,562,579,222), second iteration: X=c(130,471,585,230)<- as you can see, X of second iteration has changed by (-5*Y1), (+3*Y2), (+2*Y3), and (+11*Y4)
What I want to output is a list of randomized X values which have changed by only a factor of their corresponding Y value, and always fall within the range of given Z values. Further, I want to examine how many times at least one- and only one- X value will be be significantly different from the corresponding,starting input X.
I feel like I'm not wording the question succinctly, but I also feel that this is why I've posted. I'm not trying to ask for hand-holding, but rather seeking advice.
I am not sure that I understood the question, do you want to reiterate the process numerous times? is it for the purpose of simulation?. Here is a start of a solution.
x <- c(135,462,579,222)
y <- c(1,3,3,2)
z.lower <- c(115, 450, 510, 200)
z.upper <- c(155, 474, 648, 244)
temp.df <- data.frame(x, y, z.lower, z.upper)
df %>%
mutate(samp = sample(seq(-10, 10, 1), nrow(temp.df))) %>% ### Sample numbers between 0 and 10
mutate(new.val = x + samp * y) %>% ### Create new X
mutate(is.bound = new.val < z.upper & new.val > z.lower) ### Check that falls in bounds
x y z.lower z.upper samp new.val is.bound
1 135 1 115 155 -10 125 TRUE
2 462 3 450 474 10 492 FALSE
3 579 3 510 648 8 603 TRUE
4 222 2 200 244 6 234 TRUE
For this dataset, this is a possibility:
n = 10000
x_range_l <- split(Zees, rep(seq_len(length(Zees) / 2), each = 2))
mapply(function(y, x_range) sample(seq(from = x_range[1], to = x_range[2], by = y), size = n, replace = T),
Whys, x_range_l)
Note that this option depends more on the Zees than the Exes. A more complete way to do it would be:
Why_Range <- c(20, 4, 13, 11)
x_range_l <- Map(function(x, y, rng) c(x - y * rng, x + y * rng), Exes, Whys, Why_Range)
n = 10000
mapply(function(y, x_range) sample(seq(from = x_range[1], to = x_range[2], by = y), size = n, replace = T),
Whys, x_range_l)

Calling a function with a variable object

I have a dataframe of many rows with only one column, the column having strings of variable lengths, ranging from 30000 to 200000 characters(DNA sequence). [Below is a sample of 150 characters]
Here is the full dataset:
I have a code in R, which divides each row into 20 bins depending on its length, and counts the occurrence of G's and C's for each bin, and gives me back a matrix of 20 columns. Here is the code:
data <- fread("string.fa", header = F)
loopchar <- function(data){ bins <- sapply(seq(1, nchar(data), nchar(data)/20), function(x) substr(data, x, x + nchar(data)/20 - 1))output <- (str_count(bins, c("G"))/nchar(bins) + str_count(bins, c("C"))/nchar(bins))*100}
result <- data.frame(t(apply(data,1,loopchar)))
However, now I want to do something different. Instead of nchar(data)/20, I want the substring segments (20) to vary from a list I have. So now for my data frame, the first row should be divided into 22 bins/segments, and the code would be nchar(data)/22.
The second row should be divided into 21 bins, and the code would be nchar(data)/21, and so on. I want the function to keep changing the number of bins for the data. Both my data dataframe with strings and vector list of numbers with bins are of the same length.
What is the best way to do this?
It's more natural to use some of the Bioconductor's libraries for such tasks. In my case I use Biostrings, but maybe you could find another way.
Your file is too big, so I have created a text file (in memory), which contains random DNA for each line:
# set seed to create reproducible example
# create an example text file in memory
temp <- tempfile()
sapply(1:100, function(i){
paste(sample(c("A", "T", "C", "G"), sample(100:6000),
replace = T), collapse = "")
con = temp
# read lines from tmp file
dna <- readLines(temp)
# unlink file
Data preprocessing
Creating Biostrings::DNAStringSet object
Using Biostrings::DNAStringSet() function we can read character vector to create DNAStringSet object. Note that I assume that all the records are in standard DNA alphabet i.e. each string contains only A, T, C, G symbols. If it does not hold in your case, refer to Biostrings documentation.
dna <- DNAStringSet(dna, use.names = F)
# inspect the output
A DNAStringSet instance of length 100
width seq
... ... ...
Create the vector of random N numbers of bins
k <- sample(100, 100, replace = T)
# inspect the output
[1] 37 32 63 76 19 41
Create Views object were each DNA sequence represented by N = k[i] chunks
It is much easier to solve your problem using IRanges::Views container. This thing is furiously fast and beautiful.
First of all we divide each DNA sequenced into k[i] ranges:
seqviews <- lapply(seq_along(dna), function(i){
seq = dna[[i]]
seq_length = length(seq)
starts = seq(1, seq_length - seq_length %% k[i], seq_length %/% k[i])
Views(seq, start = starts, end = c(starts[-1] - 1, seq_length))
# inspect the output for k[2] and seqviews[2]
Views on a 1507-letter DNAString subject subject: ATGCGGTCTATCTACTTG...GTCAGAAGTAACAGTTTAG
start end width
... ... ... ... ...
After that, we check if all sequences have been divided to desired number of chunks:
all(sapply(seq_along(k), function(i) k[i] == length(seqviews[[i]])))
[1] TRUE
Important observation
Before we proceed, there is one important observation about your function.
Your function produces N chunks with variable length (because the indices it produces are floats but not integers, so substr() when you call it, rounds provided indices to the nearest integer.
As an example, extracting 1st record from the dna set, and splitting this sequence into 37 bins using your code will produce following results:
dna_1 <- as.character(dna[[1]])
sprintf("DNA#1: %d bp long, 37 chunks", nchar(dna_1))
[1] "DNA#1: 2235 bp long, 37 chunks"
bins <- sapply(seq(1, nchar(dna_1), nchar(dna_1)/37),
substr(dna_1, x, x + nchar(dna_1)/37 - 1)
bins_length <- sapply(bins, nchar)
xlab = "Bin's length",
ylab = "Count",
main = "Bin's length variability"
The approach I use in my code, while length(dna[[i]]) %% k[i] != 0 (reminder), produces k[i] - 1 bins of equal lengths, and only the last bin has its length equal to length(dna[i]) %/% k[i] + length(dna[[i]] %% k[i]:
bins_length <- sapply(seqviews, length)
xlab = "Bin's length",
ylab = "Count",
main = "Bin's length variability"
GC content calculation
As it is mentioned above, Biostrings::letterFrequency() applied to IRanges::Views allows you to calculate GC content easily:
Find the GC frequency for each bin in every DNA sequence
GC <- lapply(seqviews, letterFrequency, letters = "GC", as.prob = TRUE)
Convert to percents
GC <- lapply(GC, "*", 100)
Inspect the output
[1,] 53.33333
[2,] 46.66667
[3,] 50.00000
[4,] 55.00000
[5,] 60.00000
[6,] 45.00000
Plot GC content for DNAs 1:9
par(mfrow = c(3, 3))
lapply(1:9, function(i){
type = "l",
main = sprintf("DNA #%d, %d bp, %d bins", i, length(dna[[i]]), k[i]),
xlab = "N bins",
ylab = "GC content, %",
ylim = c(0, 100)
abline(h = 50, lty = 2, col = "red")

generate normal distribution with exactly N elements in Y bins

I'll probably want to hit myself over the head for not getting this:
How do I generate a vector with the expected height of a normal distribution over Y bins (nbins in the below), of exactly N elements.
Like so, in the below picture:
Y or nbins = 15
N or nstat = 77
... should return something like: c(1,1,2,4, ...)
I know I could draw rnorm(77), but that'll never be exactly normal, and looping over 10.000 iterations or so seems overkill.
So I tried using qnorm for that purpose, but I have a hunch that:
sth is wrong with the below code
there has to be an easier, more elegant way
Here is what I got:
nbins <- 15
nstat <- 77
item.pos <- qnorm( # to the left of which value lies...
1:(nstat) / (nstat+1)# ... the n-statement?
# using nstat + 1 because we want midpoints, not cutoffs for later
bins <- cut(
x = item.pos,
breaks = nbins,
ordered_result = TRUE
height <- summary(bins)
height <- as.numeric(bins)
If your range of data is from -2:2 with 15 intervals and the sample size is 77 I would suggest the following to get the expected heights of the 15 intervals:
rn <- dnorm(seq(-2,2, length = 15))/sum(dnorm(seq(-2,2, length = 15)))*77
[1] 1.226486 2.084993 3.266586 4.716619 6.276462 7.697443 8.700123 9.062576 8.700123 7.697443
[11] 6.276462 4.716619 3.266586 2.084993 1.226486
The barplot of this looks like:
barplot(height = rn, names.arg = round(seq(-2, 2, length = 15), 2))
So, in your sample of 77 you would get the first value of the sequence in 1.226486, the second value in 2.084993 cases, etc. Its difficult to generate a vector as you described at the beginning, because the sequence above does not consist of integers.

Selecting rows maintaining distribution percentage?

I have an existing data frame with a variable "grade" indicating the type of row/observation. My goal is to select from another dataframe more of these types of rows while not exceeding a maximum percentage for each grade type in my existing data frame. I have defined a named vector with the grade allocations:
gradeAllocation <- c("A" = 0, "B" = 0, "C" = .25, "D" = .40, "E" = .20, "F" = .10, "G" = .05)
This represents the maximum percent of each type of grade in my data frame. Now, lets say I want to select from another data frame a mixture of grades but I dont want to select too many where after the selection would give me more than the max percentage per grade type. I would be basically doing this process in a loop for each new data set that becomes available but want to keep the max distribution given by the gradeAllocation vector.
Is there a package/function that can help here? Any thoughts for custom code?
Thanks, John
So as #Mr.Flick points out, there is no guarantee that this will be possible. In your gradeAllocation the sampling distribution sums to 1. If your test dataset has no "D", for example, it will not be possible to create a sample with at most 25% C, 15% E, 10% F, 5% G, and no A or B.
Also, because the sampling distribution sums to 1, if the sample size you want is N, then the number of samples of each grade must be given by N * gradeAllocation. Here is a method that takes advantage of that fact, starting with a dataset that has 700 samples and is uniformly distributed (same number in each grade), and we extract a random sample of 100 with the distribution given by gradeAllocation.
# sample dataset: 700 observations, grade distribution is uniform
set.seed(1) # for reproducible example
data <- data.frame(grade=rep(LETTERS[1:7],each=100),x=rnorm(700))
# desired distribution in the sample
gradeAllocation <- c(A=0, B=0, C=.25, D=.40, E=.20, F=.10, G=.05)
# you start here...
N <- 100 # sample size
get.sample<- function(g) data[sample(which(data$grade==g),N*gradeAllocation[g]),]
result <-,lapply(LETTERS[1:7],get.sample))
# confirm distribution of grades in the sample
# A B C D E F G
# 0 0 25 40 20 10 5
Here's one approach
Generate some data
nOriginal <- 1000
df1 <- data.frame(grade=sample(c('A','B','C','D','E','F','G'),1000,replace=TRUE),
Get the rows that correspond to each grade
Sample the rows based on the prescribed distribution which should sum to one.
location <- c("A" = 0, "B" = 0, "C" = .25, "D" = .40, "E" = .20, "F" = .10, "G" = .05)
nSamples = 200
samp_idx_a <- sample(idx_a,nSamples*location["A"])
samp_idx_b <- sample(idx_b,nSamples*location["B"])
samp_idx_c <- sample(idx_c,nSamples*location["C"])
samp_idx_d <- sample(idx_d,nSamples*location["D"])
samp_idx_e <- sample(idx_e,nSamples*location["E"])
samp_idx_f <- sample(idx_f,nSamples*location["F"])
samp_idx_g <- sample(idx_g,nSamples*location["G"])
df_2 <- df1[c(samp_idx_a, samp_idx_b, samp_idx_c, samp_idx_d,
samp_idx_e, samp_idx_f, samp_idx_g),]
Check the results
(percent_A = sum(df_2$grade=="A")/nrow(df_2)*100)
(percent_B = sum(df_2$grade=="B")/nrow(df_2)*100)
(percent_C = sum(df_2$grade=="C")/nrow(df_2)*100)
(percent_D = sum(df_2$grade=="D")/nrow(df_2)*100)
(percent_E = sum(df_2$grade=="E")/nrow(df_2)*100)
(percent_F = sum(df_2$grade=="F")/nrow(df_2)*100)
(percent_G = sum(df_2$grade=="G")/nrow(df_2)*100)

Probabilty heatmap in ggplot

I asked this question a year ago and got code for this "probability heatmap":
numbet <- 32
numtri <- 1e5
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
mxcum2 <- ddply(mxcum, .(bet, outcome), nrow)
mxcum3 <- ddply(mxcum2, .(bet), summarize,
ymin=c(0, head(seq_along(V1)/length(V1), -1)),
p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") +
scale_y_continuous(formatter="percent") +
(May need to change this code slightly because of this)
This is almost exactly what I want. Except each vertical shaft should have different numbers of bins, ie the first should have 2, second 3, third 4 (N+1). In the graph shaft 6 +7 have the same number of bins (7), where 7 should have 8 (N+1).
If I'm right, the reason the code does this is because it is the observed data and if I ran more trials we would get more bins. I don't want to rely on the number of trials to get the correct number of bins.
How can I adapt this code to give the correct number of bins?
I have used R's dbinom to generate the frequency of heads for n=1:32 trials and plotted the graph now. It will be what you expect. I have read some of your earlier posts here on SO and on math.stackexchange. Still I don't understand why you'd want to simulate the experiment rather than generating from a binomial R.V. If you could explain it, it would be great! I'll try to work on the simulated solution from #Andrie to check out if I can match the output shown below. For now, here's something you might be interested in.
numbet <- 32
numtri <- 1e5
out <- ldply(1:numbet, function(idx) {
outcome <- dbinom(idx:0, size=idx, prob=prob)
bet <- rep(idx, length(outcome))
N <- round(outcome * numtri)
ymin <- c(0, head(seq_along(N)/length(N), -1))
ymax <- seq_along(N)/length(N)
data.frame(bet, fill=outcome, ymin, ymax)
p <- ggplot(out, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", low="red", high="blue") +
The plot:
Edit: Explanation of how your old code from Andrie works and why it doesn't give what you intend.
Basically, what Andrie did (or rather one way to look at it) is to use the idea that if you have two binomial distributions, X ~ B(n, p) and Y ~ B(m, p), where n, m = size and p = probability of success, then, their sum, X + Y = B(n + m, p) (1). So, the purpose of xcum is to obtain the outcome for all n = 1:32 tosses, but to explain it better, let me construct the code step by step. Along with the explanation, the code for xcum will also be very obvious and it can be constructed in no time (without any necessity for for-loop and constructing a cumsum everytime.
If you have followed me so far, then, our idea is first to create a numtri * numbet matrix, with each column (length = numtri) having 0's and 1's with probability = 5/6 and 1/6 respectively. That is, if you have numtri = 1000, then, you'll have ~ 834 0's and 166 1's *for each of the numbet columns (=32 here). Let's construct this and test this first.
numtri <- 1e3
numbet <- 32
xcum <- t(replicate(numtri, sample(0:1, numbet, prob=c(5/6,1/6), replace = TRUE)))
# check for count of 1's
> apply(xcum, 2, sum)
[1] 169 158 166 166 160 182 164 181 168 140 154 142 169 168 159 187 176 155 151 151 166
163 164 176 162 160 177 157 163 166 146 170
# So, the count of 1's are "approximately" what we expect (around 166).
Now, each of these columns are samples of binomial distribution with n = 1 and size = numtri. If we were to add the first two columns and replace the second column with this sum, then, from (1), since the probabilities are equal, we'll end up with a binomial distribution with n = 2. Similarly, instead, if you had added the first three columns and replaced th 3rd column by this sum, you would have obtained a binomial distribution with n = 3 and so on...
The concept is that if you cumulatively add each column, then you end up with numbet number of binomial distributions (1 to 32 here). So, let's do that.
xcum <- t(apply(xcum, 1, cumsum))
# you can verify that the second column has similar probabilities by this:
# calculate the frequency of all values in 2nd column.
> table(xcum[,2])
0 1 2
694 285 21
> round(numtri * dbinom(2:0, 2, prob=5/6))
[1] 694 278 28
# more or less identical, good!
If you divide the xcum, we have generated thus far by cumsum(1:numbet) over each row in this manner:
xcum <- xcum/matrix(rep(cumsum(1:numbet), each=numtri), ncol = numbet)
this will be identical to the xcum matrix that comes out of the for-loop (if you generate it with the same seed). However I don't quite understand the reason for this division by Andrie as this is not necessary to generate the graph you require. However, I suppose it has something to do with the frequency values you talked about in an earlier post on math.stackexchange
Now on to why you have difficulties obtaining the graph I had attached (with n+1 bins):
For a binomial distribution with n=1:32 trials, 5/6 as probability of tails (failures) and 1/6 as the probability of heads (successes), the probability of k heads is given by:
nCk * (5/6)^(k-1) * (1/6)^k # where nCk is n choose k
For the test data we've generated, for n=7 and n=8 (trials), the probability of k=0:7 and k=0:8 heads are given by:
# n=7
0 1 2 3 4 5
.278 .394 .233 .077 .016 .002
# n=8
0 1 2 3 4 5
.229 .375 .254 .111 .025 .006
Why are they both having 6 bins and not 8 and 9 bins? Of course this has to do with the value of numtri=1000. Let's see what's the probabilities of each of these 8 and 9 bins by generating probabilities directly from the binomial distribution using dbinom to understand why this happens.
# n = 7
dbinom(7:0, 7, prob=5/6)
# output rounded to 3 decimal places
[1] 0.279 0.391 0.234 0.078 0.016 0.002 0.000 0.000
# n = 8
dbinom(8:0, 8, prob=5/6)
# output rounded to 3 decimal places
[1] 0.233 0.372 0.260 0.104 0.026 0.004 0.000 0.000 0.000
You see that the probabilities corresponding to k=6,7 and k=6,7,8 corresponding to n=7 and n=8 are ~ 0. They are very low in values. The minimum value here is 5.8 * 1e-7 actually (n=8, k=8). This means that you have a chance of getting 1 value if you simulated for 1/5.8 * 1e7 times. If you check the same for n=32 and k=32, the value is 1.256493 * 1e-25. So, you'll have to simulate that many values to get at least 1 result where all 32 outcomes are head for n=32.
This is why your results were not having values for certain bins because the probability of having it is very low for the given numtri. And for the same reason, generating the probabilities directly from the binomial distribution overcomes this problem/limitation.
I hope I've managed to write with enough clarity for you to follow. Let me know if you've trouble going through.
Edit 2:
When I simulated the code I've just edited above with numtri=1e6, I get this for n=7 and n=8 and count the number of heads for k=0:7 and k=0:8:
# n = 7
0 1 2 3 4 5 6 7
279347 391386 233771 77698 15763 1915 117 3
# n = 8
0 1 2 3 4 5 6 7 8
232835 372466 259856 104116 26041 4271 392 22 1
Note that, there are k=6 and k=7 now for n=7 and n=8. Also, for n=8, you have a value of 1 for k=8. With increasing numtri you'll obtain more of the other missing bins. But it'll require a huge amount of time/memory (if at all).
