Force value attribution in a table (R script) - r

I'm calculating particles diameter evolution over time and I'm trying to make the condition that when a particle diameter is less or equal to a minimal diameter the diameter is equal to the minimal fixed value.
I tried with the condition if but it is not working (code showed here bellow) So I would like to do is that from the first time the min diameter is reached what ever the other values are equal to, the min diameter value is attributed to them.
#p is my data frame and dp is diameter values
a <-p$diameter <- p$dp*((Te - p$t)/Te)^0.5
p$vol <- pi*(p$dp*1e-6)^3/6
#diam_min_ma is minimum diameter calculation
b <- diam_min_ma=(0.03*p$vol*6/pi)^(1/3)*1000000
c = if (a >= b)
{p$diameter=a}
else
{p$diameter=b}
p$diameter <- c
This is an example of expected table (DP1,....Dp7 diameter change over time and Dp min is the minimum diameter that can be reached)
DpT1 DpT2 DpT3 DpT4 DpT5 DpT6 DpT7
150 100 75 50 36 36 36 Dp min= 36µm
100 60 45 30 28 28 28 Dp min= 28µm
60 40 20 20 20 20 20 Dp min= 28µm

Finally I found the answer which was to use ifelse instead of what I did.
Which allow to do it for all table rows instead of only the first one
p$diameter <<- ifelse (a >= b,a,b)

Related

Percentile rank (inclusive) in R

Percentile rank is frequently defined by the following formula:
Percentile rank = (L/N)*100
L=Number of values in dataset lower than or equal to value of interest
N=number of data points
In R, it is common to calculate percentile rank of values in a vector by
Percentile_Rank=rank(vec)/length(vec)*100)
However, I would like to use a slightly modified definition of percentile rank, which is defined by the same formula as above but
L = Number of values in dataset strictly lower than the value of interest
This is similar to the PERCENTILERANK.EXC function in Excel.
Is there a function built into R to calculate this? Otherwise, how can I do it?
Is this what you're looking for?
y = 1:10
# traditional percentile
rank(y)/length(y) * 100
# [1] 10 20 30 40 50 60 70 80 90 100
# percentile considering those values preceding current value
vapply(y, function(x){
sum(y < x)/length(y) * 100
}, FUN.VALUE = numeric(1L))
# [1] 0 10 20 30 40 50 60 70 80 90

Compare the annual rates between groups

I am strugling into comparing the rates 'of mortality' between two percentages over time interval. My goal is to get the annual rates per group.
My values are already in percentages (start and end values), representing how mych forest have been lost (disturbed, burned, cut, etc.) over several years from the total forest cover. E.g in first year it was 1%, the last year 20 % is a cumulative value of total forest lost.
I followed the calculation of the Compound annual growth rate (CARG), taking into account the values in the 1st year, last year, and total number of years.
Here are my dummy data for two groups, eg. mortablity depending between tree species:
df <- data.frame(group = c('pine', 'beech'),
start = c(1,2),
end = c(19, 30),
years = 18)
To calculate the CAGR, I have used this function:
CAGR_formula <- function(end, start, yrs) {
values <- ((end/start)^(1/yrs)-1)
return(values)
}
giving:
df %>%
mutate(CARG = CAGR_formula(end, start, yrs)*100)
group start end yrs CARG
1 pine 1 19 18 17.8
2 beech 2 30 18 16.2
However, CARG rates of 16-17% seems awefully hight! I was expecting about 1-3% per year. Please, what is wrong in my formula? Is it because original values (start, end) are already in percentages? Or, is it because end is a cumulative values of the start?
Thank you for your ideas!
If I understand correctly, maybe this is what is desired:
df %>%
mutate(CARG = CAGR_formula(1 - end/100, 1, yrs)*100)
#> group start end yrs CARG
#> 1 pine 1 19 18 -1.163847
#> 2 beech 2 30 18 -1.962024
where the start parameter to CARG() is always 1 (the value for year 1 can be ignored in this calculation), meaning the forest is 100%, and the end parameter to CARG() is 1 - end/100, e.g. in the first row 81% of the forest remains after 18 years.
The resulting yearly mortality rates are 1.17% and 1.96%.
We can verify that 1 * (1 - 0.0117)^18 is roughly 81%, and 1 * (1 - 0.0196)^18 is roughly 70%
Why does it seem high? From 1% to 19% is a big jump. Also:
1 * 1.178^18 = 19.086
Seems right to me

Calculate 'Ranking' based on 'weights' - what's the formula , given different ranges of values

Given a list of cars and their top speeds, MPG and car cost. I want to rank them. With Speed a 'weight' of 50%, MPG 'weight' of 30% and car cost 20%.
The faster the car, the better.. The higher the MPG, the better... The lower the COST, the better...
What math formula can I use to rank the cars in order, based on this criteria?
So given this list. How can I rank them? Given then range of each are different?
CAR SPEED MPG COST
A 135 20 50,000
B 150 15 60,000
C 170 18 80,000
D 120 30 40,000
A more general term for your problem would be 'Multi-Criteria Decision Analysis' which is a well studied subject and you will be able to find different model for different use cases.
Let's take a simple model for your case, where we will create a score based on weights and calculate it for each car:
import pandas as pd
data = pd.DataFrame({
'CAR': ['A','B','C','D'],
'SPEED': [135, 150, 170, 120],
'MPG': [20, 15, 18, 30],
'COST': [50000, 60000, 80000, 4000]
})
def Score(df):
return 0.5*df['SPEED'] + 0.3*df['MPG'] + 0.2*df['COST']
data['SCORE'] = data.apply(lambda x: Score(x), axis=1)
data = data.sort_values(by=['SCORE'], ascending=False)
print(data)
This would give us:
CAR SPEED MPG COST SCORE
2 C 170 18 80000 16090.4
1 B 150 15 60000 12079.5
0 A 135 20 50000 10073.5
3 D 120 30 40000 8069.0
As you can see in the function "SCORE" we are simply multiplying the value by weight and summing them to get a new value based on which we list the items.
The important consideration here is whether you are happy with the formula we used in Score or not. You can change it however you want and for whatever purpose you are building your model.

Repeat simulation of test scores 1000 times

I want to simulate the problem below in R and calculate the average probability based on 1000 simulations -
Scores on a test are normally distributed with mean 70 and std dev 10.
Estimate the probability that among 75 randomly selected students at least 22 score greater than 78
This is what I have done so far
set.seed(1)
scores = rnorm(1000,70,10)
head(scores)
hist(scores)
sm75=sample(scores,75)
length(sm75[sm75>78])/75
#[1] 0.1866667
However, this only gives me only one iteration, I want 1000 iterations and then take the average of those 1000 probabilities. I believe some kind of control structure using for loop can be implemented. Also, is there an easier way through "apply" family of functions?
At the end of the day you are testing whether at least 22 students score higher than 78, which can be compactly computed with:
sum(rnorm(75, 70, 10) > 78) >= 22
Breaking this down a bit, rnorm(75, 70, 10) returns the 75 scores, which are normally distributed with mean 70 and standard deviation 10. rnorm(75, 70, 10) > 78 is a vector of length 75 that indicates whether or not each of these scores is above 78. sum(rnorm(75, 70, 10) > 78) converts each true to a 1 and each false to a 0 and sums these values up, meaning it counts the number of the 75 scores that exceed 78. Lastly we test whether the sum is 22 or higher with the full expression above.
replicate can be used to replicate this any number of times. So to see the breakdown of 1000 simulations, you can use the following 1-liner (after setting your random seed, of course):
set.seed(144)
table(replicate(1000, sum(rnorm(75, 70, 10) > 78) >= 22))
# FALSE TRUE
# 936 64
In 64 of the replicates, at least 22 students scored above a 78, so we estimate the probability to be 6.4%.
Probability is calculated as number of favourable outcomes / the total number of outcomes. So..
> scores <- sample(rnorm(1000,70,10),75)
> probability <- length(subset(scores,scores>78))/length(scores)
> probability
[1] 0.28
However, you want to do this a 1000 times, and then take an average.
> mean(replicate(1000, {scores<-sample(rnorm(1000,70,10),75);length(subset(scores,scores>78))/length(scores)}))
[1] 0.2133333

Difference between runif and sample in R?

In terms of probability distribution they use? I know that runif gives fractional numbers and sample gives whole numbers, but what I am interested in is if sample also use the 'uniform probability distribution'?
Consider the following code and output:
> set.seed(1)
> round(runif(10,1,100))
[1] 27 38 58 91 21 90 95 66 63 7
> set.seed(1)
> sample(1:100, 10, replace=TRUE)
[1] 27 38 58 91 21 90 95 67 63 7
This strongly suggests that when asked to do the same thing, the 2 functions give pretty much the same output (though interestingly it is round that gives the same output rather than floor or ceiling). The main differences are in the defaults and if you don't change those defaults then both would give something called a uniform (though sample would be considered a discrete uniform and by default without replacement).
Edit
The more correct comparison is:
> ceiling(runif(10,0,100))
[1] 27 38 58 91 21 90 95 67 63 7
instead of using round.
We can even step that up a notch:
> set.seed(1)
> tmp1 <- sample(1:100, 1000, replace=TRUE)
> set.seed(1)
> tmp2 <- ceiling(runif(1000,0,100))
> all.equal(tmp1,tmp2)
[1] TRUE
Of course if the probs argument to sample is used (with not all values equal), then it will no longer be uniform.
sample samples from a fixed set of inputs, and if a length-1 input is passed as the first argument, returns an integer output(s).
On the other hand, runif returns a sample from a real-valued range.
> sample(c(1,2,3), 1)
[1] 2
> runif(1, 1, 3)
[1] 1.448551
sample() runs faster than ceiling(runif())
This is useful to know if doing many simulations or bootstrapping.
Crude time trial script that time tests 4 equivalent scripts:
n<- 100 # sample size
m<- 10000 # simulations
system.time(sample(n, size=n*m, replace =T)) # faster than ceiling/runif
system.time(ceiling(runif(n*m, 0, n)))
system.time(ceiling(n * runif(n*m)))
system.time(floor(runif(n*m, 1, n+1)))
The proportional time advantage increases with n and m but watch you don't fill memory!
BTW Don't use round() to convert uniformly distributed continuous to uniformly distributed integer since terminal values get selected only half the time they should.

Resources