Adding extreme value distributed noise (with µ=0,σ=10) to a vector of numbers in R - r

I have the following matrix
Measurement Treatment
38 A
14 A
54 A
69 A
20 B
36 B
35 B
10 B
11 C
98 C
88 C
14 C
I want to add extreme value distributed noise (with mean=0 and sd=10) to the Measurement values. How can I achieve that in R?
I found revd in extRemes package, but it does not work as expected. Does devd from the same package do what I want to do? (but it does not allow for mean and sd to be defined)

If you want to use your measure as the mean for the noise, then you can do this:
measure = round(runif(10,0,30),0)
data = data.frame(measure)
for(i in 1:nrow(data)){
data$measure1[i] = rnorm(1,data$measure[i],10)
}
data
measure measure1
1 6 6.281557
2 12 -5.780177
3 18 13.529773
4 26 33.665584
5 14 12.666614
6 24 41.146132
7 5 -1.850390
8 14 16.728703
9 13 26.082601
10 13 14.066475
EDIT: You can avoid the for loop with this instead:
data$measure1 = data$measure + rnorm(1,0,10)

Related

Interpolation in R when there are 3 columns

I need to find the interpolated value for consumption from the speed and weather.
I have tried approx function but it is only for 2 variables, wont accept three or more.
Speed weather fuel
10 2 30
12 3 35
14 8 38
15 9 65
need to find fuel for speed_new = 13 and weather = 7.
approx(x=Speed,y=Fuel,z=Weather,xout= speed_new,rule = 2)$y #need to also mention the weather

Get the lag vector from variogram in gstat

I want to compute the variogram from a set of data in R. I am using the function "variogram" from the gstat package.
Now, I want to get the lag vector from the variogram. The problem is that myvariogram$dist returns the averages of the distances between all point paits.
How can I get the lag vector instead?
My data are in two dimension:coordinates x and y with z values
x y z
1 -0.9000000 1.102146e-16 0.160000000
2 -0.8724602 2.209369e-01 0.284010236
3 -0.7915264 4.283527e-01 0.408020473
4 -0.5527914 -7.102265e-01 -0.294704200
5 -0.7102265 -5.527914e-01 -0.170693964
6 -0.8241960 -3.615259e-01 -0.046683727
7 -0.8877252 -1.481351e-01 0.077326509
8 -0.6464646 -3.877551e-01 -0.205706068
9 -0.4444444 -1.428571e-01 -0.154399515
10 -0.5959596 -3.469388e-01 -0.227651744
11 -0.5454545 -5.510204e-01 -0.319427844
12 -0.6464646 -2.040816e-02 0.005767136
13 -0.8484848 -1.836735e-01 0.028933625
14 -0.6969697 -4.285714e-01 -0.174407224
15 -0.4949495 2.040816e-02 0.020626174
16 -0.7474747 2.040816e-02 0.075029711
17 -0.4444444 -3.061224e-01 -0.300002910
18 -0.6464646 1.428571e-01 0.135007208
19 -0.5959596 6.122449e-02 0.061006799
20 -0.5454545 -3.061224e-01 -0.239963488
21 -0.5959596 1.836735e-01 0.164762622
22 -0.3434343 -2.653061e-01 -0.324516690
23 -0.3939394 -3.469388e-01 -0.360400339
24 -0.5454545 6.122449e-02 0.058277761
25 -0.6464646 -3.061224e-01 -0.174779340
myvariog=variogram((z~1, data=mydata))

Normalize/scale data set

I have the following data set:
dat<-as.data.frame(rbind(10,8,2,7,10,10,1,10,14,9,2,6,10,8,10,8,10,10,7,11,10))
colnames(dat)<-"Score"
print(dat)
Score
10
8
2
7
10
10
1
10
14
9
2
6
10
8
10
8
10
10
7
11
10
these are the test scores which students obtained, a student could get a maximum of 15 or a minimum of 0 in this test (by the way, nobody got the max or the min), however the lowest score obtained in this test was 1 and the highest was 14.
Now, I want to normalize/scale this data to the scale of 0 to 20.
How to achieve this in excel? or in R?
My final goal is to normalize the scores in this test to the above scale and to compare them with another set of data for which the max and min is 5 and 0 respectively.
How to compare these two different scaled data sets correctly against each other?
What I tried:
I went through many stuff on the internet, and came up with this:
which I got it from the wikipedia.
Is this method reliable?
In your case I would use the feature scale formula you posted on your question. The (x - min(x)) / (max(x) - min(x)) will essentially convert your test marks to the range between 0-1.
Since your edges are indeed 0 and 15 and not 2 and 14, your min(x)=0 and your max(x)=15. Once you have your marks between 0-1 using the above, you just multiply by 20.
i.e.
tests <- read.table(header=T, file='clipboard')
tests2 <- (tests - 0) / (15 - 0) #or equally tests / 15
And multiply by 20 to get marks between 0-20:
> tests2 * 20
Score
1 13.333333
2 10.666667
3 2.666667
4 9.333333
5 13.333333
6 13.333333
7 1.333333
8 13.333333
9 18.666667
10 12.000000
11 2.666667
12 8.000000
13 13.333333
14 10.666667
15 13.333333
16 10.666667
17 13.333333
18 13.333333
19 9.333333
20 14.666667
21 13.333333
The results are intuitive and the function is reliable. For example the person who scored 14/15 should get the highest mark (and very close to 20) which is the case here (after the transformation they scored 18.6666).
In Excel, if you want the normalized data to have a min of 0 and and max of 20, then we need to solve:
y = A * x + b
for two points.
Put the max of the raw data in C1:
=MAX(A:A)
Put the min of the raw data in C2:
=MIN(A:A)
Put the desired max in D1 and the desired min in D2. Put the formula for the A-coefficient in C3:
=($D$1-$D$2)/($C$1-$C$2)
and the formula for the B-coefficient in C4:
=$D$1-$C$3*$C$1
Finally put the scaling formula in B1:
=A1*$C$3+$C$4
and copy down:
Naturally, if you want the scaling to be independent of the raw max or min, you would use 15 in C1 and 0 in C2.
You can scale between 0 to 20 with this command in R:
newvalue <- 20/(max(score)-min(score))*(score-min(score))
The math way is fairly straightforward if the floor for all scales is 0.
new_value = new_ceiling * old_value / old_ceiling
The next formula will account for different floors on each scale:
new_value = new_floor + (new_ceiling - old_ceiling) * ((old_value-old_floor)/(old_ceiling-old_floor)) which is actually the formula you posted from Wikipedia. ;)
Hope this helps!
That is very simple. Due to the fact that both of those grades are linear, that a simple multiple ratio will do the work. Or in other word each grade in your set needs to be *20/15.
Here's a little r function which can help you run this if you need to repeat the operation and give you some flexibility on what you rescale to. Also one must be careful of NA values because min() and max() do not drop them by default which will then return NA. Therefore I provided an option on to handle NA values (drops them by default).
# function rescales data from 0 to 1 and optionally multiplies by new max
rescale <- function(x, new_max = 1, na.rm = T) {
as.vector(new_max * scale(x,
center = min(x, na.rm = na.rm),
scale = (max(x, na.rm = na.rm) - min(x, na.rm = na.rm))))
}
# old scores
scores <- c(10,8,2,7,10,10,1,10,14,9,2,6,10,8,10,8,10,10,7,11,10)
# new scores
data.frame(old = scores,
new = rescale(scores, new_max = 20))
#> old new
#> 1 10 13.846154
#> 2 8 10.769231
#> 3 2 1.538462
#> 4 7 9.230769
#> 5 10 13.846154
#> 6 10 13.846154
#> 7 1 0.000000
#> 8 10 13.846154
#> 9 14 20.000000
#> 10 9 12.307692
#> 11 2 1.538462
#> 12 6 7.692308
#> 13 10 13.846154
#> 14 8 10.769231
#> 15 10 13.846154
#> 16 8 10.769231
#> 17 10 13.846154
#> 18 10 13.846154
#> 19 7 9.230769
#> 20 11 15.384615
#> 21 10 13.846154
Created on 2022-03-10 by the reprex package (v2.0.1)

ggplot2 is plotting a line strangely

i am trying to plot the time series x_t = A + (-1)^t B
To do this i am using the following code. The problem is, that the ggplot is wrong.
require (ggplot2)
set.seed(42)
N<-2
A<-sample(1:20,N)
B<-rnorm(N)
X<-c(A+B,A-B)
dat<-sapply(1:N,function(n) X[rep(c(n,N+n),20)],simplify=FALSE)
dat<-data.frame(t=rep(1:20,N),w=rep(A,each=20),val=do.call(c,dat))
ggplot(data=dat,aes(x=t, y=val, color=factor(w)))+
geom_line()+facet_grid(w~.,scale = "free")
looking at the head of dat everything looks right:
> head(dat)
t w val
1 1 12 10.5533
2 2 12 13.4467
3 3 12 10.5533
4 4 12 13.4467
5 5 12 10.5533
6 6 12 13.4467
So the lower (blue) line should only have values 10.5533 and 13.4467. But it also takes different values. What is wrong in my code?
Thanks in advance for any help
You really should be more careful before asserting that something is "wrong". The way you are creating dat the rows are not ordered by dat$t, so head(...) is not displaying the extra values:
head(dat[order(dat$w,dat$t),],10)
# t w val
# 21 1 18 18.43530
# 61 1 18 18.36313
# 22 2 18 19.56470
# 62 2 18 17.63687
# 23 3 18 18.43530
# 63 3 18 18.36313
# 24 4 18 19.56470
# 64 4 18 17.63687
# 25 5 18 18.43530
# 65 5 18 18.36313
Note the row numbers.

Maximum Intermediate Volatility

I have two vectors, a and b. See attached.
a is the signal and is a probability.
b is the absolute percentage change the next period.
Signalt <- seq(0, 1, 0.05)
I would like to find the maximum absolute return occuring within each intermediate 5%-tile (Signalt) of the a vector. So if it is
0.01, 0.02, 0.03, 0.06 0.07
then it should calculate the maximum return between
0.01 and 0.02,
0.01 and 0.03,
0.02 and 0.03.
Then move on to
0.06 and 0.07 do it over etc.
Output would then be combined in a matrix or table when the entire sequence has run.
It should follow the index from vector a and b.
i is an index that is updated by one every time that a crosses into a new percentile. t(i) is the bucket associated with the ith cross.
a is the probability vector which has length tao. This vector should be analyzed in its 5% tiles, with the maximum intermediate absolute return being the output. The price change of next period is the vector b. This would be represented by P in the equation below.
l and m are indexes.
Every time Signal moves from one 5% tile to another, we compute the
largest absolute return that occurs between any two intermediate
buckets, until Signal moves to another 5% tile. For example, suppose
that Signal moves into the 85th percentile and 4 volume buckets later
moves into the 90th percentile. We would then calculate absolute
returns between buckets 1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, 3
and 4. We are interested in the maximum absolute return. We would then
calculate the max return in the following percentile bucket, move on
to the next, which could be an 85th percentile and so on. So we let i
be an index that is updated by 1 every time that Signal moves from one
percentile into another, and τ(i) the bucket associated with the ith
cross.
This is the equation I am using. The notation might vary slightly.
Now my question is how to go about this. Perhaps someone has an intuitive solution to this.
I hope my question is clear.
"a","b"
0,0.013013698630137
0,0.0013522650439487
0,0.00135409614082593
0,0.00203389830508471
0.27804813511593,0.00135317997293627
0.300237801284318,0
0.495965075167796,0.00405405405405412
0.523741892051237,0.000672947510094168
0.558753750296458,0.00202020202020203
0.665762829019002,0.000672043010752743
0.493106479913899,0.000671591672263272
0.344592579573497,0.000672043010752854
0.336263897823707,0.00201748486886366
0.35884763774257,0.00536912751677865
0.23662807979007,0.00133511348464632
0.212636893966841,0.00267379679144386
0.362212830513403,0.000666666666666593
0.319216408413927,0.00333555703802535
0.277670854167344,0
0.310143323100971,0
0.374104373036218,0.00267737617135211
0.190943075221511,0.00268456375838921
0.165770070508112,0.00200803212851386
0.240310208616952,0.00133600534402145
0.212418038918236,0.00200133422281523
0.204282022136019,0.00200534759358306
0.363725074298064,0.000667111407605114
0.451807761954326,0.000666666666666593
0.369296011692801,0.000666222518321047
0.37503495989363,0.0026666666666666
0.323386355686901,0.00132978723404265
0.189216171830472,0.00266311584553924
0.185252052821193,0.00199203187250996
0.174882909380997,0.000662690523525522
0.149291525540782,0.00132625994694946
0.196824215268048,0.00264900662251666
0.164611993131396,0.000660501981505912
0.125470998266484,0.00132187706543285
0.179999532586703,0.00264026402640272
0.368749638521621,0.000658327847267826
0.427799340926225,0
My interpretation of the question
I hope I understand your question correctly. Here is what I understood:
For each row you compute which 5% percentile it belongs to
Whenever that percentile changes, you start a new bucket
All rows from the same bucket result in a single resulting value
If there is only a single row in a bucket, the b value from that row is the resulting value
Otherwise, you compute all abs(b[l]/b[m]-1) where m<l and both belong to the same bucket
Basic answer
Code
This code here does what I describe above:
# read the data (shortened, full data in OP)
d <- read.table(textConnection("a,b
0,0.013013698630137
[…]
0.427799340926225,0
"), sep=",", header=TRUE)
# compute percentile number for each line
d$percentile <- floor(d$a/0.05)*5 + 5
# start a new bucket whenever the percentile changes
d$bucket <- cumsum(c(1, diff(d$percentile) != 0))
# compute a single number for all rows of the same bucket
aggregate(b ~ percentile + bucket, d, function(b) {
if(length(b) == 1) return(b); # special case of only a single row
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1)) # compare all pairs
return(max(m[upper.tri(m)])) # only return pairs with m < l
})
Output
The result will look like this:
percentile bucket b
1 5 1 0.8960891071
2 30 2 0.0013531800
3 35 3 0.0000000000
4 50 4 0.0040540541
5 55 5 0.0006729475
6 60 6 0.0020202020
7 70 7 0.0006720430
8 50 8 0.0006715917
9 35 9 2.0020174849
10 40 10 0.0053691275
11 25 11 1.0026737968
12 40 12 0.0006666667
13 35 13 0.0033355570
14 30 14 0.0000000000
15 35 15 0.0000000000
16 40 16 0.0026773762
17 20 17 0.2520080321
18 25 18 0.5010026738
19 40 19 0.0006671114
20 50 20 0.0006666667
21 40 21 3.0026666667
22 35 22 0.0013297872
23 20 23 0.7511597084
24 15 24 0.0013262599
25 20 25 0.7506605020
26 15 26 0.0013218771
27 20 27 0.0026402640
28 40 28 0.0006583278
29 45 29 0.0000000000
Additional columns
Code
If you also want to know the number of items in each group, then I suggest you use the plyr library:
library(plyr)
aggB <- function(b) {
if(length(b) == 1) return(b)
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1))
return(max(m[upper.tri(m)]))
}
ddply(d, .(bucket), summarise,
percentile = percentile[1], n = length(b), maxr = aggB(b))
Output
This will give you the following result:
bucket percentile n maxr
1 1 5 4 0.8960891071
2 2 30 1 0.0013531800
3 3 35 1 0.0000000000
4 4 50 1 0.0040540541
5 5 55 1 0.0006729475
6 6 60 1 0.0020202020
7 7 70 1 0.0006720430
8 8 50 1 0.0006715917
9 9 35 2 2.0020174849
10 10 40 1 0.0053691275
11 11 25 2 1.0026737968
12 12 40 1 0.0006666667
13 13 35 1 0.0033355570
14 14 30 1 0.0000000000
15 15 35 1 0.0000000000
16 16 40 1 0.0026773762
17 17 20 2 0.2520080321
18 18 25 3 0.5010026738
19 19 40 1 0.0006671114
20 20 50 1 0.0006666667
21 21 40 2 3.0026666667
22 22 35 1 0.0013297872
23 23 20 3 0.7511597084
24 24 15 1 0.0013262599
25 25 20 2 0.7506605020
26 26 15 1 0.0013218771
27 27 20 1 0.0026402640
28 28 40 1 0.0006583278
29 29 45 1 0.0000000000
I am not sure to understand but here an attempt. My idea is to group data by centiles than do calculation on each group using by
To group data I create a new variable split
##dat$split <- cut(dat$a,seq(0, 1, 0.05),include.lowest=T)
dat$split <- c(0,cumsum(diff(dat$a) > 0.05))
Using by, I can performs my function en each group. I remove the singular cases of NULL prob values or one values.
by(dat,dat$split,FUN =function(x){
P <- x$b
if( is.null(P)||length(P) ==1) return(0)
nn <- length(P)
ind <- expand.grid(1:nn,1:nn) ## I generate indexes here
ret <- abs(P[ind[,1]]/P[ind[,2]]-1) ## perfom P_l/P_m-1 (vectorized)
list(P=P,
ret.max = max(ret),
ret.ind = ind[which.max(ret),])
})
Here the result list. For each interval I show ,
P ( Prob values),
The maximum return
The indexes from which this maximum is computed.
For example:
dat$split: 0
$P
[1] 0.0130 0.0014 0.0014 0.0020
$ret.max
[1] 8.6236
$ret.ind
Var1 Var2
5 1 2
---------------------------------------------------------------------------------------------------------------
dat$split: 1
$P
[1] 0.0014 0.0000
$ret.max
[1] 1
$ret.ind
Var1 Var2
2 2 1

Resources