approx() without duplicates? - r

I am using approx() to interpolate values.
x <- 1:20
y <- c(3,8,2,6,8,2,4,7,9,9,1,3,1,9,6,2,8,7,6,2)
df <- cbind.data.frame(x,y)
> df
x y
1 1 3
2 2 8
3 3 2
4 4 6
5 5 8
6 6 2
7 7 4
8 8 7
9 9 9
10 10 9
11 11 1
12 12 3
13 13 1
14 14 9
15 15 6
16 16 2
17 17 8
18 18 7
19 19 6
20 20 2
interpolated <- approx(x=df$x, y=df$y, method="linear", n=5)
gets me this:
interpolated
$x
[1] 1.00 5.75 10.50 15.25 20.00
$y
[1] 3.0 3.5 5.0 5.0 2.0
Now, the first and last value are duplicates of my real data, is there any way to prevent this or is it something I don't understand properly about approx()?

You may want to specify xout to avoid this. For instance, if you want to always exclude the first and the last points, here's how you can do that:
specify_xout <- function(x, n) {
seq(from=min(x), to=max(x), length.out=n+2)[-c(1, n+2)]
}
plot(df$x, df$y)
points(approx(df$x, df$y, xout=specify_xout(df$x, 5)), pch = "*", col = "red")
It does not prevent from interpolating the existing point somewhere in the middle (exactly what happens on the picture below).

approx will fit through all your original datapoints if you give it a chance (change n=5 to xout=df$x to see this). Interpolation is the process of generating values for y given unobserved values of x, but should agree if the values of x have been previously observed.
The method="linear" setup is going to 'draw' linear segments joining up your original coordinates exactly (and so will give the y values you input to it for integer x). You only observe 'new' y values because your n=5 means that for points other than the beginning and end the x is not an integer (and therefore not one of your input values), and so gets interpolated.
If you want observed values not to be exactly reproduced, then maybe add some noise via rnorm ?

Related

Vectorized function usage and joining individual terms into a single tibble

the title is vague but let me explain:
I have a non-vectorized function that outputs a 15-row table of volume estimates for a tree. Each row is a different measurement unit or portion of the input tree. I have a Tables argument to help the user decide what units and measurement protocol they're looking to find, but in 99% of use case scenarios, the output for a single tree's volume estimate is a tibble with more than one row.
I've removed ~20 other arguments from the function for demonstration's sake. DBH is a tree's diameter at breast height. Vol column is arbitrary.
Est1 <- TreeVol(Tables = "All", DBH = 7)
Est1
# A tibble: 15 x 3
Tables DBH Vol
<chr> <dbl> <dbl>
1 1. Total_Above_Ground_Cubic_Volume 7 2
2 2. Gross_Inter_1/4inch_Vol 7 4
3 3. Net_Scribner_Vol 7 6
4 4. Gross_Merchantable_Vol 7 8
5 5. Net_Merchantable_Vol 7 10
6 6. Merchantable_Vol 7 12
7 7. Gross_SecondaryProduct_Vol 7 14
8 8. Net_SecondaryProduct_Vol 7 16
9 9. SecondaryProduct 7 18
10 10. Gross_Inter_1/4inch_Vol 7 20
11 11. Net_Inter_1/4inch_Vol 7 22
12 12. Gross_Scribner_SecondaryProduct 7 24
13 13. Net_Scribner_SecondaryProduct 7 26
14 14. Stump_Volume 7 28
15 15. Tip_Volume 7 30
the user can utilize the Tables argument as so:
Est2 <- TreeVol(Tables = "Scribner_BF", DBH = 7)
# A tibble: 3 x 3
Tables DBH Vol
<chr> <dbl> <dbl>
1 3. Net_Scribner_Vol 7 6
2 12. Gross_Scribner_SecondaryProduct 7 24
3 13. Net_Scribner_SecondaryProduct 7 26
The problem arises in that I'd like to write a vectorized version of this function that can calculate the volume for an entire .csv of tree inventory data. Ideally, I'd like the multi-row outputs that relate to a single tree to output as one long tibble, with each 15-row default output filtered by what the user passes to the Tables argument as so:
Est3 <- VectorizedTreeVol(Tables = "Scribner_BF", DBH = c(7, 21, 26))
# A tibble: 9 x 3
Tables DBH Vol
<chr> <dbl> <dbl>
1 3. Net_Scribner_Vol 7 6
2 12. Gross_Scribner_SecondaryProduct 7 24
3 13. Net_Scribner_SecondaryProduct 7 26
4 3. Net_Scribner_Vol 21 18
5 12. Gross_Scribner_SecondaryProduct 21 72
6 13. Net_Scribner_SecondaryProduct 21 76
7 3. Net_Scribner_Vol 26 8
8 12. Gross_Scribner_SecondaryProduct 26 78
9 13. Net_Scribner_SecondaryProduct 26 84
To achieve this, I wrote a for() loop that acts as the heart of the vectorized function. I've heard from multiple people that it's very inefficient (and I agree), but it works with the principle I'd like to achieve, in theory. Nothing I've found on this topic has suggested a better idea for application in a vectorized function like mine.
The general setup for the loop looks like this:
for(i in 1:length(DBH)){
Output <- VectorizedTreeVol(Tables = Tables[[i]], DBH = DBH[[i]]) %>%
purrr::reduce(dplyr::full_join, by = NULL) %>%
SuppressWarnings()
and in functions where the non-vectorized output is always a single row, the heart of its respective vectorized function doesn't need to be encased in a for() loop and looks like this:
Output <- OtherVectorizedFunction(Tables = Tables, DBH = DBH) %>%
purrr::reduce(dplyr::full_join, by = ColumnNames) %>% #ColumnNames is a vector with all of the output's column names
SuppressWarnings()
This specific call to reduce() has worked pretty well when I've used it to vectorize the other functions in the project, but I'm open to suggestions regarding how to join the output tables. I've been stuck on this dilemma for a few months now, and any help regarding how to achieve what this for() loop is striving for in theory would be awesome. Is having a vectorized function that outputs a tibble like Est3 even possible? Any feedback/comments are much appreciated.
Given this function:
TreeVol <- function(DBH) {
data.frame(Tables = c("Tree_Vol", "Intercapillary_transfusion", "Woodiness"),
Vol = c(DBH^2, sqrt(DBH) + 3, sin(DBH)),
DBH)
}
We could put our DBH parameters into purrr::map and then bind_rows to get a data.frame.
VecTreeVol <- function(DBH) {
DBH %>%
purrr::map(TreeVol) %>%
bind_rows()
}
Result
> VecTreeVol(DBH = 1:3)
Tables Vol DBH
1 Tree_Vol 1.0000000 1
2 Intercapillary_transfusion 4.0000000 1
3 Woodiness 0.8414710 1
4 Tree_Vol 4.0000000 2
5 Intercapillary_transfusion 4.4142136 2
6 Woodiness 0.9092974 2
7 Tree_Vol 9.0000000 3
8 Intercapillary_transfusion 4.7320508 3
9 Woodiness 0.1411200 3

Is there a way to create a permutation of a vector without using the sample() function in R?

I hope you are having a nice day. I would like to know if there is a way to create a permutation (rearrangement) of the values in a vector in R?
My professor provided with an assignment in which we are supposed create functions for a randomization test, one while using sample() to create a permutation and one not using the sample() function. So far all of my efforts have been fruitless, as any answer that I can find always resorts in the use of the sample() function. I have tried several other methods, such as indexing with runif() and writing my own functions, but to no avail. Alas, I have accepted defeat and come here for salvation.
While using the sample() function, the code looks like:
#create the groups
a <- c(2,5,5,6,6,7,8,9)
b <- c(1,1,2,3,3,4,5,7,7,8)
#create a permutation of the combined vector without replacement using the sample function()
permsample <-sample(c(a,b),replace=FALSE)
permsample
[1] 2 5 6 1 7 7 3 8 6 3 5 9 2 7 4 8 1 5
And, for reference, the entire code of my function looks like:
PermutationTtest <- function(a, b, P){
sample.t.value <- t.test(a, b)$statistic
perm.t.values<-matrix(rep(0,P),P,1)
N <-length(a)
M <-length(b)
for (i in 1:P)
{
permsample <-sample(c(a,b),replace=FALSE)
pgroup1 <- permsample[1:N]
pgroup2 <- permsample[(N+1) : (N+M)]
perm.t.values[i]<- t.test(pgroup1, pgroup2)$statistic
}
return(mean(perm.t.values))
}
How would I achieve the same thing, but without using the sample() function and within the confines of base R? The only hint my professor gave was "use indices." Thank you very much for your help and have a nice day.
You can use runif() to generate a value between 1.0 and the length of the final array. The floor() function returns the integer part of that number. At each iteration, i decrease the range of the random number to choose, append the element in the rn'th position of the original array to the new one and remove it.
a <- c(2,5,5,6,6,7,8,9)
b <- c(1,1,2,3,3,4,5,7,7,8)
c<-c(a,b)
index<-length(c)
perm<-c()
for(i in 1:length(c)){
rn = floor(runif(1, min=1, max=index))
perm<-append(perm,c[rn])
c=c[-rn]
index=index-1
}
It is easier to see what is going on if we use consecutive numbers:
a <- 1:8
b <- 9:17
ab <- c(a, b)
ab
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Now draw 17 (length(ab)) random numbers and use them to order ab:
rnd <- runif(length(ab))
ab[order(rnd)]
# [1] 5 13 11 12 6 1 17 3 10 2 8 16 7 4 9 15 14
rnd <- runif(length(ab))
ab[order(rnd)]
# [1] 14 11 5 15 10 7 13 9 17 8 2 6 1 4 16 12 3
For each permutation just draw another 17 random numbers.

Normalize/scale data set

I have the following data set:
dat<-as.data.frame(rbind(10,8,2,7,10,10,1,10,14,9,2,6,10,8,10,8,10,10,7,11,10))
colnames(dat)<-"Score"
print(dat)
Score
10
8
2
7
10
10
1
10
14
9
2
6
10
8
10
8
10
10
7
11
10
these are the test scores which students obtained, a student could get a maximum of 15 or a minimum of 0 in this test (by the way, nobody got the max or the min), however the lowest score obtained in this test was 1 and the highest was 14.
Now, I want to normalize/scale this data to the scale of 0 to 20.
How to achieve this in excel? or in R?
My final goal is to normalize the scores in this test to the above scale and to compare them with another set of data for which the max and min is 5 and 0 respectively.
How to compare these two different scaled data sets correctly against each other?
What I tried:
I went through many stuff on the internet, and came up with this:
which I got it from the wikipedia.
Is this method reliable?
In your case I would use the feature scale formula you posted on your question. The (x - min(x)) / (max(x) - min(x)) will essentially convert your test marks to the range between 0-1.
Since your edges are indeed 0 and 15 and not 2 and 14, your min(x)=0 and your max(x)=15. Once you have your marks between 0-1 using the above, you just multiply by 20.
i.e.
tests <- read.table(header=T, file='clipboard')
tests2 <- (tests - 0) / (15 - 0) #or equally tests / 15
And multiply by 20 to get marks between 0-20:
> tests2 * 20
Score
1 13.333333
2 10.666667
3 2.666667
4 9.333333
5 13.333333
6 13.333333
7 1.333333
8 13.333333
9 18.666667
10 12.000000
11 2.666667
12 8.000000
13 13.333333
14 10.666667
15 13.333333
16 10.666667
17 13.333333
18 13.333333
19 9.333333
20 14.666667
21 13.333333
The results are intuitive and the function is reliable. For example the person who scored 14/15 should get the highest mark (and very close to 20) which is the case here (after the transformation they scored 18.6666).
In Excel, if you want the normalized data to have a min of 0 and and max of 20, then we need to solve:
y = A * x + b
for two points.
Put the max of the raw data in C1:
=MAX(A:A)
Put the min of the raw data in C2:
=MIN(A:A)
Put the desired max in D1 and the desired min in D2. Put the formula for the A-coefficient in C3:
=($D$1-$D$2)/($C$1-$C$2)
and the formula for the B-coefficient in C4:
=$D$1-$C$3*$C$1
Finally put the scaling formula in B1:
=A1*$C$3+$C$4
and copy down:
Naturally, if you want the scaling to be independent of the raw max or min, you would use 15 in C1 and 0 in C2.
You can scale between 0 to 20 with this command in R:
newvalue <- 20/(max(score)-min(score))*(score-min(score))
The math way is fairly straightforward if the floor for all scales is 0.
new_value = new_ceiling * old_value / old_ceiling
The next formula will account for different floors on each scale:
new_value = new_floor + (new_ceiling - old_ceiling) * ((old_value-old_floor)/(old_ceiling-old_floor)) which is actually the formula you posted from Wikipedia. ;)
Hope this helps!
That is very simple. Due to the fact that both of those grades are linear, that a simple multiple ratio will do the work. Or in other word each grade in your set needs to be *20/15.
Here's a little r function which can help you run this if you need to repeat the operation and give you some flexibility on what you rescale to. Also one must be careful of NA values because min() and max() do not drop them by default which will then return NA. Therefore I provided an option on to handle NA values (drops them by default).
# function rescales data from 0 to 1 and optionally multiplies by new max
rescale <- function(x, new_max = 1, na.rm = T) {
as.vector(new_max * scale(x,
center = min(x, na.rm = na.rm),
scale = (max(x, na.rm = na.rm) - min(x, na.rm = na.rm))))
}
# old scores
scores <- c(10,8,2,7,10,10,1,10,14,9,2,6,10,8,10,8,10,10,7,11,10)
# new scores
data.frame(old = scores,
new = rescale(scores, new_max = 20))
#> old new
#> 1 10 13.846154
#> 2 8 10.769231
#> 3 2 1.538462
#> 4 7 9.230769
#> 5 10 13.846154
#> 6 10 13.846154
#> 7 1 0.000000
#> 8 10 13.846154
#> 9 14 20.000000
#> 10 9 12.307692
#> 11 2 1.538462
#> 12 6 7.692308
#> 13 10 13.846154
#> 14 8 10.769231
#> 15 10 13.846154
#> 16 8 10.769231
#> 17 10 13.846154
#> 18 10 13.846154
#> 19 7 9.230769
#> 20 11 15.384615
#> 21 10 13.846154
Created on 2022-03-10 by the reprex package (v2.0.1)

Searching the closest value in other column

Suppose we have a data frame of two columns
X Y
10 14
12 16
14 17
15 19
21 19
The first element of Y that is 14, the nearest value (or same) to it is 14 (which is 3rd element of X). Similarly, next element of Y is closest to 15 that is 4th element of X
So, the output I would like should be
3
4
4
5
5
As my data is large, Can you give me some advice on the systemic/proper code for doing it?
You can try this piece of code:
apply(abs(outer(d$X,d$Y,FUN = '-')),2,which.min)
# [1] 3 4 4 5 5
Here, abs(outer(d$X,d$Y,FUN = '-')) returns a matrix of unsigned differences between d$X and d$Y, and apply(...,2,which.min) will return position of the minimum by row.

Merge values of a factor column

Column data$form contains 170 unique different values, (numbers from 1 to ~800).
I would like to merge some values (e.g with a 10 radius/step).
I need to do this in order to use:
colors = rainbow(length(unique(data$form)))
In a plot and provide a better visual result.
Thank you in advance for your help.
you can use %/% to group them and mean to combine them and normalize to scale them.
# if you want specifically 20 groups:
groups <- sort(form) %/% (800/20)
x <- c(by(sort(form), groups, mean))
x <- normalize(x, TRUE) * 19 + 1
0 1 2 3 4
1.000000 1.971781 2.957476 4.103704 4.948560
5 6 7 8 9
5.950617 7.175309 7.996914 8.953086 9.952263
10 11 12 13 14
10.800705 11.901235 12.888889 13.772291 14.888889
15 16 17 18 19
15.927984 16.864198 17.918519 18.860082 20.000000
You could also use cut. If you use the argument labels=FALSE, you get an integer value:
form <- runif(170, min=1,max=800)
> cut(form, breaks=20)
[1] (518,558] (280,320] (240,280] (121,160] (757,797]
[6] (160,200] (320,359] (598,638] (80.8,121] (359,399]
[7] (121,160] (200,240] ...
20 Levels: (1.18,41] (41,80.8] (80.8,121] (121,160] (160,200] (200,240] (240,280] (280,320] (320,359] (359,399] (399,439] ... (757,797]
> cut(form, breaks=20, labels=FALSE)
[1] 14 8 7 4 20 5 9 16 3 10 4 6 5 18 18 6 2 12
[19] 2 19 13 11 13 11 14 12 17 5 ...
On a side-note, I want you to re-consider plotting with rainbow colours, as it distorts reading the data, cf. Rainbow Color Map (Still) Considered Harmful.

Resources