R: Round to varying thresholds

R: Round to varying thresholds - r

I have a vector of numbers that I need to round according to the rules in the image below:
Consider the following examples:
0.5 -> 0.5 (no rounding)
1.2 -> 1.0
3.7 -> 4.0
18.9 -> 20.0
28.1 -> 30.0
110 -> 120
I could in theory write a series of conditional statements to achieve this task; however, it will be a tedious and inefficient thing to do. Is there a way to achieve the desired outcome in an efficient manner?
Thank you

You could use the floor of base 10 logarithm to calculate powers of 10. Then divide the vector by that, round it and multiply with the powers of 10 again.
tens <- 10^floor(log10(abs(x)))
round(x/tens)*tens
# [1] NaN 0.5 1.0 4.0 -4.0 20.0 30.0 100.0
Note, that this won't work for zero and you therefore should use case-handling.
(However, 110 -> 120 is not obvious to me.)
Data:
x <- c(0, .5, 1.2, 3.7, -3.7, 18.9, 28.1, 110)

This solution uses findInterval to get which of the rounding functions is to be applied to the vector's elements.
roundSpecial <- function(x){
round_funs <- list(
no_round <- function(x) x,
round_by_1 <- function(x) round(x),
round_to_20 <- function(x) 20,
round_by_10 <- function(x) 10*round(x / 10),
round_by_15 <- function(x) 15*round(x / 15),
round_by_30 <- function(x) 30*round(x / 30)
)
lims <- c(0, 1, 17, 20, 30, 90, Inf)
which_fun <- findInterval(x, lims)
sapply(seq_along(which_fun), function(i) {
round_funs[[ which_fun[i] ]](x[i])
})
}
roundSpecial(x)
#[1] 0.5 1.0 4.0 20.0 30.0 120.0
Data
x <- c(0.5, 1.2, 3.7, 18.9, 28.1, 110)

Related

How to calculate slope and distance of two vectors in r?

I want to calculate slope and distance of two vectors. I am using the following code
df = structure(list(x = c(92.2, 88.1, 95.8, 83.8, 76.7, 83.3, 101.1,
111.8, 84.3, 81.5, 76.2, 87.1), y = c(84.8, 78.5, 103.1, 90.4,
85.1, 78.2, 98.3, 109.2, 85.6, 86.9, 85.6, 94)), class = "data.frame", row.names = c(NA,
-12L))
x <- df$x
y <- df$y
#Slope
diff(y)/diff(x)
#Distance
dist(df, method = "euclidean")
You can see in the output of slope that 11 values are coming. I want to have the slope of 12-1 also. How can I get that? and the from distance output I only want the values of 1-2, 2-3, 3-4, 4-5, 5-6, 6-7, 7-8, 8-9, 9-10, 10-11, 11-12 and 12-1 combinations. How can I achieve it?
The expected output is
Length 7.5 25.8 17.5 8.9 9.5 26.8 15.3 36.2 3.1 5.5 13.8 10.5
Slope 1.54 3.19 1.06 0.75 -1.05 1.13 1.02 0.86 -0.46 0.25 0.77 1.08

I think the diff approach by #Gregor Thomas is concise enough. Here is another option in case you are interested in dist for computing diatances.
> d <- rbind(df, df[1, ])
> with(d, diff(y) / diff(x))
[1] 1.5365854 3.1948052 1.0583333 0.7464789 -1.0454545 1.1292135
[7] 1.0186916 0.8581818 -0.4642857 0.2452830 0.7706422 -1.8039216
> (m <- as.matrix(dist(d)))[col(m) - row(m) == 1]
[1] 7.516648 25.776928 17.472550 8.860023 9.548298 26.848650 15.274161
[8] 36.238239 3.087070 5.457105 13.761177 10.519030

There's no nice diff function for getting the difference of the last and first vector elements, you can directly use (y[12] - y[1]) / (x[12] - x[1]), or if you want to be more general use tail(x, 1) for the last element and head(x, 1) for the first element. Calculate it directly and append it to your slope vector.
For euclidean distance, of successive points, its most direct to calculate it directly: distance = sqrt(diff(x)^2 + diff(y)^2).
(slope = c(diff(y)/diff(x), (head(y, 1) - tail(y, 1)) / (head(x, 1) - tail(x, 1))))
# [1] 1.5365854 3.1948052 1.0583333 0.7464789 -1.0454545 1.1292135 1.0186916
# [8] 0.8581818 -0.4642857 0.2452830 0.7706422 1.8039216
(distance = sqrt(diff(x)^2 + diff(y)^2))
# [1] 7.516648 25.776928 17.472550 8.860023 9.548298 26.848650 15.274161 36.238239 3.087070 5.457105 13.761177
I'll leave it as an exercise for the reader to add the last distance between the first and last points.

Using the akima bilinear function for interpolation

I am using the akima package and bilinear function to interpolate z values (temperatures) from a coarse coordinate grid (2.5° x 2.5°) to a finer grid (0.5° x 0.5°). The bilinear function works as follows:
Usage
bilinear(x, y, z, x0, y0)
Arguments
x a vector containing the x coordinates of the rectangular data grid.
y a vector containing the y coordinates of the rectangular data grid.
z a matrix containing the z[i,j] data values for the grid points (x[i],y[j]).
x0 vector of x coordinates used to interpolate at.
y0 vector of y coordinates used to interpolate at.
Value
This function produces a list of interpolated points:
x vector of x coordinates.
y vector of y coordinates.
z vector of interpolated data z.
Given the following data:
# coarse grid longitudes x -> c(0, 2.5, 5, 7.5, 10)
# coarse grid latitudes y -> c(50, 55, 60, 65, 70)
# temperatures z -> c(10.5, 11.1, 12.4, 9.8, 10.6)
# fine grid longitudes x0 -> c(0, 0.5, 1, 1.5, 2)
# fine grid latitudes y0 -> c(50, 50.5, 51, 51.5, 52)
I tried the function:
bilinear -> (x=x, y=y, z=z, x0=x0, y0=y0)
But I get the following:
Error in if (dim(z)[1] != nx) stop("dim(z)[1] and length of x differs!") :
argument is of length zero
I clearly don't fully understand how this function works and would really appreciate any suggestions if somebody knows what I'm doing wrong? I'm open to an alternative solution using a different package also.

Read carefully the description of the function, z need to be a matrix with dimension x,y:
library(akima)
x <- c(0, 2.5, 5, 7.5, 10)
y <- c(50, 55, 60, 65, 70)
z <- matrix(rnorm(25), 5, 5)
x0 <- seq(0, 10, .5)
y0 <- seq(50, 70, length = length(x0))
> bilinear(x, y, z, x0, y0)
$x
[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
[16] 7.5 8.0 8.5 9.0 9.5 10.0
$y
[1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
$z
[1] 1.14880762 1.08789150 0.88252672 0.53271328 0.03845118 -0.60025959
[7] -0.13758256 0.17947029 0.35089894 0.37670342 0.25688371 -0.06736752
[13] -0.42197570 -0.80694083 -1.22226291 -1.66794194 -1.38940279 -1.08889523
[19] -0.76641923 -0.42197481 -0.05556197

Applying same function but different calculations to different lists that has binary columns in loop with lapply

The title can be confusing but I guess it has a simple solution. I have my own function and I want to apply same function to multiple lists that consists of two columns. But I need to do different calculations to each column separately.
As an example mydata is:
x1 x2 y1 y2 z1 z2
1 0.0 0.0 0.0 7.8 0.0 8.6
2 8.6 0.0 0.0 7.6 1.6 1.4
3 11.2 7.8 3.4 1.2 7.6 0.0
4 8.4 7.6 21.4 10.2 23.6 0.0
5 0.0 1.2 1.8 7.0 3.2 0.0
6 0.0 10.2 1.4 0.0 0.0 0.0
mydata<-structure(list(x1 = c(0, 8.6, 11.2, 8.4, 0, 0), x2 = c(0, 0,
7.8, 7.6, 1.2, 10.2), y1 = c(0, 0, 3.4, 21.4, 1.8, 1.4), y2 = c(7.8,
7.6, 1.2, 10.2, 7, 0), z1 = c(0, 1.6, 7.6, 23.6, 3.2, 0), z2 = c(8.6,
1.4, 0, 0, 0, 0)), .Names = c("x1", "x2", "y1", "y2", "z1", "z2"
), class = "data.frame", row.names = c(NA, -6L))
And myfun function is:
myfun<- function(x) {
means<-sapply(list(x), function(ss) mean(ss, na.rm = T))
#my point: vars<-sapply(list(y), function(ss) var(ss, na.rm = T))
mean<-means[[1]]
#var<-vars[[1]]
#lists<-list(mean, var)
#names(lists) <- c("mean", "var")
#return(lists)
lists<-list(mean)
names(lists)<-c("mean")
return(lists)
}
I used #for parts that will be added later in the myfun.
When I tried
results<-lapply(mydata, myfun)
I can apply same function and same calculation to each column.
As you see there are 2 columns(x1-x2, y1-y2, z1-z2) for each data (x, y, z).
What I want is:
1) Obtaining means of first columns (x1, y1, z1)
2) Obtaining variances of second columns (x2, y2, z2)
3) And as output; I want to see results of mean1and var1for each data under x, y and z lists like:
x-> mean1 (mean of x1)
var1 (var of x2)
y-> mean1 (mean of y1)
var1 (var of y2)
4) Do all these in a loop with lapply or sapply or with any useful function.
Notes:
1) I did not group x1 and x2 under x, y1 and y2 under y. Because If a solution can be found for mydata form, it would be more useful for me. But if it is necessary I can group them separately.
2) myfun function is finding means of 6 columns now. I have indicated the additional parts that will be used to calculate variances of second columns with #

Consider assigning your groups first, then iterate off this with lapply. In fact use sapply with simplify=FALSE for a named list.
grps <- unique(gsub("[0-9]", "", colnames(mydata)))
# [1] "x" "y" "z"
myfun <- function(grp)
list(mean = mean(mydata[,paste0(grp, 1)]),
variance = var(mydata[,paste0(grp, 2)]))
mean_var_list <- sapply(grps, myfun, simplify = FALSE)
mean_var_list
# $x
# $x$mean
# [1] 4.7
#
# $x$variance
# [1] 20.87467
#
# $y
# $y$mean
# [1] 4.666667
#
# $y$variance
# [1] 16.53467
#
# $z
# $z$mean
# [1] 6
#
# $z$variance
# [1] 11.85067
Or use the default, simplify=TRUE and return a matrix.
mean_var_mat <- sapply(grps, myfun)
mean_var_mat
# x y z
# mean 4.7 4.666667 6
# variance 20.87467 16.53467 11.85067

I would start by splitting the dataframe to create a list of dataframes with 2 columns.
At the point you can use lapply or map_dfr to apply the function mean_var to each of the elements of the list.
The advantage of map_dfr is that it return a dataframe, binding the rows of the function output.
library(purrr)
my_data_l <- split.default(mydata, rep(1:3, each = 2))
mean_var <- function(x) {
list(mean = mean(x[,1]), var = var(x[,2]))
}
map_dfr(my_data_l, mean_var)

how to subset a vector in the way that represent the general shape of original vector in R

I have vectors of different size, and I want to sample all of them equally (for example 10 sample of each vector), in a way that these samples represent each vector.
suppose that one of my vectors is
y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
what are the 10 represntive points of this vector?

In case you are referring to retaining the shape of the curve, you can try preserving the local minimas and maximas:
df = as.data.frame(y)
y2 <- df %>%
mutate(loc_minima = if_else(lag(y) > y & lead(y) > y, TRUE, FALSE)) %>%
mutate(loc_maxima = if_else(lag(y) < y & lead(y) < y, TRUE, FALSE)) %>%
filter(loc_minima == TRUE | loc_maxima == TRUE) %>%
select(y)
Though this does not guarantee you'll have exactly 10 points.

Thanks to #minem, I got my answer. Perfect!
library(kmlShape)
Px=(1:length(y))
Py=y
par(mfrow=c(1,2))
plot(Px,Py,type="l",main="original points")
plot(DouglasPeuckerNbPoints(Px,Py,10),type="b",col=2,main="reduced points")
and the result is as below (using Ramer–Douglas–Peucker algorithm):

The best answer has already been given, but since I was working on it, I will post my naive heuristic solution :
Disclaimer :
this is for sure less efficient and naive than Ramer–Douglas–Peucker algorithm, but in this case it gives a similar result...
# Try to remove iteratively one element from the vector until we reach N elements only.
# At each iteration, the reduced vector is interpolated and completed again
# using a spline, then it's compared with the original one and the
# point leading to the smallest difference is selected for the removal.
heuristicDownSample <- function(x,y,n=10){
idxReduced <- 1:length(x)
while(length(idxReduced) > 10){
minDist <- NULL
idxFinal <- NULL
for(idxToRemove in 1:length(idxReduced)){
newIdxs <- idxReduced[-idxToRemove]
spf <- splinefun(x[newIdxs],y[newIdxs])
full <- spf(x)
dist <- sum((full-y)^2)
if(is.null(minDist) || dist < minDist){
minDist <- dist
idxFinal <- newIdxs
}
}
idxReduced <- idxFinal
}
return(list(x=x[idxReduced],y=y[idxReduced]))
}
Usage :
y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
x <- 1:length(y)
reduced <- heuristicDownSample(x,y,10)
par(mfrow=c(1,2))
plot(x=x,y=y,type="b",main="original")
plot(x=reduced$x,y=reduced$y,type="b",main="reduced",col='red')

You could use cut to generate a factor that indicates in which quintile (or whatever quantile you want) your values belong, and then sample from there:
df <- data.frame(values = c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23))
cutpoints <- seq(min(df$values), max(df$values), length.out = 5)
> cutpoints
[1] -2.00 4.25 10.50 16.75 23.00
df$quintiles <- cut(df$values, cutpoints, include.lowest = TRUE)
> df
values quintiles
1 2.5 [-2,4.25]
2 1.0 [-2,4.25]
3 0.0 [-2,4.25]
4 1.2 [-2,4.25]
5 2.0 [-2,4.25]
6 3.0 [-2,4.25]
7 2.0 [-2,4.25]
8 1.0 [-2,4.25]
9 0.0 [-2,4.25]
10 -2.0 [-2,4.25]
11 -1.0 [-2,4.25]
12 0.5 [-2,4.25]
13 2.0 [-2,4.25]
14 3.0 [-2,4.25]
15 6.0 (4.25,10.5]
16 5.0 (4.25,10.5]
17 7.0 (4.25,10.5]
18 9.0 (4.25,10.5]
19 11.0 (10.5,16.8]
20 15.0 (10.5,16.8]
21 23.0 (16.8,23]
Now you could split the data by quintiles, calculate the propensities and sample from the groups.
groups <- split(df, df$quintiles)
probs <- prop.table(table(df$quintiles))
nsample <- as.vector(ceiling(probs*10))
> nsample
[1] 7 2 1 1
resample <- function(x, ...) x[sample.int(length(x), ...)]
mysamples <- mapply(function(x, y) resample(x = x, size = y), groups, nsample)
z <- unname(unlist(mysamples))
> z
[1] 2.0 1.0 0.0 1.0 3.0 0.5 3.0 5.0 9.0 11.0 23.0
Due to ceiling(), this may lead to 11 cases being sampled instead of 10.

Apparently you are interested in systematic sampling. If so, maybe the following can help.
set.seed(1234)
n <- 10
step <- floor(length(y)/n)
first <- sample(step, 1)
z <- y[step*(seq_len(n) - 1) + first]

how to choose a row from a matrix which satisfies some criteria

M1 M2 M3
M1_1 M1_2 M1_diff M2_1 M2_2 M2_diff M3_1 M3_2 M3_diff
A 55.2 60.8 5.6 66.7 69.8 3.1 58.5 60.3 1.8
B 56.8 55.4 1.4 62.8 63.9 1.1 65.7 69.8 4.1
C 52.3 54.3 2.0 53.8 55.9 1.1 56.7 57.9 1.2
I have to find which of the M1,M2,M3 is best for each of A,B,C. the criteria are Mi_1 and Mi_2 shall be highest and Mi_diff shall be lowest(i=1,2,3). Like for id B it may be the second model. I have to select an M for an id. B has lowest diff for M2, so I chose M2 for B, M3 could have been chosen too with its larger accuracy, but diff is big.I cannot come up with any general algorithm to do this. we can put up a cutoff to the diff values and then choose the M's. Like if 1.5 is the lower bound for diff , then M3 is best for id B.
The data is quite big has almost 1000 unique ids and cannot be one manually.I was thinking there may be some easy solution I am not getting. Can anyone please help? I am using R for my computations.

You just need to come up with some equation that satisfies your criteria.
For instance, as you want M1 and M2 to be as high as possible, but their difference to be as low as possible, you may want to maximize:
M1*M2/(M1-M2)
You can add coefficients to this equation to increase the importance of any of the terms.
In R:
# Set RNG seed for reproducibility
set.seed(12345)
# Generate some data
num.rows <- 1000
df <- data.frame(M1_1 = runif(num.rows, 0, 100),
M1_2 = runif(num.rows, 0, 100),
M2_1 = runif(num.rows, 0, 100),
M2_2 = runif(num.rows, 0, 100),
M3_1 = runif(num.rows, 0, 100),
M3_2 = runif(num.rows, 0, 100))
df$M1_diff <- abs(df$M1_1 - df$M1_2)
df$M2_diff <- abs(df$M2_1 - df$M2_2)
df$M3_diff <- abs(df$M3_1 - df$M3_2)
# We call apply with 1 as the second parameter,
# meaning the function will be applied to each row
res <- apply(df, 1, function(row)
{
# Our criterium, modify at will
M1_prod <- row["M1_1"] * row["M1_2"] / row["M1_diff"]
M2_prod <- row["M2_1"] * row["M2_2"] / row["M2_diff"]
M3_prod <- row["M3_1"] * row["M3_2"] / row["M3_diff"]
# Which is the maximum? Returns 1, 2 or 3
which.max(c(M1_prod, M2_prod, M3_prod))
})
And the output
> head(df)
M1_1 M1_2 M2_1 M2_2 M3_1 M3_2 M1_diff M2_diff M3_diff
1 72.09039 7.7756704 95.32788 43.06881 27.16464 18.089266 64.314719 52.25907 9.075377
2 87.57732 84.3713648 62.17875 86.29595 62.93161 18.878981 3.205954 24.11720 44.052625
3 76.09823 0.6813684 53.16722 25.12324 85.90863 72.700354 75.416864 28.04398 13.208273
4 88.61246 35.1184204 89.20926 76.34523 36.97298 3.062528 53.494036 12.86403 33.910451
5 45.64810 68.6061032 19.58807 69.40719 28.21637 58.466682 22.958007 49.81913 30.250311
6 16.63718 25.4086494 88.43795 73.68140 81.37349 75.001685 8.771471 14.75656 6.371807
> head(res)
[1] 2 1 3 2 1 3