Sorting pairs by first coordinate - r

I have the following vectors:
X<-c(140,140,130,109,124,114,65,162,150,0)
Y<-c(30.65,6.45,17.74,11.29,3.23,3.23,3.23,8.06,14.52,1.61)
What I would like to do is assign each entry in X to the corresponding entry in Y, and then order them by X. For example, if I had
J<-c(10,40,20)
K<-c(9,9,2)
I would like it to give me
Jo = (10,20,40)
Ko = (9,2,9)
How do I do this in R? Thanks for the help.

Use the order() function:
X <- c(140,140,130,109,124,114,65,162,150,0)
Y <- c(30.65,6.45,17.74,11.29,3.23,3.23,3.23,8.06,14.52,1.61)
ord <- order(X)
(X2 <- X[ord])
## [1] 0 65 109 114 124 130 140 140 150 162
(Y2 <- Y[ord])
## [1] 1.61 3.23 11.29 3.23 3.23 17.74 30.65 6.45 14.52 8.06
(Don't really need to save ord if you re-order Y first; could use Y2 <- Y[order(X)]; X2 <- sort(X) instead.)

Related

How to multiply specific columns in R by scalar

This is probably a very simple problem, but I can't seem to figure it out.
I am trying to multiply every column (except the first one) on my dataframe by the same scalar. Here is my reproducible example:
df <- data.frame(replicate(200, sample(0:100, 1000, rep=TRUE)))
a <- 0.75
First I tried this:
df2 <- df[,2:200]*a
However, this creates dataframe df2 that's missing the first column.
I also tried using tidyverse with the mutate_at and specifying a multiplication function, but that didn't run at all:
scalar <- function(x) (x*0.75)
df2 <- df %>% mutate_at(across(c(2:200)), scalar)
My apologies in advance if this is very simple.
df <-
data.frame(
replicate(
5,
sample(0:100, 1000, rep=TRUE)
)
)
a <- 0.75
df2 <-
df |>
dplyr::mutate(
dplyr::across(
# Every column but the first column
.cols = -c(1),
.fns = function(x){
x * a
}
)
)
head(df2)
#> X1 X2 X3 X4 X5
#> 1 58 33.00 54.75 64.50 35.25
#> 2 39 63.00 2.25 8.25 30.00
#> 3 63 18.00 30.00 9.00 0.00
#> 4 22 39.75 46.50 11.25 29.25
#> 5 42 18.75 60.75 31.50 3.00
#> 6 34 46.50 15.00 74.25 55.50
Created on 2022-11-08 with reprex v2.0.2

Writing custom function in R which takes input vector with units

I am trying to write a function my_heights() that takes a vector-like heights and returns a numeric vector with the heights in a consistent unit.
The unit should be specified as a function argument, with the default being cm. For the sake of simplicity, assume the unit can only be m or cm.
I have the following vector:
heights <- c("188 cm", "1.73 m", "192 cm", "175 cm", "165 cm", "191 cm", "1.85 m")
Desired output (sanity check):
my_heights(heights)
[1] 188 173 192 175 165 191 185
my_heights(heights, "m")
[1] 1.88 1.73 1.92 1.75 1.65 1.91 1.85
I understand that when the default function is called it would return the heights with cm as it is but the heights with m are multiplied by 100. And for the latter heights with m are returned as it is but the heights with cm are divided by 100. Not sure how should I separate the heights by cm and m to make it work correctly.
Create a function with the input 'x' for the vector of values, and the unit specifying the default case as 'cm', then either use parse_number from readr or sub from base R to extract the numeric part, create a logical vector (i1) with grepl based on the occurrence of 'm' at the end ($) of the string. Specify a condition to multiply or divide the subset of values based on the 'unit', assign and return the numeric converted vector
clean_heights <- function(x, unit = 'cm') {
x1 <- readr::parse_number(x)
# // base R with sub
#x1 <- as.numeric(sub("\\s+\\D+", "", x))
i1 <- grepl('\\sm$', x)
if(unit == 'cm') {
x1[i1] <- x1[i1] * 100
} else {
x1[!i1] <- x1[!i1]/100
}
x1
}
-testing
clean_heights(heights)
#[1] 188 173 192 175 165 191 185
clean_heights(heights, "m")
#[1] 1.88 1.73 1.92 1.75 1.65 1.91 1.85
This would be done as follows:
my_heights <- function(x, unit = "cm"){
units <- c(mm = 0.1, cm = 1, dm=10, m = 100, km = 100000)
unname(with(read.table(text=x, h=F), V1*units[V2]))/units[unit]
}
my_heights(heights)
[1] 188 173 192 175 165 191 185
my_heights(heights, "m")
[1] 1.88 1.73 1.92 1.75 1.65 1.91 1.85
my_heights(heights, "mm")
[1] 1880 1730 1920 1750 1650 1910 1850
You can try the following user function my_heights
my_heights <- function(h, unit = "cm") {
ifelse(gsub(".*\\s+", "", h) == unit,
1,
100^ifelse(unit == "cm", 1, -1)
) * as.numeric(gsub("\\s.*$", "", h))
}
such that
> my_heights(heights)
[1] 188 173 192 175 165 191 185
> my_heights(heights, "m")
[1] 1.88 1.73 1.92 1.75 1.65 1.91 1.85
> my_heights(heights, "cm")
[1] 188 173 192 175 165 191 185

Interpolation (akima) omits part of data when x/y contains duplicate elements

I am making a function which receives three vectors, interpolates them using akima and plots them using plot_ly(). Although the general code works, I am encountering issues with scaling of the z-matrix that interp() outputs.
Let me give you an example:
x is a non-NA numeric containing some duplicate values.
y is a non-NA numeric containing some duplicate values.
z is a non-NA continuous vector
Some summary statistics:
> unique(x)
[1] 60 48 36 32 18 24 30 15 12 28 21 19 54 20 16 27 10 39 14 17 9 6 50 8 13
> range(x)
[1] 6 60
> unique(y)
[1] 10.00 10.50 13.50 12.50 14.00 12.00 11.00 9.00 11.50 9.25 13.00 10.25 6.50 6.75 8.25 9.50
[17] 8.00 8.85 9.75 7.90 7.00 8.60 8.75 7.50 8.90 8.50 7.49 7.40 5.50 7.60 7.25 8.35
[33] 6.00 5.00 7.75 7.35 6.30 4.50 5.75 8.40 5.60 5.90 7.74 9.90 6.20 5.80
> range(y)
[1] 4.5 14.0
> head(z)
[1] 2.877272 3.267328 3.175478 3.843326 4.809792 2.827825
> range(z)
[1] 2.316529 28.147808
I implement the baseline function below:
labs = list(x = 'x', y = 'y', z = 'z')
mat = interp(x, y, z, duplicate = 'mean', extrap = T, xo = sort(unique(x)))
plot_ly(x = mat$x, y = mat$y, z = mat$z, type = 'surface') %>%
layout(title = title,
scene = list(xaxis = list(title = labs$x),
yaxis = list(title = labs$y),
zaxis = list(title = labs$z)))
When I run this, the output is the following:
The issue is that a portion of the data is not covered in this picture. For instance, there is a sizeable data portion around x > 50, y < 11 that is omitted by the interpolation (and hence not plotted).
length(x[x > 50])
[1] 304
> length(y[x > 50 & y < 11])
[1] 290
> length(z[x > 50 & y < 11])
[1] 290
I suspected that this has to do with the duplicate x values. Hence, I configured the xo argument in interp() such that:
mat = interp(x, y, z, duplicate = 'mean', xo = sort(unique(x)), decreasing = T)
In which case the previously omitted region is partially plotted. It looks like the following:
Nonetheless, the x and y axes still do not correspond to their respective data ranges (despite data availability). Bottom line: How do I tweak the function such that the surface always extends the full range of x and y?
Best
It turns out that the error arose from plot_ly(). Apparently, the z-matrix cannot be passed straight through from interp() to plot_ly(), as the axis become erroneously passed through to the graph. Hence, the interpolated z-matrix needs to be transformed.
If you use these two functions in combination, ensure to carry out the transformation of z as shown below:
mat = interp(x,y,z, duplicate = 'mean')
x = mat$x
y = mat$y
z = matrix(mat$z, nrow = length(mat$y), byrow = TRUE)
plot_ly(x, y ,z, type = 'surface')

Assign value to dataframe inside parent environment

I have a parent function (fun1) that takes data and a helper function (fun2) that operates on columns from that data. I want to assign the value from that helper function to it's matching column in data in fun1. In reality there are lots of little helper functions operating on columns and I want the value changed by fun2 to be what the other helper functions deal with.
How can I take the outout from fun2 and assign it to the matching column of data in fun1? I do not want to make that alteration to data in fun1 but want to change data from within fun2.
fun1 <- function(data){
Map(fun2, data, colnames(data))
data[[1]]
}
fun2 <- function(x, colname_x){
x <- x * 2
# my attempt to assign to `data[["mpg"]]`
assign(sprintf("data[[\"%s\"]]", colname_x), value=x, pos = 1)
}
fun1(mtcars[1:3])
Desired output of fun1:
## [1] 42.0 42.0 45.6 42.8 37.4 36.2 28.6 48.8 45.6 38.4 35.6 32.8 34.6 30.4 20.8
## [16] 20.8 29.4 64.8 60.8 67.8 43.0 31.0 30.4 26.6 38.4 54.6 52.0 60.8 31.6 39.4
## [31] 30.0 42.8
EDIT: Also tried:
fun2 <- function(x, colname_x = "mpg"){
x <- x * 2
assign(sprintf("data[[\"%s\"]]", colname_x), value=x, env = parent.frame(3))
}
Edit 2 This works but seems not so clean:
'
fun2 <- function(x, colname_x = "mpg"){
x <- x * 2
data <- get('data', env=parent.frame(3))
data[[colname_x]] <- x
assign('data', value=data, env = parent.frame(3))
}

K-fold cross-validation using cv.lm()

I am new to R and trying to do K-fold cross validation using cv.lm()
Refer: http://www.statmethods.net/stats/regression.html
I am getting error indicating the length of my variable are different. Infact during my verification using length(), I found the size in fact the same.
The below are the minimal datasets to replicate the problem,
X Y
277 5.20
285 5.17
297 4.96
308 5.26
308 5.11
263 5.27
278 5.20
283 5.16
268 5.17
250 5.20
275 5.18
274 5.09
312 5.03
294 5.21
279 5.29
300 5.14
293 5.09
298 5.16
290 4.99
273 5.23
289 5.32
279 5.21
326 5.14
293 5.22
256 5.15
291 5.09
283 5.09
284 5.07
298 5.27
269 5.19
Used the below code to do the cross-validation
# K-fold cross-validation, with K=10
sampledata <- read.table("H:/sample.txt", header=TRUE)
y.1 <- sampledata$Y
x.1 <- sampledata$X
fit=lm(y.1 ~ x.1)
library(DAAG)
cv.lm(df=sampledata, fit, m=10)
The error on the terminal,
Error in model.frame.default(formula = form, data = df[rows.in, ], drop.unused.levels = TRUE) :
variable lengths differ (found for 'x.1')
Verification,
> length(x.1)
[1] 30
> length(y.1)
[1] 30
The above confirms the length are the same.
> str(x.1)
int [1:30] 277 285 297 308 308 263 278 283 268 250 ...
> str(y.1)
num [1:30] 5.2 5.17 4.96 5.26 5.11 5.27 5.2 5.16 5.17 5.2 ...
> is(y.1)
[1] "numeric" "vector"
> is(x.1)
[1] "integer" "numeric" "vector" "data.frameRowLabels"
Further check on the data set as above indicates one dataset is integer and another is numeric. But even when the data sets are converted the numeric to integer or integer to numeric, the same error pops up in the screen indicating issues with data length.
Can you guide me what should I do to correct the error?
I am unsuccessful in handling this since 2 days ago. Did not get any good lead from my research using internet.
Addional Related Query:
I see the fit works if we use the headers of the data set in the attributes,
fit=lm(Y ~ X, data=sampledata)
a) what is the difference of the above syntax with,
fit1=lm(sampledata$Y ~ sampledata$X)
Thought it is the same. In the below,
#fit 1 works
fit1=lm(Y ~ X, data=sampledata)
cv.lm(df=sampledata, fit1, m=10)
#fit 2 does not work
fit2=lm(sampledata$Y ~ sampledata$X)
cv.lm(df=sampledata, fit2, m=10)
The problem is at df=sampledata as the header "sampledata$Y" does not exist but only $Y exist. Tried to manupulate cv.lm to below it does not work too,
cv.lm(fit2, m=10)
b) How if we like to manipulate the variables, how to use it in cv.lm() for e.g
y.1 <- (sampledata$Y/sampledata$X)
x.1 <- (1/sampledata$X)
#fit 4 problem
fit4=lm(y.1 ~ x.1)
cv.lm(df=sampledata, fit4, m=10)
Is there a way I could reference y.1 and x.1 instead of the header Y ~ X in the function?
Thanks.
I'm not sure about why exactly this happens, but I've spotted that you do not specify data argument for lm(), so this was my first guess.
fit=lm(Y ~ X, data=sampledata)
Since the error is gone, this may be a sufficient answer.
UPD: The reason for the error is that y.1 and x.1 do not exist in sampledata, which is provided as df argument for cv.lm, so that formula y.1 ~ x.1 makes no sense in the cv.lm environment.

Resources