Cox proportional hazard model - r

I am trying to run Cox proportional hazard model on a data of 4 groups.
Here's the data:
I am using this code:
time_Allo_NHL<- c(28,32,49,84,357,933,1078,1183,1560,2114,2144)
censor_Allo_NHL<- c(rep(1,5), rep(0,6))
time_Auto_NHL<- c(42,53,57,63,81,140,176,210,252,476,524,1037)
censor_Auto_NHL<- c(rep(1,7), rep(0,1), rep(1,1), rep(0,1), rep(1,1), rep(0,1))
time_Allo_HOD<- c(2,4,72,77,79)
censor_Allo_HOD<- c(rep(1,5))
time_Auto_HOD<- c(30,36,41,52,62,108,132,180,307,406,446,484,748,1290,1345)
censor_Auto_HOD<- c(rep(1,7), rep(0,8))
myData <- data.frame(time=c(time_Allo_NHL, time_Auto_NHL, time_Allo_HOD, time_Auto_HOD),
censor=c(censor_Allo_NHL, censor_Auto_NHL, censor_Allo_HOD, censor_Auto_HOD),
group= rep(1:4,), each= )
str(myData)
The problem is each group has different number of observations. What I should modify in the code :
myData <- data.frame(time=c(time_Allo_NHL, time_Auto_NHL, time_Allo_HOD, time_Auto_HOD),
censor=c(censor_Allo_NHL, censor_Auto_NHL, censor_Allo_HOD,
censor_Auto_HOD), group= rep(1:4,), each= )
Instead of writing each=# so I can run the code properly in order to complete doing the Cox proportional hazard model?
Then I have attempted to run a Cox proportional hazard model using the following code:
library(survival)
for(i in 1:43){
if (myData$group[i]==2)
myData$Z1[i]<-1
else myData$Z1[i]<-0
}
for(i in 1:43){
if (myData$group[i]==3)
myData$Z2[i]<-1
else myData$Z2[i]<-0
}
for(i in 1:43){
if (myData$group[i]==4)
myData$Z3[i]<-1
else myData$Z3[i]<-0
}
myData
Coxfit<-coxph(Surv(time,censor)~Z1+Z2+Z3, data = myData)
summary(Coxfit)
This is all I got. There's no valuse!!
Next, I want to test for an interaction between type of transplant and disease type using main effects and interaction terms.
The code I'm going to use:
n<-length(myData$time)
n
for (i in 1:n){
if (myData$(here?)[i]==2)
myData$W1[i] <-1
else myData$W1[i]<-0
}
for (i in 1:n){
if (myData$(here?)[i]==2)
myData$W2[i] <-1
else myData$W2[i]<-0
}
myData
Coxfit.W<-coxph(Surv(time,censor)~W1+W2+W1*W2, data = myData)
summary(Coxfit.W)
I'm not sure what it should be written in here (myData$(here?) from the above code.

This looks like the bone marrow transplant study at Ohio State University.
As you mentioned, each group has different numbers of observations per group. I would consider binding the rows from each subgroup together in the end.
First, would create a data frame for each group. I would add a column indicating which group they belonged to. So, for example, in df_Allo_NHL would have all of the observations have Allo NHL for group:
df_Allo_NHL <- data.frame(group = "Allo NHL",
time = c(28,32,49,84,357,933,1078,1183,1560,2114,2144),
censor = c(rep(1,5), rep(0,6)))
Or just adding to the 2 vectors you have already:
df_Allo_NHL <- data.frame(group = "Allo NHL", time = time_Allo_NHL, censor = censor_Allo_NHL)
Then once you have your 4 data frames, you can combine them. One way to do this is by using Reduce and putting all your data frames in a list. The final result should be ready for cox proportional hazards analysis, in long form, and you will have group available to include. (Edit: Z1 and Z2 added from table for model.)
time_Allo_NHL<- c(28,32,49,84,357,933,1078,1183,1560,2114,2144)
censor_Allo_NHL<- c(rep(1,5), rep(0,6))
df_Allo_NHL <- data.frame(group = "Allo NHL",
time = time_Allo_NHL,
censor = censor_Allo_NHL,
Z1 = c(90,30,40,60,70,90,100,90,80,80,90),
Z2 = c(24,7,8,10,42,9,16,16,20,27,5))
time_Auto_NHL<- c(42,53,57,63,81,140,176,210,252,476,524,1037)
censor_Auto_NHL<- c(rep(1,7), rep(0,1), rep(1,1), rep(0,1), rep(1,1), rep(0,1))
df_Auto_NHL <- data.frame(group = "Auto NHL",
time = time_Auto_NHL,
censor = censor_Auto_NHL,
Z1 = c(80,90,30,60,50,100,80,90,90,90,90,90),
Z2 = c(19,17,9,13,12,11,38,16,21,24,39,84))
time_Allo_HOD<- c(2,4,72,77,79)
censor_Allo_HOD<- c(rep(1,5))
df_Allo_HOD <- data.frame(group = "Allo HOD",
time = time_Allo_HOD,
censor = censor_Allo_HOD,
Z1 = c(20,50,80,60,70),
Z2 = c(34,28,59,102,71))
time_Auto_HOD<- c(30,36,41,52,62,108,132,180,307,406,446,484,748,1290,1345)
censor_Auto_HOD<- c(rep(1,7), rep(0,8))
df_Auto_HOD <- data.frame(group = "Auto HOD",
time = time_Auto_HOD,
censor = censor_Auto_HOD,
Z1 = c(90,80,70,60,90,70,60,100,100,100,100,90,90,90,80),
Z2 = c(73,61,34,18,40,65,17,61,24,48,52,84,171,20,98))
myData <- Reduce(rbind, list(df_Allo_NHL, df_Auto_NHL, df_Allo_HOD, df_Auto_HOD))
Edit
If you go ahead and also add Z1 (Karnofsky Score) and Z2 (waiting time from diagnosis to transplant), you can do the CPH survival model like this below. group is already a factor and the first level Allo NHL would by default be there reference category.
library(survival)
Coxfit<-coxph(Surv(time,censor)~group+Z1+Z2, data = myData)
summary(Coxfit)
Output
Call:
coxph(formula = Surv(time, censor) ~ group + Z1 + Z2, data = myData)
n= 43, number of events= 26
coef exp(coef) se(coef) z Pr(>|z|)
groupAuto NHL 0.77357 2.16748 0.58631 1.319 0.18704
groupAllo HOD 2.73673 15.43639 0.94081 2.909 0.00363 **
groupAuto HOD 1.06293 2.89485 0.63494 1.674 0.09412 .
Z1 -0.05052 0.95074 0.01222 -4.135 3.55e-05 ***
Z2 -0.01660 0.98354 0.01002 -1.656 0.09769 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
groupAuto NHL 2.1675 0.46136 0.6869 6.8395
groupAllo HOD 15.4364 0.06478 2.4419 97.5818
groupAuto HOD 2.8948 0.34544 0.8340 10.0481
Z1 0.9507 1.05181 0.9282 0.9738
Z2 0.9835 1.01674 0.9644 1.0030
Concordance= 0.783 (se = 0.059 )
Likelihood ratio test= 32.48 on 5 df, p=5e-06
Wald test = 28.48 on 5 df, p=3e-05
Score (logrank) test = 39.45 on 5 df, p=2e-07
Data
group time censor Z1 Z2
1 Allo NHL 28 1 90 24
2 Allo NHL 32 1 30 7
3 Allo NHL 49 1 40 8
4 Allo NHL 84 1 60 10
5 Allo NHL 357 1 70 42
6 Allo NHL 933 0 90 9
7 Allo NHL 1078 0 100 16
8 Allo NHL 1183 0 90 16
9 Allo NHL 1560 0 80 20
10 Allo NHL 2114 0 80 27
11 Allo NHL 2144 0 90 5
12 Auto NHL 42 1 80 19
13 Auto NHL 53 1 90 17
14 Auto NHL 57 1 30 9
15 Auto NHL 63 1 60 13
16 Auto NHL 81 1 50 12
17 Auto NHL 140 1 100 11
18 Auto NHL 176 1 80 38
19 Auto NHL 210 0 90 16
20 Auto NHL 252 1 90 21
21 Auto NHL 476 0 90 24
22 Auto NHL 524 1 90 39
23 Auto NHL 1037 0 90 84
24 Allo HOD 2 1 20 34
25 Allo HOD 4 1 50 28
26 Allo HOD 72 1 80 59
27 Allo HOD 77 1 60 102
28 Allo HOD 79 1 70 71
29 Auto HOD 30 1 90 73
30 Auto HOD 36 1 80 61
31 Auto HOD 41 1 70 34
32 Auto HOD 52 1 60 18
33 Auto HOD 62 1 90 40
34 Auto HOD 108 1 70 65
35 Auto HOD 132 1 60 17
36 Auto HOD 180 0 100 61
37 Auto HOD 307 0 100 24
38 Auto HOD 406 0 100 48
39 Auto HOD 446 0 100 52
40 Auto HOD 484 0 90 84
41 Auto HOD 748 0 90 171
42 Auto HOD 1290 0 90 20
43 Auto HOD 1345 0 80 98

Related

Using Linear Regression and Model Selection Techniques to Predict Y based on Xs in R

I attempted to utilize the principles of linear regression and feature selection to predict a target variable (Y) based on a set of predictor variables (X1, X2, X3, X4, X5, X6, X7, and X8). I started by implementing a full model, which included all predictor variables, and then used stepwise regression to select the most relevant variables for my model through the use of backward, forward, and both selection methods. I then compared the performance of my model using AIC, BIC, and root mean squared error (RMSE) to determine the best model for my data. Finally, I used this best model to predict the value of Y for a specific set of predictor variable values and compared it to the actual value to assess the accuracy of my model. However, I encountered a problem in my data where the value of Y in the 39th semester was missing, so I couldn't evaluate the prediction results.
#Jeu de donnée : Classeur2.xlsx
setwd("D:/third year/alaoui")
# load
library(readxl)
data2 <- read_excel("D:/third year/alaoui/tpnote/Classeur2.xlsx")
data <- data2[-c(39),]
View(data)
# Analyse descriptive
summary(data)
str(data)
# analyse de correlation
#install.packages("psych")
library(psych)
# check if all values r numeric if not convert em
num_cols <- sapply(data, is.numeric)
data[, !num_cols] <- lapply(data[, !num_cols], as.numeric)
#
matrice_correlation <- cor(data[,c("Y","X1","X2","X4","X5...5","X5...6","X6","X7","X8")])
KMO(matrice_correlation)
cortest.bartlett(matrice_correlation, n = nrow(data))
# Analyse en composantes principales
library("FactoMineR")
library("factoextra")
library("corrplot")
p=PCA(data,graph=FALSE)
p
pca=PCA(data,ncp=2)
print(pca)
eig.val <- get_eigenvalue(pca)
eig.val
fviz_eig(pca)
fviz_pca_var(pca, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))
# Régression linéaire
model <- lm(Y ~ X1 + X2 + X4 + X5...5 + X5...6 + X6 + X7 + X8, data = data)
summary(model)
#Vérification des hypothèses de la régression linéaire
#1. Linearité
par(mfrow = c(2, 2))
plot(model)
#2. Homoscédasticité
library(car)
ncvTest(model)
#3. Normalité des résidus
library(lmtest)
library(tseries)
residuals <- resid(model)
qqnorm(residuals)
qqline(residuals)
shapiro.test(residuals)
#4. Indépendance des résidus
plot(residuals ~ fitted(model))
durbinWatsonTest(model)
#Sélection de variables
# Fit the full model
full_model <- lm(Y ~ X1 + X2 + X4 + X5...5 + X5...6 + X6 + X7 + X8, data = data)
# Fit the null model (constant only)
null_model <- lm(Y ~ 1, data = data)
# Perform backward stepwise selection
backward_model <- step(full_model, direction = "backward")
# Perform forward stepwise selection
forward_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")
# Perform both stepwise selection
both_model <- step(null_model, scope = list(upper = full_model), direction = "both")
# Compare AIC, BIC and RMSE for each model
AIC_full <- AIC(full_model)
AIC_backward <- AIC(backward_model)
AIC_forward <- AIC(forward_model)
AIC_both <- AIC(both_model)
BIC_full <- BIC(full_model)
BIC_backward <- BIC(backward_model)
BIC_forward <- BIC(forward_model)
BIC_both <- BIC(both_model)
RMSE_full <- sqrt(mean((resid(full_model))^2))
RMSE_backward <- sqrt(mean((resid(backward_model))^2))
RMSE_forward <- sqrt(mean((resid(forward_model))^2))
RMSE_both <- sqrt(mean((resid(both_model))^2))
#Print the model selection criteria for each model
cat("Full model:")
cat("\tAIC:", AIC_full, "\tBIC:", BIC_full, "\tRMSE:", RMSE_full, "\n")
cat("Backward model:")
cat("\tAIC:", AIC_backward, "\tBIC:", BIC_backward, "\tRMSE:", RMSE_backward, "\n")
cat("Forward model:")
cat("\tAIC:", AIC_forward, "\tBIC:", BIC_forward, "\tRMSE:", RMSE_forward, "\n")
cat("Both model:")
cat("\tAIC:", AIC_both, "\tBIC:", BIC_both, "\tRMSE:", RMSE_both, "\n")
#Select the model with the lowest AIC, BIC, and RMSE
model_names <- c("Full Model", "Backward Model", "Forward Model", "Both Model")
best_model <- model_names[which.min(c(AIC_full, AIC_backward, AIC_forward, AIC_both))]
print(best_model)
# predict the value of Y in the 39th semester
predicted_Y <- predict(backward_model, newdata = data.frame(X1 = 500, X2 = 100, X4 = 83, X5...5 = 30, X5...6= 50, X6 = 90, X7 = 300, X8 = 200))
print(predicted_Y)
# to make sure that its correct
#Calculate mean squared error
MSE <- mean((predicted_Y - data$Y[39])^2)
#Calculate root mean squared error
RMSE <- sqrt(MSE)
#Calculate R-squared value
R_squared <- summary(backward_model)$r.squared
#Print the results
print(paste("Predicted value of Y:", predicted_Y))
print(paste("Mean Squared Error:", MSE))
print(paste("Root Mean Squared Error:", RMSE))
print(paste("R-Squared value:", R_squared))
#Compare the predicted value with the actual value
print(paste("Actual value of Y:", data$Y[39]))
print(paste("Difference:", abs(predicted_Y - data$Y[39])))
#Plot the model
par(xpd = TRUE)
plot(backward_model,which=1)
abline(backward_model,col="blue")
#Plot the residuals
plot(backward_model, which=2)
#Normality test on residuals
shapiro.test(residuals(backward_model))
#Homoscedasticity test on residuals
ncvTest(backward_model)
#Linearity test on residuals
dwtest(backward_model)
this is my file.csv
X4 X5 X5 X6 X7 X8 Y
56 12 50 77 229 98 5540
59 9 17 89 177 225 5439
57 29 89 51 166 263 4290
58 13 107 40 258 321 5502
59 13 143 52 209 407 4872
60 11 61 21 180 247 4708
60 25 -30 40 213 328 4627
60 21 -45 32 201 298 4110
63 8 -28 12 176 218 4123
62 11 76 68 175 410 4842
65 22 144 52 253 93 5741
65 24 113 77 208 307 5094
64 14 128 96 195 107 5383
66 15 10 48 154 305 4888
67 22 -25 27 181 60 4033
67 23 117 73 220 239 4942
66 13 120 62 235 141 5313
68 8 122 25 258 291 5140
69 27 71 74 196 414 5397
71 18 4 63 279 206 5149
69 8 47 29 207 80 5151
70 10 8 91 213 429 4989
73 27 128 74 296 273 5927
73 16 -50 16 245 309 4704
73 32 100 43 276 280 5366
75 20 -40 41 211 315 4630
73 15 68 93 283 212 5712
74 11 88 83 218 118 5095
74 27 27 75 307 345 6124
77 20 59 88 211 141 4787
79 35 142 74 270 83 5036
77 23 126 21 328 398 5288
78 36 30 26 258 124 4647
78 22 18 95 233 118 5316
81 20 42 93 324 161 6180
80 16 -22 50 267 405 4801
81 35 148 83 257 111 5512
82 27 -18 91 267 170 5272
83 30 50 90 300 200 .

fit a normal distribution to grouped data, giving expected frequencies

I have a frequency distribution of observations, grouped into counts within class intervals.
I want to fit a normal (or other continuous) distribution, and find the expected frequencies in each interval according to that distribution.
For example, suppose the following, where I want to calculate another column, expected giving the
expected number of soldiers with chest circumferences in the interval given by chest, where these
are assumed to be centered on the nominal value. E.g., 35 = 34.5 <= y < 35.5. One analysis I've seen gives the expected frequency in this cell as 72.5 vs. the observed 81.
> data(ChestSizes, package="HistData")
>
> ChestSizes
chest count
1 33 3
2 34 18
3 35 81
4 36 185
5 37 420
6 38 749
7 39 1073
8 40 1079
9 41 934
10 42 658
11 43 370
12 44 92
13 45 50
14 46 21
15 47 4
16 48 1
>
> # ungroup to a vector of values
> chests <- vcdExtra::expand.dft(ChestSizes, freq="count")
There are quite a number of variations of this question, most of which relate to plotting the normal density on top of a histogram, scaled to represent counts not density. But none explicitly show the calculation of the expected frequencies. One close question is R: add normal fits to grouped histograms in ggplot2
I can perfectly well do the standard plot (below), but for other things, like a Chi-square test or a vcd::rootogram plot, I need the expected frequencies in the same class intervals.
> bw <- 1
n_obs <- nrow(chests)
xbar <- mean(chests$chest)
std <- sd(chests$chest)
plt <-
ggplot(chests, aes(chest)) +
geom_histogram(color="black", fill="lightblue", binwidth = bw) +
stat_function(fun = function(x)
dnorm(x, mean = xbar, sd = std) * bw * n_obs,
color = "darkred", size = 1)
plt
here is how you could calculate the expected frequencies for each group assuming Normality.
xbar <- with(ChestSizes, weighted.mean(chest, count))
sdx <- with(ChestSizes, sd(rep(chest, count)))
transform(ChestSizes, Expected = diff(pnorm(c(32, chest) + .5, xbar, sdx)) * sum(count))
chest count Expected
1 33 3 4.7600583
2 34 18 20.8822328
3 35 81 72.5129162
4 36 185 199.3338028
5 37 420 433.8292832
6 38 749 747.5926687
7 39 1073 1020.1058521
8 40 1079 1102.2356155
9 41 934 943.0970605
10 42 658 638.9745241
11 43 370 342.7971793
12 44 92 145.6089948
13 45 50 48.9662992
14 46 21 13.0351612
15 47 4 2.7465640
16 48 1 0.4579888

How to resample and remodel n times by vectorization?

here's my for loop version of doing resample and remodel,
B <- 999
n <- nrow(butterfly)
estMat <- matrix(NA, B+1, 2)
estMat[B+1,] <- model$coef
for (i in 1:B) {
resample <- butterfly[sample(1:n, n, replace = TRUE),]
re.model <- lm(Hk ~ inv.alt, resample)
estMat[i,] <- re.model$coef
}
I tried to avoid for loop,
B <- 999
n <- nrow(butterfly)
resample <- replicate(B, butterfly[sample(1:n, replace = TRUE),], simplify = FALSE)
re.model <- lapply(resample, lm, formula = Hk ~ inv.alt)
re.model.coef <- sapply(re.model,coef)
estMat <- cbind(re.model.coef, model$coef)
It worked but didn't improve efficiency. Is there any approach I can do vectorization?
Sorry, not quite familiar with StackOverflow. Here's the dataset butterfly.
colony alt precip max.temp min.temp Hk
pd+ss 0.5 58 97 16 98
sb 0.8 20 92 32 36
wsb 0.57 28 98 26 72
jrc+jrh 0.55 28 98 26 67
sj 0.38 15 99 28 82
cr 0.93 21 99 28 72
mi 0.48 24 101 27 65
uo+lo 0.63 10 101 27 1
dp 1.5 19 99 23 40
pz 1.75 22 101 27 39
mc 2 58 100 18 9
hh 4.2 36 95 13 19
if 2.5 34 102 16 42
af 2 21 105 20 37
sl 6.5 40 83 0 16
gh 7.85 42 84 5 4
ep 8.95 57 79 -7 1
gl 10.5 50 81 -12 4
(Assuming butterfly$inv.alt <- 1/butterfly$alt)
You get the error because resample is not a list of resampled data.frames, which you can obtain with:
resample <- replicate(B, butterfly[sample(1:n, replace = TRUE),], simplify = FALSE)
The the following should work:
re.model <- lapply(resample, lm, formula = Hk ~ inv.alt)
To extract coefficients from a list of models, re.model$coef does work. The correct path to coefficients are: re.model[[1]]$coef, re.model[[2]]$coef, .... You can get all of them with the following code:
re.model.coef <- sapply(re.model, coef)
Then you can combined it with the observed coefficients:
estMat <- cbind(re.model.coef, model$coef)
In fact, you can put all of them into replicate:
re.model.coef <- replicate(B, {
bf.rs <- butterfly[sample(1:n, replace = TRUE),]
coef(lm(formula = Hk ~ inv.alt, data = bf.rs))
})
estMat <- cbind(re.model.coef, model$coef)

Function with a for loop to create a column with values 1:n conditioned by intervals matched by another column

I have a data frame like the following
my_df=data.frame(x=runif(100, min = 0,max = 60),
y=runif(100, min = 0,max = 60)) #x and y in cm
With this I need a new column with values from 1 to 36 that match x and y every 10 cm. For example, if 0<=x<=10 & 0<=y<=10, put 1, then if 10<=x<=20 & 0<=y<=10, put 2 and so on up to 6, then 0<=x<=10 & 10<=y<=20 starting with 7 up to 12, etc. I tried to make a function with an if repeating the interval for x 6 times, and increasing by 10 the interval for y every iteration. Here is the function
#my miscarried function 'zones'
>zones= function(x,y) {
i=vector(length = 6)
n=vector(length = 6)
z=vector(length = 36)
i[1]=0
z[1]=0
n[1]=1
for (t in 1:6) {
if (0<=x & x<10 & i[t]<=y & y<i[t]+10) { z[t] = n[t]} else
if (10<=x & x<20 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+1} else
if (20<=x & x<30 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+2} else
if (30<=x & x<40 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+3} else
if (40<=x & x<50 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+4}else
if (50<=x & x<=60 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+5}
else {i[t+1]=i[t]+10
n[t+1]=n[t]+6}
}
return(z)
}
>xy$z=zones(x=xy$x,y=xy$y)
and I got
There were 31 warnings (use warnings() to see them)
>xy$z
[1] 0 0 0 0 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Please,help me before I die alone!
I think think this does the trick.
a <- cut(my_df$x, (0:6) * 10)
b <- cut(my_df$y, (0:6) * 10)
z <- interaction(a, b)
levels(z)
[1] "(0,10].(0,10]" "(10,20].(0,10]" "(20,30].(0,10]" "(30,40].(0,10]"
[5] "(40,50].(0,10]" "(50,60].(0,10]" "(0,10].(10,20]" "(10,20].(10,20]"
[9] "(20,30].(10,20]" "(30,40].(10,20]" "(40,50].(10,20]" "(50,60].(10,20]"
[13] "(0,10].(20,30]" "(10,20].(20,30]" "(20,30].(20,30]" "(30,40].(20,30]"
[17] "(40,50].(20,30]" "(50,60].(20,30]" "(0,10].(30,40]" "(10,20].(30,40]"
[21] "(20,30].(30,40]" "(30,40].(30,40]" "(40,50].(30,40]" "(50,60].(30,40]"
[25] "(0,10].(40,50]" "(10,20].(40,50]" "(20,30].(40,50]" "(30,40].(40,50]"
[29] "(40,50].(40,50]" "(50,60].(40,50]" "(0,10].(50,60]" "(10,20].(50,60]"
[33] "(20,30].(50,60]" "(30,40].(50,60]" "(40,50].(50,60]" "(50,60].(50,60]"
If this types of levels aren't for your taste, then change as below:
levels(z) <- 1:36
Is this what you're after? The resulting numbers are in column res:
# Get bin index for x values and y values
my_df$bin1 <- as.numeric(cut(my_df$x, breaks = seq(0, max(my_df$x) + 10, by = 10)));
my_df$bin2 <- as.numeric(cut(my_df$y, breaks = seq(0, max(my_df$x) + 10, by = 10)));
# Multiply bin indices
my_df$res <- my_df$bin1 * my_df$bin2;
> head(my_df)
x y bin1 bin2 res
1 49.887499 47.302849 5 5 25
2 43.169773 50.931357 5 6 30
3 10.626466 43.673533 2 5 10
4 43.401454 3.397009 5 1 5
5 7.080386 22.870539 1 3 3
6 39.094724 24.672907 4 3 12
I've broken down the steps for illustration purposes; you probably don't want to keep the intermediate columns bin1 and bin2.
We probably need a table showing the relationship between x, y, and z. After that, we can define a function to do the join.
The solution is related and inspired by this post (R dplyr join by range or virtual column). You may also find other solutions are useful.
# Set seed for reproducibility
set.seed(1)
# Create example data frame
my_df <- data.frame(x=runif(100, min = 0,max = 60),
y=runif(100, min = 0,max = 60))
# Load the dplyr package
library(dplyr)
# Create a table to show the relationship between x, y, and z
r <- expand.grid(x_from = seq(0, 50, 10), y_from = seq(0, 50, 10)) %>%
mutate(x_to = x_from + 10, y_to = y_from + 10, z = 1:n())
# Define a function for dynamic join
dynamic_join <- function(d, r){
if (!("z" %in% colnames(d))){
d[["z"]] <- NA_integer_
}
d <- d %>%
mutate(z = ifelse(x >= r$x_from & x < r$x_to & y >= r$y_from & y < r$y_to,
r$z, z))
return(d)
}
re_dynamic_join <- function(d, r){
r_list <- split(r, r$z)
for (i in 1:length(r_list)){
d <- dynamic_join(d, r_list[[i]])
}
return(d)
}
# Apply the function
re_dynamic_join(my_df, r)
x y z
1 15.930520 39.2834357 20
2 22.327434 21.1918363 15
3 34.371202 16.2156088 10
4 54.492467 59.5610437 36
5 12.100916 38.0095959 20
6 53.903381 12.7924881 12
7 56.680516 7.7623409 6
8 39.647868 28.6870821 16
9 37.746843 55.4444682 34
10 3.707176 35.9256580 19
11 12.358474 58.5702417 32
12 10.593405 43.9075507 26
13 41.221371 21.4036147 17
14 23.046223 25.8884214 15
15 46.190485 8.8926936 5
16 29.861955 0.7846545 3
17 43.057110 42.9339640 29
18 59.514366 6.1910541 6
19 22.802111 26.7770609 15
20 46.646713 38.4060627 23
21 56.082314 59.5103172 36
22 12.728551 29.7356147 14
23 39.100426 29.0609715 16
24 7.533306 10.4065401 7
25 16.033240 45.2892567 26
26 23.166846 27.2337294 15
27 0.803420 30.6701870 19
28 22.943277 12.4527068 9
29 52.181451 13.7194886 12
30 20.420940 35.7427198 21
31 28.924807 34.4923319 21
32 35.973950 4.6238628 4
33 29.612478 2.1324348 3
34 11.173056 38.5677295 20
35 49.642399 55.7169120 35
36 40.108004 35.8855453 23
37 47.654392 33.6540449 23
38 6.476618 31.5616634 19
39 43.422657 59.1057134 35
40 24.676466 30.4585093 21
41 49.256778 40.9672847 29
42 38.823612 36.0924731 22
43 46.975966 14.3321207 11
44 33.182179 15.4899556 10
45 31.783175 43.7585774 28
46 47.361374 27.1542499 17
47 1.399872 10.5076061 7
48 28.633804 44.8018962 27
49 43.938824 6.2992584 5
50 41.563893 51.8726969 35
51 28.657177 36.8786983 21
52 51.672569 33.4295723 24
53 26.285826 19.7266391 9
54 14.687837 27.1878867 14
55 4.240743 30.0264584 19
56 5.967970 10.8519817 7
57 18.976302 31.7778362 20
58 31.118056 4.5165447 4
59 39.720305 16.6653560 10
60 24.409811 12.7619712 9
61 54.772555 17.0874289 12
62 17.616202 53.7056462 32
63 27.543944 26.7741194 15
64 19.943680 46.7990934 26
65 39.052228 52.8371421 34
66 15.481007 24.7874526 14
67 28.712715 3.8285088 3
68 45.978640 20.1292495 17
69 5.054815 43.4235568 25
70 52.519280 20.2569200 18
71 20.344376 37.8248473 21
72 50.366421 50.4368732 36
73 20.801009 51.3678999 33
74 20.026496 23.4815569 15
75 28.581075 22.8296331 15
76 53.531900 53.7267256 36
77 51.860368 38.6589458 24
78 23.399373 44.4647189 27
79 46.639242 36.3182068 23
80 57.637080 54.1848967 36
81 26.079569 17.6238093 9
82 42.750881 11.4756066 11
83 23.999662 53.1870566 33
84 19.521129 30.2003691 20
85 45.425229 52.6234526 35
86 12.161535 11.3516173 8
87 42.667273 45.4861831 29
88 7.301515 43.4699336 25
89 14.729311 56.6234891 32
90 8.598263 32.8587952 19
91 14.377765 42.7046321 26
92 3.536063 23.3343060 13
93 38.537296 6.0523876 4
94 52.576153 55.6381253 36
95 46.734881 16.9939500 11
96 47.838530 35.4343895 23
97 27.316467 6.6216363 3
98 24.605045 50.4304219 33
99 48.652215 19.0778211 11
100 36.295997 46.9710802 28

Given data points and y value, give x value

Given a set of (x,y) coordinates, how can I solve for x, from y. If you were to plot the coordinates, they would be non-linear, but pretty close to exponential. I tried approx(), but it is way off. Here is example data. In this scenario, how could I solve for y == 50?
V1 V3
1 5.35 11.7906
2 10.70 15.0451
3 16.05 19.4243
4 21.40 20.7885
5 26.75 22.0584
6 32.10 25.4367
7 37.45 28.6701
8 42.80 30.7500
9 48.15 34.5084
10 53.50 37.0096
11 58.85 39.3423
12 64.20 41.5023
13 69.55 43.4599
14 74.90 44.7299
15 80.25 46.5738
16 85.60 47.7548
17 90.95 49.9749
18 96.30 51.0331
19 101.65 52.0207
20 107.00 52.9781
21 112.35 53.8730
22 117.70 54.2907
23 123.05 56.3025
24 128.40 56.6949
25 133.75 57.0830
26 139.10 58.5051
27 144.45 59.1440
28 149.80 60.0687
29 155.15 60.6627
30 160.50 61.2313
31 165.85 61.7748
32 171.20 62.5587
33 176.55 63.2684
34 181.90 63.7085
35 187.25 64.0788
36 192.60 64.5807
37 197.95 65.2233
38 203.30 65.5331
39 208.65 66.1200
40 214.00 66.6208
41 219.35 67.1952
42 224.70 67.5270
43 230.05 68.0175
44 235.40 68.3869
45 240.75 68.7485
46 246.10 69.1878
47 251.45 69.3980
48 256.80 69.5899
49 262.15 69.7382
50 267.50 69.7693
51 272.85 69.7693
52 278.20 69.7693
53 283.55 69.7693
54 288.90 69.7693
I suppose the problem you have is that approx solves for y given x, while you are talking about solving for x given y. So you need to switch your variables x and y when using approx:
df <- read.table(textConnection("
V1 V3
85.60 47.7548
90.95 49.9749
96.30 51.0331
101.65 52.0207
"), header = TRUE)
approx(x = df$V3, y = df$V1, xout = 50)
# $x
# [1] 50
#
# $y
# [1] 91.0769
Also, if y is exponential with respect to x, then you have a linear relationship between x and log(y), so it makes more sense to use a linear interpolator between x and log(y), then take the exponential to get back to y:
exp(approx(x = df$V3, y = log(df$V1), xout = 50)$y)
# [1] 91.07339

Resources