Observation matching between groups - r

I am dealing with an original dataset has more than 20000 rows. A condensed version of this looks something like this below
Row x y z Group Survive
1 0.0680 0.8701 0.0619 1 78.43507
2 0.9984 0.0016 0.0000 1 89.55533
3 0.4146 0.5787 0.0068 1 85.35468
4 0.3910 0.6016 0.0074 2 67.49987
5 0.3902 0.6023 0.0075 2 81.87669
6 0.0621 0.8701 0.0678 2 27.26777
7 0.6532 0.3442 0.0026 3 53.03938
8 0.6508 0.3466 0.0026 3 62.32931
9 0.9977 0.0023 0.0000 3 97.00324
My goal is to create a column called Match1 as shown below
Row x y z Group Survive Match1
1 0.0680 0.8701 0.0619 1 78.43507 g1r1-g2r3
2 0.9984 0.0016 0.0000 1 89.55533 g1r2-g2r1
3 0.4146 0.5787 0.0068 1 85.35468 g1r3-g2r2
1 0.3910 0.6016 0.0074 2 67.49987 g1r2-g2r1
2 0.3902 0.6023 0.0075 2 81.87669 g1r3-g2r2
3 0.0621 0.8701 0.0678 2 27.26777 g1r1-g2r3
1 0.6532 0.3442 0.0026 3 53.03938 NA
2 0.6508 0.3466 0.0026 3 62.32931 NA
3 0.9977 0.0023 0.0000 3 97.00324 NA
The logic behind the values g1r1-g2r3, g1r2-g2r1, g1r3-g2r2 is as follows
1st step, a distance matrix is generated between rows in Group1 and Group2 based on Mahalanobis or simple distance method , sqrt((x2-x1)^2 + (y2-y1)^2 + (z2-z1)^2)
0.4235 = sqrt{ (0.3910-0.0680)^2 + (0.6016-0.8701)^2 + (0.0074-0.0619)^2}
0.4225 = sqrt{ (0.3902-0.0680)^2 + (0.6023-0.8701)^2 + (0.0075-0.0619)^2}
0.0083 = sqrt{ (0.0621-0.0680)^2 + (0.8701-0.8701)^2 + (0.0678-0.0619)^2}
0.8538 = sqrt{ (0.3910-0.9984)^2 + (0.6016-0.0016)^2 + (0.0074-0.0000)^2}
0.8549 = sqrt{ (0.3902-0.9984)^2 + (0.6023-0.0016)^2 + (0.0075-0.0000)^2}
1.2789 = sqrt{ (0.0621-0.9984)^2 + (0.8701-0.0016)^2 + (0.0678-0.0000)^2}
0.0329 = sqrt{ (0.3910-0.4146)^2 + (0.6016-0.5787)^2 + (0.0074-0.0068)^2}
Group1 vs Group2
g2r1 g2r2 g2r3
g1r1 0.4235 0.4225 0.0083
g1r2 0.8538 0.8549 1.2789
g1r3 0.0329 0.0340 0.4614
2nd step, find the minimum or smallest distance in each row.
g2r1 g2r2 g2r3
g1r1 0.4235 0.4225 **0.0083**
g1r2 **0.8538** 0.8549 1.2789
g1r3 0.0329* **0.0340** 0.4614
The column Match1 takes value g1r1-g2r3 because rows , Row1-Group1 and Row3-Group2 result in smallest distance 0.0083. Similarly g1r2-g2r1 because, Row2-Group1 and Row1-Group2 results in smallest value 0.8538. Although 0.0329 is the smallest value in the last row of the distance matrix we skip this value and chose the next smallest value 0.0340 because choosing 0.0329 will result in pairing Row3-Group1 with Row1-Group2 and Row1-Group2 is already paired with Row2-Group1, so we chose the next smallest value 0.0340 which results in g1r1-g2r3.
3rd step, calculate average survival based on matched observations in Step2.
(78.43507 - 27.26777) + (89.55533 - 67.49987) + (85.35468 -81.87669)/3 = 25.56692
I am not sure how to string together these steps programatically I would appreciate any suggestions or help putting all these pieces together efficiently.

Related

3month rolling correlation keeping date column in R

This is my data. Daily return data for different sectors.
I would like to compute the 3 month rolling correlation between sectors but keep the date field and have it line up.
> head(data)
Date Communication Services Consumer Discretionary Consumer Staples Energy Financials - AREITs Financials - ex AREITs Health Care
1 2003-01-02 -0.0004 0.0016 0.0033 0.0007 0.0073 0.0006 0.0370
2 2003-01-03 -0.0126 -0.0008 0.0057 -0.0019 0.0016 0.0062 0.0166
3 2003-01-06 0.0076 0.0058 -0.0051 0.0044 0.0063 0.0037 -0.0082
4 2003-01-07 -0.0152 0.0052 -0.0024 -0.0042 -0.0037 -0.0014 0.0027
5 2003-01-08 0.0107 0.0017 0.0047 -0.0057 0.0013 -0.0008 -0.0003
6 2003-01-09 -0.0157 0.0019 -0.0020 0.0009 -0.0016 -0.0012 0.0055
`
My data type is this
$ Date : Date[1:5241], format: "2003-01-02" "2003-01-03" "2003-01-06" "2003-01-07" ...
$ Communication Services : num [1:5241] -0.0004 -0.0126 0.0076 -0.0152 0.0107 -0.0157 0.0057 -0.0131 0.0044 0.0103 ...
$ Consumer Discretionary : num [1:5241] 0.0016 -0.0008 0.0058 0.0052 0.0017 0.0019 -0.0022 0.0057 -0.0028 0.0039 ...
$ Consumer Staples : num [1:5241] 0.0033 0.0057 -0.0051 -0.0024 0.0047 -0.002 0.0043 -0.0005 0.0163 0.004 ...
$ Energy : num [1:5241] 0.0007 -0.0019 0.0044 -0.0042 -0.0057 0.0009 0.0058 0.0167 -0.0026 -0.0043 ...
$ Financials - AREITs : num [1:5241] 0.0073 0.0016 0.0063 -0.0037 0.0013 -0.0016 0 0.0025 -0.0051 0.0026 ...`
Currently what I am doing is this:
rollingcor <- rollapply(data, width=60, function(x) cor(x[,2],x[,3]),by=60, by.column=FALSE)
This works fine and works out the rolling 60 day correlation and shifts the window by 60 days. However it doesnt keep the date column and I find it hard to match the dates.
The end goal here is to produce a df in which the the date is every 3 months and the other columns are the correlations between all the sectors in my data.
Please read the information at the top of the r tag and, in particular provide the input in an easily reproducible manner using dput. In the absence of that we will use data shown below based on the 6x2 BOD data frame that comes with R and use a width of 4. The names on the correlation columns are the row:column numbers in the correlation matrix. For example, compare the 4th row of the output below with cor(data[1:4, -1]) .
fill=NA causes it to output the same number of rows as the input by filling with NA's.
library(zoo)
# test data
data <- cbind(Date = as.Date("2023-02-01") + 0:5, BOD, X = 1:6)
# given data frame x return lower triangular part of cor matrix
# Last 2 lines add row:column names.
Cor <- function(x) {
k <- cor(x)
lo <- lower.tri(k)
k.lo <- k[lo]
m <- which(lo, arr.ind = TRUE) # rows & cols of lower tri
setNames(k.lo, paste(m[, 1], m[, 2], sep = ":"))
}
cbind(data, rollapplyr(data[-1], 4, Cor, by.column = FALSE, fill = NA))
giving:
Date Time demand X 2:1 3:1 3:2
1 2023-02-01 1 8.3 1 NA NA NA
2 2023-02-02 2 10.3 2 NA NA NA
3 2023-02-03 3 19.0 3 NA NA NA
4 2023-02-04 4 16.0 4 0.8280576 1.0000000 0.8280576
5 2023-02-05 5 15.6 5 0.4604354 1.0000000 0.4604354
6 2023-02-06 7 19.8 6 0.2959666 0.9827076 0.1223522

R hmftest multinomial logit model " system is computationally singular"

I have a multinomial logit model with two individual specific variables (first and age).
I would like to conduct the hmftest to check if the IIA holds.
My dataset looks like this:
head(df)
mode choice first age
1 both 1 0 24
2 pre 1 1 23
3 both 1 2 53
4 post 1 3 43
5 no 1 1 55
6 both 1 2 63
I adjusted it for the mlogit to:
mode choice first age idx
1 TRUE 1 0 24 1:both
2 FALSE 1 0 24 1:no
3 FALSE 1 0 24 1:post
4 FALSE 1 0 24 1:pre
5 FALSE 1 1 23 2:both
6 FALSE 1 1 23 2:no
7 FALSE 1 1 23 2:post
8 TRUE 1 1 23 2:pre
9 TRUE 1 2 53 3:both
10 FALSE 1 2 53 3:no
~~~ indexes ~~~~
id1 id2
1 1 both
2 1 no
3 1 post
4 1 pre
5 2 both
6 2 no
7 2 post
8 2 pre
9 3 both
10 3 no
indexes: 1, 2
My original (full) model runs as follows:
full <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no")
leading to the following result:
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
method = "nr")
Frequencies of alternatives:choice
no both post pre
0.2 0.4 0.2 0.2
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 8.11E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 2.0077e+01 1.0441e+04 0.0019 0.9985
(Intercept):post -4.1283e-01 1.4771e+04 0.0000 1.0000
(Intercept):pre 5.3346e-01 1.4690e+04 0.0000 1.0000
first1:both -4.0237e+01 1.1059e+04 -0.0036 0.9971
first1:post -8.9168e-01 1.4771e+04 -0.0001 1.0000
first1:pre -6.6805e-01 1.4690e+04 0.0000 1.0000
first2:both -1.9674e+01 1.0441e+04 -0.0019 0.9985
first2:post -1.8975e+01 1.5683e+04 -0.0012 0.9990
first2:pre -1.8889e+01 1.5601e+04 -0.0012 0.9990
first3:both -2.1185e+01 1.1896e+04 -0.0018 0.9986
first3:post 1.9200e+01 1.5315e+04 0.0013 0.9990
first3:pre 1.9218e+01 1.5237e+04 0.0013 0.9990
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 9.3377e-03 2.3157e-02 0.4032 0.6868
age:pre -1.2338e-02 2.2812e-02 -0.5408 0.5886
Log-Likelihood: -61.044
McFadden R^2: 0.54178
Likelihood ratio test : chisq = 144.35 (p.value = < 2.22e-16)
To test for IIA, I exclude one alternative from the model (here "pre") and run the model as follows:
part <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
alt.subset = c("no", "post", "both"))
leading to
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, alt.subset = c("no",
"post", "both"), reflevel = "no", method = "nr")
Frequencies of alternatives:choice
no both post
0.25 0.50 0.25
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 6.88E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 1.9136e+01 6.5223e+03 0.0029 0.9977
(Intercept):post -9.2040e-01 9.2734e+03 -0.0001 0.9999
first1:both -3.9410e+01 7.5835e+03 -0.0052 0.9959
first1:post -9.3119e-01 9.2734e+03 -0.0001 0.9999
first2:both -1.8733e+01 6.5223e+03 -0.0029 0.9977
first2:post -1.8094e+01 9.8569e+03 -0.0018 0.9985
first3:both -2.0191e+01 1.1049e+04 -0.0018 0.9985
first3:post 2.0119e+01 1.1188e+04 0.0018 0.9986
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 1.9879e-02 2.7872e-02 0.7132 0.4757
Log-Likelihood: -27.325
McFadden R^2: 0.67149
Likelihood ratio test : chisq = 111.71 (p.value = < 2.22e-16)
However when I want to codnuct the hmftest then the following error occurs:
> hmftest(full, part)
Error in solve.default(diff.var) :
system is computationally singular: reciprocal condition number = 4.34252e-21
Does anyone have an idea where the problem might be?
I believe the issue here could be that the hmftest checks if the probability ratio of two alternatives depends only on the characteristics of these alternatives.
Since there are only individual-level variables here, the test won't work in this case.

Conditional sorting / reordering of column values in R

I have a data set similar to the following with 1 column and 60 rows:
value
1 0.0423
2 0.0388
3 0.0386
4 0.0342
5 0.0296
6 0.0276
7 0.0246
8 0.0239
9 0.0234
10 0.0214
.
40 0.1424
.
60 -0.0312
I want to reorder the rows so that certain conditions are met. For example one condition could be: sum(df$value[4:7]) > 0.1000 & sum(df$value[4:7]) <0.1100
With the data set looking like this for example.
value
1 0.0423
2 0.0388
3 0.0386
4 0.1312
5 -0.0312
6 0.0276
7 0.0246
8 0.0239
9 0.0234
10 0.0214
.
.
.
60 0.0342
What I tried was using repeat and sample as in the following:
repeat{
df1 <- as_tibble(sample(sdf$value, replace = TRUE))
if (sum(df$value[4:7]) > 0.1000 & sum(df$value[4:7]) <0.1100) break
}
Unfortunately, this method takes quite some time and I was wondering if there is a faster way to reorder rows based on mathematical conditions such as sum or prod
Here's a quick implementation of the hill-climbing method I outlined in my comment. I've had to slightly reframe the desired condition as "distance of sum(x[4:7]) from 0.105" to make it continuous, although you can still use the exact condition when doing the check that all requirements are satisfied. The benefit is that you can add extra conditions to the distance function easily.
# Using same example data as Jon Spring
set.seed(42)
vs = rnorm(60, 0.05, 0.08)
get_distance = function(x) {
distance = abs(sum(x[4:7]) - 0.105)
# Add to the distance with further conditions if needed
distance
}
max_attempts = 10000
best_distance = Inf
swaps_made = 0
for (step in 1:max_attempts) {
# Copy the vector and swap two random values
new_vs = vs
swap_inds = sample.int(length(vs), 2, replace = FALSE)
new_vs[swap_inds] = rev(new_vs[swap_inds])
# Keep the new vector if the distance has improved
new_distance = get_distance(new_vs)
if (new_distance < best_distance) {
vs = new_vs
best_distance = new_distance
swaps_made = swaps_made + 1
}
complete = (sum(vs[4:7]) < 0.11) & (sum(vs[4:7]) > 0.1)
if (complete) {
print(paste0("Solution found in ", step, " steps"))
break
}
}
sum(vs[4:7])
There's no real guarantee that this method will reach a solution, but I often try this kind of basic hill-climbing when I'm not sure if there's a "smart" way to approach a problem.
Here's an approach using combn from base R, and then filtering using dplyr. (I'm sure there's a way w/o it but my base-fu isn't there yet.)
With only 4 numbers from a pool of 60, there are "only" 488k different combinations (ignoring order; =60*59*58*57/4/3/2), so it's quick to brute force in about a second.
# Make a vector of 60 numbers like your example
set.seed(42)
my_nums <- rnorm(60, 0.05, 0.08);
all_combos <- combn(my_nums, 4) # Get all unique combos of 4 numbers
library(tidyverse)
combos_table <- all_combos %>%
t() %>%
as_tibble() %>%
mutate(sum = V1 + V2 + V3 + V4) %>%
filter(sum > 0.1, sum < 0.11)
> combos_table
# A tibble: 8,989 x 5
V1 V2 V3 V4 sum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.160 0.00482 0.0791 -0.143 0.100
2 0.160 0.00482 0.101 -0.163 0.103
3 0.160 0.00482 0.0823 -0.145 0.102
4 0.160 0.00482 0.0823 -0.143 0.104
5 0.160 0.00482 -0.0611 -0.00120 0.102
6 0.160 0.00482 -0.0611 0.00129 0.105
7 0.160 0.00482 0.0277 -0.0911 0.101
8 0.160 0.00482 0.0277 -0.0874 0.105
9 0.160 0.00482 0.101 -0.163 0.103
10 0.160 0.00482 0.0273 -0.0911 0.101
# … with 8,979 more rows
This says that in this example, there are about 9000 different sets of 4 numbers from my sequence which meet the criteria. We could pick any of these and put them in positions 4-7 to meet your requirement.

Logistic regression detection probability

I'm attempting to access the key covariates in detection probability.
I'm currently using this code
model1 <- glm(P ~ Width +
MBL +
DFT +
SGP +
SGC +
Depth,
family = binomial("logit"),
data = dframe2, na.action = na.exclude)
summary.lm(model1)
my data is structured like this-
Site Transect Q ID P Width DFT Depth Substrate SGP SGC MBL
1 Vr1 Q1 1 0 NA NA 0.5 Sand 0 0 0.00000
2 Vr1 Q2 2 0 NA NA 1.4 Sand&Searass 1 30 19.14286
3 Vr1 Q3 3 0 NA NA 1.7 Sand&Searass 1 15 16.00000
4 Vr1 Q4 4 1 17 0 2.0 Sand&Searass 1 95 35.00000
5 Vr1 Q5 5 0 NA NA 2.4 Sand 0 0 0.00000
6 Vr1 Q6 6 0 NA NA 2.9 Sand&Searass 1 50 24.85714
My sample size is really small (n=12) and I only have ~70 rows of data.
when I run the code it returns
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.457e+01 4.519e+00 5.437 0.00555 **
Width 1.810e-08 1.641e-01 0.000 1.00000
MBL -2.827e-08 9.906e-02 0.000 1.00000
DFT 2.905e-07 1.268e+00 0.000 1.00000
SGP 1.064e-06 2.691e+00 0.000 1.00000
SGC -2.703e-09 3.289e-02 0.000 1.00000
Depth 1.480e-07 9.619e-01 0.000 1.00000
SubstrateSand&Searass -8.516e-08 1.626e+00 0.000 1.00000
Does this mean my data set is just to small to asses detection probability or am I doing something wrong?
According to Hair (author of book Multivariate Data Analysis), you need at least 15 examples for each feature (column) of your data. If you have 12, you could only select one feature.
So, run a t-test comparing means of features related the each one of the two classes (0 and 1 at target - dependent variable) and choose the feature (independent variable) whose mean difference between classes is the biggest. This means that variable can properly create a boundary to split these two classes.

How to convert only SOME positive numbers to negative numbers (conditional recoding)?

I am looking for a convenient way to convert positive values (proportions) into negative values of the same variable, depending on the value of another variable.
This is how the data structure looks like:
id Item Var1 Freq
1 P1 0 0.043
2 P2 1 0.078
3 P3 2 0.454
4 P4 3 0.543
5 T1 0 0.001
6 T2 1 0
7 T3 2 0.045
8 T4 3 0.321
9 A1 0 0.671
...
More precisely, I would like to put the numbers for Freq into the negative if Var1 <= 1 (e.g. -0.043).
This is what I tried:
for(i in 1: 180) {
if (mydata$Var1 <= "1") (mydata$Freq*(-1))}
OR
mydata$Freq[mydata$Var1 <= "1"] = -abs(mydata$Freq)}
In both cases, the negative sign is rightly set but the numbers are altered as well.
Any help is highly appreciated. THANKS!
new.Freq <- with(mydata, ifelse(Var1 <= 1, -Freq, Freq))
Try:
index <- mydata$Var1 <= 1
mydata$Freq[index] = -abs(mydata$Freq[index])
There are two errors in your attempted code:
You did a character comparison by writing x <= "1" - this should be a numeric comparison, i.e. x <= 1
Although you are replacing a subset of your vector, you don't refer to the same subset as the replacement
It can also be used to deal with two variables when one has negative values and want to combine that by retaining negative values,
similarly can use it to convert to negative value by put - at start of variable (as mentioned above) e.g. -Freq.
mydata$new_Freq <- with(mydata, ifelse(Var1 < 0, Low_Freq, Freq))
id Item Var1 Freq Low_Freq
1 P1 0 1.043 -0.063
2 P2 1 1.078 -0.077
3 P3 2 2.401 -0.068
4 P4 3 3.543 -0.323
5 T1 0 1.001 1.333
6 T2 1 1.778 1.887
7 T3 2 2.045 1.011
8 T4 3 3.321 1.000
9 A1 0 4.671 2.303
# Output would be:
id Item Var1 Freq Low_Freq new_Freq
1 P1 0 1.043 -0.063 -0.063
2 P2 1 1.078 -0.077 -0.077
3 P3 2 2.401 -0.068 -0.068
4 P4 3 3.543 -0.323 -0.323
5 T1 0 1.001 0.999 1.001
6 T2 1 1.778 0.887 1.778
7 T3 2 2.045 1.011 2.045
8 T4 3 3.321 1.000 3.321
9 A1 0 4.671 2.303 4.671

Resources