Related
I would like estimate the parameters of the Gompert-Makeham distribution, but I haven't got a result.
I would like a method in R, like this Weibull parameter estimation code:
weibull_loglik <- function(parm){
gamma <- parm[1]
lambda <- parm[2]
loglik <- sum(dweibull(vec, shape=gamma, scale=lambda, log=TRUE))
return(-loglik)
}
weibull <- nlm(weibull_loglik,parm<-c(1,1), hessian = TRUE, iterlim=100)
weibull$estimate
c=weibull$estimate[1];b=weibull$estimate[2]
My data:
[1] 872 52 31 26 22 17 11 17 17 8 20 12 25 14 17
[16] 20 17 23 32 37 28 24 43 40 34 29 26 32 34 51
[31] 50 67 84 70 71 137 123 137 172 189 212 251 248 272 314
[46] 374 345 411 494 461 505 506 565 590 535 639 710 733 795 786
[61] 894 963 1019 1149 1185 1356 1354 1460 1622 1783 1843 2049 2262 2316 2591
[76] 2730 2972 3187 3432 3438 3959 3140 3612 3820 3478 4054 3587 3433 3150 2881
[91] 2639 2250 1850 1546 1236 966 729 532 375 256 168 107 65 39 22
[106] 12 6 3 2 1 1
summary(vec)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 32.0 314.0 900.9 1355.0 4054.0
It would be nice to have a reproducible example, but something like:
library(bbmle)
library(eha)
set.seed(101)
vec <- rmakeham(1000, shape = c(2,3), scale = 2)
dmwrap <- function(x, shape1, shape2, scale, log) {
res <- try(dmakeham(x, c(shape1, shape2), scale, log = log), silent = TRUE)
if (inherits(res, "try-error")) return(NA)
res
}
m1 <- mle2(y ~ dmwrap(shape1, shape2, scale),
start = list(shape1=1,shape2=1, scale=1),
data = data.frame(y = vec),
method = "Nelder-Mead"
)
Define a wrapper that (1) takes shape parameters as separate values; (2) returns NA rather than throwing an error when e.g. parameters are negative
Use Nelder-Mead rather than default BFGS for robustness
the fitdistrplus package might help too
if you're going to do a lot of this it may help to fit parameters on the log scale (i.e. use parameters logshape1, etc., and use exp(logshape1) etc. in the fitting formula)
I had to work a little harder to fit your data; I scaled the variable by 1000 (and found that I could only compute the log-likelihood; the likelihood gave an error that I didn't bother trying to track down). Unfortunately, it doesn't look like a great fit (too many small values).
x <- scan(text = "872 52 31 26 22 17 11 17 17 8 20 12 25 14 17
20 17 23 32 37 28 24 43 40 34 29 26 32 34 51
50 67 84 70 71 137 123 137 172 189 212 251 248 272 314
374 345 411 494 461 505 506 565 590 535 639 710 733 795 786
894 963 1019 1149 1185 1356 1354 1460 1622 1783 1843 2049 2262 2316 2591
2730 2972 3187 3432 3438 3959 3140 3612 3820 3478 4054 3587 3433 3150 2881
2639 2250 1850 1546 1236 966 729 532 375 256 168 107 65 39 22
12 6 3 2 1 1")
m1 <- mle2(y ~ dmwrap(shape1, shape2, scale),
start = list(shape1=1,shape2=1, scale=10000),
data = data.frame(y = x/1000),
method = "Nelder-Mead"
)
cc <- as.list(coef(m1))
png("gm.png")
hist(x,breaks = 25, freq=FALSE)
with(cc,
curve(exp(dmwrap(x/1000, shape1, shape2, scale, log = TRUE))/1000, add = TRUE)
)
dev.off()
I am working with the R programming language.
In the "datasets" library in R, there is a data set called "eurodist" that contains the distance between each combination of cities :
library(datasets)
This data set can be then converted into a "matrix":
eurodist = as.matrix(eurodist)
Athens Barcelona Brussels Calais Cherbourg Cologne Copenhagen Geneva Gibraltar Hamburg Hook of Holland Lisbon Lyons Madrid Marseilles Milan Munich Paris Rome Stockholm Vienna
Athens 0 3313 2963 3175 3339 2762 3276 2610 4485 2977 3030 4532 2753 3949 2865 2282 2179 3000 817 3927 1991
Barcelona 3313 0 1318 1326 1294 1498 2218 803 1172 2018 1490 1305 645 636 521 1014 1365 1033 1460 2868 1802
Brussels 2963 1318 0 204 583 206 966 677 2256 597 172 2084 690 1558 1011 925 747 285 1511 1616 1175
Calais 3175 1326 204 0 460 409 1136 747 2224 714 330 2052 739 1550 1059 1077 977 280 1662 1786 1381
Cherbourg 3339 1294 583 460 0 785 1545 853 2047 1115 731 1827 789 1347 1101 1209 1160 340 1794 2196 1588
Cologne 2762 1498 206 409 785 0 760 1662 2436 460 269 2290 714 1764 1035 911 583 465 1497 1403 937
Copenhagen 3276 2218 966 1136 1545 760 0 1418 3196 460 269 2971 1458 2498 1778 1537 1104 1176 2050 650 1455
Geneva 2610 803 677 747 853 1662 1418 0 1975 1118 895 1936 158 1439 425 328 591 513 995 2068 1019
Gibraltar 4485 1172 2256 2224 2047 2436 3196 1975 0 2897 2428 676 1817 698 1693 2185 2565 1971 2631 3886 2974
Hamburg 2977 2018 597 714 1115 460 460 1118 2897 0 550 2671 1159 2198 1479 1238 805 877 1751 949 1155
Hook of Holland 3030 1490 172 330 731 269 269 895 2428 550 0 2280 863 1730 1183 1098 851 457 1683 1500 1205
Lisbon 4532 1305 2084 2052 1827 2290 2971 1936 676 2671 2280 0 1178 668 1762 2250 2507 1799 2700 3231 2937
Lyons 2753 645 690 739 789 714 1458 158 1817 1159 863 1178 0 1281 320 328 724 471 1048 2108 1157
Madrid 3949 636 1558 1550 1347 1764 2498 1439 698 2198 1730 668 1281 0 1157 1724 2010 1273 2097 3188 2409
Marseilles 2865 521 1011 1059 1101 1035 1778 425 1693 1479 1183 1762 320 1157 0 618 1109 792 1011 2428 1363
Milan 2282 1014 925 1077 1209 911 1537 328 2185 1238 1098 2250 328 1724 618 0 331 856 586 2187 898
Munich 2179 1365 747 977 1160 583 1104 591 2565 805 851 2507 724 2010 1109 331 0 821 946 1754 428
Paris 3000 1033 285 280 340 465 1176 513 1971 877 457 1799 471 1273 792 856 821 0 1476 1827 1249
Rome 817 1460 1511 1662 1794 1497 2050 995 2631 1751 1683 2700 1048 2097 1011 586 946 1476 0 2707 1209
Stockholm 3927 2868 1616 1786 2196 1403 650 2068 3886 949 1500 3231 2108 3188 2428 2187 1754 1827 2707 0 2105
Vienna 1991 1802 1175 1381 1588 937 1455 1019 2974 1155 1205 2937 1157 2409 1363 898 428 1249 1209 2105 0
My Question: Suppose I have 6 cities and the Longitude/Latitude for each of these cities :
data_1 = data.frame(id = c(1,2,3), long = rnorm(3, -74, 1 ), lat = rnorm(3, 40, 1 ))
data_2 = data.frame(id = c(4,5,6), long = rnorm(3, -78, 1 ), lat = rnorm(3, 42, 1 ))
final_data = rbind(data_1, data_2)
final_data$names <- c("city_1", "city_2", "city_3", "city_4", "city_5", "city_6")
id long lat names
1 1 -75.28447 40.21079 city_1
2 2 -73.29385 40.09104 city_2
3 3 -75.12737 38.88355 city_3
4 4 -79.42325 42.61917 city_4
5 5 -77.82508 41.11707 city_5
6 6 -77.62831 39.94935 city_6
I can also make a similar matrix for these cities that contains the distance between each pair of cities:
library(geosphere)
N <- nrow(final_data)
dists <- outer(seq_len(N), seq_len(N), function(a,b) {
geosphere::distHaversine(final_data[a,2:3], final_data[b,2:3]) # Notes 1, 2
})
D <- as.matrix(dists)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.0 169895.7 148361.1 437239.3 237056.7 201742.0
[2,] 169895.7 0.0 207068.8 584183.9 399577.9 369814.4
[3,] 148361.1 207068.8 0.0 551356.0 338698.3 245620.3
[4,] 437239.3 584183.9 551356.0 0.0 213326.6 332955.7
[5,] 237056.7 399577.9 338698.3 213326.6 0.0 131051.7
[6,] 201742.0 369814.4 245620.3 332955.7 131051.7 0.0
How can I make my matrix look the same way as the "eurodist" matrix?
I had thought of the following way to do this:
colnames(dists) <- c("city_1", "city_2", "city_3", "city_4", "city_5", "city_6")
rownames(dists) <- c("city_1", "city_2", "city_3", "city_4", "city_5", "city_6")
city_1 city_2 city_3 city_4 city_5 city_6
city_1 0.0 169895.7 148361.1 437239.3 237056.7 201742.0
city_2 169895.7 0.0 207068.8 584183.9 399577.9 369814.4
city_3 148361.1 207068.8 0.0 551356.0 338698.3 245620.3
city_4 437239.3 584183.9 551356.0 0.0 213326.6 332955.7
city_5 237056.7 399577.9 338698.3 213326.6 0.0 131051.7
city_6 201742.0 369814.4 245620.3 332955.7 131051.7 0.0
In the end, I would like to use the above matrix as input for a customized Travelling Salesman Problem (R: Customizing the Travelling Salesman Problem) - e.g. Try to find the optimal path when you are forced to start at "city 4" and the third city should be "city 5":
D <- dists
transformMatrix <- function(fixed_points, D){
if(length(fixed_points) == 0) return(D)
p <- integer(nrow(D))
pos <- match(names(fixed_points), colnames(D))
p[fixed_points] <- pos
p[-fixed_points] <- sample(setdiff(seq_len(nrow(D)), pos))
D[p, p]
}
fixed_points <- c(
"city_4" = 1, "city_5" = 3
)
D_perm <- transformMatrix(fixed_points, D)
feasiblePopulation <- function(n, size, fixed_points){
positions <- setdiff(seq_len(n), fixed_points)
m <- matrix(0, size, n)
if(length(fixed_points) > 0){
m[, fixed_points] <- rep(fixed_points, each = size)
for(i in seq_len(size))
m[i, -fixed_points] <- sample(positions)
} else {
for(i in seq_len(size))
m[i,] <- sample(positions)
}
m
}
mutation <- function(n, fixed_points){
positions <- setdiff(seq_len(n), fixed_points)
function(obj, parent){
vec <- obj#population[parent,]
if(length(positions) < 2) return(vec)
indices <- sample(positions, 2)
replace(vec, indices, vec[rev(indices)])
}
}
fitness <- function(tour, distMatrix) {
tour <- c(tour, tour[1])
route <- embed(tour, 2)[,2:1]
1/sum(distMatrix[route])
}
popSize = 500
res <- ga(
type = "permutation",
fitness = fitness,
distMatrix = D_perm,
lower = 1,
upper = nrow(D_perm),
mutation = mutation(nrow(D_perm), fixed_points),
crossover = gaperm_pmxCrossover,
suggestions = feasiblePopulation(nrow(D_perm), popSize, fixed_points),
popSize = popSize,
maxiter = 5000,
run = 500,
pmutation = 0.2
)
colnames(D_perm)[res#solution[1,]]
This results in the following error:
Error in if (object#run >= run) break :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In max(fitness) : no non-missing arguments to max; returning -Inf
2: In max(Fitness, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
3: In max(fitness) : no non-missing arguments to max; returning -Inf
4: In max(x, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
Is the above error because I have not made "distance matrix" (i.e. "D") properly? Is there a different way to name the columns and rows of a matrix in R?
Thanks!
Note : If anyone knows another way to solve this constraint Travelling Salesman Problem with custom cities using the Genetic Algorithm in R (e.g. different objective function, different way to specify constraints, etc.), please let me know. I am open to different ways to solving this problem!
That’s not the problem. The error says the it encountered code:
if (object#run >= run) break
… and either object#run or run had length 0 which the if function cannot handle gracefully. It may be an error in the ga function itself or in the arguments to it.
To address the direct question about how to make the distance matrix look like the example in eurodist: There is a dimnames attribute for matrices. You need to assign a list with a rownames and a colnames value in it and assign that list to the dimnames attribute.
dimnames(D) <- list(rownames=final_data$names,
colnames=final_data$names)
Then when you run your code you get an error from the ga(...) call:
Error in gaperm_pmxCrossover_Rcpp(object, parents) : index error
Looking at the problem setup, your population size appears much larger than needed. If you drop it down a bit to say 100 or 200, then the results begin to be computed.
popSize=200;
# now calculate a res
colnames(D_perm)[res#solution[1,]]
#[1] "city_4" "city_6" "city_5" "city_1" "city_3" "city_2"
popSize=100
colnames(D_perm)[res#solution[1,]]
#[1] "city_4" "city_6" "city_5" "city_1" "city_3" "city_2"
popSiz=20
colnames(D_perm)[res#solution[1,]]
#[1] "city_4" "city_6" "city_5" "city_1" "city_3" "city_2"
It doesn't seem "proper" that a population size larger than needed should cause an obscure error, so you might contact the package maintainer with your example (now that it has been "dressed up" properly.)
A large number can be comma formatted to read more easily into groups of three. E.g. 1050 = 1,050 and 10200 = 10,200.
The sum of each of these groups of three would be:
1050=1,050 gives: 1+50=51
10200=10,200 gives: 10+200=210
I need to search for matches in the sum of the groups of threes.
Namely, if I am searching for 1234, then I am looking for numbers whose sum of threes = 1234.
The smallest match is 235,999 since 235+999=1234. No other integer less than 235,999 gives a sum of threes equal to 1234.
The next smallest match is 236,998 since 236+998=1234.
One can add 999 each time, but this fails after reaching 999 since an extra digit of 1 is added to the number due to overflow in the 999.
More generally, I am asking for the solutions (smallest to highest) to:
a+b+c+d… = x
where a,b,c,d… is an arbitrary number of integers between 0-999 and x
is a fixed integer
Note there are infinite solutions to this for any positive integer x.
How would one get the solutions to this beginning with the smallest number solutions (for y number of solutions where y can be an arbitrarily large number)?
Is there a way to do this without brute force looping one by one? I'm dealing with potentially very large numbers, which could take years to loop through in a straight loop. Ideally, one should do this without failed attempts.
The problem is easier to think about if instead of groups of 3 digits, you just consider 1 digit at a time.
An algorithm:
Start by filling the 0 digit group with x.
Create a loop that each time prints the next solution.
"Normalize" the groups by moving all that is too large from the right to the left, leaving only the maximum value at the right.
Output the solution
Repeat:
Add 1 to the penultimate group
This can carry to the left if a group gets too large (e.g.999+1 is too large)
Check whether the result didn't get too large (a[0] should be able to absorb what was added)
If the result got too large, set the group to zero and continue incrementing the earlier groups
Calculate the last group to absorb the surplus (can be positive or negative)
Some Python code for illustration:
x = 1234
grouping = 3
max_iterations = 200
max_in_group = 10**grouping - 1
a = [x]
while max_iterations > 0:
#step 1: while a[0] is too large: redistribute to the left
i = 0
while a[i] > max_in_group:
if i == len(a) - 1:
a.append(0)
a[i + 1] += a[i] - max_in_group
a[i] = max_in_group
i += 1
num = sum(10**(grouping*i) * a[i] for i, n in enumerate(a))
print(f"{num} {num:,}")
# print("".join([str(t) for t in a[::-1]]), ",".join([str(t) for t in a[::-1]]))
# step 2: add one to the penultimate group, while group already full: set to 0 and increment the
# group left of it;
# while the surplus is too large (because a[0] is too small) repeat the incrementing
i0 = 1
surplus = 0
while True: # needs to be executed at least once, and repeated if the surplus became too large
i = i0
while True: # increment a[i] by 1, which can carry to the left
if i == len(a):
a.append(1)
surplus += 1
break
else:
if a[i] == max_in_group:
a[i] = 0
surplus -= max_in_group
i += 1
else:
a[i] += 1
surplus += 1
break
if a[0] >= surplus:
break
else:
surplus -= a[i0]
a[i0] = 0
i0 += 1
#step 3: a[0] should absorb the surplus created in step 1, although a[0] can get out of bounds
a[0] -= surplus
surplus = 0
max_iterations -= 1
Abbreviated output:
235,999 236,998 ... 998,236 999,235 ... 1,234,999 1,235,998 ... 1,998,235 1,999,234 2,233,999 2,234,998 ...
Output for grouping=3 and x=3456:
459,999,999,999 460,998,999,999 460,999,998,999 460,999,999,998 461,997,999,999
461,998,998,999 461,998,999,998 461,999,997,999 461,999,998,998 461,999,999,997
462,996,999,999 ...
Output for grouping=1 and x=16:
79 88 97 169 178 187 196 259 268 277 286 295 349 358 367 376 385 394 439 448 457 466
475 484 493 529 538 547 556 565 574 583 592 619 628 637 646 655 664 673 682 691 709
718 727 736 745 754 763 772 781 790 808 817 826 835 844 853 862 871 880 907 916 925
934 943 952 961 970 1069 1078 1087 1096 1159 1168 1177 1186 1195 1249 1258 1267 1276
1285 1294 1339 1348 1357 1366 1375 1384 1393 1429 1438 1447 1456 1465 1474 1483 1492
1519 1528 1537 1546 1555 1564 1573 1582 1591 1609 1618 1627 1636 1645 1654 1663 1672
1681 1690 1708 1717 1726 1735 1744 1753 1762 1771 1780 1807 1816 1825 1834 1843 1852
1861 1870 1906 1915 1924 1933 1942 1951 1960 2059 2068 2077 2086 2095 2149 2158 2167
2176 2185 2194 2239 2248 2257 2266 2275 2284 2293 2329 2338 2347 2356 2365 2374 2383
2392 2419 2428 2437 2446 2455 2464 2473 2482 2491 2509 2518 2527 2536 2545 2554 2563
2572 2581 2590 2608 2617 2626 2635 2644 2653 2662 2671 2680 2707 2716 2725 2734 ...
I have a data.frame df with matching columns that are also paired. The matching columns are defined in the factor patient. I would like to devide the matching columns by each other. Any suggestions how to do this?
I tried this, but this does not take the pairing from patient into account.
m1 <- m1[sort(colnames(df)]
m1_g <- m1[,grep("^n",colnames(df))]
m1_r <- m1[,grep("^t",colnames(df))]
m1_new <- m1_g/m1_r
m1_new
head(df)
na-008 ta-008 nc012 tb012 na020 na-018 ta-018 na020 tc020 tc093 nc093
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGT 56 311 137 242 23 96 113 106 41 114
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGG 208 656 350 713 49 476 183 246 157 306
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGT 631 1978 1531 2470 216 1906 732 850 665 909
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGTT 2760 8159 6067 9367 622 4228 2931 3031 2895 2974
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGTTT 1698 4105 3737 3729 219 1510 1697 1643 1527 1536
> head(patient)
$`008`
[1] "na-008" "ta-008"
$`012`
[1] "nc012" "tb012"
$`018`
[1] "na-018" "ta-018"
$`020`
[1] "na020" "tc020"
$`045`
[1] "nb045" "tc045"
$`080`
[1] "nb-080" "ta-080"
I am writing R code where there's a vector 'x' which contains values 1 to 100 and I want to create another vector 'y' which subsets a range of values at every nth range. I'm sure I can use the rep() and seq() but I can't figure out the code to get what I need. Here's what the output should look like
x <- 1:100
y <- 1 2 3 11 12 13 21 22 23 31 32 33 41 42 43 51 52 53 61 62 63 71 72 73 81 82 83 91 92 93
So if I was to do have a vector x <- 1001:1100, x[y] should return:
1001 1002 1003 1011 1012 1013 1021 1022 1023 1031 1032 1033 1041 1042 1043...etc
Any ideas?
You could use grepl for that:
x <- 1001:1100
y <- grepl("[1-3]$", x)
x[y]
# [1] 1001 1002 1003 1011 1012 1013 1021 1022 1023 1031 1032 1033 1041 1042 1043 1051 1052
#[18] 1053 1061 1062 1063 1071 1072 1073 1081 1082 1083 1091 1092 1093
It simply checks for each element of x whether the last digit is in the range of 1, 2 or 3 and if so, it returns TRUE, otherwise FALSE. This logical index is then used to subset x.
In case your objective is not to subset elements ending in 1,2 or 3 but instead, to always subset 3 elements, then leave out 7, and then subset 3 again etc... you could do:
x <- 1001:1100
y <- rep(c(TRUE, FALSE), c(3, 7))
x[y]
# [1] 1001 1002 1003 1011 1012 1013 1021 1022 1023 1031 1032 1033 1041 1042 1043 1051 1052
#[18] 1053 1061 1062 1063 1071 1072 1073 1081 1082 1083 1091 1092 1093
In this case, vector y which is again logical, is recycled - note that length(x) should be divisible by length(y) for this to work properly.
For fun, With outer:
x <- 1001:1100
y <- as.vector(outer(1:3, seq(0, length(x)-10, 10), "+"))
x[y]
# [1] 1001 1002 1003 1011 1012 1013 1021 1022 1023 1031 1032 1033 1041 1042 1043
# [16] 1051 1052 1053 1061 1062 1063 1071 1072 1073 1081 1082 1083 1091 1092 1093
Probably this may help you:
x <- 1:100
y <- as.integer()
for(i in seq(1, length(x), 10)) {
y <- append(y, c(x[i], x[i+1], x[i+2]))
}
Hm. This started out as fun, but now I happen to like it since it is constructed in basically the same way the author of the question put it:
> do.call("c",lapply(0:5,function(X) 1:3+10*X))
[1] 1 2 3 11 12 13 21 22 23 31 32 33 41 42 43 51 52 53