hts method for creating hierarchical time series - r

I am trying to convert a time series into a hierarchical one using the hts package and my structure is as follows:
My data is already stored as a ts object in R with a weekly frequency and all columns are labeled:
Then I try to make the hts object but I get this error message, probably because I do not fully understand how to split the columns in the argument "characters".
x <- hts(abc, c(1,1,2,2))
Since argument characters are not specified, the default labelling system is used.
Error: Argument nodes must be a list.
Thanks in advance

First you need to use the argument characters. It is not the second argument of hts(). Second, you need to specify characters according to the names of your columns. In this case, the first character is the first level of disaggregation (A, B, C, D, E), while the next two characters specify the next level of disaggregation. So your argument should be characters = c(1,2).
Here is an example with synthetic data, but with the same structure as yours.
library(hts)
abc <- matrix(sample(1:100, 32*140, replace=TRUE), ncol=32)
colnames(abc) <- c(
paste0("A0",1:5),
paste0("B0",1:9),"B10",
paste0("C0",1:8),
paste0("D0",1:5),
paste0("E0",1:4)
)
abc <- ts(abc, start=2019, frequency=365.25/7)
x <- hts(abc, characters = c(1,2))
x
#> Hierarchical Time Series
#> 3 Levels
#> Number of nodes at each level: 1 5 32
#> Total number of series: 38
#> Number of observations per series: 140
#> Top level series:
#> Time Series:
#> Start = 2019
#> End = 2021.66392881588
#> Frequency = 52.1785714285714
#> [1] 1735 1645 1472 1638 1594 1722 1525 1761 1500 1746 1331 1567 1853 1652 1587
#> [16] 1540 1453 1989 1629 1587 1596 1474 1320 1599 1762 1419 1931 1447 2102 1608
#> [31] 1439 1909 1331 1742 1428 1677 1534 1657 1741 1612 1574 1954 1542 2067 1512
#> [46] 1850 1650 1666 1321 1332 1924 1786 1496 1695 1363 1437 1740 1448 1260 1371
#> [61] 1661 1726 1786 1641 1463 1616 1641 1895 1503 1430 1972 1705 1722 1447 1515
#> [76] 1636 1544 1727 1960 1647 1682 1569 1616 1628 1706 1837 1738 1659 1574 1716
#> [91] 1409 1428 1411 1708 1606 1501 1413 1707 1552 1567 1693 1748 2034 1557 1402
#> [106] 1649 1637 1653 1857 1401 1519 1600 1844 1585 1796 1612 1456 1626 1390 1368
#> [121] 1492 1765 1644 1773 1302 2027 1810 1652 1819 1628 1574 1655 1650 1817 1605
#> [136] 1422 1793 1999 1489 1667
Created on 2021-10-18 by the reprex package (v2.0.1)

Related

R: Different Ways to Name Matrices

I am working with the R programming language.
In the "datasets" library in R, there is a data set called "eurodist" that contains the distance between each combination of cities :
library(datasets)
This data set can be then converted into a "matrix":
eurodist = as.matrix(eurodist)
Athens Barcelona Brussels Calais Cherbourg Cologne Copenhagen Geneva Gibraltar Hamburg Hook of Holland Lisbon Lyons Madrid Marseilles Milan Munich Paris Rome Stockholm Vienna
Athens 0 3313 2963 3175 3339 2762 3276 2610 4485 2977 3030 4532 2753 3949 2865 2282 2179 3000 817 3927 1991
Barcelona 3313 0 1318 1326 1294 1498 2218 803 1172 2018 1490 1305 645 636 521 1014 1365 1033 1460 2868 1802
Brussels 2963 1318 0 204 583 206 966 677 2256 597 172 2084 690 1558 1011 925 747 285 1511 1616 1175
Calais 3175 1326 204 0 460 409 1136 747 2224 714 330 2052 739 1550 1059 1077 977 280 1662 1786 1381
Cherbourg 3339 1294 583 460 0 785 1545 853 2047 1115 731 1827 789 1347 1101 1209 1160 340 1794 2196 1588
Cologne 2762 1498 206 409 785 0 760 1662 2436 460 269 2290 714 1764 1035 911 583 465 1497 1403 937
Copenhagen 3276 2218 966 1136 1545 760 0 1418 3196 460 269 2971 1458 2498 1778 1537 1104 1176 2050 650 1455
Geneva 2610 803 677 747 853 1662 1418 0 1975 1118 895 1936 158 1439 425 328 591 513 995 2068 1019
Gibraltar 4485 1172 2256 2224 2047 2436 3196 1975 0 2897 2428 676 1817 698 1693 2185 2565 1971 2631 3886 2974
Hamburg 2977 2018 597 714 1115 460 460 1118 2897 0 550 2671 1159 2198 1479 1238 805 877 1751 949 1155
Hook of Holland 3030 1490 172 330 731 269 269 895 2428 550 0 2280 863 1730 1183 1098 851 457 1683 1500 1205
Lisbon 4532 1305 2084 2052 1827 2290 2971 1936 676 2671 2280 0 1178 668 1762 2250 2507 1799 2700 3231 2937
Lyons 2753 645 690 739 789 714 1458 158 1817 1159 863 1178 0 1281 320 328 724 471 1048 2108 1157
Madrid 3949 636 1558 1550 1347 1764 2498 1439 698 2198 1730 668 1281 0 1157 1724 2010 1273 2097 3188 2409
Marseilles 2865 521 1011 1059 1101 1035 1778 425 1693 1479 1183 1762 320 1157 0 618 1109 792 1011 2428 1363
Milan 2282 1014 925 1077 1209 911 1537 328 2185 1238 1098 2250 328 1724 618 0 331 856 586 2187 898
Munich 2179 1365 747 977 1160 583 1104 591 2565 805 851 2507 724 2010 1109 331 0 821 946 1754 428
Paris 3000 1033 285 280 340 465 1176 513 1971 877 457 1799 471 1273 792 856 821 0 1476 1827 1249
Rome 817 1460 1511 1662 1794 1497 2050 995 2631 1751 1683 2700 1048 2097 1011 586 946 1476 0 2707 1209
Stockholm 3927 2868 1616 1786 2196 1403 650 2068 3886 949 1500 3231 2108 3188 2428 2187 1754 1827 2707 0 2105
Vienna 1991 1802 1175 1381 1588 937 1455 1019 2974 1155 1205 2937 1157 2409 1363 898 428 1249 1209 2105 0
My Question: Suppose I have 6 cities and the Longitude/Latitude for each of these cities :
data_1 = data.frame(id = c(1,2,3), long = rnorm(3, -74, 1 ), lat = rnorm(3, 40, 1 ))
data_2 = data.frame(id = c(4,5,6), long = rnorm(3, -78, 1 ), lat = rnorm(3, 42, 1 ))
final_data = rbind(data_1, data_2)
final_data$names <- c("city_1", "city_2", "city_3", "city_4", "city_5", "city_6")
id long lat names
1 1 -75.28447 40.21079 city_1
2 2 -73.29385 40.09104 city_2
3 3 -75.12737 38.88355 city_3
4 4 -79.42325 42.61917 city_4
5 5 -77.82508 41.11707 city_5
6 6 -77.62831 39.94935 city_6
I can also make a similar matrix for these cities that contains the distance between each pair of cities:
library(geosphere)
N <- nrow(final_data)
dists <- outer(seq_len(N), seq_len(N), function(a,b) {
geosphere::distHaversine(final_data[a,2:3], final_data[b,2:3]) # Notes 1, 2
})
D <- as.matrix(dists)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.0 169895.7 148361.1 437239.3 237056.7 201742.0
[2,] 169895.7 0.0 207068.8 584183.9 399577.9 369814.4
[3,] 148361.1 207068.8 0.0 551356.0 338698.3 245620.3
[4,] 437239.3 584183.9 551356.0 0.0 213326.6 332955.7
[5,] 237056.7 399577.9 338698.3 213326.6 0.0 131051.7
[6,] 201742.0 369814.4 245620.3 332955.7 131051.7 0.0
How can I make my matrix look the same way as the "eurodist" matrix?
I had thought of the following way to do this:
colnames(dists) <- c("city_1", "city_2", "city_3", "city_4", "city_5", "city_6")
rownames(dists) <- c("city_1", "city_2", "city_3", "city_4", "city_5", "city_6")
city_1 city_2 city_3 city_4 city_5 city_6
city_1 0.0 169895.7 148361.1 437239.3 237056.7 201742.0
city_2 169895.7 0.0 207068.8 584183.9 399577.9 369814.4
city_3 148361.1 207068.8 0.0 551356.0 338698.3 245620.3
city_4 437239.3 584183.9 551356.0 0.0 213326.6 332955.7
city_5 237056.7 399577.9 338698.3 213326.6 0.0 131051.7
city_6 201742.0 369814.4 245620.3 332955.7 131051.7 0.0
In the end, I would like to use the above matrix as input for a customized Travelling Salesman Problem (R: Customizing the Travelling Salesman Problem) - e.g. Try to find the optimal path when you are forced to start at "city 4" and the third city should be "city 5":
D <- dists
transformMatrix <- function(fixed_points, D){
if(length(fixed_points) == 0) return(D)
p <- integer(nrow(D))
pos <- match(names(fixed_points), colnames(D))
p[fixed_points] <- pos
p[-fixed_points] <- sample(setdiff(seq_len(nrow(D)), pos))
D[p, p]
}
fixed_points <- c(
"city_4" = 1, "city_5" = 3
)
D_perm <- transformMatrix(fixed_points, D)
feasiblePopulation <- function(n, size, fixed_points){
positions <- setdiff(seq_len(n), fixed_points)
m <- matrix(0, size, n)
if(length(fixed_points) > 0){
m[, fixed_points] <- rep(fixed_points, each = size)
for(i in seq_len(size))
m[i, -fixed_points] <- sample(positions)
} else {
for(i in seq_len(size))
m[i,] <- sample(positions)
}
m
}
mutation <- function(n, fixed_points){
positions <- setdiff(seq_len(n), fixed_points)
function(obj, parent){
vec <- obj#population[parent,]
if(length(positions) < 2) return(vec)
indices <- sample(positions, 2)
replace(vec, indices, vec[rev(indices)])
}
}
fitness <- function(tour, distMatrix) {
tour <- c(tour, tour[1])
route <- embed(tour, 2)[,2:1]
1/sum(distMatrix[route])
}
popSize = 500
res <- ga(
type = "permutation",
fitness = fitness,
distMatrix = D_perm,
lower = 1,
upper = nrow(D_perm),
mutation = mutation(nrow(D_perm), fixed_points),
crossover = gaperm_pmxCrossover,
suggestions = feasiblePopulation(nrow(D_perm), popSize, fixed_points),
popSize = popSize,
maxiter = 5000,
run = 500,
pmutation = 0.2
)
colnames(D_perm)[res#solution[1,]]
This results in the following error:
Error in if (object#run >= run) break :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In max(fitness) : no non-missing arguments to max; returning -Inf
2: In max(Fitness, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
3: In max(fitness) : no non-missing arguments to max; returning -Inf
4: In max(x, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
Is the above error because I have not made "distance matrix" (i.e. "D") properly? Is there a different way to name the columns and rows of a matrix in R?
Thanks!
Note : If anyone knows another way to solve this constraint Travelling Salesman Problem with custom cities using the Genetic Algorithm in R (e.g. different objective function, different way to specify constraints, etc.), please let me know. I am open to different ways to solving this problem!
That’s not the problem. The error says the it encountered code:
if (object#run >= run) break
… and either object#run or run had length 0 which the if function cannot handle gracefully. It may be an error in the ga function itself or in the arguments to it.
To address the direct question about how to make the distance matrix look like the example in eurodist: There is a dimnames attribute for matrices. You need to assign a list with a rownames and a colnames value in it and assign that list to the dimnames attribute.
dimnames(D) <- list(rownames=final_data$names,
colnames=final_data$names)
Then when you run your code you get an error from the ga(...) call:
Error in gaperm_pmxCrossover_Rcpp(object, parents) : index error
Looking at the problem setup, your population size appears much larger than needed. If you drop it down a bit to say 100 or 200, then the results begin to be computed.
popSize=200;
# now calculate a res
colnames(D_perm)[res#solution[1,]]
#[1] "city_4" "city_6" "city_5" "city_1" "city_3" "city_2"
popSize=100
colnames(D_perm)[res#solution[1,]]
#[1] "city_4" "city_6" "city_5" "city_1" "city_3" "city_2"
popSiz=20
colnames(D_perm)[res#solution[1,]]
#[1] "city_4" "city_6" "city_5" "city_1" "city_3" "city_2"
It doesn't seem "proper" that a population size larger than needed should cause an obscure error, so you might contact the package maintainer with your example (now that it has been "dressed up" properly.)

Adding a second y-axis as a formula of the first x and y-axis

I've got the following data:
ClusterID AvgGenes nCoreGenes Ratio
20001 1941 1572 0.809892
20005 1599 1374 0.859287
20008 2017 1712 0.848785
20009 1808 1590 0.879425
20013 1823 1469 0.805815
20015 2056 1677 0.815661
20019 2135 1783 0.835129
20020 3152 2625 0.832805
20026 2028 1586 0.782051
20028 1835 1420 0.773842
20030 2885 2189 0.758752
20031 1772 1485 0.838036
20032 1722 1473 0.855401
20034 1801 1459 0.810105
20035 1677 1339 0.798450
20042 2193 1651 0.752850
20047 1747 1345 0.769891
20049 1306 1008 0.771822
20051 1738 1358 0.781358
20052 1552 1188 0.765464
20062 2179 1509 0.692520
20065 2047 1894 0.925256
20074 1948 1568 0.804928
20088 2588 2192 0.846986
20103 1916 1341 0.699896
20109 2511 2190 0.872162
20117 1668 1278 0.766187
20162 1936 1601 0.826963
20167 2068 1856 0.897485
20168 4375 3992 0.912457
20170 3961 3252 0.821005
20190 2327 2013 0.865062
20196 3350 2522 0.752836
20198 3028 2302 0.760238
20207 1522 1241 0.815375
20208 1791 1546 0.863205
20215 3013 1853 0.615002
20219 2803 2043 0.728862
20225 4604 2931 0.636620
20247 1927 1567 0.813181
20248 2510 1732 0.690040
20251 2252 1674 0.743339
20279 2843 1775 0.624340
20293 1611 1245 0.772812
20313 2277 1914 0.840580
20314 2320 1915 0.825431
20318 2201 1762 0.800545
20320 2287 1943 0.849585
20321 2060 1645 0.798544
20323 2242 1524 0.679750
20327 2132 1845 0.865385
20328 1685 1402 0.832047
20329 2393 1727 0.721688
20341 2190 1729 0.789498
20368 3906 2991 0.765745
20370 3245 2325 0.716487
20373 2608 1935 0.741948
20374 3632 2380 0.655286
20388 1787 1435 0.803022
20408 1506 1262 0.837981
20423 1979 1428 0.721577
20433 2452 1646 0.671289
20459 2118 1649 0.778565
20462 1778 1496 0.841395
20478 1653 1447 0.875378
20492 2709 1895 0.699520
20494 2686 1773 0.660089
20498 2676 1909 0.713378
20508 1425 1092 0.766316
20517 2461 1983 0.805770
20548 2752 2059 0.748183
20565 2239 1764 0.787852
20566 2368 1882 0.794764
20569 2285 1877 0.821444
20572 2179 1703 0.781551
20573 1609 1355 0.842138
20577 1753 1379 0.786651
20579 1786 1426 0.798432
20589 1811 1239 0.684152
20600 2293 1822 0.794592
20650 1693 1422 0.839929
20677 1904 1485 0.779937
20729 1680 1362 0.810714
20742 2210 1855 0.839367
20744 1583 1372 0.866709
20746 2087 1743 0.835170
20750 1859 1418 0.762776
20753 1701 1496 0.879483
20758 1480 1169 0.789865
20759 1839 1406 0.764546
20772 2068 1786 0.863636
20773 2321 2024 0.872038
20775 2528 2012 0.795886
20784 1869 1592 0.851792
20788 1843 1516 0.822572
20809 1541 1352 0.877352
20811 1569 1346 0.857871
20824 1594 1323 0.829987
20836 2287 1688 0.738085
20857 2252 1704 0.756661
20890 1884 1340 0.711253
20903 1681 1404 0.835217
20966 1826 1455 0.796824
20967 1877 1605 0.855088
20990 2125 1605 0.755294
21002 1743 1345 0.771658
21027 1866 1504 0.806002
21047 2866 2191 0.764480
21049 2163 1596 0.737864
21059 2298 1847 0.803742
21085 1640 1490 0.908537
21258 3002 1950 0.649567
21325 2945 2117 0.718846
21326 2343 1996 0.851899
21348 2362 1809 0.765876
21370 2313 1553 0.671422
21384 1932 1383 0.715839
21405 1948 1398 0.717659
21477 1852 1538 0.830454
21584 2514 1838 0.731106
21586 1247 910 0.729751
21734 1619 1452 0.896850
21818 1593 1363 0.855618
21826 2688 2009 0.747396
21845 2595 1854 0.714451
21889 1678 1285 0.765793
22085 1718 1314 0.764843
22153 1290 1139 0.882946
22347 2356 1629 0.691426
22359 2170 1552 0.715207
22396 1648 1337 0.811286
I would like to use AvgGenes as my x-axis and nCoreGenes as my primary y-axis. In addition, I would like to add a second y-axis for the ratio which is nCoreGenes/AvgGenes*100 (pCoreGenes). However, I couldn't find the right formula: y-axis/x-axis*100 to use for scale_y_continuous(sec.axis()) in ggplot2.
cluster2core$pCoreGenes <- cluster2core$Ratio*100
g6 <- ggplot(cluster2core, aes(AvgGenes, nCoreGenes))
g6 <- g6 + geom_point(aes(y = nCoreGenes)) + geom_smooth(method = lm)
g6 <- g6 + geom_line(aes(y = pCoreGenes))
g6 <- g6 + labs(y = "Number of core genes", x = "Average number of genes")
#g6 <- g6 + scale_y_continuous(sec.axis = sec_axis())
The mean value of the ratio % is 78.7 so I expect to get a horizontal line which indicates that on average genomes has 78% core genes.
Using secondary axes in ggplot requires a cheat. You need to pretend that your secondary y axis data are in the same range as the primary y axis data, so scale it accordingly. Multiplying by 100 does not suffice, as you want to have the data in the range around 1000 or so. Multiplying by 4000 should get you there.
Then, you need to reverse the process for the axis, specifying an argument to sec_axis. Normally, you would divide by 4000, but since you want percentage, divide by 40:
ggplot(df, aes(x=AvgGenes, y=nCoreGenes)) + geom_point() +
geom_smooth(method=lm) +
geom_line(aes(y=Ratio*4000)) +
scale_y_continuous(sec.axis=sec_axis( ~ . / 40))
Also, there is no need to specify the esthetics in geom_point since it is inherited from the esthetics in the ggplot() call.

Converting SAS Proc Shewhart into R programming

I have a data, where SAS Proc shewhart is implemented, I want to implement the same in R program, below is the data and sas code
> valueid date dis_id sales_amount yymm (year month)
> 868 5-Mar-18 2 956 1803
868 6-Apr-17 2 473 1704
868 22-Dec-16 2 524 1612
914 17-Dec-15 2 1768 1512
914 18-Aug-16 2 477 1608
914 12-Jan-17 2 804 1701
870 1-May-17 2 1373 1705
870 8-Sep-17 2 323 1709
870 29-Feb-16 2 1718 1602
870 26-Jan-16 2 1242 1601
870 1-Apr-16 2 995 1604
800 22-Apr-16 2 356 1604
925 10-May-16 2 1487 1605
928 30-May-16 2 1210 1605
928 29-Jun-16 2 1935 1606
928 28-Nov-16 2 1149 1611
928 13-Dec-16 2 835 1612
987 10-Jul-17 2 1023 1707
987 27-Jul-17 2 389 1707
987 22-Sep-17 2 1191 1709
Below is the Program use to implement XSCHART
proc shewhart data=sales_revenue;
by valueid;
xschart sales_amount*yymm/ nochart outtable= newoutput;
id dis_id;
run;
I need to convert this shewhart with xschart into R. kindly help me.

How to merge bins in R

So, I am trying to merge bins of a histogram whenever the number of observations in a bin is less than 6.
library(fitdistrplus)
mydata <-read.csv("Book2.csv",stringsAsFactors=FALSE)
QF3<-as.numeric(mydata[,1])
histrv<-hist(QF3,breaks="FD")
binvec<-data.frame(diff(histrv$breaks))
binbreak=histrv$breaks
freq<-histrv$count
datmean=as.numeric(mean(QF3))
datsigma=as.numeric(sd(QF3))
templist<-as.numeric()#empty list
for (i in 1:nrow(binvec)){
templist[i]=pnorm(binbreak[i+1],datmean,datsigma)-pnorm(binbreak[i],datmean,datsigma)
}
pi<-data.frame(templist)
chisqvec<-(freq-length(QF3)*pi)^2/(length(QF3)*pi)
xstat=sum(chisqvec)
The above code will provide a histogram with five bins that contain less than 6 observations, which are the bins 6000-7000, 7000-8000, 8000-9000, 9000-10000, and 10000-11000. Each of these 5 bins contain 2, 5, 2, 2, and 1 observations respectively. I would like to merge the bins that they can have more than 5 observations.
In other words, I would like to have the two bins 6000-8000 and 8000-11000 so that they can contain 7 observations and 5 observations.
Does anyone have any clue on how to approach this problem?
QF3 looks like the following:
> QF3
[1] 2016 1425 2000 785 823 2484 1870 770 1220 3454 1056 2745 2830
[14] 950 601 1245 2663 1500 1717 1070 1704 2517 1090 3310 3389 2200
[27] 882 2113 600 1900 4417 745 530 1630 1600 4530 948 2764 2202
[40] 1052 2685 1120 1275 2300 1590 1935 3957 4283 3215 5684 4092 7548
[53] 4547 3510 3063 5549 6460 5204 4626 4965 5023 8111 5525 4804 5994
[66] 8471 4767 7142 3420 4061 5102 9135 3861 5372 7274 5054 7318 3791
[79] 4901 3549 4758 4859 10190 5609 7624 5841 4908 4974 6691 5713 3235
[92] 4464 2656 4399 9581 3993 4061

which.min not working correctly inside sapply on data-frame?

Can anyone explain this strange behavior found when trying to use sapply and which.min to find the first lines inside a dataframe satisfying a condition?
The dataframe is trApr; it's sorted by customer_id (increasing) and then transaction visit_date (increasing). For each customer_id, we want to find the row-index of the first transaction in trApr. (There is a variable number of transactions overall per customer_id, that should not matter.)
trApr is 'data.frame': 2195716 obs. of 3 variables:
$ customer_id: int 2 2 2 2 2 2 2 2 2 2 ...
$ visit_date : Date, format: "2011-04-02" "2011-04-06" "2011-04-07" "2011-04-08" ...
$ visit_spend: num 37.12 32.51 4.55 31.35 42.49 ...
Other notes on the code:
all_tr_cids is simply the list of sorted, unique customer_ids: unique(trApr$customer_id) )
n:m are just indices I used for taking a tiny slice of the dataframe, while debugging. But I want to do sapply on the entire d.f.
Here's the code in question:
**GOOD:** I <- sapply(all_tr_cids[n:m], function(cid){ head(which(trApr$customer_id==cid),n=1) }, USE.NAMES=FALSE)
[1] 1909 1928 1964 1970 1988 2037 2092 2113 2140 2182
**BAD:** I <- sapply(all_tr_cids[n:m], function(cid){ which.min(trApr$customer_id==cid) }, USE.NAMES=FALSE)
[1] 1 1 1 1 1 1 1 1 1 1
The intermediate ragged object returned by sapply is below (it's 10 lists of list of int ).
If which.min can't handle that sort of structure, it really should raise a warning, not merrily return a list of 1's.
sapply(all_tr_cids[n:m], function(cid){ which(trApr$customer_id==cid) }, USE.NAMES=FALSE)
[[1]]
[1] 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927
[[2]]
[1] 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957
[31] 1958 1959 1960 1961 1962 1963
[[3]]
[1] 1964 1965 1966 1967 1968 1969
[[4]]
[1] 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987
[[5]]
[1] 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
[31] 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036
[[6]]
[1] 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066
[31] 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091
[[7]]
[1] 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112
[[8]]
[1] 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139
[[9]]
[1] 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169
[31] 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181
[[10]]
[1] 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210
I think you are misusing the which.min function. Given a vector of numeric values, it returns the index of the first minimum encountered, but here you are giving it a logical vector trApr$customer_id==cid, which is coerced to numeric as 0/1, so the first FALSE value encountered is a minimum.
See the doc page for further details on which.min : http://stat.ethz.ch/R-manual/R-devel/library/base/html/which.min.html
It is your use of which.min() which is at fault. You are supplying a logical vector (one containing TRUE and FALSE). This is patently not numeric data, so R coerces the logical to numerics with FALSE equal to 0 and TRUE equal to 1. So you are in effect doing:
R> which.min(c(TRUE,FALSE,FALSE,FALSE,FALSE))
[1] 2
R> which.min(c(FALSE,TRUE,FALSE,FALSE,FALSE,FALSE))
[1] 1
As such, which.min() returns the first of the tied minimum values, in this case the first FALSE encountered. Hence all the 1s being returned in your example. For the elements shown, the customer ID that matches was not in the first element of the object compared.
You want something like:
which.min(which(trApr$customer_id[trApr$customer_id==cid]))
where we subset trApr$customer_id first to return only those elements of the customer_id vector matching the current cid (using which()), and then ask for the minimum of the info returned from which(). This would be easier with with():
with(trpApr, which.min(which(customer_id[customer_id == cid]))
Both of which assume that cid is available/accessible; i.e. you;ve created it first.

Resources