predict.randomForest not found - r

I am using R (RStudio) and the randomForest package. I used the following code:
rf = randomForest(y ~ x1 + x2 +...)
Which worked fine. Then I tried to use the predict.randomForest function and ran into a problem. R gave me the following message:
Error: could not find function "predict.randomForest"
When I go to the randomForest help page (??randomForest), it shows me that there is such a function as predict.randomForest, and yet I can't call it. What is going on here? I checked to see if there was an update available to the randomForest package and there is none.
Additionally, the plot.randomForest() function is not found either.

You can just use generic plot() and predict() instead, like in this example from ?randomForest:
require(randomForest)
set.seed(17)
x <- matrix(runif(5e2), 100)
y <- gl(2, 50)
myrf <- randomForest(x, y)
predict(myrf, x)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Levels: 1 2
You can also have a look at MDSplot() with this example from same source:
set.seed(17)
iris.urf <- randomForest(iris[, -5])
MDSplot(iris.urf, iris$Species)

Related

finding associations in dataset with list of string data in each cell in R

I am looking for finding a method to find the association between words in the table (or list). In each cell of the table, I have several words separated by ";".
lets say I have a table as below; some words are 'af' or 'aa' belong to one cell.
df<-read.table(text="
A B C D
af;aa;az bf;bb c;cc df;dd
aa;az bf;bc c dc;dd
ah;al;aa bb c;cd dd
af;aa bf cc dd",header=T,stringsAsFactors = F)
I want to find associations between all words in the entire dataset, between cells(not interested in within cell association). for example, how many times aa and dd appear in one row, or show me which words have the highest association (e.g. aa with bb, aa with dd,....).
expected output: (the numbers can be inaccurate and association rep does not have be shown with '--')
2 pairs association (numbers can be counts, probability or normalized association)
association number of associations
aa--dd 3
aa--c 3
bb--dd 2
...
3 pairs association
aa--bb--dd 3
aa--bb--c 3
...
4 pairs association
aa--bb--c--dd 2
aa--bf--c--dd 2
...
can you help me to implement it in R?
Tx
I am not sure if you have something like the approach below in mind. It is basically a custom function which we use in a nested purrr::map call. The outer call loops over the number of pairs: 2,3, 4 and the inner call uses combn to create all possible combinations as input and uses the custom function to create the desired output.
library(tidyverse)
count_pairs <- function(x) {
s <- seq(x)
df[, x] %>%
reduce(s, separate_rows, .init = ., sep = ";")
group_by(across()) %>%
count() %>%
rename(set_names(s))
}
map(2:4,
~ map_dfr(combn(1:4, .x, simplify = FALSE),
count_pairs) %>% arrange(-n))
#> [[1]]
#> # A tibble: 50 x 3
#> # Groups: 1, 2 [50]
#> `1` `2` n
#> <chr> <chr> <int>
#> 1 aa dd 4
#> 2 aa bf 3
#> 3 aa c 3
#> 4 bf dd 3
#> 5 c dd 3
#> 6 aa bb 2
#> 7 af bf 2
#> 8 az bf 2
#> 9 aa cc 2
#> 10 af cc 2
#> # ... with 40 more rows
#>
#> [[2]]
#> # A tibble: 70 x 4
#> # Groups: 1, 2, 3 [70]
#> `1` `2` `3` n
#> <chr> <chr> <chr> <int>
#> 1 aa bf dd 3
#> 2 aa c dd 3
#> 3 aa bb c 2
#> 4 aa bf c 2
#> 5 aa bf cc 2
#> 6 af bf cc 2
#> 7 az bf c 2
#> 8 aa bb dd 2
#> 9 af bf dd 2
#> 10 az bf dd 2
#> # ... with 60 more rows
#>
#> [[3]]
#> # A tibble: 35 x 5
#> # Groups: 1, 2, 3, 4 [35]
#> `1` `2` `3` `4` n
#> <chr> <chr> <chr> <chr> <int>
#> 1 aa bb c dd 2
#> 2 aa bf c dd 2
#> 3 aa bf cc dd 2
#> 4 af bf cc dd 2
#> 5 az bf c dd 2
#> 6 aa bb c df 1
#> 7 aa bb cc dd 1
#> 8 aa bb cc df 1
#> 9 aa bb cd dd 1
#> 10 aa bc c dc 1
#> # ... with 25 more rows
# the data
df<-read.table(text="
A B C D
af;aa;az bf;bb c;cc df;dd
aa;az bf;bc c dc;dd
ah;al;aa bb c;cd dd
af;aa bf cc dd",header=T,stringsAsFactors = F)
Created on 2021-08-11 by the reprex package (v2.0.1)
You can try the base R code below
lst <- apply(df, 1, function(x) sort(unlist(strsplit(x, ";"))))
res <- lapply(
2:4,
function(k) {
setNames(
data.frame(
table(
unlist(
sapply(
lst,
function(v) combn(v, k, paste0, collapse = "--")
)
)
)
), c("association", "count")
)
}
)
and you will see that
> res
[[1]]
association count
1 aa--af 2
2 aa--ah 1
3 aa--al 1
4 aa--az 2
5 aa--bb 2
6 aa--bc 1
7 aa--bf 3
8 aa--c 3
9 aa--cc 2
10 aa--cd 1
11 aa--dc 1
12 aa--dd 4
13 aa--df 1
14 af--az 1
15 af--bb 1
16 af--bf 2
17 af--c 1
18 af--cc 2
19 af--dd 2
20 af--df 1
21 ah--al 1
22 ah--bb 1
23 ah--c 1
24 ah--cd 1
25 ah--dd 1
26 al--bb 1
27 al--c 1
28 al--cd 1
29 al--dd 1
30 az--bb 1
31 az--bc 1
32 az--bf 2
33 az--c 2
34 az--cc 1
35 az--dc 1
36 az--dd 2
37 az--df 1
38 bb--bf 1
39 bb--c 2
40 bb--cc 1
41 bb--cd 1
42 bb--dd 2
43 bb--df 1
44 bc--bf 1
45 bc--c 1
46 bc--dc 1
47 bc--dd 1
48 bf--c 2
49 bf--cc 2
50 bf--dc 1
51 bf--dd 3
52 bf--df 1
53 c--cc 1
54 c--cd 1
55 c--dc 1
56 c--dd 3
57 c--df 1
58 cc--dd 2
59 cc--df 1
60 cd--dd 1
61 dc--dd 1
62 dd--df 1
[[2]]
association count
1 aa--af--az 1
2 aa--af--bb 1
3 aa--af--bf 2
4 aa--af--c 1
5 aa--af--cc 2
6 aa--af--dd 2
7 aa--af--df 1
8 aa--ah--al 1
9 aa--ah--bb 1
10 aa--ah--c 1
11 aa--ah--cd 1
12 aa--ah--dd 1
13 aa--al--bb 1
14 aa--al--c 1
15 aa--al--cd 1
16 aa--al--dd 1
17 aa--az--bb 1
18 aa--az--bc 1
19 aa--az--bf 2
20 aa--az--c 2
21 aa--az--cc 1
22 aa--az--dc 1
23 aa--az--dd 2
24 aa--az--df 1
25 aa--bb--bf 1
26 aa--bb--c 2
27 aa--bb--cc 1
28 aa--bb--cd 1
29 aa--bb--dd 2
30 aa--bb--df 1
31 aa--bc--bf 1
32 aa--bc--c 1
33 aa--bc--dc 1
34 aa--bc--dd 1
35 aa--bf--c 2
36 aa--bf--cc 2
37 aa--bf--dc 1
38 aa--bf--dd 3
39 aa--bf--df 1
40 aa--c--cc 1
41 aa--c--cd 1
42 aa--c--dc 1
43 aa--c--dd 3
44 aa--c--df 1
45 aa--cc--dd 2
46 aa--cc--df 1
47 aa--cd--dd 1
48 aa--dc--dd 1
49 aa--dd--df 1
50 af--az--bb 1
51 af--az--bf 1
52 af--az--c 1
53 af--az--cc 1
54 af--az--dd 1
55 af--az--df 1
56 af--bb--bf 1
57 af--bb--c 1
58 af--bb--cc 1
59 af--bb--dd 1
60 af--bb--df 1
61 af--bf--c 1
62 af--bf--cc 2
63 af--bf--dd 2
64 af--bf--df 1
65 af--c--cc 1
66 af--c--dd 1
67 af--c--df 1
68 af--cc--dd 2
69 af--cc--df 1
70 af--dd--df 1
71 ah--al--bb 1
72 ah--al--c 1
73 ah--al--cd 1
74 ah--al--dd 1
75 ah--bb--c 1
76 ah--bb--cd 1
77 ah--bb--dd 1
78 ah--c--cd 1
79 ah--c--dd 1
80 ah--cd--dd 1
81 al--bb--c 1
82 al--bb--cd 1
83 al--bb--dd 1
84 al--c--cd 1
85 al--c--dd 1
86 al--cd--dd 1
87 az--bb--bf 1
88 az--bb--c 1
89 az--bb--cc 1
90 az--bb--dd 1
91 az--bb--df 1
92 az--bc--bf 1
93 az--bc--c 1
94 az--bc--dc 1
95 az--bc--dd 1
96 az--bf--c 2
97 az--bf--cc 1
98 az--bf--dc 1
99 az--bf--dd 2
100 az--bf--df 1
101 az--c--cc 1
102 az--c--dc 1
103 az--c--dd 2
104 az--c--df 1
105 az--cc--dd 1
106 az--cc--df 1
107 az--dc--dd 1
108 az--dd--df 1
109 bb--bf--c 1
110 bb--bf--cc 1
111 bb--bf--dd 1
112 bb--bf--df 1
113 bb--c--cc 1
114 bb--c--cd 1
115 bb--c--dd 2
116 bb--c--df 1
117 bb--cc--dd 1
118 bb--cc--df 1
119 bb--cd--dd 1
120 bb--dd--df 1
121 bc--bf--c 1
122 bc--bf--dc 1
123 bc--bf--dd 1
124 bc--c--dc 1
125 bc--c--dd 1
126 bc--dc--dd 1
127 bf--c--cc 1
128 bf--c--dc 1
129 bf--c--dd 2
130 bf--c--df 1
131 bf--cc--dd 2
132 bf--cc--df 1
133 bf--dc--dd 1
134 bf--dd--df 1
135 c--cc--dd 1
136 c--cc--df 1
137 c--cd--dd 1
138 c--dc--dd 1
139 c--dd--df 1
140 cc--dd--df 1
[[3]]
association count
1 aa--af--az--bb 1
2 aa--af--az--bf 1
3 aa--af--az--c 1
4 aa--af--az--cc 1
5 aa--af--az--dd 1
6 aa--af--az--df 1
7 aa--af--bb--bf 1
8 aa--af--bb--c 1
9 aa--af--bb--cc 1
10 aa--af--bb--dd 1
11 aa--af--bb--df 1
12 aa--af--bf--c 1
13 aa--af--bf--cc 2
14 aa--af--bf--dd 2
15 aa--af--bf--df 1
16 aa--af--c--cc 1
17 aa--af--c--dd 1
18 aa--af--c--df 1
19 aa--af--cc--dd 2
20 aa--af--cc--df 1
21 aa--af--dd--df 1
22 aa--ah--al--bb 1
23 aa--ah--al--c 1
24 aa--ah--al--cd 1
25 aa--ah--al--dd 1
26 aa--ah--bb--c 1
27 aa--ah--bb--cd 1
28 aa--ah--bb--dd 1
29 aa--ah--c--cd 1
30 aa--ah--c--dd 1
31 aa--ah--cd--dd 1
32 aa--al--bb--c 1
33 aa--al--bb--cd 1
34 aa--al--bb--dd 1
35 aa--al--c--cd 1
36 aa--al--c--dd 1
37 aa--al--cd--dd 1
38 aa--az--bb--bf 1
39 aa--az--bb--c 1
40 aa--az--bb--cc 1
41 aa--az--bb--dd 1
42 aa--az--bb--df 1
43 aa--az--bc--bf 1
44 aa--az--bc--c 1
45 aa--az--bc--dc 1
46 aa--az--bc--dd 1
47 aa--az--bf--c 2
48 aa--az--bf--cc 1
49 aa--az--bf--dc 1
50 aa--az--bf--dd 2
51 aa--az--bf--df 1
52 aa--az--c--cc 1
53 aa--az--c--dc 1
54 aa--az--c--dd 2
55 aa--az--c--df 1
56 aa--az--cc--dd 1
57 aa--az--cc--df 1
58 aa--az--dc--dd 1
59 aa--az--dd--df 1
60 aa--bb--bf--c 1
61 aa--bb--bf--cc 1
62 aa--bb--bf--dd 1
63 aa--bb--bf--df 1
64 aa--bb--c--cc 1
65 aa--bb--c--cd 1
66 aa--bb--c--dd 2
67 aa--bb--c--df 1
68 aa--bb--cc--dd 1
69 aa--bb--cc--df 1
70 aa--bb--cd--dd 1
71 aa--bb--dd--df 1
72 aa--bc--bf--c 1
73 aa--bc--bf--dc 1
74 aa--bc--bf--dd 1
75 aa--bc--c--dc 1
76 aa--bc--c--dd 1
77 aa--bc--dc--dd 1
78 aa--bf--c--cc 1
79 aa--bf--c--dc 1
80 aa--bf--c--dd 2
81 aa--bf--c--df 1
82 aa--bf--cc--dd 2
83 aa--bf--cc--df 1
84 aa--bf--dc--dd 1
85 aa--bf--dd--df 1
86 aa--c--cc--dd 1
87 aa--c--cc--df 1
88 aa--c--cd--dd 1
89 aa--c--dc--dd 1
90 aa--c--dd--df 1
91 aa--cc--dd--df 1
92 af--az--bb--bf 1
93 af--az--bb--c 1
94 af--az--bb--cc 1
95 af--az--bb--dd 1
96 af--az--bb--df 1
97 af--az--bf--c 1
98 af--az--bf--cc 1
99 af--az--bf--dd 1
100 af--az--bf--df 1
101 af--az--c--cc 1
102 af--az--c--dd 1
103 af--az--c--df 1
104 af--az--cc--dd 1
105 af--az--cc--df 1
106 af--az--dd--df 1
107 af--bb--bf--c 1
108 af--bb--bf--cc 1
109 af--bb--bf--dd 1
110 af--bb--bf--df 1
111 af--bb--c--cc 1
112 af--bb--c--dd 1
113 af--bb--c--df 1
114 af--bb--cc--dd 1
115 af--bb--cc--df 1
116 af--bb--dd--df 1
117 af--bf--c--cc 1
118 af--bf--c--dd 1
119 af--bf--c--df 1
120 af--bf--cc--dd 2
121 af--bf--cc--df 1
122 af--bf--dd--df 1
123 af--c--cc--dd 1
124 af--c--cc--df 1
125 af--c--dd--df 1
126 af--cc--dd--df 1
127 ah--al--bb--c 1
128 ah--al--bb--cd 1
129 ah--al--bb--dd 1
130 ah--al--c--cd 1
131 ah--al--c--dd 1
132 ah--al--cd--dd 1
133 ah--bb--c--cd 1
134 ah--bb--c--dd 1
135 ah--bb--cd--dd 1
136 ah--c--cd--dd 1
137 al--bb--c--cd 1
138 al--bb--c--dd 1
139 al--bb--cd--dd 1
140 al--c--cd--dd 1
141 az--bb--bf--c 1
142 az--bb--bf--cc 1
143 az--bb--bf--dd 1
144 az--bb--bf--df 1
145 az--bb--c--cc 1
146 az--bb--c--dd 1
147 az--bb--c--df 1
148 az--bb--cc--dd 1
149 az--bb--cc--df 1
150 az--bb--dd--df 1
151 az--bc--bf--c 1
152 az--bc--bf--dc 1
153 az--bc--bf--dd 1
154 az--bc--c--dc 1
155 az--bc--c--dd 1
156 az--bc--dc--dd 1
157 az--bf--c--cc 1
158 az--bf--c--dc 1
159 az--bf--c--dd 2
160 az--bf--c--df 1
161 az--bf--cc--dd 1
162 az--bf--cc--df 1
163 az--bf--dc--dd 1
164 az--bf--dd--df 1
165 az--c--cc--dd 1
166 az--c--cc--df 1
167 az--c--dc--dd 1
168 az--c--dd--df 1
169 az--cc--dd--df 1
170 bb--bf--c--cc 1
171 bb--bf--c--dd 1
172 bb--bf--c--df 1
173 bb--bf--cc--dd 1
174 bb--bf--cc--df 1
175 bb--bf--dd--df 1
176 bb--c--cc--dd 1
177 bb--c--cc--df 1
178 bb--c--cd--dd 1
179 bb--c--dd--df 1
180 bb--cc--dd--df 1
181 bc--bf--c--dc 1
182 bc--bf--c--dd 1
183 bc--bf--dc--dd 1
184 bc--c--dc--dd 1
185 bf--c--cc--dd 1
186 bf--c--cc--df 1
187 bf--c--dc--dd 1
188 bf--c--dd--df 1
189 bf--cc--dd--df 1
190 c--cc--dd--df 1
df<-read.table(text="
A B C D
af;aa;az bf;bb c;cc df;dd
aa;az bf;bc c dc;dd
ah;al;aa bb c;cd dd
af;aa bf cc dd",header=T,stringsAsFactors = F)
df<-apply(df, 2, function(x) gsub(";", " ", x))
df2<-as.data.frame(apply(df, 2,function(x) paste(x[1:4],sep = "," )))
df3<-splitstackshape::cSplit(df2, c("A", "B", "C","D"), " ")
df4<-as.data.frame(t(df3))
A<-expand.grid(df4[,1],df4[,1])
B<-expand.grid(df4[,2],df4[,2])
C<-expand.grid(df4[,3],df4[,3])
D<-expand.grid(df4[,4],df4[,4])
DF_list<-bind_rows(A,B,C,D)
DF_list$Var1 == DF_list$Var2 -> DF_list$filter
DF_list[DF_list$filter == "FALSE", ]-> DF_list
DF_list$Var1<-as.character(DF_list$Var1)
DF_list$Var2<-as.character(DF_list$Var2)
DF_list[nchar(DF_list$Var1) > 0, ]-> DF_list
DF_list[nchar(DF_list$Var2) > 0, ]-> DF_list
pair_df<-data.frame("pairs" = paste(DF_list[["Var1"]], "--", DF_list[["Var2"]], sep = ""))
pairs<-data.frame("combos"= table(pair_df$pairs))
####output###
combos.Var1 combos.Freq
1 aa--af 2
2 aa--ah 1
3 aa--al 1
4 aa--az 2
5 aa--bb 2
6 aa--bc 1
7 aa--bf 3
8 aa--c 3
9 aa--cc 2
10 aa--cd 1
11 aa--dc 1
12 aa--dd 4
13 aa--df 1
14 af--aa 2
15 af--az 1
16 af--bb 1
17 af--bf 2
18 af--c 1
19 af--cc 2
20 af--dd 2
21 af--df 1
22 ah--aa 1
23 ah--al 1
24 ah--bb 1
25 ah--c 1
26 ah--cd 1
27 ah--dd 1
28 al--aa 1
29 al--ah 1
30 al--bb 1
31 al--c 1
32 al--cd 1
33 al--dd 1
34 az--aa 2
35 az--af 1
36 az--bb 1
37 az--bc 1
38 az--bf 2
39 az--c 2
40 az--cc 1
41 az--dc 1
42 az--dd 2
43 az--df 1
44 bb--aa 2
45 bb--af 1
46 bb--ah 1
47 bb--al 1
48 bb--az 1
49 bb--bf 1
50 bb--c 2
51 bb--cc 1
52 bb--cd 1
53 bb--dd 2
54 bb--df 1
55 bc--aa 1
56 bc--az 1
57 bc--bf 1
58 bc--c 1
59 bc--dc 1
60 bc--dd 1
61 bf--aa 3
62 bf--af 2
63 bf--az 2
64 bf--bb 1
65 bf--bc 1
66 bf--c 2
67 bf--cc 2
68 bf--dc 1
69 bf--dd 3
70 bf--df 1
71 c--aa 3
72 c--af 1
73 c--ah 1
74 c--al 1
75 c--az 2
76 c--bb 2
77 c--bc 1
78 c--bf 2
79 c--cc 1
80 c--cd 1
81 c--dc 1
82 c--dd 3
83 c--df 1
84 cc--aa 2
85 cc--af 2
86 cc--az 1
87 cc--bb 1
88 cc--bf 2
89 cc--c 1
90 cc--dd 2
91 cc--df 1
92 cd--aa 1
93 cd--ah 1
94 cd--al 1
95 cd--bb 1
96 cd--c 1
97 cd--dd 1
98 dc--aa 1
99 dc--az 1
100 dc--bc 1
101 dc--bf 1
102 dc--c 1
103 dc--dd 1
104 dd--aa 4
105 dd--af 2
106 dd--ah 1
107 dd--al 1
108 dd--az 2
109 dd--bb 2
110 dd--bc 1
111 dd--bf 3
112 dd--c 3
113 dd--cc 2
114 dd--cd 1
115 dd--dc 1
116 dd--df 1
117 df--aa 1
118 df--af 1
119 df--az 1
120 df--bb 1
121 df--bf 1
122 df--c 1
123 df--cc 1
124 df--dd 1

Finding the k-largest clusters in dbscan result

I have a dataframe df, consists of 2 columns: x and y coordinates.
Each row refers to a point.
I feed it into dbscan function to obtain the clusters of the points in df.
library("fpc")
db = fpc::dbscan(df, eps = 0.08, MinPts = 4)
plot(db, df, main = "DBSCAN", frame = FALSE)
By using print(db), I can see the result returned by dbscan.
> print(db)
dbscan Pts=13131 MinPts=4 eps=0.08
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
border 401 38 55 5 2 3 0 0 0 8 0 6 1 3 1 3 3 2 1 2 4 3
seed 0 2634 8186 35 24 561 99 7 22 26 5 75 17 9 9 54 1 2 74 21 3 15
total 401 2672 8241 40 26 564 99 7 22 34 5 81 18 12 10 57 4 4 75 23 7 18
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
border 4 1 2 6 2 1 3 7 2 1 2 3 11 1 3 1 3 2 5 5 1 4 3
seed 14 9 4 48 2 4 38 111 5 11 5 14 111 6 1 5 1 8 3 15 10 15 6
total 18 10 6 54 4 5 41 118 7 12 7 17 122 7 4 6 4 10 8 20 11 19 9
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
border 2 4 2 1 3 2 1 1 3 1 0 2 2 3 0 3 3 3 3 0 0 2 3 1
seed 15 2 9 11 4 8 12 4 6 8 7 7 3 3 4 3 3 4 2 9 4 2 1 4
total 17 6 11 12 7 10 13 5 9 9 7 9 5 6 4 6 6 7 5 9 4 4 4 5
69 70 71
border 3 3 3
seed 1 1 1
total 4 4 4
From the above summary, I can see cluster 2 consists of 8186 seed points (core points), cluster 1 consists of 2634 seed points and cluster 5 consists of 561 points.
I define the largest cluster as the one contains the largest amount of seed points. So, in this case, the largest cluster is cluster 2. And the 1st, 2nd, 3th largest clusters are 2, 1 and 5.
Are they any direct way to return the rows (points) in the largest cluster or the k-largest cluster in general?
I can do it in an indirect way.
I can obtain the assigned cluster number of each point by
db$cluster.
Hence, I can create a new dataframe df2 with db$cluster as the
new additional column besides the original x column and y
column.
Then, I can aggregate the df2 according to the cluster numbers in
the third column and find the number of points in each cluster.
After that, I can find the k-largest groups, which are 2, 1 and 5
again.
Finally, I can select the rows in df2 with third column value equals to 2 to return the points in the largest cluster.
But the above approach re-computes many known results as stated in the summary of print(db).
The dbscan function doesn't appear to retain the data.
library(fpc)
set.seed(665544)
n <- 600
df <- data.frame(x=runif(10, 0, 10)+rnorm(n, sd=0.2), y=runif(10, 0, 10)+rnorm(n,sd=0.2))
(dbs <- dbscan(df, 0.2))
#dbscan Pts=600 MinPts=5 eps=0.2
# 0 1 2 3 4 5 6 7 8 9 10 11
#border 28 4 4 8 5 3 3 4 3 4 6 4
#seed 0 50 53 51 52 51 54 54 54 53 51 1
#total 28 54 57 59 57 54 57 58 57 57 57 5
attributes(dbs)
#$names
#[1] "cluster" "eps" "MinPts" "isseed"
#$class
#[1] "dbscan"
Your indirect steps are not that indirect (only two lines needed), and these commands won't recalculate the clusters. So just run those commands, or put them in a function and then call the function in one command.
cluster_k <- function(dbs, data, k){
kth <- names(rev(sort(table(dbs$cluster)))[k])
data[dbs$cluster == kth,]
}
cluster_k(dbs=dbs, data=df, k=1)
## x y
## 3 6.580695 8.715245
## 13 6.704379 8.528486
## 23 6.809558 8.160721
## 33 6.375842 8.756433
## 43 6.603195 8.640206
## 53 6.728533 8.425067
## a data frame with 59 rows

'x' must be numeric error when trying to plot a histogram [duplicate]

This question already has answers here:
R cannot use hist() because "content not numeric" due to negative decimal numbers?
(2 answers)
Closed 3 years ago.
I am trying to plot a histogram. However, even though all the values appear to be numeric or NA, when I try to run hist() it still returns an error. Any help would be appreciated.
corruption <- read.csv("Corruption.csv")
corruption[ corruption == "-" ] <- NA
hist(corruption$X2015)
I suspect it has something to do with the presence of the '-' character. When I use table(corruption$X2015), this is the output:
- 11 12 15 16 17 18 19 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 44 45
0 1 1 2 2 3 4 1 3 3 1 1 6 3 6 7 4 2 5 5 4 4 4 7 5 7 4 1 2 3 5 1
46 47 49 50 51 52 53 54 55 56 58 60 61 62 63 65 70 71 74 75 76 77 79 8 81 83 85 86 87 88 89 90
2 2 1 1 4 2 3 1 4 3 1 1 3 2 2 1 4 1 1 3 2 1 2 2 3 1 1 1 2 1 1 1
91
1
Convert X2015 to numeric which will automatically change non-numerics to NA.
corruption$X2015 <- as.numeric(as.character(corruption$X2015))
You can then use hist
hist(corruption$X2015)

How can I avoid right-truncated subjects being dropped?

I'm doing a survival analysis about the time some individual components remain in the source code of a software project, but some of these components are being dropped by the survfit function.
This is what I'm doing:
library(survival)
data <- read.table(text = "component_id weeks removed
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 2 0
9 2 0
10 2 0
11 2 0
12 2 1
13 2 1
14 2 0
15 2 0
16 2 0
17 2 0
18 2 0
19 2 0
20 2 1
21 2 1
22 2 0
23 2 0
24 3 1
25 3 1
26 3 1
27 3 1
28 7 1
29 7 1
30 14 1
31 14 1
32 14 1
33 14 1
34 14 1
35 14 1
36 14 1
37 14 1
38 14 1
39 14 1
40 14 1
41 14 1
42 14 1
43 14 1
44 14 1
45 14 1
46 14 1
47 14 1
48 40 1
49 40 1
50 40 1
51 40 1
52 48 1
53 48 1
54 48 1
55 48 1
56 48 1
57 48 1
58 48 1
59 48 1
60 56 1
61 56 1
62 56 1
63 56 1
64 56 1
65 56 1
66 56 1
67 56 1
68 56 1
69 56 1", header = TRUE)
fit <- survfit(Surv(data$weeks, data$removed) ~ 1)
summary(fit, censored=TRUE)
And this is the output
Call: survfit(formula = Surv(data$weeks, data$removed) ~ 1)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 69 7 0.899 0.0363 0.830 0.973
2 62 4 0.841 0.0441 0.758 0.932
3 46 4 0.767 0.0533 0.670 0.879
7 42 2 0.731 0.0567 0.628 0.851
14 40 18 0.402 0.0654 0.292 0.553
40 22 4 0.329 0.0629 0.226 0.478
48 18 8 0.183 0.0520 0.105 0.319
56 10 10 0.000 NaN NA NA
I was expecting the number of events to be 69 but I get 12 subjects dropped.
I initially thought I was misusing the package functions, and carried a type="interval2" approach, following a similar situation, but the drops keep happening with now a weird continuous number of subjects and events counts:
as.t2 <- function(i, data) if (data$removed[i] == 1) data$weeks[i] else NA
size <- length(data$weeks)
t1 <- data$weeks
t2 <- sapply(1:size, as.t2, data = data)
interval_fit <- survfit(Surv(t1, t2, type="interval2") ~ 1)
summary(interval_fit, censored=TRUE)
Next, I found what I call a mid-air explanation, clarifying a bit further the situation. I understand this is caused by non-censored subjects appearing after a "constant censoring time", but again, why?
That led me somehow to dig deeper and read about right-truncation and realized that type of studies mapped very closely to the drops I'm experiencing. Here's Klein & Moeschberger:
Truncation of survival data occurs when only those individuals whose event time lies within a certain observational window $(Y_L,Y_R)$ are observed. An individual whose event time is not in this interval is not observed and no information on this subject is available to the investigator.
Right truncation occurs when $Y_L$ is equal to zero. That is, we observe the survival time $X$ only when $X \leq Y_R$.
From my perspective, these drops carry important information for my study regardless of their time of entry.
How can I stop the drops?

Finding Cumulative Median In R Using Conditions

In my dataframe, how would I create a new variable with the median of Adv. (Advertising) amounts for each SIC group?
As an example:
SIC Adv.
1 65
1 96
1 NA
1 23
2 45
2 23
2 12
3 45
3 NA
3 35
3 6
3 888
4 23
5 656
5 547
6 12
6 32
6 1
Should become:
SIC Adv. SIC.Adv.Median
1 65 65
1 96 65
1 NA 65
1 23 65
2 45 23
2 23 23
2 12 23
3 45 40
3 NA 40
3 35 40
3 6 40
3 888 40
4 23 23
5 656 601.5
5 547 601.5
6 12 12
6 32 12
6 1 12
Any help would be greatly appreciated.
Thank you!

Resources