BingoGame:i want to check the validity of a set of bingo cards with the following parameters - multidimensional-array

i have a problem figuring out how to check if a random bingo card is valid or not.I have a range of numbers of 1-90 and i want to create 6 cards of 15 numbers.I want to check if a card has 15 numbers with no duplicates,the numbers in the rows are in ascending order,and the column has to have 2 or 3 numbers.Meanwhile the columns of all the cards have to be in ascending order too but each column must be in a number range from 1-9,10-19...80-90.Here is an example of a valid bingo set of cards:
0 11 20 0 41 54 0 72 0
7 0 23 30 0 56 0 77 0
8 15 0 0 42 0 67 0 86
............................................
2 17 0 0 46 0 65 0 85
4 0 0 36 47 0 66 74 0
0 19 22 0 0 57 0 75 90
............................................
0 0 25 33 0 51 0 76 82
0 12 0 35 0 59 63 0 88
1 0 29 0 40 0 69 78 0
............................................
3 0 24 34 0 0 0 71 84
6 0 0 37 45 0 61 73 0
0 13 26 0 49 55 0 0 89
............................................
0 14 0 32 0 52 60 0 81
5 0 27 39 0 0 62 0 87
0 18 28 0 43 53 0 79 0
............................................
0 10 0 31 0 50 64 0 80
0 16 21 0 44 58 0 70 0
9 0 0 38 48 0 68 0 83
............................................
the zeroes represent the blank spaces.I dont know where to start since im a begginer.If you have any idea or tips that will be helpful please share it with me and ill be grateful to you.

Related

How to make a case crossover data in r

Creat a reproducible data
set.seed(20220105)
begin = as.Date('1994-01-01')
end = as.Date('1994-12-31')
date_seq = seq(from = begin, to = end, by = '1 day')
length = length(date_seq)
death = sample(x = 1:100, size = length, replace = T)
temperature = sample(x = -25:25, size = length, replace = T)
df = data.frame(date = date_seq, death = death, temperature = temperature)
> head(df)
date death temperature
1 1994-01-01 66 20
2 1994-01-02 56 7
3 1994-01-03 33 -9
4 1994-01-04 29 -17
5 1994-01-05 6 0
6 1994-01-06 33 -15
Variable definition
Each day in df can be a case day and also can be a control day.
The case day and control days are matched by day of the week in the same month and in the same year.
Thus, each case has 3 or 4 control days (before and/or after the case day in the same month).
For example, when the case day is 1994-01-01, control days are 1994-01-08, 1994-01-15, 1994-01-22 and 1994-01-29.
When the case day is 1994-01-08, control days are 1994-01-01, 1994-01-15, 1994-01-22 and 1994-01-29.
What I needed
I want to creat a new df_wanted that based on the original df.
df_wanted should contain 5 varibales which 3 variables from the original df including date, death and temperature.
The death and temperature should be the case day's death and temperature.
The other 2 varibales are new variables.
One is status which indicates a day is case day or control day.
One is stratum. This variable like a group. Each group has one case day and three or four control day.
Some of the data in df_wanted should like this:
df_wanted = data.frame(
date = c('1994-01-01', '1994-01-08', '1994-01-15', '1994-01-22', '1994-01-29',
'1994-01-08', '1994-01-01', '1994-01-15', '1994-01-22', '1994-01-29'),
status = c(1,0,0,0,0, 1,0,0,0,0),
stratum = c(1,1,1,1,1, 2,2,2,2,2),
death = c(66, 66, 66, 66, 66, 1,1,1,1,1),
temperature = c(20,20,20,20,20, 13,13,13,13,13)
)
> df_wanted
date status stratum death temperature
1 1994-01-01 1 1 66 20
2 1994-01-08 0 1 66 20
3 1994-01-15 0 1 66 20
4 1994-01-22 0 1 66 20
5 1994-01-29 0 1 66 20
6 1994-01-08 1 2 1 13
7 1994-01-01 0 2 1 13
8 1994-01-15 0 2 1 13
9 1994-01-22 0 2 1 13
10 1994-01-29 0 2 1 13
What I have tried
I searched some answers, such as Create control dates in bilateral case crossover design and Create a case control pair for time stratified case-crossover design, but all the answers do not meet my need.
Any help will be highly appreciated!
We can full_join the dataset with itself by the same day of week, month and year. Here is a dplyr approach.
library(dplyr)
df %>%
mutate(
id = seq_len(n()),
ymw = format(date, "%Y-%m-%w")
) %>%
full_join(., ., by = "ymw") %>%
transmute(
stratum = id.x,
date = date.y,
status = +(id.x == id.y),
death = death.x,
temperature = temperature.x
)
Output
stratum date status death temperature
1 1 1994-01-01 1 66 20
2 1 1994-01-08 0 66 20
3 1 1994-01-15 0 66 20
4 1 1994-01-22 0 66 20
5 1 1994-01-29 0 66 20
6 2 1994-01-02 1 56 7
7 2 1994-01-09 0 56 7
8 2 1994-01-16 0 56 7
9 2 1994-01-23 0 56 7
10 2 1994-01-30 0 56 7
11 3 1994-01-03 1 33 -9
12 3 1994-01-10 0 33 -9
13 3 1994-01-17 0 33 -9
14 3 1994-01-24 0 33 -9
15 3 1994-01-31 0 33 -9
16 4 1994-01-04 1 29 -17
17 4 1994-01-11 0 29 -17
18 4 1994-01-18 0 29 -17
19 4 1994-01-25 0 29 -17
20 5 1994-01-05 1 6 0
21 5 1994-01-12 0 6 0
22 5 1994-01-19 0 6 0
23 5 1994-01-26 0 6 0
24 6 1994-01-06 1 33 -15
25 6 1994-01-13 0 33 -15
26 6 1994-01-20 0 33 -15
27 6 1994-01-27 0 33 -15
28 7 1994-01-07 1 31 21
29 7 1994-01-14 0 31 21
30 7 1994-01-21 0 31 21
31 7 1994-01-28 0 31 21
32 8 1994-01-01 0 1 13
33 8 1994-01-08 1 1 13
34 8 1994-01-15 0 1 13
35 8 1994-01-22 0 1 13
36 8 1994-01-29 0 1 13
37 9 1994-01-02 0 83 4
38 9 1994-01-09 1 83 4
39 9 1994-01-16 0 83 4
40 9 1994-01-23 0 83 4
41 9 1994-01-30 0 83 4
42 10 1994-01-03 0 37 7
43 10 1994-01-10 1 37 7
44 10 1994-01-17 0 37 7
45 10 1994-01-24 0 37 7
46 10 1994-01-31 0 37 7
47 11 1994-01-04 0 94 -18
48 11 1994-01-11 1 94 -18
49 11 1994-01-18 0 94 -18
50 11 1994-01-25 0 94 -18
51 12 1994-01-05 0 46 3
52 12 1994-01-12 1 46 3
53 12 1994-01-19 0 46 3
54 12 1994-01-26 0 46 3
55 13 1994-01-06 0 45 -13
56 13 1994-01-13 1 45 -13
57 13 1994-01-20 0 45 -13
58 13 1994-01-27 0 45 -13
59 14 1994-01-07 0 47 -21
60 14 1994-01-14 1 47 -21
61 14 1994-01-21 0 47 -21
62 14 1994-01-28 0 47 -21
63 15 1994-01-01 0 38 3
64 15 1994-01-08 0 38 3
65 15 1994-01-15 1 38 3
66 15 1994-01-22 0 38 3
67 15 1994-01-29 0 38 3
68 16 1994-01-02 0 96 -25
69 16 1994-01-09 0 96 -25
70 16 1994-01-16 1 96 -25
71 16 1994-01-23 0 96 -25
72 16 1994-01-30 0 96 -25
73 17 1994-01-03 0 99 20
74 17 1994-01-10 0 99 20
75 17 1994-01-17 1 99 20
76 17 1994-01-24 0 99 20
77 17 1994-01-31 0 99 20
78 18 1994-01-04 0 33 -22
79 18 1994-01-11 0 33 -22
80 18 1994-01-18 1 33 -22
81 18 1994-01-25 0 33 -22
82 19 1994-01-05 0 46 10
83 19 1994-01-12 0 46 10
84 19 1994-01-19 1 46 10
85 19 1994-01-26 0 46 10
86 20 1994-01-06 0 60 -2
87 20 1994-01-13 0 60 -2
88 20 1994-01-20 1 60 -2
89 20 1994-01-27 0 60 -2
90 21 1994-01-07 0 43 16
91 21 1994-01-14 0 43 16
92 21 1994-01-21 1 43 16
93 21 1994-01-28 0 43 16
94 22 1994-01-01 0 81 -14
95 22 1994-01-08 0 81 -14
96 22 1994-01-15 0 81 -14
97 22 1994-01-22 1 81 -14
98 22 1994-01-29 0 81 -14
99 23 1994-01-02 0 67 25
100 23 1994-01-09 0 67 25
101 23 1994-01-16 0 67 25
102 23 1994-01-23 1 67 25
103 23 1994-01-30 0 67 25
104 24 1994-01-03 0 31 23
105 24 1994-01-10 0 31 23
106 24 1994-01-17 0 31 23
107 24 1994-01-24 1 31 23
108 24 1994-01-31 0 31 23
109 25 1994-01-04 0 25 0
110 25 1994-01-11 0 25 0
111 25 1994-01-18 0 25 0
112 25 1994-01-25 1 25 0
113 26 1994-01-05 0 51 -21
114 26 1994-01-12 0 51 -21
115 26 1994-01-19 0 51 -21
116 26 1994-01-26 1 51 -21
117 27 1994-01-06 0 37 5
118 27 1994-01-13 0 37 5
119 27 1994-01-20 0 37 5
120 27 1994-01-27 1 37 5
121 28 1994-01-07 0 3 13
122 28 1994-01-14 0 3 13
123 28 1994-01-21 0 3 13
124 28 1994-01-28 1 3 13
125 29 1994-01-01 0 69 -22
126 29 1994-01-08 0 69 -22
127 29 1994-01-15 0 69 -22
128 29 1994-01-22 0 69 -22
129 29 1994-01-29 1 69 -22
130 30 1994-01-02 0 51 12
131 30 1994-01-09 0 51 12
132 30 1994-01-16 0 51 12
133 30 1994-01-23 0 51 12
134 30 1994-01-30 1 51 12
135 31 1994-01-03 0 84 17
136 31 1994-01-10 0 84 17
137 31 1994-01-17 0 84 17
138 31 1994-01-24 0 84 17
139 31 1994-01-31 1 84 17
140 32 1994-02-01 1 10 4
141 32 1994-02-08 0 10 4
142 32 1994-02-15 0 10 4
143 32 1994-02-22 0 10 4
144 33 1994-02-02 1 67 10
145 33 1994-02-09 0 67 10
146 33 1994-02-16 0 67 10
147 33 1994-02-23 0 67 10
148 34 1994-02-03 1 61 -21
149 34 1994-02-10 0 61 -21
150 34 1994-02-17 0 61 -21
151 34 1994-02-24 0 61 -21
152 35 1994-02-04 1 11 7
153 35 1994-02-11 0 11 7
154 35 1994-02-18 0 11 7
155 35 1994-02-25 0 11 7
156 36 1994-02-05 1 15 -21
157 36 1994-02-12 0 15 -21
158 36 1994-02-19 0 15 -21
159 36 1994-02-26 0 15 -21
160 37 1994-02-06 1 78 21
161 37 1994-02-13 0 78 21
162 37 1994-02-20 0 78 21
163 37 1994-02-27 0 78 21
164 38 1994-02-07 1 67 11
165 38 1994-02-14 0 67 11
166 38 1994-02-21 0 67 11
167 38 1994-02-28 0 67 11
168 39 1994-02-01 0 89 -10
169 39 1994-02-08 1 89 -10
170 39 1994-02-15 0 89 -10
171 39 1994-02-22 0 89 -10
172 40 1994-02-02 0 70 11
173 40 1994-02-09 1 70 11
174 40 1994-02-16 0 70 11
175 40 1994-02-23 0 70 11
176 41 1994-02-03 0 95 25
177 41 1994-02-10 1 95 25
178 41 1994-02-17 0 95 25
179 41 1994-02-24 0 95 25
180 42 1994-02-04 0 75 22
181 42 1994-02-11 1 75 22
182 42 1994-02-18 0 75 22
183 42 1994-02-25 0 75 22
184 43 1994-02-05 0 99 -20
185 43 1994-02-12 1 99 -20
186 43 1994-02-19 0 99 -20
187 43 1994-02-26 0 99 -20
188 44 1994-02-06 0 99 7
189 44 1994-02-13 1 99 7
190 44 1994-02-20 0 99 7
191 44 1994-02-27 0 99 7
192 45 1994-02-07 0 62 -2
193 45 1994-02-14 1 62 -2
194 45 1994-02-21 0 62 -2
195 45 1994-02-28 0 62 -2
196 46 1994-02-01 0 50 -9
197 46 1994-02-08 0 50 -9
198 46 1994-02-15 1 50 -9
199 46 1994-02-22 0 50 -9
200 47 1994-02-02 0 99 -13
[ reached 'max' / getOption("max.print") -- omitted 1405 rows ]

Error in y - predmat : non-numeric argument to binary operator

Trying to use R cv.glmnet() for cross validation on loans data.
I have a data set on loan data (Kaggle) and have already split into train, test.
Separated the y response from the predictive variables in select(1) and select(-1).
Created matrix so as to avoid the "Error in storage.mode(y) <- "double" : 'list' object cannot be coerced to type 'double' " problem earlier.
Now seeking to run cv.glmnet() for cross validation, but this error stops me now.
"Error in y - predmat : non-numeric argument to binary operator"
Error in non-numeric argument, yet all my data is numeric, save for one factor for response y.
As a side question, what is the predmat in "y - predmat" refer to?
x_vars <- as.matrix(data.sample.train.split %>% select(-1))
y_resp <- as.matrix(data.sample.train.split %>% select(1))
cv_output <- cv.glmnet(x_vars, y_resp, type.measure = "deviance", nfolds = 5)
cv_output <- cv.glmnet(x_vars, y_resp,
type.measure = "deviance",
lambda = NULL,
nfolds = 5)
I am also considering to try this function:
ddd.lasso <- cv.glmnet(x_vars, y_resp, alpha = 1, family = "binomial")
ddd.model <- glmnet(x_vars, y_resp, alpha = 1, family = "binomial", lambda = ddd.lasso$lambda.min)
Data sample is as follows, just some of the columns:
c("loan_amnt", "funded_amnt",
"funded_amnt_inv", "grade", "emp_length", "annual_inc", "dti",
"mths_since_last_delinq", "mths_since_last_record", "open_acc",
"pub_rec", "revol_bal", "revol_util", "total_acc", "out_prncp",
"out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp",
"total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee",
"last_pymnt_amnt", "collections_12_mths_ex_med", "acc_now_delinq"
)))
loan_amnt funded_amnt funded_amnt_inv grade emp_length annual_inc dti
3 10000 10000 10000.000 60 10 49200.00 20.00
10 10000 10000 10000.000 60 4 42000.00 18.60
14 20250 20250 19142.161 60 3 43370.00 26.53
17 15000 15000 15000.000 80 2 92000.00 29.44
18 4000 4000 4000.000 80 10 106000.00 5.63
31 4400 4400 4400.000 40 10 55000.00 20.01
35 10000 10000 10000.000 100 10 60000.00 12.74
37 25600 25600 25350.000 80 9 110000.00 15.71
41 10000 10000 10000.000 80 1 39000.00 18.58
64 9200 9200 9200.000 80 2 60000.00 19.96
72 7000 7000 7000.000 80 4 39120.00 21.01
74 3500 3500 3500.000 100 10 83000.00 2.31
77 9500 9500 9500.000 100 7 50000.00 8.18
89 10000 10000 10000.000 100 1 43000.00 25.26
98 7000 7000 7000.000 80 1 30000.00 15.80
112 21600 21600 20498.266 20 8 60000.00 16.74
117 7200 7200 7200.000 80 5 48000.00 17.43
118 12000 12000 11975.000 60 1 57000.00 10.86
125 10000 10000 10000.000 100 5 70000.00 16.78
126 8000 8000 8000.000 60 3 28000.00 12.60
128 6000 6000 6000.000 60 10 94800.00 24.53
138 35000 35000 35000.000 80 2 168000.00 3.17
144 14000 14000 14000.000 100 10 66000.00 11.15
149 3000 3000 3000.000 60 5 71000.00 21.84
152 12000 12000 11975.000 80 2 60000.00 15.50
153 6000 6000 6000.000 100 3 34000.00 14.51
155 7000 7000 7000.000 80 7 82000.00 12.00
166 24250 18100 18075.000 -1 7 120000.00 12.96
170 2500 2500 2500.000 80 7 29000.00 18.70
172 4225 4225 4225.000 80 5 55200.00 17.61
180 6000 6000 6000.000 60 5 50000.00 15.58
192 5000 5000 5000.000 80 5 38004.00 23.78
193 8000 8000 8000.000 80 3 31000.00 16.22
199 12000 12000 12000.000 80 4 40000.00 22.20
203 3200 3200 3200.000 80 9 61200.00 2.16
209 5000 5000 5000.000 80 2 70000.00 20.06
220 13250 13250 13250.000 40 10 52000.00 23.70
224 12000 12000 12000.000 100 10 68000.00 7.08
mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util
3 35 59 10 0 5598 21.0
10 61 114 14 0 24043 70.2
14 18 107 8 0 17813 85.6
17 54 79 8 0 13707 93.9
18 18 97 12 0 6110 37.7
31 68 119 7 0 25237 99.0
35 37 93 11 0 14019 19.5
37 11 118 11 0 26088 62.0
41 58 17 5 0 12874 72.7
64 39 95 8 0 23299 78.7
72 26 33 8 1 9414 52.4
74 35 59 6 0 3092 13.4
77 46 118 8 0 13422 60.5
89 59 105 8 0 8215 37.2
98 68 101 7 0 15455 47.6
112 23 26 6 0 13354 78.1
117 24 19 7 0 16450 80.2
118 47 87 7 0 9273 81.5
125 32 92 9 0 10770 69.0
126 66 112 8 0 6187 54.3
128 10 101 13 0 71890 95.9
138 22 97 16 0 1099 1.4
144 26 102 7 0 12095 35.4
149 59 103 4 0 15072 88.7
152 46 94 7 0 12168 85.7
153 70 81 9 0 13683 64.8
155 79 83 6 0 25334 71.6
166 66 118 7 0 31992 99.0
170 63 99 5 0 2668 66.7
172 69 104 6 0 4055 73.7
180 49 94 8 0 7361 83.6
192 5 85 12 0 10023 57.3
193 28 77 13 0 2751 34.4
199 78 109 9 0 16273 55.5
203 79 113 5 1 2795 33.3
209 27 62 14 0 13543 54.2
220 70 86 8 0 15002 91.5
224 21 70 7 0 15433 55.6
total_acc out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp
3 37 0 0 12226.302 12226.30 10000.00
10 28 0 0 12519.260 12519.26 10000.00
14 22 0 0 27663.043 25417.68 20250.00
17 31 0 0 15823.480 15823.48 15000.00
18 44 0 0 4484.790 4484.79 4000.00
31 11 0 0 5626.893 5626.89 4400.00
35 18 0 0 10282.670 10282.67 10000.00
37 27 0 0 29695.623 29405.63 25600.00
41 10 0 0 11474.760 11474.76 10000.00
64 19 0 0 10480.840 10480.84 9200.00
72 26 0 0 7932.300 7932.30 7000.00
74 28 0 0 3834.661 3834.66 3500.00
77 13 0 0 10493.710 10493.71 9500.00
89 16 0 0 11264.010 11264.01 10000.00
98 11 0 0 8452.257 8452.26 7000.00
112 21 0 0 27580.750 24853.63 21600.00
117 10 0 0 8677.156 8677.16 7200.00
118 11 0 0 14396.580 14366.62 12000.00
125 18 0 0 10902.910 10902.91 10000.00
126 11 0 0 8636.820 8636.82 8000.00
128 30 0 0 7215.050 7215.05 6000.00
138 22 0 0 38059.760 38059.76 35000.00
144 46 0 0 15450.084 15450.08 14000.00
149 14 0 0 3723.936 3723.94 3000.00
152 21 0 0 13919.414 13890.44 12000.00
153 16 0 0 6857.261 6857.26 6000.00
155 31 0 0 8290.730 8290.73 7000.00
166 20 0 0 22188.250 22157.63 18100.00
170 13 0 0 2894.740 2894.74 2500.00
172 12 0 0 5081.023 5081.02 4225.00
180 14 0 0 7325.299 7325.30 6000.00
192 17 0 0 6534.430 6534.43 5000.00
193 29 0 0 8306.470 8306.47 8000.00
199 23 0 0 14006.680 14006.68 12000.00
203 17 0 0 3709.193 3709.19 3200.00
209 26 0 0 5501.160 5501.16 5000.00
220 18 0 0 15650.390 15650.39 13250.00
224 34 0 0 12554.010 12554.01 12000.00
total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_amnt
3 2209.33 16.97000 0 0 357.48
10 2519.26 0.00000 0 0 370.46
14 7413.04 0.00000 0 0 6024.09
17 823.48 0.00000 0 0 2447.05
18 484.79 0.00000 0 0 2638.77
31 1226.89 0.00000 0 0 162.44
35 282.67 0.00000 0 0 8762.05
37 4095.62 0.00000 0 0 838.27
41 1474.76 0.00000 0 0 5803.94
64 1280.84 0.00000 0 0 365.48
72 932.30 0.00000 0 0 4235.03
74 334.66 0.00000 0 0 107.86
77 993.71 0.00000 0 0 5378.43
89 1264.01 0.00000 0 0 4.84
98 1452.26 0.00000 0 0 238.06
112 5980.75 0.00000 0 0 17416.49
117 1462.16 15.00000 0 0 19.26
118 2396.58 0.00000 0 0 5359.38
125 902.91 0.00000 0 0 4152.52
126 636.82 0.00000 0 0 6983.56
128 1215.05 0.00000 0 0 1960.88
138 3059.76 0.00000 0 0 272.59
144 1450.08 0.00000 0 0 2133.17
149 723.94 0.00000 0 0 107.29
152 1919.41 0.00000 0 0 395.05
153 857.26 0.00000 0 0 198.16
155 1290.73 0.00000 0 0 2454.29
166 4088.25 0.00000 0 0 16499.75
170 394.74 0.00000 0 0 1168.50
172 856.02 0.00000 0 0 146.48
180 1325.30 0.00000 0 0 215.51
192 1534.43 0.00000 0 0 1561.93
193 306.47 0.00000 0 0 7778.22
199 2006.68 0.00000 0 0 5971.51
203 509.19 0.00000 0 0 317.41
209 501.16 0.00000 0 0 3833.62
220 2400.39 0.00000 0 0 9026.78
224 554.01 0.00000 0 0 473.95
collections_12_mths_ex_med acc_now_delinq
3 0 0
10 0 0
14 0 0
17 0 0
18 0 0
31 0 0
35 0 0
37 0 0
41 0 0
64 0 0
72 0 0
74 0 0
77 0 0
89 0 0
98 0 0
112 0 0
117 0 0
118 0 0
125 0 0
126 0 0
128 0 0
138 0 0
144 0 0
149 0 0
152 0 0
153 0 0
155 0 0
166 0 0
170 0 0
172 0 0
180 0 0
192 0 0
193 0 0
199 0 0
203 0 0
209 0 0
220 0 0
224 0 0
Looks like a incorrect glmnet family, I accidently chose the default 'deviance' for cv.glmnet, when in fact my data was binomial. My next solution is to figure out "Convergence for 1th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned"
Code that improved the solution:
cv.lasso <- cv.glmnet(x_vars, y_resp, alpha = 1, family = "binomial", nfolds = 5)
cv.model <- glmnet(x_vars, y_resp, alpha = 1, relax=TRUE, family="binomial", lambda=cv.lasso$lambda.min)

how to convert "dist" to data.frame in R?

My data "gaa" is huge distance matrix data.
The problem is I lost my origin data...So.. I want to get data.frame from distance matrix.
For example,
b=data.matrix(gaa[1:5,1:5])
>b
12 21 23 34 45
21 0 0 0 0 0
23 243 0 0 0 0
34 126 134 0 0 0
45 141 470 265 0 0
56 93 213 143 214 0
and I want data like
>orgin_gaa[1:5,]
12 xxx
21 xxx
23 xxx
34 xxx
45 xxx
56 xxx
please help me, how can I get my new data frame data?
Is it impossible?
Thanks for help!

GBM Bernoulli returns no results with NaN

I know this question has been asked multiple times but I've run out of ideas to get the model working. The first 50 rows of the train data:
> train[1:25]
a b c d e f g h i j k l m
1: 0 148.00 27 16 0 A 0 117 92 0 13 271 2
2: 0 207.00 37 8 0 C 0 46 29 0 29 555 5
3: 0 1497.00 44 1 0 A 1 3754 2119 1 1961 5876 6
4: 0 463.00 44 1 0 A 0 287 202 0 105 1037 4
5: 0 19.00 82 1 0 A 0 301 186 0 344 2116 3
6: 0 204.00 41 1 0 A 0 92 76 0 290 1608 10
7: 0 79.00 69 16 0 B 0 48 29 0 1 27 3
8: 0 256.75 71 16 1 A 0 131 112 0 36 1183 0
9: 0 256.75 71 16 1 A 0 131 112 0 36 1183 2
10: 1 49.00 13 13 0 C 0 5 4 0 0 11 1
11: 0 19.00 76 1 0 A 0 897 440 0 575 2674 3
12: 0 49.00 100 100 0 C 0 6 6 0 0 0 1
13: 0 107.00 65 1 0 A 3 334 212 0 421 2773 6
14: 0 79.00 28 16 0 B 0 42 49 0 13 345 2
15: 0 1742.00 61 1 0 A 0 589 340 0 444 3853 8
16: 0 187.00 20 16 0 A 0 123 99 0 70 841 4
17: 0 68.00 73 1 0 A 0 757 507 0 359 773 3
18: 0 157.00 32 16 0 B 0 33 27 0 4 144 2
19: 0 49.00 52 16 0 C 0 10 7 0 2 51 3
20: 0 79.00 53 16 0 B 0 20 9 0 0 40 4
21: 0 68.00 45 1 0 A 0 370 245 0 298 1826 3
22: 0 1074.00 46 1 0 A 0 605 220 0 280 1421 7
23: 0 19.00 84 1 0 A 0 357 214 0 104 1273 3
24: 0 68.00 42 1 0 A 0 107 97 0 224 1526 3
25: 0 226.00 39 1 0 A 0 228 162 0 139 559 3
26: 0 49.00 92 16 0 C 0 4 3 0 0 0 3
27: 0 68.00 46 1 0 A 0 155 104 0 60 1170 3
28: 1 98.00 29 2 0 C 0 15 13 0 1 659 3
29: 0 248.00 44 1 0 A 0 347 204 0 281 1484 4
30: 0 19.00 84 1 0 A 0 302 166 0 170 2800 3
31: 0 444.00 20 16 0 A 0 569 411 1 369 1095 4
32: 0 157.00 20 16 0 B 0 38 30 0 18 265 3
33: 0 208.00 71 16 0 B 0 22 22 0 1 210 3
34: 1 84.00 27 13 0 A 0 37 24 0 1 649 1
35: 1 297.00 17 7 0 A 0 26 21 0 0 0 1
36: 1 49.00 43 16 1 C 0 4 4 0 0 0 2
37: 0 99.00 36 1 0 A 0 614 432 0 851 2839 4
38: 0 354.00 91 2 1 C 0 74 48 0 102 1005 9
39: 0 68.00 62 16 0 A 0 42 32 0 0 0 3
40: 0 49.00 78 16 0 C 0 12 10 0 0 95 3
41: 0 49.00 57 16 0 C 1 9 8 0 1 582 3
42: 0 68.00 49 1 0 A 0 64 47 0 49 112 3
43: 0 583.00 70 2 1 A 0 502 293 0 406 2734 9
44: 0 187.00 29 1 0 A 0 186 129 0 118 2746 5
45: 0 178.00 52 1 0 A 0 900 484 0 180 1701 4
46: 1 98.00 50 44 0 C 0 13 12 0 1 647 4
47: 1 548.00 21 14 0 A 0 19 14 0 0 0 1
48: 0 178.00 28 16 0 C 0 43 33 0 6 921 3
49: 1 49.00 20 20 0 C 0 8 6 0 0 0 1
50: 0 49.00 124 124 1 A 0 14 11 0 0 0 1
a b c d e f g h i j k l m
This data is not normalised, but it doesn't matter at this stage.
I can't get a simple gbm model work using the gbm package:
> require(gbm)
> gbm_model <- gbm(a ~ .
, data = train
, distribution="bernoulli"
, n.trees= 10
, shrinkage=0.001
, bag.fraction = 1
, train.fraction = 0.5
, n.minobsinnode = 3
, cv.folds = 0 # no cross-validation
, keep.data=TRUE
, verbose=TRUE
)
Iter TrainDeviance ValidDeviance StepSize Improve
1 nan nan 0.0010 nan
2 nan nan 0.0010 nan
3 nan nan 0.0010 nan
4 nan nan 0.0010 nan
5 nan nan 0.0010 nan
6 nan nan 0.0010 nan
7 nan nan 0.0010 nan
8 nan nan 0.0010 nan
9 nan nan 0.0010 nan
10 nan nan 0.0010 nan
Columns 'e' and 'f' are factors. Train data sample size is approximately 6,000. I've tried running gbm with various bag.fraction, train.fraction, n.tree, and shrinkage values but still get the same result of all NaNs. Trees and SVM work without any problem on the same data. I even tried converting column 'f' to character, as it was suggested in previous posts, and it didn't work.
Edit: data has no NAs or invalid values. I tried one-hot encoding the 'f' column and still same results.
In my case, this issue was resolved by converting the dependent variable to character.
gbm_model <- gbm(as.character(a) ~ .
, data = train
, distribution="bernoulli"
, n.trees= 10
, shrinkage=0.001
, bag.fraction = 1
, train.fraction = 0.5
, n.minobsinnode = 3
, cv.folds = 0 # no cross-validation
, keep.data=TRUE
, verbose=TRUE
)

r2dtable contingency tables are too concentrated

I am using R's r2dtable function to generate contingency tables with given marginals. However, when inspecting the resulting tables values look somewhat too concentrated to the midpoints. Example:
set.seed(1)
matrices <- r2dtable(1e4, c(100, 100), c(100, 100))
vec.vals <- vapply(matrices, function(x) x[1, 1], numeric(1))
> table(vec.vals)
vec.vals
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
1 1 1 7 25 49 105 182 268 440 596 719 954 1072 1152 1048
52 53 54 55 56 57 58 59 60 61 62
1022 775 573 404 290 156 83 50 19 6 2
So the minimal upper left corner value is 36 and the max is 62 out of 10,000 simulations.
Is there a way to achieve somewhat less concentrated matrices?
You need to consider that it would be extremely unlikely that any given random draw would have a value with and upper left corner of 35. 1e4 attempts may not be sufficient to realize such an event. Look at the theoretic predictions (courtesy of P. Dalgaard on Rhelp list this morning.):
round(dhyper(0:100,100,100,100)*1e4)
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[18] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[35] 0 0 0 1 4 9 21 45 88 160 269 417 596 787 959 1081 1124
[52] 1081 959 787 596 417 269 160 88 45 21 9 4 1 0 0 0 0
[69] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[86] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
If you increase the number of draws the probability of a single value of 1 "widens":
vec.vals <- vapply(matrices, function(x) x[1, 1], numeric(1)); table(vec.vals)
vec.vals
33 34 35 36 37 38 39 40 41 42 43 44 45
1 3 8 47 141 359 864 2148 4515 8946 15928 27013 41736
46 47 48 49 50 51 52 53 54 55 56 57 58
59558 78717 96153 108322 112524 107585 96042 78054 60019 41556 26848 16134 8627
59 60 61 62 63 64 65 66 68
4580 2092 933 351 138 42 11 4 1
... as predicted:
round(dhyper(0:100,100,100,100)*1e6)
[1] 0 0 0 0 0 0 0 0 0 0 0 0
[13] 0 0 0 0 0 0 0 0 0 0 0 0
[25] 0 0 0 0 0 0 0 0 0 1 4 13
[37] 43 129 355 897 2087 4469 8819 16045 26927 41700 59614 78694
[49] 95943 108050 112416 108050 95943 78694 59614 41700 26927 16045 8819 4469
[61] 2087 897 355 129 43 13 4 1 0 0 0 0
[73] 0 0 0 0 0 0 0 0 0 0 0 0
[85] 0 0 0 0 0 0 0 0 0 0 0 0
[97] 0 0 0 0 0
To get less concentrated matrices, you will have to find a balance between the number of columns / rows, totals and number of matrices. Consider the following sets:
m2rep <- r2dtable(1e4, rep(100,2), rep(100,2))
m2seq <- r2dtable(1e4, seq(50,100,50), seq(50,100,50))
which gives differences in number of unique value:
> length(unique(unlist(m2rep)))
[1] 29
> length(unique(unlist(m2seq)))
[1] 58
plotting this with:
par(mfrow = c(1,2))
plot(table(unlist(m2rep)))
plot(table(unlist(m2seq)))
gives:
Now consider:
m20rep <- r2dtable(1e4, rep(100,20), rep(100,20))
m20seq <- r2dtable(1e4, seq(50,1000,50), seq(50,1000,50))
which gives:
> length(unique(unlist(m20rep)))
[1] 20
> length(unique(unlist(m20seq)))
[1] 130
plotting this with:
par(mfrow = c(1,2))
plot(table(unlist(m20rep)))
plot(table(unlist(m20seq)))
gives:
As you can see, playing with the parameters helps.
HTH

Resources