R statistics: glm - r

I use a .csv from Excel. I recently added a 'bmi' variable. Unfortunately it gets analysed a different levels of 'bmi' instead of as a continuous variable (output below). I have tried making sure it is either numeric but no improvement. Other continuous variables have not been a problem.
Updated code below with a sample 3 variable dataset. Obviously the import filepath would need to change. I then go through 2 stages of subsetting.
The su_data subset causes a 2nd problem of "glm.fit: algorithm did not converge ". If I use the nitech1 dataset this doesn't occur. However, there is no obvious convergence and normally I would include more variables making it even more unlikely.
nitech <- read.csv("~/Documents/Publications/2020 NI_tech/Data/nitech.csv", stringsAsFactors = F)
head(nitech)
attach(nitech)
nitech1<-subset(nitech, 1/1305 & route=="ni" & t_1==1)
attach(nitech1)
data_su<-subset(nitech1, posn_neg_5>-3)
attach(data_su)
mod_su1<-glm(posn_neg_5>-2~a2+bmi, data = data_su, family=binomial(link="logit"))
summary(mod_su1)
posn_neg_5
bmi
a2
1
21.875
22
3
NA
23
-1
23.24380165
24
4
NA
25
5
24.04934359
26
4
21.79930796
27
4
NA
28
5
40.03673095
29
4
28.73467206
30
-1
24.05693475
31
3
32.32653061
16
4
19.04700413
21
5
27.08415907
25
2
28.125
15
-2
25.05930703
25
5
29.96878252
34
1
21.44756785
11
5
21.44756785
11
5
27.15271059
21
4
19.35020755
14
2
18.7260073
15
5
25.81663021
18
1
27.1430375
12
4
35.10667027
18
5
27.17063157
5
2
34.81611682
2
2
20.89795918
12
4
24.54346238
3
4
24.93074792
3
5
31.22130395
11
1
27.734375
14
1
23.88844098
8
5
23.8497766
10
-1
27.76620852
32
4
24.38652644
15
5
23.57391519
11
5
24.1516725
8
1
24.07407407
5
-2
25.82644628
23
5
21.46193662
7
5
30.07116843
16
-2
18.99196494
11
4
22.8774057
4
4
16.49395819
NA
4
25.61728395
5
4
35.01992513
17
-1
23.89325888
36
4
22.92097107
8
2
21.2244898
2
5
20.81165453
7
4
21.70512943
NA
-2
NA
7
4
31.7417889
9
5
28.73467206
15
5
19.72318339
22
4
20.82093992
13
4
28.6801111
14
5
32.87197232
14
5
31.60321237
19
4
21.70512943
5
4
18.44618056
5
5
18.68512111
NA
-1
23.45679012
10
1
30.0432623
14
4
27.88761707
NA
5
41.46938776
18
4
24.96800974
NA
1
21.44756785
11
4
23.57391519
11
3
23.89325888
36
4
23.89325888
36
4
20.82093992
13
5
28.6801111
14
5
28.73467206
15
5
17.90885536
13
4
15.00207476
NA
4
23.71184463
19
4
29.3877551
19
4
20.95717116
12
3
35.15625
18
5
24.53512397
15
4
25.86451247
20
3
17.90885536
13
4
23.71184463
19
2
27.91551882
13
4
35.85643085
NA
4
24.69135802
NA
4
35.2507611
35
5
19.13580247
NA
4
25.18078512
16
4
28.3446712
17
1
31.60321237
8
4
NA
NA
5
27.85200798
17
4
21.13271344
5
4
20.08827524
NA
5
22.58955144
25
4
17.96875
16
5
29.93759487
9
5
24.69135802
9
5
28.125
8
5
25.96952909
NA
4
19.80534178
5
4
22.09317005
26
4
16.23307275
12
4
22.85714286
7
4
32.24993701
29
4
32.24993701
29
4
27.75510204
17
-1
22.22222222
8
5
30.93043808
NA
5
30.93043808
15
5
26.42356982
6
5
33.65929705
15
2
24.34380949
9
4
24.34380949
7
1
27.17063157
9
5
37.73698829
NA
5
37.73698829
NA
1
23.30668005
11
-1
24.22145329
31
5
34.10798936
11
5
34.10798936
11
5
34.10798936
11
4
24.22145329
31
-1
18.04270106
NA
4
22.265625
25
5
34.10798936
11
2
25.86120485
34
-4
27.40765728
18
3
27.40765728
18
5
20.10916403
7
2
20.60408163
25
NA
24.77209671
24
1
22.49134948
NA
4
22.49134948
10
3
23.62444749
27
1
24.09297052
NA
5
24.1671624
13
-2
24.91349481
7
1
25.53544639
10
1
27.6816609
8
5
22.85714286
10
4
22.85714286
10
5
25.2493372
6
4
21.79930796
20
4
22.85714286
10
4
25.35154137
21
4
26.2345679
7
4
26.86873566
19
4
23.58832922
14
5
60.85439572
11
3
21.79930796
20
5
23.58832922
14
4
25.71166208
29
3
23.45679012
5
5
31.8877551
NA
-1
31.38510306
15
4
33.87406376
16
5
31.38510306
15
4
38.20018365
NA
3
31.38510306
15
3
29.20110193
25
5
31.99217133
NA
5
29.9940488
6
1
26.81359045
28
4
27.54820937
20
5
27.54820937
5
4
23.67125363
36
4
22.22222222
17
4
23.67125363
36
4
27.04164413
18
5
21.60493827
7
4
21.79930796
20
4
44.79082684
NA
5
43.02771702
NA
4
21.79930796
20
4
22.51606979
NA
4
20.76124567
24
4
22.98539751
23
1
26.56683975
24
4
22.98539751
23
4
29.16869227
8
1
24.75546432
8
5
24.19600671
24
5
30.79338244
NA
NA
NA
NA
2
NA
24
5
24.19600671
24
1
25.95155709
29
4
25.95155709
29
5
20.10916403
7
-1
25.390625
22
4
22.03856749
13
5
21.79930796
NA
5
22.03856749
13
-2
30.07116843
20
5
27.75748722
23
4
20.13476872
NA
4
30.49148654
9
5
27.77427093
12
5
30.0838291
25
3
22.22222222
7
-2
14.34257234
10
4
25.82644628
11
-1
20.10916403
22
4
25.82644628
11
5
23.30109483
27
4
23.30109483
27
5
22.88868802
12
4
26.2345679
9
4
36.22750875
6
5
30.47796622
33
1
22.63467632
4
3
22.03856749
31
5
24.69135802
8
5
25.21735858
21
4
15.1614521
16
5
21.14631991
19
5
30.79338244
NA
-1
27.77777778
12
5
25.66115203
9
5
35.91836735
9
5
26.7818261
19
5
26.7818261
19
-2
22.46003435
22
5
32.32323232
31
2
24.296875
11
4
26.7299275
27
5
24.48979592
9
4
23.03004535
28
5
25.2493372
6
4
20.51508648
8
4
23.87511478
15
-1
27.93277423
21
-1
27.93277423
21
4
20.51508648
8
5
27.77777778
29
4
21.49959688
8
4
28.96473469
6
2
24.69135802
24
5
29.86055123
7
5
21.60493827
17
5
41.86851211
16
5
27.77777778
19
2
28.515625
16
5
24.69135802
24
5
21.50294251
12
5
27.77777778
4
5
25.52059756
15
4
27.38328546
19
4
19.47714681
12
5
25.71166208
26
4
26.12244898
NA
5
21.484375
21
5
32.14024836
23
5
24.25867407
8
5
27.77777778
14
4
21.60493827
11
4
22.69401893
18
4
21.60493827
11
4
21.0498179
15
5
22.67573696
9
4
24.22145329
10
4
29.70801268
10
5
38.62236267
17
4
29.70801268
10
4
29.70801268
10
4
32.39588049
6

I would try two things: (1) don't "attach" your data sets. As long as you specify the data frame in your glm function call (as you have already done), you shouldn't need to attach. (2) Try changing your command to the following:
glm(I(posn_neg_5>-2)~a2+bmi, data = data_su, family=binomial(link="logit"))
The function "I()", when put inside a formula, maps a variable (pos_neg_5) to the new variable (pos_neg_5>-2). Sometimes you can get into trouble by including math in formulas without this mapping.

Related

Create edgelist that contains mutual dyads

I have an edgelist where I want to keep dyads that mutually selected each other (e.g., 1 -> 4 and 4 -> 1). However, in the final edgelist I only want to keep one row instead of both rows of the mutual dyads (e.g., only row 1 -> 4 not both rows 1 -> 4 and 4 -> 1). How do I achieve that?
Here is the dataset:
library(igraph)
ff <- as_data_frame(sample_gnm(10, 50, directed=TRUE))
ff
from to
1 1 10
2 1 3
3 1 4
4 1 5
5 1 6
6 1 7
7 1 8
8 2 1
9 2 3
10 2 8
11 2 9
12 3 1
13 3 2
14 3 10
15 3 4
16 3 5
17 3 6
18 3 8
19 3 9
20 4 3
21 4 10
22 5 1
23 5 2
24 5 3
25 5 4
26 6 2
27 6 3
28 6 4
29 6 5
30 7 3
31 7 5
32 7 6
33 7 10
34 7 8
35 8 1
36 8 2
37 8 4
38 8 5
39 8 10
40 9 1
41 9 2
42 9 3
43 9 4
44 9 5
45 9 7
46 10 1
47 10 3
48 10 4
49 10 8
50 10 9
cd <- which_mutual(g) #I know I can use `which_mutual` to identify the mutual dyads
ff[which(cd==1),] #but in the end this keeps both rows of the mutual dyads (e.g., 1 -> 4 and 4 -> 1)
from to
4 1 4
6 1 6
7 1 7
9 2 10
10 2 3
14 3 2
18 3 6
21 4 1
25 5 10
28 6 1
30 6 3
32 6 10
33 6 7
34 7 1
37 7 6
39 7 8
42 8 7
45 9 10
46 10 2
47 10 5
48 10 6
50 10 9
We may use duplicated to create a logical vector after sorting the elements by row
ff1 <- ff[which(cd==1),]
subset(ff1, !duplicated(cbind(pmin(from, to), pmax(from, to))))

Imputation with categorical variables with mix package in R

I'm trying to impute missing variables in a data set that contains categorical variables (7-point Likert scales) using the mix package in R. Here is what I'm doing:
1. Loading the data:
data <- read.csv("test.csv", header=TRUE, row.names="ID")
2. Here's what the data looks like:
The first column is my ID column, the next three columns are categorical variables (7-point Likert scales - these are the ones where I am interested in imputing the missing values). Then I have three auxiliary variables: aux_cat is another categorical variable (unordered ranging from 1 to 9, no missing data), aux_one is an integer (no missing data), aux_two is numerical (contains missing data).
var_one var_two var_three aux_cat aux_one aux_two
1 2 1 2 6 26 0.0
2 3 2 3 7 45 32906.5
3 6 2 3 3 31 1237.5
4 7 NA NA 8 11 277.0
5 4 3 1 5 145 78201.0
6 NA NA NA 6 30 48550.0
7 7 6 3 3 48 11568.0
8 6 6 4 2 15 4482.0
9 7 6 5 5 61 NA
10 5 6 7 3 2 NA
11 5 6 5 3 11 78663.0
12 6 2 2 3 16 1235.0
13 7 2 5 3 13 5781.0
14 6 5 4 6 16 5062.0
15 5 5 3 3 43 400.0
16 7 7 5 2 114 7968.0
17 6 5 4 3 99 247.5
18 7 7 7 6 114 1877.0
19 5 5 4 5 3 5881.5
20 4 4 2 3 65 1786.0
21 4 3 6 5 9 14117.5
22 3 3 2 3 35 2093.0
23 3 4 4 5 62 23071.5
24 5 3 5 3 22 2707.5
25 3 1 2 6 128 942.0
26 5 3 6 4 57 101379.0
27 5 5 4 6 76 1398.0
28 1 3 4 3 17 1024.5
29 4 3 2 1 143 10657.0
30 7 1 4 8 14 167.5
31 7 3 7 3 22 4344.0
32 3 3 3 6 27 1582.0
33 7 1 3 2 29 66.5
34 5 5 4 2 108 513.5
35 7 6 6 7 24 936.5
36 4 5 4 7 40 5950.5
37 NA NA NA 8 15 99.5
38 2 2 2 6 21 123.5
39 6 4 5 2 61 477.5
40 6 5 5 2 16 28921.0
41 6 2 2 2 11 1063.5
42 6 2 5 3 116 97798.5
43 4 4 2 8 11 9159.5
44 6 6 6 6 4 1098.5
45 6 4 5 7 21 236.5
46 4 6 4 5 43 219.5
47 3 2 3 3 28 85.5
48 5 5 5 2 71 13483.5
49 5 5 6 8 98 18400.0
50 5 6 6 3 27 357.0
51 5 7 6 7 14 145.5
52 4 5 5 3 93 427.5
53 3 4 5 2 40 412.0
54 6 6 3 2 8 2418.0
55 5 6 5 5 8 4923.5
56 4 5 2 7 32 4135.0
57 7 7 2 6 83 1408.5
58 7 2 3 2 12 5595.0
59 7 2 1 2 32 2280.5
60 7 4 5 3 11 638.5
61 7 5 3 3 24 225.5
62 4 3 3 9 44 570.0
3. Performing preliminary manipulations
I try to run prelim.mix(x, p) where x is the data matrix containing missing values and p is the number of categorical variables in x. The categorical variables must be in the first p columns of x, and they must be coded with consecutive positive integers starting with 1. For example, a binary variable must be coded as 1,2 rather than 0,1.
In my case p should be 4 since I have three Likert-scale variables where I want imputed values and one other categorical variable among my auxiliary variables.
s <- prelim.mix(data,4)
This step seems to work fine.
4. Finding the maximum likelihood (ML) estimate:
thetahat <- em.mix(s)
This is where I encounter the following error:
Steps of EM:
1...2...3...Error in em.mix(s) : NA/NaN/Inf in foreign function call (arg 6)
I think this must have something to do with my auxiliary variables, but I'm not sure. Any help would be much appreciated.

How to find all pairs of two lists, and categorize them without repetitions?

We are preparing for a program where 18 people should discuss topics in a way that in each round they form pairs, and then they switch until everyone has talked to everyone. It means 153 discussions, 9 pairs talking parallelly in each round, for 17 rounds. I tried to formulate a matrix showing who should talk to whom in order to avoid the chaos, but could not succeed. For the sake of simplicity everyone is given a number, so the bottom line is, i would need all pairs of combinations of the numbers from 1 to 18 (did that with combn function), but then these pairs should be rearranged for the 17 round so that each number only appears once per round. Any ideas?
Let's first look at a simpler problem with 6 persons. The following matrix lists who (rows) is talking to whom (columns) in which round (entry):
So for example in round 1 (yellow) we have the following pairs:
(1-2), (3-5), (4-6)
For round 2 (green) we would have:
(1-3), (2-6), (4-5)
and so on.
Thus, basically we are looking for a symmetric latin square (i.e. in each row and in each column each entry appears only once, cf. Latin Squares on Wikipedia).
The latin square in the box can be easily generated via an addition table:
inner_ls <- function(k) {
res <- outer(0:(k-1), 0:(k-1), function(i, j) (i + j) %% k)
## replace zeros by k
res[res == 0] <- k
## replace diagonal by NA
diag(res) <- NA
res
}
inner_ls(5)
# [,1] [,2] [,3] [,4] [,5]
# [1,] NA 1 2 3 4
# [2,] 1 NA 3 4 5
# [3,] 2 3 NA 5 1
# [4,] 3 4 5 NA 2
# [5,] 4 5 1 2 NA
So all is left to append the last row (column) with the missing round number:
full_ls <- function(k) {
i_ls <- inner_ls(k - 1)
last_row <- apply(i_ls, 1, function(row) {
rounds <- 1:(k - 1)
rounds[!rounds %in% row]
})
res <- cbind(rbind(i_ls, last_row), c(last_row, NA))
rownames(res) <- colnames(res) <- 1:k
res
}
full_ls(6)
# 1 2 3 4 5 6
# 1 NA 1 2 3 4 5
# 2 1 NA 3 4 5 2
# 3 2 3 NA 5 1 4
# 4 3 4 5 NA 2 1
# 5 4 5 1 2 NA 3
# 6 5 2 4 1 3 NA
With that you get your assignment matrix as follows:
full_ls(18)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# 1 NA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# 2 1 NA 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 2
# 3 2 3 NA 5 6 7 8 9 10 11 12 13 14 15 16 17 1 4
# 4 3 4 5 NA 7 8 9 10 11 12 13 14 15 16 17 1 2 6
# 5 4 5 6 7 NA 9 10 11 12 13 14 15 16 17 1 2 3 8
# 6 5 6 7 8 9 NA 11 12 13 14 15 16 17 1 2 3 4 10
# 7 6 7 8 9 10 11 NA 13 14 15 16 17 1 2 3 4 5 12
# 8 7 8 9 10 11 12 13 NA 15 16 17 1 2 3 4 5 6 14
# 9 8 9 10 11 12 13 14 15 NA 17 1 2 3 4 5 6 7 16
# 10 9 10 11 12 13 14 15 16 17 NA 2 3 4 5 6 7 8 1
# 11 10 11 12 13 14 15 16 17 1 2 NA 4 5 6 7 8 9 3
# 12 11 12 13 14 15 16 17 1 2 3 4 NA 6 7 8 9 10 5
# 13 12 13 14 15 16 17 1 2 3 4 5 6 NA 8 9 10 11 7
# 14 13 14 15 16 17 1 2 3 4 5 6 7 8 NA 10 11 12 9
# 15 14 15 16 17 1 2 3 4 5 6 7 8 9 10 NA 12 13 11
# 16 15 16 17 1 2 3 4 5 6 7 8 9 10 11 12 NA 14 13
# 17 16 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 NA 15
# 18 17 2 4 6 8 10 12 14 16 1 3 5 7 9 11 13 15 NA

Replacing a restarting sequence in a dataframe with the group number of the sequence

I have a sequence in df$V1 that starts at some number and increases. At some point, it drops, indicating that observations for a new group have started. I want to replace V1 (or create a new column) with the group number. What are some ways to do this? I've tried various dplyr tricks to no avail, and searched here and elsewhere and have not found a similar problem. Wondering if there's a slick dplyr way to do this. Thank you for any insights.
The data frame has about 350 rows. Here is a subset:
> df
V1 V2 V3 V4 V5 V6
1 1 5 9 1 2 14
2 2 5 10 1 3 9
3 3 5 11 1 4 4
4 4 5 15 1 5 7
5 5 5 18 1 6 14
6 6 5 22 1 7 6
27 1 5 9 1 2 14
28 21 9 10 2 3 4
29 22 9 11 2 4 6
30 23 9 15 2 5 1
31 24 9 18 2 6 7
32 25 9 22 2 7 14
33 26 9 24 2 8 6
34 27 9 25 2 9 7
35 28 9 26 2 10 6
And I want it to look like this (or with group as an added column in the new.df):
> new.df
group V2 V3 V4 V5 V6
1 1 5 9 1 2 14
2 1 5 10 1 3 9
3 1 5 11 1 4 4
4 1 5 15 1 5 7
5 1 5 18 1 6 14
6 1 5 22 1 7 6
27 2 5 9 1 2 14
28 2 9 10 2 3 4
29 2 9 11 2 4 6
30 2 9 15 2 5 1
31 2 9 18 2 6 7
32 2 9 22 2 7 14
33 2 9 24 2 8 6
34 2 9 25 2 9 7
35 2 9 26 2 10 6
Here's the initial data frame to load into your R session:
df <- read.table(header=TRUE, text="
V1 V2 V3 V4 V5 V6
1 5 9 1 2 14
2 5 10 1 3 9
3 5 11 1 4 4
4 5 15 1 5 7
5 5 18 1 6 14
6 5 22 1 7 6
1 5 9 1 2 14
21 9 10 2 3 4
22 9 11 2 4 6
23 9 15 2 5 1
24 9 18 2 6 7
25 9 22 2 7 14
26 9 24 2 8 6
27 9 25 2 9 7
28 9 26 2 10 6
")

Get the X value of multivariable box plot whiskers in R

I am trying to get the values of the whiskers in a boxplot.
Sample of my data is:
Company.ID ACTIVE Websource Company.Name Country Sector Ownership Activity.Status Update.Date MIN_MAX_REVENUE 16 Construction Private Number.of.Employees NOE splittedN splittedco splitted RR Range SECTORNUM
I want to find the whiskers when I box-plotted Number.of.Employees and Sector
boxplot(Data$Range ~ Data$Sector, ylab= "range", Xlab= "Sector", las=2)
Got the otliers
boxplot(Data$Range ~ Data$Sector, ylab= "range", Xlab= "Sector", las=2)$out
[1] 18 16 12 35 15 65 45 25 50 40 30 32 30 50 45 65 80 35 35 40 90 25 60 30 40 25
[27] 50 25 40 65 25 35 60 27 130 30 100 25 30 40 30 35 25 23 150 60 29 23 30 56 30 25
[53] 22 23 40 80 30 32 22 30 28 7 25 8 10 7 8 11 30 10 10 32 10 10 40 20 8 2
[79] 3 4 2 15 10 3 4 2 2 6 2 4 2 3 3 2 2 2 2 2 13 2 3 5 3 5
[105] 3 2 4 7 2 6 2 2 2 5 3 3 2 2 2 3 4 9 4 15 2 2 2 10 2 2
[131] 4 19 2 9 2 6 2 2 2 4 4 2 15 2 2 4 2 2 2 27 4 2 3 2 2 2
[157] 3 12 7 2 11 2 3 2 2 3 2 2 8 14 5 3 4 170 3 2 4 3 5 3 2 2
[183] 5 2 2 3 2 6 2 2 2 2 2 3 3 2 17 4 2 2 2 3 4 3 4 2 7 2
[209] 4 2 5 2 2 10 3 30 12 23 15 14 30 200 12 45 16 20 16 12 12 19 12 60 18 18
[235] 30 15 12 20 12 30 21 25 40 22 30 70 32 50 40 32 47 50 30 21 16 20 25 18 12 14
[261] 30 10 14 15 30 11 8 10 15 8 18 7 20 13 15 17 25 10 17 8 20 17 45 7 15 7
[287] 17 9 8 8 8 20 10 20 10 19 10 20 10 9 16 7 16 20 15 8 15 10 12 10 9 10
[313] 7 10 10 12 9 22 10 8 10 9 14 8 7 10 10 15 20 8 15 15 14 8 50 20 50 10
[339] 10 10 50 3 18 4 15 5 2 4 11 7 16 15 2 2 2 2 2 2 3 2 2 2 6 7
[365] 2 8 2 3 2 2 2 2 2 7 2 2 2 4 5 2 5 3 2 3 4 2 2 44 2 2
[391] 8 3 2 10 10 7 10 10 11 20 18 11 3 20 5 2 5 2 2 6 30 6 2 2 43 13
[417] 30 10 10 35 16 16 11 10 15 10 9 8 16 7 21 5 50 30 4 4 14 15 2 2 5 8
[443] 5 40 2 2 2 2 2 2 25 2 4 3 2 6 2 10 5 4 5 2 2 3 3 4 2 2
[469] 14 8 5 2 7 2 2 3 42 20 10 10 15 13 11 40 10 15 30 20 2 8 3 8 3 4
[495] 2 4 2 3 2 4 4 2 3 35 5 2 3 8 2 8 2 3 40 35 2 2 2 2 7 2
[521] 3 3 2 30 15 4 60 2 28 4 2 2 5 10 2 2 3 4 18 2 6 2 4 4 2 2
[547] 30 9 2 3 12 5 2 2 5 3 4 2 11 2 2 2 8 2 2 3 6 3 7 2 2 2
[573] 2 40 14 2 2 3 2 3 3 18 14 9 10 25 12 19 35 10 10 15 25 15 17 20 35 10
I need the full info about these outliers (company.Name....)
You need first the interquartile range
IQR = 75%quartile - 25% quartile,
then you find the
upper whisker at min(max(x), 75%quartile+1.5*IQR)
lower whisker at max(min(x), 25%quartiel+1.5*IQR)

Resources