double for loops in R for histograms - r

I have the following table and I am trying to run a double for loop in R to get a histogram of the distribution of responses for every month of the survey (I will then fit a distribution to it). I am currently running the following code, but cannot seem to get anywhere. Any suggestions?
for (i in 2008:2021) { for (j in 1:12) { dfn <- df(df$Year=i, df$Month=j) hist(dfn) }}
Month
Year
-3
0
2
4
5.5
8
12.5
15
1
2008
3
2
28
41
17
3
5
1
2
2008
5
3
26
40
15
4
6
1
3
2008
6
4
27
39
13
4
6
1
4
2008
9
4
18
28
28
5
7
1
5
2008
6
5
15
29
29
6
9
1
6
2008
8
3
17
28
26
6
10
2
7
2008
9
5
16
28
28
4
9
1
8
2008
5
5
19
29
26
5
9
2
9
2008
7
5
22
39
15
4
7
1
10
2008
8
6
20
40
15
4
7
0

Related

Create edgelist that contains mutual dyads

I have an edgelist where I want to keep dyads that mutually selected each other (e.g., 1 -> 4 and 4 -> 1). However, in the final edgelist I only want to keep one row instead of both rows of the mutual dyads (e.g., only row 1 -> 4 not both rows 1 -> 4 and 4 -> 1). How do I achieve that?
Here is the dataset:
library(igraph)
ff <- as_data_frame(sample_gnm(10, 50, directed=TRUE))
ff
from to
1 1 10
2 1 3
3 1 4
4 1 5
5 1 6
6 1 7
7 1 8
8 2 1
9 2 3
10 2 8
11 2 9
12 3 1
13 3 2
14 3 10
15 3 4
16 3 5
17 3 6
18 3 8
19 3 9
20 4 3
21 4 10
22 5 1
23 5 2
24 5 3
25 5 4
26 6 2
27 6 3
28 6 4
29 6 5
30 7 3
31 7 5
32 7 6
33 7 10
34 7 8
35 8 1
36 8 2
37 8 4
38 8 5
39 8 10
40 9 1
41 9 2
42 9 3
43 9 4
44 9 5
45 9 7
46 10 1
47 10 3
48 10 4
49 10 8
50 10 9
cd <- which_mutual(g) #I know I can use `which_mutual` to identify the mutual dyads
ff[which(cd==1),] #but in the end this keeps both rows of the mutual dyads (e.g., 1 -> 4 and 4 -> 1)
from to
4 1 4
6 1 6
7 1 7
9 2 10
10 2 3
14 3 2
18 3 6
21 4 1
25 5 10
28 6 1
30 6 3
32 6 10
33 6 7
34 7 1
37 7 6
39 7 8
42 8 7
45 9 10
46 10 2
47 10 5
48 10 6
50 10 9
We may use duplicated to create a logical vector after sorting the elements by row
ff1 <- ff[which(cd==1),]
subset(ff1, !duplicated(cbind(pmin(from, to), pmax(from, to))))

R statistics: glm

I use a .csv from Excel. I recently added a 'bmi' variable. Unfortunately it gets analysed a different levels of 'bmi' instead of as a continuous variable (output below). I have tried making sure it is either numeric but no improvement. Other continuous variables have not been a problem.
Updated code below with a sample 3 variable dataset. Obviously the import filepath would need to change. I then go through 2 stages of subsetting.
The su_data subset causes a 2nd problem of "glm.fit: algorithm did not converge ". If I use the nitech1 dataset this doesn't occur. However, there is no obvious convergence and normally I would include more variables making it even more unlikely.
nitech <- read.csv("~/Documents/Publications/2020 NI_tech/Data/nitech.csv", stringsAsFactors = F)
head(nitech)
attach(nitech)
nitech1<-subset(nitech, 1/1305 & route=="ni" & t_1==1)
attach(nitech1)
data_su<-subset(nitech1, posn_neg_5>-3)
attach(data_su)
mod_su1<-glm(posn_neg_5>-2~a2+bmi, data = data_su, family=binomial(link="logit"))
summary(mod_su1)
posn_neg_5
bmi
a2
1
21.875
22
3
NA
23
-1
23.24380165
24
4
NA
25
5
24.04934359
26
4
21.79930796
27
4
NA
28
5
40.03673095
29
4
28.73467206
30
-1
24.05693475
31
3
32.32653061
16
4
19.04700413
21
5
27.08415907
25
2
28.125
15
-2
25.05930703
25
5
29.96878252
34
1
21.44756785
11
5
21.44756785
11
5
27.15271059
21
4
19.35020755
14
2
18.7260073
15
5
25.81663021
18
1
27.1430375
12
4
35.10667027
18
5
27.17063157
5
2
34.81611682
2
2
20.89795918
12
4
24.54346238
3
4
24.93074792
3
5
31.22130395
11
1
27.734375
14
1
23.88844098
8
5
23.8497766
10
-1
27.76620852
32
4
24.38652644
15
5
23.57391519
11
5
24.1516725
8
1
24.07407407
5
-2
25.82644628
23
5
21.46193662
7
5
30.07116843
16
-2
18.99196494
11
4
22.8774057
4
4
16.49395819
NA
4
25.61728395
5
4
35.01992513
17
-1
23.89325888
36
4
22.92097107
8
2
21.2244898
2
5
20.81165453
7
4
21.70512943
NA
-2
NA
7
4
31.7417889
9
5
28.73467206
15
5
19.72318339
22
4
20.82093992
13
4
28.6801111
14
5
32.87197232
14
5
31.60321237
19
4
21.70512943
5
4
18.44618056
5
5
18.68512111
NA
-1
23.45679012
10
1
30.0432623
14
4
27.88761707
NA
5
41.46938776
18
4
24.96800974
NA
1
21.44756785
11
4
23.57391519
11
3
23.89325888
36
4
23.89325888
36
4
20.82093992
13
5
28.6801111
14
5
28.73467206
15
5
17.90885536
13
4
15.00207476
NA
4
23.71184463
19
4
29.3877551
19
4
20.95717116
12
3
35.15625
18
5
24.53512397
15
4
25.86451247
20
3
17.90885536
13
4
23.71184463
19
2
27.91551882
13
4
35.85643085
NA
4
24.69135802
NA
4
35.2507611
35
5
19.13580247
NA
4
25.18078512
16
4
28.3446712
17
1
31.60321237
8
4
NA
NA
5
27.85200798
17
4
21.13271344
5
4
20.08827524
NA
5
22.58955144
25
4
17.96875
16
5
29.93759487
9
5
24.69135802
9
5
28.125
8
5
25.96952909
NA
4
19.80534178
5
4
22.09317005
26
4
16.23307275
12
4
22.85714286
7
4
32.24993701
29
4
32.24993701
29
4
27.75510204
17
-1
22.22222222
8
5
30.93043808
NA
5
30.93043808
15
5
26.42356982
6
5
33.65929705
15
2
24.34380949
9
4
24.34380949
7
1
27.17063157
9
5
37.73698829
NA
5
37.73698829
NA
1
23.30668005
11
-1
24.22145329
31
5
34.10798936
11
5
34.10798936
11
5
34.10798936
11
4
24.22145329
31
-1
18.04270106
NA
4
22.265625
25
5
34.10798936
11
2
25.86120485
34
-4
27.40765728
18
3
27.40765728
18
5
20.10916403
7
2
20.60408163
25
NA
24.77209671
24
1
22.49134948
NA
4
22.49134948
10
3
23.62444749
27
1
24.09297052
NA
5
24.1671624
13
-2
24.91349481
7
1
25.53544639
10
1
27.6816609
8
5
22.85714286
10
4
22.85714286
10
5
25.2493372
6
4
21.79930796
20
4
22.85714286
10
4
25.35154137
21
4
26.2345679
7
4
26.86873566
19
4
23.58832922
14
5
60.85439572
11
3
21.79930796
20
5
23.58832922
14
4
25.71166208
29
3
23.45679012
5
5
31.8877551
NA
-1
31.38510306
15
4
33.87406376
16
5
31.38510306
15
4
38.20018365
NA
3
31.38510306
15
3
29.20110193
25
5
31.99217133
NA
5
29.9940488
6
1
26.81359045
28
4
27.54820937
20
5
27.54820937
5
4
23.67125363
36
4
22.22222222
17
4
23.67125363
36
4
27.04164413
18
5
21.60493827
7
4
21.79930796
20
4
44.79082684
NA
5
43.02771702
NA
4
21.79930796
20
4
22.51606979
NA
4
20.76124567
24
4
22.98539751
23
1
26.56683975
24
4
22.98539751
23
4
29.16869227
8
1
24.75546432
8
5
24.19600671
24
5
30.79338244
NA
NA
NA
NA
2
NA
24
5
24.19600671
24
1
25.95155709
29
4
25.95155709
29
5
20.10916403
7
-1
25.390625
22
4
22.03856749
13
5
21.79930796
NA
5
22.03856749
13
-2
30.07116843
20
5
27.75748722
23
4
20.13476872
NA
4
30.49148654
9
5
27.77427093
12
5
30.0838291
25
3
22.22222222
7
-2
14.34257234
10
4
25.82644628
11
-1
20.10916403
22
4
25.82644628
11
5
23.30109483
27
4
23.30109483
27
5
22.88868802
12
4
26.2345679
9
4
36.22750875
6
5
30.47796622
33
1
22.63467632
4
3
22.03856749
31
5
24.69135802
8
5
25.21735858
21
4
15.1614521
16
5
21.14631991
19
5
30.79338244
NA
-1
27.77777778
12
5
25.66115203
9
5
35.91836735
9
5
26.7818261
19
5
26.7818261
19
-2
22.46003435
22
5
32.32323232
31
2
24.296875
11
4
26.7299275
27
5
24.48979592
9
4
23.03004535
28
5
25.2493372
6
4
20.51508648
8
4
23.87511478
15
-1
27.93277423
21
-1
27.93277423
21
4
20.51508648
8
5
27.77777778
29
4
21.49959688
8
4
28.96473469
6
2
24.69135802
24
5
29.86055123
7
5
21.60493827
17
5
41.86851211
16
5
27.77777778
19
2
28.515625
16
5
24.69135802
24
5
21.50294251
12
5
27.77777778
4
5
25.52059756
15
4
27.38328546
19
4
19.47714681
12
5
25.71166208
26
4
26.12244898
NA
5
21.484375
21
5
32.14024836
23
5
24.25867407
8
5
27.77777778
14
4
21.60493827
11
4
22.69401893
18
4
21.60493827
11
4
21.0498179
15
5
22.67573696
9
4
24.22145329
10
4
29.70801268
10
5
38.62236267
17
4
29.70801268
10
4
29.70801268
10
4
32.39588049
6
I would try two things: (1) don't "attach" your data sets. As long as you specify the data frame in your glm function call (as you have already done), you shouldn't need to attach. (2) Try changing your command to the following:
glm(I(posn_neg_5>-2)~a2+bmi, data = data_su, family=binomial(link="logit"))
The function "I()", when put inside a formula, maps a variable (pos_neg_5) to the new variable (pos_neg_5>-2). Sometimes you can get into trouble by including math in formulas without this mapping.

Imputation with categorical variables with mix package in R

I'm trying to impute missing variables in a data set that contains categorical variables (7-point Likert scales) using the mix package in R. Here is what I'm doing:
1. Loading the data:
data <- read.csv("test.csv", header=TRUE, row.names="ID")
2. Here's what the data looks like:
The first column is my ID column, the next three columns are categorical variables (7-point Likert scales - these are the ones where I am interested in imputing the missing values). Then I have three auxiliary variables: aux_cat is another categorical variable (unordered ranging from 1 to 9, no missing data), aux_one is an integer (no missing data), aux_two is numerical (contains missing data).
var_one var_two var_three aux_cat aux_one aux_two
1 2 1 2 6 26 0.0
2 3 2 3 7 45 32906.5
3 6 2 3 3 31 1237.5
4 7 NA NA 8 11 277.0
5 4 3 1 5 145 78201.0
6 NA NA NA 6 30 48550.0
7 7 6 3 3 48 11568.0
8 6 6 4 2 15 4482.0
9 7 6 5 5 61 NA
10 5 6 7 3 2 NA
11 5 6 5 3 11 78663.0
12 6 2 2 3 16 1235.0
13 7 2 5 3 13 5781.0
14 6 5 4 6 16 5062.0
15 5 5 3 3 43 400.0
16 7 7 5 2 114 7968.0
17 6 5 4 3 99 247.5
18 7 7 7 6 114 1877.0
19 5 5 4 5 3 5881.5
20 4 4 2 3 65 1786.0
21 4 3 6 5 9 14117.5
22 3 3 2 3 35 2093.0
23 3 4 4 5 62 23071.5
24 5 3 5 3 22 2707.5
25 3 1 2 6 128 942.0
26 5 3 6 4 57 101379.0
27 5 5 4 6 76 1398.0
28 1 3 4 3 17 1024.5
29 4 3 2 1 143 10657.0
30 7 1 4 8 14 167.5
31 7 3 7 3 22 4344.0
32 3 3 3 6 27 1582.0
33 7 1 3 2 29 66.5
34 5 5 4 2 108 513.5
35 7 6 6 7 24 936.5
36 4 5 4 7 40 5950.5
37 NA NA NA 8 15 99.5
38 2 2 2 6 21 123.5
39 6 4 5 2 61 477.5
40 6 5 5 2 16 28921.0
41 6 2 2 2 11 1063.5
42 6 2 5 3 116 97798.5
43 4 4 2 8 11 9159.5
44 6 6 6 6 4 1098.5
45 6 4 5 7 21 236.5
46 4 6 4 5 43 219.5
47 3 2 3 3 28 85.5
48 5 5 5 2 71 13483.5
49 5 5 6 8 98 18400.0
50 5 6 6 3 27 357.0
51 5 7 6 7 14 145.5
52 4 5 5 3 93 427.5
53 3 4 5 2 40 412.0
54 6 6 3 2 8 2418.0
55 5 6 5 5 8 4923.5
56 4 5 2 7 32 4135.0
57 7 7 2 6 83 1408.5
58 7 2 3 2 12 5595.0
59 7 2 1 2 32 2280.5
60 7 4 5 3 11 638.5
61 7 5 3 3 24 225.5
62 4 3 3 9 44 570.0
3. Performing preliminary manipulations
I try to run prelim.mix(x, p) where x is the data matrix containing missing values and p is the number of categorical variables in x. The categorical variables must be in the first p columns of x, and they must be coded with consecutive positive integers starting with 1. For example, a binary variable must be coded as 1,2 rather than 0,1.
In my case p should be 4 since I have three Likert-scale variables where I want imputed values and one other categorical variable among my auxiliary variables.
s <- prelim.mix(data,4)
This step seems to work fine.
4. Finding the maximum likelihood (ML) estimate:
thetahat <- em.mix(s)
This is where I encounter the following error:
Steps of EM:
1...2...3...Error in em.mix(s) : NA/NaN/Inf in foreign function call (arg 6)
I think this must have something to do with my auxiliary variables, but I'm not sure. Any help would be much appreciated.

Calculating Pearson correlation and significance in R [duplicate]

This question already has answers here:
cor shows only NA or 1 for correlations - Why?
(6 answers)
Reducing correlation of datasets with NA
(3 answers)
Closed 5 years ago.
I am trying to examine association of two variables i.e., number of publications and publication years. My dataframe looks like this:
Year yr RI1 RI2 RI3 RI4 RI5 RI6 RI7 RI8 RI9 RI10 Total X
1 2005 1 2 2 3 8 2 7 16 42 1 3 86 NA
2 2006 2 1 2 9 8 1 8 25 40 0 3 97 NA
3 2007 3 9 2 13 10 3 6 32 32 2 2 111 NA
4 2008 4 4 0 12 6 3 5 31 60 1 5 127 NA
5 2009 5 13 5 12 9 5 5 28 55 3 3 138 NA
6 2010 6 13 4 11 10 3 7 33 64 5 1 151 NA
7 2011 7 10 13 10 10 4 3 42 61 4 3 160 NA
8 2012 8 11 6 18 15 6 5 43 64 6 2 176 NA
9 2013 9 22 11 17 12 6 7 50 62 9 5 201 1247
10 NA NA 85 45 105 88 33 53 300 480 31 27 1247 NA
I have used cor() function as follows but do not get results:
year = researchinstitutions$Year
> RI1 = researchinstitutions$RI1
> cor(year,RI1)
[1] NA
Any suggestions on how to go around this?

Isolated nodes appearing in the network in R

I am using library(network) and have the following edge list to generate a network.
Commands used:
library(network)
edgelist<-read.table("Filename")
net<-network(edgelist)
plot(net)
what I observe is isolated nodes in the network plot! Can anyone help in deciphering the reason?? I used the same edgelist with Cytoscape. It works perfect there. Why is it causing problem in R?
Following is the edgelist:
7 2
2 6
3 2
13 2
1 2
4 2
5 2
9 2
25 29
5 4
13 8
18 17
5 15
13 1
22 8
25 12
21 11
17 28
18 8
13 16
33 20
10 27
12 4
24 23
12 1
19 26
4 3
3 15
8 11
16 62
36 8
18 11
10 62
4 6
4 1
32 62
12 16
4 15
17 30
22 10
34 11
31 10
9 6
4 7
24 20
5 6
1 6
3 6
9 7
21 19
35 23
7 6
10 8
5 7
1 7
3 7
1 3
1 9
5 1
3 9
5 3
5 9
Found the reason! R network library requires node IDs to be in sequential order. One of the IDs was 62, while there was no node with an ID between 36 and 62. It ended up isolating those IDs as vertices..Once that problem was fixed, it worked well.

Resources