Create edgelist that contains mutual dyads - r

I have an edgelist where I want to keep dyads that mutually selected each other (e.g., 1 -> 4 and 4 -> 1). However, in the final edgelist I only want to keep one row instead of both rows of the mutual dyads (e.g., only row 1 -> 4 not both rows 1 -> 4 and 4 -> 1). How do I achieve that?
Here is the dataset:
library(igraph)
ff <- as_data_frame(sample_gnm(10, 50, directed=TRUE))
ff
from to
1 1 10
2 1 3
3 1 4
4 1 5
5 1 6
6 1 7
7 1 8
8 2 1
9 2 3
10 2 8
11 2 9
12 3 1
13 3 2
14 3 10
15 3 4
16 3 5
17 3 6
18 3 8
19 3 9
20 4 3
21 4 10
22 5 1
23 5 2
24 5 3
25 5 4
26 6 2
27 6 3
28 6 4
29 6 5
30 7 3
31 7 5
32 7 6
33 7 10
34 7 8
35 8 1
36 8 2
37 8 4
38 8 5
39 8 10
40 9 1
41 9 2
42 9 3
43 9 4
44 9 5
45 9 7
46 10 1
47 10 3
48 10 4
49 10 8
50 10 9
cd <- which_mutual(g) #I know I can use `which_mutual` to identify the mutual dyads
ff[which(cd==1),] #but in the end this keeps both rows of the mutual dyads (e.g., 1 -> 4 and 4 -> 1)
from to
4 1 4
6 1 6
7 1 7
9 2 10
10 2 3
14 3 2
18 3 6
21 4 1
25 5 10
28 6 1
30 6 3
32 6 10
33 6 7
34 7 1
37 7 6
39 7 8
42 8 7
45 9 10
46 10 2
47 10 5
48 10 6
50 10 9

We may use duplicated to create a logical vector after sorting the elements by row
ff1 <- ff[which(cd==1),]
subset(ff1, !duplicated(cbind(pmin(from, to), pmax(from, to))))

Related

R statistics: glm

I use a .csv from Excel. I recently added a 'bmi' variable. Unfortunately it gets analysed a different levels of 'bmi' instead of as a continuous variable (output below). I have tried making sure it is either numeric but no improvement. Other continuous variables have not been a problem.
Updated code below with a sample 3 variable dataset. Obviously the import filepath would need to change. I then go through 2 stages of subsetting.
The su_data subset causes a 2nd problem of "glm.fit: algorithm did not converge ". If I use the nitech1 dataset this doesn't occur. However, there is no obvious convergence and normally I would include more variables making it even more unlikely.
nitech <- read.csv("~/Documents/Publications/2020 NI_tech/Data/nitech.csv", stringsAsFactors = F)
head(nitech)
attach(nitech)
nitech1<-subset(nitech, 1/1305 & route=="ni" & t_1==1)
attach(nitech1)
data_su<-subset(nitech1, posn_neg_5>-3)
attach(data_su)
mod_su1<-glm(posn_neg_5>-2~a2+bmi, data = data_su, family=binomial(link="logit"))
summary(mod_su1)
posn_neg_5
bmi
a2
1
21.875
22
3
NA
23
-1
23.24380165
24
4
NA
25
5
24.04934359
26
4
21.79930796
27
4
NA
28
5
40.03673095
29
4
28.73467206
30
-1
24.05693475
31
3
32.32653061
16
4
19.04700413
21
5
27.08415907
25
2
28.125
15
-2
25.05930703
25
5
29.96878252
34
1
21.44756785
11
5
21.44756785
11
5
27.15271059
21
4
19.35020755
14
2
18.7260073
15
5
25.81663021
18
1
27.1430375
12
4
35.10667027
18
5
27.17063157
5
2
34.81611682
2
2
20.89795918
12
4
24.54346238
3
4
24.93074792
3
5
31.22130395
11
1
27.734375
14
1
23.88844098
8
5
23.8497766
10
-1
27.76620852
32
4
24.38652644
15
5
23.57391519
11
5
24.1516725
8
1
24.07407407
5
-2
25.82644628
23
5
21.46193662
7
5
30.07116843
16
-2
18.99196494
11
4
22.8774057
4
4
16.49395819
NA
4
25.61728395
5
4
35.01992513
17
-1
23.89325888
36
4
22.92097107
8
2
21.2244898
2
5
20.81165453
7
4
21.70512943
NA
-2
NA
7
4
31.7417889
9
5
28.73467206
15
5
19.72318339
22
4
20.82093992
13
4
28.6801111
14
5
32.87197232
14
5
31.60321237
19
4
21.70512943
5
4
18.44618056
5
5
18.68512111
NA
-1
23.45679012
10
1
30.0432623
14
4
27.88761707
NA
5
41.46938776
18
4
24.96800974
NA
1
21.44756785
11
4
23.57391519
11
3
23.89325888
36
4
23.89325888
36
4
20.82093992
13
5
28.6801111
14
5
28.73467206
15
5
17.90885536
13
4
15.00207476
NA
4
23.71184463
19
4
29.3877551
19
4
20.95717116
12
3
35.15625
18
5
24.53512397
15
4
25.86451247
20
3
17.90885536
13
4
23.71184463
19
2
27.91551882
13
4
35.85643085
NA
4
24.69135802
NA
4
35.2507611
35
5
19.13580247
NA
4
25.18078512
16
4
28.3446712
17
1
31.60321237
8
4
NA
NA
5
27.85200798
17
4
21.13271344
5
4
20.08827524
NA
5
22.58955144
25
4
17.96875
16
5
29.93759487
9
5
24.69135802
9
5
28.125
8
5
25.96952909
NA
4
19.80534178
5
4
22.09317005
26
4
16.23307275
12
4
22.85714286
7
4
32.24993701
29
4
32.24993701
29
4
27.75510204
17
-1
22.22222222
8
5
30.93043808
NA
5
30.93043808
15
5
26.42356982
6
5
33.65929705
15
2
24.34380949
9
4
24.34380949
7
1
27.17063157
9
5
37.73698829
NA
5
37.73698829
NA
1
23.30668005
11
-1
24.22145329
31
5
34.10798936
11
5
34.10798936
11
5
34.10798936
11
4
24.22145329
31
-1
18.04270106
NA
4
22.265625
25
5
34.10798936
11
2
25.86120485
34
-4
27.40765728
18
3
27.40765728
18
5
20.10916403
7
2
20.60408163
25
NA
24.77209671
24
1
22.49134948
NA
4
22.49134948
10
3
23.62444749
27
1
24.09297052
NA
5
24.1671624
13
-2
24.91349481
7
1
25.53544639
10
1
27.6816609
8
5
22.85714286
10
4
22.85714286
10
5
25.2493372
6
4
21.79930796
20
4
22.85714286
10
4
25.35154137
21
4
26.2345679
7
4
26.86873566
19
4
23.58832922
14
5
60.85439572
11
3
21.79930796
20
5
23.58832922
14
4
25.71166208
29
3
23.45679012
5
5
31.8877551
NA
-1
31.38510306
15
4
33.87406376
16
5
31.38510306
15
4
38.20018365
NA
3
31.38510306
15
3
29.20110193
25
5
31.99217133
NA
5
29.9940488
6
1
26.81359045
28
4
27.54820937
20
5
27.54820937
5
4
23.67125363
36
4
22.22222222
17
4
23.67125363
36
4
27.04164413
18
5
21.60493827
7
4
21.79930796
20
4
44.79082684
NA
5
43.02771702
NA
4
21.79930796
20
4
22.51606979
NA
4
20.76124567
24
4
22.98539751
23
1
26.56683975
24
4
22.98539751
23
4
29.16869227
8
1
24.75546432
8
5
24.19600671
24
5
30.79338244
NA
NA
NA
NA
2
NA
24
5
24.19600671
24
1
25.95155709
29
4
25.95155709
29
5
20.10916403
7
-1
25.390625
22
4
22.03856749
13
5
21.79930796
NA
5
22.03856749
13
-2
30.07116843
20
5
27.75748722
23
4
20.13476872
NA
4
30.49148654
9
5
27.77427093
12
5
30.0838291
25
3
22.22222222
7
-2
14.34257234
10
4
25.82644628
11
-1
20.10916403
22
4
25.82644628
11
5
23.30109483
27
4
23.30109483
27
5
22.88868802
12
4
26.2345679
9
4
36.22750875
6
5
30.47796622
33
1
22.63467632
4
3
22.03856749
31
5
24.69135802
8
5
25.21735858
21
4
15.1614521
16
5
21.14631991
19
5
30.79338244
NA
-1
27.77777778
12
5
25.66115203
9
5
35.91836735
9
5
26.7818261
19
5
26.7818261
19
-2
22.46003435
22
5
32.32323232
31
2
24.296875
11
4
26.7299275
27
5
24.48979592
9
4
23.03004535
28
5
25.2493372
6
4
20.51508648
8
4
23.87511478
15
-1
27.93277423
21
-1
27.93277423
21
4
20.51508648
8
5
27.77777778
29
4
21.49959688
8
4
28.96473469
6
2
24.69135802
24
5
29.86055123
7
5
21.60493827
17
5
41.86851211
16
5
27.77777778
19
2
28.515625
16
5
24.69135802
24
5
21.50294251
12
5
27.77777778
4
5
25.52059756
15
4
27.38328546
19
4
19.47714681
12
5
25.71166208
26
4
26.12244898
NA
5
21.484375
21
5
32.14024836
23
5
24.25867407
8
5
27.77777778
14
4
21.60493827
11
4
22.69401893
18
4
21.60493827
11
4
21.0498179
15
5
22.67573696
9
4
24.22145329
10
4
29.70801268
10
5
38.62236267
17
4
29.70801268
10
4
29.70801268
10
4
32.39588049
6
I would try two things: (1) don't "attach" your data sets. As long as you specify the data frame in your glm function call (as you have already done), you shouldn't need to attach. (2) Try changing your command to the following:
glm(I(posn_neg_5>-2)~a2+bmi, data = data_su, family=binomial(link="logit"))
The function "I()", when put inside a formula, maps a variable (pos_neg_5) to the new variable (pos_neg_5>-2). Sometimes you can get into trouble by including math in formulas without this mapping.

Imputation with categorical variables with mix package in R

I'm trying to impute missing variables in a data set that contains categorical variables (7-point Likert scales) using the mix package in R. Here is what I'm doing:
1. Loading the data:
data <- read.csv("test.csv", header=TRUE, row.names="ID")
2. Here's what the data looks like:
The first column is my ID column, the next three columns are categorical variables (7-point Likert scales - these are the ones where I am interested in imputing the missing values). Then I have three auxiliary variables: aux_cat is another categorical variable (unordered ranging from 1 to 9, no missing data), aux_one is an integer (no missing data), aux_two is numerical (contains missing data).
var_one var_two var_three aux_cat aux_one aux_two
1 2 1 2 6 26 0.0
2 3 2 3 7 45 32906.5
3 6 2 3 3 31 1237.5
4 7 NA NA 8 11 277.0
5 4 3 1 5 145 78201.0
6 NA NA NA 6 30 48550.0
7 7 6 3 3 48 11568.0
8 6 6 4 2 15 4482.0
9 7 6 5 5 61 NA
10 5 6 7 3 2 NA
11 5 6 5 3 11 78663.0
12 6 2 2 3 16 1235.0
13 7 2 5 3 13 5781.0
14 6 5 4 6 16 5062.0
15 5 5 3 3 43 400.0
16 7 7 5 2 114 7968.0
17 6 5 4 3 99 247.5
18 7 7 7 6 114 1877.0
19 5 5 4 5 3 5881.5
20 4 4 2 3 65 1786.0
21 4 3 6 5 9 14117.5
22 3 3 2 3 35 2093.0
23 3 4 4 5 62 23071.5
24 5 3 5 3 22 2707.5
25 3 1 2 6 128 942.0
26 5 3 6 4 57 101379.0
27 5 5 4 6 76 1398.0
28 1 3 4 3 17 1024.5
29 4 3 2 1 143 10657.0
30 7 1 4 8 14 167.5
31 7 3 7 3 22 4344.0
32 3 3 3 6 27 1582.0
33 7 1 3 2 29 66.5
34 5 5 4 2 108 513.5
35 7 6 6 7 24 936.5
36 4 5 4 7 40 5950.5
37 NA NA NA 8 15 99.5
38 2 2 2 6 21 123.5
39 6 4 5 2 61 477.5
40 6 5 5 2 16 28921.0
41 6 2 2 2 11 1063.5
42 6 2 5 3 116 97798.5
43 4 4 2 8 11 9159.5
44 6 6 6 6 4 1098.5
45 6 4 5 7 21 236.5
46 4 6 4 5 43 219.5
47 3 2 3 3 28 85.5
48 5 5 5 2 71 13483.5
49 5 5 6 8 98 18400.0
50 5 6 6 3 27 357.0
51 5 7 6 7 14 145.5
52 4 5 5 3 93 427.5
53 3 4 5 2 40 412.0
54 6 6 3 2 8 2418.0
55 5 6 5 5 8 4923.5
56 4 5 2 7 32 4135.0
57 7 7 2 6 83 1408.5
58 7 2 3 2 12 5595.0
59 7 2 1 2 32 2280.5
60 7 4 5 3 11 638.5
61 7 5 3 3 24 225.5
62 4 3 3 9 44 570.0
3. Performing preliminary manipulations
I try to run prelim.mix(x, p) where x is the data matrix containing missing values and p is the number of categorical variables in x. The categorical variables must be in the first p columns of x, and they must be coded with consecutive positive integers starting with 1. For example, a binary variable must be coded as 1,2 rather than 0,1.
In my case p should be 4 since I have three Likert-scale variables where I want imputed values and one other categorical variable among my auxiliary variables.
s <- prelim.mix(data,4)
This step seems to work fine.
4. Finding the maximum likelihood (ML) estimate:
thetahat <- em.mix(s)
This is where I encounter the following error:
Steps of EM:
1...2...3...Error in em.mix(s) : NA/NaN/Inf in foreign function call (arg 6)
I think this must have something to do with my auxiliary variables, but I'm not sure. Any help would be much appreciated.

Replacing a restarting sequence in a dataframe with the group number of the sequence

I have a sequence in df$V1 that starts at some number and increases. At some point, it drops, indicating that observations for a new group have started. I want to replace V1 (or create a new column) with the group number. What are some ways to do this? I've tried various dplyr tricks to no avail, and searched here and elsewhere and have not found a similar problem. Wondering if there's a slick dplyr way to do this. Thank you for any insights.
The data frame has about 350 rows. Here is a subset:
> df
V1 V2 V3 V4 V5 V6
1 1 5 9 1 2 14
2 2 5 10 1 3 9
3 3 5 11 1 4 4
4 4 5 15 1 5 7
5 5 5 18 1 6 14
6 6 5 22 1 7 6
27 1 5 9 1 2 14
28 21 9 10 2 3 4
29 22 9 11 2 4 6
30 23 9 15 2 5 1
31 24 9 18 2 6 7
32 25 9 22 2 7 14
33 26 9 24 2 8 6
34 27 9 25 2 9 7
35 28 9 26 2 10 6
And I want it to look like this (or with group as an added column in the new.df):
> new.df
group V2 V3 V4 V5 V6
1 1 5 9 1 2 14
2 1 5 10 1 3 9
3 1 5 11 1 4 4
4 1 5 15 1 5 7
5 1 5 18 1 6 14
6 1 5 22 1 7 6
27 2 5 9 1 2 14
28 2 9 10 2 3 4
29 2 9 11 2 4 6
30 2 9 15 2 5 1
31 2 9 18 2 6 7
32 2 9 22 2 7 14
33 2 9 24 2 8 6
34 2 9 25 2 9 7
35 2 9 26 2 10 6
Here's the initial data frame to load into your R session:
df <- read.table(header=TRUE, text="
V1 V2 V3 V4 V5 V6
1 5 9 1 2 14
2 5 10 1 3 9
3 5 11 1 4 4
4 5 15 1 5 7
5 5 18 1 6 14
6 5 22 1 7 6
1 5 9 1 2 14
21 9 10 2 3 4
22 9 11 2 4 6
23 9 15 2 5 1
24 9 18 2 6 7
25 9 22 2 7 14
26 9 24 2 8 6
27 9 25 2 9 7
28 9 26 2 10 6
")

Subset data frame based on column values

I have a data frame consisting of the fluorescence read out of multiple cells tracked over time, for example:
Number=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
Fluorescence=c(9,10,20,30,8,11,21,31,6,12,22,32,7,13,23,33)
df = data.frame(Number, Fluorescence)
Which gets:
Number Fluorescence
1 1 9
2 2 10
3 3 20
4 4 30
5 1 8
6 2 11
7 3 21
8 4 31
9 1 6
10 2 12
11 3 22
12 4 32
13 1 7
14 2 13
15 3 23
16 4 33
Number pertains to the cell number. What I want is to collate the fluorescence readout based on the cell number. The data.frame here has it counting 1-4, whereas really I want something like this:
Number Fluorescence
1 1 9
2 1 8
3 1 6
4 1 7
5 2 10
6 2 11
7 2 12
8 2 13
9 3 20
10 3 21
11 3 22
12 3 23
13 4 30
14 4 31
15 4 32
16 4 33
Or even more ideal would be having columns based on Number, then respective cell fluorescence:
1 2 3 4
1 9 10 20 30
2 8 11 21 31
3 6 12 22 32
4 7 13 23 33
I've used the which function to extract them one at a time:
Cell1=df[which(df[,1]==1),2]
But this would require me to write a line for each cell (of which there are hundreds).
Thank you for any help with this! Apologies that I'm still a bit of an R noob.
How about this:
library(tidyr);library(data.table)
number <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
fl <- c(9,10,20,30,8,11,21,31,6,12,22,32,7,13,23,33)
df <- data.table(number,fl)
df[, index:=1:.N, keyby=number]
df
number fl index
1: 1 9 1
2: 1 8 2
3: 1 6 3
4: 1 7 4
5: 2 10 1
6: 2 11 2
7: 2 12 3
8: 2 13 4
9: 3 20 1
10: 3 21 2
11: 3 22 3
12: 3 23 4
13: 4 30 1
14: 4 31 2
15: 4 32 3
16: 4 33 4
The index is added for the unique identifier in spread function from tidyr. Look this post for more information.
spread(df,number,fl)
index 1 2 3 4
1: 1 9 10 20 30
2: 2 8 11 21 31
3: 3 6 12 22 32
4: 4 7 13 23 33

R which argument fits well to obtain nonuniform bins using "plot" to build an informative histogram

I am new to R,I am trying to plot a cumulative frequency histogram(non-uniform bins) for a huge amount of data(few millions of positive numbers with a minimum value "1" and maximum value varies from data to data like for instance 1*10^6 or 1*10^5).I used this simple code to generate a histogram with the data.
for example:-sample data
[89601] 10 2 2 4 3 12 3 25 25 2
[89611] 5 5 5 2 23 22 14 8 13 10
[89621] 13 19 157 2 3 2 4 2 3 33
[89631] 22 2 14 9 2 3 3 3 8 2
[89641] 8 3 2 127 8 2 18 2 4 2
[89651] 2 13 3 34 8 2 6 10 3 7
[89661] 3 9 7 3 36 9 5 2 10 15
[89671] 7 2 23 2 2 2 2 7 6 25
[89681] 3 3 2 6 37 49 28 11 3 35
[89691] 2 2 8 3 3 2 2 4 3 12
[89701] 3 5 2 7 3 2 15 6 3 14
[89711] 13 5 3 2 2 8 34 4 4 65
[89721] 5 9 12 2 11 2 2 79 9 13
[89731] 2 66 2 9 10 22 11 2 6 3
[89741] 12 2 11 5 4 4 2 4 3 4
[89751] 2 8 9 3 2 2 84 7 11 10
[89761] 8 30 16 3 63 2 2 24 13 2
[89771] 11 37 2 9 21 21 10 2 2 49
[89781] 3 3 8 5 2 19 9 6 5 4
[89791] 4 2 9 2 10 33 5 4 2 2
[89801] 4 2 2 4 9 3 11 2 5 142
[89811] 17 2 11 4 2 8 26 2 9 8
[89821] 10 2 4 2 5 2 20 7 145 11
[89831] 22 19 8 14 18 39 3 2 3 3
[89841] 2 11 10 3 2 3 3 5 6 12
[89851] 17 5 3 8 2 2 2 2 2 5
[89861] 4 2 13 3 2 2 2 2 3 2
[89871] 4 3 21 2 6 2 8 9 7 14
[89881] 2 582 3 15 11 3 20 16 9 8
[89891] 6 2 6 7 3 20 17 2 9 5
[89901] 5 11 2 12 7 2 46 2 144 9
[89911] 2 3 36 25 3 2 16 2 2 119
[89921] 5 5 10 6 2 2 6 84 13 2
[89931] 2 6 6 2 17 3 7 4 102 48
data <- read.table("sample.txt", header=FALSE)
data <- hist(data$V1, breaks=length(data$V1), xlim=c(0,4000000))
plot(data)
when I did this I could get a histogram with all the data(positive numbers)on x axis and counts on y-axis.Then again I changed the limit of the x only upto the area of interest
plot(data, xlim=c(0,200000))
Like before a histogram is plotted,but using "plot" I couldn't define the number of bins and hence the histogram is not clear(not like bars which I want to be) and informative.
As I am new to this forum,I have no idea how to upload images,so I couldn't provide with the histogram.
Any suggestions would be very helpful.
For plotting histogram you can use hist() function just this way:
hist(data$V1, xlim=c(0,200000), breaks=100)
The breaks parameter shows, how many bars will be plotted. But this number is related to all plot, not to xlim you specified. So, at first it will make a histogram with given number of breakes and after that it will cut the part of plot you need.
But there is another way to plot the bars:
data <- read.table("sample.txt", header=FALSE)
data.hist <- hist(data$V1, breaks=length(data$V1), xlim=c(0,4000000))
plot(data.hist$counts, type='h')
The hist function returns an object which represents histogram parameters.
I assume, you are interested in "counts" field.
You can plot this info in histogram-like way by defining type='h'.

Resources