Get the X value of multivariable box plot whiskers in R - r

I am trying to get the values of the whiskers in a boxplot.
Sample of my data is:
Company.ID ACTIVE Websource Company.Name Country Sector Ownership Activity.Status Update.Date MIN_MAX_REVENUE 16 Construction Private Number.of.Employees NOE splittedN splittedco splitted RR Range SECTORNUM
I want to find the whiskers when I box-plotted Number.of.Employees and Sector
boxplot(Data$Range ~ Data$Sector, ylab= "range", Xlab= "Sector", las=2)
Got the otliers
boxplot(Data$Range ~ Data$Sector, ylab= "range", Xlab= "Sector", las=2)$out
[1] 18 16 12 35 15 65 45 25 50 40 30 32 30 50 45 65 80 35 35 40 90 25 60 30 40 25
[27] 50 25 40 65 25 35 60 27 130 30 100 25 30 40 30 35 25 23 150 60 29 23 30 56 30 25
[53] 22 23 40 80 30 32 22 30 28 7 25 8 10 7 8 11 30 10 10 32 10 10 40 20 8 2
[79] 3 4 2 15 10 3 4 2 2 6 2 4 2 3 3 2 2 2 2 2 13 2 3 5 3 5
[105] 3 2 4 7 2 6 2 2 2 5 3 3 2 2 2 3 4 9 4 15 2 2 2 10 2 2
[131] 4 19 2 9 2 6 2 2 2 4 4 2 15 2 2 4 2 2 2 27 4 2 3 2 2 2
[157] 3 12 7 2 11 2 3 2 2 3 2 2 8 14 5 3 4 170 3 2 4 3 5 3 2 2
[183] 5 2 2 3 2 6 2 2 2 2 2 3 3 2 17 4 2 2 2 3 4 3 4 2 7 2
[209] 4 2 5 2 2 10 3 30 12 23 15 14 30 200 12 45 16 20 16 12 12 19 12 60 18 18
[235] 30 15 12 20 12 30 21 25 40 22 30 70 32 50 40 32 47 50 30 21 16 20 25 18 12 14
[261] 30 10 14 15 30 11 8 10 15 8 18 7 20 13 15 17 25 10 17 8 20 17 45 7 15 7
[287] 17 9 8 8 8 20 10 20 10 19 10 20 10 9 16 7 16 20 15 8 15 10 12 10 9 10
[313] 7 10 10 12 9 22 10 8 10 9 14 8 7 10 10 15 20 8 15 15 14 8 50 20 50 10
[339] 10 10 50 3 18 4 15 5 2 4 11 7 16 15 2 2 2 2 2 2 3 2 2 2 6 7
[365] 2 8 2 3 2 2 2 2 2 7 2 2 2 4 5 2 5 3 2 3 4 2 2 44 2 2
[391] 8 3 2 10 10 7 10 10 11 20 18 11 3 20 5 2 5 2 2 6 30 6 2 2 43 13
[417] 30 10 10 35 16 16 11 10 15 10 9 8 16 7 21 5 50 30 4 4 14 15 2 2 5 8
[443] 5 40 2 2 2 2 2 2 25 2 4 3 2 6 2 10 5 4 5 2 2 3 3 4 2 2
[469] 14 8 5 2 7 2 2 3 42 20 10 10 15 13 11 40 10 15 30 20 2 8 3 8 3 4
[495] 2 4 2 3 2 4 4 2 3 35 5 2 3 8 2 8 2 3 40 35 2 2 2 2 7 2
[521] 3 3 2 30 15 4 60 2 28 4 2 2 5 10 2 2 3 4 18 2 6 2 4 4 2 2
[547] 30 9 2 3 12 5 2 2 5 3 4 2 11 2 2 2 8 2 2 3 6 3 7 2 2 2
[573] 2 40 14 2 2 3 2 3 3 18 14 9 10 25 12 19 35 10 10 15 25 15 17 20 35 10
I need the full info about these outliers (company.Name....)

You need first the interquartile range
IQR = 75%quartile - 25% quartile,
then you find the
upper whisker at min(max(x), 75%quartile+1.5*IQR)
lower whisker at max(min(x), 25%quartiel+1.5*IQR)

Related

Create edgelist that contains mutual dyads

I have an edgelist where I want to keep dyads that mutually selected each other (e.g., 1 -> 4 and 4 -> 1). However, in the final edgelist I only want to keep one row instead of both rows of the mutual dyads (e.g., only row 1 -> 4 not both rows 1 -> 4 and 4 -> 1). How do I achieve that?
Here is the dataset:
library(igraph)
ff <- as_data_frame(sample_gnm(10, 50, directed=TRUE))
ff
from to
1 1 10
2 1 3
3 1 4
4 1 5
5 1 6
6 1 7
7 1 8
8 2 1
9 2 3
10 2 8
11 2 9
12 3 1
13 3 2
14 3 10
15 3 4
16 3 5
17 3 6
18 3 8
19 3 9
20 4 3
21 4 10
22 5 1
23 5 2
24 5 3
25 5 4
26 6 2
27 6 3
28 6 4
29 6 5
30 7 3
31 7 5
32 7 6
33 7 10
34 7 8
35 8 1
36 8 2
37 8 4
38 8 5
39 8 10
40 9 1
41 9 2
42 9 3
43 9 4
44 9 5
45 9 7
46 10 1
47 10 3
48 10 4
49 10 8
50 10 9
cd <- which_mutual(g) #I know I can use `which_mutual` to identify the mutual dyads
ff[which(cd==1),] #but in the end this keeps both rows of the mutual dyads (e.g., 1 -> 4 and 4 -> 1)
from to
4 1 4
6 1 6
7 1 7
9 2 10
10 2 3
14 3 2
18 3 6
21 4 1
25 5 10
28 6 1
30 6 3
32 6 10
33 6 7
34 7 1
37 7 6
39 7 8
42 8 7
45 9 10
46 10 2
47 10 5
48 10 6
50 10 9
We may use duplicated to create a logical vector after sorting the elements by row
ff1 <- ff[which(cd==1),]
subset(ff1, !duplicated(cbind(pmin(from, to), pmax(from, to))))

R statistics: glm

I use a .csv from Excel. I recently added a 'bmi' variable. Unfortunately it gets analysed a different levels of 'bmi' instead of as a continuous variable (output below). I have tried making sure it is either numeric but no improvement. Other continuous variables have not been a problem.
Updated code below with a sample 3 variable dataset. Obviously the import filepath would need to change. I then go through 2 stages of subsetting.
The su_data subset causes a 2nd problem of "glm.fit: algorithm did not converge ". If I use the nitech1 dataset this doesn't occur. However, there is no obvious convergence and normally I would include more variables making it even more unlikely.
nitech <- read.csv("~/Documents/Publications/2020 NI_tech/Data/nitech.csv", stringsAsFactors = F)
head(nitech)
attach(nitech)
nitech1<-subset(nitech, 1/1305 & route=="ni" & t_1==1)
attach(nitech1)
data_su<-subset(nitech1, posn_neg_5>-3)
attach(data_su)
mod_su1<-glm(posn_neg_5>-2~a2+bmi, data = data_su, family=binomial(link="logit"))
summary(mod_su1)
posn_neg_5
bmi
a2
1
21.875
22
3
NA
23
-1
23.24380165
24
4
NA
25
5
24.04934359
26
4
21.79930796
27
4
NA
28
5
40.03673095
29
4
28.73467206
30
-1
24.05693475
31
3
32.32653061
16
4
19.04700413
21
5
27.08415907
25
2
28.125
15
-2
25.05930703
25
5
29.96878252
34
1
21.44756785
11
5
21.44756785
11
5
27.15271059
21
4
19.35020755
14
2
18.7260073
15
5
25.81663021
18
1
27.1430375
12
4
35.10667027
18
5
27.17063157
5
2
34.81611682
2
2
20.89795918
12
4
24.54346238
3
4
24.93074792
3
5
31.22130395
11
1
27.734375
14
1
23.88844098
8
5
23.8497766
10
-1
27.76620852
32
4
24.38652644
15
5
23.57391519
11
5
24.1516725
8
1
24.07407407
5
-2
25.82644628
23
5
21.46193662
7
5
30.07116843
16
-2
18.99196494
11
4
22.8774057
4
4
16.49395819
NA
4
25.61728395
5
4
35.01992513
17
-1
23.89325888
36
4
22.92097107
8
2
21.2244898
2
5
20.81165453
7
4
21.70512943
NA
-2
NA
7
4
31.7417889
9
5
28.73467206
15
5
19.72318339
22
4
20.82093992
13
4
28.6801111
14
5
32.87197232
14
5
31.60321237
19
4
21.70512943
5
4
18.44618056
5
5
18.68512111
NA
-1
23.45679012
10
1
30.0432623
14
4
27.88761707
NA
5
41.46938776
18
4
24.96800974
NA
1
21.44756785
11
4
23.57391519
11
3
23.89325888
36
4
23.89325888
36
4
20.82093992
13
5
28.6801111
14
5
28.73467206
15
5
17.90885536
13
4
15.00207476
NA
4
23.71184463
19
4
29.3877551
19
4
20.95717116
12
3
35.15625
18
5
24.53512397
15
4
25.86451247
20
3
17.90885536
13
4
23.71184463
19
2
27.91551882
13
4
35.85643085
NA
4
24.69135802
NA
4
35.2507611
35
5
19.13580247
NA
4
25.18078512
16
4
28.3446712
17
1
31.60321237
8
4
NA
NA
5
27.85200798
17
4
21.13271344
5
4
20.08827524
NA
5
22.58955144
25
4
17.96875
16
5
29.93759487
9
5
24.69135802
9
5
28.125
8
5
25.96952909
NA
4
19.80534178
5
4
22.09317005
26
4
16.23307275
12
4
22.85714286
7
4
32.24993701
29
4
32.24993701
29
4
27.75510204
17
-1
22.22222222
8
5
30.93043808
NA
5
30.93043808
15
5
26.42356982
6
5
33.65929705
15
2
24.34380949
9
4
24.34380949
7
1
27.17063157
9
5
37.73698829
NA
5
37.73698829
NA
1
23.30668005
11
-1
24.22145329
31
5
34.10798936
11
5
34.10798936
11
5
34.10798936
11
4
24.22145329
31
-1
18.04270106
NA
4
22.265625
25
5
34.10798936
11
2
25.86120485
34
-4
27.40765728
18
3
27.40765728
18
5
20.10916403
7
2
20.60408163
25
NA
24.77209671
24
1
22.49134948
NA
4
22.49134948
10
3
23.62444749
27
1
24.09297052
NA
5
24.1671624
13
-2
24.91349481
7
1
25.53544639
10
1
27.6816609
8
5
22.85714286
10
4
22.85714286
10
5
25.2493372
6
4
21.79930796
20
4
22.85714286
10
4
25.35154137
21
4
26.2345679
7
4
26.86873566
19
4
23.58832922
14
5
60.85439572
11
3
21.79930796
20
5
23.58832922
14
4
25.71166208
29
3
23.45679012
5
5
31.8877551
NA
-1
31.38510306
15
4
33.87406376
16
5
31.38510306
15
4
38.20018365
NA
3
31.38510306
15
3
29.20110193
25
5
31.99217133
NA
5
29.9940488
6
1
26.81359045
28
4
27.54820937
20
5
27.54820937
5
4
23.67125363
36
4
22.22222222
17
4
23.67125363
36
4
27.04164413
18
5
21.60493827
7
4
21.79930796
20
4
44.79082684
NA
5
43.02771702
NA
4
21.79930796
20
4
22.51606979
NA
4
20.76124567
24
4
22.98539751
23
1
26.56683975
24
4
22.98539751
23
4
29.16869227
8
1
24.75546432
8
5
24.19600671
24
5
30.79338244
NA
NA
NA
NA
2
NA
24
5
24.19600671
24
1
25.95155709
29
4
25.95155709
29
5
20.10916403
7
-1
25.390625
22
4
22.03856749
13
5
21.79930796
NA
5
22.03856749
13
-2
30.07116843
20
5
27.75748722
23
4
20.13476872
NA
4
30.49148654
9
5
27.77427093
12
5
30.0838291
25
3
22.22222222
7
-2
14.34257234
10
4
25.82644628
11
-1
20.10916403
22
4
25.82644628
11
5
23.30109483
27
4
23.30109483
27
5
22.88868802
12
4
26.2345679
9
4
36.22750875
6
5
30.47796622
33
1
22.63467632
4
3
22.03856749
31
5
24.69135802
8
5
25.21735858
21
4
15.1614521
16
5
21.14631991
19
5
30.79338244
NA
-1
27.77777778
12
5
25.66115203
9
5
35.91836735
9
5
26.7818261
19
5
26.7818261
19
-2
22.46003435
22
5
32.32323232
31
2
24.296875
11
4
26.7299275
27
5
24.48979592
9
4
23.03004535
28
5
25.2493372
6
4
20.51508648
8
4
23.87511478
15
-1
27.93277423
21
-1
27.93277423
21
4
20.51508648
8
5
27.77777778
29
4
21.49959688
8
4
28.96473469
6
2
24.69135802
24
5
29.86055123
7
5
21.60493827
17
5
41.86851211
16
5
27.77777778
19
2
28.515625
16
5
24.69135802
24
5
21.50294251
12
5
27.77777778
4
5
25.52059756
15
4
27.38328546
19
4
19.47714681
12
5
25.71166208
26
4
26.12244898
NA
5
21.484375
21
5
32.14024836
23
5
24.25867407
8
5
27.77777778
14
4
21.60493827
11
4
22.69401893
18
4
21.60493827
11
4
21.0498179
15
5
22.67573696
9
4
24.22145329
10
4
29.70801268
10
5
38.62236267
17
4
29.70801268
10
4
29.70801268
10
4
32.39588049
6
I would try two things: (1) don't "attach" your data sets. As long as you specify the data frame in your glm function call (as you have already done), you shouldn't need to attach. (2) Try changing your command to the following:
glm(I(posn_neg_5>-2)~a2+bmi, data = data_su, family=binomial(link="logit"))
The function "I()", when put inside a formula, maps a variable (pos_neg_5) to the new variable (pos_neg_5>-2). Sometimes you can get into trouble by including math in formulas without this mapping.

Is there any method to sort the matrix by both column and row in R?

could you guys help me?
I have a matrix like this. the first column and row are the IDs.
I need to sort it by column and row ID like this.
Thanks!
Two thoughts:
mat <- matrix(1:25, nr=5, dimnames=list(c('4',3,5,2,1), c('4',3,5,2,1)))
mat
# 4 3 5 2 1
# 4 1 6 11 16 21
# 3 2 7 12 17 22
# 5 3 8 13 18 23
# 2 4 9 14 19 24
# 1 5 10 15 20 25
If you want a strictly alphabetic ordering, then this will work:
mat[order(rownames(mat)),order(colnames(mat))]
# 1 2 3 4 5
# 1 25 20 10 5 15
# 2 24 19 9 4 14
# 3 22 17 7 2 12
# 4 21 16 6 1 11
# 5 23 18 8 3 13
This will not work well if the names are intended to be ordered numerically:
mat <- matrix(1:30, nr=3, dimnames=list(c('2',1,3), c('4',3,5,2,1,6,7,8,9,10)))
mat
# 4 3 5 2 1 6 7 8 9 10
# 2 1 4 7 10 13 16 19 22 25 28
# 1 2 5 8 11 14 17 20 23 26 29
# 3 3 6 9 12 15 18 21 24 27 30
mat[order(rownames(mat)),order(colnames(mat))]
# 1 10 2 3 4 5 6 7 8 9
# 1 14 29 11 5 2 8 17 20 23 26
# 2 13 28 10 4 1 7 16 19 22 25
# 3 15 30 12 6 3 9 18 21 24 27
(1, 10, 2, ...) For that, you need a slight modification:
mat[order(as.numeric(rownames(mat))),order(as.numeric(colnames(mat)))]
# 1 2 3 4 5 6 7 8 9 10
# 1 14 11 5 2 8 17 20 23 26 29
# 2 13 10 4 1 7 16 19 22 25 28
# 3 15 12 6 3 9 18 21 24 27 30

Replacing a restarting sequence in a dataframe with the group number of the sequence

I have a sequence in df$V1 that starts at some number and increases. At some point, it drops, indicating that observations for a new group have started. I want to replace V1 (or create a new column) with the group number. What are some ways to do this? I've tried various dplyr tricks to no avail, and searched here and elsewhere and have not found a similar problem. Wondering if there's a slick dplyr way to do this. Thank you for any insights.
The data frame has about 350 rows. Here is a subset:
> df
V1 V2 V3 V4 V5 V6
1 1 5 9 1 2 14
2 2 5 10 1 3 9
3 3 5 11 1 4 4
4 4 5 15 1 5 7
5 5 5 18 1 6 14
6 6 5 22 1 7 6
27 1 5 9 1 2 14
28 21 9 10 2 3 4
29 22 9 11 2 4 6
30 23 9 15 2 5 1
31 24 9 18 2 6 7
32 25 9 22 2 7 14
33 26 9 24 2 8 6
34 27 9 25 2 9 7
35 28 9 26 2 10 6
And I want it to look like this (or with group as an added column in the new.df):
> new.df
group V2 V3 V4 V5 V6
1 1 5 9 1 2 14
2 1 5 10 1 3 9
3 1 5 11 1 4 4
4 1 5 15 1 5 7
5 1 5 18 1 6 14
6 1 5 22 1 7 6
27 2 5 9 1 2 14
28 2 9 10 2 3 4
29 2 9 11 2 4 6
30 2 9 15 2 5 1
31 2 9 18 2 6 7
32 2 9 22 2 7 14
33 2 9 24 2 8 6
34 2 9 25 2 9 7
35 2 9 26 2 10 6
Here's the initial data frame to load into your R session:
df <- read.table(header=TRUE, text="
V1 V2 V3 V4 V5 V6
1 5 9 1 2 14
2 5 10 1 3 9
3 5 11 1 4 4
4 5 15 1 5 7
5 5 18 1 6 14
6 5 22 1 7 6
1 5 9 1 2 14
21 9 10 2 3 4
22 9 11 2 4 6
23 9 15 2 5 1
24 9 18 2 6 7
25 9 22 2 7 14
26 9 24 2 8 6
27 9 25 2 9 7
28 9 26 2 10 6
")

R which argument fits well to obtain nonuniform bins using "plot" to build an informative histogram

I am new to R,I am trying to plot a cumulative frequency histogram(non-uniform bins) for a huge amount of data(few millions of positive numbers with a minimum value "1" and maximum value varies from data to data like for instance 1*10^6 or 1*10^5).I used this simple code to generate a histogram with the data.
for example:-sample data
[89601] 10 2 2 4 3 12 3 25 25 2
[89611] 5 5 5 2 23 22 14 8 13 10
[89621] 13 19 157 2 3 2 4 2 3 33
[89631] 22 2 14 9 2 3 3 3 8 2
[89641] 8 3 2 127 8 2 18 2 4 2
[89651] 2 13 3 34 8 2 6 10 3 7
[89661] 3 9 7 3 36 9 5 2 10 15
[89671] 7 2 23 2 2 2 2 7 6 25
[89681] 3 3 2 6 37 49 28 11 3 35
[89691] 2 2 8 3 3 2 2 4 3 12
[89701] 3 5 2 7 3 2 15 6 3 14
[89711] 13 5 3 2 2 8 34 4 4 65
[89721] 5 9 12 2 11 2 2 79 9 13
[89731] 2 66 2 9 10 22 11 2 6 3
[89741] 12 2 11 5 4 4 2 4 3 4
[89751] 2 8 9 3 2 2 84 7 11 10
[89761] 8 30 16 3 63 2 2 24 13 2
[89771] 11 37 2 9 21 21 10 2 2 49
[89781] 3 3 8 5 2 19 9 6 5 4
[89791] 4 2 9 2 10 33 5 4 2 2
[89801] 4 2 2 4 9 3 11 2 5 142
[89811] 17 2 11 4 2 8 26 2 9 8
[89821] 10 2 4 2 5 2 20 7 145 11
[89831] 22 19 8 14 18 39 3 2 3 3
[89841] 2 11 10 3 2 3 3 5 6 12
[89851] 17 5 3 8 2 2 2 2 2 5
[89861] 4 2 13 3 2 2 2 2 3 2
[89871] 4 3 21 2 6 2 8 9 7 14
[89881] 2 582 3 15 11 3 20 16 9 8
[89891] 6 2 6 7 3 20 17 2 9 5
[89901] 5 11 2 12 7 2 46 2 144 9
[89911] 2 3 36 25 3 2 16 2 2 119
[89921] 5 5 10 6 2 2 6 84 13 2
[89931] 2 6 6 2 17 3 7 4 102 48
data <- read.table("sample.txt", header=FALSE)
data <- hist(data$V1, breaks=length(data$V1), xlim=c(0,4000000))
plot(data)
when I did this I could get a histogram with all the data(positive numbers)on x axis and counts on y-axis.Then again I changed the limit of the x only upto the area of interest
plot(data, xlim=c(0,200000))
Like before a histogram is plotted,but using "plot" I couldn't define the number of bins and hence the histogram is not clear(not like bars which I want to be) and informative.
As I am new to this forum,I have no idea how to upload images,so I couldn't provide with the histogram.
Any suggestions would be very helpful.
For plotting histogram you can use hist() function just this way:
hist(data$V1, xlim=c(0,200000), breaks=100)
The breaks parameter shows, how many bars will be plotted. But this number is related to all plot, not to xlim you specified. So, at first it will make a histogram with given number of breakes and after that it will cut the part of plot you need.
But there is another way to plot the bars:
data <- read.table("sample.txt", header=FALSE)
data.hist <- hist(data$V1, breaks=length(data$V1), xlim=c(0,4000000))
plot(data.hist$counts, type='h')
The hist function returns an object which represents histogram parameters.
I assume, you are interested in "counts" field.
You can plot this info in histogram-like way by defining type='h'.

Resources