Averaging Duplicate Values in an R data frame - r

I have a df named ColorMap in which I am looking to average all numerical values corresponding to the same feature (further explanation below). Here is the df.
> ColorMap
KEGGnumber Colors
1 c("C00489" 0.162
2 "C06104" 0.162
3 "C02656") 0.162
4 C00163 -0.173
5 c("C02656" -0.140
6 "C00036" -0.140
7 "C00232" -0.140
8 "C01571" -0.140
9 "C00422") -0.140
10 c("C00402" 0.147
11 "C06664" 0.147
12 "C06687" 0.147
13 "C02059") 0.147
14 c("C00246" 0.069
15 "C00902") 0.069
**16 C00033 0.011
...
25 C00033 -0.073**
26 C00048 0.259
**27 c("C00803" 0.063
...
37 C00803 -0.200
38 C00803 -0.170**
39 c("C00164" -0.020
40 "C01712" -0.020
...
165 c("C00246" 0.076
166 "C00902") 0.076
**167 C00163 -0.063
...
169 C00163 0.046**
170 c("C00058" -0.208
171 "C00036") -0.208
172 C00121 -0.178
173 C00033 -0.193
174 C00163 -0.085
I would like the final to look something like this
> ColorMap
KEGGnumber Colors
1 C00489 0.162
2 C06104 0.162
3 C02656 0.162
4 C00163 -0.173
5 C02656 -0.140
6 C00036 -0.140
7 C00232 -0.140
8 C01571 -0.140
9 C00422 -0.140
10 C00402 0.147
11 C06664 0.147
12 C06687 0.147
13 C02059 0.147
14 C00246 0.069
15 C00902 0.069
**16 C00033 0.031**
26 C00048 0.259
**27 C00803 -0.100**
39 C00164 -0.020
40 C01712 -0.020
...
165 C00246 0.076
166 C00902 0.076
**167 C00163 0.0085**
170 C00058 -0.208
171 C00036 -0.208
172 C00121 -0.178
173 C00033 -0.193
174 C00163 -0.085
They do not need to be next to each other, I simply chose those for easy visualization. I would like the mean of all Colors to a single KEGGvalue. Thus, each KEGGvalue is unique, there are no duplicates.

You can clean that column using
library(stringr)
ColorMap$KEGGnumber <- str_extract(ColorMap$KEGGnumber, "[C][0-9]+")
The argument pattern allows you to match with a regular expression, in this case, a simple one, telling you to match the capital letter C followed by any number of numbers.
Afterwards, grouping using dplyr we have
library(dplyr)
ColorMap %>% group_by(KEGGnumber) %>% summarize(mean(Colors))

Related

I have a the following message "'collapse' argument should be a formula" while using mixture function (drc package)

While using drc package for mixture prediction on microalgae growth inhibition data, specifically the mixture function, I receive an error message "'collapse' argument should be a formula" , which I don't really understand.
I'm using almost exactly the same script present in the drc package pdf available online with the data "glymet" which also concerns growth inhibition data, with just a few changes to fit my own data. The drm function manages well to estimate the parameters of the dose-response curves concerning my data.
Being not that proficient with Rstudio as a whole, I am stuck with the error message ""'collapse' argument should be a formula" " which blocks me from doing further analysis.
Does someone has any idea of what could cause this error ? I didn't find anyone raising the same issue online, and this problem doesn't happens with glymet data.
my data :
> dbp
dose rgr pct
1 0.00 1.502 100
2 0.00 1.449 100
3 0.00 1.611 100
4 0.00 1.468 100
5 0.00 1.506 100
6 0.00 1.495 100
7 1.81 1.249 100
8 1.81 1.303 100
9 1.81 1.316 100
10 3.19 0.968 100
11 3.19 1.057 100
12 3.19 1.083 100
13 5.43 1.003 100
14 5.43 0.964 100
15 5.43 0.943 100
16 8.25 0.849 100
17 8.25 0.781 100
18 8.25 0.697 100
19 15.67 0.587 100
20 15.67 0.660 100
21 15.67 0.591 100
22 26.65 0.485 100
23 26.65 0.497 100
24 26.65 0.532 100
25 45.50 0.286 100
26 45.50 0.370 100
27 45.50 0.326 100
28 0.00 1.686 75
29 0.00 1.580 75
30 0.00 1.499 75
31 0.00 1.528 75
32 0.00 1.540 75
33 0.00 1.653 75
34 1.32 1.380 75
35 1.32 1.421 75
36 1.32 1.468 75
37 2.65 1.174 75
38 2.65 1.137 75
39 2.65 1.167 75
40 5.30 0.726 75
41 5.30 0.810 75
42 5.30 0.797 75
43 10.59 0.626 75
44 10.59 0.471 75
45 10.59 0.416 75
46 21.18 0.468 75
47 21.18 0.415 75
48 21.18 0.487 75
49 42.36 0.252 75
50 42.36 0.303 75
51 42.36 0.320 75
52 0.00 1.620 50
53 0.00 1.713 50
54 0.00 1.659 50
55 0.00 1.678 50
56 0.00 1.700 50
57 0.00 1.581 50
58 1.58 1.298 50
59 1.58 1.226 50
60 1.58 1.189 50
61 3.16 1.021 50
62 3.16 1.062 50
63 3.16 0.925 50
64 6.33 0.863 50
65 6.33 0.823 50
66 6.33 0.711 50
67 12.65 0.548 50
68 12.65 0.611 50
69 12.65 0.597 50
70 25.30 0.394 50
71 25.30 0.363 50
72 25.30 0.319 50
73 50.60 0.255 50
74 50.60 0.241 50
75 50.60 0.219 50
76 0.00 1.500 25
77 0.00 1.541 25
78 0.00 1.527 25
79 0.00 1.491 25
80 0.00 1.468 25
81 0.00 1.353 25
82 1.80 1.512 25
83 1.80 1.313 25
84 1.80 1.442 25
85 3.60 1.437 25
86 3.60 1.364 25
87 3.60 1.291 25
88 7.20 1.231 25
89 7.20 1.389 25
90 7.20 1.286 25
91 14.40 0.802 25
92 14.40 1.069 25
93 14.40 0.865 25
94 28.80 0.474 25
95 28.80 0.597 25
96 28.80 0.411 25
97 57.61 0.321 25
98 57.61 0.216 25
99 57.61 0.239 25
100 0.00 1.512 0
101 0.00 1.390 0
102 0.00 1.388 0
103 0.00 1.391 0
104 0.00 1.328 0
105 0.00 1.390 0
106 1.56 1.467 0
107 1.56 1.422 0
108 1.56 1.371 0
109 2.92 1.255 0
110 2.92 1.359 0
111 2.92 1.354 0
112 5.25 1.211 0
113 5.25 1.232 0
114 5.25 1.353 0
115 7.66 1.271 0
116 7.66 1.168 0
117 7.66 0.970 0
118 17.03 0.927 0
119 17.03 0.689 0
120 17.03 1.034 0
121 31.22 0.611 0
122 31.22 0.758 0
123 31.22 0.752 0
124 55.19 0.449 0
125 55.19 0.303 0
126 55.19 0.434 0
dose is corresponding to the concentration of exposure, rgr to the growth rate and pct to the percentage of the 1st substance in the mixture.
I'm using the following script for this data:
> library(drc)
>
> dbp<-read.table("dbp.txt",header=TRUE, sep="")
>
### data glymet
> ## Fitting the model with freely varying ED50 values
> dbp.free <- drm(rgr~dose, pct, data = dbp,
+ fct = LL.3())
>
> ## Lack-of-fit test
> modelFit(dbp.free) # acceptable
Lack-of-fit test
ModelDf RSS Df F value p value
ANOVA 89 0.46595
DRC model 111 0.68575 22 1.9083 0.0181
> summary(dbp.free)
Model fitted: Log-logistic (ED50 as parameter) with lower limit at 0 (3 parms)
Parameter estimates:
Estimate Std. Error t-value p-value
b:100 0.836148 0.060986 13.710 < 2.2e-16 ***
b:75 0.990562 0.067962 14.575 < 2.2e-16 ***
b:50 0.819426 0.057170 14.333 < 2.2e-16 ***
b:25 1.651083 0.147962 11.159 < 2.2e-16 ***
b:0 1.217393 0.112309 10.840 < 2.2e-16 ***
d:100 1.511861 0.031439 48.089 < 2.2e-16 ***
d:75 1.605332 0.030678 52.329 < 2.2e-16 ***
d:50 1.658598 0.031896 52.001 < 2.2e-16 ***
d:25 1.472612 0.026126 56.365 < 2.2e-16 ***
d:0 1.414731 0.027300 51.822 < 2.2e-16 ***
e:100 10.070587 0.871177 11.560 < 2.2e-16 ***
e:75 6.293939 0.474130 13.275 < 2.2e-16 ***
e:50 5.670343 0.491531 11.536 < 2.2e-16 ***
e:25 20.022984 1.153494 17.359 < 2.2e-16 ***
e:0 27.492564 2.023952 13.584 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error:
0.07859961 (111 degrees of freedom)
> ## Plotting isobole structure
> isobole(dbp.free)
>
> ## Fitting the concentration addition model
> dbp.ca <- mixture(dbp.free, model = "CA")
Error in mixture(dbp.free, model = "CA") :
'collapse' argument should be a formula
Thank you in advance! It's my first post so don't hesitate to tell me if an information is missing.

Abline command is not showing a regression line?

I'm new to R programming and I'm trying to plot a regression line for this data set, but it doesn't seem to be working.
I followed exactly what my professor was using, however it doesn't seem to be working. I've also interchanged the abline command with abline(lm(batters$EMH~batters$TB)) with similar results.
Here is my code for it:
batters<-read.table(header=TRUE, text="
X AVG EBH TB OPS K.to.BB.Ratio
1 LeMahieu 0.327 61 312 0.893 1.95
2 Urshela 0.314 55 236 0.889 3.48
3 Torres 0.278 64 292 0.871 2.64
4 Judge 0.272 46 204 0.921 2.21
5 Sanchez 0.232 47 208 0.841 3.13
6 Wong 0.285 40 202 0.784 1.76
7 Molina 0.270 34 167 0.711 2.52
8 Goldschmidt 0.260 60 284 0.821 2.13
9 Ozuna 0.243 53 230 0.804 1.84
10 DeJong 0.233 62 259 0.762 2.39
11 Altuve 0.298 61 275 0.903 1.98
12 Bregman 0.296 80 328 1.015 0.69
13 Springer 0.292 62 283 0.974 1.69
14 Reddick 0.275 36 205 0.728 1.83
15 Chirinos 0.238 40 162 0.791 2.45
16 Bellinger 0.305 84 351 1.035 1.14
17 Turner 0.290 51 244 0.881 1.72
18 Seager 0.272 64 236 0.817 2.23
19 Taylor 0.262 45 169 0.794 3.11
20 Muncy 0.251 58 251 0.889 1.65
21 Meadows 0.291 69 296 0.922 2.43
22 Garcia 0.282 47 227 0.796 4.03
23 Pham 0.273 56 255 0.818 1.52
24 Choi 0.261 41 188 0.822 1.69
25 Adames 0.254 46 222 0.735 3.32
26 Yelich 0.329 76 328 1.101 1.48
27 Braun 0.285 55 232 0.849 3.09
28 Moustakas 0.254 66 270 0.845 1.85
29 Grandal 0.246 56 240 0.848 1.28
30 Arcia 0.223 32 173 0.633 2.53")
plot(batters$EBH,batters$TB,main="Attribute Pairing 5",xlab="EBH",ylab="TB")
lm(formula = batters$EBH~batters$TB)
#Call:
#lm(formula = batters$EBH ~ batters$TB)
#Coefficients:
#(Intercept) batters$TB
# -4.1275 0.2416
lin_model_1<-lm(formula = batters$EBH~batters$TB)
summary(lin_model_1)
abline(-4.12752, 0.24162)
I apologize for the messy coding, this is for a class.
Your formula is backwards in the lm() function call. The dependent variable is on the left side of the "~".
In your plot the y-axis (dependent variable) is TB, but in the linear regression model, it is defined as the independent variable. So for the linear regression model to work, one needs to swap EBH & TB.
plot(batters$EBH,batters$TB,main="Attribute Pairing 5",xlab="EBH",ylab="TB")
model <-lm(formula = batters$TB ~batters$EBH)
model
Call: lm(formula = batters$TB ~ batters$EBH)
Coefficients: (Intercept) batters$EBH
46.510 3.603
abline(model)
#or
abline (46.51, 3.60)
Also if you pass the "model" to abline you can avoid the need to specify the slope and intercept with abline

How to remove the variable names from the diagonal and put them on axes in R function scatterplotMatrix?

I am trying to reproduce a matrix plot on a book. Here is the plot on the book:
Here are my codes:
y=read.table("T1-2.dat");
colnames(y) <- c("density","mach-dir","cross-dir");
library(car);
scatterplotMatrix(y,smooth=F, regLine=F, var.labels=colnames(y), diagonal=list(method="boxplot"));
And this is what it looks like right now:
.
How can I delete the names from the diagonal and put them on the side of the table just like the one on the book.
Thanks in advance.
Data:
> y
density mach-dir cross-dir
1 0.801 121.41 70.42
2 0.824 127.70 72.47
3 0.841 129.20 78.20
4 0.816 131.80 74.89
5 0.840 135.10 71.21
6 0.842 131.50 78.39
7 0.820 126.70 69.02
8 0.802 115.10 73.10
9 0.828 130.80 79.28
10 0.819 124.60 76.48
11 0.826 118.31 70.25
12 0.802 114.20 72.88
13 0.810 120.30 68.23
14 0.802 115.70 68.12
15 0.832 117.51 71.62
16 0.796 109.81 53.10
17 0.759 109.10 50.85
18 0.770 115.10 51.68
19 0.759 118.31 50.60
20 0.772 112.60 53.51
21 0.806 116.20 56.53
22 0.803 118.00 70.70
23 0.845 131.00 74.35
24 0.822 125.70 68.29
25 0.971 126.10 72.10
26 0.816 125.80 70.64
27 0.836 125.50 76.33
28 0.815 127.80 76.75
29 0.822 130.50 80.33
30 0.822 127.90 75.68
31 0.843 123.90 78.54
32 0.824 124.10 71.91
33 0.788 120.80 68.22
34 0.782 107.40 54.42
35 0.795 120.70 70.41
36 0.805 121.91 73.68
37 0.836 122.31 74.93
38 0.788 110.60 53.52
39 0.772 103.51 48.93
40 0.776 110.71 53.67
41 0.758 113.80 52.42
And by the way, can we also display "Max, Med, Min" and the corresponding values on the diagonal as well? Thanks.

implement apply and t.test functions same time

I am using R, I have a couple conditions with three replicates in each and I want to apply t.test to each of the elements in the conditions (the rows). For this I am willing to use apply function to the dataset (143,554 rows) containing all the info and specifying to retrieve the pval obtained by the t.test.
The columns 4,6,8 are the replicates for the first condition (main element of apply), the colums 10,12,14 are the elements of the second condition (example data at the end). And I thought that something like this could do the work:
t.test.10x = apply( MT.10x[,c(4,6,8)], 1, function(x) t.test(x, MT.10x[,c(10,12,14)])$p.value)
However this syntax is wrong because providing the whole table for the second condition in t.test will not go row by row, instead this approach will compare all the rows in 10,12,14 to each row in 4,6,8.
I don't want to use for loop but if it is absolutely required... well..
Thank you!!
Dataset example:
Chr Start End wt1_R wt1_T wt2_R wt2_T wt3_R wt3_T ko1_R ko1_T ko2_R ko2_T ko3_R ko3_T
chr1 3060417 3060419 0.0698 43 0.25 28 0.172 29 0.188 32 0.156 45 0.119 42
chr1 3060431 3060433 0.786 28 0.818 22 0.526 19 0.895 19 0.833 36 0.784 37
chr1 3168805 3168807 0.688 16 1 19 0.769 13 0.929 14 0.933 15 0.9 10
chr1 3228992 3228994 0.7 10 1 11 0.786 14 1 14 0.938 16 0.923 13
chr1 3233065 3233067 0.857 14 0.917 12 1 17 0.846 13 0.857 21 0.952 21
chr1 3265234 3265236 0.84 25 0.727 11 0.909 22 0.968 31 0.895 19 0.905 21
chr1 3265322 3265324 0.111 27 0.25 28 0.55 20 0.385 13 0.467 15 0.462 13
chr1 3265345 3265347 0.806 31 0.857 35 0.733 30 0.9 30 0.8 25 1 17
chr1 3265357 3265359 1 30 0.759 29 0.758 33 0.867 30 0.903 31 1 18
chr1 3265486 3265488 1 15 0.545 22 1 13 0.8 10 0.917 12 1 24
chr1 3265512 3265514 0.857 28 0.75 20 0.583 24 0.714 21 0.882 17 0.839 31
chr1 3265540 3265542 0.757 37 0.966 29 0.969 32 0.774 31 0.955 22 0.971 34
chr1 3265771 3265773 0.741 27 0.864 22 0.963 27 1 20 0.864 22 0.962 26
chr1 3265776 3265778 1 20 1 21 1 26 0.722 18 1 24 0.852 27
chr1 3265803 3265805 0.611 18 0.96 25 1 17 1 18 0.895 19 0.828 29
chr1 3760225 3760227 0.278 36 0.0741 27 0.417 24 0.158 19 0.4 40 0.136 22
chr1 3760285 3760287 0.851 47 0.711 38 0.867 15 0.81 21 0.914 35 0.893 28
chr1 3761299 3761301 0.786 14 0.885 26 1 11 0.929 14 0.771 35 0.75 24
chr1 3761414 3761416 0.706 17 1 17 0.545 22 0.857 14 0.818 11 0.8 15
chr1 3838606 3838608 0.806 31 0.692 13 0.611 18 1 11 1 23 1 11
chr1 3838611 3838613 0.767 30 1 13 0.947 19 0.818 11 1 20 1 11
chr1 4182108 4182110 0.231 13 0.5 14 0.143 21 0.0667 15 0.235 17 0.353 17
chr1 4547434 4547436 0.9 10 1 13 1 17 1 14 0.909 11 0.909 11
chr1 4547456 4547458 1 18 1 10 0.895 19 0.833 12 1 12 1 12
chr1 4547496 4547498 0.812 16 0.917 12 0.75 16 0.923 13 0.818 11 0.9 10
chr1 4547509 4547511 1 14 1 12 1 15 0.9 10 0.833 12 1 11
chr1 4547512 4547514 0.923 13 1 12 1 14 0.909 11 0.833 12 0.909 11
chr1 4765732 4765734 0 11 0 12 0 11 0 13 0 13 0.1 10
chr1 5185343 5185345 0.818 22 0.909 22 0.963 27 1 15 0.923 13 1 16
chr1 5185567 5185569 0.885 52 0.781 32 0.984 63 1 37 0.844 45 1 29
I think you are looking for mapply:
mapply(function(x,y)
t.test(x,y)$p.value,
MT.10x[,c(4,6,8)], MT.10x[,c(10,12,14)])
## wt1_R wt2_R wt3_R
## 0.4790554 0.8289961 0.5204527

Mapping spatial Distributions in R

My data set includes 17 stations and for each station there are 24 hourly temperature values.
I would like to map each stations value in each hour and doing so for all the hours.
What I want to do is something like the image.
The data is in the following format:
N2 N3 N4 N5 N7 N8 N10 N12 N13 N14 N17 N19 N25 N28 N29 N31 N32
1 1.300 -0.170 -0.344 2.138 0.684 0.656 0.882 0.684 1.822 1.214 2.046 2.432 0.208 0.312 0.530 0.358 0.264
2 0.888 -0.534 -0.684 1.442 -0.178 -0.060 0.430 -0.148 1.420 0.286 1.444 2.138 -0.264 -0.042 0.398 -0.196 -0.148
3 0.792 -0.564 -0.622 0.998 -0.320 1.858 -0.036 -0.118 1.476 0.110 0.964 2.048 -0.480 -0.434 0.040 -0.538 -0.322
4 0.324 -1.022 -1.128 1.380 -0.792 1.042 -0.054 -0.158 1.518 -0.102 1.354 2.386 -0.708 -0.510 0.258 -0.696 -0.566
5 0.650 -0.774 -0.982 1.124 -0.540 3.200 -0.052 -0.258 1.452 0.028 1.022 2.110 -0.714 -0.646 0.266 -0.768 -0.532
6 0.670 -0.660 -0.844 1.248 -0.550 2.868 -0.098 -0.240 1.380 -0.012 1.164 2.324 -0.498 -0.474 0.860 -0.588 -0.324
MeteoSwiss
1 -0.6
2 -1.2
3 -1.0
4 -0.8
5 -0.4
6 -0.2
where N2, N3, ...m MeteoSwiss are the stations and each row presents the station's temperature value for each hour.
id Longitude Latitude
2 7.1735 45.86880001
3 7.17254 45.86887001
4 7.171636 45.86923601
5 7.18018 45.87158001
7 7.177229 45.86923001
8 7.17524 45.86808001
10 7.179299 45.87020001
12 7.175189 45.86974001
13 7.179379 45.87081001
14 7.175509 45.86932001
17 7.18099 45.87262001
19 7.18122 45.87355001
25 7.15497 45.87058001
28 7.153399 45.86954001
29 7.152649 45.86992001
31 7.154419 45.87004001
32 7.156099 45.86983001
MeteoSwiss 7.184 45.896
I define a toy example more or less resembling your data:
vals <- matrix(rnorm(24*17), nrow=24)
cds <- data.frame(id=paste0('N', 1:17),
Longitude=rnorm(n=17, mean=7.1),
Latitude=rnorm(n=17, mean=45.8))
vals <- as.data.frame(t(vals))
names(vals) <- paste0('H', 1:24)
The sp package defines several classes and methods to store and
display spatial data. For your example you should use the
SpatialPointsDataFrame class:
library(sp)
mySP <- SpatialPointsDataFrame(coords=cds[,-1], data=data.frame(vals))
and the spplot method to display the information:
spplot(mySP, as.table=TRUE,
col.regions=bpy.colors(10),
alpha=0.8, edge.col='black')
Besides, you may find useful the spacetime package
(paper at JSS).

Resources