How are we supposed to get at matrix diagonals and partial regression plots using r programming? - r

Given the data
farm
up
right
left
24.3
34.3
50
45
30.2
35.3
54
45
49
45
540
4353
70
60
334
343
69
80
54
342
# for finding Studentized residuals vs fitted value
mod1<-lm(farm~up+right+left)
plot(mod1)
# for finding cooks distance
plot(cookd(lm(farm~up+right+left, data=data)))
could not find function "cookd"
I don't know how to find partial and diagonal matrix though I also couldn't find much information online.
Please help or correct me if I am wrong.

Related

Indexing through 'names' in a list and performing actions on contained values in R

I have a data set of counts from standard solutions passed through an instrument that analyses chemical concentrations (an ICPMS for those familiar). The data is over a range of different standards and for each standard I have four repeat measurements that I want to calculate the mean and variance of.
I'm importing the data from an excel spreadsheet and then, following some housekeeping such as getting dates and times in the right format, I split the the dataset up into a list identified by the name of the standard solution using Count11.sp<-split(Count11.raw, Count11.raw$Type). Count11.raw$Type then becomes the list element name and I have the four count results for each chemical element in that list element.
So far so good.
I find I can yield an average (mean, median etc) easily enough by identifying the list element specifically i.e. mean(Count11.sp$'Ca40') , or sapply(Count11$'Ca40', median), but what I'm not able to do is automate that in a loop so that I can calculate the means for each standard and drop that into a numerical matrix for further manipulation. I can extract the list element names with names() and I can even use a loop to make a vector of all the names and reference the specific list element using these in a for loop.
For instance Count11.sp[names(Count11.sp[i])]will extract the full list element no problem:
$`Post Ca45t`
Type Run Date 7Li 9Be 24Mg 43Ca 52Cr 55Mn 59Co 60Ni
77 Post Ca45t 1 2011-02-08 00:13:08 114 26101 4191 453525 2632 520 714 2270
78 Post Ca45t 2 2011-02-08 00:13:24 114 26045 4179 454299 2822 524 704 2444
79 Post Ca45t 3 2011-02-08 00:13:41 96 26372 3961 456293 2898 520 762 2244
80 Post Ca45t 4 2011-02-08 00:13:58 112 26244 3799 454702 2630 510 792 2356
65Cu 66Zn 85Rb 86Sr 111Cd 115In 118Sn 137Ba 140Ce 141Pr 157Gd 185Re 208Pb
77 244 1036 56 3081 44 520625 78 166 724 10 0 388998 613
78 250 982 70 3103 46 526154 76 174 744 16 4 396496 644
79 246 1014 36 3183 56 524195 60 198 744 2 0 396024 612
80 270 932 60 3137 44 523366 70 180 824 2 4 390436 632
238U
77 24
78 20
79 14
80 6
but sapply(Count11.sp[names(count11.sp[i])produces an error message: Error in median.default(X[[i]], ...) : need numeric data
while sapply(Input$Post Ca45t, median) <'Post Ca45t' being name Count11.sp[i] i=4> does exactly what I want and produces the median value (I can clean that vector up later for medians that don't make sense) e.g.
Type Run Date 7Li 9Be 24Mg
NA 2.5 1297109612.5 113.0 26172.5 4070.0
43Ca 52Cr 55Mn 59Co 60Ni 65Cu
454500.5 2727.0 520.0 738.0 2313.0 248.0
66Zn 85Rb 86Sr 111Cd 115In 118Sn
998.0 58.0 3120.0 45.0 523780.5 73.0
137Ba 140Ce 141Pr 157Gd 185Re 208Pb
177.0 744.0 6.0 2.0 393230.0 622.5
238U
17.0
Can anyone give me any insight into how I can automate (i.e. loop through) these names to produce one median vector per list element? I'm sure there's just some simple disconnect in my logic here that may be easily solved.
Update: I've solved the problem. The way to do so is to use tapply on the original dataset with out the need to split it. tapply allows functions to be applied to data based on a user defined grouping criteria. In my case I could group according to the Count11.raw$Type and then take the mean of the data subset. tapply(Count11.raw$Type, Count11.raw[,3:ncol(Count11.raw)], mean), job done.

In R, how can I compute the summary function in parallel?

I have a huge dataset. I computed the multinomial regression by multinom in nnet package.
mylogit<- multinom(to ~ RealAge, mydata)
It takes 10 minutes. But when I use summary function to compute the coefficient
it takes more than 1 day!!!
This is the code I used:
output <- summary(mylogit)
Coef<-t(as.matrix(output$coefficients))
I was wondering if anybody know how can I compute this part of the code by parallel processing in R?
this is a small sample of data:
mydata:
to RealAge
513 59.608
513 84.18
0 85.23
119 74.764
116 65.356
0 89.03
513 92.117
69 70.243
253 88.482
88 64.23
513 64
4 84.03
65 65.246
69 81.235
513 87.663
513 81.21
17 75.235
117 49.112
69 59.019
20 90.03
If you just want the coefficients, use only the coef() method which do less computations.
Example:
mydata <- readr::read_table("to RealAge
513 59.608
513 84.18
0 85.23
119 74.764
116 65.356
0 89.03
513 92.117
69 70.243
253 88.482
88 64.23
513 64
4 84.03
65 65.246
69 81.235
513 87.663
513 81.21
17 75.235
117 49.112
69 59.019
20 90.03")[rep(1:20, 3000), ]
mylogit <- nnet::multinom(to ~ RealAge, mydata)
system.time(output <- summary(mylogit)) # 6 sec
all.equal(output$coefficients, coef(mylogit)) # TRUE & super fast
If you profile the summary() function, you'll see that the most of the time is taken by the crossprod() function.
So, if you really want the output of the summary() function, you could use an optimized math library, such as the MKL provided by Microsoft R Open.

correlation heatmap using heatmaply R

I'm trying to create an heatmap on the base of spearman correlation and with dendrogramm corresponding to spearman correlation values.
My input file is composed as follow:
> data[1:6,1:6]
group EG PN C0 C10 C10.1
1 Patients 24 729 352.66598 43.80707 75.16226
2 Patients 24 729 195.48486 17.15763 33.60365
3 Patients 24 729 106.85937 15.13400 34.47340
4 Patients 27 1060 76.70645 14.98315 22.09885
5 Patients 27 1060 354.07169 50.61995 98.36765
6 Patients 27 1060 331.84956 92.00343 125.46658
> data[150:160,1:6]
group EG PN C0 C10 C10.1
150 Controls 27 1011 99.94756 9.018773 20.207498
151 Controls 30 616 300.20203 25.667548 37.363280
152 Controls 30 616 190.38030 18.811198 46.417332
153 Controls 26 930 79.44666 7.801935 4.569444
154 Controls 24 724 381.74026 39.842241 42.144842
155 Controls 24 724 191.39962 19.008729 31.064398
I'm able to made up a simple correlation plot but i would like to create an unique heatmap with both protein and subjects dendrogram on the base on spearman correlation. Does anyone know how to do? thanks in advance
The following code displays an interactive heatmap using Spearman's rank correlation to cluster both rows and columns (in this case for the mtcars dataset).
heatmaply(mtcars,
distfun = function(x) as.dist(1 - cor(t(x), method="spearman")))

multidimensional data clustering

Problem: I have two groups of multidimensional heterogeneous data. I have concocted a simple illustrative example below. Notice that some columns are discrete (age) while some are binary (gender) and another is even an ordered pair (pant size).
Person Age gender height weight pant_size
Control_1 55 M 167.6 155 32,34
Control_2 68 F 154.1 137 28,28
Control_3 53 F 148.9 128 27,28
Control_4 57 M 167.6 165 38,34
Control_5 62 M 147.4 172 36,32
Control_6 44 M 157.6 159 32,32
Control_7 76 F 172.1 114 30,32
Control_8 49 M 161.8 146 34,34
Control_9 53 M 164.4 181 32,36
Person Age gender height weight pant_size
experiment_1 39 F 139.6 112 26,28
experiment_2 52 M 154.1 159 32,32
experiment_3 43 F 148.9 123 27,28
experiment_4 55 M 167.6 188 36,38
experiment_5 61 M 161.4 171 36,32
experiment_6 48 F 149.1 144 28,28
The question is does the entire experimental group differ significantly from the entire control group?
Or roughly speaking do they form two distinct clusters in the space of [age,gender,height,weight,pant_size]?
The general idea of what I’ve tried so far is a metric that compares corresponding columns of the experimental group to those of the control; the metric then takes the sum of the column scores (see below). A somewhat arbitrary threshold is picked to decide if the two groups are different. This arbitrariness is confounded by the weighting of the columns which is also somewhat arbitrary. Remarkably this approaches is preforming well for the actual problem I have but it needs to be formalized. I’m wondering if this approach is similar to any existing approaches or if other well established approaches more widely accepted?
Person Age gender height weight pant_size
experiment_1 39 F 139.6 112 26,28
experiment_2 52 M 154.1 159 32,32
experiment_3 43 F 148.9 123 27,28
experiment_4 55 M 167.6 188 36,38
experiment_5 61 M 161.4 171 36,32
experiment_6 48 F 149.1 144 28,28 metric
column score 2 1 5 1 7 16
Treat this as a classification rather than a clustering problem if you assume the results "cluster".
Because you don't need to find these clusters, but they are predefined classes.
The "rewritten" approach is as follows:
Train different classifiers to predict whether a point is from data A or data B. If you can get a much better accuracy than 50% (assuming balanced data) then the geoups do differ. If all your classifiers are only as good as random (and you didn't make mistakes) then tthe two sets are probably just too similar.

Curve fitting this data in R?

For a few days I've been working on this problem and I'm stuck ...
I have performed a number of Monte Carlo simulations in R which gives an output y for each input x and there is clearly some simple relationship between x and y, so I want to identify the formula and its parameters. But I can't seem to get a good overall fit for both the 'Low x' and 'High x' series, e.g. using a logarithm like this:
dat = data.frame(x=x, y=y)
fit = nls(y~a*log10(x)+b, data=dat, start=list(a=-0.8,b=-2), trace=TRUE)
I have also tried to fit (log10(x), 10^y) instead, which gives a good fit but the reverse transformation doesn't fit (x, y) very well.
Can anyone solve this?
Please explain how you found the solution.
Thanks!
EDIT:
Thanks for all the quick feedback!
I am not aware of a theoretical model for what I'm simulating so I have no basis for comparison. I simply don't know the true relationship between x and y. I'm not a statistician, by the way.
The underlying model is sort of a stochastic feedback-growth model. My objective is to determine the long-term growth-rate g given some input x>0, so the output of a system grows exponentially by the rate 1+g in each iteration. The system has a stochastic production in each iteration based on the system's size, a fraction of this production is output and the rest is kept in the system determined by another stochastic variable. From MC simulation I have found the growth-rates of the system output to be log-normal distributed for every x I have tested and the y's in the data-series are the logmeans of the growth-rates g. As x goes towards infinity g goes towards zero. As x goes towards zero g goes towards infinity.
I would like a function that could calculate y from x. I actually only need a function for low x, say, in the range 0 to 10. I was able to fit that quite well by y=1.556 * x^-0.4 -3.58, but it didn't fit well for large x. I'd like a function that is general for all x>0. I have also tried Spacedman's poly fit (thanks!) but it doesn't fit well enough in the crucial range x=1 to 6.
Any ideas?
EDIT 2:
I have experimented some more, also with the detailed suggestions by Grothendieck (thanks!) After some consideration I decided that since I don't have a theoretical basis for choosing one function over another, and I'm most likely only interested in x-values between 1 and 6, I ought to use a simple function that fits well. So I just used y~a*x^b+c and made a note that it doesn't fit for high x. I may seek the community's help again when the first draft of the paper is finished. Perhaps one of you can spot the theoretical relationship between x and y once you see the Monte Carlo model.
Thanks again!
Low x series:
x y
1 0.2 -0.7031864
2 0.3 -1.0533648
3 0.4 -1.3019655
4 0.5 -1.4919278
5 0.6 -1.6369545
6 0.7 -1.7477481
7 0.8 -1.8497117
8 0.9 -1.9300209
9 1.0 -2.0036842
10 1.1 -2.0659970
11 1.2 -2.1224324
12 1.3 -2.1693986
13 1.4 -2.2162889
14 1.5 -2.2548485
15 1.6 -2.2953162
16 1.7 -2.3249750
17 1.8 -2.3570141
18 1.9 -2.3872684
19 2.0 -2.4133978
20 2.1 -2.4359624
21 2.2 -2.4597122
22 2.3 -2.4818787
23 2.4 -2.5019371
24 2.5 -2.5173966
25 2.6 -2.5378936
26 2.7 -2.5549524
27 2.8 -2.5677939
28 2.9 -2.5865958
29 3.0 -2.5952558
30 3.1 -2.6120607
31 3.2 -2.6216831
32 3.3 -2.6370452
33 3.4 -2.6474608
34 3.5 -2.6576862
35 3.6 -2.6655606
36 3.7 -2.6763866
37 3.8 -2.6881303
38 3.9 -2.6932310
39 4.0 -2.7073198
40 4.1 -2.7165035
41 4.2 -2.7204063
42 4.3 -2.7278532
43 4.4 -2.7321731
44 4.5 -2.7444773
45 4.6 -2.7490365
46 4.7 -2.7554178
47 4.8 -2.7611471
48 4.9 -2.7719188
49 5.0 -2.7739299
50 5.1 -2.7807113
51 5.2 -2.7870781
52 5.3 -2.7950429
53 5.4 -2.7975677
54 5.5 -2.7990999
55 5.6 -2.8095955
56 5.7 -2.8142453
57 5.8 -2.8162046
58 5.9 -2.8240594
59 6.0 -2.8272394
60 6.1 -2.8338866
61 6.2 -2.8382038
62 6.3 -2.8401935
63 6.4 -2.8444915
64 6.5 -2.8448382
65 6.6 -2.8512086
66 6.7 -2.8550240
67 6.8 -2.8592950
68 6.9 -2.8622220
69 7.0 -2.8660817
70 7.1 -2.8710430
71 7.2 -2.8736998
72 7.3 -2.8764701
73 7.4 -2.8818748
74 7.5 -2.8832696
75 7.6 -2.8833351
76 7.7 -2.8891867
77 7.8 -2.8926849
78 7.9 -2.8944987
79 8.0 -2.8996780
80 8.1 -2.9011012
81 8.2 -2.9053911
82 8.3 -2.9063661
83 8.4 -2.9092228
84 8.5 -2.9135426
85 8.6 -2.9101730
86 8.7 -2.9186316
87 8.8 -2.9199631
88 8.9 -2.9199856
89 9.0 -2.9239220
90 9.1 -2.9240167
91 9.2 -2.9284608
92 9.3 -2.9294951
93 9.4 -2.9310985
94 9.5 -2.9352370
95 9.6 -2.9403694
96 9.7 -2.9395336
97 9.8 -2.9404153
98 9.9 -2.9437564
99 10.0 -2.9452175
High x series:
x y
1 2.000000e-01 -0.701301
2 2.517851e-01 -0.907446
3 3.169786e-01 -1.104863
4 3.990525e-01 -1.304556
5 5.023773e-01 -1.496033
6 6.324555e-01 -1.674629
7 7.962143e-01 -1.842118
8 1.002374e+00 -1.998864
9 1.261915e+00 -2.153993
10 1.588656e+00 -2.287607
11 2.000000e+00 -2.415137
12 2.517851e+00 -2.522978
13 3.169786e+00 -2.621386
14 3.990525e+00 -2.701105
15 5.023773e+00 -2.778751
16 6.324555e+00 -2.841699
17 7.962143e+00 -2.900664
18 1.002374e+01 -2.947035
19 1.261915e+01 -2.993301
20 1.588656e+01 -3.033517
21 2.000000e+01 -3.072003
22 2.517851e+01 -3.102536
23 3.169786e+01 -3.138539
24 3.990525e+01 -3.167577
25 5.023773e+01 -3.200739
26 6.324555e+01 -3.233111
27 7.962143e+01 -3.259738
28 1.002374e+02 -3.291657
29 1.261915e+02 -3.324449
30 1.588656e+02 -3.349988
31 2.000000e+02 -3.380031
32 2.517851e+02 -3.405850
33 3.169786e+02 -3.438225
34 3.990525e+02 -3.467420
35 5.023773e+02 -3.496026
36 6.324555e+02 -3.531125
37 7.962143e+02 -3.558215
38 1.002374e+03 -3.587526
39 1.261915e+03 -3.616800
40 1.588656e+03 -3.648891
41 2.000000e+03 -3.684342
42 2.517851e+03 -3.716174
43 3.169786e+03 -3.752631
44 3.990525e+03 -3.786956
45 5.023773e+03 -3.819529
46 6.324555e+03 -3.857214
47 7.962143e+03 -3.899199
48 1.002374e+04 -3.937206
49 1.261915e+04 -3.968795
50 1.588656e+04 -4.015991
51 2.000000e+04 -4.055811
52 2.517851e+04 -4.098894
53 3.169786e+04 -4.135608
54 3.990525e+04 -4.190248
55 5.023773e+04 -4.237104
56 6.324555e+04 -4.286103
57 7.962143e+04 -4.332090
58 1.002374e+05 -4.392748
59 1.261915e+05 -4.446233
60 1.588656e+05 -4.497845
61 2.000000e+05 -4.568541
62 2.517851e+05 -4.628460
63 3.169786e+05 -4.686546
64 3.990525e+05 -4.759202
65 5.023773e+05 -4.826938
66 6.324555e+05 -4.912130
67 7.962143e+05 -4.985855
68 1.002374e+06 -5.070668
69 1.261915e+06 -5.143341
70 1.588656e+06 -5.261585
71 2.000000e+06 -5.343636
72 2.517851e+06 -5.447189
73 3.169786e+06 -5.559962
74 3.990525e+06 -5.683828
75 5.023773e+06 -5.799319
76 6.324555e+06 -5.929599
77 7.962143e+06 -6.065907
78 1.002374e+07 -6.200967
79 1.261915e+07 -6.361633
80 1.588656e+07 -6.509538
81 2.000000e+07 -6.682960
82 2.517851e+07 -6.887793
83 3.169786e+07 -7.026138
84 3.990525e+07 -7.227990
85 5.023773e+07 -7.413960
86 6.324555e+07 -7.620247
87 7.962143e+07 -7.815754
88 1.002374e+08 -8.020447
89 1.261915e+08 -8.229911
90 1.588656e+08 -8.447927
91 2.000000e+08 -8.665613
Without an idea of the underlying process you may as well just fit a polynomial with as many components as you like. You don't seem to be testing a hypothesis (eg, gravitational strength is inverse-square related with distance) so you can fish all you like for functional forms, the data is unlikely to tell you which one is 'right'.
So if I read your data into a data frame with x and y components I can do:
data$lx=log(data$x)
plot(data$lx,data$y) # needs at least a cubic polynomial
m1 = lm(y~poly(lx,3),data=data) # fit a cubic
points(data$lx,fitted(m1),pch=19)
and the fitted points are pretty close. Change the polynomial degree from 3 to 7 and the points are identical. Does that mean that your Y values are really coming from a 7-degree polynomial of your X values? No. But you've got a curve that goes through the points.
At this scale, you may as well just join adjacent points up with a straight line, your plot is so smooth. But without underlying theory of why Y depends on X (like an inverse square law, or exponential growth, or something) all you are doing is joining the dots, and there are infinite ways of doing that.
Regressing x/y vs. x Plotting y vs. x for the low data and playing around a bit it seems that x/y is approximately linear in x so try regressing x/y against x which gives us a relationship based on only two parameters:
y = x / (a + b * x)
where a and b are the regression coefficients.
> lm(x / y ~ x, lo.data)
Call:
lm(formula = x/y ~ x, data = lo.data)
Coefficients:
(Intercept) x
-0.1877 -0.3216
MM.2 The above can be transformed into the MM.2 model in the drc R package. As seen below this model has a high R2. Also, we calculate the AIC which we can use to compare to other models (lower is better):
> library(drc)
> fm.mm2 <- drm(y ~ x, data = lo.data, fct = MM.2())
> cor(fitted(fm.mm2), lo.data$y)^2
[1] 0.9986303
> AIC(fm.mm2)
[1] -535.7969
CRS.6 This suggests we try a few other drc models and of the ones we tried CRS.6 has a particularly low AIC and seems to fit well visually:
> fm.crs6 <- drm(y ~ x, data = lo.data, fct = CRS.6())
> AIC(fm.crs6)
[1] -942.7866
> plot(fm.crs6) # see output below
This gives us a range of models we can use from the 2 parameter MM.2 model which is not as good as a fit (according to AIC) as the CRS.6 but still fits quite well and has the advantage of only two parameters or the 6 parameter CRS.6 model with its superior AIC. Note that AIC already penalizes models for having more parameters so having a better AIC is not a consequence of having more parameters.
Other If its believed that both low and high should have the same model form then finding a single model form fitting both low and high well might be used as another criterion for picking a model form. In addition to the drc models, there are also some yield-density models in (2.1), (2.2), (2.3) and (2.4) of Akbar et al, IRJFE, 2010 which look similar to the MM.2 model which could be tried.
UPDATED: reworked this around the drc package.

Resources