Converting probe ids to entrez ids from a list of lists - r

The conversion of probe ids to entrez ids is quite straight forward
i1<-c("246653_at", "246897_at", "251347_at", "252988_at", "255528_at", "256535_at", "257203_at", "257582_at", "258807_at", "261509_at", "265050_at", "265672_at")
select(ath1121501.db, i1, "ENTREZID", "PROBEID")
PROBEID ENTREZID
1 246653_at 833474
2 246897_at 832631
3 251347_at 825272
4 252988_at 829998
5 255528_at 827380
6 256535_at 840223
7 257203_at 821955
8 257582_at 841494
9 258807_at 819558
10 261509_at 843504
11 265050_at 841636
12 265672_at 817757
But Iam unsure how to do it for a long list of lists resulting from a clustering and store it as a list of ENTREZ ids instead of probe ids again:
For instance:
[[1]]
247964_at 248684_at 249126_at 249214_at 250223_at 253620_at 254907_at 259897_at 261256_at 267126_s_at
28 40 44 45 54 95 108 152 171 229
[[2]]
248230_at 250869_at 259765_at 265948_at 266221_at
33 64 151 216 221
[[3]]
245385_at 247282_at 248967_at 250180_at 250881_at 251073_at 53874_at 256093_at 257054_at 260007_at
5 22 42 52 65 67 101 117 125 155
261868_s_at 263136_at 267497_at
181 195 232
It should be something like
[[1]]
"835761","834904","834356","834281","831256","829175","826721","843479","837084","816891","816892"
and similarly for other list of lists.

Related

Splitting a matrix into multiple matrices

There are two matrices:
Matrix with 2 columns: node name and node degree (k1):
Matrix with 1 column: degrees (ms):
I need to split 1st matrix into multiple matrices, where every matrix has nodes of same degree. Then, write matrices to csv-files. But my code is not working. How can i do this correctly?
k1<-read.csv2("VandD.csv", header = FALSE)
fnk1<-as.matrix(k1)
ms<-read.csv2("mas.csv", header = FALSE)
massive<-as.matrix(ms)
wlk<-1
varbl<-1
rtt<-list()
for (wlk in 1:384) {
rtt<-NULL
stepen<-massive[wlk]
for (varbl in 1:2154) {
if(fnk1[varbl,2]==stepen){
kapa<-fnk1[varbl,1]
rtt<-append(rtt,kapa)
}
}
namef<-paste("reslt",stepen,".csv",sep = "")
write.csv2(rtt, file=namef)
}
k1
V1 V2
1 UC7Ucs42FZy3uYzjrqzOIHsw 81
2 UCyWDmyZRjrGHeKF-ofFsT5Q 81
3 UCIZP6nCTyU9VV0zIhY7q1Aw 81
4 UCqk3CdGN_j8IR9z4uBbVPSg 81
5 UCjWzQkWu0l1yAhcBoavokng 81
6 UCRXiA3h1no_PFkb1JCP0yMA 81
7 UC2w9SdXpwq2Uq-MV4W4A8kw 81
8 UCdJqTQJZleoxZFReiyNvn8w 81
9 UC2Qw1dzXDBAZPwS7zm37g8g 81
10 UCTOovOHTf4efJOmGvJBxIQQ 81
ms
V1
1 81
2 82
3 83
4 84
5 85
6 86
7 87
8 88
9 89
10 90
Seems you need split
split(k1,k1$v2)
We can use group_split
library(dplyr)
k1 %>%
group_split(v2)

How to perform interpolation for a (column vector) data series using the largest column vector in a data frame in R

I have an excel sheet having some series of data in the form of column vectors. each column vector is of different length. the sample data in the excel sheet is presented as column vectors as shown below.
No 1 2 4 5 6 7
1 7.68565 7.431991 7.620156 7.34955 7.493848 7.244905
2 8.247334 7.895186 8.107751 7.629121 8.01165 7.898938
3 8.861417 8.411331 8.616113 7.960177 8.551065 8.432346
4 9.522981 8.945542 9.117843 8.263698 9.129371 9.118917
5 10.10206 9.465829 9.621576 8.515904 9.680468 9.695693
6 10.74194 10.05058 10.2111 8.824739 10.22375 10.48411
7 11.41614 10.59113 10.70612 9.12775 10.78299 11.1652
8 12.08601 11.12069 11.23061 9.445629 11.32874 11.8499
9 12.8509 11.68692 11.81479 9.762563 11.92125 12.77563
10 13.79793 12.31746 12.3436 10.12344 12.5586 14.05427
11 14.40335 12.85409 12.81579 10.4148 13.2323 14.74745
12 14.96397 13.44764 13.39124 10.76968 13.91571 15.48449
13 15.49457 13.5184 13.94058 11.05081 14.43318 16.12423
14 16.06153 13.99386 14.35261 11.38416 14.95082 16.84513
15 16.61133 14.4879 14.86438 11.71484 15.47574 17.42593
16 17.24876 14.95296 15.30651 12.06838 16.01853 18.05138
17 17.8686 15.48764 15.82241 12.41315 16.546 18.69939
18 18.49424 16.01478 16.33324 12.76782 17.07923 19.29467
19 19.0651 16.5115 16.8808 13.11234 17.62211 20.00391
20 19.73842 17.07482 17.40481 13.46479 18.14528 20.67474
21 20.47123 17.51353 17.88455 13.55012 18.69565 21.35446
22 21.16333 18.00172 18.38069 13.82592 19.23222 22.16516
23 21.83083 18.55357 18.79004 14.10343 19.93576 23.0249
24 22.50095 19.04932 19.25296 14.38997 20.6087 23.75609
25 23.27895 19.66359 19.68497 14.66933 21.19856 24.33014
26 23.86791 20.19746 20.25114 14.96252 21.7933 25.16132
27 24.42128 20.79322 20.8394 15.27082 22.4216 25.64038
28 25.02747 21.34963 21.36803 15.59645 22.95553 26.40612
29 25.64392 21.96625 21.92369 15.90159 23.62858 26.99359
30 26.15457 22.51419 22.49119 16.21841 24.27062 27.48933
31 26.78083 23.14052 23.09582 16.5353 24.75912 28.13525
32 27.39095 23.71215 23.71597 16.84909 25.34079 28.66253
33 28.04546 24.23099 24.22622 17.23782 25.90887 29.27824
34 28.68887 24.69722 24.76757 17.58071 26.51803 30.06892
35 29.45707 25.24266 25.30781 17.91193 27.12488 30.87034
36 30.03946 25.75705 25.86998 18.24291 27.73606 31.71053
37 30.71511 26.29254 26.34333 18.50986 28.30462 32.37958
38 31.42378 26.91853 26.69165 18.81327 28.91142 33.07085
39 32.50335 27.44403 27.12134 19.20657 29.51637 33.8685
40 33.12328 27.98299 27.578 19.55173 30.14371 34.5783
41 33.71293 28.42661 28.16382 19.818 30.7509 35.29098
42 34.22313 29.11766 28.58075 20.20322 31.50584 35.97233
43 34.84822 29.69339 29.14229 20.60828 32.14028 36.53085
44 35.51228 30.30699 29.71523 20.86474 32.72842 36.82623
45 36.11674 30.89355 30.28881 21.24548 33.02594 37.79391
46 36.80722 31.50952 30.94186 21.56593 33.17226 38.42553
47 37.60966 31.98561 31.63391 21.89768 33.34089 39.20039
48 38.25016 32.63639 32.19883 22.23119 33.67384 39.98531
49 38.95744 33.18134 32.72147 22.4859 34.27073 40.76857
50 39.66163 33.67109 33.14864 22.90394 34.86681 41.49251
51 40.37425 34.12463 33.60807 23.26918 35.59697 42.51444
52 41.23707 34.66628 34.09723 23.52158 36.24535 43.14603
53 41.82558 35.1961 34.57659 23.89679 36.90796 44.16233
54 42.55081 35.72951 35.03618 24.49229 37.65297 44.59068
55 43.39907 36.31952 35.46371 24.81181 38.33818 45.22966
56 44.05056 37.05194 35.98615 25.12065 38.85623 46.23367
57 44.78049 37.1323 36.51719 25.4582 39.54339 46.54872
58 45.43282 37.76535 37.09313 25.88998 40.23827 47.07784
59 46.18882 38.27575 37.17476 26.22639 40.92604 47.807
60 46.90982 38.88576 37.90604 26.56257 41.63398 48.4778
61 47.56264 39.64927 38.5283 26.8499 42.29979 49.21885
62 48.10035 40.19561 39.16806 27.1614 42.99679 50.18735
63 49.01068 40.89077 39.80176 27.43677 43.8278 51.9102
64 49.76271 41.6514 40.39578 27.89204 44.4915 52.78747
65 50.53434 42.09778 41.03402 28.18638 45.01828 53.46253
66 51.67479 42.83619 41.44307 28.49254 45.8151 54.44443
67 52.20818 43.35224 42.17046 28.87821 46.38069 55.20507
68 52.84818 43.94838 42.54818 29.18387 47.27983 55.71156
69 53.54274 44.61937 43.04368 29.58712 47.76875 56.11357
70 54.24117 45.2113 43.55424 29.97786 48.52082 56.56269
71 55.10781 45.87016 44.19418 30.30342 49.17041 57.04574
72 55.81844 46.58728 44.70245 30.92939 50.00576 57.61847
73 56.53417 47.17022 45.19135 64.12819 50.76387 58.46774
74 56.99077 47.80587 45.81162 64.46482 51.44632 59.35406
75 57.70125 48.4632 46.53608 64.47179 52.09271 60.34232
76 58.40646 49.11251 47.44626 65.28538 52.77505 60.76057
77 59.20803 49.70755 48.0586 65.42728 53.3777 61.86707
78 59.71753 50.13534 48.76304 65.97044 54.06384 63.14102
79 60.58331 50.72049 49.47997 66.51449 54.7547 64.43312
80 61.03398 51.41927 50.11546 67.02634 55.4798 65.58254
81 61.80681 51.97609 50.69514 67.59518 55.96139 66.72086
82 62.48501 52.59973 51.31683 68.12712 56.93643 67.53484
83 63.36452 53.36562 51.73617 68.64816 57.6551 68.07806
84 64.31261 53.98405 52.21327 69.24711 58.23373 68.63623
85 65.24776 54.51552 52.77048 70.48085 58.97933 69.02074
86 66.17772 55.20282 53.22162 70.64199 59.76285 69.38057
87 67.08787 55.91391 53.7916 71.38781 60.25809 70.01195
88 68.01987 56.61301 54.46721 71.58064 61.31948 70.5335
89 68.92189 57.28238 55.16064 71.99983 62.18978 71.61938
90 69.79762 57.88332 55.85772 72.89091 63.02894 72.77907
91 69.86632 58.52047 56.78106 73.05919 63.78964 74.13258
92 70.60662 59.12164 57.49112 73.58095 64.54343 75.77073
93 71.63203 59.77399 58.20212 74.1192 65.36834 76.57243
94 72.18227 60.47282 58.77127 74.6143 65.83804 77.84715
95 72.97624 60.7739 59.41283 75.4809 66.61507 78.78102
96 73.75372 61.22352 59.84708 75.66663 67.44336 79.33527
97 74.66983 61.87689 60.49374 76.09998 68.30974 79.86294
98 75.85329 62.58495 60.7886 76.67287 69.23421 80.51763
99 76.38837 63.32424 61.5629 77.20351 70.00735 80.91219
100 77.38139 64.07433 62.21648 77.95189 70.7836 81.57964
101 78.25631 64.82328 62.74316 78.21231 71.2177 82.16656
102 79.19827 65.50484 63.64724 78.89301 72.00792 83.12364
103 80.38764 66.23685 64.48991 79.32261 73.00548 84.00261
104 80.87278 66.95412 65.2793 79.95379 73.50331 85.22213
105 81.76581 67.70247 65.82581 80.52102 74.28909 86.6621
106 83.02712 68.55701 66.62666 81.06393 75.11777 88.11059
107 83.48909 69.23235 67.35486 81.7409 75.9652
108 84.82759 70.58522 68.15342 82.25188 76.8884
109 85.28537 71.04559 68.92251 82.98396 77.83717
110 86.70018 71.73407 69.51888 83.51862 78.45438
111 87.35397 72.45837 70.31539 83.69946 79.32315
112 88.69969 73.14394 70.9007 84.25947 80.39831
113 73.92206 71.50578 85.10349 81.20853
114 74.65082 72.20686 85.26869 81.95338
115 75.32388 72.81664 86.07426 82.36201
116 76.37313 73.52561 86.33713 83.16817
117 76.85229 74.32013 86.85325 83.96463
118 77.55033 75.04207 87.32344 84.8136
119 78.19957 75.90256 87.93314 85.7303
120 79.23823 76.41772 88.39268 86.46136
121 79.57755 77.11913 88.96714 87.30937
122 79.70834 78.01459 88.17579
123 80.44374 78.76607 89.00109
124 81.47443 79.56496
125 81.80569 79.69939
126 82.57823 80.52383
127 83.38485 81.27236
128 84.09743 81.94386
129 84.78618 83.01913
130 85.91491 83.52692
131 86.18631 84.52093
132 86.87262 85.26204
133 88.0145 85.93992
134 88.30018 86.70402
135 89.08487 87.58891
136 88.27903
from the above data, the values are ranged from 7.3 (approx.) to 89.08 (approx) in the top to bottom. however, I have some data ranged from 7.3 to 89.09 (approx) in the bottom to top in another sheet of excel file.
Now, I would like to take the longest column vector (from the sample data it is column vector: 3) i.e 136*1 size and convert other column vectors (1,2,4,5 and 6) into the size of column vector :3 such that the original values (magnitudes) should remain same and their positions (values in the rows can be shifted). Between the values (original magnitudes), I need to interpolate so that, all the column vectors will be of same length (136*1).
like this column vectors, I have some hundreds.
the expected output is presented only for column:1 with reference to column:3
No 1 3
1 7.68565 7.620156
2 8.247334 8.107751
3 8.861417 8.616113
4 9.522981 9.117843
5 **9.8125205** 9.621576
6 10.10206 10.2111
7 10.74194 10.70612
8 11.41614 11.23061
9 **11.751075** 11.81479
10 12.08601 12.3436
11 12.8509 12.81579
12 13.79793 13.39124
13 **14.10064** 13.94058
14 14.40335 14.35261
15 14.96397 14.86438
16 15.49457 15.30651
17 **15.77805** 15.82241
18 16.06153 16.33324
19 16.61133 16.8808
20 17.24876 17.40481
21 17.8686 17.88455
22 18.49424 18.38069
23 **18.77967** 18.79004
24 19.0651 19.25296
25 19.73842 19.68497
26 20.47123 20.25114
27 **20.81728** 20.8394
28 21.16333 21.36803
29 21.83083 21.92369
30 22.50095 22.49119
31 23.27895 23.09582
32 23.86791 23.71597
33 24.42128 24.22622
34 **24.724375** 24.76757
35 25.02747 25.30781
36 25.64392 25.86998
37 26.15457 26.34333
38 26.78083 26.69165
39 27.39095 27.12134
40 **27.718205** 27.578
41 28.04546 28.16382
42 28.68887 28.58075
43 29.45707 29.14229
44 **29.748265** 29.71523
45 30.03946 30.28881
46 30.71511 30.94186
47 31.42378 31.63391
48 32.50335 32.19883
49 **32.813315** 32.72147
50 33.12328 33.14864
51 33.71293 33.60807
52 34.22313 34.09723
53 34.84822 34.57659
54 **35.18025** 35.03618
55 35.51228 35.46371
56 **35.81451** 35.98615
57 36.11674 36.51719
58 36.80722 37.09313
59 37.60966 37.17476
60 **37.92991** 37.90604
61 38.25016 38.5283
62 38.95744 39.16806
63 39.66163 39.80176
64 40.37425 40.39578
65 41.23707 41.03402
66 41.82558 41.44307
67 42.55081 42.17046
68 **42.97494** 42.54818
69 43.39907 43.04368
70 **43.724815** 43.55424
71 44.05056 44.19418
72 44.78049 44.70245
73 45.43282 45.19135
74 **45.81082** 45.81162
75 46.18882 46.53608
76 46.90982 47.44626
77 47.56264 48.0586
78 48.10035 48.76304
79 49.01068 49.47997
80 49.76271 50.11546
81 50.53434 50.69514
82 51.67479 51.31683
83 **51.941485** 51.73617
84 52.20818 52.21327
85 52.84818 52.77048
86 53.54274 53.22162
87 **53.891955** 53.7916
88 54.24117 54.46721
89 55.10781 55.16064
90 55.81844 55.85772
91 56.53417 56.78106
92 56.99077 57.49112
93 57.70125 58.20212
94 58.40646 58.77127
95 59.20803 59.41283
96 59.71753 59.84708
97 60.58331 60.49374
98 61.03398 60.7886
99 61.80681 61.5629
100 62.48501 62.21648
101 **62.924765** 62.74316
102 63.36452 63.64724
103 64.31261 64.48991
104 65.24776 65.2793
105 **65.71274** 65.82581
106 66.17772 66.62666
107 67.08787 67.35486
108 68.01987 68.15342
109 68.92189 68.92251
110 69.79762 69.51888
111 69.86632 70.31539
112 70.60662 70.9007
113 71.63203 71.50578
114 72.18227 72.20686
115 72.97624 72.81664
116 73.75372 73.52561
117 74.66983 74.32013
118 75.85329 75.04207
119 76.38837 75.90256
120 **76.88488** 76.41772
121 77.38139 77.11913
122 78.25631 78.01459
123 **78.72729** 78.76607
124 79.19827 79.56496
125 **79.792955** 79.69939
126 80.38764 80.52383
127 80.87278 81.27236
128 81.76581 81.94386
129 83.02712 83.01913
130 83.48909 83.52692
131 84.82759 84.52093
132 85.28537 85.26204
133 85.992775 85.93992
134 86.70018 86.70402
135 87.35397 87.58891
136 88.69969 88.27903
the expected interpolated values in column:1 are presented in double starred. here, the interpolation is done by averaging the i-1th and i+1th cell for the ith cell (simply linear interpolation)
the main purpose of doing so is to perform clustering. since column vectors/row vectors of unequal length cannot be used for clustering
is there any code to do that?
or can we calculate distance using DTW(Dynamic Time Warping) method or any other method with column vectors having unequal length (as shown in the example dataset) and perform clustering??

Calculate mean value for each row with interval

i need to calculate the mean value for each row (mean of interval). Here is a basic example (maybe anyone has even better idea to do it):
M_1_mb <- (15 : -15)#creating a vector value --> small
M_31 <- cut(M_31_mb,128)# getting 128 groups from the small vector
#M_1_mb <- (1500 : -1500)#creating a vector value
#M_1 <- cut(M_1_mb,128)# getting 128 groups from the vector
I do need to get the mean value for each row/group out of 128 intervals created in M_1 (actually i do not need even those intervals, i just need the mean of them) and i cannot figure out how to do it...
I had a look at the cut2 function from Hmisc library but unfortunatelly there is no option to set up number of intervals into which vector is to be cut (-> but there is an option to get the mean value of created intervals: levels.mean...)
I would appreciate any help! Thanks!
Additional Info:
cut2 function is working well for bigger vectors (M_1_mb), however when my vector is small (M_31_mb), then i am getting a Warning message:
Warning message:
In min(xx[xx > upper]) : no non-missing arguments to min; returning Inf
and only 31 groups are created:
M_31_mb <- (15 : -15) # smaller vector
M_31 <- table(cut2(M_31_mb,g=128,levels.mean = TRUE))
whereas
g = number of quantile groups
like this?
aggregate(M_1_mb,by=list(M_1),mean)
EDIT: Result
Group.1 x
1 (-1.5e+03,-1.48e+03] -1488.5
2 (-1.48e+03,-1.45e+03] -1465.0
3 (-1.45e+03,-1.43e+03] -1441.5
4 (-1.43e+03,-1.41e+03] -1418.0
5 (-1.41e+03,-1.38e+03] -1394.5
6 (-1.38e+03,-1.36e+03] -1371.0
7 (-1.36e+03,-1.34e+03] -1347.5
8 (-1.34e+03,-1.31e+03] -1324.0
9 (-1.31e+03,-1.29e+03] -1301.0
10 (-1.29e+03,-1.27e+03] -1277.5
11 (-1.27e+03,-1.24e+03] -1254.0
12 (-1.24e+03,-1.22e+03] -1230.5
13 (-1.22e+03,-1.2e+03] -1207.0
14 (-1.2e+03,-1.17e+03] -1183.5
15 (-1.17e+03,-1.15e+03] -1160.0
16 (-1.15e+03,-1.12e+03] -1136.5
17 (-1.12e+03,-1.1e+03] -1113.0
18 (-1.1e+03,-1.08e+03] -1090.0
19 (-1.08e+03,-1.05e+03] -1066.5
20 (-1.05e+03,-1.03e+03] -1043.0
21 (-1.03e+03,-1.01e+03] -1019.5
22 (-1.01e+03,-984] -996.0
23 (-984,-961] -972.5
24 (-961,-938] -949.0
25 (-938,-914] -926.0
26 (-914,-891] -902.5
27 (-891,-867] -879.0
28 (-867,-844] -855.5
29 (-844,-820] -832.0
30 (-820,-797] -808.5
31 (-797,-773] -785.0
32 (-773,-750] -761.5
33 (-750,-727] -738.0
34 (-727,-703] -715.0
35 (-703,-680] -691.5
36 (-680,-656] -668.0
37 (-656,-633] -644.5
38 (-633,-609] -621.0
39 (-609,-586] -597.5
40 (-586,-562] -574.0
41 (-562,-539] -551.0
42 (-539,-516] -527.5
43 (-516,-492] -504.0
44 (-492,-469] -480.5
45 (-469,-445] -457.0
46 (-445,-422] -433.5
47 (-422,-398] -410.0
48 (-398,-375] -386.5
49 (-375,-352] -363.0
50 (-352,-328] -340.0
51 (-328,-305] -316.5
52 (-305,-281] -293.0
53 (-281,-258] -269.5
54 (-258,-234] -246.0
55 (-234,-211] -222.5
56 (-211,-188] -199.0
57 (-188,-164] -176.0
58 (-164,-141] -152.5
59 (-141,-117] -129.0
60 (-117,-93.8] -105.5
61 (-93.8,-70.3] -82.0
62 (-70.3,-46.9] -58.5
63 (-46.9,-23.4] -35.0
64 (-23.4,0] -11.5
65 (0,23.4] 12.0
66 (23.4,46.9] 35.0
67 (46.9,70.3] 58.5
68 (70.3,93.8] 82.0
69 (93.8,117] 105.5
70 (117,141] 129.0
71 (141,164] 152.5
72 (164,188] 176.0
73 (188,211] 199.0
74 (211,234] 222.5
75 (234,258] 246.0
76 (258,281] 269.5
77 (281,305] 293.0
78 (305,328] 316.5
79 (328,352] 340.0
80 (352,375] 363.5
81 (375,398] 387.0
82 (398,422] 410.0
83 (422,445] 433.5
84 (445,469] 457.0
85 (469,492] 480.5
86 (492,516] 504.0
87 (516,539] 527.5
88 (539,562] 551.0
89 (562,586] 574.0
90 (586,609] 597.5
91 (609,633] 621.0
92 (633,656] 644.5
93 (656,680] 668.0
94 (680,703] 691.5
95 (703,727] 715.0
96 (727,750] 738.5
97 (750,773] 762.0
98 (773,797] 785.0
99 (797,820] 808.5
100 (820,844] 832.0
101 (844,867] 855.5
102 (867,891] 879.0
103 (891,914] 902.5
104 (914,938] 926.0
105 (938,961] 949.0
106 (961,984] 972.5
107 (984,1.01e+03] 996.0
108 (1.01e+03,1.03e+03] 1019.5
109 (1.03e+03,1.05e+03] 1043.0
110 (1.05e+03,1.08e+03] 1066.5
111 (1.08e+03,1.1e+03] 1090.0
112 (1.1e+03,1.12e+03] 1113.5
113 (1.12e+03,1.15e+03] 1137.0
114 (1.15e+03,1.17e+03] 1160.0
115 (1.17e+03,1.2e+03] 1183.5
116 (1.2e+03,1.22e+03] 1207.0
117 (1.22e+03,1.24e+03] 1230.5
118 (1.24e+03,1.27e+03] 1254.0
119 (1.27e+03,1.29e+03] 1277.5
120 (1.29e+03,1.31e+03] 1301.0
121 (1.31e+03,1.34e+03] 1324.0
122 (1.34e+03,1.36e+03] 1347.5
123 (1.36e+03,1.38e+03] 1371.0
124 (1.38e+03,1.41e+03] 1394.5
125 (1.41e+03,1.43e+03] 1418.0
126 (1.43e+03,1.45e+03] 1441.5
127 (1.45e+03,1.48e+03] 1465.0
128 (1.48e+03,1.5e+03] 1488.5

R - Data Frame is a list of columns?

Question
Is a data frame in R is a list (list is, in my understanding, a sequence of objects) of columns?
What is the design decision in R to have made a data frame a column-oriented (not row-oriented) structure?
Any reference to related design document or article of data structure design would be appreciated.
I am just used to row-as-a-unit/record and would like to know why it is column oriented. Or if I misunderstood something, kindly suggest.
Background
I had thought a dataframe was a sequence of row, such as (Ozone, Solar.R, Wind, Temp, Month, Day).
> c ## data frame created from read.csv()
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
> typeof(c)
[1] "list"
However when lapply() is applied against c to show each list element, it was a column.
> lapply(c, function(arg){ return(arg) })
$Ozone
[1] 41 36 12 18 23 19
$Solar.R
[1] 190 118 149 313 299 99
$Wind
[1] 7.4 8.0 12.6 11.5 8.6 13.8
$Temp
[1] 67 72 74 62 65 59
$Month
[1] 5 5 5 5 5 5
$Day
[1] 1 2 3 4 7 8
Whereas I had expected was
[1] 41 190 7.4 67 5 1
[1] 36 118 8.0 72 5 2
…
1) Is a data frame in R a list of columns?
Yes.
df <- data.frame(a=c("the", "quick"), b=c("brown", "fox"), c=1:2)
is.list(df) # -> TRUE
attr(df, "name") # -> [1] "a" "b" "c"
df[[1]][2] # -> "quick"
2) What is the design decision in R to have made a data frame a column-oriented (not row-oriented) structure?
A data.frame is a list of column vectors.
is.atomic(df[[1]]) # -> TRUE
mode(df[[1]]) # -> [1] "character"
mode(df[[3]]) # -> [1] "numeric"
Vectors can only store one kind of object. A "row-oriented" data.frame would demand data frames be composed of lists instead. Now imagine what the performance of an operation like
df[[1]][20000]
would be in a list-based data frame keeping in mind that random access is O(1) for vectors and O(n) for lists.
3) Any reference to related design document or article of data structure design would be appreciated.
http://adv-r.had.co.nz/Data-structures.html#data-frames

Loop Linear Regression

As a begginer in R i have a, probably, simple question.
I have a linear regression with this specification:
X1 = X1_t-h + X2_t-h
h for is equal to 1,2,3,4,5:
For example, when h=1 i run this code:
Modelo11 <- dynlm(X1 ~ L(X1,1) + L(X2, 1)-1, data = GDP)
Its a simple regression.
I want to implement a function that gives me the five linear regressions (h=1,2,3,4 and 5) with and without HAC heteroscedasticity estimation:
I did this, and didnt work:
for(h in 1:5){
Modelo1[h] <- dynlm(GDPTrimestralemT ~ L(SpreademT,h) + L(GDPTrimestralemT, h)-1, data = MatrizDadosUS)
coeftest(Modelo1[h], df = Inf, vcov = parzenHAC)
return(list(summary(Modelo1[h])))
}
One of the error message is:
number of items to replace is not a multiple of replacement length
This is my data.frame:
GDP <- data.frame(data )
GDP
X1 X2
1 0.542952690 0.226341364
2 0.102328393 0.743360185
3 0.166345969 0.186533485
4 1.406733422 1.392420181
5 -0.469811005 -0.114609464
6 -0.509268267 0.687555461
7 1.470439930 0.298655018
8 1.046456428 -1.056387597
9 -0.492462197 -0.530284962
10 -0.516065519 0.645957530
11 0.624638996 1.044731264
12 0.213616470 -1.652979785
13 0.669747432 1.398602289
14 0.552089131 -0.821013792
15 0.452715216 1.420094663
16 -0.892063248 -1.436600779
17 1.429284965 0.559738610
18 0.853740565 -0.898976767
19 0.741864168 1.352012831
20 0.171494650 1.704764705
21 0.422326351 -0.267064235
22 -1.261643503 -2.090694608
23 -1.321086283 -0.273954212
24 0.365226000 1.965167113
25 -0.080888690 -0.594498893
26 -0.183293801 -0.483053404
27 -1.033792032 0.586491772
28 0.718322432 1.776210145
29 -2.822693790 -0.731509917
30 -1.251740437 -1.918124078
31 1.184256949 -0.016548037
32 2.255202675 0.303438286
33 -0.930446147 0.803126180
34 -1.691383225 -0.157839283
35 -1.081643279 -0.006652717
36 1.034162006 -1.970063305
37 -0.716827488 0.306792930
38 0.098471514 0.338333164
39 0.343536547 0.389775011
40 1.442117465 -0.668885360
41 0.095131066 -0.298356861
42 0.222524607 0.291485267
43 -0.499969717 1.308312472
44 0.588162304 0.026539575
45 0.581215173 0.167710855
46 0.629343124 -0.052835206
47 0.811618963 0.716913172
48 1.463610069 -0.356369304
49 -2.000576321 1.226446201
50 1.278233553 0.313606888
51 -0.700373666 0.770273988
52 -1.206455648 0.344628878
53 0.024602262 1.001621886
54 0.858933385 -0.865771777
55 -1.592291995 -0.384908852
56 -0.833758365 -1.184682199
57 -0.281305858 2.070391729
58 -0.122848757 -0.308397782
59 -0.661013984 1.590741535
60 1.887869805 -1.240283364
61 -0.313677463 -1.393252994
62 1.142864110 -1.150916732
63 -0.633380499 -0.223923970
64 -0.158729527 -1.245647224
65 0.928619010 -1.050636078
66 0.424317087 0.593892028
67 1.108704956 -1.792833100
68 -1.338231248 1.138684394
69 -0.647492569 0.181495183
70 0.295906675 -0.101823172
71 -0.079827607 0.825158278
72 0.050353111 -0.448453121
73 0.129068772 0.205619797
74 -0.221450137 0.051349511
75 -1.300967949 1.639063824
76 -0.861963677 1.273104220
77 -1.691001610 0.746514122
78 0.365888734 -0.055308006
79 1.297349754 1.146102001
80 -0.652382297 -1.095031447
81 0.165682952 -0.012926971
82 0.127996446 0.510673745
83 0.338743162 -3.141650682
84 -0.266916587 -2.483389321
85 0.148135154 -1.239997153
86 1.256591385 0.051984536
87 -0.646281986 0.468210275
88 0.180472423 0.393014848
89 0.231892902 -0.545305005
90 -0.709986273 0.104969765
91 1.231712844 -1.703489840
92 0.435378714 0.876505107
93 -1.880394798 -0.885893722
94 1.083580732 0.117560662
95 -0.499072654 -1.039222894
96 1.850756855 -1.308752222
97 1.653952857 0.440405804
98 -1.057618294 -1.611779530
99 -0.021821282 -0.807071503
100 0.682923562 -2.358596342
101 -1.132293845 -1.488806929
102 0.319237353 0.706203968
103 -2.393105781 -1.562111727
104 0.188653972 -0.637073832
105 0.667003685 0.047694037
106 -0.534018861 1.366826933
107 -2.240330371 -0.071797320
108 -0.220633546 1.612879694
109 -0.022442941 1.172582601
110 -1.542418139 0.635161458
111 -0.684128812 -0.334973482
112 0.688849615 0.056557966
113 0.848602803 0.785297518
114 -0.874157558 -0.434518305
115 -0.404999060 -0.078893114
116 0.735896917 1.637873669
117 -0.174398836 0.542952690
118 0.222418628 0.102328393
119 0.419461884 0.166345969
120 -0.042602368 1.406733422
121 2.135670836 -0.469811005
122 1.197644287 -0.509268267
123 0.395951293 1.470439930
124 0.141327444 1.046456428
125 0.691575897 -0.492462197
126 -0.490708151 -0.516065519
127 -0.358903359 0.624638996
128 -0.227550909 0.213616470
129 -0.766692832 0.669747432
130 -0.001690915 0.552089131
131 -1.786701123 0.452715216
132 -1.251495762 -0.892063248
133 1.123462446 1.429284965
134 0.237862653 0.853740565
Thanks.
Your variable Modelo1 is a vector which cannot store lm objects. When Modelo1 is a list it should work.
library(dynlm)
df<-data.frame(rnorm(50),rnorm(50))
names(df)<-c("a","b")
c<-list()
for(h in 1:5){
c[[h]] <- dynlm(a ~ L(a,h) + L(b, h)-1, data = df)
}
To get the summary you have to access the single list elements. For example:
summary(c[[1]])
*edit in response to Richard Scriven comment
The most efficent way to to get all summaries would be:
lapply(c, summary)
This applies the summary function to each element of the list and returns a list with the results.

Resources