Find the the mean of several indexed lists at index values in R - r

I have 4 predicted y values presented as an indexed list in R:
> y_a
2 12 15 19 20 22 3 4
26.05434 24.33894 38.57935 37.94003 23.87608 46.20327 18.43043 24.96521
5 8 13 21 1 7 10 11
17.34129 30.41087 28.49836 39.02917 21.96358 30.41087 23.61032 30.41087
16 18
35.31196 35.85652
> y_b
6 9 14 17 23 24 3 4
36.87726 35.30301 40.48044 38.24398 42.67726 41.31053 32.32106 33.81204
5 8 13 21 1 7 10 11
32.07257 35.05451 40.31655 44.74850 38.82558 35.05451 27.80451 35.05451
16 18
36.17274 36.29699
> y_c
6 9 14 17 23 24 2 12
30.24043 35.33617 39.18723 33.63404 42.76170 39.36809 32.25106 24.04894
15 19 20 22 1 7 10 11
39.34681 38.28298 31.01702 43.66596 33.19787 34.71915 27.60213 34.71915
16 18
37.49574 37.80426
> y_d
6 9 14 17 23 24 2 12
26.48159 35.12368 38.41591 31.00840 40.54660 36.01979 31.00840 22.70478
15 19 20 22 3 4 5 8
40.47355 32.72757 29.36229 46.23494 25.24701 30.18534 24.42395 34.30063
13 21
32.72757 33.55063
I would like to create a list that returns an average of the points on each list at the same index. In other words the average of point at index 2, index 12, index 15, and etc...
> y_mean
2 6 9 12....
26.05434 31.8664 ...... ......
Any ideas on how to do that?

We may get the elements in a list, then stack it to two column data.frame, rbind and do a group by mean
dat <- do.call(rbind,
lapply(mget(ls(pattern = "^y_[a-z]$")), stack))
aggregate(values ~ ind, dat, FUN = mean)
Or use tapply
with(dat, tapply(values, ind, FUN = mean))
Or if there are only four vectors, just do
v1 <- c(y_a, y_b, y_c, y_d)
tapply(v1, names(v1), FUN = mean)

Related

`rbind` elements with same index in a list

I have a list which has three lists each of which have 2 data.frames like following:-
d <- list(list(data.frame(height = rnorm(10), weight = runif(10)), data.frame(nr = rnorm(10), qr = rchisq(10, 10))),
list(data.frame(height = rnorm(6), weight = runif(6)), data.frame(nr = rnorm(6), qr = rchisq(6, 10))),
list(data.frame(height = rnorm(8), weight = runif(8)), data.frame(nr = rnorm(8), qr = rchisq(8, 10))))
[[1]]
[[1]][[1]]
height weight
1 -0.49424331 0.023996582
2 -0.80320654 0.029460558
3 -0.89797434 0.932508002
4 -0.25267069 0.790625104
5 0.27474082 0.859495769
6 0.14285128 0.009731295
7 -0.86224008 0.343969165
8 0.07358127 0.106006154
9 -1.61474408 0.302890840
10 2.23920173 0.133115944
[[1]][[2]]
nr qr
1 -1.24342871 10.278033
2 1.37520549 13.246929
3 -0.06046197 6.267480
4 0.73643661 14.084240
5 -0.01897590 1.323470
6 1.10877385 11.739945
7 -1.09511298 10.616714
8 -1.03525533 9.992008
9 -0.04301281 12.943073
10 -0.79446848 8.670066
[[2]]
[[2]][[1]]
height weight
1 -2.7323741 0.8825884
2 0.4745896 0.1813869
3 0.9158570 0.9660507
4 0.8927806 0.1156805
5 -0.8443665 0.3079322
6 -0.4703602 0.2345349
[[2]][[2]]
nr qr
1 -1.6915651 9.532319
2 -1.9810859 15.145930
3 -0.4890531 10.013549
4 0.2163449 13.407265
5 1.0770555 6.676846
6 -0.5431102 7.688177
[[3]]
[[3]][[1]]
height weight
1 -1.2671410 0.48468152
2 -0.7792946 0.04499799
3 -0.6976782 0.10917336
4 0.8274744 0.69698260
5 -0.9456592 0.64183451
6 -1.2882436 0.29868696
7 0.6424889 0.86165232
8 -0.8255187 0.16430852
[[3]][[2]]
nr qr
1 -0.4160331 5.341376
2 -1.0321303 8.947948
3 -0.3380597 7.937599
4 -2.1520878 11.740298
5 -0.8979710 2.393419
6 -1.1172138 11.780884
7 -1.0309391 2.673642
8 0.8822399 12.351724
I want to transform it so that all the data.frames with (height, weight) columns are rbinded together and all the (nr, qr) data.frames are rbinded together. So basically first element of each list in the list should be binded together and the second element of each list in the list should be binded together.
Expected Output would be another list which will have two data.frames like following:-
[[1]]
height weight
1 -0.49424331 0.023996582
2 -0.80320654 0.029460558
3 -0.89797434 0.932508002
4 -0.25267069 0.790625104
5 0.27474082 0.859495769
6 0.14285128 0.009731295
7 -0.86224008 0.343969165
8 0.07358127 0.106006154
9 -1.61474408 0.302890840
10 2.23920173 0.133115944
11 -2.7323741 0.8825884
12 0.4745896 0.1813869
13 0.9158570 0.9660507
14 0.8927806 0.1156805
15 -0.8443665 0.3079322
16 -0.4703602 0.2345349
17 -1.2671410 0.48468152
18 -0.7792946 0.04499799
19 -0.6976782 0.10917336
20 0.8274744 0.69698260
21 -0.9456592 0.64183451
22 -1.2882436 0.29868696
23 0.6424889 0.86165232
24 -0.8255187 0.16430852
[[2]]
nr qr
1 -1.24342871 10.278033
2 1.37520549 13.246929
3 -0.06046197 6.267480
4 0.73643661 14.084240
5 -0.01897590 1.323470
6 1.10877385 11.739945
7 -1.09511298 10.616714
8 -1.03525533 9.992008
9 -0.04301281 12.943073
10 -0.79446848 8.670066
11 -1.6915651 9.532319
12 -1.9810859 15.145930
13 -0.4890531 10.013549
14 0.2163449 13.407265
15 1.0770555 6.676846
16 -0.5431102 7.688177
17 -0.4160331 5.341376
18 -1.0321303 8.947948
19 -0.3380597 7.937599
20 -2.1520878 11.740298
21 -0.8979710 2.393419
22 -1.1172138 11.780884
23 -1.0309391 2.673642
24 0.8822399 12.351724
This should do it.
dd <- list(do.call(rbind, lapply(d, "[[", 1)), do.call(rbind, lapply(d, "[[", 2)))

How to swap values between variables in a data frame by column name in R

I would like to swap values for the following data set by column name. For example, swap the values of beta0_C1 and beta0_C2 for the row 10 to 15, remaining values remain unchanged. Similarly for row 10 to 15, swap the values of beta1_C1 and beta1_C2. Similarly for beta2_C1 and beta2_C2,
beta3_C1 and beta3_C2
beta0_C1 beta1_C1 beta2_C1 beta3_C1 beta0_C2 beta1_C2 beta2_C2
1 6.010537 0.2826006 0.001931834 -0.0014162495 6.862525 -0.7267671 0.12065368
2 6.182425 0.1633226 0.025748699 -0.0028515529 6.780775 -0.6686269 0.10548767
3 6.222667 0.1109463 0.036438064 -0.0034054813 6.891512 -0.7372192 0.11895311
4 5.980246 0.3095103 -0.002670511 -0.0011975572 6.677035 -0.5774936 0.08990028
5 6.146192 0.1661733 0.024968028 -0.0027346213 6.881571 -0.7439543 0.11835484
6 6.056259 0.2374753 0.010833872 -0.0019526540 7.094971 -0.8504940 0.13648015
7 6.051281 0.2265750 0.017030676 -0.0024138722 6.829044 -0.7180662 0.12121844
8 5.911484 0.3628966 -0.014161483 -0.0005192893 6.784079 -0.6090060 0.09075940
9 5.956709 0.3486160 -0.011525364 -0.0006776760 6.934137 -0.7821656 0.12996924
10 6.010721 0.2821788 0.002475369 -0.0014508507 6.810553 -0.7140603 0.12471406
11 6.021261 0.3180654 -0.004986709 -0.0010968281 6.708342 -0.6259794 0.10697798
12 6.171459 0.2020801 0.015380862 -0.0021379484 6.592252 -0.5040888 0.07813420
13 6.103334 0.2432321 0.010022319 -0.0019386513 6.831204 -0.6854066 0.11129609
14 5.989656 0.3026038 -0.003007319 -0.0011073984 6.782081 -0.6822204 0.10769549
15 6.024628 0.2786942 0.001861784 -0.0014022176 6.864881 -0.7299905 0.12030466
16 6.023082 0.2707312 0.008308583 -0.0019947781 6.850565 -0.7136916 0.11551886
17 5.988829 0.3267394 -0.007576506 -0.0008493887 6.882956 -0.7739330 0.13467615
18 6.072949 0.2744519 0.002846329 -0.0014917373 6.886863 -0.7853582 0.13512568
19 6.030894 0.2693881 0.006378019 -0.0017875603 6.842824 -0.7238131 0.11835479
20 6.197286 0.1311579 0.036005746 -0.0035338268 6.807729 -0.6549960 0.10400631
beta3_C2
1 -0.005112708
2 -0.003982831
3 -0.004824895
4 -0.003356916
5 -0.004724677
6 -0.005657009
7 -0.005200557
8 -0.003065364
9 -0.005408715
10 -0.005551546
11 -0.004516814
12 -0.002726879
13 -0.004493288
14 -0.004053661
15 -0.004913402
16 -0.004609239
17 -0.006101912
18 -0.005945182
19 -0.004801623
20 -0.004151904
Any help is appreciated.
Given this input :
(df1 <- as.data.frame(matrix(1:12, ncol = 3)))
# V1 V2 V3
#1 1 5 9
#2 2 6 10
#3 3 7 11
#4 4 8 12
You can use rev
df1[1:2, c("V2", "V3")] <- rev(df1[1:2, c("V2", "V3")])
Result
df1
# V1 V2 V3
#1 1 9 5
#2 2 10 6
#3 3 7 11
#4 4 8 12
Written as a function of rows and cols
f <- function(data, rows, cols) {
data[rows, cols] <- rev(data[rows, cols])
data
}
f(df1, 1:2, c("V2", "V3"))

Indexing multiple text files using R

I have to combine 5 files with the same structure and add a new variable to index the new data frame, but all 5 files are using the same ID.
I successfully combine them but I do not find how to index them. I have tried a few loops, but they were not giving me what I wanted.
# Combining files
path <- "D:/..."
filenames <- list.files(path)
t <- do.call("rbind", lapply(filenames, read.table, header = TRUE))
# Trying indexing with loops:
for (i in 1:length(t$ID){
t$ID2<-(t$ID+last(t$ID2))
}
I have 5 files, all of them with the same structure, and all of them using the same variable for identification, i.e.
file 1 would have:
ID: 1 1 1 2 2 2 3 3 3
And file 2 to 5 would have exactly the same IDs:
I would like to combine them into a single data frame so I would have this:
ID: 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1....
and then name them differently. So I would have:
ID: 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7...
How's this? This code finds the largest ID of first (i) data.frame and then adds that to IDs of next (i+1) data.frame. It records (i+1) largest ID and uses that in the (i+2) data.frame.
For this to work, you will have to forego the first do.call(rbind, ...) in your code.
xy1 <- data.frame(id = rep(1:4, each = 4), matrix(runif(4*4 * 3), ncol = 3))
xy2 <- data.frame(id = rep(1:7, each = 3), matrix(runif(3*7 * 3), ncol = 3))
xy3 <- data.frame(id = rep(1:3, each = 5), matrix(runif(3*5 * 3), ncol = 3))
xy <- list(xy1, xy2, xy3)
# First find largest ID of the first data.frame.
maxid <- max(xy[[1]]$id)
# Add previous max to current ID.
for (i in 2:length(xy)) {
xy[[i]]$id <- maxid + xy[[i]]$id
maxid <- max(xy[[i]]$id) # calculates largest id to be used next
}
> do.call(rbind, xy)
id X1 X2 X3
1 1 0.881397055 0.113236016 0.58935016
2 1 0.205762300 0.216630633 0.04096480
3 1 0.307112552 0.005092413 0.97769030
4 1 0.457299727 0.329346925 0.09582600
5 2 0.007010529 0.089751397 0.69746047
6 2 0.014806573 0.432586138 0.44480438
7 2 0.534909561 0.108258153 0.82475185
8 2 0.313796157 0.749077837 0.38798818
9 3 0.643547518 0.237040912 0.18304776
10 3 0.725906336 0.186099719 0.61738806
11 3 0.506767958 0.646870554 0.27792817
12 3 0.303638439 0.082478410 0.52484137
13 4 0.360623223 0.182054933 0.48604454
14 4 0.804174231 0.427352128 0.70075198
15 4 0.211255624 0.673377745 0.77251727
16 4 0.474358562 0.430095921 0.03648586
17 5 0.731251361 0.635859860 0.90235962
18 5 0.689463703 0.931878683 0.12179179
19 5 0.256770523 0.413928661 0.89254294
20 6 0.358319709 0.393714347 0.53143877
21 6 0.241538687 0.811901018 0.91577045
22 6 0.445141806 0.015133252 0.70977512
23 7 0.179662683 0.574578297 0.09957555
24 7 0.279302309 0.351412534 0.40911867
25 7 0.826039704 0.852739191 0.58671811
26 8 0.822024888 0.061122387 0.12308001
27 8 0.676081285 0.005285565 0.32040908
28 8 0.302821623 0.511678250 0.14814015
29 9 0.966690845 0.221078055 0.72651928
30 9 0.070768391 0.726477379 0.70431920
31 9 0.178425952 0.223096153 0.41111805
32 10 0.952963096 0.209673890 0.73485060
33 10 0.905570765 0.290359419 0.69499805
34 10 0.976600565 0.448144677 0.36100322
35 11 0.458720466 0.636912805 0.04170255
36 11 0.953471285 0.533102906 0.63543974
37 11 0.574490192 0.975327747 0.94730912
38 12 0.878968237 0.956726315 0.04761167
39 12 0.379196322 0.720179957 0.98719308
40 12 0.217246809 0.066895905 0.44981063
41 12 0.309354927 0.048701078 0.24654953
42 12 0.011187546 0.833095978 0.94793368
43 13 0.590529610 0.240967648 0.42954908
44 13 0.525187039 0.739698883 0.72047067
45 13 0.223469798 0.338660741 0.21820068
46 13 0.359939747 0.831732199 0.27095365
47 13 0.672778236 0.327900275 0.04854854
48 14 0.202447020 0.911963711 0.18576047
49 14 0.858830035 0.003633945 0.25713498
50 14 0.784197766 0.527018979 0.30911792
51 14 0.942135786 0.256841256 0.76965498
52 14 0.488395595 0.716133306 0.89618736

How to use apply function instead of for loop if you have multiple if conditions to be excecuted

1st DF:
t.d
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
names(t.d) <- c("ID","A","B","C")
t.d$FinalTime <- c("7/30/2009 08:18:35","9/30/2009 19:18:35","11/30/2009 21:18:35","13/30/2009 20:18:35","15/30/2009 04:18:35")
t.d$InitTime <- c("6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35")
>t.d
ID A B C FinalTime InitTime
1 1 6 11 16 7/30/2009 08:18:35 6/30/2009 9:18:35
2 2 7 12 17 9/30/2009 19:18:35 6/30/2009 9:18:35
3 3 8 13 18 11/30/2009 21:18:35 6/30/2009 9:18:35
4 4 9 14 19 13/30/2009 20:18:35 6/30/2009 9:18:35
5 5 10 15 20 15/30/2009 04:18:35 6/30/2009 9:18:35
2nd DF:
> s.d
F D E Time
1 10 19 28 6/30/2009 08:18:35
2 11 20 29 8/30/2009 19:18:35
3 12 21 30 9/30/2009 21:18:35
4 13 22 31 01/30/2009 20:18:35
5 14 23 32 10/30/2009 04:18:35
6 15 24 33 11/30/2009 04:18:35
7 16 25 34 12/30/2009 04:18:35
8 17 26 35 13/30/2009 04:18:35
9 18 27 36 15/30/2009 04:18:35
Output to be:
From DF "t.d" I have to calculate the time interval for each row between "FinalTime" and "InitTime" (InitTime will always be less than FinalTime).
Another DF "temp" from "s.d" has to be formed having data only within the above time interval, and then the most recent values of "F","D","E" have to be taken and attached to the 'ith' row of "t.d" from which the time interval was calculated.
Also we have to see if the newly formed DF "temp" has the following conditions true:
here 'j' represents value for each row:
if(temp$F[j] < 35.5) + (temp$D[j] >= 100) >= 1)
{
temp$Flag <- 1
} else{
temp$Flag <- 0
}
Originally I have 3 million rows in the dataframe and 20 columns in each DF.
I have solved the above problem using "for loop" but it obviously takes 2 to 3 days as there are a lot of rows.
(Also if I have to add new columns to the resultant DF if multiple conditions get satisfied on each row?)
Can anybody suggest a different technique? Like using apply functions?
My suggestion is:
use lapply over row indices
handle in the function call your if branches
return either your dataframe or NULL
combine everything with rbind
by replacing lapply with mclapply from the 'parallel' package, your code gets executed in parallel.
resultList <- lapply(1:nrow(t.d), function(i){
do stuff
if(condition){
return(df)
}else{
return(NULL)
}
resultDF <- do.call(rbind, resultList)

Selecting top finite number of rows for each unique value of a column in a data fame in R

I have a data frame with 3 columns. a,b,c. There are multiple rows corresponding to each unique value of column a. I want to select top 5 rows corresponding to each unique value of column a. column c is some value and the data frame is already sorted by it in descending order, so that would not be a problem. Can anyone please suggest how can I do this in R.
Stealing #ptocquin's example, here's how you can use base function by. You can flatten the result using do.call (see below).
> by(data = data, INDICES = data$a, FUN = function(x) head(x, 5))
# or by(data = data, INDICES = data$a, FUN = head, 5)
data$a: 1
a b c
21 1 0.1188552 1.6389895
41 1 1.0182033 1.4811359
61 1 -0.8795879 0.7784072
81 1 0.6485745 0.7734652
31 1 1.5102255 0.7107957
------------------------------------------------------------
data$a: 2
a b c
15 2 -1.09704040 1.1710693
85 2 0.42914795 0.8826820
65 2 -1.01480957 0.6736782
45 2 -0.07982711 0.3693384
35 2 -0.67643885 -0.2170767
------------------------------------------------------------
A similar thing could be achieved by splitting your data.frame based on a and then using lapply to step through each element subsetting first n rows.
split.data <- split(data, data$a)
subsetted.data <- lapply(split.data, FUN = function(x) head(x, 5)) # or ..., FUN = head, 5) like above
flatten.data <- do.call("rbind", subsetted.data)
head(flatten.data)
a b c
1.21 1 0.11885516 1.63898947
1.41 1 1.01820329 1.48113594
1.61 1 -0.87958790 0.77840718
1.81 1 0.64857445 0.77346517
1.31 1 1.51022545 0.71079568
2.15 2 -1.09704040 1.17106930
2.85 2 0.42914795 0.88268205
2.65 2 -1.01480957 0.67367823
2.45 2 -0.07982711 0.36933837
2.35 2 -0.67643885 -0.21707668
Here is my try :
library(plyr)
data <- data.frame(a=rep(sample(1:20,10),10),b=rnorm(100),c=rnorm(100))
data <- data[rev(order(data$c)),]
head(data, 15)
a b c
28 6 1.69611039 1.720081
91 11 1.62656460 1.651574
70 9 -1.17808386 1.641954
6 15 1.23420550 1.603140
23 7 0.70854914 1.588352
51 11 -1.41234359 1.540738
19 10 2.83730734 1.522825
49 10 0.39313579 1.370831
80 9 -0.59445323 1.327825
59 10 -0.55538404 1.214901
18 6 0.08445888 1.152266
86 15 0.53027267 1.066034
69 10 -1.89077464 1.037447
62 1 -0.43599566 1.026505
3 7 0.78544009 1.014770
result <- ddply(data, .(a), "head", 5)
head(result, 15)
a b c
1 1 -0.43599566 1.02650544
2 1 -1.55113486 0.36380251
3 1 0.68608364 0.30911430
4 1 -0.85406406 0.05555500
5 1 -1.83894595 -0.11850847
6 5 -1.79715809 0.77760033
7 5 0.82814909 0.22401278
8 5 -1.52726859 0.06745849
9 5 0.51655092 -0.02737905
10 5 -0.44004646 -0.28106808
11 6 1.69611039 1.72008079
12 6 0.08445888 1.15226601
13 6 -1.99465060 0.82214319
14 6 0.43855489 0.76221979
15 6 -2.15251353 0.64417757

Resources