Selecting top finite number of rows for each unique value of a column in a data fame in R - r

I have a data frame with 3 columns. a,b,c. There are multiple rows corresponding to each unique value of column a. I want to select top 5 rows corresponding to each unique value of column a. column c is some value and the data frame is already sorted by it in descending order, so that would not be a problem. Can anyone please suggest how can I do this in R.

Stealing #ptocquin's example, here's how you can use base function by. You can flatten the result using do.call (see below).
> by(data = data, INDICES = data$a, FUN = function(x) head(x, 5))
# or by(data = data, INDICES = data$a, FUN = head, 5)
data$a: 1
a b c
21 1 0.1188552 1.6389895
41 1 1.0182033 1.4811359
61 1 -0.8795879 0.7784072
81 1 0.6485745 0.7734652
31 1 1.5102255 0.7107957
------------------------------------------------------------
data$a: 2
a b c
15 2 -1.09704040 1.1710693
85 2 0.42914795 0.8826820
65 2 -1.01480957 0.6736782
45 2 -0.07982711 0.3693384
35 2 -0.67643885 -0.2170767
------------------------------------------------------------
A similar thing could be achieved by splitting your data.frame based on a and then using lapply to step through each element subsetting first n rows.
split.data <- split(data, data$a)
subsetted.data <- lapply(split.data, FUN = function(x) head(x, 5)) # or ..., FUN = head, 5) like above
flatten.data <- do.call("rbind", subsetted.data)
head(flatten.data)
a b c
1.21 1 0.11885516 1.63898947
1.41 1 1.01820329 1.48113594
1.61 1 -0.87958790 0.77840718
1.81 1 0.64857445 0.77346517
1.31 1 1.51022545 0.71079568
2.15 2 -1.09704040 1.17106930
2.85 2 0.42914795 0.88268205
2.65 2 -1.01480957 0.67367823
2.45 2 -0.07982711 0.36933837
2.35 2 -0.67643885 -0.21707668

Here is my try :
library(plyr)
data <- data.frame(a=rep(sample(1:20,10),10),b=rnorm(100),c=rnorm(100))
data <- data[rev(order(data$c)),]
head(data, 15)
a b c
28 6 1.69611039 1.720081
91 11 1.62656460 1.651574
70 9 -1.17808386 1.641954
6 15 1.23420550 1.603140
23 7 0.70854914 1.588352
51 11 -1.41234359 1.540738
19 10 2.83730734 1.522825
49 10 0.39313579 1.370831
80 9 -0.59445323 1.327825
59 10 -0.55538404 1.214901
18 6 0.08445888 1.152266
86 15 0.53027267 1.066034
69 10 -1.89077464 1.037447
62 1 -0.43599566 1.026505
3 7 0.78544009 1.014770
result <- ddply(data, .(a), "head", 5)
head(result, 15)
a b c
1 1 -0.43599566 1.02650544
2 1 -1.55113486 0.36380251
3 1 0.68608364 0.30911430
4 1 -0.85406406 0.05555500
5 1 -1.83894595 -0.11850847
6 5 -1.79715809 0.77760033
7 5 0.82814909 0.22401278
8 5 -1.52726859 0.06745849
9 5 0.51655092 -0.02737905
10 5 -0.44004646 -0.28106808
11 6 1.69611039 1.72008079
12 6 0.08445888 1.15226601
13 6 -1.99465060 0.82214319
14 6 0.43855489 0.76221979
15 6 -2.15251353 0.64417757

Related

Find the the mean of several indexed lists at index values in R

I have 4 predicted y values presented as an indexed list in R:
> y_a
2 12 15 19 20 22 3 4
26.05434 24.33894 38.57935 37.94003 23.87608 46.20327 18.43043 24.96521
5 8 13 21 1 7 10 11
17.34129 30.41087 28.49836 39.02917 21.96358 30.41087 23.61032 30.41087
16 18
35.31196 35.85652
> y_b
6 9 14 17 23 24 3 4
36.87726 35.30301 40.48044 38.24398 42.67726 41.31053 32.32106 33.81204
5 8 13 21 1 7 10 11
32.07257 35.05451 40.31655 44.74850 38.82558 35.05451 27.80451 35.05451
16 18
36.17274 36.29699
> y_c
6 9 14 17 23 24 2 12
30.24043 35.33617 39.18723 33.63404 42.76170 39.36809 32.25106 24.04894
15 19 20 22 1 7 10 11
39.34681 38.28298 31.01702 43.66596 33.19787 34.71915 27.60213 34.71915
16 18
37.49574 37.80426
> y_d
6 9 14 17 23 24 2 12
26.48159 35.12368 38.41591 31.00840 40.54660 36.01979 31.00840 22.70478
15 19 20 22 3 4 5 8
40.47355 32.72757 29.36229 46.23494 25.24701 30.18534 24.42395 34.30063
13 21
32.72757 33.55063
I would like to create a list that returns an average of the points on each list at the same index. In other words the average of point at index 2, index 12, index 15, and etc...
> y_mean
2 6 9 12....
26.05434 31.8664 ...... ......
Any ideas on how to do that?
We may get the elements in a list, then stack it to two column data.frame, rbind and do a group by mean
dat <- do.call(rbind,
lapply(mget(ls(pattern = "^y_[a-z]$")), stack))
aggregate(values ~ ind, dat, FUN = mean)
Or use tapply
with(dat, tapply(values, ind, FUN = mean))
Or if there are only four vectors, just do
v1 <- c(y_a, y_b, y_c, y_d)
tapply(v1, names(v1), FUN = mean)

How to swap values between variables in a data frame by column name in R

I would like to swap values for the following data set by column name. For example, swap the values of beta0_C1 and beta0_C2 for the row 10 to 15, remaining values remain unchanged. Similarly for row 10 to 15, swap the values of beta1_C1 and beta1_C2. Similarly for beta2_C1 and beta2_C2,
beta3_C1 and beta3_C2
beta0_C1 beta1_C1 beta2_C1 beta3_C1 beta0_C2 beta1_C2 beta2_C2
1 6.010537 0.2826006 0.001931834 -0.0014162495 6.862525 -0.7267671 0.12065368
2 6.182425 0.1633226 0.025748699 -0.0028515529 6.780775 -0.6686269 0.10548767
3 6.222667 0.1109463 0.036438064 -0.0034054813 6.891512 -0.7372192 0.11895311
4 5.980246 0.3095103 -0.002670511 -0.0011975572 6.677035 -0.5774936 0.08990028
5 6.146192 0.1661733 0.024968028 -0.0027346213 6.881571 -0.7439543 0.11835484
6 6.056259 0.2374753 0.010833872 -0.0019526540 7.094971 -0.8504940 0.13648015
7 6.051281 0.2265750 0.017030676 -0.0024138722 6.829044 -0.7180662 0.12121844
8 5.911484 0.3628966 -0.014161483 -0.0005192893 6.784079 -0.6090060 0.09075940
9 5.956709 0.3486160 -0.011525364 -0.0006776760 6.934137 -0.7821656 0.12996924
10 6.010721 0.2821788 0.002475369 -0.0014508507 6.810553 -0.7140603 0.12471406
11 6.021261 0.3180654 -0.004986709 -0.0010968281 6.708342 -0.6259794 0.10697798
12 6.171459 0.2020801 0.015380862 -0.0021379484 6.592252 -0.5040888 0.07813420
13 6.103334 0.2432321 0.010022319 -0.0019386513 6.831204 -0.6854066 0.11129609
14 5.989656 0.3026038 -0.003007319 -0.0011073984 6.782081 -0.6822204 0.10769549
15 6.024628 0.2786942 0.001861784 -0.0014022176 6.864881 -0.7299905 0.12030466
16 6.023082 0.2707312 0.008308583 -0.0019947781 6.850565 -0.7136916 0.11551886
17 5.988829 0.3267394 -0.007576506 -0.0008493887 6.882956 -0.7739330 0.13467615
18 6.072949 0.2744519 0.002846329 -0.0014917373 6.886863 -0.7853582 0.13512568
19 6.030894 0.2693881 0.006378019 -0.0017875603 6.842824 -0.7238131 0.11835479
20 6.197286 0.1311579 0.036005746 -0.0035338268 6.807729 -0.6549960 0.10400631
beta3_C2
1 -0.005112708
2 -0.003982831
3 -0.004824895
4 -0.003356916
5 -0.004724677
6 -0.005657009
7 -0.005200557
8 -0.003065364
9 -0.005408715
10 -0.005551546
11 -0.004516814
12 -0.002726879
13 -0.004493288
14 -0.004053661
15 -0.004913402
16 -0.004609239
17 -0.006101912
18 -0.005945182
19 -0.004801623
20 -0.004151904
Any help is appreciated.
Given this input :
(df1 <- as.data.frame(matrix(1:12, ncol = 3)))
# V1 V2 V3
#1 1 5 9
#2 2 6 10
#3 3 7 11
#4 4 8 12
You can use rev
df1[1:2, c("V2", "V3")] <- rev(df1[1:2, c("V2", "V3")])
Result
df1
# V1 V2 V3
#1 1 9 5
#2 2 10 6
#3 3 7 11
#4 4 8 12
Written as a function of rows and cols
f <- function(data, rows, cols) {
data[rows, cols] <- rev(data[rows, cols])
data
}
f(df1, 1:2, c("V2", "V3"))

How can this R code be sped up with the apply (lapply, mapply ect.) functions?

I am not to proficient with the apply functions, or with R. But I know I overuse for loops which makes my code slow. How can the following code be sped up with apply functions, or in any other way?
sum_store = NULL
for (col in 1:ncol(cazy_fams)){ # for each column in cazy_fams (so for each master family eg. GH, AA ect...)
for (row in 1:nrow(cazy_fams)){ # for each row in cazy fams (so the specific family number e.g GH1 AA7 ect...)
# Isolating the row that pertains to the current cazy family being looked at for every dataframe in the list
filt_fam = lapply(family_summary, function(sample){
sample[as.character(sample$Family) %in% paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = ""),]
})
row_cat = do.call(rbind, filt_fam) # concatinating the lapply list output int a dataframe
if (nrow(row_cat) > 0){
fam_sum = aggregate(proteins ~ Family, data=row_cat, FUN=sum) # collapsing the dataframe into one row and summing the proteins count
sum_store = rbind(sum_store, fam_sum) # storing the results for that family
} else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) {
Family = paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")
proteins = 0
sum_store = rbind(sum_store, data.frame(Family, proteins))
} else {
next
}
}
}
family_summary is just a list of 18 two column dataframes that look like this:
Family proteins
CE0 2
CE1 9
CE4 15
CE7 1
CE9 1
CE14 10
GH0 5
GH1 1
GH3 4
GH4 1
GH8 1
GH9 2
GH13 2
GH15 5
GH17 1
with different cazy families.
cazy_fams is just a dataframe with each coulms being a cazy class (eg. GH, AA ect...) and ech row being a family number, all taken from the linked website:
GH GT PL CE AA CBM
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
The reason behind the else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) statment is to deal with the fact not all classes have the same number of family so when looping over my dataframe I end up with some GHNA and AANA with NA on the end.
The output sum_store is this:
Family proteins
GH1 54
GH2 51
GH3 125
GH4 29
GH5 40
GH6 25
GH7 0
GH8 16
GH9 25
GH10 19
GH11 5
GH12 5
GH13 164
GH14 3
GH15 61
A dataframe with all listed cazy families and the total number of apperances across the family_summary list.
Please let me know if you need anything else to help answer my question.

Indexing multiple text files using R

I have to combine 5 files with the same structure and add a new variable to index the new data frame, but all 5 files are using the same ID.
I successfully combine them but I do not find how to index them. I have tried a few loops, but they were not giving me what I wanted.
# Combining files
path <- "D:/..."
filenames <- list.files(path)
t <- do.call("rbind", lapply(filenames, read.table, header = TRUE))
# Trying indexing with loops:
for (i in 1:length(t$ID){
t$ID2<-(t$ID+last(t$ID2))
}
I have 5 files, all of them with the same structure, and all of them using the same variable for identification, i.e.
file 1 would have:
ID: 1 1 1 2 2 2 3 3 3
And file 2 to 5 would have exactly the same IDs:
I would like to combine them into a single data frame so I would have this:
ID: 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1....
and then name them differently. So I would have:
ID: 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7...
How's this? This code finds the largest ID of first (i) data.frame and then adds that to IDs of next (i+1) data.frame. It records (i+1) largest ID and uses that in the (i+2) data.frame.
For this to work, you will have to forego the first do.call(rbind, ...) in your code.
xy1 <- data.frame(id = rep(1:4, each = 4), matrix(runif(4*4 * 3), ncol = 3))
xy2 <- data.frame(id = rep(1:7, each = 3), matrix(runif(3*7 * 3), ncol = 3))
xy3 <- data.frame(id = rep(1:3, each = 5), matrix(runif(3*5 * 3), ncol = 3))
xy <- list(xy1, xy2, xy3)
# First find largest ID of the first data.frame.
maxid <- max(xy[[1]]$id)
# Add previous max to current ID.
for (i in 2:length(xy)) {
xy[[i]]$id <- maxid + xy[[i]]$id
maxid <- max(xy[[i]]$id) # calculates largest id to be used next
}
> do.call(rbind, xy)
id X1 X2 X3
1 1 0.881397055 0.113236016 0.58935016
2 1 0.205762300 0.216630633 0.04096480
3 1 0.307112552 0.005092413 0.97769030
4 1 0.457299727 0.329346925 0.09582600
5 2 0.007010529 0.089751397 0.69746047
6 2 0.014806573 0.432586138 0.44480438
7 2 0.534909561 0.108258153 0.82475185
8 2 0.313796157 0.749077837 0.38798818
9 3 0.643547518 0.237040912 0.18304776
10 3 0.725906336 0.186099719 0.61738806
11 3 0.506767958 0.646870554 0.27792817
12 3 0.303638439 0.082478410 0.52484137
13 4 0.360623223 0.182054933 0.48604454
14 4 0.804174231 0.427352128 0.70075198
15 4 0.211255624 0.673377745 0.77251727
16 4 0.474358562 0.430095921 0.03648586
17 5 0.731251361 0.635859860 0.90235962
18 5 0.689463703 0.931878683 0.12179179
19 5 0.256770523 0.413928661 0.89254294
20 6 0.358319709 0.393714347 0.53143877
21 6 0.241538687 0.811901018 0.91577045
22 6 0.445141806 0.015133252 0.70977512
23 7 0.179662683 0.574578297 0.09957555
24 7 0.279302309 0.351412534 0.40911867
25 7 0.826039704 0.852739191 0.58671811
26 8 0.822024888 0.061122387 0.12308001
27 8 0.676081285 0.005285565 0.32040908
28 8 0.302821623 0.511678250 0.14814015
29 9 0.966690845 0.221078055 0.72651928
30 9 0.070768391 0.726477379 0.70431920
31 9 0.178425952 0.223096153 0.41111805
32 10 0.952963096 0.209673890 0.73485060
33 10 0.905570765 0.290359419 0.69499805
34 10 0.976600565 0.448144677 0.36100322
35 11 0.458720466 0.636912805 0.04170255
36 11 0.953471285 0.533102906 0.63543974
37 11 0.574490192 0.975327747 0.94730912
38 12 0.878968237 0.956726315 0.04761167
39 12 0.379196322 0.720179957 0.98719308
40 12 0.217246809 0.066895905 0.44981063
41 12 0.309354927 0.048701078 0.24654953
42 12 0.011187546 0.833095978 0.94793368
43 13 0.590529610 0.240967648 0.42954908
44 13 0.525187039 0.739698883 0.72047067
45 13 0.223469798 0.338660741 0.21820068
46 13 0.359939747 0.831732199 0.27095365
47 13 0.672778236 0.327900275 0.04854854
48 14 0.202447020 0.911963711 0.18576047
49 14 0.858830035 0.003633945 0.25713498
50 14 0.784197766 0.527018979 0.30911792
51 14 0.942135786 0.256841256 0.76965498
52 14 0.488395595 0.716133306 0.89618736

R is not ordering data correctly - skips E values

I am trying to order data by the column weightFisher. However, it is almost as if R does not process e values as low, because all the e values are skipped when I try to order from smallest to greatest.
Code:
resultTable_bon <- GenTable(GOdata_bon,
weightFisher = resultFisher_bon,
weightKS = resultKS_bon,
topNodes = 15136,
ranksOf = 'weightFisher'
)
head(resultTable_bon)
#create Fisher ordered df
indF <- order(resultTable_bon$weightFisher)
resultTable_bonF <- resultTable_bon[indF, ]
what resultTable_bon looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
1 GO:0019373 epoxygenase P450 pathway 19 13 1.12 1
2 GO:0097267 omega-hydroxylase P450 pathway 9 7 0.53 2
3 GO:0042738 exogenous drug catabolic process 10 7 0.59 3
weightFisher weightKS
1 1.9e-12 0.79744
2 7.9e-08 0.96752
3 2.5e-07 0.96336
what "ordered" resultTable_bonF looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
17 GO:0014075 response to amine 33 7 1.95 17
18 GO:0034372 very-low-density lipoprotein particle re... 11 5 0.65 18
19 GO:0060710 chorio-allantoic fusion 6 4 0.35 19
weightFisher weightKS
17 0.00014 0.96387
18 0.00016 0.83624
19 0.00016 0.92286
As #bhas says, it appears to be working precisely as you want it to. Maybe it's the use of head() that's confusing you?
To put your mind at ease, try it with something simpler
dtf <- data.frame(a=c(1, 8, 6, 2)^-10, b=c(7, 2, 1, 6))
dtf
# a b
# 1 1.000000e+00 7
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
dtf[order(dtf$a), ]
# a b
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
# 1 1.000000e+00 7
Try the following :
resultTable_bon$weightFisher <- as.numeric (resultTable_bon$weightFisher)
Then :
resultTable_bonF <- resultTable_bon[order(resultTable_bonF$weightFisher),]

Resources