How to sum rows by group in a big datafram? - r

My data (crsp.daily) look roughly like this (the numbers are made up and there are more variables):
PERMCO PERMNO date price VOL SHROUT
103 201 19951006 8.8 100 823
103 203 19951006 7.9 200 1002
1004 10 19951006 5 277 398
2 5 19951110 5.3 NA 579
1003 2 19970303 10 67 NA
1003 1 19970303 11 77 1569
1003 20 19970401 6.7 NA NA
I want to sum VOL and SHROUT by groups defined by PERMCO and date, but leaving the original number of rows unchanged, thus my desired output is the following:
PERMCO PERMNO date price VOL SHROUT VOL.sum SHROUT.sum
103 201 19951006 8.8 100 823 300 1825
103 203 19951006 7.9 200 1002 300 1825
1004 10 19951006 5 277 398 277 398
2 5 19951110 5.3 NA 579 NA 579
1003 2 19970303 10 67 NA 21 1569
1003 1 19970303 11 77 1569 21 1569
1003 20 19970401 6.7 NA NA NA NA
My data have more than 45 millions of observations, and 8 columns. I have tried using ave:
crsp.daily$VOL.sum=ave(crsp.daily$VOL,c("PERMCO","date"),FUN=sum)
or sapply:
crsp.daily$VOL.sum=sapply(crsp.daily[,"VOL"],ave,crsp.daily$PERMCO,crsp.daily$date)
The problem is that it takes an infinite amount of time (like more than 30 min and I still did not see the result). Another thing that I tried was to create a variable called "group" by pasting PERMCO and date like this:
crsp.daily$group=paste0(crsp.daily$PERMCO,crsp.daily$date)
and then apply ave using crsp.daily$group as groups. This also did not work because from a certain observation on, R did not distinguish anymore the different values of crsp.daily$groups and treated them as a unique group.
The solution of creating the variable "groups" worked on a smaller dataset.
Any advise is greatly appreciated!

With data.table u could use the following code
require(data.table)
dt <- as.data.table(crsp.daily)
dt[, VOL.sum := sum(VOL), by = list(PERMCO, date)]
With the command := u create a new variable (VOL.sum) and group those by PERMCO and date.
Output
permco permno date price vol shrout vol.sum
1 103 201 19951006 8.8 100 823 300
2 103 203 19951006 7.9 200 1002 300
3 1004 10 19951006 5.0 277 398 277
4 2 5 19951110 5.3 NA 579 NA

Related

Pivot/Reshape data in R [duplicate]

This question already has answers here:
Reshape horizontal to to long format using pivot_longer
(3 answers)
Closed 2 years ago.
Thank you all for your answers, I thought I was smarter than I am and hoped I would've understood any of it. I think I messed up my visualisation of my data aswell. I have edited my post to better show my sample data. Sorry for the inconvenience, and I truly hope that someone can help me.
I have a question about reshaping my data. The data collected looks as such:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick. Am reletively new to Rstudio and Stackoverflow (if you couldn't tell that already).
Kind regards, and thank you in advance.
Here is a slightly different pivot_longer() version.
library(tidyr)
library(dplyr)
dw %>%
pivot_longer(cols = -PID, names_to =".value", names_pattern = "(.+)[0-9]")
# A tibble: 9 x 3
PID T measurement
<dbl> <dbl> <dbl>
1 1 1 100
2 1 4 200
3 1 7 50
4 2 2 150
5 2 5 300
6 2 8 60
7 3 3 120
8 3 6 210
9 3 9 70
The names_to = ".value" argument creates new columns from column names based on the names_pattern argument. The names_pattern argument takes a special regex input. In this case, here is the breakdown:
(.+) # match everything - anything noted like this becomes the ".values"
[0-9] # numeric characters - tells the pattern that the numbers
# at the end are excluded from ".values". If you have multiple digit
# numbers, use [0-9*]
In the last edit you asked for a solution that is easy to understand. A very simple approach would be to stack the measurement columns on top of each other and the Tdays columns on top of each other. Although specialty packages make things very concise and elegant, for simplicity we can solve this without additional packages. Standard R has a convenient function aptly named stack, which works like this:
> exp <- data.frame(value1 = 1:5, value2 = 6:10)
> stack(exp)
values ind
1 1 value1
2 2 value1
3 3 value1
4 4 value1
5 5 value1
6 6 value2
7 7 value2
8 8 value2
9 9 value2
10 10 value2
We can stack measurements and Tdays seperately and then combine them via cbind:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurement4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
cbind(stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
Which keeps measurements and Tdays neatly together but leaves us without pid which we can add using rep to replicate the original pid 4 times:
result <- cbind(pid = rep(data$pid, 4),
stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
The head of which looks like
> head(result)
pid values ind values ind
1 1 1356 measurement1 1435 Tdays1
2 2 943 measurement1 1848 Tdays1
3 3 1590 measurement1 185 Tdays1
4 4 130 measurement1 72 Tdays1
5 4 140 measurement1 82 Tdays1
6 4 220 measurement1 126 Tdays1
As I said above, this is not the order you expected and you can try to sort this data.frame, if that is of any concern:
result <- result[order(result$pid), c(1, 4, 2)]
names(result) <- c("pid", "Time", "Value")
leading to the final result
> head(result)
pid Time Value
1 1 1435 1356
13 1 1405 1483
25 1 1374 1563
37 1 NA NA
2 2 1848 943
14 2 1818 1173
tidyverse solution
library(tidyverse)
dw %>%
pivot_longer(-PID) %>%
mutate(name = gsub('^([A-Za-z]+)(\\d+)$', '\\1_\\2', name )) %>%
separate(name, into = c('A', 'B'), sep = '_', convert = T) %>%
pivot_wider(names_from = A, values_from = value)
Gives the following output
# A tibble: 9 x 4
PID B T measurement
<int> <int> <int> <int>
1 1 1 1 100
2 1 2 4 200
3 1 3 7 50
4 2 1 2 150
5 2 2 5 300
6 2 3 8 60
7 3 1 3 120
8 3 2 6 210
9 3 3 9 70
Considering a dataframe, df like the following:
PID T1 measurement1 T2 measurement2 T3 measurement3
1 1 100 4 200 7 50
2 2 150 5 300 8 60
3 3 120 6 210 9 70
You can use this solution to get your required dataframe:
iters = seq(from = 4, to = length(colnames(df))-1, by = 2)
finalDf = df[, c(1,2,3)]
for(j in iters){
tobind = df[, c(1,j,j+1)]
finalDf = rbind(finalDf, tobind)
}
finalDf = finalDf[order(finalDf[,1]),]
print(finalDf)
The output of the print statement is this:
PID T1 measurement1
1 1 1 100
4 1 4 200
7 1 7 50
2 2 2 150
5 2 5 300
8 2 8 60
3 3 3 120
6 3 6 210
9 3 9 70
Maybe you can try reshape like below
reshape(
setNames(data, gsub("(\\d+)$", "\\.\\1", names(data))),
direction = "long",
varying = 2:ncol(data)
)

R impute with Kalman on large data

I have a large dataset, 4666972 obs. of 5 variables.
I want to impute one column, MPR, with Kalman method based on each groups.
> str(dt)
Classes ‘data.table’ and 'data.frame': 4666972 obs. of 5 variables:
$ Year : int 1999 2000 2001 1999 2000 2001 1999 2000 2001 1999 ...
$ State: int 1 1 1 1 1 1 1 1 1 1 ...
$ CC : int 1 1 1 1 1 1 1 1 1 1 ...
$ ID : chr "1" "1" "1" "2" ...
$ MPR : num 54 54 55 52 52 53 60 60 65 70 ...
I tried the code below but it crashed after a while.
> library(imputeTS)
> data.table::setDT(dt)[, MPR_kalman := with(dt, ave(MPR, State, CC, ID, FUN=na_kalman))]
I don't know how to improve the time efficiency and impute successfully without crashed.
Is it better to split the dataset with ID to list and impute each of them with for loop?
> length(unique(hpms_S3$Section_ID))
[1] 668184
> split(dt, dt$ID)
However, I think this will not save too much of memory use or avoid crashed since when I split the dataset to 668184 lists and impute, I need to do multiple times and then combine to one dataset at last.
Is there any great way to do or how can I optimize code I did?
I provide the simple sample here:
# dt
Year State CC ID MPR
2002 15 3 3 NA
2003 15 3 3 NA
2004 15 3 3 193
2005 15 3 3 193
2006 15 3 3 348
2007 15 3 3 388
2008 15 3 3 388
1999 53 33 1 NA
2000 53 33 1 NA
2002 53 33 1 NA
2003 53 33 1 NA
2004 53 33 1 NA
2005 53 33 1 170
2006 53 33 1 170
2007 53 33 1 330
2008 53 33 1 330
EDIT:
As #r2evans mentioned in comment, I modified the code:
> setDT(dt)[, MPR_kalman := ave(MPR, State, CC, ID, FUN=na_kalman), by = .(State, CC, ID)]
Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, :
L-BFGS-B needs finite values of 'fn'
I got the error above. I found the post here for this error discussions. However, even I use na_kalman(MPR, type = 'level'), I still got error. I think there might be some repeated values within groups so that it produced error.
Perhaps splitting should be done using data.table's by= operator, perhaps more efficient.
Since I don't have imputeTS installed (there are several nested dependencies I don't have), I'll fake imputation using zoo::na.locf, both forward/backwards. I'm not suggesting this be your imputation mechanism, I'm using it to demonstrate a more-common pattern with data.table.
myimpute <- function(z) zoo::na.locf(zoo::na.locf(z, na.rm = FALSE), fromLast = TRUE, na.rm = FALSE)
Now some equivalent calls, one with your with(dt, ...) and my alternatives (which are really walk-throughs until my ultimate suggestion of 5):
dt[, MPR_kalman1 := with(dt, ave(MPR, State, CC, ID, FUN = myimpute))]
dt[, MPR_kalman2 := with(.SD, ave(MPR, State, CC, ID, FUN = myimpute))]
dt[, MPR_kalman3 := with(.SD, ave(MPR, FUN = myimpute)), by = .(State, CC, ID)]
dt[, MPR_kalman4 := ave(MPR, FUN = myimpute), by = .(State, CC, ID)]
dt[, MPR_kalman5 := myimpute(MPR), by = .(State, CC, ID)]
# Year State CC ID MPR MPR_kalman1 MPR_kalman2 MPR_kalman3 MPR_kalman4 MPR_kalman5
# 1: 2002 15 3 3 NA 193 193 193 193 193
# 2: 2003 15 3 3 NA 193 193 193 193 193
# 3: 2004 15 3 3 193 193 193 193 193 193
# 4: 2005 15 3 3 193 193 193 193 193 193
# 5: 2006 15 3 3 348 348 348 348 348 348
# 6: 2007 15 3 3 388 388 388 388 388 388
# 7: 2008 15 3 3 388 388 388 388 388 388
# 8: 1999 53 33 1 NA 170 170 170 170 170
# 9: 2000 53 33 1 NA 170 170 170 170 170
# 10: 2002 53 33 1 NA 170 170 170 170 170
# 11: 2003 53 33 1 NA 170 170 170 170 170
# 12: 2004 53 33 1 NA 170 170 170 170 170
# 13: 2005 53 33 1 170 170 170 170 170 170
# 14: 2006 53 33 1 170 170 170 170 170 170
# 15: 2007 53 33 1 330 330 330 330 330 330
# 16: 2008 53 33 1 330 330 330 330 330 330
The two methods produce the same results, but the latter preserves many of the memory-efficiencies that can make data.table preferred.
The use of with(dt, ...) is an anti-pattern in one case, and a strong risk in another. For the "risk" part, realize that data.table can do a lot of things behind-the-scenes so that the calculations/function-calls within the j= component (second argument) only sees data that is relevant. A clear example is grouping, but another (unrelated to this) data.table example is conditional replacement, as in dt[is.na(x), x := -1]. With the reference to the enter table dt inside of this, if there is ever something in the first argument (conditional replacement) or a by= argument, then it fails.
MPR_kalman2 mitigates this by using .SD, which is data.table's way of replacing the data-to-be-used with the "Subset of the Data" (ref). But it's still not taking advantage of data.table's significant efficiencies in dealing in-memory with groups.
MPR_kalman3 works on this by grouping outside, still using with but not (as in 2) in a more friendly way.
MPR_kalman4 removes the use of with, since really the MPR visible to ave is only within each group anyway. And then when you think about it, since ave is given no grouping variables, it really just passes all of the MPR data straight-through to myimpute. From this, we have MPR_kalman5, a direct method that is along the normal patterns of data.table.
While I don't know that it will mitigate your crashing, it is intended very much to be memory-efficient (in data.table's ways).

Challenge in data manipulation

I am trying to process a database of BLAST output to generate a data frame containing values for a given gene and given sample. When a gene is identified within a sample I would like the scaffold on which it was identified to be reported. If a given gene is NOT identified within a given sample I would like the cell to be filled with N/A.
sample_name scaffold gene_title match_(%)
P24_ST48 64 aadA12 94.56
401B_ST5223 381 blaTEM-163 99.65
P32_ST218 91 aadA24 90.41
HOS66_ST73 9 blaACT-5 72.31
HOS16_ST38 70 blaTEM-146 99.42
HOS56_ST131 48 aadA21 91.39
Ecoli_2009_1_ST131 41 sul1 99.88
PH152_ST95 37 dfrA33 83.94
Ecoli_2009_32_STNT 16 aac(3)-Ib 100.00
PH231_ST38 59 mph(D) 89.83
P44_STNT 135 blaTEM-105 99.88
Ecoli_2011_89_ST127 29 blaTEM-158 99.65
405C_ST1178 120 aadA1 99.75
P3_STNT 15 blaTEM-68 99.19
5A_ST34 174 blaTEM-127 99.88
P27_ST10 211 aph(3')-Ia 100.00
4D_ST767 393 blaTEM-152 98.95
P10_STNT 23 blaTEM-17 99.07
Ecoli_2014_27_ST131 49 sul2_15 99.88
Ecoli_2013_10_ST73 23 blaTEM-2 99.19
The output table would look something like:
Sample aadA1 aadA12 aadA24 blaTEM-163 ...
P24_ST48 N/A 64 N/A N/A
401B_ST5223 N/A N/A N/A 381
...
In excel I have concatenated the sample name and gene titles and reported the scaffold number on row where this string is identified using VLOOKUP - I have tried many different ways in R and am going around in circles.
Now trying to process +700 genes and +450 samples, the list of gene-sample combinations is getting somewhat laborious for excel to manage and I must find another solution with my collection of samples growing increasingly large.
Any help would be greatly appreciated.
Cheers,
Max
Here's how to do that with spread from tidyr
library(tidyr)
df1%>%
spread(key = gene_title,value = scaffold)
sample_name match_... aac(3)-Ib aadA1 aadA12 ...
1 401B_ST5223 99.65 NA NA NA
2 405C_ST1178 99.75 NA 120 NA
3 4D_ST767 98.95 NA NA NA
4 5A_ST34 99.88 NA NA NA
5 Ecoli_2009_1_ST131 99.88 NA NA NA
...
Data
df1 <- read.table(text="sample_name scaffold gene_title match_(%)
P24_ST48 64 aadA12 94.56
401B_ST5223 381 blaTEM-163 99.65
P32_ST218 91 aadA24 90.41
HOS66_ST73 9 blaACT-5 72.31
HOS16_ST38 70 blaTEM-146 99.42
HOS56_ST131 48 aadA21 91.39
Ecoli_2009_1_ST131 41 sul1 99.88
PH152_ST95 37 dfrA33 83.94
Ecoli_2009_32_STNT 16 aac(3)-Ib 100.00
PH231_ST38 59 mph(D) 89.83
P44_STNT 135 blaTEM-105 99.88
Ecoli_2011_89_ST127 29 blaTEM-158 99.65
405C_ST1178 120 aadA1 99.75
P3_STNT 15 blaTEM-68 99.19
5A_ST34 174 blaTEM-127 99.88
P27_ST10 211 aph(3')-Ia 100.00
4D_ST767 393 blaTEM-152 98.95
P10_STNT 23 blaTEM-17 99.07
Ecoli_2014_27_ST131 49 sul2_15 99.88
Ecoli_2013_10_ST73 23 blaTEM-2 99.19",
header=TRUE,stringsAsFactors=FALSE)
We can use dcast from data.table
library(data.table)
dcast(setDT(df1), sample_name + match_... ~ gene_title, value.var = 'scaffold')
# sample_name match_... aac(3)-Ib aadA1 aadA12 ...
#1: 401B_ST5223 99.65 NA NA
#2: 405C_ST1178 99.75 NA 120
#3: 4D_ST767 98.95 NA NA
#4: 5A_ST34 99.88 NA NA

Calculate mean of respective column values based on condition

I have a data.frame named sampleframe where I have stored all the table values. Inside sampleframe I have columns id, month, sold.
id month SMarch SJanFeb churn
101 1 0.00 0.00 1
101 2 0.00 0.00 1
101 3 0.00 0.00 1
108 2 0.00 6.00 1
103 2 0.00 10.00 1
160 1 0.00 2.00 1
160 2 0.00 3.00 1
160 3 0.50 0.00 0
164 1 0.00 3.00 1
164 2 0.00 6.00 1
I would like to calculate average sold for last three months based on ID. If it is month 3 then it has to consider average sold for the last two months based on ID, if it is month 2 then it has to consider average sold for 1 month based on ID., respectively for all months.
I have used ifelse and mean function to avail it but some rows are missing when i try to use it for all months
Query that I have used for execution
sampleframe$Churn <- ifelse(sampleframe$Month==4|sampleframe$Month==5|sampleframe$Month==6, ifelse(sampleframe$Sold<0.7*mean(sampleframe$Sold[sampleframe$ID[sampleframe$Month==-1&sampleframe$Month==-2&sampleframe$Month==-3]]),1,0),0)
adding according to the logic of the query it should compare with the previous months sold value of 70% and if the current value is higher than previous average months values then it should return 1 else 0
Not clear about the expected output. Based on the description about calculating average 'sold' for each 3 months, grouped by 'id', we can use roll_mean from library(RcppRoll). We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', if the number of rows is greater than 1, we get the roll_mean with n specified as 3 and concatenate with the averages for less than 3 or else i.e. for 1 observation, get the value itself.
library(RcppRoll)
library(data.table)
k <- 3
setDT(df1)[, soldAvg := if(.N>1) c(cumsum(sold[1:(k-1)])/1:(k-1),
roll_mean(sold,n=k, align='right')) else as.numeric(sold), id]
df1
# id month sold soldAvg
#1: 101 1 124 124.0000
#2: 101 2 211 167.5000
#3: 104 3 332 332.0000
#4: 105 4 124 124.0000
#5: 101 5 211 182.0000
#6: 101 6 332 251.3333
#7: 101 7 124 222.3333
#8: 101 8 211 222.3333
#9: 101 9 332 222.3333
#10: 102 10 124 124.0000
#11: 102 12 211 167.5000
#12: 104 3 332 332.0000
#13: 105 4 124 124.0000
#14: 102 5 211 182.0000
#15: 102 6 332 251.3333
#16: 106 7 124 124.0000
#17: 107 8 211 211.0000
#18: 102 9 332 291.6667
#19: 103 11 124 124.0000
#20: 103 2 211 167.5000
#21: 108 3 332 332.0000
#22: 108 4 124 228.0000
#23: 109 5 211 211.0000
#24: 103 6 332 222.3333
#25: 104 7 124 262.6667
#26: 105 8 211 153.0000
#27: 103 10 332 291.6667
Solution for above Question can be done by using library(dplyr) and use this query to avail the output
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
link to refer for solution and output Answer

row average of columns that match string

I have a data frame below and I want to find the average row value for all columns with header *R and all columns with *G.
The output should then be four columns: Rfam, Classes, avg.rowR, avg.rowG
I was playing around with the rowMeans() function, but I am not sure how to specify the columns.
Rfam Classes 26G 26R 35G 35R 46G 46R 48G 48R 55G 55R
5_8S_rRNA rRNA 63 39 8 27 26 17 28 43 41 17
5S_rRNA rRNA 171 149 119 109 681 47 95 161 417 153
7SK 7SK 53 282 748 371 248 42 425 384 316 198
ACA64 Other 7 8 19 2 10 1 36 10 10 4
let-7 miRNA 121825 73207 25259 75080 54301 63510 30444 53800 78961 47533
lin-4 miRNA 10149 16263 5629 19680 11297 37866 3816 9677 11713 10068
Metazoa_SRP SRP 317 1629 1008 418 1205 407 1116 1225 1413 1075
mir-1 miRNA 3 4 1 2 0 26 1 1 0 4
mir-10 miRNA 912163 1411287 523793 1487160 517017 1466085 107597 551381 727720 788201
mir-101 miRNA 461 320 199 553 174 460 278 297 256 254
mir-103 miRNA 937 419 202 497 318 217 328 343 891 439
mir-1180 miRNA 110 32 4 17 53 47 6 29 35 22
mir-1226 miRNA 11 3 0 3 6 0 1 2 5 4
mir-1237 miRNA 3 2 1 1 0 1 0 2 1 1
mir-1249 miRNA 5 14 2 9 4 5 9 5 7 7
newcols <- sapply(c("R$", "G$"), function(x) rowMeans(df[grep(x, names(df))]))
setNames(cbind(df[1:2], newcols), c(names(df)[1:2], "avg.rowR", "avg.rowG"))
# Rfam Classes avg.rowR avg.rowG
# 1 5_8S_rRNA rRNA 28.6 33.2
# 2 5S_rRNA rRNA 123.8 296.6
# 3 7SK 7SK 255.4 358.0
# 4 ACA64 Other 5.0 16.4
# 5 let-7 miRNA 62626.0 62158.0
# 6 lin-4 miRNA 18710.8 8520.8
# 7 Metazoa_SRP SRP 950.8 1011.8
# 8 mir-1 miRNA 7.4 1.0
# 9 mir-10 miRNA 1140822.8 557658.0
# 10 mir-101 miRNA 376.8 273.6
# 11 mir-103 miRNA 383.0 535.2
# 12 mir-1180 miRNA 29.4 41.6
# 13 mir-1226 miRNA 2.4 4.6
# 14 mir-1237 miRNA 1.4 1.0
# 15 mir-1249 miRNA 8.0 5.4
One way to look for patterns in column names is to use the grep family of functions. The function call grep("R$", names(df)) will return the index of all column names that end with R. When we use it with sapply we can search for the R and G columns in one expression.
The core of the second line is cbind(df[1:2], newcols). That is the binding of the first two columns of df and the two new columns of mean values. Wrapping it with setNames(.., c(names(df)f[1:2]....)) formats the column names to match your desired output.

Resources