How can I copy and rename a bunch of variables at once? - r

I have created some variables. I would like to duplicate these so that they exist twice, once with the name you see below, and once with Ireland_ in front of their name, i.e.,
c_PFS_Folfox = 307.81 would become:
Ireland_c_PFS_Folfox = 307.81
I initially define these as follows:
1. Cost of treatment in this country
c_PFS_Folfox <- 307.81
c_PFS_Bevacizumab <- 2580.38
c_OS_Folfiri <- 326.02
administration_cost <- 365.00
2. Cost of treating the AE conditional on it occurring
c_AE1 <- 2835.89
c_AE2 <- 1458.80
c_AE3 <- 409.03
3. Willingness to pay threshold
n_wtp = 45000
Then I put them together to rename all at once:
kk <- data.frame(c_PFS_Folfox, c_PFS_Bevacizumab, c_OS_Folfiri, administration_cost, c_AE1, c_AE2, c_AE3, n_wtp)
colnames(kk) <- paste("Ireland", kk, sep="_")
kk
Ireland_307.81 Ireland_2580.38 Ireland_326.02 Ireland_365 Ireland_2835.89 Ireland_1458.8
1 307.8 2580 326 365 2836 1459
Ireland_409.03 Ireland_45000
1 409 45000
Obviously this isn't the output I intended. These also don't exist as new variables in the environment.
What can I do?

If we want to create objects with Ireland_ as prefix, either use
list2env(setNames(kk, paste0("Ireland_", names(kk))), .GlobalEnv)
Once we created the objects in the global env, we may remove the original objects
> rm(list = names(kk))
> ls()
[1] "Ireland_administration_cost" "Ireland_c_AE1" "Ireland_c_AE2" "Ireland_c_AE3" "Ireland_c_OS_Folfiri"
[6] "Ireland_c_PFS_Bevacizumab" "Ireland_c_PFS_Folfox" "Ireland_n_wtp" "kk"
or with %=% from collapse
library(collapse)
paste("Ireland", colnames(kk), sep="_") %=% kk
-checking
> Ireland_administration_cost
[1] 365
> Ireland_c_PFS_Folfox
[1] 307.81

First put all your variables in a vector, then use sapply to iterate the vector to assign the existing variables to a new variable with the prefix "Ireland_".
your_var <- c("c_PFS_Folfox", "c_PFS_Bevacizumab", "c_OS_Folfiri",
"administration_cost", "c_AE1", "c_AE2", "c_AE3", "n_wtp")
sapply(your_var, \(x) assign(paste0("Ireland_", x), get(x), envir = globalenv()))

Related

"Duplicated rows" error in AnnotationForge function makeOrgPackage

I'm creating an organism package using the AnnotationForge package, specifically the function makeOrgPackage. I've been following this vignette: https://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationForge/inst/doc/MakingNewOrganismPackages.html
When I call the function:
makeOrgPackage(gene_info=PA14Sym, chromosome=PA14Chr, go=PA14Go,
version="0.1",
maintainer="myname <email#university.edu>",
author="myname <email#university.edu>",
outputDir = ".",
tax_id="208963",
genus="Pseudomonas",
species="aeruginosa",
goTable="go")
I receive this error:
Error in FUN(X[[i]], ...) : data.frames in '...' cannot contain duplicated rows
The "..." refers to the set of dataframes containing the annotation data. I've ensured that these dataframes are in the exact same structure as the example in the vignette. In the "gene_info" and "chromosome" dfs, I deleted all duplicated rows.
The "go" df has repeated values in the "GID" (gene ID) column, but all GO values are unique, and I've checked that no duplicate rows exist. For example:
GID GO EVIDENCE
1 PA14_00010 GO:0005524 ISM
2 PA14_00010 GO:0006270 ISM
3 PA14_00010 GO:0006275 ISM
4 PA14_00010 GO:0043565 ISM
5 PA14_00010 GO:0003677 ISM
6 PA14_00010 GO:0003688 ISM
7 PA14_00020 GO:0003677 ISM
8 PA14_00020 GO:0006260 ISM
The same goes for the sample finch data provided by the vignette; repeated GIDs, but unique GO numbers. Frustratingly, when I run the makeOrgPackage function for the sample data in the vignette, there are no errors. What am I missing here?
Full script:
# Load in GO annotated PA14 file, downloaded from Psuedomonas.com
PA14file <- read.csv("../data/GO_annotations/GO_PA14.csv")
colnames(PA14file)
> colnames(PA14file)
[1] "LocusTag" "GeneName" "ProductDescription"
[4] "StrainName" "Accession" "GOTerm"
[7] "Namespace" "GOEvidenceCode" "EvidenceOntologyECOCode"
[10] "EvidenceOntologyTerm" "SimilarToBindsTo" "PMID"
[13] "chrom"
# PA14 only has 1 chromosome, so create a new column and populate it with 1s.
PA14file$chrom <- '1'
# Create gene_info df, remove duplicate rows
PA14Sym <- PA14file[,c("LocusTag", "GeneName", "ProductDescription")]
PA14Sym <- PA14Sym[PA14Sym[,"GeneName"]!="-",]
PA14Sym <- PA14Sym[PA14Sym[,"ProductDescription"]!="-",]
colnames(PA14Sym) <- c("GID","SYMBOL","GENENAME")
PA14Sym <- PA14Sym[!duplicated(PA14Sym), ]
# Create chromosome df, remove duplicate rows
PA14Chr <- PA14file[,c("LocusTag", "chrom")]
PA14Chr <- PA14Chr[PA14Chr[,"chrom"]!="-",]
colnames(PA14Chr) <- c("GID","CHROMOSOME")
PA14Chr %>% distinct(GID, .keep_all = TRUE)
PA14Chr <- PA14Chr[!duplicated(PA14Chr), ]
# Create go df
PA14Go <- PA14file[,c("LocusTag", "Accession", "GOEvidenceCode")]
PA14Go <- PA14Go[PA14Go[,"GOEvidenceCode"]!="",]
colnames(PA14Go) <- c("GID","GO","EVIDENCE")
# Call the function
makeOrgPackage(gene_info=PA14Sym, chromosome=PA14Chr, go=PA14Go,
version="0.1",
maintainer="myname <email#university.edu>",
author="myname <email#university.edu>",
outputDir = ".",
tax_id="208963",
genus="Pseudomonas",
species="aeruginosa",
goTable="go")
I also met this question today, and just after I changed to use the distinct() in dplyr, this function can work correctly.(My function is the same as yours.)
Just try to add a piece of %>% dplyr::distinct() to the tail of each part of creating or use dplyr::distinct() after all operations to remove the duplications in your variable.
In your case:
library(dplyr)
PA14Sym <- dplyr::distinct(PA14Sym)
PA14Chr <- dplyr::distinct(PA14Chr)
PA14Go <- dplyr::distinct(PA14Go)
Hope these can help you.

In R how do you factorise and add label values to specific data.table columns, using a second file of meta data?

This is part of a project to switch from SPSS to R. While there are good tools to import SPSS files into R (expss) what this question is part of is attempting to get the benefits of SPSS style labeling when data originates from CSV sources. This is to help bridge the staff training gap between SPSS and R by providing a common format for data.tables irrespective of file format origin.
Whilst CSV does a reasonable job of storing data it is hopeless for providing meaningful data. This inevitably means variable and factor levels and labels have to come from somewhere else. In most short examples of this (e.g. in documentation) it is practical to simply hard code the meta data in. But for larger projects it makes more sense to store this meta data in a second csv file.
Example data file
ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
1,1,34,1,,1,,1,1,4,
2,1,21,0,1,,1,3,14,3,2
3,1,54,1,,,1,3,6,4,4
4,2,32,1,1,1,,3,7,4,
5,3,66,0,,,1,3,9,3,3
6,2,43,1,,1,,1,12,2,1
7,2,26,0,,,1,2,11,1,
8,3,,1,1,,,2,15,1,4
9,1,34,1,,1,,1,12,3,4
10,2,46,0,,,,3,13,2,
11,3,39,1,1,1,,3,7,1,2
12,1,28,0,,,1,1,6,5,1
13,2,64,0,,1,,2,11,,3
14,3,34,1,1,,,3,10,1,1
15,1,52,1,,1,1,1,8,6,
Example metadata file
Rowlabels,ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten
varlabel,,Question one,Question two,Question three,Question four,Question five,Question six,Question seven,Question eight,Question nine,Question ten
varrole,Unique,Attitude,Unique,Filter,Filter,Filter,Filter,Attitude,Filter,Attitude,Attitude
Missing,Error,Error,Ignored,Error,Unchecked,Unchecked,Unchecked,Error,Error,Error,Ignored
vallable,,One,,No,Checked,Checked,Checked,x,One,A,Support
vallable,,Two,,Yes,,,,y,Two,B,Neutral
vallable,,Three,,,,,,z,Three,C,Oppose
vallable,,,,,,,,,Four,D,Dont know
vallable,,,,,,,,,Five,E,
vallable,,,,,,,,,Six,F,
vallable,,,,,,,,,Seven,G,
vallable,,,,,,,,,Eight,,
vallable,,,,,,,,,Nine,,
vallable,,,,,,,,,Ten,,
vallable,,,,,,,,,Eleven,,
vallable,,,,,,,,,Twelve,,
vallable,,,,,,,,,Thirteen,,
vallable,,,,,,,,,Fourteen,,
vallable,,,,,,,,,Fifteen,,
SO the common elements are the column names which are the key to both files
The first column of the metadata file describes the role of the row for the data file
so
varlabel provides the variable label for each column
varrole describes the analytic purpose of the variable
missing describes how to treat missing data
varlabel describes the label for a factor level starting at one on up to as many labels as there are.
Right! Here's the code that works:
```#Libraries
library(expss)
library(data.table)
library(magrittr)```
readcsvdata <- function(dfile)
{
# TESTED - Working
print("OK Lets read some comma separated values")
rdata <- fread(file = dfile, sep = "," , quote = "\"" , header = TRUE, stringsAsFactors = FALSE,
na.strings = getOption("datatable.na.strings",""))
return(rdata)
}
rawdatafilename <- "testdata.csv"
rawmetadata <- "metadata.csv"
mdt <- readcsvdata(rawmetadata)
rdt <- readcsvdata(rawdatafilename)
names(rdt)[names(rdt) == "ï..ID"] <- "ID" # correct minor data error
commonnames <- intersect(names(mdt),names(rdt)) # find common variable names so metadata applies
commonnames <- commonnames[-(1)] # remove ID
qlabels <- as.list(mdt[1, commonnames, with = FALSE])
(Here I copy the rdt datatable simply so I can roll back to the original data without re-running the previous read chunks and tidying whenever I make changes that don't work out.
# set var names to columns
for (each_name in commonnames) # loop through commonnames and qlabels
{
expss::var_lab(tdt[[each_name]]) <- qlabels[[each_name]]
}
OK this is where I fall down.
Failure from here
factorcols <- as.vector(commonnames) # create a vector of column names (for later use)
for (col in factorcols)
{
print( is.na(mdt[4, ..col])) # print first row of value labels (as test)
if (is.na(mdt[4, ..col])) factorcols <- factorcols[factorcols != col]
# if not a factor column, remove it from the factorcol list and dont try to factor it
else { # if it is a vector factorise
print(paste("working on",col)) # I have had a lot of problem with unrecognised ..col variables
tlabels <- as.vector(na.omit(mdt[4:18, ..col])) # get list of labels from the data column}
validrange <- seq(1,lengths(tlabels),1) # range of valid values is 1 to the length of labels list
print(as.character(tlabels)) # for testing
print(validrange) # for testing
tdt[[col]] <- factor(tdt[[col]], levels = validrange, ordered = is.ordered(validrange), labels = as.character(tlabels))
# expss::val_lab(tdt[, ..col]) <- tlabels
tlabels = c() # flush loop variable
validrange = c() # flush loop variable
}
}
So the problem is revealed here when we check the data table.
tdt
the labels have been applied as whole vectors to each column entry except where there is only one value in the vector ("checked" for varfour and varfive)
tdt
id (int) 1
varone (fctr) c("One", "Two", "Three") 1 (should be "One" 1)
vartwo (S3: labelled) 34
varthree (fctr) c("No", "Yes") 1 (should be "No" 1)
varfour (fctr) NA
varfive (fctr) Checked
And a mystery
this code works just fine on a single columns when I don't use a for loop variable
# test using column name
tlabels <- c("one","two","three")
validrange <- c(1,2,3)
factor(tdt[,varone], levels = validrange, ordered=is.ordered(validrange), labels = tlabels)
It seems the issue is in the line tlabels <- as.vector(na.omit(mdt[4:18, ..col])). It doesn't make vector as you expect. Contrary to usual data.frame data.table doesn't drop dimensions when you provide single column in the index. And as.vector do nothing with data.frames/data.tables. So tlabels remains data.table. This line need to be rewritten as tlabels <- na.omit(mdt[[col]][4:18]).
Example:
library(data.table)
mdt = as.data.table(mtcars)
col = "am"
tlabels <- as.vector(na.omit(mdt[3:6, ..col])) # ! tlabels is data.table
str(tlabels)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 1 variable:
# $ am: num 1 0 0 0
# - attr(*, ".internal.selfref")=<externalptr>
as.character(tlabels) # character vector of length 1
# [1] "c(1, 0, 0, 0)"
tlabels <- na.omit(mdt[[col]][3:6]) # vector
str(tlabels)
# num [1:4] 1 0 0 0
as.character(tlabels) # character vector of length 4
# [1] "1" "0" "0" "0"

Huge data file and running multiple parameters and memory issue, Fisher's test

I have a R code that I am trying to run in a server. But it is stopping in the middle/get frozen probably because of memory limitation. The data files are huge/massive (one has 20 million lines) and if you look at the double for loop in the code, length(ratSplit) = 281 and length(humanSplit) = 36. The data has specific data of human and rats' genes and human has 36 replicates, while rat has 281. So, the loop is basically 281*36 steps. What I want to do is to process data using the function getGeneType and see how different/independent are the expression of different replicate combinations. Using Fisher's test. The data rat_processed_7_25_FDR_05.out looks like this :
2 Sptbn1 114201107 114200202 chr14|Sptbn1:114201107|Sptbn1:114200202|reg|- 2 Thymus_M_GSM1328751 reg
2 Ndufb7 35680273 35683909 chr19|Ndufb7:35680273|Ndufb7:35683909|reg|+ 2 Thymus_M_GSM1328751 rev
2 Ndufb10 13906408 13906289 chr10|Ndufb10:13906408|Ndufb10:13906289|reg|- 2 Thymus_M_GSM1328751 reg
3 Cdc14b 1719665 1719190 chr17|Cdc14b:1719665|Cdc14b:1719190|reg|- 3 Thymus_M_GSM1328751 reg
and the data fetal_output_7_2.out has the form
SPTLC2 78018438 77987924 chr14|SPTLC2:78018438|SPTLC2:77987924|reg|- 11 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
EXOSC1 99202993 99201016 chr10|EXOSC1:99202993|EXOSC1:99201016|rev|- 5 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
SHMT2 57627893 57628016 chr12|SHMT2:57627893|SHMT2:57628016|reg|+ 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
ZNF510 99538281 99537128 chr9|ZNF510:99538281|ZNF510:99537128|reg|- 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
PPFIBP1 27820253 27824363 chr12|PPFIBP1:27820253|PPFIBP1:27824363|reg|+ 10 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
Now I have few questions on how to make this more efficient. I think when I run this code, R takes up lots of memory that ultimately causes problems. I am wondering if there is any way of doing this more efficiently
Another possibility is the usage of double for-loop'. Will sapply help? In that case, how should I apply sapply?
At the end I want to convert result into a csv file. I know this is a bit overwhelming to put code like this. But any optimization/efficient coding/programming will be A LOT! I really need to run the whole thing at least one to get the data soon.
#this one compares reg vs rev
date()
ratRawData <- read.table("rat_processed_7_25_FDR_05.out",col.names = c("alignment", "ratGene", "start", "end", "chrom", "align", "ratReplicate", "RNAtype"), fill = TRUE)
humanRawData <- read.table("fetal_output_7_2.out", col.names = c("humanGene", "start", "end", "chrom", "alignment", "humanReplicate", "RNAtype"), fill = TRUE)
geneList <- read.table("geneList.txt", col.names = c("human", "rat"), sep = ',')
#keeping only information about gene, alignment number, replicate and RNAtype, discard other columns
ratRawData <- ratRawData[,c("ratGene", "ratReplicate", "alignment", "RNAtype")]
humanRawData <- humanRawData[, c( "humanGene", "humanReplicate", "alignment", "RNAtype")]
#function to capitalize
capitalize <- function(x){
capital <- toupper(x) ## capitalize
paste0(capital)
}
#capitalizing the rna type naming for rat. So, reg ->REG, dup ->DUP, rev ->REV
#doing this to make data manipulation for making contingency table easier.
levels(ratRawData$RNAtype) <- capitalize(levels(ratRawData$RNAtype))
#spliting data in replicates
ratSplit <- split(ratRawData, ratRawData$ratReplicate)
humanSplit <- split(humanRawData, humanRawData$humanReplicate)
print("done splitting")
#HyRy :when some gene has only reg, rev , REG, REV
#HnRy : when some gene has only reg,REG,REV
#HyRn : add 1 when some gene has only reg,rev,REG
#HnRn : add 1 when some gene has only reg,REG
#function to be used to aggregate
getGeneType <- function(types) {
types <- as.character(types)
if ('rev' %in% types) {
return(ifelse(('REV' %in% types), 'HyRy', 'HyRn'))
}
else {
return(ifelse(('REV' %in% types), 'HnRy', 'HnRn'))
}
}
#logical function to see whether x is integer(0) ..It's used the for loop bellow in case any one HmYn is equal to zero
is.integer0 <- function(x) {
is.integer(x) && length(x) == 0L
}
result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis",
Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
for(j in 1:length(humanSplit)) {
ratReplicateName <- names(ratSplit[i])
humanReplicateName <- names(humanSplit[j])
#merging above two based on the one-to-one gene mapping as in geneList defined above.
mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")
mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
mergedRatData <- mergedRatData[,c(2,1,4,5)] #rearrange column
mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"
agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.
#now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
#is.integer0 function is used
HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
# contingencyTable:
# HnRn --|--HyRn
# |------|-----|
# HnRy --|-- HyRy
#
fisherTest <- fisher.test(contingencyTable)
#make new line out of the result of fisherTest
newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2],
oddratio = fisherTest$estimate[[1]])))
result <-rbind(result,newLine) #append newline to result
if(j%%10 = 0) print(c(i,j))
}
}
write.table(result, file = "compareRegAndRev.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")
Referring to the accepted answer to Monitor memory usage in R, the amount of memory used by R can be tracked with gc().
If the script is, indeed, running short of memory (which would not surprise me), the easiest way to resolve the problem would be to move the write.table() from the outside to the inside of the loop, to replace the rbind(). It would just be necessary to create a new file name for the CSV file that is written from each output, e.g. by:
csvFileName <- sprintf("compareRegAndRev%03d_%03d.csv",i,j)
If the CSV files are written without headers, they could then be concatenated separately outside R (e.g. using cat in Unix) and the header added later.
While this approach might succeed in creating the CSV file that is sought, it is possible that file might be too big to process subsequently. If so, it may be preferable to process the CSV files individually, rather than concatenating them at all.

Filter xts objects by common dates

I am stuck with the following code.
For reference the code it is taken from the following website (http://gekkoquant.com/2013/01/21/statistical-arbitrage-trading-a-cointegrated-pair/), I am also compiling the code through R Studio.
library("quantmod")
startDate = as.Date("2013-01-01")
symbolLst<-c("WPL.AX","BHP.AX")
symbolData <- new.env()
getSymbols(symbolLst, env = symbolData, src = "yahoo", from = startDate)
stockPair <- list(
a =coredata(Cl(eval(parse(text=paste("symbolData$\"",symbolLst[1],"\"",sep="")))))
,b = coredata(Cl(eval(parse(text=paste("symbolData$\"",symbolLst[2],"\"",sep="")))))
,hedgeRatio = 0.70 ,name=title)
spread <- stockPair$a - stockPair$hedgeRatio*stockPair$b
I am getting the following error.
Error in stockPair$a - stockPair$hedgeRatio * stockPair$b :
non-conformable arrays
The reason these particular series don't match is because "WPL.AX" has an extra value (date:19-05-2014 - the matrix lengths are different) compared to "BHP". How can I solve this issue when loading data?
I have also tested other stock pairs such as "ANZ","WBC" with the source = "google" which produces two of the same length arrays.
> length(stockPair$a)
[1] 360
> length(stockPair$b)
[1] 359
Add code such as this prior to the stockPair computation, to trim each xts set to the intersection of dates:
common_dates <- as.Date(Reduce(intersect, eapply(symbolData, index)))
symbolData <- eapply(symbolData, `[`, i=common_dates)
Your code works fine if you don't convert your xts object to matrix via coredata. Then Ops.xts will ensure that only the rows with the same index will be subtracted. And fortune(106) applies.
fortunes::fortune(106)
# If the answer is parse() you should usually rethink the question.
# -- Thomas Lumley
# R-help (February 2005)
stockPair <- list(
a = Cl(symbolData[[symbolLst[1]]])
,b = Cl(symbolData[[symbolLst[2]]])
,hedgeRatio = 0.70
,name = "title")
spread <- stockPair$a - stockPair$hedgeRatio*stockPair$b
Here's an alternative approach:
# merge stocks into a single xts object
stockPair <- do.call(merge, eapply(symbolData, Cl))
# ensure stockPair columns are in the same order as symbolLst, since
# eapply may loop over the environment in an order you don't expect
stockPair <- stockPair[,pmatch(symbolLst, colnames(stockPair))]
colnames(stockPair) <- c("a","b")
# add hedgeRatio and name as xts attributes
xtsAttributes(stockPair) <- list(hedgeRatio=0.7, name="title")
spread <- stockPair$a - attr(stockPair,'hedgeRatio')*stockPair$b

How to extract and sum specific numbers from a vector in R

I have a vector of 66 numbers, called a, each with a name:
Age0_i0 Age1_i0 Age1_i1 Age2_i0 Age2_i1 Age2_i2 Age3_i0 Age3_i1 Age3_i2
1000000.000 680000.000 170000.000 462400.000 115600.000 144500.000 314432.000 78608.000 98260.000
Age3_i3 Age4_i0 Age4_i1 Age4_i2 Age4_i3 Age4_i4 Age5_i0 Age5_i1 Age5_i2
122825.000 213813.760 53453.440 66816.800 83521.000 104401.250 145393.357 36348.339 45435.424
Age5_i3 Age5_i4 Age5_i5 Age6_i0 Age6_i1 Age6_i2 Age6_i3 Age6_i4 Age6_i5
56794.280 70992.850 88741.062 98867.483 24716.871 30896.088 38620.110 48275.138 60343.922
Age6_i6 Age7_i0 Age7_i1 Age7_i2 Age7_i3 Age7_i4 Age7_i5 Age7_i6 Age7_i7
75429.903 67229.888 16807.472 21009.340 26261.675 32827.094 41033.867 51292.334 64115.418
Age8_i0 Age8_i1 Age8_i2 Age8_i3 Age8_i4 Age8_i5 Age8_i6 Age8_i7 Age8_i8
45716.324 11429.081 14286.351 17857.939 22322.424 27903.030 34878.787 43598.484 54498.105
Age9_i0 Age9_i1 Age9_i2 Age9_i3 Age9_i4 Age9_i5 Age9_i6 Age9_i7 Age9_i8
31087.100 7771.775 9714.719 12143.399 15179.248 18974.060 23717.575 29646.969 37058.711
Age9_i9 Age10_i0 Age10_i1 Age10_i2 Age10_i3 Age10_i4 Age10_i5 Age10_i6 Age10_i7
46323.389 21139.228 5284.807 6606.009 8257.511 10321.889 12902.361 16127.951 20159.939
Age10_i8 Age10_i9 Age10_i10
25199.924 31499.905 39374.881
I want to produce a list of the sums of some of those vectors. Specifically, I want to sum all of the Age3s from Age3_i3, Age3_i4...through to Age3_i10. Then all of the Age4s from _i3 to _i10 and Age5s _i3 to _i10 all the way up to Age10 _i3 to _i10. I wanted to do it in a loop like this:
x <- 10
for (i in 3:x){
for (j in 3:i){
s <- sum(a[paste0("Age",i,"_i",j)])
}}
s
but it just gives me a[66], the last value of a. Ideally it would give me a list of the 8 totals.
Any help appreciated!
EDIT##
add a dput of the data:
structure(c(1e+06, 680000, 170000, 462400, 115600, 144500, 314432,
78608, 98260, 122825, 213813.76, 53453.44, 66816.8, 83521, 104401.25,
145393.357, 36348.339, 45435.424, 56794.28, 70992.85, 88741.062,
98867.483, 24716.871, 30896.088, 38620.11, 48275.138, 60343.922,
75429.903, 67229.888, 16807.472, 21009.34, 26261.675, 32827.094,
41033.867, 51292.334, 64115.418, 45716.324, 11429.081, 14286.351,
17857.939, 22322.424, 27903.03, 34878.787, 43598.484, 54498.105,
31087.1, 7771.775, 9714.719, 12143.399, 15179.248, 18974.06,
23717.575, 29646.969, 37058.711, 46323.389, 21139.228, 5284.807,
6606.009, 8257.511, 10321.889, 12902.361, 16127.951, 20159.939,
25199.924, 31499.905, 39374.881), .Names = c("Age0_i0", "Age1_i0",
"Age1_i1", "Age2_i0", "Age2_i1", "Age2_i2", "Age3_i0", "Age3_i1",
"Age3_i2", "Age3_i3", "Age4_i0", "Age4_i1", "Age4_i2", "Age4_i3",
"Age4_i4", "Age5_i0", "Age5_i1", "Age5_i2", "Age5_i3", "Age5_i4",
"Age5_i5", "Age6_i0", "Age6_i1", "Age6_i2", "Age6_i3", "Age6_i4",
"Age6_i5", "Age6_i6", "Age7_i0", "Age7_i1", "Age7_i2", "Age7_i3",
"Age7_i4", "Age7_i5", "Age7_i6", "Age7_i7", "Age8_i0", "Age8_i1",
"Age8_i2", "Age8_i3", "Age8_i4", "Age8_i5", "Age8_i6", "Age8_i7",
"Age8_i8", "Age9_i0", "Age9_i1", "Age9_i2", "Age9_i3", "Age9_i4",
"Age9_i5", "Age9_i6", "Age9_i7", "Age9_i8", "Age9_i9", "Age10_i0",
"Age10_i1", "Age10_i2", "Age10_i3", "Age10_i4", "Age10_i5", "Age10_i6",
"Age10_i7", "Age10_i8", "Age10_i9", "Age10_i10"))
Construct the names that you want and then subset by them:
nm = expand.grid(age = 3:5, id = 3:10)
sum(y[paste0('Age', nm$age, '_i', nm$id)], na.rm = T)
#[1] 527275.4
If you instead want these sums for each age group, I'd do this instead
library(data.table)
nm = CJ(age = 3:5, id = 3:10)
nm[, sum(y[paste0('Age', age, '_i', id)], na.rm = T), by = age]
# age V1
#1: 3 122825.0
#2: 4 187922.2
#3: 5 216528.2
I think you can do it in one loop using a clever grep:
nn <- names(y)
sapply (c(3,4,5) ,function(i)
sum(y[grep(paste0('Age',i,'_i10|Age',i,'_i','[3-9]'),nn)]))
[1] 122825.0 187922.2 216528.2
EDIT
This solution works with any range (min,max). It generates a sequence and use na.rm argument to remove missing values. it is less efficient ( generates more than needed) but always works and don't use regular expression.
sum_filter <-
function(min=3,max=10)
sapply (c(3,4,5) ,function(i)
sum(y[paste0('Age',i,'_i',seq(min,max))],na.rm=T))

Resources