R: Rewriting the behaviour of "[" for matrices? - r

I am trying to append the "matrix" class and in turn overwrite the default behaviour of "[". Code examples below:
annMatrix <- function(mat=NULL, rowAnn=NULL, colAnn=NULL) {
if(is.null(mat)) mat <- matrix(nrow=0, ncol=0)
mat <- as.matrix(mat)
if(is.null(rowAnn)) rowAnn <- data.frame(row.names=seq_len(nrow(mat)))
if(is.null(colAnn)) colAnn <- data.frame(row.names=seq_len(ncol(mat)))
rowAnn <- data.frame(rowAnn, stringsAsFactors=FALSE)
colAnn <- data.frame(colAnn, stringsAsFactors=FALSE)
stopifnot(nrow(mat)==nrow(rowAnn) & ncol(mat)==nrow(colAnn))
attr(mat, "colAnn") <- colAnn
attr(mat, "rowAnn") <- rowAnn
class(mat) <- append(class(mat), "annMatrix")
mat
}
`[.annMatrix` <- function(annMat, rowExpr=NULL, colExpr=NULL) {
stopifnot(is.valid.annMatrix(annMat))
rowExpr <- eval(substitute(list(rowExpr)), attr(annMat, "rowAnn"), parent.frame())
colExpr <- eval(substitute(list(colExpr)), attr(annMat, "colAnn"), parent.frame())
indsR <- unlist(rowExpr)
indsC <- unlist(colExpr)
if(is.null(indsR)) indsR <- seq_len(nrow(annMat))
if(is.null(indsC)) indsC <- seq_len(ncol(annMat))
attr(annMat, "rowAnn") <- attr(annMat, "rowAnn")[indsR,,drop=FALSE]
attr(annMat, "colAnn") <- attr(annMat, "colAnn")[indsC,,drop=FALSE]
annMat <- unclass(annMat)
annMat <- annMat[indsR,indsC,drop=FALSE]
class(annMat) <- append(class(annMat), "annMatrix")
annMat
}
The basic idea is to make matrix preserve it's specific attributes after subsetting.
However I am running into a problem:
How to write "[" function in such a way that it behaves differently when called with and without a comma:
annMat[i]
annMat[i,]
as the default "[" for matrices seems to do.
I was thinking to set second argument to some value by default, but the value will not change because of an added comma.

Related

How to use a for loop with multiple results

I have to automate this sequence of functions:
for (i in c(15,17,20,24,25,26,27,28,29,45,50,52,55,60,62)) {
WBES_sf_angola_i <- subset(WBES_sf_angola, isic == i)
WBES_angola_i <- as_Spatial(WBES_sf_angola_i)
FDI_angola_i <- FDI_angola[FDI_angola$isic==i,]
dist_ao_i <- distm(WBES_angola_i,FDI_angola_i, fun = distGeo)/1000
rm(WBES_sf_angola_i,WBES_angola_i,FDI_angola_i)
}
As a result, I want a "dist_ao" for each i. The indexed values are to be found in the isic columns of the WBES_sf_angola and the FDI_angola datasets.
How can I embed the index in the various items' names?
EDIT:
I tried with following modification:
for (i in c(15,17,20,24,25,26,27,28,29,45,50,52,55,60,62)) {
WBES_sf_angola_i <- subset(WBES_sf_angola, isic == i)
WBES_angola_i <- as_Spatial(WBES_sf_angola_i)
FDI_angola_i <- FDI_angola[FDI_angola$isic==i,]
result_list <- list()
result_list[[paste0("dist_ao_", i)]] <- distm(WBES_angola_i,FDI_angola_i, fun = distGeo)/1000
rm(WBES_sf_angola_i,WBES_angola_i,FDI_angola_i)
}
and the output is just a list of 1 that contains dist_ao_62. Where do I avoid overwriting?
Untested (due to missing MRE) but should work:
result_list <- list()
for (i in c(15,17,20,24,25,26,27,28,29,45,50,52,55,60,62)) {
result_list[[paste0("dist_ao_", i)]] <- distm(as_Spatial(subset(WBES_sf_angola, isic == i)) , FDI_angola[FDI_angola$isic==i,], fun = distGeo)/1000
}
You could approach it this way. All resulting dataframes will be included in the list, which you can convert to a dataframe from the last line of the the code here. NOTE: since not reproducible, I have mostly taken the code from your question inside the loop.
WBES_sf_angola_result <- list() # renamed this, as it seems you are using a dataset with the name WBES_sf_angola
WBES_angola <- list()
FDI_angola <- list()
dist_ao <- list()
for (i in c(15,17,20,24,25,26,27,28,29,45,50,52,55,60,62)) {
WBES_sf_angola[[paste0("i_", i)]] <- subset(WBES_sf_angola, isic == i)
WBES_angola[[paste0("i_", i)] <- as_Spatial(WBES_sf_angola_i)
FDI_angola[[paste0("i_", i)] <- FDI_angola[FDI_angola$isic==i,]
dist_ao[[paste0("i_", i)] <- distm(WBES_angola_i,FDI_angola_i, fun = distGeo)/1000
rm(WBES_sf_angola_i,WBES_angola_i,FDI_angola_i)
}
WBES_sf_angola_result <- do.call(rbind, WBES_sf_angola_result) # to get a dataframe
Your subset data can also be accessed through list index. eg.
WBES_sf_angola_result[[i_15]] # for the first item.

R modify a column only if it exists

Suppose I have the following code snippet:
mainResult$Time <- formatTime(mainResult$Time, "DateAndTime")
mainResult$SettleDate <- formatTime(mainResult$SettleDate, "DateAndTime")
mainResult$IssueDate <- formatTime(mainResult$IssueDate, "DateAndTime")
mainResult$Maturity <- formatTime(mainResult$Maturity, "DateAndTime")
mainResult$Bid <- formatNumber(mainResult$Bid, "withDecimals")
mainResult$Ask <- formatNumber(mainResult$Ask, "withDecimals")
mainResult$AvgBid <- formatNumber(mainResult$AvgBid, "withDecimals")
mainResult$AvgAsk <- formatNumber(mainResult$AvgAsk, "withDecimals")
mainResult$BidYield <- formatNumber(mainResult$BidYield, "withDecimals")
mainResult$AskYield <- formatNumber(mainResult$AskYield, "withDecimals")
mainResult$BidSize <- formatNumber(mainResult$BidSize, "noDecimals")
mainResult$AskSize <- formatNumber(mainResult$AskSize, "noDecimals")
mainResult$Coupon <- formatNumber(mainResult$Coupon, "withDecimals")
Each formatTime and formatNumber works fine only if the column exists. Is there a clean way for this to execute without me wrapping every statement in an if block that checks if the column exists?
This is a terrible way to code. Try to use something more like this:
# start with a vector of column names and loop
dt_columns = c("Time", "SettleDate", "IssueDate", "Maturity")
for (col in dt_columns) {
if (col %in% names(mainResult)) mainResults[[col]] = formatNumber(mainResult[[col]], "DateAndTime")
}
# you can repeat for your other cases
Or this:
# intersect and lapply
dt_columns = c("Time", "SettleDate", "IssueDate", "Maturity")
dt_columns = intersect(names(mainResults), dt_columns)
mainResult[dt_columns] = lapply(mainResult[dt_columns], formatNumber, "DateAndTime")
I'd probably start by separating the information about the transformations you want to perform to the columns from the code that does the transformation. Something more like this
numberWithDecimals <- c("Bid","Ask","AvgBid","AvgAsk", "BidYield", "AskYield", "Coupon")
numberNoDecimals <- c("BidSize", "AskSize")
timeDateAndTime <- c("Time", "SettleDate", "IssueDate", "Maturity")
fmtColumns <- function(data, txlist, fun, fmt) {
cols <- intersect(txlist, names(data))
if(length(cols) > 0) {
data[, cols, drop=F] <- lapply(data[, cols, drop=F], fun, fmt)
}
}
mainResult <- fmtColumns(mainResult, numberWithDecimals, formatNumber, "withDecimals")
mainResult <- fmtColumns(mainResult, numberNoDecimals, formatNumber, "noDecimals")
mainResult <- fmtColumns(mainResult, timeDateAndTime , formatTime, "DateAndTime")

How to check if a two data frame have the same column names?

I have two data frames like this:
quest1 <- c(5,5,5)
quest2 <- c(5,5,5)
quest3<- c("a","b","c")
quest4 <- c(7,7,7)
quest5 <- c(8,8,8)
myquest1 <- data.frame(quest1,quest2,quest3)
myquest2 <- data.frame(quest4,quest5)
How can I check if they have the same column names with an ifelse or if loop statement with a warning or stop function?
Or is there an other..? I would prefer the the former.
I think what you need is something like the following using a function.
Using your example:
quest1 <- c(5,5,5)
quest2 <- c(5,5,5)
quest3<- c("a","b","c")
quest4 <- c(7,7,7)
quest5 <- c(8,8,8)
myquest1 <- data.frame(quest1,quest2,quest3)
myquest2 <- data.frame(quest4,quest5)
myquest3 <- data.frame(quest1,quest2,quest3)
my_func <- function(x,y) {
for (i in names(x)) {
if (!(i %in% names(y))) {
print('Warning: Names are not the same')
break
}
else if(i==tail(names(y),n=1)) {
print('Names are identical')
}
}
}
> my_func(myquest1,myquest2)
[1] "Warning: Names are not the same"
> my_func(myquest1,myquest3)
[1] "Names are identical"

How to avoid writing the same line several times in R?

I'm writing a program in R and I need to select variables based in a particular value of one of the variable. The program is the next:
a1961 <- base[base[,5]==1961,]
a1962 <- base[base[,5]==1962,]
a1963 <- base[base[,5]==1963,]
a1964 <- base[base[,5]==1964,]
a1965 <- base[base[,5]==1965,]
a1966 <- base[base[,5]==1966,]
a1967 <- base[base[,5]==1967,]
a1968 <- base[base[,5]==1968,]
a1969 <- base[base[,5]==1969,]
a1970 <- base[base[,5]==1970,]
a1971 <- base[base[,5]==1971,]
a1972 <- base[base[,5]==1972,]
a1973 <- base[base[,5]==1973,]
a1974 <- base[base[,5]==1974,]
a1975 <- base[base[,5]==1975,]
a1976 <- base[base[,5]==1976,]
a1977 <- base[base[,5]==1977,]
a1978 <- base[base[,5]==1978,]
a1979 <- base[base[,5]==1979,]
a1980 <- base[base[,5]==1980,]
a1981 <- base[base[,5]==1981,]
a1982 <- base[base[,5]==1982,]
a1983 <- base[base[,5]==1983,]
a1984 <- base[base[,5]==1984,]
a1985 <- base[base[,5]==1985,]
a1986 <- base[base[,5]==1986,]
a1987 <- base[base[,5]==1987,]
a1988 <- base[base[,5]==1988,]
a1989 <- base[base[,5]==1989,]
...
a2012 <- base[base[,5]==2012,]
Is there a way (like modules in SAS) in which I can avoid writing the same thing over and over again?
In general, coding/implementation questions really belong on StackOverflow. That said, my recommendation is instead of naming individual variables for each result, just throw them all into a list:
a = lapply(1961:1989, function(x) base[base[,5]==x,]
You can also use the assign command.
years <- 1961:2012
for(i in 1:length(years)) {
assign(x = paste0("a", years[i]), value = base[base[,5]==years[i],])
}

cross-tabulations on multiple data.frames and columns

I'm calculating summary statistics for numerous data frames across multiple slices vs a single response variable. I currently do this by passing a list of DFs to a function. But my function has to specify the columns (ie-slices) individually. This speeds up my process dramatically; but, I think there has to be an even more efficient way to do this via an apply() family function. I'm hoping someone here can help me out.
Here's my code:
table1 <- function(x) {
dl2 <- list()
for (i in 1:length(x)) {
z <- x[[i]]
t.sliceA <- addmargins(table(list(z$sliceA, z$Growing)))
t.sliceB <- addmargins(table(list(z$sliceB, z$Growing)))
t.sliceC <- addmargins(table(list(z$sliceC, z$Growing)))
t.sliceD <- addmargins(table(list(z$sliceD, z$Growing)))
...
t.sliceAA <- addmargins(table(list(z$sliceAA, z$Growing)))
table.list <- list(t.sliceA, t.sliceB, t.sliceC, ... , t.sliceAA)
names(table.list) <- c("t.sliceA", "t.sliceB", ... , "t.sliceAA")
dl2[[i]] <- table.list
}
assign("dl",dl2, envir=.GlobalEnv)
}
# run the function
dl <- c(DF1, DF2, ..., DF.n)
table1(dl)
I assume there must be a more efficient way to do this via lapply() where I only have to specify the columns needed. Something where I would replace the lines
t.sliceA <- [blah]
...
t.sliceAA <- [blah]
with something like:
apply(z[,c(1:4,10:12,15)],2, function(x) addmargins(table(list(x,z$Growing))))
Any help that you can provide would be very helpful. Thanks!
Update: Reproducible example
#Chase
My apologies if the this was done poorly. It's my first time using github.
https://gist.github.com/3719220
and here's the code:
# load the example datasets
a.small <- dget("df1.txt")
l.small <- dget(df2.txt)
# working function that I'd like to simplify
table1 <- function(x) {
dl2 <- list()
for (i in 1:length(x)) {
z <- x[[i]]
t.tenure <- addmargins(table(list(z$Tenure.Group, z$Growing)))
t.optfile <- addmargins(table(list(z$opt.file, z$Growing)))
t.checking <- addmargins(table(list(z$checking, z$Growing)))
t.full <- addmargins(table(list(z$add.full, z$Growing)))
t.optdm <- addmargins(table(list(z$opt.dm, z$Growing)))
t.up <- addmargins(table(list(z$add.up, z$Growing)))
t.off <- addmargins(table(list(z$offmode, z$Growing)))
table.list <- list(t.tenure, t.optfile, t.checking, t.full, t.optdm, t.up, t.off)
names(table.list) <- c("t.tenure", "t.optfile", "t.checking", "t.full", "t.optdm", "t.up", "t.off")
dl2[[i]] <- table.list
}
assign("dl",dl2, envir=.GlobalEnv)
}
# create a DF list to send to the function
dl <- list(a.small, l.small)
table1(dl) # run the function
As far as I can see this will be easily done with a couple of lapply statements
If we define our function to create a table with margins as
tabulate_df <- function(DF, .what, .with) {
table.add.margins <- function(...) addmargins(table(...))
lapply(DF[.what], table.add.margins, DF[[.with]])
}
Then
# the columns we want to cross tabulate with `Growing`
table_names <- setdiff(names(df1), 'Growing')
df_list <- setNames(list(df1,df2), c('df1','df2'))
lapply(df_list, tabulate_df, .what = table_names, .with = 'Growing')

Resources