Rowmeans with matching column names - r

How could I calculate the rowMeans of a data.frame based on matching column names?
Ex)
c1=rnorm(10)
c2=rnorm(10)
c3=rnorm(10)
out=cbind(c1,c2,c3)
out=cbind(out,out)
I realize that the values are the same, this is just for demonstration.
Each row is a specific measurement type (consider it a factor).
Imagine c1 = compound 1, c2 = compound 2, etc.
I want to group together all the c1's and average there rows together. then repeat for all unique(colnames(out))
My idea was something like:
avg = rowMeans(out,by=(unique(colnames(out)))
but obviously this doesn't work...

Try this:
sapply(unique(colnames(out)), function(i)
rowMeans(out[,colnames(out) == i]))

As #Laterow points out in the comments, having duplicate column names will lead to trouble at some point; if not here, elsewhere in your code. Best to nip it in the bud now.
If you are starting with duplicate column names, use make.unique on the colnames first to append .n where n increments for each duplicate starting at .1 for the first duplicate, leaving the initial unique names as is:
colnames(out) <- make.unique(colnames(out));
Once that's done (or as OP explained in the comments, if it was already being done by the column-creating function silently), you can do your rowMeans operation with dplyr::select's starts_with argument to group columns based on prefix:
library(dplyr);
avg_c1 <- rowMeans(select(out, starts_with("c1"));
If you have a large number of columns, instead of specifying them individually, you can use the code below to have it create a data frame of the rowMeans regardless of input size:
case_count <- as.integer(sub('^c\\d+\\.(\\d+)$', '\\1', colnames(out)[ncol(out)])) + 1L;
var_count <- as.integer(ncol(out) %/% case_count);
avg_c <- as.data.frame(matrix(nrow = var_count , ncol = nrow(out)));
for (i in 1:var_count) {
avg_c[i, 1:nrow(out)] <- rowMeans(select(as.data.frame(out), starts_with(paste0("c", i))));
}
As #Tensibai points out in comments, this solution may not be efficient, and may be overkill depending on your actual data set. You may not need the flexibility it provides and there's probably a more succinct way to do it.
EDIT1: Based on OP comments
EDIT2: Based on comments, handle all rowMeans at once
EDIT3: Fixed code bugs and clarified starting point reasoning based on comments

Related

How to find unsorted fragments of data frame

Let's assume that I've got a data.frame that is supposed to be sorted with respect to selected columns and I want to make sure that it is indeed a case. I could try something like:
library(dplyr)
mpg2 <- mpg %>%
arrange(manufacturer, model, year)
identical(mpg, mpg2)
[1] FALSE
but if the identical returns FALSE this only lets me know that a dataset is in incorrect order.
What if I would like to inspect only those rows that are in fact in incorrect order? How can I filter those out of the whole dataset? (I need to avoid looping here at best, as the dataset I work with is pretty large)
If the remaining variables (not used for ordering) are different for the same value of manufacturer, model, year, how dplyr::arrange decides which observation comes first? Does it preserve the order from original dataset (mpg here)?
As for the second question, I believe that dplyr::arrange is stable, it preserves the order of the rows when there are ties in the sorting columns.
This can be seen by comparing with the result from base::order. From the help page, section Details (my emphasis):
In the case of ties in the first vector, values in the second are
used to break the ties. If the values are still tied, values in the
later arguments are used to break the tie (see the first example).
The sort used is stable (except for method = "quick"), so any
unresolved ties will be left in their original ordering.
mpg2 <- mpg %>%
arrange(manufacturer, model, year)
i <- with(mpg, order(manufacturer, model, year))
mpg3 <- mpg[i, ]
identical(as.data.frame(mpg2), as.data.frame(mpg3))
#[1] TRUE
The values are identical, except for their classes. So dplyr::arrange does preserve the order in the case of ties.
As for the first question, maybe the code below answers it. It just gets the rows for which the next order number is smaller than the previous one. This means that those rows have changed relative positions.
j <- which(diff(i) < 0)
mpg[i[j], ]
I don't think this is something I've needed before. Usually it's best practice to not rely on table ordering. Only times I would rely on it, the ordering would be contained within a function ie I wouldn't have function B depend on ordering that happens in function A.
I think this does what you ask for, using the data.table package. You set keys with this package, and they are ordered from left to right in terms of primary, secondary key etc. I'm not sure if concatenating the keys together is the best way, but its simple.
# reproducible fake data
library(data.table)
set.seed(1)
dt <- data.table(a=rep(1:5, 2), b=letters[1:10], c=sample(1:3, 10, TRUE))
# scramble
dt <- dt[sample(1:.N)]
# make the ideal structure
keys <- c("a", "b")
dt_ideal <- copy(dt)
dt_ideal <- setkeyv(dt_ideal, keys)
key(dt_ideal)
# function to find keys not the same for each row. Pasting together
findBad <- function(dt, dt_ideal){
not_ok <- which(dt_ideal[, do.call(paste, c(.SD, sep=">")), .SDcols=keys] !=
dt[, do.call(paste, c(.SD, sep=">")), .SDcols=keys])
not_ok
}
# index of bad rows - all bad in this case
not_ok <- findBad(dt, dt_ideal)
dt[not_ok]
# better eg, swap 7 & 8
dt2 <- copy(dt_ideal)
dt2 <- dt2[c(1:6, 8, 7, 9:10)]
not_ok <- findBad(dt2, dt_ideal)
dt2[not_ok]

if function for rowSums_Modify the code

I want to get summation over several columns and make a new column based on them. So I use
df$Sum <-rowSums(df[,grep("y", names(df))])
But sometimes df just includes one column and in this case, I will get the error. Since this function is part of my long programming procedure, I was wondering how I can make an if function in a way that If df[,grep("y", names(df))] includes just one column then get sum is equal to df[,grep("y", names(df))] otherwise if df[,grep("y", names(df))] have more at leat two columns get the summation over them?
suppose:
require(stats); require(graphics)
attach(cars)
cars$y1<-seq(20:69)
#cars$y2<-seq(30:79)
df<-cars
df$Sum <-rowSums(df[,grep("y", names(df))])
You can use drop = FALSE when subsetting:
df$Sum <-rowSums(df[,grep("y", names(df)), drop = FALSE])
This keeps df as a data frame even if you are selecting only one column.

Remove Duplicates, but Keep the Most Complete Iteration

I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed.
For example,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output would be the following:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is present
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.
Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.
If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!
Here is a solution. It is not very pretty but it should work for your application:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
This does rearrange your original data frame so beware if there is additional processing later on.
You can aggregate your data and select the row with max score:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
And if you want to keep the other columns, just do this:
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)

How do I optimize a nested for loop using data.table?

I am interested in optimizing some code using data.table. I feel I should be able to do better than my current solution, and it does not scale well (as the number of rows increase).
Consider I have a matrix of values, with ID denoting person and the remaining values are traits (lineage in my case). I want to create a logical matrix which reflects if two ID's (rows) share any values amongst their row (including ID). I have been using data.table lately, but I cannot figure out how to do this more efficiently. I have tried (and failed) at nesting apply statements, or somehow using the .SD function of data.table to accomplish this.
The working code is below.
m <- matrix(rep(1:10,2),nrow=5,byrow=T)
m[c(1,3),3:4] <- NA
dt <- data.table(m)
setnames(dt,c("id","v1","v2","v3"))
res <- matrix(data=NA,nrow=5,ncol=5)
dimnames(res) <- list(dt[,id],dt[,id])
for (i in 1:nrow(dt)){
for (j in i:nrow(dt)){
res[j,i] <- res[i,j] <-length(na.omit(intersect(as.numeric(dt[i]),as.numeric(dt[j])))) > 0
}
}
res
I had a similar problem a while ago and somebody helped me out. Here's that help converted to your problem...
tm<-t(m) #transpose the matrix
dtt<-data.table(tm[2:4,]) #take values of matrix into data.table
setnames(dtt,as.character(tm[1,])) #make data.table column names
comblist<-combn(names(dtt),2,FUN=list) #create list of all possible column combinations
preresults<-dtt[,lapply(comblist, function(x) length(na.omit(intersect(as.numeric(get(x[1])),as.numeric(get(x[2]))))) > 0)] #recreate your double for loop
preresults<-melt(preresults,measure.vars=names(preresults)) #change columns to rows
preresults[,c("LHS","RHS"):=lapply(1:2,function(i)sapply(comblist,"[",i))] #add column labels
preresults[,variable:=NULL] #kill unneeded column
I'm drawing a blank on how to get my preresults to be in the same format as your res but this should give you the performance boost you're looking for.

What's the most efficient way to partition and access dataframe rows in R?

I need to iterate through a dataframe, df, where
colnames(df) == c('year','month','a','id','dollars')
I need to iterate through all of the unique pairs ('a','id'), which I've found via
counts <- count(df, c('area','normalid'))
uniquePairs <- counts[ counts$freq > 10, c('a','id') ]
Next I iterate through each of the unique pairs, finding the corresponding rows like so (I have named each column of uniquePairs appropriately):
aVec <- as.vector( uniquePairs$a )
idVec <- as.vector( uniquePairs$id )
for (i in 1:length(uniquePairs))
{
a <- aVec[i]
id <- idVec[i]
selectRows <- (df$a==a & df$id==id)
# ... get those rows and do stuff with them ...
df <- df[!selectRows,] # so lookups are slightly faster next time through
# ...
}
I know for loops are discouraged in general, but in this case I think it's appropriate. It at least seems to me to be irrelevant to this question, but maybe a more efficient way of doing this would get rid of the loop.
There are between 10-100k rows in the dataframe, and it makes sense that it'd be a worse-than-linear (though I haven't tested it) relationship between lookup time and nrow(df).
Now unique must have seen where each of these pairs occurred, even if it didn't save it. Is there a way to save that off, so that I have a boolean vector I could use for each of the pairs to more efficiently select them out of the dataframe? Or is there an alternate, better way to do this?
I have a feeling that some use of plyr or reshape could help me out, but I'm still relatively new to the large R ecosystem, so some guidance would be greatly appreciated.
data.table is your best option by far:
dt = data.table(df)
dt[,{do stuff in here, then leave results in list form},by=list(a, id)]
for a simple case of an average of some variable:
dt[,list(Mean = mean(dollars)), by = list(a, id)]

Resources