Having data in a data.frame, I would like to aggregate some columns (using any general function) grouping by some others, keeping the remaining ones as they are (or even omitting them). The fashion is to recall the group by function in SQL. As an example let us assume we have
df <- data.frame(a=rnorm(4), b=rnorm(4), c=c("A", "B", "C", "A"))
and I want to sum (say) the values in column a and average (say) the values in column b, grouping by the symbols in column c. I am aware it is possible to achieve such using apply, cbind or similars, specifying the functions you want to use, but I was wondering if there were a smarter (one line) way (especially using the aggregate function) to do so.
Sorry but I don't follow how dealing with more than one column complicates things.
library(data.table)
dt <- data.table(df)
dt[,.(sum_a = sum(a),mean_b= mean(b)),by = c]
like this?
mapply(Vectorize(function(x, y) aggregate(
df[, x], by=list(df[, 3]), FUN=y), SIMPLIFY = F),
1:2, c('sum', 'mean'))
Related
How could I calculate the rowMeans of a data.frame based on matching column names?
Ex)
c1=rnorm(10)
c2=rnorm(10)
c3=rnorm(10)
out=cbind(c1,c2,c3)
out=cbind(out,out)
I realize that the values are the same, this is just for demonstration.
Each row is a specific measurement type (consider it a factor).
Imagine c1 = compound 1, c2 = compound 2, etc.
I want to group together all the c1's and average there rows together. then repeat for all unique(colnames(out))
My idea was something like:
avg = rowMeans(out,by=(unique(colnames(out)))
but obviously this doesn't work...
Try this:
sapply(unique(colnames(out)), function(i)
rowMeans(out[,colnames(out) == i]))
As #Laterow points out in the comments, having duplicate column names will lead to trouble at some point; if not here, elsewhere in your code. Best to nip it in the bud now.
If you are starting with duplicate column names, use make.unique on the colnames first to append .n where n increments for each duplicate starting at .1 for the first duplicate, leaving the initial unique names as is:
colnames(out) <- make.unique(colnames(out));
Once that's done (or as OP explained in the comments, if it was already being done by the column-creating function silently), you can do your rowMeans operation with dplyr::select's starts_with argument to group columns based on prefix:
library(dplyr);
avg_c1 <- rowMeans(select(out, starts_with("c1"));
If you have a large number of columns, instead of specifying them individually, you can use the code below to have it create a data frame of the rowMeans regardless of input size:
case_count <- as.integer(sub('^c\\d+\\.(\\d+)$', '\\1', colnames(out)[ncol(out)])) + 1L;
var_count <- as.integer(ncol(out) %/% case_count);
avg_c <- as.data.frame(matrix(nrow = var_count , ncol = nrow(out)));
for (i in 1:var_count) {
avg_c[i, 1:nrow(out)] <- rowMeans(select(as.data.frame(out), starts_with(paste0("c", i))));
}
As #Tensibai points out in comments, this solution may not be efficient, and may be overkill depending on your actual data set. You may not need the flexibility it provides and there's probably a more succinct way to do it.
EDIT1: Based on OP comments
EDIT2: Based on comments, handle all rowMeans at once
EDIT3: Fixed code bugs and clarified starting point reasoning based on comments
I have a data frame with factor columns. Here is a tiny example:
dat <- data.frame(one = factor(c("a", "b")), two = factor(c("c", "d")))
I can calculate the means of the numeric values that underlie the factor labels for each column:
mean(as.integer(dat$one))
[1] 1.5
But since there are very many columns in my data frame, I would like to avoid having to calculate all the individual means and would rather do something like:
colMeans(dat)
which doesn't work, since the columns are factors, or
colMeans(as.integer(dat))
which doesn't work either.
So how can I easily calculate the means of all factor columns, without a loop or individually calculating them all?
Do I really have to change the class of all columns?
The data.matrix is pretty much designed for such a task. It also skips numeric and integer columns, if present, and hence reduces memory usage, though the conversion to matrix could be an overhead, sometimes. So as long you don't have character columns, this should be pretty straightforward
colMeans(data.matrix(dat))
# one two
# 1.5 1.5
We can use lapply
lapply(dat, function(x) mean(as.integer(x)))
Or with dplyr
library(dplyr)
dat %>%
summarise_each(funs(mean(as.integer(.))))
For big datasets, it may be better to calculate the mean by each column separately as converting to matrix may also create memory issues.
Write a simple function that uses a for loop to write all of the values into a vector.
dat <- data.frame(one = c(1:10), two = c(1:10))
colMeans <- function(tablename){
i <- 1
colmean <- c(1:ncol(tablename))
for(i in c(1:ncol(tablename))){
colmean[i] <- mean(tablename[,i])
}
return(colmean)
}
colMeans(dat)
Hope this works
You can also use data.table package, which is faster than data.frame. if your data is big e.g. millions of observations, than you need data.table to optimize run time.
Below is the code:
library(data.table)
dat <- data.table(one = factor(c("a", "b")), two = factor(c("c", "d")))
factorCols <- c("one", "two")
dat[, lapply(.SD, FUN=function(x) mean(as.integer(x))), .SDcols=factorCols]
There are many posts which discuss applying a function over many columns when using data.table. However I need to calculate a function which depends on many columns. As an example:
# Create a data table with 26 columns. Variable names are var1, ..., var 26
data.mat = matrix(sample(letters, 26*26, replace=TRUE),ncol=26)
colnames(data.mat) = paste("var",1:26,sep="")
data.dt <- data.table(data.mat)
Now, say I would like to count the number of 'a's in columns 5,6,7 and 8.
I cannot see how to do this with SDcols and end up doing:
data.dt[,numberOfAs := (var5=='a')+(var6=='a')+(var7=='a')+(var7=='a')]
Which is very tedious. Is there a more sensible way to do this?
Thanks
I really suggest going through the vignettes linked here. Section 2e from the Introduction to data.table vignette explains .SD and .SDcols.
.SD is just a data.table containing the data for current group. And .SDcols tells the columns .SD should have. A useful way is to use print to see the content.
# .SD contains cols 5:8
data.dt[, print(.SD), .SDcols=5:8]
Since there is no by here, .SD contains all the rows of data.dt, corresponding to the columns specified in .SDcols.
Once you understand this, the task reduces to your knowledge of base R really. You can accomplish this in more than one way.
data.dt[, numberOfAs := rowSums(.SD == "a"), .SDcols=5:8]
We return a logical matrix by comparing all the columns in .SD to "a". And then use rowSums to sum them up.
Another way using Reduce:
data.dt[, numberOfAs := Reduce(`+`, lapply(.SD, function(x) x == "a")), .SDcols=5:8]
I need to modify certain columns of specific rows of a data.table. I keep getting an error, "unused argument (with=F)". Does anyone know how to quickly deal with this? Below is an example using both data.frames and data.table.
Thanks.
test.df <- data.frame(a=rnorm(100, 0, 1), b=rnorm(100, 0, 1), c=rnorm(100,0,1))
test.dt <- as.data.table(test.df)
test.df[test.df$a<test.df$b,c(1,2)] <- 10* test.df[test.df$a<test.df$b,c(1,2)]
test.dt[test.dt$a<test.dt$b, c(1,2), with=F] <- 10* test.dt[,c(1,2),with=F][test.dt$a<test.dt$b, c(1,2), with=F]
First of all - you do not need to, and should not (as a matter of good programming) use the data.table name inside [.data.table.
Secondly, you should avoid using column numbers whenever you can - this is a source of future headache, and should instead aim to use column names.
Finally, the way to change columns in data.table's is to use the := operator to modify in-place (see ?':=').
Combining all of the above, this is what you should do:
test.dt[a < b, `:=`(a = 10 * a, b = 10 * b)]
I have the following data frame:
df <- data.frame(
Target=rep(LETTERS[1:3],each=8),
Prov=rep(letters[1:4],each=2),
B=rep("5MB"),
S=rep("1MB"),
BUF=rep("8kB"),
M=rep(c('g','p')),
Thr.mean=1:24)
whose column Thr.mean I would like to normalize by the values where Target=='C' (I don't mind attaching a new column).
To clarify, I would like to end up with:
Thr.mean <- c(1/17,2/18,3/19,4/20,5/21,6/22,7/23,8/24,9/17,10/18,11/19,12/20,13/21,14/22,15/23,16/24,1,1,1,1,1,1,1,1)
Now, it may happen that there are rows in this data frame, where Target!='C', and they have values in S or B that are not present in rows where Target=='C', and for these I would also like to calculate the overhead. The most important column for matching is M, then BUF, B, and S.
Any ideas how to do it? I could write several loops and ifs, but I'm looking for a more elegant solution.
For posterity,
the way how I solved my problem is by using data.table:
DT <- data.table(df)
DT[, Thr.Norm.C := .SD[Target=='C', Thr.mean], by = 'B,BUF,Prov']
DT[, over.thr := Thr.Norm.C/Thr.mean]