Get one sum from multiple columns - r

I imagine there is a simple function for this but I can't seem to find it. I have five columns within a larger data frame that I want to add to get a single sum. Here's what I did, but I am wondering if there is a much simpler way to get the same result:
count <- subset(NAMEOFDATA, select=c(COL1,COL2,COL3,COL4,COL5))
colcount <- as.data.frame(colSums(count))
colSums(colcount)

The sum function should do that:
sum(count)
Unlike "+" which is vectorized, sum will "collapse" its arguments and it will accept a data.frame argument. If some of the arguments are logical, then TRUE==1 and FALSE==0 for purposes of summation, which makes the construction sum(is.na(x)) possibly useful.

Always easier with a reproducible example, but here's an attempt:
apply( NAMEOFDATA[,paste0("COL",seq(5))], 1, sum )

Related

How to get top n companies from a data frame in decreasing order

I am trying to get the top 'n' companies from a data frame.Here is my code below.
data("Forbes2000", package = "HSAUR")
sort(Forbes2000$profits,decreasing=TRUE)
Now I would like to get the top 50 observations from this sorted vector.
head and tail are really useful functions!
head(sort(Forbes2000$profits,decreasing=TRUE), n = 50)
If you want the first 50 rows of the data.frame, then you can use the arrange function from plyr to sort the data.frame and then use head
library(plyr)
head(arrange(Forbes2000,desc(profits)), n = 50)
Notice that I wrapped profits in a call to desc which means it will sort in decreasing order.
To work without plyr
head(Forbes2000[order(Forbes2000$profits, decreasing= T),], n = 50)
Use order to sort the data.frame, then use head to get only the first 50 rows.
data("Forbes2000", package = "HSAUR")
head(Forbes2000[order(Forbes2000$profits, decreasing=TRUE), ], 50)
You can use rank from dplyr.
library(dplyr)
top_fifty <- Forbes2000 %>%
filter(rank(desc(profits))<=50)
This sorts your data in descending order and only keeps values where the rank is less than or equal to 50 (i.e. the top 50).
Dplyr is very useful. The commands and chaining syntax are very easy to understand. 10/10 would recommend.
Mnel is right that in general, You want to use head() and tail() functions along with the a sorting function. I should mention though for medium data sets Vince's method works faster. If you didn't use head() or tail(), then you could used the basic subsection call operator []....
library(plyr)
x = arrange(Forbes2000,desc(profits))
x = x[1:50,]
#Or using Order
x = Forbes2000[order(Forbes2000$profits, decreasing= T),]
x = x[1:50,]
However, I really do recommend the head(), tail(), or filter() functions because the regular [] operator assumes your data is structured in easily drawn array or matrix format. (Hopefully, this answers Teja question)
Now which pacakage you choose is largely subjective. However reading people's comments, I will say that the choice to use plyr's arrange(), {bases}'s order() with {utils} head() and tails, or plyr() largely depends on the memory size and row size of your dataset. I could go into more detail about how Plyr and sometimes Dplyr have problems with large complex datasets, but I don't want to get off topic.
P.S. This is one of my first times answering so feedback is appreciated.

first row for non-aggregate functions

I use ddply to avoid redundant calculations.
I am often dealing with values that are conserved within the split subsets, and doing non-aggregate analysis. So to avoid this (a toy example):
ddply(baseball,.(id,year),function(x){paste(x$id,x$year,sep="_")})
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
I have to take the first row of each mini data frame.
ddply(baseball,function(x){paste(x$id[1],x$year[1],sep="_")})
Is there a different approach or a helper I should be using? This syntax seems awkward.
--
Note: paste in my example is just for show - don't take it too literally. Imagine this is actual function:
ddply(baseball,function(x){the_slowest_function_ever(x$id[1],x$year[1])})
You might find data.table a little easier and faster in this case. The equivalent of .() variables is by= :
DT[, { paste(id,year,sep="_") }, by=list(id,year) ]
or
DT[, { do.call("paste",.BY) }, by=list(id,year) ]
I've shown the {} to illustrate you can put any (multi-line) anonymous body in j (rather than a function), but in these simple examples you don't need the {}.
The grouping variables are length 1 inside the scope of each group (which seems to be what you're asking), for speed and convenience. .BY contains the grouping variables in one list object as well, for generic access when the by criteria is decided programatically on the fly; i.e., when you don't know the by variables in advance.
You could use:
ddply(baseball, .(id, year), function(x){data.frame(paste(x$id,x$year,sep="_"))})
When you return a vector, putting it back together as a data.frame makes each entry a column. But there are different lengths, so they don't all have the same number of columns. By wrapping it in data.frame(), you make sure that your function returns a data.frame that has the column you want rather than relying on the implicit (and in this case, wrong) transformation. Also, you can name the new column easily within this construct.
UPDATE:
Given you only want to evaluate the function once (which is reasonable), then you can just pull the first row out by itself and operate on that.
ddply(baseball, .(id, year), function(x) {
x <- x[1,]
paste(x$id, x$year, sep="_")
})
This will (by itself) have only a single row for each id/year combo. If you want it to have the same number of rows as the original, then you can combine this with the previous idea.
ddply(baseball, .(id, year), function(x) {
firstrow <- x[1,]
data.frame(label=rep(paste(firstrow$id, firstrow$year, sep="_"), nrow(x)))
})

R, create a new column in a data frame that applies a function of all the columns with similar names

I have a data frame in which the names of the columns are something like a,b,v1,v2,v3...v100.
I want to create a new column that applies a function to only the columns whose names include 'v'.
For example, given this data frame
df<-data.frame(a=rnorm(3),v1=rnorm(3),v2=rnorm(3),v3=rnorm(3))
I want to create a new column in which each element is the sum of the elements of v1, v2 and v3 that are in the same row.
grep on names to get the column positions, then use rowSums:
rowSums(df[,grep("v",names(df))])
To combine both #James's and #Anatoliy's answers,
apply(df[grepl('^v', names(df))], 1, sum)
I went ahead and anchored the v in the regular expression to the beginning of the string. Other examples haven't done that but it appears that you want all columns that begin with v not the larger set that may have a v in their name. If I am wrong you could just do
apply(df[grepl('v', names(df))], 1, sum)
You should avoid using subset() when programming, as stated in ?subset
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
‘[’, and in particular the non-standard evaluation of argument
‘subset’ can have unanticipated consequences.
Also, as I learned yesterday from Richie Cotton, when indexing it is better to use grepl than grep.
That should do:
df$sums<- rowSums(subset(df, select=grepl("v", names(df))))
For a more general approach:
apply(subset(df, select=grepl("v", names(df))), 1, sum)

How do I sub sample data by group using ddply?

I've got a data frame with far too many rows to be able to do a spatial correlogram. Instead, I want to grab 40 rows for each species and run my correlogram on that subset.
I wrote a function to subset a data frame as follows:
samp <- function(dataf)
{
dataf[sample(1:dim(dataf)[1], size=40, replace=FALSE),]
}
Now I want to apply this function to each species in a larger data frame.
When I try something like
culled_data = ddply (larger_data, .(species), subset, samp)
I get this error:
Error in subset.data.frame(piece, ...) :
'subset' must evaluate to logical
Anyone got ideas on how to do this?
It looks like it should work once you remove , subset from your call.
Dirk answer is of course correct, but to add additional explanation I post my own.
Why your call don't work?
First of all your syntax is a shorthand. It's equivalent of
ddply(larger_data, .(species), function(dfrm) subset(dfrm, samp))
so you can clearly see that you provide function (see class(samp)) as second argument of subset. You could use samp(dfrm), but it won't work too cause samp return data.frame and subset need logical vector. So you could use samp(dfrm) when it returns logical indexing.
How to use subset in this case?
Make subset work by feed him with logical vector:
ddply (larger_data, .(species), subset, sample(seq_along(species)<=40))
I create logical vector with 40 TRUE (btw it works when for some spieces is less then 40 cases, then it return all) and random it.

Row/column counter in 'apply' functions

What if one wants to apply a functon i.e. to each row of a matrix, but also wants to use as an argument for this function the number of that row. As an example, suppose you wanted to get the n-th root of the numbers in each row of a matrix, where n is the row number. Is there another way (using apply only) than column-binding the row numbers to the initial matrix, like this?
test <- data.frame(x=c(26,21,20),y=c(34,29,28))
t(apply(cbind(as.numeric(rownames(test)),test),1,function(x) x[2:3]^(1/x[1])))
P.S. Actually if test was really a matrix : test <- matrix(c(26,21,20,34,29,28),nrow=3) , rownames(test) doesn't help :(
Thank you.
What I usually do is to run sapply on the row numbers 1:nrow(test) instead of test, and use test[i,] inside the function:
t(sapply(1:nrow(test), function(i) test[i,]^(1/i)))
I am not sure this is really efficient, though.
If you give the function a name rather than making it anonymous, you can pass arguments more easily. We can use nrow to get the number of rows and pass a vector of the row numbers in as a parameter, along with the frame to be indexed this way.
For clarity I used a different example function; this example multiplies column x by column y for a 2 column matrix:
test <- data.frame(x=c(26,21,20),y=c(34,29,28))
myfun <- function(position, df) {
print(df[position,1] * df[position,2])
}
positions <- 1:nrow(test)
lapply(positions, myfun, test)
cbind()ing the row numbers seems a pretty straightforward approach. For a matrix (or a data frame) the following should work:
apply( cbind(1:(dim(test)[1]), test), 1, function(x) plot(x[-1], main=x[1]) )
or whatever you want to plot.
Actually, in the case of a matrix, you don't even need apply. Just:
test^(1/row(test))
does what you want, I think. I think the row() function is the thing you are looking for.
I'm a little confuse so excuse me if I get this wrong but you want work out n-th root of the numbers in each row of a matrix where n = the row number. If this this the case then its really simple create a new array with the same dimensions as the original with each column having the same values as the corresponding row number:
test_row_order = array(seq(1:length(test[,1]), dim = dim(test))
Then simply apply a function (the n-th root in this case):
n_root = test^(1/test_row_order)

Resources