limit data in aggregate function - r

Is there a way to limit the aggregated data?
An example:
aggregate(cars$speed, FUN=mean, by=list(cars$dist), data=cars)
gives exactly the same output as:
aggregate(cars$speed, FUN=mean, by=list(cars$dist), data=cars[cars$speed >= 15, ])
In this case there are only two variables but in my case I want to limit the data by a third variable. Is this possible within the aggregate function or is a new dataframe necessary?
Thanks a lot.

Related

Doing operation on multiple numbered tables in R

I'm new to programming in R and I'm working with a huge dataset containing hundreds of variables and thousands of observations. Among these variables there is Age, which is my main concern. I want to get means for each other variables in function of Age. I can get smaller tables with this:
for(i in 18:84)
{
n<- sprintf("SortAgeM%d",i)
assign(x=n,subset(SortAgeM,subset=(SortAgeM$AGE>=i & SortAgeM$AGE<i+1)))
}
"SortAgeM85plus"<-subset(SortAgeM,subset=(SortAgeM$AGE>=85 & SortAgeM$AGE<100))
This gives me subdatasets for each age I'm concern with. I would then want to get the mean for each column. Each column is an observation of the volume of a specific brain region. I'm interested in knowing how is the volume decreasing with time and I would like to be able to know if individuals of a given age are close to the mean of their age or not.
Now, I would like to get one more row with the mean for each column. So I tried this:
for(i in 18:85) {
addmargins((SortAgeM%d,i), margin=1, FUN= "mean")
}
But it didn't work... I'm stuck and I'm not familiar enough with R function to find a solution on the net...
Thank you for your help.
Victor
Post answer edit: This is what I finally did:
for(i in 18:84)
{
n<- sprintf("SortAgeM%d",i)
assign(x=n,subset(SortAgeM,subset=(SortAgeM$AGE>=i & SortAgeM$AGE<i+1)))
Ajustment<-c(NA,NA,NA,NA,NA,NA,NA) #first variables aren't numeric
Line1<- colMeans(item[,8:217],na.rm=TRUE)
Line<-c(Ajustment,Ligne1)
assign(x=n, rbind(item,Ligne))
}
If you simply want an additional row with the means of each column, you can rbind the colMeans of your df like this
df_new <- rbind(df, colMeans(df))

how to make groups of variables from a data frame in R?

Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

Simulating data from a data frame using ddply

I have some plant data almost identical to the 'iris' data set. I would like to simulate new data using a normal distribution. So for each variable~species in the iris data set I would create 10 new observations from a normal distribution. Basically it would just create a new data frame with the same structure as the old one, but it would contain simulated data. I feel that the following code should get me started (I think the data frame would be in the wrong form), but it will not run.
ddply(iris, c("Species"), function(x) data.frame(rnorm(n=10, mean=mean(x), sd=sd(x))))
rnorm is returning an atomic vector so ddply should be able to handle it.
the ddply will subset the rows by Species, but you're doing nothing in the function to iterate over the columns of the sub-setting data.frame. You cannot get norm() to return a list or data.frame for you; you will need to assist with the shaping. How about
ddply(iris, c("Species"), function(x) {
data.frame(lapply(x[,1:4], function(y) rnorm(10, mean(y), sd(y))))
})
here we calculate new values for the first 4 columns in each group.

Aggregate Function with Variable By List

I'm trying to create an R Script to summarize measures in a data frame. I'd like it to react dynamically to changes in the structure of the data frame. For example, I have the following block.
library(plyr) #loading plyr just to access baseball data frame
MyData <- baseball[,cbind("id","h")]
AggHits <- aggregate(x=MyData$h, by=list(MyData[,"id"]), FUN=sum)
This block creates a data frame (AggHits) with the total hits (h) for each player (id). Yay.
Suppose I want to bring in the team. How do I change the by argument so that AggHits has the total hits for each combination of "id" and "team"? I tried the following and the second line throws an error: arguments must have same length
MyData <- baseball[,cbind("id","team","h")]
AggHits <- aggregate(x=MyData$h, by=list(MyData[,cbind("id","team")]), FUN=sum)
More generally, I'd like to write the second line so that it automatically aggregates h by all variables except h. I can generate the list of variables to group by pretty easily using setdiff.
# set the list of variables to summarize by as everything except hits
SumOver <- setdiff(colnames(MyData),"h")
# total up all the hits - again this line throws an error
AggHits <- aggregate(x=MyData$h, by=list(MyData[,cbind(SumOver)]), FUN=sum)
The business purpose I'm using this for involves a csv file which has a single measure ($) and currently has about a half dozen dimensions (product, customer, state code, dates, etc.). I'd like to be able to add dimensions to the csv file without having to edit the script each time.
I should mention that I've been able to accomplish this using ddply, but I know that using ddply to summarize a single measure is wasteful in regards to run time; aggregate is much faster.
Thanks in advance!
ANSWER (specific to example in question)
Block should be
MyData <- baseball[,cbind("id","team","h")]
SumOver <- setdiff(colnames(MyData),"h")
AggHits <- aggregate(x=MyData$h, by=MyData[SumOver], FUN=sum)
This aggregates by every non-integer column (ID, Team, League), but more generically shows a strategy to aggregate over an arbitrary list of columns (by=MyData[cols.to.group.on]):
MyData <- plyr::baseball
cols <- names(MyData)[sapply(MyData, class) != "integer"]
aggregate(MyData$h, by=MyData[cols], sum)
Here is a solution using aggregate from base R
data(baseball, package = "plyr")
MyData <- baseball[,c("id","h", "team")]
AggHits <- aggregate(h ~ ., data = MyData, sum)

computing a subset using a loop

I have a data frame with different variables and I want to build different subsets out of this data frame using some conditions and I want to use a loop because there will be a lot of subsets and this would be saving a lot of time.
This are the conditions:
Variable A has an ID for an area, variable B has different species (1,2,3, etc.) and I want to compute different subsets with these columns. The name of every subset should be the the ID of a point and the content should be all individuals of a certain specie in this point.
For a better understanding:
This would be the code for the one subset and I want to use a loop
A_2_NGF_Abies_alba <- subset(A_2_NGF, subset = Baumart %in% c("Abies alba"))
Is this possible doing in R
Thanks
Does this help you?
Baumdaten <- data.frame(pointID=sample(c("A_2_SEF","A_2_LEF","A_3_LEF"), 10, T), Baumart=sample(c("Abies alba", "Betula pendula", "Fagus sylvatica"), 10, T))
split(Baumdaten, Baumdaten[, 1:2])

Resources