I have a longitudinal spreadsheet that contains different growth variables for many individuals. At the moment my R code looks like this:
D5<-ifelse(growth$agyr == 5, growth$R.2ND.DIG.AVERAGE,NA)
Since it is longitudinal, I have the same measurement for each individual at multiples ages, thus the variable agyr. In this example it is taking all kids who have a finger measurement at age 5.
What I would like to do is do that for all ages so that I don't have to define an object every time, so I can essentially run some summary stats on finger length for any given agyr. Surely this is possible, but I am still a beginner at R.
tapply() is your friend here. For the mean for example:
with(growth,
tapply(R.2ND.DIG.AVERAGE,agyr,mean)
)
See also ?tapply and some good introduction book on R. And also ?with, a function that can really make your code a lot more readible.
If you have multiple levels you want to average over, you can give tapply() a list of factors. Say gender is a variable as well (a factor!), you can do eg:
with(growth,
tapply(R.2ND.DIG.AVERAGE,list(agyr,gender),mean)
)
tapply() returns an array-like structure (a vector, matrix or multidimensional array, depending on the number of categorizing factors). If you want your results in a data frame and/or summarize multiple variables at once, look at ?aggregate, eg:
thevars <- c("R.2ND.DIG.AVERAGE","VAR2","MOREVAR")
aggregate(growth[thevars],by=list(agyr,gender), FUN="mean")
or using the formula interface:
aggregate(cbind(R.2ND.DIG.AVERAGE,VAR2,MOREVAR) ~ agyr + gender,
data=growth, FUN = "mean")
Make sure you check the help files as well. Both tapply() and aggregate() are quite powerful and have plenty other possibilities.
Related
I have a pivot table I created. This table is aggregating data by region and can be filtered by 2 categories (age, Income). How do I create a table such that each category combination (ex. Toddler & below 50% FPL, and Toddler, All incomes) are represented within each aggregation. So far, I am filtering for all combinations of Age and Income and just copying and pasting in a new spreadsheet. I linked a video where I show what I mean
https://drive.google.com/file/d/1kUvDNxijXWZyJCCdVy398Gd0uq8vUFvY/view?usp=sharing
I am open to doing this in Excel or R.
Thank you very much for your help,
Rouzbeh
In order to do it in Excel you would require a third-party add-on or at least to code in VBA to do that.
In R you could find a solution. There is a similar question here. That hasn't been marked as answered.
R Solution
In Base-R you can pivot using aggregate(). There are other function on other libraries like reshape2, data.table and dyplr. If you feel comfortable with those libraries, look for their aggregation or group by functions.
Sample Data: data=
I do not know if you have a flag to determine if a subject is elegible. In that case I will use a custom aggregation. But if that's not the case you could use any of the traditional aggregation functions.
#Costume formula counting flags
counEle <- function(x){
a=length(which(x=="x"))}
Then:
#Create all posibles combinations using Age and Income
combination = expand.grid(unlist(data["Age"]),unlist(data["Income"]))
#Eliminate duplicated combinations
combination=unique(combination)
#Create the loop that filters and aggregate for each combination.
for(i in 1:nrow(combination)){
new_data=data[data$Age==combination[i,1] & data$Income==combination[i,2],]
#Check that the data frame has more than 0 columns so it could be aggregated
if(nrow(new_data)>0){
#Aggregate the data with the values in place
print(try(aggregate(Eligibility~Age+Income+County,new_data,counEle)))
}
}
The total count is in the Eligibility columns which is the one we wanted to measure. This should output all possible combinations (mind you the error handler by the try(). If you want to ignore where the count is 0 you could add an additional step with a conditional to >0. Then write each result on a csv or use a library to write it on a excel tab.
I want to calculate averages for categorical data. My data is in a long format, and I do not understand why I am not succeeding.
Here is an example (imagine it as individual participants, indicated by id, picking different options, in this example m_ex):
id <- (1,1,1,1,1,2,2,2,3,3,3)
m_ex <- ("a","b","c","b","a","b","c","b","a","a","c")
df <- data.frame(id , m_ex)
print (df)
I want to calculate averages for m_ex. That is, the average times specific m_ex are picked. I am trying to achieve this with dplyr. But I do not quite understand how to proceed with the id's having different lengths. What would I have to divide by then? And is it a problem that I do not have equal lengths of ids?
I really appreciate any help you can provide.
I have tried using dplyr and grouping by id and summarizing the results without much success. I would, in particular, like to understand what I do not understand right now.
I get something like this, but how do I get the averages?
[1]: https://i.stack.imgur.com/7nxze.jpg
[![example picture][1]][1]
I'm currently getting a bunch of accuracy measurements for 80k different items which I need to calculate the measurements independently but currently is taking too long, so want to determine a faster way to do it.
Here's my code in R with it's comments:
work_file: Contains 4 varables: item_id, Dates, demand and forecast
my code:
output<-0
uniques<- unique(work_file$item_id)
for( i in uniques){
#filter every unique item
temporal<- work_file %>% filter(item_id==i)
#Calculate the accuracy measure for each item
x<-temporal$demand
x1<-temporal$forecast
item_error<- c(i, accuracy(x1,x)
output<-rbind(output, item_error)}
For 80k~unique items is taking hours,
Any suggestions?
R is a vectorized language, as such one can avoid the use of the loop. Also the binding within a loop is especially slow since the output data structure is constantly being deleted and recreated with each iteration.
Provided the "accuracy()" function can accept a vector input this should work: Without sample data to test, there is always some doubt.
answer<- work_file %>%
group_by(item_id) %>%
summarize(accuracy(forecast, demand))
Here the dplyr's group_by function will collect the different item_ids and the pass those vectors to summarize the accuracy function.
Consider using data.table methods which would be efficient
library(data.table)
setDT(work_file)[, .(acc = accuracy(forecast, demand)), item_id]
I am hoping to efficiently combine my regressions using plyr functions. I have data frames with monthly data for multiple years in format yDDDD (so y2014, y2013, etc.)
Right now, I have the below code for one of those dfs, y2014. I am running the regressions by month, as desired within each year.
modelsm2= by(y2014,y2014$Date,function(x) lm(y~,data=x))
summarym2=lapply(modelsm2,summary)
coefficientsm2=lapply(modelsm2,coef)
coefsm2v2=ldply(modelsm2,coef) #to get the coefficients into an exportable df
I have several things I'd like to do and I would really appreciate your help!
A. Extract the r^2 for each model. I know that for one model, you can do summary(model)$r.squared to get it, but I have not had luck with my construct.
B. Apply the same methodology in a loop-type structure to get the models to run for all of my data frames (y2013 and backwards)
C. Get the summary into an easily exportable (to Excel) format --> the ldply function does not work for the summaries.
Thanks again.
A. You need to subset out the r.squared values from your summaries:
lapply(summarym2,"[[","r.squared")
B. Put all your data into a list, and put another lapply around it, eg:
lapply(list(y2014,y2013,y2012), function(dat)
by(dat,dat$Date, function(x) lm(y~.,data=x))
)
You will then have a list of lists so for instance to extract the summaries, you would use:
lapply(lmlist,lapply,summary)
C. summary returns a fairly complex data structure that cannot be coerced into a data.frame. The result you see is a consequence of the print method for it. You can use capture.output to get a charactor vector of each line of the output that you may use to write to a file.
I have a set of data that looks like this,
species<-"ABC"
ind<-rep(1:4,each=24)
hour<-rep(seq(0,23,by=1),4)
depth<-runif(length(ind),1,50)
df<-data.frame(cbind(species,ind,hour,depth))
df$depth<-as.numeric(df$depth)
In this example, the column "ind" has more levels and they don't have always the same length (here each individual has 4 levels, but in reality some individuals have thousands of rows of data, while other only a few lines).
What I would like to do is to have an outer loop or function that will select all the rows from each individual ("ind") and generate a boxplot using the depth/hour columns.
This is the idea that I have in mind,
for (i in 1:length(unique(df$ind))){
data<-df[df$ind==df$ind[i],]
individual[i]<-data
plot.boxplot<-function(data){
boxplot(depth~hour,dat=data,xlab="Hour of day",ylab="Depth (m)")
}
}
par(mfrow=c(2,2),mar=c(5,4,3,1))
plot.boxplot(individual)
I realized that this loop might be inappropriate, but I am still learning. I can do the boxplot for each individual at a time, but I would like a faster, more efficient way of selecting the data for each individual and creating or storing boxplot results. This will be very useful for when I have many more individuals (instead of doing one at a time...). Thanks a lot in advance.
What about something like this?
par(mfrow=c(2,2))
invisible(
by(df,df$ind,
function(x)
boxplot(depth~hour,data=x,xlab="Hour of day",ylab="Depth (m)")
)
)
To provide some explanation, this runs a boxplot for each group of cases in df defined by df$ind. The invisible wrapper just makes it so that the bunch of output used for the boxplot is not written to the console.