R calculate averages by group with uneven categorical data - r

I want to calculate averages for categorical data. My data is in a long format, and I do not understand why I am not succeeding.
Here is an example (imagine it as individual participants, indicated by id, picking different options, in this example m_ex):
id <- (1,1,1,1,1,2,2,2,3,3,3)
m_ex <- ("a","b","c","b","a","b","c","b","a","a","c")
df <- data.frame(id , m_ex)
print (df)
I want to calculate averages for m_ex. That is, the average times specific m_ex are picked. I am trying to achieve this with dplyr. But I do not quite understand how to proceed with the id's having different lengths. What would I have to divide by then? And is it a problem that I do not have equal lengths of ids?
I really appreciate any help you can provide.
I have tried using dplyr and grouping by id and summarizing the results without much success. I would, in particular, like to understand what I do not understand right now.
I get something like this, but how do I get the averages?
[1]: https://i.stack.imgur.com/7nxze.jpg
[![example picture][1]][1]

Related

Redefining Dataframe for Regression-Analysis in R

i have a dataframe with timestamps of several transportations from a to b plus information about the material (volume, weight etc.).
I recreated the important parts of the raw excel sheet I use.
My first step is to calculate the time it needed by simply substracting the dates as i only need a daily precision. I put all the times in a numerical vector to have it easy for further calculations and plots.
BUT:
I'd like to perform a regression analysis on it. I know how to create an lm.
My problem is, Due to several NA's my numerical vector of "transport days" is shorter than my cols in the df.
How can I merge the cols from the df with my numerical vector so that the transport times match the several materials again?
Do you look for something like
library(dplyr)
df %>%
mutate(diff = as.numeric(t4-t1))
You then have a time difference column while the colume column is still in the df. You can tell lm() how to deal with NAs anyway, so you don't need to drop them (I also don't think that you were doing it anyway).

Mean/standard deviation plot of survey items with missing data

I'm an R beginner attempting to do what I figured (erroneously) would be a beginner-type task: produce a simple plot of means/standard deviations for multiple survey questions (vectors), grouped by a second variable (say, group).
So I am reading variables (say, q1-q10) into R from Stata and have even managed to melt the data following this suggestion.
What I would like is essentially the graph presented in the solution:
However, my data contain missing values (NA), and the NUMBER of missing values varies by question. So when I try to use ggplot to plot the 'melted' data, I get an error saying the vector lengths do not match.
Well, suppose that your variables q1-q10 are separated, then you should merge them into a data frame df:
df <- data.frame(q1, q2, ...,q10)
And then you can clean it such that you only have complete cases, i.e. only observations without NA:
df <- df[complete.cases(df),]
Afterwards, you should not have problems with ggplot.

Calculating moving average with different codes and different sizes

I have a data frame that contains data for different observations, where the observations are grouped with a unique code. As a reproducible example, here is how a simulated data looks like:
v <- c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6)
mat1 <- matrix(runif(200),40)
mat1 <- cbind(v,mat1)
mat1 <-as.data.frame(mat1)
names(mat1) <- c('code','x1','x2','x3','x4','x5')
unq <- unique(mat1$code)
What I would like to do is to calculate an average for each observation, based on two previous and two future observations (you can think about this as a time series). So for example
mat1$X1[3] = mean(mean(mat1$x1[1:5])
mat1$X1[4] = mean(mean(mat1$x1[2:6])
and so on. I am able to do the calculation using a particular code (for example when mat1$code==1):
K <- data.frame(code=mat1$code,x1=rep(0,40),x2=rep(0,40),x3=rep(0,40),x4=rep(0,40),x5=rep(0,40))
for ( i in 3:(nrow(mat1)-2)){
if(mat1$code[i]==unq[1]){
K[i,2] <- mean(mat1[i-2:i+2,2])
}
}
, but there are two things that I couldn't figure out:
(1) Since the actual dataset is much larger than the simulated one, how can I dynamically go through all the unique codes and do the calculation, noting that the first and last two observations of each unique code should be zero (and I will eventually get rid of them).
(2) The number of observations for each unique code is different, and some of them are less than 4, where in this case there can't be any calculation done for that code!
Any help is highly appreciated.
Thank you

Sort data set by biggest range of entries

I am dabbling my feet in R a bit and I am now able to sort columns by their means but I would now like to sort the columns by the biggest range of the data points in each column.
Say I have a table with a point rating for movies. How can I get the Top 10 movies where the opinions are most different. Is there a function that can measure this? One idea of mine was to use perhaps the minValue ans maxValue but then just one outlier can mess it all up.
I guess the size of the box of a boxplot could be a pretty good measure.
Any ideas?
Update:
So I tried sorting my table by their respective sd() but I cant quite get that to work. What I was trying is this. The table has headings btw. and is named newdata here.
> newdata.sd <- sapply(1:107, function(j) sd(newdata[,j], na.rm=TRUE))
> newdata.sorted.sd <- newdata[,names(sort(newdata.sd, decreasing=TRUE))]
The second line throws an error because the first doesn't produce a Named num vector.
When I did the same thing with sorting by the columns means it worked. That I did with the following two lines.
> newdata.mean <- colMeans(newdata, na.rm=TRUE)
> newdata.sorted <- newdata[,names(sort(newdata.mean, decreasing=TRUE))]
How can I produce a named vector of sd()s like in the second example?
A different way to sort by sd() would be also appreciated.

Summary stats a variable for each unique variable within a condition

I have a longitudinal spreadsheet that contains different growth variables for many individuals. At the moment my R code looks like this:
D5<-ifelse(growth$agyr == 5, growth$R.2ND.DIG.AVERAGE,NA)
Since it is longitudinal, I have the same measurement for each individual at multiples ages, thus the variable agyr. In this example it is taking all kids who have a finger measurement at age 5.
What I would like to do is do that for all ages so that I don't have to define an object every time, so I can essentially run some summary stats on finger length for any given agyr. Surely this is possible, but I am still a beginner at R.
tapply() is your friend here. For the mean for example:
with(growth,
tapply(R.2ND.DIG.AVERAGE,agyr,mean)
)
See also ?tapply and some good introduction book on R. And also ?with, a function that can really make your code a lot more readible.
If you have multiple levels you want to average over, you can give tapply() a list of factors. Say gender is a variable as well (a factor!), you can do eg:
with(growth,
tapply(R.2ND.DIG.AVERAGE,list(agyr,gender),mean)
)
tapply() returns an array-like structure (a vector, matrix or multidimensional array, depending on the number of categorizing factors). If you want your results in a data frame and/or summarize multiple variables at once, look at ?aggregate, eg:
thevars <- c("R.2ND.DIG.AVERAGE","VAR2","MOREVAR")
aggregate(growth[thevars],by=list(agyr,gender), FUN="mean")
or using the formula interface:
aggregate(cbind(R.2ND.DIG.AVERAGE,VAR2,MOREVAR) ~ agyr + gender,
data=growth, FUN = "mean")
Make sure you check the help files as well. Both tapply() and aggregate() are quite powerful and have plenty other possibilities.

Resources