Calculating moving average with different codes and different sizes - r

I have a data frame that contains data for different observations, where the observations are grouped with a unique code. As a reproducible example, here is how a simulated data looks like:
v <- c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6)
mat1 <- matrix(runif(200),40)
mat1 <- cbind(v,mat1)
mat1 <-as.data.frame(mat1)
names(mat1) <- c('code','x1','x2','x3','x4','x5')
unq <- unique(mat1$code)
What I would like to do is to calculate an average for each observation, based on two previous and two future observations (you can think about this as a time series). So for example
mat1$X1[3] = mean(mean(mat1$x1[1:5])
mat1$X1[4] = mean(mean(mat1$x1[2:6])
and so on. I am able to do the calculation using a particular code (for example when mat1$code==1):
K <- data.frame(code=mat1$code,x1=rep(0,40),x2=rep(0,40),x3=rep(0,40),x4=rep(0,40),x5=rep(0,40))
for ( i in 3:(nrow(mat1)-2)){
if(mat1$code[i]==unq[1]){
K[i,2] <- mean(mat1[i-2:i+2,2])
}
}
, but there are two things that I couldn't figure out:
(1) Since the actual dataset is much larger than the simulated one, how can I dynamically go through all the unique codes and do the calculation, noting that the first and last two observations of each unique code should be zero (and I will eventually get rid of them).
(2) The number of observations for each unique code is different, and some of them are less than 4, where in this case there can't be any calculation done for that code!
Any help is highly appreciated.
Thank you

Related

R calculate averages by group with uneven categorical data

I want to calculate averages for categorical data. My data is in a long format, and I do not understand why I am not succeeding.
Here is an example (imagine it as individual participants, indicated by id, picking different options, in this example m_ex):
id <- (1,1,1,1,1,2,2,2,3,3,3)
m_ex <- ("a","b","c","b","a","b","c","b","a","a","c")
df <- data.frame(id , m_ex)
print (df)
I want to calculate averages for m_ex. That is, the average times specific m_ex are picked. I am trying to achieve this with dplyr. But I do not quite understand how to proceed with the id's having different lengths. What would I have to divide by then? And is it a problem that I do not have equal lengths of ids?
I really appreciate any help you can provide.
I have tried using dplyr and grouping by id and summarizing the results without much success. I would, in particular, like to understand what I do not understand right now.
I get something like this, but how do I get the averages?
[1]: https://i.stack.imgur.com/7nxze.jpg
[![example picture][1]][1]

Beginner: how can I repeat this function?

I need R studio for analysing some data, but haven't used it for 4 years now.
Now I've got a problem and don't know how to solve it. I want to calculate the variation of some columns together in every row. With some experimentation I've found this out:
var(as.numeric(data[1,8:33]))
and I get: 1.046154
As far as I know this should be right. It should at least give me the variation for the items 8 to 33 in the column for the first person. It also works for any other row:
var(as.numeric(data[5,8:33])) => 1.046154
Now I could of course use the same thing for every row individually, but I have 111 participants and several surveys. I tried to find a way to repeat the same command with every row but it didn't work.
How can I use the command from above and repeat it to all 111 participants?
Without the data it is difficult to help, but I created some dummy data using rnorm. You can use apply to obtain a vector containing the variance for each row. Since it appears that your data is in character format and not numeric, I created a simple function to automatically transform it and calculate the variance.
set.seed(20)
data <- matrix(as.character(rnorm(3663)),
ncol = 33,
nrow = 111)
##basic function
obtain_variance_from_character <- function(x){
return(var(as.numeric(x)))
}
##Calculate variances by row
variances <- apply(data_frame(data), MARGIN = 1, FUN = obtain_variance_from_character)

Doing operation on multiple numbered tables in R

I'm new to programming in R and I'm working with a huge dataset containing hundreds of variables and thousands of observations. Among these variables there is Age, which is my main concern. I want to get means for each other variables in function of Age. I can get smaller tables with this:
for(i in 18:84)
{
n<- sprintf("SortAgeM%d",i)
assign(x=n,subset(SortAgeM,subset=(SortAgeM$AGE>=i & SortAgeM$AGE<i+1)))
}
"SortAgeM85plus"<-subset(SortAgeM,subset=(SortAgeM$AGE>=85 & SortAgeM$AGE<100))
This gives me subdatasets for each age I'm concern with. I would then want to get the mean for each column. Each column is an observation of the volume of a specific brain region. I'm interested in knowing how is the volume decreasing with time and I would like to be able to know if individuals of a given age are close to the mean of their age or not.
Now, I would like to get one more row with the mean for each column. So I tried this:
for(i in 18:85) {
addmargins((SortAgeM%d,i), margin=1, FUN= "mean")
}
But it didn't work... I'm stuck and I'm not familiar enough with R function to find a solution on the net...
Thank you for your help.
Victor
Post answer edit: This is what I finally did:
for(i in 18:84)
{
n<- sprintf("SortAgeM%d",i)
assign(x=n,subset(SortAgeM,subset=(SortAgeM$AGE>=i & SortAgeM$AGE<i+1)))
Ajustment<-c(NA,NA,NA,NA,NA,NA,NA) #first variables aren't numeric
Line1<- colMeans(item[,8:217],na.rm=TRUE)
Line<-c(Ajustment,Ligne1)
assign(x=n, rbind(item,Ligne))
}
If you simply want an additional row with the means of each column, you can rbind the colMeans of your df like this
df_new <- rbind(df, colMeans(df))

Rank Abundance Distribution on Character Matrix (Or Vector) in R

I've got a 1000x1000 matrix consisting of a random distribution of the letters a - z, and I need to be able to plot the data in a rank abundance distribution plot; however I'm having a lot of trouble with it due to a) it all being in character format, b) it being as a matrix and not a vector (though I have changed it to a vector in one attempt to sort it), and c) I seem to have no idea how to summarise the data so that I get species abundance, let alone then be able to rank it.
My code for the matrix is:
##Create Species Vector
species.b<-letters[1:26]
#Matrix creation (Random)
neutral.matrix2<- matrix(sample(species.b,10000,replace=TRUE),
nrow=1000,
ncol=1000)
##Turn Matrix into Vector
neutral.b<-as.character(neutral.matrix2)
##Loop
lo.op <- 2
neutral.v3 <- neutral.matrix2
neutral.c<-as.character(neutral.v3)
repeat {
neutral.v3[sample(length(neutral.v3),1)]<-as.character(sample(neutral.c,1))
neutral.c<-as.character(neutral.v3)
lo.op <- lo.op+1
if(lo.op > 10000) {
break
}
}
Which creates a matrix, 1000x1000, then replaces 10,000 elements randomly (I think, I don't know how to check it until I can check the species abundances/rank distribution).
I've run it a couple of times to get neutral.v2, neutral.v3, and neutral.b, neutral.c, so I should theoretically have two matrices/vectors that I can plot and compare - I just have no idea how to do so on a purely character dataset.
I also created a matrix of the two vectors:
abundance.matrix<-matrix(c(neutral.vb,neutral.vc),
nrow=1000000,
ncol=2)
As a later requirement is for sites, and each repeat of my code (neutral.v2 to neutral.v11 eventually) could be considered a separate site for this; however this didn't change the fact that I have no idea how to treat the character data set in the first place.
I think I need to calculate the abundance of each species in the matrix/vectors, then run it through either radfit (vegan) or some form of the rankabundance/rankabun plot (biodiversityR). However the requirements for those functions:
rankabundance(x,y="",factor="",level,digits=1,t=qt(0.975,df=n-1))
x Community data frame with sites as rows, species as columns and species abundance
as cell values.
y Environmental data frame.
factor Variable of the environment
aren't available in the data I have, as for all intents and purposes I just have a "map" of 1,000,000 species locations, and no idea how to analyse it at all.
Any help would be appreciated: I don't feel like I've explained it very well though, so sorry about that!.
I'm not sure exactly what you want, but this will summarise the data and make it into a data.frame for rankabundance
counts <- as.data.frame(as.list(table(neutral.matrix2)))
BiodiversityR::rankabundance(counts)

Sample a subset of dataframe by group, with sample size equal to another subset of the dataframe

Here's my hypothetical data frame;
location<- as.factor(rep(c("town1","town2","town3","town4","town5"),100))
visited<- as.factor(rbinom(500,1,.4)) #'Yes or No' variable
variable<- rnorm(500,10,2)
id<- 1:500
DF<- data.frame(id,location,visited,variable)
I want to create a new data frame where the number of 0's and 1's are equal for each location. I want to accomplish this by taking a random sample of the 0's for each location (since there are more 0's than 1's).
I found this solution to sample by group;
library(plyr)
ddply(DF[DF$visited=="0",],.(location),function(x) x[sample(nrow(x),size=5),])
I entered '5' for the size argument so the code would run, But I can't figure out how to set the 'size' argument equal to the number of observations where DF$visited==1.
I suspect the answer could be in other questions I've reviewed, but they've been a bit too advanced for me to implement.
Thanks for any help.
The key to using ddply well is to understand that it will:
break the original data frame down by groups into smaller data frames,
then, for each group, it will call the function you give it, whose job it is to transform that data frame into a new data frame*
and finally, it will stitch all the little transformed data frames back together.
With that in mind, here's an approach that (I think) solves your problem.
sampleFunction <- function(df) {
# Determine whether visited==1 or visited==0 is less common for this location,
# and use that count as our sample size.
n <- min(nrow(df[df$visited=="1",]), nrow(df[df$visited=="0",]))
# Sample n from the two groups (visited==0 and visited==1).
ddply(df, .(visited), function(x) x[sample(nrow(x), size=n),])
}
newDF <- ddply(DF,.(location),sampleFunction)
# Just a quick check to make sure we have the equal counts we were looking for.
ddply(newDF, .(location, visited), summarise, N=length(variable))
How it works
The main ddply simply breaks DF down by location and applies sampleFunction, which does the heavy lifting.
sampleFunction takes one of the smaller data frames (in your case, one for each location), and samples from it an equal number of visited==1 and visited==0. How does it do this? With a second call to ddply: this time, using location to break it down, so we can sample from both the 1's and the 0's.
Notice, too, that we're calculating the sample size for each location based on whichever sub-group (0 or 1) has fewer occurrences, so this solution will work even if there aren't always more 0's than 1's.

Resources