I've got a data set in R of a variable, repeated 10,000 times and sampled 200 times on each repeat so a 10,000 by 200 matrix, I would like to calculate statistical moments for the variable up to an arbitrary number. So in the end I would like a numeric vector holding the value of moments.
I can get the variance and the mean for the data set using colMean and colVar, but they only go so far.
I am also aware of the moments package in R, however using the all.moments command is returning me moments for each time course, or treating each column or row as an individual variable, not what I want.
Does anyone know an equivalent to colMean and colVar for higher order moments? And if possible also for cross moments?
Many thanks!
I stole this code from an obscure R package e1071:
theskew<- function (x) {
x<-as.vector(x)
sum((x-mean(x))^3)/(length(x)*sd(x)^3)
}
thekurt <- function (x) {
x<-as.vector(x)
sum((x-mean(x))^4)/(length(x)*var(x)^2) - 3
}
You can fold that into your code by feeding them one column at a time
Okay did this yesterday for posterity here is a loop that will do what I asked.
Provided your data is a time course of a variable you are measuring, and you want the moments of that variable:
rm(list=ls())
yourdata<-read.table("whereveryourdatais/and/variableyouwant")
yourdata<-t(yourdata) #only do this at your own discretion
mu<-colMeans(yourdata,1:ncol(yourdata))
NumMoments <- 5
rawmoments <- matrix(NA, nrow=NumMoments, ncol=ncol(yourdata))
for(i in 1:NumMoments) {
rawmoments[i, ] <- colMeans(yourdata^i)
}
plot(rawmoments[1,])
holder<-matrix(NA,nrow=nrow(yourdata),ncol=ncol(yourdata))
middles<-matrix(NA,nrow=1,ncol=ncol(yourdata))
for(j in 1:nrow(yourdata)){
for(o in 1:ncol(rawmoments)){
middles[o]<-yourdata[j,o]-rawmoments[1,o]
}
holder[j,] <- middles
}
centmoments<-matrix(NA,nrow=NumMoments,ncol=ncol(yourdata))
for(i in 1:NumMoments){
centmoments[i,]<-colMeans(holder^i)
}
Then centmoments has the centralmoments and rawmoments has the raw moments, you can specify how many moments to take by changing the value of NumMoments.
Note that the first row in "centmoments" will be approximately 0.
Is this what you're looking for?
X <- matrix(1:12, 3, 4) # your data
NumMoments <- 5
moments <- matrix(NA, nrow=NumMoments, ncol=ncol(X))
for(i in 1:NumMoments) {
moments[i, ] <- colMeans(X^i)
}
EDIT:
okay, apparently you want "central moments"
X <- matrix(1:12, 3, 4)
NumMoments <- 5
moments <- matrix(NA, nrow=NumMoments, ncol=ncol(X))
Y <- X
for(i in 1:ncol(X)) {
Y[, i] <- Y[, i] - moments[1, i]
}
for(i in 2:NumMoments) {
moments[i, ] <- colMeans(Y^i)
}
Related
I have two datasets with abundance data from groups of different species. Columns are species and rows are sites. The sites (rows) are identical between the two datasets and what i am trying to do is to correlate the columns of the first dataset to the columns of the second dataset in order to see if there is a positive or a negative correlation.
library(Hmisc)
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$P
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$r
the first will give me the p value of the relation between sp1 and spA and the second the r value
I initially created a loop that allowed me to check all species of the first dataframe with a single column of the second dataframe. Needless to say if I was to make this work I would have to repeat the process a few hundred times.
My simple loop for one column of df1(new6) against all columns of df2(otu.table.filter)
pvalues = list()
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$P
}
rvalues = list()
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$r
}
p<-NULL
for(i in 1:length(pvalues)){
tmp <-print(pvalues[[i]][2])
p <- rbind(p, tmp)
}
r<-NULL
for(i in 1:length(rvalues)){
tmp <-print(rvalues[[i]][2])
r <- rbind(r, tmp)
}
fdr<-as.matrix(p.adjust(p, method = "fdr", n = length(p)))
sprman<-cbind(r,p,fdr)
and using the above as a starting point I tried to create a nested loop that each time would examine a column of df1 vs all columns of df2 and then it would proceed to the second column of df1 against all columns of df2 etc etc
but here i am a bit lost and i could not find an answer for a solution in r
I would assume that the pvalues output should be a list of
pvalues[[i]][[j]]
and similarly the rvalues output
rvalues[[i]][[j]]
but I am a bit lost and I dont know how to do that as I tried
pvalues = list()
rvalues = list()
for (j in 1:7){
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$P
}
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$r
}
}
but I cannot make it work cause I am not sure how to direct the output in the lists and then i would also appreciate if someone could help me with the next part which would be to extract for each comparison the p and r value and apply the fdr function (similar to what i did with my simple loop)
here is a subset of my two dataframes
Here a small demo. Let's assume two matrices x and y with a sample size n. Then correlation and approximate p-values can be estimated as:
n <- 100
x <- matrix(rnorm(10 * n), nrow = n)
y <- matrix(rnorm(5 * n), nrow = n)
## correlation matrix
r <- cor(x, y, method = "spearman")
## p-values
pval <- function(r, n) 2 * (1 - pt(abs(r)/sqrt((1 - r^2)/(n - 2)), n - 2))
pval(r, n)
## for comparison
cor.test(x[,1], y[,1], method = "spearman", exact = FALSE)
More details can be found here: https://stats.stackexchange.com/questions/312216/spearman-correlation-significancy-test
Edit
And finally a loop with cor.test:
## for comparison
p <- matrix(NA, nrow = ncol(x), ncol=ncol(y))
for (i in 1:ncol(x)) {
for (j in 1:ncol(y)) {
p[i, j] <- cor.test(x[,i], y[,j], method = "spearman")$p.value
}
}
p
The values differ a somewhat, because the first uses the t-approximation then the second the "exact AS 89 algorithm" of cor.test.
I have a data set with election poll data with numeric values 0-10.
I have 30 columns with these values and I want to compare every column with
all the other columns in order to create a correlation matrix.
]
But I keep getting the following error code:
Error in columnlist[i, j] <- cor(feeling_therm[, i], feeling_therm[, j], :
incorrect number of subscripts on matrix
Any suggestion on how to get this right? I'm still getting used to the syntax of R.
Just use cor(election). It should create the correlation matrix.
As Nathan Werth remarked, cor(election) works just fine. However, if you insist on using a for loop you should initialize your matrix as a matrix (including correct dimensions), not as a list:
election <- replicate(5, rnorm(n = 100))
election <- as.data.frame(election)
cor_matrix <- matrix(nrow = ncol(election), ncol = ncol(election))
for (i in 1:ncol(election)) {
for (j in 1:ncol(election)) {
cor_matrix[i,j] <- cor(election[,i], election[,j], use = "complete")
}
}
If your aim is to store values in a list then following is an option.
n_col <- ncol(election)
electionlist <- as.list(data.frame( matrix(NA, n_col, n_col)))
for (i in 1 : n_col) {
for (j in 1 : n_col) {
electionlist[[c(i, j)]] <- cor(election[,i],
election[,j], use = "complete")
}
}
I am new to R and have written a function that needs to be run multiple times to generate the final dataset.
So the multiple times is determined by the vector of unique years and again based on these years every single time the function gives an output.
Still I am not getting the right output.
Desired output: for eg it takes 10 samples from each year, after 10th run I should have 100 rows of correct output.
create_strsample <- function(n1,n2){
yr <- c(2010,2011,2012,2013)
for(i in 1:length(yr)){
k1<-subset(data,format(as.Date(data$account_opening_date),"%Y")==yr[i])
r1 <-sample(which(!is.na(k1$account_closing_date)),n1,replace=FALSE)
r2<-sample(which(is.na(k1$account_closing_date)),n2,replace=FALSE)
#final.data <-k1[c(r1,r2),]
sample.data <- lapply(yr, function(x) {f.data<-create_strsample(200,800)})
k1 <- do.call(rbind,k1)
return(k1)
}
final <- do.call(rbind,sample.data)
return(final)
}
stratified.sample.data <- create_strsample(200,800)
A MWE would have been nice, but I'll give you a template for these kind of questions. Note, that this is not optimized for speed (or anything else), but only for the ease of understanding.
As noted in the comments, that call to create_strsample in the loop looks weird and probably isn't what you really want.
data <- data.frame() # we need an empty, but existing variable for the first loop iteration
for (i in 1:10) {
temp <- runif(1,max=i) # do something...
data <- rbind(data,temp) # ... and add this to 'data'
} # repeat 10 times
rm(temp) # don't need this anymore
That return(k1) in the loop also looks wrong.
I tried this later after your suggestion #herbaman for the desired output minus the lapply.
create_strsample <- function(n1,n2){
final.data <- NULL
yr <- c(2010,2011,2012,2013)
for(i in 1:length(yr)){
k1<-subset(data,format(as.Date(data$account_opening_date),"%Y")==yr[i])
r1 <- k1[sample(which(!is.na(k1$account_closing_date)),n1,replace=FALSE), ]
r2 <- k1[sample(which(is.na(k1$account_closing_date)),n2,replace=FALSE), ]
sample.data <- rbind(r1,r2)
final.data <- rbind(final.data, sample.data)
}
return(final.data)
}
stratified.sample.data <- create_strsample(200,800)
I made a matrix based population model, however, I would like to run more than one simultaneously in order to represent different groups of animals, in order that dispersing individuals can move between matrices. I originally just repeated everything to get a second matrix but then I realised that because I run the model using a for loop and break() under certain conditions (when that specific matrix should stop running, ie that group has died out) it is, understandably, stopping the whole model rather than just that singular matrix.
I was wondering if anyone had any suggestions on the best ways to code the model so that instead of breaking, and stopping the whole for loop, it just stops running across that specific matrix. I'm a little stumped. I have include a single run of one matrix below.
Also if anyone has a more efficient way of creating and running 9 matrices than writing everything out 9 times advice much appreciated.
n.steps <- 100
mats <- array(0,c(85,85,n.steps))
ns <- array(0,c(85,n.steps))
ns[1,1]<-0
ns[12,1]<-rpois(1,3)
ns[24,1]<-rpois(1,3)
ns[85,1] <- 1
birth<-4
nextbreed<-12
for (i in 2:n.steps){
# set up an empty matrix;
mat <- matrix(0,nrow=85,ncol=85)
surv.age.1 <- 0.95
x <- 2:10
diag(mat[x,(x-1)]) <- surv.age.1
surv.age.a <- 0.97
disp <- 1:74
disp <- disp*-0.001
disp1<-0.13
disp<-1-(disp+disp1)
survdisp<-surv.age.a*disp
x <- 11:84
diag(mat[x,(x-1)])<-survdisp
if (i == nextbreed) {
pb <- 1
} else {
pb <- 0
}
if (pb == 1) {
(nextbreed <- nextbreed+12)
}
mat[1,85] <- pb*birth
mat[85,85]<-1
death<-sample(c(replicate(1000,
sample(c(1,0), prob=c(0.985, 1-0.985), size = 1))),1)
if (death == 0) {
break()}
mats[,,i]<- mat
ns[,i] <- mat%*%ns[,i-1]
}
group.size <- apply(ns[1:85,],2,sum)
plot(group.size)
View(mat)
View(ns)
As somebody else suggested on Twitter, one solution might be to simply turn the matrix into all 0s whenever death happens. It looks to me like death is the probability that a local population disappears? It which case it seems to make good biological sense to just turn the entire population matrix into 0s.
A few other small changes: I made a list of replicate simulations so I could summarize them easily.
If I understand correctly,
death<-sample(c(replicate(1000,sample(c(1,0), prob=c(0.985, 1-0.985), size =1))),1)
says " a local population dies completely with probability 1.5% ". In which case, I think you could replace it with rbinom(). I did that below and my plots look similar to those I made with your code.
Hope that helps!
lots <- replicate(100, simplify = FALSE, expr = {
for (i in 2:n.steps){
# set up an empty matrix;
mat <- matrix(0,nrow=85,ncol=85)
surv.age.1 <- 0.95
x <- 2:10
diag(mat[x,(x-1)]) <- surv.age.1
surv.age.a <- 0.97
disp <- 1:74
disp <- disp*-0.001
disp1<-0.13
disp<-1-(disp+disp1)
survdisp<-surv.age.a*disp
x <- 11:84
diag(mat[x,(x-1)])<-survdisp
if (i == nextbreed) {
pb <- 1
} else {
pb <- 0
}
if (pb == 1) {
(nextbreed <- nextbreed+12)
}
mat[1,85] <- pb*birth
mat[85,85]<-1
death<-rbinom(1, size = 1, prob = 0.6)
if (death == 0) {
mat <- 0
}
mats[,,i]<- mat
ns[,i] <- mat%*%ns[,i-1]
}
ns
})
lapply(lots, FUN = function(x) apply(x[1:85,],2,sum))
I am running a simulation trying to find the probability of something taking place in a number of binomial trials. I start with specifying the data
iter=5000
data=data.frame(prob=runif(300), value=runif(300))
data<-data[sample(nrow(data), iter, replace=T),]
then I add the trials
cols <- c("one","two","three","four","five","six",
"seven","eight","nine","ten","eleven","twelve")
data[,cols] <- NA
one contains the results of only one binomial trials, two contains the results of two binomial trials and so on. If a binomial event takes place in any of the one, two, three, ..., twelve, the cell is marked 1 else 0.
Then I run the trials for iter=5000 simulations
for (col in 3:14) {
for (i in 1:iter) if (sum(rbinom((col-2),1,data[i,1]))>0) data[i,col]<-1 else data[i,col]<-0
}
Then I evaluate the mean(data$value[data$one==0] till ... mean(data$value[data$twelve==0]
My problem is that the simulation code takes forever for iter>15000.
for (col in 3:14) {
for (i in 1:iter)
data[i,col] <- if (sum(rbinom((col-2),1,data[i,1]))>0) 1 else 0
}
Any ideas?
sim2 <- function(iter) {
dat <- data.frame(prob=runif(300), value=runif(300))
dat <- dat[sample(nrow(dat), iter, replace=TRUE),]
cols <- c("one","two","three","four","five","six",
"seven","eight","nine","ten","eleven","twelve")
dat[,cols] <- 0
for (col in 3:14) {
dat[,col] <- as.numeric(vapply(dat[,1],
function(p) {sum(rbinom((col-2), 1, p))>0},
FUN.VALUE = TRUE))
}
vapply(3:14, function(col) {mean(dat$value[dat[,col]==0])}, FUN.VALUE=1)
}
For iter of 16000, this runs in 2.29s on my machine, compared to an (estimated) 1781s for the ordering in your original algorithm. In general, don't assign individual elements in the data frame when you can assign the whole column at once. There may be more improvements possible, but I'll stop at >750x speedup (and changing the algorithm from running time of O(n^2) to O(n)).