Related
I am investigating the presence of convergent evolution in a dataset containing 20 trait variables. I need to test for convergence on each trait individually, I am trying to set up a loop to do this. However, I am getting errors on my loops and I am not sure why. I can run each variable independently in the function perfectly.
I am using the package windex and am using the function test.windex. This function uses a data.frame containing information about which species are the focal ones and the traits that one wishes to test convergence on. The function also requires a phylogenetic tree for the species included in the data.frame.
To create a reproducible example I am using the sample data supplied with the package windex as my data is confidential (I have run the following and get the same errors as with my own data).
data(sample.data) #data.frame containing the focal species and variables
data(sample.tree) #phylogentic tree
head(sample.data)
species focals bm1 bm2 bm3
1 s4 0 13.03895 17.2201554 11.43644
2 s7 0 18.22705 15.2427947 22.75245
3 s8 0 12.38588 10.5858736 13.80216
4 s9 0 24.79114 8.1148456 23.38717
5 s11 1 28.20126 -2.9911114 15.63215
6 s12 1 29.45775 0.9225023 12.09184
ou1 ou2 ou3 bin
1 13.03895 17.220155 11.43644 1
2 18.22705 15.242795 22.75245 2
3 12.38588 10.585874 13.80216 3
4 24.79114 8.114846 23.38717 4
5 21.22215 21.936316 20.30037 5
6 21.24501 20.952824 20.88650 6
#My loop, it is the traits that I need to loop for each trait variable in the sample.data
sapply(sample.data[3:8], function(x) test.windex(sample.data,sample.tree,traits= x,
focal=sample.data[,2], reps=1000,plot=TRUE,col="light grey"))
This code generates the error: Error in .subset(x, j) : only 0's may be mixed with negative subscripts
Called from: [.data.frame(dat, , traits)
I am assuming that it is some sort of an indexing problem, but I just can't seem to find a way around it. I tried this instead, but also go the same error:
traitsonly<-sample.data[3:8]
sapply(traitsonly, function(x) test.windex(sample.data,sample.tree,traits= x,
focal=sample.data[,2], reps=1000,plot=TRUE,col="light grey"))
Any help would be greatly appreciated.
From the help page of test.windex it accepts column number or column name.
traits
Column numbers (or names) of the traits for which you want to calculate a Wheatsheaf index.
So either of these should work.
library(windex)
sapply(3:8, function(x)
test.windex(sample.data,sample.tree,traits= x,
focal=sample.data[,2], reps=1000,plot=TRUE,col="light grey")
) -> plot
Or
sapply(names(sample.data)[3:8], function(x)
test.windex(sample.data,sample.tree,traits= x,
focal=sample.data[,2], reps=1000,plot=TRUE,col="light grey")
) -> plot
I'm using the first principal component from a PCA analysis as an explanatory variable in a forecasting model that forecasts recursively using Kalman filtering. In other words, at each point in time, the model updates and produces a new forecast based on the new observation included into the model. Since PCA uses data from all observations included in the model for its calculations, I need to run also the PCAs recursively, using only the observations prior to the point in time that I am forecasting (otherwise, the PCA-result could reveal information about the future, and help the model produce a more accurate answer than it would have otherwise). I think a loop might be the solution, but I am struggling with how to formulate the code.
As a more specific example, consider if I have the following data.frame
data <- as.data.frame(rbind(c(6,15,23),c(9,11,22), c(7,13,23), c(6,12,25),c(7,13,23)))
names(data) <- c("V1","V2","V3")
> data
V1 V2 V3
1 6 15 23
2 9 11 22
3 7 13 23
4 6 12 25
5 7 13 23
At each observation date, I wish to run a PCA (function prcomp() from the stats-package) for all observations up to, and including, that observation. So I want to first run PCA for the two first observation
pca2 <- prcomp(data[1:2,], scale = TRUE)
next I want to run PCA with the first, second and third observation as input
pca3 <- prcomp(data[1:3,], scale = TRUE)
next I want to run PCA with the first, second, third and fourth observation as input
pca4 <- prcomp(data[1:4,], scale = TRUE)
and so on, until the last run of the PCA, which includes all observations in the dataframe. For each of these "runs" of the PCA, I wish to extract the last value (though for pca2, I use both the first and second value) of the first principal component (PC1), and merge these into a final dataframe, where each monthly observation is the last value of the first principal component of PCA results for each of the runs.
The principal component outputs are:
> my_pca2 <- as.data.frame(pca2$x)
> my_pca2
PC1 PC2
1 -1.224745 -5.551115e-17
2 1.224745 5.551115e-17
> my_pca3 <- as.data.frame(pca3$x)
> my_pca3
PC1 PC2 PC3
1 -1.4172321 -0.2944338 6.106227e-16
2 1.8732448 -0.1215046 3.330669e-16
3 -0.4560127 0.4159384 4.163336e-16
> my_pca4 <- as.data.frame(pca4$x)
> my_pca4
PC1 PC2 PC3
1 -1.03030993 -1.10154914 0.015457199
2 2.00769890 0.07649216 0.011670433
3 0.03301806 -0.24226508 -0.033461874
4 -1.01040702 1.26732205 0.006334242
So I want my final output to be a dataframe to look like
>final.output
PC1
1 -1.224745
2 1.224745
3 -0.4560127
4 -1.01040702
Comment: yes, it looks a bit weird with the two first values, but please don't pay too much attention to that. My point is that I wish to build a dataframe that consists of the last calculated value for the first principal component for each of the PCA runs.
I am thinking that a for.loop might be the best solution here, but I have not been successful in finding any threads that might guide me closer to a coding solution. How can I make the loop use an increasing amount of the dataframe in the calculations? Does anyone have any suggestions/tips/links? Any help on this is much appreciated!
I had a very similar approach.
PCA <- vector("list", length=nrow(data)-1)
for(i in 1:(nrow(data)-1)) {
if(i==1) j <- 1:2 else j<-i+1
PCA[[i]] <- as.data.frame(prcomp(data[1:(1+i),], scale = TRUE)$x)[j, 1]
}
unlist(PCA)
You can use a for loop. It's maybe not the most efficient solution, but it will work.
First, you create an empty list to store your results:
all_results <- list()
Next, you iterate from 2 to the number of rows of data with a loop. For each iteration of the loop, run prcomp on data[1:i,]. You can directly create your pca data frame and extract PC1from it as a vector. Now you store it in the list at index i - 1
for(i in 2:nrow(data))
{
all_results[[i - 1]] <- as.data.frame(prcomp(data[1:i,], scale = TRUE)$x)$PC1
}
Now to extract all the results, you use lapply (list apply) to extract only the last element from each PC1 vector:
PC1 <- lapply(all_results, function(pca) pca[length(pca)] )
Now you convert these from a list of single elements to a vector:
PC1 <- do.call("c", PC1)
Finally, you want to stick the first value of the first analysis back on to the front of this vector:
PC1 <- c(all_results[[1]][1], PC1)
this is my first time posting here and I hope this is all in the right place. I have been using R for basic statistical analysis for some time, but haven't really used it for anything computationally challenging and I'm very much a beginner in the programming/ data manipulation side of R.
I have presence/absence (binary) data on 72 plant species in 323 plots in a single catchment. The dataframe is 323 rows, each representing a plot, with 72 columns, each representing a species. This is a sample of the first 4 columns (some row numbers are missing because the 323 plots are a subset of a larger number of preassigned plots, not all of which were surveyed):
> head(plots[,1:4])
Agrostis.canina Agrostis.capillaris Alchemilla.alpina Anthoxanthum.odoratum
1 1 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
8 0 0 0 0
I want to to determine whether any of the plant species in this catchment are associated with any others, and if so, whether that is a positive or negative association. To do this I want to perform a chi-squared test of independence on each combination of species. I need to create a 2x2 contingency table for each speciesxspecies comparison, run a chi-squared test on each of those contingency tables, and save the output. Ultimately I would like to end up with a list or matrix of all species by species tests that shows whether that combination of species has a positive, negative, or no significant association. I'd also like to incorporate some code that only shows an association as positive if all expected values were greater than 5.
I have made a start by writing the following function:
CHI <- function(sppx, sppy)
{test <- chisq.test(table(sppx, sppy))
result <- c(test$statistic, test$p.value,
sign((table(sppx, sppy) - test$expected)[2,2]))
return(result)
}
This returns the following:
> CHI(plots$Agrostis.canina, plots$Agrostis.capillaris)
X-squared
1.095869e-27 1.000000e+00 -1.000000e+00
Warning message:
In chisq.test(chitbl) : Chi-squared approximation may be incorrect
Now I'm trying to figure out a way to apply this function to each speciesxspecies combination in the data frame. I essentially want R to take each column, apply the CHI function to that column and each other column in sequence, and so on through all the columns, subtracting each column from the dataframe as it is done so the same species pair is not tested twice. I have tried various methods trying to use "for" loops or "apply" functions, but have not been able to figure this out.
I hope that is clear enough. Any help here would be much appreciated. I have tried looking for existing solutions to this specific problem online, but haven't been able to find any that really helped. If anyone could link me to an existing answer to this that would also be great.
You need the combn function to find all the combinations of the columns and then apply them to your function, something like this:
apply(combn(1:ncol(plots), 2), 2, function(ind) CHI(plots[, ind[1]], plots[, ind[2]]))
I think you are looking for something like this. I used the iris dataset.
require(datasets)
ind<-combn(NCOL(iris),2)
lapply(1:NCOL(ind), function (i) CHI(iris[,ind[1,i]],iris[,ind[2,i]]))
The below R code run chisquare test for every categorical variable / every factor of a r dataframe, against a variable given (x or y chisquare parameter is kept stable, is explicitly defined):
Define your variable
Please - change df$variable1 to your desired factor variable and df to your desirable dataframe that contain all the factor variables tested against the given df$variable1
Define your Dataframe
A new dataframe is created (df2) that will contain all the chi square values / dfs, p value of the given variable vs dataframe comparisons
Code created / completed/ altered from similar posts in stackoverflow, neither that produced my desired outcome.
Chi-Square Tables statistic / df / p value for variable vs dataframe
"2" parameter define column wide comparisons - check apply (MARGIN) option.
df2 <- t(round(cbind(apply(df, 2, function(x) {
ch <- chisq.test(df$variable1, x)
c(unname(ch$statistic), ch$parameter, ch$p.value )})), 3))
I am still learning to use data.table (from the data.table package) and even after looking for help on the web and the help files, I am still struggling to do what I want.
I have a large data table with over 60 columns (the first three corresponding to factors and the remaining to response variables, in this case different species) and several rows corresponding to the different levels of the treatments and the species abundances. A very small version looks like this:
> TEST<-data.table(Time=c("0","0","0","7","7","7","12"),
Zone=c("1","1","0","1","0","0","1"),
quadrat=c(1,2,3,1,2,3,1),
Sp1=c(0,4,29,9,1,2,10),
Sp2=c(20,17,11,15,32,15,10),
Sp3=c(1,0,1,1,1,1,0))
>setkey(TEST,Time)
> TEST
Time Zone quadrat Sp1 Sp2 Sp3
1: 0 1 1 0 20 1
2: 0 1 2 4 17 0
3: 0 0 3 29 11 1
4: 12 1 1 10 10 0
5: 7 1 1 9 15 1
6: 7 0 2 1 32 1
7: 7 0 3 2 15 1
I need to calculate the sum of the covariances for each Zone x quadrat group. If I only had the species list for a given Zone x quadrat combination, then I could use the cov() function but using cov() in the same way that I would use mean() or sum() in
Abundance = TEST[,lapply(.SD,mean),by="Zone,quadrat"]
does not work as I get the following error message:
Error in cov(value) : supply both 'x' and 'y' or a matrix-like 'x'
I understand why but I cannot figure out how to solve this.
What I exactly want is to be able to get, for each Zone x quadrat combination, the covariance matrix of all the species across all the sampling Time points. From each matrix, I then need to calculate the sum of the covariances of all pairs of species, so that then I can have a sum of covariance for each Zone x quadrat combination.
Any help would be greatly appreciated, Thanks.
From the help provided above by #Frank and some additional searching that I did around the use of the upper.tri function, the following code works:
Cov= TEST[,sum(cov(.SD)[upper.tri(cov(.SD), diag = FALSE)]), by='Zone,quadrat', .SDcols=paste('Sp',1:3,sep='')]
The initial version proposed, where upper.tri() did not appear in [ ] only extracted logical values from the covariance matrix and having diag = FALSE allowed to exclude the diagonal values before summing the upper triangle of the matrix. In my case, I didn't care whether it was the upper or lower triangle but I'm sure that using lower.tri() would work equally well.
I hope this helps other users who might encounter a similar issue.
This is admittedly a very simple question that I just can't find an answer to.
In R, I have a file that has 2 columns: 1 of categorical data names, and the second a count column (count for each of the categories). With a small dataset, I would use 'reshape' and the function 'untable' to make 1 column and do analysis that way. The question is, how to handle this with a large data set?
In this case, my data is humungous and that just isn't going to work.
My question is, how do I tell R to use something like the following as distribution data:
Cat Count
A 5
B 7
C 1
That is, I give it a histogram as an input and have R figure out that it means there are 5 of A, 7 of B and 1 of C when calculating other information about the data.
The desired input rather than output would be for R to understand that the data would be the same as follows,
A
A
A
A
A
B
B
B
B
B
B
B
C
In reasonable size data, I can do this on my own, but what do you do when the data is very large?
Edit
The total sum of all the counts is 262,916,849.
In terms of what it would be used for:
This is new data, trying to understand the correlation between this new data and other pieces of data. Need to work on linear regressions and mixed models.
I think what you're asking is to reshape a data frame of categories and counts into a single vector of observations, where categories are repeated. Here's one way:
dat <- data.frame(Cat=LETTERS[1:3],Count=c(5,7,1))
# Cat Count
#1 A 5
#2 B 7
#3 C 1
rep.int(dat$Cat,times=dat$Count)
# [1] A A A A A B B B B B B B C
#Levels: A B C
To follow up on #Blue Magister's excellent answer, here's a 100,000 row histogram with a total count of 551,245,193:
set.seed(42)
Cat <- sapply(rep(10, 100000), function(x) {
paste(sample(LETTERS, x, replace=TRUE), collapse='')
})
dat <- data.frame(Cat, Count=sample(1000:10000, length(Cat), replace=TRUE))
> head(dat)
Cat Count
1 XYHVQNTDRS 5154
2 LSYGMYZDMO 4724
3 XDZYCNKXLV 8691
4 TVKRAVAFXP 2429
5 JLAZLYXQZQ 5704
6 IJKUBTREGN 4635
This is a pretty big dataset by my standards, and the operation Blue Magister describes is very quick:
> system.time(x <- rep(dat$Cat,times=dat$Count))
user system elapsed
4.48 1.95 6.42
It uses about 6GB of RAM to complete the operation.
This really depends on what statistics you are trying to calculate. The xtabs function will create tables for you where you can specify the counts. The Hmisc package has functions like wtd.mean that will take a vector of weights for computing a mean (and related functions for standard deviation, quantiles, etc.). The biglm package could be used to expand parts of the dataset at a time and analyze. There are probably other packages as well that would handle the frequency data, but which is best depends on what question(s) you are trying to answer.
The existing answers are all expanding the pre-binned dataset into a full distribution and then using R's histogram function which is memory inefficient and will not scale for very large datasets like the original poster asked about. The HistogramTools CRAN package includes a
PreBinnedHistogram function which takes arguments for breaks and counts to create a Histogram object in R without massively expanding the dataset.
For Example, if the data set has 3 buckets with 5, 7, and 1 elements, all of the other solutions posted here so far expand that into a list of 13 elements first and then create the histogram. PreBinnedHistogram in contrast creates the histogram directly from the 3 element input list without creating a much larger intermediate vector in memory.
big.histogram <- PreBinnedHistogram(my.data$breaks, my.data$counts)