Nested anova with 3 unique levels - r

I am trying to run a nested anova in R.
I have 3 unique factors: Vegetation type, transect number, and distance.
I am trying to determine if humidity differs among vegetation type.
There are three transects in each of the three vegetation types (labelled 1-9). Across each transect are 8 distances (range from 50 m to 400 m). At each of these distances, humidity was measured (e.g., at 50 m, measure humidity, at 100 m measure humidity).
This is the code I originally tried:
nest=aov(Data$Temp_400m ~ Data$Vegetation / factor(Data$Transect)) summary(nest)
I am also wondering if I need to convert transect number and distance to categorical values (i.e., instead of transect # 1-9, it would be A-I, and instead of distance 50 - 400, it would be 50m... 400m).

Related

Calculate and return a single distance value between matrices A and B in R

I have seen similar post about distances (mostly euclidean) between matrices A and B. However they return a matrix of the distance between each matched observations (Row).
Now I have this problem, say I have a list of drug treatments, each treatment is a list containing a matrix of N rows x 9 columns. So each treatment list have different rows (experiment subjects) but all the same columns (variables).
I want to compare how similar the treatments are based on how the same experiment subject responded to the treatment according to the measured variables. SO it came to me to see if I can compute the distance between each treatment matrix and return a single value, then I can that value in a matrix that contains all the comparison between the treatments. Finally I can visualize the relationship of treatments in a heatmap by hierarchical clustering.
#take 2 treatments as an example:
set.seed(123)
Treatment1 <- data.frame(x=sample(1:10000,3),
y=sample(1:10000,3),
z=sample(1:10000,3))
Treatment2 <- data.frame(x=sample(1:100,3),
y=sample(1:100,3),
z=sample(1:1000,3))
#lets say I have 10 treatments/drugs aka length(Drugs)= 10
Drugs <- list(Treatment1,Treatment2,...Treatment10 )
#load an empty matrix to record all distances
distance <- matrix(1, nrow = length(Drugs), ncol = length(Drugs))
#now I want to construct the matrix of all the distance measurements:
for (i in 1:(length(Drugs) - 1)){
for (j in (i+1):length(Drugs)){
# Match by ID, lets assume the 1st column is the ID
total <- inner_join(Drugs[[i]], Drugs[[j]], by = c("ID"))
# Calculate distance and store
distance <- #some sort of dist function(total[,drugi], total[,drugj])
# Store in correct location on matrix
distance_values[i,j] <- distance
distance_values[j,i] <- distance
plot(hclust(distance))
So I got stuck at the #some sort of dist function, to me all distmap, pdist functions returns a matrix between row observations between 2 matrices hence I can't load a matrix into a single position of my empty matrix. I need a single number between any given matrices. Am I making sense? What function could I use to calculate such distance ?

r - Estimate selection-unbiased allele frequencies with linear regression systems

I have a few data sets consisting of frequencies for i distinct alleles/SNPs of some populations. Additionally I recorded some factors that are suspicious for having changed the frequencies of these alleles within the populations in the past due to their selectional effect. It is assumed that the selection impact can be described in the form of a simple linear regression for every selection factor.
Now I'd like to estimate how the allele frequencies are expected to be under identical selectional forces (thus, I set selection=1). These new allele frequencies a'_i are derived as
a'_i = a_i - function[a_i|selection=1]
with the current frequency a_i of the allele i of a population and function[a_i|selection=1] as the estimated allele frequency under the absence of selectional forces.
However, there are some constraints for the whole process:
The minimal values of a'_i allowed is 0.
The sum of all allele frequencies a'_i has to be 1.
Usually I'd solve this problem by applying multiple linear regressions. But then the constraints are not fulfilled ...
Any idea how to approach this analysis with constraints (maybe using linear equation/regression systems or structural equation modelling)?
Here is an example data set containing allele frequencies for the ABO major allele groups (p, q, r) as well as the selection variables (x, y, z).
Although this example file only contains 3 alleles and 3 influential variables, all my data sets contain up to ~1050 alleles/SNPs and always 8 selection variables that may have (but don't have to) an impact on the allele frequencies ...
Many thanks in advance for ideas, code snippets and hints!

Shannon Information weighted by probability of different types

Suppose I have n independent types in a system, each existing with probability t_i i=1,..,n (so the sum of t_i's=1). Suppose also that I can calculate the Shannon Entropy for each type, call this value S_i.
1) Does it make sense to then calculate a weighted sum such as H= -sum_{i=1}^{n} t_i * S_i?
2) How could I compare H values of two systems with different number of types? (e.g., system 1 has n=2 types and system 2 has n=4 types).

Creating clusters via weighted randomization

I need to assign weights on a sample for a country. I already have the population by region 85 (regions) but I cannot perform the cluster sampling. Basically, I need to create 100 clusters each with 15 units. Overall 1500 respondents. I have an excel file with all the variables for the 85 regions.
Question 1:
How can I use the already generated population probability to do a weighted randomization for 100 clusters (with 15 units each)?
Question 2:
I need to draw from the 85 regions and generate 100 clusters. Logically, the capital and some of the other big cities should have more than 1 clusters due to higher population which gives them higher probability of having a cluster. Thus, How can I draw the clusters (15 units each) and assign a number of clusters to the different regions? For instance, the cluster probability is 0.08 percent and this will mean that I need 8 clusters of the 100 (15 units each) to be assigned to the capital. How do I add that column?
Specifically the problem with my current results is that I cannot generate the column with the number of clusters per region. For instance, region A to have 3 clusters, while region B 1 and so forth.
Here is my code:
data1$clusProb1 = (data1$Population.2018)/sum(data1$Population.2018)
sampInd = c(1:length(data1$Federal.Subject),sample(1:length(data1$Federal.Subject), length(data1$Federal.Subject)*14, prob = data1$clusProb, replace = TRUE))
sampFields = data.frame(id = 1:(length(data1$Federal.Subject)*15), Gender = sample(c(0,1), length(data1$Federal.Subject)*15, replace=TRUE), replace=TRUE))
sampleData = cbind(data1[sampInd,],sampFields)
sampleData
summary(sampleData)
The result should look like:
Cluster number Region
1 A
2 A
3 A
4 C
5 D
6
NOTE: A representing the regions with higher population which should have more clusters assigned to them.

how to calculate all pairwise distances in two dimensions

Say I have data concerning the position of animals on a 2d plane (as determined by video monitoring from a camera directly overhead). For example a matrix with 15 rows (1 for each animal) and 2 columns (x position and y position)
animal.ids<-letters[1:15]
xpos<-runif(15) # x coordinates
ypos<-runif(15) # y coordinates
raw.data.t1<-data.frame(xpos, ypos)
rownames(raw.data.t1) = animal.ids
I want to calculate all the pairwise distances between animals. That is, get the distance from animal a (row 1) to the animal in row 2, row3...row15, and then repeat that step for all rows, avoiding redundant distance calculations. The desire output of a function that does this would be the mean of all the pairwise distances. I should clarify that I mean the simple linear distance, from the formula d<-sqrt(((x1-x2)^2)+((y1-y2)^2)). Any help would be greatly appreciated.
Furthermore, how could this be extended to a similar matrix with an arbitrarily large even number of columns (every two columns representing x and y positions at a given time point). The goal here would be to calculate mean pairwise distances for every two columns and output a table with each time point and its corresponding mean pairwise distance. Here is an example of the data structure with 3 time points:
xpos1<-runif(15)
ypos1<-runif(15)
xpos2<-runif(15)
ypos2<-runif(15)
xpos3<-runif(15)
ypos3<-runif(15)
pos.data<-cbind(xpos1, ypos1, xpos2, ypos2, xpos3, ypos3)
rownames(pos.data) = letters[1:15]
The aptly named dist() will do this:
x <- matrix(rnorm(100), nrow=5)
dist(x)
1 2 3 4
2 7.734978
3 7.823720 5.376545
4 8.665365 5.429437 5.971924
5 7.105536 5.922752 5.134960 6.677726
See ?dist for more details
Why do you compare d<-sqrt(((x1-x2)^2)+((y1-y2)^2))?
Do d^2<-(((x1-x2)^2)+((y1-y2)^2)). It will cost you much less.

Resources