I have a dataset containing genes identified in different reference genomes. So, the reference genomes are in the Rows and the genes are in the columns of the table. The table is coded as a binary where 0 means the gene is absent and 1 means the gene is present. I made gene accumulation curves, which indicates that the number of genes per genomes is approaching a plateau. Now, I am trying to plot the rarefaction curves using the R-package vegan. I used the following codes:
b<-read.csv("data.csv", header = T, check.names = F)
S <- specnumber(b) # observed number of species
(raremax <- min(rowSums(b)))
Srare <- rarefy(b, raremax)
plot(Srare, xlab = "Observed No. of genes", ylab = "Rarefied No. of genes")
abline(0, 1)
rarecurve(b, step = 15, sample = raremax, col = "blue", cex = 0.6)
The data set is like the following:
gene1 gene2 gene3
#genome1 0 1 0
#genome2 1 0 1
#genome3 1 0 1
However, using this code I am not getting any satisfactory output. I just get only one straight line through the diagonal. I have attached the output below.
Can someone please suggest me how can I correct the output?
Thank you.
rarefy function rarefies individual rows of your data: it takes a subsample of your occurrences ("individuals") within each row. If all these sampled individuals have value 1, you will have a subsample of ones, and the sum of ones is the sample size: that was what you got. There is no meaningful way of rarefying a vector of ones: you need count data with some counts > 1.
You were perhaps looking for accumulation of genes in your whole data set when subsampling rows of the matrix. This is done in vegan function specaccum (argument method = "exact") which has its own plot etc methods.
Related
I'm creating a cluster based on a symmetrical correlation matrix. This matrix has values from 0 to 1.
docs <- dist(as.matrix(data), method = "euclidean")
hclust_dist<- as.dist(docs)
hclust_dist[is.na(hclust_dist)] <- 0
hclust_dist[is.nan(hclust_dist)] <- 0
sum(is.infinite(hclust_dist)) # THIS SHOULD BE 0
h <- hclust(hclust_dist, "ward.D2")
plot(h, cex=0.6)
When I plot I got this cluster:
I wish to divide the cluster into different groups with a correlation score threshold of 0.7. Which means that the units in the same group share a correlation score of minimum 0.7.
However, my values of height go from 0 to 30.
Anyone knows how do I interpret this height to convert it into a correlation score from 0 to 1?
Or, do I need to use a different clustering method?
I've found a possible solution.
I tried the correlation cluster instead of the one I was using with this code:
data= read.csv(file="individuo21.csv", sep =";", header = T, row.names = 1)
dissimilarity= 1 - data
distance = as.dist(dissimilarity)
h<-(hclust(distance))
plot(h, cex=0.3)
groups <- cutree(h, h=0.70)
View(groups)
I've got a cluster with a height from 0 to 1 like the correlation score.
CLuster obtained from correlation matrix
I'm extremely new to R and not good with statistics (not a good combination, I know). I have a dataset (genes) with genes found in different species. It looks something like this:
genes sp1 sp2 sp3 sp4 sp5
genea 100 100 100 100 100
geneb 0 0 8.333 0 11.11
genec 100 11.11 16.6 0 16.6
The numbers correspond to the percentage of presence of the gene to each species. I want to know the correlation of the presence/absence of the genes between one another using this data. I tried using the cor() and corrplot to visualize. And corr_cross()to show the significant correlations:
genes<- t(genes) #just to transpose the rowxcolumn
corr_cross(genes, # to show top 10 couples of variables (by correlation coefficient)
max_pvalue = 0.05, # display only significant correlations (at 5% level)
top = 10
)
genes <- cor(genes)
corrplot(genes,
type = "lower",
order = "AOE",
tl.srt = 45,
col = brewer.pal(n = 8, name = "RdYlGn"),
bg = "lightblue",
title = "Association of virulence factors",
mar=c(0,0,1,0))
dev.off()
When I do this, I get NA values and a warning message that says that the standard deviation is zero. I know that this is because of some genes that are present in all strains of all species, therefore having a similar data of "100" across all species.
How can I revise my code to show associations between all genes even with similar values for some columns? Or should I change my statistical method?
To understand my problem, you will need the whole dataset: https://pastebin.com/82paf0G8
Pre-processing: I had a list of orders and 696 unique item numbers, and wanted to cluster them, based on how frequent each pair of items are ordered together. I calculated for each pair of items, number of frequency of occurence within the same order. I.e the highest number of occurrence was 489 between two items. I then "calculated" the similarity/correlation, by: Frequency / "max frequency of all pairs" (489). Now I have the dataset that I have uploaded.
Similarity/correlation: I don't know if my similarity approach is the best in this case. I also tried with something called "Jaccard’s coefficient/index", but get almost same results.
The dataset: The dataset contains material numbers V1 and V2. and N is the correlation between the two material numbers between 0 - 1.
With help from another one, I managed to create a distance matrix and use the PAM clustering.
Why PAM clustering? A data scientist suggest this: You have more than 95% of pairs without information, this makes all these materials are at the same distance and a single cluster very dispersed. This problem can be solved using a PAM algorithm, but still you will have a very concentrated group. Another solution is to increase the weight of the distances other than one.
Problem 1: The matrix is only 567x567. I think for clustering I need the 696x696 full matrix, even though a lot of them are zeros. But i'm not sure.
Problem 2: Clustering does not do very well. I get very concentrated clusters. A lot of items are clustered in the first cluster. Also, according to how you verify PAM clusters, my clustering results are poor. Is it due to the similarity analysis? What else should I use? Is it due to the 95% of data being zeros? Should I change the zeros to something else?
The whole code and results:
#Suppose X is the dataset
df <- data.table(X)
ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
ss <- ss/max(ss, na.rm = TRUE)
ss[is.na(ss)] <- 0
diag(ss) <- 1
Now using the PAM clustering
dd2 <- as.dist(1 - sqrt(ss))
pam2 <- pam(dd2, 4)
summary(as.factor(pam2$clustering))
But I get very concentrated clusters, as:
1 2 3 4
382 100 23 62
I'm not sure where you get the 696 number from. After you rbind, you have a dataframe with 567 unique values for V1 and V2, and then you perform the dcast, and end up with a matrix as expected 567 x 567. Clustering wise I see no issue with your clusters.
dim(df) # [1] 7659 3
test <- rbind(df, df[, .(V1 = V2, V2 = V1, N)])
dim(test) # [1] 15318 3
length(unique(test$V1)) # 567
length(unique(test$V2)) # 567
test2 <- dcast(test, V1~V2, value.var = "N")[,-1]
dim(test2) # [1] 567 567
#Mayo, forget what the data scientist said about PAM. Since you've mentioned this work is for a thesis. Then from an academic viewpoint, your current justification to why PAM is required, does not hold any merit. Essentially, you need to either prove or justify why PAM is a necessity for your case study. And given the nature of (continuous) variables in the dataset, V1, V2, N, I do not see the logic on why PAM is applicable here (like I mentioned in the comments, PAM works best for mixed variables).
Continuing further, See this post on correlation detection in R;
# Objective: Detect Highly Correlated variables, visualize them and remove them
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
# print the first 6 rows
head(my_data, 6)
# compute correlation matrix using the cor()
res<- cor(my_data)
round(res, 2) # Unfortunately, the function cor() returns only the correlation coefficients between variables.
# Visualize the correlation
# install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
# Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
# tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.
#Apply correlation filter at 0.80,
#install.packages("caret", dependencies = TRUE)
library(caret)
highlyCor <- colnames(my_data)[findCorrelation(res, cutoff = 0.80, verbose = TRUE)]
# show highly correlated variables
highlyCor
[1] "disp" "mpg"
removeHighCor<- findCorrelation(res, cutoff = 0.80) # returns indices of highly correlated variables
# remove highly correlated variables from the dataset
my_data<- my_data[,-removeHighCor]
[1] 32 4
Hope this helps.
We are looking at pattern recognitions and making different variables
unusualsubjects <- rtaverages$subject_id[rtaverages$count < 5] # make a list of subjects without enough data.
rtaverages <- filter(rtaverages,!(subject_id %in% unusualsubjects)) # only include data from good subjects. ! = not. Put data from acceptable subjects right back in the same data frame
# Another example of filtering subjects: let's say we only wanted to analyze subjects with accuracies over 95%
accurateSubjects <- averages$subject_id[averages$accuracy > .95] #returns all of the subject_ids for subjects meeting an accuracy criterion
length(accurateSubjects) # tells us how many accurate subjects there are
goodSubjectdata<-filter(data,subject_id %in% accurateSubjects) # make a new data frame that contains only the data from accurate subjects
Code to conduct actual ANOVA of the Respone Times results
model<ezANOVA(data=Data,dv=rt,within=c(set_size,target_presence,task),wid=subject_id) # You need to fill in the XXXs with the correct variable names within the variable containing all of the correct RTs. conduct a repeated measures ANOVA - dv = dependent variable. within = a list of all of the within subject variables. wid = variable that is used to group data by subject
model # show results of the ANOVA model
table1 <- tapply(X=Data$rt,INDEX=list(Data$task,Data$set_size),FUN=mean,trim=0.1)#find breakdown just of setsize and task - less broken down than the above tapply code, obtained just by deleting one item from the INDEX list "INDEX=list(rtaverages$target_presence,rtaverages$set_size,rtaverages$task)" above
table1 #show means so that one can begin to interpret the data. You'll break down rtaverages in different ways to get the different mean RTs that you need for your report
par(mar = c(4,4,4,0),mfrow=c(1, 2) ) # mfrow=c(1,2) creates two plots side by side
lineplot.CI(data=filter(rtaverages,task=="conjunctive"),x.factor=set_size,group=target_presence,x.cont=TRUE,response=rt,ylim=c(0,4000),x.leg=2,xlab="Conjunctive Set Size",ylab="RT") # produces a line graph with confidence intervals
lineplot.CI(data=filter(rtaverages,task=="disjunctive"),x.factor=set_size,group=target_presence,x.cont=TRUE,response=rt,ylim=c(0,4000),x.leg=2,xlab="Disjunctive Set Size",ylab="RT") # produces a line graph with confidence intervals
Currently attempting to put 3 lines onto one plot the following way:
# The next bit of code is to reproduce Treisman and Gelade's Figure 1, including best lines of fit
rtaverages$set_size_num<-sizes[rtaverages$set_size] # added a new column to rtaverage data frame which is the numeric/continuous version of the nominal/categorical set_size factor which will be useful for predicting RT from set_size
bySetSize<-group_by(rtaverages,set_size_num,task,target_presence) #collapse even more, so all subjects' data are combined together
collapsed<-summarize(bySetSize,rt=mean(rt,trim=0.1)) # make RT summary
collapsed # show what collapsed data look like. Note that there are now only 4 (set sizes) X 2 (tasks) X 2 (present/absent trials)=16 rows
cp<-filter(collapsed,task=="conjunctive" & target_presence=="present") # plot each of the four lines separated, filtering by the right type each time
cpf<-lm(data=cp,rt ~ set_size_num) # use a linear model to predict RT from set size. Use this to get out best fitting slope (estimate for set size) and intercept
summary(cpf) # make a summary of the linear regression model. cpf stands for: conjunctive, present fit
cp3<-filter(collapsed,task=="conjunctive" & target_presence=="absent")
caf<-lm(data=cp3,rt ~ set_size_num)
summary(caf)
cp1<-filter(collapsed,task=="disjunctive" & target_presence=="present")
dpf<-lm(data=cp1,rt ~ set_size_num)
summary(dpf)
cp2<-filter(collapsed,task=="disjunctive" & target_presence=="absent")
daf<-lm(data=cp2,rt ~ set_size_num)
summary(daf)
plot(cp$set_size_num,cp$rt,ylim=c(0,4000),xlim=c(0,30),pch=19,col="green",xlab="Set Size",ylab="Response Time (msec.)") # use a big enough range to capture all of the data
abline(cpf, col="green") # add the line with the slope and intercept derived from linear model
lines(cp$set_size_num,cp3$rt,col="green")
abline(caf, col="green")
lines(cp1$set_size_num,cp1$rt,col="red")
abline(dpf, col="red")
lines(cp2$set_size_num,cp2$rt,col="red")
abline(daf, col="red")
legend(x=0,y=4000,pch=c(19,1,19,1),col=c("green","green","red","red"),cex=0.7,legend=c("Conjunctive present","Conjunctive absent","Disjunctive present","Disjunctive absent")) #Legend only should be plotted once, pch sets 4 symbols, and col sets 4 colors. cex < 1 so that legend box isn't too big
I got them to combine, but now the lines lost their format:
I have decided to learn R and am going through Introduction to Scientific programming in R book (http://www.ms.unimelb.edu.au/spuRs/)
I am currently stuck on chapter 7 question 3 of the book, the question is:
Consider the following very simple genetic model. A population consists of
equal numbers of two sexes: male and female. At each generation men and
women are paired at random, and each pair produces exactly two offspring,
one male and one female. We are interested in the distribution of height
from one generation to the next. Suppose that the height of both children
is just the average of the height of their parents, how will the distribution
of height change across generations?
Represent the heights of the current generation as a dataframe with two
variables, m and f, for the two sexes. The command rnorm(100, 160, 20)
will generate a vector of length 100, according to the normal distribution
with mean 160 and standard deviation 20 (see Section 16.5.1). We use it to
randomly generate the population at generation 1:
pop <- data.frame(m = rnorm(100, 160, 20), f = rnorm(100, 160, 20))
The command sample(x, size = length(x)) will return a random sample
of size size taken from the vector x (without replacement). (It will also
sample with replacement, if the optional argument replace is set to TRUE.)
The following function takes the dataframe pop and randomly permutes the
ordering of the men. Men and women are then paired according to rows,
and heights for the next generation are calculated by taking the mean of
each row. The function returns a dataframe with the same structure, giving
the heights of the next generation.
next.gen <- function(pop) {
pop$m <- sample(pop$m)
pop$m <- apply(pop, 1, mean)
pop$f <- pop$m
return(pop)
}
Use the function next.gen to generate nine generations, then use the lattice
function histogram to plot the distribution of male heights in each
generation, as in Figure 7.7. The phenomenon you see is called regression
to the mean.
Hint: construct a dataframe with variables height and generation, where
each row represents a single man.
I have constructed a blank data frame:
generations <- data.frame(gen="", height="")
For now I am trying to get just the first generation height information into it, so I run:
next.gen(pop)
generations$height <- pop$m
and I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "height", value = c(165.208323681597, :
replacement has 100 rows, data has 1
I understand that I'm trying to squeeze in information from pop$m dataframe into a single row of generations$height and that is causing the problem, I do not know how to fix this? I thought that a blank data frame is flexible enough to add rows as they are being copied from pop data frame?
I tried then to run this code:
generations <- pop$m
And I get 100 values but that just turns my generations dataframe into a vector I think and running
generations
Just lists the values copied in a vector only.
I think I am approaching the first step wrong, is my dataframe definition correct? Why can't I copy row information from 1 data frame into an empty one and just adjust the size of the empty data frame as needed?
Thank you
Unsure the exact output you are looking for. Here is an approach which should be simple enough to follow. ** Note: There are workable approaches aplenty.
pop <- data.frame(m = rnorm(100, 160, 20), f = rnorm(100, 160, 20))
next.gen <- function(pop) {
pop$m <- sample(pop$m)
pop$m <- apply(pop, 1, mean)
pop$f <- pop$m
return(pop)
}
# the code
test <- list()
for (i in 1:9) {
test[[i]] <- next.gen(pop)["m"]
test[[i]]$generation <- paste0("g", i)
}
library(data.table)
test2 <- rbindlist(test)
# result
m generation
1: 174.6558 g1
2: 143.2617 g1
3: 185.2829 g1
4: 168.9719 g1
5: 151.6948 g1
---
896: 159.6091 g9
897: 161.4546 g9
898: 171.8679 g9
899: 138.4982 g9
900: 152.7390 g9
Try:
> generations <- data.frame(gen="", height="", stringsAsFactors=F)
> for(i in 1:length(pop$m)) generations[i,] = c("",pop$m[i])
> generations
gen height
1 136.70042632318
2 153.985392293761
3 122.077485676327
4 166.582538529591
5 170.751368839498
6 190.8894492681
...