A UPGMA cluster in R with NoData values - r

I have a matrix of sites. I want to develop a UPGMA aglomerative cluster. I want to use R and the vegan library for that. My matrix has sites in which not all the variables were measured.
Following a similar matrix of data:
Variable 1;Variable 2;Variable 3;Variable 4;Variable 5
0.5849774671338231;0.7962161133598957;0.3478909861199184;0.8027122599553912;0.5596553797833573
0.5904142034898171;0.18185393432022612;0.5503250366728479;NA;0.05657408486342197
0.2265148074206368;0.6345513807275411;0.8048128547418062;0.3303602674038131;0.8924461773052935
0.020429460126217602;0.18850489885886157;0.26412619465769416;0.8020472793070729;NA
0.006945970735023677;0.8404983401121199;0.058385134042814646;0.5750066564897788;0.737599672122899
0.9909722313946067;0.22356808747617019;0.7290078902086897;0.5621006367587756;0.3387823531518016
0.5932907022602052;0.899773235815933;0.5441346748937264;0.8045695319247985;0.6183003409599681
0.6520679140573288;0.5419713133237936;NA;0.7890033752744002;0.8561828607592286
0.31285906479192593;0.3396351688936058;0.5733594373520889;0.03867689654415574;0.1975784885854912
0.5045966366726562;0.6553489439611587;0.029929403932252963;0.42777351534900676;0.8787135401098227
I am planing to do it with the following code:
library(vegan)
# env <- read.csv("matrix_of_sites.csv")
env.norm <- decostand(env, method = "normalize") # Normalizing data here
env.ch <- vegdist(env.nom, method = "euclidean")
env.ch.UPGMA <- hclust(env.ch, method="average")
plot(env.ch.UPGMA)
After I run the second line, I get this error:
Error in x^2 : non-numeric argument to binary operator
I am not familiar with R, so I am not sure if this is due to the cells with no data. How can I solve this?

R does not think that data are numeric in your matrix, but at least some of them were interpreted as character variables and changed to factors. Inspect your data after reading int into R. If all your data are numbers, then sum(env) gives a numeric result. Use str() or summary() functions for detailed inspection.
From R's point of view, your data file has mixed formatting. R function read.csv assumes that items are separated by comma (,) and the decimal separator is period (.), and read.csv2 assumes that items are separated by colon (;) and decimal separator is comma ,. You mix these two conventions. You can read data formatted like that, but you may have to give both the sep and dec arguments.
If you get your data correctly in R, then decostand will stop with error: it does not accept missing values if you do not add na.rm = TRUE. The same also with the next vegdist command: it also needs na.rm = TRUE to analyse your data.

Related

How to transfer multiple columns into numeric & find correlation coefficients

I have a dataset "res.sav" that I read in via haven. It contains 20 columns, called "Genes1_Acc4", "Genes2_Acc4" etc. I am trying to find a correlation coefficient between those and another column called "Condition". I want to separately list all coefficients.
I created two functions, cor.condition.cols and cor.func to do that. The first iterates through the filenames and works just fine. The second was supposed to give me my correlations which didn't work at all. I also created a new "cor.condition.Genes" which I would like to fill with the correlations, ideally as a matrix or dataframe.
I have tried to iterate through the columns with two functions. However, when I try to pass it, I get the error: "NAs introduced by conversion". This wouldn't be the end of the world (I tried also suppressWarning()). But the bigger problem I have that it seems like my function does not convert said columns into the numeric type I need for my cor() function. I receive the "y must be numeric" error when trying to run the cor() function. I tried to put several arguments within and without '' or "" without success.
When I ran str(cor.condition.cols) I only receive character strings, which makes me think that my function somehow messes up with the as.numeric function. Any suggestions of how else I could iter through these columns and transfer them?
Thanks guys :)
cor.condition.cols <- lapply(1:20, function(x){paste0("res$Genes", x, "_Acc4")})
#save acc_4 columns as numeric columns and calculate correlations
res <- (as.numeric("cor.condition.cols"))
cor.func <- function(x){
cor(res$Condition, x, use="complete.obs", method="pearson")
}
cor.condition.Genes <- cor.func(cor.condition.cols)
You can do:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
res2 <- as.numeric(as.matrix(res[cor.condition.cols]))
cor.condition.Genes <- cor(res2, res$Condition, use="complete.obs", method="pearson")
eventually the short variant:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
cor.condition.Genes <- cor(res[cor.condition.cols], res$Condition, use="complete.obs")
Here is an example with other data:
cor(iris[-(4:5)], iris[[4]])

I want to be able to manipulate objects in class 'phylo' - ie. round/ turn my bootstrap values from decimals (.998) into percentages (99%)

I am using RStudio, programs ape and phytools. I've generated a tree with 500 bootstrap replicates stored in an object of class phylo.
Where cw is the name of my tree, I've tried the following:
round(cw, digits = 2)
and I get the following error message:
Error in round(cw, digits = 2) :
non-numeric argument to mathematical function
I feel like it's probably a very simple manipulation but I'm not sure how to get there.
Hard to tell without a reproducible example but I guess that your bootstrap scores are probably stored in the $node.label subset of your tree.
You can try the following:
## Are the bootstraps in the $node.label object?
if(!is.null(cw$node.label)) {
## Are they as character or numeric?
class(cw$node.label)
}
If they are numeric values:
cw$node.label <- round(cw$node.label, digits = 2)
If they are characters, you can probably coerce them (that can produce some NAs)
cw$node.label <- round(as.numeric(cw$node.label), digits = 2)

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

Spearman's rank correlation

i'm writing a script that reads two .txt file in two vectors. After that I want to make a Spearman's rank correlation and plot the result.
The first vectors value's length is 12-13 characters (e.g. 7.3445555667 or 10.3445555667) and the second vectors value's length is one character (e.g. 1 or 2).
The code:
vector1 <- read.table ("D:...path.../mytext1.txt", header=FALSE)
vector2 <- read.table ("D:...path.../mytext2.txt", header=FALSE)
cor.coeff = cor(vector1 , vector2 , method = "spearman")
cor.test(vector1 , vector2 , method = "spearman")
plot(vector1.var, vector2.var)
The .txt files contain only numeric values.
I'm getting two errors, the first in line 4 it's like " 'x' have to be a numeric vector"
and the second error occurs in line 5 it's like "object vector 1.var couldn't be found"
I also tried
plot(vector1, vector2)
instead of
plot(vector1.var, vector2.var)
But then there's an error like "Error in stripchart.default (x1,...) : invalid plot-method
The implementation is orientated at http://www.gardenersown.co.uk/Education/Lectures/R/correl.htm#correlation
I doubt vector1 and vector2 are vectors. Reading ?read.table we note in the Value section:
Value:
A data frame (‘data.frame’) containing a representation of the
data in the file.
....
So even if your two text files contain just a single variable, the two objects read in will be data frames with a single component each.
Secondly, your data files don't contain headers so R will make up a variable name. I haven't tested this but IIRC your the variables in vector1 and vector2 will both be called X1. Do head(vector1) and the same on vector2 (or names(vector1)) to see how your objects look in R.
I can see why you might think vector1.var might work, but you should realise that as far as R was concerned it was looking for an object named vector1.var. The . is just any other character in R object names. If you meant to use . as a subsetting or selection operator, then you need to read up on subsetting operators in R. These are $ and [ and [[. See for example the R Language Definition manual or the R manual.
I suspect you could just change your code to:
vector1 <- read.table ("D:...path.../mytext1.txt", header=FALSE)[, 1]
vector2 <- read.table ("D:...path.../mytext2.txt", header=FALSE)[, 1]
cor.coeff <- cor(vector1 , vector2 , method = "spearman")
cor.test(vector1 , vector2 , method = "spearman")
plot(vector1, vector2)
But I am supposing quite a bit about what is in your two text files...
str is a very useful function (see ?str for more) that one should use often, especially to verify R object types. A quick str(vector1) and str(vector2) will tell you if those columns were read as characters instead of numeric. If so, then use as.numeric(vector1) to typecast the data in each vector.
Also, names(vector1) and names(vector2) will tell you what the column names are and likely resolve your plotting issue.

Resources