Spearman's rank correlation - r

i'm writing a script that reads two .txt file in two vectors. After that I want to make a Spearman's rank correlation and plot the result.
The first vectors value's length is 12-13 characters (e.g. 7.3445555667 or 10.3445555667) and the second vectors value's length is one character (e.g. 1 or 2).
The code:
vector1 <- read.table ("D:...path.../mytext1.txt", header=FALSE)
vector2 <- read.table ("D:...path.../mytext2.txt", header=FALSE)
cor.coeff = cor(vector1 , vector2 , method = "spearman")
cor.test(vector1 , vector2 , method = "spearman")
plot(vector1.var, vector2.var)
The .txt files contain only numeric values.
I'm getting two errors, the first in line 4 it's like " 'x' have to be a numeric vector"
and the second error occurs in line 5 it's like "object vector 1.var couldn't be found"
I also tried
plot(vector1, vector2)
instead of
plot(vector1.var, vector2.var)
But then there's an error like "Error in stripchart.default (x1,...) : invalid plot-method
The implementation is orientated at http://www.gardenersown.co.uk/Education/Lectures/R/correl.htm#correlation

I doubt vector1 and vector2 are vectors. Reading ?read.table we note in the Value section:
Value:
A data frame (‘data.frame’) containing a representation of the
data in the file.
....
So even if your two text files contain just a single variable, the two objects read in will be data frames with a single component each.
Secondly, your data files don't contain headers so R will make up a variable name. I haven't tested this but IIRC your the variables in vector1 and vector2 will both be called X1. Do head(vector1) and the same on vector2 (or names(vector1)) to see how your objects look in R.
I can see why you might think vector1.var might work, but you should realise that as far as R was concerned it was looking for an object named vector1.var. The . is just any other character in R object names. If you meant to use . as a subsetting or selection operator, then you need to read up on subsetting operators in R. These are $ and [ and [[. See for example the R Language Definition manual or the R manual.
I suspect you could just change your code to:
vector1 <- read.table ("D:...path.../mytext1.txt", header=FALSE)[, 1]
vector2 <- read.table ("D:...path.../mytext2.txt", header=FALSE)[, 1]
cor.coeff <- cor(vector1 , vector2 , method = "spearman")
cor.test(vector1 , vector2 , method = "spearman")
plot(vector1, vector2)
But I am supposing quite a bit about what is in your two text files...

str is a very useful function (see ?str for more) that one should use often, especially to verify R object types. A quick str(vector1) and str(vector2) will tell you if those columns were read as characters instead of numeric. If so, then use as.numeric(vector1) to typecast the data in each vector.
Also, names(vector1) and names(vector2) will tell you what the column names are and likely resolve your plotting issue.

Related

How to remove the first row from multiple dataframes?

I have multiple dataframes and would like to remove the first row in all of them.
I have tried using a for loop but cannot understand what I am doing wrong
for (i in cities){
i <- i[-1, ]
}
I get the following error code:
Error in i[-1, ] : incorrect number of dimensions
If we assume that the only objects in your workspace are dataframes then this might succeed:
cities <- objects() )
for (i in cities) { assign(i, get(i)[-1,])}
Explanation:
Two thing wrong with original codes:
One was already mentioned in comments. "df" is not the same as df. You need to use get to convert a character value to a "true" R name that is used to retrieve an object having that name. The result of object() is only a character value. In R the term "name" means a "language object". See the help page: ?mode. (There is potential confusion about rownames and columnnames which are always "character"-class.) It's not like SAS which is a macro language that has no such distinction.
The second error was trying to get substitution for the i on the left-hand side of <-. The would have failed even if you were working with actual R names. The assign function is designed to handle character values that are then converted to R names.
say you get a list of all the tables in your environment, and you call that list cities. You can't just iterate over each value of cities and change things, because in the list they are just characters.
Here is what you need:
for (i in cities){
tmp <- get(i) # load the actual table
tmp <- tmp[-1, ] # remove first column
assign(i, tmp) # re-assign table to original table name
}

Making list of factors in a function but return warning error

Let say that I have these vectors:
time <- c(306,455,1010,210,883,1022,310,361,218,166)
status <- c(1,1,0,1,1,0,1,1,1,1)
gender <- c(1,1,1,1,1,1,2,2,1,1)
And I turn it into these data frame:
dataset <- data.frame(time, status, gender)
I want to list the factors in the third column using this function (p/s: pardon the immaturity. I'm still learning):
getFactor<-function(dataset){
result <- list()
result["Factors"] <- unique(dataset[[3]])
return(result)
}
And all I get is this:
getFactor(dataset)
$Factors
[1] 1
Warning message:
In result["Factors"] <- unique(dataset[[3]]) :
number of items to replace is not a multiple of replacement length
I tried using levels, but all I get is an empty list. My question is (1) why does this happen? and (2) is there any other way that I can get the list of the factor in a function?
Solution is simple, you just need double brackets around "Factors" :)
In the function
result[["Factors"]] <- unique(dataset[[3]])
That should be the line.
The double brackets return an element, single brackets return that selection as a list.
Sounds silly, by try this
test <- list()
class(test["Factors"])
class(test[["Factors"]])
The first class will be of type 'list'. The second will be of type 'NULL'. This is because the single brackets returns a subset as a list, and the double brackets return the element itself. It's useful depending on the scenario. The element in this case is "NULL" because nothing has been assigned to it.
The error "number of items to replace is not a multiple of replacement length" is because you've asked it to put 3 things into a single element (that element is a list). When you use double brackets you actually put it inside a list, where you can have multiple elements, so it can work!
Hope that makes sense!
Currently, when you create your data frame, dataset$gender is double vector (which R will automatically do if everything in it is numbers). If you want it to be a factor, you can declare it that way at the beginning:
dataset <- data.frame(time, status, gender = as.factor(gender))
Or coerce it to be a factor later:
dataset$gender <- as.factor(gender)
Then getting a vector of the levels is simple, without writing a function:
level_vector <- levels(dataset$gender)
level_vector
You're also subsetting lists & data frames incorrectly in your function. To call the third column of dataset, use dataset[,3]. The first element of a list is called by list[[1]]

Automated conversion from simple character vector to string vector in R

Suggest I have the following character vector:
models <- c(CNRM_CERFACS_CNRM_CM5_ALADIN53_r1i1p1,
CNRM_CERFACS_CNRM_CM5_ALARO_0_r1i1p1,
CNRM_CERFACS_CNRM_CM5_CCLM4_8_17_r1i1p1,
CNRM_CERFACS_CNRM_CM5_RCA4_r1i1p1,
ICHEC_EC_EARTH_CCLM4_8_17_r12i1p1,
ICHEC_EC_EARTH_HIRHAM5_r3i1p1,
ICHEC_EC_EARTH_RACMO22E_r1i1p1,
ICHEC_EC_EARTH_RCA4_r12i1p1,
IPSL_IPSL_CM5A_MR_RCA4_r1i1p1,
IPSL_IPSL_CM5A_MR_WRF331F_r1i1p1,
MPI_M_MPI_ESM_LR_CCLM4_8_17_r1i1p1,
MPI_M_MPI_ESM_LR_RCA4_r1i1p1,
MPI_M_MPI_ESM_LR_REMO2009_r1i1p1,
MPI_M_MPI_ESM_LR_REMO2009_r2i1p1
)
I now want to convert these 14 character objects into strings, i.e. add quotation marks at the beginning and ending of each of these names to get this
models <- ("CNRM_CERFACS_CNRM_CM5_ALADIN53_r1i1p1",
"CNRM_CERFACS_CNRM_CM5_ALARO_0_r1i1p1",...
Is there a form of doing that automatically, avoiding doing by hand?
c() erase the names of the variables you concatenate. You should use an object that keeps the names and then access it I guess. Where do these names come from? I am sure there is no need to write them all explicitely...
For instance this does what you want...
models <- data.frame(CNRM_CERFACS_CNRM_CM5_ALADIN53_r1i1p1,
CNRM_CERFACS_CNRM_CM5_ALARO_0_r1i1p1,
CNRM_CERFACS_CNRM_CM5_CCLM4_8_17_r1i1p1,
CNRM_CERFACS_CNRM_CM5_RCA4_r1i1p1,
ICHEC_EC_EARTH_CCLM4_8_17_r12i1p1,
ICHEC_EC_EARTH_HIRHAM5_r3i1p1,
ICHEC_EC_EARTH_RACMO22E_r1i1p1,
ICHEC_EC_EARTH_RCA4_r12i1p1,
IPSL_IPSL_CM5A_MR_RCA4_r1i1p1,
IPSL_IPSL_CM5A_MR_WRF331F_r1i1p1,
MPI_M_MPI_ESM_LR_CCLM4_8_17_r1i1p1,
MPI_M_MPI_ESM_LR_RCA4_r1i1p1,
MPI_M_MPI_ESM_LR_REMO2009_r1i1p1,
MPI_M_MPI_ESM_LR_REMO2009_r2i1p1
)
colnames(models)
A work-around would be
models <- "CNRM_CERFACS_CNRM_CM5_ALADIN53_r1i1p1,
CNRM_CERFACS_CNRM_CM5_ALARO_0_r1i1p1,
CNRM_CERFACS_CNRM_CM5_CCLM4_8_17_r1i1p1,
CNRM_CERFACS_CNRM_CM5_RCA4_r1i1p1,
ICHEC_EC_EARTH_CCLM4_8_17_r12i1p1,
ICHEC_EC_EARTH_HIRHAM5_r3i1p1,
ICHEC_EC_EARTH_RACMO22E_r1i1p1,
ICHEC_EC_EARTH_RCA4_r12i1p1,
IPSL_IPSL_CM5A_MR_RCA4_r1i1p1,
IPSL_IPSL_CM5A_MR_WRF331F_r1i1p1,
MPI_M_MPI_ESM_LR_CCLM4_8_17_r1i1p1,
MPI_M_MPI_ESM_LR_RCA4_r1i1p1,
MPI_M_MPI_ESM_LR_REMO2009_r1i1p1,
MPI_M_MPI_ESM_LR_REMO2009_r2i1p1"
unlist(stsplit(models, split = ",\n"))

How do I refer to defined objects using the reshape function in R?

I'm unable to get the reshape function ( stats::reshape ) to accept a reference to a defined character vector in one of its arguments. I don't know if this reflects wrong syntax on my part, a limitation of the function, or a more general issue related to how R itself operates.
I am using reshape to change data from wide to long format. I have a dataset with many repeated measures that are sorted appropriately for reshape (x.1, x.2, x.3, y.1, y.2, y.3, etc). I've defined a variable firstlastmeasure that contains the index to the first and last column of repeated measures data that needs to be processed by reshape (this is to prevent having to change the index every time columns are added or removed from the original data).
This is how it's defined (in a convoluted way):
temp0 <- subset(p, select=nameoffirstcolumn:nameoflastcolumn)
lastmeasname = names(temp0[ncol(temp0)])
firstmeasname = names(temp0[1])
firstmeasindex = grep(firstmesname,colnames(p))
lastmeasindex = grep(lastmesname,colnames(p))
firstlastmeasure <- paste(firstmesindex,lastmesindex,sep=":")
I'm using this variable as an argument to reshape's varying parameter, like so:
reshape(dataset, direction = "long", varying = firstlastmeasure)
Reshape always returns:
"Error in guess(varying) : failed to guess time-varying variables from their names".
Using the numerical index explicitly (i.e. varying = 6:34) works fine.
paste creates a string, if you look at firstlastmeasure it will be something like "6:34". If you look at 6:34 it will be a vector 6 7 8 9 ... 34. You need to define the vector, not paste together a string. (Note that subset does a bit of special processing to make : work with column names.)
If I'm interepreting your code correctly, temp0 has all the column you want, so you could just do
firstlastmeasure = names(temp0)
and be done with it. A little more complicated, you could keep you grep code and just not use paste:
firstlastmeasure = firstmeasindex:lastmeasindex
Since you are inputting names, the subset is unnecessary. Simplest of all would be to skip temp0 and do
firstlastmeasure = grep(nameoffirstcolumn, names(p)):grep(nameoflastcolumn, names(p))

A UPGMA cluster in R with NoData values

I have a matrix of sites. I want to develop a UPGMA aglomerative cluster. I want to use R and the vegan library for that. My matrix has sites in which not all the variables were measured.
Following a similar matrix of data:
Variable 1;Variable 2;Variable 3;Variable 4;Variable 5
0.5849774671338231;0.7962161133598957;0.3478909861199184;0.8027122599553912;0.5596553797833573
0.5904142034898171;0.18185393432022612;0.5503250366728479;NA;0.05657408486342197
0.2265148074206368;0.6345513807275411;0.8048128547418062;0.3303602674038131;0.8924461773052935
0.020429460126217602;0.18850489885886157;0.26412619465769416;0.8020472793070729;NA
0.006945970735023677;0.8404983401121199;0.058385134042814646;0.5750066564897788;0.737599672122899
0.9909722313946067;0.22356808747617019;0.7290078902086897;0.5621006367587756;0.3387823531518016
0.5932907022602052;0.899773235815933;0.5441346748937264;0.8045695319247985;0.6183003409599681
0.6520679140573288;0.5419713133237936;NA;0.7890033752744002;0.8561828607592286
0.31285906479192593;0.3396351688936058;0.5733594373520889;0.03867689654415574;0.1975784885854912
0.5045966366726562;0.6553489439611587;0.029929403932252963;0.42777351534900676;0.8787135401098227
I am planing to do it with the following code:
library(vegan)
# env <- read.csv("matrix_of_sites.csv")
env.norm <- decostand(env, method = "normalize") # Normalizing data here
env.ch <- vegdist(env.nom, method = "euclidean")
env.ch.UPGMA <- hclust(env.ch, method="average")
plot(env.ch.UPGMA)
After I run the second line, I get this error:
Error in x^2 : non-numeric argument to binary operator
I am not familiar with R, so I am not sure if this is due to the cells with no data. How can I solve this?
R does not think that data are numeric in your matrix, but at least some of them were interpreted as character variables and changed to factors. Inspect your data after reading int into R. If all your data are numbers, then sum(env) gives a numeric result. Use str() or summary() functions for detailed inspection.
From R's point of view, your data file has mixed formatting. R function read.csv assumes that items are separated by comma (,) and the decimal separator is period (.), and read.csv2 assumes that items are separated by colon (;) and decimal separator is comma ,. You mix these two conventions. You can read data formatted like that, but you may have to give both the sep and dec arguments.
If you get your data correctly in R, then decostand will stop with error: it does not accept missing values if you do not add na.rm = TRUE. The same also with the next vegdist command: it also needs na.rm = TRUE to analyse your data.

Resources