Remove indexing from matrix in R - r

I am trying to make a barplot for which I need my data in a matrix. I have made some really nice plots before when my matrix looked like this:
0% 20% 40% 60% 80%
C2 0.22 0.94 1.66 2.38 3.10
CC -1.38 -0.66 0.06 0.79 1.51
CCW -1.61 -0.87 -0.13 0.62 1.36
P -1.13 -0.16 0.81 1.78 2.76
PF 0.03 0.72 1.42 2.11 2.80
S2 -2.34 -1.61 -0.88 -0.16 0.57
For the rest of my data, I had to convert it from a dataframe, which I did using as.matrix(df). This matrix looks like this:
trt 2009 2010 2011
[1,] "C2" "9.0525" " 8.1400" " 8.1400"
[2,] "CC" "5.4200" " 4.7975" " 4.7975"
[3,] "CCW" "4.9675" " 4.0400" " 4.0400"
[4,] "P" "9.3150" "10.3500" "10.3500"
[5,] "PF" "9.0950" " 3.3375" " 3.3375"
[6,] "S2" "3.1725" " 3.1125" " 3.1125"
It won't work with the barplot function. I think I need to remove the index column, but haven't been able to. And what is with the quotes? I though a matrix was a matrix, so I'm not sure what is going on here.

The quotes means your matrix is in mode character. This is because matrix, as opposed to data.frame which are superficially similar, can only hold one type. Because alphanumeric characters cannot be converted to numeric, your matrix is in mode character. I would be easier to remove the first column before converting it to matrix and save yourself of converting the matrix to numeric.
m <- as.matrix(df[, -1])
#To add the row.names.
row.names(m) <- df[, 1]

Related

reading all excel cells into a vector with R

I have an excel spreadsheet with a bunch of numbers in it (some empty cells as well)
What I want to do is read that file in R in such a way that I can take the numbers from all the non empty cells and put it into some vector. What is the best way to do this?
It's difficult to demonstrate a solution here without an example, but it's straightforward to create one:
library(openxlsx)
set.seed(1)
sample(c(round(rnorm(100), 2), rep("", 100)), 50, TRUE) |>
matrix(10) |>
as.data.frame() |>
write.xlsx("myfile.xlsx")
In our spreadsheet software, the file looks like this:
To get the values in the spreadsheet into a single vector in R, we read it into a data frame, unlist it, convert to numeric, and remove NA values:
all_numbers <- read.xlsx("myfile.xlsx") |>
unlist() |>
as.numeric() |>
na.omit() |>
c()
all_numbers
#> [1] -0.44 -0.57 0.82 -0.02 1.36 -1.22 -1.25 -0.04 0.76 0.58 0.88
#> [12] 1.21 -2.21 -0.04 1.12 -0.74 -0.02 -0.16 -0.71 -0.41 -0.11 -0.16
#> [23] 0.03 0.34
You will see these match the numbers in the picture of the spreadsheet.

Inter-scale correlation matrix wide format to long format (in R)

In one of my datafiles, correlation matrices are stored in long format, where the first three columns represent the variables and the last three columns represent the inter-scale correlations. Frustratingly, I have to admit, the rows may represent different sub-dimensions of particular construct (e.g., the 0s in column 1).
The datafile (which is used for a meta-analysis) was constructed by a PhD-student who "decomposed" all relevant correlation matrices by hand (so the wide- format matrix was not generated by another piece of code).
Example
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 4 1 -0.32 0.25 -0.08
[2,] 0 4 2 -0.32 0.15 -0.04
[3,] 0 4 3 -0.32 -0.04 0.27
[4,] 0 4 4 -0.32 -0.14 -0.16
[5,] 0 4 1 -0.01 0.33 -0.08
[6,] 0 4 2 -0.01 0.36 -0.04
[7,] 0 4 3 -0.01 0.04 0.27
[8,] 0 4 4 -0.01 -0.03 -0.16
My question is, how to restore the interscale-correlation matrix. Such that,
c1_0a c1_0b c2_4 c3_1 c3_2 c3_3 c3_4
c1_0a
c1_0b
c2_4 -0.32 -0.01
c3_1 0.25 0.33 -0.08
c3_2 0.15 0.36 -0.04
c3_3 -0.04 0.04 0.27
c3_4 -0.14 -0.03 -0.16
I suppose this can be done with the reshape2 package, but I am unsure how to proceed. Your help is greatly appreciated!
What I have tried so far (rather clumpsy):
I identified the unique combinations of column 1,2, and 4 which I transposed to correlation matrix #1;
Similarily, I identified the unique combinations of column 1,3, and 5 which I transposed to correlation matrix #2;
Similarily, I identified the unique combinations of column 2,3, and 6 which I transposed to correlation matrix #3;
Next, I binded the three matrices which gives the incorrect solution.
The problem here is that matrix #1 has different dimensions [as there are two different correlations reported for the relationship between the first construct (0) and the second construct (4)] than matrix #3 [as there are eight different correlations reported for the second construct (4) and the third construct (1-4)].
I tried both the meltand reshape2packages to overcome these problems (and to come up with a more elegant solution), but I did not find any clues about how to set up functions in these packages to reconstruct the correlation matrix.

Aggregating columns

I have a data frame of n columns and r rows. I want to determine which column is correlated most with column 1, and then aggregate these two columns. The aggregated column will be considered the new column 1. Then, I remove the column that is correlated most from the set. Thus, the size of the date is decreased by one column. I then repeat the process, until the data frame result has has n columns, with the second column being the aggregation of two columns, the third column being the aggregation of three columns, etc. I am therefore wondering if there is an efficient or quicker way to get to the result I'm going for. I've tried various things, but without success so far. Any suggestions?
n <- 5
r <- 6
> df
X1 X2 X3 X4 X5
1 0.32 0.88 0.12 0.91 0.18
2 0.52 0.61 0.44 0.19 0.65
3 0.84 0.71 0.50 0.67 0.36
4 0.12 0.30 0.72 0.40 0.05
5 0.40 0.62 0.48 0.39 0.95
6 0.55 0.28 0.33 0.81 0.60
This is what result should look like:
> result
X1 X2 X3 X4 X5
1 0.32 0.50 1.38 2.29 2.41
2 0.52 1.17 1.78 1.97 2.41
3 0.84 1.20 1.91 2.58 3.08
4 0.12 0.17 0.47 0.87 1.59
5 0.40 1.35 1.97 2.36 2.84
6 0.55 1.15 1.43 2.24 2.57
I think most of the slowness and eventual crash comes from memory overheads during the loop and not from the correlations (though that could be improved too as #coffeeinjunky says). This is most likely as a result of the way data.frames are modified in R. Consider switching to data.tables and take advantage of their "assignment by reference" paradigm. For example, below is your code translated into data.table syntax. You can time the two loops, compare perfomance and comment the results. cheers.
n <- 5L
r <- 6L
result <- setDT(data.frame(matrix(NA,nrow=r,ncol=n)))
temp <- copy(df) # Create a temporary data frame in which I calculate the correlations
set(result, j=1L, value=temp[[1]]) # The first column is the same
for (icol in as.integer(2:n)) {
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1]) # Determine which are correlated most
set(x=result, i=NULL, j=as.integer(icol), value=(temp[[1]] + temp[[mch]]))# Aggregate and place result in results datatable
set(x=temp, i=NULL, j=1L, value=result[[icol]])# Set result as new 1st column
set(x=temp, i=NULL, j=as.integer(mch), value=NULL) # Remove column
}
Try
for (i in 2:n) {
maxcor <- names(which.max(sapply(temp[,-1, drop=F], function(x) cor(temp[, 1], x) )))
result[,i] <- temp[,1] + temp[,maxcor]
temp[,1] <- result[,i] # Set result as new 1st column
temp[,maxcor] <- NULL # Remove column
}
The error was caused because in the last iteration, subsetting temp yields a single vector, and standard R behavior is to reduce the class from dataframe to vector in such cases, which causes sapply to pass on only the first element, etc.
One more comment: currently, you are using the most positive correlation, not the strongest correlation, which may also be negative. Make sure this is what you want.
To adress your question in the comment: Note that your old code could be improved by avoiding repeat computation. For instance,
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1])
contains the command cor(temp) twice. This means each and every correlation is computed twice. Replacing it with
cortemp <- cor(temp)
mch <- match(c(max(cortemp[-1,1])),cortemp[,1])
should cut the computational burden of the initial code line in half.

Calculate the Euclidean distance of 3 points

I have a data.frame (Centroid) that contains points in virtual 3D space (columns = AV, V and A), each representing a character (column = Character). Each row contains a different character.
AV<-c(37.9,10.87,40.05)
V<-c(1.07,1.14,1.9)
A<-c(0.04,-1.23,-1.1)
Character<-c("a","A","b")
centroid = data.frame(AV,V,A,Character)
centroid
AV V A Character
1 37.90 1.07 0.04 a
2 10.87 1.14 -1.23 A
3 40.05 1.90 -1.10 b
I wish to know the similarity/dissimilarity between each character. For example, "a" corresponds to 37.9, 1.07 and 0.04 whilst "A" corresponds to 10.87, 1.14, -1.23. I want to know the distance between these characters/ 3 points.
I believe I can calculate this using Euclidean distance between each character, but am unsure of the code to run.
I have attempted to use
dist(as.matrix(Centroids))
But have been unsuccessful, as this just gives a big print in the console. Any assistance would be greatly appreciated.
Following may be helpful:
AV<-c(37.9,10.87,40.05)
V<-c(1.07,1.14,1.9)
A<-c(0.04,-1.23,-1.1)
centroid = data.frame(A,V,AV)
centroid
A V AV
1 0.04 1.07 37.90
2 -1.23 1.14 10.87
3 -1.10 1.90 40.05
mm = as.matrix(centroid)
mm
A V AV
[1,] 0.04 1.07 37.90
[2,] -1.23 1.14 10.87
[3,] -1.10 1.90 40.05
dist(mm)
1 2
2 27.059909
3 2.571186 29.190185
as.dist(mm)
A V
V -1.23
AV -1.10 1.90
It is not clear what you mean by "Character<-c(a,A,b)"

Having strange output on the last list of values

I am iterating through a list which contains 4 lists. Below is the output that I get, I am wondering why I am getting this with the accuracy, for example, why is not the first just 1.00 as it is in other cases?
[[1]]
[1] 1.00 0.96 0.84 0.74 0.66 0.56 0.48 0.36 0.26 0.16 0.06 0.00
[[2]]
[1] 1.00 0.98 0.84 0.74 0.66 0.56 0.48 0.38 0.26 0.16 0.06 0.00
[[3]]
[1] 1.00 0.94 0.84 0.74 0.66 0.56 0.48 0.36 0.26 0.16 0.06 0.00
[[4]]
[1] 1.000000e+00 9.400000e-01 8.400000e-01 7.400000e-01 6.600000e-01 5.800000e-01 4.600000e-01 3.600000e-01 2.600000e-01 1.600000e-01 6.000000e-02 1.110223e-16
As I commented when you first posted it as a follow-up comment on your previous question, this is more of a display issue. The last number is effectively zero:
R> identical(0, 1.1e-16)
[1] FALSE
R> all.equal(0, 1.1e-16)
[1] TRUE
R>
While its binary representation is not zero, it evaluates to something close enough under most circumstances. So you could run a filter over your data and replace 'near-zeros' with zero, or you could debug the code and see how/why it comes out as non-zero.
Also see the R FAQ and general references on issues related to floating-point computations.
If you want floating point numbers displayed rounded to the second decimal digit then use:
lapply( mylist, round, digits=2)
This approach has the advantage that it returns numeric-mode values which a format() call would not and it can also be used with digit specifications that are "long" and could be an effective "zero-filter":
lapply(list(c(1,2), c(1.000000e+00, 9.400000e-01, 6.000000e-02, 1.110223e-16 )), round,
digits=13)
[[1]]
[1] 1 2
[[2]]
[1] 1.00 0.94 0.06 0.00
I am not sure of the exact algorithm R uses to chose the format. It is clear that a single format is used for all values in each list. It is also clear that the last list contains values of vastly different orders of magnitude: 1.000000e+00 and 1.110223e-16. I therefore think it's reasonable that R chooses to print the last list using scientific notation.

Resources