Reducing the dimensionality of a vector - r

How can I reduce the size of a vector to a lower dimension?
Say for example X:=(1,2,3,4,5,6,7,8,9,10) is a 10-D vector. Suppose
I want to reduce it to a 5 dimensional space. Is there any way to do this?
I have a situation where I need to compare an N-d vector with a corresponding vector of a lower dimension.

There are an infinite number of ways to convert a 10d vector into a 5d vector.
This is like saying "I want a function that takes two integer parameters and returns an integer, can I make such a function". There an infinite many such functions.
It really depends on what you want to do with the vector. What are the meanings of your 10d and 5d vectors?

If my assumption is right, the OP would like to convert a vector of 10 values to a matrix with 2 columns.
This could be done easily in R:
# make up the demo data
> v <- c(1,2,3,4,5,6,7,8,9,10)
# modify the dimensions of 'v' to have 2 columns
> dim(v) <- c(5,2)
# and check the result
> v
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10

Related

R: mean pairwise differences in string vectors

I have a big list (10,000 items) of string vectors (50 elements each).
Here is a small sample from a vector:
c("0101000000000000000000000000000000000000000000001000000000000000000000000000000100000000000100000000000000100000000000000000110000000000000000000000",
"0000000000000000000000000000000010000000000000000000000000110000000000000000000000000001000000000000001000000010000000000000000000000000000000000001",
"0000000000000000000100010000000010000000000000010000000000100000000000000100000001000001100000000000001000001000000100000000000000000000000000000100")
In every vector, I would like to estimate all pairwise differences (ie number of different characters in each pair) and then estimate the mean.
Given the big size of the list, what would be a cost effective way to do so?
thank you!!
EDIT- UPDATE
I am adding an example with shorter vectors and expected output (same as my comment below):
a=c("0010100101",
"1001011101",
"1111111010")
expected result (mean of pairwise differences):
6.666667
Many thanks to #Roland for providing the solution, based on Levenshtein distance!
Your question is not terribly clear but you appear to want a Levenshtein distance:
x = c("0010100101",
"1001011101",
"1111111010")
#switch off deletions and insertions:
d <- adist(x, costs = list(ins=Inf, del=Inf, sub=1))
# [,1] [,2] [,3]
#[1,] 0 6 8
#[2,] 6 0 6
#[3,] 8 6 0
mean(d[upper.tri(d)])
#[1] 6.666667
Well, I realised the small data sample was not that short, so I repeat the example with shorter strings:
a=c("0010100101",
"1001011101",
"1111111010")
I came up with this solution:
library(stringdist)
# matrix with no. of pairwise different characters
xx=(1-outer(a,a,"stringsim",method="hamming"))*nchar(a[1])
> xx
[,1] [,2] [,3]
[1,] 0 6 8
[2,] 6 0 6
[3,] 8 6 0
# mean no. of pairwise differences
xx1=mean(xx[lower.tri(xx,diag = FALSE)])
> xx1
[1] 6.666667
The result is fine, but I am not sure if this is an ideal solution, though..
(and if the "hamming" method is the correct one for any case).

Creating upper/lower triangular correlation matrix based on values from a group of text files?

Will try not to complicate things too much with my explanations, but I'm confused how to best go about filling a triangulated correlation matrix with no repeat values with existing correlation values derived from another package. This involves extracting specific values from a list of text files. This is what I have done so far:
# read in list of file names (they are named '1_1', '1_2' .. so on until '47_48' with no repeat values generated)
filenames <- read_table('/home/filenames.txt', col_names = 'file_id')
# create symmetrical matrix
M <- diag(48)
ct <- 1
for (sub in (filenames$file_id)) {
subj <- read.table(paste0(dat_dir, '/ht_', sub, '.HEreg'), sep="", fill=TRUE)
ht <- as.character(subj$V2[grep("rG",sub$V1)]) # wanting to extract the specific value in that column for each text file
M[ct,] <- as.numeric(ht) #input this value into the appropriate location
ct <- ct + 1
}
This obviously does not give me the triangulated output I would envision - I know there is an error with inputting the variable 'ht' into the matrix, but am not sure how to solve this moving forward. Ideally, the correlation value of file 1_1 should be inserted in row 1, col 1, file 1_2 should be inserted in row 2, col 1, so on and so forth, and avoiding repeats (should be 0's)
Should I turn to nested loops?
Much help would be appreciated from this R newbie here, I hope I didn't complicate things unnecessarily!
I think the easiest way would be to read in all your values into a vector. You can do this using a variation of your existing loop.
Let us assume that your desired size correlation matrix is 5x5 (I know you have 48x48 judging by your code, but to keep the example simple I will work with a smaller matrix).
Let us assume that you have read all of your correlation values into the vector x in column major order (same as R uses), i.e. the first element of x is row 2 column 1, second element is row 3 column 1 etc. I am further assuming that you are creating a symmetric correlation matrix, i.e. you have ones on the diagonal, which is why the indexing starts the way it does, because of your use of the diag() function. Let's assume your vector x contains the following values:
x <- 1:10
I know that these are not correlations, but they will make it easy to see how we fill the matrix, i.e. which vector element goes into which position in the resulting matrix.
Now, let us create the identity matrix and zero matrices for the upper and lower triangular correlations (off diagonal).
# Assuming 5x5 matrix
n_elements <- 5
m <- diag(n_elements)
m_upper <- m_lower <- matrix(0, n_elements, n_elements)
To quickly fill the lower triangular matrix, we can use the lower.tri().
m_lower[lower.tri(m_lower, diag = FALSE)] <- x
This will yield the following output:
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 1 0 0 0 0
[3,] 2 5 0 0 0
[4,] 3 6 8 0 0
[5,] 4 7 9 10 0
As you can see, we have successfully filled the lower triangular. Also note the order in which the elements of the vector is filled into the matrix. This is crucial for your results to be correct. The upper triangular is simply the transpose of the lower triangular, and then we can add our three matrices together to form your symmetric correlation matrix.
m_upper <- t(m_lower)
M <- m_lower + m + m_upper
Which yields the desired output:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 2 3 4
[2,] 1 1 5 6 7
[3,] 2 5 1 8 9
[4,] 3 6 8 1 10
[5,] 4 7 9 10 1
As you see, there is no need to work with nested loops to fill these matrices. The only loop you need is to read in the results from files (which it appears you have a handle on). If you only want the triangulated output, you can simply stop at the lower triangular matrix above. If your vector of estimated correlations (in my example x) include the diagonal elements, simply set diag = TRUE in the lower.tri() function and you are good to go.

R: Correct strings by distance measure (stringdistmatrix)

I am dealing with the problem that I need to count unique names of people in a string, but taking into consideration that there may be slight typos.
My thought was to set strings below a certain threshold (e.g. levenshtein distance below 2) as being equal. Right now I manage to calculate the string distances, but not making any changes to my input string that would get me the correct number of unique names.
library(stringdist);library(stringr)
names<-"Michael, Liz, Miichael, Maria"
names_split<-strsplit(names, ", ")[[1]]
stringdistmatrix(names_split,names_split)
[,1] [,2] [,3] [,4]
[1,] 0 6 1 5
[2,] 6 0 7 4
[3,] 1 7 0 6
[4,] 5 4 6 0
(number_of_people<-str_count(names, ",")+1)
[1] 4
The correct value of number_of_people should be, of course, 3.
As I am only interested in the number of uniques names, I am not concerned if "Michael" becomes replaced by "Miichael" or the other way round.
One option is to try to cluster the names based on their distance matrix:
library(stringdist)
# create a 'dist' object (=lower triangular part of distance matrix)
d <- stringdistmatrix(names_split,method="osa")
# use hierarchical clustering to group nearest neighbors
hc <- hclust(d)
# visual inspection: y-axis labels the distance value
plot(hc)
# decide what distance value you find acceptable for grouping.
cutree(hc, h=3)
Depending on your actual data you will need to experiment with the distance type (qgrams/cosine may be useful, or the jaro-winkler distance in the case of names).

Does the c command create a row vector or a column vector by default in R

In R, when I use a command like this:
b <-c(7,10)
b
Does it create a row vector (1 row, 2 cols) or a column vector (1 col, 2 rows) by default?
I can't tell from the displayed output.
I am R beginner (as is obvious :))
Neither. A vector does not have a dimension attribute by default, it only has a length.
If you look at the documentation on matrix arithmetic, help("%*%"), you see that:
Multiplies two matrices, if they are conformable. If one argument is a
vector, it will be promoted to either a row or column matrix to make
the two arguments conformable. If both are vectors of the same length,
it will return the inner product (as a matrix).
So R will interpret a vector in whichever way makes the matrix product sensible.
Some examples to illustrate:
> b <- c(7,10)
> b
[1] 7 10
> dim(b) <- c(1,2)
> b
[,1] [,2]
[1,] 7 10
> dim(b) <- c(2,1)
> b
[,1]
[1,] 7
[2,] 10
> class(b)
[1] "matrix"
> dim(b) <- NULL
> b
[1] 7 10
> class(b)
[1] "numeric"
A matrix is just a vector with a dimension attribute. So adding an explicit dimension makes it a matrix, and R will do that in whichever way makes sense in context.
And an example of the behavior in the context of matrix multiplication:
> m <- matrix(1:2,1,2)
> m
[,1] [,2]
[1,] 1 2
> m %*% b
[,1]
[1,] 27
> m <- matrix(1:2,2,1)
> m %*% b
[,1] [,2]
[1,] 7 10
[2,] 14 20
You can treat a vector ( c() ) in R as a row or a column.
You can see this by
rbind(c(1,3,5),c(2,4,6))
cbind(c(1,2,3),c(4,5,6))
It is a collection. By default tho when casting to a data frame
data.frame(c(1,2,3))
it is made a column, such where the first index will address which column of the table is being referenced, in contradiction to what is orthodox in linear algebra.
i.e., to access the hello in this casting of a vector into a data.frame
an additional index is required
a = data.frame(c("hello","F***ery"))
a[[1]][[1]]
and this is where things get wacky, because data frames don't chive with strings... the type of "hello" is supposedly an integer, with levels...
The c function creates an "atomic" vector, using the word of Norman Matloff in the art of R programming:
atomic vectors, since their components cannot be broken down into
smaller components.
It can be seen as a "concatenation" (in fact c stands for concatenate) of elements, indexed by their positions and so no dimensions (in a spatial sense), but just a continuous index that goes from 1 to the length of the object itself.

Split matrix into submatrices

I found a solution to a similar question that breaks up a matrix into a set of non-overlapping sub-matrices when the matrix may not be square (as it may not be in my situation) Function to split a matrix into sub-matrices in R. However, in my case I want the matrices to be overlapping. Using the following matrix as an example:
M <- matrix(1:20, 5)
# [,1] [,2] [,3] [,4]
#[1,] 1 6 11 16
#[2,] 2 7 12 17
#[3,] 3 8 13 18
#[4,] 4 9 14 19
#[5,] 5 10 15 20
If I choose the number of rows and columns of the sub-matrices to be 4 and 4 respectively then the sub-matrices returned should be every possible overlapping 4X4 sub matrix that completely fits within the boundary of the original matrix (this is another place where the answer to this question may need to be different to the question I have referenced). In my example there would only be two sub matrices that should be returned. M[1:4,1:4] and M[2:5,1:4]. I should be able to choose any arbitrary size of sub matrix. I can't seem to find any examples of extracting overlapping sub matrices, but I may be over-thinking this one. Does anyone have any ideas on the best way to go about this?
Perhaps a strategy like this could work
submat <- function(m, nrow, ncol) {
stopifnot(nrow(m)>=nrow, ncol(m)>=ncol)
rowstarts<-1:(nrow(m)-nrow+1)
colstarts<-1:(ncol(m)-ncol+1)
ss <- function(r, c) {
m[r:(r+nrow-1), c:(c+ncol-1), drop=FALSE]
}
with(expand.grid(r=rowstarts, c=colstarts), mapply(ss, r, c, SIMPLIFY=FALSE))
}
submat(M, 4, 4)
we determine where the possible start indexes for the rows and columns are, then we use expand.grid() to generate all possible combinations of such starting values, then we use mapply to extract every possible submatrix with those starting positions.

Resources