Using R, what is the best way to read a symmetric matrix from a file that omits the upper triangular part. For example,
1.000
.505 1.000
.569 .422 1.000
.602 .467 .926 1.000
.621 .482 .877 .874 1.000
.603 .450 .878 .894 .937 1.000
I have tried read.table, but haven't been successful.
Here's a read.table and loopless and *apply-less solution:
txt <- "1.000
.505 1.000
.569 .422 1.000
.602 .467 .926 1.000
.621 .482 .877 .874 1.000
.603 .450 .878 .894 .937 1.000"
# Could use clipboard or read this from a file as well.
mat <- data.matrix( read.table(text=txt, fill=TRUE, col.names=paste("V", 1:6)) )
mat[upper.tri(mat)] <- t(mat)[upper.tri(mat)]
> mat
V1 V2 V3 V4 V5 V6
[1,] 1.000 0.505 0.569 0.602 0.621 0.603
[2,] 0.505 1.000 0.422 0.467 0.482 0.450
[3,] 0.569 0.422 1.000 0.926 0.877 0.878
[4,] 0.602 0.467 0.926 1.000 0.874 0.894
[5,] 0.621 0.482 0.877 0.874 1.000 0.937
[6,] 0.603 0.450 0.878 0.894 0.937 1.000
I copied your text, and then used tt <- file('clipboard','rt') to import it. For a standard file:
tt <- file("yourfile.txt",'rt')
a <- readLines(tt)
b <- strsplit(a," ") #insert delimiter here; can use regex
b <- lapply(b,function(x) {
x <- as.numeric(x)
length(x) <- max(unlist(lapply(b,length)));
return(x)
})
b <- do.call(rbind,b)
b[is.na(b)] <- 0
#kinda kludgy way to get the symmetric matrix
b <- b + t(b) - diag(b[1,1],nrow=dim(b)[1],ncol=dim(b)[2]
I'm posting but I like Blue Magister's approach wat better. But maybe there's something in this that's of use.
mat <- readLines(n=6)
1.000
.505 1.000
.569 .422 1.000
.602 .467 .926 1.000
.621 .482 .877 .874 1.000
.603 .450 .878 .894 .937 1.000
nmat <- lapply(mat, function(x) unlist(strsplit(x, "\\s+")))
lens <- sapply(nmat, length)
dlen <- max(lens) -lens
bmat <- lapply(seq_along(nmat), function(i) {
as.numeric(c(nmat[[i]], rep(NA, dlen[i])))
})
mat <- do.call(rbind, bmat)
mat[upper.tri(mat)] <- t(mat)[upper.tri(mat)]
mat
Here is an approach which also works if the dimensions of the matrix are unknown.
# read file as a vector
mat <- scan("file.txt", what = numeric())
# calculate the number of columns (and rows)
ncol <- (sqrt(8 * length(mat) + 1) - 1) / 2
# index of the diagonal values
diag_idx <- cumsum(seq.int(ncol))
# generate split index
split_idx <- cummax(sequence(seq.int(ncol)))
split_idx[diag_idx] <- split_idx[diag_idx] - 1
# split vector into list of rows
splitted_rows <- split(mat, f = split_idx)
# generate matrix
mat_full <- suppressWarnings(do.call(rbind, splitted_rows))
mat_full[upper.tri(mat_full)] <- t(mat_full)[upper.tri(mat_full)]
[,1] [,2] [,3] [,4] [,5] [,6]
0 1.000 0.505 0.569 0.602 0.621 0.603
1 0.505 1.000 0.422 0.467 0.482 0.450
2 0.569 0.422 1.000 0.926 0.877 0.878
3 0.602 0.467 0.926 1.000 0.874 0.894
4 0.621 0.482 0.877 0.874 1.000 0.937
5 0.603 0.450 0.878 0.894 0.937 1.000
This won't work in the OP's case because the diagonal was 1, but if the diagonal is zero or missing, then you can use as.dist%>%as.matrix to copy the lower diagonal to the upper diagonal and set the diagonal to zero:
input=" Pop0 Pop1 Pop2
Pop0
Pop1 0.015
Pop2 0.079 0.083
Pop3 0.014 0.016 0.073"
as.matrix(as.dist(cbind(read.table(text=input,fill=T),NA)))
Result:
Pop0 Pop1 Pop2 Pop3
Pop0 0.000 0.015 0.079 0.014
Pop1 0.015 0.000 0.083 0.016
Pop2 0.079 0.083 0.000 0.073
Pop3 0.014 0.016 0.073 0.000
In my case the input had column names, so read.table(fill=T) was automatically able to determine the number of columns and IRTFM's trick of specifying col.names=1:4 was not neeeded.
Related
Did a for loop and want to return the results as a vector. i seem to only succeed with print. but that's not what i am after
n<-20
for (i in 1:n) {
start_point <- 0.50
frac <- (start_point / n) * (i-1+1)
increment <- start_point + frac
print(increment)
}
You are overwriting the increment value in every iteration, you need declare it as a numeric vector and store the value in each iteration using an index.
Some improvements in your current code -
1) no need to initialise start_point in every iteration and it can be outside loop
2) (i - 1 + 1) is just i
n <- 20
increment <- numeric(length = n)
start_point <- 0.50
for (i in 1:n) {
frac <- (start_point / n) * i
increment[i] <- start_point + frac
}
increment
# [1] 0.525 0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775
# 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000
However, you could avoid the loop by using seq
seq(start_point + (start_point/n), by = start_point/n, length.out = n)
#[1] 0.525 0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775
# 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000
I am new to R, and am working on a problem of *mporting and working with correlation matrix as the only data source in PCA and PCF in R
I have referred to stack overflow answer banks and even books, I could not find any hints, it make it like R only work with variables data file whereas in SAS you can simply input the correlation matrix and get the PCA and PCF result easily. Hope I am wrong.
I tried to look at stack overflow answer banks, and they are mostly about how to calculate the cor matrix or eigenvalue decomposition.
Below is my attempts:
setwd("D:/BlueHDD/MAQAB/RStudio/R/PCA/Intelligence")
mydata <- read.csv("Intelligence.csv",na.strings = ".")
head(mydata)
X M P C E H F
1 M 1.000 0.620 0.540 0.320 0.284 0.370
2 P 0.620 1.000 0.510 0.380 0.351 0.430
3 C 0.540 0.510 1.000 0.360 0.336 0.405
4 E 0.320 0.380 0.360 1.000 0.686 0.730
5 H 0.284 0.351 0.336 0.686 1.000 0.735
6 F 0.370 0.430 0.405 0.730 0.735 1.000
ii <- as.matrix(mydata[,2:7])
rownames(ii)<- c ("M","P","C","E","H","F")
colnames(ii)<- c ("M","P","C","E","H","F")
head(ii)
M P C E H F
M 1.000 0.620 0.540 0.320 0.284 0.370
P 0.620 1.000 0.510 0.380 0.351 0.430
C 0.540 0.510 1.000 0.360 0.336 0.405
E 0.320 0.380 0.360 1.000 0.686 0.730
H 0.284 0.351 0.336 0.686 1.000 0.735
F 0.370 0.430 0.405 0.730 0.735 1.000
myPCA <- eigen(ii)
head(myPCA)
$values
[1] 3.3670861 1.1941791 0.5070061 0.3718472 0.3131559 0.2467257
$vectors
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.3677678 -0.5098401 0.266985551 0.72768020 0.047584025 -0.04178482
[2,] -0.3913477 -0.4092063 0.485916591 -0.66464527 -0.005392018 -0.03872816
[3,] -0.3719504 -0.3825819 -0.831626240 -0.15204371 -0.003331423 -0.02352388
[4,] -0.4321872 0.3748248 0.021531885 0.06531777 -0.742970281 -0.34056682
[5,] -0.4219572 0.4214599 0.002730054 0.01174474 0.665109730 -0.44922966
[6,] -0.4565228 0.3288196 0.023032686 0.03473540 0.057617669 0.82365511
myPCA$values
[1] 3.3670861 1.1941791 0.5070061 0.3718472 0.3131559 0.2467257
myPCA$vectors
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.3677678 -0.5098401 0.266985551 0.72768020 0.047584025 -0.04178482
[2,] -0.3913477 -0.4092063 0.485916591 -0.66464527 -0.005392018 -0.03872816
[3,] -0.3719504 -0.3825819 -0.831626240 -0.15204371 -0.003331423 -0.02352388
[4,] -0.4321872 0.3748248 0.021531885 0.06531777 -0.742970281 -0.34056682
[5,] -0.4219572 0.4214599 0.002730054 0.01174474 0.665109730 -0.44922966
[6,] -0.4565228 0.3288196 0.023032686 0.03473540 0.057617669 0.82365511
The problem now in the vector, all the "+" and "-" are opposite !
Also, from here, I don't know how to get the loading matrix. I tried the below but fails:
fit <- princomp(ii)
summary(fit) # print variance accounted for
loadings(fit) # pc loadings
plot(fit,type="lines") # scree plot
fit$scores # the principal components
biplot(fit)
You can perform PCA in R with the princomp function. The documentation says that if you supply the argument covmat it will compute the principal components from the covariance matrix. But it also works to use this argument with the correlation matrix.
Here is a simple example using the iris data.
## principal components from the original data
princomp(iris[,1:4], cor=T)
Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.7083611 0.9560494 0.3830886 0.1439265
Now suppose that you only have a correlation matrix
## from correlation matrix
CM = cor(iris[,1:4])
myPCA = princomp(covmat=CM)
myPCA
Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.7083611 0.9560494 0.3830886 0.1439265
You get the same answer either way. If you want the loadings, they are stored in the myPCA structure.
myPCA$loadings
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Sepal.Length 0.521 0.377 0.720 0.261
Sepal.Width -0.269 0.923 -0.244 -0.124
Petal.Length 0.580 -0.142 -0.801
Petal.Width 0.565 -0.634 0.524
Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00
I would like to be able to make a graph as produced by the code shown below (but using logarithmic axes). I have a 2D matrix containing the data and I know the separation positions between one cell and the other (equispaced if viewed in logarithmic scale). The code that I report below simulates what I would like obtain but it use Hist_2D and therefore I do not think it is usable in my case.
An example of my data:
data is a Matrix 9*9
data [0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.429 0.143 0.000 0.000 0.048 0.000 0.000 0.000 0.000
0.857 0.810 0.667 0.429 0.429 0.286 0.190 0.286 0.143
0.952 0.952 0.905 0.857 0.857 0.905 0.857 0.762 0.810
1.000 1.000 0.952 0.952 0.952 0.952 0.952 0.952 1.000
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000]
x e y are two vector of length 10
x [0.189036 0.484322 0.779609 1.07489 1.37018 1.66547 1.96075 2.25604 2.55133]
y [ -1.06208 -0.584192 -0.106299 0.371593 0.849485 1.32738 1.80527 2.28316 2.76105]
code
PRO Plot2
x = cgScaleVector(Randomn(-3L, 100000)*3., -10, 10)
y = cgScaleVector(Randomn(-5L, 100000)*10., 0, 100)
xrange = [Min(x), Max(x)]
yrange = [Min(y), Max(y)]
xbinsize = 0.25
ybinsize = 3.00
cgDisplay
density = Hist_2D(x, y, Min1=xrange[0], Max1=xrange[1], Bin1=xbinsize, Min2=yrange[0], Max2=yrange[1], Bin2=ybinsize)
maxDensity = Ceil(Max(density)/1e2) * 1e2
scaledDensity = BytScl(density, Min=0, Max=maxDensity)
cgLoadCT, 33
TVLCT, cgColor('gray', /Triple), 0
TVLCT, r, g, b, /Get
palette = [ [r], [g], [b] ]
cgImage, scaledDensity, XRange=xrange, YRange=yrange, /Axes, Palette=palette, $
XTitle='Concentration of X', YTitle='Concentration of Y', $
Position=[0.125, 0.125, 0.9, 0.8]
thick = (!D.Name EQ 'PS') ? 6 : 2
cgContour, density, LEVELS=maxDensity*[0.25, 0.5, 0.75], /OnImage, $
C_Colors=['Tan','Tan', 'Brown'], C_Annotation=['Low', 'Avg', 'High'], $
C_Thick=thick, C_CharThick=thick
cgColorbar, Position=[0.125, 0.875, 0.9, 0.925], Title='Density', $
Range=[0, maxDensity], NColors=254, Bottom=1, OOB_Low='gray', $
TLocation='Top'
END ;*****************************************************************
Plot2
END
Thanks for your help!
In the code you have posted, Hist_2D computes the density map that is then displayed by cgImage. Since you already have the matrix you want to display (data), you can simply run:
cgImage, data, /axes, /scale, /keep, xrange=[0.04,2.70], yrange=[-1.30,3.00]
My data frame consists of time series financial data from many public companies. I purposely set companies' weights as their column headers while cleaning the data, and I also calculated log returns for each of them in order to calculate weighted returns in the next step.
Here is an example. There are four companies: A, B, C and D, and their corresponding weights in the portfolio are 0.4, 0.3, 0.2, 0.1 separately. So the current data set looks like:
df1 <- data.frame(matrix(vector(),ncol=9, nrow = 4))
colnames(df1) <- c("Date","0.4","0.4.Log","0.3","0.3.Log","0.2","0.2.Log","0.1","0.1.Log")
df1[1,] <- c("2004-10-29","103.238","0","131.149","0","99.913","0","104.254","0")
df1[2,] <- c("2004-11-30","104.821","0.015","138.989","0.058","99.872","0.000","103.997","-0.002")
df1[3,] <- c("2004-12-31","105.141","0.003","137.266","-0.012","99.993","0.001","104.025","0.000")
df1[4,] <- c("2005-01-31","107.682","0.024","137.08","-0.001","99.782","-0.002","105.287","0.012")
df1
Date 0.4 0.4.Log 0.3 0.3.Log 0.2 0.2.Log 0.1 0.1.Log
1 2004-10-29 103.238 0 131.149 0 99.913 0 104.254 0
2 2004-11-30 104.821 0.015 138.989 0.058 99.872 0.000 103.997 -0.002
3 2004-12-31 105.141 0.003 137.266 -0.012 99.993 0.001 104.025 0.000
4 2005-01-31 107.682 0.024 137.08 -0.001 99.782 -0.002 105.287 0.012
I want to create new columns that contain company weights so that I can calculate weighted returns in my next step:
Date 0.4 0.4.W 0.4.Log 0.3 0.3.W 0.3.Log 0.2 0.2.W 0.2.Log 0.1 0.1.W 0.1.Log
1 2004-10-29 103.238 0.400 0.000 131.149 0.300 0.000 99.913 0.200 0.000 104.254 0.100 0.000
2 2004-11-30 104.821 0.400 0.015 138.989 0.300 0.058 99.872 0.200 0.000 103.997 0.100 -0.002
3 2004-12-31 105.141 0.400 0.003 137.266 0.300 -0.012 99.993 0.200 0.001 104.025 0.100 0.000
4 2005-01-31 107.682 0.400 0.024 137.080 0.300 -0.001 99.782 0.200 -0.002 105.287 0.100 0.012
We can try
v1 <- grep("^[0-9.]+$", names(df1), value = TRUE)
df1[paste0(v1, ".w")] <- as.list(as.numeric(v1))
This is my data frame:
>head(dat)
geno P1 P2 P3 P4 dif
1 G1 0.015 0.007 0.026 0.951 0.001
2 G2 0.008 0.006 0.015 0.970 0.001
3 G3 0.009 0.006 0.017 0.968 0.000
4 G4 0.011 0.007 0.017 0.965 0.000
5 G5 0.013 0.005 0.021 0.961 0.000
6 G6 0.009 0.006 0.007 0.977 0.001
Here, I need to find max in each row and add dat$dif to the max.
when i used which.max(dat[,-1]), I am getting error:
Error in which.max(dat[,-1]) :
(list) object cannot be coerced to type 'double'
A previous answer (by Scriven) gives most of it but as others have stated, it incorrectly includes the last column. Here is one method that works around it:
idx <- (! names(dat) %in% c('geno','dif'))
dat$dif + apply(dat[,idx], 1, max)
# 1 2 3 4 5 6
# 0.952 0.971 0.968 0.965 0.961 0.978
You can easily put the idx stuff directly into the dat[,...] subsetting, but I broke it out here for clarity.
idx can be defined by numerous things here, such as "all but the first and last columns": idx <- names(dat)[-c(1, ncol(dat))]; or "anything that looks like P#": idx <- grep('^P[0-9]+', names(dat)).
There's an app, eh function for that :-).
max.col finds the index of the maximum position for each row of a matrix. Take note, that as max.col expects a matrix (numeric values only) you have to exclude the “geno” column when applying this function.
sapply(1:6,function(x) dat[x,max.col(dat[,2:5])[x] +1]) + dat$dif
[1] 0.952 0.971 0.968 0.965 0.961 0.978