Reading an Array in a DataFrame in R and calculating cosine distance - r

I have a csv file with three columns
sigID , author,lowered,array
1, Lukic M,lukicm,"[ 0.05192188 -0.02984986 -0.01315994 -0.05446223 0.01090824 -0.0310401 -0.00134283 -0.0536921 -0.02986531 -0.01161558]"
2, Houssin C,houssinc,"[ 0.05371874 -0.07439778 0.3917329 -0.15246899 0.35638699 0.14586256 0.12886068 -0.10721818 -0.14641574 0.08469024]"
....
How do I read this csv file in R?. (I am having parsing issue for array column)
How could i calculate the cosinesimilarity between array[1],array[2]
Thanks,

Here is a way to parse your array into vector:
myList <- strsplit(gsub("\\[\\s*|\\s*\\]", "", df$array), "\\s+")
myList
[[1]]
[1] "0.05192188" "-0.02984986" "-0.01315994" "-0.05446223" "0.01090824" "-0.0310401" "-0.00134283" "-0.0536921"
[9] "-0.02986531" "-0.01161558"
[[2]]
[1] "0.05371874" "-0.07439778" "0.3917329" "-0.15246899" "0.35638699" "0.14586256" "0.12886068" "-0.10721818"
[9] "-0.14641574" "0.08469024"
Convert them to numeric before calculating the cosine distance:
mat <- do.call(cbind, lapply(myList, as.numeric))
mat
[,1] [,2]
[1,] 0.05192188 0.05371874
[2,] -0.02984986 -0.07439778
[3,] -0.01315994 0.39173290
[4,] -0.05446223 -0.15246899
[5,] 0.01090824 0.35638699
[6,] -0.03104010 0.14586256
[7,] -0.00134283 0.12886068
[8,] -0.05369210 -0.10721818
[9,] -0.02986531 -0.14641574
[10,] -0.01161558 0.08469024
You can use cosine function from lsa package to calculate the cosine similarity:
library(lsa)
cosine(mat)
[,1] [,2]
[1,] 1.0000000 0.2438864
[2,] 0.2438864 1.0000000
So the cosine similarity measure between vector 1 and vector 2 is 0.244.
Note: As to why you can't read the file, I guess you have one quote missing at the end of your first array. Otherwise, can't think of any reason why you can't read it. It is a normal .csv file.

Related

Incorrect result when multiping matrixes in R?

I'm getting some weird results when multiplying these two matrices in R:
> matrix3
[1,] 3.19747172 -2.806e-05 -0.00579284 -0.00948720 -0.01054026 0.17575719
[2,] -0.00002806 2.000e-08 0.00000057 0.00000006 -0.00000009 -0.00000358
[3,] -0.00579284 5.700e-07 0.00054269 0.00001793 -0.00002686 -0.00310465
[4,] -0.00948720 6.000e-08 0.00001793 0.00003089 0.00002527 -0.00066290
[5,] -0.01054026 -9.000e-08 -0.00002686 0.00002527 0.00023776 -0.00100898
[6,] 0.17575719 -3.580e-06 -0.00310465 -0.00066290 -0.00100898 0.03725362
> matrix4
[,1]
x0 2428.711
x1 1115178.561
x2 74411.013
x3 925700.445
x4 74727.396
x5 13342.182
> matrix3%*%matrix4
[,1]
[1,] 78.4244581753
[2,] -0.0023802299
[3,] 0.1164568885
[4,] -0.0018504732
[5,] -0.0006493249
[6,] -0.1497822396
The thing is that if you try to multiply these two matrices in excel you get:
>78.4824494081686
>-0.0000419022486847151
>0.112430295996347
>-0.000379343461780479
>0.000340414687578061
>-0.14454024116344
And using online matrices I also got to excel's result.
Would love your help in understanding how to get the same result in R.
The problem occurred due to the use of the function inv() from the library(matlib).
matrix3 is a result of inversing using the inv() function.
Not sure why when I used solve() to inverse and then continued normally I got the correct matrix.
Perheps there is some kind of rounding in the inv() function.

Get maximum distance between points in a vector (R)

I have two vectors of latitudes and longitudes. I would like to find the maximum distance between the points. The way I see it, I should get a matrix of distances between all points and get the max of those.
So far I’ve done (using geosphere package for the last command):
> lat = dt[assetId == u_assetIds[1000], latitude]
> lon = dt[assetId == u_assetIds[1000], longitude]
>
> head(cbind(lat, lon))
lat lon
[1,] 0.7266145 -1.512977
[2,] 0.7270650 -1.504216
[3,] 0.7267265 -1.499622
[4,] 0.7233676 -1.487970
[5,] 0.7232196 -1.443160
[6,] 0.7225059 -1.434848
>
> distm(c(lat_1K[1], lon_1K[1]), c(lat_1K[4], lon_1K[4]), fun = distHaversine)
[,1]
[1,] 2807.119
How do I convert the last command into giving me a matrix of all pairwise distances? I am not familiar of how to do that in R, having more experience in Python.
Thanks.
Just briefly read the help document of distm, here is what I found:
distm(x, y, fun=distHaversine)
x: longitude/latitude of point(s). Can be a vector of two numbers, a matrix of 2 columns (first one is longitude, second is latitude) or a SpatialPoints* object
y: Same as x. If missing, y is the same as x
So what you should do is to simply input your cbind(lat, lon) as the first argument x. Here is some test:
> lat <- c(0.7266145, 0.7270650, 0.7267265, 0.7233676, 0.7232196, 0.7225059)
> lon <- c(-1.512977, -1.504216, -1.499622, -1.487970, -1.443160, -1.434848)
> distm(cbind(lon,lat))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.0000 976.4802 1486.6045 2806.912 7780.5544 8708.6036
[2,] 976.4802 0.0000 512.7471 1854.601 6809.6464 7738.0538
[3,] 1486.6045 512.7471 0.0000 1349.813 6296.9308 7225.3240
[4,] 2806.9123 1854.6008 1349.8129 0.000 4987.8561 5913.8213
[5,] 7780.5544 6809.6464 6296.9308 4987.856 0.0000 928.6189
[6,] 8708.6036 7738.0538 7225.3240 5913.821 928.6189 0.0000

Perform pairwise comparison of matrix

I have a matrix of n variables and I want to make an new matrix that is a pairwise difference of each vector, but not of itself. Here is an example of the data.
Transportation.services Recreational.goods.and.vehicles Recreation.services Other.services
2.958003 -0.25983789 5.526694 2.8912009
2.857370 -0.03425164 5.312857 2.9698044
2.352275 0.30536569 4.596742 2.9190123
2.093233 0.65920773 4.192716 3.2567390
1.991406 0.92246531 3.963058 3.6298314
2.065791 1.06120930 3.692287 3.4422340
I tried running a for loop below, but I'm aware that R is very slow with loops.
Difference.Matrix<- function(data){
n<-2
new.cols="New Columns"
list = list()
for (i in 1:ncol(data)){
for (j in n:ncol(data)){
name <- paste("diff",i,j,data[,i],data[,j],sep=".")
new<- data[,i]-data[,j]
list[[new.cols]]<-c(name)
data<-merge(data,new)
}
n= n+1
}
results<-list(data=data)
return(results)
}
As I said before the code is running very slow and has not even finished a single run through yet. Also I apologize for the beginner level coding. Also I am aware this code leaves the original data on the matrix, but I can delete it later.
Is it possible for me to use an apply function or foreach on this data?
You can find the pairs with combn and use apply to create the result:
apply(combn(ncol(d), 2), 2, function(x) d[,x[1]] - d[,x[2]])
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 3.217841 -2.568691 0.0668021 -5.786532 -3.151039 2.6354931
## [2,] 2.891622 -2.455487 -0.1124344 -5.347109 -3.004056 2.3430526
## [3,] 2.046909 -2.244467 -0.5667373 -4.291376 -2.613647 1.6777297
## [4,] 1.434025 -2.099483 -1.1635060 -3.533508 -2.597531 0.9359770
## [5,] 1.068941 -1.971652 -1.6384254 -3.040593 -2.707366 0.3332266
## [6,] 1.004582 -1.626496 -1.3764430 -2.631078 -2.381025 0.2500530
You can add appropriate names with another apply. Here the column names are very long, which impairs the formatting, but the labels tell what differences are in each column:
x <- apply(combn(ncol(d), 2), 2, function(x) d[,x[1]] - d[,x[2]])
colnames(x) <- apply(combn(ncol(d), 2), 2, function(x) paste(names(d)[x], collapse=' - '))
> x
Transportation.services - Recreational.goods.and.vehicles Transportation.services - Recreation.services
[1,] 3.217841 -2.568691
[2,] 2.891622 -2.455487
[3,] 2.046909 -2.244467
[4,] 1.434025 -2.099483
[5,] 1.068941 -1.971652
[6,] 1.004582 -1.626496
Transportation.services - Other.services Recreational.goods.and.vehicles - Recreation.services
[1,] 0.0668021 -5.786532
[2,] -0.1124344 -5.347109
[3,] -0.5667373 -4.291376
[4,] -1.1635060 -3.533508
[5,] -1.6384254 -3.040593
[6,] -1.3764430 -2.631078
Recreational.goods.and.vehicles - Other.services Recreation.services - Other.services
[1,] -3.151039 2.6354931
[2,] -3.004056 2.3430526
[3,] -2.613647 1.6777297
[4,] -2.597531 0.9359770
[5,] -2.707366 0.3332266
[6,] -2.381025 0.2500530

R hdf5 dataset written incorrectly?

When I execute the following my "predictors" dataset is populated correctly:
library(rhdf5)
library(forecast)
library(sltl)
library(tseries)
fid <- H5Fcreate(output_file)
## TODO: compute the order p
p <- 4
# write predictors
h5createDataset(output_file, dataset="predictors", c(p, length(tsstl.remainder) - (p - 1)), storage.mode='double')
predictors <- as.matrix(tsstl.remainder)
for (i in 1:(p - 1)) {
predictors <- as.matrix(cbind(predictors, Lag(as.matrix(tsstl.remainder), i)))
}
predictors <- as.matrix(predictors[-1:-(p-1),])
head(predictors)
h5write(predictors, output_file, name="predictors")
H5Fclose(fid)
The generated (correct) output for head(predictors) is:
[,1] [,2] [,3] [,4]
[1,] 0.3089645 6.7722063 5.1895389 5.2323261
[2,] 8.7607228 0.3089645 6.7722063 5.1895389
[3,] -0.9411553 8.7607228 0.3089645 6.7722063
[4,] -14.1390243 -0.9411553 8.7607228 0.3089645
[5,] -26.6605296 -14.1390243 -0.9411553 8.7607228
[6,] -8.1293076 -26.6605296 -14.1390243 -0.9411553
However, when I read it the results are not correct:
tsmatrix <- t(as.matrix(h5read(output_file, "predictors")))
head(tsmatrix)
Incorrectly outputs:
[,1] [,2] [,3] [,4]
[1,] 0.3089645 8.760723 -0.9411553 -14.13902
[2,] -26.6605296 -8.129308 -9.8687675 31.52086
[3,] 54.2703126 43.902489 31.8164836 43.87957
[4,] 22.1260636 36.733055 54.7064107 56.35158
[5,] 36.3919851 25.193068 48.2244464 57.12196
[6,] 48.0585673 72.402673 68.3265518 80.18960
How come what I write does not correspond to what I get back? I double-checked and hdfview HDF5 viewer also shows this incorrect values for the "predictors" dataset.
What is wrong here?
From the rhdf5 docs:
Please note, that arrays appear as transposed matrices when opening it
with a C-program (h5dump or HDFView). This is due to the fact the
fastest changing dimension on C is the last one, but on R it is the
first one (as in Fortran).

Weighted variance-covariance matrices and lapply

I have a list prob with 50 elements. Each element is a 601x3 matrix of probabilities, each row of which represents a complete sample space (i.e., each row of each matrix sums to 1). For instance, here are the first five rows of the first element of prob:
> prob[[1]][1:5,]
[,1] [,2] [,3]
[1,] 0.6027004 0.3655563 0.03174335
[2,] 0.6013667 0.3665756 0.03205767
[3,] 0.6000306 0.3675946 0.03237481
[4,] 0.5986921 0.3686131 0.03269480
[5,] 0.5973513 0.3696311 0.03301765
Now, what I want to do is to create the following matrix for each row of each matrix/element in the list prob. Taking the first row, let a = .603, b = .366, and c = .032 (rounding to three decimal places). Then,
> w
[,1] [,2] [,3]
[1,] a*(1-a) -a*b -a*c
[2,] -b*a b*(1-b) -b*c
[3,] -c*a -c*b c*(1-c)
Such that:
> w
[,1] [,2] [,3]
[1,] 0.239391 -0.220698 -0.019296
[2,] -0.220698 0.232044 -0.011712
[3,] -0.019296 -0.011712 0.030976
I want to obtain a similar 3x3 matrix 600 more times (for the rest of the rows of this matrix) and then to repeat this entire process 49 more times for the rest of the elements of prob. The only thing I can think of is to call apply within lapply so that I am accessing each row of each matrix one-at-a-time. I'm sure that is not an elegant way to do this (not to mention I can't get it to work), but I can't think of anything else. Can anyone help me out with this? I'd also love to hear suggestions for using a different structure (e.g., is it bad to use matrices within lists?).
Running this process with lapply on a list of similarly dimensioned matrices should be very simple. If it represents a challenge, then you should post the dput(.) output for a two element list with similar matrices. The challenge is really to do the processing row by row which is illustrated here with the output being a 3x3xN array:
w <- apply(M, 1, function(rw) diag( rw*(1-rw) ) +
rbind( rw*c(0, -rw[1], -rw[1] ),
rw*c(-rw[2],0, -rw[2] ),
rw*c(-rw[3], -rw[3], 0)
)
)
w
[,1] [,2] [,3] [,4] [,5]
[1,] 0.23945263 0.23972479 0.23999388 0.24025987 0.24052272
[2,] -0.22032093 -0.22044636 -0.22056801 -0.22068575 -0.22079962
[3,] -0.01913173 -0.01927842 -0.01942588 -0.01957412 -0.01972314
[4,] -0.22032093 -0.22044636 -0.22056801 -0.22068575 -0.22079962
[5,] 0.23192489 0.23219793 0.23246881 0.23273748 0.23300395
[6,] -0.01160398 -0.01175156 -0.01190081 -0.01205173 -0.01220435
[7,] -0.01913173 -0.01927842 -0.01942588 -0.01957412 -0.01972314
[8,] -0.01160398 -0.01175156 -0.01190081 -0.01205173 -0.01220435
[9,] 0.03073571 0.03102998 0.03132668 0.03162585 0.03192748
w <- array(w, c(3,3,5) )
w
, , 1
[,1] [,2] [,3]
[1,] 0.23945263 -0.22032093 -0.01913173
[2,] -0.22032093 0.23192489 -0.01160398
[3,] -0.01913173 -0.01160398 0.03073571
, , 2
[,1] [,2] [,3]
[1,] 0.23972479 -0.22044636 -0.01927842
[2,] -0.22044636 0.23219793 -0.01175156
[3,] -0.01927842 -0.01175156 0.03102998
.... snipped remaining output

Resources