Find column with values closest to vector - r

I have a vector containing times in milliseconds looking like this;
vector <- c(667753, 671396, 675356, 679286, 683413, 687890, 691742,
695651, 700100, 704552, 708832, 713117, 717082, 720872, 725002, 729490,
733824, 738233, 742239, 746092, 750003, 754236, 867342, 870889, 873704,
876617, 879626, 882595, 885690, 888602, 891789, 894717, 897547, 900797,
903615, 906646, 909624, 912613, 915645, 918566, 921792, 924625, 927538,
930721, 933542)
Now i want to look into a large data frame with a lot of time columns and search for a single column that contains time values being closest (row-wise) to my vector time values.
The data.frame containing all the columns is of the same number of rows. So lets say my vector has 240 elements, then every column in the larger data.frame consists of 240 rows.
Any idia how to do this ?

You can calculate the euclidean distance from your vector and each column of the dataframe and then check which column has the smallest distance:
which.min(sapply(1:ncol(dataFrame), function(i) sqrt(sum((t(v)-dataFrame[,i])^2))))
The above returns the index of the column with the lowest distance.
Where dataFrame is the data frame containing columns of different times(so we compare each column to the vector v) and v is the vector.
The following is just the square root of the sum of squared distances (euclidean distance):
sqrt(sum((t(v)-dataFrame[,i])^2)))
You can also use the following as a distance measure:
abs(t(v)-dataFrame[,i])
EDIT
As Evan Friedland pointed out you can actually just use:
which.min(colSums(abs(v-dataFrame)))
or
which.min(sqrt(colSums((t(v)-dataFrame)^2)))

Related

How to calculate jaccard similarity between two rows in data frame

I have an excel file with records of students including 14 attributes (Shown below). I want to calculate the similarity between each pair of students.
First, I have to convert rows in a character array. then I have made a document-term matrix and calculate the distance between each pair. Then I subtract the distance from 1. But find the wrong similarity.
F360 <- read_excel("C:/Users/DreamWorld/F360.xlsx")
mydf=data.frame(F360$nursery,F360$higher,F360$internet,F360$romantic,stringsAsFactors = FALSE)
td1=as.character(mydf[1,])
td2=as.character(mydf[2,])
d1=paste(td1[1],td1[2],td1[3],td1[4],sep = " ")
d2=paste(td2[1],td2[2],td2[3],td2[4],sep = " ")
myvector=c(d1,d2)
mycorpus=Corpus(VectorSource(myvector))
dtm=as.matrix(DocumentTermMatrix(mycorpus))
jdist=as.matrix(dist(dtm,method = "jaccard"))
jsim=1-jdist
I'm expecting similarity between each pair of the row in the data frame.
recently, I have found that Function sum will give me a number of common attributes.
Com=sum((td1==td2)==TRUE)
Next thing is to get the number of elements in both vectors, which are obviously 4.
len = length(td1)
Finally, we can find Jaccard similarity, which is an intersection over the union.
sim = com/len

Iterating a vector over a list in R

I am dealing with some computational feature extracting problem from RNA data, and I found myself unable to deal with this question:
I have n sequences (say two for example) from which I obtained an iterated statistic i times (kind of doing a Monte Carlo iteration for analizing distribution of obtained statistics compared with original).
Example:
Say we iterate 10 times
n <- 10
I got a vector of 20 values with all the iterations, but this vector corresponds to two different sequences, so I must divide this vector in two equal parts (the iterations are ordered 1:10 - 1:10 for each sequence).
MFEit <- c(10, 12, 34, 32, 12 .....) ## vector of length 20
MFEit.split <- split(MFEit, ceiling(MFEit.along/n5))
This generates a list of two items each with 10 values, named $1 and $2
On the other hand I have a vector of two values which are the original statistics, each corresponding to each original sequence
MFE <- c(25, 15)
What I want to do is to know how many values of first item in the list MFEit.split, are equal or less than the first value of MFE, and, iteratively, how many values of second item in the list MFEit.split, are equal or less than the second value of MFE, and so on, provided that I would have more than two values or items.
I know how to do it one by one, say:
R <- length(subset(MFEit.split$`1`, MFEit.split$`1`<=MFE[1]))
R <- length(subset(MFEit.split$`2`, MFEit.split$`1`<=MFE[2]))
But... how to include this into a loop so that I can get iteratively each comparison, no matter how many MFE values or items in the list I have?
The desired output would be a vector called R, with n values corresponding to each comparison.
Any help?...

creating many matrix from one matrix with looping?

I have a square matrix M with 25x25 dimension.
Then I want to create 25 matrices as follow:
the first matrix is matrix M without the first row and first column,
the second matrix is matrix M without the second row and second column, - ... so on until 25th matrix.
this little snippet will do:
lapply(1:25, function(i) M[-i, -i])

Convert equal interval of vector to rows of matrix

I've imported table that contains the travel times for an origin-destination cost matrix of size nxn. As a result, travel times equal to zero when an origin and destination are the same.
For example, an OD cost matrix of 25 origins and 25 destinations (625 elements) would have zero values running down the diagonal. In a vector, the value 0 occurs at the 0th element, 26th element, 51st element, etc.
I've read the travel times in as a vector and I'd like to reshape the vector into a matrix where every element on the diagonal has the value of zero. Does anyone have any ideas on how this would be done?
Code:
### READ and PREPARE DATA ###
# Read OD cost matrix (use data.table for performance)
od_table <- read.table('DMatrix.txt', sep=',', header=TRUE, na.strings="NA", stringsAsFactors=FALSE)
v <- t(od_table$Total_TravelTime)
n <- sqrt(length(v))
D <- matrix(v, nrow=25)
The resulting matrix has zero values along the first row only:

R: Assigning value to a matrix with variable name

I'm struggling to remove a row in a matrix, where this matrix's name is "unknown". What I mean by "unknown" is that there are several matrices, and the last 3 characters of each matrix's name is different.
An example would make this a lot clearer I think.
Say I have 3 matrices, Trades_ABC, Trades_DEF, Trades_HIJ. Each of these matrices has x rows and 5 columns.
I currently have the following code:
for (k in 1:3)
assign(get(paste0("Trades_",sellLeg))[1,1],y)
next k
Where "sellLeg" is one of "ABC","DEF","HIJ"
In this code I am trying to change the value of the first element in each of the three matrices to some number, represented by "1", as an example. In reality, I'm not so much looking to CHANGE a value as I am looking to REMOVE a row, but my main problem is that I don't know how to assign a value to a matrix with an "unknown" name (once I can do this I should be able to remove a row)
Many thanks!

Resources