I'm trying to match stock trades from one data frame with the mid-quote that was prevailing during that time. Thus, the time stamps don't match exactly but I have just a corresponding time interval of quotes for the time the trade happened.
I wrote a loop which works but since I know that loops should be avoided whenever possible, I looked out for an alternative.
First, this is my loop:
t=dim(x1)[1]
z=1
for (i in 1:t) {
flag=FALSE
while(flag==FALSE){
if(x1[z,1]>x2[i,1]){
x2[i,2]=x1[z-1,2]
flag=TRUE
}
else {
z=z+1
}
}
}
I've found the advice on Stack Overflow to merge the two arrays, so I added the upper bound of the interval as another column and matched the corresponding times with the subset-function.
Unfortunately, this method takes far more time than the loop. I assume it's due to the huge array that is created by merging. The data frames with the quotes have like 500.000 observations and the transaction data 100.000.
Is there a more elegant (and especially faster) way to solve this problem?
Furthermore, for some data I get the error message "missing value where TRUE/FALSE needed", even though the if-condition works when I do it manually.
edit:
My quote data would look like this:
Time midquote
[1,] 35551 50.85229
[2,] 35589 53.77627
[3,] 36347 54.27945
[4,] 37460 52.01283
[5,] 37739 53.65414
[6,] 38249 52.34947
[7,] 38426 50.59568
[8,] 39858 53.75646
[9,] 40219 51.38876
[10,] 40915 52.09319
and my transaction data:
Time midquote
[1,] 36429 0
[2,] 38966 0
[3,] 39334 0
[4,] 39998 0
[5,] 40831 0
So I want to know the midquotes from the time in the latter from the corresponding time of the former. The time in the example is in seconds from midnight.
For your example datasets, the following approach is faster:
x2[ , 2] <- x1[vapply(x2[, 1], function(x) which(x <= x1[, 1])[1] - 1L,
FUN.VALUE = integer(1)), 2]
# Time midquote
# [1,] 36429 54.27945
# [2,] 38966 50.59568
# [3,] 39334 50.59568
# [4,] 39998 53.75646
# [5,] 40831 51.38876
A second approach:
o <- order(c(x1[ , 1], x2[ , 1]))
tmp <- c(x1[ , 2], x2[ , 2])[o]
idx <- which(!tmp)
x2[ , 2] <- tmp[unlist(tapply(idx, c(0, cumsum(diff(idx) > 1)),
function(x) x - seq_along(x)), use.names = FALSE)]
# Time midquote
# [1,] 36429 54.27945
# [2,] 38966 50.59568
# [3,] 39334 50.59568
# [4,] 39998 53.75646
# [5,] 40831 51.38876
Related
I have a matrix of 2134 by 2134 of correlation values and I would like to count the total number of values that are above 0.8 and below -0.8. I have tried
length(TFcoTF[TFcoTF>.8])
but this does not seem to be correct as I am getting about 50 percent of values above .8 which does not correspond to the histogram I have for the data. Also when I do
length(TFcoTF[TFcoTF<-.8])
I got 0 as the output. Any help is appreciated.
The data table package has a function called between. This returns TRUE/FALSE value for each value in your matrix whether the value is between two values.
In my example below, I randomly created a 10x10 matrix with random values [-1,+1]. Using the length function and subsetting where the values are in your range of [-0.8,+0.8].
library(data.table)
data <- matrix(runif(100,-1,1), nrow = 10, ncol=10)
data
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.05585901 -0.7497720 -0.8371569 -0.401079424 -0.4130752 -0.788961736 0.2909987 0.48965177 0.4076504 -0.0682856
[2,] -0.42442920 0.7476111 0.8238973 -0.912507391 -0.4450897 -0.001308901 0.5151425 -0.16838841 -0.1648151 0.8370660
[3,] -0.73295874 0.5271986 0.5822628 -0.008554908 -0.2785803 -0.499058508 -0.5661172 0.35957967 0.5807055 0.2350893
[4,] 0.18949338 0.3827603 -0.6112584 0.209209240 -0.5883962 -0.087900052 0.1272227 0.58165922 -0.9950324 -0.9118599
[5,] 0.40862973 0.9496163 0.4996253 0.079538601 0.9839763 -0.119883751 0.3667418 -0.02751815 -0.6724141 0.3217434
[6,] 0.77338548 -0.7698167 -0.5632436 0.223301216 -0.9936610 0.650110638 -0.9400395 -0.47808065 -0.1579283 -0.6896787
[7,] 0.93210326 0.5360980 0.7677325 0.815231731 -0.4320206 0.647954028 0.5180600 -0.09574138 -0.3848389 0.9726445
[8,] -0.66411834 0.1125759 -0.4021577 -0.711363103 0.7161801 -0.071971464 0.7953436 0.40326575 0.6895480 0.7496597
[9,] 0.14118154 0.4775983 0.8966069 0.852880293 0.4715885 -0.542526148 0.5200246 -0.62649677 -0.3677738 0.1961003
[10,] -0.59353193 -0.2358892 0.5769562 -0.287113142 -0.7100862 -0.107092848 -0.8101459 -0.46754146 -0.4082147 -0.4475972
length(data[between(data,-0.8,0.8)])
[1] 84
It's difficult to answer without having your dataset, please provide a minimal reproducible example later.
For the first line of code, this looks correct.
For the second, the error comes from a syntax error. In R you can assign value with = and <-. So x<-1 assign the value whereas x < -1 return a boolean.
You can then combine logical values and run the code below :
set.seed(42)
m <- matrix(runif(25, min = -1, max = 1), nrow = 5, ncol = 5)
m
length(m[ m > .8]) + length(m[ m < -.8]) # long version from what you did.
length(m[ m < -.8 | m > .8]) # | mean or. TRUE | FALSE will return TRUE.
sum(m > .8 | m < -.8)
# The sum of logical is the length, since sum(c(TRUE, FALSE)) is sum(c(0, 1))
sum(abs(m) > .8) # is the shortest version
Presently, I am working through the above in the RStudio help file, which contains the following sample:
##
## rbprobitGibbs example
##
if(nchar(Sys.getenv("LONG_TEST")) != 0) {R=2000} else {R=10}
set.seed(66)
simbprobit = function(X,beta) {
## function to simulate from binary probit including x variable
y=ifelse((X%*%beta+rnorm(nrow(X)))<0,0,1)
list(X=X,y=y,beta=beta)
}
nobs=200
X=cbind(rep(1,nobs),runif(nobs),runif(nobs))
beta=c(0,1,-1)
nvar=ncol(X)
simout=simbprobit(X,beta)
Data1=list(X=simout$X,y=simout$y)
Mcmc1=list(R=R,keep=1)
out=rbprobitGibbs(Data=Data1,Mcmc=Mcmc1)
summary(out$betadraw,tvalues=beta)
if(0){
## plotting example
plot(out$betadraw,tvalues=beta)
}
When I step through the code, I don't see anywhere that the A matrix is set. It is only when I reach this line:
out=rbprobitGibbs(Data=Data1,Mcmc=Mcmc1)
That I see the A matrix displayed in the output, which I understand has to be a k * k matrix, where betabar is k * 1 matrix.
Prior Parms:
betabar
# [1] 0 0 0
A
# [,1] [,2] [,3]
# [1,] 0.01 0.00 0.00
# [2,] 0.00 0.01 0.00
# [3,] 0.00 0.00 0.01
So I can understand how A gets its dimensions; however, what is not clear to my is how the values in A are set to 0.01. I am trying to figure out how I can allow a user calling the rbprobitGibbs function to set the precision via A to whatever they like. I can see where A is output, but how are its values based on some input? Does anyone have any suggestions? TIA.
UPDATE:
Here is the output produced, but as far as I can determine it is identical whether I use prior = list(rep(0,3), .2*diag(3)) or not:
> out
$betadraw
[,1] [,2] [,3]
[1,] 0.3565099 0.6369436 -0.9859025
[2,] 0.4705437 0.7211755 -1.1955608
[3,] 0.1478930 0.6538157 -0.6989660
[4,] 0.4118663 0.7910846 -1.3919411
[5,] 0.0385419 0.9421720 -0.7359932
[6,] 0.1091359 0.7991905 -0.7731041
[7,] 0.4072556 0.5183280 -0.7993501
[8,] 0.3869478 0.8116237 -1.2831395
[9,] 0.8893555 0.5448905 -1.8526630
[10,] 0.3165972 0.6484716 -0.9857531
attr(,"class")
[1] "bayesm.mat" "mcmc"
attr(,"mcpar")
[1] 1 10 1
It gets this factor by a scaling constant on the prior precision matrix. In the source, you will note that if you do not supply a prior precision then it will generate a square k matrix and multiply it by .1. Nothing fancy here. These scaling parameters for all of the various functions in bayesm can be found in the ./bayesm/R/bayesmConstants.R file.
if (is.null(Prior$A)) {
A = BayesmConstant.A * diag(nvar)
}
Should you like to you could supply your own constant, say .2, you could do so as follows, prior = list(rep(0,k), .2*diag(k)), or even introduce some relational information into the prior.
Very late to the party, but I ran across this same issue and just figured it out. In order to change the A matrix and prior matrix you have to name them as well since all of your other input variables are named.
For example your code should be,
rbprobitGibbs(Data=Data1, Prior=list(betabar=betabar1, A=A1), Mcmc=Mcmc1)
If you do that, you are able to set your own values for betabar and A.
I have a csv file with three columns
sigID , author,lowered,array
1, Lukic M,lukicm,"[ 0.05192188 -0.02984986 -0.01315994 -0.05446223 0.01090824 -0.0310401 -0.00134283 -0.0536921 -0.02986531 -0.01161558]"
2, Houssin C,houssinc,"[ 0.05371874 -0.07439778 0.3917329 -0.15246899 0.35638699 0.14586256 0.12886068 -0.10721818 -0.14641574 0.08469024]"
....
How do I read this csv file in R?. (I am having parsing issue for array column)
How could i calculate the cosinesimilarity between array[1],array[2]
Thanks,
Here is a way to parse your array into vector:
myList <- strsplit(gsub("\\[\\s*|\\s*\\]", "", df$array), "\\s+")
myList
[[1]]
[1] "0.05192188" "-0.02984986" "-0.01315994" "-0.05446223" "0.01090824" "-0.0310401" "-0.00134283" "-0.0536921"
[9] "-0.02986531" "-0.01161558"
[[2]]
[1] "0.05371874" "-0.07439778" "0.3917329" "-0.15246899" "0.35638699" "0.14586256" "0.12886068" "-0.10721818"
[9] "-0.14641574" "0.08469024"
Convert them to numeric before calculating the cosine distance:
mat <- do.call(cbind, lapply(myList, as.numeric))
mat
[,1] [,2]
[1,] 0.05192188 0.05371874
[2,] -0.02984986 -0.07439778
[3,] -0.01315994 0.39173290
[4,] -0.05446223 -0.15246899
[5,] 0.01090824 0.35638699
[6,] -0.03104010 0.14586256
[7,] -0.00134283 0.12886068
[8,] -0.05369210 -0.10721818
[9,] -0.02986531 -0.14641574
[10,] -0.01161558 0.08469024
You can use cosine function from lsa package to calculate the cosine similarity:
library(lsa)
cosine(mat)
[,1] [,2]
[1,] 1.0000000 0.2438864
[2,] 0.2438864 1.0000000
So the cosine similarity measure between vector 1 and vector 2 is 0.244.
Note: As to why you can't read the file, I guess you have one quote missing at the end of your first array. Otherwise, can't think of any reason why you can't read it. It is a normal .csv file.
I have a matrix of n variables and I want to make an new matrix that is a pairwise difference of each vector, but not of itself. Here is an example of the data.
Transportation.services Recreational.goods.and.vehicles Recreation.services Other.services
2.958003 -0.25983789 5.526694 2.8912009
2.857370 -0.03425164 5.312857 2.9698044
2.352275 0.30536569 4.596742 2.9190123
2.093233 0.65920773 4.192716 3.2567390
1.991406 0.92246531 3.963058 3.6298314
2.065791 1.06120930 3.692287 3.4422340
I tried running a for loop below, but I'm aware that R is very slow with loops.
Difference.Matrix<- function(data){
n<-2
new.cols="New Columns"
list = list()
for (i in 1:ncol(data)){
for (j in n:ncol(data)){
name <- paste("diff",i,j,data[,i],data[,j],sep=".")
new<- data[,i]-data[,j]
list[[new.cols]]<-c(name)
data<-merge(data,new)
}
n= n+1
}
results<-list(data=data)
return(results)
}
As I said before the code is running very slow and has not even finished a single run through yet. Also I apologize for the beginner level coding. Also I am aware this code leaves the original data on the matrix, but I can delete it later.
Is it possible for me to use an apply function or foreach on this data?
You can find the pairs with combn and use apply to create the result:
apply(combn(ncol(d), 2), 2, function(x) d[,x[1]] - d[,x[2]])
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 3.217841 -2.568691 0.0668021 -5.786532 -3.151039 2.6354931
## [2,] 2.891622 -2.455487 -0.1124344 -5.347109 -3.004056 2.3430526
## [3,] 2.046909 -2.244467 -0.5667373 -4.291376 -2.613647 1.6777297
## [4,] 1.434025 -2.099483 -1.1635060 -3.533508 -2.597531 0.9359770
## [5,] 1.068941 -1.971652 -1.6384254 -3.040593 -2.707366 0.3332266
## [6,] 1.004582 -1.626496 -1.3764430 -2.631078 -2.381025 0.2500530
You can add appropriate names with another apply. Here the column names are very long, which impairs the formatting, but the labels tell what differences are in each column:
x <- apply(combn(ncol(d), 2), 2, function(x) d[,x[1]] - d[,x[2]])
colnames(x) <- apply(combn(ncol(d), 2), 2, function(x) paste(names(d)[x], collapse=' - '))
> x
Transportation.services - Recreational.goods.and.vehicles Transportation.services - Recreation.services
[1,] 3.217841 -2.568691
[2,] 2.891622 -2.455487
[3,] 2.046909 -2.244467
[4,] 1.434025 -2.099483
[5,] 1.068941 -1.971652
[6,] 1.004582 -1.626496
Transportation.services - Other.services Recreational.goods.and.vehicles - Recreation.services
[1,] 0.0668021 -5.786532
[2,] -0.1124344 -5.347109
[3,] -0.5667373 -4.291376
[4,] -1.1635060 -3.533508
[5,] -1.6384254 -3.040593
[6,] -1.3764430 -2.631078
Recreational.goods.and.vehicles - Other.services Recreation.services - Other.services
[1,] -3.151039 2.6354931
[2,] -3.004056 2.3430526
[3,] -2.613647 1.6777297
[4,] -2.597531 0.9359770
[5,] -2.707366 0.3332266
[6,] -2.381025 0.2500530
Here is an excerpt of numeric matrix that I have
[1,] 30 -33.129487 3894754.1 -39.701738 -38.356477 -34.220534
[2,] 29 -44.289487 -8217525.9 -44.801738 -47.946477 -41.020534
[3,] 28 -48.439487 -4572815.9 -49.181738 -48.086477 -46.110534
[4,] 27 -48.359487 -2454575.9 -42.031738 -43.706477 -43.900534
[5,] 26 -38.919487 -2157535.9 -47.881738 -43.576477 -46.330534
[6,] 25 -45.069487 -5122485.9 -47.831738 -47.156477 -42.860534
[7,] 24 -46.207487 -2336325.9 -53.131738 -50.576477 -50.410534
[8,] 23 -51.127487 -2637685.9 -43.121738 -47.336477 -47.040534
[9,] 22 -45.645487 3700424.1 -56.151738 -47.396477 -50.720534
[10,] 21 -56.739487 1572594.1 -49.831738 -54.386577 -52.470534
[11,] 20 -46.319487 642214.1 -39.631738 -44.406577 -41.490534
What I want to do now, is to scale the values for each column to have values from 0 to 1.
I tried to accomplish this using the scale() function on my matrix (default parameters), and I got this
[1,] -0.88123100 0.53812440 -1.05963281 -1.031191482 -0.92872324
[2,] -1.17808251 -1.13538649 -1.19575096 -1.289013031 -1.11327085
[3,] -1.28847084 -0.63180980 -1.31265244 -1.292776849 -1.25141017
[4,] -1.28634287 -0.33914007 -1.12182012 -1.175023107 -1.19143220
[5,] -1.03524267 -0.29809911 -1.27795565 -1.171528133 -1.25738083
[6,] -1.19883019 -0.70775576 -1.27662116 -1.267774342 -1.16320727
[7,] -1.22910054 -0.32280189 -1.41807728 -1.359719044 -1.36810940
[8,] -1.35997055 -0.36443973 -1.15091204 -1.272613537 -1.27664977
[9,] -1.21415156 0.51127451 -1.49868058 -1.274226602 -1.37652260
[10,] -1.50924749 0.21727976 -1.33000083 -1.462151358 -1.42401647
[11,] -1.23207969 0.08873245 -1.05776452 -1.193844887 -1.12602635
Which is already close to what I want, but values from 0:1 were even better. I read the help manual of scale(), but I really don't understand how I would do that.
Try the following, which seems simple enough:
## Data to make a minimal reproducible example
m <- matrix(rnorm(9), ncol=3)
## Rescale each column to range between 0 and 1
apply(m, MARGIN = 2, FUN = function(X) (X - min(X))/diff(range(X)))
# [,1] [,2] [,3]
# [1,] 0.0000000 0.0000000 0.5220198
# [2,] 0.6239273 1.0000000 0.0000000
# [3,] 1.0000000 0.9253893 1.0000000
And if you were still to use scale:
maxs <- apply(a, 2, max)
mins <- apply(a, 2, min)
scale(a, center = mins, scale = maxs - mins)
Install the clusterSim package and run the following command:
normX = data.Normalization(x,type="n4");
scales package has a function called rescale:
set.seed(2020)
x <- runif(5, 100, 150)
scales::rescale(x)
#1.0000000 0.5053362 0.9443995 0.6671695 0.0000000
Not the prettiest but this just got the job done, since I needed to do this in a dataframe.
column_zero_one_range_scale <- function(
input_df,
columns_to_scale #columns in input_df to scale, must be numeric
){
input_df_replace <- input_df
columncount <- length(columns_to_scale)
for(i in 1:columncount){
columnnum <- columns_to_scale[i]
if(class(input_df[,columnnum]) !='numeric' & class(input_df[,columnnum])!='integer')
{print(paste('Column name ',colnames(input_df)[columnnum],' not an integer or numeric, will skip',sep='')) }
if(class(input_df[,columnnum]) %in% c('numeric','integer'))
{
vec <- input_df[,columnnum]
rangevec <- max(vec,na.rm=T)-min(vec,na.rm=T)
vec1 <- vec - min(vec,na.rm=T)
vec2 <- vec1/rangevec
}
input_df_replace[,columnnum] <- vec2
colnames(input_df_replace)[columnnum] <- paste(colnames(input_df)[columnnum],'_scaled')
}
return(input_df_replace)
}