Optimizing Speed for Populating a Matrix - r

I am trying to fill in a large matrix (55920484 elements) in R that will eventually be symmetric (so I am actually only performing calculations for half of the matrix). The resulting values matrix is a square matrix which has the same row and column names. Each value in the matrix is the result of comparing unique lists and counting the number of intersections. This data comes from a larger dataframe (427.5 Mb). Here is my fastest solution so far, I am trying to get rid of the loops which I know are slow:
for(i in 1:length(rownames(values))){
for(j in i:length(colnames(values))){
A = data[data$Stock==rownames(values)[i],"Fund"]
B = data[data$Stock==colnames(values)[j],"Fund"]
values[i, j] = length(intersect(A, B))
}
}
I have tried several other approaches such as using a database with an SQL connection, using a sparse matrix with 0s and 1s, and using the sqldf package in R.
Here is the structure of my data:
head(data)
Fund Stock Type Shares.Held Maket.Value X..of.Portfolio Rank Change.in.Shares X..Change X..Ownership
1 12 WEST CAPITAL MANAGEMENT LP GRUB CALL 500000 12100000 0.0173 12 500000 New N/A
2 12 WEST CAPITAL MANAGEMENT LP FIVE SH 214521 6886000 0.0099 15 214521 New 0
3 12 WEST CAPITAL MANAGEMENT LP SHAK SH 314114 12439000 0.0178 11 307114 4387 1
4 12 WEST CAPITAL MANAGEMENT LP FRSH SH 324120 3650000 0.0053 16 -175880 -35 2
5 12 WEST CAPITAL MANAGEMENT LP ATRA SH 393700 10398000 0.0149 14 162003 69 1
6 12 WEST CAPITAL MANAGEMENT LP ALNY SH 651000 61285000 0.0875 4 No Change 0 1

I see three problems, in order of increasing importance:
(1) You call rownames(values) and colnames(values) many times, instead of calling them just once outside of the loops. It may or may not help.
(2) You calculate A = data[data$Stock==rownames(values)[i],"Fund"] under the innermost loop, while you should calculate it outside of this loop.
(3) Most important: Your code uses only two columns of your table: Fund and Stock. I see that in your data there are many rows with same both Fund and Stock. You should eliminate this redundacy. Maybe you want to create data1=data[,c("Fund","Stock")] and eliminate redundant rows in data1 (without loop):
data1 = data1[,order(data1[,"Fund"])]
len = nrow(data1)
good = c(TRUE,data1[-len,1]!=data1[-1,1]|data1[-len,2]!=data1[-1,2])
data1 = data1[good,]
(I did not test the code above)
Maybe you want to go further and create the list, which, for each fund, specifies what stocks it contains, without redundancies.
PS: you can still create the list which, for each stock, specifies what funds have it:
rv = rownames(values)
len = length(rv)
fund.list = list()
for (i in 1:len)
fund.list[[,i]] = data[data$Stock==rv[i],"Fund"]
for (i in 1:len) {
A = fund.list[[i]]
for (j in i:len) {
values[i, j] = length(intersect(A, fund.list[[j]]))
}
}

Related

R - Spatstat - Nearest neighbour searching by ID using data table

I have two large dataframes called intersections (representing intersections of a street system) and users (representing users of a network) as follows:
intersections has three columns: x,y and label_street. They respectively represent the position of an intersection in a squared observation window (say [0,5] x [0,5]) and the street it is located on. Here is an example:
intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
head(intersections)
x y label_street
1 0.147674 0.132956 5
2 0.235356 0.150813 6
3 0.095337 0.087345 5
4 0.147674 0.132956 6
An intersection being located at the crossing of several streets, every (x,y) combination in the intersections table appears at least twice, but with different label_street (e.g. rows 1 and 4 in the previous example). The label_street may not be the row number (which is why it starts at 5 in my example).
users has 4 columns: x,y, label_street, ID. They respectively represent the position of a user, the street it is located on and a unique ID per user. There are no duplicates in this dataframe, as a user is located on a unique street and has a unique ID. Here is an example (the ID and the label_street may not be the row number)
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), ID = c(2703, 3460, 4325, 12506, 19753, 21282))
head(users)
x y label_street ID
1 0.20428152 0.14448448 6 2703
2 0.17840619 0.13921481 6 3460
3 0.12964668 0.11724543 5 4325
4 0.20423856 0.14447573 6 12506
5 0.19349761 0.14228827 6 19753
6 0.10861251 0.09891443 5 21282
What I want to do is the following: for each point (x,y) of intersections, get the ID and the distance to its closest neighbour sharing the same street_label in users
I have a working solution using spatstat function nncross for nearest neighbour searching and plyr function adply for working on the data.
My working solution is as follows:
1) Write a user-defined function which gets the ID and the distance to the nearest neighbour of a row in a query table
NN <- function(row,query){
df <- row
window <- c(0,5,0,5) #Need this to convert to ppp objects and compute NN distance using nncross
NN <- nncross(as.ppp(row[,1:2],window),as.ppp(query[,1:2],window))
df$NN.ID <- query$ID[NN$which]
df$dist <- NN$dist
return(df)
}
2) Apply this user-defined function row-wise to my dataframe "intersections" with the query being the subset of users sharing the same street_label as the row :
result <- adply(intersections, 1, function(row) NN(row, users[users$label_street == row$label_street, ])
The result is as follows on the example:
head(result)
x y label_street NN.ID NN.dist
1 0.147674 0.132956 5 4325 0.02391247
2 0.235356 0.150813 6 2703 0.03171236
3 0.095337 0.087345 5 21282 0.01760940
4 0.147674 0.132956 6 3460 0.03136304
Since my real dataframes will be huge, I think computing distance matrices for looking at the nearest neighbour won't be efficient and that adply will be slow.
Does anyone have an idea of a data.table like solution? I only now about the basics of data.table and have always found it very efficient compared to plyr.
This solution uses the RANN package to find nearest neighbours. The trick is to first ensure that elements with different label_street have a higher distance between them than elements within the same label_street. We do this by adding an additional numeric column with a very large value that is constant within the same label_street but different between different values of label_street. In total, you get:
intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), number = c(2703, 3460, 4325, 12506, 19753, 21282))
# add a numeric column that is constant within each category and has a very large value
intersections$label_street_large <- intersections$label_street * 1e6
users$label_street_large <- users$label_street * 1e6
# call the nearest neighbour function (k = 1 neighbour)
nearest_neighbours <- RANN::nn2(
intersections[, c("x", "y", "label_street_large")],
users[, c("x", "y", "label_street_large")],
k = 1
)
# get original IDs and distances
IDs <- users$number[c(nearest_neighbours$nn.idx)]
distances <- c(nearest_neighbours$nn.dists)
IDs
# [1] 3460 12506 2703 3460 3460 4325
distances
# [1] 0.03171236 0.03136304 0.02391247 0.03175620 0.04271763 0.01760940
I hope this helps you. It should be very fast because it only call nn2 once, which runs in O(N * log(N)) time.

New variable: sum of numbers from a list powered by value of different columns

This is my first question in Stackoverflow. I am not new to R, although I sometimes struggle with things that might be considered basic.
I want to calculate the count median diameter (CMD) for each of my rows from a Particle Size Distribution dataset.
My data looks like this (several rows and 53 columns in total):
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
2015-01-01 00:00:00 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082
2015-01-01 01:00:00 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846
Each variable starting with "n" indicates the number of particles for the corresponding size (variable n3.16 = number of particles of median size of 3.16nm). I will divide the values by 100 prior to the calculations, in order to avoid such high numbers that prevent from the computation.
To compute the CMD, I need to do the following calculation:
CMD = (D1^n1*D2^n2...Di^ni)^(1/N)
where Di is the diameter (to be extracted from the column name), ni is the number of particles for diameter Di, and N is the total sum of particles (sum of all the columns starting with "n").
To get the Di, I created a numeric list from the column names that start with n:
D <- as.numeric(gsub("n", "", names(data)[3:54]))
This is my attempt to create a new variable with the calculation of CMD, although it doesn't work.
data$cmd <- for i in 1:ncol(D) {
prod(D[[i]]^data[,i+2])
}
I also tried to use apply, but I again, it didn't work
data$cmd <- for i in 1:ncol(size) {
apply(data,1, function(x) prod(size[[i]]^data[,i+2])
}
I have different datasets from different sites which have different number of columns, so I would like to make code "universal".
Thank you very much
This should work (I had to mutilate your date variable because of read.table, but it is not involved in the calculations, so just ignore that):
> df
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
1 2015-01-01 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082
2 2015-01-01 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846
N <- sum(df[3:11]) # did you mean the sum of all n.columns over all rows? if not, you'd need to edit this
> N
[1] 7235.488
D <- as.numeric(gsub("n", "", names(df)[3:11]))
> D
[1] 3.16 3.55 3.98 4.47 5.01 5.62 6.31 7.08 7.94
new <- t(apply(df[3:11], 1, function(x, y) (x^y), y = D))
> new
n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
[1,] 772457.6 41933406 336296640 9957341349 5.167135e+12 1.232886e+15 3.625318e+17 2.054007e+20 3.621747e+23
[2,] 7980615.0 5922074 348176502 25783108893 1.368736e+12 2.305272e+14 9.119184e+16 5.071946e+20 1.129304e+24
df$CMD <- rowSums(new)^(1/N)
> df
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94 CMD
1 2015-01-01 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082 1.007526
2 2015-01-01 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846 1.007684

Need advice on how to abstract my simulator for opening collectible card packs

I've been building simulators in Excel with VBA to understand the distribution of outcomes a player may experience as they open up collectible card packs. These were largely built with nested for loops, and as you can imagine...were slow as molasses.
I've been spinning up on R over the last couple months, and have come up with a function that handles a particular definition of a pack (i.e., two cards with particular drop rates for n characters on either card), and now am trying to abstract my function so that it can take any number of cards of whatever type of thing you want to throw at it(i.e., currency, gear, materials, etc).
What this simulator is basically doing is saying "I want to watch 10,000 people open up 250 packs of 2 cards" and then I perform some analysis after the results are generated to ask questions like "How many $ will you need to spend to acquire character x?" or "What's the distribution of outcomes for getting x, y or z pieces of a character?"
Here's my generic function and then I'll provide some inputs that the function operates on:
mySimAnyCard <- function(observations, packs, lookup, droptable, cardNum){
obvs <- rep(1:observations, each = packs)
pks <- rep(1:packs, times = observations)
crd <- rep(cardNum, length.out = length(obvs))
if("prob" %in% colnames(lookup))
{
awrd = sample(lookup[,"award"], length(obvs), replace = TRUE, prob = lookup[,"prob"])
} else {
awrd = sample(unique(lookup[,"award"]), length(obvs), replace = TRUE)
}
qty = sample(droptable[,"qty"], length(obvs), prob = droptable[,"prob"], replace = TRUE)
df <- data.frame(observation = obvs, pack = pks, card = cardNum, award = awrd, quantity = qty)
observations and packs are set to an integer.
lookup takes a dataframe:
award prob
1 Nick 0.5
2 Alex 0.4
3 Sam 0.1
and droptable takes a similar dataframe :
qty prob
1 10 0.1355
2 12 0.3500
3 15 0.2500
4 20 0.1500
5 25 0.1000
6 50 0.0080
... continued
cardnum also takes an integer.
It's fine to run this multiple times and assign the output to a variable and then rbind and order, but what I'd really like to do is feed a master function a dataframe that contains which cards it needs to provision and which lookup and droptables it should pull against for each card a la:
card lookup droptable
1 1 char1 chardrops
2 2 char1 chardrops
3 3 char2 <NA>
4 4 credits <NA>
5 5 credits creditdrops
6 6 abilityMats abilityMatDrops
7 7 abilityMats abilityMatDrops
It's probably never going to be more than 20 cards...so I'm willing to take the speed of a for loop, but I'm curious how the SO community would approach this problem.
Here's what I put together thus far:
mySimAllCards <- function(observations, packs, cards){
full <- data.frame()
for(i in i:length(cards$card)){
tmp <- mySimAnyCard(observations, packs, cards[i,2], cards[i,3], i)
full <- rbind(full, tmp)
}
}
which trips over
Error in `[.default`(lookup, , "award") : incorrect number of dimensions
I can work through the issues above, but is there a better approach to consider?

Quickly create new columns in dataframe using lists - R

I have a data containing quotations of indexes (S&P500, CAC40,...) for every 5 minutes of the last 3 years, which make it quite huge. I am trying to create new columns containing the performance of the index for each time (ie (quotation at [TIME]/quotation at yesterday close) -1) and for each index. I began that way (my data is named temp):
listIndexes<-list("CAC","SP","MIB") # there are a lot more
listTime<-list(900,905,910,...1735) # every 5 minutes
for (j in 1:length(listTime)){
Time<-listTime[j]
for (i in 1:length(listIndexes)) {
Index<-listIndexes[i]
temp[[paste0(Index,"perf",Time)]]<-temp[[paste0(Index,Time)]]/temp[[paste0(Index,"close")]]-1
# other stuff to do but with the same concept
}
}
but it is quite long. Is there a way to get rid of the for loop(s) or to make the creation of those variables quicker ? I read some stuff about the apply functions and the derivatives of it but I do not see if and how it should be used here.
My data looks like this :
date CACcloseyesterday CAC1000 CAC1005 ... CACclose ... SP1000 ... SPclose
20140105 3999 4000 40001.2 4005 .... 2000 .... 2003
20140106 4005 4004 40003.5 4002 .... 2005 .... 2002
...
and my desired output would be a new column (more eaxcatly a new column for each time and each index) which would be added to temp
date CACperf1000 CACperf1005... SPperf1000...
20140106 (4004/4005)-1 (4003.5/4005)-1 .... (2005/2003)-1 # the close used is the one of the day before
idem for the following day
i wrote (4004/4005)-1 just to show the calcualtio nbut the result should be a number : -0.0002496879
It looks like you want to generate every combination of Index and Time. Each Index-Time combination is a column in temp and you want to calculate a new perf column by comparing each Index-Time column against a specific Index close column. And your problem is that you think there should be an easier (less error-prone) way to do this.
We can remove one of the for-loops by generating all the necessary column names beforehand using something like expand.grid.
listIndexes <-list("CAC","SP","MIB")
listTime <- list(900, 905, 910, 915, 920)
df <- expand.grid(Index = listIndexes, Time = listTime,
stringsAsFactors = FALSE)
df$c1 <- paste0(df$Index, "perf", df$Time)
df$c2 <- paste0(df$Index, df$Time)
df$c3 <- paste0(df$Index, "close")
head(df)
#> Index Time c1 c2 c3
#> 1 CAC 900 CACperf900 CAC900 CACclose
#> 2 SP 900 SPperf900 SP900 SPclose
#> 3 MIB 900 MIBperf900 MIB900 MIBclose
#> 4 CAC 905 CACperf905 CAC905 CACclose
#> 5 SP 905 SPperf905 SP905 SPclose
#> 6 MIB 905 MIBperf905 MIB905 MIBclose
Then only one loop is required, and it's for iterating over each batch of column names and doing the calculation.
for (row_i in seq_len(nrow(df))) {
this_row <- df[row_i, ]
temp[[this_row$c1]] <- temp[[this_row$c2]] / temp[[this_row$c3]] - 1
}
An alternative solution would also be to reshape your data into a form that makes this transformation much simpler. For instance, converting into a long, tidy format with columns for Date, Index, Time, Value, ClosingValue column and directly operating on just the two relevant columns there.

Subset data with dynamic conditions in R

I have a dataset of 2500 rows which are all bank loans. Each bank loan has an outstanding amount and collateral type. (Real estate, Machine tools.. etc)
I need to draw a random selection out of this dataset where for example the sum of outstanding amount = 2.5Million +-5% and maximum 25% loans with the same asset class.
I found the function optim, but this asks for a function and looks to be constructed for optimization a portfolio of stocks, which is much more complex. I would say that there is an easy way of achieving this?
I created a sample data set which could illustrate my question better:
dataset <- data.frame(balance=c(25000,50000,35000,40000,65000,10000,5000,2000,2500,5000)
,Collateral=c("Real estate","Aeroplanes","Machine tools","Auto Vehicles","Real estate",
"Machine tools","Office equipment","Machine tools","Real estate","Auto Vehicles"))
If I want for example 5 loans out of this dataset which sum of outstanding balance = 200.000 (with 10% margin) and not more than 40% is allowed to be the same collateral type. (so maximum 2 out of 5 in this example)
Please let me know if additional information is necessary.
Many thanks,
Tim
This function I made works:
pick_records <- function(df,size,bal,collat,max.it) {
i <- 1
j <- 1
while ( i == 1 ) {
s_index <- sample(1:nrow(df) , size)
print(s_index)
output <- df[s_index,]
out_num <- lapply(output,as.numeric)
tot.col <- sum(as.numeric(out_num$Collateral))
if (sum(out_num$balance) < (bal*1.1) &
sum(out_num$balance) > (bal*0.9) &
all( table(out_num$Collateral)/size <= collat) ) {
return(output)
break
}
print(j)
j <- j + 1
if ( j == max.it+1) {
print('No solution found')
break}
}
}
> a <- pick_records(dataset,5,200000,0.4,20)
> a
balance Collateral
3 35000 Machine tools
7 5000 Office equipment
4 40000 Auto Vehicles
5 65000 Real estate
2 50000 Aeroplanes
Where df is your dataframe, size is the number of records you want and max.it the number of maximum iterations to find a solution before returning a no solution found error, bal is the limit for balance and collat the same for Collateral. You can change those as you please.
Let me know if you don't get any part of it.

Resources