Subset data with dynamic conditions in R - r

I have a dataset of 2500 rows which are all bank loans. Each bank loan has an outstanding amount and collateral type. (Real estate, Machine tools.. etc)
I need to draw a random selection out of this dataset where for example the sum of outstanding amount = 2.5Million +-5% and maximum 25% loans with the same asset class.
I found the function optim, but this asks for a function and looks to be constructed for optimization a portfolio of stocks, which is much more complex. I would say that there is an easy way of achieving this?
I created a sample data set which could illustrate my question better:
dataset <- data.frame(balance=c(25000,50000,35000,40000,65000,10000,5000,2000,2500,5000)
,Collateral=c("Real estate","Aeroplanes","Machine tools","Auto Vehicles","Real estate",
"Machine tools","Office equipment","Machine tools","Real estate","Auto Vehicles"))
If I want for example 5 loans out of this dataset which sum of outstanding balance = 200.000 (with 10% margin) and not more than 40% is allowed to be the same collateral type. (so maximum 2 out of 5 in this example)
Please let me know if additional information is necessary.
Many thanks,
Tim

This function I made works:
pick_records <- function(df,size,bal,collat,max.it) {
i <- 1
j <- 1
while ( i == 1 ) {
s_index <- sample(1:nrow(df) , size)
print(s_index)
output <- df[s_index,]
out_num <- lapply(output,as.numeric)
tot.col <- sum(as.numeric(out_num$Collateral))
if (sum(out_num$balance) < (bal*1.1) &
sum(out_num$balance) > (bal*0.9) &
all( table(out_num$Collateral)/size <= collat) ) {
return(output)
break
}
print(j)
j <- j + 1
if ( j == max.it+1) {
print('No solution found')
break}
}
}
> a <- pick_records(dataset,5,200000,0.4,20)
> a
balance Collateral
3 35000 Machine tools
7 5000 Office equipment
4 40000 Auto Vehicles
5 65000 Real estate
2 50000 Aeroplanes
Where df is your dataframe, size is the number of records you want and max.it the number of maximum iterations to find a solution before returning a no solution found error, bal is the limit for balance and collat the same for Collateral. You can change those as you please.
Let me know if you don't get any part of it.

Related

Create a list from a complex comparison of two lists

I am working on market transaction data where each observation contains the value of the variable of the buyer's id, and the value of the variable of the seller's id. For each observation (i.e each transaction), I would like to create a variable equal to the number of other transactions the associated seller has done with a different buyer than the one involved in this transaction. As a consequence, in the following
data <- data.frame(Buyer_id = c("001","001","002","001"), Seller_id = c("021","022","022","021"))
I would like to obtain:
Result <- list(0,1,1,0)
I searched for already existing answers for similar problems than mine, usually involving the function mapply(), and tried to implement them, but it proved unsuccessful.
Thank you very much for helping me.
Are you looking for something like this? If yes, then you might want to change your reproducible example to have a c instead of list when you construct your data.frame.
data <- data.frame(Buyer_id = c("001","001","002","001"),
Seller_id = c("021","022","022","021"))
data$n <- NA
for (i in seq_len(nrow(data))) {
seller <- as.character(data[i, "Seller_id"])
buyer <- as.character(data[i, "Buyer_id"])
with.buyers <- as.character(data[data$Seller_id == seller, "Buyer_id"])
with.buyers <- unique(with.buyers)
diff.buyers <- with.buyers[!(with.buyers %in% buyer)]
data[i, "n"] <- length(diff.buyers)
}
Buyer_id Seller_id n
1 001 021 0
2 001 022 1
3 002 022 1
4 001 021 0
Apart from Roman Lustrik's solution, there is also an approach that uses graphs.
library(igraph)
data <- data.frame(Seller_id = c("021","022","022","021"),
Buyer_id = c("001","001","002","001"),
stringsAsFactors = FALSE)
my.graph <- graph_from_data_frame(data)
plot(my.graph)
degree(my.graph, mode = c("out"))
# Transform the graph into a simple graph. A simple graph does not allow
# duplicate edges.
my.graph <- simplify(my.graph)
plot(my.graph)
degree(my.graph, mode = c("out"))
V(my.graph)$out.degree <- degree(my.graph, mode = c("out"))
data$n <- apply(data,
MARGIN = 1,
FUN = function(transaction)
{
node.out.degree <- V(my.graph)$out.degree[ V(my.graph)$name == transaction["Seller_id"] ]
if (node.out.degree <= 1) {
# Since the vertex has at most 1 out degree we know that the current transaction
# is the only appearance of the current seller.
return(0)
} else {
# In this case, we know that the seller participates in at least one more
# tansaction. We therefore simply subtract minus one (the current transaction)
# from the out degree.
return(node.out.degree - 1)
}
})
data

Go through an R dataframe and increase the salary at certain conditions

I have this sample of my dataframe (df):
age salary
1 25 20000
2 35 22000
3 31 23500
4 24 19200
5 27 27900
6 32 31010
I want to increase the salary by 11% for people who are aged above 30 and their salary is not the maximum salary in the table. I wrote this loop:
for(row in df){
if (row$age > 30 & row$salary != max(df$salary)){
row$salary = row$salary * 0.11
}
}
but I get less than the salaries posted rather than an increase.
Would really appreciate any help.
Here's one way without explicit ifelse. Should be one of the fastest ways to do this -
df$new_salary <- with(df, salary + 0.11*salary*(age > 30)*(salary != max(salary))
The reason why you're experiencing problems is because in each iteration of the for loop (specifically, when going through each matching row), you're applying a transformation to the entire column. Try this instead:
k <- max(df$age)
df[df$age>30 & df$age<k, 'salary'] <- df[df$age>30 & df$age<k, 'salary'] * 1.11

Need advice on how to abstract my simulator for opening collectible card packs

I've been building simulators in Excel with VBA to understand the distribution of outcomes a player may experience as they open up collectible card packs. These were largely built with nested for loops, and as you can imagine...were slow as molasses.
I've been spinning up on R over the last couple months, and have come up with a function that handles a particular definition of a pack (i.e., two cards with particular drop rates for n characters on either card), and now am trying to abstract my function so that it can take any number of cards of whatever type of thing you want to throw at it(i.e., currency, gear, materials, etc).
What this simulator is basically doing is saying "I want to watch 10,000 people open up 250 packs of 2 cards" and then I perform some analysis after the results are generated to ask questions like "How many $ will you need to spend to acquire character x?" or "What's the distribution of outcomes for getting x, y or z pieces of a character?"
Here's my generic function and then I'll provide some inputs that the function operates on:
mySimAnyCard <- function(observations, packs, lookup, droptable, cardNum){
obvs <- rep(1:observations, each = packs)
pks <- rep(1:packs, times = observations)
crd <- rep(cardNum, length.out = length(obvs))
if("prob" %in% colnames(lookup))
{
awrd = sample(lookup[,"award"], length(obvs), replace = TRUE, prob = lookup[,"prob"])
} else {
awrd = sample(unique(lookup[,"award"]), length(obvs), replace = TRUE)
}
qty = sample(droptable[,"qty"], length(obvs), prob = droptable[,"prob"], replace = TRUE)
df <- data.frame(observation = obvs, pack = pks, card = cardNum, award = awrd, quantity = qty)
observations and packs are set to an integer.
lookup takes a dataframe:
award prob
1 Nick 0.5
2 Alex 0.4
3 Sam 0.1
and droptable takes a similar dataframe :
qty prob
1 10 0.1355
2 12 0.3500
3 15 0.2500
4 20 0.1500
5 25 0.1000
6 50 0.0080
... continued
cardnum also takes an integer.
It's fine to run this multiple times and assign the output to a variable and then rbind and order, but what I'd really like to do is feed a master function a dataframe that contains which cards it needs to provision and which lookup and droptables it should pull against for each card a la:
card lookup droptable
1 1 char1 chardrops
2 2 char1 chardrops
3 3 char2 <NA>
4 4 credits <NA>
5 5 credits creditdrops
6 6 abilityMats abilityMatDrops
7 7 abilityMats abilityMatDrops
It's probably never going to be more than 20 cards...so I'm willing to take the speed of a for loop, but I'm curious how the SO community would approach this problem.
Here's what I put together thus far:
mySimAllCards <- function(observations, packs, cards){
full <- data.frame()
for(i in i:length(cards$card)){
tmp <- mySimAnyCard(observations, packs, cards[i,2], cards[i,3], i)
full <- rbind(full, tmp)
}
}
which trips over
Error in `[.default`(lookup, , "award") : incorrect number of dimensions
I can work through the issues above, but is there a better approach to consider?

Optimizing Speed for Populating a Matrix

I am trying to fill in a large matrix (55920484 elements) in R that will eventually be symmetric (so I am actually only performing calculations for half of the matrix). The resulting values matrix is a square matrix which has the same row and column names. Each value in the matrix is the result of comparing unique lists and counting the number of intersections. This data comes from a larger dataframe (427.5 Mb). Here is my fastest solution so far, I am trying to get rid of the loops which I know are slow:
for(i in 1:length(rownames(values))){
for(j in i:length(colnames(values))){
A = data[data$Stock==rownames(values)[i],"Fund"]
B = data[data$Stock==colnames(values)[j],"Fund"]
values[i, j] = length(intersect(A, B))
}
}
I have tried several other approaches such as using a database with an SQL connection, using a sparse matrix with 0s and 1s, and using the sqldf package in R.
Here is the structure of my data:
head(data)
Fund Stock Type Shares.Held Maket.Value X..of.Portfolio Rank Change.in.Shares X..Change X..Ownership
1 12 WEST CAPITAL MANAGEMENT LP GRUB CALL 500000 12100000 0.0173 12 500000 New N/A
2 12 WEST CAPITAL MANAGEMENT LP FIVE SH 214521 6886000 0.0099 15 214521 New 0
3 12 WEST CAPITAL MANAGEMENT LP SHAK SH 314114 12439000 0.0178 11 307114 4387 1
4 12 WEST CAPITAL MANAGEMENT LP FRSH SH 324120 3650000 0.0053 16 -175880 -35 2
5 12 WEST CAPITAL MANAGEMENT LP ATRA SH 393700 10398000 0.0149 14 162003 69 1
6 12 WEST CAPITAL MANAGEMENT LP ALNY SH 651000 61285000 0.0875 4 No Change 0 1
I see three problems, in order of increasing importance:
(1) You call rownames(values) and colnames(values) many times, instead of calling them just once outside of the loops. It may or may not help.
(2) You calculate A = data[data$Stock==rownames(values)[i],"Fund"] under the innermost loop, while you should calculate it outside of this loop.
(3) Most important: Your code uses only two columns of your table: Fund and Stock. I see that in your data there are many rows with same both Fund and Stock. You should eliminate this redundacy. Maybe you want to create data1=data[,c("Fund","Stock")] and eliminate redundant rows in data1 (without loop):
data1 = data1[,order(data1[,"Fund"])]
len = nrow(data1)
good = c(TRUE,data1[-len,1]!=data1[-1,1]|data1[-len,2]!=data1[-1,2])
data1 = data1[good,]
(I did not test the code above)
Maybe you want to go further and create the list, which, for each fund, specifies what stocks it contains, without redundancies.
PS: you can still create the list which, for each stock, specifies what funds have it:
rv = rownames(values)
len = length(rv)
fund.list = list()
for (i in 1:len)
fund.list[[,i]] = data[data$Stock==rv[i],"Fund"]
for (i in 1:len) {
A = fund.list[[i]]
for (j in i:len) {
values[i, j] = length(intersect(A, fund.list[[j]]))
}
}

In R, iterating over two datasets and getting back results without looping

I have two data sets, Transaction_long, and Transaction_short. Transaction_long has many quotes of policy and price with a purchase point (denoted by true) in the dataset. Transaction_short has only entries of the purchase points.
My objective is to add a column in the Transaction_short dataset called Policy_Change_Frequency. For every customer in the short dataset, iterate over the rows for that customer in the long dataset and calculate how many time the policy changed.
To find the policy change I can use sum(diff(Transaction_Long$policy)!=0) but not sure how to iterate over these two data sets and get results
Details:
Customer_Name : name of customer
Customer_ID: Customer Identifier number
Purchase: Boolean variable (Yes-1,No-0)
Policy: Categorical (takes values 1-5)
Price : Price quoted
Data set1-Transaction_Long
Customer_Name,Customer_ID,Purchased,Policy,Price
Joe,101,0,1,500
Joe,101,0,1,505
Joe,101,0,2,510
Joe,101,0,2,504
Joe,101,0,2,507
Joe,101,0,1,505
Joe,101,1,3,501
Mary,103,0,1,675
Mary,103,0,3,650
Mary,103,0,2,620
Mary,103,0,2,624
Mary,103,0,2,630
Mary,103,1,2,627
Data set 2:Transaction_Short
Customer_Name , Customer_ID,Purchased,Policy, Price
Joe,101,1,3,501
Mary,103,1,2,627
Need to add a Policy Change Frequency column in the Transaction Short Dataset, so my final Transcation short Dataset will look like
Final Dataset should look like this
Customer_Name , Customer_ID,Purchased, Policy, Price,Policy_ChangeFreq
Joe,101,1,3,501,3
Mary,103,1,2,627,2
Consider a calculated column for policy change which tags changes from previous row within each customer with one. Then, aggregates the ones for a count. Merge is used due to two aggregations needed (final row for each customer and PolicyChanged count):
Transaction_Long$PolicyChangedFreq <- sapply(1:nrow(Transaction_Long),
function(i)
if (i > 1) {
ifelse(Transaction_Long$Policy[i-1]==
Transaction_Long$Policy[i], 0,
ifelse(Transaction_Long$Customer_ID[i-1] !=
Transaction_Long$Customer_ID[i], 0, 1))
} else { 0 }
)
Transaction_Final <- merge(aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:5)], FUN = tail, n = 1),
aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:2,6)], FUN = sum),
by = c('Customer_ID', 'Customer_Name'))
Transaction_Final
# Customer_ID Customer_Name Purchased Policy Price PolicyChangedFreq
#1 101 Joe 1 3 501 3
#2 103 Mary 1 2 627 2
#Parfait. Thank you for the solution. i solved this using the sqldf package in R
for (i in 1:nrow(Transaction_short)){
sql <- sprintf("SELECT policy from Transaction_long where customer_ID = %s",ML_Train_short$customer_ID[i])
df<- sqldf(sql)
NF <- sum(df$policy[-1]!= df$policy[-length(df$policy)])
ML_Train_short$Policy_Change_Freq[i] <- NF
}
Since i have about 500K rows in the long dataset and about 100K in the short dataset..this is taking a while..is there any other solution that does not require loops? Thank you

Resources