Looped Equation on different subset of data R - r

I am trying to set up an earning pattern on some data. I'm doing this by creating an 'Earned_Multiplier' variable which I can then use to multiply on whatever other variable necessary later on. Where the 'Earned_Duration' is >0 and <= 30, the Earned_Multiplier should be equal to ((Earned_Duration/30)*0.347), where the 'Earned_Duration' is >30 and <=60, the Earned_Multiplier should be equal to (0.347+((Earned_Duration/30)*0.16)), and so on.
I'm hoping the below should make sense given the above description. Unfortunately I am getting the error message "NAs are not allowed in subscripted assignments". I feel like this is likely because I need to be using a loop to do the calculation?
Could anyone help direct me as to how to build this loop and making sure it does the right calculation for each different subset?
Output_All$Earned_Multiplier <- 1
Output_All$Earned_Multiplier[Output_All$Earned_Duration == 0] <- 0
Output_All$Earned_Multiplier[(Output_All$Earned_Duration > 0) &
(Output_All$Earned_Duration <= 30)] <- 0+
((Output_All$Earned_Duration/30)*.347) # Month 1
Output_All$Earned_Multiplier[(Output_All$Earned_Duration > 30) &
(Output_All$Earned_Duration <= 60)] <- .347+(((Output_All$Earned_Duration-
30)/30)*.16) # Month 2
Output_All$Earned_Multiplier[(Output_All$Earned_Duration > 60) &
(Output_All$Earned_Duration <= 90)] <- .507+(((Output_All$Earned_Duration-
60)/30)*.085) # Month 3

It would be helpful to provide a dummy dataset so we could work on that. You probably have some NAs in your dataset causing that error.
In any case, using the dplyr library you could do an ifelse statement along with a mutate to create a new column with your calculation result:
library(dplyr)
Output_All <- Output_All %>% mutate(Earned_Multiplier = ifelse(Earned_Duration == 0, 0,
ifelse(Earned_Duration>0&Earned_Duration<=30, (Earned_Duration/30)*0.347,
ifelse(Earned_Duration>30&Earned_Duration<=60, (0.347+((Earned_Duration/30)*0.16)), #close with final else here, if none of the above is met
))))# or continue with more ifelse statements
Regarding the NAs:
If you do have NAs and they are causing you issues, depending on your preference, you can include this as part of your logical statements:
!is.na(Earned_Duration) # dont forget to add & if you add it as a condition
to make sure that NAs are disregarded.

Related

R - Looping with while always results in missing value where TRUE/FALSE is expected

EDIT: I implemented offered solutions so far, and the code looks way cleaner now. This was the key to finally finding my error. It was a logical condition that I didn't check within the while loop. It could happen that the iterator would exceed the number of elements in the vector and thus pass a "NA" to the while condition! Thx
I also changed the solution to use vector assignments to store the results and then recombine after the for loop, as vector indexing seems to be way faster than data.table indexing and value assignment within the loop.
Pls let me apologize first for any errors and lack of information for troubleshooting my problem as this is my first post so far. I have already read that this can happen accidentally whenever ther is an error in a computation and the value of a condition results in an error, such as
if (TRUE & sqrt(-1))
It's been days and I am still receiving this error. It really gives me a headache, as the inherent logic behind such code is actually pretty straigth forward and I still can't properly formalize it. It goes like following: Compare for each unique bond ID contained in a vector of size N (loop through with i), the static value of its corresponding maturity to 7 periods' end date for distinct set of rules (loop through with k) to determine which periods with unique rules the respective issue falls into, and then determine by looping through all the periods' size thresholds (loop through by l) to find if a particular issue has violted these minimium size requirements. If a violation is determined, I can assign the date of the violation. If (l == k), I can reckon that for all periods that the issue's maturity falls into, have also successfully looped through the corresponding size requirements checks and as such hasn't violated any rules. I then assign the result of the conditional checks as corresponding binary values in a new data.table column as well as the violation date. So far, I really cant determine what is casusing this error.
My data looks like following. I have a pretty large data.table containing bond issue identifiers and various other column variables that describe those issues. It was imported as initially with the read_dta() function and then transformed to a data.table with setDT().
I extract 3 columns out of this data.table, using
issue_IDs.vec <- as.numeric(issues.dt[[2]])
maturity.vec <- as.Date(issues.dt[[8]], "%Y-%m-%d")
offerings_atm.vec <- as.numeric(issues.dt[[33]])
Next, I transform eligibility criteria of an index as following.
# (1) Creating size requirement end periods (valid thru) ----
size_req_per_1 <- as.Date("1992-01-01", "%Y-%m-%d")
size_req_per_2 <- as.Date("1994-01-01", "%Y-%m-%d")
size_req_per_3 <- as.Date("1999-07-01", "%Y-%m-%d")
size_req_per_4 <- as.Date("2003-10-01", "%Y-%m-%d")
size_req_per_5 <- as.Date("2004-07-01", "%Y-%m-%d")
size_req_per_6 <- as.Date("2017-02-01", "%Y-%m-%d")
size_req_per_7 <- as.Date("2021-02-01", "%Y-%m-%d")
size_req_val_per.vec <- c(size_req_per_1, size_req_per_2, size_req_per_3, size_req_per_4,
size_req_per_5, size_req_per_6, size_req_per_7)
# (2) Create a size requirement threshold per rules' validity period ----
size_req_thresh_1 <- 25000
size_req_thresh_2 <- 50000
size_req_thresh_3 <- 100000
size_req_thresh_4 <- 150000
size_req_thresh_5 <- 200000
size_req_thresh_6 <- 250000
size_req_thresh_7 <- 300000
size_req_thresh.vec <- c(size_req_thresh_1, size_req_thresh_2, size_req_thresh_3,
size_req_thresh_4, size_req_thresh_5, size_req_thresh_6,
size_req_thresh_7)
Next, I do write a loop to perform conditional checks to find for each issue ID stored in the issues_ID.vec if they violate the index eligibility criterium of the minimim issance size during their maturity. I do this by passing the value of iterator variable i as a position value to the issues_ID.vec.
# (3) Looping through a set of conditional check to find out if and if so when a particular issue violated the size requirement ---
# Iterator variables ----
# Length of issues.dt
j <- issues.dt[, .N]
# Main iterator looping through all entries of isssues.dt extracted as vector
i <- 1
# Looping through vector elements of issue rules (vec. 1: validity periods)
k <- 1
# Looping through vector elements of issue rules (vec. 2: size thresholds)
l <- 1
# Loop
for (i in 1:j) {
id <- issue_IDs.vec[i]
maturity <- maturity.vec[i]
offering_atm <- issue_IDs.vec[i]
k <- 1
maturity_comp <- size_req_val_per.vec[k]
while (maturity >= maturity_comp) {
if (k < 7) {
k <- k + 1
maturity_comp <- size_req_val_per.vec[k]
} else {
break
}
}
l <- 1
offering_size_comp <- size_req_thresh.vec[l]
for (l in 1:k) {
if (offering_atm >= offering_size_comp) {
offering_size_comp <- size_req_thresh.vec[l]
next
} else {}
}
if (l == k) {
issues.dt[ISSUE_ID == id,
`:=`(SIZE_REQ_VIOLATION = 0,
SIZE_REQ_VIOLATION_DATE = NA)]
} else {
issues.dt[ISSUE_ID == id,
`:=`(SIZE_REQ_VIOLATION = 1,
SIZE_REQ_VIOLATION_DATE = size_req_val_per.vec[l])]
}
i <- i + 1
}
Whenever I try running the code in a simplified version, such as
k <- 1
for (1 in 1:7) {
print(maturity >= maturity_comp)
k <- k + 1
maturity_comp <- format(as.Date(size_req_val_per.vec[k]), "%Y-%m-%d")
}
the code runs smooth and always results in the printed evaluations TRUE or FALSE, depending which ID I initially to create the corresponding static maturity of the particular bond issue. As this stage, I already exhasuted my troubleshooting skills.
I'd appreciate any input from you guys, and if you need any additional information, explanations etc. just let me know.
I think the answer lies in Gregor's comment. The way you are formatting your dates converts them to character variables. Here's a quick example:
Exmpl<-as.Date("08-25-2020", "%m-%d-%Y")
class(Exmpl)
[1] "Date"
##Not your preferred format, but it is a Date variable##
Exmpl
"2020-08-25"
##Formatting changes it to a character
Exmpl2<-format(as.Date(Exmpl), "%m-%d-%Y")
class(Exmpl2)
[1] "character"
When you call them in the while() function, R is trying make a comparison to decided if the condition (i.e., maturity is greater than or equal to maturity comp) is TRUE or FALSE (logical variables). Because you have character variables, R cannot make this comparison.
I think your code will work if you don't format the dates, but simply read them in and leave them in the YYYY-mm-dd format.

Multiple if statements to define what filters are applied and ignored in R

I've searched various posts related to this question and still cannot find a solution.
I'm trying to apply multiple filters on a dataframe that involve if() statements that if a condition is met then it includes the individual filter inside {} in the set of filters, and if the statement is not met then it ignores that particular filter. Consider the follow base code that works perfectly although it doesn't include the if statements yet.
library(tidyverse)
llibrary(purrr)
library(dplyr)
Cb1 <- 0.75
Cb2 <- 1.0
Cb3 <- 20
Cb4 <- 8
test <- map(metrics, ~filter(.,
industry == "Technology",
price <= (median(price, na.rm = TRUE)) * Cb1,
ror >= (median(ror, na.rm = TRUE)) * Cb2,
debt <= Cb3,
periods >= Cb4,
price >= 0))
The code references a list of 3 dataframes titled "metrics" that I'm not including in the code for simplicity (which means that you won't be able to test this code). But it includes numerous columns of data, some of which include: industry, price, debt, periods, and ror. Which are the columns that I'm interested in filtering. Note that this includes a map function because it is performing the filtering on the 3 dataframes included in the list "metrics", but that is not important in my question.
I would like to add some if statements that check if the constants Cb are not equal to 0, then the filters inside {} are included in the set of multiple filters. But if the constants Cb are equal to 0, then this particular filter is excluded in the multiple set of filters (and the other filters still may apply). Through some research I thought that maybe using "else" after the {} and simply having nothing after "else" will achieve what I'm looking for. But it does not work. This is the code that I've tried using if statements inside the filter function, but again this does not work. To clarify, the code runs without errors, but the filters do not work properly.
Cb1 <- 0.75
Cb2 <- 0
Cb3 <- 20
Cb4 <- 8
test <- map(metrics, ~filter(.,
industry == "Technology",
if(Cb1 != 0) {price <= (median(price, na.rm = TRUE)) * Cb1} else
if(Cb2 != 0) {ror >= (median(ror, na.rm = TRUE)) * Cb2} else
if(Cb3 != 0) {debt <= Cb3} else
if(Cb4 != 0) {periods >= Cb4} else
price >= 0))
I feel like this something relatively simple like the syntax of where I place my {}, (), or even the commas. Or maybe it's some combination of boolean operators (& or |). But I can't seem to get it to work. Note that there are also some filters that don't include an if check as I want them to always apply.
I would normally try to include an expected outcome, but that is difficult since this is a set of filters on a dataframe that I cannot include. I'm hoping that someone can help with my syntax in the if statement code and that will solve the problem.
Any help is appreciated!
The filter() function expects each of its inputs to be a logical vector, equal in length to the number of rows of the data frame, or 1. "Excluding" the filter might mean the filter always resolves to TRUE, leaving all rows in place.
So you could change your statement to this (note the comma at the end):
if (Cb1 != 0) { price <= (median(price, na.rm = TRUE)) * Cb1 } else TRUE,
I have a comment though: What you are doing here feels a bit counter-intuitive. Perhaps if you could describe better why you are trying to do this, there might be a more dplyr-like solution.

Removing dataframe outliers in R with `boxplot.stats`

I'm relatively new at R, so please bear with me.
I'm using the Ames dataset (full description of dataset here; link to dataset download here).
I'm trying to create a subset data frame that will allow me to run a linear regression analysis, and I'm trying to remove the outliers using the boxplot.stats function. I created a frame that will include my samples using the following code:
regressionFrame <- data.frame(subset(ames_housing_data[,c('SalePrice','GrLivArea','LotArea')] , BldgType == '1Fam'))
My next objective was to remove the outliers, so I tried to subset using a which() function:
regressionFrame <- regressionFrame[which(regressionFrame$GrLivArea != boxplot.stats(regressionFrame$GrLivArea)$out),]
Unfortunately, that produced the
longer object length is not a multiple of shorter object length
error. Does anyone know a better way to approach this, ideally using the which() subsetting function? I'm assuming it would include some form of lapply(), but for the life of me I can't figure out how. (I figure I can always learn fancier methods later, but this is the one I'm going for right now since I already understand it.)
Nice use with boxplot.stats.
You can not test SAFELY using != if boxplot.stats returns you more than one outliers in $out. An analogy here is 1:5 != 1:3. You probably want to try !(1:5 %in% 1:3).
regressionFrame <- subset(regressionFrame,
subset = !(GrLivArea %in% boxplot.stats(GrLivArea)$out))
What I mean by SAFELY, is that 1:5 != 1:3 gives a wrong result with a warning, but 1:6 != 1:3 gives a wrong result without warning. The warning is related to the recycling rule. In the latter case, 1:3 can be recycled to have the same length of 1:6 (that is, the length of 1:6 is a multiple of the length of 1:3), so you will be testing with 1:6 != c(1:3, 1:3).
A simple example.
x <- c(1:10/10, 101, 102, 103) ## has three outliers: 101, 102 and 103
out <- boxplot.stats(x)$out ## `boxplot.stats` has picked them out
x[x != out] ## this gives a warning and wrong result
x[!(x %in% out)] ## this removes them from x

Matching negative and positive values using For Loop in R

This is my first post so I hope it is not too elementary. I am trying to match observations which have a negative Amount to counterparts that have a positive Amount and an equal abs(Amount). Furthermore, I want to check that the Amounts are both from the same Account. To do this, I am trying to use a for loop, but am getting the following error: "Operations are possibly only for numeric, logical or complex types." This is my code so far:
for(i in 1:nrow(data)){
for(j in 1:nrow(data)){
if ((data$Amount[i]=abs(data$Amount[j]))&(data$Amount[i]!=data$Amount[j])&(data$Account[i]=data$Account[j]))
{data$debit[i]<-1}}}
Does anyone have any idea why this is happening, or know of a better way using the Apply function family? Thank you in advance!
EDIT:
Below is a toy data set: to illustrate this example. For instance, on this data set, I want to create an indicator variable which would be 0 except for ID=3 because for the observation, 4.7=abs(-4.7) and "abc1"="abc1" .
Data <- " ID Amount Account
1 5.0 abc1
2 -5.0 abc9
3 4.7 abc1
4 4.6 abc7
5 5.0 abc8
6 -4.7 abc1 "
Here's an alternative method of achieving the same result with a lot less code (and I think it's easier to read too)
library(dplyr)
Data <- Data %>%
group_by(Account) %>%
mutate(
debit = (Amount > 0 & -Amount %in% unique(Amount)) * 1
) %>%
ungroup()
If you aren't familiar with the pipe operator (%>%), it allows us to avoid nesting a lot of functions inside one another. It works by taking the output of the previous function, and entering it as the first argument of the next function. So this code takes the data set (Data), groups it by the Account, adds a new column with the indicator variable with the desired criterion, and then ungroups the data so it's back to its normal format.
The looping is done within these function calls, which allows them to be implemented in compiled languages (usually C++) - which can be a lot faster than R.
You need to use the == operator (= is an assignment operator) and the && rather than the & operator for your logical condition:
## Assignment (incorrect in this case!)
1 = 1
# Error in 1 = 1 : invalid (do_set) left-hand side to assignment
a <- 1
a = a
Note that with a = a there is no logical checked (just the equivalent of a <- a; see more here).
## Checking equivalence (returns a logical)
1 == 1
# [1] TRUE
a == a
# [1] TRUE
For the difference between & and &&, the second evaluates the full condition and the first each element (see here).
Also it might be more elegant to check whether the sum of data$Amount[i] and data$Amount[j] is null rather than to check if they have the first absolute value but not the same signed value.
## Your example
for(i in 1:nrow(data)){
for(j in 1:nrow(data)){
if ( (sum(c(data$Amount[i], data$Amount[j])) == 0) && (data$Account[i] == data$Account[j]) ) {
data$debit[i]<-1
}
}
}

Using mapply() in R over rows, vs. columns

I deal with a great deal of survey data and the like in my work, and I often have to make various scoring programs that process data on a row-by-row level. For instance, I am dealing with a table right now that contains 12 columns with subscale scores from a psychometric instrument. These will be converted to normalized scores using tables provided by the instrument's creator. Seems straightforward so far.
However, there are four tables - the instrument is scored differently depending on gender and age range. So, for instance, a 14-year old female and an 10 year-old male get different normalization tables. All of the normalization data is stored in a R data frame.
What I would like to do is write a function which can be applied over rows, which returns a vector looked up from the normalization data. So, something vaguely like this:
converter <- function(rawscores,gender,age) {
if(gender=="Male") {
if(8 <= age & age <= 11) {convertvec <- c(1:12)}
if(12 <= age & age <= 14) {convertvec <- c(13:24)}
}
else if(gender=="Female") {
if(8 <= age & age <= 11) {convertvec <- c(25:36)}
if(12 <= age & age <= 14) {convertvec <- c(37:48)}
}
converted_scores <- rep(0,12)
for(z in 1:12) {
converted_scores[z] <- conversion_table[(unlist(rawscores)+1)[z],
convertvec[z]]
}
rm(z)
return(converted_scores)
}
EDITED: I updated this with the code I actually got to work yesterday. This version returns a simple vector with the scores. Here's how I then implemented it.
mydata[,21:32] <- 0
for(x in 1:dim(mydata)[1]) {
tscc_scores[x,21:32] <- converter(mydata[x,7:18],
mydata[x,"gender"],
mydata[x,"age"])
}
This works, but like I said, I'm given to understand that it is bad practice?
Side note: the reason rawscores+1 is there is that the data frame has a score of zero in the first index.
Fundamentally, the function doesn't seem very complicated, and I know I could just implement it using a loop where I would do for(x in 1:number_of_records), but my understanding is that doing so is poor practice. I had hoped to simply use apply() to do this, like as follows:
apply(X=mydata[,1:12],MARGIN=1,
FUN=converter,gender=mydata[,"gender"],age=mydata[,"age"])
Unfortunately, R doesn't seem to approve of this approach, as it does not iterate through the vectors passed to subsequent arguments, but rather tries to take them as the argument as a whole. The solution would appear to be mapply(), but I can't figure out if there's a way to use mapply() over rows, instead of columns.
So, I guess my questions are threefold. One, is there a way to use mapply() over rows? Two, is there a way to make apply() iterate over arguments? And three, is there a better option out there? I've seen and heard a lot about the plyr package, but I didn't want to jump to that before I fully investigated the options present in Base R.
You could rewrite 'converter' so that it takes vectors of gender, age, and a row index which you then use to do lookups and assignments to converted_scores using a conversion array and a data array that is jsut the numeric score columns. There is an additional problem with using apply since it will convert all its x arguments to "character" class because of the gender class being "character". It wasn't clear whether your code normdf[ rawscores+1, convertvec] was supposed to be an array extraction or a function call.
Untested in absence of working example (with normdf, mydata):
converted_scores <- matrix(NA, nrow=NROW(rawscores), ncol=12)
converter <- function(idx,gender,age) {
gidx <- match(gender, c("Male", "Female") )
aidx <- findInterval(age, c(8,12,15) )
ag.idx <- gidx + 2*aidx -1
# the aidx factor needs to be the same number of valid age categories
cvt <- cvt.arr[ ag.idx, ]
converted_scores[idx] <- normdf[rawscores+1,convertvec]
return(converted_scores)
}
cvt.arr <- matrix(1:48, nrow=4, byrow=TRUE)[1,3,2,4] # the genders alternate
cvt.scores <- mapply(converter, 1:NROW(mydata), mydata$gender, mydata$age)
I'd advise against applying this stuff by row, but would rather apply this by column. The reason is that there are only 12 columns, but there might be many rows.
The following piece of code works for me. There might be better ways, but it might be interesting for you nevertheless.
offset <- with(mydata, 24*(gender == "Female") + 12*(age >= 12))
idxs <- expand.grid(row = 1:nrow(mydata), col = 1:12)
idxs$off <- idxs$col + offset
idxs$val <- as.numeric(mydata[as.matrix(idxs[c("row", "col")])]) + 1
idxs$norm <- normdf[as.matrix(idxs[c("val", "off")])]
converted <- mydata
converted[,1:12] <- as.matrix(idxs$norm, ncol=12)
The tricky part here is this idxs data frame which combines all the rest. It has the folowing columns:
row and column: Position in the original data
off: column in normdf, based on gender and age
val: row in normdf, based on original value + 1
norm: corresponding normalized value
I'll post this here with this first thought, and see whether I can come up with a better answer, either based on jorans comment, or using a three- or four-dimensional array for normdf. Not sure yet.

Resources