Matching negative and positive values using For Loop in R - r

This is my first post so I hope it is not too elementary. I am trying to match observations which have a negative Amount to counterparts that have a positive Amount and an equal abs(Amount). Furthermore, I want to check that the Amounts are both from the same Account. To do this, I am trying to use a for loop, but am getting the following error: "Operations are possibly only for numeric, logical or complex types." This is my code so far:
for(i in 1:nrow(data)){
for(j in 1:nrow(data)){
if ((data$Amount[i]=abs(data$Amount[j]))&(data$Amount[i]!=data$Amount[j])&(data$Account[i]=data$Account[j]))
{data$debit[i]<-1}}}
Does anyone have any idea why this is happening, or know of a better way using the Apply function family? Thank you in advance!
EDIT:
Below is a toy data set: to illustrate this example. For instance, on this data set, I want to create an indicator variable which would be 0 except for ID=3 because for the observation, 4.7=abs(-4.7) and "abc1"="abc1" .
Data <- " ID Amount Account
1 5.0 abc1
2 -5.0 abc9
3 4.7 abc1
4 4.6 abc7
5 5.0 abc8
6 -4.7 abc1 "

Here's an alternative method of achieving the same result with a lot less code (and I think it's easier to read too)
library(dplyr)
Data <- Data %>%
group_by(Account) %>%
mutate(
debit = (Amount > 0 & -Amount %in% unique(Amount)) * 1
) %>%
ungroup()
If you aren't familiar with the pipe operator (%>%), it allows us to avoid nesting a lot of functions inside one another. It works by taking the output of the previous function, and entering it as the first argument of the next function. So this code takes the data set (Data), groups it by the Account, adds a new column with the indicator variable with the desired criterion, and then ungroups the data so it's back to its normal format.
The looping is done within these function calls, which allows them to be implemented in compiled languages (usually C++) - which can be a lot faster than R.

You need to use the == operator (= is an assignment operator) and the && rather than the & operator for your logical condition:
## Assignment (incorrect in this case!)
1 = 1
# Error in 1 = 1 : invalid (do_set) left-hand side to assignment
a <- 1
a = a
Note that with a = a there is no logical checked (just the equivalent of a <- a; see more here).
## Checking equivalence (returns a logical)
1 == 1
# [1] TRUE
a == a
# [1] TRUE
For the difference between & and &&, the second evaluates the full condition and the first each element (see here).
Also it might be more elegant to check whether the sum of data$Amount[i] and data$Amount[j] is null rather than to check if they have the first absolute value but not the same signed value.
## Your example
for(i in 1:nrow(data)){
for(j in 1:nrow(data)){
if ( (sum(c(data$Amount[i], data$Amount[j])) == 0) && (data$Account[i] == data$Account[j]) ) {
data$debit[i]<-1
}
}
}

Related

Using the `$` operator in a function [duplicate]

I'm trying to write a function that I can use across multiple dataframes which accepts column names as input. The objective is to identify whether an event happened (if it was the earliest) and then code the results into a binary 0 and 1. This is what I've come up with so far:
event <- function(x){
analysis$event <- 0
analysis$event[analysis$earliest == analysis$x] <- 1
}
However, when I try it with say test <- event(death_date) it returns just a value of 1. What went wrong and how can I fix it? Thanks!
The dollar operator does not work with variables. You can use double square brackets instead:
col_name <- "mpg"
mtcars[[col_name]]
# compare:
mtcars$mpg

How do fix issue with for loop in R

i am writing a package for folks who want to predict values base on AADTMAJ, L, and Base_Past. The function provides two options 1) allow the user to enter there own regression coefficients, or 2) provide the user with pre defined coefficients. However, i have not been able to use return() correctly .
input data
data=data.frame(Base_Past=c("HSM-RUR2U-KABCO",
"HSM-RUR2U-KABCO",
"HSM-RUR4-KABC",
"HSM-RUR4-KABCO"),
AADTMAJ=c(100,100,100,100),
L=c(1,1,1,1)
)
input custom regression coefficients
custom.spf=data.frame(Base_Past=c("HSM-RUR2U-KABCO","HSM-RUR2U-KABC"), a=c(-0.312,-0.19))
define helper function
helper_function = function (data, Base_Past=FALSE, override=custom.spf){
if (is.data.frame(override)){
for (j in 1:nrow(override)){
for (i in 1:nrow(data)){
if(data[i, ]$Base_Past==override[j, ]$Base_Past){
output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(override[j, ]$a))
return(output)} else{
if(data[i, ]$Base_Past=="HSM-RUR4-KABCO") {a=-0.101}
if(data[i, ]$Base_Past=="HSM-RUR4-KABC") {a=-0.143}
output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(a))
return(output)
}
}
}
}
else if (!is.data.frame(override)){
if(Base_Past=="HSM-RUR4-KABCO") {a=-0.101}
if(Base_Past=="HSM-RUR4-KABC") {a=-0.143}
output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(a))
return(output)
}
}
run
(data %>% dplyr::rowwise() %>% dplyr::mutate(predicted_value = helper_function(data = data, override=custom.spf)))[,4]
Output
# A tibble: 4 x 1
# Rowwise:
predicted_value
<dbl>
1 0.0267
2 0.0267
3 0.0267
4 0.0267
alternative
data %>% dplyr::mutate(predicted_value=dplyr::case_when(Base_Past =="HSM-RUR4-KABCO" ~AADTMAJ*L*365*10^(-6)*exp(-0.101),
Base_Past=="HSM-RUR4-KABC" ~AADTMAJ*L*365*10^(-6)*exp(-0.143),
Base_Past=="HSM-RUR2U-KABCO" ~AADTMAJ*L*365*10^(-6)*exp(-0.312),
Base_Past=="HSM-RUR2U-KABC" ~AADTMAJ*L*365*10^(-6)*exp(-0.190),
TRUE ~ NA_real_))
desired output
Base_Past AADTMAJ L predicted_value
1 HSM-RUR2U-KABCO 100 1 0.02671733
2 HSM-RUR2U-KABCO 100 1 0.02671733
3 HSM-RUR4-KABC 100 1 0.03163652
4 HSM-RUR4-KABCO 100 1 0.03299356
The function and your use of it have several problems. Notable on the list of problems since my first batch of comments:
You call it within a rowwise pipe but then pass data=data, which means that it is ignoring the data coming in the pipe and instead looking at the whole thing. You might instead use data=cur_data() (since it is inside of a mutate, this works, as cur_data() is defined by dplyr for situations something like this).
Your helper_function is ill-defined by assuming that custom.spf is defined and available. Having a function rely on the presence of external variables not explicitly passed to it makes it fragile and can be rather difficult to troubleshoot. If for instance custom.spf were not defined in the calling environment, then this function will fail with object 'custom.spf' not found. Instead, I think you could use:
helper_function <- function(..., override=NA) {
if (isTRUE(is.na(override)) && exists("custom.spf")) {
message("found 'custom.spf', using it as 'override'")
override <- custom.spf
}
...
}
I'm not totally thrilled with this still, but at least it won't fail too quickly, and is verbose in what it is doing.
Using 1:nrow(.) can be a little risky if used programmatically. That is, if for some reason one of the inputs has 0 rows (perhaps custom.spf has nothing to override), then 1:nrow(.) should logically do nothing but instead will iterate twice over rows that do not exist. That is, if nrow(.) is 0, then note that 1:0 returns c(1, 0), which is clearly not "do nothing". Instead, use seq_len(nrow(.)), as seq_len(0) returns integer(0), which is what we would want.
There is no reason to use rowwise() here, and its use should be avoided whenever possible. (It does what it does very well, and when it is truly necessary, it works great. But the performance penalty for iterating one row at a time can be significant, especially for larger data.)
Some of what you are trying to do can be simplified by learning about merge/join methods. Two really good references for merge/join are: How to join (merge) data frames (inner, outer, left, right), What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?.
Further, it seems as if a significant portion of your effort is to assign a reasonable value to a for your equation. Your inner code (looking for "-KABCO" and "-KABC") looks like it really should be yet another frame of default values.
Here's a suggested helper_function that changes things slightly. It takes as mandatory arguments Base_Past, AADTMAJ, and L, and then zero or more frames to merge/join in order to find an appropriate value for a in the equation.
helper_function <- function(Base_Past, AADTMAJ, L, ...) {
stopifnot(
length(Base_Past) == length(AADTMAJ),
length(Base_Past) == length(L)
)
defaults <- data.frame(Base_Past = c("HSM-RUR4-KABCO", "HSM-RUR4-KABC"), a = c(-0.101, -0.143))
frames <- c(list(defaults), list(...))
a <- rep(NA, length(Base_Past))
tmpdat <- data.frame(row = seq_along(Base_Past), Base_Past = Base_Past, a = a)
for (frame in frames) {
tmpdat <- merge(tmpdat, frame, by = "Base_Past", suffixes = c("", ".y"),
all.x = TRUE, sort = FALSE)
tmpdat$a <- ifelse(is.na(tmpdat$a), tmpdat$a.y, tmpdat$a)
tmpdat$a.y <- NULL
}
tmpdat <- tmpdat[order(tmpdat$row),]
AADTMAJ * L * 365 * 10^(-6) * exp(tmpdat$a)
}
The premise is that you looking for "default" values of a in your function is really the same as looking them up in your override variable. I could have given you the override= argument for a single lookup dictionary, but it is sometimes useful to have a "one or more" type of argument: perhaps you have more than one frame with other values for a, and you may want to use them all at once. This will work as you desired for a single, but if you have multiple, perhaps custom.spf and custom.spf, this would work (by adding all of them after the L argument when called).
I chose to keep the internals of the function simple base R for a few reasons, nothing that stands out as critical. The portion that could be dplyr-ized is within the for (frame in frames) loop.
data %>%
mutate(a = helper_function(Base_Past, AADTMAJ, L, custom.spf))
# Base_Past AADTMAJ L a
# 1 HSM-RUR2U-KABCO 100 1 0.02671733
# 2 HSM-RUR2U-KABCO 100 1 0.02671733
# 3 HSM-RUR4-KABC 100 1 0.03163652
# 4 HSM-RUR4-KABCO 100 1 0.03299356
The function should operate cleanly within grouping (group_by or rowwise) if you desire, but it is certainly not necessary to do what you asked originally.

Assignment in an if statment over data frame?

I hope someone could take a look at the if statement below and tell my how I should change it to get the results I want.
Essentially, I want the code to (1) run through (iterate over) every row in the data frame beh_data, and (2) if the character in the "Cue" column is identical to that in the "face1" column, I want to (3) take the value from the "Enc_trials.thisRepN" column, and (4) assign it to the "scr_of_trial" column. If they are not the same, I want to assign an NA to the "scr_of_trial" column.
Currently, the code runs, but assings NA to every row in the "scr_of_trial" column.
Can anyone tell me why?
Here is the code:
j <- 1
i = as.character(beh_data$Cue[1:1])
for (x in 1:NROW(beh_data$Cue)) {
if (beh_data$Cue[j] == beh_data$face1[j]) {
beh_data$scr_of_trial[j] <- beh_data$Enc_trials.thisRepN[j]
j <- j + 1
i = as.character(beh_data$Cue[1:1+j])
}
else {
beh_data$scr_of_trial[j] <- NA
j <- j + 1
i = as.character(beh_data$Cue[1:1+j])
next
}
}
Shift your thinking to whole-vectors-at-a-time.
A few techniques:
ifelse; while it works fine here, realize that ifelse has issues with class.
beh_data$scr_of_trial <- ifelse(beh_data$Cue == beh_data$face1,
beh_data$Enc_trials.thisRepN, NA_character_)
replace; similar functionality, no class problem:
replace(beh_data$Enc_trials.thisRepN, beh_data$Cue != beh_data$face1, NA_character_)
Use what I call an "indicator variable":
ind <- beh_data$Cue == beh_data$face1
beh_data$scr_of_trial <- NA_character_
beh_data$scr_of_trial[ind] <- beh_data$Enc_trials.thisRepN
No for loops, just whole vectors at a time.
When reasonable, I tend to use class-specific NA types like NA_character_; while base R's functions will happily up-convert for you to whatever class you have, many other dialects within R (e.g., dplyr, data.table) are less permissive. It's a little declarative programming, a little style, perhaps a little snobbery, I don't know ...
(This is all untested on actual data.)

Looped Equation on different subset of data R

I am trying to set up an earning pattern on some data. I'm doing this by creating an 'Earned_Multiplier' variable which I can then use to multiply on whatever other variable necessary later on. Where the 'Earned_Duration' is >0 and <= 30, the Earned_Multiplier should be equal to ((Earned_Duration/30)*0.347), where the 'Earned_Duration' is >30 and <=60, the Earned_Multiplier should be equal to (0.347+((Earned_Duration/30)*0.16)), and so on.
I'm hoping the below should make sense given the above description. Unfortunately I am getting the error message "NAs are not allowed in subscripted assignments". I feel like this is likely because I need to be using a loop to do the calculation?
Could anyone help direct me as to how to build this loop and making sure it does the right calculation for each different subset?
Output_All$Earned_Multiplier <- 1
Output_All$Earned_Multiplier[Output_All$Earned_Duration == 0] <- 0
Output_All$Earned_Multiplier[(Output_All$Earned_Duration > 0) &
(Output_All$Earned_Duration <= 30)] <- 0+
((Output_All$Earned_Duration/30)*.347) # Month 1
Output_All$Earned_Multiplier[(Output_All$Earned_Duration > 30) &
(Output_All$Earned_Duration <= 60)] <- .347+(((Output_All$Earned_Duration-
30)/30)*.16) # Month 2
Output_All$Earned_Multiplier[(Output_All$Earned_Duration > 60) &
(Output_All$Earned_Duration <= 90)] <- .507+(((Output_All$Earned_Duration-
60)/30)*.085) # Month 3
It would be helpful to provide a dummy dataset so we could work on that. You probably have some NAs in your dataset causing that error.
In any case, using the dplyr library you could do an ifelse statement along with a mutate to create a new column with your calculation result:
library(dplyr)
Output_All <- Output_All %>% mutate(Earned_Multiplier = ifelse(Earned_Duration == 0, 0,
ifelse(Earned_Duration>0&Earned_Duration<=30, (Earned_Duration/30)*0.347,
ifelse(Earned_Duration>30&Earned_Duration<=60, (0.347+((Earned_Duration/30)*0.16)), #close with final else here, if none of the above is met
))))# or continue with more ifelse statements
Regarding the NAs:
If you do have NAs and they are causing you issues, depending on your preference, you can include this as part of your logical statements:
!is.na(Earned_Duration) # dont forget to add & if you add it as a condition
to make sure that NAs are disregarded.

Comparing two date vectors with function in R to avoid loop and dealing with NA

There is probably a very trivial workaround to this, but here goes... I am trying to compare two date vectors in R (not originally input as date vectors) to: return the first value if the second is NA and the first is not missing; to return the largest of the two dates if the second is not missing; or to return NA if both values are missing. For example, for data presented below, I'd like lastdate to compute as follows:
v1 v2 lastdate
1/2/2006 NA 1/2/2006
1/2/2006 12/2/2006 12/2/2006
NA NA NA
I have written a formula to avoid looping over each row (85K in these data) as follows:
lastdate <- function(lastdate1,lastdate2){
if (is.na(lastdate1)==T & is.na(lastdate2)==T) {NA}
else if (is.na(lastdate2)==T & !is.na(lastdate1)) {as.Date(lastdate1,format="%m/%d/%Y")}
else {max(as.Date(lastdate2,format="%m/%d/%Y"),as.Date(lastdate1,format="%m/%d/%Y"))}
}
dfbobs$leaveobsdate <- lastdate(as.Date(dfbobs$leavedate1,format="%m/%d/%Y"),as.Date(dfbobs$leavedate2,format="%m/%d/%Y"))
The last line is telling it to compare two vectors of dates, but is not quite right as I am getting the errors
Warning messages:
1: In if (is.na(lastdate1) == T & is.na(lastdate2) == T) { :
the condition has length > 1 and only the first element will be used
2: In if (is.na(lastdate2) == T & !is.na(lastdate1)) { :
the condition has length > 1 and only the first element will be used
I'm sure this is very silly and there's probably a much easier way to do this, but any help would be appreciated.
EDIT: I have now attempted this with an ifelse function to deal with the vectors, as suggested, but the comparison, while working if I type in single values (e.g., lastdate("1/1/2006","1/2/2006")), produces NAs if I try it on the dataframe vectors. The code follows:
lastdate <- function(lastdate1,lastdate2){
ifelse(is.na(lastdate1==T) & is.na(lastdate2==T), NA,
ifelse(is.na(lastdate2)==T & !is.na(lastdate1), as.Date(lastdate1,format="%m/%d/%Y"),
ifelse(!is.na(lastdate2) & !is.na(lastdate1), max(as.Date(lastdate2,format="%m/%d/%Y"),as.Date(lastdate1,format="%m/%d/%Y")),NA)))
}
dfbobs$leaveobsdate <- as.Date(lastdate(as.Date(dfbobs$leavedate1,format="%m/%d/%Y"),as.Date(dfbobs$leavedate2,format="%m/%d/%Y")),origin="1970-01-01")
try this:
convert dates to numeric form like so
v1<-as.character(v1); v2<-as.character(v2);
v1<-as.numeric(strftime(strptime(v1,"%m/%d/%Y"),"%Y%m%d"));
v2<-as.numeric(strftime(strptime(v2,"%m/%d/%Y"),"%Y%m%d"));
compute result now
result<-ifelse(!is.na(v1) | !is.na(v2),max(v1,v2,na.rm=TRUE),NA);
cast back to format of your choice
result<-strptime(result,"%Y%m%d");
result<-strftime(result,"%m/%d/%Y");
if is not vectorized - it expects a single argument. Use ifelse.
Alternatively, you can use mapply with your existing function:
mapply(lastdate, as.Date(df$leavedate1, ...), as.Date(df$v2, ...))

Resources