Assign zero to factor based on case_when - r

I'm working with a data frame "mydata" containing a variable "therapy" which is a factor (0/1). Another variable is "died" (1 if died, 0 if survived). There are no missing values, so every observation has a value for therapy and died.
Now I would like to alter the value of "therapy" based on the value of "died": If died == 1, the therapy should be set to 0 (so I want do replace the existing value), otherwise the value should stay unchanged.
mydata$therapy <- ifelse(mydata$died == 1,
0,
mydata$therapy)
As a result I get values that not only contain 0 and 1, but also 2 (therapy never contained any "2"). I assume that the increment by one is due to the factor type of "therapy". Also the following code with case_when leads to the same results:
mydata <- mydata %>%
mutate(therapy = case_when(
died == 1 ~ 0,
TRUE ~ therapy))
Does anybody have an idea, what I do wrong? Or does anybody have a solutation for just changing "treatment" to zero if died == 1 and keeping all values as they are if died == 0.

Thank you all for your answers!
Especially the comment by Ray was helpful - my problem was solved this way:
mydata$therapy <- ifelse(mydata$died == 1,
0,
as.numeric(levels(mydata$therapy[mydata$therapy]))
In the end I got the 0/1 values instead of the 1/2 values because of the factor.

Related

R: how to set an if statement condition to only be triggered if whole column is equal to a value?

So let's say I have a list of data frames. Within each data frame, there is a column in which I want to create a new dummy column based on. This is how it works. For simplicity, let's just use vectors instead of a data frame in the example.
vect<-c(0, 0, 100, 100, 0, 0)
In this case, the dummy column created would be as follows:
dummy_vect<- c(0, 0, 0, 0, 1, 1)
The dummy essentially occurs in the indexes only after the last value in vect. I have the code written to do this and it works without any issues. The big issue I'm running into occurs in the rare instance when all of vect is 0s
vect<-c(0,0,0,0,0,0)
For the context of the problem, when this case occurs, I need the dummy columns to be 1 at every instance.
How would I translate this into code? So if every value in vect is 0, return all 1s in the dummy column, else just do the code I've written that works for other cases. Any help is greatly appreciated! It might be something simple and I'm just really over thinking it, but I don't know how to set the if condition up properly at all
Take absolute values, reverse the input and take the cumulative sum. Finally change the 0 values to TRUE, reverse and convert to numeric.
vect <- c(0, 0, 100, 100, 0, 0)
+rev(cumsum(rev(abs(vect))) == 0)
## [1] 0 0 0 0 1 1
+rev(cumsum(rev(abs(0*vect))) == 0) # 0*vect is all 0 input
## [1] 1 1 1 1 1 1
Just found a condition in an if statement that looks as though it is working.
if(all(df$x == 0){
df$dummy_col = 1
}else{
The code that does the process for all other cases...
}

Grouping conditional linked values within a data.table

I have a data.table with 3 input columns as follows and a fourth column representing my target output:
require(data.table)
Test <- data.table(Created = c(5,9,13,15,19,23,27,31,39,42,49),
Next_peak = c(9,15,15,23,27,27,31,39,49,49,50),
Valid_reversal = c(T,T,F,F,T,F,T,F,T,F,F),
Target_output = c(5,5,13,5,19,23,19,19,39,42,39))
I'm not sure if this is completely necessary, but I'll try to explain the dataset to hopefully make it easier to see what I'm trying to do. This is a little hard to explain in writing, so please bear with me!
The "Created" column represents the row number location of a price 'peak' (i.e. reversal point) in a time-series of financial data that I'm analysing. The "Next_peak" column represents the corresponding row number (in the original data set) of the next peak which exceeds the peak for that row. e.g. looking at row 1, the "Next_peak" value is 9, corresponding to the same row location as the "Created" level on row 2 of this summarised table. This means that the second peak exceeds the first peak. Conversely, in row 2 where the second peak's data is stored, the "Next peak" value of 15 suggests that it isn't until the 4th peak (i.e. corresponding to the '15' value in the "Created" column) that the second peak's price level is exceeded.
Lastly, the "Valid_reversal" column denotes whether the "Created" and "Next_peak" levels are within a predefined threshold. For example, "T" in the first row suggests that the peaks at rows 5 and 9 ("Next_peak") met this criteria. If I then go to the value of "Created" corresponding to a value of 9, there is also a "T", suggesting that the "Next_peak" value of 15 also meet the criteria. However, when I go to the 4th row where Created = 15, there is a "F", we find that the next peak does not meet the criteria.
What I'm trying to do is to link the 'chains' of valid reversal points and then return the original starting "Created" value. i.e. I want rows 1, 2 and 4 to have a value of '5', suggesting that the peaks for these rows were all within a predefined threshold of the original peak in row 5 of the original data-set.
Conversely, row 3 should simply return 13 as there were no valid reversals at the "Next_peak" value of 15 relative to the peak formed at row 13.
I can create the desired output with the following code, however, it's not a workable solution as the number of steps could easily exceed 3 with my actual data sets where there are more than 3 peaks which are 'linked' with the same reversal point.
I could do this with a 'for' loop, but I'm wondering if there is a better way to do this, preferably in a manner which is as vectorised as possible as the actual data set that I'm using contains millions of rows.
Here's my current approach:
Test[Valid_reversal == T,Step0 := Next_peak]
Test[,Step1 := sapply(seq_len(.N),function(x) ifelse(any(!(Created[x] %in% Step0[seq_len(x)])),
Created[x],NA))]
Test[,Step2 := unlist(ifelse(is.na(Step1),
lapply(.I,function(x) Step1[which.max(Step0[seq_len(x-1)] == Created[x])]),
Step1))]
Test[,Step3 := unlist(ifelse(is.na(Step2),
lapply(.I,function(x) Step2[which.max(Step0[seq_len(x-1)] == Created[x])]),
Step2))]
As you can see, while this data set only needs 3 iterations, the number of steps in the approach that I've taken is not definable in advance (as far as I can see). Therefore, to implement this approach, I'd have to repeat Step 2 until all values had been calculated, potentially via a 'while' loop. I'm struggling a little to work out how to do this.
Please let me know if you have any thoughts on how to address this in a more efficient way.
Thanks in advance,
Phil
Edit: Please note that I didn't mention in the above that the "Next_peak" values aren't necessarily monotonically increasing. The example above meant that nafill could be used, however, as the following example / sample output shows, it wouldn't give the correct output in the following instance:
Test <- data.table(Created = c(5,9,13,15,19,23,27,31,39,42,49),
Next_peak = c(27,15,15,19,23,27,42,39,42,49,50),
Valid_reversal = c(T,T,F,T,F,F,T,F,F,T,F),
Target_output = c(5,9,13,9,9,23,5,31,39,5,5))
Not sure if I understand your requirements correctly, you can use nafill after Step 1:
#step 0 & 1
Test[, out :=
Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
fifelse(is.na(x.Created), i.Created, NA_integer_)]
]
#your steps 2, 3, ...
Test[Valid_reversal | is.na(out), out := nafill(out, "locf")]
edit for the new example. You can use igraph to find the chains:
#step 0 & 1
Test[, out :=
Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
fifelse(is.na(x.Created), i.Created, NA_integer_)]
]
#steps 2, 3, ...
library(igraph)
g <- graph_from_data_frame(Test[Valid_reversal | is.na(out)])
DT <- setDT(stack(clusters(g)$membership), key="ind")[,
ind := as.numeric(levels(ind))[ind]][,
root := min(ind), values]
Test[Valid_reversal | is.na(out), out := DT[.SD, on=.(ind=Created), root]]
just for completeness, here is a while loop version:
#step 0 & 1
Test[, out :=
Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
fifelse(is.na(x.Created), i.Created, NA_integer_)]
]
#step 2, 3, ...
while(Test[, any(is.na(out))]) {
Test[is.na(out), out := Test[.SD, on=.(Next_peak=Created), mult="first", x.out]]
}
Test

How to create a factor with specified levels and labels, change the levels and adapt the labels step by step

I would like to do three things step by step and I am unfortunately stuck. Maybe someone could walk me through the process in R or point out my mistakes.
# Create a dataset containing a factor with pre-defined levels and labels
testdat<-data.frame(a=factor(c(1,2), labels=c("yes","no")))
I was expecting to get a factor, named "a", that takes on the values 1 and 2 and is assigned labels "yes" (for 1), and "no" (for 2). Unfortunately, the factor now only contains what I specified as labels, but c(1,2) is not accessible anymore.
# Next, I would like to assign new levels to the factor, namely {1,0} instead of {1,2}
testdat$a[testdat==2] <- 0
Obviously this doesn't work, because the problems in the first step and because there is no value ==2. But ideally, after this second step, I would have a variable "a" that now takes values 1 and 0, but that has still the original labels "yes" (for 1) and "no" (for 2) assigned.
So in a third step, I would like to adjust the value labels so that "no" corresponds to value 0, and no longer two (no longer present) value 2. How would I do that?
And should this be a community wiki?
As mentioned in the comments once we have factor with labels we lose the initial value. We can try to store data in named vector or list instead
#Label can be considered as name of the vector
a <- c(yes = 1, no = 2)
#We can now change the value where a == 2 to 0 and labels are still intact
a[a == 2] <- 0
a
#yes no
# 1 0

How to add grouping variable to data set that will classify both an observation and its N neighbors based on some condition

I am having some trouble coming up with a solution that properly handles classifying a variable number of neighbors for any given observation in a data frame based on some condition. I would like to be able to add a simple, binary indicator variable to a data frame that will equal 1 if the condition is satisfied, and 0 if it is not.
Where I am getting stuck is I am unsure how to iteratively check the condition against neighboring observations only, in either direction (i.e., to check if out of 4 neighboring observations in a given column in my data frame, that at least 3 out of 4 of them contain the same value). I have tried first creating another indicator variable indicating if the condition is satisfied or not (1 or 0 = yes or no). Then, I tried setting up a series of ifelse() statements within a loop to try to assign the proper categorization of the observation where the initial condition is satisfied, +/- 2 observations in either direction. However, when I inspect the dataframe after running the loop, only the observation itself (not its neighbors) where the condition is satisfied is receiving the value, rather than all neighboring observations also receiving the value. Here is my code:
#sample data
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
sample_dat$violate <- NULL
for(i in 1:nrow(dat_date_ord)){
sample_dat$violate[i] <- ifelse(sample_dat$initial_ind[i]==1 &
((sample_dat$initial_ind[i-2]==1 |
sample_dat$initial_ind[i-1]==1) &
(sample_dat$initial_ind[i+2]==1 |
sample_dat$initial_ind[i+1]==1)),
"trending",
"non-trending"
)
}
This loop correctly identifies one of the four points that needs to be labelled "trending", but it does not also assign "trending" to the correct neighbors. In other words, I expect the output to be "trending for observations 7-10, since 3/4 observations in that group of 4 all have a value of 1 in the initial indicator column. I feel like there might be an easier way to accomplish this - but what I need to ensure is that my code is robust enough to identify and assign observations to a group regardless of if I want 3/4 to indicate a group, 5/6, 2/5, etc.
Thank you for any and all advice.
You can use the rollapply function from the zoo package to apply a function to set intervals in your data. The question then becomes about creating a function that satisfies your needs. I'm not sure if I've understood correctly, but it seems you want a function that checks if the condition is true for at least 3/5 of the observation plus its four closest neighbors. In this case just adding the 1s up and checking if they're above 2 works.
library(zoo)
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
trend_test = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test, width = 5, fill = NA)
Edit: If you want a function that checks if the observation and the next 3 observations have at least 3 1s, you can do something very similar, just by changing the align argument on rollapply:
trend_test_2 = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test_2, width = 4,
fill = NA, align = "left")

Replacing Nulls for one Variable based off another

I have a dataset consisting of measured variables and categorical variables based off these measurements. i.e X1 is measured variable and Y1 will either be 0 or 1 based off the measurement in X1.
There was a lot of Null values in the X1 variable, which I have replaced already. I am now trying to replace the corresponding Y1 variable based off the new value in X1.
So what I'm trying to do with the below code is say if there is a Null in Y1, check if the corresponding X1 value is less than 34.5. If so give that Y1 0, otherwise 1.
Data$Y1[is.na(Data$Y1)] <- ifelse(Data$X1 <34.5, 0, 1)
Error i get:
Warning message:
In x[...] <- m :
number of items to replace is not a multiple of replacement length
a simple loop should do the trick
for (i in 1:nrow(Data){
if (is.na(Data$Y1[i])==TRUE){
Data$Y1[i] <- ifelse(Data$X1[i] <34.5, 0, 1)
}
}
It may not be the most sufficient way but the logic is pretty clear and runs fairly fast when your dataset isn't huge

Resources