Avoid loop to improve r code - r

I have a dataframe with million of rows and ten columns.
My code seems to work but never finish cause of the for loop and if statement I think.
I want to write it differently but I'm stuck.
df <- data.frame(x = 1:5,
y = c("a", "a", "b", "b", "c"),
z = sample(5))
for (i in seq_along(df$x)){
if (df$y[i] == df$y[i+1] & df$y[i] == "a"){
df$status[i] <- 1
} else {
df$status[i] <- "ok"
}
}

In fact, you can replace the whole loop by a vectorised ifelse:
df$status = ifelse(df$y == df$y[-1] & df$y == 'a', 1, 'ok')
This code will give you a warning, unlike the for loop. However, the warning is actually correct and also concerns your code: you are reading past the last element of df$y when doing df$y[i + 1].
You can make this warning go away (and make the code arguably clearer) by borrowing the lead function from dplyr (simplified):
lead = function (x, n = 1, default = NA) {
if (n == 0)
return(x)
`attributes<-`(c(x[-seq_len(n)], rep(default, n)), attributes(x))
}
With this, you can rewrite the code ever so slightly and get rid of the warning:
df$status = ifelse(df$y == lead(df$y) & df$y == 'a', 1, 'ok')
It’s a shame that this function doesn’t seem to exist in base R.

Related

How to use ifelse () correctly in R?

Given the following code:
a <- 3
colors <- ifelse(a == 3,
c("#004B93", "#389DC3", "#6DBE99"),
ifelse(a == 2, c("#004B93", "#389DC3"), c("#000000", "#000000", "#000000")))
My expectation ist to get something like
> colors
[1] "#004B93" "#389DC3" "#6DBE99"
But what I get is
> colors
[1] "#004B93"
What am I doing wrong?
You can also use if else statements for check the conditions inside R.
For that you can do the same logic as I checked for you.
a <- 3
colors <- if(a == 3) {
c("#004B93", "#389DC3", "#6DBE99")
} else if (a == 2) {
c("#004B93", "#389DC3")
} else {
c("#000000", "#000000", "#000000")
}
print(colors)
Output Result :
[1] "#004B93" "#389DC3" "#6DBE99"
Here's how I would write this:
possible_colors = list(
c("#000000", "#000000", "#000000"),
c("#004B93", "#389DC3"),
c("#004B93", "#389DC3", "#6DBE99")
)
colors = if (a < 1L || a > 3L) list[[1L]] else list[[a]]
This assumes that a only has integer values. Adjust the condition of the if accordingly if that is not the case.
If you know that a will never be any value except 1, 2 or 3, you can omit the if completely and directly write colors = possible_colors[[a]].
ifelse is a command that is usually better to avoid.
Why? Because
ifelse(TRUE, 1:3, 4:6)
returns 1 and not 1:3.
In fact, ifelse wants the output to be the same length as the test. Above TRUE is a vector of length 1, while the candidate output1:3 is of length 3. However, ifelse does not raise any warning, it just cuts the vector 1:3.
There is rationale behind this, and this example should explain it:
ifelse(c(TRUE, FALSE, TRUE), yes = 1:3, no = 4:6)
[1] 1 5 3
True values are taken from the yes argument and false values are
taken from the no argument. If you really want this sophisticate behaviour, ifelse can be extremely powerful, but this is rarely what you want.
The good news is that R is a functional language, thus everything is a function, including the if/else construct. Hence, to get the intended result from
ret <- ifelse(TRUE, 1:3, 4:6)
rewrite it like so:
ret <- if(TRUE) 1:3 else 4:6
and ret is 1:3
Do you need nested conditions? No problem.
a <- 1
ret <- if(a == 1) 1:3 else (if(a == 2) 4:6 else 7:9)
The line above emphasises the functional approach, where I replace the expression 4:6 with the bracketed one, but it can be shortened to:
ret <- if(a == 1) 1:3 else if(a == 2) 4:6 else 7:9
However, let me note that for this type of tasks, there is a dedicated R function: switch.
ret <- switch(a, 1:3, 4:6, 7:9)
Counting after a, the first, second, etc. argument is chosen depending on the value of a, thus:
a <- 2
switch(a, 1:3, 4:6, 7:9)
# 4:6
As for your case:
a <- 3
switch(a,
c("#000000", "#000000", "#000000"),
c("#004B93", "#389DC3"),
c("#004B93", "#389DC3", "#6DBE99"))
# [1] "#004B93" "#389DC3" "#6DBE99"

build a call using a recursive function

I'm trying to write a recursive function that builds a nested ifelse call. I do realize there are much better approaches than nested ifelse, e.g., dplyr::case_when and data.table::fcase, but I'm trying to learn how to approach such problems with metaprogramming.
The following code builds out the nested ifelse, but I'm struggling to substitute data with the actual supplied value, in this case my_df.
If I replace quote(data) with substitute(data), it only works for the first ifelse, but after entering the next iteration, it turns into data.
I think something like pryr::modify_lang could solve this after the fact, but I think there's probably a base R solution someone knows.
my_df <- data.frame(group = letters[1:3],
value = 1:3)
build_ifelse <- function(data, by, values, iter=1){
x <- call("ifelse",
call("==",
call("[[", quote(data), by),
values[iter]),
1,
if(iter != length(values)) build_ifelse(data, by, values, iter = iter + 1) else NA)
return(x)
}
build_ifelse(data = my_df, by = "group", values = letters[1:3])
# ifelse(data[["group"]] == "a", 1, ifelse(data[["group"]] == "b",
# 1, ifelse(data[["group"]] == "c", 1, NA)))
Thanks for any input!
Edit:
I found this question/answer: https://stackoverflow.com/a/59242109/9244371
Based on that, I found a solution that seems to work pretty well:
build_ifelse <- function(data, by, values, iter=1){
x <- call("ifelse",
call("==",
call("[[", quote(data), by),
values[iter]),
1,
if(iter != length(values)) build_ifelse(data, by, values, iter = iter + 1) else NA)
x <- do.call(what = "substitute",
args = list(x,
list(data = substitute(data))))
return(x)
}
build_ifelse(data = my_df, by = "group", values = letters[1:3])
# ifelse(my_df[["group"]] == "a", 1, ifelse(my_df[["group"]] ==
# "b", 1, ifelse(my_df[["group"]] == "c", 1, NA)))
eval(build_ifelse(data = my_df, by = "group", values = letters[1:3]))
# [1] 1 1 1
There is a base function, switch, that can deliver sequential testing and results similar to dplyr::case_when, at least when used with a loop wrapper. It's not well documented. It is really two different functions, one that expects a numeric input for it classification variable and another that expects character values. I can never remember it's name, and so typically I need to remind myself that it is referenced in the ?Control page. Since you're using character values, here goes. (I changed the outputs so you can see that some degree of substitution is occurring and that there is an "otherwise" option
sapply( my_df$group, switch, a=4, b=5, d=6, NA)
a b c
4 5 NA

R programming, I need to find the expected amount of draws it takes to get three certain letters in a sample using monte carlo sim

For example, I need to find the total amount of draws it takes to pull A , B, and C from a sample.
So far I used the which() method to help find what positions it appears in the output, but I dont know how I would go from there. we are told the answer is around 20.25. I know how we can do it with only one letter to find (answer is around 13 for one letter), but I don't know the proper way to find multiple letters.
my code:
draw <- function(){
letters <- sample(LETTERS)
which(letters == "A")
#print(letters)
#print(which(letters == "A"))
#print(which(letters == "B"))
#print(which(letters == "C"))
}
mean(replicate(10000, draw()))
Continue sampling 3 letters until you get the required letter combination and return the count of draws that it took to get the combination required.
draw <- function(){
vec <- c('A', 'B', 'C')
count <- 1
tmp <- sample(LETTERS, 3)
while(!all(tmp %in% vec)) {
tmp <- sample(LETTERS, 3)
count <- count + 1
}
return(count)
}
You can use replicate to repeat this n times and get the mean.
mean(replicate(10000, draw()))
I think I got it.
draw <- function(){
letters <- sample(LETTERS)
a <- which(letters == "A")
b <- which(letters == "B")
c <- which(letters == "C")
select <- c(a,b,c)
max(select)
}
mean(replicate(10000, draw()))

Numeric vs Factors & IF Statements

I am trying to create a function for gender distribution. Is there a way to define a letter as something other than as.factor? I would like to operate func(F) instead of func("F"). Or should I go numeric: func(0), func(1), func(2)?
I also finished off the statement with an else that is designed to operate when left blank, but does not. If I whittle down the function to not include an IF statement a blank variable works fine:
genderDist <- function(){
cat("Female:", sum(voterData$GENDER == "F"))
}
Thanks in advance! Cheers!
Full Statement:
genderDist <- function(x){
if (x == "F"){
cat("Female:", sum(voterData$GENDER == "F"))
}
else if (x == "M"){
cat("Male:", sum(voterData$GENDER == "M"))
}
else if(x == "U"){
cat("Unknown:", sum(voterData$GENDER == ""))
}
else{
cat("Female:", sum(voterData$GENDER == "F"))
cat("Male:", sum(voterData$GENDER == "M"))
cat("Unknown:", sum(voterData$GENDER == ""))
}
Desired results:
genderDist(F) gives count of Females
genderDist(M) gives count of Males
genderDist(U) gives count of Unknown
genderDist() gives count for all the above
There are several possibilities for coding gender, besides factor:
1. as character, not as factor. You will still have to call your function like func("F").
2. You already thought of using numeric yourself. Disadvantage is that it may be unclear if 1 is male or female.
3. The best option IMHO would be to go binary. Name your column "male" and use TRUE, FALSE and NA for unknown. The binary also works great in your if statement. Start with if(is.na(male)) ... ; else if(male).
EDIT
But to achieve your desired outcome, the coding of gender is not the issue, I would take this approach:
#First, define variables Fe, Ma and Un
#WARNING: Do NOT USE 'F', as 'F' is an abbr. for 'FALSE'!!
Fe <- "F"
Ma <- "M"
Un <- "U"
#now define a lookup dataframe for convienience
LT <- data.frame(code = c(Fe,Ma,Un), name = c("Female","Male","Unknown"), stringsAsFactors = FALSE)
# then define your function without an ifelse needed
genderDist <- function(x){
cat(LT[LT$code == x,"name"], sum(voterData$GENDER == x))
}
Introduce some fake data:
voterData <- data.frame(GENDER= c("F","F","F","M","M","U"))
Then run function:
> genderDist(Fe)
Female 3
> genderDist(Ma)
Male 2
> genderDist(Un)
Unknown 1

vectorized ifelse Rstudio

Writing a vectorized ifelse() I am trying to create and assign a new variable back to the data frame.
set.seed(1)
heights <- data.frame(
height_ft = sample( seq(from=5.5, to=6.1, length=10) , 50, replace=T),
gender = sample(c("M","F"),50, replace=T) )
Here are my attempts:
y = ifelse(gender = "F", 1,0)
##ERROR
if (gender = "F" & under_rep = 1){ print ("1") }
else if (gender = "F" & under_rep = 0) { print ("0") }
##ERROR
As #BondedDust pointed out, the error message was not included in the post, which would have been helpful. But its not hard to reproduce this error from the code.
The error is Error in ifelse(gender = "F", 1, 0) : unused argument (gender = "F").
The "unused argument" in the error message comes from R not finding gender anywhere in its environment, because the heights dataframe where it resides is not called, as in heights$gender.
But as #Richard_Scriven points out, the conditional is not used correctly either. Should be a == instead of =.
Lastly, assigning new var back into the dataframe is not address with the use of y instead of heights$y.

Resources