vectorized ifelse Rstudio - r

Writing a vectorized ifelse() I am trying to create and assign a new variable back to the data frame.
set.seed(1)
heights <- data.frame(
height_ft = sample( seq(from=5.5, to=6.1, length=10) , 50, replace=T),
gender = sample(c("M","F"),50, replace=T) )
Here are my attempts:
y = ifelse(gender = "F", 1,0)
##ERROR
if (gender = "F" & under_rep = 1){ print ("1") }
else if (gender = "F" & under_rep = 0) { print ("0") }
##ERROR

As #BondedDust pointed out, the error message was not included in the post, which would have been helpful. But its not hard to reproduce this error from the code.
The error is Error in ifelse(gender = "F", 1, 0) : unused argument (gender = "F").
The "unused argument" in the error message comes from R not finding gender anywhere in its environment, because the heights dataframe where it resides is not called, as in heights$gender.
But as #Richard_Scriven points out, the conditional is not used correctly either. Should be a == instead of =.
Lastly, assigning new var back into the dataframe is not address with the use of y instead of heights$y.

Related

build a call using a recursive function

I'm trying to write a recursive function that builds a nested ifelse call. I do realize there are much better approaches than nested ifelse, e.g., dplyr::case_when and data.table::fcase, but I'm trying to learn how to approach such problems with metaprogramming.
The following code builds out the nested ifelse, but I'm struggling to substitute data with the actual supplied value, in this case my_df.
If I replace quote(data) with substitute(data), it only works for the first ifelse, but after entering the next iteration, it turns into data.
I think something like pryr::modify_lang could solve this after the fact, but I think there's probably a base R solution someone knows.
my_df <- data.frame(group = letters[1:3],
value = 1:3)
build_ifelse <- function(data, by, values, iter=1){
x <- call("ifelse",
call("==",
call("[[", quote(data), by),
values[iter]),
1,
if(iter != length(values)) build_ifelse(data, by, values, iter = iter + 1) else NA)
return(x)
}
build_ifelse(data = my_df, by = "group", values = letters[1:3])
# ifelse(data[["group"]] == "a", 1, ifelse(data[["group"]] == "b",
# 1, ifelse(data[["group"]] == "c", 1, NA)))
Thanks for any input!
Edit:
I found this question/answer: https://stackoverflow.com/a/59242109/9244371
Based on that, I found a solution that seems to work pretty well:
build_ifelse <- function(data, by, values, iter=1){
x <- call("ifelse",
call("==",
call("[[", quote(data), by),
values[iter]),
1,
if(iter != length(values)) build_ifelse(data, by, values, iter = iter + 1) else NA)
x <- do.call(what = "substitute",
args = list(x,
list(data = substitute(data))))
return(x)
}
build_ifelse(data = my_df, by = "group", values = letters[1:3])
# ifelse(my_df[["group"]] == "a", 1, ifelse(my_df[["group"]] ==
# "b", 1, ifelse(my_df[["group"]] == "c", 1, NA)))
eval(build_ifelse(data = my_df, by = "group", values = letters[1:3]))
# [1] 1 1 1
There is a base function, switch, that can deliver sequential testing and results similar to dplyr::case_when, at least when used with a loop wrapper. It's not well documented. It is really two different functions, one that expects a numeric input for it classification variable and another that expects character values. I can never remember it's name, and so typically I need to remind myself that it is referenced in the ?Control page. Since you're using character values, here goes. (I changed the outputs so you can see that some degree of substitution is occurring and that there is an "otherwise" option
sapply( my_df$group, switch, a=4, b=5, d=6, NA)
a b c
4 5 NA

Error in summary.formula : matrix variables must have column dimnames

I am new to R, and I can't fix the bug after searching for one hour. It seems that there's no similar problem posted before.
I followed the instruction from https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ ,and want to test the proportional assumption for my data.
Following is my code:
sf <- function(y) {
c('Y>=1' = qlogis(mean(y >= 1)),
'Y>=2' = qlogis(mean(y >= 2)),
'Y>=3' = qlogis(mean(y >= 3)),
'Y>=3' = qlogis(mean(y >= 4)),
'Y>=3' = qlogis(mean(y >= 5)))
}
(s <- with(dat, summary(as.numeric(implied_rating) ~ GDP + importance, fun = sf)))
But the error occurs.
"Error in summary.formula(matrix(as.numeric(implied_rating)) ~
matrix(GDP) + : matrix variables must have column dimnames"
What should I do?
Many thanks in advance!
Solved. I thought dimnames is colnames...
Just mannually set dimnames to every column.
But I still wonder if there's better way to solve the problem.

R programming - not deleting the right column

I am writing to paste here my code.
I am following an online course in R and I was trying to automate a multiple variables regression. I have tried to check what's going on and at the beginning, it works, but when it comes to the last two variables, it enters in a loop and does not eliminate them, even though it enters in the if.
At the end, I have this error
Error in if (maxVar > sl) { : missing value where TRUE/FALSE needed
Here is the code
backwardElimination <-function(training,sl) {
numVar=length(training)
funzRegressor = lm(formula = profit ~.,
data = training)
p = summary(funzRegressor)$coefficients[,4]
maxVar = max(p)
if (maxVar > sl){
for (j in c(1:numVar)){
if (maxVar == p[j]) {
training = training[, -j]
backwardElimination(training,sl)
}
}
}
return(summary(funzRegressor))
}
Thanks in advance
Edit: this is the rest of my code
#importing dataset
dataset = read.csv('50_Startups.csv')
# Encoding categorical data
dataset$State = factor(dataset$State,
levels = c('New York', 'California', 'Florida'),
labels = c(1, 2, 3))
#splitting in train / test set
library(caTools)
set.seed(123)
split = sample.split(dataset$Profit, SplitRatio = 4/5)
trainingSet = subset(dataset, split == TRUE)
testSet = subset(dataset, split == FALSE)
#Transforming state in dummy variables
trainingSet$State = factor(trainingSet$State)
dummies = model.matrix(~trainingSet$State)
trainingSet = cbind(trainingSet,dummies)
profit = trainingSet$Profit
trainingSet = trainingSet[, -4]
trainingSet = trainingSet[, -4]
trainingSet = cbind(trainingSet,profit)
#calling the function
SL = 0.05
backwardElimination(trainingSet, SL)
This error indicates that you have an NA instead of a boolean value in your if statement.
if (NA) {}
## Error in if (NA) { : missing value where TRUE/FALSE needed
Either your p contains NA, either sl is NA.
Your intercepts are also fed back in the next step of modeling, you need to get rid of it before moving to the next iteration.
I can replicate your error with R in-built dataset state.x77
dataset <- as.data.frame(state.x77)
dataset$State <- rownames(dataset)
dataset$profit <- rnorm(nrow(dataset))
backwardElimination <-function(training,sl) {
if (!"profit" %in% names(training)) return(NULL)
numVar=length(training)
funzRegressor = lm(formula = profit ~.,
data = training)
p = summary(funzRegressor)$coefficients[,4]
maxVar = max(p)
#print(funzRegressor)
if (maxVar > sl){
for (j in c(1:numVar)){
if (maxVar == p[j]) {
training = training[, -j]
backwardElimination(training,sl)
}
}
}
return(summary(funzRegressor))
}
backwardElimination(dataset, 0.05)
There are NAs in some of your betas and all the p-values becomes NaN. Do you need to regress within states? Otherwise you can remove the State column to remove the error.
There will be another error when you reach the boundary case in your recursion, which you can fix :)

Numeric vs Factors & IF Statements

I am trying to create a function for gender distribution. Is there a way to define a letter as something other than as.factor? I would like to operate func(F) instead of func("F"). Or should I go numeric: func(0), func(1), func(2)?
I also finished off the statement with an else that is designed to operate when left blank, but does not. If I whittle down the function to not include an IF statement a blank variable works fine:
genderDist <- function(){
cat("Female:", sum(voterData$GENDER == "F"))
}
Thanks in advance! Cheers!
Full Statement:
genderDist <- function(x){
if (x == "F"){
cat("Female:", sum(voterData$GENDER == "F"))
}
else if (x == "M"){
cat("Male:", sum(voterData$GENDER == "M"))
}
else if(x == "U"){
cat("Unknown:", sum(voterData$GENDER == ""))
}
else{
cat("Female:", sum(voterData$GENDER == "F"))
cat("Male:", sum(voterData$GENDER == "M"))
cat("Unknown:", sum(voterData$GENDER == ""))
}
Desired results:
genderDist(F) gives count of Females
genderDist(M) gives count of Males
genderDist(U) gives count of Unknown
genderDist() gives count for all the above
There are several possibilities for coding gender, besides factor:
1. as character, not as factor. You will still have to call your function like func("F").
2. You already thought of using numeric yourself. Disadvantage is that it may be unclear if 1 is male or female.
3. The best option IMHO would be to go binary. Name your column "male" and use TRUE, FALSE and NA for unknown. The binary also works great in your if statement. Start with if(is.na(male)) ... ; else if(male).
EDIT
But to achieve your desired outcome, the coding of gender is not the issue, I would take this approach:
#First, define variables Fe, Ma and Un
#WARNING: Do NOT USE 'F', as 'F' is an abbr. for 'FALSE'!!
Fe <- "F"
Ma <- "M"
Un <- "U"
#now define a lookup dataframe for convienience
LT <- data.frame(code = c(Fe,Ma,Un), name = c("Female","Male","Unknown"), stringsAsFactors = FALSE)
# then define your function without an ifelse needed
genderDist <- function(x){
cat(LT[LT$code == x,"name"], sum(voterData$GENDER == x))
}
Introduce some fake data:
voterData <- data.frame(GENDER= c("F","F","F","M","M","U"))
Then run function:
> genderDist(Fe)
Female 3
> genderDist(Ma)
Male 2
> genderDist(Un)
Unknown 1

Avoid loop to improve r code

I have a dataframe with million of rows and ten columns.
My code seems to work but never finish cause of the for loop and if statement I think.
I want to write it differently but I'm stuck.
df <- data.frame(x = 1:5,
y = c("a", "a", "b", "b", "c"),
z = sample(5))
for (i in seq_along(df$x)){
if (df$y[i] == df$y[i+1] & df$y[i] == "a"){
df$status[i] <- 1
} else {
df$status[i] <- "ok"
}
}
In fact, you can replace the whole loop by a vectorised ifelse:
df$status = ifelse(df$y == df$y[-1] & df$y == 'a', 1, 'ok')
This code will give you a warning, unlike the for loop. However, the warning is actually correct and also concerns your code: you are reading past the last element of df$y when doing df$y[i + 1].
You can make this warning go away (and make the code arguably clearer) by borrowing the lead function from dplyr (simplified):
lead = function (x, n = 1, default = NA) {
if (n == 0)
return(x)
`attributes<-`(c(x[-seq_len(n)], rep(default, n)), attributes(x))
}
With this, you can rewrite the code ever so slightly and get rid of the warning:
df$status = ifelse(df$y == lead(df$y) & df$y == 'a', 1, 'ok')
It’s a shame that this function doesn’t seem to exist in base R.

Resources