If Else Statement in R Error Message - r

this is my first time using a if-else statement in R. I have a column (in a data frame) named 'Performance' that has percentages (some of which are estimates). I have created a new column named 'Estimate' which takes the last character of the 'Performance' column. Now, I want to make a column with the following conditions: If 'Estimate' = '*' then 'Estimate' = 'YES' else 'Estimate' = 'NO'. I just want to keep the new column name 'Estimate'. The statement I have written works, but I am getting an error message that says:
"Warning message: In if (data.set$Estimate == "*") { : the condition has length > 1 and only the first element will be used"
Here is my statement:
data.set$Estimate <- if (data.set$Estimate == '*') {data.set$Estimate = 'YES'} else{data.set$Estimate = 'NO'}
Can someone please explain why I am getting this error message, and/or what I would need to change to not get it? Any help is much appreciated.

For the warning, it isn't actually working. A single if statement is expecting a single boolean, yet you try to compare data$Estimate (a vector) to "*" (a single element). It therefore does exactly what it says in the warning, takes the first element only of data$Estimate. Your thought process was good, just didn't quite hit the implementation!
ifelse() allows for vectorization (i.e. a vector of booleans), so it does what you thought the if-else should do. No need for sapply in this case, we can just vectorize.
replace vect below with your data$Estimate
vect<-c("*", "X", "*", "*", "10YEAH", "WHAT", "BlurbBlurb")
vect<-ifelse(vect=="*","Yes", "No" )
vect
#[1] "Yes" "No" "Yes" "Yes" "No" "No" "No"

In your column Estimate there are more than one entries containing "*". But if statement uses one logical element only like:
any(data.set$Estimate=="*")
And seeing that you want to replace enteries having * with "YES" you can use
my.fun1<-function(x)
{
ifelse(x=="*","YES","NO")
}
data.set$Estimate<-sapply(data.set$Estimate,my.fun1)

Related

Detecting "/' in R

I am working in RStudio with a data frame which has columns that contains dates. Some rows contains years such as 1687, others contains dates format such as 12/12/23 and others contains characters such as "First half of 19e century". I am trying to extract numerical values from this columns.
The way I went about it to extract numerical values from all rows expect the one that contains "/" because I want to keep those as it is. Here is my code:
detect<-str_detect("/", MR_all$period)
MR_all$period[detect=FALSE]
MR_all['period_3']=dplyr::case_when(MR_all$period[detect=FALSE]~gsub("\\D", "",MR_all$period))
This does not work because the detect functions fails to detect the slash and prints all observations. I'd really appreciate your help in writing a function that detects '/'
You're using a single = for comparison, as in [detect=FALSE], which is going to return nothing, c.f., mtcars$cyl[detect=FALSE] (with or without detect previously assigned, this does not check nor use it if it exists).
Why? Because a single = is an assignment, so in the case you are creating (in this case, overwriting) an object named detect with a single value, FALSE. Because assignment invisibly returns the assigned value to the calling portion, this means that the [..] around it is passed a single FALSE. Effectively, MR_all$period[FALSE] is what you're telling R you want to do, and it happily returns a 0-length vector (likely character(0), but I'm guessing ... see #2).
Since you are using str_detect("/", MR_all$period), this suggests that MR_all$period is a column of strings (character class), so your next use of MR_all$period[detect==FALSE] ~ gsub(..) seems wrong: the left-hand side (LHS) of the ~ pairs in case_when must resolve as logical, so a_string ~ gsub(..) is wrong.
Further, even if MR_all$period[detect==FALSE] did resolve to something logical so that case_when will continue, we then have the problem where this vector is shorter than the number of rows in the original frame, yet you're reassigning this shorter-vector back into the entire MR_all['period_3'], which if you're lucky will warn or fail, but if you aren't then it will silently recycle data into your frame (which is a logical problem and a chief complaint of mine about recycling arguments).
Ultimately, I think you need one of the following:
Assign the gsubed version of period only for those rows where detect is true, returning NA for all other rows:
MR_all['period_3'] <- dplyr::case_when(
detect == FALSE ~ gsub("\\D", "", MR_all$period)
)
Same as above, but defaulting to period if it is not true.
MR_all['period_3'] <- dplyr::case_when(
detect == FALSE ~ gsub("\\D", "", MR_all$period),
TRUE ~ MR_all$period
)
or the preferred method in dplyr-1.1.0:
MR_all['period_3'] <- dplyr::case_when(
detect == FALSE ~ gsub("\\D", "", MR_all$period),
.default = MR_all$period
)

How to replace NA's with blank value?

I am pretty new to R and I wonder if I can replace NA value (which looks like string) with blank, nothing
It is easy when entire table is as.character however my table contains double's as well therefore when I try to run
f <- as.data.frame(replace(df, is.na(df), ""))
or
df[is.na(df)] <- ""
Both does not work.
Error is like
Assigned data `values` must be compatible with existing data.
Error occurred for column `ID`.
Can't convert <character> to <double>.
and I understand, but I really need ID's as well as any other cell in the table (character, double or other) blank to remain in the table, later on it is connected to BI tool and I can't present "NA", just blank for the sake of clarity
If your column is of type double (numbers), you can't replace NAs (which is the R internal for missings) by a character string. And "" IS a character string even though you think it's empty, but it is not.
So you need to choose: converting you whole column to type character or leave the missings as NA.
EDIT:
If you really want to covnert your numeric column to character, you can just use as.character(MYCOLUMN). But I think what you really want is:
Telling your exporting function how to treat NA'S, which is easy, e.g. write.csv(df, na = ""). Also check the help function with ?write.csv.

Is there a way to replace the value in one column with the value from another column and then 'blank' out the value taken to replace?

I have the data frame pictured below. What I am trying to do is in the case where '2nd_Booking_Deadline' column is blank ("") then replace it with the value in '3rd_Booking_Deadline'. Once '2nd_Booking_Deadline' is replaced with the value from '3rd_Booking_Deadline' I need to blank ("") out '3rd_Booking_Deadline.
I have written the below piece of code that sees to accomplish the 1st part of that task but it returns a warning that makes me a bit nervous:
Warning message:
In AAR_Combined_w_LL$`2nd_Booking_Deadline`[AAR_Combined_w_LL$`2nd_Booking_Deadline` == :
number of items to replace is not a multiple of replacement length
Here is what I have come up with so far:
AAR_Combined_w_LL$`2nd_Booking_Deadline`[AAR_Combined_w_LL $`2nd_Booking_Deadline` == ""] <- AAR_Combined_w_LL$`3rd_Booking_Deadline`
Any thoughts on if that warning is serious and if there is a way to complete both tasks at the same time would be super helpful
While replacing you need to subset from both the ends. Try :
#Get the indices where `2nd_Booking_Deadline` is blank
inds <- AAR_Combined_w_LL $`2nd_Booking_Deadline` == ""
#Replace those blank values from the corresponding indices
AAR_Combined_w_LL$`2nd_Booking_Deadline`[inds] <- AAR_Combined_w_LL$`3rd_Booking_Deadline`[inds]
#Change `3rd_Booking_Deadline` to blank string
AAR_Combined_w_LL $`3rd_Booking_Deadline`[inds] = ""

Comparing Character Vectors Per Row in R if Same True/False

I'm missing something very simple here.
I tried referring to This Post
I'm trying to compare 2 character vectors to see if they equal each other on a per row basis, just expecting a TRUE / FALSE result per row. (Trying to find the FALSE)
all(data.frame(dBase$process_name == dBase$import_process))
When I run the above I receive the result:
Error in FUN(X[[i]], ...) : only defined on a data frame with all numeric variables
I tried this as well but it seems to be overall, not per row.
identical(dBase$process_name, dBase$import_process)
So is there an alternative to comparing characters / strings to see if they are the same and pull up the rows were FALSE occurs?
If dBase is the name of an existing data.frame, then you can do the following
to see all the rows where process_name and import_process are not identical.
Notice that I use the != to get only those where they are not equal.
dBase[ dBase$process_name != dBase$import_process, ]

Unexpected behavior using -which() in R when the search term is not found

I have been using the R which function to remove rows from a data frame. I recently discovered that if the search term is NOT in the data.frame, the result is an empty character.
# 1: returns A-Q, S-Z (as expected)
LETTERS[-which(LETTERS == "R")]
# 2: returns "character(0)" (not what I would expect)
LETTERS[-which(LETTERS == "1")]
# 3: returns A-Z (expected)
LETTERS[which(LETTERS != "1")]
# 4: returns A-Q, S-Z (expected)
LETTERS[which(LETTERS != "R")]
Is the second example the expected behavior for -which() when the search term is not found? I have already switched my code to use the syntax in example 4, which seems safer, but I am just curious.
That is a well-known pitfall. When nothing matches the logical test the which-function returns numeric(0) and then "[" returns nothing instead of returning everything which would be expected. You can use:
LETTERS[ ! LETTERS == "1" ]
LETTERS[ ! LETTERS %in% "1" ]
There is another gotcha to be aware of and is the one that makes me choose to use which(). When using logical indexing an NA value used inside "[" will return a row. I generally do not want that so I use DFRM[ which(logical) ] although this seems to bother some people who say is is not needed. I just think they are working with small datasets and infrequently encounter the annoyance of seeing tens of thousands of NA-induced useless lines of output on their console. I never use the negated which version though.
Because of this:
which(LETTERS == '-1')
## integer(0)
and this:
(1:2)[integer(0)]
integer(0)
Instead of #4, use this:
LETTERS[LETTERS != "R"]
In example 2, which returns integer(0) (a zero-length integer vector) because no values are TRUE. A negative zero-length vector (-integer(0)) is still a zero-length vector. So you're essentially asking for the NULL element of LETTERS, which doesn't exist.

Resources