Detecting "/' in R - r

I am working in RStudio with a data frame which has columns that contains dates. Some rows contains years such as 1687, others contains dates format such as 12/12/23 and others contains characters such as "First half of 19e century". I am trying to extract numerical values from this columns.
The way I went about it to extract numerical values from all rows expect the one that contains "/" because I want to keep those as it is. Here is my code:
detect<-str_detect("/", MR_all$period)
MR_all$period[detect=FALSE]
MR_all['period_3']=dplyr::case_when(MR_all$period[detect=FALSE]~gsub("\\D", "",MR_all$period))
This does not work because the detect functions fails to detect the slash and prints all observations. I'd really appreciate your help in writing a function that detects '/'

You're using a single = for comparison, as in [detect=FALSE], which is going to return nothing, c.f., mtcars$cyl[detect=FALSE] (with or without detect previously assigned, this does not check nor use it if it exists).
Why? Because a single = is an assignment, so in the case you are creating (in this case, overwriting) an object named detect with a single value, FALSE. Because assignment invisibly returns the assigned value to the calling portion, this means that the [..] around it is passed a single FALSE. Effectively, MR_all$period[FALSE] is what you're telling R you want to do, and it happily returns a 0-length vector (likely character(0), but I'm guessing ... see #2).
Since you are using str_detect("/", MR_all$period), this suggests that MR_all$period is a column of strings (character class), so your next use of MR_all$period[detect==FALSE] ~ gsub(..) seems wrong: the left-hand side (LHS) of the ~ pairs in case_when must resolve as logical, so a_string ~ gsub(..) is wrong.
Further, even if MR_all$period[detect==FALSE] did resolve to something logical so that case_when will continue, we then have the problem where this vector is shorter than the number of rows in the original frame, yet you're reassigning this shorter-vector back into the entire MR_all['period_3'], which if you're lucky will warn or fail, but if you aren't then it will silently recycle data into your frame (which is a logical problem and a chief complaint of mine about recycling arguments).
Ultimately, I think you need one of the following:
Assign the gsubed version of period only for those rows where detect is true, returning NA for all other rows:
MR_all['period_3'] <- dplyr::case_when(
detect == FALSE ~ gsub("\\D", "", MR_all$period)
)
Same as above, but defaulting to period if it is not true.
MR_all['period_3'] <- dplyr::case_when(
detect == FALSE ~ gsub("\\D", "", MR_all$period),
TRUE ~ MR_all$period
)
or the preferred method in dplyr-1.1.0:
MR_all['period_3'] <- dplyr::case_when(
detect == FALSE ~ gsub("\\D", "", MR_all$period),
.default = MR_all$period
)

Related

read.csv ;check.names=F; R;Look at the picture,why it works a treat?

please see the the column name "if" in the second column,the deifference is :when check.name=F,"." beside "if" disappear
Sorry for the code,because I try to type some codes to generate this data.frame like in the picture,but i failed due to the "if".We know that "if" is a reserved word in R(like else,for, while ,function).And here, i deliberately use the "if" as the column name (the 2nd column),and see whether R will generate some novel things.
So using another way, I type the "if" in the excel and save as the format of csv in order to use read.csv.
Question is:
Why "if." changes to "if"?(After i use check.names=FALSE)
enter image description here
?read.csv describes check.names= in a similar fashion:
check.names: logical. If 'TRUE' then the names of the variables in the
data frame are checked to ensure that they are syntactically
valid variable names. If necessary they are adjusted (by
'make.names') so that they are, and also to ensure that there
are no duplicates.
The default action is to allow you to do something like dat$<column-name>, but unfortunately dat$if will fail with Error: unexpected 'if' in "dat$if", ergo check.names=TRUE changing it to something that the parser will not trip over. Note, though, that dat[["if"]] will work even when dat$if will not.
If you are wondering if check.names=FALSE is ever a bad thing, then imagine this:
dat <- read.csv(text = "a,a\n2,3")
dat
# a a.1
# 1 2 3
dat <- read.csv(text = "a,a\n2,3", check.names = FALSE)
dat
# a a
# 1 2 3
In the second case, how does one access the second column by-name? dat$a returns 2 only. However, if you don't want to use $ or [[, and instead can rely on positional indexing for columns, then dat[,colnames(dat) == "a"] does return both of them.

R made my character variable into a factor, and now the function doesn't work

I have a data frame of hospital discharge data. It has 99,779 records with 263 variables, including up to 50 ICD-10-CM diagnosis codes per record. I loaded the file including the subcommand "stringsAsFactors= FALSE" and then copied just the diagnosis codes to another df to make it easier to look at the data in RStudio. My current goal is to assign injury severity codes using icdpictr. I ran that program successfully, then looked at the output. As documented in the author's site, when the 7th character of the ICD-10-CM code is "B" or "C", the program ignores it, although it should not. So I want to change the 7th character from "B" or "C" to the character that triggers the attention. Here is where I run into a problem. Setting aside that I don't know how to write a function that will do this for each of my 50 variables, I anticipate writing 50 nearly identical statements like this:
mutate(temp = if_else(substr(DIAG1,7,7) == 'B' | substr(DIAG1,7,7) == 'C',
paste(substr(DIAG1,1,6),'A',sep=""),
DIAG1),
DIAG1 = temp, ...
I ran the program with just this one mutate command. This is the error message that appears:
Error: Problem with mutate() input temp.
x false must be a character vector, not a factor object.
i Input temp is if_else(...).
Although I loaded the DIAG variables as character, when I copied them to the other table, R -- without my permission -- turned them into factors. That was very efficient, but now I can't handle them as character type.
How do I solve this problem?
It may because of the comparison with a factor class and when we have different type for 'yes', 'no' in if_else, it can have that error because the if_else checks the type unlike the ifelse. Based on the OP's code, if 'DIAG1' is factor and the no case is returning 'DIAG1', it is a factor vs character class because substr automatically coerces the factor to character. We can convert the 'DIAG1' to character with as.character and it should work
library(dplyr)
df2 <- df1 %>%
mutate(DIAG1 = as.character(DIAG1),
temp = if_else(substr(DIAG1, 7, 7) %in% c("B", "C"),
paste0(substr(DIAG1,1,6),'A'), DIAG1))
NOTE: When there are more than one element to compare, instead of doing the same operation twice (substr(DIAG1, 7, 7)) and then doing == (as it is elementwise comparison), can use %in% with a single substr
NOTE2: From R 4.0, by default the read.csv/read.table or data.frame construction calls have stringsAsFactors = FALSE by default. Previously, it was TRUE. So, it is better to check the R version as well

How to replace NA's with blank value?

I am pretty new to R and I wonder if I can replace NA value (which looks like string) with blank, nothing
It is easy when entire table is as.character however my table contains double's as well therefore when I try to run
f <- as.data.frame(replace(df, is.na(df), ""))
or
df[is.na(df)] <- ""
Both does not work.
Error is like
Assigned data `values` must be compatible with existing data.
Error occurred for column `ID`.
Can't convert <character> to <double>.
and I understand, but I really need ID's as well as any other cell in the table (character, double or other) blank to remain in the table, later on it is connected to BI tool and I can't present "NA", just blank for the sake of clarity
If your column is of type double (numbers), you can't replace NAs (which is the R internal for missings) by a character string. And "" IS a character string even though you think it's empty, but it is not.
So you need to choose: converting you whole column to type character or leave the missings as NA.
EDIT:
If you really want to covnert your numeric column to character, you can just use as.character(MYCOLUMN). But I think what you really want is:
Telling your exporting function how to treat NA'S, which is easy, e.g. write.csv(df, na = ""). Also check the help function with ?write.csv.

Comparing Character Vectors Per Row in R if Same True/False

I'm missing something very simple here.
I tried referring to This Post
I'm trying to compare 2 character vectors to see if they equal each other on a per row basis, just expecting a TRUE / FALSE result per row. (Trying to find the FALSE)
all(data.frame(dBase$process_name == dBase$import_process))
When I run the above I receive the result:
Error in FUN(X[[i]], ...) : only defined on a data frame with all numeric variables
I tried this as well but it seems to be overall, not per row.
identical(dBase$process_name, dBase$import_process)
So is there an alternative to comparing characters / strings to see if they are the same and pull up the rows were FALSE occurs?
If dBase is the name of an existing data.frame, then you can do the following
to see all the rows where process_name and import_process are not identical.
Notice that I use the != to get only those where they are not equal.
dBase[ dBase$process_name != dBase$import_process, ]

If Else Statement in R Error Message

this is my first time using a if-else statement in R. I have a column (in a data frame) named 'Performance' that has percentages (some of which are estimates). I have created a new column named 'Estimate' which takes the last character of the 'Performance' column. Now, I want to make a column with the following conditions: If 'Estimate' = '*' then 'Estimate' = 'YES' else 'Estimate' = 'NO'. I just want to keep the new column name 'Estimate'. The statement I have written works, but I am getting an error message that says:
"Warning message: In if (data.set$Estimate == "*") { : the condition has length > 1 and only the first element will be used"
Here is my statement:
data.set$Estimate <- if (data.set$Estimate == '*') {data.set$Estimate = 'YES'} else{data.set$Estimate = 'NO'}
Can someone please explain why I am getting this error message, and/or what I would need to change to not get it? Any help is much appreciated.
For the warning, it isn't actually working. A single if statement is expecting a single boolean, yet you try to compare data$Estimate (a vector) to "*" (a single element). It therefore does exactly what it says in the warning, takes the first element only of data$Estimate. Your thought process was good, just didn't quite hit the implementation!
ifelse() allows for vectorization (i.e. a vector of booleans), so it does what you thought the if-else should do. No need for sapply in this case, we can just vectorize.
replace vect below with your data$Estimate
vect<-c("*", "X", "*", "*", "10YEAH", "WHAT", "BlurbBlurb")
vect<-ifelse(vect=="*","Yes", "No" )
vect
#[1] "Yes" "No" "Yes" "Yes" "No" "No" "No"
In your column Estimate there are more than one entries containing "*". But if statement uses one logical element only like:
any(data.set$Estimate=="*")
And seeing that you want to replace enteries having * with "YES" you can use
my.fun1<-function(x)
{
ifelse(x=="*","YES","NO")
}
data.set$Estimate<-sapply(data.set$Estimate,my.fun1)

Resources