I am quite new to this kind of function in R. What I am trying to do is to use the if statement over a vector.
Specifically, let's say we have a vector of characters:
id <- c('4450', '73635', '7462', '12')
What I'd like to do is to substitute those elements containing a specific number of characters with a particular term. Here what I tried so far:
for (i in 1:length(id)) {
if(nchar(i) > 3) {
id[i] <- 'good'
}
else id[i] <- 'bad'
}
However, the code doesn't work and I don't understand why. Also I'd like to ask you:
How can use multiple conditions in this example? Like for those elements with nchar(i) > 6 susbstitute with 'mild', nchar(i) < 2 susbsitute with 'not bad' and so on.
In your for statement, i is the iterator, not the actual element of your vector.
I think your code would work if you replace :
if(nchar(i) > 3)
by
if(nchar(id[i]) > 3)
You could use dplyr::case_when to include multiple such conditions.
temp <- nchar(id)
id1 <- dplyr::case_when(temp > 6 ~ 'mild',
temp < 2 ~ 'not bad',
#default condition
TRUE ~ 'bad')
Or using nested ifelse
id1 <- ifelse(temp > 6, 'mild', ifelse(temp < 2, 'not bad', 'bad'))
Related
Consider below expression:
x$Y = ifelse(x$A<= 5 & abs(x$B) >= 2,
ifelse(x$B> 2 ,"YES","NO"),
'NA')
What I understand is that, if A is <=5 and B >=2 then ALL are YES, if not then NO, but I am confused the second ifelse condition. Any help will be highly appreciated.
Thanks
This code aims to define a new column, Y in the data set x. The column Y will populate based on the following statements:
If we rewrite your ifelse expression using expanded syntax, it might be easier to understand.
x$Y <- ifelse(x$A <= 5 & abs(x$B) >= 2, ifelse(x$B > 2, "YES", "NO"), 'NA')
# becomes
if (x$A <= 5 & abs(x$B) >= 2) {
if (x$B > 2) {
x$Y <- "YES"
} else {
x$Y <- "NO"
}
} else {
x$Y <- NA
}
The second nested ifelse() corresponds to the inner if above. It checks the value of x$B to see if it be greater than 2, or less than -2 (one of these much be the case based on the earlier check abs(x$B) >= 2. If the former be the case, then x$Y gets assigned to YES, otherwise it gets assigned to NO.
I know that this is not the most efficient way in order to achieve my goal; however, I am using this as a teaching moment (i.e., to show that you can use a if/else statement nested within a for loop). Specifically, I have a nominal variable that uses integers as of right now. I want to use the if/else combined with the for loop in order to reassign these numbers to their respective category (class character). I have tried to do this in multiple ways, my current code is as follows:
# Take the original data and separate out the variable of interest
oasis_CDR <- oasis_final %>% select('CDR')
# transpose this data
oasis_CDR <- t(oasis_CDR)
# create the for loop
for(i in seq_along(oasis_CDR)){
if(i == 0.0){
oasis_CDR[1, i] <- "Normal"
} else if(i == 0.5) {
oasis_CDR[1 ,i] <- "Very Mild Dementia"
} else if(i == 1.0){
oasis_CDR[1 ,i] <- "Mild Dementia"
} else if(i == 2.0){
oasis_CDR[1 ,i] <- "Moderate Dementia"
} else if(i == 3.0){
oasis_CDR[1 ,i] <- "Severe Dementia"
} else{
oasis_CDR[1 ,i] <- "NA"
}
}
When I look at oasis_CDR it returns 'NA' for all observations.
If i replace 'i' with 'CDR' in each 'for' statement it only returns with 'Normal'.
Is there any way that this can be done in order for the reassignments in order to match what the data is?
If you have a different value to assign to every number you can use dplyr::recode
library(dplyr)
oasis_CDR <- oasis_CDR %>%
mutate(new_col = recode(CDR, `0` = 'Normal',
`0.5` = 'Very Mild Dementia',
`1` = 'Mild Dementia',
`1.5` = 'Moderate Dementia',
`3` = 'Severe Dementia',
.default = NA_character_))
Run a check on your seq_along(oasis_CDR) expression! These will be your i values.
My guess is that you do not really want to compare 0.0, 0.5, 1 and 2 against 1 up to > 220, do you?
And if you really wanna work through this via a for loop and not with indexing the vector then
isn't it more likely that you want to achive something like this:
oasis_CDR$result <- NA_character_
j <- 1
for (i in oasis_CDR) {
if (i == ...) oasis_CDR$result[j] <- 'Normal'
...
j <- j + 1
}
But imho that can get the job done but is not (very) nice R (or any other similar language) code.
I'd like to know the shape or length of the filtered dataframe through multiple conditions. I have 2 methods I've used, but I'm a little stumped because they're giving me different outputs.
Method 1
x <- df[df$gender=='male',]
x <- x[x$stat == 0,]
nrow(x)
OUTPUT = Some Number
Method 2
nrow(sqldf('SELECT * FROM df WHERE gender == "male" AND stat == 0'))
OUTPUT = Some Number
I'm a little confused as to why the outputs would be different? Any ideas?
It looks like in method one you assigned x to df[df$gender=='male'] and then you replace x with assigning it to x[x$stat == 0]. So you will end up with nrow for how many stat == 0 only. Off of the top of my head with no dataset, maybe x <- df[df$gender=='male' & x$stat == 0] would work. Although I have never done it this way. I would use the subset function with x <- subset(x, df$gender=='male' & x$stat == 0) and then nrow(x).
I am very new to R but I am interested in learning more and improving.
I have a dataset with around 40,000+ rows containing the length of neuron segments. I want to compare the length trends of neurons of different groups. The first step in this analysis involves sorting the measurements into 1 of 6 different categories such as '<10' '10-15', '15-20', '20-25', '25-30', and '>30'.
I created these categories as appended columns using 'mutate' from the 'dplyr' package and now I am trying to write a boolean function to determine where the measurement fits by applying a value of '1' to the corresponding column if it fits, and a '0' if it doesn't.
Here is what I wrote:
for (i in 1:40019) {
{if (FinalData$Length[i] <=10)
{FinalData$`<10`[i]<-1
} else {FinalData$`<10`[i]<-0}} #Fills '<10'
if (FinalData$Length[i] >=10 & FinalData$Length[i]<15){
FinalData$`10-15`[i]<-1
} else{FinalData$`10-15`[i]<-0} #Fills'10-15'
if (FinalData$Length[i] >=15 & FinalData$Length[i]<20){
FinalData$`15-20`[i]<-1
} else{FinalData$`15-20`[i]<-0} #Fills '15-20'
if (FinalData$Length[i] >=20 & FinalData$Length[i]<25) {
FinalData$`20-25`[i]<-1
} else{FinalData$`20-25`[i]<-0} #Fills '20-25'
if(FinalData$Length[i] >=25 & FinalData$Length[i]<30){
FinalData$`25-30`[i]<-1
} else{FinalData$`25-30`[i]<-0} #Fills '25-30'
if(FinalData$Length[i] >=30){
FinalData$`>30`[i]<-1
} else{FinalData$`>30`[i]<-0} #Fills '>30'
}
This seems to work, but it takes a long time:
system.time(source('~/Desktop/Home/Programming/R/Boolean Loop R.R'))
user system elapsed
94.408 19.147 118.203
The way I coded this seems very clunky and inefficient. Is there a faster and more efficient way to code something like this or am I doing this appropriately for what I am asking for?
Here is an example of some of the values I am testing:
'Length': 14.362, 12.482337, 8.236, 16.752, 12.045
If I am not being clear about how the dataframe is structured, here is a screenshot:
How my data frame is organized
You can use the cut function in R. It is used to convert numeric values to factors:
x<-c(1,2,4,2,3,5,6,5,6,5,8,0,5,5,4,4,3,3,3,5,7,9,0,5,6,7,4,4)
cut(x = x,breaks = c(0,3,6,9,12),labels = c("grp1","grp2","grp3","grp4"),right=F)
set right = "T" or "F" as per your need.
You can vectorise that as follows (I made a sample of some data called DF)
DF <- data.frame(1:40000,sample(letters,1:40000,replace=T),"Length"=sample(1:40,40000,replace=T))
MyFunc <- function(x) {
x[x >= 10 & x < 15] <- "10-15"
x[x >= 15 & x < 20] <- "15-20"
x[x >= 20 & x < 25] <- "20-25"
x[x >= 25 & x < 30] <- "25-30"
x[x > 30] <- ">30"
x[x < 10] <- "<10"
return(x)
}
DF$Group <- MyFunc(DF[,3])
If it has to be 6 columns like that, you can modify the above to return a one or zero for the appropriate size and everything else, respectively, for each of the 6 columns.
Edit: I guess a series of ifelse might be best if it really has to be 6 columns like that.
e.g.
DF$'<10' <- sapply(DF$Length, function(x) ifelse(x < 10,1,0))
I have two columns (df$Z and df$A)
I basically want to say: if df$Z is less than 5, then fill df$A with an NA and if not then leave df$A alone. I've tried these things but am not sure where I'm going wrong or what the error message means.
if(df$X<5){df$A <- NA}
Error:
In if (df$X < 5) { : the condition has length > 1 and only the first element will be used
I also tried to do something more like this.
for(i in dfX){
if(df$X<5){
df$A <- "NA"
}
}
No if statement needed. That's the magic of vectorization.
df$A[df$Z < 5] <- NA
A simple way is the "is.na<-" function:
is.na(df$A) <- df$Z < 5
The vectorized form of the if statement in R is the ifelse() function:
df$A <- ifelse( df$X < 5, NA, df$A )
However, in this case I would also go with #mark-heckmann's solution.
And please note, that "NA"is not the same as NA.