Multi Conditional Statements in R

Multi Conditional Statements in R - r

I'd like to know the shape or length of the filtered dataframe through multiple conditions. I have 2 methods I've used, but I'm a little stumped because they're giving me different outputs.
Method 1
x <- df[df$gender=='male',]
x <- x[x$stat == 0,]
nrow(x)
OUTPUT = Some Number
Method 2
nrow(sqldf('SELECT * FROM df WHERE gender == "male" AND stat == 0'))
OUTPUT = Some Number
I'm a little confused as to why the outputs would be different? Any ideas?

It looks like in method one you assigned x to df[df$gender=='male'] and then you replace x with assigning it to x[x$stat == 0]. So you will end up with nrow for how many stat == 0 only. Off of the top of my head with no dataset, maybe x <- df[df$gender=='male' & x$stat == 0] would work. Although I have never done it this way. I would use the subset function with x <- subset(x, df$gender=='male' & x$stat == 0) and then nrow(x).

Related

Using if/else nested within a for loop in order to cycle through & reassign values within a column in R?

I know that this is not the most efficient way in order to achieve my goal; however, I am using this as a teaching moment (i.e., to show that you can use a if/else statement nested within a for loop). Specifically, I have a nominal variable that uses integers as of right now. I want to use the if/else combined with the for loop in order to reassign these numbers to their respective category (class character). I have tried to do this in multiple ways, my current code is as follows:
# Take the original data and separate out the variable of interest
oasis_CDR <- oasis_final %>% select('CDR')
# transpose this data
oasis_CDR <- t(oasis_CDR)
# create the for loop
for(i in seq_along(oasis_CDR)){
if(i == 0.0){
oasis_CDR[1, i] <- "Normal"
} else if(i == 0.5) {
oasis_CDR[1 ,i] <- "Very Mild Dementia"
} else if(i == 1.0){
oasis_CDR[1 ,i] <- "Mild Dementia"
} else if(i == 2.0){
oasis_CDR[1 ,i] <- "Moderate Dementia"
} else if(i == 3.0){
oasis_CDR[1 ,i] <- "Severe Dementia"
} else{
oasis_CDR[1 ,i] <- "NA"
}
}
When I look at oasis_CDR it returns 'NA' for all observations.
If i replace 'i' with 'CDR' in each 'for' statement it only returns with 'Normal'.
Is there any way that this can be done in order for the reassignments in order to match what the data is?

If you have a different value to assign to every number you can use dplyr::recode
library(dplyr)
oasis_CDR <- oasis_CDR %>%
mutate(new_col = recode(CDR, `0` = 'Normal',
`0.5` = 'Very Mild Dementia',
`1` = 'Mild Dementia',
`1.5` = 'Moderate Dementia',
`3` = 'Severe Dementia',
.default = NA_character_))

Run a check on your seq_along(oasis_CDR) expression! These will be your i values.
My guess is that you do not really want to compare 0.0, 0.5, 1 and 2 against 1 up to > 220, do you?
And if you really wanna work through this via a for loop and not with indexing the vector then
isn't it more likely that you want to achive something like this:
oasis_CDR$result <- NA_character_
j <- 1
for (i in oasis_CDR) {
if (i == ...) oasis_CDR$result[j] <- 'Normal'
...
j <- j + 1
}
But imho that can get the job done but is not (very) nice R (or any other similar language) code.

Numeric vs Factors & IF Statements

I am trying to create a function for gender distribution. Is there a way to define a letter as something other than as.factor? I would like to operate func(F) instead of func("F"). Or should I go numeric: func(0), func(1), func(2)?
I also finished off the statement with an else that is designed to operate when left blank, but does not. If I whittle down the function to not include an IF statement a blank variable works fine:
genderDist <- function(){
cat("Female:", sum(voterData$GENDER == "F"))
}
Thanks in advance! Cheers!
Full Statement:
genderDist <- function(x){
if (x == "F"){
cat("Female:", sum(voterData$GENDER == "F"))
}
else if (x == "M"){
cat("Male:", sum(voterData$GENDER == "M"))
}
else if(x == "U"){
cat("Unknown:", sum(voterData$GENDER == ""))
}
else{
cat("Female:", sum(voterData$GENDER == "F"))
cat("Male:", sum(voterData$GENDER == "M"))
cat("Unknown:", sum(voterData$GENDER == ""))
}
Desired results:
genderDist(F) gives count of Females
genderDist(M) gives count of Males
genderDist(U) gives count of Unknown
genderDist() gives count for all the above

There are several possibilities for coding gender, besides factor:
1. as character, not as factor. You will still have to call your function like func("F").
2. You already thought of using numeric yourself. Disadvantage is that it may be unclear if 1 is male or female.
3. The best option IMHO would be to go binary. Name your column "male" and use TRUE, FALSE and NA for unknown. The binary also works great in your if statement. Start with if(is.na(male)) ... ; else if(male).
EDIT
But to achieve your desired outcome, the coding of gender is not the issue, I would take this approach:
#First, define variables Fe, Ma and Un
#WARNING: Do NOT USE 'F', as 'F' is an abbr. for 'FALSE'!!
Fe <- "F"
Ma <- "M"
Un <- "U"
#now define a lookup dataframe for convienience
LT <- data.frame(code = c(Fe,Ma,Un), name = c("Female","Male","Unknown"), stringsAsFactors = FALSE)
# then define your function without an ifelse needed
genderDist <- function(x){
cat(LT[LT$code == x,"name"], sum(voterData$GENDER == x))
}
Introduce some fake data:
voterData <- data.frame(GENDER= c("F","F","F","M","M","U"))
Then run function:
> genderDist(Fe)
Female 3
> genderDist(Ma)
Male 2
> genderDist(Un)
Unknown 1

speeding up boolean logic loop in R

I am very new to R but I am interested in learning more and improving.
I have a dataset with around 40,000+ rows containing the length of neuron segments. I want to compare the length trends of neurons of different groups. The first step in this analysis involves sorting the measurements into 1 of 6 different categories such as '<10' '10-15', '15-20', '20-25', '25-30', and '>30'.
I created these categories as appended columns using 'mutate' from the 'dplyr' package and now I am trying to write a boolean function to determine where the measurement fits by applying a value of '1' to the corresponding column if it fits, and a '0' if it doesn't.
Here is what I wrote:
for (i in 1:40019) {
{if (FinalData$Length[i] <=10)
{FinalData$`<10`[i]<-1
} else {FinalData$`<10`[i]<-0}} #Fills '<10'
if (FinalData$Length[i] >=10 & FinalData$Length[i]<15){
FinalData$`10-15`[i]<-1
} else{FinalData$`10-15`[i]<-0} #Fills'10-15'
if (FinalData$Length[i] >=15 & FinalData$Length[i]<20){
FinalData$`15-20`[i]<-1
} else{FinalData$`15-20`[i]<-0} #Fills '15-20'
if (FinalData$Length[i] >=20 & FinalData$Length[i]<25) {
FinalData$`20-25`[i]<-1
} else{FinalData$`20-25`[i]<-0} #Fills '20-25'
if(FinalData$Length[i] >=25 & FinalData$Length[i]<30){
FinalData$`25-30`[i]<-1
} else{FinalData$`25-30`[i]<-0} #Fills '25-30'
if(FinalData$Length[i] >=30){
FinalData$`>30`[i]<-1
} else{FinalData$`>30`[i]<-0} #Fills '>30'
}
This seems to work, but it takes a long time:
system.time(source('~/Desktop/Home/Programming/R/Boolean Loop R.R'))
user system elapsed
94.408 19.147 118.203
The way I coded this seems very clunky and inefficient. Is there a faster and more efficient way to code something like this or am I doing this appropriately for what I am asking for?
Here is an example of some of the values I am testing:
'Length': 14.362, 12.482337, 8.236, 16.752, 12.045
If I am not being clear about how the dataframe is structured, here is a screenshot:
How my data frame is organized

You can use the cut function in R. It is used to convert numeric values to factors:
x<-c(1,2,4,2,3,5,6,5,6,5,8,0,5,5,4,4,3,3,3,5,7,9,0,5,6,7,4,4)
cut(x = x,breaks = c(0,3,6,9,12),labels = c("grp1","grp2","grp3","grp4"),right=F)
set right = "T" or "F" as per your need.

You can vectorise that as follows (I made a sample of some data called DF)
DF <- data.frame(1:40000,sample(letters,1:40000,replace=T),"Length"=sample(1:40,40000,replace=T))
MyFunc <- function(x) {
x[x >= 10 & x < 15] <- "10-15"
x[x >= 15 & x < 20] <- "15-20"
x[x >= 20 & x < 25] <- "20-25"
x[x >= 25 & x < 30] <- "25-30"
x[x > 30] <- ">30"
x[x < 10] <- "<10"
return(x)
}
DF$Group <- MyFunc(DF[,3])
If it has to be 6 columns like that, you can modify the above to return a one or zero for the appropriate size and everything else, respectively, for each of the 6 columns.
Edit: I guess a series of ifelse might be best if it really has to be 6 columns like that.
e.g.
DF$'<10' <- sapply(DF$Length, function(x) ifelse(x < 10,1,0))

Extend conditions in a dynamic way

I am trying to build a decision table. At time 3 for example I have to take the previous results in time t=1 and time t=2 in order to make my decision in time 3. The decision table is going to be pretty big so I am considering an efficient way to do it by building a function. For instance at time 3:
rm(list=ls()) # clear memory
names <- c("a","b","c","d","e","f","g","h","i","j","k50","l50","m50","n50","o50")
proba <- c(1,1,1,1,1,1,1,1,1,1,0.5,0.5,0.5,0.5,0.5)
need <- 4
re <- 0.5
w <- 1000000000
# t1
t1 <- as.integer(names %in% (sample(names,need,prob=proba,replace=F)))
# t2
t2 <- rep(t1)
# t3
proba3 <- ifelse(t2==1,proba*re,proba)
t3 <- as.integer(names %in% (sample(names,need,prob=proba3,replace=F)))
Now the table is going to be big until t=7 with proba7 which takes condition from t=1 to t=6. After t=7 it always takes the 6 previous outcomes plus the random part proba in order to make decision. In other words the ifelse must be dynamic in order that I can call it later. I have been trying something like
probF <- function(a){
test <- ifelse(paste0("t",a,sep="")==1,proba*re,proba)
return(test)
}
test <- probF(2)
but there is an error as I got just one value and not a vector. I know that it looks complicated
For the conditions requested by one person (i know it's not very good written) :
proba7 <- ifelse(t2==1 & t3==1 & t4==0 & t5==0 & t6==0,proba,
ifelse(t2==1 & t3==0 & t4==0 & t5==1 & t6==1,proba*re,
ifelse(t2==1 & t3==0 & t4==0 & t5==0 & t6==1, w,
ifelse(t2==0 & t3==1 & t4==1 & t5==0 & t6==0,proba,
ifelse(t2==0 & t3==1 & t4==1 & t5==1 & t6==0,0,
ifelse(t2==0 & t3==0 & t4==1 & t5==1 & t6==1,0,
ifelse(t2==0 & t3==0 & t4==1 & t5==1 &t6==0,0,
ifelse(t2==0 & t3==0 & t4==0 & t5==1 & t6==1, proba*re,
ifelse(t2==0 & t3==0 & t4==0 & t5==0 & t6==1,w,proba)))))))))
t7 <- as.integer(names %in% (sample(names,need,prob=proba7,replace=F)))

If you take a bit of a different approach, you'll gain quite a lot of speed.
First of all, it is really a terribly bad idea to store every step as a separate t1, proba1, etc. If you need to keep all that information, predefine a matrix or list of the right size and store everything in there. That way you can use simple indices instead of having to resort to the bug-prone use of get(). If you find yourself typing get(), almost always it's time to stop and rethink your solution.
Secondly, you can use a simple principle to select the indices of the test t:
seq(max(0, i-7), i-1)
will allow you to use a loop index i and refer to the 6 previous positions if they exist.
Thirdly, depending on what you want, you can reformulate your decision as well. If you store every t as a row in the matrix, you can simply use colSums() and check whether that one is larger than 0. Based on that index, you can update the probabilities in such a way that a 1 in any of the previous 6 rows halfs the probability.
wrapping everything in a function would then look like :
myfun <- function(names, proba, need, re,
w=100){
# For convenience, so I don't have to type this twice
resample <- function(p){
as.integer(
names %in% sample(names,need,prob=p, replace = FALSE)
)
}
# get the number of needed columns
nnames <- length(names)
# create two matrices to store all the t-steps and the probabilities used
theT <- matrix(nrow = w, ncol = nnames)
theproba <- matrix(nrow = w, ncol = nnames)
# Create a first step, using the original probabilities
theT[1,] <- resample(proba)
theproba[1,] <- proba
# loop over the other simulations, each time checking the condition
# recalculating the probability and storing the result in the next
# row of the matrices
for(i in 2:w){
# the id vector to select the (maximal) 6 previous rows. If
# i-6 is smaller than 1 (i.e. there are no 6 steps yet), the
# max(1, i-6) guarantees that you start minimal at 1.
tid <- seq(max(1, i-6), i-1)
# Create the probability vector from the original one
p <- proba
# look for which columns in the 6 previous steps contain a 1
pid <- colSums(theT[tid,,drop = FALSE]) > 0
# update the probability vector
p[pid] <- p[pid]*0.5
# store the next step and the used probabilities in the matrices
theT[i,] <- resample(p)
theproba[i,] <- p
}
# Return both matrices in a single list for convenience
return(list(decisions = theT,
proba = theproba)
)
}
which can be used as:
myres <- myfun(names, proba, need, re, w)
head(myres$decisions)
head(myres$proba)
This returns you a matrix where every row is one t-point in the decision table.

cor() function in R with a subset

I have a table in R with three columns. I want to get the correlation of the first two columns with a subset of the third column following a specific set of conditions (values are all numeric, I want them to be > a certain number). The cor() function doesn't seem to have an argument to define such a subset.
I know that I could use the summary(lm()) function and square-root the r^2, but the issue is that I'm doing this inside a for loop and am just appending the correlation to a separate list that I have. I can't really append part of the summary of the regression easily to a list.
Here is what I am trying to do:
for (i in x) {list[i] = cor(data$column_a, data$column_b, subset = data$column_c > i)}
Obviously, though, I can't do that because the cor() function doesn't work with subsets.
(Note: x = seq(1,100) and list = NULL)

You can do this without a loop using lapply. Here's some code that will output a data frame with the month-range in one column and the correlation in another column. The do.call(rbind... business is just to take the list output from lapply and turn it into a data frame.
corrs = do.call(rbind, lapply(min(airquality$Month):max(airquality$Month),
function(x) {
data.frame(month_range=paste0(x," - ", max(airquality$Month)),
correlation = cor(airquality$Temp[airquality$Month >= x & airquality$Temp < 80],
airquality$Wind[airquality$Month >= x & airquality$Temp < 80]))
}))
corrs
month_range correlation
1 5 - 9 -0.3519351
2 6 - 9 -0.2778532
3 7 - 9 -0.3291274
4 8 - 9 -0.3395647
5 9 - 9 -0.3823090

You can subset the data first, and then find the correlation.
a <- subset(airquality, Temp < 80 & Month > 7)
cor(a$Temp, a$Wind)
Edit: I don't really know what your list variable is, but here is an example of dynamically changing the subset based on i (see how the month requirement changes with each iteration)
list <- seq(1, 5)
for (i in 1:5){
a <- subset(airquality, Temp < 80 & Month > i)
list[i] <- cor(a$Temp, a$Wind)
}

Based on the pseudo-code you provided alone, here's something that should work:
for (i in x) {
df <- subset(data, column_c > i)
list[i] = cor(df$column_a, df$column_b)
}
However, I don't know why you would want your index in list[i] to be the same value that you use to subset column_c. That could be another source of problems.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Multi Conditional Statements in R - r

Related

Using if/else nested within a for loop in order to cycle through & reassign values within a column in R?

Numeric vs Factors & IF Statements

speeding up boolean logic loop in R

Extend conditions in a dynamic way

cor() function in R with a subset

Categories

Resources