Numeric vs Factors & IF Statements - r

I am trying to create a function for gender distribution. Is there a way to define a letter as something other than as.factor? I would like to operate func(F) instead of func("F"). Or should I go numeric: func(0), func(1), func(2)?
I also finished off the statement with an else that is designed to operate when left blank, but does not. If I whittle down the function to not include an IF statement a blank variable works fine:
genderDist <- function(){
cat("Female:", sum(voterData$GENDER == "F"))
}
Thanks in advance! Cheers!
Full Statement:
genderDist <- function(x){
if (x == "F"){
cat("Female:", sum(voterData$GENDER == "F"))
}
else if (x == "M"){
cat("Male:", sum(voterData$GENDER == "M"))
}
else if(x == "U"){
cat("Unknown:", sum(voterData$GENDER == ""))
}
else{
cat("Female:", sum(voterData$GENDER == "F"))
cat("Male:", sum(voterData$GENDER == "M"))
cat("Unknown:", sum(voterData$GENDER == ""))
}
Desired results:
genderDist(F) gives count of Females
genderDist(M) gives count of Males
genderDist(U) gives count of Unknown
genderDist() gives count for all the above

There are several possibilities for coding gender, besides factor:
1. as character, not as factor. You will still have to call your function like func("F").
2. You already thought of using numeric yourself. Disadvantage is that it may be unclear if 1 is male or female.
3. The best option IMHO would be to go binary. Name your column "male" and use TRUE, FALSE and NA for unknown. The binary also works great in your if statement. Start with if(is.na(male)) ... ; else if(male).
EDIT
But to achieve your desired outcome, the coding of gender is not the issue, I would take this approach:
#First, define variables Fe, Ma and Un
#WARNING: Do NOT USE 'F', as 'F' is an abbr. for 'FALSE'!!
Fe <- "F"
Ma <- "M"
Un <- "U"
#now define a lookup dataframe for convienience
LT <- data.frame(code = c(Fe,Ma,Un), name = c("Female","Male","Unknown"), stringsAsFactors = FALSE)
# then define your function without an ifelse needed
genderDist <- function(x){
cat(LT[LT$code == x,"name"], sum(voterData$GENDER == x))
}
Introduce some fake data:
voterData <- data.frame(GENDER= c("F","F","F","M","M","U"))
Then run function:
> genderDist(Fe)
Female 3
> genderDist(Ma)
Male 2
> genderDist(Un)
Unknown 1

Related

R programming, I need to find the expected amount of draws it takes to get three certain letters in a sample using monte carlo sim

For example, I need to find the total amount of draws it takes to pull A , B, and C from a sample.
So far I used the which() method to help find what positions it appears in the output, but I dont know how I would go from there. we are told the answer is around 20.25. I know how we can do it with only one letter to find (answer is around 13 for one letter), but I don't know the proper way to find multiple letters.
my code:
draw <- function(){
letters <- sample(LETTERS)
which(letters == "A")
#print(letters)
#print(which(letters == "A"))
#print(which(letters == "B"))
#print(which(letters == "C"))
}
mean(replicate(10000, draw()))
Continue sampling 3 letters until you get the required letter combination and return the count of draws that it took to get the combination required.
draw <- function(){
vec <- c('A', 'B', 'C')
count <- 1
tmp <- sample(LETTERS, 3)
while(!all(tmp %in% vec)) {
tmp <- sample(LETTERS, 3)
count <- count + 1
}
return(count)
}
You can use replicate to repeat this n times and get the mean.
mean(replicate(10000, draw()))
I think I got it.
draw <- function(){
letters <- sample(LETTERS)
a <- which(letters == "A")
b <- which(letters == "B")
c <- which(letters == "C")
select <- c(a,b,c)
max(select)
}
mean(replicate(10000, draw()))

Using if/else nested within a for loop in order to cycle through & reassign values within a column in R?

I know that this is not the most efficient way in order to achieve my goal; however, I am using this as a teaching moment (i.e., to show that you can use a if/else statement nested within a for loop). Specifically, I have a nominal variable that uses integers as of right now. I want to use the if/else combined with the for loop in order to reassign these numbers to their respective category (class character). I have tried to do this in multiple ways, my current code is as follows:
# Take the original data and separate out the variable of interest
oasis_CDR <- oasis_final %>% select('CDR')
# transpose this data
oasis_CDR <- t(oasis_CDR)
# create the for loop
for(i in seq_along(oasis_CDR)){
if(i == 0.0){
oasis_CDR[1, i] <- "Normal"
} else if(i == 0.5) {
oasis_CDR[1 ,i] <- "Very Mild Dementia"
} else if(i == 1.0){
oasis_CDR[1 ,i] <- "Mild Dementia"
} else if(i == 2.0){
oasis_CDR[1 ,i] <- "Moderate Dementia"
} else if(i == 3.0){
oasis_CDR[1 ,i] <- "Severe Dementia"
} else{
oasis_CDR[1 ,i] <- "NA"
}
}
When I look at oasis_CDR it returns 'NA' for all observations.
If i replace 'i' with 'CDR' in each 'for' statement it only returns with 'Normal'.
Is there any way that this can be done in order for the reassignments in order to match what the data is?
If you have a different value to assign to every number you can use dplyr::recode
library(dplyr)
oasis_CDR <- oasis_CDR %>%
mutate(new_col = recode(CDR, `0` = 'Normal',
`0.5` = 'Very Mild Dementia',
`1` = 'Mild Dementia',
`1.5` = 'Moderate Dementia',
`3` = 'Severe Dementia',
.default = NA_character_))
Run a check on your seq_along(oasis_CDR) expression! These will be your i values.
My guess is that you do not really want to compare 0.0, 0.5, 1 and 2 against 1 up to > 220, do you?
And if you really wanna work through this via a for loop and not with indexing the vector then
isn't it more likely that you want to achive something like this:
oasis_CDR$result <- NA_character_
j <- 1
for (i in oasis_CDR) {
if (i == ...) oasis_CDR$result[j] <- 'Normal'
...
j <- j + 1
}
But imho that can get the job done but is not (very) nice R (or any other similar language) code.

Multi Conditional Statements in R

I'd like to know the shape or length of the filtered dataframe through multiple conditions. I have 2 methods I've used, but I'm a little stumped because they're giving me different outputs.
Method 1
x <- df[df$gender=='male',]
x <- x[x$stat == 0,]
nrow(x)
OUTPUT = Some Number
Method 2
nrow(sqldf('SELECT * FROM df WHERE gender == "male" AND stat == 0'))
OUTPUT = Some Number
I'm a little confused as to why the outputs would be different? Any ideas?
It looks like in method one you assigned x to df[df$gender=='male'] and then you replace x with assigning it to x[x$stat == 0]. So you will end up with nrow for how many stat == 0 only. Off of the top of my head with no dataset, maybe x <- df[df$gender=='male' & x$stat == 0] would work. Although I have never done it this way. I would use the subset function with x <- subset(x, df$gender=='male' & x$stat == 0) and then nrow(x).

Avoid loop to improve r code

I have a dataframe with million of rows and ten columns.
My code seems to work but never finish cause of the for loop and if statement I think.
I want to write it differently but I'm stuck.
df <- data.frame(x = 1:5,
y = c("a", "a", "b", "b", "c"),
z = sample(5))
for (i in seq_along(df$x)){
if (df$y[i] == df$y[i+1] & df$y[i] == "a"){
df$status[i] <- 1
} else {
df$status[i] <- "ok"
}
}
In fact, you can replace the whole loop by a vectorised ifelse:
df$status = ifelse(df$y == df$y[-1] & df$y == 'a', 1, 'ok')
This code will give you a warning, unlike the for loop. However, the warning is actually correct and also concerns your code: you are reading past the last element of df$y when doing df$y[i + 1].
You can make this warning go away (and make the code arguably clearer) by borrowing the lead function from dplyr (simplified):
lead = function (x, n = 1, default = NA) {
if (n == 0)
return(x)
`attributes<-`(c(x[-seq_len(n)], rep(default, n)), attributes(x))
}
With this, you can rewrite the code ever so slightly and get rid of the warning:
df$status = ifelse(df$y == lead(df$y) & df$y == 'a', 1, 'ok')
It’s a shame that this function doesn’t seem to exist in base R.

Summing Values of One Vector Conditional on Values of Another Vector

For some context, I am trying to add up all of the home wins the Chicago Cubs have. Thus, the W_L column refers to the wins ("W") and losses ("L"). Also, the H_A column refers to home games ("H") and away games ("A").
I am having trouble adding the total number of values "W" from one column when another column has a value of "H". Below is the code I am trying to use.
setwd("blah blah blah")
br <- read.csv(file="Baseball-Reference.csv", h=T)
record <- function(){
wins <- sum(br$W_L[!is.na(br$W_L)] == "W")
losses <- sum(br$W_L[!is.na(br$W_L)] == "L")
wp <- round(wins/games, digits = 3)
home_wins <- if(br$H_A[!is.na(br$H_A)] == "H"){
wins <- sum(br$W_L[!is.na(br$W_L)] == "W")}
}
If I run this I get a warning.
Warning message:
In if (br$H_A[!is.na(br$H_A)] == "H") { :
the condition has length > 1 and only the first element will be used
Do you really need the if statement? Perhaps this works as well:
home_wins <- with(br, sum(W_L == "W" & H_A == "H"))
home_wins
Or just a quick count, albeit without assigning the separate results to variables:
tt <- with(br, addmargins(table(W_L , H_A)))
tt
Use ifelse instead of if
The ifelse function performs elementwise conditional evaluation upon a vector:
Example:
x<-3
y<-c(1,2,3)
ifelse
ifelse(x==y,"good","bad")
[1] "bad" "bad" "good"
if
if (x ==y) "good" else "bad"
[1] "bad"
Warning message:
In if (x == y) "good" else "bad" :
the condition has length > 1 and only the first element will be used

Resources