Scoring a column with conditions in R using sd - r

I'm an R/coding newbie. I want to assign a score to a column based on some conditions. I have some random data below, that helps explain my own data.
name average score
a -3.56714858 0
a -0.41934072 0
a -1.02200958 0
b 0.67713883 0
b 0.29228235 0
b 0.11338159 0
c -1.48595572 0
c -0.35328884 0
c -1.26491347 0
d -0.27093065 0
d -0.14913264 0
What I want to do;
If (name=a & average > 2sd of benchmark) then assign score= 2
if (name=a & average < 2sd of benchmark) then assign score=0.5
etc.
Edit: benchmark = average(of top 3 "a"), so I'm scoring the rest of the "a" based on how they compare to the top three, so how many standard deviations they lie from the top 3.
Each letter has its own benchmark or number that I am comparing it to. So I was manually going through, letter by letter, like:
df$score[df$name == "a"
& df$average >= benchmark
& df$average < (benchmark + sd(benchmark)]<- 1
df$score[df$name == "a"
& df$average >= (benchmark + sd(benchmark)
& df$average < (benchmark+ 2sd(benchmark))]<- 2.0
df$score[df$name == "a"
& df$average > (benchmark+ 2sd(benchmark))]<- 2.5
df$score[df$name == "a"
& df$average < benchmark
& df$average >= (benchmark - sd(benchmark)]<- 1
df$score[df$name == "a"
& df$average < (benchmark - sd(benchmark)
& df$average >= (benchmark - 2sd(benchmark))]<- 0.5
df$score[df$name == "a"
& df$average < (benchmark - 2sd(benchmark))]<- 0
I have thousands of rows and more groups than the letters a-d. I'm hoping I can find a faster way to do this. My long method is also creating errors. Please help
I have the same scoring principle for each group, but the benchmark is different for each group.

I would structure your problem this way: First you have your data frame, then you should also have another data frame, which we will call bench, with variables as "name", "benchmark", and "sd.benchmark". Something like that. Then I would
You can use dplyr package:
require(dplyr)
df.new <- left_join(df, bench, by = "name") %>%
mutate(score = ifelse(average >= (benchmark - sd.benchmark) & average < (benchmark + sd.benchmark), 1,
ifelse(average >= (benchmark + sd.benchmark) & average < (benchmark + 2*sd.benchmark), 2,
ifelse(average >= 2*sd.benchmark, 2.5,
ifelse(average < (benchmark - sd.benchmark) & average >= (benchmark - 2* sd.benchmark), .5, 0)))))
The reason I have less conditions is because in your example benchmark - sd(benchmark) and benchmark + sd(benchmark) had the same value of 1 for conjoining ranges.
left_join combines bench to df using all values of df. It is like merge(x,y, all.x = T). From the join, the steps are now passing data to the mutate. mutate creates a new variable based on the ifelse statements.

Related

Replace logical values conditionally in R

I am sure this question has been asked before and has an easy solution, but I can't seem to find it.
I am trying to conditionally replace the logical value of a variable based on the value of other variables in the data. Specifically, I am trying to determine eligibility based on survey responses.
I have created my eligibility variable in dataframe screen:
screen$eligible <- ifelse (
(screen$age > 17 & screen$age < 23)
& (screen$alcohol > 3 | screen$marijuana > 3)
& (screen$country == 0 | screen$ageus < 12)
& (screen$county_1 == 17 | screen$county_1 == 27 | screen$county_1 == 31)
& (screen$residence_1 == 47),
TRUE,
FALSE)
And now, based on study changes, I would like to further limit eligibility. I tried the code below, and it works in part, but it appears that I am introducing NAs to my eligibility variable and missing out on folks who should be eligible.
screen$eligible <- ifelse( screen$eligible ==TRUE, ifelse(
(screen$gender_1 == 1 & screen$age > 18)
|(screen$gender_8 == 1 & screen$age > 20),
FALSE, TRUE), FALSE)
I ultimately want TRUE or FALSE values.
Two questions
Is there a clearer or more concise way to update the code to update my eligibility requirements?
Any ideas as to why I might be introducing NAs?
continuing from what #zephryl wrote, an even more readable code is:
screen$eligible <- with(screen,
(age > 17 & age < 23)
& (alcohol > 3 | marijuana > 3)
& (country == 0 | ageus < 12)
& county_1 %in% c(17, 27, 31)
& (residence_1 == 47))
to detect where are the NAs:
sapply(screen, anyNA)
1. Is there a clearer or more concise way to update the code to update my eligibility requirements?
If you ever find yourself writing x = ifelse(condition, TRUE, FALSE), as you are here -- that's equivalent to just writing x = condition. Also, your three county_1 == x statements can be replaced with one county_1 %in% c(x, y, z). So your first code block could be written as,
screen$eligible <- (screen$age > 17 & screen$age < 23)
& (screen$alcohol > 3 | screen$marijuana > 3)
& (screen$country == 0 | screen$ageus < 12)
& screen$county_1 %in% c(17, 27, 31)
& (screen$residence_1 == 47)
Likewise, your second codeblock could be simplified as:
screen$eligible <- screen$eligible
& ((screen$gender_1 == 1 & screen$age > 18)
| (screen$gender_8 == 1 & screen$age > 20))
2. Any ideas as to why I might be introducing NAs?
It's hard to say without seeing your data, but the NAs probably indicate that one or more of your constituent variables (gender_1, gender_8, age) is NA for some cases.

Using ifelse function in R to sort results that fulfil a "less than and greater than"

I am trying to get this function to select outputs that fall in between two values.
Data <- Data %>%
mutate(D= ifelse(A >= "80" & B == "InPlay" & (C <= "20")&(C >= "6"), "YES", paste(D)))
So I would like column D to read "YES" when column A is greater than 20, column B reads "InPlay", and column C falls between 6 and 20.
Are you trying to compare the numbers as strings or as numbers?
If they are strings currently, Can you convert them to numbers using a = as.numeric(a)?
That'll allow you to use the typical operator functions.
The example below worked for me in a fresh script, I'm assuming you'd want this running inside a loop.
A = 90
B = "InPlay"
C = 15
D = "No"
if( A >= 80 & B == "InPlay" & C <= 20 & C >= 6) {
D = "YES"
}

Calculate count of number of switch in vector

I have a vector in which i have to calculate how many times data switched from 0 to 100 and back to 0. An example is given as below.
Input
X1<-c(100,100,100,0,0,0,0,0,100,100,100,100,100,0,0,0,0,100,100,100,0,0,100,100)
So the output should be 3 as the value started at 0 stayed at 100 for the some time and back to 0. My requirements is to count how many times this switch has occurred. I am aware of rle but that only gives me the length.
Thanks in advance for the help.
This looks sufficient
sum(X1[-1] != X1[-length(X1)]) / 2
Assumptions are that
You only have two unique values in X1
The last element of X1 equals the first element, that is, it switches back to original state in the end.
You can do something like,
sum(diff(X1) == 100)
#[1] 3
#Or
min(sum(diff(X1) == 100), sum(diff(X1) == -100))
#[1] 3
You could run rle and then iterate through three elements of values at a time to see if the required condition has been met.
with(rle(X1),
sum(sapply(3:length(lengths), function(i)
values[i-2] == 0 & values[i-1] == 100 & values[i] == 0)))
#[1] 2
more generally for counting switches in n cases (numeric or character):
count_switches_groups <- function(seq.input){
COUNT <- 0
transition = rep("no switch",length(seq.input))
for (i in 2:length(seq.input)) {
if (seq.input[i] != seq.input[i - 1]) {
COUNT <- COUNT + 1
transition[i] <- paste0("from ",seq.input[i - 1]," to ",seq.input[i])
}
}
total_switches <- COUNT
state_transitions <- transition[transition != "no switch"]
occurances <- as.data.frame(table(state_transitions))
return_list <- list(total_switches,occurances)
names(return_list) <- c("total_transitions","unique_switches")
return(return_list)
}
count_switches_groups(X1)
sum((np.diff(x)==100)|(np.diff(x)==-100))
I think this would be the answer, worked for me

Ignore NA in Ifelse statement- R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have the following ifelse statement.
ww.LIG = ifelse( (Accel2$wk.VWD >= 3 & Accel2$we.VWD >= 0 )
| ( Accel2$wk.VWD >=2 & Accel2$we.VWD >=1 )
| ( Accel2$wk.VWD >=1 & Accel2$we.VWD >=2) ,
(Accel2$wk.LIG + Accel2$we.LIG)/2, NA)
The final line takes the average of two variables if the above conditions are met. For data that meets the first criteria in the first line (Accel2$wk.VWD >= 3 & Accel2$we.VWD >= 0 ) there is an NA for the variable named Accel2$we.VWD, which obviously returns a NAN when trying to do the calculation.
What is a simple way to remove NAs form this argument?
Many thanks.
You could solve this in two ways I think:
1) Another ifelse before this to check for NAs - something like:
ww.LIG = ifelse( is.na(Accel2$wk.VWD) | is.na(Accel2$we.VWD), NA,
ifelse( (Accel2$wk.VWD >= 3 & Accel2$we.VWD >= 0 )
| ( Accel2$wk.VWD >=2 & Accel2$we.VWD >=1 )
| ( Accel2$wk.VWD >=1 & Accel2$we.VWD >=2) ,
(Accel2$wk.LIG + Accel2$we.LIG)/2, NA))
2) Remove the NA rows to start with - something like:
df = complete.cases(data.frame(wkVWD = Accel2$wk.VWD, weVWD = Accel2$we.VWD, Accel2$wk.LIG, weLIG = Accel2$we.LIG))
df$wwLIG = ifelse( (df$wkVWD >= 3 & df$weVWD >= 0 )
| ( df$wkVWD >=2 & df$weVWD >=1 )
| ( df$wkVWD >=1 & df$weVWD >=2) ,
(df$wkLIG + df$weLIG)/2, NA)
Does that work for you?
Your problem is ill-defined: what should be the result of the following comparison NA >= value, true or false? Define this first.
I will consider an NA in the conditions means the condition is not satisfied (the sum of a + b is just an optimisation, it can be two separate conditions as well:
a = Accel2$wk.VWD
b = Accel2$we.VWD
ww.LIG = ifelse(!is.na(a + b) &
((a >= 3 & b >= 0) | (a >= 2 & b >= 1) | (a >= 1 & b >= 2)),
(Accel2$wk.LIG + Accel2$we.LIG)/2, NA)
You may be going about it the hard way. I can't tell for sure without a sample of your data, but I think you can just replace your "averaging" line with
mean(c(Accel2$wk.LIG , Accel2$we.LIG), na.rm=TRUE)
It's not clear whether you wanted to keep inputs containing NA values or not.

Changing data based on conditions in R

I have posted a similar question to this before and got a quick answer, but am just an R beginner and haven't been able to adapt it to what I need.
Basically I want to take the below code (says if Date_Index is between two numbers and df is < X, then turn df to Y) and make it so it only applies to entries that meet a certain criteria, i.e:
HAVE: df[df$Date_Index >= 50 & df$Date_Index <= 52 & df < .0000001]=1
ADD: if df$Date_Index <= 49 AND df = 0.00 ignore the above statement, else execute:
In other words I need the equivalent to an if, then, else clause. If Date_Index <= 49 and df = 0, leave alone, else if Date_Index >=50 and Date Index <= 52 and df < .001 then replace data (in Date Index rows 50-52) with 1.
This (simple) data set should illustrate it enough:
xx <- matrix(0,52,5)
xx[,1]=1
xx[,3]=1
xx[,5]=1
xx[50:52,]=0
xx[,1]=1:52
xx[50,3]=1
So what I'd like is column 2 and column 4 to stay all 0's but for the bottom of column 3 and 5 to continue to be all 1's.
I suppose you're looking for this:
xx[xx[,1] >= 50 & xx[,1] <= 52, c(FALSE, !colSums(!xx[xx[,1] <= 49, -1]))] <- 1

Resources