Setting parameters for a factor - r

I am having trouble setting the parameters for a factor and would appreciate some help. I want to make a dummy variable where it is equal to 0 when a variable equals five different points and 1 for all others. So far I have tried the following:
htd$CBSA = factor(with(data = htd, ifelse(( cbsa==41460|16980|35620|37980|14460),0,1)))
and
htd$CBSA = as.numeric(htd$cbsa == 41460|16980|35620|37980|14460)
and tried any combination of , and "" in place of the | and don't know where to go.
Thanks for any help

Notice:
> 1 == 3
[1] FALSE
> 1 == 3 | 1 == 2
[1] FALSE
> 1 == 3 | 2
[1] TRUE
You want %in%:
> 1 %in% c(3, 2)
[1] FALSE
> 1 %in% c(3, 1)
[1] TRUE

Related

R - use if statement to regroup variable

I want to regroup a variable into a new one.
If value is 0, new one should be 0 too.
If value ist 999, then make it missing, NA.
Everything else 1
This is my try:
id <- 1:10
variable <- c(0,0,0,1,2,3,4,5,999,999)
df <- data.frame(id,variable)
df$variable2 <-
if (df$variable == 0) {
df$variable2 = 0
} else if (df$variable == 999){
df$variable2 = NA
} else {
df$variable2 = 1
}
And this the error message:
In if (df$variable == 0) { : the condition has length > 1 and only
the first element will be used
A pretty basic question but I'm a basic user. Thanks in advance!
Try ifelse
df$variable2 <- ifelse(df$variable == 999, NA, ifelse(df$variable > 0, 1, 0))
df
# id variable variable2
#1 1 0 0
#2 2 0 0
#3 3 0 0
#4 4 1 1
#5 5 2 1
#6 6 3 1
#7 7 4 1
#8 8 5 1
#9 9 999 NA
#10 10 999 NA
When you do df$variable == 0 the output / condition is
#[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
where it should be a length-one logical vector that is not NA in if(condition), see ?"if".
You can avoid ifelse, for example, like so
df$variable2 <- df$variable
df$variable2[df$variable2 == 999] <- NA
df$variable2[df$variable2 > 0] <- 1
It might be easier to avoid the if/else statement all together by using conditional statements within subset notation:
when df$variable is equal to zero, change it to zero
df$variable[df$variable==0] <- 0
when df$variable is equal to 999, change it to NA
df$variable[df$variable==999] <- NA
when df$variable is greater than 0 and is not equal to NA, change it to 1
df$variable[df$variable>0 & is.na(df$variable) == 'FALSE'] <- 1
Looks like you want to recode your variable. You can do this (and other data/variable transformations) with the sjmisc-package, in your case with the rec()-command:
id <- 1:10
variable <- c(0,0,0,1,2,3,4,5,999,999)
df <- data.frame(id,variable)
library(sjmisc)
rec(df, variable, rec = c("0=0;999=NA;else=1"))
#> id variable variable_r
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 0 0
#> 4 4 1 1
#> 5 5 2 1
#> 6 6 3 1
#> 7 7 4 1
#> 8 8 5 1
#> 9 9 999 NA
#> 10 10 999 NA
# or a single vector as input
rec(df$variable, rec = c("0=0;999=NA;else=1"))
#> [1] 0 0 0 1 1 1 1 1 NA NA
There are many examples, also in the help-file, and you can find a sjmisc-cheatsheet at the RStudio-Cheatsheet collection (or direct PDF-download here).
df$variable2 <- sapply(df$variable,
function(el) if (el == 0) {0} else if (el == 999) {NA} else {1})
This one-liner reflects your:
If value is 0, new one should be 0 too. If value ist 999, then make it
missing, NA. Everything else 1
Well, it is slightly slower than #markus's second or #SPJ's solutions which are most r-ish solutions.
Why one should put away the hands from ifelse
tt <- c(TRUE, FALSE, TRUE, FALSE)
a <- c("a", "b", "c", "d")
b <- 1:4
ifelse(tt, a, b) ## [1] "a" "2" "c" "4"
# totally perfect and as expected!
df <- data.frame(a=a, b=b, c=tt)
df$d <- ifelse(df$c, df$a, df$b)
## > df
## a b c d
## 1 a 1 TRUE 1
## 2 b 2 FALSE 2
## 3 c 3 TRUE 3
## 4 d 4 FALSE 4
######### This is wrong!! ##########################
## df$d is not [1] "a" "2" "c" "4"
## the problem is that
## ifelse(df$c, df$a, df$b)
## returns for each TRUE or FALSE the entire
## df$a or df$b intead of treating it like a vector.
## Since the last df$c is FALSE, df$b is returned
## Thus we get df$b for df$d.
## Quite an unintuitive behaviour.
##
## If one uses purely vectors, ifelse is fine.
## But actually df$c, df$a, df$b should be treated each like a vector.
## However, `ifelse` does not.
## No warnings that using `ifelse` with them will lead to a
## totally different behaviour.
## In my view, this is a design mistake of `ifelse`.
## Thus I decided myself to abandon `ifelse` from my set of R commands.
## To avoid that such kind of mistakes can ever happen.
#####################################################
As #Parfait pointed out correctly, it was a misinterpretation.
The problem was that df$a was treated in the data frame as a factor.
df <- data.frame(a=a, b=b, c=tt, stringsAsFactor = F)
df$d <- ifelse(df$c, df$a, df$b)
df
Gives the correct result.
a b c d
1 a 1 TRUE a
2 b 2 FALSE 2
3 c 3 TRUE c
4 d 4 FALSE 4
Thank you #Parfait to pointing that out!
Strange that I didn't recognized that in my initial trials.
But yeah, you are absolutely right!

counting the larger value

Completely new to R and am trying to count how many numbers in a list are larger than the one right before.
This is what I have so far,
count <- 0
number <- function(value) {
for (i in 1:length(value))
{ if(value[i+1] > value[i])
{count <- count + 1}
}
}
x <- c(1,2,1,1,3,5)
number(x)
The output should be 3 based on the list.
Any help or advice would be greatly appreciated!
A base R alternative would be diff
sum(diff(x) > 0)
#[1] 3
Or we can also eliminate first and last values and compare them.
sum(x[-1] > x[-length(x)])
#[1] 3
where
x[-1]
#[1] 2 1 1 3 5
x[-length(x)]
#[1] 1 2 1 1 3
You can lag your vector and count how many times your initial vector is greater than your lagged vector
library(dplyr)
sum(x>lag(x), na.rm = TRUE)
In details, lag(x) does:
> lag(x)
[1] NA 1 2 1 1 3
so x > lag(x) does
> x>lag(x)
[1] NA TRUE FALSE FALSE TRUE TRUE
The sum of the above is 3.

conditional assignment of values

I have a dataset with you variables:
ACCURACY Feedback
141 0 3
156 0 1
167 1 2
185 1 1
191 1 NA
193 1 1
I have created a new column called X, where I would like to assign 3 potential values (correct, incorrect, unknown) based on combinations between the previous two values (i.e. accuracy ~ Feedback).
I have tried the next:
df$X=NA
df[!is.na((df$ACC==1)&(df$Feedback==1)),]$X <- "correct"
df[!is.na((df$ACC==1)&(df$Feedback==2)),]$X <- "unknown"
df[!is.na((df$ACC==1)&(df$Feedback==3)),]$X <- "incorrect"
df[!is.na((df$ACC==0)&(df$Feedback==1)),]$X <- "correct"
df[!is.na((df$ACC==0)&(df$Feedback==2)),]$X <- "unknown"
df[!is.na((df$ACC==0)&(df$Feedback==3)),]$X <- "incorrect"
But it doesnt assign a value in X based on both ACC and Feedback, but each line of code overrides the values assigned by the previous one.
I would appreciate any guidance/suggestions.
This can be done with nested ifelse functions. Although, based on the example posted, it looks like X depends only on Feedback, never ACCURACY.
ACCURACY Feedback
1 0 3
2 0 1
3 1 2
4 1 1
5 1 NA
6 1 1
df$X <- ifelse(df$ACCURACY == 1, ifelse(df$Feedback == 1, "correct", ifelse(df$Feedback == 2, "unknown", "incorrect")), ifelse(df$Feedback == 1, "correct", ifelse(df$Feedback == 2, "unknown", "incorrect")))
ACCURACY Feedback X
1 0 3 incorrect
2 0 1 correct
3 1 2 unknown
4 1 1 correct
5 1 NA <NA>
6 1 1 correct
If the values of X indeed do not depend on ACCURACY, you could just recode Feedback as a factor
df$X <- factor(df$Feedback,
levels = c(1, 2, 3),
labels = c("correct", "unkown", "incorrect"))
The issue is that you've wrapped all the assignment conditions with !is.na. These vectors all evaluate to the same thing. For example:
> !is.na((df$ACC==1)&(df$Feedback==2))
[1] TRUE TRUE TRUE TRUE FALSE TRUE
> !is.na((df$ACC==1)&(df$Feedback==3))
[1] TRUE TRUE TRUE TRUE FALSE TRUE
A possible solution would be to write a little function to do the assignments you want, and then use apply.
recoder <- function(row) {
accuracy <- row[['ACCURACY']]
feedback <- row[['Feedback']]
if(is.na(accuracy) || is.na(feedback)) {
ret_val <- NA
}
else if((accuracy==1 && feedback==1) || (accuracy==0 && feedback==1)) {
ret_val <- "correct"
}
else if((accuracy==1 & feedback==2) || (accuracy==0 & feedback==2)) {
ret_val <- "unknown"
}
else {
ret_val <- "incorrect"
}
return(ret_val)
}
df$X <- apply(df, 1, recoder)
df
> df
ACCURACY Feedback X
141 0 3 incorrect
156 0 1 correct
167 1 2 unknown
185 1 1 correct
191 1 NA <NA>
193 1 1 correct

In R, how to add TRUE / FALSE column if a value differs from previous or following value?

Sample data:
df <- data.frame(
time = c(1, 2, 3, 4, 5, 6, 7),
status = c("good", "good", "good", "bad", "good", "good", "good")
)
Output:
time status
1 1 good
2 2 good
3 3 good
4 4 bad
5 5 good
6 6 good
7 7 good
I would like to add a new column statuschange IF status differs from the row above or below. The output would look like this:
time status statuschange
1 1 good NA
2 2 good TRUE
3 3 good FALSE
4 4 bad FALSE
5 5 good FALSE
6 6 good TRUE
7 7 good NA
I have the sense that are lots of ways to do this, but I haven't been able to figure it out. Any assistance is appreciated!
You can apply diff to see if two entries are the same. You want two of these diffs, to see if both entries around an element are the same:
> !(c(NA, diff(as.numeric(x$status))) | c(rev(diff(as.numeric(rev(x$status)))), NA))
[1] NA TRUE FALSE FALSE FALSE TRUE NA
The first expression tells whether the prior element is different:
> c(NA, diff(as.numeric(x$status)))
[1] NA 0 0 -1 1 0 0
The second tells whether the following element is different:
> c(rev(diff(as.numeric(rev(x$status)))), NA)
[1] 0 0 1 -1 0 0 NA
The "or" operation | returns TRUE for nonzero, which means a leading or following element is different, so we then invert the result with the leading !.
You can use something like this
df$A <- rep(0,7)
for(i in 2:6){
df$A[i] <- ifelse(df$status[i]==df$status[i-1]
& df$status[i]==df$status[i+1],'TRUE','FALSE')
}
df
Another solution using looping:
statuschange=as.character()
for(i in seq_along(df$status))
{
statuschange[i]<-df$status[i]==df$status[i-1] && df$status[i]==df$status[i+1]
}
df$statuschange<-statuschange
Here's an approach using zoo::rollapply to loop through the column:
> library(zoo)
> c(NA, rollapply(x$status, width=3, FUN=function(x) x[1]==x[2] & x[2]==x[3]), NA)
[1] NA TRUE FALSE FALSE FALSE TRUE NA
What this does, is create a series of "windows" each of width 3, rolling across the data, and applies the function to each subset. As the window is 3, we end up with a vector that is two elements shorter than the original column (width - 1 shorter). The endpoint NA values are then added with c.

Error in If else loop in R

I am writing one code in R. First I am creating one blank column in the data set and I want to assign 0 and 1 value in that column according to some conditions. Here is my code
#Creating a empty column in the data file
Mydata$final <- "";
#To assign 0,1 value in final variable
if(Mydata$Default_Config == "No" & is.na(Mydata$Best_Config)=="TRUE" & (Mydata$AlmostDefaultConfig!=1 | Mydata$AlmostDefaultConfig!=3)){
Mydata$final <- 1
}else{
Mydata$final <- 0
}
And I am getting this error
Warning message:
In if (Mydata$Default_Config == "No" & is.na(Mydata$Best_Config) == :
the condition has length > 1 and only the first element will be used
How Can I fix this error? Please help me out. Thanks in advance
Your problem is one of vectorisation. if is not vectorised. You are testing multiple values in each comparison in your if statement and R is telling you it will only use the first because if is not vectorised. You need ifelse which is vectorised:
ifelse( Mydata$Default_Config == "No" & is.na(Mydata$Best_Config)=="TRUE" & (Mydata$AlmostDefaultConfig!=1 | Mydata$AlmostDefaultConfig!=3) , 1 , 0 )
A reproducible example is below. If x is > 5 and y is even then return 1 otherwise return 0:
x <- 1:10
# [1] 1 2 3 4 5 6 7 8 9 10
y <- seq(1,30,3)
# [1] 1 4 7 10 13 16 19 22 25 28
x > 5
# [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
y %% 2 == 0
# [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
ifelse( x > 5 & y %% 2 == 0 , 1 , 0 )
# [1] 0 0 0 0 0 1 0 1 0 1
An alternative approach is to take advantage of R's coercion. You have a set of conditionals which are vectorizable, and R is happy to convert TRUE/FALSE to 1 / 0, so you can write it like:
Mydata$final <- ( (Mydata$Default_Config == "No") *( is.na(Mydata$Best_Config)=="TRUE") * (Mydata$AlmostDefaultConfig!=1 + Mydata$AlmostDefaultConfig!=3)) )
(extra parentheses added for clarity) .
Apologies if I fouled up the logic there.
Edit: My code for the OR won't quite work, since if both sides are TRUE you'd get a big number ("2" :-) ). Change it to as.logical((Mydata$AlmostDefaultConfig!=1 + Mydata$AlmostDefaultConfig!=3))

Resources