Translating Stata to R and having some is.na troubles - r

I'm translating Stata code to R code, but now I'm having some n00b troubles like this one.
This is my Stata code:
gen aposentadofam=1 if proprendaposent > 0 & proprendaposent ~=.;
replace aposentadofam=0 if proprendaposent == 0 | proprendaposent ==.;
And this is what I tried to do in R:
# pemg <- mutate(pemg, aposentadofam = NA_real_)
# pemg <- mutate(pemg, aposentadofam = case_when(proprendaposent >0 & !is.na(proprendaposent) ~ 1, TRUE ~ aposentadofam))
# pemg <- mutate(pemg, aposentadofam = case_when(proprendaposent==0 | is.na(proprendaposent) ~ 0, TRUE ~ aposentadofam))
The line with is.na() seems to be running correctly, but the one with !is.na() does not. It gives me this error message:
LHS of case 1 (`proprendaposent > 0 & !is.na(proprendaposent) ~ 1`) must be a logical vector, not a `formula` object.
What should I do?

Not enough reputation to comment (yet!) but I just ran the following using your example code (in R) with no issues. How exactly does your data/code differ?
library(dplyr)
pemg <- data.frame(c(1, 2, 3.1, 4, 5.5, 0, 0, 0, NA))
colnames(pemg) <- "proprendaposent"
pemg <- mutate(pemg, aposentadofam = NA_real_)
pemg <- mutate(pemg, aposentadofam = case_when(proprendaposent >0 & !is.na(proprendaposent) ~ 1, TRUE ~ aposentadofam))
pemg <- mutate(pemg, aposentadofam = case_when(proprendaposent==0 | is.na(proprendaposent) ~ 0, TRUE ~ aposentadofam))
pemg
which outputs:
proprendaposent aposentadofam
1 1.0 1
2 2.0 1
3 3.1 1
4 4.0 1
5 5.5 1
6 0.0 0
7 0.0 0
8 0.0 0
9 NA 0

Often, within() is most illustrative.
dat <- within(dat, {
aposentadofam <- NA
aposentadofam[proprendaposent > 0 & !is.na(proprendaposent)] <- 1
aposentadofam[proprendaposent == 0 | is.na(proprendaposent)] <- 0
})
Or using transform().
dat <- transform(dat, aposentadofam=ifelse(proprendaposent %in% c(0, NA), 0, 1))
Both functions come with base R, so you won't need any extra packages (which is rather rarely the case anyway).
# proprendaposent aposentadofam
# 1 0 0
# 2 4 1
# 3 0 0
# 4 0 0
# 5 1 1
# 6 3 1
# 7 1 1
# 8 1 1
# 9 0 0
# 10 NA 0
# 11 NA 0
# 12 3 1
Data
dat <- structure(list(proprendaposent = c(0L, 4L, 0L, 0L, 1L, 3L, 1L,
1L, 0L, NA, NA, 3L)), class = "data.frame", row.names = c(NA,
-12L))

Related

Nested ifelse() statement in R not producing the desired results

I have a data set of boolean variables and I am trying to generate a new variable based on 3 of the existing booleans using ifelse().
The rules I'd like to implement are:
If any of the three columns have value 1, 1
If all of the three columns have value 0, 0
If all of the three columns have value NA, NA
If the three columns have some combination of 0 and NA, 0
Here is the code to generate a sample with 3 variables that I want to use to create a fourth:
df <- structure(list(var1 = c(NA, NA, NA, 0,1),
var2 = c(1, NA, 0,0, 1),
var3 = c(NA, NA, NA,0,1)), class = "data.frame", row.names = c(NA, -5L))
I have tried the following to generate the new variable according to my desired rules:
df$newvar1 <-ifelse(df$var1 == 1 | df$var2 == 1 |df$var3 == 1, 1,
ifelse((is.na(df$var1) & is.na(df$var2) & is.na(df$var3)), NA,0))
df$newvar2 <- ifelse((is.na(df$var1)|df$var1==0) &
(is.na(df$var2)|df$var2==0) &
(is.na(df$var3)|df$var3==0),0,
ifelse(df$var1 == 1 | df$var2 == 1 |df$var3 == 1, 1,
ifelse(is.na(df$var1) & is.na(df$var2) & is.na(df$var3), NA,NA)))
df$newvar3 <-ifelse(df$var1 == 1 | df$var2 == 1 |df$var3 == 1, 1,
ifelse((is.na(df$var1) & is.na(df$var2) & is.na(df$var3)), NA,
ifelse((is.na(df$var1)|df$var1==0) &
(is.na(df$var2)|df$var2==0) &
(is.na(df$var3)|df$var3==0),0,0)))
I don't understand why newvar1 and newvar3 have NA values corresponding to combinations of NAs and 0s when both examples use "&" between the na specifications (row 3 in the results).
I am assuming that NAs don't show up in newvar2 because the first ifelse() function takes precedent.
Any insight to the ifelse() function or advice on how to get the results I'm looking for would be really helpful.
Here is another possible option using rowSums:
df$newvar <- +(rowSums(df, na.rm = TRUE) * NA ^ (rowSums(!is.na(df)) == 0) > 0)
# var1 var2 var3 newvar
#1 NA 1 NA 1
#2 NA NA NA NA
#3 NA 0 NA 0
#4 0 0 0 0
#5 1 1 1 1
This gives your expected results:
df$newvar <- 0
df$newvar[Reduce(`|`, lapply(df[1:3], `%in%`, 1))] <- 1
df$newvar[Reduce(`&`, lapply(df[1:3], is.na))] <- NA
df
# var1 var2 var3 newvar
# 1 NA 1 NA 1
# 2 NA NA NA NA
# 3 NA 0 NA 0
# 4 0 0 0 0
# 5 1 1 1 1
This defaults to 0 and only changes values with known conditions, which means that if there are any rows with NA and 1 (with or without 0), it will be assigned 0. It's not difficult to test for this, but it wasn't in your logic.

How do I code 2 seperate categorical variables into a single one in R?

I have two continuous variables that I dummy coded into a categorical variable with 2 levels. Each of these variables are coded either 0 or 1 for low and high levels of this variable. Both variables were z-scored to know if they fell below or above the mean.
MeanAboveAvo <- ifelse(Dataframeforstudy2$avo < 0, 0, 1)
MeanAboveAnx <- ifelse(Dataframeforstudy2$anx < 0, 0 , 1)
My question is how do I dummy code these two variables together? I want to create a single variable with 4 different levels using these two variables (MeanAboveAvo & MeanAboveAnx). I want a single variable that is coded with either 1,2,3,4 and the 1 is (0,0), 2 is (0,1), 3 is (1,0) and 4 is (1,1).
My code is this:
stats <- while(MeanAboveAnx = 0 || MeanAboveAvx = 1) {
if(MeanAboveAnx = 0 & MeanAboveAvo = 0 ){
1
}
else if (MeanAboveAnx = 0 & MeanAboveAvo = 1){
2
}
else if(MeanAboveAnx = 1 & MeanAboveAvo = 0){
3
}
else {
4
}}
It is not coding it at all and I am getting an error message. What can I do differently to get the results I want?
Thank you for your help in advance!
Base R has function interaction precisely for this type of problem. The code below can become a one-liner, I leave it like this in order to make it more clear.
f <- with(df, interaction(anx, avo, lex.order = TRUE))
as.integer(f)
# [1] 1 2 1 1 2 3 3 3 4 2
Edit.
I was using the data in TomasIsCoding's answer, here is a solution more to the question's problem, with anx and avo as z-scores. Thanks to #KonradRudolph for his comment.
f <- with(df, interaction(as.integer(anx < 0),
as.integer(avo < 0),
lex.order = TRUE))
f
# [1] 1.1 0.1 0.1 1.0 0.0 0.1 1.1 1.1 1.1 1.0
#Levels: 0.0 0.1 1.0 1.1
as.integer(f)
# [1] 4 2 2 3 1 2 4 4 4 3
Data.
set.seed(1234)
df <- data.frame(anx = rnorm(10), avo = rnorm(10))
Categorical variables in in R don’t need to be numeric (and making them so has several drawbacks!): there’s consequently no need for your ifelse:
MeanAboveAvo <- Dataframeforstudy2$avo < 0
MeanAboveAnx <- Dataframeforstudy2$anx < 0
Next, the code using these encodings contains multiple mistakes:
It’s not clear what the while here is supposed to mean.
All = signs need to be converted to == because you’re performing comparisons.
if, unlike ifelse, isn’t vectorised so you cannot use it to assign its result to a vector of length > 1.
If I understand you correctly, then the following is one (canonical) way of encoding the stats:
stats <- paste(MeanAboveAvo, MeanAboveAnx)
This converts the logical vectors into character vectors and concatenates them element-wise. Once again, it is unnecessary (and unconventional!) in R to convert these categories into a numeric variable; though it may make sense to convert it to a factor via as.factor.
From the mapping rule to code the anx and avo, you actually don't need while loop, since yours is a shifted mapping from binary to decimal. In this case, you can do it like below
df <- within(df,code <- 2*anx + avo + 1)
such that
> df
anx avo code
1 0 0 1
2 0 1 2
3 0 0 1
4 0 0 1
5 0 1 2
6 1 0 3
7 1 0 3
8 1 0 3
9 1 1 4
10 0 1 2
Dummy Data
df <- structure(list(anx = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L
), avo = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
Try this:
as.integer(factor(paste0(MeanAboveAvo, MeanAboveAnx)))
For example:
set.seed(123)
x <- sample(0:1, 10, T) # [1] 0 0 0 1 0 1 1 1 0 0
y <- sample(0:1, 10, T) # [1] 1 1 1 0 1 0 1 0 0 0
as.integer(factor(paste0(x, y)))
# [1] 2 2 2 3 2 3 4 3 1 1

R how to create counter and make it print to new column before every reset

Suppose I have the following dataframe
df
R1 R2
0 0
1 1
1 1
0 1
1 1
0 0
0 1
1 0
0 0
1 0
1 0
1 1
And I wish to create a counter that counts - in every column individually - the occurrences of '1's after another, reset after every encounter of a 0, and output the counts in a new column. i.e. in row 1 it would reset at first step, then count to 1, then count to 2, then reset, then count 1, then reset, reset, etc. with the desired output for column 1 being:
df
R1(Counted)
N/A
N/A
2
N/A
1
N/A
N/A
1
N/A
N/A
N/A
3
I suspect I need something like:
Counter = 0
for i = 1:nrow(df){
if (???==1){
counter=counter+1
} else {
counter=0
}
}
But I really have no experience with counters and don't know how to make it continuously print its count to a new column before reseting the counter or anything like that.
Any help is much appreciated
We can create a function taking help from data.table::rleid to create groups based on every change in value. Turn all the values to NA except the ones where the value is 1 and it is the last element in the group.
get_counter <- function(ct) {
ave(ct, data.table::rleid(ct), FUN = function(x)
replace(seq_along(x), x != 1 | seq_along(x) != length(x), NA))
}
This funciton can be applied to multiple columns using lapply
df[paste0("ct_", names(df))] <- lapply(df, get_counter)
df
# R1 R2 ct_R1 ct_R2
#1 0 0 NA NA
#2 1 1 NA NA
#3 1 1 2 NA
#4 0 1 NA NA
#5 1 1 1 4
#6 0 0 NA NA
#7 0 1 NA 1
#8 1 0 1 NA
#9 0 0 NA NA
#10 1 0 NA NA
#11 1 0 NA NA
#12 1 1 3 1
data
df <- structure(list(R1 = c(0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L,
1L, 1L), R2 = c(0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L
)), class = "data.frame", row.names = c(NA, -12L))
Here is how I would do it:
a <- c(0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1)
b <- sequence(rle(as.character(a))$lengths)
b[a == 0] <- NA
b[!is.na(dplyr::lead(b))] <- NA # this finds any where the next value isn't NA
b
# NA NA 2 NA 1 NA NA 1 NA NA NA 3
You could make this into a function and lapply over your data.frame to do every column all at once if you have more than 1 to do, like this:
counter <- function(x){
count <- sequence(rle(as.character(x))$lengths)
count[x == 0] <- NA
count[!is.na(dplyr::lead(count))] <- NA
return(count)
}
df <- data.frame(
R1 = sample(c(0, 1), 20, T, c(0.2, 0.8)),
R2 = sample(c(0, 1), 20, T, c(0.7, 0.3))
)
df[paste0(names(df), '_ct')] <- lapply(df, counter)
Here is a (rather convoluted) solution with just base R using a while loop each (for R1 and R2)!!
df <- data.frame(R1 = c(0,1,1,0,1,0,0,1,0,1,1,1), R2 = c(0,1,1,1,1,0,1,0,0,0,0,1))
#For R1
mycount <- 0
i <- 1
df$R1_counted <- NA
while(i <= nrow(df)){
mycount <- mycount + df$R1[i]
if(df$R1[i] == 0 & i == 1){
df$R1_counted[i] <- NA
} else if(df$R1[i] != 0 & i == 1){
df$R1_counted[i] <- df$R1[i]
}
if(df$R1[i] == 0 & i > 1){
df$R1_counted[i] <- NA
if(df$R1[i-1] != 0){df$R1_counted[i-1] <- mycount}
mycount <- 0
} else if(df$R1[i] != 0 & i > 1){
df$R1_counted[i] <- NA
}
if(i == nrow(df) & df$R1[i] != 0){
df$R1_counted[i] <- mycount
}
i <- i + 1
}
#For R2
mycount <- 0
i <- 1
df$R2_counted <- NA
while(i <= nrow(df)){
mycount <- mycount + df$R2[i]
if(df$R2[i] == 0 & i == 1){
df$R2_counted[i] <- NA
} else if(df$R2[i] != 0 & i == 1){
df$R2_counted[i] <- df$R2[i]
}
if(df$R2[i] == 0 & i > 1){
df$R2_counted[i] <- NA
if(df$R2[i-1] != 0){df$R2_counted[i-1] <- mycount}
mycount <- 0
} else if(df$R2[i] != 0 & i > 1){
df$R2_counted[i] <- NA
}
if(i == nrow(df) & df$R2[i] != 0){
df$R2_counted[i] <- mycount
}
i <- i + 1
}
df
# R1 R2 R1_counted R2_counted
#1 0 0 NA NA
#2 1 1 NA NA
#3 1 1 2 NA
#4 0 1 NA NA
#5 1 1 1 4
#6 0 0 NA NA
#7 0 1 NA 1
#8 1 0 1 NA
#9 0 0 NA NA
#10 1 0 NA NA
#11 1 0 NA NA
#12 1 1 3 1

how to define an indicator with 2 columns

I have these columns
utility pass
2 None
3 NA
-1 None
-2 NA
indicator is 1 if : pass=None and utility>0
output
I have these columns
utility pass indicator
2 None 1
3 NA 0
-1 None 0
-2 NA 0
One possibility could be:
with(df, +(grepl("None", pass, fixed = TRUE) * utility > 0))
[1] 1 0 0 0
Assuming that NA is an NA and not a character, we can achieve the desired output as follows :
With dplyr(can use case_when):
df %>%
mutate(indicator = ifelse( !is.na(pass) & utility >0 , 1, 0))
utility pass indicator
1 2 None 1
2 3 <NA> 0
3 -1 None 0
4 -2 <NA> 0
Without relying on external packages, we can do the following with the base package:
df$indicator <- ifelse( !is.na(df$pass) & df$utility >0 , 1, 0)
Using within:
within(df, {
indicator <- ifelse(!is.na(pass) & utility >0, 1, 0)
})
utility pass indicator
1 2 None 1
2 3 <NA> 0
3 -1 None 0
4 -2 <NA> 0
Data:
df <- structure(list(utility = c(2L, 3L, -1L, -2L), pass = structure(c(1L,
NA, 1L, NA), .Label = "None", class = "factor")), class = "data.frame", row.names = c(NA,
-4L))

Loop with conditions in R programming

I would like to compare the previous row value whether it is same as the current one (for more than 1 variables and also using list of values). In this case how do I perform write code. I read 'apply' functions can be used.
I searched this topic here before posting this question found somewhat similar but unable to find the exact one. I'm quite new to R.
Here is my sample table: (Flag needs to be done based on conditions)
Ticket No V1 V2 Flag
Tkt10256 1 X 0
Tkt10257 1 aa 0
Tkt10257 2 bb 1
Tkt10257 3 x 0
Tkt10260 1 cc 0
Tkt10260 2 aa 1
Tkt10262 3 bb 0
I have to Flag based on the below conditions (if all the conditions are satisfied then mark as 1)
Variable 2 should be the following one of 4 names (aa, bb, cc, dd)
Variable 1 should be the different from previous row
Ticket number has to be the same as previous row
Thanks in advance for the help !
An approach without looping:
indx1 <- with(df, V2 %in% paste0(letters[1:4], letters[1:4]) )
indx2 <- with(df, c(TRUE,V1[-1]!=V1[-length(V1)]))
indx3 <- with(df, c(FALSE,Ticket.No[-1]==Ticket.No[-nrow(df)]))
df$Flag <- (indx1 & indx2 & indx3)+0
df$Flag
#[1] 0 0 1 0 0 1 0
data
df <- structure(list(Ticket.No = c("Tkt10256", "Tkt10257", "Tkt10257",
"Tkt10257", "Tkt10260", "Tkt10260", "Tkt10262"), V1 = c(1L, 1L,
2L, 3L, 1L, 2L, 3L), V2 = c("X", "aa", "bb", "x", "cc", "aa",
"bb"), Flag = c(0L, 0L, 1L, 1L, 0L, 1L, 0L)), .Names = c("Ticket.No",
"V1", "V2", "Flag"), class = "data.frame", row.names = c(NA,
-7L))
One more:
Check this on your larger data. I'm not exactly sure if duplicated is the right function to use there. If the numbers in the TicketNo column are increasing (i.e. the Xs in TktXXXXX), then it should work fine.
> dat2 <- dat[dat$V2 %in% c("aa", "bb", "cc", "dd"),]
> rn <- rownames(dat2)[duplicated(dat2[[1]]) & !c(FALSE, diff(dat2[[2]]) == 0)]
> dat$Flag <- (rownames(dat) %in% rn)+0
> dat
# TicketNo V1 V2 Flag
# 1 Tkt10256 1 X 0
# 2 Tkt10257 1 aa 0
# 3 Tkt10257 2 bb 1
# 4 Tkt10257 3 x 0
# 5 Tkt10260 1 cc 0
# 6 Tkt10260 2 aa 1
# 7 Tkt10262 3 bb 0
A variation on #Akrun's answer:
with(df,
V2 %in% c("aa","bb","cc","dd") &
c(FALSE,diff(V1) != 0) &
c(FALSE,head(Ticket.No, -1)) == Ticket.No
) + 0
#[1] 0 0 1 0 0 1 0
Try:
for(i in 2:nrow(ddf)){
ddf$Flag[i] = ifelse( ddf$V2[i] %in% c('aa', 'bb', 'cc', 'dd')
&& ddf$V1[i] != ddf$V1[(i-1)]
&& ddf$TicketNo[i] == ddf$TicketNo[(i-1)]
,1,0)
}
ddf
TicketNo V1 V2 Flag
1 Tkt10256 1 X 0
2 Tkt10257 1 aa 0
3 Tkt10257 2 bb 1
4 Tkt10257 3 x 0
5 Tkt10260 1 cc 0
6 Tkt10260 2 aa 1
7 Tkt10262 3 bb 0

Resources