Removing 0s from dataframe without removing NAs - r

I try to create a subset, where I remove all answers == 0 for variable B, given another variable A == 1. However, I want to keep the NAs in Variable B (just remove the 0s).
I tried it with this df2 <- subset(df, B[df$A == 1] > 0) but the result makes no sense. Can someone help?
i <- c(1:10)
A <- c(0,1,1,1,0,0,1,1,0,1)
B <- c(0, 10, 13, NA, NA, 9, 0, 0, 3, NA)
df <- data.frame(i, A, B)

subset takes a condition and returns only the rows where the value is TRUE. If you try NA == 0, or NA != 0 it will always return NA, which is neither TRUE nor FALSE, however as subset would have it it only returns rows where the value is TRUE. There are multiple ways around this:
subset(df, !(A == 1 & B == 0) | is.na(B))
or:
subset(df, !(A == 1 & B %in% 0))
There's plenty more options available however

This should work, if I understand it correctly:
subset(df, (df$A == 1) & ((df$B != 0) | (is.na(df$B))))
outputs:
i A B
2 1 10
3 1 13
4 1 NA
10 1 NA

If you do not want to specify every single column, you can just change the 0 to NA and the NA (temporarily) to a number (for example 999/-999) and switch back after you are finished.
i <- c(1:10)
A <- c(0,1,1,1,0,0,1,1,0,1)
B <- c(0, 10, 13, NA, NA, 9, 0, 0, 3, NA)
df <- data.frame(i, A, B)
df[is.na(df)] <- 999
df[df==0] <- NA
df <- na.omit(df)
df[df==999] <- NA
i A B
2 2 1 10
3 3 1 13
4 4 1 NA
10 10 1 NA

If i is unique, identify wich cases you want to remove and select the rest, try:
df[df$i != subset(df, A==1 & B==0)$i, ]
Output:
i A B
1 1 0 0
2 2 1 10
3 3 1 13
4 4 1 NA
5 5 0 NA
6 6 0 9
9 9 0 3
10 10 1 NA

Related

Nested ifelse() statement in R not producing the desired results

I have a data set of boolean variables and I am trying to generate a new variable based on 3 of the existing booleans using ifelse().
The rules I'd like to implement are:
If any of the three columns have value 1, 1
If all of the three columns have value 0, 0
If all of the three columns have value NA, NA
If the three columns have some combination of 0 and NA, 0
Here is the code to generate a sample with 3 variables that I want to use to create a fourth:
df <- structure(list(var1 = c(NA, NA, NA, 0,1),
var2 = c(1, NA, 0,0, 1),
var3 = c(NA, NA, NA,0,1)), class = "data.frame", row.names = c(NA, -5L))
I have tried the following to generate the new variable according to my desired rules:
df$newvar1 <-ifelse(df$var1 == 1 | df$var2 == 1 |df$var3 == 1, 1,
ifelse((is.na(df$var1) & is.na(df$var2) & is.na(df$var3)), NA,0))
df$newvar2 <- ifelse((is.na(df$var1)|df$var1==0) &
(is.na(df$var2)|df$var2==0) &
(is.na(df$var3)|df$var3==0),0,
ifelse(df$var1 == 1 | df$var2 == 1 |df$var3 == 1, 1,
ifelse(is.na(df$var1) & is.na(df$var2) & is.na(df$var3), NA,NA)))
df$newvar3 <-ifelse(df$var1 == 1 | df$var2 == 1 |df$var3 == 1, 1,
ifelse((is.na(df$var1) & is.na(df$var2) & is.na(df$var3)), NA,
ifelse((is.na(df$var1)|df$var1==0) &
(is.na(df$var2)|df$var2==0) &
(is.na(df$var3)|df$var3==0),0,0)))
I don't understand why newvar1 and newvar3 have NA values corresponding to combinations of NAs and 0s when both examples use "&" between the na specifications (row 3 in the results).
I am assuming that NAs don't show up in newvar2 because the first ifelse() function takes precedent.
Any insight to the ifelse() function or advice on how to get the results I'm looking for would be really helpful.
Here is another possible option using rowSums:
df$newvar <- +(rowSums(df, na.rm = TRUE) * NA ^ (rowSums(!is.na(df)) == 0) > 0)
# var1 var2 var3 newvar
#1 NA 1 NA 1
#2 NA NA NA NA
#3 NA 0 NA 0
#4 0 0 0 0
#5 1 1 1 1
This gives your expected results:
df$newvar <- 0
df$newvar[Reduce(`|`, lapply(df[1:3], `%in%`, 1))] <- 1
df$newvar[Reduce(`&`, lapply(df[1:3], is.na))] <- NA
df
# var1 var2 var3 newvar
# 1 NA 1 NA 1
# 2 NA NA NA NA
# 3 NA 0 NA 0
# 4 0 0 0 0
# 5 1 1 1 1
This defaults to 0 and only changes values with known conditions, which means that if there are any rows with NA and 1 (with or without 0), it will be assigned 0. It's not difficult to test for this, but it wasn't in your logic.

Insert a blank row before zero

x<-c(0,1,1,0,1,1,1,0,1,1)
aaa<-data.frame(x)
How to insert a blank row before zero? When the first row is zeroļ¼Œdo not add blank row. Thank you.
Result:
0
1
1
.
0
1
1
1
.
0
1
1
Below we used dot but you can replace "." with NA or "" or something else depending on what you want.
1) We can use Reduce and append:
Append <- function(x, y) append(x, ".", y - 1)
data.frame(x = Reduce(Append, setdiff(rev(which(aaa$x == 0)), 1), init = aaa$x))
2) gsub Another possibility is to convert to a character string, use gsub and convert back:
data.frame(x = strsplit(gsub("(.)0", "\\1.0", paste(aaa$x, collapse = "")), "")[[1]])
3) We can create a two row matrix in which the first row is dot before each 0 and NA otherwise. Then unravel it to a vector and use na.omit to remove the NA values.
data.frame(x = na.omit(c(rbind(replace(ifelse(aaa$x == 0, ".", NA), 1, NA), aaa$x))))
4) We can lapply over aaa$x[-1] outputting c(".", 9) or 1. Unlist that and insert aaa$x[1] back in. No packages are used.
repl <- function(x) if (!x) c(".", 0) else 1
data.frame(x = c(aaa$x[1], unlist(lapply(aaa$x[-1], repl))))
5) Create a list of all but the first element and replace the 0's in that list with c(".", 0) . Unlist that and insert the first element back in. No packages are used.
L <- as.list(aaa$x[-1])
L[x[-1] == 0] <- list(c(".", 0))
data.frame(x = c(aaa$x[1], unlist(L)))
6) Assuming aaa has two columns where the second column is character (NOT factor). Append a row of dots to aaa and then create an index vector using unlist and Map to access the appropriate row of the extended aaa.
aaa <- data.frame(x = c(0,1,1,0,1,1,1,0,1,1), y = letters[1:10],
stringsAsFactors = FALSE)
nr <- nrow(aaa); nc <- ncol(aaa)
fun <- function(ix, x) if (!is.na(x) & x == 0 & ix > 1) c(nr + 1, ix) else ix
rbind(aaa, rep(".", nc))[unlist(Map(fun, 1:nr, aaa$x)), ]
If we did want to have y be factor then note that we can't just add a dot to a factor if it is not a level of that factor so there is the question of what levels the factor can have. To get around that let us add an NA rather than a dot to the factor. Then we get the following which is the same except that aaa has been redefined so that y is a factor, we no longer need nc since we are assuming 2 columns and rep(...) in the last line is replaced with c(".", NA).
aaa <- data.frame(x = c(0,1,1,0,1,1,1,0,1,1), y = letters[1:10])
nr <- nrow(aaa)
fun <- function(ix, x) if (!is.na(x) & x == 0 & ix > 1) c(nr + 1, ix) else ix
rbind(aaa, c(".", NA))[unlist(Map(fun, 1:nr, aaa$x)), ]
One dplyr and tidyr possibility may be:
aaa %>%
uncount(ifelse(row_number() > 1 & x == 0, 2, 1)) %>%
mutate(x = ifelse(x == 0 & lag(x == 1, default = first(x)), NA_integer_, x))
x
1 0
2 1
3 1
4 NA
5 0
6 1
7 1
8 1
9 NA
10 0
11 1
12 1
It is not adding a blank row as you have a numeric vector. Instead, it is adding a row with NA. If you need a blank row, you can convert it into a character vector and then replace NA with blank.
ind = with(aaa, ifelse(x == 0 & seq_along(x) > 1, 2, 1))
d = aaa[rep(1:NROW(aaa), ind), , drop = FALSE]
transform(d, x = replace(x, sequence(ind) == 2, NA))
Here is an option with rleid
library(data.table)
setDT(aaa)[, .(x = if(x[.N] == 1) c(x, NA) else x), rleid(x)][-.N, .(x)]
# x
# 1: 0
# 2: 1
# 3: 1
# 4: NA
# 5: 0
# 6: 1
# 7: 1
# 8: 1
# 9: NA
#10: 0
#11: 1
#12: 1
data.frame(x = unname(unlist(by(aaa$x,cumsum(aaa==0),c,'.'))))
x
1 0
2 1
3 1
4 .
5 0
6 1
7 1
8 1
9 .
10 0
11 1
12 1
13 .
My solution is
aaa <- data.frame(x = c(0,1,1,0,1,1,1,0,1,1), y = letters[1:10])
aaa$ind = with(aaa, ifelse(x == 0 & seq_along(x) > 1, 2, 1))
aaa<-aaa[rep(1:nrow(aaa), aaa$ind), ,]
aaa[(aaa$ind== 2 & !grepl(".1",rownames(aaa))),]<-NA
aaa$ind<- NULL
aaa
x y
1 0 a
2 1 b
3 1 c
4 NA <NA>
4.1 0 d
5 1 e
6 1 f
7 1 g
8 NA <NA>
8.1 0 h
9 1 i
10 1 j

Append strings to a text in R

I want to classify some characters that fulfill a condition in one column and concatenate the other characters in a string in another column.
The classification is working. When there is a 1 in the column "col", the program has to compare the inputs in "Category", the actual value with the previous one. If the priority number is smaller, save the value in "AlarmPrior", and the other value in "Other Alarms". I want to concatenate all the values with less priority in a string in "Other Alarms".
#test the function
col <- c(0, 1, 0, 0, 1, 1)
Priority <- c(1,2,3,4,5,6)
Category <- c("a","b","c","d","e","f")
eventlog_overlap.dt <- data.table(col,Priority, IEC_category)
#loading the libraries
library(magrittr)
library(dplyr)
#comparison and value assignation in function of the priority
eventlog_overlap.dt$OtherAlarms <- ""
eventlog_overlap.dt <-
eventlog_overlap.dt %>%
mutate(AlarmPrior = ifelse(col == 1,
ifelse(Priority <= lag(Priority),
Category,
lag(Category)), NA),
OtherAlarms = ifelse(col == 1,
ifelse(Priority <= lag(Priority),
"1",
paste0(sprintf(Category, lag(OtherAlarms)), collapse = ", ")),NA))
For example:
This input,
col <- c(0, 1, 0, 0, 1, 1)
Priority <- c(1,2,3,4,5,6)
Category <- c("a","b","c","d","e","f")
Should return:
col Priority Category OtherAlarms AlarmPrior
1 0 1 a NA NA
2 1 2 b b a
3 0 3 c b,c NA
4 0 4 d b,c NA
5 1 5 e b,c,e d
6 1 6 f b,c,e,f e
My actual result is this one:
col Priority Category OtherAlarms AlarmPrior
1 0 1 a NA NA
2 1 2 b a,b,c,d,e,f a
3 0 3 c NA NA
4 0 4 d NA NA
5 1 5 e a,b,c,d,e,f d
6 1 6 f a,b,c,d,e,f e
I used the for statement to solve the problem
col <- c(0, 1, 0, 0, 1, 1)
Priority <- c(1,2,3,4,5,6)
Category <- c("a","b","c","d","e","f")
eventlog_overlap.dt <- data.table(col,Priority, Category)
#loading the libraries
library(magrittr)
library(dplyr)
#comparison and value assignation in function of the priority
eventlog_overlap.dt$OtherAlarms <- ""
eventlog_overlap.dt <-
eventlog_overlap.dt %>%
mutate(AlarmPrior = ifelse(col == 1,
ifelse(Priority <= lag(Priority),
Category,
lag(Category)), NA))
eventlog_overlap.dt$leadCate= lead(eventlog_overlap.dt$AlarmPrior)
tmpdata = character()
eventlog_overlap.dt$tmp= NA
for(i in 1:nrow(eventlog_overlap.dt)){
tmp = eventlog_overlap.dt[i,3]
leadtmp = eventlog_overlap.dt[i,6]
if(!is.na(leadtmp == tmp) & !as.logical(eventlog_overlap.dt$col[i])){
tmp = tmp[!grepl(tmp,leadtmp)]
tmp = ifelse(NROW(tmp)==0,NA,tmp)
tmpdata = tmpdata
} else{
tmpdata = c(tmpdata,tmp)
}
eventlog_overlap.dt[i,7] = paste(tmpdata,collapse = ',')
}
And the result is shown below
> eventlog_overlap.dt
col Priority Category OtherAlarms AlarmPrior leadCate tmp
1
1 0 1 a <NA> a
2
2 1 2 b a <NA> b
3
3 0 3 c <NA> <NA> b,c
4
4 0 4 d <NA> d b,c
5
5 1 5 e d e b,c,e
6
6 1 6 f e <NA> b,c,e,f

The Number of Rows with a Specific Number of Missing Values

Imagine a small data set like the one below, composed of three variables:
v1 <- c(0, 1, NA, 1, NA, 0)
v2 <- c(0, 0, NA, 1, NA, NA)
v3 <- c(1, NA, 0, 0, NA, 0)
df <- data.frame(v1, v2, v3)
df
v1 v2 v3
1 0 0 1
2 1 0 NA
3 NA NA 0
4 1 1 0
5 NA NA NA
6 0 NA 0
One can use the is.na command as follows to calculate the number of rows with at least one missing value - and R would return 4:
sum(is.na(df$v1) | is.na(df$v2) | is.na(df$v3))
Or the number of rows with all three values missing - and R would return 1:
sum(is.na(df$v1) & is.na(df$v2) & is.na(df$v3))
Two questions at this point:
(1) How can I calculate the number of rows where "exactly one" or "exactly two" values are missing?
(2) If I am to do the above in a large data set, how can I limit the scope of the calculation to v1, v2 and v3 (that is, without having to create a subset)?
I tried variations of is.na, nrow and df, but could not get any of them to work.
Thanks!
We can use rowSums on the logical matrix (is.na(df)) and check whether the number of NAs are equal to the value of interest.
n1 <- 1
sum(rowSums(is.na(df))==n1)
To make it easier, create a function to do this
f1 <- function(dat, n){
sum(rowSums(is.na(dat)) == n)
}
f1(df, 0)
#[1] 2
f1(df, 1)
#[1] 2
f1(df, 3)
#[1] 1
f1(df, 2)
#[1] 1
NOTE: rowSums is very fast, but if it is a large dataset, then creating a logical matrix can also create problems in memory. So, we can use Reduce after looping through the columns of the dataset (lapply(df, is.na)).
sum(Reduce(`+`, lapply(df, is.na))==1)
#[1] 2
f2 <- function(dat, n){
sum(Reduce(`+`, lapply(dat, is.na))==n)
}
f2(df, 1)
Try this:
num.rows.with.x.NA <- function(df, x, cols=names(df)) {
return(sum(apply(df, 1, function(y) sum(is.na(y[cols])) == x)))
}
df
v1 v2 v3
1 0 0 1
2 1 0 NA
3 NA NA 0
4 1 1 0
5 NA NA NA
6 0 NA 0
num.rows.with.x.NA(df, 0, names(df))
#[1] 2
num.rows.with.x.NA(df, 1, names(df))
#[1] 2
num.rows.with.x.NA(df, 2, names(df))
#[1] 1
num.rows.with.x.NA(df, 3, names(df))
#[1] 1

recoding variable into two new variables in R

I have a variable A containing continuous numeric values and a binary variable B. I would like to create a new variable A1 which contains the same values as A if B=1 and missing values (NA) if B=2.
Many thanks!
You can use ifelse() for that:
a1 <- ifelse(B == 1, A, NA)
Here's a simple and efficient approach without ifelse:
A <- 1:10
# [1] 1 2 3 4 5 6 7 8 9 10
B <- rep(1:2, 5)
# [1] 1 2 1 2 1 2 1 2 1 2
A1 <- A * NA ^ (B - 1)
# [1] 1 NA 3 NA 5 NA 7 NA 9 NA
You can use ifelse for this:
A = runif(100)
B = sample(c(0,1), 100, replace = TRUE)
B1 = ifelse(B == 1, A, NA)
You can even leave out the == 1 as R interprets 0 as FALSE and any other number as TRUE:
B1 = ifelse(B, A, NA)
Although the == 1 is both more flexible and makes it more clear what happens. So I'd go for the first approach.

Resources