If statement with two conditions and NA - r

I am looking to use a conditional statement to access date rows which are before 0021-01-11 and have NA value in a specific column (People_vaccinated for example). For those rows I wanted to impute with zero.
I want to use an IF statement with (condition1 AND condition 2).
Condition1 can be df$People_vaccinated == NA and condition2 can be df$date < 'given date'

Maybe this will help -
df <- data.frame(Date = c('0021-01-07', '0021-01-08','0021-01-11', '0021-01-12'),
a = c(2, NA, 3, NA),
b = c(1, NA, 2, 3))
ind <- match('0021-01-11', df$Date)
df$a[1:ind][is.na(df$a[1:ind])] <- 0
df
# Date a b
#1 0021-01-07 2 1
#2 0021-01-08 0 NA
#3 0021-01-11 3 2
#4 0021-01-12 NA 3
Or using dplyr -
library(dplyr)
df <- df %>%
mutate(a = replace(a,
row_number() <= match('0021-01-11', Date) & is.na(a), 0))
df

Related

Using dplyr to select rows containing non-missing values in several specified columns

Here is my data
data <- data.frame(a= c(1, NA, 3), b = c(2, 4, NA), c=c(NA, 1, 2))
I wish to select only the rows with no missing data in colunm a AND b. For my example, only the first row will be selected.
I could use
data %>% filter(!is.na(a) & !is.na(b))
to achieve my purpose. But I wish to do it using if_any/if_all, if possible. I tried data %>% filter(if_all(c(a, b), !is.na)) but this returns an error. My question is how to do it in dplyr through if_any/if_all.
data %>%
filter(if_all(c(a,b), ~!is.na(.)))
a b c
1 1 2 NA
We could use filter with if_all
library(dplyr)
data %>%
filter(if_all(c(a,b), complete.cases))
-output
a b c
1 1 2 NA
This could do the trick - use filter_at and all_vars in dplyr:
data %>%
filter_at(vars(a, b), all_vars(!is.na(.)))
Output:
# a b c
#1 1 2 NA

Add multiple columns with dplyr and fill cells based on condition

I am trying to:
1) add multiple columns that correspond to existing columns (e.g., a1 exists and add a1_yes).
2) Next, if a given cell contains 1:3, put 1 in a#_yes column, otherwise, put 0.
I can easily to this with base R but I'm trying to also make it work with dplyr.
My data:
df <- data.frame(a1 = c(1, 2, 0, NA, NA),
a2 = c(NA, 1, 2, 3, 3))
With base R:
df[paste0("a", 1:2, "_yes")] <- NA # add columns
for(c in 1:2) {
for(r in 1:nrow(df)) {
ifelse(df[r,c] %in% c(1,2,3), df[r,c+2] <- 1,df[r,c+2] <- 0)
}
}
> df
a1 a2 a1_yes a2_yes
1 1 NA 1 0
2 2 1 1 1
3 0 2 0 1
4 NA 3 0 1
5 NA 3 0 1
Thank you
Here is an option, assuming you want to do this to all columns of your dataframe
library(dplyr)
df %>%
mutate_all(., list('yes' = ~ifelse(.x %in% c(1:3), 1, 0)))
# a1 a2 a1_yes a2_yes
#1 1 NA 1 0
#2 2 1 1 1
#3 0 2 0 1
#4 NA 3 0 1
#5 NA 3 0 1
Edits
As #Akrun mentioned, you can do this without ifelse using as.integer or +
df %>%
mutate_all(., list('yes' = ~as.integer(.x %in% 1:3)))
You can also use mutate_at to select specific vars
df %>%
mutate_at(vars(a1, a2), list('yes' = ~as.integer(.x %in% 1:3)))
This will work without editing no matter how many columns you have if they are all in this format
df %>%
mutate_all(., function(x) ifelse(x == 0 | is.na(x), 0, 1)) %>%
rename_all(., function(x) paste0(x, "_yes")) %>%
bind_cols(df, .)
Here's a dplyr solution:
library(dplyr)
df <- data.frame(a1 = c(1, 2, 0, NA, NA),
a2 = c(NA, 1, 2, 3, 3))
df2 <- df %>%
mutate(a1_yes = ifelse(a1 == 0 | is.na(a1), 0, 1),
a2_yes = ifelse(a2 == 0 | is.na(a2), 0, 1))
Instead of putting the conditions so that the new columns' values are 1, I put the conditions so that they're equal to zero.
Here is a solution
df <- data.frame( a1 = c(1,2,0,NA,NA),
a2 = c(NA,1,2,3,3))
check_values <- c(1,2,3)
df %>% mutate(a1_yes = ifelse(a1 %in% check_values,1,0),
a2_yes =ifelse(a2 %in% check_values,1,0))

Variable names as Input in an R Function

I have a dataframe with several numeric variables along with factors. I wish to run over the numeric variables and replace the negative values to missing. I couldn't do that.
My alternative idea was to write a function that gets a dataframe and a variable, and does it. It didn't work either.
My code is:
NegativeToMissing = function(df,var)
{
df$var[df$var < 0] = NA
}
Error in $<-.data.frame(`*tmp*`, "var", value = logical(0)) : replacement has 0 rows, data has 40
what am I doing wrong ?
Thank you.
Here is an example with some dummy data.
df1 <- data.frame(col1 = c(-1, 1, 2, 0, -3),
col2 = 1:5,
col3 = LETTERS[1:5])
df1
# col1 col2 col3
#1 -1 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 -3 5 E
Now find columns that are numeric
numeric_cols <- sapply(df1, is.numeric)
And replace negative values
df1[numeric_cols] <- lapply(df1[numeric_cols], function(x) replace(x, x < 0 , NA))
df1
# col1 col2 col3
#1 NA 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 NA 5 E
You could also do
df1[df1 < 0] <- NA
With tidyverse, we can make use of mutate_if
library(tidyverse)
df1 %>%
mutate_if(is.numeric, funs(replace(., . < 0, NA)))
If you still want to change only one selected variable a solution withdplyr would be to use non-standard evaluation:
library(dplyr)
NegativeToMissing <- function(df, var) {
quo_var = quo_name(var)
df %>%
mutate(!!quo_var := ifelse(!!var < 0, NA, !!var))
}
NegativeToMissing(data, var=quo(val1)) # use quo() function without ""
# val1 val2
# 1 1 1
# 2 NA 2
# 3 2 3
Data used:
data <- data.frame(val1 = c(1, -1, 2),
val2 = 1:3)
data
# val1 val2
# 1 1 1
# 2 -1 2
# 3 2 3

Count visits (cumsum) per ID while ignoring NA's and 0's

I have the following df:
df <- data.frame(ID = c(1,1,2,2,2,3,3,3,3),
Attendance = c(1, 1, NA, 1,1, NA, 1, NA, 1 ))
And I want this one:
df <- data.frame(ID = c(1,1,2,2,2,3,3,3,3),
Attendance = c(1, 1, NA, 1,1, NA, 1, NA, 1),
Visit = c(1,2,0,1,2,0,1,0,2))
How can I count every time (cumsum) an ID appears , in 'Visit' column, based on 'Attendance' column value while ignoring NA's or 0's?
I have tried something with ave function like this one, but unsuccessfully:
df$Visit <- ifelse(!is.na(df$ID), (ave(df$ID, df$ID, FUN=cumsum))/df$ID, 0)
I have achieved the result by creating an auxiliar df with:
aux <- df[complete.cases(df$Attendance),]
Counting the visits with Ave function and then merging, but I'm sure there exists an easiest way
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Visit = if_else(is.na(Attendance), 0, cumsum(if_else(is.na(Attendance), 0, 1))))
We can use data.table. convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'ID', specify the i as a logical vector which is TRUE for non-NA elements in 'Attendance', assign (:=) the 'rowid' of 'Attendance' as the 'Visit' column. Then, replace the NA in 'Visit' to 0
library(data.table)
setDT(df)[!is.na(Attendance), Visit := rowidv(Attendance),
ID][is.na(Visit), Visit := 0]
df
# ID Attendance Visit
#1: 1 1 1
#2: 1 1 2
#3: 2 NA 0
#4: 2 1 1
#5: 2 1 2
#6: 3 NA 0
#7: 3 1 1
#8: 3 NA 0
#9: 3 1 2
Or if we are using ave, then create an index for non-NA rows, and then use ave on those rows
i1 <- !is.na(df$Attendance)
df$Visit <- 0
df$Visit[i1] <- with(df[i1, ], ave(Attendance, ID, FUN = cumsum))

Removing empty rows of a data file in R

I have a dataset with empty rows. I would like to remove them:
myData<-myData[-which(apply(myData,1,function(x)all(is.na(x)))),]
It works OK. But now I would like to add a column in my data and initialize the first value:
myData$newCol[1] <- -999
Error in `$<-.data.frame`(`*tmp*`, "newCol", value = -999) :
replacement has 1 rows, data has 0
Unfortunately it doesn't work and I don't really understand why and I can't solve this.
It worked when I removed one line at a time using:
TgData = TgData[2:nrow(TgData),]
Or anything similar.
It also works when I used only the first 13.000 rows.
But it doesn't work with my actual data, with 32.000 rows.
What did I do wrong? It seems to make no sense to me.
I assume you want to remove rows that are all NAs. Then, you can do the following :
data <- rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(NA, NA, NA), c(4, 8, NA)) # sample data
data
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] NA NA NA
[5,] 4 8 NA
data[rowSums(is.na(data)) != ncol(data),]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] 4 8 NA
If you want to remove rows that have at least one NA, just change the condition :
data[rowSums(is.na(data)) == 0,]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 6 7
If you have empty rows, not NAs, you can do:
data[!apply(data == "", 1, all),]
To remove both (NAs and empty):
data <- data[!apply(is.na(data) | data == "", 1, all),]
Here are some dplyr options:
# sample data
df <- data.frame(a = c('1', NA, '3', NA), b = c('a', 'b', 'c', NA), c = c('e', 'f', 'g', NA))
library(dplyr)
# remove rows where all values are NA:
df %>% filter_all(any_vars(!is.na(.)))
df %>% filter_all(any_vars(complete.cases(.)))
# remove rows where only some values are NA:
df %>% filter_all(all_vars(!is.na(.)))
df %>% filter_all(all_vars(complete.cases(.)))
# or more succinctly:
df %>% filter(complete.cases(.))
df %>% na.omit
# dplyr and tidyr:
library(tidyr)
df %>% drop_na
Alternative solution for rows of NAs using janitor package
myData %>% remove_empty("rows")
This is similar to some of the above answers, but with this, you can specify if you want to remove rows with a percentage of missing values greater-than or equal-to a given percent (with the argument pct)
drop_rows_all_na <- function(x, pct=1) x[!rowSums(is.na(x)) >= ncol(x)*pct,]
Where x is a dataframe and pct is the threshold of NA-filled data you want to get rid of.
pct = 1 means remove rows that have 100% of its values NA.
pct = .5 means remome rows that have at least half its values NA
Using dplyr's if_all/if_any
Drop rows with any NA OR Select rows with no NA value.
df %>% filter(!if_any(a:c, is.na))
# a b c
#1 1 a e
#2 3 c g
#Also
df %>% filter(if_all(a:c, Negate(is.na)))
Drop rows with all NA values or select rows with at least one non-NA value.
df %>% filter(!if_all(a:c, is.na))
# a b c
#1 1 a e
#2 <NA> b f
#3 3 c g
#Also
df %>% filter(if_any(a:c, Negate(is.na)))
data
Using data from #sbha -
df <- data.frame(a = c('1', NA, '3', NA),
b = c('a', 'b', 'c', NA),
c = c('e', 'f', 'g', NA))
Here's yet another answer if you just want a handy function wrapper. Also, many of the above solutions remove a row with ANY NAs, whereas this one only removes rows that are ALL NAs.
data <- rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(NA, NA, NA), c(4, 8, NA)) # sample data
data
rmNArows<-function(d){
goodRows<-apply(d,1,function(x) sum(is.na(x))!=ncol(d))
d[goodRows,]
}
rmNArows(data)

Resources