Removing rows having only zeros [duplicate] - r

This question already has answers here:
How to remove rows with 0 values using R
(2 answers)
Closed 2 years ago.
I want to remove all the rows having either zeros or NAs. In the code below I am selecting numeric variables and then filtering out 0s. Problem here is it does not return character variables along with numeric ones in the final output.
df <- read.table(header = TRUE, text =
"x y z
a 1 2
b 0 3
c 1 NA
d 0 NA
")
df %>% select_if(is.numeric) %>% filter(rowSums(., na.rm = T)!=0)

You can use filter_if :
library(dplyr)
df %>% filter_if(is.numeric, any_vars(. != 0 & !is.na(.)))
# x y z
#1 a 1 2
#2 b 0 3
#3 c 1 NA
Or using base R :
cols <- sapply(df, is.numeric)
df[rowSums(!is.na(df[cols]) & df[cols] != 0) > 0, ]

Another dplyr option could be:
df %>%
rowwise() %>%
filter(any(across(where(is.numeric)) != 0, na.rm = TRUE))
x y z
<fct> <int> <int>
1 a 1 2
2 b 0 3
3 c 1 NA

Following the suggestions written in this new doc page after the release of dplyr version 1.0.0, you can create a helper function to substitute the superseded functions filter_if and any_vars.
Previously, filter() was paired with the all_vars() and any_vars()
helpers. Now, across() is equivalent to all_vars(), and there’s no
direct replacement for any_vars(). However you can make a simple
helper yourself
From now on, this way should be the reference method for this kind of filtering steps.
rowAny <- function(x) {rowSums(x != 0 & !is.na(x)) > 0}
df %>% filter(rowAny(across(where(is.numeric))))
# x y z
# 1 a 1 2
# 2 b 0 3
# 3 c 1 NA

You could simply do
df[rowSums(suppressWarnings(sapply(df, as.double)), na.rm=TRUE) > 0, ]
# x y z
# 1 a 1 2
# 2 b 0 3
# 3 c 1 NA

Related

Recode NA when another column value is NA in R

I have a quick recoding question. Here is my sample dataset looks like:
df <- data.frame(id = c(1,2,3),
i1 = c(1,NA,0),
i2 = c(1,1,1))
> df
id i1 i2
1 1 1 1
2 2 NA 1
3 3 0 1
When, i1==NA , then I need to recode i2==NA. I tried below but not luck.
df %>%
mutate(i2 = case_when(
i1 == NA ~ NA_real_,
TRUE ~ as.character(i2)))
Error in `mutate()`:
! Problem while computing `i2 = case_when(i1 == "NA" ~ NA_real_, TRUE ~ as.character(i2))`.
Caused by error in `` names(message) <- `*vtmp*` ``:
! 'names' attribute [1] must be the same length as the vector [0]
my desired output looks like this:
> df
id i1 i2
1 1 1 1
2 2 NA NA
3 3 0 1
Would a simple assignment meet your requirements for this?
df$i2[is.na(df$i1)] <- NA
Here is an option:
t(apply(df, 1, \(x) if (any(is.na(x))) cumsum(x) else x))
# id i1 i2
#[1,] 1 1 1
#[2,] 2 NA NA
#[3,] 3 0 1
The idea is to calculate the cumulative sum of every row, if a row contains an NA; if there is an NA in term i , subsequent terms i+1 will also be NA (since e.g. NA + 1 = NA). Since your sample data df is all numeric, I recommend using a matrix (rather than a data.frame). Matrix operations are usually faster than data.frame (i.e. list) operations.
Key assumptions:
id cannot be NA.
This replaces NAs in i2 based on an NA in i1 per row.
A tidyverse solution
I advise against a tidyverse solution here for a couple of reasons
Your data is all-numerical, so a matrix is a more suitable data structure than a data.frame/tibble.
dplyr/tidyr syntax usually operates efficiently on columns; as soon as you want to do things "row-wise", dplyr (and its family packages) might not be the best way (despite dplyr::rowwise() which just introduces a row number-based grouping).
With that out of the way, you can transpose the problem.
library(tidyverse)
df %>%
transpose() %>%
map(~ { if (is.na(.x$i1)) .x$i2 <- NA_real_; .x }) %>%
transpose() %>%
as_tibble() %>%
unnest(everything())
## A tibble: 3 × 3
# id i1 i2
# <dbl> <dbl> <dbl>
#1 1 1 1
#2 2 NA NA
#3 3 0 1

Select columns of specific types and manipulate their content

here it is my problem in the form of a reproducible example and my partial attempt of solution
# input
mydf_in<-data.frame(a=letters[6:10],
b=c("<0.5","2","<0.5", "9", "10"),
c=1:5,
d=6:10,
e=c("<0.8","12","<0.8", "<0.8", "<0.8"))
mydf_in
# output
# the desired final result
mydf_out<-data.frame(a=letters[6:10],
b=c(0.5,2,0.5,9,10),
b_flag=c(1,0,1,0,0),
c=1:5,
d=6:10,
e=c(0.8,12,0.8,0.8,0.8),
e_flag=c(1,0,1,1,1)
)
mydf_out
library(tidyverse)
mydf_in %>%
select(where(~ is.character(.x) &
any(str_detect(.x, "<")
)
)
) %>%
# in between here is missing the creation and
# the population of the flagging columns, i.e. "b_flag" and "e_flag"
mutate(across(everything(), ~ as.numeric(str_replace(.x, "<", ""))))
in short, what is missing in the between of the above code snippet, for each selected column:
create a corresponding flagging column
populate the rows of the flagging column with 1 or 0 depending on the presence of the sign "<" (see desired output)
If we want to use the conditions explicitly, instead of select use mutate with the where the condition - to create the 'flag' columns loop over the columns with across and to change the column types use across
library(dplyr)
library(stringr)
mydf_in %>%
mutate(across(where(~ is.character(.x) &
any(str_detect(.x, fixed("<")))), ~
+(str_detect(.x, fixed("<"))), .names = "{.col}_flag"),
across(where(~ is.character(.x) &
any(str_detect(.x, fixed("<")))), ~ readr::parse_number(.)))
-output
a b c d e b_flag e_flag
1 f 0.5 1 6 0.8 1 1
2 g 2.0 2 7 12.0 0 0
3 h 0.5 3 8 0.8 1 1
4 i 9.0 4 9 0.8 0 1
5 j 10.0 5 10 0.8 0 1
You can use grepl() to make a logical vector for the flag columns. as.interger will give 1s and 0s instead of TRUE and FALSE. Then use gsub() to clean up your initial columns.
library(dplyr)
mydf_in %>%
mutate(b_flag = grepl("<",mydf_in$b) %>% as.integer,
e_flag = grepl("<",mydf_in$e) %>% as.integer,
b = gsub("<","", mydf_in$b),
e = gsub("<","", mydf_in$b)) %>%
select(a,b,b_flag,c,d,e_flag)
a b b_flag c d e_flag
1 f 0.5 1 1 6 1
2 g 2 0 2 7 0
3 h 0.5 1 3 8 1
4 i 9 0 4 9 1
5 j 10 0 5 10 1
And if you happen to have several more columns that needed flags, you could use grepl() in combination with lapply() or sapply() like this:
#make all flags
flags_columns<-mydf_in[,grepl("<",mydf_in)] %>% sapply(., grepl, pattern = "<") %>%
data.frame(.) %>%
sapply(., as.integer) %>% data.frame(.)%>%
rename_with(., ~paste0(.,"_flag"))
#remove "<" from all columns
edited_columns<- mydf_in[,grepl("<",mydf_in)] %>% lapply(., gsub, pattern = "<", replacement = "") %>%data.frame(.)
#gather any other columns
anything_else<- mydf_in[,!grepl("<",mydf_in)]
#make a dataframe
data.frame(c(flags_columns,edited_columns,anything_else)

Filter rows with dplyr/magrittr based on entire row

One is able to filter rows with dplyr with filter, but the condition is usually based on specific columns per row such as
d <- data.frame(x=c(1,2,NA),y=c(3,NA,NA),z=c(NA,4,5))
d %>% filter(!is.na(y))
I want to filter the row by whether the number of NA is greater than 50%, such as
d %>% filter(mean(is.na(EACHROW)) < 0.5 )
How do I do this in a dplyr/magrittr flow fashion?
You could use rowSums or rowMeans for that. An example with the provided data:
> d
x y z
1 1 3 NA
2 2 NA 4
3 NA NA 5
# with rowSums:
d %>% filter(rowSums(is.na(.))/ncol(.) < 0.5)
# with rowMeans:
d %>% filter(rowMeans(is.na(.)) < 0.5)
which both give:
x y z
1 1 3 NA
2 2 NA 4
As you can see row 3 is removed from the data.
In base R, you could just do:
d[rowMeans(is.na(d)) < 0.5,]
to get the same result.

R: By group, test if for each value of one variable, that value exists in another variable

I have a data frame structured something like:
a <- c(1,1,1,2,2,2,3,3,3,3,4,4)
b <- c(1,2,3,1,2,3,1,2,3,4,1,2)
c <- c(NA, NA, 2, NA, 1, 1, NA, NA, 1, 1, NA, NA)
df <- data.frame(a,b,c)
Where a and b uniquely identify an observation. I want to create a new variable, d, which indicates if each observation's value for b is present at least once in c as grouped by a. Such that d would be:
[1] 0 1 0 1 0 0 1 0 0 0 0 0
I can write a for loop which will do the trick,
attach(df)
for (i in unique(a)) {
for (j in b[a == i]) {
df$d[a == i & b == j] <- ifelse(j %in% c[a == i], 1, 0)
}
}
But surely in R there must be a cleaner/faster way of achieving the same result?
Using data.table:
library(data.table)
setDT(df) #convert df to a data.table without copying
# +() is code golf for as.integer
df[ , d := +(b %in% c), by = a]
# a b c d
# 1: 1 1 NA 0
# 2: 1 2 NA 1
# 3: 1 3 2 0
# 4: 2 1 NA 1
# 5: 2 2 1 0
# 6: 2 3 1 0
# 7: 3 1 NA 1
# 8: 3 2 NA 0
# 9: 3 3 1 0
# 10: 3 4 1 0
# 11: 4 1 NA 0
# 12: 4 2 NA 0
Adding the dplyr version for those of that persuasion. All credit due to #akrun.
library(dplyr)
df %>% group_by(a) %>% mutate(d = +(b %in% c))
And for posterity, a base R version as well (via #thelatemail below)
df <- df[order(df$a, df$b), ]
df$d <- unlist(by(df, df$a, FUN = function(x) (x$b %in% x$c) + 0L ))
The above answer by MichaelChirico apparently works well and is correct. I rarely use data.table so I don't understand the syntax. This is a way to get the same results without data.table.
invisible(lapply(unique(df$a), function(x) {
df$d[df$a==x] <<- 0L + (df$b[df$a==x] %in% df$c[df$a==x])
}))
This code gets all of the unique levels of a and then modifies the data.frame for that level of a using the logic you request. The <<- is necessary because df will otherwise be modified just in the scope of the apply and not in .GlobalEnv. With <<- it finds the parent environment where df is defined and sets df there.
Also, note the slightly different version of the + "trick" where a leading 0 which makes it clearer to the reader that the resulting vector is an integer because it must be cast that way for the addition to work. The L after the 0 indicates that 0 is in integer and not a double. Note that the notation used by MichaelChirico for this casting gives the same results (a column of class integer).

R Undo Dummy Variables

I have a data set where a bunch of categorical variables were converted to dummy variables (all classes used, NOT n-1) and some were not. I'm trying to recode them in a single column.
For instance
Q1.1 Q1.2 Q1.3 Q1.NA Q2 Q3.1 Q3.2
1 0 0 0 3 0 1
0 1 0 0 4 1 0
0 0 1 0 2 0 1
Is there a simple way to convert this to:
Q1 Q2 Q3
1 3 2
2 4 1
3 2 2
Right now I'm just using strsplit() (as all the dummied variable names contain '.') with a couple loops but feel like there should be a better way. Any suggestions?
I wrote a function a while back that did this sort of thing.
MultChoiceCondense<-function(vars,indata){
tempvar<-matrix(NaN,ncol=1,nrow=length(indata[,1]))
dat<-indata[,vars]
for (i in 1:length(vars)){
for (j in 1:length(indata[,1])){
if (dat[j,i]==1) tempvar[j]=i
}
}
return(tempvar)
}
If your data is called Dat, then:
Dat$Q1<-MultChoiceCondense(c("Q1.1","Q1.2","Q1.3"),Dat)
Here's an approach that uses melt from "reshape2" and cSplit from my "splitstackshape" package along with some "data.table" fun. I've loaded dplyr so that we can pipe all the things.
library(splitstackshape)
library(reshape2)
library(dplyr)
mydf %>%
as.data.table(keep.rownames = TRUE) %>% # Convert to data.table. Keep rownames
melt(id.vars = "rn", variable.name = "V") %>% # Melt the dataset by rownames
.[value > 0] %>% # Subset for all non-zero values
cSplit("V", ".") %>% # Split the "V" column (names) by "."
.[is.na(V_2), V_2 := value] %>% # Replace NA values with actual values
dcast.data.table(rn ~ V_1, value.var = "V_2") # Go wide.
# rn Q1 Q2 Q3
# 1: 1 1 3 2
# 2: 2 2 4 1
# 3: 3 3 2 2
Here's a possible base R approach:
## Which columns are binary?
Bins <- sapply(mydf, function(x) {
all(x %in% c(0, 1))
})
## Two vectors -- part after the dot and before
X <- gsub(".*\\.(.*)$", "\\1", names(mydf)[Bins])
Y <- unique(gsub("(.*)\\..*$", "\\1", names(mydf)[Bins]))
## Use `apply` to subset the X value based on the
## logical version of the binary variable
cbind(mydf[!Bins],
`colnames<-`(t(apply(mydf[Bins], 1, function(z) {
X[as.logical(z)]
})), Y))
# Q2 Q1 Q3
# 1 3 1 2
# 2 4 2 1
# 3 2 3 2
At the end, you can just reorder the columns as required. You may also need to convert them to numeric since in this case, Q1 and Q3 will be factors.
another base R approach
dat <- read.table(header = TRUE, text = "Q1.1 Q1.2 Q1.3 Q1.NA Q2 Q3.1 Q3.2
1 0 0 0 3 0 1
0 1 0 0 4 1 0
0 0 1 0 2 0 1")
## this will take all the unique questions; Q1, Q2, Q3; test if
## they are dummies; and return the column if so or find which
## dummy column is a 1 otherwise
res <- lapply(unique(gsub('\\..*', '', names(dat))), function(x) {
tmp <- dat[, grep(x, names(dat)), drop = FALSE]
if (ncol(tmp) == 1) unlist(tmp, use.names = FALSE) else max.col(tmp)
})
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 3 4 2
#
# [[3]]
# [1] 2 1 2
do.call('cbind', res)
# [,1] [,2] [,3]
# [1,] 1 3 2
# [2,] 2 4 1
# [3,] 3 2 2
I'm assuming your data looks like this, where the categorical columns are encoded using a dot at the end. You may also have a case where all of the values in a row are zero, which indicates a base level (such as how dummyVars in caret works with fullRank=FALSE). If so, here is a vectorized solution.
library(dplyr)
dummyVars.undo = function(df, col_prefix) {
if (!endsWith(col_prefix, '.')) {
# If col_prefix doesn't end with a period, include one, but save the
# "pretty name" as the one without a period
pretty_col_prefix = col_prefix
col_prefix = paste0(col_prefix, '.')
} else {
# Otherwise, strip the period for the pretty column name
pretty_col_prefix = substr(col_prefix, 1, nchar(col_prefix)-1)
}
# Get all columns with that encoding prefix
cols = names(df)[names(df) %>% startsWith(col_prefix)]
# Find the rows where all values are zero. If this isn't the case
# with your data there's no worry, it won't hurt anything.
base_level.idx = rowSums(df[cols]) == 0
# Set the column value to a base value of zero
df[base_level.idx, pretty_col_prefix] = 0
# Go through the remaining columns and find where the maximum value (1) occurs
df[!base_level.idx, pretty_col_prefix] = cols[apply(df[!base_level.idx, cols], 1, which.max)] %>%
strsplit('\\.') %>%
sapply(tail, 1)
# Drop the encoded columns
df[cols] = NULL
return(df)
}
Usage:
# Collapse Q1
df = dummyVars.undo(df, 'Q1')
# Collapse Q3
df = dummyVars.undo(df, 'Q3')
This uses dplyr, but only for the pipe operator %>%. You could certainly remove that if you'd prefer to do base R instead.

Resources