Not able to get the value, getting the position - r

I am trying to get the first non zero value but I am getting the position using my code. I know I am getting this as I am using which in my code but I need the value. Please help I am sharing my sample data and the R query I used.
Cnt<- c(9940000126,9940000188,9940000406,9940000992,9940001017,9940001288,9940001833,9940002276,9940002629)
FY12_April <- c(0,0,0,0,0,0,0,0,0)
FY12_August <- c(0,0,.343545,0,0,0,0,0,0)
FY12_December <- c(0,0,0,0,0,0,0,0,0)
FY12_February <- c(0,0,0,0,0,0,0,0,0)
FY12_January <- c(0,0,0.98557,0,0,0,0,0,0.41949703)
FY12_July <- c(0,0,0,0,0,-1.211583915,0,0,0)
FY12_June <- c(-1.47268885,0,0,0,-0.80164469,0,0,0,0)
SamData <- data.frame(Cnt,FY12_April,FY12_August,FY12_December,FY12_February,FY12_January,FY12_July,FY12_June)
ProcessData <- SamData %>% mutate(Count = apply(select(.,FY12_April:FY12_June),1,function(x) sum(x!=0,na.rm=TRUE))) %>%
mutate(FirstInst = colnames(select(.,FY12_April:FY12_June))[apply(select(.,FY12_April:FY12_June),1,function(x)which(x!=0)[1])]) %>%
mutate(FirstInstAmt = apply(select(.,FY12_April:FY12_June),1,function(x)which(x!=0, arr.ind = TRUE, useNames = TRUE)[1]))

We can use max.col with row index to get the value. It would be more efficient rather than apply
SamData$FirstInstAmt <- SamData[-1][cbind(seq_len(nrow(SamData)),
max.col(SamData[-1] != 0, 'first'))]
Or if we want to use apply
SamData$FirstInstAmt <- apply(SamData[-1], 1, function(x) x[x !=0][1])
SamData$FirstInstAmt
#[1] -1.4726888 NA 0.3435450 NA -0.8016447
#[6] -1.2115839 NA NA 0.4194970
Or using pmap with dplyr
library(dplyr)
library(purrr)
SamData %>%
mutate(FirstInstAmt = pmap_dbl(.[-1], ~ {x <- c(...); x[x != 0][1]}))
Or use c_across with rowwise
SamData %>%
rowwise %>%
mutate(FirstInstAmt = {tmp <- c_across(FY12_April:FY12_June)
tmp[tmp!= 0][1]})
Or replace the 0 values with NA and use coalesce to return the first non-NA
SamData %>%
mutate(FirstInstAmt = coalesce(!!! .[-1] * NA^(!.[-1])))
NOTE: Using rowwise/c_across or pmap or apply could be slower as these are loops. The most efficient among this would be based on indexing (max.col) or to a certain extent the coalesce/replace

Related

replace values in a column into Data Frame with another value (same for all)

My data frame consists of 21 columns, for this problem only one is relevant:
I want replace values 2 or 3 or 4 or 5 in a column a with the value 1 (in the same column).
beside of doing the code below for any value 2,3,4,5 i'm looking for something more elegant:
df <- df %>% mutate (a = replace(a, a == 2,1))
df <- df %>% mutate (a = replace(a, a == 3,1))
df <- df %>% mutate (a = replace(a, a == 4,1))
df <- df %>% mutate (a = replace(a, a == 5,1))
so i'm just stock with the condition "or" i need create inside the code...
any solution?
You can replace multiple columns using across and multiple values with %in%. For example, if you want to replace values from column a, b, c and d, you can do :
library(dplyr)
df <- df %>% mutate(across(a:d, ~replace(., . %in% 2:5, 1)))
#For dplyr < 1.0.0 use `mutate_at`
#df <- df %>% mutate_at(vars(a:d), ~replace(., . %in% 2:5, 1))
In base R, you can do this with lapply :
cols <- c('a','b','c','d')
df[cols] <- lapply(df[cols], function(x) replace(x, x %in% 2:5, 1))

Apply dplyr functions on a single column across a list using piping

I'm tring to filter something across a list of dataframes for a specific column. Typically across a single dataframe using dplyr I would use:
#creating dataframe
df <- data.frame(a = 0:10, d = 10:20)
# filtering column a for rows greater than 7
df %>% filter(a > 7)
I've tried doing this across a list using the following:
# creating list
x <- list(data.frame(a = 0:10, b = 10:20),
data.frame(c = 11:20, d = 21:30),
data.frame(e = 15:25, f = 35:45))
# selecting the appropriate column and trying to filter
# this is not working
x[1][[1]][1] %>% lapply(. %>% {filter(. > 2)})
# however, if I use the min() function it works
x[1][[1]][1] %>% lapply(. %>% {min(.)})
I find the %>% syntax quite easy to understand and carry out. However, in this case, selecting a specific column and doing something quite simple like filtering is not working. I'm guessing map could be equally useful. Any help is appreciated.
You can use filter_at to refer column by position.
library(dplyr)
purrr::map(x, ~.x %>% filter_at(1, any_vars(. > 7)))
In filter, you can subset the column and use it
purrr::map(x, ~.x %>% filter(.[[1]] > 7))
In base R, that would be :
lapply(x, function(y) y[y[[1]] > 7, ])
It seems you are interested in checking the condition on the first column of each dataframe in your list.
One solution using dplyr would be
lapply(x, function(df) {df %>% filter_at(1, ~. > 7)})
The 1 in filter_at indicates that I want to check the condition on the first column (1 is a positional index) of each dataframe in the list.
EDIT
After the discussion in the comments, I propose the following solution
lapply(x, function(df) {df %>% filter(a > 7) %>% select(a) %>% slice(1)})
Input data
x <- list(data.frame(a = 0:10, b = 10:20),
data.frame(a = 11:20, b = 21:30),
data.frame(a = 15:25, b = 35:45))
Output
[[1]]
a
1 8
[[2]]
a
1 11
[[3]]
a
1 15
Using filter with across
library(dplyr)
library(purrr)
map(x, ~ .x %>%
filter(across(names(.)[1], ~ .> 7)))

purrr::map_if How to return value if condition is FALSE?

I am running a function using purrr::map that will return an error if the dataframe does not contain numeric data (i.e., na.omit do not return any valid rows). I discovered map_if but it seems map_if returns .x if .p is false. Is there a way to return NA. This example should explain what I need:
library(openair)
library(tidyverse)
# Build test dataset
df1 <- mydata
df2 <- mydata
df2$no2 <- NA_real_
df3 <- mydata
dfx <- tibble(id = c(1, 2, 3), data = list(df1, df2, df3))
# polarPlot function will return error if dataframe does not contain numeric data (i.e., it only contains NAs)
polarPlot(df2, pollutant = "no2")
# Function to test length of dataframe (i.e., if 0 theneverything is NAs)
check_length <- function(x) (x %>% select(ws, wd, "no2") %>% na.omit() %>% nrow()) > 0
check_length(df1)
check_length(df2)
# purrr::map (is there a way for map_if to return NA if length == 0?)
dfx %>% mutate(mynewvar = map_if(.x = data, check_length, ~ polarPlot(.x, pollutant = "no2")))
In other words, I would like mynewvar[[2]] to return NA.
#dylanjm I posted a reprex not sure if you are not able to see it. As you suggested the function possibly is what I need.
possible_polarPlot <- possibly(polarPlot, otherwise = NA)
out <- dfx %>% mutate(mynewvar = map(.x = data, ~ possible_polarPlot(.x, pollutant = "no2")))
out$mynewvar[[2]] # Returns NA as I was looking for.

How to use apply functions correctly when there are NA values

I'd like to calculate a function on multiple columns of a dataframe with random NA values. I have two questions:
How to deal with NAs? The code runs when I try it on non-NA columns, but returns NA when there are NAs even though I remove them.
How to print the results in a dataframe format instead of multiple arrays? I used mapply but it doesn't seem to do the calculations correctly.
Here is my code:
#create a data frame with random NAs
df<-data.frame(category1 = sample(c(1:10),100,replace=TRUE),
category2 = sample(c(1:10),100,replace=TRUE)
)
insert_nas <- function(x) {
len <- length(x)
n <- sample(1:floor(0.2*len), 1)
i <- sample(1:len, n)
x[i] <- NA
x
}
df <- sapply(df, insert_nas) %>% as.data.frame()
df$type <- sample(c("A", "B", "C"),100,replace=TRUE)
#using apply:
library(NPS)
apply(df[,c('category1', 'category2')], 2,
function(x) df %>% filter(!is.na(x)) %>% group_by(type) %>%
transmute(nps(x)) %>% unique()
)
#results:
$category1
# A tibble: 3 x 2
# Groups: type [3]
type `nps(x)`
<chr> <dbl>
1 B NA
2 A NA
3 C NA
...
#using mapply
mapply(function(x) df %>% filter(!is.na(x)) %>% group_by(type) %>%
transmute(nps(x)) %>% unique(), df[,c('category1', 'category2')])
#results:
category1 category2
type Character,3 Character,3
nps(x) Numeric,3 Numeric,3
Regarding the function I use, it doesn't have a built in way to deal with NAs, so I remove NAs prior to calling it.
I still used the !is.na part of your code because it seems that nps can't deal with NA, even though the documentation said it should (possible bug). I changed your apply to lapply and passed the variables as the list. Then I used get to identify the variable name that appears in quotes as a variable in your df.
df<-data.frame(category1 = sample(c(1:10),100,replace=TRUE),
category2 = sample(c(1:10),100,replace=TRUE)
)
insert_nas <- function(x) {
len <- length(x)
n <- sample(1:floor(0.2*len), 1)
i <- sample(1:len, n)
x[i] <- NA
x
}
df <- sapply(df, insert_nas) %>% as.data.frame()
df$type <- sample(c("A", "B", "C"),100,replace=TRUE)
#using apply:
library(NPS)
df2 <- as.data.frame(lapply(c('category1', 'category2'),
function(x) df %>% filter(!is.na(get(x))) %>% group_by(type) %>%
transmute(nps(get(x))) %>% unique()
),stringsAsFactors = FALSE)
colnames(df2) <- c("type", "nps_cat1","type2","nps_cat2")
#type2 is redundant
df2 <- select(df2, -type2)

Remove rows where all variables are NA using dplyr

I'm having some issues with a seemingly simple task: to remove all rows where all variables are NA using dplyr. I know it can be done using base R (Remove rows in R matrix where all data is NA and Removing empty rows of a data file in R), but I'm curious to know if there is a simple way of doing it using dplyr.
Example:
library(tidyverse)
dat <- tibble(a = c(1, 2, NA), b = c(1, NA, NA), c = c(2, NA, NA))
filter(dat, !is.na(a) | !is.na(b) | !is.na(c))
The filter call above does what I want but it's infeasible in the situation I'm facing (as there is a large number of variables). I guess one could do it by using filter_ and first creating a string with the (long) logical statement, but it seems like there should be a simpler way.
Another way is to use rowwise() and do():
na <- dat %>%
rowwise() %>%
do(tibble(na = !all(is.na(.)))) %>%
.$na
filter(dat, na)
but that does not look too nice, although it gets the job done. Other ideas?
Since dplyr 0.7.0 new, scoped filtering verbs exists. Using filter_any you can easily filter rows with at least one non-missing column:
# dplyr 0.7.0
dat %>% filter_all(any_vars(!is.na(.)))
Using #hejseb benchmarking algorithm it appears that this solution is as efficient as f4.
UPDATE:
Since dplyr 1.0.0 the above scoped verbs are superseded. Instead the across function family was introduced, which allows to perform a function on multiple (or all) columns. Filtering rows with at least one column being not NA looks now like this:
# dplyr 1.0.0
dat %>% filter(if_any(everything(), ~ !is.na(.)))
I would suggest to use the wonderful janitor package here. Janitor is very user-friendly:
janitor::remove_empty(dat, which = "rows")
Benchmarking
#DavidArenburg suggested a number of alternatives. Here's a simple benchmarking of them.
library(tidyverse)
library(microbenchmark)
n <- 100
dat <- tibble(a = rep(c(1, 2, NA), n), b = rep(c(1, 1, NA), n))
f1 <- function(dat) {
na <- dat %>%
rowwise() %>%
do(tibble(na = !all(is.na(.)))) %>%
.$na
filter(dat, na)
}
f2 <- function(dat) {
dat %>% filter(rowSums(is.na(.)) != ncol(.))
}
f3 <- function(dat) {
dat %>% filter(rowMeans(is.na(.)) < 1)
}
f4 <- function(dat) {
dat %>% filter(Reduce(`+`, lapply(., is.na)) != ncol(.))
}
f5 <- function(dat) {
dat %>% mutate(indx = row_number()) %>% gather(var, val, -indx) %>% group_by(indx) %>% filter(sum(is.na(val)) != n()) %>% spread(var, val)
}
# f1 is too slow to be included!
microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))
Using Reduce and lapply appears to be the fastest:
> microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))
Unit: microseconds
expr min lq mean median uq max neval
f2 909.495 986.4680 2948.913 1154.4510 1434.725 131159.384 100
f3 946.321 1036.2745 1908.857 1221.1615 1805.405 7604.069 100
f4 706.647 809.2785 1318.694 960.0555 1089.099 13819.295 100
f5 640392.269 664101.2895 692349.519 679580.6435 709054.821 901386.187 100
Using a larger data set 107,880 x 40:
dat <- diamonds
# Let every third row be NA
dat[seq(1, nrow(diamonds), 3), ] <- NA
# Add some extra NA to first column so na.omit() wouldn't work
dat[seq(2, nrow(diamonds), 3), 1] <- NA
# Increase size
dat <- dat %>%
bind_rows(., .) %>%
bind_cols(., .) %>%
bind_cols(., .)
# Make names unique
names(dat) <- 1:ncol(dat)
microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))
f5 is too slow so it is also excluded. f4 seems to do relatively better than before.
> microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))
Unit: milliseconds
expr min lq mean median uq max neval
f2 34.60212 42.09918 114.65140 143.56056 148.8913 181.4218 100
f3 35.50890 44.94387 119.73744 144.75561 148.8678 254.5315 100
f4 27.68628 31.80557 73.63191 35.36144 137.2445 152.4686 100
Starting with dyplr 1.0, the colwise vignette gives a similar case as an example:
filter(across(everything(), ~ !is.na(.x))) #Remove rows with *any* NA
We can see it uses the same implicit "& logic" filter uses with multiple expressions. So the following minor adjustment selects all NA rows:
filter(across(everything(), ~ is.na(.x))) #Remove rows with *any* non-NA
But the question asks for the inverse set: Remove rows with all NA.
We can do a simple setdiff using the previous, or
we can use the fact that across returns a logical tibble and filter effectively does a row-wise all() (i.e. &).
Eg:
rowAny = function(x) apply(x, 1, any)
anyVar = function(fcn) rowAny(across(everything(), fcn)) #make it readable
df %<>% filter(anyVar(~ !is.na(.x))) #Remove rows with *all* NA
Or:
filterout = function(df, ...) setdiff(df, filter(df, ...))
df %<>% filterout(across(everything(), is.na)) #Remove rows with *all* NA
Or even combinine the above 2 to express the first example more directly:
df %<>% filterout(anyVar(~ is.na(.x))) #Remove rows with *any* NA
In my opinion, the tidyverse filter function would benefit from a parameter describing the 'aggregation logic'. It could default to "all" and preserve behavior, or allow "any" so we wouldn't need to write anyVar-like helper functions.
The solution using dplyr 1.0 is simple and does not require helper functions, you just need to add a negation in the right place.
dat %>% filter(!across(everything(), is.na))
dplyr 1.0.4 introduced the if_any() and if_all() functions:
dat %>% filter(if_any(everything(), ~!is.na(.)))
or, more verbose:
dat %>% filter(if_any(everything(), purrr::negate(is.na)))
"Take dat and keep all rows where any entry is non-NA"
Here's another solution that uses purrr::map_lgl() and tidyr::nest():
library(tidyverse)
dat <- tibble(a = c(1, 2, NA), b = c(1, NA, NA), c = c(2, NA, NA))
any_not_na <- function(x) {
!all(map_lgl(x, is.na))
}
dat_cleaned <- dat %>%
rownames_to_column("ID") %>%
group_by(ID) %>%
nest() %>%
filter(map_lgl(data, any_not_na)) %>%
unnest() %>%
select(-ID)
## Warning: package 'bindrcpp' was built under R version 3.4.2
dat_cleaned
## # A tibble: 2 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1. 1. 2.
## 2 2. NA NA
I doubt this approach will be able to compete with the benchmarks in #hejseb's answer, but I think it does a pretty good job at showing how the nest %>% map %>% unnest pattern works and users can run through it line-by-line to figure out what's going on.
You can use the function complete.cases from dplyr
using the dot (.) for specify the previous dataframe
on the chain.
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
x3 = c(NA,2,3,5)
)
df %>%
filter(complete.cases(.))
x1 x2 x3
1 2 2 2
I a neat solution what works in dplyr 1.0.1 is to use rowwise()
dat %>%
rowwise() %>%
filter(!all(is.na(across(everything())))) %>%
ungroup()
very similar to #Callum Savage 's comment on the top post but I missed it on the first pass, and without the sum()
(tidyverse 1.3.1)
data%>%rowwise()%>%
filter(!all(is.na(c_across(is.numeric))))
data%>%rowwise()%>%
filter(!all(is.na(c_across(starts_with("***")))))

Resources