I am trying to remove/filter out some specific rows when it meets the condition of the two columns if not the column EP is flagged as 1. What is the specific code for this?
For example: in the dataframe df_NC when the column "Population_type" (binary type) is equal to 1 and the column NC (binary type) is equal to 0 remove the rows when this condition is satisfied, else flag EP as 1
df_ep <- df_NC %>% mutate(EP= case_when(
df_NC$Population_Type == 1 & df_NC$NC == 0 ~ 1,
TRUE ~ 0
))
From your code I'm assuming you are using dplyr package. A couple of mistakes there.
You don't need to use the base notation like df_NC$NC inside dplyr functions, just use the name of the variable.
I don't see a reason create the column EP if you are filtering one of the values (0/FALSE).
df_NC %>%
mutate(EC = if_else(Population_Type == 1 & NC == 0, 1, 0)) %>%
filter(EC == 1)
# Or shorter, considering my second point
df_NC %>%
filter(Population_Type == 1, NC == 0) # Equivalent to EC == 1
Also, try to use boolean (TRUE/FALSE) instead of integer 1/0 to work with "binary" data type.
Related
I'm using this dataset.
I want to create a variable named neg_gw_shock such that
neg_gw_shock = (-gw_level_dev) if gw_level_dev < 0
neg_gw_shock = NA if gw_level_dev >= 0
How can I do that?
Solution with dplyr:
library(dplyr)
load("...\\water_data_1.rda")
water_data_1 %>%
mutate(neg_gw_shock = case_when(gw_level_dev < 0 ~ -gw_level_dev,
gw_level_dev >= 0 ~ NA))
case_when works like a ifelse statement, but is vectorised
How would I remove columns from a data frame when both rows for that column have non-zero values.
For example I want to change the following table from the following
Dogs
Cats
Snakes
Elephants
1
0
1
3
2
1
0
2
to the following
Cats
Snakes
0
1
1
0
The reason the other columns are removed is because both rows had non-zero numbers. If one of the two rows has a zero then we'd retain the entire column. It does not matter which one contains the zero.
I tried to use dyplr and if else statements but most of those are based on single conditions in the column being met.
You may use colSums here:
df[, colSums(df!=0) != nrow(df)]
Cats Snakes
1 0 1
2 1 0
The logic here is to retain any column such that the count of row values not equal to zero does not equal the total number of rows. Put another way, this says to retain any column having at least one zero row.
Data:
df <- data.frame(Dogs=c(1,2), Cats=c(0,1), Snakes=c(1,0), Elephants=c(3,2))
Here are few other options -
#1. Base R Filter
Filter(function(x) any(x == 0), df)
#2. purrr::keep
purrr::keep(df, ~any(.x == 0))
#3. purrr::discard
purrr::discard(df, ~all(.x != 0))
All of which returns output as -
# Cats Snakes
#1 0 1
#2 1 0
Here is a dplyr solution using select along with any:
We just select columns that contain at least one 0 or less:
library(dplyr)
df %>%
select(where(~ any(. <= 0)))
Cats Snakes
1 0 1
2 1 0
Benchmark the so far provided answers:
mbm <- microbenchmark(
base_TimBiegeleisen = df[, colSums(df!=0) != nrow(df)],
dplyr_TarJae = df %>% select(where(~ any(. <= 0))),
base_Ronak_Shah = Filter(function(x) any(x == 0), df),
purr_keep_Ronak_Shah = purrr::keep(df, ~any(.x == 0)),
purr_discard_Ronak_Shah = purrr::discard(df, ~all(.x != 0)),
times=50
)
mbm
autoplot(mbm)
I am trying to create a column and set it to 1 based on whether all particular columns (with similar name pattern) is NA.
This is what I have tried so far and doesn't seem to work.
Any help would be appreciated thanks!
mutate(
column_to_create =
case_when(
is.na(vars(matches('pattern'))) ~ as.character(1)
)
)
You can try -
library(dplyr)
df <- df %>%
mutate(column_to_create = as.integer(rowSums(!is.na(select(.,
matches('pattern')))) == 0))
This should give 1 when all the values in the column that has 'pattern' in them has NA and 0 otherwise.
I am new to R and having difficulty understanding why I get a difference in values between the two pieces of code. Why does the code below return different results when I move !is.na(arr_time) from mutate to filter? My data is coming from the nycflights13 package.
A <- flights %>%
filter(!is.na(tailnum)) %>%
mutate(on_time = !is.na(arr_time) & (arr_delay <= 0)) %>%
group_by(tailnum) %>%
summarise(on_time = mean(on_time), n = n()) %>%
filter(min_rank(on_time) == 1)
B <- flights %>%
filter(!is.na(tailnum), !is.na(arr_time)) %>%
mutate(on_time = arr_delay <= 0) %>%
group_by(tailnum) %>%
summarise(on_time = mean(on_time), n = n()) %>%
filter(min_rank(on_time) == 1)
Tibble A returns 110 observations while Tibble B returns 104 observations. When I separate the 6 unique observations between A and B and look them up in the flights data.frame, all 6 have observations where arr_time == NA. Shouldn't those be excluded in Tibble A based on the conditions in mutate? What am I missing?
Thanks!
Regarding Tibble A:
mutate(on_time = !is.na(arr_time) & (arr_delay <= 0)) is saying "create a new column in my dataset called on_time which is true only when arr_delay is less than or equal to zero AND when arr_time is not NA." So whether arr_time is NA or not is a part of the resulting boolean (T/F) result that you're storing within this new column's value. In other words, no filtering is taking place due to if arr_time is NA. It's only being used to determine if the result should be TRUE or FALSE.
Regarding Tibble B:
filter(!is.na(tailnum), !is.na(arr_time)) is saying "filter out observations (rows) where EITHER tailnum is NA, OR where arr_time is NA."
Let's consider a much simpler version of this same concept:
x <- c(1, 2, NA, 3, 4)
# "filter()" example
# filtering based on if values in x are NA:
x[!is.na(x)]
# equivalent to "mutate()" example where our result doesn't exclude NA
# values, they are simply used within our logic to determine T/F...
# determining the value of a boolean (TRUE/FALSE) based on if values in x are NA:
is.na(x)
The dplyr filter function removes rows from a data frame. From the help of this function:
Use filter() to choose rows/cases where conditions are true. Unlike
base subsetting with [, rows where the condition evaluates to NA are
dropped.
So rows that evaluate to NA are dropped. How many rows?
> sum(is.na(flights$arr_time))
[1] 8713
How many rows are you left with after filtering:
> sum(!is.na(flights$arr_time))
[1] 328063
If I run the first two lines of each of the two code blocks and check how many rows are left:
A <- flights %>%
filter(!is.na(tailnum))
> nrow(A)
[1] 334264
and
B <- flights %>%
filter(!is.na(tailnum), !is.na(arr_time))
> nrow(B)
[1] 328063
So by adding the !is.na(arr_time) clause in the filter function of B you are dropping the respective rows. Mutate does not drop rows; it changes or adds variables.
Does this help?
mutate doesn't exclude. For your condition you will get TRUE or FALSE. In other words, mutate will generate a new column with values for each row of existing data. filter on the other hand can reduce the number of rows depending on your condition.
So I have a data frame like this :
And I'd like that all and only the missing values that I have (NAs) are replaced by this formula : Value1 / Value2
I know how to do this with a loop, but when it comes to a large scale data frame it takes time so I was wondering if there is any function/tip to give me the expected result faster
Not a direct function but something like this would work
#Get indices for NA non-zero values
inds1 <- is.na(df$Result) & df$Value2 != 0
#Get indices for NA zero values
inds2 <- is.na(df$Result) & df$Value2 == 0
#Replace them
df$Result[inds1] <- df$Value1[inds1]/df$Value2[inds1]
df$Result[inds2] <- 0
perfect for tidyverse
library(tidyverse)
d %>%
mutate(Result = ifelse(is.na(Result), Value1/Value2, Result)))
or
d %>%
mutate(Result = case_when(is.na(Result) & Value2 == 0 ~ Value2,
is.na(Result) ~ Value1/Value2,
TRUE ~ Result))