How would I remove columns from a data frame when both rows for that column have non-zero values.
For example I want to change the following table from the following
Dogs
Cats
Snakes
Elephants
1
0
1
3
2
1
0
2
to the following
Cats
Snakes
0
1
1
0
The reason the other columns are removed is because both rows had non-zero numbers. If one of the two rows has a zero then we'd retain the entire column. It does not matter which one contains the zero.
I tried to use dyplr and if else statements but most of those are based on single conditions in the column being met.
You may use colSums here:
df[, colSums(df!=0) != nrow(df)]
Cats Snakes
1 0 1
2 1 0
The logic here is to retain any column such that the count of row values not equal to zero does not equal the total number of rows. Put another way, this says to retain any column having at least one zero row.
Data:
df <- data.frame(Dogs=c(1,2), Cats=c(0,1), Snakes=c(1,0), Elephants=c(3,2))
Here are few other options -
#1. Base R Filter
Filter(function(x) any(x == 0), df)
#2. purrr::keep
purrr::keep(df, ~any(.x == 0))
#3. purrr::discard
purrr::discard(df, ~all(.x != 0))
All of which returns output as -
# Cats Snakes
#1 0 1
#2 1 0
Here is a dplyr solution using select along with any:
We just select columns that contain at least one 0 or less:
library(dplyr)
df %>%
select(where(~ any(. <= 0)))
Cats Snakes
1 0 1
2 1 0
Benchmark the so far provided answers:
mbm <- microbenchmark(
base_TimBiegeleisen = df[, colSums(df!=0) != nrow(df)],
dplyr_TarJae = df %>% select(where(~ any(. <= 0))),
base_Ronak_Shah = Filter(function(x) any(x == 0), df),
purr_keep_Ronak_Shah = purrr::keep(df, ~any(.x == 0)),
purr_discard_Ronak_Shah = purrr::discard(df, ~all(.x != 0)),
times=50
)
mbm
autoplot(mbm)
Related
I wish to count consecutive occurrence of any value and assign that count to that value in next column. Below is the example of input and desired output:
dataset <- data.frame(input = c("a","b","b","a","a","c","a","a","a","a","b","c"))
dataset$count <- c(1,2,2,2,2,1,4,4,4,4,1,1)
dataset
input count
a 1
b 2
b 2
a 2
a 2
c 1
a 4
a 4
a 4
a 4
b 1
c 1
With rle(dataset$input) I can just get number of occurrences of each value. But I want resulting output in above format.
My question is similar to:
R: count consecutive occurrences of values in a single column
But here output is in sequence and I want to assign the count itself to that value.
You can repeat the lengths argument lengths time in rle
with(rle(dataset$input), rep(lengths, lengths))
#[1] 1 2 2 2 2 1 4 4 4 4 1 1
Using dplyr, we can use lag to create groups and then count the number of rows in each group.
library(dplyr)
dataset %>%
group_by(gr = cumsum(input != lag(input, default = first(input)))) %>%
mutate(count = n())
and with data.table
library(data.table)
setDT(dataset)[, count:= .N, rleid(input)]
data
Make sure the input column is character and not factor.
dataset <- data.frame(input = c("a","b","b","a","a","c","a","a","a","a","b","c"),
stringsAsFactors = FALSE)
We can use rleid with dplyr
library(dplyr)
dataset %>%
group_by(grp = rleid(input)) %>%
mutate(count = n())
I want to write a function that accepts two arguments: a data.frame and a vector (here, called id_var).
Then it filters the data.frame by a value that is in id_var (eg. the first value in the vector), adds the resulting data.frame to a variable called data_filt_by_var.
If the number of rows in data_filt_by_var is bigger than one... It takes that same initial data.frame, filter by the same id_var value and select the distinct end (end is a the name of that is present in the data.frame), and get its number of rows. If the number of rows is >= 1, returns 1, else 0.
The problem is, it has to do this to each value in id_var. I cannot make this iteration work without using loops, which are not desirable.
I wrote the following function, but its not working.
is_this_unique = function(data, id_var) {
data_filt_by_var = nrow(data[data$id == id_var, ])
if (data_filt_by_var >= 1) {
if (nrow(data[data$id == id_var, ] %>%
distinct(full_address)) == 1) {
return(1)
}
} else {
return(0)
}
}
sample_data = (tibble::tribble(~id, ~full_address,
1,'abc',
1,'bcd',
1,'abc',
2,'qaa',
2,'xcv',
2,'qaa'))
id_var = c(1,2)
I was hoping to use map_dbl in this function.
The expected output would be:
input:
>is_this_unique(sample_data, id_var)
desired output:
[1] 0 1 0 1 0 1
The first 0 is because the first id and full_address pair (1 and abc) are not unique, and so on...
The function can be written in tidyverse without using any loops with purrr. This seems to be group_by count the frequency after filtering for the 'id's passed into the function. In this case, we group by 'id', and the column that is needed (inside the curly-curly -{{}}), create a logical column by checking the number of rows (n()) equal to 1. If we pass an 'idvar' that is not in the dataset, it would usually return integer(0), which can be changed to 0 with a if/else condition at the end
library(dplyr)
is_this_unique <- function(data, id_var, colNm) {
out <- data %>%
filter(id %in% id_var) %>%
group_by(id, {{colNm}}) %>%
transmute(n = +(n() == 1)) %>%
pull(n)
if(length(out) > 0) out else 0
}
is_this_unique(sample_data, 1:2, full_address)
#[1] 0 1 0 0 1 0
is_this_unique(sample_data, 1, full_address)
#[1] 0 1 0
is_this_unique(sample_data, 0, full_address)
#[1] 0
IMO using purrr here isn't suitable, you can try this function.
library(dplyr)
is_this_unique <- function(data, id_var) {
temp_data <- data %>% filter(id %in% id_var)
if (nrow(temp_data) > 0)
temp_data %>%
add_count(id, full_address) %>%
mutate(n = +(n == 1)) %>%
pull(n)
else return(0)
}
is_this_unique(sample_data, 1:2)
#[1] 0 1 0 0 1 0
is_this_unique(sample_data, 1)
#[1] 0 1 0
is_this_unique(sample_data, 0)
#[1] 0
I am trying to remove/filter out some specific rows when it meets the condition of the two columns if not the column EP is flagged as 1. What is the specific code for this?
For example: in the dataframe df_NC when the column "Population_type" (binary type) is equal to 1 and the column NC (binary type) is equal to 0 remove the rows when this condition is satisfied, else flag EP as 1
df_ep <- df_NC %>% mutate(EP= case_when(
df_NC$Population_Type == 1 & df_NC$NC == 0 ~ 1,
TRUE ~ 0
))
From your code I'm assuming you are using dplyr package. A couple of mistakes there.
You don't need to use the base notation like df_NC$NC inside dplyr functions, just use the name of the variable.
I don't see a reason create the column EP if you are filtering one of the values (0/FALSE).
df_NC %>%
mutate(EC = if_else(Population_Type == 1 & NC == 0, 1, 0)) %>%
filter(EC == 1)
# Or shorter, considering my second point
df_NC %>%
filter(Population_Type == 1, NC == 0) # Equivalent to EC == 1
Also, try to use boolean (TRUE/FALSE) instead of integer 1/0 to work with "binary" data type.
df <- data.frame(Name=c('black','white','green','red','brown', 'blue'),
Num=c(1,1,1,0,1,0))
How many times 1 changed to 0 in the column Num? How I can count it by R?
One way is to use head, tail and count instances where the previous value was 1 and current value is 0.
sum(head(df$Num, -1) == 1 & tail(df$Num, -1) == 0)
#[1] 2
Using the same logic with dplyr lead/lag we can do
library(dplyr)
df %>% filter(Num == 0 & lag(Num) == 1) %>% nrow()
df %>% filter(Num == 1 & lead(Num) == 0) %>% nrow()
We can just use rle from base R
sum(rle(df$Num)$values)
#[1] 2
Or with rleid from data.table
library(data.table)
nrow(setDT(df)[, .N[any(Num > 0)] , rleid(Num)])
#[1] 2
Working with grouped data, I want to change the last entry in one column to match the corresponding value for that group in another column. So for my data below, for each 'nest' (group), the last 'Status' entry will equal the 'fate' for that nest.
Data like this:
nest Status fate
1 1 2
1 1 2
2 1 3
2 1 3
2 1 3
Desired result:
nest Status fate
1 1 2
1 2 2
2 1 3
2 1 3
2 3 3
It should be so simple. I tried the following from dplyr and tail to change last value in a group_by in r; it works properly for some groups, but in others it substitutes the wrong 'fate' value:
library(data.table)
indx <- setDT(df)[, .I[.N], by = .(nest)]$V1
df[indx, Status := df$fate]
I get various errors trying this approach dplyr mutate/replace on a subset of rows:
mutate_last <- function(.data, ...) {
n <- n_groups(.data)
indices <- attr(.data, "indices")[[n]] + 1
.data[indices, ] <- .data[indices, ] %>% mutate(...)
.data
}
df <- df %>%
group_by(nest) %>%
mutate_last(df, Status == fate)
I must be missing something simple from the resources mentioned above?
Something like
library(tidyverse)
df <- data.frame(nest = c(1,1,2,2,2),
status = rep(1, 5),
fate = c(2,2,3,3,3))
df %>%
group_by(nest) %>%
mutate(status = c(status[-n()], tail(fate,1)))
Not sure if this is definitely the best way to do it but here's a very simple solution:
library(dplyr)
dat <- data.frame(nest = c(1,1,2,2,2),
Status = c(1,1,1,1,1),
fate = c(2,2,3,3,3))
dat %>%
arrange(nest, Status, fate) %>% #enforce order
group_by(nest) %>%
mutate(Status = ifelse(is.na(lead(nest)), fate, Status))
E: Made a quick change.