How to count the number of changes in column (R) - r

df <- data.frame(Name=c('black','white','green','red','brown', 'blue'),
Num=c(1,1,1,0,1,0))
How many times 1 changed to 0 in the column Num? How I can count it by R?

One way is to use head, tail and count instances where the previous value was 1 and current value is 0.
sum(head(df$Num, -1) == 1 & tail(df$Num, -1) == 0)
#[1] 2
Using the same logic with dplyr lead/lag we can do
library(dplyr)
df %>% filter(Num == 0 & lag(Num) == 1) %>% nrow()
df %>% filter(Num == 1 & lead(Num) == 0) %>% nrow()

We can just use rle from base R
sum(rle(df$Num)$values)
#[1] 2
Or with rleid from data.table
library(data.table)
nrow(setDT(df)[, .N[any(Num > 0)] , rleid(Num)])
#[1] 2

Related

How to select the rows in R data.frame?

How can I select the rows, which at least once have value 1 in all 4 columns? or have only 0 through all columns?
We can use filter with if_any/if_all
library(dplyr)
df1 %>%
filter(if_any(everything(), ~ .== 1)|if_all(everything(), ~ . == 0))
Or with base R
df1[(rowSums(df1 == 1) > 0) | (rowSums(df1 == 0) == ncol(df1)),]

Pipe/dplyr friendly way to filter non zero elements of a vector

Surprisingly I am not finding an answer to this easy question. I need a pipe-friendly way to count the number of non-zero elements in a vector.
Without piping:
v <- c(1.1,2.2,0,0)
length(which(v != 0))
When I try to do this with pipes I get an error
v %>% which(. != 0) %>% length
Error in which(., . != 0) : argument to 'which' is not logical
A dplyr solution would also help
Here are some different options:
First, we can use {} with your original form:
v %>% {which(. != 0)} %>% length
#[1] 2
Or we could use {} to allow us to repeat .:
v %>% {.[. != 0]} %>% length
#[1] 2
Or we could use subset from Base R:
v %>% subset(. != 0) %>% length
#[1] 2
One way using magrittr could be:
v %>%
equals(0) %>%
not() %>%
sum()
[1] 2
We can convert to tibble and filter
library(dplyr)
tibble(v) %>%
filter(v != 0) %>%
nrow
#[1] 2

Filtering by values in vectors using purrr

I want to write a function that accepts two arguments: a data.frame and a vector (here, called id_var).
Then it filters the data.frame by a value that is in id_var (eg. the first value in the vector), adds the resulting data.frame to a variable called data_filt_by_var.
If the number of rows in data_filt_by_var is bigger than one... It takes that same initial data.frame, filter by the same id_var value and select the distinct end (end is a the name of that is present in the data.frame), and get its number of rows. If the number of rows is >= 1, returns 1, else 0.
The problem is, it has to do this to each value in id_var. I cannot make this iteration work without using loops, which are not desirable.
I wrote the following function, but its not working.
is_this_unique = function(data, id_var) {
data_filt_by_var = nrow(data[data$id == id_var, ])
if (data_filt_by_var >= 1) {
if (nrow(data[data$id == id_var, ] %>%
distinct(full_address)) == 1) {
return(1)
}
} else {
return(0)
}
}
sample_data = (tibble::tribble(~id, ~full_address,
1,'abc',
1,'bcd',
1,'abc',
2,'qaa',
2,'xcv',
2,'qaa'))
id_var = c(1,2)
I was hoping to use map_dbl in this function.
The expected output would be:
input:
>is_this_unique(sample_data, id_var)
desired output:
[1] 0 1 0 1 0 1
The first 0 is because the first id and full_address pair (1 and abc) are not unique, and so on...
The function can be written in tidyverse without using any loops with purrr. This seems to be group_by count the frequency after filtering for the 'id's passed into the function. In this case, we group by 'id', and the column that is needed (inside the curly-curly -{{}}), create a logical column by checking the number of rows (n()) equal to 1. If we pass an 'idvar' that is not in the dataset, it would usually return integer(0), which can be changed to 0 with a if/else condition at the end
library(dplyr)
is_this_unique <- function(data, id_var, colNm) {
out <- data %>%
filter(id %in% id_var) %>%
group_by(id, {{colNm}}) %>%
transmute(n = +(n() == 1)) %>%
pull(n)
if(length(out) > 0) out else 0
}
is_this_unique(sample_data, 1:2, full_address)
#[1] 0 1 0 0 1 0
is_this_unique(sample_data, 1, full_address)
#[1] 0 1 0
is_this_unique(sample_data, 0, full_address)
#[1] 0
IMO using purrr here isn't suitable, you can try this function.
library(dplyr)
is_this_unique <- function(data, id_var) {
temp_data <- data %>% filter(id %in% id_var)
if (nrow(temp_data) > 0)
temp_data %>%
add_count(id, full_address) %>%
mutate(n = +(n == 1)) %>%
pull(n)
else return(0)
}
is_this_unique(sample_data, 1:2)
#[1] 0 1 0 0 1 0
is_this_unique(sample_data, 1)
#[1] 0 1 0
is_this_unique(sample_data, 0)
#[1] 0

Filter by subgroup criteria (specify the occurrence of a value per group) using dplyr

I would like to filter a dataset and keep all groups that have exactly n rows (in my case 1 row) with a specific item.
df <- tibble(group=c("a","a","a","b","b","b"),
item=c(1,2,2,1,1,3))
I know how to filter all groups with at least 1x 1item using any
df %>% group_by(group) %>%
filter(any(item==1))
However, I do not know if it is possible to specify the occurrence per group.
I thought about something like this:
filter(n(item==1)==1)
filter(any(item==1,1))
We could group_by group and calculate occurrence of item == 1 in each group and filter where there are >= n occurrences.
library(dplyr)
n <- 1
df %>%
group_by(group) %>%
filter(sum(item == 1) >= n)
Or using the same logic with base R ave
df[with(df, ave(item == 1, group, FUN = sum) >= n), ]
and for completion one with data.table
library(data.table)
setDT(df)[, if(sum(item == 1) >= n) .SD, by = group]
We can use data.table by directly subsetting
library(data.table)
n <- 1
setDT(df)[, .SD[sum(item == 1) >= n], by = group]
Or using length
library(dplyr)
df %>%
group_by(group) %>%
filter(length(item[item==1]) >= n)

R get rows based on multiple conditions - use dplyr and reshape2

df <- data.frame(
exp=c(1,1,2,2),
name=c("gene1", "gene2", "gene1", "gene2"),
value=c(1,1,3,-1)
)
In trying to get customed to the dplyr and reshape2I stumbled over a "simple" way to select rows based on several conditions. If I want to have those genes (the namevariable) that have valueabove 0 in experiment 1 (exp== 1) AND at the same time valuebelow 0 in experiment 2; in df this would be "gene2". Sure there must be many ways to this, e.g. subset df for each set of conditions (exp==1 & value > 0, and exp==2 and value < 0) and then join the results of these subset:
library(dplyr)
inner_join(filter(df,exp == 1 & value > 0),filter(df,exp == 2 & value < 0), by= c("name"="name"))[[1]]
Although this works it looks very akward, and I feel that such conditioned filtering lies at the heart of reshape2 and dplyr but cannot figure out how to do this. Can someone enlighten me here?
One alternative that comes to mind is to transform the data to a "wide" format and then do the filtering.
Here's an example using "data.table" (for the convenience of compound-statements):
library(data.table)
dcast.data.table(as.data.table(df), name ~ exp)[`1` > 0 & `2` < 0]
# name 1 2
# 1: gene2 1 -1
Similarly, with "dplyr" and "tidyr":
library(dplyr)
library(tidyr)
df %>%
spread(exp, value) %>%
filter(`1` > 0 & `2` < 0)
Another dplyr option is:
group_by(df, name) %>% filter(value[exp == 1] > 0 & value[exp == 2] < 0)
#Source: local data frame [2 x 3]
#Groups: name
#
# exp name value
#1 1 gene2 1
#2 2 gene2 -1
Probably this is even more convoluted than your own solution, but I think it has a "dplyr" feel:
df %>%
filter((exp == 1 & value > 0) | (exp == 2 & value < 0)) %>%
group_by(name) %>%
filter(length(unique(exp)) == 2) %>%
select(name) %>%
unique()
#Source: local data frame [1 x 1]
#Groups: name
# name
#1 gene2
filter allows multiple parameters with comma, sames as select. Each extra condition is an AND:
group_by(df, name) %>% filter(value[exp == 1] > 0, value[exp == 2] < 0)
From official documentation: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
The examples shown there are:
flights[flights$month == 1 & flights$day == 1, ] in base R
filter(flights, month == 1, day == 1) in dplyr.

Resources