I have a data frame with the following structure:
pos<- c(67,125,158,195,235,458,499,526,785,912,999,1525)
v_1<-c("j","c","v","r","s","q","r","r","s","t","u","v")
v_2<-c("c","t","v","r","s","q","r","w","c","c","o","v")
v_3<-c("z","c","v","r","s","q","r","w","c","b","p","v")
v_4<-c("x","w","z","z","s","q","r","w","c","o","t","v")
data<-as.data.frame(cbind(pos,v_1,v_2,v_3,v_4))
In this dataframe it is possible to find the same letters among the different columns in consecutive rows. I need to obtain a separate data frame with the values of the variable "pos" for consecutive rows with shared letters, as can be seen in the figure:
In this figure even though all the columns have the same letter in pos 1525, this row isn’t included since it’s not consecutive with another row with repeated letters.
Solution using tidyr and dplyr:
After pivoting to long, use dplyr::add_count() to find repeated values within each pos;
Within each v, find consecutive rows with repeated values, defined as: >1 repeat and >1 repeat in either preceding or following row;
Create a column containing pos for consecutive rows and NA otherwise;
Take the minimum and maximum to get start and end for each v.
library(tidyr)
library(dplyr)
data %>%
pivot_longer(!pos, names_to = "v") %>%
add_count(pos, value) %>%
group_by(v) %>%
mutate(consec = ifelse(
n > 1 & (lag(n) > 1 | lead(n) > 1),
pos,
NA
)) %>%
summarize(
start = min(consec, na.rm = TRUE),
end = max(consec, na.rm = TRUE)
)
# A tibble: 4 × 3
v start end
<chr> <chr> <chr>
1 v_1 125 499
2 v_2 158 785
3 v_3 125 785
4 v_4 235 785
Note, not sure if/how you want to handle if there is more than one set of consecutive rows, so this solution doesn’t address that.
Related
I have data where an id variable should identify a unique observation. However, some ids are repeated. I want to get an idea of which measurements are driving this repetition by grouping by id and then calculating the proportion of inconsistent responses for each variable.
Below is an example of what I mean:
require(tidyverse)
df <- tibble(id = c(1,1,2,3,4,4,4),
col1 = c('a','a','b','b','c','c','c'), # perfectly consistent
col2 = c('a','b','b','b','c','c','c'), # id 1 is inconsistent - proportion inconsistent = 0.25
col3 = c('a','a','b','b','a','b','c'), # id 4 is inconsistent - proportion inconsistent = 0.25
col4 = c('a','b','b','b','b','b','c') # id 1 and 4 are inconsistent - proportion inconsistent = 0.5
)
I can test for inconsistent responses within ids by using group_by(), across(), and n_distinct() as per the below:
# count the number of distinct responses for each id in each column
# if the value is equal to 1, it means that all responses were consistent
df <- df %>%
group_by(id) %>%
mutate(across(.cols = c(col1:col4), ~n_distinct(.), .names = '{.col}_distinct')) %>%
ungroup()
For simplicity I can now take one row for each id:
# take one row for each test (so we aren't counting duplicates twice)
df <- distinct(df, across(c(id, contains('distinct'))))
Now I would like to calculate the proportion of ids that contained an inconsistent response for each variable. I would like to do something like the following:
consistency <- df %>%
summarise(across(contains('distinct'), ~sum(.>1) / n(.)))
But this gives the following error, which I am having trouble interpreting:
Error: Problem with `summarise()` input `..1`.
x unused argument (.)
ℹ Input `..1` is `across(contains("distinct"), ~sum(. > 1)/n(.))`.
I can get the answer I want by doing the following:
# calculate consistency for each column by finding the number of distinct values greater
# than 1 and dividing by total rows
# first get the number of distinct values
n_inconsistent <- df %>%
summarise(across(.cols = contains('distinct'), ~sum(.>1)))
# next get the number of rows
n_total <- nrow(df)
# calculate the proportion of tests that have more than one value for each column
consistency <- n_inconsistent %>%
mutate(across(contains('distinct'), ~./n_total))
But this involves intermediate variables and feels inelegant.
You can do it in the following way :
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(starts_with('col'), n_distinct)) %>%
summarise(across(starts_with('col'), ~mean(. > 1), .names = '{col}_distinct'))
# col1_distinct col2_distinct col3_distinct col4_distinct
# <dbl> <dbl> <dbl> <dbl>
#1 0 0.25 0.25 0.5
First we count number of unique values in each column per id and then calculate the proportion of values that are above 1 in each column.
Regular expression to find words and digits that are repeated back to back
Suppose I have a data frame
df<-data.frame(name=c("mike","mike","mike","bob","mike"),age=c(23,23,23,25,23)
how can I write a regular expression to check "name" column whether "mike" or any other word is repeated back to bake for e.g here mike is repeated 3 times and in "age" column a digit is repeated e.g here 23 is repeated 3 times back to back
You can try this :
library(dplyr)
df %>%
mutate(across(.fns = data.table::rleid, .names = '{col}_grp')) %>%
group_by(across(ends_with('grp'))) %>%
filter(n() >= 3) %>%
ungroup %>%
select(names(df))
# name age
# <chr> <dbl>
#1 mike 23
#2 mike 23
#3 mike 23
For every column in df we use rleid to give a unique number to consecutive values and select those groups that have >= 3 rows in them.
Base R one liner (no regex):
df[which(c(0, cumsum(abs(diff(as.integer(as.factor(df$name)))))) == 0),]
I am new to R and having difficulty understanding why I get a difference in values between the two pieces of code. Why does the code below return different results when I move !is.na(arr_time) from mutate to filter? My data is coming from the nycflights13 package.
A <- flights %>%
filter(!is.na(tailnum)) %>%
mutate(on_time = !is.na(arr_time) & (arr_delay <= 0)) %>%
group_by(tailnum) %>%
summarise(on_time = mean(on_time), n = n()) %>%
filter(min_rank(on_time) == 1)
B <- flights %>%
filter(!is.na(tailnum), !is.na(arr_time)) %>%
mutate(on_time = arr_delay <= 0) %>%
group_by(tailnum) %>%
summarise(on_time = mean(on_time), n = n()) %>%
filter(min_rank(on_time) == 1)
Tibble A returns 110 observations while Tibble B returns 104 observations. When I separate the 6 unique observations between A and B and look them up in the flights data.frame, all 6 have observations where arr_time == NA. Shouldn't those be excluded in Tibble A based on the conditions in mutate? What am I missing?
Thanks!
Regarding Tibble A:
mutate(on_time = !is.na(arr_time) & (arr_delay <= 0)) is saying "create a new column in my dataset called on_time which is true only when arr_delay is less than or equal to zero AND when arr_time is not NA." So whether arr_time is NA or not is a part of the resulting boolean (T/F) result that you're storing within this new column's value. In other words, no filtering is taking place due to if arr_time is NA. It's only being used to determine if the result should be TRUE or FALSE.
Regarding Tibble B:
filter(!is.na(tailnum), !is.na(arr_time)) is saying "filter out observations (rows) where EITHER tailnum is NA, OR where arr_time is NA."
Let's consider a much simpler version of this same concept:
x <- c(1, 2, NA, 3, 4)
# "filter()" example
# filtering based on if values in x are NA:
x[!is.na(x)]
# equivalent to "mutate()" example where our result doesn't exclude NA
# values, they are simply used within our logic to determine T/F...
# determining the value of a boolean (TRUE/FALSE) based on if values in x are NA:
is.na(x)
The dplyr filter function removes rows from a data frame. From the help of this function:
Use filter() to choose rows/cases where conditions are true. Unlike
base subsetting with [, rows where the condition evaluates to NA are
dropped.
So rows that evaluate to NA are dropped. How many rows?
> sum(is.na(flights$arr_time))
[1] 8713
How many rows are you left with after filtering:
> sum(!is.na(flights$arr_time))
[1] 328063
If I run the first two lines of each of the two code blocks and check how many rows are left:
A <- flights %>%
filter(!is.na(tailnum))
> nrow(A)
[1] 334264
and
B <- flights %>%
filter(!is.na(tailnum), !is.na(arr_time))
> nrow(B)
[1] 328063
So by adding the !is.na(arr_time) clause in the filter function of B you are dropping the respective rows. Mutate does not drop rows; it changes or adds variables.
Does this help?
mutate doesn't exclude. For your condition you will get TRUE or FALSE. In other words, mutate will generate a new column with values for each row of existing data. filter on the other hand can reduce the number of rows depending on your condition.
I have a very large data frame with fish species captured as one of the columns. Here is a very shortened example:
ID = seq(1,50,1)
fishes = c("bass", "jack", "snapper")
common = sample(fishes, size = 50, replace = TRUE)
dat = as.data.frame(cbind(ID, common))
I want to remove any species that make up less than a certain percentage of the data. For the example here say I want to remove all species that make up less than 30% of the data:
library(dplyr)
nrow(filter(dat, common == "bass")) #22 rows -> 22/50 -> 44%
nrow(filter(dat, common == "jack")) #12 rows -> 12/50 -> 24%
nrow(filter(dat, common == "snapper")) #16 rows -> 16/50 -> 32%
Here, jacks make up less than 30% of the rows, so I want to remove all the rows with jacks (or all species with less than 15 rows). This is easy to do here, but in reality I have over 700 fish species in my data frame and I want to throw out all species that make up less than 1% of the data (which in my case would be less than 18,003 rows). Is there a streamlined way to do this without having to filter out each species individually?
I imagine perhaps some kind of loop that says if the number of rows for common name = "x" is less than 18003, remove those rows...
You may also do it in one pipe:
library(dplyr)
dat %>%
mutate(percentage = n()) %>%
group_by(common) %>%
mutate(percentage = n() / percentage) %>%
filter(percentage > 0.3) %>%
select(-percentage)
One way to approach this is to first create a summary table, then filter based on the summary stat. There are probably more direct ways to accomplish the same thing.
library(dplyr)
set.seed(914) # so you get the same results from sample()
ID = seq(1,50,1)
fishes = c("bass", "jack", "snapper")
common = sample(fishes, size = 50, replace = TRUE)
dat = as.data.frame(cbind(ID, common)) # same as your structure, but I ended up with different species mix
summ.table <- dat %>%
group_by(common) %>%
summarize(number = n()) %>%
mutate(pct= number/sum(number))
summ.table
# # A tibble: 3 x 3
# common number pct
# <fct> <int> <dbl>
# 1 bass 18 0.36
# 2 jack 18 0.36
# 3 snapper 14 0.28
include <- summ.table$common[summ.table$pct > .3]
dat.selected = filter(dat, common %in% include)
I have a data frame, for example:
letter class value
A 0 55
B 1 23
C 1 12
D 1 9
E 2 68
F 2 78
G 2 187
I want to re-sample randomly the rows in each class to associate a letter to a new random value (but from the same class).
Desired example output:
letter class value
A 0 55
B 1 12
C 1 9
D 1 23
E 2 187
F 2 78
G 2 68
I tried something with dplyr like:
tab %>% group_by(class) %>% sample_n(size=3)
But this sample 3 rows per group and I don't have the same number of values per group.
The only solution I found at the moment is to create n data frames for each class, and to shuffle each data frame independently. But as my class number is large, it might be too long and dirty.
We can use sample on the sequence of rows (row_number()) and rearrange the 'value' based on the sampled index
df1 %>%
group_by(class) %>%
mutate(value = value[sample(row_number())])
Or as #RonakShah mentioned in the comments, if we have only a single row, then using sample would trigger sample of the sequence of values. So, if we directly use sample on the 'value', then an if/else condition can be used
df1 %>%
group_by(class) %>%
mutate(value = if(n() == 1) value else sample(value, n()))
If we want to use sample_n, it can be done within do
df1 %>%
group_by(class) %>%
do(sample_n(., size = nrow(.)))
NOTE: We need to specify nrow instead of n() as some of the tidyverse specific functions work within certain functions such as mutate/fsummarise/filter/arrange etc, but it is not implemented to work along with sample_n