Count and Assign Consecutive Occurrences of Variable

Count and Assign Consecutive Occurrences of Variable - r

I wish to count consecutive occurrence of any value and assign that count to that value in next column. Below is the example of input and desired output:
dataset <- data.frame(input = c("a","b","b","a","a","c","a","a","a","a","b","c"))
dataset$count <- c(1,2,2,2,2,1,4,4,4,4,1,1)
dataset
input count
a 1
b 2
b 2
a 2
a 2
c 1
a 4
a 4
a 4
a 4
b 1
c 1
With rle(dataset$input) I can just get number of occurrences of each value. But I want resulting output in above format.
My question is similar to:
R: count consecutive occurrences of values in a single column
But here output is in sequence and I want to assign the count itself to that value.

You can repeat the lengths argument lengths time in rle
with(rle(dataset$input), rep(lengths, lengths))
#[1] 1 2 2 2 2 1 4 4 4 4 1 1
Using dplyr, we can use lag to create groups and then count the number of rows in each group.
library(dplyr)
dataset %>%
group_by(gr = cumsum(input != lag(input, default = first(input)))) %>%
mutate(count = n())
and with data.table
library(data.table)
setDT(dataset)[, count:= .N, rleid(input)]
data
Make sure the input column is character and not factor.
dataset <- data.frame(input = c("a","b","b","a","a","c","a","a","a","a","b","c"),
stringsAsFactors = FALSE)

We can use rleid with dplyr
library(dplyr)
dataset %>%
group_by(grp = rleid(input)) %>%
mutate(count = n())

Related

r - arrange values in column based on unique values in another column within a group

I'm trying to reorder column in a dataframe, in a descending or ascending order, based on unique values of another column in the same dataframe within groups.
To demonstrate this below is given an example in which a dataframe has three columns. The goal is to group by the gr column, and to order the a column based on the unique value of the b column. So for example if within the gr=1 the unique value of the column b is T then I would like the column a in ascending order, and if not in descending order. The example is below
# sample dataset
df <- data.frame(
a = c(1,3,2,4),
b = c(T,T,F,F),
gr = c(1,1,2,2)
)
# split dataset according to a grouping column
df <- df %>% split(df$gr)
# ordering function
f1 <- function(dt) {
if (unique(dt$b) == T) {
arrange(dt, a)
} else {
arrange(dt, -a)
}
}
The desired dataset should look like this:
# order within groups based on variable b
df %>% purrr::map_df(f1)
Can this be done without using lists or tidyr::nest ? Using a simple dplyr::group_by and dplyr::arrange it should be possible and is the best desired answer.

Here is one option with arrange alone without doing any split
library(dplyr)
df %>%
arrange(gr, c(1, -1)[gr] * a)
# a b gr
#1 1 TRUE 1
#2 3 TRUE 1
#3 4 FALSE 2
#4 2 FALSE 2
or if it needs to be with 'b'
df %>%
arrange(gr, c(-1, 1)[(b + 1)] * a)
# a b gr
#1 1 TRUE 1
#2 3 TRUE 1
#3 4 FALSE 2
#4 2 FALSE 2
Here, we make use of the numeric 'gr'. If it is not numeric, create the grouping index with match and use that to change values of 'a'
df %>%
arrange(gr, c(1, -1)[match(gr, unique(gr))] * a)

Here is a way.
library(dplyr)
f2 <- function(dt) {
2*as.integer(df$b) - 1
}
df %>% arrange(gr, a*f2())
If you accept the rearrangement of the column gr, remove it from arrange.
df %>% arrange(a*f2())
Edit.
Simpler?
f2 <- function(x) 2*x - 1
df %>% arrange(gr, a*f2(b))

remove a group of inputs in a column based on a single input in a separate column in r [duplicate]

This question already has answers here:
Remove group from data.frame if at least one group member meets condition
(4 answers)
Closed 3 years ago.
Here is a data frame:
df <- data.frame(letter = rep(c("a","b","c","d"), each = 4), number = c(2,1,5,3,9,4,2,4,3,11,1,2,1,1,5,6))
I know how to remove a rows based on an observation:
rmv <- with(df, number > 8) # finds observations greater than 8
new.df<- df[!rmv, ] # removes observations
However, I want to remove all inputs for each letter group (i.e., all the 'b' and 'c' inputs) if there are any observations greater than 8. Ideal output would be:
letter number
1 a 2
2 a 1
3 a 5
4 a 3
13 d 1
14 d 1
15 d 5
16 d 6
How would I accomplish this?

We can use any, negate (!) after doing a group by 'letter'
library(dplyr)
df %>%
group_by(letter) %>%
filter(!any(number > 8))
Or do the reverse with all
df %>%
group_by(letter) %>%
filter(all(number <= 8))
In base R, this can be done with ave
df[with(df, ave(number <= 8, letter, FUN = all)),]

Replace last value in group with corresponding value in other column

Working with grouped data, I want to change the last entry in one column to match the corresponding value for that group in another column. So for my data below, for each 'nest' (group), the last 'Status' entry will equal the 'fate' for that nest.
Data like this:
nest Status fate
1 1 2
1 1 2
2 1 3
2 1 3
2 1 3
Desired result:
nest Status fate
1 1 2
1 2 2
2 1 3
2 1 3
2 3 3
It should be so simple. I tried the following from dplyr and tail to change last value in a group_by in r; it works properly for some groups, but in others it substitutes the wrong 'fate' value:
library(data.table)
indx <- setDT(df)[, .I[.N], by = .(nest)]$V1
df[indx, Status := df$fate]
I get various errors trying this approach dplyr mutate/replace on a subset of rows:
mutate_last <- function(.data, ...) {
n <- n_groups(.data)
indices <- attr(.data, "indices")[[n]] + 1
.data[indices, ] <- .data[indices, ] %>% mutate(...)
.data
}
df <- df %>%
group_by(nest) %>%
mutate_last(df, Status == fate)
I must be missing something simple from the resources mentioned above?

Something like
library(tidyverse)
df <- data.frame(nest = c(1,1,2,2,2),
status = rep(1, 5),
fate = c(2,2,3,3,3))
df %>%
group_by(nest) %>%
mutate(status = c(status[-n()], tail(fate,1)))

Not sure if this is definitely the best way to do it but here's a very simple solution:
library(dplyr)
dat <- data.frame(nest = c(1,1,2,2,2),
Status = c(1,1,1,1,1),
fate = c(2,2,3,3,3))
dat %>%
arrange(nest, Status, fate) %>% #enforce order
group_by(nest) %>%
mutate(Status = ifelse(is.na(lead(nest)), fate, Status))
E: Made a quick change.

dplyr n_distinct with condition

Using dplyr to summarise a dataset, I want to call n_distinct to count the number of unique occurrences in a column. However, I also want to do another summarise() for all unique occurrences in a column where a condition in another column is satisfied.
Example dataframe named "a":
A B
1 Y
2 N
3 Y
1 Y
a %>% summarise(count = n_distinct(A))
However I also want to add a count of n_distinct(A) where B == "Y"
The result should be:
count
3
when you add the condition the result should be:
count
2
The end result I am trying to achieve is both statements merged into one call that gives me a result like
count_all count_BisY
3 2
What is the appropriate way to go about this with dplyr?

This produces the distinct A counts by each value of B using dplyr.
library(dplyr)
a %>%
group_by(B) %>%
summarise(count = n_distinct(A))
This produces the result:
Source: local data frame [2 x 2]
B count
(fctr) (int)
1 N 1
2 Y 2
To produce the desired output added above using dplyr, you can do the following:
a %>% summarise(count_all = n_distinct(A), count_BisY = length(unique(A[B == 'Y'])))
This produces the result:
count_all count_BisY
1 3 2

An alternative is to use the uniqueN function from data.table inside dplyr:
library(dplyr)
library(data.table)
a %>% summarise(count_all = n_distinct(A), count_BisY = uniqueN(A[B == 'Y']))
which gives:
count_all count_BisY
1 3 2
You can also do everything with data.table:
library(data.table)
setDT(a)[, .(count_all = uniqueN(A), count_BisY = uniqueN(A[B == 'Y']))]
which gives the same result.

Filtering the dataframe before performing the summarise works
a %>%
filter(B=="Y") %>%
summarise(count = n_distinct(A))

We can also use aggregate from base R
aggregate(cbind(count=A)~B, a, FUN=function(x) length(unique(x)))
# B count
#1 N 1
#2 Y 2
Based on the OP's expected output
data.frame(count=length(unique(a$A)),
count_BisY = length(unique(a$A[a$B=="Y"])))

Unique on a dataframe with only selected columns

I have a dataframe with >100 columns, and I would to find the unique rows by comparing only two of the columns. I'm hoping this is an easy one, but I can't get it to work with unique or duplicated myself.
In the below, I would like to unique only using id and id2:
data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
id id2 somevalue
1 1 x
1 1 y
3 4 z
I would like to obtain either:
id id2 somevalue
1 1 x
3 4 z
or:
id id2 somevalue
1 1 y
3 4 z
(I have no preference which of the unique rows is kept)

Ok, if it doesn't matter which value in the non-duplicated column you select, this should be pretty easy:
dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
> dat[!duplicated(dat[,c('id','id2')]),]
id id2 somevalue
1 1 1 x
3 3 4 z
Inside the duplicated call, I'm simply passing only those columns from dat that I don't want duplicates of. This code will automatically always select the first of any ambiguous values. (In this case, x.)

Here are a couple dplyr options that keep non-duplicate rows based on columns id and id2:
library(dplyr)
df %>% distinct(id, id2, .keep_all = TRUE)
df %>% group_by(id, id2) %>% filter(row_number() == 1)
df %>% group_by(id, id2) %>% slice(1)

Using unique():
dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
dat[row.names(unique(dat[,c("id", "id2")])),]

Minor update in #Joran's code.
Using the code below, you can avoid the ambiguity and only get the unique of two columns:
dat <- data.frame(id=c(1,1,3), id2=c(1,1,4) ,somevalue=c("x","y","z"))
dat[row.names(unique(dat[,c("id", "id2")])), c("id", "id2")]