Count how many times values has changed in column using R - r

Hi i want to count how many times value has changed in a column by the group and how many unique values was in a group, and i sort of getting what i want, but it has a NA observation which i do not want to be counted.
df <- data.frame(x=c("a",'a', "a", "b",'b', "b", "c",'c', "d")
,y=c(1,2,NA,3,3,3,2,1,5))
library(data.table) #data.table_1.9.5
setDT(df)[, wanted := rleid(y), by=x][]
setDT(df)[, count := uniqueN(y),by=x][]
x y wanted count
1: a 1 1 3
2: a 2 2 3
3: a NA 3 3
4: b 3 1 1
5: b 3 1 1
6: b 3 1 1
7: c 2 1 2
8: c 1 2 2
9: d 5 1 1`
Desired results:
x y wanted count
1: a 1 1 2
2: a 2 2 2
3: a NA 2 2
4: b 3 1 1
5: b 3 1 1
6: b 3 1 1
7: c 2 1 2
8: c 1 2 2
9: d 5 1 1
I tried rleid(!is.na(y)) but seems not to work as i expected. Thank you.

We can replace the NA elements with previous non-NA element (na.locf), take the rleid on that to get the 'wanted' and also get the length of unique elements that are not NA to get the 'count'
library(zoo)
setDT(df)[, c('wanted', 'count') := list(rleid(na.locf(y)), uniqueN(y, na.rm = TRUE)), x]
df
# x y wanted count
#1: a 1 1 2
#2: a 2 2 2
#3: a NA 2 2
#4: b 3 1 1
#5: b 3 1 1
#6: b 3 1 1
#7: c 2 1 2
#8: c 1 2 2
#9: d 5 1 1

Related

R Data Table add rows to each group if not existing [duplicate]

This question already has answers here:
data.table equivalent of tidyr::complete()
(3 answers)
Closed 29 days ago.
I have a data table with multiple groups. Each group I'd like to fill with rows containing the values in vals if they are not already present. Additional columns should be filled with NAs.
DT = data.table(group = c(1,1,1,2,2,3,3,3,3), val = c(1,2,4,2,3,1,2,3,4), somethingElse = rep(1,9))
vals = data.table(val = c(1,2,3,4))
What I want:
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
The order of val does not necessarily have to be increasing, the values may also be appened at the beginning/end of each group.
I don't know how to approach this problem. I've thought about using rbindlist(...,fill = TRUE), but then the values will be simply appended.
I think some expression with DT[, lapply(...), by = c("group")] might be useful here but I have no idea how to check if a value already exists.
You can use a cross-join:
setDT(DT)[
CJ(group = group, val = val, unique = TRUE),
on = .(group, val)
]
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
Another way to solve your problem:
DT[, .SD[vals, on="val"], by=group]
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
# or
DT[CJ(group, val, unique=TRUE), on=.NATURAL]
I will just add this answer for a slightly more complex case:
#Raw Data
DT = data.table(group = c(1,1,2,2,2,3,3,3,3),
x = c(1,2,1,3,4,1,2,3,4),
y = c(2,4,2,6,8,2,4,6,8),
somethingElse = rep(1,9))
#allowed combinations of x and y
DTxy = data.table(x = c(1,2,3,4), y = c(2,4,6,8))
Here, I want to add all x,y combinations from DTxy to each group from DT, if not already present.
I've wrote a function to work for subsets.
#function to join subsets on two columns (here: x,y)
DTxyJoin = function(.SD, xy){
.SD = .SD[xy, on = .(x,y)]
return(.SD)
}
I then applied the function to each group:
#add x and y to each group if missing
DTres = DT[, DTxyJoin(.SD, DTxy), by = c("group")]
The Result:
group x y somethingElse
1: 1 1 2 1
2: 1 2 4 1
3: 1 3 6 NA
4: 1 4 8 NA
5: 2 1 2 1
6: 2 2 4 NA
7: 2 3 6 1
8: 2 4 8 1
9: 3 1 2 1
10: 3 2 4 1
11: 3 3 6 1
12: 3 4 8 1

Is there some way to keep variable names from.SD+.SDcols together with non .SD variable names in data.table?

Given a data.table
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT
x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
7: c 1 1 7 3
8: c 2 3 8 2
9: c 2 6 9 1
if one does
DT[, .(a, .SD), .SDcols=x:y]
a .SD.x .SD.v .SD.y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the variables from .SDcols become prefixed by .SD. On the other hand, if one tries, as in https://stackoverflow.com/a/62282856/997979,
DT[, c(.(a), .SD), .SDcols=x:y]
V1 x v y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the other variable name (a) become lost. (It is due to this reason that I re-ask the question which I initially marked as a duplicate to that linked above).
Is there some way to keep the names from both .SD variables and non .SD variables?
The goal is simultaneously being able to use .() to select variables without quotes and being able to select variables through .SDcols = patterns("...")
Thanks in advance!
not really sure why.. but it works ;-)
DT[, .(a, (.SD)), .SDcols=x:y]
# a x v y
# 1: 1 b 1 1
# 2: 2 b 1 3
# 3: 3 b 1 6
# 4: 4 a 2 1
# 5: 5 a 2 3
# 6: 6 a 1 6
# 7: 7 c 1 1
# 8: 8 c 2 3
# 9: 9 c 2 6

R mutate a column by group in ifelse

I'd like to mutate a column in R data.table.
Here's the example of my data.
df <- data.table(id=c(1,1,1,2,2,2,3,3,3),
stopId=c("a","b","c","a","b","c","a","b","c"),
category=c(1,1,1,NA,NA,NA,2,2,2),
result = c('a','a','a','b','b','b','c','c','c'))
My goal is to create a column using if-else command.
The column would be the first values of groupId group by id.
The point is when mutating, the values should be the same by group.
If the category is NA, then the result should be the last value of groupId.
This is the result I'm looking forward to.
id groupId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA b
5: 2 c NA b
6: 2 b NA b
7: 3 c 2 c
8: 3 b 2 c
9: 3 a 2 c
with data.table:
df[,result:=fifelse(is.na(category),last(stopId),first(stopId)),by=id][]
id stopId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA c
5: 2 b NA c
6: 2 c NA c
7: 3 a 2 a
8: 3 b 2 a
9: 3 c 2 a
As it's name, by using first and last,
df %>%
group_by(id) %>%
mutate(resultt = ifelse(is.na(category), last(stopId), first(stopId)))
id stopId category result resultt
<dbl> <chr> <dbl> <chr> <chr>
1 1 a 1 a a
2 1 b 1 a a
3 1 c 1 a a
4 2 a NA b b
5 2 c NA b b
6 2 b NA b b
7 3 c 2 c c
8 3 b 2 c c
9 3 a 2 c c
Data you provided is different above...
We can use .N or 1 to index stopId per group
> df[, result := stopId[ifelse(is.na(category), .N, 1)], id][]
id stopId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA c
5: 2 b NA c
6: 2 c NA c
7: 3 a 2 a
8: 3 b 2 a
9: 3 c 2 a
or shorter
> df[, result := stopId[c(1, .N)[is.na(category) + 1]], id][]
id stopId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA c
5: 2 b NA c
6: 2 c NA c
7: 3 a 2 a
8: 3 b 2 a
9: 3 c 2 a

Index and count unique combination of variables using R, but do NOT remove duplicates

Take this data frame for example:
DT <- data.table(A = rep(1:3, each=4),
B = rep(c(NA,1,2,4), each=3),
C = rep(1:2, 6))
I want to append a column that assign index to unique combinations of A and B, but ignore C. I also want another column that count the number of duplicates, that looks like this:
A B C Index Count
1: 1 NA 1 1 3
2: 1 NA 2 1 3
3: 1 NA 1 1 3
4: 1 1 2 2 1
5: 2 1 1 3 2
6: 2 1 2 3 2
7: 2 2 1 4 2
8: 2 2 2 4 2
9: 3 2 1 5 1
10: 3 4 2 6 3
11: 3 4 1 6 3
12: 3 4 2 6 3
I don't want to trim the data frame and (preferably)I don't want to reorder the rows.
I tried setDT, such as
setDT(DT)[,.(.I, .N), by = names(DT[,1:2])]
But the I column is not the index I want, and Column C is gone.
Thanks in advance!

Inserting a count field for each row by a grouping variable

I have a data set with observations that are both grouped and ordered (by rank). I'd like to add a third variable that is a count of the number of observations for each grouping variable. I'm aware of ways to group and count variables but I can't find a way to re-insert these counts back into the original data set, which has more rows. I'd like to get the variable C in the example table below.
A B C
1 1 3
1 2 3
1 3 3
2 1 4
2 2 4
2 3 4
2 4 4
Here's one way using ave:
DF <- within(DF, {C <- ave(A, A, FUN=length)})
# A B C
# 1 1 1 3
# 2 1 2 3
# 3 1 3 3
# 4 2 1 4
# 5 2 2 4
# 6 2 3 4
# 7 2 4 4
Here is one approach using data.table that makes use of .N, which is described in the help file to "data.table" as .N is an integer, length 1, containing the number of rows in the group.
> library(data.table)
> DT <- data.table(A = rep(c(1, 2), times = c(3, 4)), B = c(1:3, 1:4))
> DT
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
7: 2 4
> DT[, C := .N, by = "A"]
> DT
A B C
1: 1 1 3
2: 1 2 3
3: 1 3 3
4: 2 1 4
5: 2 2 4
6: 2 3 4
7: 2 4 4

Resources